Crashes in HJWASM but works well in JWASM

aw27 · March 15, 2017, 10:50:28 PM

My stake on this is that you are wrong, not just a little bit wrong, but near 100% wrong.
1) IMHO, you should not use AVX instructions (vmovdqu) without making sure the OS supports AVX. Does XP support AVX? Does Vista support AVX? Does 2008 support AVX? NO!
2) Even when the OS supports AVX, it turns out to be a BIG performance penalty to mix AVX and SSE instructions (ref:Intel books).
3) Vector calls will not solve anything. Soon or later you will need to save data to memory because even 16 XMM registers are not always enough, or turn out to be a confusing mess, and you will thank God to have aligned memory to use.

jj2007 · March 15, 2017, 10:54:02 PM

Quote from: johnsa on March 15, 2017, 10:30:34 PMI did think about that, automatically re-arranging the simd aligned items to occur first, which is totally do-able, but I'm not sure I like the idea of the assembler doing it's own thing that much.

I agree, it should be transparent.

Code Select

MyTest proc arg1, arg2
LOCAL MyDword:DWORD
ALIGN 16
LOCAL MyXmmword:XMMWORD

Problem is that the "natural" syntax as shown above is FAKE: the align 16 does ... nothing. In HJWasm, in AsmC and in ML64. And this bug crept in decades before you started to work on JWasm :(

aw27 · March 15, 2017, 10:54:32 PM

Quote from: jj2007 on March 15, 2017, 09:43:13 PM
Correct. One could limit the align 16 to OWORD variables, though. Or insert an align 16 in the right place:

You can't align stack (local) variables that way. align 16 does not work

jj2007 · March 15, 2017, 11:26:04 PM

Quote from: aw27 on March 15, 2017, 10:54:32 PMYou can't align stack (local) variables that way. align 16 does not work

I see that. An intelligent assembler would throw a syntax error, though, instead of letting the coder believe that it aligns the variable.

habran · March 16, 2017, 12:23:32 AM

Some people should be banned to use "IMHO" they should use "IMAO" instead.
If someone is not happy with HJWasm they should take it to the shop where they bought it and they will be fully refunded.

johnsa · March 16, 2017, 12:49:43 AM

Quote from: aw27 on March 15, 2017, 10:50:28 PM
My stake on this is that you are wrong, not just a little bit wrong, but near 100% wrong.
1) IMHO, you should not use AVX instructions (vmovdqu) without making sure the OS supports AVX. Does XP support AVX? Does Vista support AVX? Does 2008 support AVX? NO!
2) Even when the OS supports AVX, it turns out to be a BIG performance penalty to mix AVX and SSE instructions (ref:Intel books).
3) Vector calls will not solve anything. Soon or later you will need to save data to memory because even 16 XMM registers are not always enough, or turn out to be a confusing mess, and you will thank God to have aligned memory to use.

1. Every cpu since 2011 supports avx.. there was a specific reason the avx equivalents were chosen over the sse variants and it relates to your point 2, there is no transition penalty from sse -> avx, but there is from avx -> sse. Suppose now you have a loop making use of avx code inside which you call a procedure, if that procedure use movdqa/movdqu you'd get an implicit performance penalty that was entirely hidden from the programmer. You'd have to insert a vzeroupper/all prior to the call to mitigate the prologue penalty. Following that if you used avx inside the procedure you'd need another one prior to the epilogue generation.
This way around you can use avx outside and inside without issue, if you use sse inside the procedure a single vzeroupper can be used prior to the sse code. So it's 1 transition cleanup required as opposed to 2.
https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties

With regards to support in xp and vista.. do I care ? .. not particularly.. https://en.wikipedia.org/wiki/Usage_share_of_operating_systems .. xp and vista account for 0.5% of the pc market.. welcome to 2017 :)
2008 sp1 does support avx.
You can always switch off the default prologue/epilogue and use your own specifically if you want to support non AVX systems.

3. I'm not sure how to respond to that, it's an ridiculous comment.. vectorcall makes a huge difference
Writing simd assembler code via way of mem passing to procs will land up being far less performant that if you'd written it in C++ with intrinsics and vectorcall no matter how cleverly you optimise the register usage and dependency chains etc.

Let's consider a vector normalize function in fastcall vs vectorcall ...

Code Select


VectorNormalize PROC FRAME ptrVec:QWORD

vmovaps xmm0,[rcx]
vdpps xmm1,xmm0,xmm0,0x77
vsqrtps xmm1,xmm1
vdivps xmm0,xmm0,xmm1

ret
VectorNormalize ENDP

vmovaps someVector,xmm0
invoke VectorNormalize, ADDR someVector  
vmovaps xmm0,someVector
;<- continue using xmm0 now that it's normalized..

; Assuming I already had someVector in an xmm register, I'd either have to use a non prototyped proc and leave it in a register.. or worse I'd have to store it to memory simply to pass it to the proc.. lets assume the later as we want clean code.. thats one 16byte aligned memory store.
; Secondly we have a qword argument which will now (depending on your code style) be saved to home space.. another 8byte store.
; If after the proc I wanted it in a register again (quite likely... thats ANOTHER 16byte load)..
; Inside the proc.. because it's passed by mem, means another 16byte load to work on the vector..

So we have 48 or 56 bytes of memory traffic...

Now with vectorcall ...

Code Select



VectorNormalize PROC VECTORCALL FRAME myVec:__m128

vdpps xmm1,xmm0,xmm0,0x77  ;< vector is already in xmm0..
vsqrtps xmm1,xmm1
vdivps xmm0,xmm0,xmm1

ret
VectorNormalize ENDP

invoke VectorNormalize, xmm0     ;<- xmm regs can be passed as arguments... because the argument and register abi are == here, no operation .. so it's free..
;<- continue using xmm0 now that it's normalized..

Memory traffic == .... 0 ... and the code is still clean and prototyped.

aw27 · March 16, 2017, 01:24:50 AM

Quote from: habran on March 16, 2017, 12:23:32 AM
Some people should be banned to use "IMHO" they should use "IMAO" instead.
If someone is not happy with HJWasm they should take it to the shop where they bought it and they will be fully refunded.

You are no hero, legions of people supply free, I do it many times.
Supplying free is no excuse for supplying bad. Even worst, you are doing the inverse of the Midas Touch, transforming gold into crap.

aw27 · March 16, 2017, 01:37:35 AM

Quote from: johnsa on March 16, 2017, 12:49:43 AM
1. Every cpu since 2011 supports avx.. there was a specific reason the avx equivalents were chosen

So, you are producing a compiler for people with modern computers.

Quote from: johnsa on March 16, 2017, 12:49:43 AM
2.
With regards to support in xp and vista.. do I care ? .. not particularly.. https://en.wikipedia.org/wiki/Usage_share_of_operating_systems .. xp and vista account for 0.5% of the pc market.. welcome to 2017 :)2008 sp1 does support avx.

Much much more than 0.5% and not 2008 but 2008R2.
So we need as well Windows 7 or later.

Quote from: johnsa on March 16, 2017, 12:49:43 AM
3. I'm not sure how to respond to that, it's an ridiculous comment.. vectorcall makes a huge difference

Does not make any difference at all if your routine has more than a dozen lines.

aw27 · March 16, 2017, 01:40:41 AM

Quote from: jj2007 on March 15, 2017, 11:26:04 PM
Quote from: aw27 on March 15, 2017, 10:54:32 PMYou can't align stack (local) variables that way. align 16 does not work

I see that. An intelligent assembler would throw a syntax error, though, instead of letting the coder believe that it aligns the variable.

Actually in the code section, it align the code.

johnsa · March 16, 2017, 01:55:28 AM

1. if by modern you mean > 2011 by default yes. There are two options here, either replace the prologue/epilogue for specific functions to be compatible with pre 2011 systems.. OR.. and this depends on how everyone feels, if by group consensus we feel this is useful, we can add a command line switch like -arch:sse/avx .. so you can specify which to prefer when auto-generating code. If people like that idea we can do that?

2. same as 1.

3. ?!?!?!? any proc/function which is going to be called repeatedly and assuming is designed to be optimised and potentially deal with millions of calls is going to suffer badly from N*48 bytes worth of unnecessary memory overhead, especially in multi-threaded applications, peak bandwidth for mem is about 25Gb/s lets say on average .. that means without any other factors, call overhead etc you've already capped yourself at
520 million calls per second, and this will get worse with more arguments or more complex types like matrices! Given 4c/8t on most desktops I routinely hit 1 to 1.5billion calls per second on these sort of vector/matrix functions (but you have to keep stuff in registers). Now with Ryzen and the 8c/16t hedt cpu's that would be even worse and your memory would be be limiting you from achieving 3billion (+-) calls per second to only get 500mil.

another reference http://www.codersnotes.com/notes/maths-lib-2016/

GoneFishing · March 16, 2017, 02:51:25 AM

Quote from: johnsa on March 16, 2017, 01:55:28 AM
...
another reference http://www.codersnotes.com/notes/maths-lib-2016/

Interesting blog :t

aw27 · March 16, 2017, 02:52:28 AM

Quote
3. ?!?!?!? any proc/function which is going to be called repeatedly and assuming is designed to be optimised and potentially deal with millions of calls is going to suffer badly from N*48 bytes worth of unnecessary memory overhead, especially in multi-threaded applications, peak bandwidth for mem is about 25Gb/s lets say on average .. that means without any other factors, call overhead etc you've already capped yourself at
520 million calls per second, and this will get worse with more arguments or more complex types like matrices! Given 4c/8t on most desktops I routinely hit 1 to 1.5billion calls per second on these sort of vector/matrix functions (but you have to keep stuff in registers). Now with Ryzen and the 8c/16t hedt cpu's that would be even worse and your memory would be be limiting you from achieving 3billion (+-) calls per second to only get 500mil.

This is bullsh%&t, you are not taking into account the work involved in loading the xmm registers before the call. Take as an example the function
XMMatrixMultiply from the DirectXMath library (XMMATRIX XMMatrixMultiply(FXMMATRIX M1, CXMMATRIX M2)).
Look at the vector call:
000000013F6AB12E movaps xmm0,xmmword ptr [__xmm@412e66664131999a412666663e4ccccd (13F6D08F0h)]
000000013F6AB135 lea rdx,[rbp+40h]
000000013F6AB139 movaps xmm2,xmmword ptr [__xmm@3fc000004129999a4083333340c9999a (13F6D06D0h)]
000000013F6AB140 movaps xmm1,xmm14
000000013F6AB144 movaps xmm3,xmmword ptr [__xmm@408ccccd411b33333fd9999a40400000 (13F6D0830h)]
000000013F6AB14B movups xmm6,xmmword ptr [rax]
000000013F6AB14E movups xmm7,xmmword ptr [rax+10h]
000000013F6AB152 movups xmm8,xmmword ptr [rax+20h]
000000013F6AB157 movups xmm9,xmmword ptr [rax+30h]
000000013F6AB15C call XMMatrixMultiply (13F6A3330h)

So bad, we do have to load 8 XMM registers before the call they are not loaded by a miracle :(.
Then we could do the same inside the ASM routine!

jj2007 · March 16, 2017, 03:47:50 AM

Quote from: aw27 on March 16, 2017, 01:40:41 AMActually in the code section, it align the code.

Yes, I've used that occasionally (100+ times in one of my bigger sources). Point is that I never tried an ALIGN between LOCAL statements, and it simply does not what you intuitively expect it to do:

Code Select

MyTest proc
LOCAL MyDword:DWORD
ALIGN 16
LOCAL MyXmmword:XMMWORD

Now this looks suspiciously similar to...

Code Select

.data
align 16
somevar	dd ?

... but it doesn't align the local variable. Instead, it aligns the code and inserts nop word ptr [rax + rax] before the push rbp 8)

Strictly speaking, it does what the manual says. Only that no sane person would put the align 16 between local statements.

Btw we are all good friends here. No need to explain the open sauce etiquette to anybody (and I mean both sides :P)
While I don't need all that sophisticated stuff in HJWasm (and AsmC), I am watching with awe that these guys do develop their babies, thus challenging Micros**t's Crippleware^TM assembler.

aw27 · March 16, 2017, 04:58:04 AM

Quote
Btw we are all good friends here. No need to explain the open sauce etiquette to anybody (and I mean both sides :P)
While I don't need all that sophisticated stuff in HJWasm (and AsmC), I am watching with awe that these guys do develop their babies, thus challenging Micros**t's CripplewareTM assembler.

Yes, and in spite of all this I am still interested in HJWASM because, from what I read, it should handle better the INVOKE parameters in x64 code than JWasm. The first time I compiled with HJWasm I obtained smaller code and I thought it was due to that. Now, after the latest events I am not sure about the real reasons.

jj2007 · March 16, 2017, 05:14:59 AM

Quote from: aw27 on March 16, 2017, 04:58:04 AMit should handle better the INVOKE parameters in x64 code than JWasm. The first time I compiled with HJWasm I obtained smaller code

The x64 ABI is not exactly user-friendly; it took me some time to understand it. There are some tricks to get smaller code, and there are also some people who bark at you if you dare to favour size over speed. Perhaps it would help if you posted one or two examples where different coding styles make a difference for your project, size- or speed-wise. A lot can be done inside the PROLOGUE macro btw.

The MASM Forum

News:

Crashes in HJWASM but works well in JWASM

aw27

jj2007

aw27

jj2007

habran

johnsa

aw27

aw27

aw27

johnsa

GoneFishing

aw27

jj2007

aw27

jj2007