Hey,
Was curious if anyone had started playing with AVX and done any test pieces or comparisons with traditional SSE?
I don't seem to be able to get JWASM to assemble any AVX instructions at all at present.
John
Hi johnsa,
did you look \JWasm207as\Regress\AVX1.ASM
there are all commands tested in use
regards
Hey, Yeah I did... for example this should be no problem:
vmovaps [rdi],ymm0
But that won't assemble..
nor will
vpxor ymm0,ymm0
or anything else for that matter.
try something like this:
m256 label ymmword
vmovaps m256,ymm0
vmovaps ymm0,[rdi]
vmovaps m256[rdi],ymm0
there is no command like
vmovaps [rdi],ymm0
and vpxor is AVX2 command AFAIK
Well vmovaps seems to work if I use
vmovaps ymmword ptr [rdi],ymm0
But that shouldn't be necessary as it should be implicity being YMM0 as the register?
I still get general failure trying to use any xor operation..
vpxor ymm0,ymm0
vxorps ymm0,ymm0 etc...
I was hoping to test memory fill and copy test pieces to see if AVX could offer any speedup over SSE code.
On my new machine however even my SSE test-pieces which used to be faster seem to have been superseded by good old rep stosq (at least for smaller allocation sizes)
John,
That effect has been seen before with Intel hardware, the PIV family of processors were slow with SSE code, often well thought out integer code was nearly as fast, on the Core2 series SSE got a lot faster and on the i3/5/7 series faster again. Now the exception seems to be the special case circuitry for at least some combinations of REP STOS/MOVS etc .... I found with 32 bit code that it was often hard to improve on ordinary REP MOVSB/W/D with SSE with cached reads and direct write backs because the special case circuitry did all of these things well.
With 32 bit code there was a minimum threshold and from memory a maximum threshold where REP MOVS.. was fast, above and below it other combinations were faster at times.
Agreed... I found the same. It was somewhere in the region of 4Mb for me. Under that size or repeated access gave better performance with REP STOSD, REP MOVSD. Over 4Mb it shifted in favour of hand-rolled SSE loop, up to about 100Mb at which point using the non temporal reads gave greater performance.