News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

AVX test pieces

Started by johnsa, August 18, 2012, 12:29:44 AM

Previous topic - Next topic

johnsa

Hey,

Was curious if anyone had started playing with AVX and done any test pieces or comparisons with traditional SSE?
I don't seem to be able to get JWASM to assemble any AVX instructions at all at present.

John

habran

Hi johnsa,
did you look  \JWasm207as\Regress\AVX1.ASM
there are all commands tested in use

regards
Cod-Father

johnsa

Hey, Yeah I did... for example this should be no problem:

vmovaps [rdi],ymm0

But that won't assemble..
nor will

vpxor ymm0,ymm0

or anything else for that matter.

habran

try something like this:
  m256 label ymmword
  vmovaps m256,ymm0
  vmovaps ymm0,[rdi]
  vmovaps m256[rdi],ymm0


there is no command like
vmovaps [rdi],ymm0

and vpxor is AVX2 command AFAIK
Cod-Father

johnsa

Well vmovaps seems to work if I use
vmovaps ymmword ptr [rdi],ymm0

But that shouldn't be necessary as it should be implicity being YMM0 as the register?

I still get general failure trying to use any xor operation..

vpxor ymm0,ymm0
vxorps ymm0,ymm0 etc...

johnsa

I was hoping to test memory fill and copy test pieces to see if AVX could offer any speedup over SSE code.
On my new machine however even my SSE test-pieces which used to be faster seem to have been superseded by good old rep stosq (at least for smaller allocation sizes)

hutch--

John,

That effect has been seen before with Intel hardware, the PIV family of processors were slow with SSE code, often well thought out integer code was nearly as fast, on the Core2 series SSE got a lot faster and on the i3/5/7 series faster again. Now the exception seems to be the special case circuitry for at least some combinations of REP STOS/MOVS etc  .... I found with 32 bit code that it was often hard to improve on ordinary REP MOVSB/W/D with SSE with cached reads and direct write backs because the special case circuitry did all of these things well.

With 32 bit code there was a minimum threshold and from memory a maximum threshold where REP MOVS.. was fast, above and below it other combinations were faster at times.

johnsa

Agreed... I found the same. It was somewhere in the region of 4Mb for me. Under that size or repeated access gave better performance with REP STOSD, REP MOVSD. Over 4Mb it shifted in favour of hand-rolled SSE loop, up to about 100Mb at which point using the non temporal reads gave greater performance.