The MASM Forum

General => The Laboratory => Topic started by: johnsa on August 18, 2012, 12:29:44 AM

Title: AVX test pieces
Post by: johnsa on August 18, 2012, 12:29:44 AM
Hey,

Was curious if anyone had started playing with AVX and done any test pieces or comparisons with traditional SSE?
I don't seem to be able to get JWASM to assemble any AVX instructions at all at present.

John
Title: Re: AVX test pieces
Post by: habran on August 18, 2012, 06:00:20 AM
Hi johnsa,
did you look  \JWasm207as\Regress\AVX1.ASM
there are all commands tested in use

regards
Title: Re: AVX test pieces
Post by: johnsa on August 20, 2012, 08:57:08 PM
Hey, Yeah I did... for example this should be no problem:

vmovaps [rdi],ymm0

But that won't assemble..
nor will

vpxor ymm0,ymm0

or anything else for that matter.
Title: Re: AVX test pieces
Post by: habran on August 20, 2012, 10:39:30 PM
try something like this:
  m256 label ymmword
  vmovaps m256,ymm0
  vmovaps ymm0,[rdi]
  vmovaps m256[rdi],ymm0


there is no command like
vmovaps [rdi],ymm0

and vpxor is AVX2 command AFAIK
Title: Re: AVX test pieces
Post by: johnsa on August 20, 2012, 11:07:33 PM
Well vmovaps seems to work if I use
vmovaps ymmword ptr [rdi],ymm0

But that shouldn't be necessary as it should be implicity being YMM0 as the register?

I still get general failure trying to use any xor operation..

vpxor ymm0,ymm0
vxorps ymm0,ymm0 etc...
Title: Re: AVX test pieces
Post by: johnsa on August 20, 2012, 11:35:26 PM
I was hoping to test memory fill and copy test pieces to see if AVX could offer any speedup over SSE code.
On my new machine however even my SSE test-pieces which used to be faster seem to have been superseded by good old rep stosq (at least for smaller allocation sizes)
Title: Re: AVX test pieces
Post by: hutch-- on August 21, 2012, 01:13:51 AM
John,

That effect has been seen before with Intel hardware, the PIV family of processors were slow with SSE code, often well thought out integer code was nearly as fast, on the Core2 series SSE got a lot faster and on the i3/5/7 series faster again. Now the exception seems to be the special case circuitry for at least some combinations of REP STOS/MOVS etc  .... I found with 32 bit code that it was often hard to improve on ordinary REP MOVSB/W/D with SSE with cached reads and direct write backs because the special case circuitry did all of these things well.

With 32 bit code there was a minimum threshold and from memory a maximum threshold where REP MOVS.. was fast, above and below it other combinations were faster at times.
Title: Re: AVX test pieces
Post by: johnsa on August 21, 2012, 01:16:25 AM
Agreed... I found the same. It was somewhere in the region of 4Mb for me. Under that size or repeated access gave better performance with REP STOSD, REP MOVSD. Over 4Mb it shifted in favour of hand-rolled SSE loop, up to about 100Mb at which point using the non temporal reads gave greater performance.