Matrix Transposing AVX versus SSE

aw27 · July 13, 2018, 01:37:43 AM

This is a comparative test for AVX versus SSE Matrix Transposing performance on 3 different sizes of square matrices with rows/columns being multiples of 8:
80x80
800x800
8000x8000

The ASM Transposing source code is contained in the file transpmats.asm
The correspondent object code is linked to a Free Pascal project (Free Pascal is similar to Delphi but is Free).

The real innovation of this test is that the results are posted to the ThingSpeak.com website and immediately visible together with all past results. This is really funny - this site has been thought for IOT stuff

An interesting point is that SSE performs better than AVX for the smaller matrices. For the 8000x8000 matrix, AVX is almost on a pair with SSE. Of course, I am not sure whether the AVX transposing ASM code can be significantly improved or not.

I include all the source code and the .EXE ready for use.

Download

hutch-- · July 13, 2018, 02:18:05 AM

Quote
An interesting point is that SSE performs better than AVX for the smaller matrices. For the 8000x8000 matrix, AVX is almost on a pair with SSE. Of course, I am not sure whether the AVX transposing ASM code can be significantly improved or not.

It probably has some to do with the data size that the processor munches internally, I have seen this effect going back to the PIV hardware where a large data size performed no better than a smaller one and it appears to be the base data size internally. If a processor internally works in 64 bit but you use a 256 bit data size, the processor internally does 4 x 64 bit operations and while it may have slightly lower internal overhead, it may not be much faster than 4 64 bit operations done externally.

aw27 · July 13, 2018, 03:23:18 PM

I will keep investigating this matter because it was expected AVX to be clearly faster, the routines are similar. Although the AVX code is larger it processes 8x8 blocks at a time versus 4x4 blocks for SSE, 4 times more data.

aw27 · July 13, 2018, 05:43:43 PM

I solved the mystery - AVX is indeed much faster than SSE on the 800x800 and 8000x8000 matrices.
We need to warm up the memory or whatever. I thought this was not necessary because each matrix is filled with its own random data, so caching should not be an issue.
Now the test performs 2 transpositions on matrices filled with random data just to warm up. Only after that it performs the real transpositions tests with timing control.

I have updated the download link.

hutch-- · July 13, 2018, 05:51:27 PM

I get to see this problem with the HASWELL I am using, it idles in noddy land at about 1.2 gig and under load it comes up to its spec 3.3 gig So for any serious benchmark I run a leading loop to get it up to speed.

zedd151 · July 13, 2018, 05:54:57 PM

With the Download link, I get a blank white page Windows 7 64 bit, IE 8. It says "Done" at the bottom, no error messages. :(

edit to add:

In the url it ends with "dl=0" I manually changed it to "dl=1" and it works. ;)

aw27 · July 13, 2018, 06:01:40 PM

@Hutch,
Ah, it makes sense, probably this is the real reason.

@Zed
Damn Dropbox, I had no experience with it.

zedd151 · July 13, 2018, 06:11:25 PM

Quote from: AW on July 13, 2018, 06:01:40 PM@Zed
Damn Dropbox, I had no experience with it.

I thought it was I that was in some kind of error, some setting that was incorrect. Also using a 'dated' browser may be the issue. Let's see what other members have to say. Maybe in Windows 10 and a more recent browser solves the issue.

I guess this is me:

Code Select


 Unknown CPU 800x800 Matrix Date/Time: 2018-07-13 03:19:22AVX(ms)=1.32 SSE(ms)=1.46 
about a minute ago

AMD A6-9220e 1.60 Ghz, btw.

The two I think 'graph' windows were blank for me. IE 8 issue again?

aw27 · July 13, 2018, 06:19:52 PM

I changed to dl=1 and it works everywhere and provides a direct download, which is what I want. When dl=0 it shows (not for everybody) a page with the ZIP file and its contents and you need to click download to get it.

zedd151 · July 13, 2018, 06:25:45 PM

Quote from: AW on July 13, 2018, 06:19:52 PM
I changed to dl=1 and it works everywhere and provides a direct download, which is what I want.

:t

As for the blank page I got the first time, gotta be an IE 8 issue. Need a newer browser to work with that site. See my modifed (addition) post above.

Siekmanski · July 13, 2018, 08:22:37 PM

Some high-end Intel processors are able to turn off the upper 128 bits of the 256 bit execution units in order to save power when they are not used.
It takes approximately 14 µs to power up this upper half after an idle period.
The throughput of 256-bit vector instructions is much lower during this warm-up period because the processor uses the lower 128-bit units twice to execute a 256-bit operation.
It is possible to make the 256-bit units warm up in advance by executing a dummy 256-bit instruction at a suitable time before the 256-bit unit is needed.
The upper half of the 256-bit units will be turned off again after approximately 675 µs of no 256-bit instructions.
This phenomenon is described in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".

Before timing my routines I "Wake Up" the cores calling this routine to get it out of the energy saving "Idle Mode" and force it to go full throttle again.

WakeUpCores proc
mov eax,100000000
splp:
xchg ecx,ecx
dec eax
jnz splp
ret
WakeUpCores endp

Siekmanski · July 13, 2018, 09:15:37 PM

800x800 Matrix -> AVX twice as fast as SSE on my machine.

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz

80x80 Matrix Date/Time: 2018-07-13 13:06:06
AVX(microsec)=2,86 SSE(microsec)=3,00

800x800 Matrix Date/Time: 2018-07-13 13:01:16
AVX(ms)=0,43 SSE(ms)=0,85

8000x8000 Matrix Date/Time: 2018-07-13 13:07:25
AVX(ms)=126,87 SSE(ms)=208,25

aw27 · July 13, 2018, 09:20:28 PM

Quote
AMD A6-9220e 1.60 Ghz, btw.

I thought the CPU Brand String ID code could be the same for Intel and AMD. I will check that out.

Quote
The two I think 'graph' windows were blank for me. IE 8 issue again?

Probably

, let me look for a computer with IE8 in my virtual machines farm.

@Siekmanski,

Quote
WakeUpCores proc
mov eax,100000000
splp:
xchg ecx,ecx
dec eax
jnz splp
ret
WakeUpCores endp

I tested it is not enough to warm up in this case. :(

Siekmanski · July 13, 2018, 09:23:38 PM

Try to insert this in the wakeup routine.

vmovaps ymm0,ymm0 ; dummy 256-bit instruction to start 256 bit execution unit

aw27 · July 13, 2018, 11:12:07 PM

@Siekmanski,

Code Select


WakeUpCores proc 
    mov     eax,100000000
splp:
    xchg    ecx,ecx
    dec     eax

    vmovaps   ymm0,ymm0 ; <---

    jnz     splp
    ret
WakeUpCores endp

Does not help in this case.
Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz 800x800 Matrix Date/Time: 2018-07-13 13:58:39 AVX(ms)=0.94 SSE(ms)=0.40
less than a minute ago

My alternative produced:
Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz 800x800 Matrix Date/Time: 2018-07-13 13:54:42 AVX(ms)=0.21 SSE(ms)=0.29
4 minutes ago

@Zed
I could only find IE 8 in a OS able to launch this 64-bit program in Windows XP 64-bit. However XP (and Vista) do not support AVX.
I modified the program to launch only SpeakEasy.com pages but it does not.
However Chrome browser works fine if set as default.

I modified the code (please redownload), now it is expected to identify AMD CPU Brand (fingers crossed

).

The MASM Forum

News:

Matrix Transposing AVX versus SSE

aw27

hutch--

aw27

aw27

hutch--

zedd151

aw27

zedd151

aw27

zedd151

Siekmanski

Siekmanski

aw27

Siekmanski

aw27