News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Matrix Transposing AVX versus SSE

Started by aw27, July 13, 2018, 01:37:43 AM

Previous topic - Next topic

aw27

This is a comparative test for AVX versus SSE Matrix Transposing performance on 3 different sizes of square matrices with rows/columns being multiples of 8:
80x80
800x800
8000x8000

The ASM Transposing source code is contained in the file transpmats.asm
The correspondent object code is linked to a Free Pascal project (Free Pascal is similar to Delphi but is Free).

The real innovation of this test is that the results are posted to the ThingSpeak.com website and immediately visible together with all past results. This is really funny - this site has been thought for IOT stuff  :biggrin:

An interesting point is that SSE performs better than AVX for the smaller matrices. For the 8000x8000 matrix, AVX is almost on a pair with SSE. Of course, I am not sure whether the AVX transposing ASM code can be significantly improved or not.

I include all the source code and the .EXE ready for use.

Download

hutch--

Quote
An interesting point is that SSE performs better than AVX for the smaller matrices. For the 8000x8000 matrix, AVX is almost on a pair with SSE. Of course, I am not sure whether the AVX transposing ASM code can be significantly improved or not.
It probably has some to do with the data size that the processor munches internally, I have seen this effect going back to the PIV hardware where a large data size performed no better than a smaller one and it appears to be the base data size internally. If a processor internally works in 64 bit but you use a 256 bit data size, the processor internally does 4 x 64 bit operations and while it may have slightly lower internal overhead, it may not be much faster than 4 64 bit operations done externally.


aw27

I will keep investigating this matter because it was expected AVX to be clearly faster, the routines are similar. Although the AVX code is larger it processes 8x8 blocks at a time versus 4x4 blocks for SSE, 4 times more data.

aw27

I solved the mystery - AVX is indeed much faster than SSE on the 800x800 and 8000x8000 matrices.
We need to warm up the memory or whatever. I thought this was not necessary because each matrix is filled with its own random data, so caching should not be an issue.
Now the test performs 2 transpositions on matrices filled with random data just to warm up. Only after that it performs the real transpositions tests with timing control.

I have updated the download link.

hutch--

 :biggrin:

I get to see this problem with the HASWELL I am using, it idles in noddy land at about 1.2 gig and under load it comes up to its spec 3.3 gig So for any serious benchmark I run a leading loop to get it up to speed.

zedd151

With the Download link, I get a blank white page Windows 7 64 bit, IE 8. It says "Done" at the bottom, no error messages.  :(

edit to add:

In the url it ends with "dl=0" I manually changed it to "dl=1" and it works.   ;)

aw27

@Hutch,
Ah, it makes sense, probably this is the real reason.

@Zed
Damn Dropbox, I had no experience with it.

zedd151

Quote from: AW on July 13, 2018, 06:01:40 PM@Zed
Damn Dropbox, I had no experience with it.

I thought it was I that was in some kind of error, some setting that was incorrect. Also using a 'dated' browser may be the issue. Let's see what other members have to say. Maybe in Windows 10 and a more recent browser solves the issue.



I guess this is me:


Unknown CPU 800x800 Matrix Date/Time: 2018-07-13 03:19:22AVX(ms)=1.32 SSE(ms)=1.46
about a minute ago


AMD A6-9220e  1.60 Ghz, btw.

The two I think 'graph' windows were blank for me. IE 8 issue again?

aw27

I changed to dl=1 and it works everywhere and provides a direct download, which is what I want. When dl=0 it shows (not for everybody) a page with the ZIP file and its contents and you need to click download to get it.

zedd151

Quote from: AW on July 13, 2018, 06:19:52 PM
I changed to dl=1 and it works everywhere and provides a direct download, which is what I want.

:t

As for the blank page I got the first time, gotta be an IE 8 issue. Need a newer browser to work with that site. See my modifed (addition) post above.

Siekmanski

Some high-end Intel processors are able to turn off the upper 128 bits of the 256 bit execution units in order to save power when they are not used.
It takes approximately 14 µs to power up this upper half after an idle period.
The throughput of 256-bit vector instructions is much lower during this warm-up period because the processor uses the lower 128-bit units twice to execute a 256-bit operation.
It is possible to make the 256-bit units warm up in advance by executing a dummy 256-bit instruction at a suitable time before the 256-bit unit is needed.
The upper half of the 256-bit units will be turned off again after approximately 675 µs of no 256-bit instructions.
This phenomenon is described in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".

Before timing my routines I "Wake Up" the cores calling this routine to get it out of the energy saving "Idle Mode" and force it to go full throttle again.  :biggrin:

WakeUpCores proc
    mov     eax,100000000
splp:
    xchg    ecx,ecx
    dec     eax
    jnz     splp
    ret
WakeUpCores endp
Creative coders use backward thinking techniques as a strategy.

Siekmanski

800x800 Matrix -> AVX twice as fast as SSE on my machine.

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz

80x80 Matrix Date/Time: 2018-07-13 13:06:06
AVX(microsec)=2,86 SSE(microsec)=3,00

800x800 Matrix Date/Time: 2018-07-13 13:01:16
AVX(ms)=0,43 SSE(ms)=0,85

8000x8000 Matrix Date/Time: 2018-07-13 13:07:25
AVX(ms)=126,87 SSE(ms)=208,25
Creative coders use backward thinking techniques as a strategy.

aw27

Quote
AMD A6-9220e  1.60 Ghz, btw.
I thought the CPU Brand String ID code could be the same for Intel and AMD. I will check that out.

Quote
The two I think 'graph' windows were blank for me. IE 8 issue again?
Probably  :biggrin:, let me look for a computer with IE8 in my virtual machines farm.

@Siekmanski,

Quote
WakeUpCores proc
    mov     eax,100000000
splp:
    xchg    ecx,ecx
    dec     eax
    jnz     splp
    ret
WakeUpCores endp

I tested it is not enough to warm up in this case. :(

Siekmanski

Try to insert this in the wakeup routine.

   vmovaps   ymm0,ymm0   ; dummy 256-bit instruction to start 256 bit execution unit
Creative coders use backward thinking techniques as a strategy.

aw27

@Siekmanski,


WakeUpCores proc
    mov     eax,100000000
splp:
    xchg    ecx,ecx
    dec     eax

    vmovaps   ymm0,ymm0 ; <---

    jnz     splp
    ret
WakeUpCores endp

Does not help in this case.
Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz 800x800 Matrix Date/Time: 2018-07-13 13:58:39 AVX(ms)=0.94 SSE(ms)=0.40
less than a minute ago

My alternative produced:
Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz 800x800 Matrix Date/Time: 2018-07-13 13:54:42 AVX(ms)=0.21 SSE(ms)=0.29
4 minutes ago

@Zed
I could only find IE 8 in a OS able to launch this 64-bit program in Windows XP 64-bit. However XP (and Vista) do not support AVX.
I modified the program to launch only SpeakEasy.com pages but it does not.
However Chrome browser works fine if set as default.

I modified the code (please redownload), now it is expected to identify AMD CPU Brand (fingers crossed  :biggrin:).