The MASM Forum

General => The Laboratory => Topic started by: aw27 on July 13, 2018, 01:37:43 AM

Title: Matrix Transposing AVX versus SSE
Post by: aw27 on July 13, 2018, 01:37:43 AM
This is a comparative test for AVX versus SSE Matrix Transposing performance on 3 different sizes of square matrices with rows/columns being multiples of 8:
80x80
800x800
8000x8000

The ASM Transposing source code is contained in the file transpmats.asm
The correspondent object code is linked to a Free Pascal project (Free Pascal is similar to Delphi but is Free).

The real innovation of this test is that the results are posted to the ThingSpeak.com website and immediately visible together with all past results. This is really funny - this site has been thought for IOT stuff  :biggrin:

An interesting point is that SSE performs better than AVX for the smaller matrices. For the 8000x8000 matrix, AVX is almost on a pair with SSE. Of course, I am not sure whether the AVX transposing ASM code can be significantly improved or not.

I include all the source code and the .EXE ready for use.

Download (https://www.dropbox.com/s/9mu4u9x724jc7hg/TransposeAVXvsSSE.zip?dl=1)
Title: Re: Matrix Transposing AVX versus SSE
Post by: hutch-- on July 13, 2018, 02:18:05 AM
Quote
An interesting point is that SSE performs better than AVX for the smaller matrices. For the 8000x8000 matrix, AVX is almost on a pair with SSE. Of course, I am not sure whether the AVX transposing ASM code can be significantly improved or not.
It probably has some to do with the data size that the processor munches internally, I have seen this effect going back to the PIV hardware where a large data size performed no better than a smaller one and it appears to be the base data size internally. If a processor internally works in 64 bit but you use a 256 bit data size, the processor internally does 4 x 64 bit operations and while it may have slightly lower internal overhead, it may not be much faster than 4 64 bit operations done externally.

Title: Re: Matrix Transposing AVX versus SSE
Post by: aw27 on July 13, 2018, 03:23:18 PM
I will keep investigating this matter because it was expected AVX to be clearly faster, the routines are similar. Although the AVX code is larger it processes 8x8 blocks at a time versus 4x4 blocks for SSE, 4 times more data.
Title: Re: Matrix Transposing AVX versus SSE
Post by: aw27 on July 13, 2018, 05:43:43 PM
I solved the mystery - AVX is indeed much faster than SSE on the 800x800 and 8000x8000 matrices.
We need to warm up the memory or whatever. I thought this was not necessary because each matrix is filled with its own random data, so caching should not be an issue.
Now the test performs 2 transpositions on matrices filled with random data just to warm up. Only after that it performs the real transpositions tests with timing control.

I have updated the download link.
Title: Re: Matrix Transposing AVX versus SSE
Post by: hutch-- on July 13, 2018, 05:51:27 PM
 :biggrin:

I get to see this problem with the HASWELL I am using, it idles in noddy land at about 1.2 gig and under load it comes up to its spec 3.3 gig So for any serious benchmark I run a leading loop to get it up to speed.
Title: Re: Matrix Transposing AVX versus SSE
Post by: zedd151 on July 13, 2018, 05:54:57 PM
With the Download link, I get a blank white page Windows 7 64 bit, IE 8. It says "Done" at the bottom, no error messages.  :(

edit to add:

In the url it ends with "dl=0" I manually changed it to "dl=1" and it works.   ;)
Title: Re: Matrix Transposing AVX versus SSE
Post by: aw27 on July 13, 2018, 06:01:40 PM
@Hutch,
Ah, it makes sense, probably this is the real reason.

@Zed
Damn Dropbox, I had no experience with it.
Title: Re: Matrix Transposing AVX versus SSE
Post by: zedd151 on July 13, 2018, 06:11:25 PM
Quote from: AW on July 13, 2018, 06:01:40 PM@Zed
Damn Dropbox, I had no experience with it.

I thought it was I that was in some kind of error, some setting that was incorrect. Also using a 'dated' browser may be the issue. Let's see what other members have to say. Maybe in Windows 10 and a more recent browser solves the issue.



I guess this is me:


Unknown CPU 800x800 Matrix Date/Time: 2018-07-13 03:19:22AVX(ms)=1.32 SSE(ms)=1.46
about a minute ago


AMD A6-9220e  1.60 Ghz, btw.

The two I think 'graph' windows were blank for me. IE 8 issue again?
Title: Re: Matrix Transposing AVX versus SSE
Post by: aw27 on July 13, 2018, 06:19:52 PM
I changed to dl=1 and it works everywhere and provides a direct download, which is what I want. When dl=0 it shows (not for everybody) a page with the ZIP file and its contents and you need to click download to get it.
Title: Re: Matrix Transposing AVX versus SSE
Post by: zedd151 on July 13, 2018, 06:25:45 PM
Quote from: AW on July 13, 2018, 06:19:52 PM
I changed to dl=1 and it works everywhere and provides a direct download, which is what I want.

:t

As for the blank page I got the first time, gotta be an IE 8 issue. Need a newer browser to work with that site. See my modifed (addition) post above.
Title: Re: Matrix Transposing AVX versus SSE
Post by: Siekmanski on July 13, 2018, 08:22:37 PM
Some high-end Intel processors are able to turn off the upper 128 bits of the 256 bit execution units in order to save power when they are not used.
It takes approximately 14 µs to power up this upper half after an idle period.
The throughput of 256-bit vector instructions is much lower during this warm-up period because the processor uses the lower 128-bit units twice to execute a 256-bit operation.
It is possible to make the 256-bit units warm up in advance by executing a dummy 256-bit instruction at a suitable time before the 256-bit unit is needed.
The upper half of the 256-bit units will be turned off again after approximately 675 µs of no 256-bit instructions.
This phenomenon is described in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".

Before timing my routines I "Wake Up" the cores calling this routine to get it out of the energy saving "Idle Mode" and force it to go full throttle again.  :biggrin:

WakeUpCores proc
    mov     eax,100000000
splp:
    xchg    ecx,ecx
    dec     eax
    jnz     splp
    ret
WakeUpCores endp
Title: Re: Matrix Transposing AVX versus SSE
Post by: Siekmanski on July 13, 2018, 09:15:37 PM
800x800 Matrix -> AVX twice as fast as SSE on my machine.

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz

80x80 Matrix Date/Time: 2018-07-13 13:06:06
AVX(microsec)=2,86 SSE(microsec)=3,00

800x800 Matrix Date/Time: 2018-07-13 13:01:16
AVX(ms)=0,43 SSE(ms)=0,85

8000x8000 Matrix Date/Time: 2018-07-13 13:07:25
AVX(ms)=126,87 SSE(ms)=208,25
Title: Re: Matrix Transposing AVX versus SSE
Post by: aw27 on July 13, 2018, 09:20:28 PM
Quote
AMD A6-9220e  1.60 Ghz, btw.
I thought the CPU Brand String ID code could be the same for Intel and AMD. I will check that out.

Quote
The two I think 'graph' windows were blank for me. IE 8 issue again?
Probably  :biggrin:, let me look for a computer with IE8 in my virtual machines farm.

@Siekmanski,

Quote
WakeUpCores proc
    mov     eax,100000000
splp:
    xchg    ecx,ecx
    dec     eax
    jnz     splp
    ret
WakeUpCores endp

I tested it is not enough to warm up in this case. :(
Title: Re: Matrix Transposing AVX versus SSE
Post by: Siekmanski on July 13, 2018, 09:23:38 PM
Try to insert this in the wakeup routine.

   vmovaps   ymm0,ymm0   ; dummy 256-bit instruction to start 256 bit execution unit
Title: Re: Matrix Transposing AVX versus SSE
Post by: aw27 on July 13, 2018, 11:12:07 PM
@Siekmanski,


WakeUpCores proc
    mov     eax,100000000
splp:
    xchg    ecx,ecx
    dec     eax

    vmovaps   ymm0,ymm0 ; <---

    jnz     splp
    ret
WakeUpCores endp

Does not help in this case.
Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz 800x800 Matrix Date/Time: 2018-07-13 13:58:39 AVX(ms)=0.94 SSE(ms)=0.40
less than a minute ago

My alternative produced:
Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz 800x800 Matrix Date/Time: 2018-07-13 13:54:42 AVX(ms)=0.21 SSE(ms)=0.29
4 minutes ago

@Zed
I could only find IE 8 in a OS able to launch this 64-bit program in Windows XP 64-bit. However XP (and Vista) do not support AVX.
I modified the program to launch only SpeakEasy.com pages but it does not.
However Chrome browser works fine if set as default.

I modified the code (please redownload), now it is expected to identify AMD CPU Brand (fingers crossed  :biggrin:).
Title: Re: Matrix Transposing AVX versus SSE
Post by: felipe on July 14, 2018, 02:58:57 AM
This is very interesting too (seems like a great day for the forum today)  :greenclp:. I don't have avx support in my machine, but in the bios configuration i can dissable stepping (or similar, don't remember now  :idea:) and a few others configurations (like the fan in full speed and other thing too) to make the hardware work to maximum all the time. I don't do it because it has not been necessary, so i prefer to do power saving. But from some time i have been wainting for a nice algo to test timmings and to compare my machine in "fullmode" and in power saving mode  :idea: .  (So jj i'm still waiting for your test piece  ;)  :idea:)

But warm the machine before with some special algos...amazing, reminds me old cars in winter.  :shock:
Title: Re: Matrix Transposing AVX versus SSE
Post by: zedd151 on July 14, 2018, 03:17:15 AM
Quote
I modified the code (please redownload), now it is expected to identify AMD CPU Brand (fingers crossed  :biggrin:).

I'll take a look when I get home from work...  I'll run it from both windows 7 and windows 10. Both 64 bit..
Title: Re: Matrix Transposing AVX versus SSE
Post by: Siekmanski on July 14, 2018, 05:40:53 AM
AVX is not faster than the previous one, SSE a little bit.

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
800x800 Matrix Date/Time: 2018-07-13 21:28:08
AVX(ms)=0,43 SSE(ms)=0,79


about 8 hours ago:
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
800x800 Matrix Date/Time: 2018-07-13 13:01:16
AVX(ms)=0,43 SSE(ms)=0,85
Title: Re: Matrix Transposing AVX versus SSE
Post by: zedd151 on July 14, 2018, 07:36:06 AM
Quote from: zedd151 on July 14, 2018, 03:17:15 AM
Quote
now it is expected to identify AMD CPU Brand (fingers crossed  :biggrin: ).

I'll run it from both windows 7 ..

Here is what I get from Windows 7 Professional 64 bit. Please note that I do not have the proper video drivers installed, just the stock video from M$. (No drivers available for Windows 7 yet, for my machine - if ever)  The font looks off, but it could be an issue once again with IE8,
also I believe the graphs (or charts) should not be blank.

Later I will throw on my backup of Windows 10 Home just for kicks, and to see what changes for me. I bet it will look a lot better.
Performance wise, I don't expect much from this 1.60 Ghz cpu.
Title: Re: Matrix Transposing AVX versus SSE
Post by: aw27 on July 14, 2018, 05:31:12 PM
@Filipe,
AVX is not being much used anyway. New technologies may take up to 10 years to become widespread.

@Siekmanski
I am making the tests in real life conditions, not increasing the priority of the core, so significant fluctuations are expected in a single run.
I will modify these tests to repeat a few times and take an average.

@Zed
Your AMD is a bit slow but has not a low specification - it even supports AVX2.
I am happy the CPU brand ID worked this time. You may see it different in another OS because "It is possible to change the CPUID of AMD processors by using the AMD virtualization instructions. I hope that somebody will volunteer to make a program for this purpose - Agner Fog".
I see that the charts don't plot - IE8 is too old indeed.

Title: Re: Matrix Transposing AVX versus SSE
Post by: zedd151 on July 14, 2018, 05:50:57 PM
Quote from: AW on July 14, 2018, 05:31:12 PM
I see that the charts don't plot - IE8 is too old indeed.

:biggrin:  Just like me.  :P

I'll get Windows 10 up and running here shortly. Had a little trouble with the backup of Windows 10 'To Go'....   :icon_confused:
Title: Re: Matrix Transposing AVX versus SSE
Post by: zedd151 on July 14, 2018, 06:13:21 PM
Okay, here's from Windows 10 Ultimate 64 bit...

I noticed a significant error in the time displayed. I ran it twice to be sure, they were both taken moments ago - but display as 2hours ago.

Also, is that what the font is supposed to look like? Also, hard to be sure on the graph which one is mine. I assume the very last entry.
Title: Re: Matrix Transposing AVX versus SSE
Post by: aw27 on July 14, 2018, 06:24:54 PM
@Zed
What I have learnt from the site is that the chart show the browser time, .i.e your local time. I also calculate the local time before sending the data which is displayed correctly. The "2 hours ago" does not make sense then and it was not happening when you were using Windows 7.  :icon_eek:
PS: You may need to refresh the page  :idea:

Quote
Also, is that what the font is supposed to look like?
It is possible to customize the charts, but this is an advanced topic and I don't understand it quite well.
Title: Re: Matrix Transposing AVX versus SSE
Post by: zedd151 on July 14, 2018, 06:34:57 PM
Quote from: AW on July 14, 2018, 06:24:54 PM
PS: You may need to refresh the page  :idea:

I came back in a hurry to Windows 7 because M$ was threatening an update on Windows 10. Lemme go back and try again, I'll inform of the results - I'll clear the cache before running...
Title: Re: Matrix Transposing AVX versus SSE
Post by: zedd151 on July 14, 2018, 06:43:13 PM
Okay back in Windows 10:


AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C 3G 8000x8000 Matrix Date/Time: [b]2018-07-14 03:39:09[/b] AVX(ms)=490.97 SSE(ms)=547.48
about 2 hours ago


The date and time displayed are correct for my location Central Standard Time. But still get the 'about 2 hours ago' ....
Now I'm going to run another from Windows 7....

Bold doesn't work within the 'code' box?   :greensml:  just found that out
Title: Re: Matrix Transposing AVX versus SSE
Post by: zedd151 on July 14, 2018, 06:48:47 PM
Ok, now from Windows 7:


AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C 3G 8000x8000 Matrix Date/Time: 2018-07-14 03:46:23 AVX(ms)=447.92 SSE(ms)=738.40
less than a minute ago

AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C 3G 8000x8000 Matrix Date/Time: 2018-07-14 03:39:09 AVX(ms)=490.97 SSE(ms)=547.48
7 minutes ago

AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C 3G 8000x8000 Matrix Date/Time: 2018-07-14 03:06:33 AVX(ms)=337.76 SSE(ms)=547.36
40 minutes ago


???  :icon_confused:  Doesn't really matter though - at least not for me.

I expect some deviation from normal with Windows 7 because of a lack of drivers for this shiny new computer, but Windows 10 has all
of the latest drivers installed, plus Windows Edge browser. So I am at a loss for any explanation for the time discrepancy.
Title: Re: Matrix Transposing AVX versus SSE
Post by: aw27 on July 14, 2018, 07:00:46 PM
Quote from: zedd151 on July 14, 2018, 06:48:47 PM
I expect some deviation from normal with Windows 7 because of a lack of drivers for this shiny new computer, but Windows 10 has all
of the latest drivers installed, plus Windows Edge browser. So I am at a loss for any explanation for the time discrepancy.
I tested in Edge and IE in Windows 10 and it reports correctly the time ago. I have no explanation for the "time ago" discrepancy in your Windows 10.
Title: Re: Matrix Transposing AVX versus SSE
Post by: zedd151 on July 14, 2018, 07:06:17 PM
Quote from: AW on July 14, 2018, 07:00:46 PMI have no explanation for the "time ago" discrepancy in your Windows 10.
`

gremlins