News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Unaligned memory copy test piece.

Started by hutch--, December 06, 2021, 08:34:55 PM

Previous topic - Next topic

hutch--

I have a task where the memory copy cannot be controlled to SSE alignment.

The example has two memory copy techniques, the old rep movsb method as reference and the following for unaligned SSE.

    movdqu xmm0, [rcx+r10]
    movntdq [rdx+r10], xmm0

I have stabilised the timings by running a dummy run before the timed run and on my old Haswell the unaligned SSE version runs in about 4.7 seconds for 50 gig copy. As reference the rep movsb version runs in about 6.7 seconds for the same 50 gig.

I have not run the two tests together so that one does not effect the other, if you have time, run the SSE version then change the commented out rep movsb version.

mineiro

This is the result in my machine:
I don't have all that include files and tools, if you can release only executable file of rep movsb I can run it here.

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
wine umc.exe
--------------------------------
50 gig copy in 3338 milliseconds
--------------------------------
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

HSE

i3-10100 not so fast :biggrin:

xmmcopyu:
--------------------------------
50 gig copy in 7531 milliseconds
--------------------------------

ByteCopy:
--------------------------------
50 gig copy in 10563 milliseconds
--------------------------------
Equations in Assembly: SmplMath

hutch--

Thanks guys, all of these results are very useful to me.

hutch--

I added the rep movsd version as a zip file.

--------------------------------
50 gig copy in 6625 milliseconds    rep movsb
--------------------------------
--------------------------------
50 gig copy in 4578 milliseconds    movdqu xmm0, [rcx+r10] : movntdq [rdx+r10], xmm0
--------------------------------

What I am chasing is the ratio difference as the SSE version will be used to copy memory that has originated from a MMF written to by a 32 bit app.

avcaballero

umcmovsb:
--------------------------------
50 gig copy in 6015 milliseconds
--------------------------------
Press any key to continue...


umc:
--------------------------------
50 gig copy in 4719 milliseconds
--------------------------------
Press any key to continue...

HSE

umcmovsb:
--------------------------------
50 gig copy in 9547 milliseconds
--------------------------------
Press any key to continue...
Equations in Assembly: SmplMath

Greenhorn

AMD Ryzen 3700X

umcmovsb:
--------------------------------
50 gig copy in 5522 milliseconds
--------------------------------

umc:
--------------------------------
50 gig copy in 2902 milliseconds
--------------------------------
Kole Feut un Nordenwind gift en krusen Büdel un en lütten Pint.

Siekmanski

AMD Ryzen 9 5950X 16-Core Processor

umc:
--------------------------------
50 gig copy in 2781 milliseconds
--------------------------------

umcmovsb:
--------------------------------
50 gig copy in 2563 milliseconds
--------------------------------
Creative coders use backward thinking techniques as a strategy.

Greenhorn

Quote from: Siekmanski on December 07, 2021, 03:32:51 AM
AMD Ryzen 9 5950X 16-Core Processor

umc:
--------------------------------
50 gig copy in 2781 milliseconds
--------------------------------

umcmovsb:
--------------------------------
50 gig copy in 2563 milliseconds
--------------------------------

Well, the result for movsb is surprising.  :thumbsup:
Kole Feut un Nordenwind gift en krusen Büdel un en lütten Pint.

mineiro

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
wine umc.exe
--------------------------------
50 gig copy in 3384 milliseconds
--------------------------------
wine umcmovsb.exe
--------------------------------
50 gig copy in 3450 milliseconds
--------------------------------
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

--------------------------------
50 gig copy in 9064 milliseconds
--------------------------------

--------------------------------
50 gig copy in 10343 milliseconds
--------------------------------

hutch--

Thanks all, it seems that in every case over a wide range of different hardware that the SSE2 version is faster in every instance and that is useful for the task I have in mind.  :biggrin:

jj2007

Are you sure it's unaligned? My debugger says halloc() delivers a 16-byte aligned buffer. I also wonder whether lodsd would be faster than lodsb

hutch--

What I have to do is load data from a 32 bit app via a memory mapped file into a 64 bit app which uses HeapAlloc() to store the data. The input source from 32 bit can be rough byte aligned string data or anything else that will fit into the memory mapped file size. If it was going to be fully controlled alignment at both ends, I would use the faster aligned SSE2 instructions.

"rep movsb" is usually faster than "rep movsd" which seems to be Intel special case circuitry and I have not seen examples of "rep lodsb" being faster so I have used "rep movsb" as a reference to compare the SSE2 version and across multiple CPUs that the folks here have tested on, the SSE2 version is always faster.

I already have prototypes of the task up and running using a 1gb memory mapped file as the data transfer window and the idea is to be able to work with a 32 bit app that can store multiple 1gb blocks in the 64 bit "container" and work on any of them 1 at a time. I had used "rep movsb" for the unaligned data transfer and it worked OK but as you start using larger blocks of memory, speed starts to matter.