Unaligned memory copy test piece.

hutch-- · December 06, 2021, 08:34:55 PM

I have a task where the memory copy cannot be controlled to SSE alignment.

The example has two memory copy techniques, the old rep movsb method as reference and the following for unaligned SSE.

movdqu xmm0, [rcx+r10]
movntdq [rdx+r10], xmm0

I have stabilised the timings by running a dummy run before the timed run and on my old Haswell the unaligned SSE version runs in about 4.7 seconds for 50 gig copy. As reference the rep movsb version runs in about 6.7 seconds for the same 50 gig.

I have not run the two tests together so that one does not effect the other, if you have time, run the SSE version then change the commented out rep movsb version.

mineiro · December 06, 2021, 10:43:44 PM

This is the result in my machine:
I don't have all that include files and tools, if you can release only executable file of rep movsb I can run it here.

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
wine umc.exe
--------------------------------
50 gig copy in 3338 milliseconds
--------------------------------

HSE · December 06, 2021, 11:02:03 PM

i3-10100 not so fast

xmmcopyu:
--------------------------------
50 gig copy in 7531 milliseconds
--------------------------------

ByteCopy:
--------------------------------
50 gig copy in 10563 milliseconds
--------------------------------

hutch-- · December 06, 2021, 11:51:12 PM

Thanks guys, all of these results are very useful to me.

hutch-- · December 07, 2021, 02:36:50 AM

I added the rep movsd version as a zip file.

--------------------------------
50 gig copy in 6625 milliseconds rep movsb
--------------------------------
--------------------------------
50 gig copy in 4578 milliseconds movdqu xmm0, [rcx+r10] : movntdq [rdx+r10], xmm0
--------------------------------

What I am chasing is the ratio difference as the SSE version will be used to copy memory that has originated from a MMF written to by a 32 bit app.

avcaballero · December 07, 2021, 03:11:45 AM

umcmovsb:
--------------------------------
50 gig copy in 6015 milliseconds
--------------------------------
Press any key to continue...

umc:
--------------------------------
50 gig copy in 4719 milliseconds
--------------------------------
Press any key to continue...

HSE · December 07, 2021, 03:16:47 AM

umcmovsb:
--------------------------------
50 gig copy in 9547 milliseconds
--------------------------------
Press any key to continue...

Greenhorn · December 07, 2021, 03:26:43 AM

AMD Ryzen 3700X

umcmovsb:
--------------------------------
50 gig copy in 5522 milliseconds
--------------------------------

umc:
--------------------------------
50 gig copy in 2902 milliseconds
--------------------------------

Siekmanski · December 07, 2021, 03:32:51 AM

AMD Ryzen 9 5950X 16-Core Processor

umc:
--------------------------------
50 gig copy in 2781 milliseconds
--------------------------------

umcmovsb:
--------------------------------
50 gig copy in 2563 milliseconds
--------------------------------

Greenhorn · December 07, 2021, 03:50:06 AM

Quote from: Siekmanski on December 07, 2021, 03:32:51 AM
AMD Ryzen 9 5950X 16-Core Processor

umc:
--------------------------------
50 gig copy in 2781 milliseconds
--------------------------------

umcmovsb:
--------------------------------
50 gig copy in 2563 milliseconds
--------------------------------

Well, the result for movsb is surprising.

mineiro · December 07, 2021, 03:53:27 AM

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
wine umc.exe
--------------------------------
50 gig copy in 3384 milliseconds
--------------------------------
wine umcmovsb.exe
--------------------------------
50 gig copy in 3450 milliseconds
--------------------------------

jj2007 · December 07, 2021, 03:58:20 AM

Code Select

 --------------------------------
 50 gig copy in 9064 milliseconds
 --------------------------------

Code Select

 --------------------------------
 50 gig copy in 10343 milliseconds
 --------------------------------

hutch-- · December 07, 2021, 09:41:29 AM

Thanks all, it seems that in every case over a wide range of different hardware that the SSE2 version is faster in every instance and that is useful for the task I have in mind.

jj2007 · December 07, 2021, 11:09:02 AM

Are you sure it's unaligned? My debugger says halloc() delivers a 16-byte aligned buffer. I also wonder whether lodsd would be faster than lodsb

hutch-- · December 07, 2021, 11:27:06 AM

What I have to do is load data from a 32 bit app via a memory mapped file into a 64 bit app which uses HeapAlloc() to store the data. The input source from 32 bit can be rough byte aligned string data or anything else that will fit into the memory mapped file size. If it was going to be fully controlled alignment at both ends, I would use the faster aligned SSE2 instructions.

"rep movsb" is usually faster than "rep movsd" which seems to be Intel special case circuitry and I have not seen examples of "rep lodsb" being faster so I have used "rep movsb" as a reference to compare the SSE2 version and across multiple CPUs that the folks here have tested on, the SSE2 version is always faster.

I already have prototypes of the task up and running using a 1gb memory mapped file as the data transfer window and the idea is to be able to work with a 32 bit app that can store multiple 1gb blocks in the 64 bit "container" and work on any of them 1 at a time. I had used "rep movsb" for the unaligned data transfer and it worked OK but as you start using larger blocks of memory, speed starts to matter.

The MASM Forum

News:

Unaligned memory copy test piece.

hutch--

mineiro

HSE

hutch--

hutch--

avcaballero

HSE

Greenhorn

Siekmanski

Greenhorn

mineiro

jj2007

hutch--

jj2007

hutch--