Structure to Structure Copy

nidud · January 29, 2020, 03:48:33 AM

deleted

Biterider · January 29, 2020, 03:56:37 AM

Hi all
You will get more comparable results if you put the tested code in the same type of loop. In this case, s2s is charged with 2 additional instructions, that increases the timing count.
Changing this, I get the same timings when I compare s2s with "asmc mov".

Looking into what asmc did, I found that it translated the mov pan2,pan1 to

Code Select

    mov eax, [pan1]
    mov [pan2], eax
    mov eax, [pan1 + 4]
    mov [pan2 + 4], eax
    mov al, [pan1 + 8]
    mov [pan2 + 8], al
    mov al, [pan1 + 9]
    mov [pan2 + 9], al

Note: as usual when we compare different algos, we have to take care of the warmup before the tests begin.

Biterider

Biterider · January 29, 2020, 04:00:34 AM

Hi nidud
You posted one minute before me!

Regards, Biterider

HSE · January 29, 2020, 07:37:18 AM

Quote from: Biterider on January 29, 2020, 03:56:37 AM
Note: as usual when we compare different algos, we have to take care of the warmup before the tests begin.

No problem. First was JJ's algorithm.

Code Select

 rc1 size = 16

1092 ticks for CargaStruct
297 ticks for StructCopy
312 ticks for s2s

Look like FPU can't improve mov, at least f you have registers available.
But MOVUPS perhaps is an alternative.

daydreamer · January 30, 2020, 02:55:46 AM

but if you use fpu or SSE floats?as long as you just copy it doesnt matter
one thing I want to try in GDI POINT and RECT structures are
fild rectsource.x
fmul xscale
fistp rectdest.x
fild rectsource.y
fmul yscale
fisp rectdest.y
useful if you have graphics defined in many + and - coordinates around center,and draw with GDI polys

Code Select

movaps xmm0,XscaleYscale ;keep scaling xy,xy in xmm0 reg
mov ecx,lengthofpoints/2
lea esi,rectsource
lea edi,rectdest
@@L1:
movups xmm1,[esi]
mulps xmm1,xmm0
movups [edi],xmm1
add esi,sizeofpoint*2 ;2 POINT's at same time
add edi,sizeofpoint*2
sub ecx,1
jne @@L1

jj2007 · January 30, 2020, 09:01:56 AM

Quote from: daydreamer on January 30, 2020, 02:55:46 AM
but if you use fpu or SSE floats?as long as you just copy it doesnt matter
one thing I want to try in GDI POINT and RECT structures are
fild rectsource.x
fmul xscale
fistp rectdest.x
fild rectsource.y
fmul yscale
fisp rectdest.y

There is a reason why I used a PANOSE structure in my earlier example: It has REAL10 bytes. But fld/fstp are not very fast.

daydreamer · January 30, 2020, 06:03:04 PM

Quote from: jj2007 on January 30, 2020, 09:01:56 AM
Quote from: daydreamer on January 30, 2020, 02:55:46 AM
but if you use fpu or SSE floats?as long as you just copy it doesnt matter
one thing I want to try in GDI POINT and RECT structures are
fild rectsource.x
fmul xscale
fistp rectdest.x
fild rectsource.y
fmul yscale
fisp rectdest.y

There is a reason why I used a PANOSE structure in my earlier example: It has REAL10 bytes. But fld/fstp are not very fast.

Depends on cpu,old amds faster than Intel before with fpu
Its for example scale Italy,in your Europe map,to get close-up, where gdi polygon call is milliseconds, don't know how much cycles/point
Also have in mind, polys for gdi-> d3d customvertex, which uses floats, instead of integers, so you could get hardware acceleration if too many polys for gdi to handle

Biterider · February 27, 2020, 08:42:30 PM

Hi there
I took the idea suggested in the above posts and extended the s2s macro to support xmm and ymm registers. Unfortunately I don't have a CPU that supports AVX512, so I haven't added any code for zmm registers. However, it is very easy to do this by adding a single line of code.

An example of the current implementation:

Code Select

local R1:RECT, R2:RECT
s2s R1, R2, xmm0

This copies the entire structure in a single loop.

If the structure is larger, additional registers can be added to the argument list:

Code Select

s2s Struc1, Struc2, xmm0, xmm1, xmm2, xmm3

Finally, if the structure size does not fit in a ymm / xmm register, general purpose registers should be used:

Code Select

s2s Struc1, Struc2, xmm0, xmm1, xmm2, xmm3, rcx, rdx
All these registers listed in the arguments are divided so that all of their sub-registers can be used. This way, you don't have to worry about register sizes at all.

A last small implemented optimization is the register rotation. In some cases, this avoids time penalties.

I'm aware of the cache line problem, but this macro is intended for non-aligned structures, so I think it's not that important.

I add the s2s macro code and the supporting macros needed.
As you read, you'll notice that I used this pair of instructions vlddqu/vmovdqu. This is related to this topic: http://masm32.com/board/index.php?topic=8376.msg91681#msg91681 and may change in the future.

Biterider

PS: to avoid AVX/SEE transition penalties, I updated the instructions to AVX only (new upload).
See https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties method 4.
On my system I get a 6-15x speedup depending on the structure size compared to the best rep movs algorithm.

hutch-- · March 15, 2020, 12:13:47 PM

> On my system (i7-4770K), the results are:

Interesting, I have recently built myself a box using the identical processor and I am very pleased with the results. A gigabyte board, 32 gig of DDR3 run in dual channel and a stack of 4tb drives has made it a very useful machine. On single thread apps its a tad faster than my workhorse Haswell E/EP but the 6 core Haswell with quad channel DDR4 memory is faster when doing large multi-threading apps.

Good to see you are a man of excellent taste when it comes to hardware.

Biterider · March 15, 2020, 08:06:22 PM

Hi

Quote from: hutch-- on March 15, 2020, 12:13:47 PM
Good to see you are a man of excellent taste when it comes to hardware.

At the time I bought the system, it was a high-end one. Over time, new developments like AVX512 have come on the market that I cannot test with it. But all in all, it's a very good machine.

Biterider

The MASM Forum

News:

Structure to Structure Copy

nidud

Biterider

Biterider

HSE

daydreamer

jj2007

daydreamer

Biterider

hutch--

Biterider