The MASM Forum

Projects => ObjAsm => Topic started by: Biterider on January 19, 2020, 03:29:03 AM

Title: Structure to Structure Copy
Post by: Biterider on January 19, 2020, 03:29:03 AM
Hi
When I work on larger projects, it often happens that I have to copy structures such as POINT or RECT or larger ones. If the structures are not very large, the most efficient way to accomplish this task is to use the available registers. For example, if you are in the 64-bit world and need to copy a POINT structure, using a 64-bit register, in which both SDWORDs are copied at once, is the smartest method. The same operation in 32-bit requires 2 registers to load the values from the source structure into them and then store the values in the target structure. If only 1 register is available, the process must be unrolled reusing this register.

The other aspect is that structures can take on different sizes, so a combination of register sizes must be used to match the correct length.
Taking these basic requirements into account, I designed for my convenience a strategy that I implemented into a macro called s2s (structure-to-structure copy). It can be viewed as a relative of m2m or mrm macros.

The syntax looks like this:
local Rect1:RECT, Rect2:RECT
...
s2s Rect1, Rect2, rax, rcx
...

In the background, the $SubReg macro is used to select the correct register sizes. In this way, s2s can be used without changes in 32- and 64-bit code.

A user-friendly syntax for bitness-independent code that also works could be:
s2s Rect1, Rect2, xax, xcx

These macros are part of the upcoming ObjAsm system.inc file. I have put them together for a preview in the attached file. If the code is useful to someone, feel free to use or improve it.  :biggrin:

Regards, Biterider
Title: Re: Structure to Structure Copy
Post by: jj2007 on January 19, 2020, 05:01:59 AM
Don't forget movlps, movups and fld :cool:

include \masm32\include\masm32rt.inc
include StructCopy.inc
.686p
.xmm

.data
pan1 PANOSE <10, 11, 12, 13, 14, 15, 16, 17, 18, 19>
pan2 PANOSE <?>
pt1 POINT <100, 200>
pt2 POINT <?>
rc1 RECT <10, 20, 30, 40>
rc2 RECT <?>

.code
start:
  StructCopy pan2, pan1
  movsx eax, pan2.bFamilyType
  print str$(eax), 9, " bFamilyType", 13, 10
  movsx eax, pan2.bXHeight
  print str$(eax), 9, " bXHeight", 13, 10

  StructCopy pt2, pt1
  mov eax, pt2.x
  print str$(eax), 9, " x", 13, 10
  mov eax, pt2.y
  print str$(eax), 9, " y", 13, 10

  StructCopy rc2, rc1
  print str$(rc2.left), 9, " left", 13, 10
  print str$(rc2.bottom), 9, " bottom", 13, 10
  exit

end start
Title: Re: Structure to Structure Copy
Post by: HSE on January 28, 2020, 02:06:25 AM
Hi Biterider!

Quote from: Biterider on January 19, 2020, 03:29:03 AM
.. a macro called s2s (structure-to-structure copy). It can be viewed as a relative of m2m or mrm macros.

Impressive!
A lot more complex than my 32bit macro  :biggrin: CargaStruct macro tipo_de_objeto, ObjectDest, ObjectOrig
push esi
cld
mov edi, &ObjectDest;[Dest]
        mov esi, &ObjectOrig
        mov ecx , sizeof &tipo_de_objeto;[ln]
          rep movsb
pop esi
endm


I think you can create a similar macro for o2o in same way i have modified the previous macro:CopiaObjeto macro tipo_de_objeto, ObjectDest, ObjectOrig
push esi
cld
mov edi, &ObjectDest;
add edi , 16
mov esi, &ObjectOrig
add esi, 16
mov ecx , sizeof &tipo_de_objeto;
    sub ecx , 16
    rep movsb
pop esi
endm


And i used, for example: lea edx, dietaJenkins    ; Independent object
lea eax, [esi].Feed    ; Embedded object
CopiaObjeto MollyDiet, <eax>, <edx>


Regards. HSE
Title: Re: Structure to Structure Copy
Post by: Biterider on January 28, 2020, 03:29:20 AM
Hi HSE
Good to hear from you again!

Since the memory structure of an object is much larger than e.g. a RECT, I'm inclined to think that your macro is more efficient in terms of speed.  :thumbsup:
In the Objects.inc is something similar in the New implementation.

Regards, Biterider
Title: Re: Structure to Structure Copy
Post by: HSE on January 28, 2020, 04:07:29 AM
Perfect  :thumbsup:

That can improve my macros.

Thanks.
Title: Re: Structure to Structure Copy
Post by: jj2007 on January 28, 2020, 07:08:41 AM
Hi Biterider,

I wanted to get some timings, but your macro throws errors... what's wrong?

include \masm32\include\masm32rt.inc ; plain Masm32
iterations=99999999  ; 100 Mio
include StructCopy.inc
include S2S.inc
.686p
.xmm

.data
pan1 PANOSE <10, 11, 12, 13, 14, 15, 16, 17, 18, 19>
pan2 PANOSE <?>
pt1 POINT <100, 200>
pt2 POINT <?>
rc1 RECT <10, 20, 30, 40>
rc2 RECT <?>

.code
start:
  xpan equ PANOSE ptr [ebx]

  push rv(GetTickCount)
  push iterations
  mov ebx, offset pan2
  .Repeat
; StructCopy xpan, pan1
StructCopy rc2, rc1
dec dword ptr [esp]
  .Until Sign?
  pop edx
  invoke GetTickCount
  pop ecx
  sub eax, ecx
  print str$(eax), " ticks for StructCopy", 13, 10

if 1
  push esi
  push edi
  push rv(GetTickCount)
  push iterations
  mov ebx, offset pan2
  .Repeat
TARGET_BITNESS=32
mov ecx, RECT/4
mov esi, offset rc1
mov edi, offset rc2
rep movsd
dec dword ptr [esp]
  .Until Sign?
  pop edx
  invoke GetTickCount
  pop ecx
  sub eax, ecx
  print str$(eax), " ticks for rep movsd", 13, 10
  pop edi
  pop esi
endif

if 0 ; chokes
  push rv(GetTickCount)
  push iterations
  mov ebx, offset pan2
  .Repeat
TARGET_BITNESS=32
s2s rc1, rc2, rax, rcx
dec dword ptr [esp]
  .Until Sign?
  pop edx
  invoke GetTickCount
  pop ecx
  sub eax, ecx
  print str$(eax), " ticks for s2s", 13, 10
endif

if 0
  StructCopy xpan, pan1
  movsx eax, pan2.bFamilyType
  print str$(eax), 9, " bFamilyType", 13, 10
  movsx eax, pan2.bXHeight
  print str$(eax), 9, " bXHeight", 13, 10

  StructCopy pt2, pt1
  mov eax, pt2.x
  print str$(eax), 9, " x", 13, 10
  mov eax, pt2.y
  print str$(eax), 9, " y", 13, 10

  StructCopy rc2, rc1
  print str$(rc2.left), 9, " left", 13, 10
  print str$(rc2.bottom), 9, " bottom", 13, 10
endif
  inkey "hit any key"
  exit

end start
Title: Re: Structure to Structure Copy
Post by: HSE on January 28, 2020, 08:31:42 AM
I also have a problem with s2s (in $SubReg i think)

But my test of a little improved macro:CargaStruct macro tipo_de_objeto, ObjectDest, ObjectOrig
    push esi
    cld
    mov edi, &ObjectDest;[Dest]
    mov esi, &ObjectOrig
    $$SizetoCopy = sizeof( &tipo_de_objeto)
    mov ecx , $$SizetoCopy/@WordSize;[ln]
    if @WordSize eq 8
        if $$SizetoCopy ge 8
     rep movsq
        endif                                       ;;Copy all possible QWORDs
if ($$SizetoCopy and 4) eq 4
     movsd
endif
    else
        if $$SizetoCopy ge 4
            rep movsd                                       ;;Copy all possible DWORDs
        endif    
    endif
    if ($$SizetoCopy and 2) eq 2
        movsw
    endif
    if ($$SizetoCopy and 1) eq 1
        movsb
    endif
    pop esi
endm

(file corrected in my following post)
Title: Re: Structure to Structure Copy
Post by: Biterider on January 28, 2020, 08:51:55 AM
Hi
There are 2 things. First, ml is not able to handle large lines, so please use uasm. Second, the $Upper was missing, my fault.
The new attachment contains all macros and should work.  :rolleyes:

Biterider
Title: Re: Structure to Structure Copy
Post by: Biterider on January 28, 2020, 08:54:37 AM
Hi HSE
:thumbsup:
Title: Re: Structure to Structure Copy
Post by: HSE on January 28, 2020, 09:11:35 AM
Everything Ok!
1762 ticks for StructCopy
858 ticks for CargaStruct
827 ticks for s2s
Title: Re: Structure to Structure Copy
Post by: jj2007 on January 28, 2020, 09:29:51 AM
Works fine now :thumbsup:

203 ticks for StructCopy
889 ticks for rep movsd
203 ticks for s2s
Title: Re: Structure to Structure Copy
Post by: Biterider on January 28, 2020, 05:37:43 PM
Hi
On my system (i7-4770K), the results are:

234 ticks for StructCopy
969 ticks for rep movsd
141 ticks for s2s


Biterider
Title: Re: Structure to Structure Copy
Post by: nidud on January 29, 2020, 01:13:01 AM
deleted
Title: Re: Structure to Structure Copy
Post by: HSE on January 29, 2020, 02:52:10 AM
Confirmed!!

Nidud's machine have a mental problem with 32 bits  :biggrin:
Perhaps some problem in 64bit machine because @WordSize :cool:

Btw, some modifications for a fair test ( also I corrected the wrong destinations, and the same type of loop):

483 ticks for StructCopy
796 ticks for CargaStruct
312 ticks for s2s
312 ticks for mov


later: a new version
later2: removed a punishment to s2s
Title: Re: Structure to Structure Copy
Post by: nidud on January 29, 2020, 03:37:02 AM
deleted
Title: Re: Structure to Structure Copy
Post by: nidud on January 29, 2020, 03:48:33 AM
deleted
Title: Re: Structure to Structure Copy
Post by: Biterider on January 29, 2020, 03:56:37 AM
Hi all
You will get more comparable results if you put the tested code in the same type of loop. In this case, s2s is charged with 2 additional instructions, that increases the timing count.
Changing this, I get the same timings when I compare s2s with "asmc mov".

Looking into what asmc did, I found that it translated the mov pan2,pan1 to

    mov eax, [pan1]
    mov [pan2], eax
    mov eax, [pan1 + 4]
    mov [pan2 + 4], eax
    mov al, [pan1 + 8]
    mov [pan2 + 8], al
    mov al, [pan1 + 9]
    mov [pan2 + 9], al


Note: as usual when we compare different algos, we have to take care of the warmup before the tests begin.

Biterider
Title: Re: Structure to Structure Copy
Post by: Biterider on January 29, 2020, 04:00:34 AM
Hi nidud
You posted one minute before me!  :biggrin:

Regards, Biterider
Title: Re: Structure to Structure Copy
Post by: HSE on January 29, 2020, 07:37:18 AM
Quote from: Biterider on January 29, 2020, 03:56:37 AM
Note: as usual when we compare different algos, we have to take care of the warmup before the tests begin.

No problem. First was JJ's algorithm. :biggrin:

rc1 size = 16

1092 ticks for CargaStruct
297 ticks for StructCopy
312 ticks for s2s


Look like FPU can't improve mov, at least f you have registers available.
But MOVUPS perhaps is an alternative.
Title: Re: Structure to Structure Copy
Post by: daydreamer on January 30, 2020, 02:55:46 AM
but if you use fpu or SSE floats?as long as you just copy it doesnt matter
one thing I want to try in GDI POINT and RECT structures are
fild rectsource.x
fmul xscale
fistp rectdest.x
fild rectsource.y
fmul yscale
fisp rectdest.y
useful if you have graphics defined in many + and - coordinates around center,and draw with GDI polys
movaps xmm0,XscaleYscale ;keep scaling xy,xy in xmm0 reg
mov ecx,lengthofpoints/2
lea esi,rectsource
lea edi,rectdest
@@L1:
movups xmm1,[esi]
mulps xmm1,xmm0
movups [edi],xmm1
add esi,sizeofpoint*2 ;2 POINT's at same time
add edi,sizeofpoint*2
sub ecx,1
jne @@L1

Title: Re: Structure to Structure Copy
Post by: jj2007 on January 30, 2020, 09:01:56 AM
Quote from: daydreamer on January 30, 2020, 02:55:46 AM
but if you use fpu or SSE floats?as long as you just copy it doesnt matter
one thing I want to try in GDI POINT and RECT structures are
fild rectsource.x
fmul xscale
fistp rectdest.x
fild rectsource.y
fmul yscale
fisp rectdest.y

There is a reason why I used a PANOSE structure in my earlier example: It has REAL10 bytes. But fld/fstp are not very fast.
Title: Re: Structure to Structure Copy
Post by: daydreamer on January 30, 2020, 06:03:04 PM
Quote from: jj2007 on January 30, 2020, 09:01:56 AM
Quote from: daydreamer on January 30, 2020, 02:55:46 AM
but if you use fpu or SSE floats?as long as you just copy it doesnt matter
one thing I want to try in GDI POINT and RECT structures are
fild rectsource.x
fmul xscale
fistp rectdest.x
fild rectsource.y
fmul yscale
fisp rectdest.y

There is a reason why I used a PANOSE structure in my earlier example: It has REAL10 bytes. But fld/fstp are not very fast.
Depends on cpu,old amds faster than Intel before with fpu
Its for example scale Italy,in your Europe map,to get close-up, where gdi polygon call is milliseconds, don't know how much cycles/point
Also have in mind, polys for gdi-> d3d customvertex, which uses floats, instead of integers, so you could get hardware acceleration if too many polys for gdi to handle
Title: Re: Structure to Structure Copy
Post by: Biterider on February 27, 2020, 08:42:30 PM
Hi there
I took the idea suggested in the above posts and extended the s2s macro to support xmm and ymm registers. Unfortunately I don't have a CPU that supports AVX512, so I haven't added any code for zmm registers. However, it is very easy to do this by adding a single line of code.

An example of the current implementation:
local R1:RECT, R2:RECT
s2s R1, R2, xmm0

This copies the entire structure in a single loop.

If the structure is larger, additional registers can be added to the argument list:
s2s Struc1, Struc2, xmm0, xmm1, xmm2, xmm3

Finally, if the structure size does not fit in a ymm / xmm register, general purpose registers should be used:
s2s Struc1, Struc2, xmm0, xmm1, xmm2, xmm3, rcx, rdx
All these registers listed in the arguments are divided so that all of their sub-registers can be used. This way, you don't have to worry about register sizes at all.

A last small implemented optimization is the register rotation. In some cases, this avoids time penalties.

I'm aware of the cache line problem, but this macro is intended for non-aligned structures, so I think it's not that important.  :rolleyes:

I add the s2s macro code and the supporting macros needed.
As you read, you'll notice that I used this pair of instructions vlddqu/vmovdqu. This is related to this topic: http://masm32.com/board/index.php?topic=8376.msg91681#msg91681 (http://masm32.com/board/index.php?topic=8376.msg91681#msg91681) and may change in the future.

Biterider

PS: to avoid AVX/SEE transition penalties, I updated the instructions to AVX only (new upload).
See https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties (https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties) method 4.
On my system I get a 6-15x speedup depending on the structure size compared to the best rep movs algorithm.
Title: Re: Structure to Structure Copy
Post by: hutch-- on March 15, 2020, 12:13:47 PM
> On my system (i7-4770K), the results are:

Interesting, I have recently built myself a box using the identical processor and I am very pleased with the results. A gigabyte board, 32 gig of DDR3 run in dual channel and a stack of 4tb drives has made it a very useful machine. On single thread apps its a tad faster than my workhorse Haswell E/EP but the 6 core Haswell with quad channel DDR4 memory is faster when doing large multi-threading apps.

Good to see you are a man of excellent taste when it comes to hardware.  :thumbsup:
Title: Re: Structure to Structure Copy
Post by: Biterider on March 15, 2020, 08:06:22 PM
Hi
Quote from: hutch-- on March 15, 2020, 12:13:47 PM
Good to see you are a man of excellent taste when it comes to hardware.  :thumbsup:
:biggrin: At the time I bought the system, it was a high-end one. Over time, new developments like AVX512 have come on the market that I cannot test with it. But all in all, it's a very good machine.

Biterider