Author Topic: Structure to Structure Copy  (Read 1621 times)

nidud

  • Member
  • *****
  • Posts: 1978
    • https://github.com/nidud/asmc
Re: Structure to Structure Copy
« Reply #15 on: January 29, 2020, 03:48:33 AM »
later2: removed a punishment to s2s

They produce more or less the same code so the result should be similar but the test is not very stable.

172 ticks for s2s
156 ticks for mov

156 ticks for s2s
171 ticks for mov

172 ticks for s2s
171 ticks for mov

Biterider

  • Moderator
  • Member
  • *****
  • Posts: 533
  • ObjAsm Developer
    • ObjAsm
Re: Structure to Structure Copy
« Reply #16 on: January 29, 2020, 03:56:37 AM »
Hi all
You will get more comparable results if you put the tested code in the same type of loop. In this case, s2s is charged with 2 additional instructions, that increases the timing count.
Changing this, I get the same timings when I compare s2s with "asmc mov".

Looking into what asmc did, I found that it translated the mov pan2,pan1 to

Code: [Select]
    mov eax, [pan1]
    mov [pan2], eax
    mov eax, [pan1 + 4]
    mov [pan2 + 4], eax
    mov al, [pan1 + 8]
    mov [pan2 + 8], al
    mov al, [pan1 + 9]
    mov [pan2 + 9], al


Note: as usual when we compare different algos, we have to take care of the warmup before the tests begin.

Biterider
« Last Edit: January 29, 2020, 07:56:38 AM by Biterider »

Biterider

  • Moderator
  • Member
  • *****
  • Posts: 533
  • ObjAsm Developer
    • ObjAsm
Re: Structure to Structure Copy
« Reply #17 on: January 29, 2020, 04:00:34 AM »
Hi nidud
You posted one minute before me!  :biggrin:

Regards, Biterider

HSE

  • Member
  • *****
  • Posts: 1349
  • <AMD>< 7-32>
Re: Structure to Structure Copy
« Reply #18 on: January 29, 2020, 07:37:18 AM »
Note: as usual when we compare different algos, we have to take care of the warmup before the tests begin.

No problem. First was JJ's algorithm. :biggrin:

Code: [Select]
rc1 size = 16

1092 ticks for CargaStruct
297 ticks for StructCopy
312 ticks for s2s

Look like FPU can't improve mov, at least f you have registers available.
But MOVUPS perhaps is an alternative.

daydreamer

  • Member
  • *****
  • Posts: 1321
  • building nextdoor
Re: Structure to Structure Copy
« Reply #19 on: January 30, 2020, 02:55:46 AM »
but if you use fpu or SSE floats?as long as you just copy it doesnt matter
one thing I want to try in GDI POINT and RECT structures are
fild rectsource.x
fmul xscale
fistp rectdest.x
fild rectsource.y
fmul yscale
fisp rectdest.y
useful if you have graphics defined in many + and - coordinates around center,and draw with GDI polys
Code: [Select]
movaps xmm0,XscaleYscale ;keep scaling xy,xy in xmm0 reg
mov ecx,lengthofpoints/2
lea esi,rectsource
lea edi,rectdest
@@L1:
movups xmm1,[esi]
mulps xmm1,xmm0
movups [edi],xmm1
add esi,sizeofpoint*2 ;2 POINT's at same time
add edi,sizeofpoint*2
sub ecx,1
jne @@L1

Quote from Flashdance
Nick  :  When you give up your dream, you die
*wears a flameproof asbestos suit*
Gone serverside programming p:  :D
I love assembly,because its legal to write
princess:lea eax,luke
:)

jj2007

  • Member
  • *****
  • Posts: 10461
  • Assembler is fun ;-)
    • MasmBasic
Re: Structure to Structure Copy
« Reply #20 on: January 30, 2020, 09:01:56 AM »
but if you use fpu or SSE floats?as long as you just copy it doesnt matter
one thing I want to try in GDI POINT and RECT structures are
fild rectsource.x
fmul xscale
fistp rectdest.x
fild rectsource.y
fmul yscale
fisp rectdest.y

There is a reason why I used a PANOSE structure in my earlier example: It has REAL10 bytes. But fld/fstp are not very fast.

daydreamer

  • Member
  • *****
  • Posts: 1321
  • building nextdoor
Re: Structure to Structure Copy
« Reply #21 on: January 30, 2020, 06:03:04 PM »
but if you use fpu or SSE floats?as long as you just copy it doesnt matter
one thing I want to try in GDI POINT and RECT structures are
fild rectsource.x
fmul xscale
fistp rectdest.x
fild rectsource.y
fmul yscale
fisp rectdest.y

There is a reason why I used a PANOSE structure in my earlier example: It has REAL10 bytes. But fld/fstp are not very fast.
Depends on cpu,old amds faster than Intel before with fpu
Its for example scale Italy,in your Europe map,to get close-up, where gdi polygon call is milliseconds, don't know how much cycles/point
Also have in mind, polys for gdi-> d3d customvertex, which uses floats, instead of integers, so you could get hardware acceleration if too many polys for gdi to handle
Quote from Flashdance
Nick  :  When you give up your dream, you die
*wears a flameproof asbestos suit*
Gone serverside programming p:  :D
I love assembly,because its legal to write
princess:lea eax,luke
:)

Biterider

  • Moderator
  • Member
  • *****
  • Posts: 533
  • ObjAsm Developer
    • ObjAsm
Re: Structure to Structure Copy
« Reply #22 on: February 27, 2020, 08:42:30 PM »
Hi there
I took the idea suggested in the above posts and extended the s2s macro to support xmm and ymm registers. Unfortunately I don't have a CPU that supports AVX512, so I haven't added any code for zmm registers. However, it is very easy to do this by adding a single line of code.

An example of the current implementation:
Code: [Select]
local R1:RECT, R2:RECT
s2s R1, R2, xmm0
This copies the entire structure in a single loop.

If the structure is larger, additional registers can be added to the argument list:
Code: [Select]
s2s Struc1, Struc2, xmm0, xmm1, xmm2, xmm3
Finally, if the structure size does not fit in a ymm / xmm register, general purpose registers should be used:
Code: [Select]
s2s Struc1, Struc2, xmm0, xmm1, xmm2, xmm3, rcx, rdxAll these registers listed in the arguments are divided so that all of their sub-registers can be used. This way, you don't have to worry about register sizes at all.

A last small implemented optimization is the register rotation. In some cases, this avoids time penalties.

I'm aware of the cache line problem, but this macro is intended for non-aligned structures, so I think it's not that important.  :rolleyes:

I add the s2s macro code and the supporting macros needed.
As you read, you'll notice that I used this pair of instructions vlddqu/vmovdqu. This is related to this topic: http://masm32.com/board/index.php?topic=8376.msg91681#msg91681 and may change in the future.

Biterider

PS: to avoid AVX/SEE transition penalties, I updated the instructions to AVX only (new upload).
See https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties method 4.
On my system I get a 6-15x speedup depending on the structure size compared to the best rep movs algorithm.
« Last Edit: February 29, 2020, 07:42:51 AM by Biterider »

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 7458
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Structure to Structure Copy
« Reply #23 on: March 15, 2020, 12:13:47 PM »
> On my system (i7-4770K), the results are:

Interesting, I have recently built myself a box using the identical processor and I am very pleased with the results. A gigabyte board, 32 gig of DDR3 run in dual channel and a stack of 4tb drives has made it a very useful machine. On single thread apps its a tad faster than my workhorse Haswell E/EP but the 6 core Haswell with quad channel DDR4 memory is faster when doing large multi-threading apps.

Good to see you are a man of excellent taste when it comes to hardware.  :thumbsup:
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

Biterider

  • Moderator
  • Member
  • *****
  • Posts: 533
  • ObjAsm Developer
    • ObjAsm
Re: Structure to Structure Copy
« Reply #24 on: March 15, 2020, 08:06:22 PM »
Hi
Good to see you are a man of excellent taste when it comes to hardware.  :thumbsup:
:biggrin: At the time I bought the system, it was a high-end one. Over time, new developments like AVX512 have come on the market that I cannot test with it. But all in all, it's a very good machine.

Biterider