Author Topic: Structure to Structure Copy  (Read 6839 times)

nidud

  • Member
  • *****
  • Posts: 2390
    • https://github.com/nidud/asmc
Re: Structure to Structure Copy
« Reply #15 on: January 29, 2020, 03:48:33 AM »
deleted
« Last Edit: February 26, 2022, 03:38:12 AM by nidud »

Biterider

  • Moderator
  • Member
  • *****
  • Posts: 893
  • ObjAsm Developer
    • ObjAsm
Re: Structure to Structure Copy
« Reply #16 on: January 29, 2020, 03:56:37 AM »
Hi all
You will get more comparable results if you put the tested code in the same type of loop. In this case, s2s is charged with 2 additional instructions, that increases the timing count.
Changing this, I get the same timings when I compare s2s with "asmc mov".

Looking into what asmc did, I found that it translated the mov pan2,pan1 to

Code: [Select]
    mov eax, [pan1]
    mov [pan2], eax
    mov eax, [pan1 + 4]
    mov [pan2 + 4], eax
    mov al, [pan1 + 8]
    mov [pan2 + 8], al
    mov al, [pan1 + 9]
    mov [pan2 + 9], al


Note: as usual when we compare different algos, we have to take care of the warmup before the tests begin.

Biterider
« Last Edit: January 29, 2020, 07:56:38 AM by Biterider »

Biterider

  • Moderator
  • Member
  • *****
  • Posts: 893
  • ObjAsm Developer
    • ObjAsm
Re: Structure to Structure Copy
« Reply #17 on: January 29, 2020, 04:00:34 AM »
Hi nidud
You posted one minute before me!  :biggrin:

Regards, Biterider

HSE

  • Member
  • *****
  • Posts: 2070
  • AMD 7-32 / i3 10-64
Re: Structure to Structure Copy
« Reply #18 on: January 29, 2020, 07:37:18 AM »
Note: as usual when we compare different algos, we have to take care of the warmup before the tests begin.

No problem. First was JJ's algorithm. :biggrin:

Code: [Select]
rc1 size = 16

1092 ticks for CargaStruct
297 ticks for StructCopy
312 ticks for s2s

Look like FPU can't improve mov, at least f you have registers available.
But MOVUPS perhaps is an alternative.
Equations in Assembly: SmplMath

daydreamer

  • Member
  • *****
  • Posts: 1985
  • "follow the blue star!!!"
Re: Structure to Structure Copy
« Reply #19 on: January 30, 2020, 02:55:46 AM »
but if you use fpu or SSE floats?as long as you just copy it doesnt matter
one thing I want to try in GDI POINT and RECT structures are
fild rectsource.x
fmul xscale
fistp rectdest.x
fild rectsource.y
fmul yscale
fisp rectdest.y
useful if you have graphics defined in many + and - coordinates around center,and draw with GDI polys
Code: [Select]
movaps xmm0,XscaleYscale ;keep scaling xy,xy in xmm0 reg
mov ecx,lengthofpoints/2
lea esi,rectsource
lea edi,rectdest
@@L1:
movups xmm1,[esi]
mulps xmm1,xmm0
movups [edi],xmm1
add esi,sizeofpoint*2 ;2 POINT's at same time
add edi,sizeofpoint*2
sub ecx,1
jne @@L1

SIMD fan and macro fan
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Teacher "REAL8 + QWORD is like apples and oranges,you cant mix them"
Student "ofcourse you can,it becomes a fruit salad" :)

jj2007

  • Member
  • *****
  • Posts: 12453
  • Assembler is fun ;-)
    • MasmBasic
Re: Structure to Structure Copy
« Reply #20 on: January 30, 2020, 09:01:56 AM »
but if you use fpu or SSE floats?as long as you just copy it doesnt matter
one thing I want to try in GDI POINT and RECT structures are
fild rectsource.x
fmul xscale
fistp rectdest.x
fild rectsource.y
fmul yscale
fisp rectdest.y

There is a reason why I used a PANOSE structure in my earlier example: It has REAL10 bytes. But fld/fstp are not very fast.

daydreamer

  • Member
  • *****
  • Posts: 1985
  • "follow the blue star!!!"
Re: Structure to Structure Copy
« Reply #21 on: January 30, 2020, 06:03:04 PM »
but if you use fpu or SSE floats?as long as you just copy it doesnt matter
one thing I want to try in GDI POINT and RECT structures are
fild rectsource.x
fmul xscale
fistp rectdest.x
fild rectsource.y
fmul yscale
fisp rectdest.y

There is a reason why I used a PANOSE structure in my earlier example: It has REAL10 bytes. But fld/fstp are not very fast.
Depends on cpu,old amds faster than Intel before with fpu
Its for example scale Italy,in your Europe map,to get close-up, where gdi polygon call is milliseconds, don't know how much cycles/point
Also have in mind, polys for gdi-> d3d customvertex, which uses floats, instead of integers, so you could get hardware acceleration if too many polys for gdi to handle
SIMD fan and macro fan
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Teacher "REAL8 + QWORD is like apples and oranges,you cant mix them"
Student "ofcourse you can,it becomes a fruit salad" :)

Biterider

  • Moderator
  • Member
  • *****
  • Posts: 893
  • ObjAsm Developer
    • ObjAsm
Re: Structure to Structure Copy
« Reply #22 on: February 27, 2020, 08:42:30 PM »
Hi there
I took the idea suggested in the above posts and extended the s2s macro to support xmm and ymm registers. Unfortunately I don't have a CPU that supports AVX512, so I haven't added any code for zmm registers. However, it is very easy to do this by adding a single line of code.

An example of the current implementation:
Code: [Select]
local R1:RECT, R2:RECT
s2s R1, R2, xmm0
This copies the entire structure in a single loop.

If the structure is larger, additional registers can be added to the argument list:
Code: [Select]
s2s Struc1, Struc2, xmm0, xmm1, xmm2, xmm3
Finally, if the structure size does not fit in a ymm / xmm register, general purpose registers should be used:
Code: [Select]
s2s Struc1, Struc2, xmm0, xmm1, xmm2, xmm3, rcx, rdxAll these registers listed in the arguments are divided so that all of their sub-registers can be used. This way, you don't have to worry about register sizes at all.

A last small implemented optimization is the register rotation. In some cases, this avoids time penalties.

I'm aware of the cache line problem, but this macro is intended for non-aligned structures, so I think it's not that important.  :rolleyes:

I add the s2s macro code and the supporting macros needed.
As you read, you'll notice that I used this pair of instructions vlddqu/vmovdqu. This is related to this topic: http://masm32.com/board/index.php?topic=8376.msg91681#msg91681 and may change in the future.

Biterider

PS: to avoid AVX/SEE transition penalties, I updated the instructions to AVX only (new upload).
See https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties method 4.
On my system I get a 6-15x speedup depending on the structure size compared to the best rep movs algorithm.
« Last Edit: February 29, 2020, 07:42:51 AM by Biterider »

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 9325
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Structure to Structure Copy
« Reply #23 on: March 15, 2020, 12:13:47 PM »
> On my system (i7-4770K), the results are:

Interesting, I have recently built myself a box using the identical processor and I am very pleased with the results. A gigabyte board, 32 gig of DDR3 run in dual channel and a stack of 4tb drives has made it a very useful machine. On single thread apps its a tad faster than my workhorse Haswell E/EP but the 6 core Haswell with quad channel DDR4 memory is faster when doing large multi-threading apps.

Good to see you are a man of excellent taste when it comes to hardware.  :thumbsup:
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

Biterider

  • Moderator
  • Member
  • *****
  • Posts: 893
  • ObjAsm Developer
    • ObjAsm
Re: Structure to Structure Copy
« Reply #24 on: March 15, 2020, 08:06:22 PM »
Hi
Good to see you are a man of excellent taste when it comes to hardware.  :thumbsup:
:biggrin: At the time I bought the system, it was a high-end one. Over time, new developments like AVX512 have come on the market that I cannot test with it. But all in all, it's a very good machine.

Biterider