News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Structure to Structure Copy

Started by Biterider, January 19, 2020, 03:29:03 AM

Previous topic - Next topic

nidud

#15
deleted

Biterider

#16
Hi all
You will get more comparable results if you put the tested code in the same type of loop. In this case, s2s is charged with 2 additional instructions, that increases the timing count.
Changing this, I get the same timings when I compare s2s with "asmc mov".

Looking into what asmc did, I found that it translated the mov pan2,pan1 to

    mov eax, [pan1]
    mov [pan2], eax
    mov eax, [pan1 + 4]
    mov [pan2 + 4], eax
    mov al, [pan1 + 8]
    mov [pan2 + 8], al
    mov al, [pan1 + 9]
    mov [pan2 + 9], al


Note: as usual when we compare different algos, we have to take care of the warmup before the tests begin.

Biterider

Biterider

Hi nidud
You posted one minute before me!  :biggrin:

Regards, Biterider

HSE

Quote from: Biterider on January 29, 2020, 03:56:37 AM
Note: as usual when we compare different algos, we have to take care of the warmup before the tests begin.

No problem. First was JJ's algorithm. :biggrin:

rc1 size = 16

1092 ticks for CargaStruct
297 ticks for StructCopy
312 ticks for s2s


Look like FPU can't improve mov, at least f you have registers available.
But MOVUPS perhaps is an alternative.
Equations in Assembly: SmplMath

daydreamer

but if you use fpu or SSE floats?as long as you just copy it doesnt matter
one thing I want to try in GDI POINT and RECT structures are
fild rectsource.x
fmul xscale
fistp rectdest.x
fild rectsource.y
fmul yscale
fisp rectdest.y
useful if you have graphics defined in many + and - coordinates around center,and draw with GDI polys
movaps xmm0,XscaleYscale ;keep scaling xy,xy in xmm0 reg
mov ecx,lengthofpoints/2
lea esi,rectsource
lea edi,rectdest
@@L1:
movups xmm1,[esi]
mulps xmm1,xmm0
movups [edi],xmm1
add esi,sizeofpoint*2 ;2 POINT's at same time
add edi,sizeofpoint*2
sub ecx,1
jne @@L1

my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

Quote from: daydreamer on January 30, 2020, 02:55:46 AM
but if you use fpu or SSE floats?as long as you just copy it doesnt matter
one thing I want to try in GDI POINT and RECT structures are
fild rectsource.x
fmul xscale
fistp rectdest.x
fild rectsource.y
fmul yscale
fisp rectdest.y

There is a reason why I used a PANOSE structure in my earlier example: It has REAL10 bytes. But fld/fstp are not very fast.

daydreamer

Quote from: jj2007 on January 30, 2020, 09:01:56 AM
Quote from: daydreamer on January 30, 2020, 02:55:46 AM
but if you use fpu or SSE floats?as long as you just copy it doesnt matter
one thing I want to try in GDI POINT and RECT structures are
fild rectsource.x
fmul xscale
fistp rectdest.x
fild rectsource.y
fmul yscale
fisp rectdest.y

There is a reason why I used a PANOSE structure in my earlier example: It has REAL10 bytes. But fld/fstp are not very fast.
Depends on cpu,old amds faster than Intel before with fpu
Its for example scale Italy,in your Europe map,to get close-up, where gdi polygon call is milliseconds, don't know how much cycles/point
Also have in mind, polys for gdi-> d3d customvertex, which uses floats, instead of integers, so you could get hardware acceleration if too many polys for gdi to handle
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

Biterider

#22
Hi there
I took the idea suggested in the above posts and extended the s2s macro to support xmm and ymm registers. Unfortunately I don't have a CPU that supports AVX512, so I haven't added any code for zmm registers. However, it is very easy to do this by adding a single line of code.

An example of the current implementation:
local R1:RECT, R2:RECT
s2s R1, R2, xmm0

This copies the entire structure in a single loop.

If the structure is larger, additional registers can be added to the argument list:
s2s Struc1, Struc2, xmm0, xmm1, xmm2, xmm3

Finally, if the structure size does not fit in a ymm / xmm register, general purpose registers should be used:
s2s Struc1, Struc2, xmm0, xmm1, xmm2, xmm3, rcx, rdx
All these registers listed in the arguments are divided so that all of their sub-registers can be used. This way, you don't have to worry about register sizes at all.

A last small implemented optimization is the register rotation. In some cases, this avoids time penalties.

I'm aware of the cache line problem, but this macro is intended for non-aligned structures, so I think it's not that important.  :rolleyes:

I add the s2s macro code and the supporting macros needed.
As you read, you'll notice that I used this pair of instructions vlddqu/vmovdqu. This is related to this topic: http://masm32.com/board/index.php?topic=8376.msg91681#msg91681 and may change in the future.

Biterider

PS: to avoid AVX/SEE transition penalties, I updated the instructions to AVX only (new upload).
See https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties method 4.
On my system I get a 6-15x speedup depending on the structure size compared to the best rep movs algorithm.

hutch--

> On my system (i7-4770K), the results are:

Interesting, I have recently built myself a box using the identical processor and I am very pleased with the results. A gigabyte board, 32 gig of DDR3 run in dual channel and a stack of 4tb drives has made it a very useful machine. On single thread apps its a tad faster than my workhorse Haswell E/EP but the 6 core Haswell with quad channel DDR4 memory is faster when doing large multi-threading apps.

Good to see you are a man of excellent taste when it comes to hardware.  :thumbsup:

Biterider

Hi
Quote from: hutch-- on March 15, 2020, 12:13:47 PM
Good to see you are a man of excellent taste when it comes to hardware.  :thumbsup:
:biggrin: At the time I bought the system, it was a high-end one. Over time, new developments like AVX512 have come on the market that I cannot test with it. But all in all, it's a very good machine.

Biterider