News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Adding QWORDs with reg32, paddq and the FPU

Started by jj2007, November 09, 2020, 09:38:42 AM

Previous topic - Next topic

daydreamer

This program was assembled with ml64 in 64-bit format.
rax is GetTickCount difference
---
reg add rax     109
FPU add rax     266
xmm add rax     140
---
reg add rax     110
FPU add rax     265
xmm add rax     141
---
reg add rax     109
FPU add rax     250
xmm add rax     141
---
reg add rax     109
FPU add rax     266
xmm add rax     141


what about negative versions? one particular to include NEG vs SUB vs fchg
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

HSE

Very interesting!

It's not a wow64 effect :thumbsup:


Correcting results with the other test:

fact                reg   172  fpu 280  xmm 218

corrected to same:

reg                     reg   172  fpu 258  xmm 95

fpu                     reg   187  fpu 280  xmm 104

xmm                  reg   391  fpu 587  xmm 218
   
Equations in Assembly: SmplMath

jj2007

Quote from: daydreamer on November 10, 2020, 05:25:57 AMwhat about negative versions? one particular to include NEG vs SUB vs fchg

Go ahead, don't be shy!

jj2007

Quote from: HSE on November 10, 2020, 05:34:07 AM
Very interesting!

It's not a wow64 effect :thumbsup:


Correcting results with the other test:

fact                reg   172  fpu 280  xmm 218

corrected to same:

reg                     reg   172  fpu 258  xmm 95

fpu                     reg   187  fpu 280  xmm 104

xmm                  reg   391  fpu 587  xmm 218


I don't understand what you have done, sorry :cool:

HSE

It's an estimation assuming proportionality with your first test reg =140, fpu=210, xmm=78

Preliminary suggestion is that tests in 32 and 64 bits are not directly comparables.
Equations in Assembly: SmplMath

jj2007

Quote from: HSE on November 10, 2020, 05:55:57 AM
Preliminary suggestion is that tests in 32 and 64 bits are not directly comparables.

I still don't understand what you've done, sorry. Probably, I'm a bit dumb today. Anyway, as regards the "normal" add instruction, it's obvious that the 64-bit version is much faster: in 32-bit, you need two instructions, add and adc, to add the qword together.

HSE

Quote from: jj2007 on November 10, 2020, 06:57:38 AM
I still don't understand what you've done, sorry. Probably, I'm a bit dumb today.
:biggrin: Only the correction I maked is very dumb, just a proportion.

Quote from: jj2007 on November 10, 2020, 06:57:38 AM
Anyway, as regards the "normal" add instruction, it's obvious that the 64-bit version is much faster: in 32-bit, you need two instructions, add and adc, to add the qword together.
Fantastic!  Then xmm look very slow.
Equations in Assembly: SmplMath

mineiro

$ wine AddingQwords.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

96 cycles for 100 * add+adc
253 cycles for 100 * fadd
191 cycles for 100 * paddq aligned
76 cycles for 100 * paddq unaligned

99 cycles for 100 * add+adc
253 cycles for 100 * fadd
196 cycles for 100 * paddq aligned
76 cycles for 100 * paddq unaligned

106 cycles for 100 * add+adc
254 cycles for 100 * fadd
192 cycles for 100 * paddq aligned
76 cycles for 100 * paddq unaligned

94 cycles for 100 * add+adc
254 cycles for 100 * fadd
185 cycles for 100 * paddq aligned
78 cycles for 100 * paddq unaligned

106 cycles for 100 * add+adc
270 cycles for 100 * fadd
189 cycles for 100 * paddq aligned
77 cycles for 100 * paddq unaligned

34 bytes for add+adc
20 bytes for fadd
22 bytes for paddq aligned
25 bytes for paddq unaligned

I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

Quote from: HSE on November 10, 2020, 07:24:03 AM
Fantastic!  Then xmm look very slow.

Right, but it's a bit of an academic problem: when was the last time you had an innermost loop adding QWORD integers a Million times? :biggrin:

@mineiro: thanks :thumbsup:

hutch--

There is considerable variation in instruction sets across different hardware so you can expect unusual timing differences from one CPU to another. Long ago I remember an AMD processor that did some things really fast against the current Intel CPUs but was slow on may other instructions. Its all silicon acreage that accounts for the difference. I know that Intel over the last 10 years or so have prioritised SSE and AVX over the old integer instructions.

daydreamer

Quote from: hutch-- on November 10, 2020, 11:47:41 AM
There is considerable variation in instruction sets across different hardware so you can expect unusual timing differences from one CPU to another. Long ago I remember an AMD processor that did some things really fast against the current Intel CPUs but was slow on may other instructions. Its all silicon acreage that accounts for the difference. I know that Intel over the last 10 years or so have prioritised SSE and AVX over the old integer instructions.
well usual way you choose cpu is to be able to run games and other programs at minimum/recommended stats,long ago I first had AMD athlon with only SSE caps,one step behind intels SSE version,so I got a intel because I wanted latest SSE instructions and ran old legacy landscape raytracing program on both and intel was a disappointment compared to the AMD,probably because program was developed on with good old fpu instructions

one timing I want todo is bitblt,stretchblt,drawiconex and see if there is any difference on milliseconds running on different gpus?
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding