Author Topic: Adding QWORDs with reg32, paddq and the FPU  (Read 3111 times)

daydreamer

  • Member
  • *****
  • Posts: 1754
  • building nextdoor
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #15 on: November 10, 2020, 05:25:57 AM »
Code: [Select]
This program was assembled with ml64 in 64-bit format.
rax is GetTickCount difference
---
reg add rax     109
FPU add rax     266
xmm add rax     140
---
reg add rax     110
FPU add rax     265
xmm add rax     141
---
reg add rax     109
FPU add rax     250
xmm add rax     141
---
reg add rax     109
FPU add rax     266
xmm add rax     141

what about negative versions? one particular to include NEG vs SUB vs fchg
SIMD fan and macro fan
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."

HSE

  • Member
  • *****
  • Posts: 1766
  • <AMD>< 7-32>
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #16 on: November 10, 2020, 05:34:07 AM »
Very interesting!

It's not a wow64 effect :thumbsup:


Correcting results with the other test:

fact                reg   172  fpu 280  xmm 218

corrected to same:

reg                     reg   172  fpu 258  xmm 95

fpu                     reg   187  fpu 280  xmm 104

xmm                  reg   391  fpu 587  xmm 218
   

jj2007

  • Member
  • *****
  • Posts: 11586
  • Assembler is fun ;-)
    • MasmBasic
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #17 on: November 10, 2020, 05:45:06 AM »
what about negative versions? one particular to include NEG vs SUB vs fchg

Go ahead, don't be shy!

jj2007

  • Member
  • *****
  • Posts: 11586
  • Assembler is fun ;-)
    • MasmBasic
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #18 on: November 10, 2020, 05:46:08 AM »
Very interesting!

It's not a wow64 effect :thumbsup:


Correcting results with the other test:

fact                reg   172  fpu 280  xmm 218

corrected to same:

reg                     reg   172  fpu 258  xmm 95

fpu                     reg   187  fpu 280  xmm 104

xmm                  reg   391  fpu 587  xmm 218
 

I don't understand what you have done, sorry :cool:

HSE

  • Member
  • *****
  • Posts: 1766
  • <AMD>< 7-32>
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #19 on: November 10, 2020, 05:55:57 AM »
It's an estimation assuming proportionality with your first test reg =140, fpu=210, xmm=78

Preliminary suggestion is that tests in 32 and 64 bits are not directly comparables.

jj2007

  • Member
  • *****
  • Posts: 11586
  • Assembler is fun ;-)
    • MasmBasic
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #20 on: November 10, 2020, 06:57:38 AM »
Preliminary suggestion is that tests in 32 and 64 bits are not directly comparables.

I still don't understand what you've done, sorry. Probably, I'm a bit dumb today. Anyway, as regards the "normal" add instruction, it's obvious that the 64-bit version is much faster: in 32-bit, you need two instructions, add and adc, to add the qword together.

HSE

  • Member
  • *****
  • Posts: 1766
  • <AMD>< 7-32>
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #21 on: November 10, 2020, 07:24:03 AM »
I still don't understand what you've done, sorry. Probably, I'm a bit dumb today.
:biggrin: Only the correction I maked is very dumb, just a proportion.

Anyway, as regards the "normal" add instruction, it's obvious that the 64-bit version is much faster: in 32-bit, you need two instructions, add and adc, to add the qword together.
Fantastic!  Then xmm look very slow.

mineiro

  • Member
  • ****
  • Posts: 750
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #22 on: November 10, 2020, 09:50:29 AM »
Code: [Select]
$ wine AddingQwords.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

96 cycles for 100 * add+adc
253 cycles for 100 * fadd
191 cycles for 100 * paddq aligned
76 cycles for 100 * paddq unaligned

99 cycles for 100 * add+adc
253 cycles for 100 * fadd
196 cycles for 100 * paddq aligned
76 cycles for 100 * paddq unaligned

106 cycles for 100 * add+adc
254 cycles for 100 * fadd
192 cycles for 100 * paddq aligned
76 cycles for 100 * paddq unaligned

94 cycles for 100 * add+adc
254 cycles for 100 * fadd
185 cycles for 100 * paddq aligned
78 cycles for 100 * paddq unaligned

106 cycles for 100 * add+adc
270 cycles for 100 * fadd
189 cycles for 100 * paddq aligned
77 cycles for 100 * paddq unaligned

34 bytes for add+adc
20 bytes for fadd
22 bytes for paddq aligned
25 bytes for paddq unaligned

I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

  • Member
  • *****
  • Posts: 11586
  • Assembler is fun ;-)
    • MasmBasic
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #23 on: November 10, 2020, 11:30:14 AM »
Fantastic!  Then xmm look very slow.

Right, but it's a bit of an academic problem: when was the last time you had an innermost loop adding QWORD integers a Million times? :biggrin:

@mineiro: thanks :thumbsup:

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 8544
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #24 on: November 10, 2020, 11:47:41 AM »
There is considerable variation in instruction sets across different hardware so you can expect unusual timing differences from one CPU to another. Long ago I remember an AMD processor that did some things really fast against the current Intel CPUs but was slow on may other instructions. Its all silicon acreage that accounts for the difference. I know that Intel over the last 10 years or so have prioritised SSE and AVX over the old integer instructions.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

daydreamer

  • Member
  • *****
  • Posts: 1754
  • building nextdoor
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #25 on: November 15, 2020, 09:14:17 AM »
There is considerable variation in instruction sets across different hardware so you can expect unusual timing differences from one CPU to another. Long ago I remember an AMD processor that did some things really fast against the current Intel CPUs but was slow on may other instructions. Its all silicon acreage that accounts for the difference. I know that Intel over the last 10 years or so have prioritised SSE and AVX over the old integer instructions.
well usual way you choose cpu is to be able to run games and other programs at minimum/recommended stats,long ago I first had AMD athlon with only SSE caps,one step behind intels SSE version,so I got a intel because I wanted latest SSE instructions and ran old legacy landscape raytracing program on both and intel was a disappointment compared to the AMD,probably because program was developed on with good old fpu instructions

one timing I want todo is bitblt,stretchblt,drawiconex and see if there is any difference on milliseconds running on different gpus?
SIMD fan and macro fan
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."