Author Topic: Adding QWORDs with reg32, paddq and the FPU  (Read 2891 times)

jj2007

  • Member
  • *****
  • Posts: 11546
  • Assembler is fun ;-)
    • MasmBasic
Adding QWORDs with reg32, paddq and the FPU
« on: November 09, 2020, 09:38:42 AM »
Can I have some timings, please?
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

144     cycles for 100 * add+adc
209     cycles for 100 * fadd
76      cycles for 100 * paddq aligned
140     cycles for 100 * paddq unaligned

141     cycles for 100 * add+adc
215     cycles for 100 * fadd
83      cycles for 100 * paddq aligned
140     cycles for 100 * paddq unaligned

139     cycles for 100 * add+adc
214     cycles for 100 * fadd
81      cycles for 100 * paddq aligned
145     cycles for 100 * paddq unaligned

138     cycles for 100 * add+adc
209     cycles for 100 * fadd
77      cycles for 100 * paddq aligned
145     cycles for 100 * paddq unaligned

143     cycles for 100 * add+adc
210     cycles for 100 * fadd
76      cycles for 100 * paddq aligned
140     cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned

HSE

  • Member
  • *****
  • Posts: 1740
  • <AMD>< 7-32>
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #1 on: November 09, 2020, 10:36:16 AM »
Code: [Select]
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

196     cycles for 100 * add+adc
115     cycles for 100 * fadd
389     cycles for 100 * paddq aligned
411     cycles for 100 * paddq unaligned

196     cycles for 100 * add+adc
116     cycles for 100 * fadd
389     cycles for 100 * paddq aligned
410     cycles for 100 * paddq unaligned

196     cycles for 100 * add+adc
115     cycles for 100 * fadd
387     cycles for 100 * paddq aligned
410     cycles for 100 * paddq unaligned

196     cycles for 100 * add+adc
115     cycles for 100 * fadd
387     cycles for 100 * paddq aligned
411     cycles for 100 * paddq unaligned

195     cycles for 100 * add+adc
116     cycles for 100 * fadd
390     cycles for 100 * paddq aligned
411     cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned

Siekmanski

  • Member
  • *****
  • Posts: 2365
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #2 on: November 09, 2020, 11:00:10 AM »
Code: [Select]
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

161     cycles for 100 * add+adc
251     cycles for 100 * fadd
93      cycles for 100 * paddq aligned
166     cycles for 100 * paddq unaligned

164     cycles for 100 * add+adc
250     cycles for 100 * fadd
93      cycles for 100 * paddq aligned
166     cycles for 100 * paddq unaligned

162     cycles for 100 * add+adc
249     cycles for 100 * fadd
93      cycles for 100 * paddq aligned
165     cycles for 100 * paddq unaligned

164     cycles for 100 * add+adc
250     cycles for 100 * fadd
91      cycles for 100 * paddq aligned
166     cycles for 100 * paddq unaligned

163     cycles for 100 * add+adc
250     cycles for 100 * fadd
93      cycles for 100 * paddq aligned
165     cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned


--- ok ---
Creative coders use backward thinking techniques as a strategy.

jj2007

  • Member
  • *****
  • Posts: 11546
  • Assembler is fun ;-)
    • MasmBasic
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #3 on: November 09, 2020, 10:10:03 PM »
Very interesting, thanks! So AMD has a much faster FPU :cool:

HSE

  • Member
  • *****
  • Posts: 1740
  • <AMD>< 7-32>
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #4 on: November 09, 2020, 11:30:10 PM »
Hi JJ!

Siekmanski's machine is 64 bit. Perhaps there is a WOW problem ¿?

jj2007

  • Member
  • *****
  • Posts: 11546
  • Assembler is fun ;-)
    • MasmBasic
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #5 on: November 09, 2020, 11:57:28 PM »
Mine is 64-bit, too, but that shouldn't be a problem. 32-bit instructions are native to the cpu, and WOW only translates API calls, not single instructions. Apparently AMD simply has the faster FPU :cool:

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 8479
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #6 on: November 10, 2020, 12:49:53 AM »

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

119     cycles for 100 * add+adc
222     cycles for 100 * fadd
51      cycles for 100 * paddq aligned
59      cycles for 100 * paddq unaligned

117     cycles for 100 * add+adc
227     cycles for 100 * fadd
53      cycles for 100 * paddq aligned
59      cycles for 100 * paddq unaligned

117     cycles for 100 * add+adc
223     cycles for 100 * fadd
53      cycles for 100 * paddq aligned
58      cycles for 100 * paddq unaligned

117     cycles for 100 * add+adc
225     cycles for 100 * fadd
53      cycles for 100 * paddq aligned
59      cycles for 100 * paddq unaligned

116     cycles for 100 * add+adc
223     cycles for 100 * fadd
50      cycles for 100 * paddq aligned
60      cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned


--- ok ---
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

TimoVJL

  • Member
  • ****
  • Posts: 723
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #7 on: November 10, 2020, 01:03:55 AM »
Code: [Select]
AMD Athlon(tm) II X2 220 Processor (SSE3)

169     cycles for 100 * add+adc
109     cycles for 100 * fadd
402     cycles for 100 * paddq aligned
429     cycles for 100 * paddq unaligned

167     cycles for 100 * add+adc
106     cycles for 100 * fadd
401     cycles for 100 * paddq aligned
429     cycles for 100 * paddq unaligned

167     cycles for 100 * add+adc
105     cycles for 100 * fadd
402     cycles for 100 * paddq aligned
432     cycles for 100 * paddq unaligned

167     cycles for 100 * add+adc
106     cycles for 100 * fadd
402     cycles for 100 * paddq aligned
429     cycles for 100 * paddq unaligned

170     cycles for 100 * add+adc
109     cycles for 100 * fadd
404     cycles for 100 * paddq aligned
429     cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned
Code: [Select]
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

141     cycles for 100 * add+adc
129     cycles for 100 * fadd
213     cycles for 100 * paddq aligned
216     cycles for 100 * paddq unaligned

142     cycles for 100 * add+adc
127     cycles for 100 * fadd
214     cycles for 100 * paddq aligned
216     cycles for 100 * paddq unaligned

140     cycles for 100 * add+adc
127     cycles for 100 * fadd
214     cycles for 100 * paddq aligned
214     cycles for 100 * paddq unaligned

138     cycles for 100 * add+adc
126     cycles for 100 * fadd
218     cycles for 100 * paddq aligned
215     cycles for 100 * paddq unaligned

141     cycles for 100 * add+adc
125     cycles for 100 * fadd
218     cycles for 100 * paddq aligned
214     cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned
May the source be with you

HSE

  • Member
  • *****
  • Posts: 1740
  • <AMD>< 7-32>
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #8 on: November 10, 2020, 01:10:01 AM »
, and WOW only translates API calls, not single instructions.

MMX, 3DNow!, and the x87 FPU are all supported fully in 32-bit compatibility mode in WOW64, but not for 64-bit apps.

Sound like FPU's are not considered single instructions. I don't know.

jj2007

  • Member
  • *****
  • Posts: 11546
  • Assembler is fun ;-)
    • MasmBasic
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #9 on: November 10, 2020, 03:54:25 AM »
That's techreport fake news - and Timo just confirmed that AMD has the faster FPU :mrgreen:

include \Masm32\MasmBasic\Res\JBasic.inc        ; ## console demo, builds in 32- or 64-bit mode with UAsm, ML, AsmC ##
usedeb=1                                ; use the deb macro
.code
Init           ; OPT_64 1      ; put 0 for 32 bit, 1 for 64 bit assembly
  PrintLine Chr$("This program was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format.")
  fldpi                                 ; the FPU works fine in x64
  fld1
  fadd
  sub rsp, QWORD
  fld st
  fistp qword ptr [esp]
  movlps xmm0, qword ptr [esp]
  fst qword ptr [esp]
  movlps xmm1, qword ptr [esp]
  add rsp, QWORD
  deb 4, "The FPU works just fine in 64-bit:", xmm0, f:xmm1, ST(0)
  MsgBox 0, "q.e.d.", "Hi", MB_OK
EndOfCode



HSE

  • Member
  • *****
  • Posts: 1740
  • <AMD>< 7-32>
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #10 on: November 10, 2020, 04:06:47 AM »
I think the way is to test the addition in 32 and 64 bit binaries. If FPU instructions are handled by WOW64 we can see a diference.

jj2007

  • Member
  • *****
  • Posts: 11546
  • Assembler is fun ;-)
    • MasmBasic
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #11 on: November 10, 2020, 04:18:10 AM »
I think the way is to test the addition in 32 and 64 bit binaries. If FPU instructions are handled by WOW64 we can see a diference.

WOW64 doesn't handle instructions, it passes calls to 32-bit APIs to their 64-bit equivalents, and returns the results as if you had called the 32-bit API.

HSE

  • Member
  • *****
  • Posts: 1740
  • <AMD>< 7-32>
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #12 on: November 10, 2020, 04:24:25 AM »
 :biggrin: The article say something different.

I will wait tests results  :thumbsup:

daydreamer

  • Member
  • *****
  • Posts: 1717
  • building nextdoor
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #13 on: November 10, 2020, 04:32:07 AM »
@Jochen
AMD's fpu's was faster before too,I had an Ahtlon and also in the other forum AMD fpu code was faster(D3d code using fpu also,so next cpu was intel core duo,because my interest in SSE,AMD was one step after intel in SSE generation


Code: [Select]
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

82      cycles for 100 * add+adc
239     cycles for 100 * fadd
62      cycles for 100 * paddq aligned
67      cycles for 100 * paddq unaligned

81      cycles for 100 * add+adc
242     cycles for 100 * fadd
63      cycles for 100 * paddq aligned
71      cycles for 100 * paddq unaligned

85      cycles for 100 * add+adc
242     cycles for 100 * fadd
61      cycles for 100 * paddq aligned
70      cycles for 100 * paddq unaligned

83      cycles for 100 * add+adc
241     cycles for 100 * fadd
65      cycles for 100 * paddq aligned
73      cycles for 100 * paddq unaligned

85      cycles for 100 * add+adc
246     cycles for 100 * fadd
62      cycles for 100 * paddq aligned
72      cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned


-
SIMD fan and macro fan
why assembly is fastest is because its switch has no (brakes) breaks
:P
only in 16bit assembly you can get away with "Only words" :P

jj2007

  • Member
  • *****
  • Posts: 11546
  • Assembler is fun ;-)
    • MasmBasic
Re: Adding QWORDs with reg32, paddq and the FPU
« Reply #14 on: November 10, 2020, 04:57:23 AM »
Thanks, daydreamer. Here is the 64-bit version, using GetTickCount instead of NanoTimer():

Code: [Select]
This program was assembled with ml64 in 64-bit format.
rax is GetTickCount difference
---
reg add rax     172
FPU add rax     281
xmm add rax     218
---
reg add rax     172
FPU add rax     281
xmm add rax     218
---
reg add rax     172
FPU add rax     280
xmm add rax     219
---
reg add rax     171
FPU add rax     281
xmm add rax     219