News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Adding QWORDs with reg32, paddq and the FPU

Started by jj2007, November 09, 2020, 09:38:42 AM

Previous topic - Next topic

jj2007

Can I have some timings, please?
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

144     cycles for 100 * add+adc
209     cycles for 100 * fadd
76      cycles for 100 * paddq aligned
140     cycles for 100 * paddq unaligned

141     cycles for 100 * add+adc
215     cycles for 100 * fadd
83      cycles for 100 * paddq aligned
140     cycles for 100 * paddq unaligned

139     cycles for 100 * add+adc
214     cycles for 100 * fadd
81      cycles for 100 * paddq aligned
145     cycles for 100 * paddq unaligned

138     cycles for 100 * add+adc
209     cycles for 100 * fadd
77      cycles for 100 * paddq aligned
145     cycles for 100 * paddq unaligned

143     cycles for 100 * add+adc
210     cycles for 100 * fadd
76      cycles for 100 * paddq aligned
140     cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned

HSE

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

196     cycles for 100 * add+adc
115     cycles for 100 * fadd
389     cycles for 100 * paddq aligned
411     cycles for 100 * paddq unaligned

196     cycles for 100 * add+adc
116     cycles for 100 * fadd
389     cycles for 100 * paddq aligned
410     cycles for 100 * paddq unaligned

196     cycles for 100 * add+adc
115     cycles for 100 * fadd
387     cycles for 100 * paddq aligned
410     cycles for 100 * paddq unaligned

196     cycles for 100 * add+adc
115     cycles for 100 * fadd
387     cycles for 100 * paddq aligned
411     cycles for 100 * paddq unaligned

195     cycles for 100 * add+adc
116     cycles for 100 * fadd
390     cycles for 100 * paddq aligned
411     cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned
Equations in Assembly: SmplMath

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

161     cycles for 100 * add+adc
251     cycles for 100 * fadd
93      cycles for 100 * paddq aligned
166     cycles for 100 * paddq unaligned

164     cycles for 100 * add+adc
250     cycles for 100 * fadd
93      cycles for 100 * paddq aligned
166     cycles for 100 * paddq unaligned

162     cycles for 100 * add+adc
249     cycles for 100 * fadd
93      cycles for 100 * paddq aligned
165     cycles for 100 * paddq unaligned

164     cycles for 100 * add+adc
250     cycles for 100 * fadd
91      cycles for 100 * paddq aligned
166     cycles for 100 * paddq unaligned

163     cycles for 100 * add+adc
250     cycles for 100 * fadd
93      cycles for 100 * paddq aligned
165     cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned


--- ok ---
Creative coders use backward thinking techniques as a strategy.

jj2007

Very interesting, thanks! So AMD has a much faster FPU :cool:

HSE

Hi JJ!

Siekmanski's machine is 64 bit. Perhaps there is a WOW problem ¿?
Equations in Assembly: SmplMath

jj2007

Mine is 64-bit, too, but that shouldn't be a problem. 32-bit instructions are native to the cpu, and WOW only translates API calls, not single instructions. Apparently AMD simply has the faster FPU :cool:

hutch--


Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

119     cycles for 100 * add+adc
222     cycles for 100 * fadd
51      cycles for 100 * paddq aligned
59      cycles for 100 * paddq unaligned

117     cycles for 100 * add+adc
227     cycles for 100 * fadd
53      cycles for 100 * paddq aligned
59      cycles for 100 * paddq unaligned

117     cycles for 100 * add+adc
223     cycles for 100 * fadd
53      cycles for 100 * paddq aligned
58      cycles for 100 * paddq unaligned

117     cycles for 100 * add+adc
225     cycles for 100 * fadd
53      cycles for 100 * paddq aligned
59      cycles for 100 * paddq unaligned

116     cycles for 100 * add+adc
223     cycles for 100 * fadd
50      cycles for 100 * paddq aligned
60      cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned


--- ok ---

TimoVJL

AMD Athlon(tm) II X2 220 Processor (SSE3)

169     cycles for 100 * add+adc
109     cycles for 100 * fadd
402     cycles for 100 * paddq aligned
429     cycles for 100 * paddq unaligned

167     cycles for 100 * add+adc
106     cycles for 100 * fadd
401     cycles for 100 * paddq aligned
429     cycles for 100 * paddq unaligned

167     cycles for 100 * add+adc
105     cycles for 100 * fadd
402     cycles for 100 * paddq aligned
432     cycles for 100 * paddq unaligned

167     cycles for 100 * add+adc
106     cycles for 100 * fadd
402     cycles for 100 * paddq aligned
429     cycles for 100 * paddq unaligned

170     cycles for 100 * add+adc
109     cycles for 100 * fadd
404     cycles for 100 * paddq aligned
429     cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

141     cycles for 100 * add+adc
129     cycles for 100 * fadd
213     cycles for 100 * paddq aligned
216     cycles for 100 * paddq unaligned

142     cycles for 100 * add+adc
127     cycles for 100 * fadd
214     cycles for 100 * paddq aligned
216     cycles for 100 * paddq unaligned

140     cycles for 100 * add+adc
127     cycles for 100 * fadd
214     cycles for 100 * paddq aligned
214     cycles for 100 * paddq unaligned

138     cycles for 100 * add+adc
126     cycles for 100 * fadd
218     cycles for 100 * paddq aligned
215     cycles for 100 * paddq unaligned

141     cycles for 100 * add+adc
125     cycles for 100 * fadd
218     cycles for 100 * paddq aligned
214     cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned
May the source be with you

HSE

Quote from: jj2007 on November 09, 2020, 11:57:28 PM
, and WOW only translates API calls, not single instructions.

Quote from: https://techreport.com/review/8131/64-bit-computing-in-theory-and-practice/MMX, 3DNow!, and the x87 FPU are all supported fully in 32-bit compatibility mode in WOW64, but not for 64-bit apps.

Sound like FPU's are not considered single instructions. I don't know.
Equations in Assembly: SmplMath

jj2007

That's techreport fake news - and Timo just confirmed that AMD has the faster FPU :mrgreen:

include \Masm32\MasmBasic\Res\JBasic.inc        ; ## console demo, builds in 32- or 64-bit mode with UAsm, ML, AsmC ##
usedeb=1                                ; use the deb macro
.code
Init           ; OPT_64 1      ; put 0 for 32 bit, 1 for 64 bit assembly
  PrintLine Chr$("This program was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format.")
  fldpi                                 ; the FPU works fine in x64
  fld1
  fadd
  sub rsp, QWORD
  fld st
  fistp qword ptr [esp]
  movlps xmm0, qword ptr [esp]
  fst qword ptr [esp]
  movlps xmm1, qword ptr [esp]
  add rsp, QWORD
  deb 4, "The FPU works just fine in 64-bit:", xmm0, f:xmm1, ST(0)
  MsgBox 0, "q.e.d.", "Hi", MB_OK
EndOfCode



HSE

I think the way is to test the addition in 32 and 64 bit binaries. If FPU instructions are handled by WOW64 we can see a diference.
Equations in Assembly: SmplMath

jj2007

Quote from: HSE on November 10, 2020, 04:06:47 AM
I think the way is to test the addition in 32 and 64 bit binaries. If FPU instructions are handled by WOW64 we can see a diference.

WOW64 doesn't handle instructions, it passes calls to 32-bit APIs to their 64-bit equivalents, and returns the results as if you had called the 32-bit API.

HSE

 :biggrin: The article say something different.

I will wait tests results  :thumbsup:
Equations in Assembly: SmplMath

daydreamer

@Jochen
AMD's fpu's was faster before too,I had an Ahtlon and also in the other forum AMD fpu code was faster(D3d code using fpu also,so next cpu was intel core duo,because my interest in SSE,AMD was one step after intel in SSE generation


Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

82      cycles for 100 * add+adc
239     cycles for 100 * fadd
62      cycles for 100 * paddq aligned
67      cycles for 100 * paddq unaligned

81      cycles for 100 * add+adc
242     cycles for 100 * fadd
63      cycles for 100 * paddq aligned
71      cycles for 100 * paddq unaligned

85      cycles for 100 * add+adc
242     cycles for 100 * fadd
61      cycles for 100 * paddq aligned
70      cycles for 100 * paddq unaligned

83      cycles for 100 * add+adc
241     cycles for 100 * fadd
65      cycles for 100 * paddq aligned
73      cycles for 100 * paddq unaligned

85      cycles for 100 * add+adc
246     cycles for 100 * fadd
62      cycles for 100 * paddq aligned
72      cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned


-
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

Thanks, daydreamer. Here is the 64-bit version, using GetTickCount instead of NanoTimer():

This program was assembled with ml64 in 64-bit format.
rax is GetTickCount difference
---
reg add rax     172
FPU add rax     281
xmm add rax     218
---
reg add rax     172
FPU add rax     281
xmm add rax     218
---
reg add rax     172
FPU add rax     280
xmm add rax     219
---
reg add rax     171
FPU add rax     281
xmm add rax     219