The MASM Forum

General => The Laboratory => Topic started by: jj2007 on November 09, 2020, 09:38:42 AM

Title: Adding QWORDs with reg32, paddq and the FPU
Post by: jj2007 on November 09, 2020, 09:38:42 AM
Can I have some timings, please?
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

144     cycles for 100 * add+adc
209     cycles for 100 * fadd
76      cycles for 100 * paddq aligned
140     cycles for 100 * paddq unaligned

141     cycles for 100 * add+adc
215     cycles for 100 * fadd
83      cycles for 100 * paddq aligned
140     cycles for 100 * paddq unaligned

139     cycles for 100 * add+adc
214     cycles for 100 * fadd
81      cycles for 100 * paddq aligned
145     cycles for 100 * paddq unaligned

138     cycles for 100 * add+adc
209     cycles for 100 * fadd
77      cycles for 100 * paddq aligned
145     cycles for 100 * paddq unaligned

143     cycles for 100 * add+adc
210     cycles for 100 * fadd
76      cycles for 100 * paddq aligned
140     cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: HSE on November 09, 2020, 10:36:16 AM
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

196     cycles for 100 * add+adc
115     cycles for 100 * fadd
389     cycles for 100 * paddq aligned
411     cycles for 100 * paddq unaligned

196     cycles for 100 * add+adc
116     cycles for 100 * fadd
389     cycles for 100 * paddq aligned
410     cycles for 100 * paddq unaligned

196     cycles for 100 * add+adc
115     cycles for 100 * fadd
387     cycles for 100 * paddq aligned
410     cycles for 100 * paddq unaligned

196     cycles for 100 * add+adc
115     cycles for 100 * fadd
387     cycles for 100 * paddq aligned
411     cycles for 100 * paddq unaligned

195     cycles for 100 * add+adc
116     cycles for 100 * fadd
390     cycles for 100 * paddq aligned
411     cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: Siekmanski on November 09, 2020, 11:00:10 AM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

161     cycles for 100 * add+adc
251     cycles for 100 * fadd
93      cycles for 100 * paddq aligned
166     cycles for 100 * paddq unaligned

164     cycles for 100 * add+adc
250     cycles for 100 * fadd
93      cycles for 100 * paddq aligned
166     cycles for 100 * paddq unaligned

162     cycles for 100 * add+adc
249     cycles for 100 * fadd
93      cycles for 100 * paddq aligned
165     cycles for 100 * paddq unaligned

164     cycles for 100 * add+adc
250     cycles for 100 * fadd
91      cycles for 100 * paddq aligned
166     cycles for 100 * paddq unaligned

163     cycles for 100 * add+adc
250     cycles for 100 * fadd
93      cycles for 100 * paddq aligned
165     cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned


--- ok ---
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: jj2007 on November 09, 2020, 10:10:03 PM
Very interesting, thanks! So AMD has a much faster FPU :cool:
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: HSE on November 09, 2020, 11:30:10 PM
Hi JJ!

Siekmanski's machine is 64 bit. Perhaps there is a WOW problem ¿?
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: jj2007 on November 09, 2020, 11:57:28 PM
Mine is 64-bit, too, but that shouldn't be a problem. 32-bit instructions are native to the cpu, and WOW only translates API calls, not single instructions. Apparently AMD simply has the faster FPU :cool:
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: hutch-- on November 10, 2020, 12:49:53 AM

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

119     cycles for 100 * add+adc
222     cycles for 100 * fadd
51      cycles for 100 * paddq aligned
59      cycles for 100 * paddq unaligned

117     cycles for 100 * add+adc
227     cycles for 100 * fadd
53      cycles for 100 * paddq aligned
59      cycles for 100 * paddq unaligned

117     cycles for 100 * add+adc
223     cycles for 100 * fadd
53      cycles for 100 * paddq aligned
58      cycles for 100 * paddq unaligned

117     cycles for 100 * add+adc
225     cycles for 100 * fadd
53      cycles for 100 * paddq aligned
59      cycles for 100 * paddq unaligned

116     cycles for 100 * add+adc
223     cycles for 100 * fadd
50      cycles for 100 * paddq aligned
60      cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned


--- ok ---
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: TimoVJL on November 10, 2020, 01:03:55 AM
AMD Athlon(tm) II X2 220 Processor (SSE3)

169     cycles for 100 * add+adc
109     cycles for 100 * fadd
402     cycles for 100 * paddq aligned
429     cycles for 100 * paddq unaligned

167     cycles for 100 * add+adc
106     cycles for 100 * fadd
401     cycles for 100 * paddq aligned
429     cycles for 100 * paddq unaligned

167     cycles for 100 * add+adc
105     cycles for 100 * fadd
402     cycles for 100 * paddq aligned
432     cycles for 100 * paddq unaligned

167     cycles for 100 * add+adc
106     cycles for 100 * fadd
402     cycles for 100 * paddq aligned
429     cycles for 100 * paddq unaligned

170     cycles for 100 * add+adc
109     cycles for 100 * fadd
404     cycles for 100 * paddq aligned
429     cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned
AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

141     cycles for 100 * add+adc
129     cycles for 100 * fadd
213     cycles for 100 * paddq aligned
216     cycles for 100 * paddq unaligned

142     cycles for 100 * add+adc
127     cycles for 100 * fadd
214     cycles for 100 * paddq aligned
216     cycles for 100 * paddq unaligned

140     cycles for 100 * add+adc
127     cycles for 100 * fadd
214     cycles for 100 * paddq aligned
214     cycles for 100 * paddq unaligned

138     cycles for 100 * add+adc
126     cycles for 100 * fadd
218     cycles for 100 * paddq aligned
215     cycles for 100 * paddq unaligned

141     cycles for 100 * add+adc
125     cycles for 100 * fadd
218     cycles for 100 * paddq aligned
214     cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: HSE on November 10, 2020, 01:10:01 AM
Quote from: jj2007 on November 09, 2020, 11:57:28 PM
, and WOW only translates API calls, not single instructions.

Quote from: https://techreport.com/review/8131/64-bit-computing-in-theory-and-practice/MMX, 3DNow!, and the x87 FPU are all supported fully in 32-bit compatibility mode in WOW64, but not for 64-bit apps.

Sound like FPU's are not considered single instructions. I don't know.
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: jj2007 on November 10, 2020, 03:54:25 AM
That's techreport fake news - and Timo just confirmed that AMD has the faster FPU :mrgreen:

include \Masm32\MasmBasic\Res\JBasic.inc        ; ## console demo, builds in 32- or 64-bit mode with UAsm, ML, AsmC ##
usedeb=1                                ; use the deb macro
.code
Init           ; OPT_64 1      ; put 0 for 32 bit, 1 for 64 bit assembly
  PrintLine Chr$("This program was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format.")
  fldpi                                 ; the FPU works fine in x64
  fld1
  fadd
  sub rsp, QWORD
  fld st
  fistp qword ptr [esp]
  movlps xmm0, qword ptr [esp]
  fst qword ptr [esp]
  movlps xmm1, qword ptr [esp]
  add rsp, QWORD
  deb 4, "The FPU works just fine in 64-bit:", xmm0, f:xmm1, ST(0)
  MsgBox 0, "q.e.d.", "Hi", MB_OK
EndOfCode


Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: HSE on November 10, 2020, 04:06:47 AM
I think the way is to test the addition in 32 and 64 bit binaries. If FPU instructions are handled by WOW64 we can see a diference.
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: jj2007 on November 10, 2020, 04:18:10 AM
Quote from: HSE on November 10, 2020, 04:06:47 AM
I think the way is to test the addition in 32 and 64 bit binaries. If FPU instructions are handled by WOW64 we can see a diference.

WOW64 doesn't handle instructions, it passes calls to 32-bit APIs to their 64-bit equivalents, and returns the results as if you had called the 32-bit API.
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: HSE on November 10, 2020, 04:24:25 AM
 :biggrin: The article say something different.

I will wait tests results  :thumbsup:
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: daydreamer on November 10, 2020, 04:32:07 AM
@Jochen
AMD's fpu's was faster before too,I had an Ahtlon and also in the other forum AMD fpu code was faster(D3d code using fpu also,so next cpu was intel core duo,because my interest in SSE,AMD was one step after intel in SSE generation


Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)

82      cycles for 100 * add+adc
239     cycles for 100 * fadd
62      cycles for 100 * paddq aligned
67      cycles for 100 * paddq unaligned

81      cycles for 100 * add+adc
242     cycles for 100 * fadd
63      cycles for 100 * paddq aligned
71      cycles for 100 * paddq unaligned

85      cycles for 100 * add+adc
242     cycles for 100 * fadd
61      cycles for 100 * paddq aligned
70      cycles for 100 * paddq unaligned

83      cycles for 100 * add+adc
241     cycles for 100 * fadd
65      cycles for 100 * paddq aligned
73      cycles for 100 * paddq unaligned

85      cycles for 100 * add+adc
246     cycles for 100 * fadd
62      cycles for 100 * paddq aligned
72      cycles for 100 * paddq unaligned

34      bytes for add+adc
20      bytes for fadd
22      bytes for paddq aligned
25      bytes for paddq unaligned


-
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: jj2007 on November 10, 2020, 04:57:23 AM
Thanks, daydreamer. Here is the 64-bit version, using GetTickCount instead of NanoTimer (http://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1171)():

This program was assembled with ml64 in 64-bit format.
rax is GetTickCount difference
---
reg add rax     172
FPU add rax     281
xmm add rax     218
---
reg add rax     172
FPU add rax     281
xmm add rax     218
---
reg add rax     172
FPU add rax     280
xmm add rax     219
---
reg add rax     171
FPU add rax     281
xmm add rax     219
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: daydreamer on November 10, 2020, 05:25:57 AM
This program was assembled with ml64 in 64-bit format.
rax is GetTickCount difference
---
reg add rax     109
FPU add rax     266
xmm add rax     140
---
reg add rax     110
FPU add rax     265
xmm add rax     141
---
reg add rax     109
FPU add rax     250
xmm add rax     141
---
reg add rax     109
FPU add rax     266
xmm add rax     141


what about negative versions? one particular to include NEG vs SUB vs fchg
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: HSE on November 10, 2020, 05:34:07 AM
Very interesting!

It's not a wow64 effect :thumbsup:


Correcting results with the other test:

fact                reg   172  fpu 280  xmm 218

corrected to same:

reg                     reg   172  fpu 258  xmm 95

fpu                     reg   187  fpu 280  xmm 104

xmm                  reg   391  fpu 587  xmm 218
   
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: jj2007 on November 10, 2020, 05:45:06 AM
Quote from: daydreamer on November 10, 2020, 05:25:57 AMwhat about negative versions? one particular to include NEG vs SUB vs fchg

Go ahead, don't be shy!
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: jj2007 on November 10, 2020, 05:46:08 AM
Quote from: HSE on November 10, 2020, 05:34:07 AM
Very interesting!

It's not a wow64 effect :thumbsup:


Correcting results with the other test:

fact                reg   172  fpu 280  xmm 218

corrected to same:

reg                     reg   172  fpu 258  xmm 95

fpu                     reg   187  fpu 280  xmm 104

xmm                  reg   391  fpu 587  xmm 218


I don't understand what you have done, sorry :cool:
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: HSE on November 10, 2020, 05:55:57 AM
It's an estimation assuming proportionality with your first test reg =140, fpu=210, xmm=78

Preliminary suggestion is that tests in 32 and 64 bits are not directly comparables.
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: jj2007 on November 10, 2020, 06:57:38 AM
Quote from: HSE on November 10, 2020, 05:55:57 AM
Preliminary suggestion is that tests in 32 and 64 bits are not directly comparables.

I still don't understand what you've done, sorry. Probably, I'm a bit dumb today. Anyway, as regards the "normal" add instruction, it's obvious that the 64-bit version is much faster: in 32-bit, you need two instructions, add and adc, to add the qword together.
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: HSE on November 10, 2020, 07:24:03 AM
Quote from: jj2007 on November 10, 2020, 06:57:38 AM
I still don't understand what you've done, sorry. Probably, I'm a bit dumb today.
:biggrin: Only the correction I maked is very dumb, just a proportion.

Quote from: jj2007 on November 10, 2020, 06:57:38 AM
Anyway, as regards the "normal" add instruction, it's obvious that the 64-bit version is much faster: in 32-bit, you need two instructions, add and adc, to add the qword together.
Fantastic!  Then xmm look very slow.
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: mineiro on November 10, 2020, 09:50:29 AM
$ wine AddingQwords.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)

96 cycles for 100 * add+adc
253 cycles for 100 * fadd
191 cycles for 100 * paddq aligned
76 cycles for 100 * paddq unaligned

99 cycles for 100 * add+adc
253 cycles for 100 * fadd
196 cycles for 100 * paddq aligned
76 cycles for 100 * paddq unaligned

106 cycles for 100 * add+adc
254 cycles for 100 * fadd
192 cycles for 100 * paddq aligned
76 cycles for 100 * paddq unaligned

94 cycles for 100 * add+adc
254 cycles for 100 * fadd
185 cycles for 100 * paddq aligned
78 cycles for 100 * paddq unaligned

106 cycles for 100 * add+adc
270 cycles for 100 * fadd
189 cycles for 100 * paddq aligned
77 cycles for 100 * paddq unaligned

34 bytes for add+adc
20 bytes for fadd
22 bytes for paddq aligned
25 bytes for paddq unaligned

Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: jj2007 on November 10, 2020, 11:30:14 AM
Quote from: HSE on November 10, 2020, 07:24:03 AM
Fantastic!  Then xmm look very slow.

Right, but it's a bit of an academic problem: when was the last time you had an innermost loop adding QWORD integers a Million times? :biggrin:

@mineiro: thanks :thumbsup:
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: hutch-- on November 10, 2020, 11:47:41 AM
There is considerable variation in instruction sets across different hardware so you can expect unusual timing differences from one CPU to another. Long ago I remember an AMD processor that did some things really fast against the current Intel CPUs but was slow on may other instructions. Its all silicon acreage that accounts for the difference. I know that Intel over the last 10 years or so have prioritised SSE and AVX over the old integer instructions.
Title: Re: Adding QWORDs with reg32, paddq and the FPU
Post by: daydreamer on November 15, 2020, 09:14:17 AM
Quote from: hutch-- on November 10, 2020, 11:47:41 AM
There is considerable variation in instruction sets across different hardware so you can expect unusual timing differences from one CPU to another. Long ago I remember an AMD processor that did some things really fast against the current Intel CPUs but was slow on may other instructions. Its all silicon acreage that accounts for the difference. I know that Intel over the last 10 years or so have prioritised SSE and AVX over the old integer instructions.
well usual way you choose cpu is to be able to run games and other programs at minimum/recommended stats,long ago I first had AMD athlon with only SSE caps,one step behind intels SSE version,so I got a intel because I wanted latest SSE instructions and ran old legacy landscape raytracing program on both and intel was a disappointment compared to the AMD,probably because program was developed on with good old fpu instructions

one timing I want todo is bitblt,stretchblt,drawiconex and see if there is any difference on milliseconds running on different gpus?