Can I have some timings, please?
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
144 cycles for 100 * add+adc
209 cycles for 100 * fadd
76 cycles for 100 * paddq aligned
140 cycles for 100 * paddq unaligned
141 cycles for 100 * add+adc
215 cycles for 100 * fadd
83 cycles for 100 * paddq aligned
140 cycles for 100 * paddq unaligned
139 cycles for 100 * add+adc
214 cycles for 100 * fadd
81 cycles for 100 * paddq aligned
145 cycles for 100 * paddq unaligned
138 cycles for 100 * add+adc
209 cycles for 100 * fadd
77 cycles for 100 * paddq aligned
145 cycles for 100 * paddq unaligned
143 cycles for 100 * add+adc
210 cycles for 100 * fadd
76 cycles for 100 * paddq aligned
140 cycles for 100 * paddq unaligned
34 bytes for add+adc
20 bytes for fadd
22 bytes for paddq aligned
25 bytes for paddq unaligned
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
196 cycles for 100 * add+adc
115 cycles for 100 * fadd
389 cycles for 100 * paddq aligned
411 cycles for 100 * paddq unaligned
196 cycles for 100 * add+adc
116 cycles for 100 * fadd
389 cycles for 100 * paddq aligned
410 cycles for 100 * paddq unaligned
196 cycles for 100 * add+adc
115 cycles for 100 * fadd
387 cycles for 100 * paddq aligned
410 cycles for 100 * paddq unaligned
196 cycles for 100 * add+adc
115 cycles for 100 * fadd
387 cycles for 100 * paddq aligned
411 cycles for 100 * paddq unaligned
195 cycles for 100 * add+adc
116 cycles for 100 * fadd
390 cycles for 100 * paddq aligned
411 cycles for 100 * paddq unaligned
34 bytes for add+adc
20 bytes for fadd
22 bytes for paddq aligned
25 bytes for paddq unaligned
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
161 cycles for 100 * add+adc
251 cycles for 100 * fadd
93 cycles for 100 * paddq aligned
166 cycles for 100 * paddq unaligned
164 cycles for 100 * add+adc
250 cycles for 100 * fadd
93 cycles for 100 * paddq aligned
166 cycles for 100 * paddq unaligned
162 cycles for 100 * add+adc
249 cycles for 100 * fadd
93 cycles for 100 * paddq aligned
165 cycles for 100 * paddq unaligned
164 cycles for 100 * add+adc
250 cycles for 100 * fadd
91 cycles for 100 * paddq aligned
166 cycles for 100 * paddq unaligned
163 cycles for 100 * add+adc
250 cycles for 100 * fadd
93 cycles for 100 * paddq aligned
165 cycles for 100 * paddq unaligned
34 bytes for add+adc
20 bytes for fadd
22 bytes for paddq aligned
25 bytes for paddq unaligned
--- ok ---
Very interesting, thanks! So AMD has a much faster FPU :cool:
Hi JJ!
Siekmanski's machine is 64 bit. Perhaps there is a WOW problem ¿?
Mine is 64-bit, too, but that shouldn't be a problem. 32-bit instructions are native to the cpu, and WOW only translates API calls, not single instructions. Apparently AMD simply has the faster FPU :cool:
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
119 cycles for 100 * add+adc
222 cycles for 100 * fadd
51 cycles for 100 * paddq aligned
59 cycles for 100 * paddq unaligned
117 cycles for 100 * add+adc
227 cycles for 100 * fadd
53 cycles for 100 * paddq aligned
59 cycles for 100 * paddq unaligned
117 cycles for 100 * add+adc
223 cycles for 100 * fadd
53 cycles for 100 * paddq aligned
58 cycles for 100 * paddq unaligned
117 cycles for 100 * add+adc
225 cycles for 100 * fadd
53 cycles for 100 * paddq aligned
59 cycles for 100 * paddq unaligned
116 cycles for 100 * add+adc
223 cycles for 100 * fadd
50 cycles for 100 * paddq aligned
60 cycles for 100 * paddq unaligned
34 bytes for add+adc
20 bytes for fadd
22 bytes for paddq aligned
25 bytes for paddq unaligned
--- ok ---
AMD Athlon(tm) II X2 220 Processor (SSE3)
169 cycles for 100 * add+adc
109 cycles for 100 * fadd
402 cycles for 100 * paddq aligned
429 cycles for 100 * paddq unaligned
167 cycles for 100 * add+adc
106 cycles for 100 * fadd
401 cycles for 100 * paddq aligned
429 cycles for 100 * paddq unaligned
167 cycles for 100 * add+adc
105 cycles for 100 * fadd
402 cycles for 100 * paddq aligned
432 cycles for 100 * paddq unaligned
167 cycles for 100 * add+adc
106 cycles for 100 * fadd
402 cycles for 100 * paddq aligned
429 cycles for 100 * paddq unaligned
170 cycles for 100 * add+adc
109 cycles for 100 * fadd
404 cycles for 100 * paddq aligned
429 cycles for 100 * paddq unaligned
34 bytes for add+adc
20 bytes for fadd
22 bytes for paddq aligned
25 bytes for paddq unaligned
AMD Ryzen 5 3400G with Radeon Vega Graphics (SSE4)
141 cycles for 100 * add+adc
129 cycles for 100 * fadd
213 cycles for 100 * paddq aligned
216 cycles for 100 * paddq unaligned
142 cycles for 100 * add+adc
127 cycles for 100 * fadd
214 cycles for 100 * paddq aligned
216 cycles for 100 * paddq unaligned
140 cycles for 100 * add+adc
127 cycles for 100 * fadd
214 cycles for 100 * paddq aligned
214 cycles for 100 * paddq unaligned
138 cycles for 100 * add+adc
126 cycles for 100 * fadd
218 cycles for 100 * paddq aligned
215 cycles for 100 * paddq unaligned
141 cycles for 100 * add+adc
125 cycles for 100 * fadd
218 cycles for 100 * paddq aligned
214 cycles for 100 * paddq unaligned
34 bytes for add+adc
20 bytes for fadd
22 bytes for paddq aligned
25 bytes for paddq unaligned
Quote from: jj2007 on November 09, 2020, 11:57:28 PM
, and WOW only translates API calls, not single instructions.
Quote from: https://techreport.com/review/8131/64-bit-computing-in-theory-and-practice/MMX, 3DNow!, and the x87 FPU are all supported fully in 32-bit compatibility mode in WOW64, but not for 64-bit apps.
Sound like FPU's are not considered single instructions. I don't know.
That's techreport fake news - and Timo just confirmed that AMD has the faster FPU :mrgreen:
include \Masm32\MasmBasic\Res\JBasic.inc ; ## console demo, builds in 32- or 64-bit mode with UAsm, ML, AsmC ##
usedeb=1 ; use the deb macro
.code
Init ; OPT_64 1 ; put 0 for 32 bit, 1 for 64 bit assembly
PrintLine Chr$("This program was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format.")
fldpi ; the FPU works fine in x64
fld1
fadd
sub rsp, QWORD
fld st
fistp qword ptr [esp]
movlps xmm0, qword ptr [esp]
fst qword ptr [esp]
movlps xmm1, qword ptr [esp]
add rsp, QWORD
deb 4, "The FPU works just fine in 64-bit:", xmm0, f:xmm1, ST(0)
MsgBox 0, "q.e.d.", "Hi", MB_OK
EndOfCode
I think the way is to test the addition in 32 and 64 bit binaries. If FPU instructions are handled by WOW64 we can see a diference.
Quote from: HSE on November 10, 2020, 04:06:47 AM
I think the way is to test the addition in 32 and 64 bit binaries. If FPU instructions are handled by WOW64 we can see a diference.
WOW64 doesn't handle
instructions, it passes calls to 32-bit APIs to their 64-bit equivalents, and returns the results as if you had called the 32-bit API.
:biggrin: The article say something different.
I will wait tests results :thumbsup:
@Jochen
AMD's fpu's was faster before too,I had an Ahtlon and also in the other forum AMD fpu code was faster(D3d code using fpu also,so next cpu was intel core duo,because my interest in SSE,AMD was one step after intel in SSE generation
Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz (SSE4)
82 cycles for 100 * add+adc
239 cycles for 100 * fadd
62 cycles for 100 * paddq aligned
67 cycles for 100 * paddq unaligned
81 cycles for 100 * add+adc
242 cycles for 100 * fadd
63 cycles for 100 * paddq aligned
71 cycles for 100 * paddq unaligned
85 cycles for 100 * add+adc
242 cycles for 100 * fadd
61 cycles for 100 * paddq aligned
70 cycles for 100 * paddq unaligned
83 cycles for 100 * add+adc
241 cycles for 100 * fadd
65 cycles for 100 * paddq aligned
73 cycles for 100 * paddq unaligned
85 cycles for 100 * add+adc
246 cycles for 100 * fadd
62 cycles for 100 * paddq aligned
72 cycles for 100 * paddq unaligned
34 bytes for add+adc
20 bytes for fadd
22 bytes for paddq aligned
25 bytes for paddq unaligned
-
Thanks, daydreamer. Here is the 64-bit version, using GetTickCount instead of NanoTimer (http://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1171)():
This program was assembled with ml64 in 64-bit format.
rax is GetTickCount difference
---
reg add rax 172
FPU add rax 281
xmm add rax 218
---
reg add rax 172
FPU add rax 281
xmm add rax 218
---
reg add rax 172
FPU add rax 280
xmm add rax 219
---
reg add rax 171
FPU add rax 281
xmm add rax 219
This program was assembled with ml64 in 64-bit format.
rax is GetTickCount difference
---
reg add rax 109
FPU add rax 266
xmm add rax 140
---
reg add rax 110
FPU add rax 265
xmm add rax 141
---
reg add rax 109
FPU add rax 250
xmm add rax 141
---
reg add rax 109
FPU add rax 266
xmm add rax 141
what about negative versions? one particular to include NEG vs SUB vs fchg
Very interesting!
It's not a wow64 effect :thumbsup:
Correcting results with the other test:
fact reg 172 fpu 280 xmm 218
corrected to same:
reg reg 172 fpu 258 xmm 95
fpu reg 187 fpu 280 xmm 104
xmm reg 391 fpu 587 xmm 218
Quote from: daydreamer on November 10, 2020, 05:25:57 AMwhat about negative versions? one particular to include NEG vs SUB vs fchg
Go ahead, don't be shy!
Quote from: HSE on November 10, 2020, 05:34:07 AM
Very interesting!
It's not a wow64 effect :thumbsup:
Correcting results with the other test:
fact reg 172 fpu 280 xmm 218
corrected to same:
reg reg 172 fpu 258 xmm 95
fpu reg 187 fpu 280 xmm 104
xmm reg 391 fpu 587 xmm 218
I don't understand what you have done, sorry :cool:
It's an estimation assuming proportionality with your first test reg =140, fpu=210, xmm=78
Preliminary suggestion is that tests in 32 and 64 bits are not directly comparables.
Quote from: HSE on November 10, 2020, 05:55:57 AM
Preliminary suggestion is that tests in 32 and 64 bits are not directly comparables.
I still don't understand what you've done, sorry. Probably, I'm a bit dumb today. Anyway, as regards the "normal" add instruction, it's obvious that the 64-bit version is much faster: in 32-bit, you need two instructions,
add and
adc, to add the qword together.
Quote from: jj2007 on November 10, 2020, 06:57:38 AM
I still don't understand what you've done, sorry. Probably, I'm a bit dumb today.
:biggrin: Only the correction I maked is very dumb, just a proportion.
Quote from: jj2007 on November 10, 2020, 06:57:38 AM
Anyway, as regards the "normal" add instruction, it's obvious that the 64-bit version is much faster: in 32-bit, you need two instructions, add and adc, to add the qword together.
Fantastic! Then xmm look very slow.
$ wine AddingQwords.exe
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (SSE4)
96 cycles for 100 * add+adc
253 cycles for 100 * fadd
191 cycles for 100 * paddq aligned
76 cycles for 100 * paddq unaligned
99 cycles for 100 * add+adc
253 cycles for 100 * fadd
196 cycles for 100 * paddq aligned
76 cycles for 100 * paddq unaligned
106 cycles for 100 * add+adc
254 cycles for 100 * fadd
192 cycles for 100 * paddq aligned
76 cycles for 100 * paddq unaligned
94 cycles for 100 * add+adc
254 cycles for 100 * fadd
185 cycles for 100 * paddq aligned
78 cycles for 100 * paddq unaligned
106 cycles for 100 * add+adc
270 cycles for 100 * fadd
189 cycles for 100 * paddq aligned
77 cycles for 100 * paddq unaligned
34 bytes for add+adc
20 bytes for fadd
22 bytes for paddq aligned
25 bytes for paddq unaligned
Quote from: HSE on November 10, 2020, 07:24:03 AM
Fantastic! Then xmm look very slow.
Right, but it's a bit of an academic problem: when was the last time you had an innermost loop adding QWORD integers a Million times? :biggrin:
@mineiro: thanks :thumbsup:
There is considerable variation in instruction sets across different hardware so you can expect unusual timing differences from one CPU to another. Long ago I remember an AMD processor that did some things really fast against the current Intel CPUs but was slow on may other instructions. Its all silicon acreage that accounts for the difference. I know that Intel over the last 10 years or so have prioritised SSE and AVX over the old integer instructions.
Quote from: hutch-- on November 10, 2020, 11:47:41 AM
There is considerable variation in instruction sets across different hardware so you can expect unusual timing differences from one CPU to another. Long ago I remember an AMD processor that did some things really fast against the current Intel CPUs but was slow on may other instructions. Its all silicon acreage that accounts for the difference. I know that Intel over the last 10 years or so have prioritised SSE and AVX over the old integer instructions.
well usual way you choose cpu is to be able to run games and other programs at minimum/recommended stats,long ago I first had AMD athlon with only SSE caps,one step behind intels SSE version,so I got a intel because I wanted latest SSE instructions and ran old legacy landscape raytracing program on both and intel was a disappointment compared to the AMD,probably because program was developed on with good old fpu instructions
one timing I want todo is bitblt,stretchblt,drawiconex and see if there is any difference on milliseconds running on different gpus?