From Wayback Machine, this thread from before nidud's mass deletion of all of his posts
niduds deleted posts from this thread (https://web.archive.org/web/20211019171844/https://masm32.com/board/index.php?topic=3396.0)
Unfortunately the zip files were not archived (by Wayback Machine) and will not work.
i suppose you could use VirtualProtect to allow writes in the .CODE section
then, copy the code under test into a common code address space before executing it
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx (http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx)
to help speed testing up a little....
you could use different counter_begin loop count values for the short and long tests
i try to select a loop count that yields about 0.5 seconds per pass
prescott w/htt - xp sp3
unaligned
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
4925905 cycles - 2048..4096 (164) A memcpy SSE 16
4933332 cycles - 2048..4096 (164) A memcpy SSE 16
4953203 cycles - 2048..4096 (164) A memcpy SSE 16
4963909 cycles - 2048..4096 (164) A memcpy SSE 16
4923198 cycles - 2048..4096 (164) A memcpy SSE 16
4941277 cycles - 2048..4096 (164) A memcpy SSE 16
11502669 cycles - 2048..4096 (164) U memcpy SSE 16
11487135 cycles - 2048..4096 (164) U memcpy SSE 16
11564951 cycles - 2048..4096 (164) U memcpy SSE 16
11570118 cycles - 2048..4096 (164) U memcpy SSE 16
11497558 cycles - 2048..4096 (164) U memcpy SSE 16
11526087 cycles - 2048..4096 (164) U memcpy SSE 16
aligned
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
4935114 cycles - 2048..4096 (164) A memcpy SSE 16
4934727 cycles - 2048..4096 (164) A memcpy SSE 16
4942523 cycles - 2048..4096 (164) A memcpy SSE 16
4924574 cycles - 2048..4096 (164) A memcpy SSE 16
4924658 cycles - 2048..4096 (164) A memcpy SSE 16
4937763 cycles - 2048..4096 (164) A memcpy SSE 16
11490869 cycles - 2048..4096 (164) U memcpy SSE 16
11481780 cycles - 2048..4096 (164) U memcpy SSE 16
11616596 cycles - 2048..4096 (164) U memcpy SSE 16
11530420 cycles - 2048..4096 (164) U memcpy SSE 16
11488318 cycles - 2048..4096 (164) U memcpy SSE 16
11504392 cycles - 2048..4096 (164) U memcpy SSE 16
deleted
hmmm - that seems wrong
i thought you had to be in a code section to assemble instructions
guess i've never tried it - lol
but - there is nothing to prevent you from putting a proc in the code section and copying it to another address
so long as you allow writes into the affected pages with VirtualProtect and PAGE_EXECUTE_READWRITE
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366786%28v=vs.85%29.aspx (http://msdn.microsoft.com/en-us/library/windows/desktop/aa366786%28v=vs.85%29.aspx)
deleted
i hadn't thought about that
but - most win32 CALL's are near relative
so, you'd have to translate the target addresses
but - for testing algorithm code that doesn't make any calls, like loops, etc,
the branch addresses are relative, but the target moves with the code :P
if you wanted to use calls or invokes in movable code,
you could store the branch address and use CALL DWORD PTR lpfnFunction
or MOV EAX,Function and CALL EAX
deleted
deleted
deleted
hard to test individual instructions
so much relies on the surrounding code
deleted
deleted
deleted
those tests run way too fast to get reliable readings
for best results:
1) bind to a single core
2) wait about 750 mS before performing any tests - this allows the system to settle
3) adjust individual loop counts so that each test pass takes about 0.5 seconds
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
STRCHR-------------------------------------------
170724 cycles - ( 0) 0: crt_strchr
172046 cycles - ( 29) 1: x
70540 cycles - (119) 2: 'c'
62934 cycles - (107) 3: 'cccc'
170343 cycles - ( 0) 0: crt_strchr
167063 cycles - ( 29) 1: x
89640 cycles - (119) 2: 'c'
82154 cycles - (107) 3: 'cccc'
240068 cycles - ( 0) 0: crt_strchr
87783 cycles - ( 29) 1: x
25549 cycles - (119) 2: 'c'
23995 cycles - (107) 3: 'cccc'
--- ok ---
H:\nidudString\string\strchr => strchr
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
STRCHR-------------------------------------------
209073 cycles - ( 0) 0: crt_strchr
221060 cycles - ( 29) 1: x
80107 cycles - (119) 2: 'c'
83854 cycles - (107) 3: 'cccc'
198648 cycles - ( 0) 0: crt_strchr
211263 cycles - ( 29) 1: x
106992 cycles - (119) 2: 'c'
96168 cycles - (107) 3: 'cccc'
253182 cycles - ( 0) 0: crt_strchr
84552 cycles - ( 29) 1: x
26531 cycles - (119) 2: 'c'
36056 cycles - (107) 3: 'cccc'
they're all over the place :P
deleted
deleted
deleted
deleted
deleted
deleted
Quote from: nidud on July 23, 2014, 03:41:08 AM
Minimum supported client
Windows XP
The SSE level used is SSE2 so how common is this combination?
It may hurt the feelings of some fans of old hard- and software, but writing code for >=(SSE2 & Win XP) should be OK for 99% of the users.
There is a poll on SSE support here (http://www.insanelymac.com/forum/topic/35109-sse2-vs-sse3-the-poll/): "I'm still waiting for SSE support :) (5 votes [
2.45%])"
That was 2006, 8 years ago ;)
deleted
...or provide fallback routines
you can run a little startup init routine - detect SSE support level - and fill in addresses of PROC's
i am working on something along that line at the moment
these define TYPE's for up to 6 dword parms - you can extend it easily
_FUNC00 TYPEDEF PROTO
_FUNC04 TYPEDEF PROTO :DWORD
_FUNC08 TYPEDEF PROTO :DWORD,:DWORD
_FUNC12 TYPEDEF PROTO :DWORD,:DWORD,:DWORD
_FUNC16 TYPEDEF PROTO :DWORD,:DWORD,:DWORD,:DWORD
_FUNC20 TYPEDEF PROTO :DWORD,:DWORD,:DWORD,:DWORD,:DWORD
_FUNC24 TYPEDEF PROTO :DWORD,:DWORD,:DWORD,:DWORD,:DWORD,:DWORD
_PFUNC00 TYPEDEF Ptr _FUNC00
_PFUNC04 TYPEDEF Ptr _FUNC04
_PFUNC08 TYPEDEF Ptr _FUNC08
_PFUNC12 TYPEDEF Ptr _FUNC12
_PFUNC16 TYPEDEF Ptr _FUNC16
_PFUNC20 TYPEDEF Ptr _FUNC20
_PFUNC24 TYPEDEF Ptr _FUNC24
then, i am using a structure with function pointers in it
_FUNC STRUCT
lpfnFunc1 _PFUNC04 ? ;this function has 1 dword arg
lpfnFunc2 _PFUNC12 ? ;this function has 3 dword args
_FUNC STRUCT
and, in the .DATA? section...
_Func _FUNC <>
so, you set _Func.lpfnFunc1 and _Func.lpfnFunc2 to point at appropriate routines for the supported SSE level
then.....
INVOKE _Func.lpfnFunc1,arg1
INVOKE _Func.lpfnFunc2,arg1,arg2,arg3
;or
push edi
mov edi,offset _Func
INVOKE [edi]._FUNC.lpfnFunc1,arg1
INVOKE [edi]._FUNC.lpfnFunc2,arg1,arg2,arg3
pop edi
another way to go would be to put all the routines for each support level into a DLL
then, at init, load the DLL that is appropriate for the machine
the routines can then all have the same names
most people probably have at least SSE3
however, we can look at the forum members, alone, and find a few machines
some that probably support only MMX or SSE(1)
i bought this machine in 2005 - it supports SSE3, which was a new thing at the time
so - it's almost 10 years old
Quote from: dedndave on July 23, 2014, 07:53:25 AM
i bought this machine in 2005 - it supports SSE3, which was a new thing at the time
so - it's almost 10 years old
SSE3 was introduced in April 2005 with the Prescott revision of the Pentium 4 processor.
Gunther
deleted
deleted
Quote from: Gunther on July 23, 2014, 09:03:40 AMSSE3 was introduced in April 2005 with the Prescott revision of the Pentium 4 processor.
SSE2 was introduced in November 2000 with the P4 Willamette. In general, it's absolutely sufficient (try your luck, make Instr_() faster with SSE7.8... (http://masm32.com/board/index.php?topic=3408.msg36297#msg36297)); in particular, pcmpeqb and pmovmskb are important improvements.
deleted
Quote from: nidud on July 25, 2014, 04:29:11 AMwith regard to memcpy there seems little gain using SSE
...
conclution:
- in newer CPU's MOVSB is faster than moving blocks
- in older CPU's MOVSB gets faster with size
- SSE may be faster depending on CPU
Or, in short: Everything is more complicated than you think. (http://www.masmforum.com/board/index.php?topic=11454.msg87622#msg87622)
deleted
Quote from: nidud on July 25, 2014, 07:19:55 AM
:biggrin:
Yes, it's possible to complicate tings I guess and the link you provide includes a lot of complicated issues but few constructive conclusions to the problem at hand.
The table there looked different for each and every CPU we tested (try yourself the latest version (http://masm32.com/board/index.php?topic=1971.msg20618#msg20618)). So the choice was either choosing an algo that provided reasonable speed for most of them, or going the stony road of checking which CPU family and branching to a specialised one. For MasmBasic's MbCopy, rep movsd made the race. It is pretty fast on all CPUs, and it rocks for large copies (and that is where speed matters...). Good to see that Intel keeps pushing this line, too. I wouldn't use rep movs
b for the whole copy, though, as many of the not-so-recent CPUs are very slow with the byte variant of movs.
push ecx
shr ecx, 2 ; divide count by 4
rep movsd ; copy DWORD size blocks
pop ecx ; Reload byte count
and ecx, 3 ; get the rest
rep movsb ; copy the rest
xchg eax, edi ; for CAT$, return a pointer to the end of the destination;
How much of a difference would RAM speed make? DDR3 speeds can be 1333/1600/1866/2133/2400.
deleted
Hi sinsi,
your memcpy application brings:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
----------------------------------------------
-- aligned strings --
498064 cycles - 10 ( 0) 0: crt_memcpy
890775 cycles - 10 ( 38) 1: movsd - mov eax,ecx
892888 cycles - 10 ( 37) 2: movsd - push ecx
353318 cycles - 10 ( 27) 3: movsb
-- unaligned strings --
1006514 cycles - 10 ( 0) 0: crt_memcpy
1033525 cycles - 10 ( 38) 1: movsd - mov eax,ecx
1033580 cycles - 10 ( 37) 2: movsd - push ecx
377061 cycles - 10 ( 27) 3: movsb
-- short strings 15 --
175505 cycles - 8000 ( 0) 0: crt_memcpy
335538 cycles - 8000 ( 38) 1: movsd - mov eax,ecx
344226 cycles - 8000 ( 37) 2: movsd - push ecx
291953 cycles - 8000 ( 27) 3: movsb
-- short strings 271 --
1033175 cycles - 8000 ( 0) 0: crt_memcpy
952811 cycles - 8000 ( 38) 1: movsd - mov eax,ecx
959677 cycles - 8000 ( 37) 2: movsd - push ecx
566948 cycles - 8000 ( 27) 3: movsb
-- short strings 2014 --
3224879 cycles - 4000 ( 0) 0: crt_memcpy
3153708 cycles - 4000 ( 38) 1: movsd - mov eax,ecx
3151176 cycles - 4000 ( 37) 2: movsd - push ecx
930276 cycles - 4000 ( 27) 3: movsb
--- ok ---
Gunther
deleted
Seems to be better
x1=205 1543/884593 = 2.3191942509153927 (8000) ~2.32
x2=5844930 /2504176= 2.3340731641865428 (4000) ~2.33
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
-- aligned strings --
1188974 cycles - 10 ( 0) 0: crt_memcpy
1097640 cycles - 10 ( 75) 1: movsd - mov eax,ecx
1103251 cycles - 10 ( 75) 2: movsd - push ecx
1102906 cycles - 10 ( 59) 3: movsb
1310185 cycles - 10 (182) 4: SSE
-- unaligned strings --
2595543 cycles - 10 ( 0) 0: crt_memcpy
2620959 cycles - 10 ( 75) 1: movsd - mov eax,ecx
2611443 cycles - 10 ( 75) 2: movsd - push ecx
7866087 cycles - 10 ( 59) 3: movsb
1358767 cycles - 10 (182) 4: SSE
-- short strings 15 --
343706 cycles - 8000 ( 0) 0: crt_memcpy
789893 cycles - 8000 ( 75) 1: movsd - mov eax,ecx
808747 cycles - 8000 ( 75) 2: movsd - push ecx
2039809 cycles - 8000 ( 59) 3: movsb
237595 cycles - 8000 (182) 4: SSE
-- short strings 271 --
2051543 cycles - 8000 ( 0) 0: crt_memcpy
2096801 cycles - 8000 ( 75) 1: movsd - mov eax,ecx
2083586 cycles - 8000 ( 75) 2: movsd - push ecx
7495329 cycles - 8000 ( 59) 3: movsb
884593 cycles - 8000 (182) 4: SSE
-- short strings 2014 --
5844930 cycles - 4000 ( 0) 0: crt_memcpy
6057324 cycles - 4000 ( 75) 1: movsd - mov eax,ecx
5890555 cycles - 4000 ( 75) 2: movsd - push ecx
22533778 cycles - 4000 ( 59) 3: movsb
2504176 cycles - 4000 (182) 4: SSE
--- ok ---
deleted
align 16
rep movsb
Check if the align is really needed. In the worst case, 15 bytes of code are inserted there. One common trick is to insert the bytes needed before the entry into the proc.
nidud - i hope you're using the one in this post
http://masm32.com/board/index.php?topic=3373.msg35658#msg35658 (http://masm32.com/board/index.php?topic=3373.msg35658#msg35658)
;EAX return bits:
;0 = MMX
;1 = SSE
;2 = SSE2
;3 = SSE3
;4 = SSSE3
;5 = SSE4.1
;6 = SSE4.2
i would define the EQUates this way...
SSE_MMX equ 1
SSE_SSE equ 2
SSE_SSE2 equ 4
SSE_SSE3 equ 8
SSE_SSSE3 equ 10h
SSE_SSE41 equ 20h
SSE_SSE42 equ 40h
call GetSseLevel
test al,SSEBT_SSE3
jnz sse3_supported
the EQUates you have would be ok for BT, i suppose :P
Here is a test piece that uses 4 copies of the same simple byte intensive algo, it runs the 4 version and times each one. The idea was to test if the identical algo in 4 locations produced any difference in timing but on my old quad, they are almost perfectly identical even with multiple runs.
I get this result.
File length = 977426
828 ms
828 ms
828 ms
828 ms
Press any key to continue ...
deleted
ok - that one does not preserve EBX - but, it's probably ok, in this case
just so you are aware, CPUID destroys the contents of EBX :t
deleted
File length = 977412
1484 ms
1453 ms
1516 ms
1547 ms
Press any key to continue ...
1344 ms
1344 ms
1343 ms
2016 ms...
1343 ms
1344 ms
1344 ms
2015 ms...
1344 ms
1359 ms
1344 ms
2016 ms
If we remove the worst case ...
If i am not wrong, you are using 2 counters:
First counter = 1000
Second counter = count (=4000,etc.)
You get the result only when
the first counter is 0 (counter_end).
So the result has something to do with the execution
of this:
mov edi,count
mov ebx,esp
.while edi
pushargs
call esi
mov esp,ebx
dec edi
Is there any particular reason for this ?
Quote
counter_begin 1000, HIGH_PRIORITY_CLASS
mov edi,count
mov ebx,esp
.while edi
pushargs
call esi
mov esp,ebx
dec edi
.endw
counter_end
Quote from: nidud on July 26, 2014, 02:57:09 AM
QuoteCheck if the align is really needed
I normally tune them from the list file in the end
What I intended is that
rep movsX may not need ANY alignment, simply because it isn't a loop at this level (at micro code level, it is of course a loop).
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (MMX, SSE, SSE2, SSE3)
movsd align 16 10476 µs
movsd align 3 10456 µs
movsd align 13 10347 µs
movsb align 16 10510 µs
movsb align 3 10503 µs
movsb align 13 10407 µs
movsd align 16 10514 µs
movsd align 3 10469 µs
movsd align 13 10516 µs
movsb align 16 10455 µs
movsb align 3 10515 µs
movsb align 13 10502 µs
movsd align 16 10526 µs
movsd align 3 10455 µs
movsd align 13 10469 µs
movsb align 16 10360 µs
movsb align 3 10485 µs
movsb align 13 10456 µsSample:test4a proc uses esi edi ecx
align 16
nops 3
rep movsb
ret
test4a endp
Interesting, though, that movs
b is indeed equally fast on my trusty old Celeron, at least for a 10 MB string.
deleted
That's the result by memcpy.exe by 1234.zip:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
----------------------------------------------
-- aligned strings --
498952 cycles - 10 ( 0) 0: crt_memcpy
898756 cycles - 10 ( 75) 1: movsd - mov eax,ecx
903577 cycles - 10 ( 75) 2: movsd - push ecx
354813 cycles - 10 ( 59) 3: movsb
487954 cycles - 10 (182) 4: SSE
-- unaligned strings --
494936 cycles - 10 ( 0) 0: crt_memcpy
895940 cycles - 10 ( 75) 1: movsd - mov eax,ecx
895968 cycles - 10 ( 75) 2: movsd - push ecx
373553 cycles - 10 ( 59) 3: movsb
491344 cycles - 10 (182) 4: SSE
-- short strings 15 --
175961 cycles - 8000 ( 0) 0: crt_memcpy
361324 cycles - 8000 ( 75) 1: movsd - mov eax,ecx
361586 cycles - 8000 ( 75) 2: movsd - push ecx
313550 cycles - 8000 ( 59) 3: movsb
92719 cycles - 8000 (182) 4: SSE
-- short strings 271 --
841879 cycles - 8000 ( 0) 0: crt_memcpy
780741 cycles - 8000 ( 75) 1: movsd - mov eax,ecx
806939 cycles - 8000 ( 75) 2: movsd - push ecx
623419 cycles - 8000 ( 59) 3: movsb
275466 cycles - 8000 (182) 4: SSE
-- short strings 2014 --
1002628 cycles - 4000 ( 0) 0: crt_memcpy
2239737 cycles - 4000 ( 75) 1: movsd - mov eax,ecx
2226209 cycles - 4000 ( 75) 2: movsd - push ecx
962207 cycles - 4000 ( 59) 3: movsb
972245 cycles - 4000 (182) 4: SSE
--- ok ---
Gunther
Quote
The macro can only be called by EDI, ESI, or EBX or an immediate value.
I think I just run out of regs once and inserted a loop.
The count for small functions is also rather high so it's
just a way of skipping zeros I guess.
I think you are talking about this macro:
counter_begin MACRO loopcount:REQ, priority
or counter_end
If it is, we cannot use EBX because cpuid destroys EBX
I modified counter_begin -written by MichaelW- to this:
(COUNTERLOOPS=1000 or 10000 or 100000 or ...)
; this macro uses EDI inside = length from kIni to kEnd
; we need to define an array to save the means.
; we need to define _LoopCount,_MaxLength...etc. in .DATA
BEGIN_COUNTER_CYCLE_HIGH_PRIORITY_CLASS MACRO kIni, kEnd
LOCAL labelA,labelB
mov _LoopCount, COUNTERLOOPS
mov _MaxLength, kEnd
mov edi, kIni
;mov _MinLength, edi ;; not used yet
mov _MeanValue, 0 ;; mean is 0
invoke GetCurrentProcess
invoke SetPriorityClass, eax, HIGH_PRIORITY_CLASS
labelA: ;; Begin test loop
BEGIN_LOOP_TEST equ <labelA>
xor eax, eax ;; Use same CPUID input value for each call
cpuid ;; Flush pipe & wait for pending ops to finish
rdtsc ;; Read Time Stamp Counter
push edx ;; Preserve high-order 32 bits of start count
push eax ;; Preserve low-order 32 bits of start count
mov _LoopCounter, COUNTERLOOPS
xor eax, eax
cpuid ;; Make sure loop setup instructions finish
ALIGN 16 ;; Optimal loop alignment for P6
@@: ;; Start an empty reference loop
sub _LoopCounter, 1
jnz short @B
xor eax, eax
cpuid ;; Make sure loop instructions finish
rdtsc ;; Read end count
pop ecx ;; Recover low-order 32 bits of start count
sub eax, ecx ;; Low-order 32 bits of overhead count in EAX
pop ecx ;; Recover high-order 32 bits of start count
sbb edx, ecx ;; High-order 32 bits of overhead count in EDX
push edx ;; Preserve high-order 32 bits of overhead count
push eax ;; Preserve low-order 32 bits of overhead count
xor eax, eax
cpuid
rdtsc
push edx ;; Preserve high-order 32 bits of start count
push eax ;; Preserve low-order 32 bits of start count
;;-------------------------------------
;; Start
;;-------------------------------------
mov _LoopCounter, COUNTERLOOPS
xor eax, eax
cpuid ;; Make sure loop setup instructions finish
ALIGN 16 ;; Optimal loop alignment for P6
labelB: ;; Start test loop
START_LOOP_TEST equ <labelB>
ENDM
; ------------------------------------------------------------------------
END_COUNTER_CYCLE MACRO arg
LOCAL $tmpstr$
sub _LoopCounter, 1
jnz START_LOOP_TEST ;; goto labelB
;;---------------------------
;; stop this count
;;---------------------------
xor eax, eax
cpuid ;; Make sure loop instructions finish
rdtsc ;; Read end count
pop ecx ;; Recover low-order 32 bits of start count
sub eax, ecx ;; Low-order 32 bits of test count in EAX
pop ecx ;; Recover high-order 32 bits of start count
sbb edx, ecx ;; High-order 32 bits of test count in EDX
pop ecx ;; Recover low-order 32 bits of overhead count
sub eax, ecx ;; Low-order 32 bits of adjusted count in EAX
pop ecx ;; Recover high-order 32 bits of overhead count
sbb edx, ecx ;; High-order 32 bits of adjusted count in EDX
mov DWORD PTR _CounterQword, eax
mov DWORD PTR _CounterQword + 4, edx
finit
fild _CounterQword
fild _LoopCount
fdiv
fistp _CounterQword
mov ebx, dword ptr _CounterQword
;---------------------------------------------------
; print cycles
;---------------------------------------------------
add ebx, _MeanValue
mov _MeanValue, ebx
add edi, 1
cmp edi, _MaxLength
jbe BEGIN_LOOP_TEST ;; goto labelA
invoke GetCurrentProcess
invoke SetPriorityClass, eax, NORMAL_PRIORITY_CLASS
; --------------------------------------------------
; Save mean and print mean
; --------------------------------------------------
invoke SaveMeans, ebx ;; save it in one array
;; one after another
;---------------------------------------------------
print str$(ebx)
$tmpstr$ CATSTR <chr$(">, <arg>, <",13,10)>
print $tmpstr$
;---------------------------------------------------
ENDM
.data
ALIGN 8 ;; Optimal alignment for QWORD
_CounterQword dq 0
_LoopCount dd 0
_LoopCounter dd 0
_MinLength dd 0
_MaxLength dd 0
_MeanValue dd 0
;------------------------------
ALIGN 4
dd 0 ; <<<--- start with 0
_TblTiming0 dd 600 dup (?)
.code
SaveMeans proc kMean:DWORD
mov eax, kMean
mov edx, offset _TblTiming0
mov ecx, [edx-4] ; number of means
mov [edx+ecx*4], eax
add ecx, 1
mov [edx-4], ecx
ret
SaveMeans endp
deleted
Hi nidud,
Quote from: nidud on July 27, 2014, 02:10:39 AM
I added some bits to Dave's test:
there's nothing attached.
Gunther
deleted
deleted
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
-----------------------------------------------
-- aligned strings --
995933 cycles - 10 ( 0) 0: crt_strrchr
995891 cycles - 10 ( 40) 1: strrchr
273823 cycles - 10 (154) 2: x
94668 cycles - 10 (112) 3: SSE
-- unaligned strings --
996477 cycles - 10 ( 0) 0: crt_strrchr
997094 cycles - 10 ( 40) 1: strrchr
298219 cycles - 10 (154) 2: x
121529 cycles - 10 (112) 3: SSE
-- small strings 128 --
324263 cycles - 500 ( 0) 0: crt_strrchr
323710 cycles - 500 ( 40) 1: strrchr
84786 cycles - 500 (154) 2: x
34915 cycles - 500 (112) 3: SSE
-- small strings 1 --
67914 cycles - 500 ( 0) 0: crt_strrchr
67286 cycles - 500 ( 40) 1: strrchr
12595 cycles - 500 (154) 2: x
16622 cycles - 500 (112) 3: SSE
Hi nidud,
here's the output of auto.zip:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (AVX)
----------------------------------------------
-- aligned strings --
491469 cycles - 10 ( 0) 0: crt_memcpy
889651 cycles - 10 ( 63) 1: movsd - mov eax,ecx
887273 cycles - 10 ( 63) 2: movsd - push ecx
355080 cycles - 10 ( 51) 3: movsb
487046 cycles - 10 (182) 4: SSE
355990 cycles - 10 ( 0) 5: auto
-- unaligned strings --
490269 cycles - 10 ( 0) 0: crt_memcpy
886259 cycles - 10 ( 63) 1: movsd - mov eax,ecx
886778 cycles - 10 ( 63) 2: movsd - push ecx
372520 cycles - 10 ( 51) 3: movsb
491780 cycles - 10 (182) 4: SSE
378881 cycles - 10 ( 0) 5: auto
-- short strings 15 --
174897 cycles - 8000 ( 0) 0: crt_memcpy
349626 cycles - 8000 ( 63) 1: movsd - mov eax,ecx
343812 cycles - 8000 ( 63) 2: movsd - push ecx
307384 cycles - 8000 ( 51) 3: movsb
98073 cycles - 8000 (182) 4: SSE
293479 cycles - 8000 ( 0) 5: auto
-- short strings 271 --
832627 cycles - 8000 ( 0) 0: crt_memcpy
773797 cycles - 8000 ( 63) 1: movsd - mov eax,ecx
764418 cycles - 8000 ( 63) 2: movsd - push ecx
586580 cycles - 8000 ( 51) 3: movsb
279676 cycles - 8000 (182) 4: SSE
557134 cycles - 8000 ( 0) 5: auto
-- short strings 2014 --
998188 cycles - 4000 ( 0) 0: crt_memcpy
2198740 cycles - 4000 ( 63) 1: movsd - mov eax,ecx
2195833 cycles - 4000 ( 63) 2: movsd - push ecx
935710 cycles - 4000 ( 51) 3: movsb
961563 cycles - 4000 (182) 4: SSE
906474 cycles - 4000 ( 0) 5: auto
--- ok ---
Gunther
deleted
deleted
Hi nidud,
Quote from: nidud on August 11, 2014, 08:46:40 PM
Is this possible ? to have SSE4.1 and not SSE3 ?
Note: SSE and SSE2 are pre-set since the program will exit if SSE2 is not present, so this bit must be set by the test.
I think not. Did you try that (http://masm32.com/board/index.php?topic=1418.msg14444#msg14444)? It should show you the available instruction sets.
Gunther
deleted
Hi nidud,
you can trust my instruction detecting application. Your laptop supports in any case SSE3 and SSSE3 and it supports AVX. You can test that with that tool (http://masm32.com/board/index.php?topic=3227.msg35958#msg35958), if you've at least Windows 7 with SP1 installed. The glitch must be in your code. Do you test the right bits?
Gunther
deleted
Hi Gunther,
My Core-i3 G3220 does not support AVX, but the results for your instruction set detection tool:
Supported by Processor and installed Operating System:
------------------------------------------------------
MMX, CMOV and FCOMI, SSE, SSE2, SSE3, SSSE3, SSE4.1,
POPCNT, SSE4.2
featurenumber = 13
Appear to match the Intel specs:
http://ark.intel.com/products/77773
Hi Michael,
Quote from: MichaelW on August 12, 2014, 10:00:55 AM
Appear to match the Intel specs:
http://ark.intel.com/products/77773
I hope so. I've written the procedure using the Intel documents as a basis.
Gunther
deleted
movzx eax, byte ptr [esp+8]
if 1
imul eax, eax, 01010101h ; 4 bytes shorter, faster
else
mov ah, al
mov ecx, eax
shl eax, 16
add eax, ecx
endif
movd xmm0, eax
pshufd xmm0, xmm0, 0 ; populate char
deleted
Variants of memchr:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
43778 cycles for 100 * memchr scasb
4474 cycles for 100 * memchr SSE2a
5608 cycles for 100 * memchr SSE2b
43994 cycles for 100 * memchr scasb
4497 cycles for 100 * memchr SSE2a
5602 cycles for 100 * memchr SSE2b
44044 cycles for 100 * memchr scasb
4474 cycles for 100 * memchr SSE2a
5598 cycles for 100 * memchr SSE2b
36 bytes for memchr scasb
88 bytes for memchr SSE2a
92 bytes for memchr SSE2b
Could look much different on other CPUs, as movlps speeds it up a lot on my CPU.
My processor is near (or at) bottom-end today (retail box, $79).
Intel(R) Pentium(R) CPU G3220 @ 3.00GHz (SSE4)
24909 cycles for 100 * memchr scasb
2864 cycles for 100 * memchr SSE2a
2399 cycles for 100 * memchr SSE2b
24934 cycles for 100 * memchr scasb
2882 cycles for 100 * memchr SSE2a
2366 cycles for 100 * memchr SSE2b
24923 cycles for 100 * memchr scasb
2886 cycles for 100 * memchr SSE2a
2418 cycles for 100 * memchr SSE2b
36 bytes for memchr scasb
88 bytes for memchr SSE2a
92 bytes for memchr SSE2b
96 = eax memchr scasb
96 = eax memchr SSE2a
96 = eax memchr SSE2b
deleted
Thanks. As I suspected, the movlps/movhps pair is good only for my trusty Celeron :(
Here is one more, with movups instead:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
43821 cycles for 100 * memchr scasb
4477 cycles for 100 * memchr SSE2 lps/hps
5556 cycles for 100 * memchr SSE2 nidud
5205 cycles for 100 * memchr SSE2 ups
43778 cycles for 100 * memchr scasb
4476 cycles for 100 * memchr SSE2 lps/hps
5606 cycles for 100 * memchr SSE2 nidud
5206 cycles for 100 * memchr SSE2 ups
43762 cycles for 100 * memchr scasb
4482 cycles for 100 * memchr SSE2 lps/hps
5607 cycles for 100 * memchr SSE2 nidud
5200 cycles for 100 * memchr SSE2 ups
36 bytes for memchr scasb
88 bytes for memchr SSE2 lps/hps
92 bytes for memchr SSE2 nidud
84 bytes for memchr SSE2 ups
Intel(R) Pentium(R) CPU G3220 @ 3.00GHz (SSE4)
24916 cycles for 100 * memchr scasb
2889 cycles for 100 * memchr SSE2 lps/hps
2422 cycles for 100 * memchr SSE2 nidud
2351 cycles for 100 * memchr SSE2 ups
24927 cycles for 100 * memchr scasb
2890 cycles for 100 * memchr SSE2 lps/hps
2469 cycles for 100 * memchr SSE2 nidud
2342 cycles for 100 * memchr SSE2 ups
24921 cycles for 100 * memchr scasb
2885 cycles for 100 * memchr SSE2 lps/hps
2405 cycles for 100 * memchr SSE2 nidud
2351 cycles for 100 * memchr SSE2 ups
36 bytes for memchr scasb
88 bytes for memchr SSE2 lps/hps
92 bytes for memchr SSE2 nidud
84 bytes for memchr SSE2 ups
deleted
Jochen,
your timings:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
21892 cycles for 100 * memchr scasb
3007 cycles for 100 * memchr SSE2 lps/hps
2690 cycles for 100 * memchr SSE2 nidud
2500 cycles for 100 * memchr SSE2 ups
21951 cycles for 100 * memchr scasb
2981 cycles for 100 * memchr SSE2 lps/hps
2721 cycles for 100 * memchr SSE2 nidud
6211 cycles for 100 * memchr SSE2 ups
21827 cycles for 100 * memchr scasb
3003 cycles for 100 * memchr SSE2 lps/hps
2510 cycles for 100 * memchr SSE2 nidud
2721 cycles for 100 * memchr SSE2 ups
36 bytes for memchr scasb
88 bytes for memchr SSE2 lps/hps
92 bytes for memchr SSE2 nidud
84 bytes for memchr SSE2 ups
--- ok ---
Gunther
deleted