Code location sensitivity of timings

jj2007 · July 25, 2014, 06:33:53 AM

Quote from: nidud on July 25, 2014, 04:29:11 AMwith regard to memcpy there seems little gain using SSE
...
conclution:
- in newer CPU's MOVSB is faster than moving blocks
- in older CPU's MOVSB gets faster with size
- SSE may be faster depending on CPU

Or, in short: Everything is more complicated than you think.

nidud · July 25, 2014, 07:19:55 AM

deleted

jj2007 · July 25, 2014, 06:01:49 PM

Quote from: nidud on July 25, 2014, 07:19:55 AM

Yes, it's possible to complicate tings I guess and the link you provide includes a lot of complicated issues but few constructive conclusions to the problem at hand.

The table there looked different for each and every CPU we tested (try yourself the latest version). So the choice was either choosing an algo that provided reasonable speed for most of them, or going the stony road of checking which CPU family and branching to a specialised one. For MasmBasic's MbCopy, rep movsd made the race. It is pretty fast on all CPUs, and it rocks for large copies (and that is where speed matters...). Good to see that Intel keeps pushing this line, too. I wouldn't use rep movsb for the whole copy, though, as many of the not-so-recent CPUs are very slow with the byte variant of movs.

push ecx
shr ecx, 2 ; divide count by 4
rep movsd ; copy DWORD size blocks
pop ecx ; Reload byte count
and ecx, 3 ; get the rest
rep movsb ; copy the rest
xchg eax, edi ; for CAT$, return a pointer to the end of the destination;

sinsi · July 25, 2014, 08:10:40 PM

How much of a difference would RAM speed make? DDR3 speeds can be 1333/1600/1866/2133/2400.

nidud · July 25, 2014, 10:21:23 PM

deleted

Gunther · July 25, 2014, 10:56:33 PM

Hi sinsi,

your memcpy application brings:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
----------------------------------------------
-- aligned strings --
498064    cycles -  10 (  0) 0: crt_memcpy
890775    cycles -  10 ( 38) 1: movsd - mov eax,ecx
892888    cycles -  10 ( 37) 2: movsd - push ecx
353318    cycles -  10 ( 27) 3: movsb
-- unaligned strings --
1006514   cycles -  10 (  0) 0: crt_memcpy
1033525   cycles -  10 ( 38) 1: movsd - mov eax,ecx
1033580   cycles -  10 ( 37) 2: movsd - push ecx
377061    cycles -  10 ( 27) 3: movsb
-- short strings 15 --
175505    cycles - 8000 (  0) 0: crt_memcpy
335538    cycles - 8000 ( 38) 1: movsd - mov eax,ecx
344226    cycles - 8000 ( 37) 2: movsd - push ecx
291953    cycles - 8000 ( 27) 3: movsb
-- short strings 271 --
1033175   cycles - 8000 (  0) 0: crt_memcpy
952811    cycles - 8000 ( 38) 1: movsd - mov eax,ecx
959677    cycles - 8000 ( 37) 2: movsd - push ecx
566948    cycles - 8000 ( 27) 3: movsb
-- short strings 2014 --
3224879   cycles - 4000 (  0) 0: crt_memcpy
3153708   cycles - 4000 ( 38) 1: movsd - mov eax,ecx
3151176   cycles - 4000 ( 37) 2: movsd - push ecx
930276    cycles - 4000 ( 27) 3: movsb
--- ok ---

Gunther

nidud · July 25, 2014, 11:41:54 PM

deleted

RuiLoureiro · July 26, 2014, 01:14:05 AM

Seems to be better

x1=205 1543/884593 = 2.3191942509153927 (8000) ~2.32
x2=5844930 /2504176= 2.3340731641865428 (4000) ~2.33

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
-- aligned strings --
1188974 cycles - 10 ( 0) 0: crt_memcpy
1097640 cycles - 10 ( 75) 1: movsd - mov eax,ecx
1103251 cycles - 10 ( 75) 2: movsd - push ecx
1102906 cycles - 10 ( 59) 3: movsb
1310185 cycles - 10 (182) 4: SSE
-- unaligned strings --
2595543 cycles - 10 ( 0) 0: crt_memcpy
2620959 cycles - 10 ( 75) 1: movsd - mov eax,ecx
2611443 cycles - 10 ( 75) 2: movsd - push ecx
7866087 cycles - 10 ( 59) 3: movsb
1358767 cycles - 10 (182) 4: SSE
-- short strings 15 --
343706 cycles - 8000 ( 0) 0: crt_memcpy
789893 cycles - 8000 ( 75) 1: movsd - mov eax,ecx
808747 cycles - 8000 ( 75) 2: movsd - push ecx
2039809 cycles - 8000 ( 59) 3: movsb
237595 cycles - 8000 (182) 4: SSE
-- short strings 271 --
2051543 cycles - 8000 ( 0) 0: crt_memcpy
2096801 cycles - 8000 ( 75) 1: movsd - mov eax,ecx
2083586 cycles - 8000 ( 75) 2: movsd - push ecx
7495329 cycles - 8000 ( 59) 3: movsb
884593 cycles - 8000 (182) 4: SSE
-- short strings 2014 --
5844930 cycles - 4000 ( 0) 0: crt_memcpy
6057324 cycles - 4000 ( 75) 1: movsd - mov eax,ecx
5890555 cycles - 4000 ( 75) 2: movsd - push ecx
22533778 cycles - 4000 ( 59) 3: movsb
2504176 cycles - 4000 (182) 4: SSE
--- ok ---

nidud · July 26, 2014, 01:46:55 AM

deleted

jj2007 · July 26, 2014, 01:52:14 AM

Code Select

	align	16
	rep	movsb

Check if the align is really needed. In the worst case, 15 bytes of code are inserted there. One common trick is to insert the bytes needed before the entry into the proc.

dedndave · July 26, 2014, 02:15:52 AM

nidud - i hope you're using the one in this post

http://masm32.com/board/index.php?topic=3373.msg35658#msg35658

Code Select

;EAX return bits:
;0 = MMX
;1 = SSE
;2 = SSE2
;3 = SSE3
;4 = SSSE3
;5 = SSE4.1
;6 = SSE4.2

i would define the EQUates this way...

Code Select

SSE_MMX    equ 1
SSE_SSE    equ 2
SSE_SSE2   equ 4
SSE_SSE3   equ 8
SSE_SSSE3  equ 10h
SSE_SSE41  equ 20h
SSE_SSE42  equ 40h

Code Select

    call    GetSseLevel
    test    al,SSEBT_SSE3
    jnz     sse3_supported

the EQUates you have would be ok for BT, i suppose :P

hutch-- · July 26, 2014, 02:55:30 AM

Here is a test piece that uses 4 copies of the same simple byte intensive algo, it runs the 4 version and times each one. The idea was to test if the identical algo in 4 locations produced any difference in timing but on my old quad, they are almost perfectly identical even with multiple runs.

I get this result.

File length = 977426

828 ms
828 ms
828 ms
828 ms
Press any key to continue ...

nidud · July 26, 2014, 02:57:09 AM

deleted

dedndave · July 26, 2014, 03:03:03 AM

ok - that one does not preserve EBX - but, it's probably ok, in this case
just so you are aware, CPUID destroys the contents of EBX :t

nidud · July 26, 2014, 03:30:28 AM

deleted

The MASM Forum

News:

Code location sensitivity of timings

jj2007

nidud

jj2007

sinsi

nidud

Gunther

nidud

RuiLoureiro

nidud

jj2007

dedndave

hutch--

nidud

dedndave

nidud