News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Code location sensitivity of timings

Started by nidud, July 12, 2014, 09:15:44 PM

Previous topic - Next topic

jj2007

Quote from: nidud on July 25, 2014, 04:29:11 AMwith regard to memcpy there seems little gain using SSE
...
conclution:
- in newer CPU's MOVSB is faster than moving blocks
- in older CPU's MOVSB gets faster with size
- SSE may be faster depending on CPU

Or, in short: Everything is more complicated than you think.

nidud

#31
deleted

jj2007

Quote from: nidud on July 25, 2014, 07:19:55 AM
:biggrin:

Yes, it's possible to complicate tings I guess and the link you provide includes a lot of complicated issues but few constructive conclusions to the problem at hand.

The table there looked different for each and every CPU we tested (try yourself the latest version). So the choice was either choosing an algo that provided reasonable speed for most of them, or going the stony road of checking which CPU family and branching to a specialised one. For MasmBasic's MbCopy, rep movsd made the race. It is pretty fast on all CPUs, and it rocks for large copies (and that is where speed matters...). Good to see that Intel keeps pushing this line, too. I wouldn't use rep movsb for the whole copy, though, as many of the not-so-recent CPUs are very slow with the byte variant of movs.

      push ecx
      shr ecx, 2           ; divide count by 4
      rep movsd            ; copy DWORD size blocks
      pop ecx              ; Reload byte count
      and ecx, 3           ; get the rest
      rep movsb            ; copy the rest
      xchg eax, edi        ; for CAT$, return a pointer to the end of the destination;

sinsi

How much of a difference would RAM speed make? DDR3 speeds can be 1333/1600/1866/2133/2400.

nidud

#34
deleted

Gunther

Hi sinsi,

your memcpy application brings:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
----------------------------------------------
-- aligned strings --
498064    cycles -  10 (  0) 0: crt_memcpy
890775    cycles -  10 ( 38) 1: movsd - mov eax,ecx
892888    cycles -  10 ( 37) 2: movsd - push ecx
353318    cycles -  10 ( 27) 3: movsb
-- unaligned strings --
1006514   cycles -  10 (  0) 0: crt_memcpy
1033525   cycles -  10 ( 38) 1: movsd - mov eax,ecx
1033580   cycles -  10 ( 37) 2: movsd - push ecx
377061    cycles -  10 ( 27) 3: movsb
-- short strings 15 --
175505    cycles - 8000 (  0) 0: crt_memcpy
335538    cycles - 8000 ( 38) 1: movsd - mov eax,ecx
344226    cycles - 8000 ( 37) 2: movsd - push ecx
291953    cycles - 8000 ( 27) 3: movsb
-- short strings 271 --
1033175   cycles - 8000 (  0) 0: crt_memcpy
952811    cycles - 8000 ( 38) 1: movsd - mov eax,ecx
959677    cycles - 8000 ( 37) 2: movsd - push ecx
566948    cycles - 8000 ( 27) 3: movsb
-- short strings 2014 --
3224879   cycles - 4000 (  0) 0: crt_memcpy
3153708   cycles - 4000 ( 38) 1: movsd - mov eax,ecx
3151176   cycles - 4000 ( 37) 2: movsd - push ecx
930276    cycles - 4000 ( 27) 3: movsb
--- ok ---


Gunther
You have to know the facts before you can distort them.

nidud

#36
deleted

RuiLoureiro

#37
Seems to be better

x1=205 1543/884593  = 2.3191942509153927 (8000) ~2.32
x2=5844930 /2504176= 2.3340731641865428 (4000) ~2.33

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

-----------------------------------------------------
-- aligned strings --
1188974   cycles -  10 (  0) 0: crt_memcpy
1097640   cycles -  10 ( 75) 1: movsd - mov eax,ecx
1103251   cycles -  10 ( 75) 2: movsd - push ecx
1102906   cycles -  10 ( 59) 3: movsb
1310185   cycles -  10 (182) 4: SSE
-- unaligned strings --
2595543   cycles -  10 (  0) 0: crt_memcpy
2620959   cycles -  10 ( 75) 1: movsd - mov eax,ecx
2611443   cycles -  10 ( 75) 2: movsd - push ecx
7866087   cycles -  10 ( 59) 3: movsb
1358767   cycles -  10 (182) 4: SSE
-- short strings 15 --
343706    cycles - 8000 (  0) 0: crt_memcpy
789893    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
808747    cycles - 8000 ( 75) 2: movsd - push ecx
2039809   cycles - 8000 ( 59) 3: movsb
237595    cycles - 8000 (182) 4: SSE
-- short strings 271 --
2051543   cycles - 8000 (  0) 0: crt_memcpy
2096801   cycles - 8000 ( 75) 1: movsd - mov eax,ecx
2083586   cycles - 8000 ( 75) 2: movsd - push ecx
7495329   cycles - 8000 ( 59) 3: movsb
884593    cycles - 8000 (182) 4: SSE
-- short strings 2014 --
  5844930   cycles - 4000 (  0) 0: crt_memcpy
  6057324   cycles - 4000 ( 75) 1: movsd - mov eax,ecx
  5890555   cycles - 4000 ( 75) 2: movsd - push ecx
22533778  cycles - 4000 ( 59) 3: movsb
  2504176   cycles - 4000 (182) 4: SSE
--- ok ---

nidud

#38
deleted

jj2007

align 16
rep movsb


Check if the align is really needed. In the worst case, 15 bytes of code are inserted there. One common trick is to insert the bytes needed before the entry into the proc.

dedndave

nidud - i hope you're using the one in this post

http://masm32.com/board/index.php?topic=3373.msg35658#msg35658

;EAX return bits:
;0 = MMX
;1 = SSE
;2 = SSE2
;3 = SSE3
;4 = SSSE3
;5 = SSE4.1
;6 = SSE4.2


i would define the EQUates this way...

SSE_MMX    equ 1
SSE_SSE    equ 2
SSE_SSE2   equ 4
SSE_SSE3   equ 8
SSE_SSSE3  equ 10h
SSE_SSE41  equ 20h
SSE_SSE42  equ 40h


    call    GetSseLevel
    test    al,SSEBT_SSE3
    jnz     sse3_supported


the EQUates you have would be ok for BT, i suppose   :P

hutch--

Here is a test piece that uses 4 copies of the same simple byte intensive algo, it runs the 4 version and times each one. The idea was to test if the identical algo in 4 locations produced any difference in timing but on my old quad, they are almost perfectly identical even with multiple runs.

I get this result.


File length = 977426

828 ms
828 ms
828 ms
828 ms
Press any key to continue ...

nidud

#42
deleted

dedndave

ok - that one does not preserve EBX - but, it's probably ok, in this case
just so you are aware, CPUID destroys the contents of EBX   :t

nidud

#44
deleted