Author Topic: Code location sensitivity of timings  (Read 40411 times)

jj2007

  • Member
  • *****
  • Posts: 12697
  • Assembler is fun ;-)
    • MasmBasic
Re: Code location sensitivity of timings
« Reply #30 on: July 25, 2014, 06:33:53 AM »
with regard to memcpy there seems little gain using SSE
...
conclution:
- in newer CPU's MOVSB is faster than moving blocks
- in older CPU's MOVSB gets faster with size
- SSE may be faster depending on CPU

Or, in short: Everything is more complicated than you think.

nidud

  • Member
  • *****
  • Posts: 2388
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #31 on: July 25, 2014, 07:19:55 AM »
deleted
« Last Edit: February 25, 2022, 08:28:20 AM by nidud »

jj2007

  • Member
  • *****
  • Posts: 12697
  • Assembler is fun ;-)
    • MasmBasic
Re: Code location sensitivity of timings
« Reply #32 on: July 25, 2014, 06:01:49 PM »
:biggrin:

Yes, it’s possible to complicate tings I guess and the link you provide includes a lot of complicated issues but few constructive conclusions to the problem at hand.

The table there looked different for each and every CPU we tested (try yourself the latest version). So the choice was either choosing an algo that provided reasonable speed for most of them, or going the stony road of checking which CPU family and branching to a specialised one. For MasmBasic's MbCopy, rep movsd made the race. It is pretty fast on all CPUs, and it rocks for large copies (and that is where speed matters...). Good to see that Intel keeps pushing this line, too. I wouldn't use rep movsb for the whole copy, though, as many of the not-so-recent CPUs are very slow with the byte variant of movs.

      push ecx
      shr ecx, 2           ; divide count by 4
      rep movsd            ; copy DWORD size blocks
      pop ecx              ; Reload byte count
      and ecx, 3           ; get the rest
      rep movsb            ; copy the rest
      xchg eax, edi        ; for CAT$, return a pointer to the end of the destination;

sinsi

  • Guest
Re: Code location sensitivity of timings
« Reply #33 on: July 25, 2014, 08:10:40 PM »
How much of a difference would RAM speed make? DDR3 speeds can be 1333/1600/1866/2133/2400.

nidud

  • Member
  • *****
  • Posts: 2388
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #34 on: July 25, 2014, 10:21:23 PM »
deleted
« Last Edit: February 25, 2022, 08:28:36 AM by nidud »

Gunther

  • Member
  • *****
  • Posts: 4072
  • Forgive your enemies, but never forget their names
Re: Code location sensitivity of timings
« Reply #35 on: July 25, 2014, 10:56:33 PM »
Hi sinsi,

your memcpy application brings:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
----------------------------------------------
-- aligned strings --
498064    cycles -  10 (  0) 0: crt_memcpy
890775    cycles -  10 ( 38) 1: movsd - mov eax,ecx
892888    cycles -  10 ( 37) 2: movsd - push ecx
353318    cycles -  10 ( 27) 3: movsb
-- unaligned strings --
1006514   cycles -  10 (  0) 0: crt_memcpy
1033525   cycles -  10 ( 38) 1: movsd - mov eax,ecx
1033580   cycles -  10 ( 37) 2: movsd - push ecx
377061    cycles -  10 ( 27) 3: movsb
-- short strings 15 --
175505    cycles - 8000 (  0) 0: crt_memcpy
335538    cycles - 8000 ( 38) 1: movsd - mov eax,ecx
344226    cycles - 8000 ( 37) 2: movsd - push ecx
291953    cycles - 8000 ( 27) 3: movsb
-- short strings 271 --
1033175   cycles - 8000 (  0) 0: crt_memcpy
952811    cycles - 8000 ( 38) 1: movsd - mov eax,ecx
959677    cycles - 8000 ( 37) 2: movsd - push ecx
566948    cycles - 8000 ( 27) 3: movsb
-- short strings 2014 --
3224879   cycles - 4000 (  0) 0: crt_memcpy
3153708   cycles - 4000 ( 38) 1: movsd - mov eax,ecx
3151176   cycles - 4000 ( 37) 2: movsd - push ecx
930276    cycles - 4000 ( 27) 3: movsb
--- ok ---

Gunther
Get your facts first, and then you can distort them.

nidud

  • Member
  • *****
  • Posts: 2388
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #36 on: July 25, 2014, 11:41:54 PM »
deleted
« Last Edit: February 25, 2022, 08:28:51 AM by nidud »

RuiLoureiro

  • Member
  • ****
  • Posts: 820
Re: Code location sensitivity of timings
« Reply #37 on: July 26, 2014, 01:14:05 AM »
Seems to be better

x1=205 1543/884593  = 2.3191942509153927 (8000) ~2.32
x2=5844930 /2504176= 2.3340731641865428 (4000) ~2.33

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

-----------------------------------------------------
-- aligned strings --
1188974   cycles -  10 (  0) 0: crt_memcpy
1097640   cycles -  10 ( 75) 1: movsd - mov eax,ecx
1103251   cycles -  10 ( 75) 2: movsd - push ecx
1102906   cycles -  10 ( 59) 3: movsb
1310185   cycles -  10 (182) 4: SSE
-- unaligned strings --
2595543   cycles -  10 (  0) 0: crt_memcpy
2620959   cycles -  10 ( 75) 1: movsd - mov eax,ecx
2611443   cycles -  10 ( 75) 2: movsd - push ecx
7866087   cycles -  10 ( 59) 3: movsb
1358767   cycles -  10 (182) 4: SSE
-- short strings 15 --
 343706    cycles - 8000 (  0) 0: crt_memcpy
 789893    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
 808747    cycles - 8000 ( 75) 2: movsd - push ecx
2039809   cycles - 8000 ( 59) 3: movsb
237595    cycles - 8000 (182) 4: SSE
-- short strings 271 --
2051543   cycles - 8000 (  0) 0: crt_memcpy
2096801   cycles - 8000 ( 75) 1: movsd - mov eax,ecx
2083586   cycles - 8000 ( 75) 2: movsd - push ecx
7495329   cycles - 8000 ( 59) 3: movsb
884593    cycles - 8000 (182) 4: SSE
-- short strings 2014 --
  5844930   cycles - 4000 (  0) 0: crt_memcpy
  6057324   cycles - 4000 ( 75) 1: movsd - mov eax,ecx
  5890555   cycles - 4000 ( 75) 2: movsd - push ecx
22533778  cycles - 4000 ( 59) 3: movsb
  2504176   cycles - 4000 (182) 4: SSE
--- ok ---
« Last Edit: July 26, 2014, 03:49:33 AM by RuiLoureiro »

nidud

  • Member
  • *****
  • Posts: 2388
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #38 on: July 26, 2014, 01:46:55 AM »
deleted
« Last Edit: February 25, 2022, 08:29:07 AM by nidud »

jj2007

  • Member
  • *****
  • Posts: 12697
  • Assembler is fun ;-)
    • MasmBasic
Re: Code location sensitivity of timings
« Reply #39 on: July 26, 2014, 01:52:14 AM »
Code: [Select]
align 16
rep movsb

Check if the align is really needed. In the worst case, 15 bytes of code are inserted there. One common trick is to insert the bytes needed before the entry into the proc.

dedndave

  • Member
  • *****
  • Posts: 8828
  • Still using Abacus 2.0
    • DednDave
Re: Code location sensitivity of timings
« Reply #40 on: July 26, 2014, 02:15:52 AM »
nidud - i hope you're using the one in this post

http://masm32.com/board/index.php?topic=3373.msg35658#msg35658

Code: [Select]
;EAX return bits:
;0 = MMX
;1 = SSE
;2 = SSE2
;3 = SSE3
;4 = SSSE3
;5 = SSE4.1
;6 = SSE4.2

i would define the EQUates this way...

Code: [Select]
SSE_MMX    equ 1
SSE_SSE    equ 2
SSE_SSE2   equ 4
SSE_SSE3   equ 8
SSE_SSSE3  equ 10h
SSE_SSE41  equ 20h
SSE_SSE42  equ 40h

Code: [Select]
    call    GetSseLevel
    test    al,SSEBT_SSE3
    jnz     sse3_supported

the EQUates you have would be ok for BT, i suppose   :P

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 9564
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Code location sensitivity of timings
« Reply #41 on: July 26, 2014, 02:55:30 AM »
Here is a test piece that uses 4 copies of the same simple byte intensive algo, it runs the 4 version and times each one. The idea was to test if the identical algo in 4 locations produced any difference in timing but on my old quad, they are almost perfectly identical even with multiple runs.

I get this result.


File length = 977426

828 ms
828 ms
828 ms
828 ms
Press any key to continue ...
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

nidud

  • Member
  • *****
  • Posts: 2388
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #42 on: July 26, 2014, 02:57:09 AM »
deleted
« Last Edit: February 25, 2022, 08:29:25 AM by nidud »

dedndave

  • Member
  • *****
  • Posts: 8828
  • Still using Abacus 2.0
    • DednDave
Re: Code location sensitivity of timings
« Reply #43 on: July 26, 2014, 03:03:03 AM »
ok - that one does not preserve EBX - but, it's probably ok, in this case
just so you are aware, CPUID destroys the contents of EBX   :t

nidud

  • Member
  • *****
  • Posts: 2388
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #44 on: July 26, 2014, 03:30:28 AM »
deleted
« Last Edit: February 25, 2022, 08:29:39 AM by nidud »