Author Topic: Code location sensitivity of timings  (Read 36180 times)

jj2007

  • Member
  • *****
  • Posts: 11783
  • Assembler is fun ;-)
    • MasmBasic
Re: Code location sensitivity of timings
« Reply #30 on: July 25, 2014, 06:33:53 AM »
with regard to memcpy there seems little gain using SSE
...
conclution:
- in newer CPU's MOVSB is faster than moving blocks
- in older CPU's MOVSB gets faster with size
- SSE may be faster depending on CPU

Or, in short: Everything is more complicated than you think.

nidud

  • Member
  • *****
  • Posts: 2311
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #31 on: July 25, 2014, 07:19:55 AM »
 :biggrin:

Yes, it’s possible to complicate tings I guess and the link you provide includes a lot of complicated issues but few constructive conclusions to the problem at hand.

My conclusion is that moving memory is a hardware issue which improves in newer hardware. If you look at the latest version of MEMCPY.ASM provided by Intel you basically see (how and) what they working on in the hardware:

Code: [Select]
; See if Enhanced Fast Strings is supported.
; ENFSTRG supported?
bt __favor, __FAVOR_ENFSTRG
jnc CopyUpSSE2Check ; no jump
;
; use Enhanced Fast Strings
rep movsb
jmp TrailUp0 ; Done
CopyUpSSE2Check:
;
; Next, see if we can use a "fast" copy SSE2 routine

jj2007

  • Member
  • *****
  • Posts: 11783
  • Assembler is fun ;-)
    • MasmBasic
Re: Code location sensitivity of timings
« Reply #32 on: July 25, 2014, 06:01:49 PM »
:biggrin:

Yes, it’s possible to complicate tings I guess and the link you provide includes a lot of complicated issues but few constructive conclusions to the problem at hand.

The table there looked different for each and every CPU we tested (try yourself the latest version). So the choice was either choosing an algo that provided reasonable speed for most of them, or going the stony road of checking which CPU family and branching to a specialised one. For MasmBasic's MbCopy, rep movsd made the race. It is pretty fast on all CPUs, and it rocks for large copies (and that is where speed matters...). Good to see that Intel keeps pushing this line, too. I wouldn't use rep movsb for the whole copy, though, as many of the not-so-recent CPUs are very slow with the byte variant of movs.

      push ecx
      shr ecx, 2           ; divide count by 4
      rep movsd            ; copy DWORD size blocks
      pop ecx              ; Reload byte count
      and ecx, 3           ; get the rest
      rep movsb            ; copy the rest
      xchg eax, edi        ; for CAT$, return a pointer to the end of the destination;

sinsi

  • Guest
Re: Code location sensitivity of timings
« Reply #33 on: July 25, 2014, 08:10:40 PM »
How much of a difference would RAM speed make? DDR3 speeds can be 1333/1600/1866/2133/2400.

nidud

  • Member
  • *****
  • Posts: 2311
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #34 on: July 25, 2014, 10:21:23 PM »
For MasmBasic's MbCopy, rep movsd made the race. It is pretty fast on all CPUs, and it rocks for large copies (and that is where speed matters...).

      push ecx
      shr ecx, 2           ; divide count by 4
      rep movsd            ; copy DWORD size blocks
      pop ecx              ; Reload byte count
      and ecx, 3           ; get the rest
      rep movsb            ; copy the rest

1:
Code: [Select]
memcpy  proc uses esi edi dst, src, count
mov edi,dst
mov esi,src
mov ecx,count

mov eax,ecx
shr ecx,2
align 4
rep movsd
and eax,11B
mov ecx,eax
rep movsb

mov eax,dst
ret
memcpy  endp

2:
Code: [Select]
push ecx
shr ecx,2
align 4
rep movsd
pop ecx
and ecx,11B
rep movsb

3:
Code: [Select]
align 4
rep movsb


AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
-- aligned strings --
1082814     cycles -  10 (  0) 0: crt_memcpy
1075836     cycles -  10 ( 38) 1: movsd - mov eax,ecx
1079960     cycles -  10 ( 37) 2: movsd - push ecx
1075643     cycles -  10 ( 27) 3: movsb
-- unaligned strings --
1923314     cycles -  10 (  0) 0: crt_memcpy
1959236     cycles -  10 ( 38) 1: movsd - mov eax,ecx
1957439     cycles -  10 ( 37) 2: movsd - push ecx
8000102     cycles -  10 ( 27) 3: movsb
-- short strings 15 --
210825     cycles - 8000 (  0) 0: crt_memcpy
320028     cycles - 8000 ( 38) 1: movsd - mov eax,ecx
344116     cycles - 8000 ( 37) 2: movsd - push ecx
312027     cycles - 8000 ( 27) 3: movsb
-- short strings 271 --
1614638     cycles - 8000 (  0) 0: crt_memcpy
1676166     cycles - 8000 ( 38) 1: movsd - mov eax,ecx
1705396     cycles - 8000 ( 37) 2: movsd - push ecx
3633202     cycles - 8000 ( 27) 3: movsb
-- short strings 2014 --
5307541     cycles - 4000 (  0) 0: crt_memcpy
5823146     cycles - 4000 ( 38) 1: movsd - mov eax,ecx
5827790     cycles - 4000 ( 37) 2: movsd - push ecx
12299823    cycles - 4000 ( 27) 3: movsb

Gunther

  • Member
  • *****
  • Posts: 3802
  • Forgive your enemies, but never forget their names
Re: Code location sensitivity of timings
« Reply #35 on: July 25, 2014, 10:56:33 PM »
Hi sinsi,

your memcpy application brings:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
----------------------------------------------
-- aligned strings --
498064    cycles -  10 (  0) 0: crt_memcpy
890775    cycles -  10 ( 38) 1: movsd - mov eax,ecx
892888    cycles -  10 ( 37) 2: movsd - push ecx
353318    cycles -  10 ( 27) 3: movsb
-- unaligned strings --
1006514   cycles -  10 (  0) 0: crt_memcpy
1033525   cycles -  10 ( 38) 1: movsd - mov eax,ecx
1033580   cycles -  10 ( 37) 2: movsd - push ecx
377061    cycles -  10 ( 27) 3: movsb
-- short strings 15 --
175505    cycles - 8000 (  0) 0: crt_memcpy
335538    cycles - 8000 ( 38) 1: movsd - mov eax,ecx
344226    cycles - 8000 ( 37) 2: movsd - push ecx
291953    cycles - 8000 ( 27) 3: movsb
-- short strings 271 --
1033175   cycles - 8000 (  0) 0: crt_memcpy
952811    cycles - 8000 ( 38) 1: movsd - mov eax,ecx
959677    cycles - 8000 ( 37) 2: movsd - push ecx
566948    cycles - 8000 ( 27) 3: movsb
-- short strings 2014 --
3224879   cycles - 4000 (  0) 0: crt_memcpy
3153708   cycles - 4000 ( 38) 1: movsd - mov eax,ecx
3151176   cycles - 4000 ( 37) 2: movsd - push ecx
930276    cycles - 4000 ( 27) 3: movsb
--- ok ---

Gunther
Get your facts first, and then you can distort them.

nidud

  • Member
  • *****
  • Posts: 2311
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #36 on: July 25, 2014, 11:41:54 PM »
well, there is the future for you right there  :biggrin:

good stuff (and harware)  :t

I was thinking of using the BT sselevel,? with the result from Dave's function above. However, I have SSE3 and the SSE function seems faster on that level so the test may then be SSE3 and below for these functions. If you could also try this one to see if MOVSB also is faster than SSE, that will be good.

In this test I align EDI on all proc's:
Code: [Select]
memcpy  proc uses esi edi dst, src, count
mov edi,dst
mov esi,src
mov ecx,count

test ecx,-4
jz @F
mov eax,[esi]
mov [edi],eax
mov eax,edi
neg eax
and eax,11B
add edi,eax
add esi,eax
sub ecx,eax

...

align 16
@@: rep movsb

mov eax,dst
ret
memcpy  endp

and added a SSE proc:
Code: [Select]
memcpy  proc uses esi edi dst, src, count
mov edi,dst
mov esi,src
mov ecx,count
;
; need 16 byte for overlap..
;
test ecx,-16
jz tail
;
; fix tail bytes and aligned bytes
;
movdqu  xmm0,[esi]
movdqu  [edi],xmm0
movdqu  xmm0,[esi+ecx-16]
movdqu  [edi+ecx-16],xmm0
;
; align EDI 16
;
mov eax,edi
neg eax
and eax,1111B
add edi,eax
add esi,eax
and ecx,-16

align 16
lup: sub ecx,16
movdqu  xmm0,[esi+ecx]  ; do  copy
movdqa  [edi+ecx],xmm0  ; aligned move
jnz lup
align 16
toend:
mov eax,dst
ret
align 4
tail: test ecx,ecx
jz toend
test ecx,-2
jz @1
test ecx,-4
jz @2
test ecx,-8
jz @4
movq xmm0,[esi] ; move 8..15 byte
movq [edi],xmm0 ; |8...|
movq xmm0,[esi+ecx-8] ; |...8|
movq [edi+ecx-8],xmm0
jmp toend
align 4
@4: mov eax,[esi]
mov [edi],eax
mov eax,[esi+ecx-4]
mov [edi+ecx-4],eax
jmp toend
align 4
@2: mov eax,[esi]
mov [edi],ax
shr eax,16
mov [edi+ecx-1],al
jmp toend
align 4
@1: mov al,[esi]
mov [edi],al
jmp toend
memcpy  endp

and now I get this result:
AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
-- aligned strings --
1075537     cycles -  10 (  0) 0: crt_memcpy
1075123     cycles -  10 ( 75) 1: movsd - mov eax,ecx
1075128     cycles -  10 ( 75) 2: movsd - push ecx
1076079     cycles -  10 ( 59) 3: movsb
846397     cycles -  10 (182) 4: SSE
-- unaligned strings --
1111510     cycles -  10 (  0) 0: crt_memcpy
1106994     cycles -  10 ( 75) 1: movsd - mov eax,ecx
1110818     cycles -  10 ( 75) 2: movsd - push ecx
1108074     cycles -  10 ( 59) 3: movsb
852071     cycles -  10 (182) 4: SSE
-- short strings 15 --
200777     cycles - 8000 (  0) 0: crt_memcpy
312027     cycles - 8000 ( 75) 1: movsd - mov eax,ecx
328222     cycles - 8000 ( 75) 2: movsd - push ecx
288027     cycles - 8000 ( 59) 3: movsb
112031     cycles - 8000 (182) 4: SSE
-- short strings 271 --
1304626     cycles - 8000 (  0) 0: crt_memcpy
1304307     cycles - 8000 ( 75) 1: movsd - mov eax,ecx
1338701     cycles - 8000 ( 75) 2: movsd - push ecx
1568333     cycles - 8000 ( 59) 3: movsb
498757     cycles - 8000 (182) 4: SSE
-- short strings 2014 --
2439537     cycles - 4000 (  0) 0: crt_memcpy
2458600     cycles - 4000 ( 75) 1: movsd - mov eax,ecx
2465365     cycles - 4000 ( 75) 2: movsd - push ecx
2448965     cycles - 4000 ( 59) 3: movsb
1139728     cycles - 4000 (182) 4: SSE


RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Code location sensitivity of timings
« Reply #37 on: July 26, 2014, 01:14:05 AM »
Seems to be better

x1=205 1543/884593  = 2.3191942509153927 (8000) ~2.32
x2=5844930 /2504176= 2.3340731641865428 (4000) ~2.33

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

-----------------------------------------------------
-- aligned strings --
1188974   cycles -  10 (  0) 0: crt_memcpy
1097640   cycles -  10 ( 75) 1: movsd - mov eax,ecx
1103251   cycles -  10 ( 75) 2: movsd - push ecx
1102906   cycles -  10 ( 59) 3: movsb
1310185   cycles -  10 (182) 4: SSE
-- unaligned strings --
2595543   cycles -  10 (  0) 0: crt_memcpy
2620959   cycles -  10 ( 75) 1: movsd - mov eax,ecx
2611443   cycles -  10 ( 75) 2: movsd - push ecx
7866087   cycles -  10 ( 59) 3: movsb
1358767   cycles -  10 (182) 4: SSE
-- short strings 15 --
 343706    cycles - 8000 (  0) 0: crt_memcpy
 789893    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
 808747    cycles - 8000 ( 75) 2: movsd - push ecx
2039809   cycles - 8000 ( 59) 3: movsb
237595    cycles - 8000 (182) 4: SSE
-- short strings 271 --
2051543   cycles - 8000 (  0) 0: crt_memcpy
2096801   cycles - 8000 ( 75) 1: movsd - mov eax,ecx
2083586   cycles - 8000 ( 75) 2: movsd - push ecx
7495329   cycles - 8000 ( 59) 3: movsb
884593    cycles - 8000 (182) 4: SSE
-- short strings 2014 --
  5844930   cycles - 4000 (  0) 0: crt_memcpy
  6057324   cycles - 4000 ( 75) 1: movsd - mov eax,ecx
  5890555   cycles - 4000 ( 75) 2: movsd - push ecx
22533778  cycles - 4000 ( 59) 3: movsb
  2504176   cycles - 4000 (182) 4: SSE
--- ok ---
« Last Edit: July 26, 2014, 03:49:33 AM by RuiLoureiro »

nidud

  • Member
  • *****
  • Posts: 2311
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #38 on: July 26, 2014, 01:46:55 AM »
I will assume the tipping point is then at level 4.1

from Dave's function the bits will be like this:
Code: [Select]
SSEBT_MMX equ 0
SSEBT_SSE equ 1
SSEBT_SSE2 equ 2
SSEBT_SSE3 equ 3
SSEBT_SSSE3 equ 4
SSEBT_SSE41 equ 5
SSEBT_SSE42 equ 6

and the copy function will then be like this:
Code: [Select]
bt sselevel,SSEBT_SSE41
jnc @F

mov eax,edi
align 16
rep movsb
pop edi
pop esi
ret 12

align 4
@@: ; SSE2 copy..

as implemented now it will exit if < SSE2
and using level above needs testing

jj2007

  • Member
  • *****
  • Posts: 11783
  • Assembler is fun ;-)
    • MasmBasic
Re: Code location sensitivity of timings
« Reply #39 on: July 26, 2014, 01:52:14 AM »
Code: [Select]
align 16
rep movsb

Check if the align is really needed. In the worst case, 15 bytes of code are inserted there. One common trick is to insert the bytes needed before the entry into the proc.

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Code location sensitivity of timings
« Reply #40 on: July 26, 2014, 02:15:52 AM »
nidud - i hope you're using the one in this post

http://masm32.com/board/index.php?topic=3373.msg35658#msg35658

Code: [Select]
;EAX return bits:
;0 = MMX
;1 = SSE
;2 = SSE2
;3 = SSE3
;4 = SSSE3
;5 = SSE4.1
;6 = SSE4.2

i would define the EQUates this way...

Code: [Select]
SSE_MMX    equ 1
SSE_SSE    equ 2
SSE_SSE2   equ 4
SSE_SSE3   equ 8
SSE_SSSE3  equ 10h
SSE_SSE41  equ 20h
SSE_SSE42  equ 40h

Code: [Select]
    call    GetSseLevel
    test    al,SSEBT_SSE3
    jnz     sse3_supported

the EQUates you have would be ok for BT, i suppose   :P

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 8755
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Code location sensitivity of timings
« Reply #41 on: July 26, 2014, 02:55:30 AM »
Here is a test piece that uses 4 copies of the same simple byte intensive algo, it runs the 4 version and times each one. The idea was to test if the identical algo in 4 locations produced any difference in timing but on my old quad, they are almost perfectly identical even with multiple runs.

I get this result.


File length = 977426

828 ms
828 ms
828 ms
828 ms
Press any key to continue ...
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

nidud

  • Member
  • *****
  • Posts: 2311
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #42 on: July 26, 2014, 02:57:09 AM »
Quote
Check if the align is really needed
I normally tune them from the list file in the end

00000000         memcpy  proc dst, src, count
00000000  56            push   esi
00000001  57            push   edi
00000002  8B7C240C         mov   edi,[esp+12]
00000006  8B742410         mov   esi,[esp+16]
0000000A  8B4C2414         mov   ecx,[esp+20]
            ifndef __SSE_
0000000E  F7C1F8FFFFFF         test   ecx,-8
00000014  7412            jz   @F
00000016  8B06            mov   eax,[esi]
00000018  8907            mov   [edi],eax
0000001A  8BC7            mov   eax,edi
0000001C  F7D8            neg   eax
0000001E  83E003         and   eax,11B
00000021  90            nop
00000022  03F8            add   edi,eax
00000024  03F0            add   esi,eax
00000026  2BC8            sub   ecx,eax
00000028         @@:
00000028  F3A4            rep   movsb
0000002A  8B44240C         mov   eax,[esp+12]
0000002E  5F            pop   edi
0000002F  5E            pop   esi
00000030  C20C00         ret   12
            endif
00000033         memcpy  endp


Quote
nidud - i hope you're using the one in this post
I'm using this one:
http://masm32.com/board/index.php?topic=3396.msg36278#msg36278

Quote
i would define the EQUates this way...
I define them this way  :biggrin:
Code: [Select]
SSE_MMX equ 00000001B
SSE_SSE equ 00000010B
SSE_SSE2 equ 00000100B
SSE_SSE3 equ 00001000B
SSE_SSSE3 equ 00010000B
SSE_SSE41 equ 00100000B
SSE_SSE42 equ 01000000B

Quote
the EQUates you have would be ok for BT, i suppose   :P

BT is the fastest upcode there is me think   :P

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Code location sensitivity of timings
« Reply #43 on: July 26, 2014, 03:03:03 AM »
ok - that one does not preserve EBX - but, it's probably ok, in this case
just so you are aware, CPUID destroys the contents of EBX   :t

nidud

  • Member
  • *****
  • Posts: 2311
    • https://github.com/nidud/asmc
Re: Code location sensitivity of timings
« Reply #44 on: July 26, 2014, 03:30:28 AM »
when I run the program I get this:

1344 ms
1344 ms
1343 ms
2016 ms
...
1343 ms
1344 ms
1344 ms
2015 ms
...
1344 ms
1359 ms
1344 ms
2016 ms


So the size of the "code buffer" in this case manage to read 3.5 of the proc's, but in the middle of the last proc it needs to read in more code to execute. If I increase the size of each proc by inserting db 64 dup(90h) in each of them I get this result:

2031 ms
1344 ms
1344 ms
2015 ms
...
2016 ms
1359 ms
1344 ms
2016 ms
...
2015 ms
1360 ms
1344 ms
2015 ms

so now it manage to read 2.5 of the proc's...