Author Topic: Optimizing some code  (Read 17562 times)

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Optimizing some code
« Reply #45 on: June 15, 2014, 02:18:58 AM »
Hi nidud,

results for strlen4:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------------------
18212   cycles - 0: standard (scasb)
1871    cycles - 1: AgnerFog
1962    cycles - 2: AgnerFog (unaligned)
2789    cycles - 3: Dave

11414   cycles - 0: standard (scasb)
4372    cycles - 1: AgnerFog
4257    cycles - 2: AgnerFog (unaligned)
6608    cycles - 3: Dave

17074   cycles - 0: standard (scasb)
4358    cycles - 1: AgnerFog
4204    cycles - 2: AgnerFog (unaligned)
6618    cycles - 3: Dave

--- ok ---


Gunther
Get your facts first, and then you can distort them.

FORTRANS

  • Member
  • *****
  • Posts: 1078
Re: Optimizing some code
« Reply #46 on: June 15, 2014, 05:09:47 AM »
Hi,

   The first time it is run, it is slow on the first test.

Regards,

Steve N.

Code: [Select]
First run.

pre-P4 (SSE1)
------------------------------------------------------
315620  cycles - 0: standard (scasb)
5001    cycles - 1: AgnerFog
4967    cycles - 2: AgnerFog (unaligned)
6741    cycles - 3: Dave

11665   cycles - 0: standard (scasb)
4997    cycles - 1: AgnerFog
4975    cycles - 2: AgnerFog (unaligned)
6764    cycles - 3: Dave

11684   cycles - 0: standard (scasb)
4993    cycles - 1: AgnerFog
4967    cycles - 2: AgnerFog (unaligned)
6780    cycles - 3: Dave

--- ok ---

Second run

pre-P4 (SSE1)
------------------------------------------------------
11813   cycles - 0: standard (scasb)
4992    cycles - 1: AgnerFog
4968    cycles - 2: AgnerFog (unaligned)
6755    cycles - 3: Dave

11679   cycles - 0: standard (scasb)
4985    cycles - 1: AgnerFog
4966    cycles - 2: AgnerFog (unaligned)
6768    cycles - 3: Dave

11696   cycles - 0: standard (scasb)
4993    cycles - 1: AgnerFog
4977    cycles - 2: AgnerFog (unaligned)
6758    cycles - 3: Dave

--- ok ---

 
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
------------------------------------------------------
296692 cycles - 0: standard (scasb)
4193 cycles - 1: AgnerFog
4078 cycles - 2: AgnerFog (unaligned)
6334 cycles - 3: Dave
 
11671 cycles - 0: standard (scasb)
4148 cycles - 1: AgnerFog
4064 cycles - 2: AgnerFog (unaligned)
6268 cycles - 3: Dave
 
11675 cycles - 0: standard (scasb)
4209 cycles - 1: AgnerFog
4092 cycles - 2: AgnerFog (unaligned)
6357 cycles - 3: Dave
 
--- ok --- 
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
------------------------------------------------------
11809 cycles - 0: standard (scasb)
4187 cycles - 1: AgnerFog
4093 cycles - 2: AgnerFog (unaligned)
6219 cycles - 3: Dave
 
11793 cycles - 0: standard (scasb)
4147 cycles - 1: AgnerFog
4074 cycles - 2: AgnerFog (unaligned)
6100 cycles - 3: Dave
 
11810 cycles - 0: standard (scasb)
4211 cycles - 1: AgnerFog
4092 cycles - 2: AgnerFog (unaligned)
6225 cycles - 3: Dave
 
--- ok ---

LarryC

  • Regular Member
  • *
  • Posts: 9
Re: Optimizing some code
« Reply #47 on: June 15, 2014, 05:43:47 AM »

Intel(R) Core(TM) i7 CPU         960  @ 3.20GHz (SSE4)
------------------------------------------------------
10025   cycles - 0: standard (scasb)
6427    cycles - 1: AgnerFog
6444    cycles - 2: AgnerFog (unaligned)
11482   cycles - 3: Dave

15319   cycles - 0: standard (scasb)
7873    cycles - 1: AgnerFog
6176    cycles - 2: AgnerFog (unaligned)
11058   cycles - 3: Dave

14978   cycles - 0: standard (scasb)
7816    cycles - 1: AgnerFog
6251    cycles - 2: AgnerFog (unaligned)
10995   cycles - 3: Dave

--- ok ---

nidud

  • Member
  • *****
  • Posts: 1989
    • https://github.com/nidud/asmc
Re: Optimizing some code
« Reply #48 on: June 15, 2014, 11:29:32 PM »
The result is somewhat random, but in general the unaligned
version is equal or faster.

Here is a SSE2 version using unaligned input:
Code: [Select]
strlen  proc string:dword
mov edx,string
sub edx,16
xorps xmm1,xmm1 ; SSE2 - preset to zero
@@: add edx,16
movdqu  xmm0,[edx] ; SSE2 - unaligned move
pcmpeqb xmm0,xmm1 ; SSE2 - compare 16 byte
pmovmskb eax,xmm0 ; SSE2 - result in AX
test eax,eax ; bit 0..15 set if equal
jz @B
@@: bsf eax,eax ; get index of first "zero"
sub edx,string
add eax,edx
ret
strlen  endp

Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
------------------------------------------------------
20748   cycles - 0: standard (scasb)
15094   cycles - 3: Dave
14545   cycles - 5: MB - len()
11673   cycles - 1: AgnerFog
10198   cycles - 2: AgnerFog (unaligned)
6715   cycles - 6: MB - Len() SSE
4810   cycles - 4: unaligned SSE2

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Optimizing some code
« Reply #49 on: June 15, 2014, 11:34:59 PM »
Hi nidud,

strlen5:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
------------------------------------------------------
22200   cycles - 0: standard (scasb)
10776   cycles - 3: Dave
10271   cycles - 5: MB - len()
7120    cycles - 1: AgnerFog
7264    cycles - 2: AgnerFog (unaligned)
3007    cycles - 6: MB - Len() SSE
2280    cycles - 4: unaligned SSE2

21590   cycles - 0: standard (scasb)
10616   cycles - 3: Dave
10136   cycles - 5: MB - len()
7059    cycles - 1: AgnerFog
17323   cycles - 2: AgnerFog (unaligned)
7226    cycles - 6: MB - Len() SSE
5413    cycles - 4: unaligned SSE2

52253   cycles - 0: standard (scasb)
25722   cycles - 3: Dave
24451   cycles - 5: MB - len()
17339   cycles - 1: AgnerFog
17349   cycles - 2: AgnerFog (unaligned)
7205    cycles - 6: MB - Len() SSE
6116    cycles - 4: unaligned SSE2

--- ok ---

Gunther
Get your facts first, and then you can distort them.

nidud

  • Member
  • *****
  • Posts: 1989
    • https://github.com/nidud/asmc
Re: Optimizing some code
« Reply #50 on: June 16, 2014, 01:29:53 AM »
Quote
7120   cycles - 1: AgnerFog
7264   cycles - 2: AgnerFog (unaligned)
7059   cycles - 1: AgnerFog
17323   cycles - 2: AgnerFog (unaligned
17339   cycles - 1: AgnerFog
17349   cycles - 2: AgnerFog (unaligned)

Some random results there..

However, using aligned input is faster on long strings
Code: [Select]
1200 cycles - 6: MB - Len() SSE - 32 byte
678 cycles - 4: unaligned SSE2
2741 cycles - 6: MB - Len() SSE - 64 byte
1838 cycles - 4: unaligned SSE2
176010  cycles - 6: MB - Len() SSE - 1000 byte
166282  cycles - 4: unaligned SSE2

Testing only large strings will flip the result.

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: Optimizing some code
« Reply #51 on: June 16, 2014, 02:38:24 AM »
strlen5
prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
86037   cycles - 0: standard (scasb)
31180   cycles - 3: Dave
33575   cycles - 5: MB - len()
23079   cycles - 1: AgnerFog
25595   cycles - 2: AgnerFog (unaligned)
21374   cycles - 6: MB - Len() SSE
18166   cycles - 4: unaligned SSE2

49577   cycles - 0: standard (scasb)
31080   cycles - 3: Dave
32727   cycles - 5: MB - len()
23139   cycles - 1: AgnerFog
25405   cycles - 2: AgnerFog (unaligned)
21643   cycles - 6: MB - Len() SSE
18152   cycles - 4: unaligned SSE2

49638   cycles - 0: standard (scasb)
31000   cycles - 3: Dave
32762   cycles - 5: MB - len()
23151   cycles - 1: AgnerFog
31292   cycles - 2: AgnerFog (unaligned)
21172   cycles - 6: MB - Len() SSE
18204   cycles - 4: unaligned SSE2