Author Topic: Optimizing some code  (Read 17611 times)

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Optimizing some code
« Reply #30 on: June 11, 2014, 11:14:33 PM »

Quote
yep, that was better
What ? Did you see it ? Where is it ? Show us.
We should compare only comparable things.

the test depends on what you're after
most strings are not aligned
well - BSTR's are - and strings that are inside structures probably are
otherwise... you have to devise a test that tests all alignments

just as an example, i attached a test to look at

1) select a single core, and wait 750 mS to bind before testing
2) select loop counts that yield ~0.5 seconds per pass
3) all alignments are tested
(notice that each string is differently aligned)
4) 16 strings are tested - the overall result is divided by 16 and rounded to nearest
5) you get fewer outliers if you open a console window and type the program name than if you click on it
6) the test should show the processor (this one does not)   :P

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Optimizing some code
« Reply #31 on: June 11, 2014, 11:33:41 PM »
Results for Jochen:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

9731    cycles for 100 * Rui
1146    cycles for 100 * MB
3871    cycles for 100 * strlen32
2730    cycles for 100 * strlen32M
8707    cycles for 100 * Habran
5088    cycles for 100 * ShortLen (Dave)
5891    cycles for 100 * slen (Hutch)

8489    cycles for 100 * Rui
2387    cycles for 100 * MB
3879    cycles for 100 * strlen32
2722    cycles for 100 * strlen32M
8698    cycles for 100 * Habran
5101    cycles for 100 * ShortLen (Dave)
5866    cycles for 100 * slen (Hutch)

18823   cycles for 100 * Rui
2029    cycles for 100 * MB
7357    cycles for 100 * strlen32
7455    cycles for 100 * strlen32M
20309   cycles for 100 * Habran
12587   cycles for 100 * ShortLen (Dave)
17441   cycles for 100 * slen (Hutch)

20010   cycles for 100 * Rui
2033    cycles for 100 * MB
7387    cycles for 100 * strlen32
7442    cycles for 100 * strlen32M
19070   cycles for 100 * Habran
12663   cycles for 100 * ShortLen (Dave)
18717   cycles for 100 * slen (Hutch)

100     = eax Rui
100     = eax MB
100     = eax strlen32
100     = eax strlen32M
100     = eax Habran
100     = eax ShortLen (Dave)
100     = eax slen (Hutch)

--- ok ---

Results for Rui:
Code: [Select]

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
STRLEN test:
short - <1,3,6,10,20,30,40,50,70>
long  - <100,200,300,400,500,700,1000>
------------------------------------------------------
920     cycles - std
775     cycles - Rui
562     cycles - habran
378     cycles - Dave
544     cycles - Hutch
406     cycles - JJ
332     cycles - Rui32
246     cycles - Rui32M
209     cycles - JJ2

14480   cycles - std
10105   cycles - Rui
14099   cycles - habran
8649    cycles - Dave
9680    cycles - Hutch
7488    cycles - JJ
4513    cycles - Rui32
4634    cycles - Rui32M
1162    cycles - JJ2

--- ok ---

Gunther
Get your facts first, and then you can distort them.

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Optimizing some code
« Reply #32 on: June 11, 2014, 11:35:46 PM »
Dave,
          must show the proc.
          it must be clear. otherwise i give him 0.
          if in the school, he must explain all bits.
          This is my rule:
          i don't accept any results unless
          you show your exercise.

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Optimizing some code
« Reply #33 on: June 12, 2014, 12:14:01 AM »
Rui,

Dave,
          must show the proc.
          it must be clear. otherwise i give him 0.
          if in the school, he must explain all bits.
          This is my rule:
          i don't accept any results unless
          you show your exercise.
but slen.zip contains the source. What's your point?

Gunther
Get your facts first, and then you can distort them.

nidud

  • Member
  • *****
  • Posts: 2001
    • https://github.com/nidud/asmc
Re: Optimizing some code
« Reply #34 on: June 12, 2014, 12:23:04 AM »
i don't understand what that means
It means that the algo doesn't work: A string length 3 returns 10.
Code: [Select]
.data

strx macro x
str&x&  db x dup('x')
db 0
endm
for x,<1,3,6,10,20,30,40,50,70,100,200,300,400,500,700,1000>
strx x
endm

.code

for x,<1,3,6,10,20,30,40,50,70,100,200,300,400,500,700,1000>
invoke ShortLen,offset str&x&
.if eax != x
print uhex$(eax), " != &x& - Dave: STRLEN error", 13, 10
.endif
endm

result:
Code: [Select]
00000301 != 1 - Dave: STRLEN error
0000000A != 3 - Dave: STRLEN error
000003BC != 700 - Dave: STRLEN error

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Optimizing some code
« Reply #35 on: June 12, 2014, 12:40:12 AM »
Gunther,
              what are you talking about ?
              Could you post here the proc
              that i am talking about ?

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Optimizing some code
« Reply #36 on: June 12, 2014, 12:42:41 AM »
Rui,

no offense. But the zip archive under post #30 contains the source. Or do you mean another procedure?

Gunther
Get your facts first, and then you can distort them.

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Optimizing some code
« Reply #37 on: June 12, 2014, 12:46:41 AM »
No, iam not talking about it.
Do you know this:
This is the place to post assembler algorithms and code design for discussion, optimisation and any other...
You may do any tests you want to do with the code
i posted. I want to do the same. That's the point.
Read my reply 10.

nidud,
          it is not correct to call "Rui32" and "Rui32M" but
          AgnerFog.

:biggrin: :biggrin:
EDIT: Gunther,
                      Could i have a discussion with you about "gambuzinos" ?
("gambuzino" is a creature that noone never saw him, noone never catch him. Sometimes we say to one: go to hunt "gambuzinos".)

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Optimizing some code
« Reply #38 on: June 12, 2014, 04:57:59 AM »
Rui,

EDIT: Gunther,
                      Could i have a discussion with you about "gambuzinos" ?
("gambuzino" is a creature that noone never saw him, noone never catch him. Sometimes we say to one: go to hunt "gambuzinos".)

I've read your post #10. I think that I'm talking about algorithms, I'm posting test results (not only in your thread), I'm not talking about gambuzinos, Yetis and other impossibilities.

But anyway, it's your thread. My apology, I won't post into your threads in the future.

Gunther
Get your facts first, and then you can distort them.

RuiLoureiro

  • Member
  • ****
  • Posts: 819
Re: Optimizing some code
« Reply #39 on: June 12, 2014, 05:53:36 AM »
Hi
What did you do wrong, Gunther ?
I never saw anything wrong. It's clear.
My apology.

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Optimizing some code
« Reply #40 on: June 12, 2014, 09:36:12 AM »
fixed my routine - and added ShowCpu   :P
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
134 135 135 135 134
141 141 141 140 141

FORTRANS

  • Member
  • *****
  • Posts: 1078
Re: Optimizing some code
« Reply #41 on: June 12, 2014, 11:23:18 PM »
Hi Dave,

   Here are some results.

Code: [Select]
Pre-Pentium4 (SSE1)
106 106 106 106 106
104 104 104 104 104
Press any key to continue ...

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
103 104 104 104 103
111 111 111 111 111
Press any key to continue ...

HTH,

Steve N.


Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Optimizing some code
« Reply #42 on: June 12, 2014, 11:45:57 PM »
Dave,

results from slen2.exe:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
55 51 52 51 51
77 77 77 77 77
Press any key to continue ...

Gunther
Get your facts first, and then you can distort them.

nidud

  • Member
  • *****
  • Posts: 2001
    • https://github.com/nidud/asmc
Re: Optimizing some code
« Reply #43 on: June 15, 2014, 01:28:58 AM »
The testbed used to find the fastest algo should be implemented equally for all, which is not the case in most of these tests. Some of them are inlined as macros, and some use different prologue to save some cycles. There is also some that use SSEx code which demand an alternate conventional algo for portability.

However, with regards to the conventional source one algo seems to stand out.

Strings in general are small and unaligned. The function align the input in this case but do not use any scan or other upcodes that needs alignment, so this could be removed. A modified unaligned version could then be written like this:
Code: [Select]
strlen  proc string:dword
mov eax,string
sub eax,4
@@: add eax,4
mov edx,[eax]
lea ecx,[edx-01010101H]
not edx
and ecx,edx
and ecx,80808080H
jz @B
bsf ecx,ecx
shr ecx,3
sub eax,string
add eax,ecx
ret
strlen  endp

tested on strings 1..32:
Quote
AMD Athlon(tm) II X2 245 Processor (SSE3)
------------------------------------------------------
1960   cycles - 0: standard (scasb)
1250   cycles - 1: AgnerFog
1077   cycles - 2: AgnerFog (unaligned)
1607   cycles - 3: Dave

1..64:
Quote
6279   cycles - 0: standard (scasb)
4022   cycles - 1: AgnerFog
3519   cycles - 2: AgnerFog (unaligned)
4795   cycles - 3: Dave

and 1..999:
Quote
1056068 cycles - 0: standard (scasb)
453295  cycles - 1: AgnerFog
426523  cycles - 2: AgnerFog (unaligned)
716071  cycles - 3: Dave

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Optimizing some code
« Reply #44 on: June 15, 2014, 01:33:34 AM »
prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
79459   cycles - 0: standard (scasb)
7603    cycles - 1: AgnerFog
7785    cycles - 2: AgnerFog (unaligned)
9866    cycles - 3: Dave

16232   cycles - 0: standard (scasb)
7615    cycles - 1: AgnerFog
9145    cycles - 2: AgnerFog (unaligned)
10294   cycles - 3: Dave

16081   cycles - 0: standard (scasb)
7759    cycles - 1: AgnerFog
7763    cycles - 2: AgnerFog (unaligned)
10762   cycles - 3: Dave

you might want to increase the loop counts for better stability