News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Instr, strstr, find$

Started by jj2007, July 14, 2014, 09:20:15 PM

Previous topic - Next topic

nidud

#15
deleted

jj2007

#16
Quite efficient :t
However,

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
++++++++++++++++++++
3296    cycles for 10 * MbInstr 0 (A)
4000    cycles for 10 * MbInstr 0 (B)
5169    cycles for 10 * crt_strstr
5519    cycles for 10 * M32 find$

but

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
++++++++++++++++++++
3540    cycles for 10 * MbInstr 0 (A)
3999    cycles for 10 * MbInstr 0 (B)
5159    cycles for 10 * crt_strstr
5335    cycles for 10 * M32 find$
3247    cycles for 10 * strstr_nidud


What did you smuggle in that destroys the performance of my algo??? :eusa_naughty:

What is even more worrying is the bad performance on my Celeron :eusa_boohoo:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
4094    cycles for 10 * MbInstr 0 (A)
3291    cycles for 10 * crt_strstr
3778    cycles for 10 * M32 find$
3347    cycles for 10 * strstr_nidud

nidud

#17
deleted

dedndave

put them in seperate programs - run a batch file   :P

Gunther

Quote from: dedndave on July 16, 2014, 05:17:21 AM
put them in seperate programs - run a batch file   :P

Not a bad idea.

Gunther
You have to know the facts before you can distort them.

nidud

#20
deleted

jj2007

Here is Instr() with another setting: find echo WARNING in WinExtra.inc

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
33284   kCycles for 10 * MbInstr 0 (zero-delimited)
30100   kCycles for 10 * MbInstr 0 (file size)
34983   kCycles for 10 * crt_strstr
37981   kCycles for 10 * M32 find$
38525   kCycles for 10 * strstr_nidud


The second entry (file size) refers to the additional parameter mentioned above: The function knows how many bytes are available in the source string. The difference is surprisingly low, though - I had expected a stronger influence of the data cache.


Quote from: nidud on July 16, 2014, 06:20:58 AM
function to test:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
10568   cycles for 100 * proc_4
3939    cycles for 100 * Len
13860   cycles for 100 * len

Gunther

Jochen,

that's what InstrTimingsNew did:


Here is the output of proc4:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

6860    cycles for 100 * proc_4
1978    cycles for 100 * Len
9348    cycles for 100 * len

6205    cycles for 100 * proc_4
1957    cycles for 100 * Len
9312    cycles for 100 * len

6224    cycles for 100 * proc_4
1955    cycles for 100 * Len
9924    cycles for 100 * len

100     = eax proc_4
100     = eax Len
100     = eax len

--- ok ---


Gunther
You have to know the facts before you can distort them.

jj2007

Quote from: Gunther on July 16, 2014, 08:24:24 AM
that's what InstrTimingsNew did:

Gunther,
Either you have no \Masm32\include\winextra.inc (unlikely), or you launched the exe from a different drive than your Masm32 drive.

nidud

#24
deleted

Gunther

Jochen,

I've the winextra.inc, but I fired up the application from a different drive. Here is the new output:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++++11 of 20 tests valid, loop overhead is approx. 43/10 cycles

20184   kCycles for 10 * MbInstr 0 (zero-delimited)
18778   kCycles for 10 * MbInstr 0 (file size)
22744   kCycles for 10 * crt_strstr
28534   kCycles for 10 * M32 find$
30970   kCycles for 10 * strstr_nidud

19942   kCycles for 10 * MbInstr 0 (zero-delimited)
18751   kCycles for 10 * MbInstr 0 (file size)
22837   kCycles for 10 * crt_strstr
28442   kCycles for 10 * M32 find$
30908   kCycles for 10 * strstr_nidud

20094   kCycles for 10 * MbInstr 0 (zero-delimited)
18727   kCycles for 10 * MbInstr 0 (file size)
22739   kCycles for 10 * crt_strstr
28537   kCycles for 10 * M32 find$
30943   kCycles for 10 * strstr_nidud

1068448 = eax MbInstr 0 (zero-delimited)
1068448 = eax MbInstr 0 (file size)
1068448 = eax crt_strstr
1068448 = eax M32 find$
1068448 = eax strstr_nidud


Gunther
You have to know the facts before you can distort them.

nidud

#26
deleted

hutch--

With Dave's suggestion, try this before the timing of each algo in each separate test piece. Set the priority class high enough to avoid the wanders and see if this helps to stabilise the results.


    cpuid                           ; serialising instruction for wider seperation
    pause                           ; spinlock delay instruction

    invoke SleepEx,10,0

    cpuid                           ; serialising instruction for wider seperation
    pause                           ; spinlock delay instruction


Usually I have found that some algos are much more sensitive to code location than others, usually intensive BYTE operations where dealing in larger data types reduces the variation.

jj2007

#28
I've added a fast variant of Instr(). At least on my CPUs, it looks competitive:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
33220   kCycles for 10 * MbInstr 0 (zero-delimited)
30094   kCycles for 10 * MbInstr 0 (file size)
8362    kCycles for 10 * MbInstr FAST
34858   kCycles for 10 * crt_strstr
38010   kCycles for 10 * M32 find$
38399   kCycles for 10 * strstr_nidud

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
32413   kCycles for 10 * MbInstr 0 (zero-delimited)
24107   kCycles for 10 * MbInstr 0 (file size)
13446   kCycles for 10 * MbInstr FAST
57954   kCycles for 10 * crt_strstr
58467   kCycles for 10 * M32 find$
38112   kCycles for 10 * strstr_nidud

Intel(R) Core(TM) i5 CPU       M 520  @ 2.40GHz (SSE4)
22035   kCycles for 10 * MbInstr 0 (zero-delimited)
19469   kCycles for 10 * MbInstr 0 (file size)
5340    kCycles for 10 * MbInstr FAST
27871   kCycles for 10 * crt_strstr
28522   kCycles for 10 * M32 find$
24944   kCycles for 10 * strstr_nidud


To assemble the source, you will need MasmBasic of today, 24 July.

Usage: Instr_(1, "Test", "Te", FAST)   ; 4 args, last one is uppercase FAST
This is always case-sensitive (same for find$, strstr etc).

johnsa


Timings from me...



Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)
++++++++12 of 20 tests valid, loop overhead is approx. 46/10 cycles

16446   kCycles for 10 * MbInstr 0 (zero-delimited)
15438   kCycles for 10 * MbInstr 0 (file size)
3842    kCycles for 10 * MbInstr FAST
26039   kCycles for 10 * crt_strstr
23588   kCycles for 10 * M32 find$
20650   kCycles for 10 * strstr_nidud

16556   kCycles for 10 * MbInstr 0 (zero-delimited)
15557   kCycles for 10 * MbInstr 0 (file size)
3839    kCycles for 10 * MbInstr FAST
25890   kCycles for 10 * crt_strstr
23566   kCycles for 10 * M32 find$
20681   kCycles for 10 * strstr_nidud

16534   kCycles for 10 * MbInstr 0 (zero-delimited)
15786   kCycles for 10 * MbInstr 0 (file size)
3788    kCycles for 10 * MbInstr FAST
25781   kCycles for 10 * crt_strstr
23581   kCycles for 10 * M32 find$
20714   kCycles for 10 * strstr_nidud

1068448 = eax MbInstr 0 (zero-delimited)
1068448 = eax MbInstr 0 (file size)
1068448 = eax MbInstr FAST
1068448 = eax crt_strstr
1068448 = eax M32 find$
1068448 = eax strstr_nidud