Text only | Text with Images

The MASM Forum

General => The Laboratory => Topic started by: nidud on July 12, 2014, 09:15:44 PM

Title: Code location sensitivity of timings
Post by: nidud on July 12, 2014, 09:15:44 PM

From Wayback Machine, this thread from before nidud's mass deletion of all of his posts
niduds deleted posts from this thread (https://web.archive.org/web/20211019171844/https://masm32.com/board/index.php?topic=3396.0)
Unfortunately the zip files were not archived (by Wayback Machine) and will not work.

Title: Re: Code location sensitivity of timings
Post by: dedndave on July 12, 2014, 10:06:49 PM

i suppose you could use VirtualProtect to allow writes in the .CODE section
then, copy the code under test into a common code address space before executing it

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx (http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx)

to help speed testing up a little....
you could use different counter_begin loop count values for the short and long tests
i try to select a loop count that yields about 0.5 seconds per pass

Title: Re: Code location sensitivity of timings
Post by: dedndave on July 12, 2014, 10:11:45 PM

prescott w/htt - xp sp3
unaligned

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
4925905 cycles - 2048..4096  (164) A memcpy SSE 16
4933332 cycles - 2048..4096  (164) A memcpy SSE 16
4953203 cycles - 2048..4096  (164) A memcpy SSE 16
4963909 cycles - 2048..4096  (164) A memcpy SSE 16
4923198 cycles - 2048..4096  (164) A memcpy SSE 16
4941277 cycles - 2048..4096  (164) A memcpy SSE 16

11502669        cycles - 2048..4096  (164) U memcpy SSE 16
11487135        cycles - 2048..4096  (164) U memcpy SSE 16
11564951        cycles - 2048..4096  (164) U memcpy SSE 16
11570118        cycles - 2048..4096  (164) U memcpy SSE 16
11497558        cycles - 2048..4096  (164) U memcpy SSE 16
11526087        cycles - 2048..4096  (164) U memcpy SSE 16

aligned

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
4935114 cycles - 2048..4096  (164) A memcpy SSE 16
4934727 cycles - 2048..4096  (164) A memcpy SSE 16
4942523 cycles - 2048..4096  (164) A memcpy SSE 16
4924574 cycles - 2048..4096  (164) A memcpy SSE 16
4924658 cycles - 2048..4096  (164) A memcpy SSE 16
4937763 cycles - 2048..4096  (164) A memcpy SSE 16

11490869        cycles - 2048..4096  (164) U memcpy SSE 16
11481780        cycles - 2048..4096  (164) U memcpy SSE 16
11616596        cycles - 2048..4096  (164) U memcpy SSE 16
11530420        cycles - 2048..4096  (164) U memcpy SSE 16
11488318        cycles - 2048..4096  (164) U memcpy SSE 16
11504392        cycles - 2048..4096  (164) U memcpy SSE 16

Title: Re: Code location sensitivity of timings
Post by: nidud on July 12, 2014, 11:04:51 PM

deleted

Title: Re: Code location sensitivity of timings
Post by: dedndave on July 12, 2014, 11:21:46 PM

hmmm - that seems wrong
i thought you had to be in a code section to assemble instructions
guess i've never tried it - lol

but - there is nothing to prevent you from putting a proc in the code section and copying it to another address
so long as you allow writes into the affected pages with VirtualProtect and PAGE_EXECUTE_READWRITE

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366786%28v=vs.85%29.aspx (http://msdn.microsoft.com/en-us/library/windows/desktop/aa366786%28v=vs.85%29.aspx)

Title: Re: Code location sensitivity of timings
Post by: nidud on July 13, 2014, 02:28:37 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: dedndave on July 13, 2014, 03:03:15 AM

i hadn't thought about that
but - most win32 CALL's are near relative
so, you'd have to translate the target addresses

but - for testing algorithm code that doesn't make any calls, like loops, etc,
the branch addresses are relative, but the target moves with the code :P

if you wanted to use calls or invokes in movable code,
you could store the branch address and use CALL DWORD PTR lpfnFunction
or MOV EAX,Function and CALL EAX

Title: Re: Code location sensitivity of timings
Post by: nidud on July 13, 2014, 04:10:22 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: nidud on July 14, 2014, 09:39:03 PM

deleted

Title: Re: Code location sensitivity of timings
Post by: nidud on July 14, 2014, 10:09:22 PM

deleted

Title: Re: Code location sensitivity of timings
Post by: dedndave on July 15, 2014, 02:10:35 AM

hard to test individual instructions
so much relies on the surrounding code

Title: Re: Code location sensitivity of timings
Post by: nidud on July 15, 2014, 05:46:29 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: nidud on July 16, 2014, 08:15:11 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: nidud on July 20, 2014, 11:43:43 PM

deleted

Title: Re: Code location sensitivity of timings
Post by: dedndave on July 21, 2014, 04:40:44 AM

those tests run way too fast to get reliable readings

for best results:
1) bind to a single core
2) wait about 750 mS before performing any tests - this allows the system to settle
3) adjust individual loop counts so that each test pass takes about 0.5 seconds

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
STRCHR-------------------------------------------
170724  cycles - (  0) 0: crt_strchr
172046  cycles - ( 29) 1: x
70540   cycles - (119) 2: 'c'
62934   cycles - (107) 3: 'cccc'

170343  cycles - (  0) 0: crt_strchr
167063  cycles - ( 29) 1: x
89640   cycles - (119) 2: 'c'
82154   cycles - (107) 3: 'cccc'

240068  cycles - (  0) 0: crt_strchr
87783   cycles - ( 29) 1: x
25549   cycles - (119) 2: 'c'
23995   cycles - (107) 3: 'cccc'
--- ok ---

H:\nidudString\string\strchr => strchr

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
STRCHR-------------------------------------------
209073  cycles - (  0) 0: crt_strchr
221060  cycles - ( 29) 1: x
80107   cycles - (119) 2: 'c'
83854   cycles - (107) 3: 'cccc'

198648  cycles - (  0) 0: crt_strchr
211263  cycles - ( 29) 1: x
106992  cycles - (119) 2: 'c'
96168   cycles - (107) 3: 'cccc'

253182  cycles - (  0) 0: crt_strchr
84552   cycles - ( 29) 1: x
26531   cycles - (119) 2: 'c'
36056   cycles - (107) 3: 'cccc'

they're all over the place :P

Title: Re: Code location sensitivity of timings
Post by: nidud on July 21, 2014, 06:19:30 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 01:21:19 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 01:28:20 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 01:55:38 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 02:08:28 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 03:41:08 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 23, 2014, 05:36:38 AM

Quote from: nidud on July 23, 2014, 03:41:08 AM
Minimum supported client
Windows XP

The SSE level used is SSE2 so how common is this combination?

It may hurt the feelings of some fans of old hard- and software, but writing code for >=(SSE2 & Win XP) should be OK for 99% of the users.

There is a poll on SSE support here (http://www.insanelymac.com/forum/topic/35109-sse2-vs-sse3-the-poll/): "I'm still waiting for SSE support :) (5 votes [2.45%])"

That was 2006, 8 years ago ;)

Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 06:43:29 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: dedndave on July 23, 2014, 07:48:38 AM

...or provide fallback routines
you can run a little startup init routine - detect SSE support level - and fill in addresses of PROC's
i am working on something along that line at the moment

these define TYPE's for up to 6 dword parms - you can extend it easily

Code Select

_FUNC00  TYPEDEF PROTO
_FUNC04  TYPEDEF PROTO :DWORD
_FUNC08  TYPEDEF PROTO :DWORD,:DWORD
_FUNC12  TYPEDEF PROTO :DWORD,:DWORD,:DWORD
_FUNC16  TYPEDEF PROTO :DWORD,:DWORD,:DWORD,:DWORD
_FUNC20  TYPEDEF PROTO :DWORD,:DWORD,:DWORD,:DWORD,:DWORD
_FUNC24  TYPEDEF PROTO :DWORD,:DWORD,:DWORD,:DWORD,:DWORD,:DWORD

_PFUNC00 TYPEDEF Ptr _FUNC00
_PFUNC04 TYPEDEF Ptr _FUNC04
_PFUNC08 TYPEDEF Ptr _FUNC08
_PFUNC12 TYPEDEF Ptr _FUNC12
_PFUNC16 TYPEDEF Ptr _FUNC16
_PFUNC20 TYPEDEF Ptr _FUNC20
_PFUNC24 TYPEDEF Ptr _FUNC24

then, i am using a structure with function pointers in it

Code Select

_FUNC STRUCT
  lpfnFunc1  _PFUNC04 ?    ;this function has 1 dword arg
  lpfnFunc2  _PFUNC12 ?    ;this function has 3 dword args
_FUNC STRUCT

and, in the .DATA? section...

Code Select

_Func _FUNC <>

so, you set _Func.lpfnFunc1 and _Func.lpfnFunc2 to point at appropriate routines for the supported SSE level
then.....

Code Select

    INVOKE  _Func.lpfnFunc1,arg1
    INVOKE  _Func.lpfnFunc2,arg1,arg2,arg3

;or

    push    edi
    mov     edi,offset _Func
    INVOKE  [edi]._FUNC.lpfnFunc1,arg1
    INVOKE  [edi]._FUNC.lpfnFunc2,arg1,arg2,arg3
    pop     edi

another way to go would be to put all the routines for each support level into a DLL
then, at init, load the DLL that is appropriate for the machine
the routines can then all have the same names

Title: Re: Code location sensitivity of timings
Post by: dedndave on July 23, 2014, 07:53:25 AM

most people probably have at least SSE3
however, we can look at the forum members, alone, and find a few machines
some that probably support only MMX or SSE(1)

i bought this machine in 2005 - it supports SSE3, which was a new thing at the time
so - it's almost 10 years old

Title: Re: Code location sensitivity of timings
Post by: Gunther on July 23, 2014, 09:03:40 AM

Quote from: dedndave on July 23, 2014, 07:53:25 AM
i bought this machine in 2005 - it supports SSE3, which was a new thing at the time
so - it's almost 10 years old

SSE3 was introduced in April 2005 with the Prescott revision of the Pentium 4 processor.

Gunther

Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 09:26:21 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: nidud on July 24, 2014, 03:45:40 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 24, 2014, 04:32:46 AM

Quote from: Gunther on July 23, 2014, 09:03:40 AMSSE3 was introduced in April 2005 with the Prescott revision of the Pentium 4 processor.

SSE2 was introduced in November 2000 with the P4 Willamette. In general, it's absolutely sufficient (try your luck, make Instr_() faster with SSE7.8... (http://masm32.com/board/index.php?topic=3408.msg36297#msg36297)); in particular, pcmpeqb and pmovmskb are important improvements.

Title: Re: Code location sensitivity of timings
Post by: nidud on July 25, 2014, 04:29:11 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 25, 2014, 06:33:53 AM

Quote from: nidud on July 25, 2014, 04:29:11 AMwith regard to memcpy there seems little gain using SSE
...
conclution:
- in newer CPU's MOVSB is faster than moving blocks
- in older CPU's MOVSB gets faster with size
- SSE may be faster depending on CPU

Or, in short: Everything is more complicated than you think. (http://www.masmforum.com/board/index.php?topic=11454.msg87622#msg87622)

Title: Re: Code location sensitivity of timings
Post by: nidud on July 25, 2014, 07:19:55 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 25, 2014, 06:01:49 PM

Quote from: nidud on July 25, 2014, 07:19:55 AM
:biggrin:

Yes, it's possible to complicate tings I guess and the link you provide includes a lot of complicated issues but few constructive conclusions to the problem at hand.

The table there looked different for each and every CPU we tested (try yourself the latest version (http://masm32.com/board/index.php?topic=1971.msg20618#msg20618)). So the choice was either choosing an algo that provided reasonable speed for most of them, or going the stony road of checking which CPU family and branching to a specialised one. For MasmBasic's MbCopy, rep movsd made the race. It is pretty fast on all CPUs, and it rocks for large copies (and that is where speed matters...). Good to see that Intel keeps pushing this line, too. I wouldn't use rep movsb for the whole copy, though, as many of the not-so-recent CPUs are very slow with the byte variant of movs.

push ecx
shr ecx, 2 ; divide count by 4
rep movsd ; copy DWORD size blocks
pop ecx ; Reload byte count
and ecx, 3 ; get the rest
rep movsb ; copy the rest
xchg eax, edi ; for CAT$, return a pointer to the end of the destination;

Title: Re: Code location sensitivity of timings
Post by: sinsi on July 25, 2014, 08:10:40 PM

How much of a difference would RAM speed make? DDR3 speeds can be 1333/1600/1866/2133/2400.

Title: Re: Code location sensitivity of timings
Post by: nidud on July 25, 2014, 10:21:23 PM

deleted

Title: Re: Code location sensitivity of timings
Post by: Gunther on July 25, 2014, 10:56:33 PM

Hi sinsi,

your memcpy application brings:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
----------------------------------------------
-- aligned strings --
498064    cycles -  10 (  0) 0: crt_memcpy
890775    cycles -  10 ( 38) 1: movsd - mov eax,ecx
892888    cycles -  10 ( 37) 2: movsd - push ecx
353318    cycles -  10 ( 27) 3: movsb
-- unaligned strings --
1006514   cycles -  10 (  0) 0: crt_memcpy
1033525   cycles -  10 ( 38) 1: movsd - mov eax,ecx
1033580   cycles -  10 ( 37) 2: movsd - push ecx
377061    cycles -  10 ( 27) 3: movsb
-- short strings 15 --
175505    cycles - 8000 (  0) 0: crt_memcpy
335538    cycles - 8000 ( 38) 1: movsd - mov eax,ecx
344226    cycles - 8000 ( 37) 2: movsd - push ecx
291953    cycles - 8000 ( 27) 3: movsb
-- short strings 271 --
1033175   cycles - 8000 (  0) 0: crt_memcpy
952811    cycles - 8000 ( 38) 1: movsd - mov eax,ecx
959677    cycles - 8000 ( 37) 2: movsd - push ecx
566948    cycles - 8000 ( 27) 3: movsb
-- short strings 2014 --
3224879   cycles - 4000 (  0) 0: crt_memcpy
3153708   cycles - 4000 ( 38) 1: movsd - mov eax,ecx
3151176   cycles - 4000 ( 37) 2: movsd - push ecx
930276    cycles - 4000 ( 27) 3: movsb
--- ok ---

Gunther

Title: Re: Code location sensitivity of timings
Post by: nidud on July 25, 2014, 11:41:54 PM

deleted

Title: Re: Code location sensitivity of timings
Post by: RuiLoureiro on July 26, 2014, 01:14:05 AM

Seems to be better

x1=205 1543/884593 = 2.3191942509153927 (8000) ~2.32
x2=5844930 /2504176= 2.3340731641865428 (4000) ~2.33

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
-- aligned strings --
1188974 cycles - 10 ( 0) 0: crt_memcpy
1097640 cycles - 10 ( 75) 1: movsd - mov eax,ecx
1103251 cycles - 10 ( 75) 2: movsd - push ecx
1102906 cycles - 10 ( 59) 3: movsb
1310185 cycles - 10 (182) 4: SSE
-- unaligned strings --
2595543 cycles - 10 ( 0) 0: crt_memcpy
2620959 cycles - 10 ( 75) 1: movsd - mov eax,ecx
2611443 cycles - 10 ( 75) 2: movsd - push ecx
7866087 cycles - 10 ( 59) 3: movsb
1358767 cycles - 10 (182) 4: SSE
-- short strings 15 --
343706 cycles - 8000 ( 0) 0: crt_memcpy
789893 cycles - 8000 ( 75) 1: movsd - mov eax,ecx
808747 cycles - 8000 ( 75) 2: movsd - push ecx
2039809 cycles - 8000 ( 59) 3: movsb
237595 cycles - 8000 (182) 4: SSE
-- short strings 271 --
2051543 cycles - 8000 ( 0) 0: crt_memcpy
2096801 cycles - 8000 ( 75) 1: movsd - mov eax,ecx
2083586 cycles - 8000 ( 75) 2: movsd - push ecx
7495329 cycles - 8000 ( 59) 3: movsb
884593 cycles - 8000 (182) 4: SSE
-- short strings 2014 --
5844930 cycles - 4000 ( 0) 0: crt_memcpy
6057324 cycles - 4000 ( 75) 1: movsd - mov eax,ecx
5890555 cycles - 4000 ( 75) 2: movsd - push ecx
22533778 cycles - 4000 ( 59) 3: movsb
2504176 cycles - 4000 (182) 4: SSE
--- ok ---

Title: Re: Code location sensitivity of timings
Post by: nidud on July 26, 2014, 01:46:55 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 26, 2014, 01:52:14 AM

Code Select

	align	16
	rep	movsb

Check if the align is really needed. In the worst case, 15 bytes of code are inserted there. One common trick is to insert the bytes needed before the entry into the proc.

Title: Re: Code location sensitivity of timings
Post by: dedndave on July 26, 2014, 02:15:52 AM

nidud - i hope you're using the one in this post

http://masm32.com/board/index.php?topic=3373.msg35658#msg35658 (http://masm32.com/board/index.php?topic=3373.msg35658#msg35658)

Code Select

;EAX return bits:
;0 = MMX
;1 = SSE
;2 = SSE2
;3 = SSE3
;4 = SSSE3
;5 = SSE4.1
;6 = SSE4.2

i would define the EQUates this way...

Code Select

SSE_MMX    equ 1
SSE_SSE    equ 2
SSE_SSE2   equ 4
SSE_SSE3   equ 8
SSE_SSSE3  equ 10h
SSE_SSE41  equ 20h
SSE_SSE42  equ 40h

Code Select

    call    GetSseLevel
    test    al,SSEBT_SSE3
    jnz     sse3_supported

the EQUates you have would be ok for BT, i suppose :P

Title: Re: Code location sensitivity of timings
Post by: hutch-- on July 26, 2014, 02:55:30 AM

Here is a test piece that uses 4 copies of the same simple byte intensive algo, it runs the 4 version and times each one. The idea was to test if the identical algo in 4 locations produced any difference in timing but on my old quad, they are almost perfectly identical even with multiple runs.

I get this result.

File length = 977426

828 ms
828 ms
828 ms
828 ms
Press any key to continue ...

Title: Re: Code location sensitivity of timings
Post by: nidud on July 26, 2014, 02:57:09 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: dedndave on July 26, 2014, 03:03:03 AM

ok - that one does not preserve EBX - but, it's probably ok, in this case
just so you are aware, CPUID destroys the contents of EBX :t

Title: Re: Code location sensitivity of timings
Post by: nidud on July 26, 2014, 03:30:28 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: RuiLoureiro on July 26, 2014, 03:38:14 AM

File length = 977412

1484 ms
1453 ms
1516 ms
1547 ms
Press any key to continue ...

Title: Re: Code location sensitivity of timings
Post by: RuiLoureiro on July 26, 2014, 03:40:43 AM

1344 ms
1344 ms
1343 ms
2016 ms...

1343 ms
1344 ms
1344 ms
2015 ms...

1344 ms
1359 ms
1344 ms
2016 ms
If we remove the worst case ...

Title: Re: Code location sensitivity of timings
Post by: RuiLoureiro on July 26, 2014, 04:08:58 AM

If i am not wrong, you are using 2 counters:
First counter = 1000
Second counter = count (=4000,etc.)

You get the result only when
the first counter is 0 (counter_end).
So the result has something to do with the execution
of this:

Code Select


   mov edi,count
   mov ebx,esp
   .while edi
       pushargs
       call esi
       mov esp,ebx
       dec edi

Is there any particular reason for this ?

Quote
counter_begin 1000, HIGH_PRIORITY_CLASS
mov edi,count
mov ebx,esp
.while edi
pushargs
call esi
mov esp,ebx
dec edi
.endw
counter_end

Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 26, 2014, 05:09:09 AM

Quote from: nidud on July 26, 2014, 02:57:09 AM
QuoteCheck if the align is really needed
I normally tune them from the list file in the end

What I intended is that rep movsX may not need ANY alignment, simply because it isn't a loop at this level (at micro code level, it is of course a loop).

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (MMX, SSE, SSE2, SSE3)
movsd align 16 10476 µs
movsd align 3 10456 µs
movsd align 13 10347 µs
movsb align 16 10510 µs
movsb align 3 10503 µs
movsb align 13 10407 µs

movsd align 16 10514 µs
movsd align 3 10469 µs
movsd align 13 10516 µs
movsb align 16 10455 µs
movsb align 3 10515 µs
movsb align 13 10502 µs

movsd align 16 10526 µs
movsd align 3 10455 µs
movsd align 13 10469 µs
movsb align 16 10360 µs
movsb align 3 10485 µs
movsb align 13 10456 µs

Sample:
test4a proc uses esi edi ecx
align 16
nops 3
rep movsb
ret
test4a endp

Interesting, though, that movsb is indeed equally fast on my trusty old Celeron, at least for a 10 MB string.

Title: Re: Code location sensitivity of timings
Post by: nidud on July 26, 2014, 05:43:01 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: Gunther on July 26, 2014, 08:50:17 AM

That's the result by memcpy.exe by 1234.zip:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
----------------------------------------------
-- aligned strings --
498952    cycles -  10 (  0) 0: crt_memcpy
898756    cycles -  10 ( 75) 1: movsd - mov eax,ecx
903577    cycles -  10 ( 75) 2: movsd - push ecx
354813    cycles -  10 ( 59) 3: movsb
487954    cycles -  10 (182) 4: SSE
-- unaligned strings --
494936    cycles -  10 (  0) 0: crt_memcpy
895940    cycles -  10 ( 75) 1: movsd - mov eax,ecx
895968    cycles -  10 ( 75) 2: movsd - push ecx
373553    cycles -  10 ( 59) 3: movsb
491344    cycles -  10 (182) 4: SSE
-- short strings 15 --
175961    cycles - 8000 (  0) 0: crt_memcpy
361324    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
361586    cycles - 8000 ( 75) 2: movsd - push ecx
313550    cycles - 8000 ( 59) 3: movsb
92719     cycles - 8000 (182) 4: SSE
-- short strings 271 --
841879    cycles - 8000 (  0) 0: crt_memcpy
780741    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
806939    cycles - 8000 ( 75) 2: movsd - push ecx
623419    cycles - 8000 ( 59) 3: movsb
275466    cycles - 8000 (182) 4: SSE
-- short strings 2014 --
1002628   cycles - 4000 (  0) 0: crt_memcpy
2239737   cycles - 4000 ( 75) 1: movsd - mov eax,ecx
2226209   cycles - 4000 ( 75) 2: movsd - push ecx
962207    cycles - 4000 ( 59) 3: movsb
972245    cycles - 4000 (182) 4: SSE
--- ok ---

Gunther

Title: Re: Code location sensitivity of timings
Post by: RuiLoureiro on July 26, 2014, 09:02:01 AM

Quote
The macro can only be called by EDI, ESI, or EBX or an immediate value.
I think I just run out of regs once and inserted a loop.
The count for small functions is also rather high so it's
just a way of skipping zeros I guess.

I think you are talking about this macro:

counter_begin MACRO loopcount:REQ, priority
or counter_end

If it is, we cannot use EBX because cpuid destroys EBX

I modified counter_begin -written by MichaelW- to this:
(COUNTERLOOPS=1000 or 10000 or 100000 or ...)

Code Select


; this macro uses EDI inside = length from kIni to kEnd
; we need to define an array to save the means.
; we need to define _LoopCount,_MaxLength...etc. in .DATA
BEGIN_COUNTER_CYCLE_HIGH_PRIORITY_CLASS MACRO   kIni, kEnd
                                        LOCAL   labelA,labelB

                mov     _LoopCount, COUNTERLOOPS
                mov     _MaxLength, kEnd
                mov     edi, kIni
                ;mov     _MinLength, edi         ;; not used yet
                mov     _MeanValue, 0           ;; mean is 0

                invoke  GetCurrentProcess
                invoke  SetPriorityClass, eax, HIGH_PRIORITY_CLASS

    labelA:                                         ;; Begin test loop
    
                BEGIN_LOOP_TEST equ <labelA>
            
                xor     eax, eax        ;; Use same CPUID input value for each call
                cpuid                   ;; Flush pipe & wait for pending ops to finish
                rdtsc                   ;; Read Time Stamp Counter

                push    edx             ;; Preserve high-order 32 bits of start count
                push    eax             ;; Preserve low-order 32 bits of start count
            
                mov     _LoopCounter, COUNTERLOOPS
                xor     eax, eax
                cpuid                   ;; Make sure loop setup instructions finish
          ALIGN 16                      ;; Optimal loop alignment for P6
          @@:                           ;; Start an empty reference loop
                sub     _LoopCounter, 1
                jnz     short @B

                xor     eax, eax
                cpuid                   ;; Make sure loop instructions finish
                rdtsc                   ;; Read end count
                pop     ecx             ;; Recover low-order 32 bits of start count
                sub     eax, ecx        ;; Low-order 32 bits of overhead count in EAX
                pop     ecx             ;; Recover high-order 32 bits of start count
                sbb     edx, ecx        ;; High-order 32 bits of overhead count in EDX
                push    edx             ;; Preserve high-order 32 bits of overhead count
                push    eax             ;; Preserve low-order 32 bits of overhead count

                xor     eax, eax
                cpuid
                rdtsc
                push    edx             ;; Preserve high-order 32 bits of start count
                push    eax             ;; Preserve low-order 32 bits of start count
                ;;-------------------------------------
                ;;              Start
                ;;-------------------------------------
                mov         _LoopCounter, COUNTERLOOPS
                xor         eax, eax
                cpuid                   ;; Make sure loop setup instructions finish
    ALIGN 16                            ;; Optimal loop alignment for P6
    labelB:                             ;; Start test loop
                START_LOOP_TEST equ <labelB>
ENDM
; ------------------------------------------------------------------------
END_COUNTER_CYCLE       MACRO  arg
                        LOCAL  $tmpstr$

                sub         _LoopCounter, 1
                jnz         START_LOOP_TEST                ;; goto labelB
                ;;---------------------------
                ;;   stop this count
                ;;---------------------------
                xor         eax, eax
                cpuid                       ;; Make sure loop instructions finish
                rdtsc                       ;; Read end count
                pop         ecx             ;; Recover low-order 32 bits of start count
                sub         eax, ecx        ;; Low-order 32 bits of test count in EAX
                pop         ecx             ;; Recover high-order 32 bits of start count
                sbb         edx, ecx        ;; High-order 32 bits of test count in EDX
                pop         ecx             ;; Recover low-order 32 bits of overhead count
                sub         eax, ecx        ;; Low-order 32 bits of adjusted count in EAX
                pop         ecx             ;; Recover high-order 32 bits of overhead count
                sbb         edx, ecx        ;; High-order 32 bits of adjusted count in EDX

                mov         DWORD PTR _CounterQword, eax
                mov         DWORD PTR _CounterQword + 4, edx
                finit
                fild        _CounterQword
                fild        _LoopCount
                fdiv
                fistp       _CounterQword

                mov         ebx, dword ptr _CounterQword
                
                ;---------------------------------------------------
                ;               print cycles
                ;---------------------------------------------------
                add         ebx, _MeanValue
                mov         _MeanValue, ebx

                add         edi, 1
                cmp         edi, _MaxLength 
                jbe         BEGIN_LOOP_TEST                      ;; goto labelA

                invoke      GetCurrentProcess
                invoke      SetPriorityClass, eax, NORMAL_PRIORITY_CLASS
                
                ; --------------------------------------------------
                ;          Save mean and print mean                
                ; --------------------------------------------------
                invoke      SaveMeans, ebx          ;; save it in one array
                                                       ;; one after another
                
                ;--------------------------------------------------- 
                print       str$(ebx)                       
                $tmpstr$    CATSTR <chr$(">, <arg>, <",13,10)>        
                print       $tmpstr$
                ;---------------------------------------------------                 
ENDM

Code Select


.data
ALIGN 8                         ;; Optimal alignment for QWORD
_CounterQword   dq 0
_LoopCount      dd 0
_LoopCounter    dd 0                                   

_MinLength      dd 0
_MaxLength      dd 0
_MeanValue      dd 0
;------------------------------
ALIGN   4
                dd 0                ; <<<--- start with 0   
_TblTiming0     dd 600 dup (?)
.code
SaveMeans       proc        kMean:DWORD                    
                mov         eax, kMean
                mov         edx, offset _TblTiming0                    
                mov         ecx, [edx-4]            ; number of means
                mov         [edx+ecx*4], eax                    
                add         ecx, 1
                mov         [edx-4], ecx
                ret
SaveMeans       endp

Title: Re: Code location sensitivity of timings
Post by: nidud on July 27, 2014, 02:10:39 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: Gunther on July 27, 2014, 02:54:54 AM

Hi nidud,

Quote from: nidud on July 27, 2014, 02:10:39 AM
I added some bits to Dave's test:

there's nothing attached.

Gunther

Title: Re: Code location sensitivity of timings
Post by: nidud on July 27, 2014, 04:48:41 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: nidud on July 27, 2014, 05:17:38 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 27, 2014, 08:06:15 AM

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
-----------------------------------------------
-- aligned strings --
995933 cycles - 10 ( 0) 0: crt_strrchr
995891 cycles - 10 ( 40) 1: strrchr
273823 cycles - 10 (154) 2: x
94668 cycles - 10 (112) 3: SSE
-- unaligned strings --
996477 cycles - 10 ( 0) 0: crt_strrchr
997094 cycles - 10 ( 40) 1: strrchr
298219 cycles - 10 (154) 2: x
121529 cycles - 10 (112) 3: SSE
-- small strings 128 --
324263 cycles - 500 ( 0) 0: crt_strrchr
323710 cycles - 500 ( 40) 1: strrchr
84786 cycles - 500 (154) 2: x
34915 cycles - 500 (112) 3: SSE
-- small strings 1 --
67914 cycles - 500 ( 0) 0: crt_strrchr
67286 cycles - 500 ( 40) 1: strrchr
12595 cycles - 500 (154) 2: x
16622 cycles - 500 (112) 3: SSE

Title: Re: Code location sensitivity of timings
Post by: Gunther on July 27, 2014, 10:40:16 AM

Hi nidud,

here's the output of auto.zip:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (AVX)
----------------------------------------------
-- aligned strings --
491469    cycles -  10 (  0) 0: crt_memcpy
889651    cycles -  10 ( 63) 1: movsd - mov eax,ecx
887273    cycles -  10 ( 63) 2: movsd - push ecx
355080    cycles -  10 ( 51) 3: movsb
487046    cycles -  10 (182) 4: SSE
355990    cycles -  10 (  0) 5: auto
-- unaligned strings --
490269    cycles -  10 (  0) 0: crt_memcpy
886259    cycles -  10 ( 63) 1: movsd - mov eax,ecx
886778    cycles -  10 ( 63) 2: movsd - push ecx
372520    cycles -  10 ( 51) 3: movsb
491780    cycles -  10 (182) 4: SSE
378881    cycles -  10 (  0) 5: auto
-- short strings 15 --
174897    cycles - 8000 (  0) 0: crt_memcpy
349626    cycles - 8000 ( 63) 1: movsd - mov eax,ecx
343812    cycles - 8000 ( 63) 2: movsd - push ecx
307384    cycles - 8000 ( 51) 3: movsb
98073     cycles - 8000 (182) 4: SSE
293479    cycles - 8000 (  0) 5: auto
-- short strings 271 --
832627    cycles - 8000 (  0) 0: crt_memcpy
773797    cycles - 8000 ( 63) 1: movsd - mov eax,ecx
764418    cycles - 8000 ( 63) 2: movsd - push ecx
586580    cycles - 8000 ( 51) 3: movsb
279676    cycles - 8000 (182) 4: SSE
557134    cycles - 8000 (  0) 5: auto
-- short strings 2014 --
998188    cycles - 4000 (  0) 0: crt_memcpy
2198740   cycles - 4000 ( 63) 1: movsd - mov eax,ecx
2195833   cycles - 4000 ( 63) 2: movsd - push ecx
935710    cycles - 4000 ( 51) 3: movsb
961563    cycles - 4000 (182) 4: SSE
906474    cycles - 4000 (  0) 5: auto
--- ok ---

Gunther

Title: Re: Code location sensitivity of timings
Post by: nidud on July 27, 2014, 11:12:01 PM

deleted

Title: Re: Code location sensitivity of timings
Post by: nidud on August 11, 2014, 08:46:40 PM

deleted

Title: Re: Code location sensitivity of timings
Post by: Gunther on August 12, 2014, 01:58:59 AM

Hi nidud,

Quote from: nidud on August 11, 2014, 08:46:40 PM
Is this possible ? to have SSE4.1 and not SSE3 ?

Note: SSE and SSE2 are pre-set since the program will exit if SSE2 is not present, so this bit must be set by the test.

I think not. Did you try that (http://masm32.com/board/index.php?topic=1418.msg14444#msg14444)? It should show you the available instruction sets.

Gunther

Title: Re: Code location sensitivity of timings
Post by: nidud on August 12, 2014, 03:01:17 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: Gunther on August 12, 2014, 08:49:00 AM

Hi nidud,

you can trust my instruction detecting application. Your laptop supports in any case SSE3 and SSSE3 and it supports AVX. You can test that with that tool (http://masm32.com/board/index.php?topic=3227.msg35958#msg35958), if you've at least Windows 7 with SP1 installed. The glitch must be in your code. Do you test the right bits?

Gunther

Title: Re: Code location sensitivity of timings
Post by: nidud on August 12, 2014, 09:27:19 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: MichaelW on August 12, 2014, 10:00:55 AM

Hi Gunther,

My Core-i3 G3220 does not support AVX, but the results for your instruction set detection tool:

Code Select


Supported by Processor and installed Operating System:
------------------------------------------------------

     MMX, CMOV and FCOMI, SSE, SSE2, SSE3, SSSE3, SSE4.1,
     POPCNT, SSE4.2

     featurenumber = 13

Appear to match the Intel specs:

http://ark.intel.com/products/77773

Title: Re: Code location sensitivity of timings
Post by: Gunther on August 12, 2014, 11:02:35 AM

Hi Michael,

Quote from: MichaelW on August 12, 2014, 10:00:55 AM
Appear to match the Intel specs:

http://ark.intel.com/products/77773

I hope so. I've written the procedure using the Intel documents as a basis.

Gunther

Title: Re: Code location sensitivity of timings
Post by: nidud on August 18, 2014, 09:39:19 PM

deleted

Title: Re: Code location sensitivity of timings
Post by: jj2007 on August 18, 2014, 10:36:29 PM

Code Select

  movzx eax, byte ptr [esp+8]
  if 1
	imul eax, eax, 01010101h	; 4 bytes shorter, faster
  else
	mov ah, al
	mov ecx, eax
	shl eax, 16
	add eax, ecx
  endif
  movd xmm0, eax
  pshufd xmm0, xmm0, 0		; populate char

Title: Re: Code location sensitivity of timings
Post by: nidud on August 18, 2014, 11:40:39 PM

deleted

Title: Re: Code location sensitivity of timings
Post by: jj2007 on August 19, 2014, 01:40:54 AM

Variants of memchr:

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)

43778 cycles for 100 * memchr scasb
4474 cycles for 100 * memchr SSE2a
5608 cycles for 100 * memchr SSE2b

43994 cycles for 100 * memchr scasb
4497 cycles for 100 * memchr SSE2a
5602 cycles for 100 * memchr SSE2b

44044 cycles for 100 * memchr scasb
4474 cycles for 100 * memchr SSE2a
5598 cycles for 100 * memchr SSE2b

36 bytes for memchr scasb
88 bytes for memchr SSE2a
92 bytes for memchr SSE2b

Could look much different on other CPUs, as movlps speeds it up a lot on my CPU.

Title: Re: Code location sensitivity of timings
Post by: MichaelW on August 19, 2014, 02:27:49 AM

My processor is near (or at) bottom-end today (retail box, $79).

Code Select


Intel(R) Pentium(R) CPU G3220 @ 3.00GHz (SSE4)

24909   cycles for 100 * memchr scasb
2864    cycles for 100 * memchr SSE2a
2399    cycles for 100 * memchr SSE2b

24934   cycles for 100 * memchr scasb
2882    cycles for 100 * memchr SSE2a
2366    cycles for 100 * memchr SSE2b

24923   cycles for 100 * memchr scasb
2886    cycles for 100 * memchr SSE2a
2418    cycles for 100 * memchr SSE2b

36      bytes for memchr scasb
88      bytes for memchr SSE2a
92      bytes for memchr SSE2b

96      = eax memchr scasb
96      = eax memchr SSE2a
96      = eax memchr SSE2b

Title: Re: Code location sensitivity of timings
Post by: nidud on August 19, 2014, 03:12:24 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: jj2007 on August 19, 2014, 03:39:31 AM

Thanks. As I suspected, the movlps/movhps pair is good only for my trusty Celeron :(
Here is one more, with movups instead:

Code Select


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
43821   cycles for 100 * memchr scasb
4477    cycles for 100 * memchr SSE2 lps/hps
5556    cycles for 100 * memchr SSE2 nidud
5205    cycles for 100 * memchr SSE2 ups

43778   cycles for 100 * memchr scasb
4476    cycles for 100 * memchr SSE2 lps/hps
5606    cycles for 100 * memchr SSE2 nidud
5206    cycles for 100 * memchr SSE2 ups

43762   cycles for 100 * memchr scasb
4482    cycles for 100 * memchr SSE2 lps/hps
5607    cycles for 100 * memchr SSE2 nidud
5200    cycles for 100 * memchr SSE2 ups

36      bytes for memchr scasb
88      bytes for memchr SSE2 lps/hps
92      bytes for memchr SSE2 nidud
84      bytes for memchr SSE2 ups

Title: Re: Code location sensitivity of timings
Post by: MichaelW on August 19, 2014, 04:07:35 AM

Code Select


Intel(R) Pentium(R) CPU G3220 @ 3.00GHz (SSE4)

24916   cycles for 100 * memchr scasb
2889    cycles for 100 * memchr SSE2 lps/hps
2422    cycles for 100 * memchr SSE2 nidud
2351    cycles for 100 * memchr SSE2 ups

24927   cycles for 100 * memchr scasb
2890    cycles for 100 * memchr SSE2 lps/hps
2469    cycles for 100 * memchr SSE2 nidud
2342    cycles for 100 * memchr SSE2 ups

24921   cycles for 100 * memchr scasb
2885    cycles for 100 * memchr SSE2 lps/hps
2405    cycles for 100 * memchr SSE2 nidud
2351    cycles for 100 * memchr SSE2 ups

36      bytes for memchr scasb
88      bytes for memchr SSE2 lps/hps
92      bytes for memchr SSE2 nidud
84      bytes for memchr SSE2 ups

Title: Re: Code location sensitivity of timings
Post by: nidud on August 19, 2014, 05:06:01 AM

deleted

Title: Re: Code location sensitivity of timings
Post by: Gunther on August 19, 2014, 05:08:29 AM

Jochen,

your timings:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

21892   cycles for 100 * memchr scasb
3007    cycles for 100 * memchr SSE2 lps/hps
2690    cycles for 100 * memchr SSE2 nidud
2500    cycles for 100 * memchr SSE2 ups

21951   cycles for 100 * memchr scasb
2981    cycles for 100 * memchr SSE2 lps/hps
2721    cycles for 100 * memchr SSE2 nidud
6211    cycles for 100 * memchr SSE2 ups

21827   cycles for 100 * memchr scasb
3003    cycles for 100 * memchr SSE2 lps/hps
2510    cycles for 100 * memchr SSE2 nidud
2721    cycles for 100 * memchr SSE2 ups

36      bytes for memchr scasb
88      bytes for memchr SSE2 lps/hps
92      bytes for memchr SSE2 nidud
84      bytes for memchr SSE2 ups

--- ok ---

Gunther

Title: Re: Code location sensitivity of timings
Post by: nidud on March 22, 2015, 02:17:44 AM

deleted

Text only | Text with Images

SMF 2.1.4 © 2023, Simple Machines