The MASM Forum

General => The Laboratory => Topic started by: nidud on July 12, 2014, 09:15:44 PM

Title: Code location sensitivity of timings
Post by: nidud on July 12, 2014, 09:15:44 PM
From Wayback Machine, this thread from before nidud's mass deletion of all of his posts
niduds deleted posts from this thread (https://web.archive.org/web/20211019171844/https://masm32.com/board/index.php?topic=3396.0)
Unfortunately the zip files were not archived (by Wayback Machine) and will not work.
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 12, 2014, 10:06:49 PM
i suppose you could use VirtualProtect to allow writes in the .CODE section
then, copy the code under test into a common code address space before executing it

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx (http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx)

to help speed testing up a little....
you could use different counter_begin loop count values for the short and long tests
i try to select a loop count that yields about 0.5 seconds per pass
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 12, 2014, 10:11:45 PM
prescott w/htt - xp sp3
unaligned
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
4925905 cycles - 2048..4096  (164) A memcpy SSE 16
4933332 cycles - 2048..4096  (164) A memcpy SSE 16
4953203 cycles - 2048..4096  (164) A memcpy SSE 16
4963909 cycles - 2048..4096  (164) A memcpy SSE 16
4923198 cycles - 2048..4096  (164) A memcpy SSE 16
4941277 cycles - 2048..4096  (164) A memcpy SSE 16

11502669        cycles - 2048..4096  (164) U memcpy SSE 16
11487135        cycles - 2048..4096  (164) U memcpy SSE 16
11564951        cycles - 2048..4096  (164) U memcpy SSE 16
11570118        cycles - 2048..4096  (164) U memcpy SSE 16
11497558        cycles - 2048..4096  (164) U memcpy SSE 16
11526087        cycles - 2048..4096  (164) U memcpy SSE 16


aligned
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
4935114 cycles - 2048..4096  (164) A memcpy SSE 16
4934727 cycles - 2048..4096  (164) A memcpy SSE 16
4942523 cycles - 2048..4096  (164) A memcpy SSE 16
4924574 cycles - 2048..4096  (164) A memcpy SSE 16
4924658 cycles - 2048..4096  (164) A memcpy SSE 16
4937763 cycles - 2048..4096  (164) A memcpy SSE 16

11490869        cycles - 2048..4096  (164) U memcpy SSE 16
11481780        cycles - 2048..4096  (164) U memcpy SSE 16
11616596        cycles - 2048..4096  (164) U memcpy SSE 16
11530420        cycles - 2048..4096  (164) U memcpy SSE 16
11488318        cycles - 2048..4096  (164) U memcpy SSE 16
11504392        cycles - 2048..4096  (164) U memcpy SSE 16
Title: Re: Code location sensitivity of timings
Post by: nidud on July 12, 2014, 11:04:51 PM
deleted
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 12, 2014, 11:21:46 PM
hmmm - that seems wrong
i thought you had to be in a code section to assemble instructions
guess i've never tried it - lol

but - there is nothing to prevent you from putting a proc in the code section and copying it to another address
so long as you allow writes into the affected pages with VirtualProtect and PAGE_EXECUTE_READWRITE

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366786%28v=vs.85%29.aspx (http://msdn.microsoft.com/en-us/library/windows/desktop/aa366786%28v=vs.85%29.aspx)
Title: Re: Code location sensitivity of timings
Post by: nidud on July 13, 2014, 02:28:37 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 13, 2014, 03:03:15 AM
i hadn't thought about that
but - most win32 CALL's are near relative
so, you'd have to translate the target addresses

but - for testing algorithm code that doesn't make any calls, like loops, etc,
the branch addresses are relative, but the target moves with the code   :P

if you wanted to use calls or invokes in movable code,
you could store the branch address and use CALL DWORD PTR lpfnFunction
or MOV EAX,Function and CALL EAX
Title: Re: Code location sensitivity of timings
Post by: nidud on July 13, 2014, 04:10:22 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: nidud on July 14, 2014, 09:39:03 PM
deleted
Title: Re: Code location sensitivity of timings
Post by: nidud on July 14, 2014, 10:09:22 PM
deleted
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 15, 2014, 02:10:35 AM
hard to test individual instructions
so much relies on the surrounding code
Title: Re: Code location sensitivity of timings
Post by: nidud on July 15, 2014, 05:46:29 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: nidud on July 16, 2014, 08:15:11 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: nidud on July 20, 2014, 11:43:43 PM
deleted
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 21, 2014, 04:40:44 AM
those tests run way too fast to get reliable readings

for best results:
1) bind to a single core
2) wait about 750 mS before performing any tests - this allows the system to settle
3) adjust individual loop counts so that each test pass takes about 0.5 seconds

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
STRCHR-------------------------------------------
170724  cycles - (  0) 0: crt_strchr
172046  cycles - ( 29) 1: x
70540   cycles - (119) 2: 'c'
62934   cycles - (107) 3: 'cccc'

170343  cycles - (  0) 0: crt_strchr
167063  cycles - ( 29) 1: x
89640   cycles - (119) 2: 'c'
82154   cycles - (107) 3: 'cccc'

240068  cycles - (  0) 0: crt_strchr
87783   cycles - ( 29) 1: x
25549   cycles - (119) 2: 'c'
23995   cycles - (107) 3: 'cccc'
--- ok ---

H:\nidudString\string\strchr => strchr

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
STRCHR-------------------------------------------
209073  cycles - (  0) 0: crt_strchr
221060  cycles - ( 29) 1: x
80107   cycles - (119) 2: 'c'
83854   cycles - (107) 3: 'cccc'

198648  cycles - (  0) 0: crt_strchr
211263  cycles - ( 29) 1: x
106992  cycles - (119) 2: 'c'
96168   cycles - (107) 3: 'cccc'

253182  cycles - (  0) 0: crt_strchr
84552   cycles - ( 29) 1: x
26531   cycles - (119) 2: 'c'
36056   cycles - (107) 3: 'cccc'


they're all over the place   :P
Title: Re: Code location sensitivity of timings
Post by: nidud on July 21, 2014, 06:19:30 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 01:21:19 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 01:28:20 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 01:55:38 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 02:08:28 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 03:41:08 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 23, 2014, 05:36:38 AM
Quote from: nidud on July 23, 2014, 03:41:08 AM
Minimum supported client
Windows XP

The SSE level used is SSE2 so how common is this combination?

It may hurt the feelings of some fans of old hard- and software, but writing code for >=(SSE2 & Win XP) should be OK for 99% of the users.

There is a poll on SSE support here (http://www.insanelymac.com/forum/topic/35109-sse2-vs-sse3-the-poll/): "I'm still waiting for SSE support :) (5 votes [2.45%])"

That was 2006, 8 years ago ;)
Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 06:43:29 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 23, 2014, 07:48:38 AM
...or provide fallback routines
you can run a little startup init routine - detect SSE support level - and fill in addresses of PROC's
i am working on something along that line at the moment

these define TYPE's for up to 6 dword parms - you can extend it easily
_FUNC00  TYPEDEF PROTO
_FUNC04  TYPEDEF PROTO :DWORD
_FUNC08  TYPEDEF PROTO :DWORD,:DWORD
_FUNC12  TYPEDEF PROTO :DWORD,:DWORD,:DWORD
_FUNC16  TYPEDEF PROTO :DWORD,:DWORD,:DWORD,:DWORD
_FUNC20  TYPEDEF PROTO :DWORD,:DWORD,:DWORD,:DWORD,:DWORD
_FUNC24  TYPEDEF PROTO :DWORD,:DWORD,:DWORD,:DWORD,:DWORD,:DWORD

_PFUNC00 TYPEDEF Ptr _FUNC00
_PFUNC04 TYPEDEF Ptr _FUNC04
_PFUNC08 TYPEDEF Ptr _FUNC08
_PFUNC12 TYPEDEF Ptr _FUNC12
_PFUNC16 TYPEDEF Ptr _FUNC16
_PFUNC20 TYPEDEF Ptr _FUNC20
_PFUNC24 TYPEDEF Ptr _FUNC24


then, i am using a structure with function pointers in it
_FUNC STRUCT
  lpfnFunc1  _PFUNC04 ?    ;this function has 1 dword arg
  lpfnFunc2  _PFUNC12 ?    ;this function has 3 dword args
_FUNC STRUCT


and, in the .DATA? section...
_Func _FUNC <>

so, you set _Func.lpfnFunc1 and _Func.lpfnFunc2 to point at appropriate routines for the supported SSE level
then.....
    INVOKE  _Func.lpfnFunc1,arg1
    INVOKE  _Func.lpfnFunc2,arg1,arg2,arg3

;or

    push    edi
    mov     edi,offset _Func
    INVOKE  [edi]._FUNC.lpfnFunc1,arg1
    INVOKE  [edi]._FUNC.lpfnFunc2,arg1,arg2,arg3
    pop     edi


another way to go would be to put all the routines for each support level into a DLL
then, at init, load the DLL that is appropriate for the machine
the routines can then all have the same names
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 23, 2014, 07:53:25 AM
most people probably have at least SSE3
however, we can look at the forum members, alone, and find a few machines
some that probably support only MMX or SSE(1)

i bought this machine in 2005 - it supports SSE3, which was a new thing at the time
so - it's almost 10 years old
Title: Re: Code location sensitivity of timings
Post by: Gunther on July 23, 2014, 09:03:40 AM
Quote from: dedndave on July 23, 2014, 07:53:25 AM
i bought this machine in 2005 - it supports SSE3, which was a new thing at the time
so - it's almost 10 years old

SSE3 was introduced in April 2005 with the Prescott revision of the Pentium 4 processor.

Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on July 23, 2014, 09:26:21 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: nidud on July 24, 2014, 03:45:40 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 24, 2014, 04:32:46 AM
Quote from: Gunther on July 23, 2014, 09:03:40 AMSSE3 was introduced in April 2005 with the Prescott revision of the Pentium 4 processor.

SSE2 was introduced in November 2000 with the P4 Willamette. In general, it's absolutely sufficient (try your luck, make Instr_() faster with SSE7.8... (http://masm32.com/board/index.php?topic=3408.msg36297#msg36297)); in particular, pcmpeqb and pmovmskb are important improvements.
Title: Re: Code location sensitivity of timings
Post by: nidud on July 25, 2014, 04:29:11 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 25, 2014, 06:33:53 AM
Quote from: nidud on July 25, 2014, 04:29:11 AMwith regard to memcpy there seems little gain using SSE
...
conclution:
- in newer CPU's MOVSB is faster than moving blocks
- in older CPU's MOVSB gets faster with size
- SSE may be faster depending on CPU

Or, in short: Everything is more complicated than you think. (http://www.masmforum.com/board/index.php?topic=11454.msg87622#msg87622)
Title: Re: Code location sensitivity of timings
Post by: nidud on July 25, 2014, 07:19:55 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 25, 2014, 06:01:49 PM
Quote from: nidud on July 25, 2014, 07:19:55 AM
:biggrin:

Yes, it's possible to complicate tings I guess and the link you provide includes a lot of complicated issues but few constructive conclusions to the problem at hand.

The table there looked different for each and every CPU we tested (try yourself the latest version (http://masm32.com/board/index.php?topic=1971.msg20618#msg20618)). So the choice was either choosing an algo that provided reasonable speed for most of them, or going the stony road of checking which CPU family and branching to a specialised one. For MasmBasic's MbCopy, rep movsd made the race. It is pretty fast on all CPUs, and it rocks for large copies (and that is where speed matters...). Good to see that Intel keeps pushing this line, too. I wouldn't use rep movsb for the whole copy, though, as many of the not-so-recent CPUs are very slow with the byte variant of movs.

      push ecx
      shr ecx, 2           ; divide count by 4
      rep movsd            ; copy DWORD size blocks
      pop ecx              ; Reload byte count
      and ecx, 3           ; get the rest
      rep movsb            ; copy the rest
      xchg eax, edi        ; for CAT$, return a pointer to the end of the destination;
Title: Re: Code location sensitivity of timings
Post by: sinsi on July 25, 2014, 08:10:40 PM
How much of a difference would RAM speed make? DDR3 speeds can be 1333/1600/1866/2133/2400.
Title: Re: Code location sensitivity of timings
Post by: nidud on July 25, 2014, 10:21:23 PM
deleted
Title: Re: Code location sensitivity of timings
Post by: Gunther on July 25, 2014, 10:56:33 PM
Hi sinsi,

your memcpy application brings:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
----------------------------------------------
-- aligned strings --
498064    cycles -  10 (  0) 0: crt_memcpy
890775    cycles -  10 ( 38) 1: movsd - mov eax,ecx
892888    cycles -  10 ( 37) 2: movsd - push ecx
353318    cycles -  10 ( 27) 3: movsb
-- unaligned strings --
1006514   cycles -  10 (  0) 0: crt_memcpy
1033525   cycles -  10 ( 38) 1: movsd - mov eax,ecx
1033580   cycles -  10 ( 37) 2: movsd - push ecx
377061    cycles -  10 ( 27) 3: movsb
-- short strings 15 --
175505    cycles - 8000 (  0) 0: crt_memcpy
335538    cycles - 8000 ( 38) 1: movsd - mov eax,ecx
344226    cycles - 8000 ( 37) 2: movsd - push ecx
291953    cycles - 8000 ( 27) 3: movsb
-- short strings 271 --
1033175   cycles - 8000 (  0) 0: crt_memcpy
952811    cycles - 8000 ( 38) 1: movsd - mov eax,ecx
959677    cycles - 8000 ( 37) 2: movsd - push ecx
566948    cycles - 8000 ( 27) 3: movsb
-- short strings 2014 --
3224879   cycles - 4000 (  0) 0: crt_memcpy
3153708   cycles - 4000 ( 38) 1: movsd - mov eax,ecx
3151176   cycles - 4000 ( 37) 2: movsd - push ecx
930276    cycles - 4000 ( 27) 3: movsb
--- ok ---


Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on July 25, 2014, 11:41:54 PM
deleted
Title: Re: Code location sensitivity of timings
Post by: RuiLoureiro on July 26, 2014, 01:14:05 AM
Seems to be better

x1=205 1543/884593  = 2.3191942509153927 (8000) ~2.32
x2=5844930 /2504176= 2.3340731641865428 (4000) ~2.33

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

-----------------------------------------------------
-- aligned strings --
1188974   cycles -  10 (  0) 0: crt_memcpy
1097640   cycles -  10 ( 75) 1: movsd - mov eax,ecx
1103251   cycles -  10 ( 75) 2: movsd - push ecx
1102906   cycles -  10 ( 59) 3: movsb
1310185   cycles -  10 (182) 4: SSE
-- unaligned strings --
2595543   cycles -  10 (  0) 0: crt_memcpy
2620959   cycles -  10 ( 75) 1: movsd - mov eax,ecx
2611443   cycles -  10 ( 75) 2: movsd - push ecx
7866087   cycles -  10 ( 59) 3: movsb
1358767   cycles -  10 (182) 4: SSE
-- short strings 15 --
343706    cycles - 8000 (  0) 0: crt_memcpy
789893    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
808747    cycles - 8000 ( 75) 2: movsd - push ecx
2039809   cycles - 8000 ( 59) 3: movsb
237595    cycles - 8000 (182) 4: SSE
-- short strings 271 --
2051543   cycles - 8000 (  0) 0: crt_memcpy
2096801   cycles - 8000 ( 75) 1: movsd - mov eax,ecx
2083586   cycles - 8000 ( 75) 2: movsd - push ecx
7495329   cycles - 8000 ( 59) 3: movsb
884593    cycles - 8000 (182) 4: SSE
-- short strings 2014 --
  5844930   cycles - 4000 (  0) 0: crt_memcpy
  6057324   cycles - 4000 ( 75) 1: movsd - mov eax,ecx
  5890555   cycles - 4000 ( 75) 2: movsd - push ecx
22533778  cycles - 4000 ( 59) 3: movsb
  2504176   cycles - 4000 (182) 4: SSE
--- ok ---
Title: Re: Code location sensitivity of timings
Post by: nidud on July 26, 2014, 01:46:55 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 26, 2014, 01:52:14 AM
align 16
rep movsb


Check if the align is really needed. In the worst case, 15 bytes of code are inserted there. One common trick is to insert the bytes needed before the entry into the proc.
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 26, 2014, 02:15:52 AM
nidud - i hope you're using the one in this post

http://masm32.com/board/index.php?topic=3373.msg35658#msg35658 (http://masm32.com/board/index.php?topic=3373.msg35658#msg35658)

;EAX return bits:
;0 = MMX
;1 = SSE
;2 = SSE2
;3 = SSE3
;4 = SSSE3
;5 = SSE4.1
;6 = SSE4.2


i would define the EQUates this way...

SSE_MMX    equ 1
SSE_SSE    equ 2
SSE_SSE2   equ 4
SSE_SSE3   equ 8
SSE_SSSE3  equ 10h
SSE_SSE41  equ 20h
SSE_SSE42  equ 40h


    call    GetSseLevel
    test    al,SSEBT_SSE3
    jnz     sse3_supported


the EQUates you have would be ok for BT, i suppose   :P
Title: Re: Code location sensitivity of timings
Post by: hutch-- on July 26, 2014, 02:55:30 AM
Here is a test piece that uses 4 copies of the same simple byte intensive algo, it runs the 4 version and times each one. The idea was to test if the identical algo in 4 locations produced any difference in timing but on my old quad, they are almost perfectly identical even with multiple runs.

I get this result.


File length = 977426

828 ms
828 ms
828 ms
828 ms
Press any key to continue ...
Title: Re: Code location sensitivity of timings
Post by: nidud on July 26, 2014, 02:57:09 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: dedndave on July 26, 2014, 03:03:03 AM
ok - that one does not preserve EBX - but, it's probably ok, in this case
just so you are aware, CPUID destroys the contents of EBX   :t
Title: Re: Code location sensitivity of timings
Post by: nidud on July 26, 2014, 03:30:28 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: RuiLoureiro on July 26, 2014, 03:38:14 AM
File length = 977412

1484 ms
1453 ms
1516 ms
1547 ms
Press any key to continue ...
Title: Re: Code location sensitivity of timings
Post by: RuiLoureiro on July 26, 2014, 03:40:43 AM
1344 ms
1344 ms
1343 ms
2016 ms...

1343 ms
1344 ms
1344 ms
2015 ms...

1344 ms
1359 ms
1344 ms
2016 ms
If we remove the worst case ...
Title: Re: Code location sensitivity of timings
Post by: RuiLoureiro on July 26, 2014, 04:08:58 AM
If i am not wrong, you are using 2 counters:
           First  counter     = 1000
           Second counter = count (=4000,etc.)

You get the result only when
the first counter is 0 (counter_end).
So the result has something to do with the execution
of this:

   mov edi,count
   mov ebx,esp
   .while edi
       pushargs
       call esi
       mov esp,ebx
       dec edi

Is there any particular reason for this ?
Quote
   counter_begin 1000, HIGH_PRIORITY_CLASS
   mov edi,count
   mov ebx,esp
   .while edi
       pushargs
       call esi
       mov esp,ebx
       dec edi
   .endw
   counter_end
Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 26, 2014, 05:09:09 AM
Quote from: nidud on July 26, 2014, 02:57:09 AM
QuoteCheck if the align is really needed
I normally tune them from the list file in the end

What I intended is that rep movsX may not need ANY alignment, simply because it isn't a loop at this level (at micro code level, it is of course a loop).

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (MMX, SSE, SSE2, SSE3)
movsd align 16  10476 µs
movsd align 3   10456 µs
movsd align 13  10347 µs
movsb align 16  10510 µs
movsb align 3   10503 µs
movsb align 13  10407 µs

movsd align 16  10514 µs
movsd align 3   10469 µs
movsd align 13  10516 µs
movsb align 16  10455 µs
movsb align 3   10515 µs
movsb align 13  10502 µs

movsd align 16  10526 µs
movsd align 3   10455 µs
movsd align 13  10469 µs
movsb align 16  10360 µs
movsb align 3   10485 µs
movsb align 13  10456 µs


Sample:
test4a proc uses esi edi ecx
  align 16
  nops 3
  rep movsb
  ret
test4a endp


Interesting, though, that movsb is indeed equally fast on my trusty old Celeron, at least for a 10 MB string.
Title: Re: Code location sensitivity of timings
Post by: nidud on July 26, 2014, 05:43:01 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: Gunther on July 26, 2014, 08:50:17 AM
That's the result by memcpy.exe by 1234.zip:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
----------------------------------------------
-- aligned strings --
498952    cycles -  10 (  0) 0: crt_memcpy
898756    cycles -  10 ( 75) 1: movsd - mov eax,ecx
903577    cycles -  10 ( 75) 2: movsd - push ecx
354813    cycles -  10 ( 59) 3: movsb
487954    cycles -  10 (182) 4: SSE
-- unaligned strings --
494936    cycles -  10 (  0) 0: crt_memcpy
895940    cycles -  10 ( 75) 1: movsd - mov eax,ecx
895968    cycles -  10 ( 75) 2: movsd - push ecx
373553    cycles -  10 ( 59) 3: movsb
491344    cycles -  10 (182) 4: SSE
-- short strings 15 --
175961    cycles - 8000 (  0) 0: crt_memcpy
361324    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
361586    cycles - 8000 ( 75) 2: movsd - push ecx
313550    cycles - 8000 ( 59) 3: movsb
92719     cycles - 8000 (182) 4: SSE
-- short strings 271 --
841879    cycles - 8000 (  0) 0: crt_memcpy
780741    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
806939    cycles - 8000 ( 75) 2: movsd - push ecx
623419    cycles - 8000 ( 59) 3: movsb
275466    cycles - 8000 (182) 4: SSE
-- short strings 2014 --
1002628   cycles - 4000 (  0) 0: crt_memcpy
2239737   cycles - 4000 ( 75) 1: movsd - mov eax,ecx
2226209   cycles - 4000 ( 75) 2: movsd - push ecx
962207    cycles - 4000 ( 59) 3: movsb
972245    cycles - 4000 (182) 4: SSE
--- ok ---


Gunther
Title: Re: Code location sensitivity of timings
Post by: RuiLoureiro on July 26, 2014, 09:02:01 AM
Quote
The macro can only be called by EDI, ESI, or EBX or an immediate value.
I think I just run out of regs once and inserted a loop.
The count for small functions is also rather high so it's
just a way of skipping zeros I guess.
I think you are talking about this macro:
   
        counter_begin MACRO loopcount:REQ, priority
   or  counter_end

    If it is, we cannot use EBX because cpuid destroys EBX

I modified counter_begin -written by MichaelW- to this:
(COUNTERLOOPS=1000 or 10000 or 100000 or ...)

; this macro uses EDI inside = length from kIni to kEnd
; we need to define an array to save the means.
; we need to define _LoopCount,_MaxLength...etc. in .DATA
BEGIN_COUNTER_CYCLE_HIGH_PRIORITY_CLASS MACRO   kIni, kEnd
                                        LOCAL   labelA,labelB

                mov     _LoopCount, COUNTERLOOPS
                mov     _MaxLength, kEnd
                mov     edi, kIni
                ;mov     _MinLength, edi         ;; not used yet
                mov     _MeanValue, 0           ;; mean is 0

                invoke  GetCurrentProcess
                invoke  SetPriorityClass, eax, HIGH_PRIORITY_CLASS

    labelA:                                         ;; Begin test loop
   
                BEGIN_LOOP_TEST equ <labelA>
           
                xor     eax, eax        ;; Use same CPUID input value for each call
                cpuid                   ;; Flush pipe & wait for pending ops to finish
                rdtsc                   ;; Read Time Stamp Counter

                push    edx             ;; Preserve high-order 32 bits of start count
                push    eax             ;; Preserve low-order 32 bits of start count
           
                mov     _LoopCounter, COUNTERLOOPS
                xor     eax, eax
                cpuid                   ;; Make sure loop setup instructions finish
          ALIGN 16                      ;; Optimal loop alignment for P6
          @@:                           ;; Start an empty reference loop
                sub     _LoopCounter, 1
                jnz     short @B

                xor     eax, eax
                cpuid                   ;; Make sure loop instructions finish
                rdtsc                   ;; Read end count
                pop     ecx             ;; Recover low-order 32 bits of start count
                sub     eax, ecx        ;; Low-order 32 bits of overhead count in EAX
                pop     ecx             ;; Recover high-order 32 bits of start count
                sbb     edx, ecx        ;; High-order 32 bits of overhead count in EDX
                push    edx             ;; Preserve high-order 32 bits of overhead count
                push    eax             ;; Preserve low-order 32 bits of overhead count

                xor     eax, eax
                cpuid
                rdtsc
                push    edx             ;; Preserve high-order 32 bits of start count
                push    eax             ;; Preserve low-order 32 bits of start count
                ;;-------------------------------------
                ;;              Start
                ;;-------------------------------------
                mov         _LoopCounter, COUNTERLOOPS
                xor         eax, eax
                cpuid                   ;; Make sure loop setup instructions finish
    ALIGN 16                            ;; Optimal loop alignment for P6
    labelB:                             ;; Start test loop
                START_LOOP_TEST equ <labelB>
ENDM
; ------------------------------------------------------------------------
END_COUNTER_CYCLE       MACRO  arg
                        LOCAL  $tmpstr$

                sub         _LoopCounter, 1
                jnz         START_LOOP_TEST                ;; goto labelB
                ;;---------------------------
                ;;   stop this count
                ;;---------------------------
                xor         eax, eax
                cpuid                       ;; Make sure loop instructions finish
                rdtsc                       ;; Read end count
                pop         ecx             ;; Recover low-order 32 bits of start count
                sub         eax, ecx        ;; Low-order 32 bits of test count in EAX
                pop         ecx             ;; Recover high-order 32 bits of start count
                sbb         edx, ecx        ;; High-order 32 bits of test count in EDX
                pop         ecx             ;; Recover low-order 32 bits of overhead count
                sub         eax, ecx        ;; Low-order 32 bits of adjusted count in EAX
                pop         ecx             ;; Recover high-order 32 bits of overhead count
                sbb         edx, ecx        ;; High-order 32 bits of adjusted count in EDX

                mov         DWORD PTR _CounterQword, eax
                mov         DWORD PTR _CounterQword + 4, edx
                finit
                fild        _CounterQword
                fild        _LoopCount
                fdiv
                fistp       _CounterQword

                mov         ebx, dword ptr _CounterQword
               
                ;---------------------------------------------------
                ;               print cycles
                ;---------------------------------------------------
                add         ebx, _MeanValue
                mov         _MeanValue, ebx

                add         edi, 1
                cmp         edi, _MaxLength
                jbe         BEGIN_LOOP_TEST                      ;; goto labelA

                invoke      GetCurrentProcess
                invoke      SetPriorityClass, eax, NORMAL_PRIORITY_CLASS
               
                ; --------------------------------------------------
                ;          Save mean and print mean               
                ; --------------------------------------------------
                invoke      SaveMeans, ebx          ;; save it in one array
                                                       ;; one after another
               
                ;---------------------------------------------------
                print       str$(ebx)                       
                $tmpstr$    CATSTR <chr$(">, <arg>, <",13,10)>       
                print       $tmpstr$
                ;---------------------------------------------------                 
ENDM



.data
ALIGN 8                         ;; Optimal alignment for QWORD
_CounterQword   dq 0
_LoopCount      dd 0
_LoopCounter    dd 0                                   

_MinLength      dd 0
_MaxLength      dd 0
_MeanValue      dd 0
;------------------------------
ALIGN   4
                dd 0                ; <<<--- start with 0   
_TblTiming0     dd 600 dup (?)
.code
SaveMeans       proc        kMean:DWORD                   
                mov         eax, kMean
                mov         edx, offset _TblTiming0                   
                mov         ecx, [edx-4]            ; number of means
                mov         [edx+ecx*4], eax                   
                add         ecx, 1
                mov         [edx-4], ecx
                ret
SaveMeans       endp
Title: Re: Code location sensitivity of timings
Post by: nidud on July 27, 2014, 02:10:39 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: Gunther on July 27, 2014, 02:54:54 AM
Hi nidud,

Quote from: nidud on July 27, 2014, 02:10:39 AM
I added some bits to Dave's test:

there's nothing attached.

Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on July 27, 2014, 04:48:41 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: nidud on July 27, 2014, 05:17:38 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: jj2007 on July 27, 2014, 08:06:15 AM
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
-----------------------------------------------
-- aligned strings --
995933  cycles - 10 (  0) 0: crt_strrchr
995891  cycles - 10 ( 40) 1: strrchr
273823  cycles - 10 (154) 2: x
94668   cycles - 10 (112) 3: SSE
-- unaligned strings --
996477  cycles - 10 (  0) 0: crt_strrchr
997094  cycles - 10 ( 40) 1: strrchr
298219  cycles - 10 (154) 2: x
121529  cycles - 10 (112) 3: SSE
-- small strings 128 --
324263  cycles - 500 (  0) 0: crt_strrchr
323710  cycles - 500 ( 40) 1: strrchr
84786   cycles - 500 (154) 2: x
34915   cycles - 500 (112) 3: SSE
-- small strings 1 --
67914   cycles - 500 (  0) 0: crt_strrchr
67286   cycles - 500 ( 40) 1: strrchr
12595   cycles - 500 (154) 2: x
16622   cycles - 500 (112) 3: SSE
Title: Re: Code location sensitivity of timings
Post by: Gunther on July 27, 2014, 10:40:16 AM
Hi nidud,

here's the output of auto.zip:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (AVX)
----------------------------------------------
-- aligned strings --
491469    cycles -  10 (  0) 0: crt_memcpy
889651    cycles -  10 ( 63) 1: movsd - mov eax,ecx
887273    cycles -  10 ( 63) 2: movsd - push ecx
355080    cycles -  10 ( 51) 3: movsb
487046    cycles -  10 (182) 4: SSE
355990    cycles -  10 (  0) 5: auto
-- unaligned strings --
490269    cycles -  10 (  0) 0: crt_memcpy
886259    cycles -  10 ( 63) 1: movsd - mov eax,ecx
886778    cycles -  10 ( 63) 2: movsd - push ecx
372520    cycles -  10 ( 51) 3: movsb
491780    cycles -  10 (182) 4: SSE
378881    cycles -  10 (  0) 5: auto
-- short strings 15 --
174897    cycles - 8000 (  0) 0: crt_memcpy
349626    cycles - 8000 ( 63) 1: movsd - mov eax,ecx
343812    cycles - 8000 ( 63) 2: movsd - push ecx
307384    cycles - 8000 ( 51) 3: movsb
98073     cycles - 8000 (182) 4: SSE
293479    cycles - 8000 (  0) 5: auto
-- short strings 271 --
832627    cycles - 8000 (  0) 0: crt_memcpy
773797    cycles - 8000 ( 63) 1: movsd - mov eax,ecx
764418    cycles - 8000 ( 63) 2: movsd - push ecx
586580    cycles - 8000 ( 51) 3: movsb
279676    cycles - 8000 (182) 4: SSE
557134    cycles - 8000 (  0) 5: auto
-- short strings 2014 --
998188    cycles - 4000 (  0) 0: crt_memcpy
2198740   cycles - 4000 ( 63) 1: movsd - mov eax,ecx
2195833   cycles - 4000 ( 63) 2: movsd - push ecx
935710    cycles - 4000 ( 51) 3: movsb
961563    cycles - 4000 (182) 4: SSE
906474    cycles - 4000 (  0) 5: auto
--- ok ---


Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on July 27, 2014, 11:12:01 PM
deleted
Title: Re: Code location sensitivity of timings
Post by: nidud on August 11, 2014, 08:46:40 PM
deleted
Title: Re: Code location sensitivity of timings
Post by: Gunther on August 12, 2014, 01:58:59 AM
Hi nidud,

Quote from: nidud on August 11, 2014, 08:46:40 PM
Is this possible ? to have SSE4.1 and not SSE3 ?

Note: SSE and SSE2 are pre-set since the program will exit if SSE2 is not present, so this bit must be set by the test.

I think not. Did you try that (http://masm32.com/board/index.php?topic=1418.msg14444#msg14444)? It should show you the available instruction sets.

Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on August 12, 2014, 03:01:17 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: Gunther on August 12, 2014, 08:49:00 AM
Hi nidud,

you can trust my instruction detecting application. Your laptop supports in any case SSE3 and SSSE3 and it supports AVX. You can test that with that tool (http://masm32.com/board/index.php?topic=3227.msg35958#msg35958), if you've at least Windows 7 with SP1 installed. The glitch must be in your code. Do you test the right bits?

Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on August 12, 2014, 09:27:19 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: MichaelW on August 12, 2014, 10:00:55 AM
Hi Gunther,

My Core-i3 G3220 does not support AVX, but the results for your instruction set detection tool:

Supported by Processor and installed Operating System:
------------------------------------------------------

     MMX, CMOV and FCOMI, SSE, SSE2, SSE3, SSSE3, SSE4.1,
     POPCNT, SSE4.2

     featurenumber = 13


Appear to match the Intel specs:

http://ark.intel.com/products/77773
Title: Re: Code location sensitivity of timings
Post by: Gunther on August 12, 2014, 11:02:35 AM
Hi Michael,

Quote from: MichaelW on August 12, 2014, 10:00:55 AM
Appear to match the Intel specs:

http://ark.intel.com/products/77773

I hope so. I've written the procedure using the Intel documents as a basis.

Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on August 18, 2014, 09:39:19 PM
deleted
Title: Re: Code location sensitivity of timings
Post by: jj2007 on August 18, 2014, 10:36:29 PM
  movzx eax, byte ptr [esp+8]
  if 1
imul eax, eax, 01010101h ; 4 bytes shorter, faster
  else
mov ah, al
mov ecx, eax
shl eax, 16
add eax, ecx
  endif
  movd xmm0, eax
  pshufd xmm0, xmm0, 0 ; populate char
Title: Re: Code location sensitivity of timings
Post by: nidud on August 18, 2014, 11:40:39 PM
deleted
Title: Re: Code location sensitivity of timings
Post by: jj2007 on August 19, 2014, 01:40:54 AM
Variants of memchr:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

43778   cycles for 100 * memchr scasb
4474    cycles for 100 * memchr SSE2a
5608    cycles for 100 * memchr SSE2b

43994   cycles for 100 * memchr scasb
4497    cycles for 100 * memchr SSE2a
5602    cycles for 100 * memchr SSE2b

44044   cycles for 100 * memchr scasb
4474    cycles for 100 * memchr SSE2a
5598    cycles for 100 * memchr SSE2b

36      bytes for memchr scasb
88      bytes for memchr SSE2a
92      bytes for memchr SSE2b


Could look much different on other CPUs, as movlps speeds it up a lot on my CPU.
Title: Re: Code location sensitivity of timings
Post by: MichaelW on August 19, 2014, 02:27:49 AM
My processor is near (or at) bottom-end today (retail box, $79).

Intel(R) Pentium(R) CPU G3220 @ 3.00GHz (SSE4)

24909   cycles for 100 * memchr scasb
2864    cycles for 100 * memchr SSE2a
2399    cycles for 100 * memchr SSE2b

24934   cycles for 100 * memchr scasb
2882    cycles for 100 * memchr SSE2a
2366    cycles for 100 * memchr SSE2b

24923   cycles for 100 * memchr scasb
2886    cycles for 100 * memchr SSE2a
2418    cycles for 100 * memchr SSE2b

36      bytes for memchr scasb
88      bytes for memchr SSE2a
92      bytes for memchr SSE2b

96      = eax memchr scasb
96      = eax memchr SSE2a
96      = eax memchr SSE2b

Title: Re: Code location sensitivity of timings
Post by: nidud on August 19, 2014, 03:12:24 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: jj2007 on August 19, 2014, 03:39:31 AM
Thanks. As I suspected, the movlps/movhps pair is good only for my trusty Celeron  :(
Here is one more, with movups instead:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
43821   cycles for 100 * memchr scasb
4477    cycles for 100 * memchr SSE2 lps/hps
5556    cycles for 100 * memchr SSE2 nidud
5205    cycles for 100 * memchr SSE2 ups

43778   cycles for 100 * memchr scasb
4476    cycles for 100 * memchr SSE2 lps/hps
5606    cycles for 100 * memchr SSE2 nidud
5206    cycles for 100 * memchr SSE2 ups

43762   cycles for 100 * memchr scasb
4482    cycles for 100 * memchr SSE2 lps/hps
5607    cycles for 100 * memchr SSE2 nidud
5200    cycles for 100 * memchr SSE2 ups

36      bytes for memchr scasb
88      bytes for memchr SSE2 lps/hps
92      bytes for memchr SSE2 nidud
84      bytes for memchr SSE2 ups
Title: Re: Code location sensitivity of timings
Post by: MichaelW on August 19, 2014, 04:07:35 AM

Intel(R) Pentium(R) CPU G3220 @ 3.00GHz (SSE4)

24916   cycles for 100 * memchr scasb
2889    cycles for 100 * memchr SSE2 lps/hps
2422    cycles for 100 * memchr SSE2 nidud
2351    cycles for 100 * memchr SSE2 ups

24927   cycles for 100 * memchr scasb
2890    cycles for 100 * memchr SSE2 lps/hps
2469    cycles for 100 * memchr SSE2 nidud
2342    cycles for 100 * memchr SSE2 ups

24921   cycles for 100 * memchr scasb
2885    cycles for 100 * memchr SSE2 lps/hps
2405    cycles for 100 * memchr SSE2 nidud
2351    cycles for 100 * memchr SSE2 ups

36      bytes for memchr scasb
88      bytes for memchr SSE2 lps/hps
92      bytes for memchr SSE2 nidud
84      bytes for memchr SSE2 ups
Title: Re: Code location sensitivity of timings
Post by: nidud on August 19, 2014, 05:06:01 AM
deleted
Title: Re: Code location sensitivity of timings
Post by: Gunther on August 19, 2014, 05:08:29 AM
Jochen,

your timings:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

21892   cycles for 100 * memchr scasb
3007    cycles for 100 * memchr SSE2 lps/hps
2690    cycles for 100 * memchr SSE2 nidud
2500    cycles for 100 * memchr SSE2 ups

21951   cycles for 100 * memchr scasb
2981    cycles for 100 * memchr SSE2 lps/hps
2721    cycles for 100 * memchr SSE2 nidud
6211    cycles for 100 * memchr SSE2 ups

21827   cycles for 100 * memchr scasb
3003    cycles for 100 * memchr SSE2 lps/hps
2510    cycles for 100 * memchr SSE2 nidud
2721    cycles for 100 * memchr SSE2 ups

36      bytes for memchr scasb
88      bytes for memchr SSE2 lps/hps
92      bytes for memchr SSE2 nidud
84      bytes for memchr SSE2 ups

--- ok ---


Gunther
Title: Re: Code location sensitivity of timings
Post by: nidud on March 22, 2015, 02:17:44 AM
deleted