News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

"Hello masm32", not a BOT, new member

Started by LordAdef, January 22, 2017, 09:42:24 AM

Previous topic - Next topic

hutch--

 :biggrin:

> I found my example a bit more relevant for practical purpose

I took Intel's word on it and it seems like they know what they are talking about as they designed the hardware. May be different on AMD.

> Btw how do the timings change with an align 4 before the two loops?

I rarely ever align code these days as it often slows the code down. It was helpful at time with very old hardware, pre PIV but from the PIV onwards it often has the reverse effect.

jj2007

Quote from: hutch-- on February 16, 2017, 11:24:39 AM

> Btw how do the timings change with an align 4 before the two loops?

I rarely ever align code these days as it often slows the code down. It was helpful at time with very old hardware, pre PIV but from the PIV onwards it often has the reverse effect.

With new hardware, such as my Core i5, an align 4 definitely speeds up the mov ah part.

hutch--

> With new hardware, such as my Core i5, an align 4 definitely speeds up the mov ah part

This is possibly the case as the Intel data indicated that a read or write to a high byte register involves an extra operation that is slower than a read/write to a low byte register and very vaguely it had to do with a masking operation to get the high byte.

I will make the point though that writing legacy code in either 32 or 64 bit is a mistake, it simply cannot be done as a 64 bit operation and while your example in 32 bit in 64 bit code did work, you will get stung in performance terms by doing things like that. At least in 64 bit code you have more BYTE registers which removes the need to use antique code.

nidud

#78
deleted

jj2007

Quote from: nidud on February 16, 2017, 10:50:13 PM
Here's a simple 64-bit test-bed without any alignment problems.

Can't be :(

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (AVX)
----------------------------------------------
-- test(1)
   553579 cycles, rep(3000), code(801) 0.asm: load AL
   428491 cycles, rep(3000), code(801) 1.asm: load AH
-- test(2)
   539858 cycles, rep(3000), code(801) 0.asm: load AL
   429470 cycles, rep(3000), code(801) 1.asm: load AH
-- test(3)
   532109 cycles, rep(3000), code(801) 0.asm: load AL
   433764 cycles, rep(3000), code(801) 1.asm: load AH

total [1 .. 3], 1++
  1291725 cycles 1.asm: load AH
  1625546 cycles 0.asm: load AL

hutch--

Here is what it looks like in 64 bit MASM, the load to AH is even slower.

Run on my 3.3gig Haswell.

703 load AL
1203 load AH
656 load AL
1172 load AH
672 load AL
1156 load AH
672 load AL
1203 load AH
719 load AL
1156 load AH
672 load AL
1156 load AH
672 load AL
1156 load AH
672 load AL
1156 load AH

That's all folks .....


RE: The testing method. The virtues of real time testing versus interpreted testing.

File attached.

jj2007

    mov rsi, 1024*1024*1024     ; a power of 2, billion.
    mov dl, 0

    invoke GetTickCount
    push rax

  @@:                           ; -------
    mov ah, dl                  ; load AH  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    add dl, 1                   ; -------
    cmp dl, 255
    jne nxt1
    mov dl, 0
  nxt1:
    sub rsi, 1


So at least we now have an official confirmation that AH can be used in 64-bit code. But the mystery remains:
1. why is Nidud's test so much faster for AH? repeat 200
mov dl,ah
mov ah,bl
endm


2. why is Hutch' test so much slower for AH?  @@:                           ; -------
    mov ah, dl                  ; load AH
    add dl, 1                   ; -------
    cmp dl, 255
    jne nxt1
    mov dl, 0
  nxt1:
    sub rsi, 1
    jnz @B


Anybody here from Intel who could enlighten us?

P.S.: What does the 64-bit Windows ABI say about preserving regs?
    mov dl, 0
    invoke GetTickCount

hutch--


> 2. why is Hutch' test so much slower for AH?

Testing in real time !

Here is a variation on the first test, remove the PUSH POP from both examples and the load AH code got faster. Replace it with r15. The load AH is still a lot slower

703 load AL
1031 load AH
703 load AL
985 load AH
687 load AL
1063 load AH
734 load AL
1016 load AH
687 load AL
1016 load AH
687 load AL
1016 load AH
703 load AL
1016 load AH
734 load AL
1016 load AH

That's all folks .....


The code.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    LOCAL .rsi  :QWORD
    LOCAL .rdi  :QWORD
    LOCAL .r15  :QWORD

    mov .rsi, rsi
    mov .rdi, rdi
    mov .r15, r15

    mov rdi, 8                  ; the loop counter

  lpstart:

  ; -----------------------------------------------------------

    mov rsi, 1024*1024*1024     ; a power of 2, billion.
    mov dl, 0

    invoke GetTickCount
    mov r15, rax

  @@:                           ; -------
    mov al, dl                  ; load AL
    add dl, 1                   ; -------
    cmp dl, 255
    jne nxt
    mov dl, 0
  nxt:
    sub rsi, 1
    jnz @B

    invoke GetTickCount
    mov rcx, r15
    sub rax, rcx

    conout str$(rax)," load AL",lf

  ; -----------------------------------------------------------

    mov rsi, 1024*1024*1024     ; a power of 2, billion.
    mov dl, 0

    invoke GetTickCount
    mov r15, rax

  @@:                           ; -------
    mov ah, dl                  ; load AH
    add dl, 1                   ; -------
    cmp dl, 255
    jne nxt1
    mov dl, 0
  nxt1:
    sub rsi, 1
    jnz @B

    invoke GetTickCount
    mov rcx, r15
    sub rax, rcx

    conout str$(rax)," load AH",lf

  ; -----------------------------------------------------------

    sub rdi, 1
    jnz lpstart

    conout lf
    waitkey "That's all folks ....."

    mov rsi, .rsi
    mov rdi, .rdi
    mov r15, .r15

    invoke ExitProcess,0

    ret

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end

nidud

#83
deleted

hutch--

Thanks for that, I don't have an AMD box to test with.

TWell

Another AMD E1
3328 load AL
3297 load AH
3281 load AL
3266 load AH
3297 load AL
3297 load AH
3265 load AL
3282 load AH
3281 load AL
3281 load AH
3281 load AL
3297 load AH
3282 load AL
3265 load AH
3297 load AL
3281 load AH

That's all folks .....

FORTRANS

Hi,

   FWIW.

; - - -
{P-III}
; - - -
pre-P4 (SSE1)

104   cycles for 100 * mov ah
105   cycles for 100 * mov al

104   cycles for 100 * mov ah
105   cycles for 100 * mov al

103   cycles for 100 * mov ah
105   cycles for 100 * mov al

105   cycles for 100 * mov ah
108   cycles for 100 * mov al

104   cycles for 100 * mov ah
104   cycles for 100 * mov al

13   bytes for mov ah
12   bytes for mov al


--- ok ---

; - - -
P-MMX
; - - -
pre-P4
489   cycles for 100 * mov ah
505   cycles for 100 * mov al

489   cycles for 100 * mov ah
513   cycles for 100 * mov al

529   cycles for 100 * mov ah
592   cycles for 100 * mov al

496   cycles for 100 * mov ah
517   cycles for 100 * mov al

499   cycles for 100 * mov ah
518   cycles for 100 * mov al

13   bytes for mov ah
12   bytes for mov al


--- ok ---

; = = =

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

74   cycles for 100 * mov ah
163   cycles for 100 * mov al

74   cycles for 100 * mov ah
162   cycles for 100 * mov al

74   cycles for 100 * mov ah
162   cycles for 100 * mov al

74   cycles for 100 * mov ah
162   cycles for 100 * mov al

74   cycles for 100 * mov ah
162   cycles for 100 * mov al

13   bytes for mov ah
12   bytes for mov al


--- ok ---

; - - -

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (AVX2)
----------------------------------------------
-- test(1)
   613736 cycles, rep(3000), code(801) 0.asm: load AL
   542611 cycles, rep(3000), code(801) 1.asm: load AH
-- test(2)
   620431 cycles, rep(3000), code(801) 0.asm: load AL
   535470 cycles, rep(3000), code(801) 1.asm: load AH
-- test(3)
   628891 cycles, rep(3000), code(801) 0.asm: load AL
   532790 cycles, rep(3000), code(801) 1.asm: load AH

total [1 .. 3], 1++
  1610871 cycles 1.asm: load AH
  1863058 cycles 0.asm: load AL
hit any key to continue...


HTH,

Steve N.

jj2007

  align 8
@@:
mov ah, byte ptr somestring
mov ch, ah
inc ah
movzx eax, ah
dec ebx
jns @B


  align 8
@@:
mov al, byte ptr somestring
mov cl, al
inc al
movzx eax, al
dec ebx
jns @B


Results for a Core i5, one Billion iterations:This code was assembled with ML in 64-bit format
671 ms for AH
1014 ms for AL

655 ms for AH
983 ms for AL

656 ms for AH
982 ms for AL

640 ms for AH
983 ms for AL

671 ms for AH
982 ms for AL


No speed difference between 64-bit and 32-bit code. Sources (rich text, poor text) and executables built with ML, ML64, AsmC and HJWasm32 attached.

P.S.: What exactly is "real time", and how is it different to what?
Quote from: hutch-- on February 17, 2017, 01:31:54 AMTesting in real time !

LordAdef

I wouldn't have imagined such different results...

I'll run the test here and post my results too. Win7 64b on i7

hutch--

 :biggrin:

> P.S.: What exactly is "real time", and how is it different to what?

The stuff you wake up to in the morning, stop to each your lunch by, go to the pub after work by and what you set the alarm clock for. I have warned you for a long time that your timing technique is unsound, there are too many interpretive layers and too many variables that can produce misleading results. The other factor is that cycle counts went out the door with a 486dx.

The reason why I still use a very crude low level timing technique run long enough to avoid task switching is because it assumes only 1 thing, how long it takes. Using misleading timing techniques may justify using obsolete instructions but you won't get fast code for doing it.