News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

npad instruction

Started by TouEnMasm, May 07, 2013, 04:32:28 PM

Previous topic - Next topic

TouEnMasm


I have compiled a c prog (leght of a chain) with /Ox (optimise max):
I have searched  the npad instruction im my amd reference book but not find it.

char* chaine=("Je suis une phrase");

int longueur (const char * str)
       {
           int length = 0;

           while( *str++ )
                   ++length;

           return( length );
       }



PUBLIC ?longueur@@YAHPBD@Z ; longueur
; Function compile flags: /Ogtpy
_TEXT SEGMENT
_str$ = 8 ; size = 4
?longueur@@YAHPBD@Z PROC ; longueur
; File e:\test_struct\structure.cpp
; Line 23
mov ecx, DWORD PTR _str$[esp-4]
xor eax, eax
cmp BYTE PTR [ecx], al
je SHORT $LN6@longueur
npad 6
$LL2@longueur:
inc ecx
; Line 24
inc eax
cmp BYTE PTR [ecx], 0
jne SHORT $LL2@longueur
$LN6@longueur:
; Line 27
ret 0
?longueur@@YAHPBD@Z ENDP ; longueur
_TEXT ENDS
Fa is a musical note to play with CL

Donkey

npad is just reapeated NOP instructions used for alignment, haven't done the byte count for your disassembly but probably aligns the loop to a 16 byte boundary.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

TouEnMasm

thanks

Here the same in masm

chainsize PROC pstring:DWORD
mov ecx,pstring
xor eax, eax
cmp BYTE PTR [ecx], al
je SHORT endlongueur
align 16
repeatlongueur:
inc ecx
inc eax
cmp BYTE PTR [ecx], 0
jne SHORT repeatlongueur
endlongueur:
ret
chainsize ENDP
Fa is a musical note to play with CL

Jibz

npad is a macro defined in listing.inc in your VC include folder, and it inserts "non destructive" operations rather than just a series of NOP. For instance, the npad 6 from your example should be "lea ebx, [ebx+0]".

A word of warning, the npad macro only work for 32-bit code, because if you insert "lea ebx, [ebx+0]" in 64-bit code it clears the top 32 bits of rbx, which is hardly non destructive.
"A problem, properly stated, is a problem on it's way to being solved" -Buckminster Fuller
"Multithreading is just one damn thing after, before, or simultaneous with another" -Andrei Alexandrescu

jj2007

Masm32 has the nops macro:
  nops 6

In general, align 4 before an innermost loop does a good job (and I hope both ML64 and JWasm are aware of the problem flagged by jibz ::)). Inserting nops will slow down your code a bit, but a nops 8 can be very useful when debugging a lengthy piece of code - it's easier to find then.

Donkey

Not sure how much NOPs slow down code, most modern processors do not execute them since their prediction algorithms just skip over them to the next real instruction (in other words they never enter an execution pipe, simply advance R/EIP). However, there can be loss based on which NOP is selected as there is still memory access time to consider (though it can generally be considered negligible), so if you need a multi-byte padding you are better to choose the larger NOP instructions.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

hutch--

One of the things I do with almost any algo I want pace with is test different MASM alignments because i have seen the effect often enough that an "align 8" or an "align 16" really slowed the algo down. My guess is the OS decoder sometimes stalls on a large complex instruction. I note that some compilers and assembler pad the ends with DB 90h (nops) which may cause less of a problem that a single large instruction for alignment purposes. As usual benchmark the differences on multiple hardware.

jj2007

Quote from: Donkey on May 07, 2013, 06:44:13 PMNot sure how much NOPs slow down code, most modern processors do not execute them

Here's a test with an incredibly old processor ;-)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
234     ticks for nops 4
390     ticks for nops 8
1763    ticks for nops 40

include \masm32\include\masm32rt.inc

.code
start:
   invoke Sleep, 100
   push rv(GetTickCount)
   mov ecx, 0fffffffh
   .Repeat
      nops 4
      dec ecx
   .Until Sign?
   invoke GetTickCount
   pop edx
   sub eax, edx
   print str$(eax), 9, "ticks for nops 4", 13, 10

   invoke Sleep, 100
   push rv(GetTickCount)
   mov ecx, 0fffffffh
   .Repeat
      nops 8
      dec ecx
   .Until Sign?
   invoke GetTickCount
   pop edx
   sub eax, edx
   print str$(eax), 9, "ticks for nops 8", 13, 10

   invoke Sleep, 100
   push rv(GetTickCount)
   mov ecx, 0fffffffh
   .Repeat
      nops 40
      dec ecx
   .Until Sign?
   invoke GetTickCount
   pop edx
   sub eax, edx
   print str$(eax), 9, "ticks for nops 40", 13, 10

   exit

end start

Jibz

Could you try with the instructions from the npad macro for comparison?

For reference the 4 and 8 byte npad are:

    ; lea esp, [esp+00]
    DB 8DH, 64H, 24H, 00H


    ; jmp .+8; .npad 6
    DB 0EBH, 06H, 8DH, 9BH, 00H, 00H, 00H, 00H


(there is no 40 byte npad, but I guess it would just be a jmp and filler).
"A problem, properly stated, is a problem on it's way to being solved" -Buckminster Fuller
"Multithreading is just one damn thing after, before, or simultaneous with another" -Andrei Alexandrescu

dedndave

the branch at the end of the loop, back to the top of loop, is a SHORT branch
from what i have seen, if it's SHORT, alignment doesn't matter

jj2007

Quote from: Jibz on May 07, 2013, 09:09:47 PM
Could you try with the instructions from the npad macro for comparison?

234     ticks for nops 4
405     ticks for nops 8
234     ticks for npad 4
468     ticks for npad 8


AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
include \masm32\include\masm32rt.inc

npad_4 equ <DB 8DH, 64H, 24H, 00H>
npad_8 equ <DB 0EBH, 06H, 8DH, 9BH, 00H, 00H, 00H, 00H>

Start MACRO
   invoke Sleep, 100
   push rv(GetTickCount)
   mov ecx, 0fffffffh
   align 8
   .Repeat
ENDM
Stop MACRO name
      dec ecx
   .Until Sign?
   invoke GetTickCount
   pop edx
   sub eax, edx
   print str$(eax), 9, "ticks for &name", 13, 10
ENDM

.code
start:
   Start
      nops 4
   Stop nops 4

   Start
      nops 8
   Stop nops 8

   Start
      npad_4
   Stop npad 4

   Start
      npad_8
   Stop npad 8

   exit

end start

Donkey

LOL, I really hope you were joking with those timings. The objective in padding with NOP is to make the code AFTER the alignment faster. As I said, the NOPs should never be executed as the prediction algorithm will not load them into the instruction pipe, they are jumped over. From Agner:

QuoteMost processors fetch instructions in aligned 16-byte or 32-byte blocks. It can be advantageous to align critical loop entries and subroutine entries by 16 in order to minimize the number of 16-byte boundaries in the code. Alternatively, make sure that there is no 16-byte boundary in the first few instructions after a critical loop entry or subroutine entry.

If you noticed the original routine had cascading loops, the purpose of the padding to ensure that the subsequent loop begins on a 16 byte boundary. Try running the timing with multiple cascading loops and then you'll have a useful result.

EDIT
After reading over my other post I guess I explained myself badly, it seems at 2:45 in the morning I shouldn't be trying to explain prediction and instruction alignment. Sorry about that.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

TouEnMasm


I have made a test bed to view the result of a memory alignment.
I couldn't say that with my computer the results are visible,perhaps with another.

Quote
CPU: Intel(R) Celeron(R) CPU 2.80GHz
System: Microsoft Windows XP Home Edition Build Service Pack 3 2600

chainsize loop alignment is:2
chainsize time micro-secondes:508

chainsize_4 loop alignment is:16
chainsize_4 time micro-secondes:524

chainsize_8 loop alignment is:8
chainsize_8 time micro-secondes:692

chainsize_16 loop alignment is:32
chainsize_16 time micro-secondes:524

Fa is a musical note to play with CL

jj2007

Quote from: dedndave on May 07, 2013, 09:15:19 PM
the branch at the end of the loop, back to the top of loop, is a SHORT branch
from what i have seen, if it's SHORT, alignment doesn't matter

If it's LONG, neither... at least on my AMD:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
936     ticks aligned
951     ticks misaligned

936     ticks aligned
936     ticks misaligned

951     ticks aligned
936     ticks misaligned

FORTRANS

Hi,

Quote from: jj2007 on May 07, 2013, 08:10:23 PM
Here's a test with an incredibly old processor ;-)

   Well, you more or less asked...


P-III
1011    ticks for nops 4
1693    ticks for nops 8
7110    ticks for nops 40



P-MMX
6837   ticks for nops 4
10735   ticks for nops 8
37210   ticks for nops 40


Cheers,

Steve N.