npad instruction

TouEnMasm · May 07, 2013, 04:32:28 PM

I have compiled a c prog (leght of a chain) with /Ox (optimise max):
I have searched the npad instruction im my amd reference book but not find it.

Code Select


char* chaine=("Je suis une phrase");

 int longueur (const char * str)
       {
           int length = 0;

           while( *str++ )
                   ++length;

           return( length );
       }

Code Select


PUBLIC	?longueur@@YAHPBD@Z				; longueur
; Function compile flags: /Ogtpy
_TEXT	SEGMENT
_str$ = 8						; size = 4
?longueur@@YAHPBD@Z PROC				; longueur
; File e:\test_struct\structure.cpp
; Line 23
	mov	ecx, DWORD PTR _str$[esp-4]
	xor	eax, eax
	cmp	BYTE PTR [ecx], al
	je	SHORT $LN6@longueur
	npad	6
$LL2@longueur:
	inc	ecx
; Line 24
	inc	eax
	cmp	BYTE PTR [ecx], 0
	jne	SHORT $LL2@longueur
$LN6@longueur:
; Line 27
	ret	0
?longueur@@YAHPBD@Z ENDP				; longueur
_TEXT	ENDS

Donkey · May 07, 2013, 04:41:59 PM

npad is just reapeated NOP instructions used for alignment, haven't done the byte count for your disassembly but probably aligns the loop to a 16 byte boundary.

TouEnMasm · May 07, 2013, 05:06:05 PM

thanks

Here the same in masm

Code Select


chainsize PROC pstring:DWORD 
	mov	ecx,pstring 
	xor	eax, eax
	cmp	BYTE PTR [ecx], al
	je	SHORT endlongueur
	align 16
repeatlongueur:
	inc	ecx
	inc	eax
	cmp	BYTE PTR [ecx], 0
	jne	SHORT repeatlongueur
endlongueur:
	ret
chainsize ENDP

Jibz · May 07, 2013, 05:39:29 PM

npad is a macro defined in listing.inc in your VC include folder, and it inserts "non destructive" operations rather than just a series of NOP. For instance, the npad 6 from your example should be "lea ebx, [ebx+0]".

A word of warning, the npad macro only work for 32-bit code, because if you insert "lea ebx, [ebx+0]" in 64-bit code it clears the top 32 bits of rbx, which is hardly non destructive.

jj2007 · May 07, 2013, 06:20:28 PM

Masm32 has the nops macro:
nops 6

In general, align 4 before an innermost loop does a good job (and I hope both ML64 and JWasm are aware of the problem flagged by jibz ::)). Inserting nops will slow down your code a bit, but a nops 8 can be very useful when debugging a lengthy piece of code - it's easier to find then.

Donkey · May 07, 2013, 06:44:13 PM

Not sure how much NOPs slow down code, most modern processors do not execute them since their prediction algorithms just skip over them to the next real instruction (in other words they never enter an execution pipe, simply advance R/EIP). However, there can be loss based on which NOP is selected as there is still memory access time to consider (though it can generally be considered negligible), so if you need a multi-byte padding you are better to choose the larger NOP instructions.

hutch-- · May 07, 2013, 07:29:41 PM

One of the things I do with almost any algo I want pace with is test different MASM alignments because i have seen the effect often enough that an "align 8" or an "align 16" really slowed the algo down. My guess is the OS decoder sometimes stalls on a large complex instruction. I note that some compilers and assembler pad the ends with DB 90h (nops) which may cause less of a problem that a single large instruction for alignment purposes. As usual benchmark the differences on multiple hardware.

jj2007 · May 07, 2013, 08:10:23 PM

Quote from: Donkey on May 07, 2013, 06:44:13 PMNot sure how much NOPs slow down code, most modern processors do not execute them

Here's a test with an incredibly old processor ;-)

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
234 ticks for nops 4
390 ticks for nops 8
1763 ticks for nops 40

include \masm32\include\masm32rt.inc

.code
start:
   invoke Sleep, 100
   push rv(GetTickCount)
   mov ecx, 0fffffffh
   .Repeat
      nops 4
      dec ecx
   .Until Sign?
   invoke GetTickCount
   pop edx
   sub eax, edx
   print str$(eax), 9, "ticks for nops 4", 13, 10

   invoke Sleep, 100
   push rv(GetTickCount)
   mov ecx, 0fffffffh
   .Repeat
      nops 8
      dec ecx
   .Until Sign?
   invoke GetTickCount
   pop edx
   sub eax, edx
   print str$(eax), 9, "ticks for nops 8", 13, 10

   invoke Sleep, 100
   push rv(GetTickCount)
   mov ecx, 0fffffffh
   .Repeat
      nops 40
      dec ecx
   .Until Sign?
   invoke GetTickCount
   pop edx
   sub eax, edx
   print str$(eax), 9, "ticks for nops 40", 13, 10

   exit

end start

Jibz · May 07, 2013, 09:09:47 PM

Could you try with the instructions from the npad macro for comparison?

For reference the 4 and 8 byte npad are:

Code Select

    ; lea esp, [esp+00]
    DB 8DH, 64H, 24H, 00H

Code Select

    ; jmp .+8; .npad 6
    DB 0EBH, 06H, 8DH, 9BH, 00H, 00H, 00H, 00H

(there is no 40 byte npad, but I guess it would just be a jmp and filler).

dedndave · May 07, 2013, 09:15:19 PM

the branch at the end of the loop, back to the top of loop, is a SHORT branch
from what i have seen, if it's SHORT, alignment doesn't matter

jj2007 · May 07, 2013, 09:56:08 PM

Quote from: Jibz on May 07, 2013, 09:09:47 PM
Could you try with the instructions from the npad macro for comparison?

234 ticks for nops 4
405 ticks for nops 8
234 ticks for npad 4
468 ticks for npad 8

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
include \masm32\include\masm32rt.inc

npad_4 equ <DB 8DH, 64H, 24H, 00H>
npad_8 equ <DB 0EBH, 06H, 8DH, 9BH, 00H, 00H, 00H, 00H>

Start MACRO
   invoke Sleep, 100
   push rv(GetTickCount)
   mov ecx, 0fffffffh
   align 8
   .Repeat
ENDM
Stop MACRO name
      dec ecx
   .Until Sign?
   invoke GetTickCount
   pop edx
   sub eax, edx
   print str$(eax), 9, "ticks for &name", 13, 10
ENDM

.code
start:
   Start
      nops 4
   Stop nops 4

   Start
      nops 8
   Stop nops 8

   Start
      npad_4
   Stop npad 4

   Start
      npad_8
   Stop npad 8

   exit

end start

Donkey · May 08, 2013, 01:30:55 PM

LOL, I really hope you were joking with those timings. The objective in padding with NOP is to make the code AFTER the alignment faster. As I said, the NOPs should never be executed as the prediction algorithm will not load them into the instruction pipe, they are jumped over. From Agner:

QuoteMost processors fetch instructions in aligned 16-byte or 32-byte blocks. It can be advantageous to align critical loop entries and subroutine entries by 16 in order to minimize the number of 16-byte boundaries in the code. Alternatively, make sure that there is no 16-byte boundary in the first few instructions after a critical loop entry or subroutine entry.

If you noticed the original routine had cascading loops, the purpose of the padding to ensure that the subsequent loop begins on a 16 byte boundary. Try running the timing with multiple cascading loops and then you'll have a useful result.

EDIT
After reading over my other post I guess I explained myself badly, it seems at 2:45 in the morning I shouldn't be trying to explain prediction and instruction alignment. Sorry about that.

TouEnMasm · May 08, 2013, 05:30:28 PM

I have made a test bed to view the result of a memory alignment.
I couldn't say that with my computer the results are visible,perhaps with another.

Quote
CPU: Intel(R) Celeron(R) CPU 2.80GHz
System: Microsoft Windows XP Home Edition Build Service Pack 3 2600

chainsize loop alignment is:2
chainsize time micro-secondes:508

chainsize_4 loop alignment is:16
chainsize_4 time micro-secondes:524

chainsize_8 loop alignment is:8
chainsize_8 time micro-secondes:692

chainsize_16 loop alignment is:32
chainsize_16 time micro-secondes:524

jj2007 · May 08, 2013, 06:58:27 PM

Quote from: dedndave on May 07, 2013, 09:15:19 PM
the branch at the end of the loop, back to the top of loop, is a SHORT branch
from what i have seen, if it's SHORT, alignment doesn't matter

If it's LONG, neither... at least on my AMD:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
936 ticks aligned
951 ticks misaligned

936 ticks aligned
936 ticks misaligned

951 ticks aligned
936 ticks misaligned

FORTRANS · May 08, 2013, 10:36:35 PM

Hi,

Quote from: jj2007 on May 07, 2013, 08:10:23 PM
Here's a test with an incredibly old processor ;-)

Well, you more or less asked...

P-III
1011 ticks for nops 4
1693 ticks for nops 8
7110 ticks for nops 40

P-MMX
6837   ticks for nops 4
10735   ticks for nops 8
37210   ticks for nops 40

Cheers,

Steve N.

The MASM Forum

News:

npad instruction

TouEnMasm

Donkey

TouEnMasm

Jibz

jj2007

Donkey

hutch--

jj2007

Jibz

dedndave

jj2007

Donkey

TouEnMasm

jj2007

FORTRANS