Ascii to DWORD replacement

hutch-- · January 29, 2013, 12:38:21 AM

> The mere assumption that assembler programmers see DWORD and understand unsigned is risky, given that ML.EXE interprets DWORD -1 as ffffffffh.

This difference has some to do with where we come from, in mnemonic notation ML sees 32 bits as 32 bits without interpreting it as either signed or unsigned. As of course you would be aware, it is how you evaluate a 32 bit value that determines if its signed or unsigned. Some higher level languages name the distinction as DWORD versus LONG but assembler has 32 bit memory and register locations, nothing more.

Iczelion's old ago was never a signed version, it only ever produced unsigned results and when I asked about a drop in replacement, it was to replace that functionality, not add to it or improve its functionality list. It may be appropriate in a high level language to keep adding baggage to hold the hands of the inexperienced ALA VC, VB etc .... but with assembler, its just extra junk that has no place in the typical targets for assembler code. Disassemble core OS code or very high performance critical code and you don't find junk in it at all, this is what the masm32 project was pointed at.

The problem for me maintaining 240 odd modules in a library is the signal to noise ratio, in the past in this forum I could ask members if they had code or algorithms to perform particular functions and they understood what was being targeted but of late all I get is pseudo philosophical debate, attempts to re-interpret the task and very little grasp of what it takes to maintain a library of procedures that have been used for years by a massive number of people. This is why I rarely ever post code any longer, the effort I put in and the difficulty tracking the results make it unviable.

MichaelW · January 29, 2013, 01:37:47 AM

Quote from: dedndave on January 28, 2013, 04:08:39 PM
the way it is, i wouldn't use this routine in a real-world app
because it eats up erroneous strings and spits out a smiley face
if the string has a non-numeric, is too long, is null, you won't know about it

FWIW, for the heavily used CRT conversion functions:

Quote
The function stops reading the input string at the first character that it cannot recognize as part of a number.

Even for QuickBASIC and VB, hold-your-hand HLLs aimed at non-programmers, the VAL function implemented this same behavior, although like the CRT functions, it would skip over leading whitespace, and flag out-of-range values as an error.

jj2007 · January 29, 2013, 01:51:54 AM

I don't mind if

include \masm32\include\masm32rt.inc

.code
start:
MsgBox 0, str$(rv(atodw, " 123 ")), "Should be 123:", MB_OK
exit
end start

shows me 2401470 but at least the manual should explain why it does so

Tedd · January 29, 2013, 02:24:24 AM

Just to add more noise..

DWORD is non-signed -- it's neither explicitly signed nor unsigned, it comes down to interpretation within the context.
"atou32" is unambiguously unsigned, and "atoi32" is unambiguously signed. Future functions should use these names instead, and deprecate atodw (atodw should still be available for compatibility, but the other two should be preferred for new code.)

As for the atodw replacement, the biggest issue that it doesn't attempt to validate its input and instead produces incorrect results. I don't believe replacing a function that does no validation with one that does will break anything, as any program relying on this function will already have had to check its input separately (or simply behave erroneously when fed bad input.) So checking the input again will not break anything, but could be argued to waste multiple clock cycles.

However, it's generally more efficient to check the input as you convert it, rather than in a separate step. Of course, this will slow down the function itself, though not the operation as a whole. The only case where it's a downside is when you have 10,000 numbers to convert which are already known to be inherently valid and checking them again would waste entire milliseconds; but I don't expect this usage appears much outside of contrived test cases. The common usage is converting a single user inputted number, and the function should do that correctly.

In any case, the functions should clearly document what they do so these issues don't come as a surprise -- that means the authors should document their own functions!

hutch-- · January 29, 2013, 03:05:14 AM

The distinction here is still between "component" and "object", the new algo does what its supposed to do, it converts an ascii string within the DWORD integer range into a DWORD and it does not pretend to do anything else. Now if your task is converting user input its easy enough to exhaustively filter the input to ensure that what you feed into the conversion is purely numeric and within range but there are enough other tasks where you don't want the extra padding. try ripping the guts out of a massive log file, parsing the Nth word which is numbers then feeding it to a conversion and the result fed into an array. The last thing you need is to filter it again, especially if the log file runs into the many millions of entries.

This is why you design a high performance library as components which you then use to construct objects if you need them. The problem with constructing objects first up is they often don't fit other tasks. The disease that afflicts many high level languages is exactly the failure to understand why you isolate components so that you don't end up with bloated bundles of junk. VB, VC, Pascal, Java and so on is full of chyte like this and it is the main contributor to bloated slow sloppy and unreliable code.

Components have a very good characteristic when it comes to reliability of code, if you get it wrong it goes BANG (OS says naughty things about your app etc ...) and you only have one option, get it right. This may be anathema in VB or JAVA but this is an assembler library and low level code is where the action is in terms of performance. If you want high level library functions that hold you poor hot sweaty little hand and don't let you make mistakes, try JAVA or VB, that is what they are there for, those folks who don't want to know what a pointer is can safely dump the contents into an array rather than directly address the data. :P

Tedd · January 29, 2013, 03:44:35 AM

Quote from: hutch-- on January 29, 2013, 03:05:14 AM
The distinction here is still between "component" and "object", the new algo does what its supposed to do, it converts an ascii string within the DWORD integer range into a DWORD and it does not pretend to do anything else.

It also does what it's not supposed to do, it converts nonsense "*$&%!" into a supposedly valid DWORD with no indication.

QuoteNow if your task is converting user input its easy enough to exhaustively filter the input to ensure that what you feed into the conversion is purely numeric and within range but there are enough other tasks where you don't want the extra padding.

True, it's easy, but if you do it in more than one place then it makes sense to do it at the same time and avoid unnecessary code duplication.

Quotetry ripping the guts out of a massive log file, parsing the Nth word which is numbers then feeding it to a conversion and the result fed into an array. The last thing you need is to filter it again, especially if the log file runs into the many millions of entries.

Why would you need to filter it again? If you know your input is already valid, you do no further checking and accept the return value as-is. Usage is the same in this case. As for slow-down, millions of 'extra' cycles still only account for a few seconds at most, and this is an insignificant portion of the processing time.

You can rant about hand-holding and going bang if there's an error, but sanity checking is advisable. Programs should not throw an exception and die whenever they encounter a typo.

Obviously the choice is yours in the end, but functions should at least have documentation on their limitations, e.g. "Note: this function does not check input, it will happily convert 'dfghjkl' into a dword."

dedndave · January 29, 2013, 04:51:55 AM

well - that function returns the value in EAX - no changing that
to conform with "standards", EAX would be used for status

maybe the best thing you could do is to return 0 for all non-valid strings
because the EDX register returned nothing on older versions, it could be used for status
but - we are back to discussing design philosophy - lol

i look at it this way.....
whatever they were using a2dw for, before, didn't need validation
for the most part, i suspect that's trivials and test pieces

hutch-- · January 29, 2013, 10:37:37 AM

There are times when I feel like a voice crying in the wilderness, "don't bloat this stuff", "don't write crap code", "don't go down the wide and easy path to destruction" etc etc ....

Lets face it, a conversion IS a conversion and like the vast number of other functions used in Windows programming, it requires user controlled input, in this case a string comprised of ascii numbers only. Now like most other functions you can pass the wrong data to it and get nonsense results but we are talking about assembler programmers here, not learner VB or similar.

Input control is no big deal and it varies from place to place, from a GUI application, most often you filter the edit control so that only numbers can be entered, if its floating point you also allow a period. If the input is from the console which is an ever decreasing task in modern times, you get the string from StdIn and have a look at it first and squark an error if it has non-numeric characters in it. What you don't do is put this string filtering crap in the conversion because unless it handles every case of what can be fed to it as string, you end up with duplication or redundancy when you need a different case.

You make objects from components, do it the other way around and you end up with VB, VC, Delphi and similar crap that tries unsuccessfully to hold the hand of the inexperienced.

Now notwithstanding such weighty considerations, the masm32 library will remain a component library as per its original design but I would not want to stifle the creativity of folks who want to do it differently, I have been encouraging people to do exactly that for many years now, if it "don't fit" roll your own.

mineiro · January 29, 2013, 04:14:50 PM

I have tried this this night, return eax=0 if error on a2dw_min, and a not checking version follows too:
;----------------
Edited after: I removed the algo because does not check if have more than 10 digits, and shl by 3 does not catch some carry; sorry for the incovenience. The algo that does not check anything is:

Code Select

atou_min proc String:DWORD

 	;mov edx,[esp+4]
	pop edx
	pop edx
 	xor eax,eax
i = 0
	repeat 10
		movzx ecx,byte ptr [edx+i]
		test ecx,ecx
		jz @F
		lea ecx,[eax*8+ecx-30h]
		lea eax,[eax*2+ecx]
i = i+1
	endm
	align 4
	@@:
	jmp dword ptr [esp-8]
	;ret 4  
atou_min endp

sinsi · January 29, 2013, 04:39:39 PM

I would agree that it's better that the caller validate the string first, because they control that part.
It might not need validation so the overhead disappears, whereas if it's in the conversion routine it's redundant.
Where does validation end? Can we skip spaces/tabs? What about a null pointer?

In the old days we would use the carry flag to indicate an error, why did MS get rid of that ::)

dedndave · January 29, 2013, 09:52:07 PM

yah - it made it easy - lol
i guess they figure they can fit more info into EAX
then - they use 32 bits to return 0 or 1 - if you want the error code, call GetLastError :lol:

jj2007 · January 29, 2013, 10:41:11 PM

Quote from: sinsi on January 29, 2013, 04:39:39 PM
I would agree that it's better that the caller validate the string first, because they control that part.
It might not need validation so the overhead disappears, whereas if it's in the conversion routine it's redundant.
Where does validation end? Can we skip spaces/tabs? What about a null pointer?

Put things into perspective:
atodwJJ: 0.042 seconds per 1000000 loops
Val() : 0.136 seconds per 1000000 loops
The "slow" one skips leading and trailing spaces and tabs, and it doesn't care if the string contains a dot, or if it ends with "h" or "b" or "y" or "d" or "e2", or if it starts with "0x" or "$". But it does throw an error if the string ends with "x" or if it finds other stuff that is not in our list of valid number formats.

This kind of algo is what coders need 99% of their time - except in those cases where 0.136 seconds per Million invokes is too slow, and where they can be absolutely sure that the format is always a correct positive decimal string.

Both kind of algos have their place, and I agree with Hutch that a really fast replacement for atodw doesn't need the bells and whistles. The only point of contention is how drastic the warnings in the documentation should be, to prevent that beginners use atodw as if it was a fool-proof Val(whatever).

hutch-- · January 30, 2013, 01:37:22 AM

Here is a quickly written scruffy to test user input from a source like the console StdIn. (Warning, this is a 1:30am model)

IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

include \masm32\include\masm32rt.inc

is_str_int PROTO :DWORD

.code

start:

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

call main
inkey
exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

mov edx, rv(is_str_int," -12345678")
print ustr$(edx),13,10

mov edx, rv(is_str_int," 12345678")
print ustr$(edx),13,10

ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

is_str_int proc pstring:DWORD

mov eax, [esp+4]
sub eax, 1

tlb: ; trim any leading garbage
add eax, 1
cmp BYTE PTR [eax], 32
je tlb
cmp BYTE PTR [eax], 9
je tlb

sub eax, 1

chlp: ; test against integer character range
add eax, 1
cmp BYTE PTR [eax], 0
je iszero
cmp BYTE PTR [eax], 48
jb invalid_char
cmp BYTE PTR [eax], 57
jna chlp

invalid_char:
xor eax, eax ; return zero if invalid character
ret 4

iszero:
mov eax, 1 ; return non zero if integer string
ret 4

is_str_int endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start

dedndave · January 30, 2013, 02:18:36 AM

if you are going to pass through the string, you may as well process it

hutch-- · January 30, 2013, 09:57:50 AM

Yeah, but which string, unsigned, signed, floating point, various scientific notation etc etc .... Are you going to fill a conversion with every possible case of data error that can be fed to it ? Some experience in library design will help you here, the reason why you write re-usable components is to avoid duplication and redundancy (too many tests for the same thing. Why would you add that much crap to a conversion if you are writing a GUI input where the filtering is done in the edit control ?

You are confusing objects (by a particular theory) with components. If you are writing a library that has many objects, often doing similar things, then you write re-usable components and call them from your higher level objects, this way the finer granularity of your library yields smaller more efficient executables.

The MASM Forum

News:

Ascii to DWORD replacement

hutch--

MichaelW

jj2007

Tedd

hutch--

Tedd

dedndave

hutch--

mineiro

sinsi

dedndave

jj2007

hutch--

dedndave

hutch--