News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Faster Memcopy ...

Started by rrr314159, March 03, 2015, 02:40:50 PM

Previous topic - Next topic

jj2007

Quote from: lingo on August 19, 2023, 12:11:46 AMIt is my old StrLenW. :biggrin:

IMHO it should return 20, not 39:

The string: [UTF-16LE Hello World]
wLen(edi)       eax             20
MbSLenW         eax             20
StrLenW_Lingo   eax             39

align 16
dummy db 7 dup(0) ;  force unaligned
_dw UnicodeString2, "UTF-16LE ", "Hello World", 0

guga

Quote from: jj2007 on August 19, 2023, 04:52:41 AM
Quote from: lingo on August 19, 2023, 12:11:46 AMIt is my old StrLenW. :biggrin:

IMHO it should return 20, not 39:

The string: [UTF-16LE Hello World]
wLen(edi)       eax             20
MbSLenW         eax             20
StrLenW_Lingo   eax             39

align 16
dummy db 7 dup(0) ;  force unaligned
_dw UnicodeString2, "UTF-16LE ", "Hello World", 0


Hi JJ

i don´t get it. Why 20 ? Wouldn´t it suppose to return 11 in "Hello World" ? 11 chars not counting the 2 null terminated bytes) ?

So, 11*2 = 22 bytes to represent the Unicode text plus 2 more for the extra 2 zeroes ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

The string: [UTF-16LE Hello World]

Count them, it's 20 chars.

guga

OH ! I see, it´s the whole string "UTF-16LE Hello World" and not only "Hello World" written in UTF-16LE. I though this was a token for masmbasic and not part of the string itself :biggrin:  :biggrin:  :biggrin:  :biggrin:

Indeed, it should return 20 chars and not 39
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

lingo

It is my speed test with MichaelW's simple test. :biggrin:

C:\Documents\Visual Studio 2022\Projects\SpeedTest>c2test2

8 cycles -> StrLenW_Guga ANSI,  Return in EAX: 100

6 cycles -> StrLenW_Lingo ANSI,  Return in EAX: 100

23 cycles -> StrLenW_Guga No SAR,  Return in EAX: 200

13 cycles -> StrLenW_Lingo No SAR,  Return in EAX: 200

8 cycles -> StrLenW_Guga with SAR,  Return in EAX: 34

5 cycles -> StrLenW_Lingo with SAR,  Return in EAX: 34


Press enter to exit...

Source in attachment.
Quid sit futurum cras fuge quaerere.

lingo

Hi guys,

You had a lot of fun yesterday :badgrin:  :badgrin:  :badgrin:  :badgrin:
and now after more than 70% speed difference ...absolute silence  :shhh: ... he-he -he..
or the sound of silence... Nice song  https://www.youtube.com/watch?v=9O9DaZUS_EU
Quid sit futurum cras fuge quaerere.

jj2007

I hadn't looked at it, Lingo, but it's indeed almost 27% faster than Guga's algo :thumbsup:

(and it returns a wrong result, but never mind, we are used to that :biggrin: )

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

23633  cycles for 100 * CRT wcslen
9029    cycles for 100 * _MbStrLenW2
2702    cycles for 100 * StrLenW Guga
1987    cycles for 100 * StrLenW_LingoNoSar

23663  cycles for 100 * CRT wcslen
9020    cycles for 100 * _MbStrLenW2
2698    cycles for 100 * StrLenW Guga
1997    cycles for 100 * StrLenW_LingoNoSar

23657  cycles for 100 * CRT wcslen
9026    cycles for 100 * _MbStrLenW2
2704    cycles for 100 * StrLenW Guga
1986    cycles for 100 * StrLenW_LingoNoSar

23645  cycles for 100 * CRT wcslen
9021    cycles for 100 * _MbStrLenW2
2704    cycles for 100 * StrLenW Guga
1986    cycles for 100 * StrLenW_LingoNoSar

14      bytes for CRT wcslen
66      bytes for _MbStrLenW2
58      bytes for StrLenW Guga
98      bytes for StrLenW_LingoNoSar

100    = eax CRT wcslen
100    = eax Masm32 ucLen
100    = eax _MbStrLenW2
100    = eax StrLenW Guga
199    = eax StrLenW_LingoNoSar

Btw it sucks a bit for short strings:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3854    cycles for 100 * CRT wcslen
1549    cycles for 100 * _MbStrLenW2
695     cycles for 100 * StrLenW Guga
684     cycles for 100 * StrLenW_LingoNoSar
618     cycles for 100 * MbSLenW

3853    cycles for 100 * CRT wcslen
1547    cycles for 100 * _MbStrLenW2
695     cycles for 100 * StrLenW Guga
691     cycles for 100 * StrLenW_LingoNoSar
619     cycles for 100 * MbSLenW

3850    cycles for 100 * CRT wcslen
1550    cycles for 100 * _MbStrLenW2
719     cycles for 100 * StrLenW Guga
683     cycles for 100 * StrLenW_LingoNoSar
668     cycles for 100 * MbSLenW

3862    cycles for 100 * CRT wcslen
1549    cycles for 100 * _MbStrLenW2
720     cycles for 100 * StrLenW Guga
684     cycles for 100 * StrLenW_LingoNoSar
645     cycles for 100 * MbSLenW

14      bytes for CRT wcslen
66      bytes for _MbStrLenW2
58      bytes for StrLenW Guga
102     bytes for StrLenW_LingoNoSar
50      bytes for MbSLenW

14      = eax CRT wcslen
14      = eax Masm32 ucLen
14      = eax _MbStrLenW2
14      = eax StrLenW Guga
14      = eax StrLenW_LingoNoSar
14      = eax MbSLenW

daydreamer

Guga unicode also has game things like chess pieces memcopy from a library with solve check mate in 3 moves

If you read text from file,dont you already get len = filesize ,isnt there an api to check textfile is 16 bit unicode or not?
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

lingo

#173
QuoteHi Guga,
IMHO your algo is fast enough :biggrin:
You beat poor CRT by a factor 9, but ok, as Assembly programmers we are used to that, aren't we?

Again envy or incompetence about speed  optimization of code...

QuoteI hadn't looked at it, Lingo, but it's indeed almost 27% faster than Guga's algo

Another fake manipulation attempt against Lingo:

Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (SSE4)
12358   cycles for 100 * CRT wcslen
9124    cycles for 100 * _MbStrLenW2
2325    cycles for 100 * StrLenW Guga
1572    cycles for 100 * StrLenW_LingoNoSar
1572/2325*100=[b]67.%[/b]

12426   cycles for 100 * CRT wcslen
9142    cycles for 100 * _MbStrLenW2
2279    cycles for 100 * StrLenW Guga
1576    cycles for 100 * StrLenW_LingoNoSar
1576/2279*100=[b]69.1%[/b]

12360   cycles for 100 * CRT wcslen
9115    cycles for 100 * _MbStrLenW2
2352    cycles for 100 * StrLenW Guga
1579    cycles for 100 * StrLenW_LingoNoSar
1579/2352*100=[b]67.1%[/b]

12264   cycles for 100 * CRT wcslen
9106    cycles for 100 * _MbStrLenW2
2323    cycles for 100 * StrLenW Guga
1575    cycles for 100 * StrLenW_LingoNoSar
1575/2323*100=[b]67.8%[/b]

14      bytes for CRT wcslen
66      bytes for _MbStrLenW2
58      bytes for StrLenW Guga
98      bytes for StrLenW_LingoNoSar

100     = eax CRT wcslen
100     = eax Masm32 ucLen
100     = eax _MbStrLenW2
[b]100[/b]     = eax StrLenW Guga
[b]199[/b]     = eax StrLenW_LingoNoSar
--- ok ---
Quoteand it returns a wrong result, but never mind, we are used to that
Another attempt for fake manipulation and even an insult against Lingo.
You are comparing two different things: LenStrW_LingoNotSAR without SAR eax,1 at the end of the algo,
with LenStrW_Guga WITH SAR eax,1 at the end of the algo.
It is normal to have a difference of 2 in your fake manipulated test.
Use next time LenStr_LingoWsar!
Pls, see  the attachment and  Don't Teach Father How To Make Babies!

QuoteBtw it sucks a bit for short strings:
You'll get a lot more sucks if you test it with longer strings! :badgrin:  :skrewy:


It is also from your fake "test":
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Speed Differences:
1987/2702*100=73.5%
1997/2698*100=74%
1986/2704*100=73.4%
1986/2704*100=73.4%

Quid sit futurum cras fuge quaerere.

guga

Hi Lingo, i´ll test again later.

One question only, you use align 16 on the inputed strings, right ? How the algo will behave (in terms of speed/accuracy) for unaligned strings ?

Did you tested for those cases ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

lingo

Thank you Guga, :biggrin:

You can use my test (it is from MichaelW written in plain MASM32).
So, you can change what you want in it end recompile it again with makeit.bat.  :thumbsup:
Quid sit futurum cras fuge quaerere.

guga

Quote from: lingo on August 21, 2023, 06:34:07 AMThank you Guga, :biggrin:

You can use my test (it is from MichaelW written in plain MASM32).
So, you can change what you want in it end recompile it again with makeit.bat.  :thumbsup:

Tks, lingo. I´ll give a try later. These functions are good when we build it to work in many scenarios as possible. That´s why i tested for aligned/unaligned strings and also when we have other data before or after the string (As when a string belongs to another structure or chain of data etc etc).

I'm currently working on a way to speed up (and build) functions to change the case of strings. JJ is helping on this. It's a bit harder then i expected, but i managed to make it work (for non latin chars or unicode strings, at the moment). Later i´ll see how to optimize it and then will come back for more tests on strlen and strlenW.

It´s way of my head why M$ didn´t ever do something like that focused in speed. When analyzing some of the Apis inside msvcrt they are a true mess. For example, even their libm_sse2_log_precise function that was supposed to be accurate and faster, is slower then the ones we build here. :dazzled:
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: lingo on August 21, 2023, 05:00:49 AMSpeed Differences:
1987/2702*100=73.5%
1997/2698*100=74%
1986/2704*100=73.4%
1986/2704*100=73.4%

Your algo is 26.5% faster: 1-1987/2702=26.5%

"Don't Teach Father How To Make Babies!" is far below your intellectual level, stop it.