In this thread (http://masm32.com/board/index.php?topic=1458.0) Habran, jj2007 and others submitted memory copy algorithms that were tested in jj's program MemCopyAlgos, which ran them on a block of 2048 bytes with all possible alignments. (BTW the thread is a lot of fun to read). I substituted a new algo for Habran's "Ferrari" (because I had trouble inserting it elsewhere in the program.)
This new algo is fastest on every measure: beat every other algo on every alignment. Here are typical runs on Intel and AMD machines (rrr314's Ferrari in last column):
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 rrr314's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 114
------------------------------------------------------------------------------------
2048, d0s0-0 159 199 242 239 239 218 201 144
2048, d1s1-0 255 231 263 270 258 249 257 208
2048, d7s7-0 256 234 265 272 260 250 257 208
2048, d7s8-1 260 244 657 473 245 248 257 216
2048, d7s9-2 259 245 657 473 245 248 257 208
2048, d8s7+1 252 239 641 471 246 234 257 208
2048, d8s8-0 256 234 288 290 283 268 257 208
2048, d8s9-1 252 243 649 474 245 248 257 208
2048, d9s7+2 257 239 634 471 246 237 257 209
2048, d9s8+1 257 239 634 471 246 237 257 208
2048, d9s9-0 256 234 275 272 268 253 257 208
2048, d15s15 256 235 274 274 272 254 257 208
--- ok ---
AMD A6-6310 APU with AMD Radeon R4 Graphics (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 rrr314's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 157
------------------------------------------------------------------------------------
2048, d0s0-0 170 250 262 326 327 274 278 162
2048, d1s1-0 419 385 384 389 494 348 796 238
2048, d7s7-0 375 477 350 436 407 357 758 245
2048, d7s8-1 628 682 699 402 250 303 617 224
2048, d7s9-2 598 665 757 420 250 296 621 223
2048, d8s7+1 603 643 702 409 249 257 617 219
2048, d8s8-0 356 345 315 367 386 318 298 228
2048, d8s9-1 609 651 701 399 249 298 591 217
2048, d9s7+2 587 661 690 405 274 244 596 225
2048, d9s8+1 587 660 707 403 249 250 604 225
2048, d9s9-0 356 376 347 405 386 331 667 224
2048, d15s15 356 374 334 398 376 331 680 224
--- ok ---
It simply uses all 8 xmm registers to move a large chunk of 128 bytes per loop ... nothing fancy at all. the algo is also shorter, at 157 bytes, than all but the most basic competitors.
It could be made somewhat faster, but I didn't bother - just wanted to prove the point that this basic idea works. BTW I have found the idea on the net since my last post; google "Mark Larson optimize assembler" for a discussion. Also see FASM's discussion board. I'm surprised the idea is so obscure. Still haven't found the "more advanced" techniques used in my "faster tokenizer" algo - which makes me wonder if there's something wrong with them.
It's written in 32-bits; learned my lesson from last time, there's nothing non-standard about it.
Unfortunately sometimes the display might not fit right on your command window; but I didn't change anything in jj's program more than necessary. Also note that the very first number generated (upper left hand corner) is, I believe, artificially low due to a start-up problem I intend to discuss in more detail elsewhere (my algo beat it anyway).
The zip contains jj's testing program, renamed MemCopyAlgosrrr; and the exe thereof. My algo is in it, called "xmemcpy".
[edit] a more complete version, faster for larger memory copies, is in the zip a few posts below
[edit] even more complete version is in reply 25, next page. But the one here is a lot simpler (half as many lines) and shows the basic idea: using all 8 registers to copy 128 bit chunks.
rrr,
Something we determined a while ago is that REP MOVS/B/D is slow under about 500 bytes. After that internal special case circuitry kicks in and it gets a lot faster. Have you had a paddle with different buffer sizes for the copy operation ?
No; I didn't "optimize" it specifically for 2048 (the standard number used in that thread) but, didn't try it in any other region. As mentioned, just wanted to demonstrate feasibility of the idea. But I'll give it a shot in a while and let you know
Ran it at 20480 and 100000. My algo still does alright, is best at some settings. But at 20480, MemCoP4 and MemCoC2 are overall ahead, my algo about third. At 100000 it's surprising how neck-and-neck they all are, except MemCo2 and MemCoC3 which are slower.
I'm using all the xmm reg's, all unaligned. So for 20480 I bet I can still beat them all, since the two current winners are using just one or two xmm's, but aligned (either source or dest). So I'll try aligning one or the other also, and (I *think*) the strategy of using all 8 will give me the advantage. At least, it's proved effective in the 2 cases I've tried so far. Tomorrow I'll check it out.
However dunno about the larger numbers (100,000). It's quite amazing how all the different algorithms converge to the same speed up there (with 2 exceptions) - did u know they'd do that? The 2 xmm-aligned algos are only a little better, less than 1%. BTW jj has a comment recommending not to exceed 100,000 with this program; someday I'll check out some bigger numbers with my own test prog.
Well, thanks for the suggestion! Here are the numbers I got:
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 rrr314's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 157
------------------------------------------------------------------------------------
20480, d0s0- 2796 2760 2795 2835 2803 2786 2761 2640
20480, d1s1- 2935 2782 2809 2815 2838 2811 3036 4244
20480, d7s7- 2899 2786 2816 2848 2848 2804 3038 4343
20480, d7s8- 9095 9091 5962 5716 3156 3125 9210 3114
20480, d7s9- 9092 9090 5968 5720 3129 3124 9357 3101
20480, d8s7+ 3257 3163 5988 5270 3149 3159 3225 3240
20480, d8s8- 2923 2793 2888 2825 2840 2835 3040 4330
20480, d8s9- 9086 9091 5962 5721 3165 3126 9065 3093
20480, d9s7+ 3253 3159 5977 5264 3156 3131 3237 3205
20480, d9s8+ 3267 3166 5975 5271 3115 3156 3256 3212
20480, d9s9- 2880 2826 2808 2833 2870 2861 3043 4339
20480, d15s1 2912 2822 2867 2820 2835 2844 3029 4269
--- ok ---
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 rrr314's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 157
------------------------------------------------------------------------------------
100000, d0s0 12981 12716 12930 12924 12930 12969 12717 12962
100000, d1s1 13047 12732 12978 12963 12992 12907 13614 13905
100000, d7s7 13046 12729 12946 12982 12944 12989 13624 13938
100000, d7s8 13655 13281 24727 19108 13681 13681 13639 13951
100000, d7s9 13641 13254 24695 19110 13702 13671 13599 13857
100000, d8s7 13588 13305 24930 20277 13651 13703 13622 13961
100000, d8s8 13075 12739 12986 12946 12973 12960 13650 13950
100000, d8s9 13619 13281 24846 19103 13681 13689 13648 13937
100000, d9s7 13649 13284 25049 20303 13680 13668 13656 13978
100000, d9s8 13643 13293 25028 20336 13583 13586 13620 13950
100000, d9s9 13036 12728 13005 12963 12962 12961 13636 13943
100000, d15s 13036 12708 12942 12965 12922 12931 13656 13950
--- ok ---
Quote from: rrr314159 on March 03, 2015, 05:17:16 PM
At 100000 it's surprising how neck-and-neck they all are...
a sign that you may be hitting hardware limitations
It is certainly fast, and could be an option if you have a special need for repeatedly copying large blocks.
For MasmBasic, I chose rep movsd in the end, both on grounds of speed and convenience: It is a frequently used algo, and I wouldn't like a wrapper for preserving the xmm regs for each and every little memcopy.
Its been a long time since I did any benchmarking on memcopy but I found REP MOVS/B/D seemed to have all of the advantages on cached reads, non-cached writes and the optimal prefetch and I think Intel did the special case circuitry because its so commonly used over a long time. On sub 500 byte repeat copies from memory the incremented pointers (mov EAX/MOV AL) was faster on small buffers. I tend to look at the rest of the algo that does the memcopy to see what is worth the effort, there are enough times when a clunky fussy big mover is a PITA where a simple byte copy if faster than the algo calling it.
Thanks all for your comments! Here's what I wrote b4 I saw them, and I think it agrees with your general attitude.
This new version has the destination aligned, and is fastest (vs. every algo at every alignment) at 20483 bytes. On AMD it's particularly clear but Intel also. At 100000 it's hard to say what's fastest, there are a few very close; varies between the two computers; you can look at the numbers. Certainly, my algo is no longer the "winner".
There are some obvious things to do to speed it up at the higher byte counts, but I'm sick of this - it's a pursuit well suited to masochists. First, it's very frustrating to get good timing numbers. Second, a change can be good in one circumstance (byte count, alignment, phase of the moon) but bad in another. It varies between computers, other progs running, etc. Third, the algo (which started out simple) is getting longer and longer, harder to understand/debug/deal with. Fourth, I noticed, reviewing old optimization pages (like Mark Larson's, a few years old) tricks that worked back then are obsolete now! Branch prediction is totally different than what he optimized for. Unaligned moves are pretty much just as fast as aligned. Non-temporal moves are completely useless on my machines. And so forth. So, all this effort will be meaningless soon. Fifth, when u get right down to it who cares about a few nanoseconds one way or the other?
The zip contains MemCopyAlgosrrr.asm and .exe, same as previous posting.
[edit] fixed 3 small things in zip, now it's cleaner; functionality unchanged, saves a couple nanoseconds, 2015/3/6
--- ok ---
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 rrr314's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 206
------------------------------------------------------------------------------------
20483, d0s0- 2821 2750 2804 2834 2841 2834 2777 2717
20483, d1s1- 2919 2784 2859 2831 2816 2863 3045 2737
20483, d7s7- 2885 2794 2805 2787 2787 2829 3029 2678
20483, d7s8- 9093 9095 5935 5696 3139 3096 9212 2830
20483, d7s9- 9092 9092 5940 5700 3139 3136 9361 2887
20483, d8s7+ 3244 3146 5951 5249 3099 3089 3223 2931
20483, d8s8- 2919 2783 2795 2821 2796 2785 3035 2683
20483, d8s9- 9087 9089 5935 5700 3145 3092 9067 2893
20483, d9s7+ 3258 3148 5949 5249 3086 3109 3245 2868
20483, d9s8+ 3247 3158 5946 5243 3108 3142 3231 2890
20483, d9s9- 2920 2787 2843 2848 2862 2866 3067 2721
20483, d15s1 2902 2801 2858 2810 2839 2831 3052 2668
--- ok ---
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 rrr314's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 206
------------------------------------------------------------------------------------
100000, d0s0 13082 12709 12938 12972 12974 12942 12721 12956
100000, d1s1 13099 12702 12994 12958 12967 13022 13664 13020
100000, d7s7 13081 12739 13022 12983 12961 12991 13663 12999
100000, d7s8 13652 13312 24752 19100 13683 13696 13654 13537
100000, d7s9 13652 13267 24698 19113 13629 13704 13652 13624
100000, d8s7 13640 13315 24653 20291 13659 13684 13658 13570
100000, d8s8 13023 12695 12979 12983 12984 12966 13653 12951
100000, d8s9 13599 13287 24610 19086 13583 13645 13642 13647
100000, d9s7 13602 13305 25021 20289 13618 13634 13670 13578
100000, d9s8 13652 13317 24692 20279 13713 13681 13643 13533
100000, d9s9 13035 12718 12954 12993 12989 13064 13636 12974
100000, d15s 13071 12738 12989 13002 12948 13028 13658 12927
--- ok ---
AMD A6-6310 APU with AMD Radeon R4 Graphics (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 rrr314's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 206
------------------------------------------------------------------------------------
20483, d0s0- 5989 5914 5920 5634 5725 5723 5743 4819
20483, d1s1- 6143 5738 5653 5616 5600 5506 9796 4846
20483, d7s7- 6018 5656 5694 5601 5629 5602 9668 4869
20483, d7s8- 10563 10941 9734 9322 7603 7613 9132 6136
20483, d7s9- 10578 10961 9736 9324 7613 7637 9994 6116
20483, d8s7+ 10728 10852 7924 8886 7223 7194 10801 6290
20483, d8s8- 5960 5617 5682 5621 5606 5624 5615 4847
20483, d8s9- 10624 10974 9740 9322 7714 7608 10651 6106
20483, d9s7+ 10787 10809 7824 8919 7190 7189 9878 6206
20483, d9s8+ 10810 10818 7842 8917 7253 7223 8821 6201
20483, d9s9- 6059 5650 5675 5594 5577 5614 9622 4863
20483, d15s1 5861 5594 5680 5631 5568 5648 9694 4765
--- ok ---
AMD A6-6310 APU with AMD Radeon R4 Graphics (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 rrr314's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 206
------------------------------------------------------------------------------------
100000, d0s0 24435 22509 22496 22415 22002 22048 22183 22259
100000, d1s1 22379 22313 22026 22110 21864 21799 43627 22193
100000, d7s7 22365 22331 22055 22006 21805 22071 43413 22149
100000, d7s8 48009 47571 31597 32397 25654 24978 39742 24342
100000, d7s9 48044 47574 31809 32465 25620 24905 43194 24367
100000, d8s7 49095 49177 31628 31530 25843 24612 49296 25211
100000, d8s8 22408 22264 22047 22044 21643 21890 22259 22186
100000, d8s9 47976 47620 31494 32418 25651 24999 47788 24421
100000, d9s7 49815 49884 31259 31463 25324 24670 44214 25206
100000, d9s8 49755 49189 31743 31498 25266 24655 39641 25244
100000, d9s9 22411 22149 21979 22138 21807 21903 43295 22222
100000, d15s 22365 22310 22133 22023 21826 21728 43418 22051
ps jj2007, I hadn't thought of needing to preserve all those xmm regs, was just considering the problem in isolation from the real world, but you're right. Maybe that's why this idea is so obscure. Still, in special cases, when xmm regs are not being used for anything else, and the speed actually matters, it could be applicable. Also in 64-bit, with 16 xmm's, it makes more sense.
Here are the results on my aging i7.
Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz (SSE4)
Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 rrr314's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 206
------------------------------------------------------------------------------------
20483, d0s0- 3530 3520 3502 3497 3499 3511 3655 3376
20483, d1s1- 3752 3554 3535 3538 3541 3540 3969 3341
20483, d7s7- 3758 3560 3534 3538 3541 3544 3921 3346
20483, d7s8- 8911 8889 7560 7840 4461 4465 9022 3770
20483, d7s9- 8909 8890 7560 7840 4463 4464 9414 3763
20483, d8s7+ 4489 4445 7352 7038 4417 4411 4496 3814
20483, d8s8- 3759 3558 3537 3543 3536 3544 3913 3344
20483, d8s9- 8901 8890 7554 7839 4466 4465 8867 3763
20483, d9s7+ 4490 4448 7345 7038 4413 4411 4499 3815
20483, d9s8+ 4489 4446 7345 7042 4405 4408 4468 3818
20483, d9s9- 3759 3560 3539 3538 3535 3538 3914 3346
20483, d15s1 3581 3559 3530 3528 3526 3535 3951 3311
--- ok ---
rrr
Mis-alignment can be a problem, especially if both the source and the destination buffers have some mis-alignment, especialy if they have different alignments. The usual case is that one or the other of the buffers can be aligned mod 16. The way to handle these is to move a block of 16 bytes from the start of the buffer and a block of 16 bytes at the end of the buffer with un-aligned XMM moves. Then calculate the size left to be moved (data size - 32) and round this up to the next mod 16, then start at the first mod 16 byte boundary of the un-aligned buffer and move 16 bytes at a time (there will be a small overlap at either end).
Dave.
Quote from: rrr314159 on March 04, 2015, 06:32:41 AM
ps jj2007, I hadn't thought of needing to preserve all those xmm regs, was just considering the problem in isolation from the real world, but you're right. Maybe that's why this idea is so obscure. Still, in special cases, when xmm regs are not being used for anything else, and the speed actually matters, it could be applicable. Also in 64-bit, with 16 xmm's, it makes more sense.
If speed matters, it looks like a really good alternative. As a general purpose algo, it would fail already when looking at the minimum chunk size...
@KeepingRealBusy,
thx for the suggestion. The way I did it is more obvious; u can check it out in the zip, at beginning of "xmemcpy" proc. Hard to say which is better, probably varies with circumstances. Main advantage of my approach is, it fit with the existing algo; didn't have to change a single line, just added 20, whereas yours would require a total rewrite. It's interesting that these are 2 completely different ways to handle the same seemingly straightforward problem.
@jj2007: just in case of misunderstanding, altho my chunk size is 128, of course the routine handles any smaller number also; for < 128 it uses single xmm reg (chunk=16); less than that, moves byte by byte (chunk = 1 :) Could get more complex, use chunk sizes of 2, 4, 8, and 32, 64; I saw an Agner Fog algo that did that; but it's all, really, of academic interest only, IMHO
@hutch thx for checking it out; give your old machine a rest now, it's earned it ;)
Quote from: rrr314159 on March 04, 2015, 07:19:34 PM@jj2007: just in case of misunderstanding, altho my chunk size is 128, of course the routine handles any smaller number also
Sorry, had not seen that one :redface:
shr ecx, 7 ; divide by 128 (chunk size)
loop128: ; do chunks of 128
jle endloop128
Let me apologize for that sloppy code which aligns dest block. There are 3 places where instructions are unnecessarily executed. I knew about them when I posted it, but 1) makes no difference at all to the final results 2) didn't think anyone would read it, and 3) I was so tired of timing, timing, timing (to paraphrase Hamlet) that the extra 20 seconds to make the changes - and couple minutes to test them - was more than I could stand. If the changes had to be made I would have just chucked the whole thing.
But the original code is fine (I think) and compact enuff that u have to read carefully, as u found out! At one point, condition code is set, then not checked for another 21 lines (of course it's unchanged by those intervening lines). This gives the processor max time to predict the branch correctly, and makes the code harder to read - which, sometimes, I consider a plus :)
from my experience, code that aligns for a loop can be either "pretty" or "fast" - lol
you never seem to get both
in the past, i have done something like this:
1) test lower bits of address register for desired alignment
branch if aligned
2) perform operation on a byte (copy or whatever)
3) test lower bits of address again
branch if aligned
4) perform operation on a byte
and so on, until alignment is achieved (the lower address bits are 0's)
there's no good way to make that pretty ;)
Don't forget u need to check also if you're exceeding the length to copy! What if it takes 15 to align, but they only asked u to copy 3? Anyway, the way I did it, first figured how many bytes needed to get to alignment, then did that many. keepingrealbusy had a totally different way (see his post). In this case not real important how u do it, since once u get there you'll probably copy many thousands of bytes; a few instruction cycles more or less to align really doesn't matter. Interesting to see 3 different ways to do such a seemingly straightforward task; wonder how many more there are!
rrr,
The primary concern is where the data is located and where is has to reside, What I usually run into is that I need to read in a file of data. I allocate the buffers with VirtualAlloc and read in the data as unbuffered (for all but the last partial disk sector), and then read in that block as buffered and add the size read to the size of the prior blocks (close the file and open it as buffered and position it the last partial size from the end of the file, then read buffered).
If you need to keep the same alignment with data that is on an unaligned boundary, then use my method. Just allocate a buffer that is the required size rounded up to mod 16, then offset the start of the destination data area to match the unaligned source, unaligned copy the starting 16 bytes and the ending 16 bytes, then copy all of the middle bytes (starting at the first aligned location) using aligned moves.
Dave.
That approach sounds good IF you have the freedom to allocate your destination buffer, but that's not the point of this exercise. I'm given both buffers by the user and must write where he specifies. Normally dest won't have same alignment as source, and I have to put up with it. If they happen to match, my method works out very similar to yours (BTW note that movdqu is almost as fast as movdqa these days, on aligned data). But normally they won't be aligned so only half the moves can use movdqa; in my case it's the writes; u could do it with the reads instead; seems to make little difference.
> (BTW note that movdqu is almost as fast as movdqa these days, on aligned data)
This is where the action is with later hardware and while 16 byte aligned is slightly faster on some hardware, its not by much and its a lot of messing around to set it up. You will tend to be restricted more by the memory bottleneck that the instructions that move or copy memory and I mainly see the gain in later XMM as what it can do with 128 bit block of data as there are enough techniques that just grunt memory content around that are fast enough in most contexts. The actual data processing gives the main speed gain, not the memory copy speed.
There are many contexts where I happily use byte level copy as it is unproblematic and often a lot faster than the data processing done with the data.
deleted
Hi nidud,
it's too bad you didn't use the version of the 128-byte mover that also works for larger memory blocks (20,000 or so). Originally I just wanted to show the technique for small jobs around 2048, but hutch suggested I try it out at higher numbers, the result was the second version posted in reply #7 (see discussion). So, if I get around to it I'll plug it in your tester and see how it does. My guess is it will come in second, behind proc_3, judging by my previous numbers. Still, I don't know if I can summon the energy - timing algo's is very dull work!
As for the short 31 bytes, simple byte copy is much better for such jobs. In fact pencil and paper would probably be better.
I feel obligated to check it out - but I'd promised myself I wouldn't touch timing again for at least a year ... so if u feel like doing it, go for it! - in Reply #7.
> but I'd promised myself I wouldn't touch timing again for at least a year
Don't give up on timing, its good for you and as an aside you tend to pop faster algos if you do it regularly. You get faster at it over time and with more practice you tune the timing to the task the algo will perform a lot better.
deleted
You understand, I was merely demonstrating the efficacy of using 8 xmm reg's and paid no attention at all to other details ... simply slapped it together. However, thanks for the mods. I'll look at it (now you've got me interested) and see if any further improvements are available.
[edit] also thanks for going ahead and testing the "original new" algo; I know what a PITA all this can be ;)
@nidud,
I finally managed to get my memory copy algo faster than yours, but it's a close shave. I used some of your suggestions. Doing the tail that way hadn't occurred to me until keepingrealbusy suggested it. Also an adaptation of your short-copy handling (similar to Agner Fog's). Wouldn't have bothered without your example, which proved these ideas better, so thanks. Code is below.
I couldn't get your test bed working right; for one thing the eax return value wouldn't co-operate, also I'm not familiar with that non-standard environment, so I put together this extremely standard masm32 test bed with MichaelW's algo's. Hope it will suit. Also included shorter test and, more important, made runs aligned and also misaligned.
Here's a run where my algo is best on every number, but that's unusual, had to make 6 runs to get it :biggrin: Generally, tho, it seems overall best. Shorter too.
mem2 = nidud's improved version of rrr's first algo (was called mem4)
mem3 = nidud's algo (equals mem3 in his test prog)
mem4 = rrr latest (hopefully last) attempt
test.asm: 506 lines, 3 passes, 3688 ms, 0 warnings, 0 errors
BUFFERS ALIGNED
thecount mem2(314) mem3(245) mem4(300)
8 2387 2359 2337
31 2396 2291 2193
271 7372 9300 6269
2014 20854 25887 13538
262159 5229387 5238006 5165092
BUFFERS MISALIGNED src 11 dest 7
thecount mem2(314) mem3(245) mem4(300)
8 1198 1190 1180
31 1248 1198 1147
271 3952 5367 3714
2014 14134 25988 14074
262159 5266919 5485524 5263508
; «««««««««««««««««««««««««««««««««««««««««««««««««««
mem4 proc Dest:PTR BYTE, Source:PTR BYTE, ln:DWORD
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; written by rrr314159 2015 03/08
push esi
push edi
mov edi, [Dest]
mov esi, [Source]
mov ecx, [ln]
; if very small get it over with quick
cmp ecx, 16
jge bigger_than_15
test ecx, 8
jz @f
movq xmm1, [esi]
movq xmm0, [esi+ecx-8]
movq [edi], xmm1
movq [edi+ecx-8], xmm0
jmp end_memcopy
@@:
test ecx, 4
jz @f
mov eax, [esi]
mov ebx, [esi+ecx-4]
mov [edi], eax
mov [edi+ecx-4], ebx
jmp end_memcopy
@@:
test ecx, 2
jz @f
mov ax, [esi]
mov [edi], ax
@@:
test ecx, 1
jz end_memcopy
mov al, [esi+ecx-1]
mov [edi+ecx-1], al
jmp end_memcopy
bigger_than_15:
movdqu xmm1, [esi] ; 16 or greater it's safe
movdqu xmm0, [esi+ecx-16]
movdqu [edi], xmm1 ; to do head
movdqu [edi+ecx-16], xmm0 ; and tail here
cmp ecx, 32
jg @f
end_memcopy: ; put end close to short copies, quicker
pop edi
pop esi
ret 12
@@:
; continue for more than 32 bytes - check dest alignment
test edi, 15
jz destisaligned
mov ebx, edi
and ebx, -16
add ebx, 16 ; ebx gets bytes to align edi (1<= ebx <=15)
sub ebx, edi ; don't need to check < ecx
sub ecx, ebx ; how many bytes will be left to do in ecx
lea edi, [edi+ebx] ; align dest
lea esi, [esi+ebx] ; make source match
destisaligned: ; esi, edi and ecx shld all be right
mov ebx, ecx
and ecx, 127 ; tail bytes (up to 127) in ecx
shr ebx, 7 ; divide by 128 (chunk size)
jle endloop128
align 4
loop128: ; do chunks of 128
dec ebx ; dec ecx long b4 using it
movdqu xmm1, [esi] ; read all 8 xmm's
movdqu xmm2, [esi+10h]
movdqu xmm3, [esi+20h]
movdqu xmm4, [esi+30h]
lea esi, [esi+40h] ; seems to work best adding here
movdqu xmm5, [esi]
movdqu xmm6, [esi+10h]
movdqu xmm7, [esi+20h]
movdqu xmm0, [esi+30h]
lea esi, [esi+40h] ; add to esi long b4 use
movdqa [edi], xmm1 ; write all 8 xmm's
movdqa [edi+10h], xmm2
movdqa [edi+20h], xmm3
movdqa [edi+30h], xmm4
lea edi, [edi+40h]
movdqa [edi], xmm5
movdqa [edi+10h], xmm6
movdqa [edi+20h], xmm7
movdqa [edi+30h], xmm0
lea edi, [edi+40h]
jg loop128
endloop128:
shr ecx, 4 ; ebx has chunks of 16
jle end_memcopy
loop16:
dec ecx
movdqu xmm1, [esi] ; just one xmm could use bigger
lea esi, [esi+10h] ; chunks (32, 48 etc) but why bother
movdqa [edi], xmm1
lea edi, [edi+10h]
jg loop16
jmp end_memcopy
mem4 endp
The zip contains test.asm with all 3 algos, and the exe
rrr,
Thanks for giving me the tail end credit. Most would grab the code and run.
Dave
This was basically the last time I played with memory copy.
mov esi, pad1
mov edi, pad2
; --------------------------------------
; if blen < 128 jump to tail end trimmer
; --------------------------------------
mov eax, blen
cmp eax, 16
jae nxt
mov scnt, eax
xor edx, edx
jmp bypass
nxt:
shr eax, 4 ; int div by 128
mov lcnt, eax
shl eax, 4 ; int mul by 128
mov ecx, blen
sub ecx, eax
mov scnt, ecx
mov ecx, lcnt
mov edx, -16
; -------------------------------------------------
; perform the bulk of the process in 16 byte blocks
; -------------------------------------------------
align 16
lbl1:
add edx, 16
prefetcht1 BYTE PTR [esi+edx]
movdqa xmm0, DWORD PTR [esi+edx]
movntdq [edi+edx], xmm0
sub ecx, 1
jnz lbl1
; -------------------------------
; if no uneven tail, jump to exit
; -------------------------------
cmp scnt, 0
je bye
; -------------------------------------------
; process any eneven tail as a BYTE operation
; -------------------------------------------
add edx, 16
bypass:
mov ecx, scnt
add ecx, 1
lbl2:
movzx eax, BYTE PTR [edi+edx]
mov [esi+edx], al
add edx, 1
sub ecx, 1
jnz lbl2
bye:
It was slower than the REP MOVSD code.
cld
mov esi, src
mov edi, dst
mov ecx, ln
shr ecx, 2
rep movsd
mov ecx, ln
and ecx, 3
rep movsb
deleted
Quote from: dedndavefrom my experience, code that aligns for a loop can be either "pretty" or "fast" - lol
- when I first saw that a couple days ago I thought my approach was both pretty and fast, but I forebore to disagree: I was taught not to argue with my elders :biggrin: But now I see you're right, it wasn't all that fast; and since it's not the fastest, it's not pretty either
Quote from: keepingrealbusyThanks for giving me the tail end credit. Most would grab the code and run.
I have a stock answer when I'm told that. You can't blame those ("most") people: if they didn't grab credit for other's work, they'd never get credit for anything
Quote from: hutchIt was slower than the REP MOVSD code.
I've got to check out REP MOVS* instructions. Fact is I've never used them! When I came here in January I was still using 1980 era techniques (Z80); I've learned a bunch of modern stuff, like macros and SIMD, but to me REP MOVSD is still "new-fangled"
Quote from: nidudOne of the problems with these tests is that code location is sensitive to timings. ... The algo to test is then handled as data, and in this case as a .BIN file created by JWASM -bin ?.asm. To add a function to the test you simply add a new file.
Sure, that seems like an excellent approach.
I switched the order of the algos in my test bed and may have experienced the phenomenon you're talking about: on one category, ALIGNED 2014, the two algos pretty much reversed their timings, as though it ONLY depended on location! Other numbers weren't affected, although the numbers are not very stable anyway, so it's hard to say. So, I simply copied-pasted a dozen runs before the final three, for which I report the numbers (still with my algo in what may be the "worst" position, first). Takes longer but it's a lot more stable. Here are the first 4, and only, runs I made:
mem2 = rrr/nidud modified; mem3 = nidud; mem4 = rrr latest
BUFFERS ALIGNED
thecount mem2(314) mem3(245) mem4(294)
8 2969 2759 2566
31 2706 2651 2559
271 6891 8023 6884
2014 15540 39733 15315
262159 5258012 5218352 5255524
BUFFERS MISALIGNED src 11 dest 7
thecount mem2(314) mem3(245) mem4(294)
8 1441 1350 1386
31 1402 1346 1499
271 3655 5698 3788
2014 13863 26579 14449
262159 5318643 5385829 5317655
Press any key to continue ...
C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test
***********
ASCII build
***********
test.asm: 531 lines, 3 passes, 3656 ms, 0 warnings, 0 errors
BUFFERS ALIGNED
thecount mem2(314) mem3(245) mem4(294)
8 2688 2730 2538
31 1452 1377 1279
271 3468 4053 3338
2014 13658 25588 13532
262159 5259190 5255524 5259200
BUFFERS MISALIGNED src 11 dest 7
thecount mem2(314) mem3(245) mem4(294)
8 1465 1510 1352
31 1388 1327 1286
271 3577 5337 3571
2014 13922 25698 14171
262159 5370304 5417275 5319313
Press any key to continue ...
C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test
***********
ASCII build
***********
test.asm: 531 lines, 3 passes, 3672 ms, 0 warnings, 0 errors
BUFFERS ALIGNED
thecount mem2(314) mem3(245) mem4(294)
8 2689 2730 2539
31 1428 1437 1429
271 3441 4143 3521
2014 13636 25594 13482
262159 5211287 5219878 5276594
BUFFERS MISALIGNED src 11 dest 7
thecount mem2(314) mem3(245) mem4(294)
8 1409 1378 1335
31 1449 1387 1281
271 3603 5482 3889
2014 13870 25605 14170
262159 5372493 5509407 5361170
Press any key to continue ...
C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test
***********
ASCII build
***********
test.asm: 531 lines, 3 passes, 3672 ms, 0 warnings, 0 errors
BUFFERS ALIGNED
thecount mem2(314) mem3(245) mem4(294)
8 2709 2754 2550
31 1439 1407 1353
271 3529 4226 3449
2014 14126 25590 13531
262159 5209791 5210645 5211581
BUFFERS MISALIGNED src 11 dest 7
thecount mem2(314) mem3(245) mem4(294)
8 1416 1385 1358
31 1397 1338 1291
271 3483 5454 3570
2014 13910 25713 14117
262159 5315510 5381937 5314749
Press any key to continue ...
My latest is still ahead (27 out of 40) but the differences are pretty negligible. Not surprising since the two algos are now using, with 4 or 5 exceptions, the same techniques. The only significant conclusion is: both are definitely better than your original mem3 in the middle range (271 and 2014). BTW this reminds me of a comment someone made in an old post: seems everybody's algo is best when run on their own machines. If anyone wants to check my results, the code is posted above, but reverse the order of the algos and copy/paste a dozen copies or so; may make a little difference.
Apparently your machine doesn't do 128 (XMM) right; many older machines do it as two separate 64's, completely negating any advantage. That's why your 64-block worked so well; I'm glad u checked that out. I think the basic point is demonstrated: that larger copy blocks (the largest a given machine supports efficiently) do give a bit more speed. BTW my machine has a similar prob with AVX 256; does it as 2 separate 128's; so it's actually slower for this task (a little better for some others, though).
... But none of it matters if hutch, and the ref you mention later, are right: that MOVS* is better on newer hardware.
Well I reckon we've beaten memory-moving to death! I'm working on some
much more interesting fast-algos, hope to post something in a couple days. Thanks for the exercise, and various tips!
deleted
while that's true, to some extent
you can't avoid checking cases to get alignment :P
that means some form of conditional branching
if you play with it, i think you'll find that JNZ is faster for reverse branches and JZ is faster for forward branches :t
you can sometimes design your code to take advantage
you might also try a jump table - haven't tried that, but it might be faster
deleted
Hi nidud,
This is a very interesting investigation, unfortunately I don't have time to get into it at the moment, but in a couple days I'd really like to do so. The following comments are (deliberately) w/o reference to anything but my organic memory subsystem, ...
First - since we're getting down to details - do a global search and replace on my algo, change "ebx" to "edx". Of course, has no effect on timing.
Now, your jump timings are right in effect - that is, the conclusion drawn is right: it's better to skip that branch. I noticed u did that but didn't bother to change my code, thinking it was meaningless; but seems it's more significant. I say "in effect" because in actuality, jmps take far fewer cycles than the numbers you give: 13.5 and, uh, (alrigght, I'll take a look ...) 30.1, 10.7 etc. Yet, those are reasonable Expected Values because of branch prediction. The reason for avoiding jumps is not that they're inherently that long, rather if branch is incorrectly predicted, they cost many cycles while pipeline is cleared and reloaded. So u can't come up with exact numbers w/o many more details. It's not only different for different companies (as u mention, AMD and Intel): it's different for the same processor, depending on recent history - even the very first time the branch is encountered - because other nearby branches can affect prediction for this branch. Thus a detailed analysis is inherently probabilistic. As I say, the numbers u give look like good EVs (presumably u got them from some ref that did such analysis, or a performance analyzer?) and they give the correct conclusion: the branch shouldn't be there.
I only glanced thru your algo (mem3), but here's what I noticed. There was a clever substitute for the initial pushes which saved - either a byte or a cycle, I forget which. The comparison tests were different; mine, I felt, were more intuitive. Other small details that amount either to nothing or not much; a couple slightly better techniques which I didn't bother to borrow (like the push substitute, next time I'll use them); a couple places I think my technique a little better. Half the algo I didn't read but presume there's nothing 2 important. But, I removed almost all alignments (from your mod'ed version of my algo); my tests showed they were basically insignificant (maybe a cycle or two) and I wanted to reduce the code size, enhancing cache fit probability by a half percent if that.
My point is this. The differences between the two algos is minimal (to my knowledge); but now you're showing that your branch elimination saves quite a few cycles, more than I had supposed. So why, then, is my algo - more accurately, my version of the same algo, since these diffs are virtually just a matter of style - why is it faster? (a small, but significant, amount?) Since u, apparently, have gone thru them with a fine tooth comb, presumably u must know. There's something about mem4, versus mem2, that's saving more than just a cycle or two - must be quite noticeable - so, what is it?
I'm not kidding about a couple days: happy to investigate this further with you when I have more time.
deleted
@nidud, Thanks for providing your test software, in future I'll use some of those (better) techniques.
But I think I've answered my own question, why my algo does a bit better when timed my way instead of yours. I'm using the standard approach: not loading .bin algos or FlushInstructionCache 'ing. (These are good things to do, thanks for the tip) But doing it the usual way, when the dest is aligned the branch (around the aligning instructions) is hit many thousands of times, and *always* taken. So, the processor predicts the branch perfectly. That jump costs almost nothing, couple cycles (not sure exactly). Whereas you're executing six instructions there (13.5 cycles). This matches my results: my algo's a bit ahead on ALIGNED data, but a bit behind on non-aligned, since not taking the jump costs a little. Conclusion:
- rrr's rule for not-very-careful timing test beds: Conditional branches (that depend in a simple way on the inputs) are almost free, so use as many as u want!
Reminds me of something hutch said, "I have always been an advocate of real time testing ...". In real life the environments will be very different, and variable, you don't know what you'll run into, so:
- rrr's timing statistical rule of thumb: ignore differences smaller than about 5%.
I think you should update your scoring routine to allow ties!
A more important thought re. branch prediction occurred to me. When a utility routine like momory-copy is called in real life, there's a decent chance the branch to it wasn't predicted right, so possibly the pipeline is going to be empty (other factors may cause the same thing). So, always put a (necessary) branch right at the beginning if possible, where a misprediction might be almost free.
Quote from: nidudHowever, most new machines (and almost all Intel CPU's) works best on aligned loops.
I'm sure you're right, but I saw approximately no timing difference with "align x". At least in one important case, it turned out the loop was (accidentally) already aligned. But even when not, no real difference: dunno why.
Quote from: nidudIf you think about it, the code above (and below) the loop label is peanuts compare to what happens in the main loop, so that's the main priority. The init and exit code will only hit you when multiple calls are made with small lengths.
Your first sentence is true when called with large blocks. But your second sentence explains why the code above (or below) the loop label *can be* important; your user may indeed be using it for lots of small lengths. That's why if one can save a cycle or two (for instance, by the placement of "toend" code) it should be saved in favor of small lengths vs. long lengths.
Finally, something you should know: if your routines are called to copy <=15 bytes, when PAGE_NOACCESS protected memory happens to be just beyond the source buffer, they'll blow up :P
c u later,
I have also learnt a few lessons on code alignment, regularly while bashing the guts out of an algo to get its pace up, I have tried aligning labels that get hit hard and seen the timing take a lot longer, much the same with procedure alignment, align at 4 bytes and it works OK, change it to 8 or 16 bytes and it often slows right down. Then its worth shifting the location of the algo around in the exe to make sure you are not deluding yourself and see what the results are.
I have tried spacing algos from each other with padding from under 1k to over 1 megabyte that was more or less useless so I usually settle of a delay between each algo of about 500 ms and set the priority class to high but not the one higher. As hardware gets smarter, timings wander even further. :biggrin:
Quote from: hutchAs hardware gets smarter, timings wander even further.
Very true, meanwhile I'm getting slower ... but, I'm reminded of something hutch once said,
Quote from: hutchDon't give up on timing, its good for you and as an aside you tend to pop faster algos if you do it regularly. You get faster at it over time and with more practice you tune the timing to the task the algo will perform a lot better.
Also true! One good thing, it gives the lie to the C programmer's favorite saying, "the compiler does it better than you". Unless the compiler writers are down in the trenches timing away - as opposed to just believing Intel/AMD/Agner, as I think they do - they can't match a real MASMologist
deleted
deleted
Looks extremely familiar (http://www.masmforum.com/board/index.php?topic=18728.msg159628#msg159628) 8)
deleted
You may safely read/write buffers not in byte-by-byte fashion.
There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc - and if the buffer has actually the size smaller than it was specified with bytecount param, and the buffer is lying near to the end of the memory page - the "MEM" code will crash as well as the "STR" code.
When you scan the memory you may do this safely with larger than byte-sized reads/writes, for this you just should take attention on the details how memory is allocated under x86 page addressed protected mode (Windows, Linux, Mac and any other protected mode operating system on x86).
The memory has the granularity of one page - 4096 bytes "at least", so you may always assume, that if you was given with some address inside memory page, there is full page in the memory available for read/write (depends on protection).
I.e., if you was given with an address 23987234h - just for an instance - so you may take the base address of the page, within which this address located - just clear 12 lowest bits of the address (AND EAX,-4096) - and you have the page address - for this example it will be 23987000h . So, now you know the start of the page, and you know that the size of page is 4096 bytes, so from that you may know how long is the buffer AT LEAST - 23987000h + 1000h = 23988000h. Within this range (from the given start of the buffer 23987234h to 23988000h) you may freely read the data in any sized fashion - just take attention that your reads do not read beyond that. I.e. you may freely start the single-reg XMM read (16 bytes) on 23987ff0h, but you obviously may not start the same read at the 23987ff1h - because final address of data grabbed 23987ff1h + 10h = 23988001h - so it is beyoud the page range and you may not be sure that the next page as reserved/commited in the memory.
Just take attention on at which adddresses you read/write, and, depending on the size of the data grabbing, you have all the info how to not read beyond the buffer to the inaccessible memory.
It is simpler to align the pointers to the power of two - so you may freely read and assume that the code will not crash - and it will not crash if you will check every read for match to the algos purpose.
As an example - just imagine that you use a byte-by-byte StrLen reading algo and SSE algo which uses single-reg XMM read. If you pass the properly formatted zero-terminated string to the algos - both algos will work correctly if they check the readed data to match the finish condition (the byte zero), before they read next bytes. If the both algos will take the wrong formatted strings without zero termanator until very end of the page - and no accessible area after that - the both algos will crash. No difference between algos.
What is wrong with that example of SSE code which does reading with two XMM regs: that code does TWO readings and ONE check after both readings. So, actually the code may read the proper string end (zero byte) with first read from both reads, but it will with no attention to that try to read the next 16 bytes - but the proper zero terminated string was already terminated in the previous 16 bytes (first XMM read) - so, being reading the PROPER zero terminated string at the PROPER possible location near to the end of the buffer with no accessible memory beyond that, this code will crash, that is why it is IMPROPER. We had a thread about StrLen XMM algo on the old forum, there were I, Jochen and Lingo as the main posters AFAIK, and Lingo provided the algo of such kind - which did read two XMM reads, then combined the mask of bytes result in one 32 bit GPR reg, and only after that did veryfied the result. Of course that code was buggy, and it is funny and shame (taking in account his megalomania) for Lingo he doesn't know such a simple and basic things like the buffers/memory layouts. The loud cries about "the algo is faster instead" is not excusable - the algo should be reliable first, and then it MAY be thought of making it faster. So Jochen decided to use one-reading XMM algo in his lib, and that was simplest and reliable solution. It is probably possible to do complex solution which will read more that one reg - but it will have to check every reg before it will have the right to read the next reg from the memory, so actually that complex code probably not only will not be faster than one-reg-reading code, it will be much slower. So Lingo's code was just a trash from the point of view of software production - it was just real fancy thing but not real working solution. Well, after so long time I described my point of view on the "Lingo's coding style" and the real "value" of his over-valuated "style", in proper English now :greensml:
The Intel's "read ahead" is not actually a real read ahead - just show us full code and it is for sure that they do verify of the finish condition (zero byte) before they read next data which may be beyond page range. The full sense of read-ahead is only when the code does read only with not-dword-aligned strings, but taking in attention that the code might probably do the alignment (which is not obvious from the small piece of code provided) and/or that that every "advanced" compiler does string alignment on the dword boundary - this is not a dangeround "read-aheading" at all - it has nothing common with the algo which does two (or more) reads before it checks the condition of termination. If the algo with use of any technique is still in the range of taking notice to be in the page range, the algo will not crash. All other differences - are the details of implementation but not the idea itself.
deleted
deleted
Just re-read again what was written: if you have the address of the buffer - you have all required to check if the buffer is near of the end of the page. So what you will do after that - is for your decision, of course you may try to read a whole heap of data even if you see that the buffer you given with is in the last byte of the page, or you may process this case in its respective part - QWORD/DWORD/WORD/BYTE reading - depending on how close to the end of the page your buffer is.
In the case with that StrLen code which Jochen used in his lib, I just aligned the input address buffer to the 16 (and, take notice that I did also mention that in the first post here - mentioned that if you align the reading pointer to the power of two), aligned it "backwards" (just AND with -16 so it will always lie at least from 16 bytes from the end of the page in memory) and thus did first read not ahead but "backward" instead, after that just shifted the "backward" tail out of the masking reg. So, that SSE code MAY NOT FAIL with any zero-terminated string placed in any place in the memory.
Quote from: Antariy on March 19, 2015, 09:45:35 AMIn the case with that StrLen code which Jochen used in his lib, I just aligned the input address buffer to the 16 (and, take notice that I did also mention that in the first post here - mentioned that if you align the reading pointer to the power of two), aligned it "backwards" (just AND with -16 so it will always lie at least from 16 bytes from the end of the page in memory) and thus did first read not ahead but "backward" instead, after that just shifted the "backward" tail out of the masking reg. So, that SSE code MAY NOT FAIL with any zero-terminated string placed in any place in the memory.
This is correct, but to be fair, the example of the 4096 byte buffer posted below works only with SEH:
include \masm32\MasmBasic\MasmBasic.inc ; download (http://www.masm32.com/board/index.php?topic=12460)
Init tc ; install the SEH handler
FileWrite "test.txt", String$(4096, "x") ; prepare a nasty example
xchg rv(filesize, Chr$("test.txt")), ebx
xchg rv(VirtualAlloc, 0, ebx, MEM_COMMIT, PAGE_READWRITE), edi
push rv(_lopen, Chr$("test.txt"), OF_READ)
invoke _lread, eax, edi, ebx ; exactly 4096 bytes in a VirtualAlloc'ed buffer
call _lclose
Print Str$("%i bytes read from file\n", ebx) ; this file is exactly one page long
Print Str$("MasmBasic says the length of the string is %i bytes\n", Len(edi)) ; Len() WORKS OK
PrintLine "Now trying crt_strlen:" ; once you see this, Ctrl C is your only change...
invoke crt_strlen, edi ; ... because the CRT fails miserably, it just hangs
Inkey Str$("crt_strlen says the len of string is %i bytes\n", eax) ; you will not see this message
Exit
TryCatch (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1217)End ; activate the handler - must be last instruction before "end start"
end start
Yes it does return 4096 bytes, while the CRT just hangs, but... we must go at least one byte beyond the limits, otherwise how would the algo know that byte 4096+1 is a zero byte...?
Still, Len() remains usable, while the CRT just hangs. You can try inserting an exception handler also for the CRT:
Try
invoke crt_strlen, edi ; great, it doesn't hang any more...!
Catch
That works in the sense that your app is no longer blocked, but look at the result :P
Interestingly enough, Masm32 len() fails silently:
Try
Print Str$("Masm32 len() says the length of the string is %i bytes\n", len(edi))
Catch
And of course, all these examples work correctly if you replace the 4096 with 4095.
deleted
Once again, this algo will not fail with ANY zero-terminated string located anywhere in the memory:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings
;mov edx,[esp+4]
;mov eax,[esp+4]
db 8bh,94h,24h,04,0,0,0
db 8bh,84h,24h,04,0,0,0
add esp,-14h
movups [esp],xmm7
mov [esp+16],ecx
;and edx,-10h
db 81h,0e2h,0F0h,0FFh,0FFh,0FFh
mov ecx,eax
pxor xmm7,xmm7
and ecx,0fh
jz @F
pcmpeqb xmm7,[edx]
add edx,10h
pmovmskb eax,xmm7
shr eax,cl
bsf eax,eax
jnz @ret
pxor xmm7,xmm7
@@: ; aligned to 16
pcmpeqb xmm7,[edx]
add edx,16
pmovmskb eax,xmm7
test eax,eax
jz @B
bsf eax,eax
sub edx,[esp+4+16+4]
lea eax,[eax+edx-10h]
@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4
AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Jochen,
your "example" is WRONG. You're passing NOT PROPER string to the algo - and you should not wait for proper results. Your string is exactly unterminated - it's an infinite size from the point of view of the algorithm logic, and the results you are returning (4096) - ARE WRONG, take that or not - that is for your decision. The algo should fail there (crash) or it should return the flag which OBVIOUSLY marks that the string is incorrectly formatted - for an instance you should return -1 as mark of the string ininity. Otherwice using your StrLen you will never know that the string is not proper string - preperly terminated - with the real size of 4096 bytes - and your code will use that improper string further with full assuming that it's proper, but it is WRONG "ZERO TERMINATED" STRING. And any other components of the program yours or not yours (for example Windows), will crash or misuse that improper string, for an instance long after you did "checked" it, so after that you will long search why it crashes - because at the point where you were required to flag the improper string, you just silently returned "the fine" and "seems ok" result.
The code should never "silently substitute" the breaks in the program logic (improper zero terminated string is that stuff) with the "result which seems OK" - but your strlen does exactly this.
nidud,
your code has still the same problem - you for some reason check only first 32 bytes of the string, but the rest you're assuming safe to do two reads and only one check - the same bug there. Your code now is partially "safe" only with the strings with size smaller that 32 bytes - any other string will crash your code if it is placed in such a manner that it's zero terminator is closer to the page end than 32 bytes - i.e. if string terminated at 1...31 bytes from the page end.
The first my post here, written in almost proper english and very descriptive has all the info to implement fully safe algo without any assumptions or silent substitutions/hiding of errors. Sometimes I agree with the topic paster on the thought that the more descriptive and full information one tries to provide - the less respect and carefull attention to the one's descriptions/writings one haves.
deleted
Quote from: Antariy on March 19, 2015, 10:30:34 PM
Jochen,
your "example" is WRONG. You're passing NOT PROPER string to the algo - and you should not wait for proper results.
Alex,
My example is certainly problematic, and I agree that such an "improperly delimited" string should not be passed to a Windows API that expects a zero-delimited string; which, for example, is not the case for crt_lcpyn:
Let esi=Space$(4097)
invoke lstrcpyn, esi, edi, Len(edi)
Print "[", esi, "]"
OTOH, the string is perfectly valid in a MasmBasic macro such as Let or Print (which only want to know how many bytes must be copied):
Print Str$("MasmBasic says the length of the string is %i bytes\n", Len(edi)) ; Len() WORKS OK
Let esi="The string: ["+edi+"]"
PrintLine esi
PrintLine "[", edi, "]"
Nonetheless, I do not encourage SEH (that is what Init tc does) to enable problematic examples like this.
Quote from: nidud on March 20, 2015, 12:11:59 AM
The code won't fail :P
To put your mind at ease, this test the last 32 bytes
You may leave your sarcasm for yourself :P
Yes, it will not fail, because you did aligned data fetch to 32 bytes - but when I was first replying to your code sample, you provided only the inner loop logic, without other code, which shows that you do align to 32. The second time I took attention on the way you do first checking but did not noticed that you align the buffer fetch to 32, so that's why I said that it will fail - it should have fail if it will fetch data with 32 bytes but if the buffer fetch fill be aligned to only 16 bytes, but, as already said, I just did not noticed the proper alignment when replied second time, and there was not full code the first time from your side.
But, again, nothing wrong with that I was saying that in my first post in the thread I was describing the proper way of implementation of the algo, there I was written for an instance this sentence
Quote
It is simpler to align the pointers to the power of two - so you may freely read and assume that the code will not crash - and it will not crash if you will check every read for match to the algos purpose.
You just did it right (aligned to the power of two, equal to the size of data fetch, the same as I was writing, too - that you should align and use proper data size - QWORD/DWORD/WORD/BYTE etc - you just did that with "DXMMWORD" (double XMM word) - so nothing wrong with my explanation).
The source you provided is not recompilable, probably set of your custom macroses is missed there, so I did not run the test itself, but for sure it will work.
Just a thing: which code did you used in the test? Or you do load of the code from the .bin files? Just haven't found for an instance your code which you posted earlier there (which aligns to 32) in the disassembly listing ::)
Jochen,
Quote from: jj2007 on March 20, 2015, 01:47:04 AM
My example is certainly problematic, and I agree that such an "improperly delimited" string should not be passed to a Windows API that expects a zero-delimited string; which, for example, is not the case for crt_lcpyn:
Let esi=Space$(4097)
invoke lstrcpyn, esi, edi, Len(edi)
Print "[", esi, "]"
Well, lstrcpyn is more like the memcpy function with one additional stop factor - zero byte, because it has the length specificator. So it's not really well valid example. More over for some reason the lstrXXXn functions break the zero-terminated logic itself: these functions break the string in some unexpected place, so, for an instance, in some cases you might receive the partial string, but this may not look like proper result, as you of course understand. These functions act more like those "hiders" of error - the code actually copied just a part of the string but the proggie continues to use that string thinking that it's proper, full string with full information it must hold. Of course, the truncated string contains not full info, so it may be treated just as a trash, not the data. IF the string was just some message etc - this may be "acceptable" to receive just a truncated message on the display, but if the original string which was copied to the smaller buffer and got truncated was containing some important data (file name, just a file system path, name of some object, an encoded (for an instance a short base64 message) form of some other data etc etc) - then this case is obviously unacceptable because you get not the data, but a trash.
In your example you might substitute the lstrcpyn with memcopy+zero termination for an instance, since the work of handling the exception and determining the "length" of the string anyway was done by MasmBasic.
But that did not change things, actually: your code got an improper string, the code does not know for which reason the string was improper - it was not terminated and continued until the end of the page. The code should flag this, instead of just continuing to work as if the string is properly terminated. The rule of programming is that it's better to have do not working program, that working improperly. So it's better for code to detect that it got improper zero terminated string and flag it in some way (to the user or just going other way/return etc). Of course, this applicable to the code which expects a ZERO TERMINATED strings - if the algo which should receive the zero-terminated strings receives not terminated string, it should crash or flag it. But the algos which do not require zero-terminated strings to work, but rely on the "buffer address" and "known size" params - may work with no difference if there is zero terminator, but just because they by design do not require zero terminated strings, but rather binary strings.
Your MB Len algo is proper for the MB strings, but in a case you pass a string which is known to be zero terminated - you should not accept silently those claimed as "zero terminated" but not terminated actually strings, that was the point I tried to describe.
Anyway the problem of taking the unproper zero terminated strings with the algos which do not flag that and reject further processing is that if you continue, you might continue to receive more and more such strings, but SEH is very greedy in therms of timings, so any time you receive such a string you have thousands of cycles stalls just to continue work after a SEH catch.
And, of course, if you will pas that string which is not actually terminated to any other code - it may crash there. Other way you may do - copy that improper string to proper MB string and pass that string to the algos, but, again, every improper string copy will cost thousands of cycles.
Quote from: jj2007 on March 20, 2015, 01:47:04 AM
OTOH, the string is perfectly valid in a MasmBasic macro such as Let or Print (which only want to know how many bytes must be copied):
Print Str$("MasmBasic says the length of the string is %i bytes\n", Len(edi)) ; Len() WORKS OK
Let esi="The string: ["+edi+"]"
PrintLine esi
PrintLine "[", edi, "]"
Nonetheless, I do not encourage SEH (that is what Init tc does) to enable problematic examples like this.
BTW, what means OTOH acronym?
Yes, I did not tell you that the MASMBASIC strings are wrong if they are not terminated with zero, I told about ZERO TERMINATED strings. If MB algo receives zero terminated string which is not terminated - then it should obviously report you in some way that the string is not terminated... for an instance you may return the "seem ok" string length in EAX, but -1 in EDX if the string was not terminated and zero in EDX if it was proper zero terminated string.
The installed SEH covers all your algos in the library? I.e. the Let, the Len algos are not crash with that improper zero-terminated string only if you install the SEH at the Init? If you will not - the both Len and Let will crash with that string? How do you implemented SEH coverage on your lib - as an UnhandledExceptionFilter or as a per-thread SEH frames (just like in my signature)? The UnhandledExceptionFilter does not provide a way to handle exceptions in multiple-threaded applications, because the "safe address" which you may use to restore execution might be overwritten in the other thread which enters some function which set that "safe address" up with its own address of restoration.
deleted
Quote from: nidud on March 20, 2015, 07:17:22 AM
Quote from: Antariy on March 20, 2015, 06:01:37 AM
You may leave your sarcasm for yourself :P
:biggrin:
:t
Quote from: nidud on March 20, 2015, 07:17:22 AM
Quote
Yes, it will not fail, because you did aligned data fetch to 32 bytes - but when I was first replying to your code sample, you provided only the inner loop logic, without other code, which shows that you do align to 32.
That code did actually fail. All these versions are included in the strlen.zip file (#50). I converted a string library some time ago and the code that read-ahead is used in many of these functions.
Ah, so at least one of the pieces discusset failed. You mean this algo, probably?
.686
.model flat, stdcall
.code
.xmm
OPTION PROLOGUE:NONE, EPILOGUE:NONE
strlen proc string
mov ecx,[esp+4] ; 4
xorps xmm2,xmm2 ; 3
push edx ; 1
@@:
movdqu xmm1,[ecx+16]
movdqu xmm0,[ecx]
add ecx,32
pcmpeqb xmm0,xmm2
pcmpeqb xmm1,xmm2
pmovmskb edx,xmm0
pmovmskb eax,xmm1
shl eax,16
or eax,edx
jz @B
pop edx
bsf eax,eax
lea eax,[ecx+eax-32]
sub eax,[esp+4]
ret 4
strlen endp
END
It's in 4.asm, and the only algo which reads unaligned so it crashed near end of the page even if it did reads by 16 byte chunks, but it more over does two reads of 32 bytes. So first time you posted two-reading piece of this code?
Quote from: nidud on March 20, 2015, 07:17:22 AM
QuoteQuote
The source you provided is not recompilable, probably set of your custom macroses is missed there, so I did not run the test itself, but for sure it will work. Just a thing: which code did you used in the test? Or you do load of the code from the .bin files?
I uploaded the test-bed (benchmarks.zip (http://masm32.com/board/index.php?action=dlattach;topic=4067.0;attach=3869)) used here (#34) (http://masm32.com/board/index.php?topic=4067.msg43117#msg43117).
Ah, got it earlier but did not got that it was full set of the files. the dz.exe is your file manager DosZip?
But, the other question: do you load algos from binary files? Just did not found your safe 32 bytes algo in the listing of the timeit.exe.
deleted
Quote from: Antariy on March 20, 2015, 06:45:08 AMBTW, what means OTOH acronym?
on the other hand
QuoteThe installed SEH covers all your algos in the library? I.e. the Let, the Len algos are not crash with that improper zero-terminated string only if you install the SEH at the Init?
If SEH is installed (and I rarely use that feature), then there is exactly one point in the library where it is being used: MbStrLen. Which is a function that is used internally very often, of course.
Hello Antariy, nice to meet you,
Quote from: Antariy on March 19, 2015, 06:21:32 AM
There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc - and if the buffer has actually the size smaller than it was specified with bytecount param, and the buffer is lying near to the end of the memory page - the "MEM" code will crash as well as the "STR" code.
- If the user makes an incorrect request (wrong bytecount) then perhaps the algo should just crash. If he's careless enough to do that he probably won't check error code return, so how do you tell him he made a mistake? If you cover for him, it will probably nail him later somewhere else. Perhaps, let it crash so he knows the error. That's how my memcopy behaves.
But, that's a matter of opinion. The important point is, if user gives correct bytecount, (or, correctly terminated C-style zero-STR), then you must not crash by reading / writing outside the requested buffers. Even though reading 128 bytes at a time, my memcopy never goes beyond user-specified bounds.
Quote from: Antariy on March 19, 2015, 06:21:32 AM
It is simpler to align the pointers to the power of two...
- You mean, align to the chunk size, assuming it's a power of 2. (The "size of the data grabbing" is called "chunk".)
Quote from: Antariy on March 19, 2015, 06:21:32 AM
The loud cries about "the algo is faster instead" is not excusable - the algo should be reliable first, and then it MAY be thought of making it faster.
- Of course! Surely Lingo agrees?
Quote from: Antariy on March 19, 2015, 10:30:34 PM
Sometimes I agree with the topic paster on the thought that the more descriptive and full information one tries to provide - the less respect and carefull attention to the one's descriptions/writings one haves.
- Very annoying, go to a lot of trouble with the language and then no one reads it! But, minor point: actually, I didn't paste the topic. Couldn't find anywhere on the net to copy it from, so I had to write it myself :P
Quote from: Antariy on March 19, 2015, 10:30:34 PM
Once again, this algo will not fail with ANY zero-terminated string located anywhere in the memory: ...
- Right, but it can be 18 bytes shorter, and a little cleaner:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings
mov edx,[esp+4]
; mov eax,[esp+4] ; don't get eax from [esp+4]
mov eax, edx ; but from edx, save 2 bytes
; db 8bh,94h,24h,04,0,0,0 ; don't pad 6 bytes with
; db 8bh,84h,24h,04,0,0,0 ; these two db instructions
add esp,-14h
movups [esp],xmm7
mov [esp+16],ecx
and edx,-10h ; don't pad 3 bytes
; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh ; with this db instruction
mov ecx,eax
pxor xmm7,xmm7
and ecx,0fh
; jz @F ; no jump save 2 bytes
pcmpeqb xmm7,[edx]
; add edx,10h ; don't add here save 3 bytes
pmovmskb eax,xmm7
shr eax,cl ; works fine if cl is 0
bsf eax,eax
jnz @ret
pxor xmm7,xmm7
@@: ; aligned to 16 ; still aligned but 16 bytes less
add edx,16 ; switch order of these
pcmpeqb xmm7,[edx] ; two instructions
pmovmskb eax,xmm7
test eax,eax
jz @B
bsf eax,eax
sub edx,[esp+4+16+4]
; lea eax,[eax+edx-10h]
add eax, edx ; don't need -10h any more, save 2 bytes
@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4
AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Here it is without all the extra comments:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings
mov edx,[esp+4]
mov eax, edx
add esp,-14h
movups [esp],xmm7
mov [esp+16],ecx
and edx,-10h
mov ecx,eax
pxor xmm7,xmm7
and ecx,0fh
pcmpeqb xmm7,[edx]
pmovmskb eax,xmm7
shr eax,cl
bsf eax,eax
jnz @ret
pxor xmm7,xmm7
@@: ; naturally aligned to 16
add edx,16
pcmpeqb xmm7,[edx]
pmovmskb eax,xmm7
test eax,eax
jz @B
bsf eax,eax
sub edx,[esp+4+16+4]
add eax, edx
@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4
AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
- BTW your English is very understandable, much better than I write your language! c u later
Quote from: jj2007 on March 20, 2015, 02:14:53 PM
Quote from: Antariy on March 20, 2015, 06:45:08 AMBTW, what means OTOH acronym?
on the other hand
QuoteThe installed SEH covers all your algos in the library? I.e. the Let, the Len algos are not crash with that improper zero-terminated string only if you install the SEH at the Init?
If SEH is installed (and I rarely use that feature), then there is exactly one point in the library where it is being used: MbStrLen. Which is a function that is used internally very often, of course.
How did you implement the exception handling - as per-thread SEH or SetUnhandledExceptionFilter? IIRC it was SUEF?
Quote from: rrr314159 on March 20, 2015, 09:15:14 PM
Hello Antariy, nice to meet you,
Hello rrr, nice to meet you, too.
Quote from: rrr314159 on March 20, 2015, 09:15:14 PM
Hello Antariy, nice to meet you,
Quote from: Antariy on March 19, 2015, 06:21:32 AM
There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc - and if the buffer has actually the size smaller than it was specified with bytecount param, and the buffer is lying near to the end of the memory page - the "MEM" code will crash as well as the "STR" code.
- If the user makes an incorrect request (wrong bytecount) then perhaps the algo should just crash. If he's careless enough to do that he probably won't check error code return, so how do you tell him he made a mistake? If you cover for him, it will probably nail him later somewhere else. Perhaps, let it crash so he knows the error. That's how my memcopy behaves.
But, that's a matter of opinion. The important point is, if user gives correct bytecount, (or, correctly terminated C-style zero-STR), then you must not crash by reading / writing outside the requested buffers. Even though reading 128 bytes at a time, my memcopy never goes beyond user-specified bounds.
The algo may not know if the user is careless or not - it is "just code" ("just a machine") and has no rights to make assumptions about those who use it. Imagine the same with, for an instance, with some protection scheme: the user sits before the computer, turning it on, the computer says "I diskike your face today, probably you're drunk or you're not the owner of this machine, maybe you're a criminal, let's format harddrive of this machine to protect the real owner from data steal".
The fact that the "ideology" of "protected programming" was perversed by many contemporal programmers who are careless and simultaneously with carelessness use the "protected programming" techniques, do not make the "ideology" itself wrong. The code is better to behave in predictable, fully documented and robust way - and the rest is for the user's decision. The routines are just tools, aren't they? One wants to use as much as possible useful, comfortable and robust tool, so, if there is such tool - it's good. But HOW the tool will be used and which product it will produce is for the "master" who uses this tool.
So, probably nothing is wrong with "protected programming" nor "SEH" etc. - this is problem of people psychic rather than the "ideology" - the careless and sloppy coders produce problems, not the code itself.
The code may not know if the user provided a wrong bytecount or the buffer for some reason was not such a size as the user thinks (this may really occur - for an instance with memory-mapped files / sections while the data was not loaded for some reason (disk unexpectedly became full or just hardware failure) into the memory page) but the bytecount was right, so, if the user was careless enough to not notice that the system did not provide the full data and passed the pointer to the code, it may have one more opportunity to know that there is a error, especially if the error-flagging return differes from the regular return very noticeable. Also the itself system bugs may not be discarted from the count - even if the system provided wrong result, the user still might notice it with the code which reports about it.
After all, yes, the "protected programming" implies that one should rely on the system as a "standard" which has no bugs, and the coding style is for decision of the programmer. There are, as usual, two biggest camps: those, who codes absolutely sloppy, and those who codes paranoidally "carefull" - both camps produce usually the most buggy products on the planet and usually are the most big and respective companies on the planet :greensml:
Quote from: rrr314159 on March 20, 2015, 09:15:14 PM
Quote from: Antariy on March 19, 2015, 06:21:32 AM
It is simpler to align the pointers to the power of two...
- You mean, align to the chunk size, assuming it's a power of 2. (The "size of the data grabbing" is called "chunk".)
Yes, this correction is right and describes what I wanted to say more clearly.
Quote from: rrr314159 on March 20, 2015, 09:15:14 PM
Quote from: Antariy on March 19, 2015, 06:21:32 AM
The loud cries about "the algo is faster instead" is not excusable - the algo should be reliable first, and then it MAY be thought of making it faster.
- Of course! Surely Lingo agrees?
I do not know with which things does he agree and more overy do not want to know that :greensml: But from his "coding style" many of our members are probably got an opinion that he is a megalomaniacal self-named "God of assembly" and a "Code Nazy" who first on the planet invented every thing and all other are stolen that from him. But he is also known to produce buggy but "fast working" code and when he gets the notes on that he just goes to the insulting others, so, I think, it's better to ask his psychiatrist - what he agrees with :greensml:
Quote from: rrr314159 on March 20, 2015, 09:15:14 PM
Quote from: Antariy on March 19, 2015, 10:30:34 PM
Sometimes I agree with the topic paster on the thought that the more descriptive and full information one tries to provide - the less respect and carefull attention to the one's descriptions/writings one haves.
- Very annoying, go to a lot of trouble with the language and then no one reads it! But, minor point: actually, I didn't paste the topic. Couldn't find anywhere on the net to copy it from, so I had to write it myself :P
Well, I think, if the jokes apart, you understand that did not mean "copy"-"paster". You "pasted" the topic on the forum - there was no such topic, but it appears - it was "pasted" there, so, the one who did it was topic-"paster".
The word "paste" is independed word from the "copy" or from the "therm" "copy-paste"? So, then, that was pretty right said :P
Quote from: rrr314159 on March 20, 2015, 09:15:14 PM
Quote from: Antariy on March 19, 2015, 10:30:34 PM
Once again, this algo will not fail with ANY zero-terminated string located anywhere in the memory: ...
- Right, but it can be 18 bytes shorter, and a little cleaner:
How it was written has reasons, for an instance the XMM and ECX preservation was the "condition of production" - Jochen's condition since his lib preserves those regs in its internal functions - it's MasmBasic's ABI.
Other things like "wide" instructions was used to padd the inner loop to align it by 16, yes, but here it was for reason, too, because it was to fit the code as it was written - with all the specifics. The code may be shorter, yes, but as it was written was seemed (by me, of course, so this may be called as a "subjective judge") it was most performant on a widest variety of hardware. Some our members may confirm that I like to "hunt for bytes" like a maniac sometimes :greensml:, but in this case it was not a case (the pun).
I will just comment the changed code in the lines below of the changes - without indent, why it was written so.
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings
mov edx,[esp+4]
; mov eax,[esp+4] ; don't get eax from [esp+4]
mov eax, edx ; but from edx, save 2 bytes
; db 8bh,94h,24h,04,0,0,0 ; don't pad 6 bytes with
; db 8bh,84h,24h,04,0,0,0 ; these two db instructions
The code which was original will work faster because of reference to the same memory location, which,
being loaded to the cache (if it was not yet) when first accessed, is the "fast to do" thing, rather
than dependent and thus stalling instructions when the value taken into one reg and then copied to the
second reg from first one.
If you want to save those two bytes and leave these instructions as they were coded, then just replace
pxor instructions with the xorps.
add esp,-14h
movups [esp],xmm7
mov [esp+16],ecx
Actually if I will have to descrease the size of the algo I will probably remove the XMM and ECX presevation,
because it's not APIs ABI but only MasmBasic ABI.
and edx,-10h ; don't pad 3 bytes
; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh ; with this db instruction
This was coded so to put loop to 16 bytes align, why the code was in a whole coded so may be understandable
only if it is seeing as a whole but not some different pieces - read below why it was decided as unappropriate
to write the code so as it was changed in this example.
mov ecx,eax
pxor xmm7,xmm7
and ecx,0fh
; jz @F ; no jump save 2 bytes
The CPU "predicts" the "forward" jumps as "probably untaken", so, the jump will not be taken with the
possibility 16:1 (unaligned string) with byte-aligned strings, or 8:1 with even aligned strings, or 4:1 with
dword aligned strings, so, the jump and the some slowdown is more likely to not happen. And even if it happens
the slowdown itself is not big (the high numbers of jumps are exaggerated really, and as the jump is near and in the same code
location actually, it maybe assumed that the jump will not lead to any "misprediction" stalls since the code
is already fetched and decoded, even that code where the jump jumps (the pun again) to). So, it will cost only few cycles,
but taking in account that it over-jumps some instructions, which will take some time, this jump in the real world
(not millions of iterations on the same data as it is doing with our testbeds) will not only be a slowdown but will
rather speed the code up.
pcmpeqb xmm7,[edx]
; add edx,10h ; don't add here save 3 bytes
It will be described below why the pointer advancing was done after the data fetch.
pmovmskb eax,xmm7
shr eax,cl ; works fine if cl is 0
The jump was done not to "protect" from zero ECX but to not execute unrequired and slowing down code without reason.
bsf eax,eax
jnz @ret
pxor xmm7,xmm7
@@: ; aligned to 16 ; still aligned but 16 bytes less
add edx,16 ; switch order of these
pcmpeqb xmm7,[edx] ; two instructions
The pointer advancing BEFORE data fetching slows down the code, especially on earlier than modern CPUs.
The mov eax,[edx]; add edx,16 will be generally always faster than add edx,16; mov eax,[edx]
because these two instructions in first case pair (memory access goes into one block and the arithmetic operation
goes into second block) and not pair in second case.
pmovmskb eax,xmm7
test eax,eax
jz @B
bsf eax,eax
sub edx,[esp+4+16+4]
; lea eax,[eax+edx-10h]
add eax, edx ; don't need -10h any more, save 2 bytes
It was needed because the pointer advancing was specially chosen to be done after data fetching.
@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4
AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Quote from: rrr314159 on March 20, 2015, 09:15:14 PM
- BTW your English is very understandable, much better than I write your language! c u later
Well, it was not so some years ago, but some forum members also did assure me that it is pretty understandable, even being pretty "Cyrillic English" as one of the forum friends called it.
Do you write Russian?
Quote from: nidud on March 20, 2015, 08:22:02 AM
Quote from: Antariy on March 20, 2015, 07:31:40 AM
Ah, so at least one of the pieces discusset failed. You mean this algo, probably?
Yes, there are many like this around (http://masm32.com/board/index.php?topic=3396.msg36254#msg36254), and here are some more (http://www.alfredklomp.com/programming/sse-strings/) :lol:
You did point to that size for some reason (maybe it was some useful on that page? I did not read it thorougly), or just as an example of the fact that there are many implementations which do not care about possibility of the crash (do not align the fetch, and even grab the data unaligned all the time)?
Quote from: nidud on March 20, 2015, 08:22:02 AM
Quote
It's in 4.asm, and the only algo which reads unaligned so it crashed near end of the page even if it did reads by 16 byte chunks, but it more over does two reads of 32 bytes.
The same also with 4 and 8.
Yes, but I told about SSE one and just did not mention other.
Quote from: nidud on March 20, 2015, 08:22:02 AM
QuoteSo first time you posted two-reading piece of this code?
Think I posted it somewhere but I'm not shore.
Just on the previous page there was a piece which included the only inner loop of the two-reading and one-checking algo. My first post here fast after that where I first said that the problem of the algo is in two reads and one check - since you provided only inner loop and not full algo first time, I assumed you do the reads unaligned/aligned to only 16.
Quote from: nidud on March 20, 2015, 08:22:02 AM
Quote
Ah, got it earlier but did not got that it was full set of the files. the dz.exe is your file manager DosZip?
Yes, plus assembler and make to simplify the testing. If on the same drive as masm32 it should work out of the box.
OK, good. Assembler is JWasm?
Quote from: nidud on March 20, 2015, 08:22:02 AM
Quote
But, the other question: do you load algos from binary files? Just did not found your safe 32 bytes algo in the listing of the timeit.exe.
Yes, for reasons explained here (http://masm32.com/board/index.php?topic=4067.msg43020#msg43020).
Ah, well known code location problem. But you might try to implement the solution simpler - with definition of different segment for every tested algo. This will allow to run the not "code location independed" algos as well, with no need to relocate them in some way manually. The algos with jump/call tables are those "relocation required" algos, for an instance. Or the algos which do not-near relative/direct calls/jumps.
hello, Alex :biggrin:
that's a lot of English - lol
Hi Alex,
Quote from: dedndave on March 22, 2015, 12:11:12 AM
hello, Alex :biggrin:
that's a lot of English - lol
but that's
good news: Alex is back after a long break and he's back with high quality posts. :t I missed you.
Gunther
deleted
Quote from: dedndave on March 22, 2015, 12:11:12 AM
hello, Alex :biggrin:
that's a lot of English - lol
Hello Dave :t
Is that more or less proper English? Sometimes it isn't, yeah :greensml:
Quote from: Gunther on March 22, 2015, 12:21:42 AM
Hi Alex,
Quote from: dedndave on March 22, 2015, 12:11:12 AM
hello, Alex :biggrin:
that's a lot of English - lol
but that's good news: Alex is back after a long break and he's back with high quality posts. :t I missed you.
Gunther
Thank you, Gunther! :biggrin: :redface:
Quote from: nidud on March 22, 2015, 02:03:27 AM
Quote from: Antariy on March 21, 2015, 11:08:43 PM
; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh ; with this db instruction
and edx,dword ptr -10 ; same same ?...
well, here is a similar version using 32 bytes
I absolutely don't get what that is in the quote of "mine"??? What is "same same" words?
Also, what means "similar version" - similar to which? To AxStrLenSSE? Well, if so, then it's actually not very similar. The most important is that your version does pointer advancing BEFORE data fetch. That was the thing which I was described in the post commenting the rrr's "edition" on the algo - but I did not accept that edit as a replacement to the current AxStrLenSSE implementation, I thought it was clear said that I assume the original version as the best possible for that kind of code (simple, single XMM reg using).
Also, moving to [esp-4] is more or less "legal" in user mode, but that's disputable still (there are cases where the usage of the stack to put the data below ESP may cause wrong work of the code), but for kernel mode code (in drivers) this string routine may not be used at all, especially on systems with DEP enabled.
Also XMM is not preserved... so actually that is strange "reproduction" of the algo with broken/not fully repeated logic. XMM isn't saved at all, instead of ECX saved EDX but it saved in a wrong way - below ESP. And the pointer advancing before data fetch - which slows the algo down.
Quote from: nidud on March 20, 2015, 08:22:02 AM
My point was that you seem to think in size of registers and not in chunk size :P
That was strange point if you just will, again, yes, re-read the first post here, which maybe is not crytally obvious and clear, but described everything which was discussed here on the topic of StrCmp. Probably the thread should be split into two topics, because MemCopy gone into StrCmp ::)
There was said about granularity of the memory to the page size, and about possibility of safe algos work, about power of two "data grabbing sizes" (which were then corrected to more suitable word "chunk"). And that post was made as answer to your post where you said that the safe algos maybe only byte-by-byte reading, and that it is "common technique" to read data "ahead" - "even in Intel" - as I was not agree with all that you said, I wrote that post. The algos may be safe reading more than byte, the "read ahead" actually is not that in the right sense, and this "common technique" which is used widely by many hobby programmers over the world is buggy, because real read ahead is a way to crash code when accessed to the buffers.
Taking in account all that, it's strange that you do such a conclusions ::)
Quote from: nidud on March 22, 2015, 02:03:27 AM
Tried padding and a lot of other things, but this seems to be the best method so far, at least on the CPU I'm now using. The single-file edit also simplifies the testing by having both the (now small) list file and source open in the editor at the same time. The list file then updates when you compile the source from the editor. Flipping back and forth (using F6) makes it easy to align the code.
Padding is not what I said. I said - just create different segment for every code example. Like that:
SEG1 SEGMENT PARA PUBLIC 'CODE'
tested code 1 here
SEG1 ENDS
SEG2 SEGMENT PARA PUBLIC 'CODE'
tested code 2 here
SEG2 ENDS
... etc ...
When you write your source in this way, then every code piece will be placed in its own section in the PE file, and it will be placed in the page-granulared "sections" in memory - every piece in its own page in memory, starting with address aligned to 4096. This solves the code location problems as well - you might try it for your CPU. And it's more comfortable because does not limit the type of code to test to only location independed code. Actually I was suggested this way of fixing code location problems in some places on the forum several times.
The only really reliable way to test algorithms that are subject to code location is to put each algo in a separate executable using the identical test conditions. The problem seems to vary with the hardware but if you are performing a simple test between 2 or more algos in the same test piece, you can move the algos around to see if their timing changes. Occasionally doing a simple "align ##" will stabilise the timings but I have seen enough algorithms that do not respond to alignment.
There is another factor that I use to see with some of Lingo's test pieces, one algo leaves the following algo with a mess to clean up and to get a reliable timing for the effected algo, you had to comment out the first one.
deleted
Antariy,
Thanks very much for your comprehensive answer but I hate to make you type so much! No need to spell everything out in such detail. I see you have strong feelings about "protected programming" and Lingo so will leave those topics alone! As for "topic paster", it was just a joke, no worries.
The important thing is the code. I understand your reasoning (of course) but it offends my aesthetic sensibility to see all those bytes wasted! So, considering your suggestions, I made some modifications, and tested the shorter algo against yours as well as I could. Bottom line, it seems the shorter algo is, for the most part, faster.
First, to avoid register stall with these instructions: mov edx, [esp+4] / mov eax, edx: I just moved the mov down 2 instructions.
I didn't know xorps could save a byte over pxor, thank you. I used those 2 bytes to put the jump back in. It was necessary for the AMD which is horribly slow on bsf. I jump into the loop, skipping the "add edx, 16", so the logic remains valid.
Still preserving XMM and ECX.
Yes I know the purpose of the db instructions is to pad 9 extra bytes to align the beginning of the loop. I know that's better than nop's or lea eax, [eax] which waste time as well as space. But surely it's even better to not waste the bytes, as long as routine is still as fast or faster.
CPU branch prediction - u know, behavior seems to change with every new release from Intel or AMD. Often, we optimize a routine on our own machines, but on newer/older machines it may behave different. I routinely optimize on my Intel I5, then try it on AMD A6 and Pentium 4; often fastest on one machine may be slowest on another. So I'm leery of artificial coding techniques for speed.
Now, I had thought you were right: pointer advance should go AFTER data fetching. However on the Intel my loop was faster. On AMD, a little slower. Apparently the order of the two instructions makes little difference. Anyway, there's very simple correction available, if needed. Just "pcmpeqb xmm7, [edx+10h]" first, then "add edx, 16" - uses one more byte.
By far, the hardest part of the whole exercise is not writing the routine, but getting semi-reliable timing! First, I used your suggestion and put all algos in separate segments. Then, I doubled both of them; put yours first and last, mine 2nd and 3rd. It appears the "last" position is slightly favorable. Then, in desperation, I copied/pasted the timing runs a number of times, using just the final numbers.
Here are the resulting runs:
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)
BUFFER ALIGNED
thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)
8 906 854 852 905
31 1147 1020 1019 1074
271 4024 4142 4020 3924
2014 26115 24593 24595 25818
62159 757816 747523 748235 757502
BUFFER MISALIGNED src 11
thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)
8 1184 1157 1132 1241
31 1399 1391 1395 1425
271 4508 4368 4432 4522
2014 25622 25036 25018 25604
62159 757612 747909 746568 757986
AMD A6-6310 APU with AMD Radeon R4 Graphics (SSE4)
BUFFER ALIGNED
thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)
8 2124 1551 1551 2319
31 2526 1944 1926 2494
271 6220 5679 5676 6416
2014 29950 30171 30869 30104
62159 872547 886154 887221 871530
BUFFER MISALIGNED src 11
thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)
8 2776 2320 2319 2622
31 2795 2420 2418 2797
271 6016 5461 5476 6055
2014 30809 31229 31080 30842
62159 887148 887519 888207 889350
Your routine was a bit better on Intel ALIGNED 271; also slightly better on the longer strings on AMD. Everywhere else, the shorter routine is better. It's dramatically better on AMD short strings, who knows why; and better on Intel long strings. BTW all these numbers came out pretty much like this on multiple tests; I'm only showing one typical run from each machine.
Here is my modified routine:
; «««««««««««««««««««««««««««««««««««««««««««««««««««
algo2 proc SrcStr:PTR BYTE
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; rrr modified version of Antariy algo - number 2
mov eax,[esp+4]
add esp,-14h
movups [esp],xmm7
mov edx, eax
mov [esp+16],ecx
and edx,-10h
xorps xmm7,xmm7
mov ecx,eax
and ecx,0fh
jz intoloop
pcmpeqb xmm7,[edx]
pmovmskb eax,xmm7
shr eax,cl
bsf eax,eax
jnz @ret
xorps xmm7,xmm7
@@: ; naturally aligned to 16
add edx,16
intoloop:
pcmpeqb xmm7,[edx]
pmovmskb eax,xmm7
test eax,eax
jz @B
bsf eax,eax
sub edx,[esp+4+16+4]
add eax, edx
@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4
algo2 endp
end_algo2:
Bottom line, I can't believe it's right to waste those 18 bytes.
Finally, of course I can write Russian! I'll do it again: "Russian". - very easy. OTOH I can't write in Cyrillic to save my life :)
Zip has "testStrLen.asm" test program.
deleted
Hi nidud,
Quote from: nidud on March 22, 2015, 10:34:21 PM
Here is the Intel version of strlen from 2011:
interesting,
but AT&T syntax. Not the best choice. Fortunately the conversion into Intel syntax isn't very hard.
Gunther
Quote from: Gunther on March 23, 2015, 07:08:14 AM
interesting, but AT&T syntax. Not the best choice. Fortunately the conversion into Intel syntax isn't very hard.
Not hard,
but tedious! If someone wants to do it and post the results that would be appreciated. There are lots of defines like L(label) which (AFAIK) must be made into function MACROs - or else replace all occurrences by hand. There are the "." labels to fix. There's the multiline statements, with ;'s separating, to deal with; and of course reverse the order of about 400 commands. The #'s, %'s and the $'s are easy. Normally this routine would be about 20 or 30 lines (in fact that's the whole point, that he's unrolled everything to the max) and the job is trivial. OTOH if there were a few thousand lines, it would be rather fun to make .bat files and MACROs to automate the conversion. But for 700 lines not really worth the effort. Tedious. Oh, and don't forget to fix that egregious spelling mistake "calee".
Brings up a question,
why is it AT&T syntax? Obviously this routine is not for Windows. Note that he uses xmm0-xmm3 then skips to xmm6, in Windows you'd use 4 and 5 wouldn't you? Reason it might matter, what is the target (type of) processor? If it's for handhelds it may not be optimal for us, those processors might not have good branch prediction, for instance, or other differences. I don't know, u understand, but he has some rather surprising techniques. Like, avoids branches like the plague; avoids bsf, prefers pminub to pcmepqb, other little things. These might be great techniques for us, or, as I wonder, might be designed for different processors (say, Atom) or environments. It also seems optimized for very small strings.
And, it doesn't seem very professional. All these lines, and he still doesn't have a MAXLEN - if the string isn't terminated the routine will dive right off the deep end. And why is Intel making it public anyway? Is this really their best effort? I'll bet it's not what Windows is using.
Anyway, it would be nice if someone would translate it, I'll be happy to test it, wouldn't be amazed if it's not all that impressive speedwise - in our typical PC environment.
[edit] Of course I'm pretty ignorant about these issues, don't mean to pretend I know what I'm talking about. Just some thoughts/questions that occurred to me
deleted
Hi nidud,
could you post the entire code as attachment, please? It's much easier for me to do the conversion. Thank you.
Gunther
deleted
I must confess I don't understand what your testbed is showing. Can you explain?
-- string length 0..40
862314 cycles - ( 0) proc_0: crt_strlen
513683 cycles - (105) proc_1: SSE 32
-- string length 40..80
940951 cycles - ( 0) proc_0: crt_strlen
331639 cycles - (105) proc_1: SSE 32
- what are the numbers in brackets?
- why does SSE 32 run faster with the longer string?
deleted
Hi nidud,
Quote from: nidud on March 24, 2015, 01:13:21 AM
I converted and tested both of them.
The Intel files are 8.asm and 9.asm in the archive
thank you, that saves me a bit work. Here are the results:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (AVX)
----------------------------------------------
-- string length 0..40
444941 cycles - ( 0) proc_0: crt_strlen
231197 cycles - (105) proc_1: SSE 32
218271 cycles - (104) proc_2: AxStrLenSSE
210172 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
228546 cycles - (673) proc_8: SSE Intel Silvermont
374264 cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
493818 cycles - ( 0) proc_0: crt_strlen
155729 cycles - (105) proc_1: SSE 32
152824 cycles - (104) proc_2: AxStrLenSSE
148059 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
141848 cycles - (673) proc_8: SSE Intel Silvermont
263988 cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
341526 cycles - ( 0) proc_0: crt_strlen
69545 cycles - (105) proc_1: SSE 32
85672 cycles - (104) proc_2: AxStrLenSSE
87101 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
43498 cycles - (673) proc_8: SSE Intel Silvermont
57624 cycles - (818) proc_9: SSE Intel Atom
result:
413892 cycles - (673) proc_8: SSE Intel Silvermont
445332 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
456471 cycles - (105) proc_1: SSE 32
456767 cycles - (104) proc_2: AxStrLenSSE
695876 cycles - (818) proc_9: SSE Intel Atom
1280285 cycles - ( 0) proc_0: crt_strlen
hit any key to continue...
Gunther
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
----------------------------------------------
-- string length 0..40
1592748 cycles - ( 0) proc_0: crt_strlen
1342550 cycles - (105) proc_1: SSE 32
1361220 cycles - (104) proc_2: AxStrLenSSE
1319353 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
1256349 cycles - (673) proc_8: SSE Intel Silvermont
1586403 cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
1711483 cycles - ( 0) proc_0: crt_strlen
927442 cycles - (105) proc_1: SSE 32
962485 cycles - (104) proc_2: AxStrLenSSE
968221 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
895419 cycles - (673) proc_8: SSE Intel Silvermont
1476721 cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
659013 cycles - ( 0) proc_0: crt_strlen
286781 cycles - (105) proc_1: SSE 32
333387 cycles - (104) proc_2: AxStrLenSSE
339087 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
255315 cycles - (673) proc_8: SSE Intel Silvermont
320034 cycles - (818) proc_9: SSE Intel Atom
result:
2407083 cycles - (673) proc_8: SSE Intel Silvermont
2556773 cycles - (105) proc_1: SSE 32
2626661 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
2657092 cycles - (104) proc_2: AxStrLenSSE
3383158 cycles - (818) proc_9: SSE Intel Atom
3963244 cycles - ( 0) proc_0: crt_strlen
hit any key to continue...
I:\nidudStrlen2\strlen2 => timeit
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
----------------------------------------------
-- string length 0..40
1596428 cycles - ( 0) proc_0: crt_strlen
1353058 cycles - (105) proc_1: SSE 32
1362316 cycles - (104) proc_2: AxStrLenSSE
1351833 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
1251872 cycles - (673) proc_8: SSE Intel Silvermont
1600414 cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
1711735 cycles - ( 0) proc_0: crt_strlen
926670 cycles - (105) proc_1: SSE 32
952517 cycles - (104) proc_2: AxStrLenSSE
982818 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
906223 cycles - (673) proc_8: SSE Intel Silvermont
1443038 cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
681045 cycles - ( 0) proc_0: crt_strlen
279759 cycles - (105) proc_1: SSE 32
326521 cycles - (104) proc_2: AxStrLenSSE
339426 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
263106 cycles - (673) proc_8: SSE Intel Silvermont
319237 cycles - (818) proc_9: SSE Intel Atom
result:
2421201 cycles - (673) proc_8: SSE Intel Silvermont
2559487 cycles - (105) proc_1: SSE 32
2641354 cycles - (104) proc_2: AxStrLenSSE
2674077 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
3362689 cycles - (818) proc_9: SSE Intel Atom
3989208 cycles - ( 0) proc_0: crt_strlen
Quote from: nidud on March 24, 2015, 02:47:14 AM
Quotewhy does SSE 32 run faster with the longer string?
It's a matter of priority for the end result: A 40 byte string is more important than a 1000 byte string in this case.
I think Jochen meant that your code is faster with longer string, so, it's faster with 80 40 bytes string than with 40 bytes string, so it faster with less important string.
The timings of your latest zip (strlen2)
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
----------------------------------------------
-- string length 0..40
1631130 cycles - ( 0) proc_0: crt_strlen
1438090 cycles - (105) proc_1: SSE 32
2560055 cycles - (104) proc_2: AxStrLenSSE
2488369 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
1332828 cycles - (673) proc_8: SSE Intel Silvermont
1697691 cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
1756723 cycles - ( 0) proc_0: crt_strlen
999911 cycles - (105) proc_1: SSE 32
1511301 cycles - (104) proc_2: AxStrLenSSE
1492232 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
972191 cycles - (673) proc_8: SSE Intel Silvermont
1551460 cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
704575 cycles - ( 0) proc_0: crt_strlen
305959 cycles - (105) proc_1: SSE 32
416240 cycles - (104) proc_2: AxStrLenSSE
418694 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
280099 cycles - (673) proc_8: SSE Intel Silvermont
348494 cycles - (818) proc_9: SSE Intel Atom
result:
2585118 cycles - (673) proc_8: SSE Intel Silvermont
2743960 cycles - (105) proc_1: SSE 32
3597645 cycles - (818) proc_9: SSE Intel Atom
4092428 cycles - ( 0) proc_0: crt_strlen
4399295 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
4487596 cycles - (104) proc_2: AxStrLenSSE
hit any key to continue...
No answers to any other posts since I get bored writing the long post posted on the lab - the "Research".
BTW, timings there are required :biggrin:
Quote from: Gunther on March 24, 2015, 04:17:22 AM
413892 cycles - (673) proc_8: SSE Intel Silvermont
445332 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
456471 cycles - (105) proc_1: SSE 32
456767 cycles - (104) proc_2: AxStrLenSSE
695876 cycles - (818) proc_9: SSE Intel Atom
1280285 cycles - ( 0) proc_0: crt_strlen
Quote from: dedndave on March 24, 2015, 06:57:32 AM
result:
2407083 cycles - (673) proc_8: SSE Intel Silvermont
2556773 cycles - (105) proc_1: SSE 32
2626661 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
2657092 cycles - (104) proc_2: AxStrLenSSE
3383158 cycles - (818) proc_9: SSE Intel Atom
3963244 cycles - ( 0) proc_0: crt_strlen
hit any key to continue...
result:
2421201 cycles - (673) proc_8: SSE Intel Silvermont
2559487 cycles - (105) proc_1: SSE 32
2641354 cycles - (104) proc_2: AxStrLenSSE
2674077 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
3362689 cycles - (818) proc_9: SSE Intel Atom
3989208 cycles - ( 0) proc_0: crt_strlen
Not bad for simple (and short rrr's version) single-reg XMM code, taking in account the crazy unrollement of Intel's code and higher chunk size of the nidud's code.
Well, I think, that if advance the algo posted in the lab in the "Research" topic, make the short strings 64...256 bytes workaround using more simpler code, like nidud's 32 bytes fetching one is, then it will be two times smaller than intels Silvermont and probably faster than it.
BTW, nidud, you may try to use such a code in your inner loop:
@mainloop:
pcmpeqb xmm0,[ebx]
add ebx,16
pcmpeqb xmm1,[ebx]
add ebx,16
pmovmskb eax,xmm0
pmovmskb edx,xmm1
cmp eax,edx
jz @mainloop
shl edx,16
add edx,eax
bsf ecx,edx
It avoids unnecessary shifts and combining in the loop.
deleted
Quote from: nidud on March 24, 2015, 08:29:21 AM
Quote from: Antariy on March 24, 2015, 08:07:25 AM
BTW, nidud, you may try to use such a code in your inner loop:
I think it will fail if they are equal but not zero.
Yes... silly idea, bored with coding of prototype for the thread with test on the lab, and typing for two days ::)
deleted
Then try it so (I'm not sure it will be faster, just suggesting to try)
@mainloop:
pcmpeqb xmm0,[ebx]
add ebx,16
pcmpeqb xmm1,[ebx]
add ebx,16
pmovmskb eax,xmm0
pmovmskb edx,xmm1
add eax,edx
jz @mainloop
jc @mainloop
pmovmskb eax,xmm0
shl edx,16
add edx,eax
bsf ecx,edx
The changed/added instructions without indent.
Take notice that jz misprediction possibility since the both EAX and EDX are 80000000h is incredible small, so jc instruction will almost always be executed only once - and only after proper exit, so it will not cost much, but the loop itself will decrease number of operations.
Quote from: nidud on March 24, 2015, 09:14:22 AM
I tried a few combinations yesterday
I first used this one
lea eax,[eax+20h]
pcmpeqb xmm1,[eax]
pmovmskb edx,xmm1
pcmpeqb xmm1,[eax+16]
pmovmskb ecx,xmm1
but this seems to faster for some reason
movdqa xmm1,[eax+30h]
movdqa xmm0,[eax+20h]
pcmpeqb xmm0,xmm2
pmovmskb edx,xmm0
lea eax,[eax+20h]
pcmpeqb xmm1,xmm2
pmovmskb ecx,xmm1
Reversed order of the data fetch - the higher offset, then lower offset - I did also used that in the big code on the "Research" thread, too, in the first part which handles 64 bytes misalignment with two 32 bytes part :biggrin:
I think that first piece you showed here worked slower, and with possibility of incorrect work (i.e. it did not worked at all as required, hence the slowdowns probably) - because you compare the same XMM1 reg - but if the first comparsion changed some byte(s) of it to 0xFF, then the second comparsion actually compares not with zeroed reg, but with a mess - so, being changed randomly from time to time the reg slowing the code down and moreover it will produce wrong results with second comparsion if the first comparsion found zero.
There's a problem with this whole exercise: same code runs very different on different processors.
Quote from: nidudbut this seems to faster for some reason
I do the same thing: try different combinations and select the fastest
on my machine, often not knowing why it's faster. Problem is, what happens on another machine?
I put together a new "AxStrLenSSE" which beats the Silvermont on my i5. It was easy. First, notice that Antariy's original algo saves XMM and ECX, because of environment it was designed for; but Silvermont doesn't. So I took that out. Then, it was still slower on the long strings. So, I unrolled the loop 9 times, and it easily "won" on nidud's test bed. But then I tried it on my AMD, and it came in 3rd (SSE 32 also ahead)! (AMD is very bad for bsf, pcmeqpb and pmovmskb.)
Working at Intel, Torbjorn Granlund has access to hundreds of machines (probably) and can try his Silvermont algo on all of them. If you don't have similar facilities there's no point in trying to keep up with him. I also think it's ridiculous to use 700 bytes for a strLen algo. This is one reason Microsoft and other modern code is so bloated.
This reminds me of a Go proverb (an oriental board game, you've probably heard of it). Often a student will go to great lengths to keep a group alive, destroying his entire position in the process. It would have been much better to sacrifice the group, for advantage elsewhere; this way he's guaranteed to lose. Then the teacher says, "If that's what you have to do to stay alive, why don't you just kill yourself?" That's how I feel about 700 lines to save a few nanoseconds looking for the end of a string!
Just the same, Torbjorn might have some good ideas. There is only one thing he does I've never seen before, uses pminub to take care of the situation you (Antariy and nidud) are discussing right now. That is, integrating the results from 2 or more XMM's reading the string. Perhaps you should take a look at his technique.
nidud your test bed is very good, this is the first time I used it much, the timings are very stable and reliable compared to other methods. It's also very convenient (once u get used to it) to just change 3.asm and recompile, without needing to recompile timeit.asm or any other algo's.
Anyway, here's this new version of Antariy's algo, which is better on i5's than Silvermont, but worse on AMD ... I know there are various ways to improve it, but I think I'll do something else instead!
.686
.model flat, stdcall
.code
.xmm
OPTION PROLOGUE:NONE, EPILOGUE:NONE
AxStrLenSSE2 proc STDCALL lpszStr:DWORD ; fast with unaligned strings
; original by Antariy modified by rrr314159
mov eax,[esp+4]
mov edx, eax
and edx,-10h
pxor xmm0,xmm0
mov ecx,eax
and ecx,0fh
pcmpeqb xmm0,[edx]
pmovmskb eax,xmm0
shr eax,cl
bsf eax,eax
jnz @ret
xorps xmm0,xmm0
pxor xmm1, xmm1
pxor xmm2, xmm2
@@: ; naturally aligned to 16
pcmpeqb xmm0,[edx+10h]
pmovmskb eax,xmm0
test eax,eax
jnz gotat10h
pcmpeqb xmm1,[edx+20h]
pmovmskb eax,xmm1
test eax,eax
jnz gotat20h
pcmpeqb xmm2,[edx+30h]
pmovmskb eax,xmm2
test eax,eax
jnz gotat30h
pcmpeqb xmm0,[edx+40h]
pmovmskb eax,xmm0
test eax,eax
jnz gotat40h
pcmpeqb xmm1,[edx+50h]
pmovmskb eax,xmm1
test eax,eax
jnz gotat50h
pcmpeqb xmm2,[edx+60h]
pmovmskb eax,xmm2
test eax,eax
jnz gotat60h
pcmpeqb xmm0,[edx+70h]
pmovmskb eax,xmm0
test eax,eax
jnz gotat70h
pcmpeqb xmm1,[edx+80h]
pmovmskb eax,xmm1
test eax,eax
jnz gotat80h
pcmpeqb xmm2,[edx+90h]
pmovmskb eax,xmm2
test eax,eax
jnz gotat90h
add edx, 90h
jmp @B
gotat90h:
add edx, 10h
gotat80h:
add edx, 10h
gotat70h:
add edx, 10h
gotat60h:
add edx, 10h
gotat50h:
add edx, 10h
gotat40h:
add edx, 10h
gotat30h:
add edx, 10h
gotat20h:
add edx, 10h
gotat10h:
add edx, 10h
bsf eax,eax
sub edx,[esp+4]
add eax, edx
@ret:
ret 4
AxStrLenSSE2 endp
END
Here are the timings:
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (AVX)
----------------------------------------------
-- string length 0..40
462421 cycles - ( 0) proc_0: crt_strlen
239532 cycles - (105) proc_1: SSE 32
232826 cycles - (104) proc_2: AxStrLenSSE
217896 cycles - (229) proc_3: AxStrLenSSE (rrr)
242578 cycles - (673) proc_8: SSE Intel Silvermont
395083 cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
446781 cycles - ( 0) proc_0: crt_strlen
167097 cycles - (105) proc_1: SSE 32
164663 cycles - (104) proc_2: AxStrLenSSE
147078 cycles - (229) proc_3: AxStrLenSSE (rrr)
151820 cycles - (673) proc_8: SSE Intel Silvermont
282262 cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
349538 cycles - ( 0) proc_0: crt_strlen
73973 cycles - (105) proc_1: SSE 32
94474 cycles - (104) proc_2: AxStrLenSSE
55565 cycles - (229) proc_3: AxStrLenSSE (rrr)
46600 cycles - (673) proc_8: SSE Intel Silvermont
61976 cycles - (818) proc_9: SSE Intel Atom
result:
420539 cycles - (229) proc_3: AxStrLenSSE (rrr)
440998 cycles - (673) proc_8: SSE Intel Silvermont
480602 cycles - (105) proc_1: SSE 32
491963 cycles - (104) proc_2: AxStrLenSSE
739321 cycles - (818) proc_9: SSE Intel Atom
1258740 cycles - ( 0) proc_0: crt_strlen
hit any key to continue...
AMD A6-6310 APU with AMD Radeon R4 Graphics (AVX)
----------------------------------------------
-- string length 0..40
749682 cycles - ( 0) proc_0: crt_strlen
478036 cycles - (105) proc_1: SSE 32
537144 cycles - (104) proc_2: AxStrLenSSE
542685 cycles - (229) proc_3: AxStrLenSSE (rrr)
443542 cycles - (673) proc_8: SSE Intel Silvermont
618257 cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
876412 cycles - ( 0) proc_0: crt_strlen
321733 cycles - (105) proc_1: SSE 32
401381 cycles - (104) proc_2: AxStrLenSSE
361959 cycles - (229) proc_3: AxStrLenSSE (rrr)
310989 cycles - (673) proc_8: SSE Intel Silvermont
448527 cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
520175 cycles - ( 0) proc_0: crt_strlen
109789 cycles - (105) proc_1: SSE 32
113737 cycles - (104) proc_2: AxStrLenSSE
102691 cycles - (229) proc_3: AxStrLenSSE (rrr)
79685 cycles - (673) proc_8: SSE Intel Silvermont
95800 cycles - (818) proc_9: SSE Intel Atom
result:
834216 cycles - (673) proc_8: SSE Intel Silvermont
909558 cycles - (105) proc_1: SSE 32
1007335 cycles - (229) proc_3: AxStrLenSSE (rrr)
1052262 cycles - (104) proc_2: AxStrLenSSE
1162584 cycles - (818) proc_9: SSE Intel Atom
2146269 cycles - ( 0) proc_0: crt_strlen
hit any key to continue...
Hi rrr :t
I don't ignore your messages, just was too tired to answer thorougly.
The one note right now - yes, AMDs look very slow with BSF, though that instuction is executed not in the inner loop so it has noticeable influence only with short strings, but you may speed it up on the AMD just using BSF AX,AX instead of EAX,EAX. On Intel that has no much difference (it's a bit slower), but for AMD makes sense. With the one-reg using algo which provides 16 bit mask the usage of 16 bit BSF on AMD is a right choice.
nidud, here is a tweak of your algo which in this form comes close to the algo posted on the "Research" thread on my machine:
strlen proc STDCALL lpszStr:DWORD
mov ecx,[esp+4]
xorps xmm2,xmm2
mov eax,ecx
and ecx,32-1
and eax,-32
movdqa xmm0,[eax]
movdqa xmm1,[eax+16]
or edx,-1
shl edx,cl
pcmpeqb xmm0,xmm2
pmovmskb ecx,xmm0
and ecx,edx
jnz done
pcmpeqb xmm1,xmm2
pmovmskb ecx,xmm1
shl ecx,16
and ecx,edx
jnz done
loop_32:
lea eax,[eax+32]
movdqa xmm0,[eax]
movdqa xmm1,[eax+16]
pcmpeqb xmm0,xmm2
pcmpeqb xmm1,xmm2
pmovmskb edx,xmm0
pmovmskb ecx,xmm1
add edx,ecx
jbe loop_32
shl ecx,16
pmovmskb edx,xmm0
add ecx,edx
done:
bsf ecx,ecx
lea eax,[ecx+eax]
sub eax,[esp+4]
ret 4
strlen endp
Actually I think that if you will use an entry like in mine StrLen (just shift right overhead bits) and this inner loop (that is strange, but the movdqa/pcmpeqb sequence seems faster than just direct pcmpeqb) and the checking with exit from the loop as I suggested, this code will be probably the one of the best solutions for StrLen on the 32 bit machines. The crazy-unrolled Intel's solution has no any gain in real world.
Probably the rrr's unrolling is the maximum which has sense, but it's strange to look on the algos which are 600+ or even 800+ bytes long, just to check the string length on the, after all, real world machine in the real world app.
deleted
i5 test with a 200 MB file - 50 bibles concatenated:
Len(): 55896 us for 1519150 lines, 200850450 bytes
len(): 109353 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 51731 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51359 us for 1519150 lines, 200850450 bytes
Len(): 50408 us for 1519150 lines, 200850450 bytes
len(): 108360 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 51540 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 52424 us for 1519150 lines, 200850450 bytes
Len(): 52240 us for 1519150 lines, 200850450 bytes
len(): 108979 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 50288 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51747 us for 1519150 lines, 200850450 bytes
Len(): 51231 us for 1519150 lines, 200850450 bytes
len(): 107123 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 49613 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51625 us for 1519150 lines, 200850450 bytes
Len(): 50967 us for 1519150 lines, 200850450 bytes
len(): 107987 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 51297 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51591 us for 1519150 lines, 200850450 bytes
AxStrLenSSE is a clear winner.
deleted
Old, old, old post, but i was interesting on the timmigns. I made a small variation of Nidud´s StrLen. Seems fast and stable, but didn´t measured the results on masm. Can someone give a try ?
Proc FastStrlenTest:
Arguments @Input
Uses ecx, edx ; RosAsm macro to preserve registers. Just a push ecx+push edx after the start of the function (push ebp | mov ebp esp). Before the function ends at "endP' macro, the registers are restored with pop edx | pop ecx
mov eax D@Input
movdqu xmm0 X$eax
xorps xmm1 xmm1
pcmpeqb xmm0 xmm1
pmovmskb ecx xmm0
and ecx 0-1
jne @done
pmovmskb ecx xmm1
shl ecx 16
and ecx 0-1
jne @done
xorps xmm2 xmm2
@loop_32:
lea eax D$eax+32
movdqu xmm0 X$eax
movdqu xmm1 X$eax+16
pcmpeqb xmm0 xmm2 | pmovmskb edx xmm0
pcmpeqb xmm1 xmm2 | pmovmskb ecx xmm1
add edx ecx
jbe @loop_32
shl ecx 16
pmovmskb edx xmm0
add ecx edx
@done:
bsf ecx ecx
lea eax D$ecx+eax
sub eax D@Input
EndP
Note: I modified the beggining of Nidud´s algo, because the adress at the start, may result on a crash if the string is located exactly on the beginning of the memory addressing . That´s why i removed the initiual
and ecx 32-1
and eax 0-32
Btw...the fucntion seems to work on any size of the null terminated string...from 1 to XXXXX
Need to fully understand this, in order to try making a Unicode version too :)
deleted
I saw that while i was debugging the function. It didn´t crashed, but the pointer entered on a previously allocated memory (28 bytes before the start of the input). It didn´t crashed, probably because the memory was correctly allocated, so it didn´t reached it´s beginning.
Ex: Say, the memory starts at 100000 And the function uses a pointer at the very 1st byte 100001. What could happen is that since the and operation will force the Inputted pointer to start at 100000-32 bytes (or something). And, if the area before that address is not allocated, most likely it will crash. However, it didn´t crashed on my tests, but i made that small adaptation just to make sure it won´t crash ever.
Btw,...it´s an amazing algorithm. I wonder if it can be optimized further. I made a small test comparing pcmpeqb xmm0 xmm1 rather then pcmpeqb xmm0 xmm2/pcmpeqb xmm1 xmm2 but couldn´t understand what this opcode is doing exactly yet.
The timmings are simply great and it do finds the len for small strings etc...while AxStrLenSSE2 version seems to have some errors on the results (at least, if i ported it properly)
(https://i.ibb.co/d7N8vt4/vsdfagva-Image1.png) (https://ibb.co/d7N8vt4)
deleted
deleted
Here is a test run with your (marginally modified) 1.asm ... 4.asm and other StrLen algos.
The 3.asm algo returns a wrong value, I have not found out why.
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
6931 cycles for 100 * CRT strlen
5372 cycles for 100 * Masm32 StrLen
19701 cycles for 100 * Windows lstrlen
2006 cycles for 100 * MasmBasic Len
1579 cycles for 100 * Algo1
1712 cycles for 100 * Algo2
1103 cycles for 100 * Algo3
3209 cycles for 100 * Algo4
6941 cycles for 100 * CRT strlen
5393 cycles for 100 * Masm32 StrLen
19722 cycles for 100 * Windows lstrlen
1994 cycles for 100 * MasmBasic Len
1587 cycles for 100 * Algo1
1722 cycles for 100 * Algo2
1102 cycles for 100 * Algo3
3227 cycles for 100 * Algo4
6938 cycles for 100 * CRT strlen
5380 cycles for 100 * Masm32 StrLen
19692 cycles for 100 * Windows lstrlen
2011 cycles for 100 * MasmBasic Len
1580 cycles for 100 * Algo1
1713 cycles for 100 * Algo2
1101 cycles for 100 * Algo3
3205 cycles for 100 * Algo4
6970 cycles for 100 * CRT strlen
5404 cycles for 100 * Masm32 StrLen
19710 cycles for 100 * Windows lstrlen
2010 cycles for 100 * MasmBasic Len
1589 cycles for 100 * Algo1
1710 cycles for 100 * Algo2
1104 cycles for 100 * Algo3
3204 cycles for 100 * Algo4
14 bytes for CRT strlen
10 bytes for Masm32 StrLen
10 bytes for Windows lstrlen
10 bytes for MasmBasic Len
82 bytes for Algo1
114 bytes for Algo2
690 bytes for Algo3
761 bytes for Algo4
100 = eax CRT strlen
100 = eax Masm32 StrLen
100 = eax Windows lstrlen
100 = eax MasmBasic Len
100 = eax Algo1
100 = eax Algo2
14 = eax Algo3
100 = eax Algo4
This is the clear winner:
; include 1j.asm
Algo1 proc
mov eax,[esp+4]
mov ecx,eax ; much faster than [esp+4]
and eax,-16
and ecx,16-1
or edx,-1
shl edx,cl
xorps xmm0,xmm0
pcmpeqb xmm0,[eax]
add eax,16
pmovmskb ecx,xmm0
; xorps xmm0,xmm0 ; ??
and ecx,edx
jnz L2
L1:
movaps xmm1,[eax]
pcmpeqb xmm1,xmm0
pmovmskb ecx,xmm1
add eax,16
test ecx,ecx
jz L1
L2:
bsf ecx,ecx
lea eax,[eax+ecx-16]
sub eax,[esp+4]
retn 4
Algo1 endp
deleted
deleted
Quote from: nidud on November 15, 2019, 09:34:12 AM
Quote from: jj2007 on November 15, 2019, 08:52:04 AM
Algo1 proc
mov eax,[esp+4]
mov ecx,eax ; much faster than [esp+4]
Actually they are the same (speed-wise) but selected by size for alignment of L1.
On my CPU the algo is over 2% faster with mov ecx, eax
Quote
; xorps xmm0,xmm0 ; ??
This must be zero for compare below:
L1:
movaps xmm1,[eax]
pcmpeqb xmm1,xmm0
I saw that but can pcmpeqb xmm0,[eax] produce a non-zero result without jumping to L2?
deleted
Quote from: nidud on November 15, 2019, 11:12:13 AMQuoteI saw that but can pcmpeqb xmm0,[eax] produce a non-zero result without jumping to L2?
Yes. The alignment may go back 31 15 byte and they may all be zero.
I've tested it with a 100 byte string, increasing the pointer 100 times. "All zero" shouldn't be a problem. But I admit I am too tired right now to understand it ;-)
Here's my current version of your fast algo, 62 bytes short and considerably faster than the original:
Algo1 proc
mov eax, [esp+4]
mov ecx, eax ; much faster than [esp+4]
and eax, -16
and ecx, 16-1
or edx, -1
shl edx, cl
xorps xmm0, xmm0 ; needed for short strings
pcmpeqb xmm0, [eax]
pmovmskb ecx, xmm0
; xorps xmm0, xmm0 ; ??
and ecx, edx ; short string?
jnz L2
L1:
add eax, 16
movaps xmm1, [eax]
pcmpeqb xmm1, xmm0
pmovmskb ecx, xmm1
test ecx, ecx
jz L1
L2:
bsf ecx, ecx
add eax, ecx
sub eax, [esp+4]
retn 4
Algo1 endp
deleted
Tks, guys, i´ll give a test for benchmarking the speed..
Nidud, about the memory issue...Indeed you are right. I was confused thinking the algo would write something to a protected area, but, it is not. It´s only computing the difference where to start calculating the lenght is that it ? Tks for the unicode version. I´ll test ir right now
JJ...i tried to assemble your file, but masmbasic throw an error of configuration. I think i do have uasm on the same path as in ml.exe, but this is what is being showed to me:
(https://i.ibb.co/TMS9xWW/vfsdgf-Image1.png) (https://ibb.co/TMS9xWW)
How can i configure masmbasic properly to make it work on this test ?
Hi Nidud....The unicode version is working as expected :)
One question. How to implement a security inside the function to see either the string is really unicode or not, while it is calculating the lenght ?
I mean....say i have a bad unicode string like this:
[TestingUnicode: B$ 'H', 0, 'e', 0, 'hi', 0]
How to make the function checks the bad chars 'hi', and return a value of ... say 0-1 (meaning the function found an error ) ?
The RosAsm porting of your function, is like this:
Proc UnicodeStrlen:
Arguments @Input
Uses ecx, edx, edi ; <----preserve ecx, edx, edi registers - I benchmark the speed with all algos preserving the registers to be sure they are all behaving on the same way to see which one is faster etc under the very same conditions.
mov eax D@Input
bt eax 0 | jc L3>
mov ecx D@Input
and eax 0-16
and ecx 16-1
or edx 0-1
shl edx cl
pxor xmm0 xmm0
pcmpeqw xmm0 X$eax
add eax 16
pmovmskb ecx xmm0
pxor xmm0 xmm0
and ecx edx
jnz L2>
L1:
movups xmm1 X$eax
pcmpeqw xmm1 xmm0
pmovmskb ecx xmm1
add eax 16
test ecx ecx
jz L1<
L2:
bsf ecx ecx
lea eax D$eax+ecx-16
sub eax D@Input
shr eax 1
ExitP
L3:
mov edx edi
mov edi eax
xor eax eax
or ecx 0-1
repne scasw
not ecx
dec ecx
mov eax ecx
mov edi edx
EndP
[code]
ABout the Unicode version. JJ, i ported yours to work with Unicode as well, and it seem to work ok, and it is faster on my machine. Can you give a test on your benchmark function to compare the speeds, pls ?
JJ Unicode Version
Proc UniStrLenJJ:
Arguments @Input
Uses ecx, edx ; <--- preserved register
mov eax D@Input
mov ecx eax ; much faster than [esp+4]
and eax 0-16
and ecx 16-1
or edx 0-1
shl edx cl
xorps xmm0 xmm0
pcmpeqw xmm0 X$eax
add eax 16
pmovmskb ecx xmm0
xorps xmm0 xmm0
and ecx edx
jnz L2>
L1:
movups xmm1 X$eax ; <---- Changed to unaligned version on both unicode and ansi version. A bit faster on my machine, and prevent crashing on unaligned strings calculation)
pcmpeqw xmm1 xmm0
pmovmskb ecx xmm1
add eax 16
test ecx ecx
jz L1<
L2:
bsf ecx ecx
lea eax D$eax+ecx-16
sub eax D@Input
shr eax 1
EndP
JJ Ansi version
Proc StrLenJJ:
Arguments @Input
Uses ecx, edx
mov eax D@Input
mov ecx eax ; much faster than [esp+4]
and eax 0-16
and ecx 16-1
or edx 0-1
shl edx cl
xorps xmm0 xmm0
pcmpeqb xmm0 X$eax
add eax 16
pmovmskb ecx xmm0
xorps xmm0 xmm0
and ecx edx
jnz L2>
L1:
movups xmm1 X$eax
pcmpeqb xmm1 xmm0
pmovmskb ecx xmm1
add eax 16
test ecx ecx
jz L1<
L2:
bsf ecx,ecx
lea eax D$eax+ecx-16
sub eax D@Input
EndP
On my tests, in terms of speed the Ansi version and Unicode version does not varies that much in speed
Ansi version usign the following string (99 chars...I´m testing odd and even strings, big or tiny as 1 byte only):
[TestAnsi: B$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zer" , 0]
Average timming (in nanosecs): 2.95913440552019 ns
Unicode version
[TestUnicode: U$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zer" , 0]
Average timming (in nanosecs): 3.46197153137555 ns
Btw... Same question i made for Nidud. How to make the Unicode version checks if the string being calculated is really Unicode or Not and make it returns 0-1, if it finds non-unicode chars while it is calculating the lenght ?
Quote from: nidud on November 15, 2019, 12:31:13 PM
Well, if you feed it with this string:
align 16
db 15 dup(0)
err db "error",0
the aligned (first) compare will be:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e | r r o r 0
xmm0 will be:
00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
and there will be no jump to L2
Correct :thumbsup:Quote from: guga on November 15, 2019, 03:26:18 PM
ABout the Unicode version. JJ, i ported yours to work with Unicode as well, and it seem to work ok, and it is faster on my machine. Can you give a test on your benchmark function to compare the speeds, pls ?
If you translate it to Masm syntax, I can do that :thup:
Hii JJ. Tks :)
I believe the porting to masm, is something like this:
UniStrLenJJ proc Input :DWORD
; <--- uses ebp, rather then esp. Here should have a push ebp | mov ebp, esp at start. Forgot how to include that. Maybe using invoke token in masm, right ?
push ecx
push edx
mov eax, Input
mov ecx, eax
and eax, -16
and ecx, 16-1
or edx, -1
shl edx, cl
xorps xmm0, xmm0
pcmpeqw xmm0, [eax] ; <---- or it is xmmword ptr [eax]. Don´t remember if there is a xmmword instruction in masm. It´s the same as in Algo1, but using pcmpeqw rather then pcmpeqb
add eax, 16
pmovmskb ecx, xmm0
xorps xmm0, xmm0
and ecx, edx
jnz short Out1
InnerLoop:
movups xmm1, [eax]; <---- or it is xmmword ptr [eax]. Don´t remember if there is a xmmword instruction in masm
pcmpeqw xmm1, xmm0
pmovmskb ecx, xmm1
add eax, 16
test ecx, ecx
jz short InnerLoop
Out1:
bsf ecx, ecx
lea eax, [ecx+eax-16]
sub eax, Input
shr eax, 1
pop edx
pop ecx
retn 4 <----- or a simple ret..Don´t remember the syntax. before the ret should have a mov esp, ebp | pop ebp instructions
UniStrLenJJ endp
deleted
This performs better with small strings:
.686
.xmm
.model flat
.code
mov eax,-16
mov edx,[esp+4]
xorps xmm0, xmm0
@loop:
add eax, 16
PCMPISTRI xmm0, xmmword ptr [edx + eax], 1000b
jnz @loop
add eax, ecx
ret
end
total [0 .. 40], 8++
262586 cycles 5.asm: PCMPISTRI
329094 cycles 3.asm: SSE Intel Silvermont
366495 cycles 1.asm: SSE 16
453912 cycles 2.asm: SSE 32
682335 cycles 0.asm: msvcrt.strlen()
728151 cycles 4.asm: SSE Intel Atom
hit any key to continue...
(*) Note: can read up to 15 characters past end of string and cause an exception in some cases, normally cross pages.
deleted
Testbed with Guga's Unicode version, using ebp:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
7047 cycles for 100 * CRT strlen
5265 cycles for 100 * Masm32 StrLen
19964 cycles for 100 * Windows lstrlen
1711 cycles for 100 * MasmBasic Len
1499 cycles for 100 * Algo1
2918 cycles for 100 * AlgoGuga
7097 cycles for 100 * CRT strlen
5311 cycles for 100 * Masm32 StrLen
19940 cycles for 100 * Windows lstrlen
1704 cycles for 100 * MasmBasic Len
1497 cycles for 100 * Algo1
2929 cycles for 100 * AlgoGuga
7030 cycles for 100 * CRT strlen
5250 cycles for 100 * Masm32 StrLen
19956 cycles for 100 * Windows lstrlen
1674 cycles for 100 * MasmBasic Len
1494 cycles for 100 * Algo1
2935 cycles for 100 * AlgoGuga
7057 cycles for 100 * CRT strlen
5268 cycles for 100 * Masm32 StrLen
20014 cycles for 100 * Windows lstrlen
1714 cycles for 100 * MasmBasic Len
1511 cycles for 100 * Algo1
3000 cycles for 100 * AlgoGuga
14 bytes for CRT strlen
10 bytes for Masm32 StrLen
10 bytes for Windows lstrlen
10 bytes for MasmBasic Len
74 bytes for Algo1
87 bytes for AlgoGuga
100 = eax CRT strlen
100 = eax Masm32 StrLen
100 = eax Windows lstrlen
100 = eax MasmBasic Len
100 = eax Algo1
100 = eax AlgoGuga
This is the Unicode version using PCMPISTRI
.code
mov rax,-16
mov rdx, rcx
xorps xmm0, xmm0
@loop:
add rax, 16
PCMPISTRI xmm0, xmmword ptr [rdx + rax], 1001b
jnz @loop
shr eax, 1
add eax, ecx
ret
end
total [0 .. 40], 8++
323358 cycles 3.asm: AVX 32
404277 cycles 4.asm: PCMPISTRI
477789 cycles 2.asm: SSE 16
1237417 cycles 0.asm: msvcrt.wcslen()
3886924 cycles 1.asm: scasw
total [41 .. 80], 7++
291655 cycles 3.asm: AVX 32
562089 cycles 2.asm: SSE 16
563122 cycles 4.asm: PCMPISTRI
1935096 cycles 0.asm: msvcrt.wcslen()
4320489 cycles 1.asm: scasw
total [600 .. 1000], 100++
333669 cycles 3.asm: AVX 32
982307 cycles 2.asm: SSE 16
1405725 cycles 4.asm: PCMPISTRI
3490272 cycles 0.asm: msvcrt.wcslen()
6914474 cycles 1.asm: scasw
deleted
Tks a lot, JJ :)
It seems is fast as expected.
Here is the results:
AMD Ryzen 5 2400G with Radeon Vega Graphics (SSE4)
5842 cycles for 100 * CRT strlen
5813 cycles for 100 * Masm32 StrLen
18946 cycles for 100 * Windows lstrlen
1878 cycles for 100 * MasmBasic Len
1545 cycles for 100 * Algo1
2579 cycles for 100 * AlgoGuga
5769 cycles for 100 * CRT strlen
5711 cycles for 100 * Masm32 StrLen
18825 cycles for 100 * Windows lstrlen
2350 cycles for 100 * MasmBasic Len
1917 cycles for 100 * Algo1
3390 cycles for 100 * AlgoGuga
7980 cycles for 100 * CRT strlen
7723 cycles for 100 * Masm32 StrLen
24159 cycles for 100 * Windows lstrlen
2372 cycles for 100 * MasmBasic Len
1930 cycles for 100 * Algo1
2587 cycles for 100 * AlgoGuga
6088 cycles for 100 * CRT strlen
7132 cycles for 100 * Masm32 StrLen
22808 cycles for 100 * Windows lstrlen
1673 cycles for 100 * MasmBasic Len
1609 cycles for 100 * Algo1
2636 cycles for 100 * AlgoGuga
14 bytes for CRT strlen
10 bytes for Masm32 StrLen
10 bytes for Windows lstrlen
10 bytes for MasmBasic Len
74 bytes for Algo1
87 bytes for AlgoGuga
100 = eax CRT strlen
100 = eax Masm32 StrLen
100 = eax Windows lstrlen
100 = eax MasmBasic Len
100 = eax Algo1
100 = eax AlgoGuga
--- ok ---
Quote from: guga on November 16, 2019, 10:23:57 AM
Tks a lot, JJ :)
It seems is fast as expected.
Thanks, Guga. Here is a new version, 59 bytes short and pretty fast:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
6984 cycles for 100 * CRT strlen
5216 cycles for 100 * Masm32 StrLen
19898 cycles for 100 * Windows lstrlen
1674 cycles for 100 * MasmBasic Len
1500 cycles for 100 * Algo1/Nidud
2608 cycles for 100 * Algo1/Guga+JJ
7035 cycles for 100 * CRT strlen
5249 cycles for 100 * Masm32 StrLen
19929 cycles for 100 * Windows lstrlen
1700 cycles for 100 * MasmBasic Len
1569 cycles for 100 * Algo1/Nidud
2625 cycles for 100 * Algo1/Guga+JJ
7087 cycles for 100 * CRT strlen
5215 cycles for 100 * Masm32 StrLen
19859 cycles for 100 * Windows lstrlen
1701 cycles for 100 * MasmBasic Len
1498 cycles for 100 * Algo1/Nidud
2644 cycles for 100 * Algo1/Guga+JJ
14 bytes for CRT strlen
10 bytes for Masm32 StrLen
10 bytes for Windows lstrlen
10 bytes for MasmBasic Len
53 bytes for Algo1/Nidud
59 bytes for Algo1/Guga+JJ
100 = eax CRT strlen
100 = eax Masm32 StrLen
100 = eax Windows lstrlen
100 = eax MasmBasic Len
100 = eax Algo1/Nidud
100 = eax Algo1/Guga+JJ
Note that results are not fully comparable. Masm32 len(), MasmBasic Len() (http://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1144) and Algo1/Guga+JJ preserve edx and ecx. This is unusual, but the good ol' Masm32 len() algo did this, so in order to avoid compatibility problems, Len() (http://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1144) behaves the same. In addition, Len() preserves xmm0. Still, Len() is over 4 times faster than CRT strlen :cool:
Quote from: guga on November 15, 2019, 03:05:45 PM
Hi Nidud....The unicode version is working as expected :)
One question. How to implement a security inside the function to see either the string is really unicode or not, while it is calculating the lenght ?
I mean....say i have a bad unicode string like this:
[TestingUnicode: B$ 'H', 0, 'e', 0, 'hi', 0]
How to make the function checks the bad chars 'hi', and return a value of ... say 0-1 (meaning the function found an error ) ?
in my experience when using a texteditor and start to put in unicode characters,the texteditor detects I have entered or copy/pasted unicode and it asks if I want to save it unicode format or choose to lose that info saving in good old ascii,my suggestion is support a unicode or not flag somewhere in the copy routine,so if its used by a texteditor/gui creator,puts code to change unicode flag as soon as it detects unicode usage by user,maybe add superfast routine that expands ascii-to unicode when that happens?
Just write a UNICODE editor, no conversions.
Visual Studio and some text editors like Notepad++ detect Unicode on save.
But in this thread, what we have been calling Unicode are word length characters, that is the UTF-16 character set without the additional planes for emoji and other stuff. This is what Windows internals use. For programming and for most purposes it is very OK, no need to complicate things further. However there is more to it.
Windows has the IsTextUnicode API function but it sometimes fails. I have seen and used better algos but am never sure it will not fail somewhere.
Quote from: hutch-- on November 16, 2019, 09:14:42 PM
Just write a UNICODE editor, no conversions.
Hi Steve, tks...But i was thinking on a function able to detect it for disassembler purposes. While i was testing the GDI issues of that other post, i faced a really old problem on RosAsm and decided start a minor update on some of the routines. The routine i´m currently working on is the disassembler that also uses routines to check chunk of bytes. I remember using a very very old function for string lenght and updated it, but it was not fast enough, that´s why i decided to give a try on these.
Since RosAsm (now) do disassemble unicode strings properly, i´m trying also a unicode string lenght check and a routine to automatically detect simple unicode strings (Char+0, Char+0 etc). So, for a disassembler point of view, a faster routine able to determine whether a string is unicode or ansi (even simple unicode format) maybe necessary to avoid bad recognitions.
I succeeded yesterday to fix a old bug in RosAsm disassembler that caused the dissassemblement process to be kind of slow in while it is working in some apps or dlls. For example, on shell32.dll or atioglxx.dll for Xp (12Mb and 17 Mb respectively) , the disassembler was taking almost 25 minutes to finish due to heavily computation of code/data identification and the main functions were going back and forward thousands of times. It was a stupid design mistake i (or rené..i don´t remember) did years ago, forcing one of the functions to allocate and deallocate a huge amount of memory every time the routine is used (and it is used thousands on times for problematic files) . Now, the process is way fast for some problematic files. The same test i made on shell32.dll disassembled it in about 20 seconds and kept the accuracy of the data that was disassembled)
As long as you can safely identify character data, either ANSI or UNICODE, it may be viable to have a manual option to switch between the two. There is no truly reliable method of detecting the difference so you use the best you can get/write and have a manual option.
Manual option ? :thumbsup: :thumbsup: :thumbsup: Indeed. Yeah...maybe i´ll implement it later when i finish the fixes. I plan to add a couple of user options as soon i have a few time to dedicate to work on RosAsm updates again. The current version can identifies with relatively accuracy what is ANSI or Unicode strings, but still have some minor problems because on some cases a chunk of data can be either a string or a pointer and it is not so easy to fix that without the help of other tools like a Signature system i started developing years ago, but never finished.
Manual options could be good to make as in IdaPro, allowing the user to choose either he wants to disassemble C-Style strings, Dos Style, pascal Styles, Delphi strings etc..But, for those specific cases (of strings that are used to certain compilers) the better should be i do it when (or if) succeed to finish the signature technique (i called that DIS - Digital Identification System) many years ago.
The current routine does the job in more then 90% of the time for real apps. The disassembler sets a couple of flags to forbidden areas of the PE to avoid those to be disassembled. For example, sections flagged as import, export, resources etc etc...It basically identifies the good code and data sections and those are the ones that are actually disassembled.
One of the problems is that, inside the code section it is common to we find embedded data, structures, strings etc. Although the disassembler works reasonable fine on those sections, it do have files that produces wrong results...but those i´ll fix later once i fix some minor bugs in RosAsm.
I´m comparing the results of the fixes i´m currently doing and they are more accurate then the ones in IdaPro (Not considering the Flirt system used on Ida, of course), but still it have some issues.
Eventually i´ll try to create a set of macros or internal routines to allow compilation of masm syntax style, but..this will still take more time.
Quote from: hutch-- on November 16, 2019, 09:14:42 PM
Just write a UNICODE editor, no conversions.
I already worked on unicode richedit,but as asm programmer I cannot resist downsizing tricks like half the size of text data,upsizing with add the start of that specific language it belongs to and mov to character buffer
Quote from: guga on November 17, 2019, 11:50:24 AM
Manual options could be good to make as in IdaPro, allowing the user to choose either he wants to disassemble C-Style strings, Dos Style, pascal Styles, Delphi strings etc..But, for those specific cases (of strings that are used to certain compilers) the better should be i do it when (or if) succeed to finish the signature technique (i called that DIS - Digital Identification System) many years ago.
The current routine does the job in more then 90% of the time for real apps. The disassembler sets a couple of flags to forbidden areas of the PE to avoid those to be disassembled. For example, sections flagged as import, export, resources etc etc...It basically identifies the good code and data sections and those are the ones that are actually disassembled.
90% great job :thumbsup: some of the fail % it maybe is creator of code tried to make it harder to disassemble?
why dont make it like the smartphone editor: its best guess of string format is showed and a tiny sample is showed to the user and option to choose from a list of different text formats?I had no idea there was loads of different text formats
I also can think of hardcoded mov eax,"abcd" or mov eax,"ab" ;unicode can be very hard to handle in a disassembler
good to know is howto detect text files are saved in utf16 format
Using rich edit, the shift from ANSI to UNICODE is simple enough to do. If you have to work in both, load the file and if it looks like garbage, switch from one to the other. Just means reloading the file again. You can make tangled messes that still can't garrantee being correct or switch between the two, the latter is much easier.
This is another one, this time for an AVX-512 strlen (based on this url (https://github.com/WojciechMula/toys/blob/master/avx512-string/avx512f-strlen.cpp))
strlenZMM:
push esi
mov eax, 01010101h
vpbroadcastd zmm2, eax ; broadcast eax to all elements
xor edx, edx ; len = 0
mov eax, 80808080h
vpbroadcastd zmm3, eax
mov esi, [esp+8]
@@:
vmovdqu32 zmm0, ZMMWORD PTR [esi+edx]
vpsubd zmm1, zmm0, zmm2
vpternlogd zmm1, zmm0, zmm3, 32
vptestmd k1, zmm1, zmm1
kmovw eax, k1
movzx eax, ax
test ax, ax
jnz @F
add edx, 64
jmp short @B
@@:
bsf eax, eax
push 32
pop ecx
cmovne ecx, eax
lea esi, dword ptr [esi+ecx*4]
cmp byte ptr [esi+edx], 0
lea eax, dword ptr [edx+ecx*4]
je short @exit
cmp byte ptr [esi+edx+1], 0
jne short @F
inc eax
jmp short @exit
@@:
cmp byte ptr [esi+edx+2], 0
jne short @F
add eax, 2
jmp short @exit
@@:
add eax, 3
@exit:
vzeroupper
pop esi
ret
end
I added a 4th test for strings between 40000 and 40900 to see it the AVX-512 decouples. Well, not really, SSE Intel Silvermont and SSE Intel Atom are there as well. :hmmm:
total [0 .. 40], 8++
290780 cycles 7.asm: sse2
355355 cycles 5.asm: PCMPISTRI
412251 cycles 3.asm: SSE Intel Silvermont
469664 cycles 8.asm: Agner Fog
502841 cycles 1.asm: SSE 16
524321 cycles 2.asm: SSE 32
597335 cycles 9.asm: ZMM AVX512
865552 cycles 4.asm: SSE Intel Atom
908227 cycles 6.asm: scasb
913651 cycles 0.asm: msvcrt.strlen()
total [41 .. 80], 7++
270380 cycles 3.asm: SSE Intel Silvermont
299431 cycles 5.asm: PCMPISTRI
306940 cycles 7.asm: sse2
314735 cycles 1.asm: SSE 16
364536 cycles 9.asm: ZMM AVX512
380247 cycles 8.asm: Agner Fog
405156 cycles 2.asm: SSE 32
639091 cycles 4.asm: SSE Intel Atom
758265 cycles 6.asm: scasb
982403 cycles 0.asm: msvcrt.strlen()
total [600 .. 1000], 100++
202227 cycles 9.asm: ZMM AVX512
237534 cycles 3.asm: SSE Intel Silvermont
292854 cycles 4.asm: SSE Intel Atom
334146 cycles 2.asm: SSE 32
338568 cycles 1.asm: SSE 16
356720 cycles 7.asm: sse2
436840 cycles 8.asm: Agner Fog
650222 cycles 5.asm: PCMPISTRI
1438033 cycles 6.asm: scasb
1830544 cycles 0.asm: msvcrt.strlen()
total [40000 .. 40900], 100++
2161645 cycles 3.asm: SSE Intel Silvermont
2224521 cycles 4.asm: SSE Intel Atom
2342704 cycles 9.asm: ZMM AVX512
3137064 cycles 1.asm: SSE 16
3465817 cycles 7.asm: sse2
3514206 cycles 2.asm: SSE 32
4113016 cycles 8.asm: Agner Fog
6173622 cycles 5.asm: PCMPISTRI
13022424 cycles 6.asm: scasb
16670776 cycles 0.asm: msvcrt.strlen()
deleted
@nidud,
It has a bug, I did not figure out yet the logic but it does not pass through this.
00007ff65266174e test ecx, ecx
00007ff652661750 jz 0x7ff652661734
You can debug with VS 2012 or 2015 and the Intel SDE Debugger. I don't know if it works with the Express or Community, I have the Pro for those years.
Later: I think it does because I read this:
http://masm32.com/board/index.php?topic=6473.msg69456#msg69456
I understood the logic and corrected it in the following way and it does not need to be aligned.
.code
avx512aligned proc
sub rsp, 8
xor rax,rax
vpbroadcastq zmm0,rax
mov r8,rcx
mov rax,rcx
and rax,-64
and rcx,64-1
xor rdx,rdx
dec rdx
shl rdx,cl
vpcmpgtb k1,zmm0,[rax]
kmovq rcx,k1
and rcx,rdx
jnz L2
L1:
add rax,64
vpcmpgtb k1,zmm0,[rax]
kmovq rcx,k1
test rcx,rcx
jz L1
L2:
bsf rcx,rcx
lea rax,[rax+rcx]
sub rax,r8
dec rax
add rsp, 8
ret
avx512aligned endp
end
What a mess was there with the k2! Simply remove it.
Hi, DayDreamer
"90% great job :thumbsup: some of the fail % it maybe is creator of code tried to make it harder to disassemble?"
Thanks :thumbsup: :thumbsup: :) About the code being harder due to creator choice, well, not necessarily. The vast majority of time disassembler fails to identify is due to characteristics of each compiler or it´s libraries (When the file was not packed, of course).For example, VisualBasic6 code contains a lot of embedded data inside the code section. Some delphi or borland files have that too. Plain C using Api´s by the other-hand are somewhat easier to disassemble because the code and data are more distinguishable from each other. Also some C++ files too are somewhat easier. What makes disablement process a bit hard are heavily bad encoded libraries, specially when they are made with C++ for example or when there are trash code inside, i mean, functions that are never used. On those situations (functions that don´t have any reference), following the data chain is tricky, because some of that chunks can be either code or data.
Sure that there is not any disassembler will be 100% accurate, but if we can get a higher rate of accuracy, the better. I`m pretty sure that if i finish the DIS System (Similar to Flirt on Ida), i can reach something around 98%of accuracy, but..no plans/time to do that right now. I still have to fix lot of problems inside RosAsm yet, and try to make their inner functions be less attached to the interface (and, rebuild it completely eventually). I´ll have to isolate the encoder, the disassembler, the debugger, and create a new resource editor to only then i can try rewrite the interface or implement a new set of macros (or internal routines) to force it to work with something closer to a masm syntax too.
Hi Steve :)
"Using rich edit, the shift from ANSI to UNICODE is simple enough to do. If you have to work in both, load the file and if it looks like garbage, switch from one to the other. Just means reloading the file again. You can make tangled messes that still can't garrantee being correct or switch between the two, the latter is much easier."
Interesting idea. :) Easier indeed.
Quote from: guga on November 18, 2019, 11:45:21 AM"Using rich edit, the shift from ANSI to UNICODE is simple enough..."
The RichEdit control can use rtf under the hood, and thus has no problems using Unicode. In RichMasm (http://masm32.com/board/index.php?topic=5314.0), for example, your source can be Ansi or Unicode or a mix of both, no problem. The more interesting part is what you pass from editor to assembler - and that must be Utf-8, if you want to display non-English text in your executable.
Hi Jj. Tks.
I tested the new updated and it is working faster :)
About the UTF8, so i always need to convert it to UTF8 before displaying whenever i load a file or immediately before i show it on screen, and don´t need to make a user choice ? I´m not sure, if i understood what you meant with passing from editor to assembly.
You open a file (unicode or ansi) in RichMAsm, and export it as UTF 8 or the UTF8 conversion is done internally only to display the asm text on the screen ?
This is OT but I'll keep it short:
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
Init
uMsgBox 0, "Добро пожаловать", "歡迎", MB_OK
EndOfCode
- RichMasm exports this source to whatever.asm as Utf-8
- the assembler (ML, UAsm, AsmC) sees an 8-bit text but doesn't care whether that's codepage 1252 or 65001 or whatever
- the executable sees an 8-bit text, and gets told via the u in uMsgBox that a MessageBoxW is required, and that it should kindly translate the Utf-8 to Utf-16 before passing it on to MessageBoxW
That's the whole trick. For the coder, it's convenient because he can write everything directly into your source. And of course, comments can be in any encoding, the RichEdit control uses RTF to store it, and the assemblers don't care.
deleted
It works very well. :thumbsup:
total [0 .. 40], 8++
307793 cycles 1.asm: PCMPISTRI
443231 cycles 0.asm: AVX-512 aligned
521571 cycles 2.asm: ZMM AVX512 (older)
total [41 .. 80], 7++
257807 cycles 0.asm: AVX-512 aligned
356038 cycles 2.asm: ZMM AVX512 (older)
370879 cycles 1.asm: PCMPISTRI
total [600 .. 1000], 100++
113553 cycles 0.asm: AVX-512 aligned
204811 cycles 2.asm: ZMM AVX512 (older)
859649 cycles 1.asm: PCMPISTRI
total [40000 .. 40800], 100++
897536 cycles 0.asm: AVX-512 aligned
2127546 cycles 2.asm: ZMM AVX512 (older)
5980835 cycles 1.asm: PCMPISTRI
If you convert the remaining to 64-bit I will test them all.
deleted
Quote from: nidud on November 19, 2019, 07:12:37 AM
Added the Intel versions as well. They where originally 32-bit so the 64-bit version would have been coded differently from the direct translation as done here.
Lot of space there now: 64 * 32 regs = 2048 byte.
I think there is a bug in the ANSI version here:
args_x macro
lea rcx,str_1[size_s]
mov eax,step_x
add eax,eax <---------- HERE
sub rcx,rax
exitm<>
endm
Although it is not causing a buffer overflow, it changes clearly the results.
These are the results after removing the "add eax, eax"
total [0 .. 40], 16++
133475 cycles 5.asm: AVX 32
152210 cycles 3.asm: SSE Intel Silvermont
157375 cycles 1.asm: SSE 16
172910 cycles 6.asm: AVX512 64
178312 cycles 2.asm: SSE 32
228359 cycles 0.asm: msvcrt.strlen()
326672 cycles 4.asm: SSE Intel Atom
total [41 .. 80], 16++
117358 cycles 5.asm: AVX 32
123539 cycles 6.asm: AVX512 64
165831 cycles 1.asm: SSE 16
169369 cycles 3.asm: SSE Intel Silvermont
210518 cycles 2.asm: SSE 32
270514 cycles 0.asm: msvcrt.strlen()
281378 cycles 4.asm: SSE Intel Atom
total [600 .. 1000], 200++
67189 cycles 6.asm: AVX512 64
110356 cycles 5.asm: AVX 32
218898 cycles 3.asm: SSE Intel Silvermont
235207 cycles 4.asm: SSE Intel Atom
272100 cycles 2.asm: SSE 32
296195 cycles 1.asm: SSE 16
643732 cycles 0.asm: msvcrt.strlen()
This is another suite of test results for 64-bit strlen variations, including Agner Fog, PCMPISTRI and Masm32 SDK.
I added a test for extra long strings in the range 40000 to 40800 bytes.
Masm32 SDK strlen is not SIMD assisted (and I believe msvcrt.strlen is not as well) so plays a different tournament. But it could be made that way because all 64-bit machines have support for SSE.
total [0 .. 40], 16++
135718 cycles 5.asm: AVX 32
147583 cycles 8.asm: PCMPISTRI
159476 cycles 1.asm: SSE 16
169063 cycles 3.asm: SSE Intel Silvermont
190066 cycles 2.asm: SSE 32
192212 cycles 7.asm: Agner Fog
210091 cycles 6.asm: AVX512 64
238010 cycles 0.asm: msvcrt.strlen()
280346 cycles 4.asm: SSE Intel Atom
282475 cycles 9.asm: Masm32 SDK
total [41 .. 80], 16++
116625 cycles 5.asm: AVX 32
120875 cycles 6.asm: AVX512 64
136046 cycles 3.asm: SSE Intel Silvermont
169359 cycles 8.asm: PCMPISTRI
179466 cycles 1.asm: SSE 16
198766 cycles 2.asm: SSE 32
205015 cycles 7.asm: Agner Fog
257180 cycles 0.asm: msvcrt.strlen()
278100 cycles 4.asm: SSE Intel Atom
487603 cycles 9.asm: Masm32 SDK
total [600 .. 1000], 200++
83570 cycles 6.asm: AVX512 64
110477 cycles 5.asm: AVX 32
218994 cycles 3.asm: SSE Intel Silvermont
253867 cycles 4.asm: SSE Intel Atom
279579 cycles 2.asm: SSE 32
307387 cycles 1.asm: SSE 16
334595 cycles 7.asm: Agner Fog
488680 cycles 8.asm: PCMPISTRI
621900 cycles 0.asm: msvcrt.strlen()
1066191 cycles 9.asm: Masm32 SDK
total [40000 .. 40800], 200++
505134 cycles 6.asm: AVX512 64
977509 cycles 5.asm: AVX 32
1468195 cycles 3.asm: SSE Intel Silvermont
1684275 cycles 4.asm: SSE Intel Atom
2241774 cycles 2.asm: SSE 32
2250641 cycles 1.asm: SSE 16
2609106 cycles 7.asm: Agner Fog
3257461 cycles 8.asm: PCMPISTRI
4818268 cycles 0.asm: msvcrt.strlen()
8809927 cycles 9.asm: Masm32 SDK
In this test I introduce a new AVX-512 strlen variation called Fast AVX512 64 :biggrin:
.code
xor rax, rax
vxorps zmm1, zmm1, zmm1
L1:
vpcmpeqb k1, zmm1,zmmword ptr [rcx+rax]
kmovq r9,k1
add rax, 64
test r9,r9
jz L1
bsf r9,r9
lea rax,[rax+r9-64]
ret
end
These are the results:
total [0 .. 40], 16++
83041 cycles 9.asm: Fast AVX512 64
118463 cycles 8.asm: PCMPISTRI
136861 cycles 5.asm: AVX 32
145743 cycles 3.asm: SSE Intel Silvermont
163889 cycles 1.asm: SSE 16
178432 cycles 6.asm: AVX512 64
185371 cycles 7.asm: Agner Fog
196856 cycles 2.asm: SSE 32
228516 cycles 0.asm: msvcrt.strlen()
277227 cycles 4.asm: SSE Intel Atom
total [41 .. 80], 16++
61027 cycles 9.asm: Fast AVX512 64
111154 cycles 5.asm: AVX 32
130256 cycles 6.asm: AVX512 64
139440 cycles 3.asm: SSE Intel Silvermont
155091 cycles 8.asm: PCMPISTRI
183854 cycles 1.asm: SSE 16
194775 cycles 7.asm: Agner Fog
212161 cycles 2.asm: SSE 32
285351 cycles 4.asm: SSE Intel Atom
311238 cycles 0.asm: msvcrt.strlen()
total [600 .. 1000], 200++
71159 cycles 9.asm: Fast AVX512 64
81938 cycles 6.asm: AVX512 64
110439 cycles 5.asm: AVX 32
220499 cycles 3.asm: SSE Intel Silvermont
254703 cycles 4.asm: SSE Intel Atom
293130 cycles 2.asm: SSE 32
308233 cycles 1.asm: SSE 16
338944 cycles 7.asm: Agner Fog
516498 cycles 8.asm: PCMPISTRI
648680 cycles 0.asm: msvcrt.strlen()
total [40000 .. 40800], 200++
390634 cycles 6.asm: AVX512 64
414175 cycles 9.asm: Fast AVX512 64
606734 cycles 5.asm: AVX 32
1392867 cycles 3.asm: SSE Intel Silvermont
1417887 cycles 4.asm: SSE Intel Atom
2194951 cycles 1.asm: SSE 16
2200795 cycles 2.asm: SSE 32
2229910 cycles 7.asm: Agner Fog
3295851 cycles 8.asm: PCMPISTRI
4538755 cycles 0.asm: msvcrt.strlen()
For huge strings the other variation catchs up. :badgrin:
May be it can be improved just a little.
Later:
I have not used it in the above test but the following, which is based on the same idea, is also faster than AVX 32 except for huge strings (but the difference is small).
.code
xor rax, rax
vxorps ymm0, ymm0, ymm0
L1:
vpcmpeqb ymm1, ymm0, ymmword ptr [rcx+rax]
vpmovmskb r9d,ymm1
add rax,32
test r9d,r9d
jz L1
bsf r9,r9
lea rax,[rax+r9-32]
ret
end
deleted
Very good Nidud.
Something I dislike is the .cmd extension and the makefile. I dislike as well the "option dllimport:" that comes back from Jwasm and in general everything that tries to establish unnecessary alternatives for doing things that are well done in the traditional way.
Notepad++ users (probably not many here) can build these ASMC test suits for Debugging using this script:
(https://www.dropbox.com/s/cgh80bug1e3y44b/asmcdebug.png?dl=1)
Let's see if it works:
(https://www.dropbox.com/s/g8ndvcukgy5z4hj/asmdb2.png?dl=1)
deleted
I understand, but ASMC libs appear not to be compatible with Microsoft LINK.exe.
We will need to learn how to use linkw.exe, right?
Uasm libs don't force us to use anything else (they have not as well, thanks God).
deleted
Quote from: KeepingRealBusy on March 04, 2015, 10:23:39 AM
rrr
Mis-alignment can be a problem, especially if both the source and the destination buffers have some mis-alignment, especialy if they have different alignments. The usual case is that one or the other of the buffers can be aligned mod 16. The way to handle these is to move a block of 16 bytes from the start of the buffer and a block of 16 bytes at the end of the buffer with un-aligned XMM moves. Then calculate the size left to be moved (data size - 32) and round this up to the next mod 16, then start at the first mod 16 byte boundary of the un-aligned buffer and move 16 bytes at a time (there will be a small overlap at either end).
Dave.
Very few people take their efforts so far as to avoid this issue, but internally any data not aligned on its own size (i.e. a dword not aligned on a 4 byte boundary) is going to require 2 reads to retrieve the data. That's a big slowdown. The effort required to properly align data across the board is well worth it.
AW,
Clock this one in your list. Memory must be aligned by the data size.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
NOSTACKFRAME
ymmcopya proc
; rcx = source address
; rdx = destination address
; r8 = byte count
mov r11, r8
shr r11, 5 ; div by 32 for loop count
xor r10, r10 ; zero r10 to use as index
lpst:
vmovntdqa ymm0, YMMWORD PTR [rcx+r10]
vmovntdq YMMWORD PTR [rdx+r10], ymm0
add r10, 32
sub r11, 1
jnz lpst
mov rax, r8 ; calculate remainder if any
and rax, 31
test rax, rax
jnz @F
ret
@@:
mov r9b, [rcx+r10] ; copy any remainder
mov [rdx+r10], r9b
add r10, 1
sub rax, 1
jnz @B
ret
ymmcopya endp
STACKFRAME
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
I can't test now but non-temporal instructions tend to perform poorly in modern processors. That's the main reason Agner Fog's tests did not shine.
I will test when possible or you can do it directly from nidud's example.
Old thread, but can someone test this update of StrLen ? I tried to port it to masm to be used in masmbasic, but coulnd´t do it
This is an old routine, JJ and i did a long time ago, but i updated it, since it has some small flaws when retrieving the lenght of unaligned strings and also strings that may contains extra bytes at the end (immediatelly after the null terminated byte)
Now, it seems to handle all situations, but i´m not sure about the speed
RosAsm syntax:
Proc StrLen3:
Arguments @Input
Uses ecx
xorps xmm0 xmm0
mov ecx D@Input
L0:
movups xmm1 X$ecx | pcmpeqb xmm0 xmm1 | pmovmskb eax xmm0
add ecx 16 ; point to the next 16 bytes of our string
test ax ax | jz L0< ; Did we found any zeroes in eax ? No, jmp back and do it again. Otherwise, go to the next line
sub ecx D@Input
add ecx 0-16
bsf ax ax
add eax ecx
EndP
Masm syntax
StrLen3 proc near ; CODE XREF: start+13↑p
Input = dword ptr 8
push ebp
mov ebp, esp
push ecx
xorps xmm0, xmm0
mov ecx, [ebp+Input]
ZeroNotFound:
movups xmm1, xmmword ptr [ecx] ; ? Is this syntax correct ? or is only movups xmm1, [ecx] ?
pcmpeqb xmm0, xmm1
pmovmskb eax, xmm0
add ecx, 10h
test ax, ax
jz short ZeroNotFound
sub ecx, [ebp+Input]
add ecx, -16
bsf ax, ax
add eax, ecx
pop ecx
mov esp, ebp
pop ebp
retn 4
StrLen3 endp
Quote from: guga on August 17, 2023, 04:21:21 AM[...] and also strings that may contains extra bytes at the end (immediatelly after the null terminated byte)
Since when should this be a concern? Anything after the terminating NULL should just be ignored, right?
Not really. I was giving a new test and found that the older routine has some flaws sometimes when the string is between other chain of bytes or is unaligned. This was happenning for the strlen and strlenw versions.
The problem was when the end of the string (after the null terminated byte) contains any extra data (also if the start of the string was unaglined), it was bypassing the null terminated and thus giving a wrong result.
This issues are fixed on the newer version (this one i posted and the other i´ll post latter about strlenw), but i wanna make sure if the speed was not affected.
The former error happened more offently when we were dealing with non-unicode strings (strlenW version). I mean, strings in chinese, russian etc etc.
Some examples was things like this in the unicode version:
[UnicodeString5: B$ 1, 7, 13, 58, 12, 36, 77, 158, 18,
;U$ "Hello", 0
StringStart3: W$ 04f60, 0597d, 0e, 012, 0ff00, 04f60, 0597d, 0e, 012, 0ff00, 04f60, 0597d, 0e, 012, 0ff00, 04f60, 0597d, 0e, 012, 0ff00, 0
NextData3: W$ 7, 1, 1, 1, 1, 1, 1, 1]
[UnicodeString: U$ "Hello", 0]
[DummyBytes2: B$ 35, 47, 5, 5, 14, 5]
[< UnicodeStringUnaligned: B$ 1,
W$ 04f60, 0597d, 056fe, 06211, 04e00, 0
B$ 37]
And the former versions (That produced incorrect results in some circuinstances were like this:
unicode version
Proc StrLenW::
Arguments @Input
Uses ecx, edx
mov eax D@Input
mov ecx eax
and eax 0-16
and ecx 16-1
or edx 0-1
shl edx cl
xorps xmm0 xmm0
pcmpeqw xmm0 X$eax
add eax 16
pmovmskb ecx xmm0
xorps xmm0 xmm0
and ecx edx
jnz L2>
L1:
movups xmm1 X$eax
pcmpeqw xmm1 xmm0
pmovmskb ecx xmm1
add eax 16
test ecx ecx
jz L1<
L2:
bsf ecx ecx
lea eax D$eax+ecx-16
sub eax D@Input
shr eax 1
EndP
Ansi version
Proc StrLen::
Arguments @Input
Uses ecx, edx
mov eax D@Input
mov ecx eax
and eax 0-16
and ecx 16-1
or edx 0-1
shl edx cl
xorps xmm0 xmm0
pcmpeqb xmm0 X$eax
add eax 16
pmovmskb ecx xmm0
xorps xmm0 xmm0
and ecx edx
jnz L2>
L1:
movups xmm1 X$eax
pcmpeqb xmm1 xmm0
pmovmskb ecx xmm1
add eax 16
test ecx ecx
jz L1<
L2:
bsf ecx ecx
lea eax D$eax+ecx-16
sub eax D@Input
EndP
With the new fixes, there should be no more any uknown error that results in a wrong value. Therefore, it works now for unaligned strings and also strings that are inside chains of data, and also in whatever CodePage (in case we are using the strlenW version)
Quote from: guga on August 17, 2023, 04:21:21 AMit has some small flaws when retrieving the lenght of unaligned strings and also strings that may contains extra bytes at the end (immediatelly after the null terminated byte)
Hi Guga,
I'm not aware of such problems, can you post an example of a string that misbehaves, in Masm syntax?
There is a known issue with VirtualAlloc'ed strings (https://www.masmforum.com/board/index.php?topic=18728.msg159628;sslRedirect#msg159628).
Talking about missing people, rrr314159 don't post anything in Quora from March 4. Not good. :eusa_boohoo:
Quote from: jj2007 on August 17, 2023, 06:55:55 AMQuote from: guga on August 17, 2023, 04:21:21 AMit has some small flaws when retrieving the lenght of unaligned strings and also strings that may contains extra bytes at the end (immediatelly after the null terminated byte)
Hi Guga,
I'm not aware of such problems, can you post an example of a string that misbehaves, in Masm syntax?
There is a known issue with VirtualAlloc'ed strings (https://www.masmforum.com/board/index.php?topic=18728.msg159628;sslRedirect#msg159628).
Hi JJ, ok....I´ll post 2 examples. One now to you see what´s wrong (and my fix). And then, i´ll try to recreate the error on the Ansi version of the string len algo.
StrlenW - Unicode version_________________________________
.data
UnicodeString2 db 1 ; Yes...but as a byte to force it to be unaligned
text "UTF-16LE", 'Hello',0
NextData dw 9
dw 1
dw 1
dw 1
dw 1
dw 1
dw 1
dw 1
.code
(.................)
mov edi, offset UnicodeString2
inc edi ; Make the string starts exactly at "Hello", so inside our data chain, which is unaligned
push edi
call StrLenW
add eax 2
add edi eax ; edi should point now after the null terminated word, So, it should point to "NextData", but it points to the 2nd "l" in Hello
The function was (the one that returns incorrect values in some circumstances):
StrLenW proc near
Input = dword ptr 8
push ebp
mov ebp, esp
push ecx
push edx
mov eax, [ebp+Input]
mov ecx, eax
and eax, 0FFFFFFF0h
and ecx, 0Fh
or edx, 0FFFFFFFFh
shl edx, cl
movdqu xmm1, xmmword ptr [eax] ; movdqu xmm1, [eax] ?
xorps xmm0, xmm0
pcmpeqw xmm0, xmmword ptr [eax] ; pcmpeqw xmm0, [eax] ?
add eax, 10h
pmovmskb ecx, xmm0
xorps xmm0, xmm0
and ecx, edx
jnz short loc_4040B0
loc_40409E:
movups xmm1, xmmword ptr [eax]
pcmpeqw xmm1, xmm0
pmovmskb ecx, xmm1
add eax, 10h
test ecx, ecx
jz short loc_40409E
loc_4040B0:
bsf ecx, ecx
lea eax, [ecx+eax-10h]
sub eax, [ebp+Input]
shr eax, 1
pop edx
pop ecx
mov esp, ebp
pop ebp
retn 4
StrLenW endp
On the above example i used Unicode Text encapsulated between 2 datas (1 byte before the string starts) and a sequence of Words after it. This was to force check for alignment problems.
The same problem happens if you replace "Hello" with other Copdepage, such as chinese etc...Try replacee Hello with dw 04f60, 0597d, 056fe, 06211, 04e00, 0 And this kind error will happens more often.
To fix that i changed the code to only this (masm syntax):
My fix
StrLenW proc near
AddBytes = dword ptr -4
Input = dword ptr 8
push ebp
mov ebp, esp
sub esp, 4
push ecx
mov [ebp+AddBytes], 0
xorps xmm0, xmm0
mov ecx, [ebp+Input]
movups xmm1, xmmword ptr [ecx]
pcmpeqw xmm0, xmm1
pmovmskb eax, xmm0
test ax, ax
jnz short loc_4040BA
loc_4040A1:
add ecx, 10h
movups xmm1, xmmword ptr [ecx]
pcmpeqw xmm0, xmm1
pmovmskb eax, xmm0
test ax, ax
jz short loc_4040A1
sub ecx, [ebp+Input]
mov [ebp+AddBytes], ecx
loc_4040BA:
and ax, 101010101010101b
bsf ax, ax
add eax, [ebp+AddBytes]
pop ecx
mov esp, ebp
pop ebp
retn 4
StrLenW2 endp
Of course, this can also be optimized (as i did for strlen, but i´ll do it later after i post the Ansi version for you and we check the speed. On this fix for the Unicode version (strlenW) it works correctly, no matter if the string is encapsulated, no matter if it has extra bytes after the last null terminated word, no matter if it is unaligned and specially, no matter what codepage you use, and also will work even if the string is zero. So it will work for regular Unicode "Ascii+Zero" and for others Codepages (2 bytes as in chinese, russian etc) or even weird CodePages that uses 3 or even 4 bytes to represent a char.
The resultant value in eax is the total amount of bytes used on the Unicode String (and not the amount of chars). I made it return the amount of bytes insetad chars, to we can also use it in strings with other Codepages without having to be forced to convert with M$ Apis etc.
I´ll try to reproduce the error now in the strlen version (ansi), but the error happens on the same principle, altough it is a bit hard to find
Note:
Below is the RosAsm version of StrLenW (unoptimized yet) but with all comments i did while i was trying to fix it.
Proc StrLenW2::
Arguments @Input
Local @AddBytes
Uses ecx
mov D@AddBytes 0 ; Let´s create a variable to we store potential extrabytes later to be added to the string len
xorps xmm0 xmm0 ; zero register xmm0. Why zero ? Because we are focusing in words untill we find the null terminated word (So 2 zeroes).
; Since each register in xmm0 contains 8 words (16 bytes). We need to scan later those 8 words to we find the double 0
mov ecx D@Input
movups xmm1 X$ecx ; Load the string onto xmm1 register
pcmpeqw xmm0 xmm1 ; Compare the strings (as words) in xmm1 to the value we copied onto xmm0 (so, compare each word to 0 = our double zero to be found)
; pcmpeqw threat the words as vectors and since we are comparing if the words in xmm1 are equal to 0whenever we find a 0, whenever we find a double zero,
; the corresponding word is set all the bits of that word to 1 . So it will result in 0FFFF (per word) whenever a zero is found or 0 if not found.
; At this point if the string is less then or equal to 8 bytes long (not counting the null terminated word), no matter where the null terminated word is located
; it always will show up in one of he 8 words. Like this: Say we have the unicode string "Hello", 0
; On xmm1 it will be displayed as:
; Word7 Word6 Word5 Word4 Word3 Word2 Word1 Word0 ==> (bits 128 - 0)
; 0001 0007 0000 006F 006C 006C 0065 0048 ==> 0001 and 0007 are extra data after the end of the string (that may exists or not, so it´s useless anyway) follow by 'olleH'
; ? ? Ok o l l e H
; (zero)
; After pcmpeqw xmm0 will become:
; Word7 Word6 Word5 Word4 Word3 Word2 Word1 Word0 ==> (bits 128 - 0)
; 0000 0000 FFFF 0000 0000 0000 0000 0000 ==> So, our null terminated word is located at the 6th word = "Word5", and all bits on that word were settled 0FFFF = 00__1111_1111__1111_1111
; which means that ouur null terminated word is located at the 6th position (in words). So, in theory, we found ouur len (in bytes) which is 6-1 = 5. Ok ?
; Well we now know the total size of our string (in bytes), but it is expressed in words positions, and we need to calculate the amount of bytes of it. So, let´s continue.
pmovmskb eax xmm0 ; PMOVMSKB creates a mask made up of the most significant bit of each byte of xmm0 (the src) and stores the result in eax. But hold on...what does it means ?
; It means that we are simply transposing the position of our nll terminated word from xmm0 to eax. But, wait....xmm registers are 64 bits, and eax is only 32.
; So what's going on here ? Well... pmovmskb will transpose the position of 0FFFF (6th word containing 16 bits settled to 1) to the same position, but in bytes (since eax = 32 bits only)
; and also, insetad masking it as a 0FFFF (word) it will flag it as a half of a Byte (11 in binary = 3 in decimal)
; therefore, eax will become: 0C00 = 00__0000_1100__0000_0000 Why this? because each 16 bits from xmm0 will correspond to only 16 bits (8 positions) in eax on the same position but in 32 bits.Therefore:
; Half Half Half Half Half Half Half Half
; Byte7 Byte6 Byte5 Byte4 Byte3 Byte2 Byte1 Byte0 ==> (bits 16 - 0)
; 00 00 11 00 00 00 00 00 <==== bits = 00__0000_1100__0000_0000 (binary) = 0C00 (in hexa)
; So, in other words, all that we will need is only the value in ax (low part of eax), whereas the high part will always be 0 because we previously used pcmpeqw comparing only 8 words
; So, whatever the value here is, the maximum value for eax = 0FFFF
test ax ax | jnz L1> ; We need to test eax with 0, because eax will contains information about the lenght of the string (whatever the len is). So, the string sequence stored in xmm0 will contains only 8 words.
; In fact, ax is all we need since pmovmskb stores the result in the low word of eax and clears the hi word
; If any of those words contains a null terminated word (double zero), one of them will contains a mask of 0FFFF, which later in pmovmskb will flag the half of he byte where the zero was found with 11 (in binary)
; But if the sequence of 8 words does not contrains any zero, then xmm0 will have only 0 and, therefore, eax will also be only 0 (because no pair of bits will be settled).
; What means to say that whenever our string has less then 7 words/chars (14 Bytes) the routine will jump over to to "L1" and perform the actual computation of the bytes that the string has.
; Otherwise we go to the next line, add 16 bytes to the string (to go to the next 16 bytes to look for the double 0) and continue a loop untill we found it.
L0:
add ecx 16 ; point to the next 16 bytes of our string
movups xmm1 X$ecx | pcmpeqw xmm0 xmm1 | pmovmskb eax xmm0
test ax ax | jz L0< ; Did we found any double 0 in eax ? No, jmp back and do it again. Otherwise, go to the next line
sub ecx D@Input ; Now comes something easier. Since we have some value in eax that will represent the size of the string (or at leats, this last chain of 16 bytes), all we need to do is add those bytes to the amount of
; bytes we already calculated. To do that, we simply subtract the current location in ecx with the start of our string at Input. This will give us the total amount of bytes we have so far and we store it in AddBytes
mov D@AddBytes ecx
L1:
;movzx eax ax ; On eax = 0 means we have reached the limit of 7 bytes
; Now comes the trick. At this point eax (or better speaking, ax) contains the position (in bytes) of the double zero (found at the 6th position in Half Byte 5) starting from half Byte0 to Half Byte 5.
; Since we are dealing with a word, we ended with 2 bits set. Out 1st zero starts at the 1st one, so we can safely discard the second bit (from right to left). Doing this we will be able to identify
; the least significant that corresponds to the index (len) of our string
; So, for each pair of bits we simply and with 01 (in binary) because we will then discard the 2nd bit (most significand one) on each pair. Since we have 8 "half bytes" we then need to and also 8 half bytes with:
; 00_01_01_01_01__01_01_01_01 (in binary) = 05555 (in hexa)
and ax 00_01_01_01_01__01_01_01_01
; Now that we get rid of the 1st bit on each pair, we have only 8 pairs of bits of least significant bits (the 1st one to be checked). So, after the "and" eax will result in 0400 (hexa) = 00__0000_0100__0000_0000 (in binary)
; which correspond exactly to:
; Half Half Half Half Half Half Half Half
; Byte7 Byte6 Byte5 Byte4 Byte3 Byte2 Byte1 Byte0 ==> (bits 16 - 0)
; 00 00 01 00 00 00 00 00 <==== bits = 00__0000_0100__0000_0000 (binary) = 0400 (in hexa)
; Ok, now we have the most significand bit found at the 6th position (Half Byte5) which is preciselly, lenght 6 (in words). Since we removed the most significand bit all we need to do is scan (search) the position of that bit with bsf
; since whateever position the bit was set, it always be the 1st (of the pair), therefore, it will never result in odd positions, so we don´ need to align or do a shr eax 1 | shl eax 1 pair or and eax 0FFFF_FFFE to remove the 1st bit (bit0)
; that´ because we are, at the end, counting positions. Could we do with and ax 00_10_10_10_10__10_10_10_10 ? Yes, but this would result in odd values and then we would need to fix with eax 0FFFF_FFFE or with shr eax 1 | shl eax 1
bsf ax ax ; got the total amount of bytes of this chunck of words from xmm1 (earlier stored there, remember ?). We don´ need to bsf eax becauuse all we have is values in ax. So, perhaps, bsf ax ax is faster
; Consider later do it with lzcnt or popcnt
add eax D@AddBytes ; and now we can simply add whatever bytes we found before on the loop (if we entered onto a loop earlier) to this resultant value in eax, which will be the correct lenght of the string.
EndP
Hi Guga,
Your string behaves just fine with wLen() resp. MbStrLenW - see attachment. Are you using an older algo?
Quote from: jj2007 on August 17, 2023, 08:39:56 AMHi Guga,
Your string behaves just fine with wLen() resp. MbStrLenW - see attachment. Are you using an older algo?
I don´t know, perhaps. I didn´t touched the strlen and strlenw algo in a long time. But what are the correct/updated algos ? I can´t find it in masm basic. I saw the souuce you uploaded but the algos arent´t there (in masm syntax)
Quote from: guga on August 17, 2023, 09:20:47 AMBut what are the correct/updated algos?
See your PM :azn:
Btw it seems we never timed the Unicode version of StrLen, here it is:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
23327 cycles for 100 * CRT wcslen
20836 cycles for 100 * Masm32 ucLen
9566 cycles for 100 * MasmBasic wLen
8981 cycles for 100 * _MbStrLenW
23340 cycles for 100 * CRT wcslen
20809 cycles for 100 * Masm32 ucLen
9566 cycles for 100 * MasmBasic wLen
8988 cycles for 100 * _MbStrLenW
23284 cycles for 100 * CRT wcslen
20852 cycles for 100 * Masm32 ucLen
9157 cycles for 100 * MasmBasic wLen
9003 cycles for 100 * _MbStrLenW
23262 cycles for 100 * CRT wcslen
20883 cycles for 100 * Masm32 ucLen
9586 cycles for 100 * MasmBasic wLen
8992 cycles for 100 * _MbStrLenW
14 bytes for CRT wcslen
10 bytes for Masm32 ucLen
10 bytes for MasmBasic wLen
66 bytes for _MbStrLenW
100 = eax CRT wcslen
100 = eax Masm32 ucLen
100 = eax MasmBasic wLen
100 = eax _MbStrLenW
Hi JJ. Tks, indeed i was using an older version. But can u test it to we compare the speed ?
Quote from: guga on August 17, 2023, 04:12:03 PMHi JJ. Tks, indeed i was using an older version. But can u test it to we compare the speed ?
Hi Guga,
IMHO your algo is fast enough :biggrin:
You beat poor CRT by a factor 9, but ok, as Assembly programmers we are used to that, aren't we?
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
23713 cycles for 100 * CRT wcslen
20882 cycles for 100 * Masm32 ucLen
9471 cycles for 100 * MasmBasic wLen
9030 cycles for 100 * _MbStrLenW
9099 cycles for 100 * _MbStrLenW2
2781 cycles for 100 * StrLenW Guga
23718 cycles for 100 * CRT wcslen
20855 cycles for 100 * Masm32 ucLen
9555 cycles for 100 * MasmBasic wLen
9012 cycles for 100 * _MbStrLenW
9014 cycles for 100 * _MbStrLenW2
2696 cycles for 100 * StrLenW Guga
23641 cycles for 100 * CRT wcslen
20836 cycles for 100 * Masm32 ucLen
9541 cycles for 100 * MasmBasic wLen
9023 cycles for 100 * _MbStrLenW
9010 cycles for 100 * _MbStrLenW2
2699 cycles for 100 * StrLenW Guga
23658 cycles for 100 * CRT wcslen
20844 cycles for 100 * Masm32 ucLen
9435 cycles for 100 * MasmBasic wLen
9008 cycles for 100 * _MbStrLenW
9006 cycles for 100 * _MbStrLenW2
2701 cycles for 100 * StrLenW Guga
14 bytes for CRT wcslen
10 bytes for Masm32 ucLen
10 bytes for MasmBasic wLen
66 bytes for _MbStrLenW
66 bytes for _MbStrLenW2
58 bytes for StrLenW Guga
100 = eax CRT wcslen
100 = eax Masm32 ucLen
100 = eax MasmBasic wLen
100 = eax _MbStrLenW
100 = eax _MbStrLenW2
100 = eax StrLenW Guga
Tks a lot, JJ
It seems fast, indeed :greenclp: :greenclp: :greenclp:
I´ll do now a few more tests on the strlen version (The ansi) and finish some comments on the code, so i wont forget what i´ve done :biggrin: :biggrin: :biggrin:
AMD Ryzen 5 2400G with Radeon Vega Graphics (SSE4)
16116 cycles for 100 * CRT wcslen
25196 cycles for 100 * Masm32 ucLen
7524 cycles for 100 * MasmBasic wLen
7574 cycles for 100 * _MbStrLenW
7405 cycles for 100 * _MbStrLenW2
2568 cycles for 100 * StrLenW Guga
16085 cycles for 100 * CRT wcslen
25030 cycles for 100 * Masm32 ucLen
7537 cycles for 100 * MasmBasic wLen
7419 cycles for 100 * _MbStrLenW
7424 cycles for 100 * _MbStrLenW2
2545 cycles for 100 * StrLenW Guga
16308 cycles for 100 * CRT wcslen
25578 cycles for 100 * Masm32 ucLen
7614 cycles for 100 * MasmBasic wLen
7419 cycles for 100 * _MbStrLenW
7456 cycles for 100 * _MbStrLenW2
2573 cycles for 100 * StrLenW Guga
16086 cycles for 100 * CRT wcslen
25075 cycles for 100 * Masm32 ucLen
7600 cycles for 100 * MasmBasic wLen
7447 cycles for 100 * _MbStrLenW
7396 cycles for 100 * _MbStrLenW2
2547 cycles for 100 * StrLenW Guga
14 bytes for CRT wcslen
10 bytes for Masm32 ucLen
10 bytes for MasmBasic wLen
66 bytes for _MbStrLenW
66 bytes for _MbStrLenW2
58 bytes for StrLenW Guga
100 = eax CRT wcslen
100 = eax Masm32 ucLen
100 = eax MasmBasic wLen
100 = eax _MbStrLenW
100 = eax _MbStrLenW2
100 = eax StrLenW Guga
--- ok ---
Btw..you may use it in masmbasic if you wish and others can use it too, of course:)
I needed to finish those routines to continue updating them in RosAsm. During the updates i found those minor errors on both functions that required fixes. I hope i can finish the updates (in rosasm) soon to release the next version, but it´s a bit hard because i made several changes in RosAsm code to try making it faster and more independent of the interface and accidentally broke one thing or two, specially in the resources editor (i´m trying to fix it right now). A hell, a true hell to do. :bgrin: :bgrin: :bgrin:
One question, on your optimization, you returned eax as chars instead of bytes. Considering that the new strlenW function can be used in other CodePages (chinese, russia, greek strings etc) and also others that may uses 3 or 4 bytes to represent a character, returning chars in eax is the better instead of bytes ? What if the user codepage contains UTF8 (or others that uses more then 2 bytes for a char ) ?
https://www.ibm.com/docs/en/db2-for-zos/11?topic=unicode-utfs
Quote from: guga on August 17, 2023, 11:45:44 PMOne question, on your optimization, you returned eax as chars instead of bytes.
Practically all Windows functions require chars, not bytes. For Utf8, you can get chars with uLen() (https://www.jj2007.eu/MasmBasicQuickReference.htm).
Quote from: jj2007 on August 18, 2023, 12:14:46 AMPractically all Windows functions require chars, not bytes. For Utf8, you can get chars with uLen() (https://www.jj2007.eu/MasmBasicQuickReference.htm).
Hi JJ. Ok, i´ll change to chars, then. One question....on your version did you needed to detect the CodePage ?
Len, wLen, uLen get string length
mov ebx, Len(My$)
mov eax, Len("123") ; Len returns 3 in eax, so Olly will show mov eax, eax
void Len("123") ; put the result into eax (void avoids the mov eax, eax)
mov eax, wLen(MyUnicode$) ; wideLen for Unicode strings
void wLen(wChr$("123")) ; returns 3 chars in eax
Let esi="Добро пожаловатьäöü" ; assign a UTF-8 string
uPrint "[Добро пожаловатьäöü]" ; or print it directly
Print Str$(" is %i characters long (expected: 19)\n", uLen(esi))
Rem returns length in eax; bytes for ANSI, chars for Unicode; for UTF-8 encoded strings, Len returns bytes, uLen returns chars
I mean detecting if the sequence of bytes before the null terminated word belongs to a UTF8, UTF7, UTF32 etc ? maybe we can make a small update at the end of strlenW to calculate if the string is UTF8 in 2, 3 or 4 bytes long representing a character.
A single line after the last code at "add eax, ecx" maybe enough, (with a flag ) perhaps.
Something like. If the result must be from a UTF7 (3 bytes per char), we divide the result by 3, If UTF32, byt 4 and so on. So, add also aother parameter of the function used as a flag perhaps.
add eax ecx
test Flag UTF7 | je L1> ; is the string in UTF7 ? divide the result by 3
mov edx 0AAAAAAAB
mul edx
shr edx 1
mov eax edx
jmp L2>
test Flag UTF32 | je L1> ; Is UTF32 ? divide by 4
shr eax 2
jmp L2>
L1: ; else keep as Unicode (2 bytes per char)
shr eax 1
L2:
Is it worthful extending the algo to handle UTF7 or even UTF32 or it is unnecessary for general usage ? I´m asking because i know that UTF7 and UTF32 do exists, but i don´t know if they are commonly used in windows (or other systems).
If it won´t be useful, i´ll stay as you said and use only the
shr eax 1 at the end to return chars instead bytes.
Quote from: guga on August 18, 2023, 12:54:26 AMon your version did you needed to detect the CodePage ?
Utf8 is basically Ansi with some tricks. For most uses of Instr() or Len(), like parsing etc, the plain Ansi functions are appropriate. Only in rare cases you need to know the
chars - that's when uLen jumps in. There is no codepage test in this sense, but you tell it explicitly "treat this string as Utf8".
QuoteIs it worthful extending the algo to handle UTF7 or even UTF32 or it is unnecessary for general usage?
I wouldn't invest any efforts. I've never seen these formats in the wild.
Ok, i guess i understood now. I´ll do as you said then.
I´ll keep strlenW to return in bytes as it was already (since it will be more usefull for parsing etc) and create a variation to return in chars (with shr eax 1 after the add eax ecx ) named as UstrLen, UCharLen or something like that.
Quote from: guga on August 18, 2023, 01:34:53 AMOk, i guess i understood now. I´ll do as you said then.
I´ll keep strlenW to return in bytes as it was already (since it will be more usefull for parsing etc) and create a variation to return in chars (with shr eax 1 after the add eax ecx ) named as UstrLen, UCharLen or something like that.
Attention: Windows expects chars from "W" variants. Many WinAPI string functions have two variants, "A" and "W". Both expect chars, but in the case of the Ansi variants chars=bytes.
I use uLen() to indicate Utf8
chars, which are less than bytes because one char can be composed of one, two or three bytes.
include \masm32\MasmBasic\MasmBasic.inc
Init
Let esi="Нажмите на эту кнопку" ; click on this button
mov ecx, uLen(esi)
Print Str$("The string is %i bytes long", Len(esi)), Str$(", but has %i chars\n", ecx)
PrintLine "[", uLeft$(esi, 7), "] - correct, 7 chars"
PrintLine "[", Left$(esi, 7), "] - incorrect, 7 bytes"
uMsgBox 0, esi, "Hi", MB_OK
EndOfCode
Output:
The string is 39 bytes long, but has 21 chars
[Нажмите] - correct, 7 chars
[Наж�] - incorrect, 7 bytes
Hi,
It is my old StrLenW. :biggrin:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Align 16
db 7 dup (90h)
StrLenW_Lingo proc
movdqu xmm1, [eax]
pxor xmm0, xmm0
pcmpeqw xmm1, xmm0
pmovmskb ecx, xmm1
test ecx, ecx
jne @L_End
mov edx, eax
and eax, -16
@@:
pcmpeqw xmm0, [eax+16]
pcmpeqw xmm1, [eax+32]
por xmm1, xmm0
add eax, 32
pmovmskb ecx, xmm1
test ecx, ecx
jz @b
shl ecx, 16
sub eax, edx
pmovmskb edx, xmm0
add ecx, edx
; and cx,5555h
bsf ecx, ecx
lea eax, [eax+ecx-16]
ret
@L_End:
; and cx,5555h
bsf eax, ecx
ret
StrLenW_Lingo endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Hi Lingo
It´s very similar. I remember it was biased on one JJ and i did a long time ago on strlen, perhaps a variation of yours, i don´t remember.
This one, i adapted from strlen, and now i updated because it had flaws on unaligned strigns and UTF16 (chinese etc). Update yours to this too:
StrLenW proc near
InputString = dword ptr 8
push ebp
mov ebp, esp
push ecx
xorps xmm0, xmm0
mov ecx, [ebp+InputString]
ZeroNotFound:
movups xmm1, xmmword ptr [ecx]
pcmpeqw xmm0, xmm1
add ecx, 16
pmovmskb eax, xmm0
test ax, ax
jz short ZeroNotFound
sub ecx, [ebp+InputString]
add ecx, -16
and ax, 101010101010101b
bsf ax, ax
add eax, ecx
shr eax, 1
pop ecx
mov esp, ebp
pop ebp
retn 4
StrLenW endp
It works for all codepages that uses 2 bytes to create a char.
If you want one that can be used for parsing and result the lenght in bytes, you can use this as well
UniStrLen proc near
InputString = dword ptr 8
push ebp
mov ebp, esp
push ecx
xorps xmm0, xmm0
mov ecx, [ebp+InputString]
ZeroNotFound:
movups xmm1, xmmword ptr [ecx]
pcmpeqw xmm0, xmm1
add ecx, 16
pmovmskb eax, xmm0
test ax, ax
jz short ZeroNotFound
sub ecx, [ebp+InputString]
add ecx, -16
and ax, 101010101010101b
bsf ax, ax
add eax, ecx
pop ecx
mov esp, ebp
pop ebp
retn 4
UniStrLen endp
Or this for Ansi
StrLen proc near
InputString = dword ptr 8
push ebp
mov ebp, esp
push ecx
xorps xmm0, xmm0
mov ecx, [ebp+InputString]
ZeroNotFound:
movups xmm1, xmmword ptr [ecx]
pcmpeqb xmm0, xmm1
add ecx, 16
pmovmskb eax, xmm0
test ax, ax
jz short ZeroNotFound
sub ecx, [ebp+InputString]
add ecx, -16
bsf ax, ax
add eax, ecx
pop ecx
mov esp, ebp
pop ebp
retn 4
StrLen endp
Don´t know if yours old version had those fixes, but feel free to update them :thumbsup: :thumbsup: :thumbsup:
Quote from: lingo on August 19, 2023, 12:11:46 AMIt is my old StrLenW. :biggrin:
IMHO it should return 20, not 39:
The string: [UTF-16LE Hello World]
wLen(edi) eax 20
MbSLenW eax 20
StrLenW_Lingo eax 39
align 16
dummy db 7 dup(0) ; force unaligned
_dw UnicodeString2, "UTF-16LE ", "Hello World", 0
Quote from: jj2007 on August 19, 2023, 04:52:41 AMQuote from: lingo on August 19, 2023, 12:11:46 AMIt is my old StrLenW. :biggrin:
IMHO it should return 20, not 39:
The string: [UTF-16LE Hello World]
wLen(edi) eax 20
MbSLenW eax 20
StrLenW_Lingo eax 39
align 16
dummy db 7 dup(0) ; force unaligned
_dw UnicodeString2, "UTF-16LE ", "Hello World", 0
Hi JJ
i don´t get it. Why 20 ? Wouldn´t it suppose to return 11 in "Hello World" ? 11 chars not counting the 2 null terminated bytes) ?
So, 11*2 = 22 bytes to represent the Unicode text plus 2 more for the extra 2 zeroes ?
The string: [UTF-16LE Hello World]
Count them, it's 20 chars.
OH ! I see, it´s the whole string "UTF-16LE Hello World" and not only "Hello World" written in UTF-16LE. I though this was a token for masmbasic and not part of the string itself :biggrin: :biggrin: :biggrin: :biggrin:
Indeed, it should return 20 chars and not 39
It is my speed test with MichaelW's simple test. :biggrin:
C:\Documents\Visual Studio 2022\Projects\SpeedTest>c2test2
8 cycles -> StrLenW_Guga ANSI, Return in EAX: 100
6 cycles -> StrLenW_Lingo ANSI, Return in EAX: 100
23 cycles -> StrLenW_Guga No SAR, Return in EAX: 200
13 cycles -> StrLenW_Lingo No SAR, Return in EAX: 200
8 cycles -> StrLenW_Guga with SAR, Return in EAX: 34
5 cycles -> StrLenW_Lingo with SAR, Return in EAX: 34
Press enter to exit...
Source in attachment.
Hi guys,
You had a lot of fun yesterday :badgrin: :badgrin: :badgrin: :badgrin:
and now after more than 70% speed difference ...absolute silence :shhh: ... he-he -he..
or the sound of silence... Nice song https://www.youtube.com/watch?v=9O9DaZUS_EU
I hadn't looked at it, Lingo, but it's indeed almost 27% faster than Guga's algo :thumbsup:
(and it returns a wrong result, but never mind, we are used to that :biggrin: )
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
23633 cycles for 100 * CRT wcslen
9029 cycles for 100 * _MbStrLenW2
2702 cycles for 100 * StrLenW Guga
1987 cycles for 100 * StrLenW_LingoNoSar
23663 cycles for 100 * CRT wcslen
9020 cycles for 100 * _MbStrLenW2
2698 cycles for 100 * StrLenW Guga
1997 cycles for 100 * StrLenW_LingoNoSar
23657 cycles for 100 * CRT wcslen
9026 cycles for 100 * _MbStrLenW2
2704 cycles for 100 * StrLenW Guga
1986 cycles for 100 * StrLenW_LingoNoSar
23645 cycles for 100 * CRT wcslen
9021 cycles for 100 * _MbStrLenW2
2704 cycles for 100 * StrLenW Guga
1986 cycles for 100 * StrLenW_LingoNoSar
14 bytes for CRT wcslen
66 bytes for _MbStrLenW2
58 bytes for StrLenW Guga
98 bytes for StrLenW_LingoNoSar
100 = eax CRT wcslen
100 = eax Masm32 ucLen
100 = eax _MbStrLenW2
100 = eax StrLenW Guga
199 = eax StrLenW_LingoNoSar
Btw it sucks a bit for short strings:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
3854 cycles for 100 * CRT wcslen
1549 cycles for 100 * _MbStrLenW2
695 cycles for 100 * StrLenW Guga
684 cycles for 100 * StrLenW_LingoNoSar
618 cycles for 100 * MbSLenW
3853 cycles for 100 * CRT wcslen
1547 cycles for 100 * _MbStrLenW2
695 cycles for 100 * StrLenW Guga
691 cycles for 100 * StrLenW_LingoNoSar
619 cycles for 100 * MbSLenW
3850 cycles for 100 * CRT wcslen
1550 cycles for 100 * _MbStrLenW2
719 cycles for 100 * StrLenW Guga
683 cycles for 100 * StrLenW_LingoNoSar
668 cycles for 100 * MbSLenW
3862 cycles for 100 * CRT wcslen
1549 cycles for 100 * _MbStrLenW2
720 cycles for 100 * StrLenW Guga
684 cycles for 100 * StrLenW_LingoNoSar
645 cycles for 100 * MbSLenW
14 bytes for CRT wcslen
66 bytes for _MbStrLenW2
58 bytes for StrLenW Guga
102 bytes for StrLenW_LingoNoSar
50 bytes for MbSLenW
14 = eax CRT wcslen
14 = eax Masm32 ucLen
14 = eax _MbStrLenW2
14 = eax StrLenW Guga
14 = eax StrLenW_LingoNoSar
14 = eax MbSLenW
Guga unicode also has game things like chess pieces memcopy from a library with solve check mate in 3 moves
If you read text from file,dont you already get len = filesize ,isnt there an api to check textfile is 16 bit unicode or not?
QuoteHi Guga,
IMHO your algo is fast enough :biggrin:
You beat poor CRT by a factor 9, but ok, as Assembly programmers we are used to that, aren't we?
Again envy or incompetence about speed optimization of code...
QuoteI hadn't looked at it, Lingo, but it's indeed almost 27% faster than Guga's algo
Another fake manipulation attempt against Lingo:
Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (SSE4)
12358 cycles for 100 * CRT wcslen
9124 cycles for 100 * _MbStrLenW2
2325 cycles for 100 * StrLenW Guga
1572 cycles for 100 * StrLenW_LingoNoSar
1572/2325*100=[b]67.%[/b]
12426 cycles for 100 * CRT wcslen
9142 cycles for 100 * _MbStrLenW2
2279 cycles for 100 * StrLenW Guga
1576 cycles for 100 * StrLenW_LingoNoSar
1576/2279*100=[b]69.1%[/b]
12360 cycles for 100 * CRT wcslen
9115 cycles for 100 * _MbStrLenW2
2352 cycles for 100 * StrLenW Guga
1579 cycles for 100 * StrLenW_LingoNoSar
1579/2352*100=[b]67.1%[/b]
12264 cycles for 100 * CRT wcslen
9106 cycles for 100 * _MbStrLenW2
2323 cycles for 100 * StrLenW Guga
1575 cycles for 100 * StrLenW_LingoNoSar
1575/2323*100=[b]67.8%[/b]
14 bytes for CRT wcslen
66 bytes for _MbStrLenW2
58 bytes for StrLenW Guga
98 bytes for StrLenW_LingoNoSar
100 = eax CRT wcslen
100 = eax Masm32 ucLen
100 = eax _MbStrLenW2
[b]100[/b] = eax StrLenW Guga
[b]199[/b] = eax StrLenW_LingoNoSar
--- ok ---
Quoteand it returns a wrong result, but never mind, we are used to that
Another attempt for fake manipulation and even an insult against Lingo.
You are comparing two different things: LenStrW_LingoNotSAR
without SAR eax,1 at the end of the algo,
with LenStrW_Guga
WITH SAR eax,1 at the end of the algo.
It is normal to have a difference of 2 in your fake manipulated test.
Use next time LenStr_LingoWsar!
Pls, see the attachment and
Don't Teach Father How To Make Babies!
QuoteBtw it sucks a bit for short strings:
You'll get a lot more sucks if you test it with longer strings! :badgrin: :skrewy:
It is also from your fake "test":
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Speed Differences:
1987/2702*100=73.5%
1997/2698*100=74%
1986/2704*100=73.4%
1986/2704*100=73.4%
Hi Lingo, i´ll test again later.
One question only, you use align 16 on the inputed strings, right ? How the algo will behave (in terms of speed/accuracy) for unaligned strings ?
Did you tested for those cases ?
Thank you Guga, :biggrin:
You can use my test (it is from MichaelW written in plain MASM32).
So, you can change what you want in it end recompile it again with makeit.bat. :thumbsup:
Quote from: lingo on August 21, 2023, 06:34:07 AMThank you Guga, :biggrin:
You can use my test (it is from MichaelW written in plain MASM32).
So, you can change what you want in it end recompile it again with makeit.bat. :thumbsup:
Tks, lingo. I´ll give a try later. These functions are good when we build it to work in many scenarios as possible. That´s why i tested for aligned/unaligned strings and also when we have other data before or after the string (As when a string belongs to another structure or chain of data etc etc).
I'm currently working on a way to speed up (and build) functions to change the case of strings. JJ is helping on this. It's a bit harder then i expected, but i managed to make it work (for non latin chars or unicode strings, at the moment). Later i´ll see how to optimize it and then will come back for more tests on strlen and strlenW.
It´s way of my head why M$ didn´t ever do something like that focused in speed. When analyzing some of the Apis inside msvcrt they are a true mess. For example, even their libm_sse2_log_precise function that was supposed to be accurate and faster, is slower then the ones we build here. :dazzled:
Quote from: lingo on August 21, 2023, 05:00:49 AMSpeed Differences:
1987/2702*100=73.5%
1997/2698*100=74%
1986/2704*100=73.4%
1986/2704*100=73.4%
Your algo is 26.5% faster: 1-1987/2702=26.5%
"Don't Teach Father How To Make Babies!" is far below your intellectual level, stop it.
Quote from: jj2007 on August 17, 2023, 11:35:27 AMBtw it seems we never timed the Unicode version of StrLen, here it is:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
23327 cycles for 100 * CRT wcslen
20836 cycles for 100 * Masm32 ucLen
9566 cycles for 100 * MasmBasic wLen
8981 cycles for 100 * _MbStrLenW
23340 cycles for 100 * CRT wcslen
20809 cycles for 100 * Masm32 ucLen
9566 cycles for 100 * MasmBasic wLen
8988 cycles for 100 * _MbStrLenW
23284 cycles for 100 * CRT wcslen
20852 cycles for 100 * Masm32 ucLen
9157 cycles for 100 * MasmBasic wLen
9003 cycles for 100 * _MbStrLenW
23262 cycles for 100 * CRT wcslen
20883 cycles for 100 * Masm32 ucLen
9586 cycles for 100 * MasmBasic wLen
8992 cycles for 100 * _MbStrLenW
14 bytes for CRT wcslen
10 bytes for Masm32 ucLen
10 bytes for MasmBasic wLen
66 bytes for _MbStrLenW
100 = eax CRT wcslen
100 = eax Masm32 ucLen
100 = eax MasmBasic wLen
100 = eax _MbStrLenW
Using StrLenUnicode.zip
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz (SSE4)
20177 cycles for 100 * CRT wcslen
19941 cycles for 100 * Masm32 ucLen
11274 cycles for 100 * MasmBasic wLen
9125 cycles for 100 * _MbStrLenW
20273 cycles for 100 * CRT wcslen
19816 cycles for 100 * Masm32 ucLen
11184 cycles for 100 * MasmBasic wLen
9052 cycles for 100 * _MbStrLenW
20202 cycles for 100 * CRT wcslen
19791 cycles for 100 * Masm32 ucLen
11175 cycles for 100 * MasmBasic wLen
9172 cycles for 100 * _MbStrLenW
20217 cycles for 100 * CRT wcslen
19862 cycles for 100 * Masm32 ucLen
11273 cycles for 100 * MasmBasic wLen
9136 cycles for 100 * _MbStrLenW
14 bytes for CRT wcslen
10 bytes for Masm32 ucLen
10 bytes for MasmBasic wLen
66 bytes for _MbStrLenW
100 = eax CRT wcslen
100 = eax Masm32 ucLen
100 = eax MasmBasic wLen
100 = eax _MbStrLenW
Using StrLenUtf16.zip
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz (SSE4)
21137 cycles for 100 * CRT wcslen
20385 cycles for 100 * Masm32 ucLen
11727 cycles for 100 * MasmBasic wLen
9451 cycles for 100 * _MbStrLenW
9436 cycles for 100 * _MbStrLenW2
2292 cycles for 100 * StrLenW Guga
21140 cycles for 100 * CRT wcslen
20491 cycles for 100 * Masm32 ucLen
11686 cycles for 100 * MasmBasic wLen
9477 cycles for 100 * _MbStrLenW
9271 cycles for 100 * _MbStrLenW2
2354 cycles for 100 * StrLenW Guga
21147 cycles for 100 * CRT wcslen
20346 cycles for 100 * Masm32 ucLen
11686 cycles for 100 * MasmBasic wLen
9238 cycles for 100 * _MbStrLenW
9404 cycles for 100 * _MbStrLenW2
2333 cycles for 100 * StrLenW Guga
20655 cycles for 100 * CRT wcslen
20393 cycles for 100 * Masm32 ucLen
11688 cycles for 100 * MasmBasic wLen
9272 cycles for 100 * _MbStrLenW
9415 cycles for 100 * _MbStrLenW2
2344 cycles for 100 * StrLenW Guga
14 bytes for CRT wcslen
10 bytes for Masm32 ucLen
10 bytes for MasmBasic wLen
66 bytes for _MbStrLenW
66 bytes for _MbStrLenW2
58 bytes for StrLenW Guga
100 = eax CRT wcslen
100 = eax Masm32 ucLen
100 = eax MasmBasic wLen
100 = eax _MbStrLenW
100 = eax _MbStrLenW2
100 = eax StrLenW Guga
Just for fun. :biggrin:
WoW guga! :dazzled: Fantastic!
Tks a lot Zedd :thumbsup: :thumbsup: :thumbsup:
Quote from: guga on April 05, 2025, 05:20:17 PMTks a lot Zedd :thumbsup: :thumbsup: :thumbsup:
Your welcome. :smiley: