The MASM Forum

General => The Laboratory => Topic started by: rrr314159 on March 03, 2015, 02:40:50 PM

Title: Faster Memcopy ...
Post by: rrr314159 on March 03, 2015, 02:40:50 PM
In this thread (http://masm32.com/board/index.php?topic=1458.0) Habran, jj2007 and others submitted memory copy algorithms that were tested in jj's program MemCopyAlgos, which ran them on a block of 2048 bytes with all possible alignments. (BTW the thread is a lot of fun to read). I substituted a new algo for Habran's "Ferrari" (because I had trouble inserting it elsewhere in the program.)

This new algo is fastest on every measure: beat every other algo on every alignment. Here are typical runs on Intel and AMD machines (rrr314's Ferrari in last column):

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      114
------------------------------------------------------------------------------------
2048, d0s0-0      159      199      242      239      239      218      201      144
2048, d1s1-0      255      231      263      270      258      249      257      208
2048, d7s7-0      256      234      265      272      260      250      257      208
2048, d7s8-1      260      244      657      473      245      248      257      216
2048, d7s9-2      259      245      657      473      245      248      257      208
2048, d8s7+1      252      239      641      471      246      234      257      208
2048, d8s8-0      256      234      288      290      283      268      257      208
2048, d8s9-1      252      243      649      474      245      248      257      208
2048, d9s7+2      257      239      634      471      246      237      257      209
2048, d9s8+1      257      239      634      471      246      237      257      208
2048, d9s9-0      256      234      275      272      268      253      257      208
2048, d15s15      256      235      274      274      272      254      257      208


--- ok ---

AMD A6-6310 APU with AMD Radeon R4 Graphics     (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      157
------------------------------------------------------------------------------------
2048, d0s0-0      170      250      262      326      327      274      278      162
2048, d1s1-0      419      385      384      389      494      348      796      238
2048, d7s7-0      375      477      350      436      407      357      758      245
2048, d7s8-1      628      682      699      402      250      303      617      224
2048, d7s9-2      598      665      757      420      250      296      621      223
2048, d8s7+1      603      643      702      409      249      257      617      219
2048, d8s8-0      356      345      315      367      386      318      298      228
2048, d8s9-1      609      651      701      399      249      298      591      217
2048, d9s7+2      587      661      690      405      274      244      596      225
2048, d9s8+1      587      660      707      403      249      250      604      225
2048, d9s9-0      356      376      347      405      386      331      667      224
2048, d15s15      356      374      334      398      376      331      680      224


--- ok ---


It simply uses all 8 xmm registers to move a large chunk of 128 bytes per loop ... nothing fancy at all. the algo is also shorter, at 157 bytes, than all but the most basic competitors.

It could be made somewhat faster, but I didn't bother - just wanted to prove the point that this basic idea works. BTW I have found the idea on the net since my last post; google "Mark Larson optimize assembler" for a discussion. Also see FASM's discussion board. I'm surprised the idea is so obscure. Still haven't found the "more advanced" techniques used in my "faster tokenizer" algo - which makes me wonder if there's something wrong with them.

It's written in 32-bits; learned my lesson from last time, there's nothing non-standard about it.

Unfortunately sometimes the display might not fit right on your command window; but I didn't change anything in jj's program more than necessary. Also note that the very first number generated (upper left hand corner) is, I believe, artificially low due to a start-up problem I intend to discuss in more detail elsewhere (my algo beat it anyway).

The zip contains jj's testing program, renamed MemCopyAlgosrrr; and the exe thereof. My algo is in it, called "xmemcpy".

[edit] a more complete version, faster for larger memory copies, is in the zip a few posts below

[edit] even more complete version is in reply 25, next page. But the one here is a lot simpler (half as many lines) and shows the basic idea: using all 8 registers to copy 128 bit chunks.
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 03, 2015, 02:55:16 PM
rrr,

Something we determined a while ago is that REP MOVS/B/D is slow under about 500 bytes. After that internal special case circuitry kicks in and it gets a lot faster. Have you had a paddle with different buffer sizes for the copy operation ?
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 03, 2015, 03:51:52 PM
No; I didn't "optimize" it specifically for 2048 (the standard number used in that thread) but, didn't try it in any other region. As mentioned, just wanted to demonstrate feasibility of the idea. But I'll give it a shot in a while and let you know
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 03, 2015, 05:17:16 PM
Ran it at 20480 and 100000. My algo still does alright, is best at some settings. But at 20480, MemCoP4 and MemCoC2 are overall ahead, my algo about third. At 100000 it's surprising how neck-and-neck they all are, except MemCo2 and MemCoC3 which are slower.

I'm using all the xmm reg's, all unaligned. So for 20480 I bet I can still beat them all, since the two current winners are using just one or two xmm's, but aligned (either source or dest). So I'll try aligning one or the other also, and (I *think*) the strategy of using all 8 will give me the advantage. At least, it's proved effective in the 2 cases I've tried so far. Tomorrow I'll check it out.

However dunno about the larger numbers (100,000). It's quite amazing how all the different algorithms converge to the same speed up there (with 2 exceptions) - did u know they'd do that? The 2 xmm-aligned algos are only a little better, less than 1%. BTW jj has a comment recommending not to exceed 100,000 with this program; someday I'll check out some bigger numbers with my own test prog.

Well, thanks for the suggestion! Here are the numbers I got:

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      157
------------------------------------------------------------------------------------
20480, d0s0-     2796     2760     2795     2835     2803     2786     2761     2640
20480, d1s1-     2935     2782     2809     2815     2838     2811     3036     4244
20480, d7s7-     2899     2786     2816     2848     2848     2804     3038     4343
20480, d7s8-     9095     9091     5962     5716     3156     3125     9210     3114
20480, d7s9-     9092     9090     5968     5720     3129     3124     9357     3101
20480, d8s7+     3257     3163     5988     5270     3149     3159     3225     3240
20480, d8s8-     2923     2793     2888     2825     2840     2835     3040     4330
20480, d8s9-     9086     9091     5962     5721     3165     3126     9065     3093
20480, d9s7+     3253     3159     5977     5264     3156     3131     3237     3205
20480, d9s8+     3267     3166     5975     5271     3115     3156     3256     3212
20480, d9s9-     2880     2826     2808     2833     2870     2861     3043     4339
20480, d15s1     2912     2822     2867     2820     2835     2844     3029     4269

--- ok ---

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      157
------------------------------------------------------------------------------------
100000, d0s0    12981    12716    12930    12924    12930    12969    12717    12962
100000, d1s1    13047    12732    12978    12963    12992    12907    13614    13905
100000, d7s7    13046    12729    12946    12982    12944    12989    13624    13938
100000, d7s8    13655    13281    24727    19108    13681    13681    13639    13951
100000, d7s9    13641    13254    24695    19110    13702    13671    13599    13857
100000, d8s7    13588    13305    24930    20277    13651    13703    13622    13961
100000, d8s8    13075    12739    12986    12946    12973    12960    13650    13950
100000, d8s9    13619    13281    24846    19103    13681    13689    13648    13937
100000, d9s7    13649    13284    25049    20303    13680    13668    13656    13978
100000, d9s8    13643    13293    25028    20336    13583    13586    13620    13950
100000, d9s9    13036    12728    13005    12963    12962    12961    13636    13943
100000, d15s    13036    12708    12942    12965    12922    12931    13656    13950


--- ok ---

Title: Re: Faster Memcopy ...
Post by: dedndave on March 03, 2015, 06:01:29 PM
Quote from: rrr314159 on March 03, 2015, 05:17:16 PM
At 100000 it's surprising how neck-and-neck they all are...

a sign that you may be hitting hardware limitations
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 03, 2015, 06:25:51 PM
It is certainly fast, and could be an option if you have a special need for repeatedly copying large blocks.

For MasmBasic, I chose rep movsd in the end, both on grounds of speed and convenience: It is a frequently used algo, and I wouldn't like a wrapper for preserving the xmm regs for each and every little memcopy.
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 03, 2015, 08:02:53 PM
Its been a long time since I did any benchmarking on memcopy but I found REP MOVS/B/D seemed to have all of the advantages on cached reads, non-cached writes and the optimal prefetch and I think Intel did the special case circuitry because its so commonly used over a long time. On sub 500 byte repeat copies from memory the incremented pointers (mov EAX/MOV AL) was faster on small buffers. I tend to look at the rest of the algo that does the memcopy to see what is worth the effort, there are enough times when a clunky fussy big mover is a PITA where a simple byte copy if faster than the algo calling it.
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 04, 2015, 06:17:40 AM
Thanks all for your comments! Here's what I wrote b4 I saw them, and I think it agrees with your general attitude.

This new version has the destination aligned, and is fastest (vs. every algo at every alignment) at 20483 bytes. On AMD it's particularly clear but Intel also. At 100000 it's hard to say what's fastest, there are a few very close; varies between the two computers; you can look at the numbers. Certainly, my algo is no longer the "winner".

There are some obvious things to do to speed it up at the higher byte counts, but I'm sick of this - it's a pursuit well suited to masochists. First, it's very frustrating to get good timing numbers. Second, a change can be good in one circumstance (byte count, alignment, phase of the moon) but bad in another. It varies between computers, other progs running, etc. Third, the algo (which started out simple) is getting longer and longer, harder to understand/debug/deal with. Fourth, I noticed, reviewing old optimization pages (like Mark Larson's, a few years old) tricks that worked back then are obsolete now! Branch prediction is totally different than what he optimized for. Unaligned moves are pretty much just as fast as aligned. Non-temporal moves are completely useless on my machines. And so forth. So, all this effort will be meaningless soon. Fifth, when u get right down to it who cares about a few nanoseconds one way or the other?

The zip contains MemCopyAlgosrrr.asm and .exe, same as previous posting.

[edit] fixed 3 small things in zip, now it's cleaner; functionality unchanged, saves a couple nanoseconds, 2015/3/6
--- ok ---
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      206
------------------------------------------------------------------------------------
20483, d0s0-     2821     2750     2804     2834     2841     2834     2777     2717
20483, d1s1-     2919     2784     2859     2831     2816     2863     3045     2737
20483, d7s7-     2885     2794     2805     2787     2787     2829     3029     2678
20483, d7s8-     9093     9095     5935     5696     3139     3096     9212     2830
20483, d7s9-     9092     9092     5940     5700     3139     3136     9361     2887
20483, d8s7+     3244     3146     5951     5249     3099     3089     3223     2931
20483, d8s8-     2919     2783     2795     2821     2796     2785     3035     2683
20483, d8s9-     9087     9089     5935     5700     3145     3092     9067     2893
20483, d9s7+     3258     3148     5949     5249     3086     3109     3245     2868
20483, d9s8+     3247     3158     5946     5243     3108     3142     3231     2890
20483, d9s9-     2920     2787     2843     2848     2862     2866     3067     2721
20483, d15s1     2902     2801     2858     2810     2839     2831     3052     2668


--- ok ---

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      206
------------------------------------------------------------------------------------
100000, d0s0    13082    12709    12938    12972    12974    12942    12721    12956
100000, d1s1    13099    12702    12994    12958    12967    13022    13664    13020
100000, d7s7    13081    12739    13022    12983    12961    12991    13663    12999
100000, d7s8    13652    13312    24752    19100    13683    13696    13654    13537
100000, d7s9    13652    13267    24698    19113    13629    13704    13652    13624
100000, d8s7    13640    13315    24653    20291    13659    13684    13658    13570
100000, d8s8    13023    12695    12979    12983    12984    12966    13653    12951
100000, d8s9    13599    13287    24610    19086    13583    13645    13642    13647
100000, d9s7    13602    13305    25021    20289    13618    13634    13670    13578
100000, d9s8    13652    13317    24692    20279    13713    13681    13643    13533
100000, d9s9    13035    12718    12954    12993    12989    13064    13636    12974
100000, d15s    13071    12738    12989    13002    12948    13028    13658    12927

--- ok ---


AMD A6-6310 APU with AMD Radeon R4 Graphics     (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      206
------------------------------------------------------------------------------------
20483, d0s0-     5989     5914     5920     5634     5725     5723     5743     4819
20483, d1s1-     6143     5738     5653     5616     5600     5506     9796     4846
20483, d7s7-     6018     5656     5694     5601     5629     5602     9668     4869
20483, d7s8-    10563    10941     9734     9322     7603     7613     9132     6136
20483, d7s9-    10578    10961     9736     9324     7613     7637     9994     6116
20483, d8s7+    10728    10852     7924     8886     7223     7194    10801     6290
20483, d8s8-     5960     5617     5682     5621     5606     5624     5615     4847
20483, d8s9-    10624    10974     9740     9322     7714     7608    10651     6106
20483, d9s7+    10787    10809     7824     8919     7190     7189     9878     6206
20483, d9s8+    10810    10818     7842     8917     7253     7223     8821     6201
20483, d9s9-     6059     5650     5675     5594     5577     5614     9622     4863
20483, d15s1     5861     5594     5680     5631     5568     5648     9694     4765


--- ok ---

AMD A6-6310 APU with AMD Radeon R4 Graphics     (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      206
------------------------------------------------------------------------------------
100000, d0s0    24435    22509    22496    22415    22002    22048    22183    22259
100000, d1s1    22379    22313    22026    22110    21864    21799    43627    22193
100000, d7s7    22365    22331    22055    22006    21805    22071    43413    22149
100000, d7s8    48009    47571    31597    32397    25654    24978    39742    24342
100000, d7s9    48044    47574    31809    32465    25620    24905    43194    24367
100000, d8s7    49095    49177    31628    31530    25843    24612    49296    25211
100000, d8s8    22408    22264    22047    22044    21643    21890    22259    22186
100000, d8s9    47976    47620    31494    32418    25651    24999    47788    24421
100000, d9s7    49815    49884    31259    31463    25324    24670    44214    25206
100000, d9s8    49755    49189    31743    31498    25266    24655    39641    25244
100000, d9s9    22411    22149    21979    22138    21807    21903    43295    22222
100000, d15s    22365    22310    22133    22023    21826    21728    43418    22051
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 04, 2015, 06:32:41 AM
ps jj2007, I hadn't thought of needing to preserve all those xmm regs, was just considering the problem in isolation from the real world, but you're right. Maybe that's why this idea is so obscure. Still, in special cases, when xmm regs are not being used for anything else, and the speed actually matters, it could be applicable. Also in 64-bit, with 16 xmm's, it makes more sense.
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 04, 2015, 09:15:18 AM
Here are the results on my aging i7.


Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      206
------------------------------------------------------------------------------------
20483, d0s0-     3530     3520     3502     3497     3499     3511     3655     3376
20483, d1s1-     3752     3554     3535     3538     3541     3540     3969     3341
20483, d7s7-     3758     3560     3534     3538     3541     3544     3921     3346
20483, d7s8-     8911     8889     7560     7840     4461     4465     9022     3770
20483, d7s9-     8909     8890     7560     7840     4463     4464     9414     3763
20483, d8s7+     4489     4445     7352     7038     4417     4411     4496     3814
20483, d8s8-     3759     3558     3537     3543     3536     3544     3913     3344
20483, d8s9-     8901     8890     7554     7839     4466     4465     8867     3763
20483, d9s7+     4490     4448     7345     7038     4413     4411     4499     3815
20483, d9s8+     4489     4446     7345     7042     4405     4408     4468     3818
20483, d9s9-     3759     3560     3539     3538     3535     3538     3914     3346
20483, d15s1     3581     3559     3530     3528     3526     3535     3951     3311


--- ok ---
Title: Re: Faster Memcopy ...
Post by: KeepingRealBusy on March 04, 2015, 10:23:39 AM
rrr

Mis-alignment can be a problem, especially if both the source and the destination buffers have some mis-alignment, especialy if they have different alignments. The usual case is that one or the other of the buffers can be aligned mod 16. The way to handle these is to move a block of 16 bytes from the start of the buffer and a block of 16 bytes at the end of the buffer with un-aligned XMM moves. Then calculate the size left to be moved (data size - 32) and round this up to the next mod 16, then start at the first mod 16 byte boundary of the un-aligned buffer and move 16 bytes at a time (there will be a small overlap at either end).

Dave.
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 04, 2015, 10:26:58 AM
Quote from: rrr314159 on March 04, 2015, 06:32:41 AM
ps jj2007, I hadn't thought of needing to preserve all those xmm regs, was just considering the problem in isolation from the real world, but you're right. Maybe that's why this idea is so obscure. Still, in special cases, when xmm regs are not being used for anything else, and the speed actually matters, it could be applicable. Also in 64-bit, with 16 xmm's, it makes more sense.

If speed matters, it looks like a really good alternative. As a general purpose algo, it would fail already when looking at the minimum chunk size...
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 04, 2015, 07:19:34 PM
@KeepingRealBusy,

thx for the suggestion. The way I did it is more obvious; u can check it out in the zip, at beginning of "xmemcpy" proc. Hard to say which is better, probably varies with circumstances. Main advantage of my approach is, it fit with the existing algo; didn't have to change a single line, just added 20, whereas yours would require a total rewrite. It's interesting that these are 2 completely different ways to handle the same seemingly straightforward problem.

@jj2007: just in case of misunderstanding, altho my chunk size is 128, of course the routine handles any smaller number also; for < 128 it uses single xmm reg (chunk=16); less than that, moves byte by byte (chunk = 1 :) Could get more complex, use chunk sizes of 2, 4, 8, and 32, 64; I saw an Agner Fog algo that did that; but it's all, really, of academic interest only, IMHO

@hutch thx for checking it out; give your old machine a rest now, it's earned it ;)
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 04, 2015, 08:15:13 PM
Quote from: rrr314159 on March 04, 2015, 07:19:34 PM@jj2007: just in case of misunderstanding, altho my chunk size is 128, of course the routine handles any smaller number also

Sorry, had not seen that one :redface:

    shr ecx, 7                  ; divide by 128 (chunk size)

    loop128:                    ; do chunks of 128
        jle endloop128

Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 05, 2015, 08:00:41 AM
Let me apologize for that sloppy code which aligns dest block. There are 3 places where instructions are unnecessarily executed. I knew about them when I posted it, but 1) makes no difference at all to the final results 2) didn't think anyone would read it, and 3) I was so tired of timing, timing, timing (to paraphrase Hamlet) that the extra 20 seconds to make the changes - and couple minutes to test them - was more than I could stand. If the changes had to be made I would have just chucked the whole thing.

But the original code is fine (I think) and compact enuff that u have to read carefully, as u found out! At one point, condition code is set, then not checked for another 21 lines (of course it's unchanged by those intervening lines). This gives the processor max time to predict the branch correctly, and makes the code harder to read - which, sometimes, I consider a plus :)
Title: Re: Faster Memcopy ...
Post by: dedndave on March 05, 2015, 10:47:47 AM
from my experience, code that aligns for a loop can be either "pretty" or "fast" - lol
you never seem to get both

in the past, i have done something like this:
1) test lower bits of address register for desired alignment
   branch if aligned
2) perform operation on a byte (copy or whatever)
3) test lower bits of address again
   branch if aligned
4) perform operation on a byte

and so on, until alignment is achieved (the lower address bits are 0's)

there's no good way to make that pretty   ;)
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 05, 2015, 02:38:53 PM
Don't forget u need to check also if you're exceeding the length to copy! What if it takes 15 to align, but they only asked u to copy 3? Anyway, the way I did it, first figured how many bytes needed to get to alignment, then did that many. keepingrealbusy had a totally different way (see his post). In this case not real important how u do it, since once u get there you'll probably copy many thousands of bytes; a few instruction cycles more or less to align really doesn't matter. Interesting to see 3 different ways to do such a seemingly straightforward task; wonder how many more there are!
Title: Re: Faster Memcopy ...
Post by: KeepingRealBusy on March 08, 2015, 06:16:36 AM
rrr,

The primary concern is where the data is located and where is has to reside, What I usually run into is that I need to read in a file of data. I allocate the buffers with VirtualAlloc and read in the data as unbuffered (for all but the last partial disk sector), and then read in that block as buffered and add the size read to the size of the prior blocks (close the file and open it as buffered and position it the last partial size from the end of the file, then read buffered).

If you need to keep the same alignment with data that is on an unaligned boundary, then use my method. Just allocate a buffer that is the required size rounded up to mod 16, then offset the start of the destination data area to match the unaligned source, unaligned copy the starting 16 bytes and the ending 16 bytes, then copy all of the middle bytes (starting at the first aligned location) using aligned moves.

Dave.
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 08, 2015, 08:20:39 AM
That approach sounds good IF you have the freedom to allocate your destination buffer, but that's not the point of this exercise. I'm given both buffers by the user and must write where he specifies. Normally dest won't have same alignment as source, and I have to put up with it. If they happen to match, my method works out very similar to yours (BTW note that movdqu is almost as fast as movdqa these days, on aligned data). But normally they won't be aligned so only half the moves can use movdqa; in my case it's the writes; u could do it with the reads instead; seems to make little difference.
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 08, 2015, 09:22:54 AM
> (BTW note that movdqu is almost as fast as movdqa these days, on aligned data)

This is where the action is with later hardware and while 16 byte aligned is slightly faster on some hardware, its not by much and its a lot of messing around to set it up. You will tend to be restricted more by the memory bottleneck that the instructions that move or copy memory and I mainly see the gain in later XMM as what it can do with 128 bit block of data as there are enough techniques that just grunt memory content around that are fast enough in most contexts. The actual data processing gives the main speed gain, not the memory copy speed.

There are many contexts where I happily use byte level copy as it is unproblematic and often a lot faster than the data processing done with the data.
Title: Re: Faster Memcopy ...
Post by: nidud on March 08, 2015, 10:13:35 AM
deleted
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 08, 2015, 11:40:47 AM
Hi nidud,

it's too bad you didn't use the version of the 128-byte mover that also works for larger memory blocks (20,000 or so). Originally I just wanted to show the technique for small jobs around 2048, but hutch suggested I try it out at higher numbers, the result was the second version posted in reply #7 (see discussion). So, if I get around to it I'll plug it in your tester and see how it does. My guess is it will come in second, behind proc_3, judging by my previous numbers. Still, I don't know if I can summon the energy - timing algo's is very dull work!

As for the short 31 bytes, simple byte copy is much better for such jobs. In fact pencil and paper would probably be better.

I feel obligated to check it out - but I'd promised myself I wouldn't touch timing again for at least a year ... so if u feel like doing it, go for it! - in Reply #7.
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 08, 2015, 12:23:48 PM
> but I'd promised myself I wouldn't touch timing again for at least a year

Don't give up on timing, its good for you and as an aside you tend to pop faster algos if you do it regularly. You get faster at it over time and with more practice you tune the timing to the task the algo will perform a lot better.
Title: Re: Faster Memcopy ...
Post by: nidud on March 09, 2015, 02:04:50 AM
deleted
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 09, 2015, 06:18:24 AM
You understand, I was merely demonstrating the efficacy of using 8 xmm reg's and paid no attention at all to other details ... simply slapped it together. However, thanks for the mods. I'll look at it (now you've got me interested) and see if any further improvements are available.

[edit] also thanks for going ahead and testing the "original new" algo; I know what a PITA all this can be ;)
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 09, 2015, 01:24:59 PM
@nidud,
I finally managed to get my memory copy algo faster than yours, but it's a close shave. I used some of your suggestions. Doing the tail that way hadn't occurred to me until keepingrealbusy suggested it. Also an adaptation of your short-copy handling (similar to Agner Fog's). Wouldn't have bothered without your example, which proved these ideas better, so thanks. Code is below.

I couldn't get your test bed working right; for one thing the eax return value wouldn't co-operate, also I'm not familiar with that non-standard environment, so I put together this extremely standard masm32 test bed with MichaelW's algo's. Hope it will suit. Also included shorter test and, more important, made runs aligned and also misaligned.

Here's a run where my algo is best on every number, but that's unusual, had to make 6 runs to get it :biggrin: Generally, tho, it seems overall best. Shorter too.

mem2 = nidud's improved version of rrr's first algo (was called mem4)
mem3 = nidud's algo (equals mem3 in his test prog)
mem4 = rrr latest (hopefully last) attempt

test.asm: 506 lines, 3 passes, 3688 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(300)

     8        2387            2359            2337

    31        2396            2291            2193

   271        7372            9300            6269

  2014       20854           25887           13538

262159     5229387         5238006         5165092

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(300)

     8        1198            1190            1180

    31        1248            1198            1147

   271        3952            5367            3714

  2014       14134           25988           14074

262159     5266919         5485524         5263508


; «««««««««««««««««««««««««««««««««««««««««««««««««««
mem4 proc Dest:PTR BYTE, Source:PTR BYTE, ln:DWORD
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; written by rrr314159 2015 03/08

push esi
push edi
mov edi, [Dest]
mov esi, [Source]
mov ecx, [ln]

; if very small get it over with quick
cmp ecx, 16
jge bigger_than_15
test ecx, 8
        jz @f
        movq xmm1, [esi]
        movq xmm0, [esi+ecx-8]
        movq [edi], xmm1
        movq [edi+ecx-8], xmm0
        jmp end_memcopy
@@:
test ecx, 4
        jz @f
        mov eax, [esi]
        mov ebx, [esi+ecx-4]
        mov [edi], eax
        mov [edi+ecx-4], ebx
        jmp end_memcopy
@@:
test ecx, 2
        jz @f
        mov ax, [esi]
        mov [edi], ax
@@:
test ecx, 1
        jz end_memcopy
        mov al, [esi+ecx-1]
        mov [edi+ecx-1], al
        jmp end_memcopy

bigger_than_15:
    movdqu xmm1, [esi]              ; 16 or greater it's safe
    movdqu xmm0, [esi+ecx-16]
    movdqu [edi], xmm1              ; to do head
    movdqu [edi+ecx-16], xmm0       ; and tail here

cmp ecx, 32
jg @f

end_memcopy:                        ; put end close to short copies, quicker
    pop edi
    pop esi
    ret 12
@@:

; continue for more than 32 bytes - check dest alignment

test edi, 15
jz destisaligned
    mov ebx, edi
    and ebx, -16
    add ebx, 16                     ; ebx gets bytes to align edi (1<= ebx <=15)
    sub ebx, edi                    ; don't need to check < ecx
    sub ecx, ebx                    ; how many bytes will be left to do in ecx
    lea edi, [edi+ebx]              ; align dest
    lea esi, [esi+ebx]              ; make source match
destisaligned:                      ; esi, edi and ecx shld all be right

mov ebx, ecx
and ecx, 127                        ; tail bytes (up to 127) in ecx
shr ebx, 7                          ; divide by 128 (chunk size)
jle endloop128

align 4
loop128:                            ; do chunks of 128
    dec ebx                         ; dec ecx long b4 using it
    movdqu xmm1, [esi]              ; read all 8 xmm's
    movdqu xmm2, [esi+10h]
    movdqu xmm3, [esi+20h]
    movdqu xmm4, [esi+30h]
    lea esi, [esi+40h]              ; seems to work best adding here
   
    movdqu xmm5, [esi]
    movdqu xmm6, [esi+10h]
    movdqu xmm7, [esi+20h]
    movdqu xmm0, [esi+30h]
    lea esi, [esi+40h]              ; add to esi long b4 use
   
    movdqa [edi], xmm1              ; write all 8 xmm's
    movdqa [edi+10h], xmm2
    movdqa [edi+20h], xmm3
    movdqa [edi+30h], xmm4
    lea edi, [edi+40h]
   
    movdqa [edi], xmm5
    movdqa [edi+10h], xmm6
    movdqa [edi+20h], xmm7
    movdqa [edi+30h], xmm0
    lea edi, [edi+40h]
   
    jg loop128
endloop128:

shr ecx, 4                          ; ebx has chunks of 16   
jle end_memcopy

loop16:
    dec ecx
    movdqu xmm1, [esi]              ; just one xmm could use bigger
    lea esi, [esi+10h]              ; chunks (32, 48 etc) but why bother
    movdqa [edi], xmm1
    lea edi, [edi+10h]
    jg loop16

jmp end_memcopy

mem4 endp


The zip contains test.asm with all 3 algos, and the exe
Title: Re: Faster Memcopy ...
Post by: KeepingRealBusy on March 10, 2015, 04:37:53 AM
rrr,

Thanks for giving me the tail end credit. Most would grab the code and run.

Dave
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 10, 2015, 05:43:08 AM
This was basically the last time I played with memory copy.


    mov esi, pad1
    mov edi, pad2

  ; --------------------------------------
  ; if blen < 128 jump to tail end trimmer
  ; --------------------------------------
    mov eax, blen
    cmp eax, 16
    jae nxt
    mov scnt, eax
    xor edx, edx
    jmp bypass

  nxt:
    shr eax, 4                ; int div by 128
    mov lcnt, eax
    shl eax, 4                ; int mul by 128
    mov ecx, blen
    sub ecx, eax
    mov scnt, ecx

    mov ecx, lcnt
    mov edx, -16

  ; -------------------------------------------------
  ; perform the bulk of the process in 16 byte blocks
  ; -------------------------------------------------
  align 16
  lbl1:
    add edx, 16
    prefetcht1 BYTE PTR [esi+edx]
    movdqa xmm0, DWORD PTR [esi+edx]
    movntdq [edi+edx], xmm0

    sub ecx, 1
    jnz lbl1

  ; -------------------------------
  ; if no uneven tail, jump to exit
  ; -------------------------------
    cmp scnt, 0
    je bye

  ; -------------------------------------------
  ; process any eneven tail as a BYTE operation
  ; -------------------------------------------
    add edx, 16

  bypass:
    mov ecx, scnt
    add ecx, 1
  lbl2:
    movzx eax, BYTE PTR [edi+edx]
    mov [esi+edx], al
    add edx, 1
    sub ecx, 1
    jnz lbl2

  bye:


It was slower than the REP MOVSD code.


    cld
    mov esi, src
    mov edi, dst
    mov ecx, ln

    shr ecx, 2
    rep movsd

    mov ecx, ln
    and ecx, 3
    rep movsb
Title: Re: Faster Memcopy ...
Post by: nidud on March 10, 2015, 06:03:04 AM
deleted
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 10, 2015, 05:03:59 PM
Quote from: dedndavefrom my experience, code that aligns for a loop can be either "pretty" or "fast" - lol

- when I first saw that a couple days ago I thought my approach was both pretty and fast, but I forebore to disagree: I was taught not to argue with my elders :biggrin: But now I see you're right, it wasn't all that fast; and since it's not the fastest, it's not pretty either

Quote from: keepingrealbusyThanks for giving me the tail end credit. Most would grab the code and run.
I have a stock answer when I'm told that. You can't blame those ("most") people: if they didn't grab credit for other's work, they'd never get credit for anything

Quote from: hutchIt was slower than the REP MOVSD code.
I've got to check out REP MOVS* instructions. Fact is I've never used them! When I came here in January I was still using 1980 era techniques (Z80); I've learned a bunch of modern stuff, like macros and SIMD, but to me REP MOVSD is still "new-fangled"

Quote from: nidudOne of the problems with these tests is that code location is sensitive to timings. ... The algo to test is then handled as data, and in this case as a .BIN file created by JWASM -bin ?.asm. To add a function to the test you simply add a new file.
Sure, that seems like an excellent approach.

I switched the order of the algos in my test bed and may have experienced the phenomenon you're talking about: on one category, ALIGNED 2014, the two algos pretty much reversed their timings, as though it ONLY depended on location! Other numbers weren't affected, although the numbers are not very stable anyway, so it's hard to say. So, I simply copied-pasted a dozen runs before the final three, for which I report the numbers (still with my algo in what may be the "worst" position, first). Takes longer but it's a lot more stable. Here are the first 4, and only, runs I made:

  mem2 = rrr/nidud modified; mem3 = nidud; mem4 = rrr latest

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2969            2759            2566

    31        2706            2651            2559

   271        6891            8023            6884

  2014       15540           39733           15315

262159     5258012         5218352         5255524

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1441            1350            1386

    31        1402            1346            1499

   271        3655            5698            3788

  2014       13863           26579           14449

262159     5318643         5385829         5317655
Press any key to continue ...

C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test

***********
ASCII build
***********

test.asm: 531 lines, 3 passes, 3656 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2688            2730            2538

    31        1452            1377            1279

   271        3468            4053            3338

  2014       13658           25588           13532

262159     5259190         5255524         5259200

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1465            1510            1352

    31        1388            1327            1286

   271        3577            5337            3571

  2014       13922           25698           14171

262159     5370304         5417275         5319313
Press any key to continue ...

C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test

***********
ASCII build
***********

test.asm: 531 lines, 3 passes, 3672 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2689            2730            2539

    31        1428            1437            1429

   271        3441            4143            3521

  2014       13636           25594           13482

262159     5211287         5219878         5276594

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1409            1378            1335

    31        1449            1387            1281

   271        3603            5482            3889

  2014       13870           25605           14170

262159     5372493         5509407         5361170
Press any key to continue ...

C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test

***********
ASCII build
***********

test.asm: 531 lines, 3 passes, 3672 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2709            2754            2550

    31        1439            1407            1353

   271        3529            4226            3449

  2014       14126           25590           13531

262159     5209791         5210645         5211581

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1416            1385            1358

    31        1397            1338            1291

   271        3483            5454            3570

  2014       13910           25713           14117

262159     5315510         5381937         5314749
Press any key to continue ...


My latest is still ahead (27 out of 40) but the differences are pretty negligible. Not surprising since the two algos are now using, with 4 or 5 exceptions, the same techniques. The only significant conclusion is: both are definitely better than your original mem3 in the middle range (271 and 2014). BTW this reminds me of a comment someone made in an old post: seems everybody's algo is best when run on their own machines. If anyone wants to check my results, the code is posted above, but reverse the order of the algos and copy/paste a dozen copies or so; may make a little difference.

Apparently your machine doesn't do 128 (XMM) right; many older machines do it as two separate 64's, completely negating any advantage. That's why your 64-block worked so well; I'm glad u checked that out. I think the basic point is demonstrated: that larger copy blocks (the largest a given machine supports efficiently) do give a bit more speed. BTW my machine has a similar prob with AVX 256; does it as 2 separate 128's; so it's actually slower for this task (a little better for some others, though).

... But none of it matters if hutch, and the ref you mention later, are right: that MOVS* is better on newer hardware.

Well I reckon we've beaten memory-moving to death! I'm working on some much more interesting fast-algos, hope to post something in a couple days. Thanks for the exercise, and various tips!
Title: Re: Faster Memcopy ...
Post by: nidud on March 11, 2015, 12:55:50 AM
deleted
Title: Re: Faster Memcopy ...
Post by: dedndave on March 11, 2015, 02:37:18 AM
while that's true, to some extent

you can't avoid checking cases to get alignment   :P
that means some form of conditional branching

if you play with it, i think you'll find that JNZ is faster for reverse branches and JZ is faster for forward branches   :t
you can sometimes design your code to take advantage

you might also try a jump table - haven't tried that, but it might be faster
Title: Re: Faster Memcopy ...
Post by: nidud on March 11, 2015, 03:37:13 AM
deleted
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 11, 2015, 06:11:08 AM
Hi nidud,

This is a very interesting investigation, unfortunately I don't have time to get into it at the moment, but in a couple days I'd really like to do so. The following comments are (deliberately) w/o reference to anything but my organic memory subsystem, ...

First - since we're getting down to details - do a global search and replace on my algo, change "ebx" to "edx". Of course, has no effect on timing.

Now, your jump timings are right in effect - that is, the conclusion drawn is right: it's better to skip that branch. I noticed u did that but didn't bother to change my code, thinking it was meaningless; but seems it's more significant. I say "in effect" because in actuality, jmps take far fewer cycles than the numbers you give: 13.5 and, uh, (alrigght, I'll take a look ...) 30.1, 10.7 etc. Yet, those are reasonable Expected Values because of branch prediction. The reason for avoiding jumps is not that they're inherently that long, rather if branch is incorrectly predicted, they cost many cycles while pipeline is cleared and reloaded. So u can't come up with exact numbers w/o many more details. It's not only different for different companies (as u mention, AMD and Intel): it's different for the same processor, depending on recent history - even the very first time the branch is encountered - because other nearby branches can affect prediction for this branch. Thus a detailed analysis is inherently probabilistic. As I say, the numbers u give look like good EVs (presumably u got them from some ref that did such analysis, or a performance analyzer?) and they give the correct conclusion: the branch shouldn't be there.

I only glanced thru your algo (mem3), but here's what I noticed. There was a clever substitute for the initial pushes which saved - either a byte or a cycle, I forget which. The comparison tests were different; mine, I felt, were more intuitive. Other small details that amount either to nothing or not much; a couple slightly better techniques which I didn't bother to borrow (like the push substitute, next time I'll use them); a couple places I think my technique a little better. Half the algo I didn't read but presume there's nothing 2 important. But, I removed almost all alignments (from your mod'ed version of my algo); my tests showed they were basically insignificant (maybe a cycle or two) and I wanted to reduce the code size, enhancing cache fit probability by a half percent if that.

My point is this. The differences between the two algos is minimal (to my knowledge); but now you're showing that your branch elimination saves quite a few cycles, more than I had supposed. So why, then, is my algo - more accurately, my version of the same algo, since these diffs are virtually just a matter of style - why is it faster? (a small, but significant, amount?) Since u, apparently, have gone thru them with a fine tooth comb, presumably u must know. There's something about mem4, versus mem2, that's saving more than just a cycle or two - must be quite noticeable - so, what is it?

I'm not kidding about a couple days: happy to investigate this further with you when I have more time.
Title: Re: Faster Memcopy ...
Post by: nidud on March 12, 2015, 06:20:59 AM
deleted
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 12, 2015, 08:24:13 AM
@nidud, Thanks for providing your test software, in future I'll use some of those (better) techniques.

But I think I've answered my own question, why my algo does a bit better when timed my way instead of yours. I'm using the standard approach: not loading .bin algos or FlushInstructionCache 'ing. (These are good things to do, thanks for the tip) But doing it the usual way, when the dest is aligned the branch (around the aligning instructions) is hit many thousands of times, and *always* taken. So, the processor predicts the branch perfectly. That jump costs almost nothing, couple cycles (not sure exactly). Whereas you're executing six instructions there (13.5 cycles). This matches my results: my algo's a bit ahead on ALIGNED data, but a bit behind on non-aligned, since not taking the jump costs a little. Conclusion:

- rrr's rule for not-very-careful timing test beds: Conditional branches (that depend in a simple way on the inputs) are almost free, so use as many as u want!

Reminds me of something hutch said, "I have always been an advocate of real time testing ...". In real life the environments will be very different, and variable, you don't know what you'll run into, so:

- rrr's timing statistical rule of thumb: ignore differences smaller than about 5%.

I think you should update your scoring routine to allow ties!

A more important thought re. branch prediction occurred to me. When a utility routine like momory-copy is called in real life, there's a decent chance the branch to it wasn't predicted right, so possibly the pipeline is going to be empty (other factors may cause the same thing). So, always put a (necessary) branch right at the beginning if possible, where a misprediction might be almost free.
Quote from: nidudHowever, most new machines (and almost all Intel CPU's) works best on aligned loops.

I'm sure you're right, but I saw approximately no timing difference with "align x". At least in one important case, it turned out the loop was (accidentally) already aligned. But even when not, no real difference: dunno why.

Quote from: nidudIf you think about it, the code above (and below) the loop label is peanuts compare to what happens in the main loop, so that's the main priority. The init and exit code will only hit you when multiple calls are made with small lengths.

Your first sentence is true when called with large blocks. But your second sentence explains why the code above (or below) the loop label *can be* important; your user may indeed be using it for lots of small lengths. That's why if one can save a cycle or two (for instance, by the placement of "toend" code) it should be saved in favor of small lengths vs. long lengths.

Finally, something you should know: if your routines are called to copy <=15 bytes, when PAGE_NOACCESS protected memory happens to be just beyond the source buffer, they'll blow up :P

c u later,
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 12, 2015, 08:45:05 AM
I have also learnt a few lessons on code alignment, regularly while bashing the guts out of an algo to get its pace up, I have tried aligning labels that get hit hard and seen the timing take a lot longer, much the same with procedure alignment, align at 4 bytes and it works OK, change it to 8 or 16 bytes and it often slows right down. Then its worth shifting the location of the algo around in the exe to make sure you are not deluding yourself and see what the results are.

I have tried spacing algos from each other with padding from under 1k to over 1 megabyte that was more or less useless so I usually settle of a delay between each algo of about 500 ms and set the priority class to high but not the one higher. As hardware gets smarter, timings wander even further.  :biggrin:
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 12, 2015, 09:07:07 AM
Quote from: hutchAs hardware gets smarter, timings wander even further.

Very true, meanwhile I'm getting slower ... but, I'm reminded of something hutch once said,

Quote from: hutchDon't give up on timing, its good for you and as an aside you tend to pop faster algos if you do it regularly. You get faster at it over time and with more practice you tune the timing to the task the algo will perform a lot better.

Also true! One good thing, it gives the lie to the C programmer's favorite saying, "the compiler does it better than you". Unless the compiler writers are down in the trenches timing away - as opposed to just believing Intel/AMD/Agner, as I think they do - they can't match a real MASMologist
Title: Re: Faster Memcopy ...
Post by: nidud on March 12, 2015, 09:22:35 AM
deleted
Title: Re: Faster Memcopy ...
Post by: nidud on March 19, 2015, 02:39:37 AM
deleted
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 19, 2015, 03:22:09 AM
Looks extremely familiar (http://www.masmforum.com/board/index.php?topic=18728.msg159628#msg159628) 8)
Title: Re: Faster Memcopy ...
Post by: nidud on March 19, 2015, 05:26:14 AM
deleted
Title: Re: Faster Memcopy ...
Post by: Antariy on March 19, 2015, 06:21:32 AM
You may safely read/write buffers not in byte-by-byte fashion.

There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc - and if the buffer has actually the size smaller than it was specified with bytecount param, and the buffer is lying near to the end of the memory page - the "MEM" code will crash as well as the "STR" code.

When you scan the memory you may do this safely with larger than byte-sized reads/writes, for this you just should take attention on the details how memory is allocated under x86 page addressed protected mode (Windows, Linux, Mac and any other protected mode operating system on x86).
The memory has the granularity of one page - 4096 bytes "at least", so you may always assume, that if you was given with some address inside memory page, there is full page in the memory available for read/write (depends on protection).
I.e., if you was given with an address 23987234h - just for an instance - so you may take the base address of the page, within which this address located - just clear 12 lowest bits of the address (AND EAX,-4096) - and you have the page address - for this example it will be 23987000h . So, now you know the start of the page, and you know that the size of page is 4096 bytes, so from that you may know how long is the buffer AT LEAST - 23987000h + 1000h = 23988000h. Within this range (from the given start of the buffer 23987234h to 23988000h) you may freely read the data in any sized fashion - just take attention that your reads do not read beyond that. I.e. you may freely start the single-reg XMM read (16 bytes) on 23987ff0h, but you obviously may not start the same read at the 23987ff1h - because final address of data grabbed 23987ff1h + 10h = 23988001h - so it is beyoud the page range and you may not be sure that the next page as reserved/commited in the memory.
Just take attention on at which adddresses you read/write, and, depending on the size of the data grabbing, you have all the info how to not read beyond the buffer to the inaccessible memory.
It is simpler to align the pointers to the power of two - so you may freely read and assume that the code will not crash - and it will not crash if you will check every read for match to the algos purpose.

As an example - just imagine that you use a byte-by-byte StrLen reading algo and SSE algo which uses single-reg XMM read. If you pass the properly formatted zero-terminated string to the algos - both algos will work correctly if they check the readed data to match the finish condition (the byte zero), before they read next bytes. If the both algos will take the wrong formatted strings without zero termanator until very end of the page - and no accessible area after that - the both algos will crash. No difference between algos.

What is wrong with that example of SSE code which does reading with two XMM regs: that code does TWO readings and ONE check after both readings. So, actually the code may read the proper string end (zero byte) with first read from both reads, but it will with no attention to that try to read the next 16 bytes - but the proper zero terminated string was already terminated in the previous 16 bytes (first XMM read) - so, being reading the PROPER zero terminated string at the PROPER possible location near to the end of the buffer with no accessible memory beyond that, this code will crash, that is why it is IMPROPER. We had a thread about StrLen XMM algo on the old forum, there were I, Jochen and Lingo as the main posters AFAIK, and Lingo provided the algo of such kind - which did read two XMM reads, then combined the mask of bytes result in one 32 bit GPR reg, and only after that did veryfied the result. Of course that code was buggy, and it is funny and shame (taking in account his megalomania) for Lingo he doesn't know such a simple and basic things like the buffers/memory layouts. The loud cries about "the algo is faster instead" is not excusable - the algo should be reliable first, and then it MAY be thought of making it faster. So Jochen decided to use one-reading XMM algo in his lib, and that was simplest and reliable solution. It is probably possible to do complex solution which will read more that one reg - but it will have to check every reg before it will have the right to read the next reg from the memory, so actually that complex code probably not only will not be faster than one-reg-reading code, it will be much slower. So Lingo's code was just a trash from the point of view of software production - it was just real fancy thing but not real working solution. Well, after so long time I described my point of view on the "Lingo's coding style" and the real "value" of his over-valuated "style", in proper English now :greensml:
Title: Re: Faster Memcopy ...
Post by: Antariy on March 19, 2015, 06:41:04 AM
The Intel's "read ahead" is not actually a real read ahead - just show us full code and it is for sure that they do verify of the finish condition (zero byte) before they read next data which may be beyond page range. The full sense of read-ahead is only when the code does read only with not-dword-aligned strings, but taking in attention that the code might probably do the alignment (which is not obvious from the small piece of code provided) and/or that that every "advanced" compiler does string alignment on the dword boundary - this is not a dangeround "read-aheading" at all - it has nothing common with the algo which does two (or more) reads before it checks the condition of termination. If the algo with use of any technique is still in the range of taking notice to be in the page range, the algo will not crash. All other differences - are the details of implementation but not the idea itself.
Title: Re: Faster Memcopy ...
Post by: nidud on March 19, 2015, 09:15:28 AM
deleted
Title: Re: Faster Memcopy ...
Post by: nidud on March 19, 2015, 09:43:57 AM
deleted
Title: Re: Faster Memcopy ...
Post by: Antariy on March 19, 2015, 09:45:35 AM
Just re-read again what was written: if you have the address of the buffer - you have all required to check if the buffer is near of the end of the page. So what you will do after that - is for your decision, of course you may try to read a whole heap of data even if you see that the buffer you given with is in the last byte of the page, or you may process this case in its respective part - QWORD/DWORD/WORD/BYTE reading - depending on how close to the end of the page your buffer is.
In the case with that StrLen code which Jochen used in his lib, I just aligned the input address buffer to the 16 (and, take notice that I did also mention that in the first post here - mentioned that if you align the reading pointer to the power of two), aligned it "backwards" (just AND with -16 so it will always lie at least from 16 bytes from the end of the page in memory) and thus did first read not ahead but "backward" instead, after that just shifted the "backward" tail out of the masking reg. So, that SSE code MAY NOT FAIL with any zero-terminated string placed in any place in the memory.
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 19, 2015, 11:14:51 AM
Quote from: Antariy on March 19, 2015, 09:45:35 AMIn the case with that StrLen code which Jochen used in his lib, I just aligned the input address buffer to the 16 (and, take notice that I did also mention that in the first post here - mentioned that if you align the reading pointer to the power of two), aligned it "backwards" (just AND with -16 so it will always lie at least from 16 bytes from the end of the page in memory) and thus did first read not ahead but "backward" instead, after that just shifted the "backward" tail out of the masking reg. So, that SSE code MAY NOT FAIL with any zero-terminated string placed in any place in the memory.

This is correct, but to be fair, the example of the 4096 byte buffer posted below works only with SEH:

include \masm32\MasmBasic\MasmBasic.inc      ; download (http://www.masm32.com/board/index.php?topic=12460)
  Init tc                  ; install the SEH handler

  FileWrite "test.txt", String$(4096, "x")      ; prepare a nasty example
  xchg rv(filesize, Chr$("test.txt")), ebx
  xchg rv(VirtualAlloc, 0, ebx, MEM_COMMIT, PAGE_READWRITE), edi
  push rv(_lopen, Chr$("test.txt"), OF_READ)
  invoke _lread, eax, edi, ebx      ; exactly 4096 bytes in a VirtualAlloc'ed buffer
  call _lclose
  Print Str$("%i bytes read from file\n", ebx)      ; this file is exactly one page long
  Print Str$("MasmBasic says the length of the string is %i bytes\n", Len(edi))      ; Len() WORKS OK

  PrintLine "Now trying crt_strlen:"      ; once you see this, Ctrl C is your only change...
  invoke crt_strlen, edi                  ; ... because the CRT fails miserably, it just hangs
  Inkey Str$("crt_strlen says the len of string is %i bytes\n", eax)      ; you will not see this message
  Exit
  TryCatch (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1217)End            ; activate the handler - must be last instruction before "end start"
end start

Yes it does return 4096 bytes, while the CRT just hangs, but... we must go at least one byte beyond the limits, otherwise how would the algo know that byte 4096+1 is a zero byte...?

Still, Len() remains usable, while the CRT just hangs. You can try inserting an exception handler also for the CRT:
  Try
  invoke crt_strlen, edi      ; great, it doesn't hang any more...!
  Catch

That works in the sense that your app is no longer blocked, but look at the result :P

Interestingly enough, Masm32 len() fails silently:
  Try
  Print Str$("Masm32 len() says the length of the string is %i bytes\n", len(edi))
  Catch


And of course, all these examples work correctly if you replace the 4096 with 4095.
Title: Re: Faster Memcopy ...
Post by: nidud on March 19, 2015, 09:43:31 PM
deleted
Title: Re: Faster Memcopy ...
Post by: Antariy on March 19, 2015, 10:30:34 PM
Once again, this algo will not fail with ANY zero-terminated string located anywhere in the memory:


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings
;mov edx,[esp+4]
;mov eax,[esp+4]
db 8bh,94h,24h,04,0,0,0
db 8bh,84h,24h,04,0,0,0
add esp,-14h
movups [esp],xmm7
mov [esp+16],ecx
;and edx,-10h
db 81h,0e2h,0F0h,0FFh,0FFh,0FFh
mov ecx,eax
pxor xmm7,xmm7
and ecx,0fh
jz @F
pcmpeqb xmm7,[edx]
add edx,10h
pmovmskb eax,xmm7
shr eax,cl
bsf eax,eax
jnz @ret
pxor xmm7,xmm7
@@: ; aligned to 16
pcmpeqb xmm7,[edx]
add edx,16
pmovmskb eax,xmm7
test eax,eax
jz @B

bsf eax,eax
sub edx,[esp+4+16+4]
lea eax,[eax+edx-10h]

@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4

AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef


Jochen,
your "example" is WRONG. You're passing NOT PROPER string to the algo - and you should not wait for proper results. Your string is exactly unterminated - it's an infinite size from the point of view of the algorithm logic, and the results you are returning (4096) - ARE WRONG, take that or not - that is for your decision. The algo should fail there (crash) or it should return the flag which OBVIOUSLY marks that the string is incorrectly formatted - for an instance you should return -1 as mark of the string ininity. Otherwice using your StrLen you will never know that the string is not proper string - preperly terminated - with the real size of 4096 bytes - and your code will use that improper string further with full assuming that it's proper, but it is WRONG "ZERO TERMINATED" STRING. And any other components of the program yours or not yours (for example Windows), will crash or misuse that improper string, for an instance long after you did "checked" it, so after that you will long search why it crashes - because at the point where you were required to flag the improper string, you just silently returned "the fine" and "seems ok" result.
The code should never "silently substitute" the breaks in the program logic (improper zero terminated string is that stuff) with the "result which seems OK" - but your strlen does exactly this.

nidud,
your code has still the same problem - you for some reason check only first 32 bytes of the string, but the rest you're assuming safe to do two reads and only one check - the same bug there. Your code now is partially "safe" only with the strings with size smaller that 32 bytes - any other string will crash your code if it is placed in such a manner that it's zero terminator is closer to the page end than 32 bytes - i.e. if string terminated at 1...31 bytes from the page end.

The first my post here, written in almost proper english and very descriptive has all the info to implement fully safe algo without any assumptions or silent substitutions/hiding of errors. Sometimes I agree with the topic paster on the thought that the more descriptive and full information one tries to provide - the less respect and carefull attention to the one's descriptions/writings one haves.
Title: Re: Faster Memcopy ...
Post by: nidud on March 20, 2015, 12:11:59 AM
deleted
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 20, 2015, 01:47:04 AM
Quote from: Antariy on March 19, 2015, 10:30:34 PM
Jochen,
your "example" is WRONG. You're passing NOT PROPER string to the algo - and you should not wait for proper results.

Alex,

My example is certainly problematic, and I agree that such an "improperly delimited" string should not be passed to a Windows API that expects a zero-delimited string; which, for example, is not the case for crt_lcpyn:

  Let esi=Space$(4097)
  invoke lstrcpyn, esi, edi, Len(edi)
  Print "[", esi, "]"


OTOH, the string is perfectly valid in a MasmBasic macro such as Let or Print (which only want to know how many bytes must be copied):

  Print Str$("MasmBasic says the length of the string is %i bytes\n", Len(edi))      ; Len() WORKS OK
  Let esi="The string: ["+edi+"]"
  PrintLine esi
  PrintLine "[", edi, "]"


Nonetheless, I do not encourage SEH (that is what Init tc does) to enable problematic examples like this.
Title: Re: Faster Memcopy ...
Post by: Antariy on March 20, 2015, 06:01:37 AM
Quote from: nidud on March 20, 2015, 12:11:59 AM
The code won't fail   :P

To put your mind at ease, this test the last 32 bytes


You may leave your sarcasm for yourself :P

Yes, it will not fail, because you did aligned data fetch to 32 bytes - but when I was first replying to your code sample, you provided only the inner loop logic, without other code, which shows that you do align to 32. The second time I took attention on the way you do first checking but did not noticed that you align the buffer fetch to 32, so that's why I said that it will fail - it should have fail if it will fetch data with 32 bytes but if the buffer fetch fill be aligned to only 16 bytes, but, as already said, I just did not noticed the proper alignment when replied second time, and there was not full code the first time from your side.
But, again, nothing wrong with that I was saying that in my first post in the thread I was describing the proper way of implementation of the algo, there I was written for an instance this sentence
Quote
It is simpler to align the pointers to the power of two - so you may freely read and assume that the code will not crash - and it will not crash if you will check every read for match to the algos purpose.

You just did it right (aligned to the power of two, equal to the size of data fetch, the same as I was writing, too - that you should align and use proper data size - QWORD/DWORD/WORD/BYTE etc - you just did that with "DXMMWORD" (double XMM word) - so nothing wrong with my explanation).

The source you provided is not recompilable, probably set of your custom macroses is missed there, so I did not run the test itself, but for sure it will work.
Just a thing: which code did you used in the test? Or you do load of the code from the .bin files? Just haven't found for an instance your code which you posted earlier there (which aligns to 32) in the disassembly listing ::)
Title: Re: Faster Memcopy ...
Post by: Antariy on March 20, 2015, 06:45:08 AM
Jochen,

Quote from: jj2007 on March 20, 2015, 01:47:04 AM
My example is certainly problematic, and I agree that such an "improperly delimited" string should not be passed to a Windows API that expects a zero-delimited string; which, for example, is not the case for crt_lcpyn:

  Let esi=Space$(4097)
  invoke lstrcpyn, esi, edi, Len(edi)
  Print "[", esi, "]"


Well, lstrcpyn is more like the memcpy function with one additional stop factor - zero byte, because it has the length specificator. So it's not really well valid example. More over for some reason the lstrXXXn functions break the zero-terminated logic itself: these functions break the string in some unexpected place, so, for an instance, in some cases you might receive the partial string, but this may not look like proper result, as you of course understand. These functions act more like those "hiders" of error - the code actually copied just a part of the string but the proggie continues to use that string thinking that it's proper, full string with full information it must hold. Of course, the truncated string contains not full info, so it may be treated just as a trash, not the data. IF the string was just some message etc - this may be "acceptable" to receive just a truncated message on the display, but if the original string which was copied to the smaller buffer and got truncated was containing some important data (file name, just a file system path, name of some object, an encoded (for an instance a short base64 message) form of some other data etc etc) - then this case is obviously unacceptable because you get not the data, but a trash.
In your example you might substitute the lstrcpyn with memcopy+zero termination for an instance, since the work of handling the exception and determining the "length" of the string anyway was done by MasmBasic.
But that did not change things, actually: your code got an improper string, the code does not know for which reason the string was improper - it was not terminated and continued until the end of the page. The code should flag this, instead of just continuing to work as if the string is properly terminated. The rule of programming is that it's better to have do not working program, that working improperly. So it's better for code to detect that it got improper zero terminated string and flag it in some way (to the user or just going other way/return etc). Of course, this applicable to the code which expects a ZERO TERMINATED strings - if the algo which should receive the zero-terminated strings receives not terminated string, it should crash or flag it. But the algos which do not require zero-terminated strings to work, but rely on the "buffer address" and "known size" params - may work with no difference if there is zero terminator, but just because they by design do not require zero terminated strings, but rather binary strings.
Your MB Len algo is proper for the MB strings, but in a case you pass a string which is known to be zero terminated - you should not accept silently those claimed as "zero terminated" but not terminated actually strings, that was the point I tried to describe.

Anyway the problem of taking the unproper zero terminated strings with the algos which do not flag that and reject further processing is that if you continue, you might continue to receive more and more such strings, but SEH is very greedy in therms of timings, so any time you receive such a string you have thousands of cycles stalls just to continue work after a SEH catch.
And, of course, if you will pas that string which is not actually terminated to any other code - it may crash there. Other way you may do - copy that improper string to proper MB string and pass that string to the algos, but, again, every improper string copy will cost thousands of cycles.

Quote from: jj2007 on March 20, 2015, 01:47:04 AM
OTOH, the string is perfectly valid in a MasmBasic macro such as Let or Print (which only want to know how many bytes must be copied):

  Print Str$("MasmBasic says the length of the string is %i bytes\n", Len(edi))      ; Len() WORKS OK
  Let esi="The string: ["+edi+"]"
  PrintLine esi
  PrintLine "[", edi, "]"


Nonetheless, I do not encourage SEH (that is what Init tc does) to enable problematic examples like this.

BTW, what means OTOH acronym?

Yes, I did not tell you that the MASMBASIC strings are wrong if they are not terminated with zero, I told about ZERO TERMINATED strings. If MB algo receives zero terminated string which is not terminated - then it should obviously report you in some way that the string is not terminated... for an instance you may return the "seem ok" string length in EAX, but -1 in EDX if the string was not terminated and zero in EDX if it was proper zero terminated string.

The installed SEH covers all your algos in the library? I.e. the Let, the Len algos are not crash with that improper zero-terminated string only if you install the SEH at the Init? If you will not - the both Len and Let will crash with that string? How do you implemented SEH coverage on your lib - as an UnhandledExceptionFilter or as a per-thread SEH frames (just like in my signature)? The UnhandledExceptionFilter does not provide a way to handle exceptions in multiple-threaded applications, because the "safe address" which you may use to restore execution might be overwritten in the other thread which enters some function which set that "safe address" up with its own address of restoration.
Title: Re: Faster Memcopy ...
Post by: nidud on March 20, 2015, 07:17:22 AM
deleted
Title: Re: Faster Memcopy ...
Post by: Antariy on March 20, 2015, 07:31:40 AM
Quote from: nidud on March 20, 2015, 07:17:22 AM
Quote from: Antariy on March 20, 2015, 06:01:37 AM
You may leave your sarcasm for yourself :P

:biggrin:

:t

Quote from: nidud on March 20, 2015, 07:17:22 AM
Quote
Yes, it will not fail, because you did aligned data fetch to 32 bytes - but when I was first replying to your code sample, you provided only the inner loop logic, without other code, which shows that you do align to 32.

That code did actually fail. All these versions are included in the strlen.zip file (#50). I converted a string library some time ago and the code that read-ahead is used in many of these functions.

Ah, so at least one of the pieces discusset failed. You mean this algo, probably?

.686
.model flat, stdcall
.code
.xmm

OPTION PROLOGUE:NONE, EPILOGUE:NONE

strlen  proc string
mov ecx,[esp+4] ; 4
xorps xmm2,xmm2 ; 3
push edx ; 1
@@:
movdqu  xmm1,[ecx+16]
movdqu  xmm0,[ecx]
add ecx,32
pcmpeqb xmm0,xmm2
pcmpeqb xmm1,xmm2
pmovmskb edx,xmm0
pmovmskb eax,xmm1
shl eax,16
or eax,edx
jz @B
pop edx
bsf eax,eax
lea eax,[ecx+eax-32]
sub eax,[esp+4]
ret 4
strlen  endp

END


It's in 4.asm, and the only algo which reads unaligned so it crashed near end of the page even if it did reads by 16 byte chunks, but it more over does two reads of 32 bytes. So first time you posted two-reading piece of this code?

Quote from: nidud on March 20, 2015, 07:17:22 AM
Quote
Quote
The source you provided is not recompilable, probably set of your custom macroses is missed there, so I did not run the test itself, but for sure it will work. Just a thing: which code did you used in the test? Or you do load of the code from the .bin files?

I uploaded the test-bed (benchmarks.zip (http://masm32.com/board/index.php?action=dlattach;topic=4067.0;attach=3869)) used here (#34) (http://masm32.com/board/index.php?topic=4067.msg43117#msg43117).

Ah, got it earlier but did not got that it was full set of the files. the dz.exe is your file manager DosZip?
But, the other question: do you load algos from binary files? Just did not found your safe 32 bytes algo in the listing of the timeit.exe.
Title: Re: Faster Memcopy ...
Post by: nidud on March 20, 2015, 08:22:02 AM
deleted
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 20, 2015, 02:14:53 PM
Quote from: Antariy on March 20, 2015, 06:45:08 AMBTW, what means OTOH acronym?
on the other hand

QuoteThe installed SEH covers all your algos in the library? I.e. the Let, the Len algos are not crash with that improper zero-terminated string only if you install the SEH at the Init?

If SEH is installed (and I rarely use that feature), then there is exactly one point in the library where it is being used: MbStrLen. Which is a function that is used internally very often, of course.
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 20, 2015, 09:15:14 PM
Hello Antariy, nice to meet you,

Quote from: Antariy on March 19, 2015, 06:21:32 AM
There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc - and if the buffer has actually the size smaller than it was specified with bytecount param, and the buffer is lying near to the end of the memory page - the "MEM" code will crash as well as the "STR" code.

- If the user makes an incorrect request (wrong bytecount) then perhaps the algo should just crash. If he's careless enough to do that he probably won't check error code return, so how do you tell him he made a mistake? If you cover for him, it will probably nail him later somewhere else. Perhaps, let it crash so he knows the error. That's how my memcopy behaves.

But, that's a matter of opinion. The important point is, if user gives correct bytecount, (or, correctly terminated C-style zero-STR), then you must not crash by reading / writing outside the requested buffers. Even though reading 128 bytes at a time, my memcopy never goes beyond user-specified bounds.

Quote from: Antariy on March 19, 2015, 06:21:32 AM
It is simpler to align the pointers to the power of two...

- You mean, align to the chunk size, assuming it's a power of 2. (The "size of the data grabbing" is called "chunk".)

Quote from: Antariy on March 19, 2015, 06:21:32 AM
The loud cries about "the algo is faster instead" is not excusable - the algo should be reliable first, and then it MAY be thought of making it faster.

- Of course! Surely Lingo agrees?

Quote from: Antariy on March 19, 2015, 10:30:34 PM
Sometimes I agree with the topic paster on the thought that the more descriptive and full information one tries to provide - the less respect and carefull attention to the one's descriptions/writings one haves.

- Very annoying, go to a lot of trouble with the language and then no one reads it! But, minor point: actually, I didn't paste the topic. Couldn't find anywhere on the net to copy it from, so I had to write it myself :P

Quote from: Antariy on March 19, 2015, 10:30:34 PM
Once again, this algo will not fail with ANY zero-terminated string located anywhere in the memory: ...

- Right, but it can be 18 bytes shorter, and a little cleaner:

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings
mov edx,[esp+4]
; mov eax,[esp+4]                   ; don't get eax from [esp+4]
        mov eax, edx                     ; but from edx, save 2 bytes
; db 8bh,94h,24h,04,0,0,0           ; don't pad 6 bytes with
; db 8bh,84h,24h,04,0,0,0           ; these two db instructions
add esp,-14h
movups [esp],xmm7
mov [esp+16],ecx
and edx,-10h                      ; don't pad 3 bytes
; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh   ; with this db instruction
mov ecx,eax
pxor xmm7,xmm7
and ecx,0fh
; jz @F                             ; no jump save 2 bytes
pcmpeqb xmm7,[edx]
; add edx,10h                       ; don't add here save 3 bytes
pmovmskb eax,xmm7
shr eax,cl                         ; works fine if cl is 0
bsf eax,eax
jnz @ret
pxor xmm7,xmm7
@@: ; aligned to 16                ; still aligned but 16 bytes less
add edx,16                         ; switch order of these
pcmpeqb xmm7,[edx]                 ; two instructions
pmovmskb eax,xmm7
test eax,eax
jz @B

bsf eax,eax
sub edx,[esp+4+16+4]
; lea eax,[eax+edx-10h]
        add eax, edx                    ;  don't need -10h any more, save 2 bytes

@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4

AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef


Here it is without all the extra comments:

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings

    mov edx,[esp+4]
    mov eax, edx
    add esp,-14h
    movups [esp],xmm7
    mov [esp+16],ecx
    and edx,-10h
    mov ecx,eax
    pxor xmm7,xmm7
    and ecx,0fh
    pcmpeqb xmm7,[edx]
    pmovmskb eax,xmm7
    shr eax,cl
    bsf eax,eax
    jnz @ret
   
    pxor xmm7,xmm7
    @@:                                ; naturally aligned to 16
        add edx,16
        pcmpeqb xmm7,[edx]
        pmovmskb eax,xmm7
        test eax,eax
        jz @B
   
    bsf eax,eax
    sub edx,[esp+4+16+4]
    add eax, edx
   
    @ret:
    movups xmm7,[esp]
    mov ecx,[esp+16]
    add esp,14h
    ret 4

AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef


- BTW your English is very understandable, much better than I write your language! c u later
Title: Re: Faster Memcopy ...
Post by: Antariy on March 21, 2015, 11:08:43 PM
Quote from: jj2007 on March 20, 2015, 02:14:53 PM
Quote from: Antariy on March 20, 2015, 06:45:08 AMBTW, what means OTOH acronym?
on the other hand

QuoteThe installed SEH covers all your algos in the library? I.e. the Let, the Len algos are not crash with that improper zero-terminated string only if you install the SEH at the Init?

If SEH is installed (and I rarely use that feature), then there is exactly one point in the library where it is being used: MbStrLen. Which is a function that is used internally very often, of course.

How did you implement the exception handling - as per-thread SEH or SetUnhandledExceptionFilter? IIRC it was SUEF?




Quote from: rrr314159 on March 20, 2015, 09:15:14 PM
Hello Antariy, nice to meet you,

Hello rrr, nice to meet you, too.

Quote from: rrr314159 on March 20, 2015, 09:15:14 PM
Hello Antariy, nice to meet you,

Quote from: Antariy on March 19, 2015, 06:21:32 AM
There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc - and if the buffer has actually the size smaller than it was specified with bytecount param, and the buffer is lying near to the end of the memory page - the "MEM" code will crash as well as the "STR" code.

- If the user makes an incorrect request (wrong bytecount) then perhaps the algo should just crash. If he's careless enough to do that he probably won't check error code return, so how do you tell him he made a mistake? If you cover for him, it will probably nail him later somewhere else. Perhaps, let it crash so he knows the error. That's how my memcopy behaves.

But, that's a matter of opinion. The important point is, if user gives correct bytecount, (or, correctly terminated C-style zero-STR), then you must not crash by reading / writing outside the requested buffers. Even though reading 128 bytes at a time, my memcopy never goes beyond user-specified bounds.

The algo may not know if the user is careless or not - it is "just code" ("just a machine") and has no rights to make assumptions about those who use it. Imagine the same with, for an instance, with some protection scheme: the user sits before the computer, turning it on, the computer says "I diskike your face today, probably you're drunk or you're not the owner of this machine, maybe you're a criminal, let's format harddrive of this machine to protect the real owner from data steal".

The fact that the "ideology" of "protected programming" was perversed by many contemporal programmers who are careless and simultaneously with carelessness use the "protected programming" techniques, do not make the "ideology" itself wrong. The code is better to behave in predictable, fully documented and robust way - and the rest is for the user's decision. The routines are just tools, aren't they? One wants to use as much as possible useful, comfortable and robust tool, so, if there is such tool - it's good. But HOW the tool will be used and which product it will produce is for the "master" who uses this tool.

So, probably nothing is wrong with "protected programming" nor "SEH" etc. - this is problem of people psychic rather than the "ideology" - the careless and sloppy coders produce problems, not the code itself.

The code may not know if the user provided a wrong bytecount or the buffer for some reason was not such a size as the user thinks (this may really occur - for an instance with memory-mapped files / sections while the data was not loaded for some reason (disk unexpectedly became full or just hardware failure) into the memory page) but the bytecount was right, so, if the user was careless enough to not notice that the system did not provide the full data and passed the pointer to the code, it may have one more opportunity to know that there is a error, especially if the error-flagging return differes from the regular return very noticeable. Also the itself system bugs may not be discarted from the count - even if the system provided wrong result, the user still might notice it with the code which reports about it.
After all, yes, the "protected programming" implies that one should rely on the system as a "standard" which has no bugs, and the coding style is for decision of the programmer. There are, as usual, two biggest camps: those, who codes absolutely sloppy, and those who codes paranoidally "carefull" - both camps produce usually the most buggy products on the planet and usually are the most big and respective companies on the planet :greensml:



Quote from: rrr314159 on March 20, 2015, 09:15:14 PM
Quote from: Antariy on March 19, 2015, 06:21:32 AM
It is simpler to align the pointers to the power of two...

- You mean, align to the chunk size, assuming it's a power of 2. (The "size of the data grabbing" is called "chunk".)

Yes, this correction is right and describes what I wanted to say more clearly.

Quote from: rrr314159 on March 20, 2015, 09:15:14 PM
Quote from: Antariy on March 19, 2015, 06:21:32 AM
The loud cries about "the algo is faster instead" is not excusable - the algo should be reliable first, and then it MAY be thought of making it faster.

- Of course! Surely Lingo agrees?

I do not know with which things does he agree and more overy do not want to know that :greensml: But from his "coding style" many of our members are probably got an opinion that he is a megalomaniacal self-named "God of assembly" and a "Code Nazy" who first on the planet invented every thing and all other are stolen that from him. But he is also known to produce buggy but "fast working" code and when he gets the notes on that he just goes to the insulting others, so, I think, it's better to ask his psychiatrist - what he agrees with :greensml:

Quote from: rrr314159 on March 20, 2015, 09:15:14 PM
Quote from: Antariy on March 19, 2015, 10:30:34 PM
Sometimes I agree with the topic paster on the thought that the more descriptive and full information one tries to provide - the less respect and carefull attention to the one's descriptions/writings one haves.

- Very annoying, go to a lot of trouble with the language and then no one reads it! But, minor point: actually, I didn't paste the topic. Couldn't find anywhere on the net to copy it from, so I had to write it myself :P

Well, I think, if the jokes apart, you understand that did not mean "copy"-"paster". You "pasted" the topic on the forum - there was no such topic, but it appears - it was "pasted" there, so, the one who did it was topic-"paster".
The word "paste" is independed word from the "copy" or from the "therm" "copy-paste"? So, then, that was pretty right said :P

Quote from: rrr314159 on March 20, 2015, 09:15:14 PM
Quote from: Antariy on March 19, 2015, 10:30:34 PM
Once again, this algo will not fail with ANY zero-terminated string located anywhere in the memory: ...

- Right, but it can be 18 bytes shorter, and a little cleaner:

How it was written has reasons, for an instance the XMM and ECX preservation was the "condition of production" - Jochen's condition since his lib preserves those regs in its internal functions - it's MasmBasic's ABI.
Other things like "wide" instructions was used to padd the inner loop to align it by 16, yes, but here it was for reason, too, because it was to fit the code as it was written - with all the specifics. The code may be shorter, yes, but as it was written was seemed (by me, of course, so this may be called as a "subjective judge") it was most performant on a widest variety of hardware. Some our members may confirm that I like to "hunt for bytes" like a maniac sometimes :greensml:, but in this case it was not a case (the pun).

I will just comment the changed code in the lines below of the changes - without indent, why it was written so.

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings
mov edx,[esp+4]
; mov eax,[esp+4]                   ; don't get eax from [esp+4]
        mov eax, edx                     ; but from edx, save 2 bytes
; db 8bh,94h,24h,04,0,0,0           ; don't pad 6 bytes with
; db 8bh,84h,24h,04,0,0,0           ; these two db instructions

The code which was original will work faster because of reference to the same memory location, which,
being loaded to the cache (if it was not yet) when first accessed, is the "fast to do" thing, rather
than dependent and thus stalling instructions when the value taken into one reg and then copied to the
second reg from first one.

If you want to save those two bytes and leave these instructions as they were coded, then just replace
pxor instructions with the xorps.

add esp,-14h
movups [esp],xmm7
mov [esp+16],ecx

Actually if I will have to descrease the size of the algo I will probably remove the XMM and ECX presevation,
because it's not APIs ABI but only MasmBasic ABI.

and edx,-10h                      ; don't pad 3 bytes
; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh   ; with this db instruction

This was coded so to put loop to 16 bytes align, why the code was in a whole coded so may be understandable
only if it is seeing as a whole but not some different pieces - read below why it was decided as unappropriate
to write the code so as it was changed in this example.

mov ecx,eax
pxor xmm7,xmm7
and ecx,0fh
; jz @F                             ; no jump save 2 bytes

The CPU "predicts" the "forward" jumps as "probably untaken", so, the jump will not be taken with the
possibility 16:1 (unaligned string) with byte-aligned strings, or 8:1 with even aligned strings, or 4:1 with
dword aligned strings, so, the jump and the some slowdown is more likely to not happen. And even if it happens
the slowdown itself is not big (the high numbers of jumps are exaggerated really, and as the jump is near and in the same code
location actually, it maybe assumed that the jump will not lead to any "misprediction" stalls since the code
is already fetched and decoded, even that code where the jump jumps (the pun again) to). So, it will cost only few cycles,
but taking in account that it over-jumps some instructions, which will take some time, this jump in the real world
(not millions of iterations on the same data as it is doing with our testbeds) will not only be a slowdown but will
rather speed the code up.

pcmpeqb xmm7,[edx]
; add edx,10h                       ; don't add here save 3 bytes

It will be described below why the pointer advancing was done after the data fetch.

pmovmskb eax,xmm7
shr eax,cl                         ; works fine if cl is 0

The jump was done not to "protect" from zero ECX but to not execute unrequired and slowing down code without reason.

bsf eax,eax
jnz @ret
pxor xmm7,xmm7
@@: ; aligned to 16                ; still aligned but 16 bytes less
add edx,16                         ; switch order of these
pcmpeqb xmm7,[edx]                 ; two instructions

The pointer advancing BEFORE data fetching slows down the code, especially on earlier than modern CPUs.
The mov eax,[edx]; add edx,16 will be generally always faster than add edx,16; mov eax,[edx]
because these two instructions in first case pair (memory access goes into one block and the arithmetic operation
goes into second block) and not pair in second case.

pmovmskb eax,xmm7
test eax,eax
jz @B

bsf eax,eax
sub edx,[esp+4+16+4]
; lea eax,[eax+edx-10h]
        add eax, edx                    ;  don't need -10h any more, save 2 bytes
       
It was needed because the pointer advancing was specially chosen to be done after data fetching.

@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4

AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef


Quote from: rrr314159 on March 20, 2015, 09:15:14 PM
- BTW your English is very understandable, much better than I write your language! c u later

Well, it was not so some years ago, but some forum members also did assure me that it is pretty understandable, even being pretty "Cyrillic English" as one of the forum friends called it.
Do you write Russian?



Quote from: nidud on March 20, 2015, 08:22:02 AM
Quote from: Antariy on March 20, 2015, 07:31:40 AM
Ah, so at least one of the pieces discusset failed. You mean this algo, probably?

Yes, there are many like this around (http://masm32.com/board/index.php?topic=3396.msg36254#msg36254), and here are some more (http://www.alfredklomp.com/programming/sse-strings/)  :lol:

You did point to that size for some reason (maybe it was some useful on that page? I did not read it thorougly), or just as an example of the fact that there are many implementations which do not care about possibility of the crash (do not align the fetch, and even grab the data unaligned all the time)?


Quote from: nidud on March 20, 2015, 08:22:02 AM
Quote
It's in 4.asm, and the only algo which reads unaligned so it crashed near end of the page even if it did reads by 16 byte chunks, but it more over does two reads of 32 bytes.

The same also with 4 and 8.

Yes, but I told about SSE one and just did not mention other.


Quote from: nidud on March 20, 2015, 08:22:02 AM
QuoteSo first time you posted two-reading piece of this code?
Think I posted it somewhere but I'm not shore.

Just on the previous page there was a piece which included the only inner loop of the two-reading and one-checking algo. My first post here fast after that where I first said that the problem of the algo is in two reads and one check - since you provided only inner loop and not full algo first time, I assumed you do the reads unaligned/aligned to only 16.


Quote from: nidud on March 20, 2015, 08:22:02 AM

Quote
Ah, got it earlier but did not got that it was full set of the files. the dz.exe is your file manager DosZip?
Yes, plus assembler and make to simplify the testing. If on the same drive as masm32 it should work out of the box.

OK, good. Assembler is JWasm?


Quote from: nidud on March 20, 2015, 08:22:02 AM
Quote
But, the other question: do you load algos from binary files? Just did not found your safe 32 bytes algo in the listing of the timeit.exe.

Yes, for reasons explained here (http://masm32.com/board/index.php?topic=4067.msg43020#msg43020).

Ah, well known code location problem. But you might try to implement the solution simpler - with definition of different segment for every tested algo. This will allow to run the not "code location independed" algos as well, with no need to relocate them in some way manually. The algos with jump/call tables are those "relocation required" algos, for an instance. Or the algos which do not-near relative/direct calls/jumps.
Title: Re: Faster Memcopy ...
Post by: dedndave on March 22, 2015, 12:11:12 AM
hello, Alex   :biggrin:

that's a lot of English - lol
Title: Re: Faster Memcopy ...
Post by: Gunther on March 22, 2015, 12:21:42 AM
Hi Alex,

Quote from: dedndave on March 22, 2015, 12:11:12 AM
hello, Alex   :biggrin:

that's a lot of English - lol

but that's good news: Alex is back after a long break and he's back with high quality posts.  :t I missed you.

Gunther
Title: Re: Faster Memcopy ...
Post by: nidud on March 22, 2015, 02:03:27 AM
deleted
Title: Re: Faster Memcopy ...
Post by: Antariy on March 22, 2015, 06:08:02 AM
Quote from: dedndave on March 22, 2015, 12:11:12 AM
hello, Alex   :biggrin:

that's a lot of English - lol

Hello Dave :t

Is that more or less proper English? Sometimes it isn't, yeah :greensml:

Quote from: Gunther on March 22, 2015, 12:21:42 AM
Hi Alex,

Quote from: dedndave on March 22, 2015, 12:11:12 AM
hello, Alex   :biggrin:

that's a lot of English - lol

but that's good news: Alex is back after a long break and he's back with high quality posts.  :t I missed you.

Gunther

Thank you, Gunther! :biggrin: :redface:



Quote from: nidud on March 22, 2015, 02:03:27 AM
Quote from: Antariy on March 21, 2015, 11:08:43 PM

; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh   ; with this db instruction
and edx,dword ptr -10 ; same same ?...


well, here is a similar version using 32 bytes

I absolutely don't get what that is in the quote of "mine"??? What is "same same" words?
Also, what means "similar version" - similar to which? To AxStrLenSSE? Well, if so, then it's actually not very similar. The most important is that your version does pointer advancing BEFORE data fetch. That was the thing which I was described in the post commenting the rrr's "edition" on the algo - but I did not accept that edit as a replacement to the current AxStrLenSSE implementation, I thought it was clear said that I assume the original version as the best possible for that kind of code (simple, single XMM reg using).
Also, moving to [esp-4] is more or less "legal" in user mode, but that's disputable still (there are cases where the usage of the stack to put the data below ESP may cause wrong work of the code), but for kernel mode code (in drivers) this string routine may not be used at all, especially on systems with DEP enabled.
Also XMM is not preserved... so actually that is strange "reproduction" of the algo with broken/not fully repeated logic. XMM isn't saved at all, instead of ECX saved EDX but it saved in a wrong way - below ESP. And the pointer advancing before data fetch - which slows the algo down.


Quote from: nidud on March 20, 2015, 08:22:02 AM

My point was that you seem to think in size of registers and not in chunk size  :P


That was strange point if you just will, again, yes, re-read the first post here, which maybe is not crytally obvious and clear, but described everything which was discussed here on the topic of StrCmp. Probably the thread should be split into two topics, because MemCopy gone into StrCmp ::)

There was said about granularity of the memory to the page size, and about possibility of safe algos work, about power of two "data grabbing sizes" (which were then corrected to more suitable word "chunk"). And that post was made as answer to your post where you said that the safe algos maybe only byte-by-byte reading, and that it is "common technique" to read data "ahead" - "even in Intel" - as I was not agree with all that you said, I wrote that post. The algos may be safe reading more than byte, the "read ahead" actually is not that in the right sense, and this "common technique" which is used widely by many hobby programmers over the world is buggy, because real read ahead is a way to crash code when accessed to the buffers.

Taking in account all that, it's strange that you do such a conclusions ::)

Quote from: nidud on March 22, 2015, 02:03:27 AM
Tried padding and a lot of other things, but this seems to be the best method so far, at least on the CPU I'm now using. The single-file edit also simplifies the testing by having both the (now small) list file and source open in the editor at the same time. The list file then updates when you compile the source from the editor. Flipping back and forth (using F6) makes it easy to align the code.

Padding is not what I said. I said - just create different segment for every code example. Like that:

SEG1 SEGMENT PARA PUBLIC 'CODE'
tested code 1 here
SEG1 ENDS

SEG2 SEGMENT PARA PUBLIC 'CODE'
tested code 2 here
SEG2 ENDS

... etc ...


When you write your source in this way, then every code piece will be placed in its own section in the PE file, and it will be placed in the page-granulared "sections" in memory - every piece in its own page in memory, starting with address aligned to 4096. This solves the code location problems as well - you might try it for your CPU. And it's more comfortable because does not limit the type of code to test to only location independed code. Actually I was suggested this way of fixing code location problems in some places on the forum several times.
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 22, 2015, 06:56:15 AM
The only really reliable way to test algorithms that are subject to code location is to put each algo in a separate executable using the identical test conditions. The problem seems to vary with the hardware but if you are performing a simple test between 2 or more algos in the same test piece, you can move the algos around to see if their timing changes. Occasionally doing a simple "align ##" will stabilise the timings but I have seen enough algorithms that do not respond to alignment.

There is another factor that I use to see with some of Lingo's test pieces, one algo leaves the following algo with a mess to clean up and to get a reliable timing for the effected algo, you had to comment out the first one.
Title: Re: Faster Memcopy ...
Post by: nidud on March 22, 2015, 07:24:48 AM
deleted
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 22, 2015, 06:27:42 PM
Antariy,

Thanks very much for your comprehensive answer but I hate to make you type so much! No need to spell everything out in such detail. I see you have strong feelings about "protected programming" and Lingo so will leave those topics alone! As for "topic paster", it was just a joke, no worries.

The important thing is the code. I understand your reasoning (of course) but it offends my aesthetic sensibility to see all those bytes wasted! So, considering your suggestions, I made some modifications, and tested the shorter algo against yours as well as I could. Bottom line, it seems the shorter algo is, for the most part, faster.

First, to avoid register stall with these instructions: mov edx, [esp+4] / mov eax, edx: I just moved the mov down 2 instructions.

I didn't know xorps could save a byte over pxor, thank you. I used those 2 bytes to put the jump back in. It was necessary for the AMD which is horribly slow on bsf. I jump into the loop, skipping the "add edx, 16", so the logic remains valid.

Still preserving XMM and ECX.

Yes I know the purpose of the db instructions is to pad 9 extra bytes to align the beginning of the loop. I know that's better than nop's or lea eax, [eax] which waste time as well as space. But surely it's even better to not waste the bytes, as long as routine is still as fast or faster.

CPU branch prediction - u know, behavior seems to change with every new release from Intel or AMD. Often, we optimize a routine on our own machines, but on newer/older machines it may behave different. I routinely optimize on my Intel I5, then try it on AMD A6 and Pentium 4; often fastest on one machine may be slowest on another. So I'm leery of artificial coding techniques for speed.

Now, I had thought you were right: pointer advance should go AFTER data fetching. However on the Intel my loop was faster. On AMD, a little slower. Apparently the order of the two instructions makes little difference. Anyway, there's very simple correction available, if needed. Just "pcmpeqb xmm7, [edx+10h]" first, then "add edx, 16" - uses one more byte.

By far, the hardest part of the whole exercise is not writing the routine, but getting semi-reliable timing! First, I used your suggestion and put all algos in separate segments. Then, I doubled both of them; put yours first and last, mine 2nd and 3rd. It appears the "last" position is slightly favorable. Then, in desperation, I copied/pasted the timing runs a number of times, using just the final numbers.

Here are the resulting runs:

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

BUFFER ALIGNED

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8         906             854             852             905

    31        1147            1020            1019            1074

   271        4024            4142            4020            3924

  2014       26115           24593           24595           25818

62159      757816          747523          748235          757502

BUFFER MISALIGNED src 11

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8        1184            1157            1132            1241

    31        1399            1391            1395            1425

   271        4508            4368            4432            4522

  2014       25622           25036           25018           25604

62159      757612          747909          746568          757986


AMD A6-6310 APU with AMD Radeon R4 Graphics (SSE4)

BUFFER ALIGNED

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8        2124            1551            1551            2319

    31        2526            1944            1926            2494

   271        6220            5679            5676            6416

  2014       29950           30171           30869           30104

62159      872547          886154          887221          871530

BUFFER MISALIGNED src 11

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8        2776            2320            2319            2622

    31        2795            2420            2418            2797

   271        6016            5461            5476            6055

  2014       30809           31229           31080           30842

62159      887148          887519          888207          889350


Your routine was a bit better on Intel ALIGNED 271; also slightly better on the longer strings on AMD. Everywhere else, the shorter routine is better. It's dramatically better on AMD short strings, who knows why; and better on Intel long strings. BTW all these numbers came out pretty much like this on multiple tests; I'm only showing one typical run from each machine.

Here is my modified routine:

; «««««««««««««««««««««««««««««««««««««««««««««««««««
algo2 proc SrcStr:PTR BYTE
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; rrr modified version of Antariy algo - number 2
    mov eax,[esp+4]
    add esp,-14h
    movups [esp],xmm7
    mov edx, eax
    mov [esp+16],ecx
    and edx,-10h
    xorps xmm7,xmm7
    mov ecx,eax
    and ecx,0fh
    jz intoloop
    pcmpeqb xmm7,[edx]
    pmovmskb eax,xmm7
    shr eax,cl
    bsf eax,eax
    jnz @ret
    xorps xmm7,xmm7
    @@:                       ; naturally aligned to 16
        add edx,16
    intoloop:
        pcmpeqb xmm7,[edx]
        pmovmskb eax,xmm7
        test eax,eax
        jz @B
    bsf eax,eax
    sub edx,[esp+4+16+4]
    add eax, edx
   
    @ret:
    movups xmm7,[esp]
    mov ecx,[esp+16]
    add esp,14h
    ret 4
algo2 endp
end_algo2:


Bottom line, I can't believe it's right to waste those 18 bytes.

Finally, of course I can write Russian! I'll do it again: "Russian". - very easy. OTOH I can't write in Cyrillic to save my life :)

Zip has "testStrLen.asm" test program.
Title: Re: Faster Memcopy ...
Post by: nidud on March 22, 2015, 10:34:21 PM
deleted
Title: Re: Faster Memcopy ...
Post by: Gunther on March 23, 2015, 07:08:14 AM
Hi nidud,

Quote from: nidud on March 22, 2015, 10:34:21 PM
Here is the Intel version of strlen from 2011:

interesting, but AT&T syntax. Not the best choice. Fortunately the conversion into Intel syntax isn't very hard.

Gunther
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 23, 2015, 08:55:27 AM
Quote from: Gunther on March 23, 2015, 07:08:14 AM
interesting, but AT&T syntax. Not the best choice. Fortunately the conversion into Intel syntax isn't very hard.

Not hard, but tedious! If someone wants to do it and post the results that would be appreciated. There are lots of defines like L(label) which (AFAIK) must be made into function MACROs - or else replace all occurrences by hand. There are the "." labels to fix. There's the multiline statements, with ;'s separating, to deal with; and of course reverse the order of about 400 commands. The #'s, %'s and the $'s are easy. Normally this routine would be about 20 or 30 lines (in fact that's the whole point, that he's unrolled everything to the max) and the job is trivial. OTOH if there were a few thousand lines, it would be rather fun to make .bat files and MACROs to automate the conversion. But for 700 lines not really worth the effort. Tedious. Oh, and don't forget to fix that egregious spelling mistake "calee".

Brings up a question, why is it AT&T syntax? Obviously this routine is not for Windows. Note that he uses xmm0-xmm3 then skips to xmm6, in Windows you'd use 4 and 5 wouldn't you? Reason it might matter, what is the target (type of) processor? If it's for handhelds it may not be optimal for us, those processors might not have good branch prediction, for instance, or other differences. I don't know, u understand, but he has some rather surprising techniques. Like, avoids branches like the plague; avoids bsf, prefers pminub to pcmepqb, other little things. These might be great techniques for us, or, as I wonder, might be designed for different processors (say, Atom) or environments. It also seems optimized for very small strings.

And, it doesn't seem very professional. All these lines, and he still doesn't have a MAXLEN - if the string isn't terminated the routine will dive right off the deep end. And why is Intel making it public anyway? Is this really their best effort? I'll bet it's not what Windows is using.

Anyway, it would be nice if someone would translate it, I'll be happy to test it, wouldn't be amazed if it's not all that impressive speedwise - in our typical PC environment.

[edit] Of course I'm pretty ignorant about these issues, don't mean to pretend I know what I'm talking about. Just some thoughts/questions that occurred to me
Title: Re: Faster Memcopy ...
Post by: nidud on March 23, 2015, 11:04:49 PM
deleted
Title: Re: Faster Memcopy ...
Post by: Gunther on March 24, 2015, 12:36:04 AM
Hi nidud,

could you post the entire code as attachment, please? It's much easier for me to do the conversion. Thank you.

Gunther
Title: Re: Faster Memcopy ...
Post by: nidud on March 24, 2015, 01:13:21 AM
deleted
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 24, 2015, 02:30:50 AM
I must confess I don't understand what your testbed is showing. Can you explain?

-- string length 0..40
862314    cycles - (  0) proc_0: crt_strlen
513683    cycles - (105) proc_1: SSE 32
-- string length 40..80
940951    cycles - (  0) proc_0: crt_strlen
331639    cycles - (105) proc_1: SSE 32

- what are the numbers in brackets?
- why does SSE 32 run faster with the longer string?
Title: Re: Faster Memcopy ...
Post by: nidud on March 24, 2015, 02:47:14 AM
deleted
Title: Re: Faster Memcopy ...
Post by: Gunther on March 24, 2015, 04:17:22 AM
Hi nidud,

Quote from: nidud on March 24, 2015, 01:13:21 AM
I converted and tested both of them.
The Intel files are 8.asm and 9.asm in the archive

thank you, that saves me a bit work. Here are the results:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (AVX)
----------------------------------------------
-- string length 0..40
444941    cycles - (  0) proc_0: crt_strlen
231197    cycles - (105) proc_1: SSE 32
218271    cycles - (104) proc_2: AxStrLenSSE
210172    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
228546    cycles - (673) proc_8: SSE Intel Silvermont
374264    cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
493818    cycles - (  0) proc_0: crt_strlen
155729    cycles - (105) proc_1: SSE 32
152824    cycles - (104) proc_2: AxStrLenSSE
148059    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
141848    cycles - (673) proc_8: SSE Intel Silvermont
263988    cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
341526    cycles - (  0) proc_0: crt_strlen
69545     cycles - (105) proc_1: SSE 32
85672     cycles - (104) proc_2: AxStrLenSSE
87101     cycles - ( 86) proc_3: AxStrLenSSE (rrr)
43498     cycles - (673) proc_8: SSE Intel Silvermont
57624     cycles - (818) proc_9: SSE Intel Atom

result:
   413892 cycles - (673) proc_8: SSE Intel Silvermont
   445332 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
   456471 cycles - (105) proc_1: SSE 32
   456767 cycles - (104) proc_2: AxStrLenSSE
   695876 cycles - (818) proc_9: SSE Intel Atom
  1280285 cycles - (  0) proc_0: crt_strlen
hit any key to continue...


Gunther
Title: Re: Faster Memcopy ...
Post by: dedndave on March 24, 2015, 06:57:32 AM
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
----------------------------------------------
-- string length 0..40
1592748   cycles - (  0) proc_0: crt_strlen
1342550   cycles - (105) proc_1: SSE 32
1361220   cycles - (104) proc_2: AxStrLenSSE
1319353   cycles - ( 86) proc_3: AxStrLenSSE (rrr)
1256349   cycles - (673) proc_8: SSE Intel Silvermont
1586403   cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
1711483   cycles - (  0) proc_0: crt_strlen
927442    cycles - (105) proc_1: SSE 32
962485    cycles - (104) proc_2: AxStrLenSSE
968221    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
895419    cycles - (673) proc_8: SSE Intel Silvermont
1476721   cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
659013    cycles - (  0) proc_0: crt_strlen
286781    cycles - (105) proc_1: SSE 32
333387    cycles - (104) proc_2: AxStrLenSSE
339087    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
255315    cycles - (673) proc_8: SSE Intel Silvermont
320034    cycles - (818) proc_9: SSE Intel Atom

result:
  2407083 cycles - (673) proc_8: SSE Intel Silvermont
  2556773 cycles - (105) proc_1: SSE 32
  2626661 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  2657092 cycles - (104) proc_2: AxStrLenSSE
  3383158 cycles - (818) proc_9: SSE Intel Atom
  3963244 cycles - (  0) proc_0: crt_strlen
hit any key to continue...

I:\nidudStrlen2\strlen2 => timeit
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
----------------------------------------------
-- string length 0..40
1596428   cycles - (  0) proc_0: crt_strlen
1353058   cycles - (105) proc_1: SSE 32
1362316   cycles - (104) proc_2: AxStrLenSSE
1351833   cycles - ( 86) proc_3: AxStrLenSSE (rrr)
1251872   cycles - (673) proc_8: SSE Intel Silvermont
1600414   cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
1711735   cycles - (  0) proc_0: crt_strlen
926670    cycles - (105) proc_1: SSE 32
952517    cycles - (104) proc_2: AxStrLenSSE
982818    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
906223    cycles - (673) proc_8: SSE Intel Silvermont
1443038   cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
681045    cycles - (  0) proc_0: crt_strlen
279759    cycles - (105) proc_1: SSE 32
326521    cycles - (104) proc_2: AxStrLenSSE
339426    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
263106    cycles - (673) proc_8: SSE Intel Silvermont
319237    cycles - (818) proc_9: SSE Intel Atom

result:
  2421201 cycles - (673) proc_8: SSE Intel Silvermont
  2559487 cycles - (105) proc_1: SSE 32
  2641354 cycles - (104) proc_2: AxStrLenSSE
  2674077 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  3362689 cycles - (818) proc_9: SSE Intel Atom
  3989208 cycles - (  0) proc_0: crt_strlen
Title: Re: Faster Memcopy ...
Post by: Antariy on March 24, 2015, 07:49:00 AM
Quote from: nidud on March 24, 2015, 02:47:14 AM
Quotewhy does SSE 32 run faster with the longer string?

It's a matter of priority for the end result: A 40 byte string is more important than a 1000 byte string in this case.

I think Jochen meant that your code is faster with longer string, so, it's faster with 80 40 bytes string than with 40 bytes string, so it faster with less important string.

The timings of your latest zip (strlen2)

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
----------------------------------------------
-- string length 0..40
1631130   cycles - (  0) proc_0: crt_strlen
1438090   cycles - (105) proc_1: SSE 32
2560055   cycles - (104) proc_2: AxStrLenSSE
2488369   cycles - ( 86) proc_3: AxStrLenSSE (rrr)
1332828   cycles - (673) proc_8: SSE Intel Silvermont
1697691   cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
1756723   cycles - (  0) proc_0: crt_strlen
999911    cycles - (105) proc_1: SSE 32
1511301   cycles - (104) proc_2: AxStrLenSSE
1492232   cycles - ( 86) proc_3: AxStrLenSSE (rrr)
972191    cycles - (673) proc_8: SSE Intel Silvermont
1551460   cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
704575    cycles - (  0) proc_0: crt_strlen
305959    cycles - (105) proc_1: SSE 32
416240    cycles - (104) proc_2: AxStrLenSSE
418694    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
280099    cycles - (673) proc_8: SSE Intel Silvermont
348494    cycles - (818) proc_9: SSE Intel Atom

result:
  2585118 cycles - (673) proc_8: SSE Intel Silvermont
  2743960 cycles - (105) proc_1: SSE 32
  3597645 cycles - (818) proc_9: SSE Intel Atom
  4092428 cycles - (  0) proc_0: crt_strlen
  4399295 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  4487596 cycles - (104) proc_2: AxStrLenSSE
hit any key to continue...


No answers to any other posts since I get bored writing the long post posted on the lab - the "Research".
BTW, timings there are required :biggrin:
Title: Re: Faster Memcopy ...
Post by: Antariy on March 24, 2015, 07:55:45 AM
Quote from: Gunther on March 24, 2015, 04:17:22 AM
   413892 cycles - (673) proc_8: SSE Intel Silvermont
   445332 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
   456471 cycles - (105) proc_1: SSE 32
   456767 cycles - (104) proc_2: AxStrLenSSE
   695876 cycles - (818) proc_9: SSE Intel Atom
  1280285 cycles - (  0) proc_0: crt_strlen

Quote from: dedndave on March 24, 2015, 06:57:32 AM
result:
  2407083 cycles - (673) proc_8: SSE Intel Silvermont
  2556773 cycles - (105) proc_1: SSE 32
  2626661 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  2657092 cycles - (104) proc_2: AxStrLenSSE
  3383158 cycles - (818) proc_9: SSE Intel Atom
  3963244 cycles - (  0) proc_0: crt_strlen
hit any key to continue...

result:
  2421201 cycles - (673) proc_8: SSE Intel Silvermont
  2559487 cycles - (105) proc_1: SSE 32
  2641354 cycles - (104) proc_2: AxStrLenSSE
  2674077 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  3362689 cycles - (818) proc_9: SSE Intel Atom
  3989208 cycles - (  0) proc_0: crt_strlen

Not bad for simple (and short rrr's version) single-reg XMM code, taking in account the crazy unrollement of Intel's code and higher chunk size of the nidud's code.
Title: Re: Faster Memcopy ...
Post by: Antariy on March 24, 2015, 08:07:25 AM
Well, I think, that if advance the algo posted in the lab in the "Research" topic, make the short strings 64...256 bytes workaround using more simpler code, like nidud's 32 bytes fetching one is, then it will be two times smaller than intels Silvermont and probably faster than it.

BTW, nidud, you may try to use such a code in your inner loop:


@mainloop:
pcmpeqb xmm0,[ebx]
add ebx,16
pcmpeqb xmm1,[ebx]
add ebx,16
pmovmskb eax,xmm0
pmovmskb edx,xmm1

cmp eax,edx
jz @mainloop

shl edx,16
add edx,eax
bsf ecx,edx


It avoids unnecessary shifts and combining in the loop.
Title: Re: Faster Memcopy ...
Post by: nidud on March 24, 2015, 08:29:21 AM
deleted
Title: Re: Faster Memcopy ...
Post by: Antariy on March 24, 2015, 08:37:12 AM
Quote from: nidud on March 24, 2015, 08:29:21 AM
Quote from: Antariy on March 24, 2015, 08:07:25 AM
BTW, nidud, you may try to use such a code in your inner loop:

I think it will fail if they are equal but not zero.

Yes... silly idea, bored with coding of prototype for the thread with test on the lab, and typing for two days ::)
Title: Re: Faster Memcopy ...
Post by: nidud on March 24, 2015, 09:14:22 AM
deleted
Title: Re: Faster Memcopy ...
Post by: Antariy on March 24, 2015, 09:15:55 AM
Then try it so (I'm not sure it will be faster, just suggesting to try)

@mainloop:
pcmpeqb xmm0,[ebx]
add ebx,16
pcmpeqb xmm1,[ebx]
add ebx,16
pmovmskb eax,xmm0
pmovmskb edx,xmm1

add eax,edx
jz @mainloop
jc @mainloop

pmovmskb eax,xmm0

shl edx,16
add edx,eax
bsf ecx,edx


The changed/added instructions without indent.
Take notice that jz misprediction possibility since the both EAX and EDX are 80000000h is incredible small, so jc instruction will almost always be executed only once - and only after proper exit, so it will not cost much, but the loop itself will decrease number of operations.
Title: Re: Faster Memcopy ...
Post by: Antariy on March 24, 2015, 09:27:59 AM
Quote from: nidud on March 24, 2015, 09:14:22 AM
I tried a few combinations yesterday
I first used this one

lea eax,[eax+20h]
pcmpeqb xmm1,[eax]
pmovmskb edx,xmm1
pcmpeqb xmm1,[eax+16]
pmovmskb ecx,xmm1

but this seems to faster for some reason

movdqa  xmm1,[eax+30h]
movdqa  xmm0,[eax+20h]
pcmpeqb xmm0,xmm2
pmovmskb edx,xmm0
lea eax,[eax+20h]
pcmpeqb xmm1,xmm2
pmovmskb ecx,xmm1


Reversed order of the data fetch - the higher offset, then lower offset - I did also used that in the big code on the "Research" thread, too, in the first part which handles 64 bytes misalignment with two 32 bytes part :biggrin:

I think that first piece you showed here worked slower, and with possibility of incorrect work (i.e. it did not worked at all as required, hence the slowdowns probably) - because you compare the same XMM1 reg - but if the first comparsion changed some byte(s) of it to 0xFF, then the second comparsion actually compares not with zeroed reg, but with a mess - so, being changed randomly from time to time the reg slowing the code down and moreover it will produce wrong results with second comparsion if the first comparsion found zero.
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 24, 2015, 12:01:19 PM
There's a problem with this whole exercise: same code runs very different on different processors.

Quote from: nidudbut this seems to faster for some reason

I do the same thing: try different combinations and select the fastest on my machine, often not knowing why it's faster. Problem is, what happens on another machine?

I put together a new "AxStrLenSSE" which beats the Silvermont on my i5. It was easy. First, notice that Antariy's original algo saves XMM and ECX, because of environment it was designed for; but Silvermont doesn't. So I took that out. Then, it was still slower on the long strings. So, I unrolled the loop 9 times, and it easily "won" on nidud's test bed. But then I tried it on my AMD, and it came in 3rd (SSE 32 also ahead)! (AMD is very bad for bsf, pcmeqpb and pmovmskb.)

Working at Intel, Torbjorn Granlund has access to hundreds of machines (probably) and can try his Silvermont algo on all of them. If you don't have similar facilities there's no point in trying to keep up with him. I also think it's ridiculous to use 700 bytes for a strLen algo. This is one reason Microsoft and other modern code is so bloated.

This reminds me of a Go proverb (an oriental board game, you've probably heard of it). Often a student will go to great lengths to keep a group alive, destroying his entire position in the process. It would have been much better to sacrifice the group, for advantage elsewhere; this way he's guaranteed to lose. Then the teacher says, "If that's what you have to do to stay alive, why don't you just kill yourself?" That's how I feel about 700 lines to save a few nanoseconds looking for the end of a string!

Just the same, Torbjorn might have some good ideas. There is only one thing he does I've never seen before, uses pminub to take care of the situation you (Antariy and nidud) are discussing right now. That is, integrating the results from 2 or more XMM's reading the string. Perhaps you should take a look at his technique.

nidud your test bed is very good, this is the first time I used it much, the timings are very stable and reliable compared to other methods. It's also very convenient (once u get used to it) to just change 3.asm and recompile, without needing to recompile timeit.asm or any other algo's.

Anyway, here's this new version of Antariy's algo, which is better on i5's than Silvermont, but worse on AMD ... I know there are various ways to improve it, but I think I'll do something else instead!

.686
.model flat, stdcall
.code
.xmm

OPTION PROLOGUE:NONE, EPILOGUE:NONE

AxStrLenSSE2 proc STDCALL lpszStr:DWORD  ; fast with unaligned strings
; original by Antariy modified by rrr314159
    mov eax,[esp+4]
    mov edx, eax
    and edx,-10h
    pxor xmm0,xmm0
    mov ecx,eax
    and ecx,0fh
    pcmpeqb xmm0,[edx]
    pmovmskb eax,xmm0
    shr eax,cl
    bsf eax,eax
    jnz @ret
    xorps xmm0,xmm0
    pxor xmm1, xmm1
    pxor xmm2, xmm2
    @@:                                ; naturally aligned to 16
        pcmpeqb xmm0,[edx+10h]
        pmovmskb eax,xmm0
        test eax,eax
        jnz gotat10h

        pcmpeqb xmm1,[edx+20h]
        pmovmskb eax,xmm1
        test eax,eax
        jnz gotat20h
   
        pcmpeqb xmm2,[edx+30h]
        pmovmskb eax,xmm2
        test eax,eax
        jnz gotat30h

        pcmpeqb xmm0,[edx+40h]
        pmovmskb eax,xmm0
        test eax,eax
        jnz gotat40h

        pcmpeqb xmm1,[edx+50h]
        pmovmskb eax,xmm1
        test eax,eax
        jnz gotat50h
   
        pcmpeqb xmm2,[edx+60h]
        pmovmskb eax,xmm2
        test eax,eax
        jnz gotat60h

        pcmpeqb xmm0,[edx+70h]
        pmovmskb eax,xmm0
        test eax,eax
        jnz gotat70h

        pcmpeqb xmm1,[edx+80h]
        pmovmskb eax,xmm1
        test eax,eax
        jnz gotat80h
   
        pcmpeqb xmm2,[edx+90h]
        pmovmskb eax,xmm2
        test eax,eax
        jnz gotat90h

        add edx, 90h
        jmp @B

gotat90h:
    add edx, 10h
gotat80h:
    add edx, 10h
gotat70h:
    add edx, 10h
gotat60h:
    add edx, 10h
gotat50h:
    add edx, 10h
gotat40h:
    add edx, 10h
gotat30h:
    add edx, 10h
gotat20h:
    add edx, 10h
gotat10h:
    add edx, 10h

    bsf eax,eax
    sub edx,[esp+4]
    add eax, edx

    @ret:
    ret 4
AxStrLenSSE2 endp
END


Here are the timings:

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (AVX)
----------------------------------------------
-- string length 0..40
462421    cycles - (  0) proc_0: crt_strlen
239532    cycles - (105) proc_1: SSE 32
232826    cycles - (104) proc_2: AxStrLenSSE
217896    cycles - (229) proc_3: AxStrLenSSE (rrr)
242578    cycles - (673) proc_8: SSE Intel Silvermont
395083    cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
446781    cycles - (  0) proc_0: crt_strlen
167097    cycles - (105) proc_1: SSE 32
164663    cycles - (104) proc_2: AxStrLenSSE
147078    cycles - (229) proc_3: AxStrLenSSE (rrr)
151820    cycles - (673) proc_8: SSE Intel Silvermont
282262    cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
349538    cycles - (  0) proc_0: crt_strlen
73973     cycles - (105) proc_1: SSE 32
94474     cycles - (104) proc_2: AxStrLenSSE
55565     cycles - (229) proc_3: AxStrLenSSE (rrr)
46600     cycles - (673) proc_8: SSE Intel Silvermont
61976     cycles - (818) proc_9: SSE Intel Atom

result:
   420539 cycles - (229) proc_3: AxStrLenSSE (rrr)
   440998 cycles - (673) proc_8: SSE Intel Silvermont
   480602 cycles - (105) proc_1: SSE 32
   491963 cycles - (104) proc_2: AxStrLenSSE
   739321 cycles - (818) proc_9: SSE Intel Atom
  1258740 cycles - (  0) proc_0: crt_strlen
hit any key to continue...

AMD A6-6310 APU with AMD Radeon R4 Graphics (AVX)
----------------------------------------------
-- string length 0..40
749682    cycles - (  0) proc_0: crt_strlen
478036    cycles - (105) proc_1: SSE 32
537144    cycles - (104) proc_2: AxStrLenSSE
542685    cycles - (229) proc_3: AxStrLenSSE (rrr)
443542    cycles - (673) proc_8: SSE Intel Silvermont
618257    cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
876412    cycles - (  0) proc_0: crt_strlen
321733    cycles - (105) proc_1: SSE 32
401381    cycles - (104) proc_2: AxStrLenSSE
361959    cycles - (229) proc_3: AxStrLenSSE (rrr)
310989    cycles - (673) proc_8: SSE Intel Silvermont
448527    cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
520175    cycles - (  0) proc_0: crt_strlen
109789    cycles - (105) proc_1: SSE 32
113737    cycles - (104) proc_2: AxStrLenSSE
102691    cycles - (229) proc_3: AxStrLenSSE (rrr)
79685     cycles - (673) proc_8: SSE Intel Silvermont
95800     cycles - (818) proc_9: SSE Intel Atom

result:
   834216 cycles - (673) proc_8: SSE Intel Silvermont
   909558 cycles - (105) proc_1: SSE 32
  1007335 cycles - (229) proc_3: AxStrLenSSE (rrr)
  1052262 cycles - (104) proc_2: AxStrLenSSE
  1162584 cycles - (818) proc_9: SSE Intel Atom
  2146269 cycles - (  0) proc_0: crt_strlen
hit any key to continue...
Title: Re: Faster Memcopy ...
Post by: Antariy on March 24, 2015, 08:50:52 PM
Hi rrr :t

I don't ignore your messages, just was too tired to answer thorougly.
The one note right now - yes, AMDs look very slow with BSF, though that instuction is executed not in the inner loop so it has noticeable influence only with short strings, but you may speed it up on the AMD just using BSF AX,AX instead of EAX,EAX. On Intel that has no much difference (it's a bit slower), but for AMD makes sense. With the one-reg using algo which provides 16 bit mask the usage of 16 bit BSF on AMD is a right choice.

nidud, here is a tweak of your algo which in this form comes close to the algo posted on the "Research" thread on my machine:


strlen proc STDCALL lpszStr:DWORD
mov ecx,[esp+4]
xorps xmm2,xmm2
mov eax,ecx
and ecx,32-1
and eax,-32
movdqa  xmm0,[eax]
movdqa  xmm1,[eax+16]
or edx,-1
shl edx,cl
pcmpeqb xmm0,xmm2
pmovmskb ecx,xmm0
and ecx,edx
jnz done
pcmpeqb xmm1,xmm2
pmovmskb ecx,xmm1
shl ecx,16
and ecx,edx
jnz done
loop_32:
lea eax,[eax+32]
movdqa  xmm0,[eax]
movdqa  xmm1,[eax+16]
pcmpeqb xmm0,xmm2
pcmpeqb xmm1,xmm2
pmovmskb edx,xmm0
pmovmskb ecx,xmm1
add edx,ecx
jbe loop_32
shl ecx,16
pmovmskb edx,xmm0
add ecx,edx
done:

bsf ecx,ecx
lea eax,[ecx+eax]
sub eax,[esp+4]
ret 4
strlen endp


Actually I think that if you will use an entry like in mine StrLen (just shift right overhead bits) and this inner loop (that is strange, but the movdqa/pcmpeqb sequence seems faster than just direct pcmpeqb) and the checking with exit from the loop as I suggested, this code will be probably the one of the best solutions for StrLen on the 32 bit machines. The crazy-unrolled Intel's solution has no any gain in real world.
Probably the rrr's unrolling is the maximum which has sense, but it's strange to look on the algos which are 600+ or even 800+ bytes long, just to check the string length on the, after all, real world machine in the real world app.
Title: Re: Faster Memcopy ...
Post by: nidud on March 25, 2015, 01:16:14 AM
deleted
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 25, 2015, 02:27:55 AM
i5 test with a 200 MB file - 50 bibles concatenated:
Len(): 55896 us for 1519150 lines, 200850450 bytes
len(): 109353 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 51731 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51359 us for 1519150 lines, 200850450 bytes

Len(): 50408 us for 1519150 lines, 200850450 bytes
len(): 108360 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 51540 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 52424 us for 1519150 lines, 200850450 bytes

Len(): 52240 us for 1519150 lines, 200850450 bytes
len(): 108979 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 50288 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51747 us for 1519150 lines, 200850450 bytes

Len(): 51231 us for 1519150 lines, 200850450 bytes
len(): 107123 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 49613 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51625 us for 1519150 lines, 200850450 bytes

Len(): 50967 us for 1519150 lines, 200850450 bytes
len(): 107987 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 51297 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51591 us for 1519150 lines, 200850450 bytes


AxStrLenSSE is a clear winner.
Title: Re: Faster Memcopy ...
Post by: nidud on March 25, 2015, 02:56:03 AM
deleted
Title: Re: Faster Memcopy ...
Post by: guga on November 14, 2019, 08:55:06 PM
Old, old, old post, but i was interesting on the timmigns. I made a small variation of Nidud´s StrLen. Seems fast and stable, but didn´t measured the results on masm. Can someone give a try ?

Proc FastStrlenTest:
    Arguments @Input
    Uses ecx, edx ; RosAsm macro to preserve registers. Just a push ecx+push edx after the start of the function (push ebp | mov ebp esp). Before the function ends at "endP' macro, the registers are restored with pop edx | pop ecx

    mov eax D@Input
    movdqu xmm0 X$eax
    xorps xmm1 xmm1
    pcmpeqb xmm0 xmm1
    pmovmskb ecx xmm0
    and ecx 0-1
    jne @done
        pmovmskb ecx xmm1
        shl ecx 16
        and ecx 0-1
    jne @done
    xorps xmm2 xmm2
@loop_32:

    lea eax D$eax+32
    movdqu xmm0 X$eax
    movdqu  xmm1 X$eax+16
    pcmpeqb xmm0 xmm2 | pmovmskb edx xmm0
    pcmpeqb xmm1 xmm2 | pmovmskb ecx xmm1
    add edx ecx
jbe @loop_32
    shl ecx 16
    pmovmskb edx xmm0
    add ecx edx
@done:

bsf ecx ecx
lea eax D$ecx+eax
sub eax D@Input

EndP


Note: I modified the beggining of Nidud´s algo, because the adress at the start, may result on a crash if the string is located exactly on the beginning of the memory addressing . That´s why i removed the initiual
    and ecx 32-1
    and eax 0-32


Btw...the fucntion seems to work on any size of the null terminated string...from 1 to XXXXX


Need to fully understand this, in order to try making a Unicode version too :)
Title: Re: Faster Memcopy ...
Post by: nidud on November 14, 2019, 11:14:04 PM
deleted
Title: Re: Faster Memcopy ...
Post by: guga on November 15, 2019, 12:21:09 AM
I saw that while i was debugging the function. It didn´t crashed, but the pointer entered on a previously allocated memory (28 bytes before the start of the input). It didn´t crashed, probably because the memory was correctly allocated, so it didn´t reached it´s beginning.

Ex: Say, the memory starts at 100000 And the function uses a pointer at the very 1st byte 100001. What could happen is that since the and operation will force the Inputted pointer to start at 100000-32 bytes (or something). And, if the area before that address is not allocated, most likely it will crash. However, it didn´t crashed on my tests, but i made that small adaptation just to make sure it won´t crash ever.

Btw,...it´s an amazing algorithm. I wonder if it can be optimized further. I made a small test comparing pcmpeqb xmm0 xmm1 rather then pcmpeqb xmm0 xmm2/pcmpeqb xmm1 xmm2 but couldn´t understand what this opcode is doing exactly yet.

The timmings are simply great and it do finds the len for small strings etc...while AxStrLenSSE2 version seems to have some errors on the results (at least, if i ported it properly)

(https://i.ibb.co/d7N8vt4/vsdfagva-Image1.png) (https://ibb.co/d7N8vt4)
Title: Re: Faster Memcopy ...
Post by: nidud on November 15, 2019, 02:22:04 AM
deleted
Title: Re: Faster Memcopy ...
Post by: nidud on November 15, 2019, 05:49:03 AM
deleted
Title: Re: Faster Memcopy ...
Post by: jj2007 on November 15, 2019, 08:52:04 AM
Here is a test run with your (marginally modified) 1.asm ... 4.asm and other StrLen algos.
The 3.asm algo returns a wrong value, I have not found out why.

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6931    cycles for 100 * CRT strlen
5372    cycles for 100 * Masm32 StrLen
19701   cycles for 100 * Windows lstrlen
2006    cycles for 100 * MasmBasic Len
1579    cycles for 100 * Algo1
1712    cycles for 100 * Algo2
1103    cycles for 100 * Algo3
3209    cycles for 100 * Algo4

6941    cycles for 100 * CRT strlen
5393    cycles for 100 * Masm32 StrLen
19722   cycles for 100 * Windows lstrlen
1994    cycles for 100 * MasmBasic Len
1587    cycles for 100 * Algo1
1722    cycles for 100 * Algo2
1102    cycles for 100 * Algo3
3227    cycles for 100 * Algo4

6938    cycles for 100 * CRT strlen
5380    cycles for 100 * Masm32 StrLen
19692   cycles for 100 * Windows lstrlen
2011    cycles for 100 * MasmBasic Len
1580    cycles for 100 * Algo1
1713    cycles for 100 * Algo2
1101    cycles for 100 * Algo3
3205    cycles for 100 * Algo4

6970    cycles for 100 * CRT strlen
5404    cycles for 100 * Masm32 StrLen
19710   cycles for 100 * Windows lstrlen
2010    cycles for 100 * MasmBasic Len
1589    cycles for 100 * Algo1
1710    cycles for 100 * Algo2
1104    cycles for 100 * Algo3
3204    cycles for 100 * Algo4

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len
82      bytes for Algo1
114     bytes for Algo2
690     bytes for Algo3
761     bytes for Algo4

100     = eax CRT strlen
100     = eax Masm32 StrLen
100     = eax Windows lstrlen
100     = eax MasmBasic Len
100     = eax Algo1
100     = eax Algo2
14      = eax Algo3
100     = eax Algo4


This is the clear winner:
; include 1j.asm
Algo1 proc
  mov eax,[esp+4]
  mov ecx,eax ; much faster than [esp+4]
  and eax,-16
  and ecx,16-1
  or edx,-1
  shl edx,cl
  xorps xmm0,xmm0
  pcmpeqb xmm0,[eax]
  add eax,16
  pmovmskb ecx,xmm0
  ; xorps xmm0,xmm0 ; ??
  and ecx,edx
  jnz L2
L1:
  movaps xmm1,[eax]
  pcmpeqb xmm1,xmm0
  pmovmskb ecx,xmm1
  add eax,16
  test ecx,ecx
  jz L1
L2:
  bsf ecx,ecx
  lea eax,[eax+ecx-16]
  sub eax,[esp+4]
  retn 4
Algo1 endp
Title: Re: Faster Memcopy ...
Post by: nidud on November 15, 2019, 09:34:12 AM
deleted
Title: Re: Faster Memcopy ...
Post by: nidud on November 15, 2019, 09:48:25 AM
deleted
Title: Re: Faster Memcopy ...
Post by: jj2007 on November 15, 2019, 10:44:32 AM
Quote from: nidud on November 15, 2019, 09:34:12 AM
Quote from: jj2007 on November 15, 2019, 08:52:04 AM

Algo1 proc
  mov eax,[esp+4]
  mov ecx,eax ; much faster than [esp+4]

Actually they are the same (speed-wise) but selected by size for alignment of L1.

On my CPU the algo is over 2% faster with mov ecx, eax

Quote

  ; xorps xmm0,xmm0   ; ??

This must be zero for compare below:

L1:
    movaps  xmm1,[eax]
    pcmpeqb xmm1,xmm0


I saw that but can pcmpeqb xmm0,[eax] produce a non-zero result without jumping to L2?
Title: Re: Faster Memcopy ...
Post by: nidud on November 15, 2019, 11:12:13 AM
deleted
Title: Re: Faster Memcopy ...
Post by: jj2007 on November 15, 2019, 11:32:06 AM
Quote from: nidud on November 15, 2019, 11:12:13 AM
QuoteI saw that but can pcmpeqb xmm0,[eax] produce a non-zero result without jumping to L2?

Yes. The alignment may go back 31 15 byte and they may all be zero.

I've tested it with a 100 byte string, increasing the pointer 100 times. "All zero" shouldn't be a problem. But I admit I am too tired right now to understand it ;-)

Here's my current version of your fast algo, 62 bytes short and considerably faster than the original:
Algo1 proc
  mov eax, [esp+4]
  mov ecx, eax ; much faster than [esp+4]
  and eax, -16
  and ecx, 16-1
  or edx, -1
  shl edx, cl
  xorps xmm0, xmm0 ; needed for short strings
  pcmpeqb xmm0, [eax]
  pmovmskb ecx, xmm0
;   xorps xmm0, xmm0 ; ??
  and ecx, edx ; short string?
  jnz L2
L1:
  add eax, 16
  movaps xmm1, [eax]
  pcmpeqb xmm1, xmm0
  pmovmskb ecx, xmm1
  test ecx, ecx
  jz L1
L2:
  bsf ecx, ecx
  add eax, ecx
  sub eax, [esp+4]
  retn 4
Algo1 endp
Title: Re: Faster Memcopy ...
Post by: nidud on November 15, 2019, 12:31:13 PM
deleted
Title: Re: Faster Memcopy ...
Post by: guga on November 15, 2019, 02:22:37 PM
Tks, guys, i´ll give a test for benchmarking the speed..

Nidud, about the memory issue...Indeed you are right. I was confused thinking the algo would write something to a protected area, but, it is not. It´s only computing the difference where to start calculating the lenght is that it ?  Tks for the unicode version. I´ll test ir right now

JJ...i tried to assemble your file, but masmbasic throw an error of configuration. I think i do have uasm on the same path as in ml.exe, but this is what is being showed to me:

(https://i.ibb.co/TMS9xWW/vfsdgf-Image1.png) (https://ibb.co/TMS9xWW)

How can i configure masmbasic properly to make it work on this test ?
Title: Re: Faster Memcopy ...
Post by: guga on November 15, 2019, 03:05:45 PM
Hi Nidud....The unicode version is working as expected :)

One question. How to implement a security inside the function to see either the string is really unicode or not, while it is calculating the lenght ?

I mean....say i have a bad unicode string like this:

[TestingUnicode: B$ 'H', 0, 'e', 0, 'hi', 0]

How to make the function checks the bad chars 'hi', and return a value of ... say 0-1 (meaning the function found an error ) ?

The RosAsm porting of your function, is like this:


Proc UnicodeStrlen:
    Arguments @Input
    Uses ecx, edx, edi ; <----preserve ecx, edx, edi registers - I benchmark the speed with all algos preserving the registers to be sure they are all behaving on the same way to see which one is faster etc under the very same conditions.

    mov eax D@Input
    bt eax 0 | jc L3>
    mov ecx D@Input
    and eax 0-16
    and ecx 16-1
    or  edx 0-1
    shl edx cl
    pxor xmm0 xmm0
    pcmpeqw xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    pxor xmm0 xmm0
    and  ecx edx
    jnz  L2>

L1:
    movups  xmm1 X$eax
    pcmpeqw xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz  L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@Input
    shr eax 1
    ExitP
L3:
    mov  edx edi
    mov  edi eax
    xor  eax eax
    or   ecx 0-1
    repne scasw
    not  ecx
    dec  ecx
    mov  eax ecx
    mov  edi edx

EndP

[code]
Title: Re: Faster Memcopy ...
Post by: guga on November 15, 2019, 03:26:18 PM
ABout the Unicode version. JJ, i ported yours to work with Unicode as well, and it seem to work ok, and it is faster on my machine. Can you give a test on your benchmark function to compare the speeds, pls ?

JJ Unicode Version

Proc UniStrLenJJ:
    Arguments @Input
    Uses ecx, edx ; <--- preserved register
   
    mov eax D@Input
    mov ecx eax ; much faster than [esp+4]
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqw xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax ; <---- Changed to unaligned version on both unicode and ansi version. A bit faster on my machine, and prevent crashing on unaligned strings calculation)
    pcmpeqw xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@Input
    shr eax 1

EndP


JJ Ansi version

Proc StrLenJJ:
    Arguments @Input
    Uses ecx, edx
   
    mov eax D@Input
    mov ecx eax ; much faster than [esp+4]
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqb xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax
    pcmpeqb xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx,ecx
    lea eax D$eax+ecx-16
    sub eax D@Input

EndP


On my tests, in terms of speed the Ansi version and Unicode version does not varies that much in speed

Ansi version usign the following string (99 chars...I´m testing odd and even strings, big or tiny as 1 byte only):
[TestAnsi: B$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zer" , 0]
Average timming (in nanosecs): 2.95913440552019 ns

Unicode version
[TestUnicode: U$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zer" , 0]
Average timming (in nanosecs): 3.46197153137555 ns



Btw... Same question i made for Nidud. How to make the Unicode version checks if the string being calculated is really Unicode or Not and make it returns 0-1, if it finds non-unicode chars while it is calculating the lenght ?
Title: Re: Faster Memcopy ...
Post by: jj2007 on November 15, 2019, 08:56:43 PM
Quote from: nidud on November 15, 2019, 12:31:13 PM
Well, if you feed it with this string:

    align 16
    db 15 dup(0)
err db "error",0

the aligned (first) compare will be:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e | r r o r 0

xmm0 will be:
00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

and there will be no jump to L2

Correct :thumbsup:

Quote from: guga on November 15, 2019, 03:26:18 PM
ABout the Unicode version. JJ, i ported yours to work with Unicode as well, and it seem to work ok, and it is faster on my machine. Can you give a test on your benchmark function to compare the speeds, pls ?

If you translate it to Masm syntax, I can do that  :thup:
Title: Re: Faster Memcopy ...
Post by: guga on November 15, 2019, 09:43:21 PM
Hii JJ. Tks :)



I believe the porting to masm, is something like this:



UniStrLenJJ proc Input :DWORD
             ; <--- uses ebp, rather then esp. Here should have a push ebp | mov  ebp, esp at start. Forgot how to include that. Maybe using invoke token in masm, right ?
    push ecx
    push edx

    mov eax, Input
    mov ecx, eax
    and eax, -16
    and ecx, 16-1
    or edx, -1
    shl edx, cl
    xorps xmm0, xmm0
    pcmpeqw xmm0, [eax] ; <---- or it is  xmmword ptr [eax]. Don´t remember if there is a xmmword instruction in masm. It´s the same as in Algo1, but using pcmpeqw rather then pcmpeqb
    add eax, 16
    pmovmskb ecx, xmm0
    xorps xmm0, xmm0
    and ecx, edx
    jnz short Out1

InnerLoop:
     movups  xmm1, [eax]; <---- or it is  xmmword ptr [eax]. Don´t remember if there is a xmmword instruction in masm
     pcmpeqw xmm1, xmm0
     pmovmskb ecx, xmm1
     add  eax, 16
     test ecx, ecx
     jz short InnerLoop

Out1:
     bsf ecx, ecx
     lea eax, [ecx+eax-16]
     sub eax, Input
     shr eax, 1

     pop edx
     pop ecx
     retn 4 <----- or a simple ret..Don´t remember the syntax. before the ret should have a mov esp, ebp | pop ebp instructions
UniStrLenJJ     endp
Title: Re: Faster Memcopy ...
Post by: nidud on November 16, 2019, 12:15:47 AM
deleted
Title: Re: Faster Memcopy ...
Post by: aw27 on November 16, 2019, 01:12:02 AM
This performs better with small strings:


    .686
    .xmm
    .model flat

    .code
    mov     eax,-16
    mov     edx,[esp+4]
xorps xmm0, xmm0
@loop:
    add eax, 16
    PCMPISTRI xmm0, xmmword ptr [edx + eax], 1000b
    jnz @loop
    add eax, ecx
    ret

    end


total [0 .. 40], 8++
   262586 cycles 5.asm: PCMPISTRI
   329094 cycles 3.asm: SSE Intel Silvermont
   366495 cycles 1.asm: SSE 16
   453912 cycles 2.asm: SSE 32
   682335 cycles 0.asm: msvcrt.strlen()
   728151 cycles 4.asm: SSE Intel Atom
hit any key to continue...

(*) Note: can read up to 15 characters past end of string and cause an exception in some cases, normally cross pages.
Title: Re: Faster Memcopy ...
Post by: nidud on November 16, 2019, 02:39:42 AM
deleted
Title: Re: Faster Memcopy ...
Post by: jj2007 on November 16, 2019, 03:47:45 AM
Testbed with Guga's Unicode version, using ebp:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

7047    cycles for 100 * CRT strlen
5265    cycles for 100 * Masm32 StrLen
19964   cycles for 100 * Windows lstrlen
1711    cycles for 100 * MasmBasic Len
1499    cycles for 100 * Algo1
2918    cycles for 100 * AlgoGuga

7097    cycles for 100 * CRT strlen
5311    cycles for 100 * Masm32 StrLen
19940   cycles for 100 * Windows lstrlen
1704    cycles for 100 * MasmBasic Len
1497    cycles for 100 * Algo1
2929    cycles for 100 * AlgoGuga

7030    cycles for 100 * CRT strlen
5250    cycles for 100 * Masm32 StrLen
19956   cycles for 100 * Windows lstrlen
1674    cycles for 100 * MasmBasic Len
1494    cycles for 100 * Algo1
2935    cycles for 100 * AlgoGuga

7057    cycles for 100 * CRT strlen
5268    cycles for 100 * Masm32 StrLen
20014   cycles for 100 * Windows lstrlen
1714    cycles for 100 * MasmBasic Len
1511    cycles for 100 * Algo1
3000    cycles for 100 * AlgoGuga

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len
74      bytes for Algo1
87      bytes for AlgoGuga

100     = eax CRT strlen
100     = eax Masm32 StrLen
100     = eax Windows lstrlen
100     = eax MasmBasic Len
100     = eax Algo1
100     = eax AlgoGuga
Title: Re: Faster Memcopy ...
Post by: aw27 on November 16, 2019, 05:20:30 AM
This is the Unicode version using PCMPISTRI


    .code
    mov     rax,-16
    mov     rdx, rcx
    xorps xmm0, xmm0
@loop:
    add rax, 16
    PCMPISTRI xmm0, xmmword ptr [rdx + rax], 1001b
    jnz @loop
    shr eax, 1
    add eax, ecx
    ret
    end


total [0 .. 40], 8++
   323358 cycles 3.asm: AVX 32
   404277 cycles 4.asm: PCMPISTRI
   477789 cycles 2.asm: SSE 16
  1237417 cycles 0.asm: msvcrt.wcslen()
  3886924 cycles 1.asm: scasw

  total [41 .. 80], 7++
   291655 cycles 3.asm: AVX 32
   562089 cycles 2.asm: SSE 16
   563122 cycles 4.asm: PCMPISTRI
  1935096 cycles 0.asm: msvcrt.wcslen()
  4320489 cycles 1.asm: scasw
 
  total [600 .. 1000], 100++
   333669 cycles 3.asm: AVX 32
   982307 cycles 2.asm: SSE 16
  1405725 cycles 4.asm: PCMPISTRI
  3490272 cycles 0.asm: msvcrt.wcslen()
  6914474 cycles 1.asm: scasw
Title: Re: Faster Memcopy ...
Post by: nidud on November 16, 2019, 06:30:24 AM
deleted
Title: Re: Faster Memcopy ...
Post by: guga on November 16, 2019, 10:23:57 AM
Tks a lot, JJ :)

It seems is fast as expected.
Here is the results:


AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

5842 cycles for 100 * CRT strlen
5813 cycles for 100 * Masm32 StrLen
18946 cycles for 100 * Windows lstrlen
1878 cycles for 100 * MasmBasic Len
1545 cycles for 100 * Algo1
2579 cycles for 100 * AlgoGuga

5769 cycles for 100 * CRT strlen
5711 cycles for 100 * Masm32 StrLen
18825 cycles for 100 * Windows lstrlen
2350 cycles for 100 * MasmBasic Len
1917 cycles for 100 * Algo1
3390 cycles for 100 * AlgoGuga

7980 cycles for 100 * CRT strlen
7723 cycles for 100 * Masm32 StrLen
24159 cycles for 100 * Windows lstrlen
2372 cycles for 100 * MasmBasic Len
1930 cycles for 100 * Algo1
2587 cycles for 100 * AlgoGuga

6088 cycles for 100 * CRT strlen
7132 cycles for 100 * Masm32 StrLen
22808 cycles for 100 * Windows lstrlen
1673 cycles for 100 * MasmBasic Len
1609 cycles for 100 * Algo1
2636 cycles for 100 * AlgoGuga

14 bytes for CRT strlen
10 bytes for Masm32 StrLen
10 bytes for Windows lstrlen
10 bytes for MasmBasic Len
74 bytes for Algo1
87 bytes for AlgoGuga

100 = eax CRT strlen
100 = eax Masm32 StrLen
100 = eax Windows lstrlen
100 = eax MasmBasic Len
100 = eax Algo1
100 = eax AlgoGuga

--- ok ---



Title: Re: Faster Memcopy ...
Post by: jj2007 on November 16, 2019, 03:36:03 PM
Quote from: guga on November 16, 2019, 10:23:57 AM
Tks a lot, JJ :)

It seems is fast as expected.

Thanks, Guga. Here is a new version, 59 bytes short and pretty fast:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6984    cycles for 100 * CRT strlen
5216    cycles for 100 * Masm32 StrLen
19898   cycles for 100 * Windows lstrlen
1674    cycles for 100 * MasmBasic Len
1500    cycles for 100 * Algo1/Nidud
2608    cycles for 100 * Algo1/Guga+JJ

7035    cycles for 100 * CRT strlen
5249    cycles for 100 * Masm32 StrLen
19929   cycles for 100 * Windows lstrlen
1700    cycles for 100 * MasmBasic Len
1569    cycles for 100 * Algo1/Nidud
2625    cycles for 100 * Algo1/Guga+JJ

7087    cycles for 100 * CRT strlen
5215    cycles for 100 * Masm32 StrLen
19859   cycles for 100 * Windows lstrlen
1701    cycles for 100 * MasmBasic Len
1498    cycles for 100 * Algo1/Nidud
2644    cycles for 100 * Algo1/Guga+JJ

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len
53      bytes for Algo1/Nidud
59      bytes for Algo1/Guga+JJ

100     = eax CRT strlen
100     = eax Masm32 StrLen
100     = eax Windows lstrlen
100     = eax MasmBasic Len
100     = eax Algo1/Nidud
100     = eax Algo1/Guga+JJ


Note that results are not fully comparable. Masm32 len(), MasmBasic Len() (http://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1144) and Algo1/Guga+JJ preserve edx and ecx. This is unusual, but the good ol' Masm32 len() algo did this, so in order to avoid compatibility problems, Len() (http://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1144) behaves the same. In addition, Len() preserves xmm0. Still, Len() is over 4 times faster than CRT strlen :cool:
Title: Re: Faster Memcopy ...
Post by: daydreamer on November 16, 2019, 08:04:12 PM
Quote from: guga on November 15, 2019, 03:05:45 PM
Hi Nidud....The unicode version is working as expected :)

One question. How to implement a security inside the function to see either the string is really unicode or not, while it is calculating the lenght ?

I mean....say i have a bad unicode string like this:

[TestingUnicode: B$ 'H', 0, 'e', 0, 'hi', 0]

How to make the function checks the bad chars 'hi', and return a value of ... say 0-1 (meaning the function found an error ) ?


in my experience when using a texteditor and start to put in unicode characters,the texteditor detects I have entered or copy/pasted unicode and it asks if  I want to save it unicode format or choose to lose that info saving in good old ascii,my suggestion is support a unicode or not flag somewhere in the copy routine,so if its used by a texteditor/gui creator,puts code to change unicode flag as soon as it detects unicode usage by user,maybe add superfast routine that expands ascii-to unicode when that happens?
Title: Re: Faster Memcopy ...
Post by: hutch-- on November 16, 2019, 09:14:42 PM
Just write a UNICODE editor, no conversions.
Title: Re: Faster Memcopy ...
Post by: aw27 on November 17, 2019, 01:18:53 AM
Visual Studio and some text editors like Notepad++ detect Unicode on save.
But in this thread, what we have been calling Unicode are word length characters, that is the UTF-16 character set without the additional planes for emoji and other stuff. This is what Windows internals use. For programming and for most purposes it is very OK, no need to complicate things further. However there is more to it.

Windows has the IsTextUnicode API function but it sometimes fails. I have seen and used better algos but am never sure it will not fail somewhere.



Title: Re: Faster Memcopy ...
Post by: guga on November 17, 2019, 07:36:08 AM
Quote from: hutch-- on November 16, 2019, 09:14:42 PM
Just write a UNICODE editor, no conversions.

Hi Steve, tks...But i was thinking on a function able to detect it for disassembler purposes. While i was testing the GDI issues of that other post, i faced a really old problem on RosAsm and decided start a minor update on some of the routines. The routine i´m currently working on is the disassembler that also uses routines to check chunk of bytes. I remember using a very very old function for string lenght and updated it, but it was not fast enough, that´s why i decided to give a try on these.

Since RosAsm (now) do disassemble unicode strings properly, i´m trying also a unicode string lenght check and a routine to automatically detect simple unicode strings (Char+0, Char+0 etc). So, for a disassembler point of view, a faster routine able to determine whether a string is unicode or ansi (even simple unicode format) maybe necessary to avoid bad recognitions.

I succeeded yesterday to fix  a old bug in RosAsm disassembler that caused the dissassemblement process to be kind of slow in while it is working in some apps or dlls. For example, on shell32.dll or atioglxx.dll for Xp (12Mb and 17 Mb respectively) , the disassembler was taking almost 25 minutes to finish due to heavily computation of code/data identification and the main functions were going back and forward thousands of times. It was a stupid design mistake i (or rené..i don´t remember) did years ago, forcing one of the functions to allocate and deallocate a huge amount of memory every time the routine is used (and it is used thousands on times for problematic files) .  Now, the process is way fast for some problematic files. The same test i made on shell32.dll disassembled it in about 20 seconds and kept the accuracy of the data that was disassembled)

Title: Re: Faster Memcopy ...
Post by: hutch-- on November 17, 2019, 09:42:44 AM
As long as you can safely identify character data, either ANSI or UNICODE, it may be viable to have a manual option to switch between the two. There is no truly reliable method of detecting the difference so you use the best you can get/write and have a manual option.
Title: Re: Faster Memcopy ...
Post by: guga on November 17, 2019, 11:50:24 AM
Manual option ?  :thumbsup: :thumbsup: :thumbsup: Indeed. Yeah...maybe i´ll implement it later when i finish the fixes. I plan to add a couple of user options as soon i have a few time to dedicate to work on RosAsm updates again. The current version can identifies with relatively accuracy what is ANSI or Unicode strings, but still have some minor problems because on some cases a chunk of data can be either a string or a pointer and it is not so easy to fix that without the help of other tools like a Signature system i started developing years ago, but never finished.

Manual options could be good to make as in IdaPro, allowing the user to choose either he wants to disassemble C-Style strings, Dos Style, pascal Styles, Delphi strings etc..But, for those specific cases (of strings that are used to certain compilers) the better should be i do it when (or if) succeed to finish the signature technique (i called that DIS - Digital Identification System) many years ago.

The current routine does the job in more then 90% of the time for real apps. The disassembler sets a couple of flags to forbidden areas of the PE to avoid those to be  disassembled. For example, sections flagged as import, export, resources etc etc...It basically identifies the good code and data sections and those are the ones that are actually disassembled.

One of the problems is that, inside the code section it is common to we find embedded data, structures, strings etc. Although the disassembler works reasonable fine on those sections, it do have files that produces wrong results...but those i´ll fix later once i fix some minor bugs in RosAsm.

I´m comparing the results of the fixes i´m currently doing and they are more accurate then the ones in IdaPro (Not considering the Flirt system used on Ida, of course), but still it have some issues.

Eventually i´ll try to create a set of macros or internal routines to allow compilation of masm syntax style, but..this will still take more time.
Title: Re: Faster Memcopy ...
Post by: daydreamer on November 17, 2019, 08:50:52 PM
Quote from: hutch-- on November 16, 2019, 09:14:42 PM
Just write a UNICODE editor, no conversions.
I already worked on unicode richedit,but as asm programmer I cannot resist downsizing tricks like half the size of  text data,upsizing with add the start of that specific language it belongs to and mov to character buffer
Title: Re: Faster Memcopy ...
Post by: daydreamer on November 17, 2019, 09:12:38 PM
Quote from: guga on November 17, 2019, 11:50:24 AM
Manual options could be good to make as in IdaPro, allowing the user to choose either he wants to disassemble C-Style strings, Dos Style, pascal Styles, Delphi strings etc..But, for those specific cases (of strings that are used to certain compilers) the better should be i do it when (or if) succeed to finish the signature technique (i called that DIS - Digital Identification System) many years ago.

The current routine does the job in more then 90% of the time for real apps. The disassembler sets a couple of flags to forbidden areas of the PE to avoid those to be  disassembled. For example, sections flagged as import, export, resources etc etc...It basically identifies the good code and data sections and those are the ones that are actually disassembled.

90% great job :thumbsup: some of the fail % it maybe is creator of code tried to make it harder to disassemble?
why dont make it like the smartphone editor: its best guess of string format is showed and a tiny sample is showed to the user and option to choose from a list of different text formats?I had no idea there was loads of different text formats
I also can think of hardcoded mov eax,"abcd" or mov eax,"ab" ;unicode can be very hard to handle in a disassembler
good to know is howto detect text files are saved in utf16 format
Title: Re: Faster Memcopy ...
Post by: hutch-- on November 17, 2019, 10:15:24 PM
Using rich edit, the shift from ANSI to UNICODE is simple enough to do. If you have to work in both, load the file and if it looks like garbage, switch from one to the other. Just means reloading the file again. You can make tangled messes that still can't garrantee being correct or switch between the two, the latter is much easier.
Title: Re: Faster Memcopy ...
Post by: aw27 on November 17, 2019, 11:03:31 PM
This is another one, this time for an AVX-512 strlen (based on this url (https://github.com/WojciechMula/toys/blob/master/avx512-string/avx512f-strlen.cpp))


strlenZMM:
push esi
mov eax, 01010101h
vpbroadcastd zmm2, eax ; broadcast eax to all elements
xor edx, edx ; len = 0
mov eax, 80808080h
vpbroadcastd zmm3, eax
mov esi, [esp+8]
@@:
vmovdqu32 zmm0, ZMMWORD PTR [esi+edx]
vpsubd zmm1, zmm0, zmm2
vpternlogd zmm1, zmm0, zmm3, 32
vptestmd k1, zmm1, zmm1
kmovw eax, k1
movzx eax, ax
test ax, ax
jnz @F
add edx, 64
jmp short @B
@@:
bsf eax, eax
push 32
pop ecx
cmovne ecx, eax
lea esi, dword ptr [esi+ecx*4]
cmp byte ptr [esi+edx], 0
lea eax, dword ptr [edx+ecx*4]
je short @exit
cmp byte ptr [esi+edx+1], 0
jne short @F
inc eax
jmp short @exit
@@:
cmp byte ptr [esi+edx+2], 0
jne short @F
add eax, 2
jmp short @exit
@@:
add eax, 3
@exit:
vzeroupper
pop esi
ret
    end


I added a 4th test for strings between 40000 and 40900 to see it the AVX-512 decouples. Well, not really, SSE Intel Silvermont and SSE Intel Atom are there as well.  :hmmm:

total [0 .. 40], 8++
   290780 cycles 7.asm: sse2
   355355 cycles 5.asm: PCMPISTRI
   412251 cycles 3.asm: SSE Intel Silvermont
   469664 cycles 8.asm: Agner Fog
   502841 cycles 1.asm: SSE 16
   524321 cycles 2.asm: SSE 32
   597335 cycles 9.asm: ZMM AVX512
   865552 cycles 4.asm: SSE Intel Atom
   908227 cycles 6.asm: scasb
   913651 cycles 0.asm: msvcrt.strlen()
   
   
total [41 .. 80], 7++
   270380 cycles 3.asm: SSE Intel Silvermont
   299431 cycles 5.asm: PCMPISTRI
   306940 cycles 7.asm: sse2
   314735 cycles 1.asm: SSE 16
   364536 cycles 9.asm: ZMM AVX512
   380247 cycles 8.asm: Agner Fog
   405156 cycles 2.asm: SSE 32
   639091 cycles 4.asm: SSE Intel Atom
   758265 cycles 6.asm: scasb
   982403 cycles 0.asm: msvcrt.strlen()

   total [600 .. 1000], 100++
   202227 cycles 9.asm: ZMM AVX512
   237534 cycles 3.asm: SSE Intel Silvermont
   292854 cycles 4.asm: SSE Intel Atom
   334146 cycles 2.asm: SSE 32
   338568 cycles 1.asm: SSE 16
   356720 cycles 7.asm: sse2
   436840 cycles 8.asm: Agner Fog
   650222 cycles 5.asm: PCMPISTRI
  1438033 cycles 6.asm: scasb
  1830544 cycles 0.asm: msvcrt.strlen()
 
total [40000 .. 40900], 100++
  2161645 cycles 3.asm: SSE Intel Silvermont
  2224521 cycles 4.asm: SSE Intel Atom
  2342704 cycles 9.asm: ZMM AVX512
  3137064 cycles 1.asm: SSE 16
  3465817 cycles 7.asm: sse2
  3514206 cycles 2.asm: SSE 32
  4113016 cycles 8.asm: Agner Fog
  6173622 cycles 5.asm: PCMPISTRI
13022424 cycles 6.asm: scasb
16670776 cycles 0.asm: msvcrt.strlen() 
Title: Re: Faster Memcopy ...
Post by: nidud on November 18, 2019, 12:53:02 AM
deleted
Title: Re: Faster Memcopy ...
Post by: aw27 on November 18, 2019, 05:19:44 AM
@nidud,

It has a bug, I did not figure out yet the logic but it does not pass through this.
00007ff65266174e  test         ecx, ecx 
00007ff652661750  jz         0x7ff652661734 

You can debug with VS 2012 or 2015 and the Intel SDE Debugger. I don't know if it works with the Express or Community, I have the Pro for those years.
Later: I think it does because I read this:
http://masm32.com/board/index.php?topic=6473.msg69456#msg69456
Title: Re: Faster Memcopy ...
Post by: aw27 on November 18, 2019, 08:07:16 AM
I understood the logic and corrected it in the following way and it does not need to be aligned.


.code

avx512aligned proc
    sub rsp, 8
    xor             rax,rax
    vpbroadcastq    zmm0,rax
    mov             r8,rcx
    mov             rax,rcx
    and             rax,-64
    and             rcx,64-1
    xor             rdx,rdx
    dec             rdx   
    shl             rdx,cl
    vpcmpgtb        k1,zmm0,[rax]
    kmovq           rcx,k1
    and             rcx,rdx
    jnz             L2

L1:
    add             rax,64
    vpcmpgtb        k1,zmm0,[rax]
    kmovq           rcx,k1
    test            rcx,rcx
    jz              L1

L2:
    bsf             rcx,rcx
    lea             rax,[rax+rcx]
    sub             rax,r8
    dec rax
    add rsp, 8
    ret
avx512aligned endp

end


What a mess was there with the k2! Simply remove it.
Title: Re: Faster Memcopy ...
Post by: guga on November 18, 2019, 11:45:21 AM
Hi, DayDreamer
"90% great job :thumbsup: some of the fail % it maybe is creator of code tried to make it harder to disassemble?"
Thanks :thumbsup: :thumbsup: :)  About the code being harder due to creator choice, well, not necessarily. The vast majority of time disassembler fails to identify is due to characteristics of each compiler or it´s libraries (When the file was not packed, of course).For example, VisualBasic6 code contains a lot of embedded data inside the code section. Some delphi or borland files have that too. Plain C using Api´s by the other-hand are somewhat easier to disassemble because the code and data are more distinguishable from each other. Also some C++ files too are somewhat easier. What makes disablement process a bit hard are heavily bad encoded libraries, specially when they are made with C++ for example or when there are trash code inside, i mean, functions that are never used. On those situations (functions that don´t have any reference), following the data chain is tricky, because some of that chunks can be either code or data.

Sure that there is not any disassembler will be 100% accurate, but if we can get a higher rate of accuracy, the better. I`m pretty sure that if i finish the DIS System (Similar to Flirt on Ida), i can reach something around 98%of accuracy, but..no plans/time to do that right now. I still have to fix lot of problems inside RosAsm yet, and try to make their inner functions be less attached to the interface (and, rebuild it completely eventually). I´ll have to isolate the encoder, the disassembler, the debugger, and create a new resource editor to only then i can try rewrite the interface or implement a new set of macros (or internal routines) to force it to work with something closer to a masm syntax too.


Hi Steve :)
"Using rich edit, the shift from ANSI to UNICODE is simple enough to do. If you have to work in both, load the file and if it looks like garbage, switch from one to the other. Just means reloading the file again. You can make tangled messes that still can't garrantee being correct or switch between the two, the latter is much easier."
Interesting idea. :) Easier indeed.
Title: Re: Faster Memcopy ...
Post by: jj2007 on November 18, 2019, 08:58:06 PM
Quote from: guga on November 18, 2019, 11:45:21 AM"Using rich edit, the shift from ANSI to UNICODE is simple enough..."

The RichEdit control can use rtf under the hood, and thus has no problems using Unicode. In RichMasm (http://masm32.com/board/index.php?topic=5314.0), for example, your source can be Ansi or Unicode or a mix of both, no problem. The more interesting part is what you pass from editor to assembler - and that must be Utf-8, if you want to display non-English text in your executable.
Title: Re: Faster Memcopy ...
Post by: guga on November 18, 2019, 11:34:00 PM
Hi Jj. Tks.

I tested the new updated and it is working faster :)

About the UTF8, so i always need to convert it to UTF8 before displaying whenever i load a file or immediately before i show it on screen, and don´t need to make a user choice ? I´m not sure, if i understood what you meant with passing from editor to assembly.
You open a file (unicode or ansi) in RichMAsm, and export it as UTF 8 or the UTF8 conversion is done internally only to display the asm text on the screen ?
Title: Re: Faster Memcopy ...
Post by: jj2007 on November 19, 2019, 12:35:25 AM
This is OT but I'll keep it short:

include \masm32\MasmBasic\MasmBasic.inc         ; download (http://masm32.com/board/index.php?topic=94.0)
  Init
  uMsgBox 0, "Добро пожаловать", "歡迎", MB_OK
EndOfCode


- RichMasm exports this source to whatever.asm as Utf-8
- the assembler (ML, UAsm, AsmC) sees an 8-bit text but doesn't care whether that's codepage 1252 or 65001 or whatever
- the executable sees an 8-bit text, and gets told via the u in uMsgBox that a MessageBoxW is required, and that it should kindly translate the Utf-8 to Utf-16 before passing it on to MessageBoxW

That's the whole trick. For the coder, it's convenient because he can write everything directly into your source. And of course, comments can be in any encoding, the RichEdit control uses RTF to store it, and the assemblers don't care.
Title: Re: Faster Memcopy ...
Post by: nidud on November 19, 2019, 03:17:18 AM
deleted
Title: Re: Faster Memcopy ...
Post by: aw27 on November 19, 2019, 05:37:46 AM
It works very well.  :thumbsup:


total [0 .. 40], 8++
   307793 cycles 1.asm: PCMPISTRI
   443231 cycles 0.asm: AVX-512 aligned
   521571 cycles 2.asm: ZMM AVX512 (older)

total [41 .. 80], 7++
   257807 cycles 0.asm: AVX-512 aligned
   356038 cycles 2.asm: ZMM AVX512 (older)
   370879 cycles 1.asm: PCMPISTRI
   
total [600 .. 1000], 100++
   113553 cycles 0.asm: AVX-512 aligned
   204811 cycles 2.asm: ZMM AVX512 (older)
   859649 cycles 1.asm: PCMPISTRI

total [40000 .. 40800], 100++
   897536 cycles 0.asm: AVX-512 aligned
  2127546 cycles 2.asm: ZMM AVX512 (older)
  5980835 cycles 1.asm: PCMPISTRI 


If you convert the remaining to 64-bit I will test them all.
Title: Re: Faster Memcopy ...
Post by: nidud on November 19, 2019, 07:12:37 AM
deleted
Title: Re: Faster Memcopy ...
Post by: aw27 on November 19, 2019, 07:17:29 PM
Quote from: nidud on November 19, 2019, 07:12:37 AM
Added the Intel versions as well. They where originally 32-bit so the 64-bit version would have been coded differently from the direct translation as done here.

Lot of space there now: 64 * 32 regs = 2048 byte.

I think there is a bug in the ANSI version here:
args_x macro
    lea rcx,str_1[size_s]
    mov eax,step_x
    add eax,eax  <---------- HERE
    sub rcx,rax
    exitm<>
    endm

Although it is not causing a buffer overflow, it changes clearly the results.

These are the results after removing the "add eax, eax"


total [0 .. 40], 16++
   133475 cycles 5.asm: AVX 32
   152210 cycles 3.asm: SSE Intel Silvermont
   157375 cycles 1.asm: SSE 16
   172910 cycles 6.asm: AVX512 64
   178312 cycles 2.asm: SSE 32
   228359 cycles 0.asm: msvcrt.strlen()
   326672 cycles 4.asm: SSE Intel Atom
   
total [41 .. 80], 16++
   117358 cycles 5.asm: AVX 32
   123539 cycles 6.asm: AVX512 64
   165831 cycles 1.asm: SSE 16
   169369 cycles 3.asm: SSE Intel Silvermont
   210518 cycles 2.asm: SSE 32
   270514 cycles 0.asm: msvcrt.strlen()
   281378 cycles 4.asm: SSE Intel Atom

total [600 .. 1000], 200++
    67189 cycles 6.asm: AVX512 64
   110356 cycles 5.asm: AVX 32
   218898 cycles 3.asm: SSE Intel Silvermont
   235207 cycles 4.asm: SSE Intel Atom
   272100 cycles 2.asm: SSE 32
   296195 cycles 1.asm: SSE 16
   643732 cycles 0.asm: msvcrt.strlen()
Title: Re: Faster Memcopy ...
Post by: aw27 on November 19, 2019, 08:40:48 PM
This is another suite of test results for 64-bit strlen variations, including Agner Fog, PCMPISTRI and Masm32 SDK.
I added a test for extra long strings in the range 40000 to 40800 bytes.
Masm32 SDK strlen is not SIMD assisted (and I believe msvcrt.strlen is not as well) so plays a different tournament. But it could be made that way because all 64-bit machines have support for SSE.


total [0 .. 40], 16++
   135718 cycles 5.asm: AVX 32
   147583 cycles 8.asm: PCMPISTRI
   159476 cycles 1.asm: SSE 16
   169063 cycles 3.asm: SSE Intel Silvermont
   190066 cycles 2.asm: SSE 32
   192212 cycles 7.asm: Agner Fog
   210091 cycles 6.asm: AVX512 64
   238010 cycles 0.asm: msvcrt.strlen()
   280346 cycles 4.asm: SSE Intel Atom
   282475 cycles 9.asm: Masm32 SDK
   
   total [41 .. 80], 16++
   116625 cycles 5.asm: AVX 32
   120875 cycles 6.asm: AVX512 64
   136046 cycles 3.asm: SSE Intel Silvermont
   169359 cycles 8.asm: PCMPISTRI
   179466 cycles 1.asm: SSE 16
   198766 cycles 2.asm: SSE 32
   205015 cycles 7.asm: Agner Fog
   257180 cycles 0.asm: msvcrt.strlen()
   278100 cycles 4.asm: SSE Intel Atom
   487603 cycles 9.asm: Masm32 SDK

   total [600 .. 1000], 200++
    83570 cycles 6.asm: AVX512 64
   110477 cycles 5.asm: AVX 32
   218994 cycles 3.asm: SSE Intel Silvermont
   253867 cycles 4.asm: SSE Intel Atom
   279579 cycles 2.asm: SSE 32
   307387 cycles 1.asm: SSE 16
   334595 cycles 7.asm: Agner Fog
   488680 cycles 8.asm: PCMPISTRI
   621900 cycles 0.asm: msvcrt.strlen()
  1066191 cycles 9.asm: Masm32 SDK

  total [40000 .. 40800], 200++
   505134 cycles 6.asm: AVX512 64
   977509 cycles 5.asm: AVX 32
  1468195 cycles 3.asm: SSE Intel Silvermont
  1684275 cycles 4.asm: SSE Intel Atom
  2241774 cycles 2.asm: SSE 32
  2250641 cycles 1.asm: SSE 16
  2609106 cycles 7.asm: Agner Fog
  3257461 cycles 8.asm: PCMPISTRI
  4818268 cycles 0.asm: msvcrt.strlen()
  8809927 cycles 9.asm: Masm32 SDK
 
Title: Re: Faster Memcopy ...
Post by: aw27 on November 19, 2019, 11:42:23 PM
In this test I introduce a new AVX-512 strlen variation called Fast AVX512 64  :biggrin:


.code
xor rax, rax
vxorps zmm1, zmm1, zmm1
L1:
vpcmpeqb k1, zmm1,zmmword ptr [rcx+rax]
kmovq  r9,k1
add rax, 64
test r9,r9
jz L1
    bsf             r9,r9
    lea             rax,[rax+r9-64]
    ret
end


These are the results:


total [0 .. 40], 16++
    83041 cycles 9.asm: Fast AVX512 64
   118463 cycles 8.asm: PCMPISTRI
   136861 cycles 5.asm: AVX 32
   145743 cycles 3.asm: SSE Intel Silvermont
   163889 cycles 1.asm: SSE 16
   178432 cycles 6.asm: AVX512 64
   185371 cycles 7.asm: Agner Fog
   196856 cycles 2.asm: SSE 32
   228516 cycles 0.asm: msvcrt.strlen()
   277227 cycles 4.asm: SSE Intel Atom
   
total [41 .. 80], 16++
    61027 cycles 9.asm: Fast AVX512 64
   111154 cycles 5.asm: AVX 32
   130256 cycles 6.asm: AVX512 64
   139440 cycles 3.asm: SSE Intel Silvermont
   155091 cycles 8.asm: PCMPISTRI
   183854 cycles 1.asm: SSE 16
   194775 cycles 7.asm: Agner Fog
   212161 cycles 2.asm: SSE 32
   285351 cycles 4.asm: SSE Intel Atom
   311238 cycles 0.asm: msvcrt.strlen()

total [600 .. 1000], 200++
    71159 cycles 9.asm: Fast AVX512 64
    81938 cycles 6.asm: AVX512 64
   110439 cycles 5.asm: AVX 32
   220499 cycles 3.asm: SSE Intel Silvermont
   254703 cycles 4.asm: SSE Intel Atom
   293130 cycles 2.asm: SSE 32
   308233 cycles 1.asm: SSE 16
   338944 cycles 7.asm: Agner Fog
   516498 cycles 8.asm: PCMPISTRI
   648680 cycles 0.asm: msvcrt.strlen()

total [40000 .. 40800], 200++
   390634 cycles 6.asm: AVX512 64
   414175 cycles 9.asm: Fast AVX512 64
   606734 cycles 5.asm: AVX 32
  1392867 cycles 3.asm: SSE Intel Silvermont
  1417887 cycles 4.asm: SSE Intel Atom
  2194951 cycles 1.asm: SSE 16
  2200795 cycles 2.asm: SSE 32
  2229910 cycles 7.asm: Agner Fog
  3295851 cycles 8.asm: PCMPISTRI
  4538755 cycles 0.asm: msvcrt.strlen()   


For huge strings the other variation catchs up.  :badgrin:
May be it can be improved just a little.


Later:

I have not used it in the above test but the following, which is based on the same idea, is also faster than AVX 32 except for huge strings (but the difference is small).


.code
    xor rax, rax
    vxorps ymm0, ymm0, ymm0
L1:
    vpcmpeqb ymm1, ymm0, ymmword ptr [rcx+rax]
    vpmovmskb r9d,ymm1
    add     rax,32
    test    r9d,r9d
    jz L1
    bsf  r9,r9
    lea  rax,[rax+r9-32]
    ret
    end

Title: Re: Faster Memcopy ...
Post by: nidud on November 20, 2019, 03:53:14 AM
deleted
Title: Re: Faster Memcopy ...
Post by: aw27 on November 20, 2019, 06:34:12 AM
Very good Nidud.

Something I dislike is the .cmd extension and the makefile. I dislike as well the "option dllimport:" that comes back from Jwasm and in general everything that tries to establish unnecessary alternatives for doing things that are well done in the traditional way.


Notepad++ users (probably not many here) can build these ASMC test suits for Debugging using this script:

(https://www.dropbox.com/s/cgh80bug1e3y44b/asmcdebug.png?dl=1)

Let's see if it works:


(https://www.dropbox.com/s/g8ndvcukgy5z4hj/asmdb2.png?dl=1)
Title: Re: Faster Memcopy ...
Post by: nidud on November 20, 2019, 07:17:44 AM
deleted
Title: Re: Faster Memcopy ...
Post by: aw27 on November 20, 2019, 10:58:27 PM
I understand, but ASMC libs appear not to be compatible with Microsoft LINK.exe.
We will need to learn how to use linkw.exe, right?
Uasm libs don't force us to use anything else (they have not as well, thanks God).
Title: Re: Faster Memcopy ...
Post by: nidud on November 21, 2019, 06:35:28 AM
deleted
Title: Re: Faster Memcopy ...
Post by: CMalcheski on April 03, 2020, 11:47:46 AM
Quote from: KeepingRealBusy on March 04, 2015, 10:23:39 AM
rrr

Mis-alignment can be a problem, especially if both the source and the destination buffers have some mis-alignment, especialy if they have different alignments. The usual case is that one or the other of the buffers can be aligned mod 16. The way to handle these is to move a block of 16 bytes from the start of the buffer and a block of 16 bytes at the end of the buffer with un-aligned XMM moves. Then calculate the size left to be moved (data size - 32) and round this up to the next mod 16, then start at the first mod 16 byte boundary of the un-aligned buffer and move 16 bytes at a time (there will be a small overlap at either end).

Dave.

Very few people take their efforts so far as to avoid this issue, but internally any data not aligned on its own size (i.e. a dword not aligned on a 4 byte boundary) is going to require 2 reads to retrieve the data.  That's a big slowdown.  The effort required to properly align data across the board is well worth it.
Title: Re: Faster Memcopy ...
Post by: hutch-- on April 03, 2020, 12:18:35 PM
AW,

Clock this one in your list. Memory must be aligned by the data size.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

ymmcopya proc

  ; rcx = source address
  ; rdx = destination address
  ; r8  = byte count

    mov r11, r8
    shr r11, 5                  ; div by 32 for loop count
    xor r10, r10                ; zero r10 to use as index

  lpst:
    vmovntdqa ymm0, YMMWORD PTR [rcx+r10]
    vmovntdq YMMWORD PTR [rdx+r10], ymm0

    add r10, 32
    sub r11, 1
    jnz lpst

    mov rax, r8                 ; calculate remainder if any
    and rax, 31
    test rax, rax
    jnz @F
    ret

  @@:
    mov r9b, [rcx+r10]          ; copy any remainder
    mov [rdx+r10], r9b
    add r10, 1
    sub rax, 1
    jnz @B

    ret

ymmcopya endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Title: Re: Faster Memcopy ...
Post by: aw27 on April 03, 2020, 08:57:49 PM
I can't test now but non-temporal instructions tend to perform poorly in modern processors. That's the main reason Agner Fog's tests did not shine.
I will test when possible or you can do it directly from nidud's example.
Title: Re: Faster Memcopy ...
Post by: guga on August 17, 2023, 04:21:21 AM
Old thread, but can someone test this update of StrLen ? I tried to port it to masm to be used in masmbasic, but coulnd´t do it

This is an old routine, JJ and i did a long time ago, but i updated it, since it has some small flaws when retrieving the lenght of unaligned strings and also strings that may contains extra bytes at the end (immediatelly after the null terminated byte)

Now, it seems to handle all situations, but i´m not sure about the speed

RosAsm syntax:
Proc StrLen3:
    Arguments @Input
    Uses ecx

    xorps xmm0 xmm0
    mov ecx D@Input

    L0:
         movups xmm1 X$ecx | pcmpeqb xmm0 xmm1 | pmovmskb eax xmm0
         add ecx 16 ; point to the next 16 bytes of our string
     test ax ax | jz L0< ; Did we found any zeroes in eax ? No, jmp back and do it again. Otherwise, go to the next line
    sub ecx D@Input
    add ecx 0-16
    bsf ax ax
    add eax ecx

EndP


Masm syntax
StrLen3         proc near               ; CODE XREF: start+13↑p

Input           = dword ptr  8

                push    ebp
                mov     ebp, esp
                push    ecx
                xorps   xmm0, xmm0
                mov     ecx, [ebp+Input]

ZeroNotFound:
                movups  xmm1, xmmword ptr [ecx] ; ? Is this syntax correct ? or is only movups  xmm1, [ecx] ?
                pcmpeqb xmm0, xmm1
                pmovmskb eax, xmm0
                add     ecx, 10h
                test    ax, ax
                jz      short ZeroNotFound
                sub     ecx, [ebp+Input]
                add     ecx, -16
                bsf     ax, ax
                add     eax, ecx
                pop     ecx
                mov     esp, ebp
                pop     ebp
                retn    4
StrLen3         endp


Title: Re: Faster Memcopy ...
Post by: NoCforMe on August 17, 2023, 04:52:04 AM
Quote from: guga on August 17, 2023, 04:21:21 AM[...] and also strings that may contains extra bytes at the end (immediatelly after the null terminated byte)

Since when should this be a concern? Anything after the terminating NULL should just be ignored, right?
Title: Re: Faster Memcopy ...
Post by: guga on August 17, 2023, 05:20:28 AM
Not really. I was giving a new test and found that the older routine has some flaws sometimes when the string is between other chain of bytes or is unaligned. This was happenning for the strlen and strlenw versions.

The problem was when the end of the string (after the null terminated byte) contains any extra data (also if the start of the string was unaglined), it was bypassing the null terminated and thus giving a wrong result.

This issues are fixed on the newer version (this one i posted and the other i´ll post latter about strlenw), but i wanna make sure if the speed was not affected.

The former error happened more offently when we were dealing with non-unicode strings (strlenW version). I mean, strings in chinese, russian etc etc.

Some examples  was things like this in the unicode version:

[UnicodeString5: B$ 1, 7, 13, 58, 12, 36, 77, 158, 18,
;U$ "Hello", 0
StringStart3: W$ 04f60, 0597d, 0e, 012, 0ff00, 04f60, 0597d, 0e, 012, 0ff00, 04f60, 0597d, 0e, 012, 0ff00, 04f60, 0597d, 0e, 012, 0ff00, 0

NextData3: W$ 7, 1, 1, 1, 1, 1, 1, 1]

[UnicodeString: U$ "Hello", 0]

[DummyBytes2: B$ 35, 47, 5, 5, 14, 5]
[< UnicodeStringUnaligned: B$ 1,
W$ 04f60, 0597d, 056fe, 06211, 04e00, 0
B$ 37]


And the former versions (That produced incorrect results in some circuinstances were like this:
unicode version
Proc StrLenW::
    Arguments @Input
    Uses ecx, edx

    mov eax D@Input
    mov ecx eax
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqw xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax
    pcmpeqw xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@Input
    shr eax 1

EndP



Ansi version
Proc StrLen::
    Arguments @Input
    Uses ecx, edx

    mov eax D@Input
    mov ecx eax
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqb xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax
    pcmpeqb xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@Input

EndP


With the new fixes, there should be no more any uknown error that results in a wrong value. Therefore, it works now for unaligned strings and also strings that are inside chains of data, and also in whatever CodePage (in case we are using the strlenW version)
Title: Re: Faster Memcopy ...
Post by: jj2007 on August 17, 2023, 06:55:55 AM
Quote from: guga on August 17, 2023, 04:21:21 AMit has some small flaws when retrieving the lenght of unaligned strings and also strings that may contains extra bytes at the end (immediatelly after the null terminated byte)

Hi Guga,

I'm not aware of such problems, can you post an example of a string that misbehaves, in Masm syntax?

There is a known issue with VirtualAlloc'ed strings (https://www.masmforum.com/board/index.php?topic=18728.msg159628;sslRedirect#msg159628).
Title: Re: Faster Memcopy ...
Post by: HSE on August 17, 2023, 07:04:39 AM
Talking about missing people, rrr314159 don't post anything in Quora from March 4. Not good. :eusa_boohoo:
Title: Re: Faster Memcopy ...
Post by: guga on August 17, 2023, 08:22:14 AM
Quote from: jj2007 on August 17, 2023, 06:55:55 AM
Quote from: guga on August 17, 2023, 04:21:21 AMit has some small flaws when retrieving the lenght of unaligned strings and also strings that may contains extra bytes at the end (immediatelly after the null terminated byte)

Hi Guga,

I'm not aware of such problems, can you post an example of a string that misbehaves, in Masm syntax?

There is a known issue with VirtualAlloc'ed strings (https://www.masmforum.com/board/index.php?topic=18728.msg159628;sslRedirect#msg159628).

Hi JJ, ok....I´ll post 2 examples. One now to you see what´s wrong (and my fix). And then, i´ll try to recreate the error on the Ansi version of the string len algo.

StrlenW - Unicode version
_________________________________
.data
UnicodeString2  db 1 ;  Yes...but as a byte to force it to be unaligned
                text "UTF-16LE", 'Hello',0
NextData        dw 9
                dw 1
                dw 1
                dw 1
                dw 1
                dw 1
                dw 1
                dw 1
.code
(.................)

mov edi, offset UnicodeString2
inc edi ; Make the string starts exactly at "Hello", so inside our data chain, which is unaligned
push edi
call StrLenW
add eax 2
add edi eax ; edi should point now after the null terminated word, So, it should point to "NextData", but it points to the 2nd "l" in Hello

The function was (the one that returns incorrect values in some circumstances):
StrLenW        proc near
Input           = dword ptr  8

                push    ebp
                mov     ebp, esp
                push    ecx
                push    edx
                mov     eax, [ebp+Input]
                mov     ecx, eax
                and     eax, 0FFFFFFF0h
                and     ecx, 0Fh
                or      edx, 0FFFFFFFFh
                shl     edx, cl
                movdqu  xmm1, xmmword ptr [eax] ; movdqu  xmm1, [eax] ?
                xorps   xmm0, xmm0
                pcmpeqw xmm0, xmmword ptr [eax] ; pcmpeqw xmm0, [eax] ?
                add     eax, 10h
                pmovmskb ecx, xmm0
                xorps   xmm0, xmm0
                and     ecx, edx
                jnz     short loc_4040B0

loc_40409E:
                movups  xmm1, xmmword ptr [eax]
                pcmpeqw xmm1, xmm0
                pmovmskb ecx, xmm1
                add     eax, 10h
                test    ecx, ecx
                jz      short loc_40409E

loc_4040B0:
                bsf     ecx, ecx
                lea     eax, [ecx+eax-10h]
                sub     eax, [ebp+Input]
                shr     eax, 1
                pop     edx
                pop     ecx
                mov     esp, ebp
                pop     ebp
                retn    4
StrLenW        endp


On the above example i used Unicode Text encapsulated between 2 datas (1 byte before the string starts) and a sequence of Words after it. This was to force check for alignment problems.

The same problem happens if you replace "Hello" with other Copdepage, such as chinese etc...Try replacee Hello with dw 04f60, 0597d, 056fe, 06211, 04e00, 0  And this kind error will happens more often.


To fix that i changed the code to only this (masm syntax):
My fix
StrLenW        proc near
AddBytes        = dword ptr -4
Input           = dword ptr  8

                push    ebp
                mov     ebp, esp
                sub     esp, 4
                push    ecx
                mov     [ebp+AddBytes], 0
                xorps   xmm0, xmm0
                mov     ecx, [ebp+Input]
                movups  xmm1, xmmword ptr [ecx]
                pcmpeqw xmm0, xmm1
                pmovmskb eax, xmm0
                test    ax, ax
                jnz     short loc_4040BA

loc_4040A1:
                add     ecx, 10h
                movups  xmm1, xmmword ptr [ecx]
                pcmpeqw xmm0, xmm1
                pmovmskb eax, xmm0
                test    ax, ax
                jz      short loc_4040A1
                sub     ecx, [ebp+Input]
                mov     [ebp+AddBytes], ecx

loc_4040BA:
                and     ax, 101010101010101b
                bsf     ax, ax
                add     eax, [ebp+AddBytes]
                pop     ecx
                mov     esp, ebp
                pop     ebp
                retn    4
StrLenW2        endp



Of course, this can also be optimized (as i did for strlen, but i´ll do it later after i post the Ansi version for you and we check the speed. On this fix for the Unicode version (strlenW) it works correctly, no matter if the string is encapsulated, no matter if it has extra bytes after the last null terminated word, no matter if it is unaligned and specially, no matter what codepage you use, and also will work even if the string is zero. So it will work for regular Unicode "Ascii+Zero" and for others Codepages (2 bytes as in chinese, russian etc) or even weird CodePages that uses 3 or even 4 bytes to represent a char.

The resultant value in eax is the total amount of bytes used on the Unicode String (and not the amount of chars). I made it return the amount of bytes insetad chars, to we can also use it in strings with other Codepages without having to be forced to convert with M$ Apis etc.


I´ll try to reproduce the error now in the strlen version (ansi), but the error happens on the same principle, altough it is a bit hard to find



Note:
Below is the RosAsm version of StrLenW (unoptimized yet) but with all comments i did while i was trying to fix it.
Proc StrLenW2::
    Arguments @Input
    Local @AddBytes
    Uses ecx

    mov D@AddBytes 0 ; Let´s create a variable to we store potential extrabytes later to be added to the string len
    xorps xmm0 xmm0 ; zero register xmm0. Why zero ? Because we are focusing in words untill we find the null terminated word (So 2 zeroes).
                    ; Since each register in xmm0 contains 8 words (16 bytes). We need to scan later those 8 words to we find the double 0

    mov ecx D@Input
    movups xmm1 X$ecx ; Load the string onto xmm1 register
    pcmpeqw xmm0 xmm1 ; Compare the strings (as words) in xmm1 to the value we copied onto xmm0 (so, compare each word to 0 = our double zero to be found)
                      ; pcmpeqw threat the words as vectors and since we are comparing if the words in xmm1 are equal to 0whenever we find a 0, whenever we find a double zero,
                      ; the corresponding word is set all the bits of that word to 1 . So it will result in 0FFFF (per word) whenever a zero is found or 0 if not found.
                      ; At this point if the string is less then or equal to 8 bytes long (not counting the null terminated word), no matter where the null terminated word is located
                      ; it always will show up in one of he 8 words. Like this: Say we have the unicode string "Hello", 0
                      ; On xmm1 it will be displayed as:
                      ; Word7 Word6 Word5 Word4 Word3 Word2 Word1 Word0 ==> (bits 128 - 0)
                      ; 0001  0007  0000  006F  006C  006C  0065  0048 ==> 0001 and 0007 are extra data after the end of the string (that may exists or not, so it´s useless anyway) follow by 'olleH'
                      ;   ?     ?    Ok    o      l     l     e     H
                      ;            (zero)

                      ; After pcmpeqw xmm0 will become:

                      ; Word7 Word6 Word5 Word4 Word3 Word2 Word1 Word0 ==> (bits 128 - 0)
                      ; 0000  0000  FFFF  0000  0000  0000  0000  0000 ==> So, our null terminated word is located at the 6th word = "Word5", and all bits on that word were settled 0FFFF = 00__1111_1111__1111_1111
                      ; which means that ouur null terminated word is located at the 6th position (in words). So, in theory, we found ouur len (in bytes) which is 6-1 = 5. Ok ?
                      ; Well we now know the total size of our string (in bytes), but it is expressed in words positions, and we need to calculate the amount of bytes of it. So, let´s continue.

    pmovmskb eax xmm0 ; PMOVMSKB creates a mask made up of the most significant bit of each byte of xmm0 (the src) and stores the result in eax. But hold on...what does it means ?
                      ; It means that we are simply transposing the position of our nll terminated word from xmm0 to eax. But, wait....xmm registers are 64 bits, and eax is only 32.
                      ; So what's going on here ? Well... pmovmskb will transpose the position of 0FFFF (6th word containing 16 bits settled to 1) to the same position, but in bytes (since eax = 32 bits only)
                      ; and also, insetad masking it as a 0FFFF (word) it will flag it as a half of a Byte (11 in binary = 3 in decimal)
                      ; therefore, eax will become: 0C00 = 00__0000_1100__0000_0000 Why this? because each 16 bits from xmm0 will correspond to only 16 bits (8 positions) in eax on the same position but in 32 bits.Therefore:

                      ; Half  Half  Half  Half  Half  Half  Half  Half
                      ; Byte7 Byte6 Byte5 Byte4 Byte3 Byte2 Byte1 Byte0 ==> (bits 16 - 0)
                      ;  00    00    11    00    00    00    00    00   <==== bits = 00__0000_1100__0000_0000 (binary) = 0C00 (in hexa)

                      ; So, in other words, all that we will need is only the value in ax (low part of eax), whereas the high part will always be 0 because we previously used pcmpeqw comparing only 8 words
                      ; So, whatever the value here is, the maximum value for eax = 0FFFF

    test ax ax | jnz L1> ; We need to test eax with 0, because eax will contains information about the lenght of the string (whatever the len is). So, the string sequence stored in xmm0 will contains only 8 words.
                         ; In fact, ax is all we need since pmovmskb stores the result in the low word of eax and clears the hi word
                           ; If any of those words contains a null terminated word (double zero), one of them will contains a mask of 0FFFF, which later in pmovmskb will flag the half of he byte where the zero was found with 11 (in binary)
                           ; But if the sequence of 8 words does not contrains any zero, then xmm0 will have only 0 and, therefore, eax will also be only 0 (because no pair of bits will be settled).
                           ; What means to say that whenever our string has less then 7 words/chars (14 Bytes) the routine will jump over to to "L1" and perform the actual computation of the bytes that the string has.
                           ; Otherwise we go to the next line, add 16 bytes to the string (to go to the next 16 bytes to look for the double 0) and continue a loop untill we found it.

        L0:
                add ecx 16 ; point to the next 16 bytes of our string
                movups xmm1 X$ecx | pcmpeqw xmm0 xmm1 | pmovmskb eax xmm0
            test ax ax | jz L0< ; Did we found any double 0 in eax ? No, jmp back and do it again. Otherwise, go to the next line
        sub ecx D@Input     ; Now comes something easier. Since we have some value in eax that will represent the size of the string (or at leats, this last chain of 16 bytes), all we need to do is add those bytes to the amount of
                            ; bytes we already calculated. To do that, we simply subtract the current location in ecx with the start of our string at Input. This will give us the total amount of bytes we have so far and we store it in AddBytes
        mov D@AddBytes ecx
L1:

    ;movzx eax ax ; On eax = 0 means we have reached the limit of 7 bytes
    ; Now comes the trick. At this point eax (or better speaking, ax) contains the position (in bytes) of the double zero (found at the 6th position in Half Byte 5) starting from half Byte0 to Half Byte 5.
    ; Since we are dealing with a word, we ended with 2 bits set. Out 1st zero starts at the 1st one, so we can safely discard the second bit (from right to left). Doing this we will be able to identify
    ; the least significant that corresponds to the index (len) of our string

    ; So, for each pair of bits we simply and with 01 (in binary) because we will then discard the 2nd bit (most significand one) on each pair. Since we have 8 "half bytes" we then need to and also 8 half bytes with:
    ; 00_01_01_01_01__01_01_01_01 (in binary) = 05555 (in hexa)
    and ax 00_01_01_01_01__01_01_01_01
    ; Now that we get rid of the 1st bit on each pair, we have only 8 pairs of bits of least significant bits (the 1st one to be checked). So, after the "and" eax will result in 0400 (hexa) = 00__0000_0100__0000_0000 (in binary)
    ;  which correspond exactly to:
    ; Half  Half  Half  Half  Half  Half  Half  Half
    ; Byte7 Byte6 Byte5 Byte4 Byte3 Byte2 Byte1 Byte0 ==> (bits 16 - 0)
    ;  00    00    01    00    00    00    00    00   <==== bits = 00__0000_0100__0000_0000 (binary) = 0400 (in hexa)
    ; Ok, now we have the most significand bit found at the 6th position (Half Byte5) which is preciselly, lenght 6 (in words). Since we removed the most significand bit all we need to do is scan (search) the position of that bit with bsf
    ; since whateever position the bit was set, it always be the 1st (of the pair), therefore, it will never result in odd positions, so we don´ need to align or do a shr eax 1 | shl eax 1 pair or and eax 0FFFF_FFFE to remove the 1st bit (bit0)
    ; that´ because we are, at the end, counting positions. Could we do with and ax 00_10_10_10_10__10_10_10_10 ? Yes, but this would result in odd values and then we would need to fix with eax 0FFFF_FFFE or with shr eax 1 | shl eax 1
    bsf ax ax ; got the total amount of bytes of this chunck of words from xmm1 (earlier stored there, remember ?). We don´ need to bsf eax becauuse all we have is values in ax. So, perhaps, bsf ax ax is faster
              ; Consider later do it with lzcnt or popcnt
    add eax D@AddBytes ; and now we can simply add whatever bytes we found before on the loop (if we entered onto a loop earlier) to this resultant value in eax, which will be the correct lenght of the string.


EndP
Title: Re: Faster Memcopy ...
Post by: jj2007 on August 17, 2023, 08:39:56 AM
Hi Guga,

Your string behaves just fine with wLen() resp. MbStrLenW - see attachment. Are you using an older algo?
Title: Re: Faster Memcopy ...
Post by: guga on August 17, 2023, 09:20:47 AM
Quote from: jj2007 on August 17, 2023, 08:39:56 AMHi Guga,

Your string behaves just fine with wLen() resp. MbStrLenW - see attachment. Are you using an older algo?

I don´t know, perhaps. I didn´t touched the strlen and strlenw algo in a long time. But what are the correct/updated algos ? I can´t find it in masm basic. I saw the souuce you uploaded but the algos arent´t there (in masm syntax)
Title: Re: Faster Memcopy ...
Post by: jj2007 on August 17, 2023, 11:35:27 AM
Quote from: guga on August 17, 2023, 09:20:47 AMBut what are the correct/updated algos?

See your PM :azn:

Btw it seems we never timed the Unicode version of StrLen, here it is:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

23327   cycles for 100 * CRT wcslen
20836   cycles for 100 * Masm32 ucLen
9566    cycles for 100 * MasmBasic wLen
8981    cycles for 100 * _MbStrLenW

23340   cycles for 100 * CRT wcslen
20809   cycles for 100 * Masm32 ucLen
9566    cycles for 100 * MasmBasic wLen
8988    cycles for 100 * _MbStrLenW

23284   cycles for 100 * CRT wcslen
20852   cycles for 100 * Masm32 ucLen
9157    cycles for 100 * MasmBasic wLen
9003    cycles for 100 * _MbStrLenW

23262   cycles for 100 * CRT wcslen
20883   cycles for 100 * Masm32 ucLen
9586    cycles for 100 * MasmBasic wLen
8992    cycles for 100 * _MbStrLenW

14      bytes for CRT wcslen
10      bytes for Masm32 ucLen
10      bytes for MasmBasic wLen
66      bytes for _MbStrLenW

100     = eax CRT wcslen
100     = eax Masm32 ucLen
100     = eax MasmBasic wLen
100     = eax _MbStrLenW
Title: Re: Faster Memcopy ...
Post by: guga on August 17, 2023, 04:12:03 PM
Hi JJ. Tks, indeed i was using an older version. But can u test it to we compare the speed ?
Title: Re: Faster Memcopy ...
Post by: jj2007 on August 17, 2023, 07:46:51 PM
Quote from: guga on August 17, 2023, 04:12:03 PMHi JJ. Tks, indeed i was using an older version. But can u test it to we compare the speed ?

Hi Guga,

IMHO your algo is fast enough :biggrin:

You beat poor CRT by a factor 9, but ok, as Assembly programmers we are used to that, aren't we?

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

23713   cycles for 100 * CRT wcslen
20882   cycles for 100 * Masm32 ucLen
9471    cycles for 100 * MasmBasic wLen
9030    cycles for 100 * _MbStrLenW
9099    cycles for 100 * _MbStrLenW2
2781    cycles for 100 * StrLenW Guga

23718   cycles for 100 * CRT wcslen
20855   cycles for 100 * Masm32 ucLen
9555    cycles for 100 * MasmBasic wLen
9012    cycles for 100 * _MbStrLenW
9014    cycles for 100 * _MbStrLenW2
2696    cycles for 100 * StrLenW Guga

23641   cycles for 100 * CRT wcslen
20836   cycles for 100 * Masm32 ucLen
9541    cycles for 100 * MasmBasic wLen
9023    cycles for 100 * _MbStrLenW
9010    cycles for 100 * _MbStrLenW2
2699    cycles for 100 * StrLenW Guga

23658   cycles for 100 * CRT wcslen
20844   cycles for 100 * Masm32 ucLen
9435    cycles for 100 * MasmBasic wLen
9008    cycles for 100 * _MbStrLenW
9006    cycles for 100 * _MbStrLenW2
2701    cycles for 100 * StrLenW Guga

14      bytes for CRT wcslen
10      bytes for Masm32 ucLen
10      bytes for MasmBasic wLen
66      bytes for _MbStrLenW
66      bytes for _MbStrLenW2
58      bytes for StrLenW Guga

100     = eax CRT wcslen
100     = eax Masm32 ucLen
100     = eax MasmBasic wLen
100     = eax _MbStrLenW
100     = eax _MbStrLenW2
100     = eax StrLenW Guga
Title: Re: Faster Memcopy ...
Post by: guga on August 17, 2023, 11:45:44 PM
Tks a lot, JJ


It seems fast, indeed :greenclp:  :greenclp:  :greenclp:

I´ll do now a few more tests on the strlen version (The ansi) and finish some comments on the code, so i wont forget what i´ve done :biggrin:  :biggrin:  :biggrin:

AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

16116   cycles for 100 * CRT wcslen
25196   cycles for 100 * Masm32 ucLen
7524    cycles for 100 * MasmBasic wLen
7574    cycles for 100 * _MbStrLenW
7405    cycles for 100 * _MbStrLenW2
2568    cycles for 100 * StrLenW Guga

16085   cycles for 100 * CRT wcslen
25030   cycles for 100 * Masm32 ucLen
7537    cycles for 100 * MasmBasic wLen
7419    cycles for 100 * _MbStrLenW
7424    cycles for 100 * _MbStrLenW2
2545    cycles for 100 * StrLenW Guga

16308   cycles for 100 * CRT wcslen
25578   cycles for 100 * Masm32 ucLen
7614    cycles for 100 * MasmBasic wLen
7419    cycles for 100 * _MbStrLenW
7456    cycles for 100 * _MbStrLenW2
2573    cycles for 100 * StrLenW Guga

16086   cycles for 100 * CRT wcslen
25075   cycles for 100 * Masm32 ucLen
7600    cycles for 100 * MasmBasic wLen
7447    cycles for 100 * _MbStrLenW
7396    cycles for 100 * _MbStrLenW2
2547    cycles for 100 * StrLenW Guga

14      bytes for CRT wcslen
10      bytes for Masm32 ucLen
10      bytes for MasmBasic wLen
66      bytes for _MbStrLenW
66      bytes for _MbStrLenW2
58      bytes for StrLenW Guga

100     = eax CRT wcslen
100     = eax Masm32 ucLen
100     = eax MasmBasic wLen
100     = eax _MbStrLenW
100     = eax _MbStrLenW2
100     = eax StrLenW Guga

--- ok ---


Btw..you may use it in masmbasic if you wish and others can use it too, of course:)

I needed to finish those routines to continue updating them in RosAsm. During the updates i found those minor errors on both functions that required fixes. I hope i can finish the updates (in rosasm) soon to release the next version, but it´s a bit hard because i made several changes in RosAsm code to try making it faster and more independent of the interface and accidentally broke one thing or two, specially in the resources editor (i´m trying to fix it right now). A hell, a true hell to do. :bgrin:  :bgrin:  :bgrin:

One question, on your optimization, you returned eax as chars instead of bytes. Considering that the new strlenW function can be used in other  CodePages (chinese, russia, greek strings etc) and also others that may uses 3 or 4 bytes to represent a character, returning chars in eax is the better instead of bytes ? What if the user codepage contains UTF8 (or others that uses more then 2 bytes for a char ) ?
https://www.ibm.com/docs/en/db2-for-zos/11?topic=unicode-utfs
Title: Re: Faster Memcopy ...
Post by: jj2007 on August 18, 2023, 12:14:46 AM
Quote from: guga on August 17, 2023, 11:45:44 PMOne question, on your optimization, you returned eax as chars instead of bytes.

Practically all Windows functions require chars, not bytes. For Utf8, you can get chars with uLen() (https://www.jj2007.eu/MasmBasicQuickReference.htm).
Title: Re: Faster Memcopy ...
Post by: guga on August 18, 2023, 12:54:26 AM
Quote from: jj2007 on August 18, 2023, 12:14:46 AMPractically all Windows functions require chars, not bytes. For Utf8, you can get chars with uLen() (https://www.jj2007.eu/MasmBasicQuickReference.htm).

Hi JJ. Ok, i´ll change to chars, then. One question....on your version did you needed to detect the CodePage ?

Len, wLen, uLen                                get string length
                mov ebx, Len(My$)
                mov eax, Len("123")                        ; Len returns 3 in eax, so Olly will show mov eax, eax
                void Len("123")                                ; put the result into eax (void avoids the mov eax, eax)
                mov eax, wLen(MyUnicode$)            ; wideLen for Unicode strings
                void wLen(wChr$("123"))                ; returns 3 chars in eax
            Let esi="Добро пожаловатьäöü"                ; assign a UTF-8 string
            uPrint "[Добро пожаловатьäöü]"                ; or print it directly
                Print Str$(" is %i characters long (expected: 19)\n", uLen(esi))
Rem        returns length in eax; bytes for ANSI, chars for Unicode; for UTF-8 encoded strings, Len returns bytes, uLen returns chars

I mean detecting if the sequence of bytes before the null terminated word belongs to a UTF8, UTF7, UTF32 etc ? maybe we can make a small update at the end of strlenW to calculate if the string is UTF8 in 2, 3 or 4 bytes long representing a character.


A single line after the last code at "add eax, ecx" maybe enough, (with a flag ) perhaps.

Something like. If the result must be from a UTF7 (3 bytes per char), we divide the result by 3, If UTF32, byt 4 and so on. So, add also aother parameter of the function used as a flag perhaps.

add eax ecx

test Flag UTF7 | je L1> ; is the string in UTF7 ? divide the result by 3
    mov edx 0AAAAAAAB
    mul edx
    shr edx 1
    mov eax edx
jmp L2>
test Flag UTF32 | je L1> ; Is UTF32 ? divide by 4
shr eax 2
jmp L2>
L1:  ; else keep as Unicode (2 bytes per char)
    shr eax 1
L2:

Is it worthful extending the algo to handle UTF7 or even UTF32 or it is unnecessary for general usage ? I´m asking because i know that UTF7 and UTF32 do exists, but i don´t know if they are commonly used in windows (or other systems).

If it won´t be useful, i´ll stay as you said and use only the shr eax 1 at the end to return chars instead bytes.
Title: Re: Faster Memcopy ...
Post by: jj2007 on August 18, 2023, 01:17:04 AM
Quote from: guga on August 18, 2023, 12:54:26 AMon your version did you needed to detect the CodePage ?

Utf8 is basically Ansi with some tricks. For most uses of Instr() or Len(), like parsing etc, the plain Ansi functions are appropriate. Only in rare cases you need to know the chars - that's when uLen jumps in. There is no codepage test in this sense, but you tell it explicitly "treat this string as Utf8".

QuoteIs it worthful extending the algo to handle UTF7 or even UTF32 or it is unnecessary for general usage?

I wouldn't invest any efforts. I've never seen these formats in the wild.
Title: Re: Faster Memcopy ...
Post by: guga on August 18, 2023, 01:34:53 AM
Ok, i guess i understood now. I´ll do as you said then.

I´ll keep strlenW to return in bytes as it was already (since it will be more usefull for parsing etc) and create a variation to return in chars (with shr eax 1 after the add eax ecx ) named as UstrLen, UCharLen or something like that.
Title: Re: Faster Memcopy ...
Post by: jj2007 on August 18, 2023, 01:49:00 AM
Quote from: guga on August 18, 2023, 01:34:53 AMOk, i guess i understood now. I´ll do as you said then.

I´ll keep strlenW to return in bytes as it was already (since it will be more usefull for parsing etc) and create a variation to return in chars (with shr eax 1 after the add eax ecx ) named as UstrLen, UCharLen or something like that.

Attention: Windows expects chars from "W" variants. Many WinAPI string functions have two variants, "A" and "W". Both expect chars, but in the case of the Ansi variants chars=bytes.

I use uLen() to indicate Utf8 chars, which are less than bytes because one char can be composed of one, two or three bytes.

include \masm32\MasmBasic\MasmBasic.inc
  Init
  Let esi="Нажмите на эту кнопку"   ; click on this button
  mov ecx, uLen(esi)
  Print Str$("The string is %i bytes long", Len(esi)), Str$(", but has %i chars\n", ecx)
  PrintLine "[", uLeft$(esi, 7), "] - correct, 7 chars"
  PrintLine "[", Left$(esi, 7), "] - incorrect, 7 bytes"
  uMsgBox 0, esi, "Hi", MB_OK

EndOfCode

Output:
The string is 39 bytes long, but has 21 chars
[Нажмите] - correct, 7 chars
[Наж�] - incorrect, 7 bytes
Title: Re: Faster Memcopy ...
Post by: lingo on August 19, 2023, 12:11:46 AM
Hi,
It is my old StrLenW. :biggrin:

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Align 16
db 7 dup (90h)
StrLenW_Lingo                 proc
movdqu xmm1, [eax]
pxor xmm0, xmm0
pcmpeqw xmm1, xmm0
pmovmskb ecx,  xmm1
test ecx,  ecx
jne @L_End
mov edx,  eax
and eax,  -16
@@:
pcmpeqw xmm0, [eax+16]
pcmpeqw xmm1, [eax+32]
por xmm1, xmm0
add eax,  32
pmovmskb ecx,  xmm1
test ecx,  ecx
jz @b
shl ecx,  16
sub eax,  edx
pmovmskb edx,  xmm0
add ecx,  edx
                              ; and  cx,5555h
bsf ecx,  ecx
lea eax,  [eax+ecx-16]
ret
@L_End:
                              ; and  cx,5555h
bsf eax, ecx
ret
StrLenW_Lingo                 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Title: Re: Faster Memcopy ...
Post by: guga on August 19, 2023, 03:46:49 AM
Hi Lingo

It´s very similar. I remember it was biased on one JJ and i did a long time ago on strlen, perhaps a variation of yours, i don´t remember.

This one, i adapted from strlen, and now i updated because it had flaws on unaligned strigns and UTF16 (chinese etc). Update yours to this too:

StrLenW         proc near
InputString     = dword ptr  8

                push    ebp
                mov     ebp, esp
                push    ecx
                xorps   xmm0, xmm0
                mov     ecx, [ebp+InputString]

ZeroNotFound:
                movups  xmm1, xmmword ptr [ecx]
                pcmpeqw xmm0, xmm1
                add     ecx, 16
                pmovmskb eax, xmm0
                test    ax, ax
                jz      short ZeroNotFound
                sub     ecx, [ebp+InputString]
                add     ecx, -16
                and     ax, 101010101010101b
                bsf     ax, ax
                add     eax, ecx
                shr     eax, 1
                pop     ecx
                mov     esp, ebp
                pop     ebp
                retn    4
StrLenW         endp

It works for all codepages that uses 2 bytes to create a char.


If you want one that can be used for parsing and result the lenght in bytes, you can use this as well


UniStrLen         proc near
InputString     = dword ptr  8

                push    ebp
                mov     ebp, esp
                push    ecx
                xorps   xmm0, xmm0
                mov     ecx, [ebp+InputString]

ZeroNotFound:
                movups  xmm1, xmmword ptr [ecx]
                pcmpeqw xmm0, xmm1
                add     ecx, 16
                pmovmskb eax, xmm0
                test    ax, ax
                jz      short ZeroNotFound
                sub     ecx, [ebp+InputString]
                add     ecx, -16
                and     ax, 101010101010101b
                bsf     ax, ax
                add     eax, ecx
                pop     ecx
                mov     esp, ebp
                pop     ebp
                retn    4
UniStrLen         endp

Or this for Ansi


StrLen         proc near
InputString     = dword ptr  8

                push    ebp
                mov     ebp, esp
                push    ecx
                xorps   xmm0, xmm0
                mov     ecx, [ebp+InputString]

ZeroNotFound:
                movups  xmm1, xmmword ptr [ecx]
                pcmpeqb xmm0, xmm1
                add     ecx, 16
                pmovmskb eax, xmm0
                test    ax, ax
                jz      short ZeroNotFound
                sub     ecx, [ebp+InputString]
                add     ecx, -16
                bsf     ax, ax
                add     eax, ecx
                pop     ecx
                mov     esp, ebp
                pop     ebp
                retn    4
StrLen         endp

Don´t know if yours old version had those fixes, but feel free to update them  :thumbsup:  :thumbsup:  :thumbsup:
Title: Re: Faster Memcopy ...
Post by: jj2007 on August 19, 2023, 04:52:41 AM
Quote from: lingo on August 19, 2023, 12:11:46 AMIt is my old StrLenW. :biggrin:

IMHO it should return 20, not 39:

The string: [UTF-16LE Hello World]
wLen(edi)       eax             20
MbSLenW         eax             20
StrLenW_Lingo   eax             39

align 16
dummy db 7 dup(0) ;  force unaligned
_dw UnicodeString2, "UTF-16LE ", "Hello World", 0
Title: Re: Faster Memcopy ...
Post by: guga on August 19, 2023, 05:28:47 AM
Quote from: jj2007 on August 19, 2023, 04:52:41 AM
Quote from: lingo on August 19, 2023, 12:11:46 AMIt is my old StrLenW. :biggrin:

IMHO it should return 20, not 39:

The string: [UTF-16LE Hello World]
wLen(edi)       eax             20
MbSLenW         eax             20
StrLenW_Lingo   eax             39

align 16
dummy db 7 dup(0) ;  force unaligned
_dw UnicodeString2, "UTF-16LE ", "Hello World", 0


Hi JJ

i don´t get it. Why 20 ? Wouldn´t it suppose to return 11 in "Hello World" ? 11 chars not counting the 2 null terminated bytes) ?

So, 11*2 = 22 bytes to represent the Unicode text plus 2 more for the extra 2 zeroes ?
Title: Re: Faster Memcopy ...
Post by: jj2007 on August 19, 2023, 05:35:18 AM
The string: [UTF-16LE Hello World]

Count them, it's 20 chars.
Title: Re: Faster Memcopy ...
Post by: guga on August 19, 2023, 05:46:15 AM
OH ! I see, it´s the whole string "UTF-16LE Hello World" and not only "Hello World" written in UTF-16LE. I though this was a token for masmbasic and not part of the string itself :biggrin:  :biggrin:  :biggrin:  :biggrin:

Indeed, it should return 20 chars and not 39
Title: Re: Faster Memcopy ...
Post by: lingo on August 20, 2023, 10:17:26 AM
It is my speed test with MichaelW's simple test. :biggrin:

C:\Documents\Visual Studio 2022\Projects\SpeedTest>c2test2

8 cycles -> StrLenW_Guga ANSI,  Return in EAX: 100

6 cycles -> StrLenW_Lingo ANSI,  Return in EAX: 100

23 cycles -> StrLenW_Guga No SAR,  Return in EAX: 200

13 cycles -> StrLenW_Lingo No SAR,  Return in EAX: 200

8 cycles -> StrLenW_Guga with SAR,  Return in EAX: 34

5 cycles -> StrLenW_Lingo with SAR,  Return in EAX: 34


Press enter to exit...

Source in attachment.
Title: Re: Faster Memcopy ...
Post by: lingo on August 20, 2023, 11:09:54 PM
Hi guys,

You had a lot of fun yesterday :badgrin:  :badgrin:  :badgrin:  :badgrin:
and now after more than 70% speed difference ...absolute silence  :shhh: ... he-he -he..
or the sound of silence... Nice song  https://www.youtube.com/watch?v=9O9DaZUS_EU
Title: Re: Faster Memcopy ...
Post by: jj2007 on August 21, 2023, 02:30:33 AM
I hadn't looked at it, Lingo, but it's indeed almost 27% faster than Guga's algo :thumbsup:

(and it returns a wrong result, but never mind, we are used to that :biggrin: )

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

23633  cycles for 100 * CRT wcslen
9029    cycles for 100 * _MbStrLenW2
2702    cycles for 100 * StrLenW Guga
1987    cycles for 100 * StrLenW_LingoNoSar

23663  cycles for 100 * CRT wcslen
9020    cycles for 100 * _MbStrLenW2
2698    cycles for 100 * StrLenW Guga
1997    cycles for 100 * StrLenW_LingoNoSar

23657  cycles for 100 * CRT wcslen
9026    cycles for 100 * _MbStrLenW2
2704    cycles for 100 * StrLenW Guga
1986    cycles for 100 * StrLenW_LingoNoSar

23645  cycles for 100 * CRT wcslen
9021    cycles for 100 * _MbStrLenW2
2704    cycles for 100 * StrLenW Guga
1986    cycles for 100 * StrLenW_LingoNoSar

14      bytes for CRT wcslen
66      bytes for _MbStrLenW2
58      bytes for StrLenW Guga
98      bytes for StrLenW_LingoNoSar

100    = eax CRT wcslen
100    = eax Masm32 ucLen
100    = eax _MbStrLenW2
100    = eax StrLenW Guga
199    = eax StrLenW_LingoNoSar

Btw it sucks a bit for short strings:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3854    cycles for 100 * CRT wcslen
1549    cycles for 100 * _MbStrLenW2
695     cycles for 100 * StrLenW Guga
684     cycles for 100 * StrLenW_LingoNoSar
618     cycles for 100 * MbSLenW

3853    cycles for 100 * CRT wcslen
1547    cycles for 100 * _MbStrLenW2
695     cycles for 100 * StrLenW Guga
691     cycles for 100 * StrLenW_LingoNoSar
619     cycles for 100 * MbSLenW

3850    cycles for 100 * CRT wcslen
1550    cycles for 100 * _MbStrLenW2
719     cycles for 100 * StrLenW Guga
683     cycles for 100 * StrLenW_LingoNoSar
668     cycles for 100 * MbSLenW

3862    cycles for 100 * CRT wcslen
1549    cycles for 100 * _MbStrLenW2
720     cycles for 100 * StrLenW Guga
684     cycles for 100 * StrLenW_LingoNoSar
645     cycles for 100 * MbSLenW

14      bytes for CRT wcslen
66      bytes for _MbStrLenW2
58      bytes for StrLenW Guga
102     bytes for StrLenW_LingoNoSar
50      bytes for MbSLenW

14      = eax CRT wcslen
14      = eax Masm32 ucLen
14      = eax _MbStrLenW2
14      = eax StrLenW Guga
14      = eax StrLenW_LingoNoSar
14      = eax MbSLenW
Title: Re: Faster Memcopy ...
Post by: daydreamer on August 21, 2023, 03:49:35 AM
Guga unicode also has game things like chess pieces memcopy from a library with solve check mate in 3 moves

If you read text from file,dont you already get len = filesize ,isnt there an api to check textfile is 16 bit unicode or not?
Title: Re: Faster Memcopy ...
Post by: lingo on August 21, 2023, 05:00:49 AM
QuoteHi Guga,
IMHO your algo is fast enough :biggrin:
You beat poor CRT by a factor 9, but ok, as Assembly programmers we are used to that, aren't we?

Again envy or incompetence about speed  optimization of code...

QuoteI hadn't looked at it, Lingo, but it's indeed almost 27% faster than Guga's algo

Another fake manipulation attempt against Lingo:

Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (SSE4)
12358   cycles for 100 * CRT wcslen
9124    cycles for 100 * _MbStrLenW2
2325    cycles for 100 * StrLenW Guga
1572    cycles for 100 * StrLenW_LingoNoSar
1572/2325*100=[b]67.%[/b]

12426   cycles for 100 * CRT wcslen
9142    cycles for 100 * _MbStrLenW2
2279    cycles for 100 * StrLenW Guga
1576    cycles for 100 * StrLenW_LingoNoSar
1576/2279*100=[b]69.1%[/b]

12360   cycles for 100 * CRT wcslen
9115    cycles for 100 * _MbStrLenW2
2352    cycles for 100 * StrLenW Guga
1579    cycles for 100 * StrLenW_LingoNoSar
1579/2352*100=[b]67.1%[/b]

12264   cycles for 100 * CRT wcslen
9106    cycles for 100 * _MbStrLenW2
2323    cycles for 100 * StrLenW Guga
1575    cycles for 100 * StrLenW_LingoNoSar
1575/2323*100=[b]67.8%[/b]

14      bytes for CRT wcslen
66      bytes for _MbStrLenW2
58      bytes for StrLenW Guga
98      bytes for StrLenW_LingoNoSar

100     = eax CRT wcslen
100     = eax Masm32 ucLen
100     = eax _MbStrLenW2
[b]100[/b]     = eax StrLenW Guga
[b]199[/b]     = eax StrLenW_LingoNoSar
--- ok ---
Quoteand it returns a wrong result, but never mind, we are used to that
Another attempt for fake manipulation and even an insult against Lingo.
You are comparing two different things: LenStrW_LingoNotSAR without SAR eax,1 at the end of the algo,
with LenStrW_Guga WITH SAR eax,1 at the end of the algo.
It is normal to have a difference of 2 in your fake manipulated test.
Use next time LenStr_LingoWsar!
Pls, see  the attachment and  Don't Teach Father How To Make Babies!

QuoteBtw it sucks a bit for short strings:
You'll get a lot more sucks if you test it with longer strings! :badgrin:  :skrewy:


It is also from your fake "test":
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
Speed Differences:
1987/2702*100=73.5%
1997/2698*100=74%
1986/2704*100=73.4%
1986/2704*100=73.4%

Title: Re: Faster Memcopy ...
Post by: guga on August 21, 2023, 06:10:36 AM
Hi Lingo, i´ll test again later.

One question only, you use align 16 on the inputed strings, right ? How the algo will behave (in terms of speed/accuracy) for unaligned strings ?

Did you tested for those cases ?
Title: Re: Faster Memcopy ...
Post by: lingo on August 21, 2023, 06:34:07 AM
Thank you Guga, :biggrin:

You can use my test (it is from MichaelW written in plain MASM32).
So, you can change what you want in it end recompile it again with makeit.bat.  :thumbsup:
Title: Re: Faster Memcopy ...
Post by: guga on August 21, 2023, 06:45:25 AM
Quote from: lingo on August 21, 2023, 06:34:07 AMThank you Guga, :biggrin:

You can use my test (it is from MichaelW written in plain MASM32).
So, you can change what you want in it end recompile it again with makeit.bat.  :thumbsup:

Tks, lingo. I´ll give a try later. These functions are good when we build it to work in many scenarios as possible. That´s why i tested for aligned/unaligned strings and also when we have other data before or after the string (As when a string belongs to another structure or chain of data etc etc).

I'm currently working on a way to speed up (and build) functions to change the case of strings. JJ is helping on this. It's a bit harder then i expected, but i managed to make it work (for non latin chars or unicode strings, at the moment). Later i´ll see how to optimize it and then will come back for more tests on strlen and strlenW.

It´s way of my head why M$ didn´t ever do something like that focused in speed. When analyzing some of the Apis inside msvcrt they are a true mess. For example, even their libm_sse2_log_precise function that was supposed to be accurate and faster, is slower then the ones we build here. :dazzled:
Title: Re: Faster Memcopy ...
Post by: jj2007 on August 21, 2023, 07:43:31 AM
Quote from: lingo on August 21, 2023, 05:00:49 AMSpeed Differences:
1987/2702*100=73.5%
1997/2698*100=74%
1986/2704*100=73.4%
1986/2704*100=73.4%

Your algo is 26.5% faster: 1-1987/2702=26.5%

"Don't Teach Father How To Make Babies!" is far below your intellectual level, stop it.