The MASM Forum

General => The Laboratory => Topic started by: rrr314159 on March 03, 2015, 02:40:50 PM

Title: Faster Memcopy ...
Post by: rrr314159 on March 03, 2015, 02:40:50 PM
In this thread (http://masm32.com/board/index.php?topic=1458.0) Habran, jj2007 and others submitted memory copy algorithms that were tested in jj's program MemCopyAlgos, which ran them on a block of 2048 bytes with all possible alignments. (BTW the thread is a lot of fun to read). I substituted a new algo for Habran's "Ferrari" (because I had trouble inserting it elsewhere in the program.)

This new algo is fastest on every measure: beat every other algo on every alignment. Here are typical runs on Intel and AMD machines (rrr314's Ferrari in last column):

Code: [Select]
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      114
------------------------------------------------------------------------------------
2048, d0s0-0      159      199      242      239      239      218      201      144
2048, d1s1-0      255      231      263      270      258      249      257      208
2048, d7s7-0      256      234      265      272      260      250      257      208
2048, d7s8-1      260      244      657      473      245      248      257      216
2048, d7s9-2      259      245      657      473      245      248      257      208
2048, d8s7+1      252      239      641      471      246      234      257      208
2048, d8s8-0      256      234      288      290      283      268      257      208
2048, d8s9-1      252      243      649      474      245      248      257      208
2048, d9s7+2      257      239      634      471      246      237      257      209
2048, d9s8+1      257      239      634      471      246      237      257      208
2048, d9s9-0      256      234      275      272      268      253      257      208
2048, d15s15      256      235      274      274      272      254      257      208


--- ok ---

AMD A6-6310 APU with AMD Radeon R4 Graphics     (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      157
------------------------------------------------------------------------------------
2048, d0s0-0      170      250      262      326      327      274      278      162
2048, d1s1-0      419      385      384      389      494      348      796      238
2048, d7s7-0      375      477      350      436      407      357      758      245
2048, d7s8-1      628      682      699      402      250      303      617      224
2048, d7s9-2      598      665      757      420      250      296      621      223
2048, d8s7+1      603      643      702      409      249      257      617      219
2048, d8s8-0      356      345      315      367      386      318      298      228
2048, d8s9-1      609      651      701      399      249      298      591      217
2048, d9s7+2      587      661      690      405      274      244      596      225
2048, d9s8+1      587      660      707      403      249      250      604      225
2048, d9s9-0      356      376      347      405      386      331      667      224
2048, d15s15      356      374      334      398      376      331      680      224


--- ok ---

It simply uses all 8 xmm registers to move a large chunk of 128 bytes per loop ... nothing fancy at all. the algo is also shorter, at 157 bytes, than all but the most basic competitors.

It could be made somewhat faster, but I didn't bother - just wanted to prove the point that this basic idea works. BTW I have found the idea on the net since my last post; google "Mark Larson optimize assembler" for a discussion. Also see FASM's discussion board. I'm surprised the idea is so obscure. Still haven't found the "more advanced" techniques used in my "faster tokenizer" algo - which makes me wonder if there's something wrong with them.

It's written in 32-bits; learned my lesson from last time, there's nothing non-standard about it.

Unfortunately sometimes the display might not fit right on your command window; but I didn't change anything in jj's program more than necessary. Also note that the very first number generated (upper left hand corner) is, I believe, artificially low due to a start-up problem I intend to discuss in more detail elsewhere (my algo beat it anyway).

The zip contains jj's testing program, renamed MemCopyAlgosrrr; and the exe thereof. My algo is in it, called "xmemcpy".

[edit] a more complete version, faster for larger memory copies, is in the zip a few posts below

[edit] even more complete version is in reply 25, next page. But the one here is a lot simpler (half as many lines) and shows the basic idea: using all 8 registers to copy 128 bit chunks.
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 03, 2015, 02:55:16 PM
rrr,

Something we determined a while ago is that REP MOVS/B/D is slow under about 500 bytes. After that internal special case circuitry kicks in and it gets a lot faster. Have you had a paddle with different buffer sizes for the copy operation ?
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 03, 2015, 03:51:52 PM
No; I didn't "optimize" it specifically for 2048 (the standard number used in that thread) but, didn't try it in any other region. As mentioned, just wanted to demonstrate feasibility of the idea. But I'll give it a shot in a while and let you know
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 03, 2015, 05:17:16 PM
Ran it at 20480 and 100000. My algo still does alright, is best at some settings. But at 20480, MemCoP4 and MemCoC2 are overall ahead, my algo about third. At 100000 it's surprising how neck-and-neck they all are, except MemCo2 and MemCoC3 which are slower.

I'm using all the xmm reg's, all unaligned. So for 20480 I bet I can still beat them all, since the two current winners are using just one or two xmm's, but aligned (either source or dest). So I'll try aligning one or the other also, and (I *think*) the strategy of using all 8 will give me the advantage. At least, it's proved effective in the 2 cases I've tried so far. Tomorrow I'll check it out.

However dunno about the larger numbers (100,000). It's quite amazing how all the different algorithms converge to the same speed up there (with 2 exceptions) - did u know they'd do that? The 2 xmm-aligned algos are only a little better, less than 1%. BTW jj has a comment recommending not to exceed 100,000 with this program; someday I'll check out some bigger numbers with my own test prog.

Well, thanks for the suggestion! Here are the numbers I got:
Code: [Select]
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      157
------------------------------------------------------------------------------------
20480, d0s0-     2796     2760     2795     2835     2803     2786     2761     2640
20480, d1s1-     2935     2782     2809     2815     2838     2811     3036     4244
20480, d7s7-     2899     2786     2816     2848     2848     2804     3038     4343
20480, d7s8-     9095     9091     5962     5716     3156     3125     9210     3114
20480, d7s9-     9092     9090     5968     5720     3129     3124     9357     3101
20480, d8s7+     3257     3163     5988     5270     3149     3159     3225     3240
20480, d8s8-     2923     2793     2888     2825     2840     2835     3040     4330
20480, d8s9-     9086     9091     5962     5721     3165     3126     9065     3093
20480, d9s7+     3253     3159     5977     5264     3156     3131     3237     3205
20480, d9s8+     3267     3166     5975     5271     3115     3156     3256     3212
20480, d9s9-     2880     2826     2808     2833     2870     2861     3043     4339
20480, d15s1     2912     2822     2867     2820     2835     2844     3029     4269

--- ok ---

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      157
------------------------------------------------------------------------------------
100000, d0s0    12981    12716    12930    12924    12930    12969    12717    12962
100000, d1s1    13047    12732    12978    12963    12992    12907    13614    13905
100000, d7s7    13046    12729    12946    12982    12944    12989    13624    13938
100000, d7s8    13655    13281    24727    19108    13681    13681    13639    13951
100000, d7s9    13641    13254    24695    19110    13702    13671    13599    13857
100000, d8s7    13588    13305    24930    20277    13651    13703    13622    13961
100000, d8s8    13075    12739    12986    12946    12973    12960    13650    13950
100000, d8s9    13619    13281    24846    19103    13681    13689    13648    13937
100000, d9s7    13649    13284    25049    20303    13680    13668    13656    13978
100000, d9s8    13643    13293    25028    20336    13583    13586    13620    13950
100000, d9s9    13036    12728    13005    12963    12962    12961    13636    13943
100000, d15s    13036    12708    12942    12965    12922    12931    13656    13950


--- ok ---
Title: Re: Faster Memcopy ...
Post by: dedndave on March 03, 2015, 06:01:29 PM
At 100000 it's surprising how neck-and-neck they all are...

a sign that you may be hitting hardware limitations
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 03, 2015, 06:25:51 PM
It is certainly fast, and could be an option if you have a special need for repeatedly copying large blocks.

For MasmBasic, I chose rep movsd in the end, both on grounds of speed and convenience: It is a frequently used algo, and I wouldn't like a wrapper for preserving the xmm regs for each and every little memcopy.
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 03, 2015, 08:02:53 PM
Its been a long time since I did any benchmarking on memcopy but I found REP MOVS/B/D seemed to have all of the advantages on cached reads, non-cached writes and the optimal prefetch and I think Intel did the special case circuitry because its so commonly used over a long time. On sub 500 byte repeat copies from memory the incremented pointers (mov EAX/MOV AL) was faster on small buffers. I tend to look at the rest of the algo that does the memcopy to see what is worth the effort, there are enough times when a clunky fussy big mover is a PITA where a simple byte copy if faster than the algo calling it.
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 04, 2015, 06:17:40 AM
Thanks all for your comments! Here's what I wrote b4 I saw them, and I think it agrees with your general attitude.

This new version has the destination aligned, and is fastest (vs. every algo at every alignment) at 20483 bytes. On AMD it's particularly clear but Intel also. At 100000 it's hard to say what's fastest, there are a few very close; varies between the two computers; you can look at the numbers. Certainly, my algo is no longer the "winner".

There are some obvious things to do to speed it up at the higher byte counts, but I'm sick of this - it's a pursuit well suited to masochists. First, it's very frustrating to get good timing numbers. Second, a change can be good in one circumstance (byte count, alignment, phase of the moon) but bad in another. It varies between computers, other progs running, etc. Third, the algo (which started out simple) is getting longer and longer, harder to understand/debug/deal with. Fourth, I noticed, reviewing old optimization pages (like Mark Larson's, a few years old) tricks that worked back then are obsolete now! Branch prediction is totally different than what he optimized for. Unaligned moves are pretty much just as fast as aligned. Non-temporal moves are completely useless on my machines. And so forth. So, all this effort will be meaningless soon. Fifth, when u get right down to it who cares about a few nanoseconds one way or the other?

The zip contains MemCopyAlgosrrr.asm and .exe, same as previous posting.

[edit] fixed 3 small things in zip, now it's cleaner; functionality unchanged, saves a couple nanoseconds, 2015/3/6
Code: [Select]
--- ok ---
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      206
------------------------------------------------------------------------------------
20483, d0s0-     2821     2750     2804     2834     2841     2834     2777     2717
20483, d1s1-     2919     2784     2859     2831     2816     2863     3045     2737
20483, d7s7-     2885     2794     2805     2787     2787     2829     3029     2678
20483, d7s8-     9093     9095     5935     5696     3139     3096     9212     2830
20483, d7s9-     9092     9092     5940     5700     3139     3136     9361     2887
20483, d8s7+     3244     3146     5951     5249     3099     3089     3223     2931
20483, d8s8-     2919     2783     2795     2821     2796     2785     3035     2683
20483, d8s9-     9087     9089     5935     5700     3145     3092     9067     2893
20483, d9s7+     3258     3148     5949     5249     3086     3109     3245     2868
20483, d9s8+     3247     3158     5946     5243     3108     3142     3231     2890
20483, d9s9-     2920     2787     2843     2848     2862     2866     3067     2721
20483, d15s1     2902     2801     2858     2810     2839     2831     3052     2668


--- ok ---

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      206
------------------------------------------------------------------------------------
100000, d0s0    13082    12709    12938    12972    12974    12942    12721    12956
100000, d1s1    13099    12702    12994    12958    12967    13022    13664    13020
100000, d7s7    13081    12739    13022    12983    12961    12991    13663    12999
100000, d7s8    13652    13312    24752    19100    13683    13696    13654    13537
100000, d7s9    13652    13267    24698    19113    13629    13704    13652    13624
100000, d8s7    13640    13315    24653    20291    13659    13684    13658    13570
100000, d8s8    13023    12695    12979    12983    12984    12966    13653    12951
100000, d8s9    13599    13287    24610    19086    13583    13645    13642    13647
100000, d9s7    13602    13305    25021    20289    13618    13634    13670    13578
100000, d9s8    13652    13317    24692    20279    13713    13681    13643    13533
100000, d9s9    13035    12718    12954    12993    12989    13064    13636    12974
100000, d15s    13071    12738    12989    13002    12948    13028    13658    12927

--- ok ---


AMD A6-6310 APU with AMD Radeon R4 Graphics     (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      206
------------------------------------------------------------------------------------
20483, d0s0-     5989     5914     5920     5634     5725     5723     5743     4819
20483, d1s1-     6143     5738     5653     5616     5600     5506     9796     4846
20483, d7s7-     6018     5656     5694     5601     5629     5602     9668     4869
20483, d7s8-    10563    10941     9734     9322     7603     7613     9132     6136
20483, d7s9-    10578    10961     9736     9324     7613     7637     9994     6116
20483, d8s7+    10728    10852     7924     8886     7223     7194    10801     6290
20483, d8s8-     5960     5617     5682     5621     5606     5624     5615     4847
20483, d8s9-    10624    10974     9740     9322     7714     7608    10651     6106
20483, d9s7+    10787    10809     7824     8919     7190     7189     9878     6206
20483, d9s8+    10810    10818     7842     8917     7253     7223     8821     6201
20483, d9s9-     6059     5650     5675     5594     5577     5614     9622     4863
20483, d15s1     5861     5594     5680     5631     5568     5648     9694     4765


--- ok ---

AMD A6-6310 APU with AMD Radeon R4 Graphics     (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      206
------------------------------------------------------------------------------------
100000, d0s0    24435    22509    22496    22415    22002    22048    22183    22259
100000, d1s1    22379    22313    22026    22110    21864    21799    43627    22193
100000, d7s7    22365    22331    22055    22006    21805    22071    43413    22149
100000, d7s8    48009    47571    31597    32397    25654    24978    39742    24342
100000, d7s9    48044    47574    31809    32465    25620    24905    43194    24367
100000, d8s7    49095    49177    31628    31530    25843    24612    49296    25211
100000, d8s8    22408    22264    22047    22044    21643    21890    22259    22186
100000, d8s9    47976    47620    31494    32418    25651    24999    47788    24421
100000, d9s7    49815    49884    31259    31463    25324    24670    44214    25206
100000, d9s8    49755    49189    31743    31498    25266    24655    39641    25244
100000, d9s9    22411    22149    21979    22138    21807    21903    43295    22222
100000, d15s    22365    22310    22133    22023    21826    21728    43418    22051
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 04, 2015, 06:32:41 AM
ps jj2007, I hadn't thought of needing to preserve all those xmm regs, was just considering the problem in isolation from the real world, but you're right. Maybe that's why this idea is so obscure. Still, in special cases, when xmm regs are not being used for anything else, and the speed actually matters, it could be applicable. Also in 64-bit, with 16 xmm's, it makes more sense.
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 04, 2015, 09:15:18 AM
Here are the results on my aging i7.


Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 rrr314's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      206
------------------------------------------------------------------------------------
20483, d0s0-     3530     3520     3502     3497     3499     3511     3655     3376
20483, d1s1-     3752     3554     3535     3538     3541     3540     3969     3341
20483, d7s7-     3758     3560     3534     3538     3541     3544     3921     3346
20483, d7s8-     8911     8889     7560     7840     4461     4465     9022     3770
20483, d7s9-     8909     8890     7560     7840     4463     4464     9414     3763
20483, d8s7+     4489     4445     7352     7038     4417     4411     4496     3814
20483, d8s8-     3759     3558     3537     3543     3536     3544     3913     3344
20483, d8s9-     8901     8890     7554     7839     4466     4465     8867     3763
20483, d9s7+     4490     4448     7345     7038     4413     4411     4499     3815
20483, d9s8+     4489     4446     7345     7042     4405     4408     4468     3818
20483, d9s9-     3759     3560     3539     3538     3535     3538     3914     3346
20483, d15s1     3581     3559     3530     3528     3526     3535     3951     3311


--- ok ---
Title: Re: Faster Memcopy ...
Post by: KeepingRealBusy on March 04, 2015, 10:23:39 AM
rrr

Mis-alignment can be a problem, especially if both the source and the destination buffers have some mis-alignment, especialy if they have different alignments. The usual case is that one or the other of the buffers can be aligned mod 16. The way to handle these is to move a block of 16 bytes from the start of the buffer and a block of 16 bytes at the end of the buffer with un-aligned XMM moves. Then calculate the size left to be moved (data size - 32) and round this up to the next mod 16, then start at the first mod 16 byte boundary of the un-aligned buffer and move 16 bytes at a time (there will be a small overlap at either end).

Dave.
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 04, 2015, 10:26:58 AM
ps jj2007, I hadn't thought of needing to preserve all those xmm regs, was just considering the problem in isolation from the real world, but you're right. Maybe that's why this idea is so obscure. Still, in special cases, when xmm regs are not being used for anything else, and the speed actually matters, it could be applicable. Also in 64-bit, with 16 xmm's, it makes more sense.

If speed matters, it looks like a really good alternative. As a general purpose algo, it would fail already when looking at the minimum chunk size...
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 04, 2015, 07:19:34 PM
@KeepingRealBusy,

thx for the suggestion. The way I did it is more obvious; u can check it out in the zip, at beginning of "xmemcpy" proc. Hard to say which is better, probably varies with circumstances. Main advantage of my approach is, it fit with the existing algo; didn't have to change a single line, just added 20, whereas yours would require a total rewrite. It's interesting that these are 2 completely different ways to handle the same seemingly straightforward problem.

@jj2007: just in case of misunderstanding, altho my chunk size is 128, of course the routine handles any smaller number also; for < 128 it uses single xmm reg (chunk=16); less than that, moves byte by byte (chunk = 1 :) Could get more complex, use chunk sizes of 2, 4, 8, and 32, 64; I saw an Agner Fog algo that did that; but it's all, really, of academic interest only, IMHO

@hutch thx for checking it out; give your old machine a rest now, it's earned it ;)
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 04, 2015, 08:15:13 PM
@jj2007: just in case of misunderstanding, altho my chunk size is 128, of course the routine handles any smaller number also

Sorry, had not seen that one :redface:

Code: [Select]
    shr ecx, 7                  ; divide by 128 (chunk size)

    loop128:                    ; do chunks of 128
        jle endloop128
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 05, 2015, 08:00:41 AM
Let me apologize for that sloppy code which aligns dest block. There are 3 places where instructions are unnecessarily executed. I knew about them when I posted it, but 1) makes no difference at all to the final results 2) didn't think anyone would read it, and 3) I was so tired of timing, timing, timing (to paraphrase Hamlet) that the extra 20 seconds to make the changes - and couple minutes to test them - was more than I could stand. If the changes had to be made I would have just chucked the whole thing.

But the original code is fine (I think) and compact enuff that u have to read carefully, as u found out! At one point, condition code is set, then not checked for another 21 lines (of course it's unchanged by those intervening lines). This gives the processor max time to predict the branch correctly, and makes the code harder to read - which, sometimes, I consider a plus :)
Title: Re: Faster Memcopy ...
Post by: dedndave on March 05, 2015, 10:47:47 AM
from my experience, code that aligns for a loop can be either "pretty" or "fast" - lol
you never seem to get both

in the past, i have done something like this:
Code: [Select]
1) test lower bits of address register for desired alignment
   branch if aligned
2) perform operation on a byte (copy or whatever)
3) test lower bits of address again
   branch if aligned
4) perform operation on a byte
and so on, until alignment is achieved (the lower address bits are 0's)

there's no good way to make that pretty   ;)
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 05, 2015, 02:38:53 PM
Don't forget u need to check also if you're exceeding the length to copy! What if it takes 15 to align, but they only asked u to copy 3? Anyway, the way I did it, first figured how many bytes needed to get to alignment, then did that many. keepingrealbusy had a totally different way (see his post). In this case not real important how u do it, since once u get there you'll probably copy many thousands of bytes; a few instruction cycles more or less to align really doesn't matter. Interesting to see 3 different ways to do such a seemingly straightforward task; wonder how many more there are!
Title: Re: Faster Memcopy ...
Post by: KeepingRealBusy on March 08, 2015, 06:16:36 AM
rrr,

The primary concern is where the data is located and where is has to reside, What I usually run into is that I need to read in a file of data. I allocate the buffers with VirtualAlloc and read in the data as unbuffered (for all but the last partial disk sector), and then read in that block as buffered and add the size read to the size of the prior blocks (close the file and open it as buffered and position it the last partial size from the end of the file, then read buffered).

If you need to keep the same alignment with data that is on an unaligned boundary, then use my method. Just allocate a buffer that is the required size rounded up to mod 16, then offset the start of the destination data area to match the unaligned source, unaligned copy the starting 16 bytes and the ending 16 bytes, then copy all of the middle bytes (starting at the first aligned location) using aligned moves.

Dave.
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 08, 2015, 08:20:39 AM
That approach sounds good IF you have the freedom to allocate your destination buffer, but that's not the point of this exercise. I'm given both buffers by the user and must write where he specifies. Normally dest won't have same alignment as source, and I have to put up with it. If they happen to match, my method works out very similar to yours (BTW note that movdqu is almost as fast as movdqa these days, on aligned data). But normally they won't be aligned so only half the moves can use movdqa; in my case it's the writes; u could do it with the reads instead; seems to make little difference.
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 08, 2015, 09:22:54 AM
> (BTW note that movdqu is almost as fast as movdqa these days, on aligned data)

This is where the action is with later hardware and while 16 byte aligned is slightly faster on some hardware, its not by much and its a lot of messing around to set it up. You will tend to be restricted more by the memory bottleneck that the instructions that move or copy memory and I mainly see the gain in later XMM as what it can do with 128 bit block of data as there are enough techniques that just grunt memory content around that are fast enough in most contexts. The actual data processing gives the main speed gain, not the memory copy speed.

There are many contexts where I happily use byte level copy as it is unproblematic and often a lot faster than the data processing done with the data.
Title: Re: Faster Memcopy ...
Post by: nidud on March 08, 2015, 10:13:35 AM
My conclusion from the last test was three versions:
 - one for machines without SIMD functions
 - one using SIMD functions
 - and one for AVX++ using movsb.

This version uses SIMD functions and handle overlapping data:
Code: [Select]
memcpy  proc dst, src, count
push esi
push edi
push edx
mov eax,[esp+16]
mov esi,[esp+20]
mov ecx,[esp+24]
movdqu  xmm1,[esi] ; save aligned and tail bytes
movdqu  xmm2,[esi+ecx-16]
test ecx,-32 ; need 32 bytes
jz copy_0_31
mov edi,eax
neg edi
and edi,16-1 ; align EDI 16
mov edx,esi ; get direction of copy
sub edx,eax
cmp edx,ecx
lea edx,[eax+ecx-16]
jbe overlapped ; move left if not overlapping
add esi,edi
sub ecx,edi
add edi,eax
and ecx,dword ptr -16 ; align EIP and ECX 16
align 16
loop_L:
REPEAT  0
sub ecx,16
movdqu  xmm0,[esi+ecx]
movdqa  [edi+ecx],xmm0
jz done
ENDM
sub ecx,16
movdqu  xmm0,[esi+ecx]
movdqa  [edi+ecx],xmm0
jnz loop_L
align 16
done:
movdqu  [eax],xmm1
movdqu  [edx],xmm2
toend:
pop edx
pop edi
pop esi
ret 12
;----------------------------------------------------------------------
; Overlapping buffers
;----------------------------------------------------------------------
align 16
overlapped:
sub ecx,edi
and ecx,dword ptr -16
add edi,ecx
add esi,edi
add edi,eax
neg ecx
align 16
loop_R:
movdqu  xmm0,[esi+ecx]
movdqa  [edi+ecx],xmm0
add ecx,16
jnz loop_R
jmp done
;----------------------------------------------------------------------
; Copy 0..31 byte
;----------------------------------------------------------------------
align 4
copy_0_31:
test ecx,ecx
jz toend
test ecx,-2
jz copy_1
test ecx,-4
jz copy_2_3
test ecx,-8
jz copy_4_7
test ecx,-16
jz copy_8_15
lea edx,[eax+ecx-16]
jmp done
align 4
copy_8_15:
movq xmm0,[esi+ecx-8]
movq [eax],xmm1
movq [eax+ecx-8],xmm0
jmp toend
align 4
copy_4_7:
mov edi,[esi]
mov esi,[esi+ecx-4]
mov [eax],edi
mov [eax+ecx-4],esi
jmp toend
align 4
copy_2_3:
mov di,[esi]
mov si,[esi+ecx-2]
mov [eax+ecx-2],si
mov [eax],di
jmp toend
align 4
copy_1:
mov cl,[esi]
mov [eax],cl
jmp toend
memcpy  endp

However, this was faster on newer machines:
Code: [Select]
memcpy  proc dst, src, count
push esi
push edi
mov eax,[esp+12]
mov esi,[esp+16]
mov ecx,[esp+20]
mov edi,eax
cmp eax,esi
ja @F
rep movsb
pop edi
pop esi
ret 12
@@:
lea esi,[esi+ecx-1]
lea edi,[edi+ecx-1]
std
rep movsb
cld
pop edi
pop esi
ret 12
memcpy  endp

I added the function copying 128 byte to the test:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
--------- short: 31
560692    cycles - (  0) proc_0: ??? crt_memcpy *
1040554   cycles - ( 44) proc_1: AVX -   1 byte *
620459    cycles - (242) proc_2: 386 -   4 byte *
381270    cycles - (245) proc_3: SSE -  16 byte *
1822737   cycles - (164) proc_4: SSE - 128 byte
--------- short: 271
506692    cycles - (  0) proc_0: ??? crt_memcpy *
1101216   cycles - ( 44) proc_1: AVX -   1 byte *
455698    cycles - (242) proc_2: 386 -   4 byte *
210824    cycles - (245) proc_3: SSE -  16 byte *
394439    cycles - (164) proc_4: SSE - 128 byte
--------- short: 2014
931445    cycles - (  0) proc_0: ??? crt_memcpy *
3310250   cycles - ( 44) proc_1: AVX -   1 byte *
1262714   cycles - (242) proc_2: 386 -   4 byte *
448062    cycles - (245) proc_3: SSE -  16 byte *
574491    cycles - (164) proc_4: SSE - 128 byte
--------- aligned: 262159
1314077   cycles - (  0) proc_0: ??? crt_memcpy *
3795774   cycles - ( 44) proc_1: AVX -   1 byte *
1363916   cycles - (242) proc_2: 386 -   4 byte *
992821    cycles - (245) proc_3: SSE -  16 byte *
1061304   cycles - (164) proc_4: SSE - 128 byte
--------- unaligned: 262159
1312767   cycles - (  0) proc_0: ??? crt_memcpy *
3794592   cycles - ( 44) proc_1: AVX -   1 byte *
1352670   cycles - (242) proc_2: 386 -   4 byte *
995327    cycles - (245) proc_3: SSE -  16 byte *
1326086   cycles - (164) proc_4: SSE - 128 byte

result: * memcpy = memmove
  3028304 cycles - proc_3: SSE -  16 byte *
  4625673 cycles - proc_0: ??? crt_memcpy *
  5055457 cycles - proc_2: 386 -   4 byte *
  5179057 cycles - proc_4: SSE - 128 byte
 13042386 cycles - proc_1: AVX -   1 byte *
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 08, 2015, 11:40:47 AM
Hi nidud,

it's too bad you didn't use the version of the 128-byte mover that also works for larger memory blocks (20,000 or so). Originally I just wanted to show the technique for small jobs around 2048, but hutch suggested I try it out at higher numbers, the result was the second version posted in reply #7 (see discussion). So, if I get around to it I'll plug it in your tester and see how it does. My guess is it will come in second, behind proc_3, judging by my previous numbers. Still, I don't know if I can summon the energy - timing algo's is very dull work!

As for the short 31 bytes, simple byte copy is much better for such jobs. In fact pencil and paper would probably be better.

I feel obligated to check it out - but I'd promised myself I wouldn't touch timing again for at least a year ... so if u feel like doing it, go for it! - in Reply #7.
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 08, 2015, 12:23:48 PM
> but I'd promised myself I wouldn't touch timing again for at least a year

Don't give up on timing, its good for you and as an aside you tend to pop faster algos if you do it regularly. You get faster at it over time and with more practice you tune the timing to the task the algo will perform a lot better.
Title: Re: Faster Memcopy ...
Post by: nidud on March 09, 2015, 02:04:50 AM
here is the test of the new version:  :t
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
--------- short: 31
520885   cycles - (  0) proc_0: ??? crt_memcpy *
1043349   cycles - ( 44) proc_1: AVX - 1 byte *
620460   cycles - (242) proc_2: 386 - 4 byte *
380509   cycles - (245) proc_3: SSE -  16 byte *
1902620   cycles - (164) proc_4: SSE - 128 byte
1540727   cycles - (210) proc_5: SSE - 128 byte new
--------- short: 271
506894   cycles - (  0) proc_0: ??? crt_memcpy *
1101241   cycles - ( 44) proc_1: AVX - 1 byte *
455809   cycles - (242) proc_2: 386 - 4 byte *
214328   cycles - (245) proc_3: SSE -  16 byte *
401273   cycles - (164) proc_4: SSE - 128 byte
312955   cycles - (210) proc_5: SSE - 128 byte new
--------- short: 2014
942392   cycles - (  0) proc_0: ??? crt_memcpy *
3310465   cycles - ( 44) proc_1: AVX - 1 byte *
1261069   cycles - (242) proc_2: 386 - 4 byte *
450311   cycles - (245) proc_3: SSE -  16 byte *
575949   cycles - (164) proc_4: SSE - 128 byte
483387   cycles - (210) proc_5: SSE - 128 byte new
--------- aligned: 262159
1311687   cycles - (  0) proc_0: ??? crt_memcpy *
3794303   cycles - ( 44) proc_1: AVX - 1 byte *
1369066   cycles - (242) proc_2: 386 - 4 byte *
992980   cycles - (245) proc_3: SSE -  16 byte *
1020577   cycles - (164) proc_4: SSE - 128 byte
1010543   cycles - (210) proc_5: SSE - 128 byte new
--------- unaligned: 262159
1312919   cycles - (  0) proc_0: ??? crt_memcpy *
3797085   cycles - ( 44) proc_1: AVX - 1 byte *
1355184   cycles - (242) proc_2: 386 - 4 byte *
993036   cycles - (245) proc_3: SSE -  16 byte *
1427699   cycles - (164) proc_4: SSE - 128 byte
1029638   cycles - (210) proc_5: SSE - 128 byte new

result: * memcpy = memmove
  3031164 cycles - proc_3: SSE -  16 byte *
  4377250 cycles - proc_5: SSE - 128 byte new
  4594777 cycles - proc_0: ??? crt_memcpy *
  5061588 cycles - proc_2: 386 -   4 byte *
  5328118 cycles - proc_4: SSE - 128 byte
 13046443 cycles - proc_1: AVX -   1 byte *

However, with some modifications the algo may even be faster  :biggrin:
Code: [Select]
memcpy  proc dst, src, count

push esi
push edi
mov eax,[esp+12] ; keep return value in eax
mov esi,[esp+16]
mov ecx,[esp+20]
;
; skip byte copy of aligned and tail bytes
;
movdqu  xmm0,[esi] ; save aligned bytes
cmp ecx,32 ; need 32 bytes
jb copy_0_31

movdqu  xmm1,[esi+ecx-16] ; copy aligned and tail bytes
movdqu  [eax],xmm0 ;
movdqu  [eax+ecx-16],xmm1
;
; now only chuncs of 16 left to copy
;
mov edi,eax
neg edi
and edi,16-1 ; align EDI 16
add esi,edi
sub ecx,edi
add edi,eax

mov edx,ecx
shr ecx,7 ; divide by 128 (chunk size)
jz endloop128
;
; make shore loop-label is align 16
;
align 16

loop128: ; do chunks of 128
;
; this generate two extra jumps in the loop
;
; jle endloop128
; dec ecx ; dec ecx long b4 using it
sub ecx,dword ptr 1
movdqu  xmm1,[esi] ; read all 8 xmm's
movdqu  xmm2,[esi+10h]
movdqu  xmm3,[esi+20h]
movdqu  xmm4,[esi+30h]
lea esi, [esi+40h]  ; seems to work best adding here
movdqu  xmm5,[esi]
movdqu  xmm6,[esi+10h]
movdqu  xmm7,[esi+20h]
movdqu  xmm0,[esi+30h]
lea esi, [esi+40h]  ; add to esi long b4 use
;------------------------------------
movdqa  [edi],xmm1 ; write all 8 xmm's
movdqa  [edi+10h],xmm2
movdqa  [edi+20h],xmm3
movdqa  [edi+30h],xmm4
lea edi,[edi+40h]
movdqa  [edi],xmm5
movdqa  [edi+10h],xmm6
movdqa  [edi+20h],xmm7
movdqa  [edi+30h],xmm0
lea edi,[edi+40h]
jg loop128
; jmp loop128

align 16
endloop128:

and edx,127 ; tail bytes in ebx
shr edx,4 ; ebx has chunks of 16
jz toend

align 8
loop16:
dec edx
movdqu  xmm1,[esi] ; just one xmm could use bigger
lea esi,[esi+10h] ; chunks (32, 48 etc) but why bother
movdqa  [edi],xmm1
lea edi,[edi+10h]
jg loop16
align 4
toend:
pop edi
pop esi
ret 12
;
; Copy 0..31 byte
;
align 4
copy_0_31:
test ecx,ecx
jz toend
test ecx,-2
jz copy_1
test ecx,-4
jz copy_2_3
test ecx,-8
jz copy_4_7
test ecx,-16
jz copy_8_15
movdqu  xmm1,[esi+ecx-16]
movdqu  [eax],xmm0
movdqu  [eax+ecx-16],xmm1
jmp toend
align 4
copy_8_15:
movq xmm1,[esi+ecx-8]
movq [eax],xmm0
movq [eax+ecx-8],xmm1
jmp toend
align 4
copy_4_7:
mov edi,[esi]
mov esi,[esi+ecx-4]
mov [eax],edi
mov [eax+ecx-4],esi
jmp toend
align 4
copy_2_3:
mov di,[esi]
mov si,[esi+ecx-2]
mov [eax+ecx-2],si
mov [eax],di
jmp toend
align 4
copy_1:
mov cl,[esi]
mov [eax],cl
jmp toend
memcpy  endp

Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
--------- short: 31
520271    cycles - (  0) proc_0: ??? crt_memcpy *
1040219   cycles - ( 44) proc_1: AVX -   1 byte *
622641    cycles - (242) proc_2: 386 -   4 byte *
380032    cycles - (245) proc_3: SSE -  16 byte *
1821193   cycles - (164) proc_4: SSE - 128 byte
1571898   cycles - (209) proc_5: SSE - 128 byte new
340027    cycles - (314) proc_6: SSE - 128 byte modified
--------- short: 271
506636    cycles - (  0) proc_0: ??? crt_memcpy *
1101511   cycles - ( 44) proc_1: AVX -   1 byte *
455635    cycles - (242) proc_2: 386 -   4 byte *
212339    cycles - (245) proc_3: SSE -  16 byte *
394498    cycles - (164) proc_4: SSE - 128 byte
312829    cycles - (209) proc_5: SSE - 128 byte new
183604    cycles - (314) proc_6: SSE - 128 byte modified
--------- short: 2014
931433    cycles - (  0) proc_0: ??? crt_memcpy *
3312946   cycles - ( 44) proc_1: AVX -   1 byte *
1264095   cycles - (242) proc_2: 386 -   4 byte *
446437    cycles - (245) proc_3: SSE -  16 byte *
574598    cycles - (164) proc_4: SSE - 128 byte
485746    cycles - (209) proc_5: SSE - 128 byte new
419227    cycles - (314) proc_6: SSE - 128 byte modified
--------- aligned: 262159
1308439   cycles - (  0) proc_0: ??? crt_memcpy *
3793440   cycles - ( 44) proc_1: AVX -   1 byte *
1364127   cycles - (242) proc_2: 386 -   4 byte *
991860    cycles - (245) proc_3: SSE -  16 byte *
1044200   cycles - (164) proc_4: SSE - 128 byte
1043795   cycles - (209) proc_5: SSE - 128 byte new
1044381   cycles - (314) proc_6: SSE - 128 byte modified
--------- unaligned: 262159
1311390   cycles - (  0) proc_0: ??? crt_memcpy *
3797150   cycles - ( 44) proc_1: AVX -   1 byte *
1355516   cycles - (242) proc_2: 386 -   4 byte *
993870    cycles - (245) proc_3: SSE -  16 byte *
1389335   cycles - (164) proc_4: SSE - 128 byte
1061247   cycles - (209) proc_5: SSE - 128 byte new
1058086   cycles - (314) proc_6: SSE - 128 byte modified

result: * memcpy = memmove
  3024538 cycles - proc_3: SSE -  16 byte *
  3045325 cycles - proc_6: SSE - 128 byte modified
  4475515 cycles - proc_5: SSE - 128 byte new
  4578169 cycles - proc_0: ??? crt_memcpy *
  5062014 cycles - proc_2: 386 -   4 byte *
  5223824 cycles - proc_4: SSE - 128 byte
 13045266 cycles - proc_1: AVX -   1 byte *
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 09, 2015, 06:18:24 AM
You understand, I was merely demonstrating the efficacy of using 8 xmm reg's and paid no attention at all to other details ... simply slapped it together. However, thanks for the mods. I'll look at it (now you've got me interested) and see if any further improvements are available.

[edit] also thanks for going ahead and testing the "original new" algo; I know what a PITA all this can be ;)
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 09, 2015, 01:24:59 PM
@nidud,
I finally managed to get my memory copy algo faster than yours, but it's a close shave. I used some of your suggestions. Doing the tail that way hadn't occurred to me until keepingrealbusy suggested it. Also an adaptation of your short-copy handling (similar to Agner Fog's). Wouldn't have bothered without your example, which proved these ideas better, so thanks. Code is below.

I couldn't get your test bed working right; for one thing the eax return value wouldn't co-operate, also I'm not familiar with that non-standard environment, so I put together this extremely standard masm32 test bed with MichaelW's algo's. Hope it will suit. Also included shorter test and, more important, made runs aligned and also misaligned.

Here's a run where my algo is best on every number, but that's unusual, had to make 6 runs to get it :biggrin: Generally, tho, it seems overall best. Shorter too.

mem2 = nidud's improved version of rrr's first algo (was called mem4)
mem3 = nidud's algo (equals mem3 in his test prog)
mem4 = rrr latest (hopefully last) attempt

Code: [Select]
test.asm: 506 lines, 3 passes, 3688 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(300)

     8        2387            2359            2337

    31        2396            2291            2193

   271        7372            9300            6269

  2014       20854           25887           13538

262159     5229387         5238006         5165092

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(300)

     8        1198            1190            1180

    31        1248            1198            1147

   271        3952            5367            3714

  2014       14134           25988           14074

262159     5266919         5485524         5263508

Code: [Select]
; «««««««««««««««««««««««««««««««««««««««««««««««««««
mem4 proc Dest:PTR BYTE, Source:PTR BYTE, ln:DWORD
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; written by rrr314159 2015 03/08

push esi
push edi
mov edi, [Dest]
mov esi, [Source]
mov ecx, [ln]

; if very small get it over with quick
cmp ecx, 16
jge bigger_than_15
test ecx, 8
        jz @f
        movq xmm1, [esi]
        movq xmm0, [esi+ecx-8]
        movq [edi], xmm1
        movq [edi+ecx-8], xmm0
        jmp end_memcopy
@@:
test ecx, 4
        jz @f
        mov eax, [esi]
        mov ebx, [esi+ecx-4]
        mov [edi], eax
        mov [edi+ecx-4], ebx
        jmp end_memcopy
@@:
test ecx, 2
        jz @f
        mov ax, [esi]
        mov [edi], ax
@@:
test ecx, 1
        jz end_memcopy
        mov al, [esi+ecx-1]
        mov [edi+ecx-1], al
        jmp end_memcopy

bigger_than_15:
    movdqu xmm1, [esi]              ; 16 or greater it's safe
    movdqu xmm0, [esi+ecx-16]
    movdqu [edi], xmm1              ; to do head
    movdqu [edi+ecx-16], xmm0       ; and tail here

cmp ecx, 32
jg @f

end_memcopy:                        ; put end close to short copies, quicker
    pop edi
    pop esi
    ret 12
@@:

; continue for more than 32 bytes - check dest alignment

test edi, 15
jz destisaligned
    mov ebx, edi
    and ebx, -16
    add ebx, 16                     ; ebx gets bytes to align edi (1<= ebx <=15)
    sub ebx, edi                    ; don't need to check < ecx
    sub ecx, ebx                    ; how many bytes will be left to do in ecx
    lea edi, [edi+ebx]              ; align dest
    lea esi, [esi+ebx]              ; make source match
destisaligned:                      ; esi, edi and ecx shld all be right

mov ebx, ecx
and ecx, 127                        ; tail bytes (up to 127) in ecx
shr ebx, 7                          ; divide by 128 (chunk size)
jle endloop128

align 4
loop128:                            ; do chunks of 128
    dec ebx                         ; dec ecx long b4 using it
    movdqu xmm1, [esi]              ; read all 8 xmm's
    movdqu xmm2, [esi+10h]
    movdqu xmm3, [esi+20h]
    movdqu xmm4, [esi+30h]
    lea esi, [esi+40h]              ; seems to work best adding here
   
    movdqu xmm5, [esi]
    movdqu xmm6, [esi+10h]
    movdqu xmm7, [esi+20h]
    movdqu xmm0, [esi+30h]
    lea esi, [esi+40h]              ; add to esi long b4 use
   
    movdqa [edi], xmm1              ; write all 8 xmm's
    movdqa [edi+10h], xmm2
    movdqa [edi+20h], xmm3
    movdqa [edi+30h], xmm4
    lea edi, [edi+40h]
   
    movdqa [edi], xmm5
    movdqa [edi+10h], xmm6
    movdqa [edi+20h], xmm7
    movdqa [edi+30h], xmm0
    lea edi, [edi+40h]
   
    jg loop128
endloop128:

shr ecx, 4                          ; ebx has chunks of 16   
jle end_memcopy

loop16:
    dec ecx
    movdqu xmm1, [esi]              ; just one xmm could use bigger
    lea esi, [esi+10h]              ; chunks (32, 48 etc) but why bother
    movdqa [edi], xmm1
    lea edi, [edi+10h]
    jg loop16

jmp end_memcopy

mem4 endp

The zip contains test.asm with all 3 algos, and the exe
Title: Re: Faster Memcopy ...
Post by: KeepingRealBusy on March 10, 2015, 04:37:53 AM
rrr,

Thanks for giving me the tail end credit. Most would grab the code and run.

Dave
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 10, 2015, 05:43:08 AM
This was basically the last time I played with memory copy.

Code: [Select]
    mov esi, pad1
    mov edi, pad2

  ; --------------------------------------
  ; if blen < 128 jump to tail end trimmer
  ; --------------------------------------
    mov eax, blen
    cmp eax, 16
    jae nxt
    mov scnt, eax
    xor edx, edx
    jmp bypass

  nxt:
    shr eax, 4                ; int div by 128
    mov lcnt, eax
    shl eax, 4                ; int mul by 128
    mov ecx, blen
    sub ecx, eax
    mov scnt, ecx

    mov ecx, lcnt
    mov edx, -16

  ; -------------------------------------------------
  ; perform the bulk of the process in 16 byte blocks
  ; -------------------------------------------------
  align 16
  lbl1:
    add edx, 16
    prefetcht1 BYTE PTR [esi+edx]
    movdqa xmm0, DWORD PTR [esi+edx]
    movntdq [edi+edx], xmm0

    sub ecx, 1
    jnz lbl1

  ; -------------------------------
  ; if no uneven tail, jump to exit
  ; -------------------------------
    cmp scnt, 0
    je bye

  ; -------------------------------------------
  ; process any eneven tail as a BYTE operation
  ; -------------------------------------------
    add edx, 16

  bypass:
    mov ecx, scnt
    add ecx, 1
  lbl2:
    movzx eax, BYTE PTR [edi+edx]
    mov [esi+edx], al
    add edx, 1
    sub ecx, 1
    jnz lbl2

  bye:

It was slower than the REP MOVSD code.

Code: [Select]
    cld
    mov esi, src
    mov edi, dst
    mov ecx, ln

    shr ecx, 2
    rep movsd

    mov ecx, ln
    and ecx, 3
    rep movsb
Title: Re: Faster Memcopy ...
Post by: nidud on March 10, 2015, 06:03:04 AM
Quote
I couldn't get your test bed working right; for one thing the eax return value wouldn't co-operate, also I'm not familiar with that non-standard environment,

One of the problems with these tests is that code location is sensitive to timings. When duplicating the same code you may get this result depending on the offset of the proc:
Code: [Select]
1232712 cycles - 2048..4096  (164) A memcpy SSE 16
877928  cycles - 2048..4096  (164) A memcpy SSE 16
1232112 cycles - 2048..4096  (164) A memcpy SSE 16
878896  cycles - 2048..4096  (164) A memcpy SSE 16
1233148 cycles - 2048..4096  (164) A memcpy SSE 16
879296  cycles - 2048..4096  (164) A memcpy SSE 16

To compensate for this, the algo to be tested is copied to a buffer and executed there:
Code: [Select]
.code

align 16
proc_x  db size_p dup(0)

The algo to test is then handled as data, and in this case as a .BIN file created by JWASM -bin ?.asm. To add a function to the test you simply add a new file.
Code: [Select]
.386
.model flat
.code
push esi
push edi
mov eax,[esp+12]
mov esi,[esp+16]
mov ecx,[esp+20]
mov edi,eax
rep movsb
pop edi
pop esi
ret 12
END

In testit.asm you need to add some text and a number:
Code: [Select]
info_7  db "SSE - 128 byte mem4(rrr)",0
info_8  db "x",0
info_9  db "x",0
info_p  dd info_0,info_1,info_2,info_3,info_4
dd info_5,info_6,info_7,info_8,info_9

procs equ <for x,<7,7,7,7,7,7>> ; add procs to test here..

the result should (in theory) be more equal then:
Code: [Select]
1395909   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)
1392582   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)
1396457   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)
1394577   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)
1396294   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)
1392589   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)

on copy short values I get this:
Code: [Select]
  1864337 cycles - proc_6: SSE - 128 byte modified
  1940998 cycles - proc_7: SSE - 128 byte mem4(rrr)
  1956522 cycles - proc_3: SSE -  16 byte *
  5412838 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)

on copy long value I get this:
Code: [Select]
--------- unaligned: 262159
4217994   cycles - (269) proc_3: SSE -  16 byte *
4295528   cycles - (209) proc_5: SSE - 128 byte xmemcpy(rrr)
4300409   cycles - (314) proc_6: SSE - 128 byte modified
4302474   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)
--------- aligned: 262159
4220315   cycles - (269) proc_3: SSE -  16 byte *
4295777   cycles - (209) proc_5: SSE - 128 byte xmemcpy(rrr)
4299777   cycles - (314) proc_6: SSE - 128 byte modified
4302524   cycles - (298) proc_7: SSE - 128 byte mem4(rrr)

result: * memcpy = memmove
  8438309 cycles - proc_3: SSE -  16 byte *
  8591305 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)
  8600186 cycles - proc_6: SSE - 128 byte modified
  8604998 cycles - proc_7: SSE - 128 byte mem4(rrr)

Copying larger chunks don't seem to impress this CPU but this may not be the same for others. I tried to insert a 64-byte block-copy into the function:

Code: [Select]
OPTION PROLOGUE:NONE, EPILOGUE:NONE

memcpy  proc dst, src, count

mov eax,esp ; save one byte
push esi
push edi
push edx
mov ecx,[eax+12]
mov esi,[eax+8]
mov eax,[eax+4] ; to land at loop_64 aligned..

movdqu  xmm6,[esi] ; save aligned bytes
cmp ecx,32 ; need 32 bytes
jb copy_0_31

mov edi,eax
neg edi
and edi,16-1 ; align EDI 16
movdqu  xmm7,[esi+ecx-16] ; save tail bytes

mov edx,esi ; get direction of copy
sub edx,eax
cmp edx,ecx
mov edx,ecx
jbe overlapped ; move left if not overlapping

add esi,edi
sub ecx,edi
add edi,eax
and ecx,-16
cmp ecx,64
jb loop_16

align 16
loop_64:
movdqu  xmm0,[esi+ecx-10h] ; copy 64 byte
movdqu  xmm1,[esi+ecx-20h]
movdqu  xmm2,[esi+ecx-30h]
movdqu  xmm3,[esi+ecx-40h]
movdqa  [edi+ecx-10h],xmm0
movdqa  [edi+ecx-20h],xmm1
movdqa  [edi+ecx-30h],xmm2
movdqa  [edi+ecx-40h],xmm3
sub ecx,dword ptr 64
je done
test ecx,dword ptr -64
jnz loop_64
align 16
loop_16:
sub ecx,16
movdqu  xmm0,[esi+ecx] ; copy 16 byte
movdqa  [edi+ecx],xmm0
jnz loop_16
align 16
done:
movdqu  [eax],xmm6
movdqu  [eax+edx-16],xmm7
pop edx
pop edi
pop esi
ret 12
;----------------------------------------------------------------------
; Overlapping buffers
;----------------------------------------------------------------------
align 16
overlapped:
sub ecx,edi
and ecx,dword ptr -16
add edi,ecx
add esi,edi
add edi,eax
neg ecx
align 16
loop_R:
movdqu  xmm0,[esi+ecx]
movdqa  [edi+ecx],xmm0
add ecx,16
jnz loop_R
jmp done
;----------------------------------------------------------------------
; Copy 0..31 byte
;----------------------------------------------------------------------
align 4
copy_0_31:
test ecx,ecx
jz toend
test ecx,-2
jz copy_1
test ecx,-4
jz copy_2
test ecx,-8
jz copy_4
test ecx,-16
jz copy_8
movdqu  xmm1,[esi+ecx-16]
movdqu  [eax],xmm6
movdqu  [eax+ecx-16],xmm1
toend:
pop edx
pop edi
pop esi
ret 12
copy_8:
movq xmm1,[esi+ecx-8]
movq [eax],xmm6
movq [eax+ecx-8],xmm1
jmp toend
copy_4:
mov edi,[esi]
mov esi,[esi+ecx-4]
mov [eax],edi
mov [eax+ecx-4],esi
jmp toend
copy_2:
mov di,[esi]
mov si,[esi+ecx-2]
mov [eax+ecx-2],si
mov [eax],di
jmp toend
copy_1:
mov cl,[esi]
mov [eax],cl
jmp toend
memcpy  endp

end

  8362882 cycles - proc_4: SSE -  64 byte *
  8441756 cycles - proc_3: SSE -  16 byte *
  8592957 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)
  8601138 cycles - proc_6: SSE - 128 byte modified
  8601695 cycles - proc_7: SSE - 128 byte mem4(rrr)

testing a combination of all lengths gives this result:
Code: [Select]
11418933 cycles - proc_4: SSE -  64 byte *
 11641683 cycles - proc_3: SSE -  16 byte *
 11962870 cycles - proc_6: SSE - 128 byte modified
 12057471 cycles - proc_7: SSE - 128 byte mem4(rrr)
 15605422 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)

These results depend on CPU and OS used. On newer hardware, as demonstrated here (http://masm32.com/board/index.php?topic=3396.msg36369#msg36369), a simple MOVSB-copy will be the fastest.
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 10, 2015, 05:03:59 PM
Quote from: dedndave
from my experience, code that aligns for a loop can be either "pretty" or "fast" - lol

- when I first saw that a couple days ago I thought my approach was both pretty and fast, but I forebore to disagree: I was taught not to argue with my elders :biggrin: But now I see you're right, it wasn't all that fast; and since it's not the fastest, it's not pretty either

Quote from: keepingrealbusy
Thanks for giving me the tail end credit. Most would grab the code and run.
I have a stock answer when I'm told that. You can't blame those ("most") people: if they didn't grab credit for other's work, they'd never get credit for anything

Quote from: hutch
It was slower than the REP MOVSD code.
I've got to check out REP MOVS* instructions. Fact is I've never used them! When I came here in January I was still using 1980 era techniques (Z80); I've learned a bunch of modern stuff, like macros and SIMD, but to me REP MOVSD is still "new-fangled"

Quote from: nidud
One of the problems with these tests is that code location is sensitive to timings. ... The algo to test is then handled as data, and in this case as a .BIN file created by JWASM -bin ?.asm. To add a function to the test you simply add a new file.
Sure, that seems like an excellent approach.

I switched the order of the algos in my test bed and may have experienced the phenomenon you're talking about: on one category, ALIGNED 2014, the two algos pretty much reversed their timings, as though it ONLY depended on location! Other numbers weren't affected, although the numbers are not very stable anyway, so it's hard to say. So, I simply copied-pasted a dozen runs before the final three, for which I report the numbers (still with my algo in what may be the "worst" position, first). Takes longer but it's a lot more stable. Here are the first 4, and only, runs I made:

Code: [Select]
  mem2 = rrr/nidud modified; mem3 = nidud; mem4 = rrr latest

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2969            2759            2566

    31        2706            2651            2559

   271        6891            8023            6884

  2014       15540           39733           15315

262159     5258012         5218352         5255524

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1441            1350            1386

    31        1402            1346            1499

   271        3655            5698            3788

  2014       13863           26579           14449

262159     5318643         5385829         5317655
Press any key to continue ...

C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test

***********
ASCII build
***********

test.asm: 531 lines, 3 passes, 3656 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2688            2730            2538

    31        1452            1377            1279

   271        3468            4053            3338

  2014       13658           25588           13532

262159     5259190         5255524         5259200

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1465            1510            1352

    31        1388            1327            1286

   271        3577            5337            3571

  2014       13922           25698           14171

262159     5370304         5417275         5319313
Press any key to continue ...

C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test

***********
ASCII build
***********

test.asm: 531 lines, 3 passes, 3672 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2689            2730            2539

    31        1428            1437            1429

   271        3441            4143            3521

  2014       13636           25594           13482

262159     5211287         5219878         5276594

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1409            1378            1335

    31        1449            1387            1281

   271        3603            5482            3889

  2014       13870           25605           14170

262159     5372493         5509407         5361170
Press any key to continue ...

C:\Users\r\Desktop\nidud2\test_memcopy_algos change order>..\doj32 test

***********
ASCII build
***********

test.asm: 531 lines, 3 passes, 3672 ms, 0 warnings, 0 errors

BUFFERS ALIGNED

thecount        mem2(314)       mem3(245)       mem4(294)


     8        2709            2754            2550

    31        1439            1407            1353

   271        3529            4226            3449

  2014       14126           25590           13531

262159     5209791         5210645         5211581

BUFFERS MISALIGNED src 11 dest 7

thecount        mem2(314)       mem3(245)       mem4(294)


     8        1416            1385            1358

    31        1397            1338            1291

   271        3483            5454            3570

  2014       13910           25713           14117

262159     5315510         5381937         5314749
Press any key to continue ...

My latest is still ahead (27 out of 40) but the differences are pretty negligible. Not surprising since the two algos are now using, with 4 or 5 exceptions, the same techniques. The only significant conclusion is: both are definitely better than your original mem3 in the middle range (271 and 2014). BTW this reminds me of a comment someone made in an old post: seems everybody's algo is best when run on their own machines. If anyone wants to check my results, the code is posted above, but reverse the order of the algos and copy/paste a dozen copies or so; may make a little difference.

Apparently your machine doesn't do 128 (XMM) right; many older machines do it as two separate 64's, completely negating any advantage. That's why your 64-block worked so well; I'm glad u checked that out. I think the basic point is demonstrated: that larger copy blocks (the largest a given machine supports efficiently) do give a bit more speed. BTW my machine has a similar prob with AVX 256; does it as 2 separate 128's; so it's actually slower for this task (a little better for some others, though).

... But none of it matters if hutch, and the ref you mention later, are right: that MOVS* is better on newer hardware.

Well I reckon we've beaten memory-moving to death! I'm working on some much more interesting fast-algos, hope to post something in a couple days. Thanks for the exercise, and various tips!
Title: Re: Faster Memcopy ...
Post by: nidud on March 11, 2015, 12:55:50 AM
Code: [Select]
1) test lower bits of address register for desired alignment
   branch if aligned

consider this:
Code: [Select]
upcode        size AMD   Intel
JMP        [2]  30.5   8.1
JNZ        [2]  30.1  10.7 ; jump
JZ        [2]   1.0   8.0 ; no jump

jumping around is in worst case 30 cycles

the jump situation:
Code: [Select]
mov edi,eax ; 1.0
test edi, 15 ; 1.0
jz aligned ; 1.0/30.0
neg edi ; 2.5
and edi,16-1; 2.5
add esi,edi ; 2.5
sub ecx,edi ; 2.5
add edi,eax ; 2.5
unaligned: ; 15.0
aligned: ; 32.0

the no jump situation:
Code: [Select]
mov edi,eax ; 1.0
neg edi ; 2.5
and edi,16-1; 2.5
add esi,edi ; 2.5
sub ecx,edi ; 2.5
add edi,eax ; 2.5
justdoit: ; 13.5
Title: Re: Faster Memcopy ...
Post by: dedndave on March 11, 2015, 02:37:18 AM
while that's true, to some extent

you can't avoid checking cases to get alignment   :P
that means some form of conditional branching

if you play with it, i think you'll find that JNZ is faster for reverse branches and JZ is faster for forward branches   :t
you can sometimes design your code to take advantage

you might also try a jump table - haven't tried that, but it might be faster
Title: Re: Faster Memcopy ...
Post by: nidud on March 11, 2015, 03:37:13 AM
you can't avoid checking cases to get alignment   :P
that means some form of conditional branching

I think I just did  :lol:

If you take care of the aligned start and tail bytes, which is only 4 moves, it doesn't matter if the pointer is aligned or not. If it is aligned EDI is 0 and nothing change. However, you need a minimum of align-size count to do this, so one conditional branch is needed.

Quote
if you play with it, i think you'll find that JNZ is faster for reverse branches and JZ is faster for forward branches   :t
you can sometimes design your code to take advantage

Haven't thought about that, but that seems to be the case:  :P
Code: [Select]
align 16
loop_L:
sub ecx,16
movdqu  xmm0,[esi+ecx]
movdqa  [edi+ecx],xmm0
jnz loop_L
...
neg ecx
align 16
loop_R:
movdqu  xmm0,[esi+ecx]
movdqa  [edi+ecx],xmm0
add ecx,16
jnz loop_R

Quote
you might also try a jump table - haven't tried that, but it might be faster

it’s painfully slow - crt_memcpy uses that
but then again, no need to jump around  :biggrin:
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 11, 2015, 06:11:08 AM
Hi nidud,

This is a very interesting investigation, unfortunately I don't have time to get into it at the moment, but in a couple days I'd really like to do so. The following comments are (deliberately) w/o reference to anything but my organic memory subsystem, ...

First - since we're getting down to details - do a global search and replace on my algo, change "ebx" to "edx". Of course, has no effect on timing.

Now, your jump timings are right in effect - that is, the conclusion drawn is right: it's better to skip that branch. I noticed u did that but didn't bother to change my code, thinking it was meaningless; but seems it's more significant. I say "in effect" because in actuality, jmps take far fewer cycles than the numbers you give: 13.5 and, uh, (alrigght, I'll take a look ...) 30.1, 10.7 etc. Yet, those are reasonable Expected Values because of branch prediction. The reason for avoiding jumps is not that they're inherently that long, rather if branch is incorrectly predicted, they cost many cycles while pipeline is cleared and reloaded. So u can't come up with exact numbers w/o many more details. It's not only different for different companies (as u mention, AMD and Intel): it's different for the same processor, depending on recent history - even the very first time the branch is encountered - because other nearby branches can affect prediction for this branch. Thus a detailed analysis is inherently probabilistic. As I say, the numbers u give look like good EVs (presumably u got them from some ref that did such analysis, or a performance analyzer?) and they give the correct conclusion: the branch shouldn't be there.

I only glanced thru your algo (mem3), but here's what I noticed. There was a clever substitute for the initial pushes which saved - either a byte or a cycle, I forget which. The comparison tests were different; mine, I felt, were more intuitive. Other small details that amount either to nothing or not much; a couple slightly better techniques which I didn't bother to borrow (like the push substitute, next time I'll use them); a couple places I think my technique a little better. Half the algo I didn't read but presume there's nothing 2 important. But, I removed almost all alignments (from your mod'ed version of my algo); my tests showed they were basically insignificant (maybe a cycle or two) and I wanted to reduce the code size, enhancing cache fit probability by a half percent if that.

My point is this. The differences between the two algos is minimal (to my knowledge); but now you're showing that your branch elimination saves quite a few cycles, more than I had supposed. So why, then, is my algo - more accurately, my version of the same algo, since these diffs are virtually just a matter of style - why is it faster? (a small, but significant, amount?) Since u, apparently, have gone thru them with a fine tooth comb, presumably u must know. There's something about mem4, versus mem2, that's saving more than just a cycle or two - must be quite noticeable - so, what is it?

I'm not kidding about a couple days: happy to investigate this further with you when I have more time.
Title: Re: Faster Memcopy ...
Post by: nidud on March 12, 2015, 06:20:59 AM
Quote
First - since we're getting down to details - do a global search and replace on my algo, change "ebx" to "edx". Of course, has no effect on timing.

I notice that EBX pups up randomly in your code but EDX is seldom used  :biggrin:

Quote
Now, your jump timings are right in effect - that is, the conclusion
drawn is right: it's better to skip that branch. I noticed u did that but didn't bother to change my code, thinking it was meaningless; but seems it's more significant. I say "in effect" because in actuality, jmps take far fewer cycles than the numbers you give: 13.5 and, uh, (alrigght, I'll take a look ...) 30.1, 10.7 etc. Yet, those are reasonable Expected Values because of branch prediction. The reason for avoiding jumps is not that they're inherently that long, rather if branch is incorrectly predicted, they cost many cycles while pipeline is cleared and reloaded. So u can't come up with exact numbers w/o many more details. It's not only different for different companies (as u mention, AMD and Intel): it's different for the same processor, depending on recent history - even the very first time the branch is encountered - because other nearby branches can affect prediction for this branch. Thus a detailed analysis is inherently probabilistic. As I say, the numbers u give look like good EVs (presumably u got them from some ref that did such analysis, or a performance analyzer?) and they give the correct conclusion: the branch shouldn't be there.

True. The jump estimate is a worst-case value and depends on many things as you say. The timings are based on a short jump of 2 bytes and calibrated from a MOV instruction (1). Using the current test bed to illustrate this by adjusting the loop count:
Code: [Select]
Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (AVX)
----------------------------------------------
    96030 cycles - proc_1: test eax,eax
   100605 cycles - proc_0: mov eax,1
   164682 cycles - proc_9: movdqu xmm0,[eax]
   271729 cycles - proc_8: dec eax
   272095 cycles - proc_7: neg eax
   285086 cycles - proc_5: jz  @F no jump
   570121 cycles - proc_6: jnz @F +7
   570438 cycles - proc_4: jmp @F +5
   574778 cycles - proc_3: jmp @F +3
   593173 cycles - proc_2: jmp @F +2

AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
   100283 cycles - proc_1: test eax,eax
   100287 cycles - proc_0: mov eax,1
   154779 cycles - proc_5: jz  @F no jump
   176163 cycles - proc_9: movdqu xmm0,[eax]
   271020 cycles - proc_8: dec eax
   274000 cycles - proc_7: neg eax
   569227 cycles - proc_6: jnz @F +7
   928449 cycles - proc_4: jmp @F +5
  2570589 cycles - proc_3: jmp @F +3
  2941084 cycles - proc_2: jmp @F +2

The AMD/Intel timing will then be:
Code: [Select]
jz  <>  ;  1.6  2.9
jnz +7  ;  5.7  5.7
jmp +5  ;  9.3  5.7
jmp +3  ; 25.7  5.7
jmp +2  ; 29.4  5.9

AMD have a short test (1.6) in this case but a long jump (25.7):
Code: [Select]
.if eax
    inc ecx
.endif ; 1.6+2.7/25.7 (AMD) cycles
...
.if eax
    add ecx,1
.endif ; 1.6+2.7/ 9.3 (AMD) cycles

Quote
I only glanced thru your algo (mem3), but here's what I noticed. There was a clever substitute for the initial pushes which saved - either a byte or a cycle, I forget which.

This uses EAX instead of ESP to save one byte. This has to do with code size and alignment. This is important for speed but many systems are less picky in this regard. However, most new machines (and almost all Intel CPU’s) works best on aligned loops. The ideal format would look something like this:
Code: [Select]
00000000:
; init code in para
00000040:
; copy loop
ret
align 16
00000080:
; handle short lengths and overlapped data
ret

There is also "real" and "fake" alignment in this case. The latter adds cycles in form of code like NOP, LEA EAX,[EAX], and LEA ESP,[ESP] and so on to add up bytes to a specific offset. So it is better to keep the alignment as real as possible.

In the first algo I needed to extend the code and used DWORD PTR -16 to add 3 byte:
Code: [Select]
00000038  03F8 add edi,eax
0000003A  81E1F0FFFFFF and ecx,dword ptr -16 ; align EIP and ECX 16
00000040 align 16
00000040 loop_L:

if 3 byte is added above this happen:
Code: [Select]
0000003B  03F8 add edi,eax
0000003D  81E1F0FFFFFF and ecx,dword ptr -16 ; align EIP and ECX 16
00000043  8DA424000000008D80 align 16
00000050 loop_L:

you may then compensate by removing the DWORD PTR extension:
Code: [Select]
0000003B  03F8 add edi,eax
0000003D  83E1F0 and ecx,-16 ; align EIP and ECX 16
00000040 align 16
00000040 loop_L:

Quote
The comparison tests were different; mine, I felt, were more intuitive. Other small details that amount either to nothing or not much; a couple slightly better techniques which I didn't bother to borrow (like the push substitute, next time I'll use them); a couple places I think my technique a little better. Half the algo I didn't read but presume there's nothing 2 important. But, I removed almost all alignments (from your mod'ed version of my algo); my tests showed they were basically insignificant (maybe a cycle or two) and I wanted to reduce the code size, enhancing cache fit probability by a half percent if that.

If you think about it, the code above (and below) the loop label is peanuts compare to what happens in the main loop, so that’s the main priority. The init and exit code will only hit you when multiple calls are made with small lengths.
Code: [Select]
--------- un/aligned: 262159

Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (AVX)
  1778875 cycles - proc_7: SSE - 128 byte mem2(rrr)
  1782810 cycles - proc_8: SSE - 128 byte *
  1783226 cycles - proc_6: SSE - 128 byte modified
  1787929 cycles - proc_3: SSE -  16 byte *
  1789860 cycles - proc_4: SSE -  64 byte *
  1793160 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)
  1875893 cycles - proc_0: ??? crt_memcpy *
  3195546 cycles - proc_2: 386 -   4 byte *
 10433134 cycles - proc_1: AVX -   1 byte *

AMD Athlon(tm) II X2 245 Processor (SSE3)
  2562789 cycles - proc_8: SSE - 128 byte *
  2573278 cycles - proc_4: SSE -  64 byte *
  2647475 cycles - proc_3: SSE -  16 byte *
  2721439 cycles - proc_6: SSE - 128 byte modified
  2732438 cycles - proc_7: SSE - 128 byte mem2(rrr)
  2749617 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)
  3495427 cycles - proc_0: ??? crt_memcpy *
  3610008 cycles - proc_2: 386 -   4 byte *
 10119189 cycles - proc_1: AVX -   1 byte *

Quote
My point is this. The differences between the two algos is minimal (to my knowledge); but now you're showing that your branch elimination saves quite a few cycles, more than I had supposed. So why, then, is my algo - more accurately, my version of the same algo, since these diffs are virtually just a matter of style - why is it faster? (a small, but significant, amount?) Since u, apparently, have gone thru them with a fine tooth comb, presumably u must know. There's something about mem4, versus mem2, that's saving more than just a cycle or two - must be quite noticeable - so, what is it?

It’s difficult to say given they seem not so different in these tests.

Code: [Select]
--------- short: 2014

Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (AVX)
   141012 cycles - proc_8: SSE - 128 byte *
   141209 cycles - proc_4: SSE -  64 byte *
   165921 cycles - proc_6: SSE - 128 byte modified
   166576 cycles - proc_7: SSE - 128 byte mem2(rrr)
   194728 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)
   278250 cycles - proc_3: SSE -  16 byte *
   480782 cycles - proc_0: ??? crt_memcpy *
   780798 cycles - proc_2: 386 -   4 byte *
  2246478 cycles - proc_1: AVX -   1 byte *

AMD Athlon(tm) II X2 245 Processor (SSE3)
   212251 cycles - proc_4: SSE -  64 byte *
   228560 cycles - proc_8: SSE - 128 byte *
   251544 cycles - proc_7: SSE - 128 byte mem2(rrr)
   253457 cycles - proc_6: SSE - 128 byte modified
   267874 cycles - proc_3: SSE -  16 byte *
   293842 cycles - proc_5: SSE - 128 byte xmemcpy(rrr)
   559291 cycles - proc_0: ??? crt_memcpy *
   756845 cycles - proc_2: 386 -   4 byte *
  1987521 cycles - proc_1: AVX -   1 byte *

This is the copy algo for 0..31 byte
Code: [Select]
--------- short: 1..15

Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz (AVX)
  3889571 cycles - proc_5: SSE -  16 byte *
  4850792 cycles - proc_2: 386 -   4 byte *
  4887667 cycles - proc_4: SSE - 128 byte mem2(rrr)
  5897787 cycles - proc_0: ??? crt_memcpy *
  8191556 cycles - proc_3: SSE - 128 byte xmemcpy(rrr)
 21950067 cycles - proc_1: AVX -   1 byte *

AMD Athlon(tm) II X2 245 Processor (SSE3)
  4158723 cycles - proc_5: SSE -  16 byte *
  4572560 cycles - proc_4: SSE - 128 byte mem2(rrr)
  5055317 cycles - proc_2: 386 -   4 byte *
  6617426 cycles - proc_0: ??? crt_memcpy *
  9515503 cycles - proc_1: AVX -   1 byte *
 10004988 cycles - proc_3: SSE - 128 byte xmemcpy(rrr)

I think the copying of the short lengths is a bit faster, but that code is only used if the size < 32. The main difference on larger size is the lack of head and tail-byte copying which is done in 4 moves.

Code: [Select]
mov xmm0,[esi]
mov xmm1,[esi+ecx-16]
mov [eax], xmm0
mov [eax+ecx-16], xmm1

I included the tests made in an archive if you interested. I mostly use command line tools and makefiles for the tests, so I included some binaries.
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 12, 2015, 08:24:13 AM
@nidud, Thanks for providing your test software, in future I'll use some of those (better) techniques.

But I think I've answered my own question, why my algo does a bit better when timed my way instead of yours. I'm using the standard approach: not loading .bin algos or FlushInstructionCache 'ing. (These are good things to do, thanks for the tip) But doing it the usual way, when the dest is aligned the branch (around the aligning instructions) is hit many thousands of times, and *always* taken. So, the processor predicts the branch perfectly. That jump costs almost nothing, couple cycles (not sure exactly). Whereas you're executing six instructions there (13.5 cycles). This matches my results: my algo's a bit ahead on ALIGNED data, but a bit behind on non-aligned, since not taking the jump costs a little. Conclusion:

- rrr's rule for not-very-careful timing test beds: Conditional branches (that depend in a simple way on the inputs) are almost free, so use as many as u want!

Reminds me of something hutch said, "I have always been an advocate of real time testing ...". In real life the environments will be very different, and variable, you don't know what you'll run into, so:

- rrr's timing statistical rule of thumb: ignore differences smaller than about 5%.

I think you should update your scoring routine to allow ties!

A more important thought re. branch prediction occurred to me. When a utility routine like momory-copy is called in real life, there's a decent chance the branch to it wasn't predicted right, so possibly the pipeline is going to be empty (other factors may cause the same thing). So, always put a (necessary) branch right at the beginning if possible, where a misprediction might be almost free.
 
Quote from: nidud
However, most new machines (and almost all Intel CPU’s) works best on aligned loops.

I'm sure you're right, but I saw approximately no timing difference with "align x". At least in one important case, it turned out the loop was (accidentally) already aligned. But even when not, no real difference: dunno why.

Quote from: nidud
If you think about it, the code above (and below) the loop label is peanuts compare to what happens in the main loop, so that’s the main priority. The init and exit code will only hit you when multiple calls are made with small lengths.

Your first sentence is true when called with large blocks. But your second sentence explains why the code above (or below) the loop label *can be* important; your user may indeed be using it for lots of small lengths. That's why if one can save a cycle or two (for instance, by the placement of "toend" code) it should be saved in favor of small lengths vs. long lengths.

Finally, something you should know: if your routines are called to copy <=15 bytes, when PAGE_NOACCESS protected memory happens to be just beyond the source buffer, they'll blow up :P

c u later,
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 12, 2015, 08:45:05 AM
I have also learnt a few lessons on code alignment, regularly while bashing the guts out of an algo to get its pace up, I have tried aligning labels that get hit hard and seen the timing take a lot longer, much the same with procedure alignment, align at 4 bytes and it works OK, change it to 8 or 16 bytes and it often slows right down. Then its worth shifting the location of the algo around in the exe to make sure you are not deluding yourself and see what the results are.

I have tried spacing algos from each other with padding from under 1k to over 1 megabyte that was more or less useless so I usually settle of a delay between each algo of about 500 ms and set the priority class to high but not the one higher. As hardware gets smarter, timings wander even further.  :biggrin:
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 12, 2015, 09:07:07 AM
Quote from: hutch
As hardware gets smarter, timings wander even further.

Very true, meanwhile I'm getting slower ... but, I'm reminded of something hutch once said,

Quote from: hutch
Don't give up on timing, its good for you and as an aside you tend to pop faster algos if you do it regularly. You get faster at it over time and with more practice you tune the timing to the task the algo will perform a lot better.

Also true! One good thing, it gives the lie to the C programmer's favorite saying, "the compiler does it better than you". Unless the compiler writers are down in the trenches timing away - as opposed to just believing Intel/AMD/Agner, as I think they do - they can't match a real MASMologist
Title: Re: Faster Memcopy ...
Post by: nidud on March 12, 2015, 09:22:35 AM
Finally, something you should know: if your routines are called to copy <=15 bytes, when PAGE_NOACCESS protected memory happens to be just beyond the source buffer, they'll blow up :P

 :biggrin:

I was thinking the opposite since the first version also copy the last bytes.
Code: [Select]
movdqu  xmm2,[esi]
movdqu  xmm3,[esi+ecx-16]
I moved this below the jump not thinking ahead  :P

well, good catch  :t
Title: Re: Faster Memcopy ...
Post by: nidud on March 19, 2015, 02:39:37 AM
Finally, something you should know: if your routines are called to copy <=15 bytes, when PAGE_NOACCESS protected memory happens to be just beyond the source buffer, they'll blow up :P

The read-ahead technique seems to be widely used. Here the Intel version of STRLEN:

Code: [Select]
        add     eax,dword ptr 0         ; 5 byte nop to align label below
        align   16                      ; should be redundant
main_loop:
        mov     eax,dword ptr [ecx]     ; read 4 bytes

This version then +3, the 64-bit version +7, and the SIMD version +15 bytes. The version I currently using reads ahead 32 bytes:
Code: [Select]
@@:
movdqu  xmm1,[ecx+16] ; compare 32 byte
movdqu  xmm0,[ecx]
add ecx,32
pcmpeqb xmm0,xmm2
pcmpeqb xmm1,xmm2
pmovmskb edx,xmm0
pmovmskb eax,xmm1
shl eax,16
or eax,edx ; using 32 bit
jz @B

The only safe way will then be a byte scan. In the MEM functions it will be possible to avoid this issue given you have the actual size of the buffer, but in the STR functions you need to scan.

The benefit of using it:
Code: [Select]
2415815   cycles - (  0) 0: crt_strlen
851439   cycles - ( 58) 5: SSE

330516   cycles - (  0) 0: crt_strchr
43257   cycles - ( 85) 3: SSE

1112057   cycles - (  0) 0: crt_strcat
353820    cycles - ( 90) 3: SSE

400591   cycles - (  0) 0: crt_strstr
41779   cycles - (125) 3: SSE

There is however a potential problem with this method.
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 19, 2015, 03:22:09 AM
Looks extremely familiar (http://www.masmforum.com/board/index.php?topic=18728.msg159628#msg159628) 8)
Title: Re: Faster Memcopy ...
Post by: nidud on March 19, 2015, 05:26:14 AM
Yes, there you go.

I do control memory allocation where these functions are used, and there is a 16-byte header at the end of the heap. Reading 32 bytes may however overlap.

Stack and Data shouldn't be a problem in this case, but then again, the source may be the clipboard or other externals so you don't really know.
Title: Re: Faster Memcopy ...
Post by: Antariy on March 19, 2015, 06:21:32 AM
You may safely read/write buffers not in byte-by-byte fashion.

There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc - and if the buffer has actually the size smaller than it was specified with bytecount param, and the buffer is lying near to the end of the memory page - the "MEM" code will crash as well as the "STR" code.

When you scan the memory you may do this safely with larger than byte-sized reads/writes, for this you just should take attention on the details how memory is allocated under x86 page addressed protected mode (Windows, Linux, Mac and any other protected mode operating system on x86).
The memory has the granularity of one page - 4096 bytes "at least", so you may always assume, that if you was given with some address inside memory page, there is full page in the memory available for read/write (depends on protection).
I.e., if you was given with an address 23987234h - just for an instance - so you may take the base address of the page, within which this address located - just clear 12 lowest bits of the address (AND EAX,-4096) - and you have the page address - for this example it will be 23987000h . So, now you know the start of the page, and you know that the size of page is 4096 bytes, so from that you may know how long is the buffer AT LEAST - 23987000h + 1000h = 23988000h. Within this range (from the given start of the buffer 23987234h to 23988000h) you may freely read the data in any sized fashion - just take attention that your reads do not read beyond that. I.e. you may freely start the single-reg XMM read (16 bytes) on 23987ff0h, but you obviously may not start the same read at the 23987ff1h - because final address of data grabbed 23987ff1h + 10h = 23988001h - so it is beyoud the page range and you may not be sure that the next page as reserved/commited in the memory.
Just take attention on at which adddresses you read/write, and, depending on the size of the data grabbing, you have all the info how to not read beyond the buffer to the inaccessible memory.
It is simpler to align the pointers to the power of two - so you may freely read and assume that the code will not crash - and it will not crash if you will check every read for match to the algos purpose.

As an example - just imagine that you use a byte-by-byte StrLen reading algo and SSE algo which uses single-reg XMM read. If you pass the properly formatted zero-terminated string to the algos - both algos will work correctly if they check the readed data to match the finish condition (the byte zero), before they read next bytes. If the both algos will take the wrong formatted strings without zero termanator until very end of the page - and no accessible area after that - the both algos will crash. No difference between algos.

What is wrong with that example of SSE code which does reading with two XMM regs: that code does TWO readings and ONE check after both readings. So, actually the code may read the proper string end (zero byte) with first read from both reads, but it will with no attention to that try to read the next 16 bytes - but the proper zero terminated string was already terminated in the previous 16 bytes (first XMM read) - so, being reading the PROPER zero terminated string at the PROPER possible location near to the end of the buffer with no accessible memory beyond that, this code will crash, that is why it is IMPROPER. We had a thread about StrLen XMM algo on the old forum, there were I, Jochen and Lingo as the main posters AFAIK, and Lingo provided the algo of such kind - which did read two XMM reads, then combined the mask of bytes result in one 32 bit GPR reg, and only after that did veryfied the result. Of course that code was buggy, and it is funny and shame (taking in account his megalomania) for Lingo he doesn't know such a simple and basic things like the buffers/memory layouts. The loud cries about "the algo is faster instead" is not excusable - the algo should be reliable first, and then it MAY be thought of making it faster. So Jochen decided to use one-reading XMM algo in his lib, and that was simplest and reliable solution. It is probably possible to do complex solution which will read more that one reg - but it will have to check every reg before it will have the right to read the next reg from the memory, so actually that complex code probably not only will not be faster than one-reg-reading code, it will be much slower. So Lingo's code was just a trash from the point of view of software production - it was just real fancy thing but not real working solution. Well, after so long time I described my point of view on the "Lingo's coding style" and the real "value" of his over-valuated "style", in proper English now :greensml:
Title: Re: Faster Memcopy ...
Post by: Antariy on March 19, 2015, 06:41:04 AM
The Intel's "read ahead" is not actually a real read ahead - just show us full code and it is for sure that they do verify of the finish condition (zero byte) before they read next data which may be beyond page range. The full sense of read-ahead is only when the code does read only with not-dword-aligned strings, but taking in attention that the code might probably do the alignment (which is not obvious from the small piece of code provided) and/or that that every "advanced" compiler does string alignment on the dword boundary - this is not a dangeround "read-aheading" at all - it has nothing common with the algo which does two (or more) reads before it checks the condition of termination. If the algo with use of any technique is still in the range of taking notice to be in the page range, the algo will not crash. All other differences - are the details of implementation but not the idea itself.
Title: Re: Faster Memcopy ...
Post by: nidud on March 19, 2015, 09:15:28 AM
There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc

None of these (or any other) functions are safe from that point of view.

The only valid test will be the case of a zero at the end of the buffer. This means that the pointer must be byte-aligned at the top by testing each byte. The rest of the scan may then read chunks of aligned size blocks safely.

Quote
As an example - just imagine that you use a byte-by-byte StrLen reading algo and SSE algo which uses single-reg XMM read. If you pass the properly formatted zero-terminated string to the algos - both algos will work correctly if they check the readed data to match the finish condition (the byte zero), before they read next bytes.

Given the pointer points to the last byte the size of the buffer is 1 byte, and in this case you read 16 byte.

Quote
What is wrong with that example of SSE code which does reading with two XMM regs:

The same thing that is wrong with the 4, 8, and 16 byte overlapping reads. If the pointer is not aligned before the scan it will fail regardless of size (16/32).
Title: Re: Faster Memcopy ...
Post by: nidud on March 19, 2015, 09:43:57 AM
The Intel's "read ahead" is not actually a real read ahead - just show us full code and it is for sure that they do verify of the finish condition (zero byte) before they read next data which may be beyond page range.

This is true. The Intel code do a byte test alignment before using DWORD's.
http://masm32.com/board/index.php?topic=1872.msg19450#msg19450 (http://masm32.com/board/index.php?topic=1872.msg19450#msg19450)

Quote
The full sense of read-ahead is only when the code does read only with not-dword-aligned strings

The same also applies to larger alignment like 8, 16, 32, ...

The problem in this case is:
- no alignment before read
- aligning using size > 1

This fails:
Code: [Select]
mov ecx,[esp+4]
mov eax,[ecx]
and ecx,-4
add ecx,4
push edx
lea edx,[eax-01010101h]
not eax
and eax,edx
and eax,80808080h
jz @F
bsf eax,eax
shr eax,3
pop edx
ret 4
@@:
mov eax,[ecx]

This works:
Code: [Select]
;
; Agner Fogh is moving the pointer back to read the first 1..4 bytes:
;
mov     ecx, pBuf    ; get pointer to string
mov     eax, ecx    ; copy pointer
and     ecx, 3    ; lower 2 bits of address, check alignment
jz     L2    ; string is aligned by 4. Go to loop
and     eax, -4    ; align pointer by 4
shl     ecx, 3    ; mul by 8 = displacement in bits
mov     edx, -1
shl     edx, cl    ; make byte mask
not     edx    ; mask = 0FFH for false bytes
or     edx, [eax]    ; read from nearest preceding boundary
; check first four bytes for zero
;-----------------------------------
lea     ecx, [edx-01010101H]   ; subtract 1 from each byte
not     edx    ; invert all bytes
and     ecx, edx    ; and these two
and     ecx, 80808080H    ; test all sign bits
jnz     L3    ; zero-byte found
; Main loop, read 4 bytes aligned
;-----------------------------------
L1: add     eax, 4    ; increment pointer by 4
L2: mov     edx, [eax]    ; read 4 bytes of string
Title: Re: Faster Memcopy ...
Post by: Antariy on March 19, 2015, 09:45:35 AM
Just re-read again what was written: if you have the address of the buffer - you have all required to check if the buffer is near of the end of the page. So what you will do after that - is for your decision, of course you may try to read a whole heap of data even if you see that the buffer you given with is in the last byte of the page, or you may process this case in its respective part - QWORD/DWORD/WORD/BYTE reading - depending on how close to the end of the page your buffer is.
In the case with that StrLen code which Jochen used in his lib, I just aligned the input address buffer to the 16 (and, take notice that I did also mention that in the first post here - mentioned that if you align the reading pointer to the power of two), aligned it "backwards" (just AND with -16 so it will always lie at least from 16 bytes from the end of the page in memory) and thus did first read not ahead but "backward" instead, after that just shifted the "backward" tail out of the masking reg. So, that SSE code MAY NOT FAIL with any zero-terminated string placed in any place in the memory.
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 19, 2015, 11:14:51 AM
In the case with that StrLen code which Jochen used in his lib, I just aligned the input address buffer to the 16 (and, take notice that I did also mention that in the first post here - mentioned that if you align the reading pointer to the power of two), aligned it "backwards" (just AND with -16 so it will always lie at least from 16 bytes from the end of the page in memory) and thus did first read not ahead but "backward" instead, after that just shifted the "backward" tail out of the masking reg. So, that SSE code MAY NOT FAIL with any zero-terminated string placed in any place in the memory.

This is correct, but to be fair, the example of the 4096 byte buffer posted below works only with SEH:

include \masm32\MasmBasic\MasmBasic.inc      ; download (http://www.masm32.com/board/index.php?topic=12460)
  Init tc                  ; install the SEH handler

  FileWrite "test.txt", String$(4096, "x")      ; prepare a nasty example
  xchg rv(filesize, Chr$("test.txt")), ebx
  xchg rv(VirtualAlloc, 0, ebx, MEM_COMMIT, PAGE_READWRITE), edi
  push rv(_lopen, Chr$("test.txt"), OF_READ)
  invoke _lread, eax, edi, ebx      ; exactly 4096 bytes in a VirtualAlloc'ed buffer
  call _lclose
  Print Str$("%i bytes read from file\n", ebx)      ; this file is exactly one page long
  Print Str$("MasmBasic says the length of the string is %i bytes\n", Len(edi))      ; Len() WORKS OK

  PrintLine "Now trying crt_strlen:"      ; once you see this, Ctrl C is your only change...
  invoke crt_strlen, edi                  ; ... because the CRT fails miserably, it just hangs
  Inkey Str$("crt_strlen says the len of string is %i bytes\n", eax)      ; you will not see this message
  Exit
  TryCatch (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1217)End            ; activate the handler - must be last instruction before "end start"
end start

Yes it does return 4096 bytes, while the CRT just hangs, but... we must go at least one byte beyond the limits, otherwise how would the algo know that byte 4096+1 is a zero byte...?

Still, Len() remains usable, while the CRT just hangs. You can try inserting an exception handler also for the CRT:
  Try
  invoke crt_strlen, edi      ; great, it doesn't hang any more...!
  Catch

That works in the sense that your app is no longer blocked, but look at the result :P

Interestingly enough, Masm32 len() fails silently:
  Try
  Print Str$("Masm32 len() says the length of the string is %i bytes\n", len(edi))
  Catch


And of course, all these examples work correctly if you replace the 4096 with 4095.
Title: Re: Faster Memcopy ...
Post by: nidud on March 19, 2015, 09:43:31 PM
And of course, all these examples work correctly if you replace the 4096 with 4095.

Are you sure?
Code: [Select]
mov byte ptr [edi + 4096 – 1],0
add edi,15

Well, this version should read 32 byte equally safe as 1.
Code: [Select]
OPTION PROLOGUE:NONE, EPILOGUE:NONE

strlen  proc string
mov ecx,[esp+4]
xorps xmm2,xmm2
mov eax,ecx
and ecx,32-1
and eax,-32
movdqa  xmm0,[eax]
movdqa  xmm1,[eax+16]
or edx,-1
shl edx,cl
pcmpeqb xmm0,xmm2
pmovmskb ecx,xmm0
and ecx,edx
jnz done
pcmpeqb xmm1,xmm2
pmovmskb ecx,xmm1
shl ecx,16
and ecx,edx
jnz done
loop_32:
lea eax,[eax+32]
movdqa  xmm0,[eax]
movdqa  xmm1,[eax+16]
pcmpeqb xmm0,xmm2
pcmpeqb xmm1,xmm2
pmovmskb edx,xmm0
pmovmskb ecx,xmm1
shl ecx,16
or ecx,edx
jz loop_32
done:
bsf ecx,ecx
lea eax,[ecx+eax]
sub eax,[esp+4]
ret 4
strlen  endp
Title: Re: Faster Memcopy ...
Post by: Antariy on March 19, 2015, 10:30:34 PM
Once again, this algo will not fail with ANY zero-terminated string located anywhere in the memory:

Code: [Select]
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings
;mov edx,[esp+4]
;mov eax,[esp+4]
db 8bh,94h,24h,04,0,0,0
db 8bh,84h,24h,04,0,0,0
add esp,-14h
movups [esp],xmm7
mov [esp+16],ecx
;and edx,-10h
db 81h,0e2h,0F0h,0FFh,0FFh,0FFh
mov ecx,eax
pxor xmm7,xmm7
and ecx,0fh
jz @F
pcmpeqb xmm7,[edx]
add edx,10h
pmovmskb eax,xmm7
shr eax,cl
bsf eax,eax
jnz @ret
pxor xmm7,xmm7
@@: ; aligned to 16
pcmpeqb xmm7,[edx]
add edx,16
pmovmskb eax,xmm7
test eax,eax
jz @B

bsf eax,eax
sub edx,[esp+4+16+4]
lea eax,[eax+edx-10h]

@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4

AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Jochen,
your "example" is WRONG. You're passing NOT PROPER string to the algo - and you should not wait for proper results. Your string is exactly unterminated - it's an infinite size from the point of view of the algorithm logic, and the results you are returning (4096) - ARE WRONG, take that or not - that is for your decision. The algo should fail there (crash) or it should return the flag which OBVIOUSLY marks that the string is incorrectly formatted - for an instance you should return -1 as mark of the string ininity. Otherwice using your StrLen you will never know that the string is not proper string - preperly terminated - with the real size of 4096 bytes - and your code will use that improper string further with full assuming that it's proper, but it is WRONG "ZERO TERMINATED" STRING. And any other components of the program yours or not yours (for example Windows), will crash or misuse that improper string, for an instance long after you did "checked" it, so after that you will long search why it crashes - because at the point where you were required to flag the improper string, you just silently returned "the fine" and "seems ok" result.
The code should never "silently substitute" the breaks in the program logic (improper zero terminated string is that stuff) with the "result which seems OK" - but your strlen does exactly this.

nidud,
your code has still the same problem - you for some reason check only first 32 bytes of the string, but the rest you're assuming safe to do two reads and only one check - the same bug there. Your code now is partially "safe" only with the strings with size smaller that 32 bytes - any other string will crash your code if it is placed in such a manner that it's zero terminator is closer to the page end than 32 bytes - i.e. if string terminated at 1...31 bytes from the page end.
 
The first my post here, written in almost proper english and very descriptive has all the info to implement fully safe algo without any assumptions or silent substitutions/hiding of errors. Sometimes I agree with the topic paster on the thought that the more descriptive and full information one tries to provide - the less respect and carefull attention to the one's descriptions/writings one haves.
Title: Re: Faster Memcopy ...
Post by: nidud on March 20, 2015, 12:11:59 AM
nidud,
your code has still the same problem - you for some reason check only first 32 bytes of the string, but the rest you're assuming safe to do two reads and only one check - the same bug there. Your code now is partially "safe" only with the strings with size smaller that 32 bytes - any other string will crash your code if it is placed in such a manner that it's zero terminator is closer to the page end than 32 bytes - i.e. if string terminated at 1...31 bytes from the page end.

The code won’t fail   :P

To put your mind at ease, this test the last 32 bytes
Code: [Select]
    .if VirtualAlloc(0,4096,MEM_COMMIT,PAGE_READWRITE)
mov ebx,eax
mov edi,eax
mov ecx,4096
mov eax,'x'
rep stosb
mov edi,4096-15
add ebx,15
.repeat
dec edi
mov byte ptr [ebx+edi],0
push ebx
call esi
mov esp,ebp
.until edi == 4096 - 33 - 15
sub ebx,15
invoke  VirtualFree,ebx,0,MEM_RELEASE
    .endif

The timings for the new algo:
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
-- string length 0..40
844627    cycles - (  0) proc_0: crt_strlen
1120048   cycles - (  0) proc_1: len()
832447    cycles - (  0) proc_2: Len()
497990    cycles - (104) proc_5: SSE 32 - safe
620398    cycles - (104) proc_6: AxStrLenSSE
-- string length 40..80
937328    cycles - (  0) proc_0: crt_strlen
1108642   cycles - (  0) proc_1: len()
542569    cycles - (  0) proc_2: Len()
340045    cycles - (104) proc_5: SSE 32 - safe
474865    cycles - (104) proc_6: AxStrLenSSE
-- string length 600..1000
510733    cycles - (  0) proc_0: crt_strlen
761988    cycles - (  0) proc_1: len()
171665    cycles - (  0) proc_2: Len()
111466    cycles - (104) proc_5: SSE 32 - safe
162919    cycles - (104) proc_6: AxStrLenSSE

result:
   949501 cycles - (104) proc_5: SSE 32 - safe
  1258182 cycles - (104) proc_6: AxStrLenSSE
  1546681 cycles - (  0) proc_2: Len()
  2292688 cycles - (  0) proc_0: crt_strlen
  2990678 cycles - (  0) proc_1: len()
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 20, 2015, 01:47:04 AM
Jochen,
your "example" is WRONG. You're passing NOT PROPER string to the algo - and you should not wait for proper results.

Alex,

My example is certainly problematic, and I agree that such an "improperly delimited" string should not be passed to a Windows API that expects a zero-delimited string; which, for example, is not the case for crt_lcpyn:

  Let esi=Space$(4097)
  invoke lstrcpyn, esi, edi, Len(edi)
  Print "[", esi, "]"


OTOH, the string is perfectly valid in a MasmBasic macro such as Let or Print (which only want to know how many bytes must be copied):

  Print Str$("MasmBasic says the length of the string is %i bytes\n", Len(edi))      ; Len() WORKS OK
  Let esi="The string: ["+edi+"]"
  PrintLine esi
  PrintLine "[", edi, "]"


Nonetheless, I do not encourage SEH (that is what Init tc does) to enable problematic examples like this.
Title: Re: Faster Memcopy ...
Post by: Antariy on March 20, 2015, 06:01:37 AM
The code won’t fail   :P

To put your mind at ease, this test the last 32 bytes


You may leave your sarcasm for yourself :P

Yes, it will not fail, because you did aligned data fetch to 32 bytes - but when I was first replying to your code sample, you provided only the inner loop logic, without other code, which shows that you do align to 32. The second time I took attention on the way you do first checking but did not noticed that you align the buffer fetch to 32, so that's why I said that it will fail - it should have fail if it will fetch data with 32 bytes but if the buffer fetch fill be aligned to only 16 bytes, but, as already said, I just did not noticed the proper alignment when replied second time, and there was not full code the first time from your side.
But, again, nothing wrong with that I was saying that in my first post in the thread I was describing the proper way of implementation of the algo, there I was written for an instance this sentence
Quote
It is simpler to align the pointers to the power of two - so you may freely read and assume that the code will not crash - and it will not crash if you will check every read for match to the algos purpose.

You just did it right (aligned to the power of two, equal to the size of data fetch, the same as I was writing, too - that you should align and use proper data size - QWORD/DWORD/WORD/BYTE etc - you just did that with "DXMMWORD" (double XMM word) - so nothing wrong with my explanation).

The source you provided is not recompilable, probably set of your custom macroses is missed there, so I did not run the test itself, but for sure it will work.
Just a thing: which code did you used in the test? Or you do load of the code from the .bin files? Just haven't found for an instance your code which you posted earlier there (which aligns to 32) in the disassembly listing ::)
Title: Re: Faster Memcopy ...
Post by: Antariy on March 20, 2015, 06:45:08 AM
Jochen,

My example is certainly problematic, and I agree that such an "improperly delimited" string should not be passed to a Windows API that expects a zero-delimited string; which, for example, is not the case for crt_lcpyn:

  Let esi=Space$(4097)
  invoke lstrcpyn, esi, edi, Len(edi)
  Print "[", esi, "]"


Well, lstrcpyn is more like the memcpy function with one additional stop factor - zero byte, because it has the length specificator. So it's not really well valid example. More over for some reason the lstrXXXn functions break the zero-terminated logic itself: these functions break the string in some unexpected place, so, for an instance, in some cases you might receive the partial string, but this may not look like proper result, as you of course understand. These functions act more like those "hiders" of error - the code actually copied just a part of the string but the proggie continues to use that string thinking that it's proper, full string with full information it must hold. Of course, the truncated string contains not full info, so it may be treated just as a trash, not the data. IF the string was just some message etc - this may be "acceptable" to receive just a truncated message on the display, but if the original string which was copied to the smaller buffer and got truncated was containing some important data (file name, just a file system path, name of some object, an encoded (for an instance a short base64 message) form of some other data etc etc) - then this case is obviously unacceptable because you get not the data, but a trash.
In your example you might substitute the lstrcpyn with memcopy+zero termination for an instance, since the work of handling the exception and determining the "length" of the string anyway was done by MasmBasic.
But that did not change things, actually: your code got an improper string, the code does not know for which reason the string was improper - it was not terminated and continued until the end of the page. The code should flag this, instead of just continuing to work as if the string is properly terminated. The rule of programming is that it's better to have do not working program, that working improperly. So it's better for code to detect that it got improper zero terminated string and flag it in some way (to the user or just going other way/return etc). Of course, this applicable to the code which expects a ZERO TERMINATED strings - if the algo which should receive the zero-terminated strings receives not terminated string, it should crash or flag it. But the algos which do not require zero-terminated strings to work, but rely on the "buffer address" and "known size" params - may work with no difference if there is zero terminator, but just because they by design do not require zero terminated strings, but rather binary strings.
Your MB Len algo is proper for the MB strings, but in a case you pass a string which is known to be zero terminated - you should not accept silently those claimed as "zero terminated" but not terminated actually strings, that was the point I tried to describe.

Anyway the problem of taking the unproper zero terminated strings with the algos which do not flag that and reject further processing is that if you continue, you might continue to receive more and more such strings, but SEH is very greedy in therms of timings, so any time you receive such a string you have thousands of cycles stalls just to continue work after a SEH catch.
And, of course, if you will pas that string which is not actually terminated to any other code - it may crash there. Other way you may do - copy that improper string to proper MB string and pass that string to the algos, but, again, every improper string copy will cost thousands of cycles.

OTOH, the string is perfectly valid in a MasmBasic macro such as Let or Print (which only want to know how many bytes must be copied):

  Print Str$("MasmBasic says the length of the string is %i bytes\n", Len(edi))      ; Len() WORKS OK
  Let esi="The string: ["+edi+"]"
  PrintLine esi
  PrintLine "[", edi, "]"


Nonetheless, I do not encourage SEH (that is what Init tc does) to enable problematic examples like this.

BTW, what means OTOH acronym?

Yes, I did not tell you that the MASMBASIC strings are wrong if they are not terminated with zero, I told about ZERO TERMINATED strings. If MB algo receives zero terminated string which is not terminated - then it should obviously report you in some way that the string is not terminated... for an instance you may return the "seem ok" string length in EAX, but -1 in EDX if the string was not terminated and zero in EDX if it was proper zero terminated string.

The installed SEH covers all your algos in the library? I.e. the Let, the Len algos are not crash with that improper zero-terminated string only if you install the SEH at the Init? If you will not - the both Len and Let will crash with that string? How do you implemented SEH coverage on your lib - as an UnhandledExceptionFilter or as a per-thread SEH frames (just like in my signature)? The UnhandledExceptionFilter does not provide a way to handle exceptions in multiple-threaded applications, because the "safe address" which you may use to restore execution might be overwritten in the other thread which enters some function which set that "safe address" up with its own address of restoration.
Title: Re: Faster Memcopy ...
Post by: nidud on March 20, 2015, 07:17:22 AM
You may leave your sarcasm for yourself :P

 :biggrin:

Quote
Yes, it will not fail, because you did aligned data fetch to 32 bytes - but when I was first replying to your code sample, you provided only the inner loop logic, without other code, which shows that you do align to 32.

That code did actually fail. All these versions are included in the strlen.zip file (#50). I converted a string library some time ago and the code that read-ahead is used in many of these functions.

Quote
The source you provided is not recompilable, probably set of your custom macroses is missed there, so I did not run the test itself, but for sure it will work. Just a thing: which code did you used in the test? Or you do load of the code from the .bin files?

I uploaded the test-bed (benchmarks.zip (http://masm32.com/board/index.php?action=dlattach;topic=4067.0;attach=3869)) used here (#34) (http://masm32.com/board/index.php?topic=4067.msg43117#msg43117).
Title: Re: Faster Memcopy ...
Post by: Antariy on March 20, 2015, 07:31:40 AM
You may leave your sarcasm for yourself :P

 :biggrin:

:t

Quote
Yes, it will not fail, because you did aligned data fetch to 32 bytes - but when I was first replying to your code sample, you provided only the inner loop logic, without other code, which shows that you do align to 32.

That code did actually fail. All these versions are included in the strlen.zip file (#50). I converted a string library some time ago and the code that read-ahead is used in many of these functions.

Ah, so at least one of the pieces discusset failed. You mean this algo, probably?
Code: [Select]
.686
.model flat, stdcall
.code
.xmm

OPTION PROLOGUE:NONE, EPILOGUE:NONE

strlen  proc string
mov ecx,[esp+4] ; 4
xorps xmm2,xmm2 ; 3
push edx ; 1
@@:
movdqu  xmm1,[ecx+16]
movdqu  xmm0,[ecx]
add ecx,32
pcmpeqb xmm0,xmm2
pcmpeqb xmm1,xmm2
pmovmskb edx,xmm0
pmovmskb eax,xmm1
shl eax,16
or eax,edx
jz @B
pop edx
bsf eax,eax
lea eax,[ecx+eax-32]
sub eax,[esp+4]
ret 4
strlen  endp

END

It's in 4.asm, and the only algo which reads unaligned so it crashed near end of the page even if it did reads by 16 byte chunks, but it more over does two reads of 32 bytes. So first time you posted two-reading piece of this code?

Quote
Quote
The source you provided is not recompilable, probably set of your custom macroses is missed there, so I did not run the test itself, but for sure it will work. Just a thing: which code did you used in the test? Or you do load of the code from the .bin files?

I uploaded the test-bed (benchmarks.zip (http://masm32.com/board/index.php?action=dlattach;topic=4067.0;attach=3869)) used here (#34) (http://masm32.com/board/index.php?topic=4067.msg43117#msg43117).

Ah, got it earlier but did not got that it was full set of the files. the dz.exe is your file manager DosZip?
But, the other question: do you load algos from binary files? Just did not found your safe 32 bytes algo in the listing of the timeit.exe.
Title: Re: Faster Memcopy ...
Post by: nidud on March 20, 2015, 08:22:02 AM
Ah, so at least one of the pieces discusset failed. You mean this algo, probably?

Yes, there are many like this around (http://masm32.com/board/index.php?topic=3396.msg36254#msg36254), and here are some more (http://www.alfredklomp.com/programming/sse-strings/)  :lol:

Quote
It's in 4.asm, and the only algo which reads unaligned so it crashed near end of the page even if it did reads by 16 byte chunks, but it more over does two reads of 32 bytes.

The same also with 4 and 8.

Quote
So first time you posted two-reading piece of this code?
Think I posted it somewhere but I'm not sure.

Quote
Ah, got it earlier but did not got that it was full set of the files. the dz.exe is your file manager DosZip?
Yes, plus assembler and make to simplify the testing. If on the same drive as masm32 it should work out of the box.

Quote
But, the other question: do you load algos from binary files? Just did not found your safe 32 bytes algo in the listing of the timeit.exe.

Yes, for reasons explained here (http://masm32.com/board/index.php?topic=4067.msg43020#msg43020).
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 20, 2015, 02:14:53 PM
BTW, what means OTOH acronym?
on the other hand

Quote
The installed SEH covers all your algos in the library? I.e. the Let, the Len algos are not crash with that improper zero-terminated string only if you install the SEH at the Init?

If SEH is installed (and I rarely use that feature), then there is exactly one point in the library where it is being used: MbStrLen. Which is a function that is used internally very often, of course.
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 20, 2015, 09:15:14 PM
Hello Antariy, nice to meet you,

There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc - and if the buffer has actually the size smaller than it was specified with bytecount param, and the buffer is lying near to the end of the memory page - the "MEM" code will crash as well as the "STR" code.

- If the user makes an incorrect request (wrong bytecount) then perhaps the algo should just crash. If he's careless enough to do that he probably won't check error code return, so how do you tell him he made a mistake? If you cover for him, it will probably nail him later somewhere else. Perhaps, let it crash so he knows the error. That's how my memcopy behaves.

But, that's a matter of opinion. The important point is, if user gives correct bytecount, (or, correctly terminated C-style zero-STR), then you must not crash by reading / writing outside the requested buffers. Even though reading 128 bytes at a time, my memcopy never goes beyond user-specified bounds.

It is simpler to align the pointers to the power of two...

- You mean, align to the chunk size, assuming it's a power of 2. (The "size of the data grabbing" is called "chunk".)

The loud cries about "the algo is faster instead" is not excusable - the algo should be reliable first, and then it MAY be thought of making it faster.

- Of course! Surely Lingo agrees?

Sometimes I agree with the topic paster on the thought that the more descriptive and full information one tries to provide - the less respect and carefull attention to the one's descriptions/writings one haves.

- Very annoying, go to a lot of trouble with the language and then no one reads it! But, minor point: actually, I didn't paste the topic. Couldn't find anywhere on the net to copy it from, so I had to write it myself :P

Once again, this algo will not fail with ANY zero-terminated string located anywhere in the memory: ...

- Right, but it can be 18 bytes shorter, and a little cleaner:

Code: [Select]
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings
mov edx,[esp+4]
; mov eax,[esp+4]                   ; don't get eax from [esp+4]
        mov eax, edx                     ; but from edx, save 2 bytes
; db 8bh,94h,24h,04,0,0,0           ; don't pad 6 bytes with
; db 8bh,84h,24h,04,0,0,0           ; these two db instructions
add esp,-14h
movups [esp],xmm7
mov [esp+16],ecx
and edx,-10h                      ; don't pad 3 bytes
; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh   ; with this db instruction
mov ecx,eax
pxor xmm7,xmm7
and ecx,0fh
; jz @F                             ; no jump save 2 bytes
pcmpeqb xmm7,[edx]
; add edx,10h                       ; don't add here save 3 bytes
pmovmskb eax,xmm7
shr eax,cl                         ; works fine if cl is 0
bsf eax,eax
jnz @ret
pxor xmm7,xmm7
@@: ; aligned to 16                ; still aligned but 16 bytes less
add edx,16                         ; switch order of these
pcmpeqb xmm7,[edx]                 ; two instructions
pmovmskb eax,xmm7
test eax,eax
jz @B

bsf eax,eax
sub edx,[esp+4+16+4]
; lea eax,[eax+edx-10h]
        add eax, edx                    ;  don't need -10h any more, save 2 bytes

@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4

AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Here it is without all the extra comments:

Code: [Select]
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings

    mov edx,[esp+4]
    mov eax, edx
    add esp,-14h
    movups [esp],xmm7
    mov [esp+16],ecx
    and edx,-10h
    mov ecx,eax
    pxor xmm7,xmm7
    and ecx,0fh
    pcmpeqb xmm7,[edx]
    pmovmskb eax,xmm7
    shr eax,cl
    bsf eax,eax
    jnz @ret
   
    pxor xmm7,xmm7
    @@:                                ; naturally aligned to 16
        add edx,16
        pcmpeqb xmm7,[edx]
        pmovmskb eax,xmm7
        test eax,eax
        jz @B
   
    bsf eax,eax
    sub edx,[esp+4+16+4]
    add eax, edx
   
    @ret:
    movups xmm7,[esp]
    mov ecx,[esp+16]
    add esp,14h
    ret 4

AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

- BTW your English is very understandable, much better than I write your language! c u later
Title: Re: Faster Memcopy ...
Post by: Antariy on March 21, 2015, 11:08:43 PM
BTW, what means OTOH acronym?
on the other hand

Quote
The installed SEH covers all your algos in the library? I.e. the Let, the Len algos are not crash with that improper zero-terminated string only if you install the SEH at the Init?

If SEH is installed (and I rarely use that feature), then there is exactly one point in the library where it is being used: MbStrLen. Which is a function that is used internally very often, of course.

How did you implement the exception handling - as per-thread SEH or SetUnhandledExceptionFilter? IIRC it was SUEF?




Hello Antariy, nice to meet you,

Hello rrr, nice to meet you, too.

Hello Antariy, nice to meet you,

There is no difference in the "MEM" or "STR" functions - just imagine your "MEM" function received wrong bytecount to scan/copy/etc - and if the buffer has actually the size smaller than it was specified with bytecount param, and the buffer is lying near to the end of the memory page - the "MEM" code will crash as well as the "STR" code.

- If the user makes an incorrect request (wrong bytecount) then perhaps the algo should just crash. If he's careless enough to do that he probably won't check error code return, so how do you tell him he made a mistake? If you cover for him, it will probably nail him later somewhere else. Perhaps, let it crash so he knows the error. That's how my memcopy behaves.

But, that's a matter of opinion. The important point is, if user gives correct bytecount, (or, correctly terminated C-style zero-STR), then you must not crash by reading / writing outside the requested buffers. Even though reading 128 bytes at a time, my memcopy never goes beyond user-specified bounds.

The algo may not know if the user is careless or not - it is "just code" ("just a machine") and has no rights to make assumptions about those who use it. Imagine the same with, for an instance, with some protection scheme: the user sits before the computer, turning it on, the computer says "I diskike your face today, probably you're drunk or you're not the owner of this machine, maybe you're a criminal, let's format harddrive of this machine to protect the real owner from data steal".

The fact that the "ideology" of "protected programming" was perversed by many contemporal programmers who are careless and simultaneously with carelessness use the "protected programming" techniques, do not make the "ideology" itself wrong. The code is better to behave in predictable, fully documented and robust way - and the rest is for the user's decision. The routines are just tools, aren't they? One wants to use as much as possible useful, comfortable and robust tool, so, if there is such tool - it's good. But HOW the tool will be used and which product it will produce is for the "master" who uses this tool.

So, probably nothing is wrong with "protected programming" nor "SEH" etc. - this is problem of people psychic rather than the "ideology" - the careless and sloppy coders produce problems, not the code itself.

The code may not know if the user provided a wrong bytecount or the buffer for some reason was not such a size as the user thinks (this may really occur - for an instance with memory-mapped files / sections while the data was not loaded for some reason (disk unexpectedly became full or just hardware failure) into the memory page) but the bytecount was right, so, if the user was careless enough to not notice that the system did not provide the full data and passed the pointer to the code, it may have one more opportunity to know that there is a error, especially if the error-flagging return differes from the regular return very noticeable. Also the itself system bugs may not be discarted from the count - even if the system provided wrong result, the user still might notice it with the code which reports about it.
After all, yes, the "protected programming" implies that one should rely on the system as a "standard" which has no bugs, and the coding style is for decision of the programmer. There are, as usual, two biggest camps: those, who codes absolutely sloppy, and those who codes paranoidally "carefull" - both camps produce usually the most buggy products on the planet and usually are the most big and respective companies on the planet :greensml:



It is simpler to align the pointers to the power of two...

- You mean, align to the chunk size, assuming it's a power of 2. (The "size of the data grabbing" is called "chunk".)

Yes, this correction is right and describes what I wanted to say more clearly.

The loud cries about "the algo is faster instead" is not excusable - the algo should be reliable first, and then it MAY be thought of making it faster.

- Of course! Surely Lingo agrees?

I do not know with which things does he agree and more overy do not want to know that :greensml: But from his "coding style" many of our members are probably got an opinion that he is a megalomaniacal self-named "God of assembly" and a "Code Nazy" who first on the planet invented every thing and all other are stolen that from him. But he is also known to produce buggy but "fast working" code and when he gets the notes on that he just goes to the insulting others, so, I think, it's better to ask his psychiatrist - what he agrees with :greensml:

Sometimes I agree with the topic paster on the thought that the more descriptive and full information one tries to provide - the less respect and carefull attention to the one's descriptions/writings one haves.

- Very annoying, go to a lot of trouble with the language and then no one reads it! But, minor point: actually, I didn't paste the topic. Couldn't find anywhere on the net to copy it from, so I had to write it myself :P

Well, I think, if the jokes apart, you understand that did not mean "copy"-"paster". You "pasted" the topic on the forum - there was no such topic, but it appears - it was "pasted" there, so, the one who did it was topic-"paster".
The word "paste" is independed word from the "copy" or from the "therm" "copy-paste"? So, then, that was pretty right said :P

Once again, this algo will not fail with ANY zero-terminated string located anywhere in the memory: ...

- Right, but it can be 18 bytes shorter, and a little cleaner:

How it was written has reasons, for an instance the XMM and ECX preservation was the "condition of production" - Jochen's condition since his lib preserves those regs in its internal functions - it's MasmBasic's ABI.
Other things like "wide" instructions was used to padd the inner loop to align it by 16, yes, but here it was for reason, too, because it was to fit the code as it was written - with all the specifics. The code may be shorter, yes, but as it was written was seemed (by me, of course, so this may be called as a "subjective judge") it was most performant on a widest variety of hardware. Some our members may confirm that I like to "hunt for bytes" like a maniac sometimes :greensml:, but in this case it was not a case (the pun).

I will just comment the changed code in the lines below of the changes - without indent, why it was written so.

Code: [Select]
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
AxStrLenSSE proc STDCALL lpszStr:DWORD ; SSE safe, ECX safe :), fast with unaligned strings
mov edx,[esp+4]
; mov eax,[esp+4]                   ; don't get eax from [esp+4]
        mov eax, edx                     ; but from edx, save 2 bytes
; db 8bh,94h,24h,04,0,0,0           ; don't pad 6 bytes with
; db 8bh,84h,24h,04,0,0,0           ; these two db instructions

The code which was original will work faster because of reference to the same memory location, which,
being loaded to the cache (if it was not yet) when first accessed, is the "fast to do" thing, rather
than dependent and thus stalling instructions when the value taken into one reg and then copied to the
second reg from first one.

If you want to save those two bytes and leave these instructions as they were coded, then just replace
pxor instructions with the xorps.

add esp,-14h
movups [esp],xmm7
mov [esp+16],ecx

Actually if I will have to descrease the size of the algo I will probably remove the XMM and ECX presevation,
because it's not APIs ABI but only MasmBasic ABI.

and edx,-10h                      ; don't pad 3 bytes
; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh   ; with this db instruction

This was coded so to put loop to 16 bytes align, why the code was in a whole coded so may be understandable
only if it is seeing as a whole but not some different pieces - read below why it was decided as unappropriate
to write the code so as it was changed in this example.

mov ecx,eax
pxor xmm7,xmm7
and ecx,0fh
; jz @F                             ; no jump save 2 bytes

The CPU "predicts" the "forward" jumps as "probably untaken", so, the jump will not be taken with the
possibility 16:1 (unaligned string) with byte-aligned strings, or 8:1 with even aligned strings, or 4:1 with
dword aligned strings, so, the jump and the some slowdown is more likely to not happen. And even if it happens
the slowdown itself is not big (the high numbers of jumps are exaggerated really, and as the jump is near and in the same code
location actually, it maybe assumed that the jump will not lead to any "misprediction" stalls since the code
is already fetched and decoded, even that code where the jump jumps (the pun again) to). So, it will cost only few cycles,
but taking in account that it over-jumps some instructions, which will take some time, this jump in the real world
(not millions of iterations on the same data as it is doing with our testbeds) will not only be a slowdown but will
rather speed the code up.

pcmpeqb xmm7,[edx]
; add edx,10h                       ; don't add here save 3 bytes

It will be described below why the pointer advancing was done after the data fetch.

pmovmskb eax,xmm7
shr eax,cl                         ; works fine if cl is 0

The jump was done not to "protect" from zero ECX but to not execute unrequired and slowing down code without reason.

bsf eax,eax
jnz @ret
pxor xmm7,xmm7
@@: ; aligned to 16                ; still aligned but 16 bytes less
add edx,16                         ; switch order of these
pcmpeqb xmm7,[edx]                 ; two instructions

The pointer advancing BEFORE data fetching slows down the code, especially on earlier than modern CPUs.
The mov eax,[edx]; add edx,16 will be generally always faster than add edx,16; mov eax,[edx]
because these two instructions in first case pair (memory access goes into one block and the arithmetic operation
goes into second block) and not pair in second case.

pmovmskb eax,xmm7
test eax,eax
jz @B

bsf eax,eax
sub edx,[esp+4+16+4]
; lea eax,[eax+edx-10h]
        add eax, edx                    ;  don't need -10h any more, save 2 bytes
       
It was needed because the pointer advancing was specially chosen to be done after data fetching.

@ret:
movups xmm7,[esp]
mov ecx,[esp+16]
add esp,14h
ret 4

AxStrLenSSE endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

- BTW your English is very understandable, much better than I write your language! c u later

Well, it was not so some years ago, but some forum members also did assure me that it is pretty understandable, even being pretty "Cyrillic English" as one of the forum friends called it.
Do you write Russian?



Ah, so at least one of the pieces discusset failed. You mean this algo, probably?

Yes, there are many like this around (http://masm32.com/board/index.php?topic=3396.msg36254#msg36254), and here are some more (http://www.alfredklomp.com/programming/sse-strings/)  :lol:

You did point to that size for some reason (maybe it was some useful on that page? I did not read it thorougly), or just as an example of the fact that there are many implementations which do not care about possibility of the crash (do not align the fetch, and even grab the data unaligned all the time)?


Quote
It's in 4.asm, and the only algo which reads unaligned so it crashed near end of the page even if it did reads by 16 byte chunks, but it more over does two reads of 32 bytes.

The same also with 4 and 8.

Yes, but I told about SSE one and just did not mention other.


Quote
So first time you posted two-reading piece of this code?
Think I posted it somewhere but I'm not shore.

Just on the previous page there was a piece which included the only inner loop of the two-reading and one-checking algo. My first post here fast after that where I first said that the problem of the algo is in two reads and one check - since you provided only inner loop and not full algo first time, I assumed you do the reads unaligned/aligned to only 16.



Quote
Ah, got it earlier but did not got that it was full set of the files. the dz.exe is your file manager DosZip?
Yes, plus assembler and make to simplify the testing. If on the same drive as masm32 it should work out of the box.

OK, good. Assembler is JWasm?


Quote
But, the other question: do you load algos from binary files? Just did not found your safe 32 bytes algo in the listing of the timeit.exe.

Yes, for reasons explained here (http://masm32.com/board/index.php?topic=4067.msg43020#msg43020).

Ah, well known code location problem. But you might try to implement the solution simpler - with definition of different segment for every tested algo. This will allow to run the not "code location independed" algos as well, with no need to relocate them in some way manually. The algos with jump/call tables are those "relocation required" algos, for an instance. Or the algos which do not-near relative/direct calls/jumps.
Title: Re: Faster Memcopy ...
Post by: dedndave on March 22, 2015, 12:11:12 AM
hello, Alex   :biggrin:

that's a lot of English - lol
Title: Re: Faster Memcopy ...
Post by: Gunther on March 22, 2015, 12:21:42 AM
Hi Alex,

hello, Alex   :biggrin:

that's a lot of English - lol

but that's good news: Alex is back after a long break and he's back with high quality posts.  :t I missed you.

Gunther
Title: Re: Faster Memcopy ...
Post by: nidud on March 22, 2015, 02:03:27 AM
Code: [Select]
; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh   ; with this db instruction
and edx,dword ptr -10 ; same same ?...

well, here is a similar version using 32 bytes
Code: [Select]
xorps xmm1,xmm1
mov eax,[esp+4]
mov [esp-4],edx
mov ecx,eax
and eax,-32
and ecx,32-1
or edx,-1
shl edx,cl
pcmpeqb xmm1,[eax]
pmovmskb ecx,xmm1
and ecx,edx
jnz done
pxor xmm1,xmm1
pcmpeqb xmm1,[eax+16]
pmovmskb ecx,xmm1
shl ecx,16
and ecx,edx
jnz done
@@:
add eax,32
pcmpeqb xmm1,[eax]
pmovmskb edx,xmm1
pcmpeqb xmm1,[eax+16]
pmovmskb ecx,xmm1
shl ecx,16
or ecx,edx
jz @B
done:
bsf ecx,ecx
sub eax,[esp+4]
mov edx,[esp-4]
add eax,ecx
ret 4

Code: [Select]
result:
   972055 cycles - (105) proc_5: SSE 32 - safe
  1010521 cycles - ( 99) proc_9: AxStrLenSSE (nidud)
  1265207 cycles - (104) proc_6: AxStrLenSSE
  1289162 cycles - ( 86) proc_8: AxStrLenSSE (rrr)
  1565286 cycles - (  0) proc_2: Len()
  2415594 cycles - (  0) proc_0: crt_strlen
  3018226 cycles - (  0) proc_1: len()

Quote
, or just as an example of the fact that there are many implementations which do not care about possibility of the crash (do not align the fetch, and even grab the data unaligned all the time)?
yes

Quote
Quote
It's in 4.asm, and the only algo which reads unaligned so it crashed near end of the page even if it did reads by 16 byte chunks, but it more over does two reads of 32 bytes.

The same also with 4 and 8.

Yes, but I told about SSE one and just did not mention other.

My point was that you seem to think in size of registers and not in chunk size  :P

Quote
OK, good. Assembler is JWasm?

yes, modified version.

Quote
Quote
But, the other question: do you load algos from binary files? Just did not found your safe 32 bytes algo in the listing of the timeit.exe.

Yes, for reasons explained here (http://masm32.com/board/index.php?topic=4067.msg43020#msg43020).

Ah, well known code location problem. But you might try to implement the solution simpler - with definition of different segment for every tested algo. This will allow to run the not "code location independed" algos as well, with no need to relocate them in some way manually. The algos with jump/call tables are those "relocation required" algos, for an instance. Or the algos which do not-near relative/direct calls/jumps.

Tried padding and a lot of other things, but this seems to be the best method so far, at least on the CPU I'm now using. The single-file edit also simplifies the testing by having both the (now small) list file and source open in the editor at the same time. The list file then updates when you compile the source from the editor. Flipping back and forth (using F6) makes it easy to align the code.

I also added a batch file to the BIN directory to build the project from the editor.
BIN\build.bat
Code: [Select]
makeand added a short-key to DZ.INI
Code: [Select]
[AltF9]
asm = build.bat
lst = build.bat
makefile= build.bat

The problem is that the editor passes the current file as argument to external tools, so a batch file is needed. Well, this lets you edit, compile, and execute the test from the editor.

Title: Re: Faster Memcopy ...
Post by: Antariy on March 22, 2015, 06:08:02 AM
hello, Alex   :biggrin:

that's a lot of English - lol

Hello Dave :t

Is that more or less proper English? Sometimes it isn't, yeah :greensml:

Hi Alex,

hello, Alex   :biggrin:

that's a lot of English - lol

but that's good news: Alex is back after a long break and he's back with high quality posts.  :t I missed you.

Gunther

Thank you, Gunther! :biggrin: :redface:



Code: [Select]
; db 81h,0e2h,0F0h,0FFh,0FFh,0FFh   ; with this db instruction
and edx,dword ptr -10 ; same same ?...

well, here is a similar version using 32 bytes

I absolutely don't get what that is in the quote of "mine"??? What is "same same" words?
Also, what means "similar version" - similar to which? To AxStrLenSSE? Well, if so, then it's actually not very similar. The most important is that your version does pointer advancing BEFORE data fetch. That was the thing which I was described in the post commenting the rrr's "edition" on the algo - but I did not accept that edit as a replacement to the current AxStrLenSSE implementation, I thought it was clear said that I assume the original version as the best possible for that kind of code (simple, single XMM reg using).
Also, moving to [esp-4] is more or less "legal" in user mode, but that's disputable still (there are cases where the usage of the stack to put the data below ESP may cause wrong work of the code), but for kernel mode code (in drivers) this string routine may not be used at all, especially on systems with DEP enabled.
Also XMM is not preserved... so actually that is strange "reproduction" of the algo with broken/not fully repeated logic. XMM isn't saved at all, instead of ECX saved EDX but it saved in a wrong way - below ESP. And the pointer advancing before data fetch - which slows the algo down.



My point was that you seem to think in size of registers and not in chunk size  :P


That was strange point if you just will, again, yes, re-read the first post here, which maybe is not crytally obvious and clear, but described everything which was discussed here on the topic of StrCmp. Probably the thread should be split into two topics, because MemCopy gone into StrCmp ::)

There was said about granularity of the memory to the page size, and about possibility of safe algos work, about power of two "data grabbing sizes" (which were then corrected to more suitable word "chunk"). And that post was made as answer to your post where you said that the safe algos maybe only byte-by-byte reading, and that it is "common technique" to read data "ahead" - "even in Intel" - as I was not agree with all that you said, I wrote that post. The algos may be safe reading more than byte, the "read ahead" actually is not that in the right sense, and this "common technique" which is used widely by many hobby programmers over the world is buggy, because real read ahead is a way to crash code when accessed to the buffers.

Taking in account all that, it's strange that you do such a conclusions ::)

Tried padding and a lot of other things, but this seems to be the best method so far, at least on the CPU I'm now using. The single-file edit also simplifies the testing by having both the (now small) list file and source open in the editor at the same time. The list file then updates when you compile the source from the editor. Flipping back and forth (using F6) makes it easy to align the code.

Padding is not what I said. I said - just create different segment for every code example. Like that:

SEG1 SEGMENT PARA PUBLIC 'CODE'
tested code 1 here
SEG1 ENDS

SEG2 SEGMENT PARA PUBLIC 'CODE'
tested code 2 here
SEG2 ENDS

... etc ...


When you write your source in this way, then every code piece will be placed in its own section in the PE file, and it will be placed in the page-granulared "sections" in memory - every piece in its own page in memory, starting with address aligned to 4096. This solves the code location problems as well - you might try it for your CPU. And it's more comfortable because does not limit the type of code to test to only location independed code. Actually I was suggested this way of fixing code location problems in some places on the forum several times.
Title: Re: Faster Memcopy ...
Post by: hutch-- on March 22, 2015, 06:56:15 AM
The only really reliable way to test algorithms that are subject to code location is to put each algo in a separate executable using the identical test conditions. The problem seems to vary with the hardware but if you are performing a simple test between 2 or more algos in the same test piece, you can move the algos around to see if their timing changes. Occasionally doing a simple "align ##" will stabilise the timings but I have seen enough algorithms that do not respond to alignment.

There is another factor that I use to see with some of Lingo's test pieces, one algo leaves the following algo with a mess to clean up and to get a reliable timing for the effected algo, you had to comment out the first one.
Title: Re: Faster Memcopy ...
Post by: nidud on March 22, 2015, 07:24:48 AM
I absolutely don't get what that is in the quote of "mine"??? What is "same same" words?

It's a type of English used in some parts of the world
"same same but different"  :P

What I meant was that you could extend the code by using DWORD PTR instead of DB ...

Quote
Also, what means "similar version" - similar to which? To AxStrLenSSE?

Well, given the pointer is aligned you may compare it directly, as you do, and thereby reduce code size, so in that sense it's a similar approach.

Quote
Also, moving to [esp-4] is more or less "legal" in user mode, but that's disputable still (there are cases where the usage of the stack to put the data below ESP may cause wrong work of the code), but for kernel mode code (in drivers) this string routine may not be used at all, especially on systems with DEP enabled.

True.

Quote
And the pointer advancing before data fetch - which slows the algo down.

Are you sure ?

Quote
And that post was made as answer to your post where you said that the safe algos maybe only byte-by-byte reading, and that it is "common technique" to read data "ahead" - "even in Intel" - as I was not agree with all that you said, I wrote that post.

True, that was a wrong assumption. On viewing the strlen function made by Agner Fog I assumed the alignment was only done for speed and not safety.

Quote
Padding is not what I said. I said - just create different segment for every code example. Like that:

It may be that paging also have an impact on timings, but the test done in this case was on small functions all combined less than a page. There is a tread for this somewhere and the results are different for most machines.

Quote
And it's more comfortable because does not limit the type of code to test to only location independed code.

The test also have this option. The functions you want to test may all be in the same file, or externals like crt_strlen, or macros like Len() and len().
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 22, 2015, 06:27:42 PM
Antariy,

Thanks very much for your comprehensive answer but I hate to make you type so much! No need to spell everything out in such detail. I see you have strong feelings about "protected programming" and Lingo so will leave those topics alone! As for "topic paster", it was just a joke, no worries.

The important thing is the code. I understand your reasoning (of course) but it offends my aesthetic sensibility to see all those bytes wasted! So, considering your suggestions, I made some modifications, and tested the shorter algo against yours as well as I could. Bottom line, it seems the shorter algo is, for the most part, faster.

First, to avoid register stall with these instructions: mov edx, [esp+4] / mov eax, edx: I just moved the mov down 2 instructions.

I didn't know xorps could save a byte over pxor, thank you. I used those 2 bytes to put the jump back in. It was necessary for the AMD which is horribly slow on bsf. I jump into the loop, skipping the "add edx, 16", so the logic remains valid.

Still preserving XMM and ECX.

Yes I know the purpose of the db instructions is to pad 9 extra bytes to align the beginning of the loop. I know that's better than nop's or lea eax, [eax] which waste time as well as space. But surely it's even better to not waste the bytes, as long as routine is still as fast or faster.

CPU branch prediction - u know, behavior seems to change with every new release from Intel or AMD. Often, we optimize a routine on our own machines, but on newer/older machines it may behave different. I routinely optimize on my Intel I5, then try it on AMD A6 and Pentium 4; often fastest on one machine may be slowest on another. So I'm leery of artificial coding techniques for speed.

Now, I had thought you were right: pointer advance should go AFTER data fetching. However on the Intel my loop was faster. On AMD, a little slower. Apparently the order of the two instructions makes little difference. Anyway, there's very simple correction available, if needed. Just "pcmpeqb xmm7, [edx+10h]" first, then "add edx, 16" - uses one more byte.

By far, the hardest part of the whole exercise is not writing the routine, but getting semi-reliable timing! First, I used your suggestion and put all algos in separate segments. Then, I doubled both of them; put yours first and last, mine 2nd and 3rd. It appears the "last" position is slightly favorable. Then, in desperation, I copied/pasted the timing runs a number of times, using just the final numbers.

Here are the resulting runs:

Code: [Select]
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (SSE4)

BUFFER ALIGNED

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8         906             854             852             905

    31        1147            1020            1019            1074

   271        4024            4142            4020            3924

  2014       26115           24593           24595           25818

 62159      757816          747523          748235          757502

BUFFER MISALIGNED src 11

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8        1184            1157            1132            1241

    31        1399            1391            1395            1425

   271        4508            4368            4432            4522

  2014       25622           25036           25018           25604

 62159      757612          747909          746568          757986


AMD A6-6310 APU with AMD Radeon R4 Graphics (SSE4)

BUFFER ALIGNED

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8        2124            1551            1551            2319

    31        2526            1944            1926            2494

   271        6220            5679            5676            6416

  2014       29950           30171           30869           30104

 62159      872547          886154          887221          871530

BUFFER MISALIGNED src 11

thecount StrLen_orig(104) StrLen_rrr2(86) StrLen_rrr2(86) StrLen_orig(104)


     8        2776            2320            2319            2622

    31        2795            2420            2418            2797

   271        6016            5461            5476            6055

  2014       30809           31229           31080           30842

 62159      887148          887519          888207          889350

Your routine was a bit better on Intel ALIGNED 271; also slightly better on the longer strings on AMD. Everywhere else, the shorter routine is better. It's dramatically better on AMD short strings, who knows why; and better on Intel long strings. BTW all these numbers came out pretty much like this on multiple tests; I'm only showing one typical run from each machine.

Here is my modified routine:

Code: [Select]
; «««««««««««««««««««««««««««««««««««««««««««««««««««
algo2 proc SrcStr:PTR BYTE
; «««««««««««««««««««««««««««««««««««««««««««««««««««
; rrr modified version of Antariy algo - number 2
    mov eax,[esp+4]
    add esp,-14h
    movups [esp],xmm7
    mov edx, eax
    mov [esp+16],ecx
    and edx,-10h
    xorps xmm7,xmm7
    mov ecx,eax
    and ecx,0fh
    jz intoloop
    pcmpeqb xmm7,[edx]
    pmovmskb eax,xmm7
    shr eax,cl
    bsf eax,eax
    jnz @ret
    xorps xmm7,xmm7
    @@:                       ; naturally aligned to 16
        add edx,16
    intoloop:
        pcmpeqb xmm7,[edx]
        pmovmskb eax,xmm7
        test eax,eax
        jz @B
    bsf eax,eax
    sub edx,[esp+4+16+4]
    add eax, edx
   
    @ret:
    movups xmm7,[esp]
    mov ecx,[esp+16]
    add esp,14h
    ret 4
algo2 endp
end_algo2:

Bottom line, I can't believe it's right to waste those 18 bytes.

Finally, of course I can write Russian! I'll do it again: "Russian". - very easy. OTOH I can't write in Cyrillic to save my life :)

Zip has "testStrLen.asm" test program.
Title: Re: Faster Memcopy ...
Post by: nidud on March 22, 2015, 10:34:21 PM
I understand your reasoning (of course) but it offends my aesthetic sensibility to see all those bytes wasted!

  :biggrin:

The fast string functions used these days are normally huge in size, and they go a long way to remove branching to save a few cycles.

Here is the Intel version of strlen from 2011:
Code: [Select]
/*
Copyright (c) 2011, Intel Corporation
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

    * Redistributions of source code must retain the above copyright notice,
    * this list of conditions and the following disclaimer.

    * Redistributions in binary form must reproduce the above copyright notice,
    * this list of conditions and the following disclaimer in the documentation
    * and/or other materials provided with the distribution.

    * Neither the name of Intel Corporation nor the names of its contributors
    * may be used to endorse or promote products derived from this software
    * without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

#ifndef USE_AS_STRCAT

# ifndef STRLEN
#  define STRLEN strlen
# endif

# ifndef L
#  define L(label) .L##label
# endif

# ifndef cfi_startproc
#  define cfi_startproc .cfi_startproc
# endif

# ifndef cfi_endproc
#  define cfi_endproc .cfi_endproc
# endif

/* calee safe register only for strnlen is required */

# ifdef USE_AS_STRNLEN
#  ifndef cfi_rel_offset
#   define cfi_rel_offset(reg, off) .cfi_rel_offset reg, off
#  endif

#  ifndef cfi_restore
#   define cfi_restore(reg) .cfi_restore reg
#  endif

#  ifndef cfi_adjust_cfa_offset
#   define cfi_adjust_cfa_offset(off) .cfi_adjust_cfa_offset off
#  endif
# endif

# ifndef ENTRY
#  define ENTRY(name) \
.type name,  @function; \
.globl name; \
.p2align 4; \
name: \
cfi_startproc
# endif

# ifndef END
#  define END(name) \
cfi_endproc; \
.size name, .-name
# endif

# define PARMS 4
# define STR PARMS
# define RETURN ret

# ifdef USE_AS_STRNLEN
#  define LEN PARMS + 8
#  define CFI_PUSH(REG) \
cfi_adjust_cfa_offset (4); \
cfi_rel_offset (REG, 0)

#  define CFI_POP(REG) \
cfi_adjust_cfa_offset (-4); \
cfi_restore (REG)

#  define PUSH(REG) pushl REG; CFI_PUSH (REG)
#  define POP(REG) popl REG; CFI_POP (REG)
#  undef RETURN
#  define RETURN POP (%edi); ret; CFI_PUSH(%edi);
# endif

.text
ENTRY (STRLEN)
mov STR(%esp), %edx
# ifdef USE_AS_STRNLEN
PUSH (%edi)
movl LEN(%esp), %edi
sub $4, %edi
jbe L(len_less4_prolog)
# endif
#endif
xor %eax, %eax
cmpb $0, (%edx)
jz L(exit_tail0)
cmpb $0, 1(%edx)
jz L(exit_tail1)
cmpb $0, 2(%edx)
jz L(exit_tail2)
cmpb $0, 3(%edx)
jz L(exit_tail3)

#ifdef USE_AS_STRNLEN
sub $4, %edi
jbe L(len_less8_prolog)
#endif

cmpb $0, 4(%edx)
jz L(exit_tail4)
cmpb $0, 5(%edx)
jz L(exit_tail5)
cmpb $0, 6(%edx)
jz L(exit_tail6)
cmpb $0, 7(%edx)
jz L(exit_tail7)

#ifdef USE_AS_STRNLEN
sub $4, %edi
jbe L(len_less12_prolog)
#endif

cmpb $0, 8(%edx)
jz L(exit_tail8)
cmpb $0, 9(%edx)
jz L(exit_tail9)
cmpb $0, 10(%edx)
jz L(exit_tail10)
cmpb $0, 11(%edx)
jz L(exit_tail11)

#ifdef USE_AS_STRNLEN
sub $4, %edi
jbe L(len_less16_prolog)
#endif

cmpb $0, 12(%edx)
jz L(exit_tail12)
cmpb $0, 13(%edx)
jz L(exit_tail13)
cmpb $0, 14(%edx)
jz L(exit_tail14)
cmpb $0, 15(%edx)
jz L(exit_tail15)

pxor %xmm0, %xmm0
lea 16(%edx), %eax
mov %eax, %ecx
and $-16, %eax

#ifdef USE_AS_STRNLEN
and $15, %edx
add %edx, %edi
sub $64, %edi
jbe L(len_less64)
#endif

pcmpeqb (%eax), %xmm0
pmovmskb %xmm0, %edx
pxor %xmm1, %xmm1
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm1
pmovmskb %xmm1, %edx
pxor %xmm2, %xmm2
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm2
pmovmskb %xmm2, %edx
pxor %xmm3, %xmm3
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm3
pmovmskb %xmm3, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

#ifdef USE_AS_STRNLEN
sub $64, %edi
jbe L(len_less64)
#endif

pcmpeqb (%eax), %xmm0
pmovmskb %xmm0, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm1
pmovmskb %xmm1, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm2
pmovmskb %xmm2, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm3
pmovmskb %xmm3, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

#ifdef USE_AS_STRNLEN
sub $64, %edi
jbe L(len_less64)
#endif

pcmpeqb (%eax), %xmm0
pmovmskb %xmm0, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm1
pmovmskb %xmm1, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm2
pmovmskb %xmm2, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm3
pmovmskb %xmm3, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

#ifdef USE_AS_STRNLEN
sub $64, %edi
jbe L(len_less64)
#endif

pcmpeqb (%eax), %xmm0
pmovmskb %xmm0, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm1
pmovmskb %xmm1, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm2
pmovmskb %xmm2, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

pcmpeqb (%eax), %xmm3
pmovmskb %xmm3, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(exit)

#ifdef USE_AS_STRNLEN
mov %eax, %edx
and $63, %edx
add %edx, %edi
#endif

and $-0x40, %eax

.p2align 4
L(aligned_64_loop):
#ifdef USE_AS_STRNLEN
sub $64, %edi
jbe L(len_less64)
#endif
movaps (%eax), %xmm0
movaps 16(%eax), %xmm1
movaps 32(%eax), %xmm2
movaps 48(%eax), %xmm6
pminub %xmm1, %xmm0
pminub %xmm6, %xmm2
pminub %xmm0, %xmm2
pcmpeqb %xmm3, %xmm2
pmovmskb %xmm2, %edx
lea 64(%eax), %eax
test %edx, %edx
jz L(aligned_64_loop)

pcmpeqb -64(%eax), %xmm3
pmovmskb %xmm3, %edx
lea 48(%ecx), %ecx
test %edx, %edx
jnz L(exit)

pcmpeqb %xmm1, %xmm3
pmovmskb %xmm3, %edx
lea -16(%ecx), %ecx
test %edx, %edx
jnz L(exit)

pcmpeqb -32(%eax), %xmm3
pmovmskb %xmm3, %edx
lea -16(%ecx), %ecx
test %edx, %edx
jnz L(exit)

pcmpeqb %xmm6, %xmm3
pmovmskb %xmm3, %edx
lea -16(%ecx), %ecx
L(exit):
sub %ecx, %eax
test %dl, %dl
jz L(exit_high)

mov %dl, %cl
and $15, %cl
jz L(exit_8)
test $0x01, %dl
jnz L(exit_tail0)
test $0x02, %dl
jnz L(exit_tail1)
test $0x04, %dl
jnz L(exit_tail2)
add $3, %eax
RETURN

.p2align 4
L(exit_8):
test $0x10, %dl
jnz L(exit_tail4)
test $0x20, %dl
jnz L(exit_tail5)
test $0x40, %dl
jnz L(exit_tail6)
add $7, %eax
RETURN

.p2align 4
L(exit_high):
mov %dh, %ch
and $15, %ch
jz L(exit_high_8)
test $0x01, %dh
jnz L(exit_tail8)
test $0x02, %dh
jnz L(exit_tail9)
test $0x04, %dh
jnz L(exit_tail10)
add $11, %eax
RETURN

.p2align 4
L(exit_high_8):
test $0x10, %dh
jnz L(exit_tail12)
test $0x20, %dh
jnz L(exit_tail13)
test $0x40, %dh
jnz L(exit_tail14)
add $15, %eax
L(exit_tail0):
RETURN

#ifdef USE_AS_STRNLEN

.p2align 4
L(len_less64):
pxor %xmm0, %xmm0
add $64, %edi

pcmpeqb (%eax), %xmm0
pmovmskb %xmm0, %edx
pxor %xmm1, %xmm1
lea 16(%eax), %eax
test %edx, %edx
jnz L(strnlen_exit)

sub $16, %edi
jbe L(return_start_len)

pcmpeqb (%eax), %xmm1
pmovmskb %xmm1, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(strnlen_exit)

sub $16, %edi
jbe L(return_start_len)

pcmpeqb (%eax), %xmm0
pmovmskb %xmm0, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(strnlen_exit)

sub $16, %edi
jbe L(return_start_len)

pcmpeqb (%eax), %xmm1
pmovmskb %xmm1, %edx
lea 16(%eax), %eax
test %edx, %edx
jnz L(strnlen_exit)

#ifndef USE_AS_STRLCAT
movl LEN(%esp), %eax
RETURN
#else
jmp L(return_start_len)
#endif

.p2align 4
L(strnlen_exit):
sub %ecx, %eax

test %dl, %dl
jz L(strnlen_exit_high)
mov %dl, %cl
and $15, %cl
jz L(strnlen_exit_8)
test $0x01, %dl
jnz L(exit_tail0)
test $0x02, %dl
jnz L(strnlen_exit_tail1)
test $0x04, %dl
jnz L(strnlen_exit_tail2)
sub $4, %edi
jb L(return_start_len)
lea 3(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_8):
test $0x10, %dl
jnz L(strnlen_exit_tail4)
test $0x20, %dl
jnz L(strnlen_exit_tail5)
test $0x40, %dl
jnz L(strnlen_exit_tail6)
sub $8, %edi
jb L(return_start_len)
lea 7(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_high):
mov %dh, %ch
and $15, %ch
jz L(strnlen_exit_high_8)
test $0x01, %dh
jnz L(strnlen_exit_tail8)
test $0x02, %dh
jnz L(strnlen_exit_tail9)
test $0x04, %dh
jnz L(strnlen_exit_tail10)
sub $12, %edi
jb L(return_start_len)
lea 11(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_high_8):
test $0x10, %dh
jnz L(strnlen_exit_tail12)
test $0x20, %dh
jnz L(strnlen_exit_tail13)
test $0x40, %dh
jnz L(strnlen_exit_tail14)
sub $16, %edi
jb L(return_start_len)
lea 15(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail1):
sub $2, %edi
jb L(return_start_len)
lea 1(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail2):
sub $3, %edi
jb L(return_start_len)
lea 2(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail4):
sub $5, %edi
jb L(return_start_len)
lea 4(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail5):
sub $6, %edi
jb L(return_start_len)
lea 5(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail6):
sub $7, %edi
jb L(return_start_len)
lea 6(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail8):
sub $9, %edi
jb L(return_start_len)
lea 8(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail9):
sub $10, %edi
jb L(return_start_len)
lea 9(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail10):
sub $11, %edi
jb L(return_start_len)
lea 10(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail12):
sub $13, %edi
jb L(return_start_len)
lea 12(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail13):
sub $14, %edi
jb L(return_start_len)
lea 13(%eax), %eax
RETURN

.p2align 4
L(strnlen_exit_tail14):
sub $15, %edi
jb L(return_start_len)
lea 14(%eax), %eax
RETURN

#ifndef USE_AS_STRLCAT
.p2align 4
L(return_start_len):
movl LEN(%esp), %eax
RETURN
#endif

/* for prolog only */

.p2align 4
L(len_less4_prolog):
xor %eax, %eax

add $4, %edi
jz L(exit_tail0)

cmpb $0, (%edx)
jz L(exit_tail0)
cmp $1, %edi
je L(exit_tail1)

cmpb $0, 1(%edx)
jz L(exit_tail1)
cmp $2, %edi
je L(exit_tail2)

cmpb $0, 2(%edx)
jz L(exit_tail2)
cmp $3, %edi
je L(exit_tail3)

cmpb $0, 3(%edx)
jz L(exit_tail3)
mov %edi, %eax
RETURN

.p2align 4
L(len_less8_prolog):
add $4, %edi

cmpb $0, 4(%edx)
jz L(exit_tail4)
cmp $1, %edi
je L(exit_tail5)

cmpb $0, 5(%edx)
jz L(exit_tail5)
cmp $2, %edi
je L(exit_tail6)

cmpb $0, 6(%edx)
jz L(exit_tail6)
cmp $3, %edi
je L(exit_tail7)

cmpb $0, 7(%edx)
jz L(exit_tail7)
mov $8, %eax
RETURN


.p2align 4
L(len_less12_prolog):
add $4, %edi

cmpb $0, 8(%edx)
jz L(exit_tail8)
cmp $1, %edi
je L(exit_tail9)

cmpb $0, 9(%edx)
jz L(exit_tail9)
cmp $2, %edi
je L(exit_tail10)

cmpb $0, 10(%edx)
jz L(exit_tail10)
cmp $3, %edi
je L(exit_tail11)

cmpb $0, 11(%edx)
jz L(exit_tail11)
mov $12, %eax
RETURN

.p2align 4
L(len_less16_prolog):
add $4, %edi

cmpb $0, 12(%edx)
jz L(exit_tail12)
cmp $1, %edi
je L(exit_tail13)

cmpb $0, 13(%edx)
jz L(exit_tail13)
cmp $2, %edi
je L(exit_tail14)

cmpb $0, 14(%edx)
jz L(exit_tail14)
cmp $3, %edi
je L(exit_tail15)

cmpb $0, 15(%edx)
jz L(exit_tail15)
mov $16, %eax
RETURN
#endif

.p2align 4
L(exit_tail1):
add $1, %eax
RETURN

L(exit_tail2):
add $2, %eax
RETURN

L(exit_tail3):
add $3, %eax
RETURN

L(exit_tail4):
add $4, %eax
RETURN

L(exit_tail5):
add $5, %eax
RETURN

L(exit_tail6):
add $6, %eax
RETURN

L(exit_tail7):
add $7, %eax
RETURN

L(exit_tail8):
add $8, %eax
RETURN

L(exit_tail9):
add $9, %eax
RETURN

L(exit_tail10):
add $10, %eax
RETURN

L(exit_tail11):
add $11, %eax
RETURN

L(exit_tail12):
add $12, %eax
RETURN

L(exit_tail13):
add $13, %eax
RETURN

L(exit_tail14):
add $14, %eax
RETURN

L(exit_tail15):
add $15, %eax
#ifndef USE_AS_STRCAT
RETURN
END (STRLEN)
#endif
Title: Re: Faster Memcopy ...
Post by: Gunther on March 23, 2015, 07:08:14 AM
Hi nidud,

Here is the Intel version of strlen from 2011:

interesting, but AT&T syntax. Not the best choice. Fortunately the conversion into Intel syntax isn't very hard.

Gunther
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 23, 2015, 08:55:27 AM
interesting, but AT&T syntax. Not the best choice. Fortunately the conversion into Intel syntax isn't very hard.

Not hard, but tedious! If someone wants to do it and post the results that would be appreciated. There are lots of defines like L(label) which (AFAIK) must be made into function MACROs - or else replace all occurrences by hand. There are the "." labels to fix. There's the multiline statements, with ;'s separating, to deal with; and of course reverse the order of about 400 commands. The #'s, %'s and the $'s are easy. Normally this routine would be about 20 or 30 lines (in fact that's the whole point, that he's unrolled everything to the max) and the job is trivial. OTOH if there were a few thousand lines, it would be rather fun to make .bat files and MACROs to automate the conversion. But for 700 lines not really worth the effort. Tedious. Oh, and don't forget to fix that egregious spelling mistake "calee".

Brings up a question, why is it AT&T syntax? Obviously this routine is not for Windows. Note that he uses xmm0-xmm3 then skips to xmm6, in Windows you'd use 4 and 5 wouldn't you? Reason it might matter, what is the target (type of) processor? If it's for handhelds it may not be optimal for us, those processors might not have good branch prediction, for instance, or other differences. I don't know, u understand, but he has some rather surprising techniques. Like, avoids branches like the plague; avoids bsf, prefers pminub to pcmepqb, other little things. These might be great techniques for us, or, as I wonder, might be designed for different processors (say, Atom) or environments. It also seems optimized for very small strings.

And, it doesn't seem very professional. All these lines, and he still doesn't have a MAXLEN - if the string isn't terminated the routine will dive right off the deep end. And why is Intel making it public anyway? Is this really their best effort? I'll bet it's not what Windows is using.

Anyway, it would be nice if someone would translate it, I'll be happy to test it, wouldn't be amazed if it's not all that impressive speedwise - in our typical PC environment.

[edit] Of course I'm pretty ignorant about these issues, don't mean to pretend I know what I'm talking about. Just some thoughts/questions that occurred to me
Title: Re: Faster Memcopy ...
Post by: nidud on March 23, 2015, 11:04:49 PM
Well, it was just to illustrate the point about code size, and yes this code is made for the Atom processor.

This one is made for the Silvermont in 2014. This is more compact and similar to the ones we worked with.

Code: [Select]
/*
Copyright (c) 2014, Intel Corporation
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

    * Redistributions of source code must retain the above copyright notice,
    * this list of conditions and the following disclaimer.

    * Redistributions in binary form must reproduce the above copyright notice,
    * this list of conditions and the following disclaimer in the documentation
    * and/or other materials provided with the distribution.

    * Neither the name of Intel Corporation nor the names of its contributors
    * may be used to endorse or promote products derived from this software
    * without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

#ifndef STRLEN
# define STRLEN strlen
#endif

#ifndef L
# define L(label) .L##label
#endif

#ifndef cfi_startproc
# define cfi_startproc .cfi_startproc
#endif

#ifndef cfi_endproc
# define cfi_endproc .cfi_endproc
#endif

#ifndef cfi_rel_offset
# define cfi_rel_offset(reg, off) .cfi_rel_offset reg, off
#endif

#ifndef cfi_restore
# define cfi_restore(reg) .cfi_restore reg
#endif

#ifndef cfi_adjust_cfa_offset
# define cfi_adjust_cfa_offset(off) .cfi_adjust_cfa_offset off
#endif

#ifndef ENTRY
# define ENTRY(name)             \
.type name,  @function;  \
.globl name;             \
.p2align 4;              \
name:                            \
cfi_startproc
#endif

#ifndef END
# define END(name)               \
cfi_endproc;             \
.size name, .-name
#endif

#define CFI_PUSH(REG)                   \
cfi_adjust_cfa_offset (4);      \
cfi_rel_offset (REG, 0)

#define CFI_POP(REG)                    \
cfi_adjust_cfa_offset (-4);     \
cfi_restore (REG)

#define PUSH(REG) pushl REG; CFI_PUSH (REG)
#define POP(REG) popl REG; CFI_POP (REG)

.section .text.sse2,"ax",@progbits
ENTRY (STRLEN)
mov 4(%esp), %edx
mov %edx, %ecx
and $0x3f, %ecx
pxor %xmm0, %xmm0
cmp $0x30, %ecx
ja L(next)
movdqu (%edx), %xmm1
pcmpeqb %xmm1, %xmm0
pmovmskb %xmm0, %ecx
test %ecx, %ecx
jnz L(exit_less16)
mov %edx, %eax
and $-16, %eax
jmp L(align16_start)
L(next):
mov %edx, %eax
and $-16, %eax
PUSH (%edi)
pcmpeqb (%eax), %xmm0
mov $-1, %edi
sub %eax, %ecx
shl %cl, %edi
pmovmskb %xmm0, %ecx
and %edi, %ecx
POP (%edi)
jnz L(exit_unaligned)
pxor %xmm0, %xmm0
L(align16_start):
pxor %xmm1, %xmm1
pxor %xmm2, %xmm2
pxor %xmm3, %xmm3
pcmpeqb 16(%eax), %xmm0
pmovmskb %xmm0, %ecx
test %ecx, %ecx
jnz L(exit16)

pcmpeqb 32(%eax), %xmm1
pmovmskb %xmm1, %ecx
test %ecx, %ecx
jnz L(exit32)

pcmpeqb 48(%eax), %xmm2
pmovmskb %xmm2, %ecx
test %ecx, %ecx
jnz L(exit48)

pcmpeqb 64(%eax), %xmm3
pmovmskb %xmm3, %ecx
test %ecx, %ecx
jnz L(exit64)

pcmpeqb 80(%eax), %xmm0
add $64, %eax
pmovmskb %xmm0, %ecx
test %ecx, %ecx
jnz L(exit16)

pcmpeqb 32(%eax), %xmm1
pmovmskb %xmm1, %ecx
test %ecx, %ecx
jnz L(exit32)

pcmpeqb 48(%eax), %xmm2
pmovmskb %xmm2, %ecx
test %ecx, %ecx
jnz L(exit48)

pcmpeqb 64(%eax), %xmm3
pmovmskb %xmm3, %ecx
test %ecx, %ecx
jnz L(exit64)

pcmpeqb 80(%eax), %xmm0
add $64, %eax
pmovmskb %xmm0, %ecx
test %ecx, %ecx
jnz L(exit16)

pcmpeqb 32(%eax), %xmm1
pmovmskb %xmm1, %ecx
test %ecx, %ecx
jnz L(exit32)

pcmpeqb 48(%eax), %xmm2
pmovmskb %xmm2, %ecx
test %ecx, %ecx
jnz L(exit48)

pcmpeqb 64(%eax), %xmm3
pmovmskb %xmm3, %ecx
test %ecx, %ecx
jnz L(exit64)

pcmpeqb 80(%eax), %xmm0
add $64, %eax
pmovmskb %xmm0, %ecx
test %ecx, %ecx
jnz L(exit16)

pcmpeqb 32(%eax), %xmm1
pmovmskb %xmm1, %ecx
test %ecx, %ecx
jnz L(exit32)

pcmpeqb 48(%eax), %xmm2
pmovmskb %xmm2, %ecx
test %ecx, %ecx
jnz L(exit48)

pcmpeqb 64(%eax), %xmm3
pmovmskb %xmm3, %ecx
test %ecx, %ecx
jnz L(exit64)


test $0x3f, %eax
jz L(align64_loop)

pcmpeqb 80(%eax), %xmm0
add $80, %eax
pmovmskb %xmm0, %ecx
test %ecx, %ecx
jnz L(exit)

test $0x3f, %eax
jz L(align64_loop)

pcmpeqb 16(%eax), %xmm1
add $16, %eax
pmovmskb %xmm1, %ecx
test %ecx, %ecx
jnz L(exit)

test $0x3f, %eax
jz L(align64_loop)

pcmpeqb 16(%eax), %xmm2
add $16, %eax
pmovmskb %xmm2, %ecx
test %ecx, %ecx
jnz L(exit)

test $0x3f, %eax
jz L(align64_loop)

pcmpeqb 16(%eax), %xmm3
add $16, %eax
pmovmskb %xmm3, %ecx
test %ecx, %ecx
jnz L(exit)

add $16, %eax
.p2align 4
L(align64_loop):
movaps (%eax), %xmm4
pminub 16(%eax), %xmm4
movaps 32(%eax), %xmm5
pminub 48(%eax), %xmm5
add $64, %eax
pminub %xmm4, %xmm5
pcmpeqb %xmm0, %xmm5
pmovmskb %xmm5, %ecx
test %ecx, %ecx
jz L(align64_loop)


pcmpeqb -64(%eax), %xmm0
sub $80, %eax
pmovmskb %xmm0, %ecx
test %ecx, %ecx
jnz L(exit16)

pcmpeqb 32(%eax), %xmm1
pmovmskb %xmm1, %ecx
test %ecx, %ecx
jnz L(exit32)

pcmpeqb 48(%eax), %xmm2
pmovmskb %xmm2, %ecx
test %ecx, %ecx
jnz L(exit48)

pcmpeqb 64(%eax), %xmm3
pmovmskb %xmm3, %ecx
sub %edx, %eax
bsf %ecx, %ecx
add %ecx, %eax
add $64, %eax
ret

.p2align 4
L(exit):
sub %edx, %eax
bsf %ecx, %ecx
add %ecx, %eax
ret

L(exit_less16):
bsf %ecx, %eax
ret

.p2align 4
L(exit_unaligned):
sub %edx, %eax
bsf %ecx, %ecx
add %ecx, %eax
ret

.p2align 4
L(exit16):
sub %edx, %eax
bsf %ecx, %ecx
add %ecx, %eax
add $16, %eax
ret

.p2align 4
L(exit32):
sub %edx, %eax
bsf %ecx, %ecx
add %ecx, %eax
add $32, %eax
ret

.p2align 4
L(exit48):
sub %edx, %eax
bsf %ecx, %ecx
add %ecx, %eax
add $48, %eax
ret

.p2align 4
L(exit64):
sub %edx, %eax
bsf %ecx, %ecx
add %ecx, %eax
add $64, %eax
ret

END (STRLEN)
Title: Re: Faster Memcopy ...
Post by: Gunther on March 24, 2015, 12:36:04 AM
Hi nidud,

could you post the entire code as attachment, please? It's much easier for me to do the conversion. Thank you.

Gunther
Title: Re: Faster Memcopy ...
Post by: nidud on March 24, 2015, 01:13:21 AM
I converted and tested both of them.
Code: [Select]
AMD Athlon(tm) II X2 245 Processor (SSE3)
----------------------------------------------
-- string length 0..40
862314    cycles - (  0) proc_0: crt_strlen
513683    cycles - (105) proc_1: SSE 32
653387    cycles - (104) proc_2: AxStrLenSSE
641895    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
561687    cycles - (673) proc_8: SSE Intel Silvermont
726553    cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
940951    cycles - (  0) proc_0: crt_strlen
331639    cycles - (105) proc_1: SSE 32
474770    cycles - (104) proc_2: AxStrLenSSE
478938    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
383493    cycles - (673) proc_8: SSE Intel Silvermont
469439    cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
464667    cycles - (  0) proc_0: crt_strlen
111167    cycles - (105) proc_1: SSE 32
162923    cycles - (104) proc_2: AxStrLenSSE
166028    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
105110    cycles - (673) proc_8: SSE Intel Silvermont
120176    cycles - (818) proc_9: SSE Intel Atom

result:
   956489 cycles - (105) proc_1: SSE 32
  1050290 cycles - (673) proc_8: SSE Intel Silvermont
  1286861 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  1291080 cycles - (104) proc_2: AxStrLenSSE
  1316168 cycles - (818) proc_9: SSE Intel Atom
  2267932 cycles - (  0) proc_0: crt_strlen

The Intel files are 8.asm and 9.asm in the archive
Title: Re: Faster Memcopy ...
Post by: jj2007 on March 24, 2015, 02:30:50 AM
I must confess I don't understand what your testbed is showing. Can you explain?

-- string length 0..40
862314    cycles - (  0) proc_0: crt_strlen
513683    cycles - (105) proc_1: SSE 32
-- string length 40..80
940951    cycles - (  0) proc_0: crt_strlen
331639    cycles - (105) proc_1: SSE 32

- what are the numbers in brackets?
- why does SSE 32 run faster with the longer string?
Title: Re: Faster Memcopy ...
Post by: nidud on March 24, 2015, 02:47:14 AM
Quote
what are the numbers in brackets?

The size of the proc from reading the .BIN file.
The size of crt_strlen is unknown.

Quote
why does SSE 32 run faster with the longer string?

It's a matter of priority for the end result: A 40 byte string is more important than a 1000 byte string in this case.
Title: Re: Faster Memcopy ...
Post by: Gunther on March 24, 2015, 04:17:22 AM
Hi nidud,

I converted and tested both of them.
The Intel files are 8.asm and 9.asm in the archive

thank you, that saves me a bit work. Here are the results:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (AVX)
----------------------------------------------
-- string length 0..40
444941    cycles - (  0) proc_0: crt_strlen
231197    cycles - (105) proc_1: SSE 32
218271    cycles - (104) proc_2: AxStrLenSSE
210172    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
228546    cycles - (673) proc_8: SSE Intel Silvermont
374264    cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
493818    cycles - (  0) proc_0: crt_strlen
155729    cycles - (105) proc_1: SSE 32
152824    cycles - (104) proc_2: AxStrLenSSE
148059    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
141848    cycles - (673) proc_8: SSE Intel Silvermont
263988    cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
341526    cycles - (  0) proc_0: crt_strlen
69545     cycles - (105) proc_1: SSE 32
85672     cycles - (104) proc_2: AxStrLenSSE
87101     cycles - ( 86) proc_3: AxStrLenSSE (rrr)
43498     cycles - (673) proc_8: SSE Intel Silvermont
57624     cycles - (818) proc_9: SSE Intel Atom

result:
   413892 cycles - (673) proc_8: SSE Intel Silvermont
   445332 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
   456471 cycles - (105) proc_1: SSE 32
   456767 cycles - (104) proc_2: AxStrLenSSE
   695876 cycles - (818) proc_9: SSE Intel Atom
  1280285 cycles - (  0) proc_0: crt_strlen
hit any key to continue...

Gunther
Title: Re: Faster Memcopy ...
Post by: dedndave on March 24, 2015, 06:57:32 AM
prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
----------------------------------------------
-- string length 0..40
1592748   cycles - (  0) proc_0: crt_strlen
1342550   cycles - (105) proc_1: SSE 32
1361220   cycles - (104) proc_2: AxStrLenSSE
1319353   cycles - ( 86) proc_3: AxStrLenSSE (rrr)
1256349   cycles - (673) proc_8: SSE Intel Silvermont
1586403   cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
1711483   cycles - (  0) proc_0: crt_strlen
927442    cycles - (105) proc_1: SSE 32
962485    cycles - (104) proc_2: AxStrLenSSE
968221    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
895419    cycles - (673) proc_8: SSE Intel Silvermont
1476721   cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
659013    cycles - (  0) proc_0: crt_strlen
286781    cycles - (105) proc_1: SSE 32
333387    cycles - (104) proc_2: AxStrLenSSE
339087    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
255315    cycles - (673) proc_8: SSE Intel Silvermont
320034    cycles - (818) proc_9: SSE Intel Atom

result:
  2407083 cycles - (673) proc_8: SSE Intel Silvermont
  2556773 cycles - (105) proc_1: SSE 32
  2626661 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  2657092 cycles - (104) proc_2: AxStrLenSSE
  3383158 cycles - (818) proc_9: SSE Intel Atom
  3963244 cycles - (  0) proc_0: crt_strlen
hit any key to continue...

I:\nidudStrlen2\strlen2 => timeit
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
----------------------------------------------
-- string length 0..40
1596428   cycles - (  0) proc_0: crt_strlen
1353058   cycles - (105) proc_1: SSE 32
1362316   cycles - (104) proc_2: AxStrLenSSE
1351833   cycles - ( 86) proc_3: AxStrLenSSE (rrr)
1251872   cycles - (673) proc_8: SSE Intel Silvermont
1600414   cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
1711735   cycles - (  0) proc_0: crt_strlen
926670    cycles - (105) proc_1: SSE 32
952517    cycles - (104) proc_2: AxStrLenSSE
982818    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
906223    cycles - (673) proc_8: SSE Intel Silvermont
1443038   cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
681045    cycles - (  0) proc_0: crt_strlen
279759    cycles - (105) proc_1: SSE 32
326521    cycles - (104) proc_2: AxStrLenSSE
339426    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
263106    cycles - (673) proc_8: SSE Intel Silvermont
319237    cycles - (818) proc_9: SSE Intel Atom

result:
  2421201 cycles - (673) proc_8: SSE Intel Silvermont
  2559487 cycles - (105) proc_1: SSE 32
  2641354 cycles - (104) proc_2: AxStrLenSSE
  2674077 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  3362689 cycles - (818) proc_9: SSE Intel Atom
  3989208 cycles - (  0) proc_0: crt_strlen
Title: Re: Faster Memcopy ...
Post by: Antariy on March 24, 2015, 07:49:00 AM
Quote
why does SSE 32 run faster with the longer string?

It's a matter of priority for the end result: A 40 byte string is more important than a 1000 byte string in this case.

I think Jochen meant that your code is faster with longer string, so, it's faster with 80 40 bytes string than with 40 bytes string, so it faster with less important string.

The timings of your latest zip (strlen2)
Code: [Select]
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
----------------------------------------------
-- string length 0..40
1631130   cycles - (  0) proc_0: crt_strlen
1438090   cycles - (105) proc_1: SSE 32
2560055   cycles - (104) proc_2: AxStrLenSSE
2488369   cycles - ( 86) proc_3: AxStrLenSSE (rrr)
1332828   cycles - (673) proc_8: SSE Intel Silvermont
1697691   cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
1756723   cycles - (  0) proc_0: crt_strlen
999911    cycles - (105) proc_1: SSE 32
1511301   cycles - (104) proc_2: AxStrLenSSE
1492232   cycles - ( 86) proc_3: AxStrLenSSE (rrr)
972191    cycles - (673) proc_8: SSE Intel Silvermont
1551460   cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
704575    cycles - (  0) proc_0: crt_strlen
305959    cycles - (105) proc_1: SSE 32
416240    cycles - (104) proc_2: AxStrLenSSE
418694    cycles - ( 86) proc_3: AxStrLenSSE (rrr)
280099    cycles - (673) proc_8: SSE Intel Silvermont
348494    cycles - (818) proc_9: SSE Intel Atom

result:
  2585118 cycles - (673) proc_8: SSE Intel Silvermont
  2743960 cycles - (105) proc_1: SSE 32
  3597645 cycles - (818) proc_9: SSE Intel Atom
  4092428 cycles - (  0) proc_0: crt_strlen
  4399295 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  4487596 cycles - (104) proc_2: AxStrLenSSE
hit any key to continue...

No answers to any other posts since I get bored writing the long post posted on the lab - the "Research".
BTW, timings there are required :biggrin:
Title: Re: Faster Memcopy ...
Post by: Antariy on March 24, 2015, 07:55:45 AM
   413892 cycles - (673) proc_8: SSE Intel Silvermont
   445332 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
   456471 cycles - (105) proc_1: SSE 32
   456767 cycles - (104) proc_2: AxStrLenSSE
   695876 cycles - (818) proc_9: SSE Intel Atom
  1280285 cycles - (  0) proc_0: crt_strlen

result:
  2407083 cycles - (673) proc_8: SSE Intel Silvermont
  2556773 cycles - (105) proc_1: SSE 32
  2626661 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  2657092 cycles - (104) proc_2: AxStrLenSSE
  3383158 cycles - (818) proc_9: SSE Intel Atom
  3963244 cycles - (  0) proc_0: crt_strlen
hit any key to continue...

result:
  2421201 cycles - (673) proc_8: SSE Intel Silvermont
  2559487 cycles - (105) proc_1: SSE 32
  2641354 cycles - (104) proc_2: AxStrLenSSE
  2674077 cycles - ( 86) proc_3: AxStrLenSSE (rrr)
  3362689 cycles - (818) proc_9: SSE Intel Atom
  3989208 cycles - (  0) proc_0: crt_strlen

Not bad for simple (and short rrr's version) single-reg XMM code, taking in account the crazy unrollement of Intel's code and higher chunk size of the nidud's code.
Title: Re: Faster Memcopy ...
Post by: Antariy on March 24, 2015, 08:07:25 AM
Well, I think, that if advance the algo posted in the lab in the "Research" topic, make the short strings 64...256 bytes workaround using more simpler code, like nidud's 32 bytes fetching one is, then it will be two times smaller than intels Silvermont and probably faster than it.

BTW, nidud, you may try to use such a code in your inner loop:

Code: [Select]
@mainloop:
pcmpeqb xmm0,[ebx]
add ebx,16
pcmpeqb xmm1,[ebx]
add ebx,16
pmovmskb eax,xmm0
pmovmskb edx,xmm1

cmp eax,edx
jz @mainloop

shl edx,16
add edx,eax
bsf ecx,edx

It avoids unnecessary shifts and combining in the loop.
Title: Re: Faster Memcopy ...
Post by: nidud on March 24, 2015, 08:29:21 AM
BTW, nidud, you may try to use such a code in your inner loop:

I think it will fail if they are equal but not zero.
Title: Re: Faster Memcopy ...
Post by: Antariy on March 24, 2015, 08:37:12 AM
BTW, nidud, you may try to use such a code in your inner loop:

I think it will fail if they are equal but not zero.

Yes... silly idea, bored with coding of prototype for the thread with test on the lab, and typing for two days ::)
Title: Re: Faster Memcopy ...
Post by: nidud on March 24, 2015, 09:14:22 AM
I tried a few combinations yesterday
I first used this one
Code: [Select]
lea eax,[eax+20h]
pcmpeqb xmm1,[eax]
pmovmskb edx,xmm1
pcmpeqb xmm1,[eax+16]
pmovmskb ecx,xmm1
but this seems to faster for some reason
Code: [Select]
movdqa  xmm1,[eax+30h]
movdqa  xmm0,[eax+20h]
pcmpeqb xmm0,xmm2
pmovmskb edx,xmm0
lea eax,[eax+20h]
pcmpeqb xmm1,xmm2
pmovmskb ecx,xmm1
Title: Re: Faster Memcopy ...
Post by: Antariy on March 24, 2015, 09:15:55 AM
Then try it so (I'm not sure it will be faster, just suggesting to try)
Code: [Select]
@mainloop:
pcmpeqb xmm0,[ebx]
add ebx,16
pcmpeqb xmm1,[ebx]
add ebx,16
pmovmskb eax,xmm0
pmovmskb edx,xmm1

add eax,edx
jz @mainloop
jc @mainloop

pmovmskb eax,xmm0

shl edx,16
add edx,eax
bsf ecx,edx

The changed/added instructions without indent.
Take notice that jz misprediction possibility since the both EAX and EDX are 80000000h is incredible small, so jc instruction will almost always be executed only once - and only after proper exit, so it will not cost much, but the loop itself will decrease number of operations.
Title: Re: Faster Memcopy ...
Post by: Antariy on March 24, 2015, 09:27:59 AM
I tried a few combinations yesterday
I first used this one
Code: [Select]
lea eax,[eax+20h]
pcmpeqb xmm1,[eax]
pmovmskb edx,xmm1
pcmpeqb xmm1,[eax+16]
pmovmskb ecx,xmm1
but this seems to faster for some reason
Code: [Select]
movdqa  xmm1,[eax+30h]
movdqa  xmm0,[eax+20h]
pcmpeqb xmm0,xmm2
pmovmskb edx,xmm0
lea eax,[eax+20h]
pcmpeqb xmm1,xmm2
pmovmskb ecx,xmm1

Reversed order of the data fetch - the higher offset, then lower offset - I did also used that in the big code on the "Research" thread, too, in the first part which handles 64 bytes misalignment with two 32 bytes part :biggrin:

I think that first piece you showed here worked slower, and with possibility of incorrect work (i.e. it did not worked at all as required, hence the slowdowns probably) - because you compare the same XMM1 reg - but if the first comparsion changed some byte(s) of it to 0xFF, then the second comparsion actually compares not with zeroed reg, but with a mess - so, being changed randomly from time to time the reg slowing the code down and moreover it will produce wrong results with second comparsion if the first comparsion found zero.
Title: Re: Faster Memcopy ...
Post by: rrr314159 on March 24, 2015, 12:01:19 PM
There's a problem with this whole exercise: same code runs very different on different processors.

Quote from: nidud
but this seems to faster for some reason

I do the same thing: try different combinations and select the fastest on my machine, often not knowing why it's faster. Problem is, what happens on another machine?

I put together a new "AxStrLenSSE" which beats the Silvermont on my i5. It was easy. First, notice that Antariy's original algo saves XMM and ECX, because of environment it was designed for; but Silvermont doesn't. So I took that out. Then, it was still slower on the long strings. So, I unrolled the loop 9 times, and it easily "won" on nidud's test bed. But then I tried it on my AMD, and it came in 3rd (SSE 32 also ahead)! (AMD is very bad for bsf, pcmeqpb and pmovmskb.)

Working at Intel, Torbjorn Granlund has access to hundreds of machines (probably) and can try his Silvermont algo on all of them. If you don't have similar facilities there's no point in trying to keep up with him. I also think it's ridiculous to use 700 bytes for a strLen algo. This is one reason Microsoft and other modern code is so bloated.

This reminds me of a Go proverb (an oriental board game, you've probably heard of it). Often a student will go to great lengths to keep a group alive, destroying his entire position in the process. It would have been much better to sacrifice the group, for advantage elsewhere; this way he's guaranteed to lose. Then the teacher says, "If that's what you have to do to stay alive, why don't you just kill yourself?" That's how I feel about 700 lines to save a few nanoseconds looking for the end of a string!

Just the same, Torbjorn might have some good ideas. There is only one thing he does I've never seen before, uses pminub to take care of the situation you (Antariy and nidud) are discussing right now. That is, integrating the results from 2 or more XMM's reading the string. Perhaps you should take a look at his technique.

nidud your test bed is very good, this is the first time I used it much, the timings are very stable and reliable compared to other methods. It's also very convenient (once u get used to it) to just change 3.asm and recompile, without needing to recompile timeit.asm or any other algo's.

Anyway, here's this new version of Antariy's algo, which is better on i5's than Silvermont, but worse on AMD ... I know there are various ways to improve it, but I think I'll do something else instead!

Code: [Select]
.686
.model flat, stdcall
.code
.xmm

OPTION PROLOGUE:NONE, EPILOGUE:NONE

AxStrLenSSE2 proc STDCALL lpszStr:DWORD  ; fast with unaligned strings
; original by Antariy modified by rrr314159
    mov eax,[esp+4]
    mov edx, eax
    and edx,-10h
    pxor xmm0,xmm0
    mov ecx,eax
    and ecx,0fh
    pcmpeqb xmm0,[edx]
    pmovmskb eax,xmm0
    shr eax,cl
    bsf eax,eax
    jnz @ret
    xorps xmm0,xmm0
    pxor xmm1, xmm1
    pxor xmm2, xmm2
    @@:                                ; naturally aligned to 16
        pcmpeqb xmm0,[edx+10h]
        pmovmskb eax,xmm0
        test eax,eax
        jnz gotat10h

        pcmpeqb xmm1,[edx+20h]
        pmovmskb eax,xmm1
        test eax,eax
        jnz gotat20h
   
        pcmpeqb xmm2,[edx+30h]
        pmovmskb eax,xmm2
        test eax,eax
        jnz gotat30h

        pcmpeqb xmm0,[edx+40h]
        pmovmskb eax,xmm0
        test eax,eax
        jnz gotat40h

        pcmpeqb xmm1,[edx+50h]
        pmovmskb eax,xmm1
        test eax,eax
        jnz gotat50h
   
        pcmpeqb xmm2,[edx+60h]
        pmovmskb eax,xmm2
        test eax,eax
        jnz gotat60h

        pcmpeqb xmm0,[edx+70h]
        pmovmskb eax,xmm0
        test eax,eax
        jnz gotat70h

        pcmpeqb xmm1,[edx+80h]
        pmovmskb eax,xmm1
        test eax,eax
        jnz gotat80h
   
        pcmpeqb xmm2,[edx+90h]
        pmovmskb eax,xmm2
        test eax,eax
        jnz gotat90h

        add edx, 90h
        jmp @B

gotat90h:
    add edx, 10h
gotat80h:
    add edx, 10h
gotat70h:
    add edx, 10h
gotat60h:
    add edx, 10h
gotat50h:
    add edx, 10h
gotat40h:
    add edx, 10h
gotat30h:
    add edx, 10h
gotat20h:
    add edx, 10h
gotat10h:
    add edx, 10h

    bsf eax,eax
    sub edx,[esp+4]
    add eax, edx

    @ret:
    ret 4
AxStrLenSSE2 endp
END

Here are the timings:

Code: [Select]
Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (AVX)
----------------------------------------------
-- string length 0..40
462421    cycles - (  0) proc_0: crt_strlen
239532    cycles - (105) proc_1: SSE 32
232826    cycles - (104) proc_2: AxStrLenSSE
217896    cycles - (229) proc_3: AxStrLenSSE (rrr)
242578    cycles - (673) proc_8: SSE Intel Silvermont
395083    cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
446781    cycles - (  0) proc_0: crt_strlen
167097    cycles - (105) proc_1: SSE 32
164663    cycles - (104) proc_2: AxStrLenSSE
147078    cycles - (229) proc_3: AxStrLenSSE (rrr)
151820    cycles - (673) proc_8: SSE Intel Silvermont
282262    cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
349538    cycles - (  0) proc_0: crt_strlen
73973     cycles - (105) proc_1: SSE 32
94474     cycles - (104) proc_2: AxStrLenSSE
55565     cycles - (229) proc_3: AxStrLenSSE (rrr)
46600     cycles - (673) proc_8: SSE Intel Silvermont
61976     cycles - (818) proc_9: SSE Intel Atom

result:
   420539 cycles - (229) proc_3: AxStrLenSSE (rrr)
   440998 cycles - (673) proc_8: SSE Intel Silvermont
   480602 cycles - (105) proc_1: SSE 32
   491963 cycles - (104) proc_2: AxStrLenSSE
   739321 cycles - (818) proc_9: SSE Intel Atom
  1258740 cycles - (  0) proc_0: crt_strlen
hit any key to continue...

AMD A6-6310 APU with AMD Radeon R4 Graphics (AVX)
----------------------------------------------
-- string length 0..40
749682    cycles - (  0) proc_0: crt_strlen
478036    cycles - (105) proc_1: SSE 32
537144    cycles - (104) proc_2: AxStrLenSSE
542685    cycles - (229) proc_3: AxStrLenSSE (rrr)
443542    cycles - (673) proc_8: SSE Intel Silvermont
618257    cycles - (818) proc_9: SSE Intel Atom
-- string length 40..80
876412    cycles - (  0) proc_0: crt_strlen
321733    cycles - (105) proc_1: SSE 32
401381    cycles - (104) proc_2: AxStrLenSSE
361959    cycles - (229) proc_3: AxStrLenSSE (rrr)
310989    cycles - (673) proc_8: SSE Intel Silvermont
448527    cycles - (818) proc_9: SSE Intel Atom
-- string length 600..1000
520175    cycles - (  0) proc_0: crt_strlen
109789    cycles - (105) proc_1: SSE 32
113737    cycles - (104) proc_2: AxStrLenSSE
102691    cycles - (229) proc_3: AxStrLenSSE (rrr)
79685     cycles - (673) proc_8: SSE Intel Silvermont
95800     cycles - (818) proc_9: SSE Intel Atom

result:
   834216 cycles - (673) proc_8: SSE Intel Silvermont
   909558 cycles - (105) proc_1: SSE 32
  1007335 cycles - (229) proc_3: AxStrLenSSE (rrr)
  1052262 cycles - (104) proc_2: AxStrLenSSE
  1162584 cycles - (818) proc_9: SSE Intel Atom
  2146269 cycles - (  0) proc_0: crt_strlen
hit any key to continue...
Title: Re: Faster Memcopy ...
Post by: Antariy on March 24, 2015, 08:50:52 PM
Hi rrr :t

I don't ignore your messages, just was too tired to answer thorougly.
The one note right now - yes, AMDs look very slow with BSF, though that instuction is executed not in the inner loop so it has noticeable influence only with short strings, but you may speed it up on the AMD just using BSF AX,AX instead of EAX,EAX. On Intel that has no much difference (it's a bit slower), but for AMD makes sense. With the one-reg using algo which provides 16 bit mask the usage of 16 bit BSF on AMD is a right choice.

nidud, here is a tweak of your algo which in this form comes close to the algo posted on the "Research" thread on my machine:

Code: [Select]
strlen proc STDCALL lpszStr:DWORD
mov ecx,[esp+4]
xorps xmm2,xmm2
mov eax,ecx
and ecx,32-1
and eax,-32
movdqa  xmm0,[eax]
movdqa  xmm1,[eax+16]
or edx,-1
shl edx,cl
pcmpeqb xmm0,xmm2
pmovmskb ecx,xmm0
and ecx,edx
jnz done
pcmpeqb xmm1,xmm2
pmovmskb ecx,xmm1
shl ecx,16
and ecx,edx
jnz done
loop_32:
lea eax,[eax+32]
movdqa  xmm0,[eax]
movdqa  xmm1,[eax+16]
pcmpeqb xmm0,xmm2
pcmpeqb xmm1,xmm2
pmovmskb edx,xmm0
pmovmskb ecx,xmm1
add edx,ecx
jbe loop_32
shl ecx,16
pmovmskb edx,xmm0
add ecx,edx
done:

bsf ecx,ecx
lea eax,[ecx+eax]
sub eax,[esp+4]
ret 4
strlen endp

Actually I think that if you will use an entry like in mine StrLen (just shift right overhead bits) and this inner loop (that is strange, but the movdqa/pcmpeqb sequence seems faster than just direct pcmpeqb) and the checking with exit from the loop as I suggested, this code will be probably the one of the best solutions for StrLen on the 32 bit machines. The crazy-unrolled Intel's solution has no any gain in real world.
Probably the rrr's unrolling is the maximum which has sense, but it's strange to look on the algos which are 600+ or even 800+ bytes long, just to check the string length on the, after all, real world machine in the real world app.
Title: Re: Faster Memcopy ...
Post by: nidud on March 25, 2015, 01:16:14 AM
I do the same thing: try different combinations and select the fastest on my machine, often not knowing why it's faster. Problem is, what happens on another machine?

True, we are "typecast" to the machine we use. However, the difference between them is not that big, so the idea that the crt_strlen suddenly should be the fastest algo on a different machine I will assume is very unlikely.

Quote
Working at Intel, Torbjorn Granlund has access to hundreds of machines (probably) and can try his Silvermont algo on all of them.

We have an array of different machines here also, and from my experience there are some minor differences between AMD and Intel, and also between old and new hardware in general.

Quote
I also think it's ridiculous to use 700 bytes for a strLen algo. This is one reason Microsoft and other modern code is so bloated.

The thing is that the size of available memory and storage increase, transfer speed increase, and usage of text parsing increase. This means that code size is less important than speed. In addition to this the string and memory functions are more frequently used now than before.

The strlen function is then important since most of the STR functions also need to know the length of the string, so the method used there will be dublicated in these functions.

I have converted most of these functions to correct the overlapping bug. There is however some problem dealing with two pointers as in strcpy. This was done by reading 16 bytes, test for zero, copy and continue. The fastest safe way will then be the old way:
Code: [Select]
len = strlen(src);
memcpy(dst, src, len);

Quote
Just the same, Torbjorn might have some good ideas. There is only one thing he does I've never seen before, uses pminub to take care of the situation you (Antariy and nidud) are discussing right now. That is, integrating the results from 2 or more XMM's reading the string. Perhaps you should take a look at his technique.

I have tested this but removing the shift don't seem to have any effect.

Title: Re: Faster Memcopy ...
Post by: jj2007 on March 25, 2015, 02:27:55 AM
i5 test with a 200 MB file - 50 bibles concatenated:
Code: [Select]
Len(): 55896 us for 1519150 lines, 200850450 bytes
len(): 109353 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 51731 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51359 us for 1519150 lines, 200850450 bytes

Len(): 50408 us for 1519150 lines, 200850450 bytes
len(): 108360 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 51540 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 52424 us for 1519150 lines, 200850450 bytes

Len(): 52240 us for 1519150 lines, 200850450 bytes
len(): 108979 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 50288 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51747 us for 1519150 lines, 200850450 bytes

Len(): 51231 us for 1519150 lines, 200850450 bytes
len(): 107123 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 49613 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51625 us for 1519150 lines, 200850450 bytes

Len(): 50967 us for 1519150 lines, 200850450 bytes
len(): 107987 us for 1519150 lines, 200850450 bytes
AxStrLenSSE: 51297 us for 1519150 lines, 200850450 bytes
AxStrLenSSE1: 51591 us for 1519150 lines, 200850450 bytes

AxStrLenSSE is a clear winner.
Title: Re: Faster Memcopy ...
Post by: nidud on March 25, 2015, 02:56:03 AM
Well, that’s one way to go I guess   :biggrin:

Here is another approach:
Code: [Select]
public  string_lengths
.data
string_lengths dd 1024 dup(0)
.code
...
.if eax >= 1024
    inc string_lengths[1023*4]
.else
    inc string_lengths[eax*4]
.endif
test eax,eax
ret 4
strlen  endp
Title: Re: Faster Memcopy ...
Post by: guga on November 14, 2019, 08:55:06 PM
Old, old, old post, but i was interesting on the timmigns. I made a small variation of Nidud´s StrLen. Seems fast and stable, but didn´t measured the results on masm. Can someone give a try ?

Code: [Select]
Proc FastStrlenTest:
    Arguments @Input
    Uses ecx, edx ; RosAsm macro to preserve registers. Just a push ecx+push edx after the start of the function (push ebp | mov ebp esp). Before the function ends at "endP' macro, the registers are restored with pop edx | pop ecx

    mov eax D@Input
    movdqu xmm0 X$eax
    xorps xmm1 xmm1
    pcmpeqb xmm0 xmm1
    pmovmskb ecx xmm0
    and ecx 0-1
    jne @done
        pmovmskb ecx xmm1
        shl ecx 16
        and ecx 0-1
    jne @done
    xorps xmm2 xmm2
@loop_32:

    lea eax D$eax+32
    movdqu xmm0 X$eax
    movdqu  xmm1 X$eax+16
    pcmpeqb xmm0 xmm2 | pmovmskb edx xmm0
    pcmpeqb xmm1 xmm2 | pmovmskb ecx xmm1
    add edx ecx
 jbe @loop_32
    shl ecx 16
    pmovmskb edx xmm0
    add ecx edx
@done:

 bsf ecx ecx
 lea eax D$ecx+eax
 sub eax D@Input

EndP

Note: I modified the beggining of Nidud´s algo, because the adress at the start, may result on a crash if the string is located exactly on the beginning of the memory addressing . That´s why i removed the initiual
Code: [Select]
    and ecx 32-1
    and eax 0-32

Btw...the fucntion seems to work on any size of the null terminated string...from 1 to XXXXX


Need to fully understand this, in order to try making a Unicode version too :)
Title: Re: Faster Memcopy ...
Post by: nidud on November 14, 2019, 11:14:04 PM
Note: I modified the beggining of Nidud´s algo, because the adress at the start, may result on a crash if the string is located exactly on the beginning of the memory addressing . That´s why i removed the initiual

Not sure this is possible and the benchmark also test for overlapping memory moves.

So did this actually happened or is it just a theory?
Title: Re: Faster Memcopy ...
Post by: guga on November 15, 2019, 12:21:09 AM
I saw that while i was debugging the function. It didn´t crashed, but the pointer entered on a previously allocated memory (28 bytes before the start of the input). It didn´t crashed, probably because the memory was correctly allocated, so it didn´t reached it´s beginning.

Ex: Say, the memory starts at 100000 And the function uses a pointer at the very 1st byte 100001. What could happen is that since the and operation will force the Inputted pointer to start at 100000-32 bytes (or something). And, if the area before that address is not allocated, most likely it will crash. However, it didn´t crashed on my tests, but i made that small adaptation just to make sure it won´t crash ever.

Btw,...it´s an amazing algorithm. I wonder if it can be optimized further. I made a small test comparing pcmpeqb xmm0 xmm1 rather then pcmpeqb xmm0 xmm2/pcmpeqb xmm1 xmm2 but couldn´t understand what this opcode is doing exactly yet.

The timmings are simply great and it do finds the len for small strings etc...while AxStrLenSSE2 version seems to have some errors on the results (at least, if i ported it properly)

(https://i.ibb.co/d7N8vt4/vsdfagva-Image1.png) (https://ibb.co/d7N8vt4)
Title: Re: Faster Memcopy ...
Post by: nidud on November 15, 2019, 02:22:04 AM
I saw that while i was debugging the function. It didn´t crashed, but the pointer entered on a previously allocated memory (28 bytes before the start of the input). It didn´t crashed, probably because the memory was correctly allocated, so it didn´t reached it´s beginning.

Ex: Say, the memory starts at 100000 And the function uses a pointer at the very 1st byte 100001. What could happen is that since the and operation will force the Inputted pointer to start at 100000-32 bytes (or something). And, if the area before that address is not allocated, most likely it will crash. However, it didn´t crashed on my tests, but i made that small adaptation just to make sure it won´t crash ever.

Memory is page-aligned so it never starts at an uneven address. The problem is more about the end as addressed in this tread. The situation may be explained like this:

0              1             2
[PAGE_NOACCESS][current page][PAGE_NOACCESS]

Assuming the address is 0x1000 (start of current page) the alignment code will have no effect ((0x1000 & -32) = 0x1000) so page 0 will never be accessed. However, if the string is two byte at address 0x1FFD and the alignment code is removed you will start by reading 30 bytes of page 2. To avoid this the address is set to 0x2000-32 and a bit-mask is added for valid bytes (two upper bits in this case) in the first read.

Quote
I made a small test comparing pcmpeqb xmm0 xmm1 rather then pcmpeqb xmm0 xmm2/pcmpeqb xmm1 xmm2 but couldn´t understand what this opcode is doing exactly yet.

It compares 16 bytes and sets each equals to FF in the first operand. pmovmskb convert the result to 16 bits.

Quote
The timmings are simply great and it do finds the len for small strings etc...while AxStrLenSSE2 version seems to have some errors on the results (at least, if i ported it properly)

It's an old test-bed and it wont assemble with the current version of Asmc so it needs an update.
Title: Re: Faster Memcopy ...
Post by: nidud on November 15, 2019, 05:49:03 AM
I updated the 32-bit benchmark test (https://github.com/nidud/asmc/tree/master/source/test/benchmark/x86).

A 16 byte version seems better, at least on the machine I use now.

    mov     eax,[esp+4]
    mov     ecx,[esp+4]
    and     eax,-16
    and     ecx,16-1
    or      edx,-1
    shl     edx,cl
    xorps   xmm0,xmm0
    pcmpeqb xmm0,[eax]
    add     eax,16
    pmovmskb ecx,xmm0
    xorps   xmm0,xmm0
    and     ecx,edx
    jnz     L2
L1:
    movaps  xmm1,[eax]
    pcmpeqb xmm1,xmm0
    pmovmskb ecx,xmm1
    add     eax,16
    test    ecx,ecx
    jz      L1
L2:
    bsf     ecx,ecx
    lea     eax,[eax+ecx-16]
    sub     eax,[esp+4]
    ret

bench:

total [0 .. 40], 8++
   339601 cycles 3.asm: SSE Intel Silvermont
   370809 cycles 1.asm: SSE 16
   465072 cycles 2.asm: SSE 32
   756379 cycles 4.asm: SSE Intel Atom
   853326 cycles 0.asm: msvcrt.strlen()
total [41 .. 80], 7++
   301950 cycles 3.asm: SSE Intel Silvermont
   327903 cycles 1.asm: SSE 16
   423696 cycles 2.asm: SSE 32
   649882 cycles 4.asm: SSE Intel Atom
  1262412 cycles 0.asm: msvcrt.strlen()
total [600 .. 1000], 100++
   239360 cycles 3.asm: SSE Intel Silvermont
   308941 cycles 4.asm: SSE Intel Atom
   481037 cycles 2.asm: SSE 32
   485014 cycles 1.asm: SSE 16
  2005072 cycles 0.asm: msvcrt.strlen()
Title: Re: Faster Memcopy ...
Post by: jj2007 on November 15, 2019, 08:52:04 AM
Here is a test run with your (marginally modified) 1.asm ... 4.asm and other StrLen algos.
The 3.asm algo returns a wrong value, I have not found out why.

Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6931    cycles for 100 * CRT strlen
5372    cycles for 100 * Masm32 StrLen
19701   cycles for 100 * Windows lstrlen
2006    cycles for 100 * MasmBasic Len
1579    cycles for 100 * Algo1
1712    cycles for 100 * Algo2
1103    cycles for 100 * Algo3
3209    cycles for 100 * Algo4

6941    cycles for 100 * CRT strlen
5393    cycles for 100 * Masm32 StrLen
19722   cycles for 100 * Windows lstrlen
1994    cycles for 100 * MasmBasic Len
1587    cycles for 100 * Algo1
1722    cycles for 100 * Algo2
1102    cycles for 100 * Algo3
3227    cycles for 100 * Algo4

6938    cycles for 100 * CRT strlen
5380    cycles for 100 * Masm32 StrLen
19692   cycles for 100 * Windows lstrlen
2011    cycles for 100 * MasmBasic Len
1580    cycles for 100 * Algo1
1713    cycles for 100 * Algo2
1101    cycles for 100 * Algo3
3205    cycles for 100 * Algo4

6970    cycles for 100 * CRT strlen
5404    cycles for 100 * Masm32 StrLen
19710   cycles for 100 * Windows lstrlen
2010    cycles for 100 * MasmBasic Len
1589    cycles for 100 * Algo1
1710    cycles for 100 * Algo2
1104    cycles for 100 * Algo3
3204    cycles for 100 * Algo4

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len
82      bytes for Algo1
114     bytes for Algo2
690     bytes for Algo3
761     bytes for Algo4

100     = eax CRT strlen
100     = eax Masm32 StrLen
100     = eax Windows lstrlen
100     = eax MasmBasic Len
100     = eax Algo1
100     = eax Algo2
14      = eax Algo3
100     = eax Algo4

This is the clear winner:
Code: [Select]
; include 1j.asm
Algo1 proc
  mov eax,[esp+4]
  mov ecx,eax ; much faster than [esp+4]
  and eax,-16
  and ecx,16-1
  or edx,-1
  shl edx,cl
  xorps xmm0,xmm0
  pcmpeqb xmm0,[eax]
  add eax,16
  pmovmskb ecx,xmm0
  ; xorps xmm0,xmm0 ; ??
  and ecx,edx
  jnz L2
L1:
  movaps xmm1,[eax]
  pcmpeqb xmm1,xmm0
  pmovmskb ecx,xmm1
  add eax,16
  test ecx,ecx
  jz L1
L2:
  bsf ecx,ecx
  lea eax,[eax+ecx-16]
  sub eax,[esp+4]
  retn 4
Algo1 endp
Title: Re: Faster Memcopy ...
Post by: nidud on November 15, 2019, 09:34:12 AM
Code: [Select]
Algo1 proc
  mov eax,[esp+4]
  mov ecx,eax ; much faster than [esp+4]

Actually they are the same (speed-wise) but selected by size for alignment of L1.

  ; xorps xmm0,xmm0   ; ??

This must be zero for compare below:

L1:
    movaps  xmm1,[eax]
    pcmpeqb xmm1,xmm0
Title: Re: Faster Memcopy ...
Post by: nidud on November 15, 2019, 09:48:25 AM
This is the Unicode version. Note that the string must be aligned 2 (which is mostly the case with Unicode strings) for this to work.

    mov     eax,[esp+4]
    bt      eax,0
    jc      L3
    mov     ecx,[esp+4]
    and     eax,-16
    and     ecx,16-1
    or      edx,-1
    shl     edx,cl
    pxor    xmm0,xmm0
    pcmpeqw xmm0,[eax]
    add     eax,16
    pmovmskb ecx,xmm0
    pxor    xmm0,xmm0
    and     ecx,edx
    jnz     L2
L1:
    movaps  xmm1,[eax]
    pcmpeqw xmm1,xmm0
    pmovmskb ecx,xmm1
    add     eax,16
    test    ecx,ecx
    jz      L1
L2:
    bsf     ecx,ecx
    lea     eax,[eax+ecx-16]
    sub     eax,[esp+4]
    shr     eax,1
    ret
L3:
    mov     edx,edi
    mov     edi,eax
    xor     eax,eax
    or      ecx,-1
    repne   scasw
    not     ecx
    dec     ecx
    mov     eax,ecx
    mov     edi,edx
    ret

Result:

total [0 .. 40], 8++
   575817 cycles 2.asm: SSE 16
  3081171 cycles 0.asm: msvcrt.wcslen()
  4261124 cycles 1.asm: scasw
total [41 .. 80], 7++
   629595 cycles 2.asm: SSE 16
  4696938 cycles 1.asm: scasw
  4742392 cycles 0.asm: msvcrt.wcslen()
total [600 .. 1000], 100++
   987251 cycles 2.asm: SSE 16
  7455315 cycles 1.asm: scasw
  8530590 cycles 0.asm: msvcrt.wcslen()
Title: Re: Faster Memcopy ...
Post by: jj2007 on November 15, 2019, 10:44:32 AM
Code: [Select]
Algo1 proc
  mov eax,[esp+4]
  mov ecx,eax ; much faster than [esp+4]

Actually they are the same (speed-wise) but selected by size for alignment of L1.

On my CPU the algo is over 2% faster with mov ecx, eax

Quote

  ; xorps xmm0,xmm0   ; ??

This must be zero for compare below:

L1:
    movaps  xmm1,[eax]
    pcmpeqb xmm1,xmm0


I saw that but can pcmpeqb xmm0,[eax] produce a non-zero result without jumping to L2?
Title: Re: Faster Memcopy ...
Post by: nidud on November 15, 2019, 11:12:13 AM
On my CPU the algo is over 2% faster with mov ecx, eax

The first move is slower but the next should (in theory) be the same. It's difficult to test this in a loop given it ends up in the cache after the first move. A simple test should then be more or less equal.

    mov eax,[esp+4]
    mov ecx,[esp+4]
    mov edx,[esp+4]
...
    mov eax,[esp+4]
    mov ecx,eax
    mov edx,eax
...
Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz (AVX2)
----------------------------------------------
-- test(0)
    35942 cycles, rep(10000), code( 13) 0.asm: mov eax,[esp+4]
    35951 cycles, rep(10000), code(  9) 1.asm: mov eax,ecx
-- test(1)
    35058 cycles, rep(10000), code( 13) 0.asm: mov eax,[esp+4]
    36532 cycles, rep(10000), code(  9) 1.asm: mov eax,ecx
-- test(2)
    34778 cycles, rep(10000), code( 13) 0.asm: mov eax,[esp+4]
    35262 cycles, rep(10000), code(  9) 1.asm: mov eax,ecx

total [0 .. 2], 1++
   105778 cycles 0.asm: mov eax,[esp+4]
   107745 cycles 1.asm: mov eax,ecx

Quote
I saw that but can pcmpeqb xmm0,[eax] produce a non-zero result without jumping to L2?

Yes. The alignment may go back 31 15 byte and they may all be zero.
Title: Re: Faster Memcopy ...
Post by: jj2007 on November 15, 2019, 11:32:06 AM
Quote
I saw that but can pcmpeqb xmm0,[eax] produce a non-zero result without jumping to L2?

Yes. The alignment may go back 31 15 byte and they may all be zero.

I've tested it with a 100 byte string, increasing the pointer 100 times. "All zero" shouldn't be a problem. But I admit I am too tired right now to understand it ;-)

Here's my current version of your fast algo, 62 bytes short and considerably faster than the original:
Code: [Select]
Algo1 proc
  mov eax, [esp+4]
  mov ecx, eax ; much faster than [esp+4]
  and eax, -16
  and ecx, 16-1
  or edx, -1
  shl edx, cl
  xorps xmm0, xmm0 ; needed for short strings
  pcmpeqb xmm0, [eax]
  pmovmskb ecx, xmm0
;   xorps xmm0, xmm0 ; ??
  and ecx, edx ; short string?
  jnz L2
L1:
  add eax, 16
  movaps xmm1, [eax]
  pcmpeqb xmm1, xmm0
  pmovmskb ecx, xmm1
  test ecx, ecx
  jz L1
L2:
  bsf ecx, ecx
  add eax, ecx
  sub eax, [esp+4]
  retn 4
Algo1 endp
Title: Re: Faster Memcopy ...
Post by: nidud on November 15, 2019, 12:31:13 PM
Well, if you feed it with this string:

    align 16
    db 15 dup(0)
err db "error",0

the aligned (first) compare will be:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e | r r o r 0

xmm0 will be:
00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

and there will be no jump to L2
Title: Re: Faster Memcopy ...
Post by: guga on November 15, 2019, 02:22:37 PM
Tks, guys, i´ll give a test for benchmarking the speed..

Nidud, about the memory issue...Indeed you are right. I was confused thinking the algo would write something to a protected area, but, it is not. It´s only computing the difference where to start calculating the lenght is that it ?  Tks for the unicode version. I´ll test ir right now

JJ...i tried to assemble your file, but masmbasic throw an error of configuration. I think i do have uasm on the same path as in ml.exe, but this is what is being showed to me:

(https://i.ibb.co/TMS9xWW/vfsdgf-Image1.png) (https://ibb.co/TMS9xWW)

How can i configure masmbasic properly to make it work on this test ?
Title: Re: Faster Memcopy ...
Post by: guga on November 15, 2019, 03:05:45 PM
Hi Nidud....The unicode version is working as expected :)

One question. How to implement a security inside the function to see either the string is really unicode or not, while it is calculating the lenght ?

I mean....say i have a bad unicode string like this:

[TestingUnicode: B$ 'H', 0, 'e', 0, 'hi', 0]

How to make the function checks the bad chars 'hi', and return a value of ... say 0-1 (meaning the function found an error ) ?

The RosAsm porting of your function, is like this:
Code: [Select]

Proc UnicodeStrlen:
    Arguments @Input
    Uses ecx, edx, edi ; <----preserve ecx, edx, edi registers - I benchmark the speed with all algos preserving the registers to be sure they are all behaving on the same way to see which one is faster etc under the very same conditions.

    mov eax D@Input
    bt eax 0 | jc L3>
    mov ecx D@Input
    and eax 0-16
    and ecx 16-1
    or  edx 0-1
    shl edx cl
    pxor xmm0 xmm0
    pcmpeqw xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    pxor xmm0 xmm0
    and  ecx edx
    jnz  L2>

L1:
    movups  xmm1 X$eax
    pcmpeqw xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz  L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@Input
    shr eax 1
    ExitP
L3:
    mov  edx edi
    mov  edi eax
    xor  eax eax
    or   ecx 0-1
    repne scasw
    not  ecx
    dec  ecx
    mov  eax ecx
    mov  edi edx

EndP

[code]
Title: Re: Faster Memcopy ...
Post by: guga on November 15, 2019, 03:26:18 PM
ABout the Unicode version. JJ, i ported yours to work with Unicode as well, and it seem to work ok, and it is faster on my machine. Can you give a test on your benchmark function to compare the speeds, pls ?

JJ Unicode Version
Code: [Select]
Proc UniStrLenJJ:
    Arguments @Input
    Uses ecx, edx ; <--- preserved register
   
    mov eax D@Input
    mov ecx eax ; much faster than [esp+4]
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqw xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax ; <---- Changed to unaligned version on both unicode and ansi version. A bit faster on my machine, and prevent crashing on unaligned strings calculation)
    pcmpeqw xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx ecx
    lea eax D$eax+ecx-16
    sub eax D@Input
    shr eax 1

EndP

JJ Ansi version
Code: [Select]
Proc StrLenJJ:
    Arguments @Input
    Uses ecx, edx
   
    mov eax D@Input
    mov ecx eax ; much faster than [esp+4]
    and eax 0-16
    and ecx 16-1
    or edx 0-1
    shl edx cl
    xorps xmm0 xmm0
    pcmpeqb xmm0 X$eax
    add eax 16
    pmovmskb ecx xmm0
    xorps xmm0 xmm0
    and ecx edx
    jnz L2>
L1:
    movups xmm1 X$eax
    pcmpeqb xmm1 xmm0
    pmovmskb ecx xmm1
    add eax 16
    test ecx ecx
    jz L1<
L2:
    bsf ecx,ecx
    lea eax D$eax+ecx-16
    sub eax D@Input

EndP

On my tests, in terms of speed the Ansi version and Unicode version does not varies that much in speed

Ansi version usign the following string (99 chars...I´m testing odd and even strings, big or tiny as 1 byte only):
[TestAnsi: B$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zer" , 0]
Average timming (in nanosecs): 2.95913440552019 ns

Unicode version
[TestUnicode: U$ "Hello, this is a simple string intended for testing string algos. It has 100 characters without zer" , 0]
Average timming (in nanosecs): 3.46197153137555 ns



Btw... Same question i made for Nidud. How to make the Unicode version checks if the string being calculated is really Unicode or Not and make it returns 0-1, if it finds non-unicode chars while it is calculating the lenght ?
Title: Re: Faster Memcopy ...
Post by: jj2007 on November 15, 2019, 08:56:43 PM
Well, if you feed it with this string:

    align 16
    db 15 dup(0)
err db "error",0

the aligned (first) compare will be:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e | r r o r 0

xmm0 will be:
00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

and there will be no jump to L2

Correct :thumbsup:

ABout the Unicode version. JJ, i ported yours to work with Unicode as well, and it seem to work ok, and it is faster on my machine. Can you give a test on your benchmark function to compare the speeds, pls ?

If you translate it to Masm syntax, I can do that  :thup:
Title: Re: Faster Memcopy ...
Post by: guga on November 15, 2019, 09:43:21 PM
Hii JJ. Tks :)



I believe the porting to masm, is something like this:

Code: [Select]

UniStrLenJJ proc Input :DWORD
             ; <--- uses ebp, rather then esp. Here should have a push ebp | mov  ebp, esp at start. Forgot how to include that. Maybe using invoke token in masm, right ?
    push ecx
    push edx

    mov eax, Input
    mov ecx, eax
    and eax, -16
    and ecx, 16-1
    or edx, -1
    shl edx, cl
    xorps xmm0, xmm0
    pcmpeqw xmm0, [eax] ; <---- or it is  xmmword ptr [eax]. Don´t remember if there is a xmmword instruction in masm. It´s the same as in Algo1, but using pcmpeqw rather then pcmpeqb
    add eax, 16
    pmovmskb ecx, xmm0
    xorps xmm0, xmm0
    and ecx, edx
    jnz short Out1

InnerLoop:
     movups  xmm1, [eax]; <---- or it is  xmmword ptr [eax]. Don´t remember if there is a xmmword instruction in masm
     pcmpeqw xmm1, xmm0
     pmovmskb ecx, xmm1
     add  eax, 16
     test ecx, ecx
     jz short InnerLoop

Out1:
     bsf ecx, ecx
     lea eax, [ecx+eax-16]
     sub eax, Input
     shr eax, 1

     pop edx
     pop ecx
     retn 4 <----- or a simple ret..Don´t remember the syntax. before the ret should have a mov esp, ebp | pop ebp instructions
UniStrLenJJ     endp
Title: Re: Faster Memcopy ...
Post by: nidud on November 16, 2019, 12:15:47 AM
Nidud, about the memory issue...Indeed you are right. I was confused thinking the algo would write something to a protected area, but, it is not.

Alignment of data is normally used to increase performance but in this case it also prevent the function from reading (ahead) into protected memory, so if you remove the alignment the function will blow up.

Quote
It´s only computing the difference where to start calculating the lenght is that it ?

In addition to increase performance yes. Consider the "hello" string at the end of the buffer moved/compared to xmm0. The valid bytes in the aligned block is A to F.

 0 0 0 0 0 0 0 0 0 0 h e l l o 0|p r o t e c t e d 0
                    [0 1 2 3 4 5 6 7 8 9 A B C D E F]
[0 1 2 3 4 5 6 7 8 9 A B C D E F]

One question. How to implement a security inside the function to see either the string is really unicode or not, while it is calculating the lenght ?

I mean....say i have a bad unicode string like this:

[TestingUnicode: B$ 'H', 0, 'e', 0, 'hi', 0]

How to make the function checks the bad chars 'hi', and return a value of ... say 0-1 (meaning the function found an error ) ?

This is not something you do in an optimized string routine like strlen so don't worry about that.

Quote
The RosAsm porting of your function, is like this:

I don't know how RoAsm implement prologue code but I assume a stack frame is created so this adds code on top which will destroy the alignment. The algorithm is designed as a bare bone modular construct without any additional code added. Any changes above L1 should therefore (at best) adapt to this logic. A list file is helpful to view how this plays out:
00000000  8B442404                  mov     eax,[esp+4]
00000004  0FBAE000                  bt      eax,0
00000008  7246                      jc      L3
0000000A  8B4C2404                  mov     ecx,[esp+4]
0000000E  83E0F0                    and     eax,-16
00000011  83E10F                    and     ecx,16-1
00000014  83CAFF                    or      edx,-1
00000017  D3E2                      shl     edx,cl
00000019  660FEFC0                  pxor    xmm0,xmm0
0000001D  660F7500                  pcmpeqw xmm0,[eax]
00000021  83C010                    add     eax,16
00000024  660FD7C8                  pmovmskb ecx,xmm0
00000028  660FEFC0                  pxor    xmm0,xmm0
0000002C  23CA                      and     ecx,edx
0000002E  7512                      jnz     L2
00000030                        L1:

If you add code to the mix the alignment breaks.

00000000  55                        push    ebp
00000001  8BEC                      mov     ebp,esp
...
00000033                        L1:

The code above the loop is not that important speed-wise so you basically selected instructions by size for aligning the loop below (L1).

00000000  55                        push    ebp
00000001  8BEC                      mov     ebp,esp
00000003  8B4508                    mov     eax,[ebp+8] ; - one byte
00000006  0FBAE000                  bt      eax,0
0000000A  7244                      jc      L3
0000000C  8BC8                      mov     ecx,eax     ; - two bytes
0000000E  83E0F0                    and     eax,-16
00000011  83E10F                    and     ecx,16-1
00000014  83CAFF                    or      edx,-1
00000017  D3E2                      shl     edx,cl
00000019  660FEFC0                  pxor    xmm0,xmm0
0000001D  660F7500                  pcmpeqw xmm0,[eax]
00000021  83C010                    add     eax,16
00000024  660FD7C8                  pmovmskb ecx,xmm0
00000028  660FEFC0                  pxor    xmm0,xmm0
0000002C  23CA                      and     ecx,edx
0000002E  7512                      jnz     L2
00000030                        L1:

Quote
Proc UnicodeStrlen:
    Arguments @Input
    Uses ecx, edx, edi ; <----preserve ecx, edx, edi registers - I benchmark the speed with all algos preserving the registers to be sure they are all behaving on the same way to see which one is faster etc under the very same conditions.

Well, preserving registers here is bad advised (and edi is not even used) but it is possible to save two bytes (make room for two pushes) by flipping PXOR to XORPS. This approach may be more reliable than using a benchmark to adjust these things.
Title: Re: Faster Memcopy ...
Post by: AW on November 16, 2019, 01:12:02 AM
This performs better with small strings:

Code: [Select]
    .686
    .xmm
    .model flat

    .code
    mov     eax,-16
    mov     edx,[esp+4]
xorps xmm0, xmm0
@loop:
    add eax, 16
    PCMPISTRI xmm0, xmmword ptr [edx + eax], 1000b
    jnz @loop
    add eax, ecx
    ret

    end

total [0 .. 40], 8++
   262586 cycles 5.asm: PCMPISTRI
   329094 cycles 3.asm: SSE Intel Silvermont
   366495 cycles 1.asm: SSE 16
   453912 cycles 2.asm: SSE 32
   682335 cycles 0.asm: msvcrt.strlen()
   728151 cycles 4.asm: SSE Intel Atom
hit any key to continue...

(*) Note: can read up to 15 characters past end of string and cause an exception in some cases, normally cross pages.
Title: Re: Faster Memcopy ...
Post by: nidud on November 16, 2019, 02:39:42 AM
64-bit version of wcslen:

    .code

wcslen::

    test    ecx,1
    jnz     L3
    mov     r8,rcx
    mov     rax,rcx
    and     rax,-32
    and     ecx,32-1
    mov     edx,-1
    shl     edx,cl
    pxor    xmm0,xmm0
    vpcmpeqw ymm1,ymm0,[rax]
    add     rax,32
    vpmovmskb ecx,ymm1
    and     ecx,edx
    jnz     L2
L1:
    vpcmpeqw ymm1,ymm0,[rax]
    vpmovmskb ecx,ymm1
    add     rax,32
    test    ecx,ecx
    jz      L1
L2:
    bsf     ecx,ecx
    lea     rax,[rax+rcx-32]
    sub     rax,r8
    shr     rax,1
    ret
L3:
    push    rdi
    mov     rdi,rcx
    xor     eax,eax
    mov     rcx,-1
    repne   scasw
    not     rcx
    dec     rcx
    mov     rax,rcx
    pop     rdi
    ret

    end

bench:

total [0 .. 40], 8++
   338995 cycles 3.asm: AVX 32
   509346 cycles 2.asm: SSE 16
  1565427 cycles 0.asm: msvcrt.wcslen()
  4166297 cycles 1.asm: scasw
total [41 .. 80], 7++
   311340 cycles 3.asm: AVX 32
   605787 cycles 2.asm: SSE 16
  2549266 cycles 0.asm: msvcrt.wcslen()
  4701738 cycles 1.asm: scasw
total [600 .. 1000], 100++
   425699 cycles 3.asm: AVX 32
  1081618 cycles 2.asm: SSE 16
  3821484 cycles 0.asm: msvcrt.wcslen()
  7475438 cycles 1.asm: scasw
Title: Re: Faster Memcopy ...
Post by: jj2007 on November 16, 2019, 03:47:45 AM
Testbed with Guga's Unicode version, using ebp:

Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

7047    cycles for 100 * CRT strlen
5265    cycles for 100 * Masm32 StrLen
19964   cycles for 100 * Windows lstrlen
1711    cycles for 100 * MasmBasic Len
1499    cycles for 100 * Algo1
2918    cycles for 100 * AlgoGuga

7097    cycles for 100 * CRT strlen
5311    cycles for 100 * Masm32 StrLen
19940   cycles for 100 * Windows lstrlen
1704    cycles for 100 * MasmBasic Len
1497    cycles for 100 * Algo1
2929    cycles for 100 * AlgoGuga

7030    cycles for 100 * CRT strlen
5250    cycles for 100 * Masm32 StrLen
19956   cycles for 100 * Windows lstrlen
1674    cycles for 100 * MasmBasic Len
1494    cycles for 100 * Algo1
2935    cycles for 100 * AlgoGuga

7057    cycles for 100 * CRT strlen
5268    cycles for 100 * Masm32 StrLen
20014   cycles for 100 * Windows lstrlen
1714    cycles for 100 * MasmBasic Len
1511    cycles for 100 * Algo1
3000    cycles for 100 * AlgoGuga

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len
74      bytes for Algo1
87      bytes for AlgoGuga

100     = eax CRT strlen
100     = eax Masm32 StrLen
100     = eax Windows lstrlen
100     = eax MasmBasic Len
100     = eax Algo1
100     = eax AlgoGuga
Title: Re: Faster Memcopy ...
Post by: AW on November 16, 2019, 05:20:30 AM
This is the Unicode version using PCMPISTRI

Code: [Select]
    .code
    mov     rax,-16
    mov     rdx, rcx
    xorps xmm0, xmm0
@loop:
    add rax, 16
    PCMPISTRI xmm0, xmmword ptr [rdx + rax], 1001b
    jnz @loop
    shr eax, 1
    add eax, ecx
    ret
    end

total [0 .. 40], 8++
   323358 cycles 3.asm: AVX 32
   404277 cycles 4.asm: PCMPISTRI
   477789 cycles 2.asm: SSE 16
  1237417 cycles 0.asm: msvcrt.wcslen()
  3886924 cycles 1.asm: scasw

  total [41 .. 80], 7++
   291655 cycles 3.asm: AVX 32
   562089 cycles 2.asm: SSE 16
   563122 cycles 4.asm: PCMPISTRI
  1935096 cycles 0.asm: msvcrt.wcslen()
  4320489 cycles 1.asm: scasw
 
  total [600 .. 1000], 100++
   333669 cycles 3.asm: AVX 32
   982307 cycles 2.asm: SSE 16
  1405725 cycles 4.asm: PCMPISTRI
  3490272 cycles 0.asm: msvcrt.wcslen()
  6914474 cycles 1.asm: scasw
Title: Re: Faster Memcopy ...
Post by: nidud on November 16, 2019, 06:30:24 AM
Aligned version

    .code
    mov     r8,rcx
    mov     rax,rcx
    and     rax,-16
    and     rcx,16-1
    mov     edx,-1
    shl     edx,cl
    pxor    xmm0,xmm0
    pcmpeqw xmm0,[rax]
    add     rax,16
    pmovmskb ecx,xmm0
    pxor    xmm0,xmm0
    and     ecx,edx
    bsf     ecx,ecx
    jnz     L2
L1:
    pcmpistri xmm0,[rax],9
    lea     rax,[rax+16]
    jnz     L1
L2:
    lea     rax,[rax+rcx-16]
    sub     rax,r8
    shr     rax,1
    ret
    end

Similar results

total [0 .. 40], 8++
   339355 cycles 3.asm: AVX 32
   468741 cycles 4.asm: SSE4_2 PCMPISTRI
   508062 cycles 2.asm: SSE 16
  1618825 cycles 0.asm: msvcrt.wcslen()
  4159568 cycles 1.asm: scasw

total [41 .. 80], 7++
   314024 cycles 3.asm: AVX 32
   609103 cycles 2.asm: SSE 16
   622895 cycles 4.asm: SSE4_2 PCMPISTRI
  2544738 cycles 0.asm: msvcrt.wcslen()
  4694525 cycles 1.asm: scasw

total [600 .. 1000], 100++
   427188 cycles 3.asm: AVX 32
  1081861 cycles 2.asm: SSE 16
  1558990 cycles 4.asm: SSE4_2 PCMPISTRI
  3815976 cycles 0.asm: msvcrt.wcslen()
  7475706 cycles 1.asm: scasw
Title: Re: Faster Memcopy ...
Post by: guga on November 16, 2019, 10:23:57 AM
Tks a lot, JJ :)

It seems is fast as expected.
Here is the results:

Code: [Select]
AMD Ryzen 5 2400G with Radeon Vega Graphics     (SSE4)

5842 cycles for 100 * CRT strlen
5813 cycles for 100 * Masm32 StrLen
18946 cycles for 100 * Windows lstrlen
1878 cycles for 100 * MasmBasic Len
1545 cycles for 100 * Algo1
2579 cycles for 100 * AlgoGuga

5769 cycles for 100 * CRT strlen
5711 cycles for 100 * Masm32 StrLen
18825 cycles for 100 * Windows lstrlen
2350 cycles for 100 * MasmBasic Len
1917 cycles for 100 * Algo1
3390 cycles for 100 * AlgoGuga

7980 cycles for 100 * CRT strlen
7723 cycles for 100 * Masm32 StrLen
24159 cycles for 100 * Windows lstrlen
2372 cycles for 100 * MasmBasic Len
1930 cycles for 100 * Algo1
2587 cycles for 100 * AlgoGuga

6088 cycles for 100 * CRT strlen
7132 cycles for 100 * Masm32 StrLen
22808 cycles for 100 * Windows lstrlen
1673 cycles for 100 * MasmBasic Len
1609 cycles for 100 * Algo1
2636 cycles for 100 * AlgoGuga

14 bytes for CRT strlen
10 bytes for Masm32 StrLen
10 bytes for Windows lstrlen
10 bytes for MasmBasic Len
74 bytes for Algo1
87 bytes for AlgoGuga

100 = eax CRT strlen
100 = eax Masm32 StrLen
100 = eax Windows lstrlen
100 = eax MasmBasic Len
100 = eax Algo1
100 = eax AlgoGuga

--- ok ---


Title: Re: Faster Memcopy ...
Post by: jj2007 on November 16, 2019, 03:36:03 PM
Tks a lot, JJ :)

It seems is fast as expected.

Thanks, Guga. Here is a new version, 59 bytes short and pretty fast:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6984    cycles for 100 * CRT strlen
5216    cycles for 100 * Masm32 StrLen
19898   cycles for 100 * Windows lstrlen
1674    cycles for 100 * MasmBasic Len
1500    cycles for 100 * Algo1/Nidud
2608    cycles for 100 * Algo1/Guga+JJ

7035    cycles for 100 * CRT strlen
5249    cycles for 100 * Masm32 StrLen
19929   cycles for 100 * Windows lstrlen
1700    cycles for 100 * MasmBasic Len
1569    cycles for 100 * Algo1/Nidud
2625    cycles for 100 * Algo1/Guga+JJ

7087    cycles for 100 * CRT strlen
5215    cycles for 100 * Masm32 StrLen
19859   cycles for 100 * Windows lstrlen
1701    cycles for 100 * MasmBasic Len
1498    cycles for 100 * Algo1/Nidud
2644    cycles for 100 * Algo1/Guga+JJ

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len
53      bytes for Algo1/Nidud
59      bytes for Algo1/Guga+JJ

100     = eax CRT strlen
100     = eax Masm32 StrLen
100     = eax Windows lstrlen
100     = eax MasmBasic Len
100     = eax Algo1/Nidud
100     = eax Algo1/Guga+JJ

Note that results are not fully comparable. Masm32 len(), MasmBasic Len() (http://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1144) and Algo1/Guga+JJ preserve edx and ecx. This is unusual, but the good ol' Masm32 len() algo did this, so in order to avoid compatibility problems, Len() (http://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1144) behaves the same. In addition, Len() preserves xmm0. Still, Len() is over 4 times faster than CRT strlen :cool:
Title: Re: Faster Memcopy ...
Post by: daydreamer on November 16, 2019, 08:04:12 PM
Hi Nidud....The unicode version is working as expected :)

One question. How to implement a security inside the function to see either the string is really unicode or not, while it is calculating the lenght ?

I mean....say i have a bad unicode string like this:

[TestingUnicode: B$ 'H', 0, 'e', 0, 'hi', 0]

How to make the function checks the bad chars 'hi', and return a value of ... say 0-1 (meaning the function found an error ) ?


in my experience when using a texteditor and start to put in unicode characters,the texteditor detects I have entered or copy/pasted unicode and it asks if  I want to save it unicode format or choose to lose that info saving in good old ascii,my suggestion is support a unicode or not flag somewhere in the copy routine,so if its used by a texteditor/gui creator,puts code to change unicode flag as soon as it detects unicode usage by user,maybe add superfast routine that expands ascii-to unicode when that happens?
Title: Re: Faster Memcopy ...
Post by: hutch-- on November 16, 2019, 09:14:42 PM
Just write a UNICODE editor, no conversions.
Title: Re: Faster Memcopy ...
Post by: AW on November 17, 2019, 01:18:53 AM
Visual Studio and some text editors like Notepad++ detect Unicode on save.
But in this thread, what we have been calling Unicode are word length characters, that is the UTF-16 character set without the additional planes for emoji and other stuff. This is what Windows internals use. For programming and for most purposes it is very OK, no need to complicate things further. However there is more to it.

Windows has the IsTextUnicode API function but it sometimes fails. I have seen and used better algos but am never sure it will not fail somewhere.



Title: Re: Faster Memcopy ...
Post by: guga on November 17, 2019, 07:36:08 AM
Just write a UNICODE editor, no conversions.

Hi Steve, tks...But i was thinking on a function able to detect it for disassembler purposes. While i was testing the GDI issues of that other post, i faced a really old problem on RosAsm and decided start a minor update on some of the routines. The routine i´m currently working on is the disassembler that also uses routines to check chunk of bytes. I remember using a very very old function for string lenght and updated it, but it was not fast enough, that´s why i decided to give a try on these.

Since RosAsm (now) do disassemble unicode strings properly, i´m trying also a unicode string lenght check and a routine to automatically detect simple unicode strings (Char+0, Char+0 etc). So, for a disassembler point of view, a faster routine able to determine whether a string is unicode or ansi (even simple unicode format) maybe necessary to avoid bad recognitions.

I succeeded yesterday to fix  a old bug in RosAsm disassembler that caused the dissassemblement process to be kind of slow in while it is working in some apps or dlls. For example, on shell32.dll or atioglxx.dll for Xp (12Mb and 17 Mb respectively) , the disassembler was taking almost 25 minutes to finish due to heavily computation of code/data identification and the main functions were going back and forward thousands of times. It was a stupid design mistake i (or rené..i don´t remember) did years ago, forcing one of the functions to allocate and deallocate a huge amount of memory every time the routine is used (and it is used thousands on times for problematic files) .  Now, the process is way fast for some problematic files. The same test i made on shell32.dll disassembled it in about 20 seconds and kept the accuracy of the data that was disassembled)

Title: Re: Faster Memcopy ...
Post by: hutch-- on November 17, 2019, 09:42:44 AM
As long as you can safely identify character data, either ANSI or UNICODE, it may be viable to have a manual option to switch between the two. There is no truly reliable method of detecting the difference so you use the best you can get/write and have a manual option.
Title: Re: Faster Memcopy ...
Post by: guga on November 17, 2019, 11:50:24 AM
Manual option ?  :thumbsup: :thumbsup: :thumbsup: Indeed. Yeah...maybe i´ll implement it later when i finish the fixes. I plan to add a couple of user options as soon i have a few time to dedicate to work on RosAsm updates again. The current version can identifies with relatively accuracy what is ANSI or Unicode strings, but still have some minor problems because on some cases a chunk of data can be either a string or a pointer and it is not so easy to fix that without the help of other tools like a Signature system i started developing years ago, but never finished.

Manual options could be good to make as in IdaPro, allowing the user to choose either he wants to disassemble C-Style strings, Dos Style, pascal Styles, Delphi strings etc..But, for those specific cases (of strings that are used to certain compilers) the better should be i do it when (or if) succeed to finish the signature technique (i called that DIS - Digital Identification System) many years ago.

The current routine does the job in more then 90% of the time for real apps. The disassembler sets a couple of flags to forbidden areas of the PE to avoid those to be  disassembled. For example, sections flagged as import, export, resources etc etc...It basically identifies the good code and data sections and those are the ones that are actually disassembled.

One of the problems is that, inside the code section it is common to we find embedded data, structures, strings etc. Although the disassembler works reasonable fine on those sections, it do have files that produces wrong results...but those i´ll fix later once i fix some minor bugs in RosAsm.

I´m comparing the results of the fixes i´m currently doing and they are more accurate then the ones in IdaPro (Not considering the Flirt system used on Ida, of course), but still it have some issues.

Eventually i´ll try to create a set of macros or internal routines to allow compilation of masm syntax style, but..this will still take more time.
Title: Re: Faster Memcopy ...
Post by: daydreamer on November 17, 2019, 08:50:52 PM
Just write a UNICODE editor, no conversions.
I already worked on unicode richedit,but as asm programmer I cannot resist downsizing tricks like half the size of  text data,upsizing with add the start of that specific language it belongs to and mov to character buffer
Title: Re: Faster Memcopy ...
Post by: daydreamer on November 17, 2019, 09:12:38 PM
Manual options could be good to make as in IdaPro, allowing the user to choose either he wants to disassemble C-Style strings, Dos Style, pascal Styles, Delphi strings etc..But, for those specific cases (of strings that are used to certain compilers) the better should be i do it when (or if) succeed to finish the signature technique (i called that DIS - Digital Identification System) many years ago.

The current routine does the job in more then 90% of the time for real apps. The disassembler sets a couple of flags to forbidden areas of the PE to avoid those to be  disassembled. For example, sections flagged as import, export, resources etc etc...It basically identifies the good code and data sections and those are the ones that are actually disassembled.

90% great job :thumbsup: some of the fail % it maybe is creator of code tried to make it harder to disassemble?
why dont make it like the smartphone editor: its best guess of string format is showed and a tiny sample is showed to the user and option to choose from a list of different text formats?I had no idea there was loads of different text formats
I also can think of hardcoded mov eax,"abcd" or mov eax,"ab" ;unicode can be very hard to handle in a disassembler
good to know is howto detect text files are saved in utf16 format
Title: Re: Faster Memcopy ...
Post by: hutch-- on November 17, 2019, 10:15:24 PM
Using rich edit, the shift from ANSI to UNICODE is simple enough to do. If you have to work in both, load the file and if it looks like garbage, switch from one to the other. Just means reloading the file again. You can make tangled messes that still can't garrantee being correct or switch between the two, the latter is much easier.
Title: Re: Faster Memcopy ...
Post by: AW on November 17, 2019, 11:03:31 PM
This is another one, this time for an AVX-512 strlen (based on this url (https://github.com/WojciechMula/toys/blob/master/avx512-string/avx512f-strlen.cpp))

Code: [Select]
strlenZMM:
push esi
mov eax, 01010101h
vpbroadcastd zmm2, eax ; broadcast eax to all elements
xor edx, edx ; len = 0
mov eax, 80808080h
vpbroadcastd zmm3, eax
mov esi, [esp+8]
@@:
vmovdqu32 zmm0, ZMMWORD PTR [esi+edx]
vpsubd zmm1, zmm0, zmm2
vpternlogd zmm1, zmm0, zmm3, 32
vptestmd k1, zmm1, zmm1
kmovw eax, k1
movzx eax, ax
test ax, ax
jnz @F
add edx, 64
jmp short @B
@@:
bsf eax, eax
push 32
pop ecx
cmovne ecx, eax
lea esi, dword ptr [esi+ecx*4]
cmp byte ptr [esi+edx], 0
lea eax, dword ptr [edx+ecx*4]
je short @exit
cmp byte ptr [esi+edx+1], 0
jne short @F
inc eax
jmp short @exit
@@:
cmp byte ptr [esi+edx+2], 0
jne short @F
add eax, 2
jmp short @exit
@@:
add eax, 3
@exit:
vzeroupper
pop esi
ret
    end

I added a 4th test for strings between 40000 and 40900 to see it the AVX-512 decouples. Well, not really, SSE Intel Silvermont and SSE Intel Atom are there as well.  :hmmm:

total [0 .. 40], 8++
   290780 cycles 7.asm: sse2
   355355 cycles 5.asm: PCMPISTRI
   412251 cycles 3.asm: SSE Intel Silvermont
   469664 cycles 8.asm: Agner Fog
   502841 cycles 1.asm: SSE 16
   524321 cycles 2.asm: SSE 32
   597335 cycles 9.asm: ZMM AVX512
   865552 cycles 4.asm: SSE Intel Atom
   908227 cycles 6.asm: scasb
   913651 cycles 0.asm: msvcrt.strlen()
   
   
total [41 .. 80], 7++
   270380 cycles 3.asm: SSE Intel Silvermont
   299431 cycles 5.asm: PCMPISTRI
   306940 cycles 7.asm: sse2
   314735 cycles 1.asm: SSE 16
   364536 cycles 9.asm: ZMM AVX512
   380247 cycles 8.asm: Agner Fog
   405156 cycles 2.asm: SSE 32
   639091 cycles 4.asm: SSE Intel Atom
   758265 cycles 6.asm: scasb
   982403 cycles 0.asm: msvcrt.strlen()

   total [600 .. 1000], 100++
   202227 cycles 9.asm: ZMM AVX512
   237534 cycles 3.asm: SSE Intel Silvermont
   292854 cycles 4.asm: SSE Intel Atom
   334146 cycles 2.asm: SSE 32
   338568 cycles 1.asm: SSE 16
   356720 cycles 7.asm: sse2
   436840 cycles 8.asm: Agner Fog
   650222 cycles 5.asm: PCMPISTRI
  1438033 cycles 6.asm: scasb
  1830544 cycles 0.asm: msvcrt.strlen()
 
total [40000 .. 40900], 100++
  2161645 cycles 3.asm: SSE Intel Silvermont
  2224521 cycles 4.asm: SSE Intel Atom
  2342704 cycles 9.asm: ZMM AVX512
  3137064 cycles 1.asm: SSE 16
  3465817 cycles 7.asm: sse2
  3514206 cycles 2.asm: SSE 32
  4113016 cycles 8.asm: Agner Fog
  6173622 cycles 5.asm: PCMPISTRI
 13022424 cycles 6.asm: scasb
 16670776 cycles 0.asm: msvcrt.strlen() 
Title: Re: Faster Memcopy ...
Post by: nidud on November 18, 2019, 12:53:02 AM
Have no way of testing this but an aligned version of 64 byte (cache-line) in 64-bit may be something like this.

    .code

strlen::

    xor             eax,eax
    vpbroadcastq    zmm0,rax

    mov             r8,rcx
    mov             rax,rcx
    and             rax,-64
    and             ecx,64-1
    xor             edx,edx
    dec             rdx   
    shl             rdx,cl
    vpcmpgtb        k1{k2},zmm0,[rax]
    kmovd           ecx,k1
    add             rax,32
    and             ecx,edx
    jnz             L2
    kmovd           ecx,k2
    add             rax,32
    shr             rdx,32
    and             rcx,rdx
    jnz             L2

L1:
    vpcmpgtb        k1{k2},zmm0,[rax]
    kmovd           ecx,k1
    add             rax,32
    test            ecx,ecx
    jnz             L2
    kmovd           ecx,k2
    add             rax,32
    test            ecx,ecx
    jz              L1

L2:
    bsf             ecx,ecx
    lea             rax,[rax+rcx-32]
    sub             rax,r8
    ret

    end
Title: Re: Faster Memcopy ...
Post by: AW on November 18, 2019, 05:19:44 AM
@nidud,

It has a bug, I did not figure out yet the logic but it does not pass through this.
00007ff65266174e  test         ecx, ecx 
00007ff652661750  jz         0x7ff652661734 

You can debug with VS 2012 or 2015 and the Intel SDE Debugger. I don't know if it works with the Express or Community, I have the Pro for those years.
Later: I think it does because I read this:
http://masm32.com/board/index.php?topic=6473.msg69456#msg69456
Title: Re: Faster Memcopy ...
Post by: AW on November 18, 2019, 08:07:16 AM
I understood the logic and corrected it in the following way and it does not need to be aligned.

Code: [Select]
.code

avx512aligned proc
    sub rsp, 8
    xor             rax,rax
    vpbroadcastq    zmm0,rax
    mov             r8,rcx
    mov             rax,rcx
    and             rax,-64
    and             rcx,64-1
    xor             rdx,rdx
    dec             rdx   
    shl             rdx,cl
    vpcmpgtb        k1,zmm0,[rax]
    kmovq           rcx,k1
    and             rcx,rdx
    jnz             L2
 
L1:
    add             rax,64
    vpcmpgtb        k1,zmm0,[rax]
    kmovq           rcx,k1
    test            rcx,rcx
    jz              L1

L2:
    bsf             rcx,rcx
    lea             rax,[rax+rcx]
    sub             rax,r8
    dec rax
    add rsp, 8
    ret
avx512aligned endp

end

What a mess was there with the k2! Simply remove it.
Title: Re: Faster Memcopy ...
Post by: guga on November 18, 2019, 11:45:21 AM
Hi, DayDreamer
"90% great job :thumbsup: some of the fail % it maybe is creator of code tried to make it harder to disassemble?"
Thanks :thumbsup: :thumbsup: :)  About the code being harder due to creator choice, well, not necessarily. The vast majority of time disassembler fails to identify is due to characteristics of each compiler or it´s libraries (When the file was not packed, of course).For example, VisualBasic6 code contains a lot of embedded data inside the code section. Some delphi or borland files have that too. Plain C using Api´s by the other-hand are somewhat easier to disassemble because the code and data are more distinguishable from each other. Also some C++ files too are somewhat easier. What makes disablement process a bit hard are heavily bad encoded libraries, specially when they are made with C++ for example or when there are trash code inside, i mean, functions that are never used. On those situations (functions that don´t have any reference), following the data chain is tricky, because some of that chunks can be either code or data.

Sure that there is not any disassembler will be 100% accurate, but if we can get a higher rate of accuracy, the better. I`m pretty sure that if i finish the DIS System (Similar to Flirt on Ida), i can reach something around 98%of accuracy, but..no plans/time to do that right now. I still have to fix lot of problems inside RosAsm yet, and try to make their inner functions be less attached to the interface (and, rebuild it completely eventually). I´ll have to isolate the encoder, the disassembler, the debugger, and create a new resource editor to only then i can try rewrite the interface or implement a new set of macros (or internal routines) to force it to work with something closer to a masm syntax too.


Hi Steve :)
"Using rich edit, the shift from ANSI to UNICODE is simple enough to do. If you have to work in both, load the file and if it looks like garbage, switch from one to the other. Just means reloading the file again. You can make tangled messes that still can't garrantee being correct or switch between the two, the latter is much easier."
Interesting idea. :) Easier indeed.
Title: Re: Faster Memcopy ...
Post by: jj2007 on November 18, 2019, 08:58:06 PM
"Using rich edit, the shift from ANSI to UNICODE is simple enough..."

The RichEdit control can use rtf under the hood, and thus has no problems using Unicode. In RichMasm (http://masm32.com/board/index.php?topic=5314.0), for example, your source can be Ansi or Unicode or a mix of both, no problem. The more interesting part is what you pass from editor to assembler - and that must be Utf-8, if you want to display non-English text in your executable.
Title: Re: Faster Memcopy ...
Post by: guga on November 18, 2019, 11:34:00 PM
Hi Jj. Tks.

I tested the new updated and it is working faster :)

About the UTF8, so i always need to convert it to UTF8 before displaying whenever i load a file or immediately before i show it on screen, and don´t need to make a user choice ? I´m not sure, if i understood what you meant with passing from editor to assembly.
You open a file (unicode or ansi) in RichMAsm, and export it as UTF 8 or the UTF8 conversion is done internally only to display the asm text on the screen ?
Title: Re: Faster Memcopy ...
Post by: jj2007 on November 19, 2019, 12:35:25 AM
This is OT but I'll keep it short:

include \masm32\MasmBasic\MasmBasic.inc         ; download (http://masm32.com/board/index.php?topic=94.0)
  Init
  uMsgBox 0, "Добро пожаловать", "歡迎", MB_OK
EndOfCode


- RichMasm exports this source to whatever.asm as Utf-8
- the assembler (ML, UAsm, AsmC) sees an 8-bit text but doesn't care whether that's codepage 1252 or 65001 or whatever
- the executable sees an 8-bit text, and gets told via the u in uMsgBox that a MessageBoxW is required, and that it should kindly translate the Utf-8 to Utf-16 before passing it on to MessageBoxW

That's the whole trick. For the coder, it's convenient because he can write everything directly into your source. And of course, comments can be in any encoding, the RichEdit control uses RTF to store it, and the assemblers don't care.
Title: Re: Faster Memcopy ...
Post by: nidud on November 19, 2019, 03:17:18 AM
I understood the logic and corrected it in the following way and it does not need to be aligned.

Yes you will get 64 bits there but you need to align the input for safety.

Quote
What a mess was there with the k2! Simply remove it.

Think it also must be set to actually work so you may use it directly for the first fetch.

    mov             r8,rcx
    xor             eax,eax
    vpbroadcastq    zmm0,rax
    mov             rax,rcx
    and             rax,-64
    and             ecx,64-1
    mov             rdx,-1
    shl             rdx,cl
    kmovq           k2,rdx
    vpcmpeqb        k1{k2},zmm0,[rax]
    jmp             L2
L1:
    vpcmpeqb        k1,zmm0,[rax]
L2:
    kmovq           rcx,k1
    add             rax,64
    test            rcx,rcx
    jz              L1
    bsf             rcx,rcx
    lea             rax,[rax+rcx-64]
    sub             rax,r8
    ret
Title: Re: Faster Memcopy ...
Post by: AW on November 19, 2019, 05:37:46 AM
It works very well.  :thumbsup:

Code: [Select]
total [0 .. 40], 8++
   307793 cycles 1.asm: PCMPISTRI
   443231 cycles 0.asm: AVX-512 aligned
   521571 cycles 2.asm: ZMM AVX512 (older)

total [41 .. 80], 7++
   257807 cycles 0.asm: AVX-512 aligned
   356038 cycles 2.asm: ZMM AVX512 (older)
   370879 cycles 1.asm: PCMPISTRI
   
total [600 .. 1000], 100++
   113553 cycles 0.asm: AVX-512 aligned
   204811 cycles 2.asm: ZMM AVX512 (older)
   859649 cycles 1.asm: PCMPISTRI

total [40000 .. 40800], 100++
   897536 cycles 0.asm: AVX-512 aligned
  2127546 cycles 2.asm: ZMM AVX512 (older)
  5980835 cycles 1.asm: PCMPISTRI 

If you convert the remaining to 64-bit I will test them all.
Title: Re: Faster Memcopy ...
Post by: nidud on November 19, 2019, 07:12:37 AM
Added the Intel versions as well. They where originally 32-bit so the 64-bit version would have been coded differently from the direct translation as done here.

Lot of space there now: 64 * 32 regs = 2048 byte.
Title: Re: Faster Memcopy ...
Post by: AW on November 19, 2019, 07:17:29 PM
Added the Intel versions as well. They where originally 32-bit so the 64-bit version would have been coded differently from the direct translation as done here.

Lot of space there now: 64 * 32 regs = 2048 byte.

I think there is a bug in the ANSI version here:
args_x macro
    lea rcx,str_1[size_s]
    mov eax,step_x
    add eax,eax  <---------- HERE
    sub rcx,rax
    exitm<>
    endm

Although it is not causing a buffer overflow, it changes clearly the results.

These are the results after removing the "add eax, eax"

Code: [Select]
total [0 .. 40], 16++
   133475 cycles 5.asm: AVX 32
   152210 cycles 3.asm: SSE Intel Silvermont
   157375 cycles 1.asm: SSE 16
   172910 cycles 6.asm: AVX512 64
   178312 cycles 2.asm: SSE 32
   228359 cycles 0.asm: msvcrt.strlen()
   326672 cycles 4.asm: SSE Intel Atom
   
total [41 .. 80], 16++
   117358 cycles 5.asm: AVX 32
   123539 cycles 6.asm: AVX512 64
   165831 cycles 1.asm: SSE 16
   169369 cycles 3.asm: SSE Intel Silvermont
   210518 cycles 2.asm: SSE 32
   270514 cycles 0.asm: msvcrt.strlen()
   281378 cycles 4.asm: SSE Intel Atom

total [600 .. 1000], 200++
    67189 cycles 6.asm: AVX512 64
   110356 cycles 5.asm: AVX 32
   218898 cycles 3.asm: SSE Intel Silvermont
   235207 cycles 4.asm: SSE Intel Atom
   272100 cycles 2.asm: SSE 32
   296195 cycles 1.asm: SSE 16
   643732 cycles 0.asm: msvcrt.strlen()
Title: Re: Faster Memcopy ...
Post by: AW on November 19, 2019, 08:40:48 PM
This is another suite of test results for 64-bit strlen variations, including Agner Fog, PCMPISTRI and Masm32 SDK.
I added a test for extra long strings in the range 40000 to 40800 bytes.
Masm32 SDK strlen is not SIMD assisted (and I believe msvcrt.strlen is not as well) so plays a different tournament. But it could be made that way because all 64-bit machines have support for SSE.

Code: [Select]
total [0 .. 40], 16++
   135718 cycles 5.asm: AVX 32
   147583 cycles 8.asm: PCMPISTRI
   159476 cycles 1.asm: SSE 16
   169063 cycles 3.asm: SSE Intel Silvermont
   190066 cycles 2.asm: SSE 32
   192212 cycles 7.asm: Agner Fog
   210091 cycles 6.asm: AVX512 64
   238010 cycles 0.asm: msvcrt.strlen()
   280346 cycles 4.asm: SSE Intel Atom
   282475 cycles 9.asm: Masm32 SDK
   
   total [41 .. 80], 16++
   116625 cycles 5.asm: AVX 32
   120875 cycles 6.asm: AVX512 64
   136046 cycles 3.asm: SSE Intel Silvermont
   169359 cycles 8.asm: PCMPISTRI
   179466 cycles 1.asm: SSE 16
   198766 cycles 2.asm: SSE 32
   205015 cycles 7.asm: Agner Fog
   257180 cycles 0.asm: msvcrt.strlen()
   278100 cycles 4.asm: SSE Intel Atom
   487603 cycles 9.asm: Masm32 SDK

   total [600 .. 1000], 200++
    83570 cycles 6.asm: AVX512 64
   110477 cycles 5.asm: AVX 32
   218994 cycles 3.asm: SSE Intel Silvermont
   253867 cycles 4.asm: SSE Intel Atom
   279579 cycles 2.asm: SSE 32
   307387 cycles 1.asm: SSE 16
   334595 cycles 7.asm: Agner Fog
   488680 cycles 8.asm: PCMPISTRI
   621900 cycles 0.asm: msvcrt.strlen()
  1066191 cycles 9.asm: Masm32 SDK

  total [40000 .. 40800], 200++
   505134 cycles 6.asm: AVX512 64
   977509 cycles 5.asm: AVX 32
  1468195 cycles 3.asm: SSE Intel Silvermont
  1684275 cycles 4.asm: SSE Intel Atom
  2241774 cycles 2.asm: SSE 32
  2250641 cycles 1.asm: SSE 16
  2609106 cycles 7.asm: Agner Fog
  3257461 cycles 8.asm: PCMPISTRI
  4818268 cycles 0.asm: msvcrt.strlen()
  8809927 cycles 9.asm: Masm32 SDK
 
Title: Re: Faster Memcopy ...
Post by: AW on November 19, 2019, 11:42:23 PM
In this test I introduce a new AVX-512 strlen variation called Fast AVX512 64  :biggrin:

Code: [Select]
.code
xor rax, rax
vxorps zmm1, zmm1, zmm1
L1:
vpcmpeqb k1, zmm1,zmmword ptr [rcx+rax]
kmovq  r9,k1
add rax, 64
test r9,r9
jz L1
    bsf             r9,r9
    lea             rax,[rax+r9-64]
    ret
end

These are the results:

Code: [Select]
total [0 .. 40], 16++
    83041 cycles 9.asm: Fast AVX512 64
   118463 cycles 8.asm: PCMPISTRI
   136861 cycles 5.asm: AVX 32
   145743 cycles 3.asm: SSE Intel Silvermont
   163889 cycles 1.asm: SSE 16
   178432 cycles 6.asm: AVX512 64
   185371 cycles 7.asm: Agner Fog
   196856 cycles 2.asm: SSE 32
   228516 cycles 0.asm: msvcrt.strlen()
   277227 cycles 4.asm: SSE Intel Atom
   
total [41 .. 80], 16++
    61027 cycles 9.asm: Fast AVX512 64
   111154 cycles 5.asm: AVX 32
   130256 cycles 6.asm: AVX512 64
   139440 cycles 3.asm: SSE Intel Silvermont
   155091 cycles 8.asm: PCMPISTRI
   183854 cycles 1.asm: SSE 16
   194775 cycles 7.asm: Agner Fog
   212161 cycles 2.asm: SSE 32
   285351 cycles 4.asm: SSE Intel Atom
   311238 cycles 0.asm: msvcrt.strlen()

total [600 .. 1000], 200++
    71159 cycles 9.asm: Fast AVX512 64
    81938 cycles 6.asm: AVX512 64
   110439 cycles 5.asm: AVX 32
   220499 cycles 3.asm: SSE Intel Silvermont
   254703 cycles 4.asm: SSE Intel Atom
   293130 cycles 2.asm: SSE 32
   308233 cycles 1.asm: SSE 16
   338944 cycles 7.asm: Agner Fog
   516498 cycles 8.asm: PCMPISTRI
   648680 cycles 0.asm: msvcrt.strlen()

total [40000 .. 40800], 200++
   390634 cycles 6.asm: AVX512 64
   414175 cycles 9.asm: Fast AVX512 64
   606734 cycles 5.asm: AVX 32
  1392867 cycles 3.asm: SSE Intel Silvermont
  1417887 cycles 4.asm: SSE Intel Atom
  2194951 cycles 1.asm: SSE 16
  2200795 cycles 2.asm: SSE 32
  2229910 cycles 7.asm: Agner Fog
  3295851 cycles 8.asm: PCMPISTRI
  4538755 cycles 0.asm: msvcrt.strlen()   

For huge strings the other variation catchs up.  :badgrin:
May be it can be improved just a little.


Later:

I have not used it in the above test but the following, which is based on the same idea, is also faster than AVX 32 except for huge strings (but the difference is small).

Code: [Select]
.code
    xor rax, rax
    vxorps ymm0, ymm0, ymm0
L1:
    vpcmpeqb ymm1, ymm0, ymmword ptr [rcx+rax]
    vpmovmskb r9d,ymm1
    add     rax,32
    test    r9d,r9d
    jz L1
    bsf  r9,r9
    lea  rax,[rax+r9-32]
    ret
    end
Title: Re: Faster Memcopy ...
Post by: nidud on November 20, 2019, 03:53:14 AM
I think there is a bug in the ANSI version here:
args_x macro
    lea rcx,str_1[size_s]
    mov eax,step_x
    add eax,eax  <---------- HERE
    sub rcx,rax
    exitm<>
    endm

Yes, it was copied from the Unicode test.
I updated the timeit.inc (https://github.com/nidud/asmc/blob/master/source/test/benchmark/x64/timeit.inc) file for 64-bit to remove the 0..9 limit.

You may now continue from 9.asm to a[..z].asm
The size of the info-array is now equal to the largest id used.

procs equ <for x,<0,1>> ; add functions to test...

    .data
    info_0 db "0",0 ; only used ones needed...
    info_1 db "1",0

The id-array is still numeric but files and info use chars

procs equ <for x,<0,10>> ; add functions to test...

    .data
    info_0 db "0",0
    info_1 db "1",0
    ...
    info_9 db "9",0
    info_a db "10",0

Title: Re: Faster Memcopy ...
Post by: AW on November 20, 2019, 06:34:12 AM
Very good Nidud.

Something I dislike is the .cmd extension and the makefile. I dislike as well the "option dllimport:" that comes back from Jwasm and in general everything that tries to establish unnecessary alternatives for doing things that are well done in the traditional way.


Notepad++ users (probably not many here) can build these ASMC test suits for Debugging using this script:

(https://www.dropbox.com/s/cgh80bug1e3y44b/asmcdebug.png?dl=1)

Let's see if it works:


(https://www.dropbox.com/s/g8ndvcukgy5z4hj/asmdb2.png?dl=1)
Title: Re: Faster Memcopy ...
Post by: nidud on November 20, 2019, 07:17:44 AM
Something I dislike is the .cmd extension and the makefile. I dislike as well the "option dllimport:" that comes back from Jwasm and in general everything that tries to establish unnecessary alternatives for doing things that are well done in the traditional way.

The main include files uses:
.pragma (https://github.com/nidud/asmc/blob/master/include/string.inc#L7) comment(lib, libc, msvcrt)

This will use option dllimport:<msvcrt> if the -pe switch is used and includelib libc.lib if not.

timeit.inc:

    .pragma comment(lib, msvcrt)
    printf  proto :ptr byte, :vararg
    exit    proto :dword
    _getch  proto

    .pragma comment(lib, kernel32)
    GetCurrentProcess proto
Title: Re: Faster Memcopy ...
Post by: AW on November 20, 2019, 10:58:27 PM
I understand, but ASMC libs appear not to be compatible with Microsoft LINK.exe.
We will need to learn how to use linkw.exe, right?
Uasm libs don't force us to use anything else (they have not as well, thanks God).
Title: Re: Faster Memcopy ...
Post by: nidud on November 21, 2019, 06:35:28 AM
I understand, but ASMC libs appear not to be compatible with Microsoft LINK.exe.

They are all compatible with LINK except for the 64-bit import libraries created by LIBW (needs an update). It has to be this way. Otherwise you will be forced to download a few gigs of corporate tools in order to build them or do as Hutch: use POLIB for this purpose.

There are many samples (https://github.com/nidud/asmc/tree/master/source/test/winbase) in the source directory on how to build def-files, include files, and import libraries from installed dll-files. To update the import libraries using LIB (or POLIB) you may add the following changes to the import.asm (https://github.com/nidud/asmc/blob/master/lib/amd64/import.asm#L39) file and rebuild:

    fprintf(r12, "LIBRARY %s\nEXPORTS\n", rdi)
    .while r13d
        lodsd
        ;fprintf(r12, "++%s.\'%s.dll\'\n", addr [rax+rbx], rdi)
        fprintf(r12, "%s\n", &[rax+rbx])
        dec r13d
    .endw
    fclose(r12)
    lea rbx,buffer
    ;sprintf(rbx, "libw /n /c /q /b /fac /i6 %s\\%s.lib @%s.def", path, rdi, rdi)
    sprintf(rbx, "lib /machine:x64 /def:%s.def /out:%s\\%s.lib", rdi, path, rdi)

Note: the PATH needs to be set for LIB in the above sample.

Quote
We will need to learn how to use linkw.exe, right?

As for the benchmark test it uses neither include-files or a linker but in general terms no. The LIBC startup and auto install is a bit complicated and incorporated into the tools so this may differ depending on version and tool-chain. In the Asmc LIBC.LIB mainCRTStartup is included but not in the current msvcrt.dll (Win7-64) so building a 64-bit debug vesion using LINK:

include stdio.inc

    .code

main proc

    printf("debug: Win64 console application\n")
    xor eax,eax
    ret

main endp

    end

set LIB=\asmc\lib\amd64
asmc64 -Zi test.asm
link /map /debug /MACHINE:X64 /subsystem:console test.obj

To build without LIBC (use msvcrt.dll):
...
include tchar.inc (https://github.com/nidud/asmc/blob/master/include/tchar.inc#L1055)
...
asmc64 -D__PE__ -Zi test.asm

If you want to debug LIBC add the full path of the source to the makefile (https://github.com/nidud/asmc/blob/master/source/lib64/makefile):

AFLAGS = -Zi -Zp8 -D_CRTBLD -Cs -I$(inc_dir)

$(lib_dir)\amd64\libc.lib:
    asmc64 $(AFLAGS) /r G:\asmc\source\lib64\*.asm
Title: Re: Faster Memcopy ...
Post by: CMalcheski on April 03, 2020, 11:47:46 AM
rrr

Mis-alignment can be a problem, especially if both the source and the destination buffers have some mis-alignment, especialy if they have different alignments. The usual case is that one or the other of the buffers can be aligned mod 16. The way to handle these is to move a block of 16 bytes from the start of the buffer and a block of 16 bytes at the end of the buffer with un-aligned XMM moves. Then calculate the size left to be moved (data size - 32) and round this up to the next mod 16, then start at the first mod 16 byte boundary of the un-aligned buffer and move 16 bytes at a time (there will be a small overlap at either end).

Dave.

Very few people take their efforts so far as to avoid this issue, but internally any data not aligned on its own size (i.e. a dword not aligned on a 4 byte boundary) is going to require 2 reads to retrieve the data.  That's a big slowdown.  The effort required to properly align data across the board is well worth it.
Title: Re: Faster Memcopy ...
Post by: hutch-- on April 03, 2020, 12:18:35 PM
AW,

Clock this one in your list. Memory must be aligned by the data size.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

ymmcopya proc

  ; rcx = source address
  ; rdx = destination address
  ; r8  = byte count

    mov r11, r8
    shr r11, 5                  ; div by 32 for loop count
    xor r10, r10                ; zero r10 to use as index

  lpst:
    vmovntdqa ymm0, YMMWORD PTR [rcx+r10]
    vmovntdq YMMWORD PTR [rdx+r10], ymm0

    add r10, 32
    sub r11, 1
    jnz lpst

    mov rax, r8                 ; calculate remainder if any
    and rax, 31
    test rax, rax
    jnz @F
    ret

  @@:
    mov r9b, [rcx+r10]          ; copy any remainder
    mov [rdx+r10], r9b
    add r10, 1
    sub rax, 1
    jnz @B

    ret

ymmcopya endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Title: Re: Faster Memcopy ...
Post by: AW on April 03, 2020, 08:57:49 PM
I can't test now but non-temporal instructions tend to perform poorly in modern processors. That's the main reason Agner Fog's tests did not shine.
I will test when possible or you can do it directly from nidud's example.