I don't get much time to write 32 bit code these days but since its prototype was wasted in the campus, here is a derivation of the first version that is self contained and full register design. Nothing to test it against but it runs OK.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
comment * -----------------------------------------------------
Build this template with
"CONSOLE ASSEMBLE AND LINK"
----------------------------------------------------- *
tRev PROTO pstr:DWORD
.data
numz db "1234567890",0
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
LOCAL rval :DWORD
invoke tRev,ADDR numz ; reverse the number above
mov rval, eax
print rval,13,10
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
tRev proc pstr:DWORD
; ---------------------------------------------
; in place string reverse, fully self contained
; ---------------------------------------------
push ebx ; save EBX
push edi
push esi
mov edi, [ebp+8]
mov eax, edi
sub eax, 1
@@: ; unroll by 4
REPEAT 3
add eax, 1 ; get length
cmp BYTE PTR [eax], 0 ; loop to buffer end
je @F
ENDM
add eax, 1 ; get length
cmp BYTE PTR [eax], 0 ; loop to buffer end
jne @B
@@:
sub eax, edi ; sub address from eax for len
mov esi, eax ; store length in esi
shr eax, 1 ; get integer half length of string
mov ebx, eax ; store it in variable
mov eax, edi ; eax has start char
mov edx, edi ; load start address
add edx, esi ; get end character address
sub edx, 1 ; sub 1 to get end char
mov esi, ebx ; copy half len into edi
@@:
movzx ebx, BYTE PTR [eax] ; load start char
movzx ecx, BYTE PTR [edx] ; load end char
mov [eax], cl ; reverse char pair
add eax, 1 ; increment start char
mov [edx], bl
sub edx, 1 ; decrement end char
sub esi, 1 ; sub 1 from half length
jnz @B ; loop back if its not zero
mov eax, edi ; return the address in eax
pop esi
pop edi
pop ebx ; restore EBX
ret 4 ; balance the stack by 4 bytes
tRev endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
Since we are in the Lab :biggrin:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
25756 cycles for 100 * tRev (Hutch)
33742 cycles for 100 * strrev (CRT)
13687 cycles for 100 * jRev (jj)
24425 cycles for 100 * tRev (Hutch)
33634 cycles for 100 * strrev (CRT)
13737 cycles for 100 * jRev (jj)
24779 cycles for 100 * tRev (Hutch)
33666 cycles for 100 * strrev (CRT)
13671 cycles for 100 * jRev (jj)
24694 cycles for 100 * tRev (Hutch)
33700 cycles for 100 * strrev (CRT)
13758 cycles for 100 * jRev (jj)
107 bytes for tRev (Hutch)
19 bytes for strrev (CRT)
51 bytes for jRev (jj)
1819043144 = eax tRev (Hutch)
1819043144 = eax strrev (CRT)
1819043144 = eax jRev (jj)
Hi,
Two results.
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
25531 cycles for 100 * tRev (Hutch)
38912 cycles for 100 * strrev (CRT)
14510 cycles for 100 * jRev (jj)
25745 cycles for 100 * tRev (Hutch)
38790 cycles for 100 * strrev (CRT)
14539 cycles for 100 * jRev (jj)
25463 cycles for 100 * tRev (Hutch)
39089 cycles for 100 * strrev (CRT)
14285 cycles for 100 * jRev (jj)
25515 cycles for 100 * tRev (Hutch)
38875 cycles for 100 * strrev (CRT)
14330 cycles for 100 * jRev (jj)
107 bytes for tRev (Hutch)
19 bytes for strrev (CRT)
51 bytes for jRev (jj)
1819043144 = eax tRev (Hutch)
1819043144 = eax strrev (CRT)
1819043144 = eax jRev (jj)
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
37301 cycles for 100 * tRev (Hutch)
65684 cycles for 100 * strrev (CRT)
24447 cycles for 100 * jRev (jj)
37206 cycles for 100 * tRev (Hutch)
65876 cycles for 100 * strrev (CRT)
24418 cycles for 100 * jRev (jj)
37151 cycles for 100 * tRev (Hutch)
65790 cycles for 100 * strrev (CRT)
24463 cycles for 100 * jRev (jj)
37155 cycles for 100 * tRev (Hutch)
65919 cycles for 100 * strrev (CRT)
24468 cycles for 100 * jRev (jj)
107 bytes for tRev (Hutch)
19 bytes for strrev (CRT)
51 bytes for jRev (jj)
1819043144 = eax tRev (Hutch)
1819043144 = eax strrev (CRT)
1819043144 = eax jRev (jj)
--- ok ---
Cheers,
Steve N.
Thanks, Steve. Here is another one called xRev, using pshufb:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
24424 cycles for 100 * tRev (Hutch)
33751 cycles for 100 * strrev (CRT)
13704 cycles for 100 * jRev (jj)
14602 cycles for 100 * Mirror$ (MasmBasic)
4662 cycles for 100 * xRev (jj)
24513 cycles for 100 * tRev (Hutch)
33951 cycles for 100 * strrev (CRT)
13689 cycles for 100 * jRev (jj)
14589 cycles for 100 * Mirror$ (MasmBasic)
4659 cycles for 100 * xRev (jj)
24555 cycles for 100 * tRev (Hutch)
33987 cycles for 100 * strrev (CRT)
13773 cycles for 100 * jRev (jj)
14670 cycles for 100 * Mirror$ (MasmBasic)
4677 cycles for 100 * xRev (jj)
24750 cycles for 100 * tRev (Hutch)
33709 cycles for 100 * strrev (CRT)
13749 cycles for 100 * jRev (jj)
14620 cycles for 100 * Mirror$ (MasmBasic)
4721 cycles for 100 * xRev (jj)
107 bytes for tRev (Hutch)
19 bytes for strrev (CRT)
51 bytes for jRev (jj)
12 bytes for Mirror$ (MasmBasic)
87 bytes for xRev (jj)
1819043144 = eax tRev (Hutch)
1819043144 = eax strrev (CRT)
1819043144 = eax jRev (jj)
2053468783 = eax Mirror$ (MasmBasic)
1819043144 = eax xRev (jj)
Note Mirror$() (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1167) and Caballero's algo write to a buffer, i.e. they are not in-place algos.
Do not use for serious stuff - this is work in progress and not fully tested :cool:
P.S.: I've added Lingo's algo (http://www.masmforum.com/board/index.php?topic=14041.msg111377#msg111377) (fast but crashes for unaligned strings), and Caballero's contribution (http://masm32.com/board/index.php?topic=9347.msg102591#msg102591).
wouldnt it be more standard today with unicode 16 reverse string?
Write one for us Magnus. :thup:
Hi,
One result and a partial.
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
25930 cycles for 100 * tRev (Hutch)
39465 cycles for 100 * strrev (CRT)
14214 cycles for 100 * jRev (jj)
17288 cycles for 100 * Mirror$ (MasmBasic)
5350 cycles for 100 * xRev (jj, needs pshufb)
6834 cycles for 100 * RevLingo (aligned strings only)
113656 cycles for 100 * cRev (caballero)
26047 cycles for 100 * tRev (Hutch)
40261 cycles for 100 * strrev (CRT)
14364 cycles for 100 * jRev (jj)
18570 cycles for 100 * Mirror$ (MasmBasic)
5377 cycles for 100 * xRev (jj, needs pshufb)
6637 cycles for 100 * RevLingo (aligned strings only)
114863 cycles for 100 * cRev (caballero)
26578 cycles for 100 * tRev (Hutch)
39065 cycles for 100 * strrev (CRT)
14363 cycles for 100 * jRev (jj)
16243 cycles for 100 * Mirror$ (MasmBasic)
5453 cycles for 100 * xRev (jj, needs pshufb)
6650 cycles for 100 * RevLingo (aligned strings only)
115443 cycles for 100 * cRev (caballero)
26088 cycles for 100 * tRev (Hutch)
39862 cycles for 100 * strrev (CRT)
14335 cycles for 100 * jRev (jj)
16200 cycles for 100 * Mirror$ (MasmBasic)
5371 cycles for 100 * xRev (jj, needs pshufb)
6628 cycles for 100 * RevLingo (aligned strings only)
115281 cycles for 100 * cRev (caballero)
107 bytes for tRev (Hutch)
19 bytes for strrev (CRT)
51 bytes for jRev (jj)
10 bytes for Mirror$ (MasmBasic)
87 bytes for xRev (jj, needs pshufb)
123 bytes for RevLingo (aligned strings only)
56 bytes for cRev (caballero)
1259388704 = eax tRev (Hutch)
1259388704 = eax strrev (CRT)
1259388704 = eax jRev (jj)
1259388704 = eax Mirror$ (MasmBasic)
1259388704 = eax xRev (jj, needs pshufb)
1259388704 = eax RevLingo (aligned strings only)
1259388704 = eax cRev (caballero)
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
37197 cycles for 100 * tRev (Hutch)
65773 cycles for 100 * strrev (CRT)
24460 cycles for 100 * jRev (jj)
28888 cycles for 100 * Mirror$ (MasmBasic)
Regards,
Steve N.
Thanks, Steve. Apparently your Pentium M doesn't feature pshufb...
Hi, I wonder why is my version soooo slow. As I see, I think the problem is the use of "lstrlen" each time cRev proc is called. If we know that the string is always 100 chars long, we could change the code as follows. It should be faster now.
cRev proc uses esi edi pSrc, pDest
mov esi, pSrc
;invoke lstrlen, esi
;mov ecx, eax
mov ecx, 100
mov edi, pDest
add edi, ecx
mov byte ptr [edi], 0
dec edi
@Bucle:
movsb
sub edi, 2
loop @Bucle
ret
cRev endp
The new version is indeed a bit faster. However, all other algos check for the length of the string.
cRev proc uses esi edi pSrc, pDest
mov esi, pSrc
if 0
mov ecx, 100 ; 75800 cycles for 100 * cRev (caballero)
else
invoke lstrlen, esi ; 96500 cycles
mov ecx, eax
endif
I don't know the range of designs but any string reverse should be able to handle variable length strings. To be fair to all techniques, ranging from 2 or 3 characters to a couple of hundred characters would give you more useful results.
I have done some testings. It seems to penalize a lot using movs and even loop. Finally the best of all is using dwords as much as possible. All of them use the str length.
It was a bit hard to me compile jj's code, so I created my own to test it.
These are the results:
Test Algorithm 7:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 6:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 5:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 4:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 3:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 2:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 1:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Timming Algorithm 7: 94
Test Timming Algorithm 6: 78
Test Timming Algorithm 5: 109
Test Timming Algorithm 4: 219
Test Timming Algorithm 3: 203
Test Timming Algorithm 2: 313
Test Timming Algorithm 1: 312
algo 1: lstrlen, movsb, loop ecx
algo 2: Longitud (my own subroutine to calc str_len), movsb, loop ecx
algo 3: lstrlen, inc/dec, loop ecx
algo 4: Longitud, inc/dec, loop ecx
algo 5: Longitud, inc/dec, dec ecx
algo 6: lstrlen, dwords, dec ecx
algo 7: Longitud, dwords, dec ecx
caballero,
Try something like this.
mov eax, pTxt
sub eax, 1
@@:
add eax, 1
cmp BYTE PTR [eax], 0
jne @B
sub eax, pTxt
Length is in eax.
@Hutch
Yes, that's faster :thumbsup:. It definitely doesn't seem like a good idea to use movs niether scas
deleted
@nidud
Interesting! though it overlaps the original string. Yet, it seems that the best algo remains the 7th one.
Quote
Test Algorithm 1:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 2:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 3:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 4:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 5:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 6:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 7:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 8:
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Timming Algorithm 1: 329
Test Timming Algorithm 2: 281
Test Timming Algorithm 3: 219
Test Timming Algorithm 4: 203
Test Timming Algorithm 5: 93
Test Timming Algorithm 6: 79
Test Timming Algorithm 7: 62
Test Timming Algorithm 8: 78
New version with Nidud's algo (I called it nRev for consistency :biggrin:):
nRev:
mov edx,[esp+4]
mov ecx,[esp+4]
xor eax,eax
slen:
add ecx,1
cmp al,[ecx-1]
jne slen
sub ecx,2
lupe:
cmp edx,ecx
ja done
mov al,[edx]
mov ah,[ecx]
mov [ecx],al
mov [edx],ah
lea edx,[edx+1]
lea ecx,[ecx-1]
jne lupe
done:
mov eax,[esp+4]
retn 4
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
24486 cycles for 100 * tRev (Hutch)
33686 cycles for 100 * strrev (CRT)
13557 cycles for 100 * jRev (jj)
13920 cycles for 100 * Mirror$ (MasmBasic)
4541 cycles for 100 * xRev (jj, needs pshufb)
5548 cycles for 100 * RevLingo (aligned strings only)
95812 cycles for 100 * cRev (caballero)
32676 cycles for 100 * nRev (Nidud)
24901 cycles for 100 * tRev (Hutch)
33800 cycles for 100 * strrev (CRT)
13677 cycles for 100 * jRev (jj)
14012 cycles for 100 * Mirror$ (MasmBasic)
4618 cycles for 100 * xRev (jj, needs pshufb)
5955 cycles for 100 * RevLingo (aligned strings only)
95700 cycles for 100 * cRev (caballero)
32590 cycles for 100 * nRev (Nidud)
24625 cycles for 100 * tRev (Hutch)
33799 cycles for 100 * strrev (CRT)
13914 cycles for 100 * jRev (jj)
14180 cycles for 100 * Mirror$ (MasmBasic)
4617 cycles for 100 * xRev (jj, needs pshufb)
5668 cycles for 100 * RevLingo (aligned strings only)
95766 cycles for 100 * cRev (caballero)
32612 cycles for 100 * nRev (Nidud)
24617 cycles for 100 * tRev (Hutch)
33827 cycles for 100 * strrev (CRT)
13672 cycles for 100 * jRev (jj)
14032 cycles for 100 * Mirror$ (MasmBasic)
4619 cycles for 100 * xRev (jj, needs pshufb)
5663 cycles for 100 * RevLingo (aligned strings only)
97332 cycles for 100 * cRev (caballero)
33423 cycles for 100 * nRev (Nidud)
107 bytes for tRev (Hutch)
19 bytes for strrev (CRT)
51 bytes for jRev (jj)
10 bytes for Mirror$ (MasmBasic)
87 bytes for xRev (jj, needs pshufb)
123 bytes for RevLingo (aligned strings only)
56 bytes for cRev (caballero)
63 bytes for nRev (Nidud)
1259388704 = eax tRev (Hutch)
1259388704 = eax strrev (CRT)
1259388704 = eax jRev (jj)
1259388704 = eax Mirror$ (MasmBasic)
1259388704 = eax xRev (jj, needs pshufb)
1259388704 = eax RevLingo (aligned strings only)
1259388704 = eax cRev (caballero)
1259388704 = eax nRev (Nidud)
Using the old string instructions without the REP prefix is self inflicted punishment, with the REP prefix you get special case circuitry that performs well, still slightly slower than SSE or AVX but plenty fast enough for most situations. Avoid old instructions like the plague, they live in junk microcode for backwards compatibility, not performance.
deleted
deleted
Hi,
Fast array reversal with SIMD (A small study in hardware accelerated AoS reversal) - https://dev.to/wunk/fast-array-reversal-with-simd-j3p
https://github.com/Wunkolo/qreverse/archive/refs/heads/master.zip
// Powers of two
Bench< ELEMENTSIZE, 8 >();
Bench< ELEMENTSIZE, 16 >();
Bench< ELEMENTSIZE, 32 >();
Bench< ELEMENTSIZE, 64 >();
Bench< ELEMENTSIZE, 128 >();
Bench< ELEMENTSIZE, 256 >();
Bench< ELEMENTSIZE, 512 >();
Bench< ELEMENTSIZE, 1024 >();
// Powers of ten
Bench< ELEMENTSIZE, 100 >();
Bench< ELEMENTSIZE, 1000 >();
Bench< ELEMENTSIZE, 10000 >();
Bench< ELEMENTSIZE, 100000 >();
Bench< ELEMENTSIZE, 1000000 >();
// Primes
Bench< ELEMENTSIZE, 59 >();
Bench< ELEMENTSIZE, 79 >();
Bench< ELEMENTSIZE, 173 >();
Bench< ELEMENTSIZE, 6133 >();
Bench< ELEMENTSIZE, 10177 >();
Bench< ELEMENTSIZE, 25253 >();
Bench< ELEMENTSIZE, 31391 >();
Bench< ELEMENTSIZE, 50432 >();
AVX2 byte
void qReverse<1>(void* Array, std::size_t Count)
{
std::uint8_t* Array8 = reinterpret_cast<std::uint8_t*>(Array);
std::size_t i = 0;
for( std::size_t j = i / 32; j < ((Count / 2) / 32); ++j )
{
const __m256i ShuffleRev = _mm256_set_epi8(
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15
);
// Load 32 elements at once into one 32-byte register
__m256i Lower = _mm256_loadu_si256(
reinterpret_cast<__m256i*>(&Array8[i])
);
__m256i Upper = _mm256_loadu_si256(
reinterpret_cast<__m256i*>(&Array8[Count - i - 32])
);
// Reverse the byte order of our 32-byte vectors
Lower = _mm256_shuffle_epi8(Lower,ShuffleRev);
Upper = _mm256_shuffle_epi8(Upper,ShuffleRev);
Lower = _mm256_permute2x128_si256(Lower,Lower,1);
Upper = _mm256_permute2x128_si256(Upper,Upper,1);
// Place them at their swapped position
_mm256_storeu_si256(
reinterpret_cast<__m256i*>(&Array8[i]),
Upper
);
_mm256_storeu_si256(
reinterpret_cast<__m256i*>(&Array8[Count - i - 32]),
Lower
);
// 32 elements at a time
i += 32;
}
}// BSWAP 32
for( std::size_t j = i / 4; j < ((Count / 2) / 4); ++j )
{
// Get bswapped versions of our Upper and Lower 4-byte chunks
std::uint32_t Lower = Swap32(
*reinterpret_cast<std::uint32_t*>(&Array8[i])
);
std::uint32_t Upper = Swap32(
*reinterpret_cast<std::uint32_t*>(&Array8[Count - i - 4])
);
// Place them at their swapped position
*reinterpret_cast<std::uint32_t*>(&Array8[i]) = Upper;
*reinterpret_cast<std::uint32_t*>(&Array8[Count - i - 4]) = Lower;
// Four elements at a time
i += 4;
}
// BSWAP 16
for( std::size_t j = i / 2; j < ((Count / 2) / 2); ++j )
{
// Get bswapped versions of our Upper and Lower 4-byte chunks
std::uint16_t Lower = Swap16(
*reinterpret_cast<std::uint16_t*>(&Array8[i])
);
std::uint16_t Upper = Swap16(
*reinterpret_cast<std::uint16_t*>(&Array8[Count - i - 2])
);
// Place them at their swapped position
*reinterpret_cast<std::uint16_t*>(&Array8[i]) = Upper;
*reinterpret_cast<std::uint16_t*>(&Array8[Count - i - 2]) = Lower;
// Two elements at a time
i += 2;
}
// Everything else that we can not do a bswap on, we swap normally
// Naive swaps
for( ; i < Count / 2; ++i )
{
// Exchange the upper and lower element as we work our
// way down to the middle from either end
std::uint8_t Temp(Array8[i]);
Array8[i] = Array8[Count - i - 1];
Array8[Count - i - 1] = Temp;
}
}
SSE3
void qReverse<1>(void* Array, std::size_t Count)
{
std::uint8_t* Array8 = reinterpret_cast<std::uint8_t*>(Array);
std::size_t i = 0;
for( std::size_t j = i / 16; j < ((Count / 2) / 16); ++j )
{
const __m128i ShuffleRev = _mm_set_epi8(
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
);
// Load 16 elements at once into one 16-byte register
__m128i Lower = _mm_loadu_si128(
reinterpret_cast<__m128i*>(&Array8[i])
);
__m128i Upper = _mm_loadu_si128(
reinterpret_cast<__m128i*>(&Array8[Count - i - 16])
);
// Reverse the byte order of our 16-byte vectors
Lower = _mm_shuffle_epi8(Lower, ShuffleRev);
Upper = _mm_shuffle_epi8(Upper, ShuffleRev);
// Place them at their swapped position
_mm_storeu_si128(
reinterpret_cast<__m128i*>(&Array8[i]),
Upper
);
_mm_storeu_si128(
reinterpret_cast<__m128i*>(&Array8[Count - i - 16]),
Lower
);
// 16 elements at a time
i += 16;
}
// BSWAP 32
for( std::size_t j = i / 4; j < ((Count / 2) / 4); ++j )
{
// Get bswapped versions of our Upper and Lower 4-byte chunks
std::uint32_t Lower = Swap32(
*reinterpret_cast<std::uint32_t*>(&Array8[i])
);
std::uint32_t Upper = Swap32(
*reinterpret_cast<std::uint32_t*>(&Array8[Count - i - 4])
);
// Place them at their swapped position
*reinterpret_cast<std::uint32_t*>(&Array8[i]) = Upper;
*reinterpret_cast<std::uint32_t*>(&Array8[Count - i - 4]) = Lower;
// Four elements at a time
i += 4;
}
// BSWAP 16
for( std::size_t j = i / 2; j < ((Count / 2) / 2); ++j )
{
// Get bswapped versions of our Upper and Lower 4-byte chunks
std::uint16_t Lower = Swap16(
*reinterpret_cast<std::uint16_t*>(&Array8[i])
);
std::uint16_t Upper = Swap16(
*reinterpret_cast<std::uint16_t*>(&Array8[Count - i - 2])
);
// Place them at their swapped position
*reinterpret_cast<std::uint16_t*>(&Array8[i]) = Upper;
*reinterpret_cast<std::uint16_t*>(&Array8[Count - i - 2]) = Lower;
// Two elements at a time
i += 2;
}
// Everything else that we can not do a bswap on, we swap normally
// Naive swaps
for( ; i < Count / 2; ++i )
{
// Exchange the upper and lower element as we work our
// way down to the middle from either end
std::uint8_t Temp(Array8[i]);
Array8[i] = Array8[Count - i - 1];
Array8[Count - i - 1] = Temp;
}
}
other method
I have modify the end of reverse,a char was missing
chaine db "Hello, this is a simple string intended for testing string algos. It has 100 characters without zero",0
firstreverse db 100h dup (0)
reorder db 100h dup (0)
.code
reverse proc uses esi edi ebx
;invoke DebugBreak
mov esi,offset chaine
mov edx,offset chaine
add esi,sizeof chaine
dec esi
dec esi
mov edi,offset firstreverse
@@:
std
lodsb
cld
stosb
cmp esi,edx
jge @B
mov byte ptr [edi],0
ret
reverse endp
main PROC C
invoke reverse
invoke printf,TXT("reverse : %s "),addr firstreverse
invoke _getch
ret
main endp
Yves,
Attached the testbed with your algo - it's very slow on my machine. Since nobody used
sizeof chaine, I've used len() instead. But the problem is elsewhere: the flags switches std and cld are very, very slow.
yRev proc uses esi edi ebx pSrc, pDest
mov esi, pSrc
add esi, len(esi) ; sizeof chaine
mov edx, pSrc
; dec esi
dec esi
mov edi, pDest
@@:
std
lodsb
cld
stosb
cmp esi,edx
jge @B
mov byte ptr [edi],0
ret
yRev endp
Quote from: dedndave on January 01, 2013, 05:11:20 AMthat's because it has to test the direction flag
you may remember our previous discussions about STD and CLD taking ~80 cycles
I have revisited the nidud's algo, doing the reversing in the same string gives some opportunities :)
Quote
Test Algorithm 1:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 2:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 3:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 4:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 5:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 6:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 7:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 8:
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Algorithm 9:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
Test Timming Algorithm 1: 313
Test Timming Algorithm 2: 312
Test Timming Algorithm 3: 188
Test Timming Algorithm 4: 203
Test Timming Algorithm 5: 94
Test Timming Algorithm 6: 78
Test Timming Algorithm 7: 63
Test Timming Algorithm 8: 94
Test Timming Algorithm 9: 78
Algo8 is nidud's example, Algo9 is the revisited. 78 ticks takes it at the same level of Algo6. Probably using dwords could carry it to 63 ticks.
RevAlg09 proc uses ecx esi, pSrc
;
mov esi, [pSrc]
mov ecx, esi
; Busca final
@top:
inc ecx
or byte ptr [ecx], 0
jnz @top
dec ecx
@Bucle:
mov al, byte ptr [esi]
mov ah, byte ptr [ecx]
mov byte ptr [ecx], al
mov byte ptr [esi], ah
inc esi
dec ecx
cmp ecx, esi
jnbe @Bucle
ret
RevAlg09 endp
New version with more thorough testing:
- my own jRev and xRev algos produced wrong results depending on the length of the string; fixed
- Hutch' algo could not handle short strings with less than 2 bytes; fixed
- Caballero's algo crashed on zero length strings; fixed
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
24614 cycles for 99 * tRev (Hutch)
33313 cycles for 99 * strrev (CRT)
6669 cycles for 99 * jRev (jj)
13590 cycles for 99 * Mirror$ (MasmBasic)
4436 cycles for 99 * xRev (jj, needs pshufb)
5542 cycles for 99 * RevLingo (aligned strings only)
94021 cycles for 99 * cRev (caballero)
32311 cycles for 99 * nRev (Nidud)
24690 cycles for 99 * tRev (Hutch)
33388 cycles for 99 * strrev (CRT)
6819 cycles for 99 * jRev (jj)
13694 cycles for 99 * Mirror$ (MasmBasic)
4539 cycles for 99 * xRev (jj, needs pshufb)
5608 cycles for 99 * RevLingo (aligned strings only)
94088 cycles for 99 * cRev (caballero)
32321 cycles for 99 * nRev (Nidud)
24671 cycles for 99 * tRev (Hutch)
33401 cycles for 99 * strrev (CRT)
6788 cycles for 99 * jRev (jj)
13728 cycles for 99 * Mirror$ (MasmBasic)
4555 cycles for 99 * xRev (jj, needs pshufb)
5617 cycles for 99 * RevLingo (aligned strings only)
94015 cycles for 99 * cRev (caballero)
32272 cycles for 99 * nRev (Nidud)
107 bytes for tRev (Hutch)
19 bytes for strrev (CRT)
87 bytes for jRev (jj)
10 bytes for Mirror$ (MasmBasic)
111 bytes for xRev (jj, needs pshufb)
123 bytes for RevLingo (aligned strings only)
56 bytes for cRev (caballero)
63 bytes for nRev (Nidud)
len(src)=100 bytes
464062 tRev (Hutch) .sogla gnirts gnitset rof dedn
464062 strrev (CRT) .sogla gnirts gnitset rof dedn
464062 jRev (jj) .sogla gnirts gnitset rof dedn
464062 Mirror$ (MasmBasic) .sogla gnirts gnitset rof dedn
464062 xRev (jj, needs pshufb) .sogla gnirts gnitset rof dedn
464062 RevLingo (aligned strings only) .sogla gnirts gnitset rof dedn
464062 cRev (caballero) .sogla gnirts gnitset rof dedn
464062 nRev (Nidud) .sogla gnirts gnitset rof dedn
Stick this version into the test bed. I removed the stack frame so it may be a bit faster.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
tRev proc
; ---------------------------------------------
; in place string reverse, fully self contained
; ---------------------------------------------
push ebx
push edi
push esi
mov edi, [esp+4][12] ; string address
mov eax, edi
sub eax, 1
@@: ; unroll by 4
REPEAT 3
add eax, 1 ; get length
cmp BYTE PTR [eax], 0 ; loop to buffer end
je @F
ENDM
add eax, 1 ; get length
cmp BYTE PTR [eax], 0 ; loop to buffer end
jne @B
@@:
sub eax, edi ; sub address from eax for len
mov esi, eax ; store length in esi
shr eax, 1 ; get integer half length of string
mov ebx, eax ; store it in variable
mov eax, edi ; eax has start char
mov edx, edi ; load start address
add edx, esi ; get end character address
mov esi, ebx ; copy half len into esi
sub edx, 1 ; sub 1 to get end char
@@:
movzx ebx, BYTE PTR [eax] ; load start char
movzx ecx, BYTE PTR [edx] ; load end char
mov [eax], cl ; reverse char pair
add eax, 1 ; increment start char
mov [edx], bl
sub edx, 1 ; decrement end char
sub esi, 1 ; sub 1 from half length
jnz @B ; loop back if its not zero
mov eax, edi ; return the address in eax
pop esi
pop edi
pop ebx
ret 4 ; balance the stack by 4 bytes
tRev endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Result your last revised version.
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)
20914 cycles for 99 * tRev (Hutch)
31184 cycles for 99 * strrev (CRT)
6057 cycles for 99 * jRev (jj)
12924 cycles for 99 * Mirror$ (MasmBasic)
4213 cycles for 99 * xRev (jj, needs pshufb)
5318 cycles for 99 * RevLingo (aligned strings only)
91759 cycles for 99 * cRev (caballero)
20469 cycles for 99 * nRev (Nidud)
20989 cycles for 99 * tRev (Hutch)
30259 cycles for 99 * strrev (CRT)
6175 cycles for 99 * jRev (jj)
13076 cycles for 99 * Mirror$ (MasmBasic)
4324 cycles for 99 * xRev (jj, needs pshufb)
5419 cycles for 99 * RevLingo (aligned strings only)
91712 cycles for 99 * cRev (caballero)
20467 cycles for 99 * nRev (Nidud)
20983 cycles for 99 * tRev (Hutch)
30276 cycles for 99 * strrev (CRT)
6175 cycles for 99 * jRev (jj)
13089 cycles for 99 * Mirror$ (MasmBasic)
4327 cycles for 99 * xRev (jj, needs pshufb)
5440 cycles for 99 * RevLingo (aligned strings only)
92241 cycles for 99 * cRev (caballero)
20443 cycles for 99 * nRev (Nidud)
107 bytes for tRev (Hutch)
19 bytes for strrev (CRT)
87 bytes for jRev (jj)
10 bytes for Mirror$ (MasmBasic)
111 bytes for xRev (jj, needs pshufb)
123 bytes for RevLingo (aligned strings only)
56 bytes for cRev (caballero)
63 bytes for nRev (Nidud)
len(src)=100 bytes
464062 tRev (Hutch) .sogla gnirts gnitset rof dedn
464062 strrev (CRT) .sogla gnirts gnitset rof dedn
464062 jRev (jj) .sogla gnirts gnitset rof dedn
464062 Mirror$ (MasmBasic) .sogla gnirts gnitset rof dedn
464062 xRev (jj, needs pshufb) .sogla gnirts gnitset rof dedn
464062 RevLingo (aligned strings only) .sogla gnirts gnitset rof dedn
464062 cRev (caballero) .sogla gnirts gnitset rof dedn
464062 nRev (Nidud) .sogla gnirts gnitset rof dedn
--- ok ---
Quote from: hutch-- on May 09, 2021, 11:35:24 AM
Stick this version into the test bed. I removed the stack frame so it may be a bit faster.
Done :cool:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
24595 cycles for 99 * tRev (Hutch)
33051 cycles for 99 * strrev (CRT)
6683 cycles for 99 * jRev (jj)
13491 cycles for 99 * Mirror$ (MasmBasic)
4457 cycles for 99 * xRev (jj, needs pshufb)
7378 cycles for 99 * RevLingo (aligned strings only)
93296 cycles for 99 * cRev (caballero)
32548 cycles for 99 * nRev (Nidud)
23937 cycles for 99 * tRev (Hutch)
32986 cycles for 99 * strrev (CRT)
6725 cycles for 99 * jRev (jj)
13404 cycles for 99 * Mirror$ (MasmBasic)
4427 cycles for 99 * xRev (jj, needs pshufb)
7379 cycles for 99 * RevLingo (aligned strings only)
93127 cycles for 99 * cRev (caballero)
32965 cycles for 99 * nRev (Nidud)
23949 cycles for 99 * tRev (Hutch)
33041 cycles for 99 * strrev (CRT)
6703 cycles for 99 * jRev (jj)
13384 cycles for 99 * Mirror$ (MasmBasic)
4474 cycles for 99 * xRev (jj, needs pshufb)
7385 cycles for 99 * RevLingo (aligned strings only)
93401 cycles for 99 * cRev (caballero)
31649 cycles for 99 * nRev (Nidud)
107 bytes for tRev (Hutch)
19 bytes for strrev (CRT)
87 bytes for jRev (jj)
10 bytes for Mirror$ (MasmBasic)
111 bytes for xRev (jj, needs pshufb)
123 bytes for RevLingo (aligned strings only)
56 bytes for cRev (caballero)
63 bytes for nRev (Nidud)
len(src)=99 bytes
454815 tRev (Hutch) sogla gnirts gnitset rof dedne
454815 strrev (CRT) sogla gnirts gnitset rof dedne
454815 jRev (jj) sogla gnirts gnitset rof dedne
454815 Mirror$ (MasmBasic) sogla gnirts gnitset rof dedne
454815 xRev (jj, needs pshufb) sogla gnirts gnitset rof dedne
454479 RevLingo (aligned strings only) sogla gnirtstgni set rof dedne
454815 cRev (caballero) sogla gnirts gnitset rof dedne
454815 nRev (Nidud) sogla gnirts gnitset rof dedne
At the risk of wearing you out, one more that uses an external proc to get the length, the old Agner Fog StrLen.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
tRev proc
; --------------------------------------------------
; in place string reverse using external length proc
; --------------------------------------------------
push ebx ;
push edi ; save regs
push esi ;
mov edi, [esp+4][12] ; load string address
push ebp
push edi
call StrLen ; use external string length algo
mov edx, edi ; load start address
sub edx, 1 ; sub 1 to get end char
add edx, eax ; get end character address
shr eax, 1 ; get integer half length of string
mov esi, eax ; store it in esi
mov eax, edi ; eax has start char
@@:
movzx ebx, BYTE PTR [eax] ; load start char
movzx ecx, BYTE PTR [edx] ; load end char
mov [eax], cl ; reverse char pair
add eax, 1 ; increment start char
mov [edx], bl
sub edx, 1 ; decrement end char
sub esi, 1 ; sub 1 from half length
jnz @B ; loop back if its not zero
mov eax, edi ; return the address in eax
pop ebp
pop esi ;
pop edi ; restore regs
pop ebx ;
ret 4 ; balance the stack by 4 bytes
tRev endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
You are on the right track, Hutch - 17k instead of 24kCycles, congrats :biggrin:
Indeed, one of the secrets of the fast jRev and xRev algos is this line: add Len (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1144)(ecx), ecx
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
17399 cycles for 99 * tRev (Hutch)
33082 cycles for 99 * strrev (CRT)
6719 cycles for 99 * jRev (jj)
13395 cycles for 99 * Mirror$ (MasmBasic)
4458 cycles for 99 * xRev (jj, needs pshufb)
7459 cycles for 99 * RevLingo (aligned strings only)
93268 cycles for 99 * cRev (caballero)
31727 cycles for 99 * nRev (Nidud)
17005 cycles for 99 * tRev (Hutch)
33112 cycles for 99 * strrev (CRT)
6697 cycles for 99 * jRev (jj)
13400 cycles for 99 * Mirror$ (MasmBasic)
4452 cycles for 99 * xRev (jj, needs pshufb)
7379 cycles for 99 * RevLingo (aligned strings only)
93369 cycles for 99 * cRev (caballero)
31654 cycles for 99 * nRev (Nidud)
17052 cycles for 99 * tRev (Hutch)
33091 cycles for 99 * strrev (CRT)
6729 cycles for 99 * jRev (jj)
13402 cycles for 99 * Mirror$ (MasmBasic)
4439 cycles for 99 * xRev (jj, needs pshufb)
7371 cycles for 99 * RevLingo (aligned strings only)
93431 cycles for 99 * cRev (caballero)
31718 cycles for 99 * nRev (Nidud)
71 bytes for tRev (Hutch)
19 bytes for strrev (CRT)
87 bytes for jRev (jj)
10 bytes for Mirror$ (MasmBasic)
111 bytes for xRev (jj, needs pshufb)
123 bytes for RevLingo (aligned strings only)
56 bytes for cRev (caballero)
63 bytes for nRev (Nidud)
len(src)=99 bytes
454815 tRev (Hutch) sogla gnirts gnitset rof dedne
454815 strrev (CRT) sogla gnirts gnitset rof dedne
454815 jRev (jj) sogla gnirts gnitset rof dedne
454815 Mirror$ (MasmBasic) sogla gnirts gnitset rof dedne
454815 xRev (jj, needs pshufb) sogla gnirts gnitset rof dedne
454479 RevLingo (aligned strings only) sogla gnirtstgni set rof dedne
454815 cRev (caballero) sogla gnirts gnitset rof dedne
454815 nRev (Nidud) sogla gnirts gnitset rof dedne
Btw this is necessary to avoid a crash in case somebody passes a zero length string to the algo:
sub esi, 1 ; sub 1 from half length
jg @B ; loop back if its not greater than zero
Option prologue etc are not necessary if the proc has no locals and no args.
deleted
deleted