String reverse algo

avcaballero · May 08, 2021, 09:13:34 PM

@nidud
Interesting! though it overlaps the original string. Yet, it seems that the best algo remains the 7th one.

Quote
Test Algorithm 1:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 2:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 3:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 4:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 5:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 6:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 7:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 8:
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Timming Algorithm 1: 329
Test Timming Algorithm 2: 281
Test Timming Algorithm 3: 219
Test Timming Algorithm 4: 203
Test Timming Algorithm 5: 93
Test Timming Algorithm 6: 79
Test Timming Algorithm 7: 62
Test Timming Algorithm 8: 78

jj2007 · May 08, 2021, 09:20:12 PM

New version with Nidud's algo (I called it nRev for consistency

):

Code Select

nRev:
  mov edx,[esp+4]
  mov ecx,[esp+4]
  xor eax,eax
slen:
  add ecx,1
  cmp al,[ecx-1]
  jne slen
  sub ecx,2
lupe:
  cmp edx,ecx
  ja done
  mov al,[edx]
  mov ah,[ecx]
  mov [ecx],al
  mov [edx],ah
  lea edx,[edx+1]
  lea ecx,[ecx-1]
  jne lupe
done:
  mov eax,[esp+4]
  retn 4

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

24486   cycles for 100 * tRev (Hutch)
33686   cycles for 100 * strrev (CRT)
13557   cycles for 100 * jRev (jj)
13920   cycles for 100 * Mirror$ (MasmBasic)
4541    cycles for 100 * xRev (jj, needs pshufb)
5548    cycles for 100 * RevLingo (aligned strings only)
95812   cycles for 100 * cRev (caballero)
32676   cycles for 100 * nRev (Nidud)

24901   cycles for 100 * tRev (Hutch)
33800   cycles for 100 * strrev (CRT)
13677   cycles for 100 * jRev (jj)
14012   cycles for 100 * Mirror$ (MasmBasic)
4618    cycles for 100 * xRev (jj, needs pshufb)
5955    cycles for 100 * RevLingo (aligned strings only)
95700   cycles for 100 * cRev (caballero)
32590   cycles for 100 * nRev (Nidud)

24625   cycles for 100 * tRev (Hutch)
33799   cycles for 100 * strrev (CRT)
13914   cycles for 100 * jRev (jj)
14180   cycles for 100 * Mirror$ (MasmBasic)
4617    cycles for 100 * xRev (jj, needs pshufb)
5668    cycles for 100 * RevLingo (aligned strings only)
95766   cycles for 100 * cRev (caballero)
32612   cycles for 100 * nRev (Nidud)

24617   cycles for 100 * tRev (Hutch)
33827   cycles for 100 * strrev (CRT)
13672   cycles for 100 * jRev (jj)
14032   cycles for 100 * Mirror$ (MasmBasic)
4619    cycles for 100 * xRev (jj, needs pshufb)
5663    cycles for 100 * RevLingo (aligned strings only)
97332   cycles for 100 * cRev (caballero)
33423   cycles for 100 * nRev (Nidud)

107     bytes for tRev (Hutch)
19      bytes for strrev (CRT)
51      bytes for jRev (jj)
10      bytes for Mirror$ (MasmBasic)
87      bytes for xRev (jj, needs pshufb)
123     bytes for RevLingo (aligned strings only)
56      bytes for cRev (caballero)
63      bytes for nRev (Nidud)

1259388704      = eax tRev (Hutch)
1259388704      = eax strrev (CRT)
1259388704      = eax jRev (jj)
1259388704      = eax Mirror$ (MasmBasic)
1259388704      = eax xRev (jj, needs pshufb)
1259388704      = eax RevLingo (aligned strings only)
1259388704      = eax cRev (caballero)
1259388704      = eax nRev (Nidud)

hutch-- · May 08, 2021, 09:49:19 PM

Using the old string instructions without the REP prefix is self inflicted punishment, with the REP prefix you get special case circuitry that performs well, still slightly slower than SSE or AVX but plenty fast enough for most situations. Avoid old instructions like the plague, they live in junk microcode for backwards compatibility, not performance.

nidud · May 08, 2021, 09:54:06 PM

deleted

nidud · May 08, 2021, 10:21:57 PM

deleted

LiaoMi · May 09, 2021, 01:42:49 AM

Hi,

Fast array reversal with SIMD (A small study in hardware accelerated AoS reversal) - https://dev.to/wunk/fast-array-reversal-with-simd-j3p
https://github.com/Wunkolo/qreverse/archive/refs/heads/master.zip

// Powers of two
   Bench< ELEMENTSIZE, 8 >();
   Bench< ELEMENTSIZE, 16 >();
   Bench< ELEMENTSIZE, 32 >();
   Bench< ELEMENTSIZE, 64 >();
   Bench< ELEMENTSIZE, 128 >();
   Bench< ELEMENTSIZE, 256 >();
   Bench< ELEMENTSIZE, 512 >();
   Bench< ELEMENTSIZE, 1024 >();

   // Powers of ten
   Bench< ELEMENTSIZE, 100 >();
   Bench< ELEMENTSIZE, 1000 >();
   Bench< ELEMENTSIZE, 10000 >();
   Bench< ELEMENTSIZE, 100000 >();
   Bench< ELEMENTSIZE, 1000000 >();

   // Primes
   Bench< ELEMENTSIZE, 59 >();
   Bench< ELEMENTSIZE, 79 >();
   Bench< ELEMENTSIZE, 173 >();
   Bench< ELEMENTSIZE, 6133 >();
   Bench< ELEMENTSIZE, 10177 >();
   Bench< ELEMENTSIZE, 25253 >();
   Bench< ELEMENTSIZE, 31391 >();
   Bench< ELEMENTSIZE, 50432 >();

AVX2 byte

Code Select

void qReverse<1>(void* Array, std::size_t Count)
{
	std::uint8_t* Array8 = reinterpret_cast<std::uint8_t*>(Array);
	std::size_t i = 0;

        for( std::size_t j = i / 32; j < ((Count / 2) / 32); ++j )
	{
		const __m256i ShuffleRev = _mm256_set_epi8(
			0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,
			0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15
		);
		// Load 32 elements at once into one 32-byte register
		__m256i Lower = _mm256_loadu_si256(
			reinterpret_cast<__m256i*>(&Array8[i])
		);
		__m256i Upper = _mm256_loadu_si256(
			reinterpret_cast<__m256i*>(&Array8[Count - i - 32])
		);

		// Reverse the byte order of our 32-byte vectors
		Lower = _mm256_shuffle_epi8(Lower,ShuffleRev);
		Upper = _mm256_shuffle_epi8(Upper,ShuffleRev);

		Lower = _mm256_permute2x128_si256(Lower,Lower,1);
		Upper = _mm256_permute2x128_si256(Upper,Upper,1);

		// Place them at their swapped position
		_mm256_storeu_si256(
			reinterpret_cast<__m256i*>(&Array8[i]),
			Upper
		);
		_mm256_storeu_si256(
			reinterpret_cast<__m256i*>(&Array8[Count - i - 32]),
			Lower
		);

		// 32 elements at a time
		i += 32;
	}
}// BSWAP 32
	for( std::size_t j = i / 4; j < ((Count / 2) / 4); ++j )
	{
		// Get bswapped versions of our Upper and Lower 4-byte chunks
		std::uint32_t Lower = Swap32(
			*reinterpret_cast<std::uint32_t*>(&Array8[i])
		);
		std::uint32_t Upper = Swap32(
			*reinterpret_cast<std::uint32_t*>(&Array8[Count - i - 4])
		);

		// Place them at their swapped position
		*reinterpret_cast<std::uint32_t*>(&Array8[i]) = Upper;
		*reinterpret_cast<std::uint32_t*>(&Array8[Count - i - 4]) = Lower;

		// Four elements at a time
		i += 4;
	}
	// BSWAP 16
	for( std::size_t j = i / 2; j < ((Count / 2) / 2); ++j )
	{
		// Get bswapped versions of our Upper and Lower 4-byte chunks
		std::uint16_t Lower = Swap16(
			*reinterpret_cast<std::uint16_t*>(&Array8[i])
		);
		std::uint16_t Upper = Swap16(
			*reinterpret_cast<std::uint16_t*>(&Array8[Count - i - 2])
		);

		// Place them at their swapped position
		*reinterpret_cast<std::uint16_t*>(&Array8[i]) = Upper;
		*reinterpret_cast<std::uint16_t*>(&Array8[Count - i - 2]) = Lower;

		// Two elements at a time
		i += 2;
	}

	// Everything else that we can not do a bswap on, we swap normally
	// Naive swaps
	for( ; i < Count / 2; ++i )
	{
		// Exchange the upper and lower element as we work our
		// way down to the middle from either end
		std::uint8_t Temp(Array8[i]);
		Array8[i] = Array8[Count - i - 1];
		Array8[Count - i - 1] = Temp;
	}
}

SSE3

Code Select


void qReverse<1>(void* Array, std::size_t Count)
{
	std::uint8_t* Array8 = reinterpret_cast<std::uint8_t*>(Array);
	std::size_t i = 0;

for( std::size_t j = i / 16; j < ((Count / 2) / 16); ++j )
	{
		const __m128i ShuffleRev = _mm_set_epi8(
			0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
		);
		// Load 16 elements at once into one 16-byte register
		__m128i Lower = _mm_loadu_si128(
			reinterpret_cast<__m128i*>(&Array8[i])
		);
		__m128i Upper = _mm_loadu_si128(
			reinterpret_cast<__m128i*>(&Array8[Count - i - 16])
		);

		// Reverse the byte order of our 16-byte vectors
		Lower = _mm_shuffle_epi8(Lower, ShuffleRev);
		Upper = _mm_shuffle_epi8(Upper, ShuffleRev);

		// Place them at their swapped position
		_mm_storeu_si128(
			reinterpret_cast<__m128i*>(&Array8[i]),
			Upper
		);
		_mm_storeu_si128(
			reinterpret_cast<__m128i*>(&Array8[Count - i - 16]),
			Lower
		);

		// 16 elements at a time
		i += 16;
	}
// BSWAP 32
	for( std::size_t j = i / 4; j < ((Count / 2) / 4); ++j )
	{
		// Get bswapped versions of our Upper and Lower 4-byte chunks
		std::uint32_t Lower = Swap32(
			*reinterpret_cast<std::uint32_t*>(&Array8[i])
		);
		std::uint32_t Upper = Swap32(
			*reinterpret_cast<std::uint32_t*>(&Array8[Count - i - 4])
		);

		// Place them at their swapped position
		*reinterpret_cast<std::uint32_t*>(&Array8[i]) = Upper;
		*reinterpret_cast<std::uint32_t*>(&Array8[Count - i - 4]) = Lower;

		// Four elements at a time
		i += 4;
	}
	// BSWAP 16
	for( std::size_t j = i / 2; j < ((Count / 2) / 2); ++j )
	{
		// Get bswapped versions of our Upper and Lower 4-byte chunks
		std::uint16_t Lower = Swap16(
			*reinterpret_cast<std::uint16_t*>(&Array8[i])
		);
		std::uint16_t Upper = Swap16(
			*reinterpret_cast<std::uint16_t*>(&Array8[Count - i - 2])
		);

		// Place them at their swapped position
		*reinterpret_cast<std::uint16_t*>(&Array8[i]) = Upper;
		*reinterpret_cast<std::uint16_t*>(&Array8[Count - i - 2]) = Lower;

		// Two elements at a time
		i += 2;
	}

	// Everything else that we can not do a bswap on, we swap normally
	// Naive swaps
	for( ; i < Count / 2; ++i )
	{
		// Exchange the upper and lower element as we work our
		// way down to the middle from either end
		std::uint8_t Temp(Array8[i]);
		Array8[i] = Array8[Count - i - 1];
		Array8[Count - i - 1] = Temp;
	}
}

TouEnMasm · May 09, 2021, 03:03:26 AM

other method
I have modify the end of reverse,a char was missing

Code Select


chaine db "Hello, this is a simple string intended for testing string algos. It has 100 characters without zero",0
firstreverse db 100h dup (0)
reorder  db  100h dup (0)

.code		
reverse proc uses esi edi ebx
;invoke DebugBreak
	mov esi,offset chaine
	mov edx,offset chaine
	add esi,sizeof chaine
	dec esi
	dec esi
	mov edi,offset firstreverse
	@@:
	std
	lodsb
	cld
	stosb
	cmp esi,edx
	jge @B
	mov byte ptr [edi],0
	ret

	
reverse endp
main PROC  C  
	invoke reverse
	invoke printf,TXT("reverse : %s "),addr firstreverse
	invoke _getch
	ret
main endp

jj2007 · May 09, 2021, 04:33:08 AM

Yves,

Attached the testbed with your algo - it's very slow on my machine. Since nobody used sizeof chaine, I've used len() instead. But the problem is elsewhere: the flags switches std and cld are very, very slow.

Code Select

yRev proc uses esi edi ebx pSrc, pDest
  mov esi, pSrc
  add esi, len(esi)	; sizeof chaine
  mov edx, pSrc
  ; dec esi
  dec esi
  mov edi, pDest
  @@:
  std
  lodsb
  cld
  stosb
  cmp esi,edx
  jge @B
  mov byte ptr [edi],0
  ret
yRev endp

Quote from: dedndave on January 01, 2013, 05:11:20 AMthat's because it has to test the direction flag
you may remember our previous discussions about STD and CLD taking ~80 cycles

avcaballero · May 09, 2021, 05:44:44 AM

I have revisited the nidud's algo, doing the reversing in the same string gives some opportunities :)

Quote
Test Algorithm 1:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 2:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 3:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 4:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 5:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 6:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 7:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 8:
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 9:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Timming Algorithm 1: 313
Test Timming Algorithm 2: 312
Test Timming Algorithm 3: 188
Test Timming Algorithm 4: 203
Test Timming Algorithm 5: 94
Test Timming Algorithm 6: 78
Test Timming Algorithm 7: 63
Test Timming Algorithm 8: 94
Test Timming Algorithm 9: 78

Algo8 is nidud's example, Algo9 is the revisited. 78 ticks takes it at the same level of Algo6. Probably using dwords could carry it to 63 ticks.

Code Select


RevAlg09 proc uses ecx esi, pSrc
  ; 
  mov        esi, [pSrc]
  mov        ecx, esi
  ; Busca final
  @top:
    inc        ecx
    or         byte ptr [ecx], 0
  jnz        @top
  dec        ecx
  @Bucle:
    mov        al, byte ptr [esi]
    mov        ah, byte ptr [ecx]
    mov        byte ptr [ecx], al
    mov        byte ptr [esi], ah
    inc        esi
    dec        ecx
    cmp        ecx, esi
  jnbe       @Bucle
  ret
RevAlg09 endp

jj2007 · May 09, 2021, 11:20:57 AM

New version with more thorough testing:
- my own jRev and xRev algos produced wrong results depending on the length of the string; fixed
- Hutch' algo could not handle short strings with less than 2 bytes; fixed
- Caballero's algo crashed on zero length strings; fixed

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

24614   cycles for 99 * tRev (Hutch)
33313   cycles for 99 * strrev (CRT)
6669    cycles for 99 * jRev (jj)
13590   cycles for 99 * Mirror$ (MasmBasic)
4436    cycles for 99 * xRev (jj, needs pshufb)
5542    cycles for 99 * RevLingo (aligned strings only)
94021   cycles for 99 * cRev (caballero)
32311   cycles for 99 * nRev (Nidud)

24690   cycles for 99 * tRev (Hutch)
33388   cycles for 99 * strrev (CRT)
6819    cycles for 99 * jRev (jj)
13694   cycles for 99 * Mirror$ (MasmBasic)
4539    cycles for 99 * xRev (jj, needs pshufb)
5608    cycles for 99 * RevLingo (aligned strings only)
94088   cycles for 99 * cRev (caballero)
32321   cycles for 99 * nRev (Nidud)

24671   cycles for 99 * tRev (Hutch)
33401   cycles for 99 * strrev (CRT)
6788    cycles for 99 * jRev (jj)
13728   cycles for 99 * Mirror$ (MasmBasic)
4555    cycles for 99 * xRev (jj, needs pshufb)
5617    cycles for 99 * RevLingo (aligned strings only)
94015   cycles for 99 * cRev (caballero)
32272   cycles for 99 * nRev (Nidud)

107     bytes for tRev (Hutch)
19      bytes for strrev (CRT)
87      bytes for jRev (jj)
10      bytes for Mirror$ (MasmBasic)
111     bytes for xRev (jj, needs pshufb)
123     bytes for RevLingo (aligned strings only)
56      bytes for cRev (caballero)
63      bytes for nRev (Nidud)

len(src)=100 bytes
464062  tRev (Hutch)                    .sogla gnirts gnitset rof dedn
464062  strrev (CRT)                    .sogla gnirts gnitset rof dedn
464062  jRev (jj)                       .sogla gnirts gnitset rof dedn
464062  Mirror$ (MasmBasic)             .sogla gnirts gnitset rof dedn
464062  xRev (jj, needs pshufb)         .sogla gnirts gnitset rof dedn
464062  RevLingo (aligned strings only) .sogla gnirts gnitset rof dedn
464062  cRev (caballero)                .sogla gnirts gnitset rof dedn
464062  nRev (Nidud)                    .sogla gnirts gnitset rof dedn

hutch-- · May 09, 2021, 11:35:24 AM

Stick this version into the test bed. I removed the stack frame so it may be a bit faster.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

tRev proc
; ---------------------------------------------
; in place string reverse, fully self contained
; ---------------------------------------------
push ebx
push edi
push esi

mov edi, [esp+4][12] ; string address
mov eax, edi

sub eax, 1
@@: ; unroll by 4
REPEAT 3
add eax, 1 ; get length
cmp BYTE PTR [eax], 0 ; loop to buffer end
je @F
ENDM
add eax, 1 ; get length
cmp BYTE PTR [eax], 0 ; loop to buffer end
jne @B

@@:
sub eax, edi ; sub address from eax for len
mov esi, eax ; store length in esi

shr eax, 1 ; get integer half length of string
mov ebx, eax ; store it in variable
mov eax, edi ; eax has start char
mov edx, edi ; load start address
add edx, esi ; get end character address
mov esi, ebx ; copy half len into esi
sub edx, 1 ; sub 1 to get end char

@@:
movzx ebx, BYTE PTR [eax] ; load start char
movzx ecx, BYTE PTR [edx] ; load end char
mov [eax], cl ; reverse char pair
add eax, 1 ; increment start char
mov [edx], bl
sub edx, 1 ; decrement end char
sub esi, 1 ; sub 1 from half length
jnz @B ; loop back if its not zero

mov eax, edi ; return the address in eax

pop esi
pop edi
pop ebx

ret 4 ; balance the stack by 4 bytes

tRev endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

hutch-- · May 09, 2021, 11:42:47 AM

Result your last revised version.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

20914 cycles for 99 * tRev (Hutch)
31184 cycles for 99 * strrev (CRT)
6057 cycles for 99 * jRev (jj)
12924 cycles for 99 * Mirror$ (MasmBasic)
4213 cycles for 99 * xRev (jj, needs pshufb)
5318 cycles for 99 * RevLingo (aligned strings only)
91759 cycles for 99 * cRev (caballero)
20469 cycles for 99 * nRev (Nidud)

20989 cycles for 99 * tRev (Hutch)
30259 cycles for 99 * strrev (CRT)
6175 cycles for 99 * jRev (jj)
13076 cycles for 99 * Mirror$ (MasmBasic)
4324 cycles for 99 * xRev (jj, needs pshufb)
5419 cycles for 99 * RevLingo (aligned strings only)
91712 cycles for 99 * cRev (caballero)
20467 cycles for 99 * nRev (Nidud)

20983 cycles for 99 * tRev (Hutch)
30276 cycles for 99 * strrev (CRT)
6175 cycles for 99 * jRev (jj)
13089 cycles for 99 * Mirror$ (MasmBasic)
4327 cycles for 99 * xRev (jj, needs pshufb)
5440 cycles for 99 * RevLingo (aligned strings only)
92241 cycles for 99 * cRev (caballero)
20443 cycles for 99 * nRev (Nidud)

107 bytes for tRev (Hutch)
19 bytes for strrev (CRT)
87 bytes for jRev (jj)
10 bytes for Mirror$ (MasmBasic)
111 bytes for xRev (jj, needs pshufb)
123 bytes for RevLingo (aligned strings only)
56 bytes for cRev (caballero)
63 bytes for nRev (Nidud)

len(src)=100 bytes
464062 tRev (Hutch) .sogla gnirts gnitset rof dedn
464062 strrev (CRT) .sogla gnirts gnitset rof dedn
464062 jRev (jj) .sogla gnirts gnitset rof dedn
464062 Mirror$ (MasmBasic) .sogla gnirts gnitset rof dedn
464062 xRev (jj, needs pshufb) .sogla gnirts gnitset rof dedn
464062 RevLingo (aligned strings only) .sogla gnirts gnitset rof dedn
464062 cRev (caballero) .sogla gnirts gnitset rof dedn
464062 nRev (Nidud) .sogla gnirts gnitset rof dedn

--- ok ---

jj2007 · May 09, 2021, 07:24:15 PM

Quote from: hutch-- on May 09, 2021, 11:35:24 AM
Stick this version into the test bed. I removed the stack frame so it may be a bit faster.

Done

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

24595   cycles for 99 * tRev (Hutch)
33051   cycles for 99 * strrev (CRT)
6683    cycles for 99 * jRev (jj)
13491   cycles for 99 * Mirror$ (MasmBasic)
4457    cycles for 99 * xRev (jj, needs pshufb)
7378    cycles for 99 * RevLingo (aligned strings only)
93296   cycles for 99 * cRev (caballero)
32548   cycles for 99 * nRev (Nidud)

23937   cycles for 99 * tRev (Hutch)
32986   cycles for 99 * strrev (CRT)
6725    cycles for 99 * jRev (jj)
13404   cycles for 99 * Mirror$ (MasmBasic)
4427    cycles for 99 * xRev (jj, needs pshufb)
7379    cycles for 99 * RevLingo (aligned strings only)
93127   cycles for 99 * cRev (caballero)
32965   cycles for 99 * nRev (Nidud)

23949   cycles for 99 * tRev (Hutch)
33041   cycles for 99 * strrev (CRT)
6703    cycles for 99 * jRev (jj)
13384   cycles for 99 * Mirror$ (MasmBasic)
4474    cycles for 99 * xRev (jj, needs pshufb)
7385    cycles for 99 * RevLingo (aligned strings only)
93401   cycles for 99 * cRev (caballero)
31649   cycles for 99 * nRev (Nidud)

107     bytes for tRev (Hutch)
19      bytes for strrev (CRT)
87      bytes for jRev (jj)
10      bytes for Mirror$ (MasmBasic)
111     bytes for xRev (jj, needs pshufb)
123     bytes for RevLingo (aligned strings only)
56      bytes for cRev (caballero)
63      bytes for nRev (Nidud)

len(src)=99 bytes
454815  tRev (Hutch)                    sogla gnirts gnitset rof dedne
454815  strrev (CRT)                    sogla gnirts gnitset rof dedne
454815  jRev (jj)                       sogla gnirts gnitset rof dedne
454815  Mirror$ (MasmBasic)             sogla gnirts gnitset rof dedne
454815  xRev (jj, needs pshufb)         sogla gnirts gnitset rof dedne
454479  RevLingo (aligned strings only) sogla gnirtstgni set rof dedne
454815  cRev (caballero)                sogla gnirts gnitset rof dedne
454815  nRev (Nidud)                    sogla gnirts gnitset rof dedne

hutch-- · May 09, 2021, 11:34:42 PM

At the risk of wearing you out, one more that uses an external proc to get the length, the old Agner Fog StrLen.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

tRev proc
; --------------------------------------------------
; in place string reverse using external length proc
; --------------------------------------------------

push ebx ;
push edi ; save regs
push esi ;

mov edi, [esp+4][12] ; load string address

push ebp

push edi
call StrLen ; use external string length algo

mov edx, edi ; load start address
sub edx, 1 ; sub 1 to get end char
add edx, eax ; get end character address
shr eax, 1 ; get integer half length of string
mov esi, eax ; store it in esi
mov eax, edi ; eax has start char

@@:
movzx ebx, BYTE PTR [eax] ; load start char
movzx ecx, BYTE PTR [edx] ; load end char

mov [eax], cl ; reverse char pair
add eax, 1 ; increment start char
mov [edx], bl
sub edx, 1 ; decrement end char
sub esi, 1 ; sub 1 from half length
jnz @B ; loop back if its not zero

mov eax, edi ; return the address in eax

pop ebp
pop esi ;
pop edi ; restore regs
pop ebx ;

ret 4 ; balance the stack by 4 bytes

tRev endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

jj2007 · May 10, 2021, 06:34:07 AM

You are on the right track, Hutch - 17k instead of 24kCycles, congrats

Indeed, one of the secrets of the fast jRev and xRev algos is this line: add Len(ecx), ecx

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
17399   cycles for 99 * tRev (Hutch)
33082   cycles for 99 * strrev (CRT)
6719    cycles for 99 * jRev (jj)
13395   cycles for 99 * Mirror$ (MasmBasic)
4458    cycles for 99 * xRev (jj, needs pshufb)
7459    cycles for 99 * RevLingo (aligned strings only)
93268   cycles for 99 * cRev (caballero)
31727   cycles for 99 * nRev (Nidud)

17005   cycles for 99 * tRev (Hutch)
33112   cycles for 99 * strrev (CRT)
6697    cycles for 99 * jRev (jj)
13400   cycles for 99 * Mirror$ (MasmBasic)
4452    cycles for 99 * xRev (jj, needs pshufb)
7379    cycles for 99 * RevLingo (aligned strings only)
93369   cycles for 99 * cRev (caballero)
31654   cycles for 99 * nRev (Nidud)

17052   cycles for 99 * tRev (Hutch)
33091   cycles for 99 * strrev (CRT)
6729    cycles for 99 * jRev (jj)
13402   cycles for 99 * Mirror$ (MasmBasic)
4439    cycles for 99 * xRev (jj, needs pshufb)
7371    cycles for 99 * RevLingo (aligned strings only)
93431   cycles for 99 * cRev (caballero)
31718   cycles for 99 * nRev (Nidud)

71      bytes for tRev (Hutch)
19      bytes for strrev (CRT)
87      bytes for jRev (jj)
10      bytes for Mirror$ (MasmBasic)
111     bytes for xRev (jj, needs pshufb)
123     bytes for RevLingo (aligned strings only)
56      bytes for cRev (caballero)
63      bytes for nRev (Nidud)

len(src)=99 bytes
454815  tRev (Hutch)                    sogla gnirts gnitset rof dedne
454815  strrev (CRT)                    sogla gnirts gnitset rof dedne
454815  jRev (jj)                       sogla gnirts gnitset rof dedne
454815  Mirror$ (MasmBasic)             sogla gnirts gnitset rof dedne
454815  xRev (jj, needs pshufb)         sogla gnirts gnitset rof dedne
454479  RevLingo (aligned strings only) sogla gnirtstgni set rof dedne
454815  cRev (caballero)                sogla gnirts gnitset rof dedne
454815  nRev (Nidud)                    sogla gnirts gnitset rof dedne

Btw this is necessary to avoid a crash in case somebody passes a zero length string to the algo:
sub esi, 1 ; sub 1 from half length
jg @B ; loop back if its ~~not~~ greater than zero

Option prologue etc are not necessary if the proc has no locals and no args.

The MASM Forum

News:

String reverse algo

avcaballero

jj2007

hutch--

nidud

nidud

LiaoMi

TouEnMasm

jj2007

avcaballero

jj2007

hutch--

hutch--

jj2007

hutch--

jj2007