News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

String reverse algo

Started by hutch--, May 07, 2021, 09:34:39 PM

Previous topic - Next topic

avcaballero

@nidud
Interesting! though it overlaps the original string. Yet, it seems that the best algo remains the 7th one.
Quote
Test Algorithm 1:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 2:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 3:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 4:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 5:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 6:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 7:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 8:
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Timming Algorithm 1:  329
Test Timming Algorithm 2:  281
Test Timming Algorithm 3:  219
Test Timming Algorithm 4:  203
Test Timming Algorithm 5: 93
Test Timming Algorithm 6: 79
Test Timming Algorithm 7: 62
Test Timming Algorithm 8: 78

jj2007

New version with Nidud's algo (I called it nRev for consistency :biggrin:):

nRev:
  mov edx,[esp+4]
  mov ecx,[esp+4]
  xor eax,eax
slen:
  add ecx,1
  cmp al,[ecx-1]
  jne slen
  sub ecx,2
lupe:
  cmp edx,ecx
  ja done
  mov al,[edx]
  mov ah,[ecx]
  mov [ecx],al
  mov [edx],ah
  lea edx,[edx+1]
  lea ecx,[ecx-1]
  jne lupe
done:
  mov eax,[esp+4]
  retn 4


Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

24486   cycles for 100 * tRev (Hutch)
33686   cycles for 100 * strrev (CRT)
13557   cycles for 100 * jRev (jj)
13920   cycles for 100 * Mirror$ (MasmBasic)
4541    cycles for 100 * xRev (jj, needs pshufb)
5548    cycles for 100 * RevLingo (aligned strings only)
95812   cycles for 100 * cRev (caballero)
32676   cycles for 100 * nRev (Nidud)

24901   cycles for 100 * tRev (Hutch)
33800   cycles for 100 * strrev (CRT)
13677   cycles for 100 * jRev (jj)
14012   cycles for 100 * Mirror$ (MasmBasic)
4618    cycles for 100 * xRev (jj, needs pshufb)
5955    cycles for 100 * RevLingo (aligned strings only)
95700   cycles for 100 * cRev (caballero)
32590   cycles for 100 * nRev (Nidud)

24625   cycles for 100 * tRev (Hutch)
33799   cycles for 100 * strrev (CRT)
13914   cycles for 100 * jRev (jj)
14180   cycles for 100 * Mirror$ (MasmBasic)
4617    cycles for 100 * xRev (jj, needs pshufb)
5668    cycles for 100 * RevLingo (aligned strings only)
95766   cycles for 100 * cRev (caballero)
32612   cycles for 100 * nRev (Nidud)

24617   cycles for 100 * tRev (Hutch)
33827   cycles for 100 * strrev (CRT)
13672   cycles for 100 * jRev (jj)
14032   cycles for 100 * Mirror$ (MasmBasic)
4619    cycles for 100 * xRev (jj, needs pshufb)
5663    cycles for 100 * RevLingo (aligned strings only)
97332   cycles for 100 * cRev (caballero)
33423   cycles for 100 * nRev (Nidud)

107     bytes for tRev (Hutch)
19      bytes for strrev (CRT)
51      bytes for jRev (jj)
10      bytes for Mirror$ (MasmBasic)
87      bytes for xRev (jj, needs pshufb)
123     bytes for RevLingo (aligned strings only)
56      bytes for cRev (caballero)
63      bytes for nRev (Nidud)

1259388704      = eax tRev (Hutch)
1259388704      = eax strrev (CRT)
1259388704      = eax jRev (jj)
1259388704      = eax Mirror$ (MasmBasic)
1259388704      = eax xRev (jj, needs pshufb)
1259388704      = eax RevLingo (aligned strings only)
1259388704      = eax cRev (caballero)
1259388704      = eax nRev (Nidud)

hutch--

Using the old string instructions without the REP prefix is self inflicted punishment, with the REP prefix you get special case circuitry that performs well, still slightly slower than SSE or AVX but plenty fast enough for most situations. Avoid old instructions like the plague, they live in junk microcode for backwards compatibility, not performance.

nidud

#18
deleted

nidud

#19
deleted

LiaoMi

Hi,

Fast array reversal with SIMD (A small study in hardware accelerated AoS reversal) - https://dev.to/wunk/fast-array-reversal-with-simd-j3p
https://github.com/Wunkolo/qreverse/archive/refs/heads/master.zip

// Powers of two
   Bench< ELEMENTSIZE, 8 >();
   Bench< ELEMENTSIZE, 16 >();
   Bench< ELEMENTSIZE, 32 >();
   Bench< ELEMENTSIZE, 64 >();
   Bench< ELEMENTSIZE, 128 >();
   Bench< ELEMENTSIZE, 256 >();
   Bench< ELEMENTSIZE, 512 >();
   Bench< ELEMENTSIZE, 1024 >();

   // Powers of ten
   Bench< ELEMENTSIZE, 100 >();
   Bench< ELEMENTSIZE, 1000 >();
   Bench< ELEMENTSIZE, 10000 >();
   Bench< ELEMENTSIZE, 100000 >();
   Bench< ELEMENTSIZE, 1000000 >();

   // Primes
   Bench< ELEMENTSIZE, 59 >();
   Bench< ELEMENTSIZE, 79 >();
   Bench< ELEMENTSIZE, 173 >();
   Bench< ELEMENTSIZE, 6133 >();
   Bench< ELEMENTSIZE, 10177 >();
   Bench< ELEMENTSIZE, 25253 >();
   Bench< ELEMENTSIZE, 31391 >();
   Bench< ELEMENTSIZE, 50432 >();

AVX2 byte
void qReverse<1>(void* Array, std::size_t Count)
{
std::uint8_t* Array8 = reinterpret_cast<std::uint8_t*>(Array);
std::size_t i = 0;

        for( std::size_t j = i / 32; j < ((Count / 2) / 32); ++j )
{
const __m256i ShuffleRev = _mm256_set_epi8(
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15
);
// Load 32 elements at once into one 32-byte register
__m256i Lower = _mm256_loadu_si256(
reinterpret_cast<__m256i*>(&Array8[i])
);
__m256i Upper = _mm256_loadu_si256(
reinterpret_cast<__m256i*>(&Array8[Count - i - 32])
);

// Reverse the byte order of our 32-byte vectors
Lower = _mm256_shuffle_epi8(Lower,ShuffleRev);
Upper = _mm256_shuffle_epi8(Upper,ShuffleRev);

Lower = _mm256_permute2x128_si256(Lower,Lower,1);
Upper = _mm256_permute2x128_si256(Upper,Upper,1);

// Place them at their swapped position
_mm256_storeu_si256(
reinterpret_cast<__m256i*>(&Array8[i]),
Upper
);
_mm256_storeu_si256(
reinterpret_cast<__m256i*>(&Array8[Count - i - 32]),
Lower
);

// 32 elements at a time
i += 32;
}
}// BSWAP 32
for( std::size_t j = i / 4; j < ((Count / 2) / 4); ++j )
{
// Get bswapped versions of our Upper and Lower 4-byte chunks
std::uint32_t Lower = Swap32(
*reinterpret_cast<std::uint32_t*>(&Array8[i])
);
std::uint32_t Upper = Swap32(
*reinterpret_cast<std::uint32_t*>(&Array8[Count - i - 4])
);

// Place them at their swapped position
*reinterpret_cast<std::uint32_t*>(&Array8[i]) = Upper;
*reinterpret_cast<std::uint32_t*>(&Array8[Count - i - 4]) = Lower;

// Four elements at a time
i += 4;
}
// BSWAP 16
for( std::size_t j = i / 2; j < ((Count / 2) / 2); ++j )
{
// Get bswapped versions of our Upper and Lower 4-byte chunks
std::uint16_t Lower = Swap16(
*reinterpret_cast<std::uint16_t*>(&Array8[i])
);
std::uint16_t Upper = Swap16(
*reinterpret_cast<std::uint16_t*>(&Array8[Count - i - 2])
);

// Place them at their swapped position
*reinterpret_cast<std::uint16_t*>(&Array8[i]) = Upper;
*reinterpret_cast<std::uint16_t*>(&Array8[Count - i - 2]) = Lower;

// Two elements at a time
i += 2;
}

// Everything else that we can not do a bswap on, we swap normally
// Naive swaps
for( ; i < Count / 2; ++i )
{
// Exchange the upper and lower element as we work our
// way down to the middle from either end
std::uint8_t Temp(Array8[i]);
Array8[i] = Array8[Count - i - 1];
Array8[Count - i - 1] = Temp;
}
}


SSE3

void qReverse<1>(void* Array, std::size_t Count)
{
std::uint8_t* Array8 = reinterpret_cast<std::uint8_t*>(Array);
std::size_t i = 0;

for( std::size_t j = i / 16; j < ((Count / 2) / 16); ++j )
{
const __m128i ShuffleRev = _mm_set_epi8(
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
);
// Load 16 elements at once into one 16-byte register
__m128i Lower = _mm_loadu_si128(
reinterpret_cast<__m128i*>(&Array8[i])
);
__m128i Upper = _mm_loadu_si128(
reinterpret_cast<__m128i*>(&Array8[Count - i - 16])
);

// Reverse the byte order of our 16-byte vectors
Lower = _mm_shuffle_epi8(Lower, ShuffleRev);
Upper = _mm_shuffle_epi8(Upper, ShuffleRev);

// Place them at their swapped position
_mm_storeu_si128(
reinterpret_cast<__m128i*>(&Array8[i]),
Upper
);
_mm_storeu_si128(
reinterpret_cast<__m128i*>(&Array8[Count - i - 16]),
Lower
);

// 16 elements at a time
i += 16;
}
// BSWAP 32
for( std::size_t j = i / 4; j < ((Count / 2) / 4); ++j )
{
// Get bswapped versions of our Upper and Lower 4-byte chunks
std::uint32_t Lower = Swap32(
*reinterpret_cast<std::uint32_t*>(&Array8[i])
);
std::uint32_t Upper = Swap32(
*reinterpret_cast<std::uint32_t*>(&Array8[Count - i - 4])
);

// Place them at their swapped position
*reinterpret_cast<std::uint32_t*>(&Array8[i]) = Upper;
*reinterpret_cast<std::uint32_t*>(&Array8[Count - i - 4]) = Lower;

// Four elements at a time
i += 4;
}
// BSWAP 16
for( std::size_t j = i / 2; j < ((Count / 2) / 2); ++j )
{
// Get bswapped versions of our Upper and Lower 4-byte chunks
std::uint16_t Lower = Swap16(
*reinterpret_cast<std::uint16_t*>(&Array8[i])
);
std::uint16_t Upper = Swap16(
*reinterpret_cast<std::uint16_t*>(&Array8[Count - i - 2])
);

// Place them at their swapped position
*reinterpret_cast<std::uint16_t*>(&Array8[i]) = Upper;
*reinterpret_cast<std::uint16_t*>(&Array8[Count - i - 2]) = Lower;

// Two elements at a time
i += 2;
}

// Everything else that we can not do a bswap on, we swap normally
// Naive swaps
for( ; i < Count / 2; ++i )
{
// Exchange the upper and lower element as we work our
// way down to the middle from either end
std::uint8_t Temp(Array8[i]);
Array8[i] = Array8[Count - i - 1];
Array8[Count - i - 1] = Temp;
}
}

TouEnMasm

other method
I have modify the end of reverse,a char was missing

chaine db "Hello, this is a simple string intended for testing string algos. It has 100 characters without zero",0
firstreverse db 100h dup (0)
reorder  db  100h dup (0)

.code
reverse proc uses esi edi ebx
;invoke DebugBreak
mov esi,offset chaine
mov edx,offset chaine
add esi,sizeof chaine
dec esi
dec esi
mov edi,offset firstreverse
@@:
std
lodsb
cld
stosb
cmp esi,edx
jge @B
mov byte ptr [edi],0
ret


reverse endp
main PROC  C 
invoke reverse
invoke printf,TXT("reverse : %s "),addr firstreverse
invoke _getch
ret
main endp

Fa is a musical note to play with CL

jj2007

Yves,

Attached the testbed with your algo - it's very slow on my machine. Since nobody used sizeof chaine, I've used len() instead. But the problem is elsewhere: the flags switches std and cld are very, very slow.

yRev proc uses esi edi ebx pSrc, pDest
  mov esi, pSrc
  add esi, len(esi) ; sizeof chaine
  mov edx, pSrc
  ; dec esi
  dec esi
  mov edi, pDest
  @@:
  std
  lodsb
  cld
  stosb
  cmp esi,edx
  jge @B
  mov byte ptr [edi],0
  ret
yRev endp


Quote from: dedndave on January 01, 2013, 05:11:20 AMthat's because it has to test the direction flag
you may remember our previous discussions about STD and CLD taking ~80 cycles

avcaballero

I have revisited the nidud's algo, doing the reversing in the same string gives some opportunities :)
Quote
Test Algorithm 1:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 2:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 3:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 4:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 5:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 6:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 7:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 8:
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Algorithm 9:
Hello, this is a simple string intended for testing string algos. It has 100 characters without zero
orez tuohtiw sretcarahc 001 sah tI .sogla gnirts gnitset rof dednetni gnirts elpmis a si siht ,olleH

Test Timming Algorithm 1:  313
Test Timming Algorithm 2:  312
Test Timming Algorithm 3:  188
Test Timming Algorithm 4:  203
Test Timming Algorithm 5: 94
Test Timming Algorithm 6: 78
Test Timming Algorithm 7: 63
Test Timming Algorithm 8: 94
Test Timming Algorithm 9: 78
Algo8 is nidud's example, Algo9 is the revisited. 78 ticks takes it at the same level of Algo6. Probably using dwords could carry it to 63 ticks.


RevAlg09 proc uses ecx esi, pSrc
  ;
  mov        esi, [pSrc]
  mov        ecx, esi
  ; Busca final
  @top:
    inc        ecx
    or         byte ptr [ecx], 0
  jnz        @top
  dec        ecx
  @Bucle:
    mov        al, byte ptr [esi]
    mov        ah, byte ptr [ecx]
    mov        byte ptr [ecx], al
    mov        byte ptr [esi], ah
    inc        esi
    dec        ecx
    cmp        ecx, esi
  jnbe       @Bucle
  ret
RevAlg09 endp


jj2007

New version with more thorough testing:
- my own jRev and xRev algos produced wrong results depending on the length of the string; fixed
- Hutch' algo could not handle short strings with less than 2 bytes; fixed
- Caballero's algo crashed on zero length strings; fixed

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

24614   cycles for 99 * tRev (Hutch)
33313   cycles for 99 * strrev (CRT)
6669    cycles for 99 * jRev (jj)
13590   cycles for 99 * Mirror$ (MasmBasic)
4436    cycles for 99 * xRev (jj, needs pshufb)
5542    cycles for 99 * RevLingo (aligned strings only)
94021   cycles for 99 * cRev (caballero)
32311   cycles for 99 * nRev (Nidud)

24690   cycles for 99 * tRev (Hutch)
33388   cycles for 99 * strrev (CRT)
6819    cycles for 99 * jRev (jj)
13694   cycles for 99 * Mirror$ (MasmBasic)
4539    cycles for 99 * xRev (jj, needs pshufb)
5608    cycles for 99 * RevLingo (aligned strings only)
94088   cycles for 99 * cRev (caballero)
32321   cycles for 99 * nRev (Nidud)

24671   cycles for 99 * tRev (Hutch)
33401   cycles for 99 * strrev (CRT)
6788    cycles for 99 * jRev (jj)
13728   cycles for 99 * Mirror$ (MasmBasic)
4555    cycles for 99 * xRev (jj, needs pshufb)
5617    cycles for 99 * RevLingo (aligned strings only)
94015   cycles for 99 * cRev (caballero)
32272   cycles for 99 * nRev (Nidud)

107     bytes for tRev (Hutch)
19      bytes for strrev (CRT)
87      bytes for jRev (jj)
10      bytes for Mirror$ (MasmBasic)
111     bytes for xRev (jj, needs pshufb)
123     bytes for RevLingo (aligned strings only)
56      bytes for cRev (caballero)
63      bytes for nRev (Nidud)

len(src)=100 bytes
464062  tRev (Hutch)                    .sogla gnirts gnitset rof dedn
464062  strrev (CRT)                    .sogla gnirts gnitset rof dedn
464062  jRev (jj)                       .sogla gnirts gnitset rof dedn
464062  Mirror$ (MasmBasic)             .sogla gnirts gnitset rof dedn
464062  xRev (jj, needs pshufb)         .sogla gnirts gnitset rof dedn
464062  RevLingo (aligned strings only) .sogla gnirts gnitset rof dedn
464062  cRev (caballero)                .sogla gnirts gnitset rof dedn
464062  nRev (Nidud)                    .sogla gnirts gnitset rof dedn

hutch--

Stick this version into the test bed. I removed the stack frame so it may be a bit faster.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

tRev proc
  ; ---------------------------------------------
  ; in place string reverse, fully self contained
  ; ---------------------------------------------
    push ebx
    push edi
    push esi

    mov edi, [esp+4][12]                ; string address
    mov eax, edi

    sub eax, 1
  @@:                                   ; unroll by 4
  REPEAT 3
    add eax, 1                          ; get length
    cmp BYTE PTR [eax], 0               ; loop to buffer end
    je @F
  ENDM
    add eax, 1                          ; get length
    cmp BYTE PTR [eax], 0               ; loop to buffer end
    jne @B

  @@:
    sub eax, edi                        ; sub address from eax for len
    mov esi, eax                        ; store length in esi

    shr eax, 1                          ; get integer half length of string
    mov ebx, eax                        ; store it in variable
    mov eax, edi                        ; eax has start char
    mov edx, edi                        ; load start address
    add edx, esi                        ; get end character address
    mov esi, ebx                        ; copy half len into esi
    sub edx, 1                          ; sub 1 to get end char

  @@:
    movzx ebx, BYTE PTR [eax]           ; load start char
    movzx ecx, BYTE PTR [edx]           ; load end char
    mov [eax], cl                       ; reverse char pair
    add eax, 1                          ; increment start char
    mov [edx], bl
    sub edx, 1                          ; decrement end char
    sub esi, 1                          ; sub 1 from half length
    jnz @B                              ; loop back if its not zero

    mov eax, edi                        ; return the address in eax

    pop esi
    pop edi
    pop ebx

    ret 4                               ; balance the stack by 4 bytes

tRev endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

hutch--

Result your last revised version.

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (SSE4)

20914   cycles for 99 * tRev (Hutch)
31184   cycles for 99 * strrev (CRT)
6057    cycles for 99 * jRev (jj)
12924   cycles for 99 * Mirror$ (MasmBasic)
4213    cycles for 99 * xRev (jj, needs pshufb)
5318    cycles for 99 * RevLingo (aligned strings only)
91759   cycles for 99 * cRev (caballero)
20469   cycles for 99 * nRev (Nidud)

20989   cycles for 99 * tRev (Hutch)
30259   cycles for 99 * strrev (CRT)
6175    cycles for 99 * jRev (jj)
13076   cycles for 99 * Mirror$ (MasmBasic)
4324    cycles for 99 * xRev (jj, needs pshufb)
5419    cycles for 99 * RevLingo (aligned strings only)
91712   cycles for 99 * cRev (caballero)
20467   cycles for 99 * nRev (Nidud)

20983   cycles for 99 * tRev (Hutch)
30276   cycles for 99 * strrev (CRT)
6175    cycles for 99 * jRev (jj)
13089   cycles for 99 * Mirror$ (MasmBasic)
4327    cycles for 99 * xRev (jj, needs pshufb)
5440    cycles for 99 * RevLingo (aligned strings only)
92241   cycles for 99 * cRev (caballero)
20443   cycles for 99 * nRev (Nidud)

107     bytes for tRev (Hutch)
19      bytes for strrev (CRT)
87      bytes for jRev (jj)
10      bytes for Mirror$ (MasmBasic)
111     bytes for xRev (jj, needs pshufb)
123     bytes for RevLingo (aligned strings only)
56      bytes for cRev (caballero)
63      bytes for nRev (Nidud)

len(src)=100 bytes
464062  tRev (Hutch)                    .sogla gnirts gnitset rof dedn
464062  strrev (CRT)                    .sogla gnirts gnitset rof dedn
464062  jRev (jj)                       .sogla gnirts gnitset rof dedn
464062  Mirror$ (MasmBasic)             .sogla gnirts gnitset rof dedn
464062  xRev (jj, needs pshufb)         .sogla gnirts gnitset rof dedn
464062  RevLingo (aligned strings only) .sogla gnirts gnitset rof dedn
464062  cRev (caballero)                .sogla gnirts gnitset rof dedn
464062  nRev (Nidud)                    .sogla gnirts gnitset rof dedn

--- ok ---

jj2007

Quote from: hutch-- on May 09, 2021, 11:35:24 AM
Stick this version into the test bed. I removed the stack frame so it may be a bit faster.

Done :cool:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

24595   cycles for 99 * tRev (Hutch)
33051   cycles for 99 * strrev (CRT)
6683    cycles for 99 * jRev (jj)
13491   cycles for 99 * Mirror$ (MasmBasic)
4457    cycles for 99 * xRev (jj, needs pshufb)
7378    cycles for 99 * RevLingo (aligned strings only)
93296   cycles for 99 * cRev (caballero)
32548   cycles for 99 * nRev (Nidud)

23937   cycles for 99 * tRev (Hutch)
32986   cycles for 99 * strrev (CRT)
6725    cycles for 99 * jRev (jj)
13404   cycles for 99 * Mirror$ (MasmBasic)
4427    cycles for 99 * xRev (jj, needs pshufb)
7379    cycles for 99 * RevLingo (aligned strings only)
93127   cycles for 99 * cRev (caballero)
32965   cycles for 99 * nRev (Nidud)

23949   cycles for 99 * tRev (Hutch)
33041   cycles for 99 * strrev (CRT)
6703    cycles for 99 * jRev (jj)
13384   cycles for 99 * Mirror$ (MasmBasic)
4474    cycles for 99 * xRev (jj, needs pshufb)
7385    cycles for 99 * RevLingo (aligned strings only)
93401   cycles for 99 * cRev (caballero)
31649   cycles for 99 * nRev (Nidud)

107     bytes for tRev (Hutch)
19      bytes for strrev (CRT)
87      bytes for jRev (jj)
10      bytes for Mirror$ (MasmBasic)
111     bytes for xRev (jj, needs pshufb)
123     bytes for RevLingo (aligned strings only)
56      bytes for cRev (caballero)
63      bytes for nRev (Nidud)

len(src)=99 bytes
454815  tRev (Hutch)                    sogla gnirts gnitset rof dedne
454815  strrev (CRT)                    sogla gnirts gnitset rof dedne
454815  jRev (jj)                       sogla gnirts gnitset rof dedne
454815  Mirror$ (MasmBasic)             sogla gnirts gnitset rof dedne
454815  xRev (jj, needs pshufb)         sogla gnirts gnitset rof dedne
454479  RevLingo (aligned strings only) sogla gnirtstgni set rof dedne
454815  cRev (caballero)                sogla gnirts gnitset rof dedne
454815  nRev (Nidud)                    sogla gnirts gnitset rof dedne

hutch--

At the risk of wearing you out, one more that uses an external proc to get the length, the old Agner Fog StrLen.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

tRev proc
  ; --------------------------------------------------
  ; in place string reverse using external length proc
  ; --------------------------------------------------

    push ebx                                ;
    push edi                                ; save regs
    push esi                                ;

    mov edi, [esp+4][12]                    ; load string address

    push ebp

    push edi
    call StrLen                             ; use external string length algo

    mov edx, edi                            ; load start address
    sub edx, 1                              ; sub 1 to get end char
    add edx, eax                            ; get end character address
    shr eax, 1                              ; get integer half length of string
    mov esi, eax                            ; store it in esi
    mov eax, edi                            ; eax has start char

  @@:
    movzx ebx, BYTE PTR [eax]               ; load start char
    movzx ecx, BYTE PTR [edx]               ; load end char

    mov [eax], cl                           ; reverse char pair
    add eax, 1                              ; increment start char
    mov [edx], bl
    sub edx, 1                              ; decrement end char
    sub esi, 1                              ; sub 1 from half length
    jnz @B                                  ; loop back if its not zero

    mov eax, edi                            ; return the address in eax

    pop ebp
    pop esi                                 ;
    pop edi                                 ; restore regs
    pop ebx                                 ;

    ret 4                                   ; balance the stack by 4 bytes

tRev endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

jj2007

You are on the right track, Hutch - 17k instead of 24kCycles, congrats :biggrin:

Indeed, one of the secrets of the fast jRev and xRev algos is this line: add Len(ecx), ecx


Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
17399   cycles for 99 * tRev (Hutch)
33082   cycles for 99 * strrev (CRT)
6719    cycles for 99 * jRev (jj)
13395   cycles for 99 * Mirror$ (MasmBasic)
4458    cycles for 99 * xRev (jj, needs pshufb)
7459    cycles for 99 * RevLingo (aligned strings only)
93268   cycles for 99 * cRev (caballero)
31727   cycles for 99 * nRev (Nidud)

17005   cycles for 99 * tRev (Hutch)
33112   cycles for 99 * strrev (CRT)
6697    cycles for 99 * jRev (jj)
13400   cycles for 99 * Mirror$ (MasmBasic)
4452    cycles for 99 * xRev (jj, needs pshufb)
7379    cycles for 99 * RevLingo (aligned strings only)
93369   cycles for 99 * cRev (caballero)
31654   cycles for 99 * nRev (Nidud)

17052   cycles for 99 * tRev (Hutch)
33091   cycles for 99 * strrev (CRT)
6729    cycles for 99 * jRev (jj)
13402   cycles for 99 * Mirror$ (MasmBasic)
4439    cycles for 99 * xRev (jj, needs pshufb)
7371    cycles for 99 * RevLingo (aligned strings only)
93431   cycles for 99 * cRev (caballero)
31718   cycles for 99 * nRev (Nidud)

71      bytes for tRev (Hutch)
19      bytes for strrev (CRT)
87      bytes for jRev (jj)
10      bytes for Mirror$ (MasmBasic)
111     bytes for xRev (jj, needs pshufb)
123     bytes for RevLingo (aligned strings only)
56      bytes for cRev (caballero)
63      bytes for nRev (Nidud)

len(src)=99 bytes
454815  tRev (Hutch)                    sogla gnirts gnitset rof dedne
454815  strrev (CRT)                    sogla gnirts gnitset rof dedne
454815  jRev (jj)                       sogla gnirts gnitset rof dedne
454815  Mirror$ (MasmBasic)             sogla gnirts gnitset rof dedne
454815  xRev (jj, needs pshufb)         sogla gnirts gnitset rof dedne
454479  RevLingo (aligned strings only) sogla gnirtstgni set rof dedne
454815  cRev (caballero)                sogla gnirts gnitset rof dedne
454815  nRev (Nidud)                    sogla gnirts gnitset rof dedne


Btw this is necessary to avoid a crash in case somebody passes a zero length string to the algo:
    sub esi, 1                              ; sub 1 from half length
    jg @B                                  ; loop back if its not greater than zero

Option prologue etc are not necessary if the proc has no locals and no args.