News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Change string case in SSE2

Started by guga, August 20, 2023, 10:19:18 AM

Previous topic - Next topic

guga

Hi guys

Anyone succeeded to create a string case converter in SSE2 ? I found one here https://gist.github.com/easyaspi314/9d31e5c0f9cead66aba2ede248b74d64

But it is very confusing and also for x64 only.

The goal is convert a string to upper case or lowercase with SSE2 in 32 bits
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Very simple:
include \masm32\MasmBasic\MasmBasic.inc
Or32 OWORD 20202020202020202020202020202020h
  Init
  Cls 3
  Let esi="A SHORT STRING"
  Let edi="This is the destination buffer"
  movdqu xmm0, oword ptr [esi]
  movups xmm1, Or32
  orps xmm0, xmm1
  movdqu [edi], xmm0
  PrintLine "src= [", esi, "]"
  PrintLine "dest=[", edi, "]"
EndOfCode

Output:
src= [A SHORT STRING]
dest=[a short string  ination buffer]

Minor problem: you have to find a way to move 14 bytes from an xmmreg to memory :cool:

guga

Hi JJ

Many tks.

So, to upper case we need only to xorps after oring, right ?

Lower Case
[ToLowerTbl: Q$ 020_20_20_20_20_20_20_20, 020_20_20_20_20_20_20_20]

Proc SSEToLower:
    Arguments @pString, @pOutput

    mov esi D@pString
    movdqu xmm0 X$esi
    movdqu xmm1 X$ToLowerTbl
    orps xmm0 xmm1
    mov esi D@pOutput
    movups X$esi xmm0

EndP


[StringInput: B$ "hEllo", 0]
[OutputString: B$ 0 #128]
call SSEToLower StringInput, OutputString
OutputString: B$ "hello", 0 ---> in fact will add spaces 020, but this was just for me understand if it´s similar to regular xe86 but using xor

Proc SSEToUpper:
    Arguments @pString, @pOutput

    mov esi D@pString
    movdqu xmm0 X$esi
    movdqu xmm1 X$ToLowerTbl
    orps xmm0 xmm1
    xorps xmm0 xmm1
    mov esi D@pOutput
    movups X$esi xmm0

EndP


[StringInput: B$ "hEllo", 0]
[OutputString: B$ 0 #128]
call SSEToUpper StringInput, OutputString
OutputString: B$ "HELLO", 0 ---> in fact will add spaces 020, but this was just for me understand if it´s similar to regular xe86 but using xor

One question, how to prevent changing the case for other chars, such as numbers, or ? | _ etc etc ?

I gave a try trying to convert to masks with PCMPGTB, but it've got nowhere. It identified chars bigger then 'Z' and masked the byte positions as 0FF, but i could´nt be able to check their positions and convert only the needed char to lower or upper case.

For example, say i have the string. "Hello 123 i'm doing this. How are you ?"  We have non Ascii chars ' ? . and numbers 1 2 3. in the middle of the text. How to convert all to uppercase (except numbers, and non Ansi chars) ?

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: jj2007 on August 20, 2023, 11:06:06 AMMinor problem: you have to find a way to move 14 bytes from an xmmreg to memory :cool:
It seems you didn't catch the irony...

And it's only one of your problems:
include \masm32\MasmBasic\MasmBasic.inc
  Init
  Let esi="Введите текст здесь: Enter text here in Russian"
  PrintLine esi
  PrintLine Upper$(esi) 
EndOfCode

Output:
Введите текст здесь: Enter text here in Russian
ВВЕДИТЕ ТЕКСТ ЗДЕСЬ: ENTER TEXT HERE IN RUSSIAN

guga

Hi JJ.

About the 14 bytes.. :bgrin:  :bgrin:  :bgrin: I didn´t thought it was a irony. But you can do it copying all that left from 16 bytes to stack and at the end of the routine copy the remainder bytes from the stack onto the outputted memory buffer where the actual string will be converted. That´s what the macro Structure @TmpStorage 32, @TmpStringDis 0 is for. It allocates 32 bytes on the stack to handle the situations when the string is smaller then 16 bytes (or exceeeds it). This also prevents crashing since we are copying to the output buffer only the necessary converted chars without having to worry if the last 16 bytes are outside the allocated memory address on output

I succeeded to make 2 functions that works, but has some issues yet in some chars (Special for latin, such as ç ã õ etc ? But i guess this is the path to do it.I made those functions as:

StringtoLower

[<16 ToCase_asciiA: Q$ 040_40_40_40_40_40_40_40, 040_40_40_40_40_40_40_40] ; 'A'-1
[<16 ToCase_asciiZ: Q$ 05B_5B_5B_5B_5B_5B_5B_5B, 05B_5B_5B_5B_5B_5B_5B_5B] ; 'Z'+1
[<16 ToCase_Diff: Q$ 020_20_20_20_20_20_20_20, 020_20_20_20_20_20_20_20] ; 'a'-'A'


Proc StringtoLower:
    Arguments @pString, @pOutput
    Local @StringLenght
    Structure @TmpStorage 32, @TmpStringDis 0
    Uses esi, edi


    mov edi D@pOutput
    mov esi D@pString
    call StrLen esi
    mov D@StringLenght eax

    ..While D@StringLenght >= 16

        ; input string = xmm0
        movdqu xmm0 X$esi

        ; GreaterThanA = pcmpgtb InputString, ToCase_asciiA
        ; All chars => 'A' will be flagged as 0xFF. The rest will be flagged as 0. Therefore, bytes 0 to '@' (064) will be flagged as 0
        movdqu xmm1 xmm0; xmm1 = InputString
        pcmpgtb xmm1 X$ToCase_asciiA ; xmm1 = greaterThanA. If InputChar >= 'A', Mask1 = 0FF, Else Mask1 = 0. Mask1 = xmm1.  Therefore, bytes 0 to @ ('A'-1) will be flagged as 0

        ; lessEqualZ = pcmpgtb ToCase_asciiZ, Final3InputString
        ; Now we are doing the opposite. All chars > 'Z'  will be flagged as 0. Therefore, bytes 0 to 'Z' will be flagged as 0FF
        movdqu xmm2 X$ToCase_asciiZ; xmm2 = X$ToLowCase_asciiZ
        pcmpgtb xmm2 xmm0 ; xmm2 = lessEqualz.  If InputChar <= 'Z', Mask2 = 0FF, Else Mask2 = 0. Mask2 = xmm2.  Therefore, bytes '[' (091) to 255 will be flagged as 0

        ; Mask3 = pand   lessEqualz, greaterThanA
        ; Char >= 'A', Flag 0FF, Else 0
        ; Char <= 'Z', Flag 0FF, Else 0
        ; We have then. Value = 0FF when Char >= 'A' and Char <= 'Z'
        ; and both results
        pand xmm2 xmm1 ; mask3 . Now everything in between A to Z is flagged as 0FF, all the rest is 0

        ; toAdd = pand ToCase_Diff MAsk3
        ; And we finally and with our case difference (020 = 'a'-'A') to we keep on xmm1 only 020 corresponding to the flagged positions on our mask.
        ; So, everything flagged as 0FF will turn onto 020. Else will be 0
        movdqu xmm1 X$ToCase_Diff
        pand xmm1 xmm2

        ; added = paddb toAdd InputString
        ; Finally we ad those flagged bytes to our string to change the cae. or, we can simply 'or'it with orps
        ;paddb xmm0 xmm1 ; works with paddb, orps and xorps as well. Need to see which one is faster
        ;orps xmm0 xmm1
        xorps xmm0 xmm1

        ;_mm_storeu_si128((__m128i *)str, added);
        movdqu X$edi xmm0

        add edi 16
        add esi 16
        sub D@StringLenght 16
    ..End_While

    ; calculate remainders
    .If D@StringLenght > 0
        mov eax D@StringLenght

        movdqu xmm0 X$esi
        movdqu xmm1 xmm0; xmm1 = InbputString
        pcmpgtb xmm1 X$ToCase_asciiA ; xmm1 = greaterThanA

        movdqu xmm2 X$ToCase_asciiZ; xmm2 = X$ToLowCase_asciiZ
        pcmpgtb xmm2 xmm0 ; xmm2 = lessEqualz
        pand xmm2 xmm1 ; mask

        movdqu xmm1 X$ToCase_Diff
        pand xmm1 xmm2

        ;paddb xmm0 xmm1
        xorps xmm0 xmm1

        mov esi D@TmpStorage | mov D$esi+eax 0
        movdqu X$esi xmm0
        ; ready to copy the remainders
        L3:  movsb | dec eax | jnz L3<
    .End_If

    mov eax D@StringLenght
    mov B$edi 0

EndP


StringtoUpper

[<16 ToLowCase_asciiA: Q$ 060_60_60_60_60_60_60_60, 060_60_60_60_60_60_60_60] ; 'a'-1
[<16 ToLowCase_asciiZ: Q$ 07B_7B_7B_7B_7B_7B_7B_7B, 07B_7B_7B_7B_7B_7B_7B_7B] ; 'z'+1

Proc StringtoUpper:
    Arguments @pString, @pOutput
    Local @StringLenght
    Structure @TmpStorage 32, @TmpStringDis 0
    Uses esi, edi


    mov edi D@pOutput
    mov esi D@pString
    call StrLen esi
    mov D@StringLenght eax

    ..While D@StringLenght >= 16

        ; input string = xmm0
        movdqu xmm0 X$esi

        ; GreaterThanA = pcmpgtb InputString, ToCase_asciiA
        ; All chars => 'a' will be flagged as 0xFF. The rest will be flagged as 0. Therefore, bytes 0 to ''' (096) will be flagged as 0
        movdqu xmm1 xmm0; xmm1 = InputString
        pcmpgtb xmm1 X$ToLowCase_asciiA ; xmm1 = greaterThanA. If InputChar >= 'a', Mask1 = 0FF, Else Mask1 = 0. Mask1 = xmm1.  Therefore, bytes 0 to 096 ('a'-1) will be flagged as 0

        ; lessEqualZ = pcmpgtb ToCase_asciiZ, Final3InputString
        ; Now we are doing the opposite. All chars > 'z'  will be flagged as 0. Therefore, bytes 0 to 'z' will be flagged as 0FF
        movdqu xmm2 X$ToLowCase_asciiZ; xmm2 = X$ToLowCase_asciiZ
        pcmpgtb xmm2 xmm0 ; xmm2 = lessEqualz.  If InputChar <= 'z', Mask2 = 0FF, Else Mask2 = 0. Mask2 = xmm2.  Therefore, bytes '{' (07B) to 255 will be flagged as 0

        ; Mask3 = pand   lessEqualz, greaterThanA
        ; Char >= 'a', Flag 0FF, Else 0
        ; Char <= 'z', Flag 0FF, Else 0
        ; We have then. Value = 0FF when Char >= 'a' and Char <= 'z'
        ; and both results
        pand xmm2 xmm1 ; mask3 . Now everything in between a to z is flagged as 0FF, all the rest is 0

        ; toAdd = pand ToCase_Diff MAsk3
        ; And we finally and with our case difference (020 = 'a'-'A') to we keep on xmm1 only 020 corresponding to the flagged positions on our mask.
        ; So, everything flagged as 0FF will turn onto 020. Else will be 0
        movdqu xmm1 X$ToCase_Diff
        pand xmm1 xmm2

        ; added = paddb toAdd InputString
        ; Finally we ad those flagged bytes to our string to change the cae. or, we can simply 'or'it with orps
        ;psubb xmm0 xmm1 ; works psubb, xorps as well. Need to see which one is faster
        xorps xmm0 xmm1


        ;_mm_storeu_si128((__m128i *)str, added);
        movdqu X$edi xmm0

        add edi 16
        add esi 16
        sub D@StringLenght 16
    ..End_While

    ; calculate remainders
    .If D@StringLenght > 0
        mov eax D@StringLenght

        movdqu xmm0 X$esi
        movdqu xmm1 xmm0; xmm1 = InbputString
        pcmpgtb xmm1 X$ToLowCase_asciiA ; xmm1 = greaterThanA

        movdqu xmm2 X$ToLowCase_asciiZ; xmm2 = X$ToLowCase_asciiZ
        pcmpgtb xmm2 xmm0 ; xmm2 = lessEqualz
        pand xmm2 xmm1 ; mask

        movdqu xmm1 X$ToCase_Diff
        pand xmm1 xmm2

        ;psubb xmm0 xmm1
        xorps xmm0 xmm1

        mov esi D@TmpStorage | mov D$esi+eax 0
        movdqu X$esi xmm0
        ; ready to copy the remainders
        L3:  movsb | dec eax | jnz L3<
    .End_If

    mov eax D@StringLenght
    mov B$edi 0

EndP


Examples of usage:

[OutputString: B$ 0 #128]
[BigText2a: B$ "[zzzzz? / \ : ; zzzzzzzzzzzzzzzzzzzzzzzzaaaaaaaaaaggTTTTTTnvd123456/", 0]

    call StringtoLower BigText2a, OutputString

    call StringtoUpper BigText2a, OutputString

I´ll do some tests for speed and will convert it to masm for u. Also i´ll try to see if i can find a way to fix the lain chars

Once it is all fixed, then we can try optimize even further. It also couuld to another function with "Ex" appended to the name in cases we already have precalculate the lenght of the string (which will make the functions also faster, btw)


Btw..i´ll also check for speed to see what opcodes are faster to use at the end of the convertion case. For example, if paddb, orps or xorps have significant improves from each other, or if it won´t matter which one to choose. The functions, originally uses paddb xmm0 xmm1 fror StringtoLower and psubb xmm0 xmm1 for  StringtoUpper. But in both cases for this tests, i´m using xorps xmm0 xmm1 to see if it have differences on speed.


The functions where an adaptation from ones in C i found here for x64 - https://gist.github.com/easyaspi314/9d31e5c0f9cead66aba2ede248b74d64  (Although the C version seems to be slow, because all of those _mm_add_epi8 etc etc, takes lots of instructions to work (at least on gcc from https://godbolt.org)


About the Unicode version, i´m not there yet. But, it seems that at least for russian, the difference between upper and lowercase is also 32 bytes - https://en.wikipedia.org/wiki/Russian_alphabet  But i´ll try it later after i test the speed of all of this and see if i can do something about the Latin chars
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: guga on August 21, 2023, 04:15:49 AMHi JJ.

About the 14 bytes.. :bgrin:  :bgrin:  :bgrin: I didn´t thought it was a irony. But you can do it copying all that left from 16 bytes to stack and at the end of the routine copy the remainder bytes from the stack onto the outputted memory buffer where the actual string will be converted.

Yes, but copying 14 bytes from an XMM register is very, very slow. You may say that it doesn't matter if the string is one megabyte long, but did it ever happen to you (or anyone else) that one megabyte of text had to be converted to UPPERCASE?

guga

Quote from: jj2007 on August 21, 2023, 04:50:47 AM
Quote from: guga on August 21, 2023, 04:15:49 AMHi JJ.

About the 14 bytes.. :bgrin:  :bgrin:  :bgrin: I didn´t thought it was a irony. But you can do it copying all that left from 16 bytes to stack and at the end of the routine copy the remainder bytes from the stack onto the outputted memory buffer where the actual string will be converted.

Yes, but copying 14 bytes from an XMM register is very, very slow. You may say that it doesn't matter if the string is one megabyte long, but did it ever happen to you (or anyone else) that one megabyte of text had to be converted to UPPERCASE?

Hi JJ

It can be a bit slow, because it will, at the end do a byte by byte copy of whatever amount of bytes smaller then 16.

This part, right ?
movdqu X$esi xmm0
L3:  movsb | dec eax | jnz L3<

But....we may overcome this calculating at the beginning of the function if the remainder is a multiple of 8, 4, 2 and precalculate the remainder of remainder (and perhaps using non SSE registers just for those data lesser then 16 bytes). Not sure if it will speed up the cases where we are changing the case of small strings, but this can be tested later when we try to optimize it.

Btw...on my AMD, using paddb xmm0 xmm1 for StringtoLower and psubb xmm0 xmm1 for  StringtoUpper is a bit faster then using xorps or orps. Nothing too fast, something around 1% or less, but it may count for something when the functions be optimized further.

I´ll convert those simple versions to masm now to you test and then will see how to make it work for latin string. And later a unicode version should also be needed :azn:
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

#7
Masm versions:

Data used on both:
ToCase_asciiA   xmmword 40404040404040404040404040404040h
ToCase_asciiZ   xmmword 5B5B5B5B5B5B5B5B5B5B5B5B5B5B5B5Bh
ToCase_Diff     xmmword 20202020202020202020202020202020h
ToLowCase_asciiA xmmword 60606060606060606060606060606060h
ToLowCase_asciiZ xmmword 7B7B7B7B7B7B7B7B7B7B7B7B7B7B7B7Bh


StringtoUpper


StringtoUpper   proc near               ; CODE XREF: start+26↑p
                                        ; .text:00404DD7↑j

TmpStorage      = dword ptr -8
StringLenght    = dword ptr -4
pString         = dword ptr  8
pOutput         = dword ptr  0Ch

                push    ebp
                mov     ebp, esp
                sub     esp, 4
                sub     esp, 24h
                mov     [ebp+TmpStorage], esp
                push    esi
                push    edi
                mov     edi, [ebp+pOutput]
                mov     esi, [ebp+pString]
                push    esi
                call    StrLen
                mov     [ebp+StringLenght], eax

loc_404DFD:                             ; CODE XREF: StringtoUpper+65↓j
                cmp     [ebp+StringLenght], 10h
                jb      loc_404E4A
                movdqu  xmm0, xmmword ptr [esi]
                movdqu  xmm1, xmm0
                pcmpgtb xmm1, ToLowCase_asciiA
                movdqu  xmm2, ToLowCase_asciiZ
                pcmpgtb xmm2, xmm0
                pand    xmm2, xmm1
                movdqu  xmm1, ToCase_Diff
                pand    xmm1, xmm2
                psubb   xmm0, xmm1
                movdqu  xmmword ptr [edi], xmm0
                add     edi, 10h
                add     esi, 10h
                sub     [ebp+StringLenght], 10h
                jmp     loc_404DFD
; ---------------------------------------------------------------------------

loc_404E4A:                             ; CODE XREF: StringtoUpper+21↑j
                cmp     [ebp+StringLenght], 0
                jbe     loc_404E99
                mov     eax, [ebp+StringLenght]
                movdqu  xmm0, xmmword ptr [esi]
                movdqu  xmm1, xmm0
                pcmpgtb xmm1, ToLowCase_asciiA
                movdqu  xmm2, ToLowCase_asciiZ
                pcmpgtb xmm2, xmm0
                pand    xmm2, xmm1
                movdqu  xmm1, ToCase_Diff
                pand    xmm1, xmm2
                psubb   xmm0, xmm1
                mov     esi, [ebp+TmpStorage]
                mov     dword ptr [eax+esi], 0
                movdqu  xmmword ptr [esi], xmm0

loc_404E95:                             ; CODE XREF: StringtoUpper+B7↓j
                movsb
                dec     eax
                jnz     short loc_404E95

loc_404E99:                             ; CODE XREF: StringtoUpper+6E↑j
                mov     eax, [ebp+StringLenght]
                mov     byte ptr [edi], 0
                pop     edi
                pop     esi
                mov     esp, ebp
                pop     ebp
                retn    8
StringtoUpper   endp

StringtoLower
StringtoLower   proc near               ; CODE XREF: start+17↑p
                                        ; .text:00404D0B↑j

TmpStorage      = dword ptr -8
StringLenght    = dword ptr -4
pString         = dword ptr  8
pOutput         = dword ptr  0Ch

                push    ebp
                mov     ebp, esp
                sub     esp, 4
                sub     esp, 24h
                mov     [ebp+TmpStorage], esp
                push    esi
                push    edi
                mov     edi, [ebp+pOutput]
                mov     esi, [ebp+pString]
                push    esi
                call    StrLen
                mov     [ebp+StringLenght], eax

loc_404D2D:                             ; CODE XREF: StringtoLower+65↓j
                cmp     [ebp+StringLenght], 10h
                jb      loc_404D7A
                movdqu  xmm0, xmmword ptr [esi]
                movdqu  xmm1, xmm0
                pcmpgtb xmm1, ToCase_asciiA
                movdqu  xmm2, ToCase_asciiZ
                pcmpgtb xmm2, xmm0
                pand    xmm2, xmm1
                movdqu  xmm1, ToCase_Diff
                pand    xmm1, xmm2
                paddb   xmm0, xmm1
                movdqu  xmmword ptr [edi], xmm0
                add     edi, 10h
                add     esi, 10h
                sub     [ebp+StringLenght], 10h
                jmp     loc_404D2D
; ---------------------------------------------------------------------------

loc_404D7A:                             ; CODE XREF: StringtoLower+21↑j
                cmp     [ebp+StringLenght], 0
                jbe     loc_404DC9
                mov     eax, [ebp+StringLenght]
                movdqu  xmm0, xmmword ptr [esi]
                movdqu  xmm1, xmm0
                pcmpgtb xmm1, ToCase_asciiA
                movdqu  xmm2, ToCase_asciiZ
                pcmpgtb xmm2, xmm0
                pand    xmm2, xmm1
                movdqu  xmm1, ToCase_Diff
                pand    xmm1, xmm2
                paddb   xmm0, xmm1
                mov     esi, [ebp+TmpStorage]
                mov     dword ptr [eax+esi], 0
                movdqu  xmmword ptr [esi], xmm0

loc_404DC5:                             ; CODE XREF: StringtoLower+B7↓j
                movsb
                dec     eax
                jnz     short loc_404DC5

loc_404DC9:                             ; CODE XREF: StringtoLower+6E↑j
                mov     eax, [ebp+StringLenght]
                mov     byte ptr [edi], 0
                pop     edi
                pop     esi
                mov     esp, ebp
                pop     ebp
                retn    8
StringtoLower   endp

Additional function
StrLen
StrLen          proc near
pString         = dword ptr  8

                push    ebp
                mov     ebp, esp
                push    ecx
                xorps   xmm0, xmm0
                mov     ecx, [ebp+pString]

loc_4086BA:                             ; CODE XREF: StrLen+1B↓j
                movups  xmm1, xmmword ptr [ecx]
                pcmpeqb xmm0, xmm1
                add     ecx, 10h
                pmovmskb eax, xmm0
                test    ax, ax
                jz      short loc_4086BA
                sub     ecx, [ebp+pString]
                add     ecx, 0FFFFFFF0h
                bsf     ax, ax
                add     eax, ecx
                pop     ecx
                mov     esp, ebp
                pop     ebp
                retn    4
StrLen          endp


Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

JJ, i created a extended version of it named  StringtoUpperEx that contains one additional parameter where we insert a precalculated lenght of the string.

For my surprise, the StringtoUpperEx is extremelly fast

Text to convert:
[BigText2a: B$ "[zzzzz? / \ : ; zzzzzzzzzzzzzzzzzzzzzzzzaaaaaaaaaaggTTTTTTnvd123456/", 0]

On the normal version (with strlen inside to calculate the lenght of the string), it takes 72.84 clock cycles, while on the extended version (precalculated strlen), 36.33 clock cycles.

Normal version
The fastest results was found in Algo method: 2

Value: 72.84205172214122 clocks

Standard Deviation Results

Mean: 74.43649916350911 clocks

Max (STD Population): 76.02582798029908 clocks

Min (STD Population): 72.84717034671915 clocks

Variance (STD Population): 0.70297743260073 clocks

Standard Deviation (STD Population): 1.58932881678996 clocks

Max (STD Sample): 76.03094660487702 clocks

Min (STD Sample): 72.84205172214122 clocks

Variance (STD Sample): 0.70751277087558 clocks

Standard Deviation (STD Sample): 1.59444744136790 clocks

Extended version
The fastest results was found in Algo method: 2

Value: 36.33179307783062 clocks

Standard Deviation Results

Mean: 38.30579268930184 clocks

Max (STD Population): 40.27649955517551 clocks

Min (STD Population): 36.33508582342818 clocks

Variance (STD Population): 1.07834219136009 clocks

Standard Deviation (STD Population): 1.97070686587366 clocks

Max (STD Sample): 40.27979230077307 clocks

Min (STD Sample): 36.33179307783062 clocks

Variance (STD Sample): 1.08194868698336 clocks

Standard Deviation (STD Sample): 1.97399961147123 clocks

Of course, those are only very preliminary tests, since i did not optimized the code and didn´t found a way yet to check for the latin chars. But it seems promising in terms of speed and accuracy for both versions.

From the result of this tests, the extended version has a variance a bit bigger then the normal one, what could indicate room for more optimization and fix alignment problems or caching of sse registers, perhaps ?

If i could be able to reduce something around 50% in speed on each one of them, i can then add one more parameters to be used as a flag from where the user can activate the latin mode or not (for ç, ã, õ, ô, é, è etc etc).

I´ll then can be able to take a look on the other problem you told about the 14 bytes from XMM to memory. There is aa way to do it, but i donpt know yet, if it will affect performance. I´ll try to do it tonight or tomorrow
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Found a way to simulate the pcmpeqb routine when using only 4 bytes to perform the case conversion. On my tests, although this routine is faster then the regular low case conversions by performing a byte by byte scan and xoring only the bytes that are inside the chain of 'A' to 'Z', it is slower then the ones i did right now that uses a combination of SIMD instruction to regular x86, in order to try to maximize the performance on both cases

JJ. on my tests, i suceeded to overcome the speed problem when computing only 14 or 15 (or less) bytes and copying from xmm registers to regular x86 registers were no longer a problem (at least on this preliminary tests). I´m cleaning this whole beast and will upload it here today to we test. (Both syntaxes, masm and rosasm)

Btw, the function below has a small error when saving the bytes to edi. It was missing a saving point on the 2nd step, but since i won´t use this any longer (its slower then the other technique i did) i´m only putting it here to you see how can this be done without SIMD that perhaps maybe faster then the regular ways to do it. The function below is not part of the new algo and i didn´t fix it, i´m just putting it here so i won´t forget the steps and tests i was doing when trying to optimize the major functions.

Proc ChangeCaseShort:
    Arguments @pString, @StringLenght, @Output
    Local @RemainderBytes, @LoopCount
    Uses edi, esi, ebx, edx, ecx

    mov esi D@pString;$BigText2a
    mov edi D@Output
    mov eax D@StringLenght; | shr eax 2 ; divide by 4. ecx now is the counter of multiple of 4 bytes
    and eax 0-4 | mov ecx eax | xor eax D@StringLenght | mov D@RemainderBytes eax; | jz L1> ; When 0 means lenght is divisible by 16 and we have no remainders, jmp over to the main function

    shr ecx 2 | jz L1>

    .Do

        mov eax D$esi
        mov ebx eax
        mov edx eax
        xor edx 020202020
        or ebx 05F5F5F5F ; 05F = (91 in decimal) =  or all chars from  A to Z =  A or B or C or D ... Z = 05F (In hexa)
        and edx ebx
        sub edx eax

        ; create the mask. Why 07F on each byte ? Because the mask needs only the 8th bit settled. So we invert al bits resulting in 07F (00__0111_1111)
        ; Doing that, whatever byte had 0FF as a mask will be zeroed and all others will have whatever byte it is, but with only the 8th bit disaabled
        and edx 07F7F7F7F | xor edx 07F7F7F7F
        xor ebx ebx ; create our mask now in ebx
        Test_If_Not dl dl ; compare each byte to see if it is zeroed or not. If zero we or it with 0FF atr the given position, thus creating our mask in ebx
            or ebx 0FF
        Test_End
        Test_If_Not dh dh
            or ebx 0FF_00
        Test_End
        shr edx 16
        Test_If_Not dl dl
            or ebx 0FF_00_00
        Test_End
        Test_If_Not dh dh
            or ebx 0FF_00_00_00
        Test_End
        and ebx 020202020
        add eax ebx ; convert to lowercase

        mov D$edi eax
        add esi 4
        add edi 4
        dec ecx
    ;.Loop_Until ecx = 0
    .Repeat_Until_Zero ecx

    add esi D@RemainderBytes
    add edi D@RemainderBytes

L1:

    mov ecx D@RemainderBytes
        mov eax D$esi
        mov ebx eax
        mov edx eax
        xor edx 020202020
        or ebx 05F5F5F5F ;05F 091 or A to Z
        and edx ebx
        sub edx eax

        ; create the mask. Why 07F on each byte ? Because the mask needs only the 8th bit settled. So we invert al bits resulting in 07F (00__0111_1111)
        ; Doing that, whatever byte had 0FF as a mask will be zeroed and all others will have whatever byte it is, but with only the 8th bit disaabled
        and edx 07F7F7F7F | xor edx 07F7F7F7F
        xor ebx ebx ; create our mask now in ebx
        Test_If_Not dl dl ; compare each byte to see if it is zeroed or not. If zero we or it with 0FF atr the given position, thus creating our mask in ebx
            or ebx 0FF
        Test_End
        Test_If_Not dh dh
            or ebx 0FF_00
        Test_End
        shr edx 16
        Test_If_Not dl dl
            or ebx 0FF_00_00
        Test_End
        Test_If_Not dh dh
            or ebx 0FF_00_00_00
        Test_End
        and ebx 020202020
        add eax ebx ; convert to lowercase

        mov B$edi al | dec ecx | jz L2>
        inc edi | mov B$edi ah | dec ecx | jz L2>
        shr eax 16 | inc edi | mov B$edi al | dec ecx | jz L2>
        mov B$edi 0

L2:

EndP

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

When you are ready to test "Введите текст здесь: Enter text here in Russian", let me know.

guga

Hi JJ

Ok. I´m quite finished with the regular Ansi version. Only a minor comments to do and create a variation for usage in latin chars (For Portuguese, french, italian, Spanish - dunno if German has accents in ANSI to perform the case change yet)

So, at the end we will have 2 versions
StringtoLower for regular A to Z / a to z chars
StringtoLowerEx for accents in portuguese, spanish, italian, french etc

About russian and other Unicode version (The ending with W thing) it will be a bit harder. Russian do have uppercase chars, and it seems that some of them also have a difference of 020 bytes, but others seems to have a difference of only 1 byte ?

https://www.unicode.org/charts/

I´m not there yet on the Unicode version, but i´m close to a solution (At least Unicode and not UTF16 etc we discussed before - Russian is UTF16, right ?)

On the Ansi version i´ll do a small parameter to handle extra info, such as language and i´ll add at least identification of accents for Portuguese, Italian, french, Spanish (or German if it do have accents or is ANSI as well). perhaps the same can be done for Russian. I don´t know yet.

More info i´m researching is at:
https://www.optilingo.com/blog/french/french-accent-marks/
https://www.busuu.com/en/french/accent-marks
https://www.fluentin3months.com/french-accent-marks/?expand_article=1
https://studyspanish.com/typing-spanish-accents
https://en.wiktionary.org/wiki/Appendix:Spanish_alphabet
https://en.wikipedia.org/wiki/Russian_alphabet
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: guga on August 23, 2023, 07:58:35 AMRussian is UTF16, right ?

Russian can be handled with Utf-8, Utf-16 or its native codepage. My text is Utf-8, and so is 90% of the Internet.

Russian is only one example. You know I am a great fan of speed, but for Upper$() I chose a slow algo: MultiByteToWideChar works under the hood, slow but versatile and reliable.

guga

One question
About  MultiByteToWideChar , you use different codepages for the different languages or only UTF8 ?

I don´t know exactly how strings in UUTF8 (russian, japanese etc) works, but most of them have the same difference of 32 bytes between the cases (upper or small). Maybe the opensource version of MultiByteToWideChar for wine or ReactOS can give more clues how to use other languages on a faster way.

The problem is identifying the language, but if (and it is a big IF) unicode chars has the difference on only 32 bytes (with few exceptions, we can speed up using a table of languages for exceptional chars or situations ?

https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Another thing. Perhaps this can give more clues how to do it

https://www.coderstool.com/utf16-encoding-decoding

Put your text there ""Введите текст здесь: Enter text here in Russian", let me know. "

and the click on convert and then convert to uppercase lowercase. We can use the results to identify what are the differences between each chars
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com