News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Fast Sorting algo help

Started by guga, September 23, 2014, 05:10:13 AM

Previous topic - Next topic

guga

Hi guys

I need help on an example of a fast structure sorting algoritm

The goal is sort a array of structure based on the contents of one of it´s members.

The structure is on this syntax
[DFMScript:
DFMScript.pLabelName: D$ Label1
DFMScript.pSigLabel: D$ SigLabel1
DFMScript.pExtention: D$ ExtLabel1
DFMScript.pSignatureData: D$ SigData1
DFMScript.pSigSize: D$ 16]

[Label1: B$ "Trilobyte's JPEG graphics Library", 0]
[SigLabel1: "Sample Name", 0]
[ExtLabel1: B$ "jpg", 0]
[SigData1: B$ 084 010 0F1 0F2 0F3 0F4 01E 00 01 010 08 09 07 06 05 04]


When i have an array of n structures they need to be sorted biased on the SigSize and also on pSignatureData. So, the 1st array of the structure must be the value with bigger size. Ex:

Unsorted

[DFMScript:
DFMScript.pLabelName: D$ Label2
DFMScript.pSigLabel: D$ SigLabel2
DFMScript.pExtention: D$ ExtLabel2
DFMScript.pSignatureData: D$ SigData2
DFMScript.pSigSize: D$ 10]
[DFMScript:
DFMScript.pLabelName: D$ Label1
DFMScript.pSigLabel: D$ SigLabel1
DFMScript.pExtention: D$ ExtLabel1
DFMScript.pSignatureData: D$ SigData1
DFMScript.pSigSize: D$ 16]

[Label1: B$ "Trilobyte's JPEG graphics Library", 0]
[SigLabel1: "Sample Name", 0]
[ExtLabel1: B$ "jpg", 0]
[SigData1: B$ 084 010 0F1 0F2 0F3 0F4 01E 00 01 010 08 09 07 06 05 04]

[Label2: B$ "Trilobyte's JPEG graphics Library 2", 0]
[SigLabel2: "Sample Name 2", 0]
[ExtLabel2: B$ "jpg", 0]
[SigData2: B$ 084 010 0F1 0F2 0F3 0F0 09 00 01 010]



Sorted


Unsorted

[DFMScript:
DFMScript.pLabelName: D$ Label1
DFMScript.pSigLabel: D$ SigLabel1
DFMScript.pExtention: D$ ExtLabel1
DFMScript.pSignatureData: D$ SigData1
DFMScript.pSigSize: D$ 16]

[DFMScript:
DFMScript.pLabelName: D$ Label2
DFMScript.pSigLabel: D$ SigLabel2
DFMScript.pExtention: D$ ExtLabel2
DFMScript.pSignatureData: D$ SigData2
DFMScript.pSigSize: D$ 10]

[Label1: B$ "Trilobyte's JPEG graphics Library", 0]
[SigLabel1: "Sample Name", 0]
[ExtLabel1: B$ "jpg", 0]
[SigData1: B$ 084 010 0F1 0F2 0F3 0F4 01E 00 01 010 08 09 07 06 05 04]

[Label2: B$ "Trilobyte's JPEG graphics Library 2", 0]
[SigLabel2: "Sample Name 2", 0]
[ExtLabel2: B$ "jpg", 0]
[SigData2: B$ 084 010 0F1 0F2 0F3 0F0 09 00 01 010]


Btw...what needs to be sorted is the structure members and not necessarily their pointers.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Gustavo,

You could treat the structures as "strings", and use crt_qsort. Erol has an example. The CompareProc needs to decide which array element is "higher".

KeepingRealBusy

Guga,

I have a few questions:

How big is the array (how many elements)?

Is it memory contained or on file?

If it is memory contained, then this is just a string sort (with a string size element) with a doubly indexed string pointer. Do a quick sort and just swap the structure pointers.

Dave.

guga

Tks JJ, i´ll give a test

KeepingRealBusy, each structure have 5 members. I need to sort the structure biased on the contents of the 4th and 5th  members (4th member is a pointers to a sequence of bytes. 5th member is the total size of the sequence). I don´t want to reorder the members of each structure, just reorder the structures themselves

The sort is memory contained. It will load a file containing, let´s say 5000 structures that needs to be sorted.

It is a signature database i´m making for a DFM parser.
The user have 2 options
1st create his own database in binary format from the script file.
2nd use a default binary database.

What i´m doing is a file format identifier to be used on the DFM decoder. The script syntax is similar to one found in PEId, Trid, Droid etc. It is something like:
[Trilobyte's JPEG graphics Library:||jpg|84 10 F1 F2 F3 F4 1E 00 01 10 08 09 07 06 05 04
[TBitmap:|TBitmap|bmp|042 04D ???? ?? ?? ???? 028
[TImageList:|TImageList|iml|049 04C
[Icon:||ico|00 00 00 01 00
[Cursor:||cur|00 00 00 02 00
[Bitmap(PBM):||pbm|050 036 0A
[Delphi ImageList:||iml|049 04C
[Wavelet compressed bitmap:||wvl|057 049
[Trilobyte's JPEG graphics Library:||jpg|84 10 FF FF FF FF 1E 00 01 10 08 00 00 00 00 00


The script is converted to binary, from where the user can load it to accelerate the recognition process.

The binary data is on the format i posted above
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

KeepingRealBusy

Guga,

This is not quite a string sort as described in your first post. What you described was a sort that sorted all long strings first, then within strings of the same size, sorted on the string content.

Since this is memory contained, sort in two passes, the first pass sorts the strings by size, then sort the same sized strings by string content. all swapping is done by swapping the structure pointers.

Dave.

guga

Hmm....maybe i´m not making it clear

The strings are not what need to be sorted.

The strings will be located outside the array of structures (I mean, they are not part of the structures array that needs to be sorted). What is being sorted is only each structure, but the structure needs to take onto account the content of a pointer.

Ex: if the 4th member of 2nd structure points to a data containing: 09 01 02 03
and the 4th member of the 1st structure  points to a data  containing: FF 07 02 03

The the order of the sorted structures will be:
2nd structure
1st structure

2nd structure comes 1st because the data pointed by the 4th member  starts with 09 which is smaller then 0FF

The strings (contents of the members are located elsewhere (in general after the last structure), but those strings does not needs to be sorted, since that all we need is the reorder of the structures that will point to them anyway.

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

MichaelW

If the array must be sorted on more than one key then the sorting algorithm needs to be "stable",  and the CRT qsort is the wrong choice.
Well Microsoft, here's another nice mess you've gotten us into.

jj2007

Why exactly does it have to be a stable sort, Michael?

Interesting read at Intel developer zone:
QuoteAn easy way to perform a stable sort would be to modify the comparison function. When content of keys are ==, use address of key to specify key of lower address as less than other key. IOW the external results of the comparison will never show ==.

Problem is colleagues don't agree with Jim Dempsey :(

@guga: an example file would be nice...

guga

Hi JJ

i´ll try making the function i´m working export the file to be sorted and will upload it here to you see..
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Ok, JJ.

I´m posting the generated file here (without the header which i´ll build later). The extension name (dft - Digital FileType) is just temporary. The contents is formed by 9 structures, each one of them containing 6 members, folllowed by the member´s content pointers.

The structure have this format:
Quote
[DFMScript:
DFMScript.pLabelName: D$ 0 <----Pointer to the label name, relative to the start of the file
DFMScript.pSigLabel: D$ 0 <----Pointer to the Signature name, relative to the start of the file
DFMScript.pExtention: D$ 0 <----Pointer to the extension name, relative to the start of the file
DFMScript.pSignatureDataLen: D$ 0 <----------- the total amount of encoded bytes (Also, when sorting, it will take onto acount the amounf of encoded bytes)
DFMScript.pSignatureData: D$ 0 <----Pointer to the signature bytes data (encoded), relative to the start of the file (This pointer is what is biased to be sorted)
DFMScript.pSigSize: D$ 0] <--- Total amounf of undecoded bytes of the signature

The listing is like:

0000000 DFMScript       struc ; (sizeof=0x18)   ; XREF: seg000:FileStartr
00000000                                         ; seg000:00000018r ...
00000000 pLabelName      dd ?                    ; offset (00000000)
00000004 pSigLabel       dd ?                    ; offset (00000000)
00000008 pExtention      dd ?                    ; offset (00000000)
0000000C pSignatureDataLen dd ?                  ; base 10
00000010 pSignatureData  dd ?                    ; offset (00000000)
00000014 pSigSize        dd ?                    ; base 10
00000018 DFMScript       ends
------------------------------------------------------------------------

seg000:00000000 FileStart       DFMScript <offset aTrilobyteSJpeg,0, offset aJpg, 20, \
seg000:00000000                                         ; DATA XREF: seg000:FileStarto
seg000:00000000                                         ; seg000:00000048o ...
seg000:00000000                            offset unk_100, 16> ; "Trilobyte's JPEG graphics Library"
seg000:00000018                 DFMScript <offset aTbitmap, offset aTbitmap_0, offset aBmp, 7, \ ; "TBitmap"
seg000:00000018                            offset unk_12C, 15>
seg000:00000030                 DFMScript <offset aTimagelist, offset aTimagelist_0, offset aIml, 3, \ ; "TImageList"
seg000:00000030                            offset unk_150, 2>
seg000:00000048                 DFMScript <offset aIcon, 0, offset aIco, 7, \ ; "Icon"
seg000:00000048                            offset unk_160, 5>
seg000:00000060                 DFMScript <offset aCursor, 0, offset aCur, 7, \ ; "Cursor"
seg000:00000060                            offset unk_174, 5>
seg000:00000078                 DFMScript <offset unk_17C, 0, offset unk_188, 5, \
seg000:00000078                            offset unk_18C, 3>
seg000:00000090                 DFMScript <offset aDelphiImagelis, 0, offset aIml_0, 3,\ ; "Delphi ImageList"
seg000:00000090                            offset unk_1AC, 2>
seg000:000000A8                 DFMScript <offset aWaveletCompres, 0, offset aWvl, 3, \ ; "Wavelet compressed bitmap"
seg000:000000A8                            offset unk_1D0, 2>
seg000:000000C0                 DFMScript <offset aTrilobyteSJp_0, 0, offset aJpg_0, \ ; "Trilobyte's JPEG graphics Library"
seg000:000000C0                            20, offset unk_1FC, 16>
seg000:000000D8 aTrilobyteSJpeg db 'Trilobyte',27h,'s JPEG graphics Library',0
seg000:000000D8                                         ; DATA XREF: seg000:FileStarto
seg000:000000FA                 db    0
seg000:000000FB                 db    0
seg000:000000FC aJpg            db 'jpg',0              ; DATA XREF: seg000:FileStarto
seg000:00000100 unk_100         db    3                 ; DATA XREF: seg000:FileStarto
seg000:00000101                 db  84h ; ä
seg000:00000102                 db  10h
seg000:00000103                 db 0F1h ; ±
seg000:00000104                 db 0F2h ; =
seg000:00000105                 db    3
seg000:00000106                 db 0F3h ; =
seg000:00000107                 db 0F4h ; (
seg000:00000108                 db  1Eh
seg000:00000109                 db    0
seg000:0000010A                 db    3
seg000:0000010B                 db    1
seg000:0000010C                 db  10h
seg000:0000010D                 db    8
seg000:0000010E                 db    9
seg000:0000010F                 db    3
seg000:00000110                 db    7
seg000:00000111                 db    6
seg000:00000112                 db    5
seg000:00000113                 db    4
seg000:00000114                 db    0
seg000:00000115                 db    0
seg000:00000116                 db    0
seg000:00000117                 db    0
seg000:00000118 aTbitmap        db 'TBitmap',0          ; DATA XREF: seg000:00000018o
seg000:00000120 aTbitmap_0      db 'TBitmap',0          ; DATA XREF: seg000:00000018o
seg000:00000128 aBmp            db 'bmp',0              ; DATA XREF: seg000:00000018o
seg000:0000012C unk_12C         db    2                 ; DATA XREF: seg000:00000018o
seg000:0000012D                 db  42h ; B
seg000:0000012E                 db  4Dh ; M
seg000:0000012F                 db    5
(...)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Ok, guys, I guess, i´m at the right path. I made a sorting algo based on http://www.java2s.com/Tutorial/C/0180__Structure/SortingStructures.htm

I didn´t tested it for speed yet. (Since the database will be large, faster algos will be required), but....here is what i´ve done so far:

Proc quick_struct:
    Arguments @items, @count
    Uses eax, ebx, ecx, esi, edi, edx

    mov eax D@count | dec eax
    call qs_struct D@items, 0, eax

EndP


[temp:
temp.pLabelName: D$ 0
temp.pSigLabel: D$ 0
temp.pExtention: D$ 0
temp.pSignatureDataLen: D$ 0
temp.pSignatureData: D$ 0
temp.pSigSize: D$ 0]

; equates used
[DFMScript.pLabelNameDis 0
DFMScript.pSigLabelDis 4
DFMScript.pExtentionDis 8
DFMScript.pSignatureDataLenDis 12
DFMScript.pSignatureDataDis 16
DFMScript.pSigSizeDis 20]

[Size_of_DFMScript 24]

Proc qs_struct:
    Arguments @items, @left, @right
    Local @iCounter, @jCounter, @xPos, @DataLen1, @DataLen2

    move D@iCounter D@left
    move D@jCounter D@right

    mov eax D@left | add eax D@right | cdq | sub eax edx | sar eax 1 | imul eax Size_of_DFMScript
    mov ecx D@items
    mov edx D$ecx+eax+DFMScript.pSignatureDataDis | add edx D$ScriptData2 | mov D@xPos edx
    mov ebx D$ecx+eax+DFMScript.pSignatureDataLenDis | mov D@DataLen2 ebx

L0: ; loop2

    .Do

        mov ecx D@iCounter | imul ecx Size_of_DFMScript | mov edx D@items | mov eax D$edx+ecx+DFMScript.pSignatureDataDis | add eax D$ScriptData2
        mov ecx D$edx+ecx+DFMScript.pSignatureDataLenDis

        call DataCmp eax, ecx, D@xPos, D@DataLen2
        test eax eax | jge L1>
        mov eax D@iCounter
        cmp eax D@right | jge L1>
        inc D@iCounter
        jmp L0<
L1: ; loop1


        mov ecx D@jCounter | imul ecx Size_of_DFMScript | mov edx D@items | mov eax D$edx+ecx+DFMScript.pSignatureDataDis | add eax D$ScriptData2
        mov ecx D$edx+ecx+DFMScript.pSignatureDataLenDis

        call DataCmp eax, ecx, D@xPos, D@DataLen2
        test eax eax | jle L2>
        mov eax D@jCounter
        cmp eax D@left | jle L2>
        dec D@jCounter
        jmp L1<

L2:

        mov eax D@iCounter
        ...If eax <s= D@jCounter

            ; temp = items[i];
            mov esi D@iCounter | imul esi Size_of_DFMScript | add esi D@items | mov ecx 6 | mov edi Temp | rep movsd
            ; items[i] = items[j];
            mov esi D@jCounter | imul esi Size_of_DFMScript | add esi D@items | mov edi D@iCounter | imul edi Size_of_DFMScript | add edi D@items | mov ecx 6 | rep movsd
            ; items[j] = temp;
            mov edi D@jCounter | imul edi Size_of_DFMScript | add edi D@items | mov ecx 6 | mov esi Temp | rep movsd

            inc D@iCounter
            dec D@jCounter
        ...End_If

        mov eax D@iCounter
    .Loop_Until eax >s D@jCounter

    mov eax D@left
    If eax <s D@jCounter
        call qs_struct D@items, D@left, D@jCounter
    End_If

    mov eax D@iCounter
    If eax <s D@right
        call qs_struct D@items, D@iCounter, D@right
    End_If

EndP


; DataCmp have similar functionality as Stricmp. The returned values are the same.

Proc DataCmp:
    Arguments @String1, @String1Len, @String2, @String2Len
    Uses edi, esi, ecx

    mov esi D@String2
    mov edi D@String1

    mov ecx D@String1Len
    If ecx > D@String2Len
        mov ecx D@String2Len
    End_If
   
        Do
            On ecx = 0, jmp L2>
            mov al B$esi | inc esi
            mov ah B$edi | inc edi
            dec ecx
        Loop_Until ah <> al
    sbb al al
    sbb al 0-1
L2:
    movsx eax al

EndP


called as:


; ScriptData2 is the pointer to the binary data file
call quick_struct D$ScriptData2, 9


The structure (in binary data) seems to be ordered. I´m showing here the resultant values on the original script file just To make easier to identify the order of the binary data.

unordered

[GugaScript: B$ "
[Trilobyte's JPEG graphics Library:||jpg|84 10 F1 F2 F3 F4 1E 00 01 10 08 09 07 06 05 04
[TBitmap:|TBitmap|bmp|042 04D ???? ?? ?? ???? 028
[TImageList:|TImageList|iml|049 04C
[Icon:||ico|00 00 00 01 00
[Cursor:||cur|00 00 00 02 00
[Bitmap(PBM):||pbm|050 036 0A
[Delphi ImageList:||iml|049 04C
[Wavelet compressed bitmap:||wvl|057 049
[Trilobyte's JPEG graphics Library:||jpg|84 10 FF FF FF FF 1E 00 01 10 08 00 00 00 00 00
", 0]


ordered

[Trilobyte's JPEG graphics Library:||jpg|84 10 FF FF FF FF 1E 00 01 10 08 00 00 00 00 00
[Trilobyte's JPEG graphics Library:||jpg|84 10 F1 F2 F3 F4 1E 00 01 10 08 09 07 06 05 04
[Cursor:||cur|00 00 00 02 00
[Icon:||ico|00 00 00 01 00
[Wavelet compressed bitmap:||wvl|057 049
[Bitmap(PBM):||pbm|050 036 0A
[Delphi ImageList:||iml|049 04C
[TImageList:|TImageList|iml|049 04C
[TBitmap:|TBitmap|bmp|042 04D ???? ?? ?? ???? 028


The problem seems to be on TBitmap which was not supposed to be at the last item, but this perhaps is due to the fact that i encoded the signature before the sorting function.
Also, Cursor and Icon should be at the last itens, since they starts with 0

I´ll give a try with the sorting algo decoding only the necessary comparition chunck inside the DataCmp function.  I could, however, simply not encode before, and encode only after the sort algo played, but, this will make it loose speed, as long i´ll have to sort and parse the strings data that contains spaces, "?" marks etc etc
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

This Sort is not working properly :(

After using the decoded version i got as results something like (The resultant values displayed here are the original script file just To make easier to identify the order of the binary data.)


[Icon:||ico|00 00 00 01 00
[Cursor:||cur|00 00 00 02 00
[TImageList:|TImageList|iml|049 04C
[TBitmap:|TBitmap|bmp|042 04D ???? ?? ?? ???? 028
[Delphi ImageList:||iml|049 04C
[Bitmap(PBM):||pbm|050 036 0A
[Wavelet compressed bitmap:||wvl|057 049
[Trilobyte's JPEG graphics Library:||jpg|84 10 F1 F2 F3 F4 1E 00 01 10 08 09 07 06 05 04
[Trilobyte's JPEG graphics Library:||jpg|84 10 FF FF FF FF 1E 00 01 10 08 00 00 00 00 00


Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

qWord

Not sure if I interpret that correct, but I guess you can just use qsort whereby comparing the secondary keys only if the primary keys are equal.
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

Quote from: qWord on September 24, 2014, 03:56:54 AM
Not sure if I interpret that correct, but I guess you can just use qsort whereby comparing the secondary keys only if the primary keys are equal.

That was also my idea, but Michael (#6) says no ::)

What do you think of Jim Dempsey's idea to make qsort stable?

KeepingRealBusy

My take on this is that Qsort will work as long as it is an indirect sort, i.e. where you swap the pointers to data or to structures of pointers to data, leaving the records in place.

Dave.