Duplicated values removal in array of floats

guga · June 02, 2024, 07:21:09 AM

Using SSE2, someone has a faster way to remove duplications on a array of Real4 ? Btw...one for sorted and other for unsorted arrays. Pls

raymond · June 03, 2024, 02:37:40 AM

I have no idea about your intended end result.

Removing duplicates from an array of REAL4s is no different than removing duplicates from an array of 32-bit integers. The use of SSE2 instructions to provide any speed advantage may thus be questionable from that point of view. However, I have no expertise with SSE2 and someone else may want to inject some light on that subject.

Removing duplicates from arrays of REAL8s in 32-bit mode would be a different beast to be treated differently, while it could be done similarly in 64-bit mode.

Removing duplicates from arrays of REAL10s becomes "two bits" more tricky using integers!

guga · June 03, 2024, 04:38:51 AM

Hi Raymond. I created a routine to remove duplicated values from a array of floats without using SSE2 (Same as as regular Dwords, as you said). But i was thinking on a way to make it faster working with 4 dwords at once. This is for my dbscan algorithm i'm making. Since we are dealing with images, the final array of data (pixels) can be huge.

daydreamer · June 03, 2024, 05:13:06 AM

Similar approach as count unique colours ?
24 bit color data used as pointer to color array ?
Many of pixels same colours = many writes to same array element ?
For example 0,0,ff pixel color array [255] +Set bit flag that color exists

guga · June 03, 2024, 06:19:30 AM

Hi Daydreamer. The pixels are in ARGB or BGRA format, therefore 32 bits. They are loaded using Gdi+

The part of the code i made to remove the duplications is like this:

Code Select

    ; 2 - remove the duplications (And make delta absolute)
    xorps xmm2 xmm2
    xor ecx ecx
    xor edx edx
    .Do
        movss xmm0 X$esi+edx*4 ; esi = Pointer to the array of Floats
        mov eax D$esi+edx*4
        Do
            inc edx
            mov ebx D$esi+edx*4
            On ebx = 0, jmp L1> ; out the loop if zero is found
            On edx => D@num_points, jmp L1> ; out the loop if the counter reaches the number of points
        Loop_Until eax <> ebx
        mov D$esi+ecx*4 eax

        movss xmm1 X$esi+edx*4 ; prepare to get the maximum value of the delta inside this same routine
        subss xmm0 xmm1
        SSE_ABS_REAL4 xmm0 ; and convert them to absolute
        maxss xmm2 xmm0 ; get the maximum distance between 2 points

        inc ecx
    .Loop_Until edx >= D@num_points

L1:

What this part of the code do is while remove the repetitive values, it calculate the maximum delta between one and another (the ones that are unique) to be used in another routine. Here, the array is already ordered

Btw...the content of the array in esi is no longer the pixels, but the pixels density. So, a Real4 value representing the delta between their neighbors. This part i already calculated and placed on this array. The problem is that it contains thousands of duplications that needs to be removed. Also, on the same routine i calculated the maximum delta between the unique values on this array to use in another part of the code.

Baisally it does this:

Array: 30.0 50.0 50.0 50.0 50.0 51.6 99.9 .......

remove the duplications and also calculate the delta at once

Array Result: 30.0 50.0 51.6 99.9 .......

DeltaMax = (abs(30-50)), (abs(50-51)), abs(51-51.6), abs (51.6-99.9) and so on. = The maximum value of the deltas on the unique values of the array.

After the main loop, ecx contains the total amount of unique values of the array. No need to create another memory buffer to store the new array, since i´m using esi as input and output.

SSE_ABS_REAL4 is a macro like this:

[SSE_ABS_REAL4 | pslld #1 1 | psrld #1 1]
and the other macro for the loop is:
[Do | K0:]
[Loop_Until | #=3 | cmp #1 #3 | jn#2 K0<] . unfold as: cmp eax ebx | je K0<

The problem is that the 1st loop is slow. I tried to extend it to find the next 4 or even 8 dwords after, but i failed to make it get back to the correct pointer.
like this:
xmm0 = 4 floats
xmm1 = next 4 floats
compare xmm0 with xmm1, if equal jump over 8 floats and back to the loop and so on. And this is where i was failing, because i couldn´t make it go back to the proper pointer to do the comparison all over again.

This part of the code is responsible to get the delta of the density of each pixel that was preordered with quicksort, but it contains thousands of duplications. The performance already slowed down due to the usage of quicksort, now with this duplication removal it get a bit worst in terms of speed. That´s why i´m trying a faster routine in at least one of those two main parts that can slow down the code.

NoCforMe · June 03, 2024, 07:15:21 AM

Quote from: guga on June 03, 2024, 06:19:30 AMThe pixels are in ARGB or BGRA format, therefore 32 bits.

Therefore they're not floats; they're really 32-bit structures.
Not that it matters; you're just trying to get rid of any duplicate 32-bit things, whatever they may be.
Wish I could help you with the SSE2 stuff.

guga · June 03, 2024, 08:44:30 AM

Not necessarily. The source pixel values are no longer used when the algorithm starts. The resulting value is the delta of each pixel in each channel, calculated by the distance between neighboring pixels. In fact, before the algorithm starts, it converts the image to grayscale. Converting to gray don´t affect the result....at least, not in the tests I'm doing, since to speed things up, I'm converting to grayscale as mentioned in the articles I read. And if it works, it might be possible to work with the channels separately if that changes anything when working with color images. But I don't think it will change much, since, in one way or another, when we convert to gray, we are working with Luma values that are invariable, that is, they do not change the aspect (texture, grain, Hue, saturation etc etc etc) of the image.

What dbscan does is calculate the difference (delta) between each pixel and its neighbors, regardless of the colorspace used (RGB, CieLCH, HSL, etc.). Of course, any resulting value for this delta will not exceed the limit of one pixel (255 or equivalent if not using RGB as input), but we no longer work directly with the pixels of the image, neither their structures.

The entire delta is placed in an array (of Floats).

QuoteWish I could help you with the SSE2 stuff.

Tks :) ´ll give another try later. It is possible, but i spent the last 2 or 3 days trying to improve this and nothing yet. So, i´ll test later and check the results.

daydreamer · June 03, 2024, 03:36:01 PM

Wouldn't SSE abs 4 values at a time followed by maxps xmm0,xmm2 speed up ?
Unroll the above few times ?
Unroll inner loop and align 64 before loop ?

If you compare with packed compares equals
50.0,50.0 ,you get a mask ,use mask for add EDI, by 1 element or not
Copy to new array,if two copies are found,increment not

jj2007 · June 03, 2024, 09:16:41 PM

Quote from: guga on June 03, 2024, 04:38:51 AMmake it faster working with 4 dwords at once

Have a look at PCMPEQD and PCMPESTRI.

HSE · June 04, 2024, 05:03:33 AM

Hi Guga,

Quote from: guga on June 02, 2024, 07:21:09 AMsomeone has a faster way to remove duplications on a array of Real4

Perhaps faster ways is to prevent addition of duplicated values in the array.

Regards, HSE.

NoCforMe · June 04, 2024, 05:18:42 AM

Quote from: HSE on June 04, 2024, 05:03:33 AMPerhaps faster ways is to prevent addition of duplicated values in the array.

But but but ... wouldn't that require a (linear, binary) search every time you add a value? Seems like that would be the slowest of all, compared to just going through the completed array one time.

daydreamer · June 04, 2024, 05:40:54 AM

Quote from: HSE on June 04, 2024, 05:03:33 AMHi Guga,

Quote from: guga on June 02, 2024, 07:21:09 AMsomeone has a faster way to remove duplications on a array of Real4

Perhaps faster ways is to prevent addition of duplicated values in the array.

Regards, HSE.

That's possible with sse2 packed compare instruction that creates masks 0 or 0ffffffffh
Movd xmm6,array-1
Movd xmm7,array
Pmpeq xmm6,xmm7
Movd mask,xmm6
Movd xmm7,array
Pand xmm7,mask
Movd xmm0,sum
Paddd xmm0,xmm7 ; add zero if duplicate ,add delta if none duplicate

HSE · June 04, 2024, 07:58:43 AM

Quote from: NoCforMe on June 04, 2024, 05:18:42 AMcompared to just going through the completed array one time.

Think

NoCforMe · June 04, 2024, 08:31:13 AM

I did think; that's why I posted that.

OK, as you build the list, the first versions will be very short, but getting longer each time, right? 1, 2, 3, ... n.
So if you do things your way you'll be dealing with a total of

n(n+1)∕2

potential comparisons, instead of just

.
(Of course if you use anything besides a dumbass linear search you can reduce that number, but never as low as n.)

Or am I missing something here?

jj2007 · June 04, 2024, 08:41:44 AM

Quote from: daydreamer on June 04, 2024, 05:40:54 AMPmpeq xmm6,xmm7

You probably mean pcmpeqd.

Quote from: jj2007 on June 03, 2024, 09:16:41 PMHave a look at PCMPEQD.

The MASM Forum

News:

Duplicated values removal in array of floats

guga

raymond

guga

daydreamer

guga

NoCforMe

guga

daydreamer

jj2007

HSE

NoCforMe

daydreamer

HSE

NoCforMe

jj2007