Using SSE2, someone has a faster way to remove duplications on a array of Real4 ? Btw...one for sorted and other for unsorted arrays. Pls
I have no idea about your intended end result.
Removing duplicates from an array of REAL4s is no different than removing duplicates from an array of 32-bit integers. The use of SSE2 instructions to provide any speed advantage may thus be questionable from that point of view. However, I have no expertise with SSE2 and someone else may want to inject some light on that subject.
Removing duplicates from arrays of REAL8s in 32-bit mode would be a different beast to be treated differently, while it could be done similarly in 64-bit mode.
Removing duplicates from arrays of REAL10s becomes "two bits" more tricky using integers! :dazzled:
Hi Raymond. I created a routine to remove duplicated values from a array of floats without using SSE2 (Same as as regular Dwords, as you said). But i was thinking on a way to make it faster working with 4 dwords at once. This is for my dbscan algorithm i'm making. Since we are dealing with images, the final array of data (pixels) can be huge.
Similar approach as count unique colours ?
24 bit color data used as pointer to color array ?
Many of pixels same colours = many writes to same array element ?
For example 0,0,ff pixel color array [255] +Set bit flag that color exists
Hi Daydreamer. The pixels are in ARGB or BGRA format, therefore 32 bits. They are loaded using Gdi+
The part of the code i made to remove the duplications is like this:
; 2 - remove the duplications (And make delta absolute)
xorps xmm2 xmm2
xor ecx ecx
xor edx edx
.Do
movss xmm0 X$esi+edx*4 ; esi = Pointer to the array of Floats
mov eax D$esi+edx*4
Do
inc edx
mov ebx D$esi+edx*4
On ebx = 0, jmp L1> ; out the loop if zero is found
On edx => D@num_points, jmp L1> ; out the loop if the counter reaches the number of points
Loop_Until eax <> ebx
mov D$esi+ecx*4 eax
movss xmm1 X$esi+edx*4 ; prepare to get the maximum value of the delta inside this same routine
subss xmm0 xmm1
SSE_ABS_REAL4 xmm0 ; and convert them to absolute
maxss xmm2 xmm0 ; get the maximum distance between 2 points
inc ecx
.Loop_Until edx >= D@num_points
L1:
What this part of the code do is while remove the repetitive values, it calculate the maximum delta between one and another (the ones that are unique) to be used in another routine. Here, the array is already ordered
Btw...the content of the array in esi is no longer the pixels, but the pixels density. So, a Real4 value representing the delta between their neighbors. This part i already calculated and placed on this array. The problem is that it contains thousands of duplications that needs to be removed. Also, on the same routine i calculated the maximum delta between the unique values on this array to use in another part of the code.
Baisally it does this:
Array: 30.0 50.0 50.0 50.0 50.0 51.6 99.9 .......
remove the duplications and also calculate the delta at once
Array Result: 30.0 50.0 51.6 99.9 .......
DeltaMax = (abs(30-50)), (abs(50-51)), abs(51-51.6), abs (51.6-99.9) and so on. = The maximum value of the deltas on the unique values of the array.
After the main loop, ecx contains the total amount of unique values of the array. No need to create another memory buffer to store the new array, since i´m using esi as input and output.
SSE_ABS_REAL4 is a macro like this:
[SSE_ABS_REAL4 | pslld #1 1 | psrld #1 1]
and the other macro for the loop is:
[Do | K0:]
[Loop_Until | #=3 | cmp #1 #3 | jn#2 K0<] . unfold as: cmp eax ebx | je K0<
The problem is that the 1st loop is slow. I tried to extend it to find the next 4 or even 8 dwords after, but i failed to make it get back to the correct pointer.
like this:
xmm0 = 4 floats
xmm1 = next 4 floats
compare xmm0 with xmm1, if equal jump over 8 floats and back to the loop and so on. And this is where i was failing, because i couldn´t make it go back to the proper pointer to do the comparison all over again.
This part of the code is responsible to get the delta of the density of each pixel that was preordered with quicksort, but it contains thousands of duplications. The performance already slowed down due to the usage of quicksort, now with this duplication removal it get a bit worst in terms of speed. That´s why i´m trying a faster routine in at least one of those two main parts that can slow down the code.
Quote from: guga on June 03, 2024, 06:19:30 AMThe pixels are in ARGB or BGRA format, therefore 32 bits.
Therefore they're not floats; they're really 32-bit structures.
Not that it matters; you're just trying to get rid of any duplicate 32-bit things, whatever they may be.
Wish I could help you with the SSE2 stuff.
Not necessarily. The source pixel values are no longer used when the algorithm starts. The resulting value is the delta of each pixel in each channel, calculated by the distance between neighboring pixels. In fact, before the algorithm starts, it converts the image to grayscale. Converting to gray don´t affect the result....at least, not in the tests I'm doing, since to speed things up, I'm converting to grayscale as mentioned in the articles I read. And if it works, it might be possible to work with the channels separately if that changes anything when working with color images. But I don't think it will change much, since, in one way or another, when we convert to gray, we are working with Luma values that are invariable, that is, they do not change the aspect (texture, grain, Hue, saturation etc etc etc) of the image.
What dbscan does is calculate the difference (delta) between each pixel and its neighbors, regardless of the colorspace used (RGB, CieLCH, HSL, etc.). Of course, any resulting value for this delta will not exceed the limit of one pixel (255 or equivalent if not using RGB as input), but we no longer work directly with the pixels of the image, neither their structures.
The entire delta is placed in an array (of Floats).
QuoteWish I could help you with the SSE2 stuff.
Tks :) ´ll give another try later. It is possible, but i spent the last 2 or 3 days trying to improve this and nothing yet. So, i´ll test later and check the results.
Wouldn't SSE abs 4 values at a time followed by maxps xmm0,xmm2 speed up ?
Unroll the above few times ?
Unroll inner loop and align 64 before loop ?
If you compare with packed compares equals
50.0,50.0 ,you get a mask ,use mask for add EDI, by 1 element or not
Copy to new array,if two copies are found,increment not
Quote from: guga on June 03, 2024, 04:38:51 AMmake it faster working with 4 dwords at once
Have a look at PCMPEQD and PCMPESTRI.
Hi Guga,
Quote from: guga on June 02, 2024, 07:21:09 AMsomeone has a faster way to remove duplications on a array of Real4
Perhaps faster ways is to prevent addition of duplicated values in the array.
Regards, HSE.
Quote from: HSE on June 04, 2024, 05:03:33 AMPerhaps faster ways is to prevent addition of duplicated values in the array.
But but but ... wouldn't that require a (linear, binary) search every time you add a value? Seems like that would be the slowest of all, compared to just going through the completed array one time.
Quote from: HSE on June 04, 2024, 05:03:33 AMHi Guga,
Quote from: guga on June 02, 2024, 07:21:09 AMsomeone has a faster way to remove duplications on a array of Real4
Perhaps faster ways is to prevent addition of duplicated values in the array.
Regards, HSE.
That's possible with sse2 packed compare instruction that creates masks 0 or 0ffffffffh
Movd xmm6,array-1
Movd xmm7,array
Pmpeq xmm6,xmm7
Movd mask,xmm6
Movd xmm7,array
Pand xmm7,mask
Movd xmm0,sum
Paddd xmm0,xmm7 ; add zero if duplicate ,add delta if none duplicate
Quote from: NoCforMe on June 04, 2024, 05:18:42 AMcompared to just going through the completed array one time.
Think :biggrin:
I did think; that's why I posted that.
OK, as you build the list, the first versions will be very short, but getting longer each time, right? 1, 2, 3, ... n.
So if you do things your way you'll be dealing with a total of
n(n+1)∕2
potential comparisons, instead of just
n
.
(Of course if you use anything besides a dumbass linear search you can reduce that number, but never as low as
n.)
Or am I missing something here?
Quote from: HSE on June 04, 2024, 08:48:16 AMWhat is n?
The number of items in the list (or array if you prefer).
Quote from: NoCforMe on June 04, 2024, 09:00:45 AMThe number of items in the list (or array if you prefer).
:biggrin: When?
As you add items (you start off with zero, of course:
1, 2, 3, ... n
This is all obvious, right? What are you getting at here?
Quote from: NoCforMe on June 04, 2024, 09:22:41 AMWhat are you getting at here?
Apparently you are thinking that all pixels are differents. Anyway that is worst case. But you don't know.
n is the number of pixels. Testing at the end you have n*(n-1) comparisons (in worst case) more something to exclude autocomparison or use later, or you have to rearrange the array when pixel value is a duplicate.
Yes, that's the worst-case assumption.
So it depends on how many duplicates there actually are (which of course you have no idea when you're running). If there are few duplicates, then your suggestion (removing duplicates as you go rather than at the end) would probably be better.
It's a race between doing (potentially) n(n-1) comparisons (worst case) at the end, vs. potentially many more comparisons (1+2+3+4+ ... +n) if you do it your way, as you go.
I don't know. It seems like it would take more time your way, but now I'm not sure. (No expert in combinatorial math here.)
OK, let's try this: I'm not a mathematician, but I'm going to play one on TV for just a minute.
To formalize this, I believe (correct me if I'm wrong) that the potential total number of operations (O1) to do things your way--eliminating duplicates as they're added to the array--is this:
O1.gif
while the total number of operations (O2) if you wait until the end to remove duplicates is
O2.gif
Seems like the first number would be greater.
Now I'm going to sit back and wait for others who know more to shoot holes in my theory ...
At the end probability always is worst:
O2 > O1
n (m-1) > n (m+1) /2
Where
n >= m
n: number of values
m: number of unique values
OK, I'll have to take your word for that.
BTW, I may have screwed up
O1, which I guess is just
n
{sigma} n
0
(using a linear search to see if the value being added is already there)
Tks a lot guys. I´ll take a look, I´m just finishing the adjustments of the DbScan filter i described here https://masm32.com/board/index.php?topic=11919.30 and later i´ll give a test on the removal duplications problem.