Hi guys
Someone succeeded to create a DBScan algorithm on a given data set ?
Im trying to create a dbscan clustering functiojn to analyse a chunck of data (in dwords or bytes or whatever needed input)
The dbscan is defined as in:
https://medium.com/@balajicena1995/dbscan-clustering-2a577d384e61 (https://medium.com/@balajicena1995/dbscan-clustering-2a577d384e61)
https://www.youtube.com/watch?v=87ogbzhXUxo (https://www.youtube.com/watch?v=87ogbzhXUxo)
https://github.com/gyaikhom/dbscan/blob/master/dbscan.c (https://github.com/gyaikhom/dbscan/blob/master/dbscan.c)
https://dev.to/rajaniraiyn/dbscan-clustering-algorithm-demystified-1d5o (https://dev.to/rajaniraiyn/dbscan-clustering-algorithm-demystified-1d5o)
https://www.codeproject.com/Articles/5129186/Step-by-Step-Guide-to-Implement-Machine-Learning-8 (https://www.codeproject.com/Articles/5129186/Step-by-Step-Guide-to-Implement-Machine-Learning-8)
https://en.wikipedia.org/wiki/DBSCAN (https://en.wikipedia.org/wiki/DBSCAN)
The goal is to determine if a certain chunk of data is homogeneous, i mean, if the data inside a given chunck contains similar values. that can be used to identify/distinguish from others (to use in image processing, for example or watermark or object detection etc)
Homogeneity and Completeness are defined in:
https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html (https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html)
and
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html#sklearn.metrics.homogeneity_score (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html#sklearn.metrics.homogeneity_score)
But, i didn´t understood how this can be implemented. The C version in github is way to complex for me to start a simple algorithm.
Chatgpt gave me some ideas on how to do it in masm, but i got nowhere.
On the example i provided i tried to analyse the following sequence:
{0, 1, 1, 1, 1, 1, 1, 1, 25, 15, 96, 19, 1, 1, 2, 25}
Where 1 is the most common value, so the data may lead to some value of homogeneity and COmpleteness.
When i started, chatgpt created me an scheme as:
1st question i made:QuoteTo determine if pixel values in a sequence of images represent the same pixel or object, especially when the majority of pixels are around a certain value and a few are outliers, you can use clustering algorithms or similarity measures. Here's an approach you can take:
Preprocess the Data: Normalize the pixel values to a common scale if necessary. This ensures that all values are comparable.
Cluster Pixel Values: Use a clustering algorithm such as K-means or DBSCAN to group similar pixel values together. This can help identify clusters of pixels that represent the same object or feature in the images. Set the number of clusters based on the expected number of objects or features in the images.
Identify Outliers: Pixels that do not belong to any cluster or belong to very small clusters can be considered outliers. These outliers may represent noise or background pixels that are different from the main object or feature.
Evaluate Results: Inspect the clusters and outliers to determine if they correspond to different objects or features in the images. Visualizing the clusters can help you understand the grouping of pixel values.
Adjust Parameters: If the results are not satisfactory, adjust the parameters of the clustering algorithm (e.g., number of clusters, distance metric) and re-run the analysis.
Post-processing: Depending on the application, you may need to further process the clustered pixel values to refine the identification of objects or features.
By using clustering algorithms and analyzing the distribution of pixel values in the images, you can determine if the values represent the same pixel or object, and identify outliers that may represent noise or background pixels.
2nd questioning lead me to:QuoteTo demonstrate how clustering algorithms can be used to analyze the distribution of pixel values in a sequence of images, let's consider a simplified example with a set of 20 images. We'll use the K-means clustering algorithm to group similar pixel values together and identify outliers that may represent noise or background pixels.
Assuming each image has the same size and we want to analyze a specific pixel (e.g., pixel at coordinates (x, y)), here's how you can approach it:
Load the Images:
Load the 20 images into memory.
Extract Pixel Values:
For each image, extract the pixel value at the specified coordinates (x, y). You will have 20 pixel values for this specific pixel location.
Normalize Pixel Values:
Normalize the pixel values to a common scale if necessary.
Apply K-means Clustering:
Use the K-means clustering algorithm to cluster the pixel values. Set the number of clusters based on the expected number of objects or features in the images.
Identify Outliers:
Pixels that do not belong to any cluster or belong to very small clusters can be considered outliers. These outliers may represent noise or background pixels.
Here's a Python example using the scikit-learn library to perform K-means clustering on a set of 20 images:
import numpy as np
from sklearn.cluster import KMeans
# Assume pixel_values is a 1D array containing pixel values at location (x, y) for 20 images
# This is a simplified example, and in practice, you would extract pixel values from actual images
pixel_values = np.array([120, 130, 140, 150, 160, 200, 210, 220, 230, 240,
150, 140, 135, 130, 125, 195, 205, 210, 215, 220])
# Reshape the array to have a single feature (required by K-means)
pixel_values = pixel_values.reshape(-1, 1)
# Apply K-means clustering with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=0).fit(pixel_values)
# Get the cluster labels
cluster_labels = kmeans.labels_
# Get the cluster centers
cluster_centers = kmeans.cluster_centers_
# Identify outliers
outliers = pixel_values[np.where(cluster_labels != np.argmax(np.bincount(cluster_labels)))]
print("Cluster Centers:", cluster_centers.flatten())
print("Outliers:", outliers.flatten())
I then asked it to create a similar function in Masm32 and using DbScan instead K-mean
This s the resultant function, ChatGPT gave me
.DATA
pixel_values DWORD 120, 130, 140, 150, 160, 200, 210, 220, 230, 240, \
150, 140, 135, 130, 125, 195, 205, 210, 215, 220
num_pixels DWORD 20
epsilon DWORD 15 ; Adjust epsilon as needed
minPts DWORD 3 ; Adjust minPts as needed
cluster_labels DWORD num_pixels DUP(0)
cluster_count DWORD 0
.CODE
main PROC
; Load addresses of data
mov esi, OFFSET pixel_values
mov edi, OFFSET cluster_labels
; DBSCAN algorithm
mov ecx, num_pixels
outer_loop:
cmp DWORD PTR [edi], 0
jne skip_iteration ; Skip if point is already assigned to a cluster
mov eax, ecx
dec eax
mov ebx, eax
mov edx, epsilon
shl edx, 2 ; Multiply epsilon by 4 (size of DWORD)
mov esi, OFFSET pixel_values
lea esi, [esi + eax * 4]
inner_loop:
cmp DWORD PTR [edi + eax * 4], 0
jne next_iteration ; Skip if point is already assigned to a cluster
mov esi, OFFSET pixel_values
lea esi, [esi + ebx * 4]
sub esi, OFFSET pixel_values
mov edx, epsilon
shl edx, 2 ; Multiply epsilon by 4 (size of DWORD)
mov eax, ebx
mov ecx, eax
call distance ; Calculate distance between points
cmp eax, edx ; Compare distance with epsilon
jge next_iteration ; Skip if distance is greater than epsilon
inc ecx ; Increment count of neighbors
cmp ecx, minPts
jl next_iteration ; Skip if minimum number of points is not reached
mov eax, ebx
inc cluster_count ; Assign new cluster ID
assign_cluster:
mov DWORD PTR [edi + eax * 4], cluster_count
jmp next_iteration
skip_iteration:
dec ecx
next_iteration:
dec ebx
jns inner_loop
inc ecx
cmp ecx, num_pixels
jl outer_loop
; Output cluster labels
mov esi, OFFSET cluster_labels
mov ecx, num_pixels
print_clusters:
mov eax, [esi]
call WriteDec
call Crlf
add esi, 4
loop print_clusters
; Exit program
call ExitProcess
main ENDP
distance PROC
; Calculate Euclidean distance between two points
; Input: ESI - offset of first point, EDI - offset of second point
; Output: EAX - distance
mov eax, DWORD PTR [esi]
sub eax, DWORD PTR [edi]
imul eax, eax
mov edx, DWORD PTR [esi + 4]
sub edx, DWORD PTR [edi + 4]
imul edx, edx
add eax, edx
mov edx, DWORD PTR [esi + 8]
sub edx, DWORD PTR [edi + 8]
imul edx, edx
add eax, edx
mov edx, DWORD PTR [esi + 12]
sub edx, DWORD PTR [edi + 12]
imul edx, edx
add eax, edx
call Sqrt
ret
distance ENDP
END main
But this was just to Ok,calculate the cluster_labels. I then asked me to create the other functions such as the cluster_centers and outliers. Which lead me to:
QuoteYes, the function I provided is just to calculate the cluster_labels based on the DBSCAN algorithm. To find the cluster_centers and outliers, you can modify the code to calculate them after assigning the cluster labels. Here's how you can do it:
Cluster Centers: After assigning cluster labels, you can calculate the cluster centers by averaging the pixel values of all points in each cluster.
Outliers: Outliers are points that do not belong to any cluster or belong to very small clusters (noise points). You can identify outliers by checking for points with a cluster label of 0 or labels that are not assigned to a core point (i.e., points with fewer than minPts neighbors).
.DATA
...
cluster_centers DWORD 0, 0 ; Store the cluster centers (assuming 2 clusters)
outliers DWORD num_pixels DUP(0) ; Store the indices of outliers
num_outliers DWORD 0 ; Counter for the number of outliers
.CODE
...
update_cluster_centers PROC
; Calculate cluster centers
mov esi, OFFSET pixel_values
mov edi, OFFSET cluster_labels
xor ecx, ecx
xor edx, edx
calculate_centers_loop:
cmp DWORD PTR [edi + ecx * 4], cluster_count
jne not_in_cluster
add esi, DWORD PTR [esi + ecx * 4] ; Add pixel value to sum
inc edx ; Increment count of points in cluster
not_in_cluster:
inc ecx
cmp ecx, num_pixels
jl calculate_centers_loop
div edx ; Calculate average pixel value (cluster center)
mov DWORD PTR [cluster_centers + cluster_count * 4], eax
inc cluster_count
ret
update_cluster_centers ENDP
identify_outliers PROC
; Identify outliers
mov esi, OFFSET cluster_labels
xor ecx, ecx
find_outliers_loop:
cmp DWORD PTR [esi], 0 ; Check if point is not assigned to any cluster
jne not_outlier
mov DWORD PTR [outliers + num_outliers * 4], ecx ; Add index to outliers
inc num_outliers
not_outlier:
inc ecx
add esi, 4
cmp ecx, num_pixels
jl find_outliers_loop
ret
identify_outliers ENDP
main PROC
...
call update_cluster_centers
call identify_outliers
...
main ENDP
But, this is incorrect and crashed since it is not pointing to a given address in the distance function. I then asked it to explain in details how this thing works...but got more confused then ever. This is what chatgpt wrote:
QuoteLet's walk through the function step by step with the given pixel_values and assuming epsilon is 10 and minPts is 2 for this example:
Initialization:
pixel_values = {120, 130, 140, 150, 160}
cluster_labels = {0, 0, 0, 0, 0}
num_pixels = 5
epsilon = 10
minPts = 2
Outer Loop (for pixel 5, value 160):
Compare cluster label of pixel 5 (cluster_labels[4]) with 0. It is 0.
Set eax to 4 (index of pixel 5).
Set ebx to 4.
Calculate edx (distance threshold): 10 << 2 = 40.
Calculate address of pixel 5 (esi): pixel_values + 4 * 4 = pixel_values + 16.
Inner Loop (for pixel 4, value 150):
Compare cluster label of pixel 5 with 0. It is 0.
Calculate address of pixel 4 (esi): pixel_values + 3 * 4 = pixel_values + 12.
Calculate distance between pixel 5 and pixel 4. Assume distance is 10.
Compare distance (10) with epsilon (40). Distance is less than epsilon.
Increment neighbor count (ecx) to 1.
Continue inner loop for pixel 3.
Continue outer loop for pixel 4.
Continue the same process for pixels 3, 2, and 1.
Output Cluster Labels:
The cluster_labels array after processing may look like: {1, 1, 1, 1, 1} indicating all pixels belong to the same cluster.
Please note that the actual calculation of the distance and clustering logic is not shown here but would be necessary to implement the complete DBSCAN algorithm.
But, i´m failing badly to understand this whole thing.
Can someone help me trying to create such a thing ? Or fix the masm functions chatgpt did ?
The few times I had used chatgpt to write code, it used the irvine32 libraries often. I gave up on that soon after.
Seeing in your code snippet from chatgpt, I see 'call Crlf' -- is that from irvine32 library? There is no header there so I don't know for sure.
Don´t know if chatgpt uses irvine´s code or not. Chatgpt has several flaws in what concerns trying to write something in masm (or Nasm, FAsm, RosAsm), but it gives some idea how to start or, in some cases do 50% of the work. The problem is that on this example, i´m not being able to follow the logic that i is showing because it seems a bit different than the ones explained in wikipedia and the other links.
It seems a simple loop back on each value that neds to be compared to others, but i´m faoiling badly to understand the logic and the math behind this.
And porting the algo at https://github.com/gyaikhom/dbscan/blob/master/dbscan.c (https://github.com/gyaikhom/dbscan/blob/master/dbscan.c) to masm seems to me a bit complex for such a task that seems simple at the 1st .
Have you tried godbolt.org (https://godbolt.org), to help with the conversion/translation to assembly? (Assuming you have a usable source in another language)
Once converted, could more easily be optimized or change parts of it, to better suit your needs.
Would it be possible for you to provide a link that explains what DBScan is in plain English to someone who's unfamiliar with all that jargon? I really can't make head nor tail of it using those links you provided.
(Plus could you link-ize your links? there's a button for that right in the posting editor)
I suppose the Wikipedia entry (https://en.wikipedia.org/wiki/DBSCAN) (hate Wikipedia!) is as good as anything I've found so far:
QuoteIt is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed (points with many nearby neighbors), and marks as outliers points that lie alone in low-density regions (those whose nearest neighbors are too far away).
* I had to edit that entry, as it was badly written, as is so much stuff there.
Here's another article (https://towardsdatascience.com/how-dbscan-works-and-why-should-i-use-it-443b4a191c80), on Medium by a Portuguese writer who writes better English than most native speakers, in pretty plain language.
Enough with the jargon! Tell me what it is, how it works and what it's used for.
Quote from: sudoku on May 12, 2024, 03:01:02 AMHave you tried godbolt.org (https://godbolt.org), to help with the conversion/translation to assembly? (Assuming you have a usable source in another language)
Once converted, could more easily be optimized or change parts of it, to better suit your needs.
Yeah, i´m doing it. But the C version uses 3d points and i want to use only 1d points. (Linear values of a sequence of images. So, taking pixel at pos x = 10, y = 10 on all images. All of the next images at this particular positions will contains specific values of their pixels (RGB). I need to compute the dbscan of each channel separated. So, dbscan of 1 pixel at a specific position in all subsequent images.
Quote from: NoCforMe on May 12, 2024, 03:03:29 AMWould it be possible for you to provide a link that explains what DBScan is in plain English to someone who's unfamiliar with all that jargon? I really can't make head nor tail of it using those links you provided.
(Plus could you link-ize your links? there's a button for that right in the posting editor)
https://en.wikipedia.org/wiki/DBSCAN (https://en.wikipedia.org/wiki/DBSCAN)
https://www.youtube.com/watch?v=4AW_5nYQkuc (https://www.youtube.com/watch?v=4AW_5nYQkuc)
https://www.youtube.com/watch?v=2Zsz-0K-Ax4 (https://www.youtube.com/watch?v=2Zsz-0K-Ax4)
Quote from: NoCforMe on May 12, 2024, 03:11:35 AMI suppose the Wikipedia entry (https://en.wikipedia.org/wiki/DBSCAN) (hate Wikipedia!) is as good as anything I've found so far:
QuoteIt is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed (points with many nearby neighbors), and marks as outliers points that lie alone in low-density regions (those whose nearest neighbors are too far away).
* I had to edit that entry, as it was badly written, as is so much stuff there.
Here's another article (https://towardsdatascience.com/how-dbscan-works-and-why-should-i-use-it-443b4a191c80), on Medium by a Portuguese writer who writes better English than most native speakers, in pretty plain language.
Enough with the jargon! Tell me what it is, how it works and what it's used for.
I´m not a native english speaker. I´ll take a read on that article.
https://towardsdatascience.com/how-dbscan-works-and-why-should-i-use-it-443b4a191c80 (https://towardsdatascience.com/how-dbscan-works-and-why-should-i-use-it-443b4a191c80)
If you speak portuguese, i can talk to you in my native language, but it will be confusing and hard for others understand.
Btw....I´m using it to try a better way to identify a watermark on a set of images. I was giving a try using Median or MAD algorithms but it seems DbScan may work better.
Hi Guga!
Quote from: guga on May 11, 2024, 11:39:49 PMSomeone succeeded to create a DBScan algorithm on a given data set
An implementation is present in Comparing K-Means and Others Algorithms for Data Clustering in Assembly (https://masm32.com/board/index.php?topic=11649.0./).
Regards, HSE.
Tks Hse. Can you fix the link / It´s not pointing to any address. I managed to succeed to convert the C file, but would be nice take a look on the link you provided.
Btw....about finding the Homogeneity i made a small function that maybe used for this task. (Unless it was already written on your link on a more accurate way.
;;
Homogeneity metric of a cluster labeling given a ground truth.
A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won't change the score value in any way.
This metric is not symmetric: switching label_true with label_pred will return the completeness_score which will be different in general.
Mathematically, homogeneity has the connotation of invariance, as all components of the equation have the same degree of value whether or not each
of these components are scaled to different values, for example, by multiplication or addition. Cumulative distribution fits this description.
"The state of having identical cumulative distribution function or values".
The function uses a logarithm sum of good clusters found on a chain of data. This forces the result to be between 0 and 1 and enforce for homogeneity.
In general,the higher the value (closer to 1) more homogenous is the data.
Parameters:
points(in) - A poiter to a Data formed by a array of point_t structures
MaxClasses(in) - The maximum amout of usable classes (Cluster_Id) available. This parameter takes onto account the Ids bigger then 0.
num_points(in) - The total amount of elements on the data chain.
pHomogeneity(out) - Pointer to a variable that will hold the value of the homogeneity. The size of the variable must be a Real8 (8 bytes).
Return Value: The function does not returns any value. All values are returned in pHomogeneity parameter.
The range of the result is between 0 and 1
Remarks:
The mathematical equation for this is as follow:
homogeneity = 1 + ((log(cluster_id(0)/TotalCluster_id0)+
(log(cluster_id(1)/TotalCluster_id1)+
(log(cluster_id(2)/TotalCluster_id2)+
(log(cluster_id(3)/TotalCluster_id3)+
....
(log(cluster_id(N)/TotalCluster_idN)+
) / NumPoints
But, we exclude Noise (negative Ids) and also 0 Ids. Excluding 0 id is neded because it can also be interpreted as Noise and also there´s no log (0).
A 0 id means the data is closer to a cluster but, it can still be considered as noise. A negativa value (-2 is pure noise) and 0 can represent a 'Border' Noise.
So, in fact, our equaion results only in:
homogeneity = 1 + ((log(cluster_id(1)/TotalCluster_id1)+
(log(cluster_id(2)/TotalCluster_id2)+
(log(cluster_id(3)/TotalCluster_id3)+
....
(log(cluster_id(N)/TotalCluster_idN)+
) / NumPoints
In other wods, it takes a somatory of the log of ClusterId divided tbe the total amount of elements related to that Id After summing them, it divide by the
total amount of elements of the data chunk (num_points) and finally add 1 onto this result. This will grant the resultant value will always be between
0 and 1.
Example, say you have a data chunk formed by a sequence of data (53 elements = points of x, y, z) where you already calculated theis Cluster_ids
associated with each point of the data chain. Like this:
x y z cluster_id
-----------------------------
1.00 3.00 1.00: 0
1.00 4.00 1.00: 0
1.00 5.00 1.00: 0
1.00 6.00 1.00: 0
2.00 2.00 1.00: 2
2.00 3.00 0.00: 1
2.00 4.00 0.00: 1
2.00 5.00 0.00: 1
2.00 6.00 0.00: 1
2.00 7.00 1.00: 3
3.00 1.00 1.00: 2
3.00 2.00 1.00: 2
3.00 3.00 1.00: 2
3.00 4.00 0.00: 1
3.00 5.00 0.00: 1
3.00 6.00 1.00: 3
3.00 7.00 1.00: 3
4.00 1.00 1.00: 2
4.00 2.00 1.00: 2
4.00 3.00 0.00: 1
4.00 4.00 0.00: 1
4.00 5.00 1.00: -2
4.00 6.00 0.00: 1
4.00 7.00 1.00: 3
4.00 8.00 1.00: 3
5.00 1.00 1.00: 2
5.00 2.00 0.00: 1
5.00 3.00 0.00: 1
5.00 4.00 0.00: 1
5.00 5.00 0.00: 1
5.00 6.00 0.00: 1
5.00 7.00 1.00: 3
5.00 8.00 1.00: 3
6.00 1.00 1.00: 2
6.00 2.00 0.00: 1
6.00 3.00 1.00: 3
6.00 4.00 1.00: 3
6.00 5.00 1.00: 3
6.00 6.00 1.00: 3
6.00 7.00 1.00: 3
7.00 1.00 1.00: 2
7.00 2.00 0.00: 1
7.00 3.00 0.00: 1
7.00 4.00 0.00: 1
7.00 5.00 1.00: 3
8.00 1.00 1.00: 2
8.00 2.00 1.00: 2
8.00 3.00 0.00: 1
8.00 4.00 1.00: 3
8.00 5.00 1.00: 3
8.00 6.00 1.00: 3
9.00 2.00 1.00: 2
9.00 3.00 1.00: 2
From the above data chain (Points) we have the following clusters (groups) from -2 to 3. We need to count how many elements we have on each Cluster_ids
ClusterId = -2 = 1 elements (Pure Noise)
ClusterId = 0 = 4 elements (Border Noise)
ClusterId = 1 = 19 elements
ClusterId = 2 = 13 elements
ClusterId = 3 = 16 elements
Total elements = 53 (including Noise and border noise)
Negative Ids, such as -2 and 0, represents Noise, therefore it is not calculated on the equation of homogeneity.
So, it will result in:
homogeneity = 1 + (log(1/19)+log(2/13)+log(3/16))/53 = 0.8775430643464206899449251...
;;
[TmpLogValue: R$ 0]
Proc calculate_homogeneity_Log:
Arguments @points, @MaxClasses, @num_points, @pHomogeneity
Local @ClusterCount
Uses ecx, esi, edi
xorpd xmm0 xmm0
...If D@MaxClasses > 1 ; Skip class = 0 (We added earlier to 1)
mov esi D@points
mov edi 1 ; start with Cluster_Id = 1
.Do
mov D@ClusterCount 0
mov ecx D@num_points
Do
If D$esi+cluster_idDis = edi
inc D@ClusterCount
End_If
add esi Size_Of_point_t
dec ecx
Loop_Until ecx = 0
mov esi D@points
cvtsi2sd xmm1 edi ; converts a signed integer to double
cvtsi2sd xmm2 D@ClusterCount | divsd xmm1 xmm2
movsd X$TmpLogValue xmm1 | movsd xmm2 xmm0
call Sse2_log TmpLogValue, SSE_EXP_REAL8
addsd xmm0 xmm2
inc edi
.Loop_Until edi >= D@MaxClasses
cvtsi2sd xmm1 D@num_points | divsd xmm0 xmm1
...End_If
mov eax 1 | cvtsi2sd xmm1 eax ; converts a signed integer to double
addsd xmm0 xmm1
mov eax D@pHomogeneity
movsd X$eax xmm0
EndP
Btw...does it woks for dbscan too ?
Oh, got it..here, right ?
https://masm32.com/board/index.php?topic=11649.0 (https://masm32.com/board/index.php?topic=11649.0)
https://github.com/ASMHSE/Clusters-in-Assembly/tree/main (https://github.com/ASMHSE/Clusters-in-Assembly/tree/main)
It do contains dbscan. Very good work. I´ll take a further look later and try to understand
A very interesting topic.
the parameters ("epsilon and MinPts") of DBSCAN can have a significant impact on the clustering results. Careful parameter tuning is often required for optimal performance.
Hi,guga
Do you use the DBSCAN to analyse some image?
Quote from: six_L on May 16, 2024, 07:14:27 PMA very interesting topic.
the parameters ("epsilon and MinPts") of DBSCAN can have a significant impact on the clustering results. Careful parameter tuning is often required for optimal performance.
Hi,guga
Do you use the DBSCAN to analyse some image?
Hi six_L yes, i´m trying to port the JavaScript example on that link on github to assembly to handle images. The javascript example segments the image using DbScan algorithm. I guess i succeeded to port it, but it is incredibly slow. Maybe using a table of precalculated the euclidian distances can help, but i´m not being able to precalculate it correctly. I´m failing understand the math behinds it.
(https://i.postimg.cc/K1hNSG6q/screenshot.png) (https://postimg.cc/K1hNSG6q)
Btw..i attached the files i succeeded to port (The simpler one). Source code is embedded in exe (for RosAsm) and also included the asm file and the necessary example.dat it uses to do the calculations. Also attached the source code splitted in their own titles to make easier to follow. The part of the code itself related to dbscan is at part01.asm. Top.asm ans SSE_Macros.asm are only the macros i used for that file. The other asm files are simply additional functions for the console and Fast Logarithm functions
The output is this (Where -2 = Pure noise and 0 means noise but more close to the clusters, i suppose:
QuoteEpsilon: 1.000000
Minimum points: 2
Homogeneity: 0.000000
Homogeneity(log): 0.877543
Number of points: 53
x y z cluster_id
-----------------------------
1.00 3.00 1.00: 0
1.00 4.00 1.00: 0
1.00 5.00 1.00: 0
1.00 6.00 1.00: 0
2.00 2.00 1.00: 2
2.00 3.00 0.00: 1
2.00 4.00 0.00: 1
2.00 5.00 0.00: 1
2.00 6.00 0.00: 1
2.00 7.00 1.00: 3
3.00 1.00 1.00: 2
3.00 2.00 1.00: 2
3.00 3.00 1.00: 2
3.00 4.00 0.00: 1
3.00 5.00 0.00: 1
3.00 6.00 1.00: 3
3.00 7.00 1.00: 3
4.00 1.00 1.00: 2
4.00 2.00 1.00: 2
4.00 3.00 0.00: 1
4.00 4.00 0.00: 1
4.00 5.00 1.00: -2
4.00 6.00 0.00: 1
4.00 7.00 1.00: 3
4.00 8.00 1.00: 3
5.00 1.00 1.00: 2
5.00 2.00 0.00: 1
5.00 3.00 0.00: 1
5.00 4.00 0.00: 1
5.00 5.00 0.00: 1
5.00 6.00 0.00: 1
5.00 7.00 1.00: 3
5.00 8.00 1.00: 3
6.00 1.00 1.00: 2
6.00 2.00 0.00: 1
6.00 3.00 1.00: 3
6.00 4.00 1.00: 3
6.00 5.00 1.00: 3
6.00 6.00 1.00: 3
6.00 7.00 1.00: 3
7.00 1.00 1.00: 2
7.00 2.00 0.00: 1
7.00 3.00 0.00: 1
7.00 4.00 0.00: 1
7.00 5.00 1.00: 3
8.00 1.00 1.00: 2
8.00 2.00 1.00: 2
8.00 3.00 0.00: 1
8.00 4.00 1.00: 3
8.00 5.00 1.00: 3
8.00 6.00 1.00: 3
9.00 2.00 1.00: 2
9.00 3.00 1.00: 2
Press enter to exit...
About epsilon variable, it seems that there is a way to pre-calculate it so it can maximize the results of homogeneity, but i´m still stuck at the precalculatd euclidian distance table
New version in C (to be compiled with Pelles) with a bit of optimization and precalculating the euclidean distances.
I´ll try optimize it even further before porting it to Assembly.
Succeeded to automate the process of identifying Epsilon and Minimum Points. For the articles i´m reading, the smaller epsilon, the better.
https://hyperskill.org/learn/step/31529
https://mrinalyadav7.medium.com/dbscan-algorithm-c894701306d5
On the attached version, it generated only 2 epsilon values.
Main:
....
call GetMinPoints DBSCAN_DIM_1D
mov D$MinPoints eax
call GetEpsilon D$points, D$Num_Points, Epsilon, DBSCAN_DIM_1D
call dbscan D$points, D$Num_Points, Epsilon, D$MinPoints, Homogeneity, HomogeneityLog
....
;;
Automatically find epsilon
http://www.sefidian.com/2022/12/18/how-to-determine-epsilon-and-minpts-parameters-of-dbscan-clustering/
https://mrinalyadav7.medium.com/dbscan-algorithm-c894701306d5
https://hyperskill.org/learn/step/31529
;;
[DBSCAN_DIM_1D 1]
[DBSCAN_DIM_2D 2]
[DBSCAN_DIM_3D 3]
Proc GetMinPoints:
Arguments @DimensionFlags
mov eax D@DimensionFlags
shl eax 1
EndP
Proc GetEpsilon:
Arguments @points, @num_points, @pOutput, @Flag
Local @k_distances, @kFactor
Uses ebx, esi, edi, ecx, edx
xor eax eax
On D@num_points = 0, ExitP
C_call 'msvcrt.calloc' D@num_points, 8
mov D@k_distances eax
mov eax D@Flag
.If_Or eax = DBSCAN_DIM_1D, eax = 0
mov eax 2 ; dimensions*2) = 1*2
.Else_If eax = DBSCAN_DIM_2D
mov eax 4 ; dimensions*2) = 2*2
.Else_If eax = DBSCAN_DIM_3D
mov eax 6 ; dimensions*2) = 3*2
.Else_If_And eax > 3, eax < 7
mov eax 7
.Else_If eax > 10
inc eax
.End_If
mov D@kFactor eax
; Calculate k-distances for each point
call calculate_k_distances D@points, D@num_points, D@kFactor, D@k_distances
call find_knee_point D@k_distances, D@num_points
mov eax D@pOutput | fstp R$eax
C_call 'msvcrt.free' D@k_distances
EndP
; Function to calculate k-distance for each point
Proc calculate_k_distances:
Arguments @points, @num_points, @kFactor, @k_distances
local @Distances, @PointsIPos
Uses ebx, esi, edi, ecx, edx
xor eax eax
On D@num_points = 0, ExitP
C_call 'msvcrt.calloc' D@num_points, 8
mov D@Distances eax
mov ebx eax
mov esi D@points
mov D@PointsIPos esi
xor edi edi
.Do
mov esi D@points
xor ecx ecx
Do
call euclidean_dist D@PointsIPos, esi
fstp R$ebx+ecx*8
add esi Size_Of_point_t
inc ecx
Loop_Until ecx >= D@num_points ; jl
C_call 'msvcrt.qsort' ebx, D@num_points, 8, compare_doubles
; k-th nearest neighbor distance
mov eax D@kFactor
mov edx D@k_distances
fld R$ebx+eax*8 | fstp R$edx+edi*8
add D@PointsIPos Size_Of_point_t
inc edi
.Loop_Until edi >= D@num_points
C_call 'msvcrt.free' D@Distances
EndP
Proc compare_doubles:
Arguments @Arg1, @Arg2
mov eax D@Arg1 | movsd XMM0 X$eax
mov eax D@Arg2 | movsd XMM1 X$eax
xor eax eax
SSE_D_If xmm1 > xmm0
mov eax 0-1
SSE_D_Else_If xmm1 < xmm0
mov eax 1
SSE_D_End_If
EndSTD
[KneeRate: R$ 0.1]
Proc find_knee_point:
Arguments @k_Distance, @Num_Points
Local @KneeIndex
Uses ebx, esi, ecx, edx
mov ebx D@k_Distance
mov esi D@Num_Points
C_call 'msvcrt.qsort' ebx, esi, 8, compare_doubles
fild D@Num_Points | fmul R$KneeRate | fistp D@KneeIndex
;fld R$ebx;+eax*8
;mov eax 52 | fld R$ebx+eax*8
mov eax D@KneeIndex | fld R$ebx+eax*8
EndP
On the function find_knee_point it creates a table of distances values (possible values of epsilon) forming a array of data (Size is NumPoints *8 = 8 because we are dealing with a Real8). The generated contents are displaced in ascending order On the example we have 53 values, on 2 unique epsilon values from the smaller to the bigger.
The function chooses one located at the 10% of the initial data in the array. Later i´ll make a routine to get rid of the duplicated data, and insert the different values on another table to make easier the user choose what type of data we wants do identify.
according to the 2nd article, Epsilon can be better described as:
Quoteε: The value for ε can then be chosen by using a k-distance graph, plotting the distance to the k = minPts-1 nearest neighbor ordered from the largest to the smallest value. Good values of ε are where this plot shows an "elbow": if ε is chosen much too small, a large part of the data will not be clustered; whereas for a too high value of ε, clusters will merge and the majority of objects will be in the same cluster. In general, small values of ε are preferable, and as a rule of thumb, only a small fraction of points should be within this distance of each other.
Btw,, so far, chatgpt didn´t succeeded to help me making the algo way more fast. It is incredibly slow when working on real images. The alternative i proposed was create a table in memory to directly point to the addresses in get_neighbor routines, but it turns out that i´ll need a insane huge table for that (NumPixels*8*NumPixels). I´m trying to overcome this, and use only 1 table (size = Numpixes*8) and then, make some lighter arithmetic function to calculate the values of the next indexed tables. Thinking how to do it yet....
About epsilon and minimum points, chatgpt answered me like this (That proven to be correct at the end)
QuoteThe choice of k is not strictly fixed and can vary based on the characteristics of the data and the specific requirements of the clustering task. Here's a more detailed explanation of how to determine an appropriate k value for different dimensions:
1D Data
For 1D data, clusters are linear, and local density is easier to determine. Therefore, a small k value is usually sufficient.
k = 2: Using the nearest neighbor (1st neighbor) can be enough to define local density in 1D data.
2D Data
For 2D data, clusters are more complex, and you generally need a slightly larger k to capture the local density effectively.
k = 4: This is a commonly used heuristic. A point and its three nearest neighbors can adequately define a local dense region in 2D space.
3D Data
For 3D data, the complexity increases due to the additional dimension, requiring a larger k value to capture the local density accurately.
k = 6: This value is chosen to include a point and its five nearest neighbors, which helps define the local density in 3D space.
k = 7-10: Depending on the noise level and the expected cluster structure, you might need a larger k to ensure robustness.
Choosing k Based on Dimensions
The heuristic k = dimensions + 1 is a starting point, but it may need to be adjusted based on empirical observations and the specific dataset. The values mentioned (e.g., k = 6 for 3D) come from practical experience and empirical studies in clustering literature, where a slightly larger k helps to better capture the density in higher dimensions.
Example Adjustments in Code
Here is how you can set k based on the dimensionality of your data and adjust it if necessary:
His piece of code was originally this:
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
// Structure to represent a point in 3D (adjust for higher dimensions if needed)
typedef struct {
double x, y, z; // Add more dimensions as necessary
} point_t;
// Function to calculate Euclidean distance between two points
double euclidean_distance(point_t *a, point_t *b) {
return sqrt(pow(a->x - b->x, 2) + pow(a->y - b->y, 2) + pow(a->z - b->z, 2)); // Adjust for dimensions
}
// Function to calculate k-distance for each point
void calculate_k_distances(point_t *points, int num_points, int k, double *k_distances) {
for (int i = 0; i < num_points; ++i) {
double distances[num_points];
for (int j = 0; j < num_points; ++j) {
distances[j] = euclidean_distance(&points[i], &points[j]);
}
// Sort distances to find the k-th nearest neighbor
qsort(distances, num_points, sizeof(double), compare_doubles);
k_distances[i] = distances[k]; // k-th nearest neighbor distance
}
}
// Comparison function for qsort
int compare_doubles(const void *a, const void *b) {
double arg1 = *(const double*)a;
double arg2 = *(const double*)b;
if (arg1 < arg2) return -1;
if (arg1 > arg2) return 1;
return 0;
}
// Function to find the "knee" point in k-distances plot
double find_knee_point(double *k_distances, int num_points) {
// Sort the k-distances to create the k-distances plot
qsort(k_distances, num_points, sizeof(double), compare_doubles);
// Simple heuristic: return the distance at the 90th percentile
return k_distances[(int)(0.9 * num_points)];
}
int main() {
// Assuming points is already populated
point_t *points = ...; // Replace with actual points initialization
int num_points = ...; // Replace with actual number of points
int dimensions = 3; // Adjust based on the dimensionality of your data
int k = dimensions + 3; // Adjusted heuristic for k (example: dimensions + 3)
double k_distances[num_points];
// Calculate k-distances for each point
calculate_k_distances(points, num_points, k, k_distances);
// Find a suitable epsilon using the knee point of the k-distances plot
double epsilon = find_knee_point(k_distances, num_points);
// Determine minPts, often set as 2 * dimensions
int minPts = 2 * dimensions;
// Now run DBSCAN with the calculated epsilon and minPts
int num_clusters = dbscan_run(points, epsilon, minPts, num_points);
// Continue with your process...
return 0;
}
And then it answered me:
QuoteSummary
1D Data: k = 2 is typically sufficient.
2D Data: k = 4 is a commonly used heuristic.
3D Data: k = 6 or k = 7-10 based on noise and cluster structure.
Higher Dimensions: k = dimensions + 1 or slightly higher based on empirical observation.
These values are starting points, and you may need to adjust k based on your specific dataset and the results you observe.
The resultant values were the same i´ve got earlier, so, it seems the routines i created to automate epsilon and min points were correct
Epsilon: 1.000000
Minimum points: 2
Homogeneity: 0.000000
Homogeneity(log): 0.877543
Number of points: 53
x y z cluster_id
-----------------------------
1.00 3.00 1.00: 0
1.00 4.00 1.00: 0
1.00 5.00 1.00: 0
1.00 6.00 1.00: 0
2.00 2.00 1.00: 2
2.00 3.00 0.00: 1
2.00 4.00 0.00: 1
2.00 5.00 0.00: 1
2.00 6.00 0.00: 1
2.00 7.00 1.00: 3
3.00 1.00 1.00: 2
3.00 2.00 1.00: 2
3.00 3.00 1.00: 2
3.00 4.00 0.00: 1
3.00 5.00 0.00: 1
3.00 6.00 1.00: 3
3.00 7.00 1.00: 3
4.00 1.00 1.00: 2
4.00 2.00 1.00: 2
4.00 3.00 0.00: 1
4.00 4.00 0.00: 1
4.00 5.00 1.00: -2
4.00 6.00 0.00: 1
4.00 7.00 1.00: 3
4.00 8.00 1.00: 3
5.00 1.00 1.00: 2
5.00 2.00 0.00: 1
5.00 3.00 0.00: 1
5.00 4.00 0.00: 1
5.00 5.00 0.00: 1
5.00 6.00 0.00: 1
5.00 7.00 1.00: 3
5.00 8.00 1.00: 3
6.00 1.00 1.00: 2
6.00 2.00 0.00: 1
6.00 3.00 1.00: 3
6.00 4.00 1.00: 3
6.00 5.00 1.00: 3
6.00 6.00 1.00: 3
6.00 7.00 1.00: 3
7.00 1.00 1.00: 2
7.00 2.00 0.00: 1
7.00 3.00 0.00: 1
7.00 4.00 0.00: 1
7.00 5.00 1.00: 3
8.00 1.00 1.00: 2
8.00 2.00 1.00: 2
8.00 3.00 0.00: 1
8.00 4.00 1.00: 3
8.00 5.00 1.00: 3
8.00 6.00 1.00: 3
9.00 2.00 1.00: 2
9.00 3.00 1.00: 2
Press enter to exit...
And below are some links i found with examples on how to create a faster library for it:
https://github.com/Eleobert/dbscan
https://proceedings.mlr.press/v157/weng21a.html
I didn´t tested this new routines yet, neither ported it co Plain C or assembly yet. Thinking....
Hi,guga
First of all, congratulations on the progress you have made in your research.
Next,thanks you for providing a lot of useful information about DBSCAN. As is known to all, if use the Bing searching, i'll get very little useful information, but waste a lot of time.
Thank you very much!
;-------------------------------------
1) About image analysing, I remember a thing.
long time ago, There was a criminal who had specialized in robbing money and killing people in front of some banks. More than ten people have been killed. the police ordered the arrest of the criminal at making public, had released front and side photos of the criminal(he has been gone to jail before).When the criminal had committed an offence again, the police and military dispatched more than 5000 armed personnel to surround and annihilate the criminal.the criminal had died. the on-site photos were very bloody. Only the one ear could be recognized. Due to the difference in the two ear in the published photos, many people doubted: the person who was killed was not the criminal. then the police wrote a article to prove: the person who was killed was the criminal. I ultimately believed the police. I also understood a bit of the principle of projection.
2) the Technology of Image Generation
light; radar; infrared ray; ultraviolet rays; radio telescope; ultrasonic wave; nuclear magnetic resonance. These images have different characteristics.
3)about distance
there are the euclidean distance,the riemannian distance, the cosine distance,the manhattan distance.
4)trainning the DBSCAN model.
We need typical image datas to train the DBSCAN model, get the optimum epsilon and minPts parameters.
5)the DBSCAN model verificating datas.
i found the most DBSCAN used the mathematical datas,not the real image datas. Only the embryonic DBSCAN used the real image datas.
regards.
Hi Six_L
tks you :)
About the usage in real images, it is possible to do. Most likely the same way on the embryo image used in javascript. The main problem seems to be the insane amount of memory it needs to be allocated in order to make it work faster. I´m thinking how to overcome this.
If i succeed to make a routine to make it faster on a 1-D data, the same can also be used in any other dimensions i suppose.
Btw...I also found a way to identify the maximum values of a cluster. The function is like this:
;;
Max Points = ((k+1)/2)^D
;;
[DBSCAN_DIM_1D 1] ; 1 Dimension
[DBSCAN_DIM_2D 2] ; 2 Dimensions
[DBSCAN_DIM_3D 3] ; 3 Dimensions
Proc GetMaxPoints:
Arguments @DimensionFlags
Local @kFactorPlus1
mov eax D@DimensionFlags
.If_Or eax = DBSCAN_DIM_1D, eax = 0
mov eax 2 ; dimensions*2) = 1*2
.Else_If eax = DBSCAN_DIM_2D
mov eax 4 ; dimensions*2) = 2*2
.Else_If eax = DBSCAN_DIM_3D
mov eax 6 ; dimensions*2) = 3*2
.Else_If_And eax > 3, eax < 7
mov eax 7
.Else_If eax > 10
inc eax
.End_If
inc eax
mov D@kFactorPlus1 eax | imul eax eax | imul eax D@kFactorPlus1
shr eax 3 ; divide by 8
EndP
This time, chatgpt was usefull in helping me finding the maximum values. It answered me that:
QuoteTo calculate the maximum number of points in a dataset for a given value of kk and a specific dimensionality D, you can use the following formula:
Max points= ((k+1)/2)^D
Where:
x denotes the floor function, which returns the largest integer less than or equal to x.
k is the number of nearest neighbors used in the density estimation.
D is the dimensionality of the data.
For example, let's say you want to calculate the maximum number of points for k=6 in 3D data:
Max points= (((6+1)/2)^3)
Max points= ((7/2)^3)
Max points= (3.5^3)
Max points= 42.875
Max points= 42
So, for k=6 in 3D data, the maximum number of points would be 42.
So, each cluster now is properly identifiable by a minimum and maximum number of points it can have.
Maybe Knowing the maximum amount of points, can give me more clues on how to optimize this whole thing.
I´m currently reviewing he code for min and maxpts and also trying to implement a routine to make this thing faster
Hi,guga
i want to select a part in image for dbscaning, this is creating some tested datasets automatically. but the SCROLL of static can't work. now i am progressing slowly.
regards
That´s pretty good :greenclp: :greenclp: :greenclp:
I´m doing some more tests to see if i can optimize it even further. Dbscan seems a impressive algo for images. For noise removal, using dbscan maybe can do a wonderful thing in removing all that is noise. Perhaps we simply have to run the algo a couple of times to remove the noise and thus enhance the image completely even without the needs to use large models etc.
I was thinking that for noise removal, all we need to do relies in a few steps working on separated channels or the full image RGB contents. Personally, i would try with separated channels 1st.
The strategy maybe resume to this:
1 - Run Dbscan once to identify noises all over the image. It will result in well formed clusters and the corresponding noises that may exists near them.
2 - Create a routine to identify where the noises are and then remove them using the average of the values of the clusters that surrounds that noise or are closer to them. For example, let´s say that we run all the algo in separated channels.
So we identify the noises only on Red Channel, then we check if the noise is close to one or more clusters. We then calculate the average (Or standard deviation - that maybe better) of the colors of the cluster (or clusters) nearby the noise and apply the resultant value on that specific noise pixel. Then we go through all the image and do it all over again, until we eliminate the majority of the noises
We do that step only when we have 1 noise pixel surrounded or closer of a good cluster. If we have 2 or more pixels noises close to each other (even if they are closer or surrounded by clusters), we skip this routine, because we are mainly focusing in isolated noises pixels.
3 - Now, our image/channel contains either no noise (if all noises were isolated ones) or may still contains small chunks of noises near each other and surrounded by good clusters. Since we worked on isolated noises previously and settled their new values biased on the average (or standard deviation) of the clusters nearby, we then run DbScan again.
Why ? Because since we applied new values for what was noise, then we may have formed new set of clusters that uses those new pixels, thus enhancing the image a bit more.
4 - If the results on step 3 still contains noises (or a new set of noises) we do steps 1-3 all over again.
This is done until we can reach no noises on the resultant image
Also, we can do a routine to go out in case of infinite loops that maybe generated in the steps 1-4 results always in small chunks of noises closer to each other. On that case, we can go out the loop, and do a couple of things:
a) End the routine, since the majority of noises are now over.
b) create another routine to deal with those small chunks of noises closer to good clusters. And then create a new strategy on how to handle those. Perhaps, starts checking for small groups of odd and even noises. For example
1 - if we have 2 noises surrounded by 3 good clusters. We apply the average/standard deviation biased on the noise that are closer to a given cluster. (See image attached).
Noise1 is closer to Cluster A than from B or C. So, the Average (or Standard deviation) Color of Cluster A influences more on the result then Cluster B or C. So we can apply a weighted of the average colors of them, perhaps biased on the distance from where they are from each other.
Noise 2 is closer to Cluster B than from Cluster A or C, then we do it the same as above
c) once we do this, we run dbscan all over again....and keep doing this untill all noises are removed
New version
A bit faster, but i´m not quite sure about the results. ´ll make a double check later.
Hi,guga
The translation of DBSCAN.C completed. Compilation has been passed, Testing is still in progress.
DBSCAN.C has some mazy data structures. Only "euclidean_dist" and "fcpu_cmp_val" works normally.
;¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
;point_s STRUCT ;this is the struct from dbscan.cpp, we need to analyse, modify
; x REAL8 ?
; y REAL8 ?
; z REAL8 ?
; cluster_id DWORD ?
;point_s ENDS
;point_t typedef ptr point_s
point_s STRUCT ;this is the struct from dbscan.cpp, we need to analyse image, modify
x QWORD ? ; x coordinate
y QWORD ? ; y coordinate
z QWORD ? ; Pixel
cluster_id QWORD ?
point_s ENDS
point_t typedef ptr point_s
dist_t typedef ptr dist
dist STRUCT
a point_s <>
b point_s <>
dist ENDS
node_t typedef ptr node_s
node_s STRUCT
index QWORD ?
next point_t ?
node_s ENDS
epsilon_neighbours_t typedef ptr epsilon_neighbours_s
epsilon_neighbours_s STRUCT
num_members QWORD ?
head node_t ?
tail node_t ?
epsilon_neighbours_s ENDS
;¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
create_node PROTO index:QWORD
append_at_end PROTO index:QWORD,en:epsilon_neighbours_t
get_epsilon_neighbours PROTO index:QWORD,points:point_t,num_points:QWORD,epsilon:REAL8,@dist:dist_t
print_epsilon_neighbours PROTO points:point_t,en:epsilon_neighbours_t
destroy_epsilon_neighbours PROTO en:epsilon_neighbours_t
dbscan PROTO points:point_t,num_points:QWORD,epsilon:REAL8,minpts:QWORD,@dist:dist_t
expand PROTO index:QWORD,cluster_id:QWORD,points:point_t,num_points:QWORD,epsilon:REAL8,minpts:QWORD,@dist:dist_t
spread PROTO index:QWORD,seeds:epsilon_neighbours_t,cluster_id:QWORD,points:point_t,num_points:QWORD,epsilon:REAL8,minpts:QWORD,@dist:dist_t
euclidean_dist PROTO a:point_t,b:point_t
adjacent_intensity_dist PROTO a:point_t,b:point_t
parse_input PROTO hWndSrc:QWORD
print_points PROTO points:point_t,num_points:QWORD
fcpu_cmp_val proc Val_1:REAL8, Val_2:REAL8
finit
fld Val_1
;fild val_2 ;if val_2 is int.
fld Val_2
fcompp
fstsw ax ;retrieve exception flags from FPU
fwait
shr al,1 ;test for invalid operation
jc @err ;clean-up and return result
sahf ;transfer to the CPU flags
jpe @err ;error if non comparable
ja @Greater
jc @F
mov rax,3 ;val_1 = val_2
jmp @Exit
@@:
mov rax,1 ;val_1 < val_2
jmp @Exit
@Greater:
mov rax,2 ;val_1 > val_2
@Exit:
ret
@err:
xor rax,rax ;rax = 0 -> Error
ret
fcpu_cmp_val endp
;¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
euclidean_dist proc uses rsi rdi a:point_t,b:point_t
Local x1:REAL10
Local x2:REAL10
finit
mov rsi,a
mov rdi,b
;sqrt[(x2-x1)^2+(y2-y1)^2+(z2-z1)^2]
fild [rsi].point_s.x ;x1->st0
fild [rdi].point_s.x ;x2->st1
fsub ;st0-st1
fst st(1) ;st0->st1
fmul ;st0*st1
fstp x1
fild [rsi].point_s.y ;x1->st0
fild [rdi].point_s.y ;x2->st1
fsub ;st0-st1
fst st(1) ;st0->st1
fmul ;st0*st1
fstp x2
fild [rsi].point_s.z ;x1->st0
fild [rdi].point_s.z ;x2->st1
fsub ;st0-st1
fst st(1) ;st0->st1
fmul ;st0*st1
fld x2
fadd
fld x1
fadd
fsqrt
;result in st0,
;Local ResultR8:real8,Local ResultR10:real10
;invoke euclidean_dist, addr aa,addr bb
;fstp ResultR8 ;fstp ResultR10
;fstp rVal
ret
euclidean_dist endp
regards
Hi Six_L
Nice work. Indeed, this algorithm has some strange structures. I succeeded to port to RosAsm, but still it requires a insane amount of memory to be allocated specially when our final goal is processing images.
But, in the meanwhile, i found an very very interesting approach that claims to be a improvement for dbscan. Take a look. I suceeded to compile in Pelles and the results are great.
It do has some flaws, like can´t handle some types of images, or it shows a weird diagonal line in one of my tests, but it seems promising. I`m trying to port it right now. It uses basically 2 main functions, and it is really fast
https://github.com/DIP-BurdwanUniversity/DBSCAN-ImageClustering/tree/master/input
https://github.com/DIP-BurdwanUniversity/DBSCAN-ImageClustering
https://biswajit.netlify.app/pages/dip-burdwanuniversity/
https://github.com/DIP-BurdwanUniversity
Ok, gentlemen. 1st version Pre-alpha released :biggrin: :biggrin: :biggrin:
File attached. It may crash sometimes, i suppose. I´ll make further tests later and some small bug fixes. So far, was mainly to test for speed of this algorithm from the BurdwanUniversity
It´s really good, in fact. At the end, their algorithm is basically creating a density map rather than the "normal" clusters and noises in a conventional dbscan algorithm. But this can also be implemented.
The good news is that it does not take tons of gigabytes and is really, really fast.
Note: Source code embedded in main app (as usual, in RosAsm files). Later i´ll clean up the code (it´s a huge, huge mess and post it here on the next release)
(https://i.ibb.co/dm1NF2w/testiing.png) (https://ibb.co/dm1NF2w)
Hi,guga
very nice!
Hi Six_L tks
It was hard to understand and port this thing, but i suceeded to do it.I´ll try later to create a sort of tutorial on how this thing works (at least the part showed by the article at BurdwanUniversity)
Basically, it takes only the maximum difference between 8 neighbors pixels at every pixel we are working with. 1St it converts the image to grayscale and uses only the unique generated color corresponding to grey, then it creates a map of the grey pixels (Size = width*height*4 = 4 because we are dealing with a dword (In fact, i converted to Real4 to gain extra speed in SSE2) to make the computation easier). For example, say that on a certain region of the image, the pixels are displaced as (already converted to gray):
99 140 185
102 63 162
58 113 138
And we are at the center of the area. So, we are at pixel whose value = 63
The original article (and all other places i read so far), tell us to get the difference between each pixel and the one we are in (99-63), (140-63)....(138-63). Then some articles says that we have to get the minimum of maximum distance etc etc while the article at Burdwan University says to we get only the maximum difference between all of them and use the resultant value to create a Density map.
But all of this consumes a lot of processing time. So, all i had to do is go straightforward finding the maximum value of all of the 8 neighbors pixels and simply subtracted that value from the one at the center.
So, instead calculating all the 8 deltas and only later finding the bigger of them etc etc . All is needed, in fact, is the maximum of those 8 and subtract directly. Then we divide the value by 8 pixels
So, in this example, the maximum value of the neighbor pixels = 185. Therefore, the value of the density map = (185-63)/8.
Using SSE2 it takes a few operations to do that. All we need is locate the starting address at the top and bottom (3 pixels each), put them on a xmm register and calculate he maximum value of them.
Then,, we do the same but only for the pixels on the left and right of the center. Get the max of both.
And finally, we get the maximum of the resultant 2 values (Top/Bottom and Left/Right).
There´s no need for subtract all those 8 pixels, get their results, make them absolute and only later, get their maximum. This is simply because, no matter what subtraction or deltas you will have, the maximum delta will always be the one that contains the maximum value of one of the 8 pixels.
So, all we need is find the max value of the 8 pixels and subtract it from the center. And then divide the resultant delta by 8 (Because we have 8 neighbor pixels)
Since the algo takes 8 pixels surrounding each one, the better when start all over this, is create the grey table with a extended size. So, width+1 and height+1 and put the image at the center of that buffer area, making the borders (1 pixel each) useless, but preventing crashes and other computations to adjust the border all over the time.
On this way, if we start the algo say, at position x = 0. When he try to get the pixel on the left, it wont crash, and since it´s value would be 0 , it won´t affect too much the result (except on the borders and corners) that will take, in fact, 3 to 5 pixels to calculate the max, instead the regular 8.
So, in memory, our grey image map will be like this:
Padding1 Padding2 Padding Padding Padding Padding Padding Padding Padding Padding
Padding2 148 96 98 98 66 139 142 24 Padding
Padding3 120 90 78 149 161 101 112 46 Padding
Padding 97 97 99 140 185 180 105 140 Padding
Padding 106 99 102 63 162 115 115 139 Padding
Padding 95 107 58 113 138 82 134 176 Padding
Padding 97 102 91 53 121 78 156 204 Padding
Padding 126 93 68 57 145 155 135 58 Padding
Padding 137 95 33 86 122 169 66 37 Padding
Padding Padding Padding Padding Padding Padding Padding Padding Padding Padding
Where padding, is nothing less then a null value used to adjust the problem at the borders and corners.
The whole function is :
; equates related to pshufd macro
[SSE_ROTATE_RIGHT_96BITS 147]
[SSE_ROTATE_LEFT_32BITS 147]
[SSE_ROTATE_LEFT_96BITS 57]
[SSE_ROTATE_RIGHT_32BITS 57]
[SSE_SWAP_QWORDS 78] ; the same things and result as SSE_SWAP_QWORDS, SSE_ROTATE_RIGHT_64BITS, SSE_ROTATE_LEFT_64BITS, SSE_ROTATE_64BITS
[SSE_ROTATE_LEFT_64BITS 78] ; the same things and result as SSE_SWAP_QWORDS, SSE_ROTATE_RIGHT_64BITS, SSE_ROTATE_LEFT_64BITS, SSE_ROTATE_64BITS
[SSE_ROTATE_RIGHT_64BITS 78] ; the same things and result as SSE_SWAP_QWORDS, SSE_ROTATE_RIGHT_64BITS, SSE_ROTATE_LEFT_64BITS, SSE_ROTATE_64BITS
[SSE_ROTATE_64BITS 78] ; the same things and result as SSE_SWAP_QWORDS, SSE_ROTATE_RIGHT_64BITS, SSE_ROTATE_LEFT_64BITS, SSE_ROTATE_64BITS
[SSE_INVERT_DWORDS 27] ; invert the order of dwords
[SSE_SWAP_DWORDS 177] ; Dwords in xmm are copied from 0123 to 1032 ordering. The same as: pshufd XMM0 XMM0 {SHUFFLE 1,0,3,2}
[SSE_SWAP_LOQWORD 225] ; Only the 2 dwords in the LowQword are swaped. Dwords in xmm are copied from 0123 to 0132 ordering. The same as: pshufd XMM0 XMM0 {SHUFFLE 0,1,3,2}
[SSE_SWAP_HIQWORD 180] ; Only the 2 dwords in the HighQword are swaped. Dwords in xmm are copied from 0123 to 1023 ordering. The same as: pshufd XMM0 XMM0 {SHUFFLE 1,0,2,3}
; Macros used
[SHUFFLE | ( (#1 shl 6) or (#2 shl 4) or (#3 shl 2) or #4 )] ; Marinus/Sieekmanski
[pshufd | pshufd #1 #2 #3]
[SSE2_MAX_4FLOATS | maxps #1 #2 | pshufd #2 #1 SSE_SWAP_DWORDS | maxps #1 #2 | pshufd #2 #1 SSE_ROTATE_64BITS | maxps #1 #2]
[SSE_ABS_REAL4 | pslld #1 1 | psrld #1 1]
; Variables used
[NeighboursDivision: F$ (1/8), F$ (1/8), F$ (1/8), F$ (1/8)]
[MaxDensity: F$ 0, 0, 0, 0] ; ge the maximum density found to we use in the scale to density
[DensityRange: F$ 255, 255, 255, 255] ; to be used in Scale to density)
Proc EightNeighbourDistance:
Arguments @pInput, @pOutput, @Height, @Width, @Padding, @DensityRatio
Local @CurYPos, @NextRow, @PreviousRow, @NextOutput, @NextInput, @AddPaddingStart
Uses ebx, esi, edi, edx, ecx
xor eax eax
If_Or D@Height = 0, D@Width = 0
ExitP
End_If
movups xmm3 X$NeighboursDivision
movups xmm4 X$MaxDensity
mov eax D@Padding | shl eax 2 | mov D@AddPaddingStart eax
; calculate the address of the next input row
mov eax D@Padding | shl eax 1 | add eax D@Width | shl eax 2 | mov D@NextInput eax
; calculate the address of the next output row
mov eax D@Width | shl eax 2 | mov D@NextOutput eax
mov eax D@Padding | mov ebx eax | add eax D@Width | shl eax 2 | mov D@NextRow eax; 9
shl ebx 1 | shl ebx 2 | add ebx eax | mov D@PreviousRow ebx
; Input start at
mov eax D@Height | mov D@CurYPos eax
mov edx D@pInput | add edx D@NextInput
mov edi D@pOutput
..Do
mov ebx edx | add ebx D@AddPaddingStart
xor esi esi
.Do
; get previous line (Up)
lea eax D$ebx+esi*4 | sub eax D@PreviousRow | movdqu xmm1 X$eax; | SSE_CONV_4INT_TO_4FLOAT xmm1 ; convert the integers to floats
pshufd xmm1 xmm1 {SHUFFLE 2,2,1,0} ; the 4th Float is not needed for this comparision, because we want only 3 (padding + width + padding)
; get next line (Down)
lea eax D$ebx+esi*4 | add eax D@NextRow | movdqu xmm0 X$eax; | SSE_CONV_4INT_TO_4FLOAT xmm0 ; convert the integers to floats
pshufd xmm0 xmm0 {SHUFFLE 2,2,1,0} ; the 4th Float is not needed for this comparision, because we want only 3 (padding + width + padding)
SSE2_MAX_4FLOATS xmm0 xmm1
; now get left
movdqu xmm1 X$ebx+esi*4-4 | pshufd xmm1 xmm1 {SHUFFLE 0,0,0,0}
; now get right
movdqu xmm2 X$ebx+esi*4+4 | pshufd xmm2 xmm2 {SHUFFLE 0,0,0,0}
maxps xmm1 xmm2
; and finally get the maximum between up and down and left to right
maxps xmm1 xmm0
; get our core pixel
mov eax D$ebx+esi*4 | movd xmm0 eax | pshufd xmm0 xmm0 {SHUFFLE 0,0,0,0}
; subtract pixels from right from the core
subps xmm0 xmm1 ; Subtract corresponding integers
SSE_ABS_REAL4 xmm0 ; and convert them to absolute
mulps xmm0 xmm3
; get the maximum density
maxps xmm4 xmm0
movd D$edi+esi*4 xmm0
inc esi
.Loop_Until esi => D@Width
add edx D@NextInput
add edi D@NextOutput
dec D@CurYPos
..Repeat_Until_Zero
; finally calculate the density ratio to be used in our new scale_density_to_image function
mov eax D@DensityRatio
movups xmm0 X$DensityRange | divss xmm0 xmm4 | movd D$eax xmm0
EndP
I´ll clean up the code later, but it only uses 2 functions to do all the trick. In fact, it uses only one "EightNeighbourDistance" which is responsible to create a map of the density (size = width*height*4 only :azn: :azn: ) (It creates a sort of a "pixeldensity" or "DensityPixel" or whatever need to call this stuff), while another function scale_density_to_image is responsible only to make that map be visible since the values of the density map are around 10% of the original grey ones we inputted. That´s why i calculated the maximum density and the ratio inside EightNeighbourDistance at once, to avoid extra checking on the scale_density_to_image function.
So, the final value of the pixel (to it be visible) is basically DensityPixel*DensityRatio
I added a parameter "DensityRatio" on the function EightNeighbourDistance (responsible to create the density map) to make easier to we make the pixels be visible again from the scale_density_to_image function.
The density ratio is basically this:
Ratio = 255/MaxDensity
Max Density is the maximum value found in the density map at EightNeighbourDistance function. And 255 is because this is the maximum value of any pixel.
So, if in the density map we found that the maximum value of the density image is let´s say: 28. We simply calculate the ratio r = 255/28
Then, to make the pixels visible again, we do it in anther function (scale_density_to_image) . So, on that function, we take the value of each "PixelDensity" and multiply with our Ratio.
To convert the pixel density to visible pixels we simply do:
Final Pixel Value= PixelDensity*(255/28)
Could we use euclidean distance to calculate the value of the density map ? Sure...but the question is why ? Using euclidian distance may result in 2 disadvantages:
1 - Speed
2 - Accuracy
The speed problem could be solved using a Table of 256 floats (Real4) containing the correspondent value of the Euclidean Distance (since it will result in values from 0 to 255 anyway). So, the max delta could be applied to a table containing 256 predefined values of Euclidean distances.
But the problem seems to be accuracy. Using euclidean distances may result in a higher values of Density which could not be accurated when classifyig the clusters. Ex:
if at a given pixel we calculate the density using euclidean as:
Maximum value of the 8 Neighbors = 90
Center Pixel to get the delta 40
Euclidean Distance = sqrt(90^2-40^2) = 80.62....
But, using a simple delta, we get:
Density Map (simple delta) = 90-40 = 50
So, it seems that using a simple delta from the max of the 8 neighbours and the pixel we are using to create (or classify) in the density map seem enough and more accurate for me - Not to mention, way faster. (Is the same conclusion, btw found in the article at the Burdwan University)
The algorithm seems really food to apply in other filters. I made a quick test using the Density Map over each RGB channel of the image and the result is pretty good to enhance the contours.
It can be refined, but for now it´s just a quick test. Didn´t checked for bugs, and probably the app will crash if you try to open a new image without closing the app (I didn´t created the routine yet to release the memory since it was just a quick test.
What it does is get the values of each channel separately and subtract it from the value of the pixel found in the density map (I used the scaled value for this purpose). Then apply only the delta and use this new value as the one on each channel. Like this:
Channel Red at a given position
Pixel 30 48 150
Density Map 48 48 48 (Already scaled)
Get the delta
Pixel = Abs(30-48) (48-48) (150-48) and voilà !
In some images it may appears to increase some noises but, in fact what it is doing is enhanced the noises that were already in the original image or existent to bad encoding etc.
The better to use this techique should be apply the delta only on the luma with CieLCH colorspace, but i have no time right now to fix my CieLCH routines
Anyway, for a test, it seems very good. Now i´m proceeding to try to create the Classification (Cluster types and homogeneity etc)
Note: use the same dlls i posted in my previous version. I released only the executable this time.
Btw, calculating the delta on all of 32 bytes on xmm register revealed o be very tricky in SSE2. But here is the part of the code that do it
movups xmm0 X$ebx+edx*8 ; original image dest
movups xmm1 X$esi+edx*8 ; density src
; SSE2_ABS_BYTE_DELTA xmm0 xmm3
; calculate the delta on each one of the 32 bytes on the xmm register
; Backup original xmm0 and xmm1
movaps xmm7, xmm0 ; xmm7 = xmm0
movaps xmm6, xmm1 ; xmm6 = xmm1
; Calculate xmm0 - xmm1
psubusb xmm0, xmm1 ; xmm0 = max(xmm0 - xmm1, 0)
; Calculate xmm1 - xmm7
psubusb xmm6, xmm7 ; xmm6 = max(xmm1 - xmm7, 0)
; Take the maximum of both results to get the absolute difference
por xmm0, xmm6 ; xmm0 = max(xmm0, xmm6)
Examples
(https://i.ibb.co/cykV3j2/Image2test.png) (https://ibb.co/cykV3j2)
(https://i.ibb.co/18jJj1F/Image2test.png) (https://ibb.co/18jJj1F)
Judging from those pictures, you've cooked up a pretty good "sharpen" filter there, yes?
Hi NoCforMe. Yes, the algorithm can be used as a filter to sharpen images. It wasn't designed for that, but in the end I saw that this was another possible use.
Testing for edition. Working now, Stoo :)
Finally succeeded to make this works
I created 2 specific functions to properly identify the clusters. One biased on the "normal" identification method as described here: https://www.youtube.com/watch?v=-p354tQsKrs
And other forcing the point nearby a cluster to be a border even if it don´t have the minimum distance of Epsilon. On such cases, if i decide to use this method too (after testing on real images), i´ll see if i can set a newer flag called NEAR_BORDER meaning that we have a point near a cluster that can be considered border despite it´s values is not enough to be considered as such.
Both methods uses the technique described at: https://biswajit.netlify.app/pages/dip-burdwanuniversity
I´ll make further tests before the next release.
This algorithm can be usefull to create a more accurate denoise filter, since it identifies precisely what is noise and also store the necessary amount of data needed for that specific pixel be considered at least a border and later be used to form new clusters if necessary.
Now i need also to find a way to assign each border to it´s own cluster (or clusters) and create a type of subgroup of clusters (for cases where clusters can be merged to form a larger one)
Ok, guys. I suceeded to assign a unique ID for each Core and Border points and also succeeded to identify each Noise point as well. Now i´m trying to figure it out a way to assign a unique Id for each cluster and associate it to the corresponding core and border points that forms it.
Any ideas ?
Say that i have an image (width = 6, height = 5) that contains these classifications:
Noise - Core - Core - Noise - Core - Core
Border - Border - Core - Border - Border - Core
Noise - Border - Core - Border - Border - Border
Noise - Border - Border - Border - Border - Core
Noise - Noise - Noise - Noise - Core - Core
How do i determine the clusters and how many are there ?