DBScan, Homogeneity and Completeness algorithm

guga · May 11, 2024, 11:39:49 PM

Hi guys

Someone succeeded to create a DBScan algorithm on a given data set ?

Im trying to create a dbscan clustering functiojn to analyse a chunck of data (in dwords or bytes or whatever needed input)

The dbscan is defined as in:
https://medium.com/@balajicena1995/dbscan-clustering-2a577d384e61
https://www.youtube.com/watch?v=87ogbzhXUxo
https://github.com/gyaikhom/dbscan/blob/master/dbscan.c
https://dev.to/rajaniraiyn/dbscan-clustering-algorithm-demystified-1d5o
https://www.codeproject.com/Articles/5129186/Step-by-Step-Guide-to-Implement-Machine-Learning-8
https://en.wikipedia.org/wiki/DBSCAN

The goal is to determine if a certain chunk of data is homogeneous, i mean, if the data inside a given chunck contains similar values. that can be used to identify/distinguish from others (to use in image processing, for example or watermark or object detection etc)

Homogeneity and Completeness are defined in:
https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
and
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html#sklearn.metrics.homogeneity_score

But, i didn´t understood how this can be implemented. The C version in github is way to complex for me to start a simple algorithm.

Chatgpt gave me some ideas on how to do it in masm, but i got nowhere.

On the example i provided i tried to analyse the following sequence:
{0, 1, 1, 1, 1, 1, 1, 1, 25, 15, 96, 19, 1, 1, 2, 25}

Where 1 is the most common value, so the data may lead to some value of homogeneity and COmpleteness.

When i started, chatgpt created me an scheme as:

1st question i made:

QuoteTo determine if pixel values in a sequence of images represent the same pixel or object, especially when the majority of pixels are around a certain value and a few are outliers, you can use clustering algorithms or similarity measures. Here's an approach you can take:

Preprocess the Data: Normalize the pixel values to a common scale if necessary. This ensures that all values are comparable.

Cluster Pixel Values: Use a clustering algorithm such as K-means or DBSCAN to group similar pixel values together. This can help identify clusters of pixels that represent the same object or feature in the images. Set the number of clusters based on the expected number of objects or features in the images.

Identify Outliers: Pixels that do not belong to any cluster or belong to very small clusters can be considered outliers. These outliers may represent noise or background pixels that are different from the main object or feature.

Evaluate Results: Inspect the clusters and outliers to determine if they correspond to different objects or features in the images. Visualizing the clusters can help you understand the grouping of pixel values.

Adjust Parameters: If the results are not satisfactory, adjust the parameters of the clustering algorithm (e.g., number of clusters, distance metric) and re-run the analysis.

Post-processing: Depending on the application, you may need to further process the clustered pixel values to refine the identification of objects or features.

By using clustering algorithms and analyzing the distribution of pixel values in the images, you can determine if the values represent the same pixel or object, and identify outliers that may represent noise or background pixels.

2nd questioning lead me to:

QuoteTo demonstrate how clustering algorithms can be used to analyze the distribution of pixel values in a sequence of images, let's consider a simplified example with a set of 20 images. We'll use the K-means clustering algorithm to group similar pixel values together and identify outliers that may represent noise or background pixels.

Assuming each image has the same size and we want to analyze a specific pixel (e.g., pixel at coordinates (x, y)), here's how you can approach it:

Load the Images:
Load the 20 images into memory.

Extract Pixel Values:
For each image, extract the pixel value at the specified coordinates (x, y). You will have 20 pixel values for this specific pixel location.

Normalize Pixel Values:
Normalize the pixel values to a common scale if necessary.

Apply K-means Clustering:
Use the K-means clustering algorithm to cluster the pixel values. Set the number of clusters based on the expected number of objects or features in the images.

Identify Outliers:
Pixels that do not belong to any cluster or belong to very small clusters can be considered outliers. These outliers may represent noise or background pixels.

Here's a Python example using the scikit-learn library to perform K-means clustering on a set of 20 images:

Code Select

import numpy as np
from sklearn.cluster import KMeans

# Assume pixel_values is a 1D array containing pixel values at location (x, y) for 20 images
# This is a simplified example, and in practice, you would extract pixel values from actual images
pixel_values = np.array([120, 130, 140, 150, 160, 200, 210, 220, 230, 240,
                          150, 140, 135, 130, 125, 195, 205, 210, 215, 220])

# Reshape the array to have a single feature (required by K-means)
pixel_values = pixel_values.reshape(-1, 1)

# Apply K-means clustering with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=0).fit(pixel_values)

# Get the cluster labels
cluster_labels = kmeans.labels_

# Get the cluster centers
cluster_centers = kmeans.cluster_centers_

# Identify outliers
outliers = pixel_values[np.where(cluster_labels != np.argmax(np.bincount(cluster_labels)))]

print("Cluster Centers:", cluster_centers.flatten())
print("Outliers:", outliers.flatten())

I then asked it to create a similar function in Masm32 and using DbScan instead K-mean

This s the resultant function, ChatGPT gave me

Code Select

.DATA
    pixel_values DWORD 120, 130, 140, 150, 160, 200, 210, 220, 230, 240, \
                    150, 140, 135, 130, 125, 195, 205, 210, 215, 220
    num_pixels DWORD 20
    epsilon DWORD 15   ; Adjust epsilon as needed
    minPts DWORD 3      ; Adjust minPts as needed
    cluster_labels DWORD num_pixels DUP(0)
    cluster_count DWORD 0
.CODE
    main PROC
        ; Load addresses of data
        mov esi, OFFSET pixel_values
        mov edi, OFFSET cluster_labels
        
        ; DBSCAN algorithm
        mov ecx, num_pixels
    outer_loop:
        cmp DWORD PTR [edi], 0
        jne skip_iteration   ; Skip if point is already assigned to a cluster
        mov eax, ecx
        dec eax
        mov ebx, eax
        mov edx, epsilon
        shl edx, 2           ; Multiply epsilon by 4 (size of DWORD)
        mov esi, OFFSET pixel_values
        lea esi, [esi + eax * 4]
    inner_loop:
        cmp DWORD PTR [edi + eax * 4], 0
        jne next_iteration   ; Skip if point is already assigned to a cluster
        mov esi, OFFSET pixel_values
        lea esi, [esi + ebx * 4]
        sub esi, OFFSET pixel_values
        mov edx, epsilon
        shl edx, 2           ; Multiply epsilon by 4 (size of DWORD)
        mov eax, ebx
        mov ecx, eax
        call distance       ; Calculate distance between points
        cmp eax, edx        ; Compare distance with epsilon
        jge next_iteration  ; Skip if distance is greater than epsilon
        inc ecx             ; Increment count of neighbors
        cmp ecx, minPts
        jl next_iteration   ; Skip if minimum number of points is not reached
        mov eax, ebx
        inc cluster_count   ; Assign new cluster ID
    assign_cluster:
        mov DWORD PTR [edi + eax * 4], cluster_count
        jmp next_iteration
    skip_iteration:
        dec ecx
    next_iteration:
        dec ebx
        jns inner_loop
        inc ecx
        cmp ecx, num_pixels
        jl outer_loop

        ; Output cluster labels
        mov esi, OFFSET cluster_labels
        mov ecx, num_pixels
    print_clusters:
        mov eax, [esi]
        call WriteDec
        call Crlf
        add esi, 4
        loop print_clusters

        ; Exit program
        call ExitProcess
    main ENDP

    distance PROC
        ; Calculate Euclidean distance between two points
        ; Input: ESI - offset of first point, EDI - offset of second point
        ; Output: EAX - distance
        mov eax, DWORD PTR [esi]
        sub eax, DWORD PTR [edi]
        imul eax, eax
        mov edx, DWORD PTR [esi + 4]
        sub edx, DWORD PTR [edi + 4]
        imul edx, edx
        add eax, edx
        mov edx, DWORD PTR [esi + 8]
        sub edx, DWORD PTR [edi + 8]
        imul edx, edx
        add eax, edx
        mov edx, DWORD PTR [esi + 12]
        sub edx, DWORD PTR [edi + 12]
        imul edx, edx
        add eax, edx
        call Sqrt
        ret
    distance ENDP
END main

But this was just to Ok,calculate the cluster_labels. I then asked me to create the other functions such as the cluster_centers and outliers. Which lead me to:

QuoteYes, the function I provided is just to calculate the cluster_labels based on the DBSCAN algorithm. To find the cluster_centers and outliers, you can modify the code to calculate them after assigning the cluster labels. Here's how you can do it:

Cluster Centers: After assigning cluster labels, you can calculate the cluster centers by averaging the pixel values of all points in each cluster.

Outliers: Outliers are points that do not belong to any cluster or belong to very small clusters (noise points). You can identify outliers by checking for points with a cluster label of 0 or labels that are not assigned to a core point (i.e., points with fewer than minPts neighbors).

Code Select

.DATA
    ...
    cluster_centers DWORD 0, 0  ; Store the cluster centers (assuming 2 clusters)
    outliers DWORD num_pixels DUP(0)  ; Store the indices of outliers
    num_outliers DWORD 0  ; Counter for the number of outliers
.CODE
    ...
    update_cluster_centers PROC
        ; Calculate cluster centers
        mov esi, OFFSET pixel_values
        mov edi, OFFSET cluster_labels
        xor ecx, ecx
        xor edx, edx
    calculate_centers_loop:
        cmp DWORD PTR [edi + ecx * 4], cluster_count
        jne not_in_cluster
        add esi, DWORD PTR [esi + ecx * 4]  ; Add pixel value to sum
        inc edx  ; Increment count of points in cluster
    not_in_cluster:
        inc ecx
        cmp ecx, num_pixels
        jl calculate_centers_loop
        div edx  ; Calculate average pixel value (cluster center)
        mov DWORD PTR [cluster_centers + cluster_count * 4], eax
        inc cluster_count
        ret
    update_cluster_centers ENDP

    identify_outliers PROC
        ; Identify outliers
        mov esi, OFFSET cluster_labels
        xor ecx, ecx
    find_outliers_loop:
        cmp DWORD PTR [esi], 0  ; Check if point is not assigned to any cluster
        jne not_outlier
        mov DWORD PTR [outliers + num_outliers * 4], ecx  ; Add index to outliers
        inc num_outliers
    not_outlier:
        inc ecx
        add esi, 4
        cmp ecx, num_pixels
        jl find_outliers_loop
        ret
    identify_outliers ENDP

    main PROC
        ...
        call update_cluster_centers
        call identify_outliers
        ...
    main ENDP

But, this is incorrect and crashed since it is not pointing to a given address in the distance function. I then asked it to explain in details how this thing works...but got more confused then ever. This is what chatgpt wrote:

QuoteLet's walk through the function step by step with the given pixel_values and assuming epsilon is 10 and minPts is 2 for this example:

Initialization:
pixel_values = {120, 130, 140, 150, 160}
cluster_labels = {0, 0, 0, 0, 0}
num_pixels = 5
epsilon = 10
minPts = 2

Outer Loop (for pixel 5, value 160):

Compare cluster label of pixel 5 (cluster_labels[4]) with 0. It is 0.

Set eax to 4 (index of pixel 5).

Set ebx to 4.

Calculate edx (distance threshold): 10 << 2 = 40.

Calculate address of pixel 5 (esi): pixel_values + 4 * 4 = pixel_values + 16.

Inner Loop (for pixel 4, value 150):

Compare cluster label of pixel 5 with 0. It is 0.

Calculate address of pixel 4 (esi): pixel_values + 3 * 4 = pixel_values + 12.

Calculate distance between pixel 5 and pixel 4. Assume distance is 10.

Compare distance (10) with epsilon (40). Distance is less than epsilon.

Increment neighbor count (ecx) to 1.

Continue inner loop for pixel 3.

Continue outer loop for pixel 4.

Continue the same process for pixels 3, 2, and 1.

Output Cluster Labels:
The cluster_labels array after processing may look like: {1, 1, 1, 1, 1} indicating all pixels belong to the same cluster.

Please note that the actual calculation of the distance and clustering logic is not shown here but would be necessary to implement the complete DBSCAN algorithm.

But, i´m failing badly to understand this whole thing.

Can someone help me trying to create such a thing ? Or fix the masm functions chatgpt did ?

zedd151 · May 12, 2024, 01:17:40 AM

The few times I had used chatgpt to write code, it used the irvine32 libraries often. I gave up on that soon after.
Seeing in your code snippet from chatgpt, I see 'call Crlf' -- is that from irvine32 library? There is no header there so I don't know for sure.

guga · May 12, 2024, 01:43:56 AM

Don´t know if chatgpt uses irvine´s code or not. Chatgpt has several flaws in what concerns trying to write something in masm (or Nasm, FAsm, RosAsm), but it gives some idea how to start or, in some cases do 50% of the work. The problem is that on this example, i´m not being able to follow the logic that i is showing because it seems a bit different than the ones explained in wikipedia and the other links.

It seems a simple loop back on each value that neds to be compared to others, but i´m faoiling badly to understand the logic and the math behind this.

And porting the algo at https://github.com/gyaikhom/dbscan/blob/master/dbscan.c to masm seems to me a bit complex for such a task that seems simple at the 1st .

zedd151 · May 12, 2024, 03:01:02 AM

Have you tried godbolt.org, to help with the conversion/translation to assembly? (Assuming you have a usable source in another language)
Once converted, could more easily be optimized or change parts of it, to better suit your needs.

NoCforMe · May 12, 2024, 03:03:29 AM

Would it be possible for you to provide a link that explains what DBScan is in plain English to someone who's unfamiliar with all that jargon? I really can't make head nor tail of it using those links you provided.

(Plus could you link-ize your links? there's a button for that right in the posting editor)

NoCforMe · May 12, 2024, 03:11:35 AM

I suppose the Wikipedia entry (hate Wikipedia!) is as good as anything I've found so far:

QuoteIt is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed (points with many nearby neighbors), and marks as outliers points that lie alone in low-density regions (those whose nearest neighbors are too far away).

* I had to edit that entry, as it was badly written, as is so much stuff there.

Here's another article, on Medium by a Portuguese writer who writes better English than most native speakers, in pretty plain language.

Enough with the jargon! Tell me what it is, how it works and what it's used for.

guga · May 12, 2024, 03:43:32 AM

Quote from: sudoku on May 12, 2024, 03:01:02 AMHave you tried godbolt.org, to help with the conversion/translation to assembly? (Assuming you have a usable source in another language)
Once converted, could more easily be optimized or change parts of it, to better suit your needs.

Yeah, i´m doing it. But the C version uses 3d points and i want to use only 1d points. (Linear values of a sequence of images. So, taking pixel at pos x = 10, y = 10 on all images. All of the next images at this particular positions will contains specific values of their pixels (RGB). I need to compute the dbscan of each channel separated. So, dbscan of 1 pixel at a specific position in all subsequent images.

guga · May 12, 2024, 03:44:04 AM

Quote from: NoCforMe on May 12, 2024, 03:03:29 AMWould it be possible for you to provide a link that explains what DBScan is in plain English to someone who's unfamiliar with all that jargon? I really can't make head nor tail of it using those links you provided.

(Plus could you link-ize your links? there's a button for that right in the posting editor)

https://en.wikipedia.org/wiki/DBSCAN
https://www.youtube.com/watch?v=4AW_5nYQkuc
https://www.youtube.com/watch?v=2Zsz-0K-Ax4

guga · May 12, 2024, 03:46:49 AM

Quote from: NoCforMe on May 12, 2024, 03:11:35 AMI suppose the Wikipedia entry (hate Wikipedia!) is as good as anything I've found so far:
QuoteIt is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed (points with many nearby neighbors), and marks as outliers points that lie alone in low-density regions (those whose nearest neighbors are too far away).

* I had to edit that entry, as it was badly written, as is so much stuff there.
Here's another article, on Medium by a Portuguese writer who writes better English than most native speakers, in pretty plain language.

Enough with the jargon! Tell me what it is, how it works and what it's used for.

I´m not a native english speaker. I´ll take a read on that article.

https://towardsdatascience.com/how-dbscan-works-and-why-should-i-use-it-443b4a191c80

If you speak portuguese, i can talk to you in my native language, but it will be confusing and hard for others understand.

Btw....I´m using it to try a better way to identify a watermark on a set of images. I was giving a try using Median or MAD algorithms but it seems DbScan may work better.

HSE · May 15, 2024, 01:36:56 AM

Hi Guga!

Quote from: guga on May 11, 2024, 11:39:49 PMSomeone succeeded to create a DBScan algorithm on a given data set

An implementation is present in Comparing K-Means and Others Algorithms for Data Clustering in Assembly.

Regards, HSE.

guga · May 15, 2024, 02:31:46 AM

Tks Hse. Can you fix the link / It´s not pointing to any address. I managed to succeed to convert the C file, but would be nice take a look on the link you provided.

Btw....about finding the Homogeneity i made a small function that maybe used for this task. (Unless it was already written on your link on a more accurate way.

Code Select


;;

Homogeneity metric of a cluster labeling given a ground truth.

A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won't change the score value in any way.
This metric is not symmetric: switching label_true with label_pred will return the completeness_score which will be different in general.
Mathematically, homogeneity has the connotation of invariance, as all components of the equation have the same degree of value whether or not each
of these components are scaled to different values, for example, by multiplication or addition. Cumulative distribution fits this description.
"The state of having identical cumulative distribution function or values".

The function uses a logarithm sum of good clusters found on a chain of data. This forces the result to be between 0 and 1 and enforce for homogeneity.
In general,the higher the value (closer to 1) more homogenous is the data.

    Parameters:
        points(in) - A poiter to a Data formed by a array of point_t structures
        MaxClasses(in) - The maximum amout of usable classes (Cluster_Id) available. This parameter takes onto account the Ids bigger then 0.
        num_points(in) - The total amount of elements on the data chain.
        pHomogeneity(out) - Pointer to a variable that will hold the value of the homogeneity. The size of the variable must be a Real8 (8 bytes).

    Return Value: The function does not returns any value. All values are returned in pHomogeneity parameter.
                  The range of the result is between 0 and 1

    Remarks:
        The mathematical equation for this is as follow:
        
        homogeneity = 1 + ((log(cluster_id(0)/TotalCluster_id0)+
                           (log(cluster_id(1)/TotalCluster_id1)+
                           (log(cluster_id(2)/TotalCluster_id2)+
                           (log(cluster_id(3)/TotalCluster_id3)+
                           ....
                           (log(cluster_id(N)/TotalCluster_idN)+
                           ) / NumPoints

        But, we exclude Noise (negative Ids) and also 0 Ids. Excluding 0 id is neded because it can also be interpreted as Noise and also there´s no log (0).
        A 0 id means the data is closer to a cluster but, it can still be considered as noise. A negativa value (-2 is pure noise) and 0 can represent a 'Border' Noise.
        So, in fact, our equaion results only in:

        homogeneity = 1 + ((log(cluster_id(1)/TotalCluster_id1)+
                           (log(cluster_id(2)/TotalCluster_id2)+
                           (log(cluster_id(3)/TotalCluster_id3)+
                           ....
                           (log(cluster_id(N)/TotalCluster_idN)+
                           ) / NumPoints


    In other wods, it takes a somatory of the log of ClusterId divided tbe the total amount of elements related to that Id After summing them, it divide by the
    total amount of elements of the data chunk (num_points) and finally add 1 onto this result. This will grant the resultant value will always be between
    0 and 1.
    
    Example, say you have a data chunk formed by a sequence of data (53 elements = points of x, y, z) where you already calculated theis Cluster_ids
    associated with each point of the data chain. Like this:

        x     y     z     cluster_id
        -----------------------------
        1.00  3.00  1.00: 0
        1.00  4.00  1.00: 0
        1.00  5.00  1.00: 0
        1.00  6.00  1.00: 0
        2.00  2.00  1.00: 2
        2.00  3.00  0.00: 1
        2.00  4.00  0.00: 1
        2.00  5.00  0.00: 1
        2.00  6.00  0.00: 1
        2.00  7.00  1.00: 3
        3.00  1.00  1.00: 2
        3.00  2.00  1.00: 2
        3.00  3.00  1.00: 2
        3.00  4.00  0.00: 1
        3.00  5.00  0.00: 1
        3.00  6.00  1.00: 3
        3.00  7.00  1.00: 3
        4.00  1.00  1.00: 2
        4.00  2.00  1.00: 2
        4.00  3.00  0.00: 1
        4.00  4.00  0.00: 1
        4.00  5.00  1.00: -2
        4.00  6.00  0.00: 1
        4.00  7.00  1.00: 3
        4.00  8.00  1.00: 3
        5.00  1.00  1.00: 2
        5.00  2.00  0.00: 1
        5.00  3.00  0.00: 1
        5.00  4.00  0.00: 1
        5.00  5.00  0.00: 1
        5.00  6.00  0.00: 1
        5.00  7.00  1.00: 3
        5.00  8.00  1.00: 3
        6.00  1.00  1.00: 2
        6.00  2.00  0.00: 1
        6.00  3.00  1.00: 3
        6.00  4.00  1.00: 3
        6.00  5.00  1.00: 3
        6.00  6.00  1.00: 3
        6.00  7.00  1.00: 3
        7.00  1.00  1.00: 2
        7.00  2.00  0.00: 1
        7.00  3.00  0.00: 1
        7.00  4.00  0.00: 1
        7.00  5.00  1.00: 3
        8.00  1.00  1.00: 2
        8.00  2.00  1.00: 2
        8.00  3.00  0.00: 1
        8.00  4.00  1.00: 3
        8.00  5.00  1.00: 3
        8.00  6.00  1.00: 3
        9.00  2.00  1.00: 2
        9.00  3.00  1.00: 2

        From the above data chain (Points) we have the following clusters (groups) from -2 to 3. We need to count how many elements we have on each Cluster_ids
            
            ClusterId = -2 = 1 elements (Pure Noise)
            ClusterId = 0 = 4 elements  (Border Noise)
            ClusterId = 1 = 19 elements
            ClusterId = 2 = 13 elements
            ClusterId = 3 = 16 elements
            
            Total elements = 53 (including Noise and border noise)

            Negative Ids, such as -2 and 0, represents Noise, therefore it is not calculated on the equation of homogeneity.

            So, it will result in:
                homogeneity = 1 + (log(1/19)+log(2/13)+log(3/16))/53 = 0.8775430643464206899449251...

;;

[TmpLogValue: R$ 0]

Proc calculate_homogeneity_Log:
    Arguments @points, @MaxClasses, @num_points, @pHomogeneity
    Local @ClusterCount
    Uses ecx, esi, edi

    xorpd xmm0 xmm0
    ...If D@MaxClasses > 1 ; Skip class = 0 (We added earlier to 1)
        mov esi D@points
        mov edi 1 ; start with Cluster_Id = 1
        .Do
            mov D@ClusterCount 0
            mov ecx D@num_points
            Do
                If D$esi+cluster_idDis = edi
                    inc D@ClusterCount
                End_If
                add esi Size_Of_point_t
                dec ecx
            Loop_Until ecx = 0
            mov esi D@points
            cvtsi2sd xmm1 edi ; converts a signed integer to double
            cvtsi2sd xmm2 D@ClusterCount | divsd xmm1 xmm2
            movsd X$TmpLogValue xmm1 | movsd xmm2 xmm0
            call Sse2_log TmpLogValue, SSE_EXP_REAL8
            addsd xmm0 xmm2
            inc edi
        .Loop_Until edi >= D@MaxClasses
        cvtsi2sd xmm1 D@num_points | divsd xmm0 xmm1
    ...End_If
    mov eax 1 | cvtsi2sd xmm1 eax ; converts a signed integer to double
    addsd xmm0 xmm1
    mov eax D@pHomogeneity
    movsd X$eax xmm0

EndP

Btw...does it woks for dbscan too ?

guga · May 15, 2024, 02:47:12 AM

Oh, got it..here, right ?

https://masm32.com/board/index.php?topic=11649.0
https://github.com/ASMHSE/Clusters-in-Assembly/tree/main

It do contains dbscan. Very good work. I´ll take a further look later and try to understand

HSE · May 15, 2024, 06:20:21 AM

Quote from: guga on May 15, 2024, 02:47:12 AMOh, got it..here, right ?

Exactly

six_L · May 16, 2024, 07:14:27 PM

A very interesting topic.

the parameters ("epsilon and MinPts") of DBSCAN can have a significant impact on the clustering results. Careful parameter tuning is often required for optimal performance.

Hi,guga
Do you use the DBSCAN to analyse some image?

guga · May 17, 2024, 12:24:42 AM

Quote from: six_L on May 16, 2024, 07:14:27 PMA very interesting topic.

the parameters ("epsilon and MinPts") of DBSCAN can have a significant impact on the clustering results. Careful parameter tuning is often required for optimal performance.

Hi,guga
Do you use the DBSCAN to analyse some image?

Hi six_L yes, i´m trying to port the JavaScript example on that link on github to assembly to handle images. The javascript example segments the image using DbScan algorithm. I guess i succeeded to port it, but it is incredibly slow. Maybe using a table of precalculated the euclidian distances can help, but i´m not being able to precalculate it correctly. I´m failing understand the math behinds it.

Btw..i attached the files i succeeded to port (The simpler one). Source code is embedded in exe (for RosAsm) and also included the asm file and the necessary example.dat it uses to do the calculations. Also attached the source code splitted in their own titles to make easier to follow. The part of the code itself related to dbscan is at part01.asm. Top.asm ans SSE_Macros.asm are only the macros i used for that file. The other asm files are simply additional functions for the console and Fast Logarithm functions

The output is this (Where -2 = Pure noise and 0 means noise but more close to the clusters, i suppose:

QuoteEpsilon: 1.000000
Minimum points: 2
Homogeneity: 0.000000
Homogeneity(log): 0.877543
Number of points: 53
x y z cluster_id
-----------------------------
1.00 3.00 1.00: 0
1.00 4.00 1.00: 0
1.00 5.00 1.00: 0
1.00 6.00 1.00: 0
2.00 2.00 1.00: 2
2.00 3.00 0.00: 1
2.00 4.00 0.00: 1
2.00 5.00 0.00: 1
2.00 6.00 0.00: 1
2.00 7.00 1.00: 3
3.00 1.00 1.00: 2
3.00 2.00 1.00: 2
3.00 3.00 1.00: 2
3.00 4.00 0.00: 1
3.00 5.00 0.00: 1
3.00 6.00 1.00: 3
3.00 7.00 1.00: 3
4.00 1.00 1.00: 2
4.00 2.00 1.00: 2
4.00 3.00 0.00: 1
4.00 4.00 0.00: 1
4.00 5.00 1.00: -2
4.00 6.00 0.00: 1
4.00 7.00 1.00: 3
4.00 8.00 1.00: 3
5.00 1.00 1.00: 2
5.00 2.00 0.00: 1
5.00 3.00 0.00: 1
5.00 4.00 0.00: 1
5.00 5.00 0.00: 1
5.00 6.00 0.00: 1
5.00 7.00 1.00: 3
5.00 8.00 1.00: 3
6.00 1.00 1.00: 2
6.00 2.00 0.00: 1
6.00 3.00 1.00: 3
6.00 4.00 1.00: 3
6.00 5.00 1.00: 3
6.00 6.00 1.00: 3
6.00 7.00 1.00: 3
7.00 1.00 1.00: 2
7.00 2.00 0.00: 1
7.00 3.00 0.00: 1
7.00 4.00 0.00: 1
7.00 5.00 1.00: 3
8.00 1.00 1.00: 2
8.00 2.00 1.00: 2
8.00 3.00 0.00: 1
8.00 4.00 1.00: 3
8.00 5.00 1.00: 3
8.00 6.00 1.00: 3
9.00 2.00 1.00: 2
9.00 3.00 1.00: 2

Press enter to exit...

About epsilon variable, it seems that there is a way to pre-calculate it so it can maximize the results of homogeneity, but i´m still stuck at the precalculatd euclidian distance table

The MASM Forum

News:

DBScan, Homogeneity and Completeness algorithm

guga

zedd151

guga

zedd151

NoCforMe

NoCforMe

guga

guga

guga

HSE

guga

guga

HSE

six_L

guga