The MASM Forum

General => The Laboratory => Topic started by: guga on May 11, 2024, 11:39:49 PM

Title: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 11, 2024, 11:39:49 PM
Hi guys

Someone succeeded to create a DBScan algorithm on a given data set ?

Im trying to create a dbscan clustering functiojn to analyse a chunck of data (in dwords or bytes or whatever needed input)

The dbscan is defined as in:
https://medium.com/@balajicena1995/dbscan-clustering-2a577d384e61 (https://medium.com/@balajicena1995/dbscan-clustering-2a577d384e61)
https://www.youtube.com/watch?v=87ogbzhXUxo (https://www.youtube.com/watch?v=87ogbzhXUxo)
https://github.com/gyaikhom/dbscan/blob/master/dbscan.c (https://github.com/gyaikhom/dbscan/blob/master/dbscan.c)
https://dev.to/rajaniraiyn/dbscan-clustering-algorithm-demystified-1d5o (https://dev.to/rajaniraiyn/dbscan-clustering-algorithm-demystified-1d5o)
https://www.codeproject.com/Articles/5129186/Step-by-Step-Guide-to-Implement-Machine-Learning-8 (https://www.codeproject.com/Articles/5129186/Step-by-Step-Guide-to-Implement-Machine-Learning-8)
https://en.wikipedia.org/wiki/DBSCAN (https://en.wikipedia.org/wiki/DBSCAN)

The goal is to determine if a certain chunk of data is homogeneous, i mean, if the data inside a given chunck contains similar values. that can be used to identify/distinguish from others (to use in image processing, for example or watermark or object detection etc)

Homogeneity and Completeness are defined in:
https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html (https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html)
and
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html#sklearn.metrics.homogeneity_score (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html#sklearn.metrics.homogeneity_score)

But, i didn´t understood how this can be implemented. The C version in github is way to complex for me to start a simple algorithm.

Chatgpt gave me some ideas on how to do it in masm, but i got nowhere.

On the example i provided i tried to analyse the following sequence:
{0, 1, 1, 1, 1, 1, 1, 1, 25, 15, 96, 19, 1, 1, 2, 25}

Where 1 is the most common value, so the data may lead to some value of homogeneity and COmpleteness.

When i started, chatgpt created me an scheme as:

1st question i made:

QuoteTo determine if pixel values in a sequence of images represent the same pixel or object, especially when the majority of pixels are around a certain value and a few are outliers, you can use clustering algorithms or similarity measures. Here's an approach you can take:

    Preprocess the Data: Normalize the pixel values to a common scale if necessary. This ensures that all values are comparable.

    Cluster Pixel Values: Use a clustering algorithm such as K-means or DBSCAN to group similar pixel values together. This can help identify clusters of pixels that represent the same object or feature in the images. Set the number of clusters based on the expected number of objects or features in the images.

    Identify Outliers: Pixels that do not belong to any cluster or belong to very small clusters can be considered outliers. These outliers may represent noise or background pixels that are different from the main object or feature.

    Evaluate Results: Inspect the clusters and outliers to determine if they correspond to different objects or features in the images. Visualizing the clusters can help you understand the grouping of pixel values.

    Adjust Parameters: If the results are not satisfactory, adjust the parameters of the clustering algorithm (e.g., number of clusters, distance metric) and re-run the analysis.

    Post-processing: Depending on the application, you may need to further process the clustered pixel values to refine the identification of objects or features.

By using clustering algorithms and analyzing the distribution of pixel values in the images, you can determine if the values represent the same pixel or object, and identify outliers that may represent noise or background pixels.

2nd questioning lead me to:

QuoteTo demonstrate how clustering algorithms can be used to analyze the distribution of pixel values in a sequence of images, let's consider a simplified example with a set of 20 images. We'll use the K-means clustering algorithm to group similar pixel values together and identify outliers that may represent noise or background pixels.

Assuming each image has the same size and we want to analyze a specific pixel (e.g., pixel at coordinates (x, y)), here's how you can approach it:

    Load the Images:
    Load the 20 images into memory.

    Extract Pixel Values:
    For each image, extract the pixel value at the specified coordinates (x, y). You will have 20 pixel values for this specific pixel location.

    Normalize Pixel Values:
    Normalize the pixel values to a common scale if necessary.

    Apply K-means Clustering:
    Use the K-means clustering algorithm to cluster the pixel values. Set the number of clusters based on the expected number of objects or features in the images.

    Identify Outliers:
    Pixels that do not belong to any cluster or belong to very small clusters can be considered outliers. These outliers may represent noise or background pixels.

Here's a Python example using the scikit-learn library to perform K-means clustering on a set of 20 images:
import numpy as np
from sklearn.cluster import KMeans

# Assume pixel_values is a 1D array containing pixel values at location (x, y) for 20 images
# This is a simplified example, and in practice, you would extract pixel values from actual images
pixel_values = np.array([120, 130, 140, 150, 160, 200, 210, 220, 230, 240,
                          150, 140, 135, 130, 125, 195, 205, 210, 215, 220])

# Reshape the array to have a single feature (required by K-means)
pixel_values = pixel_values.reshape(-1, 1)

# Apply K-means clustering with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=0).fit(pixel_values)

# Get the cluster labels
cluster_labels = kmeans.labels_

# Get the cluster centers
cluster_centers = kmeans.cluster_centers_

# Identify outliers
outliers = pixel_values[np.where(cluster_labels != np.argmax(np.bincount(cluster_labels)))]

print("Cluster Centers:", cluster_centers.flatten())
print("Outliers:", outliers.flatten())

I then asked it to create a similar function in Masm32 and using DbScan instead K-mean

This s the resultant function, ChatGPT gave me
.DATA
    pixel_values DWORD 120, 130, 140, 150, 160, 200, 210, 220, 230, 240, \
                    150, 140, 135, 130, 125, 195, 205, 210, 215, 220
    num_pixels DWORD 20
    epsilon DWORD 15   ; Adjust epsilon as needed
    minPts DWORD 3      ; Adjust minPts as needed
    cluster_labels DWORD num_pixels DUP(0)
    cluster_count DWORD 0
.CODE
    main PROC
        ; Load addresses of data
        mov esi, OFFSET pixel_values
        mov edi, OFFSET cluster_labels
       
        ; DBSCAN algorithm
        mov ecx, num_pixels
    outer_loop:
        cmp DWORD PTR [edi], 0
        jne skip_iteration   ; Skip if point is already assigned to a cluster
        mov eax, ecx
        dec eax
        mov ebx, eax
        mov edx, epsilon
        shl edx, 2           ; Multiply epsilon by 4 (size of DWORD)
        mov esi, OFFSET pixel_values
        lea esi, [esi + eax * 4]
    inner_loop:
        cmp DWORD PTR [edi + eax * 4], 0
        jne next_iteration   ; Skip if point is already assigned to a cluster
        mov esi, OFFSET pixel_values
        lea esi, [esi + ebx * 4]
        sub esi, OFFSET pixel_values
        mov edx, epsilon
        shl edx, 2           ; Multiply epsilon by 4 (size of DWORD)
        mov eax, ebx
        mov ecx, eax
        call distance       ; Calculate distance between points
        cmp eax, edx        ; Compare distance with epsilon
        jge next_iteration  ; Skip if distance is greater than epsilon
        inc ecx             ; Increment count of neighbors
        cmp ecx, minPts
        jl next_iteration   ; Skip if minimum number of points is not reached
        mov eax, ebx
        inc cluster_count   ; Assign new cluster ID
    assign_cluster:
        mov DWORD PTR [edi + eax * 4], cluster_count
        jmp next_iteration
    skip_iteration:
        dec ecx
    next_iteration:
        dec ebx
        jns inner_loop
        inc ecx
        cmp ecx, num_pixels
        jl outer_loop

        ; Output cluster labels
        mov esi, OFFSET cluster_labels
        mov ecx, num_pixels
    print_clusters:
        mov eax, [esi]
        call WriteDec
        call Crlf
        add esi, 4
        loop print_clusters

        ; Exit program
        call ExitProcess
    main ENDP

    distance PROC
        ; Calculate Euclidean distance between two points
        ; Input: ESI - offset of first point, EDI - offset of second point
        ; Output: EAX - distance
        mov eax, DWORD PTR [esi]
        sub eax, DWORD PTR [edi]
        imul eax, eax
        mov edx, DWORD PTR [esi + 4]
        sub edx, DWORD PTR [edi + 4]
        imul edx, edx
        add eax, edx
        mov edx, DWORD PTR [esi + 8]
        sub edx, DWORD PTR [edi + 8]
        imul edx, edx
        add eax, edx
        mov edx, DWORD PTR [esi + 12]
        sub edx, DWORD PTR [edi + 12]
        imul edx, edx
        add eax, edx
        call Sqrt
        ret
    distance ENDP
END main

But this was just to Ok,calculate the cluster_labels. I then asked me to create the other functions such as the cluster_centers and outliers. Which lead me to:
QuoteYes, the function I provided is just to calculate the cluster_labels based on the DBSCAN algorithm. To find the cluster_centers and outliers, you can modify the code to calculate them after assigning the cluster labels. Here's how you can do it:

    Cluster Centers: After assigning cluster labels, you can calculate the cluster centers by averaging the pixel values of all points in each cluster.

    Outliers: Outliers are points that do not belong to any cluster or belong to very small clusters (noise points). You can identify outliers by checking for points with a cluster label of 0 or labels that are not assigned to a core point (i.e., points with fewer than minPts neighbors).
.DATA
    ...
    cluster_centers DWORD 0, 0  ; Store the cluster centers (assuming 2 clusters)
    outliers DWORD num_pixels DUP(0)  ; Store the indices of outliers
    num_outliers DWORD 0  ; Counter for the number of outliers
.CODE
    ...
    update_cluster_centers PROC
        ; Calculate cluster centers
        mov esi, OFFSET pixel_values
        mov edi, OFFSET cluster_labels
        xor ecx, ecx
        xor edx, edx
    calculate_centers_loop:
        cmp DWORD PTR [edi + ecx * 4], cluster_count
        jne not_in_cluster
        add esi, DWORD PTR [esi + ecx * 4]  ; Add pixel value to sum
        inc edx  ; Increment count of points in cluster
    not_in_cluster:
        inc ecx
        cmp ecx, num_pixels
        jl calculate_centers_loop
        div edx  ; Calculate average pixel value (cluster center)
        mov DWORD PTR [cluster_centers + cluster_count * 4], eax
        inc cluster_count
        ret
    update_cluster_centers ENDP

    identify_outliers PROC
        ; Identify outliers
        mov esi, OFFSET cluster_labels
        xor ecx, ecx
    find_outliers_loop:
        cmp DWORD PTR [esi], 0  ; Check if point is not assigned to any cluster
        jne not_outlier
        mov DWORD PTR [outliers + num_outliers * 4], ecx  ; Add index to outliers
        inc num_outliers
    not_outlier:
        inc ecx
        add esi, 4
        cmp ecx, num_pixels
        jl find_outliers_loop
        ret
    identify_outliers ENDP

    main PROC
        ...
        call update_cluster_centers
        call identify_outliers
        ...
    main ENDP


But, this is incorrect and crashed since it is not pointing to a given address in the distance function. I then asked it to explain in details how this thing works...but got more confused then ever. This is what chatgpt wrote:

QuoteLet's walk through the function step by step with the given pixel_values and assuming epsilon is 10 and minPts is 2 for this example:

    Initialization:
        pixel_values = {120, 130, 140, 150, 160}
        cluster_labels = {0, 0, 0, 0, 0}
        num_pixels = 5
        epsilon = 10
        minPts = 2

    Outer Loop (for pixel 5, value 160):

        Compare cluster label of pixel 5 (cluster_labels[4]) with 0. It is 0.

        Set eax to 4 (index of pixel 5).

        Set ebx to 4.

        Calculate edx (distance threshold): 10 << 2 = 40.

        Calculate address of pixel 5 (esi): pixel_values + 4 * 4 = pixel_values + 16.

        Inner Loop (for pixel 4, value 150):

            Compare cluster label of pixel 5 with 0. It is 0.

            Calculate address of pixel 4 (esi): pixel_values + 3 * 4 = pixel_values + 12.

            Calculate distance between pixel 5 and pixel 4. Assume distance is 10.

            Compare distance (10) with epsilon (40). Distance is less than epsilon.

            Increment neighbor count (ecx) to 1.

            Continue inner loop for pixel 3.

        Continue outer loop for pixel 4.

    Continue the same process for pixels 3, 2, and 1.

    Output Cluster Labels:
        The cluster_labels array after processing may look like: {1, 1, 1, 1, 1} indicating all pixels belong to the same cluster.

Please note that the actual calculation of the distance and clustering logic is not shown here but would be necessary to implement the complete DBSCAN algorithm.

But, i´m failing badly to understand this whole thing.

Can someone help me trying to create such a thing ? Or fix the masm functions chatgpt did ?
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: sudoku on May 12, 2024, 01:17:40 AM
The few times I had used chatgpt to write code, it used the irvine32 libraries often. I gave up on that soon after.
Seeing in your code snippet from chatgpt, I see 'call Crlf' -- is that from irvine32 library? There is no header there so I don't know for sure.
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 12, 2024, 01:43:56 AM
Don´t know if chatgpt uses irvine´s code or not. Chatgpt has several flaws in what concerns trying to write something in masm (or Nasm, FAsm, RosAsm), but it gives some idea how to start or, in some cases do 50% of the work. The problem is that on this example, i´m not being able to follow the logic that i is showing because it seems a bit different than the ones explained in wikipedia and the other links.

It seems a simple loop back on each value that neds to be compared to others, but i´m faoiling badly to understand the logic and the math behind this.

And porting the algo at https://github.com/gyaikhom/dbscan/blob/master/dbscan.c (https://github.com/gyaikhom/dbscan/blob/master/dbscan.c) to masm seems to me a bit complex for such a task that seems simple at the 1st .
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: sudoku on May 12, 2024, 03:01:02 AM
Have you tried  godbolt.org (https://godbolt.org), to help with the conversion/translation to assembly? (Assuming you have a usable source in another language)
Once converted, could more easily be optimized or change parts of it, to better suit your needs.
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: NoCforMe on May 12, 2024, 03:03:29 AM
Would it be possible for you to provide a link that explains what DBScan is in plain English to someone who's unfamiliar with all that jargon? I really can't make head nor tail of it using those links you provided.

(Plus could you link-ize your links? there's a button for that right in the posting editor)
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: NoCforMe on May 12, 2024, 03:11:35 AM
I suppose the Wikipedia entry (https://en.wikipedia.org/wiki/DBSCAN) (hate Wikipedia!) is as good as anything I've found so far:
QuoteIt is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed (points with many nearby neighbors), and marks as outliers points that lie alone in low-density regions (those whose nearest neighbors are too far away).

* I had to edit that entry, as it was badly written, as is so much stuff there.

Here's another article (https://towardsdatascience.com/how-dbscan-works-and-why-should-i-use-it-443b4a191c80), on Medium by a Portuguese writer who writes better English than most native speakers, in pretty plain language.

Enough with the jargon! Tell me what it is, how it works and what it's used for.
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 12, 2024, 03:43:32 AM
Quote from: sudoku on May 12, 2024, 03:01:02 AMHave you tried  godbolt.org (https://godbolt.org), to help with the conversion/translation to assembly? (Assuming you have a usable source in another language)
Once converted, could more easily be optimized or change parts of it, to better suit your needs.

Yeah, i´m doing it. But the C version uses 3d points and i want to use only 1d points. (Linear values of a sequence of images. So, taking pixel at pos x = 10, y = 10 on all images. All of the next images at this particular positions will contains specific values of their pixels (RGB). I need to compute the dbscan of each channel separated. So, dbscan of 1 pixel at a specific position in all subsequent images.
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 12, 2024, 03:44:04 AM
Quote from: NoCforMe on May 12, 2024, 03:03:29 AMWould it be possible for you to provide a link that explains what DBScan is in plain English to someone who's unfamiliar with all that jargon? I really can't make head nor tail of it using those links you provided.

(Plus could you link-ize your links? there's a button for that right in the posting editor)
https://en.wikipedia.org/wiki/DBSCAN (https://en.wikipedia.org/wiki/DBSCAN)
https://www.youtube.com/watch?v=4AW_5nYQkuc (https://www.youtube.com/watch?v=4AW_5nYQkuc)
https://www.youtube.com/watch?v=2Zsz-0K-Ax4 (https://www.youtube.com/watch?v=2Zsz-0K-Ax4)
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 12, 2024, 03:46:49 AM
Quote from: NoCforMe on May 12, 2024, 03:11:35 AMI suppose the Wikipedia entry (https://en.wikipedia.org/wiki/DBSCAN) (hate Wikipedia!) is as good as anything I've found so far:
QuoteIt is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed (points with many nearby neighbors), and marks as outliers points that lie alone in low-density regions (those whose nearest neighbors are too far away).

* I had to edit that entry, as it was badly written, as is so much stuff there.

Here's another article (https://towardsdatascience.com/how-dbscan-works-and-why-should-i-use-it-443b4a191c80), on Medium by a Portuguese writer who writes better English than most native speakers, in pretty plain language.

Enough with the jargon! Tell me what it is, how it works and what it's used for.

I´m not a native english speaker. I´ll take a read on that article.

https://towardsdatascience.com/how-dbscan-works-and-why-should-i-use-it-443b4a191c80 (https://towardsdatascience.com/how-dbscan-works-and-why-should-i-use-it-443b4a191c80)

If you speak portuguese, i can talk to you in my native language, but it will be confusing and hard for others understand.


Btw....I´m using it to try a better way to identify a watermark on a set of images. I was giving a try using Median or MAD algorithms but it seems DbScan may work better.
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: HSE on May 15, 2024, 01:36:56 AM
Hi Guga!

Quote from: guga on May 11, 2024, 11:39:49 PMSomeone succeeded to create a DBScan algorithm on a given data set

An implementation is present in Comparing K-Means and Others Algorithms for Data Clustering in Assembly (https://masm32.com/board/index.php?topic=11649.0./).

Regards, HSE.
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 15, 2024, 02:31:46 AM
Tks Hse. Can you fix the link / It´s not pointing to any address. I managed to succeed to convert the C file, but would be nice take a look on the link you provided.

Btw....about finding the Homogeneity i made a small function that maybe used for this task. (Unless it was already written on your link on a more accurate way.


;;

Homogeneity metric of a cluster labeling given a ground truth.

A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won't change the score value in any way.
This metric is not symmetric: switching label_true with label_pred will return the completeness_score which will be different in general.
Mathematically, homogeneity has the connotation of invariance, as all components of the equation have the same degree of value whether or not each
of these components are scaled to different values, for example, by multiplication or addition. Cumulative distribution fits this description.
"The state of having identical cumulative distribution function or values".

The function uses a logarithm sum of good clusters found on a chain of data. This forces the result to be between 0 and 1 and enforce for homogeneity.
In general,the higher the value (closer to 1) more homogenous is the data.

    Parameters:
        points(in) - A poiter to a Data formed by a array of point_t structures
        MaxClasses(in) - The maximum amout of usable classes (Cluster_Id) available. This parameter takes onto account the Ids bigger then 0.
        num_points(in) - The total amount of elements on the data chain.
        pHomogeneity(out) - Pointer to a variable that will hold the value of the homogeneity. The size of the variable must be a Real8 (8 bytes).

    Return Value: The function does not returns any value. All values are returned in pHomogeneity parameter.
                  The range of the result is between 0 and 1

    Remarks:
        The mathematical equation for this is as follow:
       
        homogeneity = 1 + ((log(cluster_id(0)/TotalCluster_id0)+
                           (log(cluster_id(1)/TotalCluster_id1)+
                           (log(cluster_id(2)/TotalCluster_id2)+
                           (log(cluster_id(3)/TotalCluster_id3)+
                           ....
                           (log(cluster_id(N)/TotalCluster_idN)+
                           ) / NumPoints

        But, we exclude Noise (negative Ids) and also 0 Ids. Excluding 0 id is neded because it can also be interpreted as Noise and also there´s no log (0).
        A 0 id means the data is closer to a cluster but, it can still be considered as noise. A negativa value (-2 is pure noise) and 0 can represent a 'Border' Noise.
        So, in fact, our equaion results only in:

        homogeneity = 1 + ((log(cluster_id(1)/TotalCluster_id1)+
                           (log(cluster_id(2)/TotalCluster_id2)+
                           (log(cluster_id(3)/TotalCluster_id3)+
                           ....
                           (log(cluster_id(N)/TotalCluster_idN)+
                           ) / NumPoints


    In other wods, it takes a somatory of the log of ClusterId divided tbe the total amount of elements related to that Id After summing them, it divide by the
    total amount of elements of the data chunk (num_points) and finally add 1 onto this result. This will grant the resultant value will always be between
    0 and 1.
   
    Example, say you have a data chunk formed by a sequence of data (53 elements = points of x, y, z) where you already calculated theis Cluster_ids
    associated with each point of the data chain. Like this:

        x     y     z     cluster_id
        -----------------------------
        1.00  3.00  1.00: 0
        1.00  4.00  1.00: 0
        1.00  5.00  1.00: 0
        1.00  6.00  1.00: 0
        2.00  2.00  1.00: 2
        2.00  3.00  0.00: 1
        2.00  4.00  0.00: 1
        2.00  5.00  0.00: 1
        2.00  6.00  0.00: 1
        2.00  7.00  1.00: 3
        3.00  1.00  1.00: 2
        3.00  2.00  1.00: 2
        3.00  3.00  1.00: 2
        3.00  4.00  0.00: 1
        3.00  5.00  0.00: 1
        3.00  6.00  1.00: 3
        3.00  7.00  1.00: 3
        4.00  1.00  1.00: 2
        4.00  2.00  1.00: 2
        4.00  3.00  0.00: 1
        4.00  4.00  0.00: 1
        4.00  5.00  1.00: -2
        4.00  6.00  0.00: 1
        4.00  7.00  1.00: 3
        4.00  8.00  1.00: 3
        5.00  1.00  1.00: 2
        5.00  2.00  0.00: 1
        5.00  3.00  0.00: 1
        5.00  4.00  0.00: 1
        5.00  5.00  0.00: 1
        5.00  6.00  0.00: 1
        5.00  7.00  1.00: 3
        5.00  8.00  1.00: 3
        6.00  1.00  1.00: 2
        6.00  2.00  0.00: 1
        6.00  3.00  1.00: 3
        6.00  4.00  1.00: 3
        6.00  5.00  1.00: 3
        6.00  6.00  1.00: 3
        6.00  7.00  1.00: 3
        7.00  1.00  1.00: 2
        7.00  2.00  0.00: 1
        7.00  3.00  0.00: 1
        7.00  4.00  0.00: 1
        7.00  5.00  1.00: 3
        8.00  1.00  1.00: 2
        8.00  2.00  1.00: 2
        8.00  3.00  0.00: 1
        8.00  4.00  1.00: 3
        8.00  5.00  1.00: 3
        8.00  6.00  1.00: 3
        9.00  2.00  1.00: 2
        9.00  3.00  1.00: 2

        From the above data chain (Points) we have the following clusters (groups) from -2 to 3. We need to count how many elements we have on each Cluster_ids
           
            ClusterId = -2 = 1 elements (Pure Noise)
            ClusterId = 0 = 4 elements  (Border Noise)
            ClusterId = 1 = 19 elements
            ClusterId = 2 = 13 elements
            ClusterId = 3 = 16 elements
           
            Total elements = 53 (including Noise and border noise)

            Negative Ids, such as -2 and 0, represents Noise, therefore it is not calculated on the equation of homogeneity.

            So, it will result in:
                homogeneity = 1 + (log(1/19)+log(2/13)+log(3/16))/53 = 0.8775430643464206899449251...

;;

[TmpLogValue: R$ 0]

Proc calculate_homogeneity_Log:
    Arguments @points, @MaxClasses, @num_points, @pHomogeneity
    Local @ClusterCount
    Uses ecx, esi, edi

    xorpd xmm0 xmm0
    ...If D@MaxClasses > 1 ; Skip class = 0 (We added earlier to 1)
        mov esi D@points
        mov edi 1 ; start with Cluster_Id = 1
        .Do
            mov D@ClusterCount 0
            mov ecx D@num_points
            Do
                If D$esi+cluster_idDis = edi
                    inc D@ClusterCount
                End_If
                add esi Size_Of_point_t
                dec ecx
            Loop_Until ecx = 0
            mov esi D@points
            cvtsi2sd xmm1 edi ; converts a signed integer to double
            cvtsi2sd xmm2 D@ClusterCount | divsd xmm1 xmm2
            movsd X$TmpLogValue xmm1 | movsd xmm2 xmm0
            call Sse2_log TmpLogValue, SSE_EXP_REAL8
            addsd xmm0 xmm2
            inc edi
        .Loop_Until edi >= D@MaxClasses
        cvtsi2sd xmm1 D@num_points | divsd xmm0 xmm1
    ...End_If
    mov eax 1 | cvtsi2sd xmm1 eax ; converts a signed integer to double
    addsd xmm0 xmm1
    mov eax D@pHomogeneity
    movsd X$eax xmm0

EndP

Btw...does it woks for dbscan too ?
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 15, 2024, 02:47:12 AM
Oh, got it..here, right ?

https://masm32.com/board/index.php?topic=11649.0 (https://masm32.com/board/index.php?topic=11649.0)
https://github.com/ASMHSE/Clusters-in-Assembly/tree/main (https://github.com/ASMHSE/Clusters-in-Assembly/tree/main)

It do contains dbscan. Very good work. I´ll take a further look later and try to understand
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: HSE on May 15, 2024, 06:20:21 AM
Quote from: guga on May 15, 2024, 02:47:12 AMOh, got it..here, right ?

Exactly  :thumbsup:
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: six_L on May 16, 2024, 07:14:27 PM
A very interesting topic.

the parameters ("epsilon and MinPts") of DBSCAN can have a significant impact on the clustering results. Careful parameter tuning is often required for optimal performance.

Hi,guga
Do you use the DBSCAN to analyse some image?
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 17, 2024, 12:24:42 AM
Quote from: six_L on May 16, 2024, 07:14:27 PMA very interesting topic.

the parameters ("epsilon and MinPts") of DBSCAN can have a significant impact on the clustering results. Careful parameter tuning is often required for optimal performance.

Hi,guga
Do you use the DBSCAN to analyse some image?
Hi six_L yes, i´m trying to port the JavaScript example on that link on github to assembly to handle images. The javascript example segments the image using DbScan algorithm. I guess i succeeded to port it, but it is incredibly slow. Maybe using a table of precalculated the euclidian distances can help, but i´m not being able to precalculate it correctly. I´m failing understand the math behinds it.
(https://i.postimg.cc/K1hNSG6q/screenshot.png) (https://postimg.cc/K1hNSG6q)


Btw..i attached the files i succeeded to port (The simpler one). Source code is embedded in exe (for RosAsm) and also included the asm file and the necessary example.dat it uses to do the calculations. Also attached the source code splitted in their own titles to make easier to follow. The part of the code itself related to dbscan is at part01.asm. Top.asm ans SSE_Macros.asm are only the macros i used for that file. The other asm files are simply additional functions for the console and Fast Logarithm functions


The output is this (Where -2 = Pure noise and 0 means noise but more close to the clusters, i suppose:
QuoteEpsilon: 1.000000
Minimum points: 2
Homogeneity: 0.000000
Homogeneity(log): 0.877543
Number of points: 53
 x    y    z    cluster_id
-----------------------------
 1.00  3.00  1.00: 0
 1.00  4.00  1.00: 0
 1.00  5.00  1.00: 0
 1.00  6.00  1.00: 0
 2.00  2.00  1.00: 2
 2.00  3.00  0.00: 1
 2.00  4.00  0.00: 1
 2.00  5.00  0.00: 1
 2.00  6.00  0.00: 1
 2.00  7.00  1.00: 3
 3.00  1.00  1.00: 2
 3.00  2.00  1.00: 2
 3.00  3.00  1.00: 2
 3.00  4.00  0.00: 1
 3.00  5.00  0.00: 1
 3.00  6.00  1.00: 3
 3.00  7.00  1.00: 3
 4.00  1.00  1.00: 2
 4.00  2.00  1.00: 2
 4.00  3.00  0.00: 1
 4.00  4.00  0.00: 1
 4.00  5.00  1.00: -2
 4.00  6.00  0.00: 1
 4.00  7.00  1.00: 3
 4.00  8.00  1.00: 3
 5.00  1.00  1.00: 2
 5.00  2.00  0.00: 1
 5.00  3.00  0.00: 1
 5.00  4.00  0.00: 1
 5.00  5.00  0.00: 1
 5.00  6.00  0.00: 1
 5.00  7.00  1.00: 3
 5.00  8.00  1.00: 3
 6.00  1.00  1.00: 2
 6.00  2.00  0.00: 1
 6.00  3.00  1.00: 3
 6.00  4.00  1.00: 3
 6.00  5.00  1.00: 3
 6.00  6.00  1.00: 3
 6.00  7.00  1.00: 3
 7.00  1.00  1.00: 2
 7.00  2.00  0.00: 1
 7.00  3.00  0.00: 1
 7.00  4.00  0.00: 1
 7.00  5.00  1.00: 3
 8.00  1.00  1.00: 2
 8.00  2.00  1.00: 2
 8.00  3.00  0.00: 1
 8.00  4.00  1.00: 3
 8.00  5.00  1.00: 3
 8.00  6.00  1.00: 3
 9.00  2.00  1.00: 2
 9.00  3.00  1.00: 2

Press enter to exit...


About epsilon variable, it seems that there is a way to pre-calculate it so it can maximize the results of homogeneity, but i´m still stuck at the precalculatd euclidian distance table
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 17, 2024, 04:59:54 PM
New version in C (to be compiled with Pelles) with a bit of optimization and precalculating the euclidean distances.

I´ll try optimize it even further before porting it to Assembly.

Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 19, 2024, 12:17:37 AM
Succeeded to automate the process of identifying Epsilon and Minimum Points. For the articles i´m reading, the smaller epsilon, the better.

https://hyperskill.org/learn/step/31529
https://mrinalyadav7.medium.com/dbscan-algorithm-c894701306d5

On the attached version, it generated only 2 epsilon values.

Main:
....
        call GetMinPoints DBSCAN_DIM_1D
        mov D$MinPoints eax
        call GetEpsilon D$points, D$Num_Points, Epsilon, DBSCAN_DIM_1D


        call dbscan D$points, D$Num_Points, Epsilon, D$MinPoints, Homogeneity, HomogeneityLog
....

;;
   Automatically find epsilon
   http://www.sefidian.com/2022/12/18/how-to-determine-epsilon-and-minpts-parameters-of-dbscan-clustering/
   https://mrinalyadav7.medium.com/dbscan-algorithm-c894701306d5
   https://hyperskill.org/learn/step/31529
;;

[DBSCAN_DIM_1D 1]
[DBSCAN_DIM_2D 2]
[DBSCAN_DIM_3D 3]

Proc GetMinPoints:
    Arguments @DimensionFlags

    mov eax D@DimensionFlags
    shl eax 1

EndP

Proc GetEpsilon:
    Arguments @points, @num_points, @pOutput, @Flag
    Local @k_distances, @kFactor
    Uses ebx, esi, edi, ecx, edx

    xor eax eax
    On D@num_points = 0, ExitP

    C_call 'msvcrt.calloc' D@num_points, 8
    mov D@k_distances eax

    mov eax D@Flag
    .If_Or eax = DBSCAN_DIM_1D, eax = 0
        mov eax 2 ; dimensions*2) = 1*2
    .Else_If eax = DBSCAN_DIM_2D
        mov eax 4 ; dimensions*2) = 2*2
    .Else_If eax = DBSCAN_DIM_3D
        mov eax 6 ; dimensions*2) = 3*2
    .Else_If_And eax > 3, eax < 7
        mov eax 7
    .Else_If eax > 10
        inc eax
    .End_If
    mov D@kFactor eax

    ; Calculate k-distances for each point
    call calculate_k_distances D@points, D@num_points, D@kFactor, D@k_distances
    call find_knee_point D@k_distances, D@num_points

    mov eax D@pOutput | fstp R$eax

    C_call 'msvcrt.free' D@k_distances

EndP

; Function to calculate k-distance for each point

Proc calculate_k_distances:
    Arguments @points, @num_points, @kFactor, @k_distances
    local @Distances, @PointsIPos
    Uses ebx, esi, edi, ecx, edx

    xor eax eax
    On D@num_points = 0, ExitP

    C_call 'msvcrt.calloc' D@num_points, 8
    mov D@Distances eax
    mov ebx eax
    mov esi D@points
    mov D@PointsIPos esi
    xor edi edi

    .Do

        mov esi D@points
        xor ecx ecx
        Do
       
            call euclidean_dist D@PointsIPos, esi
            fstp R$ebx+ecx*8
            add esi Size_Of_point_t
            inc ecx
        Loop_Until ecx >= D@num_points ; jl

        C_call 'msvcrt.qsort' ebx, D@num_points, 8, compare_doubles
        ;  k-th nearest neighbor distance
        mov eax D@kFactor
        mov edx D@k_distances
        fld R$ebx+eax*8 | fstp R$edx+edi*8

        add D@PointsIPos Size_Of_point_t
        inc edi
    .Loop_Until edi >= D@num_points

    C_call 'msvcrt.free' D@Distances

EndP

Proc compare_doubles:
    Arguments @Arg1, @Arg2

    mov eax D@Arg1 | movsd XMM0 X$eax
    mov eax D@Arg2 | movsd XMM1 X$eax

    xor eax eax
    SSE_D_If xmm1 > xmm0
        mov eax 0-1
    SSE_D_Else_If xmm1 < xmm0
        mov eax 1
    SSE_D_End_If

EndSTD

[KneeRate: R$ 0.1]

Proc find_knee_point:
    Arguments @k_Distance, @Num_Points
    Local @KneeIndex
    Uses ebx, esi, ecx, edx

    mov ebx D@k_Distance
    mov esi D@Num_Points
    C_call 'msvcrt.qsort' ebx, esi, 8, compare_doubles
    fild D@Num_Points | fmul R$KneeRate | fistp D@KneeIndex

;fld R$ebx;+eax*8
;mov eax 52 | fld R$ebx+eax*8

    mov eax D@KneeIndex | fld R$ebx+eax*8

EndP



On the function find_knee_point it creates a table of distances values (possible values of epsilon) forming a array of data (Size is NumPoints *8 = 8 because we are dealing with a Real8). The generated contents are displaced in ascending order On the example we have 53 values, on 2 unique epsilon values from the smaller to the bigger.

The function chooses one located at the 10% of the initial data in the array. Later i´ll make a routine to get rid of the duplicated data, and insert the different values on another table to make easier the user choose what type of data we wants do identify.

according to the 2nd article, Epsilon can be better described as:
Quoteε: The value for ε can then be chosen by using a k-distance graph, plotting the distance to the k = minPts-1 nearest neighbor ordered from the largest to the smallest value. Good values of ε are where this plot shows an "elbow": if ε is chosen much too small, a large part of the data will not be clustered; whereas for a too high value of ε, clusters will merge and the majority of objects will be in the same cluster. In general, small values of ε are preferable, and as a rule of thumb, only a small fraction of points should be within this distance of each other.


Btw,, so far, chatgpt didn´t succeeded to help me making the algo way more fast. It is incredibly slow when working on real images. The alternative i proposed was create a table in memory to directly point to the addresses in get_neighbor routines, but it turns out that i´ll need a insane huge table for that (NumPixels*8*NumPixels). I´m trying to overcome this, and use only 1 table (size = Numpixes*8) and then, make some lighter arithmetic function to calculate the values of the next indexed tables. Thinking how to do it yet....


About epsilon and minimum points, chatgpt answered me like this (That proven to be correct at the end)

QuoteThe choice of k is not strictly fixed and can vary based on the characteristics of the data and the specific requirements of the clustering task. Here's a more detailed explanation of how to determine an appropriate k value for different dimensions:
1D Data

For 1D data, clusters are linear, and local density is easier to determine. Therefore, a small k value is usually sufficient.

    k = 2: Using the nearest neighbor (1st neighbor) can be enough to define local density in 1D data.

2D Data

For 2D data, clusters are more complex, and you generally need a slightly larger k to capture the local density effectively.

    k = 4: This is a commonly used heuristic. A point and its three nearest neighbors can adequately define a local dense region in 2D space.

3D Data

For 3D data, the complexity increases due to the additional dimension, requiring a larger k value to capture the local density accurately.

    k = 6: This value is chosen to include a point and its five nearest neighbors, which helps define the local density in 3D space.
    k = 7-10: Depending on the noise level and the expected cluster structure, you might need a larger k to ensure robustness.

Choosing k Based on Dimensions

The heuristic k = dimensions + 1 is a starting point, but it may need to be adjusted based on empirical observations and the specific dataset. The values mentioned (e.g., k = 6 for 3D) come from practical experience and empirical studies in clustering literature, where a slightly larger k helps to better capture the density in higher dimensions.
Example Adjustments in Code

Here is how you can set k based on the dimensionality of your data and adjust it if necessary:

His piece of code was originally this:
#include <math.h>
#include <stdlib.h>
#include <stdio.h>

// Structure to represent a point in 3D (adjust for higher dimensions if needed)
typedef struct {
    double x, y, z; // Add more dimensions as necessary
} point_t;

// Function to calculate Euclidean distance between two points
double euclidean_distance(point_t *a, point_t *b) {
    return sqrt(pow(a->x - b->x, 2) + pow(a->y - b->y, 2) + pow(a->z - b->z, 2)); // Adjust for dimensions
}

// Function to calculate k-distance for each point
void calculate_k_distances(point_t *points, int num_points, int k, double *k_distances) {
    for (int i = 0; i < num_points; ++i) {
        double distances[num_points];
        for (int j = 0; j < num_points; ++j) {
            distances[j] = euclidean_distance(&points[i], &points[j]);
        }
        // Sort distances to find the k-th nearest neighbor
        qsort(distances, num_points, sizeof(double), compare_doubles);
        k_distances[i] = distances[k]; // k-th nearest neighbor distance
    }
}

// Comparison function for qsort
int compare_doubles(const void *a, const void *b) {
    double arg1 = *(const double*)a;
    double arg2 = *(const double*)b;
    if (arg1 < arg2) return -1;
    if (arg1 > arg2) return 1;
    return 0;
}

// Function to find the "knee" point in k-distances plot
double find_knee_point(double *k_distances, int num_points) {
    // Sort the k-distances to create the k-distances plot
    qsort(k_distances, num_points, sizeof(double), compare_doubles);
    // Simple heuristic: return the distance at the 90th percentile
    return k_distances[(int)(0.9 * num_points)];
}

int main() {
    // Assuming points is already populated
    point_t *points = ...; // Replace with actual points initialization
    int num_points = ...; // Replace with actual number of points
    int dimensions = 3; // Adjust based on the dimensionality of your data
    int k = dimensions + 3; // Adjusted heuristic for k (example: dimensions + 3)
    double k_distances[num_points];

    // Calculate k-distances for each point
    calculate_k_distances(points, num_points, k, k_distances);
    // Find a suitable epsilon using the knee point of the k-distances plot
    double epsilon = find_knee_point(k_distances, num_points);
    // Determine minPts, often set as 2 * dimensions
    int minPts = 2 * dimensions;

    // Now run DBSCAN with the calculated epsilon and minPts
    int num_clusters = dbscan_run(points, epsilon, minPts, num_points);

    // Continue with your process...

    return 0;
}


And then it answered me:
QuoteSummary

    1D Data: k = 2 is typically sufficient.
    2D Data: k = 4 is a commonly used heuristic.
    3D Data: k = 6 or k = 7-10 based on noise and cluster structure.
    Higher Dimensions: k = dimensions + 1 or slightly higher based on empirical observation.

These values are starting points, and you may need to adjust k based on your specific dataset and the results you observe.


The resultant values were the same i´ve got earlier, so, it seems the routines i created to automate epsilon and min points were correct

Epsilon: 1.000000
Minimum points: 2
Homogeneity: 0.000000
Homogeneity(log): 0.877543
Number of points: 53
 x    y    z    cluster_id
-----------------------------
 1.00  3.00  1.00: 0
 1.00  4.00  1.00: 0
 1.00  5.00  1.00: 0
 1.00  6.00  1.00: 0
 2.00  2.00  1.00: 2
 2.00  3.00  0.00: 1
 2.00  4.00  0.00: 1
 2.00  5.00  0.00: 1
 2.00  6.00  0.00: 1
 2.00  7.00  1.00: 3
 3.00  1.00  1.00: 2
 3.00  2.00  1.00: 2
 3.00  3.00  1.00: 2
 3.00  4.00  0.00: 1
 3.00  5.00  0.00: 1
 3.00  6.00  1.00: 3
 3.00  7.00  1.00: 3
 4.00  1.00  1.00: 2
 4.00  2.00  1.00: 2
 4.00  3.00  0.00: 1
 4.00  4.00  0.00: 1
 4.00  5.00  1.00: -2
 4.00  6.00  0.00: 1
 4.00  7.00  1.00: 3
 4.00  8.00  1.00: 3
 5.00  1.00  1.00: 2
 5.00  2.00  0.00: 1
 5.00  3.00  0.00: 1
 5.00  4.00  0.00: 1
 5.00  5.00  0.00: 1
 5.00  6.00  0.00: 1
 5.00  7.00  1.00: 3
 5.00  8.00  1.00: 3
 6.00  1.00  1.00: 2
 6.00  2.00  0.00: 1
 6.00  3.00  1.00: 3
 6.00  4.00  1.00: 3
 6.00  5.00  1.00: 3
 6.00  6.00  1.00: 3
 6.00  7.00  1.00: 3
 7.00  1.00  1.00: 2
 7.00  2.00  0.00: 1
 7.00  3.00  0.00: 1
 7.00  4.00  0.00: 1
 7.00  5.00  1.00: 3
 8.00  1.00  1.00: 2
 8.00  2.00  1.00: 2
 8.00  3.00  0.00: 1
 8.00  4.00  1.00: 3
 8.00  5.00  1.00: 3
 8.00  6.00  1.00: 3
 9.00  2.00  1.00: 2
 9.00  3.00  1.00: 2

Press enter to exit...



And below are some links i found with examples on how to create a faster library for it:
https://github.com/Eleobert/dbscan
https://proceedings.mlr.press/v157/weng21a.html

I didn´t tested this new routines yet, neither ported it co Plain C or assembly yet. Thinking....
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: six_L on May 19, 2024, 07:11:45 PM
Hi,guga
First of all, congratulations on the progress you have made in your research.
Next,thanks you for providing a lot of useful information about DBSCAN. As is known to all, if use the Bing searching, i'll get very little useful information, but waste a lot of time.

Thank you very much!
;-------------------------------------
1) About image analysing, I remember a thing.
long time ago, There was a criminal who had specialized in robbing money and killing people in front of some banks. More than ten people have been killed. the police ordered the arrest of the criminal at making public, had released front and side photos of the criminal(he has been gone to jail before).When the criminal had committed an offence again, the police and military dispatched more than 5000 armed personnel to surround and annihilate the criminal.the criminal had died. the on-site photos were very bloody. Only the one ear could be recognized. Due to the difference in the two ear in the published photos, many people doubted: the person who was killed was not the criminal. then the police wrote a article to prove: the person who was killed was the criminal. I ultimately believed the police. I also understood a bit of the principle of projection.

2) the Technology of Image Generation
light; radar; infrared ray; ultraviolet rays; radio telescope; ultrasonic wave; nuclear magnetic resonance. These images have different characteristics.

3)about distance
there are the euclidean distance,the riemannian distance, the cosine distance,the manhattan distance.

4)trainning the DBSCAN model.
We need typical image datas to train the DBSCAN model, get the optimum epsilon and minPts parameters.

5)the DBSCAN model verificating datas.
i found the most DBSCAN used the mathematical datas,not the real image datas. Only the embryonic DBSCAN used the real image datas.

regards.
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 20, 2024, 12:11:13 AM
Hi Six_L

tks you :)


About the usage in real images, it is possible to do. Most likely the same way on the embryo image used in javascript. The main problem seems to be the insane amount of memory it needs to be allocated in order to make it work faster. I´m thinking how to overcome this.
If i succeed to make a routine to make it faster on a 1-D data, the same can also be used in any other dimensions i suppose.

Btw...I also found a way to identify the maximum values of a cluster. The function is like this:

;;
    Max Points = ((k+1)/2)^D
;;

[DBSCAN_DIM_1D 1] ; 1 Dimension
[DBSCAN_DIM_2D 2] ; 2 Dimensions
[DBSCAN_DIM_3D 3] ; 3 Dimensions


Proc GetMaxPoints:
    Arguments @DimensionFlags
    Local @kFactorPlus1

    mov eax D@DimensionFlags
    .If_Or eax = DBSCAN_DIM_1D, eax = 0
        mov eax 2 ; dimensions*2) = 1*2
    .Else_If eax = DBSCAN_DIM_2D
        mov eax 4 ; dimensions*2) = 2*2
    .Else_If eax = DBSCAN_DIM_3D
        mov eax 6 ; dimensions*2) = 3*2
    .Else_If_And eax > 3, eax < 7
        mov eax 7
    .Else_If eax > 10
        inc eax
    .End_If
    inc eax
    mov D@kFactorPlus1 eax | imul eax eax | imul eax D@kFactorPlus1
    shr eax 3 ; divide by 8

EndP


This time, chatgpt was usefull in helping me finding the maximum values. It answered me that:
QuoteTo calculate the maximum number of points in a dataset for a given value of kk and a specific dimensionality D, you can use the following formula:

Max points= ((k+1)/2)^D

Where:

x denotes the floor function, which returns the largest integer less than or equal to x.
k is the number of nearest neighbors used in the density estimation.
D is the dimensionality of the data.

For example, let's say you want to calculate the maximum number of points for k=6 in 3D data:

Max points= (((6+1)/2)^3)
Max points= ((7/2)^3)
Max points= (3.5^3)
Max points= 42.875
Max points= 42

So, for k=6 in 3D data, the maximum number of points would be 42.

So, each cluster now is properly identifiable by a minimum and maximum number of points it can have.

Maybe Knowing the maximum amount of points, can give me more clues on how to optimize this whole thing.
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 21, 2024, 04:05:31 AM
I´m currently reviewing he code for min and maxpts and also trying to implement a routine to make this thing faster
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: six_L on May 21, 2024, 08:32:29 AM
Hi,guga
i want to select a part in image for dbscaning, this is creating some tested datasets automatically. but the SCROLL of static can't work. now i am progressing slowly.

regards
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 21, 2024, 04:50:06 PM
That´s pretty good  :greenclp:  :greenclp:  :greenclp:

I´m doing some more tests to see if i can optimize it even further. Dbscan seems a impressive algo for images. For noise removal, using dbscan maybe can do a wonderful thing in removing all that is noise. Perhaps we simply have to run the algo a couple of times to remove the noise and thus enhance the image completely even without the needs to use large models etc.

I was thinking that for noise removal, all we need to do relies in a few steps working on separated channels or the full image RGB contents. Personally, i would try with separated channels 1st.

The strategy maybe resume to this:

1 - Run Dbscan once to identify noises all over the image. It will result in well formed clusters and the corresponding noises that may exists near them.
2 - Create a routine to identify where the noises are and then remove them using the average of the values of the clusters that surrounds that noise or are closer to them. For example, let´s say that we run all the algo in separated channels.

So we identify the noises only on Red Channel, then we check if the noise is close to one or more clusters. We then calculate the average (Or standard deviation - that maybe better) of the colors of the cluster (or clusters) nearby the noise and apply the resultant value on that specific noise pixel. Then we go through all the image and do it all over again, until we eliminate  the majority of the noises

We do that step only when we have 1 noise pixel surrounded or closer of a good cluster. If we have 2 or more pixels noises close to each other (even if they are closer or surrounded by clusters), we skip this routine, because we are mainly focusing in isolated noises pixels.

3 - Now, our image/channel contains either no noise (if all noises were isolated ones) or may still contains small chunks of noises near each other and surrounded by good clusters. Since we worked on isolated noises previously and settled their new values  biased on the average (or standard deviation) of the clusters nearby, we then run DbScan again.

Why ? Because since we applied new values for what was noise, then we may have formed new set of clusters that uses those new pixels, thus enhancing the image a bit more.

4 - If the results on step 3 still contains noises (or a new set of noises) we do steps 1-3 all over again.

This is done until we can reach no noises on the resultant image

Also, we can do a routine to go out in case of infinite loops that maybe generated in the steps 1-4 results always in small chunks of noises closer to each other. On that case, we can go out the loop, and do a couple of things:

a) End the routine, since the majority of noises are now over.
b) create another routine to deal with those small chunks of noises closer to good clusters.  And then create a new strategy on how to handle those. Perhaps, starts checking for small groups of odd and even noises. For example
1 -  if we have 2 noises surrounded by 3 good clusters. We apply the average/standard deviation biased on the noise that are closer to a given cluster. (See image attached).
Noise1 is closer to Cluster A than from B or C. So, the Average (or Standard deviation) Color of Cluster A influences more on the result then Cluster B or C. So we can apply a weighted of the average colors of them, perhaps biased on the distance from where they are from each other.
Noise 2 is closer to Cluster B than from Cluster A or C, then we do it the same as above
c) once we do this, we run dbscan all over again....and keep doing this untill all noises are removed



Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 23, 2024, 11:14:32 AM
New version

A bit faster, but i´m not quite sure about the results. ´ll make a double check later.
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: six_L on May 25, 2024, 05:31:45 PM
Hi,guga
The translation of DBSCAN.C completed. Compilation has been passed, Testing is still in progress.
DBSCAN.C has some mazy data structures. Only "euclidean_dist" and "fcpu_cmp_val" works normally.
;¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
;point_s STRUCT        ;this is the struct from dbscan.cpp, we need to analyse, modify
;    x        REAL8 ?
;    y        REAL8 ?
;    z        REAL8 ?
;    cluster_id    DWORD ?
;point_s ENDS
;point_t typedef ptr point_s
point_s STRUCT        ;this is the struct from dbscan.cpp, we need to analyse image, modify
    x        QWORD ?    ; x coordinate
    y        QWORD ?    ; y coordinate
    z        QWORD ?    ; Pixel
    cluster_id    QWORD ?
point_s ENDS
point_t typedef ptr point_s

dist_t typedef ptr dist
dist STRUCT
    a    point_s <>
    b    point_s <>
dist ENDS

node_t typedef ptr node_s
node_s STRUCT
    index        QWORD ?
    next        point_t ?
node_s ENDS

epsilon_neighbours_t typedef ptr epsilon_neighbours_s
epsilon_neighbours_s STRUCT
    num_members    QWORD ?
    head        node_t ?
    tail        node_t ?
epsilon_neighbours_s ENDS
;¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
create_node            PROTO index:QWORD
append_at_end            PROTO index:QWORD,en:epsilon_neighbours_t
get_epsilon_neighbours        PROTO index:QWORD,points:point_t,num_points:QWORD,epsilon:REAL8,@dist:dist_t
print_epsilon_neighbours    PROTO points:point_t,en:epsilon_neighbours_t
destroy_epsilon_neighbours    PROTO en:epsilon_neighbours_t
dbscan                PROTO points:point_t,num_points:QWORD,epsilon:REAL8,minpts:QWORD,@dist:dist_t
expand PROTO index:QWORD,cluster_id:QWORD,points:point_t,num_points:QWORD,epsilon:REAL8,minpts:QWORD,@dist:dist_t
spread PROTO index:QWORD,seeds:epsilon_neighbours_t,cluster_id:QWORD,points:point_t,num_points:QWORD,epsilon:REAL8,minpts:QWORD,@dist:dist_t
euclidean_dist            PROTO a:point_t,b:point_t
adjacent_intensity_dist        PROTO a:point_t,b:point_t
parse_input            PROTO hWndSrc:QWORD
print_points            PROTO points:point_t,num_points:QWORD
fcpu_cmp_val proc Val_1:REAL8, Val_2:REAL8
   
    finit
    fld    Val_1
    ;fild    val_2        ;if val_2 is int.
    fld    Val_2
    fcompp 
    fstsw    ax              ;retrieve exception flags from FPU
    fwait
    shr    al,1            ;test for invalid operation
    jc    @err            ;clean-up and return result
    sahf                    ;transfer to the CPU flags
    jpe    @err            ;error if non comparable
    ja    @Greater
    jc    @F
    mov    rax,3        ;val_1 = val_2
    jmp    @Exit
@@:
    mov    rax,1        ;val_1 < val_2
    jmp    @Exit
@Greater:
    mov    rax,2        ;val_1 > val_2
@Exit:
        ret
@err:
    xor    rax,rax        ;rax = 0 -> Error
    ret

fcpu_cmp_val endp
;¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
euclidean_dist proc uses rsi rdi a:point_t,b:point_t
    Local x1:REAL10
    Local x2:REAL10
   
    finit
    mov    rsi,a
    mov    rdi,b

    ;sqrt[(x2-x1)^2+(y2-y1)^2+(z2-z1)^2]
    fild    [rsi].point_s.x        ;x1->st0
    fild    [rdi].point_s.x        ;x2->st1
    fsub                ;st0-st1
    fst    st(1)            ;st0->st1
    fmul                ;st0*st1
    fstp    x1   

    fild    [rsi].point_s.y        ;x1->st0
    fild    [rdi].point_s.y        ;x2->st1
    fsub                ;st0-st1
    fst    st(1)            ;st0->st1
    fmul                ;st0*st1
    fstp    x2   

    fild    [rsi].point_s.z        ;x1->st0
    fild    [rdi].point_s.z        ;x2->st1
    fsub                ;st0-st1
    fst    st(1)            ;st0->st1
    fmul                ;st0*st1
   
    fld    x2
    fadd
    fld    x1
    fadd

    fsqrt               
    ;result in st0,
    ;Local ResultR8:real8,Local ResultR10:real10   
    ;invoke    euclidean_dist, addr aa,addr bb
    ;fstp    ResultR8 ;fstp    ResultR10

    ;fstp    rVal

        ret

euclidean_dist endp

regards
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 27, 2024, 02:00:48 AM
Hi Six_L

Nice work. Indeed, this algorithm has some strange structures. I succeeded to port to RosAsm, but still it requires a insane amount of memory to be allocated specially when our final goal is processing images.

But, in the meanwhile, i found an very very interesting approach that claims to be a improvement for dbscan. Take a look. I suceeded to compile in Pelles and the results are great.

It do has some flaws, like can´t handle some types of images, or it shows a weird diagonal line in one of my tests, but it seems promising. I`m trying to port it right now. It uses basically 2 main functions, and it is really fast


https://github.com/DIP-BurdwanUniversity/DBSCAN-ImageClustering/tree/master/input
https://github.com/DIP-BurdwanUniversity/DBSCAN-ImageClustering
https://biswajit.netlify.app/pages/dip-burdwanuniversity/
https://github.com/DIP-BurdwanUniversity
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 29, 2024, 08:30:12 PM
Ok, gentlemen. 1st version Pre-alpha released :biggrin:  :biggrin:  :biggrin:

File attached. It may crash sometimes, i suppose. I´ll make further tests later and some small bug fixes. So far, was mainly to test for speed of this algorithm from the BurdwanUniversity

It´s really good, in fact. At the end, their algorithm is basically creating a density map rather than the "normal" clusters and noises in a conventional dbscan algorithm. But this can also be implemented.

The good news is that it does not take tons of gigabytes and is really, really fast.

Note: Source code embedded in main app (as usual, in RosAsm files). Later i´ll clean up the code (it´s a huge, huge mess and post it here on the next release)

(https://i.ibb.co/dm1NF2w/testiing.png) (https://ibb.co/dm1NF2w)
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: six_L on May 30, 2024, 05:00:47 AM
Hi,guga
very nice!
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 30, 2024, 06:14:45 AM
Hi Six_L tks

It was hard to understand and port this thing, but i suceeded to do it.I´ll try later to create a sort of tutorial on how this thing works (at least the part showed by the article at BurdwanUniversity)

Basically, it takes only the maximum difference between 8 neighbors pixels at every pixel we are working with. 1St it converts the image to grayscale and uses only the unique generated color corresponding to grey, then it creates a map of the grey pixels (Size = width*height*4 = 4 because we are dealing with a dword (In fact, i converted to Real4 to gain extra speed in SSE2) to make the computation easier). For example, say that on a certain region of the image, the pixels are displaced as (already converted to gray):

99    140    185
102    63    162
58    113    138

And we are at the center of the area. So, we are at pixel whose value = 63

The original article (and all other places i read so far), tell us to get the difference between each pixel and the one we are in (99-63), (140-63)....(138-63). Then some articles says that we have to get the minimum of maximum distance etc etc while the article at Burdwan University says to we get only the maximum difference between all of them and use the resultant value to create a Density map.

But all of this consumes a lot of processing time. So, all i had to do is go straightforward finding the maximum value of all of the 8 neighbors pixels and simply subtracted that value from the one at the center.

So, instead calculating all the 8 deltas and only later finding the bigger of them etc etc . All is needed, in fact, is the maximum of those 8 and subtract directly. Then we divide the value by 8 pixels

So, in this example, the maximum value of the neighbor pixels = 185. Therefore, the value of the density map = (185-63)/8.

Using SSE2 it takes a few operations to do that. All we need is locate the starting address at the top and bottom (3 pixels each), put them on a xmm register and calculate he maximum value of them.
Then,, we do the same but only for the pixels on the left and right of the center. Get the max of both.
And finally, we get the maximum of the resultant 2 values (Top/Bottom and Left/Right).

There´s no need for subtract all those 8 pixels, get their results, make them absolute and only later, get their maximum. This is simply because, no matter what subtraction or deltas you will have, the maximum delta will always be the one that contains the maximum value of one of the 8 pixels.

So, all we need is find the max value of the 8 pixels and subtract it from the center. And then divide the resultant delta by 8 (Because we have 8 neighbor pixels)

Since the algo takes 8 pixels surrounding each one, the better when start all over this, is create the grey table with a extended size. So, width+1 and height+1 and put the image at the center of that buffer area, making the borders (1 pixel each) useless, but preventing crashes and other computations to adjust the border all over the time.

On this way, if we start the algo say, at position x = 0. When he try to get the pixel on the left, it wont crash, and since it´s value would be 0 , it won´t affect too much the result (except on the borders and corners) that will take, in fact, 3 to 5 pixels to calculate the max, instead the regular 8.


So, in memory, our grey image map will be like this:

Padding1    Padding2    Padding    Padding    Padding    Padding    Padding    Padding    Padding    Padding
Padding2    148    96    98    98    66    139    142    24    Padding
Padding3    120    90    78    149    161    101    112    46    Padding
Padding    97    97    99    140    185    180    105    140    Padding
Padding    106    99    102    63    162    115    115    139    Padding
Padding    95    107    58    113    138    82    134    176    Padding
Padding    97    102    91    53    121    78    156    204    Padding
Padding    126    93    68    57    145    155    135    58    Padding
Padding    137    95    33    86    122    169    66    37    Padding
Padding    Padding    Padding    Padding    Padding    Padding    Padding    Padding    Padding    Padding


Where padding, is nothing less then a null value used to adjust the problem at the borders and corners.

The whole function is :


; equates related to pshufd macro
[SSE_ROTATE_RIGHT_96BITS    147]
[SSE_ROTATE_LEFT_32BITS     147]

[SSE_ROTATE_LEFT_96BITS     57]
[SSE_ROTATE_RIGHT_32BITS    57]

[SSE_SWAP_QWORDS 78]         ; the same things and result as SSE_SWAP_QWORDS, SSE_ROTATE_RIGHT_64BITS, SSE_ROTATE_LEFT_64BITS, SSE_ROTATE_64BITS
[SSE_ROTATE_LEFT_64BITS 78]  ; the same things and result as SSE_SWAP_QWORDS, SSE_ROTATE_RIGHT_64BITS, SSE_ROTATE_LEFT_64BITS, SSE_ROTATE_64BITS
[SSE_ROTATE_RIGHT_64BITS 78] ; the same things and result as SSE_SWAP_QWORDS, SSE_ROTATE_RIGHT_64BITS, SSE_ROTATE_LEFT_64BITS, SSE_ROTATE_64BITS
[SSE_ROTATE_64BITS 78]       ; the same things and result as SSE_SWAP_QWORDS, SSE_ROTATE_RIGHT_64BITS, SSE_ROTATE_LEFT_64BITS, SSE_ROTATE_64BITS

[SSE_INVERT_DWORDS 27]      ; invert the order of dwords
[SSE_SWAP_DWORDS 177]       ; Dwords in xmm are copied from 0123 to 1032 ordering. The same as: pshufd XMM0 XMM0 {SHUFFLE 1,0,3,2}
[SSE_SWAP_LOQWORD 225]      ; Only the 2 dwords in the LowQword are swaped. Dwords in xmm are copied from 0123 to 0132 ordering. The same as: pshufd XMM0 XMM0 {SHUFFLE 0,1,3,2}
[SSE_SWAP_HIQWORD 180]      ; Only the 2 dwords in the HighQword are swaped. Dwords in xmm are copied from 0123 to 1023 ordering. The same as: pshufd XMM0 XMM0 {SHUFFLE 1,0,2,3}


; Macros used
[SHUFFLE | ( (#1 shl 6) or (#2 shl 4) or (#3 shl 2) or #4 )] ; Marinus/Sieekmanski
[pshufd | pshufd #1 #2 #3]

[SSE2_MAX_4FLOATS | maxps #1 #2 | pshufd #2 #1 SSE_SWAP_DWORDS | maxps #1 #2 | pshufd #2 #1 SSE_ROTATE_64BITS | maxps #1 #2]

[SSE_ABS_REAL4 | pslld #1 1 | psrld #1 1]


; Variables used

[NeighboursDivision: F$ (1/8), F$ (1/8), F$ (1/8), F$ (1/8)]
[MaxDensity: F$ 0, 0, 0, 0] ; ge the maximum density found to we use in the scale to density
[DensityRange: F$ 255, 255, 255, 255] ; to be used in Scale to density)

Proc EightNeighbourDistance:
    Arguments @pInput, @pOutput, @Height, @Width, @Padding, @DensityRatio
    Local @CurYPos, @NextRow, @PreviousRow, @NextOutput, @NextInput, @AddPaddingStart
    Uses ebx, esi, edi, edx, ecx

    xor eax eax
    If_Or D@Height = 0, D@Width = 0
        ExitP
    End_If

    movups xmm3 X$NeighboursDivision
    movups xmm4 X$MaxDensity

    mov eax D@Padding | shl eax 2 | mov D@AddPaddingStart eax
    ; calculate the address of the next input row
    mov eax D@Padding | shl eax 1 | add eax D@Width | shl eax 2 | mov D@NextInput eax

    ; calculate the address of the next output row
    mov eax D@Width | shl eax 2 | mov D@NextOutput eax

    mov eax D@Padding | mov ebx eax | add eax D@Width | shl eax 2 | mov D@NextRow eax; 9
    shl ebx 1 | shl ebx 2 | add ebx eax | mov D@PreviousRow ebx
    ; Input  start at

    mov eax D@Height | mov D@CurYPos eax
    mov edx D@pInput | add edx D@NextInput
    mov edi D@pOutput
    ..Do
        mov ebx edx | add ebx D@AddPaddingStart
        xor esi esi
        .Do
            ; get previous line (Up)
            lea eax D$ebx+esi*4 | sub eax D@PreviousRow | movdqu xmm1 X$eax; | SSE_CONV_4INT_TO_4FLOAT xmm1 ; convert the integers to floats
            pshufd xmm1 xmm1 {SHUFFLE 2,2,1,0} ; the 4th Float is not needed for this comparision, because we want only 3 (padding + width + padding)

            ; get next line (Down)
            lea eax D$ebx+esi*4 | add eax D@NextRow | movdqu xmm0 X$eax; | SSE_CONV_4INT_TO_4FLOAT xmm0 ; convert the integers to floats
            pshufd xmm0 xmm0 {SHUFFLE 2,2,1,0} ; the 4th Float is not needed for this comparision, because we want only 3 (padding + width + padding)

            SSE2_MAX_4FLOATS xmm0 xmm1

            ; now get left
            movdqu xmm1 X$ebx+esi*4-4 | pshufd xmm1 xmm1 {SHUFFLE 0,0,0,0}

            ; now get right
            movdqu xmm2 X$ebx+esi*4+4 | pshufd xmm2 xmm2 {SHUFFLE 0,0,0,0}
            maxps xmm1 xmm2

            ; and finally get the maximum between up and down and left to right
            maxps xmm1 xmm0

            ; get our core pixel
            mov eax D$ebx+esi*4 | movd xmm0 eax | pshufd xmm0 xmm0 {SHUFFLE 0,0,0,0}

            ; subtract pixels from right from the core
            subps xmm0 xmm1 ; Subtract corresponding integers

            SSE_ABS_REAL4 xmm0 ; and convert them to absolute

            mulps xmm0 xmm3

            ; get the maximum density
            maxps xmm4 xmm0
            movd D$edi+esi*4 xmm0

            inc esi
        .Loop_Until esi => D@Width

        add edx D@NextInput
        add edi D@NextOutput
        dec D@CurYPos
    ..Repeat_Until_Zero

    ; finally calculate the density ratio to be used in our new scale_density_to_image function
    mov eax D@DensityRatio
    movups xmm0 X$DensityRange | divss xmm0 xmm4 | movd D$eax xmm0

EndP

I´ll clean up the code later, but it only uses 2 functions to do all the trick. In fact, it uses only one "EightNeighbourDistance" which is responsible to create a map of the density (size = width*height*4 only  :azn:  :azn: )  (It creates a sort of a "pixeldensity" or "DensityPixel" or whatever need to call this stuff), while another function scale_density_to_image is responsible only to make that map be visible since the values of the density map are around 10% of the original grey ones we inputted. That´s why i calculated the maximum density and the ratio inside EightNeighbourDistance at once, to avoid extra checking on the scale_density_to_image function.

So, the final value of the pixel (to it be visible) is basically DensityPixel*DensityRatio

I added a parameter "DensityRatio" on the function EightNeighbourDistance (responsible to create the density map) to make easier to we make the pixels be visible again from the scale_density_to_image function.

The density ratio is basically this:
Ratio = 255/MaxDensity

Max Density is the maximum value found in the density map at EightNeighbourDistance function. And 255 is because this is the maximum value of any pixel.

So, if in the density map we found that the maximum value of the density image is let´s say: 28. We simply calculate the ratio r = 255/28

Then, to make the pixels visible again, we do it in anther function (scale_density_to_image) . So, on that function, we take the value of each "PixelDensity" and multiply with our Ratio.

To convert the pixel density to visible pixels we simply do:

Final Pixel Value= PixelDensity*(255/28)


Could we use euclidean distance to calculate the value of the density map ? Sure...but the question is why ? Using euclidian distance may result in 2 disadvantages:
1 - Speed
2 - Accuracy
The speed problem could be solved using a Table of 256 floats (Real4) containing the correspondent value of the Euclidean Distance (since it will result in values from 0 to 255 anyway). So, the max delta could be applied to a table containing 256 predefined values of Euclidean distances.

But the problem seems to be accuracy. Using euclidean distances may result in a higher values of Density which could not be accurated when classifyig the clusters. Ex:

if at a given pixel we calculate the density using euclidean as:
Maximum value of the 8 Neighbors = 90
Center Pixel to get the delta 40

Euclidean Distance = sqrt(90^2-40^2) = 80.62....

But, using a simple delta, we get:
Density Map (simple delta) = 90-40 = 50

So, it seems that using a simple delta from the max of the 8 neighbours and the pixel we are using to create (or classify) in the density map seem enough and more accurate for me - Not to mention, way faster. (Is the same conclusion, btw found in the article at the Burdwan University)
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 31, 2024, 06:25:11 AM
The algorithm seems really food to apply in other filters. I made a quick test using the Density Map over each RGB channel of the image and the result is pretty good to enhance the contours.

It can be refined, but for now it´s just a quick test. Didn´t checked for bugs, and probably the app will crash if you try to open a new image without closing the app (I didn´t created the routine yet to release the memory since it was just a quick test.

What it does is get the values of each channel separately and subtract it from the value of the pixel found in the density map (I used the scaled value for this purpose). Then apply only the delta and use this new value as the one on each channel. Like this:

Channel Red at a given position
Pixel 30 48 150

Density Map 48 48 48 (Already scaled)


Get the delta
Pixel = Abs(30-48) (48-48) (150-48) and voilà !

In some images it may appears to increase some noises but, in fact what it is doing is enhanced the noises that were already in the original image or existent to bad encoding etc.

The better to use this techique should be apply the delta only on the luma with CieLCH colorspace, but i have no time right now to fix my CieLCH routines

Anyway, for a test, it seems very good. Now i´m proceeding to try to create the Classification (Cluster types and homogeneity etc)

Note: use the same dlls i posted in my previous version. I released only the executable this time.


Btw, calculating the delta on all of 32 bytes on xmm register revealed o be very tricky in SSE2. But here is the part of the code that do it

            movups xmm0 X$ebx+edx*8 ; original image dest
            movups xmm1 X$esi+edx*8 ; density src

;            SSE2_ABS_BYTE_DELTA xmm0 xmm3

    ; calculate the delta on each one of the 32 bytes on the xmm register
   
    ; Backup original xmm0 and xmm1
    movaps xmm7, xmm0   ; xmm7 = xmm0
    movaps xmm6, xmm1    ; xmm6 = xmm1

    ; Calculate xmm0 - xmm1
    psubusb xmm0, xmm1    ; xmm0 = max(xmm0 - xmm1, 0)

    ; Calculate xmm1 - xmm7
    psubusb xmm6, xmm7 ; xmm6 = max(xmm1 - xmm7, 0)

    ; Take the maximum of both results to get the absolute difference
    por xmm0, xmm6      ; xmm0 = max(xmm0, xmm6)
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on May 31, 2024, 06:41:55 AM
Examples


(https://i.ibb.co/cykV3j2/Image2test.png) (https://ibb.co/cykV3j2)

(https://i.ibb.co/18jJj1F/Image2test.png) (https://ibb.co/18jJj1F)
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: NoCforMe on June 01, 2024, 05:04:30 AM
Judging from those pictures, you've cooked up a pretty good "sharpen" filter there, yes?
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on June 02, 2024, 12:50:43 AM
Hi NoCforMe. Yes, the algorithm can be used as a filter to sharpen images. It wasn't designed for that, but in the end I saw that this was another possible use.

Testing for edition. Working now, Stoo :)
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on June 09, 2024, 12:49:08 AM
Finally succeeded to make this works

I created 2 specific functions to properly identify the clusters. One biased on the "normal" identification method as described here: https://www.youtube.com/watch?v=-p354tQsKrs

And other forcing the point nearby a cluster to be a border even if it don´t have the minimum distance of Epsilon. On such cases, if i decide to use this method too (after testing on real images), i´ll see if i can set a newer flag called NEAR_BORDER meaning that we have a point near a cluster that can be considered border despite it´s values is not enough to be considered as such.

Both methods uses the technique described at: https://biswajit.netlify.app/pages/dip-burdwanuniversity

I´ll make further tests before the next release.

This algorithm can be usefull to create a more accurate denoise filter, since it identifies precisely what is noise and also store the necessary amount of data needed for that specific pixel be considered at least a border and later be used to form new clusters if necessary.

Now i need also to find a way to assign each border to it´s own cluster (or clusters) and create a type of subgroup of clusters (for cases where clusters can be merged to form a larger one)
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on June 10, 2024, 07:07:22 AM
Ok, guys. I suceeded to assign a unique ID for each Core and Border points and also succeeded to identify each Noise point as well. Now i´m trying to figure it out a way to assign a unique Id for each cluster and associate it to the corresponding core and border points that forms it.

Any ideas ?
Title: Re: DBScan, Homogeneity and Completeness algorithm
Post by: guga on June 10, 2024, 01:54:18 PM
Say that i have an image (width = 6, height = 5) that contains these classifications:

Noise - Core - Core - Noise - Core - Core
Border - Border - Core - Border - Border - Core
Noise - Border - Core - Border - Border - Border
Noise - Border - Border - Border - Border - Core
Noise - Noise - Noise - Noise - Core - Core

How do i determine the clusters and how many are there ?