News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Recent posts

#1
The Laboratory / Re: DBScan, Homogeneity and Co...
Last post by guga - Today at 12:17:37 AM
Succeeded to automate the process of identifying Epsilon and Minimum Points. For the articles i´m reading, the smaller epsilon, the better.

https://hyperskill.org/learn/step/31529
https://mrinalyadav7.medium.com/dbscan-algorithm-c894701306d5

On the attached version, it generated only 2 epsilon values.

Main:
....
        call GetMinPoints DBSCAN_DIM_1D
        mov D$MinPoints eax
        call GetEpsilon D$points, D$Num_Points, Epsilon, DBSCAN_DIM_1D


        call dbscan D$points, D$Num_Points, Epsilon, D$MinPoints, Homogeneity, HomogeneityLog
....

;;
   Automatically find epsilon
   http://www.sefidian.com/2022/12/18/how-to-determine-epsilon-and-minpts-parameters-of-dbscan-clustering/
   https://mrinalyadav7.medium.com/dbscan-algorithm-c894701306d5
   https://hyperskill.org/learn/step/31529
;;

[DBSCAN_DIM_1D 1]
[DBSCAN_DIM_2D 2]
[DBSCAN_DIM_3D 3]

Proc GetMinPoints:
    Arguments @DimensionFlags

    mov eax D@DimensionFlags
    shl eax 1

EndP

Proc GetEpsilon:
    Arguments @points, @num_points, @pOutput, @Flag
    Local @k_distances, @kFactor
    Uses ebx, esi, edi, ecx, edx

    xor eax eax
    On D@num_points = 0, ExitP

    C_call 'msvcrt.calloc' D@num_points, 8
    mov D@k_distances eax

    mov eax D@Flag
    .If_Or eax = DBSCAN_DIM_1D, eax = 0
        mov eax 2 ; dimensions*2) = 1*2
    .Else_If eax = DBSCAN_DIM_2D
        mov eax 4 ; dimensions*2) = 2*2
    .Else_If eax = DBSCAN_DIM_3D
        mov eax 6 ; dimensions*2) = 3*2
    .Else_If_And eax > 3, eax < 7
        mov eax 7
    .Else_If eax > 10
        inc eax
    .End_If
    mov D@kFactor eax

    ; Calculate k-distances for each point
    call calculate_k_distances D@points, D@num_points, D@kFactor, D@k_distances
    call find_knee_point D@k_distances, D@num_points

    mov eax D@pOutput | fstp R$eax

    C_call 'msvcrt.free' D@k_distances

EndP

; Function to calculate k-distance for each point

Proc calculate_k_distances:
    Arguments @points, @num_points, @kFactor, @k_distances
    local @Distances, @PointsIPos
    Uses ebx, esi, edi, ecx, edx

    xor eax eax
    On D@num_points = 0, ExitP

    C_call 'msvcrt.calloc' D@num_points, 8
    mov D@Distances eax
    mov ebx eax
    mov esi D@points
    mov D@PointsIPos esi
    xor edi edi

    .Do

        mov esi D@points
        xor ecx ecx
        Do
       
            call euclidean_dist D@PointsIPos, esi
            fstp R$ebx+ecx*8
            add esi Size_Of_point_t
            inc ecx
        Loop_Until ecx >= D@num_points ; jl

        C_call 'msvcrt.qsort' ebx, D@num_points, 8, compare_doubles
        ;  k-th nearest neighbor distance
        mov eax D@kFactor
        mov edx D@k_distances
        fld R$ebx+eax*8 | fstp R$edx+edi*8

        add D@PointsIPos Size_Of_point_t
        inc edi
    .Loop_Until edi >= D@num_points

    C_call 'msvcrt.free' D@Distances

EndP

Proc compare_doubles:
    Arguments @Arg1, @Arg2

    mov eax D@Arg1 | movsd XMM0 X$eax
    mov eax D@Arg2 | movsd XMM1 X$eax

    xor eax eax
    SSE_D_If xmm1 > xmm0
        mov eax 0-1
    SSE_D_Else_If xmm1 < xmm0
        mov eax 1
    SSE_D_End_If

EndSTD

[KneeRate: R$ 0.1]

Proc find_knee_point:
    Arguments @k_Distance, @Num_Points
    Local @KneeIndex
    Uses ebx, esi, ecx, edx

    mov ebx D@k_Distance
    mov esi D@Num_Points
    C_call 'msvcrt.qsort' ebx, esi, 8, compare_doubles
    fild D@Num_Points | fmul R$KneeRate | fistp D@KneeIndex

;fld R$ebx;+eax*8
;mov eax 52 | fld R$ebx+eax*8

    mov eax D@KneeIndex | fld R$ebx+eax*8

EndP



On the function find_knee_point it creates a table of distances values (possible values of epsilon) forming a array of data (Size is NumPoints *8 = 8 because we are dealing with a Real8). The generated contents are displaced in ascending order On the example we have 53 values, on 2 unique epsilon values from the smaller to the bigger.

The function chooses one located at the 10% of the initial data in the array. Later i´ll make a routine to get rid of the duplicated data, and insert the different values on another table to make easier the user choose what type of data we wants do identify.

according to the 2nd article, Epsilon can be better described as:
Quoteε: The value for ε can then be chosen by using a k-distance graph, plotting the distance to the k = minPts-1 nearest neighbor ordered from the largest to the smallest value. Good values of ε are where this plot shows an "elbow": if ε is chosen much too small, a large part of the data will not be clustered; whereas for a too high value of ε, clusters will merge and the majority of objects will be in the same cluster. In general, small values of ε are preferable, and as a rule of thumb, only a small fraction of points should be within this distance of each other.


Btw,, so far, chatgpt didn´t succeeded to help me making the algo way more fast. It is incredibly slow when working on real images. The alternative i proposed was create a table in memory to directly point to the addresses in get_neighbor routines, but it turns out that i´ll need a insane huge table for that (NumPixels*8*NumPixels). I´m trying to overcome this, and use only 1 table (size = Numpixes*8) and then, make some lighter arithmetic function to calculate the values of the next indexed tables. Thinking how to do it yet....


About epsilon and minimum points, chatgpt answered me like this (That proven to be correct at the end)

QuoteThe choice of k is not strictly fixed and can vary based on the characteristics of the data and the specific requirements of the clustering task. Here's a more detailed explanation of how to determine an appropriate k value for different dimensions:
1D Data

For 1D data, clusters are linear, and local density is easier to determine. Therefore, a small k value is usually sufficient.

    k = 2: Using the nearest neighbor (1st neighbor) can be enough to define local density in 1D data.

2D Data

For 2D data, clusters are more complex, and you generally need a slightly larger k to capture the local density effectively.

    k = 4: This is a commonly used heuristic. A point and its three nearest neighbors can adequately define a local dense region in 2D space.

3D Data

For 3D data, the complexity increases due to the additional dimension, requiring a larger k value to capture the local density accurately.

    k = 6: This value is chosen to include a point and its five nearest neighbors, which helps define the local density in 3D space.
    k = 7-10: Depending on the noise level and the expected cluster structure, you might need a larger k to ensure robustness.

Choosing k Based on Dimensions

The heuristic k = dimensions + 1 is a starting point, but it may need to be adjusted based on empirical observations and the specific dataset. The values mentioned (e.g., k = 6 for 3D) come from practical experience and empirical studies in clustering literature, where a slightly larger k helps to better capture the density in higher dimensions.
Example Adjustments in Code

Here is how you can set k based on the dimensionality of your data and adjust it if necessary:

His piece of code was originally this:
#include <math.h>
#include <stdlib.h>
#include <stdio.h>

// Structure to represent a point in 3D (adjust for higher dimensions if needed)
typedef struct {
    double x, y, z; // Add more dimensions as necessary
} point_t;

// Function to calculate Euclidean distance between two points
double euclidean_distance(point_t *a, point_t *b) {
    return sqrt(pow(a->x - b->x, 2) + pow(a->y - b->y, 2) + pow(a->z - b->z, 2)); // Adjust for dimensions
}

// Function to calculate k-distance for each point
void calculate_k_distances(point_t *points, int num_points, int k, double *k_distances) {
    for (int i = 0; i < num_points; ++i) {
        double distances[num_points];
        for (int j = 0; j < num_points; ++j) {
            distances[j] = euclidean_distance(&points[i], &points[j]);
        }
        // Sort distances to find the k-th nearest neighbor
        qsort(distances, num_points, sizeof(double), compare_doubles);
        k_distances[i] = distances[k]; // k-th nearest neighbor distance
    }
}

// Comparison function for qsort
int compare_doubles(const void *a, const void *b) {
    double arg1 = *(const double*)a;
    double arg2 = *(const double*)b;
    if (arg1 < arg2) return -1;
    if (arg1 > arg2) return 1;
    return 0;
}

// Function to find the "knee" point in k-distances plot
double find_knee_point(double *k_distances, int num_points) {
    // Sort the k-distances to create the k-distances plot
    qsort(k_distances, num_points, sizeof(double), compare_doubles);
    // Simple heuristic: return the distance at the 90th percentile
    return k_distances[(int)(0.9 * num_points)];
}

int main() {
    // Assuming points is already populated
    point_t *points = ...; // Replace with actual points initialization
    int num_points = ...; // Replace with actual number of points
    int dimensions = 3; // Adjust based on the dimensionality of your data
    int k = dimensions + 3; // Adjusted heuristic for k (example: dimensions + 3)
    double k_distances[num_points];

    // Calculate k-distances for each point
    calculate_k_distances(points, num_points, k, k_distances);
    // Find a suitable epsilon using the knee point of the k-distances plot
    double epsilon = find_knee_point(k_distances, num_points);
    // Determine minPts, often set as 2 * dimensions
    int minPts = 2 * dimensions;

    // Now run DBSCAN with the calculated epsilon and minPts
    int num_clusters = dbscan_run(points, epsilon, minPts, num_points);

    // Continue with your process...

    return 0;
}


And then it answered me:
QuoteSummary

    1D Data: k = 2 is typically sufficient.
    2D Data: k = 4 is a commonly used heuristic.
    3D Data: k = 6 or k = 7-10 based on noise and cluster structure.
    Higher Dimensions: k = dimensions + 1 or slightly higher based on empirical observation.

These values are starting points, and you may need to adjust k based on your specific dataset and the results you observe.


The resultant values were the same i´ve got earlier, so, it seems the routines i created to automate epsilon and min points were correct

Epsilon: 1.000000
Minimum points: 2
Homogeneity: 0.000000
Homogeneity(log): 0.877543
Number of points: 53
 x    y    z    cluster_id
-----------------------------
 1.00  3.00  1.00: 0
 1.00  4.00  1.00: 0
 1.00  5.00  1.00: 0
 1.00  6.00  1.00: 0
 2.00  2.00  1.00: 2
 2.00  3.00  0.00: 1
 2.00  4.00  0.00: 1
 2.00  5.00  0.00: 1
 2.00  6.00  0.00: 1
 2.00  7.00  1.00: 3
 3.00  1.00  1.00: 2
 3.00  2.00  1.00: 2
 3.00  3.00  1.00: 2
 3.00  4.00  0.00: 1
 3.00  5.00  0.00: 1
 3.00  6.00  1.00: 3
 3.00  7.00  1.00: 3
 4.00  1.00  1.00: 2
 4.00  2.00  1.00: 2
 4.00  3.00  0.00: 1
 4.00  4.00  0.00: 1
 4.00  5.00  1.00: -2
 4.00  6.00  0.00: 1
 4.00  7.00  1.00: 3
 4.00  8.00  1.00: 3
 5.00  1.00  1.00: 2
 5.00  2.00  0.00: 1
 5.00  3.00  0.00: 1
 5.00  4.00  0.00: 1
 5.00  5.00  0.00: 1
 5.00  6.00  0.00: 1
 5.00  7.00  1.00: 3
 5.00  8.00  1.00: 3
 6.00  1.00  1.00: 2
 6.00  2.00  0.00: 1
 6.00  3.00  1.00: 3
 6.00  4.00  1.00: 3
 6.00  5.00  1.00: 3
 6.00  6.00  1.00: 3
 6.00  7.00  1.00: 3
 7.00  1.00  1.00: 2
 7.00  2.00  0.00: 1
 7.00  3.00  0.00: 1
 7.00  4.00  0.00: 1
 7.00  5.00  1.00: 3
 8.00  1.00  1.00: 2
 8.00  2.00  1.00: 2
 8.00  3.00  0.00: 1
 8.00  4.00  1.00: 3
 8.00  5.00  1.00: 3
 8.00  6.00  1.00: 3
 9.00  2.00  1.00: 2
 9.00  3.00  1.00: 2

Press enter to exit...



And below are some links i found with examples on how to create a faster library for it:
https://github.com/Eleobert/dbscan
https://proceedings.mlr.press/v157/weng21a.html

I didn´t tested this new routines yet, neither ported it co Plain C or assembly yet. Thinking....
#2
16 bit DOS Programming / Re: Building 16 bit Exe
Last post by tda0626 - May 18, 2024, 11:17:04 PM
Quote from: daydreamer on May 16, 2024, 02:09:05 PMIs dosbox-x the version which support MMX opcodes ?
Optimisations with dosbox should be done with old CPUs cycle count data



It has the option of emulating the Pentium MMX in the CPU type menu.
#3
MasmBasic & the RichMasm IDE / Re: Silly question
Last post by TimoVJL - May 18, 2024, 09:05:22 PM
#4
MasmBasic & the RichMasm IDE / Re: Silly question
Last post by FORTRANS - May 18, 2024, 08:21:10 PM
Hi,

   Will look at the attachments.

   A fix to a technical error: The BASIC mentioned in the book is
"Microsoft BASIC".  BASICA was in the IBM PC as a ROM.  QBASIC was
the one supplied with MS-DOS.  All are similar.

Regards,

Steve N.
#5
MasmBasic & the RichMasm IDE / Re: Silly question
Last post by daydreamer - May 18, 2024, 04:45:24 PM
I found a package which combine qbasic with dosbox,for old basic programs designed to run on slow few MHz CPU ,to keep speed down to make old games playable
Other way is like jochen pacman example ,put code in timer that updates each frame 1/60 ,otherwise running changing screen coordinates @3ghz makes game unplayable, because objects on screen seem to move at lightspeed
The big advantage is a very slow init data on original computer ,is made in milliseconds or less
My 16 bit dos coding started with emulate similar apple 2 vector graphics, few bits / vector
#6
MasmBasic & the RichMasm IDE / Re: Silly question
Last post by jj2007 - May 18, 2024, 09:06:29 AM
Attached a translated file - a very early version, just to give you an idea... lots of manual fixes needed. I won't have much time to work on it.

Source and exe of the translator in the second attachment. Drag any *.bas file over the exe to translate it to MyTest.asm
#7
MasmBasic & the RichMasm IDE / Re: Silly question
Last post by FORTRANS - May 18, 2024, 06:26:14 AM
Quote from: jj2007 on May 18, 2024, 05:36:14 AMHi Steve,

Can you post an example (not a Hello World, something bigger)?

   Here is one of the games.  Typed in, partly debugged, and starts to
run.  It may have more errors though.  I also put in some INPUT statements
to avoid the scrolling from allowing one to read things.  Not yet worked
out for all text.

QuoteMasmBasic is closer to GfaBasic, but it might not be difficult to write a translator. It seems BASICA is roughly the same as GW-BASIC, right?

   If I remember, GW-BASIC was newer and could compile to an executable.
BASICA is only an interpreter.  They were compatible for all the programs
I tried at the time.  Equals long ago though.

Regards,

Steve
#8
CoderStudio / Listing shared printers
Last post by Vortex - May 18, 2024, 06:15:53 AM
Hello,

This command line tool enumerates the printers shared by a remote system. You can copy the import library winspool.lib offered with the zip archive to the folder \Link32\lib

Usage : ListPrinters \\server
ListPrinters.c :

#include <windows.h>

#include <winspool.h>

#include <stdio.h>

int main(int argc, char * argv[]) {

    DWORD cbBuf = 0;
    DWORD pcReturned;
    PRINTER_INFO_2 * prinfo;
    BOOL r;
    int rv = 0;
    DWORD i;

    char * err = "The EnumPrinters API could not retrieve printer information.\n";

    if (argc != 2) {
        printf("ListPrinters V1.0 by Vortex\n\nUsage : ListPrinters \\\\server\n");
        return 1;
    }

    EnumPrinters(PRINTER_ENUM_SHARED | PRINTER_ENUM_NAME, argv[1], 2, NULL, 0, & cbBuf, & pcReturned);

    if (!cbBuf) {
        printf("%s", err);
        return 2;
    }

    prinfo = (PRINTER_INFO_2 * )VirtualAlloc(0, cbBuf, MEM_COMMIT, PAGE_READWRITE);

    if (!prinfo) {
        printf("Memory allocation failed.\n");
        return 3;
    }

    r = EnumPrinters(PRINTER_ENUM_SHARED | PRINTER_ENUM_NAME, argv[1], 2, (LPBYTE) prinfo, cbBuf, & cbBuf, & pcReturned);

    if (!r) {
        printf("%s", err);
        rv = 2;

    } else {

        for (i = 0; i < pcReturned; ++i) {
            printf("%s\n", prinfo[i].pShareName);
        }
    }

    VirtualFree(prinfo, 0, MEM_RELEASE);
    return rv;

}

Building the project :

\pcc32\pcc32 /I\pcc32\include /c ListPrinters.c
\link32\link32 /SUBSYSTEM:CONSOLE /LIBPATH:\link32\lib /NEV ListPrinters.obj kernel32.lib user32.lib msvcrt.lib winspool.lib
#9
The Campus / Re: sprintf_s in masm ?
Last post by alCoPaUL - May 18, 2024, 06:13:42 AM
StrLen() must be sanitized by checking if it's FFFFFFFEh to get the Maxed Capacity.

But the Primacy of the 999426 DUP(0) buffer is upheld so StrLen(data), by practice, is wayyy smaller than the function's Maxed Capacity.

And since I posted the 64-bit version, now it's clear why the size of the buffer in 64-bit sprintf_s can be ommited coz the maxed capacity of StrLen(data), to suffice the length of the inputted data to sprintf_s, Ideally, is effing FFFFFFFFFFFFFFFEh bytes.
#10
MasmBasic & the RichMasm IDE / Re: Silly question
Last post by jj2007 - May 18, 2024, 05:36:14 AM
Hi Steve,

Can you post an example (not a Hello World, something bigger)?

MasmBasic is closer to GfaBasic, but it might not be difficult to write a translator. It seems BASICA is roughly the same as GW-BASIC, right?