News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

hash table

Started by mabdelouahab, September 21, 2016, 07:30:43 AM

Previous topic - Next topic

mabdelouahab

I want to deal with hash table length 21 GB complete, each key to a unique bucket (key=Pointer), What is the best and quickest way (masm 32 or 64) to deal with it, Knowing that I have just 8 GB in RAM.

hutch--

I don't think there is a simple solution, I gather what you are after is an index to data in a database which is stored in memory and to load that index you would need to scan the very large data file to get the offsets of each member you are after. Now once you have scanned the data file and have a list of offsets then to get any individual piece of data from the large file you can either keep the file open and just change the offset to the database member you want. You could also think of using a memory mapped file which may be faster than the normal seek operation.

mabdelouahab

Quote from: hutch-- on September 21, 2016, 11:47:13 PM
I don't think there is a simple solution, I gather what you are after is an index to data in a database which is stored in memory and to load that index you would need to scan the very large data file to get the offsets of each member you are after. Now once you have scanned the data file and have a list of offsets then to get any individual piece of data from the large file you can either keep the file open and just change the offset to the database member you want. You could also think of using a memory mapped file which may be faster than the normal seek operation.

This means that I must create database engine.

It's easy if I have 32 GB in memory, just Load my data in memory (Buffer), then calculate the pointer as follows:
Pointer = BaseAddr + ( Key * 5 )
the record length is 5 byte  ; ((2 ^ 32 record)*5 byte= 21 474 836 480 byte)

so, How do I make this work on 8 GB memory to make the quickest way to access the data?

fearless

I agree, a memory mapped file would probably be handier, as you can map in a specific offset of the file of a specific slice required, regardless of the files size or memory size. That way your only consuming a chunk of memory at a time, and then sliding this view to consume another chunk etc. Probably be handier for sequential access and if you know the optimal size to read at a time and probably requires some sort of index management maybe?

Anyhow I use something similar to working with large files in a game engine, once they approach 1GB in size its safer to switch to a large file mapping option (instead of reading entire file) that maps in portions of the file based on the size required and the systems memory granulation (probably for alignment purposes from what i recall), here is some rough code to demonstrate how i calc the offsets and size required using a CalcLargeFileView function:

LOCAL LargeMapResourceSize:DWORD
LOCAL LargeMapResourceOffset:DWORD

Invoke CalcLargeFileView, dwRequiredViewSize, dwRequiredViewOffset, Addr LargeMapResourceSize, Addr LargeMapResourceOffset
Invoke MapViewOfFileEx, LargeMapHandle, FILE_MAP_ALL_ACCESS, 0, LargeMapResourceOffset, LargeMapResourceSize, NULL


And the routine CalcLargeFileView

;-----------------------------------------------------------------------------------------
; CalcLargeFileView - calculate large mapping size, view and offset
;-----------------------------------------------------------------------------------------
CalcLargeFileView PROC USES EBX dwRequiredViewSize:DWORD, dwRequiredViewOffset:DWORD, lpdwMappedViewSize:DWORD, lpdwMappedViewOffset:DWORD
    LOCAL sysinfo:SYSTEM_INFO
    LOCAL dwAllocationGranularity:DWORD
    LOCAL dwAdjustedOffset:DWORD
   
    Invoke GetSystemInfo, Addr sysinfo
    mov eax, sysinfo.dwAllocationGranularity
    mov dwAllocationGranularity, eax

    mov eax, dwRequiredViewSize
    .IF eax < dwAllocationGranularity
        mov ebx, lpdwMappedViewSize
        mov eax, dwAllocationGranularity
        add eax, dwAllocationGranularity
        mov [ebx], eax
    .ELSE
        mov eax, dwRequiredViewSize
        xor edx, edx
        mov ebx, dwAllocationGranularity
        div ebx ; Divides dwRequiredViewSize by dwAllocationGranularity. EAX = quotient and EDX = remainder (modulo).
        .IF edx > 0 ; we have a remainder, so calc to add dwAllocationGranularity to dwRequiredViewSize - remainder
            mov eax, dwRequiredViewSize
            sub eax, edx
            add eax, dwAllocationGranularity
            add eax, dwAllocationGranularity
            mov ebx, lpdwMappedViewSize
            mov [ebx], eax
        .ELSE ; else we have a multiple of dwAllocationGranularity, so just return the dwRequiredViewSize as actualsize
            mov eax, dwRequiredViewSize
            mov ebx, lpdwMappedViewSize
            mov [ebx], eax
        .ENDIF
    .ENDIF

    mov eax, dwRequiredViewOffset
    .IF eax < dwAllocationGranularity
        mov ebx, dwAllocationGranularity
        sub ebx, eax
        mov dwAdjustedOffset, ebx
        mov eax, 0
        mov ebx, lpdwMappedViewOffset
        mov [ebx], eax
    .ELSE
        mov eax, dwRequiredViewOffset
        xor edx, edx
        mov ebx, dwAllocationGranularity
        div ebx ; Divides dwRequiredViewSize by dwAllocationGranularity. EAX = quotient and EDX = remainder (modulo).
        .IF edx > 0 ; we have a remainder, so calc to add dwAllocationGranularity to dwRequiredViewSize - remainder
            mov dwAdjustedOffset, edx
            mov eax, dwRequiredViewOffset
            sub eax, edx
            mov ebx, lpdwMappedViewOffset
            mov [ebx], eax
        .ELSE ; else we have a multiple of dwAllocationGranularity, so just return the dwRequiredViewSize as actualsize
            mov eax, dwRequiredViewOffset
            mov ebx, lpdwMappedViewOffset
            mov [ebx], eax
            mov dwAdjustedOffset, 0
        .ENDIF
    .ENDIF
    mov eax, dwAdjustedOffset
CalcLargeFileView ENDP


I have to save some of the information for later reuse, so the memory mapped handle and pointer and file handle are saved to a custom structure as my 'handle' so that i can process it later on or use it in other custom functions.

If you need more info let me know.

mabdelouahab

Thank you fearless, but not what I'm looking for.

aw27

Memory mapped provide fast and transparent random access do files. Let us know, in the unlikely event if you find a better and faster solution.

mabdelouahab

Quote from: aw27 on September 23, 2016, 09:51:26 PM
Memory mapped provide fast and transparent random access do files. Let us know, in the unlikely event if you find a better and faster solution.
Not yet
I didn't find a solution programmatically, but I changed math equation until I get the length 16 rather than 32, with additional instruction
Old:
    Pointer = BaseAddr + ( Key * 5 )
    the record length is 5 byte (4 byte +1)  ; ((2 ^ 32 record)*5 byte= 21 474 836 480 byte) = 21 Go
Now:
    Pointer = BaseAddr + ( Key * 3 )
    the record length is 3 byte  (3 byte +1) ; ((2 ^ 16 record)*3 byte= 196 608 byte) = 200 Ko

Mathematic makes every job easy.

hutch--

What I don't know is what you are trying to do. Hash tables come in many forms, An algorithm to generate as unique as possible array index within the range of the hash table, a collision detection mechanism with an alternate strategy if a "bucket" in already in use then write to, read from, delete and/or modify, depends on what you want to do with it. The complication of needing far more memory that you have available can probably only be addressed with a memory mapped file. The alternative is a very big table where you save the offsets from the start for each bucket then move around the file based on the hash of the data you want to access.

File and disk based access will be a lot slower than an in memory hash table array.

mabdelouahab

Quote from: hutch-- on September 28, 2016, 01:31:30 AM
What I don't know is what you are trying to do.

Numerical analysis

Zen

#9
Hi, MABDELOUAHAB,
:bgrin:  :bgrin:  :bgrin:

GuruSR

From your example, why do you even need a hash table?  Your last comment about the size would easily fit in most memory situations and from what I see could easily just use the "key" as an offset value into your array, not entirely sure why you need a hash table and "Numerical analysis" doesn't offer much...  (And people said my NDA'ed Registry stuff was cryptic!)

GuruSR.
Learned 68k Motorola Asm instruction set in 30 minutes on the way to an Amiga Developer's Forum meeting.
Following week wrote a kernel level memory pool manager in 68k assembler for fun.