The MASM Forum

General => The Campus => Topic started by: mabdelouahab on September 21, 2016, 07:30:43 AM

Title: hash table
Post by: mabdelouahab on September 21, 2016, 07:30:43 AM
I want to deal with hash table length 21 GB complete, each key to a unique bucket (key=Pointer), What is the best and quickest way (masm 32 or 64) to deal with it, Knowing that I have just 8 GB in RAM.
Title: Re: hash table
Post by: hutch-- on September 21, 2016, 11:47:13 PM
I don't think there is a simple solution, I gather what you are after is an index to data in a database which is stored in memory and to load that index you would need to scan the very large data file to get the offsets of each member you are after. Now once you have scanned the data file and have a list of offsets then to get any individual piece of data from the large file you can either keep the file open and just change the offset to the database member you want. You could also think of using a memory mapped file which may be faster than the normal seek operation.
Title: Re: hash table
Post by: mabdelouahab on September 22, 2016, 05:22:07 PM
Quote from: hutch-- on September 21, 2016, 11:47:13 PM
I don't think there is a simple solution, I gather what you are after is an index to data in a database which is stored in memory and to load that index you would need to scan the very large data file to get the offsets of each member you are after. Now once you have scanned the data file and have a list of offsets then to get any individual piece of data from the large file you can either keep the file open and just change the offset to the database member you want. You could also think of using a memory mapped file which may be faster than the normal seek operation.

This means that I must create database engine.

It's easy if I have 32 GB in memory, just Load my data in memory (Buffer), then calculate the pointer as follows:
Pointer = BaseAddr + ( Key * 5 )
the record length is 5 byte  ; ((2 ^ 32 record)*5 byte= 21 474 836 480 byte)

so, How do I make this work on 8 GB memory to make the quickest way to access the data?
Title: Re: hash table
Post by: fearless on September 22, 2016, 10:26:15 PM
I agree, a memory mapped file would probably be handier, as you can map in a specific offset of the file of a specific slice required, regardless of the files size or memory size. That way your only consuming a chunk of memory at a time, and then sliding this view to consume another chunk etc. Probably be handier for sequential access and if you know the optimal size to read at a time and probably requires some sort of index management maybe?

Anyhow I use something similar to working with large files in a game engine, once they approach 1GB in size its safer to switch to a large file mapping option (instead of reading entire file) that maps in portions of the file based on the size required and the systems memory granulation (probably for alignment purposes from what i recall), here is some rough code to demonstrate how i calc the offsets and size required using a CalcLargeFileView function:

LOCAL LargeMapResourceSize:DWORD
LOCAL LargeMapResourceOffset:DWORD

Invoke CalcLargeFileView, dwRequiredViewSize, dwRequiredViewOffset, Addr LargeMapResourceSize, Addr LargeMapResourceOffset
Invoke MapViewOfFileEx, LargeMapHandle, FILE_MAP_ALL_ACCESS, 0, LargeMapResourceOffset, LargeMapResourceSize, NULL


And the routine CalcLargeFileView

;-----------------------------------------------------------------------------------------
; CalcLargeFileView - calculate large mapping size, view and offset
;-----------------------------------------------------------------------------------------
CalcLargeFileView PROC USES EBX dwRequiredViewSize:DWORD, dwRequiredViewOffset:DWORD, lpdwMappedViewSize:DWORD, lpdwMappedViewOffset:DWORD
    LOCAL sysinfo:SYSTEM_INFO
    LOCAL dwAllocationGranularity:DWORD
    LOCAL dwAdjustedOffset:DWORD
   
    Invoke GetSystemInfo, Addr sysinfo
    mov eax, sysinfo.dwAllocationGranularity
    mov dwAllocationGranularity, eax

    mov eax, dwRequiredViewSize
    .IF eax < dwAllocationGranularity
        mov ebx, lpdwMappedViewSize
        mov eax, dwAllocationGranularity
        add eax, dwAllocationGranularity
        mov [ebx], eax
    .ELSE
        mov eax, dwRequiredViewSize
        xor edx, edx
        mov ebx, dwAllocationGranularity
        div ebx ; Divides dwRequiredViewSize by dwAllocationGranularity. EAX = quotient and EDX = remainder (modulo).
        .IF edx > 0 ; we have a remainder, so calc to add dwAllocationGranularity to dwRequiredViewSize - remainder
            mov eax, dwRequiredViewSize
            sub eax, edx
            add eax, dwAllocationGranularity
            add eax, dwAllocationGranularity
            mov ebx, lpdwMappedViewSize
            mov [ebx], eax
        .ELSE ; else we have a multiple of dwAllocationGranularity, so just return the dwRequiredViewSize as actualsize
            mov eax, dwRequiredViewSize
            mov ebx, lpdwMappedViewSize
            mov [ebx], eax
        .ENDIF
    .ENDIF

    mov eax, dwRequiredViewOffset
    .IF eax < dwAllocationGranularity
        mov ebx, dwAllocationGranularity
        sub ebx, eax
        mov dwAdjustedOffset, ebx
        mov eax, 0
        mov ebx, lpdwMappedViewOffset
        mov [ebx], eax
    .ELSE
        mov eax, dwRequiredViewOffset
        xor edx, edx
        mov ebx, dwAllocationGranularity
        div ebx ; Divides dwRequiredViewSize by dwAllocationGranularity. EAX = quotient and EDX = remainder (modulo).
        .IF edx > 0 ; we have a remainder, so calc to add dwAllocationGranularity to dwRequiredViewSize - remainder
            mov dwAdjustedOffset, edx
            mov eax, dwRequiredViewOffset
            sub eax, edx
            mov ebx, lpdwMappedViewOffset
            mov [ebx], eax
        .ELSE ; else we have a multiple of dwAllocationGranularity, so just return the dwRequiredViewSize as actualsize
            mov eax, dwRequiredViewOffset
            mov ebx, lpdwMappedViewOffset
            mov [ebx], eax
            mov dwAdjustedOffset, 0
        .ENDIF
    .ENDIF
    mov eax, dwAdjustedOffset
CalcLargeFileView ENDP


I have to save some of the information for later reuse, so the memory mapped handle and pointer and file handle are saved to a custom structure as my 'handle' so that i can process it later on or use it in other custom functions.

If you need more info let me know.
Title: Re: hash table
Post by: mabdelouahab on September 23, 2016, 02:25:43 AM
Thank you fearless, but not what I'm looking for.
Title: Re: hash table
Post by: aw27 on September 23, 2016, 09:51:26 PM
Memory mapped provide fast and transparent random access do files. Let us know, in the unlikely event if you find a better and faster solution.
Title: Re: hash table
Post by: mabdelouahab on September 25, 2016, 02:26:49 AM
Quote from: aw27 on September 23, 2016, 09:51:26 PM
Memory mapped provide fast and transparent random access do files. Let us know, in the unlikely event if you find a better and faster solution.
Not yet
I didn't find a solution programmatically, but I changed math equation until I get the length 16 rather than 32, with additional instruction
Old:
    Pointer = BaseAddr + ( Key * 5 )
    the record length is 5 byte (4 byte +1)  ; ((2 ^ 32 record)*5 byte= 21 474 836 480 byte) = 21 Go
Now:
    Pointer = BaseAddr + ( Key * 3 )
    the record length is 3 byte  (3 byte +1) ; ((2 ^ 16 record)*3 byte= 196 608 byte) = 200 Ko

Mathematic makes every job easy.
Title: Re: hash table
Post by: hutch-- on September 28, 2016, 01:31:30 AM
What I don't know is what you are trying to do. Hash tables come in many forms, An algorithm to generate as unique as possible array index within the range of the hash table, a collision detection mechanism with an alternate strategy if a "bucket" in already in use then write to, read from, delete and/or modify, depends on what you want to do with it. The complication of needing far more memory that you have available can probably only be addressed with a memory mapped file. The alternative is a very big table where you save the offsets from the start for each bucket then move around the file based on the hash of the data you want to access.

File and disk based access will be a lot slower than an in memory hash table array.
Title: Re: hash table
Post by: mabdelouahab on September 28, 2016, 07:27:59 AM
Quote from: hutch-- on September 28, 2016, 01:31:30 AM
What I don't know is what you are trying to do.

Numerical analysis
Title: Re: hash table
Post by: Zen on September 28, 2016, 07:37:06 AM
Hi, MABDELOUAHAB,
:bgrin:  :bgrin:  :bgrin:
Title: Re: hash table
Post by: GuruSR on October 06, 2016, 01:36:44 PM
From your example, why do you even need a hash table?  Your last comment about the size would easily fit in most memory situations and from what I see could easily just use the "key" as an offset value into your array, not entirely sure why you need a hash table and "Numerical analysis" doesn't offer much...  (And people said my NDA'ed Registry stuff was cryptic!)

GuruSR.