The MASM Forum

General => The Workshop => Topic started by: MtheK on September 28, 2013, 03:01:46 AM

Title: ML.exe super-slow with large BYTE values?
Post by: MtheK on September 28, 2013, 03:01:46 AM
  If I code this and assemble it, it runs in about a half-second:

DSNBUFFERL DWORD 32767
DSNBUFFER1 BYTE 32767 DUP(00h)
DSNBUFFER2 BYTE 32767 DUP(00h)

time:  10:34:46.13 to 10:34:46.62


  Yet, if I do the same program and change the above to this:

DSNBUFFERL DWORD 1024000               
DSNBUFFER1 BYTE 1024000 DUP(00h)
DSNBUFFER2 BYTE 1024000 DUP(00h)

it runs about 455x slower, about 4 minutes:

time:  10:37:38.67 to 10:41:21.90

  My program runs the same, albeit a bit faster.

  Is this expected?
Title: Re: ML.exe super-slow with large BYTE values?
Post by: jj2007 on September 28, 2013, 03:07:04 AM
Quote from: MtheK on September 28, 2013, 03:01:46 AM
Is this expected?

Yes, it's a known bug. Workarounds:
- use JWasm
- use the MasmBasic crtbuf macro, or something similar:

crtbuf DSNBUFFER1, 1024000  ; in the .code section

crtbuf MACRO var, BufLen, msAlign:=<4>   ; cb pBuffer, 1000 [, 16]
LOCAL cblabel
.data?
align msAlign
  cblabel LABEL BYTE
  var equ offset cblabel
  ORG $+BufLen-1
  db ?
.code
ENDM
Title: Re: ML.exe super-slow with large BYTE values?
Post by: dedndave on September 28, 2013, 03:10:13 AM
yes - it is a known masm bug

there are ways to avoid it
a simple way is to break up the define
i think if you break the 1024000 byte buffers into 16 64000 byte buffers, it will build faster
DSNBUFFER1 BYTE 64000 DUP(00h)
           BYTE 64000 DUP(00h)
           BYTE 64000 DUP(00h)

;etc


the ORG macro that Jochen mentioned is also a good way

of course the SIZEOF and LENGTHOF operators won't work
but, you can put a label at the end and subtract addresses to get past that
Title: Re: ML.exe super-slow with large BYTE values?
Post by: hutch-- on September 28, 2013, 10:54:42 AM
The general drift is to use allocated memory from the OS, not the initialised data section when you need more than trivial amounts. Its generally classed as bad practice to do this as it blows the size of the file out by the amount of memory you configure this way. With allocated memory you can routinely allocate up to near 2 gigabytes if you have enough memory on the machine.
Title: Re: ML.exe super-slow with large BYTE values?
Post by: Rockoon on September 28, 2013, 01:47:51 PM
While "bad practice", you dont have to deal with a runtime out-of-memory error if you never call malloc() or its equivalents.
Title: Re: ML.exe super-slow with large BYTE values?
Post by: dedndave on September 28, 2013, 05:07:34 PM
something i forgot to mention...
if you put it in the uninitialized data section, it will be filled with 0's
and the EXE will be a lot smaller   :biggrin:
Title: Re: ML.exe super-slow with large BYTE values?
Post by: jj2007 on September 28, 2013, 05:39:02 PM
Quote from: dedndave on September 28, 2013, 05:07:34 PM
if you put it in the uninitialized data section, it will be filled with 0's
and the EXE will be a lot smaller   :biggrin:

In fact that's what the crtbug macro does. Full example:

include \masm32\include\masm32rt.inc

crtbuf MACRO var, BufLen, msAlign:=<4>
LOCAL cblabel
.data?
align msAlign
  cblabel LABEL BYTE
  var equ offset cblabel
  ORG $+BufLen-1
  db ?
.code
ENDM

.code
start:   crtbuf mypath, MAX_PATH      ; simple version
   crtbuf MySSE2buffer, 1024*1024, 16      ; with optional argument specifying alignment
   invoke lstrcpy, offset mypath, rv(GetCommandLine)
   MsgBox 0, offset mypath, "The commandline:", MB_OK
   exit
end start


Re "bad practice": As Rockoon already wrote, using .data? avoids calling the alloc family. In practice, the OS couldn't care less if it's .data or somewhere else on the heap; the only occasion where alloc (HeapAlloc, GlobalAlloc, VirtualAlloc, ...) shows an advantage is when you need temporarily a dozen megabytes or so (a case which is rare in this forum but certainly relevant for Firefox developers).
Title: Re: ML.exe super-slow with large BYTE values?
Post by: hutch-- on September 29, 2013, 01:09:30 AM
 :biggrin:

Since when was a few megabytes "rare" in MASM coding. I deviate only when I need a gigabyte or so.  :P  Dynamic allocation has the advantage that it only occupies memory when its needed by an app, not for the life of the app although it can be done that way dynamically as well.
Title: Re: ML.exe super-slow with large BYTE values?
Post by: jj2007 on September 29, 2013, 02:13:49 AM
Quote from: hutch-- on September 29, 2013, 01:09:30 AMDynamic allocation has the advantage that it only occupies memory when its needed by an app

Hutch,
If you can show me one source here on the forum that allocates some megabytes and deallocates them before the user hits Alt F4 (that is the meaning of "temporarily"), then I owe you a bottle of malt :icon14:
Title: Re: ML.exe super-slow with large BYTE values?
Post by: GoneFishing on September 29, 2013, 02:31:05 AM
maybe http://masm32.com/board/index.php?topic=2162.0  reply#8  ?  :dazzled:
Title: Re: ML.exe super-slow with large BYTE values?
Post by: dedndave on September 29, 2013, 02:41:57 AM
careful how you word that, Jochen - the postage could get expensive   :redface:
Title: Re: ML.exe super-slow with large BYTE values?
Post by: Gunther on September 29, 2013, 03:23:24 AM
Jochen,

that DPMI application (http://masm32.com/board/index.php?topic=2100.0) from the 16 bit sub-forum allocates temporary all available RAM and frees it before it terminates. That's on my machine approximately 3.5 GB. Is that enough? By the way, I would prefer Glemfiddich.

Gunther
Title: Re: ML.exe super-slow with large BYTE values?
Post by: jj2007 on September 29, 2013, 03:25:10 AM
Quote from: vertograd on September 29, 2013, 02:31:05 AM
maybe http://masm32.com/board/index.php?topic=2162.0  reply#8  ?  :dazzled:

I don't feel it fits into the description of a normal app that temporarily allocates a few megabytes...

Quote from: KeepingRealBusy on July 30, 2013, 08:33:12 AM
My memory allocation always allocates the largest block possible (VirtualAlloc), and does this 10 times, then de-allocates a block thought to be large enough to bring in the debugger then does an int 3. If the program dies with no messages (instant death), then the reserved block is not large enough, double the size ...

Quote from: Gunther on September 29, 2013, 03:23:24 AM
that DPMI application (http://masm32.com/board/index.php?topic=2100.0) from the 16 bit sub-forum allocates temporary all available RAM and frees it before it terminates.

Quote from: Gunther on July 11, 2013, 01:48:32 AMThe program checks the available extended memory, allocates that amount of memory, fills the memory with values and reads the values back for printing at the text screen. That's all.

Again, this is not an app that temporarily allocates memory for performing a specific task. It allocates memory for the whole life of the application, and that could be done as well in .data. Since it's a DOS app, the user does not hit Alt F4, I guess, but anyway, the deallocation happens after the user decides to terminate the application, right? Or did I misunderstand something, and the user can continue to run the app without the "temporarily" allocated memory??

There is a reason why I mentioned Firefox. A browser opens a tab and needs some megabytes for that. The user closes the tab, and Firefox gently releases the memory (because otherwise it would be bashed in various benchmarking sites as a memory hog). That is temporary allocation, because deallocation occurs despite the fact that the app keeps running.
Title: Re: ML.exe super-slow with large BYTE values?
Post by: japheth on September 29, 2013, 04:21:34 AM
jwasm also very often allocates a few megabytes temporarily. The size depends on the complexity of the assembly source it has to digest, of course, but 4 MB is usually exceeded for a Windows applications.

I want Glenfiddich, at least 24 years old!
Title: Re: ML.exe super-slow with large BYTE values?
Post by: nidud on September 29, 2013, 05:07:08 AM
deleted
Title: Re: ML.exe super-slow with large BYTE values?
Post by: jj2007 on September 29, 2013, 05:14:54 AM
Glenfiddich, Dimple :dazzled:

You drunkards!!!! Read reply #8 carefully!!!

First, it's clearly addressed to Hutch
Second, it relates to a source here on the forum that deallocates RAM before the user hits Alt F4
Third, it only says I owe him a bottle of malt. In case he can present that source, and it was posted before today, he'll get a PM explaining where he can pick the malt (there are many nice places in Northern Italy :biggrin:).

@Hutch: Yes, I'll also offer you a cappuccino :icon14:
Title: Re: ML.exe super-slow with large BYTE values?
Post by: Gunther on September 29, 2013, 05:26:27 AM
Jochen,

Quote from: jj2007 on September 29, 2013, 05:14:54 AM
First, it's clearly addressed to Hutch

I'm so sorry.

Quote from: jj2007 on September 29, 2013, 05:14:54 AM
Second, it relates to a source here on the forum that deallocates RAM before the user hits Alt F4

That's a bit quibbling.

Quote from: jj2007 on September 29, 2013, 05:14:54 AM
(there are many nice places in Northern Italy :biggrin:).

Of course; I would say near the  Lago di Garda.

Gunther
Title: Re: ML.exe super-slow with large BYTE values?
Post by: jj2007 on September 29, 2013, 06:17:14 AM
Quote from: Gunther on September 29, 2013, 05:26:27 AM
Quote from: jj2007 on September 29, 2013, 05:14:54 AM
Second, it relates to a source here on the forum that deallocates RAM before the user hits Alt F4

That's a bit quibbling.

Well, no, it's really the essence of the argument:
Quote from: hutch-- on September 29, 2013, 01:09:30 AMDynamic allocation has the advantage that it only occupies memory when its needed by an app, not for the life of the app

This is the whole argument in favour of dynamic allocation: That you can allocate and release memory as needed, in reacting to user requests, e.g. for opening a new browser tab. All examples posted above as "evidence" don't do that - they allocate memory for the life of the app. And that can be done more efficiently via .data? ...
Title: Re: ML.exe super-slow with large BYTE values?
Post by: dedndave on September 29, 2013, 06:53:34 AM
most programs in here - we don't have to use Alt-F4   :biggrin:
Title: Re: ML.exe super-slow with large BYTE values?
Post by: jj2007 on September 29, 2013, 07:26:52 AM
Quote from: dedndave on September 29, 2013, 06:53:34 AM
most programs in here - we don't have to use Alt-F4   :biggrin:

Yes, they end without user intervention :greensml:
Title: Re: ML.exe super-slow with large BYTE values?
Post by: hutch-- on September 29, 2013, 07:39:57 AM
 :biggrin:

I won't hold you to the bottle of Glenfiddich as I have one open and a new one as a spare that I got for my last birthday.  :P
Title: Re: ML.exe super-slow with large BYTE values?
Post by: RuiLoureiro on September 29, 2013, 07:40:45 AM
Quote from: jj2007 on September 29, 2013, 02:13:49 AM
Quote from: hutch-- on September 29, 2013, 01:09:30 AMDynamic allocation has the advantage that it only occupies memory when its needed by an app

Hutch,
If you can show me one source here on the forum that allocates some megabytes and deallocates them before the user hits Alt F4 (that is the meaning of "temporarily"), then I owe you a bottle of malt :icon14:
Hummmm,
                  i have apps that do this, but it deallocates only when we close the app ! I never do that before. So, Jochen, save the bottle or ... share it with me  ;) :P
:icon14:
Title: Re: ML.exe super-slow with large BYTE values?
Post by: dedndave on September 29, 2013, 08:12:34 AM
i used to drink a little scotch - lol

nowdays, i like to have one of these around - which lasts me 2 or 3 years
by the time i get to the end, it's 12 years old   :lol:

good ole Kentucky "sippin" whiskey   :biggrin:

(http://www.examiner.com/images/blog/wysiwyg/image/Knob_Creek_Small_Batch_Bourbon_9_Year_250.jpg)

just like grandad used to make
Title: Re: ML.exe super-slow with large BYTE values?
Post by: Gunther on September 29, 2013, 08:48:51 AM
Dave,

it's Bourbon; we're talking about Whisky. There is a mile of difference.

Gunther
Title: Re: ML.exe super-slow with large BYTE values?
Post by: KeepingRealBusy on September 29, 2013, 08:58:45 AM
Quote from: jj2007 on September 29, 2013, 06:17:14 AM
Quote from: Gunther on September 29, 2013, 05:26:27 AM
Quote from: jj2007 on September 29, 2013, 05:14:54 AM
Second, it relates to a source here on the forum that deallocates RAM before the user hits Alt F4

That's a bit quibbling.

Well, no, it's really the essence of the argument:
Quote from: hutch-- on September 29, 2013, 01:09:30 AMDynamic allocation has the advantage that it only occupies memory when its needed by an app, not for the life of the app

This is the whole argument in favour of dynamic allocation: That you can allocate and release memory as needed, in reacting to user requests, e.g. for opening a new browser tab. All examples posted above as "evidence" don't do that - they allocate memory for the life of the app. And that can be done more efficiently via .data? ...

JJ,

I don't see the problem with keeping the memory around until the end of the app. It is only virtual memory. If I am not using it, then the system will page it out if it needs some real memory. And this paging also happens even if I allocate 3.5 GB of memory and am currently using all of it since the system only allows a limited number of pages that are currently paged in - the least currently accessed page will page out and the new page I need RIGHT NOW will page in, only to be paged out at some later time.

This paging problem requires careful program design to keep accesses within certain pages (whenever possible) instead of randomly accessing a BYTE st some low address, and then a BYTE at a very high address, then in the middle.

In some cases it is faster to read in a huge buffer then process the buffer multiple times, selecting the records for one output file and skipping the others, then re-processing the buffer and selecting the second file records for output. Actually you can test by timing so see how many output buffers you can select in one pass without encountering severe paging. In my case I am reading a file into an input buffer and splitting it into 512 output buffers (no writing required) where the input is 3.7 GB, then I access each output buffer in turn and split the content into the input buffer, but split into 4 pieces, then finally write out the 4 pieces (with an index header) to the actual output file.

This worked for the first half of the data. These initial data files had all  the same sizes, and the output files were all the same sizes. The second half of the data files, however, is not so evenly dispersed. The files are from 5.080 GB to 2.547 GB. The biggest files when split to 512 buffers will not fit in the memory buffers, so I am left with several options as to how to proceed. I cannot split the data into 256 buffers, then later split the data into 8 pieces because the data will still overflow the buffers - there will be twice as much data in half of the buffer count. I cannot even read the entire file into memory and split off (process) only one output file at a time because I can only allocate 3.5 GB of memory and need to split 5.080GB of data, so I am left with writing intermediate files.  This doubles the file I/O time which is the largest part of the process (6 hrs writing  and 1.5 hr processing on the initial data files).

Still thinking about what to do with this.

Any one have any suggestions?

Dave.
Title: Re: ML.exe super-slow with large BYTE values?
Post by: hutch-- on September 29, 2013, 10:15:06 AM
Dave,

That is what memory mapped files are for. You could tile it with disk IO if the data can be processed it blocks but a memory mapped file is probably the better way.
Title: Re: ML.exe super-slow with large BYTE values?
Post by: KeepingRealBusy on September 29, 2013, 10:42:32 AM
Steve,

Thank you for the hint. I have never used memory mapped I/O. Any good links to read?

Dave.
Title: Re: ML.exe super-slow with large BYTE values?
Post by: hutch-- on September 29, 2013, 01:16:00 PM
Dave,

There is an example in the masm32 examples called mmfdemo.asm but its more aimed at sharing the memory between apps than with a disk file. The API are CreateFileMapping(), MapViewOfFile(), UnmapViewOfFile() and CloseHandle().

I don't immediately have a disk file version but the WIN32 HLP looks pretty clear to use.
Title: Re: ML.exe super-slow with large BYTE values?
Post by: KeepingRealBusy on September 29, 2013, 01:35:36 PM
Steve,

Thanks for the info. I got into some of the msdn info on line, still studying. Really looking for a compelling reason to switch to this rather than OpenFile, SeekFile, ReadFile, WriteFile etc.

Dave.
Title: Re: ML.exe super-slow with large BYTE values?
Post by: jj2007 on September 29, 2013, 03:59:27 PM
Quote from: KeepingRealBusy on September 29, 2013, 08:58:45 AM
I don't see the problem with keeping the memory around until the end of the app. It is only virtual memory. If I am not using it, then the system will page it out if it needs some real memory.

Dave,

Yes, I agree fully - and that's the whole point of the whisky controversy: If the memory is needed for the lifetime of the application, then it doesn't matter if you use dynamic allocation or the .data? section, the OS couldn't care less.

But you are introducing a new aspect: paging. One could argue that dynamic memory is better because the OS can park it in the pagefile if not needed, while .data? mem is "worse" because it blocks other applications. But actually, that's not the case: even uninitialised data gets paged in only when needed, as the attached little app demonstrates. So it is really difficult to find a factual argument against using uninitialised data, except for the one case that is disputed above: That you want to deallocate it without exiting the application.

P.S.: bufsize=1024*1024*1024 works fine, too. The app starts half a second slower, though

include \masm32\MasmBasic\MasmBasic.inc        ; download (http://masm32.com/board/index.php?topic=94.0)
        Init
        bufsize=1024*1024*32   ; we want 32 Megabytes of unitialised memory
        crtbuf fatbuffer, bufsize, 16
        mov edi, fatbuffer
        invoke MbCopy, edi, Chr$("Open Task Manager, look at mem usage, Alt Tab back here and hit any key"), -1
        Inkey edi
        xor ecx, ecx
        .Repeat
                mov byte ptr [edi+ecx], 0
                add ecx, 127
        .Until ecx>=bufsize
        inkey "more?"
        Exit
end start
Title: Re: ML.exe super-slow with large BYTE values?
Post by: sinsi on September 29, 2013, 04:05:16 PM
If you are looking at that much data why not 64-bit? Even file mapping is limited to a dword map length in 32-bit windows, so you still have the problem.
Title: Re: ML.exe super-slow with large BYTE values?
Post by: Gunther on September 29, 2013, 08:11:48 PM
Hi Dave,

sinsi is right. 64 bit is the way to go for your problem.

Gunther
Title: Re: ML.exe super-slow with large BYTE values?
Post by: dedndave on September 29, 2013, 09:42:42 PM
Quote from: sinsi on September 29, 2013, 04:05:16 PM
Even file mapping is limited to a dword map length in 32-bit windows, so you still have the problem.

a dword supports more address space than you can actually use, of course   :P

but - file mapping, in this case, would be similar to the paging system you are using
with the major exception that you can control which portions of the file are visible at any given time
you can create numerous views to a file at once   :t

i think you mentioned you'd like 3 views at any given time - that should work nicely
and - you don't have to hog a lot of memory - make them a comfortable size
Title: Re: ML.exe super-slow with large BYTE values?
Post by: KeepingRealBusy on September 30, 2013, 05:29:15 AM
To all,

Thank you very much for the suggestions.

I would like to avoid the 64 bit conversion (a whole new world to contend with and learn, at least I know to start with JWASM).

The following is a copy of my internal documentation describing my working process (the initial files I worked with) with actual times, and some guesses about different processing options and there expected times, and finally a description of a potential memory mapped process with several questions. Anyone care to make some guesses about the answers to the questions? Also, feel free to discuss the different options I have made guesses about.


;
;   A problem developed while processing the High Data files because of the file
;   size. The Low Data files were all about the same size (3,758,103,552 BYTES)
;   and when split into 512 parts, they were 7,340,046 BYTES each. When further
;   split into 4 parts, each part was 1,835,011 BYTES. The input buffer was
;   16,429,056 BYTES, and the 512 buffers were each 8,208,384 BYTES, and 1/4 of
;   the input buffer was 4,107,264 BYTES. Each 512 split size would fit in a
;   split buffer, and then when split 4 ways, each part would fit in the 1/4
;   sized input buffer, so compacting the 4 parts to a single buffer was a
;   simple move operation, and everything fit in the buffers.
;
;   Not so lucky with the High Data files. The total size was the same (the same
;   number of records), but the file sizes varied from 5,080,753,211 BYTES to
;   2,608,592,091 BYTES. There was no way to split 5 GB into memory contained
;   buffers since only 3.5 GB of memory could be allocated. There were several
;   things that could be done to process such large files:
;
;       As a baseline from processing the Low Data files: read the input, split
;       to 512 buffers, read all buffers and split each 4 ways, move the pieces
;       to an output buffer, write the combined pieces to the output file.
;       Overall, this is 1 read, move all records, move all records, move all
;       records (compact 4 pieces), and finally 1 write. This is 2 I/O's and 3
;       moves.
;
;       Accumulated times:
;
;       Time:     6:38:05.558 I/O time.     (2 I/O's)
;       Time:     1:19:09.954 Process time. (3 moves)
;
;       Time:     3:19:02.799 I/O time.     (1 I/O)
;       Time:     0:26:10.333 Process time. (1 move)
;
;       The new method for processing High Data files: read the input, write out
;       the 512 buffers as they fill, read back the data to split it 4 ways,
;       compact the 4 pieces, then write out the 4 pieces with the preface index
;       DWORDS. Overall, this is 1 read, move all records, 1 write, 1 read,
;       move all records, move all records (compact 4 pieces), and finally 1
;       write. This is 4 I/O's and 3 moves. I don't believe that Windows could
;       cache the intermediate writes and then the later reads (even with
;       unbuffered I/O it seems to do this whether you want it or not),
;       especially since the total data size is 5 GB, processed 4 times.
;
;       Guesstimate times:
;
;       Time:    13:16:11.196 I/O time.     (4 I/O's)
;       Time:     1:19:09.954 Process time. (3 moves)
;
;       Another method for processing High Data files: read the input block,
;       split the block in 4 steps (128 files at a time, 4 buffers per file),
;       for each of the 128 buffer quads, compact the 4 pieces, write out the 4
;       pieces with the preface index DWORDS, then repeat 4 times. Overall,
;       this is 4 reads, move all records (compact 4 pieces), and finally 1
;       write. This is 5 I/O's and 1 move. Again, I don't believe that caching
;       would be able to reduce this because of the total data size.
;
;       Guesstimate times:
;
;       Time:    16:35:49.995 I/O time.     (5 I/O's)
;       Time:     0:26:10.333 Process time. (1 move)
;
;       Now for an attempt to do this with memory mapped files. If I worked with
;       records only (no big buffers in my process), then I would have to read
;       all of the 7 BYTE records, one by one, and then move the 6 BYTE records,
;       one by one, into 4*131072 memory mapped files. First question, can
;       Windows handle 524,289 memory mapped files (2,147,487,744 BYTES for 4096
;       BYTE buffers) all at once? How efficiently can it move 7 BYTE records or
;       6 Byte records? Once the Data is read and/or written, the memory mapped
;       I/O seems to be done as unbuffered I/O (in blocks of 4096 BYTES) so that
;       should be the same as my unbuffered I/O. If instead, I created my
;       internal input buffer as 4096*7, and my split buffers as 524,288*4096*3,
;       I would need 6,442,450,944 BYTES, but I can only get 3.5 GB of memory,
;       so that option is out. I could get 1/4 of that but would have to do two
;       splits instead of just one split (more data movement, more I/O). How
;       would that affect memory mapped file processing in Windows, where
;       Windows has only 6 GB of real memory (the 131,073 4096 BYTE Windows
;       buffers would only take 536,875,008 BYTES, my 131,073 buffers would be
;       1,610,612,736 BYTES)?
;


Dave.
Title: Re: ML.exe super-slow with large BYTE values?
Post by: KeepingRealBusy on September 30, 2013, 05:37:18 AM
To all,

OBTW, there are 256 Low input files being split into 512 output files each, each output file internally split into 4 blocks with a preface index (already done), then repeat for the High input files (needs to be done). There are 0x80000000*64 low 7 BYTE input records, and 0x80000000*64 low 6 BYTE output records, and the same number of high records to be done.

Dave.