News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

String array sort with fat files

Started by jj2007, March 26, 2015, 07:48:47 PM

Previous topic - Next topic

jj2007

Quote from: Antariy on March 27, 2015, 06:40:39 AM:t Did you try to use LARGEADDRESSAWARE switch?

Thanks, but this would just hide the problem. Of course, if I had a concrete application that needs more memory, I could use that, but it's not a solution. I must work on the design of the algo. Again, for files under 200 MB there is no problem, it works fine.

Antariy

Quote from: jj2007 on March 27, 2015, 06:44:50 AM
Quote from: Antariy on March 27, 2015, 06:40:39 AM:t Did you try to use LARGEADDRESSAWARE switch?

Thanks, but this would just hide the problem. Of course, if I had a concrete application that needs more memory, I could use that, but it's not a solution. I must work on the design of the algo. Again, for files under 200 MB there is no problem, it works fine.

Well, I did not say it's solution, just mentioned that if it may be useful for you. You currently read entire file into memory before you starting the scan of it and separating the lines to pointers? Do you do inplace separation, inserting zero in the CR/LF place?

jj2007

Quote from: Antariy on March 27, 2015, 06:48:15 AMDo you do inplace separation, inserting zero in the CR/LF place?

Yes, exactly. Masm32 ltok and Hutch' new tokeniser use the same technique afaik. It's extremely efficient for read-only arrays (frequent case). It is a bit less efficient when you start heap-allocating new strings, but even in the most extreme case, only the memory of the original file size hangs around. No harm done if it doesn't reach the gigabyte range.

The alternative would be to copy each string to a heap-allocated new string, and delete the original array afterwards. That is awfully slow, and given that each heapstring neads a header and has trailing unused bytes, it is hardly more memory-efficient.

jj2007

New version with strongly improved memory management. Now six Million 'bible' lines can be sorted without problems. There is a version with 10 Mio lines, but it choked sometimes - see bx$ in the source.

Getting text files
Reading Bible200.txt
Reading 6076600 lines took 0.658 seconds
Reading 6076600 lines took 0.640 seconds
***
CRT QSort for Bible200.txt:
6076600 lines sorted in 7.45 seconds
6076600 lines sorted in 7.44 seconds
6076600 lines sorted in 7.44 seconds
Writing 6076600 lines took 5.52 seconds

t0=[(According as it is written, God hath
t6076599=[Ziph, and Telem, and Bealoth, .

MasmBasic QSort for Bible200.txt:
6076600 lines sorted in 6.21 seconds
6076600 lines sorted in 6.37 seconds
6076600 lines sorted in 6.34 seconds
Writing 6076600 lines took 5.34 seconds

t0=[(According as it is written, God hath
t6076399=[Ziph, and Telem, and Bealoth, .
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz

dedndave


rrr314159

I move that old computers be banned ... anyone second the motion?

Perhaps we could pass the hat to retire these aging warhorses, save a lot of trouble chasing down anachronistic bugs
Get dedndave a nice new Windows 8.1 machine, maybe even Windows 10 :biggrin:

[edit] just kidding of course - actually, as we all know, "Older is Better"
I am NaN ;)

dedndave

i like my P4 prescott   :biggrin:
XP media center edition
4 GB ram (3 usable, i guess)
had it 10 years - it's been a good machine
although, i may pick up a used win7-64 laptop for business use
win8, win10 - no thank you

sinsi

Crash again, with the i7-4790 and Win8.1Pro

Windows 7 Ultimate 64-bit
Getting text files
Downloading Canterbury Corpus
4047392 21.03.1997  15:31:58    bible.txt
Unzipping to C:\Users\tester\Desktop\Sort6MioLines\Bible.txt
4638690 26.02.1997  11:12:00    E.coli
2473400 18.03.1997  15:13:34    world192.txt
Building 50*Bible.txt as Bible50.txt
Building 200*Bible.txt as Bible200.txt
Reading Bible200.txt
Reading 6076600 lines took 0.963 seconds
Reading 6076600 lines took 1.28 seconds
Reading 6076600 lines took 1.11 seconds
***
CRT QSort for Bible200.txt:
6076600 lines sorted in 9.80 seconds
6076600 lines sorted in 9.81 seconds
6076600 lines sorted in 9.72 seconds
Writing 6076600 lines took 2.79 seconds
t0=[(According as it is written, God hath given them the spirit ...]
t6076599=[Ziph, and Telem, and Bealoth, ...]

MasmBasic QSort for Bible200.txt:
6076600 lines sorted in 9.07 seconds
6076600 lines sorted in 9.00 seconds
6076600 lines sorted in 8.93 seconds
Writing 6076600 lines took 1.85 seconds
t0=[(According as it is written, God hath given them the spirit ...]
t6076399=[Ziph, and Telem, and Bealoth, ...]
Last error: The operation completed successfully.
AMD A10-7850K APU with Radeon(TM) R7 Graphics   (MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
--- hit any key ---

Windows 10 Preview 64-bit
Getting text files
Downloading Canterbury Corpus
4047392 21.03.1997  15:31:58    bible.txt
Unzipping to C:\Users\user\Desktop\Sort6MioLines\Bible.txt
4638690 26.02.1997  11:12:00    E.coli
2473400 18.03.1997  15:13:34    world192.txt
Building 50*Bible.txt as Bible50.txt
Building 200*Bible.txt as Bible200.txt
Reading Bible200.txt
Reading 6076600 lines took 1.37 seconds
Reading 6076600 lines took 1.30 seconds
Reading 6076600 lines took 1.30 seconds
***
CRT QSort for Bible200.txt:
6076600 lines sorted in 10.2 seconds
6076600 lines sorted in 10.2 seconds
6076600 lines sorted in 9.87 seconds
Writing 6076600 lines took 23.3 seconds
t0=[(According as it is written, God hath given them the spirit ...]
t6076599=[Ziph, and Telem, and Bealoth, ...]

MasmBasic QSort for Bible200.txt:
6076600 lines sorted in 9.12 seconds
6076600 lines sorted in 9.03 seconds
6076600 lines sorted in 8.77 seconds
Writing 6076600 lines took 9.23 seconds
t0=[(According as it is written, God hath given them the spirit ...]
t6076399=[Ziph, and Telem, and Bealoth, ...]
Last error: The operation completed successfully.
Intel(R) Core(TM)2 Quad  CPU   Q9450  @ 2.66GHz (MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1)
--- hit any key ---

Windows 8.1 Professional 64-bit
Getting text files
Downloading Canterbury Corpus
4047392 21.03.1997  15:31:58    bible.txt
Unzipping to C:\Users\sinsi\Desktop\Sort6MioLines\Bible.txt
4638690 26.02.1997  11:12:00    E.coli
2473400 18.03.1997  15:13:34    world192.txt
Building 50*Bible.txt as Bible50.txt
Building 200*Bible.txt as Bible200.txt
Reading Bible200.txt
Reading 6076600 lines took 0.572 seconds
Reading 6076600 lines took 0.545 seconds
Reading 6076600 lines took 0.653 seconds
*
Fatal error: HeapAlloc

🍺🍺🍺

jj2007

Thanxalot, Sinsi & Dave - especially for the Windows 10 preview test :bgrin:

Re HeapAlloc error: it's 1.6GB in the peak, quite a lot. But it works much better than the previous version, which chokes earlier.

nidud

#24
deleted

jj2007

Thanks. Here is WinXP, Celeron M (failing with HeapAlloc):
Getting text files
Downloading Canterbury Corpus
4047392 21.03.1997  15:31:58    bible.txt
Unzipping to C:\DOCUME~1\USER\DOCUME~1\DOWNLO~1\Bible.txt
4638690 26.02.1997  11:12:00    E.coli
2473400 18.03.1997  15:13:34    world192.txt
Building 50*Bible.txt as Bible50.txt
Building 200*Bible.txt as Bible200.txt
Reading Bible200.txt
Reading 6076600 lines took 37.9 seconds
Reading 6076600 lines took 32.8 seconds
Reading 6076600 lines took 32.5 seconds