Faster Tokenizer? ...

rrr314159 · February 25, 2015, 12:50:59 PM

Inspired by this thread, I wrote a 64-bit tokenizer algo which is four times faster than RECALL and ToutenMasm's CompteurLignes. There are a few caveats but as far as I can tell it really is at least twice as fast. It uses SSE 4.2: XMM registers, etc. The exe is 6k, not bad considering it's 64-bit.

On my I5 system, it does windows.inc in less than 300 microseconds (usually). It does a 448 meg file in < 60 ~~micro~~milliseconds. (I didn't have a big text file so I used the android sdk zip; it only has 1,791,929 CR's; so a text file would take longer, maybe 100 ms). This is a rate of about 5 gigs a second.

Here are two comparison runs between "token" and the 3 algos from that thread, for Intel and AMD systems:

Code Select

Intel(R) Core(TM) i5-3330 CPU @ 3.00GHz (Actual CPU @ 2.923 GHz)
(MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Tokenize File: \Masm32\include\windows.inc, size 977412

LAST RUN of 12: number of lines 26902; us to tokenize: 136.723

Average us to read FILE         137.212
Average us to run ALGO          142.301
Average for FILE + ALGO         279.514

873100 ;;;; head
87310b
87310d comment * -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-
87315a
87315c       WINDOWS.INC for 32 bit MASM (Version 1.6 RELEASE January 2012)
...
...
961a6a echo ------------------------------------------
961a9b echo WARNING Duplicate include file windows.inc
961acc echo ------------------------------------------
961afd ENDIF
961b04

Results from 32-bit tokenizers: Average for FILE + ALGO, 12 Runs
Hutch           Yves            Recall

h 3485          y 1442          j 1293
----------------------------------------------------
Press any key to continue . . .

AMD A6-6310 APU with AMD Radeon R4 Graphics     (Actual CPU @ 1.755 GHz)
(MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Tokenize File: \Masm32\include\windows.inc, size 977412

LAST RUN of 12: number of lines 26902; us to tokenize: 299.304

Average us to read FILE         539.084
Average us to run ALGO          341.554
Average for FILE + ALGO         880.638

11e3100 ;;;; head
11e310b
11e310d comment * -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
-=-
11e315a
11e315c       WINDOWS.INC for 32 bit MASM (Version 1.6 RELEASE January 2012)
...
...
12d1a6a echo ------------------------------------------
12d1a9b echo WARNING Duplicate include file windows.inc
12d1acc echo ------------------------------------------
12d1afd ENDIF
12d1b04
______________________
Results from 32-bit tokenizers: Average for FILE + ALGO, 12 Runs
Hutch           Yves            Recall

h 8007          y 3554          j 3640
----------------------------------------------------
Press any key to continue . . .

Now, caveats. I'm using uninitialized memory instead of allocating, and not counting lines first, as hutch does before allocating. That would slow it down a good 50%. Second, I'm using my own timing macros, which haven't been tested very thoroughly. They work more-or-less the same as other timers (rdtsc). I had to write my own for 64-bit, and other reasons. So it's possible I'm doing something stupid, and my macros mis-time by a factor of 2 (or 4!).

There's also this weird fact: when you run it in a pop-up command window, it's usually about 70% slower!? I would love to know why, any suggestions welcome. So, please run it from a regular (not pop-up) cmd window. In that case, it slows down every now and then, maybe one out of ten tries.

How it works: The basic idea is, the xmm search is unrolled 18 times. There's more to it; merely unrolling hardly helps at all. The code is in the zip. I would like to mention that altho this code is entirely my own work, of course, it's not entirely "original". I followed hutch's end-of-line approach; stole 100 lines of macros from MASMBasic; and borrowed ToutenMasm's bit-handling technique for manipulating the mask from pmovmskb, much cleaner than my first clumsy method. Plus I shld mention Agner Fog and many other internet gurus; the Intel architecture manuals are hard to understand w/o some help.

If it turns out that (due to buggy timing methods) my algo is actually slower than others, then I blame those same people mentioned above :)

The zip has the following files:

token.asm, main program, the algo plus the timing mechanism
support_macros.asm, all the 64-bit macros required (like print, InputFile, etc)
timing_macros.asm, the timing macros.
support.inc, all required includes (u also need 64-bit kernel32, user32, msvcrt libs; JWasm, 64-bit linker)
token.exe
make_token.bat, batch make file
run_token.bat, batch file allows running token.exe from pop-up cmd window (it's slow)

hutchyvesjj.asm, code from that thread, cosmetically modified (functionality unchanged)
hutchyvesjj.exe (33K) ** SEE NOTE **
runtokenizers.bat, batch file which produced the runs given above

**NOTE** decided not to include this, made the zip 2 big. U can easily re-compile the asm file.

Any comments welcome of course. To anticipate one possible comment: I'm willing to convert it to 32-bits, to allow direct comparison; it will be a little slower; but 64-bit is so much easier I'd rather not.

sinsi · February 25, 2015, 12:58:29 PM

AMD A10-7850K APU with Radeon(TM) R7 Graphics (Actual CPU @ 3.607 GHz)
(MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Tokenize File: \Masm32\include\windows.inc, size 977412

LAST RUN of 12: number of lines 26902; us to tokenize: 191.513

Average us to read FILE 0.000
Average us to run ALGO 0.012
Average for FILE + ALGO 0.012

AMD is fast eh?

Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz (Actual CPU @ 3.515 GHz)
(MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Tokenize File: \Masm32\include\windows.inc, size 977412

LAST RUN of 12: number of lines 26902; us to tokenize: 101.570

Average us to read FILE 158.600
Average us to run ALGO 105.400
Average for FILE + ALGO 264.001

rrr314159 · February 25, 2015, 01:08:08 PM

Thanks sinsi,

At the moment I can't imagine why the 3 "Average" numbers didn't come out on your AMD system. Just re-checked the zip, works fine for me. Baffled. The 191.513 is presumably accurate. Well at least it worked on the Intel.

Thanks for giving it a try!

[edit] Figured it out ... u edited those numbers! Very funny. Your avatar threw me off: impossible to mistrust such an innocent face :)

sinsi · February 25, 2015, 01:27:57 PM

We normally use MichaelW's timing macros to keep things standard. Look in the sticky at the top of the lab.

rrr314159 · February 25, 2015, 02:20:43 PM

No, I'm sorry, that doesn't work. For one thing the empty timing loop approach is, IMHO, no longer viable. My CPU's just optimize it away to nothing, so it doesn't really reflect how much time the working loop takes. Also, that algo pushes/pops across the test target; that's a no-no with 64-bit, because of the calling convention, if you use habran's new JWasm, or do similar stack manipulation. In fact there are old posts where people got caught by that. qWord's updated version fixes it, but ...

The problem is, the whole approach is not flexible enough. For instance, with this token algo, I had to separate the file-reading from the algo. (The algo writes zero's to delimit lines, so file must be re-read each time). File opening has a huge variance, which swamps the algo results; if you don't separate it, u need at least 4 times as many runs (hard to say how many) to get the same confidence level. Show me how to separate those numbers with the old method!

But even beyond that ... the whole idea that timing is something special, to be done on a special test bed after the algo is written, is no longer viable. (IMHO) These CPU's are so tricky, the timer must be an integral part of your code, always available, constantly consulted, to find the best approach. It's no longer possible to design a fast algo a priori, from logical considerations, u have to let the CPU tell u what's right. Newton himself couldn't do it.

With 64-bits you can put 100 years of cycles in a single register; no longer need to keep the "high part" and "low part". I do statistical analyses on these numbers, on-the-fly, it's as easy as small numbers used to be. The old approach is still probably best for 32-bit, but with 64 it's like wearing handcuffs.

IMHO.

[edit] Well, OK - probably Newton could do it.

hutch-- · February 25, 2015, 03:42:38 PM

I have always been an advocate of real time testing, make a really big sample (1gb), isolate the load time from the algo timing and see if you can get a reliable timing that way.

jj2007 · February 25, 2015, 08:09:03 PM

Quote from: rrr314159 on February 25, 2015, 12:50:59 PMOn my I5 system, it does windows.inc in less than 300 microseconds (usually). It does a 448 meg file in < 60 microseconds.

Milliseconds, I guess? 448 meg = 600*WinInc...

With StringToArray, which is Recall without reading the file:
Let esi=FileRead$("\Masm32\include\Windows.inc")
NanoTimer()
StringToArray esi, L$()
Print Str$("StringToArray:\t%i us\n\n", NanoTimer(us))

... I get typically 1120 us on my i5. Your algo does it in 500 or 250, depending on my machine's mood.

In any case: CONGRATS, I'm impressed :t

rrr314159 · February 26, 2015, 03:34:42 AM

Woops - milliseconds it is. ::)

U know, this sort of technique is generally applicable - can use it with other string handling algos, BMP processing, etc. Obviously somebody somewhere knows about it, but haven't yet found it on the net. Of course what I'm doing is getting the execution units working in parallel (3 is a magic number for this).

I would love to know why it sometimes takes about twice as long. What is affecting the computer's mood? Perhaps the longer time is when a context switch happens to occur while the algo is running - but strange it's always about twice. Would expect it to add, say, 250 us always; but if running on a shorter or longer file, u get 100 or 500 extra - always about twice. It never happens with other algos, like Recall, as far as I can tell. Another important hint, if u also check the file read number, it does the same - approx. doubles at the same time.

Thanks for checking it out! This is the first time someone else has actually compared it; finally gives me confidence it works.

npnw · April 16, 2015, 12:06:21 PM

Off top of my head, I would say its a cacheing issue. If it can't load into cache, and has to read memory, then that is what is taking the time.
Clean your cache and you would get a consistent speed or dirty the cache and it would be consistent.

rrr314159 · April 16, 2015, 12:26:36 PM

npnw, thx for the suggestion, makes sense. I'll let u know the results when I check it out ...

Gunther · April 17, 2015, 06:05:46 AM

Hi rrr,

hera are the results:

Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (Actual CPU @ 3.313 GHz)
(MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX)
Tokenize File: \Masm32\include\windows.inc, size 977412

LAST RUN of 12: number of lines 26902; us to tokenize: 265.561

Average us to read FILE 0.000
Average us to run ALGO 0.013
Average for FILE + ALGO 0.013

12d3100 ;;;; head
12d310b
12d310d comment * -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
-=-
12d315a
12d315c WINDOWS.INC for 32 bit MASM (Version 1.6 RELEASE January 2012)
...
...
13c1a6a echo ------------------------------------------
13c1a9b echo WARNING Duplicate include file windows.inc
13c1acc echo ------------------------------------------
13c1afd ENDIF
13c1b04
______________________

Gunther

The MASM Forum

News:

Faster Tokenizer? ...

rrr314159

sinsi

rrr314159

sinsi

rrr314159

hutch--

jj2007

rrr314159

npnw

rrr314159

Gunther