MD4-Hash algorithm ...using SSE up to SSE2

phaap · January 11, 2013, 05:23:41 AM

Good evening!
I've just coded MD4-Hash algorithm in 64-bit-mode using SSE up to v2 ...its currently not optimized very well in the way that: it doesn't using hyper-/multithreading in any way, make no use of multicore and doesn't support general purpose registers to compute the hash parallel to SSE and i think also the algo itself could be optimized to increase performance by a few percents.
At this state it performs ~44 million Hashs/sec - SAMInside(v2.7.0.0) performs ~34 million Hashs/sec on my machine... machine: Intel i7 3770K @ ~4.1Ghz

I've a question about the independence between SSE and MMX? IF they are independent i could use mmx too...
But the greatest increase of performance WILL BE to fit the algorithm to make use of multicore/hyper-/multithreading!
I will be post the core algorithm if someone is interested...

qWord · January 11, 2013, 05:58:54 AM

IMO you can safely assume that the execution unit(s) (Intel called this execution ports) for a specific operation are shared for all instruction sets (everything else would be a waste of die area). The question is, if the core has enough of them thus it can feed and execute SSEx and MMX code parallel.

hutch-- · January 11, 2013, 05:29:06 PM

The problem is to multi-thread a task you will need to redesign the task and this means the algorithm as well. Now with an i7 quad where you have up to SSE4.2, each core can execute SSE instructions so if you can design a process that can parallel the operation you may see big speed gains if you can fit the number of threads to the number of cores available. Unless you bypass it the OS will handle the matching of threads to the core count (in your case 8 cores with Hyperthreading) so if you can design a technique of parallel execution you can look at 5 to 6 times the throughput allowing for some losses with the overhead of multithreading.

phaap · January 11, 2013, 10:26:27 PM

thanks for the replies :t

@hutch
With my design of the algorithm it will be easy to make the whole 'hash-process' fit for multi/-hyperthreading ...since every thread will feed/proceed his own strings/buffers - i don't want/must to split the algorithm and feed the single threads with a portion of the algorithm...

today i will make the algorithm ready to make use of general purpose registers and maybe of MMX ...furthermore i will analyze the algorithm to see if it could be optimized...

phaap · January 12, 2013, 04:44:36 AM

ok, the parallel execution of general purpose registers and SSE leads to a DECREASE of performance - possibly cause the reason qWord had mentioned?!?
the optimization of the core algorithm leads to a gain of performance by around 8% -> ~ 47,2 million Hashs/sec...
...but after i implemented the support of multithreading the performance (running 2 threads) increased to ~71,2 million Hashs/sec...
...next step will be the support of all available threads (in my case 8 )...

qWord · January 12, 2013, 05:02:59 AM

Quote from: phaap on January 12, 2013, 04:44:36 AM
ok, the parallel execution of general purpose registers and SSE leads to a DECREASE of performance - possibly cause the reason qWord had mentioned?!?

Hard to say, but it does not sound surprisingly for me: your GP code does (probably) need the same, or maybe more, code bytes to do the same operation as the SIMD version (and that on less data). I also think this will be a problem with MMX and you should simply use different thread instead. It would be interesting to see, if we can get an performance improvement, if we use a SIMD version for a non-hyper thread and GP code (or MMX ) for the corresponding hyper thread (in compare to the case we use SSEx for all threads).
You may upload your current test bed?

phaap · January 12, 2013, 06:27:03 AM

QuoteIt would be interesting to see, if we can get an performance improvement, if we use a SIMD version for a non-hyper thread and GP code (or MMX ) for the corresponding hyper thread (in compare to the case we use SSEx for all threads).

Yes, THIS IS a nice idea!!! :t
I will test it as soon as possible - since i'm not at home i will post the code as soon as possible, too!

hutch-- · January 12, 2013, 05:52:24 PM

Unloess you are targeting very old hardware there is no reason not to use SSE instructions as you should get a lot higher throughput per thread if you write the code properly. As long as you have a method of chopping up the data to be processed, them you can use the same algo for each thread that is running in parallel and you should see some impressive speed gains if you can run 4 real threads or 8 threads with hyperthreading.

It is probably worth benchmarking the difference between 4 threads and 8 using hyperthreading to see if the latter is any faster.

phaap · January 17, 2013, 04:21:15 AM

Good evening!
I'm sorry I'm late - but i've redesigned the whole algorithm from scratch.
Now it performs about 67 million Hashs/sec on 1 THREAD - performance gain about 42% - now at a level with 'Hash Suite x64 v0.4b' (for 1 Thread) - the fastest application i know (for only CPU computation, no gpu/apu)...
...and i haven't implemented the optimizations which increased the performance at the 'old' algorithm - I keep trying!

phaap · January 19, 2013, 01:58:47 AM

After i restructured the buffer too (to decrease memory access) it boosts the performance (for 1 thread) up to ~71 million Hashs/sec ...and for 2 threads it leads up to ~120 million Hashs/sec
...i think there is no more decided improvement possible... ...except the usage of more/all possible threads - and this WILL be the next step!

EDIT:
I think it was worth the trouble to redesigning from scratch - not only the massive improvement for 1 thread (~71 to ~47 million hashs/sec = ~51%)
...but also the decrease of overhead for 'multithreading':
* old algo (2 threads/ 1 thread) : ~71/~47 million hashs/sec => 51%
* new algo (2 threads/ 1 thread) : ~120/~71 million hashs/sec => 69%

qWord · January 19, 2013, 02:15:45 AM

will we ever see any working example?

phaap · January 19, 2013, 03:38:44 AM

...i've just implemented support for 4 threads!
i can't believe but this boosts the performance up to ~216 million hashs/sec

@qWord:
I'm sorry, but since this was a stiff piece of work and this routine seems the fastest at all (only cpu-usage) (at least for me) i decided not to make it public ...at least not for the moment

frktons · January 19, 2013, 03:41:38 AM

Quote from: qWord on January 19, 2013, 02:15:45 AM
will we ever see any working example?

You have to wait for a while, and after you'll know by yourself.

phaap · January 22, 2013, 04:54:19 AM

i've just implemented use of hyperthreading.
compared with full core support (in my case 4 cores/threads) this gave me a increase of performance by ~40% (8 threads).

* 4 threads: ~220 million hashs/sec
* 8 threads: ~310 million hashs/sec

to be honest: i haven't expect such a boost of performance 'only' by adding use of hyperthreading!

EDIT: by the way, the redesign of the routine degraded the SSE-requirements from SSE4.1 to SSE2

The MASM Forum

News:

MD4-Hash algorithm ...using SSE up to SSE2

phaap

qWord

hutch--

phaap

phaap

qWord

phaap

hutch--

phaap

phaap

qWord

phaap

frktons

phaap