ok, the parallel execution of general purpose registers and SSE leads to a DECREASE of performance - possibly cause the reason qWord had mentioned?!?
the optimization of the core algorithm leads to a gain of performance by around 8% -> ~ 47,2 million Hashs/sec...
...but after i implemented the support of multithreading the performance (running 2 threads) increased to ~71,2 million Hashs/sec...
...next step will be the support of all available threads (in my case 8 )...