decimal arithmetic instructions are not available in x64 and I think that the DA instructions only work at the byte level so BCD arithmetic is going to be slow, you could probably write your own decimal-adjust routines and use bigger chunks instead of byte but it would also make shifting more complex
binary arithmetic is fastest and easier to implement at the cost of more complex input/output routines
I have debugged thru AAA, to make AAA macro which can use other registers than being restricted to Al,to better possibilities for unroll
I trying SIMT solution, which uses one thread do the slow print,fprint,slow conversion in,other threads calculate
But if you have many cores,you can try solution execute many DA in parallel, might be easier if you lack SIMD skill
If you only have 32bit SSE2 128bit math might help
also wondered about use stack kinda recursion or just calculate and push results on stack, because Createthread support custom bigger stack than standard 1MB