Actually, there is no problem with the printf in my demo.
The problem is with WaitForMultipleObjects which can't wait for more than MAXIMUM_WAIT_OBJECTS (64 objects) and I am creating 3000 threads. The side effect is that the program may exit before all threads have completed their job.
A fast fix, and in order not to complicate things further, all we need to do is replace WaitForMultipleObjects with a Sleep of say 2 or 3 seconds. For 3000 threads, the correct reported sum in the end shall be 225225000.
I will not change the program now, I am going to work on the RTM part and eventually will come up with an integrated solution.
BTW, it is normal to have threads out of order, there is no FIFO guarantee.