Wrote some Matrix Transpose routines of different sizes.

The idea is to use them as building blocks to create an Algorithm for very large Transposition Matrices of any size.

Included the sources and a timing mechanism that measures the clock cycles and the time each Matrix routine takes after 10000000 calls.

You can use the "SaveResults.bat" to save the timing results as "MatrixTimerResults.txt"

It would be nice to post the results here and give some feedback for improvements of the routines.

Thanks.

Here are my results:

` 10000000 calls per Matrix for the Cycle counter and the Routine timer.`

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz

4x4 Cycles: 4 CodeSize: 66 RoutineTime: 0.032362423 seconds

4x3 Cycles: 3 CodeSize: 56 RoutineTime: 0.029423712 seconds

4x2 Cycles: 4 CodeSize: 42 RoutineTime: 0.032384074 seconds

3x4 Cycles: 4 CodeSize: 62 RoutineTime: 0.032416620 seconds

3x3 Cycles: 2 CodeSize: 51 RoutineTime: 0.026471800 seconds

3x2 Cycles: 3 CodeSize: 35 RoutineTime: 0.029419801 seconds

2x4 Cycles: 0 CodeSize: 31 RoutineTime: 0.020588231 seconds

2x3 Cycles: 0 CodeSize: 27 RoutineTime: 0.020594936 seconds

2x2 Cycles: -1 CodeSize: 18 RoutineTime: 0.017652313 seconds

Code Alignment 64 byte check: 000h

EDIT: forgot to mention that the routines are to be used backwards in memory because when a row is 3 values, 4 are written!