Hi Guys
Someone knows how to make a transposing matrix of any size for Float values ??
This is for usage in image processing exchanging the width by the height.
Example:
I have a sequence of values like this. On these cases, they are not pixels in RGB, they are simple float values representing Luma, for example:
[ValueA1:
F$ 1,2,3
F$ 4,5, 6]
Which means a row (width) of 3 Values and col (height) of 2 Values.
I need them to be transposed as:
[ValueA1:
F$ 1, 4
F$ 2, 5
F$ 3, 6]
It need to be fast and work on any size (squared or not) and work either if height is bigger then width and vice-versa. As:
[ValueA1:
F$ 1, 2
F$ 3, 4
F$ 5, 6]
will turn onto:
[ValueA1:
F$ 1,3, 5
F$ 2, 4, 6]
Note: "F$" is Float notation for RosAsm. But...if someone also have this for Real (Double) it would be appreciated as well.
On the above examples they are only to display the functionality, but in fact it will be used also on large matrixes with size: 507 x 208 , 64x64, 480x640, 1024x768 etc etc
I tried to port the code from here: http://stackoverflow.com/questions/16737298/what-is-the-fastest-way-to-transpose-a-matrix-in-c
But...it seems to not working.
Quote from: guga on April 01, 2017, 06:01:08 PMIt need to be fast
This is the biggest problem :biggrin:
mov esi, offset source
mov edi, offset dest
xor edx, edx
mov ecx, rows
.repeat
mov eax, [esi+edx]
stosd
add edx, columns*4
dec ecx
.until zero?
I use this little macro for everything:matrix macro x, y , xlarge
mov eax , x
mov ecx , xlarge
mul ecx
add eax, y
endm
The idea is:ForLp a, 0, 2
ForLp b, 0, 1
matrix a, b, 3
fld matriz1[eax*8]
matrix b, a, 2
fstp matriz2[eax*8]
Next b
Next b
LATER: I forget to say matrix are 0 based
I think FPU is irrelevant.
It is about moving array of sizeof float to new table to a new order.
Like this improved C versionvoid transpose(float *src, float *dst, const int N, const int M)
{
// #pragma omp parallel for
register long *p1 = (long *)src;
register long *p2 = (long *)dst;
for(int n = 0; n<N*M; n++) {
int i = n/N;
int j = n%N;
// dst[n] = src[M*j + i];
p2[n] = p1[M*j + i];
}
}
REAL4:
LEA ESI,source
LEA EDI,dest
MOV EDX,line
doCol:
MOV ECX,ESI
MOV EAX,col
doLine:
FLD REAL4 PTR [ECX]
FSTP REAL4 PTR [EDI]
ADD EDI,4
ADD ECX,4*col
DEC EAX
JNZ doLine
ADD ESI,4
DEC EDX
JNZ doCol
REAL8:
LEA ESI,source
LEA EDI,dest
MOV EDX,line
doCol:
MOV ECX,ESI
MOV EAX,col
doLine:
FLD REAL8 PTR [ECX]
FSTP REAL8 PTR [EDI]
ADD EDI,8
ADD ECX,8*col
DEC EAX
JNZ doLine
ADD ESI,8
DEC EDX
JNZ doCol
REAL10:
LEA ESI,source
LEA EDI,dest
MOV EDX,line
doCol:
MOV ECX,ESI
MOV EAX,col
doLine:
FLD REAL10 PTR [ECX]
FSTP REAL10 PTR [EDI]
ADD EDI,10
ADD ECX,10*col
DEC EAX
JNZ doLine
ADD ESI,10
DEC EDX
JNZ doCol
interesting code but col isn't constant
ADD ECX,4*col
i leave this issue to professionals ;)
Thanks guys...I´ll try the examples. to see if a faster version can be made. I have one that transpose a 4x4 matrix using SSE2, but didn´t tested on this version yet...
For these initial testings, i built a slow version biased on JJ´s code :).
Proc transpose_block:
Arguments @Input, @Output, @Width, @Height
Local @NextRowPos, @MaxWidth
Uses edi, esi, ecx, ebx
mov edi D@Output
mov esi D@Input
mov ecx D@Width | mov D@MaxWidth ecx
mov ebx D@Width | shl ebx 2; ebx = end of width that is the nextYPos/Line
L2:
mov ecx D@Height
xor edx edx
L1:
; edx starts at 0. Cur row pos
mov eax D$esi+edx
stosd
add edx ebx ; add to the next line
dec ecx
jnz L1<
add esi 4 ; go to the next col line
dec D@MaxWidth
jnz L2<
EndP
This seems to work regardless the size of the matrix. So, it works either if Width > Height, or Height>Width or Width=height.
Tested on:
[ValueA: D$ 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
D$ 17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
D$ 33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48
D$ 49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64
D$ 65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80
D$ 81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96
D$ 97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112
D$ 113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128]
[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA
call transpose_block ValueA1, ValueB, 16, 8
[ValueA: D$ 1, 2
D$ 3, 4]
[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA
call transpose_block ValueA1, ValueB, 2, 2
[ValueA: D$ 1, 2
D$ 3, 4
D$ 5, 6]
[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA
call transpose_block ValueA1, ValueB, 2, 3
; Matrix 20 X 30
[ValueA: F$ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
F$ 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40
F$ 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60
F$ 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80
F$ 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100
F$ 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120
F$ 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140
F$ 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160
F$ 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180
F$ 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200
F$ 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220
F$ 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240
F$ 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260
F$ 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280
F$ 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300
F$ 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320
F$ 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340
F$ 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360
F$ 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380
F$ 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400
F$ 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420
F$ 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440
F$ 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460
F$ 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480
F$ 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500
F$ 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520
F$ 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540
F$ 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560
F$ 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580
F$ 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600]
[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA
call transpose_block ValueA1, ValueB, 20, 30
Quote from: guga on April 02, 2017, 04:23:39 AMI have one that transpose a 4x4 matrix using SSE2
Loading is fine if your columns can be divided in 4*4 pieces, but the storing is not very effective: MOVD and pshufd. Doesn't sound very fast...
If you need REAL8: movlps is relatively fast.
A simple Qword/Dword X/Y pointer swop, with a tmp variable would do, without any FPU usage.
:biggrin:
A works with general regs, B with xmm regs:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
2306 cycles for 100 * TransposeA
775 cycles for 100 * TransposeB
1917 cycles for 100 * TransposeA
778 cycles for 100 * TransposeB
2294 cycles for 100 * TransposeA
778 cycles for 100 * TransposeB
2289 cycles for 100 * TransposeA
776 cycles for 100 * TransposeB
1919 cycles for 100 * TransposeA
774 cycles for 100 * TransposeB
61 bytes for TransposeA
114 bytes for TransposeB
4*4 tranpose from my d3dmath.lib
align 16
D3DMatrixTranspose proc pOut:DWORD,pM:DWORD
; [0 1 2 3] [0 4 8 C]
; [4 5 6 7] [1 5 9 D]
; [8 9 A B] [2 6 A E]
; [C D E F] [3 7 B F]
mov eax,pM
movaps xmm0,[eax+0] ; [0 1 2 3]
movaps xmm1,[eax+16] ; [4 5 6 7]
movaps xmm4,[eax+32] ; [8 9 A B]
movaps xmm3,[eax+48] ; [C D E F]
mov eax,pOut
movaps xmm2,xmm0 ; [0 1 2 3]
movaps xmm5,xmm1 ; [4 5 6 7]
shufps xmm0,xmm3,01000100b ; [0 1 C D]
shufps xmm1,xmm4,01000100b ; [4 5 8 9]
shufps xmm2,xmm4,11101110b ; [2 3 A B]
shufps xmm5,xmm3,11101110b ; [6 7 E F]
movaps xmm4,xmm0 ; [0 1 C D]
movaps xmm3,xmm2 ; [2 3 A B]
shufps xmm0,xmm1,10001000b ; [0 C 4 8]
shufps xmm1,xmm4,11011101b ; [5 9 1 D]
shufps xmm2,xmm5,10001000b ; [2 A 6 E]
shufps xmm3,xmm5,11011101b ; [3 B 7 F]
shufps xmm0,xmm0,01111000b ; [0 4 8 C]
shufps xmm1,xmm1,11010010b ; [1 5 9 D]
shufps xmm2,xmm2,11011000b ; [2 6 A E]
shufps xmm3,xmm3,11011000b ; [3 7 B F]
movaps [eax+0],xmm0 ; [0 4 8 C]
movaps [eax+16],xmm1 ; [1 5 9 D]
movaps [eax+32],xmm2 ; [2 6 A E]
movaps [eax+48],xmm3 ; [3 7 B F]
ret
D3DMatrixTranspose endp
Thank you guys
Ok..we are close :)
I gave a test on JJ and Siekmanski code, and Siekmanski seems a bit faster.
the results i've got for 20x32 matrix are:
Siekmanski : 231.37439769473289 ns
JJ´s: 281.62775417344523 ns
But...the problem remains the same concerning the size of the matrix. Both works as expected if Width and Height are bigger then 4...But...how to make it work for any matrix size (or sizes not multiple of 4) ? I mean, width = 11, height = 23...or width = 3 height = 2 etc ?
The tested codes ported to RosAsm are:
JJ´s
Proc Fast_transpose_blockJJ:
Arguments @Input, @Output, @Width, @Height
Local @CurXPos
Uses esi, edi, ebx, ecx, edx
mov esi D@Input
mov edi D@Output
mov ebx D@Width | mov D@CurXPos ebx | shl ebx 2
L2:
xor ecx ecx
xor edx edx
Do
; copy the 1st 4 dwords from esi to register XMM
movd XMM0 D$esi+edx | add edx ebx
movd XMM1 D$esi+edx | add edx ebx
movd XMM2 D$esi+edx | add edx ebx
movd XMM3 D$esi+edx | add edx ebx
pslldq XMM1 4
pslldq XMM2 8
pslldq XMM3 12
xorps XMM0 XMM1
xorps XMM0 XMM2
xorps XMM0 XMM3
movaps X$edi XMM0
add edi 16
add ecx 4
Loop_Until ecx >= D@Height
add esi 4
dec D@CurXPos | jnz L2<<
mov eax D@Output
EndP
calling it as:
call Fast_transpose_blockJJ ValueA, Output, 20, 32
Siekmanski´s
Proc D3DMatrixTranspose:
Arguments @Input, @Output, @Width, @Height
Local @CurXPos, @NextHeightPos
Uses esi, edi, ebx, ecx, edx
mov esi D@Input
mov edi D@Output
mov ebx D@Width | mov ecx ebx | shl ebx 2
mov eax D@Height | shl eax 2 | mov D@NextHeightPos eax
shr ecx 2 | mov D@CurXPos ecx ; divide by 4
L2:
xor ecx ecx
xor edx edx
Do
movaps xmm0 X$esi+edx ; [0 1 2 3]
add edx ebx
movaps xmm1 X$esi+edx ; [4 5 6 7]
add edx ebx
movaps xmm4 X$esi+edx ; [8 9 A B]
add edx ebx
movaps xmm3 X$esi+edx ; [C D E F]
movaps xmm2 xmm0 ; [0 1 2 3]
movaps xmm5 xmm1 ; [4 5 6 7]
shufps xmm0 xmm3 00_01000100 ; [0 1 C D]
shufps xmm1 xmm4 00_01000100 ; [4 5 8 9]
shufps xmm2 xmm4 00_11101110 ; [2 3 A B]
shufps xmm5 xmm3 00_11101110 ; [6 7 E F]
movaps xmm4 xmm0 ; [0 1 C D]
movaps xmm3 xmm2 ; [2 3 A B]
shufps xmm0 xmm1 00_10001000 ; [0 C 4 8]
shufps xmm1 xmm4 00_11011101 ; [5 9 1 D]
shufps xmm2 xmm5 00_10001000 ; [2 A 6 E]
shufps xmm3 xmm5 00_11011101 ; [3 B 7 F]
shufps xmm0 xmm0 00_01111000 ; [0 4 8 C]
shufps xmm1 xmm1 00_11010010 ; [1 5 9 D]
shufps xmm2 xmm2 00_11011000 ; [2 6 A E]
shufps xmm3 xmm3 00_11011000 ; [3 7 B F]
movaps X$edi xmm0 ; [0 4 8 C]
mov eax D@NextHeightPos
movaps X$edi+eax xmm1 ; [1 5 9 D]
add eax D@NextHeightPos
movaps X$edi+eax xmm2 ; [2 6 A E]
add eax D@NextHeightPos
movaps X$edi+eax xmm3 ; [3 7 B F]
add edx ebx
add edi 16
add ecx 4
Loop_Until ecx >= D@Height
add esi 16;4
add edi eax
dec D@CurXPos | jnz L2<<
mov eax D@Output
EndP
calling it as:
call D3DMatrixTranspose ValueA, Output, 20, 32
JJ´s code is very close, but...it fails if we have matrix like:
[ValueD2: F$ 1, 2, 3, 4, 5, 6,
F$ 7, 8, 9, 10, 11, 12,
F$ 13, 14, 15, 16, 17, 18,
F$ 19, 20, 21, 22, 23, 24
F$ 25, 26, 27, 28, 29, 30]
call Fast_transpose_blockJJ ValueD2, Output, 6,5
Siekmanski´s will crash, because the width and height are smaller then 16. So i couldn´t test his version yet on this example.
Wrote another faster 4 by 4 transpose routine...
Quote from: guga on April 02, 2017, 01:56:12 PMBut...how to make it work for any matrix size (or sizes not multiple of 4) ? I mean, width = 11, height = 23...or width = 3 height = 2 etc ?
Simple solution: Check the size and take my fast version if possible, or my version A if too small. After all, why should small matrices need a fast solution?
Thanks SiekManski. I´l take a look. It seems pretty fast.
Hi JJ
I made another test and a check for the size. So far it is working for any Height longer then 4 bytes and not multiple of 4. Tomorrow i´ll test to see if it works also on height smaller then 4
Proc Fast_transpose_blockJJguga:
Arguments @Input, @Output, @Width, @Height
Local @CurXPos
Uses esi, edi, ebx, ecx, edx
mov esi D@Input
mov edi D@Output
mov ebx D@Width | mov D@CurXPos ebx | shl ebx 2
; get remainders for edi
xor eax eax
mov edx D@Height | mov ecx edx | shr edx 2 | shl edx 2 | sub ecx edx ; integer count divide by 16 bytes (4 dwords)
jz L1>
mov eax 4 | sub eax ecx | shl eax 2
L1:
L2:
xor ecx ecx
xor edx edx
Do
; copy the 1st 4 dwords from esi to register XMM
movd XMM0 D$esi+edx | add edx ebx
movd XMM1 D$esi+edx | add edx ebx
movd XMM2 D$esi+edx | add edx ebx
movd XMM3 D$esi+edx | add edx ebx
pslldq XMM1 4
pslldq XMM2 8
pslldq XMM3 12
xorps XMM0 XMM1
xorps XMM0 XMM2
xorps XMM0 XMM3
movups X$edi XMM0 ; <---- must be movups (unaligned) due to the remainders in height. otherwise it will crash. it work regardless the output in aligned or not
add edi 16 ; 4*4
add ecx 4
Loop_Until ecx >= D@Height
sub edi eax; adjust edi from the remainder
add esi 4
dec D@CurXPos | jnz L2<<
; clear remainder bytes if any
test eax eax | jz L1>
shr eax 2 | jz L1> | mov D$edi 0
dec eax | jz L1> | mov D$edi+4 0
dec eax | jz L1> | mov D$edi+8 0
L1:
mov eax D@Output
EndP
I´ll check tomorrow if it works when Height or width are smaller then 4. So far, it works for 6x5; 6x6; 6x7, 30x32, 32x30 ....
i´ll test tomorrow for 4x30, 3x10, 10x3, 1x2, 2x20 etc :)
If it works on those situations, i´ll see if i can be able to adapt Siekmanski onto your variation as well. On this way we may have something around 20-30% of speed gain, i suppose ?.
I realise that small matrix don´t actually needs speed, but we may have a matrix containing a small height (smaller then 4) and a higher width. Ex: width = 100, height = 3 (Or vice-versa)
Time for some timings?Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
3442 cycles for 100 * TransposeA
570 cycles for 100 * TransposeB
786 cycles for 100 * TransposeC (old)
2285 cycles for 100 * TransposeA
571 cycles for 100 * TransposeB
772 cycles for 100 * TransposeC (old)
2285 cycles for 100 * TransposeA
570 cycles for 100 * TransposeB
786 cycles for 100 * TransposeC (old)
2284 cycles for 100 * TransposeA
570 cycles for 100 * TransposeB
770 cycles for 100 * TransposeC (old)
1913 cycles for 100 * TransposeA
570 cycles for 100 * TransposeB
784 cycles for 100 * TransposeC (old)
61 bytes for TransposeA
111 bytes for TransposeB
109 bytes for TransposeC (old)
P.S.: Don't take the slow version I posted earlier. Take V3 instead :greensml:
Great, but not working for non multiple of 4 matrices
Try with this:
Width = 6
Height = 7
[ValueD4: D$ 1, 2, 3, 4, 5, 6,
D$ 7, 8, 9, 10, 11, 12,
D$ 13, 14, 15, 16, 17, 18,
D$ 19, 20, 21, 22, 23, 24,
D$ 25, 26, 27, 28, 29, 30,
D$ 31, 32, 33, 34, 35, 36,
D$ 37, 38, 39, 40, 41, 42]
The other version worked on non multiple of 4.
Hi JJ
Ok...i guess i fix it to make work for 6x7 and not multiple of 4
It seems to work on 6x7, 20x30 etc
Proc Fast_transpose_blockJJgugaNew:
Arguments @Input, @Output, @Width, @Height
Local @CurXPos, @Reminder
Uses esi, edi, ebx, ecx, edx
mov esi D@Input
mov edi D@Output
;mov ebx D@Width | mov D@CurXPos ebx | shl ebx 2
; get remainders for edi
xor eax eax
mov edx D@Height | mov ecx edx | shr edx 2 | shl edx 2 | sub ecx edx ; integer count divide by 16 bytes (4 dwords)
jz L1>
mov eax 4 | sub eax ecx | shl eax 2
L1:
mov D@Reminder eax
mov eax D@Width | mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2
L2:
mov ecx D@Height
mov edx esi
sar ecx 2 | jnc L3> | inc ecx | clc | L3: ; if height have a remainder (i mean, not multiple of 4, increment it). And clear Carry flag just for safety
;Align 4 Alignment not really needed ?
L8:
; copy the 1st 4 dwords from esi to register XMM
movd XMM1 D$edx+eax
movd XMM0 D$edx ; +0*eax
pshufd XMM1 XMM1 00__0101_0001
xorps XMM0 XMM1
movd XMM2 D$edx+ebx ; 3*eax
movd xmm1 D$edx+eax*2
pshufd xmm2 xmm2 00_00010101
pshufd xmm1 xmm1 00_01000101
xorps xmm0 xmm1
xorps xmm0 xmm2
lea edx D$edx+eax*4
movups X$edi xmm0
add edi (4*4) ; advance one xmm reg
dec ecx | jg L8<
sub edi D@Reminder; | mov D$edi 0; adjust edi from the remainder
add esi 4
dec D@CurXPos | jnz L2<<
mov eax D@Reminder
; clear remainder bytes if any
test eax eax | jz L1>
shr eax 2 | jz L1> | mov D$edi 0
dec eax | jz L1> | mov D$edi+4 0
dec eax | jz L1> | mov D$edi+8 0
L1:
mov eax D@Output
EndP
Speed tests:
With fixing the height (jnc) 249.51956409010458 ns
Without fixing the height (not work for non 4 multiples): 243.87414443039290 ns (A bit faster, because it misses the data for non multiple of 4)
Old version: 289.57585729750531 ns
Gain of speed from new to old: something around 14%
I´ll start making the tests for height and width smaller then 4 and also exchanging the width and height (To see if it works when height is smaller then width or vice-versa)
Hmm...it seems that align4 make no effect, but align 16 gained more speed 8)
; fastest
Proc Fast_transpose_blockJJgugaNew:
Arguments @Input, @Output, @Width, @Height
Local @CurXPos, @Reminder
Uses esi, edi, ebx, ecx, edx
mov esi D@Input
mov edi D@Output
; get remainders for edi
xor eax eax
mov edx D@Height | mov ecx edx | shr edx 2 | shl edx 2 | sub ecx edx ; integer count divide by 16 bytes (4 dwords)
jz L1>
mov eax 4 | sub eax ecx | shl eax 2
L1:
mov D@Reminder eax
mov eax D@Width | mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2
L2:
mov ecx D@Height
mov edx esi
sar ecx 2 | jnc L3> | inc ecx | clc | L3: ; if height have a remainder (i mean, not multiple of 4, increment it)
Align 16 ; <--------- Must be aligned with 16 to gain more speed
L8:
; copy the 1st 4 dwords from esi to register XMM
movd XMM1 D$edx+eax
movd XMM0 D$edx ; +0*eax
pshufd XMM1 XMM1 00__0101_0001
xorps XMM0 XMM1
movd XMM2 D$edx+ebx ; 3*eax
movd xmm1 D$edx+eax*2
pshufd xmm2 xmm2 00_00010101
pshufd xmm1 xmm1 00_01000101
xorps xmm0 xmm1
xorps xmm0 xmm2
lea edx D$edx+eax*4
movups X$edi xmm0
add edi (4*4) ; advance one xmm reg
dec ecx | jg L8<
sub edi D@Reminder; | mov D$edi 0; adjust edi from the remainder
add esi 4
dec D@CurXPos | jnz L2<<
mov eax D@Reminder
; clear remainder bytes if any
test eax eax | jz L1>
shr eax 2 | jz L1> | mov D$edi 0
dec eax | jz L1> | mov D$edi+4 0
dec eax | jz L1> | mov D$edi+8 0
L1:
mov eax D@Output
EndP
New result: 217.25499826251323 ns
Another faster version that seems to work with any size (width and height) both bigger then 4 bytes
The previous "sar" operation is no longer needed. Since we can compute the max value of Y (Col) only once and not inside each loop
; fastest
Proc Fast_transpose_blockJJ:
Arguments @Input, @Output, @Width, @Height
Local @CurXPos, @Reminder, @MAxYPos
Uses esi, edi, ebx, ecx, edx
mov esi D@Input
mov edi D@Output
; get remainders for edi
xor eax eax
mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | shl edx 2 | sub ecx edx ; integer count divide by 4 dwords
jz L1>
inc D@MaxYPos ; if height have a remainder (i mean, not multiple of 4, increment it)
mov eax 4 | sub eax ecx | shl eax 2
L1:
mov D@Reminder eax
mov eax D@Width | mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2
L2:
mov ecx D@MaxYPos ; < No longer needed "sar" operation since we alreadu calculated the max value of Y (Col/Height)
mov edx esi
Align 16 ; <---- Must be aligned to 16 to gain more speed and stability
L8:
; copy the 1st 4 dwords from esi to register XMM
movd XMM1 D$edx+eax
movd XMM0 D$edx ; +0*eax
pshufd XMM1 XMM1 00__0101_0001
xorps XMM0 XMM1
movd XMM2 D$edx+ebx ; 3*eax
movd xmm1 D$edx+eax*2
pshufd xmm2 xmm2 00_00010101
pshufd xmm1 xmm1 00_01000101
xorps xmm0 xmm1
xorps xmm0 xmm2
lea edx D$edx+eax*4
movups X$edi xmm0
add edi (4*4) ; advance one xmm reg
dec ecx | jg L8<
sub edi D@Reminder; | mov D$edi 0; adjust edi from the remainder to we reach the correct position
add esi 4
dec D@CurXPos | jnz L2<<
mov eax D@Reminder
; clear remainder bytes if any
test eax eax | jz L1>
shr eax 2 | jz L1> | mov D$edi 0
dec eax | jz L1> | mov D$edi+4 0
dec eax | jz L1> | mov D$edi+8 0
L1:
mov eax D@Output
EndP
New timming: 189.46967097307137 ns
JJ, It is something around 50% faster then the old version (the 1st one) and something around 30% faster then your new one. It seems that using align 16 and putting a variable to hold the Maximum Y Position (col) was the necessary to gain more speed. Can you test on yours ?
Quote from: guga on April 03, 2017, 05:13:32 AMJJ, It is something around 50% faster then the old version (the 1st one) and something around 30% faster then your new one. It seems that using align 16 and putting a variable to hold the Maximum Y Position (col) was the necessary to gain more speed. Can you test on yours ?
Sorry, can't test that one, I know Masm syntax only. From what I see, it is unlikely that your version is any faster than mine, though.
hello sir guga, I hope you're fine irmão;
I don't have tested because don't have rosasm in hands, so I can only give suggestion.
On line:
mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | shl edx 2 | sub ecx edx ; integer count divide by 4 dwords
You can get remainder of a 4 division by "and" number with (X-1) mask. This only works to 2,4,8,16,... . To get remainder of 8 the mask is 8-1==7, to 16 the mask is 15, ... .
mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | and ecx,3
If conditional or inconditional jump address is aligned we have a gain. So, if you analyse opcode generated, from my tests, sometimes is better use add instead of inc per example, because you're align next intruction or a label to a even address.
Not remember where I have read, but, text said that if you like to read just a byte on an odd adress, so machine internally read a word, discard one byte and move a byte. So, even address speed up code. Instructions on even address perfoms better than on odd address.
Some codes that I have see done to larger matrix just get remainder and get address to store too and do calculus. The idea is align address to 16 multiple. They do something like a head code (to align remainder and address) on byte/word/dword/qword way until address its a 16 multiple where the gain appear, and a tail to deal with bytes too.
I also forgot to say, on line below:
add edi (4*4) ; advance one xmm reg
dec ecx | jg L8<
add instruction set flags and dec instruction too, this create a bit dependency. You can test with lea that don't change flags to add a displacement, so flag dependency will be on dec instruction.
Oi Mineiro
Obrigado pela dica. usando o and ecx 3 fica realmente mais pratico para encontrar a posição correta do output em edi. Estou testando agora para ver se funciona com varios tamanhos diferentes de comprimento e altura. Estou tentando fazer algo relacionado com tranposição de matrizes com tamanhos diferentes pois estou testando um algoritmo para identificar imagens a partir de um hash. A biblioteca se chama "pHash" e, em princípio, utiliza a transposição de matrizes após os pixels terem sido transformados para Luma. Ele dispõe o resultado em matrizes no formato de 32x32 que representam o padrão de iluminação da imagem (na verdade, ele cria uma nova imagem em 32x32) para que o hash possa ser gerado. No entanto, pelo que vi do código, se for utilizado nova matriz do mesmo tamanho da imagem original, o algoritmo funcionaria de forma mais eficiente. O maior problema é que a biblioteca que usam é muito muito lenta. Usam uma versão alterada de cimg que tem dentro dela a transposição de matrizes, mas, como disse é bastante lenta por isso precisaria de uma função melhorada para a transposição de matrizes de qualquer tamanho sejam quadradas ou retangulares (até para poder utilizar em qualquer outro aplicativo e não apenas neste caso específico).
A função que JJ fez é ótima do ponto de vista de rapidez, no entanto preciso apenas tentar fazer funcionar em matrizes não quadradas e, ainda assim, tentar manter a mesma rapidez quanto o código original que ele criou.
Ok...I guess i got it working for retangular matrices (Quadratic or not). However, Width and Height on this version must be bigger then 4 (both)
; fastest
Proc Fast_transpose_blockJJ:
Arguments @Input, @Output, @Width, @Height
Local @CurXPos, @RemainderY, @MaxYPos
Uses esi, edi, ebx, ecx, edx
mov esi D@Input
mov edi D@Output
; get remainders for edi
mov D@RemainderY 0
mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | and ecx 3; Check if value (Height) is a multiple of 4
jz L1>
; found not multiple of 4
inc D@MaxYPos ; if height have a remainder (i mean, not multiple of 4, increment it)
mov eax 4 | sub eax ecx | shl eax 2
mov D@RemainderY eax
L1:
mov eax D@Width | mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2
L2:
mov ecx D@MaxYPos
mov edx esi
Align 16 ; <---- Must be aligned to 16 to gain more speed and stability
L8:
; copy the 1st 4 dwords from esi to register XMM
movd XMM1 D$edx+eax
movd XMM0 D$edx ; +0*eax
pshufd XMM1 XMM1 00__0101_0001
xorps XMM0 XMM1
movd XMM2 D$edx+ebx ; 3*eax
movd xmm1 D$edx+eax*2
pshufd xmm2 xmm2 00_00010101
pshufd xmm1 xmm1 00_01000101
xorps xmm0 xmm1
xorps xmm0 xmm2
lea edx D$edx+eax*4
movups X$edi xmm0
add edi (4*4) ; advance one xmm reg
dec ecx | jg L8<
sub edi D@RemainderY; adjust edi from the remainder in YPos only
add esi 4
dec D@CurXPos | jnz L2<<
mov eax D@RemainderY
; clear remainder bytes if any
test eax eax | jz L1>
shr eax 2 | jz L1> | mov D$edi 0
dec eax | jz L1> | mov D$edi+4 0
dec eax | jz L1> | mov D$edi+8 0
L1:
mov eax D@Output
EndP
; Speed results using "and ecx" 213.78110071619122 ns
The tests were done on these data sets
; 4x4 OK Fast_transpose_blockJJ4544. Start testing the newer versions for retangular matrixes
[Teste4x4: F$ 1, 2, 3, 4,
F$ 7, 8, 9, 10,
F$ 13, 14, 15, 16,
F$ 19, 20, 21, 22]
; 5x4
[Teste5x4: F$ 1, 2, 3, 4, 5,
F$ 6, 7, 8, 9, 10,
F$ 11, 12, 13, 14, 15,
F$ 16, 17, 18, 19, 20]
[Teste6x4: F$ 1, 2, 3, 4, 5, 6,
F$ 26, 7, 8, 9, 10,27
F$ 11, 12, 13, 14, 15,28
F$ 16, 17, 18, 19, 20,29]
[Teste7x4: F$ 1, 2, 3, 4, 5, 6, 77,
F$ 26, 7, 8, 9, 10,27, 102
F$ 11, 12, 13, 14, 15,28, 95
F$ 16, 17, 18, 19, 20,29, 35]
[Teste8x4: F$ 1, 2, 3, 4, 5, 6, 77, 222,
F$ 26, 7, 8, 9, 10,27, 102, 67,
F$ 11, 12, 13, 14, 15,28, 95, 77,
F$ 16, 17, 18, 19, 20,29, 35, 91]
; for Height = 4. Width works ok on any size Fast_transpose_blockJJ_WidthOk. Now let´ see Heght with odd values
; remainder =12
[Teste4x5: F$ 1, 2, 3, 4,
F$ 7, 8, 9, 10,
F$ 13, 14, 15, 16,
F$ 19, 20, 21, 22,
F$ 79, 80, 81, 32]
; remainder = 8
[Teste4x6: F$ 1, 2, 3, 4,
F$ 7, 8, 9, 10,
F$ 13, 14, 15, 16,
F$ 19, 20, 21, 22,
F$ 79, 80, 81, 32,
F$ 39, 49, 68, 67]
; remainder = 4
[Teste4x7: F$ 1, 2, 3, 4,
F$ 7, 8, 9, 10,
F$ 13, 14, 15, 16,
F$ 19, 20, 21, 22,
F$ 79, 80, 81, 32,
F$ 39, 49, 68, 67,
F$ 59, 69, 98, 107]
; remainder = 0
[Teste4x8: F$ 1, 2, 3, 4,
F$ 7, 8, 9, 10,
F$ 13, 14, 15, 16,
F$ 19, 20, 21, 22,
F$ 79, 80, 81, 32,
F$ 39, 49, 68, 67,
F$ 59, 69, 98, 107,
F$ 259, 169, 908, 77]
; 5x5
[Teste5x5: F$ 1, 2, 3, 4, 6,
F$ 7, 8, 9, 10, 11
F$ 13, 14, 15, 16, 12
F$ 19, 20, 21, 22, 23
F$ 79, 80, 81, 32, 37]
I´ll start the tests on matrix whose width or height are smaller then 4. Ex: 3x100 ; 2x10, 1x90, 150x2, 100x3 etc
Ok, guys...it is working on all sizes : )
Thank you a lot ! :t :t :t :t
The only condition is that the inputed matrix must be stored on a buffer bigger then 64 bytes (4x4 dwords) and also the output buffer must be at least 64 bytes long too. But..it is working as expected :)
Proc Fast_Matrix_Transpose:
Arguments @Input, @Output, @Width, @Height
Local @CurXPos, @RemainderY, @MaxYPos
Uses esi, edi, ebx, ecx, edx
mov esi D@Input
mov edi D@Output
; get remainders for edi
mov D@RemainderY 0
mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | and ecx 3; Check if value (Height) is a multiple of 4
jz L1>
; found not multiple of 4
inc D@MaxYPos ; if height have a remainder (i mean, not multiple of 4, increment it)
mov eax 4 | sub eax ecx | shl eax 2
mov D@RemainderY eax
L1:
mov eax D@Width | mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2
L2:
mov ecx D@MaxYPos
mov edx esi
Align 16 ; <---- Must be aligned to 16 to gain more speed and stability
L8:
; copy the 1st 4 dwords from esi to register XMM
movd XMM1 D$edx+eax
movd XMM0 D$edx ; +0*eax
pshufd XMM1 XMM1 00__0101_0001
xorps XMM0 XMM1
movd XMM2 D$edx+ebx ; 3*eax
movd xmm1 D$edx+eax*2
pshufd xmm2 xmm2 00_00010101
pshufd xmm1 xmm1 00_01000101
xorps xmm0 xmm1
xorps xmm0 xmm2
lea edx D$edx+eax*4
movups X$edi xmm0
add edi (4*4) ; advance one xmm reg
dec ecx | jg L8<
sub edi D@RemainderY; adjust edi from the remainder in YPos only
add esi 4
dec D@CurXPos | jnz L2<<
mov eax D@RemainderY
; clear remainder bytes if any
test eax eax | jz L1>
shr eax 2 | jz L1> | mov D$edi 0
dec eax | jz L1> | mov D$edi+4 0
dec eax | jz L1> | mov D$edi+8 0
L1:
mov eax D@Output
EndP
Quote from: guga on April 03, 2017, 12:03:20 PMA função que JJ fez é ótima do ponto de vista de rapidez, no entanto preciso apenas tentar fazer funcionar em matrizes não quadradas
My function works fine on non-quadratic matrices, as long as the rows are a multiple of 4.
sub edi D@RemainderY; adjust edi from the remainder in YPos only
:t
Here is the original function, in Masm format (full code with speed test in reply #16 (http://masm32.com/board/index.php?topic=6105.msg64799#msg64799)):
TransposeB proc uses esi edi ebx pSrc, pDest, SrcRows, SrcCols ; fast
mov esi, pSrc
mov edi, pDest
mov eax, SrcCols
push eax
shl eax, 2 ; *4
lea ebx, [eax+2*eax] ; 3*eax
.Repeat
mov ecx, SrcRows
mov edx, esi
sar ecx, 2
align 4
@@: movd xmm1, dword ptr [edx+1*eax] ; +1*eax
movd xmm0, dword ptr [edx] ; +0*eax
pshufd xmm1, xmm1, 01010001b
xorps xmm0, xmm1
movd xmm2, dword ptr [edx+1*ebx] ; 3*eax
movd xmm1, dword ptr [edx+2*eax] ; 2*eax
pshufd xmm2, xmm2, 00010101b
pshufd xmm1, xmm1, 01000101b
xorps xmm0, xmm1
xorps xmm0, xmm2
lea edx, [edx+4*eax]
movups [edi], xmm0
add edi, 4*4 ; advance one xmm reg
dec ecx
jg @B
add esi, 4
dec stack
.Until Zero?
pop edx
if ShowB
mov esi, offset ValueC
For_ ct=0 To 19
For_ ecx=0 To 31
lodsd
Print Str$("%i ", eax)
Next
Print
Next
Exit
endif
ret
TransposeB endp
opa, beleza guga;
O projeto soa interessante, você deseja encontrar padrões pelo que supus e isto oferece uma gama de possibilidades onde possa ser usado. Na hora de testar o código com matrizes tente quadrado, retângulo e linha, quadrados e retângulos são formas geométricas que linhas sobrepostas podem formar, e linha nesse contexto significa um número primo, por exemplo uma matriz de 7x1. No código fonte, não cheguei a analisá-lo por completo, daí suponha que apenas tenho uma vaga ideia, mas sendo uma multiplicação então significa que é mais fácil fazer cinco vezes dez do que dez vezes o número cinco, em número de operações matemáticas.
Concordo com o senhor, neste fórum existe exímios programadores, aprendi muito e continuo aprendendo com os membros. Um eterno aprendiz.
Boa sorte no projeto.
Hi mineiro
Obrigado :) Sim, o phash usa padrões e um algoritmo chamado DCT para reduzir as frequências encontradas na imagem. Só é um pouco complicado entender a funcionalidade seguindo o código-fonte original mas, creio que dê para portar para assembly. Minha idéia inicial é fazer uma função que possa calcular o pHash de imagens mas, ao invés de fazer funções que criam e reduzem o tamanho das imagens etc, fazer apenas a que diretamente realizará o trabalho, ou seja, usar como input width, height (e talvez o pitch) e o ponteiro para os pixels (convertidos já para algum formato como Luma, -por exemplo). A biblioteca menciona que convertem a imagem para greyscale mas, pelos testes que estou fazendo em um plugin para o virtualdub (para detectar mudanças de cenas em um filme), é melhor converter já para Luma usando o valor encontrado após os pixels terem sido convertidos para YCbcr ou CieLab. Como se usa apenas o Luma, não é necessário usar toda a função de conversão de YCrcb ou CieLab apenas o início que converte RGB para Luma e o valor de Luma (de 0 pra 1 ou seja, o Luma já está normalizado: dividido por 255) é o que será passado para as matrizes e para o DCT.
Se quiser, de uma lida aqui que explicam melhor todo o funcionamento do phash de modo mais claro:
http://cloudinary.com/blog/how_to_automatically_identify_similar_images_using_phash
http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html
Sobre a função de JJ fiz os testes de maneira extensiva e, em principio está funcionando para todo e qualquer formato da matrix, incluindo lineares, quadradas e retangulares, independente de serem ou não multiplos de 4 e o melhor de tudo, com uma rapidez impecável :)
Os dados dos testes que fiz foram estes:
; Matrix 20 X 30
[ValueA: F$ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
F$ 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40
F$ 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60
F$ 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80
F$ 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100
F$ 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120
F$ 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140
F$ 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160
F$ 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180
F$ 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200
F$ 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220
F$ 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240
F$ 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260
F$ 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280
F$ 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300
F$ 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320
F$ 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340
F$ 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360
F$ 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380
F$ 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400
F$ 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420
F$ 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440
F$ 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460
F$ 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480
F$ 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500
F$ 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520
F$ 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540
F$ 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560
F$ 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580
F$ 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600]
; 5x4
[ ValueD: F$ 1, 2, 3, 4, 5
F$ 6, 7, 8, 9, 10
F$ 11, 12, 13, 14, 15
F$ 16, 17, 18, 19, 20]
; 6x5
[ValueD2: F$ 1, 2, 3, 4, 5, 6,
F$ 7, 8, 9, 10, 11, 12,
F$ 13, 14, 15, 16, 17, 18,
F$ 19, 20, 21, 22, 23, 24,
F$ 25, 26, 27, 28, 29, 30]
; 6 x 6 8 btes ahead
[ValueD3: F$ 1, 2, 3, 4, 5, 6,
F$ 7, 8, 9, 10, 11, 12,
F$ 13, 14, 15, 16, 17, 18,
F$ 19, 20, 21, 22, 23, 24,
F$ 25, 26, 27, 28, 29, 30,
F$ 31, 32, 33, 34, 35, 36]
;6x7 4 bytes ahead
[ ValueD4: F$ 1, 2, 3, 4, 5, 6,
F$ 7, 8, 9, 10, 11, 12,
F$ 13, 14, 15, 16, 17, 18,
F$ 19, 20, 21, 22, 23, 24,
F$ 25, 26, 27, 28, 29, 30,
F$ 31, 32, 33, 34, 35, 36,
F$ 37, 38, 39, 40, 41, 42]
; 4 x 7
[ValueD5: F$ 1, 2, 3, 4,
F$ 7, 8, 9, 10,
F$ 13, 14, 15, 16,
F$ 19, 20, 21, 22,
F$ 25, 26, 27, 28,
F$ 31, 32, 33, 34,
F$ 37, 38, 39, 40,
F$ 41, 42, 43, 44]
; [ValueD6: F$ 1, 2, 3, 4, 77,
F$ 7, 8, 9, 10, 64,
F$ 13, 14, 15, 16, 79
F$ 19, 20, 21, 22, 98
F$ 25, 26, 27, 28, 101
F$ 31, 32, 33, 34, 222
F$ 37, 38, 39, 40, 57
F$ 41, 42, 43, 44, 58]
; 4x4 OK Fast_transpose_blockJJ4544
[Teste4x4: F$ 1, 2, 3, 4,
F$ 7, 8, 9, 10,
F$ 13, 14, 15, 16,
F$ 19, 20, 21, 22]
; 5x4
[Teste5x4: F$ 1, 2, 3, 4, 5,
F$ 6, 7, 8, 9, 10,
F$ 11, 12, 13, 14, 15,
F$ 16, 17, 18, 19, 20]
[Teste6x4: F$ 1, 2, 3, 4, 5, 6,
F$ 26, 7, 8, 9, 10,27
F$ 11, 12, 13, 14, 15,28
F$ 16, 17, 18, 19, 20,29]
[Teste7x4: F$ 1, 2, 3, 4, 5, 6, 77,
F$ 26, 7, 8, 9, 10,27, 102
F$ 11, 12, 13, 14, 15,28, 95
F$ 16, 17, 18, 19, 20,29, 35]
[Teste8x4: F$ 1, 2, 3, 4, 5, 6, 77, 222,
F$ 26, 7, 8, 9, 10,27, 102, 67,
F$ 11, 12, 13, 14, 15,28, 95, 77,
F$ 16, 17, 18, 19, 20,29, 35, 91]
; for Height = 4. Width works ok on any size Fast_transpose_blockJJ_WidthOk. Now let´s see Height with odd values
; remainder =12
[Teste4x5: F$ 1, 2, 3, 4,
F$ 7, 8, 9, 10,
F$ 13, 14, 15, 16,
F$ 19, 20, 21, 22,
F$ 79, 80, 81, 32]
; remainder = 8
[Teste4x6: F$ 1, 2, 3, 4,
F$ 7, 8, 9, 10,
F$ 13, 14, 15, 16,
F$ 19, 20, 21, 22,
F$ 79, 80, 81, 32,
F$ 39, 49, 68, 67]
; remainder = 4
[Teste4x7: F$ 1, 2, 3, 4,
F$ 7, 8, 9, 10,
F$ 13, 14, 15, 16,
F$ 19, 20, 21, 22,
F$ 79, 80, 81, 32,
F$ 39, 49, 68, 67,
F$ 59, 69, 98, 107]
; remainder = 0
[Teste4x8: F$ 1, 2, 3, 4,
F$ 7, 8, 9, 10,
F$ 13, 14, 15, 16,
F$ 19, 20, 21, 22,
F$ 79, 80, 81, 32,
F$ 39, 49, 68, 67,
F$ 59, 69, 98, 107,
F$ 259, 169, 908, 77]
; 5x5
[Teste5x5: F$ 1, 2, 3, 4, 6,
F$ 7, 8, 9, 10, 11
F$ 13, 14, 15, 16, 12
F$ 19, 20, 21, 22, 23
F$ 79, 80, 81, 32, 37]
; Fast_transpose_blockJJ_YESSSS WORKS FOR EVERYTHING BIGGER THEN 4
; Now let´ start with height or width smaller then 4
; 1st we test for width be smaller then 4
; remainder = 0; works. no adjustment
[Teste3x4: F$ 1, 2, 3,
F$ 7, 8, 9,
F$ 13, 14, 15,
F$ 19, 20, 21]
; remainder = 0; works. no adjustment
[Teste2x4: F$ 1, 2,
F$ 7, 8,
F$ 13, 14,
F$ 19, 20]
; remainder = 0; works. no adjustment
[Teste1x4: F$ 1,
F$ 7,
F$ 13,
F$ 19,]
; remainder = 0; works. no adjustment
[Teste3x5: F$ 1, 2, 3,
F$ 7, 8, 9,
F$ 13, 14, 15,
F$ 19, 20, 21,
F$ 89, 70, 51]
; remainder = 0; works. no adjustment
[Teste2x5: F$ 1, 2,
F$ 7, 8,
F$ 13, 14,
F$ 19, 20,
F$ 11, 122]
; remainder = 0; works. no adjustment
[Teste1x5: F$ 1,
F$ 7,
F$ 13,
F$ 19
F$ 77]
; Now we test for height be smaller then 4
; remainder = 0; works. no adjustment
[Teste4x3: F$ 1, 2, 3, 4,
F$ 7, 8, 9, 10,
F$ 13, 14, 15, 16]
[Teste5x3: F$ 1, 2, 3, 4, 77,
F$ 7, 8, 9, 10, 91,
F$ 13, 14, 15, 16,125]
[Teste5x2: F$ 1, 2, 3, 4, 77,
F$ 7, 8, 9, 10, 91]
[Teste5x1: F$ 1, 2, 3, 4, 77]
[Teste1x1: F$ 1]
opa guga;
Agora o projeto ficou ainda mais audaz após ler os links postados, muito bacana mesmo. Não conhecia sobre perceputal hash, me veio na memória perceptron no ramo da inteligência artificial mas agora consigo pensar em outros algoritmos que no futuro podem ser agregados para uma automação.
http://courses.cs.washington.edu/courses/cse599/01wi/admin/Assignments/nn.html
Lembro que existia um programa chamado Altamira Composer, e este oferecia um zoom infinito, e não sei como funciona este algoritmo.
Tenho uma pergunta guga; o algoritmo é eficiente entre objetos similares? Pensei em uma laranja, mexerica e limão doce (tangerina). Tem como encontrar apenas laranjas neste pomar de fotos?
Para objetos semelhantes ele aponta qual o percentual mas...entre laranjas e mexericas...kkk.. não sei..bem pensado. É algo à se testar.
O link da demonstração online do phash é:
http://www.phash.org
tenta selecionar foto de laranjas e mexericas etc e ve qual o resultado que o phash encontrou. Se ele conseguir discernir mesmo entre uma laranja e uma mexerica etc, nossa...então o algoritmo é melhor do que pensei :)
Aliás...o que já consegui portar foi a função hamming Distance do phash. Apesar de eu achar um pouco lenta, pelo menos já consegui portar. Assim vou fazendo aos poucos para facilitar a otimização quando tudo ficar pronto.
A biblioteca tb tem uma função para calcular o Fast Fourier Transformation. Eu converti ela e posto mais tarde para tentarmos otimizar :)
[XValue: Q$ 0]
Proc ph_hamming_distance:
Arguments @Hash1, @Hash2
Uses ecx, edx, esi, edi
mov edi D@Hash1
mov esi D@Hash2
mov eax D$edi | xor eax D$esi | mov D$XValue eax
mov ecx D$edi+4 | xor ecx D$esi+4 | mov D$XValue+4 ecx
call aullshr 1, D$XValue, D$XValue+4
and eax 055555555 | and edx 055555555
mov ecx D$XValue | sub ecx eax | mov D$XValue ecx
mov eax D$XValue+4 | sbb eax edx | mov D$XValue+4 eax
mov esi D$XValue | and esi 033333333
mov edi D$XValue+4 | and edi 033333333
call aullshr 2, D$XValue, D$XValue+4
and eax 033333333 | add esi eax | mov D$XValue esi
and edx 033333333 | adc edi edx | mov D$XValue+4 edi
call aullshr 4, D$XValue, D$XValue+4
add eax D$XValue
mov ecx D$XValue+4 | adc ecx edx
and eax 0F0F0F0F | mov D$XValue eax
and ecx 0F0F0F0F | mov D$XValue+4 ecx
call allmul D$XValue, D$XValue+4, 01010101, 01010101
call aullshr 56, eax, edx
EndP
; results in eax/edx
Proc aullshr:
Arguments @Flag, @Val1, @Val2
Uses ecx
mov ecx D@Flag | movzx ecx cl
mov eax D@Val1
mov edx D@Val2
.If cl < 040
If cl < 020
shrd eax edx cl
shr edx cl
Else
mov eax edx
xor edx edx
and cl 01F
shr eax cl
End_If
.Else
xor eax eax
xor edx edx
.End_If
EndP
___________________________________________________________________________________
Proc allmul:
Arguments @Val1, @Val1a, @Val2, @Val2a
Uses ebx
mov eax D@Val1a
If D@Val2a = eax
mov eax D@Val1
mul D@Val2
Else
mul D@Val2
mov ebx eax
mov eax D@Val1
mul D@Val2a
add ebx eax
mov eax D@Val1
mul ecx
add edx ebx
End_If
EndP
Irei tentar o site amanhã e me aprofundar no assunto, gosto de alimento cerebral.
Creio que já tenha visto neste fórum uma função fft rápida, esqueci agora qual usuário a fez, talvez no velho fórum.
abraços dom.
Não vi ainda mas, se foi criada, creio que possa ter sido ou pelo JJ ou por Siekmanski . Ambos são ótimos quando o assunto é otimização de código :)
doideira, não conhecia isso. Interessante
Oi Lord
Sim, é muto interessante a técnica. No entanto, phash utiliza uma biblioteca tosca de manipulação de imagem chamada CImg que é incrivelmente lenta e confusa de portar.
Estou tentando portar ela para assembly e tentando entender direito a funcionalidade. basicamente, eles utilizam uma máscara por cima de uma imagem. A máscara é invertida (transposta mas, não do modo 'normal". A transposição se dá fazendo um flip horizontal e outro vertical da máscara) e depois passam essa máscara por sobre a imagem aplicando uma convolação de matrix. Antes da convolação e do flip, a imagem e a máscara precisam ser convertidas para greyscale (Eu vou usar Luma que é mais preciso).
Eu havia criado uma dll do pHash que tem as funções de exportação necessárias "ph_dct_imagehash" e "ph_hamming_distance" mas, após testar elas quanto à velocidade, é impraticável para usá-las em vídeo, por exemplo.
Exatamente por isso que estou otimizando e tentando rever toda a funcionalidade da biblioteca para fazê-la simplificada sem a necessidade de se utilizar biblioteca externa de manipulação de imagens. A idéia é apenas recriar as funções do pHash tomando como base os pixels já convertidos para Luma (normalizado) e apenas aplicar as alterações necessárias das matrizes e a convolação.
Isso garantirá uma maior rapidez do algoritmo. Se eu conseguir criar uma função que consiga criar o phash de uma imagem (e identificar com o de outro) utilizando apenas alguns nanosegundos, será melhor. Acredito que a função como um todo, talvez se consiga fazer em torno de uns 800 nanosegundos (para uma imagem de 32x32, por exemplo), já incluindo a convolação. Ou seja, uma imagem de 640x480 pode ser feita em algo que leve alguns poucos microsegundos. (Creio que uns 10 no máximo 20 mas, quanto menos, melhor).
Eu conseguindo isolar apenas o algoritmo de phash, já é um avanço pois se alguém quiser criar um programa de identificação de cenas (como eu estou fazendo) ou de busca por imagens etc, a velocidade final dependerá apenas da forma como o programador usou para ter acesso aos pixels antes de passar pelo algoritmo e nunca em função do algoritmo em si (Que é o que acontece com o phash atualmente)