What is the rule to get the multiplicative inverse of X (8 bits)?
What is the condition that is contrived in 02BH to be the multiplicative inverse for 015H?
QuoteThis affine transformation is the sum of multiple rotations of the byte as a vector, where addition is the XOR operation:
s = b ⊕ ( b ⋘ 1 ) ⊕ ( b ⋘ 2 ) ⊕ ( b ⋘ 3 ) ⊕ ( b ⋘ 4 ) ⊕ 63
where b represents the multiplicative inverse, ....
Rijndael S-box (https://en.wikipedia.org/wiki/Rijndael_S-box)
Quotea ≡ b (mod n) means :
- a and b have the same remainder when divided by n
- a = K.n + b
- (a - b) is multiple of n
How it was applied to the Rijndael table?
017h≡05Fh? or 06Dh≡093h? ...
.data ;00 , 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
MulInv DB 001h,001h,08Dh,0F6h,0CBh,052h,07Bh,0D1h,0E8h,04Fh,029h,0C0h,0B0h,0E1h,0E5h,0C7h ;00
DB 074h,0B4h,0AAh,04Bh,099h,02Bh,060h,05Fh,058h,03Fh,0FDh,0CCh,0FFh,040h,0EEh,0B2h ;01
DB 03Ah,06Eh,05Ah,0F1h,055h,04Dh,0A8h,0C9h,0C1h,00Ah,098h,015h,030h,044h,0A2h,0C2h ;02
DB 02Ch,045h,092h,06Ch,0F3h,039h,066h,042h,0F2h,035h,020h,06Fh,077h,0BBh,059h,019h ;03
DB 01Dh,0FEh,037h,067h,02Dh,031h,0F5h,069h,0A7h,064h,0ABh,013h,054h,025h,0E9h,009h ;04
DB 0EDh,05Ch,005h,0CAh,04Ch,024h,087h,0BFh,018h,03Eh,022h,0F0h,051h,0ECh,061h,017h ;05
DB 016h,05Eh,0AFh,0D3h,049h,0A6h,036h,043h,0F4h,047h,091h,0DFh,033h,093h,021h,03Bh ;06
DB 079h,0B7h,097h,085h,010h,0B5h,0BAh,03Ch,0B6h,070h,0D0h,006h,0A1h,0FAh,081h,082h ;07
DB 083h,07Eh,07Fh,080h,096h,073h,0BEh,056h,09Bh,09Eh,095h,0D9h,0F7h,002h,0B9h,0A4h ;08
DB 0DEh,06Ah,032h,06Dh,0D8h,08Ah,084h,072h,02Ah,014h,09Fh,088h,0F9h,0DCh,089h,09Ah ;09
DB 0FBh,07Ch,02Eh,0C3h,08Fh,0B8h,065h,048h,026h,0C8h,012h,04Ah,0CEh,0E7h,0D2h,062h ;0A
DB 00Ch,0E0h,01Fh,0EFh,011h,075h,078h,071h,0A5h,08Eh,076h,03Dh,0BDh,0BCh,086h,057h ;0B
DB 00Bh,028h,02Fh,0A3h,0DAh,0D4h,0E4h,00Fh,0A9h,027h,053h,004h,01Bh,0FCh,0ACh,0E6h ;0C
DB 07Ah,007h,0AEh,063h,0C5h,0DBh,0E2h,0EAh,094h,08Bh,0C4h,0D5h,09Dh,0F8h,090h,06Bh ;0D
DB 0B1h,00Dh,0D6h,0EBh,0C6h,00Eh,0CFh,0ADh,008h,04Eh,0D7h,0E3h,05Dh,050h,01Eh,0B3h ;0E
DB 05Bh,023h,038h,034h,068h,046h,003h,08Ch,0DDh,09Ch,07Dh,0A0h,0CDh,01Ah,041h,01Ch ;0F
Hi mabdelouahab,
4.2 Multiplication
5.1.1 SubBytes()Transformation
http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf)
Your question appears to be: How is the Multiplicative Inverse Table produced for the Rijndael characteristic 2 finite field which uses the following polynomial for multiplication: x^8 + x^4 + x^3 + x + 1?
There are a few ways, but you can try the Euclid-Wallis Algorithm, mentioned here on the chapter Simple algebraic field extensions (https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm).
If you follow the mentioned example where a= x^6+x^4+x+1 (binary 0101 0011 = 53h) you will get the inverse as x^7 + x^6 + x^3 + x (binary 1100 1010 = CAh). Confirm in the table that it is correct!
If you don't understand, I can't explain better because the person that explained it to me will not be available in the short term. :biggrin:
Translated by Google sorry.
I think you want to calculate the multiplicative inverse of x (Mod 255) you need to calculate
MCDE with the values of x, 255. If the mcde (x, 255) is different from 1, then x has no inverse
If the mcde (x, 255) == 1 then the algorithm also returns the inverse.
I have the function written in C # for the BigInteger class. If you want it, you tell me.
Mcd(2Bh,FFh) = 1 OK. The condition inv(2B) = ACh
Mcd(15h,FFh) = 3 != 1 No inverse exist.
Quote from: manuel on September 10, 2019, 04:05:45 AM
Mcd(2Bh,FFh) = 1 OK. The condition inv(2B) = ACh
Mcd(15h,FFh) = 3 != 1 No inverse exist.
We are talking about a different thing,we are talking about the Finite Field GF(2^8). Things are much more complicated.
Introduction to Finite Fields.
http://www.cs.utsa.edu/~wagner/laws/FFM.html
Quote from: LiaoMi on September 09, 2019, 06:46:14 PM
Hi mabdelouahab,
4.2 Multiplication
5.1.1 SubBytes()Transformation
http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf)
Hi LiaoMi
"For the AES algorithm, this irreducible polynomial is m(x) = x^8 + x^4 + x^3 + x +1 or {01}{1b} in hexadecimal notation."These explain a lot, Thanks
Quote from: AW on September 10, 2019, 12:09:01 AM
If you don't understand, I can't explain better because the person that explained it to me will not be available in the short term. :biggrin:
I think we will not need your friend, because we have LiaoMi :biggrin:
But the algorithm did not work with me، It gives me in the first step : quotient=x +1 , but the author confirms that the first step : quotient=x^2 +1
"Moreover, div is an auxiliary function that computes the quotient of the Euclidean division." function inverse(a, p)
t := 0; newt := 1;
r := p; newr := a;
while newr ≠ 0
quotient := r div newr
(r, newr) := (newr, r - quotient * newr)
(t, newt) := (newt, t - quotient * newt)
if degree(r) > 0 then
return "Either p is not irreducible or a is a multiple of p"
return (1/r) * t
inversefunc proc a;,p
LOCAL _t,_newt,_r,_newr,_quotient
mov _t,0
mov _newt,1
mov eax,011BH ;p in my case p Always equal 011bh
mov _r,eax
mov eax,a
mov _newr,eax
@@:
xor edx,edx
mov eax,_r
mov ecx,_newr
div ecx
mov _quotient,eax ; quotient := r div newr
mov edx,_newr
mov ebx,_r
mov _r,edx ; r:= newr
mul ecx
sub ebx,eax
mov _newr,ebx ; newr:=r - quotient * newr
mov eax,_quotient
imul _newt
mov ebx,_t
mov edx,_newt
mov _t,edx ; t:=newt
sub ebx,eax
mov _newt,ebx ; newt:=t - quotient * newt)
cmp _newr,0
jne @B
mov eax,_t
ret
inversefunc endp
@mabdelouahab
You can't divide polynomials that way! :badgrin:
Check with Liao Mi, if you don't believe. :biggrin:
Here is another example of using the Euclid-Wallis Algorithm (https://math.stackexchange.com/questions/529156/extended-euclidean-algorithm-in-gf28), to build the table.
We want the inverse of x^5+1 (which is 10 0001 = 21h). The result is x^6+x^5+x^3+x^2+x (which means 110 1110 = 6Eh)
Confirm in the table that the result is correct. I did, and can tel you it is! :thumbsup:
Final code
include masm32rt.inc
include GF28.inc
v=053h
.code
Start:
invoke gf28GetMultiplicativeInverse,v
printf("Multiplicative inverse of {%xh}=%Xh\n",v,eax)
inkey
exit
end Start
GF28.inc:
tPolyType struct
coeff db 16 dup(0)
tPolyType ends
.data
gf28IrreduciblePolynomial db 1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0 ; =11Bh
.code
bytetopoly proc b :byte ,p :ptr poly8
mov esi,p
dec esi
xor eax,eax
mov al,b
mov ecx,-1
bytetopolystrt:
inc cl
cmp cl,8
je bytetopolyend__
inc esi
bt eax,ecx
jnc bytetopolyz00
mov byte ptr [esi],1
jmp bytetopolystrt
bytetopolyz00:
mov byte ptr [esi],0
jmp bytetopolystrt
bytetopolyend__:
ret
bytetopoly endp
wordtopoly proc b :word ,p :ptr poly16
mov esi,p
dec esi
xor eax,eax
mov ax,b
mov ecx,-1
wordtopolystrt:
inc cl
cmp cl,16
je wordtopolyend__
inc esi
bt eax,ecx
jnc wordtopolyz00
mov byte ptr [esi],1
jmp wordtopolystrt
wordtopolyz00:
mov byte ptr [esi],0
jmp wordtopolystrt
wordtopolyend__:
ret
wordtopoly endp
polytobyte proc pb :ptr byte,p :ptr poly8
mov esi,p
mov edi,pb
mov byte ptr [edi],0
xor ecx,ecx
mov eax,1
polytobytestrt:
cmp byte ptr [esi],1
jne @F
or byte ptr [edi],al
@@:
inc esi
shl al,1
inc cl
cmp cl,8
jne polytobytestrt
ret
polytobyte endp
polytoword proc pb :ptr word,p :ptr poly16
mov esi,p
mov edi,pb
mov word ptr [edi],0
xor ecx,ecx
mov eax,1
polytowordstrt:
cmp byte ptr [esi],1
jne @F
or word ptr [edi],ax
@@:
inc esi
shl ax,1
inc cl
cmp cl,16
jne polytowordstrt
ret
polytoword endp
gf28resetpoly proc ptPolyType,deg
mov edi,ptPolyType
xor ecx,ecx
@@: mov byte ptr [edi+ecx],0
inc ecx
cmp ecx,deg
jne @B
ret
gf28resetpoly endp
gf28MovPolyToPoly proc pdes,psrc,l
mov edi,psrc
mov esi,pdes
xor ecx,ecx
@@:
mov al,[edi+ecx]
mov [esi+ecx],al
inc ecx
cmp ecx,l
jne @B
ret
gf28MovPolyToPoly endp
gf28GetDegreePoly proc uses esi edi ecx edx pfx
LOCAL d,notzero_
mov d,0
mov notzero_,0
mov esi,pfx
xor ecx,ecx
@@:
.if byte ptr [esi+ecx] !=0
mov notzero_,1
mov d,ecx
.endif
inc ecx
cmp ecx,15
jne @B
mov eax,d
.if notzero_
inc eax
.endif
ret
gf28GetDegreePoly endp
gf28AddPoly proc pfxgx:ptr,pfx:ptr,pgx:ptr,l
LOCAL tmppoly:tPolyType
invoke gf28resetpoly,addr tmppoly,16
lea esi,tmppoly
mov edi,pfx
mov edx,pgx
xor ecx,ecx
@@: mov al,[edx+ecx]
xor al,[edi+ecx]
mov [esi+ecx],al
inc ecx
cmp ecx,l
jne @B
invoke gf28MovPolyToPoly,pfxgx,addr tmppoly,16
ret
gf28AddPoly endp
gf28mulPoly proc pfxgx:ptr,pfx:ptr,pgx:ptr
LOCAL indx1,indx2
LOCAL tmppoly:tPolyType
invoke gf28resetpoly,addr tmppoly,16
lea esi,tmppoly
mov edi,pfx
mov indx1,0
gf28mulPolyf:
mov edx,pgx
mov indx2,0
gf28mulPolyg:
;*************************
mov al,[edx]
and al,[edi]
mov ecx,indx1
add ecx,indx2
xor al,[esi+ecx]
mov [esi+ecx],al
;*************************
inc edx
inc indx2
cmp indx2,8
jne gf28mulPolyg
inc edi
inc indx1
cmp indx1,8
jne gf28mulPolyf
lea edi,gf28IrreduciblePolynomial
gf28mulPolymod:
invoke gf28GetDegreePoly,addr tmppoly
sub eax,9
mov indx1,eax
.if sdword ptr eax < 0
invoke gf28MovPolyToPoly,pfxgx,addr tmppoly,16
ret
.endif
xor ecx,ecx
@@:
mov edx,indx1
add edx,ecx
mov al,[esi+edx]
xor al,[edi+ecx]
mov [esi+edx],al
inc ecx
cmp ecx,9
jne @B
jmp gf28mulPolymod
ret
gf28mulPoly endp
gf28divpoly proc pfxgx:ptr,pfx:ptr,pgx:ptr,pmod:ptr
LOCAL indx1,indx2
LOCAL tmppolymod:tPolyType
LOCAL tmppolyqot:tPolyType
.if pfxgx==0 && pmod==0
ret
.endif
.if pfxgx != 0
invoke gf28resetpoly, addr tmppolyqot,16
lea esi,tmppolyqot
.endif
invoke gf28resetpoly,addr tmppolymod,16
lea ebx,tmppolymod
mov edi,pfx
xor ecx,ecx
@@:
mov al,[edi+ecx]
mov [ebx+ecx],al
inc ecx
cmp ecx,9
jne @B
mov edi,ebx
mov edx,pgx
mov indx2,rv(gf28GetDegreePoly,pgx)
gf28divmod:
mov indx1,rv(gf28GetDegreePoly,edi)
sub eax,indx2
mov indx1,eax
.if (sdword ptr eax < 0)
.if pmod != 0
invoke gf28MovPolyToPoly,pmod,addr tmppolymod,9
.endif
.if pfxgx != 0
invoke gf28MovPolyToPoly,pfxgx,addr tmppolyqot,9
.endif
ret
.endif
.if pfxgx !=0
mov byte ptr [esi+eax],1
.endif
push edx
push ebx
push edi
pop edi
pop ebx
pop edx
xor ecx,ecx
@@:
mov ebx,indx1
add ebx,ecx
mov al,[edi+ebx]
xor al,[edx+ecx]
mov [edi+ebx],al
inc ecx
cmp ecx,indx2
jne @B
jmp gf28divmod
ret
gf28divpoly endp
gf28invPoly proc pinv,pfx
LOCAL _t :tPolyType
LOCAL _r :tPolyType
LOCAL _newt :tPolyType
LOCAL _newr :tPolyType
LOCAL _quotient :tPolyType
LOCAL _tmp :tPolyType
invoke gf28MovPolyToPoly, addr _r,addr gf28IrreduciblePolynomial,16 ; r := p; invoke wordtopoly,011bh,addr _r
invoke wordtopoly,0h,addr _t ; t := 0;
invoke wordtopoly,1h,addr _newt ; newt := 1;
invoke gf28MovPolyToPoly, addr _newr,pfx,16 ; newr := a
@@:
invoke gf28GetDegreePoly,addr _newr
.if eax == 0
invoke gf28MovPolyToPoly, pinv,addr _t,16
ret
.endif
invoke gf28MovPolyToPoly, addr _tmp,addr _newr,16
invoke gf28divpoly,addr _quotient,addr _r,addr _newr,addr _newr
invoke gf28MovPolyToPoly, addr _r,addr _tmp,16 ; r := _newr;
invoke gf28mulPoly,addr _tmp,addr _newt,addr _quotient
invoke gf28AddPoly,addr _tmp,addr _tmp,addr _t,8
invoke gf28MovPolyToPoly, addr _t,addr _newt ,8 ; t := _newt; ;
invoke gf28MovPolyToPoly,addr _newt, addr _tmp,8
jmp @B ; loop
ret
gf28invPoly endp
gf28GetMultiplicativeInverse proc a:byte
LOCAL _tmpPoly :tPolyType
LOCAL _tmpInvPoly :tPolyType
LOCAL _byteInverse :byte
invoke gf28resetpoly,addr _tmpPoly,16
invoke bytetopoly,a,addr _tmpPoly
invoke gf28invPoly,addr _tmpInvPoly,addr _tmpPoly
invoke polytobyte,addr _byteInverse,addr _tmpInvPoly
xor eax,eax
mov al,_byteInverse
ret
gf28GetMultiplicativeInverse endp
It works but is a confusing mess, it appears like you converted it from some HLL.
All you need are 3 (three) ASM functions. Polynomial divide, GF(2^8) multiply and GF(2^8) add.
.model flat, C
.code
polyDivision proc uses esi edi num:dword, den:dword, quocPtr:ptr, remPtr:ptr
LOCAL remainder
LOCAL term
mov eax, num
mov remainder, eax
bsr esi, remainder
bsr edi, den
xor edx, edx
.while esi>=edi
mov ecx, esi
sub ecx, edi
mov eax, den
shl eax, cl
mov term, eax
mov eax, 1
shl eax, cl
xor edx, eax
mov eax, remainder
xor eax, term
mov remainder, eax
bsr esi, remainder
bsr edi, den
.endw
mov eax, quocPtr
mov dword ptr [eax], edx
mov eax, remPtr
mov ecx, remainder
mov dword ptr [eax], ecx
ret
polyDivision endp
polyMultiply proc mult1:byte, mult2:byte, res : ptr
LOCAL result : byte
mov result, 0
.while (mult1 != 0) && (mult2 !=0)
movzx eax,mult2
and al, 1
.if al !=0
movzx eax, result
movzx ecx, mult1
xor eax, ecx
mov result, al
.endif
movzx eax, mult1
and eax, 80h
.if al != 0
movzx eax, mult1
shl eax, 1
xor eax, 11Bh
mov mult1, al
.else
mov al, mult1
shl al, 1
mov mult1, al
.endif
mov al, mult2
shr al,1
mov mult2, al
.endw
mov eax, res
mov cl, result
mov byte ptr [eax], cl
ret
polyMultiply endp
polyAdd proc add1:byte, add2:byte
movzx eax, add1
movzx ecx, add2
xor eax, ecx
ret
polyAdd endp
end
When you use the above procedures with the following C you can make fast a beautiful table:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
extern
void polyDivision(unsigned num, unsigned den, unsigned* polyDivision, unsigned* remPtr);
void polyMultiply(unsigned char mult1, unsigned char mult2, unsigned char* res);
unsigned char polyAdd(unsigned char add1, unsigned char add2);
int main()
{
int i, j;
int t, newt, r, newr, quotient, remain;
unsigned char tempByte, tempNew, printValue, tempNewr;
for (i = 0; i < 16; i++)
printf("\t%.2x", i);
printf("\n\n");
for (i = 0; i < 16; i++)
{
printf("%.2x\t", i);
for (j = 0; j < 16; j++)
{
tempNewr = i * 16 + j;
newr = tempNewr;
r = 0x11B;
t = 0;
newt = 1;
if (newr>1)
{
while (newr != 1)
{
polyDivision(r, newr, "ient, &remain);
r = newr;
newr = remain;
polyMultiply(quotient, newt, &tempByte);
tempNew = polyAdd(t, tempByte);
t = newt;
newt = tempNew;
}
}
if (tempNewr < 2)
printValue = tempNewr;
else
printValue = tempNew;
printf("%.2x\t", printValue);
}
printf("\n");
}
return 0;
}
Output:
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
00 00 01 8d f6 cb 52 7b d1 e8 4f 29 c0 b0 e1 e5 c7
01 74 b4 aa 4b 99 2b 60 5f 58 3f fd cc ff 40 ee b2
02 3a 6e 5a f1 55 4d a8 c9 c1 0a 98 15 30 44 a2 c2
03 2c 45 92 6c f3 39 66 42 f2 35 20 6f 77 bb 59 19
04 1d fe 37 67 2d 31 f5 69 a7 64 ab 13 54 25 e9 09
05 ed 5c 05 ca 4c 24 87 bf 18 3e 22 f0 51 ec 61 17
06 16 5e af d3 49 a6 36 43 f4 47 91 df 33 93 21 3b
07 79 b7 97 85 10 b5 ba 3c b6 70 d0 06 a1 fa 81 82
08 83 7e 7f 80 96 73 be 56 9b 9e 95 d9 f7 02 b9 a4
09 de 6a 32 6d d8 8a 84 72 2a 14 9f 88 f9 dc 89 9a
0a fb 7c 2e c3 8f b8 65 48 26 c8 12 4a ce e7 d2 62
0b 0c e0 1f ef 11 75 78 71 a5 8e 76 3d bd bc 86 57
0c 0b 28 2f a3 da d4 e4 0f a9 27 53 04 1b fc ac e6
0d 7a 07 ae 63 c5 db e2 ea 94 8b c4 d5 9d f8 90 6b
0e b1 0d d6 eb c6 0e cf ad 08 4e d7 e3 5d 50 1e b3
0f 5b 23 38 34 68 46 03 8c dd 9c 7d a0 cd 1a 41 1c
:thumbsup:
deleted
Quote from: nidud on September 13, 2019, 07:26:16 AM
A reduced version with less confusing mess:
At least you understood the mess enough to roll up your sleeves and spend ten minutes looking for a way to save 2 milliseconds. :thumbsup:
Quote
Bear metal is shorter and faster.
Do you mean beer metal? :rolleyes:
Bare :biggrin:
deleted
Quote from: nidud on September 13, 2019, 10:28:51 AM
why are you here?
I am here mostly to have fun with people that publish performance results but don't show how they got them.
In particular, I would like to see how you used RDTSC (http://www.forwardscattering.org/post/15) to measure the clock cycles :skrewy:
:badgrin:
Interesting that when I replace your highly optimized polyDivision with my messy polyDivision the performance difference is about 13%. (60 versus 68 cycles, no chance to cook bitcoins :sad: with this difference)
Cycles polynomial Division: 60 (53898/896) versus Cycles polynomial Division: 68 (61178/896)
I am yet to discover where you obtained 100106 cycles for the original. and 63363 cycles for the optimized. but I am going to think the whole weekend about this and may be able to find out. The likely reason might be related to the bear metal (it does happen :thumbsup:).
Yes, I know I have disregarded some practices like flushing the queue to guarantee serial execution of RDTSC, but this was just to illustrate that your conclusions are delirious. Sure, and I believe as well that these tests are mostly meaningless for modern processors but this is another discussion.
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
extern
void polyDivision(unsigned num, unsigned den, unsigned* polyDivision, unsigned* remPtr);
int polyDivision2(unsigned num, unsigned den, unsigned* polyDivision);
void polyMultiply(unsigned char mult1, unsigned char mult2, unsigned char* res);
unsigned char polyAdd(unsigned char add1, unsigned char add2);
int main()
{
int i, j;
int t, newt, r, newr, quotient, remain;
unsigned char tempByte, tempNew, printValue, tempNewr;
unsigned long long a, b;
int counter = 0;
unsigned long long accumDif = 0;
for (i = 0; i < 16; i++)
printf("\t%.2x", i);
printf("\n\n");
for (i = 0; i < 16; i++)
{
printf("%.2x\t", i);
for (j = 0; j < 16; j++)
{
tempNewr = i * 16 + j;
newr = tempNewr;
r = 0x11B;
t = 0;
newt = 1;
if (newr>1)
{
while (newr != 1)
{
counter++; // NEW
a=__rdtsc(); // NEW
polyDivision(r, newr, "ient, &remain);
//remain = polyDivision2(r, newr, "ient); // NEW
b = __rdtsc(); // NEW
accumDif += b - a; // NEW
r = newr;
newr = remain;
polyMultiply(quotient, newt, &tempByte);
tempNew = polyAdd(t, tempByte);
t = newt;
newt = tempNew;
}
}
if (tempNewr < 2)
printValue = tempNewr;
else
printValue = tempNew;
printf("%.2x\t", printValue);
}
printf("\n");
}
printf("Cycles polynomial Division: %d (%lld/%d)", (int) (accumDif / counter), accumDif, counter); // NEW
return 0;
}
:skrewy:
And for the GF(2^8) multiply the results are "Cycles GF(2^8) multiply: 105 (94234/896)" for the original and "Cycles GF(2^8) multiply: 90 (80948/896)" for the super-ultra-hyper optimized version from Nidud. A decent, but not impressive, 11.7% improvements. :greenclp:
But I am still waiting for your report explaining where you obtained the 100392 cycles versus 37749 cycles improvement (266% improvement :badgrin:). I guess I will have to wait a long long time. :sad:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
extern
void polyDivision(unsigned num, unsigned den, unsigned* polyDivision, unsigned* remPtr);
int polyDivision2(unsigned num, unsigned den, unsigned* polyDivision);
void polyMultiply(unsigned char mult1, unsigned char mult2, unsigned char* res);
unsigned char polyMultiply2(unsigned char mult1, unsigned char mult2);
unsigned char polyAdd(unsigned char add1, unsigned char add2);
int main()
{
int i, j;
int t, newt, r, newr, quotient, remain;
unsigned char tempByte, tempNew, printValue, tempNewr;
unsigned long long a, b;
int counter = 0;
unsigned long long accumDif = 0;
for (i = 0; i < 16; i++)
printf("\t%.2x", i);
printf("\n\n");
for (i = 0; i < 16; i++)
{
printf("%.2x\t", i);
for (j = 0; j < 16; j++)
{
tempNewr = i * 16 + j;
newr = tempNewr;
r = 0x11B;
t = 0;
newt = 1;
if (newr>1)
{
while (newr != 1)
{
counter++;
//a=__rdtsc();
polyDivision(r, newr, "ient, &remain);
//remain = polyDivision2(r, newr, "ient);
//b = __rdtsc();
//accumDif += b - a;
r = newr;
newr = remain;
a = __rdtsc();
//polyMultiply(quotient, newt, &tempByte);
tempByte = polyMultiply2(quotient, newt); //NEW
b = __rdtsc();
accumDif += b - a;
tempNew = polyAdd(t, tempByte);
t = newt;
newt = tempNew;
}
}
if (tempNewr < 2)
printValue = tempNewr;
else
printValue = tempNew;
printf("%.2x\t", printValue);
}
printf("\n");
}
//printf("Cycles polynomial Division: %d (%lld/%d)", (int) (accumDif / counter), accumDif, counter);
printf("Cycles GF(2^8) multiply: %d (%lld/%d)", (int)(accumDif / counter), accumDif, counter);
return 0;
}
deleted
It took me more than a reasonable amount of time to build what you attached here because if there is something I dislike in ASMC is that it is not user friendly. It is made by one ultra nerd for other ultra nerds.
Now, your test does not make sense. You picked up a set of 3 ASM instructions, repeated them 50000 times and showed that if we replace that with just one instruction we would save a lot of cycles! Thank you my friend!
But in the whole procedure that does weight much as I have demonstrated. Of course, it is a positive contribution and I would probably find it if I was set to optimize the code.
Have a nice day. :skrewy:
deleted
You are in a state of reality denial and I will, of course, not feed the hallucination.
Quote
It's a masm compatible assembler so it shouldn't be that complicated.
asmc \masm32\m32lib\*.asm
ml -c \masm32\m32lib\*.asm
Try to build what you dropped here like that then. :badgrin:
The 3 *.asm files need to be built to .bin, a trickery that only JWASM descendents have, and the .cmd file is not a file to be run by cmd.exe :rolleyes: it is an ASM file. WTF! :skrewy:
I don't knew that CMD can make that! Interesting :thumbsup:
Here making .bin with UASM32:\nidud\shr>uasm32 -q -bin ?.asm
\nidud\shr>asmc -q -pe "\nidud\shr\timeit.cmd"
\nidud\shr>timeit.exe
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
----------------------------------------------
-- test(0)
380926 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
353402 cycles, rep(1000), code(160) 1.asm: shr m,1
50546 cycles, rep(1000), code(113) 2.asm: shr al,1
-- test(1)
362377 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
358679 cycles, rep(1000), code(160) 1.asm: shr m,1
48543 cycles, rep(1000), code(113) 2.asm: shr al,1
-- test(2)
351191 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
358392 cycles, rep(1000), code(160) 1.asm: shr m,1
49500 cycles, rep(1000), code(113) 2.asm: shr al,1
total [0 .. 2], 1++
148589 cycles 2.asm: shr al,1
1070473 cycles 1.asm: shr m,1
1094494 cycles 0.asm: mov al,m | shr al,1 | mov m,al
hit any key to continue...