The MASM Forum

General => The Campus => Topic started by: mabdelouahab on September 09, 2019, 07:31:44 AM

Title: Modular multiplicative inverse
Post by: mabdelouahab on September 09, 2019, 07:31:44 AM
What is the rule to get the multiplicative inverse of  X (8 bits)?
What is the condition that is contrived in 02BH to be the multiplicative inverse  for 015H?

Quote
This affine transformation is the sum of multiple rotations of the byte as a vector, where addition is the XOR operation:

    s = b ⊕ ( b ⋘ 1 ) ⊕ ( b ⋘ 2 ) ⊕ ( b ⋘ 3 ) ⊕ ( b ⋘ 4 ) ⊕ 63

where b represents the multiplicative inverse, ....
Rijndael S-box (https://en.wikipedia.org/wiki/Rijndael_S-box)
Title: Re: Modular multiplicative inverse
Post by: mabdelouahab on September 09, 2019, 05:39:56 PM
Quote
a ≡ b (mod n) means :
    - a and b have the same remainder when divided by n
    - a = K.n + b
    - (a - b) is multiple of n

How it was applied to the Rijndael table?
017h≡05Fh? or 06Dh≡093h? ...

Code: [Select]
.data ;00 , 01   02   03   04   05   06   07   08   09   0A   0B   0C   0D   0E   0F
MulInv DB 001h,001h,08Dh,0F6h,0CBh,052h,07Bh,0D1h,0E8h,04Fh,029h,0C0h,0B0h,0E1h,0E5h,0C7h ;00
DB 074h,0B4h,0AAh,04Bh,099h,02Bh,060h,05Fh,058h,03Fh,0FDh,0CCh,0FFh,040h,0EEh,0B2h ;01
DB 03Ah,06Eh,05Ah,0F1h,055h,04Dh,0A8h,0C9h,0C1h,00Ah,098h,015h,030h,044h,0A2h,0C2h ;02
DB 02Ch,045h,092h,06Ch,0F3h,039h,066h,042h,0F2h,035h,020h,06Fh,077h,0BBh,059h,019h ;03
DB 01Dh,0FEh,037h,067h,02Dh,031h,0F5h,069h,0A7h,064h,0ABh,013h,054h,025h,0E9h,009h ;04
DB 0EDh,05Ch,005h,0CAh,04Ch,024h,087h,0BFh,018h,03Eh,022h,0F0h,051h,0ECh,061h,017h ;05
DB 016h,05Eh,0AFh,0D3h,049h,0A6h,036h,043h,0F4h,047h,091h,0DFh,033h,093h,021h,03Bh ;06
DB 079h,0B7h,097h,085h,010h,0B5h,0BAh,03Ch,0B6h,070h,0D0h,006h,0A1h,0FAh,081h,082h ;07
DB 083h,07Eh,07Fh,080h,096h,073h,0BEh,056h,09Bh,09Eh,095h,0D9h,0F7h,002h,0B9h,0A4h ;08
DB 0DEh,06Ah,032h,06Dh,0D8h,08Ah,084h,072h,02Ah,014h,09Fh,088h,0F9h,0DCh,089h,09Ah ;09
DB 0FBh,07Ch,02Eh,0C3h,08Fh,0B8h,065h,048h,026h,0C8h,012h,04Ah,0CEh,0E7h,0D2h,062h ;0A
DB 00Ch,0E0h,01Fh,0EFh,011h,075h,078h,071h,0A5h,08Eh,076h,03Dh,0BDh,0BCh,086h,057h ;0B
DB 00Bh,028h,02Fh,0A3h,0DAh,0D4h,0E4h,00Fh,0A9h,027h,053h,004h,01Bh,0FCh,0ACh,0E6h ;0C
DB 07Ah,007h,0AEh,063h,0C5h,0DBh,0E2h,0EAh,094h,08Bh,0C4h,0D5h,09Dh,0F8h,090h,06Bh ;0D
DB 0B1h,00Dh,0D6h,0EBh,0C6h,00Eh,0CFh,0ADh,008h,04Eh,0D7h,0E3h,05Dh,050h,01Eh,0B3h ;0E
DB 05Bh,023h,038h,034h,068h,046h,003h,08Ch,0DDh,09Ch,07Dh,0A0h,0CDh,01Ah,041h,01Ch ;0F
Title: Re: Modular multiplicative inverse
Post by: LiaoMi on September 09, 2019, 06:46:14 PM
Hi mabdelouahab,

4.2 Multiplication
5.1.1 SubBytes()Transformation
http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf)
Title: Re: Modular multiplicative inverse
Post by: AW on September 10, 2019, 12:09:01 AM
Your question appears to be: How is the Multiplicative Inverse Table produced for the Rijndael characteristic 2 finite field which uses the following polynomial for multiplication: x^8 + x^4 + x^3 + x + 1?

There are a few ways, but you can try the Euclid-Wallis Algorithm, mentioned here on the chapter Simple algebraic field extensions (https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm).
If you follow the mentioned example where a= x^6+x^4+x+1 (binary 0101 0011 = 53h) you will get the inverse as x^7 + x^6 + x^3 + x (binary 1100 1010 = CAh). Confirm in the table that it is correct!

If you don't understand, I can't explain better because the person that explained it to me will not be available in the short term.  :biggrin:
Title: Re: Modular multiplicative inverse
Post by: manuel on September 10, 2019, 03:47:28 AM
Translated by Google sorry.
I think you want to calculate the multiplicative inverse of x (Mod 255) you need to calculate
MCDE with the values of x, 255.  If the mcde (x, 255) is different from 1, then x has no inverse
If the mcde (x, 255) == 1 then the algorithm also returns the inverse.
I have the function written in C # for the BigInteger class. If you want it, you tell me.
Title: Re: Modular multiplicative inverse
Post by: manuel on September 10, 2019, 04:05:45 AM
Mcd(2Bh,FFh) = 1 OK. The condition       inv(2B) = ACh 
Mcd(15h,FFh) = 3 != 1   No inverse exist.
Title: Re: Modular multiplicative inverse
Post by: AW on September 10, 2019, 04:17:48 AM
Mcd(2Bh,FFh) = 1 OK. The condition       inv(2B) = ACh 
Mcd(15h,FFh) = 3 != 1   No inverse exist.

We are talking about a different thing,we are talking about the Finite Field GF(2^8). Things are much more complicated.
Introduction to Finite Fields.
http://www.cs.utsa.edu/~wagner/laws/FFM.html
Title: Re: Modular multiplicative inverse
Post by: mabdelouahab on September 10, 2019, 06:54:30 AM
Hi mabdelouahab,

4.2 Multiplication
5.1.1 SubBytes()Transformation
http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf)
Hi LiaoMi
"For  the  AES  algorithm,  this  irreducible polynomial is m(x) = x^8 + x^4 + x^3 + x +1 or {01}{1b} in hexadecimal notation."
These explain a lot, Thanks

If you don't understand, I can't explain better because the person that explained it to me will not be available in the short term.  :biggrin:

I think we will not need your friend, because we have LiaoMi  :biggrin:
But the algorithm did not work with me، It gives me in the first step : quotient=x +1 , but the author confirms that the first step : quotient=x^2 +1
 
"Moreover, div is an auxiliary function that computes the quotient of the Euclidean division."
Code: [Select]
function inverse(a, p)
    t := 0;     newt := 1;   
    r := p;     newr := a;   
    while newr ≠ 0
        quotient := r div newr
        (r, newr) := (newr, r - quotient * newr)
        (t, newt) := (newt, t - quotient * newt)
    if degree(r) > 0 then
        return "Either p is not irreducible or a is a multiple of p"
    return (1/r) * t

Code: [Select]
inversefunc proc a;,p
LOCAL _t,_newt,_r,_newr,_quotient
mov _t,0
mov _newt,1
mov eax,011BH ;p in my case p Always equal 011bh
mov _r,eax
mov eax,a
mov _newr,eax
@@:
xor edx,edx
mov eax,_r
mov ecx,_newr
div ecx
mov _quotient,eax ;        quotient := r div newr
mov edx,_newr
mov ebx,_r
mov _r,edx ; r:= newr
mul ecx
sub ebx,eax
mov _newr,ebx ; newr:=r - quotient * newr
mov eax,_quotient
imul _newt
mov ebx,_t
mov edx,_newt
mov _t,edx ; t:=newt
sub ebx,eax
mov _newt,ebx ; newt:=t - quotient * newt)
cmp _newr,0
jne @B
mov eax,_t
ret

inversefunc endp
Title: Re: Modular multiplicative inverse
Post by: AW on September 10, 2019, 01:20:47 PM
@mabdelouahab
You can't divide polynomials that way!  :badgrin:
Check with Liao Mi, if you don't believe. :biggrin:

Here is another example of using the Euclid-Wallis Algorithm (https://math.stackexchange.com/questions/529156/extended-euclidean-algorithm-in-gf28), to build the table.
We want the inverse of x^5+1 (which is 10 0001 = 21h). The result is x^6+x^5+x^3+x^2+x (which means 110 1110 = 6Eh)
Confirm in the table that the result is correct. I did, and can tel you it is!  :thumbsup:
Title: Re: Modular multiplicative inverse
Post by: mabdelouahab on September 13, 2019, 01:31:30 AM
Final code

Code: [Select]
include masm32rt.inc
include GF28.inc
v=053h
.code
Start:
invoke gf28GetMultiplicativeInverse,v
printf("Multiplicative inverse of {%xh}=%Xh\n",v,eax)
inkey
exit

end Start

GF28.inc:
Code: [Select]
tPolyType struct
coeff db 16 dup(0)
tPolyType ends
.data
  gf28IrreduciblePolynomial db 1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0 ; =11Bh

.code
bytetopoly proc b  :byte ,p :ptr poly8
mov esi,p
dec esi
xor eax,eax
mov al,b
mov ecx,-1
bytetopolystrt:
inc cl
cmp cl,8
je bytetopolyend__
inc esi
bt eax,ecx
jnc bytetopolyz00
mov byte ptr [esi],1
jmp bytetopolystrt
bytetopolyz00:
mov byte ptr [esi],0
jmp bytetopolystrt
bytetopolyend__:
ret
bytetopoly endp
wordtopoly proc b  :word ,p :ptr poly16
mov esi,p
dec esi
xor eax,eax
mov ax,b
mov ecx,-1
wordtopolystrt:
inc cl
cmp cl,16
je wordtopolyend__
inc esi
bt eax,ecx
jnc wordtopolyz00
mov byte ptr [esi],1
jmp wordtopolystrt
wordtopolyz00:
mov byte ptr [esi],0
jmp wordtopolystrt
wordtopolyend__:
ret
wordtopoly endp
polytobyte proc pb :ptr byte,p :ptr poly8
mov esi,p
mov edi,pb
mov byte ptr [edi],0
xor ecx,ecx
mov eax,1
polytobytestrt:
cmp byte ptr [esi],1
jne @F
or byte ptr [edi],al
@@:
inc esi
shl al,1
inc cl
cmp cl,8
jne polytobytestrt
ret
polytobyte endp
polytoword proc pb :ptr word,p :ptr poly16
mov esi,p
mov edi,pb
mov word ptr [edi],0
xor ecx,ecx
mov eax,1
polytowordstrt:
cmp byte ptr [esi],1
jne @F
or word ptr [edi],ax
@@:
inc esi
shl ax,1
inc cl
cmp cl,16
jne polytowordstrt
ret
polytoword endp
gf28resetpoly proc ptPolyType,deg
mov edi,ptPolyType
xor ecx,ecx
@@: mov byte ptr [edi+ecx],0
inc ecx
cmp ecx,deg
jne @B
ret
gf28resetpoly endp

gf28MovPolyToPoly proc pdes,psrc,l
mov edi,psrc
mov esi,pdes
xor ecx,ecx
@@:
mov al,[edi+ecx]
mov [esi+ecx],al
inc ecx
cmp ecx,l
jne @B
ret
gf28MovPolyToPoly endp

gf28GetDegreePoly proc uses esi edi ecx edx pfx
LOCAL d,notzero_
mov d,0
mov notzero_,0
mov esi,pfx

xor ecx,ecx
@@:
.if byte ptr [esi+ecx] !=0
mov notzero_,1
mov d,ecx
.endif
inc ecx
cmp ecx,15
jne @B
mov eax,d
.if notzero_
inc eax
.endif
ret

gf28GetDegreePoly endp

gf28AddPoly proc pfxgx:ptr,pfx:ptr,pgx:ptr,l
LOCAL tmppoly:tPolyType
invoke gf28resetpoly,addr tmppoly,16

lea esi,tmppoly
mov edi,pfx
mov edx,pgx
xor ecx,ecx

@@: mov al,[edx+ecx]
xor al,[edi+ecx]
mov [esi+ecx],al
inc ecx
cmp ecx,l
jne @B

invoke gf28MovPolyToPoly,pfxgx,addr tmppoly,16

ret
gf28AddPoly endp

gf28mulPoly proc pfxgx:ptr,pfx:ptr,pgx:ptr
LOCAL indx1,indx2
LOCAL tmppoly:tPolyType
invoke gf28resetpoly,addr tmppoly,16
lea esi,tmppoly
mov edi,pfx
mov indx1,0
gf28mulPolyf:
mov edx,pgx
mov indx2,0
gf28mulPolyg:
;*************************
mov al,[edx]
and al,[edi]
mov ecx,indx1
add ecx,indx2
xor al,[esi+ecx]
mov [esi+ecx],al
;*************************
inc edx
inc indx2
cmp indx2,8
jne gf28mulPolyg
inc edi
inc indx1
cmp indx1,8
jne gf28mulPolyf

lea edi,gf28IrreduciblePolynomial

gf28mulPolymod:
invoke gf28GetDegreePoly,addr tmppoly

sub eax,9
mov indx1,eax
.if sdword ptr eax < 0
invoke gf28MovPolyToPoly,pfxgx,addr tmppoly,16
ret
.endif
xor ecx,ecx
@@:
mov edx,indx1
add edx,ecx
mov al,[esi+edx]
xor al,[edi+ecx]
mov [esi+edx],al
inc ecx
cmp ecx,9
jne @B
jmp gf28mulPolymod

ret
gf28mulPoly endp
gf28divpoly proc  pfxgx:ptr,pfx:ptr,pgx:ptr,pmod:ptr
LOCAL indx1,indx2
LOCAL tmppolymod:tPolyType
LOCAL tmppolyqot:tPolyType
.if pfxgx==0 && pmod==0
ret
.endif
.if pfxgx != 0
invoke gf28resetpoly, addr tmppolyqot,16
lea esi,tmppolyqot
.endif
invoke gf28resetpoly,addr tmppolymod,16
lea ebx,tmppolymod
mov edi,pfx
xor ecx,ecx
@@:
mov al,[edi+ecx]
mov [ebx+ecx],al
inc ecx
cmp ecx,9
jne @B

mov edi,ebx
mov edx,pgx

mov indx2,rv(gf28GetDegreePoly,pgx)
gf28divmod:
mov indx1,rv(gf28GetDegreePoly,edi)
sub eax,indx2
mov indx1,eax
.if (sdword ptr eax < 0)
.if pmod != 0
invoke gf28MovPolyToPoly,pmod,addr tmppolymod,9
.endif
.if pfxgx != 0
invoke gf28MovPolyToPoly,pfxgx,addr tmppolyqot,9
.endif
ret
.endif
.if pfxgx !=0
mov byte ptr [esi+eax],1
.endif
push edx
push ebx
push edi
pop edi
pop ebx
pop edx

xor ecx,ecx
@@:
mov ebx,indx1
add ebx,ecx
mov al,[edi+ebx]
xor al,[edx+ecx]
mov [edi+ebx],al
inc ecx
cmp ecx,indx2
jne @B
jmp gf28divmod

ret

gf28divpoly endp

gf28invPoly proc pinv,pfx
LOCAL _t :tPolyType
LOCAL _r :tPolyType
LOCAL _newt :tPolyType
LOCAL _newr :tPolyType
LOCAL _quotient :tPolyType
LOCAL _tmp :tPolyType

invoke gf28MovPolyToPoly, addr _r,addr gf28IrreduciblePolynomial,16 ; r := p; invoke wordtopoly,011bh,addr _r       
invoke wordtopoly,0h,addr _t ;    t := 0;       
invoke wordtopoly,1h,addr _newt ;    newt := 1;   
invoke gf28MovPolyToPoly, addr _newr,pfx,16 ;    newr := a

@@:
invoke gf28GetDegreePoly,addr _newr
.if eax == 0
invoke gf28MovPolyToPoly, pinv,addr _t,16
ret
.endif
invoke gf28MovPolyToPoly, addr _tmp,addr _newr,16    
invoke gf28divpoly,addr _quotient,addr _r,addr _newr,addr _newr
invoke gf28MovPolyToPoly, addr _r,addr _tmp,16 ;    r := _newr;   

invoke gf28mulPoly,addr _tmp,addr _newt,addr _quotient
invoke gf28AddPoly,addr _tmp,addr _tmp,addr _t,8
invoke gf28MovPolyToPoly, addr _t,addr _newt ,8 ;    t := _newt; ;   
invoke gf28MovPolyToPoly,addr _newt, addr _tmp,8   
jmp @B ; loop
ret
gf28invPoly endp
gf28GetMultiplicativeInverse proc a:byte
LOCAL _tmpPoly :tPolyType
LOCAL _tmpInvPoly :tPolyType
LOCAL _byteInverse :byte

invoke gf28resetpoly,addr _tmpPoly,16
invoke bytetopoly,a,addr _tmpPoly
invoke gf28invPoly,addr _tmpInvPoly,addr _tmpPoly
invoke polytobyte,addr _byteInverse,addr _tmpInvPoly
xor eax,eax
mov al,_byteInverse
ret
gf28GetMultiplicativeInverse endp
Title: Re: Modular multiplicative inverse
Post by: AW on September 13, 2019, 03:58:43 AM
It works but is a confusing mess, it appears like you converted it from some HLL.
All you need are 3 (three) ASM functions. Polynomial divide, GF(2^8) multiply and GF(2^8) add.

Code: [Select]
.model flat, C

.code

polyDivision proc uses esi edi num:dword, den:dword, quocPtr:ptr, remPtr:ptr
LOCAL remainder
LOCAL term

mov eax, num
mov remainder, eax
bsr esi, remainder
bsr edi, den
xor edx, edx

.while esi>=edi
mov ecx, esi
sub ecx, edi
mov eax, den
shl eax, cl
mov term, eax
mov eax, 1
shl eax, cl
xor edx, eax
mov eax, remainder
xor eax, term
mov remainder, eax

bsr esi, remainder
bsr edi, den
.endw
mov eax, quocPtr
mov dword ptr [eax], edx
mov eax, remPtr
mov ecx, remainder
mov dword ptr [eax], ecx

ret
polyDivision endp

polyMultiply proc mult1:byte, mult2:byte, res : ptr
LOCAL result : byte
mov result, 0
.while (mult1 != 0) && (mult2 !=0)
movzx eax,mult2
and al, 1
.if al !=0
movzx eax, result
movzx ecx, mult1
xor eax, ecx
mov result, al
.endif
movzx eax, mult1
and eax, 80h
.if al != 0
movzx eax, mult1
shl eax, 1
xor eax, 11Bh
mov mult1, al
.else
mov al, mult1
shl al, 1
mov mult1, al
.endif
mov al, mult2
shr al,1
mov mult2, al
.endw
mov eax, res
mov cl, result
mov byte ptr [eax], cl
ret
polyMultiply endp

polyAdd proc add1:byte, add2:byte
movzx eax, add1
movzx ecx, add2
xor eax, ecx
ret
polyAdd endp

end

When you use the above procedures with the following C you can make fast a beautiful table:

Code: [Select]
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

extern
void polyDivision(unsigned num, unsigned den, unsigned* polyDivision, unsigned* remPtr);
void polyMultiply(unsigned char mult1, unsigned char mult2, unsigned char* res);
unsigned char polyAdd(unsigned char add1, unsigned char add2);

int main()
{
int i, j;
int t, newt, r, newr, quotient, remain;
unsigned char tempByte, tempNew, printValue, tempNewr;

for (i = 0; i < 16; i++)
printf("\t%.2x", i);

printf("\n\n");

for (i = 0; i < 16; i++)
{
printf("%.2x\t", i);
for (j = 0; j < 16; j++)
{
tempNewr = i * 16 + j;
newr = tempNewr;
r = 0x11B;
t = 0;
newt = 1;
if (newr>1)
{
while (newr != 1)
{
polyDivision(r, newr, &quotient, &remain);
r = newr;
newr = remain;
polyMultiply(quotient, newt, &tempByte);
tempNew = polyAdd(t, tempByte);
t = newt;
newt = tempNew;
}
}
if (tempNewr < 2)
printValue = tempNewr;
else
printValue = tempNew;
printf("%.2x\t", printValue);

}
printf("\n");
}

return 0;
}


Output:

Code: [Select]
        00      01      02      03      04      05      06      07      08      09      0a      0b      0c      0d     0e       0f

00      00      01      8d      f6      cb      52      7b      d1      e8      4f      29      c0      b0      e1     e5       c7
01      74      b4      aa      4b      99      2b      60      5f      58      3f      fd      cc      ff      40     ee       b2
02      3a      6e      5a      f1      55      4d      a8      c9      c1      0a      98      15      30      44     a2       c2
03      2c      45      92      6c      f3      39      66      42      f2      35      20      6f      77      bb     59       19
04      1d      fe      37      67      2d      31      f5      69      a7      64      ab      13      54      25     e9       09
05      ed      5c      05      ca      4c      24      87      bf      18      3e      22      f0      51      ec     61       17
06      16      5e      af      d3      49      a6      36      43      f4      47      91      df      33      93     21       3b
07      79      b7      97      85      10      b5      ba      3c      b6      70      d0      06      a1      fa     81       82
08      83      7e      7f      80      96      73      be      56      9b      9e      95      d9      f7      02     b9       a4
09      de      6a      32      6d      d8      8a      84      72      2a      14      9f      88      f9      dc     89       9a
0a      fb      7c      2e      c3      8f      b8      65      48      26      c8      12      4a      ce      e7     d2       62
0b      0c      e0      1f      ef      11      75      78      71      a5      8e      76      3d      bd      bc     86       57
0c      0b      28      2f      a3      da      d4      e4      0f      a9      27      53      04      1b      fc     ac       e6
0d      7a      07      ae      63      c5      db      e2      ea      94      8b      c4      d5      9d      f8     90       6b
0e      b1      0d      d6      eb      c6      0e      cf      ad      08      4e      d7      e3      5d      50     1e       b3
0f      5b      23      38      34      68      46      03      8c      dd      9c      7d      a0      cd      1a     41       1c
Title: Re: Modular multiplicative inverse
Post by: mabdelouahab on September 13, 2019, 04:51:22 AM
 :thumbsup:
Title: Re: Modular multiplicative inverse
Post by: nidud on September 13, 2019, 07:26:16 AM
It works but is a confusing mess, it appears like you converted it from some HLL.

Indeed :biggrin:

The divide function should return a value so the last argument is not needed.

unsigned polyDivision(unsigned num, unsigned den, unsigned* polyDivision);

The function is 89 bytes and bench at 100106 cycles.

A reduced version with less confusing mess:

polyDivision proc uses edi num:dword, den:dword, quocPtr:ptr
    mov eax,num
    bsr edi,den
    bsr ecx,eax
    xor edx,edx
    .while ecx >= edi
        sub ecx,edi
        mov edi,den
        shl edi,cl
        xor eax,edi
        mov edi,1
        shl edi,cl
        xor edx,edi
        bsr ecx,eax
        bsr edi,den
    .endw
    mov ecx,quocPtr
    mov [ecx],edx
    ret
polyDivision endp

57 bytes and 63363 cycles.

You should try to avoid using memory operands if possible. Bear metal is shorter and faster.

polyMultiply proc mult1:byte, mult2:byte
    xor eax,eax
    mov dl,mult1
    mov dh,mult2
    .while dl && dh
        .if dh & 1
            xor al,dl
        .endif
        .if dl & 80h
            shl dl,1
            xor dl,1Bh
        .else
            shl dl,1
        .endif
        shr dh,1
    .endw
    ret
polyMultiply endp

106 byte and 100392 cycles versus 48 byte and 37749 cycles.
Title: Re: Modular multiplicative inverse
Post by: AW on September 13, 2019, 08:38:47 AM
A reduced version with less confusing mess:

At least you understood the mess enough to roll up your sleeves and spend ten minutes looking for a way to save 2 milliseconds.  :thumbsup:

Quote
Bear metal is shorter and faster.

Do you mean beer metal?   :rolleyes:
Title: Re: Modular multiplicative inverse
Post by: HSE on September 13, 2019, 09:53:20 AM
Bare  :biggrin:
Title: Re: Modular multiplicative inverse
Post by: nidud on September 13, 2019, 10:28:51 AM
At least you understood the mess enough to roll up your sleeves and spend ten minutes looking for a way to save 2 milliseconds.  :thumbsup:

That's one way of looking at it I guess but 100392 versus 37749 cycles is more than 60% reduction in energy consumption: your battery will last 10 days instead of 4. You may increase your bitcoin production by 60% and all the other reasons why people use assembler.

Why do you write assembler code and why are you here?

Quote
Do you mean beer metal?   :rolleyes:

Probably a Russian bear so the metal must be a reference to Stalin.
Title: Re: Modular multiplicative inverse
Post by: AW on September 13, 2019, 04:56:30 PM
why are you here?

I am here mostly to have fun with people that publish performance results but don't show how they got them.

In particular, I would like to see how you used RDTSC (http://www.forwardscattering.org/post/15) to measure the clock cycles    :skrewy:
Title: Re: Modular multiplicative inverse
Post by: AW on September 14, 2019, 02:45:01 AM
 :badgrin:

Interesting that when I replace your highly optimized polyDivision with my messy polyDivision the performance difference is about 13%. (60 versus 68 cycles, no chance to cook bitcoins  :sad: with this difference)
Cycles polynomial Division: 60 (53898/896) versus Cycles polynomial Division: 68 (61178/896)

I am yet to discover where you obtained 100106 cycles for the original. and 63363 cycles for the optimized. but I am going to think the whole weekend about this and may be able to find out. The likely reason might be related to the bear metal (it does happen  :thumbsup:).

Yes, I know I have disregarded some practices like flushing the queue to guarantee serial execution of RDTSC, but this was just to illustrate that your conclusions are delirious. Sure, and I believe as well that these tests are mostly meaningless for modern processors but this is another discussion.

Code: [Select]
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

extern
void polyDivision(unsigned num, unsigned den, unsigned* polyDivision, unsigned* remPtr);
int polyDivision2(unsigned num, unsigned den, unsigned* polyDivision);
void polyMultiply(unsigned char mult1, unsigned char mult2, unsigned char* res);
unsigned char polyAdd(unsigned char add1, unsigned char add2);

int main()
{
int i, j;
int t, newt, r, newr, quotient, remain;
unsigned char tempByte, tempNew, printValue, tempNewr;
unsigned long long a, b;
int counter = 0;
unsigned long long accumDif = 0;

for (i = 0; i < 16; i++)
printf("\t%.2x", i);

printf("\n\n");

for (i = 0; i < 16; i++)
{
printf("%.2x\t", i);
for (j = 0; j < 16; j++)
{
tempNewr = i * 16 + j;
newr = tempNewr;
r = 0x11B;
t = 0;
newt = 1;
if (newr>1)
{
while (newr != 1)
{
counter++; // NEW
a=__rdtsc(); // NEW
polyDivision(r, newr, &quotient, &remain);
//remain = polyDivision2(r, newr, &quotient); // NEW
b = __rdtsc(); // NEW
accumDif += b - a; // NEW
r = newr;
newr = remain;
polyMultiply(quotient, newt, &tempByte);
tempNew = polyAdd(t, tempByte);
t = newt;
newt = tempNew;
}
}
if (tempNewr < 2)
printValue = tempNewr;
else
printValue = tempNew;
printf("%.2x\t", printValue);

}
printf("\n");
}

printf("Cycles polynomial Division: %d (%lld/%d)", (int) (accumDif / counter), accumDif, counter); // NEW

return 0;
}







Title: Re: Modular multiplicative inverse
Post by: AW on September 14, 2019, 06:08:51 PM
 :skrewy:

And for the GF(2^8) multiply the results are "Cycles GF(2^8) multiply: 105 (94234/896)" for the original and "Cycles GF(2^8) multiply: 90 (80948/896)" for the super-ultra-hyper optimized version from Nidud. A decent, but not impressive, 11.7% improvements.  :greenclp:
But I am still waiting for your report explaining where you obtained the 100392 cycles versus 37749 cycles improvement (266% improvement  :badgrin:). I guess I will have to wait a long long time.  :sad:

Code: [Select]
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

extern
void polyDivision(unsigned num, unsigned den, unsigned* polyDivision, unsigned* remPtr);
int polyDivision2(unsigned num, unsigned den, unsigned* polyDivision);
void polyMultiply(unsigned char mult1, unsigned char mult2, unsigned char* res);
unsigned char polyMultiply2(unsigned char mult1, unsigned char mult2);
unsigned char polyAdd(unsigned char add1, unsigned char add2);

int main()
{
int i, j;
int t, newt, r, newr, quotient, remain;
unsigned char tempByte, tempNew, printValue, tempNewr;
unsigned long long a, b;
int counter = 0;
unsigned long long accumDif = 0;

for (i = 0; i < 16; i++)
printf("\t%.2x", i);

printf("\n\n");

for (i = 0; i < 16; i++)
{
printf("%.2x\t", i);
for (j = 0; j < 16; j++)
{
tempNewr = i * 16 + j;
newr = tempNewr;
r = 0x11B;
t = 0;
newt = 1;
if (newr>1)
{
while (newr != 1)
{
counter++;
//a=__rdtsc();
polyDivision(r, newr, &quotient, &remain);
//remain = polyDivision2(r, newr, &quotient);
//b = __rdtsc();
//accumDif += b - a;
r = newr;
newr = remain;
a = __rdtsc();
//polyMultiply(quotient, newt, &tempByte);
tempByte = polyMultiply2(quotient, newt); //NEW
b = __rdtsc();
accumDif += b - a;
tempNew = polyAdd(t, tempByte);
t = newt;
newt = tempNew;
}
}
if (tempNewr < 2)
printValue = tempNewr;
else
printValue = tempNew;
printf("%.2x\t", printValue);

}
printf("\n");
}

//printf("Cycles polynomial Division: %d (%lld/%d)", (int) (accumDif / counter), accumDif, counter);
printf("Cycles GF(2^8) multiply: %d (%lld/%d)", (int)(accumDif / counter), accumDif, counter);

return 0;
}


Title: Re: Modular multiplicative inverse
Post by: nidud on September 14, 2019, 09:33:31 PM
 :biggrin:

Still confused I see. The original claim:
Quote
it appears like you converted it from some HLL.

Then you produced this:

        mov al, mult2
        shr al,1
        mov mult2, al

A simple test case:

foo proc
  local tmp:byte
  repeat 50
    mov al,tmp
    shr al,1
    mov tmp,al
  endm
    ret
foo endp
...
  repeat 50
    shr tmp,1
  endm
...
  mov al,tmp
  repeat 50
    shr al,1
  endm

result:

Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz (AVX2)
----------------------------------------------
-- test(0)
   237712 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   233321 cycles, rep(1000), code(160) 1.asm: shr m,1
    41915 cycles, rep(1000), code(113) 2.asm: shr al,1
-- test(1)
   236884 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   223148 cycles, rep(1000), code(160) 1.asm: shr m,1
    41330 cycles, rep(1000), code(113) 2.asm: shr al,1
-- test(2)
   232043 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   223281 cycles, rep(1000), code(160) 1.asm: shr m,1
    40657 cycles, rep(1000), code(113) 2.asm: shr al,1

total [0 .. 2], 1++
   123902 cycles 2.asm: shr al,1
   679750 cycles 1.asm: shr m,1
   706639 cycles 0.asm: mov al,m | shr al,1 | mov m,al

bloat: 410/113
speed: 70/12

Quote
but I am going to think the whole weekend about this and may be able to find out
Well, good luck and have a nice weekend  :thumbsup:
Title: Re: Modular multiplicative inverse
Post by: AW on September 15, 2019, 12:33:54 AM
It took me more than a reasonable amount of time to build what you attached here because if there is something I dislike in ASMC is that it is not user friendly. It is made by one ultra nerd for other ultra nerds.

Now, your test does not make sense. You picked up a set of 3 ASM instructions, repeated them 50000 times and showed that if we replace that with just one instruction we would save a lot of cycles! Thank you my friend!
But in the whole procedure that does weight much as I have demonstrated. Of course, it is a positive contribution and I would probably find it if I was set to optimize the code.

Have a nice day.  :skrewy:
Title: Re: Modular multiplicative inverse
Post by: nidud on September 17, 2019, 03:36:49 AM
It took me more than a reasonable amount of time to build what you attached here

It's just a batch file.

Quote
because if there is something I dislike in ASMC is that it is not user friendly.

It's a masm compatible assembler so it shouldn't be that complicated.

asmc \masm32\m32lib\*.asm
ml -c \masm32\m32lib\*.asm

As for the timing code this is written by MichaelW which is located in \masm32\macros\timers.asm. You will find hundredths of samples of this in the Lab.

Quote
It is made by one ultra nerd for other ultra nerds.

A person with low social skills and a mental capacity of a child but still capable of writing some usable HLL code.

Quote
Now, your test does not make sense.

Hence the question of your presence here. You enter this forum by producing a document claiming it's impossible to beat optimized compiler code. When pressed on this you claim it doesn't matter so why assembler? It doesn't make any sense.

Quote
You picked up a set of 3 ASM instructions, repeated them 50000 times and showed that if we replace that with just one instruction we would save a lot of cycles!

7:1 is 700% and you do this consistently throughout your code so this will obviously have a negative effect on performance.

Quote
But in the whole procedure that does weight much as I have demonstrated.

Demonstrated how exactly? You think your translation performs any better than the original HLL code?

Quote
Of course, it is a positive contribution and I would probably find it if I was set to optimize the code.

So you just added all this bloat because you had some time to kill?

Well, my rant was more about your social skills or lack thereof in this case where you constantly dunk on others for tings you consistently do all the time yourself.
Title: Re: Modular multiplicative inverse
Post by: AW on September 17, 2019, 04:39:29 AM
You are in a state of reality denial and I will, of course, not feed the hallucination.

Quote
It's a masm compatible assembler so it shouldn't be that complicated.
asmc \masm32\m32lib\*.asm
ml -c \masm32\m32lib\*.asm

Try to build what you dropped here like that then.   :badgrin:
The 3 *.asm files need to be built to .bin, a trickery that only JWASM descendents have, and the .cmd file is not a file to be run by cmd.exe  :rolleyes: it is an ASM file. WTF!  :skrewy:
Title: Re: Modular multiplicative inverse
Post by: HSE on September 17, 2019, 04:56:21 AM
I don't knew that CMD can make that! Interesting :thumbsup:

Here making .bin with UASM32:
Code: [Select]
\nidud\shr>uasm32 -q -bin ?.asm

\nidud\shr>asmc -q -pe "\nidud\shr\timeit.cmd"

\nidud\shr>timeit.exe
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
----------------------------------------------
-- test(0)
   380926 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   353402 cycles, rep(1000), code(160) 1.asm: shr m,1
    50546 cycles, rep(1000), code(113) 2.asm: shr al,1
-- test(1)
   362377 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   358679 cycles, rep(1000), code(160) 1.asm: shr m,1
    48543 cycles, rep(1000), code(113) 2.asm: shr al,1
-- test(2)
   351191 cycles, rep(1000), code(410) 0.asm: mov al,m | shr al,1 | mov m,al
   358392 cycles, rep(1000), code(160) 1.asm: shr m,1
    49500 cycles, rep(1000), code(113) 2.asm: shr al,1

total [0 .. 2], 1++
   148589 cycles 2.asm: shr al,1
  1070473 cycles 1.asm: shr m,1
  1094494 cycles 0.asm: mov al,m | shr al,1 | mov m,al
hit any key to continue...