Made some changes to 64-bit fastcall and vectorcall: load args right to left.
This now works:
p5(rcx,rdx,r8, r9, 0 )
p5(0, rcx,rdx,r8, r9 )
p5(0, 0, rcx,rdx,r8 )
p5(0, 0, 0, rcx,rdx )
p5(0, 0, 0, 0, rcx )
This fails:
p5(rdx,r8, r9, 0, 0 )
p5(r8, r9, 0, 0, 0 )
p5(r9, 0, 0, 0, 0 )