You can see this problem better if you think with odd, even, odd, even, ... . Or at some point with even, odd, even, odd, ... .
On entry point of our windows program, rsp=???8.
Before we do a 'call' stack should be 16 bytes aligned, so we should subtract from rsp 8 bytes to get rsp=???0 and after do a call, and on procedure entry rsp is back again to rsp=???8.
Suppose we call a function that have 4 parameters only and configuration below:
4*8 because rcx,rdx,r8,r9 will be saved on stack, like function parameters.
7*8 because locals
3*8 because I'm supposing that a function that have maximum parameters inside this procedure have 7 parameters, the first 4 goes into registers, so we need 3 on stack, and this space will be reusable by all others functions with minimum parameters. We setup to biggest parameter functions.
So our prologue can be
sub rsp, (4*8+7*8+3*8 ) == 4+7+3==14, the result is even, this is not ok because rsp=???8(odd). We need insert a foo (1*8 ) to stack get aligned into this prologue.
This way you can free rbp.
Other way is:
on entry point of our program, rsp=???8, but if we just do a call to a procedure without parameters, like 'call main', so this will align stack to 16 bytes multiple.
This is why I said about odds and evens, and my opinions this is the real challenge on coding to x86-64 O.S. (the same problem on linux side, just entry point is aligned to 16 bytes instead of 8 bytes, and instead of 4 shadows you have 6 shadows (both are even)). But I'm not talking about non volatile registers into this context, so, odds and evens again.
Dword locals are pseudo promoted to qwords, this is why it is a qword as have been said.
A simple test is, call at entry point printf function with 7 parameters, after with 8. One of these will be invalid.
So, now do the same as before but now with stack aligned, one of that will be invalid again.