Just for reference, I was working a software renderer some time back, On my core i7 I was doing 5 million points at 35fps. That was the for the entire pipeline, 4x4 transforms, rotation, treating each point as a vertex and calculating phong shading params at each one, texture look-ups and drawing them to screen just as single pixels. I was using it as a test bench for optimizing my transform stages with simd/swizzles and breaking transformation up into blocks for parallel processing.
In terms of putting together a 3d engine for yourself, I also wouldn't advise doing your projection in this standard equation form. It would be much better to have the whole engine use homogenous 4x4 matrices. That output of a a projection matrix will give you what you need for 3d clipping (homogenous clip space), w coordinate etc. It also makes it easier to scale up to device coordinate space.