From my experience I've not found any benefit to using the cache hinting operations like prefetcht0, prefetchnta etc. I've run 1000's of tests on them with various offset amounts and never really found any significant performance different that pokes it's head out of the general measurement noise.
That said, i generally still stick to using vmovaps/movaps/vmovdqa/movdqa wherever possible instead of the unaligned versions.
I also found the using movhps combinations was considerably faster than movups ie:
; Write HDR pixels (movlps/hps is far better than movups).
vmovq [rdi],xmm4
vmovhps [rdi+8],xmm4
vmovq [rdi+rdx],xmm6
vmovhps [rdi+rdx+8],xmm6
With regards to the non temporal operations it can be a bit hit and miss, i've not found the load operations to be of any benefit, and depending on the algorithm and how you use the data they can
be considerably faster for writes. Basically if you access your pixel buffer / data in a random way with small strides, the non temporal versions will be worse.
If you need to read it back again at any point in the near future, they will be worse, if the strides are massive and causing cache misses every time, non temporal will be better.
If the data is massive (IE: you're writing sequentially one time only an image buffer of 4096x4096 or something.. then they will be better). I normally put both in and while working i switch back and forth to see
as it sometimes changes in favour of one or the other during development depending on what else is happening.