John,
That effect has been seen before with Intel hardware, the PIV family of processors were slow with SSE code, often well thought out integer code was nearly as fast, on the Core2 series SSE got a lot faster and on the i3/5/7 series faster again. Now the exception seems to be the special case circuitry for at least some combinations of REP STOS/MOVS etc .... I found with 32 bit code that it was often hard to improve on ordinary REP MOVSB/W/D with SSE with cached reads and direct write backs because the special case circuitry did all of these things well.
With 32 bit code there was a minimum threshold and from memory a maximum threshold where REP MOVS.. was fast, above and below it other combinations were faster at times.