Pick a processor, pick a result, I used to own a 550 meg AMD years ago that was really fast on the old string instructions but very ordinary on the lower level integer instructions, since then you had at least 2 families of PIIIs, 3 families of PIVs, Core series duos and quads and the current i3/5/7s and this is only with Intel processors. Over a wide average the old string instructions live in microcode, not in the fast intrinsic instructions. Then you have an antique architecture locked into the source and destination index ALA 16 bit 8088 DOS code.
The only place for the old string instructions is in compatibility mode where special case circuitry make REP(?) prefixed version fast AFTER a specific byte count. The price of locking yourself into old DOS junk is buying its ancient architecture, something like trying to tune the last couple of horsepower out of a T model Ford.