The general slowness of the MSVCRT functions can be partially explained by the need to run on older processors. For my test I used the Microsoft strcmp source from the PSDK, compiled with the range of optimizations provided with the VC++ Toolkit 2003 compiler.
Windows 2000 SP4, P6:
1105 cycles, crt_strcmp
882 cycles, strcmp_gb
883 cycles, strcmp_g3
882 cycles, strcmp_g4
882 cycles, strcmp_g5
882 cycles, strcmp_g6
1098 cycles, strcmp_g7
1098 cycles, strcmp_g7_sse2
1106 cycles, crt_strcmp
883 cycles, strcmp_gb
883 cycles, strcmp_g3
883 cycles, strcmp_g4
883 cycles, strcmp_g5
883 cycles, strcmp_g6
1098 cycles, strcmp_g7
1098 cycles, strcmp_g7_sse2
1106 cycles, crt_strcmp
884 cycles, strcmp_gb
883 cycles, strcmp_g3
883 cycles, strcmp_g4
883 cycles, strcmp_g5
883 cycles, strcmp_g6
1097 cycles, strcmp_g7
1097 cycles, strcmp_g7_sse2
Windows XP SP3, P4 Northwood:
633 cycles, crt_strcmp
1318 cycles, strcmp_gb
1316 cycles, strcmp_g3
1316 cycles, strcmp_g4
1316 cycles, strcmp_g5
1319 cycles, strcmp_g6
893 cycles, strcmp_g7
911 cycles, strcmp_g7_sse2
619 cycles, crt_strcmp
1317 cycles, strcmp_gb
1316 cycles, strcmp_g3
1316 cycles, strcmp_g4
1316 cycles, strcmp_g5
1316 cycles, strcmp_g6
904 cycles, strcmp_g7
914 cycles, strcmp_g7_sse2
626 cycles, crt_strcmp
1317 cycles, strcmp_gb
1315 cycles, strcmp_g3
1316 cycles, strcmp_g4
1316 cycles, strcmp_g5
1316 cycles, strcmp_g6
904 cycles, strcmp_g7
915 cycles, strcmp_g7_sse2
Note how much lower the cycle count is for the XP SP3 MSVCRT, and that this is running on a processor with a lower IPC than the P3.
The relevant parts of the code-generation options:
/G3 optimize for 80386
/G4 optimize for 80486
/G5 optimize for Pentium
/G6 optimize for PPro, P-II, P-III
/G7 optimize for Pentium 4 or Athlon
/GB optimize for blended model (default)
/arch:<SSE|SSE2> minimum CPU architecture requirements, one of:
SSE - enable use of instructions available with SSE enabled CPUs
SSE2 - enable use of instructions available with SSE2 enabled CPUs
The SSE2 option had no effect.