There are lots of profilers listed here and I've tried a few of them myself - however I ended up writing my own based on this:
http://code.google.com/p/high-performance-cplusplus-profiler/
It does of course require that you modify the code base, but it's perfect for narrowing down bottlenecks, should work on all x86s (could be a problem with multi-core boxes, i.e. it uses rdtsc, however - this is purely for indicative timing anyway - so I find it's sufficient for my needs..)