What exactly does C++ profiling (google cpu perf tools) measure? - 【StackMirror】|c++|profiling|gperftools

I trying to get started with Google Perf Tools to profile some CPU intensive applications. It's a statistical calculation that dumps each step to a file using `ofstream'. I'm not a C++ expert so I'm having troubling finding the bottleneck. My first pass gives results:

Total: 857 samples
     357  41.7%  41.7%      357  41.7% _write$UNIX2003
     134  15.6%  57.3%      134  15.6% _exp$fenv_access_off
     109  12.7%  70.0%      276  32.2% scythe::dnorm
     103  12.0%  82.0%      103  12.0% _log$fenv_access_off
      58   6.8%  88.8%       58   6.8% scythe::const_matrix_forward_iterator::operator*
      37   4.3%  93.1%       37   4.3% scythe::matrix_forward_iterator::operator*
      15   1.8%  94.9%       47   5.5% std::transform
      13   1.5%  96.4%      486  56.7% SliceStep::DoStep
      10   1.2%  97.5%       10   1.2% 0x0002726c
       5   0.6%  98.1%        5   0.6% 0x000271c7
       5   0.6%  98.7%        5   0.6% _write$NOCANCEL$UNIX2003

This is surprising, since all the real calculation occurs in SliceStep::DoStep. The "_write$UNIX2003" (where can I find out what this is?) appears to be coming from writing the output file. Now, what confuses me is that if I comment out all the outfile << "text" statements and run pprof, 95% is in SliceStep::DoStep and `_write$UNIX2003' goes away. However my application does not speed up, as measured by total time. The whole thing speeds up less than 1 percent.

What am I missing?

Added: The pprof output without the outfile << statements is:

Total: 790 samples
     205  25.9%  25.9%      205  25.9% _exp$fenv_access_off
     170  21.5%  47.5%      170  21.5% _log$fenv_access_off
     162  20.5%  68.0%      437  55.3% scythe::dnorm
      83  10.5%  78.5%       83  10.5% scythe::const_matrix_forward_iterator::operator*
      70   8.9%  87.3%       70   8.9% scythe::matrix_forward_iterator::operator*
      28   3.5%  90.9%       78   9.9% std::transform
      26   3.3%  94.2%       26   3.3% 0x00027262
      12   1.5%  95.7%       12   1.5% _write$NOCANCEL$UNIX2003
      11   1.4%  97.1%      764  96.7% SliceStep::DoStep
       9   1.1%  98.2%        9   1.1% 0x00027253
       6   0.8%  99.0%        6   0.8% 0x000274a6

This looks like what I'd expect, except I see no visible increase in performance (.1 second on a 10 second calculation). The code is essentially:

ofstream outfile("out.txt");
for loop:
  SliceStep::DoStep()
  outfile << 'result'
outfile.close()

Update: I timing using boost::timer, starting where the profiler starts and ending where it ends. I do not use threads or anything fancy.

2009-06-16 18:52
by Tristan

How do you measure runtime? Do a "time ./yourprogramm" for both configurations. Do you use multithreading - ebo 2009-06-16 19:25

Do a run with time and measure sys/usr time. The profiler numbers indicate that you should have 40% less runtime without output. The easiest explanation would be that the profiler measurement is skewed - ebo 2009-06-16 19:42

It's all in user. What I'm doing is pretty simple, I think... Does the fact that the file is open the whole time mean anything - Tristan 2009-06-16 20:12

How much data are you actually writing out? Also, make sure you're running the profiling on a machine that's doing nothing else and maybe run your program 3 or 4 times before reading the results - Richard Corden 2009-06-17 08:18

Some years later, but might still be helpful to someone. You said you try "commenting out" the output; that probably changes a lot in the final execution in terms of branching, compiler optimizations, etc. Try 1. leaving the instruction there but redirecting the output to /dev/null 2. disabling optimizations at all (-O0). These could give you some insights about what is taking much time - Narcolessico 2018-04-23 11:39

From my comments:

The numbers you get from your profiler say, that the program should be around 40% faster without the print statements.

The runtime, however, stays nearly the same.

Obviously one of the measurements must be wrong. That means you have to do more and better measurements.

First I suggest starting with another easy tool: the time command. This should get you a rough idea where your time is spend.

If the results are still not conclusive you need a better testcase:

Use a larger problem
Do a warmup before measuring. Do some loops and start any measurement afterwards (in the same process).

Tiristan: It's all in user. What I'm doing is pretty simple, I think... Does the fact that the file is open the whole time mean anything?

That means the profiler is wrong.

Printing 100000 lines to the console using python results in something like:

for i in xrange(100000):
    print i

To console:

time python print.py
[...]
real    0m2.370s
user    0m0.156s
sys     0m0.232s

Versus:

time python test.py > /dev/null

real    0m0.133s
user    0m0.116s
sys     0m0.008s

My point is: Your internal measurements and time show you do not gain anything from disabling output. Google Perf Tools says you should. Who's wrong?

2009-06-16 20:06
by ebo

The operation definitely takes 10 seconds, all basically near the Slice::DoStep function. I don't understand how the _write can go away, after being 40%, and not have any impact.. - Tristan 2009-06-16 20:14

That's the point. One number must be wrong. Find out which one. The numbers the profiler generates may not be statistically significant or plain wrong - ebo 2009-06-16 20:20

_write$UNIX2003 is probably referring to the write POSIX system call, which outputs to the terminal. I/O is very slow compared to almost anything else, so it makes sense that your program is spending a lot of time there if you are writing a fair bit of output.

I'm not sure why your program wouldn't speed up when you remove the output, but I can't really make a guess on only the information you've given. It would be nice to see some of the code, or even the perftools output when the cout statement is removed.

2009-06-16 19:04
by Tyler McHenry

Google perftools collects samples of the call stack, so what you need is to get some visibility into those.

According to the doc, you can display the call graph at statement or address granularity. That should tell you what you need to know.

2009-06-18 18:14
by Mike Dunlavey