Profiling a GraalVM native image with perf

Profiling a GraalVM native image with perf

The GraalVM native-image tool allows you to generate a native executable (or native image) from your Java application.

This native executable will start very quickly and have a much smaller memory footprint than a traditional Java application; at the cost of reduced peak performance and a relatively high packaging build time. More information on native executables here.

A native executable contains a minimalist JVM called SubstratVM, this one has some limitations:

  • Partial support for reflection
  • Partial support for dynamic proxy
  • Partial support for dynamic class loading
  • No JNI
  • No JVMTI

No support for JMVTI means no support for Java agents, JMX, Java profilers, Java debuggers, Java Flight Recorder and Java Mission Control, as well as all the tools delivered with the JDK (jps, jstack, jmap).

For all the needs covered by these tools, you must therefore use a solution integrated into the application (for example, replacing the JMX metrics by Prometheus metrics), or standard tools provided by your operating system.

To profile the execution of an application, the Linux OS has a very powerful tool: perf.

The perf tool has many features, it can access all OS and CPU metrics (performance counters, hence its name: perf) and profile the application in many different ways.

Using perf to profile the CPU

The perf tool will use the symbols integrated in the binary of your application to make the link between a memory pointer and the corresponding Java method (or the system call).

By default, these symbols are not integrated into the native executables, so you must ask the native-image tool to leave them there via the options H:-DeleteLocalSymbols -H:+PreserveFramePointer.

If you want to test these steps, you can use the Quarkus getting-started application. Quarkus has easy native-image support, just add to your application’s application.properties the property quarkus.native.additional-build-args=-H:-DeleteLocalSymbols,-H:+PreserveFramePointer and it will automatically add these options to the command line of the native-image tool.

After having generated your native executable, you can launch it, then recover its PID; we will use this one in the command line of the perf tool.

Once your application is launched, and ideally under load (you can use a tool such as wrk to generate load), you can profile it via the following perf command: perf record -F 99 -p PID --call-graph dwarf sleep 10.

  • record : asks perf to start profiling the application.
  • -F 99 : profiles at 99 Hertz, which means 99 samples per second.
  • -p PID : asks perf to profile this particular PID (the one of your application).
  • –call-graph dwarf : tells perf to use the symbols built into your application (ELF symbol).
  • sleep 10 : as perf profiles a PID and not a command, it must be given a command to execute. When this command is complete, perf will stop profiling your application. By using sleep 10 as a command, we will therefore profile the application for 10 seconds.

When the command is finished, perf will have generated a data file containing the profile of your application (CPU profile here, because it has not been told which event it should profile): perf.data.

You can use the following command to view this profile in the console: perf report --stdio, you will then have a result close to this one:

# Children      Self  Command          Shared Object                        Symbol                                                                                                                        >
# ........  ........  ...............  ...................................  ..............................................................................................................................>
#
    13.47%     0.00%  tloop-thread-19  libpthread-2.31.so                   [.] start_thread
            |
            ---start_thread
               IsolateEnterStub_PosixJavaThreads_pthreadStartRoutine_e1f4a8c0039f8337338252cd8734f63a79b5e3df_06195ea7c1ac11d884862c6f069b026336aa4f8c
               JavaThreads_threadStartRoutine_241bd8ce6d5858d439c83fac40308278d1b55d23
               Thread_run_857ee078f8137062fcf27275732adf5c4870652a
               FastThreadLocalRunnable_run_0329ad2c5210a091812879bcecd155c58e561e60
               ThreadExecutorMap$2_run_66c8943ee6536a10df07f979fb6cd278adcf96bc
               SingleThreadEventExecutor$4_run_1b47df7867e302a2fb7f28d7657a73e92f89d91f
               |          
               |--12.64%--NioEventLoop_run_be89580b4d16514bef6e948913d2ed21c5e4f679
               |          |          
               |          |--5.14%--NioEventLoop_processSelectedKeys_9a76c58d657b781ee037bbb65f41f01d2eb54e7c
               |          |          NioEventLoop_processSelectedKeysOptimized_c36ca161e53573665bc03cb5392e91c123bcd359
               |          |          NioEventLoop_processSelectedKey_3a0d92ce472db6c251df4485227a85acb9d3a1ca
               |          |          AbstractNioByteChannel$NioByteUnsafe_read_45358e803c643a6380776021e488e79d981b159d

And this over thousands of lines … not easy to analyze eh?

To easily analyze a profile generated by perf, you can use the FlameGraph tool, accessible here: https://github.com/brendangregg/FlameGraph

A FlameGraph is a way of visualizing the profile of an application allowing to instantly detect the most frequent code path. It will display on the x-axis the population (generally the method) whose size is proportional to the number of samples in the profile, and on the y-axis the depth in the stack. More information on FlameGraphs here.

We can note a small problem in the profile data, the column Command instead of containing the given command, contains the name of the thread (truncated moreover). This is a bug in the native-image tool, to work around it we will use sed to modify the profile data before using it in the FlameGraph tool. The value of the Command column is found at the base of the FlameGraph, it must normally be unique for the aggregation of stacks to be done.

The first step is to use perf script to extract the profile data into a textual format, then use sed to correct the command name so that you can then generate a FlameGraph.

perf script > out.perf
sed -i -E "s/cutor-thread-[0-9]*/executor-thread/" out.perf
sed -i -E "s/ntloop-thread-[0-9]*/eventloop-thread/" out.perf
sed -i -E "s/tloop-thread-[0-9]*/eventloop-thread/" out.perf
~/FlameGraph/stackcollapse-perf.pl out.perf | ~FlameGraph/flamegraph.pl > perf.svg

Here is an example of a generated FlameGraph:

What I like most about FlameGraphs is that you can zoom in on them by clicking on a frame:

Using perf to profile memory

To profile memory, we’ll use the same technique with a slightly modified command.

There are several ways to profile memory with perf, you can ask perf to record memory related OS events, profile one of the system methods that allocate memory, or use pef mem. We will use the last solution.

For this, you must start your application using the perf tool: perf mem record --call-graph dwarf -F 99 ./getting-started-1.0-SNAPSHOT-runner.

When the application stops, perf will save the profile data on disk which can then be used in the same way as the CPU profile data (via perf report, perf script and the FlameGraph tool).

Pour aller plus loin

A talk I gave on the topic, starts at minute 44: https://www.youtube.com/watch?v=TXnJ9eyoEhw.

Tips for using perf with lots of recipes: http://www.brendangregg.com/perf.html.

An article describing in detail what a FlameGraph is: https://queue.acm.org/detail.cfm?id=2927301.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.