Benchmark : conversion from long to byte[]

Benchmark : conversion from long to byte[]

I’ve been using Kafka a lot lately, and in Kafka a lot of things are byte arrays, even headers!

As I have many components that exchange messages, I added headers to help with message tracking, including a timestamp header which has the value System.currentTimeMillis().

So I had to transform a long into a byte array; in a very naive way, I coded this: String.valueOf(System.currentTimeMillis()).getBytes(). But instantiating a String each time a header is created does not seem very optimal to me!

Looking a bit further, Guava has a solution based on bitwise calculation via the Longs class, as well as Kafka via its LongSerializer. You can also use a ByteBuffer to perform the conversion.

To compare the three, nothing better than JMH – The Java Microbenchmark Harness. This tool allows to write relevant micro-benchmarks taking into account the internal characteristics of the JVM. It also offers integrated tools to analyze the performance of our tests (profiling, disassembling, …). If you don’t know JMH, you can refer to this article: INTRODUCTION TO JMH – JAVA MICROBENCHMARK HARNESS.

The benchmark

First of all, I configured the benchmark with a Thread State so that the setup is played for each Thread. Among other things, I created a ByteBuffer per thread to compare implementations with and without re-use of the buffer.

@State(Scope.Thread)
// other JMH annotations ...
public class LongToByteArray {
    private static final LongSerializer LONG_SERIALIZER = new LongSerializer();

    long timestamp;
    ByteBuffer perThreadBuffer;

    @Setup
    public void setup() {
        timestamp = System.currentTimeMillis();
        perThreadBuffer = ByteBuffer.allocate(Long.BYTES);
    }
    
    // benchmark methods
}

Then, I implement a benchmark method for each way of converting a long to byte[]. I implemented two different algorithms for ByteBuffer: one with an instantiation of a buffer on each conversion, and the other with a recycling of an existing buffer using the ByteBuffer instantiated in the benchmark setup phase.

    
@Benchmark
    public byte[] testStringValueOf() {
        return String.valueOf(timestamp).getBytes();
    }

    @Benchmark
    public byte[] testGuava() {
        return Longs.toByteArray(timestamp);
    }

    @Benchmark
    public byte[] testKafkaSerde() {
        return LONG_SERIALIZER.serialize(null, timestamp);
    }

    @Benchmark
    public byte[] testByteBuffer() {
        ByteBuffer buffer = ByteBuffer.allocate(Long.BYTES);
        buffer.putLong(timestamp);
        return buffer.array();
    }

    @Benchmark
    public byte[] testByteBuffer_reuse() {
        perThreadBuffer.putLong(timestamp);
        byte[] result = perThreadBuffer.array();
        perThreadBuffer.clear();
        return result;
    }

The full benchmark is accessible here.

The results

All the tests were run on my laptop: Intel(R) Core(TM) i7-8750H 6 cores (12 with hyperthreading) – Ubuntu 19.10.

The Java version used was: openjdk version "11.0.7" 2020-04-14.

Benchmark                             Mode  Cnt   Score   Error  Units
LongToByteArray.testByteBuffer        avgt    5   4,429 ± 0,204  ns/op
LongToByteArray.testByteBuffer_reuse  avgt    5   5,655 ± 0,793  ns/op
LongToByteArray.testGuava             avgt    5   6,422 ± 0,428  ns/op
LongToByteArray.testKafkaSerde        avgt    5   9,103 ± 1,515  ns/op
LongToByteArray.testStringValueOf     avgt    5  39,660 ± 4,372  ns/op

First observation: my intuition was good, instantiating a String for each conversion is very bad, 4 to 10 times slower than all the other implementations. When we look at the result of the conversion, we understand why. By using a String we no longer convert a 64bit number but a character string where each character (each digit of the number) is coded on a byte. So we do not compare exactly the same thing since the result of the conversion via a String will give you an array of 13 bytes, while a Long can be encoded in 8 bytes, as gives us the conversion via Guava, Kafka or a ByteBuffer.

Surprisingly Kafka, which is known for its performance, has a slower implementation than Guava or that via a ByteBuffer.

The results obtained via ByteBuffer are surprising, the instantiation of a ByteBuffer for each conversion is more efficient than the reuse of an existing one (which requires a clean of the buffer) .

A little more detailed analysis

Let’s put aside the implementation via a String and try to better understand the differences between the other implementations.

For this I will use the profiling capabilities of JMH via the -prof option.

If we profile the memory allocations via -prof gc we have the following results:

LongToByteArray.testByteBuffer                      avgt    5     4,492 ±   0,708   ns/op
LongToByteArray.testByteBuffer:·gc.alloc.rate       avgt    5  4635,903 ± 712,889  MB/sec
LongToByteArray.testByteBuffer_reuse                avgt    5     5,798 ±   1,139   ns/op
LongToByteArray.testByteBuffer_reuse:·gc.alloc.rate avgt    5    ≈ 10⁻⁴            MB/sec
LongToByteArray.testGuava                           avgt    5     6,939 ±   0,899   ns/op
LongToByteArray.testGuava:·gc.alloc.rate            avgt    5  3000,818 ± 376,613  MB/sec
LongToByteArray.testKafkaSerde                      avgt    5     9,317 ±   0,842   ns/op
LongToByteArray.testKafkaSerde:·gc.alloc.rate       avgt    5  4467,791 ± 405,897  MB/sec

We can clearly see the advantage of reusing the ByteBuffer: there is no memory allocation, while by creating a new buffer for each conversion, we have 4 GB/s of memory allocation !

On the other hand, the memory allocations are close for the three other implementations, so that does not give us much more information.

Now let’s try to profile the CPU with -prof perf which will use the perf tool to profile the application.

The results are not easily understandable (to see them this is here), some observations :

  • Reusing a ByteBuffer seems to involve a lot more CPU branches, maybe this is the cause of the performance difference.
  • The Kafka implementation seems to involve more CPU branches than Guava’s despite performing fewer instructions. Because of these branches, fewer instructions can be performed per CPU cycle. This is certainly the reason why the Guava implementation is more efficient.

Finally, out of curiosity, I looked at the code for HeapByteBuffer.putLong(), this is the implementation used via ByteBuffer because I don’t do any direct allocation. This uses a Unsafe.putLongUnaligned() method. Unsafe is known for its high performance implementations (but should not be used by everyone), here this method is annotated with @HotSpotIntrinsicCandidate which means that an intrinsic may exists for it and could explain its difference in performance with other implementations. An intrinsic can be seen as a piece of native code, optimized for your OS / CPU architecture, which the JVM will substitute for the Java implementation of the method under certain conditions.

Conclusion

Be careful what you measure, the implementation via a String does not generate the same array of bytes as the others, and is therefore much less efficient.

Reusing a ByteBuffer is not always the best solution, as the cost of recycling canbe significant. Allocations are not very expensive within the JVM, and sometimes it is better to allocate than execute more instructions.

Follow the force, read the code;)

Although JMH is a great tool, it needs technical skills and a lot of time to fully analyze its results. Even if the observed differences are not fully explained; I’m still happy with my little experimentation;)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.