Post

Jumbled Protocol Buffer Message Layout

TL; DR;

When using protoc (the protocol buffer compiler) to compile a protobuf message into a C++ class, field reordering is taken based on the OptimizeLayoutHelper::Family. In some circumstances, this reordering ruins unintentionally your well-designed struct cache line alignment and impedes your program performance.

Optimizing in the wrong way.

For example, an online advertising system had a Stats message with a timestamp and some statistics data fields:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
message Stats {
    int64 ts                   = 1;
    int64 show                 = 2;
    int64 click                = 3;
    int64 cost                 = 4;
    int64 hour_show            = 5;
    int64 hour_click           = 6;
    int64 hour_cost            = 7;
    int64 acc_show             = 8;
    int64 acc_click            = 9;
    int64 acc_cost             = 10;
    repeated int64 bucket      = 11;
    repeated int64 hour_bucket = 12;
    repeated int64 acc_bucket  = 13;
}

One day, It needs to modifiy this message and delete its hour statistics data hour_* fields. I did expect that these deletions would be a memory and CPU save operation. However, the server had a CPU hotspot for accessing the stats message.

1
2
3
4
5
6
7
8
9
10
11
12
// After remove the `hour_*` fields
message StatsOpt {
    int64 ts                   = 1;
    int64 show                 = 2;
    int64 click                = 3;
    int64 cost                 = 4;
    int64 acc_show             = 5;
    int64 acc_click            = 6;
    int64 acc_cost             = 7;
    repeated int64 bucket      = 8;
    repeated int64 acc_bucket  = 9;
}

After hours of analysis, I found the culprit of this deterioration is the OptimizeLayoutHelper::Family reordering. According to the OptimizeLayoutHelper::Family layout ordering, we can draw the 64-byte cache line layout of these two message-compiled objects.

The object layout of message Stats:

1
2
3
4
5
6
7
+------ 16 BYTE ------+- 8 BYTE -+------ 16 BYTE ------+- 8 BYTE -+------ 16 BYTE ------+
|    (11)bucket       | (11)size |      (12)hour       | (12)size |  (13)acc_bucket     |
+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+
| (13)size |   (1)ts  |  (2)show | (3)click |  (4)cost |    (5)   |    (6)   | (7)      |
+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+
|    (8)   |    (9)   |   (10)   |     *    |     *    |     *    |     *    |     *    |
+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+

And, the object layout of message StatsOpt:

1
2
3
4
5
+------ 16 BYTE ------+- 8 BYTE -+------ 16 BYTE ------+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+
|    (8)bucket        | (8)size  |      (9)acc_bucket  | (9)size  |  (1)ts   | (2)show  |
+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+
| (3)click |  (3)cost |  (4)cost |    (5)   |    (6)   | (7)      |     *    |     *    |
+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+- 8 BYTE -+

protoc using two class members of a total of 24 Bytes to represent a repeated field, a member of 16 Bytes for the RepeatedField<T> f_ part, and a member 8 Bytes for the atomic<int> _size part.

As the ad server always accesses the stats ts and cost fields sequentially, for the Stats generated objects, the ts and cost are packed in the same cache line, however in the StatsOpt generated object, the ts and cost are packed in the different cache line. This is the wrongdoer who breaks the spatial locality.

Summary

Using protocol buffer as a server communication protocol is a good idea, it is competent at serialization, deserialization, and initialization, but its layout is not optimal when using its generated objects in a high-performance server, it deserves some acumen to design your messages to avoid spatial locality break.

This post is licensed under CC BY 4.0 by the author.