Order Your Members

The C++ standard guarantees that the members of a class or struct appear in memory in the same order as they are declared.

Nonstatic data members of a (non-union) class with the same access control are allocated so that later members have higher addresses within a class object. The order of allocation of non-static data members with different access control is unspecified. Implementation alignment requirements might cause two adjacent members not to be allocated immediately after each other; so might requirements for space for managing virtual functions and virtual base classes.

Alignment & Padding

The last sentence from the Standard’s quote is interesting: even though members appear in the same order, they don’t have to be allocated next to each other because of data alignment. Remember that a CPU reads data one word at a time, e.g. 32 bits or 4 bytes for x86.

Consider what happens when you have a 4 bytes of data. If the data is aligned, the CPU can read it from a singel address in a single word. However, if it’s not aligned, by definition the data is spread across two words. If the architecture already supports reading unaligned data and doesn’t just crash, it certainly comes at a penalty. The x86 architecture can deal with unaligned data, but doing so requires the CPU to read from two addresses, each containing part of the data. Even for this simplistic example, reading unaligned data takes twice as long, and it only gets worse when you consider writes and take into account caching. Either way, unaligned data is undefined behavior and should be avoided[^1].

When aligned, padding is added to make sure that data begins on a word boundary. Take for example a struct[^2] containing a char (1 byte), an integer (commonly 4 bytes), and a short (commonly 2 bytes).

struct S {
  char c;
  int i;
  short s;
};

We can expect 3 bytes of padding after the char to make the next integer aligned, and another 2 bytes after the short to make the struct aligned[^3]. Assuming c, i and s represent the bytes of their respective members, this looks something like this in memory:

0x00 c000
0x04 iiii
0x08 ss00

The total size of the struct will therefore be 5 bytes of padding in addition to the 7 bytes of actual data, totaling 12 bytes.

We can save 4 bytes or one third of the total size by simply reordering the members from large to small. That might not sound like much, but if you have an array with a few thousand of these structs, the savings quickly add up.

struct S {
  int i;
  short s;
  char c;
};

0x00 iiii
0x04 ssc0

You can validate the struct size yourself with the sizeof operator. If the architecture of your computer matches mine, you will see the same result of 12 bytes and 8 bytes respectively.

When optimizing for memory footprint, it is a good idea to order members by decreasing alignment.

Alignment in C++

Since C++11, the language provides a wider set of tools for working with alignment. Two interesting things two mention are:

The alignof operator. It returns the alignment in bytes for a given type-id.
The alignas specifier. It specifies the alignment requirement of a class, struct, union or enumeration.

Consider the example below. With alignof we can query the alignment requirements of the different types we mentioned in the previous section.

struct S {
  int i;
  short s;
  char c;
};

struct alignas(64) S64 {};

int main()
{
    std::cout << "Alignment of 'char': " << alignof(char) << std::endl;
    std::cout << "Alignment of 'short': " << alignof(short) << std::endl;
    std::cout << "Alignment of 'int': " << alignof(int) << std::endl;
    std::cout << "Alignment of 'struct S': " << alignof(S) << std::endl;
    std::cout << "Alignment of 'struct S64': " << alignof(S64) << std::endl;
}

Below is the output on my computer. As expected, char, short and int are 1, 2, and 4 byte aligned respectively. The struct S itself is 4 byte aligned because of the integer and because of the alignas-specifier the S64 struct is 64-byte aligned.

Alignment of 'char': 1
Alignment of 'short': 2
Alignment of 'int': 4
Alignment of 'struct S': 4
Alignment of 'struct S64': 64

Child Classes

A interesting technique that was mentioned on Reddit[^] involves using a child class to make use of the padding of its parent class.

The example below follows the guideline of ordering members in decreasing order of alignment. The total size of Y is 136 bytes.

// sizeof(X) == 112
struct X {
    std::uint64_t a[13];
    std::uint32_t b;
};

// sizeof(Y) == 136
struct Y {
    X x;
    std::uint64_t a;
    std::uint64_t b;
    bool flag;
};

By using inheritance, we can make more efficient use the available memory[^4]. The boolean flag fits nicely behind the std::uint32_t and the total size of Y is reduced to 128 bytes.

// sizeof(Y) == 128
struct Y: X {
    bool flag;
    std::uint64_t a;
    std::uint64_t b;
};

Offset Calculation

The different members in a struct are accessed via an offset relative to the beginning of the structure. The first member requires no offset at all, as its address is equal to that of the struct. For all members that start within the first 128 bytes, the offset can be expressed with only 8 bits. Members after that will require at least a 32 bit offset, making the address computation slightly more complex and slightly less cache efficient.

Benchmarks

I benchmarked the impact on performance by updating a char member in a list of structs. The layout of the structs differed only in the position of the character. For SmallOffset its address was equal to that of the struct, while for LargeOffset there were 128 bytes in between.

struct SmallOffset
{
  char c;
  char a[128];
};

struct LargeOffset
{
  char a[128];
  char c;
};

It took an array of one million structs before the difference became noticeable. Reducing the offset to less than 128 bytes indeed improved performance, with no measurable difference even at 10 million elements.

The data below gives the average running time of 100 iterations with 10 million elements per execution.

Scenario	Avg. Running Time in Seconds
Small Offset	0.9282
Large Offset	1.3256

Conclusion

When optimizing for runtime performance, it is a good idea to put the most used members in the first 128 bytes of the struct.

Caching

Another important thing to consider is caching. Remember that data is transferred between memory and the CPU cache in fixed-size chunks, called cache lines. How this works exactly depends on the architecture, but in general we can expect that using less cache lines improve performance. The reason is that:

Loading data in the cache is an expensive operation.
Lines are evicted from the cache to make space when new lines are loaded.

Note that this isn’t true per se. Think about the issue of false sharing, where two threads keep invaliding the same cache line, leading to performance contention.

Single-Threaded Environment

When a single thread on a single core is accessing the data in a struct, we can improve caching performance by using as little cache lines as possible.

By optimizing for memory footprint, as discussed in the previous section, the struct uses less memory and hence occupies less space in the cache.
By placing heavily used members close together, we hope (based on the locality of reference principle) that they will end up closer together in the cache, preferably even on the same cache line, and hence use less cache space.
By separating hot fields form cold ones, we reduce the amount of cache lines filled with unused data.

Benchmarks

To measure the impact of placing members close together, I did something similar to the previous benchmark. In this benchmark, members a and b are compared for equality. Since the data is only read there should be no contention. Against my own advice from the previous section, I put the two members at the end of the CloseTogether to offset the impact of address calculation.

struct FarApart
{
  char a;
  char c[128];
  char b;
};

struct CloseTogether
{
  char c[128];
  char a;
  char b;
};

Again it took about one million elements for the performance difference to become observable. As the amount of elements grows, the better the second struct performs.

The data below gives the average running time of 100 iterations with 10 million elements per execution.

Scenario	Avg. Running Time in Seconds
Members far apart	1.0361
Members close together	0.8333

Multi-Threaded Environment

When different threads are accessing the data in a struct, we basically want to do the opposite and make sure that members are not within the same cache line. Every time a member is modified from a core, the corresponding cache line is invalidated across all cores. This means that a thread trying to access a different member in the same cache line stalls until the line is reloaded, even though the value would still have valid after the first thread modified the cache line.

Benchmarks

The same structs from the previous section were used for benchmarking. One thread repeatedly reads member a while another thread repeatedly writes to member b. The operations are repeated 10 million times.

The data below gives the average running time of 100 iterations.

Scenario	Avg. Running Time in Seconds
Members far apart	0.02657
Members close together	0.0845

Unsurprisingly, the case where no false sharing occurs is significantly faster. The total time is reduced because the cache line containing a is not invalidated on each write to b.

Remember the remark about false sharing. As of C++17, two interesting compile time constants are added to the thread support library[^5].

hardware_destructive_interference_size: the minimum offset between two objects to avoid false sharing. This value can be used as an argument to alignas to keep two fields of a struct apart.
hardware_constructive_interference_size: the maximum size of contiguous memory to promote true sharing.

Basically, these constants provide a portable way to access the L1 data cache line size from C++.

Conclusion

I’ve tried to highlight the impact that different choices regarding the ordering of members in a struct can have on the performance of your code. A lot can be achieved by taking into account some general guidelines. However, as always, to really know how a particular application behaves, you will have to measure and compare its performance.

Jason Turner mentioned the -Wpadded flag on CppCast. Enabling this flag will cause the compiler to print a warning when a member is padded. Clang will even tell you how many bytes were added.

padding.cpp:6:9: warning: padding struct 'S' with 3 bytes to align 'i' [-Wpadded]
    int i;
        ^
padding.cpp:4:10: warning: padding size of 'S' with 2 bytes to alignment boundary [-Wpadded]
  struct S {
         ^

All benchmarking code is available on GitHub: https://github.com/JDevlieghere/order-your-members. The data shown in the article was obtained from my early 2015 MacBook Pro (2.7 GHz Intel Core i5, 8GB DDR3) running macOS Sierra.

Alignment & Padding#

Alignment in C++#

Child Classes#

Offset Calculation#

Benchmarks#

Conclusion#

Caching#

Single-Threaded Environment#

Benchmarks#

Multi-Threaded Environment#

Benchmarks#

False & True Sharing in C++#

Conclusion#

Alignment & Padding

Alignment in C++

Child Classes

Offset Calculation

Benchmarks

Conclusion

Caching

Single-Threaded Environment

Benchmarks

Multi-Threaded Environment

Benchmarks

False & True Sharing in C++

Conclusion