The C++ standard guarantees that the members of a class or struct appear in memory in the same order as they are declared.
Nonstatic data members of a (non-union) class with the same access control are allocated so that later members have higher addresses within a class object. The order of allocation of non-static data members with different access control is unspecified. Implementation alignment requirements might cause two adjacent members not to be allocated immediately after each other; so might requirements for space for managing virtual functions and virtual base classes.
Table of Contents
Alignment & Padding
The last sentence from the Standard’s quote is interesting: even though members appear in the same order, they don’t have to be allocated next to each other because of data alignment. Remember that a CPU reads data one word at a time, e.g. 32 bits or 4 bytes for x86.
Consider what happens when you have a 4 bytes of data. If the data is aligned, the CPU can read it from a singel address in a single word. However, if it’s not aligned, by definition the data is spread across two words. If the architecture already supports reading unaligned data and doesn’t just crash, it certainly comes at a penalty. The x86 architecture can deal with unaligned data, but doing so requires the CPU to read from two addresses, each containing part of the data. Even for this simplistic example, reading unaligned data takes twice as long, and it only gets worse when you consider writes and take into account caching. Either way, unaligned data is undefined behavior and should be avoided[^1].
When aligned, padding is added to make sure that data begins on a word boundary. Take for example a struct[^2] containing a char
(1 byte), an integer (commonly 4 bytes), and a short (commonly 2 bytes).
struct S {
char c;
int i;
short s;
};
We can expect 3 bytes of padding after the char
to make the next integer aligned, and another 2 bytes after the short
to make the struct aligned[^3]. Assuming c
, i
and s
represent the bytes of their respective members, this looks something like this in memory:
0x00 c000
0x04 iiii
0x08 ss00
The total size of the struct will therefore be 5 bytes of padding in addition to the 7 bytes of actual data, totaling 12 bytes.
We can save 4 bytes or one third of the total size by simply reordering the members from large to small. That might not sound like much, but if you have an array with a few thousand of these structs, the savings quickly add up.
struct S {
int i;
short s;
char c;
};
0x00 iiii
0x04 ssc0
You can validate the struct size yourself with the sizeof
operator. If the architecture of your computer matches mine, you will see the same result of 12 bytes and 8 bytes respectively.
When optimizing for memory footprint, it is a good idea to order members by decreasing alignment.
Alignment in C++
Since C++11, the language provides a wider set of tools for working with alignment. Two interesting things two mention are:
- The
alignof
operator. It returns the alignment in bytes for a given type-id. - The
alignas
specifier. It specifies the alignment requirement of a class, struct, union or enumeration.
Consider the example below. With alignof
we can query the alignment requirements of the different types we mentioned in the previous section.
struct S {
int i;
short s;
char c;
};
struct alignas(64) S64 {};
int main()
{
std::cout << "Alignment of 'char': " << alignof(char) << std::endl;
std::cout << "Alignment of 'short': " << alignof(short) << std::endl;
std::cout << "Alignment of 'int': " << alignof(int) << std::endl;
std::cout << "Alignment of 'struct S': " << alignof(S) << std::endl;
std::cout << "Alignment of 'struct S64': " << alignof(S64) << std::endl;
}
Below is the output on my computer. As expected, char, short and int are 1, 2, and 4 byte aligned respectively. The struct S
itself is 4 byte aligned because of the integer and because of the alignas
-specifier the S64
struct is 64-byte aligned.
Alignment of 'char': 1
Alignment of 'short': 2
Alignment of 'int': 4
Alignment of 'struct S': 4
Alignment of 'struct S64': 64
Child Classes
A interesting technique that was mentioned on Reddit[^] involves using a child class to make use of the padding of its parent class.
The example below follows the guideline of ordering members in decreasing order of alignment. The total size of Y
is 136 bytes.
// sizeof(X) == 112
struct X {
std::uint64_t a[13];
std::uint32_t b;
};
// sizeof(Y) == 136
struct Y {
X x;
std::uint64_t a;
std::uint64_t b;
bool flag;
};
By using inheritance, we can make more efficient use the available memory[^4]. The boolean flag
fits nicely behind the std::uint32_t
and the total size of Y
is reduced to 128 bytes.
// sizeof(Y) == 128
struct Y: X {
bool flag;
std::uint64_t a;
std::uint64_t b;
};
Offset Calculation
The different members in a struct are accessed via an offset relative to the beginning of the structure. The first member requires no offset at all, as its address is equal to that of the struct. For all members that start within the first 128 bytes, the offset can be expressed with only 8 bits. Members after that will require at least a 32 bit offset, making the address computation slightly more complex and slightly less cache efficient.
Benchmarks
I benchmarked the impact on performance by updating a char
member in a list of structs. The layout of the structs differed only in the position of the character. For SmallOffset
its address was equal to that of the struct, while for LargeOffset
there were 128 bytes in between.
struct SmallOffset
{
char c;
char a[128];
};
struct LargeOffset
{
char a[128];
char c;
};
It took an array of one million structs before the difference became noticeable. Reducing the offset to less than 128 bytes indeed improved performance, with no measurable difference even at 10 million elements.
The data below gives the average running time of 100 iterations with 10 million elements per execution.
Scenario | Avg. Running Time in Seconds |
---|---|
Small Offset | 0.9282 |
Large Offset | 1.3256 |
Conclusion
When optimizing for runtime performance, it is a good idea to put the most used members in the first 128 bytes of the struct.
Caching
Another important thing to consider is caching. Remember that data is transferred between memory and the CPU cache in fixed-size chunks, called cache lines. How this works exactly depends on the architecture, but in general we can expect that using less cache lines improve performance. The reason is that:
- Loading data in the cache is an expensive operation.
- Lines are evicted from the cache to make space when new lines are loaded.
Note that this isn’t true per se. Think about the issue of false sharing, where two threads keep invaliding the same cache line, leading to performance contention.
Single-Threaded Environment
When a single thread on a single core is accessing the data in a struct, we can improve caching performance by using as little cache lines as possible.
-
By optimizing for memory footprint, as discussed in the previous section, the struct uses less memory and hence occupies less space in the cache.
-
By placing heavily used members close together, we hope (based on the locality of reference principle) that they will end up closer together in the cache, preferably even on the same cache line, and hence use less cache space.
-
By separating hot fields form cold ones, we reduce the amount of cache lines filled with unused data.
Benchmarks
To measure the impact of placing members close together, I did something similar to the previous benchmark. In this benchmark, members a
and b
are compared for equality. Since the data is only read there should be no contention. Against my own advice from the previous section, I put the two members at the end of the CloseTogether
to offset the impact of address calculation.
struct FarApart
{
char a;
char c[128];
char b;
};
struct CloseTogether
{
char c[128];
char a;
char b;
};
Again it took about one million elements for the performance difference to become observable. As the amount of elements grows, the better the second struct performs.
The data below gives the average running time of 100 iterations with 10 million elements per execution.
Scenario | Avg. Running Time in Seconds |
---|---|
Members far apart | 1.0361 |
Members close together | 0.8333 |
Multi-Threaded Environment
When different threads are accessing the data in a struct, we basically want to do the opposite and make sure that members are not within the same cache line. Every time a member is modified from a core, the corresponding cache line is invalidated across all cores. This means that a thread trying to access a different member in the same cache line stalls until the line is reloaded, even though the value would still have valid after the first thread modified the cache line.
Benchmarks
The same structs from the previous section were used for benchmarking. One thread repeatedly reads member a
while another thread repeatedly writes to member b
. The operations are repeated 10 million times.
The data below gives the average running time of 100 iterations.
Scenario | Avg. Running Time in Seconds |
---|---|
Members far apart | 0.02657 |
Members close together | 0.0845 |
Unsurprisingly, the case where no false sharing occurs is significantly faster. The total time is reduced because the cache line containing a
is not invalidated on each write to b
.
False & True Sharing in C++
Remember the remark about false sharing. As of C++17, two interesting compile time constants are added to the thread support library[^5].
hardware_destructive_interference_size
: the minimum offset between two objects to avoid false sharing. This value can be used as an argument toalignas
to keep two fields of a struct apart.hardware_constructive_interference_size
: the maximum size of contiguous memory to promote true sharing.
Basically, these constants provide a portable way to access the L1 data cache line size from C++.
Conclusion
I’ve tried to highlight the impact that different choices regarding the ordering of members in a struct can have on the performance of your code. A lot can be achieved by taking into account some general guidelines. However, as always, to really know how a particular application behaves, you will have to measure and compare its performance.
Jason Turner mentioned the -Wpadded
flag on CppCast. Enabling this flag will cause the compiler to print a warning when a member is padded. Clang will even tell you how many bytes were added.
padding.cpp:6:9: warning: padding struct 'S' with 3 bytes to align 'i' [-Wpadded]
int i;
^
padding.cpp:4:10: warning: padding size of 'S' with 2 bytes to alignment boundary [-Wpadded]
struct S {
^
All benchmarking code is available on GitHub: https://github.com/JDevlieghere/order-your-members. The data shown in the article was obtained from my early 2015 MacBook Pro (2.7 GHz Intel Core i5, 8GB DDR3) running macOS Sierra.