MCF
MCF Route planning.
Languages
C
Bugzilla Tickets
PR |
Status |
Require target changes |
UNCONFIRMED |
? |
|
UNCONFIRMED |
? |
|
UNCONFIRMED |
? |
Discussion/Notes
Structure peeling nodes
Benchmark description: 505.mcf_r is a route-planning workload. It creates and manipulates a graph of nodes and connecting arcs. The node and arc data is read from a file at program startup into a chunk of allocated memory. The first optimization concerns the nodes structure. It is defined in defines.h:
The cost_t and flow_t types are typedefs of the 64-bit long type. arc_p is a pointer to an arc instance. node_p is a pointer to a node structure. Therefore, the node structure is recursive with pointers to other nodes in it.
The benchmark creates a large array of these node structures in the read_min function:
1 net->nodes = (node_t *) calloc( net->n + 1, sizeof(node_t) );
All instances of the node structure in the program are from that array. No other part of the program creates node objects. The net structure is a single global instance that contains pointer to all nodes, arcs, sizes and other data throughout the program. Many of the hot functions in the benchmark iterate across these nodes through the child, pred, sibling, sibling_prev links, using one or two other fields to compute various metrics (potential, flow). These linked-list traversals coupled with accesses to only a few of the fields of the struct cause poor memory behavior because the unused fields of the node take up cache space.
Note that the arc_tmp field is unused in the whole 505.mcf_r benchmark and could thus be removed from the struct by whole-program analysis.
One optimization proposed over the years to improve the cache behavior of mcf (in its SPEC2000 and SPEC2006 incarnations) is structure peeling [1][2]. This transforms an array of structs into multiple arrays of the structs individual fields. This way the field data from the nodes is more compact. The accesses to the fields also need to be transformed. A pointer to a node is now replaced by an integer and an access to a field of a node is replaced by an array indexing operation. This transformation can be performed manually on the 505.mcf_r source.
[1] https://dl.acm.org/citation.cfm?id=1122408
[2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.509.259&rep=rep1&type=pdf
The structure peeling transformation splits up the array of 14-field node structures into 14 global arrays, each containing all instances of one field. If performed at the source level, the individual arrays could be defined like so:
1 typedef struct node_potential { cost_t potential; } node_potential_t;
2 typedef struct node_orientation { int orientation; } node_orientation_t;
3 typedef struct node_child { node_p child; } node_child_t;
4 typedef struct node_pred { node_p pred; } node_pred_t;
5 typedef struct node_sibling { node_p sibling; } node_sibling_t;
6 typedef struct node_sibling_prev { node_p sibling_prev; } node_sibling_prev_t;
7 typedef struct node_basic_arc { arc_p basic_arc; } node_basic_arc_t;
8 typedef struct node_firstout { arc_p firstout; } node_firstout_t;
9 typedef struct node_firstin { arc_p firstin; } node_firstin_t;
10 typedef struct node_arc_tmp { arc_p arc_tmp; } node_arc_tmp_t;
11 typedef struct node_flow { flow_t flow; } node_flow_t;
12 typedef struct node_depth { LONG depth; } node_depth_t;
13 typedef struct node_number { int number; } node_number_t;
14 typedef struct node_time { int time; } node_time_t;
15
16 node_potential_t *node_potentials;
17 node_orientation_t *node_orientations;
18 node_child_t *node_childs;
19 node_pred_t *node_preds;
20 node_sibling_t *node_siblings;
21 node_sibling_prev_t *node_sibling_prevs;
22 node_basic_arc_t * node_basic_arcs;
23 node_firstout_t *node_firstouts;
24 node_firstin_t *node_firstins;
25 node_arc_tmp_t *node_arc_tmps;
26 node_flow_t *node_flows;
27 node_depth_t *node_depths;
28 node_number_t *node_numbers;
29 node_time_t *node_times;
After this transformation, every pointer to a node (node_p) becomes an integer index into these global arrays whenever a node field is required. Thus node_p can be redefined as:
1 typedef int64_t node_p;
Whole-program analysis has determined that the single instance of net->nodes is the only pool of nodes in the program, with no node structures being allocated anywhere else in the program. Therefore, these arrays can be global. The net->nodes, being an int64_t, is now initialised to zero, indicating that it’s pointing to the first node.
After the layout transformation the accesses to each field of the original node structure must be changed to index these new global arrays. For example, in the loop:
The accesses to the depth, orientation and flow fields will be transformed into:
1 if( delta > feas_tol )
2 {
3 // temp is a node_p (int64_t)
4 for( temp = jminus; temp != w; temp = node_preds[temp].pred )
5 {
6 node_depths[temp].depth -= depth_iminus;
7 if( node_orientations[temp].orientation != cycle_ori )
8 node_flows[temp].flow += delta;
9 else
10 node_flows[temp].flow -= delta;
11 }
12 }
Implementation considerations
When implementing this transformation, the compiler (or the user, if doing this transformation by hand) needs to make a few minor implementation decisions. These include:
The memory allocation and deallocation of the net->nodes array must now be transformed into multiple allocations of the new global node_* arrays that hold the fields. These arrays can be allocated individually with multiple calls to malloc, or grouped in a large allocation, in which case a bit of extra code will need to be inserted to set up the individual arrays to point to the right offset from the allocation. In the case of 505.mcf_r this seems like a minor choice that doesn’t affect the performance of the application.
Comparing against the NULL node pointer must now be rewritten, as a node_p of value zero is now a valid index accessing the first node in the pool. There are a few strategies proposed in existing literature. One is to define an INVALID_NODE value (say -1) and substitute all comparisons of node pointers with comparisons against that value. Another is to allocate one extra node element and set the net->nodes root to 1, so that all pointers derived from it will be offset from 1. In the case of 505.mcf_r this seems like a minor choice that doesn’t affect the performance of the application.
Pointer compression
Once the structure peeling transformation on the nodes is performed, we now have the option of “compressing” the node_p pointers by defining them to a shorter data type, such as int32_t. This allows us to pack the node_p fields in their respective arrays even tighter. From our investigations compressing the node_p pointers to an int32_t gave a 9% improvement on 505.mcf_r over keeping them as int64_t. Going even further and compressing them to int16_t gave us a 25% improvement over using int64_t.
Arc transformation
This transformation affects the arcs structure from defines.h:
At a first glance this looks like the node structure. Indeed it creates similar cache access bottlenecks when touching many arcs in quick succession, like in the spec_qsort calls from implicit.c.
The structure peeling optimization won’t apply because of the way arcs are allocated and used. First, there are three distinct pools of arcs in the net structure:
1 typedef struct network
2 {
3 char inputfile[200];
4 char clustfile[200];
5 LONG n, n_trips;
6 LONG max_m, m, m_org, m_impl;
7 LONG max_residual_new_m, max_new_m;
8
9 /* Various fields. */
10 double optcost;
11 cost_t ignore_impl;
12 node_p nodes, stop_nodes;
13 arc_p arcs, stop_arcs, sorted_arcs;
14 arc_p dummy_arcs, stop_dummy;
15 LONG iterations;
16 LONG bound_exchanges;
17 LONG nr_group, full_groups, max_elems;
18 } network_t;
The arcs, sorted_arcs and dummy_arcs pointers point to distinctly allocated pools in the beginning and the stop_arcs, stop_dummy pointers are derived from these pools:
Having multiple pools of arcs does not allow us to have a single global pool to index into. Furthermore, these pools are swapped and even reallocated at different points in the benchmark:
Before transforming the arc pools there are two prerequisite transformations:
- The node structures are peeled as in the first transformation and the pointers are compressed to 32 bits.
The nextout, nextin fields from the arc structure are detected as being unused. They are used only set during the execution of the program. The only place where they are used is in a debug dump function that is not called anywhere. Thus these two fields are removed from the arc structure.
The transformation that can be performed on each pool is:
It groups the arc fields together such that individual 8-byte-sized fields of every 1024 arcs are packed in a bucket. Thus, each bucket occupies 8192 bytes (8KB) of memory.
Fields smaller than 8-bytes: id, ident, tail, head are interleaved in pairs and packed into buckets: id and tail form one bucket; head and ident form another. We’ll call such an 8-byte field or pair of fields a granule.
The buckets for each field are laid out consecutively in memory in a group. Thus, each group contains the data for 1024 arcs.
Groups are laid out consecutively in memory.
Given such a memory layout change, some operations on arc pointers (arc_p) must also be changed. If performing this transformation at a source level, it is best to redefine arc_p as a byte pointer (char *).
Every valid arc pointer must always point into the first 8K bytes of a group.
From that pointer the access to a particular field associated with that arc is derived by adding a constant offset corresponding to the position of the relevant bucket in the group.
Therefore, given an arc pointer ptr (of underlying type char *) pointer the following table gives the formulas for computing the memory address of the arc’s fields.
Besides changing the layout of the arc pools, you can also transforms the memory allocations for these pools into calls to aligned_alloc, with an 8K alignment. This guarantees that each bucket is not only 8K-sized but also 8K-aligned. Each group is therefore also 8K-aligned.
The memory layout transformation described above requires the pointer increment semantics for arc pointers to also be updated. Pointing to the next arc in the original program corresponds to incrementing the pointer by the size of one granule (8 bytes) in the modified program.
Because an arc pointer must always point into the first 8K bytes of a group, we need to describe what happens when a pointer is offset such that it has to point to the next group. For example, incrementing an arc pointer pointing to the 1020th arc in group 10 by 6 must advance the pointer to the 2nd arc in group 11.
To achieve that “jump” efficiently you can make use of the 8K-aligned property of buckets in the following algorithm:
1 intptr_t
2 offset_arc_ptr (intptr_t ptr, long off)
3 {
4 long off_times_8 = off * 8;
5 intptr_t new_granule = ptr + off * 8; // Point OFF granules further.
6 // Did we cross an 8K boundary?
7 intptr_t group_diff = (new_granule >> 13) - (ptr >> 13);
8 intptr_t res;
9 if (group_diff == 0) // If not, then a simple offset is enough.
10 res = new_granule;
11 else
12 {
13 intptr_t curr_group_start = (ptr >> 13) << 13; // Start of current group.
14 // Offset from start of current group.
15 intptr_t curr_off_from_group = ptr - curr_group_start;
16 res = curr_group_start
17 + ((group _diff * 5) << 13) // How many groups do we need to jump.
18 + ((curr_off_from_group
19 + off_times_8) % 8192); // Offset inside the new group.
20 // If jumping backwards adjust the group calculation.
21 if (group_diff < 0)
22 res += 8192;
23 }
24 return res;
25 }
Because the groups are known to be 8K-aligned and consecutively laid out in memory, we can efficiently determine for each arc pointer which group it belongs to by shifting right by 13 (log2(8192)). If the pointer after the offset lies within the same group, no further adjustment is needed. If the pointer crosses into a different 8K region, it needs to be adjusted to skip to the new group. Although the algorithm for pointer offset logic contains conditional paths, one way to avoid it is by introducing (potentially unpredictable) branches through if-conversion.
One complication that this scheme must deal with is reallocation. There are a few places in 505.mcf_r where two of the three pools are reallocated through calls to realloc. This presents a problem because realloc cannot be made to guarantee the required 8K group alignment.
To work around this limitation you can transform these calls to realloc to aligned_alloc calls followed by a manual memcpy of the previous pool before freeing the existing pool. To perform this memcpy it needs to know the size of the original allocation, which is normally not available for memory allocated with malloc (or calloc, or aligned_alloc).
Therefore, you can store the size of the allocations together with the pool data, at a negative offset from the pool start. Because the pools are allocated in 8K-aligned chunks, this wastes an 8K block of memory per-pool to store an 8-byte size.
When transforming calls to realloc on the arc pools, you can replace them with an aligned_alloc, followed by a memcpy using the size stored at the negative offset from the pool pointer, followed by a call to free on the original pool pointer. The new allocation size is similarly stored at the negative offset from the newly-allocated pool.
Performance impact
We implemented the arc transformation changes described above as source changes to the 505.mcf_r benchmark, on top of the structure peeling transformation described earlier. With these two changes, the GCC score for 505.mcf_r on x86_64 compiled with -Ofast -flto -march=native matched the top spec score, within 1%. This represents a roughly 2x improvement over the score GCC achieves when run on unmodified 505.mcf_r sources.