[Bug c++/78180] New: Poor optimization of std::array on gcc 4.8/5.4/6.2 as compared to simple raw array

Tue Nov 1 21:00:00 GMT 2016

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78180

            Bug ID: 78180
           Summary: Poor optimization of std::array on gcc 4.8/5.4/6.2 as
                    compared to simple raw array
           Product: gcc
           Version: 6.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: barry.revzin at gmail dot com
  Target Milestone: ---

Here is a complete benchmark comparing a bunch of simple operations on a
std::array<int64_t, 128> vs a int64_t[128]. I'm using
https://github.com/google/benchmark and compiling with -std=c++11 -O3
-D_GLIBCXX_USE_CXX11_ABI=0:

=============================================================
#include <array>
#include <benchmark/benchmark_api.h>

template <class C>
class Rolling
{
    C times_{};
    uint32_t idx_;
    const uint32_t size_;

public:
    Rolling(uint32_t size)
    : idx_(0)
    , size_(size)
    { } 

    void add(int64_t t)
    {   
        times_[idx_] = t;
        ++idx_;
        if (idx_ == size_) {
            idx_ = 0;
        }
    }   

    bool exceeded(int64_t now, int64_t intv)
    {   
        return now - times_[idx_] < intv;
    }   
};

template <class C>
void BM_Rolling(benchmark::State& state)
{
    Rolling<C> r(100);
    int64_t i = 0;
    int64_t exc = 0;

    while (state.KeepRunning()) {
        for (int i = 0; i < state.range(0); ++i) {
            r.add(i);
            if (r.exceeded(i, 1000000)) {
                benchmark::DoNotOptimize(++exc);
            }
        }
    }   
}

#define JOIN(...) __VA_ARGS__
BENCHMARK_TEMPLATE(BM_Rolling, int64_t[128])->Range(8, 8<<10);
BENCHMARK_TEMPLATE(BM_Rolling, JOIN(std::array<int64_t, 128>))->Range(8,
8<<10);

BENCHMARK_MAIN();
=============================================================

This yields the following performance numbers (similar across 4.8.2, 5.4.0, and
6.2.0):

Run on (16 X 3199.66 MHz CPU s)
2016-11-01 15:56:13
Benchmark                                               Time           CPU
Iterations
-------------------------------------------------------------------------------------
BM_Rolling<JOIN(std::array<int64_t, 128>)>/8           18 ns         18 ns  
39568747
BM_Rolling<JOIN(std::array<int64_t, 128>)>/64         135 ns        134 ns   
5218330
BM_Rolling<JOIN(std::array<int64_t, 128>)>/512       1084 ns       1031 ns    
678795
BM_Rolling<JOIN(std::array<int64_t, 128>)>/4k        8221 ns       8185 ns     
85583
BM_Rolling<JOIN(std::array<int64_t, 128>)>/8k       16975 ns      16520 ns     
42752
BM_Rolling<int64_t[128]>/8                             15 ns         15 ns  
45940368
BM_Rolling<int64_t[128]>/64                           112 ns        111 ns   
6301196
BM_Rolling<int64_t[128]>/512                          821 ns        817 ns    
858168
BM_Rolling<int64_t[128]>/4k                          6538 ns       6496 ns    
108570
BM_Rolling<int64_t[128]>/8k                         12957 ns      12902 ns     
53582

That is a large performance gap between std::array and raw array, where I
wouldn't expect any. When compiling with clang, I don't see any gap at all
(though for both containers, the performance is significantly worse than
gcc's).