[Bug tree-optimization/79262] New: [6/7 Regression] load gap with store gap causing performance regression in 462.libquantum

Sat Jan 28 08:33:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79262

            Bug ID: 79262
           Summary: [6/7 Regression] load gap with store gap causing
                    performance regression in 462.libquantum
           Product: gcc
           Version: 7.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
            Blocks: 53947
  Target Milestone: ---
            Target: aarch64

As reported at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18438#c9 but what
is not mentioned is that this is a regression from GCC 5.  I noticed this again
when I was working on improving ThunderX 2 CN99xx performance difference
between -O2 and -Ofast and GCC 5.4.0 and the trunk.

Take:
struct node_struct
{
  float _Complex gap;
  unsigned long long state;
};

struct reg_struct
{
  int size;
  struct node_struct *node;
};

void
func(int target, struct reg_struct *reg)
{
  int i;

  for(i=0; i<reg->size; i++)
    reg->node[i].state ^= ((unsigned long long) 1 << target);
}
---- CUT ---
Currently this is vectorized on the trunk using load gaps but then the store is
using scalars.  This is much slower and also it is only doing 2 at a time. 
There are some cost model issues in the aarch64 backend dealing with scalar for
int vs floating point too.  I might just go fix those first.

Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations