This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")

From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Thu, 26 Oct 2017 12:16:27 +0000
Subject: [Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")
Auto-submitted: auto-generated
References: <bug-82729-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729

--- Comment #5 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Jakub Jelinek from comment #4)
> As for this exact ones, I'm now working on GIMPLE store merging
> improvements, but that of course won't handle this case.
> For RTL I had code to handle this at RTL DSE time, see PR22141 and
> https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html
> The problem was that the patch caused performance regressions on PowerPC and
> it was hard to find a good cost model for it.  Of course, for -Os the cost
> model would be quite simple, but although you count instructions, you were
> reporting this for -O3.

Yeah, fewer total stores, fewer instructions, and smaller code size *is* what
makes this better for performance.  An 8-byte store that doesn't cross a
cache-line boundary has nearly identical cost to a 1-byte store at least on
Intel.

x86 is robust with overlapping stores, although store-forwarding only works for
loads that get all their data from one store (and even then some CPUs have some
alignment restrictions for the load relative to the store).  Still, that
generally means that fewer wider stores are better, because most CPUs can
forward from a 4B store to a byte reload of any of those 4 bytes.


> Doing this at GIMPLE time is impossible, because it is extremely complex
> where exactly the variables are allocated, depends on many flags etc. (e.g.
> -fsanitize=address allocates pads in between them, some targets allocate
> them from top to bottom, others the other way around, ...),

Allocation order is fixed for a given target?  Ideally we'd allocate locals to
pack them together well to avoid wasted padding, and/or put ones used together
next to each other for possible SIMD (including non-loop XMM stuff like a pair
of `double`s or copying a group of integer locals into a struct).  (In case of
a really large local array, you want variables used together in the same page
and same cache line.)

Considering all the possibilities might be computationally infeasible though,
especially if the typical gains are small.

> -fstack-protector* might protect some but not others and thus allocate in
> different buckets, alignment could play roles etc.

Anyway, sounds like it would make more sense to look for possibilities likes
this in RTL when deciding how to lay out the local variables.  For x86 it seems
gcc sorts them by size?  Changing the order of declaration changes the order of
the stores, but not the locations.

References:
- [Bug tree-optimization/82729] New: adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")
  - From: peter at cordes dot ca

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]