As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2241339 when ceph is compiled with LTO on aarch64 _M_emplace_equal<std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*>&> is changed by SRA to only store 64 + 32 bits into the std::pair rather than 64 + 64 bits. Reproducer: ./xg++ -B ./ -O2 -flto=auto -ffat-lto-objects -fexceptions -g -Wall -Wno-complain-wrong-lang -Werror=format-security -fstack-protector-strong -mbranch-protection=standard -fasynchronous-unwind-tables -fstack-clash-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -std=c++20 -fPIC -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -fno-strict-aliasing -fsigned-char -Wtype-limits -Wignored-qualifiers -Wpointer-arith -Werror=format-security -Winit-self -Wno-unknown-pragmas -Wnon-virtual-dtor -Wno-ignored-qualifiers -ftemplate-depth-1024 -Wpessimizing-move -Wredundant-move -Wstrict-null-sentinel -Woverloaded-virtual -fstack-protector-strong -fdiagnostics-color=auto -c Timer.ii ./xg++ -B ./ -O2 -flto=auto -ffat-lto-objects -fexceptions -g -Wall -Wno-complain-wrong-lang -Werror=format-security -fstack-protector-strong -mbranch-protection=standard -fasynchronous-unwind-tables -fstack-clash-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -std=c++20 -fPIC -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -fno-strict-aliasing -fsigned-char -Wtype-limits -Wignored-qualifiers -Wpointer-arith -Werror=format-security -Winit-self -Wno-unknown-pragmas -Wnon-virtual-dtor -Wno-ignored-qualifiers -ftemplate-depth-1024 -Wpessimizing-move -Wredundant-move -Wstrict-null-sentinel -Woverloaded-virtual -fstack-protector-strong -fdiagnostics-color=auto -c SloppyCRCMap.ii ./xg++ -B ./ -O2 -c CrtStuff.i ./xg++ -B ./ -flto=auto -shared -o out.so SloppyCRCMap.o Timer.o CrtStuff.o -nostdlib grep_cleanup() { c++filt | grep -A16 '_M_emplace_equal<std::pair<std::chrono.*>:' | sed 's/[^:]*: *//' } diff -u <(aarch64-linux-gnu-objdump -d Timer.o | grep_cleanup) <(aarch64-linux-gnu-objdump -d out.so | grep_cleanup) which prints --- /dev/fd/63 2024-01-12 19:24:37.317433462 +0100 +++ /dev/fd/62 2024-01-12 19:24:37.318433448 +0100 @@ -7,11 +7,11 @@ d2800600 mov x0, #0x30 // #48 f90013f5 str x21, [sp, #32] aa0103f5 mov x21, x1 - 94000000 bl 0 <operator new(unsigned long)> + 97fff0bc bl 6ee0 <operator new(unsigned long)@plt> aa0003f4 mov x20, x0 f9400a62 ldr x2, [x19, #16] 91002263 add x3, x19, #0x8 f94002a7 ldr x7, [x21] f9001007 str x7, [x0, #32] - f94006a0 ldr x0, [x21, #8] - f9001680 str x0, [x20, #40] + b9400aa1 ldr w1, [x21, #8] + b9002801 str w1, [x0, #40] i.e. the -ffat-lto-objects non-LTO compilation stores 64-bit pointers at the new returned pointer + 32 and + 40, while LTO stores 64-bit pointer only to the former and 32-bit into the latter. In *.cplxlower1 I still see MEM[(struct pair *)_37 + 32B] = ISRA.253; where struct pair ISRA.253; but sra has: Created a replacement for ISRA.253 offset: 0, size: 64: SR.256D.22298 Created a replacement for ISRA.253 offset: 64, size: 32: SR.257D.22299 From what I can see, that should be std::multimap<clock_t::time_point, Context*>::value_type, i.e. std::pair<const clock_t::time_point, Context*>
Created attachment 57056 [details] Timer.ii.xz
Created attachment 57057 [details] SloppyCRCMap.ii.xz
Created attachment 57058 [details] CrtStuff.i.xz
Will try to reproduce with trunk next (unfortunately due to the builtin trait changes the GCC 13 preprocessed sources can't be compiled with GCC 14).
Created attachment 57060 [details] Timer.ii.xz from GCC 14
Created attachment 57061 [details] SloppyCRCMap.ii.xz When using current GCC trunk instead of current GCC 13 branch (CrtStuff.i.xz is the same between both), the problematic optimization doesn't happen, both before SRA and after it the code has MEM[(struct pair *)_5 + 32B] = MEM[(const struct pair &)__args#0_4(D)];
I am going to double check something but it might be related to PR 112616 .
All these might also point to IPA mod-ref issues.
SRA creates the replacements (in GCC 13) during total scalarization, i.e. the bit that is not driven by pre-existing accesses to aggregates, but because it sees an aggregate that is small and regular and so it is split according to its type in the hope it will go away. Unfortunately in the LTO and non-LTO case, they see a different type. I have added a dumping of types and fields of totally scalarized records and got the following. In the non-LTO case, the type of the aggregate is: <record_type 0xffff553cabd0 pair readonly sizes-gimplified cxx-odr-p type_1 type_5 type_6 TI size <integer_cst 0xffff6fe96090 type <integer_type 0xffff6fe9c0a8 bitsizetype> constant 128> unit-size <integer_cst 0xffff6fe960a8 type <integer_type 0xffff6fe9c000 sizetype> constant 16> align:64 warn_if_not_align:0 symtab:1430035184 alias-set -1 canonical-type 0xffff553cabd0 ... and specifically its third field is a pointer: <field_decl 0xffff55161ab0 second type <pointer_type 0xffff4cf303f0 type <record_type 0xffff4cf30348 Context cxx-odr-p type_1 type_4 type_5 type_6 VOID align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0xffff562720a8 context <translation_unit_decl 0xffff6fe9a000 /home/jakub/rpmbuild/BUILD/ceph-18.2.1/src/common/Timer.cc> pointer_to_this <pointer_type 0xffff4cf303f0>> unsigned DI size <integer_cst 0xffff6fe96048 constant 64> unit-size <integer_cst 0xffff6fe96060 constant 8> align:64 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0xffff562729d8 reference_to_this <reference_type 0xffff4cf6c738>> used unsigned nonlocal decl_3 DI /usr/include/c++/13/bits/stl_pair.h:194:11 size <integer_cst 0xffff6fe96048 type <integer_type 0xffff6fe9c0a8 bitsizetype> constant 64> unit-size <integer_cst 0xffff6fe96060 type <integer_type 0xffff6fe9c000 sizetype> constant 8> align:64 warn_if_not_align:0 offset_align 128 decl_not_flexarray: 0 offset <integer_cst 0xffff6fe96078 type <integer_type 0xffff6fe9c000 sizetype> constant 0> bit-offset <integer_cst 0xffff6fe96048 type <integer_type 0xffff6fe9c0a8 bitsizetype> constant 64> context <record_type 0xffff553ba540 pair>> However, in the LTO case the type of the aggregate is: <record_type 0xffff61cc1498 pair cxx-odr-p TI size <integer_cst 0xffff623fdd08 type <integer_type 0xffff624100a8 bitsizetype> constant 128> unit-size <integer_cst 0xffff623fdd20 type <integer_type 0xffff62410000 sizetype> constant 16> align:64 warn_if_not_align:0 symtab:0 alias-set 98 canonical-type 0xffff61cc1498 ... which however has an unsigned int as its third field: <field_decl 0xffff61cc05f0 second type <integer_type 0xffff62410690 unsigned int public unsigned SI size <integer_cst 0xffff623fdf00 constant 32> unit-size <integer_cst 0xffff623fdf18 constant 4> align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0xffff62410690 precision:32 min <integer_cst 0xffff623fdf30 0> max <integer_cst 0xffff623fdee8 4294967295> pointer_to_this <pointer_type 0xffff61cf50a8> reference_to_this <reference_type 0xffff618fd5e8>> unsigned nonlocal SI /usr/include/c++/13/bits/stl_pair.h:194:11 size <integer_cst 0xffff623fdf00 type <integer_type 0xffff624100a8 bitsizetype> constant 32> unit-size <integer_cst 0xffff623fdf18 type <integer_type 0xffff62410000 sizetype> constant 4> align:32 warn_if_not_align:0 offset_align 128 decl_not_flexarray: 0 offset <integer_cst 0xffff623fdcf0 type <integer_type 0xffff62410000 sizetype> constant 0> bit-offset <integer_cst 0xffff623fdcc0 type <integer_type 0xffff624100a8 bitsizetype> constant 64> context <record_type 0xffff61cc1498 pair>> An so only an unsigned int replacement is created. The name of the aggregate indicates it has been created by IPA-SRA and so that is where I am looking right now, but IPA-SRA simply takes (and streams) the type of the access in the original function body for these. Can't this perhaps be some type-merging issue?
I see the 'pair' type is marked TYPE_CXX_ODR_P, I'd say you should see a ODR type violation diagnostic, and if you don't, this means we force different alias sets for both? Not sure - Honza added this stuff. It only affects TYPE_CANONICAL though, regular type merging shouldn't merge them but it's likely that you get to see another type because of COMDATs and symbol merging chosing a different prevailing function which has that other type? Btw, can you dump the mangled name of the type? It should be type_with_linkage_p () I think, of course 'pair' itself is a template so only a specific instantiation should be subject to ODR. (of course there might be ODR functions that use different instantiated pair in the signature ..)
If there are two ODR types with same ODR name one with integer and other with pointer types third field, then indeed we should get ODR warning and give up on handling them as ODR types for type merging. So dumping their assembler names would be useful starting point. Of course if you have two ODR types of different names but you mix them up in COMDAT function of same name, then the warning will not trigger, so this might be some missing type compatibility check in ipa-sra or ipa-prop summary, too.
Just going from the demangled name of std::pair<std::chrono::time_point<ceph::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, Context*> it would surprise me if it was ODR violation in the testcase, because class Context is only defined in Timer.ii, not in the other preprocessed source, ceph::mono_clock is defined in both but looks the same (and it is empty class anyway), and the pair uses Context* as second type anyway, so it is unclear how it could become something smaller than pointer. But, admittedly I haven't looked up at this under the debugger and not even sure where to look at that.
On Tue, 6 Feb 2024, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113359 > > --- Comment #12 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > Just going from the demangled name of > std::pair<std::chrono::time_point<ceph::mono_clock, > std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const, > Context*> > it would surprise me if it was ODR violation in the testcase, because class > Context > is only defined in Timer.ii, not in the other preprocessed source, > ceph::mono_clock is defined in both but looks the same (and it is empty class > anyway), and the pair uses Context* as second type anyway, so it is unclear how > it could become something smaller than pointer. > But, admittedly I haven't looked up at this under the debugger and not even > sure where to look at that. Might be also an interaction with IPA ICF in case there's a pointer to the pair involved?
(In reply to rguenther@suse.de from comment #13) > Might be also an interaction with IPA ICF in case there's a pointer to > the pair involved? Yes, this is exactly what seems to be happening. The problem goes away with -fno-icf. (Possibly because the testcase uses -fno-strict-aliasing,) IPA-ICF merges two functions which copy a structure and that access type it what IPA-SRA saves, but loads only the one of the merged functions. SRA then uses the (wrong) type to split aggregate copies into copies by individual fields. I have talked to Honza about this. It seems that IPA-ICF needs to be careful about aggreage with holes in different places. The ideal next step would be to create a testcase not dependent on IPA-SRA.
Created attachment 57462 [details] Simple testcase (needs disabling early - and only early - SRA) This is a simpler testcase which exhibits the problem on x86_64-linux and current master. Steps to reproduce: $ ~/gcc/trunk/inst/bin/gcc -O2 -fno-strict-aliasing -fno-ipa-cp --disable-tree-esra -flto pr113359.c -c -o 1.o cc1: note: disable pass tree-esra for functions in the range of [0, 4294967295] $ ~/gcc/trunk/inst/bin/gcc -O2 -fno-strict-aliasing -fno-ipa-cp --disable-tree-esra -flto -DFILE2 pr113359.c -c -o 2.o cc1: note: disable pass tree-esra for functions in the range of [0, 4294967295] $ ~/gcc/trunk/inst/bin/gcc -flto 1.o 2.o -o test.exe $ ./test.exe Aborted (core dumped) If you add -fno-ipa-icf to the "compilation" commands, the test will pass. Late (post ICF) intra-procedural SRA is necessary to exhibit the problem. On the other hand, early SRA must be suppressed or it will scalarize the aggregate assignment too early and the results will look different to IPA-ICF. Instead of using --disable-tree-esra we could pass the address of tmp in both geta() and getb() to an empty function coming from a third compilation unit. Disabling strict aliasing is also necessary to show the problem, with strict aliasing IPA-ICF takes the alias class of types into acount when hashing and considers geta() and getb() different from the start.
*** Bug 114263 has been marked as a duplicate of this bug. ***
There's another small testcase in PR114263.
Confirmed.
Should we update target and summary to also include x86_64?
Go right ahead. I'm mostly trying to get things in the right broad buckets. So if you've got additional information, please add it.
Martin, please work on this together with Honza (might be an actual duplicate).
Created attachment 57828 [details] Potential fix I'm testing this patch
The patch looks reasonable. We probably could hash the padding vectors at summary generation time to reduce WPA overhead, but that can be done incrementally next stage1. I however wonder if we really guarantee to copy the paddings everywhere else then the total scalarization part? (i.e. in all paths through the RTL expansion)
(In reply to Jan Hubicka from comment #23) > I however wonder if we really guarantee to copy the paddings everywhere else > then the total scalarization part? > (i.e. in all paths through the RTL expansion) I wanted that we sometimes don't do that in PR 80689 and the idea was refused. And as far as I can recall the code I don't think we do. Anyway, I have sent the patch to the mailing list: https://inbox.sourceware.org/gcc-patches/ri6jzlc25db.fsf@virgil.suse.cz/T/#u
The master branch has been updated by Martin Jambor <jamborm@gcc.gnu.org>: https://gcc.gnu.org/g:1e3312a25a7b34d6e3f549273e1674c7114e4408 commit r14-9841-g1e3312a25a7b34d6e3f549273e1674c7114e4408 Author: Martin Jambor <mjambor@suse.cz> Date: Mon Apr 8 18:53:23 2024 +0200 ICF&SRA: Make ICF and SRA agree on padding PR 113359 shows that (at least with -fno-strict-aliasing) ICF can unify two functions which copy an aggregate type of the same size but then SRA, through its total scalarization, can copy the aggregate by pieces, skipping paddding, but the padding was not the same in the two original functions that ICF unified. This patch enhances SRA with the ability to collect padding information which then can be compared from within ICF. Unfortunately SRA uses OPTION_SET_P when determining its limits, so ICF needs to switch cfuns at least once to figure it out too. gcc/ChangeLog: 2024-03-27 Martin Jambor <mjambor@suse.cz> PR ipa/113359 * ipa-icf-gimple.h (func_checker): New members safe_for_total_scalarization_p, m_total_scalarization_limit_known_p and m_total_scalarization_limit. (func_checker::func_checker): Initialize new member variables. * ipa-icf-gimple.cc: Include tree-sra.h. (func_checker::func_checker): Initialize new member variables. (func_checker::safe_for_total_scalarization_p): New function. (func_checker::compare_operand): Use the new function. * tree-sra.h (sra_get_max_scalarization_size): Declare. (sra_total_scalarization_would_copy_same_data_p): Likewise. * tree-sra.cc (prepare_iteration_over_array_elts): New function. (class sra_padding_collecting): New. (sra_padding_collecting::record_padding): Likewise. (scalarizable_type_p): Rename to totally_scalarizable_type_p. Add ability to record padding when requested. (totally_scalarize_subtree): Split out gathering information necessary to iterate over array elements to prepare_iteration_over_array_elts. Fix errornous early exit. (analyze_all_variable_accesses): Adjust the call to totally_scalarizable_type_p. Move determining of total scalariation size limit... (sra_get_max_scalarization_size): ...here. (check_ts_and_push_padding_to_vec): New function. (sra_total_scalarization_would_copy_same_data_p): Likewise. gcc/testsuite/ChangeLog: 2024-03-27 Martin Jambor <mjambor@suse.cz> PR ipa/113359 * gcc.dg/lto/pr113359-1_0.c: New. * gcc.dg/lto/pr113359-1_1.c: Likewise. * gcc.dg/lto/pr113359-2_0.c: Likewise. * gcc.dg/lto/pr113359-2_1.c: Likewise. * gcc.dg/lto/pr113359-3_0.c: Likewise. * gcc.dg/lto/pr113359-3_1.c: Likewise. * gcc.dg/lto/pr113359-4_0.c: Likewise. * gcc.dg/lto/pr113359-4_1.c: Likewise. * gcc.dg/lto/pr113359-5_0.c: Likewise. * gcc.dg/lto/pr113359-5_1.c: Likewise.
This should be fixed on master, I'll backport the fix in a few weeks to at least gcc-13 where it was reported.
.
The releases/gcc-13 branch has been updated by Martin Jambor <jamborm@gcc.gnu.org>: https://gcc.gnu.org/g:10bf53a80eefa46500bffb442719777e2640e7d7 commit r13-8773-g10bf53a80eefa46500bffb442719777e2640e7d7 Author: Martin Jambor <mjambor@suse.cz> Date: Mon Apr 8 18:53:23 2024 +0200 ICF&SRA: Make ICF and SRA agree on padding PR 113359 shows that (at least with -fno-strict-aliasing) ICF can unify two functions which copy an aggregate type of the same size but then SRA, through its total scalarization, can copy the aggregate by pieces, skipping paddding, but the padding was not the same in the two original functions that ICF unified. This patch enhances SRA with the ability to collect padding information which then can be compared from within ICF. Unfortunately SRA uses OPTION_SET_P when determining its limits, so ICF needs to switch cfuns at least once to figure it out too. gcc/ChangeLog: 2024-03-27 Martin Jambor <mjambor@suse.cz> PR ipa/113359 * ipa-icf-gimple.h (func_checker): New members safe_for_total_scalarization_p, m_total_scalarization_limit_known_p and m_total_scalarization_limit. (func_checker::func_checker): Initialize new member variables. * ipa-icf-gimple.cc: Include tree-sra.h. (func_checker::func_checker): Initialize new member variables. (func_checker::safe_for_total_scalarization_p): New function. (func_checker::compare_operand): Use the new function. * tree-sra.h (sra_get_max_scalarization_size): Declare. (sra_total_scalarization_would_copy_same_data_p): Likewise. * tree-sra.cc (prepare_iteration_over_array_elts): New function. (class sra_padding_collecting): New. (sra_padding_collecting::record_padding): Likewise. (scalarizable_type_p): Rename to totally_scalarizable_type_p. Add ability to record padding when requested. (totally_scalarize_subtree): Split out gathering information necessary to iterate over array elements to prepare_iteration_over_array_elts. Fix errornous early exit. (analyze_all_variable_accesses): Adjust the call to totally_scalarizable_type_p. Move determining of total scalariation size limit... (sra_get_max_scalarization_size): ...here. (check_ts_and_push_padding_to_vec): New function. (sra_total_scalarization_would_copy_same_data_p): Likewise. gcc/testsuite/ChangeLog: 2024-03-27 Martin Jambor <mjambor@suse.cz> PR ipa/113359 * gcc.dg/lto/pr113359-1_0.c: New. * gcc.dg/lto/pr113359-1_1.c: Likewise. * gcc.dg/lto/pr113359-2_0.c: Likewise. * gcc.dg/lto/pr113359-2_1.c: Likewise. * gcc.dg/lto/pr113359-3_0.c: Likewise. * gcc.dg/lto/pr113359-3_1.c: Likewise. * gcc.dg/lto/pr113359-4_0.c: Likewise. * gcc.dg/lto/pr113359-4_1.c: Likewise. * gcc.dg/lto/pr113359-5_0.c: Likewise. * gcc.dg/lto/pr113359-5_1.c: Likewise. (cherry picked from commit 1e3312a25a7b34d6e3f549273e1674c7114e4408)
Fixed
...so set to fixed as well.
Hello Martin, Any chance the fix that fixes the new test for 32bits can be also backported? 4923ed49b93352bcf9e43cafac38345e4a54c3f8 https://gcc.gnu.org/g:4923ed49b93352bcf9e43cafac38345e4a54c3f8 Not sure why it's not tagged so that it would appear here.
(In reply to Marc Poulhiès from comment #31) > Hello Martin, > > Any chance the fix that fixes the new test for 32bits can be also backported? > > 4923ed49b93352bcf9e43cafac38345e4a54c3f8 > https://gcc.gnu.org/g:4923ed49b93352bcf9e43cafac38345e4a54c3f8 > > Not sure why it's not tagged so that it would appear here. My apologies for not including this commit, I completely forgot about it. Unfortunately I'm afraid it will have to wait until after the 13.3 release, but I will backport it quickly afterwards. Sorry again.
No worries, we've applied it locally. Thanks!