Created attachment 51131 [details] memset collapsing breaks __builtin_object_size() I've found a strange misoptimization around memset() and __builtin_object_size(). If there are two memset() calls that can be collapsed (due to being fully overlapping, I assume), use of __builtin_object_size() may return the wrong result. The example code shows that __builtin_object_size(&int_value, 1) returns 1 instead of 4: > $ gcc -Wall -Wextra -fno-strict-aliasing -fwrapv -O2 -c -o wat.o wat.c > In function ‘do_wipe’, > inlined from ‘loops’ at wat.c:24:3: > wat.c:15:3: warning: call to ‘__detected_overflow’ declared with attribute warning: detected overflow [-Wattribute-warning] > 15 | __detected_overflow(__builtin_object_size(&info->lg, 1), sizeof(info->lg)); > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Here, info->lg is int, but the call to __builtin_object_size() resolves to the size of info->sm (char). This can be seen directly in the resulting output: > $ objdump -rd wat.o > ... > 0000000000000000 <loops>: > 0: 53 push %rbx > 1: be 04 00 00 00 mov $0x4,%esi > 6: 48 89 fb mov %rdi,%rbx > 9: c6 07 00 movb $0x0,(%rdi) > c: bf 01 00 00 00 mov $0x1,%edi > 11: e8 00 00 00 00 call 16 <loops+0x16> > 12: R_X86_64_PLT32 __detected_overflow-0x4 > 16: c7 03 00 00 00 00 movl $0x0,(%rbx) > 1c: 5b pop %rbx > 1d: c3 ret The first argument to __detected_overflow() is "1", instead of 4. Any changes to this example code makes the bug disappear (removal of loops, removal of empty asm, or reordering of memset() calls). Using Compiler Explorer, this bug appears to have been introduced between GCC 8.5 and 9.1: https://godbolt.org/z/oGq5K9fE4
I think it behaves as expected, __builtin_object_size computes a maximum or minimum value, not an exact value.
Started with r9-2635-g78ea9abc2018243af7f7ada6135144ac90c6ad27 I wonder if objsz pass when insert_min_max_p shouldn't in addition to adding MIN_EXPR or MAX_EXPR around the result of __bos also drop the least significant bit from the __bos second argument.
(In reply to Jakub Jelinek from comment #2) > Started with r9-2635-g78ea9abc2018243af7f7ada6135144ac90c6ad27 > I wonder if objsz pass when insert_min_max_p shouldn't in addition to adding > MIN_EXPR or MAX_EXPR around the result of __bos also drop the least > significant bit from the __bos second argument. But then isn't this wrong as well? Btw, I never get what the second bit means, it's documented as "The second bit determines if maximum or minimum of remaining bytes is computed." but does it compute the maximum size when the bit is set or when it is not set? For the least significant bit the situation is better: "If the least significant bit is clear, objects are whole variables, if it is set, a closest surrounding subobject is considered the object a pointer points to." Unfortunately the examples also only use 0 and 1.
--- gcc/tree-object-size.c.jj 2021-01-04 10:25:39.911221618 +0100 +++ gcc/tree-object-size.c 2021-07-12 11:28:51.328120222 +0200 @@ -1393,6 +1393,11 @@ pass_object_sizes::execute (function *fu { tree tem = make_ssa_name (type); gimple_call_set_lhs (call, tem); + if (object_size_type == 1) + { + ost = build_zero_cst (TREE_TYPE (ost)); + gimple_call_set_arg (call, 1, ost); + } enum tree_code code = object_size_type == 1 ? MIN_EXPR : MAX_EXPR; tree cst = build_int_cstu (type, bytes); works, but unfortunately only for __builtin_object_size (, 1). When I did that also for 3 (to turn it into 2), it breaks several gcc.dg/builtin-object-size*.c tests. 0 is defined as an upper bound of the object size, subobjects not taken into account 1 is defined similarly, but subobjects are taken into account 2 is defined as a lower bound of the object size, subobjects not taken into account 3 is 2 with subobjects -D_FORTIFY_SOURCE={1,2} typically only uses 0 and 1 modes, 2 and 3 are rare, and are useful e.g. when somebody wants to find out if the object size is exact or not (for exact case the upper and lower bounds are equal).
Note, for the above patch I was worried about something like: struct S { int a; char b[36]; int c; }; static inline __SIZE_TYPE__ foo (char *p) { return __builtin_object_size (p, 1); } __SIZE_TYPE__ bar (void) { struct S s; return foo (&s.b[0]); } with -O2 -fno-early-inlining, but in that case we actually don't perform that at all - the MIN_EXPR is added only if __bos can be computed during the objsz1 pass to something other than the "unknown" value. If it is, then there shouldn't be pointer passed from caller as one of the possible values, so maybe it is safe. Of course this shows that the objsz1 pass is just mitigation pass, because when the function with __builtin_object_size calls isn't early inlined or some caller of that and ultimately objsz1 pass isn't able to compute a known value, then any SCCVN pointer replacement with different subobject sizes that could feed a subobject __bos (1 or 3) could change the user visible value. E.g. struct S { char b[36]; int c; }; void baz (char *); static inline __SIZE_TYPE__ foo (char *p) { return __builtin_object_size (p, 1); } __SIZE_TYPE__ bar (void) { struct S *p = __builtin_malloc (2 * sizeof (struct S)) + sizeof (struct S); baz ((char *) p); return foo (&p->b[0]); } doesn't reproduce it, we still pass p_5 = _1 + 40; baz (p_5); _2 = &MEM[(struct S *)_1 + 40B].b[0]; _8 = __builtin_object_size (_2, 1); and don't figure out that _2 == p_5. But modified #c0 testcase triggers it at -O2: typedef __SIZE_TYPE__ size_t; void baz (int, int) __attribute__((__warning__("detected overflow"))); union U { int i; char c; }; static void qux (void *p, size_t q) { if (__builtin_object_size (p, 1) < q) baz (__builtin_object_size (p, 1), q); __builtin_memset (p, 0, q); } static void foo (union U *u) { qux (&u->c, sizeof (u->c)); qux (&u->i, sizeof (u->i)); } void bar (union U *u) { int i, j; for (i = 0; i < 1; i++) { foo (u); for (j = 0; j < 2; j++) asm volatile (""); } } and this patch doesn't fix it.
Nothing can "fix" __builtin_object_size here (on sub-objects) without changing how we represent and CSE addresses, esp. if you consider inlining where we want to interpret __builtin_object_size (p, ..) as having passed not literal 'p' but the value 'p' has at this point. Thus any pointer CSE that happens on the value of 'p' before inlining happens will "break" our expectation on it. So suppose for a moment we'd have ADDR_WITH_SIZE_EXPR <obj, size-cst> which we could lower to just ADDR_EXPR after the final object-size pass. Then during all early opts we couldn't CSE addresses with different sizes or simplify equality conditionals on them. And we'd have to do that everywhere as for example an LTO link might expose a caller/callee with __builtin_object_size. Maybe we could somehow lower ADDR_WITH_SIZE_EXPR that do not "escape" (but we'd need to compute that). That said, my point is that sth like __builtin_object_size is quite fundamentally broken [for an optimizing compiler].
It is the cunrolli pass where things go wrong (it VNs the &u->c for both __builtin_object_size calls while previously only the first one was &u->c and the second was &u->i). Now, objsz2 obviously needs to be done after IPA opts (most importantly inlining), but what other opt passes we want in between IPA and objsz2 is a question: PUSH_INSERT_PASSES_WITHIN (pass_all_optimizations) NEXT_PASS (pass_remove_cgraph_callee_edges); /* Initial scalar cleanups before alias computation. They ensure memory accesses are not indirect wherever possible. */ NEXT_PASS (pass_strip_predict_hints, false /* early_p */); NEXT_PASS (pass_ccp, true /* nonzero_p */); NEXT_PASS (pass_post_ipa_warn); /* After CCP we rewrite no longer addressed locals into SSA form if possible. */ NEXT_PASS (pass_complete_unrolli); NEXT_PASS (pass_backprop); NEXT_PASS (pass_phiprop); NEXT_PASS (pass_forwprop); NEXT_PASS (pass_object_sizes, false /* insert_min_max_p */); I bet the rewriting no longer addressed locals into SSA form if possible done for ccp could be sometimes relevant, because when an address is stored into a previously address taken local and read back, then __bos could see it if in SSA form and not otherwise. And in theory the cunrolli could reveal some cases, though not sure. So, either do the TODO_update_address_taken sooner (e.g. at end of pass_remove_cgraph_callee_edges - does ccp itself bring any address taken removal opportunities?) and then do objsz2, or schedule another objsz pass with insert_minmax right at the start of pass_all_optimizations? Out of ideas and it is still just mitigation. Another way would be change SCCVN to handle it conservatively for __bos (x, 1) but that could break the rarely used __bos (x, 3) - but it can break randomly for both now. E.g. in the _8 = &u_5(D)->c; use (_8); ... _9 = &u_5(D)->i; use2 (_9); case when objsz2 wasn't done yet (check with property) and perhaps if any __bos (x, 1) are seen (cfun flag propagated conservatively during inlining), compute __bos (ptr, 1) for both and rewrite the setter to the larger one, i.e. not optimize into: _8 = &u_5(D)->c; use (_8); ... use2 (_8); but _8 = &u_5(D)->i; use (_8); ... use2 (_8);
--- gcc/tree-pass.h.jj 2021-01-27 10:10:00.525903635 +0100 +++ gcc/tree-pass.h 2021-07-12 13:10:59.621933276 +0200 @@ -208,6 +208,7 @@ protected: #define PROP_gimple_lcf (1 << 1) /* lowered control flow */ #define PROP_gimple_leh (1 << 2) /* lowered eh */ #define PROP_cfg (1 << 3) +#define PROP_objsz (1 << 4) /* object sizes computed */ #define PROP_ssa (1 << 5) #define PROP_no_crit_edges (1 << 6) #define PROP_rtl (1 << 7) --- gcc/tree-object-size.c.jj 2021-01-04 10:25:39.911221618 +0100 +++ gcc/tree-object-size.c 2021-07-12 13:19:50.021568629 +0200 @@ -1450,6 +1450,8 @@ pass_object_sizes::execute (function *fu } fini_object_sizes (); + if (!insert_min_max_p) + fun->curr_properties |= PROP_objsz; return 0; } --- gcc/tree-ssa-sccvn.c.jj 2021-06-09 10:20:08.988342285 +0200 +++ gcc/tree-ssa-sccvn.c 2021-07-12 13:14:33.482962387 +0200 @@ -925,12 +925,10 @@ copy_reference_ops_from_ref (tree ref, v + (wi::to_offset (bit_offset) >> LOG2_BITS_PER_UNIT)); /* Probibit value-numbering zero offset components of addresses the same before the pass folding - __builtin_object_size had a chance to run - (checking cfun->after_inlining does the - trick here). */ + __builtin_object_size had a chance to run. */ if (TREE_CODE (orig) != ADDR_EXPR || maybe_ne (off, 0) - || cfun->after_inlining) + || (cfun->curr_properties & PROP_objsz)) off.to_shwi (&temp.off); } } seems to work for both testcases.
The above patch will slightly pessimize optimizations during the NEXT_PASS (pass_remove_cgraph_callee_edges); /* Initial scalar cleanups before alias computation. They ensure memory accesses are not indirect wherever possible. */ NEXT_PASS (pass_strip_predict_hints, false /* early_p */); NEXT_PASS (pass_ccp, true /* nonzero_p */); NEXT_PASS (pass_post_ipa_warn); /* After CCP we rewrite no longer addressed locals into SSA form if possible. */ NEXT_PASS (pass_complete_unrolli); NEXT_PASS (pass_backprop); NEXT_PASS (pass_phiprop); NEXT_PASS (pass_forwprop); passes right after IPA, where we currently only pessimize it before and during IPA.
Created attachment 51139 [details] gcc12-pr101419.patch Full untested patch.
I think none of NEXT_PASS (pass_complete_unrolli); NEXT_PASS (pass_backprop); NEXT_PASS (pass_phiprop); NEXT_PASS (pass_forwprop); are useful for objsize so IMHO we should move pass_object_sizes earlier. CCP and it's into SSA are probably useful though. Like the following (untested): diff --git a/gcc/passes.def b/gcc/passes.def index 945d2bc797c..fd38e1f2d8e 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -194,14 +194,14 @@ along with GCC; see the file COPYING3. If not see They ensure memory accesses are not indirect wherever possible. */ NEXT_PASS (pass_strip_predict_hints, false /* early_p */); NEXT_PASS (pass_ccp, true /* nonzero_p */); - NEXT_PASS (pass_post_ipa_warn); /* After CCP we rewrite no longer addressed locals into SSA form if possible. */ + NEXT_PASS (pass_object_sizes, false /* insert_min_max_p */); + NEXT_PASS (pass_post_ipa_warn); NEXT_PASS (pass_complete_unrolli); NEXT_PASS (pass_backprop); NEXT_PASS (pass_phiprop); NEXT_PASS (pass_forwprop); - NEXT_PASS (pass_object_sizes, false /* insert_min_max_p */); /* pass_build_alias is a dummy pass that ensures that we execute TODO_rebuild_alias at this point. */ NEXT_PASS (pass_build_alias);
(In reply to Jakub Jelinek from comment #10) > Created attachment 51139 [details] > gcc12-pr101419.patch > > Full untested patch. I guess that's certainly what was intended (and after the patch more explicit). Whether it's good to not CSE any addresses in unrolled loop bodies is another question (but see my proposed pass order change which would fix this).
(In reply to Richard Biener from comment #12) > (In reply to Jakub Jelinek from comment #10) > > Created attachment 51139 [details] > > gcc12-pr101419.patch > > > > Full untested patch. > > I guess that's certainly what was intended (and after the patch more > explicit). Whether it's good to not CSE any addresses in unrolled loop > bodies is another question (but see my proposed pass order change which > would fix this). Note it might "break" GIMPLE FE testcases with startwith after the objsz pass - break in the sense that later FRE will behave differently. Note usually we still run all property providers but the way you set the property outside of pass->properties_provided breaks this. Thus maybe split objsz into two separate passes rather than using the flag so you can use properties_provided.
I agree about most of the passes you are moving, but I have an (albeit artificial) testcase that proves that cunrolli does affect objsz: __SIZE_TYPE__ a[10]; void foo (void) { char *p = __builtin_malloc (64); for (int i = 0; i < 4; i++) { a[i] = __builtin_object_size (p, 0); p += 6; } } When objsz is done before cunrolli, a[0] to a[3] will all be 64, while when it is done after cunrolli, it will be 64, 58, 52, 46. So, what about a mix of your and my patch, add --- gcc/passes.def +++ gcc/passes.def @@ -194,14 +194,14 @@ along with GCC; see the file COPYING3. If not see They ensure memory accesses are not indirect wherever possible. */ NEXT_PASS (pass_strip_predict_hints, false /* early_p */); NEXT_PASS (pass_ccp, true /* nonzero_p */); - NEXT_PASS (pass_post_ipa_warn); /* After CCP we rewrite no longer addressed locals into SSA form if possible. */ + NEXT_PASS (pass_post_ipa_warn); NEXT_PASS (pass_complete_unrolli); + NEXT_PASS (pass_object_sizes, false /* insert_min_max_p */); NEXT_PASS (pass_backprop); NEXT_PASS (pass_phiprop); NEXT_PASS (pass_forwprop); - NEXT_PASS (pass_object_sizes, false /* insert_min_max_p */); /* pass_build_alias is a dummy pass that ensures that we execute TODO_rebuild_alias at this point. */ NEXT_PASS (pass_build_alias); to my patch?
(In reply to Richard Biener from comment #13) > Note usually we still run all property providers but the way you set > the property outside of pass->properties_provided breaks this. Thus > maybe split objsz into two separate passes rather than using the > flag so you can use properties_provided. I wanted to avoid having two separate passes, but if you prefer it, it can be done. Will be a user visible change in the dumps and for -fdisable-tree-* etc. (though we do such changes all the time).
On Mon, 12 Jul 2021, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101419 > > --- Comment #14 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > I agree about most of the passes you are moving, but I have an (albeit > artificial) testcase that proves that cunrolli does affect objsz: > __SIZE_TYPE__ a[10]; > > void > foo (void) > { > char *p = __builtin_malloc (64); > for (int i = 0; i < 4; i++) > { > a[i] = __builtin_object_size (p, 0); > p += 6; > } > } > > When objsz is done before cunrolli, a[0] to a[3] will all be 64, while > when it is done after cunrolli, it will be 64, 58, 52, 46. > > So, what about a mix of your and my patch, add > --- gcc/passes.def > +++ gcc/passes.def > @@ -194,14 +194,14 @@ along with GCC; see the file COPYING3. If not see > They ensure memory accesses are not indirect wherever possible. */ > NEXT_PASS (pass_strip_predict_hints, false /* early_p */); > NEXT_PASS (pass_ccp, true /* nonzero_p */); > - NEXT_PASS (pass_post_ipa_warn); > /* After CCP we rewrite no longer addressed locals into SSA > form if possible. */ > + NEXT_PASS (pass_post_ipa_warn); > NEXT_PASS (pass_complete_unrolli); > + NEXT_PASS (pass_object_sizes, false /* insert_min_max_p */); > NEXT_PASS (pass_backprop); > NEXT_PASS (pass_phiprop); > NEXT_PASS (pass_forwprop); > - NEXT_PASS (pass_object_sizes, false /* insert_min_max_p */); > /* pass_build_alias is a dummy pass that ensures that we > execute TODO_rebuild_alias at this point. */ > NEXT_PASS (pass_build_alias); > to my patch? Well, my point was to avoid pessimizing the VN done from cunrolli ;) Of course any duplication / threading can improve __bos precision, but then any transform also risks breaking it. Your example is IMHO too artificial as good argument.
On Mon, 12 Jul 2021, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101419 > > --- Comment #15 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #13) > > Note usually we still run all property providers but the way you set > > the property outside of pass->properties_provided breaks this. Thus > > maybe split objsz into two separate passes rather than using the > > flag so you can use properties_provided. > > I wanted to avoid having two separate passes, but if you prefer it, it can be > done. Will be a user visible change in the dumps and for -fdisable-tree-* etc. > (though we do such changes all the time). Yes, I think it's needed for GIMPLE FE testcase correctness. As for dumpfile renaming, yeah - that's unfortunate. I'm always hoping somebody bites the bullet and implements NEXT_PASS (pass_late_object_size, "objsz2") aka specifying the dump suffix explicitely.
(In reply to rguenther@suse.de from comment #16) > Well, my point was to avoid pessimizing the VN done from cunrolli ;) > Of course any duplication / threading can improve __bos precision, > but then any transform also risks breaking it. Your example > is IMHO too artificial as good argument. Is that VN so important for cunrolli itself (I mean, what harm will be there if it is only VN simplified during FRE after it)? Does it affect anything but the number of statements that perhaps is used to determine whether to unroll completely or not? My testcase was artificial, sure, but I was worrying about short loops (say 2 iterations) doing some strcpy/memcpy etc. where having more precise object size would improve security. (In reply to rguenther@suse.de from comment #17) > Yes, I think it's needed for GIMPLE FE testcase correctness. Ok, will change it then to earlyobjsz and objsz then.
On Mon, 12 Jul 2021, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101419 > > --- Comment #18 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > (In reply to rguenther@suse.de from comment #16) > > Well, my point was to avoid pessimizing the VN done from cunrolli ;) > > Of course any duplication / threading can improve __bos precision, > > but then any transform also risks breaking it. Your example > > is IMHO too artificial as good argument. > > Is that VN so important for cunrolli itself (I mean, what harm will be there if > it is only VN simplified during FRE after it)? Does it affect anything but the > number of statements that perhaps is used to determine whether to unroll > completely or not? Sure, if you have address comparisons inside the loop body that resolve by unrolling (by making the index into an array part constant), then this changes the size estimate used for further unrolling outer loops. > My testcase was artificial, sure, but I was worrying about short loops (say 2 > iterations) doing some strcpy/memcpy etc. where having more precise object size > would improve security. Well, we can always find examples where security is improved and we can find examples where optimization is pessimized. We can't unfortunately both have the cake and also eat it.
Created attachment 51144 [details] gcc12-pr101419.patch Updated patch.
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:dddb6ffdc5c25264dd75ad82dad8e48a0718d2d9 commit r12-2270-gdddb6ffdc5c25264dd75ad82dad8e48a0718d2d9 Author: Jakub Jelinek <jakub@redhat.com> Date: Tue Jul 13 11:04:22 2021 +0200 passes: Fix up subobject __bos [PR101419] The following testcase is miscompiled, because VN during cunrolli changes __bos argument from address of a larger field to address of a smaller field and so __builtin_object_size (, 1) then folds into smaller value than the actually available size. copy_reference_ops_from_ref has a hack for this, but it was using cfun->after_inlining as a check whether the hack can be ignored, and cunrolli is after_inlining. This patch uses a property to make it exact (set at the end of objsz pass that doesn't do insert_min_max_p) and additionally based on discussions in the PR moves the objsz pass earlier after IPA. 2021-07-13 Jakub Jelinek <jakub@redhat.com> Richard Biener <rguenther@suse.de> PR tree-optimization/101419 * tree-pass.h (PROP_objsz): Define. (make_pass_early_object_sizes): Declare. * passes.def (pass_all_early_optimizations): Rename pass_object_sizes there to pass_early_object_sizes, drop parameter. (pass_all_optimizations): Move pass_object_sizes right after pass_ccp, drop parameter, move pass_post_ipa_warn right after that. * tree-object-size.c (pass_object_sizes::execute): Rename to... (object_sizes_execute): ... this. Add insert_min_max_p argument. (pass_data_object_sizes): Move after object_sizes_execute. (pass_object_sizes): Likewise. In execute method call object_sizes_execute, drop set_pass_param method and insert_min_max_p non-static data member and its initializer in the ctor. (pass_data_early_object_sizes, pass_early_object_sizes, make_pass_early_object_sizes): New. * tree-ssa-sccvn.c (copy_reference_ops_from_ref): Use (cfun->curr_properties & PROP_objsz) instead of cfun->after_inlining. * gcc.dg/builtin-object-size-10.c: Pass -fdump-tree-early_objsz-details instead of -fdump-tree-objsz1-details in dg-options and adjust names of dump file in scan-tree-dump. * gcc.dg/pr101419.c: New test.
Fixed on the trunk. Unsure about backports.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.