Bug 36861

Summary: [4.5/4.6 Regression] boost's compressed avl confuses GCC
Product: gcc Reporter: Lothar Werzinger <lothar>
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED INVALID    
Severity: normal CC: aldot, fang, gcc-bugs, igaztanaga, pinskia, rguenth
Priority: P2 Keywords: alias, wrong-code
Version: 4.3.1   
Target Milestone: 4.5.3   
Host: Target: x86_64-pc-linux-gnu
Build: Known to work: 4.2.3, 4.3.3, 4.3.4, 4.4.0, 4.4.3
Known to fail: 4.3.2, 4.5.0 Last reconfirmed: 2010-07-05 15:40:28
Bug Depends on: 21485, 37795    
Bug Blocks:    
Attachments: test case to reproduce the problem
updated testcase
the preprocessed file for case that is slower
tar file with oprofile logs
oprofile logs
unincluded testcase

Description Lothar Werzinger 2008-07-17 07:27:43 UTC
gcc 4.3.1 generates code with -O3 that is more than 10 times slower than with -O0. gcc 4.2.3 does not show this behavior. I am going to attach the test case that I used to produce these numbers:

with gcc 4.2.3:
( time ./avltest-nn >/dev/null; time ./avltest-no >/dev/null; time ./avltest-on >/dev/null; time ./avltest-oo >/dev/null; )

real    0m1.112s
user    0m1.028s
sys     0m0.012s

real    0m0.078s
user    0m0.060s
sys     0m0.008s

real    0m0.940s
user    0m0.916s
sys     0m0.008s

real    0m0.088s
user    0m0.072s
sys     0m0.004s


with gcc 4.3.1:
( time ./avltest-nn >/dev/null; time ./avltest-no >/dev/null; time ./avltest-on >/dev/null; time ./avltest-oo >/dev/null; )

real    0m0.974s
user    0m0.948s
sys     0m0.004s

real    0m12.936s
user    0m12.893s
sys     0m0.016s

real    0m1.128s
user    0m1.012s
sys     0m0.000s

real    0m0.104s
user    0m0.076s
sys     0m0.004s

Here is the information about the compilers used:

$ /opt2/linux/ix86/bin/g++-4.2.3 -v
Using built-in specs.
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-4.2.3/configure --program-suffix=-4.2.3 --enable-__cxa_atexit --enable-languages=c,c++,java --prefix=/opt2/linux/ix86 --target=x86_64-pc-linux-gnu --with-sysroot=/opt2/linux/ix86/gcc-sysroot --enable-version-specific-runtime-libs --enable-clocale=gnu
Thread model: posix
gcc version 4.2.3

$ /opt2/linux/ix86/bin/g++-4.3.1 -v
Using built-in specs.
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-4.3.1/configure --enable-targets=all --enable-multilib --enable-__cxa_atexit --enable-languages=c,c++,java --enable-version-specific-runtime-libs --disable-nls --enable-clocale=gnu --program-suffix=-4.3.1 --prefix=/opt2/linux/ix86 --target=i686-pc-linux-gnu --target=x86_64-pc-linux-gnu --with-sysroot=/opt2/linux/ix86/gcc-sysroot
Thread model: posix
gcc version 4.3.1 (GCC)
Comment 1 Lothar Werzinger 2008-07-17 07:28:42 UTC
Created attachment 15921 [details]
test case to reproduce the problem
Comment 2 Lothar Werzinger 2008-07-17 16:54:24 UTC
Created attachment 15923 [details]
updated testcase

changed the test case to resemble my application a bit closer. Here it is almost 40 times slower with -O3 than with -O0

The other strange thing with our app (that I can NOT recreate with a test case so far) is that if I use -DUSE_OPTIMIZATION=1 in my app the -O3 compiled version seems to go into an infinite loop.


real    0m1.586s
user    0m0.988s
sys     0m0.020s

real    0m40.963s
user    0m40.655s
sys     0m0.048s

real    0m2.457s
user    0m1.976s
sys     0m0.020s

real    0m0.291s
user    0m0.164s
sys     0m0.020s
Comment 3 Richard Biener 2008-07-17 17:42:50 UTC
Please provide preprocessed source for the testcase which you get when
appending -save-temps to the compilation command.  It will be called
avltest.ii.
Comment 4 Andrew Pinski 2008-07-17 17:47:46 UTC
Does -O3 -fno-tree-vectorize cause the speed back to normal?

Not everyone has boost installed :).
Comment 5 Lothar Werzinger 2008-07-17 17:50:36 UTC
Created attachment 15924 [details]
the preprocessed file for case that is slower
Comment 6 Lothar Werzinger 2008-07-17 17:52:50 UTC
in reply to #4:
-fno-tree-vectorize does not help

$ /opt2/linux/ix86/bin/g++-4.3.1 -m64 -O3 -fno-tree-vectorize -g -save-temps -I/opt2/linux/x86_64/include/boost-1_35 -DUSE_OPTIMIZATION=0 -o ./avltest-no ./avltest.cpp
./avltest.cpp:115:2: warning: #warning without optimization
$ time ./avltest-no >/dev/null

real    0m40.528s
user    0m39.902s
sys     0m0.036s
Comment 7 Richard Biener 2008-07-17 18:01:40 UTC
pointer_plus_2_bits will confuse points-to alias analysis galore.  Of course
this doesn't explain the difference to -O0.

Still, -O1 is fast, -O1 -fstrict-aliasing is _very_ slow.
Comment 8 Lothar Werzinger 2008-07-17 18:09:29 UTC
If you have questions about the internal implementation of the boost avl tree you can contact Ion Gaztañaga igaztanaga at gmail dot com. He is aware of this problem already.
Comment 9 Lothar Werzinger 2008-07-17 20:15:08 UTC
in reply to #7:

Does this mean the status is no longer UNCONFIRMED?
Comment 10 Richard Biener 2008-07-17 20:19:47 UTC
No, it's still not analyzed.
Comment 11 Andrew Pinski 2008-07-22 21:43:37 UTC
I think someone needs to kick the hell out of boost for making crappy code.  I am going to remove the regression marker for now because it is definitely a bit weird what boost is doing and not normal for real code.
Comment 12 Lothar Werzinger 2008-07-22 21:46:19 UTC
Created attachment 15943 [details]
tar file with oprofile logs

I ran both versions (gcc 4.3.1 and gcc 4.2.3) with oprofile and attached the ouput from opannotate --source --assembly 

What I find interesting is that there is this line

 67350 13.6008 :  403944:       jl     403970 
<_ZN5boost9intrusive12avl_set_implINS0_10avl_setoptINS0_6detail16base_hook_traitsI4NodeNS0_19avltree_node_traitsINS_12interprocess10offset_ptrIvEELb0EEELNS0_14link_mode_typeE1ENS0_11default_tagELi6EEESt4lessIS5_EmLb1EEEE4findERKS5_+0x240>


is 'jl' not assembly for 'jump long'

why would a jump instruction have such a long execution time? It's the longest 
time in both files (see attached grep file):

grep jl *oprofile.txt > ./jl.oprofile.txt
Comment 13 Lothar Werzinger 2008-07-22 21:48:54 UTC
(In reply to comment #11)
From an application perspective it is still a regression, as it works happily with gcc 4.2.3
Comment 14 Andrew Pinski 2008-07-22 21:52:33 UTC
(In reply to comment #13)
> (In reply to comment #11)
> From an application perspective it is still a regression, as it works happily
> with gcc 4.2.3

Yes but it is boost's fault that boost tries to be smart about space optimizations when they should not have.  It really does confuse the hell out of compilers, doing what they are doing.  C/C++ was not designed the way they think it was.
Comment 15 Lothar Werzinger 2008-07-22 22:01:00 UTC
What I am worried about is that if it is not marked as a regression nobody cares to fix it although it breaks a real application that works with gcc 4.3.2.

If you have any insights how to improve the boost intrusive library to avoid confusing gcc I am sure the boost people are happy to hear them and to incorporate them in the upcoming boost release.
Comment 16 Lothar Werzinger 2008-07-22 22:22:04 UTC
I looked at the oprofile logs in more detail and the significant difference is in the static pointer to_value_ptr(node_ptr n) function:

there are two consecutive if(!result) blocks

and as you can see the 4.3.1 one 
67350 13.6008 :  403944:	jl     403970

last more than 13 seconds, where the 4.2.3 one
   13  0.8820 :  401a7e:	jl     401aa7

only lasts 0.88 seconds

I hope this helps in pinpointing the problem


GCC-4.3.1:
               :    if(!result)
               :    {
               :      result = ( equal && rc_left.k < rc_right.k );
   491  0.0992 :  403940:	mov    (%rbx),%eax
               :      equal  = ( equal && rc_left.k == rc_right.k );
               :    }
               :    
               :    if(!result)
   294  0.0594 :  403942:	cmp    %eax,(%rdx)
 67350 13.6008 :  403944:	jl     403970 <_ZN5boost9intrusive12avl_set_implINS0_10avl_setoptINS0_6detail16base_hook_traitsI4NodeNS0_19avltree_node_traitsINS_12interprocess10offset_ptrIvEELb0EEELNS0_14link_mode_typeE1ENS0_11default_tagELi6EEESt4lessIS5_EmLb1EEEE4findERKS5_+0x240>
               :    {
               :      result = ( equal && rc_left.a < rc_right.a );
 13650  2.7565 :  403946:	jne    403a30 <_ZN5boost9intrusive12avl_set_implINS0_10avl_setoptINS0_6detail16base_hook_traitsI4NodeNS0_19avltree_node_traitsINS_12interprocess10offset_ptrIvEELb0EEELNS0_14link_mode_typeE1ENS0_11default_tagELi6EEESt4lessIS5_EmLb1EEEE4findERKS5_+0x300>
 13187  2.6630 :  40394c:	mov    0x8(%rbx),%rsi
   123  0.0248 :  403950:	cmp    %rsi,0x8(%rdx)
  2762  0.5578 :  403954:	jb     403970 <_ZN5boost9intrusive12avl_set_implINS0_10avl_setoptINS0_6detail16base_hook_traitsI4NodeNS0_19avltree_node_traitsINS_12interprocess10offset_ptrIvEELb0EEELNS0_14link_mode_typeE1ENS0_11default_tagELi6EEESt4lessIS5_EmLb1EEEE4findERKS5_+0x240>
 12904  2.6059 :  403956:	xchg   %ax,%ax
               :      equal  = ( equal && rc_left.a == rc_right.a );
   454  0.0917 :  403958:	jne    403a30 <_ZN5boost9intrusive12avl_set_implINS0_10avl_setoptINS0_6detail16base_hook_traitsI4NodeNS0_19avltree_node_traitsINS_12interprocess10offset_ptrIvEELb0EEELNS0_14link_mode_typeE1ENS0_11default_tagELi6EEESt4lessIS5_EmLb1EEEE4findERKS5_+0x300>
               :    }


GCC-4.2.3:
               :    if(!result)
               :    {
               :      result = ( equal && rc_left.k < rc_right.k );
     3  0.2035 :  401a76:	mov    (%rsi),%ecx
    32  2.1710 :  401a78:	mov    0x30(%r8),%eax
               :      equal  = ( equal && rc_left.k == rc_right.k );
               :    }
               :    
               :    if(!result)
               :  401a7c:	cmp    %eax,%ecx
    13  0.8820 :  401a7e:	jl     401aa7 <_Z5checkPKcmm+0x4b7>
               :    {
               :      result = ( equal && rc_left.a < rc_right.a );
     7  0.4749 :  401a80:	cmp    %eax,%ecx
     1  0.0678 :  401a82:	jne    401b60 <_Z5checkPKcmm+0x570>
    10  0.6784 :  401a88:	mov    0x38(%r8),%rax
     1  0.0678 :  401a8c:	cmp    %rax,0x8(%rsi)
     2  0.1357 :  401a90:	jb     401aa7 <_Z5checkPKcmm+0x4b7>
               :      equal  = ( equal && rc_left.a == rc_right.a );
     3  0.2035 :  401a92:	jne    401b60 <_Z5checkPKcmm+0x570>
               :    }
Comment 17 Ion Gaztañaga 2008-07-23 08:03:44 UTC
Andrew, I think I do write some crappy code, but the slowdown is also present in the non-optimized version.

FWIW, there are STL implementations that embed the color bit in rbtrees in the parent pointer so the optimization is not that strange. Maybe my implementation is not correct and that's why the size-optimized version ends in an infinite loop.

And we have tons of two-way pointers in production code. Too much crappy code to ignore, I'm afraid.
Comment 18 Daniel Jacobowitz 2008-07-24 22:18:38 UTC
Putting the regression marker back.  The code doesn't matter; if it's a regression, then it's regression.
Comment 19 Lothar Werzinger 2008-07-28 18:55:50 UTC
Created attachment 15973 [details]
oprofile logs

I compiled optimized executables with and without -fno-strict-aliasing and generated an oprofile log for them:

/opt2/linux/ix86/bin/g++-4.3.1 -m64 -O3 -no-strict-aliasing -g -I/opt2/linux/x86_64/include/boost-1_35 -DUSE_OPTIMIZATION=0 -o ./avltest-non ./avltest.cpp
/opt2/linux/ix86/bin/g++-4.3.1 -m64 -O3 -fno-strict-aliasing -g -I/opt2/linux/x86_64/include/boost-1_35 -DUSE_OPTIMIZATION=0 -o ./avltest-non ./avltest.cpp
sudo opcontrol --reset; sudo opcontrol --start; ./avltest-no; ./avltest-non ; sudo opcontrol --stop
opannotate --source --assembly --demangle=smart --threshold=1 ./avltest-no >./avltest-no.oprofile.txt
opannotate --source --assembly --demangle=smart --threshold=1 ./avltest-non >./avltest-non.oprofile.txt
diff -u ./avltest-no.oprofile.txt ./avltest-non.oprofile.txt ./avltest-diff.oprofile.txt

$ time ./avltest-non
sizeof(Node)=72
sizeof(NodeSet)=40
creating file ./avltest.img done
region.get_address()=0x7f4d43ed1000
region.get_size()=3600056
p_set=0x7f4d43ed1000
p_values=0x7f4d43ed1030
created/inserted 50000 entries
region.get_address()=0x7f4d43b62000
region.get_size()=3600056
p_set=0x7f4d43b62000
p_values=0x7f4d43b62030
checked 50000 entries, result=1249975000

real    0m0.614s
user    0m0.140s
sys     0m0.016s

$ time ./avltest-no
sizeof(Node)=72
sizeof(NodeSet)=40
creating file ./avltest.img done
region.get_address()=0x7f7fdce43000
region.get_size()=3600056
p_set=0x7f7fdce43000
p_values=0x7f7fdce43030
created/inserted 50000 entries
region.get_address()=0x7f7fdcad4000
region.get_size()=3600056
p_set=0x7f7fdcad4000
p_values=0x7f7fdcad4030
checked 50000 entries, result=1249975000

real    0m22.634s
user    0m22.277s
sys     0m0.120s

I hope these tests aid in finding the problem
Comment 20 Lothar Werzinger 2008-07-31 02:01:53 UTC
With all the supplied test cases and data, why is it so hard to move the bug out of the UNCONFIRMED state. It should be fairly easy to confirm that it is a problem by running the attached test.
Comment 21 Bernhard Reutner-Fischer 2008-07-31 07:56:15 UTC
Lothar, see #10
Comment 22 Richard Biener 2008-09-20 14:25:55 UTC
This PR is still very weird.  For the curious here are the numbers, now
with numbers from trunk included:

4.3 -O: 0.4s
4.3 -O -fstrict-aliasing: 59s
4.4 -O: 1m3s
4.4 -O -fstrict-aliasing: 0.4s
4.4 -O2: 0.3s
4.4 -O2 -fno-strict-aliasing: 0.2s

?!
Comment 23 Richard Biener 2008-09-20 14:27:35 UTC
Created attachment 16368 [details]
unincluded testcase

Unincluded testcase attached.

I'll leave this at P3 so some brave soul might eventually pick up the
analysis.
Comment 24 Richard Biener 2008-10-10 09:57:40 UTC
I am pretty sure this is a target issue.  We seem to generate code that makes the branch prediction unit mispredict a lot.  So, whoever wants to reproduce this
wants to state exactly on what processor it was measured (comment #22 was done
on a Athlon X2, which is a Fam 8 AMD Core).

For example with -O2 the code gets very dense with branches (which is why we
add all the alignments):

        addq    %rax, %rcx      # D.64407, prephitmp.989
.L322:
        movl    (%rcx), %eax    # <variable>.D.52563.k, D.64356
        cmpl    %eax, (%rbp)    # D.64356, <variable>.D.52563.k
        jl      .L324   #,
        jne     .L325   #,
        movq    8(%rcx), %rdx   # <variable>.D.52563.a,
        cmpq    %rdx, 8(%rbp)   #, <variable>.D.52563.a
        jb      .L324   #,
        .p2align 4,,7
        .p2align 3
        jne     .L325   #,

PRE/CSE of the loads makes the issue worse as the instruction encoding gets
even smaller.  This does look like a similar problem as PR21485.

The only thing that I can imagine would help is much more aggressive if-conversion to reduce the load on the branch predictor.  Of course we
want to do this only if there is a high branch density as otherwise extra
data dependencies hurt as well ...

The interesting function to look at is

_ZN5boost9intrusive12avl_set_implINS0_10avl_setoptINS0_6detail16base_hook_traitsI4NodeNS0_19avltree_node_traitsINS_12interprocess10offset_ptrIvEELb0EEELNS0_14link_mode_typeE1ENS0_11default_tagELi6EEESt4lessIS5_EmLb1EEEE4findERKS5_

boost::intrusive::avl_set_impl<boost::intrusive::avl_setopt<boost::intrusive::detail::base_hook_traits<Node, boost::intrusive::avltree_node_traits<boost::interprocess::offset_ptr<void>, false>, (boost::intrusive::link_mode_type)1, boost::intrusive::default_tag, 6>, std::less<Node>, unsigned long, true> >::find(Node const&)

which we nearly completely flatten by inlining.  There are a lot of strange
asserts remaining in the code which check differences of addresses of _different_ local(!) variables ... (struct offset_ptr).  Eventually VRP should
just optimize those away as undefined...  the asserts are all like

__assert_fail (&"m_offset != 1"[0], &"/opt2/linux/x86_64/include/boost-1_35/boost/interprocess/offset_ptr.hpp"[0], 75, &__PRETTY_FUNCTION__[0]);

Also we have a missed jump-threading / ifcombine thing:

<bb 25>:
  # prephitmp.956_583 = PHI <0B(22), prephitmp.956_344(80)>
  D.64081_165 = prephitmp.956_583->D.52563.k;
  D.64082_166 = (int) D.64081_165;
  D.64083_167 = value_4(D)->D.52563.k;
  D.64084_168 = (int) D.64083_167;
  if (D.64082_166 >= D.64084_168)
    goto <bb 26>;
  else
    goto <bb 30>;

<bb 26>:
  if (D.64081_165 != D.64083_167)
    goto <bb 36>;
  else
    goto <bb 27>;

which is probably because we fold (int) x != (int) y to x != y during forwprop maybe (I would have to check).  With this likely jump-threading makes things
worse.
Comment 25 Lothar Werzinger 2008-10-10 15:58:42 UTC
I was running the code on my Lenovo T61 Laptop:

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 Duo CPU     T7700  @ 2.40GHz
stepping        : 10
cpu MHz         : 800.000
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm ida
bogomips        : 4793.98
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 Duo CPU     T7700  @ 2.40GHz
stepping        : 10
cpu MHz         : 800.000
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm ida
bogomips        : 4787.49
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:
Comment 26 Mark Mitchell 2008-10-22 03:02:05 UTC
Richard has reproduced the bad behavior.  So, this is not UNCONFIRMED.  It's an optimization regression; hence P2.
Comment 27 Richard Biener 2009-01-24 10:20:37 UTC
GCC 4.3.3 is being released, adjusting target milestone.
Comment 28 Paolo Bonzini 2009-02-03 09:54:26 UTC
Am I right that the bug is just at -O, not at -O2?  4.4 -O2 is worse than 4.4 -O2 -fno-strict-aliasing, but are they better than 4.3 -O2 or worse?
Comment 29 Richard Biener 2009-08-04 12:29:18 UTC
GCC 4.3.4 is being released, adjusting target milestone.
Comment 30 Richard Biener 2010-03-16 13:15:39 UTC
I do have updated numbers.

4.3.[012] -O3: 57s
4.3.[34] -O[123]: 0.4s
4.4.[0123] -O[123]: 0.4s
4.5 -O0: 1.6s
4.5 -O[123]: 50s (*sigh*)

that makes it fixed on the release branches and a 4.5 regression only.
Comment 31 Richard Biener 2010-04-06 11:19:39 UTC
GCC 4.5.0 is being released.  Deferring to 4.5.1.
Comment 32 Richard Biener 2010-07-05 15:40:28 UTC
Re-confirmed for current 4.5 branch and trunk.
Comment 33 Richard Biener 2010-07-05 16:03:23 UTC
Using valgrind there are randomly reported errors, so this is likely either
an invalid testcase or a miscompile that manifests itself as runtime regression.
Comment 34 Richard Biener 2010-07-31 09:29:17 UTC
GCC 4.5.1 is being released, adjusting target milestone.
Comment 35 Richard Biener 2010-12-16 13:02:44 UTC
GCC 4.5.2 is being released, adjusting target milestone.
Comment 36 Richard Biener 2011-01-18 14:43:50 UTC
Trunk is now very fast and segfaults the testcase ;)
Comment 37 Richard Biener 2011-01-18 15:19:05 UTC
Ok, with -O1 -fno-tree-loop-im as basic flags -fno-tree-dse makes the
difference for the segfault when we remove

 static void boost::intrusive::avltree_algorithms<NodeTraits>::rebalance_after_i
nsertion(boost::intrusive::avltree_algorithms<NodeTraits>::node_ptr, boost::intr
usive::avltree_algorithms<NodeTraits>::node_ptr) [with NodeTraits = boost::intru
sive::avltree_node_traits<boost::interprocess::offset_ptr<void>, false>, boost::
intrusive::avltree_algorithms<NodeTraits>::node_ptr = boost::interprocess::offse
t_ptr<boost::intrusive::avltree_node<boost::interprocess::offset_ptr<void> > >] 
(struct node_ptr & restrict header, struct node_ptr & restrict x)
 {
   long int D.64530;
@@ -8290,7 +7992,6 @@
   p.19_2156 = (long int) D.64522_2154;
   D.64529_2157 = (long int) D.64513_2150;
   D.64530_2158 = p.19_2156 - D.64529_2157;
-  MEM[(struct offset_ptr *)D.64515_2149].m_offset = D.64530_2158;
   if (D.64530_2158 == 1)
     goto <bb 746>;
   else

this pointer is believed to point to

D.64515_2149, points-to NULL, points-to vars: { D.57715 CAST_RESTRICT.445 } (includes restrict tags)

computed from

<bb 742>:
  n_2143 = (struct node_ptr & restrict) &D.57715;
  p_2144 = (struct node_ptr & restrict) &D.57716;
  D.64517_2145 = D.57715.m_offset;
  D.64516_2147 = (long unsigned int) D.64517_2145;
  D.64515_2148 = n_2143 + D.64516_2147;

<bb 743>:
  # D.64515_2149 = PHI <0B(739), D.64515_2148(742)>

which confirms the points-to calculation.

But the compressed AVL tree thinks it can represent a pointer to
a global object by using the address of a local variable and
the difference between the address of the global and the local.
Which it of course cannot according to the C++ standard.
D.57715.m_offset is set by

<bb 740>:
  p.19_2139 = (long int) D.64504_2137;
  D.64508_2140 = (long int) &D.57715;
  D.64509_2141 = p.19_2139 - D.64508_2140;
  D.57715.m_offset = D.64509_2141;
  if (D.64509_2141 == 1)
    goto <bb 741>;
  else
    goto <bb 742>;

and we have

<bb 737>:
boost::interprocess::offset_ptr<boost::intrusive::avltree_node<boost::interprocess::offset_ptr<void> > >::offset_ptr (&D.57716, &root);
  D.64502_2134 = MEM[(const struct offset_ptr *)header_2(D)].m_offset;
  if (D.64502_2134 != 1)
    goto <bb 738>;
  else
    goto <bb 739>;

<bb 738>:
  D.64503_2136 = (long unsigned int) D.64502_2134;
  D.64504_2137 = header_2(D) + D.64503_2136;

where header_2(D) is a parameter, struct node_ptr & restrict header.  So
Boost AVL expects us to compute whatever header points-to as points-to
set of D.64515_2149 with no backing from the standard that we are required
to do that.

If we were able to "optimize" the code back to base the address on
header it would of course work (by luck).

For

extern void abort (void);
struct X { unsigned long offset; };
void foo (char *p)
{
  struct X a;
  char tmp, *q;
  a.offset = p - &tmp;
  if (a.offset == 1)
    abort ();
  q = &tmp + a.offset;
  *q = 0;
}

already early DCE kills the store (the above is what compressed AVL does).

Thus, this testcase is invalid.  Resulting performance is irrelevant.
Comment 38 Ion Gaztañaga 2011-01-18 19:15:22 UTC
Thanks Richard for the detailed report. The fact is that the next standard is trying to support relative pointers for container implementations (much like Boost.Interprocess does), to be able to place containers in shared memory. In those cases the relative pointers stores the difference between "this" and the pointee, which is what boost::interprocess::offset_ptr does.

It is certainly outside the standard right now and the standard won't include such "relative pointer" class, but it will require to container implementations to support them as allocator::pointer types. It would be important to find a way to use smart pointers, maybe with some existing annotation/attribute to avoid optimizing away that relative addressing, so that there is always away to compute the pointee address based on "this" + this->m_offset. It is also important to support smart pointers created on the stack pointing to global objects (temporaries, function arguments, etc.)

I know this is outside the bug, but it will be certainly a common question when using GCC and next standard containers. Thanks
Comment 39 rguenther@suse.de 2011-01-19 09:54:19 UTC
On Tue, 18 Jan 2011, igaztanaga at gmail dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36861
> 
> --- Comment #38 from Ion Gaztañaga <igaztanaga at gmail dot com> 2011-01-18 19:15:22 UTC ---
> Thanks Richard for the detailed report. The fact is that the next standard is
> trying to support relative pointers for container implementations (much like
> Boost.Interprocess does), to be able to place containers in shared memory. In
> those cases the relative pointers stores the difference between "this" and the
> pointee, which is what boost::interprocess::offset_ptr does.

But then the pointer arithmetic will not cause the pointer to move
from one object to another - the object will be simply the shared memory
segment (which isn't statically allocated either) - that's perfectly
valid in C and C++ right now.

> It is certainly outside the standard right now and the standard won't include
> such "relative pointer" class, but it will require to container implementations
> to support them as allocator::pointer types. It would be important to find a
> way to use smart pointers, maybe with some existing annotation/attribute to
> avoid optimizing away that relative addressing, so that there is always away to
> compute the pointee address based on "this" + this->m_offset. It is also
> important to support smart pointers created on the stack pointing to global
> objects (temporaries, function arguments, etc.)
> 
> I know this is outside the bug, but it will be certainly a common question when
> using GCC and next standard containers. Thanks

The point with the case in question is that a pointer to a global
object is computed relative to the address _of an automatic variable_.
I hope no standard will ever allow that ;)  It might be a simple
oversight (and thus bug) in Boost AVL though.

If the object were a global one then you wouldn't trigger the particular
bug (though still GCC would conclude that the pointer is pointing to
that global object, which, when in statically allocated storage, can
still lead to "miscompilations").

Richard.
Comment 40 Ion Gaztañaga 2011-01-19 16:07:01 UTC
(In reply to comment #39)
> The point with the case in question is that a pointer to a global
> object is computed relative to the address _of an automatic variable_.
> I hope no standard will ever allow that ;)  It might be a simple
> oversight (and thus bug) in Boost AVL though.

Is not an oversight, it's a basic feature to build easy relative addressing to be used in shared memory and memory mapped files shared between several processes. A pointer only stores the distance between the "smart pointer" and the pointee, so it can be mapped in different addresses in several processes. Support for automatic storage smart pointers is needed just to support iterators that could be stored in shared memory (thus, storing smart pointers in shared memory) and also be used by programmers in a loop (automatic storage).

Until now, this illegal scheme was working in all known compilers (although outside the standard, I agree). Since the standard does not support shared memory, if someday shared memory support is added to the standard, it will surely require some changes to support this widely used relative addressing. Until then, I agree, this bug report is invalid.

The only issue I would like to know is if this relative addressing scheme for automatic variables can be used with GCC with some kind of existing annotation, anything that could force that automatic variable to have an address that could be used to point to shared memory elements.

Ion
Comment 41 rguenther@suse.de 2011-01-19 16:40:15 UTC
On Wed, 19 Jan 2011, igaztanaga at gmail dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36861
> 
> --- Comment #40 from Ion Gaztañaga <igaztanaga at gmail dot com> 2011-01-19 16:07:01 UTC ---
> (In reply to comment #39)
> > The point with the case in question is that a pointer to a global
> > object is computed relative to the address _of an automatic variable_.
> > I hope no standard will ever allow that ;)  It might be a simple
> > oversight (and thus bug) in Boost AVL though.
> 
> Is not an oversight, it's a basic feature to build easy relative addressing to
> be used in shared memory and memory mapped files shared between several
> processes. A pointer only stores the distance between the "smart pointer" and
> the pointee, so it can be mapped in different addresses in several processes.
> Support for automatic storage smart pointers is needed just to support
> iterators that could be stored in shared memory (thus, storing smart pointers
> in shared memory) and also be used by programmers in a loop (automatic
> storage).

The case in question seems to use random automatic storage to represent
a pointer to unrelated storage.  That is never going to work.  And if
it will ever be required to work all points-to analysis will just
be useless which is not something you want.

> Until now, this illegal scheme was working in all known compilers (although
> outside the standard, I agree). Since the standard does not support shared
> memory, if someday shared memory support is added to the standard, it will
> surely require some changes to support this widely used relative addressing.
> Until then, I agree, this bug report is invalid.
> 
> The only issue I would like to know is if this relative addressing scheme for
> automatic variables can be used with GCC with some kind of existing annotation,
> anything that could force that automatic variable to have an address that could
> be used to point to shared memory elements.

No, there is no way to do that.  Sorry, but you really have to
re-design this thing to be standard conformant.

Richard.
Comment 42 Ion Gaztañaga 2011-01-20 11:52:36 UTC
> No, there is no way to do that.  Sorry, but you really have to
> re-design this thing to be standard conformant.

Thanks for your detailed report, I was not aware of this problem so I'll try to find a solution.

Ion