Bug 117924 - unused std::vector<bool> are not optimized out fully at gimple level
Summary: unused std::vector<bool> are not optimized out fully at gimple level
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 15.0
: P3 normal
Target Milestone: ---
Assignee: Jan Hubicka
URL:
Keywords: missed-optimization, patch
Depends on: 121921
Blocks: std::vector
  Show dependency treegraph
 
Reported: 2024-12-05 16:20 UTC by Jan Hubicka
Modified: 2025-09-12 03:54 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2025-09-01 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Hubicka 2024-12-05 16:20:32 UTC
compiling:
#include <vector>
void
empty(std::vector<bool> src)
{
        std::vector <bool> data=src;
}

yields in optimized dump:

void empty (struct vector & src)
{
  _Bit_type * _13;
  unsigned int _14;
  _Bit_type * _15;
  long int _16;
  long int _17;
  long int _18;
  long int _19;
  long unsigned int _20;
  long unsigned int _25;
  long unsigned int _26;
  long unsigned int _45;

  <bb 2> [local count: 1073741824]:
  _13 = MEM[(const struct vector *)src_2(D)].D.25603._M_impl.D.25087._M_start.D.16487._M_p;
  _14 = MEM[(const struct _Bit_iterator &)src_2(D) + 16].D.16487._M_offset;
  _15 = MEM[(const struct _Bit_iterator &)src_2(D) + 16].D.16487._M_p;
  _16 = _15 - _13;
  _17 = _16 * 8;
  _18 = (long int) _14;
  _19 = _17 + _18;
  if (_19 != 0)
    goto <bb 3>; [33.00%]
  else
    goto <bb 4>; [67.00%]

  <bb 3> [local count: 574129755]:
  _20 = (long unsigned int) _19;
  _25 = _20 + 63;
  _26 = _25 >> 6;
  _45 = _26 * 8;

  <bb 4> [local count: 1073741824]:
  return;

}
All the computations are obviously dead.
Eventually RTL optimizes it out, but we ought to do it earlier. Unused new/delete pair is only being determined at cddce3 which is bit late. At cddce2 we still have memcpy copying the vector which is only detected dead by dse5 run shortly before. Dse4 is not run and during dse3 we get quite a lot of code which somehow prevents the dse from happening.



If bool is changed to int memcpy goes away already during dse2.
Comment 1 Jan Hubicka 2024-12-05 16:51:07 UTC
looking at dse3 dump we get:

  <bb 2> [local count: 1073741824]:
  MEM[(struct _Bvector_impl_data *)&data] ={v} {CLOBBER(bob)};
  MEM[(struct __as_base  &)&data] ={v} {CLOBBER(bob)};
  _13 = MEM[(const struct vector *)src_2(D)].D.25603._M_impl.D.25087._M_start.D.16487._M_p;
  _14 = MEM[(const struct _Bit_iterator &)src_2(D) + 16].D.16487._M_offset;
  _15 = MEM[(const struct _Bit_iterator &)src_2(D) + 16].D.16487._M_p;
  _16 = _15 - _13;
  _17 = _16 * 8;
  _18 = (long int) _14;
  _19 = _17 + _18;
  _20 = (long unsigned int) _19;
  if (_20 != 0)  
    goto <bb 3>; [33.00%]
  else
    goto <bb 22>; [67.00%]

  <bb 22> [local count: 719407024]:
  goto <bb 5>; [100.00%]

  <bb 3> [local count: 354334800]:
  _25 = _20 + 63;
  _26 = _25 >> 6;
  _45 = _26 * 8;
  _46 = operator new (_45);


If I am reading it right, then _20 is size of the vector in bits.
<bb 3> then takes the size in bits to determine size in 64bit words which is then converted to size in bytes.

_16 already holds size of allocated vector. I think this all can be simplified to using it and adding 1 if offset is non-0.

Later we do:
  <bb 6> [local count: 966367640]:
  _54 = (long unsigned int) _16;
  __builtin_memcpy (data$_M_p_173, _13, _54);

Here _16 bytes is copied, so the last word must (if offset is non-0) must be copied separately:

  if (_14 == 0)
    goto <bb 23>; [10.20%]
  else
    goto <bb 24>; [89.80%]

  <bb 24> [local count: 867798143]:
  _55 = data$_M_p_173 + _54;
  goto <bb 11>; [100.00%]

....

 <bb 11> [local count: 964220160]:
  # __result$_M_p_40 = PHI <data$_M_p_173(26), _55(24), _57(30)>
  <bb 11> [local count: 964220160]:
  # __result$_M_p_40 = PHI <data$_M_p_173(26), _55(24), _57(30)>

  <bb 12> [local count: 9453138808]:
  # __first$_M_offset_154 = PHI <__first$_M_offset_58(21), 0(11)>
  # __first$_M_p_151 = PHI <__first$_M_p_112(21), _15(11)>
  # __result$_M_p_136 = PHI <__result$_M_p_105(21), __result$_M_p_40(11)>
  _82 = 1 << __first$_M_offset_154;
  _84 = *__first$_M_p_151;
  _85 = _82 & _84;
  pretmp_88 = *__result$_M_p_136;
  if (_85 != 0)
    goto <bb 13>; [50.00%]
  else
    goto <bb 14>; [50.00%]

this copies the last byte but takes care to zero out the unused part of vector, why?

  <bb 13> [local count: 4726569404]:
  _90 = _82 | pretmp_88;
  goto <bb 15>; [100.00%]

  <bb 14> [local count: 4726569404]:
  _92 = ~_82;
  _93 = pretmp_88 & _92;

  <bb 15> [local count: 9453138808]:
  # cstore_138 = PHI <_90(13), _93(14)>
  *__result$_M_p_136 = cstore_138;

this seems to be computing the last byte.
It is not clear to me why it is still considred live at this point, but also the whole code could just be replaced by memcpying the last word as well.  We compute the size of memory allocated and we can simply copy everything, right?
Comment 2 Richard Biener 2024-12-06 07:19:01 UTC
_Bit_type -> might be TBAA related?
Comment 3 Jan Hubicka 2024-12-06 08:46:21 UTC
Actually the main problem is that copying of bitvectors is done by loop
copying every bit individually. This loop stays until loop optimizers
and then we are quite late in optimization.  Have patch for that.
Comment 4 Jan Hubicka 2024-12-14 16:09:37 UTC
patch posted here https://gcc.gnu.org/pipermail/gcc-patches/2024-December/671137.html
Comment 5 Andrew Pinski 2025-09-12 01:16:10 UTC
Note there is another way of solving this. From my anylsis (which I wrote in PR 121921):
currently DSE5 can remove the stores:
```
  Deleted dead store: MEM[(struct __as_base  &)&data] ={v} {CLOBBER(bob)};

  Deleted dead store: MEM[(struct _Bvector_impl_data *)&data] ={v} {CLOBBER(bob)};

```
But DCE7 (which is right afterwards) does not `remove operator new/delete` because this missed optimization and then forwprop4 (which is right after dce7) is able to see (b+s) - (b+s - b) is just b and then later on the next DCE optimizes away the new/delete pair.

> Unused new/delete pair is only being determined at cddce3 which is bit late. 

The reason why it is not before hand is due to `e - (e - b)` not being optimized to b until forwprop4 which is right after dce7. If `e - (e - b)` got folded say fre1:
```
  _1 = this_15(D)->_M_impl.D.25104._M_start.D.16464._M_p;
...
  _20 = MEM[(const struct _Bvector_impl *)this_15(D)].D.25104._M_end_of_storage;
  _5 = _20 - _1; // e - b
  _8 = (long unsigned int) _5;
  _9 = -_8;
  _10 = _20 + _9; // e - (e - b)
  _11 = &this_15(D)->_M_impl;
  operator delete (_10, _8);
```
We should recongize the operator new/delete pair earlier too.
Comment 6 Andrew Pinski 2025-09-12 01:40:21 UTC
```
        _GLIBCXX20_CONSTEXPR
        _Bit_type*
        _M_end_addr() const _GLIBCXX_NOEXCEPT
        {
          if (this->_M_end_of_storage)
            return std::__addressof(this->_M_end_of_storage[-1]) + 1;
          return 0;
        }

...
      _GLIBCXX20_CONSTEXPR
      void
      _M_deallocate()
      {
        if (_M_impl._M_start._M_p)
          {
            const size_t __n = _M_impl._M_end_addr() - _M_impl._M_start._M_p;
            _Bit_alloc_traits::deallocate(_M_impl,
                                          _M_impl._M_end_of_storage - __n,
                                          __n);
            _M_impl._M_reset();
          }
      }
```

I don't get why we just don't call _Bit_alloc_traits::deallocate on _M_impl._M_start._M_p here instead of `_M_impl._M_end_of_storage - __n`. Unless I am missing something with respect to allocator traits.

That was changed with r5-4575-gccd615e3fdf2d2 ; which was the move to support C++11 allocator requirements. But I would expect all allocators should produce `e - (e - b)` as being b always. and `_M_end_of_storage  - __n` I would hope being always _M_impl._M_start._M_p too. Unless I miss something here.
Comment 7 Andrew Pinski 2025-09-12 03:54:38 UTC
(In reply to Andrew Pinski from comment #5)
> Note there is another way of solving this. From my anylsis (which I wrote in
> PR 121921):
> currently DSE5 can remove the stores:
> ```
>   Deleted dead store: MEM[(struct __as_base  &)&data] ={v} {CLOBBER(bob)};
> 
>   Deleted dead store: MEM[(struct _Bvector_impl_data *)&data] ={v}
> {CLOBBER(bob)};
> 
> ```
> But DCE7 (which is right afterwards) does not `remove operator new/delete`
> because this missed optimization and then forwprop4 (which is right after
> dce7) is able to see (b+s) - (b+s - b) is just b and then later on the next
> DCE optimizes away the new/delete pair.
> 
> > Unused new/delete pair is only being determined at cddce3 which is bit late. 
> 
> The reason why it is not before hand is due to `e - (e - b)` not being
> optimized to b until forwprop4 which is right after dce7. If `e - (e - b)`
> got folded say fre1:
> ```
>   _1 = this_15(D)->_M_impl.D.25104._M_start.D.16464._M_p;
> ...
>   _20 = MEM[(const struct _Bvector_impl
> *)this_15(D)].D.25104._M_end_of_storage;
>   _5 = _20 - _1; // e - b
>   _8 = (long unsigned int) _5;
>   _9 = -_8;
>   _10 = _20 + _9; // e - (e - b)
>   _11 = &this_15(D)->_M_impl;
>   operator delete (_10, _8);
> ```
> We should recongize the operator new/delete pair earlier too.

Nope because we are till left with:
```
  _133 = _34 + _33;
...
  _9 = _133 - _34;
  _10 = (long unsigned int) _9;
```
Not being converted into _33 until forwprop still.
The reason is fre5 does not get it due to the need for jump threading:
```
  <bb 8> [local count: 111448560]:
  # _150 = PHI <_34(7), 0B(4), _34(6)>
  # data$D25093$_M_end_of_storage_175 = PHI <_28(7), 0B(4), _28(6)>
  __first ={v} {CLOBBER(eos)};
  __result ={v} {CLOBBER(eos)};
  if (_150 != 0B)
    goto <bb 10>; [53.47%]
  else
    goto <bb 11>; [46.53%]

...

  <bb 10> [local count: 58514395]:
  _9 = data$D25093$_M_end_of_storage_175 - _150;
```

In theory we could optimize:
```
  _28 = _34 + _33;
...
  <bb 10> [local count: 111448560]:
  # __result_72 = PHI <_69(7), _34(8), _71(9), 0B(4)>
  # _150 = PHI <_34(7), _34(8), _34(9), 0B(4)>
  # data$D25093$_M_end_of_storage_175 = PHI <_28(7), _28(8), _28(9), 0B(4)>

...
  _9 = data$D25093$_M_end_of_storage_175 - _150;
  _10 = (long unsigned int) _9;


Into:
```
  <bb 10> [local count: 111448560]:
 # _t = PHI<_33(7),_33(8),_33(9),0>
...
  _9 = (long int)_t
  _10 = (long unsigned int) _9;
...
```

But I am not sure how expensive in compile time this would be. Then in ccp4 we would get the decent code.