compiling: #include <vector> void empty(std::vector<bool> src) { std::vector <bool> data=src; } yields in optimized dump: void empty (struct vector & src) { _Bit_type * _13; unsigned int _14; _Bit_type * _15; long int _16; long int _17; long int _18; long int _19; long unsigned int _20; long unsigned int _25; long unsigned int _26; long unsigned int _45; <bb 2> [local count: 1073741824]: _13 = MEM[(const struct vector *)src_2(D)].D.25603._M_impl.D.25087._M_start.D.16487._M_p; _14 = MEM[(const struct _Bit_iterator &)src_2(D) + 16].D.16487._M_offset; _15 = MEM[(const struct _Bit_iterator &)src_2(D) + 16].D.16487._M_p; _16 = _15 - _13; _17 = _16 * 8; _18 = (long int) _14; _19 = _17 + _18; if (_19 != 0) goto <bb 3>; [33.00%] else goto <bb 4>; [67.00%] <bb 3> [local count: 574129755]: _20 = (long unsigned int) _19; _25 = _20 + 63; _26 = _25 >> 6; _45 = _26 * 8; <bb 4> [local count: 1073741824]: return; } All the computations are obviously dead. Eventually RTL optimizes it out, but we ought to do it earlier. Unused new/delete pair is only being determined at cddce3 which is bit late. At cddce2 we still have memcpy copying the vector which is only detected dead by dse5 run shortly before. Dse4 is not run and during dse3 we get quite a lot of code which somehow prevents the dse from happening. If bool is changed to int memcpy goes away already during dse2.
looking at dse3 dump we get: <bb 2> [local count: 1073741824]: MEM[(struct _Bvector_impl_data *)&data] ={v} {CLOBBER(bob)}; MEM[(struct __as_base &)&data] ={v} {CLOBBER(bob)}; _13 = MEM[(const struct vector *)src_2(D)].D.25603._M_impl.D.25087._M_start.D.16487._M_p; _14 = MEM[(const struct _Bit_iterator &)src_2(D) + 16].D.16487._M_offset; _15 = MEM[(const struct _Bit_iterator &)src_2(D) + 16].D.16487._M_p; _16 = _15 - _13; _17 = _16 * 8; _18 = (long int) _14; _19 = _17 + _18; _20 = (long unsigned int) _19; if (_20 != 0) goto <bb 3>; [33.00%] else goto <bb 22>; [67.00%] <bb 22> [local count: 719407024]: goto <bb 5>; [100.00%] <bb 3> [local count: 354334800]: _25 = _20 + 63; _26 = _25 >> 6; _45 = _26 * 8; _46 = operator new (_45); If I am reading it right, then _20 is size of the vector in bits. <bb 3> then takes the size in bits to determine size in 64bit words which is then converted to size in bytes. _16 already holds size of allocated vector. I think this all can be simplified to using it and adding 1 if offset is non-0. Later we do: <bb 6> [local count: 966367640]: _54 = (long unsigned int) _16; __builtin_memcpy (data$_M_p_173, _13, _54); Here _16 bytes is copied, so the last word must (if offset is non-0) must be copied separately: if (_14 == 0) goto <bb 23>; [10.20%] else goto <bb 24>; [89.80%] <bb 24> [local count: 867798143]: _55 = data$_M_p_173 + _54; goto <bb 11>; [100.00%] .... <bb 11> [local count: 964220160]: # __result$_M_p_40 = PHI <data$_M_p_173(26), _55(24), _57(30)> <bb 11> [local count: 964220160]: # __result$_M_p_40 = PHI <data$_M_p_173(26), _55(24), _57(30)> <bb 12> [local count: 9453138808]: # __first$_M_offset_154 = PHI <__first$_M_offset_58(21), 0(11)> # __first$_M_p_151 = PHI <__first$_M_p_112(21), _15(11)> # __result$_M_p_136 = PHI <__result$_M_p_105(21), __result$_M_p_40(11)> _82 = 1 << __first$_M_offset_154; _84 = *__first$_M_p_151; _85 = _82 & _84; pretmp_88 = *__result$_M_p_136; if (_85 != 0) goto <bb 13>; [50.00%] else goto <bb 14>; [50.00%] this copies the last byte but takes care to zero out the unused part of vector, why? <bb 13> [local count: 4726569404]: _90 = _82 | pretmp_88; goto <bb 15>; [100.00%] <bb 14> [local count: 4726569404]: _92 = ~_82; _93 = pretmp_88 & _92; <bb 15> [local count: 9453138808]: # cstore_138 = PHI <_90(13), _93(14)> *__result$_M_p_136 = cstore_138; this seems to be computing the last byte. It is not clear to me why it is still considred live at this point, but also the whole code could just be replaced by memcpying the last word as well. We compute the size of memory allocated and we can simply copy everything, right?
_Bit_type -> might be TBAA related?
Actually the main problem is that copying of bitvectors is done by loop copying every bit individually. This loop stays until loop optimizers and then we are quite late in optimization. Have patch for that.
patch posted here https://gcc.gnu.org/pipermail/gcc-patches/2024-December/671137.html