94300 – [10 Regression] memcpy vector load miscompiled during const-prop since r10-6809-g7f5617b00445dcc8

Bug 94300 - [10 Regression] memcpy vector load miscompiled during const-prop since r10-6809-g7f5617b00445dcc8

Summary: [10 Regression] memcpy vector load miscompiled during const-prop since r10-68...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	10.0

Importance:	P1 normal
Target Milestone:	10.0
Assignee:	Jakub Jelinek

URL:
Keywords:	wrong-code

Depends on:
Blocks:

Reported:	2020-03-24 12:48 UTC by Matthias Kretz (Vir)
Modified:	2020-03-25 08:22 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	x86_64--, i?86--
Build:
Known to work:	9.3.0
Known to fail:	10.0
Last reconfirmed:	2020-03-24 00:00:00

Attachments
gcc10-pr94300.patch (815 bytes, patch) 2020-03-24 14:53 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Matthias Kretz (Vir) 2020-03-24 12:48:09 UTC

Test case `-O1 -march=skylake-avx512`:

int main()                                                                       
{                                                                                
  double mem[16];                                                                
  using V [[gnu::vector_size(64)]] = double;                                     
  const V a{0, 1, 2, 3, 4, 5, 6, 7};                                             
  const V b{8, 9, 10, 11, 12, 13, 14, 15};                                       
  __builtin_memcpy(mem, &a, 64);                                                 
  __builtin_memcpy(mem + 8, &b, 64);                                             
  V c = {};                                                                      
  __builtin_memcpy(&c, mem + 4, 64);                                             
  if (c[5] != double(9))                                                         
    __builtin_abort();                                                           
}

From my extended test case, where c would be {4, 5, 6, 7, 8, 15, 9, 10}. If GCC is made to forget the contents of mem (e.g. inline-asm), the test does not fail. GCC9 does not do constant evaluation of the code above and therefore doesn't fail.

Comment 1 Martin Liška 2020-03-24 13:17:52 UTC

Confirmed, started with r10-6809-g7f5617b00445dcc8.
I can reproduce that on Intel SDE simulator:

$ g++ pr94300.c -O1 -march=skylake-avx512 && /home/marxin/Programming/intel-sde-new/sde-external-8.16.0-2018-01-30-lin/sde -skx -- ./a.out 
Aborted (core dumped)

Comment 2 Jakub Jelinek 2020-03-24 14:53:53 UTC

Created attachment 48105 [details]
gcc10-pr94300.patch

Untested fix.

Comment 3 GCC Commits 2020-03-25 08:17:43 UTC

The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:f1154b4d3c54e83d493cc66d1a30c410b9b3108a

commit r10-7366-gf1154b4d3c54e83d493cc66d1a30c410b9b3108a
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Wed Mar 25 09:17:01 2020 +0100

    sccvn: Fix buffer overflow in push_partial_def [PR94300]
    
    The following testcase is miscompiled, because there is a buffer overflow
    in push_partial_def in the little-endian case when working 64-byte vectors.
    The code computes the number of bytes we need in the BUFFER: NEEDED_LEN,
    which is rounded up number of bits we need.  Then the code
    native_encode_expr each (partially overlapping) pd into THIS_BUFFER.
    If pd.offset < 0, i.e. the pd.rhs store starts at some bits before the
    window we are interested in, we pass -pd.offset to native_encode_expr and
    shrink the size already earlier:
          HOST_WIDE_INT size = pd.size;
          if (pd.offset < 0)
            size -= ROUND_DOWN (-pd.offset, BITS_PER_UNIT);
    On this testcase, the problem is with a store with pd.offset > 0,
    in particular pd.offset 256, pd.size 512, i.e. a 64-byte store which doesn't
    fit into entirely into BUFFER.
    We have just:
              size = MIN (size, (HOST_WIDE_INT) needed_len * BITS_PER_UNIT);
    in this case for little-endian, which isn't sufficient, because needed_len
    is 64, the entire BUFFER (except of the last extra byte used for shifting).
    native_encode_expr fills the whole THIS_BUFFER (again, except the last extra
    byte), and the code then performs memcpy (BUFFER + 32, THIS_BUFFER, 64);
    which overflows BUFFER and as THIS_BUFFER is usually laid out after it,
    overflows it into THIS_BUFFER.
    The following patch fixes it by for pd.offset > 0 making sure size is
    reduced too.  For big-endian the code does things differently and already
    handles this right.
    
    2020-03-25  Jakub Jelinek  <jakub@redhat.com>
    
            PR tree-optimization/94300
            * tree-ssa-sccvn.c (vn_walk_cb_data::push_partial_def): If pd.offset
            is positive, make sure that off + size isn't larger than needed_len.
    
            * gcc.target/i386/avx512f-pr94300.c: New test.

Comment 4 Jakub Jelinek 2020-03-25 08:22:31 UTC

Fixed.