65709 – [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

Bug 65709 - [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

Summary: [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

Status:	RESOLVED INVALID

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	5.0

Importance:	P3 normal
Target Milestone:	5.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-04-09 06:22 UTC by Evan Nemerson
Modified:	2023-05-06 02:15 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Test data (14.66 KB, application/x-lz4) 2015-04-09 06:22 UTC, Evan Nemerson	Details
preprocessed test case (20.74 KB, text/plain) 2015-04-09 06:22 UTC, Evan Nemerson	Details
gcc5-pr65709.patch (966 bytes, patch) 2015-04-09 08:29 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Evan Nemerson 2015-04-09 06:22:27 UTC

Created attachment 35266 [details]
Test data

With GCC 5 at -O3 on x86_64 some inputs to LZ4 will generate a segfault when decompressing.  This doesn't happen with GCC 4.9, or -O2.  It does happen with -O2 -ftree-loop-vectorize -fvect-cost-model=dynamic.

I have tested gcc-5.0.0-0.21.fc22.x86_64 from Fedora 22, as well as SVN revision 221940.

To trigger the segfault pass the attached input (sum.lz4) as the only command line argument to the preprocessed test case.

Comment 1 Evan Nemerson 2015-04-09 06:22:54 UTC

Created attachment 35267 [details]
preprocessed test case

Comment 2 Markus Trippelsdorf 2015-04-09 06:34:56 UTC

markus@x4 tmp % gcc -fsanitize=undefined -O3 test.i
markus@x4 tmp % ./a.out sum.lz4
test.c:285:29: runtime error: load of misaligned address 0x7efefb79d001 for type 'const void', which requires 8 byte alignment
0x7efefb79d001: note: pointer points here
 00 00 00  85 7f 45 4c 46 01 02 01  00 01 00 f1 04 02 00 02  00 00 00 01 00 01 0b 98  00 00 00 34 00
              ^ 
test.c:200:16: runtime error: load of misaligned address 0x7efefb79d009 for type 'const void', which requires 2 byte alignment
0x7efefb79d009: note: pointer points here
 01 02 01  00 01 00 f1 04 02 00 02  00 00 00 01 00 01 0b 98  00 00 00 34 00 00 91 50  1c 00 f1 00 34
              ^ 
test.c:285:27: runtime error: store to misaligned address 0x000001205c31 for type 'void', which requires 8 byte alignment
0x000001205c31: note: pointer points here
 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00
              ^ 
test.c:285:27: runtime error: store to misaligned address 0x000001205c44 for type 'void', which requires 8 byte alignment
0x000001205c44: note: pointer points here
  00 00 91 50 1c 00 f1 00  34 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
              ^ 
test.c:285:27: runtime error: store to misaligned address 0x000001205c4c for type 'void', which requires 8 byte alignment
0x000001205c4c: note: pointer points here
  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
              ^ 
test.c:285:29: runtime error: load of misaligned address 0x000001205c3c for type 'const void', which requires 8 byte alignment
0x000001205c3c: note: pointer points here
  00 01 0b 98 00 00 00 34  00 00 91 50 00 00 00 00  00 34 00 20 00 05 00 28  00 1a 00 17 00 00 00 06
              ^ 
test.c:285:29: runtime error: load of misaligned address 0x000001205c44 for type 'const void', which requires 8 byte alignment
0x000001205c44: note: pointer points here
  00 00 91 50 00 00 00 00  00 34 00 20 00 05 00 28  00 1a 00 17 00 00 00 06  00 00 00 34 00 00 91 50
              ^ 
test.c:273:25: runtime error: load of misaligned address 0x000001205c62 for type 'const void', which requires 4 byte alignment
0x000001205c62: note: pointer points here
 00 34  00 00 00 00 00 00 04 00  11 05 00 20 00 05 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00
              ^ 
test.c:273:23: runtime error: store to misaligned address 0x000001205c66 for type 'void', which requires 4 byte alignment
0x000001205c66: note: pointer points here
 00 00 00 00 04 00  11 05 00 20 00 05 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00
             ^ 
test.c:285:27: runtime error: store to misaligned address 0x00000120f177 for type 'void', which requires 8 byte alignment
0x00000120f177: note: pointer points here
 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  81 6e 01 00 00 00 00 00  00 00 00
             ^

Comment 3 Markus Trippelsdorf 2015-04-09 06:41:04 UTC

BTW clang's output is more informative:

test.c:285:29: runtime error: load of misaligned address 0x7fe669eee001 for type 'U64' (aka 'unsigned long'), which requires 8 byte alignment
0x7fe669eee001: note: pointer points here
<memory cannot be printed>
test.c:200:16: runtime error: load of misaligned address 0x7fe669eee009 for type 'U16' (aka 'unsigned short'), which requires 2 byte alignment
0x7fe669eee009: note: pointer points here
 01 02 01  00 01 00 f1 04 02 00 02  00 00 00 01 00 01 0b 98  00 00 00 34 00 00 91 50  1c 00 f1 00 34
              ^ 
test.c:285:13: runtime error: store to misaligned address 0x000002020021 for type 'U64' (aka 'unsigned long'), which requires 8 byte alignment
0x000002020021: note: pointer points here
 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00
              ^ 
test.c:273:25: runtime error: load of misaligned address 0x000002020052 for type 'U32' (aka 'unsigned int'), which requires 4 byte alignment
0x000002020052: note: pointer points here
 00 34  00 00 00 00 00 00 04 00  11 05 00 20 00 05 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00
              ^ 
test.c:273:9: runtime error: store to misaligned address 0x000002020056 for type 'U32' (aka 'unsigned int'), which requires 4 byte alignment
0x000002020056: note: pointer points here
 00 00 00 00 04 00  11 05 00 20 00 05 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00
             ^ 

e.g. "type 'const void'" vs. "type 'U32' (aka 'unsigned int')"

Comment 4 Markus Trippelsdorf 2015-04-09 07:29:17 UTC

If you change:

4309 static void LZ4_copy8(void* dstPtr, const void* srcPtr)                                                                                                                  
4310 {                                                                                                                                                                        
4311                                                                                                                                                                          
4312     if (1)                                                                                                                                                               
4313     {                                                                                                                                                                    
4314         if (LZ4_64bits())                                                                                                                                                
4315             *(U64*)dstPtr = *(U64*)srcPtr;                                                                                                                               
4316         else                                                                                                                                                             
4317             ((U32*)dstPtr)[0] = ((U32*)srcPtr)[0],                                                                                                                       
4318             ((U32*)dstPtr)[1] = ((U32*)srcPtr)[1];                                                                                                                       
4319         return;                                                                                                                                                          
4320     }                                                                                                                                                                    
4321                                                                                                                                                                          
4322     memcpy(dstPtr, srcPtr, 8);                                                                                                                                           
4323 } 

to if (0) it will work.

And looking at the lz4 git repository I see:

 269 static void LZ4_copy8(void* dstPtr, const void* srcPtr)                                                                                                                  
 270 {                                                                                                                                                                        
 271 #if GCC_VERSION!=409  /* disabled on GCC 4.9, as it generates invalid opcode (crash) */                                                                                  
 272     if (LZ4_UNALIGNED_ACCESS)                                                                                                                                            
 273     {                                                                                                                                                                    
 274         if (LZ4_64bits())                                                                                                                                                
 275             *(U64*)dstPtr = *(U64*)srcPtr;                                                                                                                               
 276         else                                                                                                                                                             
 277             ((U32*)dstPtr)[0] = ((U32*)srcPtr)[0],                                                                                                                       
 278             ((U32*)dstPtr)[1] = ((U32*)srcPtr)[1];                                                                                                                       
 279         return;                                                                                                                                                          
 280     }                                                                                                                                                                    
 281 #endif                                                                                                                                                                   
 282     memcpy(dstPtr, srcPtr, 8);                                                                                                                                           
 283 }

Comment 5 Richard Biener 2015-04-09 08:02:34 UTC

*(U64*)dstPtr = *(U64*)srcPtr; 

makes GCC assume that dstPtr and srcPtr are suitably aligned for U64, if they
are not then you invoke undefined behavior.  As x86 doesn't trap on unaligned accesses unless they are from vectorized code this shows up only when the
vectorizer exploits that alignment info.

Thus I'd say this bug is invalid.

Comment 6 Markus Trippelsdorf 2015-04-09 08:27:24 UTC

(In reply to Richard Biener from comment #5)
> *(U64*)dstPtr = *(U64*)srcPtr; 
> 
> makes GCC assume that dstPtr and srcPtr are suitably aligned for U64, if they
> are not then you invoke undefined behavior.  As x86 doesn't trap on
> unaligned accesses unless they are from vectorized code this shows up only
> when the
> vectorizer exploits that alignment info.
> 
> Thus I'd say this bug is invalid.

Agreed, closing.

Comment 7 Jakub Jelinek 2015-04-09 08:29:32 UTC

Created attachment 35271 [details]
gcc5-pr65709.patch

For the bad wording in the -fsanitize=alignment diagnostics, here is a fix.  The type of MEM_REF's argument isn't relevant, it can be anything, what really matters is the type of the MEM_REF, which is the access type.

Comment 8 Jakub Jelinek 2015-04-09 19:51:39 UTC

Author: jakub
Date: Thu Apr  9 19:51:08 2015
New Revision: 221958

URL: https://gcc.gnu.org/viewcvs?rev=221958&root=gcc&view=rev
Log:
	PR tree-optimization/65709
	* ubsan.c (instrument_mem_ref): Use TREE_TYPE (base) instead of
	TREE_TYPE (TREE_TYPE (t)).

	* c-c++-common/ubsan/align-9.c: New test.

Added:
    trunk/gcc/testsuite/c-c++-common/ubsan/align-9.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/ubsan.c

Comment 9 Yann Collet 2015-04-09 21:04:49 UTC

While the issue can be easily fixed from an LZ4 perspective, 
the main topic here is to analyze a GCC 4.9+ vectorizer choice.

The piece of code that it tried to optimize can be summarized as follows (once removed all the garbage) :

static void LZ4_copy8(void* dstPtr, const void* srcPtr)
{
   *(U64*)dstPtr = *(U64*)srcPtr;
}

Pretty simple.
Let's assume for the rest of the post that both pointers are correctly aligned, so it's not a problem anymore.

Looking at the assembler generated, we see that GCC generates a MOVDQA instruction for it.
> movdqa (%rdi,%rax,1),%xmm0
> $rdi=0x7fffea4b53e6
> $rax=0x0

This seems wrong on 2 levels :

- The function only wants to copy 8 bytes. MOVDQA works on a full SSE register, which is 16 bytes. This spell troubles, if only for buffer boundaries checks : the algorithm uses 8 bytes because it knows it can safely read/write that size without crossing buffer limits. With 16 bytes, no such guarantee.

- MOVDQA requires both positions to be aligned.
I read it as being SSE size aligned, which means 16-bytes aligned.
But they are not, these pointers are supposed to be 8-bytes aligned only.

(A bit off topic, but from a general perspective, I don't understand the use of MOVDQA, which requires such a strong alignment condition, while there is also MOVDQU available, which works fine at any memory address, while suffering no performance penalty on aligned memory addresses. MOVDQU looks like a better choice in every circumstances.)

Anyway, the core of the issue is rather above :
this is just an 8-bytes copy operation, replacing by a 16-bytes one looks suspicious. Maybe it would deserve a look.

Comment 10 Jakub Jelinek 2015-04-09 21:30:55 UTC

(In reply to Yann Collet from comment #9)
> Looking at the assembler generated, we see that GCC generates a MOVDQA
> instruction for it.
> > movdqa (%rdi,%rax,1),%xmm0
> > $rdi=0x7fffea4b53e6
> > $rax=0x0
> 
> This seems wrong on 2 levels :
> 
> - The function only wants to copy 8 bytes. MOVDQA works on a full SSE
> register, which is 16 bytes. This spell troubles, if only for buffer
> boundaries checks : the algorithm uses 8 bytes because it knows it can
> safely read/write that size without crossing buffer limits. With 16 bytes,
> no such guarantee.

The function has been inlined into the callers, like:
      do { LZ4_copy8(d,s); d+=8; s+=8; } while (d<e);
and this loop is then vectorized.  The vectorization prologue of course has to adjust if s is not 16 byte aligned, but it can assume that both s and d are 8 byte aligned (otherwise it is undefined behavior).  So, if they aren't 8 byte aligned, you could get crashes etc.  The load is then performed as aligned, because the vectorization prologue ensured it is aligned (unless the program has undefined behavior), while the stores as done using movups because it is possible the pointers aren't both aligned the same.

> - MOVDQA requires both positions to be aligned.
> I read it as being SSE size aligned, which means 16-bytes aligned.
> But they are not, these pointers are supposed to be 8-bytes aligned only.
> 
> (A bit off topic, but from a general perspective, I don't understand the use
> of MOVDQA, which requires such a strong alignment condition, while there is
> also MOVDQU available, which works fine at any memory address, while
> suffering no performance penalty on aligned memory addresses. MOVDQU looks
> like a better choice in every circumstances.)

On most CPUs there is a significant performance difference between the two, even if you use MOVDQU with aligned addresses.

Comment 11 Yann Collet 2015-04-10 12:02:39 UTC

Does that answer also takes into consideration "buffer boundaries check" ?

The algorithm is supposed to pay attention to this property when moving 8-bytes at a time. For 16-bytes at a time, it seems this core assumption is broken, and can result in buffer overflow.

Comment 12 Jakub Jelinek 2015-04-10 12:05:55 UTC

Of course, if a vectorized loop doesn't have compile time known number of iterations divisible by vectorization factor, or if some early iterations are performed as scalar to align some pointer, the compiler emits an epilogue (loop, possibly unrolled) that performs the last few iterations if needed.

Comment 13 Yann Collet 2015-04-10 14:06:55 UTC

Clear enough. Thanks for feedback.

Comment 14 Jeffrey Walton 2015-07-13 11:02:50 UTC

(In reply to Jakub Jelinek from comment #10)
> (In reply to Yann Collet from comment #9)
> > Looking at the assembler generated, we see that GCC generates a MOVDQA
> > instruction for it.
> > > movdqa (%rdi,%rax,1),%xmm0
> > > $rdi=0x7fffea4b53e6
> > > $rax=0x0
> > 
> > This seems wrong on 2 levels :
> > 
> > - The function only wants to copy 8 bytes. MOVDQA works on a full SSE
> > register, which is 16 bytes. This spell troubles, if only for buffer
> > boundaries checks : the algorithm uses 8 bytes because it knows it can
> > safely read/write that size without crossing buffer limits. With 16 bytes,
> > no such guarantee.
> 
> The function has been inlined into the callers, like:
>       do { LZ4_copy8(d,s); d+=8; s+=8; } while (d<e);
> and this loop is then vectorized.  The vectorization prologue of course has
> to adjust if s is not 16 byte aligned, but it can assume that both s and d
> are 8 byte aligned (otherwise it is undefined behavior)...
Forgive my barging in Jakub. I was referred to this issue and comment from another issue.

Its not clear to me where the leap is made that its OK use vmovdqa. Are you stating (unequivocally, for folks like me) that is does *not* matter what the alignment is in the C-sources; and that the prologue ensures both 's' and 's' are eventually 16-byte aligned when vmovdqa is invoked. That is, when we see vmovdqa used, we know the alignment is correct (and at 16-bytes).

Sorry to have to ask.

Comment 15 Jakub Jelinek 2015-07-13 11:14:05 UTC

I'm saying that if the program does not trigger undefined behavior (e.g. by accessing misaligned integers without telling the compiler about it (by using memcpy, or packed attribute or pragma), then it would be a compiler bug to use an instruction requiring higher alignment than guaranteed in the source, without ensuring such alignment (through realigning arrays, introducing a loop for aligning pointers before the vectorized loop, peeling a few iterations needed to align the pointer(s), or using instructions that don't require such high alignment).
No testcase has been provided here without having undefined behavior in them that would show a bug on the compiler side.

Comment 16 Jeffrey Walton 2015-07-13 11:38:47 UTC

(In reply to Jakub Jelinek from comment #15)
> I'm saying that if the program does not trigger undefined behavior (e.g. by
> accessing misaligned integers without telling the compiler about it (by
> using memcpy, or packed attribute or pragma), then it would be a compiler
> bug to use an instruction requiring higher alignment than guaranteed in the
> source, without ensuring such alignment (through realigning arrays,
> introducing a loop for aligning pointers before the vectorized loop, peeling
> a few iterations needed to align the pointer(s), or using instructions that
> don't require such high alignment).
> No testcase has been provided here without having undefined behavior in them
> that would show a bug on the compiler side.

OK, so you'll have to forgive my ignorance again.

So you are saying that it may be a bug to use vmovdqa if the source and/or destination are not 16-byte aligned; but all the user code you have seen has undefined behavior so you're not going to answer. Is that correct?

(My apologies if its too sharp a point. I'm just trying to figure out what the preconditions are for vmovdqa, and if it should be used when source or destination is 8-byte aligned).

Comment 17 Jakub Jelinek 2015-07-13 12:06:01 UTC

(In reply to Jeffrey Walton from comment #16)
> OK, so you'll have to forgive my ignorance again.
> 
> So you are saying that it may be a bug to use vmovdqa if the source and/or
> destination are not 16-byte aligned; but all the user code you have seen has
> undefined behavior so you're not going to answer. Is that correct?
> 
> (My apologies if its too sharp a point. I'm just trying to figure out what
> the preconditions are for vmovdqa, and if it should be used when source or
> destination is 8-byte aligned).

I'm saying we as the compiler writers know what we are doing, and the various cases like using unaligned accesses or peeling for alignment or versioning for alignment, or realigning arrays are handled in the compiler.
They do assume that the source is valid and does not trigger undefined behavior.
If you e.g. compile on x86_64 with -O3 -mavx2
void
foo (int *a, int *b)
{
  int i;
  for (i = 0; i < 64; i++)
    a[i] = 2 * b[i];
}
you'll see that compiler decided to peel for alignment of b pointer and you can see an (unrolled) scalar loop first that handles first few iterations to align b if it is not already aligned, and then the main vector loop uses
vmovdqa for loads and vmovups for stores (because the a pointer modulo 32 might not be the same as b pointer modulo 32).  If you compile with -O2 -ftree-vectorize -mavx2, you'll see that peeling for alignment isn't performed, as it enlarges the code, and vmovdqu is used for the loads instead.
The peeling for alignment assumes that there is no undefined behavior originally, so if you call this with (uintptr) b % sizeof (int) != 0, it will not work properly, but that is a bug in the code, not in the compiler.
So, if you have some testcase where there is no undefined behavior triggered (try various sanitizers, inspect the code yourself, read the C standard), and you are convinced the compiler introduces a bug where there isn't originally (i.e. miscompiles the code), feel free to file a new bugreport.
Nothing like that has been presented in this PR.

Comment 18 Yann Collet 2015-08-12 18:24:43 UTC

This issue makes me wonder : how to efficiently access unaligned memory ?


The case in point is ARM cpu.
They don't support SSE/AVX, so they seem unaffected by this specific issue,
but this issue force writing the source code in a certain way, to remain compatible with vectorizer assumtion.
Therefore, for portable code, it becomes an issue :
how to write a code which is both portable and efficient on both targets ?


Since apparently, writing : u32 = *(U32*)ptr;
is forbidden if ptr is not guaranteed to be aligned on 4-bytes boundaries
as the compiler will then be authorized to assume ptr is properly aligned,
how to efficiently load 4-bytes from memory at unaligned position ?

I know 3 ways :

1) byte by byte : secure, but slow ==> not efficient

2) using memcpy : memcpy(&u32, ptr, sizeof(u32));
It works. It's safe, and on x86/x64 it's correctly translated into a single mov instruction, so it's also fast.
Alas, on ARM target, this get translated into much more complex /cautious sequence, depending on optimization settings.
This is not a small difference :
at -O3 settings, we get a x2 performance difference.
at -O2 settings, it becomes x5 (unaligned code is slower).

3) using __packed instruction : Basically, feature the same benefits and problems than memcpy() method above


The problem is therefore for newer ARM CPU, which efficiently support unaligned memory.
Accessing this performance is not possible using memcpy() nor __packed.
And it seems the only way to get it is to do : u32 = *(U32*)ptr;

The difference in performance is really huge, in fact it totally changes the application, so it can't be ignored.


The question is :
Is there a way to access this performance without violating the principle which has been stated into this thread, 
that is : it's not authorized to write : u32 = *(U32*)ptr; if ptr is not guaranteed to be properly aligned on 4-bytes boundaries.

Comment 19 Jeffrey Walton 2015-08-12 18:37:01 UTC

(In reply to Yann Collet from comment #18)
> This issue makes me wonder : how to efficiently access unaligned memory ?
> 
> 
> The case in point is ARM cpu.
> They don't support SSE/AVX, so they seem unaffected by this specific issue,
> but this issue force writing the source code in a certain way, to remain
> compatible with vectorizer assumtion.

Just one comment here (sorry for speaking out of turn)....

Modern ARM has __ARM_FEATURE_UNALIGNED, which means the processor tolerates unaligned access. However, I believe it runs afoul of the C/C++ standard and GCC aliasing rules.

> Therefore, for portable code, it becomes an issue :
> how to write a code which is both portable and efficient on both targets ?

I've been relying on intrinsics to side step C/C++ requirements. In the absence of intrinsics, I use inline assembler to avoid the C/C++ language rules.

Now, I could be totally wrong, but I don't feel I'm violating the C/C++ language rules until a write a C/C++ statement that performs the violation. Hence the reason I use intrinsics or drop into assembler.

Comment 20 Paolo Bonzini 2016-03-04 11:03:30 UTC

> how to efficiently access unaligned memory ?

Use memcpy between unsigned char pointers and with a constant size.  The compiler knows to translate it to an unaligned memory access, or even a combination of unaligned and aligned memory accesses:

$ cat f.c
int f(char *restrict a, const char *restrict b)
{
	int i;
	for (i = 0; i < 512; i++)
		a[i] = b[i];
}

$ gcc f.c -O3 -S -o f.s -fdump-tree-optimized
$ cat f.c.191t.optimized

;; Function f (f, funcdef_no=0, decl_uid=1832, cgraph_uid=0, symbol_order=0)

f (char * restrict a, const char * restrict b)
{
  <bb 2>:
  __builtin_memcpy (a_5(D), b_7(D), 512); [tail call]
  return;

}

$ cat f.s
...
f:
	.cfi_startproc
	movq	(%rsi), %rdx
	movq	%rdi, %rax
	leaq	8(%rdi), %rdi
	movq	%rdx, -8(%rdi)
	movq	504(%rsi), %rdx
	movq	%rdx, 496(%rdi)
	andq	$-8, %rdi
	subq	%rdi, %rax
	subq	%rax, %rsi
	addl	$512, %eax
	shrl	$3, %eax
	movl	%eax, %ecx
	rep movsq
	ret
...

It's doing unaligned accesses for the first and last 8 bytes, and 31 aligned 8-byte accesses in the middle.