Bug 86680 - possible gcc optimization
Summary: possible gcc optimization
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 8.0
: P3 enhancement
Target Milestone: 8.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2018-07-26 10:46 UTC by Florian La Roche
Modified: 2023-06-09 18:13 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-07-26 00:00:00


Attachments
testcase (143 bytes, text/plain)
2018-07-26 10:46 UTC, Florian La Roche
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Florian La Roche 2018-07-26 10:46:51 UTC
Created attachment 44444 [details]
testcase

I can see this on x86_64 and aarch64. The first function is compiled with much
bigger code. Seems the alignment to 8 bytes and thus this multiple of 8
is forgotten in some optimization step.

best regards,

Florian La Roche




$ aarch64-linux-gnu-gcc-8 -O2 -c test.c
$ aarch64-linux-gnu-objdump -d test.o 

test.o:     Dateiformat elf64-littleaarch64


Disassembly of section .text:

0000000000000000 <clear_bss1>:
   0:	90000001 	adrp	x1, 0 <__bss_start1>
   4:	90000000 	adrp	x0, 0 <__bss_end1>
   8:	f9400022 	ldr	x2, [x1]
   c:	f9400000 	ldr	x0, [x0]
  10:	eb00005f 	cmp	x2, x0
  14:	54000142 	b.cs	3c <clear_bss1+0x3c>  // b.hs, b.nlast
  18:	d1000401 	sub	x1, x0, #0x1
  1c:	aa0203e0 	mov	x0, x2
  20:	cb020021 	sub	x1, x1, x2
  24:	927df021 	and	x1, x1, #0xfffffffffffffff8
  28:	91002021 	add	x1, x1, #0x8
  2c:	8b020021 	add	x1, x1, x2
  30:	f800841f 	str	xzr, [x0], #8
  34:	eb01001f 	cmp	x0, x1
  38:	54ffffc1 	b.ne	30 <clear_bss1+0x30>  // b.any
  3c:	d65f03c0 	ret

0000000000000040 <clear_bss2>:
  40:	90000000 	adrp	x0, 0 <__bss_start2>
  44:	90000001 	adrp	x1, 0 <__bss_end2>
  48:	f9400000 	ldr	x0, [x0]
  4c:	f9400021 	ldr	x1, [x1]
  50:	f9400000 	ldr	x0, [x0]
  54:	f9400021 	ldr	x1, [x1]
  58:	eb01001f 	cmp	x0, x1
  5c:	54000082 	b.cs	6c <clear_bss2+0x2c>  // b.hs, b.nlast
  60:	f800841f 	str	xzr, [x0], #8
  64:	eb01001f 	cmp	x0, x1
  68:	54ffffc3 	b.cc	60 <clear_bss2+0x20>  // b.lo, b.ul, b.last
  6c:	d65f03c0 	ret



Please note how the second function is compiled much smaller. The first
function from "18" to "2c" should basically be optimized away.


Compiling with -Os is also much better:
$ aarch64-linux-gnu-gcc-8 -Os -c test.c
$ aarch64-linux-gnu-objdump -d test.o 

test.o:     Dateiformat elf64-littleaarch64


Disassembly of section .text:

0000000000000000 <clear_bss1>:
   0:	90000000 	adrp	x0, 0 <__bss_start1>
   4:	90000001 	adrp	x1, 0 <__bss_end1>
   8:	f9400000 	ldr	x0, [x0]
   c:	f9400021 	ldr	x1, [x1]
  10:	eb01001f 	cmp	x0, x1
  14:	54000043 	b.cc	1c <clear_bss1+0x1c>  // b.lo, b.ul, b.last
  18:	d65f03c0 	ret
  1c:	f800841f 	str	xzr, [x0], #8
  20:	17fffffc 	b	10 <clear_bss1+0x10>

0000000000000024 <clear_bss2>:
  24:	90000000 	adrp	x0, 0 <__bss_start2>
  28:	90000001 	adrp	x1, 0 <__bss_end2>
  2c:	f9400000 	ldr	x0, [x0]
  30:	f9400021 	ldr	x1, [x1]
  34:	f9400000 	ldr	x0, [x0]
  38:	f9400021 	ldr	x1, [x1]
  3c:	eb00003f 	cmp	x1, x0
  40:	54000048 	b.hi	48 <clear_bss2+0x24>  // b.pmore
  44:	d65f03c0 	ret
  48:	f800841f 	str	xzr, [x0], #8
  4c:	17fffffc 	b	3c <clear_bss2+0x18>







The problem also shows up on x86_64 from "13" to "22":
$ gcc -O2 -c test.c
$ objdump -d test.o

test.o:     Dateiformat elf64-x86-64


Disassembly of section .text:

0000000000000000 <clear_bss1>:
   0:	48 8d 05 00 00 00 00 	lea    0x0(%rip),%rax        # 7 <clear_bss1+0x7>
   7:	48 8d 15 00 00 00 00 	lea    0x0(%rip),%rdx        # e <clear_bss1+0xe>
   e:	48 39 d0             	cmp    %rdx,%rax
  11:	73 25                	jae    38 <clear_bss1+0x38>
  13:	48 8d 48 08          	lea    0x8(%rax),%rcx
  17:	48 83 c2 07          	add    $0x7,%rdx
  1b:	48 29 ca             	sub    %rcx,%rdx
  1e:	48 83 e2 f8          	and    $0xfffffffffffffff8,%rdx
  22:	48 01 ca             	add    %rcx,%rdx
  25:	0f 1f 00             	nopl   (%rax)
  28:	48 c7 00 00 00 00 00 	movq   $0x0,(%rax)
  2f:	48 83 c0 08          	add    $0x8,%rax
  33:	48 39 d0             	cmp    %rdx,%rax
  36:	75 f0                	jne    28 <clear_bss1+0x28>
  38:	f3 c3                	repz retq 
  3a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)

0000000000000040 <clear_bss2>:
  40:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax        # 47 <clear_bss2+0x7>
  47:	48 8b 15 00 00 00 00 	mov    0x0(%rip),%rdx        # 4e <clear_bss2+0xe>
  4e:	48 39 d0             	cmp    %rdx,%rax
  51:	73 16                	jae    69 <clear_bss2+0x29>
  53:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  58:	48 83 c0 08          	add    $0x8,%rax
  5c:	48 c7 40 f8 00 00 00 	movq   $0x0,-0x8(%rax)
  63:	00 
  64:	48 39 d0             	cmp    %rdx,%rax
  67:	72 ef                	jb     58 <clear_bss2+0x18>
  69:	f3 c3                	repz retq
Comment 1 Martin Liška 2018-07-26 11:04:01 UTC
(In reply to Florian La Roche from comment #0)
> Created attachment 44444 [details]
> testcase
> 
> I can see this on x86_64 and aarch64. The first function is compiled with
> much
> bigger code. Seems the alignment to 8 bytes and thus this multiple of 8
> is forgotten in some optimization step.

Can you please explain what precisely is wrong by 'is compiled with much bigger code'?
Comment 2 Andreas Schwab 2018-07-26 11:45:48 UTC
The extra insns are introduced in the ivopts pass.

  <bb 5> [local count: 105119325]:
  ivtmp.7_7 = (unsigned long) &__bss_start1;
  _11 = (unsigned long) &__bss_end1;
  _12 = _11 + 18446744073709551615;
  _13 = (unsigned long) &__bss_start1;
  _14 = _12 - _13;
  _15 = _14 & 18446744073709551608;
  _16 = (unsigned long) &__bss_start1;
  _17 = _16 + 8;
  _18 = _15 + _17;
Comment 3 Florian La Roche 2018-07-26 12:02:14 UTC
Hello Martin,

I assume the two functions clear_bss1() and clear_bss2() to work on
identical aligned data and produce similar assembler output.
Yet looking at the assembler output, the first function produces
many more assembler lines. "-Os" keeps the assembler lines also pretty small.

The first assembler listing should remove "18" to "2C" the last
listing should remove "13" to "22".



Here another output from gcc, where the additional pseudocode shows up
after optimizations. The lines with pseudo vars "_13" to "_20" should
not be produced at all.



;; Function clear_bss1 (clear_bss1, funcdef_no=0, decl_uid=3118, cgraph_uid=0, symbol_order=0)

Removing basic block 6
Removing basic block 7
Removing basic block 8
clear_bss1 ()
{
  unsigned long ivtmp.9;
  void * _11;
  unsigned long _12;
  unsigned long _13;
  unsigned long _16;
  unsigned long _17;
  unsigned long _18;
  unsigned long _19;
  unsigned long _20;

  <bb 2> [15.00%]:
  if (&__bss_start1 < &__bss_end1)
    goto <bb 3>; [85.00%]
  else
    goto <bb 5>; [15.00%]

  <bb 3> [12.75%]:
  ivtmp.9_7 = (unsigned long) &MEM[(void *)&__bss_start1 + 8B];
  _12 = (unsigned long) &__bss_end1;
  _13 = _12 + 7;
  _16 = _13 - ivtmp.9_7;
  _17 = _16 & 18446744073709551608;
  _18 = (unsigned long) &__bss_start1;
  _19 = _18 + 16;
  _20 = _17 + _19;

  <bb 4> [85.00%]:
  # ivtmp.9_10 = PHI <ivtmp.9_1(4), ivtmp.9_7(3)>
  _11 = (void *) ivtmp.9_10;
  MEM[base: _11, offset: -8B] = 0;
  ivtmp.9_1 = ivtmp.9_10 + 8;
  if (ivtmp.9_1 != _20)
    goto <bb 4>; [85.00%]
  else
    goto <bb 5>; [15.00%]

  <bb 5> [15.00%]:
  return;

}


;; Function clear_bss2 (clear_bss2, funcdef_no=1, decl_uid=3127, cgraph_uid=1, symbol_order=1)

Removing basic block 5
Removing basic block 6
Removing basic block 7
Removing basic block 8
clear_bss2 ()
{
  long unsigned int * bss;
  long unsigned int * __bss_end2.2_10;

  <bb 2> [15.00%]:
  bss_5 = __bss_start2;
  __bss_end2.2_10 = __bss_end2;
  if (bss_5 < __bss_end2.2_10)
    goto <bb 3>; [85.00%]
  else
    goto <bb 4>; [15.00%]

  <bb 3> [85.00%]:
  # bss_11 = PHI <bss_6(3), bss_5(2)>
  bss_6 = bss_11 + 8;
  MEM[base: bss_6, offset: -8B] = 0;
  if (bss_6 < __bss_end2.2_10)
    goto <bb 3>; [85.00%]
  else
    goto <bb 4>; [15.00%]

  <bb 4> [15.00%]:
  return;

}





Is this helping to explain my bug entry?


best regards,

Florian La Roche
Comment 4 Florian La Roche 2018-07-26 12:10:17 UTC
Right, compiling with "-O2 -fno-ivopts" resolves my issues.

best regards,

Florian La Roche
Comment 5 Andrew Pinski 2018-07-27 19:27:43 UTC
I suspect this is due to what looks like undefined code.  That is you cannot compare two different arrays and expect good behavior.
Comment 6 Jakub Jelinek 2018-07-27 20:43:56 UTC
Yeah.  You probably need something like:
void clear_bss1(void)
{
    unsigned long *bss = __bss_start1;
    asm ("" : "+g" (bss));
    while (bss < __bss_end1)
        *bss++ = 0UL;
}
to hide the UB from the optimizers; pointer arithmetics is only defined within the same object and similarly pointer comparison other than ==/!=, while the first function has two different objects.
Comment 7 Florian La Roche 2018-07-27 20:49:36 UTC
Hello Andrew Pinski,

shouldn't the compiler see that both must be aligned to 8 bytes
and thus also their difference must be a multiple of 8 bytes?

I haven't looked into gcc sources, but maybe this information could
be exploited for additinal optimization.

best regards,

Florian La Roche
Comment 8 Florian La Roche 2018-07-27 21:29:15 UTC
I've found something the compiler optimized quite nicely:
(Good for the compiler, but I'd be happy to stay with the original code
that was much easier to read for humans.)



extern unsigned long __bss_start[];
extern unsigned long __bss_end[];
//extern unsigned long __bss_size;

void clear_bss(void)
{
    unsigned long *bss = __bss_start;
    unsigned long i, end = __bss_end - __bss_start;
    //unsigned long i = __bss_size;
    for (i = 0; i < end; i += sizeof (unsigned long))
        *bss++ = 0UL;
}




This results on aarch64 into this code:
0000000000000000 <clear_bss>:
   0:	90000001 	adrp	x1, 0 <__bss_end>
   4:	90000002 	adrp	x2, 0 <__bss_start>
   8:	f9400021 	ldr	x1, [x1]
   c:	f9400042 	ldr	x2, [x2]
  10:	cb020021 	sub	x1, x1, x2
  14:	9343fc21 	asr	x1, x1, #3
  18:	b40000c1 	cbz	x1, 30 <clear_bss+0x30>
  1c:	d2800000 	mov	x0, #0x0                   	// #0
  20:	f822681f 	str	xzr, [x0, x2]
  24:	91002000 	add	x0, x0, #0x8
  28:	eb00003f 	cmp	x1, x0
  2c:	54ffffa8 	b.hi	20 <clear_bss+0x20>  // b.pmore
  30:	d65f03c0 	ret


Jakub, your example code did also result in pretty large code
(but I've only tested 8.0.1, not the newest release on this).


Thanks a lot,
best regards,

Florian La Roche
Comment 9 Florian La Roche 2018-07-27 21:49:06 UTC
Puh, even introduced an error here. This one works, but is
getting complex compared to the original code:



extern unsigned long __bss_start[];
extern unsigned long __bss_end[];

void clear_bss(void)
{
    unsigned long *bss = __bss_start;
    unsigned long i, end = (__bss_end - __bss_start) * sizeof (unsigned long);
    for (i = 0; i < end; i += sizeof (unsigned long))
        *bss++ = 0UL;
}


best regards,

Florian La Roche
Comment 10 Florian La Roche 2018-07-27 21:56:49 UTC
In my optionion the result of
"end = (__bss_end - __bss_start) * sizeof (unsigned long)"
in my last testcase should show that the compile should be
able to optimize the test code of the original submitted code.

(Still of course completely unclear if this makes sense to implement.)

best regards,

Florian La Roche
Comment 11 Florian La Roche 2018-11-29 16:41:53 UTC
Below my current code that disables optimization for this one function and thus
generates ok code length.

best regards,

Florian La Roche




#if __GNUC__ > 4
#define __gcc_no_ivopts __attribute__ ((optimize("no-ivopts")))
#else
#define __gcc_no_ivopts
#endif

extern unsigned long __bss_start[], __bss_end[];

void __gcc_no_ivopts clear_bss(void)
{
    unsigned long *bss = __bss_start;
#if 1
    while (bss < __bss_end)
        *bss++ = 0UL;
#else
    unsigned long i, end = (__bss_end - __bss_start) * sizeof(unsigned long);
    for (i = 0; i < end; i += sizeof(unsigned long))
        *bss++ = 0UL;
#endif
}
Comment 12 Andrew Pinski 2023-06-09 18:13:44 UTC
These days we get:
  _1 = (unsigned long) &__bss_end1;
  _7 = _1 + 18446744073709551615;
  _6 = (unsigned long) &__bss_start1;
  _11 = _7 - _6;
  _12 = _11 >> 3;
  _13 = _12 + 1;
  _14 = _13 * 8;
  __builtin_memset (&__bss_start1, 0, _14); [tail call]


Or
((((end - 1) - start) >> 3) + 1) * 8

In theory (I think) we could do:
(end - start + 7) & -8