Bug 81907 - memset called when it does not need to be; -mtune=cortex-a9
Summary: memset called when it does not need to be; -mtune=cortex-a9
Status: RESOLVED WORKSFORME
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 6.3.1
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2017-08-21 02:52 UTC by dongkyun.s
Modified: 2017-09-06 01:15 UTC (History)
0 users

See Also:
Host:
Target: arm-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-08-21 00:00:00


Attachments
memset_test (1.30 KB, application/x-compressed)
2017-08-21 02:52 UTC, dongkyun.s
Details
memset_test_cortex-a9.o (made by '-Os -mtune=cortex-a12') (537 bytes, application/x-object)
2017-08-21 05:04 UTC, dongkyun.s
Details
obj made by '-Os -mtune=cortex-a9' (537 bytes, application/x-object)
2017-08-21 05:21 UTC, dongkyun.s
Details
obj made by '-Os -mtune=cortex-a12' (523 bytes, application/x-object)
2017-08-21 05:35 UTC, dongkyun.s
Details

Note You need to log in before you can comment on or make changes to this bug.
Description dongkyun.s 2017-08-21 02:52:21 UTC
Created attachment 42013 [details]
memset_test

Compiling the attached source without memset trivial implementation,

Failed by undefined reference to `memset'
OPTFLAGS = -Os -g -mabi=aapcs -fno-function-sections -Wall -mfloat-abi=soft -mtune=cortex-a9

Succeeded with option
OPTFLAGS = -Os -g -mabi=aapcs -fno-function-sections -Wall -mfloat-abi=soft -mtune=cortex-a12

Using -O2 instead of -Os (Optimization level) also fix this fail.
What is different optimization behavior(implementation) in GCC between cortex-a9 and cortex-a12 -given by mcpu or mtune option ?

Found related issue in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888.
Comment 1 Andrew Pinski 2017-08-21 03:56:02 UTC
>What is different optimization behavior(implementation) in GCC between cortex-a9 and cortex-a12 -given by mcpu or mtune option ?

Different tuning.  Though maybe at -Os should be almost the same except for the allowance for using the instructions that are in cortex-a12 rather than a9 (for the -mcpu case). 

But really memset is part of the C standard here and you don't use -fno-hoisting option; though IIRC that still requires memset being included in your libc.
Comment 2 Andrew Pinski 2017-08-21 03:56:52 UTC
(In reply to dongkyun.s from comment #0)
> Found related issue in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888.

Unrelated bug report.
Comment 3 Andrew Pinski 2017-08-21 03:59:06 UTC
Not a bug, see PR 63393 comment #5 for explanation of why.

*** This bug has been marked as a duplicate of bug 63393 ***
Comment 4 dongkyun.s 2017-08-21 05:04:16 UTC
Created attachment 42014 [details]
memset_test_cortex-a9.o (made by '-Os -mtune=cortex-a12')

./gcc-linaro-6.3.1-2017.02-x86_64_arm-linux-gnueabi/bin/arm-linux-gnueabi-objdump -d memset_test_cortex-a12.o

memset_test_cortex-a12.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <func1>:
   0:	b530      	push	{r4, r5, lr}
   2:	f1a0 0208 	sub.w	r2, r0, #8
   6:	460c      	mov	r4, r1
   8:	2300      	movs	r3, #0
   a:	2000      	movs	r0, #0
   c:	2100      	movs	r1, #0
   e:	42a3      	cmp	r3, r4
  10:	db00      	blt.n	14 <func1+0x14>
  12:	bd30      	pop	{r4, r5, pc}
  14:	f852 5f08 	ldr.w	r5, [r2, #8]!
  18:	3301      	adds	r3, #1
  1a:	1940      	adds	r0, r0, r5
  1c:	eb41 71e5 	adc.w	r1, r1, r5, asr #31
  20:	e7f5      	b.n	e <func1+0xe>

00000022 <test_func>:
  22:	b51f      	push	{r0, r1, r2, r3, r4, lr}
  24:	2200      	movs	r2, #0
  26:	490c      	ldr	r1, [pc, #48]	; (58 <test_func+0x36>)
  28:	2300      	movs	r3, #0
  2a:	e9cd 2300 	strd	r2, r3, [sp]
  2e:	e9cd 2302 	strd	r2, r3, [sp, #8]
  32:	780c      	ldrb	r4, [r1, #0]
  34:	7908      	ldrb	r0, [r1, #4]
  36:	4623      	mov	r3, r4
  38:	4302      	orrs	r2, r0
  3a:	e9cd 2300 	strd	r2, r3, [sp]
  3e:	788a      	ldrb	r2, [r1, #2]
  40:	2300      	movs	r3, #0
  42:	4668      	mov	r0, sp
  44:	f043 0307 	orr.w	r3, r3, #7
  48:	2102      	movs	r1, #2
  4a:	e9cd 2302 	strd	r2, r3, [sp, #8]
  4e:	f7ff fffe 	bl	0 <func1>
  52:	b004      	add	sp, #16
  54:	bd10      	pop	{r4, pc}
  56:	bf00      	nop
  58:	00000000 	.word	0x00000000
Comment 5 dongkyun.s 2017-08-21 05:21:59 UTC
Created attachment 42016 [details]
obj made by '-Os -mtune=cortex-a9'

./gcc-linaro-6.3.1-2017.02-x86_64_arm-linux-gnueabi/bin/arm-linux-gnueabi-objdump -d memset_test_cortex-a9.o

memset_test_cortex-a9.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <func1>:
   0:	b530      	push	{r4, r5, lr}
   2:	f1a0 0208 	sub.w	r2, r0, #8
   6:	460c      	mov	r4, r1
   8:	2300      	movs	r3, #0
   a:	2000      	movs	r0, #0
   c:	2100      	movs	r1, #0
   e:	42a3      	cmp	r3, r4
  10:	db00      	blt.n	14 <func1+0x14>
  12:	bd30      	pop	{r4, r5, pc}
  14:	f852 5f08 	ldr.w	r5, [r2, #8]!
  18:	3301      	adds	r3, #1
  1a:	1940      	adds	r0, r0, r5
  1c:	eb41 71e5 	adc.w	r1, r1, r5, asr #31
  20:	e7f5      	b.n	e <func1+0xe>

00000022 <test_func>:
  22:	b51f      	push	{r0, r1, r2, r3, r4, lr}
  24:	2210      	movs	r2, #16
  26:	2100      	movs	r1, #0
  28:	4668      	mov	r0, sp
  2a:	f7ff fffe 	bl	0 <memset>
  2e:	490a      	ldr	r1, [pc, #40]	; (58 <test_func+0x36>)
  30:	2200      	movs	r2, #0
  32:	780c      	ldrb	r4, [r1, #0]
  34:	7908      	ldrb	r0, [r1, #4]
  36:	4623      	mov	r3, r4
  38:	4302      	orrs	r2, r0
  3a:	4668      	mov	r0, sp
  3c:	e9cd 2300 	strd	r2, r3, [sp]
  40:	2300      	movs	r3, #0
  42:	788a      	ldrb	r2, [r1, #2]
  44:	f043 0307 	orr.w	r3, r3, #7
  48:	2102      	movs	r1, #2
  4a:	e9cd 2302 	strd	r2, r3, [sp, #8]
  4e:	f7ff fffe 	bl	0 <func1>
  52:	b004      	add	sp, #16
  54:	bd10      	pop	{r4, pc}
  56:	bf00      	nop
  58:	00000000 	.word	0x00000000
Comment 6 dongkyun.s 2017-08-21 05:35:28 UTC
Created attachment 42017 [details]
obj made by '-Os -mtune=cortex-a12'

./gcc-linaro-6.3.1-2017.02-x86_64_arm-linux-gnueabi/bin/arm-linux-gnueabi-objdump -d memset_test_cortex-a12.o

memset_test_cortex-a12.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <func1>:
   0:	b530      	push	{r4, r5, lr}
   2:	f1a0 0208 	sub.w	r2, r0, #8
   6:	460c      	mov	r4, r1
   8:	2300      	movs	r3, #0
   a:	2000      	movs	r0, #0
   c:	2100      	movs	r1, #0
   e:	42a3      	cmp	r3, r4
  10:	db00      	blt.n	14 <func1+0x14>
  12:	bd30      	pop	{r4, r5, pc}
  14:	f852 5f08 	ldr.w	r5, [r2, #8]!
  18:	3301      	adds	r3, #1
  1a:	1940      	adds	r0, r0, r5
  1c:	eb41 71e5 	adc.w	r1, r1, r5, asr #31
  20:	e7f5      	b.n	e <func1+0xe>

00000022 <test_func>:
  22:	b51f      	push	{r0, r1, r2, r3, r4, lr}
  24:	2200      	movs	r2, #0
  26:	490c      	ldr	r1, [pc, #48]	; (58 <test_func+0x36>)
  28:	2300      	movs	r3, #0
  2a:	e9cd 2300 	strd	r2, r3, [sp]
  2e:	e9cd 2302 	strd	r2, r3, [sp, #8]
  32:	780c      	ldrb	r4, [r1, #0]
  34:	7908      	ldrb	r0, [r1, #4]
  36:	4623      	mov	r3, r4
  38:	4302      	orrs	r2, r0
  3a:	e9cd 2300 	strd	r2, r3, [sp]
  3e:	788a      	ldrb	r2, [r1, #2]
  40:	2300      	movs	r3, #0
  42:	4668      	mov	r0, sp
  44:	f043 0307 	orr.w	r3, r3, #7
  48:	2102      	movs	r1, #2
  4a:	e9cd 2302 	strd	r2, r3, [sp, #8]
  4e:	f7ff fffe 	bl	0 <func1>
  52:	b004      	add	sp, #16
  54:	bd10      	pop	{r4, pc}
  56:	bf00      	nop
  58:	00000000 	.word	0x00000000
Comment 7 dongkyun.s 2017-08-21 05:51:15 UTC
> Different tuning.  Though maybe at -Os should be almost the same except for the allowance for using the instructions that are in cortex-a12 rather than a9 (for the -mcpu case). 
I attached .o files made by '-mtune=cortex-a9' and 'mtune=cortex-a12' (same as -mcpu case).
Could you describe more in detail about this why memset is added on cortex-a9 or below ?

memset_test_cortex-a9.o:     file format elf32-littlearm
Disassembly of section .text:
...
00000022 <test_func>:
  22:	b51f      	push	{r0, r1, r2, r3, r4, lr}
  24:	2210      	movs	r2, #16
  26:	2100      	movs	r1, #0
  28:	4668      	mov	r0, sp
  2a:	f7ff fffe 	bl	0 <memset>

> But really memset is part of the C standard here and you don't use -fno-hoisting option; 
Which option do you mean? (I'm sorry, but, fno-hoisting is not found)

> Not a bug, see PR 63393 comment #5 for explanation of why.
This is not related to freestanding implementations. Again, option is different by '-mcpu or -mtune' only.

(1) CFLAGS: -Os -mtune=cortex-a9
(CC) memset_test.o
(CC) main.o
gcc-linaro-6.3.1-2017.02-x86_64_arm-linux-gnueabi/bin/arm-linux-gnueabi-ld -Bstatic -o memset_test \
	memset_test.o main.o \
	--start-group -L/home/dongkyun.s/tmp/memset_test/gcc-linaro-6.3.1-2017.02-x86_64_arm-linux-gnueabi/bin/../lib/gcc/arm-linux-gnueabi/6.3.1 -lgcc --end-group -Map memset_test.map #--gc-sections
memset_test.c:(.text+0x2a): undefined reference to `memset'

(2) CFLAGS: -Os -mtune=cortex-a12
(CC) memset_test.o
(CC) main.o
gcc-linaro-6.3.1-2017.02-x86_64_arm-linux-gnueabi/bin/arm-linux-gnueabi-ld -Bstatic -o memset_test \
	memset_test.o main.o \
	--start-group -L/home/dongkyun.s/tmp/memset_test/gcc-linaro-6.3.1-2017.02-x86_64_arm-linux-gnueabi/bin/../lib/gcc/arm-linux-gnueabi/6.3.1 -lgcc --end-group -Map memset_test.map #--gc-sections
BUILD_TARGETS=memset_test.bin memset_test.txt memset_test.dis memset_test.ver
Build Done!
Comment 8 Andrew Pinski 2017-08-21 05:54:41 UTC
>This is not related to freestanding implementations.

Huh?  Since you are not linking against the C library, it has to be.
or you mean this should be optimized not to use memset; different question from what your summary is about.
Comment 9 dongkyun.s 2017-08-21 06:07:49 UTC
> or you mean this should be optimized not to use memset; different question from what your summary is about.

I mean -ffreestanding or -fno-freestanding are not included in this testcase, but, mtune/ mcpu option.
Thanks!
Comment 10 Andrew Pinski 2017-08-21 07:05:35 UTC
(In reply to dongkyun.s from comment #9)
> I mean -ffreestanding or -fno-freestanding are not included in this
> testcase, but, mtune/ mcpu option.
Yes but your summary was saying memset was missing which is not correct and would cause this bug report to be invalid.  But in reality you are complaining that the memset was not needed in the first place why is it being used for -mtune=cortex-a9  when doing -mtune=cortex-a12 can get away with not needing memset.

Two different issues :).
Comment 11 dongkyun.s 2017-08-21 07:17:32 UTC
Dear pinskia@gcc.gnu.org,
Thanks for correcting title to "memset called when it does not need to be; -mtune=cortex-a9" along with the comment :)
Comment 12 ktkachov 2017-08-21 09:36:20 UTC
Confirmed the call on 6.4.1 but GCC 7 and trunk don't generate the call for -mcpu=cortex-a9 .
I don't know off the top of my head what change fixed this though.
Comment 13 dongkyun.s 2017-08-22 05:45:23 UTC
> Confirmed the call on 6.4.1 but GCC 7 and trunk don't generate the call for -mcpu=cortex-a9 .

I also verified memset call is not generated with GCC 7.1 + "-mcpu=cortex-a9 or -mtune=cortex-a9" or lower.

It seems interesting that 
in GCC6,
- don't generate the memset call for -mcpu=cortex-a12 or higer(e.g, cortex-a15, V7 big.LITTLE)
- always generate the memset call for -mcpu=cortex-a9 or lower(e.g, cortex-a7, cotex-a5)

in GCC7.1
- always don't generate the memset call (even with V3 Architecture Processors. e.g, -mcpu=arm7)
Comment 14 Richard Earnshaw 2017-08-22 10:11:34 UTC
(In reply to dongkyun.s from comment #13)
> > Confirmed the call on 6.4.1 but GCC 7 and trunk don't generate the call for -mcpu=cortex-a9 .
> 
> I also verified memset call is not generated with GCC 7.1 + "-mcpu=cortex-a9
> or -mtune=cortex-a9" or lower.
> 
> It seems interesting that 
> in GCC6,
> - don't generate the memset call for -mcpu=cortex-a12 or higer(e.g,
> cortex-a15, V7 big.LITTLE)
> - always generate the memset call for -mcpu=cortex-a9 or lower(e.g,
> cortex-a7, cotex-a5)
> 
> in GCC7.1
> - always don't generate the memset call (even with V3 Architecture
> Processors. e.g, -mcpu=arm7)

There's nothing in the compiler that explicitly says: use memset for these cores and not for others.  The choice will be down to available instructions and their relative costs.
Comment 15 dongkyun.s 2017-08-23 01:06:11 UTC
> There's nothing in the compiler that explicitly says: use memset for these cores and not for others.  The choice will be down to available instructions and their relative costs.

Agreed, but, I'm just wondering why it has diffrent behavior according by GCC version with -Os. (It should be same result if the choice is made by their instructions and costs)
Comment 16 Michail 2017-08-30 14:54:41 UTC
> Agreed, but, I'm just wondering why it has diffrent behavior according by
> GCC version with -Os. (It should be same result if the choice is made by
> their instructions and costs)

I think that for this example GCC 7 generates memset() call after changes in tree-ssa-dse https://gcc.gnu.org/viewcvs/gcc?limit_changes=0&view=revision&revision=244442
(tuning is the same)

for reduced test:
$ cat memset_test_reduced.c
long long func1( long long *pl) 
{
  long  long  r = 0;
  for(int i=0; i<2; i++)
    r += ( long long ) pl[i];
  return r;
}

long long test_func(void) 
{
  long long x[2] = {0};
  x[0] = 3;
  x[1] = 4;
  return func1(x);
}

compiled with:
gcc -S memset_test_reduced.c -g -mabi=aapcs -fno-function-sections -Wall -mfloat-abi=soft -Os -mtune=cortex-a9 -fdump-tree-all

Difference between GIMPLE produced by gcc-6.4.1 and gcc-7.2.1 is that gcc-7 optimized out "x = {};" in first DSE pass:

diff -u 6.4.1/memset_test_reduced.c.210t.optimized  7.2.1/memset_test_reduced.c.227t.optimized
...
test_func ()
 {
   long long int x[2];
-  long long int _5;
+  long long int _4;
 
-  <bb 2>:
==============
-  x = {};
==============
+  <bb 2> [100.00%]:
   x[0] = 3;
   x[1] = 4;
-  _5 = func1 (&x);
+  _4 = func1 (&x);
   x ={v} {CLOBBER};
-  return _5;
+  return _4;
}

Gcc-6 keeps it and transforms in memset() (according to tune options?) after all.
Comment 17 Michail 2017-08-30 15:05:29 UTC
> I think that for this example GCC 7 generates memset() call after changes in
> tree-ssa-dse

I mean GCC 6 generates memset()
Comment 18 dongkyun.s 2017-09-06 01:15:12 UTC
Dear Michail,
Your analysis was very helpful. I've also verified that compiler may insert memset() call or not according by 1) DSE optimization - object size/base and tune options 2) code generation.