88560 – [9 Regression] armv8_2-fp16-move-1.c and related regressions after r266385

Bug 88560 - [9 Regression] armv8_2-fp16-move-1.c and related regressions after r266385

Summary: [9 Regression] armv8_2-fp16-move-1.c and related regressions after r266385

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	9.0

Importance:	P1 normal
Target Milestone:	9.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization, ra

Depends on:
Blocks:

Reported:	2018-12-20 10:45 UTC by Sam Tebbs
Modified:	2019-02-11 16:54 UTC (History)
CC List:	5 users (show)

See Also:
Host:
Target:	arm
Build:
Known to work:
Known to fail:
Last reconfirmed:	2018-12-20 00:00:00

Attachments
Code generated for armv8_2-fp16-move-1.c with r266385 (996 bytes, text/plain) 2018-12-20 10:45 UTC, Sam Tebbs	Details
Code generated for armv8_2-fp16-move-1.c without r266385 (1000 bytes, text/plain) 2018-12-20 10:46 UTC, Sam Tebbs	Details
Proposed patch (558 bytes, patch) 2019-01-21 22:13 UTC, Vladimir Makarov	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Sam Tebbs 2018-12-20 10:45:42 UTC

Created attachment 45266 [details]
Code generated for armv8_2-fp16-move-1.c with r266385

Several regressions were seen on arm-none-linux-gnueabihf and arm-none-eabi after r260385.

FAIL: gcc.target/arm/armv8_2-fp16-move-1.c scan-assembler-times 
ldrh\\tr[0-9]+ 2
FAIL: gcc.target/arm/armv8_2-fp16-move-1.c scan-assembler-times 
vld1\\.16\\t{d[0-9]+\\[[0-9]+\\]}, \\[r[0-9]+\\] 2
FAIL: gcc.target/arm/armv8_2-fp16-move-1.c scan-assembler-times 
vmov\\.f16\\ts[0-9]+, r[0-9]+ 2
FAIL: gcc.target/arm/fp16-aapcs-1.c scan-assembler 
vmov(\\.f16)?\\tr[0-9]+, s[0-9]+
FAIL: gcc.target/arm/fp16-aapcs-1.c scan-assembler vmov(\\.f16)?\\ts0, 
r[0-9]+
FAIL: gcc.target/arm/fp16-aapcs-3.c scan-assembler-times vmov\\tr[0-9]+, 
s[0-2] 2
FAIL: gcc.target/arm/fp16-aapcs-3.c scan-assembler-times vmov\\ts0, 
r[0-9]+ 2

Full command line used to compile and test armv8_2-fp16-move-1.c (done by make check-gcc):

bin/gcc armv8_2-fp16-move-1.c -fno-diagnostics-show-caret 
-fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 
-mfpu=fp-armv8 -march=armv8.2-a+fp16 -mfloat-abi=hard -ffat-lto-objects 
-fno-ident -S -o armv8_2-fp16-move-1.s.

Comment 1 Sam Tebbs 2018-12-20 10:46:36 UTC

Created attachment 45267 [details]
Code generated for armv8_2-fp16-move-1.c without r266385

Comment 2 Wilco 2018-12-20 13:22:27 UTC

Eg. before

test_load_store_1:
        ldrh    r3, [r2, r1, lsl #1]    @ __fp16
        strh    r3, [r0, r1, lsl #1]    @ __fp16
        bx      lr

after:

test_load_store_1:
        vmov.f16        s0, r3  @ __fp16
        ldrh    r3, [r2, r1, lsl #1]    @ __fp16
        strh    r3, [r0, r1, lsl #1]    @ __fp16
        bx      lr

Inserting spurious extra moves certainly doesn't look correct.

Comment 3 Florian Weimer 2018-12-20 13:25:40 UTC

Is the revision number (r260385) really correct?

Comment 4 Sam Tebbs 2018-12-20 13:28:11 UTC

(In reply to Florian Weimer from comment #3)
> Is the revision number (r260385) really correct?

Sorry it was r266385, fixed that now.

Comment 5 Vladimir Makarov 2019-01-18 20:56:08 UTC

We have too many tests checking expected generated code.  We should more focus on overall effect of the change.  SPEC would be a good criterium although it is hard to check SPEC for each patch.

I've checked the generated code of arm8_2-fp16-move-1.c and found that in most cases GCC generates better code with the patch:

@@ -80,7 +80,6 @@ test_load_store_1:
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        lsl     r1, r1, #1
-       vmov.f16        s0, r3  @ __fp16
        ldrh    r3, [r2, r1]    @ __fp16
        strh    r3, [r0, r1]    @ __fp16
        bx      lr
@@ -97,10 +96,11 @@ test_load_store_2:
        @ link register save eliminated.
        add     r1, r1, #2
        lsl     r1, r1, #1
+       add     r3, r2, r1
        add     r0, r0, r1
-       ldrh    r3, [r2, r1]    @ __fp16
+       vld1.16 {d0[0]}, [r3]
+       vmov.f16        r3, s0  @ __fp16
        strh    r3, [r0, #-4]   @ __fp16
-       vmov.f16        s0, r3  @ __fp16
        bx      lr
        .size   test_load_store_2, .-test_load_store_2
        .align  2
@@ -176,14 +176,13 @@ test_select_5:
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        vcvtb.f32.f16   s0, s0
-       vcvtb.f32.f16   s14, s1
-       vmov    s15, s1 @ __fp16
-       vcmpe.f32       s0, s14
+       vcvtb.f32.f16   s15, s1
+       vcmpe.f32       s0, s15
        vmrs    APSR_nzcv, FPSCR
        bmi     .L17
-       vmov    s15, s2 @ __fp16
+       vmov    s1, s2  @ __fp16
 .L17:
-       vmov    s0, s15 @ __fp16
+       vmov    s0, s1  @ __fp16
        bx      lr
        .size   test_select_5, .-test_select_5
        .align  2
@@ -197,14 +196,13 @@ test_select_6:
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        vcvtb.f32.f16   s0, s0
-       vcvtb.f32.f16   s14, s1
-       vmov    s15, s1 @ __fp16
-       vcmpe.f32       s0, s14
+       vcvtb.f32.f16   s15, s1
+       vcmpe.f32       s0, s15
        vmrs    APSR_nzcv, FPSCR
        bls     .L19
-       vmov    s15, s2 @ __fp16
+       vmov    s1, s2  @ __fp16
 .L19:
-       vmov    s0, s15 @ __fp16
+       vmov    s0, s1  @ __fp16
        bx      lr
        .size   test_select_6, .-test

The only worse code is for test_load_store_2.  The culprit insn is 

(insn 12 10 16 2 (set (mem:HF (plus:SI (reg/f:SI 121)
                (const_int -4 [0xfffffffffffffffc])) [1 *_6+0 S2 A16])
        (reg:HF 116 [ <retval> ])) "b.i":4:8 653 {*movhf_vfp_fp16}
     (expr_list:REG_DEAD (reg/f:SI 121)
        (nil)))

Before the patch p116 had GENERAL_REGS class and after the patch it has VFP_LO_REGS.  The memory move cost for VFP_LO_REGS is small although the insn for VFP_LO_REGS prohibits memory with this kind address.

There are a few ways to fix this:
  1. give RA correct cost for store VFP_LO_REGS into given memory which means adding a new hook taking specific memory rtx
  2. check insn constraints to modify cost in some way
  3. try to reload address instead of memory itself

I prefer the last solution although it requires changes in sensitive part of LRA and will require a lot testing on a few targets.  So I am working on the patch.

In any case, I don't think the PR should have P1.  P3 or even P4 would be more accurate estimation (please see my top comments).

Comment 6 Wilco 2019-01-18 21:43:15 UTC

(In reply to Vladimir Makarov from comment #5)
> We have too many tests checking expected generated code.  We should more
> focus on overall effect of the change.  SPEC would be a good criterium
> although it is hard to check SPEC for each patch.
> 
> I've checked the generated code of arm8_2-fp16-move-1.c and found that in
> most cases GCC generates better code with the patch:
> 
> @@ -80,7 +80,6 @@ test_load_store_1:
>         @ frame_needed = 0, uses_anonymous_args = 0
>         @ link register save eliminated.
>         lsl     r1, r1, #1
> -       vmov.f16        s0, r3  @ __fp16
>         ldrh    r3, [r2, r1]    @ __fp16
>         strh    r3, [r0, r1]    @ __fp16
>         bx      lr

When I tested it, this test added that vmov, not removed it - see comment #2.

Comment 7 Vladimir Makarov 2019-01-21 16:53:31 UTC

(In reply to Wilco from comment #6)
> (In reply to Vladimir Makarov from comment #5)
> > We have too many tests checking expected generated code.  We should more
> > focus on overall effect of the change.  SPEC would be a good criterium
> > although it is hard to check SPEC for each patch.
> > 
> > I've checked the generated code of arm8_2-fp16-move-1.c and found that in
> > most cases GCC generates better code with the patch:
> > 
> > @@ -80,7 +80,6 @@ test_load_store_1:
> >         @ frame_needed = 0, uses_anonymous_args = 0
> >         @ link register save eliminated.
> >         lsl     r1, r1, #1
> > -       vmov.f16        s0, r3  @ __fp16
> >         ldrh    r3, [r2, r1]    @ __fp16
> >         strh    r3, [r0, r1]    @ __fp16
> >         bx      lr
> 
> When I tested it, this test added that vmov, not removed it - see comment #2.

I've just checked the generated code again by using

./xgcc -B. /home/vmakarov/build1/trunk/gcc/gcc/testsuite/gcc.target/arm/armv8_2-fp16-move-1.c -fno-diagnostics-show-caret -fno-diagnostics-show-line-numbers -fdiagnostics-color=never -O2 -mfpu=fp-armv8 -march=armv8.2-a+fp16 -mfloat-abi=hard -ffat-lto-objects -fno-ident -o b1.s -S
 
The code is 

test_load_store_1:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        lsl     r1, r1, #1
        ldrh    r3, [r2, r1]    @ __fp16
        strh    r3, [r0, r1]    @ __fp16
        bx      lr

GCC revision is r267848 as of Jan 11,2019.

Comment 8 Vladimir Makarov 2019-01-21 22:13:42 UTC

Created attachment 45485 [details]
Proposed patch

Comment 9 Vladimir Makarov 2019-01-21 22:14:30 UTC

(In reply to Vladimir Makarov from comment #8)
> Created attachment 45485 [details]
> Proposed patch

Does this patch solves the problem?

Comment 10 Tamar Christina 2019-01-23 13:43:10 UTC

Thanks for the patch Vladimir!

I've started a validation of the patch, will let you know as soon as it finishes.

Comment 11 Tamar Christina 2019-01-25 17:21:31 UTC

Hi Vladimir,

I've tested the patch and checked the testcases.

The code is now better in most cases so no issue there. The testcases will need to be updated but I can do that after the patch is committed.

I've kicked off an overnight native regression test and will inspect the result and any failing tests and update you first thing monday.

Comment 12 Tamar Christina 2019-02-04 15:35:26 UTC

Hi Vladimir,

sorry for the delay, I was away last week.  The patch looks good, please go ahead and commit it.

I will fix the testisms after it has been committed.

Thanks,
Tamar

Comment 13 Vladimir Makarov 2019-02-08 19:01:46 UTC

Author: vmakarov
Date: Fri Feb  8 19:01:10 2019
New Revision: 268705

URL: https://gcc.gnu.org/viewcvs?rev=268705&root=gcc&view=rev
Log:
2019-02-08  Vladimir Makarov  <vmakarov@redhat.com>

	PR middle-end/88560
	* lra-constraints.c (process_alt_operands): Don't increase reject
	for memory when offset memory is required.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/lra-constraints.c

Comment 14 Jakub Jelinek 2019-02-08 19:13:37 UTC

Hopefully fixed.

Comment 15 Tamar Christina 2019-02-11 16:54:49 UTC

Author: tnfchris
Date: Mon Feb 11 16:54:18 2019
New Revision: 268772

URL: https://gcc.gnu.org/viewcvs?rev=268772&root=gcc&view=rev
Log:
Arm: Update tests after register allocation changes. (PR/target 88560)


After the register allocator changes of r268705 we need to update a few tests
with new output.

In all cases the compiler is now generating the expected code, since the tests
are all float16 testcases using a hard-floar abi, we expect that actual fp16
instructions are used rather than using integer loads and stores.  Because of
we also save on some mov.f16s that were being emitted before to move between
the two.

The aapcs cases now match the f32 cases in using floating point operations.

gcc/testsuite/Changelog

	PR middle-end/88560
	* gcc.target/arm/armv8_2-fp16-move-1.c: Update assembler scans.
	* gcc.target/arm/fp16-aapcs-1.c: Likewise.
	* gcc.target/arm/fp16-aapcs-3.c: Likewise.



Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/arm/armv8_2-fp16-move-1.c
    trunk/gcc/testsuite/gcc.target/arm/fp16-aapcs-1.c
    trunk/gcc/testsuite/gcc.target/arm/fp16-aapcs-3.c