Bug 97417

Summary: RISC-V Unnecessary andi instruction when loading volatile bool
Product: gcc Reporter: Kjetil Østerås <kjetilos>
Component: targetAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED FIXED    
Severity: normal CC: admin, jiawei, kito, wilson
Priority: P3 Keywords: missed-optimization
Version: 10.2.0   
Target Milestone: ---   
Host: Target: riscv
Build: Known to work:
Known to fail: Last reconfirmed: 2020-10-15 00:00:00
Attachments: optimization fix for BUG 97417
Optimzation Patch for QI/HImode(32bit) and QI/HI/SImode(64bit)
Third test patch
QI/HI/SI/DI zero/sign-extend for RV32/64/128
QI/HI/SI/DI zero/sign-extend for RV32/64/128
V4 patch for QI/HI/SI/DI zero/sign-extend for RV32/64/128
QI/HI/SImode auto Zero/Sign-extend
QI/HI/SImode auto Zero/Sign-extend
QI/HI/SImode auto Zero/Sign-extend
riscv-shorten-memrefs_V1.patch
Initial V1 patch on combine.c
Auto-extend Patch
untested fix to use instead of levy's combine.c patch

Description Kjetil Østerås 2020-10-14 11:56:11 UTC
When loading a volatile bool value I see that the compiler adds unnecessary andi instruction. Here is reproducer:

#include <stdbool.h>

extern volatile bool active;

int foo(void)
{
  if (!active) {
    return 42;
  } else {
    return -42;
  }
}

And the code generated looks like this:

foo:
        lui     a5,%hi(active)
        lbu     a5,%lo(active)(a5)
        li      a0,42
        andi    a5,a5,0xff
        beq     a5,zero,.L1
        li      a0,-42
.L1:
        ret

The andi instruction seems to be unnecessary since lbu is zero extending the byte.

ref. https://github.com/riscv/riscv-gnu-toolchain/issues/737
Comment 1 Jim Wilson 2020-10-15 20:34:41 UTC
Comparing with the ARM port, I see that in the ARM port, the movqi pattern emits
(insn 8 7 9 2 (set (reg:SI 117)
        (zero_extend:SI (mem/v/c:QI (reg/f:SI 115) [1 active+0 S1 A8]))) "tmp.c\
":7:7 -1
     (nil))
(insn 9 8 10 2 (set (reg:QI 116)
        (subreg:QI (reg:SI 117) 0)) "tmp.c":7:7 -1
     (nil))
and then later it combines the subreg operation with the following zero_extend and cancels them out.

Whereas in the RISC-V port, the movqi pattern emits
(insn 9 7 10 2 (set (reg:QI 76)
	(mem/v/c:QI (lo_sum:DI (reg:DI 74)
                (symbol_ref:DI ("active") [flags 0xc4]  <var_decl 0x7f9f0310312\
0 active>)) [1 active+0 S1 A8])) "tmp.c":7:7 -1
     (nil))
and then combine refuses to combine the following zero-extend with this insn as the memory operation is volatile.

So it seems we need to rewrite the RISC-V port to make movqi and movhi zero extend to si/di mode and then subreg.  That probably will require cascading changes to avoid code size and performance regressions.

Looks like a tractable small to medium size project, but will need to wait for a volunteer to work on it.
Comment 2 jiawei 2020-10-21 03:36:21 UTC
(In reply to Jim Wilson from comment #1)
> Comparing with the ARM port, I see that in the ARM port, the movqi pattern
> emits
> (insn 8 7 9 2 (set (reg:SI 117)
>         (zero_extend:SI (mem/v/c:QI (reg/f:SI 115) [1 active+0 S1 A8])))
> "tmp.c\
> ":7:7 -1
>      (nil))
> (insn 9 8 10 2 (set (reg:QI 116)
>         (subreg:QI (reg:SI 117) 0)) "tmp.c":7:7 -1
>      (nil))
> and then later it combines the subreg operation with the following
> zero_extend and cancels them out.
> 
> Whereas in the RISC-V port, the movqi pattern emits
> (insn 9 7 10 2 (set (reg:QI 76)
> 	(mem/v/c:QI (lo_sum:DI (reg:DI 74)
>                 (symbol_ref:DI ("active") [flags 0xc4]  <var_decl
> 0x7f9f0310312\
> 0 active>)) [1 active+0 S1 A8])) "tmp.c":7:7 -1
>      (nil))
> and then combine refuses to combine the following zero-extend with this insn
> as the memory operation is volatile.
> 
> So it seems we need to rewrite the RISC-V port to make movqi and movhi zero
> extend to si/di mode and then subreg.  That probably will require cascading
> changes to avoid code size and performance regressions.
> 
> Looks like a tractable small to medium size project, but will need to wait
> for a volunteer to work on it.

Hi Jim, My name is Jiawei Chen. I am from the PLCT Lab. I had recurrented this bug, and want to try to help fixing this bug. What should I modify,is there any suggestions?
Comment 3 Jim Wilson 2020-10-21 05:58:40 UTC
The basic idea here is that the movqi pattern in riscv.md currently emits RTL for a load that looks like this
  (set (reg:QI target) (mem:QI (address)))
As an experiment, we want to try changing that to something like this
  (set (reg:DI temp) (zero_extend:DI (mem:DI (address))))
  (set (reg:QI target) (subreg:QI (reg:DI temp) 0))
The hope is that the optimizer will combine the subreg with following operations resulting in smaller faster code at the end.  And this should also solve the volatile load optimization problem.  So we need a patch, and then we need experiments to see if the patch actually produces better code on real programs.  It should be fairly easy to write the patch even if you don't have any gcc experience.  The testing part of this is probably more work than the patch writing.

The movqi pattern calls riscv_legitmize_move in riscv.c, so that would have to be modified to look for qimode loads from memory, allocate a temporary register, do a zero_extending load into the temp reg, and then a subreg copy into the target register.

You will probably also need to handle cases where both the target and source are memory locations, in which case this already gets split into two instructions, a load followed by a store.

You can look at the movqi pattern in arm.md file to see an example of how to do this, where it calls gen_zero_extendqisi2.  Though for RISC-V, we would want gen_zero_extendqidi2 for rv64 and gen_zero_extendqisi2 for rv32.

If the movqi change works, then we would want similar changes for movhi and maybe also movsi for rv64.

It might also be worth checking whether zero-extend or sign-extend is the better choice.  We zero extend char by default, so that should be best.  For rv64, the hardware sign extends simode to dimode by default, so sign-extend is probably best for that.  For himode I'm not sure, I think we prefer sign-extend by default, but that isn't necessarily the best choice for loads.  This would have to be tested.

You can see rtl dumps by using -fdump-rtl-all.  The combiner is the pass that should be optimizing away the unnecessary zero-extend.  You can see details of what the combiner pass is doing by using -fdump-rtl-combine-all.
Comment 4 jiawei 2020-10-27 11:18:28 UTC
I had did some tests with this problem and find:

foo.c

#include <stdbool.h>
  
extern volatile bool active;

int foo(void)
{
  if (!active) {
    return 42;
  } else {
    return -42;
  }
}

code generated in foo.s

foo:
        lui     a5,%hi(active)
        lbu     a5,%lo(active)(a5)
        li      a0,42
        andi    a5,a5,0xff
        beq     a5,zero,.L2
        li      a0,-42

When we remove the keyword `volatile`

foo_without_volatile.c

#include <stdbool.h>
  
extern bool active;

int foo(void)
{
  if (!active) {
    return 42;
  } else {
    return -42;
  }
}

code generated in foo_without_volatile.s

foo:
        lui     a5,%hi(active)
        lbu     a5,%lo(active)(a5)
        li      a0,42
        beq     a5,zero,.L2
        li      a0,-42

and then we change the type from `bool` to `int`

foo_int.c

#include <stdbool.h>
  
extern volatile int active;

int foo(void)
{
  if (!active) {
    return 42;
  } else {
    return -42;
  }
}

code generated in foo_int.s

foo:
        lui     a5,%hi(active)
        lw      a5,%lo(active)(a5)
        li      a0,42
        sext.w  a5,a5
        beq     a5,zero,.L2
        li      a0,-42

the `sext.w` instruction replace the `andi`

We also remove the keyword `volatile` in foo_int_without_volatile.c

#include <stdbool.h>
  
extern int active;

int foo(void)
{
  if (!active) {
    return 42;
  } else {
    return -42;
  }
}

code generated in foo_int_without_volatile.s also look like optimized

foo:
        lui     a5,%hi(active)
        lw      a5,%lo(active)(a5)
        li      a0,42
        beq     a5,zero,.L2
        li      a0,-42

Maybe this problem is due to the keyword `volatile`.
Comment 5 Jim Wilson 2020-10-27 15:25:07 UTC
Yes, the volatile is the problem.  We need to disable some optimizations like the combiner to avoid breaking the semantics of volatile.  However, if you try looking at other ports, like arm, you can see that they don't have this problem because they generate different RTL at the start and hence do not need the combiner to generate the sign-extended load.  So the proposal here is that we modify the RISC-V gcc port to also emit the sign-extended load at RTL generation time, which solves this problem. And then we need to do some testing to make sure that this actually generates good code in every case, as we don't want to accidentally introduce a code size or performance regression while fixing this volatile optimization problem.

If you are curious about the combiner issue, see the init_recog_no_volatile call in combine.c.  If you comment that out, the andi will be optimized away.  But we can't remove that call, because that would break programs using volatile.
Comment 6 Levy Hsu 2020-10-29 05:27:43 UTC
Hi Jim

Levy from StarFive. 

Adding following code to the head of riscv_legitimize_move() according to your reply seems to solve the problem:

if(mode == QImode && MEM_P (src) && REG_P (dest))
  {
    rtx temp_reg;
    if (TARGET_64BIT)
    {
      temp_reg = gen_reg_rtx (DImode);
      emit_insn(gen_zero_extendqidi2(temp_reg, src));
    }
    else
    {
      temp_reg = gen_reg_rtx (SImode);
      emit_insn(gen_zero_extendqisi2(temp_reg, src));
    }
    
    riscv_emit_move(dest, gen_rtx_SUBREG(QImode,temp_reg,0));
    return true;
  }

same foo.c will produce:
foo:
	lui	a5,%hi(active)
	lbu	a5,%lo(active)(a5)
	li	a0,42
	bne	a5,zero,.L6
	ret
.L6:
	li	a0,-42
	ret
	.size	foo, .-foo
	.ident	"GCC: (GNU) 10.2.0"

Not sure if I'm doing it right, especially for 64bit DImode because I've only been with gcc for a month. Just wonder if you have time after Monday's compiler meeting so we may discuss movsi, movhi and MEM to MEM copy.
Comment 7 Jim Wilson 2020-10-29 22:49:29 UTC
That patch is basically correct.  I would suggest using gen_lowpart instead of gen_rtx_SUBREG as a minor cleanup.  It will do the same thing, and is shorter and easier to read.

There is one problem here that you can't generate new pseudo registers during register allocation, or after register allocation is complete.  So you need to disable this optimization in this case.  You can do that by adding a check for can_create_pseudo_p ().  This is already used explicitly in one place in riscv_legitimize_move and implicitly in some subfunctions, and is used in the arm.md movqi pattern.

The next part is testing the patch.  We need some correctness testing.  You can just run the gcc testsuite for that.  And we need some code size/performance testing.  I'd suggest compiling some code with and without the patch and check function sizes and look for anything that got bigger with the patch, and check to see if it is a problem.  I like to use the toolchain libraries like libc.a and libstdc++.a since they are being built anways, but using a nice benchmark is OK also as long as it is big enough to stress the compiler.

If the patch passes testing, then we can look at expanding the scope to handle more modes, and also handle MEM dest in addition to REG dest.

Yes, we can discuss this Monday.
Comment 8 Levy Hsu 2020-10-30 08:35:04 UTC
Created attachment 49470 [details]
optimization fix for BUG 97417

proposing a temp patch here in case someone needs it, then I'll submit a full patch with test case later.

Following code was added to the riscv_legitimize_move () in the riscv-gcc/gcc/config/riscv/riscv.c

  if(mode == QImode && MEM_P (src) && REG_P (dest) && can_create_pseudo_p())
  {
    rtx temp_reg;

    if (TARGET_64BIT)
    {
      temp_reg = gen_reg_rtx (DImode);
      emit_insn(gen_zero_extendqidi2(temp_reg, src));
    }
    else
    {
      temp_reg = gen_reg_rtx (SImode);
      emit_insn(gen_zero_extendqisi2(temp_reg, src));
    }
    
    riscv_emit_move(dest, gen_lowpart(QImode,temp_reg));
    return true;
  }

Tested with make report-gcc, haven't done the regression/performance test yet:

========= Summary of gcc testsuite =========
                            | # of unexpected case / # of unique unexpected case
                            |          gcc |          g++ |     gfortran |
 rv64imafdc/  lp64d/ medlow |    0 /     0 |    0 /     0 |      - |
Comment 9 Levy Hsu 2020-10-30 12:46:16 UTC
Thanks Jim. See u on Monday.
Comment 10 Levy Hsu 2020-11-04 06:10:26 UTC
Created attachment 49500 [details]
Optimzation Patch for QI/HImode(32bit) and QI/HI/SImode(64bit)

Proposing second patch for QI/HImode(32bit) and QI/HI/SImode(64bit)
Both Zero-Extend & Subreg

Tested with make report-gcc
Two failed cases: shorten-memrefs-5.c & shorten-memrefs-6.c

Both were failed due to dejaGNU rule:
/* { dg-final { scan-assembler "load1r:\n\taddi" } } */

But if we look at the assembly code, for same input in both file:

int load1r (int *array)
{
  int a = 0;
  a += array[200];
  a += array[201];
  a += array[202];
  a += array[203];
  return a;
}

Current gcc risc-v port will produce:
load1r:
	addi	a5,a0,768
	lw	a4,36(a5)
	lw	a0,32(a5)
	addw	a0,a0,a4
	lw	a4,40(a5)
	addw	a4,a4,a0
	lw	a0,44(a5)
	addw	a0,a0,a4
	ret
Patched new port will produce:
load1r:
	lwu	a4,800(a0)
	lwu	a5,804(a0)
	addw	a5,a5,a4
	lwu	a4,808(a0)
	lwu	a0,812(a0)
	addw	a5,a5,a4
	addw	a0,a5,a0
	ret
With one instruction less, the patched one seems right and even faster to me. However we still need a test on sign extend and check performance

Test case and performance evaluation will be provided later (hopefully)
Comment 11 Kito Cheng 2020-11-04 06:35:27 UTC
> Two failed cases: shorten-memrefs-5.c & shorten-memrefs-6.c

Seems like shorten_memrefs pass didn't handle zero_extend and sign_extend with memory.

You can take a look into riscv-shorten-memrefs.c:pass_shorten_memrefs::analyze and add logic to handle zero_extend and sign_extend.


> With one instruction less, the patched one seems right and even faster to me. However we still need a test on sign extend and check performance

shorten_memrefs is optimize for size, the idea is transform several load instructions with large immediate to a load immediate instruction and load instructions with small immediate, to use C-extensions instruction as possible.

so the instruction count seems increased, but the code size is smaller.
Comment 12 Levy Hsu 2020-11-04 07:03:03 UTC
(In reply to Kito Cheng from comment #11)
> > Two failed cases: shorten-memrefs-5.c & shorten-memrefs-6.c
> 
> Seems like shorten_memrefs pass didn't handle zero_extend and sign_extend
> with memory.
> 
> You can take a look into
> riscv-shorten-memrefs.c:pass_shorten_memrefs::analyze and add logic to
> handle zero_extend and sign_extend.
> 
> 
> > With one instruction less, the patched one seems right and even faster to me. However we still need a test on sign extend and check performance
> 
> shorten_memrefs is optimize for size, the idea is transform several load
> instructions with large immediate to a load immediate instruction and load
> instructions with small immediate, to use C-extensions instruction as
> possible.
> 
> so the instruction count seems increased, but the code size is smaller.

Thank you cheng, I'll have a look.
Comment 13 Jim Wilson 2020-11-05 22:57:37 UTC
The attachments show the entire riscv.c file being deleted and then readded with your change.  This is due to a line ending problem.  The original file has the unix style linefeed and the new file has the windows style carriage return and linefeed.  Git has a setting called core.autocrlf that can help with this.  This will require using git diff instead of just diff though to generate patche, but should give better diffs.  Or alternatively, maybe to can force the original file to have msdos line endings before you make the diff, e.g. maybe just loading the original file into your editor and saving it without making changes will fix the line endings.

You have in the second patch
    (mode == QImode || mode == SImode || mode == HImode)
which is wrong but harmless for rv32 since we can't extend SImode, and is also wrong for the eventual rv128 support.  You can fix this by using something like
    (GET_MODE_CLASS (mode) == MODE_INT && GET_MODE_SIZE (mode) < UNITS_PER_WORD
There is one place in riscv_legitimize_move that already uses this.

The code inside the if statement is a lot more verbose then necessary.  You can use some helper functions to simplify it.  First you can use word_mode which is DImode for rv64 and SImode for rv32 (and will be OImode for rv128).  Then you can call a helper to do the mode conversion.  So for instance something like
  temp_reg = force_reg (word_mode, convert_to_mode (word_mode, src, 1));
should work.  That should emit an insn to do the zero extend and put it in a reg.  Now you no longer need to check src mode or TARGET_64BIT as the code is the same in all cases.  So it should just be about 3 or 4 lines of code for the body of the if statement.

You have a check for REG_P (dest).  This is stricter than you need, since it only works for REG and doesn't accept SUBREG.  We should handle both.  Also, it isn't hard to also handle non-reg or non-subreg dests.  You just need to force the source to a reg, and you already did that when you generated the zero_extend operation.  So this code looks like it should work for any dest and you should be able to drop the REG_P (dest) check.
Comment 14 Jim Wilson 2020-11-06 01:20:42 UTC
Actually, for the SImode to DImode conversion, zero extending is unlikely to be correct in that case.  The rv64 'w' instructions require the input to be sign-extended from 32-bits to 64-bits, so a sign-extend is probably required.  There will be a similar consideration for rv128 when we get to that.  So maybe we should only handle QImode and HImode here.

Or maybe we make the choice of zero-extend or sign-extend depend on the mode, and use zero-extend for smaller than SImode and sign-extend for SImode and larger.

For qimode, char is unsigned by default, so zero extension is likely the right choice.  For himode, it isn't clear which is best, but the arm port does a zero extend.  Also, the Huawei code size proposal says that zero exnteded byte and short loads are more important than sign extended byte and short load, so that is more evidence that zero extend is more useful in those cases.
Comment 15 Jim Wilson 2020-11-06 02:40:00 UTC
I tried testing your first patch.  I just built unpatched/patch rv32-elf/rv64-linux toolchains and used nm/objdump to look at libraries like libc, libstdc++, libm.

I managed to find a testcase from glibc that gives worse code with the patch.
struct
{
  unsigned int a : 1;
  unsigned int b : 1;
  unsigned int c : 1;
  unsigned int d : 1;
  unsigned int pad1 : 28;
} s;

void
sub (void)
{
  s.a = 1;
  s.c = 1;
}

Without the patch we get
sub:
	lui	a5,%hi(s)
	addi	a5,a5,%lo(s)
	lbu	a4,1(a5)
	ori	a4,a4,5
	sb	a4,1(a5)
	ret
and with the patch we get
sub:
	lui	a4,%hi(s)
	lbu	a5,%lo(s)(a4)
	andi	a5,a5,-6
	ori	a5,a5,5
	sb	a5,%lo(s)(a4)
	ret
Note the extra and instruction.

This seems to be a combine problem.  With the patched compiler, I see in the -fdump-rtl-combine-all dump file
Trying 9 -> 11:
    9: r79:DI=r78:DI&0xfffffffffffffffa
      REG_DEAD r78:DI
   11: r81:DI=r79:DI|0x5
      REG_DEAD r79:DI
Failed to match this instruction:
(set (reg:DI 81)
    (ior:DI (and:DI (reg:DI 78 [ MEM <unsigned char> [(struct  *)&sD.1491] ])
            (const_int 250 [0xfa]))
        (const_int 5 [0x5])))
Combine knows that reg 78 only has 8 nonzero bits, so it simplified the AND -6 to AND 250.  If combine had left that constant alone, or if maybe combine propagated that info about nonzero bits through to r81, then it should have been able to optimize out the AND operation.  This does work when the load does not zero extend by default.

The ARM port shows the exact same problem.  I see
sub:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	movw	r2, #:lower16:.LANCHOR0
	movt	r2, #:upper16:.LANCHOR0
	ldrb	r3, [r2]	@ zero_extendqisi2
	bic	r3, r3, #5
	orr	r3, r3, #5
	strb	r3, [r2]
	bx	lr
and the bic (bit clear) is obviously unnecessary.

This probably should be submitted as a separate bug if we don't want to fix it here.
Comment 16 Kito Cheng 2020-11-06 02:44:08 UTC
> Or maybe we make the choice of zero-extend or sign-extend depend on the mode, and use zero-extend for smaller than SImode and sign-extend for SImode and larger.


Maybe depend on LOAD_EXTEND_OP?
Comment 17 Jim Wilson 2020-11-06 03:35:29 UTC
Yes, LOAD_EXTEND_OP is a good suggestion.  We can maybe do something like
  int extend = (LOAD_EXTEND_OP (mode) == ZERO_EXTEND);
  temp_reg = force_reg (word_mode, convert_to_mode (word_mode, src, extend));
Comment 18 Levy Hsu 2020-11-06 09:46:35 UTC
if (GET_MODE_CLASS (mode) == MODE_INT
	      && GET_MODE_SIZE (mode) < UNITS_PER_WORD
        && can_create_pseudo_p()
        && MEM_P (src))
  {
    int extend = (LOAD_EXTEND_OP (mode) == ZERO_EXTEND);
    rtx temp_reg = force_reg (word_mode, convert_to_mode (word_mode, src, extend));
    riscv_emit_move(dest, temp_reg);
    return true;
  }

tried to insert code at the beginning of riscv_legitimize_move() but seems convert_to_mode() raised seg fault druing make
Comment 19 Levy Hsu 2020-11-06 11:38:29 UTC
Also tested code without int extend, just zero-extend with unsignedp set to 1, Still seg fault.

if (GET_MODE_CLASS (mode) == MODE_INT
	      && GET_MODE_SIZE (mode) < UNITS_PER_WORD
        && can_create_pseudo_p()
        && MEM_P (src))
  {
    rtx temp_reg = force_reg (word_mode, convert_to_mode (word_mode, src, 1));
    riscv_emit_move(dest, temp_reg);
    return true;
  }
Comment 20 Jim Wilson 2020-11-06 20:44:44 UTC
Maybe convert_to_mode is recursively calling convert_to_mode until you run out of stack space.  You can use --save-temps to generate a .i preprocessed file form your input testcasxe, then run cc1 under gdb.  You need to give the default arch/abi options or you will get an error
(gdb) run -march=rv64gc -mabi=lp64d -O2 tmp.i
when look at the backtrace when you hit the segfault.

There are other ways to get the  zero extend we need here.  We could just generate the RTL for the insn directly instead of calling the gen_zero_extendXiYi2 functions because we know that they exist and what form they are in.  This is inconvenient though.  We could try querying the optabs table to get the right insn code.  There is a gen_extend_insn function that looks like it would work and is more direct than convert_to_mode.  gen_extend_insn does require that the input is in a form that the convert will accept, so it must be forced to a register if it isn't already a register.  Another solution would be to use one of the rtx simplier functions, e.g. you can do
     simplify_gen_unary (ZERO_EXTEND, word_mode, src, mode);
but the simplify functions are really for simplifying complex RTL to simpler RTL, so I think not the right approach here.

I think gen_extend_insn is the best approach if we can't use convert_to_mode.
Comment 21 Jim Wilson 2020-11-06 21:08:25 UTC
I submitted my testcase as 97747 so it will get more attention.
Comment 22 Levy Hsu 2020-11-09 08:35:46 UTC
Under condition 

if (GET_MODE_CLASS (mode) == MODE_INT
	      && GET_MODE_SIZE (mode) < UNITS_PER_WORD
        && can_create_pseudo_p()
        && MEM_P (src))

with var:

rtx temp_reg;
int extend = (LOAD_EXTEND_OP (mode) == ZERO_EXTEND);

I've tried the combination of:

gen_extend_insn (temp_reg, force_reg (mode, src), word_mode, mode, extend);
gen_extend_insn (temp_reg, force_reg (word_mode, src), word_mode, word_mode, extend);
gen_extend_insn (temp_reg, src, word_mode, mode, extend);

with:
riscv_emit_move(dest, gen_lowpart (mode, temp_reg));
riscv_emit_move(dest, force_reg(mode, temp_reg));

then return true

All raised segfault during make gcc.

For example:

  if (GET_MODE_CLASS (mode) == MODE_INT
	      && GET_MODE_SIZE (mode) < UNITS_PER_WORD
        && can_create_pseudo_p()
        && MEM_P (src))
  {
    rtx temp_reg;
    int extend = (LOAD_EXTEND_OP (mode) == ZERO_EXTEND);
    gen_extend_insn (temp_reg, force_reg (mode, src), word_mode, mode, extend);
    riscv_emit_move(dest, force_reg(mode, temp_reg));
    return true;
  }
At beginning of riscv_legitimize_move()
Comment 23 Levy Hsu 2020-11-09 09:22:19 UTC
Created attachment 49524 [details]
Third test patch

While I'm waiting for a solution, I've reused my second patch and made a new patch.
Third test patch adds one extra function gen_extend_insn_auto() in optabs.c/h
then just called by riscv_legitimize_move()
Now it can emit sign/unsigned-extend insn automatically. 

Still haven't solved the shorten-memrefs issue. So it will still raise 2 error when make report-gcc
So the -Os option (size optimization) may not behave as expected.
Comment 24 Levy Hsu 2020-11-10 05:20:39 UTC
Created attachment 49532 [details]
QI/HI/SI/DI zero/sign-extend for RV32/64/128

Rewrote the third proposed patch.
Comment 25 Kito Cheng 2020-11-10 05:29:30 UTC
Seem like you have add code to gcc/optabs.h and gcc/optabs.c, however those functions are RISC-V specific, so I would suggest you put in riscv.c and riscv-protos.h.
Comment 26 Levy Hsu 2020-11-10 05:34:33 UTC
Created attachment 49533 [details]
QI/HI/SI/DI zero/sign-extend for RV32/64/128

BUG fix for unimplemented code
Comment 27 Levy Hsu 2020-11-10 05:36:10 UTC
(In reply to Kito Cheng from comment #25)
> Seem like you have add code to gcc/optabs.h and gcc/optabs.c, however those
> functions are RISC-V specific, so I would suggest you put in riscv.c and
> riscv-protos.h.

No problem, I'll make a new patch.
Comment 28 Levy Hsu 2020-11-10 06:01:13 UTC
Created attachment 49534 [details]
V4 patch for QI/HI/SI/DI zero/sign-extend for RV32/64/128

Suggest by Kito, The V4 patch moved the gen_extend_insn_auto() to riscv.c and was included in riscv-protos.h since it handles riscv backend only.
Comment 29 Levy Hsu 2020-11-10 10:47:01 UTC
Created attachment 49536 [details]
QI/HI/SImode auto Zero/Sign-extend

Finally, make gen_extend_insn() seems to work with auto-extension based on Jim and Kito's idea.

Just 10 lines of code at the beginning of riscv_legitimize_move() in riscv-gcc/gcc/config/riscv.c

if (GET_MODE_CLASS (mode) == MODE_INT
	    && GET_MODE_SIZE (mode) < UNITS_PER_WORD
      && can_create_pseudo_p()
      && MEM_P (src))
  {
      rtx temp_reg = gen_reg_rtx (word_mode);
      int zero_sign_extend = (LOAD_EXTEND_OP (mode) == ZERO_EXTEND);
      emit_insn(gen_extend_insn(temp_reg, src, word_mode, mode, zero_sign_extend));
      riscv_emit_move(dest, gen_lowpart(mode, temp_reg));
      return true;
  }

Haven't make report-gcc, but already passed 2-stage make. 
I'll have a test later.
Comment 30 Jim Wilson 2020-11-11 01:21:50 UTC
Looking at your v2 patch, the first verison fails because you are passing mismatched modes to emit_move_insn.  The version with gen_lowpart solves that problem, but fails because of infinite recursion.

The v4 patch looks good.  There are some coding style issues, but I can always fix them for you.  There should be a space before open paren.  The first && should line up with the last two.  The braces should be indented two more spaces.  The comment needs to end with a period and two spaces.

In the comment, instead of "Separate ... to ..." I'd say "Expand ... to ..." or maybe "Emit ... as ...".

Now we need the copyright assignment paperwork.
Comment 31 Levy Hsu 2020-11-11 05:43:50 UTC
Created attachment 49542 [details]
QI/HI/SImode auto Zero/Sign-extend

Much appreciated Jim.

The new patch should solve the format issue by adding the same tabs on each line.

In the meanwhile I'll try to patch the pass_shorten_memrefs::analyze() in riscv-shorten-memrefs.c

Any idea on the combiner issue?
Comment 32 Levy Hsu 2020-11-11 06:43:59 UTC
Created attachment 49543 [details]
QI/HI/SImode auto Zero/Sign-extend

Added one missing space at the end of the comment.
Added one tab before each brace.
Replace all tabs with space (come on....)
Comment 33 Jim Wilson 2020-11-11 19:35:00 UTC
I did say that I'm willing to fix code style issues.  All major software projects I have worked with have coding style conventions.  It makes it easier to maintain a large software base when everything is using the same coding style.  We do have info to help you follow the GNU coding style.  See
https://gcc.gnu.org/wiki/FormattingCodeForGCC
which has emacs and vim settings to get the right code style.
Comment 34 Levy Hsu 2020-11-12 01:26:21 UTC
(In reply to Jim Wilson from comment #33)
> I did say that I'm willing to fix code style issues.  All major software
> projects I have worked with have coding style conventions.  It makes it
> easier to maintain a large software base when everything is using the same
> coding style.  We do have info to help you follow the GNU coding style.  See
> https://gcc.gnu.org/wiki/FormattingCodeForGCC
> which has emacs and vim settings to get the right code style.

No problem and thank you, Jim, I'll try to catch up the coding style.
what about the combine issue and shorten-memrefs? 
Shall we fix it here or someplace else?
Comment 35 Jim Wilson 2020-11-13 00:00:17 UTC
By combine issue, are you referring to the regression I mentioned in comment 3 and filed as bug 97747?  We can handle that as a separate issue.  It should be uncommon.  I expect to get much more benefit from this patch than the downside due to that combine issue.

As for the shorten-memrefs problem, I didn't notice that one in my testing.  It does need to be fixed.  Taking a look now, it looks pretty simple to fix.  The code currently looks for MEM, it needs to handle (SIGN_EXTEND (MEM)) and ((ZERO_EXTEND (MEM)).  See the get_si_mem_base_reg function.  You need to skip over the sign_extend or zero_extend when looking fot the mem at the first call.  Then at the second call you need to be careful to put the sign_extend or zero_extend back when creating the new RTL.  Maybe just another arg to get_si_mem_base so it can record the parent rtx code of the mem.  Or maybe do this outside get_si_mem_base to skip over a sign/zero extend at the first call, and then do the same at the second call but record what rtx we skipped over so that we can put it back.  We can either handle this here or as another patch.  But since you have some time while waiting for paperwork maybe you can try writing a fix.
Comment 36 Levy Hsu 2020-11-16 01:17:11 UTC
It seems get_si_mem_base_reg() were called repeatly FOR_BB_INSNS from both pass_shorten_memrefs::analyze and pass_shorten_memrefs::transform

Correct me if I'm wrong:
It seems we need some data structure (a linked list should work) to store the zero/sign extend when we strip it off like:

if (GET_CODE (mem) == ZERO_EXTEND
      || GET_CODE (mem) == SIGN_EXTEND)
    mem = XEXP (mem, 0);
in each insns.

Then in pass_shorten_memrefs::transform(), each time get_si_mem_base_reg() is called, we check whether if we need to put it back.
Comment 37 Kito Cheng 2020-11-16 03:24:08 UTC
Maybe we could add a parameter to indicate the type of memory access, plain_mem, zext_mem or sext_mem for pass_shorten_memrefs::get_si_mem_base_reg.

e.g.
      for (int i = 0; i < 2; i++)
        {   
          rtx mem = XEXP (pat, i); 
          rtx addr;
          mem_access_type_t type;
          if (get_si_mem_base_reg (mem, &addr, &type))
            {   
              HOST_WIDE_INT regno = REGNO (XEXP (addr, 0));
              /* Do not transform store zero as these cannot be compressed.  */
              if (i == 0)
                {   
                  if (XEXP (pat, 1) == CONST0_RTX (GET_MODE (XEXP (pat, 1))))
                    continue;
                }   
              if (m->get_or_insert (regno) > 3)
                {
                  addr
                    = targetm.legitimize_address (addr, addr, GET_MODE (mem));
                  switch (type)
                    {
                    case plain_mem:
                      XEXP (pat, i) = replace_equiv_address (mem, addr);
                      break;
                    case zext_mem:
                      ...
                      break;
                    case sext_mem:
                      ...
                      break;
                    }
                  df_insn_rescan (insn);
                }   
            }
Comment 38 Levy Hsu 2020-11-17 10:19:18 UTC
Created attachment 49575 [details]
riscv-shorten-memrefs_V1.patch

Did little bit change in get_si_mem_base_reg() and transform()
Now for the same c input array_test.c

int load1r (int *array)
{
  int a = 0;
  a += array[200];
  a += array[201];
  a += array[202];
  a += array[203];
  return a;
}

int main ()
{
    int arr[300]= {0};
    arr[200] = 15;
    arr[201] = -33;
    arr[202] = 7;
    arr[203] = -999;
    int b = load1r(arr);
    printf("Result: %d\n",b);
    return 0;
}

in loadlr, when put a debug_rtx(pat) after:

(unpatched)XEXP (pat, i) = replace_equiv_address (mem, addr); 
or 
(patched)XEXP (XEXP (pat, I), 0) = replace_equiv_address(XEXP (mem, 0), addr);



unpatched gcc will produce follwing insns:
---------------------------------------------------------
(set (reg:SI 81 [ MEM[(int *)array_5(D) + 800B] ])
    (mem:SI (plus:DI (reg:DI 88)
            (const_int 32 [0x20])) [1 MEM[(int *)array_5(D) + 800B]+0 S4 A32]))
(set (reg:SI 82 [ MEM[(int *)array_5(D) + 804B] ])
    (mem:SI (plus:DI (reg:DI 89)
            (const_int 36 [0x24])) [1 MEM[(int *)array_5(D) + 804B]+0 S4 A32]))
(set (reg:SI 84 [ MEM[(int *)array_5(D) + 808B] ])
    (mem:SI (plus:DI (reg:DI 90)
            (const_int 40 [0x28])) [1 MEM[(int *)array_5(D) + 808B]+0 S4 A32]))
(set (reg:SI 86 [ MEM[(int *)array_5(D) + 812B] ])
    (mem:SI (plus:DI (reg:DI 91)
            (const_int 44 [0x2c])) [1 MEM[(int *)array_5(D) + 812B]+0 S4 A32]))
---------------------------------------------------------


patched gcc will produce follwing insns:
---------------------------------------------------------
(set (reg:DI 82 [ MEM[(int *)array_5(D) + 800B] ])
    (sign_extend:DI (mem:SI (plus:DI (reg:DI 92)
                (const_int 32 [0x20])) [1 MEM[(int *)array_5(D) + 800B]+0 S4 A32])))
(set (reg:DI 84 [ MEM[(int *)array_5(D) + 804B] ])
    (sign_extend:DI (mem:SI (plus:DI (reg:DI 93)
                (const_int 36 [0x24])) [1 MEM[(int *)array_5(D) + 804B]+0 S4 A32])))
(set (reg:DI 87 [ MEM[(int *)array_5(D) + 808B] ])
    (sign_extend:DI (mem:SI (plus:DI (reg:DI 94)
                (const_int 40 [0x28])) [1 MEM[(int *)array_5(D) + 808B]+0 S4 A32])))
(set (reg:DI 90 [ MEM[(int *)array_5(D) + 812B] ])
    (sign_extend:DI (mem:SI (plus:DI (reg:DI 95)
                (const_int 44 [0x2c])) [1 MEM[(int *)array_5(D) + 812B]+0 S4 A32])))
---------------------------------------------------------
which the patched one looks ok for me, but the final assembly code still shows no change (both under -Os)

Not quite sure where the problem is, I'll have a look near array_test.c.250r.shorten_memrefs tomorrow.

Please ignore the coding style as it's just a debug patch
Comment 39 Levy Hsu 2020-11-18 06:09:49 UTC
Checked all pass from 250r.shorten_memrefs to 270r.ce2

In 269r.combine I saw the following combination merged the replaced address:
-------------------------------------------------------
modifying insn i3    27: r92:DI=r96:DI+0x300
      REG_DEAD r96:DI
deferring rescan insn with uid = 27.
(!)allowing combination of insns 27 and 6
original costs 4 + 16 = 20
replacement costs 4 + 16 = 20
modifying insn i2    27: r92:DI=r96:DI+0x300
deferring rescan insn with uid = 27.
modifying insn i3     6: r82:DI=sign_extend([r96:DI+0x320])
      REG_DEAD r96:DI
deferring rescan insn with uid = 6.
(!)allowing combination of insns 27 and 8
original costs 4 + 16 = 20
replacement costs 4 + 16 = 20
modifying insn i2    27: r92:DI=r96:DI+0x300
deferring rescan insn with uid = 27.
modifying insn i3     8: r84:DI=sign_extend([r96:DI+0x324])
      REG_DEAD r96:DI
deferring rescan insn with uid = 8.
(!)allowing combination of insns 27 and 12
original costs 4 + 16 = 20
replacement costs 4 + 16 = 20
modifying insn i2    27: r92:DI=r96:DI+0x300
deferring rescan insn with uid = 27.
modifying insn i3    12: r87:DI=sign_extend([r96:DI+0x328])
      REG_DEAD r96:DI
deferring rescan insn with uid = 12.
(!)allowing combination of insns 27 and 16
original costs 4 + 16 = 20
replacement cost 16
deferring deletion of insn with uid = 27.
modifying insn i3    16: r90:DI=sign_extend([r96:DI+0x32c])
      REG_DEAD r96:DI
deferring rescan insn with uid = 16.
allowing combination of insns 18 and 19
-------------------------------------------------------
Where in 268r.ud_dce both insns 27 are same (except for memory address):

(insn 27 26 28 2 (set (reg:DI 10 a0)
        (lo_sum:DI (reg/f:DI 85)
            (symbol_ref/f:DI ("*.LC0") [flags 0x82]  <var_decl 0x7f3546fe91b0 *.LC0>))) "array_test.c":21:5 133 {*lowdi}
     (expr_list:REG_DEAD (reg/f:DI 85)
        (expr_list:REG_EQUAL (symbol_ref/f:DI ("*.LC0") [flags 0x82]  <var_decl 0x7f3546fe91b0 *.LC0>)
            (nil))))
-------------------------------------------------------
In 270r.combine (patched):

(note 27 3 6 2 NOTE_INSN_DELETED)

and following insns 768 + 32/36/40/44 were put back as:

(insn 6 27 8 2 (set (reg:DI 82 [ MEM[(int *)array_5(D) + 800B] ])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                    (const_int 800 [0x320])) [1 MEM[(int *)array_5(D) + 800B]+0 S4 A32]))) "array_test.c":7:5 90 {extendsidi2}


(insn 8 6 10 2 (set (reg:DI 84 [ MEM[(int *)array_5(D) + 804B] ])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                    (const_int 804 [0x324])) [1 MEM[(int *)array_5(D) + 804B]+0 S4 A32]))) "array_test.c":7:5 90 {extendsidi2}


(insn 12 10 14 2 (set (reg:DI 87 [ MEM[(int *)array_5(D) + 808B] ])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                    (const_int 808 [0x328])) [1 MEM[(int *)array_5(D) + 808B]+0 S4 A32]))) "array_test.c":8:5 90 {extendsidi2}

(insn 16 14 18 2 (set (reg:DI 90 [ MEM[(int *)array_5(D) + 812B] ])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                    (const_int 812 [0x32c])) [1 MEM[(int *)array_5(D) + 812B]+0 S4 A32]))) "array_test.c":9:5 90 {extendsidi2}
     (expr_list:REG_DEAD (reg:DI 96)
        (nil)))

Maybe combine.c needs some modification?
Comment 40 Jim Wilson 2020-11-18 18:31:01 UTC
If you look at riscv.opt you will see that the -mshorten-memrefs option sets the variable riscv_mshorten_memrefs.  If you grep for that, you will see that it is used in riscv_address_cost in riscv.c.  I believe it is this change to the address cost that is supposed to prevent the recombination back into addresses that don't fit in compressed instructions.  So you need to look at why this works in the current code, but not with your zero/sign extend load patch.  Maybe there is something about the rtx costs for a regular load versus a zero/sign extend load that causes the problem.  In the combine dump where it says "original costs" and "replacement costs" that is where it is using rtx_cost and riscv_address_cost.  The replacement cost should be more than the original cost to prevent the recombination.  You should see that if you look at the combine dump for the unpatched compiler.
Comment 41 Levy Hsu 2020-11-20 02:41:52 UTC
When putting the same debug_rtx(addr) at the first line of riscv_address_cost()

Unpatched one outputs:
(plus:DI (reg/f:DI 15 a5 [88])
    (const_int 32 [0x20]))
(plus:DI (reg:DI 10 a0 [92])
    (const_int 800 [0x320]))
(plus:DI (reg/f:DI 15 a5 [88])
    (const_int 36 [0x24]))
(plus:DI (reg:DI 10 a0 [92])
    (const_int 804 [0x324]))
.......

Patched one outputs nothing. what it means to me is that riscv backend, something like:
(sign_extend:DI (mem:SI (plus:DI (reg:DI 93) 

is never passed to riscv_address_cost(), unlike:
(mem:SI (plus:DI (reg:DI 89)

so whether riscv_mshorten_memrefs is set or not doesn't really matter here.
Then I traced back to where address_cost() is called, 

1.address_cost() is called by riscv_new_address_profitable_p() in riscv.c
2.riscv_new_address_profitable_p() is called by attempt_change() in sched-deps.c
And since I'm not that familiar with struct mem_inc_info, this of trace could be wrong:
3.attempt_change() is called by find_inc() in sched-deps.c
....(Still tracing)

------------------------------------------------------
Also since Arm can handle sign/zero/extend with subreg under -Os, I had a quick search on arm.c

in arm_address_cost(), rtx here were passed as x, not addr (which addr may be XEXP (mem, 0)). 

So GET_CODE (x) cam be used to determine whether it has a MEM/LABEL_REF/SYMBOL_REF... at front. Then cost can be vary from 0 all the way to 10.

This also worth some investigation to see how -Os works on arm side.
------------------------------------------------------

Need some time to work on it.
Comment 42 Jim Wilson 2020-11-20 03:32:45 UTC
riscv_address_cost is a hook, so it is targetm.address_cost which is only called from address_cost which is only called in a few places one of which is in postreload.c so that is the one I would look at first.  This is in try_replace_in_use which is called from reload_combine_recognize_const_pattern which is trying to put offsets back into mems which is exactly what we don't want here.  This suggests that containing_mem isn't getting set when we have a sign/zero extend.  It should get set in reload_combine_note_use.
Comment 43 Levy Hsu 2020-11-23 06:17:40 UTC
Thanks for pointing the hook out Jim.

for both patched and unpatched, so far I've been traced through

try_replace_in_use()
to
reload_combine_recognize_const_pattern()
to
reload_combine()
to
reload_cse_regs()
to
pass_postreload_cse::execute()

in postreload.c

-------------------------------------------------------------------
For reload_combine(), by printing each insn at front of the loop: (line 1302)

for (insn = get_last_insn (); insn; insn = prev)
{
   debug_rtx(insn)
   ...
}
-------------------------------------------------------------------
Unpatched gcc shows:

(insn 13 11 14 2 (set (reg:DI 10 a0)
        (sign_extend:DI (mem:SI (plus:DI (reg/f:DI 15 a5 [88])
                    (const_int 44 [0x2c])) [1 MEM[(int *)array_5(D) + 812B]+0 S4 A32]))) "array_test.c":9:5 90 {extendsidi2}
     (nil))
(insn 11 10 13 2 (set (reg:SI 14 a4 [83])
        (plus:SI (reg:SI 14 a4 [orig:84 MEM[(int *)array_5(D) + 808B] ] [84])
            (reg:SI 10 a0 [80]))) "array_test.c":8:5 3 {addsi3}
     (nil))
(insn 10 8 11 2 (set (reg:DI 14 a4)
        (sign_extend:DI (mem:SI (plus:DI (reg/f:DI 15 a5 [88])
                    (const_int 40 [0x28])) [1 MEM[(int *)array_5(D) + 808B]+0 S4 A32]))) "array_test.c":8:5 90 {extendsidi2}
     (expr_list:REG_EQUIV (mem:SI (plus:DI (reg/f:DI 15 a5 [88])
                (const_int 40 [0x28])) [1 MEM[(int *)array_5(D) + 808B]+0 S4 A32])
        (nil)))
(insn 8 7 10 2 (set (reg:SI 10 a0 [80])
        (plus:SI (reg:SI 10 a0 [orig:81 MEM[(int *)array_5(D) + 800B] ] [81])
            (reg:SI 14 a4 [orig:82 MEM[(int *)array_5(D) + 804B] ] [82]))) "array_test.c":7:5 3 {addsi3}
     (nil))
(insn 7 6 8 2 (set (reg:DI 14 a4)
        (sign_extend:DI (mem:SI (plus:DI (reg/f:DI 15 a5 [88])
                    (const_int 36 [0x24])) [1 MEM[(int *)array_5(D) + 804B]+0 S4 A32]))) "array_test.c":7:5 90 {extendsidi2}
     (expr_list:REG_EQUIV (mem:SI (plus:DI (reg/f:DI 15 a5 [88])
                (const_int 36 [0x24])) [1 MEM[(int *)array_5(D) + 804B]+0 S4 A32])
        (nil)))
(insn 6 23 7 2 (set (reg:DI 10 a0)
        (sign_extend:DI (mem:SI (plus:DI (reg/f:DI 15 a5 [88])
                    (const_int 32 [0x20])) [1 MEM[(int *)array_5(D) + 800B]+0 S4 A32]))) "array_test.c":7:5 90 {extendsidi2}
     (expr_list:REG_EQUIV (mem:SI (plus:DI (reg/f:DI 15 a5 [88])
                (const_int 32 [0x20])) [1 MEM[(int *)array_5(D) + 800B]+0 S4 A32])
        (nil)))
-------------------------------------------------------------------
Patched one shows already merged results:

(insn 16 14 18 2 (set (reg:DI 10 a0 [orig:90 MEM[(int *)array_5(D) + 812B] ] [90])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 10 a0 [96])
                    (const_int 812 [0x32c])) [1 MEM[(int *)array_5(D) + 812B]+0 S4 A32]))) "array_test.c":9:5 90 {extendsidi2}
     (nil))
(insn 14 12 16 2 (set (reg:SI 15 a5 [85])
        (plus:SI (reg:SI 15 a5 [80])
            (reg:SI 14 a4 [orig:87 MEM[(int *)array_5(D) + 808B] ] [87]))) "array_test.c":8:5 3 {addsi3}
     (nil))
(insn 12 10 14 2 (set (reg:DI 14 a4 [orig:87 MEM[(int *)array_5(D) + 808B] ] [87])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 10 a0 [96])
                    (const_int 808 [0x328])) [1 MEM[(int *)array_5(D) + 808B]+0 S4 A32]))) "array_test.c":8:5 90 {extendsidi2}
     (nil))
(insn 10 8 12 2 (set (reg:SI 15 a5 [80])
        (plus:SI (reg:SI 15 a5 [orig:84 MEM[(int *)array_5(D) + 804B] ] [84])
            (reg:SI 14 a4 [orig:82 MEM[(int *)array_5(D) + 800B] ] [82]))) "array_test.c":7:5 3 {addsi3}
     (nil))
(insn 8 6 10 2 (set (reg:DI 15 a5 [orig:84 MEM[(int *)array_5(D) + 804B] ] [84])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 10 a0 [96])
                    (const_int 804 [0x324])) [1 MEM[(int *)array_5(D) + 804B]+0 S4 A32]))) "array_test.c":7:5 90 {extendsidi2}
     (nil))
(insn 6 27 8 2 (set (reg:DI 14 a4 [orig:82 MEM[(int *)array_5(D) + 800B] ] [82])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 10 a0 [96])
                    (const_int 800 [0x320])) [1 MEM[(int *)array_5(D) + 800B]+0 S4 A32]))) "array_test.c":7:5 90 {extendsidi2}
     (nil))
-------------------------------------------------------------------
Before reload_combine () is reload_cse_regs_1() in reload_cse_regs() which "detects no-op moves where we happened to assign two different pseudo-registers to the same hard register"

and pass_postreload_cse::execute calls reload_cse_regs()

pass_postreload_cse::execute() look like the entry point for postreload.c

In order to confirm it's not in postreload.c, I put:
----------------------------------------------------------
  rtx_insn *insn, *next;
  for (insn = get_insns (); insn; insn = next)
  {
	  debug_rtx(insn);
	  next = NEXT_INSN (insn);
  }
----------------------------------------------------------
in pass_postreload_cse::execute (function *fun)

right after:

if (!dbg_cnt (postreload_cse))
    return 0;
----------------------------------------------------------
Unpathed:

(note 1 0 4 NOTE_INSN_DELETED)
(note 4 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(note 2 4 3 2 NOTE_INSN_DELETED)
(note 3 2 23 2 NOTE_INSN_FUNCTION_BEG)
(insn 23 3 6 2 (set (reg/f:DI 15 a5 [88])
        (plus:DI (reg:DI 10 a0 [92])
            (const_int 768 [0x300]))) "array_test.c":7:5 4 {adddi3}
     (nil))
(insn 6 23 7 2 (set (reg:SI 10 a0 [orig:81 MEM[(int *)array_5(D) + 800B] ] [81])
        (mem:SI (plus:DI (reg/f:DI 15 a5 [88])
                (const_int 32 [0x20])) [1 MEM[(int *)array_5(D) + 800B]+0 S4 A32])) "array_test.c":7:5 136 {*movsi_internal}
     (expr_list:REG_EQUIV (mem:SI (plus:DI (reg/f:DI 15 a5 [88])
                (const_int 32 [0x20])) [1 MEM[(int *)array_5(D) + 800B]+0 S4 A32])
        (nil)))
(insn 7 6 8 2 (set (reg:SI 14 a4 [orig:82 MEM[(int *)array_5(D) + 804B] ] [82])
        (mem:SI (plus:DI (reg/f:DI 15 a5 [88])
                (const_int 36 [0x24])) [1 MEM[(int *)array_5(D) + 804B]+0 S4 A32])) "array_test.c":7:5 136 {*movsi_internal}
     (expr_list:REG_EQUIV (mem:SI (plus:DI (reg/f:DI 15 a5 [88])
                (const_int 36 [0x24])) [1 MEM[(int *)array_5(D) + 804B]+0 S4 A32])
        (nil)))
(insn 8 7 10 2 (set (reg:SI 10 a0 [80])
        (plus:SI (reg:SI 10 a0 [orig:81 MEM[(int *)array_5(D) + 800B] ] [81])
            (reg:SI 14 a4 [orig:82 MEM[(int *)array_5(D) + 804B] ] [82]))) "array_test.c":7:5 3 {addsi3}
     (nil))
(insn 10 8 11 2 (set (reg:SI 14 a4 [orig:84 MEM[(int *)array_5(D) + 808B] ] [84])
        (mem:SI (plus:DI (reg/f:DI 15 a5 [88])
                (const_int 40 [0x28])) [1 MEM[(int *)array_5(D) + 808B]+0 S4 A32])) "array_test.c":8:5 136 {*movsi_internal}
     (expr_list:REG_EQUIV (mem:SI (plus:DI (reg/f:DI 15 a5 [88])
                (const_int 40 [0x28])) [1 MEM[(int *)array_5(D) + 808B]+0 S4 A32])
        (nil)))
(insn 11 10 13 2 (set (reg:SI 14 a4 [83])
        (plus:SI (reg:SI 14 a4 [orig:84 MEM[(int *)array_5(D) + 808B] ] [84])
            (reg:SI 10 a0 [80]))) "array_test.c":8:5 3 {addsi3}
     (nil))
(insn 13 11 14 2 (set (reg:SI 10 a0 [orig:86 MEM[(int *)array_5(D) + 812B] ] [86])
        (mem:SI (plus:DI (reg/f:DI 15 a5 [88])
                (const_int 44 [0x2c])) [1 MEM[(int *)array_5(D) + 812B]+0 S4 A32])) "array_test.c":9:5 136 {*movsi_internal}
     (nil))
......
----------------------------------------------------------
Patched:

(note 1 0 4 NOTE_INSN_DELETED)
(note 4 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(note 2 4 3 2 NOTE_INSN_DELETED)
(note 3 2 27 2 NOTE_INSN_FUNCTION_BEG)
(note 27 3 6 2 NOTE_INSN_DELETED)
(insn 6 27 8 2 (set (reg:DI 14 a4 [orig:82 MEM[(int *)array_5(D) + 800B] ] [82])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 10 a0 [96])
                    (const_int 800 [0x320])) [1 MEM[(int *)array_5(D) + 800B]+0 S4 A32]))) "array_test.c":7:5 90 {extendsidi2}
     (nil))
(insn 8 6 10 2 (set (reg:DI 15 a5 [orig:84 MEM[(int *)array_5(D) + 804B] ] [84])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 10 a0 [96])
                    (const_int 804 [0x324])) [1 MEM[(int *)array_5(D) + 804B]+0 S4 A32]))) "array_test.c":7:5 90 {extendsidi2}
     (nil))
(insn 10 8 12 2 (set (reg:SI 15 a5 [80])
        (plus:SI (reg:SI 15 a5 [orig:84 MEM[(int *)array_5(D) + 804B] ] [84])
            (reg:SI 14 a4 [orig:82 MEM[(int *)array_5(D) + 800B] ] [82]))) "array_test.c":7:5 3 {addsi3}
     (nil))
(insn 12 10 14 2 (set (reg:DI 14 a4 [orig:87 MEM[(int *)array_5(D) + 808B] ] [87])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 10 a0 [96])
                    (const_int 808 [0x328])) [1 MEM[(int *)array_5(D) + 808B]+0 S4 A32]))) "array_test.c":8:5 90 {extendsidi2}
     (nil))
(insn 14 12 16 2 (set (reg:SI 15 a5 [85])
        (plus:SI (reg:SI 15 a5 [80])
            (reg:SI 14 a4 [orig:87 MEM[(int *)array_5(D) + 808B] ] [87]))) "array_test.c":8:5 3 {addsi3}
     (nil))
(insn 16 14 18 2 (set (reg:DI 10 a0 [orig:90 MEM[(int *)array_5(D) + 812B] ] [90])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 10 a0 [96])
                    (const_int 812 [0x32c])) [1 MEM[(int *)array_5(D) + 812B]+0 S4 A32]))) "array_test.c":9:5 90 {extendsidi2}
     (nil))
......
----------------------------------------------------------
In other words this was before reload_cse_regs(), before reload_combine(), before reload_combine_recognize_const_pattern()....

------------------------------------------------------------------
In postreload.c at around line 939, comments says:

/* We look for a REG1 = REG2 + CONSTANT insn, followed by either
     uses of REG1 inside an address, or inside another add insn.  If
     possible and profitable, merge the addition into subsequent
     uses.  */

Which really looks like the function we're looking for, but I've debug each insn from button to top, maybe I've missed something in between?

I'll investigate other places where address_cost() are called.
Comment 44 Levy Hsu 2020-11-23 06:38:05 UTC
should_replace_address() in fwprop.c looks really interesting:

/* OLD is a memory address.  Return whether it is good to use NEW instead,
   for a memory access in the given MODE.  */
Comment 45 Levy Hsu 2020-11-23 07:43:23 UTC
Basically crossed out the rtlanal.c and fwprop.c
I'm looking back at riscv.c. Since address_cost() was called by hook function new_address_profitable_p(), may be some place uses this function would provide more info
Comment 46 Levy Hsu 2020-12-01 03:03:47 UTC
Looking at gcc/passed.def and gcc/config/riscv-passes.def:

pass_shorten_memrefs is inserted after NEXT_PASS (pass_rtl_store_motion);

  NEXT_PASS (pass_rtl_store_motion);
(pass_shorten_memrefs)
  NEXT_PASS (pass_cse_after_global_opts);
  NEXT_PASS (pass_rtl_ifcvt);
  NEXT_PASS (pass_reginfo_init);
  /* Perform loop optimizations.  It might be better to do them a bit
  sooner, but we want the profile feedback to work more
  efficiently.  */
  NEXT_PASS (pass_loop2);
  PUSH_INSERT_PASSES_WITHIN (pass_loop2)
NEXT_PASS (pass_rtl_loop_init);
NEXT_PASS (pass_rtl_move_loop_invariants);
NEXT_PASS (pass_rtl_unroll_loops);
NEXT_PASS (pass_rtl_doloop);
NEXT_PASS (pass_rtl_loop_done);
  POP_INSERT_PASSES ()
  NEXT_PASS (pass_lower_subreg2);
  NEXT_PASS (pass_web);
  NEXT_PASS (pass_rtl_cprop);
  NEXT_PASS (pass_cse2);
  NEXT_PASS (pass_rtl_dse1);
  NEXT_PASS (pass_rtl_fwprop_addr);
  NEXT_PASS (pass_inc_dec);
  NEXT_PASS (pass_initialize_regs);
  NEXT_PASS (pass_ud_rtl_dce);
  NEXT_PASS (pass_combine);
  NEXT_PASS (pass_if_after_combine);
  NEXT_PASS (pass_jump_after_combine);
  NEXT_PASS (pass_partition_blocks);
  NEXT_PASS (pass_outof_cfg_layout_mode);
  NEXT_PASS (pass_split_all_insns);
  NEXT_PASS (pass_lower_subreg3);
  NEXT_PASS (pass_df_initialize_no_opt);
  NEXT_PASS (pass_stack_ptr_mod);
  NEXT_PASS (pass_mode_switching);
  NEXT_PASS (pass_match_asm_constraints);
  NEXT_PASS (pass_sms);
  NEXT_PASS (pass_live_range_shrinkage);
  NEXT_PASS (pass_sched);
  NEXT_PASS (pass_early_remat);
  NEXT_PASS (pass_ira);
  NEXT_PASS (pass_reload);
  NEXT_PASS (pass_postreload);
  PUSH_INSERT_PASSES_WITHIN (pass_postreload)
NEXT_PASS (pass_postreload_cse);
NEXT_PASS (pass_gcse2);
NEXT_PASS (pass_split_after_reload);
......

After some debugging processes. it seems either:
1.The address cost info was calculated between (pass_combine) and (pass_shorten_memrefs) for patched version, then merged in the combined pass. 
patched one is failed to be recognized as unpathed one due to Sign/Zero extend then Subreg.
This can be verified by adding -fdisable-rtl-combine option when compile, also the address_cost was not called for the whole time.

2.4 insn was determined(or say fixed?) before (pass_rtl_fwprop_addr), as for patched version, I saw forward_propagate_and_simplify() was called for 4 extra times, then pass all the way to propagate_rtx()->propagate_rtx_1()->should_replace_address()->address_cost() in fwprop.c

I've also tested the (pass_postreload) as mentioned by Jim and new_address_profitable_p(). But they seem not to be the right one.

Need some time to examine and trace the pass between (pass_shorten_memrefs) and (pass_rtl_fwprop_addr).
Comment 47 Levy Hsu 2020-12-08 09:22:19 UTC
where insns are merged:
In combine.c (pass_combine)

rest_of_handle_combine()
calls:
combine_instructions()
calls:
creat_log_links() creates links of insn (768&32/36/40/44) 
for both patched and unpatched version with log_links()

Then in combine_instructions(), for_each_log_link(), try_combine() calls combine_validate_cost()

in combine_validate_cost(), for the patched version:

OLD===================846930886===================OLD
i2 & Cost 4:
(insn 27 3 6 2 (set (reg/f:DI 92)
        (plus:DI (reg:DI 96)
            (const_int 768 [0x300]))) "array_test.c":7:5 4 {adddi3}
     (expr_list:REG_DEAD (reg:DI 96)
        (nil)))
i3 & Cost 16:
(insn 6 27 8 2 (set (reg:DI 82 [ MEM[(int *)array_5(D) + 800B] ])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                    (const_int 800 [0x320])) [1 MEM[(int *)array_5(D) + 800B]+0 S4 A32]))) "array_test.c":7:5 90 {extendsidi2}
     (nil))
Old Cost 20:



NEW===================846930886===================NEW
New_Cost: 20
i0 & Cost 0:
(nil)
i1 & Cost 0:
(nil)
i2 & Cost 4:
(insn 27 3 6 2 (set (reg/f:DI 92)
        (plus:DI (reg:DI 96)
            (const_int 768 [0x300]))) "array_test.c":7:5 4 {adddi3}
     (expr_list:REG_DEAD (reg:DI 96)
        (nil)))
i3 & Cost 16:
(insn 6 27 8 2 (set (reg:DI 82 [ MEM[(int *)array_5(D) + 800B] ])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                    (const_int 800 [0x320])) [1 MEM[(int *)array_5(D) + 800B]+0 S4 A32]))) "array_test.c":7:5 90 {extendsidi2}
     (nil))
newpat:
(set (reg:DI 82 [ MEM[(int *)array_5(D) + 800B] ])
    (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                (const_int 800 [0x320])) [1 MEM[(int *)array_5(D) + 800B]+0 S4 A32])))
newi2pat:
(set (reg/f:DI 92)
    (plus:DI (reg:DI 96)
        (const_int 768 [0x300])))
newotherpat:
(nil)
GO!---------------------------------------------------


OLD===================1681692777===================OLD
i2 & Cost 4:
(insn 27 3 6 2 (set (reg/f:DI 92)
        (plus:DI (reg:DI 96)
            (const_int 768 [0x300]))) "array_test.c":7:5 4 {adddi3}
     (nil))
i3 & Cost 16:
(insn 8 6 10 2 (set (reg:DI 84 [ MEM[(int *)array_5(D) + 804B] ])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                    (const_int 804 [0x324])) [1 MEM[(int *)array_5(D) + 804B]+0 S4 A32]))) "array_test.c":7:5 90 {extendsidi2}
     (nil))
Old Cost 20:



NEW===================1681692777===================NEW
New_Cost: 20
i0 & Cost 0:
(nil)
i1 & Cost 0:
(nil)
i2 & Cost 4:
(insn 27 3 6 2 (set (reg/f:DI 92)
        (plus:DI (reg:DI 96)
            (const_int 768 [0x300]))) "array_test.c":7:5 4 {adddi3}
     (nil))
i3 & Cost 16:
(insn 8 6 10 2 (set (reg:DI 84 [ MEM[(int *)array_5(D) + 804B] ])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                    (const_int 804 [0x324])) [1 MEM[(int *)array_5(D) + 804B]+0 S4 A32]))) "array_test.c":7:5 90 {extendsidi2}
     (nil))
newpat:
(set (reg:DI 84 [ MEM[(int *)array_5(D) + 804B] ])
    (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                (const_int 804 [0x324])) [1 MEM[(int *)array_5(D) + 804B]+0 S4 A32])))
newi2pat:
(set (reg/f:DI 92)
    (plus:DI (reg:DI 96)
        (const_int 768 [0x300])))
newotherpat:
(nil)
GO!---------------------------------------------------


OLD===================1714636915===================OLD
i2 & Cost 4:
(insn 27 3 6 2 (set (reg/f:DI 92)
        (plus:DI (reg:DI 96)
            (const_int 768 [0x300]))) "array_test.c":7:5 4 {adddi3}
     (nil))
i3 & Cost 16:
(insn 12 10 14 2 (set (reg:DI 87 [ MEM[(int *)array_5(D) + 808B] ])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                    (const_int 808 [0x328])) [1 MEM[(int *)array_5(D) + 808B]+0 S4 A32]))) "array_test.c":8:5 90 {extendsidi2}
     (nil))
Old Cost 20:



NEW===================1714636915===================NEW
New_Cost: 20
i0 & Cost 0:
(nil)
i1 & Cost 0:
(nil)
i2 & Cost 4:
(insn 27 3 6 2 (set (reg/f:DI 92)
        (plus:DI (reg:DI 96)
            (const_int 768 [0x300]))) "array_test.c":7:5 4 {adddi3}
     (nil))
i3 & Cost 16:
(insn 12 10 14 2 (set (reg:DI 87 [ MEM[(int *)array_5(D) + 808B] ])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                    (const_int 808 [0x328])) [1 MEM[(int *)array_5(D) + 808B]+0 S4 A32]))) "array_test.c":8:5 90 {extendsidi2}
     (nil))
newpat:
(set (reg:DI 87 [ MEM[(int *)array_5(D) + 808B] ])
    (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                (const_int 808 [0x328])) [1 MEM[(int *)array_5(D) + 808B]+0 S4 A32])))
newi2pat:
(set (reg/f:DI 92)
    (plus:DI (reg:DI 96)
        (const_int 768 [0x300])))
newotherpat:
(nil)
GO!---------------------------------------------------


OLD===================1957747793===================OLD
i2 & Cost 4:
(insn 27 3 6 2 (set (reg/f:DI 92)
        (plus:DI (reg:DI 96)
            (const_int 768 [0x300]))) "array_test.c":7:5 4 {adddi3}
     (nil))
i3 & Cost 16:
(insn 16 14 18 2 (set (reg:DI 90 [ MEM[(int *)array_5(D) + 812B] ])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                    (const_int 812 [0x32c])) [1 MEM[(int *)array_5(D) + 812B]+0 S4 A32]))) "array_test.c":9:5 90 {extendsidi2}
     (expr_list:REG_DEAD (reg/f:DI 92)
        (nil)))
Old Cost 20:



NEW===================1957747793===================NEW
New_Cost: 16
i0 & Cost 0:
(nil)
i1 & Cost 0:
(nil)
i2 & Cost 4:
(insn 27 3 6 2 (set (reg/f:DI 92)
        (plus:DI (reg:DI 96)
            (const_int 768 [0x300]))) "array_test.c":7:5 4 {adddi3}
     (nil))
i3 & Cost 16:
(insn 16 14 18 2 (set (reg:DI 90 [ MEM[(int *)array_5(D) + 812B] ])
        (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                    (const_int 812 [0x32c])) [1 MEM[(int *)array_5(D) + 812B]+0 S4 A32]))) "array_test.c":9:5 90 {extendsidi2}
     (expr_list:REG_DEAD (reg/f:DI 92)
        (nil)))
newpat:
(set (reg:DI 90 [ MEM[(int *)array_5(D) + 812B] ])
    (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                (const_int 812 [0x32c])) [1 MEM[(int *)array_5(D) + 812B]+0 S4 A32])))
newi2pat:
(nil)
newotherpat:
(nil)
GO!---------------------------------------------------

where the go stands for !reject (replacement happens) and newpat is the new pattern.

And above merge is not observed in the unpatched version.

  One naive fix to verify this would be changing:
int reject = old_cost > 0 && new_cost > old_cost;
  to:
int reject = old_cost > 0 && new_cost >= old_cost;
  since both unmerged and merged results cost 20, but it would surely cause side effect.

As for why the unpatched version's insns are missed here, I think it would be better to look back at try_combine() or before.
Comment 48 Levy Hsu 2020-12-14 10:43:51 UTC
Created attachment 49757 [details]
Initial V1 patch on combine.c

Three patches together: 


               ========= Summary of gcc testsuite =========
                            | # of unexpected case / # of unique unexpected case
                            |          gcc |          g++ |     gfortran |
 rv64imafdc/  lp64d/ medlow |    0 /     0 |    0 /     0 |      - |

I'll merge all 3 patches together and fix all the debug/coding style/efficiency /whatever problem with explanations later this week.

Looks likes it's fixed from my side.
Comment 49 Levy Hsu 2020-12-15 09:55:58 UTC
Created attachment 49767 [details]
Auto-extend Patch

Combined all three patches.

               ========= Summary of gcc testsuite =========
                            | # of unexpected case / # of unique unexpected case
                            |          gcc |          g++ |     gfortran |
 rv64imafdc/  lp64d/ medlow |    0 /     0 |    0 /     0 |      - |

(May require some work on coding style)
Comment 50 Jim Wilson 2020-12-16 02:40:42 UTC
The combine change is inconvenient.  We can't do that in stage3, and it means we need to make sure that this doesn't break other targets.

If the combine change is a good idea, then I think you can just modify is_just_move rather than add a new function.  Just skip over a ZERO_EXTEND or SIGN_EXTEND before the the general_operand check.  We might need a mode check against UNITS_PER_WORD since extending past the word size is not necessarily a simple move.

So the problem stems from the code in combine that is willing to do a 2->2 split if neither original instruction is a simple move.  When we add a SIGN_EXTEND or ZERO_EXTEND they aren't considered simple moves anymore.

There is still the question of why the instruction cost allows the change.  First I noticed that riscv_address_cost has a hook to check for shorten_memrefs
but that riscv_rtx_costs isn't calling it.  It uses riscv_address_insns instead.  So it seems like adding a shorten_memref check to the MEM case riscv_rtx_costs might solve the problem.  But riscv_compressed_lw_address_p has a !reload_completed check, and combine runs before reload, so that returns the same result for both the new and old insns.  I think that reload_completed check isn't quite right.  If we have a pseudo-reg, then we should allow that, but if we have an offset that is clearly not compressible, then we should reject it before reload.  So I think the reload_completed check should be moved down to where it checks for a compressible register.  With those two changes, I can make the testcase work without a combine change.  Though since I changed how shorten_memrefs works we need a check to make sure this patch doens't change code size.  I haven't tried to do that yet.

With my changes, in the combine dump, I see

Successfully matched this instruction:
(set (reg/f:DI 92)
    (plus:DI (reg:DI 96)
        (const_int 768 [0x300])))
Successfully matched this instruction:
(set (reg:DI 82 [ MEM[(intD.1 *)array_5(D) + 800B] ])
    (sign_extend:DI (mem:SI (plus:DI (reg:DI 96)
                (const_int 800 [0x320])) [1 MEM[(intD.1 *)array_5(D) + 800B]+0 S4 A32])))
rejecting combination of insns 27 and 6
original costs 4 + 16 = 20
replacement costs 4 + 20 = 24

so now the replacement gets rejected because of the increased address cost.
Comment 51 Jim Wilson 2020-12-16 02:42:35 UTC
Created attachment 49773 [details]
untested fix to use instead of levy's combine.c patch

Needs testing without Levy's patch to make sure it doesn't accidentally increase code size.
Comment 52 Jim Wilson 2020-12-17 18:13:38 UTC
I did some simple testing of my patch alone.  I modified the riscv-gnu-toolchain makefile to use -Os to build all libraries, built a rv32imac/ilp32 toolchain, and looked at library code size.  I see a few cases where my patch gives smaller code, and no obvious cases where it gives larger code, so I think it is OK.

I haven't tried a full test with my patch combined with Levy's patch minus the combine change.
Comment 53 jiawei 2020-12-17 18:26:10 UTC
   Hi Jim,

   Levy had asked me to help about the test. I was resigned on EEMBC and
   waiting acess for more benchmarks. Now I am testing on csibe and
   coremax-pro. I think will lineout the result in this week.If it need
   test on more set. Just mail me about it.

   Best wishes,
   Jiawei


   在 2020年12月18日 上午2:13,"wilson at gcc dot gnu.org"
   <gcc-bugzilla@gcc.gnu.org>写道:

     https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97417

     --- Comment #52 from Jim Wilson <wilson at gcc dot gnu.org[1]> ---
     I did some simple testing of my patch alone. I modified the
     riscv-gnu-toolchain makefile to use -Os to build all libraries,
     built a
     rv32imac/ilp32 toolchain, and looked at library code size. I see a
     few cases
     where my patch gives smaller code, and no obvious cases where it
     gives larger
     code, so I think it is OK.

     I haven't tried a full test with my patch combined with Levy's
     patch minus the
     combine change.

     --
     You are receiving this mail because:
     You are on the CC list for the bug.

   

   1. http://gnu.org
Comment 54 jiawei 2020-12-21 15:08:58 UTC
Hi Jim.

I had finished the test on the benchmark Coremark-pro.And it shows that the patch doesn't accidentally increase code size.

This test with the args "XCMD='-c4' certify-all", and the result shows follow:

WORKLOAD RESULTS TABLE(origin gcc-10.2.0 from upstream compiled with -mabi=lp64 -march=rv64imafdc)

                                                 MultiCore SingleCore           
Workload Name                                     (iter/s)   (iter/s)    Scaling
----------------------------------------------- ---------- ---------- ----------
cjpeg-rose7-preset                                  119.05      46.08       2.58
core                                                  1.10       0.28       3.93
linear_alg-mid-100x100-sp                            35.19       8.92       3.95
loops-all-mid-10k-sp                                  1.77       0.46       3.85
nnet_test                                             1.69       0.51       3.31
parser-125k                                          43.01      15.62       2.75
radix2-big-64k                                      257.93      56.84       4.54
sha-test                                            156.25      57.80       2.70
zip-test                                             81.63      26.32       3.10

MARK RESULTS TABLE

Mark Name                                        MultiCore SingleCore    Scaling
----------------------------------------------- ---------- ---------- ----------
CoreMark-PRO                                       2669.64     796.33       3.35



WORKLOAD RESULTS TABLE(Add patch one -- Auto-extend Patch)

                                                 MultiCore SingleCore           
Workload Name                                     (iter/s)   (iter/s)    Scaling
----------------------------------------------- ---------- ---------- ----------
cjpeg-rose7-preset                                  131.58      45.87       2.87
core                                                  1.08       0.28       3.86
linear_alg-mid-100x100-sp                            35.49       8.75       4.06
loops-all-mid-10k-sp                                  1.75       0.46       3.80
nnet_test                                             1.68       0.51       3.29
parser-125k                                          54.05      15.62       3.46
radix2-big-64k                                      257.80      65.57       3.93
sha-test                                            153.85      57.47       2.68
zip-test                                             76.92      26.32       2.92

MARK RESULTS TABLE

Mark Name                                        MultiCore SingleCore    Scaling
----------------------------------------------- ---------- ---------- ----------
CoreMark-PRO                                       2737.52     806.42       3.39



WORKLOAD RESULTS TABLE(Add all patches -- Auto-extend Patch & untested fix to use instead of levy's combine.c patch)

                                                 MultiCore SingleCore           
Workload Name                                     (iter/s)   (iter/s)    Scaling
----------------------------------------------- ---------- ---------- ----------
cjpeg-rose7-preset                                  120.48      45.66       2.64
core                                                  1.08       0.28       3.86
linear_alg-mid-100x100-sp                            35.61       8.77       4.06
loops-all-mid-10k-sp                                  1.76       0.46       3.83
nnet_test                                             1.68       0.51       3.29
parser-125k                                          46.51      15.62       2.98
radix2-big-64k                                      257.33      65.18       3.95
sha-test                                            153.85      57.47       2.68
zip-test                                             83.33      26.32       3.17

MARK RESULTS TABLE

Mark Name                                        MultiCore SingleCore    Scaling
----------------------------------------------- ---------- ---------- ----------
CoreMark-PRO                                       2691.95     805.68       3.34

The csibe test is still modifying,and will test after it works on riscv-gnu-toolchain.
Comment 55 Kito Cheng 2020-12-21 15:38:55 UTC
Hi jiawei:

Thanks for the data, the performance changing for coremark-pro seems interesting, could you find which part generate different code after the patch?

And I am curious what the platform you used for the performance data?
Comment 56 jiawei 2020-12-21 16:09:48 UTC
   Hi Kito,

   I test the performance data on qemu-riscv64, and compile the benchmark
   with riscv-unknown-linux-gnu-gcc -Os.
   All the modify is set in /coremark-pro/util/make/
   to change the toolchain and run env.

   I am also insterested about which part has changed due to the patch. I
   will try to find out the different.在 2020年12月21日 下午11:38,"kito at gcc
   dot gnu.org" <gcc-bugzilla@gcc.gnu.org>写道:

     https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97417

     --- Comment #55 from Kito Cheng <kito at gcc dot gnu.org[1]> ---
     Hi jiawei:

     Thanks for the data, the performance changing for coremark-pro
     seems
     interesting, could you find which part generate different code
     after the patch?

     And I am curious what the platform you used for the performance
     data?

     --
     You are receiving this mail because:
     You are on the CC list for the bug.

   

   1. http://gnu.org
Comment 57 Levy Hsu 2020-12-22 06:35:17 UTC
Thank you JiaWei for the CoreMark-Pro result.

Personally, I agree with Jim, since changing the split behaviour of try_combine in the combine.c could affect other platforms, theoretically, we can fix it with platform flag and UNITS_PER_WORD check or maybe Just skip over a ZERO_EXTEND or SIGN_EXTEND before the general_operand check, but that seems inconvenient.

Probably need more testing on all patches to see the differences in code size & speed. Maybe after EEMBC results come out then decide what to proceed next.
Comment 58 Kito Cheng 2020-12-25 03:31:13 UTC
Hi jiawei:

I would suggest you just using inst count rather than cycle or time for measuring benchmark if you using qemu, since qemu is functional simulator not cycle accurate neither nearly-cycle accurate simulator, so the performance number is coming from your native host (x86_64) cpu, and it also might  sensitive on your host loading. or maybe you could try gem5 for that?

Thanks your helping benchmark that :)
Comment 59 jiawei 2020-12-25 09:09:40 UTC
   Hi Kito,

   Okay, I will retest the benchmark on gem5.



   发自我的小米手机在 "kito at gcc dot gnu.org"
   <gcc-bugzilla@gcc.gnu.org>,2020年12月25日 上午11:31写道:

     https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97417

     --- Comment #58 from Kito Cheng <kito at gcc dot gnu.org[1]> ---
     Hi jiawei:

     I would suggest you just using inst count rather than cycle or
     time for
     measuring benchmark if you using qemu, since qemu is functional
     simulator not
     cycle accurate neither nearly-cycle accurate simulator, so the
     performance
     number is coming from your native host (x86_64) cpu, and it also
     might
     sensitive on your host loading. or maybe you could try gem5 for
     that?

     Thanks your helping benchmark that :)

     --
     You are receiving this mail because:
     You are on the CC list for the bug.

   

   1. http://gnu.org
Comment 60 GCC Commits 2021-02-13 20:24:13 UTC
The master branch has been updated by Jim Wilson <wilson@gcc.gnu.org>:

https://gcc.gnu.org/g:a4953810bac524e19126a2745c75fed58db962c2

commit r11-7236-ga4953810bac524e19126a2745c75fed58db962c2
Author: Jim Wilson <jimw@sifive.com>
Date:   Sat Feb 13 12:13:08 2021 -0800

    RISC-V: Shorten memrefs improvement, partial fix 97417.
    
    We already have a check for riscv_shorten_memrefs in riscv_address_cost.
    This adds the same check to riscv_rtx_costs.  Making this work also
    requires a change to riscv_compressed_lw_address_p to work before reload
    by checking the offset and assuming any pseudo reg is OK.  Testing shows
    that this consistently gives small code size reductions.
    
            gcc/
            PR target/97417
            * config/riscv/riscv.c (riscv_compressed_lw_address_p): Drop early
            exit when !reload_completed.  Only perform check for compressed reg
            if reload_completed.
            (riscv_rtx_costs): In MEM case, when optimizing for size and
            shorten memrefs, if not compressible, then increase cost.
Comment 61 GCC Commits 2021-02-13 20:37:17 UTC
The master branch has been updated by Jim Wilson <wilson@gcc.gnu.org>:

https://gcc.gnu.org/g:18fabc35f47f0abf4ec14d147098ec4e0734d2a3

commit r11-7237-g18fabc35f47f0abf4ec14d147098ec4e0734d2a3
Author: Levy Hsu <admin@levyhsu.com>
Date:   Sat Feb 13 12:26:33 2021 -0800

    RISC-V: Avoid zero/sign extend for volatile loads.  Fix for 97417.
    
    This expands sub-word loads as a zero/sign extended load, followed by
    a subreg.  This helps eliminate unnecessary zero/sign extend insns after
    the load, particularly for volatiles, but also in some other cases.
    Testing shows that it gives consistent code size decreases.
    
    Tested with riscv32-elf rv32imac/ilp32 and riscv64-linux rv64gc/lp064d
    builds and checks.  Some -gsplit-stack tests fail with the patch, but
    this turns out to be an existing bug with the split-stack support that
    I hadn't noticed before.  It isn't a bug in this patch.  Ignoring that
    there are no regressions.
    
    Committed.
    
            gcc/
            PR target/97417
            * config/riscv/riscv-shorten-memrefs.c (pass_shorten_memrefs): Add
            extend parameter to get_si_mem_base_reg declaration.
            (get_si_mem_base_reg): Add extend parameter.  Set it.
            (analyze): Pass extend arg to get_si_mem_base_reg.
            (transform): Likewise.  Use it when rewriting mems.
            * config/riscv/riscv.c (riscv_legitimize_move): Check for subword
            loads and emit sign/zero extending load followed by subreg move.
Comment 62 Jim Wilson 2021-02-13 20:48:13 UTC
I committed my shorten memrefs patch and Levy's patch minus the combine change.  I don't believe that we need the combine change.

I did notice one interesting case where we get unnecesssary zero extends.  I will submit that as a new bug report.

I also noticed with riscv64-linux testing that some -gsplit-dwarf tests fail, but this turns out to be a latent bug in the split-dwarf support.  I will submit that as a new bug report.

I believe this is fixed, so closing as resolved.