18576 – [3.4/4.0 Regression] missing jump threading because of type changes

Bug 18576 - [3.4/4.0 Regression] missing jump threading because of type changes

Summary: [3.4/4.0 Regression] missing jump threading because of type changes

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.0.0

Importance:	P2 normal
Target Milestone:	4.1.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	jumpthreading
	Show dependency tree / graph

Reported:	2004-11-20 06:29 UTC by Andrew Pinski
Modified:	2005-11-04 15:37 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:
Build:
Known to work:	3.3.3 3.3.1 4.1.0
Known to fail:	3.4.0 4.0.0 3.0.4 3.2.3 2.95.3
Last reconfirmed:	2005-01-23 22:35:20

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andrew Pinski 2004-11-20 06:29:17 UTC

struct loop
{
  int depth;
  struct loop **pred;
};

static __inline__ unsigned char
flow_loop_nested_p (const struct loop *outer, const struct loop *loop)
{
  return (loop->depth > outer->depth
  && loop->pred[outer->depth] == outer);
}

unsigned char
flow_bb_inside_loop_p (const struct loop *loop, const struct loop *source_loop)
{
  return loop == source_loop || flow_loop_nested_p (loop, source_loop);
}

In .64.final_cleanup:
<L2>:;
  iftmp.0 = 1;
  goto <bb 5> (<L14>);

<L4>:;
  iftmp.0 = 0;

<L14>:;
  if ((unsigned char) (int) (unsigned char) iftmp.0 != 0) goto <L8>; else goto <L7>;

See miss this jump threading (in fact we miss it on the RTL level also).
We do get better code already on the tree-cleanup-branch but that is because of my rewrite of PHI-OPT
for that branch we get:
<L12>:;
  iftmp.0 = 0;
  goto <bb 5> (<L4>);

<L1>:;
  iftmp.0 = (int) !(loop != *(source_loop->pred + (struct loop * *) ((unsigned int) D.1160 * 4)));

<L4>:;
  iftmp.1 = (int) ((unsigned char) (int) (unsigned char) iftmp.0 != 0);

Comment 1 Andrew Pinski 2004-11-20 06:37:39 UTC

The corresponding asm for the tcb:
        cmpw cr7,r3,r4
        mr r11,r3
        li r10,0
        li r3,1
        beqlr- cr7
        lwz r0,0(r11)
        lwz r2,0(r4)
        slwi r9,r0,2
        cmpw cr7,r2,r0
        ble- cr7,L7
        lwz r2,4(r4)
        lwzx r0,r9,r2
        xor r10,r11,r0
        subfic r2,r10,0
        adde r10,r2,r10
L7:
        mr r3,r10
        blr

the mainline (which is much worse):
        cmpw cr7,r3,r4
        li r10,0
        beq- cr7,L2
        lwz r0,0(r3)
        li r11,1
        lwz r2,0(r4)
        slwi r9,r0,2
        cmpw cr7,r2,r0
        bgt- cr7,L12
L4:
        li r11,0
L7:
        cmpwi cr7,r11,0
        bne- cr7,L2
        mr r3,r10
        blr
L12:
        lwz r2,4(r4)
        lwzx r0,r9,r2
        cmpw cr7,r3,r0
        bne+ cr7,L4
        b L7
L2:
        li r10,1
        mr r3,r10
        blr

Comment 2 Andrew Pinski 2004-11-20 06:43:00 UTC

Oh, even though the asm is fixed on the tcb we are still missing it on the tree level.

By the way I noticed this while trying to speedup flow_bb_inside_loop_p in the first place.

Comment 3 Steven Bosscher 2004-11-20 16:44:20 UTC

 
Confirmed, I see similar code on x86: 
 
.L4: 
        xorl    %ebx, %ebx 
.L7: 
        xorl    %edx, %edx 
        testb   %bl, %bl 
        jne     .L2

Comment 4 Steven Bosscher 2004-11-20 17:01:25 UTC

I have the following in the .optimized dump: 
 
flow_bb_inside_loop_p (loop, source_loop) 
{ 
  unsigned char D.1171; 
  struct loop * D.1170; 
  struct loop * * D.1169; 
  struct loop * * D.1168; 
  unsigned int D.1167; 
  unsigned int D.1166; 
  struct loop * * D.1165; 
  int D.1164; 
  int D.1163; 
  int iftmp.0; 
  int D.1161; 
  struct loop * outer; 
  struct loop * loop; 
  unsigned char D.1146; 
  unsigned char D.1145; 
  int iftmp.1; 
  int D.1140; 
  int D.1160; 
 
  # BLOCK 0 
  # PRED: ENTRY [100.0%]  (fallthru,exec) 
  if (loop == source_loop) goto <L8>; else goto <L0>; 
  # SUCC: 6 [10.4%]  (true,exec) 1 [89.6%]  (false,exec) 
 
  # BLOCK 1 
  # PRED: 0 [89.6%]  (false,exec) 
<L0>:; 
  D.1164 = loop->depth; 
  if (source_loop->depth <= D.1164) goto <L4>; else goto <L1>; 
  # SUCC: 4 [50.0%]  (true,exec) 2 [50.0%]  (false,exec) 
 
  # BLOCK 2 
  # PRED: 1 [50.0%]  (false,exec) 
<L1>:; 
  if (loop != *(source_loop->pred + (struct loop * *) ((unsigned int) D.1164 * 
4))) goto <L4>; else goto <L2>; 
  # SUCC: 4 [81.0%]  (true,exec) 3 [19.0%]  (false,exec) 
 
  # BLOCK 3 
  # PRED: 2 [19.0%]  (false,exec) 
<L2>:; 
  iftmp.0 = 1; 
  goto <bb 7> (<L14>); 
  # SUCC: 7 [100.0%]  (fallthru,exec) 
 
  # BLOCK 4 
  # PRED: 1 [50.0%]  (true,exec) 2 [81.0%]  (true,exec) 
<L4>:; 
  iftmp.0 = 0; 
  # SUCC: 7 [100.0%]  (fallthru) 
 
  # BLOCK 7 
  # PRED: 4 [100.0%]  (fallthru) 3 [100.0%]  (fallthru,exec) 
<L14>:; 
  if ((unsigned char) (int) (unsigned char) iftmp.0 != 0) goto <L8>; else goto 
<L7>; 
  # SUCC: 6 [33.0%]  (true,exec) 5 [67.0%]  (false,exec) 
 
  # BLOCK 5 
  # PRED: 7 [67.0%]  (false,exec) 
<L7>:; 
  iftmp.1 = 0; 
  goto <bb 8> (<L15>); 
  # SUCC: 8 [100.0%]  (fallthru,exec) 
 
  # BLOCK 6 
  # PRED: 0 [10.4%]  (true,exec) 7 [33.0%]  (true,exec) 
<L8>:; 
  iftmp.1 = 1; 
  # SUCC: 8 [100.0%]  (fallthru) 
 
  # BLOCK 8 
  # PRED: 6 [100.0%]  (fallthru) 5 [100.0%]  (fallthru,exec) 
<L15>:; 
  return (int) (unsigned char) iftmp.1; 
  # SUCC: EXIT [100.0%] 
} 
 
Control flow graph: 
 
ENTRY 
| 
0 
|\ 
| \ 
1  \ 
|\  \ 
| \  \ 
2  \  \ 
|\ |  | 
| \|  | 
3  4  | 
| /   | 
|/    | 
7     / 
|\   / 
| \ / 
5  6 
| / 
|/ 
8 
| 
EXIT 
 
 
 
Dominator tree: 
 
  0 
 /|\ 
6 8 1 
   /|\ 
  4 7 2 
    | | 
    5 3 
 
 
The problematic part is here: 
 
  # BLOCK 3 
  # PRED: 2 [19.0%]  (false,exec) 
<L2>:; 
  iftmp.0 = 1; 
  goto <bb 7> (<L14>); 
  # SUCC: 7 [100.0%]  (fallthru,exec) 
 
  # BLOCK 4 
  # PRED: 1 [50.0%]  (true,exec) 2 [81.0%]  (true,exec) 
<L4>:; 
  iftmp.0 = 0; 
  # SUCC: 7 [100.0%]  (fallthru) 
 
  # BLOCK 7 
  # PRED: 4 [100.0%]  (fallthru) 3 [100.0%]  (fallthru,exec) 
<L14>:; 
  if ((unsigned char) (int) (unsigned char) iftmp.0 != 0) goto <L8>; else goto 
<L7>; 
  # SUCC: 6 [33.0%]  (true,exec) 5 [67.0%]  (false,exec) 
 
 
Possible causes of the missed jump thread: 
1) Are casts getting in the way? 
2) Do we need that one of edges into the condition we want to 
   thread through dominates the block?

Comment 5 Andrew Pinski 2004-11-20 17:11:19 UTC

Note changing flow_loop_nested_p to return an int instead of unsigned char, the jump threading works

Comment 6 Steven Bosscher 2004-11-20 17:11:50 UTC

Situation in DOM3:  
  
  # BLOCK 4  
  # PRED: 3 [100.0%]  (fallthru,exec) 2 [81.0%]  (true,exec) 1 [50.0%]   
(true,exec)  
  # iftmp.0_1 = PHI <0(2), 1(3), 0(1)>;  
<L4>:;  
  D.1171_13 = (unsigned char) iftmp.0_1;  
  D.1160_14 = (int) D.1171_13;  
  D.1145_16 = (unsigned char) D.1160_14;  
  if (D.1145_16 != 0) goto <L8>; else goto <L7>;  
  # SUCC: 6 [33.0%]  (true,exec) 5 [67.0%]  (false,exec)  
  
So this is almost certainly casts again getting in the way.

Comment 7 Andrew Pinski 2004-11-20 17:56:59 UTC

Hmm, this is 3.4 and 4.0 Regression, 3.3.3 produced:
flow_bb_inside_loop_p:
        pushl   %ebp
        movl    %esp, %ebp
        movl    8(%ebp), %ecx
        pushl   %ebx
        movl    12(%ebp), %eax
        xorl    %ebx, %ebx
        cmpl    %eax, %ecx
        je      .L5
        movl    (%ecx), %edx
        cmpl    %edx, (%eax)
        jle     .L4
        movl    4(%eax), %eax
        cmpl    %ecx, (%eax,%edx,4)
        je      .L5
        .p2align 4,,15
.L4:
        movl    %ebx, %eax
        popl    %ebx
        popl    %ebp
        ret
        .p2align 4,,7
.L5:
        movl    $1, %ebx
        jmp     .L4

But 4.0 and 3.4.0 produces:
flow_bb_inside_loop_p:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        movl    %esi, 4(%esp)
        movl    8(%ebp), %ecx
        xorl    %esi, %esi
        movl    %ebx, (%esp)
        movl    12(%ebp), %eax
        cmpl    %eax, %ecx
        je      .L3
        movl    (%ecx), %edx
        xorl    %ebx, %ebx
        cmpl    %edx, (%eax)
        jle     .L5
        movl    4(%eax), %eax
        cmpl    %ecx, (%eax,%edx,4)
        je      .L7
.L5:
        testb   %bl, %bl
        je      .L2
.L3:
        movl    $1, %esi
.L2:
        movl    %esi, %eax
        movl    (%esp), %ebx
        movl    4(%esp), %esi
        movl    %ebp, %esp
        popl    %ebp
        ret
.L7:
        movl    $1, %ebx
        jmp     .L5

Comment 8 Steven Bosscher 2004-11-24 22:38:24 UTC

I don't think this is "minor".  We have many places where we have 
predicates returning "bool", and this PR suggests that we can never 
thread on their results because of casts. 
 
Adding Jeff, this may be a new prey for him after he's done with 
Bug 15524 ;-)

Comment 9 Jeffrey A. Law 2004-11-25 00:22:17 UTC

Subject: Re:  [3.4/4.0 Regression] missing
	jump threading because of type changes

On Wed, 2004-11-24 at 22:38 +0000, steven at gcc dot gnu dot org wrote:
> ------- Additional Comments From steven at gcc dot gnu dot org  2004-11-24 22:38 -------
> I don't think this is "minor".  We have many places where we have 
> predicates returning "bool", and this PR suggests that we can never 
> thread on their results because of casts. 
>  
> Adding Jeff, this may be a new prey for him after he's done with 
> Bug 15524 ;-) 
There are a number of ways we can attack this.  I don't know if any are
appropriate for 4.0 though.

One approach is to go with the phi optimization rewrite which allows it
to work with > 2 predecessors to the PHI block.

Another would be to look at Kazu's work which (IIRC) tried to identify
typecasts which were fed by PHIs with all constant arguments.  It then
removed the typecast and original PHI and installed a new PHI with a
result of the proper type.

A third would be to revamp the jump threading selection code.   In the
early days of the jump threading code we couldn't handle SSA graph
updates if we threaded through a block with side effects.  The new
SSA graph update code handles that case, so the selection code shouldn't
need to be so conservative anymore.

I think ultimately we'll want to do all three.  The question in my mind
is which (if any) are suitable for gcc 4.0.

jeff

Comment 10 Kazu Hirata 2004-12-12 17:26:24 UTC

> One approach is to go with the phi optimization rewrite which allows it
> to work with > 2 predecessors to the PHI block.

This is done on tree-cleanup-branch.

> Another would be to look at Kazu's work which (IIRC) tried to identify
> typecasts which were fed by PHIs with all constant arguments.  It then
> removed the typecast and original PHI and installed a new PHI with a
> result of the proper type.

The problem of replacing one PHI node with another is filed as PR 14843.

Comment 11 Kazu Hirata 2004-12-12 17:42:00 UTC

Another (possibly expensive) approach would be to "ignore" statements whose
results are used only in the basic block we are threading through.

Consider:

  # iftmp.0_1 = PHI <1(2), 0(3), 0(1)>;
<L4>:;
  D.1171_13 = (unsigned char) iftmp.0_1;
  D.1160_14 = (int) D.1171_13;
  D.1145_16 = (unsigned char) D.1160_14;
  if (D.1145_16 != 0) goto <L8>; else goto <L14>;

Note that the LHS of all MODIFY_EXPRs are used only in this basic block.
So if we get to COND_EXPR and find out that COND_EXPR_COND is a constant
(due to temporary propagation), then we can safely duplicate this basic block
and thread an incoming edge (without causing code bloat).
All the duplicate copies of MODIFY_EXPRs will be removed as dead code.

Comment 12 Jeffrey A. Law 2004-12-13 16:33:10 UTC

Subject: Re:  [3.4/4.0 Regression] missing
	jump threading because of type changes

On Sun, 2004-12-12 at 17:42 +0000, kazu at cs dot umass dot edu wrote:
> ------- Additional Comments From kazu at cs dot umass dot edu  2004-12-12 17:42 -------
> Another (possibly expensive) approach would be to "ignore" statements whose
> results are used only in the basic block we are threading through.
> 
> Consider:
> 
>   # iftmp.0_1 = PHI <1(2), 0(3), 0(1)>;
> <L4>:;
>   D.1171_13 = (unsigned char) iftmp.0_1;
>   D.1160_14 = (int) D.1171_13;
>   D.1145_16 = (unsigned char) D.1160_14;
>   if (D.1145_16 != 0) goto <L8>; else goto <L14>;
> 
> Note that the LHS of all MODIFY_EXPRs are used only in this basic block.
> So if we get to COND_EXPR and find out that COND_EXPR_COND is a constant
> (due to temporary propagation), then we can safely duplicate this basic block
> and thread an incoming edge (without causing code bloat).
> All the duplicate copies of MODIFY_EXPRs will be removed as dead code.
No.  The way to handle this is to remove all the nonsense about proving
that statements in the block are NOPs.  That precondition for threading
is no longer necessary and just gets in the way of doing a good job
at optimizing.

This is a great example of the kind of code I expect the threader to
be able to handle once that lameness is removed.

jeff

Comment 13 Kazu Hirata 2004-12-13 17:13:13 UTC

Subject: Re:  [3.4/4.0 Regression] missing
 jump threading because of type changes

Hi Jeff,

> No.  The way to handle this is to remove all the nonsense about proving
> that statements in the block are NOPs.  That precondition for threading
> is no longer necessary and just gets in the way of doing a good job
> at optimizing.
> 
> This is a great example of the kind of code I expect the threader to
> be able to handle once that lameness is removed.

Sorry, the example was a little bad.  I had something like this in my
mind.

  # iftmp.0_1 = PHI <a_1(2), 0(3)>;
<L4>:;
  D.1171_13 = iftmp.0_1 + 1;
  if (D.1171_13 != 0) goto <L8>; else goto <L14>;

Note that D.1171_13 is partially constant.  I guess we should really
transform this into

<bb 2>:
  a_2 = a_1 + 1;

  # D.1171_13 = PHI <a_2(2), 1(3)>;
<L4>:;
  if (D.1171_13 != 0) goto <L8>; else goto <L14>;

Of course, if there are a lot of incoming edges that do not have
constant PHI arguments, sending "+ 1" upstream would cause code bloat.
We could "fix" this by creating a helper block that does "+ 1" and
then jump to L4, but things get more and more complicated, and I don't
know if it's profitable.

Another thing I've seen blocking a jump threading opportunity is:

<L4>:;
  D.1171_13 = global_variable;
  if (D.1171_13 != 0) goto <L8>; else goto <L14>;

Where the value of global_variable is known on one of the incoming
edges of L4.  I guess we need to remove partially redundancy
associated with this load and shouldn't blame the jump threader.  That
is, we should transform the code above to

  D.1171_13 = PHI <...(2), ...(3)>
<L4>:;
  if (D.1171_13 != 0) goto <L8>; else goto <L14>;

Kazu Hirata

Comment 14 Jeffrey A. Law 2004-12-13 17:28:19 UTC

Subject: Re:  [3.4/4.0 Regression] missing
	jump threading because of type changes

On Mon, 2004-12-13 at 17:13 +0000, kazu at cs dot umass dot edu wrote:

> Sorry, the example was a little bad.  I had something like this in my
> mind.
> 
>   # iftmp.0_1 = PHI <a_1(2), 0(3)>;
> <L4>:;
>   D.1171_13 = iftmp.0_1 + 1;
>   if (D.1171_13 != 0) goto <L8>; else goto <L14>;
> 
> Note that D.1171_13 is partially constant.  I guess we should really
> transform this into
> 
> <bb 2>:
>   a_2 = a_1 + 1;
> 
>   # D.1171_13 = PHI <a_2(2), 1(3)>;
> <L4>:;
>   if (D.1171_13 != 0) goto <L8>; else goto <L14>;
> 
> Of course, if there are a lot of incoming edges that do not have
> constant PHI arguments, sending "+ 1" upstream would cause code bloat.
> We could "fix" this by creating a helper block that does "+ 1" and
> then jump to L4, but things get more and more complicated, and I don't
> know if it's profitable.
Not really.

We should thread the edge coming from block 3.  Again, this is precisely
the kind of thing we can and should be doing in the selection code
now that the SSA updating code can handle it.

> 
> Another thing I've seen blocking a jump threading opportunity is:
> 
> <L4>:;
>   D.1171_13 = global_variable;
>   if (D.1171_13 != 0) goto <L8>; else goto <L14>;
> 
> Where the value of global_variable is known on one of the incoming
> edges of L4.  I guess we need to remove partially redundancy
> associated with this load and shouldn't blame the jump threader.  That
> is, we should transform the code above to
> 
>   D.1171_13 = PHI <...(2), ...(3)>
> <L4>:;
>   if (D.1171_13 != 0) goto <L8>; else goto <L14>;
Again, this should be something we can handle in the threader.  If we
know the value of the global on one more more incoming edges, then
we ought to be able to thread it.  The only reason we do not in cases
like this is because the old SSA updating code couldn't handle this
kind of update because the assignment to D.1171_13 might need to be
preserved on one or more paths.

I don't see anything in these example that can not and should not
be handled by fixing the jump thread selection code.  In fact, they
are all just instances of issues I have already seen and designed
the SSA updating code to property handle.

jeff

Comment 15 Kazu Hirata 2004-12-22 14:08:29 UTC

One approach is to fix PR 14843.
It is just one approach, so I won't say this PR is blocked by PR 14843.

Comment 16 Andrew Pinski 2005-03-22 19:25:58 UTC

Fixed at least on the mainline.

Comment 17 Andrew Pinski 2005-07-22 21:12:12 UTC

Moving to 4.0.2 pre Mark.

Comment 18 Andrew Pinski 2005-11-04 15:37:40 UTC

Since this is only a missed optimization (and this was my bug) closing as fixed for 4.1.0.