33989 – Extra load/store for float with union

Bug 33989 - Extra load/store for float with union

Summary: Extra load/store for float with union

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.3.0

Importance:	P3 enhancement
Target Milestone:	9.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	34043 42588
	Show dependency tree / graph

Reported:	2007-11-03 17:47 UTC by Andrew Pinski
Modified:	2020-01-30 19:10 UTC (History)
CC List:	7 users (show)

See Also:
Host:
Target:	powerpc-
Build:
Known to work:
Known to fail:	3.3.6, 4.3.0
Last reconfirmed:	2008-01-02 14:14:25

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andrew Pinski 2007-11-03 17:47:12 UTC

Testcase:
union a
{
  int i;
  float f;
};

void f(float *a, int *b, float e)
{
  union a c;
  c.f = *a + e;
  *b = c.i;
}

--- CUT ---
Currently we get (on x86):
        subl    $28, %esp
        movl    32(%esp), %eax
        flds    40(%esp)
        fadds   (%eax)
        movl    36(%esp), %eax
        fstps   12(%esp)  <--- extra store
        movl    12(%esp), %edx  <--- extra load
        movl    %edx, (%eax)  <--- store result to *b
        addl    $28, %esp
        ret
Or with -mfpmath=sse:
_f:
        subl    $28, %esp
        movl    32(%esp), %eax
        movss   40(%esp), %xmm0
        addss   (%eax), %xmm0
        movl    36(%esp), %eax
        movss   %xmm0, 12(%esp)  <--- extra store
        movl    12(%esp), %edx <--- extra load
        movl    %edx, (%eax)  <---store result to *b
        addl    $28, %esp
        ret

Or on PPC:
f:
        lfs 0,0(3)
        stwu 1,-16(1)
        fadds 0,1,0
        stfs 0,8(1)  <--- extra store
        lwz 0,8(1)  <--- extra load
        addi 1,1,16
        stw 0,0(4)  <--- store result to *b
        blr

The issue is that SFmode cannot be in integer registers.
The rtl looks like:
(insn 8 7 9 3 t1.c:10 (set (reg:SF 124)
        (mem:SF (reg/v/f:SI 120 [ a ]) [2 S4 A32])) -1 (nil))

(insn 9 8 10 3 t1.c:10 (set (reg:SF 123)
        (plus:SF (reg/v:SF 122 [ e ])
            (reg:SF 124))) -1 (nil))

(insn 10 9 11 3 t1.c:10 (set (subreg:SF (reg/v:SI 119 [ c ]) 0)
        (reg:SF 123)) -1 (nil))

(insn 11 10 16 3 t1.c:11 (set (mem:SI (reg/v/f:SI 121 [ b ]) [3 S4 A32])
        (reg/v:SI 119 [ c ])) -1 (nil))

See how we could translate register 119 (SImode) into a register that is in SFmode and get away with it.

Comment 1 Andrew Pinski 2007-11-05 03:09:00 UTC

So if I have emit_move_insn not to produce:
(insn 10 9 11 3 t1.c:10 (set (subreg:SF (reg/v:SI 119 [ c ]) 0)
        (reg:SF 123)) -1 (nil))
but instead:
(insn 10 9 11 3 t1.c:10 (set (reg/v:SI 119 [ c ])
        (subreg:SI (reg:SF 123) 0)) -1 (nil))
I could not get the (subreg:SI (reg:SF 123) 0) propgated into
 (insn 11 10 16 3 t1.c:11 (set (mem:SI (reg/v/f:DI 121 [ b ]) [3 S4 A32])
        (reg/v:SI 119 [ c ])) -1 (nil))

I have not figured out why yet.  Once that is done, we can have simplify_set in combine change the modes of the mem.

Comment 2 Richard Biener 2008-01-02 14:14:25 UTC

Confirmed.

Comment 3 Richard Biener 2008-02-28 16:47:18 UTC

With the patch proposed for PR34043 we get

f:
.LFB2:
        addss   (%rdi), %xmm0
        movd    %xmm0, (%rsi)
        ret

instead of

f:
.LFB2:
        addss   (%rdi), %xmm0
        movss   %xmm0, -4(%rsp)
        movl    -4(%rsp), %eax
        movl    %eax, (%rsi)
        ret

so expansion can handle

f (a, b, e)
{
<bb 2>:
  *b = VIEW_CONVERT_EXPR<int>(*a + e);
  return;

}

well compared to

f (a, b, e)
{
  union a c;

<bb 2>:
  c.f = *a + e;
  *b = c.i;
  return;

}

Comment 4 Richard Biener 2008-03-14 14:52:56 UTC

Subject: Bug 33989

Author: rguenth
Date: Fri Mar 14 14:52:07 2008
New Revision: 133218

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=133218
Log:
2008-03-14  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/34043
	PR tree-optimization/33989
	* tree-ssa-pre.c (execute_pre): Allow SCCVN to do insertion
	when doing FRE.
	(bitmap_find_leader): Use extra argument to verify dominance
	relationship inside a basic-block.
	(can_PRE_operation): Add VIEW_CONVERT_EXPR.
	(find_leader_in_sets): Adjust.
	(create_component_ref_by_pieces): Take extra argument for
	dominance check, handle lookup failures.
	(find_or_generate_expression): Likewise.
	(create_expression_by_pieces): Likewise.
	(insert_into_preds_of_block): Adjust.
	(create_value_expr_from): If asked for, verify all operands
	are in the blocks AVAIL_OUT set.
	(make_values_for_stmt): Check for SSA_NAMEs that are life
	over an abnormal edge.
	(compute_avail): Remove such check.
	(do_SCCVN_insertion): New function.
	(eliminate): If we do not find a leader suitable for replacement
	insert a replacement expression from SCCVN if available.
	* tree-ssa-sccvn.h (run_scc_vn): Update prototype.
	(struct vn_ssa_aux): Add needs_insertion flag.
	* tree-ssa-sccvn.c (may_insert): New global flag.
	(copy_reference_ops_from_ref): Value-number union member access
	based on its size, not type and member if insertion is allowed.
	(visit_reference_op_load): For a weak match from union type
	punning lookup a view-converted value and insert a SSA_NAME
	for that value if that is not found.
	(visit_use): Make dumps shorter.  Do not disallow value numbering
	SSA_NAMEs that are life over an abnormal edge to constants.
	(free_scc_vn): Release inserted SSA_NAMEs.
	(run_scc_vn): New flag to specify whether insertion is allowed.
	Process SSA_NAMEs in forward order.
	* tree-ssa-loop-im.c (for_each_index): Handle invariant
	ADDR_EXPRs inside VIEW_CONVERT_EXPR.
	* fold-const.c (fold_unary): Fold VIEW_CONVERT_EXPRs from/to
	pointer type to/from integral types that do not change the
	precision to regular conversions.

	* gcc.dg/tree-ssa/ssa-fre-7.c: New testcase.
	* gcc.dg/tree-ssa/ssa-fre-8.c: Likewise.
	* gcc.dg/tree-ssa/ssa-fre-9.c: Likewise.
	* gcc.dg/tree-ssa/ssa-fre-10.c: Likewise.
	* gcc.dg/tree-ssa/ssa-pre-17.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-10.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-7.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-8.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-9.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-17.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/fold-const.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-loop-im.c
    trunk/gcc/tree-ssa-pre.c
    trunk/gcc/tree-ssa-sccvn.c
    trunk/gcc/tree-ssa-sccvn.h

Comment 5 Richard Biener 2008-03-14 16:16:09 UTC

Fixed.

Comment 6 Andrew Pinski 2008-03-14 17:34:26 UTC

>   *b = VIEW_CONVERT_EXPR<int>(*a + e);

I don't think this is fully fixed for PPC, I will check later today to be sure.

-- Pinski

Comment 7 Andrew Pinski 2008-03-14 19:12:36 UTC

PPC still gets bad code:
_f:
        lfs f0,0(r3)
        fadds f0,f1,f0
        stfs f0,-8(r1)
        lwz r0,-8(r1)
        stw r0,0(r4)
        blr

Comment 8 Andrew Pinski 2008-03-14 19:14:43 UTC

We do get:
  *b = VIEW_CONVERT_EXPR<int>(*a + e);

Now, if produced:
  VIEW_CONVERT_EXPR<float>(*b) = *a + e;

We might produce better code as SImode is not able to held in FPRs.

Comment 9 Richard Biener 2008-03-14 19:37:59 UTC

Can we fix this at expansion time?  Thus, move the VIEW_CONVERT_EXPR from the
rhs to the lhs?  This might be a target dependent optimization.

Comment 10 Andrew Pinski 2008-12-29 04:18:45 UTC

(In reply to comment #9)
> Can we fix this at expansion time?  Thus, move the VIEW_CONVERT_EXPR from the
> rhs to the lhs?  This might be a target dependent optimization.

Yes but then we have to fix fwprop to forward prop SUBREG.  I have patches for all of this but I am still getting mixed results due to scheduling differences with the pre ra schedule.

Comment 11 Andrew Pinski 2010-03-08 17:16:54 UTC

No longer working on this one, I lost the patches when I left sony.

Comment 12 Scott Mansell 2010-11-11 14:38:37 UTC

I'm still getting this same bug with PPC and gcc 4.5.1

Comment 13 Richard Biener 2010-11-11 14:42:39 UTC

With 4.6 it should be fixed finally thanks to mem-ref and value-numbering
improvements.

;; Function f (f)

f (float * a, int * b, float e)
{
  int D.2694;
  float D.2690;
  float D.2689;

<bb 2>:
  D.2689_2 = *a_1(D);
  D.2690_4 = D.2689_2 + e_3(D);
  D.2694_10 = VIEW_CONVERT_EXPR<int>(D.2690_4);
  *b_6(D) = D.2694_10;
  return;

}

f:
.LFB0:
        .cfi_startproc
        addss   (%rdi), %xmm0
        movd    %xmm0, (%rsi)
        ret
        .cfi_endproc

maybe you can verify on ppc.

Comment 14 Scott Mansell 2010-11-12 03:15:21 UTC

I downloaded and compiled the 2010-11-6 snapshot of gcc 4.6.
I'm still getting the extra load/store in ppc with -O3.

.L.f:
	lfs 0,0(3)
	fadds 0,1,0
	stfs 0,-16(1)
	lwz 0,-16(1)
	stw 0,0(4)
	blr

Comment 15 Scott Mansell 2010-11-12 04:04:25 UTC

Weirdly, it works fine with doubles.

Testcase:
union a
{
  long int i;
  double f;
};

void d(double *a, long int *b, double e)
{
  union a c;
  c.f = *a + e;
  *b = c.i;
}

Results in:
.L.d:
	lfd 0,0(3)
	fadd 0,1,0
	stfd 0,0(4)
	blr

One thing that I noticed while looking at the assembly of my program, it that is was allocated 8 bytes on the stack for each stfs instruction. Does gcc think that stfs writes a 64 bit value which is preventing it writing directly into the space of an int? 

I've checked the IBM docs, stfs does write a 32 bit value.

Comment 16 Andrew Pinski 2020-01-22 11:05:53 UTC

(In reply to Scott Mansell from comment #15)
> Weirdly, it works fine with doubles.

At least at the time I wrote up this bug report, can_change_mode target hook on rs6000 would reject SImode and SFmode IIRC which forces a reload (spill when moving between those two).  Now I have not looked at the rs6000 back-end for almost 10 years so I cannot say if this still applies or not.

Moving to a target bug.

Comment 17 seurer 2020-01-30 15:04:16 UTC

The reported problem was that with -O3 you get an extra load/store:

.L.f:
lfs 0,0(3)
fadds 0,1,0
stfs 0,-16(1)
lwz 0,-16(1)
stw 0,0(4)
blr

I tried the test code using both gcc 9 and 10 (on both powerpc64 BE and LE just in case) and I see this:

f:
.LFB0:
.cfi_startproc
lfs 0,0(3)
fadds 1,1,0
stfs 1,0(4)
blr

No extra load/store.

FYI: I looked at gcc 5 (the distro version on one of older servers) and it generated way different and probably equally not so good code. I tried on another older system that has gcc 4.8.5 and it did indeed generate the extra load/store as reported.

Comment 18 Andrew Pinski 2020-01-30 19:10:41 UTC