[Fwd: Re: Parallelized loads and widening mults cont:ed (was: Re: GCC porting tutorials)]

Mon May 3 16:53:00 GMT 2010

>> Date: Thu, 29 Apr 2010 08:55:56 +0200 (CEST)
>> From: "Jonas Paulsson" <d03jp@student.lth.se>
>
>> It feels good to know that the widening mults issue has been
>> resolved
>
> Yes, nice, and as late as last week too, though the patch was
> from February.
>
>> as
>> it was a bit of a disapointment I noted the erratic behaviour with GCC
>> 4.4.1. Perhaps you would care to comment on what to expect as a user
>> now,
>> then?
>
> IIUC, it should Just Work.  No, I haven't checked.  Note that
> the fix was somewhat along the lines of what you wrote in your
> thesis IIUC; adding a specific pass to fix up separated
> operations.  See
> <http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29274> and
> <http://gcc.gnu.org/ml/gcc-patches/2010-02/msg00643.html>.  BTW,
> my observation was from the 4.3 era.  It's a regression, which
> explains why I hadn't noticed it with the 3.x version I used
> before that.  A pity it was deemed too invasive to fix for 4.5.

>
>> Another issue that gave me porting problems was the SIMD memory
>> accesses,
>> for e g doing a wide load into two adjacent narrow registers with one
>> instruction. This was resolved earlier on the mailinglist to not be
>> handleable on RTL, so I wonder now if anything has been done for this,
>> as
>> it too seems rather reasonable, just like the widening loads?
>
> You wanted to load adjacent data in a wider mode that was then
> to be separately used in a mode half that size, but the
> registers had to be adjacent too?  That's kind of the opposite
> problem to what's usually needed!  If the use of the data was
> actually for the obvious wider mode (SI or V2HI), you'd just
> have to define the movsi or movv2hi pattern and it would be
> used, but that unfortunately seems not applicable in any way.
> I'm not sure that problem is of common interest I'm afraid, but
> if it can be resolved with a target-specific pass, there'd be
> reason to add a hook somewhat like
> TARGET_MACHINE_DEPENDENT_REORG, but earlier.
>
> But, did you check whether combine tried to match RTL that
> looked somewhat like:
>
> (parallel
>  [(set (reg:HI 1) (mem:HI (plus:SI (reg:HI 3) (const_int 2))))
>   (set (reg:HI 2) (mem:HI (plus:SI (reg:HI 3) (const_int 4))))])
>
> I.e. a parallel with the two loads where the addresses were
> adjacent?  From gdb you inspect the calls to try_combine (IIRC).
> That insn could have been matched to a pattern like:
>
> (define_insn "*load_wide"
>  [(set (match_operand:HI 0 "register_operand" "=d0,d1,d2")
>        (match_operand:HI 1 "reg_plus_const_memory_operand" "m"))
>   (set (match_operand:HI 2 "register_operand" "=d1,d2,d3")
>        (match_operand:HI 3 "reg_plus_const_memory_operand" "m"))]
>  "rtx_equal_p (XEXP (operands[3], 0),
>                plus_constant (XEXP (operands[1]), 2))"
>  "load_wide %0,%1")
>

Yes, of course I checked with combine, but the debug dump of the pass
revealed that it is not looking for a wider load in the case of an
adjacent load, unfortunately. I checked this again now by setting a bp in
try_combine, but there is no attempt to use the wider load insn. This
combination I then handled by an added pass that located and replaced
load/store instructions. Along with successful uses of post-inc insns,
this was an important optimization for the project. Does not make sense to
me, not to do such a simple thing as looking for offset 1 in the local
block, even... (of course, I did just the simple thing of checking for
CODE_FOR... in terms of locating load insns)

I think I tried to add a pattern with a parallel semantic pattern, but
combine did not care about it.

> Just a WAG, there are reasons this would not match in the
> general case (for one, you'd want to try to match the opposite
> order too).  Don't pay too much attention to the exact matching
> predicates, constraints and condition above.  The point is just
> whether combine tried to generate and match a parallel with two
> valid loads, given source where there was obvious opportunity
> for it.
>
> That insn *could* then be caught with a pattern which would,
> through the right constraints coerce register allocation to make
> the right choices for the (initially separete) registers.  In
> the example above, four registers are assumed to be valid as
> destination with the matching singleton constraints d0..d3.
>
I guess I wonder here how much of a tweak it is to use GCC in this fashion
- pairing 16 bit regs to 32 bit regs. There does not seem to be complete
support for it, although it works on a basic level.

> brgds, H-P
>