Bug 53949 - [SH] Add support for mac.w / mac.l instructions
Summary: [SH] Add support for mac.w / mac.l instructions
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.8.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-13 09:01 UTC by Oleg Endo
Modified: 2016-04-30 14:01 UTC (History)
2 users (show)

See Also:
Host:
Target: sh*-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2015-01-30 00:00:00


Attachments
Proof of concept patch (1.64 KB, patch)
2012-07-15 12:11 UTC, Oleg Endo
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Oleg Endo 2012-07-13 09:01:08 UTC
So far, GCC does not utilize the integer multiply-add instructions.
On SH1 only the mac.w instruction is supported.
On SH2 and above the mac.w and mac.l instructions are available.

Carry over from PR 39423 comment #20

> On a related thread, for further work, I'm thinking on adding support for the
> MAC instruction, now that was have the multiply and add. But this requires
> exposing the MACLH registers to reload. Had anyone had a thought on this ? I'd
> like to give this a try pretty soon.
Comment 1 Oleg Endo 2012-07-13 10:34:20 UTC
(In reply to comment #0)
> So far, GCC does not utilize the integer multiply-add instructions.
> On SH1 only the mac.w instruction is supported.
> On SH2 and above the mac.w and mac.l instructions are available.
> 
> Carry over from PR 39423 comment #20
> 
> > On a related thread, for further work, I'm thinking on adding support for the
> > MAC instruction, now that was have the multiply and add. But this requires
> > exposing the MACLH registers to reload. Had anyone had a thought on this ? I'd
> > like to give this a try pretty soon.

I think the biggest problem is that the mac operands have to be in memory.
For example:

long long fun (int a, int long b, long long c)
{
  return (long long)a * (long long)b + c;
}

would need to become something like ... 

mov.l  r4,@-r15
mov    r15,r1
mov.l  r5,@-r15
lds      r6,mach
lds      r7,macl
mac.l  @r15+,@r1+
sts      mach,r1
sts      macl,r0
rts
add     #4,r15

not using the mac instruction seems a bit simpler in this case:

dmuls.l  r4,r5
sts      mach,r1
clrt
sts      macl,r0
addc   r6,r0
rts
addc   r7,r1

I think the mac instructions can be very useful when they can be used inside of
loops,  but for this the whole post-inc memory stuff has to integrate properly into
the surrounding code.

Chris, do you have any ideas/plans on how to handle the SR.S bit, for example to implement
the ssmaddhisi4 pattern with mac.w?
Comment 2 chrbr 2012-07-13 11:00:55 UTC
I see the MAC only as a global optimization, since its interest is to spawns across several loop BBs as you said. Their is also problem on clear the accumulator. 

That should certainly be new extension in the gimple SSA loop optimizers, based on the presence on a multiply and and pattern. Not sure what is the best way to do this as this point.
Comment 3 Oleg Endo 2012-07-15 12:11:20 UTC
Created attachment 27799 [details]
Proof of concept patch

This is a proof of concept patch just to probe around.
The idea is to allow the RA to allocate macl and mach registers in DImode, and have mac insns that use the macl/mach regs as a pair in DImode.

With the patch applied, the following function ...

int64_t test01 (const int16_t* a, const int16_t* b)
{
  int64_t sum = 0;
  for (int i = 0; i < 16; ++i)
    sum += (int64_t)(*a++) * (int64_t)(*b++);
  return sum;
}

compiled with -m4 -O2 results in ...

__Z6test01PKsS0_:
.LFB0:
        .cfi_startproc
        mov	#16,r1          ! 88	movsi_ie/3	[length = 2]
        clrmac                  ! 39	clrmac/1	[length = 2]
        .align 2
.L3:
        dt	r1              ! 89	dect	[length = 2]
        bf/s	.L3             ! 90	branch_false	[length = 2]
        mac.w	@r4+,@r5+       ! 61	*macw	[length = 2]
        sts	macl,r0         ! 82	movsi_ie/8	[length = 2]
        rts                     ! 99	*return_i	[length = 2]
        sts	mach,r1         ! 83	movsi_ie/8	[length = 2]

... which is not that bad already.


Some notes I took while playing around with this:


- When compiling for big endian the RA mistakes mach and macl when
  storing mach:macl to a DImode reg:reg pair.
  This could probably fixed by providing appropriate move insns patterns.


- Move insns/splits for DImode mach:macl <-> memory have to be added.
  I've seen an ICE when compiling with -O1:
  error: unrecognizable insn:
  (insn 122 14 15 2 (set (mem/c:DI (plus:SI (reg/f:SI 15 r15)
                (const_int 8 [0x8])) [0 %sfp+-8 S8 A32])
        (reg:DI 148 macl)) sh_mac.cpp:38 -1
     (nil))


- In some cases the mach:macl reg pair gets swapped to a general reg pair
  without any obvious need.  Example function:

  int64_t test04 (const int16_t* a, const int16_t* b,
                  const int16_t* c, const int16_t* d)
  {
    int64_t sum0 = 0;
    int64_t sum1 = 0;
    for (int i = 0; i < 16; ++i)
      sum0 += (int64_t)(*a++) * (int64_t)(*b++);

    for (int i = 0; i < 16; ++i)
      sum1 += (int64_t)(*c++) * (int64_t)(*d++);

    return sum0 - sum1;
  }

  The IRA pass first allocates sum0 and sum1 to mach:macl, but then reload
  seems to think that they are conflicting and moves sum0 to a general regs
  pair.  This results in ...

        mov     #0,r2
        mov     #16,r1
        mov     r2,r3
  .L16:
        lds     r2,macl
        lds     r3,mach
        dt      r1
        mac.w   @r4+,@r5+
        sts     macl,r2
        bf/s    .L16
        sts     mach,r3

        mov     #16,r1
        clrmac
        .align 2
  .L18:
        dt      r1
        bf/s    .L18
        mac.w   @r6+,@r7+


  which would be better as:
        mov     #16,r1
        clrmac
  .L16:
        dt      r1
        bf/s    .L16
        mac.w   @r4+,@r5+

        sts     macl,r2
        sts     mach,r3
        clrmac
        mov     #16,r1
  .L18:
        dt      r1
        bf/s    .L18
        mac.w   @r6+,@r7+


- Loops with multiple running sums like
  for (int i = 0; i < 16; ++i)
  {
    sum0 += (int64_t)(*a++) * (int64_t)(*b++);
    sum1 += (int64_t)(*c++) * (int64_t)(*d++);
  }

  result in macl:mach swapping to general reg pairs between subsequent
  mac.w instructions.  Ideally such loops should be split into multiple
  loops like in the previous example.


- When loop unrolling is turned on the auto-inc addresses refs are
  converted to displacement addresses.  Because the auto-inc-dec pass
  currently fails to detect a lot of auto-inc-dec possibilities the 
  mac.w pattern will not match.
  The same goes for manually unrolled code like 
  
  sum += (int64_t)(*a++) * (int64_t)(*b++);
  sum += (int64_t)(*a++) * (int64_t)(*b++);


- Running sum variables should be turned into DImode variables if possible:
  int32_t test00 (const int16_t* a, const int16_t* b)
  {
    int32_t sum = 0;
    for (int i = 0; i < 16; ++i)
      sum += (*a++) * (*b++);
    return sum;
  }


- The existing multiplication patterns could be adopted to utilize macl:mach
  reg pair allocation, especially 32x32 -> 64 bit multiplications.


- Normal multiplications that do not need a full MAC operation but use
  memory operands can be done with a clrmac-mac sequence.


Probably there are more subtle issues.  Also, I have not tried expanding
the standard name 'maddmn4' pattern,  maybe it would make some of the
problems mentioned above automagically disappear.
Comment 4 Oleg Endo 2012-07-17 19:12:49 UTC
To be able to use the mac.w instruction for implementing the ssmaddhisi4 pattern I think the already existing mode-switching facilities can be used, as it is done for float/double mode switching.

Even without doing SR.S mode switches, before any mac instructions can be used, we must make sure that the SR.S bit is in a defined state at function entry and function leave.

Kaz, do you think it is safe to assume that SR.S = 0 at function entry?
Comment 5 Kazumoto Kojima 2012-07-17 23:04:54 UTC
(In reply to comment #4)
> Kaz, do you think it is safe to assume that SR.S = 0 at function entry?

I think so.  I can't imagine a practical system with setting
SR.S to one in its start-up code, though I'm wrong about it.
Comment 6 Oleg Endo 2012-07-22 16:47:44 UTC
If I understand correctly PR 29961 is somewhat related to this.
Comment 7 Oleg Endo 2012-10-11 20:43:02 UTC
A note regarding the SR.S bit.  The insns sets and clrs are available only on SH3* and SH4*.  SH1* and SH2* (incl SH2A) do not implement them.
Comment 8 Oleg Endo 2012-11-07 21:31:39 UTC
Christian, I just wanted to check with you whether you've already started doing something regarding the mac.w / mac.l instructions?
Comment 9 Oleg Endo 2013-05-04 13:39:10 UTC
(In reply to comment #3)
> - Loops with multiple running sums like
>   for (int i = 0; i < 16; ++i)
>   {
>     sum0 += (int64_t)(*a++) * (int64_t)(*b++);
>     sum1 += (int64_t)(*c++) * (int64_t)(*d++);
>   }
> 
>   result in macl:mach swapping to general reg pairs between subsequent
>   mac.w instructions.  Ideally such loops should be split into multiple
>   loops like in the previous example.

This is basically what -ftree-loop-distribution does.  The question would be how to re-use it for this particular case.
Comment 10 Oleg Endo 2013-12-17 12:25:51 UTC
I was wondering whether it would make sense to convert sequences such as

                             SH4       SH4A
        mov.l   @r15,r3      LS/2      LS/2  
        mul.l   r2,r3        CO/4      EX/3
        sts     macl,r3      CO/3      LS/2
        add     r1,r3        EX/1      EX/1

into
        mov     r15,r0       MT/0      MT/1
        mov.l   r2,@-r15     LS/1      LS/1
        lds     r1,macl      CO/3      LS/1
        mac.l   @r15+,@r0+   CO/4      CO/5
        sts     macl,r3      CO/3      LS/2

Looking simply at the issue cycles (the numbers above) would suggest that it's not worth doing it, at least not if the value has to be pulled out from the mac register immediately after the mac operation.  Probably it's not beneficial to emit a single mac insn if the data is not already in place so that it can be reached easily with the post-inc addressing.

On the other hand something like ...

int test33 (int* x, int y, int z)
{
  return x[0] * 40 + z;
}

currently compiles to:
        mov.l   @r4,r2
        mov     #40,r1
        mul.l   r1,r2
        sts     macl,r0
        rts
        add     r6,r0

where this one maybe could be better:
        mova    .L40,r0
        lds     r6,macl
        mac.l   @r4+,r0+
        rts
        sts     macl,r0

        .align 2
.L40:   .long   40
Comment 11 Oleg Endo 2013-12-17 12:37:30 UTC
Another question is whether the following is OK to do on all SH implementations:

int test33 (int x, int y, int z)
{
  return x * y + z;
}

currently compiles:
        mul.l   r5,r4
        sts     macl,r0
        rts
        add     r6,r0

could also be done as:
        lds     r6,macl
        mov.l   r4,@-r15
        mov.l   r5,@-r15
        mac.l   @r15+,@r15+
        rts
        sts     macl,r0

This assumes that a mac insn with both address operands being the same works exactly as it's described in the Renesas manuals:

 tempn = Read_32 (R[n]);
  R[n] += 4;
  tempm = Read_32 (R[m]);
  R[m] += 4;

However, I don't know whether this is true for all SH implementations.
Comment 12 Oleg Endo 2015-01-30 19:36:46 UTC
This is an example from Renesas public material regarding the SH-DSP.  The example program can utilize either the regular mac.w instruction, or the DSP ISA (found on SH2-DSP, SH3-DSP, SH4AL-DSP).

typedef short  Fixed;
typedef long Lfixed;
typedef long Laccum;
#define __X
#define __Y

void func(__X Fixed x[5],Fixed *out, __Y Fixed y[128][5] )
{
  int i;
  __Y Fixed *yp=y[0];
  __X Fixed *xp=x;
  Fixed x0,y0;
  Lfixed m0;
  Laccum a0;
  for(i=0;i<128;i++)
  {
    a0=0;
    x0=*xp++; y0=*yp++;
    m0=x0*y0; x0=*xp++; y0=*yp++;
    a0+=m0; m0=x0*y0; x0=*xp++; y0=*yp++;
    a0+=m0; m0=x0*y0; x0=*xp++; y0=*yp++;
    a0+=m0; m0=x0*y0; x0=*xp++; y0=*yp++;
    a0+=m0; m0=x0*y0; 
    a0+=m0;
    *out++=a0;
  }
}

which should compile to something like:
_func:
        mov     #-128,r1
        mov     r5,r3
        extu.b  r1,r1
.L11:
        clrmac
        mac.w   @r6+,@r4+
        dt      r1
        mac.w   @r6+,@r4+
        mac.w   @r6+,@r4+
        mac.w   @r6+,@r4+
        sts     macl,r2
        mov.w   r2,@r3
        bf/s    .L11
        add     #2,r3

        rts
        nop
Comment 13 Oleg Endo 2015-02-01 00:37:30 UTC
A more interesting real-world example from libjpeg would be function jpeg_idct_ifast (jidctint.c).

If we take the code as-is, there are few mac opportunities due to sharing of the terms.  The expressions could be un-CSE'd which would result in longer mac chains, but the overall result gets worse because the data layout is not in a mac friendly way.

The first loop in jpeg_idct_ifast can be split into 8 independent loops for the output value wsptr[8*n+i].
For n = 1,2,3,4,5,6 the loops look a bit complex, but for n = 0 and n = 7 we get similar looking loops like:

for (int i = 0; i < 8; ++i)
{
  wsptr[8*7+i] = inptr[8*0 + i] * quantptr[8*0 + i]
                 - inptr[8*1 + i] * quantptr[8*1 + i]
                 + inptr[8*2 + i] * quantptr[8*2 + i]
                 - inptr[8*3 + i] * quantptr[8*3 + i]
                 + inptr[8*4 + i] * quantptr[8*4 + i]
                 - inptr[8*5 + i] * quantptr[8*5 + i]
                 + inptr[8*6 + i] * quantptr[8*6 + i]
                 - inptr[8*7 + i] * quantptr[8*7 + i];
}

Still, due to the subtractions and memory access pattern, plain mac insns can't be used.

The subtractions can be converted into additions by negating the operands.  Since mac wants both operands in memory, those can be placed on the stack.  Also, in this case the address registers can be pre-computed outside the loop, since there are enough registers.
A possible outcome would be something like this:

                                    // r4 = inptr[8*0+i]
                                    // r5 = quantptr[8*0+i]
                                    // r6 = wsptr[8*0+i]

  mov    r4,r3;    add    #32,r3    // r3 = inptr[8*1+i]
  mov    r3,r7;    add    #32,r7    // r7 = inptr[8*2+i]
  mov    r7,r8;    add    #32,r8    // r8 = inptr[8*3+i]
  mov    r8,r9;    add    #32,r9    // r9 = inptr[8*4+i]
  mov    r9,r10;   add    #32,r10   // r10 = inptr[8*5+i]
  mov    r10,r11;  add    #32,r11   // r11 = inptr[8*6+i]
  mov    r11,r12;  add    #32,r12   // r12 = inptr[8*7+i]

  mov    #8,r14
  add    #126,r6;  add    #102,r6   // r6 = wpstr + 8*7*4 + 4
  mov    r4,r0;    sub    r5,r0     // r0 = quantptr - intptr

.Loop:
  mov.l  @(r0,r12),r1   // quantptr[8*7+i]
  mov.l  @(r0,r11),r2	// quantptr[8*6+i]
  mov.l  @(r0,r10),r13	// quantptr[8*5+i]
  neg    r1,r1
  mov.l  r1,@-r15
  mov.l  r2,@-r15
  neg    r13,r13
  mov.l  @(r0,r8),r1    // quantptr[8*3+i]
  mov.l  @(r0,r9),r2    // quantptr[8*4+i]
  mov.l  r13,@-r15
  neg    r1,r1
  mov.l  r2,@-r15
  mov.l  @(r0,r7),r2    // quantptr[8*2+i]
  mov.l  @(r0,r3),r13   // quantptr[8*1+i]
  mov.l  r1,@-r15
  mov.l  r2,@-r15
  neg    r13,r13
  mov.l  r13,@-r15

  clrmac
  mac.l	@r4+,@r5+
  mac.l	@r3+,@r15+
  mac.l	@r7+,@r15+
  mac.l	@r8+,@r15+
  mac.l	@r9+,@r15+
  mac.l	@r10+,@r15+
  mac.l	@r11+,@r15+
  mac.l	@r12+,@r15+
  dt	r14
  sts	macl,@-r6
  bf/s	.Loop
  add	#8,r6

which is 31 insns per loop and (almost) no pipeline stalls, vs. 53 insns per loop + stalls on mul-sts sequences when the mac insn is not used.
The above loop can be optimized even further with partial unrolling to avoid the latency of the last mac and sts.
Of course it'd be even better, if the application's data was in a mac friendly layout.
Comment 14 Oleg Endo 2015-04-08 20:08:31 UTC
(In reply to Oleg Endo from comment #3)
> 
> - When compiling for big endian the RA mistakes mach and macl when
>   storing mach:macl to a DImode reg:reg pair.
>   This could probably fixed by providing appropriate move insns patterns.

I believe this is PR 29961.
Comment 15 Oleg Endo 2016-04-30 14:01:57 UTC
Maybe the standard name patterns sdot_prod for 16 bit and 32 bit int vectors could be implemented using the mac.w and mac.l insns, if the input vectors are somehow put in memory.