This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: ]PATCH][RFC] Initial patch for better performance of 64-bit math instructions in 32-bit mode on x86-64

From: Ilya Enkovich <enkovich dot gnu at gmail dot com>
To: Richard Biener <richard dot guenther at gmail dot com>
Cc: Uros Bizjak <ubizjak at gmail dot com>, Yuri Rumyantsev <ysrumyan at gmail dot com>, "H.J. Lu" <hjl dot tools at gmail dot com>, gcc-patches <gcc-patches at gcc dot gnu dot org>, Jeff Law <law at redhat dot com>
Date: Wed, 1 Jun 2016 13:40:20 +0300
Subject: Re: ]PATCH][RFC] Initial patch for better performance of 64-bit math instructions in 32-bit mode on x86-64
Authentication-results: sourceware.org; auth=none
References: <CAEoMCqRUy94A2w4W-0CfECc-oQOuNb9O6sjsURkDt_9g=08exw at mail dot gmail dot com> <CAFULd4ZBpjF1=osHV1+fD90vu07EH9Qadv-Yo=jsQbpcse+cDA at mail dot gmail dot com> <CAMbmDYaxqzMA3e8u3Vt9zrX0d4Pn7-t5f2kwOPzQd9oj61FrRg at mail dot gmail dot com> <CAFiYyc0ivrFSbtff9fYy8ffuY2qXpuvzHVwP+VG41ukC6vKpPA at mail dot gmail dot com>

2016-06-01 13:06 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
> On Wed, Jun 1, 2016 at 11:57 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>> 2016-05-31 19:15 GMT+03:00 Uros Bizjak <ubizjak@gmail.com>:
>>> On Tue, May 31, 2016 at 5:00 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>> Hi Uros,
>>>>
>>>> Here is initial patch to improve performance of 64-bit integer
>>>> arithmetic in 32-bit mode. We discovered that gcc is significantly
>>>> behind icc and clang on rsa benchmark from eembc2.0 suite.
>>>> Te problem function looks like
>>>> typedef unsigned long long ull;
>>>> typedef unsigned long ul;
>>>> ul mul_add(ul *rp, ul *ap, int num, ul w)
>>>>  {
>>>>  ul c1=0;
>>>>  ull t;
>>>>  for (;;)
>>>>   {
>>>>   { t=(ull)w * ap[0] + rp[0] + c1;
>>>>    rp[0]= ((ul)t)&0xffffffffL; c1= ((ul)((t)>>32))&(0xffffffffL); };
>>>>   if (--num == 0) break;
>>>>   { t=(ull)w * ap[1] + rp[1] + c1;
>>>>    rp[1]= ((ul)(t))&(0xffffffffL); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>>   if (--num == 0) break;
>>>>   { t=(ull)w * ap[2] + rp[2] + c1;
>>>>    rp[2]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>>   if (--num == 0) break;
>>>>   { t=(ull)w * ap[3] + rp[3] + c1;
>>>>    rp[3]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>>   if (--num == 0) break;
>>>>   ap+=4;
>>>>   rp+=4;
>>>>   }
>>>>  return(c1);
>>>>  }
>>>>
>>>> If we apply patch below we will get +6% speed-up for rsa on Silvermont.
>>>>
>>>> The patch looks loke (not complete since there are other 64-bit
>>>> instructions e.g. subtraction):
>>>>
>>>> Index: i386.md
>>>> ===================================================================
>>>> --- i386.md     (revision 236181)
>>>> +++ i386.md     (working copy)
>>>> @@ -5439,7 +5439,7 @@
>>>>     (clobber (reg:CC FLAGS_REG))]
>>>>    "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)"
>>>>    "#"
>>>> -  "reload_completed"
>>>> +  "1"
>>>>    [(parallel [(set (reg:CCC FLAGS_REG)
>>>>                    (compare:CCC
>>>>                      (plus:DWIH (match_dup 1) (match_dup 2))
>>>>
>>>> What is your opinion?
>>>
>>> This splitter doesn't depend on hard registers, so there is no
>>> technical obstacle for the proposed patch. OTOH, this is a very old
>>> splitter, it is possible that it was introduced to handle some of
>>> reload deficiencies. Maybe Jeff knows something about this approach.
>>> We have LRA now, so perhaps we have to rethink the purpose of these
>>> DImode splitters.
>>
>> The change doesn't spoil splitter for hard register case and therefore
>> splitter still should be able to handle any reload deficiencies.  I think
>> we should try to split all instructions working on multiword registers
>> (not only PLUS case) at earlier passes to allow more optimizations on
>> splitted code and relax registers allocation (now we need to allocate
>> consequent registers).  Probably make a separate split right after STV?
>> This should help with PR70321.
>
> There are already pass_lower_subreg{,2}, not sure if x86 uses it for splitting
> DImode ops though.

Looking at pass description I see it works when "all the uses of a multi-word
register are via SUBREG, or are copies of the register to another location".
It doesn't cover cases when we operate with whole multi-word registers.

Thanks,
Ilya

>
> Richard.
>
>> Thanks,
>> Ilya
>>
>>>
>>> A pragmatic approach would be - if the patch shows measurable benefit,
>>> and doesn't introduce regressions, then Stage 1 is the time to try it.
>>>
>>> BTW: Use "&&  1" in the split condition of the combined insn_and_split
>>> pattern to copy the enable condition from the insn part. If there is
>>> no condition, you should just use "".
>>>
>>> Uros.

References:
- Re: ]PATCH][RFC] Initial patch for better performance of 64-bit math instructions in 32-bit mode on x86-64
  - From: Ilya Enkovich
- Re: ]PATCH][RFC] Initial patch for better performance of 64-bit math instructions in 32-bit mode on x86-64
  - From: Richard Biener

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]