This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: ]PATCH][RFC] Initial patch for better performance of 64-bit math instructions in 32-bit mode on x86-64
- From: Ilya Enkovich <enkovich dot gnu at gmail dot com>
- To: Richard Biener <richard dot guenther at gmail dot com>
- Cc: Uros Bizjak <ubizjak at gmail dot com>, Yuri Rumyantsev <ysrumyan at gmail dot com>, "H.J. Lu" <hjl dot tools at gmail dot com>, gcc-patches <gcc-patches at gcc dot gnu dot org>, Jeff Law <law at redhat dot com>
- Date: Wed, 1 Jun 2016 13:40:20 +0300
- Subject: Re: ]PATCH][RFC] Initial patch for better performance of 64-bit math instructions in 32-bit mode on x86-64
- Authentication-results: sourceware.org; auth=none
- References: <CAEoMCqRUy94A2w4W-0CfECc-oQOuNb9O6sjsURkDt_9g=08exw at mail dot gmail dot com> <CAFULd4ZBpjF1=osHV1+fD90vu07EH9Qadv-Yo=jsQbpcse+cDA at mail dot gmail dot com> <CAMbmDYaxqzMA3e8u3Vt9zrX0d4Pn7-t5f2kwOPzQd9oj61FrRg at mail dot gmail dot com> <CAFiYyc0ivrFSbtff9fYy8ffuY2qXpuvzHVwP+VG41ukC6vKpPA at mail dot gmail dot com>
2016-06-01 13:06 GMT+03:00 Richard Biener <richard.guenther@gmail.com>:
> On Wed, Jun 1, 2016 at 11:57 AM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>> 2016-05-31 19:15 GMT+03:00 Uros Bizjak <ubizjak@gmail.com>:
>>> On Tue, May 31, 2016 at 5:00 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>>>> Hi Uros,
>>>>
>>>> Here is initial patch to improve performance of 64-bit integer
>>>> arithmetic in 32-bit mode. We discovered that gcc is significantly
>>>> behind icc and clang on rsa benchmark from eembc2.0 suite.
>>>> Te problem function looks like
>>>> typedef unsigned long long ull;
>>>> typedef unsigned long ul;
>>>> ul mul_add(ul *rp, ul *ap, int num, ul w)
>>>> {
>>>> ul c1=0;
>>>> ull t;
>>>> for (;;)
>>>> {
>>>> { t=(ull)w * ap[0] + rp[0] + c1;
>>>> rp[0]= ((ul)t)&0xffffffffL; c1= ((ul)((t)>>32))&(0xffffffffL); };
>>>> if (--num == 0) break;
>>>> { t=(ull)w * ap[1] + rp[1] + c1;
>>>> rp[1]= ((ul)(t))&(0xffffffffL); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>> if (--num == 0) break;
>>>> { t=(ull)w * ap[2] + rp[2] + c1;
>>>> rp[2]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>> if (--num == 0) break;
>>>> { t=(ull)w * ap[3] + rp[3] + c1;
>>>> rp[3]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>>>> if (--num == 0) break;
>>>> ap+=4;
>>>> rp+=4;
>>>> }
>>>> return(c1);
>>>> }
>>>>
>>>> If we apply patch below we will get +6% speed-up for rsa on Silvermont.
>>>>
>>>> The patch looks loke (not complete since there are other 64-bit
>>>> instructions e.g. subtraction):
>>>>
>>>> Index: i386.md
>>>> ===================================================================
>>>> --- i386.md (revision 236181)
>>>> +++ i386.md (working copy)
>>>> @@ -5439,7 +5439,7 @@
>>>> (clobber (reg:CC FLAGS_REG))]
>>>> "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)"
>>>> "#"
>>>> - "reload_completed"
>>>> + "1"
>>>> [(parallel [(set (reg:CCC FLAGS_REG)
>>>> (compare:CCC
>>>> (plus:DWIH (match_dup 1) (match_dup 2))
>>>>
>>>> What is your opinion?
>>>
>>> This splitter doesn't depend on hard registers, so there is no
>>> technical obstacle for the proposed patch. OTOH, this is a very old
>>> splitter, it is possible that it was introduced to handle some of
>>> reload deficiencies. Maybe Jeff knows something about this approach.
>>> We have LRA now, so perhaps we have to rethink the purpose of these
>>> DImode splitters.
>>
>> The change doesn't spoil splitter for hard register case and therefore
>> splitter still should be able to handle any reload deficiencies. I think
>> we should try to split all instructions working on multiword registers
>> (not only PLUS case) at earlier passes to allow more optimizations on
>> splitted code and relax registers allocation (now we need to allocate
>> consequent registers). Probably make a separate split right after STV?
>> This should help with PR70321.
>
> There are already pass_lower_subreg{,2}, not sure if x86 uses it for splitting
> DImode ops though.
Looking at pass description I see it works when "all the uses of a multi-word
register are via SUBREG, or are copies of the register to another location".
It doesn't cover cases when we operate with whole multi-word registers.
Thanks,
Ilya
>
> Richard.
>
>> Thanks,
>> Ilya
>>
>>>
>>> A pragmatic approach would be - if the patch shows measurable benefit,
>>> and doesn't introduce regressions, then Stage 1 is the time to try it.
>>>
>>> BTW: Use "&& 1" in the split condition of the combined insn_and_split
>>> pattern to copy the enable condition from the insn part. If there is
>>> no condition, you should just use "".
>>>
>>> Uros.