This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: ]PATCH][RFC] Initial patch for better performance of 64-bit math instructions in 32-bit mode on x86-64
- From: Ilya Enkovich <enkovich dot gnu at gmail dot com>
- To: Uros Bizjak <ubizjak at gmail dot com>
- Cc: Yuri Rumyantsev <ysrumyan at gmail dot com>, "H.J. Lu" <hjl dot tools at gmail dot com>, gcc-patches <gcc-patches at gcc dot gnu dot org>, Jeff Law <law at redhat dot com>
- Date: Wed, 1 Jun 2016 12:57:45 +0300
- Subject: Re: ]PATCH][RFC] Initial patch for better performance of 64-bit math instructions in 32-bit mode on x86-64
- Authentication-results: sourceware.org; auth=none
- References: <CAEoMCqRUy94A2w4W-0CfECc-oQOuNb9O6sjsURkDt_9g=08exw at mail dot gmail dot com> <CAFULd4ZBpjF1=osHV1+fD90vu07EH9Qadv-Yo=jsQbpcse+cDA at mail dot gmail dot com>
2016-05-31 19:15 GMT+03:00 Uros Bizjak <ubizjak@gmail.com>:
> On Tue, May 31, 2016 at 5:00 PM, Yuri Rumyantsev <ysrumyan@gmail.com> wrote:
>> Hi Uros,
>>
>> Here is initial patch to improve performance of 64-bit integer
>> arithmetic in 32-bit mode. We discovered that gcc is significantly
>> behind icc and clang on rsa benchmark from eembc2.0 suite.
>> Te problem function looks like
>> typedef unsigned long long ull;
>> typedef unsigned long ul;
>> ul mul_add(ul *rp, ul *ap, int num, ul w)
>> {
>> ul c1=0;
>> ull t;
>> for (;;)
>> {
>> { t=(ull)w * ap[0] + rp[0] + c1;
>> rp[0]= ((ul)t)&0xffffffffL; c1= ((ul)((t)>>32))&(0xffffffffL); };
>> if (--num == 0) break;
>> { t=(ull)w * ap[1] + rp[1] + c1;
>> rp[1]= ((ul)(t))&(0xffffffffL); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>> if (--num == 0) break;
>> { t=(ull)w * ap[2] + rp[2] + c1;
>> rp[2]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>> if (--num == 0) break;
>> { t=(ull)w * ap[3] + rp[3] + c1;
>> rp[3]= (((ul)(t))&(0xffffffffL)); c1= (((ul)((t)>>32))&(0xffffffffL)); };
>> if (--num == 0) break;
>> ap+=4;
>> rp+=4;
>> }
>> return(c1);
>> }
>>
>> If we apply patch below we will get +6% speed-up for rsa on Silvermont.
>>
>> The patch looks loke (not complete since there are other 64-bit
>> instructions e.g. subtraction):
>>
>> Index: i386.md
>> ===================================================================
>> --- i386.md (revision 236181)
>> +++ i386.md (working copy)
>> @@ -5439,7 +5439,7 @@
>> (clobber (reg:CC FLAGS_REG))]
>> "ix86_binary_operator_ok (PLUS, <DWI>mode, operands)"
>> "#"
>> - "reload_completed"
>> + "1"
>> [(parallel [(set (reg:CCC FLAGS_REG)
>> (compare:CCC
>> (plus:DWIH (match_dup 1) (match_dup 2))
>>
>> What is your opinion?
>
> This splitter doesn't depend on hard registers, so there is no
> technical obstacle for the proposed patch. OTOH, this is a very old
> splitter, it is possible that it was introduced to handle some of
> reload deficiencies. Maybe Jeff knows something about this approach.
> We have LRA now, so perhaps we have to rethink the purpose of these
> DImode splitters.
The change doesn't spoil splitter for hard register case and therefore
splitter still should be able to handle any reload deficiencies. I think
we should try to split all instructions working on multiword registers
(not only PLUS case) at earlier passes to allow more optimizations on
splitted code and relax registers allocation (now we need to allocate
consequent registers). Probably make a separate split right after STV?
This should help with PR70321.
Thanks,
Ilya
>
> A pragmatic approach would be - if the patch shows measurable benefit,
> and doesn't introduce regressions, then Stage 1 is the time to try it.
>
> BTW: Use "&& 1" in the split condition of the combined insn_and_split
> pattern to copy the enable condition from the insn part. If there is
> no condition, you should just use "".
>
> Uros.