Bug 79173 - add-with-carry and subtract-with-borrow support (x86_64 and others)
Summary: add-with-carry and subtract-with-borrow support (x86_64 and others)
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 7.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 80491
Blocks:
  Show dependency treegraph
 
Reported: 2017-01-21 00:51 UTC by Vincent Lefèvre
Modified: 2021-10-27 16:24 UTC (History)
10 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-05-28 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vincent Lefèvre 2017-01-21 00:51:53 UTC
There should be a way to support full add-with-carry and subtract-with-borrow by generating adc / sbb instructions on x86_64 (and similar instructions on other targets).

GCC could add builtins, such as __builtin_addc* and __builtin_subc* (two arguments, carry in, carry out, and the result), similar to Clang:
http://clang.llvm.org/docs/LanguageExtensions.html#multiprecision-arithmetic-builtins
as suggested in PR 60206 comment 3.

Detection of special constructs in standard C/... code would be useful too. Here are some examples from https://gcc.gnu.org/ml/gcc-help/2017-01/msg00067.html for subtraction:

typedef unsigned long T;

void sub1 (T *p, T u0, T u1, T u2, T v0, T v1, T v2)
{
  T t1;
  int b0, b1;

  p[0] = u0 - v0;
  b0 = u0 < v0;
  t1 = u1 - v1;
  b1 = u1 < v1;
  p[1] = t1 - b0;
  b1 |= p[1] > t1;
  p[2] = u2 - v2 - b1;
}

void sub2 (T *p, T u0, T u1, T u2, T v0, T v1, T v2)
{
  int b0, b1;

  p[0] = u0 - v0;
  b0 = u0 < v0;
  p[1] = u1 - v1 - b0;
  b1 = u1 < v1 || (u1 == v1 && b0 != 0);
  p[2] = u2 - v2 - b1;
}

In the second example, the b1 line could also be replaced by:

  b1 = u1 < v1 + b0 || v1 + b0 < v1;

For the subtractions, optimal code would contain 1 sub and 2 sbb's.
Comment 1 Marc Glisse 2017-01-21 08:41:23 UTC
We could start with the simpler:

void f(unsigned*__restrict__ r,unsigned*__restrict__ s,unsigned a,unsigned b,unsigned c, unsigned d){
  *r=a+b;
  *s=c+d+(*r<a);
}

After combine, we have:

(insn 34 12 20 2 (set (reg:SI 93 [ _15+4 ])
        (ltu:SI (reg:CCC 17 flags)
            (const_int 0 [0]))) 608 {*setcc_si_1_movzbl}
     (expr_list:REG_DEAD (reg:CCC 17 flags)
        (nil)))

(insn 21 20 22 2 (parallel [
            (set (reg:SI 102)
                (plus:SI (reg:SI 37 r8 [ c ])
                    (reg:SI 38 r9 [ d ])))
            (clobber (reg:CC 17 flags))
        ]) "a.c":3 213 {*addsi_1}
     (expr_list:REG_DEAD (reg:SI 38 r9 [ d ])
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (expr_list:REG_DEAD (reg:SI 37 r8 [ c ])
                (nil)))))

(insn 25 24 26 2 (parallel [
            (set (reg:SI 105)
                (plus:SI (reg:SI 102)
                    (reg:SI 93 [ _15+4 ])))
            (clobber (reg:CC 17 flags))
        ]) "a.c":3 213 {*addsi_1}
     (expr_list:REG_DEAD (reg:SI 93 [ _15+4 ])
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (expr_list:REG_DEAD (reg:SI 102)
                (nil)))))

The combine dump says "Trying 21, 34 -> 25:" but the next line is blank and it moves on to trying something else.

If I use parentheses *s=c+(d+(*r<a)); and disable reassoc, I get:

Trying 23, 24 -> 25:
Successfully matched this instruction:
(parallel [
        (set (reg:SI 105)
            (plus:SI (plus:SI (ltu:SI (reg:CCC 17 flags)
                        (const_int 0 [0]))
                    (reg:SI 37 r8 [ c ]))
                (reg:SI 38 r9 [ d ])))
        (clobber (reg:CC 17 flags))
    ])
Instruction not appropriate for target.

I don't know where that target restriction is coming from, but at least we seem to be getting somewhere.

If I remove c and keep *s=d+(*r<a), I get

Failed to match this instruction:
(parallel [
        (set (reg:SI 103)
            (plus:SI (ltu:SI (reg:CCC 17 flags)
                    (const_int 0 [0]))
                (reg:SI 38 r9 [ d ])))
        (clobber (reg:CC 17 flags))
    ])
Failed to match this instruction:
(set (reg:SI 103)
    (plus:SI (ltu:SI (reg:CCC 17 flags)
            (const_int 0 [0]))
        (reg:SI 38 r9 [ d ])))

we would probably need a special pattern for this case, virtually adding 0.
Comment 2 Marc Glisse 2017-01-21 10:36:03 UTC
(In reply to Marc Glisse from comment #1)
> Trying 23, 24 -> 25:
> Successfully matched this instruction:
> (parallel [
>         (set (reg:SI 105)
>             (plus:SI (plus:SI (ltu:SI (reg:CCC 17 flags)
>                         (const_int 0 [0]))
>                     (reg:SI 37 r8 [ c ]))
>                 (reg:SI 38 r9 [ d ])))
>         (clobber (reg:CC 17 flags))
>     ])
> Instruction not appropriate for target.

I didn't notice immediately, but apparently ix86_legitimate_combined_insn was happy to let combine propagate the hard registers r8 and r9 into the simple additions *addsi_1 which has (match_operand:SWI48 2 "x86_64_general_operand" "rme,re,0,le"), but then it won't accept them in the addition with carry. My guess would be that letting the hard registers into the additions is premature and should be delayed until after combine. If I manually reject those in gdb, we do produce addl+adcl.
Comment 3 Richard Biener 2017-01-23 08:49:09 UTC
Confirmed, the bug should possibly be split (Clang compatible builtins, target specific detection of patterns, middle-end detection of patterns).  Note a suitable middle-end representation would not have the address argument but likely
return two values (via the usual _Complex int trick).
Comment 4 Vincent Lefèvre 2017-01-25 14:12:00 UTC
Also, make sure that the optimization is still done when a variable is a constant or replaced by a constant (with Clang, the optimization is no longer done in such a case).
Comment 5 Michael_S 2020-11-15 21:21:39 UTC
(In reply to Vincent Lefèvre from comment #0)
> There should be a way to support full add-with-carry and
> subtract-with-borrow by generating adc / sbb instructions on x86_64 (and
> similar instructions on other targets).
> 
> GCC could add builtins, such as __builtin_addc* and __builtin_subc* (two
> arguments, carry in, carry out, and the result), similar to Clang:
> http://clang.llvm.org/docs/LanguageExtensions.html#multiprecision-arithmetic-
> builtins
> as suggested in PR 60206 comment 3.
> 
> Detection of special constructs in standard C/... code would be useful too.
> Here are some examples from
> https://gcc.gnu.org/ml/gcc-help/2017-01/msg00067.html for subtraction:
> 
> typedef unsigned long T;
> 
> void sub1 (T *p, T u0, T u1, T u2, T v0, T v1, T v2)
> {
>   T t1;
>   int b0, b1;
> 
>   p[0] = u0 - v0;
>   b0 = u0 < v0;
>   t1 = u1 - v1;
>   b1 = u1 < v1;
>   p[1] = t1 - b0;
>   b1 |= p[1] > t1;
>   p[2] = u2 - v2 - b1;
> }
> 
> void sub2 (T *p, T u0, T u1, T u2, T v0, T v1, T v2)
> {
>   int b0, b1;
> 
>   p[0] = u0 - v0;
>   b0 = u0 < v0;
>   p[1] = u1 - v1 - b0;
>   b1 = u1 < v1 || (u1 == v1 && b0 != 0);
>   p[2] = u2 - v2 - b1;
> }
> 
> In the second example, the b1 line could also be replaced by:
> 
>   b1 = u1 < v1 + b0 || v1 + b0 < v1;
> 
> For the subtractions, optimal code would contain 1 sub and 2 sbb's.

I agree with regard to "other targets", first of all, aarch64, but x86_64 variant of gcc already provides requested functionality in for of _subborrow_u64 () and _addcarry_u64() intrinsic functions.
The problem here is not lack of functionality, but very poor implementation (mentioned many times on bugzilla with minimal effect).
In that regard gcc is more than decade behind MSVC and ~4 years behind clang/llvm. Surprisingly, icc is also quite bad.
Comment 6 Michael_S 2020-11-15 21:24:13 UTC
(In reply to Marc Glisse from comment #1)
> We could start with the simpler:
> 
> void f(unsigned*__restrict__ r,unsigned*__restrict__ s,unsigned a,unsigned
> b,unsigned c, unsigned d){
>   *r=a+b;
>   *s=c+d+(*r<a);
> }
> 

That works for dual-precision addition, but not for triple or more.

> After combine, we have:
> 
> (insn 34 12 20 2 (set (reg:SI 93 [ _15+4 ])
>         (ltu:SI (reg:CCC 17 flags)
>             (const_int 0 [0]))) 608 {*setcc_si_1_movzbl}
>      (expr_list:REG_DEAD (reg:CCC 17 flags)
>         (nil)))
> 
> (insn 21 20 22 2 (parallel [
>             (set (reg:SI 102)
>                 (plus:SI (reg:SI 37 r8 [ c ])
>                     (reg:SI 38 r9 [ d ])))
>             (clobber (reg:CC 17 flags))
>         ]) "a.c":3 213 {*addsi_1}
>      (expr_list:REG_DEAD (reg:SI 38 r9 [ d ])
>         (expr_list:REG_UNUSED (reg:CC 17 flags)
>             (expr_list:REG_DEAD (reg:SI 37 r8 [ c ])
>                 (nil)))))
> 
> (insn 25 24 26 2 (parallel [
>             (set (reg:SI 105)
>                 (plus:SI (reg:SI 102)
>                     (reg:SI 93 [ _15+4 ])))
>             (clobber (reg:CC 17 flags))
>         ]) "a.c":3 213 {*addsi_1}
>      (expr_list:REG_DEAD (reg:SI 93 [ _15+4 ])
>         (expr_list:REG_UNUSED (reg:CC 17 flags)
>             (expr_list:REG_DEAD (reg:SI 102)
>                 (nil)))))
> 
> The combine dump says "Trying 21, 34 -> 25:" but the next line is blank and
> it moves on to trying something else.
> 
> If I use parentheses *s=c+(d+(*r<a)); and disable reassoc, I get:
> 
> Trying 23, 24 -> 25:
> Successfully matched this instruction:
> (parallel [
>         (set (reg:SI 105)
>             (plus:SI (plus:SI (ltu:SI (reg:CCC 17 flags)
>                         (const_int 0 [0]))
>                     (reg:SI 37 r8 [ c ]))
>                 (reg:SI 38 r9 [ d ])))
>         (clobber (reg:CC 17 flags))
>     ])
> Instruction not appropriate for target.
> 
> I don't know where that target restriction is coming from, but at least we
> seem to be getting somewhere.
> 
> If I remove c and keep *s=d+(*r<a), I get
> 
> Failed to match this instruction:
> (parallel [
>         (set (reg:SI 103)
>             (plus:SI (ltu:SI (reg:CCC 17 flags)
>                     (const_int 0 [0]))
>                 (reg:SI 38 r9 [ d ])))
>         (clobber (reg:CC 17 flags))
>     ])
> Failed to match this instruction:
> (set (reg:SI 103)
>     (plus:SI (ltu:SI (reg:CCC 17 flags)
>             (const_int 0 [0]))
>         (reg:SI 38 r9 [ d ])))
> 
> we would probably need a special pattern for this case, virtually adding 0.
Comment 7 Jakub Jelinek 2020-11-15 21:32:57 UTC
(In reply to Michael_S from comment #5)
> I agree with regard to "other targets", first of all, aarch64, but x86_64
> variant of gcc already provides requested functionality in for of
> _subborrow_u64 () and _addcarry_u64() intrinsic functions.
> The problem here is not lack of functionality, but very poor implementation
> (mentioned many times on bugzilla with minimal effect).
> In that regard gcc is more than decade behind MSVC and ~4 years behind
> clang/llvm. Surprisingly, icc is also quite bad.

Are you sure you have tested gcc trunk?
There have been fixes for this a month ago as part of PR97387 fixes.
Comment 8 Michael_S 2020-11-15 22:30:27 UTC
(In reply to Jakub Jelinek from comment #7)
> (In reply to Michael_S from comment #5)
> > I agree with regard to "other targets", first of all, aarch64, but x86_64
> > variant of gcc already provides requested functionality in for of
> > _subborrow_u64 () and _addcarry_u64() intrinsic functions.
> > The problem here is not lack of functionality, but very poor implementation
> > (mentioned many times on bugzilla with minimal effect).
> > In that regard gcc is more than decade behind MSVC and ~4 years behind
> > clang/llvm. Surprisingly, icc is also quite bad.
> 
> Are you sure you have tested gcc trunk?
> There have been fixes for this a month ago as part of PR97387 fixes.

I am sure I did not :-)
And I most likely am not going to test trunk, sorry. I'd wait for release.
Essentially I am posting here not because I deeply care about the topic right now (I did care a years or so ago, but lost interest since then, see https://www.realworldtech.com/forum/?threadid=188061&curpostid=188061), but because I stepped on it occasionally while waiting for response to 97832.

So, sorry for intervention.
Comment 9 Michael_S 2020-11-15 22:45:39 UTC
Despite what I wrote above, I did took a look at the trunk on godbolt with same old code from a year ago. Because it was so easy. And indeed a trunk looks ALOT better.
But until it's released I wouldn't know if it's actually up to speed of MSVC and clang.
Comment 10 Thomas Koenig 2021-05-28 15:34:59 UTC
Just had a look at trunk.

It currently produces

adc:
        leaq    800(%rsi), %rcx
        xorl    %edx, %edx
.L2:
        movq    (%rdi), %rax
        addb    $-1, %dl
        adcq    (%rsi), %rax
        setc    %dl
        addq    $8, %rsi
        movq    %rax, (%rdi)
        addq    $8, %rdi
        cmpq    %rcx, %rsi
        jne     .L2
        ret

Clang does

adc:                                    # @adc
        movl    $4, %eax
        xorl    %ecx, %ecx
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        movq    -32(%rsi,%rax,8), %rdx
        addb    $-1, %cl
        adcq    %rdx, -32(%rdi,%rax,8)
        movq    -24(%rsi,%rax,8), %rcx
        adcq    %rcx, -24(%rdi,%rax,8)
        movq    -16(%rsi,%rax,8), %rcx
        adcq    %rcx, -16(%rdi,%rax,8)
        movq    -8(%rsi,%rax,8), %rcx
        adcq    %rcx, -8(%rdi,%rax,8)
        movq    (%rsi,%rax,8), %rcx
        adcq    %rcx, (%rdi,%rax,8)
        setb    %cl
        addq    $5, %rax
        cmpq    $104, %rax
        jne     .LBB0_1
        retq

so it actually unrolls the loop and does the ideal sequence of
add with carry.
Comment 11 Andrew Pinski 2021-09-09 06:48:18 UTC
x86 has _addcarry_u64 which gets it mostly (see PR 97387).

The clang builtins __builtin_*_overflow are there but not the __builtin_add* builtins.

GCC does do a decent job of optimizing the original code now too.
Comment 12 Vincent Lefèvre 2021-09-09 08:54:53 UTC
(In reply to Andrew Pinski from comment #11)
> x86 has _addcarry_u64 which gets it mostly (see PR 97387).
> 
> The clang builtins __builtin_*_overflow are there but not the __builtin_add*
> builtins.
> 
> GCC does do a decent job of optimizing the original code now too.

By "original code", do you mean the code with _addcarry_u64 (I haven't tested)? Otherwise, I don't see any optimization at all on the code I posted in Comment 0.

One issue is that _addcarry_u64 / x86intrin.h are not documented, so the conditions of its use in portable code are not clear. I suppose that it is designed to be used in a target-independent compiler builtin.
Comment 13 Andrew Pinski 2021-09-09 09:05:55 UTC
(In reply to Vincent Lefèvre from comment #12) 
> One issue is that _addcarry_u64 / x86intrin.h are not documented, so the
> conditions of its use in portable code are not clear. I suppose that it is
> designed to be used in a target-independent compiler builtin.

None of the x86 intrinsics are documented except on the Intel intrinsics guide page:
e.g:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_addcarry_u64