There should be a way to support full add-with-carry and subtract-with-borrow by generating adc / sbb instructions on x86_64 (and similar instructions on other targets). GCC could add builtins, such as __builtin_addc* and __builtin_subc* (two arguments, carry in, carry out, and the result), similar to Clang: http://clang.llvm.org/docs/LanguageExtensions.html#multiprecision-arithmetic-builtins as suggested in PR 60206 comment 3. Detection of special constructs in standard C/... code would be useful too. Here are some examples from https://gcc.gnu.org/ml/gcc-help/2017-01/msg00067.html for subtraction: typedef unsigned long T; void sub1 (T *p, T u0, T u1, T u2, T v0, T v1, T v2) { T t1; int b0, b1; p[0] = u0 - v0; b0 = u0 < v0; t1 = u1 - v1; b1 = u1 < v1; p[1] = t1 - b0; b1 |= p[1] > t1; p[2] = u2 - v2 - b1; } void sub2 (T *p, T u0, T u1, T u2, T v0, T v1, T v2) { int b0, b1; p[0] = u0 - v0; b0 = u0 < v0; p[1] = u1 - v1 - b0; b1 = u1 < v1 || (u1 == v1 && b0 != 0); p[2] = u2 - v2 - b1; } In the second example, the b1 line could also be replaced by: b1 = u1 < v1 + b0 || v1 + b0 < v1; For the subtractions, optimal code would contain 1 sub and 2 sbb's.
We could start with the simpler: void f(unsigned*__restrict__ r,unsigned*__restrict__ s,unsigned a,unsigned b,unsigned c, unsigned d){ *r=a+b; *s=c+d+(*r<a); } After combine, we have: (insn 34 12 20 2 (set (reg:SI 93 [ _15+4 ]) (ltu:SI (reg:CCC 17 flags) (const_int 0 [0]))) 608 {*setcc_si_1_movzbl} (expr_list:REG_DEAD (reg:CCC 17 flags) (nil))) (insn 21 20 22 2 (parallel [ (set (reg:SI 102) (plus:SI (reg:SI 37 r8 [ c ]) (reg:SI 38 r9 [ d ]))) (clobber (reg:CC 17 flags)) ]) "a.c":3 213 {*addsi_1} (expr_list:REG_DEAD (reg:SI 38 r9 [ d ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_DEAD (reg:SI 37 r8 [ c ]) (nil))))) (insn 25 24 26 2 (parallel [ (set (reg:SI 105) (plus:SI (reg:SI 102) (reg:SI 93 [ _15+4 ]))) (clobber (reg:CC 17 flags)) ]) "a.c":3 213 {*addsi_1} (expr_list:REG_DEAD (reg:SI 93 [ _15+4 ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_DEAD (reg:SI 102) (nil))))) The combine dump says "Trying 21, 34 -> 25:" but the next line is blank and it moves on to trying something else. If I use parentheses *s=c+(d+(*r<a)); and disable reassoc, I get: Trying 23, 24 -> 25: Successfully matched this instruction: (parallel [ (set (reg:SI 105) (plus:SI (plus:SI (ltu:SI (reg:CCC 17 flags) (const_int 0 [0])) (reg:SI 37 r8 [ c ])) (reg:SI 38 r9 [ d ]))) (clobber (reg:CC 17 flags)) ]) Instruction not appropriate for target. I don't know where that target restriction is coming from, but at least we seem to be getting somewhere. If I remove c and keep *s=d+(*r<a), I get Failed to match this instruction: (parallel [ (set (reg:SI 103) (plus:SI (ltu:SI (reg:CCC 17 flags) (const_int 0 [0])) (reg:SI 38 r9 [ d ]))) (clobber (reg:CC 17 flags)) ]) Failed to match this instruction: (set (reg:SI 103) (plus:SI (ltu:SI (reg:CCC 17 flags) (const_int 0 [0])) (reg:SI 38 r9 [ d ]))) we would probably need a special pattern for this case, virtually adding 0.
(In reply to Marc Glisse from comment #1) > Trying 23, 24 -> 25: > Successfully matched this instruction: > (parallel [ > (set (reg:SI 105) > (plus:SI (plus:SI (ltu:SI (reg:CCC 17 flags) > (const_int 0 [0])) > (reg:SI 37 r8 [ c ])) > (reg:SI 38 r9 [ d ]))) > (clobber (reg:CC 17 flags)) > ]) > Instruction not appropriate for target. I didn't notice immediately, but apparently ix86_legitimate_combined_insn was happy to let combine propagate the hard registers r8 and r9 into the simple additions *addsi_1 which has (match_operand:SWI48 2 "x86_64_general_operand" "rme,re,0,le"), but then it won't accept them in the addition with carry. My guess would be that letting the hard registers into the additions is premature and should be delayed until after combine. If I manually reject those in gdb, we do produce addl+adcl.
Confirmed, the bug should possibly be split (Clang compatible builtins, target specific detection of patterns, middle-end detection of patterns). Note a suitable middle-end representation would not have the address argument but likely return two values (via the usual _Complex int trick).
Also, make sure that the optimization is still done when a variable is a constant or replaced by a constant (with Clang, the optimization is no longer done in such a case).
(In reply to Vincent Lefèvre from comment #0) > There should be a way to support full add-with-carry and > subtract-with-borrow by generating adc / sbb instructions on x86_64 (and > similar instructions on other targets). > > GCC could add builtins, such as __builtin_addc* and __builtin_subc* (two > arguments, carry in, carry out, and the result), similar to Clang: > http://clang.llvm.org/docs/LanguageExtensions.html#multiprecision-arithmetic- > builtins > as suggested in PR 60206 comment 3. > > Detection of special constructs in standard C/... code would be useful too. > Here are some examples from > https://gcc.gnu.org/ml/gcc-help/2017-01/msg00067.html for subtraction: > > typedef unsigned long T; > > void sub1 (T *p, T u0, T u1, T u2, T v0, T v1, T v2) > { > T t1; > int b0, b1; > > p[0] = u0 - v0; > b0 = u0 < v0; > t1 = u1 - v1; > b1 = u1 < v1; > p[1] = t1 - b0; > b1 |= p[1] > t1; > p[2] = u2 - v2 - b1; > } > > void sub2 (T *p, T u0, T u1, T u2, T v0, T v1, T v2) > { > int b0, b1; > > p[0] = u0 - v0; > b0 = u0 < v0; > p[1] = u1 - v1 - b0; > b1 = u1 < v1 || (u1 == v1 && b0 != 0); > p[2] = u2 - v2 - b1; > } > > In the second example, the b1 line could also be replaced by: > > b1 = u1 < v1 + b0 || v1 + b0 < v1; > > For the subtractions, optimal code would contain 1 sub and 2 sbb's. I agree with regard to "other targets", first of all, aarch64, but x86_64 variant of gcc already provides requested functionality in for of _subborrow_u64 () and _addcarry_u64() intrinsic functions. The problem here is not lack of functionality, but very poor implementation (mentioned many times on bugzilla with minimal effect). In that regard gcc is more than decade behind MSVC and ~4 years behind clang/llvm. Surprisingly, icc is also quite bad.
(In reply to Marc Glisse from comment #1) > We could start with the simpler: > > void f(unsigned*__restrict__ r,unsigned*__restrict__ s,unsigned a,unsigned > b,unsigned c, unsigned d){ > *r=a+b; > *s=c+d+(*r<a); > } > That works for dual-precision addition, but not for triple or more. > After combine, we have: > > (insn 34 12 20 2 (set (reg:SI 93 [ _15+4 ]) > (ltu:SI (reg:CCC 17 flags) > (const_int 0 [0]))) 608 {*setcc_si_1_movzbl} > (expr_list:REG_DEAD (reg:CCC 17 flags) > (nil))) > > (insn 21 20 22 2 (parallel [ > (set (reg:SI 102) > (plus:SI (reg:SI 37 r8 [ c ]) > (reg:SI 38 r9 [ d ]))) > (clobber (reg:CC 17 flags)) > ]) "a.c":3 213 {*addsi_1} > (expr_list:REG_DEAD (reg:SI 38 r9 [ d ]) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (expr_list:REG_DEAD (reg:SI 37 r8 [ c ]) > (nil))))) > > (insn 25 24 26 2 (parallel [ > (set (reg:SI 105) > (plus:SI (reg:SI 102) > (reg:SI 93 [ _15+4 ]))) > (clobber (reg:CC 17 flags)) > ]) "a.c":3 213 {*addsi_1} > (expr_list:REG_DEAD (reg:SI 93 [ _15+4 ]) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (expr_list:REG_DEAD (reg:SI 102) > (nil))))) > > The combine dump says "Trying 21, 34 -> 25:" but the next line is blank and > it moves on to trying something else. > > If I use parentheses *s=c+(d+(*r<a)); and disable reassoc, I get: > > Trying 23, 24 -> 25: > Successfully matched this instruction: > (parallel [ > (set (reg:SI 105) > (plus:SI (plus:SI (ltu:SI (reg:CCC 17 flags) > (const_int 0 [0])) > (reg:SI 37 r8 [ c ])) > (reg:SI 38 r9 [ d ]))) > (clobber (reg:CC 17 flags)) > ]) > Instruction not appropriate for target. > > I don't know where that target restriction is coming from, but at least we > seem to be getting somewhere. > > If I remove c and keep *s=d+(*r<a), I get > > Failed to match this instruction: > (parallel [ > (set (reg:SI 103) > (plus:SI (ltu:SI (reg:CCC 17 flags) > (const_int 0 [0])) > (reg:SI 38 r9 [ d ]))) > (clobber (reg:CC 17 flags)) > ]) > Failed to match this instruction: > (set (reg:SI 103) > (plus:SI (ltu:SI (reg:CCC 17 flags) > (const_int 0 [0])) > (reg:SI 38 r9 [ d ]))) > > we would probably need a special pattern for this case, virtually adding 0.
(In reply to Michael_S from comment #5) > I agree with regard to "other targets", first of all, aarch64, but x86_64 > variant of gcc already provides requested functionality in for of > _subborrow_u64 () and _addcarry_u64() intrinsic functions. > The problem here is not lack of functionality, but very poor implementation > (mentioned many times on bugzilla with minimal effect). > In that regard gcc is more than decade behind MSVC and ~4 years behind > clang/llvm. Surprisingly, icc is also quite bad. Are you sure you have tested gcc trunk? There have been fixes for this a month ago as part of PR97387 fixes.
(In reply to Jakub Jelinek from comment #7) > (In reply to Michael_S from comment #5) > > I agree with regard to "other targets", first of all, aarch64, but x86_64 > > variant of gcc already provides requested functionality in for of > > _subborrow_u64 () and _addcarry_u64() intrinsic functions. > > The problem here is not lack of functionality, but very poor implementation > > (mentioned many times on bugzilla with minimal effect). > > In that regard gcc is more than decade behind MSVC and ~4 years behind > > clang/llvm. Surprisingly, icc is also quite bad. > > Are you sure you have tested gcc trunk? > There have been fixes for this a month ago as part of PR97387 fixes. I am sure I did not :-) And I most likely am not going to test trunk, sorry. I'd wait for release. Essentially I am posting here not because I deeply care about the topic right now (I did care a years or so ago, but lost interest since then, see https://www.realworldtech.com/forum/?threadid=188061&curpostid=188061), but because I stepped on it occasionally while waiting for response to 97832. So, sorry for intervention.
Despite what I wrote above, I did took a look at the trunk on godbolt with same old code from a year ago. Because it was so easy. And indeed a trunk looks ALOT better. But until it's released I wouldn't know if it's actually up to speed of MSVC and clang.
Just had a look at trunk. It currently produces adc: leaq 800(%rsi), %rcx xorl %edx, %edx .L2: movq (%rdi), %rax addb $-1, %dl adcq (%rsi), %rax setc %dl addq $8, %rsi movq %rax, (%rdi) addq $8, %rdi cmpq %rcx, %rsi jne .L2 ret Clang does adc: # @adc movl $4, %eax xorl %ecx, %ecx .LBB0_1: # =>This Inner Loop Header: Depth=1 movq -32(%rsi,%rax,8), %rdx addb $-1, %cl adcq %rdx, -32(%rdi,%rax,8) movq -24(%rsi,%rax,8), %rcx adcq %rcx, -24(%rdi,%rax,8) movq -16(%rsi,%rax,8), %rcx adcq %rcx, -16(%rdi,%rax,8) movq -8(%rsi,%rax,8), %rcx adcq %rcx, -8(%rdi,%rax,8) movq (%rsi,%rax,8), %rcx adcq %rcx, (%rdi,%rax,8) setb %cl addq $5, %rax cmpq $104, %rax jne .LBB0_1 retq so it actually unrolls the loop and does the ideal sequence of add with carry.
x86 has _addcarry_u64 which gets it mostly (see PR 97387). The clang builtins __builtin_*_overflow are there but not the __builtin_add* builtins. GCC does do a decent job of optimizing the original code now too.
(In reply to Andrew Pinski from comment #11) > x86 has _addcarry_u64 which gets it mostly (see PR 97387). > > The clang builtins __builtin_*_overflow are there but not the __builtin_add* > builtins. > > GCC does do a decent job of optimizing the original code now too. By "original code", do you mean the code with _addcarry_u64 (I haven't tested)? Otherwise, I don't see any optimization at all on the code I posted in Comment 0. One issue is that _addcarry_u64 / x86intrin.h are not documented, so the conditions of its use in portable code are not clear. I suppose that it is designed to be used in a target-independent compiler builtin.
(In reply to Vincent Lefèvre from comment #12) > One issue is that _addcarry_u64 / x86intrin.h are not documented, so the > conditions of its use in portable code are not clear. I suppose that it is > designed to be used in a target-independent compiler builtin. None of the x86 intrinsics are documented except on the Intel intrinsics guide page: e.g: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_addcarry_u64
Unfortunately, the clang __builtin_addc* and __builtin_subc* builtins are badly designed. Instead of adding or subtracting a 1-bit carry, where the result is guaranteed to have 1-bit carry as well, they are: unsigned __builtin_addc (unsigned x, unsigned y, unsigned carry_in, unsigned *carry_out) { unsigned r; unsigned c1 = __builtin_add_overflow (x, y, &r); unsigned c2 = __builtin_add_overflow (r, carry_in, &r); *carry_out = c1 + c2; return r; } unsigned __builtin_subc (unsigned x, unsigned y, unsigned carry_in, unsigned *carry_out) { unsigned r; unsigned c1 = __builtin_sub_overflow (x, y, &r); unsigned c2 = __builtin_sub_overflow (r, carry_in, &r); *carry_out = c1 + c2; return r; } So, instead of doing [0, 0xffffffff] + [0, 0xffffffff] + [0, 1] resulting in [0, 0xffffffff] plus [0, 1] carry they actually do [0, 0xffffffff] + [0, 0xffffffff] + [0, 0xffffffff] resulting in [0, 0xffffffff] plus [0, 2] carry. So far for good "design". So, am not really sure if it is worth implementing those builtins, one can use __builtin_add_overflow/__builtin_sub_overflow instead, all we need is pattern detect if they are chained and so start with 0 carry in and then the carry outs are guaranteed to be [0, 1] and further pairs of .ADD_OVERFLOW/.SUB_OVERFLOW again can count on [0, 1] carry in and produce [0, 1] carry out. And pattern detect that into some new IFN which will try to produce efficient code for these.
For what it's worth, clang's __builtin_addc is implemented in frontend only as a pair of __builtin_add_overflow. The commit from 11 year ago does not explain why they were added. https://github.com/llvm/llvm-project/commit/54398015bf8cbdc3af54dda74807d6f3c8436164 Producing a chain of ADC instructions out of __builtin_add_overflow patterns has been done quite recently (~1 year ago). And this work is not fully finished yet. On the other hand, Go recently added "addc" like "builtins" in https://pkg.go.dev/math/bits. And they are really pleasure to use in multi-precision arithmetic.
Created attachment 55271 [details] gcc14-pr79173-wip.patch Untested WIP (with backend implementation for x86 only so far). Still, need to add/tweak some peephole2s to make pr79173-{1,2,3,4}.c tests clean on both x86-64 and i?86, and then as can be seen in pr79173-5.c need to do some tweaks so that it pattern recognizes also the pattern recognized __builtin_{add,sub}_overflow instead of those being used directly. C23 is standardizing __builtin_{add,sub,mul}_overflow under the ckd_{add,sub,mul} names, so pattern recognizing it that way is definitely desirable. Oh, and maybe incrementally check what happens if one of the addends or subtrahends are immediate.
Created attachment 55274 [details] gcc14-pr79173.patch Full patch I'm going to test. Unfortunately, I haven't been successful at getting the subc stuff working when not using __builtin_sub_overflow builtins nor _subborrow_u* instrinsics, only addc seems to work in that case, seems reassoc rewrites stuff that we no longer are able to even pattern match __builtin_sub_overflow in that case. So, maybe even adding the ugly clang builtins as a canned way how to express it canonically would be useful, the pattern matching can't handle infinite number of different ways how to write the same thing.
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:b6ca11407d4f5d16ccfb580ea2d3d9aa08d7cd11 commit r14-1835-gb6ca11407d4f5d16ccfb580ea2d3d9aa08d7cd11 Author: Jakub Jelinek <jakub@redhat.com> Date: Thu Jun 15 09:05:01 2023 +0200 i386: Add peephole2 patterns to improve add with carry or subtract with borrow with memory destination [PR79173] This patch adds various peephole2s which help to recognize add with carry or subtract with borrow with memory destination. 2023-06-14 Jakub Jelinek <jakub@redhat.com> PR middle-end/79173 * config/i386/i386.md (*sub<mode>_3, @add<mode>3_carry, addcarry<mode>, @sub<mode>3_carry, *add<mode>3_cc_overflow_1): Add define_peephole2 TARGET_READ_MODIFY_WRITE/-Os patterns to prefer using memory destination in these patterns.
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:ec52d228d6db7f77188ad099a8c0ff65dead3241 commit r14-1836-gec52d228d6db7f77188ad099a8c0ff65dead3241 Author: Jakub Jelinek <jakub@redhat.com> Date: Thu Jun 15 09:08:37 2023 +0200 i386: Add peephole2 patterns to improve subtract with borrow with memory destination [PR79173] This patch adds subborrow<mode> alternative so that it can have memory destination and adds various peephole2s which help to match it. 2023-06-15 Jakub Jelinek <jakub@redhat.com> PR middle-end/79173 * config/i386/i386.md (subborrow<mode>): Add alternative with memory destination and add for it define_peephole2 TARGET_READ_MODIFY_WRITE/-Os patterns to prefer using memory destination in these patterns.
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:43a3252c42af12ad90082e4088ea58eecd0bf582 commit r14-1837-g43a3252c42af12ad90082e4088ea58eecd0bf582 Author: Jakub Jelinek <jakub@redhat.com> Date: Thu Jun 15 09:12:40 2023 +0200 middle-end, i386: Pattern recognize add/subtract with carry [PR79173] The following patch introduces {add,sub}c5_optab and pattern recognizes various forms of add with carry and subtract with carry/borrow, see pr79173-{1,2,3,4,5,6}.c tests on what is matched. Primarily forms with 2 __builtin_add_overflow or __builtin_sub_overflow calls per limb (with just one for the least significant one), for add with carry even when it is hand written in C (for subtraction reassoc seems to change it too much so that the pattern recognition doesn't work). __builtin_{add,sub}_overflow are standardized in C23 under ckd_{add,sub} names, so it isn't any longer a GNU only extension. Note, clang has for these (IMHO badly designed) __builtin_{add,sub}c{b,s,,l,ll} builtins which don't add/subtract just a single bit of carry, but basically add 3 unsigned values or subtract 2 unsigned values from one, and result in carry out of 0, 1, or 2 because of that. If we wanted to introduce those for clang compatibility, we could and lower them early to just two __builtin_{add,sub}_overflow calls and let the pattern matching in this patch recognize it later. I've added expanders for this on ix86 and in addition to that added various peephole2s (in preparation patches for this patch) to make sure we get nice (and small) code for the common cases. I think there are other PRs which request that e.g. for the _{addcarry,subborrow}_u{32,64} intrinsics, which the patch also improves. Would be nice if support for these optabs was added to many other targets, arm/aarch64 and powerpc* certainly have such instructions, I'd expect in fact that most targets do. The _BitInt support I'm working on will also need this to emit reasonable code. 2023-06-15 Jakub Jelinek <jakub@redhat.com> PR middle-end/79173 * internal-fn.def (UADDC, USUBC): New internal functions. * internal-fn.cc (expand_UADDC, expand_USUBC): New functions. (commutative_ternary_fn_p): Return true also for IFN_UADDC. * optabs.def (uaddc5_optab, usubc5_optab): New optabs. * tree-ssa-math-opts.cc (uaddc_cast, uaddc_ne0, uaddc_is_cplxpart, match_uaddc_usubc): New functions. (math_opts_dom_walker::after_dom_children): Call match_uaddc_usubc for PLUS_EXPR, MINUS_EXPR, BIT_IOR_EXPR and BIT_XOR_EXPR unless other optimizations have been successful for those. * gimple-fold.cc (gimple_fold_call): Handle IFN_UADDC and IFN_USUBC. * fold-const-call.cc (fold_const_call): Likewise. * gimple-range-fold.cc (adjust_imagpart_expr): Likewise. * tree-ssa-dce.cc (eliminate_unnecessary_stmts): Likewise. * doc/md.texi (uaddc<mode>5, usubc<mode>5): Document new named patterns. * config/i386/i386.md (uaddc<mode>5, usubc<mode>5): New define_expand patterns. (*setcc_qi_addqi3_cconly_overflow_1_<mode>, *setccc): Split into NOTE_INSN_DELETED note rather than nop instruction. (*setcc_qi_negqi_ccc_1_<mode>, *setcc_qi_negqi_ccc_2_<mode>): Likewise. * gcc.target/i386/pr79173-1.c: New test. * gcc.target/i386/pr79173-2.c: New test. * gcc.target/i386/pr79173-3.c: New test. * gcc.target/i386/pr79173-4.c: New test. * gcc.target/i386/pr79173-5.c: New test. * gcc.target/i386/pr79173-6.c: New test. * gcc.target/i386/pr79173-7.c: New test. * gcc.target/i386/pr79173-8.c: New test. * gcc.target/i386/pr79173-9.c: New test. * gcc.target/i386/pr79173-10.c: New test.
CCing some maintainers, could you please consider adding these uaddc<mode>5/usubc<mode>5 expanders to rs6000, aarch64, arm, s390, riscv targets if the targets have some appropriate instructions (add with carry and subtract with carry/borrow), copy the above gcc.target/i386/ testcases to other gcc.target subdirectories and verify there you get optimal code for those sequences? The _BitInt addition/subtraction code will also use these new internal functions if possible, so implementing it will also help get usable code for _BitInt later on.
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:2b4e0415ad664cdb3ce87d1f7eee5ca26911a05b commit r14-1896-g2b4e0415ad664cdb3ce87d1f7eee5ca26911a05b Author: Jakub Jelinek <jakub@redhat.com> Date: Fri Jun 16 19:47:28 2023 +0200 uiltins: Add support for clang compatible __builtin_{add,sub}c{,l,ll} [PR79173] While the design of these builtins in clang is questionable, rather than being say unsigned __builtin_addc (unsigned, unsigned, bool, bool *) so that it is clear they add two [0, 0xffffffff] range numbers plus one [0, 1] range carry in and give [0, 0xffffffff] range return plus [0, 1] range carry out, they actually instead add 3 [0, 0xffffffff] values together but the carry out isn't then the expected [0, 2] value because 0xffffffffULL + 0xffffffff + 0xffffffff is 0x2fffffffd, but just [0, 1] whether there was any overflow at all. It is something used in the wild and shorter to write than the corresponding #define __builtin_addc(a,b,carry_in,carry_out) \ ({ unsigned _s; \ unsigned _c1 = __builtin_uadd_overflow (a, b, &_s); \ unsigned _c2 = __builtin_uadd_overflow (_s, carry_in, &_s); \ *(carry_out) = (_c1 | _c2); \ _s; }) and so a canned builtin for something people could often use. It isn't that hard to maintain on the GCC side, as we just lower it to two .ADD_OVERFLOW calls early, and the already committed pottern recognization code can then make .UADDC/.USUBC calls out of that if the carry in is in [0, 1] range and the corresponding optab is supported by the target. 2023-06-16 Jakub Jelinek <jakub@redhat.com> PR middle-end/79173 * builtin-types.def (BT_FN_UINT_UINT_UINT_UINT_UINTPTR, BT_FN_ULONG_ULONG_ULONG_ULONG_ULONGPTR, BT_FN_ULONGLONG_ULONGLONG_ULONGLONG_ULONGLONG_ULONGLONGPTR): New types. * builtins.def (BUILT_IN_ADDC, BUILT_IN_ADDCL, BUILT_IN_ADDCLL, BUILT_IN_SUBC, BUILT_IN_SUBCL, BUILT_IN_SUBCLL): New builtins. * builtins.cc (fold_builtin_addc_subc): New function. (fold_builtin_varargs): Handle BUILT_IN_{ADD,SUB}C{,L,LL}. * doc/extend.texi (__builtin_addc, __builtin_subc): Document. * gcc.target/i386/pr79173-11.c: New test. * gcc.dg/builtin-addc-1.c: New test.
risc-v doesn't have any special instructions to implement add-with-carry or subtract-with-borrow. Depending on who you talk do, it's either a feature or a mis-design.
Sorry, in that case nothing needs to be done for riscv. I'm sure aarch64, arm has one (e.g. adcs), I think powerpc has some, but e.g. PR43892 is still open, and I'm sure s390 has them too (alc*, slb*).
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:f8f68c4ca622a24c2e8cf2b5f2f9fdcd47a7b369 commit r14-2001-gf8f68c4ca622a24c2e8cf2b5f2f9fdcd47a7b369 Author: Jakub Jelinek <jakub@redhat.com> Date: Tue Jun 20 20:17:41 2023 +0200 tree-ssa-math-opts: Small uaddc/usubc pattern matching improvement [PR79173] In the following testcase we fail to pattern recognize the least significant .UADDC call. The reason is that arg3 in that case is _3 = .ADD_OVERFLOW (...); _2 = __imag__ _3; _1 = _2 != 0; arg3 = (unsigned long) _1; and while before the changes arg3 has a single use in some .ADD_OVERFLOW later on, we add a .UADDC call next to it (and gsi_remove/gsi_replace only what is strictly necessary and leave quite a few dead stmts around which next DCE cleans up) and so it all of sudden isn't used just once, but twice (.ADD_OVERFLOW and .UADDC) and so uaddc_cast fails. While we could tweak uaddc_cast and not require has_single_use in these uses, there is also no vrp that would figure out that because __imag__ _3 is in [0, 1] range, it can just use arg3 = __imag__ _3; and drop the comparison and cast. We already search if either arg2 or arg3 is ultimately set from __imag__ of .{{ADD,SUB}_OVERFLOW,U{ADD,SUB}C} call, so the following patch just remembers the lhs of __imag__ from that case and uses it later. 2023-06-20 Jakub Jelinek <jakub@redhat.com> PR middle-end/79173 * tree-ssa-math-opts.cc (match_uaddc_usubc): Remember lhs of IMAGPART_EXPR of arg2/arg3 and use that as arg3 if it has the right type. * g++.target/i386/pr79173-1.C: New test.
(In reply to CVS Commits from comment #22) > While the design of these builtins in clang is questionable, > rather than being say > unsigned __builtin_addc (unsigned, unsigned, bool, bool *) > so that it is clear they add two [0, 0xffffffff] range numbers > plus one [0, 1] range carry in and give [0, 0xffffffff] range > return plus [0, 1] range carry out, they actually instead > add 3 [0, 0xffffffff] values together but the carry out > isn't then the expected [0, 2] value because > 0xffffffffULL + 0xffffffff + 0xffffffff is 0x2fffffffd, > but just [0, 1] whether there was any overflow at all. That is very strange. I always thought that the original intent was for __builtin_addc to assume that its third argument was in [0, 1] and generate a single addc instruction on hardware that has it, and the type only ended up being the same as the others for convenience (also C used not to have a bool type). The final overflow never being 2 confirms this. It may be worth discussing with clang developers if they would be willing to document such a [0, 1] restriction, and maybe have ubsan check it.
Given that the builtins exist for 10 years already, I think changing it for them is too late, though they don't seem to take backwards compatibility as seriously. They don't document the [0, 1] restriction and the behavior implemented in GCC is what I saw when trying it. Note, in many cases it isn't that big deal, because if carry_in is in [0, 1] range and compiler can see it from VRP, it can still optimize it. And given that carry_out is always in [0, 1] range, for chained cases worst case the first additions will be less optimized but the chained will be already better.
(In reply to Jakub Jelinek from comment #27) > Given that the builtins exist for 10 years already, I think changing it for > them is too late, though they don't seem to take backwards compatibility as > seriously. > They don't document the [0, 1] restriction and the behavior implemented in > GCC is what I saw when trying it. Their documentation at https://clang.llvm.org/docs/LanguageExtensions.html is currently just unsigned sum = __builtin_addc(x, y, carryin, &carryout); But a carry for a 2-ary addition is always 0 or 1, so the [0, 1] restriction is implicit (by the language that is used). And in their example, the carries are always 0 or 1. > Note, in many cases it isn't that big deal, because if carry_in is in [0, 1] > range and compiler can see it from VRP, it can still optimize it. And given > that carry_out is always in [0, 1] range, for chained cases worst case the > first additions will be less optimized but the chained will be already > better. What do you mean by "the first additions will be less optimized"? (If you don't know anything about the initial carryin and the arguments, you can't optimize at all, AFAIK.)
(In reply to Vincent Lefèvre from comment #28) > (In reply to Jakub Jelinek from comment #27) > > Given that the builtins exist for 10 years already, I think changing it for > > them is too late, though they don't seem to take backwards compatibility as > > seriously. > > They don't document the [0, 1] restriction and the behavior implemented in > > GCC is what I saw when trying it. > > Their documentation at https://clang.llvm.org/docs/LanguageExtensions.html > is currently just > > unsigned sum = __builtin_addc(x, y, carryin, &carryout); > > But a carry for a 2-ary addition is always 0 or 1, so the [0, 1] restriction > is implicit (by the language that is used). That is something that would need to be said explicitly, that it is undefined behavior if it is some other value. Like we document that e.g. __builtin_clz is undefined behavior on 0 input. > What do you mean by "the first additions will be less optimized"? (If you > don't know anything about the initial carryin and the arguments, you can't > optimize at all, AFAIK.) I mean that if the compiler can't see it is in [0, 1], it will need to use 2 additions and or the 2 carry bits together. But, because the ored carry bits are in [0, 1] range, all the higher limbs could be done using addc. If you try clang trunk with -O2 unsigned int foo (unsigned x, unsigned y, unsigned carry_in, unsigned *carry_out) { return __builtin_addc (x, y, carry_in, carry_out); } unsigned int bar (unsigned x, unsigned y, unsigned carry_in, unsigned *carry_out) { if (carry_in > 1) __builtin_unreachable (); return __builtin_addc (x, y, carry_in, carry_out); } it shows exactly those 2 additions, rather than trying to optimize it, in both cases. GCC trunk emits something comparable for the first case, and unsigned int foo (unsigned x, unsigned y, unsigned carry_in, unsigned *carry_out) { return __builtin_addc (x, y, carry_in, carry_out); } you get addb $-1, %dl; adcl %esi, %eax for the main work.
(In reply to Jakub Jelinek from comment #29) > (In reply to Vincent Lefèvre from comment #28) > > What do you mean by "the first additions will be less optimized"? (If you > > don't know anything about the initial carryin and the arguments, you can't > > optimize at all, AFAIK.) > > I mean that if the compiler can't see it is in [0, 1], it will need to use 2 > additions and or the 2 carry bits together. But, because the ored carry > bits are in [0, 1] range, all the higher limbs could be done using addc. If the compiler can't see that carryin is in [0, 1], then it must not "or" the carry bits; it needs to add them, as carryout may be 2. So each part of the whole chain would need 2 __builtin_add_overflow and an addition of the carry bits. However, if the compiler can detect that at some point, the arguments cannot be both 0xffffffff at the same time (while carryin is in [0, 2]), then an optimization is possible for the rest of the chain.
(In reply to Vincent Lefèvre from comment #30) > (In reply to Jakub Jelinek from comment #29) > > (In reply to Vincent Lefèvre from comment #28) > > > What do you mean by "the first additions will be less optimized"? (If you > > > don't know anything about the initial carryin and the arguments, you can't > > > optimize at all, AFAIK.) > > > > I mean that if the compiler can't see it is in [0, 1], it will need to use 2 > > additions and or the 2 carry bits together. But, because the ored carry > > bits are in [0, 1] range, all the higher limbs could be done using addc. > > If the compiler can't see that carryin is in [0, 1], then it must not "or" > the carry bits; it needs to add them, as carryout may be 2. That is not how the clang builtin works, which is why I've implemented the | and documented it that way, as it is a compatibility builtin.
(In reply to Jakub Jelinek from comment #31) > (In reply to Vincent Lefèvre from comment #30) > > (In reply to Jakub Jelinek from comment #29) > > > I mean that if the compiler can't see it is in [0, 1], it will need > > > to use 2 additions and or the 2 carry bits together. But, because > > > the ored carry bits are in [0, 1] range, all the higher limbs could > > > be done using addc. > > > > If the compiler can't see that carryin is in [0, 1], then it must not "or" > > the carry bits; it needs to add them, as carryout may be 2. > > That is not how the clang builtin works, which is why I've implemented the | > and documented it that way, as it is a compatibility builtin. I'm confused. In Comment 14, you said that *carry_out = c1 + c2; was used. This is an addition, not an OR.
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:a7aec76a74dd38524be325343158d3049b6ab3ac commit r14-3541-ga7aec76a74dd38524be325343158d3049b6ab3ac Author: Jakub Jelinek <jakub@redhat.com> Date: Tue Aug 29 10:46:01 2023 +0200 tree-ssa-math-opts: Improve uaddc/usubc pattern matching [PR111209] The uaddc/usubc usual matching is of the .{ADD,SUB}_OVERFLOW pair in the middle, which adds/subtracts carry-in (from lower limbs) and computes carry-out (to higher limbs). Before optimizations (unless user writes it intentionally that way already), all the steps look the same, but optimizations simplify the handling of the least significant limb (one which adds/subtracts 0 carry-in) to just a single .{ADD,SUB}_OVERFLOW and the handling of the most significant limb if the computed carry-out is ignored to normal addition/subtraction of multiple operands. Now, match_uaddc_usubc has code to turn that least significant .{ADD,SUB}_OVERFLOW call into .U{ADD,SUB}C call with 0 carry-in if a more significant limb above it is matched into .U{ADD,SUB}C; this isn't necessary for functionality, as .ADD_OVERFLOW (x, y) is functionally equal to .UADDC (x, y, 0) (provided the types of operands are the same and result is complex type with that type element), and it also has code to match the most significant limb with ignored carry-out (in that case one pattern match turns both the penultimate limb pair of .{ADD,SUB}_OVERFLOW into .U{ADD,SUB}C and the addition/subtraction of the 4 values (2 carries) into another .U{ADD,SUB}C. As the following patch shows, what we weren't handling is the case when one uses either the __builtin_{add,sub}c builtins or hand written forms thereof (either __builtin_*_overflow or even that written by hand) for just 2 limbs, where the least significant has 0 carry-in and the most significant ignores carry-out. The following patch matches that, e.g. _16 = .ADD_OVERFLOW (_1, _2); _17 = REALPART_EXPR <_16>; _18 = IMAGPART_EXPR <_16>; _15 = _3 + _4; _12 = _15 + _18; into _16 = .UADDC (_1, _2, 0); _17 = REALPART_EXPR <_16>; _18 = IMAGPART_EXPR <_16>; _19 = .UADDC (_3, _4, _18); _12 = IMAGPART_EXPR <_19>; so that we can emit better code. As the 2 later comments show, we must do that carefully, because the pass walks the IL from first to last stmt in a bb and we must avoid pattern matching this way something that should be matched on a later instruction differently. 2023-08-29 Jakub Jelinek <jakub@redhat.com> PR middle-end/79173 PR middle-end/111209 * tree-ssa-math-opts.cc (match_uaddc_usubc): Match also just 2 limb uaddc/usubc with 0 carry-in on lower limb and ignored carry-out on higher limb. Don't match it though if it could be matched later on 4 argument addition/subtraction. * gcc.target/i386/pr79173-12.c: New test.