This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Fold pointer range checks with equal spans
- From: Richard Sandiford <richard dot sandiford at arm dot com>
- To: gcc-patches at gcc dot gnu dot org
- Date: Fri, 20 Jul 2018 11:27:25 +0100
- Subject: Fold pointer range checks with equal spans
When checking whether vectorised accesses at A and B are independent,
the vectoriser falls back to tests of the form:
A + size <= B || B + size <= A
But in the common case that "size" is just the constant size of a vector
(or a small multiple), it would be more efficient to do:
((size_t) A - (size_t) B + size - 1) > (size - 1) * 2
This patch adds folds to do that. E.g. before the patch, the alias
checks for:
for (int j = 0; j < n; ++j)
{
for (int i = 0; i < 16; ++i)
a[i] = (b[i] + c[i]) >> 1;
a += step;
b += step;
c += step;
}
were:
add x7, x1, 15
add x5, x0, 15
cmp x0, x7
add x7, x2, 15
ccmp x1, x5, 2, ls
cset w8, hi
cmp x0, x7
ccmp x2, x5, 2, ls
cset w4, hi
tst w8, w4
while after the patch they're:
sub x7, x0, x1
sub x5, x0, x2
add x7, x7, 15
add x5, x5, 15
cmp x7, 30
ccmp x5, 30, 0, hi
The old scheme needs:
[A] one addition per vector pointer
[B] two comparisons and one IOR per range check
The new one needs:
[C] one subtraction, one addition and one comparison per range check
The range checks are then ANDed together, with the same number of
ANDs either way.
With conditional comparisons (such as on AArch64), we're able to remove
the IOR between comparisons in the old scheme, but then need an explicit
AND or branch when combining the range checks, as the example above shows.
With the new scheme we can instead use conditional comparisons for
the AND chain.
So even with conditional comparisons, the new scheme should in practice
be a win in almost all cases. Without conditional comparisons, the new
scheme removes [A] and replaces [B] with an equivalent number of operations
[C], so should always give fewer operations overall. Although each [C]
is linear, multiple [C]s are easily parallelisable.
A better implementation of the above would be:
add x5, x0, 15
sub x7, x5, x1
sub x5, x5, x2
cmp x7, 30
ccmp x5, 30, 0, hi
where we add 15 to "a" before the subtraction. Unfortunately,
canonicalisation rules mean that even if we try to create it in
that form, it gets folded into the one above instead.
An alternative would be not to do this in match.pd and instead get
tree-data-ref.c to do it itself. I started out that way but thought
the match.pd approach seemed cleaner.
Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf
and x86_64-linux-gnu. OK to install?
Richard
2018-07-20 Richard Sandiford <richard.sandiford@arm.com>
gcc/
* match.pd: Optimise pointer range checks.
gcc/testsuite/
* gcc.dg/pointer-range-check-1.c: New test.
* gcc.dg/pointer-range-check-2.c: Likewise.
Index: gcc/match.pd
===================================================================
--- gcc/match.pd 2018-07-18 18:44:22.565914281 +0100
+++ gcc/match.pd 2018-07-20 11:24:33.692045585 +0100
@@ -4924,3 +4924,37 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(if (inverse_conditions_p (@0, @2)
&& element_precision (type) == element_precision (op_type))
(view_convert (cond_op @2 @3 @4 @5 (view_convert:op_type @1)))))))
+
+/* For pointers @0 and @2 and unsigned constant offset @1, look for
+ expressions like:
+
+ A: (@0 + @1 < @2) | (@2 + @1 < @0)
+ B: (@0 + @1 <= @2) | (@2 + @1 <= @0)
+
+ If pointers are known not to wrap, B checks whether @1 bytes starting at
+ @0 and @2 do not overlap, while A tests the same thing for @1 + 1 bytes.
+ A is more efficiently tested as:
+
+ ((sizetype) @0 - (sizetype) @2 + @1) > (@1 * 2)
+
+ as long as @1 * 2 doesn't overflow. B is the same with @1 replaced
+ with @1 - 1. */
+(for ior (truth_orif truth_or bit_ior)
+ (for cmp (le lt)
+ (simplify
+ (ior (cmp (pointer_plus:s @0 INTEGER_CST@1) @2)
+ (cmp (pointer_plus:s @2 @1) @0))
+ (if (!flag_trapv && !TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
+ /* Convert the B form to the A form. */
+ (with { offset_int off = wi::to_offset (@1) - (cmp == LE_EXPR ? 1 : 0); }
+ /* Always fails for negative values. */
+ (if (wi::min_precision (off, UNSIGNED) * 2 <= TYPE_PRECISION (sizetype))
+ /* It doesn't matter whether we use @2 - @0 or @0 - @2, so let
+ tree_swap_operands_p pick a canonical order. */
+ (with { tree ptr1 = @0, ptr2 = @2;
+ if (tree_swap_operands_p (ptr1, ptr2))
+ std::swap (ptr1, ptr2); }
+ (gt (plus (minus (convert:sizetype { ptr1; })
+ (convert:sizetype { ptr2; }))
+ { wide_int_to_tree (sizetype, off); })
+ { wide_int_to_tree (sizetype, off * 2); }))))))))
Index: gcc/testsuite/gcc.dg/pointer-range-check-1.c
===================================================================
--- /dev/null 2018-07-09 14:52:09.234750850 +0100
+++ gcc/testsuite/gcc.dg/pointer-range-check-1.c 2018-07-20 11:24:33.692045585 +0100
@@ -0,0 +1,37 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-ipa-icf -fdump-tree-optimized" } */
+
+/* All four functions should be folded to:
+
+ ((sizetype) a - (sizetype) b + 15) < 30. */
+
+_Bool
+f1 (char *a, char *b)
+{
+ return (a + 16 <= b) || (b + 16 <= a);
+}
+
+_Bool
+f2 (char *a, char *b)
+{
+ return (a + 15 < b) || (b + 15 < a);
+}
+
+_Bool
+f3 (char *a, char *b)
+{
+ return (a + 16 <= b) || (b + 16 <= a);
+}
+
+_Bool
+f4 (char *a, char *b)
+{
+ return (a + 15 < b) || (b + 15 < a);
+}
+
+/* { dg-final { scan-tree-dump-times { = [^\n]* - [^\n]*;} 4 "optimized" } } */
+/* { dg-final { scan-tree-dump-times { = [^\n]* \+ [^\n]*;} 4 "optimized" } } */
+/* { dg-final { scan-tree-dump-times { = [^\n]*\ > [^\n]*;} 4 "optimized" } } */
+/* { dg-final { scan-tree-dump-not {=[^\n]*\ < [^\n]*;} "optimized" } } */
+/* { dg-final { scan-tree-dump-times { \+ 15} 4 "optimized" } } */
+/* { dg-final { scan-tree-dump-times { > 30} 4 "optimized" } } */
Index: gcc/testsuite/gcc.dg/pointer-range-check-2.c
===================================================================
--- /dev/null 2018-07-09 14:52:09.234750850 +0100
+++ gcc/testsuite/gcc.dg/pointer-range-check-2.c 2018-07-20 11:24:33.692045585 +0100
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-ipa-icf -fwrapv-pointer -fdump-tree-optimized" } */
+
+_Bool
+f1 (char *a, char *b)
+{
+ return (a + 16 <= b) || (b + 16 <= a);
+}
+
+_Bool
+f2 (char *a, char *b)
+{
+ return (a + 15 < b) || (b + 15 < a);
+}
+
+_Bool
+f3 (char *a, char *b)
+{
+ return (a + 16 <= b) || (b + 16 <= a);
+}
+
+_Bool
+f4 (char *a, char *b)
+{
+ return (a + 15 < b) || (b + 15 < a);
+}
+
+/* { dg-final { scan-tree-dump-not { = [^\n]* - [^\n]*;} "optimized" } } */
+/* { dg-final { scan-tree-dump-times { = [^\n]* \+ [^\n]*;} 8 "optimized" } } */
+/* { dg-final { scan-tree-dump-times { \+ 15} 4 "optimized" } } */
+/* { dg-final { scan-tree-dump-times { \+ 16} 4 "optimized" } } */