This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH, rs6000] generate loop code for memcmp inline expansion
- From: Segher Boessenkool <segher at kernel dot crashing dot org>
- To: Aaron Sawdey <acsawdey at linux dot vnet dot ibm dot com>
- Cc: gcc-patches at gcc dot gnu dot org, David Edelsohn <dje dot gcc at gmail dot com>, Jeff Law <law at redhat dot com>
- Date: Tue, 12 Dec 2017 10:13:28 -0600
- Subject: Re: [PATCH, rs6000] generate loop code for memcmp inline expansion
- Authentication-results: sourceware.org; auth=none
- References: <1513007385.30918.2.camel@linux.vnet.ibm.com>
Hi!
On Mon, Dec 11, 2017 at 09:49:45AM -0600, Aaron Sawdey wrote:
> This patch builds on two previously posted patches:
>
> https://gcc.gnu.org/ml/gcc-patches/2017-11/msg01216.html
> https://gcc.gnu.org/ml/gcc-patches/2017-11/msg02599.html
>
> The previous patches allow the use of bdnzt.
>
> This patch allows the cmpstrsi pattern expansion to handle longer blocks without
> increasing code size by using a loop:
>
> .L13:
> ldbrx 7,3,6
> ldbrx 9,10,6
> ldbrx 0,3,5
> ldbrx 4,10,5
> addi 6,6,16
> addi 5,5,16
> subfc. 9,9,7
> bne 0,.L10
> subfc. 9,4,0
> bdnzt 2,.L13
>
> Performance testing on P8 showed that unroll x2 with 2 IVs was the way to go,
> and it is the same performance as the previous non-loop inlne code for a
> 32 byte compare. Also on P8 the performance crossover with calling glibc memcmp
> happens around 192 bytes length. The loop expansion code can also be generated
> when the length is not known at compile time.
>
> The gcc.dg/memcmp-1.c and gcc.dg/strncmp-2.c test cases are updated by the
> patch to check the additional memcmp expansion for known and unknown lengths.
>
> The undocumented option -mblock-compare-inline-limit was changed slightly to
> mean how many bytes will be compared by the non-loop inline code, the default
> is 31 bytes. As previously, -mblock-compare-inlne-limit=0 will disable all
> inline expansion of memcmp(). The new -mblock-compare-inline-loop-limit option
> sets the upper limit, for lengths above that no expansion is done if the length
> is known at compile time. If the length is unknown, the generated code calls
> glibc memcmp() after comparing that many bytes.
>
> Bootstrap and regtest pass on P8 LE and P9 LE, testing in progress for
> BE (default, P7, and P8). If tests pass, OK for trunk when the first of the
> previously posted patches is approved?
It looks very good. Just some whitespace etc. issues:
> +/* Generate rtl for a load, shift, and compare of less than a full word.
> +
> + LOAD_MODE is the machine mode for the loads.
> + DIFF is the reg for the difference.
> + CMP_REM is the reg containing the remaining bytes to compare.
> + DCOND is the CCUNS reg for the compare if we are doing P9 code with setb.
> + SRC1_ADDR is the first source address.
> + SRC2_ADDR is the second source address.
> + ORIG_SRC1 is the original first source block's address rtx.
> + ORIG_SRC2 is the original second source block's address rtx. */
You have a tab instead of a space at the end there.
> +/* Generate rtl for an overlapping load and compare of less than a
> + full load_mode. This assumes that we have already compared the
> + emit_move_insn (adj_reg, GEN_INT (-addr_adj));
That last line doesn't really make sense; pasto?
> + ORIG_SRC2 is the original second source block's address rtx. */
Dot-space-space-end-of-comment.
> +static void
> +do_overlap_load_compare (machine_mode load_mode, bool isConst,
> + HOST_WIDE_INT bytes_rem, rtx diff,
> + rtx cmp_rem, rtx dcond, rtx src1_addr, rtx src2_addr,
> + rtx orig_src1, rtx orig_src2)
Should use tabs.
> + bool isP7 = (rs6000_cpu == PROCESSOR_POWER7);
> + //bool isP8 = (rs6000_cpu == PROCESSOR_POWER8);
> + bool isP9 = (rs6000_cpu == PROCESSOR_POWER9);
Just remove the unused code please (certainly when it's trivial to recreate
it, like here).
> + /* Anything to move? */
> + HOST_WIDE_INT bytes = 0;
> + if (bytes_is_const)
> + bytes = INTVAL (bytes_rtx);
> +
> + if (bytes_is_const && bytes == 0)
> + return true;
> +
> + /* Limit the amount we compare, if known statically. */
These last two comments end in tabs as well; they shouldn't.
> + /* Allow the option to override the default. */
And more; check the whole file please.
> + if (rs6000_block_compare_inline_loop_limit >= 0)
> + max_bytes = (unsigned HOST_WIDE_INT) rs6000_block_compare_inline_loop_limit;
This doesn't need a cast I think?
> + /* Extend bytes_rtx to word_mode if needed. But, we expect only to
> + maybe have to deal with the case were bytes_rtx is SImode and
> + word_mode is DImode. */
> + if (!bytes_is_const)
> + {
> + if (GET_MODE_SIZE (GET_MODE (bytes_rtx)) > GET_MODE_SIZE (word_mode))
> + return false; /* This is unexpected. */
Comments at the end of line are unexpected, too ;-) Put it a line up?
Or even before the "if".
> + /* Number of bytes per iteration of the unrolled loop. */
> + HOST_WIDE_INT loop_bytes = 2*load_mode_size;
> + /* max iters and bytes compared in the loop. */
> + HOST_WIDE_INT max_loop_iter = max_bytes/loop_bytes;
> + HOST_WIDE_INT max_loop_bytes = max_loop_iter*loop_bytes;
> + int l2lb = floor_log2 (loop_bytes);
Spaces around binary operators.
> + rtx final_label = gen_label_rtx();
Space before () (more of these).
> + /* Difference found is stored here before jump to diff_label. */
> + rtx diff = gen_reg_rtx(word_mode);
Tab at end of comment; space before (.
> + /* If bytes is not const, compare length and branch directly
> + to the cleanup code that can handle 0-16 bytes if length
> + is >= 16. Stash away bytes-max_bytes for the library call. */
Here a tab in the middle of the line.
> + emit_move_insn (iter, GEN_INT (niter));
> + emit_move_insn (loop_cmp, GEN_INT (niter*loop_bytes));
> + emit_move_insn (cmp_rem, GEN_INT (bytes-niter*loop_bytes));
Spaces around operators please.
> + /* Check for < loop_bytes bytes. */
> + (void)do_ifelse (CCmode, LT, bytes_rtx, GEN_INT (loop_bytes),
> + NULL_RTX, cleanup_label);
No cast to void please (here and in the rest of the file). Also, space
after cast.
> + /* Compute bytes to compare in loop if bytes <= max_bytes. */
> + rtx mask = GEN_INT (HOST_WIDE_INT_M1U << l2lb);
> + if (word_mode == DImode)
> + {
> + emit_insn (gen_anddi3 (loop_cmp, bytes_rtx, mask));
> + }
> + else
> + {
> + emit_insn (gen_andsi3 (loop_cmp, bytes_rtx, mask));
> + }
You don't need {} here.
> + if ( need_library_call )
No spaces inside ().
> + if (!bytes_is_const || bytes >= loop_bytes)
> + {
> + /* It should not be possible to come here if remaining bytes is
> + < 16. Compute number of loop iterations. We compare
Please do an assert then, if something should not be possible.
> + JUMP_LABEL(j) = library_call_label;
Space before (, also in macro invocations (but not in macro definitions
obviously).
Please fix those trivialities, and it's okay for trunk (after the
rtlanal patch is approved too). Thanks!
Segher