The get_ref hot function in 525.x264_r inlines a hot helper that performs a vector average: void pixel_avg( unsigned char *dst, int i_dst_stride, unsigned char *src1, int i_src1_stride, unsigned char *src2, int i_src2_stride, int i_width, int i_height ) { for( int y = 0; y < i_height; y++ ) { for( int x = 0; x < i_width; x++ ) dst[x] = ( src1[x] + src2[x] + 1 ) >> 1; dst += i_dst_stride; src1 += i_src1_stride; src2 += i_src2_stride; } } GCC 9 already knows how to generate vector average instructions (PR 85694). For aarch64 it generates a 16x vectorised loop. Runtime profiling of the arguments to this function, however, show that the >50% of the time the i_width has value 8 during runtime and therefore the vector loop is skipped in favour of a scalar fallback: 32.07% 40ed2c ldrb w3, [x0,x5] 11.41% 40ed30 ldrb w11, [x4,x5] 40ed34 add w3, w3, w11 40ed38 add w3, w3, #0x1 40ed3c asr w3, w3, #1 0.71% 40ed40 strb w3, [x2,x5] 40ed44 add x5, x5, #0x1 40ed48 cmp w6, w5 40ed4c b.gt <loop> The most frequent runtime combinations of inputs to this function are: 29240545 i_height: 8, i_width: 8, i_dst_stride: 16, i_src1_stride: 1344, i_src2_stride: 1344 22714355 i_height: 16, i_width: 16, i_dst_stride: 16, i_src1_stride: 1344, i_src2_stride: 1344 19669512 i_height: 8, i_width: 8, i_dst_stride: 16, i_src1_stride: 704, i_src2_stride: 704 3689216 i_height: 16, i_width: 8, i_dst_stride: 16, i_src1_stride: 1344, i_src2_stride: 1344 3670639 i_height: 8, i_width: 16, i_dst_stride: 16, i_src1_stride: 1344, i_src2_stride: 1344 That's a shame. AArch64 supports the V8QI form of the vector average instruction (and advertises it through optabs). With --param vect-epilogues-nomask=1 we already generate something like: if (bytes_left > 16) { while (bytes_left > 16) 16x_vectorised; if (bytes_left > 8) 8x_vectorised; unrolled_scalar_epilogue; } else scalar_loop; Could we perhaps generate: while (bytes_left > 16) 16x_vectorised; if (bytes_left > 8) 8x_vectorised; unrolled_scalar_epilogue; // or keep it as a rolled scalar_loop to save on codesize? Basically I'm looking for a way to take advantage of the 8x vectorised form.
Yeah, the epilogue stuff is on my list of things to revisit for GCC 10.
(In reply to Richard Biener from comment #1) > Yeah, the epilogue stuff is on my list of things to revisit for GCC 10. I think this isn't necessarily about the epilogue (the main vector loop is not entered here), but rather the scalar fallback. Unless we merge the fallback and epilogue into one.
I'll be taking a look at this one as a part of GCC 10 as well.
*** Bug 91460 has been marked as a duplicate of this bug. ***
Author: avieira Date: Tue Oct 29 13:15:46 2019 New Revision: 277569 URL: https://gcc.gnu.org/viewcvs?rev=277569&root=gcc&view=rev Log: [vect]PR 88915: Vectorize epilogues when versioning loops gcc/ChangeLog: 2019-10-29 Andre Vieira <andre.simoesdiasvieira@arm.com> PR 88915 * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter and make the valueize function pointer also take a void pointer. * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap around vn_valueize, to call it without a context. (process_bb): Use vn_valueize_wrapper instead of vn_valueize. * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos. (~_loop_vec_info): Release epilogue_vinfos. (vect_analyze_loop_costing): Use knowledge of main VF to estimate number of iterations of epilogue. (vect_analyze_loop_2): Adapt to analyse main loop for all supported vector sizes when vect-epilogues-nomask=1. Also keep track of lowest versioning threshold needed for main loop. (vect_analyze_loop): Likewise. (find_in_mapping): New helper function. (update_epilogue_loop_vinfo): New function. (vect_transform_loop): When vectorizing epilogues re-use analysis done on main loop and call update_epilogue_loop_vinfo to update it. * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert stmts on loop preheader edge. (vect_do_peeling): Enable skip-vectors when doing loop versioning if we decided to vectorize epilogues. Update epilogues NITERS and construct ADVANCE to update epilogues data references where needed. * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos. (vect_do_peeling, vect_update_inits_of_drs, determine_peel_for_niter, vect_analyze_loop): Add or update declarations. * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already created loop_vec_info's for epilogues when available. Otherwise analyse epilogue separately. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-loop-niter.c trunk/gcc/tree-ssa-loop-niter.h trunk/gcc/tree-ssa-sccvn.c trunk/gcc/tree-vect-loop-manip.c trunk/gcc/tree-vect-loop.c trunk/gcc/tree-vectorizer.c trunk/gcc/tree-vectorizer.h
Can this issue be closed now Andre?
Fixed on trunk.