Bug 88915

Summary: Try smaller vectorisation factors in scalar fallback
Product: gcc Reporter: ktkachov
Component: tree-optimizationAssignee: avieira
Status: RESOLVED FIXED    
Severity: normal CC: dimhen, hjl.tools, jamborm
Priority: P3 Keywords: missed-optimization
Version: 9.0   
Target Milestone: 10.0   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2019-01-21 00:00:00
Bug Depends on:    
Bug Blocks: 53947    

Description ktkachov 2019-01-18 16:31:31 UTC
The get_ref hot function in 525.x264_r inlines a hot helper that performs a vector average:
void pixel_avg( unsigned char *dst, int i_dst_stride,
                               unsigned char *src1, int i_src1_stride,
                               unsigned char *src2, int i_src2_stride,
                               int i_width, int i_height )
 {
     for( int y = 0; y < i_height; y++ )
     {
         for( int x = 0; x < i_width; x++ )
             dst[x] = ( src1[x] + src2[x] + 1 ) >> 1;
         dst += i_dst_stride;
         src1 += i_src1_stride;
         src2 += i_src2_stride;
     }
 }

GCC 9 already knows how to generate vector average instructions (PR 85694).
For aarch64 it generates a 16x vectorised loop.
Runtime profiling of the arguments to this function, however, show that the >50% of the time the i_width has value 8 during runtime and therefore the vector loop is skipped in favour of a scalar fallback:
32.07%  40ed2c		ldrb	w3, [x0,x5]
11.41%  40ed30		ldrb	w11, [x4,x5]
        40ed34		add	w3, w3, w11
        40ed38		add	w3, w3, #0x1
        40ed3c		asr	w3, w3, #1
0.71%   40ed40		strb	w3, [x2,x5]
        40ed44		add	x5, x5, #0x1
        40ed48		cmp	w6, w5
        40ed4c		b.gt	<loop>

The most frequent runtime combinations of inputs to this function are:
29240545 i_height: 8, i_width: 8, i_dst_stride: 16, i_src1_stride: 1344, i_src2_stride: 1344
22714355 i_height: 16, i_width: 16, i_dst_stride: 16, i_src1_stride: 1344, i_src2_stride: 1344
19669512 i_height: 8, i_width: 8, i_dst_stride: 16, i_src1_stride: 704, i_src2_stride: 704
3689216 i_height: 16, i_width: 8, i_dst_stride: 16, i_src1_stride: 1344, i_src2_stride: 1344
3670639 i_height: 8, i_width: 16, i_dst_stride: 16, i_src1_stride: 1344, i_src2_stride: 1344

That's a shame. AArch64 supports the V8QI form of the vector average instruction (and advertises it through optabs).
With --param vect-epilogues-nomask=1 we already generate something like:
if (bytes_left > 16)
{
  while (bytes_left > 16)
    16x_vectorised;
  if (bytes_left > 8)
    8x_vectorised;
  unrolled_scalar_epilogue;
}
else
  scalar_loop;

Could we perhaps generate:
  while (bytes_left > 16)
    16x_vectorised;
  if (bytes_left > 8)
    8x_vectorised;
  unrolled_scalar_epilogue; // or keep it as a rolled scalar_loop to save on codesize?

Basically I'm looking for a way to take advantage of the 8x vectorised form.
Comment 1 Richard Biener 2019-01-21 08:55:01 UTC
Yeah, the epilogue stuff is on my list of things to revisit for GCC 10.
Comment 2 ktkachov 2019-01-21 09:00:45 UTC
(In reply to Richard Biener from comment #1)
> Yeah, the epilogue stuff is on my list of things to revisit for GCC 10.

I think this isn't necessarily about the epilogue (the main vector loop is not entered here), but rather the scalar fallback.

Unless we merge the fallback and epilogue into one.
Comment 3 Tamar Christina 2019-04-09 18:14:18 UTC
I'll be taking a look at this one as a part of GCC 10 as well.
Comment 4 avieira 2019-08-16 09:21:46 UTC
*** Bug 91460 has been marked as a duplicate of this bug. ***
Comment 5 avieira 2019-10-29 13:16:57 UTC
Author: avieira
Date: Tue Oct 29 13:15:46 2019
New Revision: 277569

URL: https://gcc.gnu.org/viewcvs?rev=277569&root=gcc&view=rev
Log:
[vect]PR 88915: Vectorize epilogues when versioning loops

gcc/ChangeLog:
2019-10-29  Andre Vieira  <andre.simoesdiasvieira@arm.com>

	PR 88915
	* tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration.
	* tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter
	and make the valueize function pointer also take a void pointer.
	* gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap
	around vn_valueize, to call it without a context.
	(process_bb): Use vn_valueize_wrapper instead of vn_valueize.
	* tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos.
	(~_loop_vec_info): Release epilogue_vinfos.
	(vect_analyze_loop_costing): Use knowledge of main VF to estimate
	number of iterations of epilogue.
	(vect_analyze_loop_2): Adapt to analyse main loop for all supported
	vector sizes when vect-epilogues-nomask=1.  Also keep track of lowest
	versioning threshold needed for main loop.
	(vect_analyze_loop): Likewise.
	(find_in_mapping): New helper function.
	(update_epilogue_loop_vinfo): New function.
	(vect_transform_loop): When vectorizing epilogues re-use analysis done
	on main loop and call update_epilogue_loop_vinfo to update it.
	* tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert
	stmts on loop preheader edge.
	(vect_do_peeling): Enable skip-vectors when doing loop versioning if
	we decided to vectorize epilogues.  Update epilogues NITERS and
	construct ADVANCE to update epilogues data references where needed.
	* tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos.
	(vect_do_peeling, vect_update_inits_of_drs,
	 determine_peel_for_niter, vect_analyze_loop): Add or update
	declarations.
	* tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already
	created loop_vec_info's for epilogues when available.  Otherwise analyse
	epilogue separately.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-loop-niter.c
    trunk/gcc/tree-ssa-loop-niter.h
    trunk/gcc/tree-ssa-sccvn.c
    trunk/gcc/tree-vect-loop-manip.c
    trunk/gcc/tree-vect-loop.c
    trunk/gcc/tree-vectorizer.c
    trunk/gcc/tree-vectorizer.h
Comment 6 Tamar Christina 2019-11-22 14:24:42 UTC
Can this issue be closed now Andre?
Comment 7 Richard Biener 2019-11-27 08:48:28 UTC
Fixed on trunk.