This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] Fix PR82255 (vectorizer cost model overcounts some vector load costs)
On Thu, Sep 21, 2017 at 3:15 PM, Bill Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> On Sep 21, 2017, at 2:37 AM, Richard Biener <richard.guenther@gmail.com> wrote
>>
>> On Wed, Sep 20, 2017 at 9:17 PM, Bill Schmidt
>> <wschmidt@linux.vnet.ibm.com> wrote:
>>> On Sep 20, 2017, at 2:14 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote:
>>>>
>>>> On Sep 20, 2017, at 3:49 AM, Richard Sandiford <richard.sandiford@linaro.org> wrote:
>>>>>
>>>>> But I think this shows up another problem. In the vectorised loop,
>>>>> we have 1 copy of the load and 4 copies of the ABS (after unpacking).
>>>>> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4
>>>>> even for the loads. So I think it's a double whammy: we've quadrupling
>>>>> the real cost via the excessive ncopies_for_cost, and we're using the
>>>>> wrong access type too.
>>>>>
>>>>> The patch below should fix the second problem but isn't much use without
>>>>> the first. I think we need to adjust ncopies_for_cost in the recursive
>>>>> calls as well.
>>>>
>>>> Unfortunately we have to deal with the fact that ncopies_for_cost is set
>>>> once for the whole SLP instance, which is not ideal since we know the
>>>> number of loads and stores doesn't follow the same rules.
>>>>
>>>> What about the following? I *think* that group_size / nunits is sufficient for
>>>> VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are
>>>> not possible (already handled differently), but I could well be missing something.
>>>
>>> Uh, I need to do the obvious rounding of (group_size + nunits - 1) / nunits...
>>
>> I don't think any such simple approach will result in appropriate costing. The
>> current costing is conservative and you definitely don't want to be
>> overly optimistic
>> here ;)
>>
>> You'd have to fully emulate the vectorizable_load code. Which means we should
>> probably refactor the code-gen part
>>
>> if (memory_access_type == VMAT_ELEMENTWISE
>> || memory_access_type == VMAT_STRIDED_SLP)
>> {
>> ...
>>
>> where it computes ltype, nloads, etc., do that computation up-front in
>> the analysis
>> phase, storing the result somewhere in stmt_vinfo (err, or slp_node or
>> both -- think
>> of hybrid SLP and/or the stmt used in different SLP contexts) and use the info
>> during transform. And hope to have enough info to statically compute the
>> number of loads and their size/alignment at costing time.
>
> I suppose so. I was thinking that if these are truly strided loads, then the
> cost is predictable -- but only for targets that have hardware strided loads
> available, which are few, so this isn't conservative enough. The code in
> question didn't deal with VMAT_ELEMENTWISE, so the factoring is even
> more complicated -- the structures of the costing logic and the code gen
> logic don't match well at all.
>
> I'm unlikely to have time for a complex solution right now, so I'll stick a pointer
> to this conversation in the bugzilla and take my name off in case somebody
> gets to it sooner. But I'll probably keep poking at it in spare moments.
I hope to get to some costmodel cleanups for GCC 8.
Richard.
> Thanks,
> Bill
>>
>> Or go the full way of restructuring costing to have a prologue/body_cost_vec
>> in stmt_info and slp_node so that we can compute and store the cost
>> from vectorizable_* rather than duplicating the hard parts in SLP costing code.
>>
>> Richard.
>>
>>> Bill
>>>>
>>>> Index: gcc/tree-vect-slp.c
>>>> ===================================================================
>>>> --- gcc/tree-vect-slp.c (revision 252760)
>>>> +++ gcc/tree-vect-slp.c (working copy)
>>>> @@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
>>>> stmt_info = vinfo_for_stmt (stmt);
>>>> if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
>>>> {
>>>> + int group_size = GROUP_SIZE (stmt_info);
>>>> + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
>>>> vect_memory_access_type memory_access_type
>>>> - = (STMT_VINFO_STRIDED_P (stmt_info)
>>>> + = (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits
>>>> ? VMAT_STRIDED_SLP
>>>> : VMAT_CONTIGUOUS);
>>>> if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
>>>> - vect_model_store_cost (stmt_info, ncopies_for_cost,
>>>> + vect_model_store_cost (stmt_info, group_size / nunits,
>>>> memory_access_type, vect_uninitialized_def,
>>>> node, prologue_cost_vec, body_cost_vec);
>>>> else
>>>> @@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl
>>>> + nunits - 1) / nunits);
>>>> ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance);
>>>> }
>>>> + else
>>>> + ncopies_for_cost = group_size / nunits;
>>>> /* Record the cost for the vector loads. */
>>>> vect_model_load_cost (stmt_info, ncopies_for_cost,
>>>> memory_access_type, node, prologue_cost_vec,
>>>>
>>>> Bill
>>>>
>>>
>>
>