This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: [ARM] Disable peeling


On Wed, Dec 12, 2012 at 9:06 AM, Christophe Lyon
<christophe.lyon@linaro.org> wrote:
> On 11 December 2012 13:26, Tim Prince <n8tm@aol.com> wrote:
>> On 12/11/2012 5:14 AM, Richard Earnshaw wrote:
>>>
>>> On 11/12/12 09:56, Richard Biener wrote:
>>>>
>>>> On Tue, Dec 11, 2012 at 10:48 AM, Richard Earnshaw <rearnsha@arm.com>
>>>> wrote:
>>>>>
>>>>> On 11/12/12 09:45, Richard Biener wrote:
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 10, 2012 at 10:07 PM, Andi Kleen <andi@firstfloor.org>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Jan Hubicka <hubicka@ucw.cz> writes:
>>>>>>>
>>>>>>>> Note that I think Core has similar characteristics - at least for
>>>>>>>> string
>>>>>>>> operations
>>>>>>>> it fares well with unalignes accesses.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Nehalem and later has very fast unaligned vector loads. There's still
>>>>>>> some
>>>>>>> penalty when they cross cache lines however.
>>>>>>>
>>>>>>> iirc the rule of thumb is to do unaligned for 128 bit vectors,
>>>>>>> but avoid it for 256bit vectors because the cache line cross
>>>>>>> penalty is larger on Sandy Bridge and more likely with the larger
>>>>>>> vectors.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Yes, I think the rule was that using the unaligned instruction variants
>>>>>> carries
>>>>>> no penalty when the actual access is aligned but that aligned accesses
>>>>>> are
>>>>>> still faster than unaligned accesses.  Thus peeling for alignment _is_
>>>>>> a
>>>>>> win.
>>>>>> I also seem to remember that the story for unaligned stores vs.
>>>>>> unaligned
>>>>>> loads
>>>>>> is usually different.
>>>>>
>>>>>
>>>>>
>>>>> Yes, it's generally the case that unaligned loads are slightly more
>>>>> expensive than unaligned stores, since the stores can often merge in a
>>>>> store
>>>>> buffer with little or no penalty.
>>>>
>>>>
>>>> It was the other way around on AMD CPUs AFAIK - unaligned stores forced
>>>> flushes of the store buffers.  Which is why the vectorizer first and
>>>> foremost tries
>>>> to align stores.
>>>>
>>>
>>> In which case, which to align should be a question that the ME asks the
>>> BE.
>>>
>>> R.
>>>
>>>
>> I see that this thread is no longer about ARM.
>> Yes, when peeling for alignment, aligned stores should take precedence over
>> aligned loads.
>> "ivy bridge" corei7-3 is supposed to have corrected the situation on "sandy
>> bridge" corei7-2 where unaligned 256-bit load is more expensive than
>> explicitly split (128-bit) loads.  There aren't yet any production
>> multi-socket corei7-3 platforms.
>> It seems difficult to make the best decision between 128-bit unaligned
>> access without peeling and 256-bit access with peeling for alignment (unless
>> the loop count is known to be too small for the latter to come up to speed).
>> Facilities afforded by various compilers to allow the programmer to guide
>> this choice are rather strange and probably not to be counted on.
>> In my experience, "westmere" unaligned 128-bit loads are more expensive than
>> explicitly split (64-bit) loads, but the architecture manuals disagree with
>> this finding.  gcc already does a good job for corei7[-1] in such
>> situations.
>>
>> --
>> Tim Prince
>>
>
> Since this thread is also about x86 now, I have tried to look at how
> things are implemented on this target.
> People have mentioned nehalem, sandy bridge, ivy bridge and westmere;
> I have searched for occurrences of these strings in GCC, and I
> couldn't find anything that would imply a different behavior wrt
> unaligned loads on 128/256 bits vectors. Is it still unimplemented?
>

i386.c has

   {
      /* When not optimize for size, enable vzeroupper optimization for
         TARGET_AVX with -fexpensive-optimizations and split 32-byte
         AVX unaligned load/store.  */
      if (!optimize_size)
        {
          if (flag_expensive_optimizations
              && !(target_flags_explicit & MASK_VZEROUPPER))
            target_flags |= MASK_VZEROUPPER;
          if ((x86_avx256_split_unaligned_load & ix86_tune_mask)
              && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_LOAD))
            target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
          if ((x86_avx256_split_unaligned_store & ix86_tune_mask)
              && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_STORE))
            target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
          /* Enable 128-bit AVX instruction generation
             for the auto-vectorizer.  */
          if (TARGET_AVX128_OPTIMAL
              && !(target_flags_explicit & MASK_PREFER_AVX128))
            target_flags |= MASK_PREFER_AVX128;
        }
    }


-- 
H.J.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]