This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][AArch64] Optimized implementation of search_line_fast for the CPP lexer


On Mon, Nov 07, 2016 at 01:39:53PM +0000, Richard Earnshaw (lists) wrote:
> This patch contains an implementation of search_line_fast for the CPP
> lexer.  It's based in part on the AArch32 (ARM) code but incorporates
> new instructions available in AArch64 (reduction add operations) plus
> some tricks for reducing the realignment overheads.  We assume a page
> size of 4k, but that's a safe assumption -- AArch64 systems can never
> have a smaller page size than that: on systems with larger pages we will
> go through the realignment code more often than strictly necessary, but
> it's still likely to be in the noise (less than 0.5% of the time).
> Bootstrapped on aarch64-none-linux-gnu.

Some very minor nits wrt. style for the Advanced SIMD intrinsics, otherwise
OK from me.

> 
> +  const uint8x16_t xmask = (uint8x16_t) vdupq_n_u64 (0x8040201008040201ULL);


It is a pedantic point, but these casts are a GNU extension, the "portable"
way to write this would be:

  vreinterpretq_u8_u64 (vdupq_n_u64 (0x8040201008040201ULL));

> +
> +#ifdef __AARCH64EB
> +  const int16x8_t shift = {8, 8, 8, 8, 0, 0, 0, 0};

This sort of vector initialisation is a bit scary for user programmers, as
we shouldn't generally mix Neon intrinsics with the GNU extensions (for
exactly the reason you have here, keeping BE and LE straight is extra
effort)

This could be written portably as:

  vcombine_u16 (vdup_n_u16 (8), vdup_n_u16 (0));

Or if you prefer to be explicit about the elements:

  int16_t buf[] = {8, 8, 8, 8, 0, 0, 0, 0};
  int16x8_t shift = vld1q_s16 (buf);

> +#else
> +  const int16x8_t shift = {0, 0, 0, 0, 8, 8, 8, 8};
> +#endif
> +
> +  unsigned int found;
> +  const uint8_t *p;
> +  uint8x16_t data;
> +  uint8x16_t t;
> +  uint16x8_t m;
> +  uint8x16_t u, v, w;
> +
> +  /* Align the source pointer.  */
> +  p = (const uint8_t *)((uintptr_t)s & -16);
> +
> +  /* Assuming random string start positions, with a 4k page size we'll take
> +     the slow path about 0.37% of the time.  */
> +  if (__builtin_expect ((AARCH64_MIN_PAGE_SIZE
> +			 - (((uintptr_t) s) & (AARCH64_MIN_PAGE_SIZE - 1)))
> +			< 16, 0))
> +    {
> +      /* Slow path: the string starts near a possible page boundary.  */
> +      uint32_t misalign, mask;
> +
> +      misalign = (uintptr_t)s & 15;
> +      mask = (-1u << misalign) & 0xffff;
> +      data = vld1q_u8 (p);
> +      t = vceqq_u8 (data, repl_nl);
> +      u = vceqq_u8 (data, repl_cr);
> +      v = vorrq_u8 (t, vceqq_u8 (data, repl_bs));
> +      w = vorrq_u8 (u, vceqq_u8 (data, repl_qm));
> +      t = vorrq_u8 (v, w);

Can you trust the compiler to perform the reassociation here manually?
That would let you write this in the more natural form:

      t = vceqq_u8 (data, repl_nl);
      t = vorrq_u8 (t, vceqq_u8 (data, repl_cr));
      t = vorrq_u8 (t, vceqq_u8 (data, repl_bs));
      t = vorrq_u8 (t, vceqq_u8 (data, repl_qm));

> +      t = vandq_u8 (t, xmask);
> +      m = vpaddlq_u8 (t);
> +      m = vshlq_u16 (m, shift);
> +      found = vaddvq_u16 (m);
> +      found &= mask;
> +      if (found)
> +	return (const uchar*)p + __builtin_ctz (found);
> +    }
> +  else
> +    {
> +      data = vld1q_u8 ((const uint8_t *) s);
> +      t = vceqq_u8 (data, repl_nl);
> +      u = vceqq_u8 (data, repl_cr);
> +      v = vorrq_u8 (t, vceqq_u8 (data, repl_bs));
> +      w = vorrq_u8 (u, vceqq_u8 (data, repl_qm));
> +      t = vorrq_u8 (v, w);
> +      if (__builtin_expect (vpaddd_u64 ((uint64x2_t)t), 0))
> +	goto done;

As above, this cast is a GNU extension:

    if (__builtin_expect (vpaddd_u64 (vreinterpretq_u64_u8 (t)), 0))

> +    }
> +
> +  do
> +    {
> +      p += 16;
> +      data = vld1q_u8 (p);
> +      t = vceqq_u8 (data, repl_nl);
> +      u = vceqq_u8 (data, repl_cr);
> +      v = vorrq_u8 (t, vceqq_u8 (data, repl_bs));
> +      w = vorrq_u8 (u, vceqq_u8 (data, repl_qm));
> +      t = vorrq_u8 (v, w);
> +    } while (!vpaddd_u64 ((uint64x2_t)t));

Likewise here.

Thanks,
James


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]