[lno] [patch] misaligned loads support

This patch adds support for handling unaligned loads in the vectorizer.
The parts in tree-vectorizer.c are largely based on Ayal's implementation
on apple-ppc branch.
Passed bootstrap on ppc-darwin and i686-pc-linux, and SPEC on ppc-darwin.
vect-27.c and vect-52.c now get vectorized, and a new testcase vect-72.c.

It's not entirely complete, but complete enough that there's something
working that I can commit to lno as a preview. I hope it can be
pre-reviewed before I start working on porting it to mainline. Especially
the parts I'm less confident about - the parts of the patch that deal with
RTL expansion (expr.c, optabs.c, builtins.c), and the issues I bring up

(1) The patch includes a little more than what is actually supported; i.e,
you'll find a declaration of vec_realign_store_optab although misaligned
stores are not supported yet, and a declaration of
BUILT_IN_BUILD_CC_MASK_FOR_LOAD even though it's not implemented. I
included these as an indication of how things are intended to look like and
so I could get feedback on whether the overall picture conforms with the
meaning of the new tree-codes and optabs we have discussed ( These should probably not
be included in the patch for mainline?

(2) I did not systematically add consideration of the new
considered; I only added the minimum that allowed me to build the compiler
and pass the vectorizer testcases and SPEC. I'll complete this before I
prepare the patch for mainline. Anything else to look for besides 'grep'ing

(3) The function vect_create_data_ref used to return a "(*vp)[indx]" (vp
was a pointer to an array of vectypes). Now it is renamed to
vect_create_data_ref_ptr and returns "vp" pointing to the same address,
computed as follows:
vp = vp_init + (indx * vectype_size_in_bytes). The caller is responsible to
create an (ALIGN/MISALIGNED_)INDIRECT_REF based on vp.
This change triggered a failure in vect-6.c, which seems to be caused by a
problem in DCE that has been fixed on mainline. I disabled it for now, and
will verify that it passes when I port this patch to mainline.

(4) In tree-pretty-print.c I have ALIGN_INDIRECT_REF represented as "A*"
and MISALIGNED_INDIRECT_REF represented as "M*". e.g:
vect_var_.61_65 = A*vect_p.62_64
vect_var_.61_65 = M*vect_p.62_64{misalignment=1}
Maybe using a "*" like a regular INDIRECT_REF is better?

(5) with respect to:
> I'm also thinking that perhaps the "mask" versions should not use
> an optab at all, but rather a builtin function.  The reason here
> is that the form of the mask differs between systems.  For Altivec,
> the mask is a complete 16-byte vector.  For SPE, the mask is a CCmode
> value.  If we have this as a builtin, then the target can easily
> influence the return type of the function, and thus the type of the
> variable that we create for the loop.

I implemented it by introducing a different generic builtin for each type
of mask - BUILT_IN_BUILD_VECTOR_MASK_FOR_LOAD for a vector mask, which is
implemented for altivec, and just as an example I also introduced
BUILT_IN_BUILD_CC_MASK_FOR_LOAD, which may be implemented for SPE. More
forms can be added. Which builtin form will be generated is determined by
checking available target support (i.e, HAVE_build_vector_mask_for_load or
HAVE_build_cc_mask_for_load). Is this a reasonable approach?
An alternative way could be to introduce a single generic builtin whose
return type is determined by the target, as follows (in builtins.def):
#ifdef HAVE_build_vector_mask_for_load
DEF_GCC_BUILTIN  (BUILT_IN_BUILD_MASK, "build_mask_for_load",
#if HAVE_build_cc_mask_for_load
DEF_GCC_BUILTIN  (BUILT_IN_BUILD_MASK, "build_mask_for_load",
? (this one seems cleaner. I think I had a problem when I tried it, but I
don't remember what it was...)

(6) I hardcoded '16' in the declaration of BT_CHAR_VECTOR in
builtin-types.def. I wanted to use UNITS_PER_SIMD_WORD instead, but that
would require including default.h wherever DEF_PRIMITIVE_TYPE and
DEF_FUNCTION_TYPE_1 are defined, so I wasn't sure about that. ?

(7) addr_floor_v* basically does nothing (copies the input to
the output). Is that ok, or is there a better way to represent that?

(8) addr_misaligned_ is implemented only for 16QI because
sse2_movdqu expects only 16QI. As a result I have misalignment support on
i386 only for chars. I know this is wrong, but I wanted to get something
preliminary done quickly so I could get feedback on how to really do this
for i386... ?

Given this testcase:
  char ia[N];
  char ib[N+1] = {....};
  for (i = 1; i < N+1; i++){
      ia[i-1] = ib[i];

This is what is generated for altivec:
        addi r9,r1,193
        li r0,8
        neg r2,r9
        mtctr r0
        lvsr v12,0,r2
        addi r11,r1,208
        lvx v13,0,r9
        li r2,0
        addi r9,r1,64
        lvx v0,r2,r11
        vperm v1,v13,v0,v12
        vor v13,v0,v0
        stvx v1,r2,r9
        addi r2,r2,16
        bdnz L6

Given the above testcase, this is what is generated for i386:
        movb    %al, -280(%ebp,%eax)
        incl    %eax
        cmpl    $129, %eax
        jne     .L4
        leal    -279(%ebp), %ebx
        xorl    %ecx, %ecx
        leal    -136(%ebp), %esi
        movl    %ebx, %edx
        movl    %esi, %eax
        incl    %ecx
        movdqu  (%edx), %xmm0
        subl    %ebx, %eax
        movdqa  %xmm0, (%edx,%eax)
        addl    $16, %edx
        cmpl    $8, %ecx
        jne     .L6




        (REALIGN_LOAD_EXPR, REALIGN_STORE_EXPR): New tree-codes.
        * tree.h (REF_ORIGINAL): Consider ALIGN_INDIRECT_REF and
        * alias.c (get_alias_set, nonoverlapping_memrefs_p): Likewise.
        * tree-gimple.c (get_base_address): Likewise.
        * tree-ssa-loop-ivopts.c (for_each_index, peel_address): Likewise.
        * tree-pretty-print.c (op_prio): Likewise.
        (dump_generic_node): Likewise + consider REALIGN_LOAD_EXPR.
        * tree-ssa-operands.c (get_expr_operands): Same.
        * expr.c (safe_from_p, expand_expr_real_1, rewrite_address_base)
        (find_interesting_uses_address): Consider
        (expand_expr_real_1): Consider REALIGN_LOAD_EXPR.
        * optabs.h (vec_realign_store_optab, vec_realign_load_optab)
        (addr_floor_optab, addr_misaligned_optab): New optabs.
        (OTI_vec_realign_store, OTI_vec_realign_load, OTI_addr_floor)
        (OTI_addr_misaligned): New optab_index values for the above new
        (expand_realign_op, expand_addr_floor_op,
        Declaration for new functions.
        * optabs.c (optab_for_tree_code): Add new cases for the above
        new tree-codes.
        (expand_realign_op, expand_addr_floor_op,
        New functions.
        (init_optabs): Init vec_realign_load_optab, addr_floor_optab and
        * genopinit.c (optabs): Handle above new optabs.

        * builtin-types.def (BT_CHAR_VECTOR, BT_FN_CHAR_VECTOR_PTR): New
        * builtins.def (BUILT_IN_BUILD_VECTOR_MASK_FOR_LOAD): New builtin.
        (BUILT_IN_BUILD_CC_MASK_FOR_LOAD): New builtin.
        * builtins.c (expand_builtin_build_mask_for_load): New function.
        (exapnd_builtin): New cases for BUILT_IN_BUILD_VECTOR_MASK_FOR_LOAD

        * config/rs6000/ (build_vector_mask_for_load): New
        (addr_floor_v4si, addr_floor_v4sf, addr_floor_v4hi,
        New define_expand.
        (vec_realign_load_v4si, vec_realign_load_v4sf,
        (vec_realign_load_v16qi): New define_insn.
        * config/i386/ (addr_misaligned_v16qi): New define_expand.

        * tree-vectorizer.c (vect_create_data_ref): Renamed to
        vect_create_data_ref_ptr. Functionality changes reflected in the
        function documentation.
        (offset, initial_address, only_init): New arguments.
        (vect_ptr): New pointer to vectype rather than pointer to array of
        (vectorizable_store): Call vect_create_data_ref_ptr with additional
        arguments, and create an indirect_ref with its return value
        Check aligned_access_p.
        (vect_create_cond_for_align_checks): Call vect_create_data_ref_ptr
        additional arguments.
        (vect_create_addr_base_for_vector_ref): Takes an additional
argument -
        offset. Creates &(base[init_val+offset]) instead of
&(base[init_val) if
        offst is provided.
        (vectorizable_load): Handle misaligned loads, using
        scheme with REALIGN_LOAD_EXPR and ALIGN_INDIRECT_REF if
        vec_realign_load_optab and addr_floor_optab are supported, or using
        regular scheme (without software-pipelining) with
        MISALIGNED_INDIRECT_REF if addr_misaligned_optab is supported.
        (BUILT_IN_build_mask_for_load): New variable.
        (vect_enhance_data_refs_alignment): Don't do versioning for
        loads/stores that can be vectorized. Call vectorizable_load/store
        initialize STMT_VINFO_VECTYPE.
        (vect_analyze_data_refs_alignment): Don't fail vectorization in the
        presence of misaligned loads.

        (vect_create_addr_base_for_vector_ref, vect_create_data_ref)
        (vect_compute_data_ref_alignment): Call unshare_expr.

        (add_loop_guard_on_edge, vect_create_index_for_vector_ref)
        (vect_finish_stmt_generation, vect_compute_data_refs_alignment)
        (vect_analyze_data_ref_access): Minor editting fixes (don't
overflow 80

testsuite Changelog.lno:

        * gcc.dg/vect/vect-72.c: New test.
        * gcc.dg/vect/vect-27.c: Now vectorized on ppc*.
        * gcc.dg/vect/vect-52.c: Now vectorized on ppc*.
        * gcc.dg/vect/vect-6.c: Temporarily changed from run to compile
        * gcc.dg/vect/vect-26.c: Use sse2 instead of sse.
        * gcc.dg/vect/vect-27.c: Use sse2 instead of sse.
        * gcc.dg/vect/vect-28.c: Use sse2 instead of sse.
        * gcc.dg/vect/vect-29.c: Use sse2 instead of sse.
        * gcc.dg/vect/vect-4?.c: Use sse2 instead of sse.
        * gcc.dg/vect/vect-5?.c: Use sse2 instead of sse.
        * gcc.dg/vect/vect-60.c: Use sse2 instead of sse.
        * gcc.dg/vect/vect-61.c: Use sse2 instead of sse.

(See attached file: patch.Sept10)

