diff mbox

[rs6000] Expand vec_ld and vec_st during parsing to improve performance

Message ID 1461017157.18355.108.camel@oc8801110288.ibm.com
State New
Headers show

Commit Message

Bill Schmidt April 18, 2016, 10:05 p.m. UTC
Hi,

Expanding built-ins in the usual way (leaving them as calls until
expanding into RTL) restricts the amount of optimization that can be
performed on the code represented by the built-ins.  This has been
observed to be particularly bad for the vec_ld and vec_st built-ins on
PowerPC, which represent the lvx and stvx instructions.  Currently these
are expanded into UNSPECs that are left untouched by the optimizers, so
no redundant load or store elimination can take place.  For certain
idiomatic usages, this leads to very bad performance.

Initially I planned to just change the UNSPEC representation to RTL that
directly expresses the address masking implicit in lvx and stvx.  This
turns out to be only partially successful in improving performance.
Among other things, by the time we reach RTL we have lost track of the
__restrict__ attribute, leading to more appearances of may-alias
relationships than should really be present.  Instead, this patch
expands the built-ins during parsing so that they are exposed to all
GIMPLE optimizations as well.

This works well for vec_ld and vec_st.  It is also possible for
programmers to instead use __builtin_altivec_lvx_<mode> and
__builtin_altivec_stvx_<mode>.  These are not so easy to catch during
parsing, since they are not processed by the overloaded built-in
function table.  For these, I am currently falling back to expansion
during RTL while still exposing the address-masking semantics, which
seems ok for these somewhat obscure built-ins.  At some future time we
may decide to handle them similarly to vec_ld and vec_st.

For POWER8 little-endian only, the loads and stores during expand time
require some special handling, since the POWER8 expanders want to
convert these to lxvd2x/xxswapd and xxswapd/stxvd2x.  To deal with this,
I've added an extra pre-pass to the swap optimization phase that
recognizes the lvx and stvx patterns and canonicalizes them so they'll
be properly recognized.  This isn't an issue for earlier or later
processors, or for big-endian POWER8, so doing this as part of swap
optimization is appropriate.

We have a lot of existing test cases for this code, which proved very
useful in discovering bugs, so I haven't seen a reason to add any new
tests.

The patch is fairly large, but it isn't feasible to break it up into
smaller units without leaving something in a broken state.  So I will
have to just apologize for the size and leave it at that.  Sorry! :)

Bootstrapped and tested successfully on powerpc64le-unknown-linux-gnu,
and on powerpc64-unknown-linux-gnu (-m32 and -m64) with no regressions.
Is this ok for trunk after GCC 6 releases?

Thanks,
Bill


2016-04-18  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	* config/rs6000/altivec.md (altivec_lvx_<mode>): Remove.
	(altivec_lvx_<mode>_internal): Document.
	(altivec_lvx_<mode>_2op): New define_insn.
	(altivec_lvx_<mode>_1op): Likewise.
	(altivec_lvx_<mode>_2op_si): Likewise.
	(altivec_lvx_<mode>_1op_si): Likewise.
	(altivec_stvx_<mode>): Remove.
	(altivec_stvx_<mode>_internal): Document.
	(altivec_stvx_<mode>_2op): New define_insn.
	(altivec_stvx_<mode>_1op): Likewise.
	(altivec_stvx_<mode>_2op_si): Likewise.
	(altivec_stvx_<mode>_1op_si): Likewise.
	* config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
	Expand vec_ld and vec_st during parsing.
	* config/rs6000/rs6000.c (altivec_expand_lvx_be): Commentary
	changes.
	(altivec_expand_stvx_be): Likewise.
	(altivec_expand_lv_builtin): Expand lvx built-ins to expose the
	address-masking behavior in RTL.
	(altivec_expand_stv_builtin): Expand stvx built-ins to expose the
	address-masking behavior in RTL.
	(altivec_expand_builtin): Change builtin code arguments for calls
	to altivec_expand_stv_builtin and altivec_expand_lv_builtin.
	(insn_is_swappable_p): Avoid incorrect swap optimization in the
	presence of lvx/stvx patterns.
	(alignment_with_canonical_addr): New function.
	(alignment_mask): Likewise.
	(find_alignment_op): Likewise.
	(combine_lvx_pattern): Likewise.
	(combine_stvx_pattern): Likewise.
	(combine_lvx_stvx_patterns): Likewise.
	(rs6000_analyze_swaps): Perform a pre-pass to recognize lvx and
	stvx patterns from expand.
	* config/rs6000/vector.md (vector_altivec_load_<mode>): Use new
	expansions.
	(vector_altivec_store_<mode>): Likewise.

Comments

Richard Biener April 19, 2016, 8:09 a.m. UTC | #1
On Tue, Apr 19, 2016 at 12:05 AM, Bill Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> Hi,
>
> Expanding built-ins in the usual way (leaving them as calls until
> expanding into RTL) restricts the amount of optimization that can be
> performed on the code represented by the built-ins.  This has been
> observed to be particularly bad for the vec_ld and vec_st built-ins on
> PowerPC, which represent the lvx and stvx instructions.  Currently these
> are expanded into UNSPECs that are left untouched by the optimizers, so
> no redundant load or store elimination can take place.  For certain
> idiomatic usages, this leads to very bad performance.
>
> Initially I planned to just change the UNSPEC representation to RTL that
> directly expresses the address masking implicit in lvx and stvx.  This
> turns out to be only partially successful in improving performance.
> Among other things, by the time we reach RTL we have lost track of the
> __restrict__ attribute, leading to more appearances of may-alias
> relationships than should really be present.  Instead, this patch
> expands the built-ins during parsing so that they are exposed to all
> GIMPLE optimizations as well.
>
> This works well for vec_ld and vec_st.  It is also possible for
> programmers to instead use __builtin_altivec_lvx_<mode> and
> __builtin_altivec_stvx_<mode>.  These are not so easy to catch during
> parsing, since they are not processed by the overloaded built-in
> function table.  For these, I am currently falling back to expansion
> during RTL while still exposing the address-masking semantics, which
> seems ok for these somewhat obscure built-ins.  At some future time we
> may decide to handle them similarly to vec_ld and vec_st.
>
> For POWER8 little-endian only, the loads and stores during expand time
> require some special handling, since the POWER8 expanders want to
> convert these to lxvd2x/xxswapd and xxswapd/stxvd2x.  To deal with this,
> I've added an extra pre-pass to the swap optimization phase that
> recognizes the lvx and stvx patterns and canonicalizes them so they'll
> be properly recognized.  This isn't an issue for earlier or later
> processors, or for big-endian POWER8, so doing this as part of swap
> optimization is appropriate.
>
> We have a lot of existing test cases for this code, which proved very
> useful in discovering bugs, so I haven't seen a reason to add any new
> tests.
>
> The patch is fairly large, but it isn't feasible to break it up into
> smaller units without leaving something in a broken state.  So I will
> have to just apologize for the size and leave it at that.  Sorry! :)
>
> Bootstrapped and tested successfully on powerpc64le-unknown-linux-gnu,
> and on powerpc64-unknown-linux-gnu (-m32 and -m64) with no regressions.
> Is this ok for trunk after GCC 6 releases?

Just took a very quick look but it seems you are using integer arithmetic
for the pointer adjustment and bit-and.  You could use POINTER_PLUS_EXPR
for the addition and BIT_AND_EXPR is also valid on pointer types.  Which
means you don't need conversions to/from sizetype.

x86 nowadays has intrinsics implemented as inlines - they come from
header files.  It seems for ppc the intrinsics are somehow magically
there, w/o a header file?

Richard.

> Thanks,
> Bill
>
>
> 2016-04-18  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
>
>         * config/rs6000/altivec.md (altivec_lvx_<mode>): Remove.
>         (altivec_lvx_<mode>_internal): Document.
>         (altivec_lvx_<mode>_2op): New define_insn.
>         (altivec_lvx_<mode>_1op): Likewise.
>         (altivec_lvx_<mode>_2op_si): Likewise.
>         (altivec_lvx_<mode>_1op_si): Likewise.
>         (altivec_stvx_<mode>): Remove.
>         (altivec_stvx_<mode>_internal): Document.
>         (altivec_stvx_<mode>_2op): New define_insn.
>         (altivec_stvx_<mode>_1op): Likewise.
>         (altivec_stvx_<mode>_2op_si): Likewise.
>         (altivec_stvx_<mode>_1op_si): Likewise.
>         * config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
>         Expand vec_ld and vec_st during parsing.
>         * config/rs6000/rs6000.c (altivec_expand_lvx_be): Commentary
>         changes.
>         (altivec_expand_stvx_be): Likewise.
>         (altivec_expand_lv_builtin): Expand lvx built-ins to expose the
>         address-masking behavior in RTL.
>         (altivec_expand_stv_builtin): Expand stvx built-ins to expose the
>         address-masking behavior in RTL.
>         (altivec_expand_builtin): Change builtin code arguments for calls
>         to altivec_expand_stv_builtin and altivec_expand_lv_builtin.
>         (insn_is_swappable_p): Avoid incorrect swap optimization in the
>         presence of lvx/stvx patterns.
>         (alignment_with_canonical_addr): New function.
>         (alignment_mask): Likewise.
>         (find_alignment_op): Likewise.
>         (combine_lvx_pattern): Likewise.
>         (combine_stvx_pattern): Likewise.
>         (combine_lvx_stvx_patterns): Likewise.
>         (rs6000_analyze_swaps): Perform a pre-pass to recognize lvx and
>         stvx patterns from expand.
>         * config/rs6000/vector.md (vector_altivec_load_<mode>): Use new
>         expansions.
>         (vector_altivec_store_<mode>): Likewise.
>
>
> Index: gcc/config/rs6000/altivec.md
> ===================================================================
> --- gcc/config/rs6000/altivec.md        (revision 235090)
> +++ gcc/config/rs6000/altivec.md        (working copy)
> @@ -2514,20 +2514,9 @@
>    "lvxl %0,%y1"
>    [(set_attr "type" "vecload")])
>
> -(define_expand "altivec_lvx_<mode>"
> -  [(parallel
> -    [(set (match_operand:VM2 0 "register_operand" "=v")
> -         (match_operand:VM2 1 "memory_operand" "Z"))
> -     (unspec [(const_int 0)] UNSPEC_LVX)])]
> -  "TARGET_ALTIVEC"
> -{
> -  if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
> -    {
> -      altivec_expand_lvx_be (operands[0], operands[1], <MODE>mode, UNSPEC_LVX);
> -      DONE;
> -    }
> -})
> -
> +; This version of lvx is used only in cases where we need to force an lvx
> +; over any other load, and we don't care about losing CSE opportunities.
> +; Its primary use is for prologue register saves.
>  (define_insn "altivec_lvx_<mode>_internal"
>    [(parallel
>      [(set (match_operand:VM2 0 "register_operand" "=v")
> @@ -2537,20 +2526,45 @@
>    "lvx %0,%y1"
>    [(set_attr "type" "vecload")])
>
> -(define_expand "altivec_stvx_<mode>"
> -  [(parallel
> -    [(set (match_operand:VM2 0 "memory_operand" "=Z")
> -         (match_operand:VM2 1 "register_operand" "v"))
> -     (unspec [(const_int 0)] UNSPEC_STVX)])]
> -  "TARGET_ALTIVEC"
> -{
> -  if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
> -    {
> -      altivec_expand_stvx_be (operands[0], operands[1], <MODE>mode, UNSPEC_STVX);
> -      DONE;
> -    }
> -})
> +; The next two patterns embody what lvx should usually look like.
> +(define_insn "altivec_lvx_<mode>_2op"
> +  [(set (match_operand:VM2 0 "register_operand" "=v")
> +        (mem:VM2 (and:DI (plus:DI (match_operand:DI 1 "register_operand" "b")
> +                                  (match_operand:DI 2 "register_operand" "r"))
> +                        (const_int -16))))]
> +  "TARGET_ALTIVEC && TARGET_64BIT"
> +  "lvx %0,%1,%2"
> +  [(set_attr "type" "vecload")])
>
> +(define_insn "altivec_lvx_<mode>_1op"
> +  [(set (match_operand:VM2 0 "register_operand" "=v")
> +        (mem:VM2 (and:DI (match_operand:DI 1 "register_operand" "r")
> +                        (const_int -16))))]
> +  "TARGET_ALTIVEC && TARGET_64BIT"
> +  "lvx %0,0,%1"
> +  [(set_attr "type" "vecload")])
> +
> +; 32-bit versions of the above.
> +(define_insn "altivec_lvx_<mode>_2op_si"
> +  [(set (match_operand:VM2 0 "register_operand" "=v")
> +        (mem:VM2 (and:SI (plus:SI (match_operand:SI 1 "register_operand" "b")
> +                                  (match_operand:SI 2 "register_operand" "r"))
> +                        (const_int -16))))]
> +  "TARGET_ALTIVEC && TARGET_32BIT"
> +  "lvx %0,%1,%2"
> +  [(set_attr "type" "vecload")])
> +
> +(define_insn "altivec_lvx_<mode>_1op_si"
> +  [(set (match_operand:VM2 0 "register_operand" "=v")
> +        (mem:VM2 (and:SI (match_operand:SI 1 "register_operand" "r")
> +                        (const_int -16))))]
> +  "TARGET_ALTIVEC && TARGET_32BIT"
> +  "lvx %0,0,%1"
> +  [(set_attr "type" "vecload")])
> +
> +; This version of stvx is used only in cases where we need to force an stvx
> +; over any other store, and we don't care about losing CSE opportunities.
> +; Its primary use is for epilogue register restores.
>  (define_insn "altivec_stvx_<mode>_internal"
>    [(parallel
>      [(set (match_operand:VM2 0 "memory_operand" "=Z")
> @@ -2560,6 +2574,42 @@
>    "stvx %1,%y0"
>    [(set_attr "type" "vecstore")])
>
> +; The next two patterns embody what stvx should usually look like.
> +(define_insn "altivec_stvx_<mode>_2op"
> +  [(set (mem:VM2 (and:DI (plus:DI (match_operand:DI 1 "register_operand" "b")
> +                                 (match_operand:DI 2 "register_operand" "r"))
> +                        (const_int -16)))
> +        (match_operand:VM2 0 "register_operand" "v"))]
> +  "TARGET_ALTIVEC && TARGET_64BIT"
> +  "stvx %0,%1,%2"
> +  [(set_attr "type" "vecstore")])
> +
> +(define_insn "altivec_stvx_<mode>_1op"
> +  [(set (mem:VM2 (and:DI (match_operand:DI 1 "register_operand" "r")
> +                        (const_int -16)))
> +        (match_operand:VM2 0 "register_operand" "v"))]
> +  "TARGET_ALTIVEC && TARGET_64BIT"
> +  "stvx %0,0,%1"
> +  [(set_attr "type" "vecstore")])
> +
> +; 32-bit versions of the above.
> +(define_insn "altivec_stvx_<mode>_2op_si"
> +  [(set (mem:VM2 (and:SI (plus:SI (match_operand:SI 1 "register_operand" "b")
> +                                 (match_operand:SI 2 "register_operand" "r"))
> +                        (const_int -16)))
> +        (match_operand:VM2 0 "register_operand" "v"))]
> +  "TARGET_ALTIVEC && TARGET_32BIT"
> +  "stvx %0,%1,%2"
> +  [(set_attr "type" "vecstore")])
> +
> +(define_insn "altivec_stvx_<mode>_1op_si"
> +  [(set (mem:VM2 (and:SI (match_operand:SI 1 "register_operand" "r")
> +                        (const_int -16)))
> +        (match_operand:VM2 0 "register_operand" "v"))]
> +  "TARGET_ALTIVEC && TARGET_32BIT"
> +  "stvx %0,0,%1"
> +  [(set_attr "type" "vecstore")])
> +
>  (define_expand "altivec_stvxl_<mode>"
>    [(parallel
>      [(set (match_operand:VM2 0 "memory_operand" "=Z")
> Index: gcc/config/rs6000/rs6000-c.c
> ===================================================================
> --- gcc/config/rs6000/rs6000-c.c        (revision 235090)
> +++ gcc/config/rs6000/rs6000-c.c        (working copy)
> @@ -4800,6 +4800,164 @@ assignment for unaligned loads and stores");
>        return stmt;
>      }
>
> +  /* Expand vec_ld into an expression that masks the address and
> +     performs the load.  We need to expand this early to allow
> +     the best aliasing, as by the time we get into RTL we no longer
> +     are able to honor __restrict__, for example.  We may want to
> +     consider this for all memory access built-ins.
> +
> +     When -maltivec=be is specified, simply punt to existing
> +     built-in processing.  */
> +  if (fcode == ALTIVEC_BUILTIN_VEC_LD
> +      && (BYTES_BIG_ENDIAN || !VECTOR_ELT_ORDER_BIG))
> +    {
> +      tree arg0 = (*arglist)[0];
> +      tree arg1 = (*arglist)[1];
> +
> +      /* Strip qualifiers like "const" from the pointer arg.  */
> +      tree arg1_type = TREE_TYPE (arg1);
> +      tree inner_type = TREE_TYPE (arg1_type);
> +      if (TYPE_QUALS (TREE_TYPE (arg1_type)) != 0)
> +       {
> +         arg1_type = build_pointer_type (build_qualified_type (inner_type,
> +                                                               0));
> +         arg1 = fold_convert (arg1_type, arg1);
> +       }
> +
> +      /* Construct the masked address.  We have to jump through some hoops
> +        here.  If the first argument to a PLUS_EXPR is a pointer,
> +        build_binary_op will multiply the offset by the size of the
> +        inner type of the pointer (C semantics).  With vec_ld and vec_st,
> +        the offset must be left alone.  However, if we convert to a
> +        sizetype to do the arithmetic, we get a PLUS_EXPR instead of a
> +        POINTER_PLUS_EXPR, which interferes with aliasing (causing us,
> +        for example, to lose "restrict" information).  Thus where legal,
> +        we pre-adjust the offset knowing that a multiply by size is
> +        coming.  When the offset isn't a multiple of the size, we are
> +        forced to do the arithmetic in size_type for correctness, at the
> +        cost of losing aliasing information.  This, however, should be
> +        quite rare with these operations.  */
> +      arg0 = fold (arg0);
> +
> +      /* Let existing error handling take over if we don't have a constant
> +        offset.  */
> +      if (TREE_CODE (arg0) == INTEGER_CST)
> +       {
> +         HOST_WIDE_INT off = TREE_INT_CST_LOW (arg0);
> +         HOST_WIDE_INT size = int_size_in_bytes (inner_type);
> +         tree addr;
> +
> +         if (off % size == 0)
> +           {
> +             tree adjoff = build_int_cst (TREE_TYPE (arg0), off / size);
> +             addr = build_binary_op (loc, PLUS_EXPR, arg1, adjoff, 0);
> +             addr = build1 (NOP_EXPR, sizetype, addr);
> +           }
> +         else
> +           {
> +             tree hack_arg1 = build1 (NOP_EXPR, sizetype, arg1);
> +             addr = build_binary_op (loc, PLUS_EXPR, hack_arg1, arg0, 0);
> +           }
> +         tree aligned = build_binary_op (loc, BIT_AND_EXPR, addr,
> +                                         build_int_cst (sizetype, -16), 0);
> +
> +         /* Find the built-in to get the return type so we can convert
> +            the result properly (or fall back to default handling if the
> +            arguments aren't compatible).  */
> +         for (desc = altivec_overloaded_builtins;
> +              desc->code && desc->code != fcode; desc++)
> +           continue;
> +
> +         for (; desc->code == fcode; desc++)
> +           if (rs6000_builtin_type_compatible (TREE_TYPE (arg0), desc->op1)
> +               && (rs6000_builtin_type_compatible (TREE_TYPE (arg1),
> +                                                   desc->op2)))
> +             {
> +               tree ret_type = rs6000_builtin_type (desc->ret_type);
> +               if (TYPE_MODE (ret_type) == V2DImode)
> +                 /* Type-based aliasing analysis thinks vector long
> +                    and vector long long are different and will put them
> +                    in distinct alias classes.  Force our return type
> +                    to be a may-alias type to avoid this.  */
> +                 ret_type
> +                   = build_pointer_type_for_mode (ret_type, Pmode,
> +                                                  true/*can_alias_all*/);
> +               else
> +                 ret_type = build_pointer_type (ret_type);
> +               aligned = build1 (NOP_EXPR, ret_type, aligned);
> +               tree ret_val = build_indirect_ref (loc, aligned, RO_NULL);
> +               return ret_val;
> +             }
> +       }
> +    }
> +
> +  /* Similarly for stvx.  */
> +  if (fcode == ALTIVEC_BUILTIN_VEC_ST
> +      && (BYTES_BIG_ENDIAN || !VECTOR_ELT_ORDER_BIG))
> +    {
> +      tree arg0 = (*arglist)[0];
> +      tree arg1 = (*arglist)[1];
> +      tree arg2 = (*arglist)[2];
> +
> +      /* Construct the masked address.  See handling for ALTIVEC_BUILTIN_VEC_LD
> +        for an explanation of address arithmetic concerns.  */
> +      arg1 = fold (arg1);
> +
> +      /* Let existing error handling take over if we don't have a constant
> +        offset.  */
> +      if (TREE_CODE (arg1) == INTEGER_CST)
> +       {
> +         HOST_WIDE_INT off = TREE_INT_CST_LOW (arg1);
> +         tree inner_type = TREE_TYPE (TREE_TYPE (arg2));
> +         HOST_WIDE_INT size = int_size_in_bytes (inner_type);
> +         tree addr;
> +
> +         if (off % size == 0)
> +           {
> +             tree adjoff = build_int_cst (TREE_TYPE (arg1), off / size);
> +             addr = build_binary_op (loc, PLUS_EXPR, arg2, adjoff, 0);
> +             addr = build1 (NOP_EXPR, sizetype, addr);
> +           }
> +         else
> +           {
> +             tree hack_arg2 = build1 (NOP_EXPR, sizetype, arg2);
> +             addr = build_binary_op (loc, PLUS_EXPR, hack_arg2, arg1, 0);
> +           }
> +         tree aligned = build_binary_op (loc, BIT_AND_EXPR, addr,
> +                                         build_int_cst (sizetype, -16), 0);
> +
> +         /* Find the built-in to make sure a compatible one exists; if not
> +            we fall back to default handling to get the error message.  */
> +         for (desc = altivec_overloaded_builtins;
> +              desc->code && desc->code != fcode; desc++)
> +           continue;
> +
> +         for (; desc->code == fcode; desc++)
> +           if (rs6000_builtin_type_compatible (TREE_TYPE (arg0), desc->op1)
> +               && rs6000_builtin_type_compatible (TREE_TYPE (arg1), desc->op2)
> +               && rs6000_builtin_type_compatible (TREE_TYPE (arg2),
> +                                                  desc->op3))
> +             {
> +               tree arg0_type = TREE_TYPE (arg0);
> +               if (TYPE_MODE (arg0_type) == V2DImode)
> +                 /* Type-based aliasing analysis thinks vector long
> +                    and vector long long are different and will put them
> +                    in distinct alias classes.  Force our address type
> +                    to be a may-alias type to avoid this.  */
> +                 arg0_type
> +                   = build_pointer_type_for_mode (arg0_type, Pmode,
> +                                                  true/*can_alias_all*/);
> +               else
> +                 arg0_type = build_pointer_type (arg0_type);
> +               aligned = build1 (NOP_EXPR, arg0_type, aligned);
> +               tree stg = build_indirect_ref (loc, aligned, RO_NULL);
> +               tree retval = build2 (MODIFY_EXPR, TREE_TYPE (stg), stg,
> +                                     convert (TREE_TYPE (stg), arg0));
> +               return retval;
> +             }
> +       }
> +    }
> +
>    for (n = 0;
>         !VOID_TYPE_P (TREE_VALUE (fnargs)) && n < nargs;
>         fnargs = TREE_CHAIN (fnargs), n++)
> Index: gcc/config/rs6000/rs6000.c
> ===================================================================
> --- gcc/config/rs6000/rs6000.c  (revision 235090)
> +++ gcc/config/rs6000/rs6000.c  (working copy)
> @@ -13025,9 +13025,9 @@ swap_selector_for_mode (machine_mode mode)
>    return force_reg (V16QImode, gen_rtx_CONST_VECTOR (V16QImode, gen_rtvec_v (16, perm)));
>  }
>
> -/* Generate code for an "lvx", "lvxl", or "lve*x" built-in for a little endian target
> -   with -maltivec=be specified.  Issue the load followed by an element-reversing
> -   permute.  */
> +/* Generate code for an "lvxl", or "lve*x" built-in for a little endian target
> +   with -maltivec=be specified.  Issue the load followed by an element-
> +   reversing permute.  */
>  void
>  altivec_expand_lvx_be (rtx op0, rtx op1, machine_mode mode, unsigned unspec)
>  {
> @@ -13043,8 +13043,8 @@ altivec_expand_lvx_be (rtx op0, rtx op1, machine_m
>    emit_insn (gen_rtx_SET (op0, vperm));
>  }
>
> -/* Generate code for a "stvx" or "stvxl" built-in for a little endian target
> -   with -maltivec=be specified.  Issue the store preceded by an element-reversing
> +/* Generate code for a "stvxl" built-in for a little endian target with
> +   -maltivec=be specified.  Issue the store preceded by an element-reversing
>     permute.  */
>  void
>  altivec_expand_stvx_be (rtx op0, rtx op1, machine_mode mode, unsigned unspec)
> @@ -13106,22 +13106,65 @@ altivec_expand_lv_builtin (enum insn_code icode, t
>
>    op1 = copy_to_mode_reg (mode1, op1);
>
> -  if (op0 == const0_rtx)
> +  /* For LVX, express the RTL accurately by ANDing the address with -16.
> +     LVXL and LVE*X expand to use UNSPECs to hide their special behavior,
> +     so the raw address is fine.  */
> +  switch (icode)
>      {
> -      addr = gen_rtx_MEM (blk ? BLKmode : tmode, op1);
> -    }
> -  else
> -    {
> -      op0 = copy_to_mode_reg (mode0, op0);
> -      addr = gen_rtx_MEM (blk ? BLKmode : tmode, gen_rtx_PLUS (Pmode, op0, op1));
> -    }
> +    case CODE_FOR_altivec_lvx_v2df_2op:
> +    case CODE_FOR_altivec_lvx_v2di_2op:
> +    case CODE_FOR_altivec_lvx_v4sf_2op:
> +    case CODE_FOR_altivec_lvx_v4si_2op:
> +    case CODE_FOR_altivec_lvx_v8hi_2op:
> +    case CODE_FOR_altivec_lvx_v16qi_2op:
> +      {
> +       rtx rawaddr;
> +       if (op0 == const0_rtx)
> +         rawaddr = op1;
> +       else
> +         {
> +           op0 = copy_to_mode_reg (mode0, op0);
> +           rawaddr = gen_rtx_PLUS (Pmode, op1, op0);
> +         }
> +       addr = gen_rtx_AND (Pmode, rawaddr, gen_rtx_CONST_INT (Pmode, -16));
> +       addr = gen_rtx_MEM (blk ? BLKmode : tmode, addr);
>
> -  pat = GEN_FCN (icode) (target, addr);
> +       /* For -maltivec=be, emit the load and follow it up with a
> +          permute to swap the elements.  */
> +       if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
> +         {
> +           rtx temp = gen_reg_rtx (tmode);
> +           emit_insn (gen_rtx_SET (temp, addr));
>
> -  if (! pat)
> -    return 0;
> -  emit_insn (pat);
> +           rtx sel = swap_selector_for_mode (tmode);
> +           rtx vperm = gen_rtx_UNSPEC (tmode, gen_rtvec (3, temp, temp, sel),
> +                                       UNSPEC_VPERM);
> +           emit_insn (gen_rtx_SET (target, vperm));
> +         }
> +       else
> +         emit_insn (gen_rtx_SET (target, addr));
>
> +       break;
> +      }
> +
> +    default:
> +      if (op0 == const0_rtx)
> +       addr = gen_rtx_MEM (blk ? BLKmode : tmode, op1);
> +      else
> +       {
> +         op0 = copy_to_mode_reg (mode0, op0);
> +         addr = gen_rtx_MEM (blk ? BLKmode : tmode,
> +                             gen_rtx_PLUS (Pmode, op1, op0));
> +       }
> +
> +      pat = GEN_FCN (icode) (target, addr);
> +      if (! pat)
> +       return 0;
> +      emit_insn (pat);
> +
> +      break;
> +    }
> +
>    return target;
>  }
>
> @@ -13208,7 +13251,7 @@ altivec_expand_stv_builtin (enum insn_code icode,
>    rtx op0 = expand_normal (arg0);
>    rtx op1 = expand_normal (arg1);
>    rtx op2 = expand_normal (arg2);
> -  rtx pat, addr;
> +  rtx pat, addr, rawaddr;
>    machine_mode tmode = insn_data[icode].operand[0].mode;
>    machine_mode smode = insn_data[icode].operand[1].mode;
>    machine_mode mode1 = Pmode;
> @@ -13220,24 +13263,69 @@ altivec_expand_stv_builtin (enum insn_code icode,
>        || arg2 == error_mark_node)
>      return const0_rtx;
>
> -  if (! (*insn_data[icode].operand[1].predicate) (op0, smode))
> -    op0 = copy_to_mode_reg (smode, op0);
> -
>    op2 = copy_to_mode_reg (mode2, op2);
>
> -  if (op1 == const0_rtx)
> +  /* For STVX, express the RTL accurately by ANDing the address with -16.
> +     STVXL and STVE*X expand to use UNSPECs to hide their special behavior,
> +     so the raw address is fine.  */
> +  switch (icode)
>      {
> -      addr = gen_rtx_MEM (tmode, op2);
> -    }
> -  else
> -    {
> -      op1 = copy_to_mode_reg (mode1, op1);
> -      addr = gen_rtx_MEM (tmode, gen_rtx_PLUS (Pmode, op1, op2));
> -    }
> +    case CODE_FOR_altivec_stvx_v2df_2op:
> +    case CODE_FOR_altivec_stvx_v2di_2op:
> +    case CODE_FOR_altivec_stvx_v4sf_2op:
> +    case CODE_FOR_altivec_stvx_v4si_2op:
> +    case CODE_FOR_altivec_stvx_v8hi_2op:
> +    case CODE_FOR_altivec_stvx_v16qi_2op:
> +      {
> +       if (op1 == const0_rtx)
> +         rawaddr = op2;
> +       else
> +         {
> +           op1 = copy_to_mode_reg (mode1, op1);
> +           rawaddr = gen_rtx_PLUS (Pmode, op2, op1);
> +         }
>
> -  pat = GEN_FCN (icode) (addr, op0);
> -  if (pat)
> -    emit_insn (pat);
> +       addr = gen_rtx_AND (Pmode, rawaddr, gen_rtx_CONST_INT (Pmode, -16));
> +       addr = gen_rtx_MEM (tmode, addr);
> +
> +       op0 = copy_to_mode_reg (tmode, op0);
> +
> +       /* For -maltivec=be, emit a permute to swap the elements, followed
> +          by the store.  */
> +       if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
> +         {
> +           rtx temp = gen_reg_rtx (tmode);
> +           rtx sel = swap_selector_for_mode (tmode);
> +           rtx vperm = gen_rtx_UNSPEC (tmode, gen_rtvec (3, op0, op0, sel),
> +                                       UNSPEC_VPERM);
> +           emit_insn (gen_rtx_SET (temp, vperm));
> +           emit_insn (gen_rtx_SET (addr, temp));
> +         }
> +       else
> +         emit_insn (gen_rtx_SET (addr, op0));
> +
> +       break;
> +      }
> +
> +    default:
> +      {
> +       if (! (*insn_data[icode].operand[1].predicate) (op0, smode))
> +         op0 = copy_to_mode_reg (smode, op0);
> +
> +       if (op1 == const0_rtx)
> +         addr = gen_rtx_MEM (tmode, op2);
> +       else
> +         {
> +           op1 = copy_to_mode_reg (mode1, op1);
> +           addr = gen_rtx_MEM (tmode, gen_rtx_PLUS (Pmode, op2, op1));
> +         }
> +
> +       pat = GEN_FCN (icode) (addr, op0);
> +       if (pat)
> +         emit_insn (pat);
> +      }
> +    }
> +
>    return NULL_RTX;
>  }
>
> @@ -14073,18 +14161,18 @@ altivec_expand_builtin (tree exp, rtx target, bool
>    switch (fcode)
>      {
>      case ALTIVEC_BUILTIN_STVX_V2DF:
> -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2df, exp);
> +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2df_2op, exp);
>      case ALTIVEC_BUILTIN_STVX_V2DI:
> -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2di, exp);
> +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2di_2op, exp);
>      case ALTIVEC_BUILTIN_STVX_V4SF:
> -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4sf, exp);
> +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4sf_2op, exp);
>      case ALTIVEC_BUILTIN_STVX:
>      case ALTIVEC_BUILTIN_STVX_V4SI:
> -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4si, exp);
> +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4si_2op, exp);
>      case ALTIVEC_BUILTIN_STVX_V8HI:
> -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v8hi, exp);
> +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v8hi_2op, exp);
>      case ALTIVEC_BUILTIN_STVX_V16QI:
> -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v16qi, exp);
> +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v16qi_2op, exp);
>      case ALTIVEC_BUILTIN_STVEBX:
>        return altivec_expand_stv_builtin (CODE_FOR_altivec_stvebx, exp);
>      case ALTIVEC_BUILTIN_STVEHX:
> @@ -14272,23 +14360,23 @@ altivec_expand_builtin (tree exp, rtx target, bool
>        return altivec_expand_lv_builtin (CODE_FOR_altivec_lvxl_v16qi,
>                                         exp, target, false);
>      case ALTIVEC_BUILTIN_LVX_V2DF:
> -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2df,
> +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2df_2op,
>                                         exp, target, false);
>      case ALTIVEC_BUILTIN_LVX_V2DI:
> -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2di,
> +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2di_2op,
>                                         exp, target, false);
>      case ALTIVEC_BUILTIN_LVX_V4SF:
> -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4sf,
> +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4sf_2op,
>                                         exp, target, false);
>      case ALTIVEC_BUILTIN_LVX:
>      case ALTIVEC_BUILTIN_LVX_V4SI:
> -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4si,
> +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4si_2op,
>                                         exp, target, false);
>      case ALTIVEC_BUILTIN_LVX_V8HI:
> -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v8hi,
> +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v8hi_2op,
>                                         exp, target, false);
>      case ALTIVEC_BUILTIN_LVX_V16QI:
> -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v16qi,
> +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v16qi_2op,
>                                         exp, target, false);
>      case ALTIVEC_BUILTIN_LVLX:
>        return altivec_expand_lv_builtin (CODE_FOR_altivec_lvlx,
> @@ -37139,7 +37227,9 @@ insn_is_swappable_p (swap_web_entry *insn_entry, r
>       fix them up by converting them to permuting ones.  Exceptions:
>       UNSPEC_LVE, UNSPEC_LVX, and UNSPEC_STVX, which have a PARALLEL
>       body instead of a SET; and UNSPEC_STVE, which has an UNSPEC
> -     for the SET source.  */
> +     for the SET source.  Also we must now make an exception for lvx
> +     and stvx when they are not in the UNSPEC_LVX/STVX form (with the
> +     explicit "& -16") since this leads to unrecognizable insns.  */
>    rtx body = PATTERN (insn);
>    int i = INSN_UID (insn);
>
> @@ -37147,6 +37237,11 @@ insn_is_swappable_p (swap_web_entry *insn_entry, r
>      {
>        if (GET_CODE (body) == SET)
>         {
> +         rtx rhs = SET_SRC (body);
> +         gcc_assert (GET_CODE (rhs) == MEM);
> +         if (GET_CODE (XEXP (rhs, 0)) == AND)
> +           return 0;
> +
>           *special = SH_NOSWAP_LD;
>           return 1;
>         }
> @@ -37156,8 +37251,14 @@ insn_is_swappable_p (swap_web_entry *insn_entry, r
>
>    if (insn_entry[i].is_store)
>      {
> -      if (GET_CODE (body) == SET && GET_CODE (SET_SRC (body)) != UNSPEC)
> +      if (GET_CODE (body) == SET
> +         && GET_CODE (SET_SRC (body)) != UNSPEC)
>         {
> +         rtx lhs = SET_DEST (body);
> +         gcc_assert (GET_CODE (lhs) == MEM);
> +         if (GET_CODE (XEXP (lhs, 0)) == AND)
> +           return 0;
> +
>           *special = SH_NOSWAP_ST;
>           return 1;
>         }
> @@ -37827,6 +37928,267 @@ dump_swap_insn_table (swap_web_entry *insn_entry)
>    fputs ("\n", dump_file);
>  }
>
> +/* Return RTX with its address canonicalized to (reg) or (+ reg reg).
> +   Here RTX is an (& addr (const_int -16)).  Always return a new copy
> +   to avoid problems with combine.  */
> +static rtx
> +alignment_with_canonical_addr (rtx align)
> +{
> +  rtx canon;
> +  rtx addr = XEXP (align, 0);
> +
> +  if (REG_P (addr))
> +    canon = addr;
> +
> +  else if (GET_CODE (addr) == PLUS)
> +    {
> +      rtx addrop0 = XEXP (addr, 0);
> +      rtx addrop1 = XEXP (addr, 1);
> +
> +      if (!REG_P (addrop0))
> +       addrop0 = force_reg (GET_MODE (addrop0), addrop0);
> +
> +      if (!REG_P (addrop1))
> +       addrop1 = force_reg (GET_MODE (addrop1), addrop1);
> +
> +      canon = gen_rtx_PLUS (GET_MODE (addr), addrop0, addrop1);
> +    }
> +
> +  else
> +    canon = force_reg (GET_MODE (addr), addr);
> +
> +  return gen_rtx_AND (GET_MODE (align), canon, GEN_INT (-16));
> +}
> +
> +/* Check whether an rtx is an alignment mask, and if so, return
> +   a fully-expanded rtx for the masking operation.  */
> +static rtx
> +alignment_mask (rtx_insn *insn)
> +{
> +  rtx body = PATTERN (insn);
> +
> +  if (GET_CODE (body) != SET
> +      || GET_CODE (SET_SRC (body)) != AND
> +      || !REG_P (XEXP (SET_SRC (body), 0)))
> +    return 0;
> +
> +  rtx mask = XEXP (SET_SRC (body), 1);
> +
> +  if (GET_CODE (mask) == CONST_INT)
> +    {
> +      if (INTVAL (mask) == -16)
> +       return alignment_with_canonical_addr (SET_SRC (body));
> +      else
> +       return 0;
> +    }
> +
> +  if (!REG_P (mask))
> +    return 0;
> +
> +  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> +  df_ref use;
> +  rtx real_mask = 0;
> +
> +  FOR_EACH_INSN_INFO_USE (use, insn_info)
> +    {
> +      if (!rtx_equal_p (DF_REF_REG (use), mask))
> +       continue;
> +
> +      struct df_link *def_link = DF_REF_CHAIN (use);
> +      if (!def_link || def_link->next)
> +       return 0;
> +
> +      rtx_insn *const_insn = DF_REF_INSN (def_link->ref);
> +      rtx const_body = PATTERN (const_insn);
> +      if (GET_CODE (const_body) != SET)
> +       return 0;
> +
> +      real_mask = SET_SRC (const_body);
> +
> +      if (GET_CODE (real_mask) != CONST_INT
> +         || INTVAL (real_mask) != -16)
> +       return 0;
> +    }
> +
> +  if (real_mask == 0)
> +    return 0;
> +
> +  return alignment_with_canonical_addr (SET_SRC (body));
> +}
> +
> +/* Given INSN that's a load or store based at BASE_REG, look for a
> +   feeding computation that aligns its address on a 16-byte boundary.  */
> +static rtx
> +find_alignment_op (rtx_insn *insn, rtx base_reg)
> +{
> +  df_ref base_use;
> +  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> +  rtx and_operation = 0;
> +
> +  FOR_EACH_INSN_INFO_USE (base_use, insn_info)
> +    {
> +      if (!rtx_equal_p (DF_REF_REG (base_use), base_reg))
> +       continue;
> +
> +      struct df_link *base_def_link = DF_REF_CHAIN (base_use);
> +      if (!base_def_link || base_def_link->next)
> +       break;
> +
> +      rtx_insn *and_insn = DF_REF_INSN (base_def_link->ref);
> +      and_operation = alignment_mask (and_insn);
> +      if (and_operation != 0)
> +       break;
> +    }
> +
> +  return and_operation;
> +}
> +
> +struct del_info { bool replace; rtx_insn *replace_insn; };
> +
> +/* If INSN is the load for an lvx pattern, put it in canonical form.  */
> +static void
> +combine_lvx_pattern (rtx_insn *insn, del_info *to_delete)
> +{
> +  rtx body = PATTERN (insn);
> +  gcc_assert (GET_CODE (body) == SET
> +             && GET_CODE (SET_SRC (body)) == VEC_SELECT
> +             && GET_CODE (XEXP (SET_SRC (body), 0)) == MEM);
> +
> +  rtx mem = XEXP (SET_SRC (body), 0);
> +  rtx base_reg = XEXP (mem, 0);
> +
> +  rtx and_operation = find_alignment_op (insn, base_reg);
> +
> +  if (and_operation != 0)
> +    {
> +      df_ref def;
> +      struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> +      FOR_EACH_INSN_INFO_DEF (def, insn_info)
> +       {
> +         struct df_link *link = DF_REF_CHAIN (def);
> +         if (!link || link->next)
> +           break;
> +
> +         rtx_insn *swap_insn = DF_REF_INSN (link->ref);
> +         if (!insn_is_swap_p (swap_insn)
> +             || insn_is_load_p (swap_insn)
> +             || insn_is_store_p (swap_insn))
> +           break;
> +
> +         /* Expected lvx pattern found.  Change the swap to
> +            a copy, and propagate the AND operation into the
> +            load.  */
> +         to_delete[INSN_UID (swap_insn)].replace = true;
> +         to_delete[INSN_UID (swap_insn)].replace_insn = swap_insn;
> +
> +         XEXP (mem, 0) = and_operation;
> +         SET_SRC (body) = mem;
> +         INSN_CODE (insn) = -1; /* Force re-recognition.  */
> +         df_insn_rescan (insn);
> +
> +         if (dump_file)
> +           fprintf (dump_file, "lvx opportunity found at %d\n",
> +                    INSN_UID (insn));
> +       }
> +    }
> +}
> +
> +/* If INSN is the store for an stvx pattern, put it in canonical form.  */
> +static void
> +combine_stvx_pattern (rtx_insn *insn, del_info *to_delete)
> +{
> +  rtx body = PATTERN (insn);
> +  gcc_assert (GET_CODE (body) == SET
> +             && GET_CODE (SET_DEST (body)) == MEM
> +             && GET_CODE (SET_SRC (body)) == VEC_SELECT);
> +  rtx mem = SET_DEST (body);
> +  rtx base_reg = XEXP (mem, 0);
> +
> +  rtx and_operation = find_alignment_op (insn, base_reg);
> +
> +  if (and_operation != 0)
> +    {
> +      rtx src_reg = XEXP (SET_SRC (body), 0);
> +      df_ref src_use;
> +      struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> +      FOR_EACH_INSN_INFO_USE (src_use, insn_info)
> +       {
> +         if (!rtx_equal_p (DF_REF_REG (src_use), src_reg))
> +           continue;
> +
> +         struct df_link *link = DF_REF_CHAIN (src_use);
> +         if (!link || link->next)
> +           break;
> +
> +         rtx_insn *swap_insn = DF_REF_INSN (link->ref);
> +         if (!insn_is_swap_p (swap_insn)
> +             || insn_is_load_p (swap_insn)
> +             || insn_is_store_p (swap_insn))
> +           break;
> +
> +         /* Expected stvx pattern found.  Change the swap to
> +            a copy, and propagate the AND operation into the
> +            store.  */
> +         to_delete[INSN_UID (swap_insn)].replace = true;
> +         to_delete[INSN_UID (swap_insn)].replace_insn = swap_insn;
> +
> +         XEXP (mem, 0) = and_operation;
> +         SET_SRC (body) = src_reg;
> +         INSN_CODE (insn) = -1; /* Force re-recognition.  */
> +         df_insn_rescan (insn);
> +
> +         if (dump_file)
> +           fprintf (dump_file, "stvx opportunity found at %d\n",
> +                    INSN_UID (insn));
> +       }
> +    }
> +}
> +
> +/* Look for patterns created from builtin lvx and stvx calls, and
> +   canonicalize them to be properly recognized as such.  */
> +static void
> +combine_lvx_stvx_patterns (function *fun)
> +{
> +  int i;
> +  basic_block bb;
> +  rtx_insn *insn;
> +
> +  int num_insns = get_max_uid ();
> +  del_info *to_delete = XCNEWVEC (del_info, num_insns);
> +
> +  FOR_ALL_BB_FN (bb, fun)
> +    FOR_BB_INSNS (bb, insn)
> +    {
> +      if (!NONDEBUG_INSN_P (insn))
> +       continue;
> +
> +      if (insn_is_load_p (insn) && insn_is_swap_p (insn))
> +       combine_lvx_pattern (insn, to_delete);
> +      else if (insn_is_store_p (insn) && insn_is_swap_p (insn))
> +       combine_stvx_pattern (insn, to_delete);
> +    }
> +
> +  /* Turning swaps into copies is delayed until now, to avoid problems
> +     with deleting instructions during the insn walk.  */
> +  for (i = 0; i < num_insns; i++)
> +    if (to_delete[i].replace)
> +      {
> +       rtx swap_body = PATTERN (to_delete[i].replace_insn);
> +       rtx src_reg = XEXP (SET_SRC (swap_body), 0);
> +       rtx copy = gen_rtx_SET (SET_DEST (swap_body), src_reg);
> +       rtx_insn *new_insn = emit_insn_before (copy,
> +                                              to_delete[i].replace_insn);
> +       set_block_for_insn (new_insn,
> +                           BLOCK_FOR_INSN (to_delete[i].replace_insn));
> +       df_insn_rescan (new_insn);
> +       df_insn_delete (to_delete[i].replace_insn);
> +       remove_insn (to_delete[i].replace_insn);
> +       to_delete[i].replace_insn->set_deleted ();
> +      }
> +
> +  free (to_delete);
> +}
> +
>  /* Main entry point for this pass.  */
>  unsigned int
>  rs6000_analyze_swaps (function *fun)
> @@ -37833,7 +38195,7 @@ rs6000_analyze_swaps (function *fun)
>  {
>    swap_web_entry *insn_entry;
>    basic_block bb;
> -  rtx_insn *insn;
> +  rtx_insn *insn, *curr_insn = 0;
>
>    /* Dataflow analysis for use-def chains.  */
>    df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
> @@ -37841,12 +38203,15 @@ rs6000_analyze_swaps (function *fun)
>    df_analyze ();
>    df_set_flags (DF_DEFER_INSN_RESCAN);
>
> +  /* Pre-pass to combine lvx and stvx patterns so we don't lose info.  */
> +  combine_lvx_stvx_patterns (fun);
> +
>    /* Allocate structure to represent webs of insns.  */
>    insn_entry = XCNEWVEC (swap_web_entry, get_max_uid ());
>
>    /* Walk the insns to gather basic data.  */
>    FOR_ALL_BB_FN (bb, fun)
> -    FOR_BB_INSNS (bb, insn)
> +    FOR_BB_INSNS_SAFE (bb, insn, curr_insn)
>      {
>        unsigned int uid = INSN_UID (insn);
>        if (NONDEBUG_INSN_P (insn))
> Index: gcc/config/rs6000/vector.md
> ===================================================================
> --- gcc/config/rs6000/vector.md (revision 235090)
> +++ gcc/config/rs6000/vector.md (working copy)
> @@ -167,7 +167,14 @@
>    if (VECTOR_MEM_VSX_P (<MODE>mode))
>      {
>        operands[1] = rs6000_address_for_altivec (operands[1]);
> -      emit_insn (gen_altivec_lvx_<mode> (operands[0], operands[1]));
> +      rtx and_op = XEXP (operands[1], 0);
> +      gcc_assert (GET_CODE (and_op) == AND);
> +      rtx addr = XEXP (and_op, 0);
> +      if (GET_CODE (addr) == PLUS)
> +        emit_insn (gen_altivec_lvx_<mode>_2op (operands[0], XEXP (addr, 0),
> +                                              XEXP (addr, 1)));
> +      else
> +        emit_insn (gen_altivec_lvx_<mode>_1op (operands[0], operands[1]));
>        DONE;
>      }
>  }")
> @@ -183,7 +190,14 @@
>    if (VECTOR_MEM_VSX_P (<MODE>mode))
>      {
>        operands[0] = rs6000_address_for_altivec (operands[0]);
> -      emit_insn (gen_altivec_stvx_<mode> (operands[0], operands[1]));
> +      rtx and_op = XEXP (operands[0], 0);
> +      gcc_assert (GET_CODE (and_op) == AND);
> +      rtx addr = XEXP (and_op, 0);
> +      if (GET_CODE (addr) == PLUS)
> +        emit_insn (gen_altivec_stvx_<mode>_2op (operands[1], XEXP (addr, 0),
> +                                               XEXP (addr, 1)));
> +      else
> +        emit_insn (gen_altivec_stvx_<mode>_1op (operands[1], operands[0]));
>        DONE;
>      }
>  }")
>
>
Bill Schmidt April 19, 2016, 1:10 p.m. UTC | #2
On Tue, 2016-04-19 at 10:09 +0200, Richard Biener wrote:
> On Tue, Apr 19, 2016 at 12:05 AM, Bill Schmidt
> <wschmidt@linux.vnet.ibm.com> wrote:
> > Hi,
> >
> > Expanding built-ins in the usual way (leaving them as calls until
> > expanding into RTL) restricts the amount of optimization that can be
> > performed on the code represented by the built-ins.  This has been
> > observed to be particularly bad for the vec_ld and vec_st built-ins on
> > PowerPC, which represent the lvx and stvx instructions.  Currently these
> > are expanded into UNSPECs that are left untouched by the optimizers, so
> > no redundant load or store elimination can take place.  For certain
> > idiomatic usages, this leads to very bad performance.
> >
> > Initially I planned to just change the UNSPEC representation to RTL that
> > directly expresses the address masking implicit in lvx and stvx.  This
> > turns out to be only partially successful in improving performance.
> > Among other things, by the time we reach RTL we have lost track of the
> > __restrict__ attribute, leading to more appearances of may-alias
> > relationships than should really be present.  Instead, this patch
> > expands the built-ins during parsing so that they are exposed to all
> > GIMPLE optimizations as well.
> >
> > This works well for vec_ld and vec_st.  It is also possible for
> > programmers to instead use __builtin_altivec_lvx_<mode> and
> > __builtin_altivec_stvx_<mode>.  These are not so easy to catch during
> > parsing, since they are not processed by the overloaded built-in
> > function table.  For these, I am currently falling back to expansion
> > during RTL while still exposing the address-masking semantics, which
> > seems ok for these somewhat obscure built-ins.  At some future time we
> > may decide to handle them similarly to vec_ld and vec_st.
> >
> > For POWER8 little-endian only, the loads and stores during expand time
> > require some special handling, since the POWER8 expanders want to
> > convert these to lxvd2x/xxswapd and xxswapd/stxvd2x.  To deal with this,
> > I've added an extra pre-pass to the swap optimization phase that
> > recognizes the lvx and stvx patterns and canonicalizes them so they'll
> > be properly recognized.  This isn't an issue for earlier or later
> > processors, or for big-endian POWER8, so doing this as part of swap
> > optimization is appropriate.
> >
> > We have a lot of existing test cases for this code, which proved very
> > useful in discovering bugs, so I haven't seen a reason to add any new
> > tests.
> >
> > The patch is fairly large, but it isn't feasible to break it up into
> > smaller units without leaving something in a broken state.  So I will
> > have to just apologize for the size and leave it at that.  Sorry! :)
> >
> > Bootstrapped and tested successfully on powerpc64le-unknown-linux-gnu,
> > and on powerpc64-unknown-linux-gnu (-m32 and -m64) with no regressions.
> > Is this ok for trunk after GCC 6 releases?
> 
> Just took a very quick look but it seems you are using integer arithmetic
> for the pointer adjustment and bit-and.  You could use POINTER_PLUS_EXPR
> for the addition and BIT_AND_EXPR is also valid on pointer types.  Which
> means you don't need conversions to/from sizetype.

Thanks, I appreciate that help -- I had tried to use BIT_AND_EXPR on
pointer types but it didn't work; I must have done something wrong, and
assumed it wasn't allowed.  I'll take another crack at that, as the
conversions are definitely an annoyance.

Using PLUS_EXPR was automatically getting me a POINTER_PLUS_EXPR based
on type, but it is probably best to make that explicit.

> 
> x86 nowadays has intrinsics implemented as inlines - they come from
> header files.  It seems for ppc the intrinsics are somehow magically
> there, w/o a header file?

Yes, and we really need to start gravitating to the inlines in header
files model (Clang does this successfully for PowerPC and it is quite a
bit cleaner, and allows for more optimization).  We have a very
complicated setup for handling overloaded built-ins that could use a
rewrite once somebody has time to attack it.  We do have one header file
for built-ins (altivec.h) but it largely just #defines well-known
aliases for the internal built-in names.  We have a lot of other things
we have to do in GCC 7, but I'd like to do something about this in the
relatively near future.  (Things like "vec_add" that just do a vector
addition aren't expanded until RTL time??  Gack.)

David, let me take another shot at eliminating the sizetype conversions
before you review this.

Thanks!

Bill

> 
> Richard.
> 
> > Thanks,
> > Bill
> >
> >
> > 2016-04-18  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> >
> >         * config/rs6000/altivec.md (altivec_lvx_<mode>): Remove.
> >         (altivec_lvx_<mode>_internal): Document.
> >         (altivec_lvx_<mode>_2op): New define_insn.
> >         (altivec_lvx_<mode>_1op): Likewise.
> >         (altivec_lvx_<mode>_2op_si): Likewise.
> >         (altivec_lvx_<mode>_1op_si): Likewise.
> >         (altivec_stvx_<mode>): Remove.
> >         (altivec_stvx_<mode>_internal): Document.
> >         (altivec_stvx_<mode>_2op): New define_insn.
> >         (altivec_stvx_<mode>_1op): Likewise.
> >         (altivec_stvx_<mode>_2op_si): Likewise.
> >         (altivec_stvx_<mode>_1op_si): Likewise.
> >         * config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
> >         Expand vec_ld and vec_st during parsing.
> >         * config/rs6000/rs6000.c (altivec_expand_lvx_be): Commentary
> >         changes.
> >         (altivec_expand_stvx_be): Likewise.
> >         (altivec_expand_lv_builtin): Expand lvx built-ins to expose the
> >         address-masking behavior in RTL.
> >         (altivec_expand_stv_builtin): Expand stvx built-ins to expose the
> >         address-masking behavior in RTL.
> >         (altivec_expand_builtin): Change builtin code arguments for calls
> >         to altivec_expand_stv_builtin and altivec_expand_lv_builtin.
> >         (insn_is_swappable_p): Avoid incorrect swap optimization in the
> >         presence of lvx/stvx patterns.
> >         (alignment_with_canonical_addr): New function.
> >         (alignment_mask): Likewise.
> >         (find_alignment_op): Likewise.
> >         (combine_lvx_pattern): Likewise.
> >         (combine_stvx_pattern): Likewise.
> >         (combine_lvx_stvx_patterns): Likewise.
> >         (rs6000_analyze_swaps): Perform a pre-pass to recognize lvx and
> >         stvx patterns from expand.
> >         * config/rs6000/vector.md (vector_altivec_load_<mode>): Use new
> >         expansions.
> >         (vector_altivec_store_<mode>): Likewise.
> >
> >
> > Index: gcc/config/rs6000/altivec.md
> > ===================================================================
> > --- gcc/config/rs6000/altivec.md        (revision 235090)
> > +++ gcc/config/rs6000/altivec.md        (working copy)
> > @@ -2514,20 +2514,9 @@
> >    "lvxl %0,%y1"
> >    [(set_attr "type" "vecload")])
> >
> > -(define_expand "altivec_lvx_<mode>"
> > -  [(parallel
> > -    [(set (match_operand:VM2 0 "register_operand" "=v")
> > -         (match_operand:VM2 1 "memory_operand" "Z"))
> > -     (unspec [(const_int 0)] UNSPEC_LVX)])]
> > -  "TARGET_ALTIVEC"
> > -{
> > -  if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
> > -    {
> > -      altivec_expand_lvx_be (operands[0], operands[1], <MODE>mode, UNSPEC_LVX);
> > -      DONE;
> > -    }
> > -})
> > -
> > +; This version of lvx is used only in cases where we need to force an lvx
> > +; over any other load, and we don't care about losing CSE opportunities.
> > +; Its primary use is for prologue register saves.
> >  (define_insn "altivec_lvx_<mode>_internal"
> >    [(parallel
> >      [(set (match_operand:VM2 0 "register_operand" "=v")
> > @@ -2537,20 +2526,45 @@
> >    "lvx %0,%y1"
> >    [(set_attr "type" "vecload")])
> >
> > -(define_expand "altivec_stvx_<mode>"
> > -  [(parallel
> > -    [(set (match_operand:VM2 0 "memory_operand" "=Z")
> > -         (match_operand:VM2 1 "register_operand" "v"))
> > -     (unspec [(const_int 0)] UNSPEC_STVX)])]
> > -  "TARGET_ALTIVEC"
> > -{
> > -  if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
> > -    {
> > -      altivec_expand_stvx_be (operands[0], operands[1], <MODE>mode, UNSPEC_STVX);
> > -      DONE;
> > -    }
> > -})
> > +; The next two patterns embody what lvx should usually look like.
> > +(define_insn "altivec_lvx_<mode>_2op"
> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
> > +        (mem:VM2 (and:DI (plus:DI (match_operand:DI 1 "register_operand" "b")
> > +                                  (match_operand:DI 2 "register_operand" "r"))
> > +                        (const_int -16))))]
> > +  "TARGET_ALTIVEC && TARGET_64BIT"
> > +  "lvx %0,%1,%2"
> > +  [(set_attr "type" "vecload")])
> >
> > +(define_insn "altivec_lvx_<mode>_1op"
> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
> > +        (mem:VM2 (and:DI (match_operand:DI 1 "register_operand" "r")
> > +                        (const_int -16))))]
> > +  "TARGET_ALTIVEC && TARGET_64BIT"
> > +  "lvx %0,0,%1"
> > +  [(set_attr "type" "vecload")])
> > +
> > +; 32-bit versions of the above.
> > +(define_insn "altivec_lvx_<mode>_2op_si"
> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
> > +        (mem:VM2 (and:SI (plus:SI (match_operand:SI 1 "register_operand" "b")
> > +                                  (match_operand:SI 2 "register_operand" "r"))
> > +                        (const_int -16))))]
> > +  "TARGET_ALTIVEC && TARGET_32BIT"
> > +  "lvx %0,%1,%2"
> > +  [(set_attr "type" "vecload")])
> > +
> > +(define_insn "altivec_lvx_<mode>_1op_si"
> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
> > +        (mem:VM2 (and:SI (match_operand:SI 1 "register_operand" "r")
> > +                        (const_int -16))))]
> > +  "TARGET_ALTIVEC && TARGET_32BIT"
> > +  "lvx %0,0,%1"
> > +  [(set_attr "type" "vecload")])
> > +
> > +; This version of stvx is used only in cases where we need to force an stvx
> > +; over any other store, and we don't care about losing CSE opportunities.
> > +; Its primary use is for epilogue register restores.
> >  (define_insn "altivec_stvx_<mode>_internal"
> >    [(parallel
> >      [(set (match_operand:VM2 0 "memory_operand" "=Z")
> > @@ -2560,6 +2574,42 @@
> >    "stvx %1,%y0"
> >    [(set_attr "type" "vecstore")])
> >
> > +; The next two patterns embody what stvx should usually look like.
> > +(define_insn "altivec_stvx_<mode>_2op"
> > +  [(set (mem:VM2 (and:DI (plus:DI (match_operand:DI 1 "register_operand" "b")
> > +                                 (match_operand:DI 2 "register_operand" "r"))
> > +                        (const_int -16)))
> > +        (match_operand:VM2 0 "register_operand" "v"))]
> > +  "TARGET_ALTIVEC && TARGET_64BIT"
> > +  "stvx %0,%1,%2"
> > +  [(set_attr "type" "vecstore")])
> > +
> > +(define_insn "altivec_stvx_<mode>_1op"
> > +  [(set (mem:VM2 (and:DI (match_operand:DI 1 "register_operand" "r")
> > +                        (const_int -16)))
> > +        (match_operand:VM2 0 "register_operand" "v"))]
> > +  "TARGET_ALTIVEC && TARGET_64BIT"
> > +  "stvx %0,0,%1"
> > +  [(set_attr "type" "vecstore")])
> > +
> > +; 32-bit versions of the above.
> > +(define_insn "altivec_stvx_<mode>_2op_si"
> > +  [(set (mem:VM2 (and:SI (plus:SI (match_operand:SI 1 "register_operand" "b")
> > +                                 (match_operand:SI 2 "register_operand" "r"))
> > +                        (const_int -16)))
> > +        (match_operand:VM2 0 "register_operand" "v"))]
> > +  "TARGET_ALTIVEC && TARGET_32BIT"
> > +  "stvx %0,%1,%2"
> > +  [(set_attr "type" "vecstore")])
> > +
> > +(define_insn "altivec_stvx_<mode>_1op_si"
> > +  [(set (mem:VM2 (and:SI (match_operand:SI 1 "register_operand" "r")
> > +                        (const_int -16)))
> > +        (match_operand:VM2 0 "register_operand" "v"))]
> > +  "TARGET_ALTIVEC && TARGET_32BIT"
> > +  "stvx %0,0,%1"
> > +  [(set_attr "type" "vecstore")])
> > +
> >  (define_expand "altivec_stvxl_<mode>"
> >    [(parallel
> >      [(set (match_operand:VM2 0 "memory_operand" "=Z")
> > Index: gcc/config/rs6000/rs6000-c.c
> > ===================================================================
> > --- gcc/config/rs6000/rs6000-c.c        (revision 235090)
> > +++ gcc/config/rs6000/rs6000-c.c        (working copy)
> > @@ -4800,6 +4800,164 @@ assignment for unaligned loads and stores");
> >        return stmt;
> >      }
> >
> > +  /* Expand vec_ld into an expression that masks the address and
> > +     performs the load.  We need to expand this early to allow
> > +     the best aliasing, as by the time we get into RTL we no longer
> > +     are able to honor __restrict__, for example.  We may want to
> > +     consider this for all memory access built-ins.
> > +
> > +     When -maltivec=be is specified, simply punt to existing
> > +     built-in processing.  */
> > +  if (fcode == ALTIVEC_BUILTIN_VEC_LD
> > +      && (BYTES_BIG_ENDIAN || !VECTOR_ELT_ORDER_BIG))
> > +    {
> > +      tree arg0 = (*arglist)[0];
> > +      tree arg1 = (*arglist)[1];
> > +
> > +      /* Strip qualifiers like "const" from the pointer arg.  */
> > +      tree arg1_type = TREE_TYPE (arg1);
> > +      tree inner_type = TREE_TYPE (arg1_type);
> > +      if (TYPE_QUALS (TREE_TYPE (arg1_type)) != 0)
> > +       {
> > +         arg1_type = build_pointer_type (build_qualified_type (inner_type,
> > +                                                               0));
> > +         arg1 = fold_convert (arg1_type, arg1);
> > +       }
> > +
> > +      /* Construct the masked address.  We have to jump through some hoops
> > +        here.  If the first argument to a PLUS_EXPR is a pointer,
> > +        build_binary_op will multiply the offset by the size of the
> > +        inner type of the pointer (C semantics).  With vec_ld and vec_st,
> > +        the offset must be left alone.  However, if we convert to a
> > +        sizetype to do the arithmetic, we get a PLUS_EXPR instead of a
> > +        POINTER_PLUS_EXPR, which interferes with aliasing (causing us,
> > +        for example, to lose "restrict" information).  Thus where legal,
> > +        we pre-adjust the offset knowing that a multiply by size is
> > +        coming.  When the offset isn't a multiple of the size, we are
> > +        forced to do the arithmetic in size_type for correctness, at the
> > +        cost of losing aliasing information.  This, however, should be
> > +        quite rare with these operations.  */
> > +      arg0 = fold (arg0);
> > +
> > +      /* Let existing error handling take over if we don't have a constant
> > +        offset.  */
> > +      if (TREE_CODE (arg0) == INTEGER_CST)
> > +       {
> > +         HOST_WIDE_INT off = TREE_INT_CST_LOW (arg0);
> > +         HOST_WIDE_INT size = int_size_in_bytes (inner_type);
> > +         tree addr;
> > +
> > +         if (off % size == 0)
> > +           {
> > +             tree adjoff = build_int_cst (TREE_TYPE (arg0), off / size);
> > +             addr = build_binary_op (loc, PLUS_EXPR, arg1, adjoff, 0);
> > +             addr = build1 (NOP_EXPR, sizetype, addr);
> > +           }
> > +         else
> > +           {
> > +             tree hack_arg1 = build1 (NOP_EXPR, sizetype, arg1);
> > +             addr = build_binary_op (loc, PLUS_EXPR, hack_arg1, arg0, 0);
> > +           }
> > +         tree aligned = build_binary_op (loc, BIT_AND_EXPR, addr,
> > +                                         build_int_cst (sizetype, -16), 0);
> > +
> > +         /* Find the built-in to get the return type so we can convert
> > +            the result properly (or fall back to default handling if the
> > +            arguments aren't compatible).  */
> > +         for (desc = altivec_overloaded_builtins;
> > +              desc->code && desc->code != fcode; desc++)
> > +           continue;
> > +
> > +         for (; desc->code == fcode; desc++)
> > +           if (rs6000_builtin_type_compatible (TREE_TYPE (arg0), desc->op1)
> > +               && (rs6000_builtin_type_compatible (TREE_TYPE (arg1),
> > +                                                   desc->op2)))
> > +             {
> > +               tree ret_type = rs6000_builtin_type (desc->ret_type);
> > +               if (TYPE_MODE (ret_type) == V2DImode)
> > +                 /* Type-based aliasing analysis thinks vector long
> > +                    and vector long long are different and will put them
> > +                    in distinct alias classes.  Force our return type
> > +                    to be a may-alias type to avoid this.  */
> > +                 ret_type
> > +                   = build_pointer_type_for_mode (ret_type, Pmode,
> > +                                                  true/*can_alias_all*/);
> > +               else
> > +                 ret_type = build_pointer_type (ret_type);
> > +               aligned = build1 (NOP_EXPR, ret_type, aligned);
> > +               tree ret_val = build_indirect_ref (loc, aligned, RO_NULL);
> > +               return ret_val;
> > +             }
> > +       }
> > +    }
> > +
> > +  /* Similarly for stvx.  */
> > +  if (fcode == ALTIVEC_BUILTIN_VEC_ST
> > +      && (BYTES_BIG_ENDIAN || !VECTOR_ELT_ORDER_BIG))
> > +    {
> > +      tree arg0 = (*arglist)[0];
> > +      tree arg1 = (*arglist)[1];
> > +      tree arg2 = (*arglist)[2];
> > +
> > +      /* Construct the masked address.  See handling for ALTIVEC_BUILTIN_VEC_LD
> > +        for an explanation of address arithmetic concerns.  */
> > +      arg1 = fold (arg1);
> > +
> > +      /* Let existing error handling take over if we don't have a constant
> > +        offset.  */
> > +      if (TREE_CODE (arg1) == INTEGER_CST)
> > +       {
> > +         HOST_WIDE_INT off = TREE_INT_CST_LOW (arg1);
> > +         tree inner_type = TREE_TYPE (TREE_TYPE (arg2));
> > +         HOST_WIDE_INT size = int_size_in_bytes (inner_type);
> > +         tree addr;
> > +
> > +         if (off % size == 0)
> > +           {
> > +             tree adjoff = build_int_cst (TREE_TYPE (arg1), off / size);
> > +             addr = build_binary_op (loc, PLUS_EXPR, arg2, adjoff, 0);
> > +             addr = build1 (NOP_EXPR, sizetype, addr);
> > +           }
> > +         else
> > +           {
> > +             tree hack_arg2 = build1 (NOP_EXPR, sizetype, arg2);
> > +             addr = build_binary_op (loc, PLUS_EXPR, hack_arg2, arg1, 0);
> > +           }
> > +         tree aligned = build_binary_op (loc, BIT_AND_EXPR, addr,
> > +                                         build_int_cst (sizetype, -16), 0);
> > +
> > +         /* Find the built-in to make sure a compatible one exists; if not
> > +            we fall back to default handling to get the error message.  */
> > +         for (desc = altivec_overloaded_builtins;
> > +              desc->code && desc->code != fcode; desc++)
> > +           continue;
> > +
> > +         for (; desc->code == fcode; desc++)
> > +           if (rs6000_builtin_type_compatible (TREE_TYPE (arg0), desc->op1)
> > +               && rs6000_builtin_type_compatible (TREE_TYPE (arg1), desc->op2)
> > +               && rs6000_builtin_type_compatible (TREE_TYPE (arg2),
> > +                                                  desc->op3))
> > +             {
> > +               tree arg0_type = TREE_TYPE (arg0);
> > +               if (TYPE_MODE (arg0_type) == V2DImode)
> > +                 /* Type-based aliasing analysis thinks vector long
> > +                    and vector long long are different and will put them
> > +                    in distinct alias classes.  Force our address type
> > +                    to be a may-alias type to avoid this.  */
> > +                 arg0_type
> > +                   = build_pointer_type_for_mode (arg0_type, Pmode,
> > +                                                  true/*can_alias_all*/);
> > +               else
> > +                 arg0_type = build_pointer_type (arg0_type);
> > +               aligned = build1 (NOP_EXPR, arg0_type, aligned);
> > +               tree stg = build_indirect_ref (loc, aligned, RO_NULL);
> > +               tree retval = build2 (MODIFY_EXPR, TREE_TYPE (stg), stg,
> > +                                     convert (TREE_TYPE (stg), arg0));
> > +               return retval;
> > +             }
> > +       }
> > +    }
> > +
> >    for (n = 0;
> >         !VOID_TYPE_P (TREE_VALUE (fnargs)) && n < nargs;
> >         fnargs = TREE_CHAIN (fnargs), n++)
> > Index: gcc/config/rs6000/rs6000.c
> > ===================================================================
> > --- gcc/config/rs6000/rs6000.c  (revision 235090)
> > +++ gcc/config/rs6000/rs6000.c  (working copy)
> > @@ -13025,9 +13025,9 @@ swap_selector_for_mode (machine_mode mode)
> >    return force_reg (V16QImode, gen_rtx_CONST_VECTOR (V16QImode, gen_rtvec_v (16, perm)));
> >  }
> >
> > -/* Generate code for an "lvx", "lvxl", or "lve*x" built-in for a little endian target
> > -   with -maltivec=be specified.  Issue the load followed by an element-reversing
> > -   permute.  */
> > +/* Generate code for an "lvxl", or "lve*x" built-in for a little endian target
> > +   with -maltivec=be specified.  Issue the load followed by an element-
> > +   reversing permute.  */
> >  void
> >  altivec_expand_lvx_be (rtx op0, rtx op1, machine_mode mode, unsigned unspec)
> >  {
> > @@ -13043,8 +13043,8 @@ altivec_expand_lvx_be (rtx op0, rtx op1, machine_m
> >    emit_insn (gen_rtx_SET (op0, vperm));
> >  }
> >
> > -/* Generate code for a "stvx" or "stvxl" built-in for a little endian target
> > -   with -maltivec=be specified.  Issue the store preceded by an element-reversing
> > +/* Generate code for a "stvxl" built-in for a little endian target with
> > +   -maltivec=be specified.  Issue the store preceded by an element-reversing
> >     permute.  */
> >  void
> >  altivec_expand_stvx_be (rtx op0, rtx op1, machine_mode mode, unsigned unspec)
> > @@ -13106,22 +13106,65 @@ altivec_expand_lv_builtin (enum insn_code icode, t
> >
> >    op1 = copy_to_mode_reg (mode1, op1);
> >
> > -  if (op0 == const0_rtx)
> > +  /* For LVX, express the RTL accurately by ANDing the address with -16.
> > +     LVXL and LVE*X expand to use UNSPECs to hide their special behavior,
> > +     so the raw address is fine.  */
> > +  switch (icode)
> >      {
> > -      addr = gen_rtx_MEM (blk ? BLKmode : tmode, op1);
> > -    }
> > -  else
> > -    {
> > -      op0 = copy_to_mode_reg (mode0, op0);
> > -      addr = gen_rtx_MEM (blk ? BLKmode : tmode, gen_rtx_PLUS (Pmode, op0, op1));
> > -    }
> > +    case CODE_FOR_altivec_lvx_v2df_2op:
> > +    case CODE_FOR_altivec_lvx_v2di_2op:
> > +    case CODE_FOR_altivec_lvx_v4sf_2op:
> > +    case CODE_FOR_altivec_lvx_v4si_2op:
> > +    case CODE_FOR_altivec_lvx_v8hi_2op:
> > +    case CODE_FOR_altivec_lvx_v16qi_2op:
> > +      {
> > +       rtx rawaddr;
> > +       if (op0 == const0_rtx)
> > +         rawaddr = op1;
> > +       else
> > +         {
> > +           op0 = copy_to_mode_reg (mode0, op0);
> > +           rawaddr = gen_rtx_PLUS (Pmode, op1, op0);
> > +         }
> > +       addr = gen_rtx_AND (Pmode, rawaddr, gen_rtx_CONST_INT (Pmode, -16));
> > +       addr = gen_rtx_MEM (blk ? BLKmode : tmode, addr);
> >
> > -  pat = GEN_FCN (icode) (target, addr);
> > +       /* For -maltivec=be, emit the load and follow it up with a
> > +          permute to swap the elements.  */
> > +       if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
> > +         {
> > +           rtx temp = gen_reg_rtx (tmode);
> > +           emit_insn (gen_rtx_SET (temp, addr));
> >
> > -  if (! pat)
> > -    return 0;
> > -  emit_insn (pat);
> > +           rtx sel = swap_selector_for_mode (tmode);
> > +           rtx vperm = gen_rtx_UNSPEC (tmode, gen_rtvec (3, temp, temp, sel),
> > +                                       UNSPEC_VPERM);
> > +           emit_insn (gen_rtx_SET (target, vperm));
> > +         }
> > +       else
> > +         emit_insn (gen_rtx_SET (target, addr));
> >
> > +       break;
> > +      }
> > +
> > +    default:
> > +      if (op0 == const0_rtx)
> > +       addr = gen_rtx_MEM (blk ? BLKmode : tmode, op1);
> > +      else
> > +       {
> > +         op0 = copy_to_mode_reg (mode0, op0);
> > +         addr = gen_rtx_MEM (blk ? BLKmode : tmode,
> > +                             gen_rtx_PLUS (Pmode, op1, op0));
> > +       }
> > +
> > +      pat = GEN_FCN (icode) (target, addr);
> > +      if (! pat)
> > +       return 0;
> > +      emit_insn (pat);
> > +
> > +      break;
> > +    }
> > +
> >    return target;
> >  }
> >
> > @@ -13208,7 +13251,7 @@ altivec_expand_stv_builtin (enum insn_code icode,
> >    rtx op0 = expand_normal (arg0);
> >    rtx op1 = expand_normal (arg1);
> >    rtx op2 = expand_normal (arg2);
> > -  rtx pat, addr;
> > +  rtx pat, addr, rawaddr;
> >    machine_mode tmode = insn_data[icode].operand[0].mode;
> >    machine_mode smode = insn_data[icode].operand[1].mode;
> >    machine_mode mode1 = Pmode;
> > @@ -13220,24 +13263,69 @@ altivec_expand_stv_builtin (enum insn_code icode,
> >        || arg2 == error_mark_node)
> >      return const0_rtx;
> >
> > -  if (! (*insn_data[icode].operand[1].predicate) (op0, smode))
> > -    op0 = copy_to_mode_reg (smode, op0);
> > -
> >    op2 = copy_to_mode_reg (mode2, op2);
> >
> > -  if (op1 == const0_rtx)
> > +  /* For STVX, express the RTL accurately by ANDing the address with -16.
> > +     STVXL and STVE*X expand to use UNSPECs to hide their special behavior,
> > +     so the raw address is fine.  */
> > +  switch (icode)
> >      {
> > -      addr = gen_rtx_MEM (tmode, op2);
> > -    }
> > -  else
> > -    {
> > -      op1 = copy_to_mode_reg (mode1, op1);
> > -      addr = gen_rtx_MEM (tmode, gen_rtx_PLUS (Pmode, op1, op2));
> > -    }
> > +    case CODE_FOR_altivec_stvx_v2df_2op:
> > +    case CODE_FOR_altivec_stvx_v2di_2op:
> > +    case CODE_FOR_altivec_stvx_v4sf_2op:
> > +    case CODE_FOR_altivec_stvx_v4si_2op:
> > +    case CODE_FOR_altivec_stvx_v8hi_2op:
> > +    case CODE_FOR_altivec_stvx_v16qi_2op:
> > +      {
> > +       if (op1 == const0_rtx)
> > +         rawaddr = op2;
> > +       else
> > +         {
> > +           op1 = copy_to_mode_reg (mode1, op1);
> > +           rawaddr = gen_rtx_PLUS (Pmode, op2, op1);
> > +         }
> >
> > -  pat = GEN_FCN (icode) (addr, op0);
> > -  if (pat)
> > -    emit_insn (pat);
> > +       addr = gen_rtx_AND (Pmode, rawaddr, gen_rtx_CONST_INT (Pmode, -16));
> > +       addr = gen_rtx_MEM (tmode, addr);
> > +
> > +       op0 = copy_to_mode_reg (tmode, op0);
> > +
> > +       /* For -maltivec=be, emit a permute to swap the elements, followed
> > +          by the store.  */
> > +       if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
> > +         {
> > +           rtx temp = gen_reg_rtx (tmode);
> > +           rtx sel = swap_selector_for_mode (tmode);
> > +           rtx vperm = gen_rtx_UNSPEC (tmode, gen_rtvec (3, op0, op0, sel),
> > +                                       UNSPEC_VPERM);
> > +           emit_insn (gen_rtx_SET (temp, vperm));
> > +           emit_insn (gen_rtx_SET (addr, temp));
> > +         }
> > +       else
> > +         emit_insn (gen_rtx_SET (addr, op0));
> > +
> > +       break;
> > +      }
> > +
> > +    default:
> > +      {
> > +       if (! (*insn_data[icode].operand[1].predicate) (op0, smode))
> > +         op0 = copy_to_mode_reg (smode, op0);
> > +
> > +       if (op1 == const0_rtx)
> > +         addr = gen_rtx_MEM (tmode, op2);
> > +       else
> > +         {
> > +           op1 = copy_to_mode_reg (mode1, op1);
> > +           addr = gen_rtx_MEM (tmode, gen_rtx_PLUS (Pmode, op2, op1));
> > +         }
> > +
> > +       pat = GEN_FCN (icode) (addr, op0);
> > +       if (pat)
> > +         emit_insn (pat);
> > +      }
> > +    }
> > +
> >    return NULL_RTX;
> >  }
> >
> > @@ -14073,18 +14161,18 @@ altivec_expand_builtin (tree exp, rtx target, bool
> >    switch (fcode)
> >      {
> >      case ALTIVEC_BUILTIN_STVX_V2DF:
> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2df, exp);
> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2df_2op, exp);
> >      case ALTIVEC_BUILTIN_STVX_V2DI:
> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2di, exp);
> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2di_2op, exp);
> >      case ALTIVEC_BUILTIN_STVX_V4SF:
> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4sf, exp);
> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4sf_2op, exp);
> >      case ALTIVEC_BUILTIN_STVX:
> >      case ALTIVEC_BUILTIN_STVX_V4SI:
> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4si, exp);
> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4si_2op, exp);
> >      case ALTIVEC_BUILTIN_STVX_V8HI:
> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v8hi, exp);
> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v8hi_2op, exp);
> >      case ALTIVEC_BUILTIN_STVX_V16QI:
> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v16qi, exp);
> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v16qi_2op, exp);
> >      case ALTIVEC_BUILTIN_STVEBX:
> >        return altivec_expand_stv_builtin (CODE_FOR_altivec_stvebx, exp);
> >      case ALTIVEC_BUILTIN_STVEHX:
> > @@ -14272,23 +14360,23 @@ altivec_expand_builtin (tree exp, rtx target, bool
> >        return altivec_expand_lv_builtin (CODE_FOR_altivec_lvxl_v16qi,
> >                                         exp, target, false);
> >      case ALTIVEC_BUILTIN_LVX_V2DF:
> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2df,
> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2df_2op,
> >                                         exp, target, false);
> >      case ALTIVEC_BUILTIN_LVX_V2DI:
> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2di,
> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2di_2op,
> >                                         exp, target, false);
> >      case ALTIVEC_BUILTIN_LVX_V4SF:
> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4sf,
> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4sf_2op,
> >                                         exp, target, false);
> >      case ALTIVEC_BUILTIN_LVX:
> >      case ALTIVEC_BUILTIN_LVX_V4SI:
> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4si,
> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4si_2op,
> >                                         exp, target, false);
> >      case ALTIVEC_BUILTIN_LVX_V8HI:
> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v8hi,
> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v8hi_2op,
> >                                         exp, target, false);
> >      case ALTIVEC_BUILTIN_LVX_V16QI:
> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v16qi,
> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v16qi_2op,
> >                                         exp, target, false);
> >      case ALTIVEC_BUILTIN_LVLX:
> >        return altivec_expand_lv_builtin (CODE_FOR_altivec_lvlx,
> > @@ -37139,7 +37227,9 @@ insn_is_swappable_p (swap_web_entry *insn_entry, r
> >       fix them up by converting them to permuting ones.  Exceptions:
> >       UNSPEC_LVE, UNSPEC_LVX, and UNSPEC_STVX, which have a PARALLEL
> >       body instead of a SET; and UNSPEC_STVE, which has an UNSPEC
> > -     for the SET source.  */
> > +     for the SET source.  Also we must now make an exception for lvx
> > +     and stvx when they are not in the UNSPEC_LVX/STVX form (with the
> > +     explicit "& -16") since this leads to unrecognizable insns.  */
> >    rtx body = PATTERN (insn);
> >    int i = INSN_UID (insn);
> >
> > @@ -37147,6 +37237,11 @@ insn_is_swappable_p (swap_web_entry *insn_entry, r
> >      {
> >        if (GET_CODE (body) == SET)
> >         {
> > +         rtx rhs = SET_SRC (body);
> > +         gcc_assert (GET_CODE (rhs) == MEM);
> > +         if (GET_CODE (XEXP (rhs, 0)) == AND)
> > +           return 0;
> > +
> >           *special = SH_NOSWAP_LD;
> >           return 1;
> >         }
> > @@ -37156,8 +37251,14 @@ insn_is_swappable_p (swap_web_entry *insn_entry, r
> >
> >    if (insn_entry[i].is_store)
> >      {
> > -      if (GET_CODE (body) == SET && GET_CODE (SET_SRC (body)) != UNSPEC)
> > +      if (GET_CODE (body) == SET
> > +         && GET_CODE (SET_SRC (body)) != UNSPEC)
> >         {
> > +         rtx lhs = SET_DEST (body);
> > +         gcc_assert (GET_CODE (lhs) == MEM);
> > +         if (GET_CODE (XEXP (lhs, 0)) == AND)
> > +           return 0;
> > +
> >           *special = SH_NOSWAP_ST;
> >           return 1;
> >         }
> > @@ -37827,6 +37928,267 @@ dump_swap_insn_table (swap_web_entry *insn_entry)
> >    fputs ("\n", dump_file);
> >  }
> >
> > +/* Return RTX with its address canonicalized to (reg) or (+ reg reg).
> > +   Here RTX is an (& addr (const_int -16)).  Always return a new copy
> > +   to avoid problems with combine.  */
> > +static rtx
> > +alignment_with_canonical_addr (rtx align)
> > +{
> > +  rtx canon;
> > +  rtx addr = XEXP (align, 0);
> > +
> > +  if (REG_P (addr))
> > +    canon = addr;
> > +
> > +  else if (GET_CODE (addr) == PLUS)
> > +    {
> > +      rtx addrop0 = XEXP (addr, 0);
> > +      rtx addrop1 = XEXP (addr, 1);
> > +
> > +      if (!REG_P (addrop0))
> > +       addrop0 = force_reg (GET_MODE (addrop0), addrop0);
> > +
> > +      if (!REG_P (addrop1))
> > +       addrop1 = force_reg (GET_MODE (addrop1), addrop1);
> > +
> > +      canon = gen_rtx_PLUS (GET_MODE (addr), addrop0, addrop1);
> > +    }
> > +
> > +  else
> > +    canon = force_reg (GET_MODE (addr), addr);
> > +
> > +  return gen_rtx_AND (GET_MODE (align), canon, GEN_INT (-16));
> > +}
> > +
> > +/* Check whether an rtx is an alignment mask, and if so, return
> > +   a fully-expanded rtx for the masking operation.  */
> > +static rtx
> > +alignment_mask (rtx_insn *insn)
> > +{
> > +  rtx body = PATTERN (insn);
> > +
> > +  if (GET_CODE (body) != SET
> > +      || GET_CODE (SET_SRC (body)) != AND
> > +      || !REG_P (XEXP (SET_SRC (body), 0)))
> > +    return 0;
> > +
> > +  rtx mask = XEXP (SET_SRC (body), 1);
> > +
> > +  if (GET_CODE (mask) == CONST_INT)
> > +    {
> > +      if (INTVAL (mask) == -16)
> > +       return alignment_with_canonical_addr (SET_SRC (body));
> > +      else
> > +       return 0;
> > +    }
> > +
> > +  if (!REG_P (mask))
> > +    return 0;
> > +
> > +  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> > +  df_ref use;
> > +  rtx real_mask = 0;
> > +
> > +  FOR_EACH_INSN_INFO_USE (use, insn_info)
> > +    {
> > +      if (!rtx_equal_p (DF_REF_REG (use), mask))
> > +       continue;
> > +
> > +      struct df_link *def_link = DF_REF_CHAIN (use);
> > +      if (!def_link || def_link->next)
> > +       return 0;
> > +
> > +      rtx_insn *const_insn = DF_REF_INSN (def_link->ref);
> > +      rtx const_body = PATTERN (const_insn);
> > +      if (GET_CODE (const_body) != SET)
> > +       return 0;
> > +
> > +      real_mask = SET_SRC (const_body);
> > +
> > +      if (GET_CODE (real_mask) != CONST_INT
> > +         || INTVAL (real_mask) != -16)
> > +       return 0;
> > +    }
> > +
> > +  if (real_mask == 0)
> > +    return 0;
> > +
> > +  return alignment_with_canonical_addr (SET_SRC (body));
> > +}
> > +
> > +/* Given INSN that's a load or store based at BASE_REG, look for a
> > +   feeding computation that aligns its address on a 16-byte boundary.  */
> > +static rtx
> > +find_alignment_op (rtx_insn *insn, rtx base_reg)
> > +{
> > +  df_ref base_use;
> > +  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> > +  rtx and_operation = 0;
> > +
> > +  FOR_EACH_INSN_INFO_USE (base_use, insn_info)
> > +    {
> > +      if (!rtx_equal_p (DF_REF_REG (base_use), base_reg))
> > +       continue;
> > +
> > +      struct df_link *base_def_link = DF_REF_CHAIN (base_use);
> > +      if (!base_def_link || base_def_link->next)
> > +       break;
> > +
> > +      rtx_insn *and_insn = DF_REF_INSN (base_def_link->ref);
> > +      and_operation = alignment_mask (and_insn);
> > +      if (and_operation != 0)
> > +       break;
> > +    }
> > +
> > +  return and_operation;
> > +}
> > +
> > +struct del_info { bool replace; rtx_insn *replace_insn; };
> > +
> > +/* If INSN is the load for an lvx pattern, put it in canonical form.  */
> > +static void
> > +combine_lvx_pattern (rtx_insn *insn, del_info *to_delete)
> > +{
> > +  rtx body = PATTERN (insn);
> > +  gcc_assert (GET_CODE (body) == SET
> > +             && GET_CODE (SET_SRC (body)) == VEC_SELECT
> > +             && GET_CODE (XEXP (SET_SRC (body), 0)) == MEM);
> > +
> > +  rtx mem = XEXP (SET_SRC (body), 0);
> > +  rtx base_reg = XEXP (mem, 0);
> > +
> > +  rtx and_operation = find_alignment_op (insn, base_reg);
> > +
> > +  if (and_operation != 0)
> > +    {
> > +      df_ref def;
> > +      struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> > +      FOR_EACH_INSN_INFO_DEF (def, insn_info)
> > +       {
> > +         struct df_link *link = DF_REF_CHAIN (def);
> > +         if (!link || link->next)
> > +           break;
> > +
> > +         rtx_insn *swap_insn = DF_REF_INSN (link->ref);
> > +         if (!insn_is_swap_p (swap_insn)
> > +             || insn_is_load_p (swap_insn)
> > +             || insn_is_store_p (swap_insn))
> > +           break;
> > +
> > +         /* Expected lvx pattern found.  Change the swap to
> > +            a copy, and propagate the AND operation into the
> > +            load.  */
> > +         to_delete[INSN_UID (swap_insn)].replace = true;
> > +         to_delete[INSN_UID (swap_insn)].replace_insn = swap_insn;
> > +
> > +         XEXP (mem, 0) = and_operation;
> > +         SET_SRC (body) = mem;
> > +         INSN_CODE (insn) = -1; /* Force re-recognition.  */
> > +         df_insn_rescan (insn);
> > +
> > +         if (dump_file)
> > +           fprintf (dump_file, "lvx opportunity found at %d\n",
> > +                    INSN_UID (insn));
> > +       }
> > +    }
> > +}
> > +
> > +/* If INSN is the store for an stvx pattern, put it in canonical form.  */
> > +static void
> > +combine_stvx_pattern (rtx_insn *insn, del_info *to_delete)
> > +{
> > +  rtx body = PATTERN (insn);
> > +  gcc_assert (GET_CODE (body) == SET
> > +             && GET_CODE (SET_DEST (body)) == MEM
> > +             && GET_CODE (SET_SRC (body)) == VEC_SELECT);
> > +  rtx mem = SET_DEST (body);
> > +  rtx base_reg = XEXP (mem, 0);
> > +
> > +  rtx and_operation = find_alignment_op (insn, base_reg);
> > +
> > +  if (and_operation != 0)
> > +    {
> > +      rtx src_reg = XEXP (SET_SRC (body), 0);
> > +      df_ref src_use;
> > +      struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> > +      FOR_EACH_INSN_INFO_USE (src_use, insn_info)
> > +       {
> > +         if (!rtx_equal_p (DF_REF_REG (src_use), src_reg))
> > +           continue;
> > +
> > +         struct df_link *link = DF_REF_CHAIN (src_use);
> > +         if (!link || link->next)
> > +           break;
> > +
> > +         rtx_insn *swap_insn = DF_REF_INSN (link->ref);
> > +         if (!insn_is_swap_p (swap_insn)
> > +             || insn_is_load_p (swap_insn)
> > +             || insn_is_store_p (swap_insn))
> > +           break;
> > +
> > +         /* Expected stvx pattern found.  Change the swap to
> > +            a copy, and propagate the AND operation into the
> > +            store.  */
> > +         to_delete[INSN_UID (swap_insn)].replace = true;
> > +         to_delete[INSN_UID (swap_insn)].replace_insn = swap_insn;
> > +
> > +         XEXP (mem, 0) = and_operation;
> > +         SET_SRC (body) = src_reg;
> > +         INSN_CODE (insn) = -1; /* Force re-recognition.  */
> > +         df_insn_rescan (insn);
> > +
> > +         if (dump_file)
> > +           fprintf (dump_file, "stvx opportunity found at %d\n",
> > +                    INSN_UID (insn));
> > +       }
> > +    }
> > +}
> > +
> > +/* Look for patterns created from builtin lvx and stvx calls, and
> > +   canonicalize them to be properly recognized as such.  */
> > +static void
> > +combine_lvx_stvx_patterns (function *fun)
> > +{
> > +  int i;
> > +  basic_block bb;
> > +  rtx_insn *insn;
> > +
> > +  int num_insns = get_max_uid ();
> > +  del_info *to_delete = XCNEWVEC (del_info, num_insns);
> > +
> > +  FOR_ALL_BB_FN (bb, fun)
> > +    FOR_BB_INSNS (bb, insn)
> > +    {
> > +      if (!NONDEBUG_INSN_P (insn))
> > +       continue;
> > +
> > +      if (insn_is_load_p (insn) && insn_is_swap_p (insn))
> > +       combine_lvx_pattern (insn, to_delete);
> > +      else if (insn_is_store_p (insn) && insn_is_swap_p (insn))
> > +       combine_stvx_pattern (insn, to_delete);
> > +    }
> > +
> > +  /* Turning swaps into copies is delayed until now, to avoid problems
> > +     with deleting instructions during the insn walk.  */
> > +  for (i = 0; i < num_insns; i++)
> > +    if (to_delete[i].replace)
> > +      {
> > +       rtx swap_body = PATTERN (to_delete[i].replace_insn);
> > +       rtx src_reg = XEXP (SET_SRC (swap_body), 0);
> > +       rtx copy = gen_rtx_SET (SET_DEST (swap_body), src_reg);
> > +       rtx_insn *new_insn = emit_insn_before (copy,
> > +                                              to_delete[i].replace_insn);
> > +       set_block_for_insn (new_insn,
> > +                           BLOCK_FOR_INSN (to_delete[i].replace_insn));
> > +       df_insn_rescan (new_insn);
> > +       df_insn_delete (to_delete[i].replace_insn);
> > +       remove_insn (to_delete[i].replace_insn);
> > +       to_delete[i].replace_insn->set_deleted ();
> > +      }
> > +
> > +  free (to_delete);
> > +}
> > +
> >  /* Main entry point for this pass.  */
> >  unsigned int
> >  rs6000_analyze_swaps (function *fun)
> > @@ -37833,7 +38195,7 @@ rs6000_analyze_swaps (function *fun)
> >  {
> >    swap_web_entry *insn_entry;
> >    basic_block bb;
> > -  rtx_insn *insn;
> > +  rtx_insn *insn, *curr_insn = 0;
> >
> >    /* Dataflow analysis for use-def chains.  */
> >    df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
> > @@ -37841,12 +38203,15 @@ rs6000_analyze_swaps (function *fun)
> >    df_analyze ();
> >    df_set_flags (DF_DEFER_INSN_RESCAN);
> >
> > +  /* Pre-pass to combine lvx and stvx patterns so we don't lose info.  */
> > +  combine_lvx_stvx_patterns (fun);
> > +
> >    /* Allocate structure to represent webs of insns.  */
> >    insn_entry = XCNEWVEC (swap_web_entry, get_max_uid ());
> >
> >    /* Walk the insns to gather basic data.  */
> >    FOR_ALL_BB_FN (bb, fun)
> > -    FOR_BB_INSNS (bb, insn)
> > +    FOR_BB_INSNS_SAFE (bb, insn, curr_insn)
> >      {
> >        unsigned int uid = INSN_UID (insn);
> >        if (NONDEBUG_INSN_P (insn))
> > Index: gcc/config/rs6000/vector.md
> > ===================================================================
> > --- gcc/config/rs6000/vector.md (revision 235090)
> > +++ gcc/config/rs6000/vector.md (working copy)
> > @@ -167,7 +167,14 @@
> >    if (VECTOR_MEM_VSX_P (<MODE>mode))
> >      {
> >        operands[1] = rs6000_address_for_altivec (operands[1]);
> > -      emit_insn (gen_altivec_lvx_<mode> (operands[0], operands[1]));
> > +      rtx and_op = XEXP (operands[1], 0);
> > +      gcc_assert (GET_CODE (and_op) == AND);
> > +      rtx addr = XEXP (and_op, 0);
> > +      if (GET_CODE (addr) == PLUS)
> > +        emit_insn (gen_altivec_lvx_<mode>_2op (operands[0], XEXP (addr, 0),
> > +                                              XEXP (addr, 1)));
> > +      else
> > +        emit_insn (gen_altivec_lvx_<mode>_1op (operands[0], operands[1]));
> >        DONE;
> >      }
> >  }")
> > @@ -183,7 +190,14 @@
> >    if (VECTOR_MEM_VSX_P (<MODE>mode))
> >      {
> >        operands[0] = rs6000_address_for_altivec (operands[0]);
> > -      emit_insn (gen_altivec_stvx_<mode> (operands[0], operands[1]));
> > +      rtx and_op = XEXP (operands[0], 0);
> > +      gcc_assert (GET_CODE (and_op) == AND);
> > +      rtx addr = XEXP (and_op, 0);
> > +      if (GET_CODE (addr) == PLUS)
> > +        emit_insn (gen_altivec_stvx_<mode>_2op (operands[1], XEXP (addr, 0),
> > +                                               XEXP (addr, 1)));
> > +      else
> > +        emit_insn (gen_altivec_stvx_<mode>_1op (operands[1], operands[0]));
> >        DONE;
> >      }
> >  }")
> >
> >
> 
>
Bill Schmidt April 19, 2016, 8:27 p.m. UTC | #3
On Tue, 2016-04-19 at 10:09 +0200, Richard Biener wrote:
> On Tue, Apr 19, 2016 at 12:05 AM, Bill Schmidt
> <wschmidt@linux.vnet.ibm.com> wrote:
> > Hi,
> >
> > Expanding built-ins in the usual way (leaving them as calls until
> > expanding into RTL) restricts the amount of optimization that can be
> > performed on the code represented by the built-ins.  This has been
> > observed to be particularly bad for the vec_ld and vec_st built-ins on
> > PowerPC, which represent the lvx and stvx instructions.  Currently these
> > are expanded into UNSPECs that are left untouched by the optimizers, so
> > no redundant load or store elimination can take place.  For certain
> > idiomatic usages, this leads to very bad performance.
> >
> > Initially I planned to just change the UNSPEC representation to RTL that
> > directly expresses the address masking implicit in lvx and stvx.  This
> > turns out to be only partially successful in improving performance.
> > Among other things, by the time we reach RTL we have lost track of the
> > __restrict__ attribute, leading to more appearances of may-alias
> > relationships than should really be present.  Instead, this patch
> > expands the built-ins during parsing so that they are exposed to all
> > GIMPLE optimizations as well.
> >
> > This works well for vec_ld and vec_st.  It is also possible for
> > programmers to instead use __builtin_altivec_lvx_<mode> and
> > __builtin_altivec_stvx_<mode>.  These are not so easy to catch during
> > parsing, since they are not processed by the overloaded built-in
> > function table.  For these, I am currently falling back to expansion
> > during RTL while still exposing the address-masking semantics, which
> > seems ok for these somewhat obscure built-ins.  At some future time we
> > may decide to handle them similarly to vec_ld and vec_st.
> >
> > For POWER8 little-endian only, the loads and stores during expand time
> > require some special handling, since the POWER8 expanders want to
> > convert these to lxvd2x/xxswapd and xxswapd/stxvd2x.  To deal with this,
> > I've added an extra pre-pass to the swap optimization phase that
> > recognizes the lvx and stvx patterns and canonicalizes them so they'll
> > be properly recognized.  This isn't an issue for earlier or later
> > processors, or for big-endian POWER8, so doing this as part of swap
> > optimization is appropriate.
> >
> > We have a lot of existing test cases for this code, which proved very
> > useful in discovering bugs, so I haven't seen a reason to add any new
> > tests.
> >
> > The patch is fairly large, but it isn't feasible to break it up into
> > smaller units without leaving something in a broken state.  So I will
> > have to just apologize for the size and leave it at that.  Sorry! :)
> >
> > Bootstrapped and tested successfully on powerpc64le-unknown-linux-gnu,
> > and on powerpc64-unknown-linux-gnu (-m32 and -m64) with no regressions.
> > Is this ok for trunk after GCC 6 releases?
> 
> Just took a very quick look but it seems you are using integer arithmetic
> for the pointer adjustment and bit-and.  You could use POINTER_PLUS_EXPR
> for the addition and BIT_AND_EXPR is also valid on pointer types.  Which
> means you don't need conversions to/from sizetype.

I just verified that I run into trouble with both these changes.  The
build_binary_op interface doesn't accept POINTER_PLUS_EXPR as a valid
code (we hit a gcc_unreachable in the main switch statement), but does
produce pointer additions from a PLUS_EXPR.  Also, apparently
BIT_AND_EXPR is not valid on at least these pointer types:

ld.c: In function 'test':
ld.c:68:9: error: invalid operands to binary & (have '__vector(16) unsigned char *' and '__vector(16) unsigned char *')
   vuc = vec_ld (0, (vector unsigned char *)svuc);
         ^

That's what happens if I try:

          tree aligned = build_binary_op (loc, BIT_AND_EXPR, addr,
                                          build_int_cst (TREE_TYPE (arg1),
                                                         -16), 0);

If I try with building the -16 as a sizetype, I get the same error
message except that the second argument listed is 'sizetype'.  Is there
something else I should be trying instead?

Thanks,
Bill


> 
> x86 nowadays has intrinsics implemented as inlines - they come from
> header files.  It seems for ppc the intrinsics are somehow magically
> there, w/o a header file?
> 
> Richard.
> 
> > Thanks,
> > Bill
> >
> >
> > 2016-04-18  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
> >
> >         * config/rs6000/altivec.md (altivec_lvx_<mode>): Remove.
> >         (altivec_lvx_<mode>_internal): Document.
> >         (altivec_lvx_<mode>_2op): New define_insn.
> >         (altivec_lvx_<mode>_1op): Likewise.
> >         (altivec_lvx_<mode>_2op_si): Likewise.
> >         (altivec_lvx_<mode>_1op_si): Likewise.
> >         (altivec_stvx_<mode>): Remove.
> >         (altivec_stvx_<mode>_internal): Document.
> >         (altivec_stvx_<mode>_2op): New define_insn.
> >         (altivec_stvx_<mode>_1op): Likewise.
> >         (altivec_stvx_<mode>_2op_si): Likewise.
> >         (altivec_stvx_<mode>_1op_si): Likewise.
> >         * config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
> >         Expand vec_ld and vec_st during parsing.
> >         * config/rs6000/rs6000.c (altivec_expand_lvx_be): Commentary
> >         changes.
> >         (altivec_expand_stvx_be): Likewise.
> >         (altivec_expand_lv_builtin): Expand lvx built-ins to expose the
> >         address-masking behavior in RTL.
> >         (altivec_expand_stv_builtin): Expand stvx built-ins to expose the
> >         address-masking behavior in RTL.
> >         (altivec_expand_builtin): Change builtin code arguments for calls
> >         to altivec_expand_stv_builtin and altivec_expand_lv_builtin.
> >         (insn_is_swappable_p): Avoid incorrect swap optimization in the
> >         presence of lvx/stvx patterns.
> >         (alignment_with_canonical_addr): New function.
> >         (alignment_mask): Likewise.
> >         (find_alignment_op): Likewise.
> >         (combine_lvx_pattern): Likewise.
> >         (combine_stvx_pattern): Likewise.
> >         (combine_lvx_stvx_patterns): Likewise.
> >         (rs6000_analyze_swaps): Perform a pre-pass to recognize lvx and
> >         stvx patterns from expand.
> >         * config/rs6000/vector.md (vector_altivec_load_<mode>): Use new
> >         expansions.
> >         (vector_altivec_store_<mode>): Likewise.
> >
> >
> > Index: gcc/config/rs6000/altivec.md
> > ===================================================================
> > --- gcc/config/rs6000/altivec.md        (revision 235090)
> > +++ gcc/config/rs6000/altivec.md        (working copy)
> > @@ -2514,20 +2514,9 @@
> >    "lvxl %0,%y1"
> >    [(set_attr "type" "vecload")])
> >
> > -(define_expand "altivec_lvx_<mode>"
> > -  [(parallel
> > -    [(set (match_operand:VM2 0 "register_operand" "=v")
> > -         (match_operand:VM2 1 "memory_operand" "Z"))
> > -     (unspec [(const_int 0)] UNSPEC_LVX)])]
> > -  "TARGET_ALTIVEC"
> > -{
> > -  if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
> > -    {
> > -      altivec_expand_lvx_be (operands[0], operands[1], <MODE>mode, UNSPEC_LVX);
> > -      DONE;
> > -    }
> > -})
> > -
> > +; This version of lvx is used only in cases where we need to force an lvx
> > +; over any other load, and we don't care about losing CSE opportunities.
> > +; Its primary use is for prologue register saves.
> >  (define_insn "altivec_lvx_<mode>_internal"
> >    [(parallel
> >      [(set (match_operand:VM2 0 "register_operand" "=v")
> > @@ -2537,20 +2526,45 @@
> >    "lvx %0,%y1"
> >    [(set_attr "type" "vecload")])
> >
> > -(define_expand "altivec_stvx_<mode>"
> > -  [(parallel
> > -    [(set (match_operand:VM2 0 "memory_operand" "=Z")
> > -         (match_operand:VM2 1 "register_operand" "v"))
> > -     (unspec [(const_int 0)] UNSPEC_STVX)])]
> > -  "TARGET_ALTIVEC"
> > -{
> > -  if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
> > -    {
> > -      altivec_expand_stvx_be (operands[0], operands[1], <MODE>mode, UNSPEC_STVX);
> > -      DONE;
> > -    }
> > -})
> > +; The next two patterns embody what lvx should usually look like.
> > +(define_insn "altivec_lvx_<mode>_2op"
> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
> > +        (mem:VM2 (and:DI (plus:DI (match_operand:DI 1 "register_operand" "b")
> > +                                  (match_operand:DI 2 "register_operand" "r"))
> > +                        (const_int -16))))]
> > +  "TARGET_ALTIVEC && TARGET_64BIT"
> > +  "lvx %0,%1,%2"
> > +  [(set_attr "type" "vecload")])
> >
> > +(define_insn "altivec_lvx_<mode>_1op"
> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
> > +        (mem:VM2 (and:DI (match_operand:DI 1 "register_operand" "r")
> > +                        (const_int -16))))]
> > +  "TARGET_ALTIVEC && TARGET_64BIT"
> > +  "lvx %0,0,%1"
> > +  [(set_attr "type" "vecload")])
> > +
> > +; 32-bit versions of the above.
> > +(define_insn "altivec_lvx_<mode>_2op_si"
> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
> > +        (mem:VM2 (and:SI (plus:SI (match_operand:SI 1 "register_operand" "b")
> > +                                  (match_operand:SI 2 "register_operand" "r"))
> > +                        (const_int -16))))]
> > +  "TARGET_ALTIVEC && TARGET_32BIT"
> > +  "lvx %0,%1,%2"
> > +  [(set_attr "type" "vecload")])
> > +
> > +(define_insn "altivec_lvx_<mode>_1op_si"
> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
> > +        (mem:VM2 (and:SI (match_operand:SI 1 "register_operand" "r")
> > +                        (const_int -16))))]
> > +  "TARGET_ALTIVEC && TARGET_32BIT"
> > +  "lvx %0,0,%1"
> > +  [(set_attr "type" "vecload")])
> > +
> > +; This version of stvx is used only in cases where we need to force an stvx
> > +; over any other store, and we don't care about losing CSE opportunities.
> > +; Its primary use is for epilogue register restores.
> >  (define_insn "altivec_stvx_<mode>_internal"
> >    [(parallel
> >      [(set (match_operand:VM2 0 "memory_operand" "=Z")
> > @@ -2560,6 +2574,42 @@
> >    "stvx %1,%y0"
> >    [(set_attr "type" "vecstore")])
> >
> > +; The next two patterns embody what stvx should usually look like.
> > +(define_insn "altivec_stvx_<mode>_2op"
> > +  [(set (mem:VM2 (and:DI (plus:DI (match_operand:DI 1 "register_operand" "b")
> > +                                 (match_operand:DI 2 "register_operand" "r"))
> > +                        (const_int -16)))
> > +        (match_operand:VM2 0 "register_operand" "v"))]
> > +  "TARGET_ALTIVEC && TARGET_64BIT"
> > +  "stvx %0,%1,%2"
> > +  [(set_attr "type" "vecstore")])
> > +
> > +(define_insn "altivec_stvx_<mode>_1op"
> > +  [(set (mem:VM2 (and:DI (match_operand:DI 1 "register_operand" "r")
> > +                        (const_int -16)))
> > +        (match_operand:VM2 0 "register_operand" "v"))]
> > +  "TARGET_ALTIVEC && TARGET_64BIT"
> > +  "stvx %0,0,%1"
> > +  [(set_attr "type" "vecstore")])
> > +
> > +; 32-bit versions of the above.
> > +(define_insn "altivec_stvx_<mode>_2op_si"
> > +  [(set (mem:VM2 (and:SI (plus:SI (match_operand:SI 1 "register_operand" "b")
> > +                                 (match_operand:SI 2 "register_operand" "r"))
> > +                        (const_int -16)))
> > +        (match_operand:VM2 0 "register_operand" "v"))]
> > +  "TARGET_ALTIVEC && TARGET_32BIT"
> > +  "stvx %0,%1,%2"
> > +  [(set_attr "type" "vecstore")])
> > +
> > +(define_insn "altivec_stvx_<mode>_1op_si"
> > +  [(set (mem:VM2 (and:SI (match_operand:SI 1 "register_operand" "r")
> > +                        (const_int -16)))
> > +        (match_operand:VM2 0 "register_operand" "v"))]
> > +  "TARGET_ALTIVEC && TARGET_32BIT"
> > +  "stvx %0,0,%1"
> > +  [(set_attr "type" "vecstore")])
> > +
> >  (define_expand "altivec_stvxl_<mode>"
> >    [(parallel
> >      [(set (match_operand:VM2 0 "memory_operand" "=Z")
> > Index: gcc/config/rs6000/rs6000-c.c
> > ===================================================================
> > --- gcc/config/rs6000/rs6000-c.c        (revision 235090)
> > +++ gcc/config/rs6000/rs6000-c.c        (working copy)
> > @@ -4800,6 +4800,164 @@ assignment for unaligned loads and stores");
> >        return stmt;
> >      }
> >
> > +  /* Expand vec_ld into an expression that masks the address and
> > +     performs the load.  We need to expand this early to allow
> > +     the best aliasing, as by the time we get into RTL we no longer
> > +     are able to honor __restrict__, for example.  We may want to
> > +     consider this for all memory access built-ins.
> > +
> > +     When -maltivec=be is specified, simply punt to existing
> > +     built-in processing.  */
> > +  if (fcode == ALTIVEC_BUILTIN_VEC_LD
> > +      && (BYTES_BIG_ENDIAN || !VECTOR_ELT_ORDER_BIG))
> > +    {
> > +      tree arg0 = (*arglist)[0];
> > +      tree arg1 = (*arglist)[1];
> > +
> > +      /* Strip qualifiers like "const" from the pointer arg.  */
> > +      tree arg1_type = TREE_TYPE (arg1);
> > +      tree inner_type = TREE_TYPE (arg1_type);
> > +      if (TYPE_QUALS (TREE_TYPE (arg1_type)) != 0)
> > +       {
> > +         arg1_type = build_pointer_type (build_qualified_type (inner_type,
> > +                                                               0));
> > +         arg1 = fold_convert (arg1_type, arg1);
> > +       }
> > +
> > +      /* Construct the masked address.  We have to jump through some hoops
> > +        here.  If the first argument to a PLUS_EXPR is a pointer,
> > +        build_binary_op will multiply the offset by the size of the
> > +        inner type of the pointer (C semantics).  With vec_ld and vec_st,
> > +        the offset must be left alone.  However, if we convert to a
> > +        sizetype to do the arithmetic, we get a PLUS_EXPR instead of a
> > +        POINTER_PLUS_EXPR, which interferes with aliasing (causing us,
> > +        for example, to lose "restrict" information).  Thus where legal,
> > +        we pre-adjust the offset knowing that a multiply by size is
> > +        coming.  When the offset isn't a multiple of the size, we are
> > +        forced to do the arithmetic in size_type for correctness, at the
> > +        cost of losing aliasing information.  This, however, should be
> > +        quite rare with these operations.  */
> > +      arg0 = fold (arg0);
> > +
> > +      /* Let existing error handling take over if we don't have a constant
> > +        offset.  */
> > +      if (TREE_CODE (arg0) == INTEGER_CST)
> > +       {
> > +         HOST_WIDE_INT off = TREE_INT_CST_LOW (arg0);
> > +         HOST_WIDE_INT size = int_size_in_bytes (inner_type);
> > +         tree addr;
> > +
> > +         if (off % size == 0)
> > +           {
> > +             tree adjoff = build_int_cst (TREE_TYPE (arg0), off / size);
> > +             addr = build_binary_op (loc, PLUS_EXPR, arg1, adjoff, 0);
> > +             addr = build1 (NOP_EXPR, sizetype, addr);
> > +           }
> > +         else
> > +           {
> > +             tree hack_arg1 = build1 (NOP_EXPR, sizetype, arg1);
> > +             addr = build_binary_op (loc, PLUS_EXPR, hack_arg1, arg0, 0);
> > +           }
> > +         tree aligned = build_binary_op (loc, BIT_AND_EXPR, addr,
> > +                                         build_int_cst (sizetype, -16), 0);
> > +
> > +         /* Find the built-in to get the return type so we can convert
> > +            the result properly (or fall back to default handling if the
> > +            arguments aren't compatible).  */
> > +         for (desc = altivec_overloaded_builtins;
> > +              desc->code && desc->code != fcode; desc++)
> > +           continue;
> > +
> > +         for (; desc->code == fcode; desc++)
> > +           if (rs6000_builtin_type_compatible (TREE_TYPE (arg0), desc->op1)
> > +               && (rs6000_builtin_type_compatible (TREE_TYPE (arg1),
> > +                                                   desc->op2)))
> > +             {
> > +               tree ret_type = rs6000_builtin_type (desc->ret_type);
> > +               if (TYPE_MODE (ret_type) == V2DImode)
> > +                 /* Type-based aliasing analysis thinks vector long
> > +                    and vector long long are different and will put them
> > +                    in distinct alias classes.  Force our return type
> > +                    to be a may-alias type to avoid this.  */
> > +                 ret_type
> > +                   = build_pointer_type_for_mode (ret_type, Pmode,
> > +                                                  true/*can_alias_all*/);
> > +               else
> > +                 ret_type = build_pointer_type (ret_type);
> > +               aligned = build1 (NOP_EXPR, ret_type, aligned);
> > +               tree ret_val = build_indirect_ref (loc, aligned, RO_NULL);
> > +               return ret_val;
> > +             }
> > +       }
> > +    }
> > +
> > +  /* Similarly for stvx.  */
> > +  if (fcode == ALTIVEC_BUILTIN_VEC_ST
> > +      && (BYTES_BIG_ENDIAN || !VECTOR_ELT_ORDER_BIG))
> > +    {
> > +      tree arg0 = (*arglist)[0];
> > +      tree arg1 = (*arglist)[1];
> > +      tree arg2 = (*arglist)[2];
> > +
> > +      /* Construct the masked address.  See handling for ALTIVEC_BUILTIN_VEC_LD
> > +        for an explanation of address arithmetic concerns.  */
> > +      arg1 = fold (arg1);
> > +
> > +      /* Let existing error handling take over if we don't have a constant
> > +        offset.  */
> > +      if (TREE_CODE (arg1) == INTEGER_CST)
> > +       {
> > +         HOST_WIDE_INT off = TREE_INT_CST_LOW (arg1);
> > +         tree inner_type = TREE_TYPE (TREE_TYPE (arg2));
> > +         HOST_WIDE_INT size = int_size_in_bytes (inner_type);
> > +         tree addr;
> > +
> > +         if (off % size == 0)
> > +           {
> > +             tree adjoff = build_int_cst (TREE_TYPE (arg1), off / size);
> > +             addr = build_binary_op (loc, PLUS_EXPR, arg2, adjoff, 0);
> > +             addr = build1 (NOP_EXPR, sizetype, addr);
> > +           }
> > +         else
> > +           {
> > +             tree hack_arg2 = build1 (NOP_EXPR, sizetype, arg2);
> > +             addr = build_binary_op (loc, PLUS_EXPR, hack_arg2, arg1, 0);
> > +           }
> > +         tree aligned = build_binary_op (loc, BIT_AND_EXPR, addr,
> > +                                         build_int_cst (sizetype, -16), 0);
> > +
> > +         /* Find the built-in to make sure a compatible one exists; if not
> > +            we fall back to default handling to get the error message.  */
> > +         for (desc = altivec_overloaded_builtins;
> > +              desc->code && desc->code != fcode; desc++)
> > +           continue;
> > +
> > +         for (; desc->code == fcode; desc++)
> > +           if (rs6000_builtin_type_compatible (TREE_TYPE (arg0), desc->op1)
> > +               && rs6000_builtin_type_compatible (TREE_TYPE (arg1), desc->op2)
> > +               && rs6000_builtin_type_compatible (TREE_TYPE (arg2),
> > +                                                  desc->op3))
> > +             {
> > +               tree arg0_type = TREE_TYPE (arg0);
> > +               if (TYPE_MODE (arg0_type) == V2DImode)
> > +                 /* Type-based aliasing analysis thinks vector long
> > +                    and vector long long are different and will put them
> > +                    in distinct alias classes.  Force our address type
> > +                    to be a may-alias type to avoid this.  */
> > +                 arg0_type
> > +                   = build_pointer_type_for_mode (arg0_type, Pmode,
> > +                                                  true/*can_alias_all*/);
> > +               else
> > +                 arg0_type = build_pointer_type (arg0_type);
> > +               aligned = build1 (NOP_EXPR, arg0_type, aligned);
> > +               tree stg = build_indirect_ref (loc, aligned, RO_NULL);
> > +               tree retval = build2 (MODIFY_EXPR, TREE_TYPE (stg), stg,
> > +                                     convert (TREE_TYPE (stg), arg0));
> > +               return retval;
> > +             }
> > +       }
> > +    }
> > +
> >    for (n = 0;
> >         !VOID_TYPE_P (TREE_VALUE (fnargs)) && n < nargs;
> >         fnargs = TREE_CHAIN (fnargs), n++)
> > Index: gcc/config/rs6000/rs6000.c
> > ===================================================================
> > --- gcc/config/rs6000/rs6000.c  (revision 235090)
> > +++ gcc/config/rs6000/rs6000.c  (working copy)
> > @@ -13025,9 +13025,9 @@ swap_selector_for_mode (machine_mode mode)
> >    return force_reg (V16QImode, gen_rtx_CONST_VECTOR (V16QImode, gen_rtvec_v (16, perm)));
> >  }
> >
> > -/* Generate code for an "lvx", "lvxl", or "lve*x" built-in for a little endian target
> > -   with -maltivec=be specified.  Issue the load followed by an element-reversing
> > -   permute.  */
> > +/* Generate code for an "lvxl", or "lve*x" built-in for a little endian target
> > +   with -maltivec=be specified.  Issue the load followed by an element-
> > +   reversing permute.  */
> >  void
> >  altivec_expand_lvx_be (rtx op0, rtx op1, machine_mode mode, unsigned unspec)
> >  {
> > @@ -13043,8 +13043,8 @@ altivec_expand_lvx_be (rtx op0, rtx op1, machine_m
> >    emit_insn (gen_rtx_SET (op0, vperm));
> >  }
> >
> > -/* Generate code for a "stvx" or "stvxl" built-in for a little endian target
> > -   with -maltivec=be specified.  Issue the store preceded by an element-reversing
> > +/* Generate code for a "stvxl" built-in for a little endian target with
> > +   -maltivec=be specified.  Issue the store preceded by an element-reversing
> >     permute.  */
> >  void
> >  altivec_expand_stvx_be (rtx op0, rtx op1, machine_mode mode, unsigned unspec)
> > @@ -13106,22 +13106,65 @@ altivec_expand_lv_builtin (enum insn_code icode, t
> >
> >    op1 = copy_to_mode_reg (mode1, op1);
> >
> > -  if (op0 == const0_rtx)
> > +  /* For LVX, express the RTL accurately by ANDing the address with -16.
> > +     LVXL and LVE*X expand to use UNSPECs to hide their special behavior,
> > +     so the raw address is fine.  */
> > +  switch (icode)
> >      {
> > -      addr = gen_rtx_MEM (blk ? BLKmode : tmode, op1);
> > -    }
> > -  else
> > -    {
> > -      op0 = copy_to_mode_reg (mode0, op0);
> > -      addr = gen_rtx_MEM (blk ? BLKmode : tmode, gen_rtx_PLUS (Pmode, op0, op1));
> > -    }
> > +    case CODE_FOR_altivec_lvx_v2df_2op:
> > +    case CODE_FOR_altivec_lvx_v2di_2op:
> > +    case CODE_FOR_altivec_lvx_v4sf_2op:
> > +    case CODE_FOR_altivec_lvx_v4si_2op:
> > +    case CODE_FOR_altivec_lvx_v8hi_2op:
> > +    case CODE_FOR_altivec_lvx_v16qi_2op:
> > +      {
> > +       rtx rawaddr;
> > +       if (op0 == const0_rtx)
> > +         rawaddr = op1;
> > +       else
> > +         {
> > +           op0 = copy_to_mode_reg (mode0, op0);
> > +           rawaddr = gen_rtx_PLUS (Pmode, op1, op0);
> > +         }
> > +       addr = gen_rtx_AND (Pmode, rawaddr, gen_rtx_CONST_INT (Pmode, -16));
> > +       addr = gen_rtx_MEM (blk ? BLKmode : tmode, addr);
> >
> > -  pat = GEN_FCN (icode) (target, addr);
> > +       /* For -maltivec=be, emit the load and follow it up with a
> > +          permute to swap the elements.  */
> > +       if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
> > +         {
> > +           rtx temp = gen_reg_rtx (tmode);
> > +           emit_insn (gen_rtx_SET (temp, addr));
> >
> > -  if (! pat)
> > -    return 0;
> > -  emit_insn (pat);
> > +           rtx sel = swap_selector_for_mode (tmode);
> > +           rtx vperm = gen_rtx_UNSPEC (tmode, gen_rtvec (3, temp, temp, sel),
> > +                                       UNSPEC_VPERM);
> > +           emit_insn (gen_rtx_SET (target, vperm));
> > +         }
> > +       else
> > +         emit_insn (gen_rtx_SET (target, addr));
> >
> > +       break;
> > +      }
> > +
> > +    default:
> > +      if (op0 == const0_rtx)
> > +       addr = gen_rtx_MEM (blk ? BLKmode : tmode, op1);
> > +      else
> > +       {
> > +         op0 = copy_to_mode_reg (mode0, op0);
> > +         addr = gen_rtx_MEM (blk ? BLKmode : tmode,
> > +                             gen_rtx_PLUS (Pmode, op1, op0));
> > +       }
> > +
> > +      pat = GEN_FCN (icode) (target, addr);
> > +      if (! pat)
> > +       return 0;
> > +      emit_insn (pat);
> > +
> > +      break;
> > +    }
> > +
> >    return target;
> >  }
> >
> > @@ -13208,7 +13251,7 @@ altivec_expand_stv_builtin (enum insn_code icode,
> >    rtx op0 = expand_normal (arg0);
> >    rtx op1 = expand_normal (arg1);
> >    rtx op2 = expand_normal (arg2);
> > -  rtx pat, addr;
> > +  rtx pat, addr, rawaddr;
> >    machine_mode tmode = insn_data[icode].operand[0].mode;
> >    machine_mode smode = insn_data[icode].operand[1].mode;
> >    machine_mode mode1 = Pmode;
> > @@ -13220,24 +13263,69 @@ altivec_expand_stv_builtin (enum insn_code icode,
> >        || arg2 == error_mark_node)
> >      return const0_rtx;
> >
> > -  if (! (*insn_data[icode].operand[1].predicate) (op0, smode))
> > -    op0 = copy_to_mode_reg (smode, op0);
> > -
> >    op2 = copy_to_mode_reg (mode2, op2);
> >
> > -  if (op1 == const0_rtx)
> > +  /* For STVX, express the RTL accurately by ANDing the address with -16.
> > +     STVXL and STVE*X expand to use UNSPECs to hide their special behavior,
> > +     so the raw address is fine.  */
> > +  switch (icode)
> >      {
> > -      addr = gen_rtx_MEM (tmode, op2);
> > -    }
> > -  else
> > -    {
> > -      op1 = copy_to_mode_reg (mode1, op1);
> > -      addr = gen_rtx_MEM (tmode, gen_rtx_PLUS (Pmode, op1, op2));
> > -    }
> > +    case CODE_FOR_altivec_stvx_v2df_2op:
> > +    case CODE_FOR_altivec_stvx_v2di_2op:
> > +    case CODE_FOR_altivec_stvx_v4sf_2op:
> > +    case CODE_FOR_altivec_stvx_v4si_2op:
> > +    case CODE_FOR_altivec_stvx_v8hi_2op:
> > +    case CODE_FOR_altivec_stvx_v16qi_2op:
> > +      {
> > +       if (op1 == const0_rtx)
> > +         rawaddr = op2;
> > +       else
> > +         {
> > +           op1 = copy_to_mode_reg (mode1, op1);
> > +           rawaddr = gen_rtx_PLUS (Pmode, op2, op1);
> > +         }
> >
> > -  pat = GEN_FCN (icode) (addr, op0);
> > -  if (pat)
> > -    emit_insn (pat);
> > +       addr = gen_rtx_AND (Pmode, rawaddr, gen_rtx_CONST_INT (Pmode, -16));
> > +       addr = gen_rtx_MEM (tmode, addr);
> > +
> > +       op0 = copy_to_mode_reg (tmode, op0);
> > +
> > +       /* For -maltivec=be, emit a permute to swap the elements, followed
> > +          by the store.  */
> > +       if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
> > +         {
> > +           rtx temp = gen_reg_rtx (tmode);
> > +           rtx sel = swap_selector_for_mode (tmode);
> > +           rtx vperm = gen_rtx_UNSPEC (tmode, gen_rtvec (3, op0, op0, sel),
> > +                                       UNSPEC_VPERM);
> > +           emit_insn (gen_rtx_SET (temp, vperm));
> > +           emit_insn (gen_rtx_SET (addr, temp));
> > +         }
> > +       else
> > +         emit_insn (gen_rtx_SET (addr, op0));
> > +
> > +       break;
> > +      }
> > +
> > +    default:
> > +      {
> > +       if (! (*insn_data[icode].operand[1].predicate) (op0, smode))
> > +         op0 = copy_to_mode_reg (smode, op0);
> > +
> > +       if (op1 == const0_rtx)
> > +         addr = gen_rtx_MEM (tmode, op2);
> > +       else
> > +         {
> > +           op1 = copy_to_mode_reg (mode1, op1);
> > +           addr = gen_rtx_MEM (tmode, gen_rtx_PLUS (Pmode, op2, op1));
> > +         }
> > +
> > +       pat = GEN_FCN (icode) (addr, op0);
> > +       if (pat)
> > +         emit_insn (pat);
> > +      }
> > +    }
> > +
> >    return NULL_RTX;
> >  }
> >
> > @@ -14073,18 +14161,18 @@ altivec_expand_builtin (tree exp, rtx target, bool
> >    switch (fcode)
> >      {
> >      case ALTIVEC_BUILTIN_STVX_V2DF:
> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2df, exp);
> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2df_2op, exp);
> >      case ALTIVEC_BUILTIN_STVX_V2DI:
> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2di, exp);
> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2di_2op, exp);
> >      case ALTIVEC_BUILTIN_STVX_V4SF:
> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4sf, exp);
> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4sf_2op, exp);
> >      case ALTIVEC_BUILTIN_STVX:
> >      case ALTIVEC_BUILTIN_STVX_V4SI:
> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4si, exp);
> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4si_2op, exp);
> >      case ALTIVEC_BUILTIN_STVX_V8HI:
> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v8hi, exp);
> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v8hi_2op, exp);
> >      case ALTIVEC_BUILTIN_STVX_V16QI:
> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v16qi, exp);
> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v16qi_2op, exp);
> >      case ALTIVEC_BUILTIN_STVEBX:
> >        return altivec_expand_stv_builtin (CODE_FOR_altivec_stvebx, exp);
> >      case ALTIVEC_BUILTIN_STVEHX:
> > @@ -14272,23 +14360,23 @@ altivec_expand_builtin (tree exp, rtx target, bool
> >        return altivec_expand_lv_builtin (CODE_FOR_altivec_lvxl_v16qi,
> >                                         exp, target, false);
> >      case ALTIVEC_BUILTIN_LVX_V2DF:
> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2df,
> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2df_2op,
> >                                         exp, target, false);
> >      case ALTIVEC_BUILTIN_LVX_V2DI:
> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2di,
> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2di_2op,
> >                                         exp, target, false);
> >      case ALTIVEC_BUILTIN_LVX_V4SF:
> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4sf,
> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4sf_2op,
> >                                         exp, target, false);
> >      case ALTIVEC_BUILTIN_LVX:
> >      case ALTIVEC_BUILTIN_LVX_V4SI:
> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4si,
> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4si_2op,
> >                                         exp, target, false);
> >      case ALTIVEC_BUILTIN_LVX_V8HI:
> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v8hi,
> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v8hi_2op,
> >                                         exp, target, false);
> >      case ALTIVEC_BUILTIN_LVX_V16QI:
> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v16qi,
> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v16qi_2op,
> >                                         exp, target, false);
> >      case ALTIVEC_BUILTIN_LVLX:
> >        return altivec_expand_lv_builtin (CODE_FOR_altivec_lvlx,
> > @@ -37139,7 +37227,9 @@ insn_is_swappable_p (swap_web_entry *insn_entry, r
> >       fix them up by converting them to permuting ones.  Exceptions:
> >       UNSPEC_LVE, UNSPEC_LVX, and UNSPEC_STVX, which have a PARALLEL
> >       body instead of a SET; and UNSPEC_STVE, which has an UNSPEC
> > -     for the SET source.  */
> > +     for the SET source.  Also we must now make an exception for lvx
> > +     and stvx when they are not in the UNSPEC_LVX/STVX form (with the
> > +     explicit "& -16") since this leads to unrecognizable insns.  */
> >    rtx body = PATTERN (insn);
> >    int i = INSN_UID (insn);
> >
> > @@ -37147,6 +37237,11 @@ insn_is_swappable_p (swap_web_entry *insn_entry, r
> >      {
> >        if (GET_CODE (body) == SET)
> >         {
> > +         rtx rhs = SET_SRC (body);
> > +         gcc_assert (GET_CODE (rhs) == MEM);
> > +         if (GET_CODE (XEXP (rhs, 0)) == AND)
> > +           return 0;
> > +
> >           *special = SH_NOSWAP_LD;
> >           return 1;
> >         }
> > @@ -37156,8 +37251,14 @@ insn_is_swappable_p (swap_web_entry *insn_entry, r
> >
> >    if (insn_entry[i].is_store)
> >      {
> > -      if (GET_CODE (body) == SET && GET_CODE (SET_SRC (body)) != UNSPEC)
> > +      if (GET_CODE (body) == SET
> > +         && GET_CODE (SET_SRC (body)) != UNSPEC)
> >         {
> > +         rtx lhs = SET_DEST (body);
> > +         gcc_assert (GET_CODE (lhs) == MEM);
> > +         if (GET_CODE (XEXP (lhs, 0)) == AND)
> > +           return 0;
> > +
> >           *special = SH_NOSWAP_ST;
> >           return 1;
> >         }
> > @@ -37827,6 +37928,267 @@ dump_swap_insn_table (swap_web_entry *insn_entry)
> >    fputs ("\n", dump_file);
> >  }
> >
> > +/* Return RTX with its address canonicalized to (reg) or (+ reg reg).
> > +   Here RTX is an (& addr (const_int -16)).  Always return a new copy
> > +   to avoid problems with combine.  */
> > +static rtx
> > +alignment_with_canonical_addr (rtx align)
> > +{
> > +  rtx canon;
> > +  rtx addr = XEXP (align, 0);
> > +
> > +  if (REG_P (addr))
> > +    canon = addr;
> > +
> > +  else if (GET_CODE (addr) == PLUS)
> > +    {
> > +      rtx addrop0 = XEXP (addr, 0);
> > +      rtx addrop1 = XEXP (addr, 1);
> > +
> > +      if (!REG_P (addrop0))
> > +       addrop0 = force_reg (GET_MODE (addrop0), addrop0);
> > +
> > +      if (!REG_P (addrop1))
> > +       addrop1 = force_reg (GET_MODE (addrop1), addrop1);
> > +
> > +      canon = gen_rtx_PLUS (GET_MODE (addr), addrop0, addrop1);
> > +    }
> > +
> > +  else
> > +    canon = force_reg (GET_MODE (addr), addr);
> > +
> > +  return gen_rtx_AND (GET_MODE (align), canon, GEN_INT (-16));
> > +}
> > +
> > +/* Check whether an rtx is an alignment mask, and if so, return
> > +   a fully-expanded rtx for the masking operation.  */
> > +static rtx
> > +alignment_mask (rtx_insn *insn)
> > +{
> > +  rtx body = PATTERN (insn);
> > +
> > +  if (GET_CODE (body) != SET
> > +      || GET_CODE (SET_SRC (body)) != AND
> > +      || !REG_P (XEXP (SET_SRC (body), 0)))
> > +    return 0;
> > +
> > +  rtx mask = XEXP (SET_SRC (body), 1);
> > +
> > +  if (GET_CODE (mask) == CONST_INT)
> > +    {
> > +      if (INTVAL (mask) == -16)
> > +       return alignment_with_canonical_addr (SET_SRC (body));
> > +      else
> > +       return 0;
> > +    }
> > +
> > +  if (!REG_P (mask))
> > +    return 0;
> > +
> > +  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> > +  df_ref use;
> > +  rtx real_mask = 0;
> > +
> > +  FOR_EACH_INSN_INFO_USE (use, insn_info)
> > +    {
> > +      if (!rtx_equal_p (DF_REF_REG (use), mask))
> > +       continue;
> > +
> > +      struct df_link *def_link = DF_REF_CHAIN (use);
> > +      if (!def_link || def_link->next)
> > +       return 0;
> > +
> > +      rtx_insn *const_insn = DF_REF_INSN (def_link->ref);
> > +      rtx const_body = PATTERN (const_insn);
> > +      if (GET_CODE (const_body) != SET)
> > +       return 0;
> > +
> > +      real_mask = SET_SRC (const_body);
> > +
> > +      if (GET_CODE (real_mask) != CONST_INT
> > +         || INTVAL (real_mask) != -16)
> > +       return 0;
> > +    }
> > +
> > +  if (real_mask == 0)
> > +    return 0;
> > +
> > +  return alignment_with_canonical_addr (SET_SRC (body));
> > +}
> > +
> > +/* Given INSN that's a load or store based at BASE_REG, look for a
> > +   feeding computation that aligns its address on a 16-byte boundary.  */
> > +static rtx
> > +find_alignment_op (rtx_insn *insn, rtx base_reg)
> > +{
> > +  df_ref base_use;
> > +  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> > +  rtx and_operation = 0;
> > +
> > +  FOR_EACH_INSN_INFO_USE (base_use, insn_info)
> > +    {
> > +      if (!rtx_equal_p (DF_REF_REG (base_use), base_reg))
> > +       continue;
> > +
> > +      struct df_link *base_def_link = DF_REF_CHAIN (base_use);
> > +      if (!base_def_link || base_def_link->next)
> > +       break;
> > +
> > +      rtx_insn *and_insn = DF_REF_INSN (base_def_link->ref);
> > +      and_operation = alignment_mask (and_insn);
> > +      if (and_operation != 0)
> > +       break;
> > +    }
> > +
> > +  return and_operation;
> > +}
> > +
> > +struct del_info { bool replace; rtx_insn *replace_insn; };
> > +
> > +/* If INSN is the load for an lvx pattern, put it in canonical form.  */
> > +static void
> > +combine_lvx_pattern (rtx_insn *insn, del_info *to_delete)
> > +{
> > +  rtx body = PATTERN (insn);
> > +  gcc_assert (GET_CODE (body) == SET
> > +             && GET_CODE (SET_SRC (body)) == VEC_SELECT
> > +             && GET_CODE (XEXP (SET_SRC (body), 0)) == MEM);
> > +
> > +  rtx mem = XEXP (SET_SRC (body), 0);
> > +  rtx base_reg = XEXP (mem, 0);
> > +
> > +  rtx and_operation = find_alignment_op (insn, base_reg);
> > +
> > +  if (and_operation != 0)
> > +    {
> > +      df_ref def;
> > +      struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> > +      FOR_EACH_INSN_INFO_DEF (def, insn_info)
> > +       {
> > +         struct df_link *link = DF_REF_CHAIN (def);
> > +         if (!link || link->next)
> > +           break;
> > +
> > +         rtx_insn *swap_insn = DF_REF_INSN (link->ref);
> > +         if (!insn_is_swap_p (swap_insn)
> > +             || insn_is_load_p (swap_insn)
> > +             || insn_is_store_p (swap_insn))
> > +           break;
> > +
> > +         /* Expected lvx pattern found.  Change the swap to
> > +            a copy, and propagate the AND operation into the
> > +            load.  */
> > +         to_delete[INSN_UID (swap_insn)].replace = true;
> > +         to_delete[INSN_UID (swap_insn)].replace_insn = swap_insn;
> > +
> > +         XEXP (mem, 0) = and_operation;
> > +         SET_SRC (body) = mem;
> > +         INSN_CODE (insn) = -1; /* Force re-recognition.  */
> > +         df_insn_rescan (insn);
> > +
> > +         if (dump_file)
> > +           fprintf (dump_file, "lvx opportunity found at %d\n",
> > +                    INSN_UID (insn));
> > +       }
> > +    }
> > +}
> > +
> > +/* If INSN is the store for an stvx pattern, put it in canonical form.  */
> > +static void
> > +combine_stvx_pattern (rtx_insn *insn, del_info *to_delete)
> > +{
> > +  rtx body = PATTERN (insn);
> > +  gcc_assert (GET_CODE (body) == SET
> > +             && GET_CODE (SET_DEST (body)) == MEM
> > +             && GET_CODE (SET_SRC (body)) == VEC_SELECT);
> > +  rtx mem = SET_DEST (body);
> > +  rtx base_reg = XEXP (mem, 0);
> > +
> > +  rtx and_operation = find_alignment_op (insn, base_reg);
> > +
> > +  if (and_operation != 0)
> > +    {
> > +      rtx src_reg = XEXP (SET_SRC (body), 0);
> > +      df_ref src_use;
> > +      struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> > +      FOR_EACH_INSN_INFO_USE (src_use, insn_info)
> > +       {
> > +         if (!rtx_equal_p (DF_REF_REG (src_use), src_reg))
> > +           continue;
> > +
> > +         struct df_link *link = DF_REF_CHAIN (src_use);
> > +         if (!link || link->next)
> > +           break;
> > +
> > +         rtx_insn *swap_insn = DF_REF_INSN (link->ref);
> > +         if (!insn_is_swap_p (swap_insn)
> > +             || insn_is_load_p (swap_insn)
> > +             || insn_is_store_p (swap_insn))
> > +           break;
> > +
> > +         /* Expected stvx pattern found.  Change the swap to
> > +            a copy, and propagate the AND operation into the
> > +            store.  */
> > +         to_delete[INSN_UID (swap_insn)].replace = true;
> > +         to_delete[INSN_UID (swap_insn)].replace_insn = swap_insn;
> > +
> > +         XEXP (mem, 0) = and_operation;
> > +         SET_SRC (body) = src_reg;
> > +         INSN_CODE (insn) = -1; /* Force re-recognition.  */
> > +         df_insn_rescan (insn);
> > +
> > +         if (dump_file)
> > +           fprintf (dump_file, "stvx opportunity found at %d\n",
> > +                    INSN_UID (insn));
> > +       }
> > +    }
> > +}
> > +
> > +/* Look for patterns created from builtin lvx and stvx calls, and
> > +   canonicalize them to be properly recognized as such.  */
> > +static void
> > +combine_lvx_stvx_patterns (function *fun)
> > +{
> > +  int i;
> > +  basic_block bb;
> > +  rtx_insn *insn;
> > +
> > +  int num_insns = get_max_uid ();
> > +  del_info *to_delete = XCNEWVEC (del_info, num_insns);
> > +
> > +  FOR_ALL_BB_FN (bb, fun)
> > +    FOR_BB_INSNS (bb, insn)
> > +    {
> > +      if (!NONDEBUG_INSN_P (insn))
> > +       continue;
> > +
> > +      if (insn_is_load_p (insn) && insn_is_swap_p (insn))
> > +       combine_lvx_pattern (insn, to_delete);
> > +      else if (insn_is_store_p (insn) && insn_is_swap_p (insn))
> > +       combine_stvx_pattern (insn, to_delete);
> > +    }
> > +
> > +  /* Turning swaps into copies is delayed until now, to avoid problems
> > +     with deleting instructions during the insn walk.  */
> > +  for (i = 0; i < num_insns; i++)
> > +    if (to_delete[i].replace)
> > +      {
> > +       rtx swap_body = PATTERN (to_delete[i].replace_insn);
> > +       rtx src_reg = XEXP (SET_SRC (swap_body), 0);
> > +       rtx copy = gen_rtx_SET (SET_DEST (swap_body), src_reg);
> > +       rtx_insn *new_insn = emit_insn_before (copy,
> > +                                              to_delete[i].replace_insn);
> > +       set_block_for_insn (new_insn,
> > +                           BLOCK_FOR_INSN (to_delete[i].replace_insn));
> > +       df_insn_rescan (new_insn);
> > +       df_insn_delete (to_delete[i].replace_insn);
> > +       remove_insn (to_delete[i].replace_insn);
> > +       to_delete[i].replace_insn->set_deleted ();
> > +      }
> > +
> > +  free (to_delete);
> > +}
> > +
> >  /* Main entry point for this pass.  */
> >  unsigned int
> >  rs6000_analyze_swaps (function *fun)
> > @@ -37833,7 +38195,7 @@ rs6000_analyze_swaps (function *fun)
> >  {
> >    swap_web_entry *insn_entry;
> >    basic_block bb;
> > -  rtx_insn *insn;
> > +  rtx_insn *insn, *curr_insn = 0;
> >
> >    /* Dataflow analysis for use-def chains.  */
> >    df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
> > @@ -37841,12 +38203,15 @@ rs6000_analyze_swaps (function *fun)
> >    df_analyze ();
> >    df_set_flags (DF_DEFER_INSN_RESCAN);
> >
> > +  /* Pre-pass to combine lvx and stvx patterns so we don't lose info.  */
> > +  combine_lvx_stvx_patterns (fun);
> > +
> >    /* Allocate structure to represent webs of insns.  */
> >    insn_entry = XCNEWVEC (swap_web_entry, get_max_uid ());
> >
> >    /* Walk the insns to gather basic data.  */
> >    FOR_ALL_BB_FN (bb, fun)
> > -    FOR_BB_INSNS (bb, insn)
> > +    FOR_BB_INSNS_SAFE (bb, insn, curr_insn)
> >      {
> >        unsigned int uid = INSN_UID (insn);
> >        if (NONDEBUG_INSN_P (insn))
> > Index: gcc/config/rs6000/vector.md
> > ===================================================================
> > --- gcc/config/rs6000/vector.md (revision 235090)
> > +++ gcc/config/rs6000/vector.md (working copy)
> > @@ -167,7 +167,14 @@
> >    if (VECTOR_MEM_VSX_P (<MODE>mode))
> >      {
> >        operands[1] = rs6000_address_for_altivec (operands[1]);
> > -      emit_insn (gen_altivec_lvx_<mode> (operands[0], operands[1]));
> > +      rtx and_op = XEXP (operands[1], 0);
> > +      gcc_assert (GET_CODE (and_op) == AND);
> > +      rtx addr = XEXP (and_op, 0);
> > +      if (GET_CODE (addr) == PLUS)
> > +        emit_insn (gen_altivec_lvx_<mode>_2op (operands[0], XEXP (addr, 0),
> > +                                              XEXP (addr, 1)));
> > +      else
> > +        emit_insn (gen_altivec_lvx_<mode>_1op (operands[0], operands[1]));
> >        DONE;
> >      }
> >  }")
> > @@ -183,7 +190,14 @@
> >    if (VECTOR_MEM_VSX_P (<MODE>mode))
> >      {
> >        operands[0] = rs6000_address_for_altivec (operands[0]);
> > -      emit_insn (gen_altivec_stvx_<mode> (operands[0], operands[1]));
> > +      rtx and_op = XEXP (operands[0], 0);
> > +      gcc_assert (GET_CODE (and_op) == AND);
> > +      rtx addr = XEXP (and_op, 0);
> > +      if (GET_CODE (addr) == PLUS)
> > +        emit_insn (gen_altivec_stvx_<mode>_2op (operands[1], XEXP (addr, 0),
> > +                                               XEXP (addr, 1)));
> > +      else
> > +        emit_insn (gen_altivec_stvx_<mode>_1op (operands[1], operands[0]));
> >        DONE;
> >      }
> >  }")
> >
> >
> 
>
Richard Biener April 20, 2016, 9:05 a.m. UTC | #4
On Tue, Apr 19, 2016 at 10:27 PM, Bill Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> On Tue, 2016-04-19 at 10:09 +0200, Richard Biener wrote:
>> On Tue, Apr 19, 2016 at 12:05 AM, Bill Schmidt
>> <wschmidt@linux.vnet.ibm.com> wrote:
>> > Hi,
>> >
>> > Expanding built-ins in the usual way (leaving them as calls until
>> > expanding into RTL) restricts the amount of optimization that can be
>> > performed on the code represented by the built-ins.  This has been
>> > observed to be particularly bad for the vec_ld and vec_st built-ins on
>> > PowerPC, which represent the lvx and stvx instructions.  Currently these
>> > are expanded into UNSPECs that are left untouched by the optimizers, so
>> > no redundant load or store elimination can take place.  For certain
>> > idiomatic usages, this leads to very bad performance.
>> >
>> > Initially I planned to just change the UNSPEC representation to RTL that
>> > directly expresses the address masking implicit in lvx and stvx.  This
>> > turns out to be only partially successful in improving performance.
>> > Among other things, by the time we reach RTL we have lost track of the
>> > __restrict__ attribute, leading to more appearances of may-alias
>> > relationships than should really be present.  Instead, this patch
>> > expands the built-ins during parsing so that they are exposed to all
>> > GIMPLE optimizations as well.
>> >
>> > This works well for vec_ld and vec_st.  It is also possible for
>> > programmers to instead use __builtin_altivec_lvx_<mode> and
>> > __builtin_altivec_stvx_<mode>.  These are not so easy to catch during
>> > parsing, since they are not processed by the overloaded built-in
>> > function table.  For these, I am currently falling back to expansion
>> > during RTL while still exposing the address-masking semantics, which
>> > seems ok for these somewhat obscure built-ins.  At some future time we
>> > may decide to handle them similarly to vec_ld and vec_st.
>> >
>> > For POWER8 little-endian only, the loads and stores during expand time
>> > require some special handling, since the POWER8 expanders want to
>> > convert these to lxvd2x/xxswapd and xxswapd/stxvd2x.  To deal with this,
>> > I've added an extra pre-pass to the swap optimization phase that
>> > recognizes the lvx and stvx patterns and canonicalizes them so they'll
>> > be properly recognized.  This isn't an issue for earlier or later
>> > processors, or for big-endian POWER8, so doing this as part of swap
>> > optimization is appropriate.
>> >
>> > We have a lot of existing test cases for this code, which proved very
>> > useful in discovering bugs, so I haven't seen a reason to add any new
>> > tests.
>> >
>> > The patch is fairly large, but it isn't feasible to break it up into
>> > smaller units without leaving something in a broken state.  So I will
>> > have to just apologize for the size and leave it at that.  Sorry! :)
>> >
>> > Bootstrapped and tested successfully on powerpc64le-unknown-linux-gnu,
>> > and on powerpc64-unknown-linux-gnu (-m32 and -m64) with no regressions.
>> > Is this ok for trunk after GCC 6 releases?
>>
>> Just took a very quick look but it seems you are using integer arithmetic
>> for the pointer adjustment and bit-and.  You could use POINTER_PLUS_EXPR
>> for the addition and BIT_AND_EXPR is also valid on pointer types.  Which
>> means you don't need conversions to/from sizetype.
>
> I just verified that I run into trouble with both these changes.  The
> build_binary_op interface doesn't accept POINTER_PLUS_EXPR as a valid
> code (we hit a gcc_unreachable in the main switch statement), but does
> produce pointer additions from a PLUS_EXPR.  Also, apparently
> BIT_AND_EXPR is not valid on at least these pointer types:
>
> ld.c: In function 'test':
> ld.c:68:9: error: invalid operands to binary & (have '__vector(16) unsigned char *' and '__vector(16) unsigned char *')
>    vuc = vec_ld (0, (vector unsigned char *)svuc);
>          ^
>
> That's what happens if I try:
>
>           tree aligned = build_binary_op (loc, BIT_AND_EXPR, addr,
>                                           build_int_cst (TREE_TYPE (arg1),
>                                                          -16), 0);
>
> If I try with building the -16 as a sizetype, I get the same error
> message except that the second argument listed is 'sizetype'.  Is there
> something else I should be trying instead?

Ah, it might be that the FE interfaces (build_binary_op and friends) do not
accept this.  If you'd simply used fold_build2 it should work.  For the
BIT_AND_EXPR the constant has to be of the same type as 'addr'.

Richard.

> Thanks,
> Bill
>
>
>>
>> x86 nowadays has intrinsics implemented as inlines - they come from
>> header files.  It seems for ppc the intrinsics are somehow magically
>> there, w/o a header file?
>>
>> Richard.
>>
>> > Thanks,
>> > Bill
>> >
>> >
>> > 2016-04-18  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>
>> >
>> >         * config/rs6000/altivec.md (altivec_lvx_<mode>): Remove.
>> >         (altivec_lvx_<mode>_internal): Document.
>> >         (altivec_lvx_<mode>_2op): New define_insn.
>> >         (altivec_lvx_<mode>_1op): Likewise.
>> >         (altivec_lvx_<mode>_2op_si): Likewise.
>> >         (altivec_lvx_<mode>_1op_si): Likewise.
>> >         (altivec_stvx_<mode>): Remove.
>> >         (altivec_stvx_<mode>_internal): Document.
>> >         (altivec_stvx_<mode>_2op): New define_insn.
>> >         (altivec_stvx_<mode>_1op): Likewise.
>> >         (altivec_stvx_<mode>_2op_si): Likewise.
>> >         (altivec_stvx_<mode>_1op_si): Likewise.
>> >         * config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
>> >         Expand vec_ld and vec_st during parsing.
>> >         * config/rs6000/rs6000.c (altivec_expand_lvx_be): Commentary
>> >         changes.
>> >         (altivec_expand_stvx_be): Likewise.
>> >         (altivec_expand_lv_builtin): Expand lvx built-ins to expose the
>> >         address-masking behavior in RTL.
>> >         (altivec_expand_stv_builtin): Expand stvx built-ins to expose the
>> >         address-masking behavior in RTL.
>> >         (altivec_expand_builtin): Change builtin code arguments for calls
>> >         to altivec_expand_stv_builtin and altivec_expand_lv_builtin.
>> >         (insn_is_swappable_p): Avoid incorrect swap optimization in the
>> >         presence of lvx/stvx patterns.
>> >         (alignment_with_canonical_addr): New function.
>> >         (alignment_mask): Likewise.
>> >         (find_alignment_op): Likewise.
>> >         (combine_lvx_pattern): Likewise.
>> >         (combine_stvx_pattern): Likewise.
>> >         (combine_lvx_stvx_patterns): Likewise.
>> >         (rs6000_analyze_swaps): Perform a pre-pass to recognize lvx and
>> >         stvx patterns from expand.
>> >         * config/rs6000/vector.md (vector_altivec_load_<mode>): Use new
>> >         expansions.
>> >         (vector_altivec_store_<mode>): Likewise.
>> >
>> >
>> > Index: gcc/config/rs6000/altivec.md
>> > ===================================================================
>> > --- gcc/config/rs6000/altivec.md        (revision 235090)
>> > +++ gcc/config/rs6000/altivec.md        (working copy)
>> > @@ -2514,20 +2514,9 @@
>> >    "lvxl %0,%y1"
>> >    [(set_attr "type" "vecload")])
>> >
>> > -(define_expand "altivec_lvx_<mode>"
>> > -  [(parallel
>> > -    [(set (match_operand:VM2 0 "register_operand" "=v")
>> > -         (match_operand:VM2 1 "memory_operand" "Z"))
>> > -     (unspec [(const_int 0)] UNSPEC_LVX)])]
>> > -  "TARGET_ALTIVEC"
>> > -{
>> > -  if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
>> > -    {
>> > -      altivec_expand_lvx_be (operands[0], operands[1], <MODE>mode, UNSPEC_LVX);
>> > -      DONE;
>> > -    }
>> > -})
>> > -
>> > +; This version of lvx is used only in cases where we need to force an lvx
>> > +; over any other load, and we don't care about losing CSE opportunities.
>> > +; Its primary use is for prologue register saves.
>> >  (define_insn "altivec_lvx_<mode>_internal"
>> >    [(parallel
>> >      [(set (match_operand:VM2 0 "register_operand" "=v")
>> > @@ -2537,20 +2526,45 @@
>> >    "lvx %0,%y1"
>> >    [(set_attr "type" "vecload")])
>> >
>> > -(define_expand "altivec_stvx_<mode>"
>> > -  [(parallel
>> > -    [(set (match_operand:VM2 0 "memory_operand" "=Z")
>> > -         (match_operand:VM2 1 "register_operand" "v"))
>> > -     (unspec [(const_int 0)] UNSPEC_STVX)])]
>> > -  "TARGET_ALTIVEC"
>> > -{
>> > -  if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
>> > -    {
>> > -      altivec_expand_stvx_be (operands[0], operands[1], <MODE>mode, UNSPEC_STVX);
>> > -      DONE;
>> > -    }
>> > -})
>> > +; The next two patterns embody what lvx should usually look like.
>> > +(define_insn "altivec_lvx_<mode>_2op"
>> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
>> > +        (mem:VM2 (and:DI (plus:DI (match_operand:DI 1 "register_operand" "b")
>> > +                                  (match_operand:DI 2 "register_operand" "r"))
>> > +                        (const_int -16))))]
>> > +  "TARGET_ALTIVEC && TARGET_64BIT"
>> > +  "lvx %0,%1,%2"
>> > +  [(set_attr "type" "vecload")])
>> >
>> > +(define_insn "altivec_lvx_<mode>_1op"
>> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
>> > +        (mem:VM2 (and:DI (match_operand:DI 1 "register_operand" "r")
>> > +                        (const_int -16))))]
>> > +  "TARGET_ALTIVEC && TARGET_64BIT"
>> > +  "lvx %0,0,%1"
>> > +  [(set_attr "type" "vecload")])
>> > +
>> > +; 32-bit versions of the above.
>> > +(define_insn "altivec_lvx_<mode>_2op_si"
>> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
>> > +        (mem:VM2 (and:SI (plus:SI (match_operand:SI 1 "register_operand" "b")
>> > +                                  (match_operand:SI 2 "register_operand" "r"))
>> > +                        (const_int -16))))]
>> > +  "TARGET_ALTIVEC && TARGET_32BIT"
>> > +  "lvx %0,%1,%2"
>> > +  [(set_attr "type" "vecload")])
>> > +
>> > +(define_insn "altivec_lvx_<mode>_1op_si"
>> > +  [(set (match_operand:VM2 0 "register_operand" "=v")
>> > +        (mem:VM2 (and:SI (match_operand:SI 1 "register_operand" "r")
>> > +                        (const_int -16))))]
>> > +  "TARGET_ALTIVEC && TARGET_32BIT"
>> > +  "lvx %0,0,%1"
>> > +  [(set_attr "type" "vecload")])
>> > +
>> > +; This version of stvx is used only in cases where we need to force an stvx
>> > +; over any other store, and we don't care about losing CSE opportunities.
>> > +; Its primary use is for epilogue register restores.
>> >  (define_insn "altivec_stvx_<mode>_internal"
>> >    [(parallel
>> >      [(set (match_operand:VM2 0 "memory_operand" "=Z")
>> > @@ -2560,6 +2574,42 @@
>> >    "stvx %1,%y0"
>> >    [(set_attr "type" "vecstore")])
>> >
>> > +; The next two patterns embody what stvx should usually look like.
>> > +(define_insn "altivec_stvx_<mode>_2op"
>> > +  [(set (mem:VM2 (and:DI (plus:DI (match_operand:DI 1 "register_operand" "b")
>> > +                                 (match_operand:DI 2 "register_operand" "r"))
>> > +                        (const_int -16)))
>> > +        (match_operand:VM2 0 "register_operand" "v"))]
>> > +  "TARGET_ALTIVEC && TARGET_64BIT"
>> > +  "stvx %0,%1,%2"
>> > +  [(set_attr "type" "vecstore")])
>> > +
>> > +(define_insn "altivec_stvx_<mode>_1op"
>> > +  [(set (mem:VM2 (and:DI (match_operand:DI 1 "register_operand" "r")
>> > +                        (const_int -16)))
>> > +        (match_operand:VM2 0 "register_operand" "v"))]
>> > +  "TARGET_ALTIVEC && TARGET_64BIT"
>> > +  "stvx %0,0,%1"
>> > +  [(set_attr "type" "vecstore")])
>> > +
>> > +; 32-bit versions of the above.
>> > +(define_insn "altivec_stvx_<mode>_2op_si"
>> > +  [(set (mem:VM2 (and:SI (plus:SI (match_operand:SI 1 "register_operand" "b")
>> > +                                 (match_operand:SI 2 "register_operand" "r"))
>> > +                        (const_int -16)))
>> > +        (match_operand:VM2 0 "register_operand" "v"))]
>> > +  "TARGET_ALTIVEC && TARGET_32BIT"
>> > +  "stvx %0,%1,%2"
>> > +  [(set_attr "type" "vecstore")])
>> > +
>> > +(define_insn "altivec_stvx_<mode>_1op_si"
>> > +  [(set (mem:VM2 (and:SI (match_operand:SI 1 "register_operand" "r")
>> > +                        (const_int -16)))
>> > +        (match_operand:VM2 0 "register_operand" "v"))]
>> > +  "TARGET_ALTIVEC && TARGET_32BIT"
>> > +  "stvx %0,0,%1"
>> > +  [(set_attr "type" "vecstore")])
>> > +
>> >  (define_expand "altivec_stvxl_<mode>"
>> >    [(parallel
>> >      [(set (match_operand:VM2 0 "memory_operand" "=Z")
>> > Index: gcc/config/rs6000/rs6000-c.c
>> > ===================================================================
>> > --- gcc/config/rs6000/rs6000-c.c        (revision 235090)
>> > +++ gcc/config/rs6000/rs6000-c.c        (working copy)
>> > @@ -4800,6 +4800,164 @@ assignment for unaligned loads and stores");
>> >        return stmt;
>> >      }
>> >
>> > +  /* Expand vec_ld into an expression that masks the address and
>> > +     performs the load.  We need to expand this early to allow
>> > +     the best aliasing, as by the time we get into RTL we no longer
>> > +     are able to honor __restrict__, for example.  We may want to
>> > +     consider this for all memory access built-ins.
>> > +
>> > +     When -maltivec=be is specified, simply punt to existing
>> > +     built-in processing.  */
>> > +  if (fcode == ALTIVEC_BUILTIN_VEC_LD
>> > +      && (BYTES_BIG_ENDIAN || !VECTOR_ELT_ORDER_BIG))
>> > +    {
>> > +      tree arg0 = (*arglist)[0];
>> > +      tree arg1 = (*arglist)[1];
>> > +
>> > +      /* Strip qualifiers like "const" from the pointer arg.  */
>> > +      tree arg1_type = TREE_TYPE (arg1);
>> > +      tree inner_type = TREE_TYPE (arg1_type);
>> > +      if (TYPE_QUALS (TREE_TYPE (arg1_type)) != 0)
>> > +       {
>> > +         arg1_type = build_pointer_type (build_qualified_type (inner_type,
>> > +                                                               0));
>> > +         arg1 = fold_convert (arg1_type, arg1);
>> > +       }
>> > +
>> > +      /* Construct the masked address.  We have to jump through some hoops
>> > +        here.  If the first argument to a PLUS_EXPR is a pointer,
>> > +        build_binary_op will multiply the offset by the size of the
>> > +        inner type of the pointer (C semantics).  With vec_ld and vec_st,
>> > +        the offset must be left alone.  However, if we convert to a
>> > +        sizetype to do the arithmetic, we get a PLUS_EXPR instead of a
>> > +        POINTER_PLUS_EXPR, which interferes with aliasing (causing us,
>> > +        for example, to lose "restrict" information).  Thus where legal,
>> > +        we pre-adjust the offset knowing that a multiply by size is
>> > +        coming.  When the offset isn't a multiple of the size, we are
>> > +        forced to do the arithmetic in size_type for correctness, at the
>> > +        cost of losing aliasing information.  This, however, should be
>> > +        quite rare with these operations.  */
>> > +      arg0 = fold (arg0);
>> > +
>> > +      /* Let existing error handling take over if we don't have a constant
>> > +        offset.  */
>> > +      if (TREE_CODE (arg0) == INTEGER_CST)
>> > +       {
>> > +         HOST_WIDE_INT off = TREE_INT_CST_LOW (arg0);
>> > +         HOST_WIDE_INT size = int_size_in_bytes (inner_type);
>> > +         tree addr;
>> > +
>> > +         if (off % size == 0)
>> > +           {
>> > +             tree adjoff = build_int_cst (TREE_TYPE (arg0), off / size);
>> > +             addr = build_binary_op (loc, PLUS_EXPR, arg1, adjoff, 0);
>> > +             addr = build1 (NOP_EXPR, sizetype, addr);
>> > +           }
>> > +         else
>> > +           {
>> > +             tree hack_arg1 = build1 (NOP_EXPR, sizetype, arg1);
>> > +             addr = build_binary_op (loc, PLUS_EXPR, hack_arg1, arg0, 0);
>> > +           }
>> > +         tree aligned = build_binary_op (loc, BIT_AND_EXPR, addr,
>> > +                                         build_int_cst (sizetype, -16), 0);
>> > +
>> > +         /* Find the built-in to get the return type so we can convert
>> > +            the result properly (or fall back to default handling if the
>> > +            arguments aren't compatible).  */
>> > +         for (desc = altivec_overloaded_builtins;
>> > +              desc->code && desc->code != fcode; desc++)
>> > +           continue;
>> > +
>> > +         for (; desc->code == fcode; desc++)
>> > +           if (rs6000_builtin_type_compatible (TREE_TYPE (arg0), desc->op1)
>> > +               && (rs6000_builtin_type_compatible (TREE_TYPE (arg1),
>> > +                                                   desc->op2)))
>> > +             {
>> > +               tree ret_type = rs6000_builtin_type (desc->ret_type);
>> > +               if (TYPE_MODE (ret_type) == V2DImode)
>> > +                 /* Type-based aliasing analysis thinks vector long
>> > +                    and vector long long are different and will put them
>> > +                    in distinct alias classes.  Force our return type
>> > +                    to be a may-alias type to avoid this.  */
>> > +                 ret_type
>> > +                   = build_pointer_type_for_mode (ret_type, Pmode,
>> > +                                                  true/*can_alias_all*/);
>> > +               else
>> > +                 ret_type = build_pointer_type (ret_type);
>> > +               aligned = build1 (NOP_EXPR, ret_type, aligned);
>> > +               tree ret_val = build_indirect_ref (loc, aligned, RO_NULL);
>> > +               return ret_val;
>> > +             }
>> > +       }
>> > +    }
>> > +
>> > +  /* Similarly for stvx.  */
>> > +  if (fcode == ALTIVEC_BUILTIN_VEC_ST
>> > +      && (BYTES_BIG_ENDIAN || !VECTOR_ELT_ORDER_BIG))
>> > +    {
>> > +      tree arg0 = (*arglist)[0];
>> > +      tree arg1 = (*arglist)[1];
>> > +      tree arg2 = (*arglist)[2];
>> > +
>> > +      /* Construct the masked address.  See handling for ALTIVEC_BUILTIN_VEC_LD
>> > +        for an explanation of address arithmetic concerns.  */
>> > +      arg1 = fold (arg1);
>> > +
>> > +      /* Let existing error handling take over if we don't have a constant
>> > +        offset.  */
>> > +      if (TREE_CODE (arg1) == INTEGER_CST)
>> > +       {
>> > +         HOST_WIDE_INT off = TREE_INT_CST_LOW (arg1);
>> > +         tree inner_type = TREE_TYPE (TREE_TYPE (arg2));
>> > +         HOST_WIDE_INT size = int_size_in_bytes (inner_type);
>> > +         tree addr;
>> > +
>> > +         if (off % size == 0)
>> > +           {
>> > +             tree adjoff = build_int_cst (TREE_TYPE (arg1), off / size);
>> > +             addr = build_binary_op (loc, PLUS_EXPR, arg2, adjoff, 0);
>> > +             addr = build1 (NOP_EXPR, sizetype, addr);
>> > +           }
>> > +         else
>> > +           {
>> > +             tree hack_arg2 = build1 (NOP_EXPR, sizetype, arg2);
>> > +             addr = build_binary_op (loc, PLUS_EXPR, hack_arg2, arg1, 0);
>> > +           }
>> > +         tree aligned = build_binary_op (loc, BIT_AND_EXPR, addr,
>> > +                                         build_int_cst (sizetype, -16), 0);
>> > +
>> > +         /* Find the built-in to make sure a compatible one exists; if not
>> > +            we fall back to default handling to get the error message.  */
>> > +         for (desc = altivec_overloaded_builtins;
>> > +              desc->code && desc->code != fcode; desc++)
>> > +           continue;
>> > +
>> > +         for (; desc->code == fcode; desc++)
>> > +           if (rs6000_builtin_type_compatible (TREE_TYPE (arg0), desc->op1)
>> > +               && rs6000_builtin_type_compatible (TREE_TYPE (arg1), desc->op2)
>> > +               && rs6000_builtin_type_compatible (TREE_TYPE (arg2),
>> > +                                                  desc->op3))
>> > +             {
>> > +               tree arg0_type = TREE_TYPE (arg0);
>> > +               if (TYPE_MODE (arg0_type) == V2DImode)
>> > +                 /* Type-based aliasing analysis thinks vector long
>> > +                    and vector long long are different and will put them
>> > +                    in distinct alias classes.  Force our address type
>> > +                    to be a may-alias type to avoid this.  */
>> > +                 arg0_type
>> > +                   = build_pointer_type_for_mode (arg0_type, Pmode,
>> > +                                                  true/*can_alias_all*/);
>> > +               else
>> > +                 arg0_type = build_pointer_type (arg0_type);
>> > +               aligned = build1 (NOP_EXPR, arg0_type, aligned);
>> > +               tree stg = build_indirect_ref (loc, aligned, RO_NULL);
>> > +               tree retval = build2 (MODIFY_EXPR, TREE_TYPE (stg), stg,
>> > +                                     convert (TREE_TYPE (stg), arg0));
>> > +               return retval;
>> > +             }
>> > +       }
>> > +    }
>> > +
>> >    for (n = 0;
>> >         !VOID_TYPE_P (TREE_VALUE (fnargs)) && n < nargs;
>> >         fnargs = TREE_CHAIN (fnargs), n++)
>> > Index: gcc/config/rs6000/rs6000.c
>> > ===================================================================
>> > --- gcc/config/rs6000/rs6000.c  (revision 235090)
>> > +++ gcc/config/rs6000/rs6000.c  (working copy)
>> > @@ -13025,9 +13025,9 @@ swap_selector_for_mode (machine_mode mode)
>> >    return force_reg (V16QImode, gen_rtx_CONST_VECTOR (V16QImode, gen_rtvec_v (16, perm)));
>> >  }
>> >
>> > -/* Generate code for an "lvx", "lvxl", or "lve*x" built-in for a little endian target
>> > -   with -maltivec=be specified.  Issue the load followed by an element-reversing
>> > -   permute.  */
>> > +/* Generate code for an "lvxl", or "lve*x" built-in for a little endian target
>> > +   with -maltivec=be specified.  Issue the load followed by an element-
>> > +   reversing permute.  */
>> >  void
>> >  altivec_expand_lvx_be (rtx op0, rtx op1, machine_mode mode, unsigned unspec)
>> >  {
>> > @@ -13043,8 +13043,8 @@ altivec_expand_lvx_be (rtx op0, rtx op1, machine_m
>> >    emit_insn (gen_rtx_SET (op0, vperm));
>> >  }
>> >
>> > -/* Generate code for a "stvx" or "stvxl" built-in for a little endian target
>> > -   with -maltivec=be specified.  Issue the store preceded by an element-reversing
>> > +/* Generate code for a "stvxl" built-in for a little endian target with
>> > +   -maltivec=be specified.  Issue the store preceded by an element-reversing
>> >     permute.  */
>> >  void
>> >  altivec_expand_stvx_be (rtx op0, rtx op1, machine_mode mode, unsigned unspec)
>> > @@ -13106,22 +13106,65 @@ altivec_expand_lv_builtin (enum insn_code icode, t
>> >
>> >    op1 = copy_to_mode_reg (mode1, op1);
>> >
>> > -  if (op0 == const0_rtx)
>> > +  /* For LVX, express the RTL accurately by ANDing the address with -16.
>> > +     LVXL and LVE*X expand to use UNSPECs to hide their special behavior,
>> > +     so the raw address is fine.  */
>> > +  switch (icode)
>> >      {
>> > -      addr = gen_rtx_MEM (blk ? BLKmode : tmode, op1);
>> > -    }
>> > -  else
>> > -    {
>> > -      op0 = copy_to_mode_reg (mode0, op0);
>> > -      addr = gen_rtx_MEM (blk ? BLKmode : tmode, gen_rtx_PLUS (Pmode, op0, op1));
>> > -    }
>> > +    case CODE_FOR_altivec_lvx_v2df_2op:
>> > +    case CODE_FOR_altivec_lvx_v2di_2op:
>> > +    case CODE_FOR_altivec_lvx_v4sf_2op:
>> > +    case CODE_FOR_altivec_lvx_v4si_2op:
>> > +    case CODE_FOR_altivec_lvx_v8hi_2op:
>> > +    case CODE_FOR_altivec_lvx_v16qi_2op:
>> > +      {
>> > +       rtx rawaddr;
>> > +       if (op0 == const0_rtx)
>> > +         rawaddr = op1;
>> > +       else
>> > +         {
>> > +           op0 = copy_to_mode_reg (mode0, op0);
>> > +           rawaddr = gen_rtx_PLUS (Pmode, op1, op0);
>> > +         }
>> > +       addr = gen_rtx_AND (Pmode, rawaddr, gen_rtx_CONST_INT (Pmode, -16));
>> > +       addr = gen_rtx_MEM (blk ? BLKmode : tmode, addr);
>> >
>> > -  pat = GEN_FCN (icode) (target, addr);
>> > +       /* For -maltivec=be, emit the load and follow it up with a
>> > +          permute to swap the elements.  */
>> > +       if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
>> > +         {
>> > +           rtx temp = gen_reg_rtx (tmode);
>> > +           emit_insn (gen_rtx_SET (temp, addr));
>> >
>> > -  if (! pat)
>> > -    return 0;
>> > -  emit_insn (pat);
>> > +           rtx sel = swap_selector_for_mode (tmode);
>> > +           rtx vperm = gen_rtx_UNSPEC (tmode, gen_rtvec (3, temp, temp, sel),
>> > +                                       UNSPEC_VPERM);
>> > +           emit_insn (gen_rtx_SET (target, vperm));
>> > +         }
>> > +       else
>> > +         emit_insn (gen_rtx_SET (target, addr));
>> >
>> > +       break;
>> > +      }
>> > +
>> > +    default:
>> > +      if (op0 == const0_rtx)
>> > +       addr = gen_rtx_MEM (blk ? BLKmode : tmode, op1);
>> > +      else
>> > +       {
>> > +         op0 = copy_to_mode_reg (mode0, op0);
>> > +         addr = gen_rtx_MEM (blk ? BLKmode : tmode,
>> > +                             gen_rtx_PLUS (Pmode, op1, op0));
>> > +       }
>> > +
>> > +      pat = GEN_FCN (icode) (target, addr);
>> > +      if (! pat)
>> > +       return 0;
>> > +      emit_insn (pat);
>> > +
>> > +      break;
>> > +    }
>> > +
>> >    return target;
>> >  }
>> >
>> > @@ -13208,7 +13251,7 @@ altivec_expand_stv_builtin (enum insn_code icode,
>> >    rtx op0 = expand_normal (arg0);
>> >    rtx op1 = expand_normal (arg1);
>> >    rtx op2 = expand_normal (arg2);
>> > -  rtx pat, addr;
>> > +  rtx pat, addr, rawaddr;
>> >    machine_mode tmode = insn_data[icode].operand[0].mode;
>> >    machine_mode smode = insn_data[icode].operand[1].mode;
>> >    machine_mode mode1 = Pmode;
>> > @@ -13220,24 +13263,69 @@ altivec_expand_stv_builtin (enum insn_code icode,
>> >        || arg2 == error_mark_node)
>> >      return const0_rtx;
>> >
>> > -  if (! (*insn_data[icode].operand[1].predicate) (op0, smode))
>> > -    op0 = copy_to_mode_reg (smode, op0);
>> > -
>> >    op2 = copy_to_mode_reg (mode2, op2);
>> >
>> > -  if (op1 == const0_rtx)
>> > +  /* For STVX, express the RTL accurately by ANDing the address with -16.
>> > +     STVXL and STVE*X expand to use UNSPECs to hide their special behavior,
>> > +     so the raw address is fine.  */
>> > +  switch (icode)
>> >      {
>> > -      addr = gen_rtx_MEM (tmode, op2);
>> > -    }
>> > -  else
>> > -    {
>> > -      op1 = copy_to_mode_reg (mode1, op1);
>> > -      addr = gen_rtx_MEM (tmode, gen_rtx_PLUS (Pmode, op1, op2));
>> > -    }
>> > +    case CODE_FOR_altivec_stvx_v2df_2op:
>> > +    case CODE_FOR_altivec_stvx_v2di_2op:
>> > +    case CODE_FOR_altivec_stvx_v4sf_2op:
>> > +    case CODE_FOR_altivec_stvx_v4si_2op:
>> > +    case CODE_FOR_altivec_stvx_v8hi_2op:
>> > +    case CODE_FOR_altivec_stvx_v16qi_2op:
>> > +      {
>> > +       if (op1 == const0_rtx)
>> > +         rawaddr = op2;
>> > +       else
>> > +         {
>> > +           op1 = copy_to_mode_reg (mode1, op1);
>> > +           rawaddr = gen_rtx_PLUS (Pmode, op2, op1);
>> > +         }
>> >
>> > -  pat = GEN_FCN (icode) (addr, op0);
>> > -  if (pat)
>> > -    emit_insn (pat);
>> > +       addr = gen_rtx_AND (Pmode, rawaddr, gen_rtx_CONST_INT (Pmode, -16));
>> > +       addr = gen_rtx_MEM (tmode, addr);
>> > +
>> > +       op0 = copy_to_mode_reg (tmode, op0);
>> > +
>> > +       /* For -maltivec=be, emit a permute to swap the elements, followed
>> > +          by the store.  */
>> > +       if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
>> > +         {
>> > +           rtx temp = gen_reg_rtx (tmode);
>> > +           rtx sel = swap_selector_for_mode (tmode);
>> > +           rtx vperm = gen_rtx_UNSPEC (tmode, gen_rtvec (3, op0, op0, sel),
>> > +                                       UNSPEC_VPERM);
>> > +           emit_insn (gen_rtx_SET (temp, vperm));
>> > +           emit_insn (gen_rtx_SET (addr, temp));
>> > +         }
>> > +       else
>> > +         emit_insn (gen_rtx_SET (addr, op0));
>> > +
>> > +       break;
>> > +      }
>> > +
>> > +    default:
>> > +      {
>> > +       if (! (*insn_data[icode].operand[1].predicate) (op0, smode))
>> > +         op0 = copy_to_mode_reg (smode, op0);
>> > +
>> > +       if (op1 == const0_rtx)
>> > +         addr = gen_rtx_MEM (tmode, op2);
>> > +       else
>> > +         {
>> > +           op1 = copy_to_mode_reg (mode1, op1);
>> > +           addr = gen_rtx_MEM (tmode, gen_rtx_PLUS (Pmode, op2, op1));
>> > +         }
>> > +
>> > +       pat = GEN_FCN (icode) (addr, op0);
>> > +       if (pat)
>> > +         emit_insn (pat);
>> > +      }
>> > +    }
>> > +
>> >    return NULL_RTX;
>> >  }
>> >
>> > @@ -14073,18 +14161,18 @@ altivec_expand_builtin (tree exp, rtx target, bool
>> >    switch (fcode)
>> >      {
>> >      case ALTIVEC_BUILTIN_STVX_V2DF:
>> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2df, exp);
>> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2df_2op, exp);
>> >      case ALTIVEC_BUILTIN_STVX_V2DI:
>> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2di, exp);
>> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2di_2op, exp);
>> >      case ALTIVEC_BUILTIN_STVX_V4SF:
>> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4sf, exp);
>> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4sf_2op, exp);
>> >      case ALTIVEC_BUILTIN_STVX:
>> >      case ALTIVEC_BUILTIN_STVX_V4SI:
>> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4si, exp);
>> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4si_2op, exp);
>> >      case ALTIVEC_BUILTIN_STVX_V8HI:
>> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v8hi, exp);
>> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v8hi_2op, exp);
>> >      case ALTIVEC_BUILTIN_STVX_V16QI:
>> > -      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v16qi, exp);
>> > +      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v16qi_2op, exp);
>> >      case ALTIVEC_BUILTIN_STVEBX:
>> >        return altivec_expand_stv_builtin (CODE_FOR_altivec_stvebx, exp);
>> >      case ALTIVEC_BUILTIN_STVEHX:
>> > @@ -14272,23 +14360,23 @@ altivec_expand_builtin (tree exp, rtx target, bool
>> >        return altivec_expand_lv_builtin (CODE_FOR_altivec_lvxl_v16qi,
>> >                                         exp, target, false);
>> >      case ALTIVEC_BUILTIN_LVX_V2DF:
>> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2df,
>> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2df_2op,
>> >                                         exp, target, false);
>> >      case ALTIVEC_BUILTIN_LVX_V2DI:
>> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2di,
>> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2di_2op,
>> >                                         exp, target, false);
>> >      case ALTIVEC_BUILTIN_LVX_V4SF:
>> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4sf,
>> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4sf_2op,
>> >                                         exp, target, false);
>> >      case ALTIVEC_BUILTIN_LVX:
>> >      case ALTIVEC_BUILTIN_LVX_V4SI:
>> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4si,
>> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4si_2op,
>> >                                         exp, target, false);
>> >      case ALTIVEC_BUILTIN_LVX_V8HI:
>> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v8hi,
>> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v8hi_2op,
>> >                                         exp, target, false);
>> >      case ALTIVEC_BUILTIN_LVX_V16QI:
>> > -      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v16qi,
>> > +      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v16qi_2op,
>> >                                         exp, target, false);
>> >      case ALTIVEC_BUILTIN_LVLX:
>> >        return altivec_expand_lv_builtin (CODE_FOR_altivec_lvlx,
>> > @@ -37139,7 +37227,9 @@ insn_is_swappable_p (swap_web_entry *insn_entry, r
>> >       fix them up by converting them to permuting ones.  Exceptions:
>> >       UNSPEC_LVE, UNSPEC_LVX, and UNSPEC_STVX, which have a PARALLEL
>> >       body instead of a SET; and UNSPEC_STVE, which has an UNSPEC
>> > -     for the SET source.  */
>> > +     for the SET source.  Also we must now make an exception for lvx
>> > +     and stvx when they are not in the UNSPEC_LVX/STVX form (with the
>> > +     explicit "& -16") since this leads to unrecognizable insns.  */
>> >    rtx body = PATTERN (insn);
>> >    int i = INSN_UID (insn);
>> >
>> > @@ -37147,6 +37237,11 @@ insn_is_swappable_p (swap_web_entry *insn_entry, r
>> >      {
>> >        if (GET_CODE (body) == SET)
>> >         {
>> > +         rtx rhs = SET_SRC (body);
>> > +         gcc_assert (GET_CODE (rhs) == MEM);
>> > +         if (GET_CODE (XEXP (rhs, 0)) == AND)
>> > +           return 0;
>> > +
>> >           *special = SH_NOSWAP_LD;
>> >           return 1;
>> >         }
>> > @@ -37156,8 +37251,14 @@ insn_is_swappable_p (swap_web_entry *insn_entry, r
>> >
>> >    if (insn_entry[i].is_store)
>> >      {
>> > -      if (GET_CODE (body) == SET && GET_CODE (SET_SRC (body)) != UNSPEC)
>> > +      if (GET_CODE (body) == SET
>> > +         && GET_CODE (SET_SRC (body)) != UNSPEC)
>> >         {
>> > +         rtx lhs = SET_DEST (body);
>> > +         gcc_assert (GET_CODE (lhs) == MEM);
>> > +         if (GET_CODE (XEXP (lhs, 0)) == AND)
>> > +           return 0;
>> > +
>> >           *special = SH_NOSWAP_ST;
>> >           return 1;
>> >         }
>> > @@ -37827,6 +37928,267 @@ dump_swap_insn_table (swap_web_entry *insn_entry)
>> >    fputs ("\n", dump_file);
>> >  }
>> >
>> > +/* Return RTX with its address canonicalized to (reg) or (+ reg reg).
>> > +   Here RTX is an (& addr (const_int -16)).  Always return a new copy
>> > +   to avoid problems with combine.  */
>> > +static rtx
>> > +alignment_with_canonical_addr (rtx align)
>> > +{
>> > +  rtx canon;
>> > +  rtx addr = XEXP (align, 0);
>> > +
>> > +  if (REG_P (addr))
>> > +    canon = addr;
>> > +
>> > +  else if (GET_CODE (addr) == PLUS)
>> > +    {
>> > +      rtx addrop0 = XEXP (addr, 0);
>> > +      rtx addrop1 = XEXP (addr, 1);
>> > +
>> > +      if (!REG_P (addrop0))
>> > +       addrop0 = force_reg (GET_MODE (addrop0), addrop0);
>> > +
>> > +      if (!REG_P (addrop1))
>> > +       addrop1 = force_reg (GET_MODE (addrop1), addrop1);
>> > +
>> > +      canon = gen_rtx_PLUS (GET_MODE (addr), addrop0, addrop1);
>> > +    }
>> > +
>> > +  else
>> > +    canon = force_reg (GET_MODE (addr), addr);
>> > +
>> > +  return gen_rtx_AND (GET_MODE (align), canon, GEN_INT (-16));
>> > +}
>> > +
>> > +/* Check whether an rtx is an alignment mask, and if so, return
>> > +   a fully-expanded rtx for the masking operation.  */
>> > +static rtx
>> > +alignment_mask (rtx_insn *insn)
>> > +{
>> > +  rtx body = PATTERN (insn);
>> > +
>> > +  if (GET_CODE (body) != SET
>> > +      || GET_CODE (SET_SRC (body)) != AND
>> > +      || !REG_P (XEXP (SET_SRC (body), 0)))
>> > +    return 0;
>> > +
>> > +  rtx mask = XEXP (SET_SRC (body), 1);
>> > +
>> > +  if (GET_CODE (mask) == CONST_INT)
>> > +    {
>> > +      if (INTVAL (mask) == -16)
>> > +       return alignment_with_canonical_addr (SET_SRC (body));
>> > +      else
>> > +       return 0;
>> > +    }
>> > +
>> > +  if (!REG_P (mask))
>> > +    return 0;
>> > +
>> > +  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
>> > +  df_ref use;
>> > +  rtx real_mask = 0;
>> > +
>> > +  FOR_EACH_INSN_INFO_USE (use, insn_info)
>> > +    {
>> > +      if (!rtx_equal_p (DF_REF_REG (use), mask))
>> > +       continue;
>> > +
>> > +      struct df_link *def_link = DF_REF_CHAIN (use);
>> > +      if (!def_link || def_link->next)
>> > +       return 0;
>> > +
>> > +      rtx_insn *const_insn = DF_REF_INSN (def_link->ref);
>> > +      rtx const_body = PATTERN (const_insn);
>> > +      if (GET_CODE (const_body) != SET)
>> > +       return 0;
>> > +
>> > +      real_mask = SET_SRC (const_body);
>> > +
>> > +      if (GET_CODE (real_mask) != CONST_INT
>> > +         || INTVAL (real_mask) != -16)
>> > +       return 0;
>> > +    }
>> > +
>> > +  if (real_mask == 0)
>> > +    return 0;
>> > +
>> > +  return alignment_with_canonical_addr (SET_SRC (body));
>> > +}
>> > +
>> > +/* Given INSN that's a load or store based at BASE_REG, look for a
>> > +   feeding computation that aligns its address on a 16-byte boundary.  */
>> > +static rtx
>> > +find_alignment_op (rtx_insn *insn, rtx base_reg)
>> > +{
>> > +  df_ref base_use;
>> > +  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
>> > +  rtx and_operation = 0;
>> > +
>> > +  FOR_EACH_INSN_INFO_USE (base_use, insn_info)
>> > +    {
>> > +      if (!rtx_equal_p (DF_REF_REG (base_use), base_reg))
>> > +       continue;
>> > +
>> > +      struct df_link *base_def_link = DF_REF_CHAIN (base_use);
>> > +      if (!base_def_link || base_def_link->next)
>> > +       break;
>> > +
>> > +      rtx_insn *and_insn = DF_REF_INSN (base_def_link->ref);
>> > +      and_operation = alignment_mask (and_insn);
>> > +      if (and_operation != 0)
>> > +       break;
>> > +    }
>> > +
>> > +  return and_operation;
>> > +}
>> > +
>> > +struct del_info { bool replace; rtx_insn *replace_insn; };
>> > +
>> > +/* If INSN is the load for an lvx pattern, put it in canonical form.  */
>> > +static void
>> > +combine_lvx_pattern (rtx_insn *insn, del_info *to_delete)
>> > +{
>> > +  rtx body = PATTERN (insn);
>> > +  gcc_assert (GET_CODE (body) == SET
>> > +             && GET_CODE (SET_SRC (body)) == VEC_SELECT
>> > +             && GET_CODE (XEXP (SET_SRC (body), 0)) == MEM);
>> > +
>> > +  rtx mem = XEXP (SET_SRC (body), 0);
>> > +  rtx base_reg = XEXP (mem, 0);
>> > +
>> > +  rtx and_operation = find_alignment_op (insn, base_reg);
>> > +
>> > +  if (and_operation != 0)
>> > +    {
>> > +      df_ref def;
>> > +      struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
>> > +      FOR_EACH_INSN_INFO_DEF (def, insn_info)
>> > +       {
>> > +         struct df_link *link = DF_REF_CHAIN (def);
>> > +         if (!link || link->next)
>> > +           break;
>> > +
>> > +         rtx_insn *swap_insn = DF_REF_INSN (link->ref);
>> > +         if (!insn_is_swap_p (swap_insn)
>> > +             || insn_is_load_p (swap_insn)
>> > +             || insn_is_store_p (swap_insn))
>> > +           break;
>> > +
>> > +         /* Expected lvx pattern found.  Change the swap to
>> > +            a copy, and propagate the AND operation into the
>> > +            load.  */
>> > +         to_delete[INSN_UID (swap_insn)].replace = true;
>> > +         to_delete[INSN_UID (swap_insn)].replace_insn = swap_insn;
>> > +
>> > +         XEXP (mem, 0) = and_operation;
>> > +         SET_SRC (body) = mem;
>> > +         INSN_CODE (insn) = -1; /* Force re-recognition.  */
>> > +         df_insn_rescan (insn);
>> > +
>> > +         if (dump_file)
>> > +           fprintf (dump_file, "lvx opportunity found at %d\n",
>> > +                    INSN_UID (insn));
>> > +       }
>> > +    }
>> > +}
>> > +
>> > +/* If INSN is the store for an stvx pattern, put it in canonical form.  */
>> > +static void
>> > +combine_stvx_pattern (rtx_insn *insn, del_info *to_delete)
>> > +{
>> > +  rtx body = PATTERN (insn);
>> > +  gcc_assert (GET_CODE (body) == SET
>> > +             && GET_CODE (SET_DEST (body)) == MEM
>> > +             && GET_CODE (SET_SRC (body)) == VEC_SELECT);
>> > +  rtx mem = SET_DEST (body);
>> > +  rtx base_reg = XEXP (mem, 0);
>> > +
>> > +  rtx and_operation = find_alignment_op (insn, base_reg);
>> > +
>> > +  if (and_operation != 0)
>> > +    {
>> > +      rtx src_reg = XEXP (SET_SRC (body), 0);
>> > +      df_ref src_use;
>> > +      struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
>> > +      FOR_EACH_INSN_INFO_USE (src_use, insn_info)
>> > +       {
>> > +         if (!rtx_equal_p (DF_REF_REG (src_use), src_reg))
>> > +           continue;
>> > +
>> > +         struct df_link *link = DF_REF_CHAIN (src_use);
>> > +         if (!link || link->next)
>> > +           break;
>> > +
>> > +         rtx_insn *swap_insn = DF_REF_INSN (link->ref);
>> > +         if (!insn_is_swap_p (swap_insn)
>> > +             || insn_is_load_p (swap_insn)
>> > +             || insn_is_store_p (swap_insn))
>> > +           break;
>> > +
>> > +         /* Expected stvx pattern found.  Change the swap to
>> > +            a copy, and propagate the AND operation into the
>> > +            store.  */
>> > +         to_delete[INSN_UID (swap_insn)].replace = true;
>> > +         to_delete[INSN_UID (swap_insn)].replace_insn = swap_insn;
>> > +
>> > +         XEXP (mem, 0) = and_operation;
>> > +         SET_SRC (body) = src_reg;
>> > +         INSN_CODE (insn) = -1; /* Force re-recognition.  */
>> > +         df_insn_rescan (insn);
>> > +
>> > +         if (dump_file)
>> > +           fprintf (dump_file, "stvx opportunity found at %d\n",
>> > +                    INSN_UID (insn));
>> > +       }
>> > +    }
>> > +}
>> > +
>> > +/* Look for patterns created from builtin lvx and stvx calls, and
>> > +   canonicalize them to be properly recognized as such.  */
>> > +static void
>> > +combine_lvx_stvx_patterns (function *fun)
>> > +{
>> > +  int i;
>> > +  basic_block bb;
>> > +  rtx_insn *insn;
>> > +
>> > +  int num_insns = get_max_uid ();
>> > +  del_info *to_delete = XCNEWVEC (del_info, num_insns);
>> > +
>> > +  FOR_ALL_BB_FN (bb, fun)
>> > +    FOR_BB_INSNS (bb, insn)
>> > +    {
>> > +      if (!NONDEBUG_INSN_P (insn))
>> > +       continue;
>> > +
>> > +      if (insn_is_load_p (insn) && insn_is_swap_p (insn))
>> > +       combine_lvx_pattern (insn, to_delete);
>> > +      else if (insn_is_store_p (insn) && insn_is_swap_p (insn))
>> > +       combine_stvx_pattern (insn, to_delete);
>> > +    }
>> > +
>> > +  /* Turning swaps into copies is delayed until now, to avoid problems
>> > +     with deleting instructions during the insn walk.  */
>> > +  for (i = 0; i < num_insns; i++)
>> > +    if (to_delete[i].replace)
>> > +      {
>> > +       rtx swap_body = PATTERN (to_delete[i].replace_insn);
>> > +       rtx src_reg = XEXP (SET_SRC (swap_body), 0);
>> > +       rtx copy = gen_rtx_SET (SET_DEST (swap_body), src_reg);
>> > +       rtx_insn *new_insn = emit_insn_before (copy,
>> > +                                              to_delete[i].replace_insn);
>> > +       set_block_for_insn (new_insn,
>> > +                           BLOCK_FOR_INSN (to_delete[i].replace_insn));
>> > +       df_insn_rescan (new_insn);
>> > +       df_insn_delete (to_delete[i].replace_insn);
>> > +       remove_insn (to_delete[i].replace_insn);
>> > +       to_delete[i].replace_insn->set_deleted ();
>> > +      }
>> > +
>> > +  free (to_delete);
>> > +}
>> > +
>> >  /* Main entry point for this pass.  */
>> >  unsigned int
>> >  rs6000_analyze_swaps (function *fun)
>> > @@ -37833,7 +38195,7 @@ rs6000_analyze_swaps (function *fun)
>> >  {
>> >    swap_web_entry *insn_entry;
>> >    basic_block bb;
>> > -  rtx_insn *insn;
>> > +  rtx_insn *insn, *curr_insn = 0;
>> >
>> >    /* Dataflow analysis for use-def chains.  */
>> >    df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
>> > @@ -37841,12 +38203,15 @@ rs6000_analyze_swaps (function *fun)
>> >    df_analyze ();
>> >    df_set_flags (DF_DEFER_INSN_RESCAN);
>> >
>> > +  /* Pre-pass to combine lvx and stvx patterns so we don't lose info.  */
>> > +  combine_lvx_stvx_patterns (fun);
>> > +
>> >    /* Allocate structure to represent webs of insns.  */
>> >    insn_entry = XCNEWVEC (swap_web_entry, get_max_uid ());
>> >
>> >    /* Walk the insns to gather basic data.  */
>> >    FOR_ALL_BB_FN (bb, fun)
>> > -    FOR_BB_INSNS (bb, insn)
>> > +    FOR_BB_INSNS_SAFE (bb, insn, curr_insn)
>> >      {
>> >        unsigned int uid = INSN_UID (insn);
>> >        if (NONDEBUG_INSN_P (insn))
>> > Index: gcc/config/rs6000/vector.md
>> > ===================================================================
>> > --- gcc/config/rs6000/vector.md (revision 235090)
>> > +++ gcc/config/rs6000/vector.md (working copy)
>> > @@ -167,7 +167,14 @@
>> >    if (VECTOR_MEM_VSX_P (<MODE>mode))
>> >      {
>> >        operands[1] = rs6000_address_for_altivec (operands[1]);
>> > -      emit_insn (gen_altivec_lvx_<mode> (operands[0], operands[1]));
>> > +      rtx and_op = XEXP (operands[1], 0);
>> > +      gcc_assert (GET_CODE (and_op) == AND);
>> > +      rtx addr = XEXP (and_op, 0);
>> > +      if (GET_CODE (addr) == PLUS)
>> > +        emit_insn (gen_altivec_lvx_<mode>_2op (operands[0], XEXP (addr, 0),
>> > +                                              XEXP (addr, 1)));
>> > +      else
>> > +        emit_insn (gen_altivec_lvx_<mode>_1op (operands[0], operands[1]));
>> >        DONE;
>> >      }
>> >  }")
>> > @@ -183,7 +190,14 @@
>> >    if (VECTOR_MEM_VSX_P (<MODE>mode))
>> >      {
>> >        operands[0] = rs6000_address_for_altivec (operands[0]);
>> > -      emit_insn (gen_altivec_stvx_<mode> (operands[0], operands[1]));
>> > +      rtx and_op = XEXP (operands[0], 0);
>> > +      gcc_assert (GET_CODE (and_op) == AND);
>> > +      rtx addr = XEXP (and_op, 0);
>> > +      if (GET_CODE (addr) == PLUS)
>> > +        emit_insn (gen_altivec_stvx_<mode>_2op (operands[1], XEXP (addr, 0),
>> > +                                               XEXP (addr, 1)));
>> > +      else
>> > +        emit_insn (gen_altivec_stvx_<mode>_1op (operands[1], operands[0]));
>> >        DONE;
>> >      }
>> >  }")
>> >
>> >
>>
>>
>
>
Bill Schmidt April 20, 2016, 1:55 p.m. UTC | #5
On Tue, 2016-04-19 at 08:10 -0500, Bill Schmidt wrote:
> On Tue, 2016-04-19 at 10:09 +0200, Richard Biener wrote:
> > 
> > x86 nowadays has intrinsics implemented as inlines - they come from
> > header files.  It seems for ppc the intrinsics are somehow magically
> > there, w/o a header file?
> 
> Yes, and we really need to start gravitating to the inlines in header
> files model (Clang does this successfully for PowerPC and it is quite a
> bit cleaner, and allows for more optimization).  We have a very
> complicated setup for handling overloaded built-ins that could use a
> rewrite once somebody has time to attack it.  We do have one header file
> for built-ins (altivec.h) but it largely just #defines well-known
> aliases for the internal built-in names.  We have a lot of other things
> we have to do in GCC 7, but I'd like to do something about this in the
> relatively near future.  (Things like "vec_add" that just do a vector
> addition aren't expanded until RTL time??  Gack.)

Looking into this a bit more reminded me why things are the way they
are.  The AltiVec interfaces were designed way back to be overloaded
functions, which isn't valid C99.  Thus they can't be declared in
headers without some magic.  Clang solved this by adding an extension
__attribute__ ((__overloaded__)), which allows nice always-inline
functions that fully express the semantics and integrate well into the
optimizers.  To date, GCC doesn't have such an attribute.  Thus we have
this somewhat nasty code that gets called out of the front end that
allows us to resolve the overloaded built-ins during parsing.

With C11 we could use _Generic, but having two separate interfaces to
maintain based on the language level doesn't seem reasonable.

It looks like there is a way to do this with GCC built-ins, however,
using __builtin_choose_expr and __builtin_types_compatible_p
(https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html).  I need to
play with this and see what kind of code gets generated.  If we end up
with a bunch of run-time type checks that would still not be a good
solution.

I wonder how hard it would be to get support for __attribute__
((__overloaded__)) in GCC...

Thanks again,
Bill
Mike Stump April 20, 2016, 9:24 p.m. UTC | #6
> On Apr 20, 2016, at 6:55 AM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote:
> Looking into this a bit more reminded me why things are the way they
> are.  The AltiVec interfaces were designed way back to be overloaded
> functions, which isn't valid C99.  Thus they can't be declared in
> headers without some magic.

And for fun, I have a nice generic overload resolution for builtin functions subsystem for a nice generic builtins subsystem.  Kinda would like to donate it, as with it, builtins are quite a bit nicer to deal with.  It is yet another solution to the problem.

We have a 5k line python program that sings and dances and processes builtins and wires them into the compiler.  Let me know if you want to invest some time, otherwise, we we’d see about contributing it at some point, maybe later this year.
diff mbox

Patch

Index: gcc/config/rs6000/altivec.md
===================================================================
--- gcc/config/rs6000/altivec.md	(revision 235090)
+++ gcc/config/rs6000/altivec.md	(working copy)
@@ -2514,20 +2514,9 @@ 
   "lvxl %0,%y1"
   [(set_attr "type" "vecload")])
 
-(define_expand "altivec_lvx_<mode>"
-  [(parallel
-    [(set (match_operand:VM2 0 "register_operand" "=v")
-	  (match_operand:VM2 1 "memory_operand" "Z"))
-     (unspec [(const_int 0)] UNSPEC_LVX)])]
-  "TARGET_ALTIVEC"
-{
-  if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
-    {
-      altivec_expand_lvx_be (operands[0], operands[1], <MODE>mode, UNSPEC_LVX);
-      DONE;
-    }
-})
-
+; This version of lvx is used only in cases where we need to force an lvx
+; over any other load, and we don't care about losing CSE opportunities.
+; Its primary use is for prologue register saves.
 (define_insn "altivec_lvx_<mode>_internal"
   [(parallel
     [(set (match_operand:VM2 0 "register_operand" "=v")
@@ -2537,20 +2526,45 @@ 
   "lvx %0,%y1"
   [(set_attr "type" "vecload")])
 
-(define_expand "altivec_stvx_<mode>"
-  [(parallel
-    [(set (match_operand:VM2 0 "memory_operand" "=Z")
-	  (match_operand:VM2 1 "register_operand" "v"))
-     (unspec [(const_int 0)] UNSPEC_STVX)])]
-  "TARGET_ALTIVEC"
-{
-  if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
-    {
-      altivec_expand_stvx_be (operands[0], operands[1], <MODE>mode, UNSPEC_STVX);
-      DONE;
-    }
-})
+; The next two patterns embody what lvx should usually look like.
+(define_insn "altivec_lvx_<mode>_2op"
+  [(set (match_operand:VM2 0 "register_operand" "=v")
+        (mem:VM2 (and:DI (plus:DI (match_operand:DI 1 "register_operand" "b")
+                                  (match_operand:DI 2 "register_operand" "r"))
+		         (const_int -16))))]
+  "TARGET_ALTIVEC && TARGET_64BIT"
+  "lvx %0,%1,%2"
+  [(set_attr "type" "vecload")])
 
+(define_insn "altivec_lvx_<mode>_1op"
+  [(set (match_operand:VM2 0 "register_operand" "=v")
+        (mem:VM2 (and:DI (match_operand:DI 1 "register_operand" "r")
+			 (const_int -16))))]
+  "TARGET_ALTIVEC && TARGET_64BIT"
+  "lvx %0,0,%1"
+  [(set_attr "type" "vecload")])
+
+; 32-bit versions of the above.
+(define_insn "altivec_lvx_<mode>_2op_si"
+  [(set (match_operand:VM2 0 "register_operand" "=v")
+        (mem:VM2 (and:SI (plus:SI (match_operand:SI 1 "register_operand" "b")
+                                  (match_operand:SI 2 "register_operand" "r"))
+		         (const_int -16))))]
+  "TARGET_ALTIVEC && TARGET_32BIT"
+  "lvx %0,%1,%2"
+  [(set_attr "type" "vecload")])
+
+(define_insn "altivec_lvx_<mode>_1op_si"
+  [(set (match_operand:VM2 0 "register_operand" "=v")
+        (mem:VM2 (and:SI (match_operand:SI 1 "register_operand" "r")
+			 (const_int -16))))]
+  "TARGET_ALTIVEC && TARGET_32BIT"
+  "lvx %0,0,%1"
+  [(set_attr "type" "vecload")])
+
+; This version of stvx is used only in cases where we need to force an stvx
+; over any other store, and we don't care about losing CSE opportunities.
+; Its primary use is for epilogue register restores.
 (define_insn "altivec_stvx_<mode>_internal"
   [(parallel
     [(set (match_operand:VM2 0 "memory_operand" "=Z")
@@ -2560,6 +2574,42 @@ 
   "stvx %1,%y0"
   [(set_attr "type" "vecstore")])
 
+; The next two patterns embody what stvx should usually look like.
+(define_insn "altivec_stvx_<mode>_2op"
+  [(set (mem:VM2 (and:DI (plus:DI (match_operand:DI 1 "register_operand" "b")
+  	                          (match_operand:DI 2 "register_operand" "r"))
+	                 (const_int -16)))
+        (match_operand:VM2 0 "register_operand" "v"))]
+  "TARGET_ALTIVEC && TARGET_64BIT"
+  "stvx %0,%1,%2"
+  [(set_attr "type" "vecstore")])
+
+(define_insn "altivec_stvx_<mode>_1op"
+  [(set (mem:VM2 (and:DI (match_operand:DI 1 "register_operand" "r")
+	                 (const_int -16)))
+        (match_operand:VM2 0 "register_operand" "v"))]
+  "TARGET_ALTIVEC && TARGET_64BIT"
+  "stvx %0,0,%1"
+  [(set_attr "type" "vecstore")])
+
+; 32-bit versions of the above.
+(define_insn "altivec_stvx_<mode>_2op_si"
+  [(set (mem:VM2 (and:SI (plus:SI (match_operand:SI 1 "register_operand" "b")
+  	                          (match_operand:SI 2 "register_operand" "r"))
+	                 (const_int -16)))
+        (match_operand:VM2 0 "register_operand" "v"))]
+  "TARGET_ALTIVEC && TARGET_32BIT"
+  "stvx %0,%1,%2"
+  [(set_attr "type" "vecstore")])
+
+(define_insn "altivec_stvx_<mode>_1op_si"
+  [(set (mem:VM2 (and:SI (match_operand:SI 1 "register_operand" "r")
+	                 (const_int -16)))
+        (match_operand:VM2 0 "register_operand" "v"))]
+  "TARGET_ALTIVEC && TARGET_32BIT"
+  "stvx %0,0,%1"
+  [(set_attr "type" "vecstore")])
+
 (define_expand "altivec_stvxl_<mode>"
   [(parallel
     [(set (match_operand:VM2 0 "memory_operand" "=Z")
Index: gcc/config/rs6000/rs6000-c.c
===================================================================
--- gcc/config/rs6000/rs6000-c.c	(revision 235090)
+++ gcc/config/rs6000/rs6000-c.c	(working copy)
@@ -4800,6 +4800,164 @@  assignment for unaligned loads and stores");
       return stmt;
     }
 
+  /* Expand vec_ld into an expression that masks the address and
+     performs the load.  We need to expand this early to allow
+     the best aliasing, as by the time we get into RTL we no longer
+     are able to honor __restrict__, for example.  We may want to
+     consider this for all memory access built-ins.
+
+     When -maltivec=be is specified, simply punt to existing
+     built-in processing.  */
+  if (fcode == ALTIVEC_BUILTIN_VEC_LD
+      && (BYTES_BIG_ENDIAN || !VECTOR_ELT_ORDER_BIG))
+    {
+      tree arg0 = (*arglist)[0];
+      tree arg1 = (*arglist)[1];
+
+      /* Strip qualifiers like "const" from the pointer arg.  */
+      tree arg1_type = TREE_TYPE (arg1);
+      tree inner_type = TREE_TYPE (arg1_type);
+      if (TYPE_QUALS (TREE_TYPE (arg1_type)) != 0)
+	{
+	  arg1_type = build_pointer_type (build_qualified_type (inner_type,
+								0));
+	  arg1 = fold_convert (arg1_type, arg1);
+	}
+
+      /* Construct the masked address.  We have to jump through some hoops
+	 here.  If the first argument to a PLUS_EXPR is a pointer,
+	 build_binary_op will multiply the offset by the size of the
+	 inner type of the pointer (C semantics).  With vec_ld and vec_st,
+	 the offset must be left alone.  However, if we convert to a
+	 sizetype to do the arithmetic, we get a PLUS_EXPR instead of a
+	 POINTER_PLUS_EXPR, which interferes with aliasing (causing us,
+	 for example, to lose "restrict" information).  Thus where legal,
+	 we pre-adjust the offset knowing that a multiply by size is
+	 coming.  When the offset isn't a multiple of the size, we are
+	 forced to do the arithmetic in size_type for correctness, at the
+	 cost of losing aliasing information.  This, however, should be
+	 quite rare with these operations.  */
+      arg0 = fold (arg0);
+
+      /* Let existing error handling take over if we don't have a constant
+	 offset.  */
+      if (TREE_CODE (arg0) == INTEGER_CST)
+	{
+	  HOST_WIDE_INT off = TREE_INT_CST_LOW (arg0);
+	  HOST_WIDE_INT size = int_size_in_bytes (inner_type);
+	  tree addr;
+
+	  if (off % size == 0)
+	    {
+	      tree adjoff = build_int_cst (TREE_TYPE (arg0), off / size);
+	      addr = build_binary_op (loc, PLUS_EXPR, arg1, adjoff, 0);
+	      addr = build1 (NOP_EXPR, sizetype, addr);
+	    }
+	  else
+	    {
+	      tree hack_arg1 = build1 (NOP_EXPR, sizetype, arg1);
+	      addr = build_binary_op (loc, PLUS_EXPR, hack_arg1, arg0, 0);
+	    }
+	  tree aligned = build_binary_op (loc, BIT_AND_EXPR, addr,
+					  build_int_cst (sizetype, -16), 0);
+
+	  /* Find the built-in to get the return type so we can convert
+	     the result properly (or fall back to default handling if the
+	     arguments aren't compatible).  */
+	  for (desc = altivec_overloaded_builtins;
+	       desc->code && desc->code != fcode; desc++)
+	    continue;
+
+	  for (; desc->code == fcode; desc++)
+	    if (rs6000_builtin_type_compatible (TREE_TYPE (arg0), desc->op1)
+		&& (rs6000_builtin_type_compatible (TREE_TYPE (arg1),
+						    desc->op2)))
+	      {
+		tree ret_type = rs6000_builtin_type (desc->ret_type);
+		if (TYPE_MODE (ret_type) == V2DImode)
+		  /* Type-based aliasing analysis thinks vector long
+		     and vector long long are different and will put them
+		     in distinct alias classes.  Force our return type
+		     to be a may-alias type to avoid this.  */
+		  ret_type
+		    = build_pointer_type_for_mode (ret_type, Pmode,
+						   true/*can_alias_all*/);
+		else
+		  ret_type = build_pointer_type (ret_type);
+		aligned = build1 (NOP_EXPR, ret_type, aligned);
+		tree ret_val = build_indirect_ref (loc, aligned, RO_NULL);
+		return ret_val;
+	      }
+	}
+    }
+
+  /* Similarly for stvx.  */
+  if (fcode == ALTIVEC_BUILTIN_VEC_ST
+      && (BYTES_BIG_ENDIAN || !VECTOR_ELT_ORDER_BIG))
+    {
+      tree arg0 = (*arglist)[0];
+      tree arg1 = (*arglist)[1];
+      tree arg2 = (*arglist)[2];
+
+      /* Construct the masked address.  See handling for ALTIVEC_BUILTIN_VEC_LD
+	 for an explanation of address arithmetic concerns.  */
+      arg1 = fold (arg1);
+
+      /* Let existing error handling take over if we don't have a constant
+	 offset.  */
+      if (TREE_CODE (arg1) == INTEGER_CST)
+	{
+	  HOST_WIDE_INT off = TREE_INT_CST_LOW (arg1);
+	  tree inner_type = TREE_TYPE (TREE_TYPE (arg2));
+	  HOST_WIDE_INT size = int_size_in_bytes (inner_type);
+	  tree addr;
+
+	  if (off % size == 0)
+	    {
+	      tree adjoff = build_int_cst (TREE_TYPE (arg1), off / size);
+	      addr = build_binary_op (loc, PLUS_EXPR, arg2, adjoff, 0);
+	      addr = build1 (NOP_EXPR, sizetype, addr);
+	    }
+	  else
+	    {
+	      tree hack_arg2 = build1 (NOP_EXPR, sizetype, arg2);
+	      addr = build_binary_op (loc, PLUS_EXPR, hack_arg2, arg1, 0);
+	    }
+	  tree aligned = build_binary_op (loc, BIT_AND_EXPR, addr,
+					  build_int_cst (sizetype, -16), 0);
+
+	  /* Find the built-in to make sure a compatible one exists; if not
+	     we fall back to default handling to get the error message.  */
+	  for (desc = altivec_overloaded_builtins;
+	       desc->code && desc->code != fcode; desc++)
+	    continue;
+
+	  for (; desc->code == fcode; desc++)
+	    if (rs6000_builtin_type_compatible (TREE_TYPE (arg0), desc->op1)
+		&& rs6000_builtin_type_compatible (TREE_TYPE (arg1), desc->op2)
+		&& rs6000_builtin_type_compatible (TREE_TYPE (arg2),
+						   desc->op3))
+	      {
+		tree arg0_type = TREE_TYPE (arg0);
+		if (TYPE_MODE (arg0_type) == V2DImode)
+		  /* Type-based aliasing analysis thinks vector long
+		     and vector long long are different and will put them
+		     in distinct alias classes.  Force our address type
+		     to be a may-alias type to avoid this.  */
+		  arg0_type
+		    = build_pointer_type_for_mode (arg0_type, Pmode,
+						   true/*can_alias_all*/);
+		else
+		  arg0_type = build_pointer_type (arg0_type);
+		aligned = build1 (NOP_EXPR, arg0_type, aligned);
+		tree stg = build_indirect_ref (loc, aligned, RO_NULL);
+		tree retval = build2 (MODIFY_EXPR, TREE_TYPE (stg), stg,
+				      convert (TREE_TYPE (stg), arg0));
+		return retval;
+	      }
+	}
+    }
+
   for (n = 0;
        !VOID_TYPE_P (TREE_VALUE (fnargs)) && n < nargs;
        fnargs = TREE_CHAIN (fnargs), n++)
Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 235090)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -13025,9 +13025,9 @@  swap_selector_for_mode (machine_mode mode)
   return force_reg (V16QImode, gen_rtx_CONST_VECTOR (V16QImode, gen_rtvec_v (16, perm)));
 }
 
-/* Generate code for an "lvx", "lvxl", or "lve*x" built-in for a little endian target
-   with -maltivec=be specified.  Issue the load followed by an element-reversing
-   permute.  */
+/* Generate code for an "lvxl", or "lve*x" built-in for a little endian target
+   with -maltivec=be specified.  Issue the load followed by an element-
+   reversing permute.  */
 void
 altivec_expand_lvx_be (rtx op0, rtx op1, machine_mode mode, unsigned unspec)
 {
@@ -13043,8 +13043,8 @@  altivec_expand_lvx_be (rtx op0, rtx op1, machine_m
   emit_insn (gen_rtx_SET (op0, vperm));
 }
 
-/* Generate code for a "stvx" or "stvxl" built-in for a little endian target
-   with -maltivec=be specified.  Issue the store preceded by an element-reversing
+/* Generate code for a "stvxl" built-in for a little endian target with
+   -maltivec=be specified.  Issue the store preceded by an element-reversing
    permute.  */
 void
 altivec_expand_stvx_be (rtx op0, rtx op1, machine_mode mode, unsigned unspec)
@@ -13106,22 +13106,65 @@  altivec_expand_lv_builtin (enum insn_code icode, t
 
   op1 = copy_to_mode_reg (mode1, op1);
 
-  if (op0 == const0_rtx)
+  /* For LVX, express the RTL accurately by ANDing the address with -16.
+     LVXL and LVE*X expand to use UNSPECs to hide their special behavior,
+     so the raw address is fine.  */
+  switch (icode)
     {
-      addr = gen_rtx_MEM (blk ? BLKmode : tmode, op1);
-    }
-  else
-    {
-      op0 = copy_to_mode_reg (mode0, op0);
-      addr = gen_rtx_MEM (blk ? BLKmode : tmode, gen_rtx_PLUS (Pmode, op0, op1));
-    }
+    case CODE_FOR_altivec_lvx_v2df_2op:
+    case CODE_FOR_altivec_lvx_v2di_2op:
+    case CODE_FOR_altivec_lvx_v4sf_2op:
+    case CODE_FOR_altivec_lvx_v4si_2op:
+    case CODE_FOR_altivec_lvx_v8hi_2op:
+    case CODE_FOR_altivec_lvx_v16qi_2op:
+      {
+	rtx rawaddr;
+	if (op0 == const0_rtx)
+	  rawaddr = op1;
+	else
+	  {
+	    op0 = copy_to_mode_reg (mode0, op0);
+	    rawaddr = gen_rtx_PLUS (Pmode, op1, op0);
+	  }
+	addr = gen_rtx_AND (Pmode, rawaddr, gen_rtx_CONST_INT (Pmode, -16));
+	addr = gen_rtx_MEM (blk ? BLKmode : tmode, addr);
 
-  pat = GEN_FCN (icode) (target, addr);
+	/* For -maltivec=be, emit the load and follow it up with a
+	   permute to swap the elements.  */
+	if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
+	  {
+	    rtx temp = gen_reg_rtx (tmode);
+	    emit_insn (gen_rtx_SET (temp, addr));
 
-  if (! pat)
-    return 0;
-  emit_insn (pat);
+	    rtx sel = swap_selector_for_mode (tmode);
+	    rtx vperm = gen_rtx_UNSPEC (tmode, gen_rtvec (3, temp, temp, sel),
+					UNSPEC_VPERM);
+	    emit_insn (gen_rtx_SET (target, vperm));
+	  }
+	else
+	  emit_insn (gen_rtx_SET (target, addr));
 
+	break;
+      }
+
+    default:
+      if (op0 == const0_rtx)
+	addr = gen_rtx_MEM (blk ? BLKmode : tmode, op1);
+      else
+	{
+	  op0 = copy_to_mode_reg (mode0, op0);
+	  addr = gen_rtx_MEM (blk ? BLKmode : tmode,
+			      gen_rtx_PLUS (Pmode, op1, op0));
+	}
+
+      pat = GEN_FCN (icode) (target, addr);
+      if (! pat)
+	return 0;
+      emit_insn (pat);
+
+      break;
+    }
+  
   return target;
 }
 
@@ -13208,7 +13251,7 @@  altivec_expand_stv_builtin (enum insn_code icode,
   rtx op0 = expand_normal (arg0);
   rtx op1 = expand_normal (arg1);
   rtx op2 = expand_normal (arg2);
-  rtx pat, addr;
+  rtx pat, addr, rawaddr;
   machine_mode tmode = insn_data[icode].operand[0].mode;
   machine_mode smode = insn_data[icode].operand[1].mode;
   machine_mode mode1 = Pmode;
@@ -13220,24 +13263,69 @@  altivec_expand_stv_builtin (enum insn_code icode,
       || arg2 == error_mark_node)
     return const0_rtx;
 
-  if (! (*insn_data[icode].operand[1].predicate) (op0, smode))
-    op0 = copy_to_mode_reg (smode, op0);
-
   op2 = copy_to_mode_reg (mode2, op2);
 
-  if (op1 == const0_rtx)
+  /* For STVX, express the RTL accurately by ANDing the address with -16.
+     STVXL and STVE*X expand to use UNSPECs to hide their special behavior,
+     so the raw address is fine.  */
+  switch (icode)
     {
-      addr = gen_rtx_MEM (tmode, op2);
-    }
-  else
-    {
-      op1 = copy_to_mode_reg (mode1, op1);
-      addr = gen_rtx_MEM (tmode, gen_rtx_PLUS (Pmode, op1, op2));
-    }
+    case CODE_FOR_altivec_stvx_v2df_2op:
+    case CODE_FOR_altivec_stvx_v2di_2op:
+    case CODE_FOR_altivec_stvx_v4sf_2op:
+    case CODE_FOR_altivec_stvx_v4si_2op:
+    case CODE_FOR_altivec_stvx_v8hi_2op:
+    case CODE_FOR_altivec_stvx_v16qi_2op:
+      {
+	if (op1 == const0_rtx)
+	  rawaddr = op2;
+	else
+	  {
+	    op1 = copy_to_mode_reg (mode1, op1);
+	    rawaddr = gen_rtx_PLUS (Pmode, op2, op1);
+	  }
 
-  pat = GEN_FCN (icode) (addr, op0);
-  if (pat)
-    emit_insn (pat);
+	addr = gen_rtx_AND (Pmode, rawaddr, gen_rtx_CONST_INT (Pmode, -16));
+	addr = gen_rtx_MEM (tmode, addr);
+
+	op0 = copy_to_mode_reg (tmode, op0);
+
+	/* For -maltivec=be, emit a permute to swap the elements, followed
+	   by the store.  */
+	if (!BYTES_BIG_ENDIAN && VECTOR_ELT_ORDER_BIG)
+	  {
+	    rtx temp = gen_reg_rtx (tmode);
+	    rtx sel = swap_selector_for_mode (tmode);
+	    rtx vperm = gen_rtx_UNSPEC (tmode, gen_rtvec (3, op0, op0, sel),
+					UNSPEC_VPERM);
+	    emit_insn (gen_rtx_SET (temp, vperm));
+	    emit_insn (gen_rtx_SET (addr, temp));
+	  }
+	else
+	  emit_insn (gen_rtx_SET (addr, op0));
+
+	break;
+      }
+
+    default:
+      {
+	if (! (*insn_data[icode].operand[1].predicate) (op0, smode))
+	  op0 = copy_to_mode_reg (smode, op0);
+
+	if (op1 == const0_rtx)
+	  addr = gen_rtx_MEM (tmode, op2);
+	else
+	  {
+	    op1 = copy_to_mode_reg (mode1, op1);
+	    addr = gen_rtx_MEM (tmode, gen_rtx_PLUS (Pmode, op2, op1));
+	  }
+
+	pat = GEN_FCN (icode) (addr, op0);
+	if (pat)
+	  emit_insn (pat);
+      }
+    }      
+
   return NULL_RTX;
 }
 
@@ -14073,18 +14161,18 @@  altivec_expand_builtin (tree exp, rtx target, bool
   switch (fcode)
     {
     case ALTIVEC_BUILTIN_STVX_V2DF:
-      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2df, exp);
+      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2df_2op, exp);
     case ALTIVEC_BUILTIN_STVX_V2DI:
-      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2di, exp);
+      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v2di_2op, exp);
     case ALTIVEC_BUILTIN_STVX_V4SF:
-      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4sf, exp);
+      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4sf_2op, exp);
     case ALTIVEC_BUILTIN_STVX:
     case ALTIVEC_BUILTIN_STVX_V4SI:
-      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4si, exp);
+      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v4si_2op, exp);
     case ALTIVEC_BUILTIN_STVX_V8HI:
-      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v8hi, exp);
+      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v8hi_2op, exp);
     case ALTIVEC_BUILTIN_STVX_V16QI:
-      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v16qi, exp);
+      return altivec_expand_stv_builtin (CODE_FOR_altivec_stvx_v16qi_2op, exp);
     case ALTIVEC_BUILTIN_STVEBX:
       return altivec_expand_stv_builtin (CODE_FOR_altivec_stvebx, exp);
     case ALTIVEC_BUILTIN_STVEHX:
@@ -14272,23 +14360,23 @@  altivec_expand_builtin (tree exp, rtx target, bool
       return altivec_expand_lv_builtin (CODE_FOR_altivec_lvxl_v16qi,
 					exp, target, false);
     case ALTIVEC_BUILTIN_LVX_V2DF:
-      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2df,
+      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2df_2op,
 					exp, target, false);
     case ALTIVEC_BUILTIN_LVX_V2DI:
-      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2di,
+      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v2di_2op,
 					exp, target, false);
     case ALTIVEC_BUILTIN_LVX_V4SF:
-      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4sf,
+      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4sf_2op,
 					exp, target, false);
     case ALTIVEC_BUILTIN_LVX:
     case ALTIVEC_BUILTIN_LVX_V4SI:
-      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4si,
+      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v4si_2op,
 					exp, target, false);
     case ALTIVEC_BUILTIN_LVX_V8HI:
-      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v8hi,
+      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v8hi_2op,
 					exp, target, false);
     case ALTIVEC_BUILTIN_LVX_V16QI:
-      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v16qi,
+      return altivec_expand_lv_builtin (CODE_FOR_altivec_lvx_v16qi_2op,
 					exp, target, false);
     case ALTIVEC_BUILTIN_LVLX:
       return altivec_expand_lv_builtin (CODE_FOR_altivec_lvlx,
@@ -37139,7 +37227,9 @@  insn_is_swappable_p (swap_web_entry *insn_entry, r
      fix them up by converting them to permuting ones.  Exceptions:
      UNSPEC_LVE, UNSPEC_LVX, and UNSPEC_STVX, which have a PARALLEL
      body instead of a SET; and UNSPEC_STVE, which has an UNSPEC
-     for the SET source.  */
+     for the SET source.  Also we must now make an exception for lvx
+     and stvx when they are not in the UNSPEC_LVX/STVX form (with the
+     explicit "& -16") since this leads to unrecognizable insns.  */
   rtx body = PATTERN (insn);
   int i = INSN_UID (insn);
 
@@ -37147,6 +37237,11 @@  insn_is_swappable_p (swap_web_entry *insn_entry, r
     {
       if (GET_CODE (body) == SET)
 	{
+	  rtx rhs = SET_SRC (body);
+	  gcc_assert (GET_CODE (rhs) == MEM);
+	  if (GET_CODE (XEXP (rhs, 0)) == AND)
+	    return 0;
+
 	  *special = SH_NOSWAP_LD;
 	  return 1;
 	}
@@ -37156,8 +37251,14 @@  insn_is_swappable_p (swap_web_entry *insn_entry, r
 
   if (insn_entry[i].is_store)
     {
-      if (GET_CODE (body) == SET && GET_CODE (SET_SRC (body)) != UNSPEC)
+      if (GET_CODE (body) == SET
+	  && GET_CODE (SET_SRC (body)) != UNSPEC)
 	{
+	  rtx lhs = SET_DEST (body);
+	  gcc_assert (GET_CODE (lhs) == MEM);
+	  if (GET_CODE (XEXP (lhs, 0)) == AND)
+	    return 0;
+	  
 	  *special = SH_NOSWAP_ST;
 	  return 1;
 	}
@@ -37827,6 +37928,267 @@  dump_swap_insn_table (swap_web_entry *insn_entry)
   fputs ("\n", dump_file);
 }
 
+/* Return RTX with its address canonicalized to (reg) or (+ reg reg).
+   Here RTX is an (& addr (const_int -16)).  Always return a new copy
+   to avoid problems with combine.  */
+static rtx
+alignment_with_canonical_addr (rtx align)
+{
+  rtx canon;
+  rtx addr = XEXP (align, 0);
+
+  if (REG_P (addr))
+    canon = addr;
+
+  else if (GET_CODE (addr) == PLUS)
+    {
+      rtx addrop0 = XEXP (addr, 0);
+      rtx addrop1 = XEXP (addr, 1);
+
+      if (!REG_P (addrop0))
+	addrop0 = force_reg (GET_MODE (addrop0), addrop0);
+
+      if (!REG_P (addrop1))
+	addrop1 = force_reg (GET_MODE (addrop1), addrop1);
+
+      canon = gen_rtx_PLUS (GET_MODE (addr), addrop0, addrop1);
+    }
+
+  else
+    canon = force_reg (GET_MODE (addr), addr);
+
+  return gen_rtx_AND (GET_MODE (align), canon, GEN_INT (-16));
+}
+
+/* Check whether an rtx is an alignment mask, and if so, return 
+   a fully-expanded rtx for the masking operation.  */
+static rtx
+alignment_mask (rtx_insn *insn)
+{
+  rtx body = PATTERN (insn);
+
+  if (GET_CODE (body) != SET
+      || GET_CODE (SET_SRC (body)) != AND
+      || !REG_P (XEXP (SET_SRC (body), 0)))
+    return 0;
+
+  rtx mask = XEXP (SET_SRC (body), 1);
+
+  if (GET_CODE (mask) == CONST_INT)
+    {
+      if (INTVAL (mask) == -16)
+	return alignment_with_canonical_addr (SET_SRC (body));
+      else
+	return 0;
+    }
+
+  if (!REG_P (mask))
+    return 0;
+
+  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+  df_ref use;
+  rtx real_mask = 0;
+
+  FOR_EACH_INSN_INFO_USE (use, insn_info)
+    {
+      if (!rtx_equal_p (DF_REF_REG (use), mask))
+	continue;
+
+      struct df_link *def_link = DF_REF_CHAIN (use);
+      if (!def_link || def_link->next)
+	return 0;
+
+      rtx_insn *const_insn = DF_REF_INSN (def_link->ref);
+      rtx const_body = PATTERN (const_insn);
+      if (GET_CODE (const_body) != SET)
+	return 0;
+
+      real_mask = SET_SRC (const_body);
+
+      if (GET_CODE (real_mask) != CONST_INT
+	  || INTVAL (real_mask) != -16)
+	return 0;
+    }
+
+  if (real_mask == 0)
+    return 0;
+
+  return alignment_with_canonical_addr (SET_SRC (body));
+}
+
+/* Given INSN that's a load or store based at BASE_REG, look for a
+   feeding computation that aligns its address on a 16-byte boundary.  */
+static rtx
+find_alignment_op (rtx_insn *insn, rtx base_reg)
+{
+  df_ref base_use;
+  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+  rtx and_operation = 0;
+
+  FOR_EACH_INSN_INFO_USE (base_use, insn_info)
+    {
+      if (!rtx_equal_p (DF_REF_REG (base_use), base_reg))
+	continue;
+
+      struct df_link *base_def_link = DF_REF_CHAIN (base_use);
+      if (!base_def_link || base_def_link->next)
+	break;
+
+      rtx_insn *and_insn = DF_REF_INSN (base_def_link->ref);
+      and_operation = alignment_mask (and_insn);
+      if (and_operation != 0)
+	break;
+    }
+
+  return and_operation;
+}
+
+struct del_info { bool replace; rtx_insn *replace_insn; };
+
+/* If INSN is the load for an lvx pattern, put it in canonical form.  */
+static void
+combine_lvx_pattern (rtx_insn *insn, del_info *to_delete)
+{
+  rtx body = PATTERN (insn);
+  gcc_assert (GET_CODE (body) == SET
+	      && GET_CODE (SET_SRC (body)) == VEC_SELECT
+	      && GET_CODE (XEXP (SET_SRC (body), 0)) == MEM);
+
+  rtx mem = XEXP (SET_SRC (body), 0);
+  rtx base_reg = XEXP (mem, 0);
+
+  rtx and_operation = find_alignment_op (insn, base_reg);
+
+  if (and_operation != 0)
+    {
+      df_ref def;
+      struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+      FOR_EACH_INSN_INFO_DEF (def, insn_info)
+	{
+	  struct df_link *link = DF_REF_CHAIN (def);
+	  if (!link || link->next)
+	    break;
+
+	  rtx_insn *swap_insn = DF_REF_INSN (link->ref);
+	  if (!insn_is_swap_p (swap_insn)
+	      || insn_is_load_p (swap_insn)
+	      || insn_is_store_p (swap_insn))
+	    break;
+
+	  /* Expected lvx pattern found.  Change the swap to
+	     a copy, and propagate the AND operation into the
+	     load.  */
+	  to_delete[INSN_UID (swap_insn)].replace = true;
+	  to_delete[INSN_UID (swap_insn)].replace_insn = swap_insn;
+
+	  XEXP (mem, 0) = and_operation;
+	  SET_SRC (body) = mem;
+	  INSN_CODE (insn) = -1; /* Force re-recognition.  */
+	  df_insn_rescan (insn);
+		  
+	  if (dump_file)
+	    fprintf (dump_file, "lvx opportunity found at %d\n",
+		     INSN_UID (insn));
+	}
+    }
+}
+
+/* If INSN is the store for an stvx pattern, put it in canonical form.  */
+static void
+combine_stvx_pattern (rtx_insn *insn, del_info *to_delete)
+{
+  rtx body = PATTERN (insn);
+  gcc_assert (GET_CODE (body) == SET
+	      && GET_CODE (SET_DEST (body)) == MEM
+	      && GET_CODE (SET_SRC (body)) == VEC_SELECT);
+  rtx mem = SET_DEST (body);
+  rtx base_reg = XEXP (mem, 0);
+
+  rtx and_operation = find_alignment_op (insn, base_reg);
+
+  if (and_operation != 0)
+    {
+      rtx src_reg = XEXP (SET_SRC (body), 0);
+      df_ref src_use;
+      struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+      FOR_EACH_INSN_INFO_USE (src_use, insn_info)
+	{
+	  if (!rtx_equal_p (DF_REF_REG (src_use), src_reg))
+	    continue;
+
+	  struct df_link *link = DF_REF_CHAIN (src_use);
+	  if (!link || link->next)
+	    break;
+
+	  rtx_insn *swap_insn = DF_REF_INSN (link->ref);
+	  if (!insn_is_swap_p (swap_insn)
+	      || insn_is_load_p (swap_insn)
+	      || insn_is_store_p (swap_insn))
+	    break;
+
+	  /* Expected stvx pattern found.  Change the swap to
+	     a copy, and propagate the AND operation into the
+	     store.  */
+	  to_delete[INSN_UID (swap_insn)].replace = true;
+	  to_delete[INSN_UID (swap_insn)].replace_insn = swap_insn;
+
+	  XEXP (mem, 0) = and_operation;
+	  SET_SRC (body) = src_reg;
+	  INSN_CODE (insn) = -1; /* Force re-recognition.  */
+	  df_insn_rescan (insn);
+		  
+	  if (dump_file)
+	    fprintf (dump_file, "stvx opportunity found at %d\n",
+		     INSN_UID (insn));
+	}
+    }
+}
+
+/* Look for patterns created from builtin lvx and stvx calls, and
+   canonicalize them to be properly recognized as such.  */
+static void
+combine_lvx_stvx_patterns (function *fun)
+{
+  int i;
+  basic_block bb;
+  rtx_insn *insn;
+
+  int num_insns = get_max_uid ();
+  del_info *to_delete = XCNEWVEC (del_info, num_insns);
+
+  FOR_ALL_BB_FN (bb, fun)
+    FOR_BB_INSNS (bb, insn)
+    {
+      if (!NONDEBUG_INSN_P (insn))
+	continue;
+
+      if (insn_is_load_p (insn) && insn_is_swap_p (insn))
+	combine_lvx_pattern (insn, to_delete);
+      else if (insn_is_store_p (insn) && insn_is_swap_p (insn))
+	combine_stvx_pattern (insn, to_delete);
+    }
+
+  /* Turning swaps into copies is delayed until now, to avoid problems
+     with deleting instructions during the insn walk.  */
+  for (i = 0; i < num_insns; i++)
+    if (to_delete[i].replace)
+      {
+	rtx swap_body = PATTERN (to_delete[i].replace_insn);
+	rtx src_reg = XEXP (SET_SRC (swap_body), 0);
+	rtx copy = gen_rtx_SET (SET_DEST (swap_body), src_reg);
+	rtx_insn *new_insn = emit_insn_before (copy,
+					       to_delete[i].replace_insn);
+	set_block_for_insn (new_insn,
+			    BLOCK_FOR_INSN (to_delete[i].replace_insn));
+	df_insn_rescan (new_insn);
+	df_insn_delete (to_delete[i].replace_insn);
+	remove_insn (to_delete[i].replace_insn);
+	to_delete[i].replace_insn->set_deleted ();
+      }
+  
+  free (to_delete);
+}
+
 /* Main entry point for this pass.  */
 unsigned int
 rs6000_analyze_swaps (function *fun)
@@ -37833,7 +38195,7 @@  rs6000_analyze_swaps (function *fun)
 {
   swap_web_entry *insn_entry;
   basic_block bb;
-  rtx_insn *insn;
+  rtx_insn *insn, *curr_insn = 0;
 
   /* Dataflow analysis for use-def chains.  */
   df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
@@ -37841,12 +38203,15 @@  rs6000_analyze_swaps (function *fun)
   df_analyze ();
   df_set_flags (DF_DEFER_INSN_RESCAN);
 
+  /* Pre-pass to combine lvx and stvx patterns so we don't lose info.  */
+  combine_lvx_stvx_patterns (fun);
+
   /* Allocate structure to represent webs of insns.  */
   insn_entry = XCNEWVEC (swap_web_entry, get_max_uid ());
 
   /* Walk the insns to gather basic data.  */
   FOR_ALL_BB_FN (bb, fun)
-    FOR_BB_INSNS (bb, insn)
+    FOR_BB_INSNS_SAFE (bb, insn, curr_insn)
     {
       unsigned int uid = INSN_UID (insn);
       if (NONDEBUG_INSN_P (insn))
Index: gcc/config/rs6000/vector.md
===================================================================
--- gcc/config/rs6000/vector.md	(revision 235090)
+++ gcc/config/rs6000/vector.md	(working copy)
@@ -167,7 +167,14 @@ 
   if (VECTOR_MEM_VSX_P (<MODE>mode))
     {
       operands[1] = rs6000_address_for_altivec (operands[1]);
-      emit_insn (gen_altivec_lvx_<mode> (operands[0], operands[1]));
+      rtx and_op = XEXP (operands[1], 0);
+      gcc_assert (GET_CODE (and_op) == AND);
+      rtx addr = XEXP (and_op, 0);
+      if (GET_CODE (addr) == PLUS)
+        emit_insn (gen_altivec_lvx_<mode>_2op (operands[0], XEXP (addr, 0),
+	                                       XEXP (addr, 1)));
+      else
+        emit_insn (gen_altivec_lvx_<mode>_1op (operands[0], operands[1]));
       DONE;
     }
 }")
@@ -183,7 +190,14 @@ 
   if (VECTOR_MEM_VSX_P (<MODE>mode))
     {
       operands[0] = rs6000_address_for_altivec (operands[0]);
-      emit_insn (gen_altivec_stvx_<mode> (operands[0], operands[1]));
+      rtx and_op = XEXP (operands[0], 0);
+      gcc_assert (GET_CODE (and_op) == AND);
+      rtx addr = XEXP (and_op, 0);
+      if (GET_CODE (addr) == PLUS)
+        emit_insn (gen_altivec_stvx_<mode>_2op (operands[1], XEXP (addr, 0),
+	                                        XEXP (addr, 1)));
+      else
+        emit_insn (gen_altivec_stvx_<mode>_1op (operands[1], operands[0]));
       DONE;
     }
 }")