diff mbox

, PR target/80510, Optimize offsettable memory references on power7/power8

Message ID 20170512213350.GA985@ibm-tiger.the-meissners.org
State New
Headers show

Commit Message

Michael Meissner May 12, 2017, 9:33 p.m. UTC
When I was fixing PR target/68163, I noticed there was a wider problem, and
opened up PR 80510.

The problem is if the DImode, DFmode, and SFmode are allowed in Altivec
registers before ISA 3.0, and the compiler wants to do an offsettable store.
The compiler generates a move from an Altivec register to a traditional
floating point register, and then the compiler generates the STFD or STFS
instruction.

This code adds peephole2's that notices there is a move from an altivec
regsiter to fpr register and store, it changes this load the offset into a GPR,
and do the indexed store from the Altivec register.  I also added code to do
the reverse (notice if there is a load to a FPR register and copy it to an
Altivec register) and use an indexed load.

I ran the Spec 2006 floating point suite with this patch, and the LBM benchmark
shows a nearly 3% gain with this patch, and there were no significant
regressions.  I also bootstrapped the compiler and insured there were no
regressions.  Can I check this into GCC 8?  How about GCC 7?

Note, using peepholes are a quick way to fix the particular problem.  However,
it would be nice long term to arrange things so the back end can tell the
register allocator to load up the offset into a register, instead of doing the
move/store.  I tried various modifications to secondary reload, but I wasn't
able to get it to change behavor.

[gcc]
2017-05-12  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/80510
	* config/rs6000/predicates.md (simple_offsettable_mem_operand):
	New predicate.

	* config/rs6000/rs6000.md (ALTIVEC_DFORM): New iterator.
	(define_peephole2 for Altivec d-form load): Add peepholes to catch
	cases where the register allocator uses a move and an offsettable
	memory operation to/from a FPR register on ISA 2.06/2.07.
	(define_peephole2 for Altivec d-form store): Likewise.

[gcc/testsuite]
2017-05-12  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/80510
	* gcc.target/powerpc/pr80510-1.c: New test.
	* gcc.target/powerpc/pr80510-2.c: Likewise.

Comments

Segher Boessenkool May 15, 2017, 9:26 p.m. UTC | #1
Hi,

On Fri, May 12, 2017 at 05:33:50PM -0400, Michael Meissner wrote:
> The problem is if the DImode, DFmode, and SFmode are allowed in Altivec
> registers before ISA 3.0, and the compiler wants to do an offsettable store.
> The compiler generates a move from an Altivec register to a traditional
> floating point register, and then the compiler generates the STFD or STFS
> instruction.
> 
> This code adds peephole2's that notices there is a move from an altivec
> regsiter to fpr register and store, it changes this load the offset into a GPR,
> and do the indexed store from the Altivec register.  I also added code to do
> the reverse (notice if there is a load to a FPR register and copy it to an
> Altivec register) and use an indexed load.

Ok.

> I ran the Spec 2006 floating point suite with this patch, and the LBM benchmark
> shows a nearly 3% gain with this patch, and there were no significant
> regressions.

Nice :-)

> Note, using peepholes are a quick way to fix the particular problem.  However,
> it would be nice long term to arrange things so the back end can tell the
> register allocator to load up the offset into a register, instead of doing the
> move/store.  I tried various modifications to secondary reload, but I wasn't
> able to get it to change behavor.

These peepholes are simple and look perfectly safe.  It would of course
be great if we wouldn't need them, if LRA was a bit smarter.

> +;; Optimize cases where we want to do a D-form load (register+offset) on
> +;; ISA 2.06/2.07 to an Altivec register, and the register allocator
> +;; has generated:
> +;;	load fpr
> +;;	move fpr->altivec

Maybe show here the actual machine instructions before and after the
peephole has been applied?

> +/* { dg-final { scan-assembler     {\xsadddp\M} } } */
> +/* { dg-final { scan-assembler     {\stxsdx\M}  } } */
> +/* { dg-final { scan-assembler-not {\mmfvsrd\M} } } */

You forgot the "m" (in "\m") in the first two of these.  (I wonder
how this worked, esp. the "\s" one?)

> +/* { dg-final { scan-assembler     {\xsaddsp\M}  } } */
> +/* { dg-final { scan-assembler     {\stxsspx\M}  } } */
> +/* { dg-final { scan-assembler-not {\mmfvsrd\M}  } } */
> +/* { dg-final { scan-assembler-not {\mmfvsrwz\M} } } */

And again.

Okay for trunk with that fixed.  Also okay for 7 (after a delay).
Thanks!


Segher
diff mbox

Patch

Index: gcc/config/rs6000/rs6000.md
===================================================================
--- gcc/config/rs6000/rs6000.md	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/branches/ibm/meissner-gcc8/gcc/config/rs6000)	(revision 247282)
+++ gcc/config/rs6000/rs6000.md	(.../gcc/config/rs6000)	(working copy)
@@ -701,6 +701,11 @@  (define_code_attr     minmax	[(smin "min
 (define_code_attr     SMINMAX	[(smin "SMIN")
 				 (smax "SMAX")])
 
+;; Iterator to optimize the following cases:
+;;	D-form load to FPR register & move to Altivec register
+;;	Move Altivec register to FPR register and store
+(define_mode_iterator ALTIVEC_DFORM [DI DF SF])
+
 
 ;; Start with fixed-point load and store insns.  Here we put only the more
 ;; complex forms.  Basic data transfer is done later.
@@ -14067,6 +14072,74 @@  (define_insn "*fusion_p9_<mode>_constant
    (set_attr "length" "8")])
 
 
+;; Optimize cases where we want to do a D-form load (register+offset) on
+;; ISA 2.06/2.07 to an Altivec register, and the register allocator
+;; has generated:
+;;	load fpr
+;;	move fpr->altivec
+
+(define_peephole2
+  [(match_scratch:DI 0 "b")
+   (set (match_operand:ALTIVEC_DFORM 1 "fpr_reg_operand")
+	(match_operand:ALTIVEC_DFORM 2 "simple_offsettable_mem_operand"))
+   (set (match_operand:ALTIVEC_DFORM 3 "altivec_register_operand")
+	(match_dup 1))]
+  "TARGET_VSX && TARGET_POWERPC64 && TARGET_UPPER_REGS_<MODE>
+   && !TARGET_P9_DFORM_SCALAR && peep2_reg_dead_p (2, operands[1])"
+  [(set (match_dup 0)
+	(match_dup 4))
+   (set (match_dup 3)
+	(match_dup 5))]
+{
+  rtx tmp_reg = operands[0];
+  rtx mem = operands[2];
+  rtx addr = XEXP (mem, 0);
+  rtx add_op0, add_op1, new_addr;
+
+  gcc_assert (GET_CODE (addr) == PLUS || GET_CODE (addr) == LO_SUM);
+  add_op0 = XEXP (addr, 0);
+  add_op1 = XEXP (addr, 1);
+  gcc_assert (REG_P (add_op0));
+  new_addr = gen_rtx_PLUS (DImode, add_op0, tmp_reg);
+
+  operands[4] = add_op1;
+  operands[5] = change_address (mem, <MODE>mode, new_addr);
+})
+
+;; Optimize cases were want to do a D-form store on ISA 2.06/2.07 from an
+;; Altivec register, and the register allocator has generated:
+;;	move altivec->fpr
+;;	store fpr
+
+(define_peephole2
+  [(match_scratch:DI 0 "b")
+   (set (match_operand:ALTIVEC_DFORM 1 "fpr_reg_operand")
+	(match_operand:ALTIVEC_DFORM 2 "altivec_register_operand"))
+   (set (match_operand:ALTIVEC_DFORM 3 "simple_offsettable_mem_operand")
+	(match_dup 1))]
+  "TARGET_VSX && TARGET_POWERPC64 && TARGET_UPPER_REGS_<MODE>
+   && !TARGET_P9_DFORM_SCALAR && peep2_reg_dead_p (2, operands[1])"
+  [(set (match_dup 0)
+	(match_dup 4))
+   (set (match_dup 5)
+	(match_dup 2))]
+{
+  rtx tmp_reg = operands[0];
+  rtx mem = operands[3];
+  rtx addr = XEXP (mem, 0);
+  rtx add_op0, add_op1, new_addr;
+
+  gcc_assert (GET_CODE (addr) == PLUS || GET_CODE (addr) == LO_SUM);
+  add_op0 = XEXP (addr, 0);
+  add_op1 = XEXP (addr, 1);
+  gcc_assert (REG_P (add_op0));
+  new_addr = gen_rtx_PLUS (DImode, add_op0, tmp_reg);
+
+  operands[4] = add_op1;
+  operands[5] = change_address (mem, <MODE>mode, new_addr);
+})
+   
+
 ;; Miscellaneous ISA 2.06 (power7) instructions
 (define_insn "addg6s"
   [(set (match_operand:SI 0 "register_operand" "=r")
Index: gcc/config/rs6000/predicates.md
===================================================================
--- gcc/config/rs6000/predicates.md	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/branches/ibm/meissner-gcc8/gcc/config/rs6000)	(revision 247282)
+++ gcc/config/rs6000/predicates.md	(.../gcc/config/rs6000)	(working copy)
@@ -847,6 +847,22 @@  (define_predicate "offsettable_mem_opera
   (and (match_operand 0 "memory_operand")
        (match_test "offsettable_nonstrict_memref_p (op)")))
 
+;; Return 1 if the operand is a simple offsettable memory operand
+;; that does not include pre-increment, post-increment, etc.
+(define_predicate "simple_offsettable_mem_operand"
+  (match_operand 0 "offsettable_mem_operand")
+{
+  rtx addr = XEXP (op, 0);
+
+  if (GET_CODE (addr) != PLUS && GET_CODE (addr) != LO_SUM)
+    return 0;
+
+  if (!CONSTANT_P (XEXP (addr, 1)))
+    return 0;
+
+  return base_reg_operand (XEXP (addr, 0), Pmode);
+})
+
 ;; Return 1 if the operand is suitable for load/store quad memory.
 ;; This predicate only checks for non-atomic loads/stores (not lqarx/stqcx).
 (define_predicate "quad_memory_operand"
Index: gcc/testsuite/gcc.target/powerpc/pr80510-1.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/pr80510-1.c	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/branches/ibm/meissner-gcc8/gcc/testsuite/gcc.target/powerpc)	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/pr80510-1.c	(.../gcc/testsuite/gcc.target/powerpc)	(revision 247113)
@@ -0,0 +1,211 @@ 
+/* { dg-do compile { target { powerpc*-*-* } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power7" } } */
+/* { dg-options "-mcpu=power7 -O2" } */
+
+/* Make sure that STXSDX is generated for double scalars in Altivec registers
+   on power7 instead of moving the value to a FPR register and doing a X-FORM
+   store.  */
+
+#ifndef TYPE
+#define TYPE double
+#endif
+
+#ifndef TYPE_IN
+#define TYPE_IN TYPE
+#endif
+
+#ifndef TYPE_OUT
+#define TYPE_OUT TYPE
+#endif
+
+#ifndef ITYPE
+#define ITYPE long
+#endif
+
+#ifdef DO_CALL
+extern ITYPE get_bits (ITYPE);
+
+#else
+#define get_bits(X) (X)
+#endif
+
+void test (ITYPE *bits, ITYPE n, TYPE one, TYPE_IN *p, TYPE_OUT *q)
+{
+  TYPE x_00 = p[ 0];
+  TYPE x_01 = p[ 1];
+  TYPE x_02 = p[ 2];
+  TYPE x_03 = p[ 3];
+  TYPE x_04 = p[ 4];
+  TYPE x_05 = p[ 5];
+  TYPE x_06 = p[ 6];
+  TYPE x_07 = p[ 7];
+  TYPE x_08 = p[ 8];
+  TYPE x_09 = p[ 9];
+
+  TYPE x_10 = p[10];
+  TYPE x_11 = p[11];
+  TYPE x_12 = p[12];
+  TYPE x_13 = p[13];
+  TYPE x_14 = p[14];
+  TYPE x_15 = p[15];
+  TYPE x_16 = p[16];
+  TYPE x_17 = p[17];
+  TYPE x_18 = p[18];
+  TYPE x_19 = p[19];
+
+  TYPE x_20 = p[20];
+  TYPE x_21 = p[21];
+  TYPE x_22 = p[22];
+  TYPE x_23 = p[23];
+  TYPE x_24 = p[24];
+  TYPE x_25 = p[25];
+  TYPE x_26 = p[26];
+  TYPE x_27 = p[27];
+  TYPE x_28 = p[28];
+  TYPE x_29 = p[29];
+
+  TYPE x_30 = p[30];
+  TYPE x_31 = p[31];
+  TYPE x_32 = p[32];
+  TYPE x_33 = p[33];
+  TYPE x_34 = p[34];
+  TYPE x_35 = p[35];
+  TYPE x_36 = p[36];
+  TYPE x_37 = p[37];
+  TYPE x_38 = p[38];
+  TYPE x_39 = p[39];
+
+  TYPE x_40 = p[40];
+  TYPE x_41 = p[41];
+  TYPE x_42 = p[42];
+  TYPE x_43 = p[43];
+  TYPE x_44 = p[44];
+  TYPE x_45 = p[45];
+  TYPE x_46 = p[46];
+  TYPE x_47 = p[47];
+  TYPE x_48 = p[48];
+  TYPE x_49 = p[49];
+
+  ITYPE i;
+
+  for (i = 0; i < n; i++)
+    {
+      ITYPE bit = get_bits (bits[i]);
+
+      if ((bit & ((ITYPE)1) << 	0) != 0) x_00 += one;
+      if ((bit & ((ITYPE)1) << 	1) != 0) x_01 += one;
+      if ((bit & ((ITYPE)1) << 	2) != 0) x_02 += one;
+      if ((bit & ((ITYPE)1) << 	3) != 0) x_03 += one;
+      if ((bit & ((ITYPE)1) << 	4) != 0) x_04 += one;
+      if ((bit & ((ITYPE)1) << 	5) != 0) x_05 += one;
+      if ((bit & ((ITYPE)1) << 	6) != 0) x_06 += one;
+      if ((bit & ((ITYPE)1) << 	7) != 0) x_07 += one;
+      if ((bit & ((ITYPE)1) << 	8) != 0) x_08 += one;
+      if ((bit & ((ITYPE)1) << 	9) != 0) x_09 += one;
+
+      if ((bit & ((ITYPE)1) << 10) != 0) x_10 += one;
+      if ((bit & ((ITYPE)1) << 11) != 0) x_11 += one;
+      if ((bit & ((ITYPE)1) << 12) != 0) x_12 += one;
+      if ((bit & ((ITYPE)1) << 13) != 0) x_13 += one;
+      if ((bit & ((ITYPE)1) << 14) != 0) x_14 += one;
+      if ((bit & ((ITYPE)1) << 15) != 0) x_15 += one;
+      if ((bit & ((ITYPE)1) << 16) != 0) x_16 += one;
+      if ((bit & ((ITYPE)1) << 17) != 0) x_17 += one;
+      if ((bit & ((ITYPE)1) << 18) != 0) x_18 += one;
+      if ((bit & ((ITYPE)1) << 19) != 0) x_19 += one;
+
+      if ((bit & ((ITYPE)1) << 20) != 0) x_20 += one;
+      if ((bit & ((ITYPE)1) << 21) != 0) x_21 += one;
+      if ((bit & ((ITYPE)1) << 22) != 0) x_22 += one;
+      if ((bit & ((ITYPE)1) << 23) != 0) x_23 += one;
+      if ((bit & ((ITYPE)1) << 24) != 0) x_24 += one;
+      if ((bit & ((ITYPE)1) << 25) != 0) x_25 += one;
+      if ((bit & ((ITYPE)1) << 26) != 0) x_26 += one;
+      if ((bit & ((ITYPE)1) << 27) != 0) x_27 += one;
+      if ((bit & ((ITYPE)1) << 28) != 0) x_28 += one;
+      if ((bit & ((ITYPE)1) << 29) != 0) x_29 += one;
+
+      if ((bit & ((ITYPE)1) << 30) != 0) x_30 += one;
+      if ((bit & ((ITYPE)1) << 31) != 0) x_31 += one;
+      if ((bit & ((ITYPE)1) << 32) != 0) x_32 += one;
+      if ((bit & ((ITYPE)1) << 33) != 0) x_33 += one;
+      if ((bit & ((ITYPE)1) << 34) != 0) x_34 += one;
+      if ((bit & ((ITYPE)1) << 35) != 0) x_35 += one;
+      if ((bit & ((ITYPE)1) << 36) != 0) x_36 += one;
+      if ((bit & ((ITYPE)1) << 37) != 0) x_37 += one;
+      if ((bit & ((ITYPE)1) << 38) != 0) x_38 += one;
+      if ((bit & ((ITYPE)1) << 39) != 0) x_39 += one;
+
+      if ((bit & ((ITYPE)1) << 40) != 0) x_40 += one;
+      if ((bit & ((ITYPE)1) << 41) != 0) x_41 += one;
+      if ((bit & ((ITYPE)1) << 42) != 0) x_42 += one;
+      if ((bit & ((ITYPE)1) << 43) != 0) x_43 += one;
+      if ((bit & ((ITYPE)1) << 44) != 0) x_44 += one;
+      if ((bit & ((ITYPE)1) << 45) != 0) x_45 += one;
+      if ((bit & ((ITYPE)1) << 46) != 0) x_46 += one;
+      if ((bit & ((ITYPE)1) << 47) != 0) x_47 += one;
+      if ((bit & ((ITYPE)1) << 48) != 0) x_48 += one;
+      if ((bit & ((ITYPE)1) << 49) != 0) x_49 += one;
+    }
+
+  q[ 0] = x_00;
+  q[ 1] = x_01;
+  q[ 2] = x_02;
+  q[ 3] = x_03;
+  q[ 4] = x_04;
+  q[ 5] = x_05;
+  q[ 6] = x_06;
+  q[ 7] = x_07;
+  q[ 8] = x_08;
+  q[ 9] = x_09;
+
+  q[10] = x_10;
+  q[11] = x_11;
+  q[12] = x_12;
+  q[13] = x_13;
+  q[14] = x_14;
+  q[15] = x_15;
+  q[16] = x_16;
+  q[17] = x_17;
+  q[18] = x_18;
+  q[19] = x_19;
+
+  q[20] = x_20;
+  q[21] = x_21;
+  q[22] = x_22;
+  q[23] = x_23;
+  q[24] = x_24;
+  q[25] = x_25;
+  q[26] = x_26;
+  q[27] = x_27;
+  q[28] = x_28;
+  q[29] = x_29;
+
+  q[30] = x_30;
+  q[31] = x_31;
+  q[32] = x_32;
+  q[33] = x_33;
+  q[34] = x_34;
+  q[35] = x_35;
+  q[36] = x_36;
+  q[37] = x_37;
+  q[38] = x_38;
+  q[39] = x_39;
+
+  q[40] = x_40;
+  q[41] = x_41;
+  q[42] = x_42;
+  q[43] = x_43;
+  q[44] = x_44;
+  q[45] = x_45;
+  q[46] = x_46;
+  q[47] = x_47;
+  q[48] = x_48;
+  q[49] = x_49;
+}
+
+/* { dg-final { scan-assembler     {\xsadddp\M} } } */
+/* { dg-final { scan-assembler     {\stxsdx\M}  } } */
+/* { dg-final { scan-assembler-not {\mmfvsrd\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/pr80510-2.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/pr80510-2.c	(.../svn+ssh://meissner@gcc.gnu.org/svn/gcc/branches/ibm/meissner-gcc8/gcc/testsuite/gcc.target/powerpc)	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/pr80510-2.c	(.../gcc/testsuite/gcc.target/powerpc)	(revision 247113)
@@ -0,0 +1,212 @@ 
+/* { dg-do compile { target { powerpc*-*-* } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power8" } } */
+/* { dg-options "-mcpu=power8 -O2" } */
+
+/* Make sure that STXSSPX is generated for float scalars in Altivec registers
+   on power7 instead of moving the value to a FPR register and doing a X-FORM
+   store.  */
+
+#ifndef TYPE
+#define TYPE float
+#endif
+
+#ifndef TYPE_IN
+#define TYPE_IN TYPE
+#endif
+
+#ifndef TYPE_OUT
+#define TYPE_OUT TYPE
+#endif
+
+#ifndef ITYPE
+#define ITYPE long
+#endif
+
+#ifdef DO_CALL
+extern ITYPE get_bits (ITYPE);
+
+#else
+#define get_bits(X) (X)
+#endif
+
+void test (ITYPE *bits, ITYPE n, TYPE one, TYPE_IN *p, TYPE_OUT *q)
+{
+  TYPE x_00 = p[ 0];
+  TYPE x_01 = p[ 1];
+  TYPE x_02 = p[ 2];
+  TYPE x_03 = p[ 3];
+  TYPE x_04 = p[ 4];
+  TYPE x_05 = p[ 5];
+  TYPE x_06 = p[ 6];
+  TYPE x_07 = p[ 7];
+  TYPE x_08 = p[ 8];
+  TYPE x_09 = p[ 9];
+
+  TYPE x_10 = p[10];
+  TYPE x_11 = p[11];
+  TYPE x_12 = p[12];
+  TYPE x_13 = p[13];
+  TYPE x_14 = p[14];
+  TYPE x_15 = p[15];
+  TYPE x_16 = p[16];
+  TYPE x_17 = p[17];
+  TYPE x_18 = p[18];
+  TYPE x_19 = p[19];
+
+  TYPE x_20 = p[20];
+  TYPE x_21 = p[21];
+  TYPE x_22 = p[22];
+  TYPE x_23 = p[23];
+  TYPE x_24 = p[24];
+  TYPE x_25 = p[25];
+  TYPE x_26 = p[26];
+  TYPE x_27 = p[27];
+  TYPE x_28 = p[28];
+  TYPE x_29 = p[29];
+
+  TYPE x_30 = p[30];
+  TYPE x_31 = p[31];
+  TYPE x_32 = p[32];
+  TYPE x_33 = p[33];
+  TYPE x_34 = p[34];
+  TYPE x_35 = p[35];
+  TYPE x_36 = p[36];
+  TYPE x_37 = p[37];
+  TYPE x_38 = p[38];
+  TYPE x_39 = p[39];
+
+  TYPE x_40 = p[40];
+  TYPE x_41 = p[41];
+  TYPE x_42 = p[42];
+  TYPE x_43 = p[43];
+  TYPE x_44 = p[44];
+  TYPE x_45 = p[45];
+  TYPE x_46 = p[46];
+  TYPE x_47 = p[47];
+  TYPE x_48 = p[48];
+  TYPE x_49 = p[49];
+
+  ITYPE i;
+
+  for (i = 0; i < n; i++)
+    {
+      ITYPE bit = get_bits (bits[i]);
+
+      if ((bit & ((ITYPE)1) << 	0) != 0) x_00 += one;
+      if ((bit & ((ITYPE)1) << 	1) != 0) x_01 += one;
+      if ((bit & ((ITYPE)1) << 	2) != 0) x_02 += one;
+      if ((bit & ((ITYPE)1) << 	3) != 0) x_03 += one;
+      if ((bit & ((ITYPE)1) << 	4) != 0) x_04 += one;
+      if ((bit & ((ITYPE)1) << 	5) != 0) x_05 += one;
+      if ((bit & ((ITYPE)1) << 	6) != 0) x_06 += one;
+      if ((bit & ((ITYPE)1) << 	7) != 0) x_07 += one;
+      if ((bit & ((ITYPE)1) << 	8) != 0) x_08 += one;
+      if ((bit & ((ITYPE)1) << 	9) != 0) x_09 += one;
+
+      if ((bit & ((ITYPE)1) << 10) != 0) x_10 += one;
+      if ((bit & ((ITYPE)1) << 11) != 0) x_11 += one;
+      if ((bit & ((ITYPE)1) << 12) != 0) x_12 += one;
+      if ((bit & ((ITYPE)1) << 13) != 0) x_13 += one;
+      if ((bit & ((ITYPE)1) << 14) != 0) x_14 += one;
+      if ((bit & ((ITYPE)1) << 15) != 0) x_15 += one;
+      if ((bit & ((ITYPE)1) << 16) != 0) x_16 += one;
+      if ((bit & ((ITYPE)1) << 17) != 0) x_17 += one;
+      if ((bit & ((ITYPE)1) << 18) != 0) x_18 += one;
+      if ((bit & ((ITYPE)1) << 19) != 0) x_19 += one;
+
+      if ((bit & ((ITYPE)1) << 20) != 0) x_20 += one;
+      if ((bit & ((ITYPE)1) << 21) != 0) x_21 += one;
+      if ((bit & ((ITYPE)1) << 22) != 0) x_22 += one;
+      if ((bit & ((ITYPE)1) << 23) != 0) x_23 += one;
+      if ((bit & ((ITYPE)1) << 24) != 0) x_24 += one;
+      if ((bit & ((ITYPE)1) << 25) != 0) x_25 += one;
+      if ((bit & ((ITYPE)1) << 26) != 0) x_26 += one;
+      if ((bit & ((ITYPE)1) << 27) != 0) x_27 += one;
+      if ((bit & ((ITYPE)1) << 28) != 0) x_28 += one;
+      if ((bit & ((ITYPE)1) << 29) != 0) x_29 += one;
+
+      if ((bit & ((ITYPE)1) << 30) != 0) x_30 += one;
+      if ((bit & ((ITYPE)1) << 31) != 0) x_31 += one;
+      if ((bit & ((ITYPE)1) << 32) != 0) x_32 += one;
+      if ((bit & ((ITYPE)1) << 33) != 0) x_33 += one;
+      if ((bit & ((ITYPE)1) << 34) != 0) x_34 += one;
+      if ((bit & ((ITYPE)1) << 35) != 0) x_35 += one;
+      if ((bit & ((ITYPE)1) << 36) != 0) x_36 += one;
+      if ((bit & ((ITYPE)1) << 37) != 0) x_37 += one;
+      if ((bit & ((ITYPE)1) << 38) != 0) x_38 += one;
+      if ((bit & ((ITYPE)1) << 39) != 0) x_39 += one;
+
+      if ((bit & ((ITYPE)1) << 40) != 0) x_40 += one;
+      if ((bit & ((ITYPE)1) << 41) != 0) x_41 += one;
+      if ((bit & ((ITYPE)1) << 42) != 0) x_42 += one;
+      if ((bit & ((ITYPE)1) << 43) != 0) x_43 += one;
+      if ((bit & ((ITYPE)1) << 44) != 0) x_44 += one;
+      if ((bit & ((ITYPE)1) << 45) != 0) x_45 += one;
+      if ((bit & ((ITYPE)1) << 46) != 0) x_46 += one;
+      if ((bit & ((ITYPE)1) << 47) != 0) x_47 += one;
+      if ((bit & ((ITYPE)1) << 48) != 0) x_48 += one;
+      if ((bit & ((ITYPE)1) << 49) != 0) x_49 += one;
+    }
+
+  q[ 0] = x_00;
+  q[ 1] = x_01;
+  q[ 2] = x_02;
+  q[ 3] = x_03;
+  q[ 4] = x_04;
+  q[ 5] = x_05;
+  q[ 6] = x_06;
+  q[ 7] = x_07;
+  q[ 8] = x_08;
+  q[ 9] = x_09;
+
+  q[10] = x_10;
+  q[11] = x_11;
+  q[12] = x_12;
+  q[13] = x_13;
+  q[14] = x_14;
+  q[15] = x_15;
+  q[16] = x_16;
+  q[17] = x_17;
+  q[18] = x_18;
+  q[19] = x_19;
+
+  q[20] = x_20;
+  q[21] = x_21;
+  q[22] = x_22;
+  q[23] = x_23;
+  q[24] = x_24;
+  q[25] = x_25;
+  q[26] = x_26;
+  q[27] = x_27;
+  q[28] = x_28;
+  q[29] = x_29;
+
+  q[30] = x_30;
+  q[31] = x_31;
+  q[32] = x_32;
+  q[33] = x_33;
+  q[34] = x_34;
+  q[35] = x_35;
+  q[36] = x_36;
+  q[37] = x_37;
+  q[38] = x_38;
+  q[39] = x_39;
+
+  q[40] = x_40;
+  q[41] = x_41;
+  q[42] = x_42;
+  q[43] = x_43;
+  q[44] = x_44;
+  q[45] = x_45;
+  q[46] = x_46;
+  q[47] = x_47;
+  q[48] = x_48;
+  q[49] = x_49;
+}
+
+/* { dg-final { scan-assembler     {\xsaddsp\M}  } } */
+/* { dg-final { scan-assembler     {\stxsspx\M}  } } */
+/* { dg-final { scan-assembler-not {\mmfvsrd\M}  } } */
+/* { dg-final { scan-assembler-not {\mmfvsrwz\M} } } */