Patchwork GCC/test: Disable loop-19.c for classic FPU Power

login
register
mail settings
Submitter Maciej W. Rozycki
Date Aug. 30, 2014, 2:46 a.m.
Message ID <alpine.DEB.1.10.1408291552410.2958@tp.orcam.me.uk>
Download mbox | patch
Permalink /patch/384427/
State New
Headers show

Comments

Maciej W. Rozycki - Aug. 30, 2014, 2:46 a.m.
Hi,

 The loop-19.c test case has regressed from 4.8 to 4.9 and trunk on 
classic FPU Power targets, these failures are now seen:

FAIL: gcc.dg/tree-ssa/loop-19.c scan-tree-dump-times optimized "MEM.(base: &|symbol: )a," 2
FAIL: gcc.dg/tree-ssa/loop-19.c scan-tree-dump-times optimized "MEM.(base: &|symbol: )c," 2

 However upon the inpection of generated code it is obvious that its 
quality has improved, the autoincrement rather than indexed addressing 
mode is now used in the loop produced, reducing the number of instructions 
in the loop from 4 to 3 and also removing another instruction from outside 
the loop, i.e. (new code):

	.globl tuned_STREAM_Copy
	.type	tuned_STREAM_Copy, @function
tuned_STREAM_Copy:
	lis 8,0x1e
	lis 10,a-8@ha
	ori 8,8,33920
	lis 9,c-8@ha
	mtctr 8
	la 10,a-8@l(10)
	la 9,c-8@l(9)
.L2:
	lfdu 0,8(10)
	stfdu 0,8(9)
	bdnz .L2
	blr
	.size	tuned_STREAM_Copy, .-tuned_STREAM_Copy

vs (old code):

	.globl tuned_STREAM_Copy
	.type	tuned_STREAM_Copy, @function
tuned_STREAM_Copy:
	lis 7,0x1e
	ori 7,7,33920
	mtctr 7
	lis 8,c@ha
	lis 10,a@ha
	li 9,0
	la 8,c@l(8)
	la 10,a@l(10)
.L3:
	lfdx 0,10,9
	stfdx 0,8,9
	addi 9,9,8
	bdnz .L3
	blr
	.size	tuned_STREAM_Copy,.-tuned_STREAM_Copy

The only Power targets that still pass this test are e500v2 ones such as 
`-mcpu=8548 -mfloat-gprs=double -mspe=yes -mabi=spe' that use the SPE unit 
for FP operations, because the indexed mode is still used (there's no 
autoincrement addressing mode available for the memory access instructions 
concerned):

	.globl tuned_STREAM_Copy
	.type	tuned_STREAM_Copy, @function
tuned_STREAM_Copy:
	lis 10,0x1e
	lis 7,c@ha
	lis 8,a@ha
	ori 10,10,0x8480
	li 9,0
	la 7,c@l(7)
	la 8,a@l(8)
	mtctr 10
.L2:
	evlddx 10,8,9
	evstddx 10,7,9
	addi 9,9,8
	bdnz .L2
	blr
	.size	tuned_STREAM_Copy,.-tuned_STREAM_Copy

[I have removed "-fno-common" from the current test flags for the purpose 
of this consideration to compare apples to apples; 4.8 didn't have it.  
The presence or absence of this flag does not appear to make a difference 
for this test case for Power targets.]

 The obvious reason of the failure is the offset of -8 now seen in new 
classic FP code for preinitialising the pointers before entering the loop.  
The initial offset is needed so that it is cancelled by the offset of 8 
used in the loop itself to autoincrement these pointers.  So the new code 
not only is better, but it actually has to use these offsets as well or 
autoincrementation would not work.

 Therefore I think at this point the test case is invalid for classic FP 
Power, so I propose that we exclude it from testing here, only leaving SPE 
FP Power for whatever value the test case may have for it, and especially 
x86 variants where there's actual code size penalty for using an immediate 
offset (displacement) in addition to a base register.

 For the record here are the optimization dumps examined by the test case, 
for the old generated code that passes:

;; Function tuned_STREAM_Copy (tuned_STREAM_Copy, funcdef_no=0, decl_uid=1382, cgraph_uid=0)

tuned_STREAM_Copy ()
{
  sizetype ivtmp.10;
  double _4;

  <bb 2>:

  <bb 3>:
  # ivtmp.10_8 = PHI <ivtmp.10_2(4), 0(2)>
  _4 = MEM[symbol: a, index: ivtmp.10_8, offset: 0B];
  MEM[symbol: c, index: ivtmp.10_8, offset: 0B] = _4;
  ivtmp.10_2 = ivtmp.10_8 + 8;
  if (ivtmp.10_2 != 16000000)
    goto <bb 4>;
  else
    goto <bb 5>;

  <bb 4>:
  goto <bb 3>;

  <bb 5>:
  return;

}

and for the new code that fails:

;; Function tuned_STREAM_Copy (tuned_STREAM_Copy, funcdef_no=0, decl_uid=2191, symbol_order=2)

Removing basic block 5
tuned_STREAM_Copy ()
{
  unsigned int ivtmp.13;
  unsigned int ivtmp.9;
  double _4;
  void * _15;
  void * _16;
  unsigned int _17;

  <bb 2>:
  ivtmp.9_11 = (unsigned int) &MEM[(void *)&a + 4294967288B];
  ivtmp.13_14 = (unsigned int) &MEM[(void *)&c + 4294967288B];
  _17 = (unsigned int) &MEM[(void *)&a + 15999992B];

  <bb 3>:
  # ivtmp.9_8 = PHI <ivtmp.9_2(3), ivtmp.9_11(2)>
  # ivtmp.13_12 = PHI <ivtmp.13_13(3), ivtmp.13_14(2)>
  ivtmp.9_2 = ivtmp.9_8 + 8;
  _15 = (void *) ivtmp.9_2;
  _4 = MEM[base: _15, offset: 0B];
  ivtmp.13_13 = ivtmp.13_12 + 8;
  _16 = (void *) ivtmp.13_13;
  MEM[base: _16, offset: 0B] = _4;
  if (ivtmp.9_2 != _17)
    goto <bb 3>;
  else
    goto <bb 4>;

  <bb 4>:
  return;

}

 Tested with the following powerpc-gnu-linux multilibs with the respective 
results noted on the right:

-mcpu=603e						UNSUPPORTED
-mcpu=603e -msoft-float					UNSUPPORTED
-mcpu=8540 -mfloat-gprs=single -mspe=yes -mabi=spe	UNSUPPORTED
-mcpu=8548 -mfloat-gprs=double -mspe=yes -mabi=spe	PASS
-mcpu=7400 -maltivec -mabi=altivec			UNSUPPORTED
-mcpu=e6500 -maltivec -mabi=altivec			UNSUPPORTED
-mcpu=e5500 -m64					UNSUPPORTED
-mcpu=e6500 -m64 -maltivec -mabi=altivec		UNSUPPORTED

Original results:

-mcpu=603e						FAIL
-mcpu=603e -msoft-float					UNSUPPORTED
-mcpu=8540 -mfloat-gprs=single -mspe=yes -mabi=spe	UNSUPPORTED
-mcpu=8548 -mfloat-gprs=double -mspe=yes -mabi=spe	PASS
-mcpu=7400 -maltivec -mabi=altivec			FAIL
-mcpu=e6500 -maltivec -mabi=altivec			FAIL
-mcpu=e5500 -m64					FAIL
-mcpu=e6500 -m64 -maltivec -mabi=altivec		FAIL

 OK to apply (for trunk and 4.9)?

2014-08-30  Maciej W. Rozycki  <macro@codesourcery.com>

	* gcc.dg/tree-ssa/loop-19.c: Exclude classic FPU Power targets.

  Maciej

gcc-test-power-loop-19.diff
David Edelsohn - Aug. 30, 2014, 10:51 p.m.
On Fri, Aug 29, 2014 at 10:46 PM, Maciej W. Rozycki
<macro@codesourcery.com> wrote:
> Hi,
>
>  The loop-19.c test case has regressed from 4.8 to 4.9 and trunk on
> classic FPU Power targets, these failures are now seen:
>
> FAIL: gcc.dg/tree-ssa/loop-19.c scan-tree-dump-times optimized "MEM.(base: &|symbol: )a," 2
> FAIL: gcc.dg/tree-ssa/loop-19.c scan-tree-dump-times optimized "MEM.(base: &|symbol: )c," 2
>
>  However upon the inpection of generated code it is obvious that its
> quality has improved, the autoincrement rather than indexed addressing
> mode is now used in the loop produced, reducing the number of instructions
> in the loop from 4 to 3 and also removing another instruction from outside
> the loop, i.e. (new code):
>
>         .globl tuned_STREAM_Copy
>         .type   tuned_STREAM_Copy, @function
> tuned_STREAM_Copy:
>         lis 8,0x1e
>         lis 10,a-8@ha
>         ori 8,8,33920
>         lis 9,c-8@ha
>         mtctr 8
>         la 10,a-8@l(10)
>         la 9,c-8@l(9)
> .L2:
>         lfdu 0,8(10)
>         stfdu 0,8(9)
>         bdnz .L2
>         blr
>         .size   tuned_STREAM_Copy, .-tuned_STREAM_Copy
>
> vs (old code):
>
>         .globl tuned_STREAM_Copy
>         .type   tuned_STREAM_Copy, @function
> tuned_STREAM_Copy:
>         lis 7,0x1e
>         ori 7,7,33920
>         mtctr 7
>         lis 8,c@ha
>         lis 10,a@ha
>         li 9,0
>         la 8,c@l(8)
>         la 10,a@l(10)
> .L3:
>         lfdx 0,10,9
>         stfdx 0,8,9
>         addi 9,9,8
>         bdnz .L3
>         blr
>         .size   tuned_STREAM_Copy,.-tuned_STREAM_Copy
>
> The only Power targets that still pass this test are e500v2 ones such as
> `-mcpu=8548 -mfloat-gprs=double -mspe=yes -mabi=spe' that use the SPE unit
> for FP operations, because the indexed mode is still used (there's no
> autoincrement addressing mode available for the memory access instructions
> concerned):
>
>         .globl tuned_STREAM_Copy
>         .type   tuned_STREAM_Copy, @function
> tuned_STREAM_Copy:
>         lis 10,0x1e
>         lis 7,c@ha
>         lis 8,a@ha
>         ori 10,10,0x8480
>         li 9,0
>         la 7,c@l(7)
>         la 8,a@l(8)
>         mtctr 10
> .L2:
>         evlddx 10,8,9
>         evstddx 10,7,9
>         addi 9,9,8
>         bdnz .L2
>         blr
>         .size   tuned_STREAM_Copy,.-tuned_STREAM_Copy
>
> [I have removed "-fno-common" from the current test flags for the purpose
> of this consideration to compare apples to apples; 4.8 didn't have it.
> The presence or absence of this flag does not appear to make a difference
> for this test case for Power targets.]
>
>  The obvious reason of the failure is the offset of -8 now seen in new
> classic FP code for preinitialising the pointers before entering the loop.
> The initial offset is needed so that it is cancelled by the offset of 8
> used in the loop itself to autoincrement these pointers.  So the new code
> not only is better, but it actually has to use these offsets as well or
> autoincrementation would not work.
>
>  Therefore I think at this point the test case is invalid for classic FP
> Power, so I propose that we exclude it from testing here, only leaving SPE
> FP Power for whatever value the test case may have for it, and especially
> x86 variants where there's actual code size penalty for using an immediate
> offset (displacement) in addition to a base register.
>
>  For the record here are the optimization dumps examined by the test case,
> for the old generated code that passes:
>
> ;; Function tuned_STREAM_Copy (tuned_STREAM_Copy, funcdef_no=0, decl_uid=1382, cgraph_uid=0)
>
> tuned_STREAM_Copy ()
> {
>   sizetype ivtmp.10;
>   double _4;
>
>   <bb 2>:
>
>   <bb 3>:
>   # ivtmp.10_8 = PHI <ivtmp.10_2(4), 0(2)>
>   _4 = MEM[symbol: a, index: ivtmp.10_8, offset: 0B];
>   MEM[symbol: c, index: ivtmp.10_8, offset: 0B] = _4;
>   ivtmp.10_2 = ivtmp.10_8 + 8;
>   if (ivtmp.10_2 != 16000000)
>     goto <bb 4>;
>   else
>     goto <bb 5>;
>
>   <bb 4>:
>   goto <bb 3>;
>
>   <bb 5>:
>   return;
>
> }
>
> and for the new code that fails:
>
> ;; Function tuned_STREAM_Copy (tuned_STREAM_Copy, funcdef_no=0, decl_uid=2191, symbol_order=2)
>
> Removing basic block 5
> tuned_STREAM_Copy ()
> {
>   unsigned int ivtmp.13;
>   unsigned int ivtmp.9;
>   double _4;
>   void * _15;
>   void * _16;
>   unsigned int _17;
>
>   <bb 2>:
>   ivtmp.9_11 = (unsigned int) &MEM[(void *)&a + 4294967288B];
>   ivtmp.13_14 = (unsigned int) &MEM[(void *)&c + 4294967288B];
>   _17 = (unsigned int) &MEM[(void *)&a + 15999992B];
>
>   <bb 3>:
>   # ivtmp.9_8 = PHI <ivtmp.9_2(3), ivtmp.9_11(2)>
>   # ivtmp.13_12 = PHI <ivtmp.13_13(3), ivtmp.13_14(2)>
>   ivtmp.9_2 = ivtmp.9_8 + 8;
>   _15 = (void *) ivtmp.9_2;
>   _4 = MEM[base: _15, offset: 0B];
>   ivtmp.13_13 = ivtmp.13_12 + 8;
>   _16 = (void *) ivtmp.13_13;
>   MEM[base: _16, offset: 0B] = _4;
>   if (ivtmp.9_2 != _17)
>     goto <bb 3>;
>   else
>     goto <bb 4>;
>
>   <bb 4>:
>   return;
>
> }
>
>  Tested with the following powerpc-gnu-linux multilibs with the respective
> results noted on the right:
>
> -mcpu=603e                                              UNSUPPORTED
> -mcpu=603e -msoft-float                                 UNSUPPORTED
> -mcpu=8540 -mfloat-gprs=single -mspe=yes -mabi=spe      UNSUPPORTED
> -mcpu=8548 -mfloat-gprs=double -mspe=yes -mabi=spe      PASS
> -mcpu=7400 -maltivec -mabi=altivec                      UNSUPPORTED
> -mcpu=e6500 -maltivec -mabi=altivec                     UNSUPPORTED
> -mcpu=e5500 -m64                                        UNSUPPORTED
> -mcpu=e6500 -m64 -maltivec -mabi=altivec                UNSUPPORTED
>
> Original results:
>
> -mcpu=603e                                              FAIL
> -mcpu=603e -msoft-float                                 UNSUPPORTED
> -mcpu=8540 -mfloat-gprs=single -mspe=yes -mabi=spe      UNSUPPORTED
> -mcpu=8548 -mfloat-gprs=double -mspe=yes -mabi=spe      PASS
> -mcpu=7400 -maltivec -mabi=altivec                      FAIL
> -mcpu=e6500 -maltivec -mabi=altivec                     FAIL
> -mcpu=e5500 -m64                                        FAIL
> -mcpu=e6500 -m64 -maltivec -mabi=altivec                FAIL
>
>  OK to apply (for trunk and 4.9)?
>
> 2014-08-30  Maciej W. Rozycki  <macro@codesourcery.com>
>
>         * gcc.dg/tree-ssa/loop-19.c: Exclude classic FPU Power targets.
>
>   Maciej
>
> gcc-test-power-loop-19.diff
> Index: gcc-fsf-trunk-quilt/gcc/testsuite/gcc.dg/tree-ssa/loop-19.c
> ===================================================================
> --- gcc-fsf-trunk-quilt.orig/gcc/testsuite/gcc.dg/tree-ssa/loop-19.c    2014-08-29 16:45:27.748122597 +0100
> +++ gcc-fsf-trunk-quilt/gcc/testsuite/gcc.dg/tree-ssa/loop-19.c 2014-08-30 02:53:03.658955978 +0100
> @@ -4,7 +4,7 @@
>
>     The testcase comes from PR 29256 (and originally, the stream benchmark).  */
>
> -/* { dg-do compile { target { i?86-*-* || { x86_64-*-* || powerpc_hard_double } } } } */
> +/* { dg-do compile { target { i?86-*-* || { x86_64-*-* || { powerpc_hard_double && { ! powerpc_fprs } } } } } } */
>  /* { dg-require-effective-target nonpic } */
>  /* { dg-options "-O3 -fno-tree-loop-distribute-patterns -fno-prefetch-loop-arrays -fdump-tree-optimized -fno-common" } */
>

Okay.

Thanks, David
Maciej W. Rozycki - Sept. 1, 2014, 3:44 p.m.
On Sat, 30 Aug 2014, David Edelsohn wrote:

> > 2014-08-30  Maciej W. Rozycki  <macro@codesourcery.com>
> >
> >         * gcc.dg/tree-ssa/loop-19.c: Exclude classic FPU Power targets.
> >
> >   Maciej
> >
> > gcc-test-power-loop-19.diff
> > Index: gcc-fsf-trunk-quilt/gcc/testsuite/gcc.dg/tree-ssa/loop-19.c
> > ===================================================================
> > --- gcc-fsf-trunk-quilt.orig/gcc/testsuite/gcc.dg/tree-ssa/loop-19.c    2014-08-29 16:45:27.748122597 +0100
> > +++ gcc-fsf-trunk-quilt/gcc/testsuite/gcc.dg/tree-ssa/loop-19.c 2014-08-30 02:53:03.658955978 +0100
> > @@ -4,7 +4,7 @@
> >
> >     The testcase comes from PR 29256 (and originally, the stream benchmark).  */
> >
> > -/* { dg-do compile { target { i?86-*-* || { x86_64-*-* || powerpc_hard_double } } } } */
> > +/* { dg-do compile { target { i?86-*-* || { x86_64-*-* || { powerpc_hard_double && { ! powerpc_fprs } } } } } } */
> >  /* { dg-require-effective-target nonpic } */
> >  /* { dg-options "-O3 -fno-tree-loop-distribute-patterns -fno-prefetch-loop-arrays -fdump-tree-optimized -fno-common" } */
> >
> 
> Okay.

 Applied to trunk now and backported to 4.9.  Thanks.

  Maciej

Patch

Index: gcc-fsf-trunk-quilt/gcc/testsuite/gcc.dg/tree-ssa/loop-19.c
===================================================================
--- gcc-fsf-trunk-quilt.orig/gcc/testsuite/gcc.dg/tree-ssa/loop-19.c	2014-08-29 16:45:27.748122597 +0100
+++ gcc-fsf-trunk-quilt/gcc/testsuite/gcc.dg/tree-ssa/loop-19.c	2014-08-30 02:53:03.658955978 +0100
@@ -4,7 +4,7 @@ 
  
    The testcase comes from PR 29256 (and originally, the stream benchmark).  */
 
-/* { dg-do compile { target { i?86-*-* || { x86_64-*-* || powerpc_hard_double } } } } */
+/* { dg-do compile { target { i?86-*-* || { x86_64-*-* || { powerpc_hard_double && { ! powerpc_fprs } } } } } } */
 /* { dg-require-effective-target nonpic } */
 /* { dg-options "-O3 -fno-tree-loop-distribute-patterns -fno-prefetch-loop-arrays -fdump-tree-optimized -fno-common" } */