Patchwork [RFH] subreg of a vector without going through memory

login
register
mail settings
Submitter Marc Glisse
Date Nov. 4, 2012, 9:29 a.m.
Message ID <alpine.DEB.2.02.1211040959520.5576@stedding.saclay.inria.fr>
Download mbox | patch
Permalink /patch/196984/
State New
Headers show

Comments

Marc Glisse - Nov. 4, 2012, 9:29 a.m.
Hello,

trying to make some progress on PR 53101, I wrote the attached patch
(it might be completely wrong for big endian, I don't know)
(it is also missing a check that it isn't a paradoxical subreg)

 	* simplify-rtx.c (simplify_subreg): For vectors, create a VEC_SELECT.

However, when I compile this code on x86_64:

typedef double v4 __attribute__((vector_size(32)));
typedef double v2 __attribute__((vector_size(16)));
v2 f(v4 x){
   return *(v2*)&x;
}


I see in the *.combine dump:

[...]
Trying 6 -> 7:
Successfully matched this instruction:
(set (reg:V2DF 60 [ <retval> ])
     (vec_select:V2DF (reg/v:V4DF 61 [ x ])
         (parallel [
                 (const_int 0 [0])
                 (const_int 1 [0x1])
             ])))
rejecting combination of insns 6 and 7
original costs 4 + 16 = 20
replacement cost 32
[...]
(note 4 0 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 2 4 3 2 (set (reg/v:V4DF 61 [ x ])
         (reg:V4DF 21 xmm0 [ x ])) v.cc:3 1123 {*movv4df_internal}
      (expr_list:REG_DEAD (reg:V4DF 21 xmm0 [ x ])
         (nil)))
(note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
(insn 6 3 7 2 (set (reg:OI 63 [ x ])
         (subreg:OI (reg/v:V4DF 61 [ x ]) 0)) v.cc:4 60 
{*movoi_internal_avx}
      (expr_list:REG_DEAD (reg/v:V4DF 61 [ x ])
         (nil)))
(insn 7 6 11 2 (set (reg:V2DF 60 [ <retval> ])
         (subreg:V2DF (reg:OI 63 [ x ]) 0)) v.cc:4 1124 {*movv2df_internal}
      (expr_list:REG_DEAD (reg:OI 63 [ x ])
         (nil)))
(insn 11 7 14 2 (set (reg/i:V2DF 21 xmm0)
         (reg:V2DF 60 [ <retval> ])) v.cc:5 1124 {*movv2df_internal}
      (expr_list:REG_DEAD (reg:V2DF 60 [ <retval> ])
         (nil)))
(insn 14 11 0 2 (use (reg/i:V2DF 21 xmm0)) v.cc:5 -1
      (nil))



I am surprised by that high replacement cost that prevents the change. Is 
my approach wrong? Is there an issue with the evaluation of costs?

The approach was suggested by Richard B:
http://gcc.gnu.org/ml/gcc-patches/2012-05/msg00197.html
Marc Glisse - Nov. 4, 2012, 10:42 a.m.
On Sun, 4 Nov 2012, Marc Glisse wrote:

> Hello,
>
> trying to make some progress on PR 53101, I wrote the attached patch
> (it might be completely wrong for big endian, I don't know)
> (it is also missing a check that it isn't a paradoxical subreg)
>
> 	* simplify-rtx.c (simplify_subreg): For vectors, create a VEC_SELECT.
>
> However, when I compile this code on x86_64:
>
> typedef double v4 __attribute__((vector_size(32)));
> typedef double v2 __attribute__((vector_size(16)));
> v2 f(v4 x){
>  return *(v2*)&x;
> }
>
>
> I see in the *.combine dump:
>
> [...]
> Trying 6 -> 7:
> Successfully matched this instruction:
> (set (reg:V2DF 60 [ <retval> ])
>    (vec_select:V2DF (reg/v:V4DF 61 [ x ])
>        (parallel [
>                (const_int 0 [0])
>                (const_int 1 [0x1])
>            ])))
> rejecting combination of insns 6 and 7
> original costs 4 + 16 = 20
> replacement cost 32

This cost comes from the x86 target:

     case VEC_SELECT:
     case VEC_CONCAT:
     case VEC_MERGE:
     case VEC_DUPLICATE:
       /* ??? Assume all of these vector manipulation patterns are
          recognizable.  In which case they all pretty much have the
          same cost.  */
      *total = cost->fabs;
      return true;

If the canonical form of a subvector is vec_select, I guess the cost needs 
updating, and if it is subreg, the target should learn how to handle it 
properly?

> [...]
> (note 4 0 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
> (insn 2 4 3 2 (set (reg/v:V4DF 61 [ x ])
>        (reg:V4DF 21 xmm0 [ x ])) v.cc:3 1123 {*movv4df_internal}
>     (expr_list:REG_DEAD (reg:V4DF 21 xmm0 [ x ])
>        (nil)))
> (note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
> (insn 6 3 7 2 (set (reg:OI 63 [ x ])
>        (subreg:OI (reg/v:V4DF 61 [ x ]) 0)) v.cc:4 60 {*movoi_internal_avx}
>     (expr_list:REG_DEAD (reg/v:V4DF 61 [ x ])
>        (nil)))
> (insn 7 6 11 2 (set (reg:V2DF 60 [ <retval> ])
>        (subreg:V2DF (reg:OI 63 [ x ]) 0)) v.cc:4 1124 {*movv2df_internal}
>     (expr_list:REG_DEAD (reg:OI 63 [ x ])
>        (nil)))
> (insn 11 7 14 2 (set (reg/i:V2DF 21 xmm0)
>        (reg:V2DF 60 [ <retval> ])) v.cc:5 1124 {*movv2df_internal}
>     (expr_list:REG_DEAD (reg:V2DF 60 [ <retval> ])
>        (nil)))
> (insn 14 11 0 2 (use (reg/i:V2DF 21 xmm0)) v.cc:5 -1
>     (nil))
>
>
>
> I am surprised by that high replacement cost that prevents the change. Is my 
> approach wrong? Is there an issue with the evaluation of costs?
>
> The approach was suggested by Richard B:
> http://gcc.gnu.org/ml/gcc-patches/2012-05/msg00197.html
Richard Sandiford - Nov. 5, 2012, 10:46 a.m.
Marc Glisse <marc.glisse@inria.fr> writes:
>  	* simplify-rtx.c (simplify_subreg): For vectors, create a VEC_SELECT.

Probably not helpful, sorry, but a subreg->vec_select transformation
feels like it's going in the wrong direction.  Going from vec_select
to subreg would be OK from a "layering" perspective (although whether
it's a good idea or not is another question), but not the other way.
E.g. we convert can convert truncate to subreg, but should never
convert subreg to truncate.

One problem is that subregs can be lvalues as well as rvalues,
whereas vec_select is always an rvalue.  Also, it's invalid to
form a non-lowpart subreg of a word or subword register, so having
general vec_select handling feels like it's coping with input that
ought to be rejected with a null return.

Richard
Marc Glisse - Nov. 5, 2012, 12:03 p.m.
On Mon, 5 Nov 2012, Richard Sandiford wrote:

> Marc Glisse <marc.glisse@inria.fr> writes:
>>  	* simplify-rtx.c (simplify_subreg): For vectors, create a VEC_SELECT.
>
> Probably not helpful, sorry, but a subreg->vec_select transformation
> feels like it's going in the wrong direction.  Going from vec_select
> to subreg would be OK from a "layering" perspective (although whether
> it's a good idea or not is another question), but not the other way.
> E.g. we convert can convert truncate to subreg, but should never
> convert subreg to truncate.

Thanks, it is helpful to know what the right direction is supposed to be 
:-)

> One problem is that subregs can be lvalues as well as rvalues,
> whereas vec_select is always an rvalue.

Hmm indeed. Not sure how to test that.

> Also, it's invalid to form a non-lowpart subreg of a word or subword 
> register, so having general vec_select handling feels like it's coping 
> with input that ought to be rejected with a null return.

Ah, even if it was the right direction, I should limit it to memory and 
registers larger than a word, ok.


So if I understand you, the x86 back-end is missing define_insns for movs 
of subregs. And I should duplicate in simplify_subreg any code from 
simplify_binary_operation(VEC_SELECT,...) that might apply. And maybe add 
code that recognizes when a VEC_SELECT could be a SUBREG, but maybe not.

(this other patch I was writing (untested) sounds less useful then, I'll 
just attach it here so it doesn't get lost)

Well, I guess there are worse issues with vectors to handle first, as can 
be seen when compiling this for SSE2, which produces no less than 36 mov 
instructions :-(

typedef double vec4 __attribute__((vector_size(32)));

void f(vec4*x){
   *x+=*x**x;
}

Good thing next stage1 is far away, I am getting way too confused...

Patch

Index: simplify-rtx.c

===================================================================
--- simplify-rtx.c	(revision 193127)

+++ simplify-rtx.c	(working copy)

@@ -5884,20 +5884,35 @@  simplify_subreg (enum machine_mode outer

   if (SCALAR_INT_MODE_P (outermode)
       && SCALAR_INT_MODE_P (innermode)
       && GET_MODE_PRECISION (outermode) < GET_MODE_PRECISION (innermode)
       && byte == subreg_lowpart_offset (outermode, innermode))
     {
       rtx tem = simplify_truncation (outermode, op, innermode);
       if (tem)
 	return tem;
     }
 
+  if (VECTOR_MODE_P (innermode)

+      && GET_MODE_INNER (innermode) == (VECTOR_MODE_P (outermode)

+					? GET_MODE_INNER (outermode)

+					: outermode))

+    {

+      unsigned elem_size = GET_MODE_SIZE (GET_MODE_INNER (innermode));

+      unsigned n = GET_MODE_SIZE (outermode) / elem_size;

+      unsigned start = byte / elem_size;

+      rtvec vec = rtvec_alloc (n);

+      for (unsigned i = 0; i < n; i++)

+	RTVEC_ELT (vec, i) = GEN_INT (start + i);

+      return simplify_gen_binary (VEC_SELECT, outermode, op,

+				  gen_rtx_PARALLEL (VOIDmode, vec));

+    }

+

   return NULL_RTX;
 }
 
 /* Make a SUBREG operation or equivalent if it folds.  */
 
 rtx
 simplify_gen_subreg (enum machine_mode outermode, rtx op,
 		     enum machine_mode innermode, unsigned int byte)
 {
   rtx newx;