Add mulv32qi3 support

Message ID	20111012162445.GD2210@tyan-ft48-01.lab.bos.redhat.com
State	New
Headers	show Return-Path: <gcc-patches-return-304212-incoming=patchwork.ozlabs.org@gcc.gnu.org> Date: Wed, 12 Oct 2011 18:24:45 +0200 From: Jakub Jelinek <jakub@redhat.com> To: Uros Bizjak <ubizjak@gmail.com>, Richard Henderson <rth@redhat.com>, Kirill Yukhin <kirill.yukhin@gmail.com> Cc: gcc-patches@gcc.gnu.org Subject: [PATCH] Add mulv32qi3 support Message-ID: <20111012162445.GD2210@tyan-ft48-01.lab.bos.redhat.com> Reply-To: Jakub Jelinek <jakub@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org

Message ID

20111012162445.GD2210@tyan-ft48-01.lab.bos.redhat.com

State

New

Headers

Date: Wed, 12 Oct 2011 18:24:45 +0200
From: Jakub Jelinek <jakub@redhat.com>
To: Uros Bizjak <ubizjak@gmail.com>, Richard Henderson <rth@redhat.com>,
	Kirill Yukhin <kirill.yukhin@gmail.com>
Cc: gcc-patches@gcc.gnu.org
Subject: [PATCH] Add mulv32qi3 support
Message-ID: <20111012162445.GD2210@tyan-ft48-01.lab.bos.redhat.com>
Reply-To: Jakub Jelinek <jakub@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
Sender: gcc-patches-owner@gcc.gnu.org

Commit Message

Jakub Jelinek Oct. 12, 2011, 4:24 p.m. UTC

Hi!

On
long long a[1024], c[1024];
char b[1024];
void
foo (void)
{
  int i;
  for (i = 0; i < 1024; i++)
    b[i] = a[i] + 3 * c[i];
}
I've noticed that while i?86 backend supports
mulv16qi3, it doesn't support mulv32qi3 even with AVX2.

The following patch implements that similarly how
mulv16qi3 is implemented.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

BTW, I wonder if vector multiply expansion when one argument is VECTOR_CST
with all elements the same shouldn't use something similar to what expand_mult
does, not sure if in the generic code or at least in the backends.
Testing the costs will be harder, maybe it could just test fewer algorithms
and perhaps just count number of instructions or something similar.
But certainly e.g. v32qi multiplication by 3 is quite costly
(4 interleaves, 2 v16hi multiplications, 4 insns to select even from the
two), while two vector additions (tmp = x + x; result = x + tmp;)
would do the job.

2011-10-12  Jakub Jelinek  <jakub@redhat.com>

	* config/i386/sse.md (vec_avx2): New mode_attr.
	(mulv16qi3): Macroize to cover also mulv32qi3 for
	TARGET_AVX2 into ...
	(mul<mode>3): ... this.


	Jakub

Comments

Richard Henderson Oct. 12, 2011, 8:30 p.m. UTC | #1

On 10/12/2011 09:24 AM, Jakub Jelinek wrote:
> BTW, I wonder if vector multiply expansion when one argument is VECTOR_CST
> with all elements the same shouldn't use something similar to what expand_mult
> does, not sure if in the generic code or at least in the backends.
> Testing the costs will be harder, maybe it could just test fewer algorithms
> and perhaps just count number of instructions or something similar.
> But certainly e.g. v32qi multiplication by 3 is quite costly
> (4 interleaves, 2 v16hi multiplications, 4 insns to select even from the
> two), while two vector additions (tmp = x + x; result = x + tmp;)
> would do the job.

It would certainly be a good thing to try to do this in the middle-end.


> 2011-10-12  Jakub Jelinek  <jakub@redhat.com>
> 
> 	* config/i386/sse.md (vec_avx2): New mode_attr.
> 	(mulv16qi3): Macroize to cover also mulv32qi3 for
> 	TARGET_AVX2 into ...
> 	(mul<mode>3): ... this.

Ok.


r~

--- gcc/config/i386/sse.md.jj	2011-10-12 09:23:37.000000000 +0200
+++ gcc/config/i386/sse.md	2011-10-12 12:16:39.000000000 +0200
@@ -163,6 +163,12 @@  (define_mode_attr avx_avx2
    (V4SI "avx2") (V2DI "avx2")
    (V8SI "avx2") (V4DI "avx2")])
 
+(define_mode_attr vec_avx2
+  [(V16QI "vec") (V32QI "avx2")
+   (V8HI "vec") (V16HI "avx2")
+   (V4SI "vec") (V8SI "avx2")
+   (V2DI "vec") (V4DI "avx2")])
+
 ;; Mapping of logic-shift operators
 (define_code_iterator lshift [lshiftrt ashift])
 
@@ -4838,10 +4844,10 @@  (define_insn "*<sse2_avx2>_<plusminus_in
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "TI")])
 
-(define_insn_and_split "mulv16qi3"
-  [(set (match_operand:V16QI 0 "register_operand" "")
-	(mult:V16QI (match_operand:V16QI 1 "register_operand" "")
-		    (match_operand:V16QI 2 "register_operand" "")))]
+(define_insn_and_split "mul<mode>3"
+  [(set (match_operand:VI1_AVX2 0 "register_operand" "")
+	(mult:VI1_AVX2 (match_operand:VI1_AVX2 1 "register_operand" "")
+		       (match_operand:VI1_AVX2 2 "register_operand" "")))]
   "TARGET_SSE2
    && can_create_pseudo_p ()"
   "#"
@@ -4850,34 +4856,41 @@  (define_insn_and_split "mulv16qi3"
 {
   rtx t[6];
   int i;
+  enum machine_mode mulmode = <sseunpackmode>mode;
 
   for (i = 0; i < 6; ++i)
-    t[i] = gen_reg_rtx (V16QImode);
+    t[i] = gen_reg_rtx (<MODE>mode);
 
   /* Unpack data such that we've got a source byte in each low byte of
      each word.  We don't care what goes into the high byte of each word.
      Rather than trying to get zero in there, most convenient is to let
      it be a copy of the low byte.  */
-  emit_insn (gen_vec_interleave_highv16qi (t[0], operands[1], operands[1]));
-  emit_insn (gen_vec_interleave_highv16qi (t[1], operands[2], operands[2]));
-  emit_insn (gen_vec_interleave_lowv16qi (t[2], operands[1], operands[1]));
-  emit_insn (gen_vec_interleave_lowv16qi (t[3], operands[2], operands[2]));
+  emit_insn (gen_<vec_avx2>_interleave_high<mode> (t[0], operands[1],
+						   operands[1]));
+  emit_insn (gen_<vec_avx2>_interleave_high<mode> (t[1], operands[2],
+						   operands[2]));
+  emit_insn (gen_<vec_avx2>_interleave_low<mode> (t[2], operands[1],
+						  operands[1]));
+  emit_insn (gen_<vec_avx2>_interleave_low<mode> (t[3], operands[2],
+						  operands[2]));
 
   /* Multiply words.  The end-of-line annotations here give a picture of what
      the output of that instruction looks like.  Dot means don't care; the
      letters are the bytes of the result with A being the most significant.  */
-  emit_insn (gen_mulv8hi3 (gen_lowpart (V8HImode, t[4]), /* .A.B.C.D.E.F.G.H */
-			   gen_lowpart (V8HImode, t[0]),
-			   gen_lowpart (V8HImode, t[1])));
-  emit_insn (gen_mulv8hi3 (gen_lowpart (V8HImode, t[5]), /* .I.J.K.L.M.N.O.P */
-			   gen_lowpart (V8HImode, t[2]),
-			   gen_lowpart (V8HImode, t[3])));
+  emit_insn (gen_rtx_SET (VOIDmode, gen_lowpart (mulmode, t[4]),
+			  gen_rtx_MULT (mulmode,	/* .A.B.C.D.E.F.G.H */
+					gen_lowpart (mulmode, t[0]),
+					gen_lowpart (mulmode, t[1]))));
+  emit_insn (gen_rtx_SET (VOIDmode, gen_lowpart (mulmode, t[5]),
+			  gen_rtx_MULT (mulmode,	/* .I.J.K.L.M.N.O.P */
+					gen_lowpart (mulmode, t[2]),
+					gen_lowpart (mulmode, t[3]))));
 
   /* Extract the even bytes and merge them back together.  */
   ix86_expand_vec_extract_even_odd (operands[0], t[5], t[4], 0);
 
   set_unique_reg_note (get_last_insn (), REG_EQUAL,
-		       gen_rtx_MULT (V16QImode, operands[1], operands[2]));
+		       gen_rtx_MULT (<MODE>mode, operands[1], operands[2]));
   DONE;
 })

Add mulv32qi3 support

Commit Message

Comments

Patch