Patchwork x86_64 varargs setup jump table

login
register
mail settings
Submitter Richard Henderson
Date July 20, 2010, 4:32 p.m.
Message ID <4C45CFA0.9070404@redhat.com>
Download mbox | patch
Permalink /patch/59344/
State New
Headers show

Comments

Richard Henderson - July 20, 2010, 4:32 p.m.
On 07/19/2010 02:13 PM, Richard Henderson wrote:
> This bootstraps; regression test starting now.
> 
> Obviously there's some patterns in i386.md that should be removed
> along with this, were this patch to go in.

A slightly different patch that passes regression testing.
This also vanishes the patterns that should go.


r~
Richard Henderson - July 20, 2010, 11:05 p.m.
On 07/20/2010 09:32 AM, Richard Henderson wrote:
> On 07/19/2010 02:13 PM, Richard Henderson wrote:
>> This bootstraps; regression test starting now.
>>
>> Obviously there's some patterns in i386.md that should be removed
>> along with this, were this patch to go in.
> 
> A slightly different patch that passes regression testing.
> This also vanishes the patterns that should go.

It was pointed out to me on IRC that x86_64 bootstrap is still broken.
So unless there are objections to this patch, I'll commit it tomorrow
morning (GMT+7).  We can debate the performance impact afterward.


r~
Sebastian Pop - July 20, 2010, 11:20 p.m.
On Tue, Jul 20, 2010 at 18:05, Richard Henderson <rth@redhat.com> wrote:
> On 07/20/2010 09:32 AM, Richard Henderson wrote:
>> On 07/19/2010 02:13 PM, Richard Henderson wrote:
>>> This bootstraps; regression test starting now.
>>>
>>> Obviously there's some patterns in i386.md that should be removed
>>> along with this, were this patch to go in.
>>
>> A slightly different patch that passes regression testing.
>> This also vanishes the patterns that should go.
>
> It was pointed out to me on IRC that x86_64 bootstrap is still broken.
> So unless there are objections to this patch, I'll commit it tomorrow
> morning (GMT+7).  We can debate the performance impact afterward.

I am also testing SPEC 2006 with your last patch on amd64-linux.

Sebastian
H.J. Lu - July 21, 2010, 11:15 p.m.
On Tue, Jul 20, 2010 at 9:32 AM, Richard Henderson <rth@redhat.com> wrote:
> On 07/19/2010 02:13 PM, Richard Henderson wrote:
>> This bootstraps; regression test starting now.
>>
>> Obviously there's some patterns in i386.md that should be removed
>> along with this, were this patch to go in.
>
> A slightly different patch that passes regression testing.
> This also vanishes the patterns that should go.
>

This caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45027
H.J. Lu - July 22, 2010, 1:51 p.m.
On Tue, Jul 20, 2010 at 4:05 PM, Richard Henderson <rth@redhat.com> wrote:
> On 07/20/2010 09:32 AM, Richard Henderson wrote:
>> On 07/19/2010 02:13 PM, Richard Henderson wrote:
>>> This bootstraps; regression test starting now.
>>>
>>> Obviously there's some patterns in i386.md that should be removed
>>> along with this, were this patch to go in.
>>
>> A slightly different patch that passes regression testing.
>> This also vanishes the patterns that should go.
>
> It was pointed out to me on IRC that x86_64 bootstrap is still broken.
> So unless there are objections to this patch, I'll commit it tomorrow
> morning (GMT+7).  We can debate the performance impact afterward.
>

Here are results on Intel Core i7:

Old: Gcc 4.6.0 revision 162269
New: Gcc 4.6.0 revision 162269 + this patch.

                                 New/Old
-O2:
164.gzip 			 0.595829%
175.vpr 			 0.123508%
176.gcc 			 0.580271%
181.mcf 			 -0.724263%
186.crafty 			 0.222717%
197.parser 			 0.100806%
252.eon 			 -0.159744%
253.perlbmk 			 -6.45709%
254.gap 			 0.417202%
255.vortex 			 0.127226%
256.bzip2 			 -0.167926%
300.twolf 			 0%
SPECint_base2000 			 -0.461082%

168.wupwise 			 -0.0897666%
171.swim 			 -0.210859%
172.mgrid 			 0.304298%
173.applu 			 1.39241%
177.mesa 			 -0.0538938%
178.galgel 			 0.0946074%
179.art 			 -0.244662%
183.equake 			 -1.24064%
187.facerec 			 0.50797%
188.ammp 			 0.532092%
189.lucas 			 -0.149785%
191.fma3d 			 0.841751%
200.sixtrack 			 -0.0752445%
301.apsi 			 -0.209018%
SPECfp_base2000 			 0.0985068%

400.perlbench 			 3.35821%
401.bzip2 			 1.11732%
403.gcc 			 -0.413223%
429.mcf 			 0.425532%
445.gobmk 			 0.41841%
456.hmmer 			 0%
458.sjeng 			 0%
462.libquantum 			 -1.81818%
464.h264ref 			 0%
471.omnetpp 			 0%
473.astar 			 0%
483.xalancbmk 			 0.3861%
SPECint(R)_base2006 			 0.41841%

410.bwaves 			 -0.287356%
416.gamess 			 0%
433.milc 			 1.08696%
434.zeusmp 			 0%
435.gromacs 			 -0.490196%
436.cactusADM 			 -4.67836%
437.leslie3d 			 -0.502513%
444.namd 			 0%
447.dealII 			 0%
450.soplex 			 -1.14504%
453.povray 			 -0.358423%
454.calculix 			 0%
459.GemsFDTD 			 -1.32159%
465.tonto 			 0.44843%
470.lbm 			 -10%
481.wrf 			 0%
482.sphinx3 			 0.621118%
SPECfp(R)_base2006 			 -0.865801%

-O3:

164.gzip 			 -0.0487567%
175.vpr 			 0.204499%
176.gcc 			 -0.389509%
181.mcf 			 -0.721092%
186.crafty 			 0%
197.parser 			 0.328638%
252.eon 			 -0.021796%
253.perlbmk 			 0.107817%
254.gap 			 0.607806%
255.vortex 			 0.176278%
256.bzip2 			 0%
300.twolf 			 0%
SPECint_base2000 			 0.018909%

168.wupwise 			 0.404722%
171.swim 			 -0.289321%
172.mgrid 			 0.662495%
173.applu 			 0%
177.mesa 			 0.0272109%
178.galgel 			 0.0126374%
179.art 			 0.155699%
183.equake 			 3.73021%
187.facerec 			 0.27248%
188.ammp 			 0.064%
189.lucas 			 0.432331%
191.fma3d 			 0.658436%
200.sixtrack 			 -0.142653%
301.apsi 			 -0.536572%
SPECfp_base2000 			 0.385806%

400.perlbench 			 2.22222%
401.bzip2 			 6.89655%
403.gcc 			 -0.809717%
429.mcf 			 0%
445.gobmk 			 0%
456.hmmer 			 0%
458.sjeng 			 0%
462.libquantum 			 -2.86396%
464.h264ref 			 0%
471.omnetpp 			 0%
473.astar 			 0%
483.xalancbmk 			 -0.352113%
SPECint(R)_base2006 			 0.403226%

410.bwaves 			 -7.14286%
416.gamess 			 0%
433.milc 			 -1.02564%
434.zeusmp 			 -0.865801%
435.gromacs 			 0%
436.cactusADM 			 0%
437.leslie3d 			 -1.66667%
444.namd 			 0%
447.dealII 			 -0.595238%
450.soplex 			 -0.754717%
453.povray 			 0.373134%
454.calculix 			 0.480769%
459.GemsFDTD 			 0.444444%
465.tonto 			 0.44843%
470.lbm 			 -0.438596%
481.wrf 			 0%
482.sphinx3 			 -1.08696%
SPECfp(R)_base2006 			 -0.78125%

Results look OK. Can't tell why big changes in 410.bwaves,
400.perlbench and 401.bzip2.
Sebastian Pop - July 22, 2010, 6:02 p.m.
Here are the results on AMD Phenom(tm) 9950 Quad-Core.

Old: Gcc 4.6.0 revision 162355
New: Gcc 4.6.0 revision 162355 + this patch.
Flags: -O3 -funroll-loops -fpeel-loops -ffast-math -march=native

The number is the run time percentage: (old - new) / old * 100
(positive is better)

483.xalancbmk  Error
403.gcc        Error

410.bwaves      	0.00%
416.gamess      	0.00%
433.milc        	0.00%
434.zeusmp      	-0.40%
435.gromacs     	-0.18%
436.cactusADM   	0.92%
437.leslie3d    	0.00%
444.namd        	0.00%
447.dealII      	-0.16%
450.soplex      	-0.15%
453.povray      	-0.33%
454.calculix    	-0.16%
459.GemsFDTD    	0.11%
465.tonto       	-0.38%
470.lbm         	-0.12%
481.wrf         	-0.11%
482.sphinx3     	0.85%
400.perlbench   	0.33%
401.bzip2       	0.33%
429.mcf         	6.11%
445.gobmk       	0.13%
456.hmmer       	9.95%
458.sjeng       	0.23%
462.libquantum  	-0.38%
464.h264ref     	-0.91%
471.omnetpp     	0.27%
473.astar       	0.00%
Richard Henderson - July 22, 2010, 6:15 p.m.
On 07/22/2010 11:02 AM, Sebastian Pop wrote:
> Here are the results on AMD Phenom(tm) 9950 Quad-Core.
> 
> Old: Gcc 4.6.0 revision 162355
> New: Gcc 4.6.0 revision 162355 + this patch.
> Flags: -O3 -funroll-loops -fpeel-loops -ffast-math -march=native
> 
> The number is the run time percentage: (old - new) / old * 100
> (positive is better)
> 
> [ no positive results ]

Hmm.  At least HJ had some positive results.  I'm surprised
that there are none on the AMD box.

Does movaps have reformatting stalls that perhaps movdqa does
with that particular micro-architecture?  Or are stores exempt
from reformatting stalls now?

Otherwise the only thing I can think is that the computed jump
was in practice very predictable (i.e. lots of calls containing
the same sequence of types), and that performing a few less
stores makes that difference.


r~
Richard Henderson - July 22, 2010, 6:17 p.m.
On 07/22/2010 11:15 AM, Richard Henderson wrote:
> Hmm.  At least HJ had some positive results.  I'm surprised
> that there are none on the AMD box.

Clearly I can't read.  There were improvements, some large.


r~
Sebastian Pop - July 22, 2010, 6:19 p.m.
On Thu, Jul 22, 2010 at 13:17, Richard Henderson <rth@redhat.com> wrote:
> On 07/22/2010 11:15 AM, Richard Henderson wrote:
>> Hmm.  At least HJ had some positive results.  I'm surprised
>> that there are none on the AMD box.
>
> Clearly I can't read.  There were improvements, some large.

Right.  The negative results are minor, less than 1%.

Sebastian

Patch

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index bb0b890..d9dc571 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -7062,11 +7062,8 @@  static void
 setup_incoming_varargs_64 (CUMULATIVE_ARGS *cum)
 {
   rtx save_area, mem;
-  rtx label;
-  rtx tmp_reg;
-  rtx nsse_reg;
   alias_set_type set;
-  int i;
+  int i, max;
 
   /* GPR size of varargs save area.  */
   if (cfun->va_list_gpr_size)
@@ -7087,10 +7084,11 @@  setup_incoming_varargs_64 (CUMULATIVE_ARGS *cum)
   save_area = frame_pointer_rtx;
   set = get_varargs_alias_set ();
 
-  for (i = cum->regno;
-       i < X86_64_REGPARM_MAX
-       && i < cum->regno + cfun->va_list_gpr_size / UNITS_PER_WORD;
-       i++)
+  max = cum->regno + cfun->va_list_gpr_size / UNITS_PER_WORD;
+  if (max > X86_64_REGPARM_MAX)
+    max = X86_64_REGPARM_MAX;
+
+  for (i = cum->regno; i < max; i++)
     {
       mem = gen_rtx_MEM (Pmode,
 			 plus_constant (save_area, i * UNITS_PER_WORD));
@@ -7102,33 +7100,41 @@  setup_incoming_varargs_64 (CUMULATIVE_ARGS *cum)
 
   if (ix86_varargs_fpr_size)
     {
+      enum machine_mode smode;
+      rtx label, test;
+
       /* Now emit code to save SSE registers.  The AX parameter contains number
-	 of SSE parameter registers used to call this function.  We use
-	 sse_prologue_save insn template that produces computed jump across
-	 SSE saves.  We need some preparation work to get this working.  */
+	 of SSE parameter registers used to call this function, though all we
+	 actually check here is the zero/non-zero status.  */
 
       label = gen_label_rtx ();
+      test = gen_rtx_EQ (VOIDmode, gen_rtx_REG (QImode, AX_REG), const0_rtx);
+      emit_jump_insn (gen_cbranchqi4 (test, XEXP (test, 0), XEXP (test, 1),
+				      label));
+
+      /* If we've determined that we're only loading scalars (and not
+	 vector data) then we can store doubles instead.  */
+      if (crtl->stack_alignment_needed < 128)
+	smode = DFmode;
+      else
+	smode = V4SFmode;
 
-      nsse_reg = gen_reg_rtx (Pmode);
-      emit_insn (gen_zero_extendqidi2 (nsse_reg, gen_rtx_REG (QImode, AX_REG)));
-
-      /* Compute address of memory block we save into.  We always use pointer
-	 pointing 127 bytes after first byte to store - this is needed to keep
-	 instruction size limited by 4 bytes (5 bytes for AVX) with one
-	 byte displacement.  */
-      tmp_reg = gen_reg_rtx (Pmode);
-      emit_insn (gen_rtx_SET (VOIDmode, tmp_reg,
-			      plus_constant (save_area,
-					     ix86_varargs_gpr_size + 127)));
-      mem = gen_rtx_MEM (BLKmode, plus_constant (tmp_reg, -127));
-      MEM_NOTRAP_P (mem) = 1;
-      set_mem_alias_set (mem, set);
-      set_mem_align (mem, 64);
+      max = cum->sse_regno + cfun->va_list_fpr_size / 16;
+      if (max > X86_64_SSE_REGPARM_MAX)
+	max = X86_64_SSE_REGPARM_MAX;
 
-      /* And finally do the dirty job!  */
-      emit_insn (gen_sse_prologue_save (mem, nsse_reg,
-					GEN_INT (cum->sse_regno), label,
-					gen_reg_rtx (Pmode)));
+      for (i = cum->sse_regno; i < max; ++i)
+	{
+	  mem = plus_constant (save_area, i * 16 + ix86_varargs_gpr_size);
+	  mem = gen_rtx_MEM (smode, mem);
+	  MEM_NOTRAP_P (mem) = 1;
+	  set_mem_alias_set (mem, set);
+	  set_mem_align (mem, GET_MODE_ALIGNMENT (smode));
+
+	  emit_move_insn (mem, gen_rtx_REG (smode, SSE_REGNO (i)));
+	}
+
+      emit_label (label);
     }
 }
 
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 88b4029..6616da2 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -79,13 +79,11 @@ 
   ;; Prologue support
   UNSPEC_STACK_ALLOC
   UNSPEC_SET_GOT
-  UNSPEC_SSE_PROLOGUE_SAVE
   UNSPEC_REG_SAVE
   UNSPEC_DEF_CFA
   UNSPEC_SET_RIP
   UNSPEC_SET_GOT_OFFSET
   UNSPEC_MEMORY_BLOCKAGE
-  UNSPEC_SSE_PROLOGUE_SAVE_LOW
 
   ;; TLS support
   UNSPEC_TP
@@ -17825,179 +17823,6 @@ 
   { return ASM_SHORT "0x0b0f"; }
   [(set_attr "length" "2")])
 
-(define_expand "sse_prologue_save"
-  [(parallel [(set (match_operand:BLK 0 "" "")
-		   (unspec:BLK [(reg:DI XMM0_REG)
-				(reg:DI XMM1_REG)
-				(reg:DI XMM2_REG)
-				(reg:DI XMM3_REG)
-				(reg:DI XMM4_REG)
-				(reg:DI XMM5_REG)
-				(reg:DI XMM6_REG)
-				(reg:DI XMM7_REG)] UNSPEC_SSE_PROLOGUE_SAVE))
-	      (clobber (reg:CC FLAGS_REG))
-	      (clobber (match_operand:DI 1 "register_operand" ""))
-	      (use (match_operand:DI 2 "immediate_operand" ""))
-	      (use (label_ref:DI (match_operand 3 "" "")))
-	      (clobber (match_operand:DI 4 "register_operand" ""))
-	      (use (match_dup 1))])]
-  "TARGET_64BIT"
-  "")
-
-;; Pre-reload version of prologue save.  Until after prologue generation we don't know
-;; what the size of save instruction will be.
-;; Operand 0+operand 6 is the memory save area
-;; Operand 1 is number of registers to save (will get overwritten to operand 5)
-;; Operand 2 is number of non-vaargs SSE arguments
-;; Operand 3 is label starting the save block
-;; Operand 4 is used for temporary computation of jump address
-(define_insn "*sse_prologue_save_insn1"
-  [(set (mem:BLK (plus:DI (match_operand:DI 0 "register_operand" "R")
-			  (match_operand:DI 6 "const_int_operand" "n")))
-	(unspec:BLK [(reg:DI XMM0_REG)
-		     (reg:DI XMM1_REG)
-		     (reg:DI XMM2_REG)
-		     (reg:DI XMM3_REG)
-		     (reg:DI XMM4_REG)
-		     (reg:DI XMM5_REG)
-		     (reg:DI XMM6_REG)
-		     (reg:DI XMM7_REG)] UNSPEC_SSE_PROLOGUE_SAVE))
-   (clobber (reg:CC FLAGS_REG))
-   (clobber (match_operand:DI 1 "register_operand" "=r"))
-   (use (match_operand:DI 2 "const_int_operand" "i"))
-   (use (label_ref:DI (match_operand 3 "" "X")))
-   (clobber (match_operand:DI 4 "register_operand" "=&r"))
-   (use (match_operand:DI 5 "register_operand" "1"))]
-  "TARGET_64BIT
-   && INTVAL (operands[6]) + X86_64_SSE_REGPARM_MAX * 16 - 16 < 128
-   && INTVAL (operands[6]) + INTVAL (operands[2]) * 16 >= -128"
-  "#"
-  [(set_attr "type" "other")
-   (set_attr "memory" "store")
-   (set_attr "mode" "DI")])
-
-;; We know size of save instruction; expand the computation of jump address
-;; in the jumptable.
-(define_split
-  [(parallel [(set (match_operand:BLK 0 "" "")
-		    (unspec:BLK [(reg:DI XMM0_REG)
-				 (reg:DI XMM1_REG)
-				 (reg:DI XMM2_REG)
-				 (reg:DI XMM3_REG)
-				 (reg:DI XMM4_REG)
-				 (reg:DI XMM5_REG)
-				 (reg:DI XMM6_REG)
-				 (reg:DI XMM7_REG)] UNSPEC_SSE_PROLOGUE_SAVE))
-	       (clobber (reg:CC FLAGS_REG))
-	       (clobber (match_operand:DI 1 "register_operand" ""))
-	       (use (match_operand:DI 2 "const_int_operand" ""))
-	       (use (match_operand 3 "" ""))
-	       (clobber (match_operand:DI 4 "register_operand" ""))
-	       (use (match_operand:DI 5 "register_operand" ""))])]
-  "reload_completed"
-  [(parallel [(set (match_dup 0)
-		   (unspec:BLK [(reg:DI XMM0_REG)
-				(reg:DI XMM1_REG)
-				(reg:DI XMM2_REG)
-				(reg:DI XMM3_REG)
-				(reg:DI XMM4_REG)
-				(reg:DI XMM5_REG)
-				(reg:DI XMM6_REG)
-				(reg:DI XMM7_REG)]
-			       UNSPEC_SSE_PROLOGUE_SAVE_LOW))
-	      (use (match_dup 1))
-	      (use (match_dup 2))
-	      (use (match_dup 3))
-	      (use (match_dup 5))])]
-{
-  /* Movaps is 4 bytes, AVX and movsd is 5 bytes.  */
-  int size = 4 + (TARGET_AVX || crtl->stack_alignment_needed < 128);
-
-  /* Compute address to jump to:
-     label - eax*size + nnamed_sse_arguments*size. */
-  if (size == 5)
-    emit_insn (gen_rtx_SET (VOIDmode, operands[4],
-			    gen_rtx_PLUS
-			      (Pmode,
-			       gen_rtx_MULT (Pmode, operands[1],
-					     GEN_INT (4)),
-			       operands[1])));
-  else  if (size == 4)
-    emit_insn (gen_rtx_SET (VOIDmode, operands[4],
-			    gen_rtx_MULT (Pmode, operands[1],
-					  GEN_INT (4))));
-  else
-    gcc_unreachable ();
-  if (INTVAL (operands[2]))
-    emit_move_insn
-      (operands[1],
-       gen_rtx_CONST (DImode,
-		      gen_rtx_PLUS (DImode,
-				    operands[3],
-				    GEN_INT (INTVAL (operands[2])
-					     * size))));
-  else
-    emit_move_insn (operands[1], operands[3]);
-  emit_insn (gen_subdi3 (operands[1], operands[1], operands[4]));
-  operands[5] = GEN_INT (size);
-})
-
-(define_insn "sse_prologue_save_insn"
-  [(set (mem:BLK (plus:DI (match_operand:DI 0 "register_operand" "R")
-			  (match_operand:DI 4 "const_int_operand" "n")))
-	(unspec:BLK [(reg:DI XMM0_REG)
-		     (reg:DI XMM1_REG)
-		     (reg:DI XMM2_REG)
-		     (reg:DI XMM3_REG)
-		     (reg:DI XMM4_REG)
-		     (reg:DI XMM5_REG)
-		     (reg:DI XMM6_REG)
-		     (reg:DI XMM7_REG)] UNSPEC_SSE_PROLOGUE_SAVE_LOW))
-   (use (match_operand:DI 1 "register_operand" "r"))
-   (use (match_operand:DI 2 "const_int_operand" "i"))
-   (use (label_ref:DI (match_operand 3 "" "X")))
-   (use (match_operand:DI 5 "const_int_operand" "i"))]
-  "TARGET_64BIT
-   && INTVAL (operands[4]) + X86_64_SSE_REGPARM_MAX * 16 - 16 < 128
-   && INTVAL (operands[4]) + INTVAL (operands[2]) * 16 >= -128"
-{
-  int i;
-  operands[0] = gen_rtx_MEM (Pmode,
-			     gen_rtx_PLUS (Pmode, operands[0], operands[4]));
-  /* VEX instruction with a REX prefix will #UD.  */
-  if (TARGET_AVX && GET_CODE (XEXP (operands[0], 0)) != PLUS)
-    gcc_unreachable ();
-
-  output_asm_insn ("jmp\t%A1", operands);
-  for (i = X86_64_SSE_REGPARM_MAX - 1; i >= INTVAL (operands[2]); i--)
-    {
-      operands[4] = adjust_address (operands[0], DImode, i*16);
-      operands[5] = gen_rtx_REG (TImode, SSE_REGNO (i));
-      PUT_MODE (operands[4], TImode);
-      if (GET_CODE (XEXP (operands[0], 0)) != PLUS)
-        output_asm_insn ("rex", operands);
-      if (crtl->stack_alignment_needed < 128)
-        output_asm_insn ("%vmovsd\t{%5, %4|%4, %5}", operands);
-      else
-        output_asm_insn ("%vmovaps\t{%5, %4|%4, %5}", operands);
-    }
-  targetm.asm_out.internal_label (asm_out_file, "L",
-				  CODE_LABEL_NUMBER (operands[3]));
-  return "";
-}
-  [(set_attr "type" "other")
-   (set_attr "length_immediate" "0")
-   (set_attr "length_address" "0")
-   ;; 2 bytes for jump and opernds[4] bytes for each save.
-   (set (attr "length")
-     (plus (const_int 2)
-	   (mult (symbol_ref ("INTVAL (operands[5])"))
-		 (symbol_ref ("X86_64_SSE_REGPARM_MAX - INTVAL (operands[2])")))))
-   (set_attr "memory" "store")
-   (set_attr "modrm" "0")
-   (set_attr "prefix" "maybe_vex")
-   (set_attr "mode" "DI")])
-
 (define_expand "prefetch"
   [(prefetch (match_operand 0 "address_operand" "")
 	     (match_operand:SI 1 "const_int_operand" "")