diff mbox series

[aarch64}: added variable issue rate feature for falkor

Message ID CAEwic4aTGH11iQa2iiVCUeYELYV3kYkdpYcA6SNhwQtm=qywww@mail.gmail.com
State New
Headers show
Series [aarch64}: added variable issue rate feature for falkor | expand

Commit Message

Kai Tietz Aug. 13, 2018, 8:16 a.m. UTC
Hello,

this patch implements variable issue rate feature for falkor cpu.
Additionally this patch adjusts the issue rate for falkor 8 as this
value reflects more cpu's specification.

This patch was tested against SPEC 2017 & 2016 and showed in general
some improvements without any regressions for thos tests.

ChangeLog:

        Jim Wilson <jim.wilson@linaro.org>
        Kai Tietz <kai.tietz@linaro.org>

        * config/aarch64.c (aarch64_sched_reorder): Implementing
        TARGET_SHED_REORDER hook.
        (aarch64_variable_issue): Implemented TARGET_SHED_VARIABLE_ISSUE
        hook.
        (TARGET_SHED_REORDER): Defined.
        (TARGET_SHED_VARIABLE_ISSUE): Likewise.
        * config/aarch64/falor.md (falkor_variable_issue): New.

Ok for apply?

Regards,
Kai

PS: I might be in need to update my key-files for stronger bitness.
Whom I can ask for doing this?

Comments

Bernhard Reutner-Fischer Aug. 13, 2018, 1:12 p.m. UTC | #1
On 13 August 2018 10:16:20 CEST, Kai Tietz <ktietz70@googlemail.com> wrote:
>Hello,
>
>this patch implements variable issue rate feature for falkor cpu.
>Additionally this patch adjusts the issue rate for falkor 8 as this
>value reflects more cpu's specification.
>
>This patch was tested against SPEC 2017 & 2016 and showed in general
>some improvements without any regressions for thos tests.
>
>ChangeLog:
>
>        Jim Wilson <jim.wilson@linaro.org>
>        Kai Tietz <kai.tietz@linaro.org>
>
>        * config/aarch64.c (aarch64_sched_reorder): Implementing
>        TARGET_SHED_REORDER hook.

Present tense in ChangeLog please.

>       (aarch64_variable_issue): Implemented TARGET_SHED_VARIABLE_ISSUE
>        hook.
>        (TARGET_SHED_REORDER): Defined.
>        (TARGET_SHED_VARIABLE_ISSUE): Likewise.
>        * config/aarch64/falor.md (falkor_variable_issue): New.
>
>Ok for apply?

s/subtrated/subtracted/


>
>Regards,
>Kai
>
>PS: I might be in need to update my key-files for stronger bitness.
>Whom I can ask for doing this?

If you still can login:
https://gcc.gnu.org/ml/gcc-patches/2015-12/msg02105.html

HTH,
Bernhard Reutner-Fischer Aug. 13, 2018, 1:43 p.m. UTC | #2
On 13 August 2018 15:12:30 CEST, Bernhard Reutner-Fischer <rep.dot.nop@gmail.com> wrote:
>On 13 August 2018 10:16:20 CEST, Kai Tietz <ktietz70@googlemail.com>
>wrote:
>>Hello,
>>
>>this patch implements variable issue rate feature for falkor cpu.
>>Additionally this patch adjusts the issue rate for falkor 8 as this
>>value reflects more cpu's specification.
>>
>>This patch was tested against SPEC 2017 & 2016 and showed in general
>>some improvements without any regressions for thos tests.
>>
>>ChangeLog:
>>
>>        Jim Wilson <jim.wilson@linaro.org>
>>        Kai Tietz <kai.tietz@linaro.org>
>>
>>        * config/aarch64.c (aarch64_sched_reorder): Implementing
>>        TARGET_SHED_REORDER hook.
>
>Present tense in ChangeLog please.
>
>>       (aarch64_variable_issue): Implemented
>TARGET_SHED_VARIABLE_ISSUE
>>        hook.
>>        (TARGET_SHED_REORDER): Defined.
>>        (TARGET_SHED_VARIABLE_ISSUE): Likewise.
>>        * config/aarch64/falor.md (falkor_variable_issue): New.
>>
>>Ok for apply?
>
>s/subtrated/subtracted/
>
>
>>
>>Regards,
>>Kai
>>
>>PS: I might be in need to update my key-files for stronger bitness.
>>Whom I can ask for doing this?
>
>If you still can login:
>https://gcc.gnu.org/ml/gcc-patches/2015-12/msg02105.html

And if it's "just" you had a DSA key then add a new one with temporarily accepting the old one:
ssh -oPubkeyAcceptedKeyTypes=+ssh-dss ....

>HTH,
Kai Tietz Aug. 13, 2018, 2:29 p.m. UTC | #3
2018-08-13 15:43 GMT+02:00 Bernhard Reutner-Fischer <rep.dot.nop@gmail.com>:
> On 13 August 2018 15:12:30 CEST, Bernhard Reutner-Fischer <rep.dot.nop@gmail.com> wrote:
>>On 13 August 2018 10:16:20 CEST, Kai Tietz <ktietz70@googlemail.com>
>>wrote:
>>>Hello,
>>>
>>>this patch implements variable issue rate feature for falkor cpu.
>>>Additionally this patch adjusts the issue rate for falkor 8 as this
>>>value reflects more cpu's specification.
>>>
>>>This patch was tested against SPEC 2017 & 2016 and showed in general
>>>some improvements without any regressions for thos tests.
>>>
>>>ChangeLog:
>>>
>>>        Jim Wilson <jim.wilson@linaro.org>
>>>        Kai Tietz <kai.tietz@linaro.org>
>>>
>>>        * config/aarch64.c (aarch64_sched_reorder): Implementing
>>>        TARGET_SHED_REORDER hook.
>>
>>Present tense in ChangeLog please.
>>
>>>       (aarch64_variable_issue): Implemented
>>TARGET_SHED_VARIABLE_ISSUE
>>>        hook.
>>>        (TARGET_SHED_REORDER): Defined.
>>>        (TARGET_SHED_VARIABLE_ISSUE): Likewise.
>>>        * config/aarch64/falor.md (falkor_variable_issue): New.
>>>
>>>Ok for apply?
>>
>>s/subtrated/subtracted/
>>
>>
>>>
>>>Regards,
>>>Kai
>>>
>>>PS: I might be in need to update my key-files for stronger bitness.
>>>Whom I can ask for doing this?
>>
>>If you still can login:
>>https://gcc.gnu.org/ml/gcc-patches/2015-12/msg02105.html
>
> And if it's "just" you had a DSA key then add a new one with temporarily accepting the old one:
> ssh -oPubkeyAcceptedKeyTypes=+ssh-dss ....
>
>>HTH,
>

Thanks for you help. I corrected changelog text accordingly. Sadly I
don't have the old key anymore. So I would need an overseer to replace
my old key with an new one.

Kai
Segher Boessenkool Aug. 13, 2018, 3:51 p.m. UTC | #4
Hi!

On Mon, Aug 13, 2018 at 10:16:20AM +0200, Kai Tietz wrote:
>         * config/aarch64.c (aarch64_sched_reorder): Implementing
>         TARGET_SHED_REORDER hook.
>         (aarch64_variable_issue): Implemented TARGET_SHED_VARIABLE_ISSUE
>         hook.
>         (TARGET_SHED_REORDER): Defined.
>         (TARGET_SHED_VARIABLE_ISSUE): Likewise.
>         * config/aarch64/falor.md (falkor_variable_issue): New.

SCHED, not SHED :-)  And s/falor/falkor/ .


Segher
Kai Tietz Aug. 13, 2018, 4:13 p.m. UTC | #5
2018-08-13 17:51 GMT+02:00 Segher Boessenkool <segher@kernel.crashing.org>:
> Hi!
>
> On Mon, Aug 13, 2018 at 10:16:20AM +0200, Kai Tietz wrote:
>>         * config/aarch64.c (aarch64_sched_reorder): Implementing
>>         TARGET_SHED_REORDER hook.
>>         (aarch64_variable_issue): Implemented TARGET_SHED_VARIABLE_ISSUE
>>         hook.
>>         (TARGET_SHED_REORDER): Defined.
>>         (TARGET_SHED_VARIABLE_ISSUE): Likewise.
>>         * config/aarch64/falor.md (falkor_variable_issue): New.
>
> SCHED, not SHED :-)  And s/falor/falkor/ .
>
>
> Segher

:) Thanks, otherwise ok?

Kai
Segher Boessenkool Aug. 13, 2018, 4:23 p.m. UTC | #6
On Mon, Aug 13, 2018 at 06:13:58PM +0200, Kai Tietz wrote:
> 2018-08-13 17:51 GMT+02:00 Segher Boessenkool <segher@kernel.crashing.org>:
> > On Mon, Aug 13, 2018 at 10:16:20AM +0200, Kai Tietz wrote:
> >>         * config/aarch64.c (aarch64_sched_reorder): Implementing
> >>         TARGET_SHED_REORDER hook.
> >>         (aarch64_variable_issue): Implemented TARGET_SHED_VARIABLE_ISSUE
> >>         hook.
> >>         (TARGET_SHED_REORDER): Defined.
> >>         (TARGET_SHED_VARIABLE_ISSUE): Likewise.
> >>         * config/aarch64/falor.md (falkor_variable_issue): New.
> >
> > SCHED, not SHED :-)  And s/falor/falkor/ .
> 
> :) Thanks, otherwise ok?

I cannot okay patches for aarch64.  But I did not notice other typoes,
if that is what you're asking.


Segher
Kai Tietz Aug. 13, 2018, 4:40 p.m. UTC | #7
2018-08-13 18:23 GMT+02:00 Segher Boessenkool <segher@kernel.crashing.org>:
> I cannot okay patches for aarch64.  But I did not notice other typoes,
> if that is what you're asking.

Yes. Thanks.

So, is patch ok with updated ChangeLog:

        Jim Wilson <jim.wilson@linaro.org>
        Kai Tietz <kai.tietz@linaro.org>

        * config/aarch64.c (aarch64_sched_reorder): Implement
        TARGET_SCHED_REORDER hook.
        (aarch64_variable_issue): Implement TARGET_SCHED_VARIABLE_ISSUE
        hook.
        (TARGET_SCHED_REORDER): Define.
        (TARGET_SCHED_VARIABLE_ISSUE): Likewise.
        * config/aarch64/falkor.md (falkor_variable_issue): New.

Kai
Kai Tietz Aug. 13, 2018, 4:48 p.m. UTC | #8
I repost updated patch containing ChangeLog entry.

Regards,
Kai
Jim Wilson <jim.wilson@linaro.org>
	Kai Tietz <kai.tietz@linaro.org>

	* config/aarch64.c (aarch64_sched_reorder): Implement
	TARGET_SCHED_REORDER hook.
	(aarch64_variable_issue): Implement TARGET_SCHED_VARIABLE_ISSUE
	hook.
	(TARGET_SCHED_REORDER): Define.
	(TARGET_SCHED_VARIABLE_ISSUE): Likewise.
	* config/aarch64/falkor.md (falkor_variable_issue): New.

Index: aarch64/aarch64.c
===================================================================
--- aarch64.orig/aarch64.c
+++ aarch64/aarch64.c
@@ -914,7 +914,7 @@ static const struct tune_params qdf24xx_
   &generic_branch_cost,
   &generic_approx_modes,
   4, /* memmov_cost  */
-  4, /* issue_rate  */
+  8, /* issue_rate  */
   (AARCH64_FUSE_MOV_MOVK | AARCH64_FUSE_ADRP_ADD
    | AARCH64_FUSE_MOVK_MOVK), /* fuseable_ops  */
   "16",	/* function_align.  */
@@ -17551,6 +17551,105 @@ aarch64_run_selftests (void)
 
 #endif /* #if CHECKING_P */
 
+/* The number of micro ops left over after issuing the last instruction in a
+   cycle.  This is subtracted from the next cycle before we start issuing insns.
+   This is initialized to 0 at the start of every basic block.  */
+static int leftover_uops = 0;
+
+/* Implement TARGET_SCHED_REORDER.  */
+
+static int
+aarch64_sched_reorder (FILE *file, int verbose,
+		       rtx_insn **ready ATTRIBUTE_UNUSED,
+		       int *n_readyp ATTRIBUTE_UNUSED,
+		       int clock)
+{
+  int can_issue_more = aarch64_sched_issue_rate ();
+
+  if ((enum attr_tune) aarch64_tune == TUNE_FALKOR)
+    {
+      /* The start of a basic block.  */
+      if (clock == 0)
+	{
+	  if (leftover_uops && file && (verbose > 3))
+	    fprintf (file, ";;\tLeftover uops ignored at bb start.\n");
+
+	  leftover_uops = 0;
+	}
+
+      /* Account for issue slots left over from previous cycle.  This value
+	 can be larger than the number of issue slots per cycle, so we need
+	 to check it here before scheduling any instructions.  */
+      else if (leftover_uops)
+	{
+	  can_issue_more -= leftover_uops;
+
+	  if (file && (verbose > 3))
+	    {
+	      fprintf (file, ";;\tUse %d issue slots for leftover uops.\n",
+		       leftover_uops);
+	      fprintf (file, ";;\t%d issue slots left.\n", can_issue_more);
+	    }
+
+	  leftover_uops = 0;
+
+	  if (can_issue_more < 0)
+	    {
+	      leftover_uops = 0 - can_issue_more;
+	      can_issue_more = 0;
+
+	      if (file && (verbose > 3))
+		{
+		  fprintf (file, ";;skipping issue cycle.\n");
+		  fprintf (file, ";;\t%d uops left over.\n", leftover_uops);
+		}
+	    }
+	}
+    }
+
+  return can_issue_more;
+}
+
+/* Implement TARGET_SCHED_VARIABLE_ISSUE.  */
+
+static int
+aarch64_variable_issue (FILE *file, int verbose,
+			rtx_insn *insn, int more)
+{
+  if (GET_CODE (PATTERN (insn)) != USE
+      && GET_CODE (PATTERN (insn)) != CLOBBER)
+    {
+      if ((enum attr_tune) aarch64_tune != TUNE_FALKOR)
+	more -= 1;
+      else
+	{
+	  int issue_slots = get_attr_falkor_variable_issue (insn);
+	  more -= issue_slots;
+
+	  if (file && (verbose > 3))
+	    {
+	      fprintf (file, ";;\tInsn takes %d issue slots.\n", issue_slots);
+	      fprintf (file, ";;\t%d issue slots left.\n", more);
+	    }
+
+	  /* We schedule an instruction first, and then subtract issue slots,
+	     which means the result can be negative.  We carry the extra over
+	     to the next cycle.  */
+
+	  if (more < 0)
+	    {
+	      leftover_uops = 0 - more;
+	      more = 0;
+
+	      if (file && (verbose > 3))
+		fprintf (file, ";;\t%d uops left over.\n", leftover_uops);
+	    }
+	}
+    }
+
+  return more;
+}
+
 #undef TARGET_ADDRESS_COST
 #define TARGET_ADDRESS_COST aarch64_address_cost
 
@@ -17779,6 +17878,12 @@ aarch64_libgcc_floating_mode_supported_p
 #undef TARGET_SCHED_ISSUE_RATE
 #define TARGET_SCHED_ISSUE_RATE aarch64_sched_issue_rate
 
+#undef TARGET_SCHED_REORDER
+#define TARGET_SCHED_REORDER aarch64_sched_reorder
+
+#undef TARGET_SCHED_VARIABLE_ISSUE
+#define TARGET_SCHED_VARIABLE_ISSUE aarch64_variable_issue
+
 #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   aarch64_sched_first_cycle_multipass_dfa_lookahead
Index: aarch64/falkor.md
===================================================================
--- aarch64.orig/falkor.md
+++ aarch64/falkor.md
@@ -685,3 +685,24 @@
 (define_bypass 3
   "falkor_afp_5_vxvy_mul,falkor_afp_5_vxvy_mla,falkor_afp_5_vxvy_vxvy_mul,falkor_afp_5_vxvy_vxvy_mla,falkor_afp_6_vxvy_mul,falkor_afp_6_vxvy_mla,falkor_afp_6_vxvy_vxvy_mul,falkor_afp_6_vxvy_vxvy_mla,falkor_fpdt_5_vxvy_mul,falkor_fpdt_5_vxvy_mla,falkor_fpdt_6_vxvy_mul,falkor_fpdt_6_vxvy_mla"
   "falkor_afp_5_vxvy_mla,falkor_afp_5_vxvy_vxvy_mla,falkor_afp_6_vxvy_mla,falkor_afp_6_vxvy_vxvy_mla,falkor_fpdt_5_vxvy_mla,falkor_fpdt_6_vxvy_mla")
+
+
+(define_attr "falkor_variable_issue" ""
+  (cond [
+;; A64 Instructions
+	 (eq_attr "type" "neon_fp_neg_s_q,neon_fp_neg_d_q,neon_fp_abs_s_q,neon_fp_abs_d_q,neon_fp_minmax_s_q,neon_fp_minmax_d_q,neon_fp_compare_s_q,neon_fp_compare_d_q,neon_fp_round_s_q,neon_fp_round_d_q,neon_fp_abd_s_q,neon_fp_abd_d_q,neon_fp_addsub_s_q,neon_fp_addsub_d_q,neon_fp_reduc_add_s_q,neon_fp_reduc_add_d_q,neon_fp_to_int_s_q,neon_fp_to_int_d_q,neon_int_to_fp_s_q,neon_int_to_fp_d_q,neon_fp_mla_d_q,neon_fp_mla_d_scalar_q,neon_fp_div_s,neon_fp_div_d,neon_fp_sqrt_s,neon_fp_sqrt_d,neon_shift_imm_long,neon_add_q,neon_reduc_add_q,neon_logic_q,neon_neg_q,neon_sub_q,neon_add_halve_q,neon_sub_halve_q,neon_shift_imm_q,neon_shift_reg_q,neon_minmax_q,neon_abs_q,neon_compare_q,neon_compare_zero_q,neon_tst_q,neon_reduc_add_long,neon_shift_acc_q,neon_reduc_add_acc_q,neon_abd_q,neon_abd_long,neon_qadd_q,neon_qsub_q,neon_qabs_q,neon_qneg_q,neon_sat_shift_imm_q,neon_sat_shift_reg_q,neon_mul_b_q,neon_mul_h_q,neon_mul_s_q,neon_mul_h_scalar_q,neon_mul_s_scalar_q,neon_sat_mul_b_q,neon_sat_mul_h_q,neon_sat_mul_s_q,neon_mul_b_long,neon_mul_h_long,neon_mul_s_long,neon_mul_d_long,neon_mul_h_scalar_long,neon_mul_s_scalar_long,neon_sat_mul_b_long,neon_sat_mul_h_long,neon_sat_mul_s_long,neon_sat_mul_h_scalar_q,neon_sat_mul_s_scalar_q,neon_sat_mul_h_scalar_long,neon_sat_mul_s_scalar_long,neon_mla_b_q,neon_mla_h_q,neon_mla_s_q,neon_mla_h_scalar_q,neon_mla_s_scalar_q,neon_mla_b_long,neon_mla_h_long,neon_mla_s_long,neon_mla_h_scalar_long,neon_mla_s_scalar_long,neon_sat_mla_b_long,neon_sat_mla_h_long,neon_sat_mla_s_long,neon_sat_mla_h_scalar_long,neon_sat_mla_s_scalar_long,neon_add_halve_narrow_q,neon_sub_halve_narrow_q,neon_arith_acc,neon_load1_2reg,neon_load2_2reg,neon_load2_all_lanes,neon_load1_2reg_q,neon_load2_2reg_q,neon_load2_all_lanes_q,neon_load3_one_lane,neon_load4_one_lane,neon_ldp,neon_ldp_q,neon_from_gp_q,neon_bsl_q,neon_dup_q,neon_ext_q,neon_move_q,neon_rev_q,neon_tbl1_q,neon_permute_q,neon_cls_q,neon_cnt_q,neon_rbit_q,neon_tbl2,neon_fp_recpe_s_q,neon_fp_recpe_d_q,neon_fp_rsqrte_s_q,neon_fp_rsqrte_d_q,neon_fp_recps_s_q,neon_fp_recps_d_q,neon_fp_rsqrts_d_q,neon_store1_1reg,neon_store1_1reg_q,neon_store1_one_lane,neon_store1_one_lane_q,neon_store1_2reg,neon_store2_2reg,neon_store2_one_lane,neon_store2_one_lane_q,neon_stp,crypto_sha1_xor,crypto_sha256_fast,crypto_sha1_slow,crypto_sha256_slow,crypto_aese,f_stores,f_stored,fdivs,fdivd,sdiv,udiv")
+	   (const_int 2)
+	 (eq_attr "type" "neon_fp_cvt_narrow_s_q,neon_fp_cvt_narrow_d_q,neon_load1_3reg,neon_load3_3reg,neon_load3_all_lanes,neon_load1_3reg_q,neon_load3_3reg_q,neon_load3_all_lanes_q,neon_tbl2_q,neon_tbl3")
+	   (const_int 3)
+	 (eq_attr "type" "neon_fp_div_s_q,neon_fp_div_d_q,neon_fp_sqrt_s_q,neon_fp_sqrt_d_q,neon_add_widen,neon_sub_widen,neon_arith_acc_q,neon_load1_4reg,neon_load4_4reg,neon_load1_4reg_q,neon_load4_4reg_q,neon_load4_all_lanes,neon_load4_all_lanes_q,neon_tbl3_q,neon_tbl4,neon_store1_2reg_q,neon_store1_3reg,neon_store1_4reg,neon_store2_2reg_q,neon_store3_3reg,neon_store4_4reg,neon_store3_one_lane,neon_store3_one_lane_q,neon_store4_one_lane,neon_store4_one_lane_q,neon_stp_q")
+	   (const_int 4)
+	 (eq_attr "type" "neon_tbl4_q")
+	   (const_int 5)
+	 (eq_attr "type" "neon_store1_3reg_q,neon_store3_3reg_q")
+	   (const_int 6)
+	 (eq_attr "type" "neon_store1_4reg_q,neon_store4_4reg_q")
+	   (const_int 8)
+	 (eq_attr "type" "multiple")
+	   (const_int 2)
+	]
+	(const_int 1)))
Kyrill Tkachov Aug. 14, 2018, 2:50 p.m. UTC | #9
Hi Kai,

On 13/08/18 17:48, Kai Tietz wrote:
> I repost updated patch containing ChangeLog entry.
>
> Regards,
> Kai

I think I understand what this patch does, please correct me if I'm wrong.
You model the processors micro-ops and some A64 instructions use multiple micro-ops.
This is what the falkor_variable_issue attribute specifies.
In TARGET_SCHED_VARIABLE_ISSUE you count the issue slots that the micro-ops take and how much "space" is left over, which is stored in leftover_uops
and you use leftover_uops in TARGET_SCHED_REORDER to tell the scheduler how many more micro-ops it can issue on that cycle.

And with that change the issue_rate is no longer the *instruction* issue rate, but rather the *micro-op* issue rate.
Overall this looks very similar to the implementation of this functionality in the powerpc port (rs6000).
Is this correct?

I have a few comments on the implementation inline...

	Jim Wilson<jim.wilson@linaro.org>
	Kai Tietz<kai.tietz@linaro.org>

	* config/aarch64.c (aarch64_sched_reorder): Implement
	TARGET_SCHED_REORDER hook.
	(aarch64_variable_issue): Implement TARGET_SCHED_VARIABLE_ISSUE
	hook.
	(TARGET_SCHED_REORDER): Define.
	(TARGET_SCHED_VARIABLE_ISSUE): Likewise.
	* config/aarch64/falkor.md (falkor_variable_issue): New.

Index: aarch64/aarch64.c
===================================================================
--- aarch64.orig/aarch64.c
+++ aarch64/aarch64.c
@@ -914,7 +914,7 @@ static const struct tune_params qdf24xx_
    &generic_branch_cost,
    &generic_approx_modes,
    4, /* memmov_cost  */
-  4, /* issue_rate  */
+  8, /* issue_rate  */
    (AARCH64_FUSE_MOV_MOVK | AARCH64_FUSE_ADRP_ADD
     | AARCH64_FUSE_MOVK_MOVK), /* fuseable_ops  */
    "16",	/* function_align.  */
@@ -17551,6 +17551,105 @@ aarch64_run_selftests (void)
  
  #endif /* #if CHECKING_P */
  
+/* The number of micro ops left over after issuing the last instruction in a
+   cycle.  This is subtracted from the next cycle before we start issuing insns.
+   This is initialized to 0 at the start of every basic block.  */
+static int leftover_uops = 0;
+

I believe the scheduler provides hooks specifically for storing backend-specific scheduling state so we should
avoid creating such static variables in aarch64.c. Can you use the TARGET_SCHED_*_SCHED_CONTEXT family of hooks here?
Then it will be up to the scheduler midend to keep track of the state and between basic blocks, functions etc.

  +/* Implement TARGET_SCHED_REORDER.  */
+
+static int
+aarch64_sched_reorder (FILE *file, int verbose,
+		       rtx_insn **ready ATTRIBUTE_UNUSED,
+		       int *n_readyp ATTRIBUTE_UNUSED,
+		       int clock)
+{
+  int can_issue_more = aarch64_sched_issue_rate ();
+
+  if ((enum attr_tune) aarch64_tune == TUNE_FALKOR)
+    {
+      /* The start of a basic block.  */
+      if (clock == 0)
+	{
+	  if (leftover_uops && file && (verbose > 3))
+	    fprintf (file, ";;\tLeftover uops ignored at bb start.\n");
+
+	  leftover_uops = 0;
+	}
+
+      /* Account for issue slots left over from previous cycle.  This value
+	 can be larger than the number of issue slots per cycle, so we need
+	 to check it here before scheduling any instructions.  */
+      else if (leftover_uops)
+	{
+	  can_issue_more -= leftover_uops;
+
+	  if (file && (verbose > 3))
+	    {
+	      fprintf (file, ";;\tUse %d issue slots for leftover uops.\n",
+		       leftover_uops);
+	      fprintf (file, ";;\t%d issue slots left.\n", can_issue_more);
+	    }
+
+	  leftover_uops = 0;
+
+	  if (can_issue_more < 0)
+	    {
+	      leftover_uops = 0 - can_issue_more;
+	      can_issue_more = 0;
+
+	      if (file && (verbose > 3))
+		{
+		  fprintf (file, ";;skipping issue cycle.\n");
+		  fprintf (file, ";;\t%d uops left over.\n", leftover_uops);
+		}
+	    }
+	}
+    }
+
+  return can_issue_more;
+}
+
+/* Implement TARGET_SCHED_VARIABLE_ISSUE.  */
+

A comment here like you have for TARGET_SCHED_REORDER describing what this function accomplishes would be very helpful.

  +static int
+aarch64_variable_issue (FILE *file, int verbose,
+			rtx_insn *insn, int more)
+{
+  if (GET_CODE (PATTERN (insn)) != USE
+      && GET_CODE (PATTERN (insn)) != CLOBBER)
+    {
+      if ((enum attr_tune) aarch64_tune != TUNE_FALKOR)
+	more -= 1;
+      else
+	{
+	  int issue_slots = get_attr_falkor_variable_issue (insn);
+	  more -= issue_slots;
+


We generally try to avoid having explicit CPU-specific checks like this in the aarch64 backend.
Instead we try to keep all the CPU-specific tuning information in the CPU tuning structures.

This particular infrastructure looks like it could be used for other CPUs in the future. In order for that to happen we don't want
to have a check of aarch64_tune, but rather a tuning flag defined in aarch64-tuning-flags.def that tells us whether we're scheduling
considering micro-ops or not. It can be on by default in the falkor tuning struct.

Then, we'd want the falkor_variable_issue attribute to be a generic attribute that specifies the number of micro-ops per-instruction and per-cpu.
Unfortunately I'm don't see how this could be done in the current RTL attribute infrastructure without creating a big unwieldy attribute that will be hard to keep
up to date :( Ideally we'd want something that could be specified in each CPU's .md file. Any ideas?

Thanks,
Kyrill

+	  if (file && (verbose > 3))
+	    {
+	      fprintf (file, ";;\tInsn takes %d issue slots.\n", issue_slots);
+	      fprintf (file, ";;\t%d issue slots left.\n", more);
+	    }
+
+	  /* We schedule an instruction first, and then subtract issue slots,
+	     which means the result can be negative.  We carry the extra over
+	     to the next cycle.  */
+
+	  if (more < 0)
+	    {
+	      leftover_uops = 0 - more;
+	      more = 0;
+
+	      if (file && (verbose > 3))
+		fprintf (file, ";;\t%d uops left over.\n", leftover_uops);
+	    }
+	}
+    }
+
+  return more;
+}
+
  #undef TARGET_ADDRESS_COST
  #define TARGET_ADDRESS_COST aarch64_address_cost
  
@@ -17779,6 +17878,12 @@ aarch64_libgcc_floating_mode_supported_p
  #undef TARGET_SCHED_ISSUE_RATE
  #define TARGET_SCHED_ISSUE_RATE aarch64_sched_issue_rate
  
+#undef TARGET_SCHED_REORDER
+#define TARGET_SCHED_REORDER aarch64_sched_reorder
+
+#undef TARGET_SCHED_VARIABLE_ISSUE
+#define TARGET_SCHED_VARIABLE_ISSUE aarch64_variable_issue
+
  #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
  #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
    aarch64_sched_first_cycle_multipass_dfa_lookahead
Index: aarch64/falkor.md
===================================================================
--- aarch64.orig/falkor.md
+++ aarch64/falkor.md
@@ -685,3 +685,24 @@
  (define_bypass 3
    "falkor_afp_5_vxvy_mul,falkor_afp_5_vxvy_mla,falkor_afp_5_vxvy_vxvy_mul,falkor_afp_5_vxvy_vxvy_mla,falkor_afp_6_vxvy_mul,falkor_afp_6_vxvy_mla,falkor_afp_6_vxvy_vxvy_mul,falkor_afp_6_vxvy_vxvy_mla,falkor_fpdt_5_vxvy_mul,falkor_fpdt_5_vxvy_mla,falkor_fpdt_6_vxvy_mul,falkor_fpdt_6_vxvy_mla"
    "falkor_afp_5_vxvy_mla,falkor_afp_5_vxvy_vxvy_mla,falkor_afp_6_vxvy_mla,falkor_afp_6_vxvy_vxvy_mla,falkor_fpdt_5_vxvy_mla,falkor_fpdt_6_vxvy_mla")
+
+
+(define_attr "falkor_variable_issue" ""
+  (cond [
+;; A64 Instructions
+	 (eq_attr "type" "neon_fp_neg_s_q,neon_fp_neg_d_q,neon_fp_abs_s_q,neon_fp_abs_d_q,neon_fp_minmax_s_q,neon_fp_minmax_d_q,neon_fp_compare_s_q,neon_fp_compare_d_q,neon_fp_round_s_q,neon_fp_round_d_q,neon_fp_abd_s_q,neon_fp_abd_d_q,neon_fp_addsub_s_q,neon_fp_addsub_d_q,neon_fp_reduc_add_s_q,neon_fp_reduc_add_d_q,neon_fp_to_int_s_q,neon_fp_to_int_d_q,neon_int_to_fp_s_q,neon_int_to_fp_d_q,neon_fp_mla_d_q,neon_fp_mla_d_scalar_q,neon_fp_div_s,neon_fp_div_d,neon_fp_sqrt_s,neon_fp_sqrt_d,neon_shift_imm_long,neon_add_q,neon_reduc_add_q,neon_logic_q,neon_neg_q,neon_sub_q,neon_add_halve_q,neon_sub_halve_q,neon_shift_imm_q,neon_shift_reg_q,neon_minmax_q,neon_abs_q,neon_compare_q,neon_compare_zero_q,neon_tst_q,neon_reduc_add_long,neon_shift_acc_q,neon_reduc_add_acc_q,neon_abd_q,neon_abd_long,neon_qadd_q,neon_qsub_q,neon_qabs_q,neon_qneg_q,neon_sat_shift_imm_q,neon_sat_shift_reg_q,neon_mul_b_q,neon_mul_h_q,neon_mul_s_q,neon_mul_h_scalar_q,neon_mul_s_scalar_q,neon_sat_mul_b_q,neon_sat_mul_h_q,neon_sat_mul_s_q,neon_mul_b_long,neon_mul_h_long,neon_mul_s_long,neon_mul_d_long,neon_mul_h_scalar_long,neon_mul_s_scalar_long,neon_sat_mul_b_long,neon_sat_mul_h_long,neon_sat_mul_s_long,neon_sat_mul_h_scalar_q,neon_sat_mul_s_scalar_q,neon_sat_mul_h_scalar_long,neon_sat_mul_s_scalar_long,neon_mla_b_q,neon_mla_h_q,neon_mla_s_q,neon_mla_h_scalar_q,neon_mla_s_scalar_q,neon_mla_b_long,neon_mla_h_long,neon_mla_s_long,neon_mla_h_scalar_long,neon_mla_s_scalar_long,neon_sat_mla_b_long,neon_sat_mla_h_long,neon_sat_mla_s_long,neon_sat_mla_h_scalar_long,neon_sat_mla_s_scalar_long,neon_add_halve_narrow_q,neon_sub_halve_narrow_q,neon_arith_acc,neon_load1_2reg,neon_load2_2reg,neon_load2_all_lanes,neon_load1_2reg_q,neon_load2_2reg_q,neon_load2_all_lanes_q,neon_load3_one_lane,neon_load4_one_lane,neon_ldp,neon_ldp_q,neon_from_gp_q,neon_bsl_q,neon_dup_q,neon_ext_q,neon_move_q,neon_rev_q,neon_tbl1_q,neon_permute_q,neon_cls_q,neon_cnt_q,neon_rbit_q,neon_tbl2,neon_fp_recpe_s_q,neon_fp_recpe_d_q,neon_fp_rsqrte_s_q,neon_fp_rsqrte_d_q,neon_fp_recps_s_q,neon_fp_recps_d_q,neon_fp_rsqrts_d_q,neon_store1_1reg,neon_store1_1reg_q,neon_store1_one_lane,neon_store1_one_lane_q,neon_store1_2reg,neon_store2_2reg,neon_store2_one_lane,neon_store2_one_lane_q,neon_stp,crypto_sha1_xor,crypto_sha256_fast,crypto_sha1_slow,crypto_sha256_slow,crypto_aese,f_stores,f_stored,fdivs,fdivd,sdiv,udiv")
+	   (const_int 2)
+	 (eq_attr "type" "neon_fp_cvt_narrow_s_q,neon_fp_cvt_narrow_d_q,neon_load1_3reg,neon_load3_3reg,neon_load3_all_lanes,neon_load1_3reg_q,neon_load3_3reg_q,neon_load3_all_lanes_q,neon_tbl2_q,neon_tbl3")
+	   (const_int 3)
+	 (eq_attr "type" "neon_fp_div_s_q,neon_fp_div_d_q,neon_fp_sqrt_s_q,neon_fp_sqrt_d_q,neon_add_widen,neon_sub_widen,neon_arith_acc_q,neon_load1_4reg,neon_load4_4reg,neon_load1_4reg_q,neon_load4_4reg_q,neon_load4_all_lanes,neon_load4_all_lanes_q,neon_tbl3_q,neon_tbl4,neon_store1_2reg_q,neon_store1_3reg,neon_store1_4reg,neon_store2_2reg_q,neon_store3_3reg,neon_store4_4reg,neon_store3_one_lane,neon_store3_one_lane_q,neon_store4_one_lane,neon_store4_one_lane_q,neon_stp_q")
+	   (const_int 4)
+	 (eq_attr "type" "neon_tbl4_q")
+	   (const_int 5)
+	 (eq_attr "type" "neon_store1_3reg_q,neon_store3_3reg_q")
+	   (const_int 6)
+	 (eq_attr "type" "neon_store1_4reg_q,neon_store4_4reg_q")
+	   (const_int 8)
+	 (eq_attr "type" "multiple")
+	   (const_int 2)
+	]
+	(const_int 1)))
Kai Tietz Aug. 15, 2018, 10:45 a.m. UTC | #10
Hello Kyrill,

thanks for your comments.

2018-08-14 16:50 GMT+02:00 Kyrill  Tkachov <kyrylo.tkachov@foss.arm.com>:
> Hi Kai,
>
>
> On 13/08/18 17:48, Kai Tietz wrote:
>>
>> I repost updated patch containing ChangeLog entry.
>>
>> Regards,
>> Kai
>
>
> I think I understand what this patch does, please correct me if I'm wrong.
> You model the processors micro-ops and some A64 instructions use multiple
> micro-ops.
> This is what the falkor_variable_issue attribute specifies.
> In TARGET_SCHED_VARIABLE_ISSUE you count the issue slots that the micro-ops
> take and how much "space" is left over, which is stored in leftover_uops
> and you use leftover_uops in TARGET_SCHED_REORDER to tell the scheduler how
> many more micro-ops it can issue on that cycle.

Correct.

> And with that change the issue_rate is no longer the *instruction* issue
> rate, but rather the *micro-op* issue rate.
> Overall this looks very similar to the implementation of this functionality
> in the powerpc port (rs6000).
> Is this correct?

Yes, it is somewhat similar to the rs6000 variant.

> I have a few comments on the implementation inline...
>
>         Jim Wilson<jim.wilson@linaro.org>
>         Kai Tietz<kai.tietz@linaro.org>
>
>         * config/aarch64.c (aarch64_sched_reorder): Implement
>         TARGET_SCHED_REORDER hook.
>         (aarch64_variable_issue): Implement TARGET_SCHED_VARIABLE_ISSUE
>         hook.
>         (TARGET_SCHED_REORDER): Define.
>         (TARGET_SCHED_VARIABLE_ISSUE): Likewise.
>         * config/aarch64/falkor.md (falkor_variable_issue): New.
>
> Index: aarch64/aarch64.c
> ===================================================================
> --- aarch64.orig/aarch64.c
> +++ aarch64/aarch64.c
> @@ -914,7 +914,7 @@ static const struct tune_params qdf24xx_
>    &generic_branch_cost,
>    &generic_approx_modes,
>    4, /* memmov_cost  */
> -  4, /* issue_rate  */
> +  8, /* issue_rate  */
>    (AARCH64_FUSE_MOV_MOVK | AARCH64_FUSE_ADRP_ADD
>     | AARCH64_FUSE_MOVK_MOVK), /* fuseable_ops  */
>    "16",        /* function_align.  */
> @@ -17551,6 +17551,105 @@ aarch64_run_selftests (void)
>   #endif /* #if CHECKING_P */
>  +/* The number of micro ops left over after issuing the last instruction in
> a
> +   cycle.  This is subtracted from the next cycle before we start issuing
> insns.
> +   This is initialized to 0 at the start of every basic block.  */
> +static int leftover_uops = 0;
> +
>
> I believe the scheduler provides hooks specifically for storing
> backend-specific scheduling state so we should
> avoid creating such static variables in aarch64.c. Can you use the
> TARGET_SCHED_*_SCHED_CONTEXT family of hooks here?
> Then it will be up to the scheduler midend to keep track of the state and
> between basic blocks, functions etc.

I think you refer to the ppc implementation. But if you take a closer
look to it, you will see that nevertheless such an implementation will
require global variables.
So I am not really sure it is worth to introduce the ..._SCHED_CONTEXT
API to avoid one global variable by introducing at least two others.
Neverthelss I am admit that making use of SCHED_CONTEXT could be a
general nice to have, but not necessarily a gain in that case.

>  +/* Implement TARGET_SCHED_REORDER.  */
> +
> +static int
> +aarch64_sched_reorder (FILE *file, int verbose,
> +                      rtx_insn **ready ATTRIBUTE_UNUSED,
> +                      int *n_readyp ATTRIBUTE_UNUSED,
> +                      int clock)
> +{
> +  int can_issue_more = aarch64_sched_issue_rate ();
> +
> +  if ((enum attr_tune) aarch64_tune == TUNE_FALKOR)
> +    {
> +      /* The start of a basic block.  */
> +      if (clock == 0)
> +       {
> +         if (leftover_uops && file && (verbose > 3))
> +           fprintf (file, ";;\tLeftover uops ignored at bb start.\n");
> +
> +         leftover_uops = 0;
> +       }
> +
> +      /* Account for issue slots left over from previous cycle.  This value
> +        can be larger than the number of issue slots per cycle, so we need
> +        to check it here before scheduling any instructions.  */
> +      else if (leftover_uops)
> +       {
> +         can_issue_more -= leftover_uops;
> +
> +         if (file && (verbose > 3))
> +           {
> +             fprintf (file, ";;\tUse %d issue slots for leftover uops.\n",
> +                      leftover_uops);
> +             fprintf (file, ";;\t%d issue slots left.\n", can_issue_more);
> +           }
> +
> +         leftover_uops = 0;
> +
> +         if (can_issue_more < 0)
> +           {
> +             leftover_uops = 0 - can_issue_more;
> +             can_issue_more = 0;
> +
> +             if (file && (verbose > 3))
> +               {
> +                 fprintf (file, ";;skipping issue cycle.\n");
> +                 fprintf (file, ";;\t%d uops left over.\n", leftover_uops);
> +               }
> +           }
> +       }
> +    }
> +
> +  return can_issue_more;
> +}
> +
> +/* Implement TARGET_SCHED_VARIABLE_ISSUE.  */
> +
>
> A comment here like you have for TARGET_SCHED_REORDER describing what this
> function accomplishes would be very helpful.

Ok. As it is a simple implementation of a well described hook, it
didn't seemed to me that it would need more comment. but I am open for
improving it.

>  +static int
> +aarch64_variable_issue (FILE *file, int verbose,
> +                       rtx_insn *insn, int more)
> +{
> +  if (GET_CODE (PATTERN (insn)) != USE
> +      && GET_CODE (PATTERN (insn)) != CLOBBER)
> +    {
> +      if ((enum attr_tune) aarch64_tune != TUNE_FALKOR)
> +       more -= 1;
> +      else
> +       {
> +         int issue_slots = get_attr_falkor_variable_issue (insn);
> +         more -= issue_slots;
> +
>
>
> We generally try to avoid having explicit CPU-specific checks like this in
> the aarch64 backend.
> Instead we try to keep all the CPU-specific tuning information in the CPU
> tuning structures.
>
> This particular infrastructure looks like it could be used for other CPUs in
> the future. In order for that to happen we don't want
> to have a check of aarch64_tune, but rather a tuning flag defined in
> aarch64-tuning-flags.def that tells us whether we're scheduling
> considering micro-ops or not. It can be on by default in the falkor tuning
> struct.

Ok, this is fair. I will add such a tuning-flags.def indicating
scheduling for micro-ops/or not.

> Then, we'd want the falkor_variable_issue attribute to be a generic
> attribute that specifies the number of micro-ops per-instruction and
> per-cpu.
Ok, so I will rename hook to a generic name instead.

> Unfortunately I'm don't see how this could be done in the current RTL
> attribute infrastructure without creating a big unwieldy attribute that will
> be hard to keep
> up to date :( Ideally we'd want something that could be specified in each
> CPU's .md file. Any ideas?

Sadly no. I have thought about this too. The approach via such a 'big'
attribute looks long-term hard to handle. As you mentioned the
maintaining for such a beast will be gross.
Therefore I descided for now to address it in an cpu-flavor specific way.

> Thanks,
> Kyrill
>

Thanks,
Kai
Steve Ellcey Aug. 15, 2018, 6:06 p.m. UTC | #11
On Wed, 2018-08-15 at 12:45 +0200, Kai Tietz wrote:

> > I believe the scheduler provides hooks specifically for storing
> > backend-specific scheduling state so we should
> > avoid creating such static variables in aarch64.c. Can you use the
> > TARGET_SCHED_*_SCHED_CONTEXT family of hooks here?
> > Then it will be up to the scheduler midend to keep track of the
> > state and
> > between basic blocks, functions etc.
> I think you refer to the ppc implementation. But if you take a closer
> look to it, you will see that nevertheless such an implementation will
> require global variables.
> So I am not really sure it is worth to introduce the ..._SCHED_CONTEXT
> API to avoid one global variable by introducing at least two others.
> Neverthelss I am admit that making use of SCHED_CONTEXT could be a
> general nice to have, but not necessarily a gain in that case.

I think having the SCHED_CONTEXT infrastructure could be helpful to me
for Thunderx2.  I am looking at an issue that I
think TARGET_SCHED_VARIABLE_ISSUE will help with so having the general
API available would be useful.

I am looking at the lbm benchmark from SPEC and I believe it is
overscheduling floating point instructions, if I cut the issue rate
from 4 to 2, I get better performance on this one benchmark, but not
on others.

According to the Thunderx2 description, we can dispatch up to 4 uops
per clock, but only up to 2 FP/SIMD uops.  I know dispatch is different
than issue but I don't think that GCC models both.  For my patch
I would probably want to save the previous instruction scheduled so
that if it and the current one are both FP/SIMD ops, then that is all
we can issue.  I might need to save several instructions,  not just the
last one, to get everything correct.

Steve Ellcey
sellcey@cavium.com
Kai Tietz Oct. 11, 2018, 3:25 p.m. UTC | #12
Hi,

I reworked patch use a tuning flag instead of checking explicit for
CPU flavor. I will send soon an update for it, which won't use the
static variable anymore, and uses instead the SCHED-api.

I would like first to get some comments on current version.

Regards,
Kai
Jim Wilson <jim.wilson@linaro.org>
        Kai Tietz <kai.tietz@linaro.org>

        * config/aarch64/aarch64.c (aarch64_sched_reorder): Implement
        TARGET_SCHED_REORDER hook.
        (aarch64_variable_issue): Implement TARGET_SCHED_VARIABLE_ISSUE
        hook.
	(qdf24xx_): Add AARCH64_EXTRA_TUNE_SCHED_MICRO_OPS tune flag.
        (TARGET_SCHED_REORDER): Define.
        (TARGET_SCHED_VARIABLE_ISSUE): Likewise.
        * config/aarch64/falkor.md (falkor_variable_issue): New.
	* onfig/aarch64/aarch64-tuning-flags.def (SCHED_MICRO_OPS): New flag.

Index: trunk/gcc/config/aarch64/aarch64.c
===================================================================
--- trunk.orig/gcc/config/aarch64/aarch64.c
+++ trunk/gcc/config/aarch64/aarch64.c
@@ -955,7 +955,7 @@ static const struct tune_params qdf24xx_
   &generic_branch_cost,
   &generic_approx_modes,
   4, /* memmov_cost  */
-  4, /* issue_rate  */
+  8, /* issue_rate  */
   (AARCH64_FUSE_MOV_MOVK | AARCH64_FUSE_ADRP_ADD
    | AARCH64_FUSE_MOVK_MOVK), /* fuseable_ops  */
   "16",	/* function_align.  */
@@ -968,7 +968,7 @@ static const struct tune_params qdf24xx_
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
+  AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS | AARCH64_EXTRA_TUNE_SCHED_MICRO_OPS, /* tune_flags.  */
   &qdf24xx_prefetch_tune
 };
 
@@ -18037,6 +18037,109 @@ aarch64_run_selftests (void)
 
 #endif /* #if CHECKING_P */
 
+/* The number of micro ops left over after issuing the last instruction in a
+   cycle.  This is subtracted from the next cycle before we start issuing insns.
+   This is initialized to 0 at the start of every basic block.  */
+static int leftover_uops = 0;
+
+/* Implement TARGET_SCHED_REORDER.  */
+
+static int
+aarch64_sched_reorder (FILE *file, int verbose,
+		       rtx_insn **ready ATTRIBUTE_UNUSED,
+		       int *n_readyp ATTRIBUTE_UNUSED,
+		       int clock)
+{
+  int can_issue_more = aarch64_sched_issue_rate ();
+
+  if ((aarch64_tune_params.extra_tuning_flags
+       & AARCH64_EXTRA_TUNE_SCHED_MICRO_OPS) != 0)
+    {
+      /* The start of a basic block.  */
+      if (clock == 0)
+	{
+	  if (leftover_uops && file && (verbose > 3))
+	    fprintf (file, ";;\tLeftover uops ignored at bb start.\n");
+
+	  leftover_uops = 0;
+	}
+
+      /* Account for issue slots left over from previous cycle.  This value
+	 can be larger than the number of issue slots per cycle, so we need
+	 to check it here before scheduling any instructions.  */
+      else if (leftover_uops)
+	{
+	  can_issue_more -= leftover_uops;
+
+	  if (file && (verbose > 3))
+	    {
+	      fprintf (file, ";;\tUse %d issue slots for leftover uops.\n",
+		       leftover_uops);
+	      fprintf (file, ";;\t%d issue slots left.\n", can_issue_more);
+	    }
+
+	  leftover_uops = 0;
+
+	  if (can_issue_more < 0)
+	    {
+	      leftover_uops = 0 - can_issue_more;
+	      can_issue_more = 0;
+
+	      if (file && (verbose > 3))
+		{
+		  fprintf (file, ";;skipping issue cycle.\n");
+		  fprintf (file, ";;\t%d uops left over.\n", leftover_uops);
+		}
+	    }
+	}
+    }
+
+  return can_issue_more;
+}
+
+/* Implement TARGET_SCHED_VARIABLE_ISSUE.  */
+
+static int
+aarch64_variable_issue (FILE *file, int verbose,
+			rtx_insn *insn, int more)
+{
+  if (GET_CODE (PATTERN (insn)) != USE
+      && GET_CODE (PATTERN (insn)) != CLOBBER)
+    {
+      if ((aarch64_tune_params.extra_tuning_flags
+	   & AARCH64_EXTRA_TUNE_SCHED_MICRO_OPS) == 0)
+	more -= 1;
+      else
+	{
+          /* There is for now just falkor target supporting scheduling
+ 	     of micro operations. Therefore we don't need to check.  */
+	  int issue_slots = get_attr_falkor_variable_issue (insn);
+	  more -= issue_slots;
+
+	  if (file && (verbose > 3))
+	    {
+	      fprintf (file, ";;\tInsn takes %d issue slots.\n", issue_slots);
+	      fprintf (file, ";;\t%d issue slots left.\n", more);
+	    }
+
+	  /* We schedule an instruction first, and then subtract issue slots,
+	     which means the result can be negative.  We carry the extra over
+	     to the next cycle.  */
+
+	  if (more < 0)
+	    {
+	      leftover_uops = 0 - more;
+	      more = 0;
+
+	      if (file && (verbose > 3))
+		fprintf (file, ";;\t%d uops left over.\n", leftover_uops);
+	    }
+	}
+    }
+
+  return more;
+}
+
 #undef TARGET_ADDRESS_COST
 #define TARGET_ADDRESS_COST aarch64_address_cost
 
@@ -18265,6 +18368,12 @@ aarch64_libgcc_floating_mode_supported_p
 #undef TARGET_SCHED_ISSUE_RATE
 #define TARGET_SCHED_ISSUE_RATE aarch64_sched_issue_rate
 
+#undef TARGET_SCHED_REORDER
+#define TARGET_SCHED_REORDER aarch64_sched_reorder
+
+#undef TARGET_SCHED_VARIABLE_ISSUE
+#define TARGET_SCHED_VARIABLE_ISSUE aarch64_variable_issue
+
 #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   aarch64_sched_first_cycle_multipass_dfa_lookahead
Index: trunk/gcc/config/aarch64/falkor.md
===================================================================
--- trunk.orig/gcc/config/aarch64/falkor.md
+++ trunk/gcc/config/aarch64/falkor.md
@@ -685,3 +685,24 @@
 (define_bypass 3
   "falkor_afp_5_vxvy_mul,falkor_afp_5_vxvy_mla,falkor_afp_5_vxvy_vxvy_mul,falkor_afp_5_vxvy_vxvy_mla,falkor_afp_6_vxvy_mul,falkor_afp_6_vxvy_mla,falkor_afp_6_vxvy_vxvy_mul,falkor_afp_6_vxvy_vxvy_mla,falkor_fpdt_5_vxvy_mul,falkor_fpdt_5_vxvy_mla,falkor_fpdt_6_vxvy_mul,falkor_fpdt_6_vxvy_mla"
   "falkor_afp_5_vxvy_mla,falkor_afp_5_vxvy_vxvy_mla,falkor_afp_6_vxvy_mla,falkor_afp_6_vxvy_vxvy_mla,falkor_fpdt_5_vxvy_mla,falkor_fpdt_6_vxvy_mla")
+
+
+(define_attr "falkor_variable_issue" ""
+  (cond [
+;; A64 Instructions
+	 (eq_attr "type" "neon_fp_neg_s_q,neon_fp_neg_d_q,neon_fp_abs_s_q,neon_fp_abs_d_q,neon_fp_minmax_s_q,neon_fp_minmax_d_q,neon_fp_compare_s_q,neon_fp_compare_d_q,neon_fp_round_s_q,neon_fp_round_d_q,neon_fp_abd_s_q,neon_fp_abd_d_q,neon_fp_addsub_s_q,neon_fp_addsub_d_q,neon_fp_reduc_add_s_q,neon_fp_reduc_add_d_q,neon_fp_to_int_s_q,neon_fp_to_int_d_q,neon_int_to_fp_s_q,neon_int_to_fp_d_q,neon_fp_mla_d_q,neon_fp_mla_d_scalar_q,neon_fp_div_s,neon_fp_div_d,neon_fp_sqrt_s,neon_fp_sqrt_d,neon_shift_imm_long,neon_add_q,neon_reduc_add_q,neon_logic_q,neon_neg_q,neon_sub_q,neon_add_halve_q,neon_sub_halve_q,neon_shift_imm_q,neon_shift_reg_q,neon_minmax_q,neon_abs_q,neon_compare_q,neon_compare_zero_q,neon_tst_q,neon_reduc_add_long,neon_shift_acc_q,neon_reduc_add_acc_q,neon_abd_q,neon_abd_long,neon_qadd_q,neon_qsub_q,neon_qabs_q,neon_qneg_q,neon_sat_shift_imm_q,neon_sat_shift_reg_q,neon_mul_b_q,neon_mul_h_q,neon_mul_s_q,neon_mul_h_scalar_q,neon_mul_s_scalar_q,neon_sat_mul_b_q,neon_sat_mul_h_q,neon_sat_mul_s_q,neon_mul_b_long,neon_mul_h_long,neon_mul_s_long,neon_mul_d_long,neon_mul_h_scalar_long,neon_mul_s_scalar_long,neon_sat_mul_b_long,neon_sat_mul_h_long,neon_sat_mul_s_long,neon_sat_mul_h_scalar_q,neon_sat_mul_s_scalar_q,neon_sat_mul_h_scalar_long,neon_sat_mul_s_scalar_long,neon_mla_b_q,neon_mla_h_q,neon_mla_s_q,neon_mla_h_scalar_q,neon_mla_s_scalar_q,neon_mla_b_long,neon_mla_h_long,neon_mla_s_long,neon_mla_h_scalar_long,neon_mla_s_scalar_long,neon_sat_mla_b_long,neon_sat_mla_h_long,neon_sat_mla_s_long,neon_sat_mla_h_scalar_long,neon_sat_mla_s_scalar_long,neon_add_halve_narrow_q,neon_sub_halve_narrow_q,neon_arith_acc,neon_load1_2reg,neon_load2_2reg,neon_load2_all_lanes,neon_load1_2reg_q,neon_load2_2reg_q,neon_load2_all_lanes_q,neon_load3_one_lane,neon_load4_one_lane,neon_ldp,neon_ldp_q,neon_from_gp_q,neon_bsl_q,neon_dup_q,neon_ext_q,neon_move_q,neon_rev_q,neon_tbl1_q,neon_permute_q,neon_cls_q,neon_cnt_q,neon_rbit_q,neon_tbl2,neon_fp_recpe_s_q,neon_fp_recpe_d_q,neon_fp_rsqrte_s_q,neon_fp_rsqrte_d_q,neon_fp_recps_s_q,neon_fp_recps_d_q,neon_fp_rsqrts_d_q,neon_store1_1reg,neon_store1_1reg_q,neon_store1_one_lane,neon_store1_one_lane_q,neon_store1_2reg,neon_store2_2reg,neon_store2_one_lane,neon_store2_one_lane_q,neon_stp,crypto_sha1_xor,crypto_sha256_fast,crypto_sha1_slow,crypto_sha256_slow,crypto_aese,f_stores,f_stored,fdivs,fdivd,sdiv,udiv")
+	   (const_int 2)
+	 (eq_attr "type" "neon_fp_cvt_narrow_s_q,neon_fp_cvt_narrow_d_q,neon_load1_3reg,neon_load3_3reg,neon_load3_all_lanes,neon_load1_3reg_q,neon_load3_3reg_q,neon_load3_all_lanes_q,neon_tbl2_q,neon_tbl3")
+	   (const_int 3)
+	 (eq_attr "type" "neon_fp_div_s_q,neon_fp_div_d_q,neon_fp_sqrt_s_q,neon_fp_sqrt_d_q,neon_add_widen,neon_sub_widen,neon_arith_acc_q,neon_load1_4reg,neon_load4_4reg,neon_load1_4reg_q,neon_load4_4reg_q,neon_load4_all_lanes,neon_load4_all_lanes_q,neon_tbl3_q,neon_tbl4,neon_store1_2reg_q,neon_store1_3reg,neon_store1_4reg,neon_store2_2reg_q,neon_store3_3reg,neon_store4_4reg,neon_store3_one_lane,neon_store3_one_lane_q,neon_store4_one_lane,neon_store4_one_lane_q,neon_stp_q")
+	   (const_int 4)
+	 (eq_attr "type" "neon_tbl4_q")
+	   (const_int 5)
+	 (eq_attr "type" "neon_store1_3reg_q,neon_store3_3reg_q")
+	   (const_int 6)
+	 (eq_attr "type" "neon_store1_4reg_q,neon_store4_4reg_q")
+	   (const_int 8)
+	 (eq_attr "type" "multiple")
+	   (const_int 2)
+	]
+	(const_int 1)))
Index: trunk/gcc/config/aarch64/aarch64-tuning-flags.def
===================================================================
--- trunk.orig/gcc/config/aarch64/aarch64-tuning-flags.def
+++ trunk/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -46,4 +46,7 @@ AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
 
+/* Enable micro-op scheduling instead of default  instruction one.  */
+AARCH64_EXTRA_TUNING_OPTION ("sched_microops", SCHED_MICRO_OPS)
+
 #undef AARCH64_EXTRA_TUNING_OPTION
Kyrill Tkachov Oct. 16, 2018, 4:59 p.m. UTC | #13
Hi Kai,

On 11/10/18 16:25, Kai Tietz wrote:
> Hi,
>
> I reworked patch use a tuning flag instead of checking explicit for
> CPU flavor. I will send soon an update for it, which won't use the
> static variable anymore, and uses instead the SCHED-api.
>
> I would like first to get some comments on current version.
>
> Regards,
> Kai

Index: trunk/gcc/config/aarch64/aarch64.c
===================================================================
--- trunk.orig/gcc/config/aarch64/aarch64.c
+++ trunk/gcc/config/aarch64/aarch64.c
@@ -955,7 +955,7 @@ static const struct tune_params qdf24xx_
    &generic_branch_cost,
    &generic_approx_modes,
    4, /* memmov_cost  */
-  4, /* issue_rate  */
+  8, /* issue_rate  */
    (AARCH64_FUSE_MOV_MOVK | AARCH64_FUSE_ADRP_ADD
     | AARCH64_FUSE_MOVK_MOVK), /* fuseable_ops  */
    "16",	/* function_align.  */
@@ -968,7 +968,7 @@ static const struct tune_params qdf24xx_
    2,	/* min_div_recip_mul_df.  */
    0,	/* max_case_values.  */
    tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS, /* tune_flags.  */
+  AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS | AARCH64_EXTRA_TUNE_SCHED_MICRO_OPS, /* tune_flags.  */

Thanks for this, this looks much cleaner and externsible.

  &qdf24xx_prefetch_tune
  };
  
@@ -18037,6 +18037,109 @@ aarch64_run_selftests (void)
  
  #endif /* #if CHECKING_P */
  
+/* The number of micro ops left over after issuing the last instruction in a
+   cycle.  This is subtracted from the next cycle before we start issuing insns.
+   This is initialized to 0 at the start of every basic block.  */
+static int leftover_uops = 0;
+
+/* Implement TARGET_SCHED_REORDER.  */
+
+static int
+aarch64_sched_reorder (FILE *file, int verbose,
+		       rtx_insn **ready ATTRIBUTE_UNUSED,
+		       int *n_readyp ATTRIBUTE_UNUSED,
+		       int clock)
+{
+  int can_issue_more = aarch64_sched_issue_rate ();
+
+  if ((aarch64_tune_params.extra_tuning_flags
+       & AARCH64_EXTRA_TUNE_SCHED_MICRO_OPS) != 0)
+    {
+      /* The start of a basic block.  */
+      if (clock == 0)
+	{
+	  if (leftover_uops && file && (verbose > 3))
+	    fprintf (file, ";;\tLeftover uops ignored at bb start.\n");
+
+	  leftover_uops = 0;
+	}
+
+      /* Account for issue slots left over from previous cycle.  This value
+	 can be larger than the number of issue slots per cycle, so we need
+	 to check it here before scheduling any instructions.  */
+      else if (leftover_uops)
+	{
+	  can_issue_more -= leftover_uops;
+
+	  if (file && (verbose > 3))
+	    {
+	      fprintf (file, ";;\tUse %d issue slots for leftover uops.\n",
+		       leftover_uops);
+	      fprintf (file, ";;\t%d issue slots left.\n", can_issue_more);
+	    }
+
+	  leftover_uops = 0;
+
+	  if (can_issue_more < 0)
+	    {
+	      leftover_uops = 0 - can_issue_more;
+	      can_issue_more = 0;
+
+	      if (file && (verbose > 3))
+		{
+		  fprintf (file, ";;skipping issue cycle.\n");
+		  fprintf (file, ";;\t%d uops left over.\n", leftover_uops);
+		}
+	    }
+	}
+    }
+
+  return can_issue_more;
+}
+
+/* Implement TARGET_SCHED_VARIABLE_ISSUE.  */
+
+static int
+aarch64_variable_issue (FILE *file, int verbose,
+			rtx_insn *insn, int more)
+{
+  if (GET_CODE (PATTERN (insn)) != USE
+      && GET_CODE (PATTERN (insn)) != CLOBBER)
+    {
+      if ((aarch64_tune_params.extra_tuning_flags
+	   & AARCH64_EXTRA_TUNE_SCHED_MICRO_OPS) == 0)
+	more -= 1;
+      else
+	{
+          /* There is for now just falkor target supporting scheduling
+ 	     of micro operations. Therefore we don't need to check.  */
+	  int issue_slots = get_attr_falkor_variable_issue (insn);


This is still my concern about having this falkor-specific path.
I think we'll want to go with my suggestion at https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00815.html
which is to create a generic attribute, called something like "number_micro_ops" and have it populated for each core if they decide to use it.
Your patch would provide the falkor numbers initially. Then this part of aarch64.c wouldn't be falkor-specific and could use something like "get_attr_number_micro_ops".

That is really the biggest concern I had.

Thanks,
Kyrill


+	  more -= issue_slots;
+
+	  if (file && (verbose > 3))
+	    {
+	      fprintf (file, ";;\tInsn takes %d issue slots.\n", issue_slots);
+	      fprintf (file, ";;\t%d issue slots left.\n", more);
+	    }
+
+	  /* We schedule an instruction first, and then subtract issue slots,
+	     which means the result can be negative.  We carry the extra over
+	     to the next cycle.  */
+
+	  if (more < 0)
+	    {
+	      leftover_uops = 0 - more;
+	      more = 0;
+
+	      if (file && (verbose > 3))
+		fprintf (file, ";;\t%d uops left over.\n", leftover_uops);
+	    }
+	}
+    }
+
+  return more;
+}
+
  #undef TARGET_ADDRESS_COST
  #define TARGET_ADDRESS_COST aarch64_address_cost
  
@@ -18265,6 +18368,12 @@ aarch64_libgcc_floating_mode_supported_p
  #undef TARGET_SCHED_ISSUE_RATE
  #define TARGET_SCHED_ISSUE_RATE aarch64_sched_issue_rate
  
+#undef TARGET_SCHED_REORDER
+#define TARGET_SCHED_REORDER aarch64_sched_reorder
+
+#undef TARGET_SCHED_VARIABLE_ISSUE
+#define TARGET_SCHED_VARIABLE_ISSUE aarch64_variable_issue
+
  #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
  #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
    aarch64_sched_first_cycle_multipass_dfa_lookahead
Index: trunk/gcc/config/aarch64/falkor.md
===================================================================
--- trunk.orig/gcc/config/aarch64/falkor.md
+++ trunk/gcc/config/aarch64/falkor.md
@@ -685,3 +685,24 @@
  (define_bypass 3
    "falkor_afp_5_vxvy_mul,falkor_afp_5_vxvy_mla,falkor_afp_5_vxvy_vxvy_mul,falkor_afp_5_vxvy_vxvy_mla,falkor_afp_6_vxvy_mul,falkor_afp_6_vxvy_mla,falkor_afp_6_vxvy_vxvy_mul,falkor_afp_6_vxvy_vxvy_mla,falkor_fpdt_5_vxvy_mul,falkor_fpdt_5_vxvy_mla,falkor_fpdt_6_vxvy_mul,falkor_fpdt_6_vxvy_mla"
    "falkor_afp_5_vxvy_mla,falkor_afp_5_vxvy_vxvy_mla,falkor_afp_6_vxvy_mla,falkor_afp_6_vxvy_vxvy_mla,falkor_fpdt_5_vxvy_mla,falkor_fpdt_6_vxvy_mla")
+
+
+(define_attr "falkor_variable_issue" ""
+  (cond [
+;; A64 Instructions
+	 (eq_attr "type" "neon_fp_neg_s_q,neon_fp_neg_d_q,neon_fp_abs_s_q,neon_fp_abs_d_q,neon_fp_minmax_s_q,neon_fp_minmax_d_q,neon_fp_compare_s_q,neon_fp_compare_d_q,neon_fp_round_s_q,neon_fp_round_d_q,neon_fp_abd_s_q,neon_fp_abd_d_q,neon_fp_addsub_s_q,neon_fp_addsub_d_q,neon_fp_reduc_add_s_q,neon_fp_reduc_add_d_q,neon_fp_to_int_s_q,neon_fp_to_int_d_q,neon_int_to_fp_s_q,neon_int_to_fp_d_q,neon_fp_mla_d_q,neon_fp_mla_d_scalar_q,neon_fp_div_s,neon_fp_div_d,neon_fp_sqrt_s,neon_fp_sqrt_d,neon_shift_imm_long,neon_add_q,neon_reduc_add_q,neon_logic_q,neon_neg_q,neon_sub_q,neon_add_halve_q,neon_sub_halve_q,neon_shift_imm_q,neon_shift_reg_q,neon_minmax_q,neon_abs_q,neon_compare_q,neon_compare_zero_q,neon_tst_q,neon_reduc_add_long,neon_shift_acc_q,neon_reduc_add_acc_q,neon_abd_q,neon_abd_long,neon_qadd_q,neon_qsub_q,neon_qabs_q,neon_qneg_q,neon_sat_shift_imm_q,neon_sat_shift_reg_q,neon_mul_b_q,neon_mul_h_q,neon_mul_s_q,neon_mul_h_scalar_q,neon_mul_s_scalar_q,neon_sat_mul_b_q,neon_sat_mul_h_q,neon_sat_mul_s_q,neon_mul_b_long,neon_mul_h_long,neon_mul_s_long,neon_mul_d_long,neon_mul_h_scalar_long,neon_mul_s_scalar_long,neon_sat_mul_b_long,neon_sat_mul_h_long,neon_sat_mul_s_long,neon_sat_mul_h_scalar_q,neon_sat_mul_s_scalar_q,neon_sat_mul_h_scalar_long,neon_sat_mul_s_scalar_long,neon_mla_b_q,neon_mla_h_q,neon_mla_s_q,neon_mla_h_scalar_q,neon_mla_s_scalar_q,neon_mla_b_long,neon_mla_h_long,neon_mla_s_long,neon_mla_h_scalar_long,neon_mla_s_scalar_long,neon_sat_mla_b_long,neon_sat_mla_h_long,neon_sat_mla_s_long,neon_sat_mla_h_scalar_long,neon_sat_mla_s_scalar_long,neon_add_halve_narrow_q,neon_sub_halve_narrow_q,neon_arith_acc,neon_load1_2reg,neon_load2_2reg,neon_load2_all_lanes,neon_load1_2reg_q,neon_load2_2reg_q,neon_load2_all_lanes_q,neon_load3_one_lane,neon_load4_one_lane,neon_ldp,neon_ldp_q,neon_from_gp_q,neon_bsl_q,neon_dup_q,neon_ext_q,neon_move_q,neon_rev_q,neon_tbl1_q,neon_permute_q,neon_cls_q,neon_cnt_q,neon_rbit_q,neon_tbl2,neon_fp_recpe_s_q,neon_fp_recpe_d_q,neon_fp_rsqrte_s_q,neon_fp_rsqrte_d_q,neon_fp_recps_s_q,neon_fp_recps_d_q,neon_fp_rsqrts_d_q,neon_store1_1reg,neon_store1_1reg_q,neon_store1_one_lane,neon_store1_one_lane_q,neon_store1_2reg,neon_store2_2reg,neon_store2_one_lane,neon_store2_one_lane_q,neon_stp,crypto_sha1_xor,crypto_sha256_fast,crypto_sha1_slow,crypto_sha256_slow,crypto_aese,f_stores,f_stored,fdivs,fdivd,sdiv,udiv")
+	   (const_int 2)
+	 (eq_attr "type" "neon_fp_cvt_narrow_s_q,neon_fp_cvt_narrow_d_q,neon_load1_3reg,neon_load3_3reg,neon_load3_all_lanes,neon_load1_3reg_q,neon_load3_3reg_q,neon_load3_all_lanes_q,neon_tbl2_q,neon_tbl3")
+	   (const_int 3)
+	 (eq_attr "type" "neon_fp_div_s_q,neon_fp_div_d_q,neon_fp_sqrt_s_q,neon_fp_sqrt_d_q,neon_add_widen,neon_sub_widen,neon_arith_acc_q,neon_load1_4reg,neon_load4_4reg,neon_load1_4reg_q,neon_load4_4reg_q,neon_load4_all_lanes,neon_load4_all_lanes_q,neon_tbl3_q,neon_tbl4,neon_store1_2reg_q,neon_store1_3reg,neon_store1_4reg,neon_store2_2reg_q,neon_store3_3reg,neon_store4_4reg,neon_store3_one_lane,neon_store3_one_lane_q,neon_store4_one_lane,neon_store4_one_lane_q,neon_stp_q")
+	   (const_int 4)
+	 (eq_attr "type" "neon_tbl4_q")
+	   (const_int 5)
+	 (eq_attr "type" "neon_store1_3reg_q,neon_store3_3reg_q")
+	   (const_int 6)
+	 (eq_attr "type" "neon_store1_4reg_q,neon_store4_4reg_q")
+	   (const_int 8)
+	 (eq_attr "type" "multiple")
+	   (const_int 2)
+	]
+	(const_int 1)))
Index: trunk/gcc/config/aarch64/aarch64-tuning-flags.def
===================================================================
--- trunk.orig/gcc/config/aarch64/aarch64-tuning-flags.def
+++ trunk/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -46,4 +46,7 @@ AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp
  
  AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
  
+/* Enable micro-op scheduling instead of default  instruction one.  */
+AARCH64_EXTRA_TUNING_OPTION ("sched_microops", SCHED_MICRO_OPS)
+
  #undef AARCH64_EXTRA_TUNING_OPTION
diff mbox series

Patch

	Jim Wilson <jim.wilson@linaro.org>
	Kai Tietz <kai.tietz@linaro.org>

	* config/aarch64.c (aarch64_sched_reorder): Implementing
	TARGET_SHED_REORDER hook.
	(aarch64_variable_issue): Implemented TARGET_SHED_VARIABLE_ISSUE
	hook.
	(TARGET_SHED_REORDER): Defined.
	(TARGET_SHED_VARIABLE_ISSUE): Likewise.
	* config/aarch64/falor.md (falkor_variable_issue): New.

Index: aarch64/aarch64.c
===================================================================
--- aarch64.orig/aarch64.c
+++ aarch64/aarch64.c
@@ -914,7 +914,7 @@  static const struct tune_params qdf24xx_
   &generic_branch_cost,
   &generic_approx_modes,
   4, /* memmov_cost  */
-  4, /* issue_rate  */
+  8, /* issue_rate  */
   (AARCH64_FUSE_MOV_MOVK | AARCH64_FUSE_ADRP_ADD
    | AARCH64_FUSE_MOVK_MOVK), /* fuseable_ops  */
   "16",	/* function_align.  */
@@ -17551,6 +17551,105 @@  aarch64_run_selftests (void)
 
 #endif /* #if CHECKING_P */
 
+/* The number of micro ops left over after issuing the last instruction in a
+   cycle.  This is subtrated from the next cycle before we start issuing insns.
+   This is initialized to 0 at the start of every basic block.  */
+static int leftover_uops = 0;
+
+/* Implement TARGET_SCHED_REORDER.  */
+
+static int
+aarch64_sched_reorder (FILE *file, int verbose,
+		       rtx_insn **ready ATTRIBUTE_UNUSED,
+		       int *n_readyp ATTRIBUTE_UNUSED,
+		       int clock)
+{
+  int can_issue_more = aarch64_sched_issue_rate ();
+
+  if ((enum attr_tune) aarch64_tune == TUNE_FALKOR)
+    {
+      /* The start of a basic block.  */
+      if (clock == 0)
+	{
+	  if (leftover_uops && file && (verbose > 3))
+	    fprintf (file, ";;\tLeftover uops ignored at bb start.\n");
+
+	  leftover_uops = 0;
+	}
+
+      /* Account for issue slots left over from previous cycle.  This value
+	 can be larger than the number of issue slots per cycle, so we need
+	 to check it here before scheduling any instructions.  */
+      else if (leftover_uops)
+	{
+	  can_issue_more -= leftover_uops;
+
+	  if (file && (verbose > 3))
+	    {
+	      fprintf (file, ";;\tUse %d issue slots for leftover uops.\n",
+		       leftover_uops);
+	      fprintf (file, ";;\t%d issue slots left.\n", can_issue_more);
+	    }
+
+	  leftover_uops = 0;
+
+	  if (can_issue_more < 0)
+	    {
+	      leftover_uops = 0 - can_issue_more;
+	      can_issue_more = 0;
+
+	      if (file && (verbose > 3))
+		{
+		  fprintf (file, ";;skipping issue cycle.\n");
+		  fprintf (file, ";;\t%d uops left over.\n", leftover_uops);
+		}
+	    }
+	}
+    }
+
+  return can_issue_more;
+}
+
+/* Implement TARGET_SCHED_VARIABLE_ISSUE.  */
+
+static int
+aarch64_variable_issue (FILE *file, int verbose,
+			rtx_insn *insn, int more)
+{
+  if (GET_CODE (PATTERN (insn)) != USE
+      && GET_CODE (PATTERN (insn)) != CLOBBER)
+    {
+      if ((enum attr_tune) aarch64_tune != TUNE_FALKOR)
+	more -= 1;
+      else
+	{
+	  int issue_slots = get_attr_falkor_variable_issue (insn);
+	  more -= issue_slots;
+
+	  if (file && (verbose > 3))
+	    {
+	      fprintf (file, ";;\tInsn takes %d issue slots.\n", issue_slots);
+	      fprintf (file, ";;\t%d issue slots left.\n", more);
+	    }
+
+	  /* We schedule an instruction first, and then subtract issue slots,
+	     which means the result can be negative.  We carry the extra over
+	     to the next cycle.  */
+
+	  if (more < 0)
+	    {
+	      leftover_uops = 0 - more;
+	      more = 0;
+
+	      if (file && (verbose > 3))
+		fprintf (file, ";;\t%d uops left over.\n", leftover_uops);
+	    }
+	}
+    }
+
+  return more;
+}
+
 #undef TARGET_ADDRESS_COST
 #define TARGET_ADDRESS_COST aarch64_address_cost
 
@@ -17779,6 +17878,12 @@  aarch64_libgcc_floating_mode_supported_p
 #undef TARGET_SCHED_ISSUE_RATE
 #define TARGET_SCHED_ISSUE_RATE aarch64_sched_issue_rate
 
+#undef TARGET_SCHED_REORDER
+#define TARGET_SCHED_REORDER aarch64_sched_reorder
+
+#undef TARGET_SCHED_VARIABLE_ISSUE
+#define TARGET_SCHED_VARIABLE_ISSUE aarch64_variable_issue
+
 #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   aarch64_sched_first_cycle_multipass_dfa_lookahead
Index: aarch64/falkor.md
===================================================================
--- aarch64.orig/falkor.md
+++ aarch64/falkor.md
@@ -685,3 +685,24 @@ 
 (define_bypass 3
   "falkor_afp_5_vxvy_mul,falkor_afp_5_vxvy_mla,falkor_afp_5_vxvy_vxvy_mul,falkor_afp_5_vxvy_vxvy_mla,falkor_afp_6_vxvy_mul,falkor_afp_6_vxvy_mla,falkor_afp_6_vxvy_vxvy_mul,falkor_afp_6_vxvy_vxvy_mla,falkor_fpdt_5_vxvy_mul,falkor_fpdt_5_vxvy_mla,falkor_fpdt_6_vxvy_mul,falkor_fpdt_6_vxvy_mla"
   "falkor_afp_5_vxvy_mla,falkor_afp_5_vxvy_vxvy_mla,falkor_afp_6_vxvy_mla,falkor_afp_6_vxvy_vxvy_mla,falkor_fpdt_5_vxvy_mla,falkor_fpdt_6_vxvy_mla")
+
+
+(define_attr "falkor_variable_issue" ""
+  (cond [
+;; A64 Instructions
+	 (eq_attr "type" "neon_fp_neg_s_q,neon_fp_neg_d_q,neon_fp_abs_s_q,neon_fp_abs_d_q,neon_fp_minmax_s_q,neon_fp_minmax_d_q,neon_fp_compare_s_q,neon_fp_compare_d_q,neon_fp_round_s_q,neon_fp_round_d_q,neon_fp_abd_s_q,neon_fp_abd_d_q,neon_fp_addsub_s_q,neon_fp_addsub_d_q,neon_fp_reduc_add_s_q,neon_fp_reduc_add_d_q,neon_fp_to_int_s_q,neon_fp_to_int_d_q,neon_int_to_fp_s_q,neon_int_to_fp_d_q,neon_fp_mla_d_q,neon_fp_mla_d_scalar_q,neon_fp_div_s,neon_fp_div_d,neon_fp_sqrt_s,neon_fp_sqrt_d,neon_shift_imm_long,neon_add_q,neon_reduc_add_q,neon_logic_q,neon_neg_q,neon_sub_q,neon_add_halve_q,neon_sub_halve_q,neon_shift_imm_q,neon_shift_reg_q,neon_minmax_q,neon_abs_q,neon_compare_q,neon_compare_zero_q,neon_tst_q,neon_reduc_add_long,neon_shift_acc_q,neon_reduc_add_acc_q,neon_abd_q,neon_abd_long,neon_qadd_q,neon_qsub_q,neon_qabs_q,neon_qneg_q,neon_sat_shift_imm_q,neon_sat_shift_reg_q,neon_mul_b_q,neon_mul_h_q,neon_mul_s_q,neon_mul_h_scalar_q,neon_mul_s_scalar_q,neon_sat_mul_b_q,neon_sat_mul_h_q,neon_sat_mul_s_q,neon_mul_b_long,neon_mul_h_long,neon_mul_s_long,neon_mul_d_long,neon_mul_h_scalar_long,neon_mul_s_scalar_long,neon_sat_mul_b_long,neon_sat_mul_h_long,neon_sat_mul_s_long,neon_sat_mul_h_scalar_q,neon_sat_mul_s_scalar_q,neon_sat_mul_h_scalar_long,neon_sat_mul_s_scalar_long,neon_mla_b_q,neon_mla_h_q,neon_mla_s_q,neon_mla_h_scalar_q,neon_mla_s_scalar_q,neon_mla_b_long,neon_mla_h_long,neon_mla_s_long,neon_mla_h_scalar_long,neon_mla_s_scalar_long,neon_sat_mla_b_long,neon_sat_mla_h_long,neon_sat_mla_s_long,neon_sat_mla_h_scalar_long,neon_sat_mla_s_scalar_long,neon_add_halve_narrow_q,neon_sub_halve_narrow_q,neon_arith_acc,neon_load1_2reg,neon_load2_2reg,neon_load2_all_lanes,neon_load1_2reg_q,neon_load2_2reg_q,neon_load2_all_lanes_q,neon_load3_one_lane,neon_load4_one_lane,neon_ldp,neon_ldp_q,neon_from_gp_q,neon_bsl_q,neon_dup_q,neon_ext_q,neon_move_q,neon_rev_q,neon_tbl1_q,neon_permute_q,neon_cls_q,neon_cnt_q,neon_rbit_q,neon_tbl2,neon_fp_recpe_s_q,neon_fp_recpe_d_q,neon_fp_rsqrte_s_q,neon_fp_rsqrte_d_q,neon_fp_recps_s_q,neon_fp_recps_d_q,neon_fp_rsqrts_d_q,neon_store1_1reg,neon_store1_1reg_q,neon_store1_one_lane,neon_store1_one_lane_q,neon_store1_2reg,neon_store2_2reg,neon_store2_one_lane,neon_store2_one_lane_q,neon_stp,crypto_sha1_xor,crypto_sha256_fast,crypto_sha1_slow,crypto_sha256_slow,crypto_aese,f_stores,f_stored,fdivs,fdivd,sdiv,udiv")
+	   (const_int 2)
+	 (eq_attr "type" "neon_fp_cvt_narrow_s_q,neon_fp_cvt_narrow_d_q,neon_load1_3reg,neon_load3_3reg,neon_load3_all_lanes,neon_load1_3reg_q,neon_load3_3reg_q,neon_load3_all_lanes_q,neon_tbl2_q,neon_tbl3")
+	   (const_int 3)
+	 (eq_attr "type" "neon_fp_div_s_q,neon_fp_div_d_q,neon_fp_sqrt_s_q,neon_fp_sqrt_d_q,neon_add_widen,neon_sub_widen,neon_arith_acc_q,neon_load1_4reg,neon_load4_4reg,neon_load1_4reg_q,neon_load4_4reg_q,neon_load4_all_lanes,neon_load4_all_lanes_q,neon_tbl3_q,neon_tbl4,neon_store1_2reg_q,neon_store1_3reg,neon_store1_4reg,neon_store2_2reg_q,neon_store3_3reg,neon_store4_4reg,neon_store3_one_lane,neon_store3_one_lane_q,neon_store4_one_lane,neon_store4_one_lane_q,neon_stp_q")
+	   (const_int 4)
+	 (eq_attr "type" "neon_tbl4_q")
+	   (const_int 5)
+	 (eq_attr "type" "neon_store1_3reg_q,neon_store3_3reg_q")
+	   (const_int 6)
+	 (eq_attr "type" "neon_store1_4reg_q,neon_store4_4reg_q")
+	   (const_int 8)
+	 (eq_attr "type" "multiple")
+	   (const_int 2)
+	]
+	(const_int 1)))