[RFC] tcg: Optimize fence instructions

Message ID	20160714202940.18399-1-bobby.prani@gmail.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Pranith Kumar <bobby.prani@gmail.com> To: Richard Henderson <rth@twiddle.net>, qemu-devel@nongnu.org (open list:All patches CC here) Date: Thu, 14 Jul 2016 16:29:40 -0400 Message-Id: <20160714202940.18399-1-bobby.prani@gmail.com> Subject: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions Precedence: list Cc: serge.fdrv@gmail.com, alex.bennee@linaro.org Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Pranith Kumar July 14, 2016, 8:29 p.m. UTC

This patch applies on top of the fence generation patch series.

This commit optimizes fence instructions. Two optimizations are
currently implemented. These are:

1. Unnecessary duplicate fence instructions

   If the same fence instruction is detected consecutively, we remove
   one instance of it.

   ex: mb; mb => mb, strl; strl => strl

2. Merging weaker fence with subsequent/previous stronger fence

   load-acquire/store-release fence can be combined with a full fence
   without relaxing the ordering constraint.

   ex: a) ld; ldaq; mb => ld; mb
       b) mb; strl; st => mb; st

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
---
 tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg.h      |  1 +
 2 files changed, 60 insertions(+)

Richard Henderson July 19, 2016, 3:53 a.m. UTC | #1

On 07/15/2016 01:59 AM, Pranith Kumar wrote:
> This patch applies on top of the fence generation patch series.
>
> This commit optimizes fence instructions. Two optimizations are
> currently implemented. These are:
>
> 1. Unnecessary duplicate fence instructions
>
>    If the same fence instruction is detected consecutively, we remove
>    one instance of it.
>
>    ex: mb; mb => mb, strl; strl => strl
>
> 2. Merging weaker fence with subsequent/previous stronger fence
>
>    load-acquire/store-release fence can be combined with a full fence
>    without relaxing the ordering constraint.
>
>    ex: a) ld; ldaq; mb => ld; mb
>        b) mb; strl; st => mb; st
>
> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
> ---
>  tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tcg/tcg.h      |  1 +
>  2 files changed, 60 insertions(+)
>
> diff --git a/tcg/optimize.c b/tcg/optimize.c
> index c0d975b..a655829 100644
> --- a/tcg/optimize.c
> +++ b/tcg/optimize.c
> @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2)
>      return false;
>  }
>
> +/* Eliminate duplicate and unnecessary fence instructions */
> +void tcg_optimize_mb(TCGContext *s)
> +{
> +    int oi, oi_next;
> +    TCGArg prev_op_mb = -1;
> +    TCGOp *prev_op;
> +
> +    for (oi = s->gen_first_op_idx; oi >= 0; oi = oi_next) {
> +        TCGOp *op = &s->gen_op_buf[oi];
> +        TCGArg *args = &s->gen_opparam_buf[op->args];
> +        TCGOpcode opc = op->opc;
> +
> +        switch (opc) {
> +        case INDEX_op_mb:
> +        {
> +            TCGBar curr_mb_type = args[0] & 0xF0;
> +            TCGBar prev_mb_type = prev_op_mb & 0xF0;

No check to make sure that prev_op_mb != -1?

> +
> +            if (curr_mb_type == prev_mb_type ||
> +                (curr_mb_type == TCG_BAR_STRL && prev_mb_type == TCG_BAR_SC)) {
> +                /* Remove the current weaker barrier op. The previous
> +                 * barrier is stronger and sufficient.
> +                 * mb; strl => mb; st
> +                 */
> +                tcg_op_remove(s, op);
> +            } else if (curr_mb_type == TCG_BAR_SC &&
> +                       prev_mb_type == TCG_BAR_LDAQ) {
> +                /* Remove the previous weaker barrier op. The current
> +                 * barrier is stronger and sufficient.
> +                 * ldaq; mb => ld; mb
> +                 */
> +                tcg_op_remove(s, prev_op);
> +            } else if (curr_mb_type == TCG_BAR_STRL &&
> +                       prev_mb_type == TCG_BAR_LDAQ) {
> +                /* Consecutive load-acquire and store-release barriers
> +                 * can be merged into one stronger SC barrier
> +                 * ldaq; strl => ld; mb; st
> +                 */
> +                args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
> +                tcg_op_remove(s, prev_op);

Don't you need to merge the bottom 4 bits as well?

> +        default:
> +            prev_op_mb = -1;
> +            prev_op = NULL;
> +            break;

A reasonable start, I suppose, but there's nothing stopping you from 
considering barriers that aren't exactly adjacent.

You only need to reset prev_op_mb when you see qemu_ld, qemu_st, call, or 
tcg_op_defs[opc].flags & TCG_OPF_BB_END.


r~

Alex Bennée July 19, 2016, 2:36 p.m. UTC | #2

Pranith Kumar <bobby.prani@gmail.com> writes:

> This patch applies on top of the fence generation patch series.
>
> This commit optimizes fence instructions. Two optimizations are
> currently implemented. These are:
>
> 1. Unnecessary duplicate fence instructions
>
>    If the same fence instruction is detected consecutively, we remove
>    one instance of it.
>
>    ex: mb; mb => mb, strl; strl => strl
>
> 2. Merging weaker fence with subsequent/previous stronger fence
>
>    load-acquire/store-release fence can be combined with a full fence
>    without relaxing the ordering constraint.
>
>    ex: a) ld; ldaq; mb => ld; mb
>        b) mb; strl; st => mb; st
>
> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
> ---
>  tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tcg/tcg.h      |  1 +
>  2 files changed, 60 insertions(+)
>
> diff --git a/tcg/optimize.c b/tcg/optimize.c
> index c0d975b..a655829 100644
> --- a/tcg/optimize.c
> +++ b/tcg/optimize.c
> @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2)
>      return false;
>  }
>
> +/* Eliminate duplicate and unnecessary fence instructions */
> +void tcg_optimize_mb(TCGContext *s)
> +{
> +    int oi, oi_next;
> +    TCGArg prev_op_mb = -1;
> +    TCGOp *prev_op;


The compiler throws up warnings about prev_op not being set:

/home/alex/lsrc/qemu/qemu.git/tcg/optimize.c: In function ‘tcg_optimize_mb’:
/home/alex/lsrc/qemu/qemu.git/tcg/optimize.c:611:17: error: ‘prev_op’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
                 tcg_op_remove(s, prev_op);
                 ^

> +
> +    for (oi = s->gen_first_op_idx; oi >= 0; oi = oi_next) {
> +        TCGOp *op = &s->gen_op_buf[oi];
> +        TCGArg *args = &s->gen_opparam_buf[op->args];
> +        TCGOpcode opc = op->opc;
> +
> +        switch (opc) {
> +        case INDEX_op_mb:
> +        {
> +            TCGBar curr_mb_type = args[0] & 0xF0;
> +            TCGBar prev_mb_type = prev_op_mb & 0xF0;
> +
> +            if (curr_mb_type == prev_mb_type ||
> +                (curr_mb_type == TCG_BAR_STRL && prev_mb_type == TCG_BAR_SC)) {
> +                /* Remove the current weaker barrier op. The previous
> +                 * barrier is stronger and sufficient.
> +                 * mb; strl => mb; st
> +                 */
> +                tcg_op_remove(s, op);
> +            } else if (curr_mb_type == TCG_BAR_SC &&
> +                       prev_mb_type == TCG_BAR_LDAQ) {
> +                /* Remove the previous weaker barrier op. The current
> +                 * barrier is stronger and sufficient.
> +                 * ldaq; mb => ld; mb
> +                 */
> +                tcg_op_remove(s, prev_op);
> +            } else if (curr_mb_type == TCG_BAR_STRL &&
> +                       prev_mb_type == TCG_BAR_LDAQ) {
> +                /* Consecutive load-acquire and store-release barriers
> +                 * can be merged into one stronger SC barrier
> +                 * ldaq; strl => ld; mb; st
> +                 */
> +                args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
> +                tcg_op_remove(s, prev_op);
> +            }
> +            prev_op_mb = args[0];
> +            prev_op = op;
> +            break;
> +        }
> +        case INDEX_op_insn_start:
> +            break;
> +        default:
> +            prev_op_mb = -1;
> +            prev_op = NULL;
> +            break;
> +        }
> +
> +        oi_next = op->next;
> +    }
> +}
> +
>  /* Propagate constants and copies, fold constant expressions. */
>  void tcg_optimize(TCGContext *s)
>  {
> @@ -1327,4 +1384,6 @@ void tcg_optimize(TCGContext *s)
>              break;
>          }
>      }
> +
> +    tcg_optimize_mb(s);
>  }
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 60ebc2b..2d7a46d 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -904,6 +904,7 @@ void tcg_gen_callN(TCGContext *s, void *func,
>                     TCGArg ret, int nargs, TCGArg *args);
>
>  void tcg_op_remove(TCGContext *s, TCGOp *op);
> +void tcg_optimize_mb(TCGContext *s);
>  void tcg_optimize(TCGContext *s);
>
>  /* only used for debugging purposes */


--
Alex Bennée

Alex Bennée July 19, 2016, 3:56 p.m. UTC | #3

Pranith Kumar <bobby.prani@gmail.com> writes:

> This patch applies on top of the fence generation patch series.
>
> This commit optimizes fence instructions. Two optimizations are
> currently implemented. These are:
>
> 1. Unnecessary duplicate fence instructions
>
>    If the same fence instruction is detected consecutively, we remove
>    one instance of it.
>
>    ex: mb; mb => mb, strl; strl => strl
>
> 2. Merging weaker fence with subsequent/previous stronger fence
>
>    load-acquire/store-release fence can be combined with a full fence
>    without relaxing the ordering constraint.
>
>    ex: a) ld; ldaq; mb => ld; mb
>        b) mb; strl; st => mb; st

What test cases do you have for this?

Currently the litmus tests don't fire (as they have exactly what they
need). Most programs don't seem to trigger multiple barriers.

>
> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
> ---
>  tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tcg/tcg.h      |  1 +
>  2 files changed, 60 insertions(+)
>
> diff --git a/tcg/optimize.c b/tcg/optimize.c
> index c0d975b..a655829 100644
> --- a/tcg/optimize.c
> +++ b/tcg/optimize.c
> @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2)
>      return false;
>  }
>
> +/* Eliminate duplicate and unnecessary fence instructions */
> +void tcg_optimize_mb(TCGContext *s)
> +{
> +    int oi, oi_next;
> +    TCGArg prev_op_mb = -1;
> +    TCGOp *prev_op;
> +
> +    for (oi = s->gen_first_op_idx; oi >= 0; oi = oi_next) {
> +        TCGOp *op = &s->gen_op_buf[oi];
> +        TCGArg *args = &s->gen_opparam_buf[op->args];
> +        TCGOpcode opc = op->opc;
> +
> +        switch (opc) {
> +        case INDEX_op_mb:
> +        {
> +            TCGBar curr_mb_type = args[0] & 0xF0;
> +            TCGBar prev_mb_type = prev_op_mb & 0xF0;
> +
> +            if (curr_mb_type == prev_mb_type ||
> +                (curr_mb_type == TCG_BAR_STRL && prev_mb_type == TCG_BAR_SC)) {
> +                /* Remove the current weaker barrier op. The previous
> +                 * barrier is stronger and sufficient.
> +                 * mb; strl => mb; st
> +                 */
> +                tcg_op_remove(s, op);
> +            } else if (curr_mb_type == TCG_BAR_SC &&
> +                       prev_mb_type == TCG_BAR_LDAQ) {
> +                /* Remove the previous weaker barrier op. The current
> +                 * barrier is stronger and sufficient.
> +                 * ldaq; mb => ld; mb
> +                 */
> +                tcg_op_remove(s, prev_op);
> +            } else if (curr_mb_type == TCG_BAR_STRL &&
> +                       prev_mb_type == TCG_BAR_LDAQ) {
> +                /* Consecutive load-acquire and store-release barriers
> +                 * can be merged into one stronger SC barrier
> +                 * ldaq; strl => ld; mb; st
> +                 */
> +                args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
> +                tcg_op_remove(s, prev_op);
> +            }
> +            prev_op_mb = args[0];
> +            prev_op = op;
> +            break;
> +        }
> +        case INDEX_op_insn_start:
> +            break;
> +        default:
> +            prev_op_mb = -1;
> +            prev_op = NULL;
> +            break;
> +        }
> +
> +        oi_next = op->next;
> +    }
> +}
> +
>  /* Propagate constants and copies, fold constant expressions. */
>  void tcg_optimize(TCGContext *s)
>  {
> @@ -1327,4 +1384,6 @@ void tcg_optimize(TCGContext *s)
>              break;
>          }
>      }
> +
> +    tcg_optimize_mb(s);
>  }
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 60ebc2b..2d7a46d 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -904,6 +904,7 @@ void tcg_gen_callN(TCGContext *s, void *func,
>                     TCGArg ret, int nargs, TCGArg *args);
>
>  void tcg_op_remove(TCGContext *s, TCGOp *op);
> +void tcg_optimize_mb(TCGContext *s);
>  void tcg_optimize(TCGContext *s);
>
>  /* only used for debugging purposes */


--
Alex Bennée

Paolo Bonzini July 19, 2016, 5:16 p.m. UTC | #4

On 14/07/2016 22:29, Pranith Kumar wrote:
> +            } else if (curr_mb_type == TCG_BAR_STRL &&
> +                       prev_mb_type == TCG_BAR_LDAQ) {
> +                /* Consecutive load-acquire and store-release barriers
> +                 * can be merged into one stronger SC barrier
> +                 * ldaq; strl => ld; mb; st
> +                 */
> +                args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
> +                tcg_op_remove(s, prev_op);

Is this really an optimization?  For example the processor could reorder
"st1; ldaq1; strl2; ld2" to "ldaq1; ld2; st1; strl2".  It cannot do this
if you change ldaq1/strl2 to ld1/mb/st2.

On x86 for example a memory fence costs ~50 clock cycles, while normal
loads and stores are of course faster.

Of course this is useful if your target doesn't have ldaq/strl
instructions.  In this case, however, you probably want to lower ldaq to
"ld;mb" and strl to "mb;st"; the other optimizations then will remove
the unnecessary barrier.

Paolo

Pranith Kumar July 19, 2016, 6:28 p.m. UTC | #5

Richard Henderson writes:

> On 07/15/2016 01:59 AM, Pranith Kumar wrote:
>> This patch applies on top of the fence generation patch series.
>>
>> This commit optimizes fence instructions. Two optimizations are
>> currently implemented. These are:
>>
>> 1. Unnecessary duplicate fence instructions
>>
>>    If the same fence instruction is detected consecutively, we remove
>>    one instance of it.
>>
>>    ex: mb; mb => mb, strl; strl => strl
>>
>> 2. Merging weaker fence with subsequent/previous stronger fence
>>
>>    load-acquire/store-release fence can be combined with a full fence
>>    without relaxing the ordering constraint.
>>
>>    ex: a) ld; ldaq; mb => ld; mb
>>        b) mb; strl; st => mb; st
>>
>> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
>> ---
>>  tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  tcg/tcg.h      |  1 +
>>  2 files changed, 60 insertions(+)
>>
>> diff --git a/tcg/optimize.c b/tcg/optimize.c
>> index c0d975b..a655829 100644
>> --- a/tcg/optimize.c
>> +++ b/tcg/optimize.c
>> @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2)
>>      return false;
>>  }
>>
>> +/* Eliminate duplicate and unnecessary fence instructions */
>> +void tcg_optimize_mb(TCGContext *s)
>> +{
>> +    int oi, oi_next;
>> +    TCGArg prev_op_mb = -1;
>> +    TCGOp *prev_op;
>> +
>> +    for (oi = s->gen_first_op_idx; oi >= 0; oi = oi_next) {
>> +        TCGOp *op = &s->gen_op_buf[oi];
>> +        TCGArg *args = &s->gen_opparam_buf[op->args];
>> +        TCGOpcode opc = op->opc;
>> +
>> +        switch (opc) {
>> +        case INDEX_op_mb:
>> +        {
>> +            TCGBar curr_mb_type = args[0] & 0xF0;
>> +            TCGBar prev_mb_type = prev_op_mb & 0xF0;
>
> No check to make sure that prev_op_mb != -1?
>

I don't think that is necessary since if prev_op_mb is -1 then prev_mb_type
will be 0x70 which will not match any of the cases we compare against below.

>> +
>> +            if (curr_mb_type == prev_mb_type ||
>> +                (curr_mb_type == TCG_BAR_STRL && prev_mb_type == TCG_BAR_SC)) {
>> +                /* Remove the current weaker barrier op. The previous
>> +                 * barrier is stronger and sufficient.
>> +                 * mb; strl => mb; st
>> +                 */
>> +                tcg_op_remove(s, op);
>> +            } else if (curr_mb_type == TCG_BAR_SC &&
>> +                       prev_mb_type == TCG_BAR_LDAQ) {
>> +                /* Remove the previous weaker barrier op. The current
>> +                 * barrier is stronger and sufficient.
>> +                 * ldaq; mb => ld; mb
>> +                 */
>> +                tcg_op_remove(s, prev_op);
>> +            } else if (curr_mb_type == TCG_BAR_STRL &&
>> +                       prev_mb_type == TCG_BAR_LDAQ) {
>> +                /* Consecutive load-acquire and store-release barriers
>> +                 * can be merged into one stronger SC barrier
>> +                 * ldaq; strl => ld; mb; st
>> +                 */
>> +                args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
>> +                tcg_op_remove(s, prev_op);
>
> Don't you need to merge the bottom 4 bits as well?
>

That is an interesting point. Ideally, the TCG_MO_* flags for both the
barriers should be the same. Atleast now, for LDAR and STRL barriers this flag
is TCG_MO_ALL since we do not have any other variant.

>> +        default:
>> +            prev_op_mb = -1;
>> +            prev_op = NULL;
>> +            break;
>
> A reasonable start, I suppose, but there's nothing stopping you from 
> considering barriers that aren't exactly adjacent.
>
> You only need to reset prev_op_mb when you see qemu_ld, qemu_st, call, or 
> tcg_op_defs[opc].flags & TCG_OPF_BB_END.
>

That sounds like a good idea to extend this to. I will work on it.

Thanks!

Pranith Kumar July 19, 2016, 6:29 p.m. UTC | #6

Alex Bennée writes:

> Pranith Kumar <bobby.prani@gmail.com> writes:
>
>> This patch applies on top of the fence generation patch series.
>>
>> This commit optimizes fence instructions. Two optimizations are
>> currently implemented. These are:
>>
>> 1. Unnecessary duplicate fence instructions
>>
>>    If the same fence instruction is detected consecutively, we remove
>>    one instance of it.
>>
>>    ex: mb; mb => mb, strl; strl => strl
>>
>> 2. Merging weaker fence with subsequent/previous stronger fence
>>
>>    load-acquire/store-release fence can be combined with a full fence
>>    without relaxing the ordering constraint.
>>
>>    ex: a) ld; ldaq; mb => ld; mb
>>        b) mb; strl; st => mb; st
>>
>> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
>> ---
>>  tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  tcg/tcg.h      |  1 +
>>  2 files changed, 60 insertions(+)
>>
>> diff --git a/tcg/optimize.c b/tcg/optimize.c
>> index c0d975b..a655829 100644
>> --- a/tcg/optimize.c
>> +++ b/tcg/optimize.c
>> @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2)
>>      return false;
>>  }
>>
>> +/* Eliminate duplicate and unnecessary fence instructions */
>> +void tcg_optimize_mb(TCGContext *s)
>> +{
>> +    int oi, oi_next;
>> +    TCGArg prev_op_mb = -1;
>> +    TCGOp *prev_op;
>
>
> The compiler throws up warnings about prev_op not being set:
>
> /home/alex/lsrc/qemu/qemu.git/tcg/optimize.c: In function ‘tcg_optimize_mb’:
> /home/alex/lsrc/qemu/qemu.git/tcg/optimize.c:611:17: error: ‘prev_op’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
>                  tcg_op_remove(s, prev_op);
>                  ^

OK, I will fix this.

Thanks,

Pranith Kumar July 19, 2016, 6:33 p.m. UTC | #7

Alex Bennée writes:

> Pranith Kumar <bobby.prani@gmail.com> writes:
>
>> This patch applies on top of the fence generation patch series.
>>
>> This commit optimizes fence instructions. Two optimizations are
>> currently implemented. These are:
>>
>> 1. Unnecessary duplicate fence instructions
>>
>>    If the same fence instruction is detected consecutively, we remove
>>    one instance of it.
>>
>>    ex: mb; mb => mb, strl; strl => strl
>>
>> 2. Merging weaker fence with subsequent/previous stronger fence
>>
>>    load-acquire/store-release fence can be combined with a full fence
>>    without relaxing the ordering constraint.
>>
>>    ex: a) ld; ldaq; mb => ld; mb
>>        b) mb; strl; st => mb; st
>
> What test cases do you have for this?
>
> Currently the litmus tests don't fire (as they have exactly what they
> need). Most programs don't seem to trigger multiple barriers.
>

Indeed, these cases are not so commonly seen in the wild. To test it I wrote
small test programs using C11 builtins to generate appropriate
instructions. Then verified it by running 'qemu-aarch64 -d in_asm,out_asm'.

_Atomic int val;

int main()
{
  val = __atomic_load_n(&val, __ATOMIC_ACQUIRE);
  __atomic_store_n(&val, val, __ATOMIC_RELEASE);
  barrier();
  barrier();
}

Pranith Kumar July 19, 2016, 6:55 p.m. UTC | #8

Paolo Bonzini writes:

> On 14/07/2016 22:29, Pranith Kumar wrote:
>> +            } else if (curr_mb_type == TCG_BAR_STRL &&
>> +                       prev_mb_type == TCG_BAR_LDAQ) {
>> +                /* Consecutive load-acquire and store-release barriers
>> +                 * can be merged into one stronger SC barrier
>> +                 * ldaq; strl => ld; mb; st
>> +                 */
>> +                args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
>> +                tcg_op_remove(s, prev_op);
>
> Is this really an optimization?  For example the processor could reorder
> "st1; ldaq1; strl2; ld2" to "ldaq1; ld2; st1; strl2".  It cannot do this
> if you change ldaq1/strl2 to ld1/mb/st2.
>
> On x86 for example a memory fence costs ~50 clock cycles, while normal
> loads and stores are of course faster.
>
> Of course this is useful if your target doesn't have ldaq/strl
> instructions.  In this case, however, you probably want to lower ldaq to
> "ld;mb" and strl to "mb;st"; the other optimizations then will remove
> the unnecessary barrier.
>

I agree that this is a conservative optimization. The problem is that
currently even for architectures which have ldaq/strl instructions, tcg
backend does not generate them. TCG just generates plain loads and stores.I
guess we didn't need to since it was single threaded MTTCG.

I am trying to add support to generate these instructions on AARCH64. Once
this is done we can disable the above optimization.

[RFC] tcg: Optimize fence instructions

Commit Message

Comments

Patch