Message ID | 20160714202940.18399-1-bobby.prani@gmail.com |
---|---|
State | New |
Headers | show |
On 07/15/2016 01:59 AM, Pranith Kumar wrote: > This patch applies on top of the fence generation patch series. > > This commit optimizes fence instructions. Two optimizations are > currently implemented. These are: > > 1. Unnecessary duplicate fence instructions > > If the same fence instruction is detected consecutively, we remove > one instance of it. > > ex: mb; mb => mb, strl; strl => strl > > 2. Merging weaker fence with subsequent/previous stronger fence > > load-acquire/store-release fence can be combined with a full fence > without relaxing the ordering constraint. > > ex: a) ld; ldaq; mb => ld; mb > b) mb; strl; st => mb; st > > Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> > --- > tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > tcg/tcg.h | 1 + > 2 files changed, 60 insertions(+) > > diff --git a/tcg/optimize.c b/tcg/optimize.c > index c0d975b..a655829 100644 > --- a/tcg/optimize.c > +++ b/tcg/optimize.c > @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2) > return false; > } > > +/* Eliminate duplicate and unnecessary fence instructions */ > +void tcg_optimize_mb(TCGContext *s) > +{ > + int oi, oi_next; > + TCGArg prev_op_mb = -1; > + TCGOp *prev_op; > + > + for (oi = s->gen_first_op_idx; oi >= 0; oi = oi_next) { > + TCGOp *op = &s->gen_op_buf[oi]; > + TCGArg *args = &s->gen_opparam_buf[op->args]; > + TCGOpcode opc = op->opc; > + > + switch (opc) { > + case INDEX_op_mb: > + { > + TCGBar curr_mb_type = args[0] & 0xF0; > + TCGBar prev_mb_type = prev_op_mb & 0xF0; No check to make sure that prev_op_mb != -1? > + > + if (curr_mb_type == prev_mb_type || > + (curr_mb_type == TCG_BAR_STRL && prev_mb_type == TCG_BAR_SC)) { > + /* Remove the current weaker barrier op. The previous > + * barrier is stronger and sufficient. > + * mb; strl => mb; st > + */ > + tcg_op_remove(s, op); > + } else if (curr_mb_type == TCG_BAR_SC && > + prev_mb_type == TCG_BAR_LDAQ) { > + /* Remove the previous weaker barrier op. The current > + * barrier is stronger and sufficient. > + * ldaq; mb => ld; mb > + */ > + tcg_op_remove(s, prev_op); > + } else if (curr_mb_type == TCG_BAR_STRL && > + prev_mb_type == TCG_BAR_LDAQ) { > + /* Consecutive load-acquire and store-release barriers > + * can be merged into one stronger SC barrier > + * ldaq; strl => ld; mb; st > + */ > + args[0] = (args[0] & 0x0F) | TCG_BAR_SC; > + tcg_op_remove(s, prev_op); Don't you need to merge the bottom 4 bits as well? > + default: > + prev_op_mb = -1; > + prev_op = NULL; > + break; A reasonable start, I suppose, but there's nothing stopping you from considering barriers that aren't exactly adjacent. You only need to reset prev_op_mb when you see qemu_ld, qemu_st, call, or tcg_op_defs[opc].flags & TCG_OPF_BB_END. r~
Pranith Kumar <bobby.prani@gmail.com> writes: > This patch applies on top of the fence generation patch series. > > This commit optimizes fence instructions. Two optimizations are > currently implemented. These are: > > 1. Unnecessary duplicate fence instructions > > If the same fence instruction is detected consecutively, we remove > one instance of it. > > ex: mb; mb => mb, strl; strl => strl > > 2. Merging weaker fence with subsequent/previous stronger fence > > load-acquire/store-release fence can be combined with a full fence > without relaxing the ordering constraint. > > ex: a) ld; ldaq; mb => ld; mb > b) mb; strl; st => mb; st > > Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> > --- > tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > tcg/tcg.h | 1 + > 2 files changed, 60 insertions(+) > > diff --git a/tcg/optimize.c b/tcg/optimize.c > index c0d975b..a655829 100644 > --- a/tcg/optimize.c > +++ b/tcg/optimize.c > @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2) > return false; > } > > +/* Eliminate duplicate and unnecessary fence instructions */ > +void tcg_optimize_mb(TCGContext *s) > +{ > + int oi, oi_next; > + TCGArg prev_op_mb = -1; > + TCGOp *prev_op; The compiler throws up warnings about prev_op not being set: /home/alex/lsrc/qemu/qemu.git/tcg/optimize.c: In function ‘tcg_optimize_mb’: /home/alex/lsrc/qemu/qemu.git/tcg/optimize.c:611:17: error: ‘prev_op’ may be used uninitialized in this function [-Werror=maybe-uninitialized] tcg_op_remove(s, prev_op); ^ > + > + for (oi = s->gen_first_op_idx; oi >= 0; oi = oi_next) { > + TCGOp *op = &s->gen_op_buf[oi]; > + TCGArg *args = &s->gen_opparam_buf[op->args]; > + TCGOpcode opc = op->opc; > + > + switch (opc) { > + case INDEX_op_mb: > + { > + TCGBar curr_mb_type = args[0] & 0xF0; > + TCGBar prev_mb_type = prev_op_mb & 0xF0; > + > + if (curr_mb_type == prev_mb_type || > + (curr_mb_type == TCG_BAR_STRL && prev_mb_type == TCG_BAR_SC)) { > + /* Remove the current weaker barrier op. The previous > + * barrier is stronger and sufficient. > + * mb; strl => mb; st > + */ > + tcg_op_remove(s, op); > + } else if (curr_mb_type == TCG_BAR_SC && > + prev_mb_type == TCG_BAR_LDAQ) { > + /* Remove the previous weaker barrier op. The current > + * barrier is stronger and sufficient. > + * ldaq; mb => ld; mb > + */ > + tcg_op_remove(s, prev_op); > + } else if (curr_mb_type == TCG_BAR_STRL && > + prev_mb_type == TCG_BAR_LDAQ) { > + /* Consecutive load-acquire and store-release barriers > + * can be merged into one stronger SC barrier > + * ldaq; strl => ld; mb; st > + */ > + args[0] = (args[0] & 0x0F) | TCG_BAR_SC; > + tcg_op_remove(s, prev_op); > + } > + prev_op_mb = args[0]; > + prev_op = op; > + break; > + } > + case INDEX_op_insn_start: > + break; > + default: > + prev_op_mb = -1; > + prev_op = NULL; > + break; > + } > + > + oi_next = op->next; > + } > +} > + > /* Propagate constants and copies, fold constant expressions. */ > void tcg_optimize(TCGContext *s) > { > @@ -1327,4 +1384,6 @@ void tcg_optimize(TCGContext *s) > break; > } > } > + > + tcg_optimize_mb(s); > } > diff --git a/tcg/tcg.h b/tcg/tcg.h > index 60ebc2b..2d7a46d 100644 > --- a/tcg/tcg.h > +++ b/tcg/tcg.h > @@ -904,6 +904,7 @@ void tcg_gen_callN(TCGContext *s, void *func, > TCGArg ret, int nargs, TCGArg *args); > > void tcg_op_remove(TCGContext *s, TCGOp *op); > +void tcg_optimize_mb(TCGContext *s); > void tcg_optimize(TCGContext *s); > > /* only used for debugging purposes */ -- Alex Bennée
Pranith Kumar <bobby.prani@gmail.com> writes: > This patch applies on top of the fence generation patch series. > > This commit optimizes fence instructions. Two optimizations are > currently implemented. These are: > > 1. Unnecessary duplicate fence instructions > > If the same fence instruction is detected consecutively, we remove > one instance of it. > > ex: mb; mb => mb, strl; strl => strl > > 2. Merging weaker fence with subsequent/previous stronger fence > > load-acquire/store-release fence can be combined with a full fence > without relaxing the ordering constraint. > > ex: a) ld; ldaq; mb => ld; mb > b) mb; strl; st => mb; st What test cases do you have for this? Currently the litmus tests don't fire (as they have exactly what they need). Most programs don't seem to trigger multiple barriers. > > Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> > --- > tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > tcg/tcg.h | 1 + > 2 files changed, 60 insertions(+) > > diff --git a/tcg/optimize.c b/tcg/optimize.c > index c0d975b..a655829 100644 > --- a/tcg/optimize.c > +++ b/tcg/optimize.c > @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2) > return false; > } > > +/* Eliminate duplicate and unnecessary fence instructions */ > +void tcg_optimize_mb(TCGContext *s) > +{ > + int oi, oi_next; > + TCGArg prev_op_mb = -1; > + TCGOp *prev_op; > + > + for (oi = s->gen_first_op_idx; oi >= 0; oi = oi_next) { > + TCGOp *op = &s->gen_op_buf[oi]; > + TCGArg *args = &s->gen_opparam_buf[op->args]; > + TCGOpcode opc = op->opc; > + > + switch (opc) { > + case INDEX_op_mb: > + { > + TCGBar curr_mb_type = args[0] & 0xF0; > + TCGBar prev_mb_type = prev_op_mb & 0xF0; > + > + if (curr_mb_type == prev_mb_type || > + (curr_mb_type == TCG_BAR_STRL && prev_mb_type == TCG_BAR_SC)) { > + /* Remove the current weaker barrier op. The previous > + * barrier is stronger and sufficient. > + * mb; strl => mb; st > + */ > + tcg_op_remove(s, op); > + } else if (curr_mb_type == TCG_BAR_SC && > + prev_mb_type == TCG_BAR_LDAQ) { > + /* Remove the previous weaker barrier op. The current > + * barrier is stronger and sufficient. > + * ldaq; mb => ld; mb > + */ > + tcg_op_remove(s, prev_op); > + } else if (curr_mb_type == TCG_BAR_STRL && > + prev_mb_type == TCG_BAR_LDAQ) { > + /* Consecutive load-acquire and store-release barriers > + * can be merged into one stronger SC barrier > + * ldaq; strl => ld; mb; st > + */ > + args[0] = (args[0] & 0x0F) | TCG_BAR_SC; > + tcg_op_remove(s, prev_op); > + } > + prev_op_mb = args[0]; > + prev_op = op; > + break; > + } > + case INDEX_op_insn_start: > + break; > + default: > + prev_op_mb = -1; > + prev_op = NULL; > + break; > + } > + > + oi_next = op->next; > + } > +} > + > /* Propagate constants and copies, fold constant expressions. */ > void tcg_optimize(TCGContext *s) > { > @@ -1327,4 +1384,6 @@ void tcg_optimize(TCGContext *s) > break; > } > } > + > + tcg_optimize_mb(s); > } > diff --git a/tcg/tcg.h b/tcg/tcg.h > index 60ebc2b..2d7a46d 100644 > --- a/tcg/tcg.h > +++ b/tcg/tcg.h > @@ -904,6 +904,7 @@ void tcg_gen_callN(TCGContext *s, void *func, > TCGArg ret, int nargs, TCGArg *args); > > void tcg_op_remove(TCGContext *s, TCGOp *op); > +void tcg_optimize_mb(TCGContext *s); > void tcg_optimize(TCGContext *s); > > /* only used for debugging purposes */ -- Alex Bennée
On 14/07/2016 22:29, Pranith Kumar wrote: > + } else if (curr_mb_type == TCG_BAR_STRL && > + prev_mb_type == TCG_BAR_LDAQ) { > + /* Consecutive load-acquire and store-release barriers > + * can be merged into one stronger SC barrier > + * ldaq; strl => ld; mb; st > + */ > + args[0] = (args[0] & 0x0F) | TCG_BAR_SC; > + tcg_op_remove(s, prev_op); Is this really an optimization? For example the processor could reorder "st1; ldaq1; strl2; ld2" to "ldaq1; ld2; st1; strl2". It cannot do this if you change ldaq1/strl2 to ld1/mb/st2. On x86 for example a memory fence costs ~50 clock cycles, while normal loads and stores are of course faster. Of course this is useful if your target doesn't have ldaq/strl instructions. In this case, however, you probably want to lower ldaq to "ld;mb" and strl to "mb;st"; the other optimizations then will remove the unnecessary barrier. Paolo
Richard Henderson writes: > On 07/15/2016 01:59 AM, Pranith Kumar wrote: >> This patch applies on top of the fence generation patch series. >> >> This commit optimizes fence instructions. Two optimizations are >> currently implemented. These are: >> >> 1. Unnecessary duplicate fence instructions >> >> If the same fence instruction is detected consecutively, we remove >> one instance of it. >> >> ex: mb; mb => mb, strl; strl => strl >> >> 2. Merging weaker fence with subsequent/previous stronger fence >> >> load-acquire/store-release fence can be combined with a full fence >> without relaxing the ordering constraint. >> >> ex: a) ld; ldaq; mb => ld; mb >> b) mb; strl; st => mb; st >> >> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> >> --- >> tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> tcg/tcg.h | 1 + >> 2 files changed, 60 insertions(+) >> >> diff --git a/tcg/optimize.c b/tcg/optimize.c >> index c0d975b..a655829 100644 >> --- a/tcg/optimize.c >> +++ b/tcg/optimize.c >> @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2) >> return false; >> } >> >> +/* Eliminate duplicate and unnecessary fence instructions */ >> +void tcg_optimize_mb(TCGContext *s) >> +{ >> + int oi, oi_next; >> + TCGArg prev_op_mb = -1; >> + TCGOp *prev_op; >> + >> + for (oi = s->gen_first_op_idx; oi >= 0; oi = oi_next) { >> + TCGOp *op = &s->gen_op_buf[oi]; >> + TCGArg *args = &s->gen_opparam_buf[op->args]; >> + TCGOpcode opc = op->opc; >> + >> + switch (opc) { >> + case INDEX_op_mb: >> + { >> + TCGBar curr_mb_type = args[0] & 0xF0; >> + TCGBar prev_mb_type = prev_op_mb & 0xF0; > > No check to make sure that prev_op_mb != -1? > I don't think that is necessary since if prev_op_mb is -1 then prev_mb_type will be 0x70 which will not match any of the cases we compare against below. >> + >> + if (curr_mb_type == prev_mb_type || >> + (curr_mb_type == TCG_BAR_STRL && prev_mb_type == TCG_BAR_SC)) { >> + /* Remove the current weaker barrier op. The previous >> + * barrier is stronger and sufficient. >> + * mb; strl => mb; st >> + */ >> + tcg_op_remove(s, op); >> + } else if (curr_mb_type == TCG_BAR_SC && >> + prev_mb_type == TCG_BAR_LDAQ) { >> + /* Remove the previous weaker barrier op. The current >> + * barrier is stronger and sufficient. >> + * ldaq; mb => ld; mb >> + */ >> + tcg_op_remove(s, prev_op); >> + } else if (curr_mb_type == TCG_BAR_STRL && >> + prev_mb_type == TCG_BAR_LDAQ) { >> + /* Consecutive load-acquire and store-release barriers >> + * can be merged into one stronger SC barrier >> + * ldaq; strl => ld; mb; st >> + */ >> + args[0] = (args[0] & 0x0F) | TCG_BAR_SC; >> + tcg_op_remove(s, prev_op); > > Don't you need to merge the bottom 4 bits as well? > That is an interesting point. Ideally, the TCG_MO_* flags for both the barriers should be the same. Atleast now, for LDAR and STRL barriers this flag is TCG_MO_ALL since we do not have any other variant. >> + default: >> + prev_op_mb = -1; >> + prev_op = NULL; >> + break; > > A reasonable start, I suppose, but there's nothing stopping you from > considering barriers that aren't exactly adjacent. > > You only need to reset prev_op_mb when you see qemu_ld, qemu_st, call, or > tcg_op_defs[opc].flags & TCG_OPF_BB_END. > That sounds like a good idea to extend this to. I will work on it. Thanks!
Alex Bennée writes: > Pranith Kumar <bobby.prani@gmail.com> writes: > >> This patch applies on top of the fence generation patch series. >> >> This commit optimizes fence instructions. Two optimizations are >> currently implemented. These are: >> >> 1. Unnecessary duplicate fence instructions >> >> If the same fence instruction is detected consecutively, we remove >> one instance of it. >> >> ex: mb; mb => mb, strl; strl => strl >> >> 2. Merging weaker fence with subsequent/previous stronger fence >> >> load-acquire/store-release fence can be combined with a full fence >> without relaxing the ordering constraint. >> >> ex: a) ld; ldaq; mb => ld; mb >> b) mb; strl; st => mb; st >> >> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> >> --- >> tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> tcg/tcg.h | 1 + >> 2 files changed, 60 insertions(+) >> >> diff --git a/tcg/optimize.c b/tcg/optimize.c >> index c0d975b..a655829 100644 >> --- a/tcg/optimize.c >> +++ b/tcg/optimize.c >> @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2) >> return false; >> } >> >> +/* Eliminate duplicate and unnecessary fence instructions */ >> +void tcg_optimize_mb(TCGContext *s) >> +{ >> + int oi, oi_next; >> + TCGArg prev_op_mb = -1; >> + TCGOp *prev_op; > > > The compiler throws up warnings about prev_op not being set: > > /home/alex/lsrc/qemu/qemu.git/tcg/optimize.c: In function ‘tcg_optimize_mb’: > /home/alex/lsrc/qemu/qemu.git/tcg/optimize.c:611:17: error: ‘prev_op’ may be used uninitialized in this function [-Werror=maybe-uninitialized] > tcg_op_remove(s, prev_op); > ^ OK, I will fix this. Thanks,
Alex Bennée writes: > Pranith Kumar <bobby.prani@gmail.com> writes: > >> This patch applies on top of the fence generation patch series. >> >> This commit optimizes fence instructions. Two optimizations are >> currently implemented. These are: >> >> 1. Unnecessary duplicate fence instructions >> >> If the same fence instruction is detected consecutively, we remove >> one instance of it. >> >> ex: mb; mb => mb, strl; strl => strl >> >> 2. Merging weaker fence with subsequent/previous stronger fence >> >> load-acquire/store-release fence can be combined with a full fence >> without relaxing the ordering constraint. >> >> ex: a) ld; ldaq; mb => ld; mb >> b) mb; strl; st => mb; st > > What test cases do you have for this? > > Currently the litmus tests don't fire (as they have exactly what they > need). Most programs don't seem to trigger multiple barriers. > Indeed, these cases are not so commonly seen in the wild. To test it I wrote small test programs using C11 builtins to generate appropriate instructions. Then verified it by running 'qemu-aarch64 -d in_asm,out_asm'. _Atomic int val; int main() { val = __atomic_load_n(&val, __ATOMIC_ACQUIRE); __atomic_store_n(&val, val, __ATOMIC_RELEASE); barrier(); barrier(); }
Paolo Bonzini writes: > On 14/07/2016 22:29, Pranith Kumar wrote: >> + } else if (curr_mb_type == TCG_BAR_STRL && >> + prev_mb_type == TCG_BAR_LDAQ) { >> + /* Consecutive load-acquire and store-release barriers >> + * can be merged into one stronger SC barrier >> + * ldaq; strl => ld; mb; st >> + */ >> + args[0] = (args[0] & 0x0F) | TCG_BAR_SC; >> + tcg_op_remove(s, prev_op); > > Is this really an optimization? For example the processor could reorder > "st1; ldaq1; strl2; ld2" to "ldaq1; ld2; st1; strl2". It cannot do this > if you change ldaq1/strl2 to ld1/mb/st2. > > On x86 for example a memory fence costs ~50 clock cycles, while normal > loads and stores are of course faster. > > Of course this is useful if your target doesn't have ldaq/strl > instructions. In this case, however, you probably want to lower ldaq to > "ld;mb" and strl to "mb;st"; the other optimizations then will remove > the unnecessary barrier. > I agree that this is a conservative optimization. The problem is that currently even for architectures which have ldaq/strl instructions, tcg backend does not generate them. TCG just generates plain loads and stores.I guess we didn't need to since it was single threaded MTTCG. I am trying to add support to generate these instructions on AARCH64. Once this is done we can disable the above optimization.
diff --git a/tcg/optimize.c b/tcg/optimize.c index c0d975b..a655829 100644 --- a/tcg/optimize.c +++ b/tcg/optimize.c @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2) return false; } +/* Eliminate duplicate and unnecessary fence instructions */ +void tcg_optimize_mb(TCGContext *s) +{ + int oi, oi_next; + TCGArg prev_op_mb = -1; + TCGOp *prev_op; + + for (oi = s->gen_first_op_idx; oi >= 0; oi = oi_next) { + TCGOp *op = &s->gen_op_buf[oi]; + TCGArg *args = &s->gen_opparam_buf[op->args]; + TCGOpcode opc = op->opc; + + switch (opc) { + case INDEX_op_mb: + { + TCGBar curr_mb_type = args[0] & 0xF0; + TCGBar prev_mb_type = prev_op_mb & 0xF0; + + if (curr_mb_type == prev_mb_type || + (curr_mb_type == TCG_BAR_STRL && prev_mb_type == TCG_BAR_SC)) { + /* Remove the current weaker barrier op. The previous + * barrier is stronger and sufficient. + * mb; strl => mb; st + */ + tcg_op_remove(s, op); + } else if (curr_mb_type == TCG_BAR_SC && + prev_mb_type == TCG_BAR_LDAQ) { + /* Remove the previous weaker barrier op. The current + * barrier is stronger and sufficient. + * ldaq; mb => ld; mb + */ + tcg_op_remove(s, prev_op); + } else if (curr_mb_type == TCG_BAR_STRL && + prev_mb_type == TCG_BAR_LDAQ) { + /* Consecutive load-acquire and store-release barriers + * can be merged into one stronger SC barrier + * ldaq; strl => ld; mb; st + */ + args[0] = (args[0] & 0x0F) | TCG_BAR_SC; + tcg_op_remove(s, prev_op); + } + prev_op_mb = args[0]; + prev_op = op; + break; + } + case INDEX_op_insn_start: + break; + default: + prev_op_mb = -1; + prev_op = NULL; + break; + } + + oi_next = op->next; + } +} + /* Propagate constants and copies, fold constant expressions. */ void tcg_optimize(TCGContext *s) { @@ -1327,4 +1384,6 @@ void tcg_optimize(TCGContext *s) break; } } + + tcg_optimize_mb(s); } diff --git a/tcg/tcg.h b/tcg/tcg.h index 60ebc2b..2d7a46d 100644 --- a/tcg/tcg.h +++ b/tcg/tcg.h @@ -904,6 +904,7 @@ void tcg_gen_callN(TCGContext *s, void *func, TCGArg ret, int nargs, TCGArg *args); void tcg_op_remove(TCGContext *s, TCGOp *op); +void tcg_optimize_mb(TCGContext *s); void tcg_optimize(TCGContext *s); /* only used for debugging purposes */
This patch applies on top of the fence generation patch series. This commit optimizes fence instructions. Two optimizations are currently implemented. These are: 1. Unnecessary duplicate fence instructions If the same fence instruction is detected consecutively, we remove one instance of it. ex: mb; mb => mb, strl; strl => strl 2. Merging weaker fence with subsequent/previous stronger fence load-acquire/store-release fence can be combined with a full fence without relaxing the ordering constraint. ex: a) ld; ldaq; mb => ld; mb b) mb; strl; st => mb; st Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> --- tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ tcg/tcg.h | 1 + 2 files changed, 60 insertions(+)