diff mbox

[v2,0/3] Fix exceptions handling for MIPS and i386

Message ID 20150618081640.GK931@aurel32.net
State New
Headers show

Commit Message

Aurelien Jarno June 18, 2015, 8:16 a.m. UTC
On 2015-06-18 10:12, Pavel Dovgaluk wrote:
> > From: Aurelien Jarno [mailto:aurelien@aurel32.net]
> > On 2015-06-17 15:41, Pavel Dovgalyuk wrote:
> > > In icount mode every translation block looks as follows:
> > >
> > > if icount < n then exit
> > > icount -= n
> > > instr1
> > > instr2
> > > ...
> > > instrn
> > > exit
> > >
> > > When one of these instructions initiates an exception, icount should be
> > > restored and adjusted number of instructions should be subtracted from icount
> > > instead of initial n.
> > >
> > > tlb_fill function passes retaddr to raise_exception, which allows restoring
> > > current instructions in TB and correct icount calculation.
> > >
> > > When exception triggered with other function (e.g. by embedding call to
> > > exception raising helper into TB), then PC is not passed as retaddr and
> > > correct icount is not recovered. In such cases icount will be decreased
> > > by the value equal to the size of TB.
> > 
> > Looking at how icount work, I see it's basically a variable in the CPU
> > state (icount_decr.u16.low), which is already accessed from the TB.
> > Couldn't we adjust it using additional code before generating an
> > exception, when in icount mode.
> > 
> > For example for MIPS, we can add some code before generate_exception
> > which use the value from s->gen_opc_icount[j] to adjust
> > the variable icount_decr.u16.low.
> 
> It is possible, but it will incur additional overhead, because we will 
> have to update icount every time the exception might be generated.
> We'll have to update icount value before and after every helper call, 
> that can cause an exception:
> 
> icount -= n
> ...
> instr_k
> icount += n - k
> helper
> icount -= n - k
> ...
> 
> And this overhead will slowdown the code even if no exception occur.

That's where I might disagree. Retranslation seems a very good idea on
the paper, but in practice it doesn't seems to always bring the
performance improvement it should. In addition it seems to be highly
dependent on the target. Just to give some numbers, on MIPS (as your
patch originally concerns this architecture), 40% of code generation is
actually due to retranslation. The problem is that over the time we have
improved a lot the code generation (liveness analysis, better register
allocation, constant propagation, ...) and thus we have increased the
code generation time. While it clearly has some benefits when this code
is actually executed, it's not the case when the code is simply
retranslated. In short we spend more time to find the CPU state
corresponding to an exception than before.

A simple way to show that is to apply the simple patch below, which
disable retranslation and save the CPU state before each instruction:


On x86, this patch brings a 5% boot time improvement on MIPS. One of the
reason is that the TCG code generator has a good knowledge about which
TCG ops or helpers can trigger an exception, so it can optimize out part
of the instructions saving the CPU state. I guess that the host CPUs have
also evolved over the time, now being superscalar and out-of-order so
that saving the CPU state can be done "in background". Also it's just a
quick and dirty patch, we can probably even do better.

All of that to say that I am worried for the performances to see more
paths through the retranslation code, especially on MIPS as it seems to
be costly. That said I haven't really look in details at other targets,
nor hosts.

Now to come back about your patches, we might want to simply fix icount
first, even if it has some performance impact, and deal with the
retranslation issue separately, as it concerns more than just icount.

Comments

Pavel Dovgalyuk June 18, 2015, 8:58 a.m. UTC | #1
> From: Aurelien Jarno [mailto:aurelien@aurel32.net]
> On 2015-06-18 10:12, Pavel Dovgaluk wrote:
> > > From: Aurelien Jarno [mailto:aurelien@aurel32.net]
> > > On 2015-06-17 15:41, Pavel Dovgalyuk wrote:
> > > > In icount mode every translation block looks as follows:
> > > >
> > > > if icount < n then exit
> > > > icount -= n
> > > > instr1
> > > > instr2
> > > > ...
> > > > instrn
> > > > exit
> > > >
> > > > When one of these instructions initiates an exception, icount should be
> > > > restored and adjusted number of instructions should be subtracted from icount
> > > > instead of initial n.
> > > >
> > > > tlb_fill function passes retaddr to raise_exception, which allows restoring
> > > > current instructions in TB and correct icount calculation.
> > > >
> > > > When exception triggered with other function (e.g. by embedding call to
> > > > exception raising helper into TB), then PC is not passed as retaddr and
> > > > correct icount is not recovered. In such cases icount will be decreased
> > > > by the value equal to the size of TB.
> > >
> > > Looking at how icount work, I see it's basically a variable in the CPU
> > > state (icount_decr.u16.low), which is already accessed from the TB.
> > > Couldn't we adjust it using additional code before generating an
> > > exception, when in icount mode.
> > >
> > > For example for MIPS, we can add some code before generate_exception
> > > which use the value from s->gen_opc_icount[j] to adjust
> > > the variable icount_decr.u16.low.
> >
> > It is possible, but it will incur additional overhead, because we will
> > have to update icount every time the exception might be generated.
> > We'll have to update icount value before and after every helper call,
> > that can cause an exception:
> >
> > icount -= n
> > ...
> > instr_k
> > icount += n - k
> > helper
> > icount -= n - k
> > ...
> >
> > And this overhead will slowdown the code even if no exception occur.
> 
> That's where I might disagree. Retranslation seems a very good idea on
> the paper, but in practice it doesn't seems to always bring the
> performance improvement it should. In addition it seems to be highly
> dependent on the target. Just to give some numbers, on MIPS (as your
> patch originally concerns this architecture), 40% of code generation is
> actually due to retranslation. The problem is that over the time we have
> improved a lot the code generation (liveness analysis, better register
> allocation, constant propagation, ...) and thus we have increased the
> code generation time. While it clearly has some benefits when this code
> is actually executed, it's not the case when the code is simply
> retranslated. In short we spend more time to find the CPU state
> corresponding to an exception than before.
> 
...
> 
> All of that to say that I am worried for the performances to see more
> paths through the retranslation code, especially on MIPS as it seems to
> be costly. That said I haven't really look in details at other targets,
> nor hosts.

I fixed syscalls, exceptions that occur without any conditions,
and removed redundant calls to save_cpu_state. Then I measured the performance
without enabling icount. And Linux boots even faster than with original version.
I'll submit this version for review soon.

> Now to come back about your patches, we might want to simply fix icount
> first, even if it has some performance impact, and deal with the
> retranslation issue separately, as it concerns more than just icount.

Pavel Dovgalyuk
diff mbox

Patch

diff --git a/target-mips/translate.c b/target-mips/translate.c
index 1d128ee..5238d71 100644
--- a/target-mips/translate.c
+++ b/target-mips/translate.c
@@ -19435,6 +19435,7 @@  gen_intermediate_code_internal(MIPSCPU *cpu, TranslationBlock *tb,
     LOG_DISAS("\ntb %p idx %d hflags %04x\n", tb, ctx.mem_idx, ctx.hflags);
     gen_tb_start(tb);
     while (ctx.bstate == BS_NONE) {
+        save_cpu_state(&ctx, 1);
         if (unlikely(!QTAILQ_EMPTY(&cs->breakpoints))) {
             QTAILQ_FOREACH(bp, &cs->breakpoints, entry) {
                 if (bp->pc == ctx.pc) {
diff --git a/translate-all.c b/translate-all.c
index b6b0e1c..3d4c017 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -212,6 +212,8 @@  static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
     int64_t ti;
 #endif
 
+    return -1;
+
 #ifdef CONFIG_PROFILER
     ti = profile_getclock();
 #endif