Message ID | 20110125122749.GA19736@edde.se.axis.com |
---|---|
State | New |
Headers | show |
On 01/25/2011 04:27 AM, Edgar E. Iglesias wrote: > I've tested this patch a bit and got mixed results. I tested with patched > CRIS and MicroBlaze translators. The patch works OK (it doesn't break > anything) for the usecases I had but I saw a bit of a slowdown with > MicroBlaze (compare to not using deposit at all). > > I suspect that the fast 8 and 16 bit x86 deposits are giving me a slight > speedup with CRIS. But MicroBlaze uses one bit fields into bit 2 and > 31. Those seem to be slower with deposit than with other tcg sequences. > > I would have guessed that at worst, this patch would be equally fast > as any TCG sequence. Am I missing something? With or without the i386 tcg-target.c changes? If without, then I'm stumped, since it looks like identical tcg ops being emitted. It with, then perhaps SHLD is slower than I thought. I see that GCC lists this insn as "vector decoded" for AMD cores, as opposed to "direct decoded". If this insn is indeed microcoded on some hosts then maybe the i386 tcg-target patch isn't such a great idea. That said, there are still other tcg targets which support this operation directly. I would be shocked if you measured a slowdown with these changes on a ppc host, for instance. r~
On Tue, Jan 25, 2011 at 08:13:53AM -0800, Richard Henderson wrote: > On 01/25/2011 04:27 AM, Edgar E. Iglesias wrote: > > I've tested this patch a bit and got mixed results. I tested with patched > > CRIS and MicroBlaze translators. The patch works OK (it doesn't break > > anything) for the usecases I had but I saw a bit of a slowdown with > > MicroBlaze (compare to not using deposit at all). > > > > I suspect that the fast 8 and 16 bit x86 deposits are giving me a slight > > speedup with CRIS. But MicroBlaze uses one bit fields into bit 2 and > > 31. Those seem to be slower with deposit than with other tcg sequences. > > > > I would have guessed that at worst, this patch would be equally fast > > as any TCG sequence. Am I missing something? > > With or without the i386 tcg-target.c changes? > > If without, then I'm stumped, since it looks like identical tcg ops > being emitted. It's with the tcg-target patch. > It with, then perhaps SHLD is slower than I thought. I see that GCC > lists this insn as "vector decoded" for AMD cores, as opposed to > "direct decoded". If this insn is indeed microcoded on some hosts > then maybe the i386 tcg-target patch isn't such a great idea. OK, I see. Maybe we should try to emit an insn sequence more similar to what tcg was emitting (for the non 8 & 16-bit deposits)? That ought too at least give similar results as before for those and give us a speedup for the byte and word moves. > That said, there are still other tcg targets which support this > operation directly. I would be shocked if you measured a slowdown > with these changes on a ppc host, for instance. Yep, agreed. Cheers
Hi, On Tue, Jan 25, 2011 at 01:27:49PM +0100, Edgar E. Iglesias wrote: > On Mon, Jan 10, 2011 at 07:23:46PM -0800, Richard Henderson wrote: > > Special case deposits that are implementable with byte and word stores. > > Otherwise implement with double-word shift plus rotates. > > > > Expose tcg_scratch_alloc to the backend for allocation of scratch registers. > > > > Signed-off-by: Richard Henderson <rth@twiddle.net> > > Hi, > > I've tested this patch a bit and got mixed results. I tested with patched > CRIS and MicroBlaze translators. The patch works OK (it doesn't break > anything) for the usecases I had but I saw a bit of a slowdown with > MicroBlaze (compare to not using deposit at all). > This week-end I have tested it emulating an x86-64 machine on x86-64, with all the patch series applied. I have measured the boot time from the bootloader up to the graphical environment of a Debian installation I used -snapshot to make sure the host hard-drive is not introducing any noise in the measurement (so that the whole image is in the host cache), and did the measurement 10 times. The machine is a Core 2 Q9650, nothing else was running on the machine except the few standard daemons. I have found that the boot time is roughly 1.8% faster with the patch series applied. It's undoubtedly an improvement, but still close to the measurement noise. This is a bit disappointing...
On 01/31/2011 12:33 AM, Aurelien Jarno wrote: > This week-end I have tested it emulating an x86-64 machine on x86-64, > with all the patch series applied. I have measured the boot time from > the bootloader up to the graphical environment of a Debian installation > I used -snapshot to make sure the host hard-drive is not introducing any > noise in the measurement (so that the whole image is in the host cache), > and did the measurement 10 times. The machine is a Core 2 Q9650, nothing > else was running on the machine except the few standard daemons. > > I have found that the boot time is roughly 1.8% faster with the patch > series applied. It's undoubtedly an improvement, but still close to the > measurement noise. This is a bit disappointing... It's also not terribly surprising, with that test scenario. GCC tries not to generate partial register stores, except when (as here) it's really a bitfield insert. A test that might show off the deposit code more would be booting a 16-bit OS. Either FreeDOS, or Windows 3.1 (if anyone still has a copy). In that case, the translator will be emitting a deposit op for almost every guest instruction. (Which is probably a mistake from a translator point of view -- there's no reason we can't emulate 16-bit operations with 32-bit operations given that the high bits are ignorable.) r~
On 02/08/2011 07:05 PM, Richard Henderson wrote: > (Which is probably a mistake from a translator point of view -- there's > no reason we can't emulate 16-bit operations with 32-bit operations given > that the high bits are ignorable.) Not really, you never know if the guest is going to use a 66 prefix on the next instruction. Paolo
On Wed, Feb 9, 2011 at 9:41 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > On 02/08/2011 07:05 PM, Richard Henderson wrote: >> >> (Which is probably a mistake from a translator point of view -- there's >> no reason we can't emulate 16-bit operations with 32-bit operations given >> that the high bits are ignorable.) > > Not really, you never know if the guest is going to use a 66 prefix on the > next instruction. Perhaps similar system to current lazy condition code evaluation could be used. The translator would keep record of high bits use status, and if they are getting used, emit extra ops to clear (or recalculate?) the high bits.
diff --git a/target-microblaze/translate.c b/target-microblaze/translate.c index 2207431..39ab3a5 100644 --- a/target-microblaze/translate.c +++ b/target-microblaze/translate.c @@ -160,6 +160,7 @@ static void read_carry(DisasContext *dc, TCGv d) static void write_carry(DisasContext *dc, TCGv v) { +#if 0 TCGv t0 = tcg_temp_new(); tcg_gen_shli_tl(t0, v, 31); tcg_gen_sari_tl(t0, t0, 31); @@ -168,6 +169,10 @@ static void write_carry(DisasContext *dc, TCGv v) ~(MSR_C | MSR_CC)); tcg_gen_or_tl(cpu_SR[SR_MSR], cpu_SR[SR_MSR], t0); tcg_temp_free(t0); +#else + tcg_gen_deposit_tl(cpu_SR[SR_MSR], cpu_SR[SR_MSR], v, 2, 1); + tcg_gen_deposit_tl(cpu_SR[SR_MSR], cpu_SR[SR_MSR], v, 31, 1); +#endif } CRIS translator: commit 9f427e14b2535a067bf046fea093f28cfaa92f7f Author: Edgar E. Iglesias <edgar.iglesias@gmail.com> Date: Fri Jan 21 22:09:44 2011 +0100 cris: Use deposit for ALU writeback Most ALU insns on CRIS have deposit semantics in the writeback stage. Use the new deposit tcg operation to perform the write back to registers. Move the extract of the result into cc_result to the slow path in evaluate_flags. Signed-off-by: Edgar E. Iglesias <edgar.iglesias@gmail.com> diff --git a/target-cris/translate.c b/target-cris/translate.c index f4cc125..018ce68 100644 --- a/target-cris/translate.c +++ b/target-cris/translate.c @@ -861,11 +861,6 @@ static void cris_alu_op_exec(DisasContext *dc, int op, BUG(); break; } - - if (size == 1) - tcg_gen_andi_tl(dst, dst, 0xff); - else if (size == 2) - tcg_gen_andi_tl(dst, dst, 0xffff); } static void cris_alu(DisasContext *dc, int op, @@ -880,6 +875,7 @@ static void cris_alu(DisasContext *dc, int op, tmp = tcg_temp_new(); writeback = 0; } else if (size == 4) { + /* We write directly into the dest. */ tmp = d; writeback = 0; } else @@ -892,11 +888,7 @@ static void cris_alu(DisasContext *dc, int op, /* Writeback. */ if (writeback) { - if (size == 1) - tcg_gen_andi_tl(d, d, ~0xff); - else - tcg_gen_andi_tl(d, d, ~0xffff); - tcg_gen_or_tl(d, d, tmp); + tcg_gen_deposit_tl(d, d, tmp, 0, size * 8); } if (!TCGV_EQUAL(tmp, d)) tcg_temp_free(tmp); @@ -941,6 +933,10 @@ static void gen_tst_cc (DisasContext *dc, TCGv cc, int cond) * When this function is done, T0 should be non-zero if the condition * code is true. */ + if (dc->cc_size != 4) { + tcg_gen_andi_tl(cc_result, cc_result, + (1 << (dc->cc_size * 8)) - 1); + } arith_opt = arith_cc(dc) && !dc->flags_uptodate; move_opt = (dc->cc_op == CC_OP_MOVE); switch (cond) {