Patchwork [5/7] tcg-i386: Implement deposit operation.

login
register
mail settings
Submitter Edgar Iglesias
Date Jan. 25, 2011, 12:27 p.m.
Message ID <20110125122749.GA19736@edde.se.axis.com>
Download mbox | patch
Permalink /patch/80354/
State New
Headers show

Comments

Edgar Iglesias - Jan. 25, 2011, 12:27 p.m.
On Mon, Jan 10, 2011 at 07:23:46PM -0800, Richard Henderson wrote:
> Special case deposits that are implementable with byte and word stores.
> Otherwise implement with double-word shift plus rotates.
> 
> Expose tcg_scratch_alloc to the backend for allocation of scratch registers.
> 
> Signed-off-by: Richard Henderson <rth@twiddle.net>

Hi,

I've tested this patch a bit and got mixed results. I tested with patched
CRIS and MicroBlaze translators. The patch works OK (it doesn't break
anything) for the usecases I had but I saw a bit of a slowdown with
MicroBlaze (compare to not using deposit at all).

I suspect that the fast 8 and 16 bit x86 deposits are giving me a slight
speedup with CRIS. But MicroBlaze uses one bit fields into bit 2 and
31. Those seem to be slower with deposit than with other tcg sequences.

I would have guessed that at worst, this patch would be equally fast
as any TCG sequence. Am I missing something?

These are the patches I've applied:

Microblaze translator:
Richard Henderson - Jan. 25, 2011, 4:13 p.m.
On 01/25/2011 04:27 AM, Edgar E. Iglesias wrote:
> I've tested this patch a bit and got mixed results. I tested with patched
> CRIS and MicroBlaze translators. The patch works OK (it doesn't break
> anything) for the usecases I had but I saw a bit of a slowdown with
> MicroBlaze (compare to not using deposit at all).
> 
> I suspect that the fast 8 and 16 bit x86 deposits are giving me a slight
> speedup with CRIS. But MicroBlaze uses one bit fields into bit 2 and
> 31. Those seem to be slower with deposit than with other tcg sequences.
> 
> I would have guessed that at worst, this patch would be equally fast
> as any TCG sequence. Am I missing something?

With or without the i386 tcg-target.c changes?

If without, then I'm stumped, since it looks like identical tcg ops
being emitted.

It with, then perhaps SHLD is slower than I thought.  I see that GCC
lists this insn as "vector decoded" for AMD cores, as opposed to
"direct decoded".  If this insn is indeed microcoded on some hosts
then maybe the i386 tcg-target patch isn't such a great idea.

That said, there are still other tcg targets which support this 
operation directly.  I would be shocked if you measured a slowdown
with these changes on a ppc host, for instance.


r~
Edgar Iglesias - Jan. 25, 2011, 4:48 p.m.
On Tue, Jan 25, 2011 at 08:13:53AM -0800, Richard Henderson wrote:
> On 01/25/2011 04:27 AM, Edgar E. Iglesias wrote:
> > I've tested this patch a bit and got mixed results. I tested with patched
> > CRIS and MicroBlaze translators. The patch works OK (it doesn't break
> > anything) for the usecases I had but I saw a bit of a slowdown with
> > MicroBlaze (compare to not using deposit at all).
> > 
> > I suspect that the fast 8 and 16 bit x86 deposits are giving me a slight
> > speedup with CRIS. But MicroBlaze uses one bit fields into bit 2 and
> > 31. Those seem to be slower with deposit than with other tcg sequences.
> > 
> > I would have guessed that at worst, this patch would be equally fast
> > as any TCG sequence. Am I missing something?
> 
> With or without the i386 tcg-target.c changes?
> 
> If without, then I'm stumped, since it looks like identical tcg ops
> being emitted.

It's with the tcg-target patch.

> It with, then perhaps SHLD is slower than I thought.  I see that GCC
> lists this insn as "vector decoded" for AMD cores, as opposed to
> "direct decoded".  If this insn is indeed microcoded on some hosts
> then maybe the i386 tcg-target patch isn't such a great idea.

OK, I see. Maybe we should try to emit an insn sequence more similar
to what tcg was emitting (for the non 8 & 16-bit deposits)?
That ought too at least give similar results as before for those and
give us a speedup for the byte and word moves.

> That said, there are still other tcg targets which support this 
> operation directly.  I would be shocked if you measured a slowdown
> with these changes on a ppc host, for instance.

Yep, agreed.

Cheers
Aurelien Jarno - Jan. 31, 2011, 8:33 a.m.
Hi,

On Tue, Jan 25, 2011 at 01:27:49PM +0100, Edgar E. Iglesias wrote:
> On Mon, Jan 10, 2011 at 07:23:46PM -0800, Richard Henderson wrote:
> > Special case deposits that are implementable with byte and word stores.
> > Otherwise implement with double-word shift plus rotates.
> > 
> > Expose tcg_scratch_alloc to the backend for allocation of scratch registers.
> > 
> > Signed-off-by: Richard Henderson <rth@twiddle.net>
> 
> Hi,
> 
> I've tested this patch a bit and got mixed results. I tested with patched
> CRIS and MicroBlaze translators. The patch works OK (it doesn't break
> anything) for the usecases I had but I saw a bit of a slowdown with
> MicroBlaze (compare to not using deposit at all).
> 

This week-end I have tested it emulating an x86-64 machine on x86-64,
with all the patch series applied. I have measured the boot time from
the bootloader up to the graphical environment of a Debian installation
I used -snapshot to make sure the host hard-drive is not introducing any
noise in the measurement (so that the whole image is in the host cache),
and did the measurement 10 times. The machine is a Core 2 Q9650, nothing
else was running on the machine except the few standard daemons.

I have found that the boot time is roughly 1.8% faster with the patch
series applied. It's undoubtedly an improvement, but still close to the
measurement noise. This is a bit disappointing...
Richard Henderson - Feb. 8, 2011, 6:05 p.m.
On 01/31/2011 12:33 AM, Aurelien Jarno wrote:
> This week-end I have tested it emulating an x86-64 machine on x86-64,
> with all the patch series applied. I have measured the boot time from
> the bootloader up to the graphical environment of a Debian installation
> I used -snapshot to make sure the host hard-drive is not introducing any
> noise in the measurement (so that the whole image is in the host cache),
> and did the measurement 10 times. The machine is a Core 2 Q9650, nothing
> else was running on the machine except the few standard daemons.
> 
> I have found that the boot time is roughly 1.8% faster with the patch
> series applied. It's undoubtedly an improvement, but still close to the
> measurement noise. This is a bit disappointing...

It's also not terribly surprising, with that test scenario.  GCC tries
not to generate partial register stores, except when (as here) it's 
really a bitfield insert.

A test that might show off the deposit code more would be booting a
16-bit OS.  Either FreeDOS, or Windows 3.1 (if anyone still has a copy).
In that case, the translator will be emitting a deposit op for almost
every guest instruction.

(Which is probably a mistake from a translator point of view -- there's
no reason we can't emulate 16-bit operations with 32-bit operations given
that the high bits are ignorable.)


r~
Paolo Bonzini - Feb. 9, 2011, 7:41 a.m.
On 02/08/2011 07:05 PM, Richard Henderson wrote:
> (Which is probably a mistake from a translator point of view -- there's
> no reason we can't emulate 16-bit operations with 32-bit operations given
> that the high bits are ignorable.)

Not really, you never know if the guest is going to use a 66 prefix on 
the next instruction.

Paolo
Blue Swirl - Feb. 9, 2011, 5:24 p.m.
On Wed, Feb 9, 2011 at 9:41 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 02/08/2011 07:05 PM, Richard Henderson wrote:
>>
>> (Which is probably a mistake from a translator point of view -- there's
>> no reason we can't emulate 16-bit operations with 32-bit operations given
>> that the high bits are ignorable.)
>
> Not really, you never know if the guest is going to use a 66 prefix on the
> next instruction.

Perhaps similar system to current lazy condition code evaluation could
be used. The translator would keep record of high bits use status, and
if they are getting used, emit extra ops to clear (or recalculate?)
the high bits.

Patch

diff --git a/target-microblaze/translate.c b/target-microblaze/translate.c
index 2207431..39ab3a5 100644
--- a/target-microblaze/translate.c
+++ b/target-microblaze/translate.c
@@ -160,6 +160,7 @@  static void read_carry(DisasContext *dc, TCGv d)
 
 static void write_carry(DisasContext *dc, TCGv v)
 {
+#if 0
     TCGv t0 = tcg_temp_new();
     tcg_gen_shli_tl(t0, v, 31);
     tcg_gen_sari_tl(t0, t0, 31);
@@ -168,6 +169,10 @@  static void write_carry(DisasContext *dc, TCGv v)
                     ~(MSR_C | MSR_CC));
     tcg_gen_or_tl(cpu_SR[SR_MSR], cpu_SR[SR_MSR], t0);
     tcg_temp_free(t0);
+#else
+    tcg_gen_deposit_tl(cpu_SR[SR_MSR], cpu_SR[SR_MSR], v, 2, 1);
+    tcg_gen_deposit_tl(cpu_SR[SR_MSR], cpu_SR[SR_MSR], v, 31, 1);
+#endif
 }


CRIS translator:
commit 9f427e14b2535a067bf046fea093f28cfaa92f7f
Author: Edgar E. Iglesias <edgar.iglesias@gmail.com>
Date:   Fri Jan 21 22:09:44 2011 +0100

    cris: Use deposit for ALU writeback
    
    Most ALU insns on CRIS have deposit semantics in the writeback
    stage. Use the new deposit tcg operation to perform the write
    back to registers.
    
    Move the extract of the result into cc_result to the slow path
    in evaluate_flags.
    
    Signed-off-by: Edgar E. Iglesias <edgar.iglesias@gmail.com>

diff --git a/target-cris/translate.c b/target-cris/translate.c
index f4cc125..018ce68 100644
--- a/target-cris/translate.c
+++ b/target-cris/translate.c
@@ -861,11 +861,6 @@  static void cris_alu_op_exec(DisasContext *dc, int op,
 			BUG();
 			break;
 	}
-
-	if (size == 1)
-		tcg_gen_andi_tl(dst, dst, 0xff);
-	else if (size == 2)
-		tcg_gen_andi_tl(dst, dst, 0xffff);
 }
 
 static void cris_alu(DisasContext *dc, int op,
@@ -880,6 +875,7 @@  static void cris_alu(DisasContext *dc, int op,
 		tmp = tcg_temp_new();
 		writeback = 0;
 	} else if (size == 4) {
+		/* We write directly into the dest.  */
 		tmp = d;
 		writeback = 0;
 	} else
@@ -892,11 +888,7 @@  static void cris_alu(DisasContext *dc, int op,
 
 	/* Writeback.  */
 	if (writeback) {
-		if (size == 1)
-			tcg_gen_andi_tl(d, d, ~0xff);
-		else
-			tcg_gen_andi_tl(d, d, ~0xffff);
-		tcg_gen_or_tl(d, d, tmp);
+		tcg_gen_deposit_tl(d, d, tmp, 0, size * 8);
 	}
 	if (!TCGV_EQUAL(tmp, d))
 		tcg_temp_free(tmp);
@@ -941,6 +933,10 @@  static void gen_tst_cc (DisasContext *dc, TCGv cc, int cond)
 	 * When this function is done, T0 should be non-zero if the condition
 	 * code is true.
 	 */
+	if (dc->cc_size != 4) {
+		tcg_gen_andi_tl(cc_result, cc_result,
+				(1 << (dc->cc_size * 8)) - 1);
+	}
 	arith_opt = arith_cc(dc) && !dc->flags_uptodate;
 	move_opt = (dc->cc_op == CC_OP_MOVE);
 	switch (cond) {