diff mbox

[5/7] tcg-i386: Implement deposit operation.

Message ID 4D3F4993.4010109@twiddle.net
State New
Headers show

Commit Message

Richard Henderson Jan. 25, 2011, 10:07 p.m. UTC
On 01/25/2011 08:48 AM, Edgar E. Iglesias wrote:
> OK, I see. Maybe we should try to emit an insn sequence more similar
> to what tcg was emitting (for the non 8 & 16-bit deposits)?
> That ought too at least give similar results as before for those and
> give us a speedup for the byte and word moves.

Please try this replacement version for tcg/i386/*.

I was able to run your cris testsuite, and stuff looked ok there.
But for some reason the microblaze kernel would not boot.  It seems
that the kernel commandline isn't in place properly and it isn't 
finding the disk image.


r~

Comments

Edgar E. Iglesias Jan. 26, 2011, 8:53 a.m. UTC | #1
On Tue, Jan 25, 2011 at 02:07:15PM -0800, Richard Henderson wrote:
> On 01/25/2011 08:48 AM, Edgar E. Iglesias wrote:
> > OK, I see. Maybe we should try to emit an insn sequence more similar
> > to what tcg was emitting (for the non 8 & 16-bit deposits)?
> > That ought too at least give similar results as before for those and
> > give us a speedup for the byte and word moves.
> 
> Please try this replacement version for tcg/i386/*.
> 
> I was able to run your cris testsuite, and stuff looked ok there.
> But for some reason the microblaze kernel would not boot.  It seems
> that the kernel commandline isn't in place properly and it isn't 
> finding the disk image.

Yes, you need to build qemu with libfdt for that image to boot.
IIRC, libfdt comes with the dtc package on some dists. In the future
I'll see if can upload an image that boots without the need of
devicetree manipulation in qemu.

I tried your new patch and got similar results as before. Maaybe slightly
faster but within the noise.

I looked a little bit more at it and realized that I'm probably not doing
a fair comparition with microblaze. The write_carry sequence actually
bit deposits a bit into two locations with a sequence of tcg ops that
is not much longer than the one to deposit a single bit. So I'm basically
comparing the cost of a single tcg deposit sequence with the cost of two
tcg deposit backend ops. I should probably just accept that the new
deposit op is not worth using for that particular case.

It would be nice if somebody else also tested this patch. Before we
agree on applying it.

One note, the tcg_scratch_alloc hunk from the previous version was
missing on this one.

Thanks
Alexander Graf Jan. 26, 2011, 9:23 a.m. UTC | #2
On 26.01.2011, at 09:53, Edgar E. Iglesias wrote:

> On Tue, Jan 25, 2011 at 02:07:15PM -0800, Richard Henderson wrote:
>> On 01/25/2011 08:48 AM, Edgar E. Iglesias wrote:
>>> OK, I see. Maybe we should try to emit an insn sequence more similar
>>> to what tcg was emitting (for the non 8 & 16-bit deposits)?
>>> That ought too at least give similar results as before for those and
>>> give us a speedup for the byte and word moves.
>> 
>> Please try this replacement version for tcg/i386/*.
>> 
>> I was able to run your cris testsuite, and stuff looked ok there.
>> But for some reason the microblaze kernel would not boot.  It seems
>> that the kernel commandline isn't in place properly and it isn't 
>> finding the disk image.
> 
> Yes, you need to build qemu with libfdt for that image to boot.
> IIRC, libfdt comes with the dtc package on some dists. In the future
> I'll see if can upload an image that boots without the need of
> devicetree manipulation in qemu.
> 
> I tried your new patch and got similar results as before. Maaybe slightly
> faster but within the noise.
> 
> I looked a little bit more at it and realized that I'm probably not doing
> a fair comparition with microblaze. The write_carry sequence actually
> bit deposits a bit into two locations with a sequence of tcg ops that
> is not much longer than the one to deposit a single bit. So I'm basically
> comparing the cost of a single tcg deposit sequence with the cost of two
> tcg deposit backend ops. I should probably just accept that the new
> deposit op is not worth using for that particular case.
> 
> It would be nice if somebody else also tested this patch. Before we
> agree on applying it.
> 
> One note, the tcg_scratch_alloc hunk from the previous version was
> missing on this one.

I'm using the deposit instruction on S/390 with the implementation patch for x86 where it appears to not improve performance:

Booting into initrd shell:

3.53s  w/ x86 deposit patch 
3.52s  w/o x86 deposit patch


But maybe this is the same problem as with the shift opcode that was reported earlier in the thread:

agraf@toonie:/studio/s390/qemu-s390> grep deposit target-s390x/translate.c 
    tcg_gen_deposit_i64(regs[reg], regs[reg], tmp, 0, 32);
    tcg_gen_deposit_i64(regs[reg], regs[reg], v, 0, 32);
    tcg_gen_deposit_i64(regs[reg], regs[reg], tmp, 0, 16);
    tcg_gen_deposit_i64(regs[reg], regs[reg], v, 0, 8);
        tcg_gen_deposit_i64(regs[r1], regs[r1], tmp, 48, 16);
        tcg_gen_deposit_i64(regs[r1], regs[r1], tmp, 32, 16);

The 0, 32 and 0, 16 versions should get accelerated pretty well while the 32, 16 and 48, 16 are not I assume?

I'm not saying that the deposit instruction doesn't make sense btw. Code cleanup wise it was awesome - 10 lines of tcg combined into a single call :). Maybe a simple andi and ori emission for unknown masks might make more sense though?


Alex
Richard Henderson Jan. 26, 2011, 3:50 p.m. UTC | #3
On 01/26/2011 01:23 AM, Alexander Graf wrote:
> agraf@toonie:/studio/s390/qemu-s390> grep deposit target-s390x/translate.c 
>     tcg_gen_deposit_i64(regs[reg], regs[reg], tmp, 0, 32);
>     tcg_gen_deposit_i64(regs[reg], regs[reg], v, 0, 32);
>     tcg_gen_deposit_i64(regs[reg], regs[reg], tmp, 0, 16);
>     tcg_gen_deposit_i64(regs[reg], regs[reg], v, 0, 8);
>         tcg_gen_deposit_i64(regs[r1], regs[r1], tmp, 48, 16);
>         tcg_gen_deposit_i64(regs[r1], regs[r1], tmp, 32, 16);
> 
> The 0, 32 and 0, 16 versions should get accelerated pretty well while
> the 32, 16 and 48, 16 are not I assume?

No, only the 0,16 and 0,8 deposits correspond to a hardware insn on x86.

Given that the 0,32 lowpart writeback is almost certainly the most common
operation for s390x, I doubt the deposit patch will help with an x86 host.

Have you thought about buffering the lowpart writeback in the translator?
I.e. when a 32-bit insn writes to a register, remember that value without
writing it back.  If the next insn in the TB is also 32-bit, reuse the 
saved value, etc.  Only perform the writeback for 64-bit insns using the
register as a source, end of TB, and places that can take an exception.


r~
Alexander Graf Jan. 26, 2011, 4 p.m. UTC | #4
Richard Henderson wrote:
> On 01/26/2011 01:23 AM, Alexander Graf wrote:
>   
>> agraf@toonie:/studio/s390/qemu-s390> grep deposit target-s390x/translate.c 
>>     tcg_gen_deposit_i64(regs[reg], regs[reg], tmp, 0, 32);
>>     tcg_gen_deposit_i64(regs[reg], regs[reg], v, 0, 32);
>>     tcg_gen_deposit_i64(regs[reg], regs[reg], tmp, 0, 16);
>>     tcg_gen_deposit_i64(regs[reg], regs[reg], v, 0, 8);
>>         tcg_gen_deposit_i64(regs[r1], regs[r1], tmp, 48, 16);
>>         tcg_gen_deposit_i64(regs[r1], regs[r1], tmp, 32, 16);
>>
>> The 0, 32 and 0, 16 versions should get accelerated pretty well while
>> the 32, 16 and 48, 16 are not I assume?
>>     
>
> No, only the 0,16 and 0,8 deposits correspond to a hardware insn on x86.
>
> Given that the 0,32 lowpart writeback is almost certainly the most common
> operation for s390x, I doubt the deposit patch will help with an x86 host.
>
> Have you thought about buffering the lowpart writeback in the translator?
> I.e. when a 32-bit insn writes to a register, remember that value without
> writing it back.  If the next insn in the TB is also 32-bit, reuse the 
> saved value, etc.  Only perform the writeback for 64-bit insns using the
> register as a source, end of TB, and places that can take an exception.
>   

That's basically what uli's tcg_sync op did which got rejected upstream :).

Keeping it only inside of the translator would break on page faults, as
the lower 32 bits of the register would lie around in a temporary which
is invisible for the page fault resolver.


Alex
Avi Kivity Jan. 26, 2011, 4:34 p.m. UTC | #5
On 01/26/2011 06:00 PM, Alexander Graf wrote:
> Richard Henderson wrote:
> >  On 01/26/2011 01:23 AM, Alexander Graf wrote:
> >
> >>  agraf@toonie:/studio/s390/qemu-s390>  grep deposit target-s390x/translate.c
> >>      tcg_gen_deposit_i64(regs[reg], regs[reg], tmp, 0, 32);
> >>      tcg_gen_deposit_i64(regs[reg], regs[reg], v, 0, 32);
> >>      tcg_gen_deposit_i64(regs[reg], regs[reg], tmp, 0, 16);
> >>      tcg_gen_deposit_i64(regs[reg], regs[reg], v, 0, 8);
> >>          tcg_gen_deposit_i64(regs[r1], regs[r1], tmp, 48, 16);
> >>          tcg_gen_deposit_i64(regs[r1], regs[r1], tmp, 32, 16);
> >>
> >>  The 0, 32 and 0, 16 versions should get accelerated pretty well while
> >>  the 32, 16 and 48, 16 are not I assume?
> >>
> >
> >  No, only the 0,16 and 0,8 deposits correspond to a hardware insn on x86.
> >
> >  Given that the 0,32 lowpart writeback is almost certainly the most common
> >  operation for s390x, I doubt the deposit patch will help with an x86 host.
> >
> >  Have you thought about buffering the lowpart writeback in the translator?
> >  I.e. when a 32-bit insn writes to a register, remember that value without
> >  writing it back.  If the next insn in the TB is also 32-bit, reuse the
> >  saved value, etc.  Only perform the writeback for 64-bit insns using the
> >  register as a source, end of TB, and places that can take an exception.
> >
>
> That's basically what uli's tcg_sync op did which got rejected upstream :).
>
> Keeping it only inside of the translator would break on page faults, as
> the lower 32 bits of the register would lie around in a temporary which
> is invisible for the page fault resolver.

It should be possible to have the exception injector look up instruction 
pointer in a bitmap in the TB, and if set, execute a completion which 
writes back the temporary to the correct register.  Not trivial, but I 
imagine useful for many optimizations which combine several target 
instructions into one host instruction (probably most useful for 
risc->cisc translations).
Richard Henderson Jan. 26, 2011, 4:40 p.m. UTC | #6
On 01/26/2011 08:00 AM, Alexander Graf wrote:
> Keeping it only inside of the translator would break on page faults, as
> the lower 32 bits of the register would lie around in a temporary which
> is invisible for the page fault resolver.

Given that QEMU doesn't support truely async signals, and the fact that
the translator can tell which insns can fault, I can't imagine that this
is actually a problem.

You should get the same sequence of writebacks when translating the TB
the second time for tcg_gen_code_search_pc.

Am I totally confused here?


r~
Alexander Graf Jan. 26, 2011, 4:50 p.m. UTC | #7
Richard Henderson wrote:
> On 01/26/2011 08:00 AM, Alexander Graf wrote:
>   
>> Keeping it only inside of the translator would break on page faults, as
>> the lower 32 bits of the register would lie around in a temporary which
>> is invisible for the page fault resolver.
>>     
>
> Given that QEMU doesn't support truely async signals, and the fact that
> the translator can tell which insns can fault, I can't imagine that this
> is actually a problem.
>
> You should get the same sequence of writebacks when translating the TB
> the second time for tcg_gen_code_search_pc.
>
> Am I totally confused here?
>   

Oh, you mean basically to have the following:

TCGv_i32 regs32[16];
TCGv_i64 regs[16];

Then declare both as globals with offset and just switch between the
access type using a disas struct variable. Once the TB ends, I'd
obviously have to sync back to 64 bit again.

Yes, that would work. I'll see if I can get that rolling once I'm done
with the CC optimizations :)


Alex
Richard Henderson Jan. 26, 2011, 6:21 p.m. UTC | #8
On 01/26/2011 08:50 AM, Alexander Graf wrote:
> Oh, you mean basically to have the following:
> 
> TCGv_i32 regs32[16];
> TCGv_i64 regs[16];
> 
> Then declare both as globals with offset and just switch between the
> access type using a disas struct variable. Once the TB ends, I'd
> obviously have to sync back to 64 bit again.

Maybe?  I don't think that regs32 can overlap with regs in the real
cpu structure.  Otherwise you have the same sort of sync problems as
was originally rejected.

I had been picturing regs32 as a member of DisasContext, and it would
hold tcg_temp_new_i32 variables as-needed.  Something like

#if TCG_TARGET_REG_BITS == 64

static void writeback_reg32(DisasContext *dc, int r)
{
    // ??? This macro should really exist to match TCGV_UNUSED_I32 etc.
    if (!TCGV_IS_UNUSED_I32 (dc->regs32[r])) {
        tcg_gen_deposit_tl(cpu_regs[r], cpu_regs[r],
                           MAKE_TCGV_I64 (GET_TCGV_I32 (dc->regs32[r])),
                           0, 32);
    }
}

static void writeback_all_reg32(DisasContext *dc)
{
    int i;
    for (i = 0; i < 16; ++i) {
        flush_reg32(dc, i);
    }
}

static void flush_all_reg32(DisasContext *dc)
{
    int i;
    for (i = 0; i < 16; ++i) {
        if (!TCGV_IS_UNUSED_I32 (dc->regs32[r])) {
            tcg_temp_free_i32 (dc->regs32[r]);
            TCGV_UNUSED_I32 (dc->regs32[r]);
        }
    }
}

static TCGv_i32 get_reg32(DisasContext *dc, int r)
{
    if (TCGV_IS_UNUSED_I32 (dc->regs32[r])) {
        dc->regs32[r] = tcg_temp_new_i32();
        tcg_gen_trunc_i64_i32(dc->regs32[r], cpu_regs[r]);
    }
    return dc->regs32[r];
}

static TCGv_i64 get_reg64(DisasContext *dc, int r)
{
    writeback_reg32(dc, r);
    return cpu_regs[r];
}

#elif TCG_TARGET_REG_BITS == 32

static void writeback_reg32(DisasContext *dc, int r) { }
static void writeback_all_reg32(DisasContext *dc) { }
static void flush_all_reg32(DisasContext *dc) { }

static TCGv_i32 get_reg32(DisasContext *dc, int r)
{
    return TCGV_LOW(cpu_regs[r]);
}

static TCGv_i64 get_reg64(DisasContext *dc, int r)
{
    return cpu_regs[r];
}

#endif


r~
Alexander Graf Jan. 26, 2011, 6:27 p.m. UTC | #9
Richard Henderson wrote:
> On 01/26/2011 08:50 AM, Alexander Graf wrote:
>   
>> Oh, you mean basically to have the following:
>>
>> TCGv_i32 regs32[16];
>> TCGv_i64 regs[16];
>>
>> Then declare both as globals with offset and just switch between the
>> access type using a disas struct variable. Once the TB ends, I'd
>> obviously have to sync back to 64 bit again.
>>     
>
> Maybe?  I don't think that regs32 can overlap with regs in the real
> cpu structure.  Otherwise you have the same sort of sync problems as
> was originally rejected.
>
> I had been picturing regs32 as a member of DisasContext, and it would
> hold tcg_temp_new_i32 variables as-needed.  Something like
>   

That would mean that during a TB the reg32 parts would be in
temporaries, right? So they'd end up being somewhere in a register.

What do I do on a page fault now? I could rerun the translator to find
out which registers would be stuck in a temporary, but I have no way to
actually read the temporary's value, as all register state is thrown
away on a page fault IIUC.

So the only alternative to this would be that I'd keep both 32 bit and
64 bit register states around in env, declare both as global, and sync
them manually on access. I guess.


Alex
Richard Henderson Jan. 26, 2011, 6:55 p.m. UTC | #10
On 01/26/2011 10:27 AM, Alexander Graf wrote:
> What do I do on a page fault now? I could rerun the translator to find
> out which registers would be stuck in a temporary, but I have no way to
> actually read the temporary's value, as all register state is thrown
> away on a page fault IIUC.

When does a page-fault happen?

As far as I know, it does not happen at random.  Which seems to be 
what you are suggesting.


r~
Alexander Graf Jan. 26, 2011, 7:01 p.m. UTC | #11
Richard Henderson wrote:
> On 01/26/2011 10:27 AM, Alexander Graf wrote:
>   
>> What do I do on a page fault now? I could rerun the translator to find
>> out which registers would be stuck in a temporary, but I have no way to
>> actually read the temporary's value, as all register state is thrown
>> away on a page fault IIUC.
>>     
>
> When does a page-fault happen?
>
> As far as I know, it does not happen at random.  Which seems to be 
> what you are suggesting.
>   

It happens on load/store and potentially helpers. The main difference
IIUC between globals and temps is that globals are kept in registers as
long as possible (read: until load/store or helper or tb end gets
emitted) while temporaries are not stored back to any memory, so they
are lost on load/store.

So what you are suggesting is basically to use a different set of
globals for regs32 and to keep track of their usage throughout the TB,
so we can convert on demand. We can't use temporaries for that unless we
manually store them off on load/store/helper/tb end which means we'd
rewrite the globals treatment in target code :).


Alex
Richard Henderson Jan. 26, 2011, 7:05 p.m. UTC | #12
On 01/26/2011 11:01 AM, Alexander Graf wrote:
>> As far as I know, it does not happen at random.  Which seems to be 
>> what you are suggesting.
> 
> It happens on load/store and potentially helpers. The main difference
> IIUC between globals and temps is that globals are kept in registers as
> long as possible (read: until load/store or helper or tb end gets
> emitted) while temporaries are not stored back to any memory, so they
> are lost on load/store.
> 
> So what you are suggesting is basically to use a different set of
> globals for regs32 and to keep track of their usage throughout the TB,
> so we can convert on demand. We can't use temporaries for that unless we
> manually store them off on load/store/helper/tb end which means we'd
> rewrite the globals treatment in target code :).

No, what I'm suggesting is manually storing the reg32 temporaries back
to their reg64 origins in the translator immediately before issuing the
load/store/helper/tbend, at which point the generic TCG bits write back
the reg64 globals to their env origin.

Do you have a pointer to your s390x tree?



r~
Alexander Graf Jan. 26, 2011, 7:09 p.m. UTC | #13
Richard Henderson wrote:
> On 01/26/2011 11:01 AM, Alexander Graf wrote:
>   
>>> As far as I know, it does not happen at random.  Which seems to be 
>>> what you are suggesting.
>>>       
>> It happens on load/store and potentially helpers. The main difference
>> IIUC between globals and temps is that globals are kept in registers as
>> long as possible (read: until load/store or helper or tb end gets
>> emitted) while temporaries are not stored back to any memory, so they
>> are lost on load/store.
>>
>> So what you are suggesting is basically to use a different set of
>> globals for regs32 and to keep track of their usage throughout the TB,
>> so we can convert on demand. We can't use temporaries for that unless we
>> manually store them off on load/store/helper/tb end which means we'd
>> rewrite the globals treatment in target code :).
>>     
>
> No, what I'm suggesting is manually storing the reg32 temporaries back
> to their reg64 origins in the translator immediately before issuing the
> load/store/helper/tbend, at which point the generic TCG bits write back
> the reg64 globals to their env origin.
>   

Oh, I see :). That makes sense now. Thanks for the clarification :).

> Do you have a pointer to your s390x tree?
>   

Sure.

  git://repo.or.cz/qemu/agraf.git  s390

Please be aware of the fact that I'm currently reworking the whole CC
model, so if you start working on the register acceleration now there
will be conflicts for sure :). Do you have code to test it with?


Alex
Richard Henderson Jan. 26, 2011, 7:19 p.m. UTC | #14
On 01/26/2011 11:09 AM, Alexander Graf wrote:
> Please be aware of the fact that I'm currently reworking the whole CC
> model, so if you start working on the register acceleration now there
> will be conflicts for sure :).

Roger.

> Do you have code to test it with?

Er, I assume I can pull something out of my hercules installation,
but if you already have images tarred up, that would be helpful.


r~
Alexander Graf Jan. 26, 2011, 7:27 p.m. UTC | #15
Richard Henderson wrote:
> On 01/26/2011 11:09 AM, Alexander Graf wrote:
>   
>> Please be aware of the fact that I'm currently reworking the whole CC
>> model, so if you start working on the register acceleration now there
>> will be conflicts for sure :).
>>     
>
> Roger.
>
>   
>> Do you have code to test it with?
>>     
>
> Er, I assume I can pull something out of my hercules installation,
> but if you already have images tarred up, that would be helpful.
>   

I'll gladly put together something :).


Alex
diff mbox

Patch

diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index bb19a95..ec49b34 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -258,7 +258,8 @@  static inline int tcg_target_const_match(tcg_target_long val,
 #define OPC_JMP_long	(0xe9)
 #define OPC_JMP_short	(0xeb)
 #define OPC_LEA         (0x8d)
-#define OPC_MOVB_EvGv	(0x88)		/* stores, more or less */
+#define OPC_MOVB_EbGb	(0x88)		/* stores, more or less */
+#define OPC_MOVB_GbEb   (0x8a)		/* loads, more or less */
 #define OPC_MOVL_EvGv	(0x89)		/* stores, more or less */
 #define OPC_MOVL_GvEv	(0x8b)		/* loads, more or less */
 #define OPC_MOVL_EvIz	(0xc7)
@@ -277,6 +278,7 @@  static inline int tcg_target_const_match(tcg_target_long val,
 #define OPC_SHIFT_1	(0xd1)
 #define OPC_SHIFT_Ib	(0xc1)
 #define OPC_SHIFT_cl	(0xd3)
+#define OPC_SHRD_Ib	(0xac | P_EXT)
 #define OPC_TESTL	(0x85)
 #define OPC_XCHG_ax_r32	(0x90)
 
@@ -710,6 +712,107 @@  static void tcg_out_addi(TCGContext *s, int reg, tcg_target_long val)
     }
 }
 
+static void tcg_out_deposit(TCGContext *s, int inout, int val,
+                            unsigned ofs, unsigned len, int rexw)
+{
+    TCGType type = rexw ? TCG_TYPE_I64 : TCG_TYPE_I32;
+    tcg_target_ulong imask, vmask;
+    TCGRegSet live;
+    int scratch;
+
+    /* Look for MOVB w/ %reg_h special case.  */
+    if (ofs == 8 && len == 8 && inout < 4 && val < 4) {
+        tcg_out_modrm(s, OPC_MOVB_GbEb, inout + 4, val);
+        return;
+    }
+
+    /* Look for MOVB/MOVW special cases.  */
+    if (len == 16
+        || (len == 8
+            && (TCG_TARGET_REG_BITS == 64 || (inout < 4 && val < 4)))) {
+        /* If the offset is non-zero, and we have a deposit from self,
+           then we need a tempoarary.  */
+        if (ofs != 0 && inout == val) {
+            tcg_regset_clear(live);
+            tcg_regset_set_reg(live, inout);
+            val = tcg_scratch_alloc(s, type, live);
+            tcg_out_mov(s, type, val, inout);
+        }
+
+        /* If the offset is non-zero, rotate the destination into place.  */
+        if (ofs != 0) {
+            tcg_out_shifti(s, SHIFT_ROR + rexw, inout, ofs);
+        }
+
+        if (len == 8) {
+            tcg_out_modrm(s, OPC_MOVB_GbEb + P_REXB_R + P_REXB_RM, inout, val);
+        } else {
+            tcg_out_modrm(s, OPC_MOVL_GvEv + P_DATA16, inout, val);
+        }
+
+        /* Restore the destination to its proper location.  */
+        if (ofs != 0) {
+            tcg_out_shifti(s, SHIFT_ROL + rexw, inout, ofs);
+        }
+        return;
+    }
+
+    /* Otherwise we can't support this operation natively.  It's possible to
+       play tricks with rotates and shld in order to implement this.  While
+       this is much smaller than masks, but it turns out that shld is too slow
+       on many cpus.  */
+    tcg_regset_clear(live);
+    tcg_regset_set_reg(live, inout);
+    tcg_regset_set_reg(live, val);
+    scratch = tcg_scratch_alloc(s, type, live);
+
+    vmask = ((tcg_target_ulong)1 << len) - 1;
+    imask = ~(vmask << ofs);
+
+    /* Careful, some 64-bit masks cannot use immediate operands.  */
+    if (type == TCG_TYPE_I64 && imask != (int32_t)imask) {
+        bool val_scratch = false;
+
+        /* Since we are going to clobber INOUT first, the destination
+           bitfield cannot overlap the input bits.  */
+        if (inout == val && ofs < len) {
+            tcg_regset_set_reg(live, scratch);
+            val = tcg_scratch_alloc(s, type, live);
+            tcg_out_mov(s, type, val, inout);
+            val_scratch = true;
+        }
+
+        tcg_out_movi(s, type, scratch, imask);
+        tgen_arithr(s, ARITH_AND + rexw, inout, scratch);
+
+        if (vmask > 0xffffffffu) {
+            tcg_out_movi(s, type, scratch, vmask);
+            tgen_arithr(s, ARITH_AND + P_REXW, scratch, val);
+        } else {
+            if (val_scratch) {
+                scratch = val;
+            } else {
+                tcg_out_mov(s, TCG_TYPE_I32, scratch, val);
+            }
+            tgen_arithi(s, ARITH_AND, scratch, vmask, 0);
+        }
+
+        tcg_out_shifti(s, SHIFT_SHL + P_REXW, scratch, ofs);
+        tgen_arithr(s, ARITH_OR + P_REXW, inout, scratch);
+        return;
+    }
+
+    /* Both IMASK and VMASK are valid immediate operands, which means that
+       VAL may be treated as a 32-bit value.  */
+    tcg_out_mov(s, TCG_TYPE_I32, scratch, val);
+    tgen_arithi(s, ARITH_AND, scratch, vmask, 0);
+    tcg_out_shifti (s, SHIFT_SHL + rexw, scratch, ofs);
+
+    tgen_arithi(s, ARITH_AND + rexw, inout, imask, 0);
+    tgen_arithr(s, ARITH_OR + rexw, inout, scratch);
+}
+
+
 /* Use SMALL != 0 to force a short forward branch.  */
 static void tcg_out_jxx(TCGContext *s, int opc, int label_index, int small)
 {
@@ -1266,7 +1369,7 @@  static void tcg_out_qemu_st_direct(TCGContext *s, int datalo, int datahi,
 
     switch (sizeop) {
     case 0:
-        tcg_out_modrm_offset(s, OPC_MOVB_EvGv + P_REXB_R, datalo, base, ofs);
+        tcg_out_modrm_offset(s, OPC_MOVB_EbGb + P_REXB_R, datalo, base, ofs);
         break;
     case 1:
         if (bswap) {
@@ -1504,7 +1607,7 @@  static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         break;
 
     OP_32_64(st8):
-        tcg_out_modrm_offset(s, OPC_MOVB_EvGv | P_REXB_R,
+        tcg_out_modrm_offset(s, OPC_MOVB_EbGb | P_REXB_R,
                              args[0], args[1], args[2]);
         break;
     OP_32_64(st16):
@@ -1603,6 +1706,10 @@  static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         }
         break;
 
+    OP_32_64(deposit):
+        tcg_out_deposit(s, args[0], args[2], args[3], args[4], rexw);
+        break;
+
     case INDEX_op_brcond_i32:
         tcg_out_brcond32(s, args[2], args[0], args[1], const_args[1],
                          args[3], 0);
@@ -1783,6 +1890,7 @@  static const TCGTargetOpDef x86_op_defs[] = {
     { INDEX_op_sar_i32, { "r", "0", "ci" } },
     { INDEX_op_rotl_i32, { "r", "0", "ci" } },
     { INDEX_op_rotr_i32, { "r", "0", "ci" } },
+    { INDEX_op_deposit_i32, { "r", "0", "r" } },
 
     { INDEX_op_brcond_i32, { "r", "ri" } },
 
@@ -1835,6 +1943,7 @@  static const TCGTargetOpDef x86_op_defs[] = {
     { INDEX_op_sar_i64, { "r", "0", "ci" } },
     { INDEX_op_rotl_i64, { "r", "0", "ci" } },
     { INDEX_op_rotr_i64, { "r", "0", "ci" } },
+    { INDEX_op_deposit_i64, { "r", "0", "r" } },
 
     { INDEX_op_brcond_i64, { "r", "re" } },
     { INDEX_op_setcond_i64, { "r", "r", "re" } },
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index bfafbfc..9f90d17 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -77,6 +77,7 @@  enum {
 /* optional instructions */
 #define TCG_TARGET_HAS_div2_i32
 #define TCG_TARGET_HAS_rot_i32
+#define TCG_TARGET_HAS_deposit_i32
 #define TCG_TARGET_HAS_ext8s_i32
 #define TCG_TARGET_HAS_ext16s_i32
 #define TCG_TARGET_HAS_ext8u_i32
@@ -94,6 +95,7 @@  enum {
 #if TCG_TARGET_REG_BITS == 64
 #define TCG_TARGET_HAS_div2_i64
 #define TCG_TARGET_HAS_rot_i64
+#define TCG_TARGET_HAS_deposit_i64
 #define TCG_TARGET_HAS_ext8s_i64
 #define TCG_TARGET_HAS_ext16s_i64
 #define TCG_TARGET_HAS_ext32s_i64