Patchwork [14/27] Add book3s_64 specific opcode emulation

login
register
mail settings
Submitter Alexander Graf
Date Oct. 30, 2009, 3:47 p.m.
Message ID <1256917647-6200-15-git-send-email-agraf@suse.de>
Download mbox | patch
Permalink /patch/37326/
State Accepted, archived
Commit c215c6e49fef6c79a5b98f66f11cc6b1e395cb59
Delegated to: Benjamin Herrenschmidt
Headers show

Comments

Alexander Graf - Oct. 30, 2009, 3:47 p.m.
There are generic parts of PowerPC that can be shared across all
implementations and specific parts that only apply to BookE or desktop PPCs.

This patch adds emulation for desktop specific opcodes that don't apply
to BookE CPUs.

Signed-off-by: Alexander Graf <agraf@suse.de>

---

v5 -> v6:

  - // -> /* */
---
 arch/powerpc/kvm/book3s_64_emulate.c |  337 ++++++++++++++++++++++++++++++++++
 1 files changed, 337 insertions(+), 0 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_64_emulate.c
Segher Boessenkool - Nov. 3, 2009, 8:47 a.m.
Nice patchset.  Some comments on the emulation part:

> +#define OP_31_XOP_EIOIO		854

You mean EIEIO.

> +	case 19:
> +		switch (get_xop(inst)) {
> +		case OP_19_XOP_RFID:
> +		case OP_19_XOP_RFI:
> +			vcpu->arch.pc = vcpu->arch.srr0;
> +			kvmppc_set_msr(vcpu, vcpu->arch.srr1);
> +			*advance = 0;
> +			break;

I think you should only emulate the insns that exist on whatever the  
guest
pretends to be.  RFID exist only on 64-bit implementations.  Same  
comment
everywhere else.

> +		case OP_31_XOP_EIOIO:
> +			break;

Have you always executed an eieio or sync when you get here, or
do you just not allow direct access to I/O devices?  Other context
synchronising insns are not enough, they do not broadcast on the
bus.

> +		case OP_31_XOP_DCBZ:
> +		{
> +			ulong rb =  vcpu->arch.gpr[get_rb(inst)];
> +			ulong ra = 0;
> +			ulong addr;
> +			u32 zeros[8] = { 0, 0, 0, 0, 0, 0, 0, 0 };
> +
> +			if (get_ra(inst))
> +				ra = vcpu->arch.gpr[get_ra(inst)];
> +
> +			addr = (ra + rb) & ~31ULL;
> +			if (!(vcpu->arch.msr & MSR_SF))
> +				addr &= 0xffffffff;
> +
> +			if (kvmppc_st(vcpu, addr, 32, zeros)) {

DCBZ zeroes out a cache line, not 32 bytes; except on 970, where there
are HID bits to make it work on 32 bytes only, and an extra DCBZL insn
that always clears a full cache line (128 bytes).

> +	switch (sprn) {
> +	case SPRN_IBAT0U ... SPRN_IBAT3L:
> +		bat = &vcpu_book3s->ibat[(sprn - SPRN_IBAT0U) / 2];
> +		break;
> +	case SPRN_IBAT4U ... SPRN_IBAT7L:
> +		bat = &vcpu_book3s->ibat[(sprn - SPRN_IBAT4U) / 2];
> +		break;
> +	case SPRN_DBAT0U ... SPRN_DBAT3L:
> +		bat = &vcpu_book3s->dbat[(sprn - SPRN_DBAT0U) / 2];
> +		break;
> +	case SPRN_DBAT4U ... SPRN_DBAT7L:
> +		bat = &vcpu_book3s->dbat[(sprn - SPRN_DBAT4U) / 2];
> +		break;

Do xBAT4..7 have the same SPR numbers on all CPUs?  They are CPU- 
specific
SPRs, after all.  Some CPUs have only six, some only four, some none,  
btw.

> +	case SPRN_HID0:
> +		to_book3s(vcpu)->hid[0] = vcpu->arch.gpr[rs];
> +		break;
> +	case SPRN_HID1:
> +		to_book3s(vcpu)->hid[1] = vcpu->arch.gpr[rs];
> +		break;
> +	case SPRN_HID2:
> +		to_book3s(vcpu)->hid[2] = vcpu->arch.gpr[rs];
> +		break;
> +	case SPRN_HID4:
> +		to_book3s(vcpu)->hid[4] = vcpu->arch.gpr[rs];
> +		break;
> +	case SPRN_HID5:
> +		to_book3s(vcpu)->hid[5] = vcpu->arch.gpr[rs];

HIDs are different per CPU; and worse, different CPUs have different
registers (SPR #s) for the same register name!

> +		/* guest HID5 set can change is_dcbz32 */
> +		if (vcpu->arch.mmu.is_dcbz32(vcpu) &&
> +		    (mfmsr() & MSR_HV))
> +			vcpu->arch.hflags |= BOOK3S_HFLAG_DCBZ32;
> +		break;

Wait, does this mean you allow other HID writes when MSR[HV] isn't
set?  All HIDs (and many other SPRs) cannot be read or written in
supervisor mode.


Segher
Alexander Graf - Nov. 3, 2009, 9:06 a.m.
On 03.11.2009, at 09:47, Segher Boessenkool wrote:

> Nice patchset.  Some comments on the emulation part:

Cool, thanks for looking though them!

>> +#define OP_31_XOP_EIOIO		854
>
> You mean EIEIO.

Probably, yeah.

>> +	case 19:
>> +		switch (get_xop(inst)) {
>> +		case OP_19_XOP_RFID:
>> +		case OP_19_XOP_RFI:
>> +			vcpu->arch.pc = vcpu->arch.srr0;
>> +			kvmppc_set_msr(vcpu, vcpu->arch.srr1);
>> +			*advance = 0;
>> +			break;
>
> I think you should only emulate the insns that exist on whatever the  
> guest
> pretends to be.  RFID exist only on 64-bit implementations.  Same  
> comment
> everywhere else.

True.

>
>> +		case OP_31_XOP_EIOIO:
>> +			break;
>
> Have you always executed an eieio or sync when you get here, or
> do you just not allow direct access to I/O devices?  Other context
> synchronising insns are not enough, they do not broadcast on the
> bus.

There is no device passthrough yet :-). It's theoretically possible,  
but nothing for it is implemented so far.

>
>> +		case OP_31_XOP_DCBZ:
>> +		{
>> +			ulong rb =  vcpu->arch.gpr[get_rb(inst)];
>> +			ulong ra = 0;
>> +			ulong addr;
>> +			u32 zeros[8] = { 0, 0, 0, 0, 0, 0, 0, 0 };
>> +
>> +			if (get_ra(inst))
>> +				ra = vcpu->arch.gpr[get_ra(inst)];
>> +
>> +			addr = (ra + rb) & ~31ULL;
>> +			if (!(vcpu->arch.msr & MSR_SF))
>> +				addr &= 0xffffffff;
>> +
>> +			if (kvmppc_st(vcpu, addr, 32, zeros)) {
>
> DCBZ zeroes out a cache line, not 32 bytes; except on 970, where there
> are HID bits to make it work on 32 bytes only, and an extra DCBZL insn
> that always clears a full cache line (128 bytes).

Yes. We only come here when we patched the dcbz opcodes to invalid  
instructions because cache line size of target == 32.
On 970 with MSR_HV = 0 we actually use the dcbz 32-bytes mode.

Admittedly though, this could be a lot more clever.

>> +	switch (sprn) {
>> +	case SPRN_IBAT0U ... SPRN_IBAT3L:
>> +		bat = &vcpu_book3s->ibat[(sprn - SPRN_IBAT0U) / 2];
>> +		break;
>> +	case SPRN_IBAT4U ... SPRN_IBAT7L:
>> +		bat = &vcpu_book3s->ibat[(sprn - SPRN_IBAT4U) / 2];
>> +		break;
>> +	case SPRN_DBAT0U ... SPRN_DBAT3L:
>> +		bat = &vcpu_book3s->dbat[(sprn - SPRN_DBAT0U) / 2];
>> +		break;
>> +	case SPRN_DBAT4U ... SPRN_DBAT7L:
>> +		bat = &vcpu_book3s->dbat[(sprn - SPRN_DBAT4U) / 2];
>> +		break;
>
> Do xBAT4..7 have the same SPR numbers on all CPUs?  They are CPU- 
> specific
> SPRs, after all.  Some CPUs have only six, some only four, some  
> none, btw.

For now only Linux runs which only uses the first 3(?) IIRC. But yes,  
it's probably worth looking into at one point or the other.

>
>> +	case SPRN_HID0:
>> +		to_book3s(vcpu)->hid[0] = vcpu->arch.gpr[rs];
>> +		break;
>> +	case SPRN_HID1:
>> +		to_book3s(vcpu)->hid[1] = vcpu->arch.gpr[rs];
>> +		break;
>> +	case SPRN_HID2:
>> +		to_book3s(vcpu)->hid[2] = vcpu->arch.gpr[rs];
>> +		break;
>> +	case SPRN_HID4:
>> +		to_book3s(vcpu)->hid[4] = vcpu->arch.gpr[rs];
>> +		break;
>> +	case SPRN_HID5:
>> +		to_book3s(vcpu)->hid[5] = vcpu->arch.gpr[rs];
>
> HIDs are different per CPU; and worse, different CPUs have different
> registers (SPR #s) for the same register name!

Sigh :-(

>> +		/* guest HID5 set can change is_dcbz32 */
>> +		if (vcpu->arch.mmu.is_dcbz32(vcpu) &&
>> +		    (mfmsr() & MSR_HV))
>> +			vcpu->arch.hflags |= BOOK3S_HFLAG_DCBZ32;
>> +		break;
>
> Wait, does this mean you allow other HID writes when MSR[HV] isn't
> set?  All HIDs (and many other SPRs) cannot be read or written in
> supervisor mode.

When we're running in MSR_HV=0 mode on a 970 we can use the 32 byte  
dcbz HID flag. So all we need to do is tell our entry/exit code to set  
this bit.

If we're on 970 on a hypervisor or on a non-970 though we can't use  
the HID5 bit, so we need to binary patch the opcodes.

So in order to emulate real 970 behavior, we need to be able to  
emulate that HID5 bit too! That's what this chunk of code does - it  
basically sets us in dcbz32 mode when allowed on 970 guests.

Alex
Benjamin Herrenschmidt - Nov. 3, 2009, 9:38 p.m.
On Tue, 2009-11-03 at 10:06 +0100, Alexander Graf wrote:

> > DCBZ zeroes out a cache line, not 32 bytes; except on 970, where there
> > are HID bits to make it work on 32 bytes only, and an extra DCBZL insn
> > that always clears a full cache line (128 bytes).
> 
> Yes. We only come here when we patched the dcbz opcodes to invalid  
> instructions because cache line size of target == 32.
> On 970 with MSR_HV = 0 we actually use the dcbz 32-bytes mode.
> 
> Admittedly though, this could be a lot more clever.

Yeah well, we also really need to fix ppc32 Linux to use the device-tree
provided cache line size :-) For 64-bits, that should already be the
case, and thus the emulation trick shouldn't be useful as long as you
properly provide the guest with the right size in the device-tree.

(Though glibc can be nasty, afaik it might load up optimized variants of
some routines with hard wired cache line sizes based on the CPU type)

> >> +	switch (sprn) {
> >> +	case SPRN_IBAT0U ... SPRN_IBAT3L:
> >> +		bat = &vcpu_book3s->ibat[(sprn - SPRN_IBAT0U) / 2];
> >> +		break;
> >> +	case SPRN_IBAT4U ... SPRN_IBAT7L:
> >> +		bat = &vcpu_book3s->ibat[(sprn - SPRN_IBAT4U) / 2];
> >> +		break;
> >> +	case SPRN_DBAT0U ... SPRN_DBAT3L:
> >> +		bat = &vcpu_book3s->dbat[(sprn - SPRN_DBAT0U) / 2];
> >> +		break;
> >> +	case SPRN_DBAT4U ... SPRN_DBAT7L:
> >> +		bat = &vcpu_book3s->dbat[(sprn - SPRN_DBAT4U) / 2];
> >> +		break;
> >
> > Do xBAT4..7 have the same SPR numbers on all CPUs?  They are CPU- 
> > specific
> > SPRs, after all.  Some CPUs have only six, some only four, some  
> > none, btw.
> 
> For now only Linux runs which only uses the first 3(?) IIRC. But yes,  
> it's probably worth looking into at one point or the other.
> 
> >
> >> +	case SPRN_HID0:
> >> +		to_book3s(vcpu)->hid[0] = vcpu->arch.gpr[rs];
> >> +		break;
> >> +	case SPRN_HID1:
> >> +		to_book3s(vcpu)->hid[1] = vcpu->arch.gpr[rs];
> >> +		break;
> >> +	case SPRN_HID2:
> >> +		to_book3s(vcpu)->hid[2] = vcpu->arch.gpr[rs];
> >> +		break;
> >> +	case SPRN_HID4:
> >> +		to_book3s(vcpu)->hid[4] = vcpu->arch.gpr[rs];
> >> +		break;
> >> +	case SPRN_HID5:
> >> +		to_book3s(vcpu)->hid[5] = vcpu->arch.gpr[rs];
> >
> > HIDs are different per CPU; and worse, different CPUs have different
> > registers (SPR #s) for the same register name!
> 
> Sigh :-(

On the other hand, you can probably just "Swallow" all of these and
Linux won't even notice, except for the case of the sleep state maybe on
6xx/7xx/7xxx. Just a matter of knowing what your are emulating as guest.

Cheers,
Ben.
Arnd Bergmann - Nov. 4, 2009, 8:43 a.m.
On Tuesday 03 November 2009, Benjamin Herrenschmidt wrote:
> (Though glibc can be nasty, afaik it might load up optimized variants of
> some routines with hard wired cache line sizes based on the CPU type)

You can also get application with hand-coded cache optimizations
that are even harder, if not impossible, to fix.

	Arnd <><
Benjamin Herrenschmidt - Nov. 4, 2009, 8:47 a.m.
On Wed, 2009-11-04 at 09:43 +0100, Arnd Bergmann wrote:
> On Tuesday 03 November 2009, Benjamin Herrenschmidt wrote:
> > (Though glibc can be nasty, afaik it might load up optimized
> variants of
> > some routines with hard wired cache line sizes based on the CPU
> type)
> 
> You can also get application with hand-coded cache optimizations
> that are even harder, if not impossible, to fix. 

Right. But those are already broken across CPU variants anyways.

Cheers,
Ben
Alexander Graf - Nov. 4, 2009, 11:35 a.m.
On 04.11.2009, at 09:47, Benjamin Herrenschmidt wrote:

> On Wed, 2009-11-04 at 09:43 +0100, Arnd Bergmann wrote:
>> On Tuesday 03 November 2009, Benjamin Herrenschmidt wrote:
>>> (Though glibc can be nasty, afaik it might load up optimized
>> variants of
>>> some routines with hard wired cache line sizes based on the CPU
>> type)
>>
>> You can also get application with hand-coded cache optimizations
>> that are even harder, if not impossible, to fix.
>
> Right. But those are already broken across CPU variants anyways.

... which might be the reason you're using KVM in the first place.

Alex
Segher Boessenkool - Nov. 5, 2009, 12:53 a.m.
>>> +		case OP_31_XOP_EIOIO:
>>> +			break;
>>
>> Have you always executed an eieio or sync when you get here, or
>> do you just not allow direct access to I/O devices?  Other context
>> synchronising insns are not enough, they do not broadcast on the
>> bus.
>
> There is no device passthrough yet :-). It's theoretically  
> possible, but nothing for it is implemented so far.

You could just always do an eieio here, it's not expensive at all
compared to the emulation trap itself.

However -- eieio is a Book II insn, it will never trap anyway!

>>> +		case OP_31_XOP_DCBZ:
>>> +		{
>>> +			ulong rb =  vcpu->arch.gpr[get_rb(inst)];
>>> +			ulong ra = 0;
>>> +			ulong addr;
>>> +			u32 zeros[8] = { 0, 0, 0, 0, 0, 0, 0, 0 };
>>> +
>>> +			if (get_ra(inst))
>>> +				ra = vcpu->arch.gpr[get_ra(inst)];
>>> +
>>> +			addr = (ra + rb) & ~31ULL;
>>> +			if (!(vcpu->arch.msr & MSR_SF))
>>> +				addr &= 0xffffffff;
>>> +
>>> +			if (kvmppc_st(vcpu, addr, 32, zeros)) {
>>
>> DCBZ zeroes out a cache line, not 32 bytes; except on 970, where  
>> there
>> are HID bits to make it work on 32 bytes only, and an extra DCBZL  
>> insn
>> that always clears a full cache line (128 bytes).
>
> Yes. We only come here when we patched the dcbz opcodes to invalid  
> instructions

Ah yes, I forgot.  Could you rename it to OP_31_XOP_FAKE_32BIT_DCBZ
or such?

> because cache line size of target == 32.
> On 970 with MSR_HV = 0 we actually use the dcbz 32-bytes mode.
>
> Admittedly though, this could be a lot more clever.

>>> +		/* guest HID5 set can change is_dcbz32 */
>>> +		if (vcpu->arch.mmu.is_dcbz32(vcpu) &&
>>> +		    (mfmsr() & MSR_HV))
>>> +			vcpu->arch.hflags |= BOOK3S_HFLAG_DCBZ32;
>>> +		break;
>>
>> Wait, does this mean you allow other HID writes when MSR[HV] isn't
>> set?  All HIDs (and many other SPRs) cannot be read or written in
>> supervisor mode.
>
> When we're running in MSR_HV=0 mode on a 970 we can use the 32 byte  
> dcbz HID flag. So all we need to do is tell our entry/exit code to  
> set this bit.

Which patch contains that entry/exit code?

> If we're on 970 on a hypervisor or on a non-970 though we can't use  
> the HID5 bit, so we need to binary patch the opcodes.
>
> So in order to emulate real 970 behavior, we need to be able to  
> emulate that HID5 bit too! That's what this chunk of code does - it  
> basically sets us in dcbz32 mode when allowed on 970 guests.

But when MSR[HV]=0 and MSR[PR]=0, mtspr to a hypervisor resource
will not trap but be silently ignored.  Sorry for not being more clear.
...Oh.  You run your guest as MSR[PR]=1 anyway!  Tricky.


Segher
Alexander Graf - Nov. 5, 2009, 10:09 a.m.
On 05.11.2009, at 01:53, Segher Boessenkool wrote:

>>>> +		case OP_31_XOP_EIOIO:
>>>> +			break;
>>>
>>> Have you always executed an eieio or sync when you get here, or
>>> do you just not allow direct access to I/O devices?  Other context
>>> synchronising insns are not enough, they do not broadcast on the
>>> bus.
>>
>> There is no device passthrough yet :-). It's theoretically  
>> possible, but nothing for it is implemented so far.
>
> You could just always do an eieio here, it's not expensive at all
> compared to the emulation trap itself.
>
> However -- eieio is a Book II insn, it will never trap anyway!

Don't all 31 ops trap? I'm pretty sure I added the emulation because I  
saw the trap.

>>>> +		case OP_31_XOP_DCBZ:
>>>> +		{
>>>> +			ulong rb =  vcpu->arch.gpr[get_rb(inst)];
>>>> +			ulong ra = 0;
>>>> +			ulong addr;
>>>> +			u32 zeros[8] = { 0, 0, 0, 0, 0, 0, 0, 0 };
>>>> +
>>>> +			if (get_ra(inst))
>>>> +				ra = vcpu->arch.gpr[get_ra(inst)];
>>>> +
>>>> +			addr = (ra + rb) & ~31ULL;
>>>> +			if (!(vcpu->arch.msr & MSR_SF))
>>>> +				addr &= 0xffffffff;
>>>> +
>>>> +			if (kvmppc_st(vcpu, addr, 32, zeros)) {
>>>
>>> DCBZ zeroes out a cache line, not 32 bytes; except on 970, where  
>>> there
>>> are HID bits to make it work on 32 bytes only, and an extra DCBZL  
>>> insn
>>> that always clears a full cache line (128 bytes).
>>
>> Yes. We only come here when we patched the dcbz opcodes to invalid  
>> instructions
>
> Ah yes, I forgot.  Could you rename it to OP_31_XOP_FAKE_32BIT_DCBZ
> or such?

Good idea.

>> because cache line size of target == 32.
>> On 970 with MSR_HV = 0 we actually use the dcbz 32-bytes mode.
>>
>> Admittedly though, this could be a lot more clever.
>
>>>> +		/* guest HID5 set can change is_dcbz32 */
>>>> +		if (vcpu->arch.mmu.is_dcbz32(vcpu) &&
>>>> +		    (mfmsr() & MSR_HV))
>>>> +			vcpu->arch.hflags |= BOOK3S_HFLAG_DCBZ32;
>>>> +		break;
>>>
>>> Wait, does this mean you allow other HID writes when MSR[HV] isn't
>>> set?  All HIDs (and many other SPRs) cannot be read or written in
>>> supervisor mode.
>>
>> When we're running in MSR_HV=0 mode on a 970 we can use the 32 byte  
>> dcbz HID flag. So all we need to do is tell our entry/exit code to  
>> set this bit.
>
> Which patch contains that entry/exit code?

That's patch 7 / 27.

+	/* Some guests may need to have dcbz set to 32 byte length.
+	 *
+	 * Usually we ensure that by patching the guest's instructions
+	 * to trap on dcbz and emulate it in the hypervisor.
+	 *
+	 * If we can, we should tell the CPU to use 32 byte dcbz though,
+	 * because that's a lot faster.
+	 */
+
+	ld	r3, VCPU_HFLAGS(r4)
+	rldicl.	r3, r3, 0, 63		/* CR = ((r3 & 1) == 0) */
+	beq	no_dcbz32_on
+
+	mfspr   r3,SPRN_HID5
+	ori     r3, r3, 0x80		/* XXX HID5_dcbz32 = 0x80 */
+	mtspr   SPRN_HID5,r3
+
+no_dcbz32_on:

>> If we're on 970 on a hypervisor or on a non-970 though we can't use  
>> the HID5 bit, so we need to binary patch the opcodes.
>>
>> So in order to emulate real 970 behavior, we need to be able to  
>> emulate that HID5 bit too! That's what this chunk of code does - it  
>> basically sets us in dcbz32 mode when allowed on 970 guests.
>
> But when MSR[HV]=0 and MSR[PR]=0, mtspr to a hypervisor resource
> will not trap but be silently ignored.  Sorry for not being more  
> clear.
> ...Oh.  You run your guest as MSR[PR]=1 anyway!  Tricky.

Yeah, the guest is always running in PR=1, so all HV checks are for  
the host. Usually we run in HV=1 on the host, because IBM doesn't sell  
machines that have HV=0 accessible for mortals :-).


I'll address your comments in a follow-up patch once the stuff is  
merged.

Alex

Patch

diff --git a/arch/powerpc/kvm/book3s_64_emulate.c b/arch/powerpc/kvm/book3s_64_emulate.c
new file mode 100644
index 0000000..c343e67
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_64_emulate.c
@@ -0,0 +1,337 @@ 
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ *
+ * Copyright SUSE Linux Products GmbH 2009
+ *
+ * Authors: Alexander Graf <agraf@suse.de>
+ */
+
+#include <asm/kvm_ppc.h>
+#include <asm/disassemble.h>
+#include <asm/kvm_book3s.h>
+#include <asm/reg.h>
+
+#define OP_19_XOP_RFID		18
+#define OP_19_XOP_RFI		50
+
+#define OP_31_XOP_MFMSR		83
+#define OP_31_XOP_MTMSR		146
+#define OP_31_XOP_MTMSRD	178
+#define OP_31_XOP_MTSRIN	242
+#define OP_31_XOP_TLBIEL	274
+#define OP_31_XOP_TLBIE		306
+#define OP_31_XOP_SLBMTE	402
+#define OP_31_XOP_SLBIE		434
+#define OP_31_XOP_SLBIA		498
+#define OP_31_XOP_MFSRIN	659
+#define OP_31_XOP_SLBMFEV	851
+#define OP_31_XOP_EIOIO		854
+#define OP_31_XOP_SLBMFEE	915
+
+/* DCBZ is actually 1014, but we patch it to 1010 so we get a trap */
+#define OP_31_XOP_DCBZ		1010
+
+int kvmppc_core_emulate_op(struct kvm_run *run, struct kvm_vcpu *vcpu,
+                           unsigned int inst, int *advance)
+{
+	int emulated = EMULATE_DONE;
+
+	switch (get_op(inst)) {
+	case 19:
+		switch (get_xop(inst)) {
+		case OP_19_XOP_RFID:
+		case OP_19_XOP_RFI:
+			vcpu->arch.pc = vcpu->arch.srr0;
+			kvmppc_set_msr(vcpu, vcpu->arch.srr1);
+			*advance = 0;
+			break;
+
+		default:
+			emulated = EMULATE_FAIL;
+			break;
+		}
+		break;
+	case 31:
+		switch (get_xop(inst)) {
+		case OP_31_XOP_MFMSR:
+			vcpu->arch.gpr[get_rt(inst)] = vcpu->arch.msr;
+			break;
+		case OP_31_XOP_MTMSRD:
+		{
+			ulong rs = vcpu->arch.gpr[get_rs(inst)];
+			if (inst & 0x10000) {
+				vcpu->arch.msr &= ~(MSR_RI | MSR_EE);
+				vcpu->arch.msr |= rs & (MSR_RI | MSR_EE);
+			} else
+				kvmppc_set_msr(vcpu, rs);
+			break;
+		}
+		case OP_31_XOP_MTMSR:
+			kvmppc_set_msr(vcpu, vcpu->arch.gpr[get_rs(inst)]);
+			break;
+		case OP_31_XOP_MFSRIN:
+		{
+			int srnum;
+
+			srnum = (vcpu->arch.gpr[get_rb(inst)] >> 28) & 0xf;
+			if (vcpu->arch.mmu.mfsrin) {
+				u32 sr;
+				sr = vcpu->arch.mmu.mfsrin(vcpu, srnum);
+				vcpu->arch.gpr[get_rt(inst)] = sr;
+			}
+			break;
+		}
+		case OP_31_XOP_MTSRIN:
+			vcpu->arch.mmu.mtsrin(vcpu,
+				(vcpu->arch.gpr[get_rb(inst)] >> 28) & 0xf,
+				vcpu->arch.gpr[get_rs(inst)]);
+			break;
+		case OP_31_XOP_TLBIE:
+		case OP_31_XOP_TLBIEL:
+		{
+			bool large = (inst & 0x00200000) ? true : false;
+			ulong addr = vcpu->arch.gpr[get_rb(inst)];
+			vcpu->arch.mmu.tlbie(vcpu, addr, large);
+			break;
+		}
+		case OP_31_XOP_EIOIO:
+			break;
+		case OP_31_XOP_SLBMTE:
+			if (!vcpu->arch.mmu.slbmte)
+				return EMULATE_FAIL;
+
+			vcpu->arch.mmu.slbmte(vcpu, vcpu->arch.gpr[get_rs(inst)],
+						vcpu->arch.gpr[get_rb(inst)]);
+			break;
+		case OP_31_XOP_SLBIE:
+			if (!vcpu->arch.mmu.slbie)
+				return EMULATE_FAIL;
+
+			vcpu->arch.mmu.slbie(vcpu, vcpu->arch.gpr[get_rb(inst)]);
+			break;
+		case OP_31_XOP_SLBIA:
+			if (!vcpu->arch.mmu.slbia)
+				return EMULATE_FAIL;
+
+			vcpu->arch.mmu.slbia(vcpu);
+			break;
+		case OP_31_XOP_SLBMFEE:
+			if (!vcpu->arch.mmu.slbmfee) {
+				emulated = EMULATE_FAIL;
+			} else {
+				ulong t, rb;
+
+				rb = vcpu->arch.gpr[get_rb(inst)];
+				t = vcpu->arch.mmu.slbmfee(vcpu, rb);
+				vcpu->arch.gpr[get_rt(inst)] = t;
+			}
+			break;
+		case OP_31_XOP_SLBMFEV:
+			if (!vcpu->arch.mmu.slbmfev) {
+				emulated = EMULATE_FAIL;
+			} else {
+				ulong t, rb;
+
+				rb = vcpu->arch.gpr[get_rb(inst)];
+				t = vcpu->arch.mmu.slbmfev(vcpu, rb);
+				vcpu->arch.gpr[get_rt(inst)] = t;
+			}
+			break;
+		case OP_31_XOP_DCBZ:
+		{
+			ulong rb =  vcpu->arch.gpr[get_rb(inst)];
+			ulong ra = 0;
+			ulong addr;
+			u32 zeros[8] = { 0, 0, 0, 0, 0, 0, 0, 0 };
+
+			if (get_ra(inst))
+				ra = vcpu->arch.gpr[get_ra(inst)];
+
+			addr = (ra + rb) & ~31ULL;
+			if (!(vcpu->arch.msr & MSR_SF))
+				addr &= 0xffffffff;
+
+			if (kvmppc_st(vcpu, addr, 32, zeros)) {
+				vcpu->arch.dear = addr;
+				vcpu->arch.fault_dear = addr;
+				to_book3s(vcpu)->dsisr = DSISR_PROTFAULT |
+						      DSISR_ISSTORE;
+				kvmppc_book3s_queue_irqprio(vcpu,
+					BOOK3S_INTERRUPT_DATA_STORAGE);
+				kvmppc_mmu_pte_flush(vcpu, addr, ~0xFFFULL);
+			}
+
+			break;
+		}
+		default:
+			emulated = EMULATE_FAIL;
+		}
+		break;
+	default:
+		emulated = EMULATE_FAIL;
+	}
+
+	return emulated;
+}
+
+static void kvmppc_write_bat(struct kvm_vcpu *vcpu, int sprn, u64 val)
+{
+	struct kvmppc_vcpu_book3s *vcpu_book3s = to_book3s(vcpu);
+	struct kvmppc_bat *bat;
+
+	switch (sprn) {
+	case SPRN_IBAT0U ... SPRN_IBAT3L:
+		bat = &vcpu_book3s->ibat[(sprn - SPRN_IBAT0U) / 2];
+		break;
+	case SPRN_IBAT4U ... SPRN_IBAT7L:
+		bat = &vcpu_book3s->ibat[(sprn - SPRN_IBAT4U) / 2];
+		break;
+	case SPRN_DBAT0U ... SPRN_DBAT3L:
+		bat = &vcpu_book3s->dbat[(sprn - SPRN_DBAT0U) / 2];
+		break;
+	case SPRN_DBAT4U ... SPRN_DBAT7L:
+		bat = &vcpu_book3s->dbat[(sprn - SPRN_DBAT4U) / 2];
+		break;
+	default:
+		BUG();
+	}
+
+	if (!(sprn % 2)) {
+		/* Upper BAT */
+		u32 bl = (val >> 2) & 0x7ff;
+		bat->bepi_mask = (~bl << 17);
+		bat->bepi = val & 0xfffe0000;
+		bat->vs = (val & 2) ? 1 : 0;
+		bat->vp = (val & 1) ? 1 : 0;
+	} else {
+		/* Lower BAT */
+		bat->brpn = val & 0xfffe0000;
+		bat->wimg = (val >> 3) & 0xf;
+		bat->pp = val & 3;
+	}
+}
+
+int kvmppc_core_emulate_mtspr(struct kvm_vcpu *vcpu, int sprn, int rs)
+{
+	int emulated = EMULATE_DONE;
+
+	switch (sprn) {
+	case SPRN_SDR1:
+		to_book3s(vcpu)->sdr1 = vcpu->arch.gpr[rs];
+		break;
+	case SPRN_DSISR:
+		to_book3s(vcpu)->dsisr = vcpu->arch.gpr[rs];
+		break;
+	case SPRN_DAR:
+		vcpu->arch.dear = vcpu->arch.gpr[rs];
+		break;
+	case SPRN_HIOR:
+		to_book3s(vcpu)->hior = vcpu->arch.gpr[rs];
+		break;
+	case SPRN_IBAT0U ... SPRN_IBAT3L:
+	case SPRN_IBAT4U ... SPRN_IBAT7L:
+	case SPRN_DBAT0U ... SPRN_DBAT3L:
+	case SPRN_DBAT4U ... SPRN_DBAT7L:
+		kvmppc_write_bat(vcpu, sprn, vcpu->arch.gpr[rs]);
+		/* BAT writes happen so rarely that we're ok to flush
+		 * everything here */
+		kvmppc_mmu_pte_flush(vcpu, 0, 0);
+		break;
+	case SPRN_HID0:
+		to_book3s(vcpu)->hid[0] = vcpu->arch.gpr[rs];
+		break;
+	case SPRN_HID1:
+		to_book3s(vcpu)->hid[1] = vcpu->arch.gpr[rs];
+		break;
+	case SPRN_HID2:
+		to_book3s(vcpu)->hid[2] = vcpu->arch.gpr[rs];
+		break;
+	case SPRN_HID4:
+		to_book3s(vcpu)->hid[4] = vcpu->arch.gpr[rs];
+		break;
+	case SPRN_HID5:
+		to_book3s(vcpu)->hid[5] = vcpu->arch.gpr[rs];
+		/* guest HID5 set can change is_dcbz32 */
+		if (vcpu->arch.mmu.is_dcbz32(vcpu) &&
+		    (mfmsr() & MSR_HV))
+			vcpu->arch.hflags |= BOOK3S_HFLAG_DCBZ32;
+		break;
+	case SPRN_ICTC:
+	case SPRN_THRM1:
+	case SPRN_THRM2:
+	case SPRN_THRM3:
+	case SPRN_CTRLF:
+	case SPRN_CTRLT:
+		break;
+	default:
+		printk(KERN_INFO "KVM: invalid SPR write: %d\n", sprn);
+#ifndef DEBUG_SPR
+		emulated = EMULATE_FAIL;
+#endif
+		break;
+	}
+
+	return emulated;
+}
+
+int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int sprn, int rt)
+{
+	int emulated = EMULATE_DONE;
+
+	switch (sprn) {
+	case SPRN_SDR1:
+		vcpu->arch.gpr[rt] = to_book3s(vcpu)->sdr1;
+		break;
+	case SPRN_DSISR:
+		vcpu->arch.gpr[rt] = to_book3s(vcpu)->dsisr;
+		break;
+	case SPRN_DAR:
+		vcpu->arch.gpr[rt] = vcpu->arch.dear;
+		break;
+	case SPRN_HIOR:
+		vcpu->arch.gpr[rt] = to_book3s(vcpu)->hior;
+		break;
+	case SPRN_HID0:
+		vcpu->arch.gpr[rt] = to_book3s(vcpu)->hid[0];
+		break;
+	case SPRN_HID1:
+		vcpu->arch.gpr[rt] = to_book3s(vcpu)->hid[1];
+		break;
+	case SPRN_HID2:
+		vcpu->arch.gpr[rt] = to_book3s(vcpu)->hid[2];
+		break;
+	case SPRN_HID4:
+		vcpu->arch.gpr[rt] = to_book3s(vcpu)->hid[4];
+		break;
+	case SPRN_HID5:
+		vcpu->arch.gpr[rt] = to_book3s(vcpu)->hid[5];
+		break;
+	case SPRN_THRM1:
+	case SPRN_THRM2:
+	case SPRN_THRM3:
+	case SPRN_CTRLF:
+	case SPRN_CTRLT:
+		vcpu->arch.gpr[rt] = 0;
+		break;
+	default:
+		printk(KERN_INFO "KVM: invalid SPR read: %d\n", sprn);
+#ifndef DEBUG_SPR
+		emulated = EMULATE_FAIL;
+#endif
+		break;
+	}
+
+	return emulated;
+}
+