Patchwork [27/27] KVM: PPC: Add Documentation about PV interface

login
register
mail settings
Submitter Alexander Graf
Date July 1, 2010, 10:43 a.m.
Message ID <1277980982-12433-28-git-send-email-agraf@suse.de>
Download mbox | patch
Permalink /patch/57527/
State Not Applicable
Headers show

Comments

Alexander Graf - July 1, 2010, 10:43 a.m.
We just introduced a new PV interface that screams for documentation. So here
it is - a shiny new and awesome text file describing the internal works of
the PPC KVM paravirtual interface.

Signed-off-by: Alexander Graf <agraf@suse.de>

---

v1 -> v2:

  - clarify guest implementation
  - clarify that privileged instructions still work
  - explain safe MSR bits
  - Fix dsisr patch description
  - change hypervisor calls to use new register values
---
 Documentation/kvm/ppc-pv.txt |  185 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 185 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/kvm/ppc-pv.txt
Segher Boessenkool - July 2, 2010, 4:27 p.m.
> +To find out if we're running on KVM or not, we overlay the PVR  
> register. Usually
> +the PVR register contains an id that identifies your CPU type. If,  
> however, you
> +pass KVM_PVR_PARA in the register that you want the PVR result in,  
> the register
> +still contains KVM_PVR_PARA after the mfpvr call.
> +
> +	LOAD_REG_IMM(r5, KVM_PVR_PARA)
> +	mfpvr	r5
> +	[r5 still contains KVM_PVR_PARA]

I love this part :-)

> +	__u64 scratch3;
> +	__u64 critical;		/* Guest may not get interrupts if == r1 */
> +	__u64 sprg0;
> +	__u64 sprg1;
> +	__u64 sprg2;
> +	__u64 sprg3;
> +	__u64 srr0;
> +	__u64 srr1;
> +	__u64 dar;
> +	__u64 msr;
> +	__u32 dsisr;
> +	__u32 int_pending;	/* Tells the guest if we have an interrupt */
> +};
> +
> +Additions to the page must only occur at the end. Struct fields  
> are always 32
> +bit aligned.

The u64s are 64-bit aligned, should they always be?

> +The "ld" and "std" instructions are transormed to "lwz" and "stw"  
> instructions
> +respectively on 32 bit systems with an added offset of 4 to  
> accomodate for big
> +endianness.

Will this add never overflow?  Is there anything that checks for it?

> +mtmsrd	rX, 0		b	<special mtmsr section>
> +mtmsr			b	<special mtmsr section>

mtmsr rX


Segher
Hollis Blanchard - July 2, 2010, 5:59 p.m.
[Resending...]

Please reconcile this with
http://www.linux-kvm.org/page/PowerPC_Hypercall_ABI, which has been
discussed in the (admittedly closed) Power.org embedded hypervisor
working group. Bear in mind that other hypervisors are already
implementing the documented ABI, so if you have concerns, you should
probably raise them with that audience...

-Hollis

On Thu, Jul 1, 2010 at 3:43 AM, Alexander Graf <agraf@suse.de> wrote:
>
> We just introduced a new PV interface that screams for documentation. So here
> it is - a shiny new and awesome text file describing the internal works of
> the PPC KVM paravirtual interface.
>
> Signed-off-by: Alexander Graf <agraf@suse.de>
>
> ---
>
> v1 -> v2:
>
>  - clarify guest implementation
>  - clarify that privileged instructions still work
>  - explain safe MSR bits
>  - Fix dsisr patch description
>  - change hypervisor calls to use new register values
> ---
>  Documentation/kvm/ppc-pv.txt |  185 ++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 185 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/kvm/ppc-pv.txt
>
> diff --git a/Documentation/kvm/ppc-pv.txt b/Documentation/kvm/ppc-pv.txt
> new file mode 100644
> index 0000000..82de6c6
> --- /dev/null
> +++ b/Documentation/kvm/ppc-pv.txt
> @@ -0,0 +1,185 @@
> +The PPC KVM paravirtual interface
> +=================================
> +
> +The basic execution principle by which KVM on PowerPC works is to run all kernel
> +space code in PR=1 which is user space. This way we trap all privileged
> +instructions and can emulate them accordingly.
> +
> +Unfortunately that is also the downfall. There are quite some privileged
> +instructions that needlessly return us to the hypervisor even though they
> +could be handled differently.
> +
> +This is what the PPC PV interface helps with. It takes privileged instructions
> +and transforms them into unprivileged ones with some help from the hypervisor.
> +This cuts down virtualization costs by about 50% on some of my benchmarks.
> +
> +The code for that interface can be found in arch/powerpc/kernel/kvm*
> +
> +Querying for existence
> +======================
> +
> +To find out if we're running on KVM or not, we overlay the PVR register. Usually
> +the PVR register contains an id that identifies your CPU type. If, however, you
> +pass KVM_PVR_PARA in the register that you want the PVR result in, the register
> +still contains KVM_PVR_PARA after the mfpvr call.
> +
> +       LOAD_REG_IMM(r5, KVM_PVR_PARA)
> +       mfpvr   r5
> +       [r5 still contains KVM_PVR_PARA]
> +
> +Once determined to run under a PV capable KVM, you can now use hypercalls as
> +described below.
> +
> +PPC hypercalls
> +==============
> +
> +The only viable ways to reliably get from guest context to host context are:
> +
> +       1) Call an invalid instruction
> +       2) Call the "sc" instruction with a parameter to "sc"
> +       3) Call the "sc" instruction with parameters in GPRs
> +
> +Method 1 is always a bad idea. Invalid instructions can be replaced later on
> +by valid instructions, rendering the interface broken.
> +
> +Method 2 also has downfalls. If the parameter to "sc" is != 0 the spec is
> +rather unclear if the sc is targeted directly for the hypervisor or the
> +supervisor. It would also require that we read the syscall issuing instruction
> +every time a syscall is issued, slowing down guest syscalls.
> +
> +Method 3 is what KVM uses. We pass magic constants (KVM_SC_MAGIC_R0 and
> +KVM_SC_MAGIC_R3) in r0 and r3 respectively. If a syscall instruction with these
> +magic values arrives from the guest's kernel mode, we take the syscall as a
> +hypercall.
> +
> +The parameters are as follows:
> +
> +       r0              KVM_SC_MAGIC_R0
> +       r3              KVM_SC_MAGIC_R3         Return code
> +       r4              Hypercall number
> +       r5              First parameter
> +       r6              Second parameter
> +       r7              Third parameter
> +       r8              Fourth parameter
> +
> +Hypercall definitions are shared in generic code, so the same hypercall numbers
> +apply for x86 and powerpc alike.
> +
> +The magic page
> +==============
> +
> +To enable communication between the hypervisor and guest there is a new shared
> +page that contains parts of supervisor visible register state. The guest can
> +map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
> +
> +With this hypercall issued the guest always gets the magic page mapped at the
> +desired location in effective and physical address space. For now, we always
> +map the page to -4096. This way we can access it using absolute load and store
> +functions. The following instruction reads the first field of the magic page:
> +
> +       ld      rX, -4096(0)
> +
> +The interface is designed to be extensible should there be need later to add
> +additional registers to the magic page. If you add fields to the magic page,
> +also define a new hypercall feature to indicate that the host can give you more
> +registers. Only if the host supports the additional features, make use of them.
> +
> +The magic page has the following layout as described in
> +arch/powerpc/include/asm/kvm_para.h:
> +
> +struct kvm_vcpu_arch_shared {
> +       __u64 scratch1;
> +       __u64 scratch2;
> +       __u64 scratch3;
> +       __u64 critical;         /* Guest may not get interrupts if == r1 */
> +       __u64 sprg0;
> +       __u64 sprg1;
> +       __u64 sprg2;
> +       __u64 sprg3;
> +       __u64 srr0;
> +       __u64 srr1;
> +       __u64 dar;
> +       __u64 msr;
> +       __u32 dsisr;
> +       __u32 int_pending;      /* Tells the guest if we have an interrupt */
> +};
> +
> +Additions to the page must only occur at the end. Struct fields are always 32
> +bit aligned.
> +
> +MSR bits
> +========
> +
> +The MSR contains bits that require hypervisor intervention and bits that do
> +not require direct hypervisor intervention because they only get interpreted
> +when entering the guest or don't have any impact on the hypervisor's behavior.
> +
> +The following bits are safe to be set inside the guest:
> +
> +  MSR_EE
> +  MSR_RI
> +  MSR_CR
> +  MSR_ME
> +
> +If any other bit changes in the MSR, please still use mtmsr(d).
> +
> +Patched instructions
> +====================
> +
> +The "ld" and "std" instructions are transormed to "lwz" and "stw" instructions
> +respectively on 32 bit systems with an added offset of 4 to accomodate for big
> +endianness.
> +
> +The following is a list of mapping the Linux kernel performs when running as
> +guest. Implementing any of those mappings is optional, as the instruction traps
> +also act on the shared page. So calling privileged instructions still works as
> +before.
> +
> +From                   To
> +====                   ==
> +
> +mfmsr  rX              ld      rX, magic_page->msr
> +mfsprg rX, 0           ld      rX, magic_page->sprg0
> +mfsprg rX, 1           ld      rX, magic_page->sprg1
> +mfsprg rX, 2           ld      rX, magic_page->sprg2
> +mfsprg rX, 3           ld      rX, magic_page->sprg3
> +mfsrr0 rX              ld      rX, magic_page->srr0
> +mfsrr1 rX              ld      rX, magic_page->srr1
> +mfdar  rX              ld      rX, magic_page->dar
> +mfdsisr        rX              lwz     rX, magic_page->dsisr
> +
> +mtmsr  rX              std     rX, magic_page->msr
> +mtsprg 0, rX           std     rX, magic_page->sprg0
> +mtsprg 1, rX           std     rX, magic_page->sprg1
> +mtsprg 2, rX           std     rX, magic_page->sprg2
> +mtsprg 3, rX           std     rX, magic_page->sprg3
> +mtsrr0 rX              std     rX, magic_page->srr0
> +mtsrr1 rX              std     rX, magic_page->srr1
> +mtdar  rX              std     rX, magic_page->dar
> +mtdsisr        rX              stw     rX, magic_page->dsisr
> +
> +tlbsync                        nop
> +
> +mtmsrd rX, 0           b       <special mtmsr section>
> +mtmsr                  b       <special mtmsr section>
> +
> +mtmsrd rX, 1           b       <special mtmsrd section>
> +
> +[BookE only]
> +wrteei [0|1]           b       <special wrteei section>
> +
> +
> +Some instructions require more logic to determine what's going on than a load
> +or store instruction can deliver. To enable patching of those, we keep some
> +RAM around where we can live translate instructions to. What happens is the
> +following:
> +
> +       1) copy emulation code to memory
> +       2) patch that code to fit the emulated instruction
> +       3) patch that code to return to the original pc + 4
> +       4) patch the original instruction to branch to the new code
> +
> +That way we can inject an arbitrary amount of code as replacement for a single
> +instruction. This allows us to check for pending interrupts when setting EE=1
> +for example.
> +
> --
> 1.6.0.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf - July 2, 2010, 6:41 p.m.
On 02.07.2010, at 18:27, Segher Boessenkool wrote:

>> +To find out if we're running on KVM or not, we overlay the PVR register. Usually
>> +the PVR register contains an id that identifies your CPU type. If, however, you
>> +pass KVM_PVR_PARA in the register that you want the PVR result in, the register
>> +still contains KVM_PVR_PARA after the mfpvr call.
>> +
>> +	LOAD_REG_IMM(r5, KVM_PVR_PARA)
>> +	mfpvr	r5
>> +	[r5 still contains KVM_PVR_PARA]
> 
> I love this part :-)

:)

> 
>> +	__u64 scratch3;
>> +	__u64 critical;		/* Guest may not get interrupts if == r1 */
>> +	__u64 sprg0;
>> +	__u64 sprg1;
>> +	__u64 sprg2;
>> +	__u64 sprg3;
>> +	__u64 srr0;
>> +	__u64 srr1;
>> +	__u64 dar;
>> +	__u64 msr;
>> +	__u32 dsisr;
>> +	__u32 int_pending;	/* Tells the guest if we have an interrupt */
>> +};
>> +
>> +Additions to the page must only occur at the end. Struct fields are always 32
>> +bit aligned.
> 
> The u64s are 64-bit aligned, should they always be?

That's obvious, isn't it? And the ABI only specifies u64s to be 32 bit aligned, no? At least that's what ld and std specify.

> 
>> +The "ld" and "std" instructions are transormed to "lwz" and "stw" instructions
>> +respectively on 32 bit systems with an added offset of 4 to accomodate for big
>> +endianness.
> 
> Will this add never overflow?  Is there anything that checks for it?

It basically means that to access dar, we either do

ld  rX, DAR(0)

or

lwz rX, DAR+4(0)


> 
>> +mtmsrd	rX, 0		b	<special mtmsr section>
>> +mtmsr			b	<special mtmsr section>
> 
> mtmsr rX

Nod.


Alex
Alexander Graf - July 2, 2010, 6:47 p.m.
On 02.07.2010, at 19:59, Hollis Blanchard wrote:

> [Resending...]
> 
> Please reconcile this with
> http://www.linux-kvm.org/page/PowerPC_Hypercall_ABI, which has been
> discussed in the (admittedly closed) Power.org embedded hypervisor
> working group. Bear in mind that other hypervisors are already
> implementing the documented ABI, so if you have concerns, you should
> probably raise them with that audience...

We can not use sc with LV=1 because that would break the KVM in something else case which is KVM's strong point on PPC.

Alex
Scott Wood - July 2, 2010, 7:10 p.m.
On Fri, 2 Jul 2010 20:47:44 +0200
Alexander Graf <agraf@suse.de> wrote:

> 
> On 02.07.2010, at 19:59, Hollis Blanchard wrote:
> 
> > [Resending...]
> > 
> > Please reconcile this with
> > http://www.linux-kvm.org/page/PowerPC_Hypercall_ABI, which has been
> > discussed in the (admittedly closed) Power.org embedded hypervisor
> > working group. Bear in mind that other hypervisors are already
> > implementing the documented ABI, so if you have concerns, you should
> > probably raise them with that audience...
> 
> We can not use sc with LV=1 because that would break the KVM in
> something else case which is KVM's strong point on PPC.

The current proposal involves the hypervisor specifying the hcall opcode
sequence in the device tree -- to allow either "sc 1" or "sc 0 plus
magic GPR" depending on whether you've got the hardware hypervisor
feature (hereafter HHV).

With HHV, "sc 0 plus magic GPR" just doesn't work, since it won't trap
to the hypervisor.  "sc 1 plus magic GPR" might be problematic on some
non-HHV implementations, especially if you *do* have HHV but the
non-HHV hypervisor is running as an HHV guest.

-Scott
Benjamin Herrenschmidt - July 3, 2010, 10:41 p.m.
On Fri, 2010-07-02 at 18:27 +0200, Segher Boessenkool wrote:
> > +To find out if we're running on KVM or not, we overlay the PVR  
> > register. Usually
> > +the PVR register contains an id that identifies your CPU type. If,  
> > however, you
> > +pass KVM_PVR_PARA in the register that you want the PVR result in,  
> > the register
> > +still contains KVM_PVR_PARA after the mfpvr call.
> > +
> > +	LOAD_REG_IMM(r5, KVM_PVR_PARA)
> > +	mfpvr	r5
> > +	[r5 still contains KVM_PVR_PARA]
> 
> I love this part :-)

Me not :-)

It should be in the device-tree instead, or something like that. Enough
games with PVR...

Ben.

> > +	__u64 scratch3;
> > +	__u64 critical;		/* Guest may not get interrupts if == r1 */
> > +	__u64 sprg0;
> > +	__u64 sprg1;
> > +	__u64 sprg2;
> > +	__u64 sprg3;
> > +	__u64 srr0;
> > +	__u64 srr1;
> > +	__u64 dar;
> > +	__u64 msr;
> > +	__u32 dsisr;
> > +	__u32 int_pending;	/* Tells the guest if we have an interrupt */
> > +};
> > +
> > +Additions to the page must only occur at the end. Struct fields  
> > are always 32
> > +bit aligned.
> 
> The u64s are 64-bit aligned, should they always be?
> 
> > +The "ld" and "std" instructions are transormed to "lwz" and "stw"  
> > instructions
> > +respectively on 32 bit systems with an added offset of 4 to  
> > accomodate for big
> > +endianness.
> 
> Will this add never overflow?  Is there anything that checks for it?
> 
> > +mtmsrd	rX, 0		b	<special mtmsr section>
> > +mtmsr			b	<special mtmsr section>
> 
> mtmsr rX
> 
> 
> Segher
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
Benjamin Herrenschmidt - July 3, 2010, 10:42 p.m.
On Fri, 2010-07-02 at 20:41 +0200, Alexander Graf wrote:
> The u64s are 64-bit aligned, should they always be?
> 
> That's obvious, isn't it? And the ABI only specifies u64s to be 32 bit
> aligned, no? At least that's what ld and std specify.

No, the PowerPC ABI specifies u64's to be 64-bit aligned, even for
32-bit binaries.

Ben.

> > 
> >> +The "ld" and "std" instructions are transormed to "lwz" and "stw"
> instructions
> >> +respectively on 32 bit systems with an added offset of 4 to
> accomodate for big
> >> +endianness.
> > 
> > Will this add never overflow?  Is there anything that checks for it?
> 
> It basically means that to access dar, we either do
> 
> ld  rX, DAR(0)
> 
> or
> 
> lwz rX, DAR+4(0)
> 
> 
> > 
> >> +mtmsrd      rX, 0           b       <special mtmsr section>
> >> +mtmsr                       b       <special mtmsr section>
> > 
> > mtmsr rX
> 
> Nod.
Alexander Graf - July 4, 2010, 9:02 a.m.
On 02.07.2010, at 21:10, Scott Wood wrote:

> On Fri, 2 Jul 2010 20:47:44 +0200
> Alexander Graf <agraf@suse.de> wrote:
> 
>> 
>> On 02.07.2010, at 19:59, Hollis Blanchard wrote:
>> 
>>> [Resending...]
>>> 
>>> Please reconcile this with
>>> http://www.linux-kvm.org/page/PowerPC_Hypercall_ABI, which has been
>>> discussed in the (admittedly closed) Power.org embedded hypervisor
>>> working group. Bear in mind that other hypervisors are already
>>> implementing the documented ABI, so if you have concerns, you should
>>> probably raise them with that audience...
>> 
>> We can not use sc with LV=1 because that would break the KVM in
>> something else case which is KVM's strong point on PPC.
> 
> The current proposal involves the hypervisor specifying the hcall opcode
> sequence in the device tree -- to allow either "sc 1" or "sc 0 plus
> magic GPR" depending on whether you've got the hardware hypervisor
> feature (hereafter HHV).

Ah right, so you can still trap a hypercall with HHV. Makes sense.

> 
> With HHV, "sc 0 plus magic GPR" just doesn't work, since it won't trap
> to the hypervisor.  "sc 1 plus magic GPR" might be problematic on some
> non-HHV implementations, especially if you *do* have HHV but the
> non-HHV hypervisor is running as an HHV guest.

Yes, that's why I need sc 0 plus magic GPR in r0 and r3 - to accomodate for all the non-HHV cases. And it would be clever to have a way to expose the same functionality when we do use the HHV features.

So, is that draft available anywhere? The wiki page Hollis pointed to is very vague.


Alex
Alexander Graf - July 4, 2010, 9:04 a.m.
On 04.07.2010, at 00:41, Benjamin Herrenschmidt wrote:

> On Fri, 2010-07-02 at 18:27 +0200, Segher Boessenkool wrote:
>>> +To find out if we're running on KVM or not, we overlay the PVR  
>>> register. Usually
>>> +the PVR register contains an id that identifies your CPU type. If,  
>>> however, you
>>> +pass KVM_PVR_PARA in the register that you want the PVR result in,  
>>> the register
>>> +still contains KVM_PVR_PARA after the mfpvr call.
>>> +
>>> +	LOAD_REG_IMM(r5, KVM_PVR_PARA)
>>> +	mfpvr	r5
>>> +	[r5 still contains KVM_PVR_PARA]
>> 
>> I love this part :-)
> 
> Me not :-)
> 
> It should be in the device-tree instead, or something like that. Enough
> games with PVR...

My biggest concern about putting things in the device-tree is that I was trying to keep things as separate as possible. Why does the firmware have to know that it's running in KVM? Why do I have to patch 3 projects (Linux, OpenBIOS, Qemu) when I could go with patching a single one (Linux)?

Alex
Alexander Graf - July 4, 2010, 9:04 a.m.
On 04.07.2010, at 00:42, Benjamin Herrenschmidt wrote:

> On Fri, 2010-07-02 at 20:41 +0200, Alexander Graf wrote:
>> The u64s are 64-bit aligned, should they always be?
>> 
>> That's obvious, isn't it? And the ABI only specifies u64s to be 32 bit
>> aligned, no? At least that's what ld and std specify.
> 
> No, the PowerPC ABI specifies u64's to be 64-bit aligned, even for
> 32-bit binaries.

I see :).

Alex
Avi Kivity - July 4, 2010, 9:10 a.m.
On 07/04/2010 12:04 PM, Alexander Graf wrote:
>
> My biggest concern about putting things in the device-tree is that I was trying to keep things as separate as possible. Why does the firmware have to know that it's running in KVM?

It doesn't need to know about kvm, it needs to know that a particular 
hypercall protocol is available.

> Why do I have to patch 3 projects (Linux, OpenBIOS, Qemu) when I could go with patching a single one (Linux)?
>    

That's not a valid argument.  You patch as many projects as it takes to 
get it right (not that I have an opinion in this particular discussion).

At the very least you have to patch qemu for reasons described before 
(backwards compatible live migration).
Alexander Graf - July 4, 2010, 9:17 a.m.
On 04.07.2010, at 11:10, Avi Kivity wrote:

> On 07/04/2010 12:04 PM, Alexander Graf wrote:
>> 
>> My biggest concern about putting things in the device-tree is that I was trying to keep things as separate as possible. Why does the firmware have to know that it's running in KVM?
> 
> It doesn't need to know about kvm, it needs to know that a particular hypercall protocol is available.

Considering how the parts of the draft that I read about sound like, that's not the inventor's idea. PPC people love to see the BIOS be part of the virtualization solution. I don't. That's the biggest difference here and reason for us going different directions.

I think what they thought of is something like

if (in_kvm()) {
  device_tree_put("/hypervisor/exit", EXIT_TYPE_MAGIC);
  device_tree_put("/hypervisor/exit_magic", EXIT_MAGIC);
}

which then the OS reads out. But that's useless, as the hypercalls are hypervisor specific. So why make the detection on the Linux side generic?

> 
>> Why do I have to patch 3 projects (Linux, OpenBIOS, Qemu) when I could go with patching a single one (Linux)?
>>   
> 
> That's not a valid argument.  You patch as many projects as it takes to get it right (not that I have an opinion in this particular discussion).

If you can put code in Linux that touches 3 submaintainer's directories or one submaintainer's directory with both ending up being functionally equivalent, which way would you go?

> 
> At the very least you have to patch qemu for reasons described before (backwards compatible live migration).

There is no live migration on PPC (yet). That point is completely moot atm.


Alex
Alexander Graf - July 4, 2010, 9:30 a.m.
On 04.07.2010, at 11:17, Alexander Graf wrote:

> 
> On 04.07.2010, at 11:10, Avi Kivity wrote:
> 
>> On 07/04/2010 12:04 PM, Alexander Graf wrote:
>>> 
>>> My biggest concern about putting things in the device-tree is that I was trying to keep things as separate as possible. Why does the firmware have to know that it's running in KVM?
>> 
>> It doesn't need to know about kvm, it needs to know that a particular hypercall protocol is available.
> 
> Considering how the parts of the draft that I read about sound like, that's not the inventor's idea. PPC people love to see the BIOS be part of the virtualization solution. I don't. That's the biggest difference here and reason for us going different directions.
> 
> I think what they thought of is something like
> 
> if (in_kvm()) {
>  device_tree_put("/hypervisor/exit", EXIT_TYPE_MAGIC);
>  device_tree_put("/hypervisor/exit_magic", EXIT_MAGIC);
> }
> 
> which then the OS reads out. But that's useless, as the hypercalls are hypervisor specific. So why make the detection on the Linux side generic?

In fact, it's even worse. Right now with KVM for PPC we have 3 different ways of generating the device tree:

1) OpenBIOS (Mac emulation)
2) Qemu libfdt (BookE)
3) MOL OF implementation

So I'd have to touch even more projects. Just for the sake of splitting out something that belongs together anyway. And probably even create new interfaces just for that sake (qemu asking the kernel which type of hypercalls the vm should use) even though the guest could just query all that itself.

Alex
Avi Kivity - July 4, 2010, 9:37 a.m.
On 07/04/2010 12:17 PM, Alexander Graf wrote:
> On 04.07.2010, at 11:10, Avi Kivity wrote:
>
>    
>> On 07/04/2010 12:04 PM, Alexander Graf wrote:
>>      
>>> My biggest concern about putting things in the device-tree is that I was trying to keep things as separate as possible. Why does the firmware have to know that it's running in KVM?
>>>        
>> It doesn't need to know about kvm, it needs to know that a particular hypercall protocol is available.
>>      
> Considering how the parts of the draft that I read about sound like, that's not the inventor's idea. PPC people love to see the BIOS be part of the virtualization solution. I don't. That's the biggest difference here and reason for us going different directions.
>    

Regardless of which direction is "correct", you need to go in one direction.

> I think what they thought of is something like
>
> if (in_kvm()) {
>    device_tree_put("/hypervisor/exit", EXIT_TYPE_MAGIC);
>    device_tree_put("/hypervisor/exit_magic", EXIT_MAGIC);
> }
>
> which then the OS reads out. But that's useless, as the hypercalls are hypervisor specific. So why make the detection on the Linux side generic?
>    

Looks like the benefit is less magic in the detection code.  x86 has 
(more or less) standardized feature detection.  Is this an attempt to 
bring something similar to ppc-land?

>>> Why do I have to patch 3 projects (Linux, OpenBIOS, Qemu) when I could go with patching a single one (Linux)?
>>>
>>>        
>> That's not a valid argument.  You patch as many projects as it takes to get it right (not that I have an opinion in this particular discussion).
>>      
> If you can put code in Linux that touches 3 submaintainer's directories or one submaintainer's directory with both ending up being functionally equivalent, which way would you go?
>    

We would do the right thing.  Trivial examples include adding defines to 
include/asm/processor.h or include/asm/msr-index.h, more complicated 
ones are the topic for my talk in kvm forum 2010.

Yes, coordinating the acks and trees and merge windows is not as fun as 
coding.  Yes, it's even more difficult with separate trees.  No, that's 
not an excuse if we[1] determine that the right thing to do is the most 
complicated.

[1] "we" in this case are the powerpc Linux arch maintainers and/or 
whoever defines the hardware specification

>> At the very least you have to patch qemu for reasons described before (backwards compatible live migration).
>>      
> There is no live migration on PPC (yet). That point is completely moot atm.
>    

You still need to make that feature disableable from userspace.
Avi Kivity - July 4, 2010, 9:41 a.m.
On 07/04/2010 12:30 PM, Alexander Graf wrote:
>
>> Considering how the parts of the draft that I read about sound like, that's not the inventor's idea. PPC people love to see the BIOS be part of the virtualization solution. I don't. That's the biggest difference here and reason for us going different directions.
>>
>> I think what they thought of is something like
>>
>> if (in_kvm()) {
>>   device_tree_put("/hypervisor/exit", EXIT_TYPE_MAGIC);
>>   device_tree_put("/hypervisor/exit_magic", EXIT_MAGIC);
>> }
>>
>> which then the OS reads out. But that's useless, as the hypercalls are hypervisor specific. So why make the detection on the Linux side generic?
>>      
> In fact, it's even worse. Right now with KVM for PPC we have 3 different ways of generating the device tree:
>
> 1) OpenBIOS (Mac emulation)
> 2) Qemu libfdt (BookE)
> 3) MOL OF implementation
>    

I sympathize.  But, if the arch says that's how you do things, then 
that's how you do things.

> So I'd have to touch even more projects. Just for the sake of splitting out something that belongs together anyway. And probably even create new interfaces just for that sake (qemu asking the kernel which type of hypercalls the vm should use) even though the guest could just query all that itself.
>    

qemu needs to be involved, in case one day you support more than one 
type of hypercalls (like x86 does with hyper-v) or if you want to live 
migrate from a host that has hypercall support to another host that has 
this feature removed (as has already happened on x86 with the pvmmu).

Planning for the future means a lot of boring interfaces.
m j - July 9, 2010, 9:11 a.m.
On Thu, Jul 1, 2010 at 4:13 PM, Alexander Graf <agraf@suse.de> wrote:
> We just introduced a new PV interface that screams for documentation. So here
> it is - a shiny new and awesome text file describing the internal works of
> the PPC KVM paravirtual interface.
>
> Signed-off-by: Alexander Graf <agraf@suse.de>
>
> ---
>
> v1 -> v2:
>
>  - clarify guest implementation
>  - clarify that privileged instructions still work
>  - explain safe MSR bits
>  - Fix dsisr patch description
>  - change hypervisor calls to use new register values
> ---
>  Documentation/kvm/ppc-pv.txt |  185 ++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 185 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/kvm/ppc-pv.txt
>
> diff --git a/Documentation/kvm/ppc-pv.txt b/Documentation/kvm/ppc-pv.txt
> new file mode 100644
> index 0000000..82de6c6
> --- /dev/null
> +++ b/Documentation/kvm/ppc-pv.txt
> @@ -0,0 +1,185 @@
> +The PPC KVM paravirtual interface
> +=================================
> +
> +The basic execution principle by which KVM on PowerPC works is to run all kernel
> +space code in PR=1 which is user space. This way we trap all privileged
> +instructions and can emulate them accordingly.
> +
> +Unfortunately that is also the downfall. There are quite some privileged
> +instructions that needlessly return us to the hypervisor even though they
> +could be handled differently.
> +
> +This is what the PPC PV interface helps with. It takes privileged instructions
> +and transforms them into unprivileged ones with some help from the hypervisor.
> +This cuts down virtualization costs by about 50% on some of my benchmarks.
> +
> +The code for that interface can be found in arch/powerpc/kernel/kvm*
> +
> +Querying for existence
> +======================
> +
> +To find out if we're running on KVM or not, we overlay the PVR register. Usually
> +the PVR register contains an id that identifies your CPU type. If, however, you
> +pass KVM_PVR_PARA in the register that you want the PVR result in, the register
> +still contains KVM_PVR_PARA after the mfpvr call.
> +
> +       LOAD_REG_IMM(r5, KVM_PVR_PARA)
> +       mfpvr   r5
> +       [r5 still contains KVM_PVR_PARA]
> +
> +Once determined to run under a PV capable KVM, you can now use hypercalls as
> +described below.
> +
> +PPC hypercalls
> +==============
> +
> +The only viable ways to reliably get from guest context to host context are:
> +
> +       1) Call an invalid instruction
> +       2) Call the "sc" instruction with a parameter to "sc"
> +       3) Call the "sc" instruction with parameters in GPRs
> +
> +Method 1 is always a bad idea. Invalid instructions can be replaced later on
> +by valid instructions, rendering the interface broken.
> +
> +Method 2 also has downfalls. If the parameter to "sc" is != 0 the spec is
> +rather unclear if the sc is targeted directly for the hypervisor or the
> +supervisor. It would also require that we read the syscall issuing instruction
> +every time a syscall is issued, slowing down guest syscalls.
> +
> +Method 3 is what KVM uses. We pass magic constants (KVM_SC_MAGIC_R0 and
> +KVM_SC_MAGIC_R3) in r0 and r3 respectively. If a syscall instruction with these
> +magic values arrives from the guest's kernel mode, we take the syscall as a
> +hypercall.
> +
> +The parameters are as follows:
> +
> +       r0              KVM_SC_MAGIC_R0
> +       r3              KVM_SC_MAGIC_R3         Return code
> +       r4              Hypercall number
> +       r5              First parameter
> +       r6              Second parameter
> +       r7              Third parameter
> +       r8              Fourth parameter
> +
> +Hypercall definitions are shared in generic code, so the same hypercall numbers
> +apply for x86 and powerpc alike.
> +
> +The magic page
> +==============
> +
> +To enable communication between the hypervisor and guest there is a new shared
> +page that contains parts of supervisor visible register state. The guest can
> +map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
> +
> +With this hypercall issued the guest always gets the magic page mapped at the
> +desired location in effective and physical address space. For now, we always
> +map the page to -4096. This way we can access it using absolute load and store
> +functions. The following instruction reads the first field of the magic page:
> +
> +       ld      rX, -4096(0)
> +
> +The interface is designed to be extensible should there be need later to add
> +additional registers to the magic page. If you add fields to the magic page,
> +also define a new hypercall feature to indicate that the host can give you more
> +registers. Only if the host supports the additional features, make use of them.
> +
> +The magic page has the following layout as described in
> +arch/powerpc/include/asm/kvm_para.h:
> +
> +struct kvm_vcpu_arch_shared {
> +       __u64 scratch1;
> +       __u64 scratch2;
> +       __u64 scratch3;
> +       __u64 critical;         /* Guest may not get interrupts if == r1 */
> +       __u64 sprg0;
> +       __u64 sprg1;
> +       __u64 sprg2;
> +       __u64 sprg3;
> +       __u64 srr0;
> +       __u64 srr1;
> +       __u64 dar;
> +       __u64 msr;
> +       __u32 dsisr;
> +       __u32 int_pending;      /* Tells the guest if we have an interrupt */
> +};
> +
> +Additions to the page must only occur at the end. Struct fields are always 32
> +bit aligned.
> +
> +MSR bits
> +========
> +
> +The MSR contains bits that require hypervisor intervention and bits that do
> +not require direct hypervisor intervention because they only get interpreted
> +when entering the guest or don't have any impact on the hypervisor's behavior.
> +
> +The following bits are safe to be set inside the guest:
> +
> +  MSR_EE
> +  MSR_RI
> +  MSR_CR
> +  MSR_ME
> +
> +If any other bit changes in the MSR, please still use mtmsr(d).
> +
> +Patched instructions
> +====================
> +
> +The "ld" and "std" instructions are transormed to "lwz" and "stw" instructions
> +respectively on 32 bit systems with an added offset of 4 to accomodate for big
> +endianness.
> +
> +The following is a list of mapping the Linux kernel performs when running as
> +guest. Implementing any of those mappings is optional, as the instruction traps
> +also act on the shared page. So calling privileged instructions still works as
> +before.
> +
> +From                   To
> +====                   ==
> +
> +mfmsr  rX              ld      rX, magic_page->msr
> +mfsprg rX, 0           ld      rX, magic_page->sprg0
> +mfsprg rX, 1           ld      rX, magic_page->sprg1
> +mfsprg rX, 2           ld      rX, magic_page->sprg2
> +mfsprg rX, 3           ld      rX, magic_page->sprg3
> +mfsrr0 rX              ld      rX, magic_page->srr0
> +mfsrr1 rX              ld      rX, magic_page->srr1
> +mfdar  rX              ld      rX, magic_page->dar
> +mfdsisr        rX              lwz     rX, magic_page->dsisr
> +
> +mtmsr  rX              std     rX, magic_page->msr
> +mtsprg 0, rX           std     rX, magic_page->sprg0
> +mtsprg 1, rX           std     rX, magic_page->sprg1
> +mtsprg 2, rX           std     rX, magic_page->sprg2
> +mtsprg 3, rX           std     rX, magic_page->sprg3
> +mtsrr0 rX              std     rX, magic_page->srr0
> +mtsrr1 rX              std     rX, magic_page->srr1
> +mtdar  rX              std     rX, magic_page->dar
> +mtdsisr        rX              stw     rX, magic_page->dsisr
> +
> +tlbsync                        nop
> +
> +mtmsrd rX, 0           b       <special mtmsr section>
> +mtmsr                  b       <special mtmsr section>
> +
> +mtmsrd rX, 1           b       <special mtmsrd section>
> +
> +[BookE only]
> +wrteei [0|1]           b       <special wrteei section>
> +
> +
> +Some instructions require more logic to determine what's going on than a load
> +or store instruction can deliver. To enable patching of those, we keep some
> +RAM around where we can live translate instructions to. What happens is the
> +following:
> +
> +       1) copy emulation code to memory
> +       2) patch that code to fit the emulated instruction
> +       3) patch that code to return to the original pc + 4
> +       4) patch the original instruction to branch to the new code
> +
> +That way we can inject an arbitrary amount of code as replacement for a single
> +instruction. This allows us to check for pending interrupts when setting EE=1
> +for example.
> +

Which patch does this mapping ? Can you please point to that.


> --
> 1.6.0.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Alexander Graf - July 9, 2010, 9:15 a.m.
On 09.07.2010, at 11:11, MJ embd wrote:

> On Thu, Jul 1, 2010 at 4:13 PM, Alexander Graf <agraf@suse.de> wrote:
>> We just introduced a new PV interface that screams for documentation. So here
>> it is - a shiny new and awesome text file describing the internal works of
>> the PPC KVM paravirtual interface.
>> 
>> Signed-off-by: Alexander Graf <agraf@suse.de>
>> 
>> +
>> +
>> +Some instructions require more logic to determine what's going on than a load
>> +or store instruction can deliver. To enable patching of those, we keep some
>> +RAM around where we can live translate instructions to. What happens is the
>> +following:
>> +
>> +       1) copy emulation code to memory
>> +       2) patch that code to fit the emulated instruction
>> +       3) patch that code to return to the original pc + 4
>> +       4) patch the original instruction to branch to the new code
>> +
>> +That way we can inject an arbitrary amount of code as replacement for a single
>> +instruction. This allows us to check for pending interrupts when setting EE=1
>> +for example.
>> +
> 
> Which patch does this mapping ? Can you please point to that.

The branch patching is in patch 22/27. For the respective users, see patch 23-26/27.


Alex

Patch

diff --git a/Documentation/kvm/ppc-pv.txt b/Documentation/kvm/ppc-pv.txt
new file mode 100644
index 0000000..82de6c6
--- /dev/null
+++ b/Documentation/kvm/ppc-pv.txt
@@ -0,0 +1,185 @@ 
+The PPC KVM paravirtual interface
+=================================
+
+The basic execution principle by which KVM on PowerPC works is to run all kernel
+space code in PR=1 which is user space. This way we trap all privileged
+instructions and can emulate them accordingly.
+
+Unfortunately that is also the downfall. There are quite some privileged
+instructions that needlessly return us to the hypervisor even though they
+could be handled differently.
+
+This is what the PPC PV interface helps with. It takes privileged instructions
+and transforms them into unprivileged ones with some help from the hypervisor.
+This cuts down virtualization costs by about 50% on some of my benchmarks.
+
+The code for that interface can be found in arch/powerpc/kernel/kvm*
+
+Querying for existence
+======================
+
+To find out if we're running on KVM or not, we overlay the PVR register. Usually
+the PVR register contains an id that identifies your CPU type. If, however, you
+pass KVM_PVR_PARA in the register that you want the PVR result in, the register
+still contains KVM_PVR_PARA after the mfpvr call.
+
+	LOAD_REG_IMM(r5, KVM_PVR_PARA)
+	mfpvr	r5
+	[r5 still contains KVM_PVR_PARA]
+
+Once determined to run under a PV capable KVM, you can now use hypercalls as
+described below.
+
+PPC hypercalls
+==============
+
+The only viable ways to reliably get from guest context to host context are:
+
+	1) Call an invalid instruction
+	2) Call the "sc" instruction with a parameter to "sc"
+	3) Call the "sc" instruction with parameters in GPRs
+
+Method 1 is always a bad idea. Invalid instructions can be replaced later on
+by valid instructions, rendering the interface broken.
+
+Method 2 also has downfalls. If the parameter to "sc" is != 0 the spec is
+rather unclear if the sc is targeted directly for the hypervisor or the
+supervisor. It would also require that we read the syscall issuing instruction
+every time a syscall is issued, slowing down guest syscalls.
+
+Method 3 is what KVM uses. We pass magic constants (KVM_SC_MAGIC_R0 and
+KVM_SC_MAGIC_R3) in r0 and r3 respectively. If a syscall instruction with these
+magic values arrives from the guest's kernel mode, we take the syscall as a
+hypercall.
+
+The parameters are as follows:
+
+	r0		KVM_SC_MAGIC_R0
+	r3		KVM_SC_MAGIC_R3		Return code
+	r4		Hypercall number
+	r5		First parameter
+	r6		Second parameter
+	r7		Third parameter
+	r8		Fourth parameter
+
+Hypercall definitions are shared in generic code, so the same hypercall numbers
+apply for x86 and powerpc alike.
+
+The magic page
+==============
+
+To enable communication between the hypervisor and guest there is a new shared
+page that contains parts of supervisor visible register state. The guest can
+map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
+
+With this hypercall issued the guest always gets the magic page mapped at the
+desired location in effective and physical address space. For now, we always
+map the page to -4096. This way we can access it using absolute load and store
+functions. The following instruction reads the first field of the magic page:
+
+	ld	rX, -4096(0)
+
+The interface is designed to be extensible should there be need later to add
+additional registers to the magic page. If you add fields to the magic page,
+also define a new hypercall feature to indicate that the host can give you more
+registers. Only if the host supports the additional features, make use of them.
+
+The magic page has the following layout as described in
+arch/powerpc/include/asm/kvm_para.h:
+
+struct kvm_vcpu_arch_shared {
+	__u64 scratch1;
+	__u64 scratch2;
+	__u64 scratch3;
+	__u64 critical;		/* Guest may not get interrupts if == r1 */
+	__u64 sprg0;
+	__u64 sprg1;
+	__u64 sprg2;
+	__u64 sprg3;
+	__u64 srr0;
+	__u64 srr1;
+	__u64 dar;
+	__u64 msr;
+	__u32 dsisr;
+	__u32 int_pending;	/* Tells the guest if we have an interrupt */
+};
+
+Additions to the page must only occur at the end. Struct fields are always 32
+bit aligned.
+
+MSR bits
+========
+
+The MSR contains bits that require hypervisor intervention and bits that do
+not require direct hypervisor intervention because they only get interpreted
+when entering the guest or don't have any impact on the hypervisor's behavior.
+
+The following bits are safe to be set inside the guest:
+
+  MSR_EE
+  MSR_RI
+  MSR_CR
+  MSR_ME
+
+If any other bit changes in the MSR, please still use mtmsr(d).
+
+Patched instructions
+====================
+
+The "ld" and "std" instructions are transormed to "lwz" and "stw" instructions
+respectively on 32 bit systems with an added offset of 4 to accomodate for big
+endianness.
+
+The following is a list of mapping the Linux kernel performs when running as
+guest. Implementing any of those mappings is optional, as the instruction traps
+also act on the shared page. So calling privileged instructions still works as
+before.
+
+From			To
+====			==
+
+mfmsr	rX		ld	rX, magic_page->msr
+mfsprg	rX, 0		ld	rX, magic_page->sprg0
+mfsprg	rX, 1		ld	rX, magic_page->sprg1
+mfsprg	rX, 2		ld	rX, magic_page->sprg2
+mfsprg	rX, 3		ld	rX, magic_page->sprg3
+mfsrr0	rX		ld	rX, magic_page->srr0
+mfsrr1	rX		ld	rX, magic_page->srr1
+mfdar	rX		ld	rX, magic_page->dar
+mfdsisr	rX		lwz	rX, magic_page->dsisr
+
+mtmsr	rX		std	rX, magic_page->msr
+mtsprg	0, rX		std	rX, magic_page->sprg0
+mtsprg	1, rX		std	rX, magic_page->sprg1
+mtsprg	2, rX		std	rX, magic_page->sprg2
+mtsprg	3, rX		std	rX, magic_page->sprg3
+mtsrr0	rX		std	rX, magic_page->srr0
+mtsrr1	rX		std	rX, magic_page->srr1
+mtdar	rX		std	rX, magic_page->dar
+mtdsisr	rX		stw	rX, magic_page->dsisr
+
+tlbsync			nop
+
+mtmsrd	rX, 0		b	<special mtmsr section>
+mtmsr			b	<special mtmsr section>
+
+mtmsrd	rX, 1		b	<special mtmsrd section>
+
+[BookE only]
+wrteei	[0|1]		b	<special wrteei section>
+
+
+Some instructions require more logic to determine what's going on than a load
+or store instruction can deliver. To enable patching of those, we keep some
+RAM around where we can live translate instructions to. What happens is the
+following:
+
+	1) copy emulation code to memory
+	2) patch that code to fit the emulated instruction
+	3) patch that code to return to the original pc + 4
+	4) patch the original instruction to branch to the new code
+
+That way we can inject an arbitrary amount of code as replacement for a single
+instruction. This allows us to check for pending interrupts when setting EE=1
+for example.
+