Patchwork [26/26] KVM: PPC: Add Documentation about PV interface

login
register
mail settings
Submitter Alexander Graf
Date June 25, 2010, 11:25 p.m.
Message ID <1277508314-915-27-git-send-email-agraf@suse.de>
Download mbox | patch
Permalink /patch/57028/
State Not Applicable
Headers show

Comments

Alexander Graf - June 25, 2010, 11:25 p.m.
We just introduced a new PV interface that screams for documentation. So here
it is - a shiny new and awesome text file describing the internal works of
the PPC KVM paravirtual interface.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 Documentation/kvm/ppc-pv.txt |  164 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 164 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/kvm/ppc-pv.txt
Avi Kivity - June 27, 2010, 8:14 a.m.
On 06/26/2010 02:25 AM, Alexander Graf wrote:
> We just introduced a new PV interface that screams for documentation. So here
> it is - a shiny new and awesome text file describing the internal works of
> the PPC KVM paravirtual interface.
>    

Good, that lets people who have no idea what they're talking about 
participate in the review.

> +
> +PPC hypercalls
> +==============
> +
> +The only viable ways to reliably get from guest context to host context are:
> +
> +	1) Call an invalid instruction
> +	2) Call the "sc" instruction with a parameter to "sc"
> +	3) Call the "sc" instruction with parameters in GPRs
> +
> +Method 1 is always a bad idea. Invalid instructions can be replaced later on
> +by valid instructions, rendering the interface broken.
> +
> +Method 2 also has downfalls. If the parameter to "sc" is != 0 the spec is
> +rather unclear if the sc is targeted directly for the hypervisor or the
> +supervisor. It would also require that we read the syscall issuing instruction
> +every time a syscall is issued, slowing down guest syscalls.
> +
> +Method 3 is what KVM uses. We pass magic constants (KVM_SC_MAGIC_R3 and
> +KVM_SC_MAGIC_R4) in r3 and r4 respectively. If a syscall instruction with these
> +magic values arrives from the guest's kernel mode, we take the syscall as a
> +hypercall.
>    

Is there any chance a normal syscall will have those values in r3 and r4?

If so, maybe it's better to use pc as they key for hypercalls.  Let the 
guest designate one instruction address as the hypercall call point; kvm 
can easily check it and reflect it back to the guest if it doesn't match.

Is it valid and useful to issue sc from privileged mode anyway, except 
for calling the hypervisor?

> +
> +The parameters are as follows:
> +
> +	r3		KVM_SC_MAGIC_R3
> +	r4		KVM_SC_MAGIC_R4
> +	r5		Hypercall number
> +	r6		First parameter
> +	r7		Second parameter
> +	r8		Third parameter
> +	r9		Fourth parameter
> +
> +Hypercall definitions are shared in generic code, so the same hypercall numbers
> +apply for x86 and powerpc alike.
>    

Addresses passed in hypercall paramters are guest physical addresses.

Do you have >32 bit physical addresses on 32-bit guests?  if so, you'll 
need to pass physical addresses in two registers.

> +
> +The magic page
> +==============
> +
> +To enable communication between the hypervisor and guest there is a new shared
> +page that contains parts of supervisor visible register state. The guest can
> +map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
> +
> +With this hypercall issued the guest always gets the magic page mapped at the
> +desired location in effective and physical address space. For now, we always
> +map the page to -4096. This way we can access it using absolute load and store
> +functions. The following instruction reads the first field of the magic page:
> +
> +	ld	rX, -4096(0)
>    

Is the address guest controlled or host controlled?

> +
> +The interface is designed to be extensible should there be need later to add
> +additional registers to the magic page. If you add fields to the magic page,
> +also define a new hypercall feature to indicate that the host can give you more
> +registers. Only if the host supports the additional features, make use of them.
> +
> +The magic page has the following layout as described in
> +arch/powerpc/include/asm/kvm_para.h:
> +
> +struct kvm_vcpu_arch_shared {
> +	__u64 scratch1;
> +	__u64 scratch2;
> +	__u64 scratch3;
> +	__u64 critical;		/* Guest may not get interrupts if == r1 */
>    

Elaborate?

> +	__u64 sprg0;
> +	__u64 sprg1;
> +	__u64 sprg2;
> +	__u64 sprg3;
> +	__u64 srr0;
> +	__u64 srr1;
> +	__u64 dar;
> +	__u64 msr;
> +	__u32 dsisr;
> +	__u32 int_pending;	/* Tells the guest if we have an interrupt */
> +};
> +
> +Additions to the page must only occur at the end. Struct fields are always 32
> +bit aligned.
> +
> +Patched instructions
> +====================
> +
> +The "ld" and "std" instructions are transormed to "lwz" and "stw" instructions
> +respectively on 32 bit systems with an added offset of 4 to accomodate for big
> +endianness.
>    

Who does the patching? guest or host?

> +
> +From			To
> +====			==
> +
> +mfmsr	rX		ld	rX, magic_page->msr
> +mfsprg	rX, 0		ld	rX, magic_page->sprg0
> +mfsprg	rX, 1		ld	rX, magic_page->sprg1
> +mfsprg	rX, 2		ld	rX, magic_page->sprg2
> +mfsprg	rX, 3		ld	rX, magic_page->sprg3
> +mfsrr0	rX		ld	rX, magic_page->srr0
> +mfsrr1	rX		ld	rX, magic_page->srr1
> +mfdar	rX		ld	rX, magic_page->dar
> +mfdsisr	rX		ld	rX, magic_page->dsisr
> +
> +mtmsr	rX		std	rX, magic_page->msr
> +mtsprg	0, rX		std	rX, magic_page->sprg0
> +mtsprg	1, rX		std	rX, magic_page->sprg1
> +mtsprg	2, rX		std	rX, magic_page->sprg2
> +mtsprg	3, rX		std	rX, magic_page->sprg3
> +mtsrr0	rX		std	rX, magic_page->srr0
> +mtsrr1	rX		std	rX, magic_page->srr1
> +mtdar	rX		std	rX, magic_page->dar
> +mtdsisr	rX		std	rX, magic_page->dsisr
> +
> +tlbsync			nop
> +
> +mtmsrd	rX, 0		b	<special mtmsr section>
> +mtmsr			b	<special mtmsr section>
> +
> +mtmsrd	rX, 1		b	<special mtmsrd section>
> +
> +[BookE only]
> +wrteei	[0|1]		b	<special wrteei section>
>    

Probably the guest, as only it can arrange for special * sections.  Good.

> +
> +Some instructions require more logic to determine what's going on than a load
> +or store instruction can deliver. To enable patching of those, we keep some
> +RAM around where we can live translate instructions to. What happens is the
> +following:
> +
> +	1) copy emulation code to memory
> +	2) patch that code to fit the emulated instruction
> +	3) patch that code to return to the original pc + 4
> +	4) patch the original instruction to branch to the new code
> +
> +That way we can inject an arbitrary amount of code as replacement for a single
> +instruction. This allows us to check for pending interrupts when setting EE=1
> +for example.
> +
>    

Or not.

What about transitions from paravirt to non-paravirt?  For example, a 
system reset.
Avi Kivity - June 27, 2010, 8:34 a.m.
On 06/26/2010 02:25 AM, Alexander Graf wrote:
> We just introduced a new PV interface that screams for documentation. So here
> it is - a shiny new and awesome text file describing the internal works of
> the PPC KVM paravirtual interface.
>
>
> +Querying for existence
> +======================
> +
> +To find out if we're running on KVM or not, we overlay the PVR register. Usually
> +the PVR register contains an id that identifies your CPU type. If, however, you
> +pass KVM_PVR_PARA in the register that you want the PVR result in, the register
> +still contains KVM_PVR_PARA after the mfpvr call.
> +
> +	LOAD_REG_IMM(r5, KVM_PVR_PARA)
> +	mfpvr	r5
> +	[r5 still contains KVM_PVR_PARA]
> +
> +Once determined to run under a PV capable KVM, you can now use hypercalls as
> +described below.
>    

On x86 we allow host userspace to determine whether the guest sees the 
paravirt interface (and what features are exposed).  This allows you to 
live migrate from a newer host to an older host, by not exposing the 
newer features.
Alexander Graf - June 27, 2010, 9:33 a.m.
Am 27.06.2010 um 10:14 schrieb Avi Kivity <avi@redhat.com>:

> On 06/26/2010 02:25 AM, Alexander Graf wrote:
>> We just introduced a new PV interface that screams for  
>> documentation. So here
>> it is - a shiny new and awesome text file describing the internal  
>> works of
>> the PPC KVM paravirtual interface.
>>
>
> Good, that lets people who have no idea what they're talking about  
> participate in the review.

Heh, I knew you'd like this :).

>
>> +
>> +PPC hypercalls
>> +==============
>> +
>> +The only viable ways to reliably get from guest context to host  
>> context are:
>> +
>> +    1) Call an invalid instruction
>> +    2) Call the "sc" instruction with a parameter to "sc"
>> +    3) Call the "sc" instruction with parameters in GPRs
>> +
>> +Method 1 is always a bad idea. Invalid instructions can be  
>> replaced later on
>> +by valid instructions, rendering the interface broken.
>> +
>> +Method 2 also has downfalls. If the parameter to "sc" is != 0 the  
>> spec is
>> +rather unclear if the sc is targeted directly for the hypervisor  
>> or the
>> +supervisor. It would also require that we read the syscall issuing  
>> instruction
>> +every time a syscall is issued, slowing down guest syscalls.
>> +
>> +Method 3 is what KVM uses. We pass magic constants  
>> (KVM_SC_MAGIC_R3 and
>> +KVM_SC_MAGIC_R4) in r3 and r4 respectively. If a syscall  
>> instruction with these
>> +magic values arrives from the guest's kernel mode, we take the  
>> syscall as a
>> +hypercall.
>>
>
> Is there any chance a normal syscall will have those values in r3  
> and r4?

r3 is the syscall number. So as long as the guest doesn't reuse that  
value, we're safe. Since in general syscall numbers are not randomly  
scattered throughout the number range, we should be ok here.

>
> If so, maybe it's better to use pc as they key for hypercalls.  Let  
> the guest designate one instruction address as the hypercall call  
> point; kvm can easily check it and reflect it back to the guest if  
> it doesn't match.
>

You mean the guest would tell the hv where the hypercall lies? That  
would require a hypercall, no? Defining it statically is tricky. I  
want to PV'nize osx using a kernel module later, so I don't have  
control over the physical layout.

> Is it valid and useful to issue sc from privileged mode anyway,  
> except for calling the hypervisor?

Same as a syscall on x86 really. The kernel can and does issue  
syscalls within itself.

>
>> +
>> +The parameters are as follows:
>> +
>> +    r3        KVM_SC_MAGIC_R3
>> +    r4        KVM_SC_MAGIC_R4
>> +    r5        Hypercall number
>> +    r6        First parameter
>> +    r7        Second parameter
>> +    r8        Third parameter
>> +    r9        Fourth parameter
>> +
>> +Hypercall definitions are shared in generic code, so the same  
>> hypercall numbers
>> +apply for x86 and powerpc alike.
>>
>
> Addresses passed in hypercall paramters are guest physical addresses.
>
> Do you have >32 bit physical addresses on 32-bit guests?  if so,  
> you'll need to pass physical addresses in two registers.

I think theoretically it's possible. Will we ever support it?  
Doubtful. Do we need to pass hogh memory addresses to the hv? Even  
more doubtful.

If we hit such a case, I'd just disable the hypercall for 32 bit. Or  
define param1 and param2 to contain the address if the guest is in 32- 
bit mode. No need to always make all params 64 bit imho.

>
>> +
>> +The magic page
>> +==============
>> +
>> +To enable communication between the hypervisor and guest there is  
>> a new shared
>> +page that contains parts of supervisor visible register state. The  
>> guest can
>> +map this shared page using the KVM hypercall  
>> KVM_HC_PPC_MAP_MAGIC_PAGE.
>> +
>> +With this hypercall issued the guest always gets the magic page  
>> mapped at the
>> +desired location in effective and physical address space. For now,  
>> we always
>> +map the page to -4096. This way we can access it using absolute  
>> load and store
>> +functions. The following instruction reads the first field of the  
>> magic page:
>> +
>> +    ld    rX, -4096(0)
>>
>
> Is the address guest controlled or host controlled?

Guest controlled. It's passed in to the map_magic_page hypercall.

>
>> +
>> +The interface is designed to be extensible should there be need  
>> later to add
>> +additional registers to the magic page. If you add fields to the  
>> magic page,
>> +also define a new hypercall feature to indicate that the host can  
>> give you more
>> +registers. Only if the host supports the additional features, make  
>> use of them.
>> +
>> +The magic page has the following layout as described in
>> +arch/powerpc/include/asm/kvm_para.h:
>> +
>> +struct kvm_vcpu_arch_shared {
>> +    __u64 scratch1;
>> +    __u64 scratch2;
>> +    __u64 scratch3;
>> +    __u64 critical;        /* Guest may not get interrupts if ==  
>> r1 */
>>
>
> Elaborate?

I think I have a description in the respective patch. Probably a good  
idea to add it to the documentation.

>
>> +    __u64 sprg0;
>> +    __u64 sprg1;
>> +    __u64 sprg2;
>> +    __u64 sprg3;
>> +    __u64 srr0;
>> +    __u64 srr1;
>> +    __u64 dar;
>> +    __u64 msr;
>> +    __u32 dsisr;
>> +    __u32 int_pending;    /* Tells the guest if we have an  
>> interrupt */
>> +};
>> +
>> +Additions to the page must only occur at the end. Struct fields  
>> are always 32
>> +bit aligned.
>> +
>> +Patched instructions
>> +====================
>> +
>> +The "ld" and "std" instructions are transormed to "lwz" and "stw"  
>> instructions
>> +respectively on 32 bit systems with an added offset of 4 to  
>> accomodate for big
>> +endianness.
>>
>
> Who does the patching? guest or host?

All patching is done by the guest. Probably worth mentioning, yeah.

>
>> +
>> +From            To
>> +====            ==
>> +
>> +mfmsr    rX        ld    rX, magic_page->msr
>> +mfsprg    rX, 0        ld    rX, magic_page->sprg0
>> +mfsprg    rX, 1        ld    rX, magic_page->sprg1
>> +mfsprg    rX, 2        ld    rX, magic_page->sprg2
>> +mfsprg    rX, 3        ld    rX, magic_page->sprg3
>> +mfsrr0    rX        ld    rX, magic_page->srr0
>> +mfsrr1    rX        ld    rX, magic_page->srr1
>> +mfdar    rX        ld    rX, magic_page->dar
>> +mfdsisr    rX        ld    rX, magic_page->dsisr
>> +
>> +mtmsr    rX        std    rX, magic_page->msr
>> +mtsprg    0, rX        std    rX, magic_page->sprg0
>> +mtsprg    1, rX        std    rX, magic_page->sprg1
>> +mtsprg    2, rX        std    rX, magic_page->sprg2
>> +mtsprg    3, rX        std    rX, magic_page->sprg3
>> +mtsrr0    rX        std    rX, magic_page->srr0
>> +mtsrr1    rX        std    rX, magic_page->srr1
>> +mtdar    rX        std    rX, magic_page->dar
>> +mtdsisr    rX        std    rX, magic_page->dsisr
>> +
>> +tlbsync            nop
>> +
>> +mtmsrd    rX, 0        b    <special mtmsr section>
>> +mtmsr            b    <special mtmsr section>
>> +
>> +mtmsrd    rX, 1        b    <special mtmsrd section>
>> +
>> +[BookE only]
>> +wrteei    [0|1]        b    <special wrteei section>
>>
>
> Probably the guest, as only it can arrange for special * sections.   
> Good.
>
>> +
>> +Some instructions require more logic to determine what's going on  
>> than a load
>> +or store instruction can deliver. To enable patching of those, we  
>> keep some
>> +RAM around where we can live translate instructions to. What  
>> happens is the
>> +following:
>> +
>> +    1) copy emulation code to memory
>> +    2) patch that code to fit the emulated instruction
>> +    3) patch that code to return to the original pc + 4
>> +    4) patch the original instruction to branch to the new code
>> +
>> +That way we can inject an arbitrary amount of code as replacement  
>> for a single
>> +instruction. This allows us to check for pending interrupts when  
>> setting EE=1
>> +for example.
>> +
>>
>
> Or not.
>
> What about transitions from paravirt to non-paravirt?  For example,  
> a system reset.

That ... eh ... good question. It would leave the map pending, but  
everything still continues working.

I don't really know in kvm when a reset occured. So we have to make  
qemu set the map to 0 on reset. Let's add then when we add migration  
support and actually expose all those missing states to userspace.  
Currently we only expose half the necessary state for migration  
anyway :).


Alex
Alexander Graf - June 27, 2010, 9:49 a.m.
Am 27.06.2010 um 10:34 schrieb Avi Kivity <avi@redhat.com>:

> On 06/26/2010 02:25 AM, Alexander Graf wrote:
>> We just introduced a new PV interface that screams for  
>> documentation. So here
>> it is - a shiny new and awesome text file describing the internal  
>> works of
>> the PPC KVM paravirtual interface.
>>
>>
>> +Querying for existence
>> +======================
>> +
>> +To find out if we're running on KVM or not, we overlay the PVR  
>> register. Usually
>> +the PVR register contains an id that identifies your CPU type. If,  
>> however, you
>> +pass KVM_PVR_PARA in the register that you want the PVR result in,  
>> the register
>> +still contains KVM_PVR_PARA after the mfpvr call.
>> +
>> +    LOAD_REG_IMM(r5, KVM_PVR_PARA)
>> +    mfpvr    r5
>> +    [r5 still contains KVM_PVR_PARA]
>> +
>> +Once determined to run under a PV capable KVM, you can now use  
>> hypercalls as
>> +described below.
>>
>
> On x86 we allow host userspace to determine whether the guest sees  
> the paravirt interface (and what features are exposed).  This allows  
> you to live migrate from a newer host to an older host, by not  
> exposing the newer features.

A very good idea indeed. Let's postpone that to when we expose enough  
state to make live migration possible.

Alex
Milton Miller - June 28, 2010, 7:18 a.m.
On Sun Jun 27 around 19:33:52 EST 2010 Alexander Graf wrote:
> Am 27.06.2010 um 10:14 schrieb Avi Kivity <avi at redhat.com>:
> > On 06/26/2010 02:25 AM, Alexander Graf wrote:

> > > +
> > > +PPC hypercalls
> > > +==============
> > > +
> > > +The only viable ways to reliably get from guest context to host  
> > > context are:
> > > +
> > > +    1) Call an invalid instruction
> > > +    2) Call the "sc" instruction with a parameter to "sc"
> > > +    3) Call the "sc" instruction with parameters in GPRs
> > > +
> > > +Method 1 is always a bad idea. Invalid instructions can be  
> > > replaced later on
> > > +by valid instructions, rendering the interface broken.
> > > +
> > > +Method 2 also has downfalls. If the parameter to "sc" is != 0 the  
> > > spec is
> > > +rather unclear if the sc is targeted directly for the hypervisor  
> > > or the
> > > +supervisor. It would also require that we read the syscall issuing  
> > > instruction
> > > +every time a syscall is issued, slowing down guest syscalls.
> > > +

It goes to the hypervisor, and it would require the hypervisor to
return to the supervisor, but I believe it just returns to the user with
permission denied.

> > > +Method 3 is what KVM uses. We pass magic constants  
> > > (KVM_SC_MAGIC_R3 and
> > > +KVM_SC_MAGIC_R4) in r3 and r4 respectively. If a syscall  
> > > instruction with these
> > > +magic values arrives from the guest's kernel mode, we take the  
> > > syscall as a
> > > +hypercall.
> > >
> >
> > Is there any chance a normal syscall will have those values in r3  
> > and r4?
> 
> r3 is the syscall number. So as long as the guest doesn't reuse that  
> value, we're safe. Since in general syscall numbers are not randomly  
> scattered throughout the number range, we should be ok here.
> 

No, r0 has the system call number.  Registers 3 and 4 are the first
2 args in c abi (or first 64 bit arg in 32 bit c abi), but the linux
syscall abi special.  (In addition, it returns success or failure in
cr0).

> >
> > If so, maybe it's better to use pc as they key for hypercalls.  Let  
> > the guest designate one instruction address as the hypercall call  
> > point; kvm can easily check it and reflect it back to the guest if  
> > it doesn't match.
> >
> 
> You mean the guest would tell the hv where the hypercall lies? That  
> would require a hypercall, no? Defining it statically is tricky. I  
> want to PV'nize osx using a kernel module later, so I don't have  
> control over the physical layout.
> 
> > Is it valid and useful to issue sc from privileged mode anyway,  
> > except for calling the hypervisor?
> 
> Same as a syscall on x86 really. The kernel can and does issue  
> syscalls within itself.
> 
> 

I don't believe we support the kernel actually doing a syscall to itself
anymore, at least on powerpc.  The callers call the underlying system
call function, or kernel_thread.

That said, I would suggest we allocate a syscall number for this, as it
would document the usage.  (In additon to 0..nr_syscalls - 1 we have
0x1ebe in use).

Also, is there any desire to nest such emulation?

milton
Alexander Graf - June 28, 2010, 7:49 a.m.
On 28.06.2010, at 09:18, Milton Miller wrote:

> On Sun Jun 27 around 19:33:52 EST 2010 Alexander Graf wrote:
>> Am 27.06.2010 um 10:14 schrieb Avi Kivity <avi at redhat.com>:
>>> On 06/26/2010 02:25 AM, Alexander Graf wrote:
> 
>>>> +
>>>> +PPC hypercalls
>>>> +==============
>>>> +
>>>> +The only viable ways to reliably get from guest context to host  
>>>> context are:
>>>> +
>>>> +    1) Call an invalid instruction
>>>> +    2) Call the "sc" instruction with a parameter to "sc"
>>>> +    3) Call the "sc" instruction with parameters in GPRs
>>>> +
>>>> +Method 1 is always a bad idea. Invalid instructions can be  
>>>> replaced later on
>>>> +by valid instructions, rendering the interface broken.
>>>> +
>>>> +Method 2 also has downfalls. If the parameter to "sc" is != 0 the  
>>>> spec is
>>>> +rather unclear if the sc is targeted directly for the hypervisor  
>>>> or the
>>>> +supervisor. It would also require that we read the syscall issuing  
>>>> instruction
>>>> +every time a syscall is issued, slowing down guest syscalls.
>>>> +
> 
> It goes to the hypervisor, and it would require the hypervisor to
> return to the supervisor, but I believe it just returns to the user with
> permission denied.

That's what I assumed, yeah :(.

> 
>>>> +Method 3 is what KVM uses. We pass magic constants  
>>>> (KVM_SC_MAGIC_R3 and
>>>> +KVM_SC_MAGIC_R4) in r3 and r4 respectively. If a syscall  
>>>> instruction with these
>>>> +magic values arrives from the guest's kernel mode, we take the  
>>>> syscall as a
>>>> +hypercall.
>>>> 
>>> 
>>> Is there any chance a normal syscall will have those values in r3  
>>> and r4?
>> 
>> r3 is the syscall number. So as long as the guest doesn't reuse that  
>> value, we're safe. Since in general syscall numbers are not randomly  
>> scattered throughout the number range, we should be ok here.
>> 
> 
> No, r0 has the system call number.  Registers 3 and 4 are the first
> 2 args in c abi (or first 64 bit arg in 32 bit c abi), but the linux
> syscall abi special.  (In addition, it returns success or failure in
> cr0).

Oh. Ahem :)

> 
>>> 
>>> If so, maybe it's better to use pc as they key for hypercalls.  Let  
>>> the guest designate one instruction address as the hypercall call  
>>> point; kvm can easily check it and reflect it back to the guest if  
>>> it doesn't match.
>>> 
>> 
>> You mean the guest would tell the hv where the hypercall lies? That  
>> would require a hypercall, no? Defining it statically is tricky. I  
>> want to PV'nize osx using a kernel module later, so I don't have  
>> control over the physical layout.
>> 
>>> Is it valid and useful to issue sc from privileged mode anyway,  
>>> except for calling the hypervisor?
>> 
>> Same as a syscall on x86 really. The kernel can and does issue  
>> syscalls within itself.
>> 
>> 
> 
> I don't believe we support the kernel actually doing a syscall to itself
> anymore, at least on powerpc.  The callers call the underlying system
> call function, or kernel_thread.
> 
> That said, I would suggest we allocate a syscall number for this, as it
> would document the usage.  (In additon to 0..nr_syscalls - 1 we have
> 0x1ebe in use).

That's actually a pretty good idea.

> 
> Also, is there any desire to nest such emulation?

Nesting should just work, right? Since we only accept hypercalls from PR=0 and guests run in PR=1, we get the sc interrupt in the l1 guest by then.

The only issue I'm aware of that completely breaks when using nested KVM on PPC is the MSR_IR != MSR_DR logic. We fetch the instruction we got an interrupt on for certain interrupts in the world switch handler by keeping MSR_IR=0, but setting MSR_DR=1. And KVM speeds up MSR_DR != MSR_IR by mapping both of them lazily in a special address space. So if you access the same page as instruction and as data, you get an invalid result.

Alex
Avi Kivity - June 28, 2010, 8:13 a.m.
On 06/28/2010 10:49 AM, Alexander Graf wrote:
>
>> I don't believe we support the kernel actually doing a syscall to itself
>> anymore, at least on powerpc.  The callers call the underlying system
>> call function, or kernel_thread.
>>
>> That said, I would suggest we allocate a syscall number for this, as it
>> would document the usage.  (In additon to 0..nr_syscalls - 1 we have
>> 0x1ebe in use).
>>      
> That's actually a pretty good idea.
>    

Since the syscall register is not architectual (or rather it is 
architectural but Linux ignores it) I don't see the point.  It would 
work for Linux but may alias some random parameter for a different 
guest.  We need a reliable method of distinguishing between syscalls and 
hypercalls.  Matching pc would work (but is defeated by inlining) so 
long as we find some other way of identifying the hc pc to the hypervisor.
Alexander Graf - June 28, 2010, 8:21 a.m.
On 28.06.2010, at 10:13, Avi Kivity wrote:

> On 06/28/2010 10:49 AM, Alexander Graf wrote:
>> 
>>> I don't believe we support the kernel actually doing a syscall to itself
>>> anymore, at least on powerpc.  The callers call the underlying system
>>> call function, or kernel_thread.
>>> 
>>> That said, I would suggest we allocate a syscall number for this, as it
>>> would document the usage.  (In additon to 0..nr_syscalls - 1 we have
>>> 0x1ebe in use).
>>>     
>> That's actually a pretty good idea.
>>   
> 
> Since the syscall register is not architectual (or rather it is architectural but Linux ignores it) I don't see the point.  It would work for Linux but may alias some random parameter for a different guest.  We need a reliable method of distinguishing between syscalls and hypercalls.  Matching pc would work (but is defeated by inlining) so long as we find some other way of identifying the hc pc to the hypervisor.

The other alternative I'd see is to reuse an instruction that is not sc. We could for example pull the mfpvr trick again, but pass a different magic value in the register this time that tells the hypervisor "this is a hypercall".

Or we could reserve a different SPR. But from what I've seen there are already quite a lot of SPRs out there. More than available numbers :).

The hypercall technique I used here is actually inspired by MOL. They use magic constants in r3 and r4 for their "OSI" identification. I'm frankly not sure what the best approach is, but considering that syscalls from the kernel lie in the guest kernel's hand, we could just declare any breakage a guest kernel bug.


Alex
Avi Kivity - June 28, 2010, 8:32 a.m.
On 06/28/2010 11:21 AM, Alexander Graf wrote:
>
> The other alternative I'd see is to reuse an instruction that is not sc. We could for example pull the mfpvr trick again, but pass a different magic value in the register this time that tells the hypervisor "this is a hypercall".
>
> Or we could reserve a different SPR. But from what I've seen there are already quite a lot of SPRs out there. More than available numbers :).
>
> The hypercall technique I used here is actually inspired by MOL. They use magic constants in r3 and r4 for their "OSI" identification. I'm frankly not sure what the best approach is, but considering that syscalls from the kernel lie in the guest kernel's hand, we could just declare any breakage a guest kernel bug.
>
>    

Magic = liable to break without notice.

Given r0 is the architectural syscall number, and r3 is the Linux 
syscall number, we can use a combination of r0 and r3, reserve r3 in 
Linux, and hope that no one else uses our selection of r0.

Still smelly, but not as bad.

Patch

diff --git a/Documentation/kvm/ppc-pv.txt b/Documentation/kvm/ppc-pv.txt
new file mode 100644
index 0000000..7cbcd51
--- /dev/null
+++ b/Documentation/kvm/ppc-pv.txt
@@ -0,0 +1,164 @@ 
+The PPC KVM paravirtual interface
+=================================
+
+The basic execution principle by which KVM on PowerPC works is to run all kernel
+space code in PR=1 which is user space. This way we trap all privileged
+instructions and can emulate them accordingly.
+
+Unfortunately that is also the downfall. There are quite some privileged
+instructions that needlessly return us to the hypervisor even though they
+could be handled differently.
+
+This is what the PPC PV interface helps with. It takes privileged instructions
+and transforms them into unprivileged ones with some help from the hypervisor.
+This cuts down virtualization costs by about 50% on some of my benchmarks.
+
+The code for that interface can be found in arch/powerpc/kernel/kvm*
+
+Querying for existence
+======================
+
+To find out if we're running on KVM or not, we overlay the PVR register. Usually
+the PVR register contains an id that identifies your CPU type. If, however, you
+pass KVM_PVR_PARA in the register that you want the PVR result in, the register
+still contains KVM_PVR_PARA after the mfpvr call.
+
+	LOAD_REG_IMM(r5, KVM_PVR_PARA)
+	mfpvr	r5
+	[r5 still contains KVM_PVR_PARA]
+
+Once determined to run under a PV capable KVM, you can now use hypercalls as
+described below.
+
+PPC hypercalls
+==============
+
+The only viable ways to reliably get from guest context to host context are:
+
+	1) Call an invalid instruction
+	2) Call the "sc" instruction with a parameter to "sc"
+	3) Call the "sc" instruction with parameters in GPRs
+
+Method 1 is always a bad idea. Invalid instructions can be replaced later on
+by valid instructions, rendering the interface broken.
+
+Method 2 also has downfalls. If the parameter to "sc" is != 0 the spec is
+rather unclear if the sc is targeted directly for the hypervisor or the
+supervisor. It would also require that we read the syscall issuing instruction
+every time a syscall is issued, slowing down guest syscalls.
+
+Method 3 is what KVM uses. We pass magic constants (KVM_SC_MAGIC_R3 and
+KVM_SC_MAGIC_R4) in r3 and r4 respectively. If a syscall instruction with these
+magic values arrives from the guest's kernel mode, we take the syscall as a
+hypercall.
+
+The parameters are as follows:
+
+	r3		KVM_SC_MAGIC_R3
+	r4		KVM_SC_MAGIC_R4
+	r5		Hypercall number
+	r6		First parameter
+	r7		Second parameter
+	r8		Third parameter
+	r9		Fourth parameter
+
+Hypercall definitions are shared in generic code, so the same hypercall numbers
+apply for x86 and powerpc alike.
+
+The magic page
+==============
+
+To enable communication between the hypervisor and guest there is a new shared
+page that contains parts of supervisor visible register state. The guest can
+map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
+
+With this hypercall issued the guest always gets the magic page mapped at the
+desired location in effective and physical address space. For now, we always
+map the page to -4096. This way we can access it using absolute load and store
+functions. The following instruction reads the first field of the magic page:
+
+	ld	rX, -4096(0)
+
+The interface is designed to be extensible should there be need later to add
+additional registers to the magic page. If you add fields to the magic page,
+also define a new hypercall feature to indicate that the host can give you more
+registers. Only if the host supports the additional features, make use of them.
+
+The magic page has the following layout as described in
+arch/powerpc/include/asm/kvm_para.h:
+
+struct kvm_vcpu_arch_shared {
+	__u64 scratch1;
+	__u64 scratch2;
+	__u64 scratch3;
+	__u64 critical;		/* Guest may not get interrupts if == r1 */
+	__u64 sprg0;
+	__u64 sprg1;
+	__u64 sprg2;
+	__u64 sprg3;
+	__u64 srr0;
+	__u64 srr1;
+	__u64 dar;
+	__u64 msr;
+	__u32 dsisr;
+	__u32 int_pending;	/* Tells the guest if we have an interrupt */
+};
+
+Additions to the page must only occur at the end. Struct fields are always 32
+bit aligned.
+
+Patched instructions
+====================
+
+The "ld" and "std" instructions are transormed to "lwz" and "stw" instructions
+respectively on 32 bit systems with an added offset of 4 to accomodate for big
+endianness.
+
+From			To
+====			==
+
+mfmsr	rX		ld	rX, magic_page->msr
+mfsprg	rX, 0		ld	rX, magic_page->sprg0
+mfsprg	rX, 1		ld	rX, magic_page->sprg1
+mfsprg	rX, 2		ld	rX, magic_page->sprg2
+mfsprg	rX, 3		ld	rX, magic_page->sprg3
+mfsrr0	rX		ld	rX, magic_page->srr0
+mfsrr1	rX		ld	rX, magic_page->srr1
+mfdar	rX		ld	rX, magic_page->dar
+mfdsisr	rX		ld	rX, magic_page->dsisr
+
+mtmsr	rX		std	rX, magic_page->msr
+mtsprg	0, rX		std	rX, magic_page->sprg0
+mtsprg	1, rX		std	rX, magic_page->sprg1
+mtsprg	2, rX		std	rX, magic_page->sprg2
+mtsprg	3, rX		std	rX, magic_page->sprg3
+mtsrr0	rX		std	rX, magic_page->srr0
+mtsrr1	rX		std	rX, magic_page->srr1
+mtdar	rX		std	rX, magic_page->dar
+mtdsisr	rX		std	rX, magic_page->dsisr
+
+tlbsync			nop
+
+mtmsrd	rX, 0		b	<special mtmsr section>
+mtmsr			b	<special mtmsr section>
+
+mtmsrd	rX, 1		b	<special mtmsrd section>
+
+[BookE only]
+wrteei	[0|1]		b	<special wrteei section>
+
+
+Some instructions require more logic to determine what's going on than a load
+or store instruction can deliver. To enable patching of those, we keep some
+RAM around where we can live translate instructions to. What happens is the
+following:
+
+	1) copy emulation code to memory
+	2) patch that code to fit the emulated instruction
+	3) patch that code to return to the original pc + 4
+	4) patch the original instruction to branch to the new code
+
+That way we can inject an arbitrary amount of code as replacement for a single
+instruction. This allows us to check for pending interrupts when setting EE=1
+for example.
+