diff mbox

[STRAWMAN] KVM: PPC: Add ioctl to specify interrupt controller architecture to emulate

Message ID 20130307032913.GC2385@drongo
State New, archived
Headers show

Commit Message

Paul Mackerras March 7, 2013, 3:29 a.m. UTC
This adds a new ioctl, KVM_SET_IRQ_ARCHITECTURE, which is intended to
be called by userspace to specify that it wishes the kernel to emulate
a specific interrupt controller architecture.  This doesn't imply the
creation of any specific device, but does indicate that when vcpus are
created, space for per-vcpu interrupt controller state should be
allocated.  Having this ioctl enables userspace to defer creation of
the actual interrupt controller device(s) until after the vcpus are
created.

The KVM_SET_IRQ_ARCHITECTURE ioctl takes a single 32-bit unsigned int
as its parameter.  Values for this parameter will be defined in
subsequent patches.  The value of 0 means no in-kernel interrupt
controller emulation.

In order to ensure that this ioctl will fail once any vcpu has been
created, we use an arch.vcpus_created field to indicate that vcpu
creation has commenced.  We don't use the online_vcpus field because
that is only incremented after vcpu creation; if we used that there
would be a window during the first KVM_CREATE_VCPU ioctl where the
first vcpu had been created but the interrupt architecture could still
be changed.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
So, should this all be done in generic code?

 Documentation/virtual/kvm/api.txt   |   22 ++++++++++++
 arch/powerpc/include/asm/kvm_host.h |    2 ++
 arch/powerpc/kvm/powerpc.c          |   66 +++++++++++++++++++++++++++++++++--
 include/uapi/linux/kvm.h            |    3 ++
 4 files changed, 90 insertions(+), 3 deletions(-)

Comments

Alexander Graf March 7, 2013, 2 p.m. UTC | #1
On 07.03.2013, at 04:29, Paul Mackerras wrote:

> This adds a new ioctl, KVM_SET_IRQ_ARCHITECTURE, which is intended to
> be called by userspace to specify that it wishes the kernel to emulate
> a specific interrupt controller architecture.  This doesn't imply the
> creation of any specific device, but does indicate that when vcpus are
> created, space for per-vcpu interrupt controller state should be
> allocated.  Having this ioctl enables userspace to defer creation of
> the actual interrupt controller device(s) until after the vcpus are
> created.
> 
> The KVM_SET_IRQ_ARCHITECTURE ioctl takes a single 32-bit unsigned int
> as its parameter.  Values for this parameter will be defined in
> subsequent patches.  The value of 0 means no in-kernel interrupt
> controller emulation.
> 
> In order to ensure that this ioctl will fail once any vcpu has been
> created, we use an arch.vcpus_created field to indicate that vcpu
> creation has commenced.  We don't use the online_vcpus field because
> that is only incremented after vcpu creation; if we used that there
> would be a window during the first KVM_CREATE_VCPU ioctl where the
> first vcpu had been created but the interrupt architecture could still
> be changed.
> 
> Signed-off-by: Paul Mackerras <paulus@samba.org>

Could you please (in a quick and drafty way) try and see if setting the IRQ arch (using enable_cap) after the vcpu got created would work for you?

That enable_cap would then have to loop through all devices and notify irq controllers that a new cpu got spawned.
All vcpu local payloads would have to get allocated and initialized outside of vcpu_create too then.

I don't have a good feeling for how hard this would be and whether locking would become overly difficult. I think it's fair to restrict the enable_cap to only work when no other vcpu is running. Of course, not requiring a stopped machine would make hotplug easier for user space :).

If it turns out to be hard and / or so complex that it's likely to make things more buggy than is worth, this patch is the way to go IMHO.


Alex

> ---
> So, should this all be done in generic code?
> 
> Documentation/virtual/kvm/api.txt   |   22 ++++++++++++
> arch/powerpc/include/asm/kvm_host.h |    2 ++
> arch/powerpc/kvm/powerpc.c          |   66 +++++++++++++++++++++++++++++++++--
> include/uapi/linux/kvm.h            |    3 ++
> 4 files changed, 90 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index cce500a..d550313 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2126,6 +2126,28 @@ header; first `n_valid' valid entries with contents from the data
> written, then `n_invalid' invalid entries, invalidating any previously
> valid entries found.
> 
> +4.79 KVM_SET_IRQ_ARCHITECTURE
> +
> +Capability: KVM_CAP_IRQ_ARCH
> +Architecture: powerpc
> +Type: vm ioctl
> +Parameters: Pointer to u32 (in)
> +Returns: 0 on success, -1 on error
> +
> +This is called before any vcpus are created, if in-kernel interrupt
> +controller emulation is desired, to specify what overall interrupt
> +controller architecture should be emulated.  Having this called before
> +any vcpus are created allows per-vcpu interrupt controller state to be
> +allocated at vcpu creation time, and allows the creation of the actual
> +interrupt controller device to be deferred until after the vcpus are
> +created.
> +
> +The parameter is a 32-bit unsigned integer specifying the
> +architecture, or the value 0 to specify that no emulation should be
> +done.  If the requested architecture is not supported by the kernel,
> +this ioctl returns an EINVAL error.  Otherwise, if this ioctl is
> +called after any vcpus have been created, it returns an EBUSY error.
> +
> 
> 5. The kvm_run structure
> ------------------------
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 8a72d59..e21ea1f 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -227,6 +227,8 @@ struct kvm_arch_memory_slot {
> };
> 
> struct kvm_arch {
> +	unsigned int irq_arch;
> +	int vcpus_created;
> 	unsigned int lpid;
> #ifdef CONFIG_KVM_BOOK3S_64_HV
> 	unsigned long hpt_virt;
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 934413c..b681746 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -441,10 +441,21 @@ struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, unsigned int id)
> {
> 	struct kvm_vcpu *vcpu;
> 	vcpu = kvmppc_core_vcpu_create(kvm, id);
> -	if (!IS_ERR(vcpu)) {
> -		vcpu->arch.wqp = &vcpu->wq;
> -		kvmppc_create_vcpu_debugfs(vcpu, id);
> +	if (IS_ERR(vcpu))
> +		goto out;
> +
> +	/* Create per-vcpu irq controller state if needed */
> +	mutex_lock(&kvm->lock);
> +	kvm->arch.vcpus_created = 1;
> +	switch (kvm->arch.irq_arch) {
> +	default:
> +		break;
> 	}
> +	mutex_unlock(&kvm->lock);
> +
> +	vcpu->arch.wqp = &vcpu->wq;
> +	kvmppc_create_vcpu_debugfs(vcpu, id);
> + out:
> 	return vcpu;
> }
> 
> @@ -459,6 +470,12 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
> 	hrtimer_cancel(&vcpu->arch.dec_timer);
> 	tasklet_kill(&vcpu->arch.tasklet);
> 
> +	/* Destroy per-vcpu irq controller state */
> +	switch (vcpu->kvm->arch.irq_arch) {
> +	default:
> +		break;
> +	}
> +
> 	kvmppc_remove_vcpu_debugfs(vcpu);
> 	kvmppc_core_vcpu_free(vcpu);
> }
> @@ -996,6 +1013,49 @@ long kvm_arch_vm_ioctl(struct file *filp,
> 		break;
> 	}
> #endif /* CONFIG_PPC_BOOK3S_64 */
> +
> +	case KVM_IRQ_LINE: {
> +		struct kvm *kvm = filp->private_data;
> +		struct kvm_irq_level args;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&args, argp, sizeof(args)))
> +			break;
> +		switch (kvm->arch.irq_arch) {
> +		default:
> +			r = -EINVAL;
> +		}
> +		break;
> +	}
> +
> +	case KVM_SET_IRQ_ARCHITECTURE: {
> +		struct kvm *kvm = filp->private_data;
> +		u32 arch_id;
> +
> +		r = -EFAULT;
> +		if (get_user(arch_id, (u32 __user *)argp))
> +			break;
> +		r = 0;
> +		switch (arch_id) {
> +		case 0:		/* no emulation */
> +			break;
> +		default:
> +			r = -EINVAL;
> +			break;
> +		}
> +		if (r)
> +			break;
> +
> +		/* mutex ensures we don't race with vcpu creation */
> +		mutex_lock(&kvm->lock);
> +		if (!kvm->arch.vcpus_created)
> +			kvm->arch.irq_arch = arch_id;
> +		else
> +			r = -EBUSY;
> +		mutex_unlock(&kvm->lock);
> +		break;
> +	}
> +
> 	default:
> 		r = -ENOTTY;
> 	}
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 9a2db57..2c0ea1c 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -662,6 +662,7 @@ struct kvm_ppc_smmu_info {
> #define KVM_CAP_PPC_HTAB_FD 84
> #define KVM_CAP_S390_CSS_SUPPORT 85
> #define KVM_CAP_PPC_EPR 86
> +#define KVM_CAP_IRQ_ARCH 87
> 
> #ifdef KVM_CAP_IRQ_ROUTING
> 
> @@ -889,6 +890,8 @@ struct kvm_s390_ucas_mapping {
> #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
> /* Available with KVM_CAP_PPC_HTAB_FD */
> #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
> +/* Available with KVM_CAP_IRQ_ARCH */
> +#define KVM_SET_IRQ_ARCHITECTURE  _IOW(KVMIO,  0xab, __u32)
> 
> /*
>  * ioctls for vcpu fds
> -- 
> 1.7.10.4
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Mackerras March 8, 2013, 10:37 a.m. UTC | #2
On Thu, Mar 07, 2013 at 03:00:52PM +0100, Alexander Graf wrote:
> 
> Could you please (in a quick and drafty way) try and see if setting the IRQ arch (using enable_cap) after the vcpu got created would work for you?
> 
> That enable_cap would then have to loop through all devices and notify irq controllers that a new cpu got spawned.
> All vcpu local payloads would have to get allocated and initialized outside of vcpu_create too then.

So, the first thing I noticed is that KVM_ENABLE_CAP is a vcpu ioctl,
not a vm ioctl.  Apparently qemu calls it once for every vcpu when it
calls it on ppc targets.  That means that it doesn't have to loop
through all vcpus; it just needs to connect up the one it's called
for, which simplifies things.

I'm coding it up now and porting my XICS emulation to the kvm device
API proposed by Scott.  It looks like it's going to be OK.  I have
used the first argument (cap->args[0]) to specify which interrupt
controller you want to connect the vcpu to.

> I don't have a good feeling for how hard this would be and whether locking would become overly difficult. I think it's fair to restrict the enable_cap to only work when no other vcpu is running. Of course, not requiring a stopped machine would make hotplug easier for user space :).

Since it's a per-vcpu ioctl, I'm expecting userspace to enable the
capability for each hotplugged vcpu.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf March 8, 2013, 11:04 a.m. UTC | #3
Am 08.03.2013 um 11:37 schrieb Paul Mackerras <paulus@samba.org>:

> On Thu, Mar 07, 2013 at 03:00:52PM +0100, Alexander Graf wrote:
>> 
>> Could you please (in a quick and drafty way) try and see if setting the IRQ arch (using enable_cap) after the vcpu got created would work for you?
>> 
>> That enable_cap would then have to loop through all devices and notify irq controllers that a new cpu got spawned.
>> All vcpu local payloads would have to get allocated and initialized outside of vcpu_create too then.
> 
> So, the first thing I noticed is that KVM_ENABLE_CAP is a vcpu ioctl,
> not a vm ioctl.  Apparently qemu calls it once for every vcpu when it
> calls it on ppc targets.  That means that it doesn't have to loop
> through all vcpus; it just needs to connect up the one it's called
> for, which simplifies things.

That's the point, yes :). And if for some weird reason one vcpu isn't connected to the interrupt controller (or to a different one), we can model that too ;).

> I'm coding it up now and porting my XICS emulation to the kvm device
> API proposed by Scott.  It looks like it's going to be OK.  

Awesome! Scott is going to prototype whether using fds as tokens makes sense. But even if we change it to an fd model, there should be very little work to do to move xics to it too if it's already modeled for create_device.

> I have
> used the first argument (cap->args[0]) to specify which interrupt
> controller you want to connect the vcpu to.

Ah, nice idea. So you basically make the vcpu connection explicit. Perfect! Then just pass the interrupt controller pin id in cap->args[1] so we don't need to guess which vcpu we're talking about and all is well :). No implicit assumptions left in the kernel.

> 
>> I don't have a good feeling for how hard this would be and whether locking would become overly difficult. I think it's fair to restrict the enable_cap to only work when no other vcpu is running. Of course, not requiring a stopped machine would make hotplug easier for user space :).
> 
> Since it's a per-vcpu ioctl, I'm expecting userspace to enable the
> capability for each hotplugged vcpu.

Yes, and also establish the link to the interrupt controller manually through that, so configuration is 100% in user space's hands.

This whole thing is starting to get a consistent structure - very nice!


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Mackerras March 9, 2013, 2:26 a.m. UTC | #4
On Fri, Mar 08, 2013 at 12:04:30PM +0100, Alexander Graf wrote:
> 
> 
> Am 08.03.2013 um 11:37 schrieb Paul Mackerras <paulus@samba.org>:
> 
> > I have
> > used the first argument (cap->args[0]) to specify which interrupt
> > controller you want to connect the vcpu to.
> 
> Ah, nice idea. So you basically make the vcpu connection explicit. Perfect! Then just pass the interrupt controller pin id in cap->args[1] so we don't need to guess which vcpu we're talking about and all is well :). No implicit assumptions left in the kernel.

By "pin", do you mean output pin of the interrupt controller, or input
pin of the CPU (i.e. external/debug/critical)?

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf March 11, 2013, 9:15 a.m. UTC | #5
Am 09.03.2013 um 03:26 schrieb Paul Mackerras <paulus@samba.org>:

> On Fri, Mar 08, 2013 at 12:04:30PM +0100, Alexander Graf wrote:
>> 
>> 
>> Am 08.03.2013 um 11:37 schrieb Paul Mackerras <paulus@samba.org>:
>> 
>>> I have
>>> used the first argument (cap->args[0]) to specify which interrupt
>>> controller you want to connect the vcpu to.
>> 
>> Ah, nice idea. So you basically make the vcpu connection explicit. Perfect! Then just pass the interrupt controller pin id in cap->args[1] so we don't need to guess which vcpu we're talking about and all is well :). No implicit assumptions left in the kernel.
> 
> By "pin", do you mean output pin of the interrupt controller, or input
> pin of the CPU (i.e. external/debug/critical)?

I basically mean the interrupt controller's notion of that cpu's ID. That can just as well be multiple output pins :).


Alex

> 
> Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Mackerras March 14, 2013, 1:26 a.m. UTC | #6
On Wed, Mar 13, 2013 at 07:14:48PM -0500, Scott Wood wrote:
> On 03/08/2013 05:04:30 AM, Alexander Graf wrote:
> >
> >
> >Am 08.03.2013 um 11:37 schrieb Paul Mackerras <paulus@samba.org>:
> >
> >> On Thu, Mar 07, 2013 at 03:00:52PM +0100, Alexander Graf wrote:
> >>>
> >>> Could you please (in a quick and drafty way) try and see if
> >setting the IRQ arch (using enable_cap) after the vcpu got created
> >would work for you?
> >>>
> >>> That enable_cap would then have to loop through all devices and
> >notify irq controllers that a new cpu got spawned.
> >>> All vcpu local payloads would have to get allocated and
> >initialized outside of vcpu_create too then.
> >>
> >> So, the first thing I noticed is that KVM_ENABLE_CAP is a vcpu
> >ioctl,
> >> not a vm ioctl.  Apparently qemu calls it once for every vcpu
> >when it
> >> calls it on ppc targets.  That means that it doesn't have to loop
> >> through all vcpus; it just needs to connect up the one it's called
> >> for, which simplifies things.
> >
> >That's the point, yes :). And if for some weird reason one vcpu
> >isn't connected to the interrupt controller (or to a different
> >one), we can model that too ;).
> >
> >> I'm coding it up now and porting my XICS emulation to the kvm device
> >> API proposed by Scott.  It looks like it's going to be OK.
> >
> >Awesome! Scott is going to prototype whether using fds as tokens
> >makes sense. But even if we change it to an fd model, there should
> >be very little work to do to move xics to it too if it's already
> >modeled for create_device.
> 
> It looks like the fd approach will be workable.  Paul, do you want
> to post what you have in terms of the capability approach, so I can
> base an fd version of the device control patchset on it, or should I
> fd-ize the current patchset without it, and then rework mpic on top
> of the capability stuff once you've posted your device-control-using
> patchset?

I have a complete patchset based on your "kvm: add device control API"
patch, tested and ready to go. :)  I just posted the first patch of
that series, the one that adds the KVM_CAP_IRQ_ARCH capability.  If
you're going to change the device API then I'll hold off posting the
rest of the series for now.

> >> I have
> >> used the first argument (cap->args[0]) to specify which interrupt
> >> controller you want to connect the vcpu to.
> >
> >Ah, nice idea. So you basically make the vcpu connection explicit.
> >Perfect! Then just pass the interrupt controller pin id in
> >cap->args[1] so we don't need to guess which vcpu we're talking
> >about and all is well :). No implicit assumptions left in the
> >kernel.
> 
> Is the IRQ architecture now implicit based on what sort of irqchip
> you point at, or is there a separate capability for each IRQ
> architecture?  The latter may make more sense -- you can test for
> specific architectures, provide architecture-specific arguments,
> some architectures may not require pointing at a device (e.g. the
> "LAPIC in kernel, IO-APIC in userspace" model), etc.

The way I have done it, there is one capability, and args[0] is a
token for the IRQ architecture (not a device ID).  I arbitrarily
assigned 0x58494353 for KVM_CAP_IRQ_XICS as the args[0] value to
indicate XICS.  I think it would be better if we don't have to get a
new capability number assigned every time we want to add a new type of
interrupt controller.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Mackerras March 14, 2013, 10:02 p.m. UTC | #7
On Thu, Mar 14, 2013 at 01:15:35PM -0500, Scott Wood wrote:
> On 03/13/2013 08:26:20 PM, Paul Mackerras wrote:

> >I arbitrarily
> >assigned 0x58494353 for KVM_CAP_IRQ_XICS as the args[0] value to
> >indicate XICS.
> 
> Why is it called KVM_CAP_<whatever> if it's not a capability?

Because it's associated with a capability.  I'm not wedded to the name.

> >I think it would be better if we don't have to get a
> >new capability number assigned every time we want to add a new type of
> >interrupt controller.
> 
> How often does it really happen?  If a simple enumeration is good
> enough for identifying the main IRQ controller device type, it
> should be good enough for identifying the vcpu irq arch.

Whatever.  I really don't care at this point, I'm just getting
extremely tired of the bikeshedding.  If you don't like it, propose
something.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf March 14, 2013, 10:44 p.m. UTC | #8
On 14.03.2013, at 23:02, Paul Mackerras wrote:

> On Thu, Mar 14, 2013 at 01:15:35PM -0500, Scott Wood wrote:
>> On 03/13/2013 08:26:20 PM, Paul Mackerras wrote:
> 
>>> I arbitrarily
>>> assigned 0x58494353 for KVM_CAP_IRQ_XICS as the args[0] value to
>>> indicate XICS.
>> 
>> Why is it called KVM_CAP_<whatever> if it's not a capability?
> 
> Because it's associated with a capability.  I'm not wedded to the name.
> 
>>> I think it would be better if we don't have to get a
>>> new capability number assigned every time we want to add a new type of
>>> interrupt controller.
>> 
>> How often does it really happen?  If a simple enumeration is good
>> enough for identifying the main IRQ controller device type, it
>> should be good enough for identifying the vcpu irq arch.
> 
> Whatever.  I really don't care at this point, I'm just getting
> extremely tired of the bikeshedding.  If you don't like it, propose
> something.

So far most comments I've seen haven't been bikeshedding, but each and every one got us closer to something that we can hopefully rely on for the next few years.

However, I agree. Scott, you seem to have a pretty good picture by now on how this should look like. Could you please take this patch as a basis, make each controller type its own CAP and have the enable_cap pass a token (fd) to the interrupt controller the CPU should get connected to? I think we're getting very close to having something that looks gorgeous.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index cce500a..d550313 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2126,6 +2126,28 @@  header; first `n_valid' valid entries with contents from the data
 written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
+4.79 KVM_SET_IRQ_ARCHITECTURE
+
+Capability: KVM_CAP_IRQ_ARCH
+Architecture: powerpc
+Type: vm ioctl
+Parameters: Pointer to u32 (in)
+Returns: 0 on success, -1 on error
+
+This is called before any vcpus are created, if in-kernel interrupt
+controller emulation is desired, to specify what overall interrupt
+controller architecture should be emulated.  Having this called before
+any vcpus are created allows per-vcpu interrupt controller state to be
+allocated at vcpu creation time, and allows the creation of the actual
+interrupt controller device to be deferred until after the vcpus are
+created.
+
+The parameter is a 32-bit unsigned integer specifying the
+architecture, or the value 0 to specify that no emulation should be
+done.  If the requested architecture is not supported by the kernel,
+this ioctl returns an EINVAL error.  Otherwise, if this ioctl is
+called after any vcpus have been created, it returns an EBUSY error.
+
 
 5. The kvm_run structure
 ------------------------
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 8a72d59..e21ea1f 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -227,6 +227,8 @@  struct kvm_arch_memory_slot {
 };
 
 struct kvm_arch {
+	unsigned int irq_arch;
+	int vcpus_created;
 	unsigned int lpid;
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 	unsigned long hpt_virt;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 934413c..b681746 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -441,10 +441,21 @@  struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, unsigned int id)
 {
 	struct kvm_vcpu *vcpu;
 	vcpu = kvmppc_core_vcpu_create(kvm, id);
-	if (!IS_ERR(vcpu)) {
-		vcpu->arch.wqp = &vcpu->wq;
-		kvmppc_create_vcpu_debugfs(vcpu, id);
+	if (IS_ERR(vcpu))
+		goto out;
+
+	/* Create per-vcpu irq controller state if needed */
+	mutex_lock(&kvm->lock);
+	kvm->arch.vcpus_created = 1;
+	switch (kvm->arch.irq_arch) {
+	default:
+		break;
 	}
+	mutex_unlock(&kvm->lock);
+
+	vcpu->arch.wqp = &vcpu->wq;
+	kvmppc_create_vcpu_debugfs(vcpu, id);
+ out:
 	return vcpu;
 }
 
@@ -459,6 +470,12 @@  void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 	hrtimer_cancel(&vcpu->arch.dec_timer);
 	tasklet_kill(&vcpu->arch.tasklet);
 
+	/* Destroy per-vcpu irq controller state */
+	switch (vcpu->kvm->arch.irq_arch) {
+	default:
+		break;
+	}
+
 	kvmppc_remove_vcpu_debugfs(vcpu);
 	kvmppc_core_vcpu_free(vcpu);
 }
@@ -996,6 +1013,49 @@  long kvm_arch_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif /* CONFIG_PPC_BOOK3S_64 */
+
+	case KVM_IRQ_LINE: {
+		struct kvm *kvm = filp->private_data;
+		struct kvm_irq_level args;
+
+		r = -EFAULT;
+		if (copy_from_user(&args, argp, sizeof(args)))
+			break;
+		switch (kvm->arch.irq_arch) {
+		default:
+			r = -EINVAL;
+		}
+		break;
+	}
+
+	case KVM_SET_IRQ_ARCHITECTURE: {
+		struct kvm *kvm = filp->private_data;
+		u32 arch_id;
+
+		r = -EFAULT;
+		if (get_user(arch_id, (u32 __user *)argp))
+			break;
+		r = 0;
+		switch (arch_id) {
+		case 0:		/* no emulation */
+			break;
+		default:
+			r = -EINVAL;
+			break;
+		}
+		if (r)
+			break;
+
+		/* mutex ensures we don't race with vcpu creation */
+		mutex_lock(&kvm->lock);
+		if (!kvm->arch.vcpus_created)
+			kvm->arch.irq_arch = arch_id;
+		else
+			r = -EBUSY;
+		mutex_unlock(&kvm->lock);
+		break;
+	}
+
 	default:
 		r = -ENOTTY;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 9a2db57..2c0ea1c 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -662,6 +662,7 @@  struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_HTAB_FD 84
 #define KVM_CAP_S390_CSS_SUPPORT 85
 #define KVM_CAP_PPC_EPR 86
+#define KVM_CAP_IRQ_ARCH 87
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -889,6 +890,8 @@  struct kvm_s390_ucas_mapping {
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
 /* Available with KVM_CAP_PPC_HTAB_FD */
 #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
+/* Available with KVM_CAP_IRQ_ARCH */
+#define KVM_SET_IRQ_ARCHITECTURE  _IOW(KVMIO,  0xab, __u32)
 
 /*
  * ioctls for vcpu fds