Patchwork Reset problem vs. MMIO emulation, hypercalls, etc...

login
register
mail settings
Submitter Benjamin Herrenschmidt
Date Aug. 3, 2012, 10:30 p.m.
Message ID <1344033008.24037.67.camel@pasglop>
Download mbox | patch
Permalink /patch/175052/
State New
Headers show

Comments

Benjamin Herrenschmidt - Aug. 3, 2012, 10:30 p.m.
On Fri, 2012-08-03 at 14:41 -0300, Marcelo Tosatti wrote:

> > Hrm, except that doing KVM_RUN with a signal is very cumbersome to do
> > and I couldn't quite find the logic in qemu to do it ... but I might
> > just have missed it. I can see indeed that in the migration case you
> > want to actually complete the operation rather than just "abort it".
> > 
> > Any chance you can point me to the code that performs that trick qemu
> > side for migration ?
> 
> kvm-all.c:
> 
>         kvm_arch_pre_run(env, run);
>         if (env->exit_request) {
>             DPRINTF("interrupt exit requested\n");
>             /*
>              * KVM requires us to reenter the kernel after IO exits to
>              * complete
>              * instruction emulation. This self-signal will ensure that
>              * we
>              * leave ASAP again.
>              */
>             qemu_cpu_kick_self();
>         }
> 
> 
> > Anthony seems to think that for reset we can just abort the operation
> > state in the kernel when the MP state changes.

Ok, I see now, this is scary really... set a flag somewhere, pray that
we are in the right part of the loop on elsewhere... oh well.

In fact there's some fun (through harmless) bits to be found, look at
x86 for example:

        if (env->exception_injected == EXCP08_DBLE) {
            /* this means triple fault */
            qemu_system_reset_request();
            env->exit_request = 1;
            return 0;
        }

Well, qemu_system_reset_request() calls cpu_stop_current() which calls
cpu_exit() which also does:

   env->exit_request = 1;
 
So obviously it will be well set by that time :-)

Now I can see how having it set during kvm_arch_process_async_events()
will work as this is called right before the run loop. Similarily,
EXIT_MMIO and EXIT_IO would work so they are a non issue.

Our problem I suspect with PAPR doing resets via hcalls is that
our kvm_arch_handle_exit() does:

    case KVM_EXIT_PAPR_HCALL:
        dprintf("handle PAPR hypercall\n");
        run->papr_hcall.ret = spapr_hypercall(env, run->papr_hcall.nr,
                                              run->papr_hcall.args);
        ret = 1;
        break;

See the ret = 1 here ? That means that the caller (kvm_cpu_exec.c main
loop) will exit immediately upon return. If we had instead returned 0,
it would have looped, seen the "exit_requested" set by
qemu_system_reset_request(), which would have then done the signal + run
trick, completed the hypercall, and returned this time in a clean state.

So it seems (I don't have the machine at hand to test right now) that by
just changing that ret = 1 to ret = 0, we might be fixing our problem,
including the case where another vcpu is taking a hypercall at the time
of the reset (this other CPU will also need to complete the operation
before stopping).

David, is there any reason why you did ret = 1 to begin with ? For
things like reset etc... the exit will happen as a result of
env->exit_requested being set by cpu_exit().

Are there other cases where you wish the hcall to go back to the main
loop ? In that case, shouldn't it set env->exit_requested rather than
returning 1 at that point ? (H_CEDE for example ?)

Paul, David, I don't have time to test that before Tuesday (I'm away on
monday) but if you have a chance, revert the kernel change we did and
try this qemu patch for reset:

 }

Though I don't think H_CEDE ever goes down to qemu, does it ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Gibson - Aug. 6, 2012, 3:13 a.m.
On Sat, Aug 04, 2012 at 08:30:08AM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2012-08-03 at 14:41 -0300, Marcelo Tosatti wrote:
> 
> > > Hrm, except that doing KVM_RUN with a signal is very cumbersome to do
> > > and I couldn't quite find the logic in qemu to do it ... but I might
> > > just have missed it. I can see indeed that in the migration case you
> > > want to actually complete the operation rather than just "abort it".
> > > 
> > > Any chance you can point me to the code that performs that trick qemu
> > > side for migration ?
> > 
> > kvm-all.c:
> > 
> >         kvm_arch_pre_run(env, run);
> >         if (env->exit_request) {
> >             DPRINTF("interrupt exit requested\n");
> >             /*
> >              * KVM requires us to reenter the kernel after IO exits to
> >              * complete
> >              * instruction emulation. This self-signal will ensure that
> >              * we
> >              * leave ASAP again.
> >              */
> >             qemu_cpu_kick_self();
> >         }
> > 
> > 
> > > Anthony seems to think that for reset we can just abort the operation
> > > state in the kernel when the MP state changes.
> 
> Ok, I see now, this is scary really... set a flag somewhere, pray that
> we are in the right part of the loop on elsewhere... oh well.
> 
> In fact there's some fun (through harmless) bits to be found, look at
> x86 for example:
> 
>         if (env->exception_injected == EXCP08_DBLE) {
>             /* this means triple fault */
>             qemu_system_reset_request();
>             env->exit_request = 1;
>             return 0;
>         }
> 
> Well, qemu_system_reset_request() calls cpu_stop_current() which calls
> cpu_exit() which also does:
> 
>    env->exit_request = 1;
>  
> So obviously it will be well set by that time :-)
> 
> Now I can see how having it set during kvm_arch_process_async_events()
> will work as this is called right before the run loop. Similarily,
> EXIT_MMIO and EXIT_IO would work so they are a non issue.
> 
> Our problem I suspect with PAPR doing resets via hcalls is that
> our kvm_arch_handle_exit() does:
> 
>     case KVM_EXIT_PAPR_HCALL:
>         dprintf("handle PAPR hypercall\n");
>         run->papr_hcall.ret = spapr_hypercall(env, run->papr_hcall.nr,
>                                               run->papr_hcall.args);
>         ret = 1;
>         break;
> 
> See the ret = 1 here ? That means that the caller (kvm_cpu_exec.c main
> loop) will exit immediately upon return. If we had instead returned 0,
> it would have looped, seen the "exit_requested" set by
> qemu_system_reset_request(), which would have then done the signal + run
> trick, completed the hypercall, and returned this time in a clean state.
> 
> So it seems (I don't have the machine at hand to test right now) that by
> just changing that ret = 1 to ret = 0, we might be fixing our problem,
> including the case where another vcpu is taking a hypercall at the time
> of the reset (this other CPU will also need to complete the operation
> before stopping).
> 
> David, is there any reason why you did ret = 1 to begin with ? For
> things like reset etc... the exit will happen as a result of
> env->exit_requested being set by cpu_exit().

Erm, there might have been, but I don't remember what it was.

> Are there other cases where you wish the hcall to go back to the main
> loop ? In that case, shouldn't it set env->exit_requested rather than
> returning 1 at that point ? (H_CEDE for example ?)

So, I'm still trying to nut out the implications for H_CEDE, and think
if there are any other hypercalls that might want to block the guest
for a time.  We were considering blocking H_PUT_TCE if qemu devices
had active dma maps on the previously mapped iovas.  I'm not sure if
the discussions that led to the inclusion of the qemu IOMMU code
decided that was wholly unnnecessary or just not necessary for the
time being.

> Paul, David, I don't have time to test that before Tuesday (I'm away on
> monday) but if you have a chance, revert the kernel change we did and
> try this qemu patch for reset:
> 
> --- a/target-ppc/kvm.c
> +++ b/target-ppc/kvm.c
> @@ -766,7 +766,7 @@ int kvm_arch_handle_exit(CPUPPCState *env, struct
> kvm_run *r
>          dprintf("handle PAPR hypercall\n");
>          run->papr_hcall.ret = spapr_hypercall(env, run->papr_hcall.nr,
>                                                run->papr_hcall.args);
> -        ret = 1;
> +        ret = 0;
>          break;
>  #endif
>      default:

So, usually I think ret = 0 is correct, it lets us handle the
hypercall and get back to the guest without further fuss.  I think
right now, ret = 0 is always correct, since H_CEDE is always handled
in kernel.  But if we ever have qemu implemented hypercalls which want
to "block" then we might need a facility to do something else.  Maybe,
like I say, I haven't convinved myself I've thought of all the
implications yet.

> It might also need something like:
> 
> diff --git a/hw/spapr_hcall.c b/hw/spapr_hcall.c
> index a5990a9..d4d7fb0 100644
> --- a/hw/spapr_hcall.c
> +++ b/hw/spapr_hcall.c
> @@ -545,6 +545,7 @@ static target_ulong h_cede(CPUPPCState *env,
> sPAPREnvironmen
>      hreg_compute_hflags(env);
>      if (!cpu_has_work(env)) {
>          env->halted = 1;
> +        env->exit_request = 1;
>      }
>      return H_SUCCESS;
>  }
> 
> Though I don't think H_CEDE ever goes down to qemu, does it ?

No, though it might be worth doing that anyway, just in case we ever
have some whacky case where the H_CEDE does get handed on from the
kernel.

Ok, I'll send a patch and we can worry about the blocking qemu
hypercall case later, since it doesn't exist now.
Benjamin Herrenschmidt - Aug. 6, 2012, 8:57 p.m.
On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
> So, I'm still trying to nut out the implications for H_CEDE, and think
> if there are any other hypercalls that might want to block the guest
> for a time.  We were considering blocking H_PUT_TCE if qemu devices
> had active dma maps on the previously mapped iovas.  I'm not sure if
> the discussions that led to the inclusion of the qemu IOMMU code
> decided that was wholly unnnecessary or just not necessary for the
> time being.

For "sleeping hcalls" they will simply have to set exit_request to
complete the hcall from the kernel perspective, leaving us in a state
where the kernel is about to restart at srr0 + 4, along with some other
flag (stop or halt) to actually freeze the vcpu.

If such an "async" hcall decides to return an error, it can then set
gpr3 directly using ioctls before restarting the vcpu.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Gibson - Aug. 7, 2012, 1:32 a.m.
On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
> > So, I'm still trying to nut out the implications for H_CEDE, and think
> > if there are any other hypercalls that might want to block the guest
> > for a time.  We were considering blocking H_PUT_TCE if qemu devices
> > had active dma maps on the previously mapped iovas.  I'm not sure if
> > the discussions that led to the inclusion of the qemu IOMMU code
> > decided that was wholly unnnecessary or just not necessary for the
> > time being.
> 
> For "sleeping hcalls" they will simply have to set exit_request to
> complete the hcall from the kernel perspective, leaving us in a state
> where the kernel is about to restart at srr0 + 4, along with some other
> flag (stop or halt) to actually freeze the vcpu.
> 
> If such an "async" hcall decides to return an error, it can then set
> gpr3 directly using ioctls before restarting the vcpu.

Yeah, I'd pretty much convinced myself of that by the end of
yesterday.  I hope to send patches implementing these fixes today.

There are also some questions about why our in-kernel H_CEDE works
kind of differently from x86's hlt instruction implementation (which
comes out to qemu unless the irqchip is in-kernel as well).  I don't
think we have an urgent problem there though.
Avi Kivity - Aug. 7, 2012, 8:46 a.m.
On 08/07/2012 04:32 AM, David Gibson wrote:
> On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
>> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
>> > So, I'm still trying to nut out the implications for H_CEDE, and think
>> > if there are any other hypercalls that might want to block the guest
>> > for a time.  We were considering blocking H_PUT_TCE if qemu devices
>> > had active dma maps on the previously mapped iovas.  I'm not sure if
>> > the discussions that led to the inclusion of the qemu IOMMU code
>> > decided that was wholly unnnecessary or just not necessary for the
>> > time being.
>> 
>> For "sleeping hcalls" they will simply have to set exit_request to
>> complete the hcall from the kernel perspective, leaving us in a state
>> where the kernel is about to restart at srr0 + 4, along with some other
>> flag (stop or halt) to actually freeze the vcpu.
>> 
>> If such an "async" hcall decides to return an error, it can then set
>> gpr3 directly using ioctls before restarting the vcpu.
> 
> Yeah, I'd pretty much convinced myself of that by the end of
> yesterday.  I hope to send patches implementing these fixes today.
> 
> There are also some questions about why our in-kernel H_CEDE works
> kind of differently from x86's hlt instruction implementation (which
> comes out to qemu unless the irqchip is in-kernel as well).  I don't
> think we have an urgent problem there though.

It's the other way round, hlt sleeps in the kernel unless the irqchip is
not in the kernel.  Meaning the normal state of things is to sleep in
the kernel (whether or not you have an emulated interrupt controller in
the kernel -- the term irqchip in kernel is overloaded for x86).
David Gibson - Aug. 7, 2012, 12:14 p.m.
On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote:
> On 08/07/2012 04:32 AM, David Gibson wrote:
> > On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
> >> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
> >> > So, I'm still trying to nut out the implications for H_CEDE, and think
> >> > if there are any other hypercalls that might want to block the guest
> >> > for a time.  We were considering blocking H_PUT_TCE if qemu devices
> >> > had active dma maps on the previously mapped iovas.  I'm not sure if
> >> > the discussions that led to the inclusion of the qemu IOMMU code
> >> > decided that was wholly unnnecessary or just not necessary for the
> >> > time being.
> >> 
> >> For "sleeping hcalls" they will simply have to set exit_request to
> >> complete the hcall from the kernel perspective, leaving us in a state
> >> where the kernel is about to restart at srr0 + 4, along with some other
> >> flag (stop or halt) to actually freeze the vcpu.
> >> 
> >> If such an "async" hcall decides to return an error, it can then set
> >> gpr3 directly using ioctls before restarting the vcpu.
> > 
> > Yeah, I'd pretty much convinced myself of that by the end of
> > yesterday.  I hope to send patches implementing these fixes today.
> > 
> > There are also some questions about why our in-kernel H_CEDE works
> > kind of differently from x86's hlt instruction implementation (which
> > comes out to qemu unless the irqchip is in-kernel as well).  I don't
> > think we have an urgent problem there though.
> 
> It's the other way round, hlt sleeps in the kernel unless the irqchip is
> not in the kernel.

That's the same as what I said.

We never have irqchip in kernel (because we haven't written that yet)
but we still sleep in-kernel for CEDE.  I haven't spotted any problem
with that, but now I'm wondering if there is one, since x86 don't do
it in what seems like the analogous situation.

It's possible this works because our decrementer (timer) interrupts
are different at the core level from external interrupts coming from
the PIC, and *are* handled in kernel, but I haven't actually followed
the logic to work out if this is the case.

>  Meaning the normal state of things is to sleep in
> the kernel (whether or not you have an emulated interrupt controller in
> the kernel -- the term irqchip in kernel is overloaded for x86).

Uh.. overloaded in what way.
Avi Kivity - Aug. 7, 2012, 1:13 p.m.
On 08/07/2012 03:14 PM, David Gibson wrote:
> On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote:
>> On 08/07/2012 04:32 AM, David Gibson wrote:
>> > On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
>> >> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
>> >> > So, I'm still trying to nut out the implications for H_CEDE, and think
>> >> > if there are any other hypercalls that might want to block the guest
>> >> > for a time.  We were considering blocking H_PUT_TCE if qemu devices
>> >> > had active dma maps on the previously mapped iovas.  I'm not sure if
>> >> > the discussions that led to the inclusion of the qemu IOMMU code
>> >> > decided that was wholly unnnecessary or just not necessary for the
>> >> > time being.
>> >> 
>> >> For "sleeping hcalls" they will simply have to set exit_request to
>> >> complete the hcall from the kernel perspective, leaving us in a state
>> >> where the kernel is about to restart at srr0 + 4, along with some other
>> >> flag (stop or halt) to actually freeze the vcpu.
>> >> 
>> >> If such an "async" hcall decides to return an error, it can then set
>> >> gpr3 directly using ioctls before restarting the vcpu.
>> > 
>> > Yeah, I'd pretty much convinced myself of that by the end of
>> > yesterday.  I hope to send patches implementing these fixes today.
>> > 
>> > There are also some questions about why our in-kernel H_CEDE works
>> > kind of differently from x86's hlt instruction implementation (which
>> > comes out to qemu unless the irqchip is in-kernel as well).  I don't
>> > think we have an urgent problem there though.
>> 
>> It's the other way round, hlt sleeps in the kernel unless the irqchip is
>> not in the kernel.
> 
> That's the same as what I said.

I meant to stress that the normal way which other archs should emulate
is sleep-in-kernel.

> 
> We never have irqchip in kernel (because we haven't written that yet)
> but we still sleep in-kernel for CEDE.  I haven't spotted any problem
> with that, but now I'm wondering if there is one, since x86 don't do
> it in what seems like the analogous situation.
> 
> It's possible this works because our decrementer (timer) interrupts
> are different at the core level from external interrupts coming from
> the PIC, and *are* handled in kernel, but I haven't actually followed
> the logic to work out if this is the case.
> 
>>  Meaning the normal state of things is to sleep in
>> the kernel (whether or not you have an emulated interrupt controller in
>> the kernel -- the term irqchip in kernel is overloaded for x86).
> 
> Uh.. overloaded in what way.

On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
correspond to non-x86 interrupt controllers, but the local APIC is more
tightly coupled to the core.  Interrupt acceptance by the core is an
operation that involved synchronous communication with the local APIC:
the APIC presents the interrupt, the core accepts it based on the value
of the interrupt enable flag and possible a register (CR8), then the
APIC updates the ISR and IRR.

The upshot is that if the local APIC is in userspace, interrupts must be
synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
and HLT is emulated in userspace (so that local APIC emulation can check
if an interrupt wakes it up or not).  As soon as the local APIC is
emulated in the kernel, HLT can be emulated there as well, and
interrupts become asynchronous (KVM_IRQ_LINE, a vm ioctl).

So irqchip_in_kernel, for most discussions, really means whether
interrupt queuing is synchronous or asynchronous.  It has nothing to do
with the interrupt controllers per se.  All non-x86 archs always have
irqchip_in_kernel() in this sense.

Peter has started to fix up this naming mess in qemu.  I guess we should
do the same for the kernel (except for ABIs) and document it, because it
keeps generating confusion.
Benjamin Herrenschmidt - Aug. 7, 2012, 9:09 p.m.
On Tue, 2012-08-07 at 16:13 +0300, Avi Kivity wrote:
> 

> Peter has started to fix up this naming mess in qemu.  I guess we should
> do the same for the kernel (except for ABIs) and document it, because it
> keeps generating confusion. 

Ok so our current situation is that the XICS and MPIC are emulated in
userspace entirely but the link between them and the VCPU is the
asynchronous EE line so we are fine.

I'm currently working on moving the XICS into the kernel for performance
reasons, however, just like ARM VGIC, I can't seem to find a way to
"fit" in the "generic" irqchip code in there. It's just not generic at
all and quite x86 centric :-)

So for now I'm just doing my own version of CREATE_IRQCHIP to create it
and KVM_INTERRUPT to trigger the various interrupts. None of the mapping
stuff (which we really don't need).

That's a bit of a problem vs. some of the code qemu-side such as in
virtio-pci which does seem to be written around the model exposed by the
x86 stuff and relies on doing such mappings so I think we'll have to
butcher some of that.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Gibson - Aug. 8, 2012, 12:49 a.m.
On Tue, Aug 07, 2012 at 04:13:49PM +0300, Avi Kivity wrote:
> On 08/07/2012 03:14 PM, David Gibson wrote:
> > On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote:
> >> On 08/07/2012 04:32 AM, David Gibson wrote:
> >> > On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
> >> >> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
> >> >> > So, I'm still trying to nut out the implications for H_CEDE, and think
> >> >> > if there are any other hypercalls that might want to block the guest
> >> >> > for a time.  We were considering blocking H_PUT_TCE if qemu devices
> >> >> > had active dma maps on the previously mapped iovas.  I'm not sure if
> >> >> > the discussions that led to the inclusion of the qemu IOMMU code
> >> >> > decided that was wholly unnnecessary or just not necessary for the
> >> >> > time being.
> >> >> 
> >> >> For "sleeping hcalls" they will simply have to set exit_request to
> >> >> complete the hcall from the kernel perspective, leaving us in a state
> >> >> where the kernel is about to restart at srr0 + 4, along with some other
> >> >> flag (stop or halt) to actually freeze the vcpu.
> >> >> 
> >> >> If such an "async" hcall decides to return an error, it can then set
> >> >> gpr3 directly using ioctls before restarting the vcpu.
> >> > 
> >> > Yeah, I'd pretty much convinced myself of that by the end of
> >> > yesterday.  I hope to send patches implementing these fixes today.
> >> > 
> >> > There are also some questions about why our in-kernel H_CEDE works
> >> > kind of differently from x86's hlt instruction implementation (which
> >> > comes out to qemu unless the irqchip is in-kernel as well).  I don't
> >> > think we have an urgent problem there though.
> >> 
> >> It's the other way round, hlt sleeps in the kernel unless the irqchip is
> >> not in the kernel.
> > 
> > That's the same as what I said.
> 
> I meant to stress that the normal way which other archs should emulate
> is sleep-in-kernel.

Ok.

> > We never have irqchip in kernel (because we haven't written that yet)
> > but we still sleep in-kernel for CEDE.  I haven't spotted any problem
> > with that, but now I'm wondering if there is one, since x86 don't do
> > it in what seems like the analogous situation.
> > 
> > It's possible this works because our decrementer (timer) interrupts
> > are different at the core level from external interrupts coming from
> > the PIC, and *are* handled in kernel, but I haven't actually followed
> > the logic to work out if this is the case.
> > 
> >>  Meaning the normal state of things is to sleep in
> >> the kernel (whether or not you have an emulated interrupt controller in
> >> the kernel -- the term irqchip in kernel is overloaded for x86).
> > 
> > Uh.. overloaded in what way.
> 
> On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
> the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
> correspond to non-x86 interrupt controllers, but the local APIC is more
> tightly coupled to the core.  Interrupt acceptance by the core is an
> operation that involved synchronous communication with the local APIC:
> the APIC presents the interrupt, the core accepts it based on the value
> of the interrupt enable flag and possible a register (CR8), then the
> APIC updates the ISR and IRR.
> 
> The upshot is that if the local APIC is in userspace, interrupts must be
> synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
> and HLT is emulated in userspace (so that local APIC emulation can check
> if an interrupt wakes it up or not).

Sorry, still not 100% getting it.  When the vcpu is actually running
code, that synchronous communication must still be accomplished via
the KVM_INTERRUPT ioctl, yes?  So what makes HLT different, that the
communication can't be accomplished in that case.

> As soon as the local APIC is
> emulated in the kernel, HLT can be emulated there as well, and
> interrupts become asynchronous (KVM_IRQ_LINE, a vm ioctl).
> 
> So irqchip_in_kernel, for most discussions, really means whether
> interrupt queuing is synchronous or asynchronous.  It has nothing to do
> with the interrupt controllers per se.  All non-x86 archs always have
> irqchip_in_kernel() in this sense.
> 
> Peter has started to fix up this naming mess in qemu.  I guess we should
> do the same for the kernel (except for ABIs) and document it, because it
> keeps generating confusion.

Ok.
Avi Kivity - Aug. 8, 2012, 8:52 a.m.
On 08/08/2012 12:09 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2012-08-07 at 16:13 +0300, Avi Kivity wrote:
>> 
> 
>> Peter has started to fix up this naming mess in qemu.  I guess we should
>> do the same for the kernel (except for ABIs) and document it, because it
>> keeps generating confusion. 
> 
> Ok so our current situation is that the XICS and MPIC are emulated in
> userspace entirely but the link between them and the VCPU is the
> asynchronous EE line so we are fine.
> 
> I'm currently working on moving the XICS into the kernel for performance
> reasons, however, just like ARM VGIC, I can't seem to find a way to
> "fit" in the "generic" irqchip code in there. It's just not generic at
> all and quite x86 centric :-)

The generic code is for the two apic architectures: x64 and ia64.  Don't
try to shoehorn it in, you'll damage both the shoe and your foot.

> 
> So for now I'm just doing my own version of CREATE_IRQCHIP to create it
> and KVM_INTERRUPT to trigger the various interrupts. None of the mapping
> stuff (which we really don't need).

You mean KVM_IRQ_LINE.  KVM_INTERRUPT is a synchronous vcpu ioctl.

> 
> That's a bit of a problem vs. some of the code qemu-side such as in
> virtio-pci which does seem to be written around the model exposed by the
> x86 stuff and relies on doing such mappings so I think we'll have to
> butcher some of that.

Can you elaborate? virtio-pci is pci-centric, there should be nothing
x86 specific there.
Avi Kivity - Aug. 8, 2012, 8:58 a.m.
On 08/08/2012 03:49 AM, David Gibson wrote:
>> > We never have irqchip in kernel (because we haven't written that yet)
>> > but we still sleep in-kernel for CEDE.  I haven't spotted any problem
>> > with that, but now I'm wondering if there is one, since x86 don't do
>> > it in what seems like the analogous situation.
>> > 
>> > It's possible this works because our decrementer (timer) interrupts
>> > are different at the core level from external interrupts coming from
>> > the PIC, and *are* handled in kernel, but I haven't actually followed
>> > the logic to work out if this is the case.
>> > 
>> >>  Meaning the normal state of things is to sleep in
>> >> the kernel (whether or not you have an emulated interrupt controller in
>> >> the kernel -- the term irqchip in kernel is overloaded for x86).
>> > 
>> > Uh.. overloaded in what way.
>> 
>> On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
>> the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
>> correspond to non-x86 interrupt controllers, but the local APIC is more
>> tightly coupled to the core.  Interrupt acceptance by the core is an
>> operation that involved synchronous communication with the local APIC:
>> the APIC presents the interrupt, the core accepts it based on the value
>> of the interrupt enable flag and possible a register (CR8), then the
>> APIC updates the ISR and IRR.
>> 
>> The upshot is that if the local APIC is in userspace, interrupts must be
>> synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
>> and HLT is emulated in userspace (so that local APIC emulation can check
>> if an interrupt wakes it up or not).
> 
> Sorry, still not 100% getting it.  When the vcpu is actually running
> code, that synchronous communication must still be accomplished via
> the KVM_INTERRUPT ioctl, yes?  So what makes HLT different, that the
> communication can't be accomplished in that case.

No, you're correct.  HLT could have been emulated in userspace, it just
wasn't.  The correct statement is that HLT was arbitrarily chosen to be
emulated in userspace with the synchronous model, but the asynchronous
model forced it into the kernel.
Benjamin Herrenschmidt - Aug. 8, 2012, 9:27 a.m.
On Wed, 2012-08-08 at 11:52 +0300, Avi Kivity wrote:
> > So for now I'm just doing my own version of CREATE_IRQCHIP to create it
> > and KVM_INTERRUPT to trigger the various interrupts. None of the mapping
> > stuff (which we really don't need).
> 
> You mean KVM_IRQ_LINE.  KVM_INTERRUPT is a synchronous vcpu ioctl.

Yes, sorry, brain and fingers not agreeing :-)

> > That's a bit of a problem vs. some of the code qemu-side such as in
> > virtio-pci which does seem to be written around the model exposed by the
> > x86 stuff and relies on doing such mappings so I think we'll have to
> > butcher some of that.
> 
> Can you elaborate? virtio-pci is pci-centric, there should be nothing
> x86 specific there.

The kvm_irqchip_add_msi_route & co is a problem, it more/less dives into
the x86/ia64 specific routing stuff at a generic level, I'll have to
hook things up differently in qemu I suspect.

Not a huge deal, we can sort it out.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Gibson - Aug. 8, 2012, 11:59 a.m.
On Wed, Aug 08, 2012 at 11:58:58AM +0300, Avi Kivity wrote:
> On 08/08/2012 03:49 AM, David Gibson wrote:
> >> > We never have irqchip in kernel (because we haven't written that yet)
> >> > but we still sleep in-kernel for CEDE.  I haven't spotted any problem
> >> > with that, but now I'm wondering if there is one, since x86 don't do
> >> > it in what seems like the analogous situation.
> >> > 
> >> > It's possible this works because our decrementer (timer) interrupts
> >> > are different at the core level from external interrupts coming from
> >> > the PIC, and *are* handled in kernel, but I haven't actually followed
> >> > the logic to work out if this is the case.
> >> > 
> >> >>  Meaning the normal state of things is to sleep in
> >> >> the kernel (whether or not you have an emulated interrupt controller in
> >> >> the kernel -- the term irqchip in kernel is overloaded for x86).
> >> > 
> >> > Uh.. overloaded in what way.
> >> 
> >> On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
> >> the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
> >> correspond to non-x86 interrupt controllers, but the local APIC is more
> >> tightly coupled to the core.  Interrupt acceptance by the core is an
> >> operation that involved synchronous communication with the local APIC:
> >> the APIC presents the interrupt, the core accepts it based on the value
> >> of the interrupt enable flag and possible a register (CR8), then the
> >> APIC updates the ISR and IRR.
> >> 
> >> The upshot is that if the local APIC is in userspace, interrupts must be
> >> synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
> >> and HLT is emulated in userspace (so that local APIC emulation can check
> >> if an interrupt wakes it up or not).
> > 
> > Sorry, still not 100% getting it.  When the vcpu is actually running
> > code, that synchronous communication must still be accomplished via
> > the KVM_INTERRUPT ioctl, yes?  So what makes HLT different, that the
> > communication can't be accomplished in that case.
> 
> No, you're correct.  HLT could have been emulated in userspace, it just
> wasn't.  The correct statement is that HLT was arbitrarily chosen to be
> emulated in userspace with the synchronous model, but the asynchronous
> model forced it into the kernel.

Aha!  Ok, understood.  Uh, assuming you meant kernelspace, not
userspace in the first line there, anyway.  Ok, so I am now reassured
that our current handling of CEDE in kernelspace does not cause
problems.
Avi Kivity - Aug. 8, 2012, 12:42 p.m.
On 08/08/2012 02:59 PM, David Gibson wrote:
>> 
>> No, you're correct.  HLT could have been emulated in userspace, it just
>> wasn't.  The correct statement is that HLT was arbitrarily chosen to be
>> emulated in userspace with the synchronous model, but the asynchronous
>> model forced it into the kernel.
> 
> Aha!  Ok, understood.  Uh, assuming you meant kernelspace, not
> userspace in the first line there, anyway.  

I did.

> Ok, so I am now reassured
> that our current handling of CEDE in kernelspace does not cause
> problems.

Great.  It's a real pity the original local APIC implementation was in
userspace, it causes no end of confusion, and in fact was broken for smp
until recently even though it is 7 years old.

Patch

--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -766,7 +766,7 @@  int kvm_arch_handle_exit(CPUPPCState *env, struct
kvm_run *r
         dprintf("handle PAPR hypercall\n");
         run->papr_hcall.ret = spapr_hypercall(env, run->papr_hcall.nr,
                                               run->papr_hcall.args);
-        ret = 1;
+        ret = 0;
         break;
 #endif
     default:

It might also need something like:

diff --git a/hw/spapr_hcall.c b/hw/spapr_hcall.c
index a5990a9..d4d7fb0 100644
--- a/hw/spapr_hcall.c
+++ b/hw/spapr_hcall.c
@@ -545,6 +545,7 @@  static target_ulong h_cede(CPUPPCState *env,
sPAPREnvironmen
     hreg_compute_hflags(env);
     if (!cpu_has_work(env)) {
         env->halted = 1;
+        env->exit_request = 1;
     }
     return H_SUCCESS;