Message ID | 200809231743.23828.ossthema@de.ibm.com |
---|---|
State | Deferred, archived |
Delegated to: | Jeff Garzik |
Headers | show |
Jan-Bernd wrote: > Ben, can you / your team look into the implementation > of the set_irq_type functionality needed for XICS? I'm not volunteering to look at or implement any changes for how xics works with generic irq, but I'm trying to understand what the rt kernel is trying to accomplish with this statement: On Mon Sep 15 at 18:04:06 EST in 2008, Sebastien Dugue wrote: > When entering the low level handler, level sensitive interrupts are > masked, then eio'd in interrupt context and then unmasked at the > end of hardirq processing. That's fine as any interrupt comming > in-between will still be processed since the kernel replays those > pending interrupts. Is this to generate some kind of software managed nesting and priority of the hardware level interrupts? The reason I ask is the xics controller can do unlimited nesting of hardware interrupts. In fact, the hardware has 255 levels of priority, of which 16 or so are reserved by the hypervisor, leaving over 200 for the os to manage. Higher numbers are lower in priority, and the hardware will only dispatch an interrupt to a given cpu if it is currenty at a lower priority. If it is at a higher priority and the interrupt is not bound to a specific cpu it will look for another cpu to dispatch it. The hardware will not re-present an irq until the it is EOId (managed by a small state machine per interrupt at the source, which also handles no cpu available try again later), but software can return its cpu priority to the previous level to recieve other interrupt sources at the same level. The hardware also supports lazy update of the cpu priority register when an interrupt is presented; as long as the cpu is hard-irq enabled it can take the irq then write is real priority and let the hw decide if the irq is still pending or it must defer or try another cpu in the rejection scenerio. The only restriction is that the EOI can not cause an interrupt reject by raising the priority while sending the EOI command. The per-interrupt mask and unmask calls have to go through RTAS, a single-threaded global context, which in addition to increasing path length will really limit scalability. The interrupt controller poll and reject facilities are accessed through hypervisor calls which are comparable to a fast syscall, and parallel to all cpus. We used to lower the priority to allow other interrupts in, but we realized that in addition to the questionable latency in doing so, it only caused unlimited stack nesting and overflow without per-irq stacks. We currently set IPIs above other irqs so we typically only process them during a hard irq (but we return to base level after IPI and could take another base irq, a bug). So, Sebastien, with this information, is does the RT kernel have a strategy that better matches this hardware? milton -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, 2008-09-24 at 04:58 -0500, Milton Miller wrote: > The per-interrupt mask and unmask calls have to go through RTAS, a > single-threaded global context, which in addition to increasing > path length will really limit scalability. The interrupt controller > poll and reject facilities are accessed through hypervisor calls > which are comparable to a fast syscall, and parallel to all cpus. Note also that the XICS code thus assumes, iirc, as does the cell IIC code, that eoi is called on the -same- cpu that fetched the interrupt initially. That assumption can be broken with IRQ threads no ? Ben.
On Sep 24, 2008, at 5:17 AM, Benjamin Herrenschmidt wrote: > On Wed, 2008-09-24 at 04:58 -0500, Milton Miller wrote: >> The per-interrupt mask and unmask calls have to go through RTAS, a >> single-threaded global context, which in addition to increasing >> path length will really limit scalability. The interrupt controller >> poll and reject facilities are accessed through hypervisor calls >> which are comparable to a fast syscall, and parallel to all cpus. > > Note also that the XICS code thus assumes, iirc, as does the cell IIC > code, that eoi is called on the -same- cpu that fetched the interrupt > initially. That assumption can be broken with IRQ threads no ? There may be some implicit assumption in that we expect the cpu priority to be returned to normal by the EOI, but there is nothing in the hardware that requires the EOI to come from the same cpu as accepted the interrupt for processing, with the exception of the IPI which is per-cpu (and the only interrupt that is per-cpu). It would probably mean adding the concept of the current cpu priority vs interrupts and making sure we write it to hardware at irq_exit() time when deferring the actual irq handlers. The MPIC hardware, on the other hand, maintains a queue of pending interrupts (It has been about a decade but the number 4-5 comes to mind), and the hardware must receive the EOI on the cpu that took it. I am guessing that the handling described (take level irq, mask it, eoi it, dispatch the thread, then unmask it after processing) is a result to work within those limitations. Do you know the cell IIC to know if its mpic or xics in this regard? The other unknown is the (very few) platforms that present as xics but are really firmware on mpic. If they do a full virtual layer and don't take shortcuts but do eoi/mask like described here they should work, but I would not be surprised that does not hold true :-(. milton
Hi Milton, On Wed, 24 Sep 2008 04:58:22 -0500 (CDT) Milton Miller <miltonm@bga.com> wrote: > Jan-Bernd wrote: > > Ben, can you / your team look into the implementation > > of the set_irq_type functionality needed for XICS? > > I'm not volunteering to look at or implement any changes for how xics > works with generic irq, but I'm trying to understand what the rt kernel > is trying to accomplish with this statement: > > On Mon Sep 15 at 18:04:06 EST in 2008, Sebastien Dugue wrote: > > When entering the low level handler, level sensitive interrupts are > > masked, then eio'd in interrupt context and then unmasked at the > > end of hardirq processing. That's fine as any interrupt comming > > in-between will still be processed since the kernel replays those > > pending interrupts. > > Is this to generate some kind of software managed nesting and priority > of the hardware level interrupts? No, not really. This is only to be sure to not miss interrupts coming from the same source that were received during threaded hardirq processing. Some instrumentation showed that it never seems to happen in the eHEA interrupt case, so I think we can forget this aspect. Also, the problem only manifests with the eHEA RX interrupt. For example, the IBM Power Raid (ipr) SCSI exhibits absolutely no problem under an RT kernel. From this I conclude that: IPR - PCI - XICS is OK eHEA - IBMEBUS - XICS is broken with hardirq preemption. I also checked that forcing the eHEA interrupt to take the non threaded path does work. Here is a side by side comparison of the fasteoi flow with and without hardirq threading (sorry it's a bit wide) fasteoi flow: ------------ Non threaded hardirq | threaded hardirq | interrupt context | interrupt context hardirq thread ----------------- | ----------------- -------------- | | clear IRQ_REPLAY and IRQ_WAITING | clear IRQ_REPLAY and IRQ_WAITING | increment percpu interrupt count | increment percpu interrupt count | if no action or IRQ_INPROGRESS or IRQ_DISABLED | if no action or IRQ_INPROGRESS or IRQ_DISABLED | set IRQ_PENDING | set IRQ_PENDING | mask | mask | eoi | eoi | done | done | set IRQ_INPROGRESS | set IRQ_INPROGRESS | | | wakeup IRQ thread | | mask | | eoi | | done -- | \ | \ | \ | --> loop | clear IRQ_PENDING | clear IRQ_PENDING | call handle_IRQ_event | call handle_IRQ_event | | check for prempt | | until IRQ_PENDING cleared | | clear IRQ_INPROGRESS | clear IRQ_INPROGRESS | if not IRQ_DISABLED | if not IRQ_DISABLED | unmask | unmask | eoi | | done | the non-threaded flow does (in interrupt context): mask handle interrupt unmask eoi the threaded flow does: mask eoi handle interrupt unmask If I remove the mask() call, then the eHEA is no longer hanging. > > The reason I ask is the xics controller can do unlimited nesting > of hardware interrupts. In fact, the hardware has 255 levels of > priority, of which 16 or so are reserved by the hypervisor, leaving > over 200 for the os to manage. Higher numbers are lower in priority, > and the hardware will only dispatch an interrupt to a given cpu if > it is currenty at a lower priority. If it is at a higher priority > and the interrupt is not bound to a specific cpu it will look for > another cpu to dispatch it. The hardware will not re-present an > irq until the it is EOId (managed by a small state machine per > interrupt at the source, which also handles no cpu available try > again later), but software can return its cpu priority to the > previous level to recieve other interrupt sources at the same level. > The hardware also supports lazy update of the cpu priority register > when an interrupt is presented; as long as the cpu is hard-irq > enabled it can take the irq then write is real priority and let the > hw decide if the irq is still pending or it must defer or try another > cpu in the rejection scenerio. The only restriction is that the > EOI can not cause an interrupt reject by raising the priority while > sending the EOI command. > > The per-interrupt mask and unmask calls have to go through RTAS, a > single-threaded global context, which in addition to increasing > path length will really limit scalability. The interrupt controller > poll and reject facilities are accessed through hypervisor calls > which are comparable to a fast syscall, and parallel to all cpus. > > We used to lower the priority to allow other interrupts in, but we > realized that in addition to the questionable latency in doing so, > it only caused unlimited stack nesting and overflow without per-irq > stacks. We currently set IPIs above other irqs so we typically > only process them during a hard irq (but we return to base level > after IPI and could take another base irq, a bug). > > > So, Sebastien, with this information, is does the RT kernel have > a strategy that better matches this hardware? Don't think so. I think that the problem may be elsewhere as everything is fine with PCI devices (well at least SCSI). As I said earlier in another mail, it seems that the eHEA is behaving as if it was generating edge interrupts which do not support masking. Don't know. Thanks a lot for the explanation, looks like the xics + hypervisor combo is way more complex than I thought. Sebastien. > > milton >
Hi Ben, On Wed, 24 Sep 2008 20:17:47 +1000 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Wed, 2008-09-24 at 04:58 -0500, Milton Miller wrote: > > The per-interrupt mask and unmask calls have to go through RTAS, a > > single-threaded global context, which in addition to increasing > > path length will really limit scalability. The interrupt controller > > poll and reject facilities are accessed through hypervisor calls > > which are comparable to a fast syscall, and parallel to all cpus. > > Note also that the XICS code thus assumes, iirc, as does the cell IIC > code, that eoi is called on the -same- cpu that fetched the interrupt > initially. That assumption can be broken with IRQ threads no ? No, the fetch and the eoi are both done in interrupt context before the hardirq thread is woken up. On the other hand, the mask+eoi and the unmask may well happen on different cpus as there's only one hardirq thread per irq on the system. Don't know if this is a problem with the XICS though. Thanks, Sebastien. > > Ben. > > >
On Sep 24, 2008, at 7:30 AM, Sebastien Dugue wrote: > Hi Milton, > On Wed, 24 Sep 2008 04:58:22 -0500 (CDT) Milton Miller > <miltonm@bga.com> wrote: >> On Mon Sep 15 at 18:04:06 EST in 2008, Sebastien Dugue wrote: >>> When entering the low level handler, level sensitive interrupts are >>> masked, then eio'd in interrupt context and then unmasked at the >>> end of hardirq processing. That's fine as any interrupt comming >>> in-between will still be processed since the kernel replays those >>> pending interrupts. >> >> Is this to generate some kind of software managed nesting and priority >> of the hardware level interrupts? > > No, not really. This is only to be sure to not miss interrupts coming > from the same source that were received during threaded hardirq > processing. > Some instrumentation showed that it never seems to happen in the eHEA > interrupt case, so I think we can forget this aspect. I don't trust "the interrupt can never happen during hea hardirq", because I think there will be a race between their rearming the next interrupt and the unmask being called. I was trying to understand why the mask and early eoi, but I guess its to handle other more limited interrupt controllers where the interrupts stack in hardware instead of software. > Also, the problem only manifests with the eHEA RX interrupt. For > example, > the IBM Power Raid (ipr) SCSI exhibits absolutely no problem under an > RT > kernel. From this I conclude that: > > IPR - PCI - XICS is OK > eHEA - IBMEBUS - XICS is broken with hardirq preemption. > > I also checked that forcing the eHEA interrupt to take the non > threaded > path does work. For a long period of time, XICS dealt only with level interrupts. First Micro Channel, and later PCI buses. The IPI is made level by software conventions. Recently, EHCA, EHEA, and MSI interrupts were added which by their nature are edge based. The logic that converts those interrupts to the XICS layer is responsible for the resend when no cpu can accept them, but not to retrigger after an EOI. > Here is a side by side comparison of the fasteoi flow with and > without hardirq > threading (sorry it's a bit wide) (removed) > the non-threaded flow does (in interrupt context): > > mask > handle interrupt > unmask > eoi > > the threaded flow does: > > mask > eoi > handle interrupt > unmask > > If I remove the mask() call, then the eHEA is no longer hanging. Hmm, I guess I'm confused. You are saying the irq does not appear if it occurs while it is masked? Well, in that case, I would guess that the hypervisor is checking if the irq is previously pending while it was masked and resetting it as part of the unmask. It can't do it on level, but can on the true edge sources. I would further say the justification for this might be the hardware might make it pending from some previous stale event that might result in the false interrupt on startup were it not to do this clear. >> The reason I ask is the xics controller can do unlimited nesting >> of hardware interrupts. In fact, the hardware has 255 levels of >> priority, of which 16 or so are reserved by the hypervisor, leaving >> over 200 for the os to manage. Higher numbers are lower in priority, >> and the hardware will only dispatch an interrupt to a given cpu if >> it is currenty at a lower priority. If it is at a higher priority >> and the interrupt is not bound to a specific cpu it will look for >> another cpu to dispatch it. The hardware will not re-present an >> irq until the it is EOId (managed by a small state machine per >> interrupt at the source, which also handles no cpu available try >> again later), but software can return its cpu priority to the >> previous level to recieve other interrupt sources at the same level. >> The hardware also supports lazy update of the cpu priority register >> when an interrupt is presented; as long as the cpu is hard-irq >> enabled it can take the irq then write is real priority and let the >> hw decide if the irq is still pending or it must defer or try another >> cpu in the rejection scenerio. The only restriction is that the >> EOI can not cause an interrupt reject by raising the priority while >> sending the EOI command. >> >> The per-interrupt mask and unmask calls have to go through RTAS, a >> single-threaded global context, which in addition to increasing >> path length will really limit scalability. The interrupt controller >> poll and reject facilities are accessed through hypervisor calls >> which are comparable to a fast syscall, and parallel to all cpus. >> >> We used to lower the priority to allow other interrupts in, but we >> realized that in addition to the questionable latency in doing so, >> it only caused unlimited stack nesting and overflow without per-irq >> stacks. We currently set IPIs above other irqs so we typically >> only process them during a hard irq (but we return to base level >> after IPI and could take another base irq, a bug). >> >> >> So, Sebastien, with this information, is does the RT kernel have >> a strategy that better matches this hardware? > > Don't think so. I think that the problem may be elsewhere as > everything is fine with PCI devices (well at least SCSI). Those are true level sources, and not edge. > As I said earlier in another mail, it seems that the eHEA > is behaving as if it was generating edge interrupts which do not > support masking. Don't know. (I wrote this next paragraph before parsing the "remove mask and it works" / I'm confused paragraph above, so it may not be a problem). These sources are truly edge. Once you do an EOI you are taking responsibility to do the replay yourself. In your threaded case, you EOI and therefore the hardware will arm for the next event. When you add the mask, the delivery is deferred until it is unmasked at the end of your EOI loop. When you do not, the new interrupt may come in but you just EOI it but do not tell the running thread that it happened, then you are dropping the irq event. Since the source is truly edge, there is no hardware replay and the interrupt is lost. (I think the pci express gigabit is one of the few msi interrupt adapters that both IBM and Linux support). > Thanks a lot for the explanation, looks like the xics + hypervisor > combo is way more complex than I thought. While the hypervisor adds a bit of path length (an hcall vs a single mmio access for get_irq/eoi with multiple priority irq nesting), the model is no more or less complicated than native xics. The path lengh for mask and unmask is always VERY slow and single threaded global lock and single context in xics. It is designed and tuned to run at driver startup and shutdown (and adapter reset and reinitalize during pci error processing), not during normal irq processing. The XICS hardware implicitly masks the specific source as part of interrupt ack (get_irq), and implicitly undoes this mask at eoi. In addition, it helps to manage the cpu priority by supplying the previous priority as part of the get_irq process and providing for the priority to be restored (lowered only) as part of the eoi. The hardware does support setting the cpu priority independently. We should only be using this implicit masking for xics, and not the explicit masking for any normal interrupt processing. I don't know if this means making the mask/unmask setting a bit in software, and the enable/disable to actually call what we do now on mask/unmask, or if it means we need a new flow type on real time. While call to mask and unmask might work on level interrupts, its really slow and will limit performance if done on every interrupt. > the non-threaded flow does (in interrupt context): > > mask > handle interrupt > unmask > eoi > > the threaded flow does: > > mask > eoi > handle interrupt > unmask I think the flows we want on xics are: (non-threaded) getirq (implicit source specific mask until eoi) handle interrupt eoi (implicit cpu priority restore) (threaded) getirq (implicit source specific mask until eoi) explicit cpu priority restore handle interrupt eoi (implicit cpu priority restore to same as explicit level) Where the cpu priority restore allows receiving other interrupts of the same priority from the hardware. So I guess the question is can the rt kernel interrupt processing take advantage of xics auto mask, or does someone need to write state tracking in the xics code to work around this, changing mask under interrupt to "defer eoi to unmask" (which I can not see as clean, and having shutdown problems). milton
> There may be some implicit assumption in that we expect the cpu > priority to be returned to normal by the EOI, but there is nothing in > the hardware that requires the EOI to come from the same cpu as > accepted the interrupt for processing, with the exception of the IPI > which is per-cpu (and the only interrupt that is per-cpu). Well, there is one fundamental one: The XIRR register we access is per-CPU, so if we are to return the right processor priority, we must make sure we write the right XIRR. Same with Cell, MPIC, actually and a few others. In general I'd say most fast_eoi type PICs have this requirement. > It would probably mean adding the concept of the current cpu priority > vs interrupts and making sure we write it to hardware at irq_exit() > time when deferring the actual irq handlers. I think we need something like a special -rt variant of the fast_eoi handler that masks & eoi's in ack() before the thread is spun off, and unmasks instead of eoi() when the irq processing is complete. Ben.
On Wed, 2008-09-24 at 14:35 +0200, Sebastien Dugue wrote: > Hi Ben, > > On Wed, 24 Sep 2008 20:17:47 +1000 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > On Wed, 2008-09-24 at 04:58 -0500, Milton Miller wrote: > > > The per-interrupt mask and unmask calls have to go through RTAS, a > > > single-threaded global context, which in addition to increasing > > > path length will really limit scalability. The interrupt controller > > > poll and reject facilities are accessed through hypervisor calls > > > which are comparable to a fast syscall, and parallel to all cpus. > > > > Note also that the XICS code thus assumes, iirc, as does the cell IIC > > code, that eoi is called on the -same- cpu that fetched the interrupt > > initially. That assumption can be broken with IRQ threads no ? > > No, the fetch and the eoi are both done in interrupt context before > the hardirq thread is woken up. > > On the other hand, the mask+eoi and the unmask may well happen > on different cpus as there's only one hardirq thread per irq on > the system. Don't know if this is a problem with the XICS though. Ok, that's the right approach then. It should work. I don't know what the specific problems with HEA are at this stage. It doesn't seem to make sense to implement a set_irq_type(), what would it do ? The XICS doesn't expose any concept of interrupt type... Ben.
On Wed, 2008-09-24 at 11:42 -0500, Milton Miller wrote: > > I was trying to understand why the mask and early eoi, but I guess its > to handle other more limited interrupt controllers where the interrupts > stack in hardware instead of software. No Milton, we must do it that way, because the EOI must be done on the right CPU even on XICS, or we won't get the CPU priority back properly. Ben.
On Sep 24, 2008, at 4:16 PM, Benjamin Herrenschmidt wrote: > On Wed, 2008-09-24 at 11:42 -0500, Milton Miller wrote: >> >> I was trying to understand why the mask and early eoi, but I guess its >> to handle other more limited interrupt controllers where the >> interrupts >> stack in hardware instead of software. > > No Milton, we must do it that way, because the EOI must be done on the > right CPU even on XICS, or we won't get the CPU priority back properly. Ben and I had a online chat, and he pointed out I needed to be more specific in saying what I was thinking. >> I think the flows we want on xics are: >> >> (non-threaded) >> getirq (implicit source specific mask until eoi) >> handle interrupt >> eoi (implicit cpu priority restore) >> >> (threaded) >> getirq (implicit source specific mask until eoi) >> explicit cpu priority restore >> handle interrupt >> eoi (implicit cpu priority restore to same as explicit level) cpu takes interrupt, checks soft disabled if so, set hard disabled else call get_irq if threaded write cppr to restore this cpu irq dispatch state to non-interrupt mark irq thread as irq pending else handle interrupt eoi (cppr = base) irq thread will handle interrupt eoi wait for marked pending again The part Ben did not follow was that the cppr write to base priority is done by the interrupted cpu (like the mask and eoi in the current flow) and only the final eoi (where the mask is in the existing flow) is done on which ever cpu happens to run the irq thread. (optional) As I was discussing with Paul, when taking an irq when soft-disabled but still hard enabled, it is possible to write the cppr such that it would reject the pending irq and have it be considered for dispatch to another cpu. But it would increase pathlength on both the go-to-hard-disabled and return-from-hard-disabled and the hardware will have some latency as it will likely send it back to the io source until it retrys, so we would only want to do this if the hard-disable period is sufficiently long. milton
On Thu, 25 Sep 2008 07:15:17 +1000 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Wed, 2008-09-24 at 14:35 +0200, Sebastien Dugue wrote: > > Hi Ben, > > > > On Wed, 24 Sep 2008 20:17:47 +1000 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > > > On Wed, 2008-09-24 at 04:58 -0500, Milton Miller wrote: > > > > The per-interrupt mask and unmask calls have to go through RTAS, a > > > > single-threaded global context, which in addition to increasing > > > > path length will really limit scalability. The interrupt controller > > > > poll and reject facilities are accessed through hypervisor calls > > > > which are comparable to a fast syscall, and parallel to all cpus. > > > > > > Note also that the XICS code thus assumes, iirc, as does the cell IIC > > > code, that eoi is called on the -same- cpu that fetched the interrupt > > > initially. That assumption can be broken with IRQ threads no ? > > > > No, the fetch and the eoi are both done in interrupt context before > > the hardirq thread is woken up. > > > > On the other hand, the mask+eoi and the unmask may well happen > > on different cpus as there's only one hardirq thread per irq on > > the system. Don't know if this is a problem with the XICS though. > > Ok, that's the right approach then. It should work. I don't know what > the specific problems with HEA are at this stage. Yep, except as it behaves in way that the current -rt fasteoi flow cannot handle. > It doesn't seem to > make sense to implement a set_irq_type(), what would it do ? The > XICS doesn't expose any concept of interrupt type... That's what I gathered from looking at the sources. Thanks, Sebastien. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, 2008-09-25 at 09:18 +0200, Sebastien Dugue wrote: > > > Ok, that's the right approach then. It should work. I don't know > what > > the specific problems with HEA are at this stage. > > Yep, except as it behaves in way that the current -rt fasteoi flow > cannot handle. We probably need to make a special xics flow handler for -rt that does what Milton suggested, ie, bring down the CPU priority right away and only EOI later or something like that, instead of masking/unmasking. I don't know what are the other potential issues with the HEA though. Ben.
On Thu, 25 Sep 2008 07:14:07 +1000 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > There may be some implicit assumption in that we expect the cpu > > priority to be returned to normal by the EOI, but there is nothing in > > the hardware that requires the EOI to come from the same cpu as > > accepted the interrupt for processing, with the exception of the IPI > > which is per-cpu (and the only interrupt that is per-cpu). > > Well, there is one fundamental one: The XIRR register we access is > per-CPU, so if we are to return the right processor priority, we must > make sure we write the right XIRR. That's already the case as the irq fetch (xx_xirr_info_get()) and eoi (xx_xirr_info_set()) are both done in interrupt context, therefore on the same cpu. > > Same with Cell, MPIC, actually and a few others. In general I'd say most > fast_eoi type PICs have this requirement. > > > It would probably mean adding the concept of the current cpu priority > > vs interrupts and making sure we write it to hardware at irq_exit() > > time when deferring the actual irq handlers. > > I think we need something like a special -rt variant of the fast_eoi > handler that masks & eoi's in ack() before the thread is spun off, and > unmasks instead of eoi() when the irq processing is complete. This is what is already done in the threaded case: - fetch + mask + eoi in interrupt context - unmask in the thread when processing is complete. Sebastien.
On Thu, 25 Sep 2008 17:22:41 +1000 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Thu, 2008-09-25 at 09:18 +0200, Sebastien Dugue wrote: > > > > > Ok, that's the right approach then. It should work. I don't know > > what > > > the specific problems with HEA are at this stage. > > > > Yep, except as it behaves in way that the current -rt fasteoi flow > > cannot handle. > > We probably need to make a special xics flow handler for -rt that does > what Milton suggested, ie, bring down the CPU priority right away Do you mean creating a custom fasteoi handler into xics.c? Also, it's not clear to me from looking at the code how you go about changing the cpu priority. > and > only EOI later or something like that, instead of masking/unmasking. > > I don't know what are the other potential issues with the HEA though. Don't know either, but that I can test. Thanks, Sebastien.
> Do you mean creating a custom fasteoi handler into xics.c? Also, it's > not clear to me from looking at the code how you go about changing the > cpu priority. Yup. I think the priority is the CPPR.. Milton can give you more details, if not, I'll pick it up tomorrow when at the office. Ben.
On Thu, 25 Sep 2008 18:36:19 +1000 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > Do you mean creating a custom fasteoi handler into xics.c? Also, it's > > not clear to me from looking at the code how you go about changing the > > cpu priority. > > Yup. I think the priority is the CPPR.. Milton can give you more > details, if not, I'll pick it up tomorrow when at the office. > Thanks Ben, will look into this. Nite Sebastien.
On Wed, 24 Sep 2008 11:42:15 -0500 Milton Miller <miltonm@bga.com> wrote: > On Sep 24, 2008, at 7:30 AM, Sebastien Dugue wrote: > > Hi Milton, > > On Wed, 24 Sep 2008 04:58:22 -0500 (CDT) Milton Miller > > <miltonm@bga.com> wrote: > >> On Mon Sep 15 at 18:04:06 EST in 2008, Sebastien Dugue wrote: > >>> When entering the low level handler, level sensitive interrupts are > >>> masked, then eio'd in interrupt context and then unmasked at the > >>> end of hardirq processing. That's fine as any interrupt comming > >>> in-between will still be processed since the kernel replays those > >>> pending interrupts. > >> > >> Is this to generate some kind of software managed nesting and priority > >> of the hardware level interrupts? > > > > No, not really. This is only to be sure to not miss interrupts coming > > from the same source that were received during threaded hardirq > > processing. > > Some instrumentation showed that it never seems to happen in the eHEA > > interrupt case, so I think we can forget this aspect. > > I don't trust "the interrupt can never happen during hea hardirq", > because I think there will be a race between their rearming the next > interrupt and the unmask being called. So do I, it was just to make sure I was not hit by another interrupt while handling the previous one and thus reduce the number of hypothesis. I sure do not say that it cannot happen, just that that path is not taken when I have the eHEA hang. > > I was trying to understand why the mask and early eoi, but I guess its > to handle other more limited interrupt controllers where the interrupts > stack in hardware instead of software. > > > Also, the problem only manifests with the eHEA RX interrupt. For > > example, > > the IBM Power Raid (ipr) SCSI exhibits absolutely no problem under an > > RT > > kernel. From this I conclude that: > > > > IPR - PCI - XICS is OK > > eHEA - IBMEBUS - XICS is broken with hardirq preemption. > > > > I also checked that forcing the eHEA interrupt to take the non > > threaded > > path does work. > > For a long period of time, XICS dealt only with level interrupts. > First Micro Channel, and later PCI buses. The IPI is made level by > software conventions. Recently, EHCA, EHEA, and MSI interrupts were > added which by their nature are edge based. The logic that converts > those interrupts to the XICS layer is responsible for the resend when > no cpu can accept them, but not to retrigger after an EOI. OK > > > Here is a side by side comparison of the fasteoi flow with and > > without hardirq > > threading (sorry it's a bit wide) > (removed) > > the non-threaded flow does (in interrupt context): > > > > mask Whoops, my bad, in the non threaded case, there's no mask at all, only an unmask+eoi at the end, maybe that's an oversight! > > handle interrupt > > unmask > > eoi > > > > the threaded flow does: > > > > mask > > eoi > > handle interrupt > > unmask > > > > If I remove the mask() call, then the eHEA is no longer hanging. > > Hmm, I guess I'm confused. You are saying the irq does not appear if > it occurs while it is masked? Looks like it is, but I cannot say for sure, the only observable effect is that I do not get any more interrupts coming from the eHEA. > Well, in that case, I would guess that > the hypervisor is checking if the irq is previously pending while it > was masked and resetting it as part of the unmask. It can't do it on > level, but can on the true edge sources. I would further say the > justification for this might be the hardware might make it pending from > some previous stale event that might result in the false interrupt on > startup were it not to do this clear. > > >> The reason I ask is the xics controller can do unlimited nesting > >> of hardware interrupts. In fact, the hardware has 255 levels of > >> priority, of which 16 or so are reserved by the hypervisor, leaving > >> over 200 for the os to manage. Higher numbers are lower in priority, > >> and the hardware will only dispatch an interrupt to a given cpu if > >> it is currenty at a lower priority. If it is at a higher priority > >> and the interrupt is not bound to a specific cpu it will look for > >> another cpu to dispatch it. The hardware will not re-present an > >> irq until the it is EOId (managed by a small state machine per > >> interrupt at the source, which also handles no cpu available try > >> again later), but software can return its cpu priority to the > >> previous level to recieve other interrupt sources at the same level. > >> The hardware also supports lazy update of the cpu priority register > >> when an interrupt is presented; as long as the cpu is hard-irq > >> enabled it can take the irq then write is real priority and let the > >> hw decide if the irq is still pending or it must defer or try another > >> cpu in the rejection scenerio. The only restriction is that the > >> EOI can not cause an interrupt reject by raising the priority while > >> sending the EOI command. > >> > >> The per-interrupt mask and unmask calls have to go through RTAS, a > >> single-threaded global context, which in addition to increasing > >> path length will really limit scalability. The interrupt controller > >> poll and reject facilities are accessed through hypervisor calls > >> which are comparable to a fast syscall, and parallel to all cpus. > >> > >> We used to lower the priority to allow other interrupts in, but we > >> realized that in addition to the questionable latency in doing so, > >> it only caused unlimited stack nesting and overflow without per-irq > >> stacks. We currently set IPIs above other irqs so we typically > >> only process them during a hard irq (but we return to base level > >> after IPI and could take another base irq, a bug). > >> > >> > >> So, Sebastien, with this information, is does the RT kernel have > >> a strategy that better matches this hardware? > > > > Don't think so. I think that the problem may be elsewhere as > > everything is fine with PCI devices (well at least SCSI). > > Those are true level sources, and not edge. Right. > > > As I said earlier in another mail, it seems that the eHEA > > is behaving as if it was generating edge interrupts which do not > > support masking. Don't know. > > (I wrote this next paragraph before parsing the "remove mask and it > works" / I'm confused paragraph above, so it may not be a problem). > > These sources are truly edge. Once you do an EOI you are taking > responsibility to do the replay yourself. In your threaded case, you > EOI and therefore the hardware will arm for the next event. When you > add the mask, the delivery is deferred until it is unmasked at the end > of your EOI loop. When you do not, the new interrupt may come in but > you just EOI it but do not tell the running thread that it happened, > then you are dropping the irq event. Since the source is truly edge, > there is no hardware replay and the interrupt is lost. > > (I think the pci express gigabit is one of the few msi interrupt > adapters that both IBM and Linux support). > > > Thanks a lot for the explanation, looks like the xics + hypervisor > > combo is way more complex than I thought. > > While the hypervisor adds a bit of path length (an hcall vs a single > mmio access for get_irq/eoi with multiple priority irq nesting), the > model is no more or less complicated than native xics. That may be, but I'm only looking at the code (read no specifications at hand) and it looks like a black box to me. > > The path lengh for mask and unmask is always VERY slow and single > threaded global lock and single context in xics. It is designed and > tuned to run at driver startup and shutdown (and adapter reset and > reinitalize during pci error processing), not during normal irq > processing. Now, that is quite interesting then. Those mask() and unmask() should then be called shutdown() and startup() and not at each interrupt or am I misunderstanding you. > > The XICS hardware implicitly masks the specific source as part of > interrupt ack (get_irq), and implicitly undoes this mask at eoi. In > addition, it helps to manage the cpu priority by supplying the previous > priority as part of the get_irq process and providing for the priority > to be restored (lowered only) as part of the eoi. The hardware does > support setting the cpu priority independently. This confirms, then, that the mask and unmask methods should be empty for the xics. > > We should only be using this implicit masking for xics, and not the > explicit masking for any normal interrupt processing. OK > I don't know if > this means making the mask/unmask setting a bit in software, Used by whom? > and the > enable/disable to actually call what we do now on mask/unmask, or if it > means we need a new flow type on real time. Maybe a new flow type is not necessary considering what you said. > > While call to mask and unmask might work on level interrupts, its > really slow and will limit performance if done on every interrupt. > > > the non-threaded flow does (in interrupt context): > > > > mask Same Whoops, no mask is done in the non threaded case > > handle interrupt > > unmask > > eoi > > > > the threaded flow does: > > > > mask > > eoi > > handle interrupt > > unmask > > I think the flows we want on xics are: > > (non-threaded) > getirq (implicit source specific mask until eoi) > handle interrupt > eoi (implicit cpu priority restore) Yep > > (threaded) > getirq (implicit source specific mask until eoi) > explicit cpu priority restore ^ How do you go about doing that? Still not clear to me. > handle interrupt > eoi (implicit cpu priority restore to same as explicit level) > > Where the cpu priority restore allows receiving other interrupts of the > same priority from the hardware. > > So I guess the question is can the rt kernel interrupt processing take > advantage of xics auto mask, It should, but even mainline could benefit from it I guess. > or does someone need to write state > tracking in the xics code to work around this, changing mask under > interrupt to "defer eoi to unmask" (which I can not see as clean, and > having shutdown problems). Thanks a lot Milton for those explanations, Sebastien.
diff -Nurp b/arch/powerpc/kernel/ibmebus.c a/arch/powerpc/kernel/ibmebus.c --- b/arch/powerpc/kernel/ibmebus.c 2008-09-22 00:29:55.000000000 +0200 +++ a/arch/powerpc/kernel/ibmebus.c 2008-09-23 12:04:53.000000000 +0200 @@ -216,12 +216,16 @@ int ibmebus_request_irq(u32 ist, irq_han unsigned long irq_flags, const char *devname, void *dev_id) { + int ret; unsigned int irq = irq_create_mapping(NULL, ist); if (irq == NO_IRQ) return -EINVAL; - return request_irq(irq, handler, irq_flags, devname, dev_id); + ret = request_irq(irq, handler, irq_flags, devname, dev_id); + set_irq_type(irq, IRQ_TYPE_EDGE_RISING); + + return ret; } EXPORT_SYMBOL(ibmebus_request_irq);