Message ID | b1a04379e05e40f9774fb4668609fac4255a6514.1411297686.git.agordeev@redhat.com |
---|---|
State | Not Applicable |
Delegated to: | David Miller |
Headers | show |
Hello, Alexander. On Sun, Sep 21, 2014 at 03:19:28PM +0200, Alexander Gordeev wrote: > Split interrupt service routine into hardware context handler > and threaded context handler. That allows to protect ports with > individual locks rather than with a single host-wide lock and > move port interrupts handling out of the hardware interrupt > context. > > Testing was done by transferring 8GB on two hard drives in > parallel using command 'dd if=/dev/sd{a,b} of=/dev/null'. With > lock_stat statistics I measured access times to ata_host::lock > spinlock (since interrupt handler code is fully embraced with > this lock). The average lock's holdtime decreased eight times > while average waittime decreased two times. > > Both before and after the change the transfer time is the same, > while 'perf record -e cycles:k ...' shows 1%-4% CPU time spent > in ahci_single_irq_intr() routine before the update and not even > sampled/shown ahci_single_irq_intr() after the update. Hmmm... how does it affect single device operation tho? It does make individual interrupt handling heavier, no? Thanks.
On Tue, Sep 23, 2014 at 04:57:10PM -0400, Tejun Heo wrote: > Hmmm... how does it affect single device operation tho? It does make > individual interrupt handling heavier, no? I think it is difficult to assess "individual interrupt handling", since it depends from both the hardware and device access pattern. On the system I use the results are rather counter-intuitive: ahci_thread_fn() does not show up in perf report at all, nor ahci_single_irq_intr(). While before the change ahci_single_irq_intr() reported 0.00%. But since the handling is split in two parts it is rather incorrect to apply the same metric to the threaded context. Obviously, the threaded handler is expected slowed down by other interrupts handlers, but the whole system should benefit from it, which is exactly the aim of this change. > -- > tejun
Hello, Alexander. On Wed, Sep 24, 2014 at 11:42:15AM +0100, Alexander Gordeev wrote: > On Tue, Sep 23, 2014 at 04:57:10PM -0400, Tejun Heo wrote: > > Hmmm... how does it affect single device operation tho? It does make > > individual interrupt handling heavier, no? > > I think it is difficult to assess "individual interrupt handling", since > it depends from both the hardware and device access pattern. On the system > I use the results are rather counter-intuitive: ahci_thread_fn() does not > show up in perf report at all, nor ahci_single_irq_intr(). While before > the change ahci_single_irq_intr() reported 0.00%. > > But since the handling is split in two parts it is rather incorrect to > apply the same metric to the threaded context. Obviously, the threaded > handler is expected slowed down by other interrupts handlers, but the > whole system should benefit from it, which is exactly the aim of this > change. Hmmm, how would the whole system benefit from it if there's only single device? Each individual servicing of the interrupt does more now which includes scheduling which may end up adding to completion latency. The thing I don't get is why multiple MSI handling and this patchset are tied to threaded interrupt handling. Splitting locks don't necessarily have much to do with threaded handling and it's not like ahci interrupt handling is heavy. The hot path is pretty short actually. The meat of the work - completing requests and propagating completions - is offloaded to softirq by block layer anyway. Just to be clear, I'm not against the proposed changes but wanna understand the justifications behind them. Thanks.
On Wed, 24 Sep 2014 09:04:44 -0400 Tejun Heo <tj@kernel.org> wrote: > Hello, Alexander. > > On Wed, Sep 24, 2014 at 11:42:15AM +0100, Alexander Gordeev wrote: > > On Tue, Sep 23, 2014 at 04:57:10PM -0400, Tejun Heo wrote: > > > Hmmm... how does it affect single device operation tho? It does make > > > individual interrupt handling heavier, no? > > > > I think it is difficult to assess "individual interrupt handling", since > > it depends from both the hardware and device access pattern. On the system > > I use the results are rather counter-intuitive: ahci_thread_fn() does not > > show up in perf report at all, nor ahci_single_irq_intr(). While before > > the change ahci_single_irq_intr() reported 0.00%. > > > > But since the handling is split in two parts it is rather incorrect to > > apply the same metric to the threaded context. Obviously, the threaded > > handler is expected slowed down by other interrupts handlers, but the > > whole system should benefit from it, which is exactly the aim of this > > change. > > Hmmm, how would the whole system benefit from it if there's only > single device? Each individual servicing of the interrupt does more > now which includes scheduling which may end up adding to completion > latency. > I think he meant other, non-AHCI, interrupt handlers would benefit. A good test of this patch might be to stream 10Gb ethernet while also streaming writes to an AHCI device. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 24, 2014 at 08:27:12AM -0500, Chuck Ebbert wrote: > I think he meant other, non-AHCI, interrupt handlers would benefit. A > good test of this patch might be to stream 10Gb ethernet while also > streaming writes to an AHCI device. I'm a bit doubtful this'd make a noticeable difference. ahci interrupt handler doesn't do that much to begin with. Thanks.
On Wed, Sep 24, 2014 at 09:04:44AM -0400, Tejun Heo wrote: > Hello, Alexander. > > On Wed, Sep 24, 2014 at 11:42:15AM +0100, Alexander Gordeev wrote: > > On Tue, Sep 23, 2014 at 04:57:10PM -0400, Tejun Heo wrote: > > > Hmmm... how does it affect single device operation tho? It does make > > > individual interrupt handling heavier, no? > > > > I think it is difficult to assess "individual interrupt handling", since > > it depends from both the hardware and device access pattern. On the system > > I use the results are rather counter-intuitive: ahci_thread_fn() does not > > show up in perf report at all, nor ahci_single_irq_intr(). While before > > the change ahci_single_irq_intr() reported 0.00%. > > > > But since the handling is split in two parts it is rather incorrect to > > apply the same metric to the threaded context. Obviously, the threaded > > handler is expected slowed down by other interrupts handlers, but the > > whole system should benefit from it, which is exactly the aim of this > > change. > > Hmmm, how would the whole system benefit from it if there's only > single device? Each individual servicing of the interrupt does more > now which includes scheduling which may end up adding to completion > latency. As Chuck noticed, non-AHCI hardware context handlers will benefit. > The thing I don't get is why multiple MSI handling and this patchset > are tied to threaded interrupt handling. Multiple MSIs were implemented with the above aim (let's say aim #1) right away. Single MSI/IRQ handling is getting updated with this series. > Splitting locks don't > necessarily have much to do with threaded handling and it's not like > ahci interrupt handling is heavy. The hot path is pretty short > actually. The meat of the work - completing requests and propagating > completions - is offloaded to softirq by block layer anyway. So the aim (let's say aim #2) is to avoid any of those to compete with hardware context handler. IOW, not to wait on host/port spinlocks with local interrupts disabled unnecessarily. I assume, if at the time of writing of original handlers the two interrupt context existed, they were written the way I propose now :) > Just to be clear, I'm not against the proposed changes but wanna > understand the justifications behind them. Should I send the fixed series? ;) > Thanks. > > -- > tejun
Hello, Alexander. On Wed, Sep 24, 2014 at 03:08:44PM +0100, Alexander Gordeev wrote: > > Hmmm, how would the whole system benefit from it if there's only > > single device? Each individual servicing of the interrupt does more > > now which includes scheduling which may end up adding to completion > > latency. > > As Chuck noticed, non-AHCI hardware context handlers will benefit. Maybe I'm off but I'm kinda skeptical that we'd be gaining back the overhead we pay by punting to a thread. > > The thing I don't get is why multiple MSI handling and this patchset > > are tied to threaded interrupt handling. > > Multiple MSIs were implemented with the above aim (let's say aim #1) > right away. Single MSI/IRQ handling is getting updated with this series. Yeah, I get that. I'm curious whether that was justified. > > Splitting locks don't > > necessarily have much to do with threaded handling and it's not like > > ahci interrupt handling is heavy. The hot path is pretty short > > actually. The meat of the work - completing requests and propagating > > completions - is offloaded to softirq by block layer anyway. > > So the aim (let's say aim #2) is to avoid any of those to compete with > hardware context handler. IOW, not to wait on host/port spinlocks with > local interrupts disabled unnecessarily. > > I assume, if at the time of writing of original handlers the two > interrupt context existed, they were written the way I propose now :) Maybe it makes sense with many high speed devices attached to a single host; otherwise, I think we'd prolly be paying more than we're gaining. Lock splitting itself is likely beneficial as our issue path is a lot heavier than completion path but I'm not too sure about splitting completion contexts especially given that completion for block layer and up are already punted to softirq. Would it be possible for you compare threaded vs. unthreaded under relatively heavy load? ie. let the interrupt handler access irq status under host lock but release it and then go through per-port locks from the interrupt handler. Thanks for doing this!
On Wed, Sep 24, 2014 at 10:39:13AM -0400, Tejun Heo wrote: > Would it be possible for you compare threaded vs. unthreaded under > relatively heavy load? I will try, although not quite soon. In the meantime I could fix and resend patches 1,2,3 and 6 as they are not related to this topic. Makes sense? > -- > tejun
> -----Original Message----- > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel- > owner@vger.kernel.org] On Behalf Of Tejun Heo ... > The thing I don't get is why multiple MSI handling and this patchset > are tied to threaded interrupt handling. Splitting locks don't > necessarily have much to do with threaded handling and it's not like > ahci interrupt handling is heavy. The hot path is pretty short > actually. The meat of the work - completing requests and propagating > completions - is offloaded to softirq by block layer anyway. blk-mq/scsi-mq chose to move all completion work into hardirq context, so this seems headed in a different direction. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 24, 2014 at 03:59:07PM +0100, Alexander Gordeev wrote: > On Wed, Sep 24, 2014 at 10:39:13AM -0400, Tejun Heo wrote: > > Would it be possible for you compare threaded vs. unthreaded under > > relatively heavy load? > > I will try, although not quite soon. > > In the meantime I could fix and resend patches 1,2,3 and 6 as > they are not related to this topic. Makes sense? Yeap, sure thing. Thanks.
On Wed, Sep 24, 2014 at 10:39:13AM -0400, Tejun Heo wrote: > Hello, Alexander. > > On Wed, Sep 24, 2014 at 03:08:44PM +0100, Alexander Gordeev wrote: > > > Hmmm, how would the whole system benefit from it if there's only > > > single device? Each individual servicing of the interrupt does more > > > now which includes scheduling which may end up adding to completion > > > latency. > > > > As Chuck noticed, non-AHCI hardware context handlers will benefit. > > Maybe I'm off but I'm kinda skeptical that we'd be gaining back the > overhead we pay by punting to a thread. Hi Tejun, As odd as it sounds, I did not mention there is *no* change in IO performance at all (in my system): neither with one drive nor two. The change is only about how the interrupt handlers co-exist with other devices. I am attaching excerpts from some new perf tests I have done (this time in legacy interrupt mode). As you can notice, ahci_interrupt() CPU time drops from 4% to none. As of your concern wrt threaded handler invocation overhead - I am not quite sure here, but if SCHED_FIFO policy (the handler runs with) makes the difference? Anyway, as said above the overall IO does not suffer. > -- > tejun
On Wed, Oct 01, 2014 at 04:31:14PM +0100, Alexander Gordeev wrote:
> I am attaching excerpts from some new perf tests I have done (this
Attaching :)
Hey, Alexander. On Wed, Oct 01, 2014 at 04:31:15PM +0100, Alexander Gordeev wrote: > As of your concern wrt threaded handler invocation overhead - I am > not quite sure here, but if SCHED_FIFO policy (the handler runs with) > makes the difference? Anyway, as said above the overall IO does not > suffer. Hmmm.... so, AFAICS, there's no real pros or cons of going either way, right? The only thing which could be different is possibly slightly lower latency in servicing other IRQs or RT tasks on the same CPU but given that the ahci IRQ handler already doesn't do anything which takes time, I'm doubtful whether that'd be anything measureable. I just don't get why ahci bothers with threaded irq, MMSI or not. Thanks.
A bit of addition. On Sat, Oct 04, 2014 at 10:23:11PM -0400, Tejun Heo wrote: > Hmmm.... so, AFAICS, there's no real pros or cons of going either way, > right? The only thing which could be different is possibly slightly > lower latency in servicing other IRQs or RT tasks on the same CPU but > given that the ahci IRQ handler already doesn't do anything which > takes time, I'm doubtful whether that'd be anything measureable. > > I just don't get why ahci bothers with threaded irq, MMSI or not. I think the thing which bothers me is that due to softirq we end up bouncing the context twice. IRQ schedules threaded IRQ handler after doing minimal amount of work. The threaded IRQ handler gets scheduled and again it doesn't do much but basically just schedules block softirq to actually run completions which is the heavier part. Apparently this doesn't seem to hurt measureably but it's just weird. Why are we bouncing the context twice? Thanks.
On Sun, Oct 05, 2014 at 12:16:46PM -0400, Tejun Heo wrote: > I think the thing which bothers me is that due to softirq we end up > bouncing the context twice. IRQ schedules threaded IRQ handler after > doing minimal amount of work. The threaded IRQ handler gets scheduled > and again it doesn't do much but basically just schedules block > softirq to actually run completions which is the heavier part. > Apparently this doesn't seem to hurt measureably but it's just weird. Hi Tejun, That is exactly the point I was concerned with when stated in one of changelogs "The downside of this change is introduction of a kernel thread". Splitting the service routine in two parts is a small change (in terms of code familiarity). Yet it right away provides benefits I could observe and justify (to myself at least). > Why are we bouncing the context twice? I *did* consider moving the threaded handler code to the softirq part. I just wanted to get updates in stages: to address hardware interrupts latency first and possibly threaded hander next. Getting done these two together would be too big change for me ;) > -- > tejun
Hello, Alexander. On Mon, Oct 06, 2014 at 08:27:11AM +0100, Alexander Gordeev wrote: > > Why are we bouncing the context twice? > > I *did* consider moving the threaded handler code to the softirq part. > I just wanted to get updates in stages: to address hardware interrupts > latency first and possibly threaded hander next. Getting done these two > together would be too big change for me ;) I don't think we'd be able to move libata handling to block softirq and probably end up just doing it from the irq context. Anyways, as long as you're gonna keep working on it, I have no objection to the proposed changes. Do you have a refreshed version or is the current version good for inclusion? Thanks.
On Mon, Oct 06, 2014 at 08:58:17AM -0400, Tejun Heo wrote: > I don't think we'd be able to move libata handling to block softirq > and probably end up just doing it from the irq context. Anyways, as > long as you're gonna keep working on it, I have no objection to the > proposed changes. Do you have a refreshed version or is the current > version good for inclusion? No, this one would not apply. I can send updated version on top of v5 I posted earlier. Should I? > tejun
On Mon, Oct 06, 2014 at 02:24:46PM +0100, Alexander Gordeev wrote: > On Mon, Oct 06, 2014 at 08:58:17AM -0400, Tejun Heo wrote: > > I don't think we'd be able to move libata handling to block softirq > > and probably end up just doing it from the irq context. Anyways, as > > long as you're gonna keep working on it, I have no objection to the > > proposed changes. Do you have a refreshed version or is the current > > version good for inclusion? > > No, this one would not apply. I can send updated version on top of > v5 I posted earlier. Should I? Yeap, please do so. Thanks a lot for your patience! :)
diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c index 4a849f8..0a6d112 100644 --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -1280,6 +1280,31 @@ out_free_irqs: return rc; } +static int ahci_host_activate_single_irq(struct ata_host *host, int irq, + struct scsi_host_template *sht) +{ + int i, rc; + + rc = ata_host_start(host); + if (rc) + return rc; + + rc = devm_request_threaded_irq(host->dev, irq, ahci_single_irq_intr, + ahci_thread_fn, IRQF_SHARED, + dev_driver_string(host->dev), host); + if (rc) + return rc; + + for (i = 0; i < host->n_ports; i++) + ata_port_desc(host->ports[i], "irq %d", irq); + + rc = ata_host_register(host, sht); + if (rc) + devm_free_irq(host->dev, irq, host); + + return rc; +} + /** * ahci_host_activate - start AHCI host, request IRQs and register it * @host: target ATA host @@ -1305,8 +1330,8 @@ int ahci_host_activate(struct ata_host *host, int irq, if (hpriv->flags & AHCI_HFLAG_MULTI_MSI) rc = ahci_host_activate_multi_irqs(host, irq, sht); else - rc = ata_host_activate(host, irq, ahci_single_irq_intr, - IRQF_SHARED, sht); + rc = ahci_host_activate_single_irq(host, irq, sht); + return rc; } EXPORT_SYMBOL_GPL(ahci_host_activate); diff --git a/drivers/ata/ahci.h b/drivers/ata/ahci.h index 44c02f7..c12f590 100644 --- a/drivers/ata/ahci.h +++ b/drivers/ata/ahci.h @@ -390,6 +390,7 @@ void ahci_set_em_messages(struct ahci_host_priv *hpriv, int ahci_reset_em(struct ata_host *host); irqreturn_t ahci_single_irq_intr(int irq, void *dev_instance); irqreturn_t ahci_multi_irqs_intr(int irq, void *dev_instance); +irqreturn_t ahci_thread_fn(int irq, void *dev_instance); irqreturn_t ahci_port_thread_fn(int irq, void *dev_instance); void ahci_print_info(struct ata_host *host, const char *scc_s); int ahci_host_activate(struct ata_host *host, int irq, diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c index cbe7757..169c272 100644 --- a/drivers/ata/libahci.c +++ b/drivers/ata/libahci.c @@ -1778,17 +1778,6 @@ static void ahci_handle_port_interrupt(struct ata_port *ap, u32 status) } } -static void ahci_port_intr(struct ata_port *ap) -{ - void __iomem *port_mmio = ahci_port_base(ap); - u32 status; - - status = readl(port_mmio + PORT_IRQ_STAT); - writel(status, port_mmio + PORT_IRQ_STAT); - - ahci_handle_port_interrupt(ap, status); -} - irqreturn_t ahci_port_thread_fn(int irq, void *dev_instance) { struct ata_port *ap = dev_instance; @@ -1810,6 +1799,35 @@ irqreturn_t ahci_port_thread_fn(int irq, void *dev_instance) } EXPORT_SYMBOL_GPL(ahci_port_thread_fn); +irqreturn_t ahci_thread_fn(int irq, void *dev_instance) +{ + struct ata_host *host = dev_instance; + struct ahci_host_priv *hpriv = host->private_data; + u32 irq_masked = hpriv->port_map; + unsigned int i; + + for (i = 0; i < host->n_ports; i++) { + struct ata_port *ap; + + if (!(irq_masked & (1 << i))) + continue; + + ap = host->ports[i]; + if (ap) { + ahci_port_thread_fn(irq, ap); + VPRINTK("port %u\n", i); + } else { + VPRINTK("port %u (no irq)\n", i); + if (ata_ratelimit()) + dev_warn(host->dev, + "interrupt on disabled port %u\n", i); + } + } + + return IRQ_HANDLED; +} +EXPORT_SYMBOL_GPL(ahci_thread_fn); + static void ahci_update_intr_status(struct ata_port *ap) { void __iomem *port_mmio = ahci_port_base(ap); @@ -1908,7 +1926,7 @@ irqreturn_t ahci_single_irq_intr(int irq, void *dev_instance) ap = host->ports[i]; if (ap) { - ahci_port_intr(ap); + ahci_update_intr_status(ap); VPRINTK("port %u\n", i); } else { VPRINTK("port %u (no irq)\n", i); @@ -1935,7 +1953,7 @@ irqreturn_t ahci_single_irq_intr(int irq, void *dev_instance) VPRINTK("EXIT\n"); - return IRQ_RETVAL(handled); + return handled ? IRQ_WAKE_THREAD : IRQ_NONE; } EXPORT_SYMBOL_GPL(ahci_single_irq_intr); @@ -2348,13 +2366,8 @@ static int ahci_port_start(struct ata_port *ap) */ pp->intr_mask = DEF_PORT_IRQ; - /* - * Switch to per-port locking in case each port has its own MSI vector. - */ - if ((hpriv->flags & AHCI_HFLAG_MULTI_MSI)) { - spin_lock_init(&pp->lock); - ap->lock = &pp->lock; - } + spin_lock_init(&pp->lock); + ap->lock = &pp->lock; ap->private_data = pp;
Split interrupt service routine into hardware context handler and threaded context handler. That allows to protect ports with individual locks rather than with a single host-wide lock and move port interrupts handling out of the hardware interrupt context. Testing was done by transferring 8GB on two hard drives in parallel using command 'dd if=/dev/sd{a,b} of=/dev/null'. With lock_stat statistics I measured access times to ata_host::lock spinlock (since interrupt handler code is fully embraced with this lock). The average lock's holdtime decreased eight times while average waittime decreased two times. Both before and after the change the transfer time is the same, while 'perf record -e cycles:k ...' shows 1%-4% CPU time spent in ahci_single_irq_intr() routine before the update and not even sampled/shown ahci_single_irq_intr() after the update. Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Cc: linux-ide@vger.kernel.org --- drivers/ata/ahci.c | 29 ++++++++++++++++++++++++++-- drivers/ata/ahci.h | 1 + drivers/ata/libahci.c | 53 ++++++++++++++++++++++++++++++++------------------- 3 files changed, 61 insertions(+), 22 deletions(-)