Message ID | 1399577392-4457-1-git-send-email-nhorman@tuxdriver.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
From: Neil Horman > What about something like this? Its not even compile tested, but let me know > what you think of the idea. The reasoning behind it is that transient resources > like dma address ranges in the iommu or swiotlb have the following attributes > > 1) they are quickly allocated and freed I'm not sure that is true for iommu entries. The ones allocated for ethernet receive are effectively permanently allocated. Imagine a system with 512 iommu entries. An ethernet driver allocates 128 RX ring entries using one iommu entry each. There are now no iommu entries left for anything else. That system will only work if the ethernet driver reduces the number of active rx buffers. It is also possible (but less likely) that ethernet transmit will use so many iommu entries that none are left for more important things. The network will work with only one active transmit, but you may have to do disc and/or usb transfers even when resource limited. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 09, 2014 at 08:55:10AM +0000, David Laight wrote: > From: Neil Horman > > What about something like this? Its not even compile tested, but let me know > > what you think of the idea. The reasoning behind it is that transient resources > > like dma address ranges in the iommu or swiotlb have the following attributes > > > > 1) they are quickly allocated and freed > > I'm not sure that is true for iommu entries. > The ones allocated for ethernet receive are effectively permanently allocated. > I disagree. A review of several NIC drivers shows the pseudocode for the RX patch to be: For SKB X on the RX ring: If LENGTH(SKB) < COPYBREAK SKB2 = ALLOCATE_SKB COPY_DATA(SKB2, SKB1) RECEIVE(SKB2) Else UNMAP(SKB1) RECEIVE(SKB1) SKB1 = ALLOCATE_SKB MAP(SKB1) Done The value of COPYBREAK is configurable, but is never more than 256 bytes, and is often 128 or fewer bytes (sometimes zero). This will cause some udp traffic to get handled as copies, but never more reasonably sized udp packets, and no well behaved tcp traffic will ever get copied. Those iommu entries will come and go very quickly. > Imagine a system with 512 iommu entries. > An ethernet driver allocates 128 RX ring entries using one iommu entry each. > There are now no iommu entries left for anything else. That actually leaves 384 entries remaiing, but thats neither here nor there :). iommus work like tlbs, in that they don't have a fixed number of entries. Each iommu has a set of page tables, wherein a set of pages can be mapped. If a packet fits within a single page, then yes, it takes up a single 'pte', if a packet takes multiple pages (say a gso packet for example), or if some other device is doing lots of high volume dma (FCoE/iscsi/roce/infiniband), then multiple pte's will be taken up handling the larger data buffer. Its a limited resource shared unevenly between all dma-ing devices. Thats why we can't reserve entries, because you don't have alot of space to begin with, and you don't know how much you'll need until you have the data to send, which can vary wildly depending on the device. > That system will only work if the ethernet driver reduces the number of > active rx buffers. > Reducing the number of active rx buffers is tantamount to reducing the ring size of a NIC, which is already a tunable feature, and not one to be received overly well by people trying to maximize their network througput. > It is also possible (but less likely) that ethernet transmit will > use so many iommu entries that none are left for more important things. This is possible in all cases, not just transmit. > The network will work with only one active transmit, but you may > have to do disc and/or usb transfers even when resource limited. > Hence my RFC patch in my prior note. If we're resource constrained, push back on the qdisc such that we try not to use as many mappings for short time without causing too much overhead. It doesn't affect receive of course, but its very hard to deal with managing mapping use when the producer is not directly controllable by us. Neil > David > > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Neil Horman > On Fri, May 09, 2014 at 08:55:10AM +0000, David Laight wrote: > > From: Neil Horman > > > What about something like this? Its not even compile tested, but let me know > > > what you think of the idea. The reasoning behind it is that transient resources > > > like dma address ranges in the iommu or swiotlb have the following attributes > > > > > > 1) they are quickly allocated and freed > > > > I'm not sure that is true for iommu entries. > > The ones allocated for ethernet receive are effectively permanently allocated. > > > I disagree. A review of several NIC drivers shows the pseudocode for the RX > patch to be: > > For SKB X on the RX ring: > If LENGTH(SKB) < COPYBREAK > SKB2 = ALLOCATE_SKB > COPY_DATA(SKB2, SKB1) > RECEIVE(SKB2) > Else > UNMAP(SKB1) > RECEIVE(SKB1) > SKB1 = ALLOCATE_SKB > MAP(SKB1) > Done > > The value of COPYBREAK is configurable, but is never more than 256 bytes, and is > often 128 or fewer bytes (sometimes zero). This will cause some udp traffic to > get handled as copies, but never more reasonably sized udp packets, and no well > behaved tcp traffic will ever get copied. Those iommu entries will come and go > very quickly. If I understand correctly the iommu entries are needed for the ethernet chip to access main memory. The usual state is that RX ring is full buffers - all of which are mapped for dma. > > Imagine a system with 512 iommu entries. > > An ethernet driver allocates 128 RX ring entries using one iommu entry each. > > There are now no iommu entries left for anything else. > That actually leaves 384 entries remaiing, but thats neither here nor there :). I seem to fail to write 'include 4 such interfaces and' ... > iommus work like tlbs, in that they don't have a fixed number of entries. > Each iommu has a set of page tables, wherein a set of pages can be mapped. > ... Yes I realise that what is actually being allocated is io virtual address space. But the '1 buffer' == '1 slot' simplification is reasonable appropriate for some 'thought designs'. > Its a > limited resource shared unevenly between all dma-ing devices. Thats why we > can't reserve entries, because you don't have alot of space to begin with, and > you don't know how much you'll need until you have the data to send, which can > vary wildly depending on the device. You do know how much resource is left though. > > That system will only work if the ethernet driver reduces the number of > > active rx buffers. > > > Reducing the number of active rx buffers is tantamount to reducing the ring size > of a NIC, which is already a tunable feature, and not one to be received overly > well by people trying to maximize their network througput. Except that it needs to done automatically if there is a global constraint (be it iommu space or dma-able memory). In isolation a large rx ring is almost always an advantage. > > It is also possible (but less likely) that ethernet transmit will > > use so many iommu entries that none are left for more important things. > This is possible in all cases, not just transmit. > > > The network will work with only one active transmit, but you may > > have to do disc and/or usb transfers even when resource limited. > > > Hence my RFC patch in my prior note. If we're resource constrained, push back > on the qdisc such that we try not to use as many mappings for short time without > causing too much overhead. It doesn't affect receive of course, but its very > hard to deal with managing mapping use when the producer is not directly > controllable by us. But ethernet receive is likely to be the big user of iommu entries. If you constrain it, then there probably won't be allocation failures elsewhere. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 09, 2014 at 12:53:29PM +0000, David Laight wrote: > From: Neil Horman > > On Fri, May 09, 2014 at 08:55:10AM +0000, David Laight wrote: > > > From: Neil Horman > > > > What about something like this? Its not even compile tested, but let me know > > > > what you think of the idea. The reasoning behind it is that transient resources > > > > like dma address ranges in the iommu or swiotlb have the following attributes > > > > > > > > 1) they are quickly allocated and freed > > > > > > I'm not sure that is true for iommu entries. > > > The ones allocated for ethernet receive are effectively permanently allocated. > > > > > I disagree. A review of several NIC drivers shows the pseudocode for the RX > > patch to be: > > > > For SKB X on the RX ring: > > If LENGTH(SKB) < COPYBREAK > > SKB2 = ALLOCATE_SKB > > COPY_DATA(SKB2, SKB1) > > RECEIVE(SKB2) > > Else > > UNMAP(SKB1) > > RECEIVE(SKB1) > > SKB1 = ALLOCATE_SKB > > MAP(SKB1) > > Done > > > > The value of COPYBREAK is configurable, but is never more than 256 bytes, and is > > often 128 or fewer bytes (sometimes zero). This will cause some udp traffic to > > get handled as copies, but never more reasonably sized udp packets, and no well > > behaved tcp traffic will ever get copied. Those iommu entries will come and go > > very quickly. > > If I understand correctly the iommu entries are needed for the ethernet > chip to access main memory. The usual state is that RX ring is full > buffers - all of which are mapped for dma. > This is true, but those buffers are not static, they are unmapped, and new ones are mapped in their place on recieve, so theres an opportunity for those unmapped buffers to get used by other entities/hardware during the recieve process. We could do something as you suggest, in which we create an api to reserve the address space for a buffer, and just continually reuse them, but as Dave points out, its a limited resource and reserving them may be unfair to other transient users, especially given that the RX path has to 'over-reserve' allocating space for the largest packet size receivable, even if we get less than that. > > > Imagine a system with 512 iommu entries. > > > An ethernet driver allocates 128 RX ring entries using one iommu entry each. > > > There are now no iommu entries left for anything else. > > > That actually leaves 384 entries remaiing, but thats neither here nor there :). > I seem to fail to write 'include 4 such interfaces and' ... > Ah, that makes more sense > > iommus work like tlbs, in that they don't have a fixed number of entries. > > Each iommu has a set of page tables, wherein a set of pages can be mapped. > > ... > > Yes I realise that what is actually being allocated is io virtual address space. > But the '1 buffer' == '1 slot' simplification is reasonable appropriate for > some 'thought designs'. > Only partly. Continuity is also a factor here. You might have 1000 sots remaining, but under heavy use, your largest contiguous slot could be significantly smaller, which is what a dma using bit of code is actually interested in. Keeping track of the largest contiguous slot is significantly more difficult, and its all still a bit dicey, because the larges slot might change between the time a device asks what it is, and actually allocates it. > > > Its a > > limited resource shared unevenly between all dma-ing devices. Thats why we > > can't reserve entries, because you don't have alot of space to begin with, and > > you don't know how much you'll need until you have the data to send, which can > > vary wildly depending on the device. > > You do know how much resource is left though. > See above, you can certainly tell in aggregate how much free space is available, but not what your largest free chunk is. > > > That system will only work if the ethernet driver reduces the number of > > > active rx buffers. > > > > > Reducing the number of active rx buffers is tantamount to reducing the ring size > > of a NIC, which is already a tunable feature, and not one to be received overly > > well by people trying to maximize their network througput. > > Except that it needs to done automatically if there is a global constraint > (be it iommu space or dma-able memory). 2 Things: 1) Need is really a strong term here. The penalty for failing a dma mapping is to drop the frame. Thats not unacceptible in many use cases. 2) It seems to me that global constraint here implies a static, well known number. While its true we can interrogate an iommu, and compare its mapping size to the ring size of all the NICS/devices on a system to see if we're likely to exceed the iommu space available, we shouldn't do that. If a given NIC doesn't produce much traffic, its ring sizes aren't relevant to the computation. We're not trying to address a static allocation scheme here. If a system boots, it implies that all the recive rings on all the devices were able to reserve the amount of space they needed in the iommu (as you note earlier, they populate their rings on init, effectively doing a iommu reservation). The problem we're addressing is the periodic lack of space that arises from temporary exhaustion of iommu space under heavy I/O loads. We won't know if that happens, until it happens, and we can't just allocate for the worst case, because then we're sure to run out of space as devices scale up. Sharing is the way to do this whenever possible. > In isolation a large rx ring is almost always an advantage. > No argument there. But requiring a user to size a ring based on expected traffic patterns seems like it won't be well received. > > > It is also possible (but less likely) that ethernet transmit will > > > use so many iommu entries that none are left for more important things. > > This is possible in all cases, not just transmit. > > > > > The network will work with only one active transmit, but you may > > > have to do disc and/or usb transfers even when resource limited. > > > > > Hence my RFC patch in my prior note. If we're resource constrained, push back > > on the qdisc such that we try not to use as many mappings for short time without > > causing too much overhead. It doesn't affect receive of course, but its very > > hard to deal with managing mapping use when the producer is not directly > > controllable by us. > > But ethernet receive is likely to be the big user of iommu entries. > If you constrain it, then there probably won't be allocation failures > elsewhere. > What makes you say that? Theres no reason a tx ring can't be just as full as a receive ring under heavy traffic load. If you want to constrain receive allocations, do so, we have a knob for that already. Neil > David > > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Neil Horman [mailto:nhorman@tuxdriver.com] ... > 2 Things: > > 1) Need is really a strong term here. The penalty for failing a dma mapping is > to drop the frame. Thats not unacceptible in many use cases. Indeed, but dropping an ethernet frame will be recovered by the higher layers. While not ideal, it is a 'last resort' action. Note that I'm not suggesting that your deferred retry of the transmit isn't a good idea, just that it is probably papering over the cracks. > 2) It seems to me that global constraint here implies a static, well known > number. While its true we can interrogate an iommu, and compare its mapping > size to the ring size of all the NICS/devices on a system to see if we're likely > to exceed the iommu space available, we shouldn't do that. If a given NIC > doesn't produce much traffic, its ring sizes aren't relevant to the computation. An idle NIC will be using a lot of iommu entries for its receive buffers. > We're not trying to address a static allocation scheme here. If a system boots, > it implies that all the recive rings on all the devices were able to reserve the > amount of space they needed in the iommu (as you note earlier, they populate > their rings on init, effectively doing a iommu reservation). The problem we're > addressing is the periodic lack of space that arises from temporary exhaustion > of iommu space under heavy I/O loads. We won't know if that happens, until it > happens, and we can't just allocate for the worst case, because then we're sure > to run out of space as devices scale up. Sharing is the way to do this whenever > possible. Do you have any data for which drivers have active iommu entries when an allocate fails? I can imagine systems where almost all the iommu entries are being used for ethernet rx buffers, and everything else is fighting for the last few entries. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 09, 2014 at 03:46:14PM +0000, David Laight wrote: > From: Neil Horman [mailto:nhorman@tuxdriver.com] > ... > > 2 Things: > > > > 1) Need is really a strong term here. The penalty for failing a dma mapping is > > to drop the frame. Thats not unacceptible in many use cases. > > Indeed, but dropping an ethernet frame will be recovered by the higher layers. > While not ideal, it is a 'last resort' action. No argument there, but its the best we have at the moment. > Note that I'm not suggesting that your deferred retry of the transmit isn't > a good idea, just that it is probably papering over the cracks. > Except that forcing everyone to a lower througput based on worst case scenarios doesn't seem like a better solution to me. > > 2) It seems to me that global constraint here implies a static, well known > > number. While its true we can interrogate an iommu, and compare its mapping > > size to the ring size of all the NICS/devices on a system to see if we're likely > > to exceed the iommu space available, we shouldn't do that. If a given NIC > > doesn't produce much traffic, its ring sizes aren't relevant to the computation. > > An idle NIC will be using a lot of iommu entries for its receive buffers. > Yes, and thats really the point! Only an Idle NIC will be (mis)using alot of extra iommu entries. But we have no way to know if a NIC will be idle, we have to reserve space for them in the iommu because thats how the hardware is designed. The only recourse we have here is to reserve less space for each NIC RX ring, which punishes those NICS that are active, which is the opposite of what we really want to do here. > > We're not trying to address a static allocation scheme here. If a system boots, > > it implies that all the recive rings on all the devices were able to reserve the > > amount of space they needed in the iommu (as you note earlier, they populate > > their rings on init, effectively doing a iommu reservation). The problem we're > > addressing is the periodic lack of space that arises from temporary exhaustion > > of iommu space under heavy I/O loads. We won't know if that happens, until it > > happens, and we can't just allocate for the worst case, because then we're sure > > to run out of space as devices scale up. Sharing is the way to do this whenever > > possible. > > Do you have any data for which drivers have active iommu entries when an > allocate fails? > No, but I'm not sure it matters which driver holds DMA descriptors when an allocation failure occurs. I'm operating under the assumption here that drivers are attempting to allocate a reasonable number of buffers for the work they need to do. I'm not arguing that we can't reclaim space by forcing any given drier to allocate less, only that doing so isn't helpful in that it will just decrease receive througput. We're trading in one problem for another. > I can imagine systems where almost all the iommu entries are being used > for ethernet rx buffers, and everything else is fighting for the last > few entries. > I would argue that its not quite as unbalance as all/few, but yes, your ponit is sound in that the tx path is fighting for a reduced pool of shared buffers because of the nature of the RX paths' pre-allocation needs. We could reduce the number of those buffers statically, but thats really an administrative job because the kernel never really knows when a NIC is going to be a high volume producer or consumer. Hmm, that actually makes me wonder, is this a job for something like tuned up in user space? or a cmobination of something like my backoff patch and tuned? The backoff patch helps the tx path in case of an actuall exhaustion, and tuned can be administratively loaded with a profile to indicate which nics are likely to be low volume, and therefore can have their ring sizes reduced, giving the iommu the maximum free pool to services all the other dma users? What do you think? Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 08, 2014 at 03:29:52PM -0400, Neil Horman wrote: > What about something like this? Its not even compile tested, but let me know > what you think of the idea. The reasoning behind it is that transient resources > like dma address ranges in the iommu or swiotlb have the following attributes > > 1) they are quickly allocated and freed > > 2) Usually handled by simply trying again at some later point in time > > 3) Likely to fail again if tried again immediately. > > 4) Not condusive to interlocked signaling mechanisms as the time it takes to > find and signal a waiting tx queue that resources are now available may take > longer than needed, and may still result in failure, as competing allocators > may race in and claim said resources during the signaling period. > > As such, what if we try something more simple like a linear backoff? In this > example we add a TX return code that indicates that the driver is not busy, but > unable to transmit due to resource constraints. This signals the qdisc layer to > skip trying to drain this transmit queue for a short period of time, with > subsequent simmilar errors causing increased backoff. Once the condition is > cleared, the backoff delay is removed and operation returns to normal. > > Thouthgs? > > Neil > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > --- > include/linux/netdevice.h | 4 ++++ > net/sched/sch_generic.c | 20 ++++++++++++++++++++ > 2 files changed, 24 insertions(+) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index a803d79..2b88ccd 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -105,6 +105,7 @@ enum netdev_tx { > NETDEV_TX_OK = 0x00, /* driver took care of packet */ > NETDEV_TX_BUSY = 0x10, /* driver tx path was busy*/ > NETDEV_TX_LOCKED = 0x20, /* driver tx lock was already taken */ > + NETDEV_TX_BACKOFF= 0x40 /* driver resource contrained */ > }; > typedef enum netdev_tx netdev_tx_t; > > @@ -572,6 +573,9 @@ struct netdev_queue { > > unsigned long state; > > + unsigned long backoff_count; > + unsigned long skip_count; > + > #ifdef CONFIG_BQL > struct dql dql; > #endif > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index e1543b0..64bf6fe 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -30,6 +30,8 @@ > #include <net/pkt_sched.h> > #include <net/dst.h> > > +#define MAX_BACKOFF_COUNT 4 > + > /* Qdisc to use by default */ > const struct Qdisc_ops *default_qdisc_ops = &pfifo_fast_ops; > EXPORT_SYMBOL(default_qdisc_ops); > @@ -121,6 +123,15 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, > { > int ret = NETDEV_TX_BUSY; > > + /* Only attempt transmit module the backoff_count */ > + if ((txq->backoff_count) { > + if (txq->skip_count % txq->backoff_count) { > + txq->skip_count++; > + return ret; > + } > + txq->skip_count = 0; > + } > + > /* And release qdisc */ > spin_unlock(root_lock); > > @@ -135,9 +146,17 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, > if (dev_xmit_complete(ret)) { > /* Driver sent out skb successfully or skb was consumed */ > ret = qdisc_qlen(q); > + txq->backoff_count = 0; > } else if (ret == NETDEV_TX_LOCKED) { > /* Driver try lock failed */ > ret = handle_dev_cpu_collision(skb, txq, q); > + txq->backoff_count = 0; > + } else if (ret == NETDEV_TX_BACKOFF) { > + /* The driver claims to be resource constrained */ > + if (txq->backoff_count != MAX_BACKOFF_COUNT) > + txq->backoff_count++; > + txq->skip_count++; > + ret = dev_requeue_skb(skb, q); > } else { > /* Driver returned NETDEV_TX_BUSY - requeue skb */ > if (unlikely(ret != NETDEV_TX_BUSY)) > @@ -145,6 +164,7 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, > dev->name, ret, q->q.qlen); > > ret = dev_requeue_skb(skb, q); > + txq->backoff_count = 0; > } > > if (ret && netif_xmit_frozen_or_stopped(txq)) Ping, Dave do you have any thoughts on this approach? Is it worth proposing formally? Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 8, 2014 at 12:29 PM, Neil Horman <nhorman@tuxdriver.com> wrote: > What about something like this? Its not even compile tested, but let me know > what you think of the idea. The reasoning behind it is that transient resources > like dma address ranges in the iommu or swiotlb have the following attributes > > 1) they are quickly allocated and freed > > 2) Usually handled by simply trying again at some later point in time > > 3) Likely to fail again if tried again immediately. > > 4) Not condusive to interlocked signaling mechanisms as the time it takes to > find and signal a waiting tx queue that resources are now available may take > longer than needed, and may still result in failure, as competing allocators > may race in and claim said resources during the signaling period. > > As such, what if we try something more simple like a linear backoff? In this > example we add a TX return code that indicates that the driver is not busy, but > unable to transmit due to resource constraints. This signals the qdisc layer to > skip trying to drain this transmit queue for a short period of time, with > subsequent simmilar errors causing increased backoff. Once the condition is > cleared, the backoff delay is removed and operation returns to normal. Loos like this is a more general issue which should be solved for every spinlock: http://thread.gmane.org/gmane.linux.kernel/1437186 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 21, 2014 at 11:05:28AM -0700, Cong Wang wrote: > On Thu, May 8, 2014 at 12:29 PM, Neil Horman <nhorman@tuxdriver.com> wrote: > > What about something like this? Its not even compile tested, but let me know > > what you think of the idea. The reasoning behind it is that transient resources > > like dma address ranges in the iommu or swiotlb have the following attributes > > > > 1) they are quickly allocated and freed > > > > 2) Usually handled by simply trying again at some later point in time > > > > 3) Likely to fail again if tried again immediately. > > > > 4) Not condusive to interlocked signaling mechanisms as the time it takes to > > find and signal a waiting tx queue that resources are now available may take > > longer than needed, and may still result in failure, as competing allocators > > may race in and claim said resources during the signaling period. > > > > As such, what if we try something more simple like a linear backoff? In this > > example we add a TX return code that indicates that the driver is not busy, but > > unable to transmit due to resource constraints. This signals the qdisc layer to > > skip trying to drain this transmit queue for a short period of time, with > > subsequent simmilar errors causing increased backoff. Once the condition is > > cleared, the backoff delay is removed and operation returns to normal. > > Loos like this is a more general issue which should be solved for every > spinlock: > > http://thread.gmane.org/gmane.linux.kernel/1437186 > Good point, though this isn't a locking situation so much as a failed transmission situation. The same mechanism applies though, and this seems to be simmilar to what the ticketed spinlocks do Neil -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index a803d79..2b88ccd 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -105,6 +105,7 @@ enum netdev_tx { NETDEV_TX_OK = 0x00, /* driver took care of packet */ NETDEV_TX_BUSY = 0x10, /* driver tx path was busy*/ NETDEV_TX_LOCKED = 0x20, /* driver tx lock was already taken */ + NETDEV_TX_BACKOFF= 0x40 /* driver resource contrained */ }; typedef enum netdev_tx netdev_tx_t; @@ -572,6 +573,9 @@ struct netdev_queue { unsigned long state; + unsigned long backoff_count; + unsigned long skip_count; + #ifdef CONFIG_BQL struct dql dql; #endif diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index e1543b0..64bf6fe 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -30,6 +30,8 @@ #include <net/pkt_sched.h> #include <net/dst.h> +#define MAX_BACKOFF_COUNT 4 + /* Qdisc to use by default */ const struct Qdisc_ops *default_qdisc_ops = &pfifo_fast_ops; EXPORT_SYMBOL(default_qdisc_ops); @@ -121,6 +123,15 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, { int ret = NETDEV_TX_BUSY; + /* Only attempt transmit module the backoff_count */ + if ((txq->backoff_count) { + if (txq->skip_count % txq->backoff_count) { + txq->skip_count++; + return ret; + } + txq->skip_count = 0; + } + /* And release qdisc */ spin_unlock(root_lock); @@ -135,9 +146,17 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, if (dev_xmit_complete(ret)) { /* Driver sent out skb successfully or skb was consumed */ ret = qdisc_qlen(q); + txq->backoff_count = 0; } else if (ret == NETDEV_TX_LOCKED) { /* Driver try lock failed */ ret = handle_dev_cpu_collision(skb, txq, q); + txq->backoff_count = 0; + } else if (ret == NETDEV_TX_BACKOFF) { + /* The driver claims to be resource constrained */ + if (txq->backoff_count != MAX_BACKOFF_COUNT) + txq->backoff_count++; + txq->skip_count++; + ret = dev_requeue_skb(skb, q); } else { /* Driver returned NETDEV_TX_BUSY - requeue skb */ if (unlikely(ret != NETDEV_TX_BUSY)) @@ -145,6 +164,7 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, dev->name, ret, q->q.qlen); ret = dev_requeue_skb(skb, q); + txq->backoff_count = 0; } if (ret && netif_xmit_frozen_or_stopped(txq))
What about something like this? Its not even compile tested, but let me know what you think of the idea. The reasoning behind it is that transient resources like dma address ranges in the iommu or swiotlb have the following attributes 1) they are quickly allocated and freed 2) Usually handled by simply trying again at some later point in time 3) Likely to fail again if tried again immediately. 4) Not condusive to interlocked signaling mechanisms as the time it takes to find and signal a waiting tx queue that resources are now available may take longer than needed, and may still result in failure, as competing allocators may race in and claim said resources during the signaling period. As such, what if we try something more simple like a linear backoff? In this example we add a TX return code that indicates that the driver is not busy, but unable to transmit due to resource constraints. This signals the qdisc layer to skip trying to drain this transmit queue for a short period of time, with subsequent simmilar errors causing increased backoff. Once the condition is cleared, the backoff delay is removed and operation returns to normal. Thouthgs? Neil Signed-off-by: Neil Horman <nhorman@tuxdriver.com> --- include/linux/netdevice.h | 4 ++++ net/sched/sch_generic.c | 20 ++++++++++++++++++++ 2 files changed, 24 insertions(+)