diff mbox

[RFC] net: Provide linear backoff mechanism for constrained resources at the driver

Message ID 1399577392-4457-1-git-send-email-nhorman@tuxdriver.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Neil Horman May 8, 2014, 7:29 p.m. UTC
What about something like this?  Its not even compile tested, but let me know
what you think of the idea.  The reasoning behind it is that transient resources
like dma address ranges in the iommu or swiotlb have the following attributes

1) they are quickly allocated and freed

2) Usually handled by simply trying again at some later point in time

3) Likely to fail again if tried again immediately.

4) Not condusive to interlocked signaling mechanisms as the time it takes to
find and signal a waiting tx queue that resources are now available may take
longer than needed, and may still result in failure, as competing allocators
may race in and claim said resources during the signaling period.

As such, what if we try something more simple like a linear backoff? In this
example we add a TX return code that indicates that the driver is not busy, but
unable to transmit due to resource constraints.  This signals the qdisc layer to
skip trying to drain this transmit queue for a short period of time, with
subsequent simmilar errors causing increased backoff.  Once the condition is
cleared, the backoff delay is removed and operation returns to normal.

Thouthgs?

Neil

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
---
 include/linux/netdevice.h |  4 ++++
 net/sched/sch_generic.c   | 20 ++++++++++++++++++++
 2 files changed, 24 insertions(+)

Comments

David Laight May 9, 2014, 8:55 a.m. UTC | #1
From: Neil Horman
> What about something like this?  Its not even compile tested, but let me know
> what you think of the idea.  The reasoning behind it is that transient resources
> like dma address ranges in the iommu or swiotlb have the following attributes
> 
> 1) they are quickly allocated and freed

I'm not sure that is true for iommu entries.
The ones allocated for ethernet receive are effectively permanently allocated.

Imagine a system with 512 iommu entries.
An ethernet driver allocates 128 RX ring entries using one iommu entry each.
There are now no iommu entries left for anything else.
That system will only work if the ethernet driver reduces the number of
active rx buffers.

It is also possible (but less likely) that ethernet transmit will
use so many iommu entries that none are left for more important things.
The network will work with only one active transmit, but you may
have to do disc and/or usb transfers even when resource limited.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman May 9, 2014, 11:17 a.m. UTC | #2
On Fri, May 09, 2014 at 08:55:10AM +0000, David Laight wrote:
> From: Neil Horman
> > What about something like this?  Its not even compile tested, but let me know
> > what you think of the idea.  The reasoning behind it is that transient resources
> > like dma address ranges in the iommu or swiotlb have the following attributes
> > 
> > 1) they are quickly allocated and freed
> 
> I'm not sure that is true for iommu entries.
> The ones allocated for ethernet receive are effectively permanently allocated.
> 
I disagree.  A review of several NIC drivers shows the pseudocode for the RX
patch to be:

For SKB X on the RX ring:
	If LENGTH(SKB) < COPYBREAK
		SKB2 = ALLOCATE_SKB
		COPY_DATA(SKB2, SKB1)
		RECEIVE(SKB2)
	Else
		UNMAP(SKB1)
		RECEIVE(SKB1)
		SKB1 = ALLOCATE_SKB
		MAP(SKB1)
Done

The value of COPYBREAK is configurable, but is never more than 256 bytes, and is
often 128 or fewer bytes (sometimes zero).  This will cause some udp traffic to
get handled as copies, but never more reasonably sized udp packets, and no well
behaved tcp traffic will ever get copied.  Those iommu entries will come and go
very quickly.

> Imagine a system with 512 iommu entries.
> An ethernet driver allocates 128 RX ring entries using one iommu entry each.
> There are now no iommu entries left for anything else.
That actually leaves 384 entries remaiing, but thats neither here nor there :).
  iommus work like tlbs, in that they don't have a fixed number of entries.
Each iommu has a set of page tables, wherein a set of pages can be mapped.  If
a packet fits within a single page, then yes, it takes up a single 'pte', if a
packet takes multiple pages (say a gso packet for example), or if some other
device is doing lots of high volume dma (FCoE/iscsi/roce/infiniband), then
multiple pte's will be taken up handling the larger data buffer.  Its a
limited resource shared unevenly between all dma-ing devices.  Thats why we
can't reserve entries, because you don't have alot of space to begin with, and
you don't know how much you'll need until you have the data to send, which can
vary wildly depending on the device.
 
> That system will only work if the ethernet driver reduces the number of
> active rx buffers.
> 
Reducing the number of active rx buffers is tantamount to reducing the ring size
of a NIC, which is already a tunable feature, and not one to be received overly
well by people trying to maximize their network througput.

> It is also possible (but less likely) that ethernet transmit will
> use so many iommu entries that none are left for more important things.
This is possible in all cases, not just transmit.

> The network will work with only one active transmit, but you may
> have to do disc and/or usb transfers even when resource limited.
> 
Hence my RFC patch in my prior note.  If we're resource constrained, push back
on the qdisc such that we try not to use as many mappings for short time without
causing too much overhead.  It doesn't affect receive of course, but its very
hard to deal with managing mapping use when the producer is not directly
controllable by us.

Neil

> 	David
> 
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight May 9, 2014, 12:53 p.m. UTC | #3
From: Neil Horman 
> On Fri, May 09, 2014 at 08:55:10AM +0000, David Laight wrote:
> > From: Neil Horman
> > > What about something like this?  Its not even compile tested, but let me know
> > > what you think of the idea.  The reasoning behind it is that transient resources
> > > like dma address ranges in the iommu or swiotlb have the following attributes
> > >
> > > 1) they are quickly allocated and freed
> >
> > I'm not sure that is true for iommu entries.
> > The ones allocated for ethernet receive are effectively permanently allocated.
> >
> I disagree.  A review of several NIC drivers shows the pseudocode for the RX
> patch to be:
> 
> For SKB X on the RX ring:
> 	If LENGTH(SKB) < COPYBREAK
> 		SKB2 = ALLOCATE_SKB
> 		COPY_DATA(SKB2, SKB1)
> 		RECEIVE(SKB2)
> 	Else
> 		UNMAP(SKB1)
> 		RECEIVE(SKB1)
> 		SKB1 = ALLOCATE_SKB
> 		MAP(SKB1)
> Done
> 
> The value of COPYBREAK is configurable, but is never more than 256 bytes, and is
> often 128 or fewer bytes (sometimes zero).  This will cause some udp traffic to
> get handled as copies, but never more reasonably sized udp packets, and no well
> behaved tcp traffic will ever get copied.  Those iommu entries will come and go
> very quickly.

If I understand correctly the iommu entries are needed for the ethernet
chip to access main memory. The usual state is that RX ring is full
buffers - all of which are mapped for dma.

> > Imagine a system with 512 iommu entries.
> > An ethernet driver allocates 128 RX ring entries using one iommu entry each.
> > There are now no iommu entries left for anything else.

> That actually leaves 384 entries remaiing, but thats neither here nor there :).
I seem to fail to write 'include 4 such interfaces and' ...

>   iommus work like tlbs, in that they don't have a fixed number of entries.
> Each iommu has a set of page tables, wherein a set of pages can be mapped.
> ...

Yes I realise that what is actually being allocated is io virtual address space.
But the '1 buffer' == '1 slot' simplification is reasonable appropriate for
some 'thought designs'.


> Its a
> limited resource shared unevenly between all dma-ing devices.  Thats why we
> can't reserve entries, because you don't have alot of space to begin with, and
> you don't know how much you'll need until you have the data to send, which can
> vary wildly depending on the device.

You do know how much resource is left though.

> > That system will only work if the ethernet driver reduces the number of
> > active rx buffers.
> >
> Reducing the number of active rx buffers is tantamount to reducing the ring size
> of a NIC, which is already a tunable feature, and not one to be received overly
> well by people trying to maximize their network througput.

Except that it needs to done automatically if there is a global constraint
(be it iommu space or dma-able memory).
In isolation a large rx ring is almost always an advantage.

> > It is also possible (but less likely) that ethernet transmit will
> > use so many iommu entries that none are left for more important things.
> This is possible in all cases, not just transmit.
> 
> > The network will work with only one active transmit, but you may
> > have to do disc and/or usb transfers even when resource limited.
> >
> Hence my RFC patch in my prior note.  If we're resource constrained, push back
> on the qdisc such that we try not to use as many mappings for short time without
> causing too much overhead.  It doesn't affect receive of course, but its very
> hard to deal with managing mapping use when the producer is not directly
> controllable by us.

But ethernet receive is likely to be the big user of iommu entries.
If you constrain it, then there probably won't be allocation failures
elsewhere.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman May 9, 2014, 3:33 p.m. UTC | #4
On Fri, May 09, 2014 at 12:53:29PM +0000, David Laight wrote:
> From: Neil Horman 
> > On Fri, May 09, 2014 at 08:55:10AM +0000, David Laight wrote:
> > > From: Neil Horman
> > > > What about something like this?  Its not even compile tested, but let me know
> > > > what you think of the idea.  The reasoning behind it is that transient resources
> > > > like dma address ranges in the iommu or swiotlb have the following attributes
> > > >
> > > > 1) they are quickly allocated and freed
> > >
> > > I'm not sure that is true for iommu entries.
> > > The ones allocated for ethernet receive are effectively permanently allocated.
> > >
> > I disagree.  A review of several NIC drivers shows the pseudocode for the RX
> > patch to be:
> > 
> > For SKB X on the RX ring:
> > 	If LENGTH(SKB) < COPYBREAK
> > 		SKB2 = ALLOCATE_SKB
> > 		COPY_DATA(SKB2, SKB1)
> > 		RECEIVE(SKB2)
> > 	Else
> > 		UNMAP(SKB1)
> > 		RECEIVE(SKB1)
> > 		SKB1 = ALLOCATE_SKB
> > 		MAP(SKB1)
> > Done
> > 
> > The value of COPYBREAK is configurable, but is never more than 256 bytes, and is
> > often 128 or fewer bytes (sometimes zero).  This will cause some udp traffic to
> > get handled as copies, but never more reasonably sized udp packets, and no well
> > behaved tcp traffic will ever get copied.  Those iommu entries will come and go
> > very quickly.
> 
> If I understand correctly the iommu entries are needed for the ethernet
> chip to access main memory. The usual state is that RX ring is full
> buffers - all of which are mapped for dma.
> 
This is true, but those buffers are not static, they are unmapped, and new ones
are mapped in their place on recieve, so theres an opportunity for those
unmapped buffers to get used by other entities/hardware during the recieve
process.  We could do something as you suggest, in which we create an api to
reserve the address space for a buffer, and just continually reuse them, but as
Dave points out, its a limited resource and reserving them may be unfair to
other transient users, especially given that the RX path has to 'over-reserve'
allocating space for the largest packet size receivable, even if we get less
than that.

> > > Imagine a system with 512 iommu entries.
> > > An ethernet driver allocates 128 RX ring entries using one iommu entry each.
> > > There are now no iommu entries left for anything else.
> 
> > That actually leaves 384 entries remaiing, but thats neither here nor there :).
> I seem to fail to write 'include 4 such interfaces and' ...
> 
Ah, that makes more sense

> >   iommus work like tlbs, in that they don't have a fixed number of entries.
> > Each iommu has a set of page tables, wherein a set of pages can be mapped.
> > ...
> 
> Yes I realise that what is actually being allocated is io virtual address space.
> But the '1 buffer' == '1 slot' simplification is reasonable appropriate for
> some 'thought designs'.
> 
Only partly.  Continuity is also a factor here.  You might have 1000 sots
remaining, but under heavy use, your largest contiguous slot could be
significantly smaller, which is what a dma using bit of code is actually
interested in.  Keeping track of the largest contiguous slot is significantly
more difficult, and its all still a bit dicey, because the larges slot might
change between the time a device asks what it is, and actually allocates it.

> 
> > Its a
> > limited resource shared unevenly between all dma-ing devices.  Thats why we
> > can't reserve entries, because you don't have alot of space to begin with, and
> > you don't know how much you'll need until you have the data to send, which can
> > vary wildly depending on the device.
> 
> You do know how much resource is left though.
> 
See above, you can certainly tell in aggregate how much free space is available,
but not what your largest free chunk is.

> > > That system will only work if the ethernet driver reduces the number of
> > > active rx buffers.
> > >
> > Reducing the number of active rx buffers is tantamount to reducing the ring size
> > of a NIC, which is already a tunable feature, and not one to be received overly
> > well by people trying to maximize their network througput.
> 
> Except that it needs to done automatically if there is a global constraint
> (be it iommu space or dma-able memory).
2 Things:

1) Need is really a strong term here.  The penalty for failing a dma mapping is
to drop the frame.  Thats not unacceptible in many use cases.

2) It seems to me that global constraint here implies a static, well known
number.  While its true we can interrogate an iommu, and compare its mapping
size to the ring size of all the NICS/devices on a system to see if we're likely
to exceed the iommu space available, we shouldn't do that.  If a given NIC
doesn't produce much traffic, its ring sizes aren't relevant to the computation.

We're not trying to address a static allocation scheme here.  If a system boots,
it implies that all the recive rings on all the devices were able to reserve the
amount of space they needed in the iommu (as you note earlier, they populate
their rings on init, effectively doing a iommu reservation).  The problem we're
addressing is the periodic lack of space that arises from temporary exhaustion
of iommu space under heavy I/O loads.  We won't know if that happens, until it
happens, and we can't just allocate for the worst case, because then we're sure
to run out of space as devices scale up.  Sharing is the way to do this whenever
possible.

> In isolation a large rx ring is almost always an advantage.
> 
No argument there.  But requiring a user to size a ring based on expected
traffic patterns seems like it won't be well received.

> > > It is also possible (but less likely) that ethernet transmit will
> > > use so many iommu entries that none are left for more important things.
> > This is possible in all cases, not just transmit.
> > 
> > > The network will work with only one active transmit, but you may
> > > have to do disc and/or usb transfers even when resource limited.
> > >
> > Hence my RFC patch in my prior note.  If we're resource constrained, push back
> > on the qdisc such that we try not to use as many mappings for short time without
> > causing too much overhead.  It doesn't affect receive of course, but its very
> > hard to deal with managing mapping use when the producer is not directly
> > controllable by us.
> 
> But ethernet receive is likely to be the big user of iommu entries.
> If you constrain it, then there probably won't be allocation failures
> elsewhere.
> 
What makes you say that?  Theres no reason a tx ring can't be just as full as a
receive ring under heavy traffic load.  If you want to constrain receive
allocations, do so, we have a knob for that already.

Neil

> 	David
> 
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight May 9, 2014, 3:46 p.m. UTC | #5
From: Neil Horman [mailto:nhorman@tuxdriver.com]
...
> 2 Things:
> 
> 1) Need is really a strong term here.  The penalty for failing a dma mapping is
> to drop the frame.  Thats not unacceptible in many use cases.

Indeed, but dropping an ethernet frame will be recovered by the higher layers.
While not ideal, it is a 'last resort' action.
Note that I'm not suggesting that your deferred retry of the transmit isn't
a good idea, just that it is probably papering over the cracks.

> 2) It seems to me that global constraint here implies a static, well known
> number.  While its true we can interrogate an iommu, and compare its mapping
> size to the ring size of all the NICS/devices on a system to see if we're likely
> to exceed the iommu space available, we shouldn't do that.  If a given NIC
> doesn't produce much traffic, its ring sizes aren't relevant to the computation.

An idle NIC will be using a lot of iommu entries for its receive buffers.

> We're not trying to address a static allocation scheme here.  If a system boots,
> it implies that all the recive rings on all the devices were able to reserve the
> amount of space they needed in the iommu (as you note earlier, they populate
> their rings on init, effectively doing a iommu reservation).  The problem we're
> addressing is the periodic lack of space that arises from temporary exhaustion
> of iommu space under heavy I/O loads.  We won't know if that happens, until it
> happens, and we can't just allocate for the worst case, because then we're sure
> to run out of space as devices scale up.  Sharing is the way to do this whenever
> possible.

Do you have any data for which drivers have active iommu entries when an
allocate fails?

I can imagine systems where almost all the iommu entries are being used
for ethernet rx buffers, and everything else is fighting for the last
few entries.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman May 9, 2014, 5:02 p.m. UTC | #6
On Fri, May 09, 2014 at 03:46:14PM +0000, David Laight wrote:
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> ...
> > 2 Things:
> > 
> > 1) Need is really a strong term here.  The penalty for failing a dma mapping is
> > to drop the frame.  Thats not unacceptible in many use cases.
> 
> Indeed, but dropping an ethernet frame will be recovered by the higher layers.
> While not ideal, it is a 'last resort' action.
No argument there, but its the best we have at the moment.

> Note that I'm not suggesting that your deferred retry of the transmit isn't
> a good idea, just that it is probably papering over the cracks.
> 
Except that forcing everyone to a lower througput based on worst case scenarios
doesn't seem like a better solution to me.

> > 2) It seems to me that global constraint here implies a static, well known
> > number.  While its true we can interrogate an iommu, and compare its mapping
> > size to the ring size of all the NICS/devices on a system to see if we're likely
> > to exceed the iommu space available, we shouldn't do that.  If a given NIC
> > doesn't produce much traffic, its ring sizes aren't relevant to the computation.
> 
> An idle NIC will be using a lot of iommu entries for its receive buffers.
> 
Yes, and thats really the point!  Only an Idle NIC will be (mis)using alot of
extra iommu entries.  But we have no way to know if a NIC will be idle, we have
to reserve space for them in the iommu because thats how the hardware is
designed.  The only recourse we have here is to reserve less space for each NIC
RX ring, which punishes those NICS that are active, which is the opposite of
what we really want to do here.

> > We're not trying to address a static allocation scheme here.  If a system boots,
> > it implies that all the recive rings on all the devices were able to reserve the
> > amount of space they needed in the iommu (as you note earlier, they populate
> > their rings on init, effectively doing a iommu reservation).  The problem we're
> > addressing is the periodic lack of space that arises from temporary exhaustion
> > of iommu space under heavy I/O loads.  We won't know if that happens, until it
> > happens, and we can't just allocate for the worst case, because then we're sure
> > to run out of space as devices scale up.  Sharing is the way to do this whenever
> > possible.
> 
> Do you have any data for which drivers have active iommu entries when an
> allocate fails?
> 
No, but I'm not sure it matters which driver holds DMA descriptors when an
allocation failure occurs.  I'm operating under the assumption here that drivers
are attempting to allocate a reasonable number of buffers for the work they need
to do.  I'm not arguing that we can't reclaim space by forcing any given drier
to allocate less, only that doing so isn't helpful in that it will just decrease
receive througput.  We're trading in one problem for another.

> I can imagine systems where almost all the iommu entries are being used
> for ethernet rx buffers, and everything else is fighting for the last
> few entries.
> 
I would argue that its not quite as unbalance as all/few, but yes, your ponit is
sound in that the tx path is fighting for a reduced pool of shared buffers
because of the nature of the RX paths' pre-allocation needs.  We could reduce
the number of those buffers statically, but thats really an administrative job
because the kernel never really knows when a NIC is going to be a high volume
producer or consumer.

Hmm, that actually makes me wonder, is this a job for something like tuned up in
user space?  or a cmobination of something like my backoff patch and tuned?  The
backoff patch helps the tx path in case of an actuall exhaustion, and tuned can
be administratively loaded with a profile to indicate which nics are likely to
be low volume, and therefore can have their ring sizes reduced, giving the iommu
the maximum free pool to services all the other dma users?

What do you think?
Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman May 18, 2014, 4:19 p.m. UTC | #7
On Thu, May 08, 2014 at 03:29:52PM -0400, Neil Horman wrote:
> What about something like this?  Its not even compile tested, but let me know
> what you think of the idea.  The reasoning behind it is that transient resources
> like dma address ranges in the iommu or swiotlb have the following attributes
> 
> 1) they are quickly allocated and freed
> 
> 2) Usually handled by simply trying again at some later point in time
> 
> 3) Likely to fail again if tried again immediately.
> 
> 4) Not condusive to interlocked signaling mechanisms as the time it takes to
> find and signal a waiting tx queue that resources are now available may take
> longer than needed, and may still result in failure, as competing allocators
> may race in and claim said resources during the signaling period.
> 
> As such, what if we try something more simple like a linear backoff? In this
> example we add a TX return code that indicates that the driver is not busy, but
> unable to transmit due to resource constraints.  This signals the qdisc layer to
> skip trying to drain this transmit queue for a short period of time, with
> subsequent simmilar errors causing increased backoff.  Once the condition is
> cleared, the backoff delay is removed and operation returns to normal.
> 
> Thouthgs?
> 
> Neil
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> ---
>  include/linux/netdevice.h |  4 ++++
>  net/sched/sch_generic.c   | 20 ++++++++++++++++++++
>  2 files changed, 24 insertions(+)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index a803d79..2b88ccd 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -105,6 +105,7 @@ enum netdev_tx {
>  	NETDEV_TX_OK	 = 0x00,	/* driver took care of packet */
>  	NETDEV_TX_BUSY	 = 0x10,	/* driver tx path was busy*/
>  	NETDEV_TX_LOCKED = 0x20,	/* driver tx lock was already taken */
> +	NETDEV_TX_BACKOFF= 0x40		/* driver resource contrained */
>  };
>  typedef enum netdev_tx netdev_tx_t;
>  
> @@ -572,6 +573,9 @@ struct netdev_queue {
>  
>  	unsigned long		state;
>  
> +	unsigned long		backoff_count;
> +	unsigned long		skip_count;
> +
>  #ifdef CONFIG_BQL
>  	struct dql		dql;
>  #endif
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index e1543b0..64bf6fe 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -30,6 +30,8 @@
>  #include <net/pkt_sched.h>
>  #include <net/dst.h>
>  
> +#define MAX_BACKOFF_COUNT 4
> +
>  /* Qdisc to use by default */
>  const struct Qdisc_ops *default_qdisc_ops = &pfifo_fast_ops;
>  EXPORT_SYMBOL(default_qdisc_ops);
> @@ -121,6 +123,15 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
>  {
>  	int ret = NETDEV_TX_BUSY;
>  
> +	/* Only attempt transmit module the backoff_count */
> +	if ((txq->backoff_count) {
> +		if (txq->skip_count % txq->backoff_count) {
> +			txq->skip_count++;
> +			return ret;
> +		}
> +		txq->skip_count = 0;
> +	}
> +		
>  	/* And release qdisc */
>  	spin_unlock(root_lock);
>  
> @@ -135,9 +146,17 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
>  	if (dev_xmit_complete(ret)) {
>  		/* Driver sent out skb successfully or skb was consumed */
>  		ret = qdisc_qlen(q);
> +		txq->backoff_count = 0;
>  	} else if (ret == NETDEV_TX_LOCKED) {
>  		/* Driver try lock failed */
>  		ret = handle_dev_cpu_collision(skb, txq, q);
> +		txq->backoff_count = 0;
> +	} else if (ret == NETDEV_TX_BACKOFF) {
> +		/* The driver claims to be resource constrained */
> +		if (txq->backoff_count != MAX_BACKOFF_COUNT)
> +			txq->backoff_count++;
> +		txq->skip_count++;
> +		ret = dev_requeue_skb(skb, q);
>  	} else {
>  		/* Driver returned NETDEV_TX_BUSY - requeue skb */
>  		if (unlikely(ret != NETDEV_TX_BUSY))
> @@ -145,6 +164,7 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
>  					     dev->name, ret, q->q.qlen);
>  
>  		ret = dev_requeue_skb(skb, q);
> +		txq->backoff_count = 0;
>  	}
>  
>  	if (ret && netif_xmit_frozen_or_stopped(txq))

Ping, Dave do you have any thoughts on this approach?  Is it worth proposing
formally?
Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cong Wang May 21, 2014, 6:05 p.m. UTC | #8
On Thu, May 8, 2014 at 12:29 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> What about something like this?  Its not even compile tested, but let me know
> what you think of the idea.  The reasoning behind it is that transient resources
> like dma address ranges in the iommu or swiotlb have the following attributes
>
> 1) they are quickly allocated and freed
>
> 2) Usually handled by simply trying again at some later point in time
>
> 3) Likely to fail again if tried again immediately.
>
> 4) Not condusive to interlocked signaling mechanisms as the time it takes to
> find and signal a waiting tx queue that resources are now available may take
> longer than needed, and may still result in failure, as competing allocators
> may race in and claim said resources during the signaling period.
>
> As such, what if we try something more simple like a linear backoff? In this
> example we add a TX return code that indicates that the driver is not busy, but
> unable to transmit due to resource constraints.  This signals the qdisc layer to
> skip trying to drain this transmit queue for a short period of time, with
> subsequent simmilar errors causing increased backoff.  Once the condition is
> cleared, the backoff delay is removed and operation returns to normal.

Loos like this is a more general issue which should be solved for every
spinlock:

http://thread.gmane.org/gmane.linux.kernel/1437186
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman May 21, 2014, 6:49 p.m. UTC | #9
On Wed, May 21, 2014 at 11:05:28AM -0700, Cong Wang wrote:
> On Thu, May 8, 2014 at 12:29 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> > What about something like this?  Its not even compile tested, but let me know
> > what you think of the idea.  The reasoning behind it is that transient resources
> > like dma address ranges in the iommu or swiotlb have the following attributes
> >
> > 1) they are quickly allocated and freed
> >
> > 2) Usually handled by simply trying again at some later point in time
> >
> > 3) Likely to fail again if tried again immediately.
> >
> > 4) Not condusive to interlocked signaling mechanisms as the time it takes to
> > find and signal a waiting tx queue that resources are now available may take
> > longer than needed, and may still result in failure, as competing allocators
> > may race in and claim said resources during the signaling period.
> >
> > As such, what if we try something more simple like a linear backoff? In this
> > example we add a TX return code that indicates that the driver is not busy, but
> > unable to transmit due to resource constraints.  This signals the qdisc layer to
> > skip trying to drain this transmit queue for a short period of time, with
> > subsequent simmilar errors causing increased backoff.  Once the condition is
> > cleared, the backoff delay is removed and operation returns to normal.
> 
> Loos like this is a more general issue which should be solved for every
> spinlock:
> 
> http://thread.gmane.org/gmane.linux.kernel/1437186
> 
Good point, though this isn't a locking situation so much as a failed
transmission situation.  The same mechanism applies though, and this seems to be
simmilar to what the ticketed spinlocks do
Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a803d79..2b88ccd 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -105,6 +105,7 @@  enum netdev_tx {
 	NETDEV_TX_OK	 = 0x00,	/* driver took care of packet */
 	NETDEV_TX_BUSY	 = 0x10,	/* driver tx path was busy*/
 	NETDEV_TX_LOCKED = 0x20,	/* driver tx lock was already taken */
+	NETDEV_TX_BACKOFF= 0x40		/* driver resource contrained */
 };
 typedef enum netdev_tx netdev_tx_t;
 
@@ -572,6 +573,9 @@  struct netdev_queue {
 
 	unsigned long		state;
 
+	unsigned long		backoff_count;
+	unsigned long		skip_count;
+
 #ifdef CONFIG_BQL
 	struct dql		dql;
 #endif
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index e1543b0..64bf6fe 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -30,6 +30,8 @@ 
 #include <net/pkt_sched.h>
 #include <net/dst.h>
 
+#define MAX_BACKOFF_COUNT 4
+
 /* Qdisc to use by default */
 const struct Qdisc_ops *default_qdisc_ops = &pfifo_fast_ops;
 EXPORT_SYMBOL(default_qdisc_ops);
@@ -121,6 +123,15 @@  int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 {
 	int ret = NETDEV_TX_BUSY;
 
+	/* Only attempt transmit module the backoff_count */
+	if ((txq->backoff_count) {
+		if (txq->skip_count % txq->backoff_count) {
+			txq->skip_count++;
+			return ret;
+		}
+		txq->skip_count = 0;
+	}
+		
 	/* And release qdisc */
 	spin_unlock(root_lock);
 
@@ -135,9 +146,17 @@  int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 	if (dev_xmit_complete(ret)) {
 		/* Driver sent out skb successfully or skb was consumed */
 		ret = qdisc_qlen(q);
+		txq->backoff_count = 0;
 	} else if (ret == NETDEV_TX_LOCKED) {
 		/* Driver try lock failed */
 		ret = handle_dev_cpu_collision(skb, txq, q);
+		txq->backoff_count = 0;
+	} else if (ret == NETDEV_TX_BACKOFF) {
+		/* The driver claims to be resource constrained */
+		if (txq->backoff_count != MAX_BACKOFF_COUNT)
+			txq->backoff_count++;
+		txq->skip_count++;
+		ret = dev_requeue_skb(skb, q);
 	} else {
 		/* Driver returned NETDEV_TX_BUSY - requeue skb */
 		if (unlikely(ret != NETDEV_TX_BUSY))
@@ -145,6 +164,7 @@  int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 					     dev->name, ret, q->q.qlen);
 
 		ret = dev_requeue_skb(skb, q);
+		txq->backoff_count = 0;
 	}
 
 	if (ret && netif_xmit_frozen_or_stopped(txq))