Patchwork pppoatm: Fix excessive queue bloat

login
register
mail settings
Submitter David Woodhouse
Date April 8, 2012, 7:53 p.m.
Message ID <1333914837.31812.57.camel@shinybook.infradead.org>
Download mbox | patch
Permalink /patch/151371/
State Superseded
Delegated to: David Miller
Headers show

Comments

David Woodhouse - April 8, 2012, 7:53 p.m.
We discovered that PPPoATM has an excessively deep transmit queue. A
queue the size of the default socket send buffer (wmem_default) is
maintained between the PPP generic core and the ATM device.

Fix it to queue a maximum of *two* packets. The one the ATM device is
currently working on, and one more for the ATM driver to process
immediately in its TX done interrupt handler. The PPP core is designed
to feed packets to the channel with minimal latency, so that really
ought to be enough to keep the ATM device busy.

While we're at it, fix the fact that we were triggering the wakeup
tasklet on *every* pppoatm_pop() call. The comment saying "this is
inefficient, but doing it right is too hard" turns out to be overly
pessimistic... I think :)

On machines like the Traverse Geos, with a slow Geode CPU and two
high-speed ADSL2+ interfaces, there were reports of extremely high CPU
usage which could partly be attributed to the extra wakeups.

(The wakeup handling could actually be made a whole lot easier if we
 stop checking sk->sk_sndbuf altogether. Given that we now only queue
 *two* packets ever, one wonders what the point is. As it is, you could
 already deadlock the thing by setting the sk_sndbuf to a value lower
 than the MTU of the device, and it'd just block for ever.)

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>

---

Seriously, this gets *much* easier if we just ditch the checks against
sk_sndbuf. We just wake up whenever decrementing ->inflight from zero.
Can I?

 net/atm/pppoatm.c |   95 +++++++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 85 insertions(+), 10 deletions(-)
chas williams - CONTRACTOR - April 10, 2012, 2:26 p.m.
On Sun, 08 Apr 2012 21:53:57 +0200
David Woodhouse <dwmw2@infradead.org> wrote:

> Seriously, this gets *much* easier if we just ditch the checks against
> sk_sndbuf. We just wake up whenever decrementing ->inflight from zero.
> Can I?

i dont know.  on a 'low' speed connection like queuing 2 packets might
be enough to keep something busy but imagine an interace like oc3 or
oc12.  i dont know anything running pppoatm on such an interface but it
seems like just dropping sk_sndbuf isnt right either.

sk_sndbuf is per vcc and that isnt right either since the transmit
limit is actually the atm device's transmit queue depth.  in your case,
since you are only using a single vcc, this problem does somewhat map
to a sk_sndbuf.  but sk_sndbuf counts bytes and not entries in a queue
which is typically the limit imposed by the hardware.

it makes me wonder if we dont need to add another stub to the atm
driver interface to check if the device can transmit instead of 'can we
queue on the vcc'.  atm_may_send() then would be changed to check to see
if a transmit queue entry is available and there is sk_sndbuf queue
space.  if a driver doesnt have this stub to check for queue space,
atm_may_send() can just assume there is a queue entry and we get the
previous default behavior.  what is the "queue depth" of your atm
device's transmit queue?
 
yes, the vcc->sk_sndbuf should be MAX(vcc->sk_sndbuf, pppoatm->MTU) but
since people dont often change the sk_sndbuf i guess no one has run
into this issue.   pppoatm (et al) should enforce this limit though.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Woodhouse - April 10, 2012, 8:28 p.m.
On Tue, 2012-04-10 at 10:26 -0400, chas williams - CONTRACTOR wrote:
> On Sun, 08 Apr 2012 21:53:57 +0200
> David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > Seriously, this gets *much* easier if we just ditch the checks against
> > sk_sndbuf. We just wake up whenever decrementing ->inflight from zero.
> > Can I?
> 
> i dont know.  on a 'low' speed connection like queuing 2 packets might
> be enough to keep something busy but imagine an interace like oc3 or
> oc12.  i dont know anything running pppoatm on such an interface but it
> seems like just dropping sk_sndbuf isnt right either.

That looks like a response to my patch, not to the question that you
cited. My patch reduces the buffering to MAX(vcc->sk_sndbuf, 2 packets)
so if there are issues with keeping faster devices busy, they'll happen
anyway with my patch. (My question was just whether we can ditch the
sk_sndbuf bit altogether, and just make it a hard-coded two packets.)

The limit of two packets was chosen on the basis that the PPP core is
designed to feed us new packets when we need them, with *low* latency.
So when the hardware finishes sending packet #1 and starts on packet #2,
we *should* be able to get a packet #3 into its queue by the time it
needs it.

But if that approach could really cause an issue with keeping faster
devices busy, perhaps the limit should be "x ms worth of packets" based
on the upload speed of the device, rather than a hard-coded 2?

Not that we *know* the upload speed of the device, for asymmetric
links... do we?

> sk_sndbuf is per vcc and that isnt right either since the transmit
> limit is actually the atm device's transmit queue depth.

Hm, I don't think that's a limit that I care about. You're talking about
the maximum number of packets that can be queued to the hardware at a
time. What I care about is the *minimum* number of packets that *need*
to be queued to the hardware, to ensure that it doesn't stall waiting
for us to replenish its queue.

Which is largely a factor of how well the PPP core does the job it was
*designed* to do, as I see it.

> what is the "queue depth" of your atm device's transmit queue?

We're still using MMIO for the Solos ADSL2+ devices at the moment, so
there's no descriptor ring. It used to have internal buffering which was
only limited by the device's internal memory — it would continue to
accept packets from the host until it had nowhere else to put them. I
got them to fix that in its firmware, so now it only has two or three
packets queued internally. But as far as the host is concerned, those
packets are *gone* already.
David Miller - April 13, 2012, 5:04 p.m.
From: chas williams - CONTRACTOR <chas@cmf.nrl.navy.mil>
Date: Tue, 10 Apr 2012 10:26:12 -0400

> On Sun, 08 Apr 2012 21:53:57 +0200
> David Woodhouse <dwmw2@infradead.org> wrote:
> 
>> Seriously, this gets *much* easier if we just ditch the checks against
>> sk_sndbuf. We just wake up whenever decrementing ->inflight from zero.
>> Can I?
> 
> i dont know.

Please do not take this discussion private, and off of the netdev list.

Now nobody else on the list can see your response and and reply to it.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - April 13, 2012, 5:05 p.m.
From: David Woodhouse <dwmw2@infradead.org>
Date: Sun, 08 Apr 2012 21:53:57 +0200

> We discovered that PPPoATM has an excessively deep transmit queue. A
> queue the size of the default socket send buffer (wmem_default) is
> maintained between the PPP generic core and the ATM device.
> 
> Fix it to queue a maximum of *two* packets. The one the ATM device is
> currently working on, and one more for the ATM driver to process
> immediately in its TX done interrupt handler. The PPP core is designed
> to feed packets to the channel with minimal latency, so that really
> ought to be enough to keep the ATM device busy.
> 
> While we're at it, fix the fact that we were triggering the wakeup
> tasklet on *every* pppoatm_pop() call. The comment saying "this is
> inefficient, but doing it right is too hard" turns out to be overly
> pessimistic... I think :)
> 
> On machines like the Traverse Geos, with a slow Geode CPU and two
> high-speed ADSL2+ interfaces, there were reports of extremely high CPU
> usage which could partly be attributed to the extra wakeups.
> 
> (The wakeup handling could actually be made a whole lot easier if we
>  stop checking sk->sk_sndbuf altogether. Given that we now only queue
>  *two* packets ever, one wonders what the point is. As it is, you could
>  already deadlock the thing by setting the sk_sndbuf to a value lower
>  than the MTU of the device, and it'd just block for ever.)
> 
> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>

Applied to net-next, thanks David.

I always considered the use of socket buffers to handle packets
moving around the ATM layers as a fundamental design error.

Although it's major surgery, fixing that up would be a much
appreciated contribution for someone suitably motivated :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - April 13, 2012, 5:27 p.m.
From: David Miller <davem@davemloft.net>
Date: Fri, 13 Apr 2012 13:04:52 -0400 (EDT)

> From: chas williams - CONTRACTOR <chas@cmf.nrl.navy.mil>
> Date: Tue, 10 Apr 2012 10:26:12 -0400
> 
>> On Sun, 08 Apr 2012 21:53:57 +0200
>> David Woodhouse <dwmw2@infradead.org> wrote:
>> 
>>> Seriously, this gets *much* easier if we just ditch the checks against
>>> sk_sndbuf. We just wake up whenever decrementing ->inflight from zero.
>>> Can I?
>> 
>> i dont know.
> 
> Please do not take this discussion private, and off of the netdev list.
> 
> Now nobody else on the list can see your response and and reply to it.

My bad, you didn't do this, my apologies.

But for some reason your replies didn't show up in the patchwork
entry for this patch, maybe your email client fumbled up the Message-Ids
because that's what patchwork uses to match things up.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/net/atm/pppoatm.c b/net/atm/pppoatm.c
index 614d3fc..68a02a9 100644
--- a/net/atm/pppoatm.c
+++ b/net/atm/pppoatm.c
@@ -62,12 +62,25 @@  struct pppoatm_vcc {
 	void (*old_pop)(struct atm_vcc *, struct sk_buff *);
 					/* keep old push/pop for detaching */
 	enum pppoatm_encaps encaps;
+	atomic_t inflight;
+	unsigned long blocked;
 	int flags;			/* SC_COMP_PROT - compress protocol */
 	struct ppp_channel chan;	/* interface to generic ppp layer */
 	struct tasklet_struct wakeup_tasklet;
 };
 
 /*
+ * We want to allow two packets in the queue. The one that's currently in
+ * flight, and *one* queued up ready for the ATM device to send immediately
+ * from its TX done IRQ. We want to be able to use atomic_inc_not_zero(), so
+ * inflight == -2 represents an empty queue, -1 one packet, and zero means
+ * there are two packets in the queue.
+ */
+#define NONE_INFLIGHT -2
+
+#define BLOCKED 0
+
+/*
  * Header used for LLC Encapsulated PPP (4 bytes) followed by the LCP protocol
  * ID (0xC021) used in autodetection
  */
@@ -102,16 +115,30 @@  static void pppoatm_wakeup_sender(unsigned long arg)
 static void pppoatm_pop(struct atm_vcc *atmvcc, struct sk_buff *skb)
 {
 	struct pppoatm_vcc *pvcc = atmvcc_to_pvcc(atmvcc);
+
 	pvcc->old_pop(atmvcc, skb);
+	atomic_dec(&pvcc->inflight);
+
 	/*
-	 * We don't really always want to do this since it's
-	 * really inefficient - it would be much better if we could
-	 * test if we had actually throttled the generic layer.
-	 * Unfortunately then there would be a nasty SMP race where
-	 * we could clear that flag just as we refuse another packet.
-	 * For now we do the safe thing.
+	 * We always used to run the wakeup tasklet unconditionally here, for
+	 * fear of race conditions where we clear the BLOCKED flag just as we
+	 * refuse another packet in pppoatm_send(). This was quite inefficient.
+	 *
+	 * In fact it's OK. The PPP core will only ever call pppoatm_send()
+	 * while holding the channel->downl lock. And ppp_output_wakeup() as
+	 * called by the tasklet will *also* grab that lock. So even if another
+	 * CPU is in pppoatm_send() right now, the tasklet isn't going to race
+	 * with it. The wakeup *will* happen after the other CPU is safely out
+	 * of pppoatm_send() again.
+	 *
+	 * So if the CPU in pppoatm_send() has already set the BLOCKED bit and
+	 * it about to return, that's fine. We trigger a wakeup which will
+	 * happen later. And if the CPU in pppoatm_send() *hasn't* set the
+	 * BLOCKED bit yet, that's fine too because of the double check in
+	 * pppoatm_may_send() which is commented there.
 	 */
-	tasklet_schedule(&pvcc->wakeup_tasklet);
+	if (test_and_clear_bit(BLOCKED, &pvcc->blocked))
+		tasklet_schedule(&pvcc->wakeup_tasklet);
 }
 
 /*
@@ -184,6 +211,51 @@  error:
 	ppp_input_error(&pvcc->chan, 0);
 }
 
+static inline int pppoatm_may_send(struct pppoatm_vcc *pvcc, int size)
+{
+	/*
+	 * It's not clear that we need to bother with using atm_may_send()
+	 * to check we don't exceed sk->sk_sndbuf. If userspace sets a
+	 * value of sk_sndbuf which is lower than the MTU, we're going to
+	 * block for ever. But the code always did that before we introduced
+	 * the packet count limit, so... 
+	 */
+	if (atm_may_send(pvcc->atmvcc, size) &&
+	    atomic_inc_not_zero_hint(&pvcc->inflight, NONE_INFLIGHT))
+		return 1;
+
+	/*
+	 * We use test_and_set_bit() rather than set_bit() here because
+	 * we need to ensure there's a memory barrier after it. The bit
+	 * *must* be set before we do the atomic_inc() on pvcc->inflight.
+	 * There's no smp_mb__after_set_bit(), so it's this or abuse
+	 * smp_mb__after_clear_bit().
+	 */
+	test_and_set_bit(BLOCKED, &pvcc->blocked);
+
+	/*
+	 * We may have raced with pppoatm_pop(). If it ran for the
+	 * last packet in the queue, *just* before we set the BLOCKED
+	 * bit, then it might never run again and the channel could
+	 * remain permanently blocked. Cope with that race by checking
+	 * *again*. If it did run in that window, we'll have space on
+	 * the queue now and can return success. It's harmless to leave
+	 * the BLOCKED flag set, since it's only used as a trigger to
+	 * run the wakeup tasklet. Another wakeup will never hurt. 
+	 * If pppoatm_pop() is running but hasn't got as far as making
+	 * space on the queue yet, then it hasn't checked the BLOCKED
+	 * flag yet either, so we're safe in that case too. It'll issue
+	 * an "immediate" wakeup... where "immediate" actually involves
+	 * taking the PPP channel's ->downl lock, which is held by the
+	 * code path that calls pppoatm_send(), and is thus going to
+	 * wait for us to finish.
+	 */
+	if (atm_may_send(pvcc->atmvcc, size) &&
+	    atomic_inc_not_zero(&pvcc->inflight))
+		return 1;
+
+	return 0;
+}
 /*
  * Called by the ppp_generic.c to send a packet - returns true if packet
  * was accepted.  If we return false, then it's our job to call
@@ -207,7 +279,7 @@  static int pppoatm_send(struct ppp_channel *chan, struct sk_buff *skb)
 			struct sk_buff *n;
 			n = skb_realloc_headroom(skb, LLC_LEN);
 			if (n != NULL &&
-			    !atm_may_send(pvcc->atmvcc, n->truesize)) {
+			    !pppoatm_may_send(pvcc, n->truesize)) {
 				kfree_skb(n);
 				goto nospace;
 			}
@@ -215,12 +287,12 @@  static int pppoatm_send(struct ppp_channel *chan, struct sk_buff *skb)
 			skb = n;
 			if (skb == NULL)
 				return DROP_PACKET;
-		} else if (!atm_may_send(pvcc->atmvcc, skb->truesize))
+		} else if (!pppoatm_may_send(pvcc, skb->truesize))
 			goto nospace;
 		memcpy(skb_push(skb, LLC_LEN), pppllc, LLC_LEN);
 		break;
 	case e_vc:
-		if (!atm_may_send(pvcc->atmvcc, skb->truesize))
+		if (!pppoatm_may_send(pvcc, skb->truesize))
 			goto nospace;
 		break;
 	case e_autodetect:
@@ -285,6 +357,9 @@  static int pppoatm_assign_vcc(struct atm_vcc *atmvcc, void __user *arg)
 	if (pvcc == NULL)
 		return -ENOMEM;
 	pvcc->atmvcc = atmvcc;
+
+	/* Maximum is zero, so that we can use atomic_inc_not_zero() */
+	atomic_set(&pvcc->inflight, NONE_INFLIGHT);
 	pvcc->old_push = atmvcc->push;
 	pvcc->old_pop = atmvcc->pop;
 	pvcc->encaps = (enum pppoatm_encaps) be.encaps;