diff mbox

[net-next,3/3] qdisc: catch misconfig of attaching qdisc to tx_queue_len zero device

Message ID 20161103135611.28737.39840.stgit@firesoul
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Jesper Dangaard Brouer Nov. 3, 2016, 1:56 p.m. UTC
It is a clear misconfiguration to attach a qdisc to a device with
tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred,
htb, plug and sfb) inherit/copy this value as their queue length.

Why should the kernel catch such a misconfiguration?  Because prior to
introducing the IFF_NO_QUEUE device flag, userspace found a loophole
in the qdisc config system that allowed them to achieve the equivalent
of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from
a device.  The loophole on older kernels is setting tx_queue_len=0,
*prior* to device qdisc init (the config time is significant, simply
setting tx_queue_len=0 doesn't trigger the loophole).

This loophole is currently used by Docker[1] to get better performance
and scalability out of the veth device.  The Docker developers were
warned[1] that they needed to adjust the tx_queue_len if ever
attaching a qdisc.  The OpenShift project didn't remember this warning
and attached a qdisc, this were caught and fixed in[2].

[1] https://github.com/docker/libcontainer/pull/193
[2] https://github.com/openshift/origin/pull/11126

Instead of fixing every userspace program that used this loophole, and
forgot to reset the tx_queue_len, prior to attaching a qdisc.  Let's
catch the misconfiguration on the kernel side.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/sched/sch_api.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Phil Sutter Nov. 4, 2016, 9:35 a.m. UTC | #1
Hi,

On Thu, Nov 03, 2016 at 02:56:11PM +0100, Jesper Dangaard Brouer wrote:
[...]
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 206dc24add3a..f337f1bdd1d4 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -960,6 +960,17 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
>  
>  	sch->handle = handle;
>  
> +	/* This exist to keep backward compatible with a userspace
> +	 * loophole, what allowed userspace to get IFF_NO_QUEUE
> +	 * facility on older kernels by setting tx_queue_len=0 (prior
> +	 * to qdisc init), and then forgot to reinit tx_queue_len
> +	 * before again attaching a qdisc.
> +	 */
> +	if ((dev->priv_flags & IFF_NO_QUEUE) && (dev->tx_queue_len == 0)) {
> +		dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN;
> +		netdev_info(dev, "Caught tx_queue_len zero misconfig\n");
> +	}

I wonder why this is limited to IFF_NO_QUEUE devices. Do you think there
is a valid use case for physical ones?

Also, if we sanitize here, couldn't we then just get rid of the
sanitization you're fixing in patch 2?

Apart from that, ACK to all the patches. Thanks for cleaning up my mess!
:)

Cheers, Phil
Jesper Dangaard Brouer Nov. 4, 2016, 10:10 a.m. UTC | #2
On Fri, 4 Nov 2016 10:35:26 +0100 Phil Sutter <phil@nwl.cc> wrote:

> Hi,
> 
> On Thu, Nov 03, 2016 at 02:56:11PM +0100, Jesper Dangaard Brouer wrote:
> [...]
> > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> > index 206dc24add3a..f337f1bdd1d4 100644
> > --- a/net/sched/sch_api.c
> > +++ b/net/sched/sch_api.c
> > @@ -960,6 +960,17 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
> >  
> >  	sch->handle = handle;
> >  
> > +	/* This exist to keep backward compatible with a userspace
> > +	 * loophole, what allowed userspace to get IFF_NO_QUEUE
> > +	 * facility on older kernels by setting tx_queue_len=0 (prior
> > +	 * to qdisc init), and then forgot to reinit tx_queue_len
> > +	 * before again attaching a qdisc.
> > +	 */
> > +	if ((dev->priv_flags & IFF_NO_QUEUE) && (dev->tx_queue_len == 0)) {
> > +		dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN;
> > +		netdev_info(dev, "Caught tx_queue_len zero misconfig\n");
> > +	}  
> 
> I wonder why this is limited to IFF_NO_QUEUE devices. Do you think there
> is a valid use case for physical ones?

Hmmm, I cannot come up with a useful use-case for physical devices, but
I cannot see why we should save users that had used the loophole on
physical devices, as that is clearly a faulty config to begin with.
See net_crit_ratelimited warning here:
 https://github.com/torvalds/linux/blob/v4.9-rc3/net/core/dev.c#L3403

> Also, if we sanitize here, couldn't we then just get rid of the
> sanitization you're fixing in patch 2?

Without patch 2, then some IFF_NO_QUEUE devices would have a visible
tx_queue_len 0 (e.g. the ones not calling ether_setup()), and that
would be inconsistent (visible from userspace).

> Apart from that, ACK to all the patches. Thanks for cleaning up my mess!
> :)

Thanks for the ACKs
Phil Sutter Nov. 4, 2016, 10:59 a.m. UTC | #3
On Fri, Nov 04, 2016 at 11:10:42AM +0100, Jesper Dangaard Brouer wrote:
> 
> On Fri, 4 Nov 2016 10:35:26 +0100 Phil Sutter <phil@nwl.cc> wrote:
> 
> > Hi,
> > 
> > On Thu, Nov 03, 2016 at 02:56:11PM +0100, Jesper Dangaard Brouer wrote:
> > [...]
> > > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> > > index 206dc24add3a..f337f1bdd1d4 100644
> > > --- a/net/sched/sch_api.c
> > > +++ b/net/sched/sch_api.c
> > > @@ -960,6 +960,17 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
> > >  
> > >  	sch->handle = handle;
> > >  
> > > +	/* This exist to keep backward compatible with a userspace
> > > +	 * loophole, what allowed userspace to get IFF_NO_QUEUE
> > > +	 * facility on older kernels by setting tx_queue_len=0 (prior
> > > +	 * to qdisc init), and then forgot to reinit tx_queue_len
> > > +	 * before again attaching a qdisc.
> > > +	 */
> > > +	if ((dev->priv_flags & IFF_NO_QUEUE) && (dev->tx_queue_len == 0)) {
> > > +		dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN;
> > > +		netdev_info(dev, "Caught tx_queue_len zero misconfig\n");
> > > +	}  
> > 
> > I wonder why this is limited to IFF_NO_QUEUE devices. Do you think there
> > is a valid use case for physical ones?
> 
> Hmmm, I cannot come up with a useful use-case for physical devices, but
> I cannot see why we should save users that had used the loophole on
> physical devices, as that is clearly a faulty config to begin with.
> See net_crit_ratelimited warning here:
>  https://github.com/torvalds/linux/blob/v4.9-rc3/net/core/dev.c#L3403

I really feel like nit-picking again, but what differs in between
loophole users of virtual devices (whose broken scripts stopped working)
and loophole users of physical devices (whose broken scripts stopped
working as well)?

I we really take exposing broken userspace scripts as kernel bugs, don't
we have to take this one for the same as well?

> > Also, if we sanitize here, couldn't we then just get rid of the
> > sanitization you're fixing in patch 2?
> 
> Without patch 2, then some IFF_NO_QUEUE devices would have a visible
> tx_queue_len 0 (e.g. the ones not calling ether_setup()), and that
> would be inconsistent (visible from userspace).

Ah, indeed. Although there's no functional difference, I guess it might
confuse people seeing an interface with 0 qlen performing properly.

Thanks, Phil
Jesper Dangaard Brouer Nov. 4, 2016, 12:09 p.m. UTC | #4
On Fri, 4 Nov 2016 11:59:13 +0100
Phil Sutter <phil@nwl.cc> wrote:

> On Fri, Nov 04, 2016 at 11:10:42AM +0100, Jesper Dangaard Brouer wrote:
> > 
> > On Fri, 4 Nov 2016 10:35:26 +0100 Phil Sutter <phil@nwl.cc> wrote:
> >   
> > > Hi,
> > > 
> > > On Thu, Nov 03, 2016 at 02:56:11PM +0100, Jesper Dangaard Brouer wrote:
> > > [...]  
> > > > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> > > > index 206dc24add3a..f337f1bdd1d4 100644
> > > > --- a/net/sched/sch_api.c
> > > > +++ b/net/sched/sch_api.c
> > > > @@ -960,6 +960,17 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
> > > >  
> > > >  	sch->handle = handle;
> > > >  
> > > > +	/* This exist to keep backward compatible with a userspace
> > > > +	 * loophole, what allowed userspace to get IFF_NO_QUEUE
> > > > +	 * facility on older kernels by setting tx_queue_len=0 (prior
> > > > +	 * to qdisc init), and then forgot to reinit tx_queue_len
> > > > +	 * before again attaching a qdisc.
> > > > +	 */
> > > > +	if ((dev->priv_flags & IFF_NO_QUEUE) && (dev->tx_queue_len == 0)) {
> > > > +		dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN;
> > > > +		netdev_info(dev, "Caught tx_queue_len zero misconfig\n");
> > > > +	}    
> > > 
> > > I wonder why this is limited to IFF_NO_QUEUE devices. Do you think there
> > > is a valid use case for physical ones?  
> > 
> > Hmmm, I cannot come up with a useful use-case for physical devices, but
> > I cannot see why we should save users that had used the loophole on
> > physical devices, as that is clearly a faulty config to begin with.
> > See net_crit_ratelimited warning here:
> >  [1] https://github.com/torvalds/linux/blob/v4.9-rc3/net/core/dev.c#L3403  
> 
> I really feel like nit-picking again,

Perhaps a follow up patch is better?  This patch does solve a real
issue.

> but what differs in between
> loophole users of virtual devices (whose broken scripts stopped working)
> and loophole users of physical devices (whose broken scripts stopped
> working as well)?

There is a difference.  We basically closed the loophole config, but
fixed that qdisc can be attached to virtual (IFF_NO_QUEUE) devices,
without needing to adjusting tx_queue_len.

Thus, running a loophole-script have no-effect, but for IFF_NO_QUEUE
devices (veth specifically) it looks like it had the desired effect,
thus Docker will/can keep doing that, to work with older kernels, and
on newer kernels it just doesn't have any effect.

The remaining problem is that a "loophole-script" leaves the interface
in a broken state with tx_queue_len==0.  Which this patch address.

So, why only catch misconfig for IFF_NO_QUEUE devices?  Because a
loophole-script on veth brought it into a valid config, thus valid
use-case, while one a physical into a invalid config (hence the
critical warn[1]).

You could (in a followup patch, please) argue that it is a lot simpler,
just to always catch the misconfig of having tx_queue_len==0 when
attaching a qdisc.
Sergei Shtylyov Nov. 4, 2016, 12:53 p.m. UTC | #5
Hello.

On 11/3/2016 4:56 PM, Jesper Dangaard Brouer wrote:

> It is a clear misconfiguration to attach a qdisc to a device with
> tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred,
> htb, plug and sfb) inherit/copy this value as their queue length.
>
> Why should the kernel catch such a misconfiguration?  Because prior to
> introducing the IFF_NO_QUEUE device flag, userspace found a loophole
> in the qdisc config system that allowed them to achieve the equivalent
> of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from
> a device.  The loophole on older kernels is setting tx_queue_len=0,
> *prior* to device qdisc init (the config time is significant, simply
> setting tx_queue_len=0 doesn't trigger the loophole).
>
> This loophole is currently used by Docker[1] to get better performance
> and scalability out of the veth device.  The Docker developers were
> warned[1] that they needed to adjust the tx_queue_len if ever
> attaching a qdisc.  The OpenShift project didn't remember this warning
> and attached a qdisc, this were caught and fixed in[2].
>
> [1] https://github.com/docker/libcontainer/pull/193
> [2] https://github.com/openshift/origin/pull/11126
>
> Instead of fixing every userspace program that used this loophole, and
> forgot to reset the tx_queue_len, prior to attaching a qdisc.  Let's
> catch the misconfiguration on the kernel side.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  net/sched/sch_api.c |   11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 206dc24add3a..f337f1bdd1d4 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -960,6 +960,17 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
>
>  	sch->handle = handle;
>
> +	/* This exist to keep backward compatible with a userspace

    "Exists" and "compatibility".

> +	 * loophole, what allowed userspace to get IFF_NO_QUEUE
> +	 * facility on older kernels by setting tx_queue_len=0 (prior
> +	 * to qdisc init), and then forgot to reinit tx_queue_len
> +	 * before again attaching a qdisc.
> +	 */
> +	if ((dev->priv_flags & IFF_NO_QUEUE) && (dev->tx_queue_len == 0)) {
> +		dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN;
> +		netdev_info(dev, "Caught tx_queue_len zero misconfig\n");
> +	}
> +
>  	if (!ops->init || (err = ops->init(sch, tca[TCA_OPTIONS])) == 0) {
>  		if (qdisc_is_percpu_stats(sch)) {
>  			sch->cpu_bstats =

MBR, Sergei
Maciej Żenczykowski Nov. 8, 2016, 6:14 a.m. UTC | #6
Just FYI:

I'm tangentially aware of internal Google code that:
- expects a bonding device running HTB with non-zero txqueuelen
- wants to remove HTB and get a noqueue interface (the normal default
for bonding)

The code currently removes HTB, which gets us to mq, sets txqueuelen
to 0, adds a pfifo, removes the pfifo, which gets us to noqueue.

After this patch this would ?possibly? break (adding pfifo, would
change txqueuelen, so when we remove it we wouldn't end up with
noqueue).

From what I fuzzily recall, HTB with txquelelen == 0 drops traffic
hard, while pfifo continues to function, hence the ordering...

Obviously our code can be fixed, but I'm worried there's a more
generic backwards compatibility problem here.

(note: this is mostly about 3.11 and 4.3 and might no longer be
relevant with 4.10... maybe the new kernel's default qdisc selection
logic doesn't depend on txqueuelen and checks the flag instead???)
Jesper Dangaard Brouer Nov. 8, 2016, 7:46 a.m. UTC | #7
On Mon, 7 Nov 2016 22:14:37 -0800 Maciej Żenczykowski <zenczykowski@gmail.com> wrote:

> Just FYI:
> 
> I'm tangentially aware of internal Google code that:
> - expects a bonding device running HTB with non-zero txqueuelen
> - wants to remove HTB and get a noqueue interface (the normal default
> for bonding)
> 
> The code currently removes HTB, which gets us to mq, sets txqueuelen
> to 0, adds a pfifo, removes the pfifo, which gets us to noqueue.

This clearly shows that the older userspace interface, of tx_queue_len
having double meaning, was a mess!

> After this patch this would ?possibly? break (adding pfifo, would
> change txqueuelen, so when we remove it we wouldn't end up with
> noqueue).

No, you will still end-up with "noqueue".  It is now the flag
IFF_NO_QUEUE that determine if a device gets "noqueue" when the default
qdisc is attached. The tx_queue_len no longer have any effect on
getting "noqueue".  The IFF_NO_QUEUE system removed this double meaning
of tx_queue_len.


> From what I fuzzily recall, HTB with txquelelen == 0 drops traffic
> hard, while pfifo continues to function, hence the ordering...
> 
> Obviously our code can be fixed, but I'm worried there's a more
> generic backwards compatibility problem here.

It is good you bring it up, but I don't see a backwards compatibility
problem with your usage after the patchset.
 
> (note: this is mostly about 3.11 and 4.3 and might no longer be
> relevant with 4.10... maybe the new kernel's default qdisc selection
> logic doesn't depend on txqueuelen and checks the flag instead???)

If I were you, I would now implement a validation check that reported
the problem if not getting into the expected "noqueue" state.  Then
when you eventually upgrade to a more recent kernel, you would get
alerted of improper state.

Something like:

noqueue=$(ip link show dev $DEV 2> /dev/null | grep -q "noqueue" && echo "noqueue" || echo "bad")
if [[ "$noqueue" != "noqueue" ]]; then
    echo "report-problem";
fi
diff mbox

Patch

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 206dc24add3a..f337f1bdd1d4 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -960,6 +960,17 @@  static struct Qdisc *qdisc_create(struct net_device *dev,
 
 	sch->handle = handle;
 
+	/* This exist to keep backward compatible with a userspace
+	 * loophole, what allowed userspace to get IFF_NO_QUEUE
+	 * facility on older kernels by setting tx_queue_len=0 (prior
+	 * to qdisc init), and then forgot to reinit tx_queue_len
+	 * before again attaching a qdisc.
+	 */
+	if ((dev->priv_flags & IFF_NO_QUEUE) && (dev->tx_queue_len == 0)) {
+		dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN;
+		netdev_info(dev, "Caught tx_queue_len zero misconfig\n");
+	}
+
 	if (!ops->init || (err = ops->init(sch, tca[TCA_OPTIONS])) == 0) {
 		if (qdisc_is_percpu_stats(sch)) {
 			sch->cpu_bstats =