Message ID | 20161103135611.28737.39840.stgit@firesoul |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
Hi, On Thu, Nov 03, 2016 at 02:56:11PM +0100, Jesper Dangaard Brouer wrote: [...] > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > index 206dc24add3a..f337f1bdd1d4 100644 > --- a/net/sched/sch_api.c > +++ b/net/sched/sch_api.c > @@ -960,6 +960,17 @@ static struct Qdisc *qdisc_create(struct net_device *dev, > > sch->handle = handle; > > + /* This exist to keep backward compatible with a userspace > + * loophole, what allowed userspace to get IFF_NO_QUEUE > + * facility on older kernels by setting tx_queue_len=0 (prior > + * to qdisc init), and then forgot to reinit tx_queue_len > + * before again attaching a qdisc. > + */ > + if ((dev->priv_flags & IFF_NO_QUEUE) && (dev->tx_queue_len == 0)) { > + dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN; > + netdev_info(dev, "Caught tx_queue_len zero misconfig\n"); > + } I wonder why this is limited to IFF_NO_QUEUE devices. Do you think there is a valid use case for physical ones? Also, if we sanitize here, couldn't we then just get rid of the sanitization you're fixing in patch 2? Apart from that, ACK to all the patches. Thanks for cleaning up my mess! :) Cheers, Phil
On Fri, 4 Nov 2016 10:35:26 +0100 Phil Sutter <phil@nwl.cc> wrote: > Hi, > > On Thu, Nov 03, 2016 at 02:56:11PM +0100, Jesper Dangaard Brouer wrote: > [...] > > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > > index 206dc24add3a..f337f1bdd1d4 100644 > > --- a/net/sched/sch_api.c > > +++ b/net/sched/sch_api.c > > @@ -960,6 +960,17 @@ static struct Qdisc *qdisc_create(struct net_device *dev, > > > > sch->handle = handle; > > > > + /* This exist to keep backward compatible with a userspace > > + * loophole, what allowed userspace to get IFF_NO_QUEUE > > + * facility on older kernels by setting tx_queue_len=0 (prior > > + * to qdisc init), and then forgot to reinit tx_queue_len > > + * before again attaching a qdisc. > > + */ > > + if ((dev->priv_flags & IFF_NO_QUEUE) && (dev->tx_queue_len == 0)) { > > + dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN; > > + netdev_info(dev, "Caught tx_queue_len zero misconfig\n"); > > + } > > I wonder why this is limited to IFF_NO_QUEUE devices. Do you think there > is a valid use case for physical ones? Hmmm, I cannot come up with a useful use-case for physical devices, but I cannot see why we should save users that had used the loophole on physical devices, as that is clearly a faulty config to begin with. See net_crit_ratelimited warning here: https://github.com/torvalds/linux/blob/v4.9-rc3/net/core/dev.c#L3403 > Also, if we sanitize here, couldn't we then just get rid of the > sanitization you're fixing in patch 2? Without patch 2, then some IFF_NO_QUEUE devices would have a visible tx_queue_len 0 (e.g. the ones not calling ether_setup()), and that would be inconsistent (visible from userspace). > Apart from that, ACK to all the patches. Thanks for cleaning up my mess! > :) Thanks for the ACKs
On Fri, Nov 04, 2016 at 11:10:42AM +0100, Jesper Dangaard Brouer wrote: > > On Fri, 4 Nov 2016 10:35:26 +0100 Phil Sutter <phil@nwl.cc> wrote: > > > Hi, > > > > On Thu, Nov 03, 2016 at 02:56:11PM +0100, Jesper Dangaard Brouer wrote: > > [...] > > > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > > > index 206dc24add3a..f337f1bdd1d4 100644 > > > --- a/net/sched/sch_api.c > > > +++ b/net/sched/sch_api.c > > > @@ -960,6 +960,17 @@ static struct Qdisc *qdisc_create(struct net_device *dev, > > > > > > sch->handle = handle; > > > > > > + /* This exist to keep backward compatible with a userspace > > > + * loophole, what allowed userspace to get IFF_NO_QUEUE > > > + * facility on older kernels by setting tx_queue_len=0 (prior > > > + * to qdisc init), and then forgot to reinit tx_queue_len > > > + * before again attaching a qdisc. > > > + */ > > > + if ((dev->priv_flags & IFF_NO_QUEUE) && (dev->tx_queue_len == 0)) { > > > + dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN; > > > + netdev_info(dev, "Caught tx_queue_len zero misconfig\n"); > > > + } > > > > I wonder why this is limited to IFF_NO_QUEUE devices. Do you think there > > is a valid use case for physical ones? > > Hmmm, I cannot come up with a useful use-case for physical devices, but > I cannot see why we should save users that had used the loophole on > physical devices, as that is clearly a faulty config to begin with. > See net_crit_ratelimited warning here: > https://github.com/torvalds/linux/blob/v4.9-rc3/net/core/dev.c#L3403 I really feel like nit-picking again, but what differs in between loophole users of virtual devices (whose broken scripts stopped working) and loophole users of physical devices (whose broken scripts stopped working as well)? I we really take exposing broken userspace scripts as kernel bugs, don't we have to take this one for the same as well? > > Also, if we sanitize here, couldn't we then just get rid of the > > sanitization you're fixing in patch 2? > > Without patch 2, then some IFF_NO_QUEUE devices would have a visible > tx_queue_len 0 (e.g. the ones not calling ether_setup()), and that > would be inconsistent (visible from userspace). Ah, indeed. Although there's no functional difference, I guess it might confuse people seeing an interface with 0 qlen performing properly. Thanks, Phil
On Fri, 4 Nov 2016 11:59:13 +0100 Phil Sutter <phil@nwl.cc> wrote: > On Fri, Nov 04, 2016 at 11:10:42AM +0100, Jesper Dangaard Brouer wrote: > > > > On Fri, 4 Nov 2016 10:35:26 +0100 Phil Sutter <phil@nwl.cc> wrote: > > > > > Hi, > > > > > > On Thu, Nov 03, 2016 at 02:56:11PM +0100, Jesper Dangaard Brouer wrote: > > > [...] > > > > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > > > > index 206dc24add3a..f337f1bdd1d4 100644 > > > > --- a/net/sched/sch_api.c > > > > +++ b/net/sched/sch_api.c > > > > @@ -960,6 +960,17 @@ static struct Qdisc *qdisc_create(struct net_device *dev, > > > > > > > > sch->handle = handle; > > > > > > > > + /* This exist to keep backward compatible with a userspace > > > > + * loophole, what allowed userspace to get IFF_NO_QUEUE > > > > + * facility on older kernels by setting tx_queue_len=0 (prior > > > > + * to qdisc init), and then forgot to reinit tx_queue_len > > > > + * before again attaching a qdisc. > > > > + */ > > > > + if ((dev->priv_flags & IFF_NO_QUEUE) && (dev->tx_queue_len == 0)) { > > > > + dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN; > > > > + netdev_info(dev, "Caught tx_queue_len zero misconfig\n"); > > > > + } > > > > > > I wonder why this is limited to IFF_NO_QUEUE devices. Do you think there > > > is a valid use case for physical ones? > > > > Hmmm, I cannot come up with a useful use-case for physical devices, but > > I cannot see why we should save users that had used the loophole on > > physical devices, as that is clearly a faulty config to begin with. > > See net_crit_ratelimited warning here: > > [1] https://github.com/torvalds/linux/blob/v4.9-rc3/net/core/dev.c#L3403 > > I really feel like nit-picking again, Perhaps a follow up patch is better? This patch does solve a real issue. > but what differs in between > loophole users of virtual devices (whose broken scripts stopped working) > and loophole users of physical devices (whose broken scripts stopped > working as well)? There is a difference. We basically closed the loophole config, but fixed that qdisc can be attached to virtual (IFF_NO_QUEUE) devices, without needing to adjusting tx_queue_len. Thus, running a loophole-script have no-effect, but for IFF_NO_QUEUE devices (veth specifically) it looks like it had the desired effect, thus Docker will/can keep doing that, to work with older kernels, and on newer kernels it just doesn't have any effect. The remaining problem is that a "loophole-script" leaves the interface in a broken state with tx_queue_len==0. Which this patch address. So, why only catch misconfig for IFF_NO_QUEUE devices? Because a loophole-script on veth brought it into a valid config, thus valid use-case, while one a physical into a invalid config (hence the critical warn[1]). You could (in a followup patch, please) argue that it is a lot simpler, just to always catch the misconfig of having tx_queue_len==0 when attaching a qdisc.
Hello. On 11/3/2016 4:56 PM, Jesper Dangaard Brouer wrote: > It is a clear misconfiguration to attach a qdisc to a device with > tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred, > htb, plug and sfb) inherit/copy this value as their queue length. > > Why should the kernel catch such a misconfiguration? Because prior to > introducing the IFF_NO_QUEUE device flag, userspace found a loophole > in the qdisc config system that allowed them to achieve the equivalent > of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from > a device. The loophole on older kernels is setting tx_queue_len=0, > *prior* to device qdisc init (the config time is significant, simply > setting tx_queue_len=0 doesn't trigger the loophole). > > This loophole is currently used by Docker[1] to get better performance > and scalability out of the veth device. The Docker developers were > warned[1] that they needed to adjust the tx_queue_len if ever > attaching a qdisc. The OpenShift project didn't remember this warning > and attached a qdisc, this were caught and fixed in[2]. > > [1] https://github.com/docker/libcontainer/pull/193 > [2] https://github.com/openshift/origin/pull/11126 > > Instead of fixing every userspace program that used this loophole, and > forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's > catch the misconfiguration on the kernel side. > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > --- > net/sched/sch_api.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > index 206dc24add3a..f337f1bdd1d4 100644 > --- a/net/sched/sch_api.c > +++ b/net/sched/sch_api.c > @@ -960,6 +960,17 @@ static struct Qdisc *qdisc_create(struct net_device *dev, > > sch->handle = handle; > > + /* This exist to keep backward compatible with a userspace "Exists" and "compatibility". > + * loophole, what allowed userspace to get IFF_NO_QUEUE > + * facility on older kernels by setting tx_queue_len=0 (prior > + * to qdisc init), and then forgot to reinit tx_queue_len > + * before again attaching a qdisc. > + */ > + if ((dev->priv_flags & IFF_NO_QUEUE) && (dev->tx_queue_len == 0)) { > + dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN; > + netdev_info(dev, "Caught tx_queue_len zero misconfig\n"); > + } > + > if (!ops->init || (err = ops->init(sch, tca[TCA_OPTIONS])) == 0) { > if (qdisc_is_percpu_stats(sch)) { > sch->cpu_bstats = MBR, Sergei
Just FYI: I'm tangentially aware of internal Google code that: - expects a bonding device running HTB with non-zero txqueuelen - wants to remove HTB and get a noqueue interface (the normal default for bonding) The code currently removes HTB, which gets us to mq, sets txqueuelen to 0, adds a pfifo, removes the pfifo, which gets us to noqueue. After this patch this would ?possibly? break (adding pfifo, would change txqueuelen, so when we remove it we wouldn't end up with noqueue). From what I fuzzily recall, HTB with txquelelen == 0 drops traffic hard, while pfifo continues to function, hence the ordering... Obviously our code can be fixed, but I'm worried there's a more generic backwards compatibility problem here. (note: this is mostly about 3.11 and 4.3 and might no longer be relevant with 4.10... maybe the new kernel's default qdisc selection logic doesn't depend on txqueuelen and checks the flag instead???)
On Mon, 7 Nov 2016 22:14:37 -0800 Maciej Żenczykowski <zenczykowski@gmail.com> wrote: > Just FYI: > > I'm tangentially aware of internal Google code that: > - expects a bonding device running HTB with non-zero txqueuelen > - wants to remove HTB and get a noqueue interface (the normal default > for bonding) > > The code currently removes HTB, which gets us to mq, sets txqueuelen > to 0, adds a pfifo, removes the pfifo, which gets us to noqueue. This clearly shows that the older userspace interface, of tx_queue_len having double meaning, was a mess! > After this patch this would ?possibly? break (adding pfifo, would > change txqueuelen, so when we remove it we wouldn't end up with > noqueue). No, you will still end-up with "noqueue". It is now the flag IFF_NO_QUEUE that determine if a device gets "noqueue" when the default qdisc is attached. The tx_queue_len no longer have any effect on getting "noqueue". The IFF_NO_QUEUE system removed this double meaning of tx_queue_len. > From what I fuzzily recall, HTB with txquelelen == 0 drops traffic > hard, while pfifo continues to function, hence the ordering... > > Obviously our code can be fixed, but I'm worried there's a more > generic backwards compatibility problem here. It is good you bring it up, but I don't see a backwards compatibility problem with your usage after the patchset. > (note: this is mostly about 3.11 and 4.3 and might no longer be > relevant with 4.10... maybe the new kernel's default qdisc selection > logic doesn't depend on txqueuelen and checks the flag instead???) If I were you, I would now implement a validation check that reported the problem if not getting into the expected "noqueue" state. Then when you eventually upgrade to a more recent kernel, you would get alerted of improper state. Something like: noqueue=$(ip link show dev $DEV 2> /dev/null | grep -q "noqueue" && echo "noqueue" || echo "bad") if [[ "$noqueue" != "noqueue" ]]; then echo "report-problem"; fi
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index 206dc24add3a..f337f1bdd1d4 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -960,6 +960,17 @@ static struct Qdisc *qdisc_create(struct net_device *dev, sch->handle = handle; + /* This exist to keep backward compatible with a userspace + * loophole, what allowed userspace to get IFF_NO_QUEUE + * facility on older kernels by setting tx_queue_len=0 (prior + * to qdisc init), and then forgot to reinit tx_queue_len + * before again attaching a qdisc. + */ + if ((dev->priv_flags & IFF_NO_QUEUE) && (dev->tx_queue_len == 0)) { + dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN; + netdev_info(dev, "Caught tx_queue_len zero misconfig\n"); + } + if (!ops->init || (err = ops->init(sch, tca[TCA_OPTIONS])) == 0) { if (qdisc_is_percpu_stats(sch)) { sch->cpu_bstats =
It is a clear misconfiguration to attach a qdisc to a device with tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred, htb, plug and sfb) inherit/copy this value as their queue length. Why should the kernel catch such a misconfiguration? Because prior to introducing the IFF_NO_QUEUE device flag, userspace found a loophole in the qdisc config system that allowed them to achieve the equivalent of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from a device. The loophole on older kernels is setting tx_queue_len=0, *prior* to device qdisc init (the config time is significant, simply setting tx_queue_len=0 doesn't trigger the loophole). This loophole is currently used by Docker[1] to get better performance and scalability out of the veth device. The Docker developers were warned[1] that they needed to adjust the tx_queue_len if ever attaching a qdisc. The OpenShift project didn't remember this warning and attached a qdisc, this were caught and fixed in[2]. [1] https://github.com/docker/libcontainer/pull/193 [2] https://github.com/openshift/origin/pull/11126 Instead of fixing every userspace program that used this loophole, and forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's catch the misconfiguration on the kernel side. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- net/sched/sch_api.c | 11 +++++++++++ 1 file changed, 11 insertions(+)