diff mbox

[Bloat] shaper team forming up

Message ID 1300164166.2649.70.camel@edumazet-laptop
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet March 15, 2011, 4:42 a.m. UTC
Le lundi 14 mars 2011 à 21:24 +0100, Eric Dumazet a écrit :

remove CC to bloat lists for now, adding David Miller to thread.

> Le lundi 14 mars 2011 à 21:55 +0200, Jonathan Morton a écrit :
> > On 14 Mar, 2011, at 9:26 pm, Dave Täht wrote:
> > 
> > > Over the weekend, Dan Siemons uncovered a possible bad interaction
> > > between ECN and the default pfifo_fast qdisc in Linux.
> > > 
> > > http://www.coverfire.com/archives/2011/03/13/pfifo_fast-and-ecn/
> > 
> > This seems to be more complicated that it appears.  It looks as though
> > Linux has re-used the LSB of the old TOS field for some "link local"
> > flag which is used by routing.
> > 
> > It's not immediately obvious whether pfifo_fast is using this new
> > interpretation though.  If it isn't, the fix should be to remove the
> > RTO_ONLINK bit from the mask it's using on the tos field.  The other
> > half of the mask correctly excludes the ECN bits from the field.
> > 
> 
> CC netdev, where linux network dev can take a look.
> 
> I would say that this is a wrong analysis : 
> 
> 1) ECN uses two low order bits of TOS byte
> 
> 2) pfifo_fast uses skb->priority
> 
> 
> skb->priority = rt_tos2priority(iph->tos);
> 
> #define IPTOS_TOS_MASK            0x1E
> #define IPTOS_TOS(tos)            ((tos)&IPTOS_TOS_MASK)
> 
> static inline char rt_tos2priority(u8 tos)
> {
> 	return ip_tos2prio[IPTOS_TOS(tos)>>1];
> }
> 
> No interference between two mechanisms, unless sysadmin messed up things
> (skb_edit)
> 
> 

David, it seems ip_tos2prio is wrong on its 2nd entry :

#define TC_PRIO_BESTEFFORT              0
#define TC_PRIO_FILLER                  1
#define TC_PRIO_BULK                    2
#define TC_PRIO_INTERACTIVE_BULK        4
#define TC_PRIO_INTERACTIVE             6
#define TC_PRIO_CONTROL                 7

#define TC_PRIO_MAX                     15

net/ipv4/route.c:170:#define ECN_OR_COST(class) TC_PRIO_##class

const __u8 ip_tos2prio[16] = {
	TC_PRIO_BESTEFFORT,   /* 0 : for flow without ECN */
	ECN_OR_COST(FILLER), /* 1 : flow with ECN */
	...
};




This means ECN enabled flows got TC_PRIO_FILLER (what the hell is
that ?)

pfifo_fast has :

static const u8 prio2band[TC_PRIO_MAX+1] =
	{ 1, 2, 2, 2, 1, 2, 0, 0 , 1, 1, 1, 1, 1, 1, 1, 1 };

So a non ECN enabled flow goes to band 1, while an ECN enabled one is in
band 2 (!).  Thus, ECN enabled flows have a chance being droped more
often than non ECN flows. Thats not fair...

What do you think ?

Thanks



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dave =?utf-8?Q?T=C3=A4ht?= March 15, 2011, 5:27 a.m. UTC | #1
Eric Dumazet <eric.dumazet@gmail.com> writes:

> Le lundi 14 mars 2011 à 21:24 +0100, Eric Dumazet a écrit :
>
> remove CC to bloat lists for now, adding David Miller to thread.
>
>> Le lundi 14 mars 2011 à 21:55 +0200, Jonathan Morton a écrit :
>> > On 14 Mar, 2011, at 9:26 pm, Dave Täht wrote:
>> > 
>> > > Over the weekend, Dan Siemon uncovered a possible bad interaction
>> > > between ECN and the default pfifo_fast qdisc in Linux.
>> > > 
>> > > http://www.coverfire.com/archives/2011/03/13/pfifo_fast-and-ecn/
>> > 

Totally irrelevant bit elided

>> CC netdev, where linux network dev can take a look.
>> 
>> I would say that this is a wrong analysis : 
>> 
>> 1) ECN uses two low order bits of TOS byte
>> 
>> 2) pfifo_fast uses skb->priority
>> 
>> 
>> skb->priority = rt_tos2priority(iph->tos);
>> 
>> #define IPTOS_TOS_MASK            0x1E
>> #define IPTOS_TOS(tos)            ((tos)&IPTOS_TOS_MASK)
>> 
>> static inline char rt_tos2priority(u8 tos)
>> {
>> 	return ip_tos2prio[IPTOS_TOS(tos)>>1];
>> }
>> 
>> No interference between two mechanisms, unless sysadmin messed up things
>> (skb_edit)
>> 
>> 
>
> David, it seems ip_tos2prio is wrong on its 2nd entry :
>
> #define TC_PRIO_BESTEFFORT              0
> #define TC_PRIO_FILLER                  1
> #define TC_PRIO_BULK                    2
> #define TC_PRIO_INTERACTIVE_BULK        4
> #define TC_PRIO_INTERACTIVE             6
> #define TC_PRIO_CONTROL                 7
>
> #define TC_PRIO_MAX                     15
>
> net/ipv4/route.c:170:#define ECN_OR_COST(class) TC_PRIO_##class
>
> const __u8 ip_tos2prio[16] = {
> 	TC_PRIO_BESTEFFORT,   /* 0 : for flow without ECN */
> 	ECN_OR_COST(FILLER), /* 1 : flow with ECN */
> 	...
> };
>
>
>
>
> This means ECN enabled flows got TC_PRIO_FILLER (what the hell is
> that ?)
>
> pfifo_fast has :
>
> static const u8 prio2band[TC_PRIO_MAX+1] =
> 	{ 1, 2, 2, 2, 1, 2, 0, 0 , 1, 1, 1, 1, 1, 1, 1, 1 };
>
> So a non ECN enabled flow goes to band 1, while an ECN enabled one is in
> band 2 (!).  Thus, ECN enabled flows have a chance being droped more
> often than non ECN flows. Thats not fair...
>
> What do you think ?

Well, that makes 3 of us that think it's wrong. Can we get more? 

(I'll run through the math again in the morning)

It's most often not actually "enablement" but "assertion", when for
example an ECN bit is put on an ACK packet (by an application, or qdisc)
, it drops that ACK packet into the 2 queue - leaving all the other
non-ECN asserted packets in that flow to flow out ahead of it.

Or so dan siemon & I & now you, think. It's late and I really want to recheck
the math and the shifts in the morning. However, if true... this would
explain much ECN related weirdness precisely where it has been hard to
measure, on heavily loaded systems.

>
> Thanks
>
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 6ed6603..fabfe81 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -171,7 +171,7 @@ static struct dst_ops ipv4_dst_ops = {
>  
>  const __u8 ip_tos2prio[16] = {
>  	TC_PRIO_BESTEFFORT,
> -	ECN_OR_COST(FILLER),
> +	ECN_OR_COST(BESTEFFORT),
>  	TC_PRIO_BESTEFFORT,
>  	ECN_OR_COST(BESTEFFORT),
>  	TC_PRIO_BULK,
>

I think this is a good short term fix, but it will mildly upset people
that actually still use minimum cost and don't use ECN. That said, 
RFC1349 has been obsolete for a decade now, and ECN enabled servers are
at 12% penetration according to MIT.

Still, long term, doing a sch_pfifo_dscp that would be fully compliant
with the relevant modern RFCs and eventually making that the standard
would be good.
Eric Dumazet March 15, 2011, 6:15 a.m. UTC | #2
Le lundi 14 mars 2011 à 23:27 -0600, Dave Täht a écrit :

> Well, that makes 3 of us that think it's wrong. Can we get more? 
> 
> (I'll run through the math again in the morning)
> 
> It's most often not actually "enablement" but "assertion", when for
> example an ECN bit is put on an ACK packet (by an application, or qdisc)
> , it drops that ACK packet into the 2 queue - leaving all the other
> non-ECN asserted packets in that flow to flow out ahead of it.
> 

There are two ECN bits, not one.
The low order bit is not taken into account by skb->priority mapping.
The high order bit cannot be changed during flow lifetime.
(So : no OOO (Out Of Order) problems on say TCP flows)

> Or so dan siemon & I & now you, think. It's late and I really want to recheck
> the math and the shifts in the morning. However, if true... this would
> explain much ECN related weirdness precisely where it has been hard to
> measure, on heavily loaded systems.
> 
> >
> > Thanks
> >
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index 6ed6603..fabfe81 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -171,7 +171,7 @@ static struct dst_ops ipv4_dst_ops = {
> >  
> >  const __u8 ip_tos2prio[16] = {
> >  	TC_PRIO_BESTEFFORT,
> > -	ECN_OR_COST(FILLER),
> > +	ECN_OR_COST(BESTEFFORT),
> >  	TC_PRIO_BESTEFFORT,
> >  	ECN_OR_COST(BESTEFFORT),
> >  	TC_PRIO_BULK,
> >
> 
> I think this is a good short term fix, but it will mildly upset people
> that actually still use minimum cost and don't use ECN. That said, 
> RFC1349 has been obsolete for a decade now, and ECN enabled servers are
> at 12% penetration according to MIT.
> 

If minimum cost was asked by people, their packets had chance being
dropped. Why should they be upset ?
ECN should be favored anyway in 2011, now everybody is ready.

> Still, long term, doing a sch_pfifo_dscp that would be fully compliant
> with the relevant modern RFCs and eventually making that the standard
> would be good.
> 

sch_pfifo_fast is not the place we perform the TOS -> priority mapping.

Its done in another layer.

That is of litle effect, given TOS values are meaningfull only inside a
domain. Nobody can force everyone to use same semantics on the Internet,
even with a standard RFC. I doubt people using linux machines at home
really need DSCP at all.

What we could do instead is to favor a bit ECN enabled connections,
using 4 bands instead of 3 for pfifo_fast (linux default qdisc, probably
the most used qdisc)

band 0 : high priority packets (like now)
band 1 : (old band 1, ECN capable flows)
band 2 : (old band 1, no ECN flows)
band 3 : low priority packets (old band 2)

Note : pfifo_fast is mostly used on end hosts, not on routers (where
admins setup non default qdiscs), and typical end hosts never experiment
packet drops on their qdiscs, because they are now plugged to Gigabit
LANS, and device queuelength is so big.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jonathan Morton March 15, 2011, 5:09 p.m. UTC | #3
On 15 Mar, 2011, at 8:15 am, Eric Dumazet wrote:

> band 0 : high priority packets (like now)
> band 1 : (old band 1, ECN capable flows)
> band 2 : (old band 1, no ECN flows)
> band 3 : low priority packets (old band 2)

This seems good to me.  It would provide a concrete (if minor) enticement to turn ECN on.

 - Jonathan

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 15, 2011, 10:51 p.m. UTC | #4
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 15 Mar 2011 05:42:46 +0100

> Le lundi 14 mars 2011 à 21:24 +0100, Eric Dumazet a écrit :
> 
> David, it seems ip_tos2prio is wrong on its 2nd entry :

Indeed, and in context, this is simply a thinko which has survived since
ECN support first got added to the tree, here's the relevant hunk:

--------------------
@@ -152,23 +152,29 @@ struct dst_ops ipv4_dst_ops =
 	sizeof(struct rtable),
 };
 
+#ifdef CONFIG_INET_ECN
+#define ECN_OR_COST(class)	TC_PRIO_##class
+#else
+#define ECN_OR_COST(class)	TC_PRIO_FILLER
+#endif
+
 __u8 ip_tos2prio[16] = {
 	TC_PRIO_BESTEFFORT,
-	TC_PRIO_FILLER,
+	ECN_OR_COST(FILLER),
 	TC_PRIO_BESTEFFORT,
-	TC_PRIO_FILLER,
+	ECN_OR_COST(BESTEFFORT),
 	TC_PRIO_BULK,
-	TC_PRIO_FILLER,
+	ECN_OR_COST(BULK),
 	TC_PRIO_BULK,
-	TC_PRIO_FILLER,
+	ECN_OR_COST(BULK),
 	TC_PRIO_INTERACTIVE,
-	TC_PRIO_FILLER,
+	ECN_OR_COST(INTERACTIVE),
 	TC_PRIO_INTERACTIVE,
-	TC_PRIO_FILLER,
+	ECN_OR_COST(INTERACTIVE),
 	TC_PRIO_INTERACTIVE_BULK,
-	TC_PRIO_FILLER,
+	ECN_OR_COST(INTERACTIVE_BULK),
 	TC_PRIO_INTERACTIVE_BULK,
-	TC_PRIO_FILLER
+	ECN_OR_COST(INTERACTIVE_BULK)
 };
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 15, 2011, 10:53 p.m. UTC | #5
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 15 Mar 2011 05:42:46 +0100

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 6ed6603..fabfe81 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -171,7 +171,7 @@ static struct dst_ops ipv4_dst_ops = {
>  
>  const __u8 ip_tos2prio[16] = {
>  	TC_PRIO_BESTEFFORT,
> -	ECN_OR_COST(FILLER),
> +	ECN_OR_COST(BESTEFFORT),
>  	TC_PRIO_BESTEFFORT,
>  	ECN_OR_COST(BESTEFFORT),
>  	TC_PRIO_BULK,
> 
> 

Eric can you please formally submit this fix?  I'd like to apply it soon.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 15, 2011, 11 p.m. UTC | #6
Le mardi 15 mars 2011 à 15:53 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 15 Mar 2011 05:42:46 +0100
> 
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index 6ed6603..fabfe81 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -171,7 +171,7 @@ static struct dst_ops ipv4_dst_ops = {
> >  
> >  const __u8 ip_tos2prio[16] = {
> >  	TC_PRIO_BESTEFFORT,
> > -	ECN_OR_COST(FILLER),
> > +	ECN_OR_COST(BESTEFFORT),
> >  	TC_PRIO_BESTEFFORT,
> >  	ECN_OR_COST(BESTEFFORT),
> >  	TC_PRIO_BULK,
> > 
> > 
> 
> Eric can you please formally submit this fix?  I'd like to apply it soon.
> 
> Thanks.

Sure, I can do this right now, but I want to fully credit Dan Siemon on
this one, as the author of the patch.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6ed6603..fabfe81 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -171,7 +171,7 @@  static struct dst_ops ipv4_dst_ops = {
 
 const __u8 ip_tos2prio[16] = {
 	TC_PRIO_BESTEFFORT,
-	ECN_OR_COST(FILLER),
+	ECN_OR_COST(BESTEFFORT),
 	TC_PRIO_BESTEFFORT,
 	ECN_OR_COST(BESTEFFORT),
 	TC_PRIO_BULK,