diff mbox

ipv4: add DiffServ priority based routing

Message ID 201001121432.43301.schmto@hrz.tu-chemnitz.de
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Torsten Schmidt Jan. 12, 2010, 1:32 p.m. UTC
Enables IPv4 Differentiated Services support for IP priority based
routing. Notice that the IP TOS field was redefined 1998 to DiffServ
(RFC 2474). Type Of Service is deprecated since 1998 !

This patch adds a compliant flag to net/ipv4/Kconfig, which allows
the user to select DiffServ ore TOS priority based routing. Default
answer is TOS.

Signed-off-by: Torsten Schmidt <schmto@hrz.tu-chemnitz.de>
---
 include/net/route.h |    4 ++++
 net/ipv4/Kconfig    |   15 +++++++++++++++
 2 files changed, 19 insertions(+), 0 deletions(-)

--

Comments

David Miller Jan. 12, 2010, 8:16 p.m. UTC | #1
You can't do any of these things you are doing, I've basically been
ignoring all of these crazy diffserv patches, they're nuts!

The TOS socket option has a meaning and behavior defined by the BSD
sockets interface many years ago.  And you cannot and must not change
the behavior of those system calls because applications are written to
the current behavior and you will break them.  Protecting the new
behavior with a kernel config option is a non-starter, it's pointless
because no distribution is going to enable a kernel option that
knowingly breaks applications.

And it is also possible to set the TOS field however you desire using
what the kernel currently provides, we do not preclude proper diffserv
support, the BSD socket interfaces allow that just fine.

And you can also do diffserv by classifying traffic and setting the
TOS field using either the packet scheduler, or even netfilter.

Linux supports diffserv fully and just fine, you just can't see it :-)

Please stop submitting these patches without first having at least a
real discussion and understanding of how this stuff works.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Philip Prindeville Jan. 12, 2010, 8:59 p.m. UTC | #2
On 01/12/2010 12:16 PM, David Miller wrote:
> 
> You can't do any of these things you are doing, I've basically been
> ignoring all of these crazy diffserv patches, they're nuts!
> 
> The TOS socket option has a meaning and behavior defined by the BSD
> sockets interface many years ago.  And you cannot and must not change
> the behavior of those system calls because applications are written to
> the current behavior and you will break them.  Protecting the new
> behavior with a kernel config option is a non-starter, it's pointless
> because no distribution is going to enable a kernel option that
> knowingly breaks applications.
> 
> And it is also possible to set the TOS field however you desire using
> what the kernel currently provides, we do not preclude proper diffserv
> support, the BSD socket interfaces allow that just fine.
> 
> And you can also do diffserv by classifying traffic and setting the
> TOS field using either the packet scheduler, or even netfilter.
> 
> Linux supports diffserv fully and just fine, you just can't see it :-)
> 
> Please stop submitting these patches without first having at least a
> real discussion and understanding of how this stuff works.
> 
> Thanks.

I disagree.

The TOS socket option means "use these bits as the value of iphdr->ip_tos exactly as I'm giving them to you".

That hasn't changed.

What has changed is how network equipment is required to interpret the meaning of those bits.

As for "And you cannot and must not change the behavior of those system calls because applications are written to the current behavior and you will break them."  For me, this is the real non-starter.  Even if we pass the bits "as is" to the network, if the network is applying entirely new semantics (and when I say "entirely new", I mean those mandated since 1998), then compatibility in the host kernel API doesn't matter a hoot since the packets will still be handled by every transited router according to the modern semantics.

I note that the lower two bits of the TOS field were appropriated for ECN at the same time, and that hasn't broken a thing.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 12, 2010, 9:03 p.m. UTC | #3
From: "Philip A. Prindeville" <philipp_subx@redfish-solutions.com>
Date: Tue, 12 Jan 2010 12:59:36 -0800

> What has changed is how network equipment is required to interpret
> the meaning of those bits.
>
> Even if we pass the bits "as is" to the network, if the network is
> applying entirely new semantics (and when I say "entirely new", I
> mean those mandated since 1998), then compatibility in the host
> kernel API doesn't matter a hoot since the packets will still be
> handled by every transited router according to the modern semantics.

People really don't assign global meaning to bits set by applications
in the TOS field.

What they do is they have a set of semantics inside of their cloud of
routers and switch points for diffserv, and when packets come in the
TOS field is rewritten to whatever scheme is being used inside of that
cloud.

And the diffserv bits only have meaning and effect within that cloud.

So really, having a syscall that sets the TOS bits exactly by
applications is just fine.

People are doing diffserv right now with Linux and have done so
for years.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Philip Prindeville Jan. 12, 2010, 9:33 p.m. UTC | #4
On 01/12/2010 01:03 PM, David Miller wrote:
> From: "Philip A. Prindeville" <philipp_subx@redfish-solutions.com>
> Date: Tue, 12 Jan 2010 12:59:36 -0800
> 
>> What has changed is how network equipment is required to interpret
>> the meaning of those bits.
>>
>> Even if we pass the bits "as is" to the network, if the network is
>> applying entirely new semantics (and when I say "entirely new", I
>> mean those mandated since 1998), then compatibility in the host
>> kernel API doesn't matter a hoot since the packets will still be
>> handled by every transited router according to the modern semantics.
> 
> People really don't assign global meaning to bits set by applications
> in the TOS field.

Since I'm not a clairvoyant, I can't speak for "people".  But I will say that I do assign such a meaning, and based on that interpretation, other people have code reviewed patches and accepted them, so at least "some people" share my interpretation.

I've submitted QoS fixes for NTP, Proftp, Cyrus, Apache/apr, Sendmail, CURL, Thunderbird, Firefox, and a several other packages.

All of which very much depend on host compliance with RFC-2474 and 2597/2598.


> What they do is they have a set of semantics inside of their cloud of
> routers and switch points for diffserv, and when packets come in the
> TOS field is rewritten to whatever scheme is being used inside of that
> cloud.

Uh, no.  Net Neutrality very much requires consistent end-to-end interpretation of ToS bits by backbone carriers.  If you know of a carrier that isn't honoring ToS bits, I have a group of lawyers I'd like them to meet.


> And the diffserv bits only have meaning and effect within that cloud.

Have you read RFC-2474 lately?  You only need to get as far as the Abstract:

   The services may be either end-to-end or intra-domain; they include
   both those that can satisfy quantitative performance requirements (e.g.,
   peak bandwidth) and those based on relative performance (e.g., "class"
   differentiation).

"end-to-end"... seems pretty clear to me.


> So really, having a syscall that sets the TOS bits exactly by
> applications is just fine.
> 
> People are doing diffserv right now with Linux and have done so
> for years.

Right, and I suspect in most cases, the default behavior of the host is to misinterpret the bits and put the packet in the wrong output queue.

-Philip
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steven Blake Jan. 13, 2010, 4:47 a.m. UTC | #5
On Tue, 12 Jan 2010 13:33:05 -0800, "Philip A. Prindeville"
<philipp_subx@redfish-solutions.com> wrote:

> On 01/12/2010 01:03 PM, David Miller wrote:

>> What they do is they have a set of semantics inside of their cloud of
>> routers and switch points for diffserv, and when packets come in the
>> TOS field is rewritten to whatever scheme is being used inside of that
>> cloud.
> 
> Uh, no.  Net Neutrality very much requires consistent end-to-end
> interpretation of ToS bits by backbone carriers.  If you know of a
carrier
> that isn't honoring ToS bits, I have a group of lawyers I'd like them to
> meet.

Few if any ISPs are honoring any DSCP bits outside of negotiated contracts.

 
>> And the diffserv bits only have meaning and effect within that cloud.
> 
> Have you read RFC-2474 lately?  You only need to get as far as the
> Abstract:
> 
>    The services may be either end-to-end or intra-domain; they include
>    both those that can satisfy quantitative performance requirements
(e.g.,
>    peak bandwidth) and those based on relative performance (e.g., "class"
>    differentiation).
> 
> "end-to-end"... seems pretty clear to me.

David's understanding of RFC 2474 is correct.


Regards,

// Steve
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Torsten Schmidt Jan. 14, 2010, 11:50 a.m. UTC | #6
On Tuesday 12 January 2010 21:16:07 you wrote:
> You can't do any of these things you are doing, I've basically been
> ignoring all of these crazy diffserv patches, they're nuts!
> 
> The TOS socket option has a meaning and behavior defined by the BSD
> sockets interface many years ago.  And you cannot and must not change
> the behavior of those system calls because applications are written to
> the current behavior and you will break them.  Protecting the new
> behavior with a kernel config option is a non-starter, it's pointless
> because no distribution is going to enable a kernel option that
> knowingly breaks applications.
> 
> And it is also possible to set the TOS field however you desire using
> what the kernel currently provides, we do not preclude proper diffserv
> support, the BSD socket interfaces allow that just fine.
okay, I noticed.

> And you can also do diffserv by classifying traffic and setting the
> TOS field using either the packet scheduler, or even netfilter.
Our company is more interested in IP DiffServ traffic accounting.
So we simply need a fast mechanism which generates IP DiffServ traffic
statistics, without setting the network interface in promiscuous mode,
(which softflowd ore tcpdump does).

So first idea was to implement a virtual file e.g. called /proc/net/ip_dscp.
If the main line peoples are not interested in, i will no longer send these
patches. 

Thanks,
Torsten

PS.: Maybe we could implement a more general mechanism to classify TOS/DiffServ 
traffic in the kernel (if needed) ?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 14, 2010, 12:50 p.m. UTC | #7
Le 14/01/2010 12:50, Torsten Schmidt a écrit :
.
> Our company is more interested in IP DiffServ traffic accounting.
> So we simply need a fast mechanism which generates IP DiffServ traffic
> statistics, without setting the network interface in promiscuous mode,
> (which softflowd ore tcpdump does).

tcpdump can work in non promiscuous mode (-p). Anyway, switched networks makes
promiscuous a non issue nowaday.

> 
> So first idea was to implement a virtual file e.g. called /proc/net/ip_dscp.
> If the main line peoples are not interested in, i will no longer send these
> patches. 

Problem is, you might need many sets of counters...

1) General (for each net namespace ...)
2) Per interface
3) Per protocol ?

For performance reasons, SNMP counters are per cpu, so each set would be very large.

In any case, dont add yet another /proc/net file, this is considered as a hacky way to do
these things today.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 15, 2010, 12:51 a.m. UTC | #8
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 14 Jan 2010 13:50:51 +0100

> Problem is, you might need many sets of counters...

Why nobody has suggested using existing kernel facilities for this is
beyond me.

We have a diffserv packet scheduler queueing discipline, so you can
classify traffic arbitrarily based upon the diffserv bits in the
packet and then shape them using that classification into different
packet scheduler queues.

Simply attach those child queues to that default pfifo qdisc, nothing
fancy.

Then you can dump the queue stats using 'tc'.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 15, 2010, 8:24 a.m. UTC | #9
Le 15/01/2010 01:51, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 14 Jan 2010 13:50:51 +0100
> 
>> Problem is, you might need many sets of counters...
> 
> Why nobody has suggested using existing kernel facilities for this is
> beyond me.
> 
> We have a diffserv packet scheduler queueing discipline, so you can
> classify traffic arbitrarily based upon the diffserv bits in the
> packet and then shape them using that classification into different
> packet scheduler queues.
> 
> Simply attach those child queues to that default pfifo qdisc, nothing
> fancy.
> 
> Then you can dump the queue stats using 'tc'.

Well, its a good idea but it has drawbacks, for example if your eth0 device
has 64 tx queues. Parsing tc output can be... interesting :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 15, 2010, 8:26 a.m. UTC | #10
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 15 Jan 2010 09:24:36 +0100

> Well, its a good idea but it has drawbacks, for example if your eth0 device
> has 64 tx queues. Parsing tc output can be... interesting :)

Tool problem :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Philip Prindeville March 11, 2010, 7:25 p.m. UTC | #11
On 01/12/2010 02:03 PM, David Miller wrote:
> From: "Philip A. Prindeville" <philipp_subx@redfish-solutions.com>
> Date: Tue, 12 Jan 2010 12:59:36 -0800
> 
>> What has changed is how network equipment is required to interpret
>> the meaning of those bits.
>>
>> Even if we pass the bits "as is" to the network, if the network is
>> applying entirely new semantics (and when I say "entirely new", I
>> mean those mandated since 1998), then compatibility in the host
>> kernel API doesn't matter a hoot since the packets will still be
>> handled by every transited router according to the modern semantics.
> 
> People really don't assign global meaning to bits set by applications
> in the TOS field.
> 
> What they do is they have a set of semantics inside of their cloud of
> routers and switch points for diffserv, and when packets come in the
> TOS field is rewritten to whatever scheme is being used inside of that
> cloud.
> 
> And the diffserv bits only have meaning and effect within that cloud.
> 
> So really, having a syscall that sets the TOS bits exactly by
> applications is just fine.
> 
> People are doing diffserv right now with Linux and have done so
> for years.

Sorry about coming back to this weeks later... but I hadn't seen RFC 4594 previously.

What if boxes (i.e. the OS) and applications can preconfigured to use RFC-4594 guidelines by default, and varying from that required the administrator to make specific changes?

I agree with the notion that certain values should be set side-wide (or at least system-wide) to prevent malicious users from exploiting QoS...  that's why I've been advocating for QoS settings to be specified in a system configuration file, and not a per-user configuration file.

-Philip
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 11, 2010, 7:29 p.m. UTC | #12
From: "Philip A. Prindeville" <philipp_subx@redfish-solutions.com>
Date: Thu, 11 Mar 2010 12:25:24 -0700

> I agree with the notion that certain values should be set side-wide
> (or at least system-wide) to prevent malicious users from exploiting
> QoS...  that's why I've been advocating for QoS settings to be
> specified in a system configuration file, and not a per-user
> configuration file.

So I can set whatever I want on my personal workation.

I'm sure sysadmins will be happy about that.

Look, this doesn't work.  QoS handling and policy belongs in the
egress point to the network, it's the only way to control this
properly and prevent abuse.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Philip Prindeville March 11, 2010, 7:32 p.m. UTC | #13
On 03/11/2010 12:29 PM, David Miller wrote:
> From: "Philip A. Prindeville" <philipp_subx@redfish-solutions.com>
> Date: Thu, 11 Mar 2010 12:25:24 -0700
> 
>> I agree with the notion that certain values should be set side-wide
>> (or at least system-wide) to prevent malicious users from exploiting
>> QoS...  that's why I've been advocating for QoS settings to be
>> specified in a system configuration file, and not a per-user
>> configuration file.
> 
> So I can set whatever I want on my personal workation.
> 
> I'm sure sysadmins will be happy about that.
> 
> Look, this doesn't work.  QoS handling and policy belongs in the
> egress point to the network, it's the only way to control this
> properly and prevent abuse.


Well, anyone who has 'root' on their workstation can already do a fair amount of damage on a network... we're not letting any new genies out of the bottle... we're just giving them more room to stretch.

"QoS handling and policy belongs in the egress point to the network, it's the only way to control this properly and prevent abuse."

Except that it doesn't.  As I pointed out in another email, TFTP, FTP-Data, and RTP are very hard to categorize correctly.

For that matter, so is SSH, since it can be an interactive shell session, an SCP file transfer, or a mix of various tunneled protocols like X and LPR.

So by the time packets get to the egress point, oftentimes you've lost sufficient information to adequately categorize them.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benny Amorsen March 12, 2010, 11:18 a.m. UTC | #14
David Miller <davem@davemloft.net> writes:

> Look, this doesn't work.  QoS handling and policy belongs in the
> egress point to the network, it's the only way to control this
> properly and prevent abuse.

First, QoS is important even within the network. Modern switches come
pre-configured with sane defaults which ensure that e.g. EF marked
packets get priority over non-EF-marked packets. Cisco, HP, and
Linksys-Cisco at least provide a decent out-of-the-box configuration.

This can obviously be abused, but the solution there is the same as in
network abuses: Either apply the LART or change the configuration of the
switches to be less trusting. We haven't, so far, had a customer where
the LART was necessary, much less had to reconfigure a switch.

So why not let Linux provide the same out-of-the-box experience as the
switches do? If the trust is abused Linux provides lots of tools to make
it less trusting or even to punish the abusers.


/Benny
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Philip Prindeville Feb. 21, 2011, 6:01 a.m. UTC | #15
On 3/12/10 3:18 AM, Benny Amorsen wrote:
> David Miller<davem@davemloft.net>  writes:
>
>> Look, this doesn't work.  QoS handling and policy belongs in the
>> egress point to the network, it's the only way to control this
>> properly and prevent abuse.
> First, QoS is important even within the network. Modern switches come
> pre-configured with sane defaults which ensure that e.g. EF marked
> packets get priority over non-EF-marked packets. Cisco, HP, and
> Linksys-Cisco at least provide a decent out-of-the-box configuration.
>
> This can obviously be abused, but the solution there is the same as in
> network abuses: Either apply the LART or change the configuration of the
> switches to be less trusting. We haven't, so far, had a customer where
> the LART was necessary, much less had to reconfigure a switch.
>
> So why not let Linux provide the same out-of-the-box experience as the
> switches do? If the trust is abused Linux provides lots of tools to make
> it less trusting or even to punish the abusers.
>
>
> /Benny

For those who want to use DiffServ as the out-of-the-box default configuration, and trust the marking on their traffic, I don't understand why certain folks are so adamant about not supporting this.

Torsten's patch to allow rt_tos2priority() to use IPTOS_PRECEDENCE() instead seemed reasonable.

Especially in a network using 802.1p or 802.1q encapsulation.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From 224f01d5c0f2ea36efe78d9de4247e756157c445 Mon Sep 17 00:00:00 2001
From: Torsten Schmidt <schmto@hrz.tu-chemnitz.de>
Date: Tue, 12 Jan 2010 14:26:39 +0100
Subject: [PATCH] ipv4: add DiffServ priority based routing

Enables IPv4 Differentiated Services support for IP priority based
routing. Notice that the IP TOS field was redefined 1998 to DiffServ
(RFC 2474). Type Of Service is deprecated since 1998 !

This patch adds a compliant flag to net/ipv4/Kconfig, which allows
the user to select DiffServ ore TOS priority based routing. Default
answer is TOS.

Signed-off-by: Torsten Schmidt <schmto@hrz.tu-chemnitz.de>
---
 include/net/route.h |    4 ++++
 net/ipv4/Kconfig    |   15 +++++++++++++++
 2 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 40f6346..8bf43a5 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -141,7 +141,11 @@  extern const __u8 ip_tos2prio[16];
 
 static inline char rt_tos2priority(u8 tos)
 {
+#ifdef CONFIG_IP_DIFFSERV_COMPLIANT
+	return tos >> 5;
+#else
 	return ip_tos2prio[IPTOS_TOS(tos)>>1];
+#endif
 }
 
 static inline int ip_route_connect(struct rtable **rp, __be32 dst,
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 70491d9..e1be75c 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -272,6 +272,21 @@  config IP_PIMSM_V2
 	  gated-5). This routing protocol is not used widely, so say N unless
 	  you want to play with it.
 
+config IP_DIFFSERV_COMPLIANT
+	bool "IP: DiffServ priority routing"
+	default n
+	help
+	  Enables IPv4 Differentiated Services support for IP priority based
+	  routing. If you say YES here, TOS priority based routing is disabled.
+	  Notice that the IP TOS field was redefined 1998 to DiffServ (RFC 2474).
+	  Type Of Service is deprecated since 1998 ! So in future default answer
+	  should be YES.
+
+	    Y: DiffServ
+	    N: Type Of Service
+
+	  If unsure, say N.
+
 config ARPD
 	bool "IP: ARP daemon support"
 	---help---
-- 
1.6.3.3