diff mbox

[RFC] gre: conform to RFC6040 ECN progogation

Message ID 20120924144457.0c76bce2@nehalam.linuxnetplumber.net
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

stephen hemminger Sept. 24, 2012, 9:44 p.m. UTC
Linux GRE was likely written before this RFC and therefore does not
conform to one of the rules in Section 4.2.  Default Tunnel Egress Behaviour.

The new code addresses:
 o If the inner ECN field is Not-ECT, the decapsulator MUST NOT
      propagate any other ECN codepoint onwards.  This is because the
      inner Not-ECT marking is set by transports that rely on dropped
      packets as an indication of congestion and would not understand or
      respond to any other ECN codepoint [RFC4774].  Specifically:

      *  If the inner ECN field is Not-ECT and the outer ECN field is
         CE, the decapsulator MUST drop the packet.

      *  If the inner ECN field is Not-ECT and the outer ECN field is
         Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the
         outgoing packet with the ECN field cleared to Not-ECT.

This was caught by Chris Wright while reviewing VXLAN.
This code has not been tested with real ECN through tunnel.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Sept. 24, 2012, 10:25 p.m. UTC | #1
On Mon, 2012-09-24 at 14:44 -0700, Stephen Hemminger wrote:
> Linux GRE was likely written before this RFC and therefore does not
> conform to one of the rules in Section 4.2.  Default Tunnel Egress Behaviour.
> 
> The new code addresses:
>  o If the inner ECN field is Not-ECT, the decapsulator MUST NOT
>       propagate any other ECN codepoint onwards.  This is because the
>       inner Not-ECT marking is set by transports that rely on dropped
>       packets as an indication of congestion and would not understand or
>       respond to any other ECN codepoint [RFC4774].  Specifically:
> 
>       *  If the inner ECN field is Not-ECT and the outer ECN field is
>          CE, the decapsulator MUST drop the packet.
> 
>       *  If the inner ECN field is Not-ECT and the outer ECN field is
>          Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the
>          outgoing packet with the ECN field cleared to Not-ECT.
> 
> This was caught by Chris Wright while reviewing VXLAN.
> This code has not been tested with real ECN through tunnel.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

It seems dangerous to me without any logging ?

One could argue that the outer ECN field should not be CE if the inner
was Not-ECT

It means :
1) the encapsulator set ECT(0), or ECT(1) and a congestioned hop set CE
2) the encapsulator set CE

1) or 2) while inner was not-ECT (!!!)

If a router does such a thing, we should log a message to help
diagnostics.

By the way other tunnels probably have the same issues.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger Sept. 24, 2012, 10:30 p.m. UTC | #2
On Tue, 25 Sep 2012 00:25:36 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Mon, 2012-09-24 at 14:44 -0700, Stephen Hemminger wrote:
> > Linux GRE was likely written before this RFC and therefore does not
> > conform to one of the rules in Section 4.2.  Default Tunnel Egress Behaviour.
> > 
> > The new code addresses:
> >  o If the inner ECN field is Not-ECT, the decapsulator MUST NOT
> >       propagate any other ECN codepoint onwards.  This is because the
> >       inner Not-ECT marking is set by transports that rely on dropped
> >       packets as an indication of congestion and would not understand or
> >       respond to any other ECN codepoint [RFC4774].  Specifically:
> > 
> >       *  If the inner ECN field is Not-ECT and the outer ECN field is
> >          CE, the decapsulator MUST drop the packet.
> > 
> >       *  If the inner ECN field is Not-ECT and the outer ECN field is
> >          Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the
> >          outgoing packet with the ECN field cleared to Not-ECT.
> > 
> > This was caught by Chris Wright while reviewing VXLAN.
> > This code has not been tested with real ECN through tunnel.
> > 
> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> It seems dangerous to me without any logging ?
> 
> One could argue that the outer ECN field should not be CE if the inner
> was Not-ECT
> 
> It means :
> 1) the encapsulator set ECT(0), or ECT(1) and a congestioned hop set CE
> 2) the encapsulator set CE
> 
> 1) or 2) while inner was not-ECT (!!!)
> 
> If a router does such a thing, we should log a message to help
> diagnostics.
> 
> By the way other tunnels probably have the same issues.

Logging is a bad idea in this case since the tunnel might be from a remote
host/protocol and the log would be filled with crap.

The tunnels in general do need to have rx_dropped counter, but it looks
like that isn't being done right either.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 25, 2012, 5:17 a.m. UTC | #3
On Mon, 2012-09-24 at 15:30 -0700, Stephen Hemminger wrote:

> Logging is a bad idea in this case since the tunnel might be from a remote
> host/protocol and the log would be filled with crap.
> 
> The tunnels in general do need to have rx_dropped counter, but it looks
> like that isn't being done right either.
> 

I never suggested to log _all_ messages.


Stephen, I personally was hit by some provider playing with TOS bits so
wrong I had to patch linux to fix a minor problem. [1]

I would like to know why my tunnels are going to fail, and what should I
do to get a fallback. Ie reverting your patches.


RFC 6040 states :

      In these cases,
      particularly the more dangerous ones, the decapsulator SHOULD log
      the event and MAY also raise an alarm.

      Just because the highlighted combinations are currently unused,
      does not mean that all the other combinations are always valid.
      Some are only valid if they have arrived from a particular type of
      legacy ingress, and dangerous otherwise.  Therefore, an
      implementation MAY allow an operator to configure logging and
      alarms for such additional header combinations known to be
      dangerous or CU for the particular configuration of tunnel
      endpoints deployed at run-time.

      Alarms SHOULD be rate-limited so that the anomalous combinations
      will not amplify into a flood of alarm messages.  It MUST be
      possible to suppress alarms or logging, e.g., if it becomes
      apparent that a combination that previously was not used has
      started to be used for legitimate purposes such as a new standards
      action.


[1]
commit 7a269ffad72f3604b8982fa09c387670e0d2ee14
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Thu Sep 22 20:02:19 2011 +0000

    tcp: ECN blackhole should not force quickack mode
    
    While playing with a new ADSL box at home, I discovered that ECN
    blackhole can trigger suboptimal quickack mode on linux : We send one
    ACK for each incoming data frame, without any delay and eventual
    piggyback.
    
    This is because TCP_ECN_check_ce() considers that if no ECT is seen on a
    segment, this is because this segment was a retransmit.
    
    Refine this heuristic and apply it only if we seen ECT in a previous
    segment, to detect ECN blackhole at IP level.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings Oct. 1, 2012, 3:55 p.m. UTC | #4
On Mon, 2012-09-24 at 14:44 -0700, Stephen Hemminger wrote:
[...]
> --- a/net/ipv4/ip_gre.c	2012-09-21 08:45:55.948772761 -0700
> +++ b/net/ipv4/ip_gre.c	2012-09-24 14:35:54.666185603 -0700
[...]
> @@ -703,17 +704,18 @@ static int ipgre_rcv(struct sk_buff *skb
>  			skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
>  		}
>  
> +		__skb_tunnel_rx(skb, tunnel->dev);
> +
> +		skb_reset_network_header(skb);
> +		if (!ipgre_ecn_decapsulate(iph, skb))
> +			goto drop;
> +
>  		tstats = this_cpu_ptr(tunnel->dev->tstats);
>  		u64_stats_update_begin(&tstats->syncp);
>  		tstats->rx_packets++;
>  		tstats->rx_bytes += skb->len;
>  		u64_stats_update_end(&tstats->syncp);

I don't know why you're moving this code above the stats update;
rx_packets/rx_bytes should include dropped packets.

Ben.

> -		__skb_tunnel_rx(skb, tunnel->dev);
> -
> -		skb_reset_network_header(skb);
> -		ipgre_ecn_decapsulate(iph, skb);
> -
>  		netif_rx(skb);
>  
>  		rcu_read_unlock();
stephen hemminger Oct. 1, 2012, 3:56 p.m. UTC | #5
On Mon, 1 Oct 2012 16:55:19 +0100
Ben Hutchings <bhutchings@solarflare.com> wrote:

> On Mon, 2012-09-24 at 14:44 -0700, Stephen Hemminger wrote:
> [...]
> > --- a/net/ipv4/ip_gre.c	2012-09-21 08:45:55.948772761 -0700
> > +++ b/net/ipv4/ip_gre.c	2012-09-24 14:35:54.666185603 -0700
> [...]
> > @@ -703,17 +704,18 @@ static int ipgre_rcv(struct sk_buff *skb
> >  			skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> >  		}
> >  
> > +		__skb_tunnel_rx(skb, tunnel->dev);
> > +
> > +		skb_reset_network_header(skb);
> > +		if (!ipgre_ecn_decapsulate(iph, skb))
> > +			goto drop;
> > +
> >  		tstats = this_cpu_ptr(tunnel->dev->tstats);
> >  		u64_stats_update_begin(&tstats->syncp);
> >  		tstats->rx_packets++;
> >  		tstats->rx_bytes += skb->len;
> >  		u64_stats_update_end(&tstats->syncp);
> 
> I don't know why you're moving this code above the stats update;
> rx_packets/rx_bytes should include dropped packets.
> 
> Ben.
> 
> > -		__skb_tunnel_rx(skb, tunnel->dev);
> > -
> > -		skb_reset_network_header(skb);
> > -		ipgre_ecn_decapsulate(iph, skb);
> > -
> >  		netif_rx(skb);
> >  
> >  		rcu_read_unlock();
> 

Is that true? I thought they were accounted for by rx_dropped.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings Oct. 1, 2012, 4:49 p.m. UTC | #6
On Mon, 2012-10-01 at 08:56 -0700, Stephen Hemminger wrote:
> On Mon, 1 Oct 2012 16:55:19 +0100
> Ben Hutchings <bhutchings@solarflare.com> wrote:
> 
> > On Mon, 2012-09-24 at 14:44 -0700, Stephen Hemminger wrote:
> > [...]
> > > --- a/net/ipv4/ip_gre.c	2012-09-21 08:45:55.948772761 -0700
> > > +++ b/net/ipv4/ip_gre.c	2012-09-24 14:35:54.666185603 -0700
> > [...]
> > > @@ -703,17 +704,18 @@ static int ipgre_rcv(struct sk_buff *skb
> > >  			skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> > >  		}
> > >  
> > > +		__skb_tunnel_rx(skb, tunnel->dev);
> > > +
> > > +		skb_reset_network_header(skb);
> > > +		if (!ipgre_ecn_decapsulate(iph, skb))
> > > +			goto drop;
> > > +
> > >  		tstats = this_cpu_ptr(tunnel->dev->tstats);
> > >  		u64_stats_update_begin(&tstats->syncp);
> > >  		tstats->rx_packets++;
> > >  		tstats->rx_bytes += skb->len;
> > >  		u64_stats_update_end(&tstats->syncp);
> > 
> > I don't know why you're moving this code above the stats update;
> > rx_packets/rx_bytes should include dropped packets.
> > 
> > Ben.
> > 
> > > -		__skb_tunnel_rx(skb, tunnel->dev);
> > > -
> > > -		skb_reset_network_header(skb);
> > > -		ipgre_ecn_decapsulate(iph, skb);
> > > -
> > >  		netif_rx(skb);
> > >  
> > >  		rcu_read_unlock();
> > 
> 
> Is that true? I thought they were accounted for by rx_dropped.

I don't think rx_dropped is appropriate for counting invalid packets,
but maybe actual practice is already different.

As for whether packets counted in rx_dropped should also be counted in
rx_packets/rx_bytes, I really don't know.  The current comments on
rtnl_link_stats (inherited from net_device_stats) are totally inadequate
as a specification.

Ben.
Eric Dumazet Oct. 1, 2012, 5:13 p.m. UTC | #7
On Mon, 2012-10-01 at 17:49 +0100, Ben Hutchings wrote:

> I don't think rx_dropped is appropriate for counting invalid packets,
> but maybe actual practice is already different.
> 
> As for whether packets counted in rx_dropped should also be counted in
> rx_packets/rx_bytes, I really don't know.  The current comments on
> rtnl_link_stats (inherited from net_device_stats) are totally inadequate
> as a specification.

rx_dropped is used by core network stack, not the devices themselves.

So a packet is first accounted in rx_bytes/rx_packets by the driver,
and if net/core/dev.c drops it, rx_dropped is incremented as well.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger Oct. 1, 2012, 9:21 p.m. UTC | #8
On Mon, 01 Oct 2012 19:13:47 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Mon, 2012-10-01 at 17:49 +0100, Ben Hutchings wrote:
> 
> > I don't think rx_dropped is appropriate for counting invalid packets,
> > but maybe actual practice is already different.
> > 
> > As for whether packets counted in rx_dropped should also be counted in
> > rx_packets/rx_bytes, I really don't know.  The current comments on
> > rtnl_link_stats (inherited from net_device_stats) are totally inadequate
> > as a specification.
> 
> rx_dropped is used by core network stack, not the devices themselves.
> 
> So a packet is first accounted in rx_bytes/rx_packets by the driver,
> and if net/core/dev.c drops it, rx_dropped is incremented as well.

The tunnel drivers are consistent in putting any dropped packet because
of protocol problem into a rx_XXX_error value and incrementing rx_errors.

I just made it treat ECN bit breakage like all the other protocol errors.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/net/ipv4/ip_gre.c	2012-09-21 08:45:55.948772761 -0700
+++ b/net/ipv4/ip_gre.c	2012-09-24 14:35:54.666185603 -0700
@@ -567,15 +567,16 @@  out:
 	rcu_read_unlock();
 }
 
-static inline void ipgre_ecn_decapsulate(const struct iphdr *iph, struct sk_buff *skb)
+static int ipgre_ecn_decapsulate(const struct iphdr *iph, struct sk_buff *skb)
 {
 	if (INET_ECN_is_ce(iph->tos)) {
 		if (skb->protocol == htons(ETH_P_IP)) {
-			IP_ECN_set_ce(ip_hdr(skb));
+			return IP_ECN_set_ce(ip_hdr(skb));
 		} else if (skb->protocol == htons(ETH_P_IPV6)) {
-			IP6_ECN_set_ce(ipv6_hdr(skb));
+			return IP6_ECN_set_ce(ipv6_hdr(skb));
 		}
 	}
+	return 1;
 }
 
 static inline u8
@@ -703,17 +704,18 @@  static int ipgre_rcv(struct sk_buff *skb
 			skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
 		}
 
+		__skb_tunnel_rx(skb, tunnel->dev);
+
+		skb_reset_network_header(skb);
+		if (!ipgre_ecn_decapsulate(iph, skb))
+			goto drop;
+
 		tstats = this_cpu_ptr(tunnel->dev->tstats);
 		u64_stats_update_begin(&tstats->syncp);
 		tstats->rx_packets++;
 		tstats->rx_bytes += skb->len;
 		u64_stats_update_end(&tstats->syncp);
 
-		__skb_tunnel_rx(skb, tunnel->dev);
-
-		skb_reset_network_header(skb);
-		ipgre_ecn_decapsulate(iph, skb);
-
 		netif_rx(skb);
 
 		rcu_read_unlock();