Message ID | 20120924144457.0c76bce2@nehalam.linuxnetplumber.net |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Mon, 2012-09-24 at 14:44 -0700, Stephen Hemminger wrote: > Linux GRE was likely written before this RFC and therefore does not > conform to one of the rules in Section 4.2. Default Tunnel Egress Behaviour. > > The new code addresses: > o If the inner ECN field is Not-ECT, the decapsulator MUST NOT > propagate any other ECN codepoint onwards. This is because the > inner Not-ECT marking is set by transports that rely on dropped > packets as an indication of congestion and would not understand or > respond to any other ECN codepoint [RFC4774]. Specifically: > > * If the inner ECN field is Not-ECT and the outer ECN field is > CE, the decapsulator MUST drop the packet. > > * If the inner ECN field is Not-ECT and the outer ECN field is > Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the > outgoing packet with the ECN field cleared to Not-ECT. > > This was caught by Chris Wright while reviewing VXLAN. > This code has not been tested with real ECN through tunnel. > > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> It seems dangerous to me without any logging ? One could argue that the outer ECN field should not be CE if the inner was Not-ECT It means : 1) the encapsulator set ECT(0), or ECT(1) and a congestioned hop set CE 2) the encapsulator set CE 1) or 2) while inner was not-ECT (!!!) If a router does such a thing, we should log a message to help diagnostics. By the way other tunnels probably have the same issues. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 25 Sep 2012 00:25:36 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Mon, 2012-09-24 at 14:44 -0700, Stephen Hemminger wrote: > > Linux GRE was likely written before this RFC and therefore does not > > conform to one of the rules in Section 4.2. Default Tunnel Egress Behaviour. > > > > The new code addresses: > > o If the inner ECN field is Not-ECT, the decapsulator MUST NOT > > propagate any other ECN codepoint onwards. This is because the > > inner Not-ECT marking is set by transports that rely on dropped > > packets as an indication of congestion and would not understand or > > respond to any other ECN codepoint [RFC4774]. Specifically: > > > > * If the inner ECN field is Not-ECT and the outer ECN field is > > CE, the decapsulator MUST drop the packet. > > > > * If the inner ECN field is Not-ECT and the outer ECN field is > > Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the > > outgoing packet with the ECN field cleared to Not-ECT. > > > > This was caught by Chris Wright while reviewing VXLAN. > > This code has not been tested with real ECN through tunnel. > > > > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> > > It seems dangerous to me without any logging ? > > One could argue that the outer ECN field should not be CE if the inner > was Not-ECT > > It means : > 1) the encapsulator set ECT(0), or ECT(1) and a congestioned hop set CE > 2) the encapsulator set CE > > 1) or 2) while inner was not-ECT (!!!) > > If a router does such a thing, we should log a message to help > diagnostics. > > By the way other tunnels probably have the same issues. Logging is a bad idea in this case since the tunnel might be from a remote host/protocol and the log would be filled with crap. The tunnels in general do need to have rx_dropped counter, but it looks like that isn't being done right either. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-09-24 at 15:30 -0700, Stephen Hemminger wrote: > Logging is a bad idea in this case since the tunnel might be from a remote > host/protocol and the log would be filled with crap. > > The tunnels in general do need to have rx_dropped counter, but it looks > like that isn't being done right either. > I never suggested to log _all_ messages. Stephen, I personally was hit by some provider playing with TOS bits so wrong I had to patch linux to fix a minor problem. [1] I would like to know why my tunnels are going to fail, and what should I do to get a fallback. Ie reverting your patches. RFC 6040 states : In these cases, particularly the more dangerous ones, the decapsulator SHOULD log the event and MAY also raise an alarm. Just because the highlighted combinations are currently unused, does not mean that all the other combinations are always valid. Some are only valid if they have arrived from a particular type of legacy ingress, and dangerous otherwise. Therefore, an implementation MAY allow an operator to configure logging and alarms for such additional header combinations known to be dangerous or CU for the particular configuration of tunnel endpoints deployed at run-time. Alarms SHOULD be rate-limited so that the anomalous combinations will not amplify into a flood of alarm messages. It MUST be possible to suppress alarms or logging, e.g., if it becomes apparent that a combination that previously was not used has started to be used for legitimate purposes such as a new standards action. [1] commit 7a269ffad72f3604b8982fa09c387670e0d2ee14 Author: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu Sep 22 20:02:19 2011 +0000 tcp: ECN blackhole should not force quickack mode While playing with a new ADSL box at home, I discovered that ECN blackhole can trigger suboptimal quickack mode on linux : We send one ACK for each incoming data frame, without any delay and eventual piggyback. This is because TCP_ECN_check_ce() considers that if no ECT is seen on a segment, this is because this segment was a retransmit. Refine this heuristic and apply it only if we seen ECT in a previous segment, to detect ECN blackhole at IP level. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-09-24 at 14:44 -0700, Stephen Hemminger wrote: [...] > --- a/net/ipv4/ip_gre.c 2012-09-21 08:45:55.948772761 -0700 > +++ b/net/ipv4/ip_gre.c 2012-09-24 14:35:54.666185603 -0700 [...] > @@ -703,17 +704,18 @@ static int ipgre_rcv(struct sk_buff *skb > skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); > } > > + __skb_tunnel_rx(skb, tunnel->dev); > + > + skb_reset_network_header(skb); > + if (!ipgre_ecn_decapsulate(iph, skb)) > + goto drop; > + > tstats = this_cpu_ptr(tunnel->dev->tstats); > u64_stats_update_begin(&tstats->syncp); > tstats->rx_packets++; > tstats->rx_bytes += skb->len; > u64_stats_update_end(&tstats->syncp); I don't know why you're moving this code above the stats update; rx_packets/rx_bytes should include dropped packets. Ben. > - __skb_tunnel_rx(skb, tunnel->dev); > - > - skb_reset_network_header(skb); > - ipgre_ecn_decapsulate(iph, skb); > - > netif_rx(skb); > > rcu_read_unlock();
On Mon, 1 Oct 2012 16:55:19 +0100 Ben Hutchings <bhutchings@solarflare.com> wrote: > On Mon, 2012-09-24 at 14:44 -0700, Stephen Hemminger wrote: > [...] > > --- a/net/ipv4/ip_gre.c 2012-09-21 08:45:55.948772761 -0700 > > +++ b/net/ipv4/ip_gre.c 2012-09-24 14:35:54.666185603 -0700 > [...] > > @@ -703,17 +704,18 @@ static int ipgre_rcv(struct sk_buff *skb > > skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); > > } > > > > + __skb_tunnel_rx(skb, tunnel->dev); > > + > > + skb_reset_network_header(skb); > > + if (!ipgre_ecn_decapsulate(iph, skb)) > > + goto drop; > > + > > tstats = this_cpu_ptr(tunnel->dev->tstats); > > u64_stats_update_begin(&tstats->syncp); > > tstats->rx_packets++; > > tstats->rx_bytes += skb->len; > > u64_stats_update_end(&tstats->syncp); > > I don't know why you're moving this code above the stats update; > rx_packets/rx_bytes should include dropped packets. > > Ben. > > > - __skb_tunnel_rx(skb, tunnel->dev); > > - > > - skb_reset_network_header(skb); > > - ipgre_ecn_decapsulate(iph, skb); > > - > > netif_rx(skb); > > > > rcu_read_unlock(); > Is that true? I thought they were accounted for by rx_dropped. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-10-01 at 08:56 -0700, Stephen Hemminger wrote: > On Mon, 1 Oct 2012 16:55:19 +0100 > Ben Hutchings <bhutchings@solarflare.com> wrote: > > > On Mon, 2012-09-24 at 14:44 -0700, Stephen Hemminger wrote: > > [...] > > > --- a/net/ipv4/ip_gre.c 2012-09-21 08:45:55.948772761 -0700 > > > +++ b/net/ipv4/ip_gre.c 2012-09-24 14:35:54.666185603 -0700 > > [...] > > > @@ -703,17 +704,18 @@ static int ipgre_rcv(struct sk_buff *skb > > > skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); > > > } > > > > > > + __skb_tunnel_rx(skb, tunnel->dev); > > > + > > > + skb_reset_network_header(skb); > > > + if (!ipgre_ecn_decapsulate(iph, skb)) > > > + goto drop; > > > + > > > tstats = this_cpu_ptr(tunnel->dev->tstats); > > > u64_stats_update_begin(&tstats->syncp); > > > tstats->rx_packets++; > > > tstats->rx_bytes += skb->len; > > > u64_stats_update_end(&tstats->syncp); > > > > I don't know why you're moving this code above the stats update; > > rx_packets/rx_bytes should include dropped packets. > > > > Ben. > > > > > - __skb_tunnel_rx(skb, tunnel->dev); > > > - > > > - skb_reset_network_header(skb); > > > - ipgre_ecn_decapsulate(iph, skb); > > > - > > > netif_rx(skb); > > > > > > rcu_read_unlock(); > > > > Is that true? I thought they were accounted for by rx_dropped. I don't think rx_dropped is appropriate for counting invalid packets, but maybe actual practice is already different. As for whether packets counted in rx_dropped should also be counted in rx_packets/rx_bytes, I really don't know. The current comments on rtnl_link_stats (inherited from net_device_stats) are totally inadequate as a specification. Ben.
On Mon, 2012-10-01 at 17:49 +0100, Ben Hutchings wrote: > I don't think rx_dropped is appropriate for counting invalid packets, > but maybe actual practice is already different. > > As for whether packets counted in rx_dropped should also be counted in > rx_packets/rx_bytes, I really don't know. The current comments on > rtnl_link_stats (inherited from net_device_stats) are totally inadequate > as a specification. rx_dropped is used by core network stack, not the devices themselves. So a packet is first accounted in rx_bytes/rx_packets by the driver, and if net/core/dev.c drops it, rx_dropped is incremented as well. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 01 Oct 2012 19:13:47 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Mon, 2012-10-01 at 17:49 +0100, Ben Hutchings wrote: > > > I don't think rx_dropped is appropriate for counting invalid packets, > > but maybe actual practice is already different. > > > > As for whether packets counted in rx_dropped should also be counted in > > rx_packets/rx_bytes, I really don't know. The current comments on > > rtnl_link_stats (inherited from net_device_stats) are totally inadequate > > as a specification. > > rx_dropped is used by core network stack, not the devices themselves. > > So a packet is first accounted in rx_bytes/rx_packets by the driver, > and if net/core/dev.c drops it, rx_dropped is incremented as well. The tunnel drivers are consistent in putting any dropped packet because of protocol problem into a rx_XXX_error value and incrementing rx_errors. I just made it treat ECN bit breakage like all the other protocol errors. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- a/net/ipv4/ip_gre.c 2012-09-21 08:45:55.948772761 -0700 +++ b/net/ipv4/ip_gre.c 2012-09-24 14:35:54.666185603 -0700 @@ -567,15 +567,16 @@ out: rcu_read_unlock(); } -static inline void ipgre_ecn_decapsulate(const struct iphdr *iph, struct sk_buff *skb) +static int ipgre_ecn_decapsulate(const struct iphdr *iph, struct sk_buff *skb) { if (INET_ECN_is_ce(iph->tos)) { if (skb->protocol == htons(ETH_P_IP)) { - IP_ECN_set_ce(ip_hdr(skb)); + return IP_ECN_set_ce(ip_hdr(skb)); } else if (skb->protocol == htons(ETH_P_IPV6)) { - IP6_ECN_set_ce(ipv6_hdr(skb)); + return IP6_ECN_set_ce(ipv6_hdr(skb)); } } + return 1; } static inline u8 @@ -703,17 +704,18 @@ static int ipgre_rcv(struct sk_buff *skb skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); } + __skb_tunnel_rx(skb, tunnel->dev); + + skb_reset_network_header(skb); + if (!ipgre_ecn_decapsulate(iph, skb)) + goto drop; + tstats = this_cpu_ptr(tunnel->dev->tstats); u64_stats_update_begin(&tstats->syncp); tstats->rx_packets++; tstats->rx_bytes += skb->len; u64_stats_update_end(&tstats->syncp); - __skb_tunnel_rx(skb, tunnel->dev); - - skb_reset_network_header(skb); - ipgre_ecn_decapsulate(iph, skb); - netif_rx(skb); rcu_read_unlock();
Linux GRE was likely written before this RFC and therefore does not conform to one of the rules in Section 4.2. Default Tunnel Egress Behaviour. The new code addresses: o If the inner ECN field is Not-ECT, the decapsulator MUST NOT propagate any other ECN codepoint onwards. This is because the inner Not-ECT marking is set by transports that rely on dropped packets as an indication of congestion and would not understand or respond to any other ECN codepoint [RFC4774]. Specifically: * If the inner ECN field is Not-ECT and the outer ECN field is CE, the decapsulator MUST drop the packet. * If the inner ECN field is Not-ECT and the outer ECN field is Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the outgoing packet with the ECN field cleared to Not-ECT. This was caught by Chris Wright while reviewing VXLAN. This code has not been tested with real ECN through tunnel. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html