Message ID | 1317932911.3457.31.camel@edumazet-laptop |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On 10/06/2011 01:28 PM, Eric Dumazet wrote: > Le mercredi 05 octobre 2011 à 15:35 -0700, Ben Greear a écrit : > >> If someone wants to cook up macvlan-ip-defrag patch I'll be happy >> to test it. But, as far as I can tell, this problem can happen on >> any two interfaces. The reason that some of mine work (.1q vlans) >> and macvlan didn't is probably because those were separated by >> some virtual network links that imparted extra delay...so the >> vlan consumed all its fragments and passed the complete pkt up >> the stack before the mac-vlan ever saw the initial frame. >> >> With this in mind, it seems that using multiple udp multicast >> sockets bound to specific devices is fundamentally broken for >> fragmented packets. >> >> I have no pressing need for this feature, so now that I better understand >> the problem I can just document it and move on to other things. >> >> Thanks for all the help. >> > > Please test following patch (note I had no time to test it, sorry !) I hope to test this soon, but lots of other things piled up all at once, so might be a few days. Thanks, Ben > > Based on net-next tree, might apply on 3.0 kernel... > > [PATCH net-next] macvlan: handle fragmented multicast frames > > Fragmented multicast frames are delivered to a single macvlan port, > because ip defrag logic considers other samples are redundant. > > Implement a defrag step before trying to send the multicast frame.
On 10/06/2011 01:28 PM, Eric Dumazet wrote: > Le mercredi 05 octobre 2011 à 15:35 -0700, Ben Greear a écrit : > >> If someone wants to cook up macvlan-ip-defrag patch I'll be happy >> to test it. But, as far as I can tell, this problem can happen on >> any two interfaces. The reason that some of mine work (.1q vlans) >> and macvlan didn't is probably because those were separated by >> some virtual network links that imparted extra delay...so the >> vlan consumed all its fragments and passed the complete pkt up >> the stack before the mac-vlan ever saw the initial frame. >> >> With this in mind, it seems that using multiple udp multicast >> sockets bound to specific devices is fundamentally broken for >> fragmented packets. >> >> I have no pressing need for this feature, so now that I better understand >> the problem I can just document it and move on to other things. >> >> Thanks for all the help. >> > > Please test following patch (note I had no time to test it, sorry !) > > Based on net-next tree, might apply on 3.0 kernel... > > [PATCH net-next] macvlan: handle fragmented multicast frames > > Fragmented multicast frames are delivered to a single macvlan port, > because ip defrag logic considers other samples are redundant. > > Implement a defrag step before trying to send the multicast frame. I applied this to Linus' top-of-tree this morning and it does appear to fix the problem for mac-vlans. I do see this error, but I doubt it has anything to do with your patch: device eth0 entered promiscuous mode device rddVR10 entered promiscuous mode ADDRCONF(NETDEV_CHANGE): rddVR1b: link becomes ready ================================================ [ BUG: lock held when returning to user space! ] ------------------------------------------------ ip/3452 is leaving the kernel with locks still held! 1 lock held by ip/3452: #0: (rcu_read_lock){.+.+..}, at: [<f8c5336f>] rcu_read_lock+0x0/0x26 [ipv6] ADDRCONF(NETDEV_CHANGE): rddVR4b: link becomes ready ADDRCONF(NETDEV_CHANGE): rddVR5b: link becomes ready I have no idea why it doesn't print out a more useful stack trace. It seems repeatable (2 of 2 reboots so far). I'm configuring a pretty complex virtual network, with veth devices, xorp instances running ipv4 and ipv6 routing protocols, etc. This is a clean upstream kernel with no outside patches aside from your own. Thanks, Ben
Le lundi 10 octobre 2011 à 09:27 -0700, Ben Greear a écrit : > I applied this to Linus' top-of-tree this morning and it does appear > to fix the problem for mac-vlans. > Thanks for testing > I do see this error, but I doubt it has anything to do with your > patch: > > device eth0 entered promiscuous mode > device rddVR10 entered promiscuous mode > ADDRCONF(NETDEV_CHANGE): rddVR1b: link becomes ready > > ================================================ > [ BUG: lock held when returning to user space! ] > ------------------------------------------------ > ip/3452 is leaving the kernel with locks still held! > 1 lock held by ip/3452: > #0: (rcu_read_lock){.+.+..}, at: [<f8c5336f>] rcu_read_lock+0x0/0x26 [ipv6] > ADDRCONF(NETDEV_CHANGE): rddVR4b: link becomes ready > ADDRCONF(NETDEV_CHANGE): rddVR5b: link becomes ready > > > I have no idea why it doesn't print out a more useful stack > trace. It seems repeatable (2 of 2 reboots so far). I'm > configuring a pretty complex virtual network, with veth devices, > xorp instances running ipv4 and ipv6 routing protocols, etc. > Do you have LOCKDEP enabled ? > This is a clean upstream kernel with no outside patches aside from your > own. Hmm, it seems we have an rcu_read_unlock() missing... Any idea what was done by this "ip" command ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/10/2011 09:41 AM, Eric Dumazet wrote: > Le lundi 10 octobre 2011 à 09:27 -0700, Ben Greear a écrit : > >> I applied this to Linus' top-of-tree this morning and it does appear >> to fix the problem for mac-vlans. >> > > Thanks for testing > >> I do see this error, but I doubt it has anything to do with your >> patch: >> >> device eth0 entered promiscuous mode >> device rddVR10 entered promiscuous mode >> ADDRCONF(NETDEV_CHANGE): rddVR1b: link becomes ready >> >> ================================================ >> [ BUG: lock held when returning to user space! ] >> ------------------------------------------------ >> ip/3452 is leaving the kernel with locks still held! >> 1 lock held by ip/3452: >> #0: (rcu_read_lock){.+.+..}, at: [<f8c5336f>] rcu_read_lock+0x0/0x26 [ipv6] >> ADDRCONF(NETDEV_CHANGE): rddVR4b: link becomes ready >> ADDRCONF(NETDEV_CHANGE): rddVR5b: link becomes ready >> >> >> I have no idea why it doesn't print out a more useful stack >> trace. It seems repeatable (2 of 2 reboots so far). I'm >> configuring a pretty complex virtual network, with veth devices, >> xorp instances running ipv4 and ipv6 routing protocols, etc. >> > > Do you have LOCKDEP enabled ? Yes, as far as I can tell: [greearb@build-32 linux-2.6.p4s]$ grep LOCKDEP .config CONFIG_LOCKDEP_SUPPORT=y CONFIG_LOCKDEP=y And it doesn't appear to have turned itself off: [root@lec2010-ath9k-1 ~]# dmesg|grep lockdep RCU lockdep checking is enabled. lockdep: fixing up alternatives. [root@lec2010-ath9k-1 ~]# I looked through the kernel debug section of the config, and it seems normal enough... But, after this splat, if I run sysrq-d, then it says sysrq is off, maybe because the splat disabled it? SysRq : Show Locks Held INFO: lockdep is turned off. sysrq-l does show backtraces, so the backtrace logic in general seems to work fine. > >> This is a clean upstream kernel with no outside patches aside from your >> own. > > Hmm, it seems we have an rcu_read_unlock() missing... > > Any idea what was done by this "ip" command ? No, it's called multiple times by my user-space control logic. Basically, it configures around 30 interfaces, some GRE, veth, mac-vlans, .1q vlans, normal ethernet, etc. Also, I have some ipv6 addrs configured on many of them. And, setting up routing rules, for ipv4 and ipv6 for the virtual routers. Thanks, Ben
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 06 Oct 2011 22:28:31 +0200 > [PATCH net-next] macvlan: handle fragmented multicast frames > > Fragmented multicast frames are delivered to a single macvlan port, > because ip defrag logic considers other samples are redundant. > > Implement a defrag step before trying to send the multicast frame. > > Reported-by: Ben Greear <greearb@candelatech.com> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Applied to net-next, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index b100c90..40366eb 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -169,6 +169,9 @@ static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb) port = macvlan_port_get_rcu(skb->dev); if (is_multicast_ether_addr(eth->h_dest)) { + skb = ip_check_defrag(skb, IP_DEFRAG_MACVLAN); + if (!skb) + return RX_HANDLER_CONSUMED; src = macvlan_hash_lookup(port, eth->h_source); if (!src) /* frame comes from an external address */ diff --git a/include/net/ip.h b/include/net/ip.h index aa76c7a..c7e066a 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -406,9 +406,18 @@ enum ip_defrag_users { IP_DEFRAG_VS_OUT, IP_DEFRAG_VS_FWD, IP_DEFRAG_AF_PACKET, + IP_DEFRAG_MACVLAN, }; int ip_defrag(struct sk_buff *skb, u32 user); +#ifdef CONFIG_INET +struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user); +#else +static inline struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user) +{ + return skb; +} +#endif int ip_frag_mem(struct net *net); int ip_frag_nqueues(struct net *net); diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index 0e0ab98..763589a 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -682,6 +682,42 @@ int ip_defrag(struct sk_buff *skb, u32 user) } EXPORT_SYMBOL(ip_defrag); +struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user) +{ + const struct iphdr *iph; + u32 len; + + if (skb->protocol != htons(ETH_P_IP)) + return skb; + + if (!pskb_may_pull(skb, sizeof(struct iphdr))) + return skb; + + iph = ip_hdr(skb); + if (iph->ihl < 5 || iph->version != 4) + return skb; + if (!pskb_may_pull(skb, iph->ihl*4)) + return skb; + iph = ip_hdr(skb); + len = ntohs(iph->tot_len); + if (skb->len < len || len < (iph->ihl * 4)) + return skb; + + if (ip_is_fragment(ip_hdr(skb))) { + skb = skb_share_check(skb, GFP_ATOMIC); + if (skb) { + if (pskb_trim_rcsum(skb, len)) + return skb; + memset(IPCB(skb), 0, sizeof(struct inet_skb_parm)); + if (ip_defrag(skb, user)) + return NULL; + skb->rxhash = 0; + } + } + return skb; +} +EXPORT_SYMBOL(ip_check_defrag); + #ifdef CONFIG_SYSCTL static int zero; diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 25e68f5..ff9eed7 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1213,43 +1213,6 @@ static struct sock *fanout_demux_cpu(struct packet_fanout *f, struct sk_buff *sk return f->arr[cpu % num]; } -static struct sk_buff *fanout_check_defrag(struct sk_buff *skb) -{ -#ifdef CONFIG_INET - const struct iphdr *iph; - u32 len; - - if (skb->protocol != htons(ETH_P_IP)) - return skb; - - if (!pskb_may_pull(skb, sizeof(struct iphdr))) - return skb; - - iph = ip_hdr(skb); - if (iph->ihl < 5 || iph->version != 4) - return skb; - if (!pskb_may_pull(skb, iph->ihl*4)) - return skb; - iph = ip_hdr(skb); - len = ntohs(iph->tot_len); - if (skb->len < len || len < (iph->ihl * 4)) - return skb; - - if (ip_is_fragment(ip_hdr(skb))) { - skb = skb_share_check(skb, GFP_ATOMIC); - if (skb) { - if (pskb_trim_rcsum(skb, len)) - return skb; - memset(IPCB(skb), 0, sizeof(struct inet_skb_parm)); - if (ip_defrag(skb, IP_DEFRAG_AF_PACKET)) - return NULL; - skb->rxhash = 0; - } - } -#endif - return skb; -} - static int packet_rcv_fanout(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) { @@ -1268,7 +1231,7 @@ static int packet_rcv_fanout(struct sk_buff *skb, struct net_device *dev, case PACKET_FANOUT_HASH: default: if (f->defrag) { - skb = fanout_check_defrag(skb); + skb = ip_check_defrag(skb, IP_DEFRAG_AF_PACKET); if (!skb) return 0; }