diff mbox

[net-next] macvlan: handle fragmented multicast frames

Message ID 1317932911.3457.31.camel@edumazet-laptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Oct. 6, 2011, 8:28 p.m. UTC
Le mercredi 05 octobre 2011 à 15:35 -0700, Ben Greear a écrit :

> If someone wants to cook up macvlan-ip-defrag patch I'll be happy
> to test it.  But, as far as I can tell, this problem can happen on
> any two interfaces.  The reason that some of mine work (.1q vlans)
> and macvlan didn't is probably because those were separated by
> some virtual network links that imparted extra delay...so the
> vlan consumed all its fragments and passed the complete pkt up
> the stack before the mac-vlan ever saw the initial frame.
> 
> With this in mind, it seems that using multiple udp multicast
> sockets bound to specific devices is fundamentally broken for
> fragmented packets.
> 
> I have no pressing need for this feature, so now that I better understand
> the problem I can just document it and move on to other things.
> 
> Thanks for all the help.
> 

Please test following patch (note I had no time to test it, sorry !)

Based on net-next tree, might apply on 3.0 kernel...

[PATCH net-next] macvlan: handle fragmented multicast frames

Fragmented multicast frames are delivered to a single macvlan port,
because ip defrag logic considers other samples are redundant.

Implement a defrag step before trying to send the multicast frame.

Reported-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 drivers/net/macvlan.c  |    3 +++
 include/net/ip.h       |    9 +++++++++
 net/ipv4/ip_fragment.c |   36 ++++++++++++++++++++++++++++++++++++
 net/packet/af_packet.c |   39 +--------------------------------------



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Ben Greear Oct. 7, 2011, 4:44 p.m. UTC | #1
On 10/06/2011 01:28 PM, Eric Dumazet wrote:
> Le mercredi 05 octobre 2011 à 15:35 -0700, Ben Greear a écrit :
>
>> If someone wants to cook up macvlan-ip-defrag patch I'll be happy
>> to test it.  But, as far as I can tell, this problem can happen on
>> any two interfaces.  The reason that some of mine work (.1q vlans)
>> and macvlan didn't is probably because those were separated by
>> some virtual network links that imparted extra delay...so the
>> vlan consumed all its fragments and passed the complete pkt up
>> the stack before the mac-vlan ever saw the initial frame.
>>
>> With this in mind, it seems that using multiple udp multicast
>> sockets bound to specific devices is fundamentally broken for
>> fragmented packets.
>>
>> I have no pressing need for this feature, so now that I better understand
>> the problem I can just document it and move on to other things.
>>
>> Thanks for all the help.
>>
>
> Please test following patch (note I had no time to test it, sorry !)

I hope to test this soon, but lots of other things piled up all
at once, so might be a few days.

Thanks,
Ben

>
> Based on net-next tree, might apply on 3.0 kernel...
>
> [PATCH net-next] macvlan: handle fragmented multicast frames
>
> Fragmented multicast frames are delivered to a single macvlan port,
> because ip defrag logic considers other samples are redundant.
>
> Implement a defrag step before trying to send the multicast frame.
Ben Greear Oct. 10, 2011, 4:27 p.m. UTC | #2
On 10/06/2011 01:28 PM, Eric Dumazet wrote:
> Le mercredi 05 octobre 2011 à 15:35 -0700, Ben Greear a écrit :
>
>> If someone wants to cook up macvlan-ip-defrag patch I'll be happy
>> to test it.  But, as far as I can tell, this problem can happen on
>> any two interfaces.  The reason that some of mine work (.1q vlans)
>> and macvlan didn't is probably because those were separated by
>> some virtual network links that imparted extra delay...so the
>> vlan consumed all its fragments and passed the complete pkt up
>> the stack before the mac-vlan ever saw the initial frame.
>>
>> With this in mind, it seems that using multiple udp multicast
>> sockets bound to specific devices is fundamentally broken for
>> fragmented packets.
>>
>> I have no pressing need for this feature, so now that I better understand
>> the problem I can just document it and move on to other things.
>>
>> Thanks for all the help.
>>
>
> Please test following patch (note I had no time to test it, sorry !)
>
> Based on net-next tree, might apply on 3.0 kernel...
>
> [PATCH net-next] macvlan: handle fragmented multicast frames
>
> Fragmented multicast frames are delivered to a single macvlan port,
> because ip defrag logic considers other samples are redundant.
>
> Implement a defrag step before trying to send the multicast frame.

I applied this to Linus' top-of-tree this morning and it does appear
to fix the problem for mac-vlans.

I do see this error, but I doubt it has anything to do with your
patch:

device eth0 entered promiscuous mode
device rddVR10 entered promiscuous mode
ADDRCONF(NETDEV_CHANGE): rddVR1b: link becomes ready

================================================
[ BUG: lock held when returning to user space! ]
------------------------------------------------
ip/3452 is leaving the kernel with locks still held!
1 lock held by ip/3452:
  #0:  (rcu_read_lock){.+.+..}, at: [<f8c5336f>] rcu_read_lock+0x0/0x26 [ipv6]
ADDRCONF(NETDEV_CHANGE): rddVR4b: link becomes ready
ADDRCONF(NETDEV_CHANGE): rddVR5b: link becomes ready


I have no idea why it doesn't print out a more useful stack
trace.  It seems repeatable (2 of 2 reboots so far).  I'm
configuring a pretty complex virtual network, with veth devices,
xorp instances running ipv4 and ipv6 routing protocols, etc.

This is a clean upstream kernel with no outside patches aside from your
own.

Thanks,
Ben
Eric Dumazet Oct. 10, 2011, 4:41 p.m. UTC | #3
Le lundi 10 octobre 2011 à 09:27 -0700, Ben Greear a écrit :

> I applied this to Linus' top-of-tree this morning and it does appear
> to fix the problem for mac-vlans.
> 

Thanks for testing

> I do see this error, but I doubt it has anything to do with your
> patch:
> 
> device eth0 entered promiscuous mode
> device rddVR10 entered promiscuous mode
> ADDRCONF(NETDEV_CHANGE): rddVR1b: link becomes ready
> 
> ================================================
> [ BUG: lock held when returning to user space! ]
> ------------------------------------------------
> ip/3452 is leaving the kernel with locks still held!
> 1 lock held by ip/3452:
>   #0:  (rcu_read_lock){.+.+..}, at: [<f8c5336f>] rcu_read_lock+0x0/0x26 [ipv6]
> ADDRCONF(NETDEV_CHANGE): rddVR4b: link becomes ready
> ADDRCONF(NETDEV_CHANGE): rddVR5b: link becomes ready
> 
> 
> I have no idea why it doesn't print out a more useful stack
> trace.  It seems repeatable (2 of 2 reboots so far).  I'm
> configuring a pretty complex virtual network, with veth devices,
> xorp instances running ipv4 and ipv6 routing protocols, etc.
> 

Do you have LOCKDEP enabled ?

> This is a clean upstream kernel with no outside patches aside from your
> own.

Hmm, it seems we have an rcu_read_unlock() missing...

Any idea what was done by this "ip" command ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Greear Oct. 10, 2011, 4:53 p.m. UTC | #4
On 10/10/2011 09:41 AM, Eric Dumazet wrote:
> Le lundi 10 octobre 2011 à 09:27 -0700, Ben Greear a écrit :
>
>> I applied this to Linus' top-of-tree this morning and it does appear
>> to fix the problem for mac-vlans.
>>
>
> Thanks for testing
>
>> I do see this error, but I doubt it has anything to do with your
>> patch:
>>
>> device eth0 entered promiscuous mode
>> device rddVR10 entered promiscuous mode
>> ADDRCONF(NETDEV_CHANGE): rddVR1b: link becomes ready
>>
>> ================================================
>> [ BUG: lock held when returning to user space! ]
>> ------------------------------------------------
>> ip/3452 is leaving the kernel with locks still held!
>> 1 lock held by ip/3452:
>>    #0:  (rcu_read_lock){.+.+..}, at: [<f8c5336f>] rcu_read_lock+0x0/0x26 [ipv6]
>> ADDRCONF(NETDEV_CHANGE): rddVR4b: link becomes ready
>> ADDRCONF(NETDEV_CHANGE): rddVR5b: link becomes ready
>>
>>
>> I have no idea why it doesn't print out a more useful stack
>> trace.  It seems repeatable (2 of 2 reboots so far).  I'm
>> configuring a pretty complex virtual network, with veth devices,
>> xorp instances running ipv4 and ipv6 routing protocols, etc.
>>
>
> Do you have LOCKDEP enabled ?

Yes, as far as I can tell:
[greearb@build-32 linux-2.6.p4s]$ grep LOCKDEP .config
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_LOCKDEP=y

And it doesn't appear to have turned itself off:

[root@lec2010-ath9k-1 ~]# dmesg|grep lockdep
	RCU lockdep checking is enabled.
lockdep: fixing up alternatives.
[root@lec2010-ath9k-1 ~]#

I looked through the kernel debug section of the config, and it
seems normal enough...

But, after this splat, if I run sysrq-d, then it says sysrq is off,
maybe because the splat disabled it?

SysRq : Show Locks Held
INFO: lockdep is turned off.

sysrq-l does show backtraces, so the backtrace logic in general
seems to work fine.

>
>> This is a clean upstream kernel with no outside patches aside from your
>> own.
>
> Hmm, it seems we have an rcu_read_unlock() missing...
>
> Any idea what was done by this "ip" command ?

No, it's called multiple times by my user-space control logic.  Basically,
it configures around 30 interfaces, some GRE, veth, mac-vlans, .1q vlans, normal ethernet, etc.
Also, I have some ipv6 addrs configured on many of them.

And, setting up routing rules, for ipv4 and ipv6 for the virtual routers.

Thanks,
Ben
David Miller Oct. 19, 2011, 3:22 a.m. UTC | #5
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 06 Oct 2011 22:28:31 +0200

> [PATCH net-next] macvlan: handle fragmented multicast frames
> 
> Fragmented multicast frames are delivered to a single macvlan port,
> because ip defrag logic considers other samples are redundant.
> 
> Implement a defrag step before trying to send the multicast frame.
> 
> Reported-by: Ben Greear <greearb@candelatech.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied to net-next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index b100c90..40366eb 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -169,6 +169,9 @@  static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb)
 
 	port = macvlan_port_get_rcu(skb->dev);
 	if (is_multicast_ether_addr(eth->h_dest)) {
+		skb = ip_check_defrag(skb, IP_DEFRAG_MACVLAN);
+		if (!skb)
+			return RX_HANDLER_CONSUMED;
 		src = macvlan_hash_lookup(port, eth->h_source);
 		if (!src)
 			/* frame comes from an external address */
diff --git a/include/net/ip.h b/include/net/ip.h
index aa76c7a..c7e066a 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -406,9 +406,18 @@  enum ip_defrag_users {
 	IP_DEFRAG_VS_OUT,
 	IP_DEFRAG_VS_FWD,
 	IP_DEFRAG_AF_PACKET,
+	IP_DEFRAG_MACVLAN,
 };
 
 int ip_defrag(struct sk_buff *skb, u32 user);
+#ifdef CONFIG_INET
+struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user);
+#else
+static inline struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user)
+{
+	return skb;
+}
+#endif
 int ip_frag_mem(struct net *net);
 int ip_frag_nqueues(struct net *net);
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 0e0ab98..763589a 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -682,6 +682,42 @@  int ip_defrag(struct sk_buff *skb, u32 user)
 }
 EXPORT_SYMBOL(ip_defrag);
 
+struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user)
+{
+	const struct iphdr *iph;
+	u32 len;
+
+	if (skb->protocol != htons(ETH_P_IP))
+		return skb;
+
+	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
+		return skb;
+
+	iph = ip_hdr(skb);
+	if (iph->ihl < 5 || iph->version != 4)
+		return skb;
+	if (!pskb_may_pull(skb, iph->ihl*4))
+		return skb;
+	iph = ip_hdr(skb);
+	len = ntohs(iph->tot_len);
+	if (skb->len < len || len < (iph->ihl * 4))
+		return skb;
+
+	if (ip_is_fragment(ip_hdr(skb))) {
+		skb = skb_share_check(skb, GFP_ATOMIC);
+		if (skb) {
+			if (pskb_trim_rcsum(skb, len))
+				return skb;
+			memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+			if (ip_defrag(skb, user))
+				return NULL;
+			skb->rxhash = 0;
+		}
+	}
+	return skb;
+}
+EXPORT_SYMBOL(ip_check_defrag);
+
 #ifdef CONFIG_SYSCTL
 static int zero;
 
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 25e68f5..ff9eed7 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1213,43 +1213,6 @@  static struct sock *fanout_demux_cpu(struct packet_fanout *f, struct sk_buff *sk
 	return f->arr[cpu % num];
 }
 
-static struct sk_buff *fanout_check_defrag(struct sk_buff *skb)
-{
-#ifdef CONFIG_INET
-	const struct iphdr *iph;
-	u32 len;
-
-	if (skb->protocol != htons(ETH_P_IP))
-		return skb;
-
-	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
-		return skb;
-
-	iph = ip_hdr(skb);
-	if (iph->ihl < 5 || iph->version != 4)
-		return skb;
-	if (!pskb_may_pull(skb, iph->ihl*4))
-		return skb;
-	iph = ip_hdr(skb);
-	len = ntohs(iph->tot_len);
-	if (skb->len < len || len < (iph->ihl * 4))
-		return skb;
-
-	if (ip_is_fragment(ip_hdr(skb))) {
-		skb = skb_share_check(skb, GFP_ATOMIC);
-		if (skb) {
-			if (pskb_trim_rcsum(skb, len))
-				return skb;
-			memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
-			if (ip_defrag(skb, IP_DEFRAG_AF_PACKET))
-				return NULL;
-			skb->rxhash = 0;
-		}
-	}
-#endif
-	return skb;
-}
-
 static int packet_rcv_fanout(struct sk_buff *skb, struct net_device *dev,
 			     struct packet_type *pt, struct net_device *orig_dev)
 {
@@ -1268,7 +1231,7 @@  static int packet_rcv_fanout(struct sk_buff *skb, struct net_device *dev,
 	case PACKET_FANOUT_HASH:
 	default:
 		if (f->defrag) {
-			skb = fanout_check_defrag(skb);
+			skb = ip_check_defrag(skb, IP_DEFRAG_AF_PACKET);
 			if (!skb)
 				return 0;
 		}