diff mbox

[v2] ipv4: Early TCP socket demux.

Message ID 1340195920.4604.918.camel@edumazet-glaptop
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet June 20, 2012, 12:38 p.m. UTC
On Wed, 2012-06-20 at 03:15 -0700, David Miller wrote:
> 			       dev->ifindex);
> +		if (sk) {
> +			skb_orphan(skb);
> +			skb->sk = sk;
> +			skb->destructor = sock_edemux;
> +			if (!skb_dst(skb) &&
> +			    sk->sk_state != TCP_TIME_WAIT) {
> +				struct dst_entry *dst = sk->sk_rx_dst;
> +				if (dst)
> +					dst = dst_check(dst, 0);
> +				if (dst) {
> +					struct rtable *rt = (struct rtable *) dst;
> +
> +					if (rt->rt_iif == dev->ifindex)
> +						skb_dst_set_noref(skb, dst);
> +				}
> +			}
> +		}
> +	}
> +	return pp;

I am trying to convince myself its safe.

skb_dst_set_noref() assumes caller hold rcu_read_lock() until we use the
skb dst.

And dev_gro_receive() releases RCU...

Problem could happen if sk->sk_rx_dst is freed while some packets are
still in napi or socket backlog (can happen with some network
reordering)

1) Socket backlog must be flushed before sk->sk_rx_dst freeing

2) Even if we move rcu_read_lock() in net_rx_action(), we need some
napi_gro_forcedstrefs() in case we sofnet_break

Or maybe just use napi_gro_flush() ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller June 20, 2012, 10:29 p.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 20 Jun 2012 14:38:40 +0200

> Problem could happen if sk->sk_rx_dst is freed while some packets are
> still in napi or socket backlog (can happen with some network
> reordering)
> 
> 1) Socket backlog must be flushed before sk->sk_rx_dst freeing
> 
> 2) Even if we move rcu_read_lock() in net_rx_action(), we need some
> napi_gro_forcedstrefs() in case we sofnet_break
> 
> Or maybe just use napi_gro_flush() ?

Good catch, but I've just figured out a more fundamental issue
with doing this at the GRO layer.

The IPV4 input path is going to undo our early socket demux by
orphaning the SKB in ip_rcv().  So we'll end up looking up the
socket twice.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 57c4f9b..c0f71a0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3861,6 +3861,9 @@  static void net_rx_action(struct softirq_action *h)
 
 		budget -= work;
 
+		if (work == weight)
+			napi_gro_flush(n);
+
 		local_irq_disable();
 
 		/* Drivers must not modify the NAPI state if they