Patchwork [V2,net-next-2.6] net: relax dst refcnt in input path

login
register
mail settings
Submitter Eric Dumazet
Date July 22, 2009, 3:40 p.m.
Message ID <4A6732D6.90809@gmail.com>
Download mbox | patch
Permalink /patch/30082/
State Changes Requested
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - July 22, 2009, 3:40 p.m.
Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
>> index db46b4b..23e974b 100644
>> --- a/net/ipv4/ip_input.c
>> +++ b/net/ipv4/ip_input.c
>> @@ -331,7 +331,7 @@ static int ip_rcv_finish(struct sk_buff *skb)
>>  	 */
>>  	if (skb_dst(skb) == NULL) {
>>  		int err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos,
>> -					 skb->dev);
>> +					 skb->dev, 1);
>>  		if (unlikely(err)) {
>>  			if (err == -EHOSTUNREACH)
>>  				IP_INC_STATS_BH(dev_net(skb->dev),
> 
> 
> The packet might get queued by netfilter after this. I think this needs
> another skb_dst_force() in __nf_queue().

Yes, thanks for catching this !

[PATCH net-next-2.6] net: relax dst refcnt in input path

One serious point of contention in network stack is the IP route cache
refcounts in input path, on SMP setups.

On stress situation, one cpu (say A) handles network softirq RX processing.
When a packet is received, we need to find a dst_entry, take
a reference on this dst_entry and associate skb to this dst_entry.
skb is queued on a socket receive queue.

When application (running from another CPU B) dequeues this packet,
it has to release the dst_entry, which refcount is hot and dirty on
another CPU A cache, involving an expensive cache line ping-pong.

Back in November 2008, we tried to keep this cache line only
in CPU A (commit 70355602879229c6f8bd694ec9c0814222bc4936)
(net: release skb->dst in sock_queue_rcv_skb()), but we had
to revert this commit because it broke IP_PKTINFO handling,
as noticed by Mark McLoughlin

Then David suggested not taking the reference at the first place,
which this patch does when possible.

We prepared this work with commit adf30907 (net: skb->dst accessors),
introducing accessors to work on skb->dst

We now can use the low order bit of skb->_skb_dst to tell
if a reference was taken on dst for this skb

Then we add a new 'noref' parameter to ip_route_input(), to tell
this function to take or not a reference count on found dst.

We make ip_rcv_finish() calls ip_route_input() with noref=1.

When queueing a skb to socket receive queue, we drop dst
unless IP_CMSG_PKTINFO is set. 

This is done in ip_queue_rcv_skb(), a new wrapper around
sock_queue_rcv_skb(), for UDP (IPV4/IPV6) and RAW (IPV4)

In sock_queue_rcv_skb(), we check if skb has still a dst,
because we must take a reference on it unless already done.
(The dequeuing might be done long time after queueing, and
RCU wont protect dst that long)

skb_dst_drop() is called from tcp_data_queue() for TCP,
regardless of IP_CMSG_PKTINFO (not meaningfull for tcp)

Net effect of this patch is avoiding two atomic ops per
incoming packet, on possibly contented cache lines.

V2: Forwarding is taken into account by changes in dev_queue_xmit(),
forcing a dst refcount on !IFF_XMIT_DST_RELEASE devices.

V3: As pointed by Patrick, we must force a dst refcount in
__nf_queue(), before queueing a packet.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/skbuff.h    |   35 +++++++++++++++++++++++++++-
 include/net/dst.h         |   45 ++++++++++++++++++++++++++++++++----
 include/net/ip.h          |    1
 include/net/route.h       |    3 +-
 net/bridge/br_netfilter.c |    2 -
 net/core/dev.c            |   18 ++++++++++++--
 net/core/skbuff.c         |    2 -
 net/core/sock.c           |    4 +++
 net/ipv4/arp.c            |    2 -
 net/ipv4/icmp.c           |    8 +++---
 net/ipv4/ip_input.c       |    2 -
 net/ipv4/ip_options.c     |   11 ++++----
 net/ipv4/ip_sockglue.c    |   16 ++++++++++++
 net/ipv4/netfilter.c      |    8 +++---
 net/ipv4/raw.c            |    2 -
 net/ipv4/route.c          |   15 ++++++++----
 net/ipv4/tcp_input.c      |    1
 net/ipv4/udp.c            |    2 -
 net/ipv4/xfrm4_input.c    |    2 -
 net/ipv6/ip6_tunnel.c     |    2 -
 net/ipv6/udp.c            |    2 -
 net/netfilter/nf_queue.c  |    1
 22 files changed, 148 insertions(+), 36 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Aug. 6, 2009, 8:35 p.m.
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 22 Jul 2009 17:40:06 +0200

> [PATCH net-next-2.6] net: relax dst refcnt in input path


Two things:

1) Don't use a boolean name like "noref" it results in using
   double-negatives in one's mind while trying to read and understand
   the code.

   Call these arguments "need_ref" or something like that.

   Also, use "bool" type.

2) I wonder about this:

> @@ -1700,9 +1700,15 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
>  		 * If device doesnt need skb->dst, release it right now while
>  		 * its hot in this cpu cache
>  		 */
> -		if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
> -			skb_dst_drop(skb);
> -
> +		if (skb_dst(skb)) {
> +			if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
> +				skb_dst_drop(skb);
> +			else
> +				/*
> +				 * make sure dst was refcounted by caller
> +				 */
> +				WARN_ON(!(skb->_skb_dst & SKB_DST_REFTAKEN));
> +		}
>  		rc = ops->ndo_start_xmit(skb, dev);
>  		if (rc == NETDEV_TX_OK)

So, won't this warning trigger if we are forwarding to a device that
does not set IFF_XMIT_DST_RELEASE?

If I understand things correctly, in IPv4 when we're not delivering to
a socket, we don't take a reference.

We'll get here from the forwarding path with a DST, and the target TX
device has IFF_XMIT_DST_RELEASE clear, the WARN_ON above will trigger.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Aug. 7, 2009, 5:05 a.m.
David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 22 Jul 2009 17:40:06 +0200
> 
>> [PATCH net-next-2.6] net: relax dst refcnt in input path
> 
> 
> Two things:
> 
> 1) Don't use a boolean name like "noref" it results in using
>    double-negatives in one's mind while trying to read and understand
>    the code.
> 
>    Call these arguments "need_ref" or something like that.
> 
>    Also, use "bool" type.

Sure

> 
> 2) I wonder about this:
> 
>> @@ -1700,9 +1700,15 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
>>  		 * If device doesnt need skb->dst, release it right now while
>>  		 * its hot in this cpu cache
>>  		 */
>> -		if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
>> -			skb_dst_drop(skb);
>> -
>> +		if (skb_dst(skb)) {
>> +			if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
>> +				skb_dst_drop(skb);
>> +			else
>> +				/*
>> +				 * make sure dst was refcounted by caller
>> +				 */
>> +				WARN_ON(!(skb->_skb_dst & SKB_DST_REFTAKEN));
>> +		}
>>  		rc = ops->ndo_start_xmit(skb, dev);
>>  		if (rc == NETDEV_TX_OK)
> 
> So, won't this warning trigger if we are forwarding to a device that
> does not set IFF_XMIT_DST_RELEASE?

If this device has a queue, we do a skb_dst_force() in dev_queue_xmit()

> 
> If I understand things correctly, in IPv4 when we're not delivering to
> a socket, we don't take a reference.

yes, that would be the trick

> 
> We'll get here from the forwarding path with a DST, and the target TX
> device has IFF_XMIT_DST_RELEASE clear, the WARN_ON above will trigger.

I added this warning to make sure I called skb_dst_force(skb)
from dev_queue_xmit() if device has a queue (packet might be queued
and thus escape from rcu lock section)

I suppose I should also skb_dst_force() if device has no queue and
has IFF_XMIT_DST_RELEASE cleared. Or should audit all these devices
and insert the skb_dst_force() if the packet must be queued in a driver queue,
and its dst needed later. 

I'll respin patch anyway since commit bbd8a0d3a3b65d341437f8b99c828fa5cc29c739
(net: Avoid enqueuing skb for default qdiscs)
changed too many things, when I'll come back from vacations in two weeks.

Or if you prefer, I might submit it today (my vacations start tomorrow), but
please dont apply it while I cannot react to any bug report :)

Thanks a lot
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Aug. 7, 2009, 5:09 a.m.
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 07 Aug 2009 07:05:38 +0200

> I'll respin patch anyway since commit bbd8a0d3a3b65d341437f8b99c828fa5cc29c739
> (net: Avoid enqueuing skb for default qdiscs)
> changed too many things, when I'll come back from vacations in two weeks.

That's fine.

> Or if you prefer, I might submit it today (my vacations start tomorrow), but
> please dont apply it while I cannot react to any bug report :)

Take your time, there is no rush.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f2c69a2..9d14368 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -426,13 +426,46 @@  extern void skb_dma_unmap(struct device *dev, struct sk_buff *skb,
 			  enum dma_data_direction dir);
 #endif
 
+/*
+ * skb might have a dst pointer attached, refcounted or not
+ * _skb_dst low order bit is set if refcount was taken
+ */
+#define SKB_DST_REFTAKEN 1UL
+#define SKB_DST_PTRMASK ~1UL
+/**
+ * skb_dst - returns skb dst
+ * @skb: buffer
+ *
+ * Returns skb dst, regardless of reference taken or not.
+ */
 static inline struct dst_entry *skb_dst(const struct sk_buff *skb)
 {
-	return (struct dst_entry *)skb->_skb_dst;
+	return (struct dst_entry *)(skb->_skb_dst & SKB_DST_PTRMASK);
 }
 
+/**
+ * skb_dst_set - sets skb dst
+ * @skb: buffer
+ * @dst: dst entry
+ *
+ * Sets skb dst, assuming a reference was taken on dst and should
+ * be released by skb_dst_drop()
+ */
 static inline void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
 {
+	skb->_skb_dst = (unsigned long)dst | SKB_DST_REFTAKEN;
+}
+
+/**
+ * skb_dst_set_noref - sets skb dst, without a reference
+ * @skb: buffer
+ * @dst: dst entry
+ *
+ * Sets skb dst, assuming a reference was not taken on dst
+ * skb_dst_drop() should not dst_release() this dst
+ */
+static inline void skb_dst_set_noref(struct sk_buff *skb, struct dst_entry *dst)
+{
 	skb->_skb_dst = (unsigned long)dst;
 }
 
diff --git a/include/net/dst.h b/include/net/dst.h
index 7fc409c..9c3f0fc 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -179,13 +179,18 @@  static inline void dst_hold(struct dst_entry * dst)
 	atomic_inc(&dst->__refcnt);
 }
 
-static inline void dst_use(struct dst_entry *dst, unsigned long time)
+static inline void dst_use_noref(struct dst_entry *dst, unsigned long time)
 {
-	dst_hold(dst);
 	dst->__use++;
 	dst->lastuse = time;
 }
 
+static inline void dst_use(struct dst_entry *dst, unsigned long time)
+{
+	dst_hold(dst);
+	dst_use_noref(dst, time);
+}
+
 static inline
 struct dst_entry * dst_clone(struct dst_entry * dst)
 {
@@ -195,13 +200,45 @@  struct dst_entry * dst_clone(struct dst_entry * dst)
 }
 
 extern void dst_release(struct dst_entry *dst);
+
+static inline void __skb_dst_drop(unsigned long _skb_dst)
+{
+	if (_skb_dst & SKB_DST_REFTAKEN)
+		dst_release((struct dst_entry *)(_skb_dst & SKB_DST_PTRMASK));
+}
+/**
+ * skb_dst_drop - drops skb dst
+ * @skb: buffer
+ *
+ * Drops dst reference count if low order bit is set.
+ */
 static inline void skb_dst_drop(struct sk_buff *skb)
 {
-	if (skb->_skb_dst)
-		dst_release(skb_dst(skb));
+	__skb_dst_drop(skb->_skb_dst);
 	skb->_skb_dst = 0UL;
 }
 
+static inline void skb_dst_copy(struct sk_buff *nskb, const struct sk_buff *oskb)
+{
+	nskb->_skb_dst = oskb->_skb_dst;
+	if (nskb->_skb_dst & SKB_DST_REFTAKEN)
+		dst_clone(skb_dst(nskb));
+}
+
+/**
+ * skb_dst_force - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted, let's do it
+ */
+static inline void skb_dst_force(struct sk_buff *skb)
+{
+	if (!(skb->_skb_dst & SKB_DST_REFTAKEN)) {
+		dst_clone(skb_dst(skb));
+		skb->_skb_dst |= SKB_DST_REFTAKEN;
+	}
+}
+
 /* Children define the path of the packet through the
  * Linux networking.  Thus, destinations are stackable.
  */
diff --git a/include/net/ip.h b/include/net/ip.h
index 72c3692..ec94680 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -378,6 +378,7 @@  extern int ip_options_rcv_srr(struct sk_buff *skb);
  *	Functions provided by ip_sockglue.c
  */
 
+extern int ip_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 extern void	ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb);
 extern int	ip_cmsg_send(struct net *net,
 			     struct msghdr *msg, struct ipcm_cookie *ipc);
diff --git a/include/net/route.h b/include/net/route.h
index 40f6346..726df35 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -115,7 +115,8 @@  extern void		rt_cache_flush(struct net *net, int how);
 extern int		__ip_route_output_key(struct net *, struct rtable **, const struct flowi *flp);
 extern int		ip_route_output_key(struct net *, struct rtable **, struct flowi *flp);
 extern int		ip_route_output_flow(struct net *, struct rtable **rp, struct flowi *flp, struct sock *sk, int flags);
-extern int		ip_route_input(struct sk_buff*, __be32 dst, __be32 src, u8 tos, struct net_device *devin);
+extern int		ip_route_input(struct sk_buff*, __be32 dst, __be32 src, u8 tos,
+				       struct net_device *devin, int noref);
 extern unsigned short	ip_rt_frag_needed(struct net *net, struct iphdr *iph, unsigned short new_mtu, struct net_device *dev);
 extern void		ip_rt_send_redirect(struct sk_buff *skb);
 
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 4fde742..5c88c83 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -349,7 +349,7 @@  static int br_nf_pre_routing_finish(struct sk_buff *skb)
 	}
 	nf_bridge->mask ^= BRNF_NF_BRIDGE_PREROUTING;
 	if (dnat_took_place(skb)) {
-		if ((err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, dev))) {
+		if ((err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, dev, 1))) {
 			struct flowi fl = {
 				.nl_u = {
 					.ip4_u = {
diff --git a/net/core/dev.c b/net/core/dev.c
index dca8b50..e47280d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1700,9 +1700,15 @@  int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 		 * If device doesnt need skb->dst, release it right now while
 		 * its hot in this cpu cache
 		 */
-		if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
-			skb_dst_drop(skb);
-
+		if (skb_dst(skb)) {
+			if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
+				skb_dst_drop(skb);
+			else
+				/*
+				 * make sure dst was refcounted by caller
+				 */
+				WARN_ON(!(skb->_skb_dst & SKB_DST_REFTAKEN));
+		}
 		rc = ops->ndo_start_xmit(skb, dev);
 		if (rc == NETDEV_TX_OK)
 			txq_trans_update(txq);
@@ -1861,6 +1867,12 @@  gso:
 	if (q->enqueue) {
 		spinlock_t *root_lock = qdisc_lock(q);
 
+		/*
+		 * before queueing skb, we must enforce a refcounted dst
+		 */
+		if (!(dev->priv_flags & IFF_XMIT_DST_RELEASE))
+			skb_dst_force(skb);
+
 		spin_lock(root_lock);
 
 		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9e0597d..37b1fc0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -530,7 +530,7 @@  static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	new->transport_header	= old->transport_header;
 	new->network_header	= old->network_header;
 	new->mac_header		= old->mac_header;
-	skb_dst_set(new, dst_clone(skb_dst(old)));
+	skb_dst_copy(new, old);
 #ifdef CONFIG_XFRM
 	new->sp			= secpath_get(old->sp);
 #endif
diff --git a/net/core/sock.c b/net/core/sock.c
index d9eec15..2d1cf4c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -304,6 +304,10 @@  int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	 * from the queue.
 	 */
 	skb_len = skb->len;
+	/*
+	 * If we still have a dst entry linked, make sure we refcounted it
+	 */
+	skb_dst_force(skb);
 
 	skb_queue_tail(&sk->sk_receive_queue, skb);
 
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index c29d75d..ff436be 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -812,7 +812,7 @@  static int arp_process(struct sk_buff *skb)
 	}
 
 	if (arp->ar_op == htons(ARPOP_REQUEST) &&
-	    ip_route_input(skb, tip, sip, 0, dev) == 0) {
+	    ip_route_input(skb, tip, sip, 0, dev, 0) == 0) {
 
 		rt = skb_rtable(skb);
 		addr_type = rt->rt_type;
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 97c410e..08e68d0 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -584,20 +584,20 @@  void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 			err = __ip_route_output_key(net, &rt2, &fl);
 		else {
 			struct flowi fl2 = {};
-			struct dst_entry *odst;
+			unsigned long odst;
 
 			fl2.fl4_dst = fl.fl4_src;
 			if (ip_route_output_key(net, &rt2, &fl2))
 				goto relookup_failed;
 
 			/* Ugh! */
-			odst = skb_dst(skb_in);
+			odst = skb_in->_skb_dst; /* save old dst */
 			err = ip_route_input(skb_in, fl.fl4_dst, fl.fl4_src,
-					     RT_TOS(tos), rt2->u.dst.dev);
+					     RT_TOS(tos), rt2->u.dst.dev, 0);
 
 			dst_release(&rt2->u.dst);
 			rt2 = skb_rtable(skb_in);
-			skb_dst_set(skb_in, odst);
+			skb_in->_skb_dst = odst; /* restore old dst */
 		}
 
 		if (err)
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index db46b4b..23e974b 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -331,7 +331,7 @@  static int ip_rcv_finish(struct sk_buff *skb)
 	 */
 	if (skb_dst(skb) == NULL) {
 		int err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos,
-					 skb->dev);
+					 skb->dev, 1);
 		if (unlikely(err)) {
 			if (err == -EHOSTUNREACH)
 				IP_INC_STATS_BH(dev_net(skb->dev),
diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c
index 94bf105..e45abbd 100644
--- a/net/ipv4/ip_options.c
+++ b/net/ipv4/ip_options.c
@@ -600,6 +600,7 @@  int ip_options_rcv_srr(struct sk_buff *skb)
 	unsigned char *optptr = skb_network_header(skb) + opt->srr;
 	struct rtable *rt = skb_rtable(skb);
 	struct rtable *rt2;
+	unsigned long odst;
 	int err;
 
 	if (!opt->srr)
@@ -623,16 +624,16 @@  int ip_options_rcv_srr(struct sk_buff *skb)
 		}
 		memcpy(&nexthop, &optptr[srrptr-1], 4);
 
-		rt = skb_rtable(skb);
+		odst = skb->_skb_dst;
 		skb_dst_set(skb, NULL);
-		err = ip_route_input(skb, nexthop, iph->saddr, iph->tos, skb->dev);
+		err = ip_route_input(skb, nexthop, iph->saddr, iph->tos, skb->dev, 0);
 		rt2 = skb_rtable(skb);
 		if (err || (rt2->rt_type != RTN_UNICAST && rt2->rt_type != RTN_LOCAL)) {
-			ip_rt_put(rt2);
-			skb_dst_set(skb, &rt->u.dst);
+			skb_dst_drop(skb);
+			skb->_skb_dst = odst;
 			return -EINVAL;
 		}
-		ip_rt_put(rt);
+		__skb_dst_drop(odst);
 		if (rt2->rt_type != RTN_LOCAL)
 			break;
 		/* Superfast 8) loopback forward */
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index fc7993e..7349554 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -946,6 +946,22 @@  e_inval:
 	return -EINVAL;
 }
 
+/**
+ * ip_queue_rcv_skb - Queue an skb into sock receive queue
+ * @sk: socket
+ * @skb: buffer
+ *
+ * Queues an skb into socket receive queue. If IP_CMSG_PKTINFO option
+ * is not set, we drop skb dst entry now.
+ */
+int ip_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+{
+	if (!(inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO))
+		skb_dst_drop(skb);
+	return sock_queue_rcv_skb(sk, skb);
+}
+EXPORT_SYMBOL(ip_queue_rcv_skb);
+
 int ip_setsockopt(struct sock *sk, int level,
 		int optname, char __user *optval, int optlen)
 {
diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
index 1725dc0..7eb2a20 100644
--- a/net/ipv4/netfilter.c
+++ b/net/ipv4/netfilter.c
@@ -16,7 +16,7 @@  int ip_route_me_harder(struct sk_buff *skb, unsigned addr_type)
 	const struct iphdr *iph = ip_hdr(skb);
 	struct rtable *rt;
 	struct flowi fl = {};
-	struct dst_entry *odst;
+	unsigned long odst;
 	unsigned int hh_len;
 	unsigned int type;
 
@@ -50,14 +50,14 @@  int ip_route_me_harder(struct sk_buff *skb, unsigned addr_type)
 		if (ip_route_output_key(net, &rt, &fl) != 0)
 			return -1;
 
-		odst = skb_dst(skb);
+		odst = skb->_skb_dst;
 		if (ip_route_input(skb, iph->daddr, iph->saddr,
-				   RT_TOS(iph->tos), rt->u.dst.dev) != 0) {
+				   RT_TOS(iph->tos), rt->u.dst.dev, 0) != 0) {
 			dst_release(&rt->u.dst);
 			return -1;
 		}
 		dst_release(&rt->u.dst);
-		dst_release(odst);
+		__skb_dst_drop(odst);
 	}
 
 	if (skb_dst(skb)->error)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 2979f14..ccccd91 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -291,7 +291,7 @@  static int raw_rcv_skb(struct sock * sk, struct sk_buff * skb)
 {
 	/* Charge it to the socket. */
 
-	if (sock_queue_rcv_skb(sk, skb) < 0) {
+	if (ip_queue_rcv_skb(sk, skb) < 0) {
 		atomic_inc(&sk->sk_drops);
 		kfree_skb(skb);
 		return NET_RX_DROP;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 278f46f..e66d85d 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2255,7 +2255,7 @@  martian_source:
 }
 
 int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
-		   u8 tos, struct net_device *dev)
+		   u8 tos, struct net_device *dev, int noref)
 {
 	struct rtable * rth;
 	unsigned	hash;
@@ -2281,10 +2281,15 @@  int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 		    rth->fl.mark == skb->mark &&
 		    net_eq(dev_net(rth->u.dst.dev), net) &&
 		    !rt_is_expired(rth)) {
-			dst_use(&rth->u.dst, jiffies);
+			if (noref) {
+				dst_use_noref(&rth->u.dst, jiffies);
+				skb_dst_set_noref(skb, &rth->u.dst);
+			} else {
+				dst_use(&rth->u.dst, jiffies);
+				skb_dst_set(skb, &rth->u.dst);
+			}
 			RT_CACHE_STAT_INC(in_hit);
 			rcu_read_unlock();
-			skb_dst_set(skb, &rth->u.dst);
 			return 0;
 		}
 		RT_CACHE_STAT_INC(in_hlist_search);
@@ -2944,7 +2949,7 @@  static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr* nlh, void
 		skb->protocol	= htons(ETH_P_IP);
 		skb->dev	= dev;
 		local_bh_disable();
-		err = ip_route_input(skb, dst, src, rtm->rtm_tos, dev);
+		err = ip_route_input(skb, dst, src, rtm->rtm_tos, dev, 0);
 		local_bh_enable();
 
 		rt = skb_rtable(skb);
@@ -3008,7 +3013,7 @@  int ip_rt_dump(struct sk_buff *skb,  struct netlink_callback *cb)
 				continue;
 			if (rt_is_expired(rt))
 				continue;
-			skb_dst_set(skb, dst_clone(&rt->u.dst));
+			skb_dst_set_noref(skb, &rt->u.dst);
 			if (rt_fill_info(net, skb, NETLINK_CB(cb->skb).pid,
 					 cb->nlh->nlmsg_seq, RTM_NEWROUTE,
 					 1, NLM_F_MULTI) <= 0) {
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2bdb0da..32fb9c6 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4297,6 +4297,7 @@  static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
 		goto drop;
 
+	skb_dst_drop(skb);
 	__skb_pull(skb, th->doff * 4);
 
 	TCP_ECN_accept_cwr(tp, skb);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 29ebb0d..094f1ad 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1024,7 +1024,7 @@  static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	int is_udplite = IS_UDPLITE(sk);
 	int rc;
 
-	if ((rc = sock_queue_rcv_skb(sk, skb)) < 0) {
+	if ((rc = ip_queue_rcv_skb(sk, skb)) < 0) {
 		/* Note that an ENOMEM error is charged twice */
 		if (rc == -ENOMEM) {
 			UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_RCVBUFERRORS,
diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c
index f9f922a..fd5310b 100644
--- a/net/ipv4/xfrm4_input.c
+++ b/net/ipv4/xfrm4_input.c
@@ -27,7 +27,7 @@  static inline int xfrm4_rcv_encap_finish(struct sk_buff *skb)
 		const struct iphdr *iph = ip_hdr(skb);
 
 		if (ip_route_input(skb, iph->daddr, iph->saddr, iph->tos,
-				   skb->dev))
+				   skb->dev, 1))
 			goto drop;
 	}
 	return dst_input(skb);
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index a1d6045..94a23b9 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -564,7 +564,7 @@  ip4ip6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	} else {
 		ip_rt_put(rt);
 		if (ip_route_input(skb2, eiph->daddr, eiph->saddr, eiph->tos,
-				   skb2->dev) ||
+				   skb2->dev, 0) ||
 		    skb_dst(skb2)->dev->type != ARPHRD_TUNNEL)
 			goto out;
 	}
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index d79fa67..5732ff5 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -385,7 +385,7 @@  int udpv6_queue_rcv_skb(struct sock * sk, struct sk_buff *skb)
 			goto drop;
 	}
 
-	if ((rc = sock_queue_rcv_skb(sk,skb)) < 0) {
+	if ((rc = ip_queue_rcv_skb(sk, skb)) < 0) {
 		/* Note that an ENOMEM error is charged twice */
 		if (rc == -ENOMEM) {
 			UDP6_INC_STATS_BH(sock_net(sk),
diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
index 3a6fd77..38fa876 100644
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@@ -169,6 +169,7 @@  static int __nf_queue(struct sk_buff *skb,
 			dev_hold(physoutdev);
 	}
 #endif
+	skb_dst_force(skb);
 	afinfo->saveroute(skb, entry);
 	status = qh->outfn(entry, queuenum);