diff mbox

vxlan/veth performance issues on net.git + latest kernels

Message ID 1386115199.30495.44.camel@edumazet-glaptop2.roam.corp.google.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Dec. 3, 2013, 11:59 p.m. UTC
On Wed, 2013-12-04 at 01:10 +0200, Or Gerlitz wrote:

> Samples: 883K of event 'skb:kfree_skb', Event count (approx.): 883406
> +  97.13%          swapper  [kernel.kallsyms]  [k] net_tx_action
> +   1.53%            iperf  [kernel.kallsyms]  [k] net_tx_action
> +   1.03%             perf  [kernel.kallsyms]  [k] net_tx_action
> +   0.27%      ksoftirqd/7  [kernel.kallsyms]  [k] net_tx_action
> +   0.03%      kworker/7:1  [kernel.kallsyms]  [k] net_tx_action
> +   0.00%          rpcbind  [kernel.kallsyms]  [k] net_tx_action
> +   0.00%          swapper  [kernel.kallsyms]  [k] kfree_skb
> +   0.00%            sleep  [kernel.kallsyms]  [k] net_tx_action
> +   0.00%  hald-addon-acpi  [kernel.kallsyms]  [k] kfree_skb
> +   0.00%            iperf  [kernel.kallsyms]  [k] kfree_skb
> +   0.00%             perf  [kernel.kallsyms]  [k] kfree_skb
> 

Right, I actually have a patch for that, but was waiting for net-next
being re-opened :

commit 9a731d750dd8bf0b8c20fb1ca53c42317fb4dd37
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 25 13:09:20 2013 -0800

    net-fixes: introduce dev_consume_skb_any()
    
    Some NIC drivers use dev_kfree_skb_any() generic helper to free skbs,
    both for dropped packets and TX completed ones.
    
    To have "perf record -e skb:kfree_skb" give good hints on where
    packets are dropped (if any), we need to separate the two causes.
    
    dev_consume_skb_any() is a helper to free skbs that were properly sent
    to the wire.
    

Signed-off-by: Eric Dumazet <edumazet@google.com>
Google-Bug-Id: 11634401
---


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Alexei Starovoitov Dec. 4, 2013, 12:26 a.m. UTC | #1
On Tue, Dec 3, 2013 at 3:59 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> +#define SKB_CONSUMED_MAGIC ((void *)0xDEAD0001)
> +static inline void dev_consume_skb_any(struct sk_buff *skb)
> +{
> +       skb->dev = SKB_CONSUMED_MAGIC;
> +       dev_kfree_skb_any(skb);
> +}
> +
>  int netif_rx(struct sk_buff *skb);
>  int netif_rx_ni(struct sk_buff *skb);
>  int netif_receive_skb(struct sk_buff *skb);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index ba3b7ea5ebb3..b2b0e5776ce9 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3306,7 +3306,10 @@ static void net_tx_action(struct softirq_action *h)
>                         clist = clist->next;
>
>                         WARN_ON(atomic_read(&skb->users));
> -                       trace_kfree_skb(skb, net_tx_action);
> +                       if (likely(skb->dev == SKB_CONSUMED_MAGIC))
> +                               trace_consume_skb(skb);
> +                       else
> +                               trace_kfree_skb(skb, net_tx_action);

Could you use some other way to mark skb ?
In tracing we might want to examine skb more carefully and not being
able to see the device
will limit the usability of this tracepoint.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 4, 2013, 12:36 a.m. UTC | #2
On Tue, 2013-12-03 at 16:26 -0800, Alexei Starovoitov wrote:

> 
> Could you use some other way to mark skb ?

I could ;)

> In tracing we might want to examine skb more carefully and not being
> able to see the device
> will limit the usability of this tracepoint.

Unfortunately, using skb->dev as a pointer to device would be buggy or
expensive (you would need to take a reference on device in order not
letting it disappear, as we escape RCU protection)

Current TRACE_EVENT for trace_consume_skb() does not use skb->dev.

Anyway, this magic is pretty easy to change, I am open to suggestions.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Dec. 4, 2013, 12:55 a.m. UTC | #3
On Tue, Dec 3, 2013 at 4:36 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2013-12-03 at 16:26 -0800, Alexei Starovoitov wrote:
>
>>
>> Could you use some other way to mark skb ?
>
> I could ;)
>
>> In tracing we might want to examine skb more carefully and not being
>> able to see the device
>> will limit the usability of this tracepoint.
>
> Unfortunately, using skb->dev as a pointer to device would be buggy or
> expensive (you would need to take a reference on device in order not
> letting it disappear, as we escape RCU protection)

well, yes, you might have an skb around when device is already freed
when skb_dst_noref.
but I'm not suggesting anything expensive. Tracing definitely should
not add overhead by doing rcu_lock() or dev_hold(). Instead it can go
through skb, skb->dev, skb->dev->xxx via probe_kernel_read(). If dev
is gone, it's still safe.

> Anyway, this magic is pretty easy to change, I am open to suggestions.

you're the expert :) use skb->mark field, since it's unused during
freeing path... ?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Dec. 4, 2013, 6:39 a.m. UTC | #4
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 03 Dec 2013 15:59:59 -0800

> On Wed, 2013-12-04 at 01:10 +0200, Or Gerlitz wrote:
> 
>> Samples: 883K of event 'skb:kfree_skb', Event count (approx.): 883406
>> +  97.13%          swapper  [kernel.kallsyms]  [k] net_tx_action
>> +   1.53%            iperf  [kernel.kallsyms]  [k] net_tx_action
>> +   1.03%             perf  [kernel.kallsyms]  [k] net_tx_action
>> +   0.27%      ksoftirqd/7  [kernel.kallsyms]  [k] net_tx_action
>> +   0.03%      kworker/7:1  [kernel.kallsyms]  [k] net_tx_action
>> +   0.00%          rpcbind  [kernel.kallsyms]  [k] net_tx_action
>> +   0.00%          swapper  [kernel.kallsyms]  [k] kfree_skb
>> +   0.00%            sleep  [kernel.kallsyms]  [k] net_tx_action
>> +   0.00%  hald-addon-acpi  [kernel.kallsyms]  [k] kfree_skb
>> +   0.00%            iperf  [kernel.kallsyms]  [k] kfree_skb
>> +   0.00%             perf  [kernel.kallsyms]  [k] kfree_skb
>> 
> 
> Right, I actually have a patch for that, but was waiting for net-next
> being re-opened :
> 
> commit 9a731d750dd8bf0b8c20fb1ca53c42317fb4dd37
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Mon Nov 25 13:09:20 2013 -0800
> 
>     net-fixes: introduce dev_consume_skb_any()

I definitely prefer the control block approach to this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 4, 2013, 5:40 p.m. UTC | #5
On Wed, 2013-12-04 at 01:39 -0500, David Miller wrote:

> I definitely prefer the control block approach to this.

I polished the patch to keep this knowledge in net/core/dev.c



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index ec96130533cc..8b8c2171b187 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -209,9 +209,9 @@  static u16 bnx2x_free_tx_pkt(struct bnx2x *bp, struct bnx2x_fp_txdata *txdata,
 	if (likely(skb)) {
 		(*pkts_compl)++;
 		(*bytes_compl) += skb->len;
+		dev_consume_skb_any(skb);
 	}
 
-	dev_kfree_skb_any(skb);
 	tx_buf->first_bd = 0;
 	tx_buf->skb = NULL;
 
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 8d3945ab7334..03081932e519 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1058,7 +1058,7 @@  static void e1000_put_txbuf(struct e1000_ring *tx_ring,
 		buffer_info->dma = 0;
 	}
 	if (buffer_info->skb) {
-		dev_kfree_skb_any(buffer_info->skb);
+		dev_consume_skb_any(buffer_info->skb);
 		buffer_info->skb = NULL;
 	}
 	buffer_info->time_stamp = 0;
diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c
index 43aa7acd84a6..294825efb248 100644
--- a/drivers/net/ethernet/marvell/sky2.c
+++ b/drivers/net/ethernet/marvell/sky2.c
@@ -2037,7 +2037,7 @@  static void sky2_tx_complete(struct sky2_port *sky2, u16 done)
 			bytes_compl += skb->len;
 
 			re->skb = NULL;
-			dev_kfree_skb_any(skb);
+			dev_consume_skb_any(skb);
 
 			sky2->tx_next = RING_NEXT(idx, sky2->tx_ring_size);
 		}
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index f54ebd5a1702..653484bfae98 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -317,7 +317,7 @@  static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 			}
 		}
 	}
-	dev_kfree_skb_any(skb);
+	dev_consume_skb_any(skb);
 	return tx_info->nr_txbb;
 }
 
diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
index 2d045be4b5cf..d44f7b69a6a0 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -2557,7 +2557,7 @@  static int nv_tx_done(struct net_device *dev, int limit)
 					u64_stats_update_end(&np->swstats_tx_syncp);
 				}
 				bytes_compl += np->get_tx_ctx->skb->len;
-				dev_kfree_skb_any(np->get_tx_ctx->skb);
+				dev_consume_skb_any(np->get_tx_ctx->skb);
 				np->get_tx_ctx->skb = NULL;
 				tx_work++;
 			}
@@ -2574,7 +2574,7 @@  static int nv_tx_done(struct net_device *dev, int limit)
 					u64_stats_update_end(&np->swstats_tx_syncp);
 				}
 				bytes_compl += np->get_tx_ctx->skb->len;
-				dev_kfree_skb_any(np->get_tx_ctx->skb);
+				dev_consume_skb_any(np->get_tx_ctx->skb);
 				np->get_tx_ctx->skb = NULL;
 				tx_work++;
 			}
@@ -2625,7 +2625,7 @@  static int nv_tx_done_optimized(struct net_device *dev, int limit)
 			}
 
 			bytes_cleaned += np->get_tx_ctx->skb->len;
-			dev_kfree_skb_any(np->get_tx_ctx->skb);
+			dev_consume_skb_any(np->get_tx_ctx->skb);
 			np->get_tx_ctx->skb = NULL;
 			tx_work++;
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 916241d16c67..ab8970693ff3 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -755,7 +755,7 @@  static void free_old_xmit_skbs(struct send_queue *sq)
 		stats->tx_packets++;
 		u64_stats_update_end(&stats->tx_syncp);
 
-		dev_kfree_skb_any(skb);
+		dev_consume_skb_any(skb);
 	}
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7f0ed423a360..8a7482fa2656 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2380,6 +2380,13 @@  void dev_kfree_skb_irq(struct sk_buff *skb);
  */
 void dev_kfree_skb_any(struct sk_buff *skb);
 
+#define SKB_CONSUMED_MAGIC ((void *)0xDEAD0001)
+static inline void dev_consume_skb_any(struct sk_buff *skb)
+{
+	skb->dev = SKB_CONSUMED_MAGIC;
+	dev_kfree_skb_any(skb);
+}
+
 int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
 int netif_receive_skb(struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index ba3b7ea5ebb3..b2b0e5776ce9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3306,7 +3306,10 @@  static void net_tx_action(struct softirq_action *h)
 			clist = clist->next;
 
 			WARN_ON(atomic_read(&skb->users));
-			trace_kfree_skb(skb, net_tx_action);
+			if (likely(skb->dev == SKB_CONSUMED_MAGIC))
+				trace_consume_skb(skb);
+			else
+				trace_kfree_skb(skb, net_tx_action);
 			__kfree_skb(skb);
 		}
 	}