diff mbox

virtio-net: fix a race on 32bit arches

Message ID 1338971724.2760.3913.camel@edumazet-glaptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet June 6, 2012, 8:35 a.m. UTC
From: Eric Dumazet <edumazet@google.com>

commit 3fa2a1df909 (virtio-net: per cpu 64 bit stats (v2)) added a race
on 32bit arches.

We must use separate syncp for rx and tx path as they can be run at the
same time on different cpus. Thus one sequence increment can be lost and
readers spin forever.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c |   19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet June 6, 2012, 8:45 a.m. UTC | #1
On Wed, 2012-06-06 at 10:35 +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> commit 3fa2a1df909 (virtio-net: per cpu 64 bit stats (v2)) added a race
> on 32bit arches.
> 
> We must use separate syncp for rx and tx path as they can be run at the
> same time on different cpus. Thus one sequence increment can be lost and
> readers spin forever.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>
> ---

Just to make clear : even using percpu stats/syncp, we have no guarantee
that write_seqcount_begin() is done with one instruction. [1]

It is OK on x86 if "incl" instruction is generated by the compiler, but
on a RISC cpu, the "load memory,%reg ; inc %reg ; store %reg,memory" can
be interrupted.

So if you are 100% sure all paths are safe against preemption/BH, then
this patch is not needed, but a big comment in the code would avoid
adding possible races in the future.

[1] If done with one instruction, we still have a race, since a reader
might see an even sequence and conclude no writer is inside the critical
section. So read values could be wrong.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Wang June 6, 2012, 9:37 a.m. UTC | #2
On 06/06/2012 04:45 PM, Eric Dumazet wrote:
> On Wed, 2012-06-06 at 10:35 +0200, Eric Dumazet wrote:
>> From: Eric Dumazet<edumazet@google.com>
>>
>> commit 3fa2a1df909 (virtio-net: per cpu 64 bit stats (v2)) added a race
>> on 32bit arches.
>>
>> We must use separate syncp for rx and tx path as they can be run at the
>> same time on different cpus. Thus one sequence increment can be lost and
>> readers spin forever.
>>
>> Signed-off-by: Eric Dumazet<edumazet@google.com>
>> Cc: Stephen Hemminger<shemminger@vyatta.com>
>> Cc: Michael S. Tsirkin<mst@redhat.com>
>> Cc: Jason Wang<jasowang@redhat.com>
>> ---
> Just to make clear : even using percpu stats/syncp, we have no guarantee
> that write_seqcount_begin() is done with one instruction. [1]
>
> It is OK on x86 if "incl" instruction is generated by the compiler, but
> on a RISC cpu, the "load memory,%reg ; inc %reg ; store %reg,memory" can
> be interrupted.
>
> So if you are 100% sure all paths are safe against preemption/BH, then
> this patch is not needed, but a big comment in the code would avoid
> adding possible races in the future.

Thanks for explaing, current virtio-net is safe I think. But the patch 
is still needed as my patch would update the statistics in irq.
>
> [1] If done with one instruction, we still have a race, since a reader
> might see an even sequence and conclude no writer is inside the critical
> section. So read values could be wrong.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 6, 2012, 11:13 a.m. UTC | #3
On Wed, Jun 06, 2012 at 10:45:41AM +0200, Eric Dumazet wrote:
> On Wed, 2012-06-06 at 10:35 +0200, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > commit 3fa2a1df909 (virtio-net: per cpu 64 bit stats (v2)) added a race
> > on 32bit arches.
> > 
> > We must use separate syncp for rx and tx path as they can be run at the
> > same time on different cpus. Thus one sequence increment can be lost and
> > readers spin forever.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Jason Wang <jasowang@redhat.com>
> > ---
> 
> Just to make clear : even using percpu stats/syncp, we have no guarantee
> that write_seqcount_begin() is done with one instruction. [1]
> 
> It is OK on x86 if "incl" instruction is generated by the compiler, but
> on a RISC cpu, the "load memory,%reg ; inc %reg ; store %reg,memory" can
> be interrupted.
> 
> So if you are 100% sure all paths are safe against preemption/BH, then
> this patch is not needed, but a big comment in the code would avoid
> adding possible races in the future.

We currently do all stats either on napi callback or from
start_xmit callback.
This makes them safe, yes?

> [1] If done with one instruction, we still have a race, since a reader
> might see an even sequence and conclude no writer is inside the critical
> section. So read values could be wrong.
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell June 10, 2012, 6:36 a.m. UTC | #4
On Wed, 06 Jun 2012 10:45:41 +0200, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2012-06-06 at 10:35 +0200, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > commit 3fa2a1df909 (virtio-net: per cpu 64 bit stats (v2)) added a race
> > on 32bit arches.
> > 
> > We must use separate syncp for rx and tx path as they can be run at the
> > same time on different cpus. Thus one sequence increment can be lost and
> > readers spin forever.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Jason Wang <jasowang@redhat.com>
> > ---
> 
> Just to make clear : even using percpu stats/syncp, we have no guarantee
> that write_seqcount_begin() is done with one instruction. [1]
> 
> It is OK on x86 if "incl" instruction is generated by the compiler, but
> on a RISC cpu, the "load memory,%reg ; inc %reg ; store %reg,memory" can
> be interrupted.
> 
> So if you are 100% sure all paths are safe against preemption/BH, then
> this patch is not needed, but a big comment in the code would avoid
> adding possible races in the future.

Too fragile; let's keep them separate as per this patch.

Acked-by: Rusty Russell <rusty@rustcorp.com.au>

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 10, 2012, 7:03 a.m. UTC | #5
On Sun, Jun 10, 2012 at 04:06:34PM +0930, Rusty Russell wrote:
> On Wed, 06 Jun 2012 10:45:41 +0200, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Wed, 2012-06-06 at 10:35 +0200, Eric Dumazet wrote:
> > > From: Eric Dumazet <edumazet@google.com>
> > > 
> > > commit 3fa2a1df909 (virtio-net: per cpu 64 bit stats (v2)) added a race
> > > on 32bit arches.
> > > 
> > > We must use separate syncp for rx and tx path as they can be run at the
> > > same time on different cpus. Thus one sequence increment can be lost and
> > > readers spin forever.
> > > 
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > > Cc: Michael S. Tsirkin <mst@redhat.com>
> > > Cc: Jason Wang <jasowang@redhat.com>
> > > ---
> > 
> > Just to make clear : even using percpu stats/syncp, we have no guarantee
> > that write_seqcount_begin() is done with one instruction. [1]
> > 
> > It is OK on x86 if "incl" instruction is generated by the compiler, but
> > on a RISC cpu, the "load memory,%reg ; inc %reg ; store %reg,memory" can
> > be interrupted.
> > 
> > So if you are 100% sure all paths are safe against preemption/BH, then
> > this patch is not needed, but a big comment in the code would avoid
> > adding possible races in the future.
> 
> Too fragile; let's keep them separate as per this patch.
> 
> Acked-by: Rusty Russell <rusty@rustcorp.com.au>
> 
> Thanks,
> Rusty.

One question though: do we want to lay the structure
out so that the rx sync structure precedes the rx counters?
Eric Dumazet June 10, 2012, 10:21 a.m. UTC | #6
On Sun, 2012-06-10 at 10:03 +0300, Michael S. Tsirkin wrote:

> One question though: do we want to lay the structure
> out so that the rx sync structure precedes the rx counters?
> 

I am not sure its worth having holes in the structure, since its percpu
data.

That would be 8 bytes lost per cpu and per device.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 10, 2012, 10:22 a.m. UTC | #7
On Sun, Jun 10, 2012 at 12:21:14PM +0200, Eric Dumazet wrote:
> On Sun, 2012-06-10 at 10:03 +0300, Michael S. Tsirkin wrote:
> 
> > One question though: do we want to lay the structure
> > out so that the rx sync structure precedes the rx counters?
> > 
> 
> I am not sure its worth having holes in the structure, since its percpu
> data.
> 
> That would be 8 bytes lost per cpu and per device.

Right, I forgot it's per cpu.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 10, 2012, 10:25 a.m. UTC | #8
On Wed, Jun 06, 2012 at 10:35:24AM +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> commit 3fa2a1df909 (virtio-net: per cpu 64 bit stats (v2)) added a race
> on 32bit arches.
> 
> We must use separate syncp for rx and tx path as they can be run at the
> same time on different cpus. Thus one sequence increment can be lost and
> readers spin forever.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>

I'm still thinking about moving tx to take a xmit lock long term,
meanwhile this fix appears appropriate for 3.5.

Acked-by: Michael S. Tsirkin <mst@redhat.com>

Dave, can you pick this up pls?

> ---
>  drivers/net/virtio_net.c |   19 ++++++++++++-------
>  1 file changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 5214b1e..f18149a 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -42,7 +42,8 @@ module_param(gso, bool, 0444);
>  #define VIRTNET_DRIVER_VERSION "1.0.0"
>  
>  struct virtnet_stats {
> -	struct u64_stats_sync syncp;
> +	struct u64_stats_sync tx_syncp;
> +	struct u64_stats_sync rx_syncp;
>  	u64 tx_bytes;
>  	u64 tx_packets;
>  
> @@ -300,10 +301,10 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>  
>  	hdr = skb_vnet_hdr(skb);
>  
> -	u64_stats_update_begin(&stats->syncp);
> +	u64_stats_update_begin(&stats->rx_syncp);
>  	stats->rx_bytes += skb->len;
>  	stats->rx_packets++;
> -	u64_stats_update_end(&stats->syncp);
> +	u64_stats_update_end(&stats->rx_syncp);
>  
>  	if (hdr->hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
>  		pr_debug("Needs csum!\n");
> @@ -565,10 +566,10 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
>  	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
>  		pr_debug("Sent skb %p\n", skb);
>  
> -		u64_stats_update_begin(&stats->syncp);
> +		u64_stats_update_begin(&stats->tx_syncp);
>  		stats->tx_bytes += skb->len;
>  		stats->tx_packets++;
> -		u64_stats_update_end(&stats->syncp);
> +		u64_stats_update_end(&stats->tx_syncp);
>  
>  		tot_sgs += skb_vnet_hdr(skb)->num_sg;
>  		dev_kfree_skb_any(skb);
> @@ -703,12 +704,16 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
>  		u64 tpackets, tbytes, rpackets, rbytes;
>  
>  		do {
> -			start = u64_stats_fetch_begin(&stats->syncp);
> +			start = u64_stats_fetch_begin(&stats->tx_syncp);
>  			tpackets = stats->tx_packets;
>  			tbytes   = stats->tx_bytes;
> +		} while (u64_stats_fetch_retry(&stats->tx_syncp, start));
> +
> +		do {
> +			start = u64_stats_fetch_begin(&stats->rx_syncp);
>  			rpackets = stats->rx_packets;
>  			rbytes   = stats->rx_bytes;
> -		} while (u64_stats_fetch_retry(&stats->syncp, start));
> +		} while (u64_stats_fetch_retry(&stats->rx_syncp, start));
>  
>  		tot->rx_packets += rpackets;
>  		tot->tx_packets += tpackets;
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 11, 2012, 3:23 a.m. UTC | #9
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Sun, 10 Jun 2012 13:25:12 +0300

> On Wed, Jun 06, 2012 at 10:35:24AM +0200, Eric Dumazet wrote:
>> From: Eric Dumazet <edumazet@google.com>
>> 
>> commit 3fa2a1df909 (virtio-net: per cpu 64 bit stats (v2)) added a race
>> on 32bit arches.
>> 
>> We must use separate syncp for rx and tx path as they can be run at the
>> same time on different cpus. Thus one sequence increment can be lost and
>> readers spin forever.
>> 
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> Cc: Stephen Hemminger <shemminger@vyatta.com>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> Cc: Jason Wang <jasowang@redhat.com>
> 
> I'm still thinking about moving tx to take a xmit lock long term,
> meanwhile this fix appears appropriate for 3.5.
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> 
> Dave, can you pick this up pls?

Done, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5214b1e..f18149a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -42,7 +42,8 @@  module_param(gso, bool, 0444);
 #define VIRTNET_DRIVER_VERSION "1.0.0"
 
 struct virtnet_stats {
-	struct u64_stats_sync syncp;
+	struct u64_stats_sync tx_syncp;
+	struct u64_stats_sync rx_syncp;
 	u64 tx_bytes;
 	u64 tx_packets;
 
@@ -300,10 +301,10 @@  static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
 
 	hdr = skb_vnet_hdr(skb);
 
-	u64_stats_update_begin(&stats->syncp);
+	u64_stats_update_begin(&stats->rx_syncp);
 	stats->rx_bytes += skb->len;
 	stats->rx_packets++;
-	u64_stats_update_end(&stats->syncp);
+	u64_stats_update_end(&stats->rx_syncp);
 
 	if (hdr->hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
 		pr_debug("Needs csum!\n");
@@ -565,10 +566,10 @@  static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
 	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 
-		u64_stats_update_begin(&stats->syncp);
+		u64_stats_update_begin(&stats->tx_syncp);
 		stats->tx_bytes += skb->len;
 		stats->tx_packets++;
-		u64_stats_update_end(&stats->syncp);
+		u64_stats_update_end(&stats->tx_syncp);
 
 		tot_sgs += skb_vnet_hdr(skb)->num_sg;
 		dev_kfree_skb_any(skb);
@@ -703,12 +704,16 @@  static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
 		u64 tpackets, tbytes, rpackets, rbytes;
 
 		do {
-			start = u64_stats_fetch_begin(&stats->syncp);
+			start = u64_stats_fetch_begin(&stats->tx_syncp);
 			tpackets = stats->tx_packets;
 			tbytes   = stats->tx_bytes;
+		} while (u64_stats_fetch_retry(&stats->tx_syncp, start));
+
+		do {
+			start = u64_stats_fetch_begin(&stats->rx_syncp);
 			rpackets = stats->rx_packets;
 			rbytes   = stats->rx_bytes;
-		} while (u64_stats_fetch_retry(&stats->syncp, start));
+		} while (u64_stats_fetch_retry(&stats->rx_syncp, start));
 
 		tot->rx_packets += rpackets;
 		tot->tx_packets += tpackets;