diff mbox

[net] net: core: orphan frags before queuing to slow qdisc

Message ID 1389951734-13234-1-git-send-email-jasowang@redhat.com
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Jason Wang Jan. 17, 2014, 9:42 a.m. UTC
Many qdiscs can queue a packet for a long time, this will lead an issue
with zerocopy skb. It means the frags will not be orphaned in an expected
short time, this breaks the assumption that virtio-net will transmit the
packet in time.

So if guest packets were queued through such kind of qdisc and hit the
limitation of the max pending packets for virtio/vhost. All packets that
go to another destination from guest will also be blocked.

A case for reproducing the issue:

- Boot two VMs and connect them to the same bridge kvmbr.
- Setup tbf with a very low rate/burst on eth0 which is a port of kvmbr.
- Let VM1 send lots of packets thorugh eth0
- After a while, VM1 is unable to send any packets out since the number of
  pending packets (queued to tbf) were exceeds the limitation of vhost/virito

Solve this issue by orphaning the frags before queuing it to a slow qdisc (the
one without TCQ_F_CAN_BYPASS).

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 net/core/dev.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Eric Dumazet Jan. 17, 2014, 2:28 p.m. UTC | #1
On Fri, 2014-01-17 at 17:42 +0800, Jason Wang wrote:
> Many qdiscs can queue a packet for a long time, this will lead an issue
> with zerocopy skb. It means the frags will not be orphaned in an expected
> short time, this breaks the assumption that virtio-net will transmit the
> packet in time.
> 
> So if guest packets were queued through such kind of qdisc and hit the
> limitation of the max pending packets for virtio/vhost. All packets that
> go to another destination from guest will also be blocked.
> 
> A case for reproducing the issue:
> 
> - Boot two VMs and connect them to the same bridge kvmbr.
> - Setup tbf with a very low rate/burst on eth0 which is a port of kvmbr.
> - Let VM1 send lots of packets thorugh eth0
> - After a while, VM1 is unable to send any packets out since the number of
>   pending packets (queued to tbf) were exceeds the limitation of vhost/virito

So whats the problem ? If the limit is low, you cannot sent packets.

Solution : increase the limit, or tell the vm to lower its rate.

Oh wait, are you bitten because you did some prior skb_orphan() to allow
the vm to send unlimited number of skbs ???

> 
> Solve this issue by orphaning the frags before queuing it to a slow qdisc (the
> one without TCQ_F_CAN_BYPASS).

Why orphaning the frags only solves the problem ? A skb without zerocopy
frags should also be blocked for a while.

Seriously, lets admit this zero copy stuff is utterly broken.


TCQ_F_CAN_BYPASS is not enough. Some NIC have separate queues with
strict priorities.

It seems to me that you are pushing to use FIFO (the only qdisc setting
TCQ_F_CAN_BYPASS), by adding yet another test in fast path (I do not
know how we can still call it a fast path), while we already have smart
qdisc to avoid the inherent HOL and unfairness problems of FIFO.

> 
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  net/core/dev.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0ce469e..1209774 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2700,6 +2700,12 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  	contended = qdisc_is_running(q);
>  	if (unlikely(contended))
>  		spin_lock(&q->busylock);
> +	if (!(q->flags & TCQ_F_CAN_BYPASS) &&
> +	    unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) {
> +		kfree_skb(skb);
> +		rc = NET_XMIT_DROP;
> +		goto out;
> +	}

Are you aware that copying stuff takes time ?

If yes, why is it done after taking the busylock spinlock ?

>  
>  	spin_lock(root_lock);
>  	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> @@ -2739,6 +2745,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  		}
>  	}
>  	spin_unlock(root_lock);
> +out:
>  	if (unlikely(contended))
>  		spin_unlock(&q->busylock);
>  	return rc;



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Wang Jan. 18, 2014, 5:35 a.m. UTC | #2
On 01/17/2014 10:28 PM, Eric Dumazet wrote:
> On Fri, 2014-01-17 at 17:42 +0800, Jason Wang wrote:
>> Many qdiscs can queue a packet for a long time, this will lead an issue
>> with zerocopy skb. It means the frags will not be orphaned in an expected
>> short time, this breaks the assumption that virtio-net will transmit the
>> packet in time.
>>
>> So if guest packets were queued through such kind of qdisc and hit the
>> limitation of the max pending packets for virtio/vhost. All packets that
>> go to another destination from guest will also be blocked.
>>
>> A case for reproducing the issue:
>>
>> - Boot two VMs and connect them to the same bridge kvmbr.
>> - Setup tbf with a very low rate/burst on eth0 which is a port of kvmbr.
>> - Let VM1 send lots of packets thorugh eth0
>> - After a while, VM1 is unable to send any packets out since the number of
>>    pending packets (queued to tbf) were exceeds the limitation of vhost/virito
> So whats the problem ? If the limit is low, you cannot sent packets.

It was just an extreme case. The problem is if zercopy packets of vm1 
were throttled by qdisc in eth0, probably all packets from vm1 were 
throttled even if it was not go through eth0.
> Solution : increase the limit, or tell the vm to lower its rate.
>
> Oh wait, are you bitten because you did some prior skb_orphan() to allow
> the vm to send unlimited number of skbs ???
>

The problem is sndbuf were defaulted to INT_MAX to prevent a similar 
issue for non-zerocopy packets. For zerocopy, only after the frags were 
orphaned can vhost notify the completion of tx for virtio-net. So 
INT_MAX sndbuf is not enough.
>> Solve this issue by orphaning the frags before queuing it to a slow qdisc (the
>> one without TCQ_F_CAN_BYPASS).
> Why orphaning the frags only solves the problem ? A skb without zerocopy
> frags should also be blocked for a while.

It's ok for non-zerocopy packet to be blocked since VM1 thought the 
packets has been sent instead of pending in the virtqueue. So VM1 can 
still send packet to other destination.
> Seriously, lets admit this zero copy stuff is utterly broken.
>
>
> TCQ_F_CAN_BYPASS is not enough. Some NIC have separate queues with
> strict priorities.
>

Yes, but looks less serious than traffic shaping.
> It seems to me that you are pushing to use FIFO (the only qdisc setting
> TCQ_F_CAN_BYPASS), by adding yet another test in fast path (I do not
> know how we can still call it a fast path), while we already have smart
> qdisc to avoid the inherent HOL and unfairness problems of FIFO.
>

It was just a workaround like the case of sndbuf before we had a better 
solution. So looks like using sfq or fq in guest can mitigate the issue?
>> Cc: Michael S. Tsirkin<mst@redhat.com>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
>>   net/core/dev.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 0ce469e..1209774 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -2700,6 +2700,12 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>>   	contended = qdisc_is_running(q);
>>   	if (unlikely(contended))
>>   		spin_lock(&q->busylock);
>> +	if (!(q->flags&  TCQ_F_CAN_BYPASS)&&
>> +	    unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) {
>> +		kfree_skb(skb);
>> +		rc = NET_XMIT_DROP;
>> +		goto out;
>> +	}
> Are you aware that copying stuff takes time ?
>
> If yes, why is it done after taking the busylock spinlock ?
>

Yes and it should be done outside the spinlock.
>>
>>   	spin_lock(root_lock);
>>   	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED,&q->state))) {
>> @@ -2739,6 +2745,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>>   		}
>>   	}
>>   	spin_unlock(root_lock);
>> +out:
>>   	if (unlikely(contended))
>>   		spin_unlock(&q->busylock);
>>   	return rc;
>
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Jan. 19, 2014, 9:21 a.m. UTC | #3
On Fri, Jan 17, 2014 at 05:42:14PM +0800, Jason Wang wrote:
> Many qdiscs can queue a packet for a long time, this will lead an issue
> with zerocopy skb. It means the frags will not be orphaned in an expected
> short time, this breaks the assumption that virtio-net will transmit the
> packet in time.
> 
> So if guest packets were queued through such kind of qdisc and hit the
> limitation of the max pending packets for virtio/vhost. All packets that
> go to another destination from guest will also be blocked.
> 
> A case for reproducing the issue:
> 
> - Boot two VMs and connect them to the same bridge kvmbr.
> - Setup tbf with a very low rate/burst on eth0 which is a port of kvmbr.
> - Let VM1 send lots of packets thorugh eth0
> - After a while, VM1 is unable to send any packets out since the number of
>   pending packets (queued to tbf) were exceeds the limitation of vhost/virito
> 
> Solve this issue by orphaning the frags before queuing it to a slow qdisc (the
> one without TCQ_F_CAN_BYPASS).

This seems too aggressive to me.
The issue is that packet can stay queued indefinitely long.
So I think we should only do this for tbf.

> 
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  net/core/dev.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0ce469e..1209774 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2700,6 +2700,12 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  	contended = qdisc_is_running(q);
>  	if (unlikely(contended))
>  		spin_lock(&q->busylock);
> +	if (!(q->flags & TCQ_F_CAN_BYPASS) &&
> +	    unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) {
> +		kfree_skb(skb);
> +		rc = NET_XMIT_DROP;
> +		goto out;
> +	}
>  
>  	spin_lock(root_lock);
>  	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> @@ -2739,6 +2745,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  		}
>  	}
>  	spin_unlock(root_lock);
> +out:
>  	if (unlikely(contended))
>  		spin_unlock(&q->busylock);
>  	return rc;
> -- 
> 1.8.3.2
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Jan. 19, 2014, 9:56 a.m. UTC | #4
On Fri, Jan 17, 2014 at 06:28:17AM -0800, Eric Dumazet wrote:
> On Fri, 2014-01-17 at 17:42 +0800, Jason Wang wrote:
> > Many qdiscs can queue a packet for a long time, this will lead an issue
> > with zerocopy skb. It means the frags will not be orphaned in an expected
> > short time, this breaks the assumption that virtio-net will transmit the
> > packet in time.
> > 
> > So if guest packets were queued through such kind of qdisc and hit the
> > limitation of the max pending packets for virtio/vhost. All packets that
> > go to another destination from guest will also be blocked.
> > 
> > A case for reproducing the issue:
> > 
> > - Boot two VMs and connect them to the same bridge kvmbr.
> > - Setup tbf with a very low rate/burst on eth0 which is a port of kvmbr.
> > - Let VM1 send lots of packets thorugh eth0
> > - After a while, VM1 is unable to send any packets out since the number of
> >   pending packets (queued to tbf) were exceeds the limitation of vhost/virito
> 
> So whats the problem ? If the limit is low, you cannot sent packets.

I think this isn't the main problem.
Say VM1 is using tap1 and VM2 uses tap2, all connected to
same bridge.
Set up tbf on tap2 and have VM1 send packets to VM2.
VM1 will get blocked and won't be able to talk to
host or eth0 for a long while, even though the path
from VM1 to host is idle.

The problem should appear with both zerocopy or with
TUNSETSNDBUF used to set sndbuf to a finite value.


> Solution : increase the limit, or tell the vm to lower its rate.

The issue is that different links can have different policies.

A VM that's unable to talk to host
even though the path from this VM to the host is
completely idle because some other VM's RX -
which it used to talk to in the past - is throttled violates
the principle of least surprise.

>
> Oh wait, are you bitten because you did some prior skb_orphan() to allow
> the vm to send unlimited number of skbs ???

I think it's because the limit on number
of skbs per VM is INT_MAX by default, in any case -
for compatibility reasons.

Further, e.g. QEMU also had to stop using
TUNSETSNDBUF because of these issues.
It would be nice to fix this, yes.

> > 
> > Solve this issue by orphaning the frags before queuing it to a slow qdisc (the
> > one without TCQ_F_CAN_BYPASS).
> 
> Why orphaning the frags only solves the problem ? A skb without zerocopy
> frags should also be blocked for a while.

I agree that if we want to fix behaviour for limited sndbuf,
we have to orphan everything queued on tbf,
possibly only if it's limit is set very low.
Do you think it's reasonable?

>
> Seriously, lets admit this zero copy stuff is utterly broken.
> 

zero copy is simply newer, we didn't have compatibility to
worry about so we set a finite limit there by default.
As you point out the problem isn't specific to zero copy,
it's happening whenever we try to limit # of skbs.


So do you think the idea of limiting skbs is utterly
broken, and we should just let it rip as fast as it can?

> TCQ_F_CAN_BYPASS is not enough. Some NIC have separate queues with
> strict priorities.
> 
> It seems to me that you are pushing to use FIFO (the only qdisc setting
> TCQ_F_CAN_BYPASS), by adding yet another test in fast path (I do not
> know how we can still call it a fast path), while we already have smart
> qdisc to avoid the inherent HOL and unfairness problems of FIFO.

TCQ_F_CAN_BYPASS doesn't look like a good idea to me either.

I think the biggest problem is with tbf because it's not
work-conserving, so a link can get
throttled even in system which is mostly idle.


> > 
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > ---
> >  net/core/dev.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 0ce469e..1209774 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -2700,6 +2700,12 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >  	contended = qdisc_is_running(q);
> >  	if (unlikely(contended))
> >  		spin_lock(&q->busylock);
> > +	if (!(q->flags & TCQ_F_CAN_BYPASS) &&
> > +	    unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) {
> > +		kfree_skb(skb);
> > +		rc = NET_XMIT_DROP;
> > +		goto out;
> > +	}
> 
> Are you aware that copying stuff takes time ?
> 
> If yes, why is it done after taking the busylock spinlock ?
> 
> >  
> >  	spin_lock(root_lock);
> >  	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> > @@ -2739,6 +2745,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >  		}
> >  	}
> >  	spin_unlock(root_lock);
> > +out:
> >  	if (unlikely(contended))
> >  		spin_unlock(&q->busylock);
> >  	return rc;
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 0ce469e..1209774 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2700,6 +2700,12 @@  static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	contended = qdisc_is_running(q);
 	if (unlikely(contended))
 		spin_lock(&q->busylock);
+	if (!(q->flags & TCQ_F_CAN_BYPASS) &&
+	    unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) {
+		kfree_skb(skb);
+		rc = NET_XMIT_DROP;
+		goto out;
+	}
 
 	spin_lock(root_lock);
 	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
@@ -2739,6 +2745,7 @@  static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		}
 	}
 	spin_unlock(root_lock);
+out:
 	if (unlikely(contended))
 		spin_unlock(&q->busylock);
 	return rc;