diff mbox

Flow Control and Port Mirroring Revisited

Message ID 20110106124439.GA17004@verge.net.au
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Simon Horman Jan. 6, 2011, 12:44 p.m. UTC
On Thu, Jan 06, 2011 at 11:22:42AM +0100, Eric Dumazet wrote:
> Le jeudi 06 janvier 2011 à 18:33 +0900, Simon Horman a écrit :
> > Hi,
> > 
> > Back in October I reported that I noticed a problem whereby flow control
> > breaks down when openvswitch is configured to mirror a port[1].
> > 
> > I have (finally) looked into this further and the problem appears to relate
> > to cloning of skbs, as Jesse Gross originally suspected.
> > 
> > More specifically, in do_execute_actions[2] the first n-1 times that an skb
> > needs to be transmitted it is cloned first and the final time the original
> > skb is used.
> > 
> > In the case that there is only one action, which is the normal case, then
> > the original skb will be used. But in the case of mirroring the cloning
> > comes into effect. And in my case the cloned skb seems to go to the (slow)
> > eth1 interface while the original skb goes to the (fast) dummy0 interface
> > that I set up to be a mirror. The result is that dummy0 "paces" the flow,
> > and its a cracking pace at that.
> > 
> > As an experiment I hacked do_execute_actions() to use the original skb
> > for the first action instead of the last one.  In my case the result was
> > that eth1 "paces" the flow, and things work reasonably nicely.
> > 
> > Well, sort of. Things work well for non-GSO skbs but extremely poorly for
> > GSO skbs where only 3 (yes 3, not 3%) end up at the remote host running
> > netserv. I'm unsure why, but I digress.
> > 
> > It seems to me that my hack illustrates the point that the flow ends up
> > being "paced" by one interface. However I think that what would be
> > desirable is that the flow is "paced" by the slowest link. Unfortunately
> > I'm unsure how to achieve that.
> > 
> 
> Hi Simon !
> 
> "pacing" is done because skb is attached to a socket, and a socket has a
> limited (but configurable) sndbuf. sk->sk_wmem_alloc is the current sum
> of all truesize skbs in flight.
> 
> When you enter something that :
> 
> 1) Get a clone of the skb, queue the clone to device X
> 2) queue the original skb to device Y
> 
> Then :	Socket sndbuf is not affected at all by device X queue.
> 	This is speed on device Y that matters.
> 
> You want to get servo control on both X and Y
> 
> You could try to
> 
> 1) Get a clone of skb
>    Attach it to socket too (so that socket get a feedback of final
> orphaning for the clone) with skb_set_owner_w()
>    queue the clone to device X
> 
> Unfortunatly, stacked skb->destructor() makes this possible only for
> known destructor (aka sock_wfree())

Hi Eric !

Thanks for the advice. I had thought about the socket buffer but at some
point it slipped my mind.

In any case the following patch seems to implement the change that I had in
mind. However my discussions Michael Tsirkin elsewhere in this thread are
beginning to make me think that think that perhaps this change isn't the
best solution.

I got a rather nasty panic without the if (skb->sk),
I guess some skbs don't have a socket.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Jan. 6, 2011, 1:28 p.m. UTC | #1
Le jeudi 06 janvier 2011 à 21:44 +0900, Simon Horman a écrit :

> Hi Eric !
> 
> Thanks for the advice. I had thought about the socket buffer but at some
> point it slipped my mind.
> 
> In any case the following patch seems to implement the change that I had in
> mind. However my discussions Michael Tsirkin elsewhere in this thread are
> beginning to make me think that think that perhaps this change isn't the
> best solution.
> 
> diff --git a/datapath/actions.c b/datapath/actions.c
> index 5e16143..505f13f 100644
> --- a/datapath/actions.c
> +++ b/datapath/actions.c
> @@ -384,7 +384,12 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
>  
>  	for (a = actions, rem = actions_len; rem > 0; a = nla_next(a, &rem)) {
>  		if (prev_port != -1) {
> -			do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port);
> +			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
> +			if (nskb) {
> +				if (skb->sk)
> +					skb_set_owner_w(nskb, skb->sk);
> +				do_output(dp, nskb, prev_port);
> +			}
>  			prev_port = -1;
>  		}
> 
> I got a rather nasty panic without the if (skb->sk),
> I guess some skbs don't have a socket.

Indeed, some packets are not linked to a socket.

(ARP packets for example)

Sorry, I should have mentioned it :)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Jan. 6, 2011, 10:01 p.m. UTC | #2
On Thu, Jan 06, 2011 at 02:28:18PM +0100, Eric Dumazet wrote:
> Le jeudi 06 janvier 2011 à 21:44 +0900, Simon Horman a écrit :
> 
> > Hi Eric !
> > 
> > Thanks for the advice. I had thought about the socket buffer but at some
> > point it slipped my mind.
> > 
> > In any case the following patch seems to implement the change that I had in
> > mind. However my discussions Michael Tsirkin elsewhere in this thread are
> > beginning to make me think that think that perhaps this change isn't the
> > best solution.
> > 
> > diff --git a/datapath/actions.c b/datapath/actions.c
> > index 5e16143..505f13f 100644
> > --- a/datapath/actions.c
> > +++ b/datapath/actions.c
> > @@ -384,7 +384,12 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
> >  
> >  	for (a = actions, rem = actions_len; rem > 0; a = nla_next(a, &rem)) {
> >  		if (prev_port != -1) {
> > -			do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port);
> > +			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
> > +			if (nskb) {
> > +				if (skb->sk)
> > +					skb_set_owner_w(nskb, skb->sk);
> > +				do_output(dp, nskb, prev_port);
> > +			}
> >  			prev_port = -1;
> >  		}
> > 
> > I got a rather nasty panic without the if (skb->sk),
> > I guess some skbs don't have a socket.
> 
> Indeed, some packets are not linked to a socket.
> 
> (ARP packets for example)
> 
> Sorry, I should have mentioned it :)

Not at all, the occasional panic during hacking is good for the soul.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross Jan. 6, 2011, 10:38 p.m. UTC | #3
On Thu, Jan 6, 2011 at 7:44 AM, Simon Horman <horms@verge.net.au> wrote:
> On Thu, Jan 06, 2011 at 11:22:42AM +0100, Eric Dumazet wrote:
>> Le jeudi 06 janvier 2011 à 18:33 +0900, Simon Horman a écrit :
>> > Hi,
>> >
>> > Back in October I reported that I noticed a problem whereby flow control
>> > breaks down when openvswitch is configured to mirror a port[1].
>> >
>> > I have (finally) looked into this further and the problem appears to relate
>> > to cloning of skbs, as Jesse Gross originally suspected.
>> >
>> > More specifically, in do_execute_actions[2] the first n-1 times that an skb
>> > needs to be transmitted it is cloned first and the final time the original
>> > skb is used.
>> >
>> > In the case that there is only one action, which is the normal case, then
>> > the original skb will be used. But in the case of mirroring the cloning
>> > comes into effect. And in my case the cloned skb seems to go to the (slow)
>> > eth1 interface while the original skb goes to the (fast) dummy0 interface
>> > that I set up to be a mirror. The result is that dummy0 "paces" the flow,
>> > and its a cracking pace at that.
>> >
>> > As an experiment I hacked do_execute_actions() to use the original skb
>> > for the first action instead of the last one.  In my case the result was
>> > that eth1 "paces" the flow, and things work reasonably nicely.
>> >
>> > Well, sort of. Things work well for non-GSO skbs but extremely poorly for
>> > GSO skbs where only 3 (yes 3, not 3%) end up at the remote host running
>> > netserv. I'm unsure why, but I digress.
>> >
>> > It seems to me that my hack illustrates the point that the flow ends up
>> > being "paced" by one interface. However I think that what would be
>> > desirable is that the flow is "paced" by the slowest link. Unfortunately
>> > I'm unsure how to achieve that.
>> >
>>
>> Hi Simon !
>>
>> "pacing" is done because skb is attached to a socket, and a socket has a
>> limited (but configurable) sndbuf. sk->sk_wmem_alloc is the current sum
>> of all truesize skbs in flight.
>>
>> When you enter something that :
>>
>> 1) Get a clone of the skb, queue the clone to device X
>> 2) queue the original skb to device Y
>>
>> Then :        Socket sndbuf is not affected at all by device X queue.
>>       This is speed on device Y that matters.
>>
>> You want to get servo control on both X and Y
>>
>> You could try to
>>
>> 1) Get a clone of skb
>>    Attach it to socket too (so that socket get a feedback of final
>> orphaning for the clone) with skb_set_owner_w()
>>    queue the clone to device X
>>
>> Unfortunatly, stacked skb->destructor() makes this possible only for
>> known destructor (aka sock_wfree())
>
> Hi Eric !
>
> Thanks for the advice. I had thought about the socket buffer but at some
> point it slipped my mind.
>
> In any case the following patch seems to implement the change that I had in
> mind. However my discussions Michael Tsirkin elsewhere in this thread are
> beginning to make me think that think that perhaps this change isn't the
> best solution.

I know that everyone likes a nice netperf result but I agree with
Michael that this probably isn't the right question to be asking.  I
don't think that socket buffers are a real solution to the flow
control problem: they happen to provide that functionality but it's
more of a side effect than anything.  It's just that the amount of
memory consumed by packets in the queue(s) doesn't really have any
implicit meaning for flow control (think multiple physical adapters,
all with the same speed instead of a virtual device and a physical
device with wildly different speeds).  The analog in the physical
world that you're looking for would be Ethernet flow control.
Obviously, if the question is limiting CPU or memory consumption then
that's a different story.

This patch also double counts memory, since the full size of the
packet will be accounted for by each clone, even though they share the
actual packet data.  Probably not too significant here but it might be
when flooding/mirroring to many interfaces.  This is at least fixable
(the Xen-style accounting through page tracking deals with it, though
it has its own problems).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Jan. 7, 2011, 1:23 a.m. UTC | #4
On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:

[ snip ]
> 
> I know that everyone likes a nice netperf result but I agree with
> Michael that this probably isn't the right question to be asking.  I
> don't think that socket buffers are a real solution to the flow
> control problem: they happen to provide that functionality but it's
> more of a side effect than anything.  It's just that the amount of
> memory consumed by packets in the queue(s) doesn't really have any
> implicit meaning for flow control (think multiple physical adapters,
> all with the same speed instead of a virtual device and a physical
> device with wildly different speeds).  The analog in the physical
> world that you're looking for would be Ethernet flow control.
> Obviously, if the question is limiting CPU or memory consumption then
> that's a different story.

Point taken. I will see if I can control CPU (and thus memory) consumption
using cgroups and/or tc.

> This patch also double counts memory, since the full size of the
> packet will be accounted for by each clone, even though they share the
> actual packet data.  Probably not too significant here but it might be
> when flooding/mirroring to many interfaces.  This is at least fixable
> (the Xen-style accounting through page tracking deals with it, though
> it has its own problems).

Agreed on all counts.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Jan. 10, 2011, 9:31 a.m. UTC | #5
On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> 
> [ snip ]
> > 
> > I know that everyone likes a nice netperf result but I agree with
> > Michael that this probably isn't the right question to be asking.  I
> > don't think that socket buffers are a real solution to the flow
> > control problem: they happen to provide that functionality but it's
> > more of a side effect than anything.  It's just that the amount of
> > memory consumed by packets in the queue(s) doesn't really have any
> > implicit meaning for flow control (think multiple physical adapters,
> > all with the same speed instead of a virtual device and a physical
> > device with wildly different speeds).  The analog in the physical
> > world that you're looking for would be Ethernet flow control.
> > Obviously, if the question is limiting CPU or memory consumption then
> > that's a different story.
> 
> Point taken. I will see if I can control CPU (and thus memory) consumption
> using cgroups and/or tc.

I have found that I can successfully control the throughput using
the following techniques

1) Place a tc egress filter on dummy0

2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
   this is effectively the same as one of my hacks to the datapath
   that I mentioned in an earlier mail. The result is that eth1
   "paces" the connection.

3) 2) + place a tc egress filter on eth1

Which mostly makes sense to me although I am a little confused about
why 1) needs a filter on dummy0 (a filter on eth1 has no effect)
but 3) needs a filter on eth1 (a filter on dummy0 has no effect,
even if the skb is sent to dummy0 last.

I also had some limited success using CPU cgroups, though obviously
that targets CPU usage and thus the effect on throughput is fairly course.
In short, its a useful technique but not one that bares further
discussion here.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Jan. 13, 2011, 6:47 a.m. UTC | #6
On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> > 
> > [ snip ]
> > > 
> > > I know that everyone likes a nice netperf result but I agree with
> > > Michael that this probably isn't the right question to be asking.  I
> > > don't think that socket buffers are a real solution to the flow
> > > control problem: they happen to provide that functionality but it's
> > > more of a side effect than anything.  It's just that the amount of
> > > memory consumed by packets in the queue(s) doesn't really have any
> > > implicit meaning for flow control (think multiple physical adapters,
> > > all with the same speed instead of a virtual device and a physical
> > > device with wildly different speeds).  The analog in the physical
> > > world that you're looking for would be Ethernet flow control.
> > > Obviously, if the question is limiting CPU or memory consumption then
> > > that's a different story.
> > 
> > Point taken. I will see if I can control CPU (and thus memory) consumption
> > using cgroups and/or tc.
> 
> I have found that I can successfully control the throughput using
> the following techniques
> 
> 1) Place a tc egress filter on dummy0
> 
> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
>    this is effectively the same as one of my hacks to the datapath
>    that I mentioned in an earlier mail. The result is that eth1
>    "paces" the connection.

Further to this, I wonder if there is any interest in providing
a method to switch the action order - using ovs-ofctl is a hack imho -
and/or switching the default action order for mirroring.

> 3) 2) + place a tc egress filter on eth1
> 
> Which mostly makes sense to me although I am a little confused about
> why 1) needs a filter on dummy0 (a filter on eth1 has no effect)
> but 3) needs a filter on eth1 (a filter on dummy0 has no effect,
> even if the skb is sent to dummy0 last.
> 
> I also had some limited success using CPU cgroups, though obviously
> that targets CPU usage and thus the effect on throughput is fairly course.
> In short, its a useful technique but not one that bares further
> discussion here.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross Jan. 13, 2011, 3:45 p.m. UTC | #7
On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
>> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
>> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
>> >
>> > [ snip ]
>> > >
>> > > I know that everyone likes a nice netperf result but I agree with
>> > > Michael that this probably isn't the right question to be asking.  I
>> > > don't think that socket buffers are a real solution to the flow
>> > > control problem: they happen to provide that functionality but it's
>> > > more of a side effect than anything.  It's just that the amount of
>> > > memory consumed by packets in the queue(s) doesn't really have any
>> > > implicit meaning for flow control (think multiple physical adapters,
>> > > all with the same speed instead of a virtual device and a physical
>> > > device with wildly different speeds).  The analog in the physical
>> > > world that you're looking for would be Ethernet flow control.
>> > > Obviously, if the question is limiting CPU or memory consumption then
>> > > that's a different story.
>> >
>> > Point taken. I will see if I can control CPU (and thus memory) consumption
>> > using cgroups and/or tc.
>>
>> I have found that I can successfully control the throughput using
>> the following techniques
>>
>> 1) Place a tc egress filter on dummy0
>>
>> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
>>    this is effectively the same as one of my hacks to the datapath
>>    that I mentioned in an earlier mail. The result is that eth1
>>    "paces" the connection.
>
> Further to this, I wonder if there is any interest in providing
> a method to switch the action order - using ovs-ofctl is a hack imho -
> and/or switching the default action order for mirroring.

I'm not sure that there is a way to do this that is correct in the
generic case.  It's possible that the destination could be a VM while
packets are being mirrored to a physical device or we could be
multicasting or some other arbitrarily complex scenario.  Just think
of what a physical switch would do if it has ports with two different
speeds.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Jan. 13, 2011, 11:41 p.m. UTC | #8
On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
> On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> > On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> >> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> >> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> >> >
> >> > [ snip ]
> >> > >
> >> > > I know that everyone likes a nice netperf result but I agree with
> >> > > Michael that this probably isn't the right question to be asking.  I
> >> > > don't think that socket buffers are a real solution to the flow
> >> > > control problem: they happen to provide that functionality but it's
> >> > > more of a side effect than anything.  It's just that the amount of
> >> > > memory consumed by packets in the queue(s) doesn't really have any
> >> > > implicit meaning for flow control (think multiple physical adapters,
> >> > > all with the same speed instead of a virtual device and a physical
> >> > > device with wildly different speeds).  The analog in the physical
> >> > > world that you're looking for would be Ethernet flow control.
> >> > > Obviously, if the question is limiting CPU or memory consumption then
> >> > > that's a different story.
> >> >
> >> > Point taken. I will see if I can control CPU (and thus memory) consumption
> >> > using cgroups and/or tc.
> >>
> >> I have found that I can successfully control the throughput using
> >> the following techniques
> >>
> >> 1) Place a tc egress filter on dummy0
> >>
> >> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
> >>    this is effectively the same as one of my hacks to the datapath
> >>    that I mentioned in an earlier mail. The result is that eth1
> >>    "paces" the connection.
> >
> > Further to this, I wonder if there is any interest in providing
> > a method to switch the action order - using ovs-ofctl is a hack imho -
> > and/or switching the default action order for mirroring.
> 
> I'm not sure that there is a way to do this that is correct in the
> generic case.  It's possible that the destination could be a VM while
> packets are being mirrored to a physical device or we could be
> multicasting or some other arbitrarily complex scenario.  Just think
> of what a physical switch would do if it has ports with two different
> speeds.

Yes, I have considered that case. And I agree that perhaps there
is no sensible default. But perhaps we could make it configurable somehow?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Jan. 14, 2011, 4:58 a.m. UTC | #9
On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
> On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
> > On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> > > On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> > >> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> > >> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> > >> >
> > >> > [ snip ]
> > >> > >
> > >> > > I know that everyone likes a nice netperf result but I agree with
> > >> > > Michael that this probably isn't the right question to be asking.  I
> > >> > > don't think that socket buffers are a real solution to the flow
> > >> > > control problem: they happen to provide that functionality but it's
> > >> > > more of a side effect than anything.  It's just that the amount of
> > >> > > memory consumed by packets in the queue(s) doesn't really have any
> > >> > > implicit meaning for flow control (think multiple physical adapters,
> > >> > > all with the same speed instead of a virtual device and a physical
> > >> > > device with wildly different speeds).  The analog in the physical
> > >> > > world that you're looking for would be Ethernet flow control.
> > >> > > Obviously, if the question is limiting CPU or memory consumption then
> > >> > > that's a different story.
> > >> >
> > >> > Point taken. I will see if I can control CPU (and thus memory) consumption
> > >> > using cgroups and/or tc.
> > >>
> > >> I have found that I can successfully control the throughput using
> > >> the following techniques
> > >>
> > >> 1) Place a tc egress filter on dummy0
> > >>
> > >> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
> > >>    this is effectively the same as one of my hacks to the datapath
> > >>    that I mentioned in an earlier mail. The result is that eth1
> > >>    "paces" the connection.

This is actually a bug. This means that one slow connection will
affect fast ones. I intend to change the default for qemu to sndbuf=0 :
this will fix it but break your "pacing". So pls do not count on this behaviour.

> > > Further to this, I wonder if there is any interest in providing
> > > a method to switch the action order - using ovs-ofctl is a hack imho -
> > > and/or switching the default action order for mirroring.
> > 
> > I'm not sure that there is a way to do this that is correct in the
> > generic case.  It's possible that the destination could be a VM while
> > packets are being mirrored to a physical device or we could be
> > multicasting or some other arbitrarily complex scenario.  Just think
> > of what a physical switch would do if it has ports with two different
> > speeds.
> 
> Yes, I have considered that case. And I agree that perhaps there
> is no sensible default. But perhaps we could make it configurable somehow?

The fix is at the application level. Run netperf with -b and -w flags to
limit the speed to a sensible value.
Simon Horman Jan. 14, 2011, 6:35 a.m. UTC | #10
On Fri, Jan 14, 2011 at 06:58:18AM +0200, Michael S. Tsirkin wrote:
> On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
> > On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
> > > On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> > > > On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> > > >> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> > > >> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> > > >> >
> > > >> > [ snip ]
> > > >> > >
> > > >> > > I know that everyone likes a nice netperf result but I agree with
> > > >> > > Michael that this probably isn't the right question to be asking.  I
> > > >> > > don't think that socket buffers are a real solution to the flow
> > > >> > > control problem: they happen to provide that functionality but it's
> > > >> > > more of a side effect than anything.  It's just that the amount of
> > > >> > > memory consumed by packets in the queue(s) doesn't really have any
> > > >> > > implicit meaning for flow control (think multiple physical adapters,
> > > >> > > all with the same speed instead of a virtual device and a physical
> > > >> > > device with wildly different speeds).  The analog in the physical
> > > >> > > world that you're looking for would be Ethernet flow control.
> > > >> > > Obviously, if the question is limiting CPU or memory consumption then
> > > >> > > that's a different story.
> > > >> >
> > > >> > Point taken. I will see if I can control CPU (and thus memory) consumption
> > > >> > using cgroups and/or tc.
> > > >>
> > > >> I have found that I can successfully control the throughput using
> > > >> the following techniques
> > > >>
> > > >> 1) Place a tc egress filter on dummy0
> > > >>
> > > >> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
> > > >>    this is effectively the same as one of my hacks to the datapath
> > > >>    that I mentioned in an earlier mail. The result is that eth1
> > > >>    "paces" the connection.
> 
> This is actually a bug. This means that one slow connection will affect
> fast ones. I intend to change the default for qemu to sndbuf=0 : this
> will fix it but break your "pacing". So pls do not count on this
> behaviour.

Do you have a patch I could test?

> > > > Further to this, I wonder if there is any interest in providing
> > > > a method to switch the action order - using ovs-ofctl is a hack imho -
> > > > and/or switching the default action order for mirroring.
> > > 
> > > I'm not sure that there is a way to do this that is correct in the
> > > generic case.  It's possible that the destination could be a VM while
> > > packets are being mirrored to a physical device or we could be
> > > multicasting or some other arbitrarily complex scenario.  Just think
> > > of what a physical switch would do if it has ports with two different
> > > speeds.
> > 
> > Yes, I have considered that case. And I agree that perhaps there
> > is no sensible default. But perhaps we could make it configurable somehow?
> 
> The fix is at the application level. Run netperf with -b and -w flags to
> limit the speed to a sensible value.

Perhaps I should have stated my goals more clearly.
I'm interested in situations where I don't control the application.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/datapath/actions.c b/datapath/actions.c
index 5e16143..505f13f 100644
--- a/datapath/actions.c
+++ b/datapath/actions.c
@@ -384,7 +384,12 @@  static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 
 	for (a = actions, rem = actions_len; rem > 0; a = nla_next(a, &rem)) {
 		if (prev_port != -1) {
-			do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port);
+			struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
+			if (nskb) {
+				if (skb->sk)
+					skb_set_owner_w(nskb, skb->sk);
+				do_output(dp, nskb, prev_port);
+			}
 			prev_port = -1;
 		}