diff mbox

Flow Control and Port Mirroring Revisited

Message ID 20110114065415.GA30300@redhat.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Michael S. Tsirkin Jan. 14, 2011, 6:54 a.m. UTC
On Fri, Jan 14, 2011 at 03:35:28PM +0900, Simon Horman wrote:
> On Fri, Jan 14, 2011 at 06:58:18AM +0200, Michael S. Tsirkin wrote:
> > On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
> > > On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
> > > > On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> > > > > On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> > > > >> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> > > > >> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> > > > >> >
> > > > >> > [ snip ]
> > > > >> > >
> > > > >> > > I know that everyone likes a nice netperf result but I agree with
> > > > >> > > Michael that this probably isn't the right question to be asking.  I
> > > > >> > > don't think that socket buffers are a real solution to the flow
> > > > >> > > control problem: they happen to provide that functionality but it's
> > > > >> > > more of a side effect than anything.  It's just that the amount of
> > > > >> > > memory consumed by packets in the queue(s) doesn't really have any
> > > > >> > > implicit meaning for flow control (think multiple physical adapters,
> > > > >> > > all with the same speed instead of a virtual device and a physical
> > > > >> > > device with wildly different speeds).  The analog in the physical
> > > > >> > > world that you're looking for would be Ethernet flow control.
> > > > >> > > Obviously, if the question is limiting CPU or memory consumption then
> > > > >> > > that's a different story.
> > > > >> >
> > > > >> > Point taken. I will see if I can control CPU (and thus memory) consumption
> > > > >> > using cgroups and/or tc.
> > > > >>
> > > > >> I have found that I can successfully control the throughput using
> > > > >> the following techniques
> > > > >>
> > > > >> 1) Place a tc egress filter on dummy0
> > > > >>
> > > > >> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
> > > > >>    this is effectively the same as one of my hacks to the datapath
> > > > >>    that I mentioned in an earlier mail. The result is that eth1
> > > > >>    "paces" the connection.
> > 
> > This is actually a bug. This means that one slow connection will affect
> > fast ones. I intend to change the default for qemu to sndbuf=0 : this
> > will fix it but break your "pacing". So pls do not count on this
> > behaviour.
> 
> Do you have a patch I could test?

You can (and users already can) just run qemu with sndbuf=0. But if you
like, below.

> > > > > Further to this, I wonder if there is any interest in providing
> > > > > a method to switch the action order - using ovs-ofctl is a hack imho -
> > > > > and/or switching the default action order for mirroring.
> > > > 
> > > > I'm not sure that there is a way to do this that is correct in the
> > > > generic case.  It's possible that the destination could be a VM while
> > > > packets are being mirrored to a physical device or we could be
> > > > multicasting or some other arbitrarily complex scenario.  Just think
> > > > of what a physical switch would do if it has ports with two different
> > > > speeds.
> > > 
> > > Yes, I have considered that case. And I agree that perhaps there
> > > is no sensible default. But perhaps we could make it configurable somehow?
> > 
> > The fix is at the application level. Run netperf with -b and -w flags to
> > limit the speed to a sensible value.
> 
> Perhaps I should have stated my goals more clearly.
> I'm interested in situations where I don't control the application.

Well an application that streams UDP without any throttling
at the application level will break on a physical network, right?
So I am not sure why should one try to make it work on the virtual one.

But let's assume that you do want to throttle the guest
for reasons such as QOS. The proper approach seems
to be to throttle the sender, not have a dummy throttled
receiver "pacing" it. Place the qemu process in the
correct net_cls cgroup, set the class id and apply a rate limit?


---

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Simon Horman Jan. 16, 2011, 10:37 p.m. UTC | #1
On Fri, Jan 14, 2011 at 08:54:15AM +0200, Michael S. Tsirkin wrote:
> On Fri, Jan 14, 2011 at 03:35:28PM +0900, Simon Horman wrote:
> > On Fri, Jan 14, 2011 at 06:58:18AM +0200, Michael S. Tsirkin wrote:
> > > On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
> > > > On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
> > > > > On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> > > > > > On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> > > > > >> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> > > > > >> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> > > > > >> >
> > > > > >> > [ snip ]
> > > > > >> > >
> > > > > >> > > I know that everyone likes a nice netperf result but I agree with
> > > > > >> > > Michael that this probably isn't the right question to be asking.  I
> > > > > >> > > don't think that socket buffers are a real solution to the flow
> > > > > >> > > control problem: they happen to provide that functionality but it's
> > > > > >> > > more of a side effect than anything.  It's just that the amount of
> > > > > >> > > memory consumed by packets in the queue(s) doesn't really have any
> > > > > >> > > implicit meaning for flow control (think multiple physical adapters,
> > > > > >> > > all with the same speed instead of a virtual device and a physical
> > > > > >> > > device with wildly different speeds).  The analog in the physical
> > > > > >> > > world that you're looking for would be Ethernet flow control.
> > > > > >> > > Obviously, if the question is limiting CPU or memory consumption then
> > > > > >> > > that's a different story.
> > > > > >> >
> > > > > >> > Point taken. I will see if I can control CPU (and thus memory) consumption
> > > > > >> > using cgroups and/or tc.
> > > > > >>
> > > > > >> I have found that I can successfully control the throughput using
> > > > > >> the following techniques
> > > > > >>
> > > > > >> 1) Place a tc egress filter on dummy0
> > > > > >>
> > > > > >> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
> > > > > >>    this is effectively the same as one of my hacks to the datapath
> > > > > >>    that I mentioned in an earlier mail. The result is that eth1
> > > > > >>    "paces" the connection.
> > > 
> > > This is actually a bug. This means that one slow connection will affect
> > > fast ones. I intend to change the default for qemu to sndbuf=0 : this
> > > will fix it but break your "pacing". So pls do not count on this
> > > behaviour.
> > 
> > Do you have a patch I could test?
> 
> You can (and users already can) just run qemu with sndbuf=0. But if you
> like, below.

Thanks

> > > > > > Further to this, I wonder if there is any interest in providing
> > > > > > a method to switch the action order - using ovs-ofctl is a hack imho -
> > > > > > and/or switching the default action order for mirroring.
> > > > > 
> > > > > I'm not sure that there is a way to do this that is correct in the
> > > > > generic case.  It's possible that the destination could be a VM while
> > > > > packets are being mirrored to a physical device or we could be
> > > > > multicasting or some other arbitrarily complex scenario.  Just think
> > > > > of what a physical switch would do if it has ports with two different
> > > > > speeds.
> > > > 
> > > > Yes, I have considered that case. And I agree that perhaps there
> > > > is no sensible default. But perhaps we could make it configurable somehow?
> > > 
> > > The fix is at the application level. Run netperf with -b and -w flags to
> > > limit the speed to a sensible value.
> > 
> > Perhaps I should have stated my goals more clearly.
> > I'm interested in situations where I don't control the application.
> 
> Well an application that streams UDP without any throttling
> at the application level will break on a physical network, right?
> So I am not sure why should one try to make it work on the virtual one.
> 
> But let's assume that you do want to throttle the guest
> for reasons such as QOS. The proper approach seems
> to be to throttle the sender, not have a dummy throttled
> receiver "pacing" it. Place the qemu process in the
> correct net_cls cgroup, set the class id and apply a rate limit?

I would like to be able to use a class to rate limit egress packets.
That much works fine for me.

What I would also like is for there to be back-pressure such that the guest
doesn't consume lots of CPU, spinning, sending packets as fast as it can,
almost of all of which are dropped. That does seem like a lot of wasted
CPU to me.

Unfortunately there are several problems with this and I am fast concluding
that I will need to use a CPU cgroup. Which does make some sense, as what I
am really trying to limit here is CPU usage not network packet rates - even
if the test using the CPU is netperf.  So long as the CPU usage can
(mostly) be attributed to the guest using a cgroup should work fine.  And
indeed seems to in my limited testing.

One scenario in which I don't think it is possible for there to be
back-pressure in a meaningful sense is if root in the guest sets
/proc/sys/net/core/wmem_default to a large value, say 2000000.


I do think that to some extent there is back-pressure provided by sockbuf
in the case where process on the host is sending directly to a physical
interface.  And to my mind it would be "nice" if the same kind of
back-pressure was present in guests.  But through our discussions of the
past week or so I get the feeling that is not your view of things.

Perhaps I could characterise the guest situation by saying:
	Egress packet rates can be controlled using tc on the host;
	Guest CPU usage can be controlled using CPU cgroups on the host;
	Sockbuf controls memory usage on the host;
	Back-pressure is irrelevant.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell Jan. 16, 2011, 11:56 p.m. UTC | #2
On Mon, 17 Jan 2011 09:07:30 am Simon Horman wrote:

[snip]

I've been away, but what concerns me is that socket buffer limits are
bypassed in various configurations, due to skb cloning.  We should probably
drop such limits altogether, or fix them to be consistent.

Simple fix is as someone suggested here, to attach the clone.  That might
seriously reduce your sk limit, though.  I haven't thought about it hard,
but might it make sense to move ownership into skb_shared_info; ie. the
data, rather than the skb head?

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Jan. 17, 2011, 10:26 a.m. UTC | #3
On Mon, Jan 17, 2011 at 07:37:30AM +0900, Simon Horman wrote:
> On Fri, Jan 14, 2011 at 08:54:15AM +0200, Michael S. Tsirkin wrote:
> > On Fri, Jan 14, 2011 at 03:35:28PM +0900, Simon Horman wrote:
> > > On Fri, Jan 14, 2011 at 06:58:18AM +0200, Michael S. Tsirkin wrote:
> > > > On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
> > > > > On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
> > > > > > On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> > > > > > > On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> > > > > > >> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> > > > > > >> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> > > > > > >> >
> > > > > > >> > [ snip ]
> > > > > > >> > >
> > > > > > >> > > I know that everyone likes a nice netperf result but I agree with
> > > > > > >> > > Michael that this probably isn't the right question to be asking.  I
> > > > > > >> > > don't think that socket buffers are a real solution to the flow
> > > > > > >> > > control problem: they happen to provide that functionality but it's
> > > > > > >> > > more of a side effect than anything.  It's just that the amount of
> > > > > > >> > > memory consumed by packets in the queue(s) doesn't really have any
> > > > > > >> > > implicit meaning for flow control (think multiple physical adapters,
> > > > > > >> > > all with the same speed instead of a virtual device and a physical
> > > > > > >> > > device with wildly different speeds).  The analog in the physical
> > > > > > >> > > world that you're looking for would be Ethernet flow control.
> > > > > > >> > > Obviously, if the question is limiting CPU or memory consumption then
> > > > > > >> > > that's a different story.
> > > > > > >> >
> > > > > > >> > Point taken. I will see if I can control CPU (and thus memory) consumption
> > > > > > >> > using cgroups and/or tc.
> > > > > > >>
> > > > > > >> I have found that I can successfully control the throughput using
> > > > > > >> the following techniques
> > > > > > >>
> > > > > > >> 1) Place a tc egress filter on dummy0
> > > > > > >>
> > > > > > >> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
> > > > > > >>    this is effectively the same as one of my hacks to the datapath
> > > > > > >>    that I mentioned in an earlier mail. The result is that eth1
> > > > > > >>    "paces" the connection.
> > > > 
> > > > This is actually a bug. This means that one slow connection will affect
> > > > fast ones. I intend to change the default for qemu to sndbuf=0 : this
> > > > will fix it but break your "pacing". So pls do not count on this
> > > > behaviour.
> > > 
> > > Do you have a patch I could test?
> > 
> > You can (and users already can) just run qemu with sndbuf=0. But if you
> > like, below.
> 
> Thanks
> 
> > > > > > > Further to this, I wonder if there is any interest in providing
> > > > > > > a method to switch the action order - using ovs-ofctl is a hack imho -
> > > > > > > and/or switching the default action order for mirroring.
> > > > > > 
> > > > > > I'm not sure that there is a way to do this that is correct in the
> > > > > > generic case.  It's possible that the destination could be a VM while
> > > > > > packets are being mirrored to a physical device or we could be
> > > > > > multicasting or some other arbitrarily complex scenario.  Just think
> > > > > > of what a physical switch would do if it has ports with two different
> > > > > > speeds.
> > > > > 
> > > > > Yes, I have considered that case. And I agree that perhaps there
> > > > > is no sensible default. But perhaps we could make it configurable somehow?
> > > > 
> > > > The fix is at the application level. Run netperf with -b and -w flags to
> > > > limit the speed to a sensible value.
> > > 
> > > Perhaps I should have stated my goals more clearly.
> > > I'm interested in situations where I don't control the application.
> > 
> > Well an application that streams UDP without any throttling
> > at the application level will break on a physical network, right?
> > So I am not sure why should one try to make it work on the virtual one.
> > 
> > But let's assume that you do want to throttle the guest
> > for reasons such as QOS. The proper approach seems
> > to be to throttle the sender, not have a dummy throttled
> > receiver "pacing" it. Place the qemu process in the
> > correct net_cls cgroup, set the class id and apply a rate limit?
> 
> I would like to be able to use a class to rate limit egress packets.
> That much works fine for me.
> 
> What I would also like is for there to be back-pressure such that the guest
> doesn't consume lots of CPU, spinning, sending packets as fast as it can,
> almost of all of which are dropped. That does seem like a lot of wasted
> CPU to me.
> 
> Unfortunately there are several problems with this and I am fast concluding
> that I will need to use a CPU cgroup. Which does make some sense, as what I
> am really trying to limit here is CPU usage not network packet rates - even
> if the test using the CPU is netperf.  So long as the CPU usage can
> (mostly) be attributed to the guest using a cgroup should work fine.  And
> indeed seems to in my limited testing.
> 
> One scenario in which I don't think it is possible for there to be
> back-pressure in a meaningful sense is if root in the guest sets
> /proc/sys/net/core/wmem_default to a large value, say 2000000.
> 
> 
> I do think that to some extent there is back-pressure provided by sockbuf
> in the case where process on the host is sending directly to a physical
> interface.  And to my mind it would be "nice" if the same kind of
> back-pressure was present in guests.  But through our discussions of the
> past week or so I get the feeling that is not your view of things.

It might be nice. Unfortunately this is not what we have implemented:
the sockbuf backpressure blocks the socket, what we have blocks all
transmit from the guest. Another issue is that the strategy we have
seems to be broken if the target is a guest on another machine.

So it won't be all that simple to implement well, and before we try,
I'd like to know whether there are applications that are helped
by it. For example, we could try to measure latency at various
pps and see whether the backpressure helps. netperf has -b, -w
flags which might help these measurements.

> Perhaps I could characterise the guest situation by saying:
> 	Egress packet rates can be controlled using tc on the host;
> 	Guest CPU usage can be controlled using CPU cgroups on the host;
> 	Sockbuf controls memory usage on the host;

Not really, the memory usage on the host is controlled by the
various queue lengths in the host. E.g. if you send packets to
the physical device, they will get queued there.

> 	Back-pressure is irrelevant.

Or at least, broken :)
Michael S. Tsirkin Jan. 17, 2011, 10:38 a.m. UTC | #4
On Mon, Jan 17, 2011 at 10:26:25AM +1030, Rusty Russell wrote:
> On Mon, 17 Jan 2011 09:07:30 am Simon Horman wrote:
> 
> [snip]
> 
> I've been away, but what concerns me is that socket buffer limits are
> bypassed in various configurations, due to skb cloning.  We should probably
> drop such limits altogether, or fix them to be consistent.

Further, it looks like when the limits are not bypassed, they
easily result in deadlocks. For example, with
multiple tun devices attached to a single bridge in host,
if a number of these have their queues blocked,
others will reach the socket buffer limit and
traffic on the bridge will get blocked altogether.

It might be better to drop the limits altogether
unless we can fix them. Happily, as the limits are off by
default, doing so does not require kernel changes.

> Simple fix is as someone suggested here, to attach the clone.  That might
> seriously reduce your sk limit, though.  I haven't thought about it hard,
> but might it make sense to move ownership into skb_shared_info; ie. the
> data, rather than the skb head?
> 
> Cheers,
> Rusty.

tracking data ownership might benefit others such as various zero-copy
strategies. It might need to be done per-page, though, not per-skb.
Rick Jones Jan. 18, 2011, 7:41 p.m. UTC | #5
> So it won't be all that simple to implement well, and before we try,
> I'd like to know whether there are applications that are helped
> by it. For example, we could try to measure latency at various
> pps and see whether the backpressure helps. netperf has -b, -w
> flags which might help these measurements.

Those options are enabled when one adds --enable-burst to the pre-compilation 
./configure  of netperf (one doesn't have to recompile netserver).  However, if 
one is also looking at latency statistics via the -j option in the top-of-trunk, 
or simply at the histogram with --enable-histogram on the ./configure and a 
verbosity level of 2 (global -v 2) then one wants the very top of trunk netperf 
from:

http://www.netperf.org/svn/netperf2/trunk

to get the recently added support for accurate (netperf level) RTT measuremnts 
on burst-mode request/response tests.

happy benchmarking,

rick jones

PS - the enhanced latency statistics from -j are only available in the "omni" 
version of the TCP_RR test.  To get that add a --enable-omni to the ./configure 
- and in this case both netperf and netserver have to be recompiled.  For very 
basic output one can peruse the output of:

src/netperf -t omni -- -O /?

and then pick those outputs of interest and put them into an output selection 
file which one then passes to either (test-specific) -o, -O or -k to get CVS, 
"Human" or keyval output respectively.  E.G.

raj@tardy:~/netperf2_trunk$ cat foo
THROUGHPUT,THROUGHPUT_UNITS
RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY

when foo is passed to -o one will get those all on one line of CSV.  To -O one 
gets three lines of more netperf-classic-like "human" readable output, and when 
one passes that to -k one gets a string of keyval output a la:

raj@tardy:~/netperf2_trunk$ src/netperf -t omni -j -v 2 -- -r 1 -d rr -k foo
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 
AF_INET : histogram
THROUGHPUT=29454.12
THROUGHPUT_UNITS=Trans/s
RT_LATENCY=33.951
MIN_LATENCY=19
MEAN_LATENCY=32.00
MAX_LATENCY=126
P50_LATENCY=32
P90_LATENCY=38
P99_LATENCY=41
STDDEV_LATENCY=5.46

Histogram of request/response times
UNIT_USEC     :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_USEC      :    0: 3553: 45244: 237790: 7859:   86:   10:    3:    0:    0
HUNDRED_USEC  :    0:    2:    0:    0:    0:    0:    0:    0:    0:    0
UNIT_MSEC     :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_MSEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
HUNDRED_MSEC  :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
UNIT_SEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_SEC       :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
 >100_SECS: 0
HIST_TOTAL:      294547

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Jan. 18, 2011, 8:13 p.m. UTC | #6
On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> >So it won't be all that simple to implement well, and before we try,
> >I'd like to know whether there are applications that are helped
> >by it. For example, we could try to measure latency at various
> >pps and see whether the backpressure helps. netperf has -b, -w
> >flags which might help these measurements.
> 
> Those options are enabled when one adds --enable-burst to the
> pre-compilation ./configure  of netperf (one doesn't have to
> recompile netserver).  However, if one is also looking at latency
> statistics via the -j option in the top-of-trunk, or simply at the
> histogram with --enable-histogram on the ./configure and a verbosity
> level of 2 (global -v 2) then one wants the very top of trunk
> netperf from:
> 
> http://www.netperf.org/svn/netperf2/trunk
> 
> to get the recently added support for accurate (netperf level) RTT
> measuremnts on burst-mode request/response tests.
> 
> happy benchmarking,
> 
> rick jones
> 
> PS - the enhanced latency statistics from -j are only available in
> the "omni" version of the TCP_RR test.  To get that add a
> --enable-omni to the ./configure - and in this case both netperf and
> netserver have to be recompiled.


Is this TCP only? I would love to get latency data from UDP as well.

>  For very basic output one can
> peruse the output of:
> 
> src/netperf -t omni -- -O /?
> 
> and then pick those outputs of interest and put them into an output
> selection file which one then passes to either (test-specific) -o,
> -O or -k to get CVS, "Human" or keyval output respectively.  E.G.
> 
> raj@tardy:~/netperf2_trunk$ cat foo
> THROUGHPUT,THROUGHPUT_UNITS
> RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
> P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
> 
> when foo is passed to -o one will get those all on one line of CSV.
> To -O one gets three lines of more netperf-classic-like "human"
> readable output, and when one passes that to -k one gets a string of
> keyval output a la:
> 
> raj@tardy:~/netperf2_trunk$ src/netperf -t omni -j -v 2 -- -r 1 -d rr -k foo
> OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost
> (127.0.0.1) port 0 AF_INET : histogram
> THROUGHPUT=29454.12
> THROUGHPUT_UNITS=Trans/s
> RT_LATENCY=33.951
> MIN_LATENCY=19
> MEAN_LATENCY=32.00
> MAX_LATENCY=126
> P50_LATENCY=32
> P90_LATENCY=38
> P99_LATENCY=41
> STDDEV_LATENCY=5.46
> 
> Histogram of request/response times
> UNIT_USEC     :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
> TEN_USEC      :    0: 3553: 45244: 237790: 7859:   86:   10:    3:    0:    0
> HUNDRED_USEC  :    0:    2:    0:    0:    0:    0:    0:    0:    0:    0
> UNIT_MSEC     :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
> TEN_MSEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
> HUNDRED_MSEC  :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
> UNIT_SEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
> TEN_SEC       :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
> >100_SECS: 0
> HIST_TOTAL:      294547
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Jan. 18, 2011, 9:28 p.m. UTC | #7
Michael S. Tsirkin wrote:
> On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> 
>>PS - the enhanced latency statistics from -j are only available in
>>the "omni" version of the TCP_RR test.  To get that add a
>>--enable-omni to the ./configure - and in this case both netperf and
>>netserver have to be recompiled.
> 
> Is this TCP only? I would love to get latency data from UDP as well.

I believe it will work with UDP request response as well.  The omni test code 
strives to be protocol agnostic.  (I'm sure there are bugs of course, there 
always are.)

There is though the added complication of there being no specific matching of 
requests to responses.  The code as written takes advantage of TCP's in-order 
semantics and recovery from packet loss.  In a "plain" UDP_RR test, with one at 
a time transactions, if either the request or response are lost, data flow 
effectively stops there until the timer expires.  So, one has "reasonable" RTT 
numbers from before that point.  In a burst UDP RR test, the code doesn't know 
which request/response was lost and so the matching being done to get RTTs will 
be off by each lost datagram.  And if something were re-ordered the timstamps 
would be off even without a datagram loss event.

To "fix" that would require netperf do something it has not yet done in 18-odd 
years :)  That is actually echo something back from the netserver on the RR test 
- either an id, or a timestamp.  That means "dirtying" the buffers which means 
still more cache misses, from places other than the actual stack. Not beyond the 
realm of the possible, but it would be a bit of departure for "normal" operation 
(*) and could enforce a minimum request/response size beyond the present single 
byte (ok, perhaps only two or four bytes :).  But that, perhaps, is a discussion 
best left to netperf-talk at netperf.org.

happy benchmarking,

rick jones

(*) netperf does have the concept of reading from and/or dirtying buffers, 
put-in back in the days of COW/page-remapping in HP-UX 9.0, but that was mainly 
to force COW and/or show the effect of the required data cache purges/flushes. 
As such it was made conditional on DIRTY being defined.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Jan. 19, 2011, 9:11 a.m. UTC | #8
On Tue, Jan 18, 2011 at 10:13:33PM +0200, Michael S. Tsirkin wrote:
> On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> > >So it won't be all that simple to implement well, and before we try,
> > >I'd like to know whether there are applications that are helped
> > >by it. For example, we could try to measure latency at various
> > >pps and see whether the backpressure helps. netperf has -b, -w
> > >flags which might help these measurements.
> > 
> > Those options are enabled when one adds --enable-burst to the
> > pre-compilation ./configure  of netperf (one doesn't have to
> > recompile netserver).  However, if one is also looking at latency
> > statistics via the -j option in the top-of-trunk, or simply at the
> > histogram with --enable-histogram on the ./configure and a verbosity
> > level of 2 (global -v 2) then one wants the very top of trunk
> > netperf from:
> > 
> > http://www.netperf.org/svn/netperf2/trunk
> > 
> > to get the recently added support for accurate (netperf level) RTT
> > measuremnts on burst-mode request/response tests.
> > 
> > happy benchmarking,
> > 
> > rick jones

Thanks Rick, that is really helpful.

> > PS - the enhanced latency statistics from -j are only available in
> > the "omni" version of the TCP_RR test.  To get that add a
> > --enable-omni to the ./configure - and in this case both netperf and
> > netserver have to be recompiled.
> 
> 
> Is this TCP only? I would love to get latency data from UDP as well.

At a glance, -- -T UDP is what you are after.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Jan. 20, 2011, 8:38 a.m. UTC | #9
[ Trimmed Eric from CC list as vger was complaining that it is too long ]

On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> >So it won't be all that simple to implement well, and before we try,
> >I'd like to know whether there are applications that are helped
> >by it. For example, we could try to measure latency at various
> >pps and see whether the backpressure helps. netperf has -b, -w
> >flags which might help these measurements.
> 
> Those options are enabled when one adds --enable-burst to the
> pre-compilation ./configure  of netperf (one doesn't have to
> recompile netserver).  However, if one is also looking at latency
> statistics via the -j option in the top-of-trunk, or simply at the
> histogram with --enable-histogram on the ./configure and a verbosity
> level of 2 (global -v 2) then one wants the very top of trunk
> netperf from:

Hi,

I have constructed a test where I run an un-paced  UDP_STREAM test in
one guest and a paced omni rr test in another guest at the same time.
Breifly I get the following results from the omni test..

1. Omni test only:		MEAN_LATENCY=272.00
2. Omni and stream test:	MEAN_LATENCY=3423.00
3. cpu and net_cls group:	MEAN_LATENCY=493.00
   As per 2 plus cgoups are created for each guest
   and guest tasks added to the groups
4. 100Mbit/s class:		MEAN_LATENCY=273.00
   As per 3 plus the net_cls groups each have a 100MBit/s HTB class
5. cpu.shares=128:		MEAN_LATENCY=652.00
   As per 4 plus the cpu groups have cpu.shares set to 128
6. Busy CPUS:			MEAN_LATENCY=15126.00
   As per 5 but the CPUs are made busy using a simple shell while loop

There is a bit of noise in the results as the two netperf invocations
aren't started at exactly the same moment

For reference, my netperf invocations are:
netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12
netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr -k foo -b 1 -w 200 -m 200

foo contains
PROTOCOL
THROUGHPUT,THROUGHPUT_UNITS
LOCAL_SEND_THROUGHPUT
LOCAL_RECV_THROUGHPUT
REMOTE_SEND_THROUGHPUT
REMOTE_RECV_THROUGHPUT
RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
LOCAL_CPU_UTIL,REMOTE_CPU_UTIL

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Jan. 21, 2011, 2:30 a.m. UTC | #10
Simon Horman wrote:
> [ Trimmed Eric from CC list as vger was complaining that it is too long ]
>...
> I have constructed a test where I run an un-paced  UDP_STREAM test in
> one guest and a paced omni rr test in another guest at the same time.
> Breifly I get the following results from the omni test..
> 
>...
 >
> There is a bit of noise in the results as the two netperf invocations
> aren't started at exactly the same moment
> 
> For reference, my netperf invocations are:
> netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12
> netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr -k foo -b 1 -w 200 -m 200

Since the -b and -w are in the test-specific portion, this test was not actually 
  paced. The -w will have been ignored entirely (IIRC) and the -b will have 
attempted to set the "burst" size of a --enable-burst ./configured netperf.  If 
netperf was ./configured that way, it will have had two rr transactions in 
flight at one time - the "regular" one and then the one additional from the -b 
option.  If netperf was not ./configured with --enable-burst then a warning 
message should have been emitted.

Also, I am guessing you wanted TCP_NODELAY set, and that is -D but not a global 
-D.  I'm reasonably confident the -m 200 will have been ignored, but it would be 
best to drop it. So, I think your second line needs to be:

netperf.omni -p 12866 -c -C -H  172.17.60.216 -t omni -j -v 2 -b 1 -w 200 -- -r 
1 -d rr -k foo -D

If you want the request and response sizes to be 200 bytes, use -r 200 
(test-specific).

Also, if you ./configure with --enable-omni first, that netserver will 
understand both omni and non-omni tests at the same time and you don't have to 
have a second netserver on a different control port.  You can also go-in to 
config.h after the ./configure and unset WANT_MIGRATION and then UDP_STREAM in 
netperf will be the "true" classic UDP_STREAM code rather than the migrated to 
omni path.

> foo contains
> PROTOCOL
> THROUGHPUT,THROUGHPUT_UNITS
> LOCAL_SEND_THROUGHPUT
> LOCAL_RECV_THROUGHPUT
> REMOTE_SEND_THROUGHPUT
> REMOTE_RECV_THROUGHPUT
> RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
> P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
> LOCAL_CPU_UTIL,REMOTE_CPU_UTIL

As the -k file parsing option didn't care until recently (within the hour or 
so), I think it didn't matter that you had more than four lines (assuming that 
is a verbatim cat of foo).  However, if you pull the *current* top of trunk, it 
will probably start to care - I'm in the midst of adding support for "direct 
output selection" in the -k, -o and -O options and also cleaning-up the omni 
printing code to the point where there is only the one routing parsing the 
output selection file.  Currently that is the one for "human" output, which has 
a four line restriction.  I will try to make it smarter as I go.

happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Jan. 21, 2011, 9:59 a.m. UTC | #11
On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
> [ Trimmed Eric from CC list as vger was complaining that it is too long ]
> 
> On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> > >So it won't be all that simple to implement well, and before we try,
> > >I'd like to know whether there are applications that are helped
> > >by it. For example, we could try to measure latency at various
> > >pps and see whether the backpressure helps. netperf has -b, -w
> > >flags which might help these measurements.
> > 
> > Those options are enabled when one adds --enable-burst to the
> > pre-compilation ./configure  of netperf (one doesn't have to
> > recompile netserver).  However, if one is also looking at latency
> > statistics via the -j option in the top-of-trunk, or simply at the
> > histogram with --enable-histogram on the ./configure and a verbosity
> > level of 2 (global -v 2) then one wants the very top of trunk
> > netperf from:
> 
> Hi,
> 
> I have constructed a test where I run an un-paced  UDP_STREAM test in
> one guest and a paced omni rr test in another guest at the same time.

Hmm, what is this supposed to measure?  Basically each time you run an
un-paced UDP_STREAM you get some random load on the network.
You can't tell what it was exactly, only that it was between
the send and receive throughput.

> Breifly I get the following results from the omni test..
> 
> 1. Omni test only:		MEAN_LATENCY=272.00
> 2. Omni and stream test:	MEAN_LATENCY=3423.00
> 3. cpu and net_cls group:	MEAN_LATENCY=493.00
>    As per 2 plus cgoups are created for each guest
>    and guest tasks added to the groups
> 4. 100Mbit/s class:		MEAN_LATENCY=273.00
>    As per 3 plus the net_cls groups each have a 100MBit/s HTB class
> 5. cpu.shares=128:		MEAN_LATENCY=652.00
>    As per 4 plus the cpu groups have cpu.shares set to 128
> 6. Busy CPUS:			MEAN_LATENCY=15126.00
>    As per 5 but the CPUs are made busy using a simple shell while loop
> 
> There is a bit of noise in the results as the two netperf invocations
> aren't started at exactly the same moment
> 
> For reference, my netperf invocations are:
> netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12
> netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr -k foo -b 1 -w 200 -m 200
> 
> foo contains
> PROTOCOL
> THROUGHPUT,THROUGHPUT_UNITS
> LOCAL_SEND_THROUGHPUT
> LOCAL_RECV_THROUGHPUT
> REMOTE_SEND_THROUGHPUT
> REMOTE_RECV_THROUGHPUT
> RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
> P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
> LOCAL_CPU_UTIL,REMOTE_CPU_UTIL
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Jan. 21, 2011, 6:04 p.m. UTC | #12
>>I have constructed a test where I run an un-paced  UDP_STREAM test in
>>one guest and a paced omni rr test in another guest at the same time.
> 
> 
> Hmm, what is this supposed to measure?  Basically each time you run an
> un-paced UDP_STREAM you get some random load on the network.

Well, if the netperf is (effectively) pinned to a given CPU, presumably it would 
be trying to generate UDP datagrams at the same rate each time.  Indeed though, 
no guarantee that rate would consistently get through each time.

But then, that is where one can use the confidence intervals options to get an 
idea by how much the rate varied.

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Jan. 21, 2011, 11:11 p.m. UTC | #13
On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:
> On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
> > [ Trimmed Eric from CC list as vger was complaining that it is too long ]
> > 
> > On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> > > >So it won't be all that simple to implement well, and before we try,
> > > >I'd like to know whether there are applications that are helped
> > > >by it. For example, we could try to measure latency at various
> > > >pps and see whether the backpressure helps. netperf has -b, -w
> > > >flags which might help these measurements.
> > > 
> > > Those options are enabled when one adds --enable-burst to the
> > > pre-compilation ./configure  of netperf (one doesn't have to
> > > recompile netserver).  However, if one is also looking at latency
> > > statistics via the -j option in the top-of-trunk, or simply at the
> > > histogram with --enable-histogram on the ./configure and a verbosity
> > > level of 2 (global -v 2) then one wants the very top of trunk
> > > netperf from:
> > 
> > Hi,
> > 
> > I have constructed a test where I run an un-paced  UDP_STREAM test in
> > one guest and a paced omni rr test in another guest at the same time.
> 
> Hmm, what is this supposed to measure?  Basically each time you run an
> un-paced UDP_STREAM you get some random load on the network.
> You can't tell what it was exactly, only that it was between
> the send and receive throughput.

Rick mentioned in another email that I messed up my test parameters a bit,
so I will re-run the tests, incorporating his suggestions.

What I was attempting to measure was the effect of an unpaced UDP_STREAM
on the latency of more moderated traffic. Because I am interested in
what effect an abusive guest has on other guests and how that my be
mitigated.

Could you suggest some tests that you feel are more appropriate?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Jan. 22, 2011, 9:57 p.m. UTC | #14
On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote:
> On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:
> > On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
> > > [ Trimmed Eric from CC list as vger was complaining that it is too long ]
> > > 
> > > On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> > > > >So it won't be all that simple to implement well, and before we try,
> > > > >I'd like to know whether there are applications that are helped
> > > > >by it. For example, we could try to measure latency at various
> > > > >pps and see whether the backpressure helps. netperf has -b, -w
> > > > >flags which might help these measurements.
> > > > 
> > > > Those options are enabled when one adds --enable-burst to the
> > > > pre-compilation ./configure  of netperf (one doesn't have to
> > > > recompile netserver).  However, if one is also looking at latency
> > > > statistics via the -j option in the top-of-trunk, or simply at the
> > > > histogram with --enable-histogram on the ./configure and a verbosity
> > > > level of 2 (global -v 2) then one wants the very top of trunk
> > > > netperf from:
> > > 
> > > Hi,
> > > 
> > > I have constructed a test where I run an un-paced  UDP_STREAM test in
> > > one guest and a paced omni rr test in another guest at the same time.
> > 
> > Hmm, what is this supposed to measure?  Basically each time you run an
> > un-paced UDP_STREAM you get some random load on the network.
> > You can't tell what it was exactly, only that it was between
> > the send and receive throughput.
> 
> Rick mentioned in another email that I messed up my test parameters a bit,
> so I will re-run the tests, incorporating his suggestions.
> 
> What I was attempting to measure was the effect of an unpaced UDP_STREAM
> on the latency of more moderated traffic. Because I am interested in
> what effect an abusive guest has on other guests and how that my be
> mitigated.
> 
> Could you suggest some tests that you feel are more appropriate?

Yes. To refraze my concern in these terms, besides the malicious guest
you have another software in host (netperf) that interferes with
the traffic, and it cooperates with the malicious guest.
Right?

IMO for a malicious guest you would send
UDP packets that then get dropped by the host.

For example block netperf in host so that
it does not consume packets from the socket.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman Jan. 23, 2011, 6:38 a.m. UTC | #15
On Sat, Jan 22, 2011 at 11:57:42PM +0200, Michael S. Tsirkin wrote:
> On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote:
> > On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:
> > > On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
> > > > [ Trimmed Eric from CC list as vger was complaining that it is too long ]
> > > > 
> > > > On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> > > > > >So it won't be all that simple to implement well, and before we try,
> > > > > >I'd like to know whether there are applications that are helped
> > > > > >by it. For example, we could try to measure latency at various
> > > > > >pps and see whether the backpressure helps. netperf has -b, -w
> > > > > >flags which might help these measurements.
> > > > > 
> > > > > Those options are enabled when one adds --enable-burst to the
> > > > > pre-compilation ./configure  of netperf (one doesn't have to
> > > > > recompile netserver).  However, if one is also looking at latency
> > > > > statistics via the -j option in the top-of-trunk, or simply at the
> > > > > histogram with --enable-histogram on the ./configure and a verbosity
> > > > > level of 2 (global -v 2) then one wants the very top of trunk
> > > > > netperf from:
> > > > 
> > > > Hi,
> > > > 
> > > > I have constructed a test where I run an un-paced  UDP_STREAM test in
> > > > one guest and a paced omni rr test in another guest at the same time.
> > > 
> > > Hmm, what is this supposed to measure?  Basically each time you run an
> > > un-paced UDP_STREAM you get some random load on the network.
> > > You can't tell what it was exactly, only that it was between
> > > the send and receive throughput.
> > 
> > Rick mentioned in another email that I messed up my test parameters a bit,
> > so I will re-run the tests, incorporating his suggestions.
> > 
> > What I was attempting to measure was the effect of an unpaced UDP_STREAM
> > on the latency of more moderated traffic. Because I am interested in
> > what effect an abusive guest has on other guests and how that my be
> > mitigated.
> > 
> > Could you suggest some tests that you feel are more appropriate?
> 
> Yes. To refraze my concern in these terms, besides the malicious guest
> you have another software in host (netperf) that interferes with
> the traffic, and it cooperates with the malicious guest.
> Right?

Yes, that is the scenario in this test.

> IMO for a malicious guest you would send
> UDP packets that then get dropped by the host.
> 
> For example block netperf in host so that
> it does not consume packets from the socket.

I'm more interested in rate-limiting netperf than blocking it.
But in any case, do you mean use iptables or tc based on
classification made by net_cls?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Jan. 23, 2011, 10:39 a.m. UTC | #16
On Sun, Jan 23, 2011 at 05:38:49PM +1100, Simon Horman wrote:
> On Sat, Jan 22, 2011 at 11:57:42PM +0200, Michael S. Tsirkin wrote:
> > On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote:
> > > On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
> > > > > [ Trimmed Eric from CC list as vger was complaining that it is too long ]
> > > > > 
> > > > > On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
> > > > > > >So it won't be all that simple to implement well, and before we try,
> > > > > > >I'd like to know whether there are applications that are helped
> > > > > > >by it. For example, we could try to measure latency at various
> > > > > > >pps and see whether the backpressure helps. netperf has -b, -w
> > > > > > >flags which might help these measurements.
> > > > > > 
> > > > > > Those options are enabled when one adds --enable-burst to the
> > > > > > pre-compilation ./configure  of netperf (one doesn't have to
> > > > > > recompile netserver).  However, if one is also looking at latency
> > > > > > statistics via the -j option in the top-of-trunk, or simply at the
> > > > > > histogram with --enable-histogram on the ./configure and a verbosity
> > > > > > level of 2 (global -v 2) then one wants the very top of trunk
> > > > > > netperf from:
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > I have constructed a test where I run an un-paced  UDP_STREAM test in
> > > > > one guest and a paced omni rr test in another guest at the same time.
> > > > 
> > > > Hmm, what is this supposed to measure?  Basically each time you run an
> > > > un-paced UDP_STREAM you get some random load on the network.
> > > > You can't tell what it was exactly, only that it was between
> > > > the send and receive throughput.
> > > 
> > > Rick mentioned in another email that I messed up my test parameters a bit,
> > > so I will re-run the tests, incorporating his suggestions.
> > > 
> > > What I was attempting to measure was the effect of an unpaced UDP_STREAM
> > > on the latency of more moderated traffic. Because I am interested in
> > > what effect an abusive guest has on other guests and how that my be
> > > mitigated.
> > > 
> > > Could you suggest some tests that you feel are more appropriate?
> > 
> > Yes. To refraze my concern in these terms, besides the malicious guest
> > you have another software in host (netperf) that interferes with
> > the traffic, and it cooperates with the malicious guest.
> > Right?
> 
> Yes, that is the scenario in this test.

Yes but I think that you want to put some controlled load on host.
Let's assume that we impove the speed somehow and now you can push more
bytes per second without loss.  Result might be a regression in your
test because you let the guest push "as much as it can" and suddenly it
can push more data through.  OTOH with packet loss the load on host is
anywhere in between send and receive throughput: there's no easy way to
measure it from netperf: the earlier some buffers overrun, the earlier
the packets get dropped and the less the load on host.

This is why I say that to get a specific
load on host you want to limit the sender
to a specific BW and then either
- make sure packet loss % is close to 0.
- make sure packet loss % is close to 100%.

> > IMO for a malicious guest you would send
> > UDP packets that then get dropped by the host.
> > 
> > For example block netperf in host so that
> > it does not consume packets from the socket.
> 
> I'm more interested in rate-limiting netperf than blocking it.

Well I mean netperf on host.

> But in any case, do you mean use iptables or tc based on
> classification made by net_cls?

Just to block netperf you can send it SIGSTOP :)
Simon Horman Jan. 23, 2011, 1:53 p.m. UTC | #17
On Sun, Jan 23, 2011 at 12:39:02PM +0200, Michael S. Tsirkin wrote:
> On Sun, Jan 23, 2011 at 05:38:49PM +1100, Simon Horman wrote:
> > On Sat, Jan 22, 2011 at 11:57:42PM +0200, Michael S. Tsirkin wrote:
> > > On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote:
> > > > On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:

[snip]

> > > > > Hmm, what is this supposed to measure?  Basically each time you run an
> > > > > un-paced UDP_STREAM you get some random load on the network.
> > > > > You can't tell what it was exactly, only that it was between
> > > > > the send and receive throughput.
> > > > 
> > > > Rick mentioned in another email that I messed up my test parameters a bit,
> > > > so I will re-run the tests, incorporating his suggestions.
> > > > 
> > > > What I was attempting to measure was the effect of an unpaced UDP_STREAM
> > > > on the latency of more moderated traffic. Because I am interested in
> > > > what effect an abusive guest has on other guests and how that my be
> > > > mitigated.
> > > > 
> > > > Could you suggest some tests that you feel are more appropriate?
> > > 
> > > Yes. To refraze my concern in these terms, besides the malicious guest
> > > you have another software in host (netperf) that interferes with
> > > the traffic, and it cooperates with the malicious guest.
> > > Right?
> > 
> > Yes, that is the scenario in this test.
> 
> Yes but I think that you want to put some controlled load on host.
> Let's assume that we impove the speed somehow and now you can push more
> bytes per second without loss.  Result might be a regression in your
> test because you let the guest push "as much as it can" and suddenly it
> can push more data through.  OTOH with packet loss the load on host is
> anywhere in between send and receive throughput: there's no easy way to
> measure it from netperf: the earlier some buffers overrun, the earlier
> the packets get dropped and the less the load on host.
> 
> This is why I say that to get a specific
> load on host you want to limit the sender
> to a specific BW and then either
> - make sure packet loss % is close to 0.
> - make sure packet loss % is close to 100%.

Thanks, and sorry for being a bit slow.  I now see what you have
been getting at with regards to limiting the tests.
I will see about getting some numbers based on your suggestions.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Jan. 24, 2011, 6:27 p.m. UTC | #18
> 
> Just to block netperf you can send it SIGSTOP :)
> 

Clever :)  One could I suppose achieve the same result by making the remote 
receive socket buffer size smaller than the UDP message size and then not worry 
about having to learn the netserver's PID to send it the SIGSTOP.  I *think* the 
semantics will be substantially the same?  Both will be drops at the socket 
buffer, albeit for for different reasons.  The "too small socket buffer" version 
though doesn't require one remember to "wake" the netserver in time to have it 
send results back to netperf without netperf tossing-up an error and not 
reporting any statistics.

Also, netperf has a "no control connection" mode where you can, in effect cause 
it to send UDP datagrams out into the void - I put it there to allow folks to 
test against the likes of echo discard and chargen services but it may have a 
use here.  Requires that one specify the destination IP and port for the "data 
connection" explicitly via the test-specific options.  In that mode the only 
stats reported are those local to netperf rather than netserver.

happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Jan. 24, 2011, 6:36 p.m. UTC | #19
On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote:
> >
> >Just to block netperf you can send it SIGSTOP :)
> >
> 
> Clever :)  One could I suppose achieve the same result by making the
> remote receive socket buffer size smaller than the UDP message size
> and then not worry about having to learn the netserver's PID to send
> it the SIGSTOP.  I *think* the semantics will be substantially the
> same?

If you could set, it, yes. But at least linux ignores
any value substantially smaller than 1K, and then
multiplies that by 2:

        case SO_RCVBUF:
                /* Don't error on this BSD doesn't and if you think
                   about it this is right. Otherwise apps have to
                   play 'guess the biggest size' games. RCVBUF/SNDBUF
                   are treated in BSD as hints */

                if (val > sysctl_rmem_max)
                        val = sysctl_rmem_max;
set_rcvbuf:     
                sk->sk_userlocks |= SOCK_RCVBUF_LOCK;

                /*
                 * We double it on the way in to account for
                 * "struct sk_buff" etc. overhead.   Applications
                 * assume that the SO_RCVBUF setting they make will
                 * allow that much actual data to be received on that
                 * socket.
                 *
                 * Applications are unaware that "struct sk_buff" and
                 * other overheads allocate from the receive buffer
                 * during socket buffer allocation. 
                 *
                 * And after considering the possible alternatives,
                 * returning the value we actually used in getsockopt
                 * is the most desirable behavior.
                 */ 
                if ((val * 2) < SOCK_MIN_RCVBUF)
                        sk->sk_rcvbuf = SOCK_MIN_RCVBUF;
                else
                        sk->sk_rcvbuf = val * 2;

and

/*                      
 * Since sk_rmem_alloc sums skb->truesize, even a small frame might need
 * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak
 */             
#define SOCK_MIN_RCVBUF (2048 + sizeof(struct sk_buff))


>  Both will be drops at the socket buffer, albeit for for
> different reasons.  The "too small socket buffer" version though
> doesn't require one remember to "wake" the netserver in time to have
> it send results back to netperf without netperf tossing-up an error
> and not reporting any statistics.
> 
> Also, netperf has a "no control connection" mode where you can, in
> effect cause it to send UDP datagrams out into the void - I put it
> there to allow folks to test against the likes of echo discard and
> chargen services but it may have a use here.  Requires that one
> specify the destination IP and port for the "data connection"
> explicitly via the test-specific options.  In that mode the only
> stats reported are those local to netperf rather than netserver.

Ah, sounds perfect.

> happy benchmarking,
> 
> rick jones

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Jan. 24, 2011, 7:01 p.m. UTC | #20
Michael S. Tsirkin wrote:
> On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote:
> 
>>>Just to block netperf you can send it SIGSTOP :)
>>>
>>
>>Clever :)  One could I suppose achieve the same result by making the
>>remote receive socket buffer size smaller than the UDP message size
>>and then not worry about having to learn the netserver's PID to send
>>it the SIGSTOP.  I *think* the semantics will be substantially the
>>same?
> 
> 
> If you could set, it, yes. But at least linux ignores
> any value substantially smaller than 1K, and then
> multiplies that by 2:
> 
>         case SO_RCVBUF:
>                 /* Don't error on this BSD doesn't and if you think
>                    about it this is right. Otherwise apps have to
>                    play 'guess the biggest size' games. RCVBUF/SNDBUF
>                    are treated in BSD as hints */
> 
>                 if (val > sysctl_rmem_max)
>                         val = sysctl_rmem_max;
> set_rcvbuf:     
>                 sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> 
>                 /*
>                  * We double it on the way in to account for
>                  * "struct sk_buff" etc. overhead.   Applications
>                  * assume that the SO_RCVBUF setting they make will
>                  * allow that much actual data to be received on that
>                  * socket.
>                  *
>                  * Applications are unaware that "struct sk_buff" and
>                  * other overheads allocate from the receive buffer
>                  * during socket buffer allocation. 
>                  *
>                  * And after considering the possible alternatives,
>                  * returning the value we actually used in getsockopt
>                  * is the most desirable behavior.
>                  */ 
>                 if ((val * 2) < SOCK_MIN_RCVBUF)
>                         sk->sk_rcvbuf = SOCK_MIN_RCVBUF;
>                 else
>                         sk->sk_rcvbuf = val * 2;
> 
> and
> 
> /*                      
>  * Since sk_rmem_alloc sums skb->truesize, even a small frame might need
>  * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak
>  */             
> #define SOCK_MIN_RCVBUF (2048 + sizeof(struct sk_buff))

Pity - seems to work back on 2.6.26:

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 1024
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

124928    1024   10.00     2882334      0    2361.17
    256           10.00           0              0.00

raj@tardy:~/netperf2_trunk$ uname -a
Linux tardy 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux

Still, even with that (or SIGSTOP) we don't really know where the packets were 
dropped right?  There is no guarantee they weren't dropped before they got to 
the socket buffer

happy benchmarking,
rick jones

PS - here is with a -S 1024 option:

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1024 -m 1024
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

124928    1024   10.00     1679269      0    1375.64
   2048           10.00     1490662           1221.13

showing that there is a decent chance that many of the frames were dropped at 
the socket buffer, but not all - I suppose I could/should be checking netstat 
stats... :)

And just a little more, only because I was curious :)

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1M -m 257
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

124928     257   10.00     1869134      0     384.29
262142           10.00     1869134            384.29

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 257
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

124928     257   10.00     3076363      0     632.49
    256           10.00           0              0.00

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Jan. 24, 2011, 7:42 p.m. UTC | #21
On Mon, Jan 24, 2011 at 11:01:45AM -0800, Rick Jones wrote:
> Michael S. Tsirkin wrote:
> >On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote:
> >
> >>>Just to block netperf you can send it SIGSTOP :)
> >>>
> >>
> >>Clever :)  One could I suppose achieve the same result by making the
> >>remote receive socket buffer size smaller than the UDP message size
> >>and then not worry about having to learn the netserver's PID to send
> >>it the SIGSTOP.  I *think* the semantics will be substantially the
> >>same?
> >
> >
> >If you could set, it, yes. But at least linux ignores
> >any value substantially smaller than 1K, and then
> >multiplies that by 2:
> >
> >        case SO_RCVBUF:
> >                /* Don't error on this BSD doesn't and if you think
> >                   about it this is right. Otherwise apps have to
> >                   play 'guess the biggest size' games. RCVBUF/SNDBUF
> >                   are treated in BSD as hints */
> >
> >                if (val > sysctl_rmem_max)
> >                        val = sysctl_rmem_max;
> >set_rcvbuf:                     sk->sk_userlocks |=
> >SOCK_RCVBUF_LOCK;
> >
> >                /*
> >                 * We double it on the way in to account for
> >                 * "struct sk_buff" etc. overhead.   Applications
> >                 * assume that the SO_RCVBUF setting they make will
> >                 * allow that much actual data to be received on that
> >                 * socket.
> >                 *
> >                 * Applications are unaware that "struct sk_buff" and
> >                 * other overheads allocate from the receive buffer
> >                 * during socket buffer allocation.
> >*
> >                 * And after considering the possible alternatives,
> >                 * returning the value we actually used in getsockopt
> >                 * is the most desirable behavior.
> >                 */                 if ((val * 2) <
> >SOCK_MIN_RCVBUF)
> >                        sk->sk_rcvbuf = SOCK_MIN_RCVBUF;
> >                else
> >                        sk->sk_rcvbuf = val * 2;
> >
> >and
> >
> >/*                       * Since sk_rmem_alloc sums skb->truesize,
> >even a small frame might need
> > * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak
> > */             #define SOCK_MIN_RCVBUF (2048 + sizeof(struct
> >sk_buff))
> 
> Pity - seems to work back on 2.6.26:

Hmm, that code is there at least as far back as 2.6.12.

> raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 1024
> MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> localhost (127.0.0.1) port 0 AF_INET : histogram
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
> 
> 124928    1024   10.00     2882334      0    2361.17
>    256           10.00           0              0.00
> 
> raj@tardy:~/netperf2_trunk$ uname -a
> Linux tardy 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux
> 
> Still, even with that (or SIGSTOP) we don't really know where the
> packets were dropped right?  There is no guarantee they weren't
> dropped before they got to the socket buffer
> 
> happy benchmarking,
> rick jones

Right. Better send to a port with no socket listening there,
that would drop the packet at an early (if not at the earliest
possible)  opportunity.

> PS - here is with a -S 1024 option:
> 
> raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1024 -m 1024
> MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> localhost (127.0.0.1) port 0 AF_INET : histogram
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
> 
> 124928    1024   10.00     1679269      0    1375.64
>   2048           10.00     1490662           1221.13
> 
> showing that there is a decent chance that many of the frames were
> dropped at the socket buffer, but not all - I suppose I could/should
> be checking netstat stats... :)
> 
> And just a little more, only because I was curious :)
> 
> raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1M -m 257
> MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> localhost (127.0.0.1) port 0 AF_INET : histogram
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
> 
> 124928     257   10.00     1869134      0     384.29
> 262142           10.00     1869134            384.29
> 
> raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 257
> MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> localhost (127.0.0.1) port 0 AF_INET : histogram
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
> 
> 124928     257   10.00     3076363      0     632.49
>    256           10.00           0              0.00
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/tap-linux.c b/net/tap-linux.c
index f7aa904..0dbcdd4 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -87,7 +87,7 @@  int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required
  * Ethernet NICs generally have txqueuelen=1000, so 1Mb is
  * a good default, given a 1500 byte MTU.
  */
-#define TAP_DEFAULT_SNDBUF 1024*1024
+#define TAP_DEFAULT_SNDBUF 0
 
 int tap_set_sndbuf(int fd, QemuOpts *opts)
 {