[net-next] virtio_net: force_napi_tx module param.
diff mbox series

Message ID 20180723231119.142904-1-caleb.raitto@gmail.com
State Changes Requested
Delegated to: David Miller
Headers show
Series
  • [net-next] virtio_net: force_napi_tx module param.
Related show

Commit Message

Caleb Raitto July 23, 2018, 11:11 p.m. UTC
From: Caleb Raitto <caraitto@google.com>

The driver disables tx napi if it's not certain that completions will
be processed affine with tx service.

Its heuristic doesn't account for some scenarios where it is, such as
when the queue pair count matches the core but not hyperthread count.

Allow userspace to override the heuristic. This is an alternative
solution to that in the linked patch. That added more logic in the
kernel for these cases, but the agreement was that this was better left
to user control.

Do not expand the existing napi_tx variable to a ternary value,
because doing so can break user applications that expect
boolean ('Y'/'N') instead of integer output. Add a new param instead.

Link: https://patchwork.ozlabs.org/patch/725249/
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jon Olson <jonolson@google.com>
Signed-off-by: Caleb Raitto <caraitto@google.com>
---
 drivers/net/virtio_net.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Stephen Hemminger July 24, 2018, 12:53 a.m. UTC | #1
On Mon, 23 Jul 2018 16:11:19 -0700
Caleb Raitto <caleb.raitto@gmail.com> wrote:

> From: Caleb Raitto <caraitto@google.com>
> 
> The driver disables tx napi if it's not certain that completions will
> be processed affine with tx service.
> 
> Its heuristic doesn't account for some scenarios where it is, such as
> when the queue pair count matches the core but not hyperthread count.
> 
> Allow userspace to override the heuristic. This is an alternative
> solution to that in the linked patch. That added more logic in the
> kernel for these cases, but the agreement was that this was better left
> to user control.
> 
> Do not expand the existing napi_tx variable to a ternary value,
> because doing so can break user applications that expect
> boolean ('Y'/'N') instead of integer output. Add a new param instead.
> 
> Link: https://patchwork.ozlabs.org/patch/725249/
> Acked-by: Willem de Bruijn <willemb@google.com>
> Acked-by: Jon Olson <jonolson@google.com>
> Signed-off-by: Caleb Raitto <caraitto@google.com>
> ---

Not a fan of this.
Module parameters are frowned on by the distributions because they
never get well tested and they force the user to do magic things
to enable features. It looks like you are using it to paper
over a bug in this case.
Willem de Bruijn July 24, 2018, 1:23 a.m. UTC | #2
On Mon, Jul 23, 2018 at 8:55 PM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Mon, 23 Jul 2018 16:11:19 -0700
> Caleb Raitto <caleb.raitto@gmail.com> wrote:
>
> > From: Caleb Raitto <caraitto@google.com>
> >
> > The driver disables tx napi if it's not certain that completions will
> > be processed affine with tx service.
> >
> > Its heuristic doesn't account for some scenarios where it is, such as
> > when the queue pair count matches the core but not hyperthread count.
> >
> > Allow userspace to override the heuristic. This is an alternative
> > solution to that in the linked patch. That added more logic in the
> > kernel for these cases, but the agreement was that this was better left
> > to user control.
> >
> > Do not expand the existing napi_tx variable to a ternary value,
> > because doing so can break user applications that expect
> > boolean ('Y'/'N') instead of integer output. Add a new param instead.
> >
> > Link: https://patchwork.ozlabs.org/patch/725249/
> > Acked-by: Willem de Bruijn <willemb@google.com>
> > Acked-by: Jon Olson <jonolson@google.com>
> > Signed-off-by: Caleb Raitto <caraitto@google.com>
> > ---
>
> Not a fan of this.
> Module parameters are frowned on by the distributions because they
> never get well tested and they force the user to do magic things
> to enable features. It looks like you are using it to paper
> over a bug in this case.

This has actually been a catch-22 that this patch tries to break.

In micro benchmarks napi-tx was an improvement for most cases. We need
wider validation to make it default. Or even to start enabling it for some
users. But we cannot get the data, because understandably no one is
going to make it default without more data.

Enabling the feature selectively to safely roll out is the intent of
(temporary) param napi_tx. But the requirement to have
has_affinity_set is proving a real obstruction.

Especially in cases where we enable the feature to do A:B comparisons,
and thus are monitoring performance metrics closely, we should be able
to override the kernel heuristic.
Michael S. Tsirkin July 24, 2018, 10:42 a.m. UTC | #3
On Mon, Jul 23, 2018 at 04:11:19PM -0700, Caleb Raitto wrote:
> From: Caleb Raitto <caraitto@google.com>
> 
> The driver disables tx napi if it's not certain that completions will
> be processed affine with tx service.
> 
> Its heuristic doesn't account for some scenarios where it is, such as
> when the queue pair count matches the core but not hyperthread count.
> 
> Allow userspace to override the heuristic. This is an alternative
> solution to that in the linked patch. That added more logic in the
> kernel for these cases, but the agreement was that this was better left
> to user control.
> 
> Do not expand the existing napi_tx variable to a ternary value,
> because doing so can break user applications that expect
> boolean ('Y'/'N') instead of integer output. Add a new param instead.
> 
> Link: https://patchwork.ozlabs.org/patch/725249/
> Acked-by: Willem de Bruijn <willemb@google.com>
> Acked-by: Jon Olson <jonolson@google.com>
> Signed-off-by: Caleb Raitto <caraitto@google.com>

Is there a reason the same rule should apply to all devices?
If not shouldn't this be an ethtool option?

> ---
>  drivers/net/virtio_net.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 2ff08bc103a9..d9aca4e90d6b 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -39,10 +39,11 @@
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
>  
> -static bool csum = true, gso = true, napi_tx;
> +static bool csum = true, gso = true, napi_tx, force_napi_tx;
>  module_param(csum, bool, 0444);
>  module_param(gso, bool, 0444);
>  module_param(napi_tx, bool, 0644);
> +module_param(force_napi_tx, bool, 0644);
>  
>  /* FIXME: MTU in config. */
>  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> @@ -1201,7 +1202,7 @@ static void virtnet_napi_tx_enable(struct virtnet_info *vi,
>  	/* Tx napi touches cachelines on the cpu handling tx interrupts. Only
>  	 * enable the feature if this is likely affine with the transmit path.
>  	 */
> -	if (!vi->affinity_hint_set) {
> +	if (!vi->affinity_hint_set && !force_napi_tx) {
>  		napi->weight = 0;
>  		return;
>  	}
> @@ -2646,7 +2647,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
>  		netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
>  			       napi_weight);
>  		netif_tx_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
> -				  napi_tx ? napi_weight : 0);
> +				  (napi_tx || force_napi_tx) ? napi_weight : 0);
>  
>  		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
>  		ewma_pkt_len_init(&vi->rq[i].mrg_avg_pkt_len);
> -- 
> 2.18.0.233.g985f88cf7e-goog
Willem de Bruijn July 24, 2018, 2:01 p.m. UTC | #4
On Tue, Jul 24, 2018 at 6:44 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Jul 23, 2018 at 04:11:19PM -0700, Caleb Raitto wrote:
> > From: Caleb Raitto <caraitto@google.com>
> >
> > The driver disables tx napi if it's not certain that completions will
> > be processed affine with tx service.
> >
> > Its heuristic doesn't account for some scenarios where it is, such as
> > when the queue pair count matches the core but not hyperthread count.
> >
> > Allow userspace to override the heuristic. This is an alternative
> > solution to that in the linked patch. That added more logic in the
> > kernel for these cases, but the agreement was that this was better left
> > to user control.
> >
> > Do not expand the existing napi_tx variable to a ternary value,
> > because doing so can break user applications that expect
> > boolean ('Y'/'N') instead of integer output. Add a new param instead.
> >
> > Link: https://patchwork.ozlabs.org/patch/725249/
> > Acked-by: Willem de Bruijn <willemb@google.com>
> > Acked-by: Jon Olson <jonolson@google.com>
> > Signed-off-by: Caleb Raitto <caraitto@google.com>
>
> Is there a reason the same rule should apply to all devices?
> If not shouldn't this be an ethtool option?

It not very likely that a guest will have multiple virtio_net devices,
some of which have affinity_hint_set and some do not?

I'd really rather not add the extra option at all, but remove
the affinity_hint_set requirement for now. Without more data,
I understand the concern about cacheline bouncing if napi-tx
would becomes the default at some point and we don't have
data on this by then. But while it isn't default and a user has to
opt in to napi-tx to try it that seems enough guardrail to me.

The original reason was lack of data on whether napi-tx may suffer
from cache invalidations when tx and rx softirq are on different cpus
and we enable tx descriptor cleaning from the rx handler (i.e., on ACK).
From those initial numbers it seemed to be a win even with those
invalidations.

  https://patchwork.ozlabs.org/patch/746232/

In lieu of removing the affinity_hint_set, this boolean is the least amount
of both new interface and implementation to allow experimentation. We
can easily leave it as a noop eventually when we are confident that
napi-tx can be enabled even without affinity. By comparison, an ethtool
param would be quite a bit of new logic.
Michael S. Tsirkin July 24, 2018, 6:38 p.m. UTC | #5
On Tue, Jul 24, 2018 at 10:01:39AM -0400, Willem de Bruijn wrote:
> On Tue, Jul 24, 2018 at 6:44 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Mon, Jul 23, 2018 at 04:11:19PM -0700, Caleb Raitto wrote:
> > > From: Caleb Raitto <caraitto@google.com>
> > >
> > > The driver disables tx napi if it's not certain that completions will
> > > be processed affine with tx service.
> > >
> > > Its heuristic doesn't account for some scenarios where it is, such as
> > > when the queue pair count matches the core but not hyperthread count.
> > >
> > > Allow userspace to override the heuristic. This is an alternative
> > > solution to that in the linked patch. That added more logic in the
> > > kernel for these cases, but the agreement was that this was better left
> > > to user control.
> > >
> > > Do not expand the existing napi_tx variable to a ternary value,
> > > because doing so can break user applications that expect
> > > boolean ('Y'/'N') instead of integer output. Add a new param instead.
> > >
> > > Link: https://patchwork.ozlabs.org/patch/725249/
> > > Acked-by: Willem de Bruijn <willemb@google.com>
> > > Acked-by: Jon Olson <jonolson@google.com>
> > > Signed-off-by: Caleb Raitto <caraitto@google.com>
> >
> > Is there a reason the same rule should apply to all devices?
> > If not shouldn't this be an ethtool option?
> 
> It not very likely that a guest will have multiple virtio_net devices,
> some of which have affinity_hint_set and some do not?

Just to answer this question, I do hear a lot about guests with multiple
virtio net interfaces.  These might be just the noisy few, but they do
exist.

> I'd really rather not add the extra option at all, but remove
> the affinity_hint_set requirement for now. Without more data,
> I understand the concern about cacheline bouncing if napi-tx
> would becomes the default at some point and we don't have
> data on this by then. But while it isn't default and a user has to
> opt in to napi-tx to try it that seems enough guardrail to me.
> 
> The original reason was lack of data on whether napi-tx may suffer
> from cache invalidations when tx and rx softirq are on different cpus
> and we enable tx descriptor cleaning from the rx handler (i.e., on ACK).
> >From those initial numbers it seemed to be a win even with those
> invalidations.
> 
>   https://patchwork.ozlabs.org/patch/746232/
> 
> In lieu of removing the affinity_hint_set, this boolean is the least amount
> of both new interface and implementation to allow experimentation. We
> can easily leave it as a noop eventually when we are confident that
> napi-tx can be enabled even without affinity. By comparison, an ethtool
> param would be quite a bit of new logic.

So it boils down to this: if we believe napi tx is
always a good idea, then this flag is there just in case,
and the patch is fine. If it's good for some workloads
but not others, and will be for a while, we are better off
with an ethtool flag.

What's the case here?

OTOH if you want to add more trick to the affinity hint, such
as the mentioned above # of queues matching core count,
that is also fine IMHO.
Willem de Bruijn July 24, 2018, 8:52 p.m. UTC | #6
On Tue, Jul 24, 2018 at 2:39 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Jul 24, 2018 at 10:01:39AM -0400, Willem de Bruijn wrote:
> > On Tue, Jul 24, 2018 at 6:44 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Mon, Jul 23, 2018 at 04:11:19PM -0700, Caleb Raitto wrote:
> > > > From: Caleb Raitto <caraitto@google.com>
> > > >
> > > > The driver disables tx napi if it's not certain that completions will
> > > > be processed affine with tx service.
> > > >
> > > > Its heuristic doesn't account for some scenarios where it is, such as
> > > > when the queue pair count matches the core but not hyperthread count.
> > > >
> > > > Allow userspace to override the heuristic. This is an alternative
> > > > solution to that in the linked patch. That added more logic in the
> > > > kernel for these cases, but the agreement was that this was better left
> > > > to user control.
> > > >
> > > > Do not expand the existing napi_tx variable to a ternary value,
> > > > because doing so can break user applications that expect
> > > > boolean ('Y'/'N') instead of integer output. Add a new param instead.
> > > >
> > > > Link: https://patchwork.ozlabs.org/patch/725249/
> > > > Acked-by: Willem de Bruijn <willemb@google.com>
> > > > Acked-by: Jon Olson <jonolson@google.com>
> > > > Signed-off-by: Caleb Raitto <caraitto@google.com>
> > >
> > > Is there a reason the same rule should apply to all devices?
> > > If not shouldn't this be an ethtool option?
> >
> > It not very likely that a guest will have multiple virtio_net devices,
> > some of which have affinity_hint_set and some do not?
>
> Just to answer this question, I do hear a lot about guests with multiple
> virtio net interfaces.  These might be just the noisy few, but they do
> exist.
>
> > I'd really rather not add the extra option at all, but remove
> > the affinity_hint_set requirement for now. Without more data,
> > I understand the concern about cacheline bouncing if napi-tx
> > would becomes the default at some point and we don't have
> > data on this by then. But while it isn't default and a user has to
> > opt in to napi-tx to try it that seems enough guardrail to me.
> >
> > The original reason was lack of data on whether napi-tx may suffer
> > from cache invalidations when tx and rx softirq are on different cpus
> > and we enable tx descriptor cleaning from the rx handler (i.e., on ACK).
> > >From those initial numbers it seemed to be a win even with those
> > invalidations.
> >
> >   https://patchwork.ozlabs.org/patch/746232/
> >
> > In lieu of removing the affinity_hint_set, this boolean is the least amount
> > of both new interface and implementation to allow experimentation. We
> > can easily leave it as a noop eventually when we are confident that
> > napi-tx can be enabled even without affinity. By comparison, an ethtool
> > param would be quite a bit of new logic.
>
> So it boils down to this: if we believe napi tx is
> always a good idea, then this flag is there just in case,
> and the patch is fine. If it's good for some workloads
> but not others, and will be for a while, we are better off
> with an ethtool flag.
>
> What's the case here?

Based on benchmark results so far, it looks like the first. It's markedly
better for some and in the noise for most.

The crux is the "for a while". We undoubtedly will find some cases that
we need to fix before flipping the default.

The choice is between using napi_tx vs adding a separate ethtool function
to do essentially the same. This patch is the simpler option, and easier to
backport to guest kernels.

> OTOH if you want to add more trick to the affinity hint, such
> as the mentioned above # of queues matching core count,
> that is also fine IMHO.

From the above linked patch, I understand that there are yet
other special cases in production, such as a hard cap on #tx queues to
32 regardless of number of vcpus. I don't think we want to add special
cases for all this kind of business logic in the kernel.
Michael S. Tsirkin July 24, 2018, 10:23 p.m. UTC | #7
On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> >From the above linked patch, I understand that there are yet
> other special cases in production, such as a hard cap on #tx queues to
> 32 regardless of number of vcpus.

I don't think upstream kernels have this limit - we can
now use vmalloc for higher number of queues.
Willem de Bruijn July 24, 2018, 10:31 p.m. UTC | #8
On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> > >From the above linked patch, I understand that there are yet
> > other special cases in production, such as a hard cap on #tx queues to
> > 32 regardless of number of vcpus.
>
> I don't think upstream kernels have this limit - we can
> now use vmalloc for higher number of queues.

Yes. that patch* mentioned it as a google compute engine imposed
limit. It is exactly such cloud provider imposed rules that I'm
concerned about working around in upstream drivers.

* for reference, I mean https://patchwork.ozlabs.org/patch/725249/
Michael S. Tsirkin July 24, 2018, 10:46 p.m. UTC | #9
On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> > > >From the above linked patch, I understand that there are yet
> > > other special cases in production, such as a hard cap on #tx queues to
> > > 32 regardless of number of vcpus.
> >
> > I don't think upstream kernels have this limit - we can
> > now use vmalloc for higher number of queues.
> 
> Yes. that patch* mentioned it as a google compute engine imposed
> limit. It is exactly such cloud provider imposed rules that I'm
> concerned about working around in upstream drivers.
> 
> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/

Yea. Why does GCE do it btw?
Willem de Bruijn July 25, 2018, 12:02 a.m. UTC | #10
On Tue, Jul 24, 2018 at 6:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
> > On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> > > > >From the above linked patch, I understand that there are yet
> > > > other special cases in production, such as a hard cap on #tx queues to
> > > > 32 regardless of number of vcpus.
> > >
> > > I don't think upstream kernels have this limit - we can
> > > now use vmalloc for higher number of queues.
> >
> > Yes. that patch* mentioned it as a google compute engine imposed
> > limit. It is exactly such cloud provider imposed rules that I'm
> > concerned about working around in upstream drivers.
> >
> > * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
>
> Yea. Why does GCE do it btw?

I have no idea, either. I guess some host-side constraint.
Jon Olson July 25, 2018, 12:17 a.m. UTC | #11
On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
> > On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> > > > >From the above linked patch, I understand that there are yet
> > > > other special cases in production, such as a hard cap on #tx queues to
> > > > 32 regardless of number of vcpus.
> > >
> > > I don't think upstream kernels have this limit - we can
> > > now use vmalloc for higher number of queues.
> >
> > Yes. that patch* mentioned it as a google compute engine imposed
> > limit. It is exactly such cloud provider imposed rules that I'm
> > concerned about working around in upstream drivers.
> >
> > * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
>
> Yea. Why does GCE do it btw?

There are a few reasons for the limit, some historical, some current.

Historically we did this because of a kernel limit on the number of
TAP queues (in Montreal I thought this limit was 32). To my chagrin,
the limit upstream at the time we did it was actually eight. We had
increased the limit from eight to 32 internally, and it appears in
upstream it has subsequently increased upstream to 256. We no longer
use TAP for networking, so that constraint no longer applies for us,
but when looking at removing/raising the limit we discovered no
workloads that clearly benefited from lifting it, and it also placed
more pressure on our virtual networking stack particularly on the Tx
side. We left it as-is.

In terms of current reasons there are really two. One is memory usage.
As you know, virtio-net uses rx/tx pairs, so there's an expectation
that the guest will have an Rx queue for every Tx queue. We run our
individual virtqueues fairly deep (4096 entries) to give guests a wide
time window for re-posting Rx buffers and avoiding starvation on
packet delivery. Filling an Rx vring with max-sized mergeable buffers
(4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
be up to 512MB of memory posted for network buffers. Scaling this to
the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
all of the Rx rings full would (in the large average Rx packet size
case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
of RAM available, but I don't believe we've observed a situation where
they would have benefited from having 2.5 gigs of buffers posted for
incoming network traffic :)

The second reason is interrupt related -- as I mentioned above, we
have found no workloads that clearly benefit from so many queues, but
we have found workloads that degrade. In particular workloads that do
a lot of small packet processing but which aren't extremely latency
sensitive can achieve higher PPS by taking fewer interrupt across
fewer VCPUs due to better batching (this also incurs higher latency,
but at the limit the "busy" cores end up suppressing most interrupts
and spending most of their cycles farming out work). Memcache is a
good example here, particularly if the latency targets for request
completion are in the ~milliseconds range (rather than the
microseconds we typically strive for with TCP_RR-style workloads).

All of that said, we haven't been forthcoming with data (and
unfortunately I don't have it handy in a useful form, otherwise I'd
simply post it here), so I understand the hesitation to simply run
with napi_tx across the board. As Willem said, this patch seemed like
the least disruptive way to allow us to continue down the road of
"universal" NAPI Tx and to hopefully get data across enough workloads
(with VMs small, large, and absurdly large :) to present a compelling
argument in one direction or another. As far as I know there aren't
currently any NAPI related ethtool commands (based on a quick perusal
of ethtool.h) -- it seems like it would be fairly involved/heavyweight
to plumb one solely for this unless NAPI Tx is something many users
will want to tune (and for which other drivers would support tuning).

--
Jon Olson
David Miller July 29, 2018, 4 p.m. UTC | #12
From: Caleb Raitto <caleb.raitto@gmail.com>
Date: Mon, 23 Jul 2018 16:11:19 -0700

> From: Caleb Raitto <caraitto@google.com>
> 
> The driver disables tx napi if it's not certain that completions will
> be processed affine with tx service.
> 
> Its heuristic doesn't account for some scenarios where it is, such as
> when the queue pair count matches the core but not hyperthread count.
> 
> Allow userspace to override the heuristic. This is an alternative
> solution to that in the linked patch. That added more logic in the
> kernel for these cases, but the agreement was that this was better left
> to user control.
> 
> Do not expand the existing napi_tx variable to a ternary value,
> because doing so can break user applications that expect
> boolean ('Y'/'N') instead of integer output. Add a new param instead.
> 
> Link: https://patchwork.ozlabs.org/patch/725249/
> Acked-by: Willem de Bruijn <willemb@google.com>
> Acked-by: Jon Olson <jonolson@google.com>
> Signed-off-by: Caleb Raitto <caraitto@google.com>

So I looked into the history surrounding these issues.

First of all, it's always ends up turning out crummy when drivers start
to set affinities themselves.  The worst possible case is to do it
_conditionally_, and that is exactly what virtio_net is doing.

From the user's perspective, this provides a really bad experience.

So if I have a 32-queue device and there are 32 cpus, you'll do all
the affinity settings, stopping Irqbalanced from doing anything
right?

So if I add one more cpu, you'll say "oops, no idea what to do in
this situation" and not touch the affinities at all?

That makes no sense at all.

If the driver is going to set affinities at all, OWN that decision
and set it all the time to something reasonable.

Or accept that you shouldn't be touching this stuff in the first place
and leave the affinities alone.

Right now we're kinda in a situation where the driver has been setting
affinities in the ncpus==nqueues cases for some time, so we can't stop
doing it.

Which means we have to set them in all cases to make the user
experience sane again.

I looked at the linked to patch again:

	https://patchwork.ozlabs.org/patch/725249/

And I think the strategy should be made more generic, to get rid of
the hyperthreading assumptions.  I also agree that the "assign
to first N cpus" logic doesn't make much sense either.

Just distribute across the available cpus evenly, and be done with it.
If you have 64 cpus and 32 queues, this assigns queues to every other
cpu.

Then we don't need this weird new module parameter.
Michael S. Tsirkin July 29, 2018, 8:33 p.m. UTC | #13
On Sun, Jul 29, 2018 at 09:00:27AM -0700, David Miller wrote:
> From: Caleb Raitto <caleb.raitto@gmail.com>
> Date: Mon, 23 Jul 2018 16:11:19 -0700
> 
> > From: Caleb Raitto <caraitto@google.com>
> > 
> > The driver disables tx napi if it's not certain that completions will
> > be processed affine with tx service.
> > 
> > Its heuristic doesn't account for some scenarios where it is, such as
> > when the queue pair count matches the core but not hyperthread count.
> > 
> > Allow userspace to override the heuristic. This is an alternative
> > solution to that in the linked patch. That added more logic in the
> > kernel for these cases, but the agreement was that this was better left
> > to user control.
> > 
> > Do not expand the existing napi_tx variable to a ternary value,
> > because doing so can break user applications that expect
> > boolean ('Y'/'N') instead of integer output. Add a new param instead.
> > 
> > Link: https://patchwork.ozlabs.org/patch/725249/
> > Acked-by: Willem de Bruijn <willemb@google.com>
> > Acked-by: Jon Olson <jonolson@google.com>
> > Signed-off-by: Caleb Raitto <caraitto@google.com>
> 
> So I looked into the history surrounding these issues.
> 
> First of all, it's always ends up turning out crummy when drivers start
> to set affinities themselves.  The worst possible case is to do it
> _conditionally_, and that is exactly what virtio_net is doing.
> 
> >From the user's perspective, this provides a really bad experience.
> 
> So if I have a 32-queue device and there are 32 cpus, you'll do all
> the affinity settings, stopping Irqbalanced from doing anything
> right?
> 
> So if I add one more cpu, you'll say "oops, no idea what to do in
> this situation" and not touch the affinities at all?
> 
> That makes no sense at all.
> 
> If the driver is going to set affinities at all, OWN that decision
> and set it all the time to something reasonable.
> 
> Or accept that you shouldn't be touching this stuff in the first place
> and leave the affinities alone.
> 
> Right now we're kinda in a situation where the driver has been setting
> affinities in the ncpus==nqueues cases for some time, so we can't stop
> doing it.
> 
> Which means we have to set them in all cases to make the user
> experience sane again.
> 
> I looked at the linked to patch again:
> 
> 	https://patchwork.ozlabs.org/patch/725249/
> 
> And I think the strategy should be made more generic, to get rid of
> the hyperthreading assumptions.  I also agree that the "assign
> to first N cpus" logic doesn't make much sense either.
> 
> Just distribute across the available cpus evenly, and be done with it.
> If you have 64 cpus and 32 queues, this assigns queues to every other
> cpu.
> 
> Then we don't need this weird new module parameter.

Can't we set affinity to a set of CPUs?

The point really is that tx irq handler needs a lock on the tx queue to
free up skbs, so processing it on another CPU while tx is active causes
cache line bounces. So we want affinity to CPUs that submit to this
queue on the theory they have these cache line(s) anyway.

I suspect it's not a uniqueue property of virtio.
David Miller July 29, 2018, 8:36 p.m. UTC | #14
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Sun, 29 Jul 2018 23:33:20 +0300

> The point really is that tx irq handler needs a lock on the tx queue to
> free up skbs, so processing it on another CPU while tx is active causes
> cache line bounces. So we want affinity to CPUs that submit to this
> queue on the theory they have these cache line(s) anyway.
> 
> I suspect it's not a uniqueue property of virtio.

It certainly is not.

I think the objectives are clear, someone just needs to put them
together cleanly into a patch :)
Willem de Bruijn July 29, 2018, 9:09 p.m. UTC | #15
On Sun, Jul 29, 2018 at 4:48 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Sun, Jul 29, 2018 at 09:00:27AM -0700, David Miller wrote:
> > From: Caleb Raitto <caleb.raitto@gmail.com>
> > Date: Mon, 23 Jul 2018 16:11:19 -0700
> >
> > > From: Caleb Raitto <caraitto@google.com>
> > >
> > > The driver disables tx napi if it's not certain that completions will
> > > be processed affine with tx service.
> > >
> > > Its heuristic doesn't account for some scenarios where it is, such as
> > > when the queue pair count matches the core but not hyperthread count.
> > >
> > > Allow userspace to override the heuristic. This is an alternative
> > > solution to that in the linked patch. That added more logic in the
> > > kernel for these cases, but the agreement was that this was better left
> > > to user control.
> > >
> > > Do not expand the existing napi_tx variable to a ternary value,
> > > because doing so can break user applications that expect
> > > boolean ('Y'/'N') instead of integer output. Add a new param instead.
> > >
> > > Link: https://patchwork.ozlabs.org/patch/725249/
> > > Acked-by: Willem de Bruijn <willemb@google.com>
> > > Acked-by: Jon Olson <jonolson@google.com>
> > > Signed-off-by: Caleb Raitto <caraitto@google.com>
> >
> > So I looked into the history surrounding these issues.
> >
> > First of all, it's always ends up turning out crummy when drivers start
> > to set affinities themselves.  The worst possible case is to do it
> > _conditionally_, and that is exactly what virtio_net is doing.
> >
> > >From the user's perspective, this provides a really bad experience.
> >
> > So if I have a 32-queue device and there are 32 cpus, you'll do all
> > the affinity settings, stopping Irqbalanced from doing anything
> > right?
> >
> > So if I add one more cpu, you'll say "oops, no idea what to do in
> > this situation" and not touch the affinities at all?
> >
> > That makes no sense at all.
> >
> > If the driver is going to set affinities at all, OWN that decision
> > and set it all the time to something reasonable.
> >
> > Or accept that you shouldn't be touching this stuff in the first place
> > and leave the affinities alone.
> >
> > Right now we're kinda in a situation where the driver has been setting
> > affinities in the ncpus==nqueues cases for some time, so we can't stop
> > doing it.
> >
> > Which means we have to set them in all cases to make the user
> > experience sane again.
> >
> > I looked at the linked to patch again:
> >
> >       https://patchwork.ozlabs.org/patch/725249/
> >
> > And I think the strategy should be made more generic, to get rid of
> > the hyperthreading assumptions.  I also agree that the "assign
> > to first N cpus" logic doesn't make much sense either.
> >
> > Just distribute across the available cpus evenly, and be done with it.
> > If you have 64 cpus and 32 queues, this assigns queues to every other
> > cpu.
> >
> > Then we don't need this weird new module parameter.
>
> Can't we set affinity to a set of CPUs?
>
> The point really is that tx irq handler needs a lock on the tx queue to
> free up skbs, so processing it on another CPU while tx is active causes
> cache line bounces. So we want affinity to CPUs that submit to this
> queue on the theory they have these cache line(s) anyway.
>
> I suspect it's not a uniqueue property of virtio.

It is a good heuristic. But as Jon pointed out, there is a trade-off
with other costs, such as increased interrupt load with additional
tx queues. This is particularly stark when multiple hyperthreads
share a cache, but have independent tx queues.
Willem de Bruijn July 29, 2018, 9:32 p.m. UTC | #16
On Sun, Jul 29, 2018 at 12:01 PM David Miller <davem@davemloft.net> wrote:
>
> From: Caleb Raitto <caleb.raitto@gmail.com>
> Date: Mon, 23 Jul 2018 16:11:19 -0700
>
> > From: Caleb Raitto <caraitto@google.com>
> >
> > The driver disables tx napi if it's not certain that completions will
> > be processed affine with tx service.
> >
> > Its heuristic doesn't account for some scenarios where it is, such as
> > when the queue pair count matches the core but not hyperthread count.
> >
> > Allow userspace to override the heuristic. This is an alternative
> > solution to that in the linked patch. That added more logic in the
> > kernel for these cases, but the agreement was that this was better left
> > to user control.
> >
> > Do not expand the existing napi_tx variable to a ternary value,
> > because doing so can break user applications that expect
> > boolean ('Y'/'N') instead of integer output. Add a new param instead.
> >
> > Link: https://patchwork.ozlabs.org/patch/725249/
> > Acked-by: Willem de Bruijn <willemb@google.com>
> > Acked-by: Jon Olson <jonolson@google.com>
> > Signed-off-by: Caleb Raitto <caraitto@google.com>
>
> So I looked into the history surrounding these issues.
>
> First of all, it's always ends up turning out crummy when drivers start
> to set affinities themselves.  The worst possible case is to do it
> _conditionally_, and that is exactly what virtio_net is doing.
>
> From the user's perspective, this provides a really bad experience.
>
> So if I have a 32-queue device and there are 32 cpus, you'll do all
> the affinity settings, stopping Irqbalanced from doing anything
> right?
>
> So if I add one more cpu, you'll say "oops, no idea what to do in
> this situation" and not touch the affinities at all?
>
> That makes no sense at all.
>
> If the driver is going to set affinities at all, OWN that decision
> and set it all the time to something reasonable.
>
> Or accept that you shouldn't be touching this stuff in the first place
> and leave the affinities alone.
>
> Right now we're kinda in a situation where the driver has been setting
> affinities in the ncpus==nqueues cases for some time, so we can't stop
> doing it.
>
> Which means we have to set them in all cases to make the user
> experience sane again.
>
> I looked at the linked to patch again:
>
>         https://patchwork.ozlabs.org/patch/725249/
>
> And I think the strategy should be made more generic, to get rid of
> the hyperthreading assumptions.  I also agree that the "assign
> to first N cpus" logic doesn't make much sense either.
>
> Just distribute across the available cpus evenly, and be done with it.

Sounds good to me.

> If you have 64 cpus and 32 queues, this assigns queues to every other
> cpu.

Striping half the number of queues as cores on a hyperthreaded system
with two logical cores per physical core will allocate all queues on
only half the physical cores.

But it is probably not safe to make any assumptions on virtual to
physical core mapping, anyway, which makes the simplest strategy is
preferable.
Jason Wang July 30, 2018, 6:06 a.m. UTC | #17
On 2018年07月25日 08:17, Jon Olson wrote:
> On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
>>> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
>>>>> >From the above linked patch, I understand that there are yet
>>>>> other special cases in production, such as a hard cap on #tx queues to
>>>>> 32 regardless of number of vcpus.
>>>> I don't think upstream kernels have this limit - we can
>>>> now use vmalloc for higher number of queues.
>>> Yes. that patch* mentioned it as a google compute engine imposed
>>> limit. It is exactly such cloud provider imposed rules that I'm
>>> concerned about working around in upstream drivers.
>>>
>>> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
>> Yea. Why does GCE do it btw?
> There are a few reasons for the limit, some historical, some current.
>
> Historically we did this because of a kernel limit on the number of
> TAP queues (in Montreal I thought this limit was 32). To my chagrin,
> the limit upstream at the time we did it was actually eight. We had
> increased the limit from eight to 32 internally, and it appears in
> upstream it has subsequently increased upstream to 256. We no longer
> use TAP for networking, so that constraint no longer applies for us,
> but when looking at removing/raising the limit we discovered no
> workloads that clearly benefited from lifting it, and it also placed
> more pressure on our virtual networking stack particularly on the Tx
> side. We left it as-is.
>
> In terms of current reasons there are really two. One is memory usage.
> As you know, virtio-net uses rx/tx pairs, so there's an expectation
> that the guest will have an Rx queue for every Tx queue. We run our
> individual virtqueues fairly deep (4096 entries) to give guests a wide
> time window for re-posting Rx buffers and avoiding starvation on
> packet delivery. Filling an Rx vring with max-sized mergeable buffers
> (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
> be up to 512MB of memory posted for network buffers. Scaling this to
> the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
> all of the Rx rings full would (in the large average Rx packet size
> case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
> of RAM available, but I don't believe we've observed a situation where
> they would have benefited from having 2.5 gigs of buffers posted for
> incoming network traffic :)

We can work to have async txq and rxq instead of paris if there's a 
strong requirement.

>
> The second reason is interrupt related -- as I mentioned above, we
> have found no workloads that clearly benefit from so many queues, but
> we have found workloads that degrade. In particular workloads that do
> a lot of small packet processing but which aren't extremely latency
> sensitive can achieve higher PPS by taking fewer interrupt across
> fewer VCPUs due to better batching (this also incurs higher latency,
> but at the limit the "busy" cores end up suppressing most interrupts
> and spending most of their cycles farming out work). Memcache is a
> good example here, particularly if the latency targets for request
> completion are in the ~milliseconds range (rather than the
> microseconds we typically strive for with TCP_RR-style workloads).
>
> All of that said, we haven't been forthcoming with data (and
> unfortunately I don't have it handy in a useful form, otherwise I'd
> simply post it here), so I understand the hesitation to simply run
> with napi_tx across the board. As Willem said, this patch seemed like
> the least disruptive way to allow us to continue down the road of
> "universal" NAPI Tx and to hopefully get data across enough workloads
> (with VMs small, large, and absurdly large :) to present a compelling
> argument in one direction or another. As far as I know there aren't
> currently any NAPI related ethtool commands (based on a quick perusal
> of ethtool.h)

As I suggest before, maybe we can (ab)use tx-frames-irq.

Thanks

> -- it seems like it would be fairly involved/heavyweight
> to plumb one solely for this unless NAPI Tx is something many users
> will want to tune (and for which other drivers would support tuning).
>
> --
> Jon Olson
Stephen Hemminger July 30, 2018, 7:51 p.m. UTC | #18
On Sun, 29 Jul 2018 09:00:27 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Caleb Raitto <caleb.raitto@gmail.com>
> Date: Mon, 23 Jul 2018 16:11:19 -0700
> 
> > From: Caleb Raitto <caraitto@google.com>
> > 
> > The driver disables tx napi if it's not certain that completions will
> > be processed affine with tx service.
> > 
> > Its heuristic doesn't account for some scenarios where it is, such as
> > when the queue pair count matches the core but not hyperthread count.
> > 
> > Allow userspace to override the heuristic. This is an alternative
> > solution to that in the linked patch. That added more logic in the
> > kernel for these cases, but the agreement was that this was better left
> > to user control.
> > 
> > Do not expand the existing napi_tx variable to a ternary value,
> > because doing so can break user applications that expect
> > boolean ('Y'/'N') instead of integer output. Add a new param instead.
> > 
> > Link: https://patchwork.ozlabs.org/patch/725249/
> > Acked-by: Willem de Bruijn <willemb@google.com>
> > Acked-by: Jon Olson <jonolson@google.com>
> > Signed-off-by: Caleb Raitto <caraitto@google.com>  
> 
> So I looked into the history surrounding these issues.
> 
> First of all, it's always ends up turning out crummy when drivers start
> to set affinities themselves.  The worst possible case is to do it
> _conditionally_, and that is exactly what virtio_net is doing.
> 
> From the user's perspective, this provides a really bad experience.
> 
> So if I have a 32-queue device and there are 32 cpus, you'll do all
> the affinity settings, stopping Irqbalanced from doing anything
> right?
> 
> So if I add one more cpu, you'll say "oops, no idea what to do in
> this situation" and not touch the affinities at all?
> 
> That makes no sense at all.
> 
> If the driver is going to set affinities at all, OWN that decision
> and set it all the time to something reasonable.
> 
> Or accept that you shouldn't be touching this stuff in the first place
> and leave the affinities alone.
> 
> Right now we're kinda in a situation where the driver has been setting
> affinities in the ncpus==nqueues cases for some time, so we can't stop
> doing it.
> 
> Which means we have to set them in all cases to make the user
> experience sane again.
> 
> I looked at the linked to patch again:
> 
> 	https://patchwork.ozlabs.org/patch/725249/
> 
> And I think the strategy should be made more generic, to get rid of
> the hyperthreading assumptions.  I also agree that the "assign
> to first N cpus" logic doesn't make much sense either.
> 
> Just distribute across the available cpus evenly, and be done with it.
> If you have 64 cpus and 32 queues, this assigns queues to every other
> cpu.
> 
> Then we don't need this weird new module parameter.

I wonder if it would be possible to give irqbalanced hints
with irq_set_affinity_hint instead of doing direct affinity setting?
Jason Wang July 31, 2018, 1:41 a.m. UTC | #19
On 2018年07月31日 03:51, Stephen Hemminger wrote:
> On Sun, 29 Jul 2018 09:00:27 -0700 (PDT)
> David Miller <davem@davemloft.net> wrote:
>
>> From: Caleb Raitto <caleb.raitto@gmail.com>
>> Date: Mon, 23 Jul 2018 16:11:19 -0700
>>
>>> From: Caleb Raitto <caraitto@google.com>
>>>
>>> The driver disables tx napi if it's not certain that completions will
>>> be processed affine with tx service.
>>>
>>> Its heuristic doesn't account for some scenarios where it is, such as
>>> when the queue pair count matches the core but not hyperthread count.
>>>
>>> Allow userspace to override the heuristic. This is an alternative
>>> solution to that in the linked patch. That added more logic in the
>>> kernel for these cases, but the agreement was that this was better left
>>> to user control.
>>>
>>> Do not expand the existing napi_tx variable to a ternary value,
>>> because doing so can break user applications that expect
>>> boolean ('Y'/'N') instead of integer output. Add a new param instead.
>>>
>>> Link: https://patchwork.ozlabs.org/patch/725249/
>>> Acked-by: Willem de Bruijn <willemb@google.com>
>>> Acked-by: Jon Olson <jonolson@google.com>
>>> Signed-off-by: Caleb Raitto <caraitto@google.com>
>> So I looked into the history surrounding these issues.
>>
>> First of all, it's always ends up turning out crummy when drivers start
>> to set affinities themselves.  The worst possible case is to do it
>> _conditionally_, and that is exactly what virtio_net is doing.
>>
>>  From the user's perspective, this provides a really bad experience.
>>
>> So if I have a 32-queue device and there are 32 cpus, you'll do all
>> the affinity settings, stopping Irqbalanced from doing anything
>> right?
>>
>> So if I add one more cpu, you'll say "oops, no idea what to do in
>> this situation" and not touch the affinities at all?
>>
>> That makes no sense at all.
>>
>> If the driver is going to set affinities at all, OWN that decision
>> and set it all the time to something reasonable.
>>
>> Or accept that you shouldn't be touching this stuff in the first place
>> and leave the affinities alone.
>>
>> Right now we're kinda in a situation where the driver has been setting
>> affinities in the ncpus==nqueues cases for some time, so we can't stop
>> doing it.
>>
>> Which means we have to set them in all cases to make the user
>> experience sane again.
>>
>> I looked at the linked to patch again:
>>
>> 	https://patchwork.ozlabs.org/patch/725249/
>>
>> And I think the strategy should be made more generic, to get rid of
>> the hyperthreading assumptions.  I also agree that the "assign
>> to first N cpus" logic doesn't make much sense either.
>>
>> Just distribute across the available cpus evenly, and be done with it.
>> If you have 64 cpus and 32 queues, this assigns queues to every other
>> cpu.
>>
>> Then we don't need this weird new module parameter.
> I wonder if it would be possible to give irqbalanced hints
> with irq_set_affinity_hint instead of doing direct affinity setting?

We do use hints instead of affinity directly.

See vp_set_vq_affinity(), which did:

     if (vp_dev->msix_enabled) {
         mask = vp_dev->msix_affinity_masks[info->msix_vector];
         irq = pci_irq_vector(vp_dev->pci_dev, info->msix_vector);
         if (cpu == -1)
             irq_set_affinity_hint(irq, NULL);
         else {
             cpumask_clear(mask);
             cpumask_set_cpu(cpu, mask);
             irq_set_affinity_hint(irq, mask);
         }
     }

Thanks.
Michael S. Tsirkin July 31, 2018, 12:34 p.m. UTC | #20
On Sun, Jul 29, 2018 at 05:32:56PM -0400, Willem de Bruijn wrote:
> On Sun, Jul 29, 2018 at 12:01 PM David Miller <davem@davemloft.net> wrote:
> >
> > From: Caleb Raitto <caleb.raitto@gmail.com>
> > Date: Mon, 23 Jul 2018 16:11:19 -0700
> >
> > > From: Caleb Raitto <caraitto@google.com>
> > >
> > > The driver disables tx napi if it's not certain that completions will
> > > be processed affine with tx service.
> > >
> > > Its heuristic doesn't account for some scenarios where it is, such as
> > > when the queue pair count matches the core but not hyperthread count.
> > >
> > > Allow userspace to override the heuristic. This is an alternative
> > > solution to that in the linked patch. That added more logic in the
> > > kernel for these cases, but the agreement was that this was better left
> > > to user control.
> > >
> > > Do not expand the existing napi_tx variable to a ternary value,
> > > because doing so can break user applications that expect
> > > boolean ('Y'/'N') instead of integer output. Add a new param instead.
> > >
> > > Link: https://patchwork.ozlabs.org/patch/725249/
> > > Acked-by: Willem de Bruijn <willemb@google.com>
> > > Acked-by: Jon Olson <jonolson@google.com>
> > > Signed-off-by: Caleb Raitto <caraitto@google.com>
> >
> > So I looked into the history surrounding these issues.
> >
> > First of all, it's always ends up turning out crummy when drivers start
> > to set affinities themselves.  The worst possible case is to do it
> > _conditionally_, and that is exactly what virtio_net is doing.
> >
> > From the user's perspective, this provides a really bad experience.
> >
> > So if I have a 32-queue device and there are 32 cpus, you'll do all
> > the affinity settings, stopping Irqbalanced from doing anything
> > right?
> >
> > So if I add one more cpu, you'll say "oops, no idea what to do in
> > this situation" and not touch the affinities at all?
> >
> > That makes no sense at all.
> >
> > If the driver is going to set affinities at all, OWN that decision
> > and set it all the time to something reasonable.
> >
> > Or accept that you shouldn't be touching this stuff in the first place
> > and leave the affinities alone.
> >
> > Right now we're kinda in a situation where the driver has been setting
> > affinities in the ncpus==nqueues cases for some time, so we can't stop
> > doing it.
> >
> > Which means we have to set them in all cases to make the user
> > experience sane again.
> >
> > I looked at the linked to patch again:
> >
> >         https://patchwork.ozlabs.org/patch/725249/
> >
> > And I think the strategy should be made more generic, to get rid of
> > the hyperthreading assumptions.  I also agree that the "assign
> > to first N cpus" logic doesn't make much sense either.
> >
> > Just distribute across the available cpus evenly, and be done with it.
> 
> Sounds good to me.

So e.g. we could set an affinity hint to a group of CPUs that
might transmit to this queue.

> > If you have 64 cpus and 32 queues, this assigns queues to every other
> > cpu.
> 
> Striping half the number of queues as cores on a hyperthreaded system
> with two logical cores per physical core will allocate all queues on
> only half the physical cores.
> 
> But it is probably not safe to make any assumptions on virtual to
> physical core mapping, anyway, which makes the simplest strategy is
> preferable.
Willem de Bruijn Aug. 1, 2018, 3:46 p.m. UTC | #21
On Tue, Jul 31, 2018 at 8:34 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Sun, Jul 29, 2018 at 05:32:56PM -0400, Willem de Bruijn wrote:
> > On Sun, Jul 29, 2018 at 12:01 PM David Miller <davem@davemloft.net> wrote:
> > >
> > > From: Caleb Raitto <caleb.raitto@gmail.com>
> > > Date: Mon, 23 Jul 2018 16:11:19 -0700
> > >
> > > > From: Caleb Raitto <caraitto@google.com>
> > > >
> > > > The driver disables tx napi if it's not certain that completions will
> > > > be processed affine with tx service.
> > > >
> > > > Its heuristic doesn't account for some scenarios where it is, such as
> > > > when the queue pair count matches the core but not hyperthread count.
> > > >
> > > > Allow userspace to override the heuristic. This is an alternative
> > > > solution to that in the linked patch. That added more logic in the
> > > > kernel for these cases, but the agreement was that this was better left
> > > > to user control.
> > > >
> > > > Do not expand the existing napi_tx variable to a ternary value,
> > > > because doing so can break user applications that expect
> > > > boolean ('Y'/'N') instead of integer output. Add a new param instead.
> > > >
> > > > Link: https://patchwork.ozlabs.org/patch/725249/
> > > > Acked-by: Willem de Bruijn <willemb@google.com>
> > > > Acked-by: Jon Olson <jonolson@google.com>
> > > > Signed-off-by: Caleb Raitto <caraitto@google.com>
> > >
> > > So I looked into the history surrounding these issues.
> > >
> > > First of all, it's always ends up turning out crummy when drivers start
> > > to set affinities themselves.  The worst possible case is to do it
> > > _conditionally_, and that is exactly what virtio_net is doing.
> > >
> > > From the user's perspective, this provides a really bad experience.
> > >
> > > So if I have a 32-queue device and there are 32 cpus, you'll do all
> > > the affinity settings, stopping Irqbalanced from doing anything
> > > right?
> > >
> > > So if I add one more cpu, you'll say "oops, no idea what to do in
> > > this situation" and not touch the affinities at all?
> > >
> > > That makes no sense at all.
> > >
> > > If the driver is going to set affinities at all, OWN that decision
> > > and set it all the time to something reasonable.
> > >
> > > Or accept that you shouldn't be touching this stuff in the first place
> > > and leave the affinities alone.
> > >
> > > Right now we're kinda in a situation where the driver has been setting
> > > affinities in the ncpus==nqueues cases for some time, so we can't stop
> > > doing it.
> > >
> > > Which means we have to set them in all cases to make the user
> > > experience sane again.
> > >
> > > I looked at the linked to patch again:
> > >
> > >         https://patchwork.ozlabs.org/patch/725249/
> > >
> > > And I think the strategy should be made more generic, to get rid of
> > > the hyperthreading assumptions.  I also agree that the "assign
> > > to first N cpus" logic doesn't make much sense either.
> > >
> > > Just distribute across the available cpus evenly, and be done with it.
> >
> > Sounds good to me.
>
> So e.g. we could set an affinity hint to a group of CPUs that
> might transmit to this queue.

We also want to set the xps mask for all cpus in the group to this queue.

Is there a benefit over explicitly choosing one cpu from the set, btw?
I assumed striping. Something along the lines of

  int stripe = max_t(int, num_online_cpus() / vi->curr_queue_pairs, 1);
  int vq = 0;

  cpumask_clear(xps_mask);

  for_each_online_cpu(cpu) {
      cpumask_set_cpu(cpu, xps_mask);

      if ((i + 1) % stripe == 0) {
          virtqueue_set_affinity(vi->rq[vq].vq, cpu);
          virtqueue_set_affinity(vi->sq[vq].vq, cpu);
          netif_set_xps_queue(vi->dev, xps_mask, vq);
          cpumask_clear(xps_mask);
          vq++;
     }
      i++;
  }
Willem de Bruijn Aug. 1, 2018, 3:56 p.m. UTC | #22
> > > > Just distribute across the available cpus evenly, and be done with it.
> > >
> > > Sounds good to me.
> >
> > So e.g. we could set an affinity hint to a group of CPUs that
> > might transmit to this queue.
>
> We also want to set the xps mask for all cpus in the group to this queue.
>
> Is there a benefit over explicitly choosing one cpu from the set, btw?
> I assumed striping. Something along the lines of
>
>   int stripe = max_t(int, num_online_cpus() / vi->curr_queue_pairs, 1);
>   int vq = 0;
>
>   cpumask_clear(xps_mask);
>
>   for_each_online_cpu(cpu) {
>       cpumask_set_cpu(cpu, xps_mask);
>
>       if ((i + 1) % stripe == 0) {
>           virtqueue_set_affinity(vi->rq[vq].vq, cpu);
>           virtqueue_set_affinity(vi->sq[vq].vq, cpu);
>           netif_set_xps_queue(vi->dev, xps_mask, vq);
>           cpumask_clear(xps_mask);
>           vq++;
>      }
>       i++;
>   }

.. but handling edge cases correctly, such as #cpu not being a perfect
multiple of #vq.
Michael S. Tsirkin Aug. 1, 2018, 10:25 p.m. UTC | #23
On Wed, Aug 01, 2018 at 11:46:14AM -0400, Willem de Bruijn wrote:
> On Tue, Jul 31, 2018 at 8:34 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Sun, Jul 29, 2018 at 05:32:56PM -0400, Willem de Bruijn wrote:
> > > On Sun, Jul 29, 2018 at 12:01 PM David Miller <davem@davemloft.net> wrote:
> > > >
> > > > From: Caleb Raitto <caleb.raitto@gmail.com>
> > > > Date: Mon, 23 Jul 2018 16:11:19 -0700
> > > >
> > > > > From: Caleb Raitto <caraitto@google.com>
> > > > >
> > > > > The driver disables tx napi if it's not certain that completions will
> > > > > be processed affine with tx service.
> > > > >
> > > > > Its heuristic doesn't account for some scenarios where it is, such as
> > > > > when the queue pair count matches the core but not hyperthread count.
> > > > >
> > > > > Allow userspace to override the heuristic. This is an alternative
> > > > > solution to that in the linked patch. That added more logic in the
> > > > > kernel for these cases, but the agreement was that this was better left
> > > > > to user control.
> > > > >
> > > > > Do not expand the existing napi_tx variable to a ternary value,
> > > > > because doing so can break user applications that expect
> > > > > boolean ('Y'/'N') instead of integer output. Add a new param instead.
> > > > >
> > > > > Link: https://patchwork.ozlabs.org/patch/725249/
> > > > > Acked-by: Willem de Bruijn <willemb@google.com>
> > > > > Acked-by: Jon Olson <jonolson@google.com>
> > > > > Signed-off-by: Caleb Raitto <caraitto@google.com>
> > > >
> > > > So I looked into the history surrounding these issues.
> > > >
> > > > First of all, it's always ends up turning out crummy when drivers start
> > > > to set affinities themselves.  The worst possible case is to do it
> > > > _conditionally_, and that is exactly what virtio_net is doing.
> > > >
> > > > From the user's perspective, this provides a really bad experience.
> > > >
> > > > So if I have a 32-queue device and there are 32 cpus, you'll do all
> > > > the affinity settings, stopping Irqbalanced from doing anything
> > > > right?
> > > >
> > > > So if I add one more cpu, you'll say "oops, no idea what to do in
> > > > this situation" and not touch the affinities at all?
> > > >
> > > > That makes no sense at all.
> > > >
> > > > If the driver is going to set affinities at all, OWN that decision
> > > > and set it all the time to something reasonable.
> > > >
> > > > Or accept that you shouldn't be touching this stuff in the first place
> > > > and leave the affinities alone.
> > > >
> > > > Right now we're kinda in a situation where the driver has been setting
> > > > affinities in the ncpus==nqueues cases for some time, so we can't stop
> > > > doing it.
> > > >
> > > > Which means we have to set them in all cases to make the user
> > > > experience sane again.
> > > >
> > > > I looked at the linked to patch again:
> > > >
> > > >         https://patchwork.ozlabs.org/patch/725249/
> > > >
> > > > And I think the strategy should be made more generic, to get rid of
> > > > the hyperthreading assumptions.  I also agree that the "assign
> > > > to first N cpus" logic doesn't make much sense either.
> > > >
> > > > Just distribute across the available cpus evenly, and be done with it.
> > >
> > > Sounds good to me.
> >
> > So e.g. we could set an affinity hint to a group of CPUs that
> > might transmit to this queue.
> 
> We also want to set the xps mask for all cpus in the group to this queue.
> 
> Is there a benefit over explicitly choosing one cpu from the set, btw?

If only one CPU actually transmits on this queue then probably yes.
And virtio doesn't know whether that's the case.


> I assumed striping. Something along the lines of
> 
>   int stripe = max_t(int, num_online_cpus() / vi->curr_queue_pairs, 1);
>   int vq = 0;
> 
>   cpumask_clear(xps_mask);
> 
>   for_each_online_cpu(cpu) {
>       cpumask_set_cpu(cpu, xps_mask);
> 
>       if ((i + 1) % stripe == 0) {
>           virtqueue_set_affinity(vi->rq[vq].vq, cpu);
>           virtqueue_set_affinity(vi->sq[vq].vq, cpu);
>           netif_set_xps_queue(vi->dev, xps_mask, vq);
>           cpumask_clear(xps_mask);
>           vq++;
>      }
>       i++;
>   }
Michael S. Tsirkin Aug. 1, 2018, 10:27 p.m. UTC | #24
On Mon, Jul 30, 2018 at 02:06:50PM +0800, Jason Wang wrote:
> 
> 
> On 2018年07月25日 08:17, Jon Olson wrote:
> > On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
> > > > On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> > > > > > >From the above linked patch, I understand that there are yet
> > > > > > other special cases in production, such as a hard cap on #tx queues to
> > > > > > 32 regardless of number of vcpus.
> > > > > I don't think upstream kernels have this limit - we can
> > > > > now use vmalloc for higher number of queues.
> > > > Yes. that patch* mentioned it as a google compute engine imposed
> > > > limit. It is exactly such cloud provider imposed rules that I'm
> > > > concerned about working around in upstream drivers.
> > > > 
> > > > * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
> > > Yea. Why does GCE do it btw?
> > There are a few reasons for the limit, some historical, some current.
> > 
> > Historically we did this because of a kernel limit on the number of
> > TAP queues (in Montreal I thought this limit was 32). To my chagrin,
> > the limit upstream at the time we did it was actually eight. We had
> > increased the limit from eight to 32 internally, and it appears in
> > upstream it has subsequently increased upstream to 256. We no longer
> > use TAP for networking, so that constraint no longer applies for us,
> > but when looking at removing/raising the limit we discovered no
> > workloads that clearly benefited from lifting it, and it also placed
> > more pressure on our virtual networking stack particularly on the Tx
> > side. We left it as-is.
> > 
> > In terms of current reasons there are really two. One is memory usage.
> > As you know, virtio-net uses rx/tx pairs, so there's an expectation
> > that the guest will have an Rx queue for every Tx queue. We run our
> > individual virtqueues fairly deep (4096 entries) to give guests a wide
> > time window for re-posting Rx buffers and avoiding starvation on
> > packet delivery. Filling an Rx vring with max-sized mergeable buffers
> > (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
> > be up to 512MB of memory posted for network buffers. Scaling this to
> > the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
> > all of the Rx rings full would (in the large average Rx packet size
> > case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
> > of RAM available, but I don't believe we've observed a situation where
> > they would have benefited from having 2.5 gigs of buffers posted for
> > incoming network traffic :)
> 
> We can work to have async txq and rxq instead of paris if there's a strong
> requirement.

I think the reason we don't is because RX queueing is programmed by TX
packets.  It might make sense if we support RX queueing policy that
isn't dependent on TX.


> > 
> > The second reason is interrupt related -- as I mentioned above, we
> > have found no workloads that clearly benefit from so many queues, but
> > we have found workloads that degrade. In particular workloads that do
> > a lot of small packet processing but which aren't extremely latency
> > sensitive can achieve higher PPS by taking fewer interrupt across
> > fewer VCPUs due to better batching (this also incurs higher latency,
> > but at the limit the "busy" cores end up suppressing most interrupts
> > and spending most of their cycles farming out work). Memcache is a
> > good example here, particularly if the latency targets for request
> > completion are in the ~milliseconds range (rather than the
> > microseconds we typically strive for with TCP_RR-style workloads).
> > 
> > All of that said, we haven't been forthcoming with data (and
> > unfortunately I don't have it handy in a useful form, otherwise I'd
> > simply post it here), so I understand the hesitation to simply run
> > with napi_tx across the board. As Willem said, this patch seemed like
> > the least disruptive way to allow us to continue down the road of
> > "universal" NAPI Tx and to hopefully get data across enough workloads
> > (with VMs small, large, and absurdly large :) to present a compelling
> > argument in one direction or another. As far as I know there aren't
> > currently any NAPI related ethtool commands (based on a quick perusal
> > of ethtool.h)
> 
> As I suggest before, maybe we can (ab)use tx-frames-irq.
> 
> Thanks
> 
> > -- it seems like it would be fairly involved/heavyweight
> > to plumb one solely for this unless NAPI Tx is something many users
> > will want to tune (and for which other drivers would support tuning).
> > 
> > --
> > Jon Olson
Willem de Bruijn Aug. 1, 2018, 10:43 p.m. UTC | #25
> > So e.g. we could set an affinity hint to a group of CPUs that
> > > might transmit to this queue.
> >
> > We also want to set the xps mask for all cpus in the group to this queue.
> >
> > Is there a benefit over explicitly choosing one cpu from the set, btw?
>
> If only one CPU actually transmits on this queue then probably yes.
> And virtio doesn't know whether that's the case.

Oh, yes, good example.

To plumb that through, vp_set_vq_affinity would have to take a cpu
mask like irq_set_affinity_hint, instead of a single value.

Though I don't suggest doing that as part of the core
virtnet_set_affinity changes.
Michael S. Tsirkin Aug. 1, 2018, 10:48 p.m. UTC | #26
On Wed, Aug 01, 2018 at 06:43:53PM -0400, Willem de Bruijn wrote:
>  > > So e.g. we could set an affinity hint to a group of CPUs that
> > > > might transmit to this queue.
> > >
> > > We also want to set the xps mask for all cpus in the group to this queue.
> > >
> > > Is there a benefit over explicitly choosing one cpu from the set, btw?
> >
> > If only one CPU actually transmits on this queue then probably yes.
> > And virtio doesn't know whether that's the case.
> 
> Oh, yes, good example.
> 
> To plumb that through, vp_set_vq_affinity would have to take a cpu
> mask like irq_set_affinity_hint, instead of a single value.
> 
> Though I don't suggest doing that as part of the core
> virtnet_set_affinity changes.

Why not? It seems a reasonable thing to do.
Willem de Bruijn Aug. 1, 2018, 11:33 p.m. UTC | #27
On Wed, Aug 1, 2018 at 6:48 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Aug 01, 2018 at 06:43:53PM -0400, Willem de Bruijn wrote:
> >  > > So e.g. we could set an affinity hint to a group of CPUs that
> > > > > might transmit to this queue.
> > > >
> > > > We also want to set the xps mask for all cpus in the group to this queue.
> > > >
> > > > Is there a benefit over explicitly choosing one cpu from the set, btw?
> > >
> > > If only one CPU actually transmits on this queue then probably yes.
> > > And virtio doesn't know whether that's the case.
> >
> > Oh, yes, good example.
> >
> > To plumb that through, vp_set_vq_affinity would have to take a cpu
> > mask like irq_set_affinity_hint, instead of a single value.
> >
> > Though I don't suggest doing that as part of the core
> > virtnet_set_affinity changes.
>
> Why not? It seems a reasonable thing to do.

Well, at least let's split it into separate patches.
Willem de Bruijn Aug. 28, 2018, 7:57 p.m. UTC | #28
On Mon, Jul 30, 2018 at 2:06 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
>
> On 2018年07月25日 08:17, Jon Olson wrote:
> > On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
> >>> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> >>>>> >From the above linked patch, I understand that there are yet
> >>>>> other special cases in production, such as a hard cap on #tx queues to
> >>>>> 32 regardless of number of vcpus.
> >>>> I don't think upstream kernels have this limit - we can
> >>>> now use vmalloc for higher number of queues.
> >>> Yes. that patch* mentioned it as a google compute engine imposed
> >>> limit. It is exactly such cloud provider imposed rules that I'm
> >>> concerned about working around in upstream drivers.
> >>>
> >>> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
> >> Yea. Why does GCE do it btw?
> > There are a few reasons for the limit, some historical, some current.
> >
> > Historically we did this because of a kernel limit on the number of
> > TAP queues (in Montreal I thought this limit was 32). To my chagrin,
> > the limit upstream at the time we did it was actually eight. We had
> > increased the limit from eight to 32 internally, and it appears in
> > upstream it has subsequently increased upstream to 256. We no longer
> > use TAP for networking, so that constraint no longer applies for us,
> > but when looking at removing/raising the limit we discovered no
> > workloads that clearly benefited from lifting it, and it also placed
> > more pressure on our virtual networking stack particularly on the Tx
> > side. We left it as-is.
> >
> > In terms of current reasons there are really two. One is memory usage.
> > As you know, virtio-net uses rx/tx pairs, so there's an expectation
> > that the guest will have an Rx queue for every Tx queue. We run our
> > individual virtqueues fairly deep (4096 entries) to give guests a wide
> > time window for re-posting Rx buffers and avoiding starvation on
> > packet delivery. Filling an Rx vring with max-sized mergeable buffers
> > (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
> > be up to 512MB of memory posted for network buffers. Scaling this to
> > the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
> > all of the Rx rings full would (in the large average Rx packet size
> > case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
> > of RAM available, but I don't believe we've observed a situation where
> > they would have benefited from having 2.5 gigs of buffers posted for
> > incoming network traffic :)
>
> We can work to have async txq and rxq instead of paris if there's a
> strong requirement.
>
> >
> > The second reason is interrupt related -- as I mentioned above, we
> > have found no workloads that clearly benefit from so many queues, but
> > we have found workloads that degrade. In particular workloads that do
> > a lot of small packet processing but which aren't extremely latency
> > sensitive can achieve higher PPS by taking fewer interrupt across
> > fewer VCPUs due to better batching (this also incurs higher latency,
> > but at the limit the "busy" cores end up suppressing most interrupts
> > and spending most of their cycles farming out work). Memcache is a
> > good example here, particularly if the latency targets for request
> > completion are in the ~milliseconds range (rather than the
> > microseconds we typically strive for with TCP_RR-style workloads).
> >
> > All of that said, we haven't been forthcoming with data (and
> > unfortunately I don't have it handy in a useful form, otherwise I'd
> > simply post it here), so I understand the hesitation to simply run
> > with napi_tx across the board. As Willem said, this patch seemed like
> > the least disruptive way to allow us to continue down the road of
> > "universal" NAPI Tx and to hopefully get data across enough workloads
> > (with VMs small, large, and absurdly large :) to present a compelling
> > argument in one direction or another. As far as I know there aren't
> > currently any NAPI related ethtool commands (based on a quick perusal
> > of ethtool.h)
>
> As I suggest before, maybe we can (ab)use tx-frames-irq.

I forgot to respond to this originally, but I agree.

How about something like the snippet below. It would be simpler to
reason about if only allow switching while the device is down, but
napi does not strictly require that.

+static int virtnet_set_coalesce(struct net_device *dev,
+                               struct ethtool_coalesce *ec)
+{
+       const u32 tx_coalesce_napi_mask = (1 << 16);
+       const struct ethtool_coalesce ec_default = {
+               .cmd = ETHTOOL_SCOALESCE,
+               .rx_max_coalesced_frames = 1,
+               .tx_max_coalesced_frames = 1,
+       };
+       struct virtnet_info *vi = netdev_priv(dev);
+       int napi_weight = 0;
+       bool running;
+       int i;
+
+       if (ec->tx_max_coalesced_frames & tx_coalesce_napi_mask) {
+               ec->tx_max_coalesced_frames &= ~tx_coalesce_napi_mask;
+               napi_weight = NAPI_POLL_WEIGHT;
+       }
+
+       /* disallow changes to fields not explicitly tested above */
+       if (memcmp(ec, &ec_default, sizeof(ec_default)))
+               return -EINVAL;
+
+       if (napi_weight ^ vi->sq[0].napi.weight) {
+               running = netif_running(vi->dev);
+
+               for (i = 0; i < vi->max_queue_pairs; i++) {
+                       vi->sq[i].napi.weight = napi_weight;
+
+                       if (!running)
+                               continue;
+
+                       if (napi_weight)
+                               virtnet_napi_tx_enable(vi, vi->sq[i].vq,
+                                                      &vi->sq[i].napi);
+                       else
+                               napi_disable(&vi->sq[i].napi);
+               }
+       }
+
+       return 0;
+}
+
+static int virtnet_get_coalesce(struct net_device *dev,
+                               struct ethtool_coalesce *ec)
+{
+       const u32 tx_coalesce_napi_mask = (1 << 16);
+       const struct ethtool_coalesce ec_default = {
+               .cmd = ETHTOOL_GCOALESCE,
+               .rx_max_coalesced_frames = 1,
+               .tx_max_coalesced_frames = 1,
+       };
+       struct virtnet_info *vi = netdev_priv(dev);
+
+       memcpy(ec, &ec_default, sizeof(ec_default));
+
+       if (vi->sq[0].napi.weight)
+               ec->tx_max_coalesced_frames |= tx_coalesce_napi_mask;
+
+       return 0;
+}
Jason Wang Aug. 29, 2018, 7:56 a.m. UTC | #29
On 2018年08月29日 03:57, Willem de Bruijn wrote:
> On Mon, Jul 30, 2018 at 2:06 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>>
>> On 2018年07月25日 08:17, Jon Olson wrote:
>>> On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
>>>>> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
>>>>>>> >From the above linked patch, I understand that there are yet
>>>>>>> other special cases in production, such as a hard cap on #tx queues to
>>>>>>> 32 regardless of number of vcpus.
>>>>>> I don't think upstream kernels have this limit - we can
>>>>>> now use vmalloc for higher number of queues.
>>>>> Yes. that patch* mentioned it as a google compute engine imposed
>>>>> limit. It is exactly such cloud provider imposed rules that I'm
>>>>> concerned about working around in upstream drivers.
>>>>>
>>>>> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
>>>> Yea. Why does GCE do it btw?
>>> There are a few reasons for the limit, some historical, some current.
>>>
>>> Historically we did this because of a kernel limit on the number of
>>> TAP queues (in Montreal I thought this limit was 32). To my chagrin,
>>> the limit upstream at the time we did it was actually eight. We had
>>> increased the limit from eight to 32 internally, and it appears in
>>> upstream it has subsequently increased upstream to 256. We no longer
>>> use TAP for networking, so that constraint no longer applies for us,
>>> but when looking at removing/raising the limit we discovered no
>>> workloads that clearly benefited from lifting it, and it also placed
>>> more pressure on our virtual networking stack particularly on the Tx
>>> side. We left it as-is.
>>>
>>> In terms of current reasons there are really two. One is memory usage.
>>> As you know, virtio-net uses rx/tx pairs, so there's an expectation
>>> that the guest will have an Rx queue for every Tx queue. We run our
>>> individual virtqueues fairly deep (4096 entries) to give guests a wide
>>> time window for re-posting Rx buffers and avoiding starvation on
>>> packet delivery. Filling an Rx vring with max-sized mergeable buffers
>>> (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
>>> be up to 512MB of memory posted for network buffers. Scaling this to
>>> the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
>>> all of the Rx rings full would (in the large average Rx packet size
>>> case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
>>> of RAM available, but I don't believe we've observed a situation where
>>> they would have benefited from having 2.5 gigs of buffers posted for
>>> incoming network traffic :)
>> We can work to have async txq and rxq instead of paris if there's a
>> strong requirement.
>>
>>> The second reason is interrupt related -- as I mentioned above, we
>>> have found no workloads that clearly benefit from so many queues, but
>>> we have found workloads that degrade. In particular workloads that do
>>> a lot of small packet processing but which aren't extremely latency
>>> sensitive can achieve higher PPS by taking fewer interrupt across
>>> fewer VCPUs due to better batching (this also incurs higher latency,
>>> but at the limit the "busy" cores end up suppressing most interrupts
>>> and spending most of their cycles farming out work). Memcache is a
>>> good example here, particularly if the latency targets for request
>>> completion are in the ~milliseconds range (rather than the
>>> microseconds we typically strive for with TCP_RR-style workloads).
>>>
>>> All of that said, we haven't been forthcoming with data (and
>>> unfortunately I don't have it handy in a useful form, otherwise I'd
>>> simply post it here), so I understand the hesitation to simply run
>>> with napi_tx across the board. As Willem said, this patch seemed like
>>> the least disruptive way to allow us to continue down the road of
>>> "universal" NAPI Tx and to hopefully get data across enough workloads
>>> (with VMs small, large, and absurdly large :) to present a compelling
>>> argument in one direction or another. As far as I know there aren't
>>> currently any NAPI related ethtool commands (based on a quick perusal
>>> of ethtool.h)
>> As I suggest before, maybe we can (ab)use tx-frames-irq.
> I forgot to respond to this originally, but I agree.
>
> How about something like the snippet below. It would be simpler to
> reason about if only allow switching while the device is down, but
> napi does not strictly require that.
>
> +static int virtnet_set_coalesce(struct net_device *dev,
> +                               struct ethtool_coalesce *ec)
> +{
> +       const u32 tx_coalesce_napi_mask = (1 << 16);
> +       const struct ethtool_coalesce ec_default = {
> +               .cmd = ETHTOOL_SCOALESCE,
> +               .rx_max_coalesced_frames = 1,
> +               .tx_max_coalesced_frames = 1,
> +       };
> +       struct virtnet_info *vi = netdev_priv(dev);
> +       int napi_weight = 0;
> +       bool running;
> +       int i;
> +
> +       if (ec->tx_max_coalesced_frames & tx_coalesce_napi_mask) {
> +               ec->tx_max_coalesced_frames &= ~tx_coalesce_napi_mask;
> +               napi_weight = NAPI_POLL_WEIGHT;
> +       }
> +
> +       /* disallow changes to fields not explicitly tested above */
> +       if (memcmp(ec, &ec_default, sizeof(ec_default)))
> +               return -EINVAL;
> +
> +       if (napi_weight ^ vi->sq[0].napi.weight) {
> +               running = netif_running(vi->dev);
> +
> +               for (i = 0; i < vi->max_queue_pairs; i++) {
> +                       vi->sq[i].napi.weight = napi_weight;
> +
> +                       if (!running)
> +                               continue;
> +
> +                       if (napi_weight)
> +                               virtnet_napi_tx_enable(vi, vi->sq[i].vq,
> +                                                      &vi->sq[i].napi);
> +                       else
> +                               napi_disable(&vi->sq[i].napi);
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +static int virtnet_get_coalesce(struct net_device *dev,
> +                               struct ethtool_coalesce *ec)
> +{
> +       const u32 tx_coalesce_napi_mask = (1 << 16);
> +       const struct ethtool_coalesce ec_default = {
> +               .cmd = ETHTOOL_GCOALESCE,
> +               .rx_max_coalesced_frames = 1,
> +               .tx_max_coalesced_frames = 1,
> +       };
> +       struct virtnet_info *vi = netdev_priv(dev);
> +
> +       memcpy(ec, &ec_default, sizeof(ec_default));
> +
> +       if (vi->sq[0].napi.weight)
> +               ec->tx_max_coalesced_frames |= tx_coalesce_napi_mask;
> +
> +       return 0;
> +}

Looks good. Just one nit, maybe it's better simply check against zero?

Thanks
Willem de Bruijn Aug. 29, 2018, 1:01 p.m. UTC | #30
On Wed, Aug 29, 2018 at 3:56 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
>
> On 2018年08月29日 03:57, Willem de Bruijn wrote:
> > On Mon, Jul 30, 2018 at 2:06 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >>
> >> On 2018年07月25日 08:17, Jon Olson wrote:
> >>> On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
> >>>>> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >>>>>> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> >>>>>>> >From the above linked patch, I understand that there are yet
> >>>>>>> other special cases in production, such as a hard cap on #tx queues to
> >>>>>>> 32 regardless of number of vcpus.
> >>>>>> I don't think upstream kernels have this limit - we can
> >>>>>> now use vmalloc for higher number of queues.
> >>>>> Yes. that patch* mentioned it as a google compute engine imposed
> >>>>> limit. It is exactly such cloud provider imposed rules that I'm
> >>>>> concerned about working around in upstream drivers.
> >>>>>
> >>>>> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
> >>>> Yea. Why does GCE do it btw?
> >>> There are a few reasons for the limit, some historical, some current.
> >>>
> >>> Historically we did this because of a kernel limit on the number of
> >>> TAP queues (in Montreal I thought this limit was 32). To my chagrin,
> >>> the limit upstream at the time we did it was actually eight. We had
> >>> increased the limit from eight to 32 internally, and it appears in
> >>> upstream it has subsequently increased upstream to 256. We no longer
> >>> use TAP for networking, so that constraint no longer applies for us,
> >>> but when looking at removing/raising the limit we discovered no
> >>> workloads that clearly benefited from lifting it, and it also placed
> >>> more pressure on our virtual networking stack particularly on the Tx
> >>> side. We left it as-is.
> >>>
> >>> In terms of current reasons there are really two. One is memory usage.
> >>> As you know, virtio-net uses rx/tx pairs, so there's an expectation
> >>> that the guest will have an Rx queue for every Tx queue. We run our
> >>> individual virtqueues fairly deep (4096 entries) to give guests a wide
> >>> time window for re-posting Rx buffers and avoiding starvation on
> >>> packet delivery. Filling an Rx vring with max-sized mergeable buffers
> >>> (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
> >>> be up to 512MB of memory posted for network buffers. Scaling this to
> >>> the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
> >>> all of the Rx rings full would (in the large average Rx packet size
> >>> case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
> >>> of RAM available, but I don't believe we've observed a situation where
> >>> they would have benefited from having 2.5 gigs of buffers posted for
> >>> incoming network traffic :)
> >> We can work to have async txq and rxq instead of paris if there's a
> >> strong requirement.
> >>
> >>> The second reason is interrupt related -- as I mentioned above, we
> >>> have found no workloads that clearly benefit from so many queues, but
> >>> we have found workloads that degrade. In particular workloads that do
> >>> a lot of small packet processing but which aren't extremely latency
> >>> sensitive can achieve higher PPS by taking fewer interrupt across
> >>> fewer VCPUs due to better batching (this also incurs higher latency,
> >>> but at the limit the "busy" cores end up suppressing most interrupts
> >>> and spending most of their cycles farming out work). Memcache is a
> >>> good example here, particularly if the latency targets for request
> >>> completion are in the ~milliseconds range (rather than the
> >>> microseconds we typically strive for with TCP_RR-style workloads).
> >>>
> >>> All of that said, we haven't been forthcoming with data (and
> >>> unfortunately I don't have it handy in a useful form, otherwise I'd
> >>> simply post it here), so I understand the hesitation to simply run
> >>> with napi_tx across the board. As Willem said, this patch seemed like
> >>> the least disruptive way to allow us to continue down the road of
> >>> "universal" NAPI Tx and to hopefully get data across enough workloads
> >>> (with VMs small, large, and absurdly large :) to present a compelling
> >>> argument in one direction or another. As far as I know there aren't
> >>> currently any NAPI related ethtool commands (based on a quick perusal
> >>> of ethtool.h)
> >> As I suggest before, maybe we can (ab)use tx-frames-irq.
> > I forgot to respond to this originally, but I agree.
> >
> > How about something like the snippet below. It would be simpler to
> > reason about if only allow switching while the device is down, but
> > napi does not strictly require that.
> >
> > +static int virtnet_set_coalesce(struct net_device *dev,
> > +                               struct ethtool_coalesce *ec)
> > +{
> > +       const u32 tx_coalesce_napi_mask = (1 << 16);
> > +       const struct ethtool_coalesce ec_default = {
> > +               .cmd = ETHTOOL_SCOALESCE,
> > +               .rx_max_coalesced_frames = 1,
> > +               .tx_max_coalesced_frames = 1,
> > +       };
> > +       struct virtnet_info *vi = netdev_priv(dev);
> > +       int napi_weight = 0;
> > +       bool running;
> > +       int i;
> > +
> > +       if (ec->tx_max_coalesced_frames & tx_coalesce_napi_mask) {
> > +               ec->tx_max_coalesced_frames &= ~tx_coalesce_napi_mask;
> > +               napi_weight = NAPI_POLL_WEIGHT;
> > +       }
> > +
> > +       /* disallow changes to fields not explicitly tested above */
> > +       if (memcmp(ec, &ec_default, sizeof(ec_default)))
> > +               return -EINVAL;
> > +
> > +       if (napi_weight ^ vi->sq[0].napi.weight) {
> > +               running = netif_running(vi->dev);
> > +
> > +               for (i = 0; i < vi->max_queue_pairs; i++) {
> > +                       vi->sq[i].napi.weight = napi_weight;
> > +
> > +                       if (!running)
> > +                               continue;
> > +
> > +                       if (napi_weight)
> > +                               virtnet_napi_tx_enable(vi, vi->sq[i].vq,
> > +                                                      &vi->sq[i].napi);
> > +                       else
> > +                               napi_disable(&vi->sq[i].napi);
> > +               }
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static int virtnet_get_coalesce(struct net_device *dev,
> > +                               struct ethtool_coalesce *ec)
> > +{
> > +       const u32 tx_coalesce_napi_mask = (1 << 16);
> > +       const struct ethtool_coalesce ec_default = {
> > +               .cmd = ETHTOOL_GCOALESCE,
> > +               .rx_max_coalesced_frames = 1,
> > +               .tx_max_coalesced_frames = 1,
> > +       };
> > +       struct virtnet_info *vi = netdev_priv(dev);
> > +
> > +       memcpy(ec, &ec_default, sizeof(ec_default));
> > +
> > +       if (vi->sq[0].napi.weight)
> > +               ec->tx_max_coalesced_frames |= tx_coalesce_napi_mask;
> > +
> > +       return 0;
> > +}
>
> Looks good. Just one nit, maybe it's better simply check against zero?

I wanted to avoid making napi and interrupt moderation mutually
exclusive. If the virtio-net driver ever gets true moderation support,
it should be able to work alongside napi.

But I can make no-napi be 0 and napi be 1. That is future proof, in
the sense that napi is enabled if there is any interrupt moderation.
Willem de Bruijn Sept. 9, 2018, 11:07 p.m. UTC | #31
On Wed, Aug 29, 2018 at 9:01 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Wed, Aug 29, 2018 at 3:56 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> >
> > On 2018年08月29日 03:57, Willem de Bruijn wrote:
> > > On Mon, Jul 30, 2018 at 2:06 AM Jason Wang <jasowang@redhat.com> wrote:
> > >>
> > >>
> > >> On 2018年07月25日 08:17, Jon Olson wrote:
> > >>> On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >>>> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
> > >>>>> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >>>>>> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> > >>>>>>> >From the above linked patch, I understand that there are yet
> > >>>>>>> other special cases in production, such as a hard cap on #tx queues to
> > >>>>>>> 32 regardless of number of vcpus.
> > >>>>>> I don't think upstream kernels have this limit - we can
> > >>>>>> now use vmalloc for higher number of queues.
> > >>>>> Yes. that patch* mentioned it as a google compute engine imposed
> > >>>>> limit. It is exactly such cloud provider imposed rules that I'm
> > >>>>> concerned about working around in upstream drivers.
> > >>>>>
> > >>>>> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
> > >>>> Yea. Why does GCE do it btw?
> > >>> There are a few reasons for the limit, some historical, some current.
> > >>>
> > >>> Historically we did this because of a kernel limit on the number of
> > >>> TAP queues (in Montreal I thought this limit was 32). To my chagrin,
> > >>> the limit upstream at the time we did it was actually eight. We had
> > >>> increased the limit from eight to 32 internally, and it appears in
> > >>> upstream it has subsequently increased upstream to 256. We no longer
> > >>> use TAP for networking, so that constraint no longer applies for us,
> > >>> but when looking at removing/raising the limit we discovered no
> > >>> workloads that clearly benefited from lifting it, and it also placed
> > >>> more pressure on our virtual networking stack particularly on the Tx
> > >>> side. We left it as-is.
> > >>>
> > >>> In terms of current reasons there are really two. One is memory usage.
> > >>> As you know, virtio-net uses rx/tx pairs, so there's an expectation
> > >>> that the guest will have an Rx queue for every Tx queue. We run our
> > >>> individual virtqueues fairly deep (4096 entries) to give guests a wide
> > >>> time window for re-posting Rx buffers and avoiding starvation on
> > >>> packet delivery. Filling an Rx vring with max-sized mergeable buffers
> > >>> (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
> > >>> be up to 512MB of memory posted for network buffers. Scaling this to
> > >>> the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
> > >>> all of the Rx rings full would (in the large average Rx packet size
> > >>> case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
> > >>> of RAM available, but I don't believe we've observed a situation where
> > >>> they would have benefited from having 2.5 gigs of buffers posted for
> > >>> incoming network traffic :)
> > >> We can work to have async txq and rxq instead of paris if there's a
> > >> strong requirement.
> > >>
> > >>> The second reason is interrupt related -- as I mentioned above, we
> > >>> have found no workloads that clearly benefit from so many queues, but
> > >>> we have found workloads that degrade. In particular workloads that do
> > >>> a lot of small packet processing but which aren't extremely latency
> > >>> sensitive can achieve higher PPS by taking fewer interrupt across
> > >>> fewer VCPUs due to better batching (this also incurs higher latency,
> > >>> but at the limit the "busy" cores end up suppressing most interrupts
> > >>> and spending most of their cycles farming out work). Memcache is a
> > >>> good example here, particularly if the latency targets for request
> > >>> completion are in the ~milliseconds range (rather than the
> > >>> microseconds we typically strive for with TCP_RR-style workloads).
> > >>>
> > >>> All of that said, we haven't been forthcoming with data (and
> > >>> unfortunately I don't have it handy in a useful form, otherwise I'd
> > >>> simply post it here), so I understand the hesitation to simply run
> > >>> with napi_tx across the board. As Willem said, this patch seemed like
> > >>> the least disruptive way to allow us to continue down the road of
> > >>> "universal" NAPI Tx and to hopefully get data across enough workloads
> > >>> (with VMs small, large, and absurdly large :) to present a compelling
> > >>> argument in one direction or another. As far as I know there aren't
> > >>> currently any NAPI related ethtool commands (based on a quick perusal
> > >>> of ethtool.h)
> > >> As I suggest before, maybe we can (ab)use tx-frames-irq.
> > > I forgot to respond to this originally, but I agree.
> > >
> > > How about something like the snippet below. It would be simpler to
> > > reason about if only allow switching while the device is down, but
> > > napi does not strictly require that.
> > >
> > > +static int virtnet_set_coalesce(struct net_device *dev,
> > > +                               struct ethtool_coalesce *ec)
> > > +{
> > > +       const u32 tx_coalesce_napi_mask = (1 << 16);
> > > +       const struct ethtool_coalesce ec_default = {
> > > +               .cmd = ETHTOOL_SCOALESCE,
> > > +               .rx_max_coalesced_frames = 1,
> > > +               .tx_max_coalesced_frames = 1,
> > > +       };
> > > +       struct virtnet_info *vi = netdev_priv(dev);
> > > +       int napi_weight = 0;
> > > +       bool running;
> > > +       int i;
> > > +
> > > +       if (ec->tx_max_coalesced_frames & tx_coalesce_napi_mask) {
> > > +               ec->tx_max_coalesced_frames &= ~tx_coalesce_napi_mask;
> > > +               napi_weight = NAPI_POLL_WEIGHT;
> > > +       }
> > > +
> > > +       /* disallow changes to fields not explicitly tested above */
> > > +       if (memcmp(ec, &ec_default, sizeof(ec_default)))
> > > +               return -EINVAL;
> > > +
> > > +       if (napi_weight ^ vi->sq[0].napi.weight) {
> > > +               running = netif_running(vi->dev);
> > > +
> > > +               for (i = 0; i < vi->max_queue_pairs; i++) {
> > > +                       vi->sq[i].napi.weight = napi_weight;
> > > +
> > > +                       if (!running)
> > > +                               continue;
> > > +
> > > +                       if (napi_weight)
> > > +                               virtnet_napi_tx_enable(vi, vi->sq[i].vq,
> > > +                                                      &vi->sq[i].napi);
> > > +                       else
> > > +                               napi_disable(&vi->sq[i].napi);
> > > +               }
> > > +       }
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static int virtnet_get_coalesce(struct net_device *dev,
> > > +                               struct ethtool_coalesce *ec)
> > > +{
> > > +       const u32 tx_coalesce_napi_mask = (1 << 16);
> > > +       const struct ethtool_coalesce ec_default = {
> > > +               .cmd = ETHTOOL_GCOALESCE,
> > > +               .rx_max_coalesced_frames = 1,
> > > +               .tx_max_coalesced_frames = 1,
> > > +       };
> > > +       struct virtnet_info *vi = netdev_priv(dev);
> > > +
> > > +       memcpy(ec, &ec_default, sizeof(ec_default));
> > > +
> > > +       if (vi->sq[0].napi.weight)
> > > +               ec->tx_max_coalesced_frames |= tx_coalesce_napi_mask;
> > > +
> > > +       return 0;
> > > +}
> >
> > Looks good. Just one nit, maybe it's better simply check against zero?
>
> I wanted to avoid making napi and interrupt moderation mutually
> exclusive. If the virtio-net driver ever gets true moderation support,
> it should be able to work alongside napi.
>
> But I can make no-napi be 0 and napi be 1. That is future proof, in
> the sense that napi is enabled if there is any interrupt moderation.

It's not appearing on patchwork yet, but I just sent a patch.

I implemented the above, but .tx_frames of 0 is technically incorrect
and it would unnecessarily constrain interrupt moderation to one of two
modes. I went back to using a high bit. That said, if you feel strongly
I'll change it.

I also tried various ways of switching between napi and non napi
mode without bringing the device down. This is quite fragile. At the
very least napi.weight has to be updated without any interrupt or
napi callback happening in between. So most of the datapath needs
to be quiesced.

I did code up a variant that manually stops all the queues, masks the
interrupt and waits for napi to complete if scheduled. But in a stress
test it still managed to trigger a BUG in napi_enable on this state.

Napi is not switched at runtime in other devices, nor really needed. So
instead I made this change conditional on the device being down.
Jason Wang Sept. 10, 2018, 5:59 a.m. UTC | #32
On 2018年09月10日 07:07, Willem de Bruijn wrote:
> On Wed, Aug 29, 2018 at 9:01 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>> On Wed, Aug 29, 2018 at 3:56 AM Jason Wang <jasowang@redhat.com> wrote:
>>>
>>>
>>> On 2018年08月29日 03:57, Willem de Bruijn wrote:
>>>> On Mon, Jul 30, 2018 at 2:06 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>
>>>>> On 2018年07月25日 08:17, Jon Olson wrote:
>>>>>> On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>>> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
>>>>>>>> On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>>>>> On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
>>>>>>>>>> >From the above linked patch, I understand that there are yet
>>>>>>>>>> other special cases in production, such as a hard cap on #tx queues to
>>>>>>>>>> 32 regardless of number of vcpus.
>>>>>>>>> I don't think upstream kernels have this limit - we can
>>>>>>>>> now use vmalloc for higher number of queues.
>>>>>>>> Yes. that patch* mentioned it as a google compute engine imposed
>>>>>>>> limit. It is exactly such cloud provider imposed rules that I'm
>>>>>>>> concerned about working around in upstream drivers.
>>>>>>>>
>>>>>>>> * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
>>>>>>> Yea. Why does GCE do it btw?
>>>>>> There are a few reasons for the limit, some historical, some current.
>>>>>>
>>>>>> Historically we did this because of a kernel limit on the number of
>>>>>> TAP queues (in Montreal I thought this limit was 32). To my chagrin,
>>>>>> the limit upstream at the time we did it was actually eight. We had
>>>>>> increased the limit from eight to 32 internally, and it appears in
>>>>>> upstream it has subsequently increased upstream to 256. We no longer
>>>>>> use TAP for networking, so that constraint no longer applies for us,
>>>>>> but when looking at removing/raising the limit we discovered no
>>>>>> workloads that clearly benefited from lifting it, and it also placed
>>>>>> more pressure on our virtual networking stack particularly on the Tx
>>>>>> side. We left it as-is.
>>>>>>
>>>>>> In terms of current reasons there are really two. One is memory usage.
>>>>>> As you know, virtio-net uses rx/tx pairs, so there's an expectation
>>>>>> that the guest will have an Rx queue for every Tx queue. We run our
>>>>>> individual virtqueues fairly deep (4096 entries) to give guests a wide
>>>>>> time window for re-posting Rx buffers and avoiding starvation on
>>>>>> packet delivery. Filling an Rx vring with max-sized mergeable buffers
>>>>>> (4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
>>>>>> be up to 512MB of memory posted for network buffers. Scaling this to
>>>>>> the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
>>>>>> all of the Rx rings full would (in the large average Rx packet size
>>>>>> case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
>>>>>> of RAM available, but I don't believe we've observed a situation where
>>>>>> they would have benefited from having 2.5 gigs of buffers posted for
>>>>>> incoming network traffic :)
>>>>> We can work to have async txq and rxq instead of paris if there's a
>>>>> strong requirement.
>>>>>
>>>>>> The second reason is interrupt related -- as I mentioned above, we
>>>>>> have found no workloads that clearly benefit from so many queues, but
>>>>>> we have found workloads that degrade. In particular workloads that do
>>>>>> a lot of small packet processing but which aren't extremely latency
>>>>>> sensitive can achieve higher PPS by taking fewer interrupt across
>>>>>> fewer VCPUs due to better batching (this also incurs higher latency,
>>>>>> but at the limit the "busy" cores end up suppressing most interrupts
>>>>>> and spending most of their cycles farming out work). Memcache is a
>>>>>> good example here, particularly if the latency targets for request
>>>>>> completion are in the ~milliseconds range (rather than the
>>>>>> microseconds we typically strive for with TCP_RR-style workloads).
>>>>>>
>>>>>> All of that said, we haven't been forthcoming with data (and
>>>>>> unfortunately I don't have it handy in a useful form, otherwise I'd
>>>>>> simply post it here), so I understand the hesitation to simply run
>>>>>> with napi_tx across the board. As Willem said, this patch seemed like
>>>>>> the least disruptive way to allow us to continue down the road of
>>>>>> "universal" NAPI Tx and to hopefully get data across enough workloads
>>>>>> (with VMs small, large, and absurdly large :) to present a compelling
>>>>>> argument in one direction or another. As far as I know there aren't
>>>>>> currently any NAPI related ethtool commands (based on a quick perusal
>>>>>> of ethtool.h)
>>>>> As I suggest before, maybe we can (ab)use tx-frames-irq.
>>>> I forgot to respond to this originally, but I agree.
>>>>
>>>> How about something like the snippet below. It would be simpler to
>>>> reason about if only allow switching while the device is down, but
>>>> napi does not strictly require that.
>>>>
>>>> +static int virtnet_set_coalesce(struct net_device *dev,
>>>> +                               struct ethtool_coalesce *ec)
>>>> +{
>>>> +       const u32 tx_coalesce_napi_mask = (1 << 16);
>>>> +       const struct ethtool_coalesce ec_default = {
>>>> +               .cmd = ETHTOOL_SCOALESCE,
>>>> +               .rx_max_coalesced_frames = 1,
>>>> +               .tx_max_coalesced_frames = 1,
>>>> +       };
>>>> +       struct virtnet_info *vi = netdev_priv(dev);
>>>> +       int napi_weight = 0;
>>>> +       bool running;
>>>> +       int i;
>>>> +
>>>> +       if (ec->tx_max_coalesced_frames & tx_coalesce_napi_mask) {
>>>> +               ec->tx_max_coalesced_frames &= ~tx_coalesce_napi_mask;
>>>> +               napi_weight = NAPI_POLL_WEIGHT;
>>>> +       }
>>>> +
>>>> +       /* disallow changes to fields not explicitly tested above */
>>>> +       if (memcmp(ec, &ec_default, sizeof(ec_default)))
>>>> +               return -EINVAL;
>>>> +
>>>> +       if (napi_weight ^ vi->sq[0].napi.weight) {
>>>> +               running = netif_running(vi->dev);
>>>> +
>>>> +               for (i = 0; i < vi->max_queue_pairs; i++) {
>>>> +                       vi->sq[i].napi.weight = napi_weight;
>>>> +
>>>> +                       if (!running)
>>>> +                               continue;
>>>> +
>>>> +                       if (napi_weight)
>>>> +                               virtnet_napi_tx_enable(vi, vi->sq[i].vq,
>>>> +                                                      &vi->sq[i].napi);
>>>> +                       else
>>>> +                               napi_disable(&vi->sq[i].napi);
>>>> +               }
>>>> +       }
>>>> +
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +static int virtnet_get_coalesce(struct net_device *dev,
>>>> +                               struct ethtool_coalesce *ec)
>>>> +{
>>>> +       const u32 tx_coalesce_napi_mask = (1 << 16);
>>>> +       const struct ethtool_coalesce ec_default = {
>>>> +               .cmd = ETHTOOL_GCOALESCE,
>>>> +               .rx_max_coalesced_frames = 1,
>>>> +               .tx_max_coalesced_frames = 1,
>>>> +       };
>>>> +       struct virtnet_info *vi = netdev_priv(dev);
>>>> +
>>>> +       memcpy(ec, &ec_default, sizeof(ec_default));
>>>> +
>>>> +       if (vi->sq[0].napi.weight)
>>>> +               ec->tx_max_coalesced_frames |= tx_coalesce_napi_mask;
>>>> +
>>>> +       return 0;
>>>> +}
>>> Looks good. Just one nit, maybe it's better simply check against zero?
>> I wanted to avoid making napi and interrupt moderation mutually
>> exclusive. If the virtio-net driver ever gets true moderation support,
>> it should be able to work alongside napi.
>>
>> But I can make no-napi be 0 and napi be 1. That is future proof, in
>> the sense that napi is enabled if there is any interrupt moderation.
> It's not appearing on patchwork yet, but I just sent a patch.
>
> I implemented the above, but .tx_frames of 0 is technically incorrect
> and it would unnecessarily constrain interrupt moderation to one of two
> modes. I went back to using a high bit. That said, if you feel strongly
> I'll change it.

Rethink about this, how about something like:

- UINT_MAX: no tx interrupt
- other value: tx interrupt with possible interrupt moderation

>
> I also tried various ways of switching between napi and non napi
> mode without bringing the device down. This is quite fragile. At the
> very least napi.weight has to be updated without any interrupt or
> napi callback happening in between. So most of the datapath needs
> to be quiesced.
>
> I did code up a variant that manually stops all the queues, masks the
> interrupt and waits for napi to complete if scheduled. But in a stress
> test it still managed to trigger a BUG in napi_enable on this state.
>
> Napi is not switched at runtime in other devices, nor really needed. So
> instead I made this change conditional on the device being down.

I agree to start with this, but I cook a patch on top. Please refer the 
thread of formal patch.

Thanks

Patch
diff mbox series

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 2ff08bc103a9..d9aca4e90d6b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -39,10 +39,11 @@ 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
 
-static bool csum = true, gso = true, napi_tx;
+static bool csum = true, gso = true, napi_tx, force_napi_tx;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 module_param(napi_tx, bool, 0644);
+module_param(force_napi_tx, bool, 0644);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
@@ -1201,7 +1202,7 @@  static void virtnet_napi_tx_enable(struct virtnet_info *vi,
 	/* Tx napi touches cachelines on the cpu handling tx interrupts. Only
 	 * enable the feature if this is likely affine with the transmit path.
 	 */
-	if (!vi->affinity_hint_set) {
+	if (!vi->affinity_hint_set && !force_napi_tx) {
 		napi->weight = 0;
 		return;
 	}
@@ -2646,7 +2647,7 @@  static int virtnet_alloc_queues(struct virtnet_info *vi)
 		netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
 			       napi_weight);
 		netif_tx_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
-				  napi_tx ? napi_weight : 0);
+				  (napi_tx || force_napi_tx) ? napi_weight : 0);
 
 		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
 		ewma_pkt_len_init(&vi->rq[i].mrg_avg_pkt_len);