diff mbox series

[v4,4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

Message ID 20200928183529.471328-5-nitesh@redhat.com
State New
Headers show
Series isolation: limit msix vectors to housekeeping CPUs | expand

Commit Message

Nitesh Narayan Lal Sept. 28, 2020, 6:35 p.m. UTC
If we have isolated CPUs dedicated for use by real-time tasks, we try to
move IRQs to housekeeping CPUs from the userspace to reduce latency
overhead on the isolated CPUs.

If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
may exceed per-CPU vector limits.

When we have isolated CPUs, limit the number of vectors allocated by
pci_alloc_irq_vectors() to the minimum number required by the driver, or
to one per housekeeping CPU if that is larger.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 drivers/pci/msi.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Comments

Bjorn Helgaas Sept. 28, 2020, 9:59 p.m. UTC | #1
[to: Christoph in case he has comments, since I think he wrote this code]

On Mon, Sep 28, 2020 at 02:35:29PM -0400, Nitesh Narayan Lal wrote:
> If we have isolated CPUs dedicated for use by real-time tasks, we try to
> move IRQs to housekeeping CPUs from the userspace to reduce latency
> overhead on the isolated CPUs.
> 
> If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
> may exceed per-CPU vector limits.
> 
> When we have isolated CPUs, limit the number of vectors allocated by
> pci_alloc_irq_vectors() to the minimum number required by the driver, or
> to one per housekeeping CPU if that is larger.
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> ---
>  drivers/pci/msi.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 30ae4ffda5c1..8c156867803c 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -23,6 +23,7 @@
>  #include <linux/slab.h>
>  #include <linux/irqdomain.h>
>  #include <linux/of_irq.h>
> +#include <linux/sched/isolation.h>
>  
>  #include "pci.h"
>  
> @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  				   struct irq_affinity *affd)
>  {
>  	struct irq_affinity msi_default_affd = {0};
> +	unsigned int hk_cpus;
>  	int nvecs = -ENOSPC;
>  
> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> +
> +	/*
> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> +	 * to the housekeeping CPUs from the userspace by changing their
> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> +	 * running out of IRQ vectors.
> +	 */
> +	if (hk_cpus < num_online_cpus()) {
> +		if (hk_cpus < min_vecs)
> +			max_vecs = min_vecs;
> +		else if (hk_cpus < max_vecs)
> +			max_vecs = hk_cpus;
> +	}
> +
>  	if (flags & PCI_IRQ_AFFINITY) {
>  		if (!affd)
>  			affd = &msi_default_affd;
> -- 
> 2.18.2
>
Christoph Hellwig Sept. 29, 2020, 5:46 p.m. UTC | #2
On Mon, Sep 28, 2020 at 04:59:31PM -0500, Bjorn Helgaas wrote:
> [to: Christoph in case he has comments, since I think he wrote this code]

I think I actually suggested this a few iterations back.

> > +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> > +
> > +	/*
> > +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> > +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> > +	 * to the housekeeping CPUs from the userspace by changing their
> > +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> > +	 * running out of IRQ vectors.
> > +	 */
> > +	if (hk_cpus < num_online_cpus()) {

I woukd have moved the assignment to hk_cpus below the comment and
just above the if, but that is really just a minor style preference.

Otherwise this looks good:

Acked-by: Christoph Hellwig <hch@lst.de>
Peter Zijlstra Oct. 16, 2020, 12:20 p.m. UTC | #3
On Mon, Sep 28, 2020 at 02:35:29PM -0400, Nitesh Narayan Lal wrote:
> If we have isolated CPUs dedicated for use by real-time tasks, we try to
> move IRQs to housekeeping CPUs from the userspace to reduce latency
> overhead on the isolated CPUs.
> 
> If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
> may exceed per-CPU vector limits.
> 
> When we have isolated CPUs, limit the number of vectors allocated by
> pci_alloc_irq_vectors() to the minimum number required by the driver, or
> to one per housekeeping CPU if that is larger.
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  drivers/pci/msi.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 30ae4ffda5c1..8c156867803c 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -23,6 +23,7 @@
>  #include <linux/slab.h>
>  #include <linux/irqdomain.h>
>  #include <linux/of_irq.h>
> +#include <linux/sched/isolation.h>
>  
>  #include "pci.h"
>  
> @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  				   struct irq_affinity *affd)
>  {
>  	struct irq_affinity msi_default_affd = {0};
> +	unsigned int hk_cpus;
>  	int nvecs = -ENOSPC;
>  
> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> +
> +	/*
> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> +	 * to the housekeeping CPUs from the userspace by changing their
> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> +	 * running out of IRQ vectors.
> +	 */
> +	if (hk_cpus < num_online_cpus()) {
> +		if (hk_cpus < min_vecs)
> +			max_vecs = min_vecs;
> +		else if (hk_cpus < max_vecs)
> +			max_vecs = hk_cpus;

is that:

		max_vecs = clamp(hk_cpus, min_vecs, max_vecs);

Also, do we really need to have that conditional on hk_cpus <
num_online_cpus()? That is, why can't we do this unconditionally?

And what are the (desired) semantics vs hotplug? Using a cpumask without
excluding hotplug is racy.

> +	}
> +
>  	if (flags & PCI_IRQ_AFFINITY) {
>  		if (!affd)
>  			affd = &msi_default_affd;
> -- 
> 2.18.2
>
Nitesh Narayan Lal Oct. 18, 2020, 6:14 p.m. UTC | #4
On 10/16/20 8:20 AM, Peter Zijlstra wrote:
> On Mon, Sep 28, 2020 at 02:35:29PM -0400, Nitesh Narayan Lal wrote:
>> If we have isolated CPUs dedicated for use by real-time tasks, we try to
>> move IRQs to housekeeping CPUs from the userspace to reduce latency
>> overhead on the isolated CPUs.
>>
>> If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
>> may exceed per-CPU vector limits.
>>
>> When we have isolated CPUs, limit the number of vectors allocated by
>> pci_alloc_irq_vectors() to the minimum number required by the driver, or
>> to one per housekeeping CPU if that is larger.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  drivers/pci/msi.c | 18 ++++++++++++++++++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
>> index 30ae4ffda5c1..8c156867803c 100644
>> --- a/drivers/pci/msi.c
>> +++ b/drivers/pci/msi.c
>> @@ -23,6 +23,7 @@
>>  #include <linux/slab.h>
>>  #include <linux/irqdomain.h>
>>  #include <linux/of_irq.h>
>> +#include <linux/sched/isolation.h>
>>  
>>  #include "pci.h"
>>  
>> @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>>  				   struct irq_affinity *affd)
>>  {
>>  	struct irq_affinity msi_default_affd = {0};
>> +	unsigned int hk_cpus;
>>  	int nvecs = -ENOSPC;
>>  
>> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
>> +
>> +	/*
>> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
>> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
>> +	 * to the housekeeping CPUs from the userspace by changing their
>> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
>> +	 * running out of IRQ vectors.
>> +	 */
>> +	if (hk_cpus < num_online_cpus()) {
>> +		if (hk_cpus < min_vecs)
>> +			max_vecs = min_vecs;
>> +		else if (hk_cpus < max_vecs)
>> +			max_vecs = hk_cpus;
> is that:
>
> 		max_vecs = clamp(hk_cpus, min_vecs, max_vecs);

Yes, I think this will do.

>
> Also, do we really need to have that conditional on hk_cpus <
> num_online_cpus()? That is, why can't we do this unconditionally?


FWIU most of the drivers using this API already restricts the number of
vectors based on the num_online_cpus, if we do it unconditionally we can
unnecessary duplicate the restriction for cases where we don't have any
isolated CPUs.

Also, different driver seems to take different factors into consideration
along with num_online_cpus while finding the max_vecs to request, for
example in the case of mlx5:
MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
               MLX5_EQ_VEC_COMP_BASE

Having hk_cpus < num_online_cpus() helps us ensure that we are only
changing the behavior when we have isolated CPUs.

Does that make sense?

>
> And what are the (desired) semantics vs hotplug? Using a cpumask without
> excluding hotplug is racy.

The housekeeping_mask should still remain constant, isn't?
In any case, I can double check this.

>
>> +	}
>> +
>>  	if (flags & PCI_IRQ_AFFINITY) {
>>  		if (!affd)
>>  			affd = &msi_default_affd;
>> -- 
>> 2.18.2
>>
Peter Zijlstra Oct. 19, 2020, 11:11 a.m. UTC | #5
On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote:
> >> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> >> +
> >> +	/*
> >> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> >> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> >> +	 * to the housekeeping CPUs from the userspace by changing their
> >> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> >> +	 * running out of IRQ vectors.
> >> +	 */
> >> +	if (hk_cpus < num_online_cpus()) {
> >> +		if (hk_cpus < min_vecs)
> >> +			max_vecs = min_vecs;
> >> +		else if (hk_cpus < max_vecs)
> >> +			max_vecs = hk_cpus;
> > is that:
> >
> > 		max_vecs = clamp(hk_cpus, min_vecs, max_vecs);
> 
> Yes, I think this will do.
> 
> >
> > Also, do we really need to have that conditional on hk_cpus <
> > num_online_cpus()? That is, why can't we do this unconditionally?
> 
> FWIU most of the drivers using this API already restricts the number of
> vectors based on the num_online_cpus, if we do it unconditionally we can
> unnecessary duplicate the restriction for cases where we don't have any
> isolated CPUs.

unnecessary isn't really a concern here, this is a slow path. What's
important is code clarity.

> Also, different driver seems to take different factors into consideration
> along with num_online_cpus while finding the max_vecs to request, for
> example in the case of mlx5:
> MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
>                MLX5_EQ_VEC_COMP_BASE
> 
> Having hk_cpus < num_online_cpus() helps us ensure that we are only
> changing the behavior when we have isolated CPUs.
> 
> Does that make sense?

That seems to want to allocate N interrupts per cpu (plus some random
static amount, which seems weird, but whatever). This patch breaks that.

So I think it is important to figure out what that driver really wants
in the nohz_full case. If it wants to retain N interrupts per CPU, and
only reduce the number of CPUs, the proposed interface is wrong.

> > And what are the (desired) semantics vs hotplug? Using a cpumask without
> > excluding hotplug is racy.
> 
> The housekeeping_mask should still remain constant, isn't?
> In any case, I can double check this.

The goal is very much to have that dynamically configurable.
Marcelo Tosatti Oct. 19, 2020, 2 p.m. UTC | #6
On Mon, Oct 19, 2020 at 01:11:37PM +0200, Peter Zijlstra wrote:
> On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote:
> > >> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> > >> +
> > >> +	/*
> > >> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> > >> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> > >> +	 * to the housekeeping CPUs from the userspace by changing their
> > >> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> > >> +	 * running out of IRQ vectors.
> > >> +	 */
> > >> +	if (hk_cpus < num_online_cpus()) {
> > >> +		if (hk_cpus < min_vecs)
> > >> +			max_vecs = min_vecs;
> > >> +		else if (hk_cpus < max_vecs)
> > >> +			max_vecs = hk_cpus;
> > > is that:
> > >
> > > 		max_vecs = clamp(hk_cpus, min_vecs, max_vecs);
> > 
> > Yes, I think this will do.
> > 
> > >
> > > Also, do we really need to have that conditional on hk_cpus <
> > > num_online_cpus()? That is, why can't we do this unconditionally?
> > 
> > FWIU most of the drivers using this API already restricts the number of
> > vectors based on the num_online_cpus, if we do it unconditionally we can
> > unnecessary duplicate the restriction for cases where we don't have any
> > isolated CPUs.
> 
> unnecessary isn't really a concern here, this is a slow path. What's
> important is code clarity.
> 
> > Also, different driver seems to take different factors into consideration
> > along with num_online_cpus while finding the max_vecs to request, for
> > example in the case of mlx5:
> > MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
> >                MLX5_EQ_VEC_COMP_BASE
> > 
> > Having hk_cpus < num_online_cpus() helps us ensure that we are only
> > changing the behavior when we have isolated CPUs.
> > 
> > Does that make sense?
> 
> That seems to want to allocate N interrupts per cpu (plus some random
> static amount, which seems weird, but whatever). This patch breaks that.

On purpose. For the isolated CPUs we don't want network device 
interrupts (in this context).

> So I think it is important to figure out what that driver really wants
> in the nohz_full case. If it wants to retain N interrupts per CPU, and
> only reduce the number of CPUs, the proposed interface is wrong.

It wants N interrupts per non-isolated (AKA housekeeping) CPU.
Zero interrupts for isolated interrupts.

> > > And what are the (desired) semantics vs hotplug? Using a cpumask without
> > > excluding hotplug is racy.
> > 
> > The housekeeping_mask should still remain constant, isn't?
> > In any case, I can double check this.
> 
> The goal is very much to have that dynamically configurable.

Yes, but this patch is a fix for customer bug in the old, static on-boot 
isolation CPU configuration.

---

Discussing the dynamic configuration (not this patch!) case:

Would need to enable/disable interrupts for a particular device 
on a per-CPU basis. Such interface does not exist yet.

Perhaps that is what you are looking for when writing "proposed interface
is wrong" Peter?
Frederic Weisbecker Oct. 19, 2020, 2:21 p.m. UTC | #7
On Mon, Oct 19, 2020 at 01:11:37PM +0200, Peter Zijlstra wrote:
> > > And what are the (desired) semantics vs hotplug? Using a cpumask without
> > > excluding hotplug is racy.
> > 
> > The housekeeping_mask should still remain constant, isn't?
> > In any case, I can double check this.
> 
> The goal is very much to have that dynamically configurable.

Right but unfortunately we are not there before a little while. And the
existing code in these drivers allocating vectors doesn't even take into
account hotplug as you spotted. So I agreed to let Nitesh fix this issue
on top of the existing code until he can look into providing some infrastructure
for these kind of vectors allocation. The first step would be to consolidate
similar code from other drivers, then maybe handle hotplug and later
dynamic housekeeping.
Nitesh Narayan Lal Oct. 19, 2020, 2:25 p.m. UTC | #8
On 10/19/20 10:00 AM, Marcelo Tosatti wrote:
> On Mon, Oct 19, 2020 at 01:11:37PM +0200, Peter Zijlstra wrote:
>> On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote:

[...]

>>>> Also, do we really need to have that conditional on hk_cpus <
>>>> num_online_cpus()? That is, why can't we do this unconditionally?
>>> FWIU most of the drivers using this API already restricts the number of
>>> vectors based on the num_online_cpus, if we do it unconditionally we can
>>> unnecessary duplicate the restriction for cases where we don't have any
>>> isolated CPUs.
>> unnecessary isn't really a concern here, this is a slow path. What's
>> important is code clarity.

Right, I can skip that check then.

>>
>>> Also, different driver seems to take different factors into consideration
>>> along with num_online_cpus while finding the max_vecs to request, for
>>> example in the case of mlx5:
>>> MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
>>>                MLX5_EQ_VEC_COMP_BASE
>>>
>>> Having hk_cpus < num_online_cpus() helps us ensure that we are only
>>> changing the behavior when we have isolated CPUs.
>>>
>>> Does that make sense?
>> That seems to want to allocate N interrupts per cpu (plus some random
>> static amount, which seems weird, but whatever). This patch breaks that.
> On purpose. For the isolated CPUs we don't want network device 
> interrupts (in this context).
>
>> So I think it is important to figure out what that driver really wants
>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
>> only reduce the number of CPUs, the proposed interface is wrong.
> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
> Zero interrupts for isolated interrupts.

Right, otherwise we may end up in a situation where we run out of per CPU
vectors while we move the IRQs from isolated CPUs to housekeeping.
Peter Zijlstra Oct. 20, 2020, 7:30 a.m. UTC | #9
On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
> > So I think it is important to figure out what that driver really wants
> > in the nohz_full case. If it wants to retain N interrupts per CPU, and
> > only reduce the number of CPUs, the proposed interface is wrong.
> 
> It wants N interrupts per non-isolated (AKA housekeeping) CPU.

Then the patch is wrong and the interface needs changing from @min_vecs,
@max_vecs to something that expresses the N*nr_cpus relation.
Nitesh Narayan Lal Oct. 20, 2020, 1 p.m. UTC | #10
On 10/20/20 3:30 AM, Peter Zijlstra wrote:
> On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
>>> So I think it is important to figure out what that driver really wants
>>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
>>> only reduce the number of CPUs, the proposed interface is wrong.
>> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
> Then the patch is wrong and the interface needs changing from @min_vecs,
> @max_vecs to something that expresses the N*nr_cpus relation.

Reading Marcelo's comment again I think what is really expected is 1
interrupt per non-isolated (housekeeping) CPU (not N interrupts).

My bad that I missed it initially.
Peter Zijlstra Oct. 20, 2020, 1:41 p.m. UTC | #11
On Tue, Oct 20, 2020 at 09:00:01AM -0400, Nitesh Narayan Lal wrote:
> 
> On 10/20/20 3:30 AM, Peter Zijlstra wrote:
> > On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
> >>> So I think it is important to figure out what that driver really wants
> >>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
> >>> only reduce the number of CPUs, the proposed interface is wrong.
> >> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
> > Then the patch is wrong and the interface needs changing from @min_vecs,
> > @max_vecs to something that expresses the N*nr_cpus relation.
> 
> Reading Marcelo's comment again I think what is really expected is 1
> interrupt per non-isolated (housekeeping) CPU (not N interrupts).

Then what is the point of them asking for N*nr_cpus when there is no
isolation?

Either everybody wants 1 interrupts per CPU and we can do the clamp
unconditionally, in which case we should go fix this user, or they want
multiple per cpu and we should go fix the interface.

It cannot be both.
Thomas Gleixner Oct. 20, 2020, 2:16 p.m. UTC | #12
On Mon, Sep 28 2020 at 14:35, Nitesh Narayan Lal wrote:
>  
> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> +
> +	/*
> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> +	 * to the housekeeping CPUs from the userspace by changing their
> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> +	 * running out of IRQ vectors.
> +	 */

This is not true for managed interrupts. The interrupts affinity of
those cannot be changed by user space.

> +	if (hk_cpus < num_online_cpus()) {
> +		if (hk_cpus < min_vecs)
> +			max_vecs = min_vecs;
> +		else if (hk_cpus < max_vecs)
> +			max_vecs = hk_cpus;
> +	}

So now with that assume a 16 core machine (HT off for simplicity)

17 Requested interrupts (1 general, 16 queues)

Managed interrupts will allocate

   1  general interrupt which is free movable by user space
   16 managed interrupts for queues (one per CPU)

This allows the driver to have 16 queues, i.e. one queue per CPU. These
interrupts are only used when an application on a CPU issues I/O.

With the above change this will result

   1  general interrupt which is free movable by user space
   1  managed interrupts (possible affinity to all 16 CPUs, but routed
      to housekeeping CPU as long as there is one online)

So the device is now limited to a single queue which also affects the
housekeeping CPUs because now they have to share a single queue.

With larger machines this gets even worse.

So no. This needs way more thought for managed interrupts and you cannot
do that at the PCI layer. Only the affinity spreading mechanism can do
the right thing here.

Thanks,

        tglx
Nitesh Narayan Lal Oct. 20, 2020, 2:39 p.m. UTC | #13
On 10/20/20 9:41 AM, Peter Zijlstra wrote:
> On Tue, Oct 20, 2020 at 09:00:01AM -0400, Nitesh Narayan Lal wrote:
>> On 10/20/20 3:30 AM, Peter Zijlstra wrote:
>>> On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
>>>>> So I think it is important to figure out what that driver really wants
>>>>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
>>>>> only reduce the number of CPUs, the proposed interface is wrong.
>>>> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
>>> Then the patch is wrong and the interface needs changing from @min_vecs,
>>> @max_vecs to something that expresses the N*nr_cpus relation.
>> Reading Marcelo's comment again I think what is really expected is 1
>> interrupt per non-isolated (housekeeping) CPU (not N interrupts).
> Then what is the point of them asking for N*nr_cpus when there is no
> isolation?
>
> Either everybody wants 1 interrupts per CPU and we can do the clamp
> unconditionally, in which case we should go fix this user, or they want
> multiple per cpu and we should go fix the interface.
>
> It cannot be both.

Based on my understanding I don't think this is consistent, the number
of interrupts any driver can request varies to an extent that some
consumer of this API even request just one interrupt for its use.

This was one of the reasons why I thought of having a conditional
restriction.

But I agree there is a lack of consistency.
Nitesh Narayan Lal Oct. 20, 2020, 4:18 p.m. UTC | #14
On 10/20/20 10:16 AM, Thomas Gleixner wrote:
> On Mon, Sep 28 2020 at 14:35, Nitesh Narayan Lal wrote:
>>  
>> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
>> +
>> +	/*
>> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
>> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
>> +	 * to the housekeeping CPUs from the userspace by changing their
>> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
>> +	 * running out of IRQ vectors.
>> +	 */
> This is not true for managed interrupts. The interrupts affinity of
> those cannot be changed by user space.

Ah Yes. Perhaps
s/IRQs/non-managed IRQ ?

>
>> +	if (hk_cpus < num_online_cpus()) {
>> +		if (hk_cpus < min_vecs)
>> +			max_vecs = min_vecs;
>> +		else if (hk_cpus < max_vecs)
>> +			max_vecs = hk_cpus;
>> +	}
> So now with that assume a 16 core machine (HT off for simplicity)
>
> 17 Requested interrupts (1 general, 16 queues)
>
> Managed interrupts will allocate
>
>    1  general interrupt which is free movable by user space
>    16 managed interrupts for queues (one per CPU)
>
> This allows the driver to have 16 queues, i.e. one queue per CPU. These
> interrupts are only used when an application on a CPU issues I/O.

Correct.

>
> With the above change this will result
>
>    1  general interrupt which is free movable by user space
>    1  managed interrupts (possible affinity to all 16 CPUs, but routed
>       to housekeeping CPU as long as there is one online)
>
> So the device is now limited to a single queue which also affects the
> housekeeping CPUs because now they have to share a single queue.
>
> With larger machines this gets even worse.

Yes, the change can impact the performance, however, if we don't do that we
may have a latency impact instead. Specifically, on larger systems where
most of the CPUs are isolated as we will definitely fail in moving all of the
IRQs away from the isolated CPUs to the housekeeping.

>
> So no. This needs way more thought for managed interrupts and you cannot
> do that at the PCI layer.

Maybe we should not be doing anything in the case of managed IRQs as they
are anyways pinned to the housekeeping CPUs as long as we have the
'managed_irq' option included in the kernel cmdline.

>  Only the affinity spreading mechanism can do
> the right thing here.

I can definitely explore this further.

However, IMHO we would still need a logic to prevent the devices from
creating excess vectors.

Do you agree?

>
> Thanks,
>
>         tglx
>
Thomas Gleixner Oct. 20, 2020, 6:07 p.m. UTC | #15
On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
> On 10/20/20 10:16 AM, Thomas Gleixner wrote:
>> With the above change this will result
>>
>>    1  general interrupt which is free movable by user space
>>    1  managed interrupts (possible affinity to all 16 CPUs, but routed
>>       to housekeeping CPU as long as there is one online)
>>
>> So the device is now limited to a single queue which also affects the
>> housekeeping CPUs because now they have to share a single queue.
>>
>> With larger machines this gets even worse.
>
> Yes, the change can impact the performance, however, if we don't do that we
> may have a latency impact instead. Specifically, on larger systems where
> most of the CPUs are isolated as we will definitely fail in moving all of the
> IRQs away from the isolated CPUs to the housekeeping.

For non managed interrupts I agree.

>> So no. This needs way more thought for managed interrupts and you cannot
>> do that at the PCI layer.
>
> Maybe we should not be doing anything in the case of managed IRQs as they
> are anyways pinned to the housekeeping CPUs as long as we have the
> 'managed_irq' option included in the kernel cmdline.

Exactly. For the PCI side this vector limiting has to be restricted to
the non managed case.

>>  Only the affinity spreading mechanism can do
>> the right thing here.
>
> I can definitely explore this further.
>
> However, IMHO we would still need a logic to prevent the devices from
> creating excess vectors.

Managed interrupts are preventing exactly that by pinning the interrupts
and queues to one or a set of CPUs, which prevents vector exhaustion on
CPU hotplug.

Non-managed, yes that is and always was a problem. One of the reasons
why managed interrupts exist.

Thanks,

        tglx
Thomas Gleixner Oct. 21, 2020, 8:25 p.m. UTC | #16
On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
> On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
>> However, IMHO we would still need a logic to prevent the devices from
>> creating excess vectors.
>
> Managed interrupts are preventing exactly that by pinning the interrupts
> and queues to one or a set of CPUs, which prevents vector exhaustion on
> CPU hotplug.
>
> Non-managed, yes that is and always was a problem. One of the reasons
> why managed interrupts exist.

But why is this only a problem for isolation? The very same problem
exists vs. CPU hotplug and therefore hibernation.

On x86 we have at max. 204 vectors available for device interrupts per
CPU. So assumed the only device interrupt in use is networking then any
machine which has more than 204 network interrupts (queues, aux ...)
active will prevent the machine from hibernation.

Aside of that it's silly to have multiple queues targeted at a single
CPU in case of hotplug. And that's not a theoretical problem.  Some
power management schemes shut down sockets when the utilization of a
system is low enough, e.g. outside of working hours.

The whole point of multi-queue is to have locality so that traffic from
a CPU goes through the CPU local queue. What's the point of having two
or more queues on a CPU in case of hotplug?

The right answer to this is to utilize managed interrupts and have
according logic in your network driver to handle CPU hotplug. When a CPU
goes down, then the queue which is associated to that CPU is quiesced
and the interrupt core shuts down the relevant interrupt instead of
moving it to an online CPU (which causes the whole vector exhaustion
problem on x86). When the CPU comes online again, then the interrupt is
reenabled in the core and the driver reactivates the queue.

Thanks,

        tglx
Nitesh Narayan Lal Oct. 21, 2020, 9:04 p.m. UTC | #17
On 10/21/20 4:25 PM, Thomas Gleixner wrote:
> On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
>> On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
>>> However, IMHO we would still need a logic to prevent the devices from
>>> creating excess vectors.
>> Managed interrupts are preventing exactly that by pinning the interrupts
>> and queues to one or a set of CPUs, which prevents vector exhaustion on
>> CPU hotplug.
>>
>> Non-managed, yes that is and always was a problem. One of the reasons
>> why managed interrupts exist.
> But why is this only a problem for isolation? The very same problem
> exists vs. CPU hotplug and therefore hibernation.
>
> On x86 we have at max. 204 vectors available for device interrupts per
> CPU. So assumed the only device interrupt in use is networking then any
> machine which has more than 204 network interrupts (queues, aux ...)
> active will prevent the machine from hibernation.

Yes, that is indeed the case.

>
> Aside of that it's silly to have multiple queues targeted at a single
> CPU in case of hotplug. And that's not a theoretical problem.  Some
> power management schemes shut down sockets when the utilization of a
> system is low enough, e.g. outside of working hours.
>
> The whole point of multi-queue is to have locality so that traffic from
> a CPU goes through the CPU local queue. What's the point of having two
> or more queues on a CPU in case of hotplug?
>
> The right answer to this is to utilize managed interrupts and have
> according logic in your network driver to handle CPU hotplug. When a CPU
> goes down, then the queue which is associated to that CPU is quiesced
> and the interrupt core shuts down the relevant interrupt instead of
> moving it to an online CPU (which causes the whole vector exhaustion
> problem on x86). When the CPU comes online again, then the interrupt is
> reenabled in the core and the driver reactivates the queue.

IIRC then i40e does have something like that where it suspends all IRQs
before hibernation and restores them when the CPU is back online.

I am not particularly sure about the other drivers.

This brings me to another discussion that Peter initiated that is to
perform the proposed restriction without any condition for all non-managed
IRQs.

Something on the lines:

+       if (!pci_is_managed(dev))
+               max_vecs = clamp(hk_cpus, min_vecs, max_vecs);


I am not particularly sure about this because I am not sure what kind of
performance penalty this will have on the drivers in general and if
that will be acceptable at all. Any thoughts?

However, this still doesn't solve the generic problem, and an ideal solution
will be something that you suggested.

Will it be sensible to think about having a generic API that can be
consumed by all the drivers and that can do both the things you mentioned?

>
> Thanks,
>
>         tglx
>
>
>
Jakub Kicinski Oct. 22, 2020, 12:02 a.m. UTC | #18
On Wed, 21 Oct 2020 22:25:48 +0200 Thomas Gleixner wrote:
> On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
> > On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:  
> >> However, IMHO we would still need a logic to prevent the devices from
> >> creating excess vectors.  
> >
> > Managed interrupts are preventing exactly that by pinning the interrupts
> > and queues to one or a set of CPUs, which prevents vector exhaustion on
> > CPU hotplug.
> >
> > Non-managed, yes that is and always was a problem. One of the reasons
> > why managed interrupts exist.  
> 
> But why is this only a problem for isolation? The very same problem
> exists vs. CPU hotplug and therefore hibernation.
> 
> On x86 we have at max. 204 vectors available for device interrupts per
> CPU. So assumed the only device interrupt in use is networking then any
> machine which has more than 204 network interrupts (queues, aux ...)
> active will prevent the machine from hibernation.
> 
> Aside of that it's silly to have multiple queues targeted at a single
> CPU in case of hotplug. And that's not a theoretical problem.  Some
> power management schemes shut down sockets when the utilization of a
> system is low enough, e.g. outside of working hours.
> 
> The whole point of multi-queue is to have locality so that traffic from
> a CPU goes through the CPU local queue. What's the point of having two
> or more queues on a CPU in case of hotplug?
> 
> The right answer to this is to utilize managed interrupts and have
> according logic in your network driver to handle CPU hotplug. When a CPU
> goes down, then the queue which is associated to that CPU is quiesced
> and the interrupt core shuts down the relevant interrupt instead of
> moving it to an online CPU (which causes the whole vector exhaustion
> problem on x86). When the CPU comes online again, then the interrupt is
> reenabled in the core and the driver reactivates the queue.

I think Mellanox folks made some forays into managed irqs, but I don't
remember/can't find the details now.

For networking the locality / queue per core does not always work,
since the incoming traffic is usually spread based on a hash. Many
applications perform better when network processing is done on a small
subset of CPUs, and application doesn't get interrupted every 100us. 
So we do need extra user control here.

We have a bit of a uAPI problem since people had grown to depend on
IRQ == queue == NAPI to configure their systems. "The right way" out
would be a proper API which allows associating queues with CPUs rather
than IRQs, then we can use managed IRQs and solve many other problems.

Such new API has been in the works / discussions for a while now.

(Magnus keep me honest here, if you disagree the queue API solves this.)
Jacob Keller Oct. 22, 2020, 12:27 a.m. UTC | #19
On 10/21/2020 5:02 PM, Jakub Kicinski wrote:
> On Wed, 21 Oct 2020 22:25:48 +0200 Thomas Gleixner wrote:
>> On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
>>> On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:  
>>>> However, IMHO we would still need a logic to prevent the devices from
>>>> creating excess vectors.  
>>>
>>> Managed interrupts are preventing exactly that by pinning the interrupts
>>> and queues to one or a set of CPUs, which prevents vector exhaustion on
>>> CPU hotplug.
>>>
>>> Non-managed, yes that is and always was a problem. One of the reasons
>>> why managed interrupts exist.  
>>
>> But why is this only a problem for isolation? The very same problem
>> exists vs. CPU hotplug and therefore hibernation.
>>
>> On x86 we have at max. 204 vectors available for device interrupts per
>> CPU. So assumed the only device interrupt in use is networking then any
>> machine which has more than 204 network interrupts (queues, aux ...)
>> active will prevent the machine from hibernation.
>>
>> Aside of that it's silly to have multiple queues targeted at a single
>> CPU in case of hotplug. And that's not a theoretical problem.  Some
>> power management schemes shut down sockets when the utilization of a
>> system is low enough, e.g. outside of working hours.
>>
>> The whole point of multi-queue is to have locality so that traffic from
>> a CPU goes through the CPU local queue. What's the point of having two
>> or more queues on a CPU in case of hotplug?
>>
>> The right answer to this is to utilize managed interrupts and have
>> according logic in your network driver to handle CPU hotplug. When a CPU
>> goes down, then the queue which is associated to that CPU is quiesced
>> and the interrupt core shuts down the relevant interrupt instead of
>> moving it to an online CPU (which causes the whole vector exhaustion
>> problem on x86). When the CPU comes online again, then the interrupt is
>> reenabled in the core and the driver reactivates the queue.
> 
> I think Mellanox folks made some forays into managed irqs, but I don't
> remember/can't find the details now.
> 

I remember looking into this a few years ago, and not getting very far
either.

> For networking the locality / queue per core does not always work,
> since the incoming traffic is usually spread based on a hash. Many
> applications perform better when network processing is done on a small
> subset of CPUs, and application doesn't get interrupted every 100us. 
> So we do need extra user control here.
> 
> We have a bit of a uAPI problem since people had grown to depend on
> IRQ == queue == NAPI to configure their systems. "The right way" out
> would be a proper API which allows associating queues with CPUs rather
> than IRQs, then we can use managed IRQs and solve many other problems.
> 

I think we (Intel) hit some of the same issues you mention.

I know I personally would like to see something that lets a lot of the
current driver-specific policy be moved out. I think it should be
possible to significantly simplify the abstraction used by the drivers.

> Such new API has been in the works / discussions for a while now.
> 
> (Magnus keep me honest here, if you disagree the queue API solves this.)
>
Thomas Gleixner Oct. 22, 2020, 8:28 a.m. UTC | #20
On Wed, Oct 21 2020 at 17:02, Jakub Kicinski wrote:
> On Wed, 21 Oct 2020 22:25:48 +0200 Thomas Gleixner wrote:
>> The right answer to this is to utilize managed interrupts and have
>> according logic in your network driver to handle CPU hotplug. When a CPU
>> goes down, then the queue which is associated to that CPU is quiesced
>> and the interrupt core shuts down the relevant interrupt instead of
>> moving it to an online CPU (which causes the whole vector exhaustion
>> problem on x86). When the CPU comes online again, then the interrupt is
>> reenabled in the core and the driver reactivates the queue.
>
> I think Mellanox folks made some forays into managed irqs, but I don't
> remember/can't find the details now.
>
> For networking the locality / queue per core does not always work,
> since the incoming traffic is usually spread based on a hash. Many

That makes it problematic and is fundamentally different from block I/O.

> applications perform better when network processing is done on a small
> subset of CPUs, and application doesn't get interrupted every 100us. 
> So we do need extra user control here.

Ok.

> We have a bit of a uAPI problem since people had grown to depend on
> IRQ == queue == NAPI to configure their systems. "The right way" out
> would be a proper API which allows associating queues with CPUs rather
> than IRQs, then we can use managed IRQs and solve many other problems.
>
> Such new API has been in the works / discussions for a while now.

If there is anything which needs to be done/extended on the irq side
please let me know.

Thanks

        tglx
Marcelo Tosatti Oct. 22, 2020, 12:28 p.m. UTC | #21
On Wed, Oct 21, 2020 at 10:25:48PM +0200, Thomas Gleixner wrote:
> On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
> > On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
> >> However, IMHO we would still need a logic to prevent the devices from
> >> creating excess vectors.
> >
> > Managed interrupts are preventing exactly that by pinning the interrupts
> > and queues to one or a set of CPUs, which prevents vector exhaustion on
> > CPU hotplug.
> >
> > Non-managed, yes that is and always was a problem. One of the reasons
> > why managed interrupts exist.
> 
> But why is this only a problem for isolation? The very same problem
> exists vs. CPU hotplug and therefore hibernation.
> 
> On x86 we have at max. 204 vectors available for device interrupts per
> CPU. So assumed the only device interrupt in use is networking then any
> machine which has more than 204 network interrupts (queues, aux ...)
> active will prevent the machine from hibernation.
> 
> Aside of that it's silly to have multiple queues targeted at a single
> CPU in case of hotplug. And that's not a theoretical problem.  Some
> power management schemes shut down sockets when the utilization of a
> system is low enough, e.g. outside of working hours.

Exactly. It seems the proper way to do handle this is to disable
individual vectors rather than moving them. And that is needed for 
dynamic isolate / unisolate anyway...

> The whole point of multi-queue is to have locality so that traffic from
> a CPU goes through the CPU local queue. What's the point of having two
> or more queues on a CPU in case of hotplug?
> 
> The right answer to this is to utilize managed interrupts and have
> according logic in your network driver to handle CPU hotplug. When a CPU
> goes down, then the queue which is associated to that CPU is quiesced
> and the interrupt core shuts down the relevant interrupt instead of
> moving it to an online CPU (which causes the whole vector exhaustion
> problem on x86). When the CPU comes online again, then the interrupt is
> reenabled in the core and the driver reactivates the queue.

Aha... But it would be necessary to do that from userspace (for runtime
isolate/unisolate).
Nitesh Narayan Lal Oct. 22, 2020, 5:47 p.m. UTC | #22
On 10/20/20 10:39 AM, Nitesh Narayan Lal wrote:
> On 10/20/20 9:41 AM, Peter Zijlstra wrote:
>> On Tue, Oct 20, 2020 at 09:00:01AM -0400, Nitesh Narayan Lal wrote:
>>> On 10/20/20 3:30 AM, Peter Zijlstra wrote:
>>>> On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
>>>>>> So I think it is important to figure out what that driver really wants
>>>>>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
>>>>>> only reduce the number of CPUs, the proposed interface is wrong.
>>>>> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
>>>> Then the patch is wrong and the interface needs changing from @min_vecs,
>>>> @max_vecs to something that expresses the N*nr_cpus relation.
>>> Reading Marcelo's comment again I think what is really expected is 1
>>> interrupt per non-isolated (housekeeping) CPU (not N interrupts).
>> Then what is the point of them asking for N*nr_cpus when there is no
>> isolation?
>>
>> Either everybody wants 1 interrupts per CPU and we can do the clamp
>> unconditionally, in which case we should go fix this user, or they want
>> multiple per cpu and we should go fix the interface.
>>
>> It cannot be both.
> Based on my understanding I don't think this is consistent, the number
> of interrupts any driver can request varies to an extent that some
> consumer of this API even request just one interrupt for its use.
>
> This was one of the reasons why I thought of having a conditional
> restriction.
>
> But I agree there is a lack of consistency.
>

Hi Peter,

So based on the suggestions from you and Thomas, I think something like the
following should do the job within pci_alloc_irq_vectors_affinity():

+       if (!pci_is_managed(dev) && (hk_cpus < num_online_cpus()))
+               max_vecs = clamp(hk_cpus, min_vecs, max_vecs);

I do know that you didn't like the usage of "hk_cpus < num_online_cpus()"
and to an extent I agree that it does degrade the code clarity.

However, since there is a certain inconsistency in the number of vectors
that drivers request through this API IMHO we will need this, otherwise
we could cause an impact on the drivers even in setups that doesn't
have any isolated CPUs.

If you agree, I can send the next version of the patch-set.
Thomas Gleixner Oct. 22, 2020, 10:39 p.m. UTC | #23
On Thu, Oct 22 2020 at 09:28, Marcelo Tosatti wrote:
> On Wed, Oct 21, 2020 at 10:25:48PM +0200, Thomas Gleixner wrote:
>> The right answer to this is to utilize managed interrupts and have
>> according logic in your network driver to handle CPU hotplug. When a CPU
>> goes down, then the queue which is associated to that CPU is quiesced
>> and the interrupt core shuts down the relevant interrupt instead of
>> moving it to an online CPU (which causes the whole vector exhaustion
>> problem on x86). When the CPU comes online again, then the interrupt is
>> reenabled in the core and the driver reactivates the queue.
>
> Aha... But it would be necessary to do that from userspace (for runtime
> isolate/unisolate).

For anything which uses managed interrupts this is a non-problem and
userspace has absolutely no business with it.

Isolation does not shut down queues, at least not the block multi-queue
ones which are only active when I/O is issued from that isolated CPU.

So transitioning out of isolation requires no action at all.

Transitioning in or changing the housekeeping mask needs some trivial
tweak to handle the case where there is an overlap in the cpuset of a
queue (housekeeping and isolated). This is handled already for setup and
affinity changes, but of course not for runtime isolation mask changes,
but that's a trivial thing to do.

What's more interesting is how to deal with the network problem where
there is no guarantee that the "response" ends up on the same queue as
the "request" which is what the block people rely on. And that problem
is not really an interrupt affinity problem in the first place.

Thanks,

        tglx
Peter Zijlstra Oct. 23, 2020, 8:58 a.m. UTC | #24
On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:

> Hi Peter,
> 
> So based on the suggestions from you and Thomas, I think something like the
> following should do the job within pci_alloc_irq_vectors_affinity():
> 
> +       if (!pci_is_managed(dev) && (hk_cpus < num_online_cpus()))
> +               max_vecs = clamp(hk_cpus, min_vecs, max_vecs);
> 
> I do know that you didn't like the usage of "hk_cpus < num_online_cpus()"
> and to an extent I agree that it does degrade the code clarity.

It's not just code clarity; I simply don't understand it. It feels like
a band-aid that breaks thing.

At the very least it needs a ginormous (and coherent) comment that
explains:

 - the interface
 - the usage
 - this hack

> However, since there is a certain inconsistency in the number of vectors
> that drivers request through this API IMHO we will need this, otherwise
> we could cause an impact on the drivers even in setups that doesn't
> have any isolated CPUs.

So shouldn't we then fix the drivers / interface first, to get rid of
this inconsistency?

> If you agree, I can send the next version of the patch-set.

Well, it's not just me you have to convince.
Nitesh Narayan Lal Oct. 23, 2020, 1:10 p.m. UTC | #25
On 10/23/20 4:58 AM, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:
>
>> Hi Peter,
>>
>> So based on the suggestions from you and Thomas, I think something like the
>> following should do the job within pci_alloc_irq_vectors_affinity():
>>
>> +       if (!pci_is_managed(dev) && (hk_cpus < num_online_cpus()))
>> +               max_vecs = clamp(hk_cpus, min_vecs, max_vecs);
>>
>> I do know that you didn't like the usage of "hk_cpus < num_online_cpus()"
>> and to an extent I agree that it does degrade the code clarity.
> It's not just code clarity; I simply don't understand it. It feels like
> a band-aid that breaks thing.
>
> At the very least it needs a ginormous (and coherent) comment that
> explains:
>
>  - the interface
>  - the usage
>  - this hack

That make sense.

>
>> However, since there is a certain inconsistency in the number of vectors
>> that drivers request through this API IMHO we will need this, otherwise
>> we could cause an impact on the drivers even in setups that doesn't
>> have any isolated CPUs.
> So shouldn't we then fix the drivers / interface first, to get rid of
> this inconsistency?
>

Considering we agree that excess vector is a problem that needs to be
solved across all the drivers and that you are comfortable with the other
three patches in the set. If I may suggest the following:

- We can pick those three patches for now, as that will atleast fix a
  driver that is currently impacting RT workloads. Is that a fair
  expectation?

- In the meanwhile, I will start looking into individual drivers that
  consume this API to find out if there is a co-relation that can be
  derived between the max_vecs and number of CPUs. If that exists then I
  can go ahead and tweak the API's max_vecs accordingly. However, if this
  is absolutely random then I can come up with a sane comment
  before this check that covers the list of items you suggested.

- I also want to explore the comments made by Thomas which may take
  some time.
Thomas Gleixner Oct. 23, 2020, 9 p.m. UTC | #26
On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote:
> On 10/23/20 4:58 AM, Peter Zijlstra wrote:
>> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:
>> So shouldn't we then fix the drivers / interface first, to get rid of
>> this inconsistency?
>>
> Considering we agree that excess vector is a problem that needs to be
> solved across all the drivers and that you are comfortable with the other
> three patches in the set. If I may suggest the following:
>
> - We can pick those three patches for now, as that will atleast fix a
>   driver that is currently impacting RT workloads. Is that a fair
>   expectation?

No. Blindly reducing the maximum vectors to the number of housekeeping
CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide
what the right number of interrupts for this situation is.

Many of these drivers need more than queue interrupts, admin, error
interrupt and some operate best with seperate RX/TX interrupts per
queue. They all can "work" with a single PCI interrupt of course, but
the price you pay is performance.

An isolated setup, which I'm familiar with, has two housekeeping
CPUs. So far I restricted the number of network queues with a module
argument to two, which allocates two management interrupts for the
device and two interrupts (RX/TX) per queue, i.e. a total of six.

Now I reduced the number of available interrupts to two according to
your hack, which makes it use one queue RX/TX combined and one
management interrupt. Guess what happens? Network performance tanks to
the points that it breaks a carefully crafted setup.

The same applies to a device which is application specific and wants one
channel including an interrupt per isolated application core. Today I
can isolate 8 out of 12 CPUs and let the device create 8 channels and
set one interrupt and channel affine to each isolated CPU. With your
hack, I get only 4 interrupts and channels. Fail!

You cannot declare that all this is perfectly fine, just because it does
not matter for your particular use case.

So without information from the driver which tells what the best number
of interrupts is with a reduced number of CPUs, this cutoff will cause
more problems than it solves. Regressions guaranteed.

Managed interrupts base their interrupt allocation and spreading on
information which is handed in by the individual driver and not on crude
assumptions. They are not imposing restrictions on the use case.

It's perfectly fine for isolated work to save a data set to disk after
computation has finished and that just works with the per-cpu I/O queue
which is otherwise completely silent. All isolated workers can do the
same in parallel without trampling on each other toes by competing for a
reduced number of queues which are affine to the housekeeper CPUs.

Unfortunately network multi-queue is substantially different from block
multi-queue (as I learned in this conversation), so the concept cannot
be applied one-to-one to networking as is. But there are certainly part
of it which can be reused.

This needs a lot more thought than just these crude hacks.

Especially under the aspect that there are talks about making isolation
runtime switchable. Are you going to rmmod/insmod the i40e network
driver to do so? That's going to work fine if you do that
reconfiguration over network...

Thanks,

        tglx
Nitesh Narayan Lal Oct. 26, 2020, 1:35 p.m. UTC | #27
On 10/23/20 5:00 PM, Thomas Gleixner wrote:
> On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote:
>> On 10/23/20 4:58 AM, Peter Zijlstra wrote:
>>> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:
>>> So shouldn't we then fix the drivers / interface first, to get rid of
>>> this inconsistency?
>>>
>> Considering we agree that excess vector is a problem that needs to be
>> solved across all the drivers and that you are comfortable with the other
>> three patches in the set. If I may suggest the following:
>>
>> - We can pick those three patches for now, as that will atleast fix a
>>   driver that is currently impacting RT workloads. Is that a fair
>>   expectation?
> No. Blindly reducing the maximum vectors to the number of housekeeping
> CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide
> what the right number of interrupts for this situation is.
>
> Many of these drivers need more than queue interrupts, admin, error
> interrupt and some operate best with seperate RX/TX interrupts per
> queue. They all can "work" with a single PCI interrupt of course, but
> the price you pay is performance.
>
> An isolated setup, which I'm familiar with, has two housekeeping
> CPUs. So far I restricted the number of network queues with a module
> argument to two, which allocates two management interrupts for the
> device and two interrupts (RX/TX) per queue, i.e. a total of six.

Does it somehow take num_online_cpus() into consideration while deciding
the number of interrupts to create?

>
> Now I reduced the number of available interrupts to two according to
> your hack, which makes it use one queue RX/TX combined and one
> management interrupt. Guess what happens? Network performance tanks to
> the points that it breaks a carefully crafted setup.
>
> The same applies to a device which is application specific and wants one
> channel including an interrupt per isolated application core. Today I
> can isolate 8 out of 12 CPUs and let the device create 8 channels and
> set one interrupt and channel affine to each isolated CPU. With your
> hack, I get only 4 interrupts and channels. Fail!
>
> You cannot declare that all this is perfectly fine, just because it does
> not matter for your particular use case.

I agree that does sound wrong.

>
> So without information from the driver which tells what the best number
> of interrupts is with a reduced number of CPUs, this cutoff will cause
> more problems than it solves. Regressions guaranteed.

Indeed.
I think one commonality among the drivers at the moment is the usage of
num_online_cpus() to determine the vectors to create.

So, maybe instead of doing this kind of restrictions in a generic level
API, it will make more sense to do this on a per-device basis by replacing
the number of online CPUs with the housekeeping CPUs?

This is what I have done in the i40e patch.
But that still sounds hackish and will impact the performance.

>
> Managed interrupts base their interrupt allocation and spreading on
> information which is handed in by the individual driver and not on crude
> assumptions. They are not imposing restrictions on the use case.

Right, FWIU it is irq_do_set_affinity that prevents the spreading of
managed interrupts to isolated CPUs if HK_FLAG_MANAGED_IRQ is enabled,
isn't?

>
> It's perfectly fine for isolated work to save a data set to disk after
> computation has finished and that just works with the per-cpu I/O queue
> which is otherwise completely silent. All isolated workers can do the
> same in parallel without trampling on each other toes by competing for a
> reduced number of queues which are affine to the housekeeper CPUs.
>
> Unfortunately network multi-queue is substantially different from block
> multi-queue (as I learned in this conversation), so the concept cannot
> be applied one-to-one to networking as is. But there are certainly part
> of it which can be reused.

So this is one of the areas that I don't understand well yet and will
explore now.

>
> This needs a lot more thought than just these crude hacks.

Got it.
I am indeed not in the favor of pushing a solution that has a possibility
of regressing/breaking things afterward.

>
> Especially under the aspect that there are talks about making isolation
> runtime switchable. Are you going to rmmod/insmod the i40e network
> driver to do so? That's going to work fine if you do that
> reconfiguration over network...

That's an interesting point. However, for some of the drivers which uses
something like cpu_online/possible_mask while creating its affinity mask
reloading will still associate jobs to the isolated CPUs.
Thomas Gleixner Oct. 26, 2020, 1:57 p.m. UTC | #28
On Mon, Oct 26 2020 at 09:35, Nitesh Narayan Lal wrote:
> On 10/23/20 5:00 PM, Thomas Gleixner wrote:
>> An isolated setup, which I'm familiar with, has two housekeeping
>> CPUs. So far I restricted the number of network queues with a module
>> argument to two, which allocates two management interrupts for the
>> device and two interrupts (RX/TX) per queue, i.e. a total of six.
>
> Does it somehow take num_online_cpus() into consideration while deciding
> the number of interrupts to create?

No, I just tell it to create two queues :)

>> So without information from the driver which tells what the best number
>> of interrupts is with a reduced number of CPUs, this cutoff will cause
>> more problems than it solves. Regressions guaranteed.
>
> Indeed.
> I think one commonality among the drivers at the moment is the usage of
> num_online_cpus() to determine the vectors to create.
>
> So, maybe instead of doing this kind of restrictions in a generic level
> API, it will make more sense to do this on a per-device basis by replacing
> the number of online CPUs with the housekeeping CPUs?
>
> This is what I have done in the i40e patch.
> But that still sounds hackish and will impact the performance.

You want an interface which allows the driver to say:

  I need N interrupts for general management and ideally M interrupts
  per queue.

This is similar to the way drivers tell the core code about their
requirements for managed interrupts for the spreading calculation.

>> Managed interrupts base their interrupt allocation and spreading on
>> information which is handed in by the individual driver and not on crude
>> assumptions. They are not imposing restrictions on the use case.
>
> Right, FWIU it is irq_do_set_affinity that prevents the spreading of
> managed interrupts to isolated CPUs if HK_FLAG_MANAGED_IRQ is enabled,
> isn't?

No. Spreading takes possible CPUs into account. HK_FLAG_MANAGED_IRQ does
not influence spreading at all.

It only handles the case that an interrupt is affine to more than one
CPUs and the resulting affinity mask spawns both housekeeping and
isolated CPUs. It then steers the interrupt to the housekeeping CPUs (as
long as there is one online).

Thanks,

        tglx
Marcelo Tosatti Oct. 26, 2020, 5:30 p.m. UTC | #29
On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote:
> On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote:
> > On 10/23/20 4:58 AM, Peter Zijlstra wrote:
> >> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:
> >> So shouldn't we then fix the drivers / interface first, to get rid of
> >> this inconsistency?
> >>
> > Considering we agree that excess vector is a problem that needs to be
> > solved across all the drivers and that you are comfortable with the other
> > three patches in the set. If I may suggest the following:
> >
> > - We can pick those three patches for now, as that will atleast fix a
> >   driver that is currently impacting RT workloads. Is that a fair
> >   expectation?
> 
> No. Blindly reducing the maximum vectors to the number of housekeeping
> CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide
> what the right number of interrupts for this situation is.
> 
> Many of these drivers need more than queue interrupts, admin, error
> interrupt and some operate best with seperate RX/TX interrupts per
> queue. They all can "work" with a single PCI interrupt of course, but
> the price you pay is performance.
> 
> An isolated setup, which I'm familiar with, has two housekeeping
> CPUs. So far I restricted the number of network queues with a module
> argument to two, which allocates two management interrupts for the
> device and two interrupts (RX/TX) per queue, i.e. a total of six.
> 
> Now I reduced the number of available interrupts to two according to
> your hack, which makes it use one queue RX/TX combined and one
> management interrupt. Guess what happens? Network performance tanks to
> the points that it breaks a carefully crafted setup.
> 
> The same applies to a device which is application specific and wants one
> channel including an interrupt per isolated application core. Today I
> can isolate 8 out of 12 CPUs and let the device create 8 channels and
> set one interrupt and channel affine to each isolated CPU. With your
> hack, I get only 4 interrupts and channels. Fail!

Good point.

> You cannot declare that all this is perfectly fine, just because it does
> not matter for your particular use case.
> 
> So without information from the driver which tells what the best number
> of interrupts is with a reduced number of CPUs, this cutoff will cause
> more problems than it solves. Regressions guaranteed.

One might want to move from one interrupt per isolated app core
to zero, or vice versa. It seems that "best number of interrupts 
is with reduced number of CPUs" information, is therefore in userspace, 
not in driver...

No?

> Managed interrupts base their interrupt allocation and spreading on
> information which is handed in by the individual driver and not on crude
> assumptions. They are not imposing restrictions on the use case.
> 
> It's perfectly fine for isolated work to save a data set to disk after
> computation has finished and that just works with the per-cpu I/O queue
> which is otherwise completely silent. 

Userspace could only change the mask of interrupts which are not 
triggered by requests from the local CPU (admin, error, mgmt, etc),
to avoid the vector exhaustion problem.

However, there is no explicit way for userspace to know that, as far as
i know.

 130:      34845          0          0          0          0          0          0          0  IR-PCI-MSI 33554433-edge      nvme0q1
 131:          0      27062          0          0          0          0          0          0  IR-PCI-MSI 33554434-edge      nvme0q2
 132:          0          0      24393          0          0          0          0          0  IR-PCI-MSI 33554435-edge      nvme0q3
 133:          0          0          0      24313          0          0          0          0  IR-PCI-MSI 33554436-edge      nvme0q4
 134:          0          0          0          0      20608          0          0          0  IR-PCI-MSI 33554437-edge      nvme0q5
 135:          0          0          0          0          0      22163          0          0  IR-PCI-MSI 33554438-edge      nvme0q6
 136:          0          0          0          0          0          0      23020          0  IR-PCI-MSI 33554439-edge      nvme0q7
 137:          0          0          0          0          0          0          0      24285  IR-PCI-MSI 33554440-edge      nvme0q8


Can that be retrieved from PCI-MSI information, or drivers
have to inform this? 

> All isolated workers can do the
> same in parallel without trampling on each other toes by competing for a
> reduced number of queues which are affine to the housekeeper CPUs.
> 
> Unfortunately network multi-queue is substantially different from block
> multi-queue (as I learned in this conversation), so the concept cannot
> be applied one-to-one to networking as is. But there are certainly part
> of it which can be reused.
> 
> This needs a lot more thought than just these crude hacks.
> 
> Especially under the aspect that there are talks about making isolation
> runtime switchable. Are you going to rmmod/insmod the i40e network
> driver to do so? That's going to work fine if you do that
> reconfiguration over network...
> 
> Thanks,
> 
>         tglx
Thomas Gleixner Oct. 26, 2020, 7 p.m. UTC | #30
On Mon, Oct 26 2020 at 14:30, Marcelo Tosatti wrote:
> On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote:
>> So without information from the driver which tells what the best number
>> of interrupts is with a reduced number of CPUs, this cutoff will cause
>> more problems than it solves. Regressions guaranteed.
>
> One might want to move from one interrupt per isolated app core
> to zero, or vice versa. It seems that "best number of interrupts 
> is with reduced number of CPUs" information, is therefore in userspace, 
> not in driver...

How does userspace know about the driver internals? Number of management
interrupts, optimal number of interrupts per queue?

>> Managed interrupts base their interrupt allocation and spreading on
>> information which is handed in by the individual driver and not on crude
>> assumptions. They are not imposing restrictions on the use case.
>> 
>> It's perfectly fine for isolated work to save a data set to disk after
>> computation has finished and that just works with the per-cpu I/O queue
>> which is otherwise completely silent. 
>
> Userspace could only change the mask of interrupts which are not 
> triggered by requests from the local CPU (admin, error, mgmt, etc),
> to avoid the vector exhaustion problem.
>
> However, there is no explicit way for userspace to know that, as far as
> i know.
>
>  130:      34845          0          0          0          0          0          0          0  IR-PCI-MSI 33554433-edge      nvme0q1
>  131:          0      27062          0          0          0          0          0          0  IR-PCI-MSI 33554434-edge      nvme0q2
>  132:          0          0      24393          0          0          0          0          0  IR-PCI-MSI 33554435-edge      nvme0q3
>  133:          0          0          0      24313          0          0          0          0  IR-PCI-MSI 33554436-edge      nvme0q4
>  134:          0          0          0          0      20608          0          0          0  IR-PCI-MSI 33554437-edge      nvme0q5
>  135:          0          0          0          0          0      22163          0          0  IR-PCI-MSI 33554438-edge      nvme0q6
>  136:          0          0          0          0          0          0      23020          0  IR-PCI-MSI 33554439-edge      nvme0q7
>  137:          0          0          0          0          0          0          0      24285  IR-PCI-MSI 33554440-edge      nvme0q8
>
> Can that be retrieved from PCI-MSI information, or drivers
> have to inform this?

The driver should use a different name for the admin queues.

Thanks,

        tglx
Marcelo Tosatti Oct. 26, 2020, 7:11 p.m. UTC | #31
On Mon, Oct 26, 2020 at 08:00:39PM +0100, Thomas Gleixner wrote:
> On Mon, Oct 26 2020 at 14:30, Marcelo Tosatti wrote:
> > On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote:
> >> So without information from the driver which tells what the best number
> >> of interrupts is with a reduced number of CPUs, this cutoff will cause
> >> more problems than it solves. Regressions guaranteed.
> >
> > One might want to move from one interrupt per isolated app core
> > to zero, or vice versa. It seems that "best number of interrupts 
> > is with reduced number of CPUs" information, is therefore in userspace, 
> > not in driver...
> 
> How does userspace know about the driver internals? Number of management
> interrupts, optimal number of interrupts per queue?
> 
> >> Managed interrupts base their interrupt allocation and spreading on
> >> information which is handed in by the individual driver and not on crude
> >> assumptions. They are not imposing restrictions on the use case.
> >> 
> >> It's perfectly fine for isolated work to save a data set to disk after
> >> computation has finished and that just works with the per-cpu I/O queue
> >> which is otherwise completely silent. 
> >
> > Userspace could only change the mask of interrupts which are not 
> > triggered by requests from the local CPU (admin, error, mgmt, etc),
> > to avoid the vector exhaustion problem.
> >
> > However, there is no explicit way for userspace to know that, as far as
> > i know.
> >
> >  130:      34845          0          0          0          0          0          0          0  IR-PCI-MSI 33554433-edge      nvme0q1
> >  131:          0      27062          0          0          0          0          0          0  IR-PCI-MSI 33554434-edge      nvme0q2
> >  132:          0          0      24393          0          0          0          0          0  IR-PCI-MSI 33554435-edge      nvme0q3
> >  133:          0          0          0      24313          0          0          0          0  IR-PCI-MSI 33554436-edge      nvme0q4
> >  134:          0          0          0          0      20608          0          0          0  IR-PCI-MSI 33554437-edge      nvme0q5
> >  135:          0          0          0          0          0      22163          0          0  IR-PCI-MSI 33554438-edge      nvme0q6
> >  136:          0          0          0          0          0          0      23020          0  IR-PCI-MSI 33554439-edge      nvme0q7
> >  137:          0          0          0          0          0          0          0      24285  IR-PCI-MSI 33554440-edge      nvme0q8
> >
> > Can that be retrieved from PCI-MSI information, or drivers
> > have to inform this?
> 
> The driver should use a different name for the admin queues.

Works for me.

Sounds more like a heuristic which can break, so documenting this 
as an "interface" seems appropriate.
Jacob Keller Oct. 26, 2020, 7:21 p.m. UTC | #32
On 10/26/2020 12:00 PM, Thomas Gleixner wrote:
> On Mon, Oct 26 2020 at 14:30, Marcelo Tosatti wrote:
>> On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote:
>>> So without information from the driver which tells what the best number
>>> of interrupts is with a reduced number of CPUs, this cutoff will cause
>>> more problems than it solves. Regressions guaranteed.
>>
>> One might want to move from one interrupt per isolated app core
>> to zero, or vice versa. It seems that "best number of interrupts 
>> is with reduced number of CPUs" information, is therefore in userspace, 
>> not in driver...
> 
> How does userspace know about the driver internals? Number of management
> interrupts, optimal number of interrupts per queue?
> 

I guess this is the problem solved in part by the queue management work
that would make queues a thing that userspace is aware of.

Are there drivers which use more than one interrupt per queue? I know
drivers have multiple management interrupts.. and I guess some drivers
do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
have multiple queues for one interrupt .. I'm not sure how a single
queue with multiple interrupts would work though.

>>> Managed interrupts base their interrupt allocation and spreading on
>>> information which is handed in by the individual driver and not on crude
>>> assumptions. They are not imposing restrictions on the use case.
>>>
>>> It's perfectly fine for isolated work to save a data set to disk after
>>> computation has finished and that just works with the per-cpu I/O queue
>>> which is otherwise completely silent. 
>>
>> Userspace could only change the mask of interrupts which are not 
>> triggered by requests from the local CPU (admin, error, mgmt, etc),
>> to avoid the vector exhaustion problem.
>>
>> However, there is no explicit way for userspace to know that, as far as
>> i know.
>>
>>  130:      34845          0          0          0          0          0          0          0  IR-PCI-MSI 33554433-edge      nvme0q1
>>  131:          0      27062          0          0          0          0          0          0  IR-PCI-MSI 33554434-edge      nvme0q2
>>  132:          0          0      24393          0          0          0          0          0  IR-PCI-MSI 33554435-edge      nvme0q3
>>  133:          0          0          0      24313          0          0          0          0  IR-PCI-MSI 33554436-edge      nvme0q4
>>  134:          0          0          0          0      20608          0          0          0  IR-PCI-MSI 33554437-edge      nvme0q5
>>  135:          0          0          0          0          0      22163          0          0  IR-PCI-MSI 33554438-edge      nvme0q6
>>  136:          0          0          0          0          0          0      23020          0  IR-PCI-MSI 33554439-edge      nvme0q7
>>  137:          0          0          0          0          0          0          0      24285  IR-PCI-MSI 33554440-edge      nvme0q8
>>
>> Can that be retrieved from PCI-MSI information, or drivers
>> have to inform this?
> 
> The driver should use a different name for the admin queues.
> 
> Thanks,
> 
>         tglx
>
Thomas Gleixner Oct. 26, 2020, 8:11 p.m. UTC | #33
On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:
> On 10/26/2020 12:00 PM, Thomas Gleixner wrote:
>> How does userspace know about the driver internals? Number of management
>> interrupts, optimal number of interrupts per queue?
>
> I guess this is the problem solved in part by the queue management work
> that would make queues a thing that userspace is aware of.
>
> Are there drivers which use more than one interrupt per queue? I know
> drivers have multiple management interrupts.. and I guess some drivers
> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
> have multiple queues for one interrupt .. I'm not sure how a single
> queue with multiple interrupts would work though.

For block there is always one interrupt per queue. Some Network drivers
seem to have seperate RX and TX interrupts per queue.

Thanks,

        tglx
Jacob Keller Oct. 26, 2020, 9:11 p.m. UTC | #34
On 10/26/2020 1:11 PM, Thomas Gleixner wrote:
> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:
>> On 10/26/2020 12:00 PM, Thomas Gleixner wrote:
>>> How does userspace know about the driver internals? Number of management
>>> interrupts, optimal number of interrupts per queue?
>>
>> I guess this is the problem solved in part by the queue management work
>> that would make queues a thing that userspace is aware of.
>>
>> Are there drivers which use more than one interrupt per queue? I know
>> drivers have multiple management interrupts.. and I guess some drivers
>> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
>> have multiple queues for one interrupt .. I'm not sure how a single
>> queue with multiple interrupts would work though.
> 
> For block there is always one interrupt per queue. Some Network drivers
> seem to have seperate RX and TX interrupts per queue.
> 
> Thanks,
> 
>         tglx
> 

That's true when thinking of Tx and Rx as a single queue. Another way to
think about it is "one rx queue" and "one tx queue" each with their own
interrupt...

Even if there are devices which force there to be exactly queue pairs,
you could still think of them as separate entities?

Hmm.
Thomas Gleixner Oct. 26, 2020, 9:50 p.m. UTC | #35
On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote:
> On 10/26/2020 1:11 PM, Thomas Gleixner wrote:
>> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:
>>> Are there drivers which use more than one interrupt per queue? I know
>>> drivers have multiple management interrupts.. and I guess some drivers
>>> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
>>> have multiple queues for one interrupt .. I'm not sure how a single
>>> queue with multiple interrupts would work though.
>> 
>> For block there is always one interrupt per queue. Some Network drivers
>> seem to have seperate RX and TX interrupts per queue.
> That's true when thinking of Tx and Rx as a single queue. Another way to
> think about it is "one rx queue" and "one tx queue" each with their own
> interrupt...
>
> Even if there are devices which force there to be exactly queue pairs,
> you could still think of them as separate entities?

Interesting thought.

But as Jakub explained networking queues are fundamentally different
from block queues on the RX side. For block the request issued on queue
X will raise the complete interrupt on queue X.

For networking the TX side will raise the TX interrupt on the queue on
which the packet was queued obviously or should I say hopefully. :)

But incoming packets will be directed to some receive queue based on a
hash or whatever crystallball logic the firmware decided to implement.

Which makes this not really suitable for the managed interrupt and
spreading approach which is used by block-mq. Hrm...

But I still think that for curing that isolation stuff we want at least
some information from the driver. Alternative solution would be to grant
the allocation of interrupts and queues and have some sysfs knob to shut
down queues at runtime. If that shutdown results in releasing the queue
interrupt (via free_irq()) then the vector exhaustion problem goes away.

Needs more thought and information (for network oblivious folks like
me).

Thanks,

        tglx
Jakub Kicinski Oct. 26, 2020, 10:13 p.m. UTC | #36
On Mon, 26 Oct 2020 22:50:45 +0100 Thomas Gleixner wrote:
> On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote:
> > On 10/26/2020 1:11 PM, Thomas Gleixner wrote:  
> >> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:  
> >>> Are there drivers which use more than one interrupt per queue? I know
> >>> drivers have multiple management interrupts.. and I guess some drivers
> >>> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
> >>> have multiple queues for one interrupt .. I'm not sure how a single
> >>> queue with multiple interrupts would work though.  
> >> 
> >> For block there is always one interrupt per queue. Some Network drivers
> >> seem to have seperate RX and TX interrupts per queue.  
> > That's true when thinking of Tx and Rx as a single queue. Another way to
> > think about it is "one rx queue" and "one tx queue" each with their own
> > interrupt...
> >
> > Even if there are devices which force there to be exactly queue pairs,
> > you could still think of them as separate entities?  
> 
> Interesting thought.
> 
> But as Jakub explained networking queues are fundamentally different
> from block queues on the RX side. For block the request issued on queue
> X will raise the complete interrupt on queue X.
> 
> For networking the TX side will raise the TX interrupt on the queue on
> which the packet was queued obviously or should I say hopefully. :)
> 
> But incoming packets will be directed to some receive queue based on a
> hash or whatever crystallball logic the firmware decided to implement.
> 
> Which makes this not really suitable for the managed interrupt and
> spreading approach which is used by block-mq. Hrm...
> 
> But I still think that for curing that isolation stuff we want at least
> some information from the driver. Alternative solution would be to grant
> the allocation of interrupts and queues and have some sysfs knob to shut
> down queues at runtime. If that shutdown results in releasing the queue
> interrupt (via free_irq()) then the vector exhaustion problem goes away.
> 
> Needs more thought and information (for network oblivious folks like
> me).

One piece of information that may be useful is that even tho the RX
packets may be spread semi-randomly the user space can still control
which queues are included in the mechanism. There is an indirection
table in the HW which allows to weigh queues differently, or exclude
selected queues from the spreading. Other mechanisms exist to filter
flows onto specific queues.

IOW just because a core has an queue/interrupt does not mean that
interrupt will ever fire, provided its excluded from RSS.

Another piece is that by default we suggest drivers allocate 8 RX
queues, and online_cpus TX queues. The number of queues can be
independently controlled via ethtool -L. Drivers which can't support
separate queues will default to online_cpus queue pairs, and let
ethtool -L only set the "combined" parameter.

There are drivers which always allocate online_cpus interrupts, 
and then some of them will go unused if #qs < #cpus.


My unpopular opinion is that for networking devices all the heuristics
we may come up with are going to be a dead end. We need an explicit API
to allow users placing queues on cores, and use managed IRQs for data
queues. (I'm assuming that managed IRQs will let us reliably map a MSI-X
vector to a core :))
Nitesh Narayan Lal Oct. 26, 2020, 10:22 p.m. UTC | #37
On 10/26/20 5:50 PM, Thomas Gleixner wrote:
> On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote:
>> On 10/26/2020 1:11 PM, Thomas Gleixner wrote:
>>> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:
>>>> Are there drivers which use more than one interrupt per queue? I know
>>>> drivers have multiple management interrupts.. and I guess some drivers
>>>> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
>>>> have multiple queues for one interrupt .. I'm not sure how a single
>>>> queue with multiple interrupts would work though.
>>> For block there is always one interrupt per queue. Some Network drivers
>>> seem to have seperate RX and TX interrupts per queue.
>> That's true when thinking of Tx and Rx as a single queue. Another way to
>> think about it is "one rx queue" and "one tx queue" each with their own
>> interrupt...
>>
>> Even if there are devices which force there to be exactly queue pairs,
>> you could still think of them as separate entities?
> Interesting thought.
>
> But as Jakub explained networking queues are fundamentally different
> from block queues on the RX side. For block the request issued on queue
> X will raise the complete interrupt on queue X.
>
> For networking the TX side will raise the TX interrupt on the queue on
> which the packet was queued obviously or should I say hopefully. :)

This is my impression as well.

> But incoming packets will be directed to some receive queue based on a
> hash or whatever crystallball logic the firmware decided to implement.
>
> Which makes this not really suitable for the managed interrupt and
> spreading approach which is used by block-mq. Hrm...
>
> But I still think that for curing that isolation stuff we want at least
> some information from the driver. Alternative solution would be to grant
> the allocation of interrupts and queues and have some sysfs knob to shut
> down queues at runtime. If that shutdown results in releasing the queue
> interrupt (via free_irq()) then the vector exhaustion problem goes away.

I think this is close to what I and Marcelo were discussing earlier today
privately.

I don't think there is currently a way to control the enablement/disablement of
interrupts from the userspace.

I think in terms of the idea we need something similar to what i40e does,
that is shutdown all IRQs when CPU is suspended and restores the interrupt
schema when the CPU is back online.

The two key difference will be that this API needs to be generic and also
needs to be exposed to the userspace through something like sysfs as you
have mentioned.
Thomas Gleixner Oct. 26, 2020, 10:46 p.m. UTC | #38
On Mon, Oct 26 2020 at 15:13, Jakub Kicinski wrote:
> On Mon, 26 Oct 2020 22:50:45 +0100 Thomas Gleixner wrote:
>> But I still think that for curing that isolation stuff we want at least
>> some information from the driver. Alternative solution would be to grant
>> the allocation of interrupts and queues and have some sysfs knob to shut
>> down queues at runtime. If that shutdown results in releasing the queue
>> interrupt (via free_irq()) then the vector exhaustion problem goes away.
>> 
>> Needs more thought and information (for network oblivious folks like
>> me).
>
> One piece of information that may be useful is that even tho the RX
> packets may be spread semi-randomly the user space can still control
> which queues are included in the mechanism. There is an indirection
> table in the HW which allows to weigh queues differently, or exclude
> selected queues from the spreading. Other mechanisms exist to filter
> flows onto specific queues.
>
> IOW just because a core has an queue/interrupt does not mean that
> interrupt will ever fire, provided its excluded from RSS.
>
> Another piece is that by default we suggest drivers allocate 8 RX
> queues, and online_cpus TX queues. The number of queues can be
> independently controlled via ethtool -L. Drivers which can't support
> separate queues will default to online_cpus queue pairs, and let
> ethtool -L only set the "combined" parameter.
>
> There are drivers which always allocate online_cpus interrupts, 
> and then some of them will go unused if #qs < #cpus.

Thanks for the enlightment.

> My unpopular opinion is that for networking devices all the heuristics
> we may come up with are going to be a dead end.

I agree. Heuristics suck.

> We need an explicit API to allow users placing queues on cores, and
> use managed IRQs for data queues. (I'm assuming that managed IRQs will
> let us reliably map a MSI-X vector to a core :))

Yes, they allow you to do that. That will need some tweaks to theway
they work today (coming from the strict block mq semantics). You also
need to be aware that managed irqs have also strict semantics vs. CPU
hotplug. If the last CPU in the managed affinity set goes down then the
interrupt is shut down by the irq core which means that you need to
quiesce the associated queue before that happens. When the first CPU
comes online again the interrupt is reenabled, so the queue should be
able to handle it or has ensured that the device does not raise one
before it is able to do so.

Thanks,

        tglx
Thomas Gleixner Oct. 26, 2020, 10:49 p.m. UTC | #39
On Mon, Oct 26 2020 at 18:22, Nitesh Narayan Lal wrote:
> On 10/26/20 5:50 PM, Thomas Gleixner wrote:
>> But I still think that for curing that isolation stuff we want at least
>> some information from the driver. Alternative solution would be to grant
>> the allocation of interrupts and queues and have some sysfs knob to shut
>> down queues at runtime. If that shutdown results in releasing the queue
>> interrupt (via free_irq()) then the vector exhaustion problem goes away.
>
> I think this is close to what I and Marcelo were discussing earlier today
> privately.
>
> I don't think there is currently a way to control the enablement/disablement of
> interrupts from the userspace.

You cannot just disable the interrupt. You need to make sure that the
associated queue is shutdown or quiesced _before_ the interrupt is shut
down.

Thanks,

        tglx
Jacob Keller Oct. 26, 2020, 10:52 p.m. UTC | #40
On 10/26/2020 3:13 PM, Jakub Kicinski wrote:
> On Mon, 26 Oct 2020 22:50:45 +0100 Thomas Gleixner wrote:
>> On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote:
>>> On 10/26/2020 1:11 PM, Thomas Gleixner wrote:  
>>>> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:  
>>>>> Are there drivers which use more than one interrupt per queue? I know
>>>>> drivers have multiple management interrupts.. and I guess some drivers
>>>>> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
>>>>> have multiple queues for one interrupt .. I'm not sure how a single
>>>>> queue with multiple interrupts would work though.  
>>>>
>>>> For block there is always one interrupt per queue. Some Network drivers
>>>> seem to have seperate RX and TX interrupts per queue.  
>>> That's true when thinking of Tx and Rx as a single queue. Another way to
>>> think about it is "one rx queue" and "one tx queue" each with their own
>>> interrupt...
>>>
>>> Even if there are devices which force there to be exactly queue pairs,
>>> you could still think of them as separate entities?  
>>
>> Interesting thought.
>>
>> But as Jakub explained networking queues are fundamentally different
>> from block queues on the RX side. For block the request issued on queue
>> X will raise the complete interrupt on queue X.
>>
>> For networking the TX side will raise the TX interrupt on the queue on
>> which the packet was queued obviously or should I say hopefully. :)
>>
>> But incoming packets will be directed to some receive queue based on a
>> hash or whatever crystallball logic the firmware decided to implement.
>>
>> Which makes this not really suitable for the managed interrupt and
>> spreading approach which is used by block-mq. Hrm...
>>
>> But I still think that for curing that isolation stuff we want at least
>> some information from the driver. Alternative solution would be to grant
>> the allocation of interrupts and queues and have some sysfs knob to shut
>> down queues at runtime. If that shutdown results in releasing the queue
>> interrupt (via free_irq()) then the vector exhaustion problem goes away.
>>
>> Needs more thought and information (for network oblivious folks like
>> me).
> 
> One piece of information that may be useful is that even tho the RX
> packets may be spread semi-randomly the user space can still control
> which queues are included in the mechanism. There is an indirection
> table in the HW which allows to weigh queues differently, or exclude
> selected queues from the spreading. Other mechanisms exist to filter
> flows onto specific queues.
> 
> IOW just because a core has an queue/interrupt does not mean that
> interrupt will ever fire, provided its excluded from RSS.
> 
> Another piece is that by default we suggest drivers allocate 8 RX
> queues, and online_cpus TX queues. The number of queues can be
> independently controlled via ethtool -L. Drivers which can't support
> separate queues will default to online_cpus queue pairs, and let
> ethtool -L only set the "combined" parameter.
> 

I know the Intel drivers usually have defaulted to trying to maintain
queue pairs. I do not believe this is technically a HW restriction, but
it is heavily built into the way the drivers work today.

> There are drivers which always allocate online_cpus interrupts, 
> and then some of them will go unused if #qs < #cpus.
> 
> 

Right.

> My unpopular opinion is that for networking devices all the heuristics
> we may come up with are going to be a dead end. We need an explicit API
> to allow users placing queues on cores, and use managed IRQs for data
> queues. (I'm assuming that managed IRQs will let us reliably map a MSI-X
> vector to a core :))
> 

I don't think it is that unpopular... This is the direction I'd like to
see us go as well.
Jacob Keller Oct. 26, 2020, 11:08 p.m. UTC | #41
On 10/26/2020 3:49 PM, Thomas Gleixner wrote:
> On Mon, Oct 26 2020 at 18:22, Nitesh Narayan Lal wrote:
>> On 10/26/20 5:50 PM, Thomas Gleixner wrote:
>>> But I still think that for curing that isolation stuff we want at least
>>> some information from the driver. Alternative solution would be to grant
>>> the allocation of interrupts and queues and have some sysfs knob to shut
>>> down queues at runtime. If that shutdown results in releasing the queue
>>> interrupt (via free_irq()) then the vector exhaustion problem goes away.
>>
>> I think this is close to what I and Marcelo were discussing earlier today
>> privately.
>>
>> I don't think there is currently a way to control the enablement/disablement of
>> interrupts from the userspace.
> 
> You cannot just disable the interrupt. You need to make sure that the
> associated queue is shutdown or quiesced _before_ the interrupt is shut
> down.
> 
> Thanks,
> 
>         tglx
> 

Could this be handled with a callback to the driver/hw? I know Intel HW
should support this type of quiesce/shutdown.

Thanks,
Jake
Marcelo Tosatti Oct. 27, 2020, 11:47 a.m. UTC | #42
On Mon, Oct 26, 2020 at 06:22:29PM -0400, Nitesh Narayan Lal wrote:
> 
> On 10/26/20 5:50 PM, Thomas Gleixner wrote:
> > On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote:
> >> On 10/26/2020 1:11 PM, Thomas Gleixner wrote:
> >>> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:
> >>>> Are there drivers which use more than one interrupt per queue? I know
> >>>> drivers have multiple management interrupts.. and I guess some drivers
> >>>> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
> >>>> have multiple queues for one interrupt .. I'm not sure how a single
> >>>> queue with multiple interrupts would work though.
> >>> For block there is always one interrupt per queue. Some Network drivers
> >>> seem to have seperate RX and TX interrupts per queue.
> >> That's true when thinking of Tx and Rx as a single queue. Another way to
> >> think about it is "one rx queue" and "one tx queue" each with their own
> >> interrupt...
> >>
> >> Even if there are devices which force there to be exactly queue pairs,
> >> you could still think of them as separate entities?
> > Interesting thought.
> >
> > But as Jakub explained networking queues are fundamentally different
> > from block queues on the RX side. For block the request issued on queue
> > X will raise the complete interrupt on queue X.
> >
> > For networking the TX side will raise the TX interrupt on the queue on
> > which the packet was queued obviously or should I say hopefully. :)
> 
> This is my impression as well.
> 
> > But incoming packets will be directed to some receive queue based on a
> > hash or whatever crystallball logic the firmware decided to implement.
> >
> > Which makes this not really suitable for the managed interrupt and
> > spreading approach which is used by block-mq. Hrm...
> >
> > But I still think that for curing that isolation stuff we want at least
> > some information from the driver. Alternative solution would be to grant
> > the allocation of interrupts and queues and have some sysfs knob to shut
> > down queues at runtime. If that shutdown results in releasing the queue
> > interrupt (via free_irq()) then the vector exhaustion problem goes away.
> 
> I think this is close to what I and Marcelo were discussing earlier today
> privately.
> 
> I don't think there is currently a way to control the enablement/disablement of
> interrupts from the userspace.

As long as the interrupt obeys the "trigger when request has been
performed by local CPU" rule (#1) (for MSI type interrupts, where driver allocates
one I/O interrupt per CPU), don't see a need for the interface.

For other types of interrupts, interrupt controller should be programmed
to not include the isolated CPU on its "destination CPU list".

About the block VS network discussion, what we are trying to do at skb
level (Paolo Abeni CC'ed, author of the suggestion) is to use RPS to
avoid skbs from being queued to a CPU (on RX), and to queue skbs
on housekeeping CPUs for processing (TX).

However, if per-CPU interrupts are not disabled, then the (for example)
network device is free to include the CPU in its list of destinations.
Which would require one to say, configure RPS (or whatever mechanism
is distributing interrupts).

Hum, it would feel safer (rather than trust the #1 rule to be valid
in all cases) to ask the driver to disable the interrupt (after shutting
down queue) for that particular CPU.

BTW, Thomas, software is free to configure a particular MSI-X interrupt
to point to any CPU:

10.11 MESSAGE SIGNALLED INTERRUPTS
The PCI Local Bus Specification, Rev 2.2 (www.pcisig.com) introduces the concept of message signalled interrupts.
As the specification indicates:
“Message signalled interrupts (MSI) is an optional feature that enables PCI devices to request
service by writing a system-specified message to a system-specified address (PCI DWORD memory
write transaction). The transaction address specifies the message destination while the transaction
data specifies the message. System software is expected to initialize the message destination and
message during device configuration, allocating one or more non-shared messages to each MSI
capable function.”

Fields in the Message Address Register are as follows:
1. Bits 31-20 — These bits contain a fixed value for interrupt messages (0FEEH). This value locates interrupts at
the 1-MByte area with a base address of 4G – 18M. All accesses to this region are directed as interrupt
messages. Care must to be taken to ensure that no other device claims the region as I/O space.
2. Destination ID — This field contains an 8-bit destination ID. It identifies the message’s target processor(s).
The destination ID corresponds to bits 63:56 of the I/O APIC Redirection Table Entry if the IOAPIC is used to
dispatch the interrupt to the processor(s).

---

So taking the example where computation happens while isolated and later
stored via block interface, aren't we restricting the usage scenarios
by enforcing the "per-CPU queue has interrupt pointing to owner CPU" rule?

> I think in terms of the idea we need something similar to what i40e does,
> that is shutdown all IRQs when CPU is suspended and restores the interrupt
> schema when the CPU is back online.
> 
> The two key difference will be that this API needs to be generic and also
> needs to be exposed to the userspace through something like sysfs as you
> have mentioned.


> 
> -- 
> Thanks
> Nitesh
>
Thomas Gleixner Oct. 27, 2020, 2:28 p.m. UTC | #43
On Mon, Oct 26 2020 at 16:08, Jacob Keller wrote:
> On 10/26/2020 3:49 PM, Thomas Gleixner wrote:
>> On Mon, Oct 26 2020 at 18:22, Nitesh Narayan Lal wrote:
>>> I don't think there is currently a way to control the enablement/disablement of
>>> interrupts from the userspace.
>> 
>> You cannot just disable the interrupt. You need to make sure that the
>> associated queue is shutdown or quiesced _before_ the interrupt is shut
>> down.
>
> Could this be handled with a callback to the driver/hw? I know Intel HW
> should support this type of quiesce/shutdown.

We can't have a callback from the interrupt shutdown code as you have to
wait for the queue to drain packets in flight. Something like this

     mark queue as going down (no more tx queueing)
     tell hardware not to route RX packets to it
     consume pending RX
     wait for already queued TX packets to be sent

Look what the block people did. They have a common multi-instance
hotplug state and they register each context (queue) as an instance. The
hotplug core invokes the corresponding callbacks when bringing a CPU up
or when shutting it down.

Thanks,

        tglx
Thomas Gleixner Oct. 27, 2020, 2:43 p.m. UTC | #44
On Tue, Oct 27 2020 at 08:47, Marcelo Tosatti wrote:
> On Mon, Oct 26, 2020 at 06:22:29PM -0400, Nitesh Narayan Lal wrote:
> However, if per-CPU interrupts are not disabled, then the (for example)
> network device is free to include the CPU in its list of destinations.
> Which would require one to say, configure RPS (or whatever mechanism
> is distributing interrupts).

And why is that a problem? If that's possible then you can prevent
getting RX interrupts already today.

> Hum, it would feel safer (rather than trust the #1 rule to be valid
> in all cases) to ask the driver to disable the interrupt (after shutting
> down queue) for that particular CPU.
>
> BTW, Thomas, software is free to configure a particular MSI-X interrupt
> to point to any CPU:
>
> 10.11 MESSAGE SIGNALLED INTERRUPTS

I know how MSI works :)

> So taking the example where computation happens while isolated and later
> stored via block interface, aren't we restricting the usage scenarios
> by enforcing the "per-CPU queue has interrupt pointing to owner CPU"
> rule?

No. For block this is the ideal configuration (think locality) and it
prevents vector exhaustion. If you make these interrupts freely routable
then you bring back the vector exhaustion problem right away.

Now we already established that networking has different requirements,
so you have to come up with a different solution for it which allows to
work for all use cases.

Thanks,

        tglx
diff mbox series

Patch

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 30ae4ffda5c1..8c156867803c 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -23,6 +23,7 @@ 
 #include <linux/slab.h>
 #include <linux/irqdomain.h>
 #include <linux/of_irq.h>
+#include <linux/sched/isolation.h>
 
 #include "pci.h"
 
@@ -1191,8 +1192,25 @@  int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 				   struct irq_affinity *affd)
 {
 	struct irq_affinity msi_default_affd = {0};
+	unsigned int hk_cpus;
 	int nvecs = -ENOSPC;
 
+	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
+
+	/*
+	 * If we have isolated CPUs for use by real-time tasks, to keep the
+	 * latency overhead to a minimum, device-specific IRQ vectors are moved
+	 * to the housekeeping CPUs from the userspace by changing their
+	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
+	 * running out of IRQ vectors.
+	 */
+	if (hk_cpus < num_online_cpus()) {
+		if (hk_cpus < min_vecs)
+			max_vecs = min_vecs;
+		else if (hk_cpus < max_vecs)
+			max_vecs = hk_cpus;
+	}
+
 	if (flags & PCI_IRQ_AFFINITY) {
 		if (!affd)
 			affd = &msi_default_affd;