PCI MSI: allow alignment restrictions on vector allocation

Message ID	20170925041153.26574-1-drake@endlessm.com
State	Not Applicable
Headers	show Return-Path: <linux-pci-owner@vger.kernel.org> From: Daniel Drake <drake@endlessm.com> To: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com Cc: linux-pci@vger.kernel.org, x86@kernel.org, linux-wireless@vger.kernel.org, ath9k-devel@qca.qualcomm.com, linux@endlessm.com Subject: [PATCH] PCI MSI: allow alignment restrictions on vector allocation Date: Mon, 25 Sep 2017 12:11:53 +0800 Message-Id: <20170925041153.26574-1-drake@endlessm.com> Sender: linux-pci-owner@vger.kernel.org Precedence: bulk
Series	PCI MSI: allow alignment restrictions on vector allocation \| expand PCI MSI: allow alignment restrictions on vector allocation

Daniel Drake Sept. 25, 2017, 4:11 a.m. UTC

ath9k hardware claims to support up to 4 MSI vectors, and when run in
that configuration, it would be allowed to modify the lower bits of the
MSI Message Data when generating interrupts in order to signal which
of the 4 vectors the interrupt is being raised on.

Linux's PCI-MSI irqchip only supports a single MSI vector for each
device, and it tells the device this, but the device appears to assume
it is working with 4, as it will unset the lower 2 bits of Message Data
presumably to indicate that it is an IRQ for the first of 4 possible
vectors.

Linux will then receive an interrupt on the wrong vector, so the
ath9k interrupt handler will not be invoked.

To work around this, introduce a mechanism where the vector assignment
algorithm can be restricted to only a subset of available vector numbers
based on a bitmap.

As a user of this bitmap, introduce a pci_dev.align_msi_vector flag which
can be used to state that MSI vector numbers must be aligned to a specific
amount. If we 4-align the ath9k MSI vector then the lower bits will
already be 0 and hence the device will not modify the Message Data away
from its original value.

This is needed in order to support the wifi card in at least 8 new Acer
consumer laptop models which come with the Foxconn NFA335 WiFi module.
Legacy interrupts do not work on that module, so MSI support is required.

Signed-off-by: Daniel Drake <drake@endlessm.com>

https://phabricator.endlessm.com/T16988
---
 arch/x86/include/asm/hw_irq.h |  1 +
 arch/x86/kernel/apic/msi.c    | 15 +++++++++++++++
 arch/x86/kernel/apic/vector.c | 32 +++++++++++++++++++++++++-------
 include/linux/pci.h           |  2 ++
 4 files changed, 43 insertions(+), 7 deletions(-)

This solves the issue described here:
https://marc.info/?l=linux-pci&m=150238260826803&w=2

If this approach looks good I'll follow up with the ath9k patch
to enable MSI interrupts.

kernel test robot Sept. 26, 2017, 8:32 p.m. UTC | #1

Hi Daniel,

[auto build test ERROR on pci/next]
[also build test ERROR on v4.14-rc2]
[cannot apply to tip/x86/core next-20170926]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Daniel-Drake/PCI-MSI-allow-alignment-restrictions-on-vector-allocation/20170927-035611
base:   https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next
config: i386-randconfig-x000-201739 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   arch/x86/kernel/apic/vector.c: In function 'assign_irq_vector_policy':
>> arch/x86/kernel/apic/vector.c:266:25: error: 'struct irq_alloc_info' has no member named 'allowed_vectors'
      allowed_vectors = info->allowed_vectors;
                            ^~

vim +266 arch/x86/kernel/apic/vector.c

   252	
   253	static int assign_irq_vector_policy(int irq, int node,
   254					    struct apic_chip_data *data,
   255					    struct irq_alloc_info *info,
   256					    struct irq_data *irqdata)
   257	{
   258		unsigned long *allowed_vectors = NULL;
   259	
   260		/* Some MSI interrupts have restrictions on which vector numbers
   261		 * can be used.
   262		 */
   263		if (info &&
   264			(info->type == X86_IRQ_ALLOC_TYPE_MSI ||
   265			 info->type == X86_IRQ_ALLOC_TYPE_MSIX))
 > 266			allowed_vectors = info->allowed_vectors;
   267	
   268		if (info && info->mask)
   269			return assign_irq_vector(irq, data, info->mask, irqdata,
   270						 allowed_vectors);
   271		if (node != NUMA_NO_NODE &&
   272		    assign_irq_vector(irq, data, cpumask_of_node(node), irqdata,
   273				      allowed_vectors) == 0)
   274			return 0;
   275		return assign_irq_vector(irq, data, apic->target_cpus(), irqdata,
   276					 allowed_vectors);
   277	}
   278	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

Thomas Gleixner Sept. 27, 2017, 3:28 p.m. UTC | #2

On Mon, 25 Sep 2017, Daniel Drake wrote:

Please send x86 related patches to LKML as documented in MAINTAINERS. That
patch is mainly x86 and not PCI.

> ath9k hardware claims to support up to 4 MSI vectors, and when run in
> that configuration, it would be allowed to modify the lower bits of the
> MSI Message Data when generating interrupts in order to signal which
> of the 4 vectors the interrupt is being raised on.
> 
> Linux's PCI-MSI irqchip only supports a single MSI vector for each
> device, and it tells the device this, but the device appears to assume
> it is working with 4, as it will unset the lower 2 bits of Message Data
> presumably to indicate that it is an IRQ for the first of 4 possible
> vectors.
> 
> Linux will then receive an interrupt on the wrong vector, so the
> ath9k interrupt handler will not be invoked.
> 
> To work around this, introduce a mechanism where the vector assignment
> algorithm can be restricted to only a subset of available vector numbers
> based on a bitmap.
> 
> As a user of this bitmap, introduce a pci_dev.align_msi_vector flag which
> can be used to state that MSI vector numbers must be aligned to a specific
> amount. If we 4-align the ath9k MSI vector then the lower bits will
> already be 0 and hence the device will not modify the Message Data away
> from its original value.
> 
> This is needed in order to support the wifi card in at least 8 new Acer
> consumer laptop models which come with the Foxconn NFA335 WiFi module.
> Legacy interrupts do not work on that module, so MSI support is required.

Sigh, what a mess.

> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
> index 6dfe366a8804..7f35178586a1 100644
> --- a/arch/x86/include/asm/hw_irq.h
> +++ b/arch/x86/include/asm/hw_irq.h
> @@ -77,6 +77,7 @@ struct irq_alloc_info {
>  		struct {
>  			struct pci_dev	*msi_dev;
>  			irq_hw_number_t	msi_hwirq;
> +			DECLARE_BITMAP(allowed_vectors, NR_VECTORS);

Uurgh. No. You want to express an alignment requirement. Why would you need
a bitmap for that? A simple

	unsigned int align_vector;

is sufficient.


> @@ -178,6 +179,9 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
>  		if (test_bit(vector, used_vectors))
>  			goto next;
>  
> +		if (allowed_vectors && !test_bit(vector, allowed_vectors))
> +			goto next;

This code is gone in -next and replaced by a new vector allocator.

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/apic

> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index f68c58a93dd0..7755450be02d 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -419,6 +419,8 @@ struct pci_dev {
>  #endif
>  #ifdef CONFIG_PCI_MSI
>  	const struct attribute_group **msi_irq_groups;
> +	int align_msi_vector;	/* device requires MSI vector numbers aligned
> +				 * by this amount */

This wants to be unsigned int and please get rid of this butt ugly
formatted comment.

But the real question is how this is supposed to work with

 - interrupt remapping

 - non X86 machines

I doubt that this can be made generic enough to make it work all over the
place. Tell Acer/Foxconn to fix their stupid firmware.

Thanks,

	tglx

Daniel Drake Oct. 2, 2017, 8:57 a.m. UTC | #3

Hi Thomas,

Thanks a lot for looking at this.

On Wed, Sep 27, 2017 at 11:28 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> This code is gone in -next and replaced by a new vector allocator.
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/apic

Nice, thanks for cleaning that up!

> But the real question is how this is supposed to work with
>
>  - interrupt remapping

On another system, I have multiple devices using IR-PCI-MSI according
to /proc/interrupts, and lspci shows that a MSI Message Data value 0
is used for every single MSI-enabled device.

I don't know if that's just luck, but its a value that would work
fine for ath9k.

Unfortunately interrupt remapping is not available on the affected
Acer products though.
https://lists.linuxfoundation.org/pipermail/iommu/2017-August/023717.html

>  - non X86 machines

After checking out the new code and thinking this through a bit, I think
perhaps the only generic approach that would work is to make the
ath9k driver require a vector allocation that enables the entire block
of 4 MSI IRQs that the hardware supports (which is what Windows is doing).
This way the alignment requirement is automatically met and ath9k is
free to stamp all over the lower 2 bits of the MSI Message Data.

MSI_FLAG_MULTI_PCI_MSI is already supported by a couple of non-x86 drivers
and the interrupt remapping code, but it is not supported by the generic
pci_msi_controller, and the new vector allocator still rejects
X86_IRQ_ALLOC_CONTIGUOUS_VECTORS.

I will try to take a look at enabling this for the generic allocator,
unless you have any other ideas.

In the mean time, at least for reference, below is an updated version
of the previous patch based on the new allocator and incorporating
your feedback. (It's still rather ugly though)

> I doubt that this can be made generic enough to make it work all over the
> place. Tell Acer/Foxconn to fix their stupid firmware.

We're also working on this approach ever since the problematic models
first appeared months ago, but since these products are in mass production,
I think ultimately we are going to need a Linux workaround.

---
 arch/x86/include/asm/hw_irq.h |  1 +
 arch/x86/kernel/apic/msi.c    |  2 ++
 arch/x86/kernel/apic/vector.c | 14 +++++++++++---
 include/linux/irq.h           |  6 +++---
 include/linux/pci.h           |  1 +
 kernel/irq/matrix.c           | 22 ++++++++++++++--------
 6 files changed, 32 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 661540a93072..9e5e1bb62121 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -79,6 +79,7 @@ struct irq_alloc_info {
 		struct {
 			struct pci_dev	*msi_dev;
 			irq_hw_number_t	msi_hwirq;
+			unsigned int vector_align;
 		};
 #endif
 #ifdef	CONFIG_X86_IO_APIC
diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 5b6dd1a85ec4..9b5277309c29 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -111,6 +111,8 @@ int pci_msi_prepare(struct irq_domain *domain, struct device *dev, int nvec,
 		arg->flags |= X86_IRQ_ALLOC_CONTIGUOUS_VECTORS;
 	}
 
+	arg->vector_align = pdev->align_msi_vector;
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(pci_msi_prepare);
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 6789e286def9..e85f19c09c34 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -31,6 +31,7 @@ struct apic_chip_data {
 	unsigned int		cpu;
 	unsigned int		prev_cpu;
 	unsigned int		irq;
+	unsigned int		vector_align;
 	struct hlist_node	clist;
 	unsigned int		move_in_progress	: 1,
 				is_managed		: 1,
@@ -171,7 +172,8 @@ static int reserve_managed_vector(struct irq_data *irqd)
 
 	raw_spin_lock_irqsave(&vector_lock, flags);
 	apicd->is_managed = true;
-	ret = irq_matrix_reserve_managed(vector_matrix, affmsk);
+	ret = irq_matrix_reserve_managed(vector_matrix, affmsk,
+					 apicd->vector_align);
 	raw_spin_unlock_irqrestore(&vector_lock, flags);
 	trace_vector_reserve_managed(irqd->irq, ret);
 	return ret;
@@ -215,7 +217,8 @@ static int allocate_vector(struct irq_data *irqd, const struct cpumask *dest)
 	if (vector && cpu_online(cpu) && cpumask_test_cpu(cpu, dest))
 		return 0;
 
-	vector = irq_matrix_alloc(vector_matrix, dest, resvd, &cpu);
+	vector = irq_matrix_alloc(vector_matrix, dest, resvd, &cpu,
+				  apicd->vector_align);
 	if (vector > 0)
 		apic_update_vector(irqd, vector, cpu);
 	trace_vector_alloc(irqd->irq, vector, resvd, vector);
@@ -299,7 +302,8 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest)
 	/* set_affinity might call here for nothing */
 	if (apicd->vector && cpumask_test_cpu(apicd->cpu, vector_searchmask))
 		return 0;
-	vector = irq_matrix_alloc_managed(vector_matrix, cpu);
+	vector = irq_matrix_alloc_managed(vector_matrix, cpu,
+					  apicd->vector_align);
 	trace_vector_alloc_managed(irqd->irq, vector, vector);
 	if (vector < 0)
 		return vector;
@@ -511,6 +515,10 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
 			goto error;
 		}
 
+		if (info->type == X86_IRQ_ALLOC_TYPE_MSI
+				|| info->type == X86_IRQ_ALLOC_TYPE_MSIX)
+			apicd->vector_align = info->vector_align;
+
 		apicd->irq = virq + i;
 		irqd->chip = &lapic_controller;
 		irqd->chip_data = apicd;
diff --git a/include/linux/irq.h b/include/linux/irq.h
index fda8da7c45e7..a8c249df5b1e 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -1120,13 +1120,13 @@ struct irq_matrix *irq_alloc_matrix(unsigned int matrix_bits,
 void irq_matrix_online(struct irq_matrix *m);
 void irq_matrix_offline(struct irq_matrix *m);
 void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit, bool replace);
-int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask *msk);
+int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask *msk, unsigned int align);
 void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk);
-int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu);
+int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu, unsigned int align);
 void irq_matrix_reserve(struct irq_matrix *m);
 void irq_matrix_remove_reserved(struct irq_matrix *m);
 int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
-		     bool reserved, unsigned int *mapped_cpu);
+		     bool reserved, unsigned int *mapped_cpu, unsigned int align);
 void irq_matrix_free(struct irq_matrix *m, unsigned int cpu,
 		     unsigned int bit, bool managed);
 void irq_matrix_assign(struct irq_matrix *m, unsigned int bit);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index f68c58a93dd0..e7408c133115 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -419,6 +419,7 @@ struct pci_dev {
 #endif
 #ifdef CONFIG_PCI_MSI
 	const struct attribute_group **msi_irq_groups;
+	unsigned int align_msi_vector;
 #endif
 	struct pci_vpd *vpd;
 #ifdef CONFIG_PCI_ATS
diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index a3cbbc8191c5..d904819820a8 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -106,14 +106,17 @@ void irq_matrix_offline(struct irq_matrix *m)
 }
 
 static unsigned int matrix_alloc_area(struct irq_matrix *m, struct cpumap *cm,
-				      unsigned int num, bool managed)
+				      unsigned int num, bool managed, unsigned int align)
 {
 	unsigned int area, start = m->alloc_start;
 	unsigned int end = m->alloc_end;
 
+	if (align > 0)
+		align--;
+
 	bitmap_or(m->scratch_map, cm->managed_map, m->system_map, end);
 	bitmap_or(m->scratch_map, m->scratch_map, cm->alloc_map, end);
-	area = bitmap_find_next_zero_area(m->scratch_map, end, start, num, 0);
+	area = bitmap_find_next_zero_area(m->scratch_map, end, start, num, align);
 	if (area >= end)
 		return area;
 	if (managed)
@@ -163,7 +166,7 @@ void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit,
  * on all CPUs in @msk, but it's not guaranteed that the bits are at the
  * same offset on all CPUs
  */
-int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask *msk)
+int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask *msk, unsigned int align)
 {
 	unsigned int cpu, failed_cpu;
 
@@ -171,7 +174,7 @@ int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask *msk)
 		struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
 		unsigned int bit;
 
-		bit = matrix_alloc_area(m, cm, 1, true);
+		bit = matrix_alloc_area(m, cm, 1, true, align);
 		if (bit >= m->alloc_end)
 			goto cleanup;
 		cm->managed++;
@@ -238,14 +241,17 @@ void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk)
  * @m:		Matrix pointer
  * @cpu:	On which CPU the interrupt should be allocated
  */
-int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu)
+int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu, unsigned int align)
 {
 	struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
 	unsigned int bit, end = m->alloc_end;
 
+	if (align > 0)
+		align--;
+
 	/* Get managed bit which are not allocated */
 	bitmap_andnot(m->scratch_map, cm->managed_map, cm->alloc_map, end);
-	bit = find_first_bit(m->scratch_map, end);
+	bit = bitmap_find_next_zero_area(m->scratch_map, end, 0, 1, align);
 	if (bit >= end)
 		return -ENOSPC;
 	set_bit(bit, cm->alloc_map);
@@ -319,7 +325,7 @@ void irq_matrix_remove_reserved(struct irq_matrix *m)
  * @mapped_cpu: Pointer to store the CPU for which the irq was allocated
  */
 int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
-		     bool reserved, unsigned int *mapped_cpu)
+		     bool reserved, unsigned int *mapped_cpu, unsigned int align)
 {
 	unsigned int cpu;
 
@@ -330,7 +336,7 @@ int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
 		if (!cm->online)
 			continue;
 
-		bit = matrix_alloc_area(m, cm, 1, false);
+		bit = matrix_alloc_area(m, cm, 1, false, align);
 		if (bit < m->alloc_end) {
 			cm->allocated++;
 			cm->available--;

Thomas Gleixner Oct. 2, 2017, 2:38 p.m. UTC | #4

Daniel,

On Mon, 2 Oct 2017, Daniel Drake wrote:
> On Wed, Sep 27, 2017 at 11:28 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On another system, I have multiple devices using IR-PCI-MSI according
> to /proc/interrupts, and lspci shows that a MSI Message Data value 0
> is used for every single MSI-enabled device.
> 
> I don't know if that's just luck, but its a value that would work
> fine for ath9k.

It's an implementation detail...

> After checking out the new code and thinking this through a bit, I think
> perhaps the only generic approach that would work is to make the
> ath9k driver require a vector allocation that enables the entire block
> of 4 MSI IRQs that the hardware supports (which is what Windows is doing).

I wonder how Windows deals with the affinity problem for multi-MSI. Or does
it just not allow to set it at all?

> This way the alignment requirement is automatically met and ath9k is
> free to stamp all over the lower 2 bits of the MSI Message Data.
> 
> MSI_FLAG_MULTI_PCI_MSI is already supported by a couple of non-x86 drivers
> and the interrupt remapping code, but it is not supported by the generic
> pci_msi_controller, and the new vector allocator still rejects
> X86_IRQ_ALLOC_CONTIGUOUS_VECTORS.

Yes, and it does so because Multi-MSI on x86 requires single destination
for all vectors, that means the affinity is the same for all vectors. This
has two implications:

  1) The generic interrupt code and its affinity management are not able to
     deal with that. Probably a solvable problem, but not trivial either.

  2) The affinity setting of straight MSI interrupts (w/o remapping) on x86
     requires to make the affinity change from the interrupt context of the
     current active vector in order not to lose interrupts or worst case
     getting into a stale state.

     That works for single vectors, but trying to move all vectors in one
     go is more or less impossible, as there is no reliable way to
     determine that none of the other vectors is on flight.

     There might be some 'workarounds' for that, but I rather avoid that
     unless we get an official documented one from Intel/AMD.

With interrupt remapping this is a different story as the remapping unit
removes all these problems and things just work.

The switch to best effort reservation mode for vectors is another
interesting problem. This cannot work with non remapped multi-MSI, unless
we just give up for these type of interrupts and associate them straight
away. I could be persuaded to do so, but the above problems (especially #2)
wont magically go away.

> I will try to take a look at enabling this for the generic allocator,
> unless you have any other ideas.

See above.

> In the mean time, at least for reference, below is an updated version
> of the previous patch based on the new allocator and incorporating
> your feedback. (It's still rather ugly though)
> 
> > I doubt that this can be made generic enough to make it work all over the
> > place. Tell Acer/Foxconn to fix their stupid firmware.
> 
> We're also working on this approach ever since the problematic models
> first appeared months ago, but since these products are in mass production,

I really wonder how they managed to screw that up. The spec is pretty
strict about that:

  "The Multiple Message Enable field (bits 6-4 of the Message Control
   register) defines the number of low order message data bits the function
   is permitted to modify to generate its system software allocated
   vectors. For example, a Multiple Message Enable encoding of “010”
   indicates the function has been allocated four vectors and is permitted
   to modify message data bits 1 and 0 (a function modifies the lower
   message data bits to generate the allocated number of vectors). If the
   Multiple Message Enable field is “000”, the function is not permitted to
   modify the message data."

Not permitted is the keyword here.

> I think ultimately we are going to need a Linux workaround.

What's wrong with just using the legacy INTx emulation if you cannot
allocate 4 MSI vectors?

Thanks,

	tglx

Thomas Gleixner Oct. 2, 2017, 4:19 p.m. UTC | #5

On Mon, 2 Oct 2017, Thomas Gleixner wrote:
> On Mon, 2 Oct 2017, Daniel Drake wrote:
>   2) The affinity setting of straight MSI interrupts (w/o remapping) on x86
>      requires to make the affinity change from the interrupt context of the
>      current active vector in order not to lose interrupts or worst case
>      getting into a stale state.
> 
>      That works for single vectors, but trying to move all vectors in one
>      go is more or less impossible, as there is no reliable way to
>      determine that none of the other vectors is on flight.
> 
>      There might be some 'workarounds' for that, but I rather avoid that
>      unless we get an official documented one from Intel/AMD.

Thinking more about it. That might be actually a non issue for MSI, but we
have that modus operandi in the current code and we need to address that
first before even thinking about multi MSI support.

Thanks,

	tglx

Thomas Gleixner Oct. 3, 2017, 9:07 p.m. UTC | #6

On Mon, 2 Oct 2017, Thomas Gleixner wrote:
> On Mon, 2 Oct 2017, Thomas Gleixner wrote:
> > On Mon, 2 Oct 2017, Daniel Drake wrote:
> >   2) The affinity setting of straight MSI interrupts (w/o remapping) on x86
> >      requires to make the affinity change from the interrupt context of the
> >      current active vector in order not to lose interrupts or worst case
> >      getting into a stale state.
> > 
> >      That works for single vectors, but trying to move all vectors in one
> >      go is more or less impossible, as there is no reliable way to
> >      determine that none of the other vectors is on flight.
> > 
> >      There might be some 'workarounds' for that, but I rather avoid that
> >      unless we get an official documented one from Intel/AMD.
> 
> Thinking more about it. That might be actually a non issue for MSI, but we
> have that modus operandi in the current code and we need to address that
> first before even thinking about multi MSI support.

But even if its possible, it's very debatable whether it's worth the effort
when this driver just can use the legacy INTx.and be done with it.

Thanks,

	tglx

Bjorn Helgaas Oct. 3, 2017, 9:34 p.m. UTC | #7

On Tue, Oct 03, 2017 at 11:07:58PM +0200, Thomas Gleixner wrote:
> On Mon, 2 Oct 2017, Thomas Gleixner wrote:
> > On Mon, 2 Oct 2017, Thomas Gleixner wrote:
> > > On Mon, 2 Oct 2017, Daniel Drake wrote:
> > >   2) The affinity setting of straight MSI interrupts (w/o remapping) on x86
> > >      requires to make the affinity change from the interrupt context of the
> > >      current active vector in order not to lose interrupts or worst case
> > >      getting into a stale state.
> > > 
> > >      That works for single vectors, but trying to move all vectors in one
> > >      go is more or less impossible, as there is no reliable way to
> > >      determine that none of the other vectors is on flight.
> > > 
> > >      There might be some 'workarounds' for that, but I rather avoid that
> > >      unless we get an official documented one from Intel/AMD.
> > 
> > Thinking more about it. That might be actually a non issue for MSI, but we
> > have that modus operandi in the current code and we need to address that
> > first before even thinking about multi MSI support.
> 
> But even if its possible, it's very debatable whether it's worth the effort
> when this driver just can use the legacy INTx.and be done with it.

Daniel said "Legacy interrupts do not work on that module, so MSI
support is required," so I assume he means INTx doesn't work.  Maybe
the driver could poll?  I don't know how much slower that would be,
but at least it would penalize the broken device, not everybody.

Bjorn

Thomas Gleixner Oct. 3, 2017, 9:46 p.m. UTC | #8

On Tue, 3 Oct 2017, Bjorn Helgaas wrote:
> On Tue, Oct 03, 2017 at 11:07:58PM +0200, Thomas Gleixner wrote:
> > On Mon, 2 Oct 2017, Thomas Gleixner wrote:
> > > On Mon, 2 Oct 2017, Thomas Gleixner wrote:
> > > > On Mon, 2 Oct 2017, Daniel Drake wrote:
> > > >   2) The affinity setting of straight MSI interrupts (w/o remapping) on x86
> > > >      requires to make the affinity change from the interrupt context of the
> > > >      current active vector in order not to lose interrupts or worst case
> > > >      getting into a stale state.
> > > > 
> > > >      That works for single vectors, but trying to move all vectors in one
> > > >      go is more or less impossible, as there is no reliable way to
> > > >      determine that none of the other vectors is on flight.
> > > > 
> > > >      There might be some 'workarounds' for that, but I rather avoid that
> > > >      unless we get an official documented one from Intel/AMD.
> > > 
> > > Thinking more about it. That might be actually a non issue for MSI, but we
> > > have that modus operandi in the current code and we need to address that
> > > first before even thinking about multi MSI support.
> > 
> > But even if its possible, it's very debatable whether it's worth the effort
> > when this driver just can use the legacy INTx.and be done with it.
> 
> Daniel said "Legacy interrupts do not work on that module, so MSI
> support is required," so I assume he means INTx doesn't work.  Maybe

Hmm, I missed that detail obviously.

> the driver could poll?  I don't know how much slower that would be,
> but at least it would penalize the broken device, not everybody.

That would definitely be prefered.

Thanks,

	tglx

Daniel Drake Oct. 5, 2017, 4:23 a.m. UTC | #9

On Mon, Oct 2, 2017 at 10:38 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> After checking out the new code and thinking this through a bit, I think
>> perhaps the only generic approach that would work is to make the
>> ath9k driver require a vector allocation that enables the entire block
>> of 4 MSI IRQs that the hardware supports (which is what Windows is doing).
>
> I wonder how Windows deals with the affinity problem for multi-MSI. Or does
> it just not allow to set it at all?

https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/interrupt-affinity-and-priority
Looks like IRQ affinity can only be set by registry or inf files. I assume
that means it is not dynamic and hence avoids the challenges related to
moving interrupts around at runtime.

> What's wrong with just using the legacy INTx emulation if you cannot
> allocate 4 MSI vectors?

The Legacy interrupt simply doesn't work for the wifi on at least 8 new Acer
laptop products based on Intel Apollo Lake.
Plus 4 Dell systems included in the patches in this thread:
https://lkml.org/lkml/2017/9/26/55
(the 2 which I can find specs for are also Apollo Lake)

We have tried taking the mini-PCIe wifi module out of one of the affected
Acer products and moved it to another computer, where it is working fine
with legacy interrupts. So this suggests that the wifi module itself is OK,
but we are facing a hardware limitation or BIOS limitation on the affected
products. In the Dell thread it says "Some platform(BIOS) blocks legacy
interrupts (INTx)".

If you have any suggestions for how we might solve this without getting into
the MSI mess then that would be much appreciated. If the BIOS blocks the
interrupts, can Linux unblock them?

Just for reference I'm attaching my latest attempt at enabling MULTI_PCI_MSI.
It would definitely need further work if we proceed here - so far I've
ignored the affinity considerations that you explained, and it's not
particularly clean.

I'll now have a look at polling for interrupts in the ath9k driver.

---
 arch/x86/kernel/apic/msi.c    |  3 +-
 arch/x86/kernel/apic/vector.c | 75 ++++++++++++++++++++++++++++++++-----------
 include/linux/irq.h           |  3 +-
 kernel/irq/matrix.c           | 23 +++++++------
 4 files changed, 74 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 5b6dd1a85ec4..c57b6a7b9317 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -129,7 +129,8 @@ static struct msi_domain_ops pci_msi_domain_ops = {
 
 static struct msi_domain_info pci_msi_domain_info = {
 	.flags		= MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS |
-			  MSI_FLAG_PCI_MSIX | MSI_FLAG_MUST_REACTIVATE,
+			  MSI_FLAG_PCI_MSIX | MSI_FLAG_MUST_REACTIVATE |
+			  MSI_FLAG_MULTI_PCI_MSI,
 	.ops		= &pci_msi_domain_ops,
 	.chip		= &pci_msi_controller,
 	.handler	= handle_edge_irq,
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 6789e286def9..2926fd92ea1c 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -35,7 +35,8 @@ struct apic_chip_data {
 	unsigned int		move_in_progress	: 1,
 				is_managed		: 1,
 				can_reserve		: 1,
-				has_reserved		: 1;
+				has_reserved		: 1,
+				contig_allocation	: 1;
 };
 
 struct irq_domain *x86_vector_domain;
@@ -198,7 +199,8 @@ static int reserve_irq_vector(struct irq_data *irqd)
 	return 0;
 }
 
-static int allocate_vector(struct irq_data *irqd, const struct cpumask *dest)
+static int allocate_vector(struct irq_data *irqd, const struct cpumask *dest,
+			   unsigned int num, unsigned int align_mask)
 {
 	struct apic_chip_data *apicd = apic_chip_data(irqd);
 	bool resvd = apicd->has_reserved;
@@ -215,18 +217,21 @@ static int allocate_vector(struct irq_data *irqd, const struct cpumask *dest)
 	if (vector && cpu_online(cpu) && cpumask_test_cpu(cpu, dest))
 		return 0;
 
-	vector = irq_matrix_alloc(vector_matrix, dest, resvd, &cpu);
+	vector = irq_matrix_alloc(vector_matrix, dest, resvd, &cpu,
+				  num, align_mask);
 	if (vector > 0)
 		apic_update_vector(irqd, vector, cpu);
+
 	trace_vector_alloc(irqd->irq, vector, resvd, vector);
 	return vector;
 }
 
 static int assign_vector_locked(struct irq_data *irqd,
-				const struct cpumask *dest)
+				const struct cpumask *dest,
+				unsigned int num, unsigned int align_mask)
 {
 	struct apic_chip_data *apicd = apic_chip_data(irqd);
-	int vector = allocate_vector(irqd, dest);
+	int vector = allocate_vector(irqd, dest, num, align_mask);
 
 	if (vector < 0)
 		return vector;
@@ -235,14 +240,15 @@ static int assign_vector_locked(struct irq_data *irqd,
 	return 0;
 }
 
-static int assign_irq_vector(struct irq_data *irqd, const struct cpumask *dest)
+static int assign_irq_vector(struct irq_data *irqd, const struct cpumask *dest,
+			     unsigned int num, unsigned int align_mask)
 {
 	unsigned long flags;
 	int ret;
 
 	raw_spin_lock_irqsave(&vector_lock, flags);
 	cpumask_and(vector_searchmask, dest, cpu_online_mask);
-	ret = assign_vector_locked(irqd, vector_searchmask);
+	ret = assign_vector_locked(irqd, vector_searchmask, num, align_mask);
 	raw_spin_unlock_irqrestore(&vector_lock, flags);
 	return ret;
 }
@@ -257,18 +263,18 @@ static int assign_irq_vector_any_locked(struct irq_data *irqd)
 		goto all;
 	/* Try the intersection of @affmsk and node mask */
 	cpumask_and(vector_searchmask, cpumask_of_node(node), affmsk);
-	if (!assign_vector_locked(irqd, vector_searchmask))
+	if (!assign_vector_locked(irqd, vector_searchmask, 1, 0))
 		return 0;
 	/* Try the node mask */
-	if (!assign_vector_locked(irqd, cpumask_of_node(node)))
+	if (!assign_vector_locked(irqd, cpumask_of_node(node), 1, 0))
 		return 0;
 all:
 	/* Try the full affinity mask */
 	cpumask_and(vector_searchmask, affmsk, cpu_online_mask);
-	if (!assign_vector_locked(irqd, vector_searchmask))
+	if (!assign_vector_locked(irqd, vector_searchmask, 1, 0))
 		return 0;
 	/* Try the full online mask */
-	return assign_vector_locked(irqd, cpu_online_mask);
+	return assign_vector_locked(irqd, cpu_online_mask, 1, 0);
 }
 
 static int
@@ -277,7 +283,7 @@ assign_irq_vector_policy(struct irq_data *irqd, struct irq_alloc_info *info)
 	if (irqd_affinity_is_managed(irqd))
 		return reserve_managed_vector(irqd);
 	if (info->mask)
-		return assign_irq_vector(irqd, info->mask);
+		return assign_irq_vector(irqd, info->mask, 1, 0);
 	/*
 	 * Make only a global reservation with no guarantee. A real vector
 	 * is associated at activation time.
@@ -353,6 +359,9 @@ static void x86_vector_deactivate(struct irq_domain *dom, struct irq_data *irqd)
 	if (apicd->has_reserved)
 		return;
 
+	if (apicd->contig_allocation)
+		return;
+
 	raw_spin_lock_irqsave(&vector_lock, flags);
 	clear_irq_vector(irqd);
 	if (apicd->can_reserve)
@@ -411,6 +420,9 @@ static int x86_vector_activate(struct irq_domain *dom, struct irq_data *irqd,
 	if (!apicd->can_reserve && !apicd->is_managed)
 		return 0;
 
+	if (apicd->contig_allocation)
+		return 0;
+
 	raw_spin_lock_irqsave(&vector_lock, flags);
 	if (early || irqd_is_managed_and_shutdown(irqd))
 		vector_assign_managed_shutdown(irqd);
@@ -489,16 +501,25 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
 				 unsigned int nr_irqs, void *arg)
 {
 	struct irq_alloc_info *info = arg;
-	struct apic_chip_data *apicd;
+	struct apic_chip_data *apicd, *first_apicd;
 	struct irq_data *irqd;
 	int i, err, node;
+	bool contig_allocation = false;
+	unsigned int align_mask = 0;
 
 	if (disable_apic)
 		return -ENXIO;
 
-	/* Currently vector allocator can't guarantee contiguous allocations */
-	if ((info->flags & X86_IRQ_ALLOC_CONTIGUOUS_VECTORS) && nr_irqs > 1)
-		return -ENOSYS;
+	if ((info->flags & X86_IRQ_ALLOC_CONTIGUOUS_VECTORS) && nr_irqs > 1) {
+		contig_allocation = true;
+
+		if (info->type == X86_IRQ_ALLOC_TYPE_MSI) {
+			/* Contiguous allocations must be aligned for MSI */
+			align_mask = 1 << fls(nr_irqs - 1);
+			/* Convert from align requirement to align mask */
+			align_mask--;
+		}
+	}
 
 	for (i = 0; i < nr_irqs; i++) {
 		irqd = irq_domain_get_irq_data(domain, virq + i);
@@ -512,6 +533,7 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
 		}
 
 		apicd->irq = virq + i;
+		apicd->contig_allocation = contig_allocation;
 		irqd->chip = &lapic_controller;
 		irqd->chip_data = apicd;
 		irqd->hwirq = virq + i;
@@ -528,7 +550,24 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
 				continue;
 		}
 
-		err = assign_irq_vector_policy(irqd, info);
+		if (contig_allocation) {
+			/* Automatically activate */
+			if (i == 0) {
+				first_apicd = apicd;
+				err = assign_irq_vector(irqd, cpu_online_mask,
+							nr_irqs, align_mask);
+			} else {
+				apic_update_vector(irqd,
+						   first_apicd->vector + i,
+						   first_apicd->cpu);
+				apic_update_irq_cfg(irqd,
+						    first_apicd->vector + i,
+						    first_apicd->cpu);
+				err = 0;
+			}
+		} else {
+			err = assign_irq_vector_policy(irqd, info);
+		}
 		trace_vector_setup(virq + i, false, err);
 		if (err)
 			goto error;
@@ -733,7 +772,7 @@ static int apic_set_affinity(struct irq_data *irqd,
 	if (irqd_affinity_is_managed(irqd))
 		err = assign_managed_vector(irqd, vector_searchmask);
 	else
-		err = assign_vector_locked(irqd, vector_searchmask);
+		err = assign_vector_locked(irqd, vector_searchmask, 1, 0);
 	raw_spin_unlock(&vector_lock);
 	return err ? err : IRQ_SET_MASK_OK;
 }
diff --git a/include/linux/irq.h b/include/linux/irq.h
index fda8da7c45e7..2cb5c3a9c96f 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -1126,7 +1126,8 @@ int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu);
 void irq_matrix_reserve(struct irq_matrix *m);
 void irq_matrix_remove_reserved(struct irq_matrix *m);
 int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
-		     bool reserved, unsigned int *mapped_cpu);
+		     bool reserved, unsigned int *mapped_cpu, unsigned int num,
+		     unsigned int align_mask);
 void irq_matrix_free(struct irq_matrix *m, unsigned int cpu,
 		     unsigned int bit, bool managed);
 void irq_matrix_assign(struct irq_matrix *m, unsigned int bit);
diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index a3cbbc8191c5..b5c4f054a650 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -106,14 +106,16 @@ void irq_matrix_offline(struct irq_matrix *m)
 }
 
 static unsigned int matrix_alloc_area(struct irq_matrix *m, struct cpumap *cm,
-				      unsigned int num, bool managed)
+				      unsigned int num, bool managed,
+				      unsigned int align_mask)
 {
 	unsigned int area, start = m->alloc_start;
 	unsigned int end = m->alloc_end;
 
 	bitmap_or(m->scratch_map, cm->managed_map, m->system_map, end);
 	bitmap_or(m->scratch_map, m->scratch_map, cm->alloc_map, end);
-	area = bitmap_find_next_zero_area(m->scratch_map, end, start, num, 0);
+	area = bitmap_find_next_zero_area(m->scratch_map, end, start, num,
+					  align_mask);
 	if (area >= end)
 		return area;
 	if (managed)
@@ -171,7 +173,7 @@ int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask *msk)
 		struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
 		unsigned int bit;
 
-		bit = matrix_alloc_area(m, cm, 1, true);
+		bit = matrix_alloc_area(m, cm, 1, true, 0);
 		if (bit >= m->alloc_end)
 			goto cleanup;
 		cm->managed++;
@@ -319,7 +321,8 @@ void irq_matrix_remove_reserved(struct irq_matrix *m)
  * @mapped_cpu: Pointer to store the CPU for which the irq was allocated
  */
 int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
-		     bool reserved, unsigned int *mapped_cpu)
+		     bool reserved, unsigned int *mapped_cpu,
+		     unsigned int num, unsigned int align_mask)
 {
 	unsigned int cpu;
 
@@ -330,14 +333,14 @@ int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
 		if (!cm->online)
 			continue;
 
-		bit = matrix_alloc_area(m, cm, 1, false);
+		bit = matrix_alloc_area(m, cm, num, false, align_mask);
 		if (bit < m->alloc_end) {
-			cm->allocated++;
-			cm->available--;
-			m->total_allocated++;
-			m->global_available--;
+			cm->allocated += num;
+			cm->available -= num;
+			m->total_allocated += num;
+			m->global_available -= num;
 			if (reserved)
-				m->global_reserved--;
+				m->global_reserved -= num;
 			*mapped_cpu = cpu;
 			trace_irq_matrix_alloc(bit, cpu, m, cm);
 			return bit;

Thomas Gleixner Oct. 5, 2017, 10:13 a.m. UTC | #10

On Thu, 5 Oct 2017, Daniel Drake wrote:
> On Mon, Oct 2, 2017 at 10:38 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > What's wrong with just using the legacy INTx emulation if you cannot
> > allocate 4 MSI vectors?
> 
> The Legacy interrupt simply doesn't work for the wifi on at least 8 new Acer
> laptop products based on Intel Apollo Lake.
> Plus 4 Dell systems included in the patches in this thread:
> https://lkml.org/lkml/2017/9/26/55
> (the 2 which I can find specs for are also Apollo Lake)
>
> We have tried taking the mini-PCIe wifi module out of one of the affected
> Acer products and moved it to another computer, where it is working fine
> with legacy interrupts. So this suggests that the wifi module itself is OK,
> but we are facing a hardware limitation or BIOS limitation on the affected
> products. In the Dell thread it says "Some platform(BIOS) blocks legacy
> interrupts (INTx)".
>
> If you have any suggestions for how we might solve this without getting into
> the MSI mess then that would be much appreciated. If the BIOS blocks the
> interrupts, can Linux unblock them?

I'm pretty sure we can. Cc'ed Rafael and Andy. They might know, if not they
certainly know whom to ask @Intel.

> Just for reference I'm attaching my latest attempt at enabling MULTI_PCI_MSI.
> It would definitely need further work if we proceed here - so far I've
> ignored the affinity considerations that you explained, and it's not
> particularly clean.

Yeah, I know how that looks. When I rewrote the allocator I initialy had
that multi-vector mode implemented and then ditched it due to the affinity
setting mess and because its hard vs. the global best effort reservation
mode, which is way more important to have than multi MSI.

>  int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
> -		     bool reserved, unsigned int *mapped_cpu);
> +		     bool reserved, unsigned int *mapped_cpu, unsigned int num,
> +		     unsigned int align_mask);

That's not needed. We can assume that multivector allocations must be
aligned in the following way:

	count = __roundup_pow_of_two(count);
	mask = ilog2(count);

That's a sane assumption in general.

Thanks,

	tglx

PCI MSI: allow alignment restrictions on vector allocation

Commit Message

Comments

Patch