4.13.0-rc4 sparc64: can't allocate MSI-X affinity masks for 2 vectors
diff mbox

Message ID 20170821.112747.1532639515902173100.davem@davemloft.net
State Not Applicable
Headers show

Commit Message

David Miller Aug. 21, 2017, 6:27 p.m. UTC
From: Bjorn Helgaas <helgaas@kernel.org>
Date: Wed, 16 Aug 2017 14:02:41 -0500

> On Wed, Aug 16, 2017 at 09:39:08PM +0300, Meelis Roos wrote:
>> > > > I noticed that in 4.13.0-rc4 there is a new error in dmesg on my sparc64 
>> > > > t5120 server: can't allocate MSI-X affinity masks.
>> > > > 
>> > > > [   30.274284] qla2xxx [0000:00:00.0]-0005: : QLogic Fibre Channel HBA Driver: 10.00.00.00-k.
>> > > > [   30.274648] qla2xxx [0000:10:00.0]-001d: : Found an ISP2432 irq 21 iobase 0x000000c100d00000.
>> > > > [   30.275447] qla2xxx 0000:10:00.0: can't allocate MSI-X affinity masks for 2 vectors
>> > > > [   30.816882] scsi host1: qla2xxx
>> > > > [   30.877294] qla2xxx: probe of 0000:10:00.0 failed with error -22
>> > > > [   30.877578] qla2xxx [0000:10:00.1]-001d: : Found an ISP2432 irq 22 iobase 0x000000c100d04000.
>> > > > [   30.878387] qla2xxx 0000:10:00.1: can't allocate MSI-X affinity masks for 2 vectors
>> > > > [   31.367083] scsi host1: qla2xxx
>> > > > [   31.427500] qla2xxx: probe of 0000:10:00.1 failed with error -22
>> > > > 
>> > > > I do not know if the driver works since nothing is attached to the FC 
>> > > > HBA at the moment, but from the error messages it looks like the driver 
>> > > > fails to load.
>> > > > 
>> > > > I booted 4.12 and 4.11 - the red error is not there but the failure 
>> > > > seems to be the same error -22:
>> > 
>> > 4.10.0 works, 4.11.0 errors out with EINVAL and 4.13-rc4 errorr sout 
>> > with more verbose MSI messages. So something between 4.10 and 4.11 has 
>> > broken it.
>> 
>> I can not reproduice the older kernels that misbehave. I checked out 
>> earlier kernels and recompiled them (old config lost, nothing changed 
>> AFAIK), everything works up to 4.12 inclusive.
>> 
>> > Also, 4.13-rc4 is broken on another sun4v here (T1000). So it seems to 
>> > be sun4v interrupt related.
>> 
>> This still holds - 4.13-rc4 has MSI trouble on at least 2 of my sun4v 
>> machines.
> 
> IIUC, that means v4.12 works and v4.13-rc4 does not, so this is a
> regression we introduced this cycle.
> 
> If nobody steps up with a theory, bisecting might be the easiest path
> forward.

I suspect the test added by:

commit 6f9a22bc5775d231ab8fbe2c2f3c88e45e3e7c28
Author: Michael Hernandez <michael.hernandez@cavium.com>
Date:   Thu May 18 10:47:47 2017 -0700

    PCI/MSI: Ignore affinity if pre/post vector count is more than min_vecs

is triggering.

The rest of the failure cases are memory allocation failures which should
not be happening here.

There have only been 5 commits to kernel/irq/affinity.c since v4.10

I suppose we have been getting away with something that has silently
been allowed in the past, or something like that.

Meelis can you run with the following debuggingspatch?

Comments

Christoph Hellwig Aug. 21, 2017, 6:34 p.m. UTC | #1
I think with this patch from -rc6 the symptoms should be cured:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c005390374957baacbc38eef96ea360559510aa7

if that theory is right.
Meelis Roos Aug. 21, 2017, 7:20 p.m. UTC | #2
> I think with this patch from -rc6 the symptoms should be cured:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c005390374957baacbc38eef96ea360559510aa7
> 
> if that theory is right.

The result with 4.13-rc6 is positive but mixed: the message about MSI-X 
affinty maks are still there but the rest of the detection works and the 
driver is loaded successfully:

[   29.924282] qla2xxx [0000:00:00.0]-0005: : QLogic Fibre Channel HBA Driver: 10.00.00.00-k.
[   29.924710] qla2xxx [0000:10:00.0]-001d: : Found an ISP2432 irq 21 iobase 0x000000c100d00000.
[   29.925581] qla2xxx 0000:10:00.0: can't allocate MSI-X affinity masks for 2 vectors
[   30.483422] scsi host1: qla2xxx
[   35.495031] qla2xxx [0000:10:00.0]-00fb:1: QLogic QLE2462 - SG-(X)PCIE2FC-QF4, Sun StorageTek 4 Gb FC Enterprise PCI-Express Dual Channel H.
[   35.495274] qla2xxx [0000:10:00.0]-00fc:1: ISP2432: PCIe (2.5GT/s x4) @ 0000:10:00.0 hdma- host#=1 fw=7.03.00 (9496).
[   35.495615] qla2xxx [0000:10:00.1]-001d: : Found an ISP2432 irq 22 iobase 0x000000c100d04000.
[   35.496409] qla2xxx 0000:10:00.1: can't allocate MSI-X affinity masks for 2 vectors
[   35.985355] scsi host2: qla2xxx
[   40.996991] qla2xxx [0000:10:00.1]-00fb:2: QLogic QLE2462 - SG-(X)PCIE2FC-QF4, Sun StorageTek 4 Gb FC Enterprise PCI-Express Dual Channel H.
[   40.997251] qla2xxx [0000:10:00.1]-00fc:2: ISP2432: PCIe (2.5GT/s x4) @ 0000:10:00.1 hdma- host#=2 fw=7.03.00 (9496).
[   51.880945] qla2xxx [0000:10:00.0]-8038:1: Cable is unplugged...
[   57.402900] qla2xxx [0000:10:00.1]-8038:2: Cable is unplugged...

With Dave Millers patch on top of 4.13-rc6, I see the following before 
both MSI-X messages:

irq_create_affinity_masks: nvecs[2] affd->pre_vectors[2] affd->post_vectors[0]
David Miller Aug. 21, 2017, 8:35 p.m. UTC | #3
From: mroos@linux.ee
Date: Mon, 21 Aug 2017 22:20:22 +0300 (EEST)

>> I think with this patch from -rc6 the symptoms should be cured:
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c005390374957baacbc38eef96ea360559510aa7
>> 
>> if that theory is right.
> 
> The result with 4.13-rc6 is positive but mixed: the message about MSI-X 
> affinty maks are still there but the rest of the detection works and the 
> driver is loaded successfully:

Is this an SMP system?

I ask because the commit log message indicates that this failure is
not expected to ever happen on SMP.

We really need to root cause this.
Meelis Roos Aug. 22, 2017, 5:02 a.m. UTC | #4
> 
> >> I think with this patch from -rc6 the symptoms should be cured:
> >> 
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c005390374957baacbc38eef96ea360559510aa7
> >> 
> >> if that theory is right.
> > 
> > The result with 4.13-rc6 is positive but mixed: the message about MSI-X 
> > affinty maks are still there but the rest of the detection works and the 
> > driver is loaded successfully:
> 
> Is this an SMP system?

Yes, T5120.
Christoph Hellwig Aug. 22, 2017, 6:35 a.m. UTC | #5
On Mon, Aug 21, 2017 at 01:35:49PM -0700, David Miller wrote:
> I ask because the commit log message indicates that this failure is
> not expected to ever happen on SMP.

I fear my commit message (but not the code) might be wrong.
irq_create_affinity_masks can return NULL any time we don't have any
affinity masks.  I've already had a discussion about this elsewhere
with Bjorn, and I suspect we need to kill the warning or move it
to irq_create_affinity_masks only for genuine failure cases.

> 
> We really need to root cause this.
---end quoted text---
David Miller Aug. 22, 2017, 4:31 p.m. UTC | #6
From: Christoph Hellwig <hch@lst.de>
Date: Tue, 22 Aug 2017 08:35:05 +0200

> On Mon, Aug 21, 2017 at 01:35:49PM -0700, David Miller wrote:
>> I ask because the commit log message indicates that this failure is
>> not expected to ever happen on SMP.
> 
> I fear my commit message (but not the code) might be wrong.
> irq_create_affinity_masks can return NULL any time we don't have any
> affinity masks.  I've already had a discussion about this elsewhere
> with Bjorn, and I suspect we need to kill the warning or move it
> to irq_create_affinity_masks only for genuine failure cases.

This is a rather large machine with 64 or more cpus and several NUMA
nodes.  Why wouldn't there be any affinity masks available?

That's why I want to root cause this.
Meelis Roos Aug. 22, 2017, 4:33 p.m. UTC | #7
> > On Mon, Aug 21, 2017 at 01:35:49PM -0700, David Miller wrote:
> >> I ask because the commit log message indicates that this failure is
> >> not expected to ever happen on SMP.
> > 
> > I fear my commit message (but not the code) might be wrong.
> > irq_create_affinity_masks can return NULL any time we don't have any
> > affinity masks.  I've already had a discussion about this elsewhere
> > with Bjorn, and I suspect we need to kill the warning or move it
> > to irq_create_affinity_masks only for genuine failure cases.
> 
> This is a rather large machine with 64 or more cpus and several NUMA
> nodes.  Why wouldn't there be any affinity masks available?

T5120 with 1 slot and 32 threads total. I have not configured any NUM on 
it is there any reason for that?
Christoph Hellwig Aug. 22, 2017, 4:39 p.m. UTC | #8
On Tue, Aug 22, 2017 at 09:31:39AM -0700, David Miller wrote:
> > I fear my commit message (but not the code) might be wrong.
> > irq_create_affinity_masks can return NULL any time we don't have any
> > affinity masks.  I've already had a discussion about this elsewhere
> > with Bjorn, and I suspect we need to kill the warning or move it
> > to irq_create_affinity_masks only for genuine failure cases.
> 
> This is a rather large machine with 64 or more cpus and several NUMA
> nodes.  Why wouldn't there be any affinity masks available?

The drivers only asked for two MSI-X vectors, and marked bost of them
as pre-vectors that should not be spread.  So there is no actual
vector left that we want to actually spread.
David Miller Aug. 22, 2017, 4:45 p.m. UTC | #9
From: Meelis Roos <mroos@linux.ee>
Date: Tue, 22 Aug 2017 19:33:55 +0300 (EEST)

>> > On Mon, Aug 21, 2017 at 01:35:49PM -0700, David Miller wrote:
>> >> I ask because the commit log message indicates that this failure is
>> >> not expected to ever happen on SMP.
>> > 
>> > I fear my commit message (but not the code) might be wrong.
>> > irq_create_affinity_masks can return NULL any time we don't have any
>> > affinity masks.  I've already had a discussion about this elsewhere
>> > with Bjorn, and I suspect we need to kill the warning or move it
>> > to irq_create_affinity_masks only for genuine failure cases.
>> 
>> This is a rather large machine with 64 or more cpus and several NUMA
>> nodes.  Why wouldn't there be any affinity masks available?
> 
> T5120 with 1 slot and 32 threads total. I have not configured any NUM on 
> it is there any reason for that?

Ok 32 cpus and 1 NUMA node, my bad :-)
David Miller Aug. 22, 2017, 4:52 p.m. UTC | #10
From: Christoph Hellwig <hch@lst.de>
Date: Tue, 22 Aug 2017 18:39:16 +0200

> On Tue, Aug 22, 2017 at 09:31:39AM -0700, David Miller wrote:
>> > I fear my commit message (but not the code) might be wrong.
>> > irq_create_affinity_masks can return NULL any time we don't have any
>> > affinity masks.  I've already had a discussion about this elsewhere
>> > with Bjorn, and I suspect we need to kill the warning or move it
>> > to irq_create_affinity_masks only for genuine failure cases.
>> 
>> This is a rather large machine with 64 or more cpus and several NUMA
>> nodes.  Why wouldn't there be any affinity masks available?
> 
> The drivers only asked for two MSI-X vectors, and marked bost of them
> as pre-vectors that should not be spread.  So there is no actual
> vector left that we want to actually spread.

Ok, now it makes more sense, and yes the warning should be removed.

Patch
diff mbox

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index d69bd77252a7..d16c6326000a 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -110,6 +110,9 @@  irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	struct cpumask *masks;
 	cpumask_var_t nmsk, *node_to_present_cpumask;
 
+	pr_info("irq_create_affinity_masks: nvecs[%d] affd->pre_vectors[%d] "
+		"affd->post_vectors[%d]\n",
+		nvecs, affd->pre_vectors, affd->post_vectors);
 	/*
 	 * If there aren't any vectors left after applying the pre/post
 	 * vectors don't bother with assigning affinity.