diff mbox

4.13.0-rc4 sparc64: can't allocate MSI-X affinity masks for 2 vectors

Message ID 20170821.112747.1532639515902173100.davem@davemloft.net
State RFC
Delegated to: David Miller
Headers show

Commit Message

David Miller Aug. 21, 2017, 6:27 p.m. UTC
From: Bjorn Helgaas <helgaas@kernel.org>
Date: Wed, 16 Aug 2017 14:02:41 -0500

> On Wed, Aug 16, 2017 at 09:39:08PM +0300, Meelis Roos wrote:
>> > > > I noticed that in 4.13.0-rc4 there is a new error in dmesg on my sparc64 
>> > > > t5120 server: can't allocate MSI-X affinity masks.
>> > > > 
>> > > > [   30.274284] qla2xxx [0000:00:00.0]-0005: : QLogic Fibre Channel HBA Driver: 10.00.00.00-k.
>> > > > [   30.274648] qla2xxx [0000:10:00.0]-001d: : Found an ISP2432 irq 21 iobase 0x000000c100d00000.
>> > > > [   30.275447] qla2xxx 0000:10:00.0: can't allocate MSI-X affinity masks for 2 vectors
>> > > > [   30.816882] scsi host1: qla2xxx
>> > > > [   30.877294] qla2xxx: probe of 0000:10:00.0 failed with error -22
>> > > > [   30.877578] qla2xxx [0000:10:00.1]-001d: : Found an ISP2432 irq 22 iobase 0x000000c100d04000.
>> > > > [   30.878387] qla2xxx 0000:10:00.1: can't allocate MSI-X affinity masks for 2 vectors
>> > > > [   31.367083] scsi host1: qla2xxx
>> > > > [   31.427500] qla2xxx: probe of 0000:10:00.1 failed with error -22
>> > > > 
>> > > > I do not know if the driver works since nothing is attached to the FC 
>> > > > HBA at the moment, but from the error messages it looks like the driver 
>> > > > fails to load.
>> > > > 
>> > > > I booted 4.12 and 4.11 - the red error is not there but the failure 
>> > > > seems to be the same error -22:
>> > 
>> > 4.10.0 works, 4.11.0 errors out with EINVAL and 4.13-rc4 errorr sout 
>> > with more verbose MSI messages. So something between 4.10 and 4.11 has 
>> > broken it.
>> 
>> I can not reproduice the older kernels that misbehave. I checked out 
>> earlier kernels and recompiled them (old config lost, nothing changed 
>> AFAIK), everything works up to 4.12 inclusive.
>> 
>> > Also, 4.13-rc4 is broken on another sun4v here (T1000). So it seems to 
>> > be sun4v interrupt related.
>> 
>> This still holds - 4.13-rc4 has MSI trouble on at least 2 of my sun4v 
>> machines.
> 
> IIUC, that means v4.12 works and v4.13-rc4 does not, so this is a
> regression we introduced this cycle.
> 
> If nobody steps up with a theory, bisecting might be the easiest path
> forward.

I suspect the test added by:

commit 6f9a22bc5775d231ab8fbe2c2f3c88e45e3e7c28
Author: Michael Hernandez <michael.hernandez@cavium.com>
Date:   Thu May 18 10:47:47 2017 -0700

    PCI/MSI: Ignore affinity if pre/post vector count is more than min_vecs

is triggering.

The rest of the failure cases are memory allocation failures which should
not be happening here.

There have only been 5 commits to kernel/irq/affinity.c since v4.10

I suppose we have been getting away with something that has silently
been allowed in the past, or something like that.

Meelis can you run with the following debuggingspatch?

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Christoph Hellwig Aug. 21, 2017, 6:34 p.m. UTC | #1
I think with this patch from -rc6 the symptoms should be cured:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c005390374957baacbc38eef96ea360559510aa7

if that theory is right.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Meelis Roos Aug. 21, 2017, 7:20 p.m. UTC | #2
> I think with this patch from -rc6 the symptoms should be cured:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c005390374957baacbc38eef96ea360559510aa7
> 
> if that theory is right.

The result with 4.13-rc6 is positive but mixed: the message about MSI-X 
affinty maks are still there but the rest of the detection works and the 
driver is loaded successfully:

[   29.924282] qla2xxx [0000:00:00.0]-0005: : QLogic Fibre Channel HBA Driver: 10.00.00.00-k.
[   29.924710] qla2xxx [0000:10:00.0]-001d: : Found an ISP2432 irq 21 iobase 0x000000c100d00000.
[   29.925581] qla2xxx 0000:10:00.0: can't allocate MSI-X affinity masks for 2 vectors
[   30.483422] scsi host1: qla2xxx
[   35.495031] qla2xxx [0000:10:00.0]-00fb:1: QLogic QLE2462 - SG-(X)PCIE2FC-QF4, Sun StorageTek 4 Gb FC Enterprise PCI-Express Dual Channel H.
[   35.495274] qla2xxx [0000:10:00.0]-00fc:1: ISP2432: PCIe (2.5GT/s x4) @ 0000:10:00.0 hdma- host#=1 fw=7.03.00 (9496).
[   35.495615] qla2xxx [0000:10:00.1]-001d: : Found an ISP2432 irq 22 iobase 0x000000c100d04000.
[   35.496409] qla2xxx 0000:10:00.1: can't allocate MSI-X affinity masks for 2 vectors
[   35.985355] scsi host2: qla2xxx
[   40.996991] qla2xxx [0000:10:00.1]-00fb:2: QLogic QLE2462 - SG-(X)PCIE2FC-QF4, Sun StorageTek 4 Gb FC Enterprise PCI-Express Dual Channel H.
[   40.997251] qla2xxx [0000:10:00.1]-00fc:2: ISP2432: PCIe (2.5GT/s x4) @ 0000:10:00.1 hdma- host#=2 fw=7.03.00 (9496).
[   51.880945] qla2xxx [0000:10:00.0]-8038:1: Cable is unplugged...
[   57.402900] qla2xxx [0000:10:00.1]-8038:2: Cable is unplugged...

With Dave Millers patch on top of 4.13-rc6, I see the following before 
both MSI-X messages:

irq_create_affinity_masks: nvecs[2] affd->pre_vectors[2] affd->post_vectors[0]
David Miller Aug. 21, 2017, 8:35 p.m. UTC | #3
From: mroos@linux.ee
Date: Mon, 21 Aug 2017 22:20:22 +0300 (EEST)

>> I think with this patch from -rc6 the symptoms should be cured:
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c005390374957baacbc38eef96ea360559510aa7
>> 
>> if that theory is right.
> 
> The result with 4.13-rc6 is positive but mixed: the message about MSI-X 
> affinty maks are still there but the rest of the detection works and the 
> driver is loaded successfully:

Is this an SMP system?

I ask because the commit log message indicates that this failure is
not expected to ever happen on SMP.

We really need to root cause this.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Meelis Roos Aug. 22, 2017, 5:02 a.m. UTC | #4
> 
> >> I think with this patch from -rc6 the symptoms should be cured:
> >> 
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c005390374957baacbc38eef96ea360559510aa7
> >> 
> >> if that theory is right.
> > 
> > The result with 4.13-rc6 is positive but mixed: the message about MSI-X 
> > affinty maks are still there but the rest of the detection works and the 
> > driver is loaded successfully:
> 
> Is this an SMP system?

Yes, T5120.
Christoph Hellwig Aug. 22, 2017, 6:35 a.m. UTC | #5
On Mon, Aug 21, 2017 at 01:35:49PM -0700, David Miller wrote:
> I ask because the commit log message indicates that this failure is
> not expected to ever happen on SMP.

I fear my commit message (but not the code) might be wrong.
irq_create_affinity_masks can return NULL any time we don't have any
affinity masks.  I've already had a discussion about this elsewhere
with Bjorn, and I suspect we need to kill the warning or move it
to irq_create_affinity_masks only for genuine failure cases.

> 
> We really need to root cause this.
---end quoted text---
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 22, 2017, 4:31 p.m. UTC | #6
From: Christoph Hellwig <hch@lst.de>
Date: Tue, 22 Aug 2017 08:35:05 +0200

> On Mon, Aug 21, 2017 at 01:35:49PM -0700, David Miller wrote:
>> I ask because the commit log message indicates that this failure is
>> not expected to ever happen on SMP.
> 
> I fear my commit message (but not the code) might be wrong.
> irq_create_affinity_masks can return NULL any time we don't have any
> affinity masks.  I've already had a discussion about this elsewhere
> with Bjorn, and I suspect we need to kill the warning or move it
> to irq_create_affinity_masks only for genuine failure cases.

This is a rather large machine with 64 or more cpus and several NUMA
nodes.  Why wouldn't there be any affinity masks available?

That's why I want to root cause this.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Meelis Roos Aug. 22, 2017, 4:33 p.m. UTC | #7
> > On Mon, Aug 21, 2017 at 01:35:49PM -0700, David Miller wrote:
> >> I ask because the commit log message indicates that this failure is
> >> not expected to ever happen on SMP.
> > 
> > I fear my commit message (but not the code) might be wrong.
> > irq_create_affinity_masks can return NULL any time we don't have any
> > affinity masks.  I've already had a discussion about this elsewhere
> > with Bjorn, and I suspect we need to kill the warning or move it
> > to irq_create_affinity_masks only for genuine failure cases.
> 
> This is a rather large machine with 64 or more cpus and several NUMA
> nodes.  Why wouldn't there be any affinity masks available?

T5120 with 1 slot and 32 threads total. I have not configured any NUM on 
it is there any reason for that?
Christoph Hellwig Aug. 22, 2017, 4:39 p.m. UTC | #8
On Tue, Aug 22, 2017 at 09:31:39AM -0700, David Miller wrote:
> > I fear my commit message (but not the code) might be wrong.
> > irq_create_affinity_masks can return NULL any time we don't have any
> > affinity masks.  I've already had a discussion about this elsewhere
> > with Bjorn, and I suspect we need to kill the warning or move it
> > to irq_create_affinity_masks only for genuine failure cases.
> 
> This is a rather large machine with 64 or more cpus and several NUMA
> nodes.  Why wouldn't there be any affinity masks available?

The drivers only asked for two MSI-X vectors, and marked bost of them
as pre-vectors that should not be spread.  So there is no actual
vector left that we want to actually spread.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 22, 2017, 4:45 p.m. UTC | #9
From: Meelis Roos <mroos@linux.ee>
Date: Tue, 22 Aug 2017 19:33:55 +0300 (EEST)

>> > On Mon, Aug 21, 2017 at 01:35:49PM -0700, David Miller wrote:
>> >> I ask because the commit log message indicates that this failure is
>> >> not expected to ever happen on SMP.
>> > 
>> > I fear my commit message (but not the code) might be wrong.
>> > irq_create_affinity_masks can return NULL any time we don't have any
>> > affinity masks.  I've already had a discussion about this elsewhere
>> > with Bjorn, and I suspect we need to kill the warning or move it
>> > to irq_create_affinity_masks only for genuine failure cases.
>> 
>> This is a rather large machine with 64 or more cpus and several NUMA
>> nodes.  Why wouldn't there be any affinity masks available?
> 
> T5120 with 1 slot and 32 threads total. I have not configured any NUM on 
> it is there any reason for that?

Ok 32 cpus and 1 NUMA node, my bad :-)
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 22, 2017, 4:52 p.m. UTC | #10
From: Christoph Hellwig <hch@lst.de>
Date: Tue, 22 Aug 2017 18:39:16 +0200

> On Tue, Aug 22, 2017 at 09:31:39AM -0700, David Miller wrote:
>> > I fear my commit message (but not the code) might be wrong.
>> > irq_create_affinity_masks can return NULL any time we don't have any
>> > affinity masks.  I've already had a discussion about this elsewhere
>> > with Bjorn, and I suspect we need to kill the warning or move it
>> > to irq_create_affinity_masks only for genuine failure cases.
>> 
>> This is a rather large machine with 64 or more cpus and several NUMA
>> nodes.  Why wouldn't there be any affinity masks available?
> 
> The drivers only asked for two MSI-X vectors, and marked bost of them
> as pre-vectors that should not be spread.  So there is no actual
> vector left that we want to actually spread.

Ok, now it makes more sense, and yes the warning should be removed.


--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index d69bd77252a7..d16c6326000a 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -110,6 +110,9 @@  irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	struct cpumask *masks;
 	cpumask_var_t nmsk, *node_to_present_cpumask;
 
+	pr_info("irq_create_affinity_masks: nvecs[%d] affd->pre_vectors[%d] "
+		"affd->post_vectors[%d]\n",
+		nvecs, affd->pre_vectors, affd->post_vectors);
 	/*
 	 * If there aren't any vectors left after applying the pre/post
 	 * vectors don't bother with assigning affinity.