diff mbox

[arch-x86] Allow SRAT integrity check to be skipped

Message ID 20100901213318.19353.54619.stgit@localhost.localdomain
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Waskiewicz Jr, Peter P Sept. 1, 2010, 9:33 p.m. UTC
On certain BIOSes, SRAT enumeration isn't exported correctly.
This leads to NUMA node enumeration failure, and causes the kernel
to fall back onto a single node treated as flat memory.  This
can happen on large, multi-socket systems (4 or more sockets), and
becomes problematic for performance.

This patch adds a boot parameter to allow a kernel to be booted
with the option to skip the SRAT check.  There are BIOSes in
production that have these failures, so this will allow people
in the field to work around these BIOS issues.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 Documentation/x86/x86_64/boot-options.txt |    4 ++++
 arch/x86/mm/srat_64.c                     |   20 +++++++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andi Kleen Sept. 1, 2010, 9:33 p.m. UTC | #1
On Wed, Sep 01, 2010 at 02:33:18PM -0700, Peter P Waskiewicz Jr wrote:
> On certain BIOSes, SRAT enumeration isn't exported correctly.
> This leads to NUMA node enumeration failure, and causes the kernel
> to fall back onto a single node treated as flat memory.  This
> can happen on large, multi-socket systems (4 or more sockets), and
> becomes problematic for performance.
> 
> This patch adds a boot parameter to allow a kernel to be booted
> with the option to skip the SRAT check.  There are BIOSes in
> production that have these failures, so this will allow people
> in the field to work around these BIOS issues.

Looks good.

Reviewed-by: Andi Kleen <ak@linux.intel.com>

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Randy.Dunlap Sept. 1, 2010, 10:04 p.m. UTC | #2
On Wed, September 1, 2010 2:33 pm, Peter P Waskiewicz Jr wrote:
> Documentation/x86/x86_64/boot-options.txt |    4 ++++
> arch/x86/mm/srat_64.c                     |   20 +++++++++++++++++--- 2
> files changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/srat_64.c b/arch/x86/mm/srat_64.c index
> f9897f7..8719472 100644 --- a/arch/x86/mm/srat_64.c
> +++ b/arch/x86/mm/srat_64.c
> @@ -351,6 +351,15 @@ int __init acpi_get_nodes(struct bootnode *physnodes)
>  return ret; }
>
>
> +int srat_bypass_bios = 0;

static int srat_bypass_bios;

Don't need to init to 0.

> +
> +static int __init srat_bypass_bios_setup(char *str)
> +{
> +        srat_bypass_bios = 1;
> +        return 0;
> +}
> +early_param("sratbypassbios", srat_bypass_bios_setup);
> +
> /* Use the information discovered above to actually set up the nodes. */
> int __init acpi_scan_nodes(unsigned long start, unsigned long end) {
> @@ -425,9 +434,14 @@ int __init acpi_scan_nodes(unsigned long start,
> unsigned long end) nodes[i].end >> PAGE_SHIFT); /* for out of order entries
> in SRAT */ sort_node_map(); -	if (!nodes_cover_memory(nodes)) {
> -		bad_srat();
> -		return -1;
> +	if (!srat_bypass_bios) {
> +		if (!nodes_cover_memory(nodes)) {
> +			bad_srat();
> +			return -1;
> +		}
> +	} else {
> +		printk(KERN_INFO
> +		           "SRAT: Bypassing NUMA sanity check...bad BIOS...\n");
> }


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Waskiewicz Jr, Peter P Sept. 1, 2010, 10:11 p.m. UTC | #3
On Wed, 2010-09-01 at 15:04 -0700, rdunlap@xenotime.net wrote:
> On Wed, September 1, 2010 2:33 pm, Peter P Waskiewicz Jr wrote:
> > Documentation/x86/x86_64/boot-options.txt |    4 ++++
> > arch/x86/mm/srat_64.c                     |   20 +++++++++++++++++--- 2
> > files changed, 21 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/mm/srat_64.c b/arch/x86/mm/srat_64.c index
> > f9897f7..8719472 100644 --- a/arch/x86/mm/srat_64.c
> > +++ b/arch/x86/mm/srat_64.c
> > @@ -351,6 +351,15 @@ int __init acpi_get_nodes(struct bootnode *physnodes)
> >  return ret; }
> >
> >
> > +int srat_bypass_bios = 0;
> 
> static int srat_bypass_bios;
> 
> Don't need to init to 0.

I'll fix and send a second version.  Thanks Randy.

-PJ

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Sept. 2, 2010, 6:57 a.m. UTC | #4
* Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> wrote:

> On certain BIOSes, SRAT enumeration isn't exported correctly. This 
> leads to NUMA node enumeration failure, and causes the kernel to fall 
> back onto a single node treated as flat memory.  This can happen on 
> large, multi-socket systems (4 or more sockets), and becomes 
> problematic for performance.
> 
> This patch adds a boot parameter to allow a kernel to be booted with 
> the option to skip the SRAT check.  There are BIOSes in production 
> that have these failures, so this will allow people in the field to 
> work around these BIOS issues.

> + 	sratbypassbios
> +		If specified, will skip an SRAT check for PXM coverage
> +		from BIOS enumeration.  Only to be used on systems with
> +		buggy BIOSes that munge the SRAT enumeration.

This isnt a particularly useful solution to users of said systems - they 
have to figure out that this option exists, and then they have to enter 
this option on the boot line.

A better solution would be to match these systems using DMI filters - 
_and_ to also have the boot time option to cover the case where a new 
system comes out with such a breakage. (or there's some system not yet 
mapped)


	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen Sept. 2, 2010, 10:03 a.m. UTC | #5
> This isnt a particularly useful solution to users of said systems - they 
> have to figure out that this option exists, and then they have to enter 
> this option on the boot line.

This usually only happens in early preproduction systems. So far
the BIOS always got fixed before they shipped to users.

But it's still useful to have this option so that the early
users have a workaround.

It's difficult to maintain DMI strings for early systems,
they are too bleeding edge and the entries would be stale
as soon as the BIOS is fixed.

-Andi
Ingo Molnar Sept. 3, 2010, 6:39 a.m. UTC | #6
* Andi Kleen <andi@firstfloor.org> wrote:

> > This isnt a particularly useful solution to users of said systems - 
> > they have to figure out that this option exists, and then they have 
> > to enter this option on the boot line.
> 
> This usually only happens in early preproduction systems. So far the 
> BIOS always got fixed before they shipped to users.

'Usually' != 'always'. Read the changelog:

  ' There are BIOSes in production that have these failures, so this will 
    allow people in the field to work around these BIOS issues. '

Peter, which system in production that has this problem? That one needs 
a DMI match.

Preproduction systems can certainly be hacked around with the boot 
option and are not worth the DMI match. (preproduction systems also tend 
to have have very unspecific DMI strings in most cases - things like 
'1234567890' or 'to be filled by OEM')

In important cases a PCI ID match can be done for serious bugs in 
widespread preproduction systems - we have a handful of examples in-tree 
for such workarounds - but i doubt this is such a case - we use it for 
crasher bugs, not for 'boots up slow' cases.

Nevertheless, production systems need a DMI match.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen Sept. 3, 2010, 10:04 a.m. UTC | #7
> > +int srat_bypass_bios = 0;
> 
> static int srat_bypass_bios;
> 
> Don't need to init to 0.

It actually doesn't matter to gcc, it generates the same
code for both cases.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Waskiewicz Jr, Peter P Sept. 7, 2010, 7:38 p.m. UTC | #8
On Thu, 2010-09-02 at 23:39 -0700, Ingo Molnar wrote:
> * Andi Kleen <andi@firstfloor.org> wrote:
> 
> > > This isnt a particularly useful solution to users of said systems - 
> > > they have to figure out that this option exists, and then they have 
> > > to enter this option on the boot line.
> > 
> > This usually only happens in early preproduction systems. So far the 
> > BIOS always got fixed before they shipped to users.
> 
> 'Usually' != 'always'. Read the changelog:
> 
>   ' There are BIOSes in production that have these failures, so this will 
>     allow people in the field to work around these BIOS issues. '
> 
> Peter, which system in production that has this problem? That one needs 
> a DMI match.

It's one SKU of a Nehalem-EX system.  The BIOS for that SKU has an issue
with resolving SRAT hotplug enumeration, and screws up the table.  Other
SKU's of this same platform do not have the issue.  Efforts are underway
to get this BIOS fixed, but in the meantime, there's nothing for users
to work around the bug (aside from disabling memory hotplug in the
BIOS).  Another platform almost shipped with the same symptoms, but
caught it and had it fixed before it shipped (didn't catch it early
because Windows wasn't failing, and most of the testing on that platform
was done under Windows).

I agree with Andi that adding DMI strings would be overkill and would
leave clutter once the BIOS is fixed.  I look at this patch as a
stop-gap measure for people to fall back on until a newer BIOS is
available to correct the NUMA enumeration issues.  Without it, we have
nothing to point users to when they run into this, waiting for a new
BIOS.

Cheers,
-PJ

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Sept. 7, 2010, 7:56 p.m. UTC | #9
* Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> wrote:

> On Thu, 2010-09-02 at 23:39 -0700, Ingo Molnar wrote:
> > * Andi Kleen <andi@firstfloor.org> wrote:
> > 
> > > > This isnt a particularly useful solution to users of said systems - 
> > > > they have to figure out that this option exists, and then they have 
> > > > to enter this option on the boot line.
> > > 
> > > This usually only happens in early preproduction systems. So far the 
> > > BIOS always got fixed before they shipped to users.
> > 
> > 'Usually' != 'always'. Read the changelog:
> > 
> >   ' There are BIOSes in production that have these failures, so this will 
> >     allow people in the field to work around these BIOS issues. '
> > 
> > Peter, which system in production that has this problem? That one needs 
> > a DMI match.
> 
> It's one SKU of a Nehalem-EX system.  The BIOS for that SKU has an 
> issue with resolving SRAT hotplug enumeration, and screws up the 
> table.  Other SKU's of this same platform do not have the issue.  
> Efforts are underway to get this BIOS fixed, but in the meantime, 
> there's nothing for users to work around the bug (aside from disabling 
> memory hotplug in the BIOS).  Another platform almost shipped with the 
> same symptoms, but caught it and had it fixed before it shipped 
> (didn't catch it early because Windows wasn't failing, and most of the 
> testing on that platform was done under Windows).
>
> I agree with Andi that adding DMI strings would be overkill and would 
> leave clutter once the BIOS is fixed. [...]

We use the following policy for hardware/firmware workarounds in 
upstream arch/x86: if the system got shipped and if the vendor/OEM wants 
it fixed, then it has real DMI info (or some PCI ID match method, etc.) 
and an automatic workaround is very well possible and desirable.

If the vendor cannot be bothered to add a few lines based on a simple 
reading of dmidecode output and test it, then we dont really want/need 
the rest of the patch upstream either.

It should be literally 5 minutes of work to add a DMI match.

> I look at this patch as a stop-gap measure for people to fall back on 
> until a newer BIOS is available to correct the NUMA enumeration 
> issues. [...]

We dont do half-done stop-gap measures in the upstream kernel like that, 
and for various good reasons.

Furthermore, since Windows doesnt have a problem booting with this, i'm 
afraid that we are bound to see repeat problems of this sort, so we 
better have the DMI path beaten out - even if in this case it's a single 
model.

> [...]  Without it, we have nothing to point users to when they run 
> into this, waiting for a new BIOS.

I by all means support you to give users a real fix - one that applies 
the workaround automatically with a DMI match. Also, as i said, we can 
also add the boot option in the same patch.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
H. Peter Anvin Sept. 7, 2010, 8:03 p.m. UTC | #10
On 09/07/2010 12:38 PM, Peter P Waskiewicz Jr wrote:
> 
> It's one SKU of a Nehalem-EX system.  The BIOS for that SKU has an issue
> with resolving SRAT hotplug enumeration, and screws up the table.  Other
> SKU's of this same platform do not have the issue.  Efforts are underway
> to get this BIOS fixed, but in the meantime, there's nothing for users
> to work around the bug (aside from disabling memory hotplug in the
> BIOS).  Another platform almost shipped with the same symptoms, but
> caught it and had it fixed before it shipped (didn't catch it early
> because Windows wasn't failing, and most of the testing on that platform
> was done under Windows).
> 
> I agree with Andi that adding DMI strings would be overkill and would
> leave clutter once the BIOS is fixed.  I look at this patch as a
> stop-gap measure for people to fall back on until a newer BIOS is
> available to correct the NUMA enumeration issues.  Without it, we have
> nothing to point users to when they run into this, waiting for a new
> BIOS.
> 

No, this is exactly the kind of stuff for which a DMI match is ideal.  A
specific system with bounded propagation of the problem.  Thus, the DMI
match acts as a whitelist -- "we know this system and it is safe to
activate this hack on it."  This is a very good thing.

If this is a production BIOS it should have this information.

	-hpa
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Waskiewicz Jr, Peter P Sept. 7, 2010, 8:16 p.m. UTC | #11
On Tue, 2010-09-07 at 13:03 -0700, H. Peter Anvin wrote:
> On 09/07/2010 12:38 PM, Peter P Waskiewicz Jr wrote:
> > 
> > It's one SKU of a Nehalem-EX system.  The BIOS for that SKU has an issue
> > with resolving SRAT hotplug enumeration, and screws up the table.  Other
> > SKU's of this same platform do not have the issue.  Efforts are underway
> > to get this BIOS fixed, but in the meantime, there's nothing for users
> > to work around the bug (aside from disabling memory hotplug in the
> > BIOS).  Another platform almost shipped with the same symptoms, but
> > caught it and had it fixed before it shipped (didn't catch it early
> > because Windows wasn't failing, and most of the testing on that platform
> > was done under Windows).
> > 
> > I agree with Andi that adding DMI strings would be overkill and would
> > leave clutter once the BIOS is fixed.  I look at this patch as a
> > stop-gap measure for people to fall back on until a newer BIOS is
> > available to correct the NUMA enumeration issues.  Without it, we have
> > nothing to point users to when they run into this, waiting for a new
> > BIOS.
> > 
> 
> No, this is exactly the kind of stuff for which a DMI match is ideal.  A
> specific system with bounded propagation of the problem.  Thus, the DMI
> match acts as a whitelist -- "we know this system and it is safe to
> activate this hack on it."  This is a very good thing.
> 
> If this is a production BIOS it should have this information.

Responding for both your and Ingo's last email, I'll work on getting a
DMI match for this system.

Thanks,
-PJ
H. Peter Anvin Sept. 7, 2010, 8:48 p.m. UTC | #12
On 09/07/2010 01:16 PM, Peter P Waskiewicz Jr wrote:
>>
>> If this is a production BIOS it should have this information.
> 
> Responding for both your and Ingo's last email, I'll work on getting a
> DMI match for this system.
> 

Great.

	-hpa
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen Sept. 8, 2010, 6:55 a.m. UTC | #13
"H. Peter Anvin" <hpa@zytor.com> writes:
>
> If this is a production BIOS it should have this information.

The point was -- it wasn't a production BIOS and the production
BIOS will be fixed.

If you want you call fill the DMI tables with all bugs in preproduction
BIOS, but I suspect that will fill them up rather quickly. Preproduction
BIOS tend to be buggy (that is why they are pre-production)

Another problem also with adding stuff for early BIOS is that the later
BIOS likely have this fixed and then you have a quirk that runs but is
not needed (unless you add another DMI match for the fixed production
BIOS etc.) While for this case it would be probably harmless, 
this has caused problems in the past with other quirks.

Normally early users are also ok with having to specify an option
as a workaround, they just need something they can specify.

IMHO also it's good coding practice to have a command line option
for every DMI quirk anyways. This makes it easier to let 
users test if they need a particular quirk and helps
with users running older kernels.

-Andi
H. Peter Anvin Sept. 8, 2010, 6:03 p.m. UTC | #14
On 09/07/2010 11:55 PM, Andi Kleen wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
>>
>> If this is a production BIOS it should have this information.
> 
> The point was -- it wasn't a production BIOS and the production
> BIOS will be fixed.
> 

From the original description:

This patch adds a boot parameter to allow a kernel to be booted
with the option to skip the SRAT check.  There are BIOSes in
                                         ^^^^^^^^^^^^^^^^^^^
production that have these failures, so this will allow people
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
in the field to work around these BIOS issues.


Agreed that command-line option is a good thing to have in general and
that it is not necessary to have DMI quirks for preproduction BIOSes.

	-hpa
Waskiewicz Jr, Peter P Sept. 8, 2010, 6:51 p.m. UTC | #15
On Wed, 2010-09-08 at 11:03 -0700, H. Peter Anvin wrote:
> On 09/07/2010 11:55 PM, Andi Kleen wrote:
> > "H. Peter Anvin" <hpa@zytor.com> writes:
> >>
> >> If this is a production BIOS it should have this information.
> > 
> > The point was -- it wasn't a production BIOS and the production
> > BIOS will be fixed.
> > 
> 
> From the original description:
> 
> This patch adds a boot parameter to allow a kernel to be booted
> with the option to skip the SRAT check.  There are BIOSes in
>                                          ^^^^^^^^^^^^^^^^^^^
> production that have these failures, so this will allow people
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> in the field to work around these BIOS issues.
> 
> 
> Agreed that command-line option is a good thing to have in general and
> that it is not necessary to have DMI quirks for preproduction BIOSes.

Right.  The new patchset (when I finish it) will have the DMI quirk for
this specific production NHM-EX, plus the command-line option.

-PJ
H. Peter Anvin Sept. 8, 2010, 7:09 p.m. UTC | #16
On 09/01/2010 03:11 PM, Peter P Waskiewicz Jr wrote:
> On Wed, 2010-09-01 at 15:04 -0700, rdunlap@xenotime.net wrote:
>> On Wed, September 1, 2010 2:33 pm, Peter P Waskiewicz Jr wrote:
>>> Documentation/x86/x86_64/boot-options.txt |    4 ++++
>>> arch/x86/mm/srat_64.c                     |   20 +++++++++++++++++--- 2
>>> files changed, 21 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/x86/mm/srat_64.c b/arch/x86/mm/srat_64.c index
>>> f9897f7..8719472 100644 --- a/arch/x86/mm/srat_64.c
>>> +++ b/arch/x86/mm/srat_64.c
>>> @@ -351,6 +351,15 @@ int __init acpi_get_nodes(struct bootnode *physnodes)
>>>  return ret; }
>>>
>>>
>>> +int srat_bypass_bios = 0;
>>
>> static int srat_bypass_bios;
>>
>> Don't need to init to 0.
> 
> I'll fix and send a second version.  Thanks Randy.
> 

Should be static *bool* srat_bypass_bios;

Please use bool for actual booleans; it prevents someone from being
"smart" and using values like 2 or -1 for "special cases".

	-hpa
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
index 7fbbaf8..7863d9c 100644
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -316,3 +316,7 @@  Miscellaneous
 		Do not use GB pages for kernel direct mappings.
 	gbpages
 		Use GB pages for kernel direct mappings.
+ 	sratbypassbios
+		If specified, will skip an SRAT check for PXM coverage
+		from BIOS enumeration.  Only to be used on systems with
+		buggy BIOSes that munge the SRAT enumeration.
diff --git a/arch/x86/mm/srat_64.c b/arch/x86/mm/srat_64.c
index f9897f7..8719472 100644
--- a/arch/x86/mm/srat_64.c
+++ b/arch/x86/mm/srat_64.c
@@ -351,6 +351,15 @@  int __init acpi_get_nodes(struct bootnode *physnodes)
 	return ret;
 }
 
+int srat_bypass_bios = 0;
+
+static int __init srat_bypass_bios_setup(char *str)
+{
+        srat_bypass_bios = 1;
+        return 0;
+}
+early_param("sratbypassbios", srat_bypass_bios_setup);
+
 /* Use the information discovered above to actually set up the nodes. */
 int __init acpi_scan_nodes(unsigned long start, unsigned long end)
 {
@@ -425,9 +434,14 @@  int __init acpi_scan_nodes(unsigned long start, unsigned long end)
 						nodes[i].end >> PAGE_SHIFT);
 	/* for out of order entries in SRAT */
 	sort_node_map();
-	if (!nodes_cover_memory(nodes)) {
-		bad_srat();
-		return -1;
+	if (!srat_bypass_bios) {
+		if (!nodes_cover_memory(nodes)) {
+			bad_srat();
+			return -1;
+		}
+	} else {
+		printk(KERN_INFO
+		           "SRAT: Bypassing NUMA sanity check...bad BIOS...\n");
 	}
 
 	/* Account for nodes with cpus and no memory */