diff mbox series

[SRU,Bionic,1/1] x86, sched: Allow topologies where NUMA nodes share an LLC

Message ID 20200608041814.23501-2-matthew.ruffell@canonical.com
State New
Headers show
Series smpboot: don't call topology_sane() when Sub-NUMA-Clustering is enabled | expand

Commit Message

Matthew Ruffell June 8, 2020, 4:18 a.m. UTC
From: Alison Schofield <alison.schofield@intel.com>

BugLink: https://bugs.launchpad.net/bugs/1882478

Intel's Skylake Server CPUs have a different LLC topology than previous
generations. When in Sub-NUMA-Clustering (SNC) mode, the package is divided
into two "slices", each containing half the cores, half the LLC, and one
memory controller and each slice is enumerated to Linux as a NUMA
node. This is similar to how the cores and LLC were arranged for the
Cluster-On-Die (CoD) feature.

CoD allowed the same cache line to be present in each half of the LLC.
But, with SNC, each line is only ever present in *one* slice. This means
that the portion of the LLC *available* to a CPU depends on the data being
accessed:

    Remote socket: entire package LLC is shared
    Local socket->local slice: data goes into local slice LLC
    Local socket->remote slice: data goes into remote-slice LLC. Slightly
                    		higher latency than local slice LLC.

The biggest implication from this is that a process accessing all
NUMA-local memory only sees half the LLC capacity.

The CPU describes its cache hierarchy with the CPUID instruction. One of
the CPUID leaves enumerates the "logical processors sharing this
cache". This information is used for scheduling decisions so that tasks
move more freely between CPUs sharing the cache.

But, the CPUID for the SNC configuration discussed above enumerates the LLC
as being shared by the entire package. This is not 100% precise because the
entire cache is not usable by all accesses. But, it *is* the way the
hardware enumerates itself, and this is not likely to change.

The userspace visible impact of all the above is that the sysfs info
reports the entire LLC as being available to the entire package. As noted
above, this is not true for local socket accesses. This patch does not
correct the sysfs info. It is the same, pre and post patch.

The current code emits the following warning:

 sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.

The warning is coming from the topology_sane() check in smpboot.c because
the topology is not matching the expectations of the model for obvious
reasons.

To fix this, add a vendor and model specific check to never call
topology_sane() for these systems. Also, just like "Cluster-on-Die" disable
the "coregroup" sched_domain_topology_level and use NUMA information from
the SRAT alone.

This is OK at least on the hardware we are immediately concerned about
because the LLC sharing happens at both the slice and at the package level,
which are also NUMA boundaries.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: brice.goglin@gmail.com
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lkml.kernel.org/r/20180407002130.GA18984@alison-desk.jf.intel.com
(backported from commit 1340ccfa9a9afefdbab90d7935d4ed19817e37c2)
[mruffell: re-arrange #includes to match upstream]
Signed-off-by: Matthew Ruffell <matthew.ruffell@canonical.com>
---
 arch/x86/kernel/smpboot.c | 45 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 40 insertions(+), 5 deletions(-)

Comments

Stefan Bader June 10, 2020, 7:36 a.m. UTC | #1
On 08.06.20 06:18, Matthew Ruffell wrote:
> From: Alison Schofield <alison.schofield@intel.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1882478
> 
> Intel's Skylake Server CPUs have a different LLC topology than previous
> generations. When in Sub-NUMA-Clustering (SNC) mode, the package is divided
> into two "slices", each containing half the cores, half the LLC, and one
> memory controller and each slice is enumerated to Linux as a NUMA
> node. This is similar to how the cores and LLC were arranged for the
> Cluster-On-Die (CoD) feature.
> 
> CoD allowed the same cache line to be present in each half of the LLC.
> But, with SNC, each line is only ever present in *one* slice. This means
> that the portion of the LLC *available* to a CPU depends on the data being
> accessed:
> 
>     Remote socket: entire package LLC is shared
>     Local socket->local slice: data goes into local slice LLC
>     Local socket->remote slice: data goes into remote-slice LLC. Slightly
>                     		higher latency than local slice LLC.
> 
> The biggest implication from this is that a process accessing all
> NUMA-local memory only sees half the LLC capacity.
> 
> The CPU describes its cache hierarchy with the CPUID instruction. One of
> the CPUID leaves enumerates the "logical processors sharing this
> cache". This information is used for scheduling decisions so that tasks
> move more freely between CPUs sharing the cache.
> 
> But, the CPUID for the SNC configuration discussed above enumerates the LLC
> as being shared by the entire package. This is not 100% precise because the
> entire cache is not usable by all accesses. But, it *is* the way the
> hardware enumerates itself, and this is not likely to change.
> 
> The userspace visible impact of all the above is that the sysfs info
> reports the entire LLC as being available to the entire package. As noted
> above, this is not true for local socket accesses. This patch does not
> correct the sysfs info. It is the same, pre and post patch.
> 
> The current code emits the following warning:
> 
>  sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
> 
> The warning is coming from the topology_sane() check in smpboot.c because
> the topology is not matching the expectations of the model for obvious
> reasons.
> 
> To fix this, add a vendor and model specific check to never call
> topology_sane() for these systems. Also, just like "Cluster-on-Die" disable
> the "coregroup" sched_domain_topology_level and use NUMA information from
> the SRAT alone.
> 
> This is OK at least on the hardware we are immediately concerned about
> because the LLC sharing happens at both the slice and at the package level,
> which are also NUMA boundaries.
> 
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Borislav Petkov <bp@suse.de>
> Cc: Prarit Bhargava <prarit@redhat.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: brice.goglin@gmail.com
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Link: https://lkml.kernel.org/r/20180407002130.GA18984@alison-desk.jf.intel.com
> (backported from commit 1340ccfa9a9afefdbab90d7935d4ed19817e37c2)
> [mruffell: re-arrange #includes to match upstream]
> Signed-off-by: Matthew Ruffell <matthew.ruffell@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
> ---
>  arch/x86/kernel/smpboot.c | 45 ++++++++++++++++++++++++++++++++++-----
>  1 file changed, 40 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index ce8d67c3ed44..c4bf8e1543eb 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -78,6 +78,8 @@
>  #include <asm/realmode.h>
>  #include <asm/misc.h>
>  #include <asm/qspinlock.h>
> +#include <asm/intel-family.h>
> +#include <asm/cpu_device_id.h>
>  #include <asm/spec-ctrl.h>
>  #include <asm/hw_irq.h>
>  
> @@ -410,15 +412,47 @@ static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
>  	return false;
>  }
>  
> +/*
> + * Define snc_cpu[] for SNC (Sub-NUMA Cluster) CPUs.
> + *
> + * These are Intel CPUs that enumerate an LLC that is shared by
> + * multiple NUMA nodes. The LLC on these systems is shared for
> + * off-package data access but private to the NUMA node (half
> + * of the package) for on-package access.
> + *
> + * CPUID (the source of the information about the LLC) can only
> + * enumerate the cache as being shared *or* unshared, but not
> + * this particular configuration. The CPU in this case enumerates
> + * the cache to be shared across the entire package (spanning both
> + * NUMA nodes).
> + */
> +
> +static const struct x86_cpu_id snc_cpu[] = {
> +	{ X86_VENDOR_INTEL, 6, INTEL_FAM6_SKYLAKE_X },
> +	{}
> +};
> +
>  static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
>  {
>  	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
>  
> -	if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID &&
> -	    per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2))
> -		return topology_sane(c, o, "llc");
> +	/* Do not match if we do not have a valid APICID for cpu: */
> +	if (per_cpu(cpu_llc_id, cpu1) == BAD_APICID)
> +		return false;
>  
> -	return false;
> +	/* Do not match if LLC id does not match: */
> +	if (per_cpu(cpu_llc_id, cpu1) != per_cpu(cpu_llc_id, cpu2))
> +		return false;
> +
> +	/*
> +	 * Allow the SNC topology without warning. Return of false
> +	 * means 'c' does not share the LLC of 'o'. This will be
> +	 * reflected to userspace.
> +	 */
> +	if (!topology_same_node(c, o) && x86_match_cpu(snc_cpu))
> +		return false;
> +
> +	return topology_sane(c, o, "llc");
>  }
>  
>  /*
> @@ -476,7 +510,8 @@ static struct sched_domain_topology_level x86_topology[] = {
>  
>  /*
>   * Set if a package/die has multiple NUMA nodes inside.
> - * AMD Magny-Cours and Intel Cluster-on-Die have this.
> + * AMD Magny-Cours, Intel Cluster-on-Die, and Intel
> + * Sub-NUMA Clustering have this.
>   */
>  static bool x86_has_numa_in_package;
>  
>
Sultan Alsawaf June 23, 2020, 4:50 p.m. UTC | #2
Acked-by: Sultan Alsawaf <sultan.alsawaf@canonical.com>

On Mon, Jun 08, 2020 at 04:18:14PM +1200, Matthew Ruffell wrote:
> From: Alison Schofield <alison.schofield@intel.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1882478
> 
> Intel's Skylake Server CPUs have a different LLC topology than previous
> generations. When in Sub-NUMA-Clustering (SNC) mode, the package is divided
> into two "slices", each containing half the cores, half the LLC, and one
> memory controller and each slice is enumerated to Linux as a NUMA
> node. This is similar to how the cores and LLC were arranged for the
> Cluster-On-Die (CoD) feature.
> 
> CoD allowed the same cache line to be present in each half of the LLC.
> But, with SNC, each line is only ever present in *one* slice. This means
> that the portion of the LLC *available* to a CPU depends on the data being
> accessed:
> 
>     Remote socket: entire package LLC is shared
>     Local socket->local slice: data goes into local slice LLC
>     Local socket->remote slice: data goes into remote-slice LLC. Slightly
>                     		higher latency than local slice LLC.
> 
> The biggest implication from this is that a process accessing all
> NUMA-local memory only sees half the LLC capacity.
> 
> The CPU describes its cache hierarchy with the CPUID instruction. One of
> the CPUID leaves enumerates the "logical processors sharing this
> cache". This information is used for scheduling decisions so that tasks
> move more freely between CPUs sharing the cache.
> 
> But, the CPUID for the SNC configuration discussed above enumerates the LLC
> as being shared by the entire package. This is not 100% precise because the
> entire cache is not usable by all accesses. But, it *is* the way the
> hardware enumerates itself, and this is not likely to change.
> 
> The userspace visible impact of all the above is that the sysfs info
> reports the entire LLC as being available to the entire package. As noted
> above, this is not true for local socket accesses. This patch does not
> correct the sysfs info. It is the same, pre and post patch.
> 
> The current code emits the following warning:
> 
>  sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
> 
> The warning is coming from the topology_sane() check in smpboot.c because
> the topology is not matching the expectations of the model for obvious
> reasons.
> 
> To fix this, add a vendor and model specific check to never call
> topology_sane() for these systems. Also, just like "Cluster-on-Die" disable
> the "coregroup" sched_domain_topology_level and use NUMA information from
> the SRAT alone.
> 
> This is OK at least on the hardware we are immediately concerned about
> because the LLC sharing happens at both the slice and at the package level,
> which are also NUMA boundaries.
> 
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Borislav Petkov <bp@suse.de>
> Cc: Prarit Bhargava <prarit@redhat.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: brice.goglin@gmail.com
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Link: https://lkml.kernel.org/r/20180407002130.GA18984@alison-desk.jf.intel.com
> (backported from commit 1340ccfa9a9afefdbab90d7935d4ed19817e37c2)
> [mruffell: re-arrange #includes to match upstream]
> Signed-off-by: Matthew Ruffell <matthew.ruffell@canonical.com>
> ---
>  arch/x86/kernel/smpboot.c | 45 ++++++++++++++++++++++++++++++++++-----
>  1 file changed, 40 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index ce8d67c3ed44..c4bf8e1543eb 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -78,6 +78,8 @@
>  #include <asm/realmode.h>
>  #include <asm/misc.h>
>  #include <asm/qspinlock.h>
> +#include <asm/intel-family.h>
> +#include <asm/cpu_device_id.h>
>  #include <asm/spec-ctrl.h>
>  #include <asm/hw_irq.h>
>  
> @@ -410,15 +412,47 @@ static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
>  	return false;
>  }
>  
> +/*
> + * Define snc_cpu[] for SNC (Sub-NUMA Cluster) CPUs.
> + *
> + * These are Intel CPUs that enumerate an LLC that is shared by
> + * multiple NUMA nodes. The LLC on these systems is shared for
> + * off-package data access but private to the NUMA node (half
> + * of the package) for on-package access.
> + *
> + * CPUID (the source of the information about the LLC) can only
> + * enumerate the cache as being shared *or* unshared, but not
> + * this particular configuration. The CPU in this case enumerates
> + * the cache to be shared across the entire package (spanning both
> + * NUMA nodes).
> + */
> +
> +static const struct x86_cpu_id snc_cpu[] = {
> +	{ X86_VENDOR_INTEL, 6, INTEL_FAM6_SKYLAKE_X },
> +	{}
> +};
> +
>  static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
>  {
>  	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
>  
> -	if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID &&
> -	    per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2))
> -		return topology_sane(c, o, "llc");
> +	/* Do not match if we do not have a valid APICID for cpu: */
> +	if (per_cpu(cpu_llc_id, cpu1) == BAD_APICID)
> +		return false;
>  
> -	return false;
> +	/* Do not match if LLC id does not match: */
> +	if (per_cpu(cpu_llc_id, cpu1) != per_cpu(cpu_llc_id, cpu2))
> +		return false;
> +
> +	/*
> +	 * Allow the SNC topology without warning. Return of false
> +	 * means 'c' does not share the LLC of 'o'. This will be
> +	 * reflected to userspace.
> +	 */
> +	if (!topology_same_node(c, o) && x86_match_cpu(snc_cpu))
> +		return false;
> +
> +	return topology_sane(c, o, "llc");
>  }
>  
>  /*
> @@ -476,7 +510,8 @@ static struct sched_domain_topology_level x86_topology[] = {
>  
>  /*
>   * Set if a package/die has multiple NUMA nodes inside.
> - * AMD Magny-Cours and Intel Cluster-on-Die have this.
> + * AMD Magny-Cours, Intel Cluster-on-Die, and Intel
> + * Sub-NUMA Clustering have this.
>   */
>  static bool x86_has_numa_in_package;
>  
> -- 
> 2.25.1
> 
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
diff mbox series

Patch

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ce8d67c3ed44..c4bf8e1543eb 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -78,6 +78,8 @@ 
 #include <asm/realmode.h>
 #include <asm/misc.h>
 #include <asm/qspinlock.h>
+#include <asm/intel-family.h>
+#include <asm/cpu_device_id.h>
 #include <asm/spec-ctrl.h>
 #include <asm/hw_irq.h>
 
@@ -410,15 +412,47 @@  static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 	return false;
 }
 
+/*
+ * Define snc_cpu[] for SNC (Sub-NUMA Cluster) CPUs.
+ *
+ * These are Intel CPUs that enumerate an LLC that is shared by
+ * multiple NUMA nodes. The LLC on these systems is shared for
+ * off-package data access but private to the NUMA node (half
+ * of the package) for on-package access.
+ *
+ * CPUID (the source of the information about the LLC) can only
+ * enumerate the cache as being shared *or* unshared, but not
+ * this particular configuration. The CPU in this case enumerates
+ * the cache to be shared across the entire package (spanning both
+ * NUMA nodes).
+ */
+
+static const struct x86_cpu_id snc_cpu[] = {
+	{ X86_VENDOR_INTEL, 6, INTEL_FAM6_SKYLAKE_X },
+	{}
+};
+
 static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 {
 	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
 
-	if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID &&
-	    per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2))
-		return topology_sane(c, o, "llc");
+	/* Do not match if we do not have a valid APICID for cpu: */
+	if (per_cpu(cpu_llc_id, cpu1) == BAD_APICID)
+		return false;
 
-	return false;
+	/* Do not match if LLC id does not match: */
+	if (per_cpu(cpu_llc_id, cpu1) != per_cpu(cpu_llc_id, cpu2))
+		return false;
+
+	/*
+	 * Allow the SNC topology without warning. Return of false
+	 * means 'c' does not share the LLC of 'o'. This will be
+	 * reflected to userspace.
+	 */
+	if (!topology_same_node(c, o) && x86_match_cpu(snc_cpu))
+		return false;
+
+	return topology_sane(c, o, "llc");
 }
 
 /*
@@ -476,7 +510,8 @@  static struct sched_domain_topology_level x86_topology[] = {
 
 /*
  * Set if a package/die has multiple NUMA nodes inside.
- * AMD Magny-Cours and Intel Cluster-on-Die have this.
+ * AMD Magny-Cours, Intel Cluster-on-Die, and Intel
+ * Sub-NUMA Clustering have this.
  */
 static bool x86_has_numa_in_package;