Patchwork [v2] sparc64: fix and optimize irq distribution

login
register
mail settings
Submitter Hong H. Pham
Date May 13, 2009, 4:52 p.m.
Message ID <1242233551-3369-1-git-send-email-hong.pham@windriver.com>
Download mbox | patch
Permalink /patch/27159/
State Changes Requested
Delegated to: David Miller
Headers show

Comments

Hong H. Pham - May 13, 2009, 4:52 p.m.
irq_choose_cpu() should compare the affinity mask against cpu_online_map
rather than CPU_MASK_ALL, since irq_select_affinity() sets the interrupt's
affinity mask to cpu_online_map "and" CPU_MASK_ALL (which ends up being
just cpu_online_map).  The mask comparison in irq_choose_cpu() will always
fail since the two masks are not the same.  So the CPU chosen is the first CPU
in the intersection of cpu_online_map and CPU_MASK_ALL, which is always CPU0.
That means all interrupts are reassigned to CPU0...

Distributing interrupts to CPUs in a linearly increasing round robin fashion
is not optimal for the UltraSPARC T1/T2.  Also, the irq_rover in
irq_choose_cpu() causes an interrupt to be assigned to a different
processor each time the interrupt is allocated and released.  This may lead
to an unbalanced distribution over time.

A static mapping of interrupts to processors is done to optimize and balance
interrupt distribution.  For the T1/T2, interrupts are spread to different
cores first, and then to strands within a core.

The following are benchmarks showing the effects of interrupt distribution
on a T2.  The test was done with iperf using a pair of T5220 boxes, each
with a 10GBe NIU (XAUI) connected back to back.

  TCP     | Stock       Linear RR IRQ  Optimized IRQ
  Streams | 2.6.30-rc5  Distribution   Distribution
          | GBits/sec   GBits/sec      GBits/sec
  --------+-----------------------------------------
    1       0.839       0.862          0.868
    8       1.16        4.96           5.88
   16       1.15        6.40           8.04
  100       1.09        7.28           8.68

Signed-off-by: Hong H. Pham <hong.pham@windriver.com>
---
 arch/sparc/kernel/Makefile |    1 +
 arch/sparc/kernel/cpumap.c |  110 ++++++++++++++++++++++++++++++++++++++++++++
 arch/sparc/kernel/cpumap.h |   15 ++++++
 arch/sparc/kernel/irq_64.c |   29 ++----------
 arch/sparc/kernel/smp_64.c |    2 +
 5 files changed, 132 insertions(+), 25 deletions(-)
 create mode 100644 arch/sparc/kernel/cpumap.c
 create mode 100644 arch/sparc/kernel/cpumap.h
David Miller - May 22, 2009, 12:14 a.m.
From: "Hong H. Pham" <hong.pham@windriver.com>
Date: Wed, 13 May 2009 12:52:31 -0400

> irq_choose_cpu() should compare the affinity mask against cpu_online_map
> rather than CPU_MASK_ALL, since irq_select_affinity() sets the interrupt's
> affinity mask to cpu_online_map "and" CPU_MASK_ALL (which ends up being
> just cpu_online_map).  The mask comparison in irq_choose_cpu() will always
> fail since the two masks are not the same.  So the CPU chosen is the first CPU
> in the intersection of cpu_online_map and CPU_MASK_ALL, which is always CPU0.
> That means all interrupts are reassigned to CPU0...
> 
> Distributing interrupts to CPUs in a linearly increasing round robin fashion
> is not optimal for the UltraSPARC T1/T2.  Also, the irq_rover in
> irq_choose_cpu() causes an interrupt to be assigned to a different
> processor each time the interrupt is allocated and released.  This may lead
> to an unbalanced distribution over time.
> 
> A static mapping of interrupts to processors is done to optimize and balance
> interrupt distribution.  For the T1/T2, interrupts are spread to different
> cores first, and then to strands within a core.
> 
> The following are benchmarks showing the effects of interrupt distribution
> on a T2.  The test was done with iperf using a pair of T5220 boxes, each
> with a 10GBe NIU (XAUI) connected back to back.
> 
>   TCP     | Stock       Linear RR IRQ  Optimized IRQ
>   Streams | 2.6.30-rc5  Distribution   Distribution
>           | GBits/sec   GBits/sec      GBits/sec
>   --------+-----------------------------------------
>     1       0.839       0.862          0.868
>     8       1.16        4.96           5.88
>    16       1.15        6.40           8.04
>   100       1.09        7.28           8.68
> 
> Signed-off-by: Hong H. Pham <hong.pham@windriver.com>

I like this patch a lot but it's going to do the wrong thing on
virtualized guests.

There is absolutely no connection between virtual cpu numbers
and the hierarchy in which they sit in the cores and higher
level hierarchy of the processor.  So you can't just say
(cpu_id / 4) is the core number or anything like that.

You must use the machine description to determine this kind of
information, just as we do in arch/sparc/kernel/mdesc.c to figure out
the CPU scheduler grouping maps.  (see mark_proc_ids() and
mark_core_ids())

This will also allow your code to transparently work on ROCK and other
future cpus without any changes.

I'm happy to apply this patch once you change it to use the MDESC
properly to probe the cpu hierarchy information.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - May 22, 2009, 12:18 a.m.
From: David Miller <davem@davemloft.net>
Date: Thu, 21 May 2009 17:14:24 -0700 (PDT)

> I'm happy to apply this patch once you change it to use the MDESC
> properly to probe the cpu hierarchy information.

BTW, you could also use the precomputed scheduler grouping
cpu masks in your distribution table building too.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hong H. Pham - June 3, 2009, 4:41 p.m.
Hi,

Here's a revised patch to fix and optimize interrupt distribution.  The major
change since the last patch is that a tree representation of the CPU hierarchy
is built from the per CPU cpu_data.  Each iteration through the CPU tree
picks the next optimal CPU.  The following are example CPU distribution maps
for various Niagara2/2+ machines.

  T5220 (64 cpus)
  { 0 8 16 24 32 40 48 56 4 12 20 28 36 44 52 60 1 9 17 25 33 41 49 57 5 13 21 29 37 45 53 61 2 10 18 26 34 42 50 58 6 14 22 30 38 46 54 62 3 11 19 27 35 43 51 59 7 15 23 31 39 47 55 63}

  T5440 (2 way, 96 cpus)
  { 0 8 16 24 32 40 72 80 88 96 104 112 4 12 20 28 36 44 76 84 92 100 108 116 1 9 17 25 33 41 73 81 89 97 105 113 5 13 21 29 37 45 77 85 93 101 109 117 2 10 18 26 34 42 74 82 90 98 106 114 6 14 22 30 38 46 78 86 94 102 110 118 3 11 19 27 35 43 75 83 91 99 107 115 7 15 23 31 39 47 79 87 95 103 111 119}

  LDOM (on a T5220)
  { 0 3 1 4 2 5 0 6}

An assumption used when building the CPU tree is that cpu_data is sorted
by node, core_id, and proc_id (in order of significance).  This the case
for the Niagara2 machines I have available.  If this isn't true for all
sparc64 machines, a copy of cpu_data would need to be sorted prior to
building the CPU tree.

Regards,
Hong

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - June 4, 2009, 4:57 a.m.
From: "Hong H. Pham" <hong.pham@windriver.com>
Date: Wed,  3 Jun 2009 12:41:01 -0400

> Here's a revised patch to fix and optimize interrupt distribution.  The major
> change since the last patch is that a tree representation of the CPU hierarchy
> is built from the per CPU cpu_data.  Each iteration through the CPU tree
> picks the next optimal CPU.  The following are example CPU distribution maps
> for various Niagara2/2+ machines.
> 
>   T5220 (64 cpus)
>   { 0 8 16 24 32 40 48 56 4 12 20 28 36 44 52 60 1 9 17 25 33 41 49 57 5 13 21 29 37 45 53 61 2 10 18 26 34 42 50 58 6 14 22 30 38 46 54 62 3 11 19 27 35 43 51 59 7 15 23 31 39 47 55 63}
> 
>   T5440 (2 way, 96 cpus)
>   { 0 8 16 24 32 40 72 80 88 96 104 112 4 12 20 28 36 44 76 84 92 100 108 116 1 9 17 25 33 41 73 81 89 97 105 113 5 13 21 29 37 45 77 85 93 101 109 117 2 10 18 26 34 42 74 82 90 98 106 114 6 14 22 30 38 46 78 86 94 102 110 118 3 11 19 27 35 43 75 83 91 99 107 115 7 15 23 31 39 47 79 87 95 103 111 119}
> 
>   LDOM (on a T5220)
>   { 0 3 1 4 2 5 0 6}

This looks great!

> An assumption used when building the CPU tree is that cpu_data is sorted
> by node, core_id, and proc_id (in order of significance).  This the case
> for the Niagara2 machines I have available.  If this isn't true for all
> sparc64 machines, a copy of cpu_data would need to be sorted prior to
> building the CPU tree.

The MDESC and OF cpu scanners allocate the node, core_id, and proc_ids
linearly as the cpu's are scanned linearly, so this should be OK at
least for now.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/arch/sparc/kernel/Makefile b/arch/sparc/kernel/Makefile
index 54742e5..47029c6 100644
--- a/arch/sparc/kernel/Makefile
+++ b/arch/sparc/kernel/Makefile
@@ -53,8 +53,9 @@  obj-$(CONFIG_SPARC64)   += hvapi.o
 obj-$(CONFIG_SPARC64)   += sstate.o
 obj-$(CONFIG_SPARC64)   += mdesc.o
 obj-$(CONFIG_SPARC64)	+= pcr.o
 obj-$(CONFIG_SPARC64)	+= nmi.o
+obj-$(CONFIG_SPARC64_SMP) += cpumap.o
 
 # sparc32 do not use GENERIC_HARDIRQS but uses the generic devres implementation
 obj-$(CONFIG_SPARC32)     += devres.o
 devres-y                  := ../../../kernel/irq/devres.o
diff --git a/arch/sparc/kernel/cpumap.c b/arch/sparc/kernel/cpumap.c
new file mode 100644
index 0000000..0b1dce7
--- /dev/null
+++ b/arch/sparc/kernel/cpumap.c
@@ -0,0 +1,110 @@ 
+/* cpumap.c: used for optimizing CPU assignment
+ *
+ * Copyright (C) 2009 Hong H. Pham <hong.pham@windriver.com>
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/cpumask.h>
+#include <linux/spinlock.h>
+#include "cpumap.h"
+
+
+static u16 cpu_distribution_map[NR_CPUS];
+static int cpu_map_entries = 0;
+static DEFINE_SPINLOCK(cpu_map_lock);
+
+
+static int strands_per_core(void)
+{
+	int n;
+
+	switch (sun4v_chip_type) {
+	case SUN4V_CHIP_NIAGARA1:
+		n = 4;
+		break;
+
+	case SUN4V_CHIP_NIAGARA2:
+		n = 8;
+		break;
+
+	default:
+		n = 1;
+		break;
+	}
+	return n;
+}
+
+static int iterate_cpu(unsigned int index)
+{
+	static unsigned int num_cpus  = 0;
+	static unsigned int num_cores = 0;
+	unsigned int strand, s_per_core;
+
+	s_per_core = strands_per_core();
+
+	/* num_cpus must be a multiple of strands_per_core. */
+	if (unlikely(num_cores == 0)) {
+		num_cpus  = num_possible_cpus();
+		num_cores = ((num_cpus / s_per_core) +
+		             (num_cpus % s_per_core ? 1 : 0));
+		num_cpus  = num_cores * s_per_core;
+	}
+
+	strand = (index * s_per_core) / num_cpus;
+
+	/* Optimize for the T2.  Each core in the T2 has two instruction
+	 * pipelines.  Stagger the CPU distribution across different cores
+	 * first, and then across different pipelines.
+	 */
+	if (sun4v_chip_type == SUN4V_CHIP_NIAGARA2) {
+		if ((index / num_cores) & 0x01)
+			strand = s_per_core - strand;
+	}
+
+	return ((index * s_per_core) % num_cpus) + strand;
+}
+
+void cpu_map_init(void)
+{
+	int i, cpu, cpu_rover = 0;
+	unsigned long flag;
+
+	spin_lock_irqsave(&cpu_map_lock, flag);
+	for (i = 0; i < num_online_cpus(); i++) {
+		do {
+			cpu = iterate_cpu(cpu_rover++);
+		} while (!cpu_online(cpu));
+
+		cpu_distribution_map[i] = cpu;
+	}
+	cpu_map_entries = i;
+	spin_unlock_irqrestore(&cpu_map_lock, flag);
+}
+
+int map_to_cpu(unsigned int index)
+{
+	unsigned int mapped_cpu;
+	unsigned long flag;
+
+	spin_lock_irqsave(&cpu_map_lock, flag);
+	if (unlikely(cpu_map_entries != num_online_cpus())) {
+		spin_unlock_irqrestore(&cpu_map_lock, flag);
+		cpu_map_init();
+		spin_lock_irqsave(&cpu_map_lock, flag);
+	}
+
+	mapped_cpu = cpu_distribution_map[index % cpu_map_entries];
+#ifdef CONFIG_HOTPLUG_CPU
+	while (!cpu_online(mapped_cpu)) {
+		spin_unlock_irqrestore(&cpu_map_lock, flag);
+		cpu_map_init();
+		spin_lock_irqsave(&cpu_map_lock, flag);
+		mapped_cpu = cpu_distribution_map[index % cpu_map_entries];
+	}
+#endif /* CONFIG_HOTPLUG_CPU */
+	spin_unlock_irqrestore(&cpu_map_lock, flag);
+	return mapped_cpu;
+}
+EXPORT_SYMBOL(map_to_cpu);
diff --git a/arch/sparc/kernel/cpumap.h b/arch/sparc/kernel/cpumap.h
new file mode 100644
index 0000000..524b207
--- /dev/null
+++ b/arch/sparc/kernel/cpumap.h
@@ -0,0 +1,15 @@ 
+#ifndef _CPUMAP_H
+#define _CPUMAP_H
+
+#ifdef CONFIG_SMP
+extern void cpu_map_init(void);
+extern int  map_to_cpu(unsigned int index);
+#else
+#define cpu_map_init() do {} while (0)
+static inline int map_to_cpu(unsigned int index)
+{
+	return raw_smp_processor_id();
+}
+#endif
+
+#endif
diff --git a/arch/sparc/kernel/irq_64.c b/arch/sparc/kernel/irq_64.c
index 5deabe9..b68386d 100644
--- a/arch/sparc/kernel/irq_64.c
+++ b/arch/sparc/kernel/irq_64.c
@@ -44,8 +44,9 @@ 
 #include <asm/hypervisor.h>
 #include <asm/cacheflush.h>
 
 #include "entry.h"
+#include "cpumap.h"
 
 #define NUM_IVECS	(IMAP_INR + 1)
 
 struct ino_bucket *ivector_table;
@@ -255,37 +256,15 @@  static int irq_choose_cpu(unsigned int virt_irq)
 	cpumask_t mask;
 	int cpuid;
 
 	cpumask_copy(&mask, irq_desc[virt_irq].affinity);
-	if (cpus_equal(mask, CPU_MASK_ALL)) {
-		static int irq_rover;
-		static DEFINE_SPINLOCK(irq_rover_lock);
-		unsigned long flags;
-
-		/* Round-robin distribution... */
-	do_round_robin:
-		spin_lock_irqsave(&irq_rover_lock, flags);
-
-		while (!cpu_online(irq_rover)) {
-			if (++irq_rover >= nr_cpu_ids)
-				irq_rover = 0;
-		}
-		cpuid = irq_rover;
-		do {
-			if (++irq_rover >= nr_cpu_ids)
-				irq_rover = 0;
-		} while (!cpu_online(irq_rover));
-
-		spin_unlock_irqrestore(&irq_rover_lock, flags);
+	if (cpus_equal(mask, cpu_online_map)) {
+		cpuid = map_to_cpu(virt_irq);
 	} else {
 		cpumask_t tmp;
 
 		cpus_and(tmp, cpu_online_map, mask);
-
-		if (cpus_empty(tmp))
-			goto do_round_robin;
-
-		cpuid = first_cpu(tmp);
+		cpuid = cpus_empty(tmp) ? map_to_cpu(virt_irq) : first_cpu(tmp);
 	}
 
 	return cpuid;
 }
diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index f7642e5..54906aa 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1314,8 +1314,10 @@  int __cpu_disable(void)
 	ipi_call_lock();
 	cpu_clear(cpu, cpu_online_map);
 	ipi_call_unlock();
 
+	cpu_map_init();
+
 	return 0;
 }
 
 void __cpu_die(unsigned int cpu)