Patchwork genirq: Set initial default irq affinity to just CPU0

login
register
mail settings
Submitter Kumar Gala
Date Oct. 24, 2008, 3:57 p.m.
Message ID <1224863858-7933-1-git-send-email-galak@kernel.crashing.org>
Download mbox | patch
Permalink /patch/5688/
State Superseded, archived
Delegated to: Benjamin Herrenschmidt
Headers show

Comments

Kumar Gala - Oct. 24, 2008, 3:57 p.m.
Commit 18404756765c713a0be4eb1082920c04822ce588 introduced a regression
on a subset of SMP based PPC systems whose interrupt controller only
allow setting an irq to a single processor.  The previous behavior
was only CPU0 was initially setup to get interrupts.  Revert back
to that behavior.

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
---
 kernel/irq/manage.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)
David Miller - Oct. 24, 2008, 11:18 p.m.
From: Kumar Gala <galak@kernel.crashing.org>
Date: Fri, 24 Oct 2008 10:57:38 -0500

> Commit 18404756765c713a0be4eb1082920c04822ce588 introduced a regression
> on a subset of SMP based PPC systems whose interrupt controller only
> allow setting an irq to a single processor.  The previous behavior
> was only CPU0 was initially setup to get interrupts.  Revert back
> to that behavior.
> 
> Signed-off-by: Kumar Gala <galak@kernel.crashing.org>

I really don't remember getting all of my interrupts only on cpu 0
on sparc64 before any of these changes.  I therefore find all of
this quite mysterious. :-)
Benjamin Herrenschmidt - Oct. 25, 2008, 9:33 p.m.
On Fri, 2008-10-24 at 16:18 -0700, David Miller wrote:
> From: Kumar Gala <galak@kernel.crashing.org>
> Date: Fri, 24 Oct 2008 10:57:38 -0500
> 
> > Commit 18404756765c713a0be4eb1082920c04822ce588 introduced a regression
> > on a subset of SMP based PPC systems whose interrupt controller only
> > allow setting an irq to a single processor.  The previous behavior
> > was only CPU0 was initially setup to get interrupts.  Revert back
> > to that behavior.
> > 
> > Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
> 
> I really don't remember getting all of my interrupts only on cpu 0
> on sparc64 before any of these changes.  I therefore find all of
> this quite mysterious. :-)

Well, I don't know how you do it but on powerpc, we explicitely fill the
affinity masks at boot time when we can spread interrupts... Maybe we
should change it the other way around and limit the mask when we can't ?
It's hard to tell for sure at this stage.

Ben.
Kevin Diggs - Oct. 25, 2008, 10:53 p.m.
Benjamin Herrenschmidt wrote:
> On Fri, 2008-10-24 at 16:18 -0700, David Miller wrote:
> 
>>From: Kumar Gala <galak@kernel.crashing.org>
>>Date: Fri, 24 Oct 2008 10:57:38 -0500
>>
>>
>>>Commit 18404756765c713a0be4eb1082920c04822ce588 introduced a regression
>>>on a subset of SMP based PPC systems whose interrupt controller only
>>>allow setting an irq to a single processor.  The previous behavior
>>>was only CPU0 was initially setup to get interrupts.  Revert back
>>>to that behavior.
>>>
>>>Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
>>
>>I really don't remember getting all of my interrupts only on cpu 0
>>on sparc64 before any of these changes.  I therefore find all of
>>this quite mysterious. :-)
> 
> 
> Well, I don't know how you do it but on powerpc, we explicitely fill the
> affinity masks at boot time when we can spread interrupts... Maybe we
> should change it the other way around and limit the mask when we can't ?
> It's hard to tell for sure at this stage.
> 
> Ben.
> 
What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this
thing supposed to be able to spread irq between its cpus?

kevin
David Miller - Oct. 26, 2008, 4:04 a.m.
From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Sun, 26 Oct 2008 08:33:09 +1100

> Well, I don't know how you do it but on powerpc, we explicitely fill the
> affinity masks at boot time when we can spread interrupts... Maybe we
> should change it the other way around and limit the mask when we can't ?
> It's hard to tell for sure at this stage.

On sparc64 we look at the cpu mask configured for the interrupt and do
one of two things:

1) If all bits are set, we round robin assign a cpu at IRQ enable time.

2) Else we pick the first bit set in the mask.

One modification I want to make is to make case #1 NUMA aware.

But back to my original wonder, since I've always tipped off of this
generic IRQ layer cpu mask, when was it ever defaulting to zero
and causing the behvaior your powerpc guys actually want? :-)
David Miller - Oct. 26, 2008, 4:05 a.m.
From: Kevin Diggs <kevdig@hypersurf.com>
Date: Sat, 25 Oct 2008 15:53:46 -0700

> What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this
> thing supposed to be able to spread irq between its cpus?

Networking interrupts should lock onto a single CPU, unconditionally.
That's the optimal way to handle networking interrupts, especially
with multiqueue chips.

This is what the userland IRQ balance daemon does.
Benjamin Herrenschmidt - Oct. 26, 2008, 6:33 a.m.
On Sat, 2008-10-25 at 21:04 -0700, David Miller wrote:
> But back to my original wonder, since I've always tipped off of this
> generic IRQ layer cpu mask, when was it ever defaulting to zero
> and causing the behvaior your powerpc guys actually want? :-)

Well, I'm not sure what Kumar wants. Most powerpc SMP setups actually
want to spread interrupts to all CPUs, and those who can't tend to just
not implement set_affinity... So Kumar must have a special case of MPIC
usage here on FSL platforms.

In any case, the platform limitations should be dealt with there or the
user could break it by manipulating affinity via /proc anyway.

By yeah, I do expect default affinity to be all CPUs and in fact, I even
have an -OLD- comment in the code that says

 	/* let the mpic know we want intrs. default affinitya is 0xffffffff ...

Now, I've tried to track that down but it's hard because the generic code
seem to have changed in many ways around affinity handling...

So it looks like nowadays, the generic setup_irq() will call
irq_select_affinity() when an interrupt is first requested. Unless
you set CONFIG_AUTO_IRQ_AFFINITY and implement your own
irq_select_affinity(), thus, you will get the default one which copies
the content of this global irq_default_affinity to the interrupt.

However it does that _after_ your IRQ startup() has been called
(yes, this is very fishy), and so after you did your irq_choose_cpu()...

This is all very messy, along with hooks for balancing and other confusing
stuff that I suspect keeps changing. I'll have to spend more time next
week to sort out what exactly is happening on powerpc and whether we
get our interrupts spread or not...

That's the downside of having more generic irq code I suppose: now people
keep rewriting half of the generic code with x86 exclusively in mind and
we have to be extra careful :-)

Cheers,
Ben.
Benjamin Herrenschmidt - Oct. 26, 2008, 6:48 a.m.
> What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this
> thing supposed to be able to spread irq between its cpus?

Depends on the interrupt controller. I don't know that machine
but for example the Apple Dual G5's use an MPIC that can spread
based on an internal HW round robin scheme. This isn't always
the best idea tho for cache reasons... depends if an at what level
your caches are shared between CPUs.

Ben.
David Miller - Oct. 26, 2008, 7:16 a.m.
From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Sun, 26 Oct 2008 17:48:43 +1100

> 
> > What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this
> > thing supposed to be able to spread irq between its cpus?
> 
> Depends on the interrupt controller. I don't know that machine
> but for example the Apple Dual G5's use an MPIC that can spread
> based on an internal HW round robin scheme. This isn't always
> the best idea tho for cache reasons... depends if an at what level
> your caches are shared between CPUs.

it's always going to be the wrong thing to do for networking cards,
especially once we start doing RX flow seperation in software
Benjamin Herrenschmidt - Oct. 26, 2008, 8:29 a.m.
On Sun, 2008-10-26 at 00:16 -0700, David Miller wrote:
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Sun, 26 Oct 2008 17:48:43 +1100
> 
> > 
> > > What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this
> > > thing supposed to be able to spread irq between its cpus?
> > 
> > Depends on the interrupt controller. I don't know that machine
> > but for example the Apple Dual G5's use an MPIC that can spread
> > based on an internal HW round robin scheme. This isn't always
> > the best idea tho for cache reasons... depends if an at what level
> > your caches are shared between CPUs.
> 
> it's always going to be the wrong thing to do for networking cards,
> especially once we start doing RX flow seperation in software

True, though I don't have policy in the kernel for that, ie, it's pretty
much irqbalanced job to do that. At this stage, the kernel always tries
to spread when it can... at least on powerpc.

Ben.
Kevin Diggs - Oct. 27, 2008, 2:30 a.m.
Benjamin Herrenschmidt wrote:
>>What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this
>>thing supposed to be able to spread irq between its cpus?
> 
> 
> Depends on the interrupt controller. I don't know that machine
> but for example the Apple Dual G5's use an MPIC that can spread
> based on an internal HW round robin scheme. This isn't always
> the best idea tho for cache reasons... depends if an at what level
> your caches are shared between CPUs.
> 
> Ben.
> 
Sorry. I thought GigE was a common name for the machine. It is a dual
450 MHz G4 powermac with a gigabit ethernet and AGP. It now has a
PowerLogix dual 1.1 GHz 7455 in it. I think the L3 caches are
seperate? Not sure about the original cpu card. Can the OS tell?

The reason I asked is that I seem to remember a config option that
would restrict the irqs to cpu 0? Help suggested it was needed for
certain PowerMacs. Didn't provide any help as to which ones. My GigE
currently spreads them between the two. I have not noticed any
additional holes in the space time contiuum.

kevin
Benjamin Herrenschmidt - Oct. 27, 2008, 2:49 a.m.
On Sun, 2008-10-26 at 18:30 -0800, Kevin Diggs wrote:
> The reason I asked is that I seem to remember a config option that
> would restrict the irqs to cpu 0? Help suggested it was needed for
> certain PowerMacs. Didn't provide any help as to which ones. My GigE
> currently spreads them between the two. I have not noticed any
> additional holes in the space time contiuum.

Yeah, a long time ago we had unexplained lockups when spreading
interrupts, hence the config option. I think it's all been fixed since
then.

Ben.
Kumar Gala - Oct. 27, 2008, 1:43 p.m.
On Oct 26, 2008, at 1:33 AM, Benjamin Herrenschmidt wrote:

> On Sat, 2008-10-25 at 21:04 -0700, David Miller wrote:
>> But back to my original wonder, since I've always tipped off of this
>> generic IRQ layer cpu mask, when was it ever defaulting to zero
>> and causing the behvaior your powerpc guys actually want? :-)
>
> Well, I'm not sure what Kumar wants. Most powerpc SMP setups actually
> want to spread interrupts to all CPUs, and those who can't tend to  
> just
> not implement set_affinity... So Kumar must have a special case of  
> MPIC
> usage here on FSL platforms.
>
> In any case, the platform limitations should be dealt with there or  
> the
> user could break it by manipulating affinity via /proc anyway.
>
> By yeah, I do expect default affinity to be all CPUs and in fact, I  
> even
> have an -OLD- comment in the code that says
>
> 	/* let the mpic know we want intrs. default affinitya is  
> 0xffffffff ...

While we have the comment the code appears not to really follow it.   
We appear to write 1 << hard_smp_processor_id().

- k
Chris Friesen - Oct. 27, 2008, 5:36 p.m.
David Miller wrote:
> From: Kevin Diggs <kevdig@hypersurf.com>
> Date: Sat, 25 Oct 2008 15:53:46 -0700
> 
>> What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this
>> thing supposed to be able to spread irq between its cpus?
> 
> Networking interrupts should lock onto a single CPU, unconditionally.
> That's the optimal way to handle networking interrupts, especially
> with multiqueue chips.

What about something like the Cavium Octeon, where we have 16 cores but a 
single core isn't powerful enough to keep up with a gigE device?

Chris
David Miller - Oct. 27, 2008, 6:28 p.m.
From: "Chris Friesen" <cfriesen@nortel.com>
Date: Mon, 27 Oct 2008 11:36:21 -0600

> David Miller wrote:
> > From: Kevin Diggs <kevdig@hypersurf.com>
> > Date: Sat, 25 Oct 2008 15:53:46 -0700
> > 
> >> What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this
> >> thing supposed to be able to spread irq between its cpus?
> > Networking interrupts should lock onto a single CPU, unconditionally.
> > That's the optimal way to handle networking interrupts, especially
> > with multiqueue chips.
> 
> What about something like the Cavium Octeon, where we have 16 cores but a single core isn't powerful enough to keep up with a gigE device?

Hello, we either have hardware that does flow seperation and has multiple RX queues
going to multiple MSI-X interrupts or we do flow seperation in software (work
in progress patches were posted for that about a month ago, maybe something final
will land in 2.6.29)

Just moving the interrupt around when not doing flow seperation is as
suboptimal as you can possibly get.  You'll get out of order packet
processing within the same flow, TCP will retransmit when the
reordering gets deep enough, and then you're totally screwed
performance wise.
Chris Friesen - Oct. 27, 2008, 7:10 p.m.
David Miller wrote:
> From: "Chris Friesen" <cfriesen@nortel.com>

>> What about something like the Cavium Octeon, where we have 16 cores but a
>> single core isn't powerful enough to keep up with a gigE device?
> 
> Hello, we either have hardware that does flow seperation and has multiple
> RX queues going to multiple MSI-X interrupts or we do flow seperation in
> software (work in progress patches were posted for that about a month ago,
> maybe something final will land in 2.6.29)

Are there any plans for a mechanism to allow the kernel to figure out (or be 
told) what packets cpu-affined tasks are interested in and route the 
interrupts appropriately?

> Just moving the interrupt around when not doing flow seperation is as 
> suboptimal as you can possibly get.  You'll get out of order packet 
> processing within the same flow, TCP will retransmit when the reordering
> gets deep enough, and then you're totally screwed performance wise.

Ideally I agree with you.  In this particular case however the hardware is 
capable of doing flow separation, but the vendor driver doesn't support it 
(and isn't in mainline).  Packet rates are high enough that a single core 
cannot keep up, but are low enough that they can be handled by multiple cores 
without reordering if interrupt mitigation is not used.

It's not an ideal situation, but we're sort of stuck unless we do custom 
driver work.

Chris
David Miller - Oct. 27, 2008, 7:25 p.m.
From: "Chris Friesen" <cfriesen@nortel.com>
Date: Mon, 27 Oct 2008 13:10:55 -0600

> David Miller wrote:
> > From: "Chris Friesen" <cfriesen@nortel.com>
> 
> > Hello, we either have hardware that does flow seperation and has multiple
> > RX queues going to multiple MSI-X interrupts or we do flow seperation in
> > software (work in progress patches were posted for that about a month ago,
> > maybe something final will land in 2.6.29)
>
> Are there any plans for a mechanism to allow the kernel to figure
> out (or be told) what packets cpu-affined tasks are interested in
> and route the interrupts appropriately?

No, not at all.

Now there are plans to allow the user to add classification rules into
the chip for specific flows, on hardware that supports this, via ethtool.

> Ideally I agree with you.  In this particular case however the
> hardware is capable of doing flow separation, but the vendor driver
> doesn't support it (and isn't in mainline).  Packet rates are high
> enough that a single core cannot keep up, but are low enough that
> they can be handled by multiple cores without reordering if
> interrupt mitigation is not used.

Your driver is weak and doesn't support the hardware correctly, and you
want to put the onus on everyone else with sane hardware and drivers?

> It's not an ideal situation, but we're sort of stuck unless we do
> custom driver work.

Wouldn't want you to get your hands dirty or anything like that now,
would we?  :-)))
Kumar Gala - Oct. 27, 2008, 7:43 p.m.
On Oct 27, 2008, at 1:28 PM, David Miller wrote:

> From: "Chris Friesen" <cfriesen@nortel.com>
> Date: Mon, 27 Oct 2008 11:36:21 -0600
>
>> David Miller wrote:
>>> From: Kevin Diggs <kevdig@hypersurf.com>
>>> Date: Sat, 25 Oct 2008 15:53:46 -0700
>>>
>>>> What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this
>>>> thing supposed to be able to spread irq between its cpus?
>>> Networking interrupts should lock onto a single CPU,  
>>> unconditionally.
>>> That's the optimal way to handle networking interrupts, especially
>>> with multiqueue chips.
>>
>> What about something like the Cavium Octeon, where we have 16 cores  
>> but a single core isn't powerful enough to keep up with a gigE  
>> device?
>
> Hello, we either have hardware that does flow seperation and has  
> multiple RX queues
> going to multiple MSI-X interrupts or we do flow seperation in  
> software (work
> in progress patches were posted for that about a month ago, maybe  
> something final
> will land in 2.6.29)
>
> Just moving the interrupt around when not doing flow seperation is as
> suboptimal as you can possibly get.  You'll get out of order packet
> processing within the same flow, TCP will retransmit when the
> reordering gets deep enough, and then you're totally screwed
> performance wise.

I haven't been following the netdev patches, but what about HW that  
does flow separation w/o multiple interrupts?

We (Freescale) are working on such a device:

http://www.freescale.com/webapp/sps/site/prod_summary.jsp?fastpreview=1&code=P4080

- k
David Miller - Oct. 27, 2008, 7:49 p.m.
From: Kumar Gala <galak@kernel.crashing.org>
Date: Mon, 27 Oct 2008 14:43:29 -0500

> I haven't been following the netdev patches, but what about HW that does flow separation w/o multiple interrupts?
> 
> We (Freescale) are working on such a device:
> 
> http://www.freescale.com/webapp/sps/site/prod_summary.jsp?fastpreview=1&code=P4080

It could probably tie into the software based flow seperation support.
Benjamin Herrenschmidt - Oct. 27, 2008, 8:27 p.m.
On Mon, 2008-10-27 at 08:43 -0500, Kumar Gala wrote:
> 
> While we have the comment the code appears not to really follow it.   
> We appear to write 1 << hard_smp_processor_id().

That code is called by each CPU that gets onlined and OR's it's
bit in the mask.

Ben.
Kumar Gala - Oct. 27, 2008, 8:45 p.m.
On Oct 27, 2008, at 3:27 PM, Benjamin Herrenschmidt wrote:

> On Mon, 2008-10-27 at 08:43 -0500, Kumar Gala wrote:
>>
>> While we have the comment the code appears not to really follow it.
>> We appear to write 1 << hard_smp_processor_id().
>
> That code is called by each CPU that gets onlined and OR's it's
> bit in the mask.

ahh, I see now.

- k
Kumar Gala - Oct. 27, 2008, 8:46 p.m.
On Oct 27, 2008, at 2:49 PM, David Miller wrote:

> From: Kumar Gala <galak@kernel.crashing.org>
> Date: Mon, 27 Oct 2008 14:43:29 -0500
>
>> I haven't been following the netdev patches, but what about HW that  
>> does flow separation w/o multiple interrupts?
>>
>> We (Freescale) are working on such a device:
>>
>> http://www.freescale.com/webapp/sps/site/prod_summary.jsp?fastpreview=1&code=P4080
>
> It could probably tie into the software based flow seperation support.

Will have to look at the code.. we might be able to fit in the HW irq  
scheme.  We effective have a way of getting a per cpu interrupt for  
the flow.

- k
Chris Friesen - Oct. 28, 2008, 3:46 a.m.
David Miller wrote:
> From: "Chris Friesen" <cfriesen@nortel.com>

>>Are there any plans for a mechanism to allow the kernel to figure
>>out (or be told) what packets cpu-affined tasks are interested in
>>and route the interrupts appropriately?
> 
> No, not at all.

> Now there are plans to allow the user to add classification rules into
> the chip for specific flows, on hardware that supports this, via ethtool.

Okay, that sounds reasonable.  Good to know where you're planning on going.

> Your driver is weak and doesn't support the hardware correctly, and you
> want to put the onus on everyone else with sane hardware and drivers?

I'm not expecting any action...I was just objecting somewhat to 
"Networking interrupts should lock onto a single CPU, unconditionally." 
  Add "for a particular flow" into that and I wouldn't have said anything.

>>It's not an ideal situation, but we're sort of stuck unless we do
>>custom driver work.

> Wouldn't want you to get your hands dirty or anything like that now,
> would we?  :-)))

I'd love to.  But other things take time too, so we live with it for now.

Chris

Patch

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index c498a1b..728d36a 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -17,7 +17,7 @@ 
 
 #ifdef CONFIG_SMP
 
-cpumask_t irq_default_affinity = CPU_MASK_ALL;
+cpumask_t irq_default_affinity = CPU_MASK_CPU0;
 
 /**
  *	synchronize_irq - wait for pending IRQ handlers (on other CPUs)