diff mbox

[6/7,RFC] enable early TLBs for BG/P

Message ID 1305753895-24845-6-git-send-email-ericvh@gmail.com (mailing list archive)
State Changes Requested
Headers show

Commit Message

Eric Van Hensbergen May 18, 2011, 9:24 p.m. UTC
BG/P maps firmware with an early TLB

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
---
 arch/powerpc/include/asm/mmu-44x.h |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

Comments

Benjamin Herrenschmidt May 20, 2011, 12:39 a.m. UTC | #1
On Wed, 2011-05-18 at 16:24 -0500, Eric Van Hensbergen wrote:
> BG/P maps firmware with an early TLB

That's a bit gross. How often do you call that firmware in practice ?
Aren't you better off instead inserting a TLB entry for it when you call
it instead ? A simple tlbsx. + tlbwe sequence would do. That would free
up a TLB entry for normal use.

Cheers,
Ben.

> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
> ---
>  arch/powerpc/include/asm/mmu-44x.h |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu-44x.h b/arch/powerpc/include/asm/mmu-44x.h
> index ca1b90c..2807d6e 100644
> --- a/arch/powerpc/include/asm/mmu-44x.h
> +++ b/arch/powerpc/include/asm/mmu-44x.h
> @@ -115,8 +115,12 @@ typedef struct {
>  #endif /* !__ASSEMBLY__ */
>  
>  #ifndef CONFIG_PPC_EARLY_DEBUG_44x
> +#ifndef CONFIG_BGP
>  #define PPC44x_EARLY_TLBS	1
> -#else
> +#else /* CONFIG_BGP */
> +#define PPC44x_EARLY_TLBS	2
> +#endif /* CONFIG_BGP */
> +#else /* CONFIG_PPC_EARLY_DEBUG_44x */
>  #define PPC44x_EARLY_TLBS	2
>  #define PPC44x_EARLY_DEBUG_VIRTADDR	(ASM_CONST(0xf0000000) \
>  	| (ASM_CONST(CONFIG_PPC_EARLY_DEBUG_44x_PHYSLOW) & 0xffff))
Eric Van Hensbergen May 20, 2011, 1:21 a.m. UTC | #2
On Thu, May 19, 2011 at 7:39 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2011-05-18 at 16:24 -0500, Eric Van Hensbergen wrote:
>> BG/P maps firmware with an early TLB
>
> That's a bit gross. How often do you call that firmware in practice ?
> Aren't you better off instead inserting a TLB entry for it when you call
> it instead ? A simple tlbsx. + tlbwe sequence would do. That would free
> up a TLB entry for normal use.
>

Well, it depends on who you talk to.  The production software BG/P
guys use the firmware
constantly, its the primary interface to the networks, the console,
and the management software
which runs the machine.  As such the IO Node guys, the Compute Node
Kernel guys and the
ZeptoOS guys use it quite a bit.  The kittyhawk guys on the other hand
barely use it at all, in fact
I believe they do all the interaction with it during uboot and then shut it off.

IIRC, the sticky question is RAS support, there are certain things it
wants to jump to firmware
to deal with and expects things to be mapped an pinned into memory.
Furthermore, I think it
may make assumptions about where in the TLB the mappings are.  Since
the kittyhawk guys
obviously ignore this by shutting it down, its not clear just how
important this is.  I'm game to
try the dynamic mapping as you suggest if you would prefer it.

Its worth mentioning that I believe with BG/Q, the plan is to rely on
the firmware even more
extensively, but I haven't looked at any of the code yet to verify
whether or not this is true.

     -eric
Benjamin Herrenschmidt May 20, 2011, 1:54 a.m. UTC | #3
On Thu, 2011-05-19 at 20:21 -0500, Eric Van Hensbergen wrote:
> On Thu, May 19, 2011 at 7:39 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Wed, 2011-05-18 at 16:24 -0500, Eric Van Hensbergen wrote:
> >> BG/P maps firmware with an early TLB
> >
> > That's a bit gross. How often do you call that firmware in practice ?
> > Aren't you better off instead inserting a TLB entry for it when you call
> > it instead ? A simple tlbsx. + tlbwe sequence would do. That would free
> > up a TLB entry for normal use.
> >
> 
> Well, it depends on who you talk to.  The production software BG/P
> guys use the firmware constantly, its the primary interface to the networks, the console,
> and the management software which runs the machine.

Yuck.

> As such the IO Node guys, the Compute Node Kernel guys and the
> ZeptoOS guys use it quite a bit.  The kittyhawk guys on the other hand
> barely use it at all, in fact I believe they do all the interaction with
> it during uboot and then shut it off.

I would prefer that approach.

> IIRC, the sticky question is RAS support, there are certain things it
> wants to jump to firmware to deal with and expects things to be mapped
> an pinned into memory.
>
> Furthermore, I think it may make assumptions about where in the TLB the
> mappings are.  

This is gross, especially on a system with only 64 SW loaded TLB
entries :-(

> Since the kittyhawk guys
> obviously ignore this by shutting it down, its not clear just how
> important this is.  I'm game to
> try the dynamic mapping as you suggest if you would prefer it.

I would yes, we can sort things out later for RAS.

> Its worth mentioning that I believe with BG/Q, the plan is to rely on
> the firmware even more extensively, but I haven't looked at any of the code yet to verify
> whether or not this is true.

This is tantamount to linking a binary blob with the kernel ... it's a
fine line. At some point we might refuse the patches if they go too far
in that direction.

Cheers,
Ben.

>      -eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
Kazutomo Yoshii May 20, 2011, 3:38 a.m. UTC | #4
On 05/19/2011 08:54 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2011-05-19 at 20:21 -0500, Eric Van Hensbergen wrote:
>    
>> On Thu, May 19, 2011 at 7:39 PM, Benjamin Herrenschmidt
>> <benh@kernel.crashing.org>  wrote:
>>      
>>> On Wed, 2011-05-18 at 16:24 -0500, Eric Van Hensbergen wrote:
>>>        
>>>> BG/P maps firmware with an early TLB
>>>>          
>>> That's a bit gross. How often do you call that firmware in practice ?
>>> Aren't you better off instead inserting a TLB entry for it when you call
>>> it instead ? A simple tlbsx. + tlbwe sequence would do. That would free
>>> up a TLB entry for normal use.
>>>
>>>        
>> Well, it depends on who you talk to.  The production software BG/P
>> guys use the firmware constantly, its the primary interface to the networks, the console,
>> and the management software which runs the machine.
>>      
> Yuck.
>    

Unfortunately, the firmware is also required:
- to configure Blue Gene Interrupt Controller(BIC)
- to configure Torus DMA unit. e.g. fifo
- to configure global interrupt (even we don't use, we need to disable 
some channel correctly)
- to access node personality information (node id, DDR size, HZ, etc) or 
maybe we can directly access SRAM?
etc, etc.

>> As such the IO Node guys, the Compute Node Kernel guys and the
>> ZeptoOS guys use it quite a bit.  The kittyhawk guys on the other hand
>> barely use it at all, in fact I believe they do all the interaction with
>> it during uboot and then shut it off.
>>      
>    
(I'm one of the ZeptoOS guys, btw)

As a regular ppc linux usage, our firmware dependency is minimum as well.
However,  with our HPC extension, the firmware functions are called when
it configures BGP specific network hardware.

We are not planning to submit our HPC extension here anytime soon
because our work is very special purpose and includes lots of dirty hack 
right now.

Thanks,
Kaz
> I would prefer that approach.
>
>    
>> IIRC, the sticky question is RAS support, there are certain things it
>> wants to jump to firmware to deal with and expects things to be mapped
>> an pinned into memory.
>>
>> Furthermore, I think it may make assumptions about where in the TLB the
>> mappings are.
>>      
> This is gross, especially on a system with only 64 SW loaded TLB
> entries :-(
>
>    
>> Since the kittyhawk guys
>> obviously ignore this by shutting it down, its not clear just how
>> important this is.  I'm game to
>> try the dynamic mapping as you suggest if you would prefer it.
>>      
> I would yes, we can sort things out later for RAS.
>
>    
>> Its worth mentioning that I believe with BG/Q, the plan is to rely on
>> the firmware even more extensively, but I haven't looked at any of the code yet to verify
>> whether or not this is true.
>>      
> This is tantamount to linking a binary blob with the kernel ... it's a
> fine line. At some point we might refuse the patches if they go too far
> in that direction.
>
> Cheers,
> Ben.
>
>    
>>       -eric
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>      
>
> _______________________________________________
> bg-linux mailing list
> bg-linux@lists.anl-external.org
> https://lists.anl-external.org/mailman/listinfo/bg-linux
> http://bg-linux.anl-external.org/wiki
>
Benjamin Herrenschmidt May 20, 2011, 3:52 a.m. UTC | #5
> Unfortunately, the firmware is also required:
> - to configure Blue Gene Interrupt Controller(BIC)

Can't we just write bare metal code for that ?

> - to configure Torus DMA unit. e.g. fifo

Same

> - to configure global interrupt (even we don't use, we need to disable 
> some channel correctly)

Same

> - to access node personality information (node id, DDR size, HZ, etc) or 
> maybe we can directly access SRAM?

That should be turned into device-tree at boot, possibly from a
bootloader or from the zImage wrapper.

> etc, etc.
> 
> >> As such the IO Node guys, the Compute Node Kernel guys and the
> >> ZeptoOS guys use it quite a bit.  The kittyhawk guys on the other hand
> >> barely use it at all, in fact I believe they do all the interaction with
> >> it during uboot and then shut it off.
> >>      
> >    
> (I'm one of the ZeptoOS guys, btw)

Heh ok.

> As a regular ppc linux usage, our firmware dependency is minimum as well.
> However,  with our HPC extension, the firmware functions are called when
> it configures BGP specific network hardware.
> 
> We are not planning to submit our HPC extension here anytime soon
> because our work is very special purpose and includes lots of dirty hack 
> right now.

Ok.

Cheers,
Ben.

> Thanks,
> Kaz
> > I would prefer that approach.
> >
> >    
> >> IIRC, the sticky question is RAS support, there are certain things it
> >> wants to jump to firmware to deal with and expects things to be mapped
> >> an pinned into memory.
> >>
> >> Furthermore, I think it may make assumptions about where in the TLB the
> >> mappings are.
> >>      
> > This is gross, especially on a system with only 64 SW loaded TLB
> > entries :-(
> >
> >    
> >> Since the kittyhawk guys
> >> obviously ignore this by shutting it down, its not clear just how
> >> important this is.  I'm game to
> >> try the dynamic mapping as you suggest if you would prefer it.
> >>      
> > I would yes, we can sort things out later for RAS.
> >
> >    
> >> Its worth mentioning that I believe with BG/Q, the plan is to rely on
> >> the firmware even more extensively, but I haven't looked at any of the code yet to verify
> >> whether or not this is true.
> >>      
> > This is tantamount to linking a binary blob with the kernel ... it's a
> > fine line. At some point we might refuse the patches if they go too far
> > in that direction.
> >
> > Cheers,
> > Ben.
> >
> >    
> >>       -eric
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at  http://www.tux.org/lkml/
> >>      
> >
> > _______________________________________________
> > bg-linux mailing list
> > bg-linux@lists.anl-external.org
> > https://lists.anl-external.org/mailman/listinfo/bg-linux
> > http://bg-linux.anl-external.org/wiki
> >
Eric Van Hensbergen May 20, 2011, 1:01 p.m. UTC | #6
On Thu, May 19, 2011 at 10:52 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>> Unfortunately, the firmware is also required:
>> - to configure Blue Gene Interrupt Controller(BIC)
>> - to configure Torus DMA unit. e.g. fifo
>> - to configure global interrupt (even we don't use, we need to disable
>> some channel correctly)
>
> Can't we just write bare metal code for that ?
>

The kittyhawk code has the bare-metal equivalents for all of these.
When I get to the drivers, I'll favor the kittyhawk versions for
submission and then we'll see if it would be possible to adapt the HPC
extensions to use the bare-metal versions of the drivers versus the
firmware interface.

>> - to access node personality information (node id, DDR size, HZ, etc) or
>> maybe we can directly access SRAM?
>
> That should be turned into device-tree at boot, possibly from a
> bootloader or from the zImage wrapper.
>

This is the approach is used by the kittyhawk u-boot approach.
However, it would also be just as easy to construct an in-memory
device-tree within Linux by mapping the personality page and copying
the relevant bits out.  This has the advantage of being able to boot
Linux directly on the nodes without an intermediary boot loader (which
kittyhawk uses just to allow us customize which kernel boots on a
node-to-node basis whereas the stock system boots the same kernel on
all the nodes within a partition allocation (64-40,000 nodes)).

         -eric
Benjamin Herrenschmidt May 20, 2011, 10:20 p.m. UTC | #7
On Fri, 2011-05-20 at 08:01 -0500, Eric Van Hensbergen wrote:
> On Thu, May 19, 2011 at 10:52 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> >> Unfortunately, the firmware is also required:
> >> - to configure Blue Gene Interrupt Controller(BIC)
> >> - to configure Torus DMA unit. e.g. fifo
> >> - to configure global interrupt (even we don't use, we need to disable
> >> some channel correctly)
> >
> > Can't we just write bare metal code for that ?
> >
> 
> The kittyhawk code has the bare-metal equivalents for all of these.
> When I get to the drivers, I'll favor the kittyhawk versions for
> submission and then we'll see if it would be possible to adapt the HPC
> extensions to use the bare-metal versions of the drivers versus the
> firmware interface.

Ok. We can also start with using the FW and then migrate to bare metal.

> >> - to access node personality information (node id, DDR size, HZ, etc) or
> >> maybe we can directly access SRAM?
> >
> > That should be turned into device-tree at boot, possibly from a
> > bootloader or from the zImage wrapper.
> >
> 
> This is the approach is used by the kittyhawk u-boot approach.
> However, it would also be just as easy to construct an in-memory
> device-tree within Linux by mapping the personality page and copying
> the relevant bits out.  This has the advantage of being able to boot
> Linux directly on the nodes without an intermediary boot loader (which
> kittyhawk uses just to allow us customize which kernel boots on a
> node-to-node basis whereas the stock system boots the same kernel on
> all the nodes within a partition allocation (64-40,000 nodes)).

We can do that from the zImage wrapper... that would be nicer than doing
it from the kernel itself unless there's good reasons to do so like
iSeries.

Cheers,
Ben.
diff mbox

Patch

diff --git a/arch/powerpc/include/asm/mmu-44x.h b/arch/powerpc/include/asm/mmu-44x.h
index ca1b90c..2807d6e 100644
--- a/arch/powerpc/include/asm/mmu-44x.h
+++ b/arch/powerpc/include/asm/mmu-44x.h
@@ -115,8 +115,12 @@  typedef struct {
 #endif /* !__ASSEMBLY__ */
 
 #ifndef CONFIG_PPC_EARLY_DEBUG_44x
+#ifndef CONFIG_BGP
 #define PPC44x_EARLY_TLBS	1
-#else
+#else /* CONFIG_BGP */
+#define PPC44x_EARLY_TLBS	2
+#endif /* CONFIG_BGP */
+#else /* CONFIG_PPC_EARLY_DEBUG_44x */
 #define PPC44x_EARLY_TLBS	2
 #define PPC44x_EARLY_DEBUG_VIRTADDR	(ASM_CONST(0xf0000000) \
 	| (ASM_CONST(CONFIG_PPC_EARLY_DEBUG_44x_PHYSLOW) & 0xffff))