diff mbox

[v1,0/4] Context domains

Message ID 20170725134515.GI26577@zareason
State RFC
Delegated to: David Miller
Headers show

Commit Message

Bob Picco July 25, 2017, 1:45 p.m. UTC
David Miller wrote:	[Thu Jul 20 2017, 11:06:55PM EDT]
> From: David Miller <davem@davemloft.net>
> Date: Fri, 21 Jul 2017 03:50:05 +0100 (WEST)
> 
> > Having to allocate a full trap frame just to TLB flush one page or an
> > MM is a serious regression.
> > 
> > Next, allocating a whole new data structure and clearing it out on
> > every new address creation is going to be a significant new cost as
> > well.
None of this work has seen performance analysis. We rushed our sharing
because of forthcoming job reassignments.
> 
> So, just thinking out loud:
> 
> 1) You can retain the cross call TLB flush assembler by passing in the
>    appropriate context value for each individual cpu from the cross
>    call dispatcher.
Agreed and reinstated.
> 
> 2) If you have some constant bounds on the upper number of context
>    domains, you can simply inline them into the existing mmu_context
>    structure.  This avoids the memory allocation per mm creation.
Yes the direction I was moving towards on Friday (7/14). I have
similar on top of what Pasha posted. It has to be debugged and tested.
Pasha made some nice identifier name changes and clean ups before
posting.
> 
> You can also make the context domain salting extremely cheap.
> Perhaps something like "(cpuid>>x) & y".
> 
> No, you won't map cores to context domains so precisely like the
> code does now, but you will make up for it in code simplicity and
> overall new costs added by these changes for the more common
> cases.
> 
> I suggest "(cpuid>>x) & y" and a very small number of context domains
> (which determines 'y') because we don't need something perfect, we
> need something which divides the problem by some order of magnitude.
> 
> The hash of locks caught my eye as well.  I think you don't need that
> and we really steer clear of hashed spinlock tables in the Linux
> kernel because they never scale properly.
Agreed and witnessed during experimentation  (7/7 - 7/11) with CTX_NR_BITS
reduction.
> 
> Instead, I think you can use something like RCU to provide the
> necessary synchronization.  So you could first make sure X isn't
> referenced on the local cpu any more, and then do call_rcu() to do the
> actual clearing of the bitmap which allows X to be allocated again.
Making the context id array compile time within mm_context_t eliminates
requirement for hashed spin lock. The elimination of hash spin lock and
code simplicity is why I changed direction on 7/14. Not sure we will
require any RCU like thing. This also will likely make mm_cd_destroy()
more scalable with fewer context domains to inspect and possibly context
id releases. The context id release aspect will decrease wrap events.

It is difficult to satisfy all topologies. To me it seems likely
impossible.

This is a snippet of the patch series. I just booted it on my t5-2:
. We may want to reduce CPU_TO_CD_SHIFT to let's say 7. I will have
to do this just for testing on my t5-2. The current value of 8 will
likely make Debian a single context domain. Please allow a couple
more days for testing.

To summarize, reinstate cross calls, mm_cd_destroy() will now release
valid context-id-s (mitigate wrap) and compile time mm.context.cds.
> 
> Just some ideas...
thanx,

bob
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Pavel Tatashin July 25, 2017, 2:09 p.m. UTC | #1
On 07/25/2017 09:45 AM, Bob Picco wrote:
> David Miller wrote:	[Thu Jul 20 2017, 11:06:55PM EDT]
>> From: David Miller <davem@davemloft.net>
>> Date: Fri, 21 Jul 2017 03:50:05 +0100 (WEST)
>>
>>> Having to allocate a full trap frame just to TLB flush one page or an
>>> MM is a serious regression.
>>>
>>> Next, allocating a whole new data structure and clearing it out on
>>> every new address creation is going to be a significant new cost as
>>> well.
> None of this work has seen performance analysis. We rushed our sharing
> because of forthcoming job reassignments.
>>
>> So, just thinking out loud:
>>
>> 1) You can retain the cross call TLB flush assembler by passing in the
>>     appropriate context value for each individual cpu from the cross
>>     call dispatcher.
> Agreed and reinstated.
>>
>> 2) If you have some constant bounds on the upper number of context
>>     domains, you can simply inline them into the existing mmu_context
>>     structure.  This avoids the memory allocation per mm creation.
> Yes the direction I was moving towards on Friday (7/14). I have
> similar on top of what Pasha posted. It has to be debugged and tested.
> Pasha made some nice identifier name changes and clean ups before
> posting.
>>
>> You can also make the context domain salting extremely cheap.
>> Perhaps something like "(cpuid>>x) & y".
>>
>> No, you won't map cores to context domains so precisely like the
>> code does now, but you will make up for it in code simplicity and
>> overall new costs added by these changes for the more common
>> cases.
>>
>> I suggest "(cpuid>>x) & y" and a very small number of context domains
>> (which determines 'y') because we don't need something perfect, we
>> need something which divides the problem by some order of magnitude.
>>
>> The hash of locks caught my eye as well.  I think you don't need that
>> and we really steer clear of hashed spinlock tables in the Linux
>> kernel because they never scale properly.
> Agreed and witnessed during experimentation  (7/7 - 7/11) with CTX_NR_BITS
> reduction.
>>
>> Instead, I think you can use something like RCU to provide the
>> necessary synchronization.  So you could first make sure X isn't
>> referenced on the local cpu any more, and then do call_rcu() to do the
>> actual clearing of the bitmap which allows X to be allocated again.
> Making the context id array compile time within mm_context_t eliminates
> requirement for hashed spin lock. The elimination of hash spin lock and
> code simplicity is why I changed direction on 7/14. Not sure we will
> require any RCU like thing. This also will likely make mm_cd_destroy()
> more scalable with fewer context domains to inspect and possibly context
> id releases. The context id release aspect will decrease wrap events.
> 
> It is difficult to satisfy all topologies. To me it seems likely
> impossible.

Another approach to simplify the initial putback is to remove all hash 
locks entirely, and only keep one global wrap lock. We would not regress 
compared to base, as there is global lock taken during mm destroy 
already. Later, if this area found to be problematic the contention can 
be scaled, either by using hash locks or some other way.

> 
> This is a snippet of the patch series. I just booted it on my t5-2:
> diff --git a/arch/sparc/include/asm/mmu_64.h b/arch/sparc/include/asm/mmu_64.h
> index 83b36a5..11f2c07 100644
> --- a/arch/sparc/include/asm/mmu_64.h
> +++ b/arch/sparc/include/asm/mmu_64.h
> @@ -53,10 +53,6 @@
>   #define CTX_HW_MASK		(CTX_NR_MASK | CTX_PGSZ_MASK)
>   
>   #define CTX_FIRST_VERSION	BIT(CTX_VERSION_SHIFT)
> -#define CTX_VALID(__ctx)	\
> -	 (!(((__ctx.sparc64_ctx_val) ^ tlb_context_cache) & CTX_VERSION_MASK))
> -#define CTX_HWBITS(__ctx)	((__ctx.sparc64_ctx_val) & CTX_HW_MASK)
> -#define CTX_NRBITS(__ctx)	((__ctx.sparc64_ctx_val) & CTX_NR_MASK)
>   
>   #ifndef __ASSEMBLY__
>   
> @@ -89,9 +85,21 @@ struct tsb_config {
>   #define MM_NUM_TSBS	1
>   #endif
>   
> +int alloc_context_domain(int cpu);
> +void cd_cpu_online(int cpu);
> +void cd_cpu_offline(int cpu);
> +void mm_cd_destroy(struct mm_struct *mm);
> +
> +#define CPU_TO_CD_SHIFT		(8)
> +#define CD_MAX			(0x10u)
> +#define CD_MASK			(0x0fu)
> +#define CPU_TO_CD(CPUID)	(((CPUID) >> CPU_TO_CD_SHIFT) & CD_MASK)
> +
> +#define NR_CD	(CPU_TO_CD((CONFIG_NR_CPUS - 1)) + 1)
> +
>   typedef struct {
>   	spinlock_t		lock;
> -	unsigned long		sparc64_ctx_val;
> +	unsigned long		cds[NR_CD];
>   	unsigned long		hugetlb_pte_count;
>   	unsigned long		thp_pte_count;
>   	struct tsb_config	tsb_block[MM_NUM_TSBS];
> . We may want to reduce CPU_TO_CD_SHIFT to let's say 7. I will have
> to do this just for testing on my t5-2. The current value of 8 will
> likely make Debian a single context domain. Please allow a couple
> more days for testing.
> 
> To summarize, reinstate cross calls, mm_cd_destroy() will now release
> valid context-id-s (mitigate wrap) and compile time mm.context.cds.
>>
>> Just some ideas...
> thanx,
> 
> bob
> --
> To unsubscribe from this list: send the line "unsubscribe sparclinux" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/sparc/include/asm/mmu_64.h b/arch/sparc/include/asm/mmu_64.h
index 83b36a5..11f2c07 100644
--- a/arch/sparc/include/asm/mmu_64.h
+++ b/arch/sparc/include/asm/mmu_64.h
@@ -53,10 +53,6 @@ 
 #define CTX_HW_MASK		(CTX_NR_MASK | CTX_PGSZ_MASK)
 
 #define CTX_FIRST_VERSION	BIT(CTX_VERSION_SHIFT)
-#define CTX_VALID(__ctx)	\
-	 (!(((__ctx.sparc64_ctx_val) ^ tlb_context_cache) & CTX_VERSION_MASK))
-#define CTX_HWBITS(__ctx)	((__ctx.sparc64_ctx_val) & CTX_HW_MASK)
-#define CTX_NRBITS(__ctx)	((__ctx.sparc64_ctx_val) & CTX_NR_MASK)
 
 #ifndef __ASSEMBLY__
 
@@ -89,9 +85,21 @@  struct tsb_config {
 #define MM_NUM_TSBS	1
 #endif
 
+int alloc_context_domain(int cpu);
+void cd_cpu_online(int cpu);
+void cd_cpu_offline(int cpu);
+void mm_cd_destroy(struct mm_struct *mm);
+
+#define CPU_TO_CD_SHIFT		(8)
+#define CD_MAX			(0x10u)
+#define CD_MASK			(0x0fu)
+#define CPU_TO_CD(CPUID)	(((CPUID) >> CPU_TO_CD_SHIFT) & CD_MASK)
+
+#define NR_CD	(CPU_TO_CD((CONFIG_NR_CPUS - 1)) + 1)
+
 typedef struct {
 	spinlock_t		lock;
-	unsigned long		sparc64_ctx_val;
+	unsigned long		cds[NR_CD];
 	unsigned long		hugetlb_pte_count;
 	unsigned long		thp_pte_count;
 	struct tsb_config	tsb_block[MM_NUM_TSBS];