diff mbox

ext4: slab caches set to SLAB_MEM_SPREAD flags.

Message ID 1321455848-1929-1-git-send-email-linkinjeon@gmail.com
State Superseded, archived
Headers show

Commit Message

Namjae Jeon Nov. 16, 2011, 3:04 p.m. UTC
If slab caches set to SLAB_MEM_SPREAD flags, The allocation is spread evenly over all the memory nodes instead of favoring allocation on the node local to current cpu.

Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Amit Sahrawat <amit.sahrawat83@gmail.com>
---
 fs/ext4/mballoc.c |    9 +++++----
 fs/ext4/page-io.c |    6 ++++--
 2 files changed, 9 insertions(+), 6 deletions(-)

Comments

Theodore Ts'o Nov. 16, 2011, 3:51 p.m. UTC | #1
On Nov 16, 2011, at 10:04 AM, Namjae Jeon wrote:

> If slab caches set to SLAB_MEM_SPREAD flags, The allocation is spread evenly over all the memory nodes instead of favoring allocation on the node local to current cpu.

And why do you think this is a good thing?   For mballoc in particular, the data structures are used immediately and then freed immediately --- on the local node, so using a non-local memory just makes things worse in a NUMA system.

For the pageio, they are used immediately, and may end up getting freed on the same CPU if the I/O interrupt (and hence competion callbacks) happen on the same CPU that they are submitted on….

Regards,

-- Ted


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Rientjes Nov. 16, 2011, 9:52 p.m. UTC | #2
On Wed, 16 Nov 2011, Theodore Tso wrote:

> On Nov 16, 2011, at 10:04 AM, Namjae Jeon wrote:
> 
> > If slab caches set to SLAB_MEM_SPREAD flags, The allocation is spread 
> > evenly over all the memory nodes instead of favoring allocation on the 
> > node local to current cpu.
> 
> And why do you think this is a good thing?   For mballoc in particular, 
> the data structures are used immediately and then freed immediately --- 
> on the local node, so using a non-local memory just makes things worse 
> in a NUMA system.
> 

I don't think this has the effect that Namjae thinks it does: this is only 
useful for CONFIG_SLAB and when you have cpusets enabled with 
cpuset.memory_spread_slab set.

To test how useful it is, you should enable CONFIG_SLAB and then mount 
cpusets, set cpuset.memory_spread_slab, and create an MPOL_INTERLEAVE 
mempolicy over all online nodes.  This will have the same effect as adding 
SLAB_MEM_SPREAD to these slab caches (it just doesn't require the 
mempolicy) and will be able to quantify the effects without any changes to 
the kernel at all.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amit Sahrawat Nov. 17, 2011, 5:11 a.m. UTC | #3
Hi All,

To share with you all, this was picked as part of some review process.
We looked for the details regarding the introduction of
SLAB_MEM_SHARED(http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-02/msg01409.html
- provide the slab cache infrastructure to support cpuset memory
spreading), and further the changes were introduced across all
filesystems - http://lwn.net/Articles/173654/.
Since, we do not have direct access to NUMA architecture machine, so
could not think of verifying the changes - but looking at the
respective changes - the changes in this patch does seems relevant
keeping in sync with all changes.
And, Yes David - these changes will make sense in case of CONFIG_SLAB.

Please share your opinion on the above introductions irrespective of
the code changes done in this patch.

Thanks & Regards,
Amit Sahrawat

On Thu, Nov 17, 2011 at 3:22 AM, David Rientjes <rientjes@google.com> wrote:
> On Wed, 16 Nov 2011, Theodore Tso wrote:
>
>> On Nov 16, 2011, at 10:04 AM, Namjae Jeon wrote:
>>
>> > If slab caches set to SLAB_MEM_SPREAD flags, The allocation is spread
>> > evenly over all the memory nodes instead of favoring allocation on the
>> > node local to current cpu.
>>
>> And why do you think this is a good thing?   For mballoc in particular,
>> the data structures are used immediately and then freed immediately ---
>> on the local node, so using a non-local memory just makes things worse
>> in a NUMA system.
>>
>
> I don't think this has the effect that Namjae thinks it does: this is only
> useful for CONFIG_SLAB and when you have cpusets enabled with
> cpuset.memory_spread_slab set.
>
> To test how useful it is, you should enable CONFIG_SLAB and then mount
> cpusets, set cpuset.memory_spread_slab, and create an MPOL_INTERLEAVE
> mempolicy over all online nodes.  This will have the same effect as adding
> SLAB_MEM_SPREAD to these slab caches (it just doesn't require the
> mempolicy) and will be able to quantify the effects without any changes to
> the kernel at all.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Namjae Jeon Nov. 17, 2011, 5:35 a.m. UTC | #4
2011/11/17 Amit Sahrawat <amit.sahrawat83@gmail.com>:
> Hi All,
>
> To share with you all, this was picked as part of some review process.
> We looked for the details regarding the introduction of
> SLAB_MEM_SHARED(http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-02/msg01409.html
> - provide the slab cache infrastructure to support cpuset memory
> spreading), and further the changes were introduced across all
> filesystems - http://lwn.net/Articles/173654/.
> Since, we do not have direct access to NUMA architecture machine, so
> could not think of verifying the changes - but looking at the
> respective changes - the changes in this patch does seems relevant
> keeping in sync with all changes.
> And, Yes David - these changes will make sense in case of CONFIG_SLAB.
>
> Please share your opinion on the above introductions irrespective of
> the code changes done in this patch.
>
> Thanks & Regards,
> Amit Sahrawat
>
> On Thu, Nov 17, 2011 at 3:22 AM, David Rientjes <rientjes@google.com> wrote:
>> On Wed, 16 Nov 2011, Theodore Tso wrote:
>>
>>> On Nov 16, 2011, at 10:04 AM, Namjae Jeon wrote:
>>>
>>> > If slab caches set to SLAB_MEM_SPREAD flags, The allocation is spread
>>> > evenly over all the memory nodes instead of favoring allocation on the
>>> > node local to current cpu.
>>>
>>> And why do you think this is a good thing?   For mballoc in particular,
>>> the data structures are used immediately and then freed immediately ---
>>> on the local node, so using a non-local memory just makes things worse
>>> in a NUMA system.
>>>
>>
>> I don't think this has the effect that Namjae thinks it does: this is only
>> useful for CONFIG_SLAB and when you have cpusets enabled with
>> cpuset.memory_spread_slab set.
>>
>> To test how useful it is, you should enable CONFIG_SLAB and then mount
>> cpusets, set cpuset.memory_spread_slab, and create an MPOL_INTERLEAVE
>> mempolicy over all online nodes.  This will have the same effect as adding
>> SLAB_MEM_SPREAD to these slab caches (it just doesn't require the
>> mempolicy) and will be able to quantify the effects without any changes to
>> the kernel at all.
Regarding ted's question,
We prevent one node being concentrating workload from redundant
alloc/free by balancing workload to other nodes in performance side.
and we can avoid reclaim probability a little by allocation from only
local node also.
slab caches for extents io of btrfs is set to SLAB_MEM_SPREAD flags
with recliam flags.
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Nov. 17, 2011, 3:11 p.m. UTC | #5
On Nov 17, 2011, at 12:35 AM, NamJae Jeon wrote:

> We prevent one node being concentrating workload from redundant
> alloc/free by balancing workload to other nodes in performance side.
> and we can avoid reclaim probability a little by allocation from only
> local node also.

Can you say more about this?   I'm afraid I don't understand what you
are trying to do here.

What I'm worried about is systems where the memory latency time for
local versus remote nodes might be 2-3 to 1.  That is, it takes 2-3 times
as much time to access memory on a remote node.   (On older machines,
the memory access time ratio's could have been as bad as 12:1 or 15:1.)

So accessing non-local memory can be a really, really big deal.  And this
isn't just theoretical, but have you considered what might happen on a 8
core AMD machine?

Regards,

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Nov. 17, 2011, 9:34 p.m. UTC | #6
On Nov 17, 2011, at 10:11 AM, Theodore Tso wrote:
> 
> So accessing non-local memory can be a really, really big deal.  And this
> isn't just theoretical, but have you considered what might happen on a 8
> core AMD machine?

Sorry, typo.  This should have read, "have you considered what might happen on a 8 _socket_ AMD machine"?

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Namjae Jeon Nov. 19, 2011, 1:52 p.m. UTC | #7
2011/11/18 Theodore Tso <tytso@mit.edu>:
>
> On Nov 17, 2011, at 10:11 AM, Theodore Tso wrote:
>>
>> So accessing non-local memory can be a really, really big deal.  And this
>> isn't just theoretical, but have you considered what might happen on a 8
>> core AMD machine?
>
> Sorry, typo.  This should have read, "have you considered what might happen on a 8 _socket_ AMD machine"?
>
You're right. but..
It is only useful using cpuset, And have you read cpuset spread
history link of Amit provided ?
And why have you still used spread flags for inode cache ?
Thanks.

> -- Ted
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Nov. 19, 2011, 2:42 p.m. UTC | #8
On Nov 19, 2011, at 8:52 AM, NamJae Jeon wrote:

> 2011/11/18 Theodore Tso <tytso@mit.edu>:
>> 
>> On Nov 17, 2011, at 10:11 AM, Theodore Tso wrote:
>>> 
>>> So accessing non-local memory can be a really, really big deal.  And this
>>> isn't just theoretical, but have you considered what might happen on a 8
>>> core AMD machine?
>> 
>> Sorry, typo.  This should have read, "have you considered what might happen on a 8 _socket_ AMD machine"?
>> 
> You're right. but..
> It is only useful using cpuset, And have you read cpuset spread
> history link of Amit provided ?
> And why have you still used spread flags for inode cache ?

The inode cache is different for the following reasons:

(1) The memory allocations are long-lived, and there is a good chance for many of them that they will be used by other CPU's on different NUMA nodes.

(2)  There are a very large number of inodes, so uneven allocation of the inodes has a significantly larger impact.

In contrast, the page_io and mballoc allocations are very short lived, there aren't a whole lot of them (check /proc/slabinfo), and they are guaranteed to be used during their very short lifetime on the local CPU node.   So the benefits of spreading them around are not that great, since they aren't that many of them, and downsides of potential 2x or 3x time to access memory is large. 

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Namjae Jeon Nov. 19, 2011, 3:40 p.m. UTC | #9
2011/11/19 Theodore Tso <tytso@mit.edu>:
>
> On Nov 19, 2011, at 8:52 AM, NamJae Jeon wrote:
>
>> 2011/11/18 Theodore Tso <tytso@mit.edu>:
>>>
>>> On Nov 17, 2011, at 10:11 AM, Theodore Tso wrote:
>>>>
>>>> So accessing non-local memory can be a really, really big deal.  And this
>>>> isn't just theoretical, but have you considered what might happen on a 8
>>>> core AMD machine?
>>>
>>> Sorry, typo.  This should have read, "have you considered what might happen on a 8 _socket_ AMD machine"?
>>>
>> You're right. but..
>> It is only useful using cpuset, And have you read cpuset spread
>> history link of Amit provided ?
>> And why have you still used spread flags for inode cache ?
>
> The inode cache is different for the following reasons:
>
> (1) The memory allocations are long-lived, and there is a good chance for many of them that they will be used by other CPU's on different NUMA nodes.
>
> (2)  There are a very large number of inodes, so uneven allocation of the inodes has a significantly larger impact.
>
> In contrast, the page_io and mballoc allocations are very short lived, there aren't a whole lot of them (check /proc/slabinfo), and they are guaranteed to be used during their very short lifetime on the local CPU node.   So the benefits of spreading them around are not that great, since they aren't that many of them, and downsides of potential 2x or 3x time to access memory is large.
>
Thanks for your opinion.
Have you thought that mem spread flags is not needed in group info
caches same with page_io and mballoc ?

> -- Ted
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Nov. 19, 2011, 9:06 p.m. UTC | #10
On Nov 17, 2011, at 12:11 AM, Amit Sahrawat wrote:
> We looked for the details regarding the introduction of
> SLAB_MEM_SHARED(http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-02/msg01409.html
> - provide the slab cache infrastructure to support cpuset memory
> spreading), and further the changes were introduced across all
> filesystems - http://lwn.net/Articles/173654/.

The main thing I see when I look at this list was it was focused only on the file system's inode structures, or other structures which were long lived and likely to be accessed from many NUMA nodes other than the one where it was originally allocated.

It's unfortunate that this has gotten turned into a much broader generalization that all slab caches should have this flag set.   Perhaps it might be interesting to reflect on the fact that if it was always good to do this, that a flag wouldn't be used to control this behavior, but rather it would be done conditionally?

It's all very good to send  lots of clean up patches, perhaps as a way to learn how to submit kernel patches.   But I would ask people who do this to understand a bit more deeply what is going on.  What's most important for people who are getting started with kernel development is not so much the mechanics of submitting patches, but understanding why things work the way they do.

Best regards,

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Nov. 19, 2011, 9:14 p.m. UTC | #11
On Nov 19, 2011, at 10:40 AM, NamJae Jeon wrote:

> Thanks for your opinion.
> Have you thought that mem spread flags is not needed in group info
> caches same with page_io and mballoc ?

The group info structures are long lived structures that live a long time and will likely be accessed by many NUMA nodes.   So setting the men spread flag for the group info cache is more likely to be useful.   My guess is that any benefit will be extremely hard to measure (especially since the amount of memory used by the group info structures is a pittance compared to the memory used by the inode cache), but at least theoretically it would be sound to have that flag set with the group info slab cache.

Regards,

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Namjae Jeon Nov. 19, 2011, 10:16 p.m. UTC | #12
2011/11/20 Theodore Tso <tytso@mit.edu>:
>
> On Nov 17, 2011, at 12:11 AM, Amit Sahrawat wrote:
>> We looked for the details regarding the introduction of
>> SLAB_MEM_SHARED(http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-02/msg01409.html
>> - provide the slab cache infrastructure to support cpuset memory
>> spreading), and further the changes were introduced across all
>> filesystems - http://lwn.net/Articles/173654/.
>
> The main thing I see when I look at this list was it was focused only on the file system's inode structures, or other structures which were long lived and likely to be accessed from many NUMA nodes other than the one where it was originally allocated.
>
> It's unfortunate that this has gotten turned into a much broader generalization that all slab caches should have this flag set.   Perhaps it might be interesting to reflect on the fact that if it was always good to do this, that a flag wouldn't be used to control this behavior, but rather it would be done conditionally?
>
> It's all very good to send  lots of clean up patches, perhaps as a way to learn how to submit kernel patches.   But I would ask people who do this to understand a bit more deeply what is going on.  What's most important for people who are getting started with kernel development is not so much the mechanics of submitting patches, but understanding why things work the way they do.

Actually I checked btrfs before I posted this patch, I was more
confused by temparal slab caches similar with page_io caches of ext4
were not even using the spread flags in btrfs.
I apologize for lacking a point. I will post a patch after
understanding a bit more deeply.

>
> Best regards,
>
> -- Ted
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Namjae Jeon Nov. 19, 2011, 10:22 p.m. UTC | #13
2011/11/20 Theodore Tso <tytso@mit.edu>:
>
> On Nov 19, 2011, at 10:40 AM, NamJae Jeon wrote:
>
>> Thanks for your opinion.
>> Have you thought that mem spread flags is not needed in group info
>> caches same with page_io and mballoc ?
>
> The group info structures are long lived structures that live a long time and will likely be accessed by many NUMA nodes.   So setting the men spread flag for the group info cache is more likely to be useful.   My guess is that any benefit will be extremely hard to measure (especially since the amount of memory used by the group info structures is a pittance compared to the memory used by the inode cache), but at least theoretically it would be sound to have that flag set with the group info slab cache.
You're right. I will post the new patch included spread flags about
group info caches again.
Thanks ted.
> Regards,
>
> -- Ted
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index e2d8be8..86efde7 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2407,7 +2407,8 @@  static int ext4_groupinfo_create_slab(size_t size)
 				bb_counters[blocksize_bits + 2]);
 
 	cachep = kmem_cache_create(ext4_groupinfo_slab_names[cache_index],
-					slab_size, 0, SLAB_RECLAIM_ACCOUNT,
+					slab_size, 0, SLAB_RECLAIM_ACCOUNT |
+					SLAB_MEM_SPREAD,
 					NULL);
 
 	ext4_groupinfo_caches[cache_index] = cachep;
@@ -2730,19 +2731,19 @@  static void ext4_remove_debugfs_entry(void)
 int __init ext4_init_mballoc(void)
 {
 	ext4_pspace_cachep = KMEM_CACHE(ext4_prealloc_space,
-					SLAB_RECLAIM_ACCOUNT);
+					SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD);
 	if (ext4_pspace_cachep == NULL)
 		return -ENOMEM;
 
 	ext4_ac_cachep = KMEM_CACHE(ext4_allocation_context,
-				    SLAB_RECLAIM_ACCOUNT);
+				SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD);
 	if (ext4_ac_cachep == NULL) {
 		kmem_cache_destroy(ext4_pspace_cachep);
 		return -ENOMEM;
 	}
 
 	ext4_free_ext_cachep = KMEM_CACHE(ext4_free_data,
-					  SLAB_RECLAIM_ACCOUNT);
+					SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD);
 	if (ext4_free_ext_cachep == NULL) {
 		kmem_cache_destroy(ext4_pspace_cachep);
 		kmem_cache_destroy(ext4_ac_cachep);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 7ce1d0b..4f848e0 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -34,10 +34,12 @@  static struct kmem_cache *io_page_cachep, *io_end_cachep;
 
 int __init ext4_init_pageio(void)
 {
-	io_page_cachep = KMEM_CACHE(ext4_io_page, SLAB_RECLAIM_ACCOUNT);
+	io_page_cachep = KMEM_CACHE(ext4_io_page, SLAB_RECLAIM_ACCOUNT |
+				SLAB_MEM_SPREAD);
 	if (io_page_cachep == NULL)
 		return -ENOMEM;
-	io_end_cachep = KMEM_CACHE(ext4_io_end, SLAB_RECLAIM_ACCOUNT);
+	io_end_cachep = KMEM_CACHE(ext4_io_end, SLAB_RECLAIM_ACCOUNT |
+				SLAB_MEM_SPREAD);
 	if (io_end_cachep == NULL) {
 		kmem_cache_destroy(io_page_cachep);
 		return -ENOMEM;