diff mbox

[7/7] slub: initial bulk free implementation

Message ID 20150615155256.18824.42651.stgit@devil
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Jesper Dangaard Brouer June 15, 2015, 3:52 p.m. UTC
This implements SLUB specific kmem_cache_free_bulk().  SLUB allocator
now both have bulk alloc and free implemented.

Play nice and reenable local IRQs while calling slowpath.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 mm/slub.c |   32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Christoph Lameter (Ampere) June 15, 2015, 4:34 p.m. UTC | #1
On Mon, 15 Jun 2015, Jesper Dangaard Brouer wrote:

> +	for (i = 0; i < size; i++) {
> +		void *object = p[i];
> +
> +		if (unlikely(!object))
> +			continue; // HOW ABOUT BUG_ON()???

Sure BUG_ON would be fitting here.

> +
> +		page = virt_to_head_page(object);
> +		BUG_ON(s != page->slab_cache); /* Check if valid slab page */

This is the check if the slab page belongs to the slab cache we are
interested in.

> +
> +		if (c->page == page) {
> +			/* Fastpath: local CPU free */
> +			set_freepointer(s, object, c->freelist);
> +			c->freelist = object;
> +		} else {
> +			c->tid = next_tid(c->tid);

tids are only useful for the fastpath. No need to fiddle around with them
for the slowpath.

> +			local_irq_enable();
> +			/* Slowpath: overhead locked cmpxchg_double_slab */
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander H Duyck June 15, 2015, 5:04 p.m. UTC | #2
On 06/15/2015 08:52 AM, Jesper Dangaard Brouer wrote:
> This implements SLUB specific kmem_cache_free_bulk().  SLUB allocator
> now both have bulk alloc and free implemented.
>
> Play nice and reenable local IRQs while calling slowpath.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>   mm/slub.c |   32 +++++++++++++++++++++++++++++++-
>   1 file changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 98d0e6f73ec1..cc4f870677bb 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2752,7 +2752,37 @@ EXPORT_SYMBOL(kmem_cache_free);
>   
>   void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>   {
> -	__kmem_cache_free_bulk(s, size, p);
> +	struct kmem_cache_cpu *c;
> +	struct page *page;
> +	int i;
> +
> +	local_irq_disable();
> +	c = this_cpu_ptr(s->cpu_slab);
> +
> +	for (i = 0; i < size; i++) {
> +		void *object = p[i];
> +
> +		if (unlikely(!object))
> +			continue; // HOW ABOUT BUG_ON()???
> +
> +		page = virt_to_head_page(object);
> +		BUG_ON(s != page->slab_cache); /* Check if valid slab page */
> +
> +		if (c->page == page) {
> +			/* Fastpath: local CPU free */
> +			set_freepointer(s, object, c->freelist);
> +			c->freelist = object;
> +		} else {
> +			c->tid = next_tid(c->tid);
> +			local_irq_enable();
> +			/* Slowpath: overhead locked cmpxchg_double_slab */
> +			__slab_free(s, page, object, _RET_IP_);
> +			local_irq_disable();
> +			c = this_cpu_ptr(s->cpu_slab);
> +		}
> +	}
> +	c->tid = next_tid(c->tid);
> +	local_irq_enable();
>   }
>   EXPORT_SYMBOL(kmem_cache_free_bulk);

So if the idea is to batch the freeing maybe you should look at doing 
the freeing in two passes.  The first would be to free all those buffers 
that share their page with the percpu slab.  Then you could just free 
everything else in the second pass after you have re-enabled IRQs.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joonsoo Kim June 16, 2015, 7:23 a.m. UTC | #3
On Mon, Jun 15, 2015 at 05:52:56PM +0200, Jesper Dangaard Brouer wrote:
> This implements SLUB specific kmem_cache_free_bulk().  SLUB allocator
> now both have bulk alloc and free implemented.
> 
> Play nice and reenable local IRQs while calling slowpath.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  mm/slub.c |   32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 98d0e6f73ec1..cc4f870677bb 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2752,7 +2752,37 @@ EXPORT_SYMBOL(kmem_cache_free);
>  
>  void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>  {
> -	__kmem_cache_free_bulk(s, size, p);
> +	struct kmem_cache_cpu *c;
> +	struct page *page;
> +	int i;
> +
> +	local_irq_disable();
> +	c = this_cpu_ptr(s->cpu_slab);
> +
> +	for (i = 0; i < size; i++) {
> +		void *object = p[i];
> +
> +		if (unlikely(!object))
> +			continue; // HOW ABOUT BUG_ON()???
> +
> +		page = virt_to_head_page(object);
> +		BUG_ON(s != page->slab_cache); /* Check if valid slab page */

You need to use cache_from_objt() to support kmemcg accounting.
And, slab_free_hook() should be called before free.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joonsoo Kim June 16, 2015, 7:28 a.m. UTC | #4
On Mon, Jun 15, 2015 at 05:52:56PM +0200, Jesper Dangaard Brouer wrote:
> This implements SLUB specific kmem_cache_free_bulk().  SLUB allocator
> now both have bulk alloc and free implemented.
> 
> Play nice and reenable local IRQs while calling slowpath.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  mm/slub.c |   32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 98d0e6f73ec1..cc4f870677bb 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2752,7 +2752,37 @@ EXPORT_SYMBOL(kmem_cache_free);
>  
>  void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>  {
> -	__kmem_cache_free_bulk(s, size, p);
> +	struct kmem_cache_cpu *c;
> +	struct page *page;
> +	int i;
> +
> +	local_irq_disable();
> +	c = this_cpu_ptr(s->cpu_slab);
> +
> +	for (i = 0; i < size; i++) {
> +		void *object = p[i];
> +
> +		if (unlikely(!object))
> +			continue; // HOW ABOUT BUG_ON()???
> +
> +		page = virt_to_head_page(object);
> +		BUG_ON(s != page->slab_cache); /* Check if valid slab page */
> +
> +		if (c->page == page) {
> +			/* Fastpath: local CPU free */
> +			set_freepointer(s, object, c->freelist);
> +			c->freelist = object;
> +		} else {
> +			c->tid = next_tid(c->tid);
> +			local_irq_enable();
> +			/* Slowpath: overhead locked cmpxchg_double_slab */
> +			__slab_free(s, page, object, _RET_IP_);
> +			local_irq_disable();
> +			c = this_cpu_ptr(s->cpu_slab);

SLUB free path doesn't need to irq management in many cases although
it uses cmpxchg_doule_slab. Is this really better than just calling
__kmem_cache_free_bulk()?

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer June 16, 2015, 8:04 a.m. UTC | #5
On Mon, 15 Jun 2015 11:34:44 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Mon, 15 Jun 2015, Jesper Dangaard Brouer wrote:
> 
> > +	for (i = 0; i < size; i++) {
> > +		void *object = p[i];
> > +
> > +		if (unlikely(!object))
> > +			continue; // HOW ABOUT BUG_ON()???
> 
> Sure BUG_ON would be fitting here.

Okay, will do in V2.

> > +
> > +		page = virt_to_head_page(object);
> > +		BUG_ON(s != page->slab_cache); /* Check if valid slab page */
> 
> This is the check if the slab page belongs to the slab cache we are
> interested in.

Is this appropriate to keep on this fastpath? (I copied the check from
one of your earlier patches)

> > +
> > +		if (c->page == page) {
> > +			/* Fastpath: local CPU free */
> > +			set_freepointer(s, object, c->freelist);
> > +			c->freelist = object;
> > +		} else {
> > +			c->tid = next_tid(c->tid);
> 
> tids are only useful for the fastpath. No need to fiddle around with them
> for the slowpath.

Okay, understood.

> > +			local_irq_enable();
> > +			/* Slowpath: overhead locked cmpxchg_double_slab */
Jesper Dangaard Brouer June 16, 2015, 8:21 a.m. UTC | #6
On Tue, 16 Jun 2015 16:28:06 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

> Is this really better than just calling __kmem_cache_free_bulk()?

Yes, as can be seen by cover-letter, but my cover-letter does not seem
to have reached mm-list.

Measurements for the entire patchset:

Bulk - Fallback bulking           - fastpath-bulking
   1 -  47 cycles(tsc) 11.921 ns  -  45 cycles(tsc) 11.461 ns   improved  4.3%
   2 -  46 cycles(tsc) 11.649 ns  -  28 cycles(tsc)  7.023 ns   improved 39.1%
   3 -  46 cycles(tsc) 11.550 ns  -  22 cycles(tsc)  5.671 ns   improved 52.2%
   4 -  45 cycles(tsc) 11.398 ns  -  19 cycles(tsc)  4.967 ns   improved 57.8%
   8 -  45 cycles(tsc) 11.303 ns  -  17 cycles(tsc)  4.298 ns   improved 62.2%
  16 -  44 cycles(tsc) 11.221 ns  -  17 cycles(tsc)  4.423 ns   improved 61.4%
  30 -  75 cycles(tsc) 18.894 ns  -  57 cycles(tsc) 14.497 ns   improved 24.0%
  32 -  73 cycles(tsc) 18.491 ns  -  56 cycles(tsc) 14.227 ns   improved 23.3%
  34 -  75 cycles(tsc) 18.962 ns  -  58 cycles(tsc) 14.638 ns   improved 22.7%
  48 -  80 cycles(tsc) 20.049 ns  -  64 cycles(tsc) 16.247 ns   improved 20.0%
  64 -  87 cycles(tsc) 21.929 ns  -  74 cycles(tsc) 18.598 ns   improved 14.9%
 128 -  98 cycles(tsc) 24.511 ns  -  89 cycles(tsc) 22.295 ns   improved  9.2%
 158 - 101 cycles(tsc) 25.389 ns  -  93 cycles(tsc) 23.390 ns   improved  7.9%
 250 - 104 cycles(tsc) 26.170 ns  - 100 cycles(tsc) 25.112 ns   improved  3.8%

I'll do a compare against the previous patch, and post the results.
Jesper Dangaard Brouer June 16, 2015, 8:57 a.m. UTC | #7
On Tue, 16 Jun 2015 10:21:10 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> 
> On Tue, 16 Jun 2015 16:28:06 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> 
> > Is this really better than just calling __kmem_cache_free_bulk()?
> 
> Yes, as can be seen by cover-letter, but my cover-letter does not seem
> to have reached mm-list.
> 
> Measurements for the entire patchset:
> 
> Bulk - Fallback bulking           - fastpath-bulking
>    1 -  47 cycles(tsc) 11.921 ns  -  45 cycles(tsc) 11.461 ns   improved  4.3%
>    2 -  46 cycles(tsc) 11.649 ns  -  28 cycles(tsc)  7.023 ns   improved 39.1%
>    3 -  46 cycles(tsc) 11.550 ns  -  22 cycles(tsc)  5.671 ns   improved 52.2%
>    4 -  45 cycles(tsc) 11.398 ns  -  19 cycles(tsc)  4.967 ns   improved 57.8%
>    8 -  45 cycles(tsc) 11.303 ns  -  17 cycles(tsc)  4.298 ns   improved 62.2%
>   16 -  44 cycles(tsc) 11.221 ns  -  17 cycles(tsc)  4.423 ns   improved 61.4%
>   30 -  75 cycles(tsc) 18.894 ns  -  57 cycles(tsc) 14.497 ns   improved 24.0%
>   32 -  73 cycles(tsc) 18.491 ns  -  56 cycles(tsc) 14.227 ns   improved 23.3%
>   34 -  75 cycles(tsc) 18.962 ns  -  58 cycles(tsc) 14.638 ns   improved 22.7%
>   48 -  80 cycles(tsc) 20.049 ns  -  64 cycles(tsc) 16.247 ns   improved 20.0%
>   64 -  87 cycles(tsc) 21.929 ns  -  74 cycles(tsc) 18.598 ns   improved 14.9%
>  128 -  98 cycles(tsc) 24.511 ns  -  89 cycles(tsc) 22.295 ns   improved  9.2%
>  158 - 101 cycles(tsc) 25.389 ns  -  93 cycles(tsc) 23.390 ns   improved  7.9%
>  250 - 104 cycles(tsc) 26.170 ns  - 100 cycles(tsc) 25.112 ns   improved  3.8%
> 
> I'll do a compare against the previous patch, and post the results.

Compare against previous patch:

Run:   previous-patch            - this patch
  1 -   49 cycles(tsc) 12.378 ns -  43 cycles(tsc) 10.775 ns  improved 12.2%
  2 -   37 cycles(tsc)  9.297 ns -  26 cycles(tsc)  6.652 ns  improved 29.7%
  3 -   33 cycles(tsc)  8.348 ns -  21 cycles(tsc)  5.347 ns  improved 36.4%
  4 -   31 cycles(tsc)  7.930 ns -  18 cycles(tsc)  4.669 ns  improved 41.9%
  8 -   30 cycles(tsc)  7.693 ns -  17 cycles(tsc)  4.404 ns  improved 43.3%
 16 -   32 cycles(tsc)  8.059 ns -  17 cycles(tsc)  4.493 ns  improved 46.9%
 30 -   65 cycles(tsc) 16.345 ns -  59 cycles(tsc) 14.858 ns  improved  9.2%
 32 -   64 cycles(tsc) 16.170 ns -  56 cycles(tsc) 14.074 ns  improved 12.5%
 34 -   66 cycles(tsc) 16.645 ns -  55 cycles(tsc) 13.882 ns  improved 16.7%
 48 -   78 cycles(tsc) 19.581 ns -  65 cycles(tsc) 16.266 ns  improved 16.7%
 64 -   81 cycles(tsc) 20.428 ns -  77 cycles(tsc) 19.432 ns  improved  4.9%
128 -   92 cycles(tsc) 23.030 ns -  78 cycles(tsc) 19.650 ns  improved 15.2%
158 -   94 cycles(tsc) 23.581 ns -  83 cycles(tsc) 20.953 ns  improved 11.7%
250 -   96 cycles(tsc) 24.175 ns -  93 cycles(tsc) 23.444 ns  improved  3.1%

This mostly amortize the less-heavy none-locked cmpxchg_double used on
fastpath.  As I demonstrated earlier this fastpath cmpxchg is approx 38%
of the fastpath cost.

Thus, yes, doing bulk free this way is faster ;-)
Jesper Dangaard Brouer June 16, 2015, 9:20 a.m. UTC | #8
On Tue, 16 Jun 2015 16:23:28 +0900
Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

> On Mon, Jun 15, 2015 at 05:52:56PM +0200, Jesper Dangaard Brouer wrote:
> > This implements SLUB specific kmem_cache_free_bulk().  SLUB allocator
> > now both have bulk alloc and free implemented.
> > 
> > Play nice and reenable local IRQs while calling slowpath.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  mm/slub.c |   32 +++++++++++++++++++++++++++++++-
> >  1 file changed, 31 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 98d0e6f73ec1..cc4f870677bb 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2752,7 +2752,37 @@ EXPORT_SYMBOL(kmem_cache_free);
> >  
> >  void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
> >  {
> > -	__kmem_cache_free_bulk(s, size, p);
> > +	struct kmem_cache_cpu *c;
> > +	struct page *page;
> > +	int i;
> > +
> > +	local_irq_disable();
> > +	c = this_cpu_ptr(s->cpu_slab);
> > +
> > +	for (i = 0; i < size; i++) {
> > +		void *object = p[i];
> > +
> > +		if (unlikely(!object))
> > +			continue; // HOW ABOUT BUG_ON()???
> > +
> > +		page = virt_to_head_page(object);
> > +		BUG_ON(s != page->slab_cache); /* Check if valid slab page */
> 
> You need to use cache_from_objt() to support kmemcg accounting.
> And, slab_free_hook() should be called before free.

Okay, but Christoph choose to not support kmem_cache_debug() in patch2/7.

Should we/I try to add kmem cache debugging support?

If adding these, then I would also need to add those on alloc path...
Joonsoo Kim June 16, 2015, noon UTC | #9
2015-06-16 18:20 GMT+09:00 Jesper Dangaard Brouer <brouer@redhat.com>:
> On Tue, 16 Jun 2015 16:23:28 +0900
> Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
>
>> On Mon, Jun 15, 2015 at 05:52:56PM +0200, Jesper Dangaard Brouer wrote:
>> > This implements SLUB specific kmem_cache_free_bulk().  SLUB allocator
>> > now both have bulk alloc and free implemented.
>> >
>> > Play nice and reenable local IRQs while calling slowpath.
>> >
>> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> > ---
>> >  mm/slub.c |   32 +++++++++++++++++++++++++++++++-
>> >  1 file changed, 31 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/mm/slub.c b/mm/slub.c
>> > index 98d0e6f73ec1..cc4f870677bb 100644
>> > --- a/mm/slub.c
>> > +++ b/mm/slub.c
>> > @@ -2752,7 +2752,37 @@ EXPORT_SYMBOL(kmem_cache_free);
>> >
>> >  void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>> >  {
>> > -   __kmem_cache_free_bulk(s, size, p);
>> > +   struct kmem_cache_cpu *c;
>> > +   struct page *page;
>> > +   int i;
>> > +
>> > +   local_irq_disable();
>> > +   c = this_cpu_ptr(s->cpu_slab);
>> > +
>> > +   for (i = 0; i < size; i++) {
>> > +           void *object = p[i];
>> > +
>> > +           if (unlikely(!object))
>> > +                   continue; // HOW ABOUT BUG_ON()???
>> > +
>> > +           page = virt_to_head_page(object);
>> > +           BUG_ON(s != page->slab_cache); /* Check if valid slab page */
>>
>> You need to use cache_from_objt() to support kmemcg accounting.
>> And, slab_free_hook() should be called before free.
>
> Okay, but Christoph choose to not support kmem_cache_debug() in patch2/7.
>
> Should we/I try to add kmem cache debugging support?

kmem_cache_debug() is the check for slab internal debugging feature.
slab_free_hook() and others mentioned from me are also related to external
debugging features like as kasan and kmemleak. So, even if
debugged kmem_cache isn't supported by bulk API, external debugging
feature should be supported.

> If adding these, then I would also need to add those on alloc path...

Yes, please.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joonsoo Kim June 16, 2015, 12:05 p.m. UTC | #10
2015-06-16 17:57 GMT+09:00 Jesper Dangaard Brouer <brouer@redhat.com>:
> On Tue, 16 Jun 2015 10:21:10 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
>>
>> On Tue, 16 Jun 2015 16:28:06 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
>>
>> > Is this really better than just calling __kmem_cache_free_bulk()?
>>
>> Yes, as can be seen by cover-letter, but my cover-letter does not seem
>> to have reached mm-list.
>>
>> Measurements for the entire patchset:
>>
>> Bulk - Fallback bulking           - fastpath-bulking
>>    1 -  47 cycles(tsc) 11.921 ns  -  45 cycles(tsc) 11.461 ns   improved  4.3%
>>    2 -  46 cycles(tsc) 11.649 ns  -  28 cycles(tsc)  7.023 ns   improved 39.1%
>>    3 -  46 cycles(tsc) 11.550 ns  -  22 cycles(tsc)  5.671 ns   improved 52.2%
>>    4 -  45 cycles(tsc) 11.398 ns  -  19 cycles(tsc)  4.967 ns   improved 57.8%
>>    8 -  45 cycles(tsc) 11.303 ns  -  17 cycles(tsc)  4.298 ns   improved 62.2%
>>   16 -  44 cycles(tsc) 11.221 ns  -  17 cycles(tsc)  4.423 ns   improved 61.4%
>>   30 -  75 cycles(tsc) 18.894 ns  -  57 cycles(tsc) 14.497 ns   improved 24.0%
>>   32 -  73 cycles(tsc) 18.491 ns  -  56 cycles(tsc) 14.227 ns   improved 23.3%
>>   34 -  75 cycles(tsc) 18.962 ns  -  58 cycles(tsc) 14.638 ns   improved 22.7%
>>   48 -  80 cycles(tsc) 20.049 ns  -  64 cycles(tsc) 16.247 ns   improved 20.0%
>>   64 -  87 cycles(tsc) 21.929 ns  -  74 cycles(tsc) 18.598 ns   improved 14.9%
>>  128 -  98 cycles(tsc) 24.511 ns  -  89 cycles(tsc) 22.295 ns   improved  9.2%
>>  158 - 101 cycles(tsc) 25.389 ns  -  93 cycles(tsc) 23.390 ns   improved  7.9%
>>  250 - 104 cycles(tsc) 26.170 ns  - 100 cycles(tsc) 25.112 ns   improved  3.8%
>>
>> I'll do a compare against the previous patch, and post the results.
>
> Compare against previous patch:
>
> Run:   previous-patch            - this patch
>   1 -   49 cycles(tsc) 12.378 ns -  43 cycles(tsc) 10.775 ns  improved 12.2%
>   2 -   37 cycles(tsc)  9.297 ns -  26 cycles(tsc)  6.652 ns  improved 29.7%
>   3 -   33 cycles(tsc)  8.348 ns -  21 cycles(tsc)  5.347 ns  improved 36.4%
>   4 -   31 cycles(tsc)  7.930 ns -  18 cycles(tsc)  4.669 ns  improved 41.9%
>   8 -   30 cycles(tsc)  7.693 ns -  17 cycles(tsc)  4.404 ns  improved 43.3%
>  16 -   32 cycles(tsc)  8.059 ns -  17 cycles(tsc)  4.493 ns  improved 46.9%

So, in your test, most of objects may come from one or two slabs and your
algorithm is well optimized for this case. But, is this workload normal case?
If most of objects comes from many different slabs, bulk free API does
enabling/disabling interrupt very much so I guess it work worse than
just calling __kmem_cache_free_bulk(). Could you test this case?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) June 16, 2015, 3:06 p.m. UTC | #11
On Tue, 16 Jun 2015, Joonsoo Kim wrote:

> > If adding these, then I would also need to add those on alloc path...
>
> Yes, please.

Lets fall back to the generic implementation for any of these things. We
need to focus on maximum performance in these functions. The more special
cases we have to handle the more all of this gets compromised.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) June 16, 2015, 3:10 p.m. UTC | #12
On Tue, 16 Jun 2015, Joonsoo Kim wrote:

> So, in your test, most of objects may come from one or two slabs and your
> algorithm is well optimized for this case. But, is this workload normal case?

It is normal if the objects were bulk allocated because SLUB ensures that
all objects are first allocated from one page before moving to another.

> If most of objects comes from many different slabs, bulk free API does
> enabling/disabling interrupt very much so I guess it work worse than
> just calling __kmem_cache_free_bulk(). Could you test this case?

In case of SLAB this would be an issue since the queueing mechanism
destroys spatial locality. This is much less an issue for SLUB.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer June 16, 2015, 3:52 p.m. UTC | #13
On Tue, 16 Jun 2015 10:10:25 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Tue, 16 Jun 2015, Joonsoo Kim wrote:
> 
> > So, in your test, most of objects may come from one or two slabs and your
> > algorithm is well optimized for this case. But, is this workload normal case?
> 
> It is normal if the objects were bulk allocated because SLUB ensures that
> all objects are first allocated from one page before moving to another.

Yes, exactly.  Maybe SLAB is different? If so, then we can handle that
in the SLAB specific bulk implementation.


> > If most of objects comes from many different slabs, bulk free API does
> > enabling/disabling interrupt very much so I guess it work worse than
> > just calling __kmem_cache_free_bulk(). Could you test this case?
> 
> In case of SLAB this would be an issue since the queueing mechanism
> destroys spatial locality. This is much less an issue for SLUB.

I think Kim is worried about the cost of the enable/disable calls, when
the slowpath gets called.  But it is not a problem because the cost of
local_irq_{disable,enable} is very low (total cost 7 cycles).

It is very important that everybody realizes that the save+restore
variant is very expensive, this is key:

CPU: i7-4790K CPU @ 4.00GHz
 * local_irq_{disable,enable}:  7 cycles(tsc) - 1.821 ns
 * local_irq_{save,restore}  : 37 cycles(tsc) - 9.443 ns

Even if EVERY object need to call slowpath/__slab_free() it will be
faster than calling the fallback.  Because I've demonstrated the call
this_cpu_cmpxchg_double() costs 9 cycles.
Christoph Lameter (Ampere) June 16, 2015, 4:04 p.m. UTC | #14
On Tue, 16 Jun 2015, Jesper Dangaard Brouer wrote:

> It is very important that everybody realizes that the save+restore
> variant is very expensive, this is key:
>
> CPU: i7-4790K CPU @ 4.00GHz
>  * local_irq_{disable,enable}:  7 cycles(tsc) - 1.821 ns
>  * local_irq_{save,restore}  : 37 cycles(tsc) - 9.443 ns
>
> Even if EVERY object need to call slowpath/__slab_free() it will be
> faster than calling the fallback.  Because I've demonstrated the call
> this_cpu_cmpxchg_double() costs 9 cycles.

But the cmpxchg also stores a value. You need to add the cost of the store
to the cycles.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/mm/slub.c b/mm/slub.c
index 98d0e6f73ec1..cc4f870677bb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2752,7 +2752,37 @@  EXPORT_SYMBOL(kmem_cache_free);
 
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 {
-	__kmem_cache_free_bulk(s, size, p);
+	struct kmem_cache_cpu *c;
+	struct page *page;
+	int i;
+
+	local_irq_disable();
+	c = this_cpu_ptr(s->cpu_slab);
+
+	for (i = 0; i < size; i++) {
+		void *object = p[i];
+
+		if (unlikely(!object))
+			continue; // HOW ABOUT BUG_ON()???
+
+		page = virt_to_head_page(object);
+		BUG_ON(s != page->slab_cache); /* Check if valid slab page */
+
+		if (c->page == page) {
+			/* Fastpath: local CPU free */
+			set_freepointer(s, object, c->freelist);
+			c->freelist = object;
+		} else {
+			c->tid = next_tid(c->tid);
+			local_irq_enable();
+			/* Slowpath: overhead locked cmpxchg_double_slab */
+			__slab_free(s, page, object, _RET_IP_);
+			local_irq_disable();
+			c = this_cpu_ptr(s->cpu_slab);
+		}
+	}
+	c->tid = next_tid(c->tid);
+	local_irq_enable();
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);