From patchwork Thu Jun 4 10:31:59 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jesper Dangaard Brouer X-Patchwork-Id: 480587 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 5255C140271 for ; Thu, 4 Jun 2015 20:32:36 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752045AbbFDKcc (ORCPT ); Thu, 4 Jun 2015 06:32:32 -0400 Received: from mx1.redhat.com ([209.132.183.28]:57521 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751827AbbFDKca (ORCPT ); Thu, 4 Jun 2015 06:32:30 -0400 Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by mx1.redhat.com (Postfix) with ESMTPS id AF2F9C0160; Thu, 4 Jun 2015 10:32:29 +0000 (UTC) Received: from [127.0.0.1] (ovpn-116-93.ams2.redhat.com [10.36.116.93]) by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t54AWOQ7010295; Thu, 4 Jun 2015 06:32:26 -0400 From: Jesper Dangaard Brouer Subject: [RFC PATCH] slub: RFC: Improving SLUB performance with 38% on NO-PREEMPT To: Christoph Lameter Cc: Jesper Dangaard Brouer , Joonsoo Kim , Alexander Duyck , linux-mm@kvack.org, netdev@vger.kernel.org Date: Thu, 04 Jun 2015 12:31:59 +0200 Message-ID: <20150604103159.4744.75870.stgit@ivy> User-Agent: StGIT/0.14.3 MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch improves performance of SLUB allocator fastpath with 38% by avoiding the call to this_cpu_cmpxchg_double() for NO-PREEMPT kernels. Reviewers please point out why this change is wrong, as such a large improvement should not be possible ;-) My primarily motivation for this patch is to understand and microbenchmark the MM-layer of the kernel, due to an increasing demand from the networking stack. This "microbenchmark" is merely to demonstrate the cost of the instruction CMPXCHG16B (without LOCK prefix). My microbench is avail on github[1] (reused "qmempool_bench"). The fastpath-reuse (alloc+free cost) (CPU E5-2695): * 47 cycles(tsc) - 18.948 ns (normal with this_cpu_cmpxchg_double) * 29 cycles(tsc) - 11.791 ns (with patch) Thus, the difference deduct the cost of CMPXCHG16B * Total saved 18 cycles - 7.157ns * for two CMPXCHG16B (alloc+free): per-inst saved 9 cycles - 3.579ns * http://instlatx64.atw.hu/ says 9 cycles cost of CMPXCHG16B This also shows that the cost of this_cpu_cmpxchg_double() in SLUB is approx 38% of fast-path cost. [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/qmempool_bench.c The cunning reviewer would also like to know the cost of disabling interrupts, on this CPU. Here it is interesting to see how the save/restore variant is significantly more expensive: Cost of local IRQ toggling (CPU E5-2695): * local_irq_{disable,enable}: 7 cycles(tsc) - 2.861 ns * local_irq_{save,restore} : 37 cycles(tsc) - 14.846 ns With the additional overhead of local_irq_{disable,enable}, there would still be a saving of 11 cycles (out of 47) 23%. --- mm/slub.c | 52 +++++++++++++++++++++++++++++++++++++++------------- 1 files changed, 39 insertions(+), 13 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/mm/slub.c b/mm/slub.c index 54c0876..b31991f 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2489,13 +2489,32 @@ redo: * against code executing on this cpu *not* from access by * other cpus. */ - if (unlikely(!this_cpu_cmpxchg_double( - s->cpu_slab->freelist, s->cpu_slab->tid, - object, tid, - next_object, next_tid(tid)))) { - - note_cmpxchg_failure("slab_alloc", s, tid); - goto redo; + if (IS_ENABLED(CONFIG_PREEMPT)) { + if (unlikely(!this_cpu_cmpxchg_double( + s->cpu_slab->freelist, s->cpu_slab->tid, + object, tid, + next_object, next_tid(tid)))) { + + note_cmpxchg_failure("slab_alloc", s, tid); + goto redo; + } + } else { + // HACK - On a NON-PREEMPT cmpxchg is not necessary(?) + __this_cpu_write(s->cpu_slab->tid, next_tid(tid)); + __this_cpu_write(s->cpu_slab->freelist, next_object); + /* + * Q: What happens in-case called from interrupt handler? + * + * If we need to disable (local) IRQs then most of the + * saving is lost. E.g. the local_irq_{save,restore} + * is too costly. + * + * Saved (alloc+free): 18 cycles - 7.157ns + * + * Cost of (CPU E5-2695): + * local_irq_{disable,enable}: 7 cycles(tsc) - 2.861 ns + * local_irq_{save,restore} : 37 cycles(tsc) - 14.846 ns + */ } prefetch_freepointer(s, next_object); stat(s, ALLOC_FASTPATH); @@ -2726,14 +2745,21 @@ redo: if (likely(page == c->page)) { set_freepointer(s, object, c->freelist); - if (unlikely(!this_cpu_cmpxchg_double( - s->cpu_slab->freelist, s->cpu_slab->tid, - c->freelist, tid, - object, next_tid(tid)))) { + if (IS_ENABLED(CONFIG_PREEMPT)) { + if (unlikely(!this_cpu_cmpxchg_double( + s->cpu_slab->freelist, s->cpu_slab->tid, + c->freelist, tid, + object, next_tid(tid)))) { - note_cmpxchg_failure("slab_free", s, tid); - goto redo; + note_cmpxchg_failure("slab_free", s, tid); + goto redo; + } + } else { + // HACK - On a NON-PREEMPT cmpxchg is not necessary(?) + __this_cpu_write(s->cpu_slab->tid, next_tid(tid)); + __this_cpu_write(s->cpu_slab->freelist, object); } + stat(s, FREE_FASTPATH); } else __slab_free(s, page, x, addr);