diff mbox

[MM,V4.1,5/6] slub: support for bulk free with SLUB freelists

Message ID 20151001151015.c59a1360c7720a257f655578@linux-foundation.org
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Andrew Morton Oct. 1, 2015, 10:10 p.m. UTC
On Wed, 30 Sep 2015 13:44:19 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> Make it possible to free a freelist with several objects by adjusting
> API of slab_free() and __slab_free() to have head, tail and an objects
> counter (cnt).
> 
> Tail being NULL indicate single object free of head object.  This
> allow compiler inline constant propagation in slab_free() and
> slab_free_freelist_hook() to avoid adding any overhead in case of
> single object free.
> 
> This allows a freelist with several objects (all within the same
> slab-page) to be free'ed using a single locked cmpxchg_double in
> __slab_free() and with an unlocked cmpxchg_double in slab_free().
> 
> Object debugging on the free path is also extended to handle these
> freelists.  When CONFIG_SLUB_DEBUG is enabled it will also detect if
> objects don't belong to the same slab-page.
> 
> These changes are needed for the next patch to bulk free the detached
> freelists it introduces and constructs.
> 
> Micro benchmarking showed no performance reduction due to this change,
> when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).
> 

checkpatch says

WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()
#205: FILE: mm/slub.c:2888:
+       BUG_ON(!size);


Linus will get mad at you if he finds out, and we wouldn't want that.

Comments

Jesper Dangaard Brouer Oct. 2, 2015, 9:41 a.m. UTC | #1
On Thu, 1 Oct 2015 15:10:15 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 30 Sep 2015 13:44:19 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > Make it possible to free a freelist with several objects by adjusting
> > API of slab_free() and __slab_free() to have head, tail and an objects
> > counter (cnt).
> > 
> > Tail being NULL indicate single object free of head object.  This
> > allow compiler inline constant propagation in slab_free() and
> > slab_free_freelist_hook() to avoid adding any overhead in case of
> > single object free.
> > 
> > This allows a freelist with several objects (all within the same
> > slab-page) to be free'ed using a single locked cmpxchg_double in
> > __slab_free() and with an unlocked cmpxchg_double in slab_free().
> > 
> > Object debugging on the free path is also extended to handle these
> > freelists.  When CONFIG_SLUB_DEBUG is enabled it will also detect if
> > objects don't belong to the same slab-page.
> > 
> > These changes are needed for the next patch to bulk free the detached
> > freelists it introduces and constructs.
> > 
> > Micro benchmarking showed no performance reduction due to this change,
> > when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).
> > 
> 
> checkpatch says
> 
> WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()
> #205: FILE: mm/slub.c:2888:
> +       BUG_ON(!size);
> 
> 
> Linus will get mad at you if he finds out, and we wouldn't want that.
> 
> --- a/mm/slub.c~slub-optimize-bulk-slowpath-free-by-detached-freelist-fix
> +++ a/mm/slub.c
> @@ -2885,7 +2885,8 @@ static int build_detached_freelist(struc
>  /* Note that interrupts must be enabled when calling this function. */
>  void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>  {
> -	BUG_ON(!size);
> +	if (WARN_ON(!size))
> +		return;
>  
>  	do {
>  		struct detached_freelist df;
> _

My problem with this change is that WARN_ON generates (slightly) larger
code size, which is critical for instruction-cache usage...

 [net-next-mm]$ ./scripts/bloat-o-meter vmlinux-with_BUG_ON vmlinux-with_WARN_ON 
 add/remove: 0/0 grow/shrink: 1/0 up/down: 17/0 (17)
 function                                     old     new   delta
 kmem_cache_free_bulk                         438     455     +17

My IP-forwarding benchmark is actually a very challenging use-case,
because the code path "size" a packet have to travel is larger than the
instruction-cache of the CPU.

Thus, I need introducing new code like this patch and at the same time
have to reduce the number of instruction-cache misses/usage.  In this
case we solve the problem by kmem_cache_free_bulk() not getting called
too often. Thus, +17 bytes will hopefully not matter too much... but on
the other hand we sort-of know that calling kmem_cache_free_bulk() will
cause icache misses.
Christoph Lameter (Ampere) Oct. 2, 2015, 10:10 a.m. UTC | #2
On Fri, 2 Oct 2015, Jesper Dangaard Brouer wrote:

> Thus, I need introducing new code like this patch and at the same time
> have to reduce the number of instruction-cache misses/usage.  In this
> case we solve the problem by kmem_cache_free_bulk() not getting called
> too often. Thus, +17 bytes will hopefully not matter too much... but on
> the other hand we sort-of know that calling kmem_cache_free_bulk() will
> cause icache misses.

Can we just drop the WARN/BUG here? Nothing untoward happens if size == 0
right?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer Oct. 2, 2015, 10:40 a.m. UTC | #3
On Fri, 2 Oct 2015 05:10:02 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Fri, 2 Oct 2015, Jesper Dangaard Brouer wrote:
> 
> > Thus, I need introducing new code like this patch and at the same time
> > have to reduce the number of instruction-cache misses/usage.  In this
> > case we solve the problem by kmem_cache_free_bulk() not getting called
> > too often. Thus, +17 bytes will hopefully not matter too much... but on
> > the other hand we sort-of know that calling kmem_cache_free_bulk() will
> > cause icache misses.
> 
> Can we just drop the WARN/BUG here? Nothing untoward happens if size == 0
> right?

I think we crash if size == 0, as we deref p[--size] in build_detached_freelist().


(aside note: The code do handles if pointers in the p array are NULL)
Jesper Dangaard Brouer Oct. 2, 2015, 1:40 p.m. UTC | #4
On Fri, 2 Oct 2015 11:41:18 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Thu, 1 Oct 2015 15:10:15 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Wed, 30 Sep 2015 13:44:19 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > 
> > > Make it possible to free a freelist with several objects by adjusting
> > > API of slab_free() and __slab_free() to have head, tail and an objects
> > > counter (cnt).
> > > 
> > > Tail being NULL indicate single object free of head object.  This
> > > allow compiler inline constant propagation in slab_free() and
> > > slab_free_freelist_hook() to avoid adding any overhead in case of
> > > single object free.
> > > 
> > > This allows a freelist with several objects (all within the same
> > > slab-page) to be free'ed using a single locked cmpxchg_double in
> > > __slab_free() and with an unlocked cmpxchg_double in slab_free().
> > > 
> > > Object debugging on the free path is also extended to handle these
> > > freelists.  When CONFIG_SLUB_DEBUG is enabled it will also detect if
> > > objects don't belong to the same slab-page.
> > > 
> > > These changes are needed for the next patch to bulk free the detached
> > > freelists it introduces and constructs.
> > > 
> > > Micro benchmarking showed no performance reduction due to this change,
> > > when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).
> > > 
> > 
> > checkpatch says
> > 
> > WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()
> > #205: FILE: mm/slub.c:2888:
> > +       BUG_ON(!size);
> > 
> > 
> > Linus will get mad at you if he finds out, and we wouldn't want that.
> > 
> > --- a/mm/slub.c~slub-optimize-bulk-slowpath-free-by-detached-freelist-fix
> > +++ a/mm/slub.c
> > @@ -2885,7 +2885,8 @@ static int build_detached_freelist(struc
> >  /* Note that interrupts must be enabled when calling this function. */
> >  void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
> >  {
> > -	BUG_ON(!size);
> > +	if (WARN_ON(!size))
> > +		return;
> >  
> >  	do {
> >  		struct detached_freelist df;
> > _
> 
> My problem with this change is that WARN_ON generates (slightly) larger
> code size, which is critical for instruction-cache usage...
> 
>  [net-next-mm]$ ./scripts/bloat-o-meter vmlinux-with_BUG_ON vmlinux-with_WARN_ON 
>  add/remove: 0/0 grow/shrink: 1/0 up/down: 17/0 (17)
>  function                                     old     new   delta
>  kmem_cache_free_bulk                         438     455     +17
> 
> My IP-forwarding benchmark is actually a very challenging use-case,
> because the code path "size" a packet have to travel is larger than the
> instruction-cache of the CPU.
> 
> Thus, I need introducing new code like this patch and at the same time
> have to reduce the number of instruction-cache misses/usage.  In this
> case we solve the problem by kmem_cache_free_bulk() not getting called
> too often. Thus, +17 bytes will hopefully not matter too much... but on
> the other hand we sort-of know that calling kmem_cache_free_bulk() will
> cause icache misses.

I just tested this change on top of my net-use-case patchset... and for
some strange reason the code with this WARN_ON is faster and have much
less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses).

Thus, I think we should keep your fix.

I cannot explain why using WARN_ON() is better and cause less icache
misses.  And I hate when I don't understand every detail.

 My theory is, after reading the assembler code, that the UD2
instruction (from BUG_ON) cause some kind of icache decoder stall
(Intel experts???).  Now that should not be a problem, as UD2 is
obviously placed as an unlikely branch and left at the end of the asm
function call.  But the call to __slab_free() is also placed at the end
of the asm function (gets inlined from slab_free() as unlikely).  And
it is actually fairly likely that bulking is calling __slab_free (slub
slowpath call).
Andrew Morton Oct. 2, 2015, 9:50 p.m. UTC | #5
On Fri, 2 Oct 2015 15:40:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> > Thus, I need introducing new code like this patch and at the same time
> > have to reduce the number of instruction-cache misses/usage.  In this
> > case we solve the problem by kmem_cache_free_bulk() not getting called
> > too often. Thus, +17 bytes will hopefully not matter too much... but on
> > the other hand we sort-of know that calling kmem_cache_free_bulk() will
> > cause icache misses.
> 
> I just tested this change on top of my net-use-case patchset... and for
> some strange reason the code with this WARN_ON is faster and have much
> less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses).
> 
> Thus, I think we should keep your fix.
> 
> I cannot explain why using WARN_ON() is better and cause less icache
> misses.  And I hate when I don't understand every detail.
> 
>  My theory is, after reading the assembler code, that the UD2
> instruction (from BUG_ON) cause some kind of icache decoder stall
> (Intel experts???).  Now that should not be a problem, as UD2 is
> obviously placed as an unlikely branch and left at the end of the asm
> function call.  But the call to __slab_free() is also placed at the end
> of the asm function (gets inlined from slab_free() as unlikely).  And
> it is actually fairly likely that bulking is calling __slab_free (slub
> slowpath call).

Yes, I was looking at the asm code and the difference is pretty small:
a not-taken ud2 versus a not-taken "call warn_slowpath_null", mainly.

But I wouldn't assume that the microbenchmarking is meaningful.  I've
seen shockingly large (and quite repeatable) microbenchmarking
differences from small changes in code which isn't even executed (and
this is one such case, actually).  You add or remove just one byte of
text and half the kernel (or half the .o file?) gets a different
alignment and this seems to change everything.

Deleting the BUG altogether sounds the best solution.  As long as the
kernel crashes in some manner, we'll be able to work out what happened.
And it's cant-happen anyway, isn't it?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer Oct. 5, 2015, 7:26 p.m. UTC | #6
On Fri, 2 Oct 2015 14:50:44 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 2 Oct 2015 15:40:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > > Thus, I need introducing new code like this patch and at the same time
> > > have to reduce the number of instruction-cache misses/usage.  In this
> > > case we solve the problem by kmem_cache_free_bulk() not getting called
> > > too often. Thus, +17 bytes will hopefully not matter too much... but on
> > > the other hand we sort-of know that calling kmem_cache_free_bulk() will
> > > cause icache misses.
> > 
> > I just tested this change on top of my net-use-case patchset... and for
> > some strange reason the code with this WARN_ON is faster and have much
> > less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses).
> > 
> > Thus, I think we should keep your fix.
> > 
> > I cannot explain why using WARN_ON() is better and cause less icache
> > misses.  And I hate when I don't understand every detail.
> > 
> >  My theory is, after reading the assembler code, that the UD2
> > instruction (from BUG_ON) cause some kind of icache decoder stall
> > (Intel experts???).  Now that should not be a problem, as UD2 is
> > obviously placed as an unlikely branch and left at the end of the asm
> > function call.  But the call to __slab_free() is also placed at the end
> > of the asm function (gets inlined from slab_free() as unlikely).  And
> > it is actually fairly likely that bulking is calling __slab_free (slub
> > slowpath call).
> 
> Yes, I was looking at the asm code and the difference is pretty small:
> a not-taken ud2 versus a not-taken "call warn_slowpath_null", mainly.

Actually managed to find documentation that the UD2 instruction do
stops the instruction prefetcher.

From "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
 Section 3.4.1.6 "Branch Type Selection":
 Cite:
  "[...] follow the indirect branch with a UD2 instruction, which will
   stop the processor from decoding down the fall-through path."

And from LWN article: https://lwn.net/Articles/255364/
 Cite: "[...] like using the ud2 instruction [...]. This instruction,
 which cannot be executed itself, is used after an indirect jump
 instruction; it is used as a signal to the instruction fetcher that
 the processor should not waste efforts decoding the following memory
 since the execution will continue at a different location."

Thus, we were hitting a very unfortunately combination of code, there
the UD2 instruction was located before the call to __slab_free() in the
ASM code, causing instruction prefetching to fail/stop.

Now I understand all the details again :-)

My only problem left, is I want a perf measurement that pinpoint these
kind of spots.  The difference in L1-icache-load-misses were significant
(1,278,276 vs 2,719,158).  I tried to somehow perf record this with
different perf events without being able to pinpoint the location (even
though I know the spot now).  Even tried Andi's ocperf.py... maybe he
will know what event I should try?


> But I wouldn't assume that the microbenchmarking is meaningful.  I've
> seen shockingly large (and quite repeatable) microbenchmarking
> differences from small changes in code which isn't even executed (and
> this is one such case, actually).  You add or remove just one byte of
> text and half the kernel (or half the .o file?) gets a different
> alignment and this seems to change everything.

I do know alignment of code is important, but this performance impact
was just too large.  And I think I found documentation to back my theory.

 
> Deleting the BUG altogether sounds the best solution.  As long as the
> kernel crashes in some manner, we'll be able to work out what happened.
> And it's cant-happen anyway, isn't it?

To me WARN_ON() seems like a good "documentation" if it does not hurt
performance.  I don't think removing the WARN_ON() will improve
performance, but I'm willing to actually test if it matters.

Below is how it will crash, it is fairly obvious (if you know x86_64
call convention). 

API is: void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)

Below it clearly shows second arg "size" in %RSI == 0.

[ 2986.977708] BUG: unable to handle kernel paging request at ffffffffa03c4d58
[ 2986.977760] IP: [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0
[ 2986.977805] PGD 19a4067 PUD 19a5063 PMD 408a66067 PTE 3f6407161
[ 2986.977879] Oops: 0003 [#1] SMP 
[ 2986.977936] Modules linked in: slab_bulk_test03(O+) time_bench(O) netconsole tun coretemp kvm_intel kvm mxm_wmi microcode sg i2c_i801 i2c_core pcspkr wmi video shpchp acpi_pad nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc serio_raw sd_mod hid_generic ixgbe mdio vxlan ip6_udp_tunnel e1000e udp_tunnel ptp pps_core mlx5_core [last unloaded: slab_bulk_test03]
[ 2986.978190] CPU: 1 PID: 12495 Comm: modprobe Tainted: G           O    4.3.0-rc2-net-next-mm+ #442
[ 2986.978265] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015
[ 2986.978341] task: ffff88040a89d200 ti: ffff8803e3c50000 task.ti: ffff8803e3c50000
[ 2986.978414] RIP: 0010:[<ffffffff8116d044>]  [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0
[ 2986.978490] RSP: 0018:ffff8803e3c53aa8  EFLAGS: 00010282
[ 2986.978528] RAX: 0000000000000000 RBX: ffff8803e3c53c30 RCX: 0000000100200016
[ 2986.978571] RDX: ffff8803e3c53ad8 RSI: 0000000000000000 RDI: 000077ff80000000
[ 2986.978614] RBP: ffff8803e3c53ad0 R08: ffff8803e3db3600 R09: 00000000e3db3b01
[ 2986.978655] R10: ffffffffa03c4d58 R11: 0000000000000001 R12: ffffffffffffffff
[ 2986.978698] R13: ffff8803e3c53ae0 R14: 000077ff80000000 R15: ffff88040d774c00
[ 2986.978740] FS:  00007fbf31190740(0000) GS:ffff88041fa40000(0000) knlGS:0000000000000000
[ 2986.978814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2986.978853] CR2: ffffffffa03c4d58 CR3: 00000003f6406000 CR4: 00000000001406e0
[ 2986.978895] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2986.978938] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2986.978979] Stack:
[ 2986.979014]  ffff8803e3c53c30 0000000000000009 0000000000000000 ffffffffa002e000
[ 2986.979099]  ffff8803fc40db40 ffff8803e3c53cf0 ffffffffa03c4d58 0000000000000000
[ 2986.979184]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2986.979259] Call Trace:
[ 2986.979296]  [<ffffffffa002e000>] ? 0xffffffffa002e000
[ 2986.979335]  [<ffffffffa03c4d58>] run_try_crash_tests+0x378/0x3a6 [slab_bulk_test03]
[ 2986.979409]  [<ffffffffa002e12d>] slab_bulk_test03_module_init+0x12d/0x1000 [slab_bulk_test03]
[ 2986.979485]  [<ffffffff810002ed>] do_one_initcall+0xad/0x1d0
[ 2986.979526]  [<ffffffff8111c117>] do_init_module+0x60/0x1e8
[ 2986.979567]  [<ffffffff810c2778>] load_module+0x1bb8/0x2420
[ 2986.979608]  [<ffffffff810bf1b0>] ? __symbol_put+0x40/0x40
[ 2986.979647]  [<ffffffff810c31b0>] SyS_finit_module+0x80/0xb0
[ 2986.979690]  [<ffffffff8166b697>] entry_SYSCALL_64_fastpath+0x12/0x6a
[ 2986.979732] Code: f8 eb 05 4d 85 e4 74 13 4c 8b 10 49 83 ec 01 48 89 c2 48 83 e8 08 4d 85 d2 74 e8 4d 85 d2 0f 84 2e 01 00 00 49 63 47 20 4c 89 f7 <49> c7 04 02 00 00 00 00 b8 00 00 00 80 4c 01 d0 48 0f 42 3d b4 
[ 2986.979904] RIP  [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0
[ 2986.979945]  RSP <ffff8803e3c53aa8>
[ 2986.979982] CR2: ffffffffa03c4d58
[ 2986.980130] ---[ end trace 64f42b02f4347220 ]---


(gdb) list *(kmem_cache_free_bulk)+0x54
0xffffffff8116d044 is in kmem_cache_free_bulk (mm/slub.c:269).
264		return p;
265	}
266	
267	static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
268	{
269		*(void **)(object + s->offset) = fp;
270	}
271	
272	/* Loop over all objects in a slab */
273	#define for_each_object(__p, __s, __addr, __objects) \
Andi Kleen Oct. 5, 2015, 9:20 p.m. UTC | #7
> My only problem left, is I want a perf measurement that pinpoint these
> kind of spots.  The difference in L1-icache-load-misses were significant
> (1,278,276 vs 2,719,158).  I tried to somehow perf record this with
> different perf events without being able to pinpoint the location (even
> though I know the spot now).  Even tried Andi's ocperf.py... maybe he
> will know what event I should try?

Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the
bottle neck is and what to sample for if there is a suitable event and
even prints the command line.

https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev

However frontend issues are difficult to sample, as they happen very far
away from instruction retirement where the sampling happens. So you may
have large skid and the sampling points may be far away. Skylake has new
special FRONTEND_* PEBS events for this, but before it was often difficult. 

BTW if your main goal is icache; I wrote a gcc patch to help the kernel
by enabling function splitting: Apply the patch in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66890 to gcc 5,
make sure 9bebe9e5b0f (now in mainline) is applied and build with
-freorder-blocks-and-partition. That will split all functions into
statically predicted hot and cold parts and generally relieves
icache pressure. Any testing of this on your workload welcome.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer Oct. 5, 2015, 11:07 p.m. UTC | #8
(trimmed Cc list a little)

On Mon, 5 Oct 2015 14:20:45 -0700 Andi Kleen <ak@linux.intel.com> wrote:

> > My only problem left, is I want a perf measurement that pinpoint these
> > kind of spots.  The difference in L1-icache-load-misses were significant
> > (1,278,276 vs 2,719,158).  I tried to somehow perf record this with
> > different perf events without being able to pinpoint the location (even
> > though I know the spot now).  Even tried Andi's ocperf.py... maybe he
> > will know what event I should try?
> 
> Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the
> bottle neck is and what to sample for if there is a suitable event and
> even prints the command line.
> 
> https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev
> 

My result from (IP-forward flow hitting CPU 0):
 $ sudo ./toplev.py -I 1000 -l3 -a --show-sample --core C0

So, what does this tell me?:

 C0    BAD     Bad_Speculation:                                 0.00 % [  5.50%]
 C0    BE      Backend_Bound:                                 100.00 % [  5.50%]
 C0    BE/Mem  Backend_Bound.Memory_Bound:                     53.06 % [  5.50%]
 C0    BE/Core Backend_Bound.Core_Bound:                       46.94 % [  5.50%]
 C0-T0 FE      Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.42 % [  5.50%]
 C0-T0 BE/Mem  Backend_Bound.Memory_Bound.L1_Bound:            54.51 % [  5.50%]
 C0-T0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization:     20.99 % [  5.60%]
 C0-T0         CPU utilization: 1.00 CPUs   	[100.00%]
 C0-T1 FE      Frontend_Bound.Frontend_Latency.Branch_Resteers: 6.04 % [  5.50%]
 C0-T1         CPU utilization: 1.00 CPUs   	[100.00%]

Unfortunately the perf command it gives me fails with:
 "invalid or unsupported event".

Perf command:

 perf record -g -e cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES:pp,period=400009/pp,cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT:pp,period=2000003/pp,cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB:pp,period=100003/pp -C 0,4 -a


> However frontend issues are difficult to sample, as they happen very far
> away from instruction retirement where the sampling happens. So you may
> have large skid and the sampling points may be far away. Skylake has new
> special FRONTEND_* PEBS events for this, but before it was often difficult. 

This testlab CPU is i7-4790K @ 4.00GHz.  Maybe I should get a Skylake...


p.s. thanks for your pmu-tools[1], even-though I don't know how to use
most of them ;-)
Jesper Dangaard Brouer Oct. 5, 2015, 11:53 p.m. UTC | #9
On Mon, 5 Oct 2015 21:26:39 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> My only problem left, is I want a perf measurement that pinpoint these
> kind of spots.  The difference in L1-icache-load-misses were significant
> (1,278,276 vs 2,719,158).  I tried to somehow perf record this with
> different perf events without being able to pinpoint the location (even
> though I know the spot now).  Even tried Andi's ocperf.py... maybe he
> will know what event I should try?

Using: 'ocperf.py -e icache_misses' and looking closer at the perf
annotate and considering "skid" I think I can see the icache misses
happening in the end of the function, due to the UD2 inst.

Annotation of kmem_cache_free_bulk (last/end of func):

       │17b:   test   %r12,%r12
       │     ↑ jne    2e
       │184:   pop    %rbx
       │       pop    %r12
       │       pop    %r13
       │       pop    %r14
       │       pop    %r15
       │       pop    %rbp
       │     ← retq
  8.57 │18f:   mov    0x30(%rdx),%rdx
  5.71 │     ↑ jmp    116
       │195:   ud2
  2.86 │197:   mov    %rdi,%rsi
       │       mov    %r11d,%r8d
       │       mov    %r10,%rcx
       │       mov    %rbx,%rdx
       │       mov    %r15,%rdi
       │     → callq  __slab_free
       │     ↑ jmp    17b
  2.86 │1ad:   mov    0x30(%rdi),%rdi
       │     ↑ jmpq   99
Jesper Dangaard Brouer Oct. 7, 2015, 10:39 a.m. UTC | #10
On Mon, 5 Oct 2015 21:26:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> On Fri, 2 Oct 2015 14:50:44 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> 
[...] 
>  
> > Deleting the BUG altogether sounds the best solution.  As long as the
> > kernel crashes in some manner, we'll be able to work out what happened.
> > And it's cant-happen anyway, isn't it?
> 
> To me WARN_ON() seems like a good "documentation" if it does not hurt
> performance.  I don't think removing the WARN_ON() will improve
> performance, but I'm willing to actually test if it matters.

I tested removing BUG/WARN_ON altogether, and it gives slightly worse
performance. The icache-misses only increase approx 14% (not 112% as
before).  This, I'm willing to attribute to some code alignment issue.

Thus, let us just keep the WARN_ON() and move along.
Jesper Dangaard Brouer Oct. 7, 2015, 12:31 p.m. UTC | #11
On Tue, 6 Oct 2015 01:07:03 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> (trimmed Cc list a little)
> 
> On Mon, 5 Oct 2015 14:20:45 -0700 Andi Kleen <ak@linux.intel.com> wrote:
> 
> > > My only problem left, is I want a perf measurement that pinpoint these
> > > kind of spots.  The difference in L1-icache-load-misses were significant
> > > (1,278,276 vs 2,719,158).  I tried to somehow perf record this with
> > > different perf events without being able to pinpoint the location (even
> > > though I know the spot now).  Even tried Andi's ocperf.py... maybe he
> > > will know what event I should try?
> > 
> > Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the
> > bottle neck is and what to sample for if there is a suitable event and
> > even prints the command line.
> > 
> > https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev
> > 
> 
> My result from (IP-forward flow hitting CPU 0):
>  $ sudo ./toplev.py -I 1000 -l3 -a --show-sample --core C0
> 
> So, what does this tell me?:
>
>  C0    BAD     Bad_Speculation:                                 0.00 % [  5.50%]
>  C0    BE      Backend_Bound:                                 100.00 % [  5.50%]
>  C0    BE/Mem  Backend_Bound.Memory_Bound:                     53.06 % [  5.50%]
>  C0    BE/Core Backend_Bound.Core_Bound:                       46.94 % [  5.50%]
>  C0-T0 FE      Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.42 % [  5.50%]
>  C0-T0 BE/Mem  Backend_Bound.Memory_Bound.L1_Bound:            54.51 % [  5.50%]
>  C0-T0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization:     20.99 % [  5.60%]
>  C0-T0         CPU utilization: 1.00 CPUs   	[100.00%]
>  C0-T1 FE      Frontend_Bound.Frontend_Latency.Branch_Resteers: 6.04 % [  5.50%]
>  C0-T1         CPU utilization: 1.00 CPUs   	[100.00%]

Reading: https://github.com/andikleen/pmu-tools/wiki/toplev-manual
Helped me understand most of above.

My specific CPU (i7-4790K @ 4.00GHz) unfortunately seems to have
limited "Frontend" support. E.g. 

 # perf record -g -a -e stalled-cycles-frontend
 Error:
 The stalled-cycles-frontend event is not supported.

And AFAIK icache misses are part of "frontend".


> Unfortunately the perf command it gives me fails with:
>  "invalid or unsupported event".
> 
> Perf command:
> 
>  sudo ./ocperf.py record -g -e \
  cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES:pp,period=400009/pp,\
  cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\
  cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT:pp,period=2000003/pp,\
  cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB:pp,period=100003/pp \
  -C 0,4 -a

I fixed the problem with this perf command by removing the ":pp" part.
Perhaps your tool need to fix that?

A working command line looks like this:

 sudo ./ocperf.py record -g -e \
cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/pp,\
cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\
cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT,period=2000003/pp,\
cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB,period=100003/pp \
  -C 0,4 -a
Arnaldo Carvalho de Melo Oct. 7, 2015, 1:36 p.m. UTC | #12
Em Wed, Oct 07, 2015 at 02:31:20PM +0200, Jesper Dangaard Brouer escreveu:
> On Tue, 6 Oct 2015 01:07:03 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > (trimmed Cc list a little)
> > 
> > On Mon, 5 Oct 2015 14:20:45 -0700 Andi Kleen <ak@linux.intel.com> wrote:
> > 
> > > > My only problem left, is I want a perf measurement that pinpoint these
> > > > kind of spots.  The difference in L1-icache-load-misses were significant
> > > > (1,278,276 vs 2,719,158).  I tried to somehow perf record this with
> > > > different perf events without being able to pinpoint the location (even
> > > > though I know the spot now).  Even tried Andi's ocperf.py... maybe he
> > > > will know what event I should try?
> > > 
> > > Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the
> > > bottle neck is and what to sample for if there is a suitable event and
> > > even prints the command line.
> > > 
> > > https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev
> > > 
> > 
> > My result from (IP-forward flow hitting CPU 0):
> >  $ sudo ./toplev.py -I 1000 -l3 -a --show-sample --core C0
> > 
> > So, what does this tell me?:
> >
> >  C0    BAD     Bad_Speculation:                                 0.00 % [  5.50%]
> >  C0    BE      Backend_Bound:                                 100.00 % [  5.50%]
> >  C0    BE/Mem  Backend_Bound.Memory_Bound:                     53.06 % [  5.50%]
> >  C0    BE/Core Backend_Bound.Core_Bound:                       46.94 % [  5.50%]
> >  C0-T0 FE      Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.42 % [  5.50%]
> >  C0-T0 BE/Mem  Backend_Bound.Memory_Bound.L1_Bound:            54.51 % [  5.50%]
> >  C0-T0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization:     20.99 % [  5.60%]
> >  C0-T0         CPU utilization: 1.00 CPUs   	[100.00%]
> >  C0-T1 FE      Frontend_Bound.Frontend_Latency.Branch_Resteers: 6.04 % [  5.50%]
> >  C0-T1         CPU utilization: 1.00 CPUs   	[100.00%]
> 
> Reading: https://github.com/andikleen/pmu-tools/wiki/toplev-manual
> Helped me understand most of above.
> 
> My specific CPU (i7-4790K @ 4.00GHz) unfortunately seems to have
> limited "Frontend" support. E.g. 
> 
>  # perf record -g -a -e stalled-cycles-frontend
>  Error:
>  The stalled-cycles-frontend event is not supported.
> 
> And AFAIK icache misses are part of "frontend".
> 
> 
> > Unfortunately the perf command it gives me fails with:
> >  "invalid or unsupported event".
> > 
> > Perf command:
> > 
> >  sudo ./ocperf.py record -g -e \
>   cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES:pp,period=400009/pp,\
>   cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\
>   cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT:pp,period=2000003/pp,\
>   cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB:pp,period=100003/pp \
>   -C 0,4 -a
> 
> I fixed the problem with this perf command by removing the ":pp" part.
> Perhaps your tool need to fix that?
> 
> A working command line looks like this:
> 
>  sudo ./ocperf.py record -g -e \
> cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/pp,\
> cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\
> cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT,period=2000003/pp,\
> cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB,period=100003/pp \
>   -C 0,4 -a

There is a recent patch that may help here, see below, but maybe its
just a matter of removing that :pp, as it ends with a /pp anyway, no
need to state that twice :)

With the patch below all those /pp would be replaced with /P.

- Arnaldo


https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/tools/perf?id=7f94af7a489fada17d28cc60e8f4409ce216bd6d

----------------------------------------------------------------------
perf tools: Introduce 'P' modifier to request max precision
The 'P' will cause the event to get maximum possible detected precise
level.

Following record:
  $ perf record -e cycles:P ...

will detect maximum precise level for 'cycles' event and use it.
----------------------------------------------------------------------

- Arnaldo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen Oct. 7, 2015, 3:44 p.m. UTC | #13
> There is a recent patch that may help here, see below, but maybe its
> just a matter of removing that :pp, as it ends with a /pp anyway, no
> need to state that twice :)

Yes the extra :pp was a regression in toplev. I fixed it now. Thanks.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen Oct. 7, 2015, 4:06 p.m. UTC | #14
> My specific CPU (i7-4790K @ 4.00GHz) unfortunately seems to have
> limited "Frontend" support. E.g. 
> 
>  # perf record -g -a -e stalled-cycles-frontend
>  Error:
>  The stalled-cycles-frontend event is not supported.
> 
> And AFAIK icache misses are part of "frontend".

Ignore stalled-cycles-frontend. It is very unreliable and has never worked right.
toplev gives you much more reliable output.


-Andi
diff mbox

Patch

--- a/mm/slub.c~slub-optimize-bulk-slowpath-free-by-detached-freelist-fix
+++ a/mm/slub.c
@@ -2885,7 +2885,8 @@  static int build_detached_freelist(struc
 /* Note that interrupts must be enabled when calling this function. */
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 {
-	BUG_ON(!size);
+	if (WARN_ON(!size))
+		return;
 
 	do {
 		struct detached_freelist df;