Message ID | 20151001151015.c59a1360c7720a257f655578@linux-foundation.org |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, 1 Oct 2015 15:10:15 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 30 Sep 2015 13:44:19 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > > Make it possible to free a freelist with several objects by adjusting > > API of slab_free() and __slab_free() to have head, tail and an objects > > counter (cnt). > > > > Tail being NULL indicate single object free of head object. This > > allow compiler inline constant propagation in slab_free() and > > slab_free_freelist_hook() to avoid adding any overhead in case of > > single object free. > > > > This allows a freelist with several objects (all within the same > > slab-page) to be free'ed using a single locked cmpxchg_double in > > __slab_free() and with an unlocked cmpxchg_double in slab_free(). > > > > Object debugging on the free path is also extended to handle these > > freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if > > objects don't belong to the same slab-page. > > > > These changes are needed for the next patch to bulk free the detached > > freelists it introduces and constructs. > > > > Micro benchmarking showed no performance reduction due to this change, > > when debugging is turned off (compiled with CONFIG_SLUB_DEBUG). > > > > checkpatch says > > WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON() > #205: FILE: mm/slub.c:2888: > + BUG_ON(!size); > > > Linus will get mad at you if he finds out, and we wouldn't want that. > > --- a/mm/slub.c~slub-optimize-bulk-slowpath-free-by-detached-freelist-fix > +++ a/mm/slub.c > @@ -2885,7 +2885,8 @@ static int build_detached_freelist(struc > /* Note that interrupts must be enabled when calling this function. */ > void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) > { > - BUG_ON(!size); > + if (WARN_ON(!size)) > + return; > > do { > struct detached_freelist df; > _ My problem with this change is that WARN_ON generates (slightly) larger code size, which is critical for instruction-cache usage... [net-next-mm]$ ./scripts/bloat-o-meter vmlinux-with_BUG_ON vmlinux-with_WARN_ON add/remove: 0/0 grow/shrink: 1/0 up/down: 17/0 (17) function old new delta kmem_cache_free_bulk 438 455 +17 My IP-forwarding benchmark is actually a very challenging use-case, because the code path "size" a packet have to travel is larger than the instruction-cache of the CPU. Thus, I need introducing new code like this patch and at the same time have to reduce the number of instruction-cache misses/usage. In this case we solve the problem by kmem_cache_free_bulk() not getting called too often. Thus, +17 bytes will hopefully not matter too much... but on the other hand we sort-of know that calling kmem_cache_free_bulk() will cause icache misses.
On Fri, 2 Oct 2015, Jesper Dangaard Brouer wrote: > Thus, I need introducing new code like this patch and at the same time > have to reduce the number of instruction-cache misses/usage. In this > case we solve the problem by kmem_cache_free_bulk() not getting called > too often. Thus, +17 bytes will hopefully not matter too much... but on > the other hand we sort-of know that calling kmem_cache_free_bulk() will > cause icache misses. Can we just drop the WARN/BUG here? Nothing untoward happens if size == 0 right? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2 Oct 2015 05:10:02 -0500 (CDT) Christoph Lameter <cl@linux.com> wrote: > On Fri, 2 Oct 2015, Jesper Dangaard Brouer wrote: > > > Thus, I need introducing new code like this patch and at the same time > > have to reduce the number of instruction-cache misses/usage. In this > > case we solve the problem by kmem_cache_free_bulk() not getting called > > too often. Thus, +17 bytes will hopefully not matter too much... but on > > the other hand we sort-of know that calling kmem_cache_free_bulk() will > > cause icache misses. > > Can we just drop the WARN/BUG here? Nothing untoward happens if size == 0 > right? I think we crash if size == 0, as we deref p[--size] in build_detached_freelist(). (aside note: The code do handles if pointers in the p array are NULL)
On Fri, 2 Oct 2015 11:41:18 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Thu, 1 Oct 2015 15:10:15 -0700 > Andrew Morton <akpm@linux-foundation.org> wrote: > > > On Wed, 30 Sep 2015 13:44:19 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > > > > Make it possible to free a freelist with several objects by adjusting > > > API of slab_free() and __slab_free() to have head, tail and an objects > > > counter (cnt). > > > > > > Tail being NULL indicate single object free of head object. This > > > allow compiler inline constant propagation in slab_free() and > > > slab_free_freelist_hook() to avoid adding any overhead in case of > > > single object free. > > > > > > This allows a freelist with several objects (all within the same > > > slab-page) to be free'ed using a single locked cmpxchg_double in > > > __slab_free() and with an unlocked cmpxchg_double in slab_free(). > > > > > > Object debugging on the free path is also extended to handle these > > > freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if > > > objects don't belong to the same slab-page. > > > > > > These changes are needed for the next patch to bulk free the detached > > > freelists it introduces and constructs. > > > > > > Micro benchmarking showed no performance reduction due to this change, > > > when debugging is turned off (compiled with CONFIG_SLUB_DEBUG). > > > > > > > checkpatch says > > > > WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON() > > #205: FILE: mm/slub.c:2888: > > + BUG_ON(!size); > > > > > > Linus will get mad at you if he finds out, and we wouldn't want that. > > > > --- a/mm/slub.c~slub-optimize-bulk-slowpath-free-by-detached-freelist-fix > > +++ a/mm/slub.c > > @@ -2885,7 +2885,8 @@ static int build_detached_freelist(struc > > /* Note that interrupts must be enabled when calling this function. */ > > void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) > > { > > - BUG_ON(!size); > > + if (WARN_ON(!size)) > > + return; > > > > do { > > struct detached_freelist df; > > _ > > My problem with this change is that WARN_ON generates (slightly) larger > code size, which is critical for instruction-cache usage... > > [net-next-mm]$ ./scripts/bloat-o-meter vmlinux-with_BUG_ON vmlinux-with_WARN_ON > add/remove: 0/0 grow/shrink: 1/0 up/down: 17/0 (17) > function old new delta > kmem_cache_free_bulk 438 455 +17 > > My IP-forwarding benchmark is actually a very challenging use-case, > because the code path "size" a packet have to travel is larger than the > instruction-cache of the CPU. > > Thus, I need introducing new code like this patch and at the same time > have to reduce the number of instruction-cache misses/usage. In this > case we solve the problem by kmem_cache_free_bulk() not getting called > too often. Thus, +17 bytes will hopefully not matter too much... but on > the other hand we sort-of know that calling kmem_cache_free_bulk() will > cause icache misses. I just tested this change on top of my net-use-case patchset... and for some strange reason the code with this WARN_ON is faster and have much less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses). Thus, I think we should keep your fix. I cannot explain why using WARN_ON() is better and cause less icache misses. And I hate when I don't understand every detail. My theory is, after reading the assembler code, that the UD2 instruction (from BUG_ON) cause some kind of icache decoder stall (Intel experts???). Now that should not be a problem, as UD2 is obviously placed as an unlikely branch and left at the end of the asm function call. But the call to __slab_free() is also placed at the end of the asm function (gets inlined from slab_free() as unlikely). And it is actually fairly likely that bulking is calling __slab_free (slub slowpath call).
On Fri, 2 Oct 2015 15:40:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > Thus, I need introducing new code like this patch and at the same time > > have to reduce the number of instruction-cache misses/usage. In this > > case we solve the problem by kmem_cache_free_bulk() not getting called > > too often. Thus, +17 bytes will hopefully not matter too much... but on > > the other hand we sort-of know that calling kmem_cache_free_bulk() will > > cause icache misses. > > I just tested this change on top of my net-use-case patchset... and for > some strange reason the code with this WARN_ON is faster and have much > less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses). > > Thus, I think we should keep your fix. > > I cannot explain why using WARN_ON() is better and cause less icache > misses. And I hate when I don't understand every detail. > > My theory is, after reading the assembler code, that the UD2 > instruction (from BUG_ON) cause some kind of icache decoder stall > (Intel experts???). Now that should not be a problem, as UD2 is > obviously placed as an unlikely branch and left at the end of the asm > function call. But the call to __slab_free() is also placed at the end > of the asm function (gets inlined from slab_free() as unlikely). And > it is actually fairly likely that bulking is calling __slab_free (slub > slowpath call). Yes, I was looking at the asm code and the difference is pretty small: a not-taken ud2 versus a not-taken "call warn_slowpath_null", mainly. But I wouldn't assume that the microbenchmarking is meaningful. I've seen shockingly large (and quite repeatable) microbenchmarking differences from small changes in code which isn't even executed (and this is one such case, actually). You add or remove just one byte of text and half the kernel (or half the .o file?) gets a different alignment and this seems to change everything. Deleting the BUG altogether sounds the best solution. As long as the kernel crashes in some manner, we'll be able to work out what happened. And it's cant-happen anyway, isn't it? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2 Oct 2015 14:50:44 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Fri, 2 Oct 2015 15:40:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > > > Thus, I need introducing new code like this patch and at the same time > > > have to reduce the number of instruction-cache misses/usage. In this > > > case we solve the problem by kmem_cache_free_bulk() not getting called > > > too often. Thus, +17 bytes will hopefully not matter too much... but on > > > the other hand we sort-of know that calling kmem_cache_free_bulk() will > > > cause icache misses. > > > > I just tested this change on top of my net-use-case patchset... and for > > some strange reason the code with this WARN_ON is faster and have much > > less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses). > > > > Thus, I think we should keep your fix. > > > > I cannot explain why using WARN_ON() is better and cause less icache > > misses. And I hate when I don't understand every detail. > > > > My theory is, after reading the assembler code, that the UD2 > > instruction (from BUG_ON) cause some kind of icache decoder stall > > (Intel experts???). Now that should not be a problem, as UD2 is > > obviously placed as an unlikely branch and left at the end of the asm > > function call. But the call to __slab_free() is also placed at the end > > of the asm function (gets inlined from slab_free() as unlikely). And > > it is actually fairly likely that bulking is calling __slab_free (slub > > slowpath call). > > Yes, I was looking at the asm code and the difference is pretty small: > a not-taken ud2 versus a not-taken "call warn_slowpath_null", mainly. Actually managed to find documentation that the UD2 instruction do stops the instruction prefetcher. From "Intel® 64 and IA-32 Architectures Optimization Reference Manual" Section 3.4.1.6 "Branch Type Selection": Cite: "[...] follow the indirect branch with a UD2 instruction, which will stop the processor from decoding down the fall-through path." And from LWN article: https://lwn.net/Articles/255364/ Cite: "[...] like using the ud2 instruction [...]. This instruction, which cannot be executed itself, is used after an indirect jump instruction; it is used as a signal to the instruction fetcher that the processor should not waste efforts decoding the following memory since the execution will continue at a different location." Thus, we were hitting a very unfortunately combination of code, there the UD2 instruction was located before the call to __slab_free() in the ASM code, causing instruction prefetching to fail/stop. Now I understand all the details again :-) My only problem left, is I want a perf measurement that pinpoint these kind of spots. The difference in L1-icache-load-misses were significant (1,278,276 vs 2,719,158). I tried to somehow perf record this with different perf events without being able to pinpoint the location (even though I know the spot now). Even tried Andi's ocperf.py... maybe he will know what event I should try? > But I wouldn't assume that the microbenchmarking is meaningful. I've > seen shockingly large (and quite repeatable) microbenchmarking > differences from small changes in code which isn't even executed (and > this is one such case, actually). You add or remove just one byte of > text and half the kernel (or half the .o file?) gets a different > alignment and this seems to change everything. I do know alignment of code is important, but this performance impact was just too large. And I think I found documentation to back my theory. > Deleting the BUG altogether sounds the best solution. As long as the > kernel crashes in some manner, we'll be able to work out what happened. > And it's cant-happen anyway, isn't it? To me WARN_ON() seems like a good "documentation" if it does not hurt performance. I don't think removing the WARN_ON() will improve performance, but I'm willing to actually test if it matters. Below is how it will crash, it is fairly obvious (if you know x86_64 call convention). API is: void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) Below it clearly shows second arg "size" in %RSI == 0. [ 2986.977708] BUG: unable to handle kernel paging request at ffffffffa03c4d58 [ 2986.977760] IP: [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0 [ 2986.977805] PGD 19a4067 PUD 19a5063 PMD 408a66067 PTE 3f6407161 [ 2986.977879] Oops: 0003 [#1] SMP [ 2986.977936] Modules linked in: slab_bulk_test03(O+) time_bench(O) netconsole tun coretemp kvm_intel kvm mxm_wmi microcode sg i2c_i801 i2c_core pcspkr wmi video shpchp acpi_pad nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc serio_raw sd_mod hid_generic ixgbe mdio vxlan ip6_udp_tunnel e1000e udp_tunnel ptp pps_core mlx5_core [last unloaded: slab_bulk_test03] [ 2986.978190] CPU: 1 PID: 12495 Comm: modprobe Tainted: G O 4.3.0-rc2-net-next-mm+ #442 [ 2986.978265] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015 [ 2986.978341] task: ffff88040a89d200 ti: ffff8803e3c50000 task.ti: ffff8803e3c50000 [ 2986.978414] RIP: 0010:[<ffffffff8116d044>] [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0 [ 2986.978490] RSP: 0018:ffff8803e3c53aa8 EFLAGS: 00010282 [ 2986.978528] RAX: 0000000000000000 RBX: ffff8803e3c53c30 RCX: 0000000100200016 [ 2986.978571] RDX: ffff8803e3c53ad8 RSI: 0000000000000000 RDI: 000077ff80000000 [ 2986.978614] RBP: ffff8803e3c53ad0 R08: ffff8803e3db3600 R09: 00000000e3db3b01 [ 2986.978655] R10: ffffffffa03c4d58 R11: 0000000000000001 R12: ffffffffffffffff [ 2986.978698] R13: ffff8803e3c53ae0 R14: 000077ff80000000 R15: ffff88040d774c00 [ 2986.978740] FS: 00007fbf31190740(0000) GS:ffff88041fa40000(0000) knlGS:0000000000000000 [ 2986.978814] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2986.978853] CR2: ffffffffa03c4d58 CR3: 00000003f6406000 CR4: 00000000001406e0 [ 2986.978895] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2986.978938] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2986.978979] Stack: [ 2986.979014] ffff8803e3c53c30 0000000000000009 0000000000000000 ffffffffa002e000 [ 2986.979099] ffff8803fc40db40 ffff8803e3c53cf0 ffffffffa03c4d58 0000000000000000 [ 2986.979184] 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 2986.979259] Call Trace: [ 2986.979296] [<ffffffffa002e000>] ? 0xffffffffa002e000 [ 2986.979335] [<ffffffffa03c4d58>] run_try_crash_tests+0x378/0x3a6 [slab_bulk_test03] [ 2986.979409] [<ffffffffa002e12d>] slab_bulk_test03_module_init+0x12d/0x1000 [slab_bulk_test03] [ 2986.979485] [<ffffffff810002ed>] do_one_initcall+0xad/0x1d0 [ 2986.979526] [<ffffffff8111c117>] do_init_module+0x60/0x1e8 [ 2986.979567] [<ffffffff810c2778>] load_module+0x1bb8/0x2420 [ 2986.979608] [<ffffffff810bf1b0>] ? __symbol_put+0x40/0x40 [ 2986.979647] [<ffffffff810c31b0>] SyS_finit_module+0x80/0xb0 [ 2986.979690] [<ffffffff8166b697>] entry_SYSCALL_64_fastpath+0x12/0x6a [ 2986.979732] Code: f8 eb 05 4d 85 e4 74 13 4c 8b 10 49 83 ec 01 48 89 c2 48 83 e8 08 4d 85 d2 74 e8 4d 85 d2 0f 84 2e 01 00 00 49 63 47 20 4c 89 f7 <49> c7 04 02 00 00 00 00 b8 00 00 00 80 4c 01 d0 48 0f 42 3d b4 [ 2986.979904] RIP [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0 [ 2986.979945] RSP <ffff8803e3c53aa8> [ 2986.979982] CR2: ffffffffa03c4d58 [ 2986.980130] ---[ end trace 64f42b02f4347220 ]--- (gdb) list *(kmem_cache_free_bulk)+0x54 0xffffffff8116d044 is in kmem_cache_free_bulk (mm/slub.c:269). 264 return p; 265 } 266 267 static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp) 268 { 269 *(void **)(object + s->offset) = fp; 270 } 271 272 /* Loop over all objects in a slab */ 273 #define for_each_object(__p, __s, __addr, __objects) \
> My only problem left, is I want a perf measurement that pinpoint these > kind of spots. The difference in L1-icache-load-misses were significant > (1,278,276 vs 2,719,158). I tried to somehow perf record this with > different perf events without being able to pinpoint the location (even > though I know the spot now). Even tried Andi's ocperf.py... maybe he > will know what event I should try? Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the bottle neck is and what to sample for if there is a suitable event and even prints the command line. https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev However frontend issues are difficult to sample, as they happen very far away from instruction retirement where the sampling happens. So you may have large skid and the sampling points may be far away. Skylake has new special FRONTEND_* PEBS events for this, but before it was often difficult. BTW if your main goal is icache; I wrote a gcc patch to help the kernel by enabling function splitting: Apply the patch in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66890 to gcc 5, make sure 9bebe9e5b0f (now in mainline) is applied and build with -freorder-blocks-and-partition. That will split all functions into statically predicted hot and cold parts and generally relieves icache pressure. Any testing of this on your workload welcome. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
(trimmed Cc list a little) On Mon, 5 Oct 2015 14:20:45 -0700 Andi Kleen <ak@linux.intel.com> wrote: > > My only problem left, is I want a perf measurement that pinpoint these > > kind of spots. The difference in L1-icache-load-misses were significant > > (1,278,276 vs 2,719,158). I tried to somehow perf record this with > > different perf events without being able to pinpoint the location (even > > though I know the spot now). Even tried Andi's ocperf.py... maybe he > > will know what event I should try? > > Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the > bottle neck is and what to sample for if there is a suitable event and > even prints the command line. > > https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev > My result from (IP-forward flow hitting CPU 0): $ sudo ./toplev.py -I 1000 -l3 -a --show-sample --core C0 So, what does this tell me?: C0 BAD Bad_Speculation: 0.00 % [ 5.50%] C0 BE Backend_Bound: 100.00 % [ 5.50%] C0 BE/Mem Backend_Bound.Memory_Bound: 53.06 % [ 5.50%] C0 BE/Core Backend_Bound.Core_Bound: 46.94 % [ 5.50%] C0-T0 FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.42 % [ 5.50%] C0-T0 BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 54.51 % [ 5.50%] C0-T0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 20.99 % [ 5.60%] C0-T0 CPU utilization: 1.00 CPUs [100.00%] C0-T1 FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 6.04 % [ 5.50%] C0-T1 CPU utilization: 1.00 CPUs [100.00%] Unfortunately the perf command it gives me fails with: "invalid or unsupported event". Perf command: perf record -g -e cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES:pp,period=400009/pp,cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT:pp,period=2000003/pp,cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB:pp,period=100003/pp -C 0,4 -a > However frontend issues are difficult to sample, as they happen very far > away from instruction retirement where the sampling happens. So you may > have large skid and the sampling points may be far away. Skylake has new > special FRONTEND_* PEBS events for this, but before it was often difficult. This testlab CPU is i7-4790K @ 4.00GHz. Maybe I should get a Skylake... p.s. thanks for your pmu-tools[1], even-though I don't know how to use most of them ;-)
On Mon, 5 Oct 2015 21:26:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > My only problem left, is I want a perf measurement that pinpoint these > kind of spots. The difference in L1-icache-load-misses were significant > (1,278,276 vs 2,719,158). I tried to somehow perf record this with > different perf events without being able to pinpoint the location (even > though I know the spot now). Even tried Andi's ocperf.py... maybe he > will know what event I should try? Using: 'ocperf.py -e icache_misses' and looking closer at the perf annotate and considering "skid" I think I can see the icache misses happening in the end of the function, due to the UD2 inst. Annotation of kmem_cache_free_bulk (last/end of func): │17b: test %r12,%r12 │ ↑ jne 2e │184: pop %rbx │ pop %r12 │ pop %r13 │ pop %r14 │ pop %r15 │ pop %rbp │ ← retq 8.57 │18f: mov 0x30(%rdx),%rdx 5.71 │ ↑ jmp 116 │195: ud2 2.86 │197: mov %rdi,%rsi │ mov %r11d,%r8d │ mov %r10,%rcx │ mov %rbx,%rdx │ mov %r15,%rdi │ → callq __slab_free │ ↑ jmp 17b 2.86 │1ad: mov 0x30(%rdi),%rdi │ ↑ jmpq 99
On Mon, 5 Oct 2015 21:26:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Fri, 2 Oct 2015 14:50:44 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > [...] > > > Deleting the BUG altogether sounds the best solution. As long as the > > kernel crashes in some manner, we'll be able to work out what happened. > > And it's cant-happen anyway, isn't it? > > To me WARN_ON() seems like a good "documentation" if it does not hurt > performance. I don't think removing the WARN_ON() will improve > performance, but I'm willing to actually test if it matters. I tested removing BUG/WARN_ON altogether, and it gives slightly worse performance. The icache-misses only increase approx 14% (not 112% as before). This, I'm willing to attribute to some code alignment issue. Thus, let us just keep the WARN_ON() and move along.
On Tue, 6 Oct 2015 01:07:03 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > (trimmed Cc list a little) > > On Mon, 5 Oct 2015 14:20:45 -0700 Andi Kleen <ak@linux.intel.com> wrote: > > > > My only problem left, is I want a perf measurement that pinpoint these > > > kind of spots. The difference in L1-icache-load-misses were significant > > > (1,278,276 vs 2,719,158). I tried to somehow perf record this with > > > different perf events without being able to pinpoint the location (even > > > though I know the spot now). Even tried Andi's ocperf.py... maybe he > > > will know what event I should try? > > > > Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the > > bottle neck is and what to sample for if there is a suitable event and > > even prints the command line. > > > > https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev > > > > My result from (IP-forward flow hitting CPU 0): > $ sudo ./toplev.py -I 1000 -l3 -a --show-sample --core C0 > > So, what does this tell me?: > > C0 BAD Bad_Speculation: 0.00 % [ 5.50%] > C0 BE Backend_Bound: 100.00 % [ 5.50%] > C0 BE/Mem Backend_Bound.Memory_Bound: 53.06 % [ 5.50%] > C0 BE/Core Backend_Bound.Core_Bound: 46.94 % [ 5.50%] > C0-T0 FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.42 % [ 5.50%] > C0-T0 BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 54.51 % [ 5.50%] > C0-T0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 20.99 % [ 5.60%] > C0-T0 CPU utilization: 1.00 CPUs [100.00%] > C0-T1 FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 6.04 % [ 5.50%] > C0-T1 CPU utilization: 1.00 CPUs [100.00%] Reading: https://github.com/andikleen/pmu-tools/wiki/toplev-manual Helped me understand most of above. My specific CPU (i7-4790K @ 4.00GHz) unfortunately seems to have limited "Frontend" support. E.g. # perf record -g -a -e stalled-cycles-frontend Error: The stalled-cycles-frontend event is not supported. And AFAIK icache misses are part of "frontend". > Unfortunately the perf command it gives me fails with: > "invalid or unsupported event". > > Perf command: > > sudo ./ocperf.py record -g -e \ cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES:pp,period=400009/pp,\ cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\ cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT:pp,period=2000003/pp,\ cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB:pp,period=100003/pp \ -C 0,4 -a I fixed the problem with this perf command by removing the ":pp" part. Perhaps your tool need to fix that? A working command line looks like this: sudo ./ocperf.py record -g -e \ cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/pp,\ cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\ cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT,period=2000003/pp,\ cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB,period=100003/pp \ -C 0,4 -a
Em Wed, Oct 07, 2015 at 02:31:20PM +0200, Jesper Dangaard Brouer escreveu: > On Tue, 6 Oct 2015 01:07:03 +0200 > Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > (trimmed Cc list a little) > > > > On Mon, 5 Oct 2015 14:20:45 -0700 Andi Kleen <ak@linux.intel.com> wrote: > > > > > > My only problem left, is I want a perf measurement that pinpoint these > > > > kind of spots. The difference in L1-icache-load-misses were significant > > > > (1,278,276 vs 2,719,158). I tried to somehow perf record this with > > > > different perf events without being able to pinpoint the location (even > > > > though I know the spot now). Even tried Andi's ocperf.py... maybe he > > > > will know what event I should try? > > > > > > Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the > > > bottle neck is and what to sample for if there is a suitable event and > > > even prints the command line. > > > > > > https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev > > > > > > > My result from (IP-forward flow hitting CPU 0): > > $ sudo ./toplev.py -I 1000 -l3 -a --show-sample --core C0 > > > > So, what does this tell me?: > > > > C0 BAD Bad_Speculation: 0.00 % [ 5.50%] > > C0 BE Backend_Bound: 100.00 % [ 5.50%] > > C0 BE/Mem Backend_Bound.Memory_Bound: 53.06 % [ 5.50%] > > C0 BE/Core Backend_Bound.Core_Bound: 46.94 % [ 5.50%] > > C0-T0 FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.42 % [ 5.50%] > > C0-T0 BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 54.51 % [ 5.50%] > > C0-T0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 20.99 % [ 5.60%] > > C0-T0 CPU utilization: 1.00 CPUs [100.00%] > > C0-T1 FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 6.04 % [ 5.50%] > > C0-T1 CPU utilization: 1.00 CPUs [100.00%] > > Reading: https://github.com/andikleen/pmu-tools/wiki/toplev-manual > Helped me understand most of above. > > My specific CPU (i7-4790K @ 4.00GHz) unfortunately seems to have > limited "Frontend" support. E.g. > > # perf record -g -a -e stalled-cycles-frontend > Error: > The stalled-cycles-frontend event is not supported. > > And AFAIK icache misses are part of "frontend". > > > > Unfortunately the perf command it gives me fails with: > > "invalid or unsupported event". > > > > Perf command: > > > > sudo ./ocperf.py record -g -e \ > cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES:pp,period=400009/pp,\ > cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\ > cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT:pp,period=2000003/pp,\ > cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB:pp,period=100003/pp \ > -C 0,4 -a > > I fixed the problem with this perf command by removing the ":pp" part. > Perhaps your tool need to fix that? > > A working command line looks like this: > > sudo ./ocperf.py record -g -e \ > cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/pp,\ > cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\ > cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT,period=2000003/pp,\ > cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB,period=100003/pp \ > -C 0,4 -a There is a recent patch that may help here, see below, but maybe its just a matter of removing that :pp, as it ends with a /pp anyway, no need to state that twice :) With the patch below all those /pp would be replaced with /P. - Arnaldo https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/tools/perf?id=7f94af7a489fada17d28cc60e8f4409ce216bd6d ---------------------------------------------------------------------- perf tools: Introduce 'P' modifier to request max precision The 'P' will cause the event to get maximum possible detected precise level. Following record: $ perf record -e cycles:P ... will detect maximum precise level for 'cycles' event and use it. ---------------------------------------------------------------------- - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> There is a recent patch that may help here, see below, but maybe its > just a matter of removing that :pp, as it ends with a /pp anyway, no > need to state that twice :) Yes the extra :pp was a regression in toplev. I fixed it now. Thanks. -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> My specific CPU (i7-4790K @ 4.00GHz) unfortunately seems to have > limited "Frontend" support. E.g. > > # perf record -g -a -e stalled-cycles-frontend > Error: > The stalled-cycles-frontend event is not supported. > > And AFAIK icache misses are part of "frontend". Ignore stalled-cycles-frontend. It is very unreliable and has never worked right. toplev gives you much more reliable output. -Andi
--- a/mm/slub.c~slub-optimize-bulk-slowpath-free-by-detached-freelist-fix +++ a/mm/slub.c @@ -2885,7 +2885,8 @@ static int build_detached_freelist(struc /* Note that interrupts must be enabled when calling this function. */ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) { - BUG_ON(!size); + if (WARN_ON(!size)) + return; do { struct detached_freelist df;