Message ID | 20180210081139.27236-1-npiggin@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | powerpc/mm/slice: improve slice speed and stack use | expand |
Le 10/02/2018 à 09:11, Nicholas Piggin a écrit : > This series intends to improve performance and reduce stack > consumption in the slice allocation code. It does it by keeping slice > masks in the mm_context rather than compute them for each allocation, > and by reducing bitmaps and slice_masks from stacks, using pointers > instead where possible. > > checkstack.pl gives, before: > 0x00000de4 slice_get_unmapped_area [slice.o]: 656 > 0x00001b4c is_hugepage_only_range [slice.o]: 512 > 0x0000075c slice_find_area_topdown [slice.o]: 416 > 0x000004c8 slice_find_area_bottomup.isra.1 [slice.o]: 272 > 0x00001aa0 slice_set_range_psize [slice.o]: 240 > 0x00000a64 slice_find_area [slice.o]: 176 > 0x00000174 slice_check_fit [slice.o]: 112 > > after: > 0x00000d70 slice_get_unmapped_area [slice.o]: 320 > 0x000008f8 slice_find_area [slice.o]: 144 > 0x00001860 slice_set_range_psize [slice.o]: 144 > 0x000018ec is_hugepage_only_range [slice.o]: 144 > 0x00000750 slice_find_area_bottomup.isra.4 [slice.o]: 128 > > The benchmark in https://github.com/linuxppc/linux/issues/49 gives, before: > $ time ./slicemask > real 0m20.712s > user 0m5.830s > sys 0m15.105s > > after: > $ time ./slicemask > real 0m13.197s > user 0m5.409s > sys 0m7.779s Hi, I tested your serie on an 8xx, on top of patch https://patchwork.ozlabs.org/patch/871675/ I don't get a result as significant as yours, but there is some improvment anyway: ITERATION 500000 Before: root@vgoip:~# time ./slicemask real 0m 33.26s user 0m 1.94s sys 0m 30.85s After: root@vgoip:~# time ./slicemask real 0m 29.69s user 0m 2.11s sys 0m 27.15s Most significant improvment is obtained with the first patch of your serie: root@vgoip:~# time ./slicemask real 0m 30.85s user 0m 1.80s sys 0m 28.57s Had to modify your serie a bit, if you are interested I can post it. Christophe > > Thanks, > Nick > > Nicholas Piggin (5): > powerpc/mm/slice: pass pointers to struct slice_mask where possible > powerpc/mm/slice: implement a slice mask cache > powerpc/mm/slice: implement slice_check_range_fits > powerpc/mm/slice: Use const pointers to cached slice masks where > possible > powerpc/mm/slice: use the dynamic high slice size to limit bitmap > operations > > arch/powerpc/include/asm/book3s/64/mmu.h | 20 +- > arch/powerpc/mm/slice.c | 302 +++++++++++++++++++------------ > 2 files changed, 204 insertions(+), 118 deletions(-) >
On Mon, 12 Feb 2018 16:02:23 +0100 Christophe LEROY <christophe.leroy@c-s.fr> wrote: > Le 10/02/2018 à 09:11, Nicholas Piggin a écrit : > > This series intends to improve performance and reduce stack > > consumption in the slice allocation code. It does it by keeping slice > > masks in the mm_context rather than compute them for each allocation, > > and by reducing bitmaps and slice_masks from stacks, using pointers > > instead where possible. > > > > checkstack.pl gives, before: > > 0x00000de4 slice_get_unmapped_area [slice.o]: 656 > > 0x00001b4c is_hugepage_only_range [slice.o]: 512 > > 0x0000075c slice_find_area_topdown [slice.o]: 416 > > 0x000004c8 slice_find_area_bottomup.isra.1 [slice.o]: 272 > > 0x00001aa0 slice_set_range_psize [slice.o]: 240 > > 0x00000a64 slice_find_area [slice.o]: 176 > > 0x00000174 slice_check_fit [slice.o]: 112 > > > > after: > > 0x00000d70 slice_get_unmapped_area [slice.o]: 320 > > 0x000008f8 slice_find_area [slice.o]: 144 > > 0x00001860 slice_set_range_psize [slice.o]: 144 > > 0x000018ec is_hugepage_only_range [slice.o]: 144 > > 0x00000750 slice_find_area_bottomup.isra.4 [slice.o]: 128 > > > > The benchmark in https://github.com/linuxppc/linux/issues/49 gives, before: > > $ time ./slicemask > > real 0m20.712s > > user 0m5.830s > > sys 0m15.105s > > > > after: > > $ time ./slicemask > > real 0m13.197s > > user 0m5.409s > > sys 0m7.779s > > Hi, > > I tested your serie on an 8xx, on top of patch > https://patchwork.ozlabs.org/patch/871675/ > > I don't get a result as significant as yours, but there is some > improvment anyway: > > ITERATION 500000 > > Before: > > root@vgoip:~# time ./slicemask > real 0m 33.26s > user 0m 1.94s > sys 0m 30.85s > > After: > root@vgoip:~# time ./slicemask > real 0m 29.69s > user 0m 2.11s > sys 0m 27.15s > > Most significant improvment is obtained with the first patch of your serie: > root@vgoip:~# time ./slicemask > real 0m 30.85s > user 0m 1.80s > sys 0m 28.57s Okay, thanks. Are you still spending significant time in the slice code? > > Had to modify your serie a bit, if you are interested I can post it. > Sure, that would be good. Thanks, Nick
Le 12/02/2018 à 16:24, Nicholas Piggin a écrit : > On Mon, 12 Feb 2018 16:02:23 +0100 > Christophe LEROY <christophe.leroy@c-s.fr> wrote: > >> Le 10/02/2018 à 09:11, Nicholas Piggin a écrit : >>> This series intends to improve performance and reduce stack >>> consumption in the slice allocation code. It does it by keeping slice >>> masks in the mm_context rather than compute them for each allocation, >>> and by reducing bitmaps and slice_masks from stacks, using pointers >>> instead where possible. >>> >>> checkstack.pl gives, before: >>> 0x00000de4 slice_get_unmapped_area [slice.o]: 656 >>> 0x00001b4c is_hugepage_only_range [slice.o]: 512 >>> 0x0000075c slice_find_area_topdown [slice.o]: 416 >>> 0x000004c8 slice_find_area_bottomup.isra.1 [slice.o]: 272 >>> 0x00001aa0 slice_set_range_psize [slice.o]: 240 >>> 0x00000a64 slice_find_area [slice.o]: 176 >>> 0x00000174 slice_check_fit [slice.o]: 112 >>> >>> after: >>> 0x00000d70 slice_get_unmapped_area [slice.o]: 320 >>> 0x000008f8 slice_find_area [slice.o]: 144 >>> 0x00001860 slice_set_range_psize [slice.o]: 144 >>> 0x000018ec is_hugepage_only_range [slice.o]: 144 >>> 0x00000750 slice_find_area_bottomup.isra.4 [slice.o]: 128 >>> >>> The benchmark in https://github.com/linuxppc/linux/issues/49 gives, before: >>> $ time ./slicemask >>> real 0m20.712s >>> user 0m5.830s >>> sys 0m15.105s >>> >>> after: >>> $ time ./slicemask >>> real 0m13.197s >>> user 0m5.409s >>> sys 0m7.779s >> >> Hi, >> >> I tested your serie on an 8xx, on top of patch >> https://patchwork.ozlabs.org/patch/871675/ >> >> I don't get a result as significant as yours, but there is some >> improvment anyway: >> >> ITERATION 500000 >> >> Before: >> >> root@vgoip:~# time ./slicemask >> real 0m 33.26s >> user 0m 1.94s >> sys 0m 30.85s >> >> After: >> root@vgoip:~# time ./slicemask >> real 0m 29.69s >> user 0m 2.11s >> sys 0m 27.15s >> >> Most significant improvment is obtained with the first patch of your serie: >> root@vgoip:~# time ./slicemask >> real 0m 30.85s >> user 0m 1.80s >> sys 0m 28.57s > > Okay, thanks. Are you still spending significant time in the slice > code? Do you mean am I still updating my patches ? No I hope we are at last run with v4 now that Aneesh has tagged all of them as reviewed-by himself. Once the serie has been accepted, my next step will be to backport at least the 3 first ones in kernel 4.14 > >> >> Had to modify your serie a bit, if you are interested I can post it. >> > > Sure, that would be good. Ok, lets share it. The patch are not 100% clean. Christophe
On Mon, 12 Feb 2018 18:42:21 +0100 Christophe LEROY <christophe.leroy@c-s.fr> wrote: > Le 12/02/2018 à 16:24, Nicholas Piggin a écrit : > > On Mon, 12 Feb 2018 16:02:23 +0100 > > Christophe LEROY <christophe.leroy@c-s.fr> wrote: > > > >> Le 10/02/2018 à 09:11, Nicholas Piggin a écrit : > >>> This series intends to improve performance and reduce stack > >>> consumption in the slice allocation code. It does it by keeping slice > >>> masks in the mm_context rather than compute them for each allocation, > >>> and by reducing bitmaps and slice_masks from stacks, using pointers > >>> instead where possible. > >>> > >>> checkstack.pl gives, before: > >>> 0x00000de4 slice_get_unmapped_area [slice.o]: 656 > >>> 0x00001b4c is_hugepage_only_range [slice.o]: 512 > >>> 0x0000075c slice_find_area_topdown [slice.o]: 416 > >>> 0x000004c8 slice_find_area_bottomup.isra.1 [slice.o]: 272 > >>> 0x00001aa0 slice_set_range_psize [slice.o]: 240 > >>> 0x00000a64 slice_find_area [slice.o]: 176 > >>> 0x00000174 slice_check_fit [slice.o]: 112 > >>> > >>> after: > >>> 0x00000d70 slice_get_unmapped_area [slice.o]: 320 > >>> 0x000008f8 slice_find_area [slice.o]: 144 > >>> 0x00001860 slice_set_range_psize [slice.o]: 144 > >>> 0x000018ec is_hugepage_only_range [slice.o]: 144 > >>> 0x00000750 slice_find_area_bottomup.isra.4 [slice.o]: 128 > >>> > >>> The benchmark in https://github.com/linuxppc/linux/issues/49 gives, before: > >>> $ time ./slicemask > >>> real 0m20.712s > >>> user 0m5.830s > >>> sys 0m15.105s > >>> > >>> after: > >>> $ time ./slicemask > >>> real 0m13.197s > >>> user 0m5.409s > >>> sys 0m7.779s > >> > >> Hi, > >> > >> I tested your serie on an 8xx, on top of patch > >> https://patchwork.ozlabs.org/patch/871675/ > >> > >> I don't get a result as significant as yours, but there is some > >> improvment anyway: > >> > >> ITERATION 500000 > >> > >> Before: > >> > >> root@vgoip:~# time ./slicemask > >> real 0m 33.26s > >> user 0m 1.94s > >> sys 0m 30.85s > >> > >> After: > >> root@vgoip:~# time ./slicemask > >> real 0m 29.69s > >> user 0m 2.11s > >> sys 0m 27.15s > >> > >> Most significant improvment is obtained with the first patch of your serie: > >> root@vgoip:~# time ./slicemask > >> real 0m 30.85s > >> user 0m 1.80s > >> sys 0m 28.57s > > > > Okay, thanks. Are you still spending significant time in the slice > > code? > > Do you mean am I still updating my patches ? No I hope we are at last Actually I was wondering about CPU time spent for the microbenchmark :) > run with v4 now that Aneesh has tagged all of them as reviewed-by himself. > Once the serie has been accepted, my next step will be to backport at > least the 3 first ones in kernel 4.14 > > > > >> > >> Had to modify your serie a bit, if you are interested I can post it. > >> > > > > Sure, that would be good. > > Ok, lets share it. The patch are not 100% clean. Those look pretty good, thanks for doing that work. Thanks, Nick
Le 13/02/2018 à 09:40, Nicholas Piggin a écrit : > On Mon, 12 Feb 2018 18:42:21 +0100 > Christophe LEROY <christophe.leroy@c-s.fr> wrote: > >> Le 12/02/2018 à 16:24, Nicholas Piggin a écrit : >>> On Mon, 12 Feb 2018 16:02:23 +0100 >>> Christophe LEROY <christophe.leroy@c-s.fr> wrote: >>> >>>> Le 10/02/2018 à 09:11, Nicholas Piggin a écrit : >>>>> This series intends to improve performance and reduce stack >>>>> consumption in the slice allocation code. It does it by keeping slice >>>>> masks in the mm_context rather than compute them for each allocation, >>>>> and by reducing bitmaps and slice_masks from stacks, using pointers >>>>> instead where possible. >>>>> >>>>> checkstack.pl gives, before: >>>>> 0x00000de4 slice_get_unmapped_area [slice.o]: 656 >>>>> 0x00001b4c is_hugepage_only_range [slice.o]: 512 >>>>> 0x0000075c slice_find_area_topdown [slice.o]: 416 >>>>> 0x000004c8 slice_find_area_bottomup.isra.1 [slice.o]: 272 >>>>> 0x00001aa0 slice_set_range_psize [slice.o]: 240 >>>>> 0x00000a64 slice_find_area [slice.o]: 176 >>>>> 0x00000174 slice_check_fit [slice.o]: 112 >>>>> >>>>> after: >>>>> 0x00000d70 slice_get_unmapped_area [slice.o]: 320 >>>>> 0x000008f8 slice_find_area [slice.o]: 144 >>>>> 0x00001860 slice_set_range_psize [slice.o]: 144 >>>>> 0x000018ec is_hugepage_only_range [slice.o]: 144 >>>>> 0x00000750 slice_find_area_bottomup.isra.4 [slice.o]: 128 >>>>> >>>>> The benchmark in https://github.com/linuxppc/linux/issues/49 gives, before: >>>>> $ time ./slicemask >>>>> real 0m20.712s >>>>> user 0m5.830s >>>>> sys 0m15.105s >>>>> >>>>> after: >>>>> $ time ./slicemask >>>>> real 0m13.197s >>>>> user 0m5.409s >>>>> sys 0m7.779s >>>> >>>> Hi, >>>> >>>> I tested your serie on an 8xx, on top of patch >>>> https://patchwork.ozlabs.org/patch/871675/ >>>> >>>> I don't get a result as significant as yours, but there is some >>>> improvment anyway: >>>> >>>> ITERATION 500000 >>>> >>>> Before: >>>> >>>> root@vgoip:~# time ./slicemask >>>> real 0m 33.26s >>>> user 0m 1.94s >>>> sys 0m 30.85s >>>> >>>> After: >>>> root@vgoip:~# time ./slicemask >>>> real 0m 29.69s >>>> user 0m 2.11s >>>> sys 0m 27.15s >>>> >>>> Most significant improvment is obtained with the first patch of your serie: >>>> root@vgoip:~# time ./slicemask >>>> real 0m 30.85s >>>> user 0m 1.80s >>>> sys 0m 28.57s >>> >>> Okay, thanks. Are you still spending significant time in the slice >>> code? >> >> Do you mean am I still updating my patches ? No I hope we are at last > > Actually I was wondering about CPU time spent for the microbenchmark :) Lol. I've got the following perf report (functions over 0.50%) # Overhead Command Shared Object Symbol # ........ ......... ................. .................................. # 7.13% slicemask [kernel.kallsyms] [k] do_brk_flags 6.19% slicemask [kernel.kallsyms] [k] DoSyscall 5.81% slicemask [kernel.kallsyms] [k] perf_event_mmap 5.55% slicemask [kernel.kallsyms] [k] do_munmap 4.55% slicemask [kernel.kallsyms] [k] sys_brk 4.43% slicemask [kernel.kallsyms] [k] find_vma 3.42% slicemask [kernel.kallsyms] [k] vma_compute_subtree_gap 3.08% slicemask libc-2.23.so [.] __brk 2.95% slicemask [kernel.kallsyms] [k] slice_get_unmapped_area 2.81% slicemask [kernel.kallsyms] [k] __vm_enough_memory 2.78% slicemask [kernel.kallsyms] [k] kmem_cache_free 2.51% slicemask [kernel.kallsyms] [k] perf_iterate_ctx.constprop.84 2.40% slicemask [kernel.kallsyms] [k] unmap_page_range 2.27% slicemask [kernel.kallsyms] [k] perf_iterate_sb 2.21% slicemask [kernel.kallsyms] [k] vmacache_find 2.04% slicemask [kernel.kallsyms] [k] vma_gap_update 1.91% slicemask [kernel.kallsyms] [k] unmap_region 1.81% slicemask [kernel.kallsyms] [k] memset_nocache_branch 1.59% slicemask [kernel.kallsyms] [k] kmem_cache_alloc 1.57% slicemask [kernel.kallsyms] [k] get_unmapped_area.part.7 1.55% slicemask [kernel.kallsyms] [k] up_write 1.44% slicemask [kernel.kallsyms] [k] vma_merge 1.28% slicemask slicemask [.] main 1.27% slicemask [kernel.kallsyms] [k] lru_add_drain 1.22% slicemask [kernel.kallsyms] [k] vma_link 1.19% slicemask [kernel.kallsyms] [k] tlb_gather_mmu 1.17% slicemask [kernel.kallsyms] [k] tlb_flush_mmu_free 1.15% slicemask libc-2.23.so [.] got_label 1.11% slicemask [kernel.kallsyms] [k] unlink_anon_vmas 1.06% slicemask [kernel.kallsyms] [k] lru_add_drain_cpu 1.02% slicemask [kernel.kallsyms] [k] free_pgtables 1.01% slicemask [kernel.kallsyms] [k] remove_vma 0.98% slicemask [kernel.kallsyms] [k] strlcpy 0.98% slicemask [kernel.kallsyms] [k] perf_event_mmap_output 0.95% slicemask [kernel.kallsyms] [k] may_expand_vm 0.90% slicemask [kernel.kallsyms] [k] unmap_vmas 0.86% slicemask [kernel.kallsyms] [k] down_write_killable 0.83% slicemask [kernel.kallsyms] [k] __vma_link_list 0.83% slicemask [kernel.kallsyms] [k] arch_vma_name 0.81% slicemask [kernel.kallsyms] [k] __vma_rb_erase 0.80% slicemask [kernel.kallsyms] [k] __rcu_read_unlock 0.71% slicemask [kernel.kallsyms] [k] tlb_flush_mmu 0.70% slicemask [kernel.kallsyms] [k] tlb_finish_mmu 0.68% slicemask [kernel.kallsyms] [k] __rb_insert_augmented 0.63% slicemask [kernel.kallsyms] [k] cap_capable 0.61% slicemask [kernel.kallsyms] [k] free_pgd_range 0.59% slicemask [kernel.kallsyms] [k] arch_tlb_finish_mmu 0.59% slicemask [kernel.kallsyms] [k] __vma_link_rb 0.56% slicemask [kernel.kallsyms] [k] __rcu_read_lock 0.55% slicemask [kernel.kallsyms] [k] arch_get_unmapped_area_topdown 0.53% slicemask [kernel.kallsyms] [k] unlink_file_vma 0.51% slicemask [kernel.kallsyms] [k] vmacache_update 0.50% slicemask [kernel.kallsyms] [k] kfree Unfortunatly I didn't run a perf report before applying the patch serie. If you are interested for the comparison, I won't be able to do it before next week. > >> run with v4 now that Aneesh has tagged all of them as reviewed-by himself. >> Once the serie has been accepted, my next step will be to backport at >> least the 3 first ones in kernel 4.14 >> >>> >>>> >>>> Had to modify your serie a bit, if you are interested I can post it. >>>> >>> >>> Sure, that would be good. >> >> Ok, lets share it. The patch are not 100% clean. > > Those look pretty good, thanks for doing that work. You are welcome. I wanted to try your serie on the 8xx. It is untested on the book3s64, not sure it even compiles. Christophe > > Thanks, > Nick >