From patchwork Sat Nov 28 15:25:58 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nicholas Piggin X-Patchwork-Id: 1407561 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4Cjwjz6rXMz9s0b for ; Sun, 29 Nov 2020 02:42:47 +1100 (AEDT) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=FSb4XXcA; dkim-atps=neutral Received: from bilbo.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 4Cjwjz66m8zDrcN for ; Sun, 29 Nov 2020 02:42:47 +1100 (AEDT) X-Original-To: linuxppc-dev@lists.ozlabs.org Delivered-To: linuxppc-dev@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gmail.com (client-ip=2607:f8b0:4864:20::541; helo=mail-pg1-x541.google.com; envelope-from=npiggin@gmail.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=FSb4XXcA; dkim-atps=neutral Received: from mail-pg1-x541.google.com (mail-pg1-x541.google.com [IPv6:2607:f8b0:4864:20::541]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4CjwN70jGYzDrRy for ; Sun, 29 Nov 2020 02:27:07 +1100 (AEDT) Received: by mail-pg1-x541.google.com with SMTP id t21so6693040pgl.3 for ; Sat, 28 Nov 2020 07:27:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=BDXZ4Heko2aWtgeiWos6UOAsGGuX0rv29sJJvlqYtwQ=; b=FSb4XXcArnY8lb/u/2/iWrh22BaFOXo9rpE48Xz1QKtwDiw/HtTVBNrpE6fswxzmuO sgIZkw3Szg2nEjED2WhNNkHogCbAcaL5WL5CdF+a4s7PsuhDCt2a8ZVr2NgaUUPzK8op y0LdtrgaSe61JWideumVRsDfMwtI7yTPjq9fHoH+EiEyRGF46dIM67LzZWrVuUdEtcwp c73twirSUQ4KE5EAtxM1diclyrGc3VVnCiT6AJbDRt84J+PHCN9VfJnI4V7kryTcfekq AzSdtDn92q6pmKjYmB0TymyjGdaOQEFgA0Do1h5yukfBaqxHYjKMYZLCzHoqMQ0Wd1oZ wEKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=BDXZ4Heko2aWtgeiWos6UOAsGGuX0rv29sJJvlqYtwQ=; b=tOJQyczDfZcdieX/KTj/vgKDn+xDk8VuHtJ2FRpg+OORjHMW2YuajP2Gj7jvoAbIym nbl2Ubmb2xo16LYZC3fNs+MxBqw2OIa+a6iyzgGuo22sEkag1wydl4T2PH/tkPyZyJu3 L21eRS9CYkPW+Y7z89bpAvmB7k0v3eOGSMZTWGSzzQN+Fbk4W2Tr7XwBR3l/iTnnoB1a pTDRCtk7hLGHmpwT/VTFLil0ba9LVz0ophpG27/Fi5CbKT/m8sDMbOY5Qf5Mm38kMO25 UyQ+Ks6Yso08D29ON99ytYIBFv7fmbtzyAgyaWSTxcoISHjCWm4R0vuYuNTpEazHHUz+ Fa9Q== X-Gm-Message-State: AOAM530QHGNIrZcNJJpWJPNdpnTAPDyR5/Z/epXVO5772ekKQhLQYyBx wmrTgcGsdHVzLA5LCM8bL+E= X-Google-Smtp-Source: ABdhPJxMW/vljGK7yJDeugnrIwpgJMhZDca4fbNJykdt6US6/uwMJwU7ZxjIr4fJyUR13WKo6i0WeA== X-Received: by 2002:a65:4187:: with SMTP id a7mr2205551pgq.16.1606577225416; Sat, 28 Nov 2020 07:27:05 -0800 (PST) Received: from bobo.ibm.com (193-116-103-132.tpgi.com.au. [193.116.103.132]) by smtp.gmail.com with ESMTPSA id d22sm15500173pjw.11.2020.11.28.07.27.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 28 Nov 2020 07:27:05 -0800 (PST) From: Nicholas Piggin To: linux-mm@kvack.org, Andrew Morton Subject: [PATCH v8 11/12] mm/vmalloc: Hugepage vmalloc mappings Date: Sun, 29 Nov 2020 01:25:58 +1000 Message-Id: <20201128152559.999540-12-npiggin@gmail.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20201128152559.999540-1-npiggin@gmail.com> References: <20201128152559.999540-1-npiggin@gmail.com> MIME-Version: 1.0 X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, Nicholas Piggin , Christoph Hellwig , Zefan Li , Jonathan Cameron , linuxppc-dev@lists.ozlabs.org Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC enables support on architectures that define HAVE_ARCH_HUGE_VMAP and supports PMD sized vmap mappings. vmalloc will attempt to allocate PMD-sized pages if allocating PMD size or larger, and fall back to small pages if that was unsuccessful. Allocations that do not use PAGE_KERNEL prot are not permitted to use huge pages, because not all callers expect this (e.g., module allocations vs strict module rwx). When hugepage vmalloc mappings are enabled in the next patch, this reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%. This can result in more internal fragmentation and memory overhead for a given allocation, an option nohugevmalloc is added to disable at boot. Signed-off-by: Nicholas Piggin Reported-by: kernel test robot Reported-by: kernel test robot --- arch/Kconfig | 4 + include/linux/vmalloc.h | 1 + mm/page_alloc.c | 5 +- mm/vmalloc.c | 190 ++++++++++++++++++++++++++++++---------- 4 files changed, 154 insertions(+), 46 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 56b6ccc0e32d..94f5093fa6f5 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -662,6 +662,10 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD config HAVE_ARCH_HUGE_VMAP bool +config HAVE_ARCH_HUGE_VMALLOC + depends on HAVE_ARCH_HUGE_VMAP + bool + config ARCH_WANT_HUGE_PMD_SHARE bool diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index a5ae791dc1e0..72b106c2b4da 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -59,6 +59,7 @@ struct vm_struct { unsigned long size; unsigned long flags; struct page **pages; + unsigned int page_order; unsigned int nr_pages; phys_addr_t phys_addr; const void *caller; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eaa227a479e4..35bdc411bd49 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -70,6 +70,7 @@ #include #include #include +#include #include #include @@ -8171,6 +8172,7 @@ void *__init alloc_large_system_hash(const char *tablename, void *table = NULL; gfp_t gfp_flags; bool virt; + bool huge; /* allow the kernel cmdline to have a say */ if (!numentries) { @@ -8238,6 +8240,7 @@ void *__init alloc_large_system_hash(const char *tablename, } else if (get_order(size) >= MAX_ORDER || hashdist) { table = __vmalloc(size, gfp_flags); virt = true; + huge = (find_vm_area(table)->page_order > 0); } else { /* * If bucketsize is not a power-of-two, we may free @@ -8254,7 +8257,7 @@ void *__init alloc_large_system_hash(const char *tablename, pr_info("%s hash table entries: %ld (order: %d, %lu bytes, %s)\n", tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size, - virt ? "vmalloc" : "linear"); + virt ? (huge ? "vmalloc hugepage" : "vmalloc") : "linear"); if (_hash_shift) *_hash_shift = log2qty; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index ee9c3bee67f5..f650b26a0e83 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -42,6 +42,19 @@ #include "internal.h" #include "pgalloc-track.h" +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC +static bool __ro_after_init vmap_allow_huge = true; + +static int __init set_nohugevmalloc(char *str) +{ + vmap_allow_huge = false; + return 0; +} +early_param("nohugevmalloc", set_nohugevmalloc); +#else /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */ +static const bool vmap_allow_huge = false; +#endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */ + bool is_vmalloc_addr(const void *x) { unsigned long addr = (unsigned long)x; @@ -477,31 +490,12 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr, return 0; } -/** - * map_kernel_range_noflush - map kernel VM area with the specified pages - * @addr: start of the VM area to map - * @size: size of the VM area to map - * @prot: page protection flags to use - * @pages: pages to map - * - * Map PFN_UP(@size) pages at @addr. The VM area @addr and @size specify should - * have been allocated using get_vm_area() and its friends. - * - * NOTE: - * This function does NOT do any cache flushing. The caller is responsible for - * calling flush_cache_vmap() on to-be-mapped areas before calling this - * function. - * - * RETURNS: - * 0 on success, -errno on failure. - */ -int map_kernel_range_noflush(unsigned long addr, unsigned long size, - pgprot_t prot, struct page **pages) +static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end, + pgprot_t prot, struct page **pages) { unsigned long start = addr; - unsigned long end = addr + size; - unsigned long next; pgd_t *pgd; + unsigned long next; int err = 0; int nr = 0; pgtbl_mod_mask mask = 0; @@ -523,6 +517,65 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size, return 0; } +static int vmap_pages_range_noflush(unsigned long addr, unsigned long end, + pgprot_t prot, struct page **pages, unsigned int page_shift) +{ + unsigned int i, nr = (end - addr) >> PAGE_SHIFT; + + WARN_ON(page_shift < PAGE_SHIFT); + + if (page_shift == PAGE_SHIFT) + return vmap_small_pages_range_noflush(addr, end, prot, pages); + + for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) { + int err; + + err = vmap_range_noflush(addr, addr + (1UL << page_shift), + __pa(page_address(pages[i])), prot, + page_shift); + if (err) + return err; + + addr += 1UL << page_shift; + } + + return 0; +} + +static int vmap_pages_range(unsigned long addr, unsigned long end, + pgprot_t prot, struct page **pages, unsigned int page_shift) +{ + int err; + + err = vmap_pages_range_noflush(addr, end, prot, pages, page_shift); + flush_cache_vmap(addr, end); + return err; +} + +/** + * map_kernel_range_noflush - map kernel VM area with the specified pages + * @addr: start of the VM area to map + * @size: size of the VM area to map + * @prot: page protection flags to use + * @pages: pages to map + * + * Map PFN_UP(@size) pages at @addr. The VM area @addr and @size specify should + * have been allocated using get_vm_area() and its friends. + * + * NOTE: + * This function does NOT do any cache flushing. The caller is responsible for + * calling flush_cache_vmap() on to-be-mapped areas before calling this + * function. + * + * RETURNS: + * 0 on success, -errno on failure. + */ +int map_kernel_range_noflush(unsigned long addr, unsigned long size, + pgprot_t prot, struct page **pages) +{ + return vmap_pages_range_noflush(addr, addr + size, prot, pages, PAGE_SHIFT); +} + int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot, struct page **pages) { @@ -2400,6 +2453,7 @@ static inline void set_area_direct_map(const struct vm_struct *area, { int i; + /* HUGE_VMALLOC passes small pages to set_direct_map */ for (i = 0; i < area->nr_pages; i++) if (page_address(area->pages[i])) set_direct_map(area->pages[i]); @@ -2433,11 +2487,12 @@ static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages) * map. Find the start and end range of the direct mappings to make sure * the vm_unmap_aliases() flush includes the direct map. */ - for (i = 0; i < area->nr_pages; i++) { + for (i = 0; i < area->nr_pages; i += 1U << area->page_order) { unsigned long addr = (unsigned long)page_address(area->pages[i]); if (addr) { + unsigned long page_size = PAGE_SIZE << area->page_order; start = min(addr, start); - end = max(addr + PAGE_SIZE, end); + end = max(addr + page_size, end); flush_dmap = 1; } } @@ -2480,11 +2535,11 @@ static void __vunmap(const void *addr, int deallocate_pages) if (deallocate_pages) { int i; - for (i = 0; i < area->nr_pages; i++) { + for (i = 0; i < area->nr_pages; i += 1U << area->page_order) { struct page *page = area->pages[i]; BUG_ON(!page); - __free_pages(page, 0); + __free_pages(page, area->page_order); } atomic_long_sub(area->nr_pages, &nr_vmalloc_pages); @@ -2674,12 +2729,17 @@ EXPORT_SYMBOL_GPL(vmap_pfn); #endif /* CONFIG_VMAP_PFN */ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, - pgprot_t prot, int node) + pgprot_t prot, unsigned int page_shift, + int node) { const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; - unsigned int nr_pages = get_vm_area_size(area) >> PAGE_SHIFT; - unsigned int array_size = nr_pages * sizeof(struct page *), i; + unsigned int page_order = page_shift - PAGE_SHIFT; + unsigned long addr = (unsigned long)area->addr; + unsigned long size = get_vm_area_size(area); + unsigned int nr_small_pages = size >> PAGE_SHIFT; + unsigned int array_size = nr_small_pages * sizeof(struct page *); struct page **pages; + unsigned int i; gfp_mask |= __GFP_NOWARN; if (!(gfp_mask & (GFP_DMA | GFP_DMA32))) @@ -2700,30 +2760,35 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, } area->pages = pages; - area->nr_pages = nr_pages; + area->nr_pages = nr_small_pages; + area->page_order = page_order; - for (i = 0; i < area->nr_pages; i++) { + /* + * Careful, we allocate and map page_order pages, but tracking is done + * per PAGE_SIZE page so as to keep the vm_struct APIs independent of + * the physical/mapped size. + */ + for (i = 0; i < area->nr_pages; i += 1U << page_order) { struct page *page; + int p; - if (node == NUMA_NO_NODE) - page = alloc_page(gfp_mask); - else - page = alloc_pages_node(node, gfp_mask, 0); - + page = alloc_pages_node(node, gfp_mask, page_order); if (unlikely(!page)) { /* Successfully allocated i pages, free them in __vfree() */ area->nr_pages = i; atomic_long_add(area->nr_pages, &nr_vmalloc_pages); goto fail; } - area->pages[i] = page; + + for (p = 0; p < (1U << page_order); p++) + area->pages[i + p] = page + p; + if (gfpflags_allow_blocking(gfp_mask)) cond_resched(); } atomic_long_add(area->nr_pages, &nr_vmalloc_pages); - if (map_kernel_range((unsigned long)area->addr, get_vm_area_size(area), - prot, pages) < 0) + if (vmap_pages_range(addr, addr + size, prot, pages, page_shift) < 0) goto fail; return area->addr; @@ -2731,7 +2796,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, fail: warn_alloc(gfp_mask, NULL, "vmalloc: allocation failure, allocated %ld of %ld bytes", - (area->nr_pages*PAGE_SIZE), area->size); + (area->nr_pages*PAGE_SIZE), size); __vfree(area->addr); return NULL; } @@ -2762,19 +2827,43 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, struct vm_struct *area; void *addr; unsigned long real_size = size; + unsigned long real_align = align; + unsigned int shift = PAGE_SHIFT; - size = PAGE_ALIGN(size); if (!size || (size >> PAGE_SHIFT) > totalram_pages()) goto fail; - area = __get_vm_area_node(real_size, align, VM_ALLOC | VM_UNINITIALIZED | + if (vmap_allow_huge && arch_vmap_pmd_supported(prot) && + (pgprot_val(prot) == pgprot_val(PAGE_KERNEL))) { + unsigned long size_per_node; + + /* + * Try huge pages. Only try for PAGE_KERNEL allocations, + * others like modules don't yet expect huge pages in + * their allocations due to apply_to_page_range not + * supporting them. + */ + + size_per_node = size; + if (node == NUMA_NO_NODE) + size_per_node /= num_online_nodes(); + if (size_per_node >= PMD_SIZE) { + shift = PMD_SHIFT; + align = max(real_align, 1UL << shift); + size = ALIGN(real_size, 1UL << shift); + } + } + +again: + size = PAGE_ALIGN(size); + area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED | vm_flags, start, end, node, gfp_mask, caller); if (!area) goto fail; - addr = __vmalloc_area_node(area, gfp_mask, prot, node); + addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node); if (!addr) - return NULL; + goto fail; /* * In this function, newly allocated vm_struct has VM_UNINITIALIZED @@ -2788,8 +2877,19 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, return addr; fail: - warn_alloc(gfp_mask, NULL, + if (shift > PAGE_SHIFT) { + free_vm_area(area); + shift = PAGE_SHIFT; + align = real_align; + size = real_size; + goto again; + } + + if (!area) { + /* Warn for area allocation, page allocations already warn */ + warn_alloc(gfp_mask, NULL, "vmalloc: allocation failure: %lu bytes", real_size); + } return NULL; }