From patchwork Fri Nov 8 18:20:22 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Jesper Dangaard Brouer X-Patchwork-Id: 1192099 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.b="D019uoZq"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 478pVH3WHgz9sNx for ; Sat, 9 Nov 2019 05:20:39 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728371AbfKHSUi (ORCPT ); Fri, 8 Nov 2019 13:20:38 -0500 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:29210 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726446AbfKHSUi (ORCPT ); Fri, 8 Nov 2019 13:20:38 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1573237236; h=from:from:reply-to:subject:subject:date:date:message-id:message-id:to: cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PWbCaNnZzBCmvu2buu71oIjD5hbrfBX8U53P282HSIw=; b=D019uoZqH3vS6Xwt1CQoDZEAoEi2yI8lGcGt2uPbBFFgjeppQeqd7vBJnmVaajd/XScxCe SW4M1eVyQ5q4lNimj3blc8TVMesLWZ4LNX5mPu2U1AnlBoS7Ltt12jKHdy7IZ0XVMKoLEL /rGj9v5Xb+nJYhw/3UVvaKwf3jpvsRY= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-340-0iGfr7E7O1CQepWdYlh_jw-1; Fri, 08 Nov 2019 13:20:33 -0500 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 9FFC5477; Fri, 8 Nov 2019 18:20:31 +0000 (UTC) Received: from firesoul.localdomain (ovpn-200-27.brq.redhat.com [10.40.200.27]) by smtp.corp.redhat.com (Postfix) with ESMTP id A010E6084E; Fri, 8 Nov 2019 18:20:23 +0000 (UTC) Received: from [192.168.42.3] (localhost [IPv6:::1]) by firesoul.localdomain (Postfix) with ESMTP id CA77030FC134F; Fri, 8 Nov 2019 19:20:22 +0100 (CET) Subject: [net-next v1 PATCH 1/2] xdp: revert forced mem allocator removal for page_pool From: Jesper Dangaard Brouer Cc: Toke =?utf-8?q?H=C3=B8iland-J=C3=B8rgensen?= , netdev@vger.kernel.org, Ilias Apalodimas , Jesper Dangaard Brouer , Saeed Mahameed , Matteo Croce , Jonathan Lemon , Lorenzo Bianconi , Tariq Toukan Date: Fri, 08 Nov 2019 19:20:22 +0100 Message-ID: <157323722276.10408.11333995838112864686.stgit@firesoul> In-Reply-To: <157323719180.10408.3472322881536070517.stgit@firesoul> References: <157323719180.10408.3472322881536070517.stgit@firesoul> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-MC-Unique: 0iGfr7E7O1CQepWdYlh_jw-1 X-Mimecast-Spam-Score: 2 To: unlisted-recipients:; (no To-header on input) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Forced removal of XDP mem allocator, specifically related to page_pool, turned out to be a wrong approach. Special thanks to Jonathan Lemon for convincing me. This patch is a partial revert of commit d956a048cd3f (“xdp: force mem allocator removal and periodic warning”). It is much better to provide a guarantee that page_pool object stays valid until 'inflight' pages reach zero, making it safe to remove. We keep the periodic warning via a work-queue, but increased interval to 5-minutes. The reason is to have a way to catch bugs, where inflight pages/packets never reach zero, indicating some kind of leak. These kind of bugs have been observed while converting drivers over to use page_pool API. Details on when to crash the kernel. If page_pool API is misused and somehow __page_pool_free() is invoked while there are still inflight frames, then (like before) a WARN() is triggered and not a BUG(). This can potentially lead to use-after-free, which we try to catch via poisoning the page_pool object memory with some NULL pointers. Doing it this way, pinpoint both the driver (likely) prematurely freeing page_pool via WARN(), and crash-dump for inflight page/packet show who to blame for late return. Fixes: d956a048cd3f (“xdp: force mem allocator removal and periodic warning”) Signed-off-by: Jesper Dangaard Brouer --- include/trace/events/xdp.h | 35 +++-------------------------------- net/core/page_pool.c | 8 ++++++-- net/core/xdp.c | 36 +++++++++++++----------------------- 3 files changed, 22 insertions(+), 57 deletions(-) diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h index c7e3c9c5bad3..a3ead2b1f00e 100644 --- a/include/trace/events/xdp.h +++ b/include/trace/events/xdp.h @@ -318,9 +318,9 @@ __MEM_TYPE_MAP(__MEM_TYPE_TP_FN) TRACE_EVENT(mem_disconnect, TP_PROTO(const struct xdp_mem_allocator *xa, - bool safe_to_remove, bool force), + bool safe_to_remove), - TP_ARGS(xa, safe_to_remove, force), + TP_ARGS(xa, safe_to_remove), TP_STRUCT__entry( __field(const struct xdp_mem_allocator *, xa) @@ -328,7 +328,6 @@ TRACE_EVENT(mem_disconnect, __field(u32, mem_type) __field(const void *, allocator) __field(bool, safe_to_remove) - __field(bool, force) __field(int, disconnect_cnt) ), @@ -338,17 +337,15 @@ TRACE_EVENT(mem_disconnect, __entry->mem_type = xa->mem.type; __entry->allocator = xa->allocator; __entry->safe_to_remove = safe_to_remove; - __entry->force = force; __entry->disconnect_cnt = xa->disconnect_cnt; ), TP_printk("mem_id=%d mem_type=%s allocator=%p" - " safe_to_remove=%s force=%s disconnect_cnt=%d", + " safe_to_remove=%s disconnect_cnt=%d", __entry->mem_id, __print_symbolic(__entry->mem_type, __MEM_TYPE_SYM_TAB), __entry->allocator, __entry->safe_to_remove ? "true" : "false", - __entry->force ? "true" : "false", __entry->disconnect_cnt ) ); @@ -387,32 +384,6 @@ TRACE_EVENT(mem_connect, ) ); -TRACE_EVENT(mem_return_failed, - - TP_PROTO(const struct xdp_mem_info *mem, - const struct page *page), - - TP_ARGS(mem, page), - - TP_STRUCT__entry( - __field(const struct page *, page) - __field(u32, mem_id) - __field(u32, mem_type) - ), - - TP_fast_assign( - __entry->page = page; - __entry->mem_id = mem->id; - __entry->mem_type = mem->type; - ), - - TP_printk("mem_id=%d mem_type=%s page=%p", - __entry->mem_id, - __print_symbolic(__entry->mem_type, __MEM_TYPE_SYM_TAB), - __entry->page - ) -); - #endif /* _TRACE_XDP_H */ #include diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 5bc65587f1c4..226f2eb30418 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -346,7 +346,7 @@ static void __warn_in_flight(struct page_pool *pool) distance = _distance(hold_cnt, release_cnt); - /* Drivers should fix this, but only problematic when DMA is used */ + /* BUG but warn as kernel should crash later */ WARN(1, "Still in-flight pages:%d hold:%u released:%u", distance, hold_cnt, release_cnt); } @@ -360,12 +360,16 @@ void __page_pool_free(struct page_pool *pool) WARN(pool->alloc.count, "API usage violation"); WARN(!ptr_ring_empty(&pool->ring), "ptr_ring is not empty"); - /* Can happen due to forced shutdown */ if (!__page_pool_safe_to_destroy(pool)) __warn_in_flight(pool); ptr_ring_cleanup(&pool->ring, NULL); + /* Make sure kernel will crash on use-after-free */ + pool->ring.queue = NULL; + pool->alloc.cache[PP_ALLOC_CACHE_SIZE - 1] = NULL; + pool->alloc.count = PP_ALLOC_CACHE_SIZE; + if (pool->p.flags & PP_FLAG_DMA_MAP) put_device(pool->p.dev); diff --git a/net/core/xdp.c b/net/core/xdp.c index 20781ad5f9c3..8673f199d9f4 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -85,7 +85,7 @@ static void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu) kfree(xa); } -static bool __mem_id_disconnect(int id, bool force) +static bool __mem_id_disconnect(int id) { struct xdp_mem_allocator *xa; bool safe_to_remove = true; @@ -104,30 +104,26 @@ static bool __mem_id_disconnect(int id, bool force) if (xa->mem.type == MEM_TYPE_PAGE_POOL) safe_to_remove = page_pool_request_shutdown(xa->page_pool); - trace_mem_disconnect(xa, safe_to_remove, force); + trace_mem_disconnect(xa, safe_to_remove); - if ((safe_to_remove || force) && + if ((safe_to_remove) && !rhashtable_remove_fast(mem_id_ht, &xa->node, mem_id_rht_params)) call_rcu(&xa->rcu, __xdp_mem_allocator_rcu_free); mutex_unlock(&mem_id_lock); - return (safe_to_remove|force); + return safe_to_remove; } -#define DEFER_TIME (msecs_to_jiffies(1000)) -#define DEFER_WARN_INTERVAL (30 * HZ) -#define DEFER_MAX_RETRIES 120 +#define DEFER_TIME (msecs_to_jiffies(1000UL)) +#define DEFER_WARN_INTERVAL (600UL * HZ) static void mem_id_disconnect_defer_retry(struct work_struct *wq) { struct delayed_work *dwq = to_delayed_work(wq); struct xdp_mem_allocator *xa = container_of(dwq, typeof(*xa), defer_wq); - bool force = false; + unsigned long defer_time; - if (xa->disconnect_cnt > DEFER_MAX_RETRIES) - force = true; - - if (__mem_id_disconnect(xa->mem.id, force)) + if (__mem_id_disconnect(xa->mem.id)) return; /* Periodic warning */ @@ -140,7 +136,8 @@ static void mem_id_disconnect_defer_retry(struct work_struct *wq) } /* Still not ready to be disconnected, retry later */ - schedule_delayed_work(&xa->defer_wq, DEFER_TIME); + defer_time = min(DEFER_WARN_INTERVAL, DEFER_TIME * xa->disconnect_cnt); + schedule_delayed_work(&xa->defer_wq, defer_time); } void xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq) @@ -161,7 +158,7 @@ void xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq) if (id == 0) return; - if (__mem_id_disconnect(id, false)) + if (__mem_id_disconnect(id)) return; /* Could not disconnect, defer new disconnect attempt to later */ @@ -402,15 +399,8 @@ static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct, /* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */ xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params); page = virt_to_head_page(data); - if (likely(xa)) { - napi_direct &= !xdp_return_frame_no_direct(); - page_pool_put_page(xa->page_pool, page, napi_direct); - } else { - /* Hopefully stack show who to blame for late return */ - WARN_ONCE(1, "page_pool gone mem.id=%d", mem->id); - trace_mem_return_failed(mem, page); - put_page(page); - } + napi_direct &= !xdp_return_frame_no_direct(); + page_pool_put_page(xa->page_pool, page, napi_direct); rcu_read_unlock(); break; case MEM_TYPE_PAGE_SHARED: From patchwork Fri Nov 8 18:20:27 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jesper Dangaard Brouer X-Patchwork-Id: 1192100 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.b="i6xkgcPj"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 478pVK3ldNz9sNx for ; Sat, 9 Nov 2019 05:20:41 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729951AbfKHSUk (ORCPT ); Fri, 8 Nov 2019 13:20:40 -0500 Received: from us-smtp-2.mimecast.com ([207.211.31.81]:44775 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726446AbfKHSUk (ORCPT ); Fri, 8 Nov 2019 13:20:40 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1573237238; h=from:from:reply-to:subject:subject:date:date:message-id:message-id:to: cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=eFKgszGEpzR2nnwoquaFNs0TBvv5DvOXN/SDdEuxKgM=; b=i6xkgcPjR6hcoMQ9YTUzlJdVV3meXj/7qhw5O7gVWMSCuN1F+PwdUXNxjVlnQ4kFhB62/O T1+2WikkRN5rN0qq0CK5cT//2AkskrC8aDNnZ2C/Z7XC1PUO7Hh8EjyNmXBuBlWFRJ3RQv Ja4MDaS2UT+jhYqilzC3bBWmLGIPLmc= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-309-3Ew8BlhkMJ6n5sNt-KfY-Q-1; Fri, 08 Nov 2019 13:20:35 -0500 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 66AC48017E0; Fri, 8 Nov 2019 18:20:34 +0000 (UTC) Received: from firesoul.localdomain (ovpn-200-27.brq.redhat.com [10.40.200.27]) by smtp.corp.redhat.com (Postfix) with ESMTP id AF6EF600CA; Fri, 8 Nov 2019 18:20:28 +0000 (UTC) Received: from [192.168.42.3] (localhost [IPv6:::1]) by firesoul.localdomain (Postfix) with ESMTP id DC2F730FC134D; Fri, 8 Nov 2019 19:20:27 +0100 (CET) Subject: [net-next v1 PATCH 2/2] page_pool: make inflight returns more robust via blocking alloc cache From: Jesper Dangaard Brouer Cc: Toke =?utf-8?q?H=C3=B8iland-J=C3=B8rgensen?= , netdev@vger.kernel.org, Ilias Apalodimas , Jesper Dangaard Brouer , Saeed Mahameed , Matteo Croce , Jonathan Lemon , Lorenzo Bianconi , Tariq Toukan Date: Fri, 08 Nov 2019 19:20:27 +0100 Message-ID: <157323722783.10408.5060384163093162553.stgit@firesoul> In-Reply-To: <157323719180.10408.3472322881536070517.stgit@firesoul> References: <157323719180.10408.3472322881536070517.stgit@firesoul> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-MC-Unique: 3Ew8BlhkMJ6n5sNt-KfY-Q-1 X-Mimecast-Spam-Score: 2 To: unlisted-recipients:; (no To-header on input) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org When requesting page_pool shutdown, it's a requirement that consumer RX-side have been disconnected, but __page_pool_request_shutdown() have a loop that empty RX alloc cache each time. Producers can still be inflight, but they MUST NOT return pages into RX alloc cache. Thus, the RX alloc cache MUST remain empty after the first clearing, else it is considered a bug. Lets make it more clear this is only cleared once. This patch only empty RX alloc cache once and then block alloc cache. The alloc cache is blocked via pretending it is full, and then also poisoning the last element. This blocks producers from using fast-path, and consumer (which is not allowed) will see a NULL pointer. Signed-off-by: Jesper Dangaard Brouer --- include/net/page_pool.h | 2 ++ net/core/page_pool.c | 28 +++++++++++++++++++++------- 2 files changed, 23 insertions(+), 7 deletions(-) diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 2cbcdbdec254..ab39b7955f9b 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -107,6 +107,8 @@ struct page_pool { * refcnt serves purpose is to simplify drivers error handling. */ refcount_t user_cnt; + + bool shutdown_in_progress; }; struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp); diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 226f2eb30418..89feee635d08 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -357,7 +357,7 @@ void __page_pool_free(struct page_pool *pool) if (!page_pool_put(pool)) return; - WARN(pool->alloc.count, "API usage violation"); + WARN(!pool->shutdown_in_progress, "API usage violation"); WARN(!ptr_ring_empty(&pool->ring), "ptr_ring is not empty"); if (!__page_pool_safe_to_destroy(pool)) @@ -367,8 +367,6 @@ void __page_pool_free(struct page_pool *pool) /* Make sure kernel will crash on use-after-free */ pool->ring.queue = NULL; - pool->alloc.cache[PP_ALLOC_CACHE_SIZE - 1] = NULL; - pool->alloc.count = PP_ALLOC_CACHE_SIZE; if (pool->p.flags & PP_FLAG_DMA_MAP) put_device(pool->p.dev); @@ -377,22 +375,38 @@ void __page_pool_free(struct page_pool *pool) } EXPORT_SYMBOL(__page_pool_free); -/* Request to shutdown: release pages cached by page_pool, and check - * for in-flight pages - */ -bool __page_pool_request_shutdown(struct page_pool *pool) +/* Empty alloc cache once and block it */ +void page_pool_empty_alloc_cache_once(struct page_pool *pool) { struct page *page; + if (pool->shutdown_in_progress) + return; + /* Empty alloc cache, assume caller made sure this is * no-longer in use, and page_pool_alloc_pages() cannot be * call concurrently. */ while (pool->alloc.count) { page = pool->alloc.cache[--pool->alloc.count]; + pool->alloc.cache[pool->alloc.count] = NULL; __page_pool_return_page(pool, page); } + /* Block alloc cache, pretend it's full and poison last element */ + pool->alloc.cache[PP_ALLOC_CACHE_SIZE - 1] = NULL; + pool->alloc.count = PP_ALLOC_CACHE_SIZE; + + pool->shutdown_in_progress = true; +} + +/* Request to shutdown: release pages cached by page_pool, and check + * for in-flight pages. RX-alloc side MUST be stopped prior to this. + */ +bool __page_pool_request_shutdown(struct page_pool *pool) +{ + page_pool_empty_alloc_cache_once(pool); + /* No more consumers should exist, but producers could still * be in-flight. */