Message ID | 1e177bb63c858acdf5aeac9198c2815448d37820.1573844190.git.lorenzo@kernel.org |
---|---|
State | Changes Requested |
Delegated to: | David Miller |
Headers | show |
Series | add DMA-sync-for-device capability to page_pool API | expand |
On 15 Nov 2019, at 11:01, Lorenzo Bianconi wrote: > Introduce the following parameters in order to add the possibility to sync > DMA memory for device before putting allocated pages in the page_pool > caches: > - PP_FLAG_DMA_SYNC_DEV: if set in page_pool_params flags, all pages that > the driver gets from page_pool will be DMA-synced-for-device according > to the length provided by the device driver. Please note DMA-sync-for-CPU > is still device driver responsibility > - offset: DMA address offset where the DMA engine starts copying rx data > - max_len: maximum DMA memory size page_pool is allowed to flush. This > is currently used in __page_pool_alloc_pages_slow routine when pages > are allocated from page allocator > These parameters are supposed to be set by device drivers. > > This optimization reduces the length of the DMA-sync-for-device. > The optimization is valid because pages are initially > DMA-synced-for-device as defined via max_len. At RX time, the driver > will perform a DMA-sync-for-CPU on the memory for the packet length. > What is important is the memory occupied by packet payload, because > this is the area CPU is allowed to read and modify. As we don't track > cache-lines written into by the CPU, simply use the packet payload length > as dma_sync_size at page_pool recycle time. This also take into account > any tail-extend. > > Tested-by: Matteo Croce <mcroce@redhat.com> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
On Fri, 15 Nov 2019 21:01:38 +0200 Lorenzo Bianconi <lorenzo@kernel.org> wrote: > static bool __page_pool_recycle_into_ring(struct page_pool *pool, > - struct page *page) > + struct page *page, > + unsigned int dma_sync_size) > { > int ret; > /* BH protection not needed if current is serving softirq */ > @@ -264,6 +285,9 @@ static bool __page_pool_recycle_into_ring(struct page_pool *pool, > else > ret = ptr_ring_produce_bh(&pool->ring, page); > > + if (ret == 0 && (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)) > + page_pool_dma_sync_for_device(pool, page, dma_sync_size); > + > return (ret == 0) ? true : false; > } I do wonder if we should DMA-sync-for-device BEFORE putting page into ptr_ring, as this is a channel between several concurrent CPUs. > @@ -273,18 +297,22 @@ static bool __page_pool_recycle_into_ring(struct page_pool *pool, > * Caller must provide appropriate safe context. > */ > static bool __page_pool_recycle_direct(struct page *page, > - struct page_pool *pool) > + struct page_pool *pool, > + unsigned int dma_sync_size) > { > if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE)) > return false; > > /* Caller MUST have verified/know (page_ref_count(page) == 1) */ > pool->alloc.cache[pool->alloc.count++] = page; > + > + if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) > + page_pool_dma_sync_for_device(pool, page, dma_sync_size); > return true; > } We know __page_pool_recycle_direct() is concurrency safe, and only a single (NAPI processing) CPU can enter. (So, the DMA-sync order is not wrong here, but it could be swapped)
> On Fri, 15 Nov 2019 21:01:38 +0200 > Lorenzo Bianconi <lorenzo@kernel.org> wrote: > > > static bool __page_pool_recycle_into_ring(struct page_pool *pool, > > - struct page *page) > > + struct page *page, > > + unsigned int dma_sync_size) > > { > > int ret; > > /* BH protection not needed if current is serving softirq */ > > @@ -264,6 +285,9 @@ static bool __page_pool_recycle_into_ring(struct page_pool *pool, > > else > > ret = ptr_ring_produce_bh(&pool->ring, page); > > > > + if (ret == 0 && (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)) > > + page_pool_dma_sync_for_device(pool, page, dma_sync_size); > > + > > return (ret == 0) ? true : false; > > } > > > I do wonder if we should DMA-sync-for-device BEFORE putting page into > ptr_ring, as this is a channel between several concurrent CPUs. Hi Jesper, in this way we can end up syncing the DMA page even if it is unmapped in __page_pool_clean_page (e.g. if the ptr_ring is full), right? > > > > > @@ -273,18 +297,22 @@ static bool __page_pool_recycle_into_ring(struct page_pool *pool, > > * Caller must provide appropriate safe context. > > */ > > static bool __page_pool_recycle_direct(struct page *page, > > - struct page_pool *pool) > > + struct page_pool *pool, > > + unsigned int dma_sync_size) > > { > > if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE)) > > return false; > > > > /* Caller MUST have verified/know (page_ref_count(page) == 1) */ > > pool->alloc.cache[pool->alloc.count++] = page; > > + > > + if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) > > + page_pool_dma_sync_for_device(pool, page, dma_sync_size); > > return true; > > } > > We know __page_pool_recycle_direct() is concurrency safe, and only a > single (NAPI processing) CPU can enter. (So, the DMA-sync order is not > wrong here, but it could be swapped) do you mean move it before putting the page in the cache? pool->alloc.cache[pool->alloc.count++] = page; Regards, Lorenzo > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer >
On Sat, 16 Nov 2019 13:36:30 +0200 Lorenzo Bianconi <lorenzo@kernel.org> wrote: > > On Fri, 15 Nov 2019 21:01:38 +0200 > > Lorenzo Bianconi <lorenzo@kernel.org> wrote: > > > > > static bool __page_pool_recycle_into_ring(struct page_pool *pool, > > > - struct page *page) > > > + struct page *page, > > > + unsigned int dma_sync_size) > > > { > > > int ret; > > > /* BH protection not needed if current is serving softirq */ > > > @@ -264,6 +285,9 @@ static bool __page_pool_recycle_into_ring(struct page_pool *pool, > > > else > > > ret = ptr_ring_produce_bh(&pool->ring, page); > > > > > > + if (ret == 0 && (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)) > > > + page_pool_dma_sync_for_device(pool, page, dma_sync_size); > > > + > > > return (ret == 0) ? true : false; > > > } > > > > > > I do wonder if we should DMA-sync-for-device BEFORE putting page into > > ptr_ring, as this is a channel between several concurrent CPUs. > > Hi Jesper, > > in this way we can end up syncing the DMA page even if it is unmapped in > __page_pool_clean_page (e.g. if the ptr_ring is full), right? Yes. The call __page_pool_clean_page() will do a dma_unmap_page, so it should still be safe/correct. I can see, that it is not optimal performance wise, in-case the ptr_ring is full, as DMA-sync-for-device is wasted work. I don't know if you can find an argument, that proves that it cannot happen, that a remote CPU can dequeue/consume the page from ptr_ring and give it to the device, while you (the CPU the enqueued) are still doing the DMA-sync-for-device. > > > @@ -273,18 +297,22 @@ static bool __page_pool_recycle_into_ring(struct page_pool *pool, > > > * Caller must provide appropriate safe context. > > > */ > > > static bool __page_pool_recycle_direct(struct page *page, > > > - struct page_pool *pool) > > > + struct page_pool *pool, > > > + unsigned int dma_sync_size) > > > { > > > if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE)) > > > return false; > > > > > > /* Caller MUST have verified/know (page_ref_count(page) == 1) */ > > > pool->alloc.cache[pool->alloc.count++] = page; > > > + > > > + if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) > > > + page_pool_dma_sync_for_device(pool, page, dma_sync_size); > > > return true; > > > } > > > > We know __page_pool_recycle_direct() is concurrency safe, and only a > > single (NAPI processing) CPU can enter. (So, the DMA-sync order is not > > wrong here, but it could be swapped) > > do you mean move it before putting the page in the cache? > > pool->alloc.cache[pool->alloc.count++] = page; Yes, but here the order doesn't matter. If you choose to do the DMA-sync-for-device earlier/before, then look at the code, and see of it makes sense to do it in __page_pool_put_page() ? (I've not checked the details)
> On Sat, 16 Nov 2019 13:36:30 +0200 > Lorenzo Bianconi <lorenzo@kernel.org> wrote: > > > > On Fri, 15 Nov 2019 21:01:38 +0200 > > > Lorenzo Bianconi <lorenzo@kernel.org> wrote: > > > > > > > static bool __page_pool_recycle_into_ring(struct page_pool *pool, > > > > - struct page *page) > > > > + struct page *page, > > > > + unsigned int dma_sync_size) > > > > { > > > > int ret; > > > > /* BH protection not needed if current is serving softirq */ > > > > @@ -264,6 +285,9 @@ static bool __page_pool_recycle_into_ring(struct page_pool *pool, > > > > else > > > > ret = ptr_ring_produce_bh(&pool->ring, page); > > > > > > > > + if (ret == 0 && (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)) > > > > + page_pool_dma_sync_for_device(pool, page, dma_sync_size); > > > > + > > > > return (ret == 0) ? true : false; > > > > } > > > > > > > > > I do wonder if we should DMA-sync-for-device BEFORE putting page into > > > ptr_ring, as this is a channel between several concurrent CPUs. > > > > Hi Jesper, > > > > in this way we can end up syncing the DMA page even if it is unmapped in > > __page_pool_clean_page (e.g. if the ptr_ring is full), right? > > Yes. The call __page_pool_clean_page() will do a dma_unmap_page, so it > should still be safe/correct. I can see, that it is not optimal > performance wise, in-case the ptr_ring is full, as DMA-sync-for-device > is wasted work. > > I don't know if you can find an argument, that proves that it cannot > happen, that a remote CPU can dequeue/consume the page from ptr_ring > and give it to the device, while you (the CPU the enqueued) are still > doing the DMA-sync-for-device. right, I can see it now :) > > > > > > @@ -273,18 +297,22 @@ static bool __page_pool_recycle_into_ring(struct page_pool *pool, > > > > * Caller must provide appropriate safe context. > > > > */ > > > > static bool __page_pool_recycle_direct(struct page *page, > > > > - struct page_pool *pool) > > > > + struct page_pool *pool, > > > > + unsigned int dma_sync_size) > > > > { > > > > if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE)) > > > > return false; > > > > > > > > /* Caller MUST have verified/know (page_ref_count(page) == 1) */ > > > > pool->alloc.cache[pool->alloc.count++] = page; > > > > + > > > > + if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) > > > > + page_pool_dma_sync_for_device(pool, page, dma_sync_size); > > > > return true; > > > > } > > > > > > We know __page_pool_recycle_direct() is concurrency safe, and only a > > > single (NAPI processing) CPU can enter. (So, the DMA-sync order is not > > > wrong here, but it could be swapped) > > > > do you mean move it before putting the page in the cache? > > > > pool->alloc.cache[pool->alloc.count++] = page; > > Yes, but here the order doesn't matter. > > If you choose to do the DMA-sync-for-device earlier/before, then look > at the code, and see of it makes sense to do it in __page_pool_put_page() ? > (I've not checked the details) I guess we can move page_pool_dma_sync_for_device() before __page_pool_recycle_direct and __page_pool_recycle_into_ring since even if __page_pool_put_page is not running in NAPI context or if alloc.cache is full we will end up calling page_pool_dma_sync_for_device as first task in __page_pool_recycle_into_ring. I will fix in v4. Regards, Lorenzo > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer >
diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 2cbcdbdec254..8856d2c815d0 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -34,8 +34,15 @@ #include <linux/ptr_ring.h> #include <linux/dma-direction.h> -#define PP_FLAG_DMA_MAP 1 /* Should page_pool do the DMA map/unmap */ -#define PP_FLAG_ALL PP_FLAG_DMA_MAP +#define PP_FLAG_DMA_MAP 1 /* Should page_pool do the DMA map/unmap */ +#define PP_FLAG_DMA_SYNC_DEV 2 /* if set all pages that the driver gets + * from page_pool will be + * DMA-synced-for-device according to the + * length provided by the device driver. + * Please note DMA-sync-for-CPU is still + * device driver responsibility + */ +#define PP_FLAG_ALL (PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV) /* * Fast allocation side cache array/stack @@ -65,6 +72,8 @@ struct page_pool_params { int nid; /* Numa node id to allocate from pages from */ struct device *dev; /* device, for DMA pre-mapping purposes */ enum dma_data_direction dma_dir; /* DMA mapping direction */ + unsigned int max_len; /* max DMA sync memory size */ + unsigned int offset; /* DMA addr offset */ }; struct page_pool { @@ -150,8 +159,8 @@ static inline void page_pool_destroy(struct page_pool *pool) } /* Never call this directly, use helpers below */ -void __page_pool_put_page(struct page_pool *pool, - struct page *page, bool allow_direct); +void __page_pool_put_page(struct page_pool *pool, struct page *page, + unsigned int dma_sync_size, bool allow_direct); static inline void page_pool_put_page(struct page_pool *pool, struct page *page, bool allow_direct) @@ -160,14 +169,14 @@ static inline void page_pool_put_page(struct page_pool *pool, * allow registering MEM_TYPE_PAGE_POOL, but shield linker. */ #ifdef CONFIG_PAGE_POOL - __page_pool_put_page(pool, page, allow_direct); + __page_pool_put_page(pool, page, -1, allow_direct); #endif } /* Very limited use-cases allow recycle direct */ static inline void page_pool_recycle_direct(struct page_pool *pool, struct page *page) { - __page_pool_put_page(pool, page, true); + __page_pool_put_page(pool, page, -1, true); } /* API user MUST have disconnected alloc-side (not allowed to call diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 5bc65587f1c4..e4ea607f2098 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -44,6 +44,13 @@ static int page_pool_init(struct page_pool *pool, (pool->p.dma_dir != DMA_BIDIRECTIONAL)) return -EINVAL; + /* In order to request DMA-sync-for-device the page needs to + * be mapped + */ + if ((pool->p.flags & PP_FLAG_DMA_SYNC_DEV) && + !(pool->p.flags & PP_FLAG_DMA_MAP)) + return -EINVAL; + if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0) return -ENOMEM; @@ -112,6 +119,16 @@ static struct page *__page_pool_get_cached(struct page_pool *pool) return page; } +static void page_pool_dma_sync_for_device(struct page_pool *pool, + struct page *page, + unsigned int dma_sync_size) +{ + dma_sync_size = min(dma_sync_size, pool->p.max_len); + dma_sync_single_range_for_device(pool->p.dev, page->dma_addr, + pool->p.offset, dma_sync_size, + pool->p.dma_dir); +} + /* slow path */ noinline static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, @@ -156,6 +173,9 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, } page->dma_addr = dma; + if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) + page_pool_dma_sync_for_device(pool, page, pool->p.max_len); + skip_dma_map: /* Track how many pages are held 'in-flight' */ pool->pages_state_hold_cnt++; @@ -255,7 +275,8 @@ static void __page_pool_return_page(struct page_pool *pool, struct page *page) } static bool __page_pool_recycle_into_ring(struct page_pool *pool, - struct page *page) + struct page *page, + unsigned int dma_sync_size) { int ret; /* BH protection not needed if current is serving softirq */ @@ -264,6 +285,9 @@ static bool __page_pool_recycle_into_ring(struct page_pool *pool, else ret = ptr_ring_produce_bh(&pool->ring, page); + if (ret == 0 && (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)) + page_pool_dma_sync_for_device(pool, page, dma_sync_size); + return (ret == 0) ? true : false; } @@ -273,18 +297,22 @@ static bool __page_pool_recycle_into_ring(struct page_pool *pool, * Caller must provide appropriate safe context. */ static bool __page_pool_recycle_direct(struct page *page, - struct page_pool *pool) + struct page_pool *pool, + unsigned int dma_sync_size) { if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE)) return false; /* Caller MUST have verified/know (page_ref_count(page) == 1) */ pool->alloc.cache[pool->alloc.count++] = page; + + if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) + page_pool_dma_sync_for_device(pool, page, dma_sync_size); return true; } -void __page_pool_put_page(struct page_pool *pool, - struct page *page, bool allow_direct) +void __page_pool_put_page(struct page_pool *pool, struct page *page, + unsigned int dma_sync_size, bool allow_direct) { /* This allocator is optimized for the XDP mode that uses * one-frame-per-page, but have fallbacks that act like the @@ -296,10 +324,12 @@ void __page_pool_put_page(struct page_pool *pool, /* Read barrier done in page_ref_count / READ_ONCE */ if (allow_direct && in_serving_softirq()) - if (__page_pool_recycle_direct(page, pool)) + if (__page_pool_recycle_direct(page, pool, + dma_sync_size)) return; - if (!__page_pool_recycle_into_ring(pool, page)) { + if (!__page_pool_recycle_into_ring(pool, page, + dma_sync_size)) { /* Cache full, fallback to free pages */ __page_pool_return_page(pool, page); }