Message ID | 20180614005309.17357-1-saeedm@mellanox.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Series | [net,RFC] net/mlx4_en: Use frag stride in crossing page boundary condition | expand |
On 06/13/2018 05:53 PM, Saeed Mahameed wrote: > When a new rx packet arrives, the rx path will decide whether to reuse > the same page or not according to the current rx frag page offset and > frag size, i.e: > release = frags->page_offset + frag_info->frag_size > PAGE_SIZE; > > Martin debugged this and he showed that this can cause mlx4 XDP to reuse > buffers when it shouldn't. > > Using frag_info->frag_size is wrong since frag size is always the port > mtu and the frag stride might be larger, especially in XDP case where > frag_stride == PAGE_SIZE. Hmmm... why frag_size is not PAGE_SIZE for XDP then ? This patch is a bit strange, since we do : u32 sz_align = ALIGN(frag_size, SMP_CACHE_BYTES); frags->page_offset += sz_align; release = frags->page_offset + frag_info->frag_size > PAGE_SIZE; So the @release condition is really to have enough space from frags->page_offset to hold up to frag_info->frag_size bytes. (NIC wont push more than frag_info->frag_size bytes, regardless of how big is out frag_stride ) Your patch would prevent a page being reused if say the amount of available bytes is 1536, but frag_stride = 2048 I would suggest a patch like the following instead. diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index 5edd0cf2999cbde37b3404aafd700584696af940..83d6b17b102bc9a22bfd8e68863d079f38670501 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -1082,7 +1082,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev) * This only works when num_frags == 1. */ if (priv->tx_ring_num[TX_XDP]) { - priv->frag_info[0].frag_size = eff_mtu; + priv->frag_info[0].frag_size = PAGE_SIZE; /* This will gain efficient xdp frame recycling at the * expense of more costly truesize accounting */
On Wed, 2018-06-13 at 19:30 -0700, Eric Dumazet wrote: > > On 06/13/2018 05:53 PM, Saeed Mahameed wrote: > > When a new rx packet arrives, the rx path will decide whether to > > reuse > > the same page or not according to the current rx frag page offset > > and > > frag size, i.e: > > release = frags->page_offset + frag_info->frag_size > PAGE_SIZE; > > > > Martin debugged this and he showed that this can cause mlx4 XDP to > > reuse > > buffers when it shouldn't. > > > > Using frag_info->frag_size is wrong since frag size is always the > > port > > mtu and the frag stride might be larger, especially in XDP case > > where > > frag_stride == PAGE_SIZE. > > Hmmm... why frag_size is not PAGE_SIZE for XDP then ? > I thought about this, but i see some problems with this that i would like to avoid. 1. definition of frag_size v.s. frag_stride: frag_size should be equal to the effective mtu, since the descriptor frag size depends solely on it. and we don't want to end up receiving packets larger than the stack mtu (think of a vf with smaller mtu than the wire). e.g: rx_desc->data[i].byte_count = cpu_to_be32(priv->frag_info[i].frag_size); frag_stride defines how much the frag_size can safely grow inside of a page or how much padding we need for this frag. so yes frag_size = PAGE_SIZE can work but it also kind of weird and might cause incorrectness in terms of RX packets sizes. > This patch is a bit strange, since we do : > > u32 sz_align = ALIGN(frag_size, SMP_CACHE_BYTES); > > frags->page_offset += sz_align; > release = frags->page_offset + frag_info->frag_size > PAGE_SIZE; > > So the @release condition is really to have enough space from frags- > >page_offset > to hold up to frag_info->frag_size bytes. > > (NIC wont push more than frag_info->frag_size bytes, regardless of > how big is out frag_stride ) > yes, but if frag_size > mtu, we might have a problem. > Your patch would prevent a page being reused if say the amount of > available bytes is 1536, > but frag_stride = 2048 > Interestingly for this exact frag_stride we don't have an issue :) since it goes through a different condition branch (the page flipping thing): if (frag_info->frag_stride == PAGE_SIZE / 2) { frags->page_offset ^= PAGE_SIZE / 2; release = page_count(page) != 1 || page_is_pfmemalloc(page) || page_to_nid(page) != numa_mem_id(); but yes there is a small price to pay for when 1536 < mtu < 2k but for other cases there is no change in behavior. overall i am not against your suggestion i am just laying off the problems with the two suggestions. > > I would suggest a patch like the following instead. > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > index > 5edd0cf2999cbde37b3404aafd700584696af940..83d6b17b102bc9a22bfd8e68863 > d079f38670501 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > @@ -1082,7 +1082,7 @@ void mlx4_en_calc_rx_buf(struct net_device > *dev) > * This only works when num_frags == 1. > */ > if (priv->tx_ring_num[TX_XDP]) { > - priv->frag_info[0].frag_size = eff_mtu; > + priv->frag_info[0].frag_size = PAGE_SIZE; > /* This will gain efficient xdp frame recycling at > the > * expense of more costly truesize accounting > */ >
On 06/14/2018 11:56 AM, Saeed Mahameed wrote: > Interestingly for this exact frag_stride we don't have an issue :) > since it goes through a different condition branch > (the page flipping thing): > > if (frag_info->frag_stride == PAGE_SIZE / 2) { > frags->page_offset ^= PAGE_SIZE / 2; > release = page_count(page) != 1 || > page_is_pfmemalloc(page) || > page_to_nid(page) != numa_mem_id(); > I guess you forgot to test on PowerPC where PAGE_SIZE=65536 ? On PowerPC, the first branch is never taken.
On Thu, 2018-06-14 at 12:12 -0700, Eric Dumazet wrote: > > On 06/14/2018 11:56 AM, Saeed Mahameed wrote: > > > Interestingly for this exact frag_stride we don't have an issue :) > > since it goes through a different condition branch > > (the page flipping thing): > > > > if (frag_info->frag_stride == PAGE_SIZE / 2) { > > frags->page_offset ^= PAGE_SIZE / 2; > > release = page_count(page) != 1 || > > page_is_pfmemalloc(page) || > > page_to_nid(page) != numa_mem_id(); > > > > I guess you forgot to test on PowerPC where PAGE_SIZE=65536 ? > > On PowerPC, the first branch is never taken. > This code is already in the driver, i guess as optimization for when a page can be reused only twice (PAGE_SIZE=4K, strides_size = 2k). frag_stride is never more than 2k. for power PC we should always take the other branch which will reuse as much 2k strides as possible in a page regardless of PAGE_SIZE. so are we good with my change ?
On Thu, 2018-06-14 at 13:47 -0700, saeedm@mellanox.com wrote: > On Thu, 2018-06-14 at 12:12 -0700, Eric Dumazet wrote: > > > > On 06/14/2018 11:56 AM, Saeed Mahameed wrote: > > > > > Interestingly for this exact frag_stride we don't have an issue > > > :) > > > since it goes through a different condition branch > > > (the page flipping thing): > > > > > > if (frag_info->frag_stride == PAGE_SIZE / 2) { > > > frags->page_offset ^= PAGE_SIZE / 2; > > > release = page_count(page) != 1 || > > > page_is_pfmemalloc(page) || > > > page_to_nid(page) != numa_mem_id(); > > > > > > > I guess you forgot to test on PowerPC where PAGE_SIZE=65536 ? > > > > On PowerPC, the first branch is never taken. > > Oh, sorry i see where you are going with this, you mean for xdp in the other branch we need to take care of PAGE_SIZE > 4k .. You are totally right, good catch! I will have to figure this out :) .. > > This code is already in the driver, i guess as optimization for when > a > page can be reused only twice (PAGE_SIZE=4K, strides_size = 2k). > frag_stride is never more than 2k. > > for power PC we should always take the other branch which will reuse > as > much 2k strides as possible in a page regardless of PAGE_SIZE. > > so are we good with my change ? > >
On Thu, 2018-06-14 at 20:53 +0000, Saeed Mahameed wrote: > On Thu, 2018-06-14 at 13:47 -0700, saeedm@mellanox.com wrote: > > On Thu, 2018-06-14 at 12:12 -0700, Eric Dumazet wrote: > > > > > > On 06/14/2018 11:56 AM, Saeed Mahameed wrote: > > > > > > > Interestingly for this exact frag_stride we don't have an issue > > > > :) > > > > since it goes through a different condition branch > > > > (the page flipping thing): > > > > > > > > if (frag_info->frag_stride == PAGE_SIZE / 2) { > > > > frags->page_offset ^= PAGE_SIZE / 2; > > > > release = page_count(page) != 1 || > > > > page_is_pfmemalloc(page) || > > > > page_to_nid(page) != numa_mem_id(); > > > > > > > > > > I guess you forgot to test on PowerPC where PAGE_SIZE=65536 ? > > > > > > On PowerPC, the first branch is never taken. > > > > > Oh, sorry i see where you are going with this, you mean for xdp in > the > other branch we need to take care of PAGE_SIZE > 4k .. > I was looking at the code without my fix :) with my fix: release = frags->page_offset + frag_info->frag_stride > PAGE_SIZE; for XDP: frag_info->frag_stride is PAGE_SIZE, so release will always be true regardless of PAGE_SIZE. So i guess i didn't quite understand your PowerPC concern.. can you elaborate ? > You are totally right, good catch! I will have to figure this out :) > .. > > > > > This code is already in the driver, i guess as optimization for > > when > > a > > page can be reused only twice (PAGE_SIZE=4K, strides_size = 2k). > > frag_stride is never more than 2k. > > > > for power PC we should always take the other branch which will > > reuse > > as > > much 2k strides as possible in a page regardless of PAGE_SIZE. > > > > so are we good with my change ? > >
On 06/14/2018 02:04 PM, Saeed Mahameed wrote: > I was looking at the code without my fix :) > > with my fix: > release = frags->page_offset + frag_info->frag_stride > PAGE_SIZE; > > for XDP: frag_info->frag_stride is PAGE_SIZE, so release will always be > true regardless of PAGE_SIZE. > > So i guess i didn't quite understand your PowerPC concern.. can you > elaborate ? > So your maths with PAGE_SIZE=65536 and MTU 9000 frag_stride is about 9344 So if the last chunk of the page has 9100 bytes, we wont be able to use it, while really we should be able to use it.
On Thu, 2018-06-14 at 16:49 -0700, Eric Dumazet wrote: > > On 06/14/2018 02:04 PM, Saeed Mahameed wrote: > > > I was looking at the code without my fix :) > > > > with my fix: > > release = frags->page_offset + frag_info->frag_stride > PAGE_SIZE; > > > > for XDP: frag_info->frag_stride is PAGE_SIZE, so release will > > always be > > true regardless of PAGE_SIZE. > > > > So i guess i didn't quite understand your PowerPC concern.. can you > > elaborate ? > > > > So your maths with PAGE_SIZE=65536 and MTU 9000 > > frag_stride is about 9344 > > So if the last chunk of the page has 9100 bytes, we wont be able to > use it, while really we should be able to use it. > > this is only true for XDP setup, for non XDP max stride_size can only be around ~3k and only for mtu > ~6k For XDP setup you suggested: - priv->frag_info[0].frag_size = eff_mtu; + priv->frag_info[0].frag_size = PAGE_SIZE; currently the condition is: release = frags->page_offset + frag_info->frag_size > PAGE_SIZE; so my solution and yours have the same problem you described above. the problem is not with the initial values or with stride/farg size math, it just that in XDP we shouldn't reuse at ALL. I agree with you that we need to optimize and maybe for PAGE_SIZE > 8k we need to allow XDP setup to reuses. but for now there is a data corruption to handle.
On 06/19/2018 11:05 AM, Saeed Mahameed wrote: > this is only true for XDP setup, for non XDP max stride_size can only > be around ~3k and only for mtu > ~6k > > For XDP setup you suggested: > - priv->frag_info[0].frag_size = eff_mtu; > + priv->frag_info[0].frag_size = PAGE_SIZE; > > currently the condition is: > > release = frags->page_offset + frag_info->frag_size > PAGE_SIZE; > > so my solution and yours have the same problem you described above. > > the problem is not with the initial values or with stride/farg size > math, it just that in XDP we shouldn't reuse at ALL. I agree with you > that we need to optimize and maybe for PAGE_SIZE > 8k we need to allow > XDP setup to reuses. but for now there is a data corruption to handle. Sure, we all agree there is a bug to fix. The way you are fixing it is kind of illogical. The NIC can use a frag if its _size_ is big enough to receive the frame. The _stride_ is an abstraction created by the driver to report an estimation of the _truesize_, or memory consumption, so that linux can better track overall memory usage. For example, if MTU=1500, the size of the fragment is 1536 bytes, but since we can put only 2 fragments per 4KB page (on x86), we declare the _stride_ to be 2048 bytes. Declaring that a final blob of a page, being 1600 bytes, not able to receive a frame because _stride_ is 2048 is illogical and waste resources.
On Tue, 2018-06-19 at 17:25 -0700, Eric Dumazet wrote: > > On 06/19/2018 11:05 AM, Saeed Mahameed wrote: > > > this is only true for XDP setup, for non XDP max stride_size can > > only > > be around ~3k and only for mtu > ~6k > > > > For XDP setup you suggested: > > - priv->frag_info[0].frag_size = eff_mtu; > > + priv->frag_info[0].frag_size = PAGE_SIZE; > > > > currently the condition is: > > > > release = frags->page_offset + frag_info->frag_size > PAGE_SIZE; > > > > so my solution and yours have the same problem you described above. > > > > the problem is not with the initial values or with stride/farg size > > math, it just that in XDP we shouldn't reuse at ALL. I agree with > > you > > that we need to optimize and maybe for PAGE_SIZE > 8k we need to > > allow > > XDP setup to reuses. but for now there is a data corruption to > > handle. > > > Sure, we all agree there is a bug to fix. > > The way you are fixing it is kind of illogical. > > The NIC can use a frag if its _size_ is big enough to receive the > frame. > > The _stride_ is an abstraction created by the driver to report an > estimation of the _truesize_, > or memory consumption, so that linux can better track overall memory > usage. > > For example, if MTU=1500, the size of the fragment is 1536 bytes, but > since we can put only > 2 fragments per 4KB page (on x86), we declare the _stride_ to be 2048 > bytes. > > Declaring that a final blob of a page, being 1600 bytes, not able to > receive a frame because > _stride_ is 2048 is illogical and waste resources. > > I see, I wanted to use _stride_ as grantee for how much a page frag can grow, for example in mlx5 we need the whole stride to build_skb around the frag, since we always need the trailer, but it is different in here and we can avoid resource waste. so how a bout this: (As suggested by Martin). currently as mlx4_en_complete_rx_desc assumes that priv->rx_headroom is always 0 in non-XDP setup, hence: frags->page_offset += sz_align; where it really should be: frags->page_offset += sz_align + priv->rx_headroom; we can use it as a hint to not reuse as below: what do you think ? diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index 9f54ccbddea7..f14c7a574cc8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -474,10 +474,10 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv, { const struct mlx4_en_frag_info *frag_info = priv->frag_info; unsigned int truesize = 0; + bool release = true; int nr, frag_size; struct page *page; dma_addr_t dma; - bool release; index 9f54ccbddea7..f14c7a574cc8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c /* Collect used fragments while replacing them in the HW descriptors */ for (nr = 0;; frags++) { @@ -500,7 +500,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv, release = page_count(page) != 1 || page_is_pfmemalloc(page) || page_to_nid(page) != numa_mem_id(); - } else { + } elseif(!priv->rx_headroom) { u32 sz_align = ALIGN(frag_size, SMP_CACHE_BYTES); frags->page_offset += sz_align;
On 06/20/2018 04:41 PM, Saeed Mahameed wrote: > > I see, I wanted to use _stride_ as grantee for how much a page frag can > grow, for example in mlx5 we need the whole stride to build_skb around > the frag, since we always need the trailer, but it is different in here > and we can avoid resource waste. > > so how a bout this: (As suggested by Martin). > currently as mlx4_en_complete_rx_desc assumes that priv->rx_headroom > is always 0 in non-XDP setup, hence: > > frags->page_offset += sz_align; > > where it really should be: > frags->page_offset += sz_align + priv->rx_headroom; > > we can use it as a hint to not reuse as below: > what do you think ? > > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > index 9f54ccbddea7..f14c7a574cc8 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > @@ -474,10 +474,10 @@ static int mlx4_en_complete_rx_desc(struct > mlx4_en_priv *priv, > { > const struct mlx4_en_frag_info *frag_info = priv->frag_info; > unsigned int truesize = 0; > + bool release = true; > int nr, frag_size; > struct page *page; > dma_addr_t dma; > - bool release; > index 9f54ccbddea7..f14c7a574cc8 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > > /* Collect used fragments while replacing them in the HW > descriptors */ > for (nr = 0;; frags++) { > @@ -500,7 +500,7 @@ static int mlx4_en_complete_rx_desc(struct > mlx4_en_priv *priv, > release = page_count(page) != 1 || > page_is_pfmemalloc(page) || > page_to_nid(page) != numa_mem_id(); > - } else { > + } elseif(!priv->rx_headroom) { > u32 sz_align = ALIGN(frag_size, > SMP_CACHE_BYTES); > > frags->page_offset += sz_align; > I guess that would work, please double check priv->rx_headroom wont need another cache line, thanks !
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index 5c613c6663da..f63dde0288b7 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -504,7 +504,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv, u32 sz_align = ALIGN(frag_size, SMP_CACHE_BYTES); frags->page_offset += sz_align; - release = frags->page_offset + frag_info->frag_size > PAGE_SIZE; + release = frags->page_offset + frag_info->frag_stride > PAGE_SIZE; } if (release) { dma_unmap_page(priv->ddev, dma, PAGE_SIZE, priv->dma_dir);
When a new rx packet arrives, the rx path will decide whether to reuse the same page or not according to the current rx frag page offset and frag size, i.e: release = frags->page_offset + frag_info->frag_size > PAGE_SIZE; Martin debugged this and he showed that this can cause mlx4 XDP to reuse buffers when it shouldn't. Using frag_info->frag_size is wrong since frag size is always the port mtu and the frag stride might be larger, especially in XDP case where frag_stride == PAGE_SIZE. In XDP there is an assumption to have a page per packet and reuse can break such assumption and might cause packet data corruptions, since in XDP frag_offset will always reset to the beginning of the page when refilling the rx buffer. Fix this by using the stride size rather than frag size in "release" condition evaluation. For non XDP setup this will yield the same behavior since frag_stride already equals to ALIGN(frag_size, SMP_CACHE_BYTES) and on XDP setup the "release" condition will always be true as it is supposed to be since frag_stride == PAGE_SIZE. Fixes: 34db548bfb95 ("mlx4: add page recycling in receive path") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reported-by: Martin KaFai Lau <kafai@fb.com> CC: Eric Dumazet <edumazet@google.com> --- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)