[net] ixgbe: fix potential RX buffer starvation for AF_XDP

Message ID 1548770630-16189-1-git-send-email-magnus.karlsson@intel.com
State Under Review
Delegated to: Jeff Kirsher
Headers show
Series
  • [net] ixgbe: fix potential RX buffer starvation for AF_XDP
Related show

Commit Message

Magnus Karlsson Jan. 29, 2019, 2:03 p.m.
When the RX rings are created they are also populated with buffers so
that packets can be received. Usually these are kernel buffers, but
for AF_XDP in zero-copy mode, these are user-space buffers and in this
case the application might not have sent down any buffers to the
driver at this point. And if no buffers are allocated at ring creation
time, no packets can be received and no interupts will be generated so
the napi poll function that allocates buffers to the rings will never
get executed.

To recitfy this, we kick the NAPI context of any queue with an
attached AF_XDP zero-copy socket in two places in the code. Once after
an XDP program has loaded and once after the umem is registered.  This
take care of both cases: XDP program gets loaded first then AF_XDP
socket is created, and the reverse, AF_XDP socket is created first,
then XDP program is loaded.

Fixes: d0bcacd0a130 ("ixgbe: add AF_XDP zero-copy Rx support")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 12 +++++++++++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 12 ++++++++++--
 2 files changed, 21 insertions(+), 3 deletions(-)

Comments

Magnus Karlsson Jan. 30, 2019, 9:35 a.m. | #1
On Tue, Jan 29, 2019 at 8:15 PM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Tue, 29 Jan 2019 15:03:50 +0100, Magnus Karlsson wrote:
> > When the RX rings are created they are also populated with buffers so
> > that packets can be received. Usually these are kernel buffers, but
> > for AF_XDP in zero-copy mode, these are user-space buffers and in this
> > case the application might not have sent down any buffers to the
> > driver at this point. And if no buffers are allocated at ring creation
> > time, no packets can be received and no interupts will be generated so
> > the napi poll function that allocates buffers to the rings will never
> > get executed.
> >
> > To recitfy this, we kick the NAPI context of any queue with an
> > attached AF_XDP zero-copy socket in two places in the code. Once after
> > an XDP program has loaded and once after the umem is registered.  This
> > take care of both cases: XDP program gets loaded first then AF_XDP
> > socket is created, and the reverse, AF_XDP socket is created first,
> > then XDP program is loaded.
> >
> > Fixes: d0bcacd0a130 ("ixgbe: add AF_XDP zero-copy Rx support")
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>
> I may not understand the problem fully, but isn't it kind of normal
> that if you create a ring empty you'll never receive packets?  And it
> should be reasonably easy to catch while writing an app from scratch
> (i.e. it behaves deterministically).

Agree that this should be the normal behavior for a NIC. The question
is how to get out of this situation.There are two options: punt this
to the application writer or fix this in the driver. I chose to fix
the driver since this removes complexity in the application.

> Plus user space can already do the kick manually: create the ring empty,
> attach prog, put packets on the ring, and kick xmit - that would work,
> no?

Unfortunately there is no way such a kick could be guaranteed to work
since it only guarantees to wake up Tx. If the Rx is in another NAPI
context, then this will not work. On i40e and ixgbe, the Rx and Tx
processing is in the same NAPI so this would indeed work. But can I
guarantee that this is true for all current and future NICs?
Personally I would like to have Tx in a separate NAPI as I could get
better performance that way, at least in the i40e driver. You also do
not need to have any Tx at all in your socket, as it could be Rx only.
These arguments are why I went with the approach to fix this in the
driver, instead of a hard-to-explain extra step in the application
that might or might not work depending on your driver.

I believe the crux of the problem is that we need to tell the driver
to take buffers from the fill ring then put references to them in the
HW Rx ring so that packets can be received from the wire. In this
patch I went for installing an XDP program and calling bind() as "the"
two places. But there could be a scenario where I do bind then install
the XDP program and after that populate my fill queue and not call any
other syscall ever! This would not be covered by this patch :-(.

So you might be correct that the only way to fix this is from the
application. But how to do it in a simple and elegant way? I do not
think sendto() is the right answer, but what about poll()? Can we wire
that up to kick Rx through the busy poll mechanism? I would like the
solution to be simple and not complicate things for the user, if
possible. Another idea would be to use a timer in the driver to
periodically check the fill queue in the case it is empty. This would
be easy for application writers but add complexity for people
implementing AF_XDP ZC support in their drivers. Any ideas? What do
you prefer?

Thanks: Magnus

> Putting the kick in ixgbe_xdp_setup() seems a tiny bit random to me,
> but perhaps I'm not seeing the rationale clearly.
Björn Töpel Jan. 30, 2019, 2:16 p.m. | #2
Den ons 30 jan. 2019 kl 10:35 skrev Magnus Karlsson <magnus.karlsson@gmail.com>:
>
> On Tue, Jan 29, 2019 at 8:15 PM Jakub Kicinski
> <jakub.kicinski@netronome.com> wrote:
[...]
> >
> > I may not understand the problem fully, but isn't it kind of normal
> > that if you create a ring empty you'll never receive packets?  And it
> > should be reasonably easy to catch while writing an app from scratch
> > (i.e. it behaves deterministically).
>
> Agree that this should be the normal behavior for a NIC. The question
> is how to get out of this situation.There are two options: punt this
> to the application writer or fix this in the driver. I chose to fix
> the driver since this removes complexity in the application.
>

Magnus' fix addresses a race/timing issue. At zero-copy initialization
point, if the fill ring was empty, the driver (both i40e and ixgbe)
would stop retrying to "allocate" zero-copy frames from the fill
ring. So, frames would never be received, even if the fill ring was
filled at a later point.

If the driver runs-dry in terms of Rx buffer if one or more frames has
been received, the driver will retry polling the fill-ring. However at
initialization point, if the fill-ring was empty, the driver would
just give up and never retry.

As Magnus stated, there is no "notify the kernel that the items has
appeared in the fill ring" (other than a HW mechanism where the tail
pointer is a door bell) on the Rx side, so for the Intel drivers it's
up to the driver to solve this.

> > Plus user space can already do the kick manually: create the ring empty,
> > attach prog, put packets on the ring, and kick xmit - that would work,
> > no?
>
> Unfortunately there is no way such a kick could be guaranteed to work
> since it only guarantees to wake up Tx. If the Rx is in another NAPI
> context, then this will not work. On i40e and ixgbe, the Rx and Tx
> processing is in the same NAPI so this would indeed work. But can I
> guarantee that this is true for all current and future NICs?
> Personally I would like to have Tx in a separate NAPI as I could get
> better performance that way, at least in the i40e driver. You also do
> not need to have any Tx at all in your socket, as it could be Rx only.
> These arguments are why I went with the approach to fix this in the
> driver, instead of a hard-to-explain extra step in the application
> that might or might not work depending on your driver.
>
> I believe the crux of the problem is that we need to tell the driver
> to take buffers from the fill ring then put references to them in the
> HW Rx ring so that packets can be received from the wire. In this
> patch I went for installing an XDP program and calling bind() as "the"
> two places. But there could be a scenario where I do bind then install
> the XDP program and after that populate my fill queue and not call any
> other syscall ever! This would not be covered by this patch :-(.
>
> So you might be correct that the only way to fix this is from the
> application. But how to do it in a simple and elegant way? I do not
> think sendto() is the right answer, but what about poll()? Can we wire
> that up to kick Rx through the busy poll mechanism? I would like the
> solution to be simple and not complicate things for the user, if
> possible. Another idea would be to use a timer in the driver to
> periodically check the fill queue in the case it is empty. This would
> be easy for application writers but add complexity for people
> implementing AF_XDP ZC support in their drivers. Any ideas? What do
> you prefer?
>

Hmm, actually your patch *does* cover the starve Rx scenarios. One
downside, which is related to the fact that the AF_XDP ZC ndo doens't
state if there actually is a socket that needs Rx (we'll just enable
Rx zero unconditionally), is that the driver will try to allocate
buffers from the fill ring even if the are no sockets with Rx
queues. In this case, the fill ring probably never be filled from
userland. :-) But fixing that would go in to another patch IMO!

I think we can leave the NAPI discussion a side for now. Currently,
the SPSC queue setup require a single NAPI context. Let's relax that
later (if ever). (Or did I misunderstand your points?)


Cheers,
Björn


> Thanks: Magnus
>
> > Putting the kick in ixgbe_xdp_setup() seems a tiny bit random to me,
> > but perhaps I'm not seeing the rationale clearly.
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
Magnus Karlsson Jan. 30, 2019, 2:25 p.m. | #3
On Wed, Jan 30, 2019 at 3:16 PM Björn Töpel <bjorn.topel@gmail.com> wrote:
>
> Den ons 30 jan. 2019 kl 10:35 skrev Magnus Karlsson <magnus.karlsson@gmail.com>:
> >
> > On Tue, Jan 29, 2019 at 8:15 PM Jakub Kicinski
> > <jakub.kicinski@netronome.com> wrote:
> [...]
> > >
> > > I may not understand the problem fully, but isn't it kind of normal
> > > that if you create a ring empty you'll never receive packets?  And it
> > > should be reasonably easy to catch while writing an app from scratch
> > > (i.e. it behaves deterministically).
> >
> > Agree that this should be the normal behavior for a NIC. The question
> > is how to get out of this situation.There are two options: punt this
> > to the application writer or fix this in the driver. I chose to fix
> > the driver since this removes complexity in the application.
> >
>
> Magnus' fix addresses a race/timing issue. At zero-copy initialization
> point, if the fill ring was empty, the driver (both i40e and ixgbe)
> would stop retrying to "allocate" zero-copy frames from the fill
> ring. So, frames would never be received, even if the fill ring was
> filled at a later point.
>
> If the driver runs-dry in terms of Rx buffer if one or more frames has
> been received, the driver will retry polling the fill-ring. However at
> initialization point, if the fill-ring was empty, the driver would
> just give up and never retry.
>
> As Magnus stated, there is no "notify the kernel that the items has
> appeared in the fill ring" (other than a HW mechanism where the tail
> pointer is a door bell) on the Rx side, so for the Intel drivers it's
> up to the driver to solve this.
>
> > > Plus user space can already do the kick manually: create the ring empty,
> > > attach prog, put packets on the ring, and kick xmit - that would work,
> > > no?
> >
> > Unfortunately there is no way such a kick could be guaranteed to work
> > since it only guarantees to wake up Tx. If the Rx is in another NAPI
> > context, then this will not work. On i40e and ixgbe, the Rx and Tx
> > processing is in the same NAPI so this would indeed work. But can I
> > guarantee that this is true for all current and future NICs?
> > Personally I would like to have Tx in a separate NAPI as I could get
> > better performance that way, at least in the i40e driver. You also do
> > not need to have any Tx at all in your socket, as it could be Rx only.
> > These arguments are why I went with the approach to fix this in the
> > driver, instead of a hard-to-explain extra step in the application
> > that might or might not work depending on your driver.
> >
> > I believe the crux of the problem is that we need to tell the driver
> > to take buffers from the fill ring then put references to them in the
> > HW Rx ring so that packets can be received from the wire. In this
> > patch I went for installing an XDP program and calling bind() as "the"
> > two places. But there could be a scenario where I do bind then install
> > the XDP program and after that populate my fill queue and not call any
> > other syscall ever! This would not be covered by this patch :-(.
> >
> > So you might be correct that the only way to fix this is from the
> > application. But how to do it in a simple and elegant way? I do not
> > think sendto() is the right answer, but what about poll()? Can we wire
> > that up to kick Rx through the busy poll mechanism? I would like the
> > solution to be simple and not complicate things for the user, if
> > possible. Another idea would be to use a timer in the driver to
> > periodically check the fill queue in the case it is empty. This would
> > be easy for application writers but add complexity for people
> > implementing AF_XDP ZC support in their drivers. Any ideas? What do
> > you prefer?
> >
>
> Hmm, actually your patch *does* cover the starve Rx scenarios. One
> downside, which is related to the fact that the AF_XDP ZC ndo doens't
> state if there actually is a socket that needs Rx (we'll just enable
> Rx zero unconditionally), is that the driver will try to allocate
> buffers from the fill ring even if the are no sockets with Rx
> queues. In this case, the fill ring probably never be filled from
> userland. :-) But fixing that would go in to another patch IMO!
>
> I think we can leave the NAPI discussion a side for now. Currently,
> the SPSC queue setup require a single NAPI context. Let's relax that
> later (if ever). (Or did I misunderstand your points?)

OK, good. Thanks. And yes, the NAPI discussion was only for a possible
future state, if ever.

/Magnus

> Cheers,
> Björn
>
>
> > Thanks: Magnus
> >
> > > Putting the kick in ixgbe_xdp_setup() seems a tiny bit random to me,
> > > but perhaps I'm not seeing the rationale clearly.
> > _______________________________________________
> > Intel-wired-lan mailing list
> > Intel-wired-lan@osuosl.org
> > https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
Jakub Kicinski Jan. 30, 2019, 6:22 p.m. | #4
On Wed, 30 Jan 2019 15:16:15 +0100, Björn Töpel wrote:
> Den ons 30 jan. 2019 kl 10:35 skrev Magnus Karlsson:
> >
> > On Tue, Jan 29, 2019 at 8:15 PM Jakub Kicinski wrote: 
> > >
> > > I may not understand the problem fully, but isn't it kind of normal
> > > that if you create a ring empty you'll never receive packets?  And it
> > > should be reasonably easy to catch while writing an app from scratch
> > > (i.e. it behaves deterministically).  
> >
> > Agree that this should be the normal behavior for a NIC. The question
> > is how to get out of this situation.There are two options: punt this
> > to the application writer or fix this in the driver. I chose to fix
> > the driver since this removes complexity in the application.
> >  
> 
> Magnus' fix addresses a race/timing issue. At zero-copy initialization
> point, if the fill ring was empty, the driver (both i40e and ixgbe)
> would stop retrying to "allocate" zero-copy frames from the fill
> ring. So, frames would never be received, even if the fill ring was
> filled at a later point.
> 
> If the driver runs-dry in terms of Rx buffer if one or more frames has
> been received, the driver will retry polling the fill-ring. However at
> initialization point, if the fill-ring was empty, the driver would
> just give up and never retry.
> 
> As Magnus stated, there is no "notify the kernel that the items has
> appeared in the fill ring" (other than a HW mechanism where the tail
> pointer is a door bell) on the Rx side, so for the Intel drivers it's
> up to the driver to solve this.

Oh, I see, so its a missing piece in an otherwise very fault tolerant
implementation :)  Okay, no objection.

The nfp prototype I did simply fails the program load if there are not
enough buffers to do the initial fill, but that's just a personal
preference.  I tend to return errors in unclear situations more than
most people.
Bowers, AndrewX Jan. 30, 2019, 7:41 p.m. | #5
> -----Original Message-----
> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@osuosl.org] On
> Behalf Of Magnus Karlsson
> Sent: Tuesday, January 29, 2019 6:04 AM
> To: Karlsson, Magnus <magnus.karlsson@intel.com>; Topel, Bjorn
> <bjorn.topel@intel.com>; intel-wired-lan@lists.osuosl.org
> Cc: netdev@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH net] ixgbe: fix potential RX buffer
> starvation for AF_XDP
> 
> When the RX rings are created they are also populated with buffers so that
> packets can be received. Usually these are kernel buffers, but for AF_XDP in
> zero-copy mode, these are user-space buffers and in this case the
> application might not have sent down any buffers to the driver at this point.
> And if no buffers are allocated at ring creation time, no packets can be
> received and no interupts will be generated so the napi poll function that
> allocates buffers to the rings will never get executed.
> 
> To recitfy this, we kick the NAPI context of any queue with an attached
> AF_XDP zero-copy socket in two places in the code. Once after an XDP
> program has loaded and once after the umem is registered.  This take care of
> both cases: XDP program gets loaded first then AF_XDP socket is created,
> and the reverse, AF_XDP socket is created first, then XDP program is loaded.
> 
> Fixes: d0bcacd0a130 ("ixgbe: add AF_XDP zero-copy Rx support")
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 12 +++++++++++-
> drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 12 ++++++++++--
>  2 files changed, 21 insertions(+), 3 deletions(-)

Tested-by: Andrew Bowers <andrewx.bowers@intel.com>

Patch

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index daff818..017c930 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -10225,6 +10225,7 @@  static int ixgbe_xdp_setup(struct net_device *dev, struct bpf_prog *prog)
 	int i, frame_size = dev->mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
 	struct ixgbe_adapter *adapter = netdev_priv(dev);
 	struct bpf_prog *old_prog;
+	bool need_reset;
 
 	if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
 		return -EINVAL;
@@ -10247,9 +10248,10 @@  static int ixgbe_xdp_setup(struct net_device *dev, struct bpf_prog *prog)
 		return -ENOMEM;
 
 	old_prog = xchg(&adapter->xdp_prog, prog);
+	need_reset = (!!prog != !!old_prog);
 
 	/* If transitioning XDP modes reconfigure rings */
-	if (!!prog != !!old_prog) {
+	if (need_reset) {
 		int err = ixgbe_setup_tc(dev, adapter->hw_tcs);
 
 		if (err) {
@@ -10265,6 +10267,14 @@  static int ixgbe_xdp_setup(struct net_device *dev, struct bpf_prog *prog)
 	if (old_prog)
 		bpf_prog_put(old_prog);
 
+	/* Kick start the NAPI context if there is an AF_XDP socket open
+	 * on that queue id. This so that receiving will start.
+	 */
+	if (need_reset && prog)
+		for (i = 0; i < adapter->num_rx_queues; i++)
+			if (adapter->xdp_ring[i]->xsk_umem)
+				(void)ixgbe_xsk_async_xmit(adapter->netdev, i);
+
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 65c3e2c..654ae92 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -144,11 +144,19 @@  static int ixgbe_xsk_umem_enable(struct ixgbe_adapter *adapter,
 		ixgbe_txrx_ring_disable(adapter, qid);
 
 	err = ixgbe_add_xsk_umem(adapter, umem, qid);
+	if (err)
+		return err;
 
-	if (if_running)
+	if (if_running) {
 		ixgbe_txrx_ring_enable(adapter, qid);
 
-	return err;
+		/* Kick start the NAPI context so that receiving will start */
+		err = ixgbe_xsk_async_xmit(adapter->netdev, qid);
+		if (err)
+			return err;
+	}
+
+	return 0;
 }
 
 static int ixgbe_xsk_umem_disable(struct ixgbe_adapter *adapter, u16 qid)