diff mbox

[v4,net-next,RFC] net: Generic XDP

Message ID 20170419142903.GJ4730@C02RW35GFVH8.dhcp.broadcom.net
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Andy Gospodarek April 19, 2017, 2:29 p.m. UTC
On Tue, Apr 18, 2017 at 03:29:16PM -0400, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Tue, 18 Apr 2017 15:07:08 -0400 (EDT)
> 
> > From: Andy Gospodarek <andy@greyhouse.net>
> > Date: Tue, 18 Apr 2017 15:05:35 -0400
> > 
> >> On Fri, Apr 14, 2017 at 05:59:51PM -0700, Alexei Starovoitov wrote:
> >>> On Thu, Apr 13, 2017 at 04:23:15PM -0400, David Miller wrote:
> >>> > +
> >>> > +	switch (act) {
> >>> > +	case XDP_TX:
> >>> > +		__skb_push(skb, skb->mac_len);
> >>> 
> >>> s/skb->mac_len/mac_len/
> >>> 
> >> 
> >> I was away from my keyboard for a few days, but was able to get some
> >> time to test this today.
> >> 
> >> When using this change above suggested by Alexei, XDP_DROP and XDP_TX
> >> actions appear to work well with xdp1 and xdp2.
> >> 
> >> I'm seeing some rather odd behavior with xdp_tx_tunnel so it might be
> >> good to hold off on committing this just yet.
> >> 
> >> At first glance, it looks like there is enough headroom for the new
> >> frame, but the resulting packet data do not look right and I'm actually
> >> seeing some data that may be left on the stack from a previous caller.
> > 
> > Thanks for testing Andy, I'll take a look and see if I can figure it out.
> 
> Andy, I think we might be getting burnt by signedness issues in the
> offset handling when the XDP program adjusts the packet data pointer.
> 
> In netif_receive_generic_xdp(), try changing the offset handling code to
> read something like:
> 
> 	off = xdp.data - orig_data;
> 	if (off > 0)
> 		__skb_pull(skb, off);
> 	else if (off < 0)
> 		__skb_push(skb, -off);
> 
> If that doesn't work try adding:
> 
> 	__skb_cow(skb, XDP_PACKET_HEADROOM, 0);
> 
> right after the skb_linearize() call in that same function.

So I tried a variety of things and the simplest change on top of yours that
works well for xdp1, xdp2, and xdp_tx_iptunnel. 


I ran this on top of a card that uses the bnxt_en driver on a desktop
class system with an i7-6700 CPU @ 3.40GHz, sending a single stream of
UDP traffic with flow control disabled and saw the following (all stats
in Million PPS).

                xdp1                xdp2            xdp_tx_tunnel
Generic XDP      7.8    5.5 (1.3 actual)         4.6 (1.1 actual)
Optimized XDP   11.7		     9.7                      4.6

One thing to note is that the Generic XDP case shows some different
results for reported by the application vs actual (seen on the wire).  I
did not debug where the drops are happening and what counter needs to be
incremented to note this -- I'll add that to my TODO list.  The
Optimized XDP case does not have a difference in reported vs actual
frames on the wire.

I agree with all those who have asserted that this is great tool for
those that want to get started with XDP but do not have hardware, so I'd
say it's ready to have the 'RFC' tag dropped.  Thanks for pushing this
forward, Dave!  :-)

Comments

Alexei Starovoitov April 19, 2017, 5:17 p.m. UTC | #1
On Wed, Apr 19, 2017 at 10:29:03AM -0400, Andy Gospodarek wrote:
> 
> I ran this on top of a card that uses the bnxt_en driver on a desktop
> class system with an i7-6700 CPU @ 3.40GHz, sending a single stream of
> UDP traffic with flow control disabled and saw the following (all stats
> in Million PPS).
> 
>                 xdp1                xdp2            xdp_tx_tunnel
> Generic XDP      7.8    5.5 (1.3 actual)         4.6 (1.1 actual)
> Optimized XDP   11.7		     9.7                      4.6

Nice! Thanks for testing.

> One thing to note is that the Generic XDP case shows some different
> results for reported by the application vs actual (seen on the wire).  I
> did not debug where the drops are happening and what counter needs to be
> incremented to note this -- I'll add that to my TODO list.  The
> Optimized XDP case does not have a difference in reported vs actual
> frames on the wire.

The missed packets are probably due to xmit queue being full.
We need 'xdp_tx_full' counter in:
+       if (free_skb) {
+               trace_xdp_exception(dev, xdp_prog, XDP_TX);
+               kfree_skb(skb);
+       }
like in-driver xdp does.
It's surprising that tx becomes full so often. May be bnxt specific behavior?

> I agree with all those who have asserted that this is great tool for
> those that want to get started with XDP but do not have hardware, so I'd
> say it's ready to have the 'RFC' tag dropped.  Thanks for pushing this
> forward, Dave!  :-)

+1
John Fastabend April 19, 2017, 5:44 p.m. UTC | #2
On 17-04-19 10:17 AM, Alexei Starovoitov wrote:
> On Wed, Apr 19, 2017 at 10:29:03AM -0400, Andy Gospodarek wrote:
>>
>> I ran this on top of a card that uses the bnxt_en driver on a desktop
>> class system with an i7-6700 CPU @ 3.40GHz, sending a single stream of
>> UDP traffic with flow control disabled and saw the following (all stats
>> in Million PPS).
>>
>>                 xdp1                xdp2            xdp_tx_tunnel
>> Generic XDP      7.8    5.5 (1.3 actual)         4.6 (1.1 actual)
>> Optimized XDP   11.7		     9.7                      4.6
> 
> Nice! Thanks for testing.
> 
>> One thing to note is that the Generic XDP case shows some different
>> results for reported by the application vs actual (seen on the wire).  I
>> did not debug where the drops are happening and what counter needs to be
>> incremented to note this -- I'll add that to my TODO list.  The
>> Optimized XDP case does not have a difference in reported vs actual
>> frames on the wire.
> 
> The missed packets are probably due to xmit queue being full.
> We need 'xdp_tx_full' counter in:
> +       if (free_skb) {
> +               trace_xdp_exception(dev, xdp_prog, XDP_TX);
> +               kfree_skb(skb);
> +       }
> like in-driver xdp does.
> It's surprising that tx becomes full so often. May be bnxt specific behavior?

hmm as a data point I get better numbers than 1.3Mpps running through the qdisc
layer with pktgen so seems like something is wrong with the driver perhaps? If
I get a chance I'll take a look with my setup here, although it likely wont be
until the weekend. I don't think it needs to slow down dropping the RFC tag
and getting the patch applied though.

> 
>> I agree with all those who have asserted that this is great tool for
>> those that want to get started with XDP but do not have hardware, so I'd
>> say it's ready to have the 'RFC' tag dropped.  Thanks for pushing this
>> forward, Dave!  :-)
> 
> +1
>  
>
Andy Gospodarek April 19, 2017, 8:25 p.m. UTC | #3
On Wed, Apr 19, 2017 at 10:44:59AM -0700, John Fastabend wrote:
> On 17-04-19 10:17 AM, Alexei Starovoitov wrote:
> > On Wed, Apr 19, 2017 at 10:29:03AM -0400, Andy Gospodarek wrote:
> >>
> >> I ran this on top of a card that uses the bnxt_en driver on a desktop
> >> class system with an i7-6700 CPU @ 3.40GHz, sending a single stream of
> >> UDP traffic with flow control disabled and saw the following (all stats
> >> in Million PPS).
> >>
> >>                 xdp1                xdp2            xdp_tx_tunnel
> >> Generic XDP      7.8    5.5 (1.3 actual)         4.6 (1.1 actual)
> >> Optimized XDP   11.7		     9.7                      4.6
> > 
> > Nice! Thanks for testing.
> > 
> >> One thing to note is that the Generic XDP case shows some different
> >> results for reported by the application vs actual (seen on the wire).  I
> >> did not debug where the drops are happening and what counter needs to be
> >> incremented to note this -- I'll add that to my TODO list.  The
> >> Optimized XDP case does not have a difference in reported vs actual
> >> frames on the wire.
> > 
> > The missed packets are probably due to xmit queue being full.
> > We need 'xdp_tx_full' counter in:
> > +       if (free_skb) {
> > +               trace_xdp_exception(dev, xdp_prog, XDP_TX);
> > +               kfree_skb(skb);
> > +       }
> > like in-driver xdp does.
> > It's surprising that tx becomes full so often. May be bnxt specific behavior?
> 
> hmm as a data point I get better numbers than 1.3Mpps running through the qdisc
> layer with pktgen so seems like something is wrong with the driver perhaps? If

I get ~6.5Mpps on a single core with pktgen, so inconclusive for now....

> I get a chance I'll take a look with my setup here, although it likely wont be
> until the weekend. I don't think it needs to slow down dropping the RFC tag
> and getting the patch applied though.
> 
> > 
> >> I agree with all those who have asserted that this is great tool for
> >> those that want to get started with XDP but do not have hardware, so I'd
> >> say it's ready to have the 'RFC' tag dropped.  Thanks for pushing this
> >> forward, Dave!  :-)
> > 
> > +1
> >  
> > 
>
Alexei Starovoitov April 20, 2017, 12:13 a.m. UTC | #4
On Wed, Apr 19, 2017 at 04:25:43PM -0400, Andy Gospodarek wrote:
> On Wed, Apr 19, 2017 at 10:44:59AM -0700, John Fastabend wrote:
> > On 17-04-19 10:17 AM, Alexei Starovoitov wrote:
> > > On Wed, Apr 19, 2017 at 10:29:03AM -0400, Andy Gospodarek wrote:
> > >>
> > >> I ran this on top of a card that uses the bnxt_en driver on a desktop
> > >> class system with an i7-6700 CPU @ 3.40GHz, sending a single stream of
> > >> UDP traffic with flow control disabled and saw the following (all stats
> > >> in Million PPS).
> > >>
> > >>                 xdp1                xdp2            xdp_tx_tunnel
> > >> Generic XDP      7.8    5.5 (1.3 actual)         4.6 (1.1 actual)
> > >> Optimized XDP   11.7		     9.7                      4.6
> > > 
> > > Nice! Thanks for testing.
> > > 
> > >> One thing to note is that the Generic XDP case shows some different
> > >> results for reported by the application vs actual (seen on the wire).  I
> > >> did not debug where the drops are happening and what counter needs to be
> > >> incremented to note this -- I'll add that to my TODO list.  The
> > >> Optimized XDP case does not have a difference in reported vs actual
> > >> frames on the wire.
> > > 
> > > The missed packets are probably due to xmit queue being full.
> > > We need 'xdp_tx_full' counter in:
> > > +       if (free_skb) {
> > > +               trace_xdp_exception(dev, xdp_prog, XDP_TX);
> > > +               kfree_skb(skb);
> > > +       }
> > > like in-driver xdp does.
> > > It's surprising that tx becomes full so often. May be bnxt specific behavior?
> > 
> > hmm as a data point I get better numbers than 1.3Mpps running through the qdisc
> > layer with pktgen so seems like something is wrong with the driver perhaps? If
> 
> I get ~6.5Mpps on a single core with pktgen, so inconclusive for now....

may be your tx queue is simply smaller than rx queue?
David Miller April 20, 2017, 1:40 a.m. UTC | #5
From: Andy Gospodarek <andy@greyhouse.net>
Date: Wed, 19 Apr 2017 10:29:03 -0400

> So I tried a variety of things and the simplest change on top of yours that
> works well for xdp1, xdp2, and xdp_tx_iptunnel. 
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b3d3a6e..1bab3dc 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4316,11 +4316,11 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
>  
>  	off = xdp.data - orig_data;
>  	if (off)
> -		__skb_push(skb, off);
> +		__skb_push(skb, -off);

We have to handle both pushing and popping headers, so could you
please test the snippet I asked you to try?

> 	if (off > 0)
> 		__skb_pull(skb, off);
> 	else if (off < 0)
> 		__skb_push(skb, -off);

Thanks.
Jesper Dangaard Brouer April 20, 2017, 2:30 p.m. UTC | #6
On Wed, 19 Apr 2017 10:29:03 -0400
Andy Gospodarek <andy@greyhouse.net> wrote:

> I ran this on top of a card that uses the bnxt_en driver on a desktop
> class system with an i7-6700 CPU @ 3.40GHz, sending a single stream of
> UDP traffic with flow control disabled and saw the following (all stats
> in Million PPS).
> 
>                 xdp1                xdp2            xdp_tx_tunnel
> Generic XDP      7.8    5.5 (1.3 actual)         4.6 (1.1 actual)
> Optimized XDP   11.7		     9.7                      4.6
> 
> One thing to note is that the Generic XDP case shows some different
> results for reported by the application vs actual (seen on the wire).  I
> did not debug where the drops are happening and what counter needs to be
> incremented to note this -- I'll add that to my TODO list.  The
> Optimized XDP case does not have a difference in reported vs actual
> frames on the wire.

The reported application vs actual (seen on the wire) number sound scary.
How do you evaluate/measure "seen on the wire"?

Perhaps you could use ethtool -S stats to see if anything is fishy?
I recommend using my tool[1] like:

 ~/git/network-testing/bin/ethtool_stats.pl --dev mlx5p2 --sec 2

[1] https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

I'm evaluating this patch on a mlx5 NIC, and something is not right...
I'm seeing:

 Ethtool(mlx5p2) stat:     349599 (        349,599) <= tx_multicast_phy /sec
 Ethtool(mlx5p2) stat:    4940185 (      4,940,185) <= tx_packets /sec
 Ethtool(mlx5p2) stat:     349596 (        349,596) <= tx_packets_phy /sec
 [...]
 Ethtool(mlx5p2) stat:      36898 (         36,898) <= rx_cache_busy /sec
 Ethtool(mlx5p2) stat:      36898 (         36,898) <= rx_cache_full /sec
 Ethtool(mlx5p2) stat:    4903287 (      4,903,287) <= rx_cache_reuse /sec
 Ethtool(mlx5p2) stat:    4940185 (      4,940,185) <= rx_csum_complete /sec
 Ethtool(mlx5p2) stat:    4940185 (      4,940,185) <= rx_packets /sec

Something is wrong... when I tcpdump on the generator machine, I see
garbled packets with IPv6 multicast addresses.

And it looks like I'm only sending 349,596 tx_packets_phy/sec on the "wire".
Andy Gospodarek April 20, 2017, 10:09 p.m. UTC | #7
On Wed, Apr 19, 2017 at 09:40:49PM -0400, David Miller wrote:
> From: Andy Gospodarek <andy@greyhouse.net>
> Date: Wed, 19 Apr 2017 10:29:03 -0400
> 
> > So I tried a variety of things and the simplest change on top of yours that
> > works well for xdp1, xdp2, and xdp_tx_iptunnel. 
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index b3d3a6e..1bab3dc 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4316,11 +4316,11 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
> >  
> >  	off = xdp.data - orig_data;
> >  	if (off)
> > -		__skb_push(skb, off);
> > +		__skb_push(skb, -off);
> 
> We have to handle both pushing and popping headers, so could you
> please test the snippet I asked you to try?
> 

I will tomorrow or by Monday of next week.

I'm also going to hack^W write a quick test app to exercise it as well.


> > 	if (off > 0)
> > 		__skb_pull(skb, off);
> > 	else if (off < 0)
> > 		__skb_push(skb, -off);
> 
> Thanks.
Jesper Dangaard Brouer April 24, 2017, 1:18 p.m. UTC | #8
On Thu, 20 Apr 2017 16:30:34 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Wed, 19 Apr 2017 10:29:03 -0400
> Andy Gospodarek <andy@greyhouse.net> wrote:
> 
> > I ran this on top of a card that uses the bnxt_en driver on a desktop
> > class system with an i7-6700 CPU @ 3.40GHz, sending a single stream of
> > UDP traffic with flow control disabled and saw the following (all stats
> > in Million PPS).
> > 
> >                 xdp1                xdp2            xdp_tx_tunnel
> > Generic XDP      7.8    5.5 (1.3 actual)         4.6 (1.1 actual)
> > Optimized XDP   11.7		     9.7                      4.6
> > 
> > One thing to note is that the Generic XDP case shows some different
> > results for reported by the application vs actual (seen on the wire).  I
> > did not debug where the drops are happening and what counter needs to be
> > incremented to note this -- I'll add that to my TODO list.  The
> > Optimized XDP case does not have a difference in reported vs actual
> > frames on the wire.  
> 
> The reported application vs actual (seen on the wire) number sound scary.
> How do you evaluate/measure "seen on the wire"?
> 
> Perhaps you could use ethtool -S stats to see if anything is fishy?
> I recommend using my tool[1] like:
> 
>  ~/git/network-testing/bin/ethtool_stats.pl --dev mlx5p2 --sec 2
> 
> [1] https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> 
> I'm evaluating this patch on a mlx5 NIC, and something is not right...
> I'm seeing:
> 
>  Ethtool(mlx5p2) stat:     349599 (        349,599) <= tx_multicast_phy /sec
>  Ethtool(mlx5p2) stat:    4940185 (      4,940,185) <= tx_packets /sec
>  Ethtool(mlx5p2) stat:     349596 (        349,596) <= tx_packets_phy /sec
>  [...]
>  Ethtool(mlx5p2) stat:      36898 (         36,898) <= rx_cache_busy /sec
>  Ethtool(mlx5p2) stat:      36898 (         36,898) <= rx_cache_full /sec
>  Ethtool(mlx5p2) stat:    4903287 (      4,903,287) <= rx_cache_reuse /sec
>  Ethtool(mlx5p2) stat:    4940185 (      4,940,185) <= rx_csum_complete /sec
>  Ethtool(mlx5p2) stat:    4940185 (      4,940,185) <= rx_packets /sec
> 
> Something is wrong... when I tcpdump on the generator machine, I see
> garbled packets with IPv6 multicast addresses.
> 
> And it looks like I'm only sending 349,596 tx_packets_phy/sec on the "wire".
> 

Not seeing packets on the TX wire was caused by the NIC HW dropping the
packets, because the ethernet MAC-addr were not changed/swapped.

Fixed this XDP_TX bug in my test program xdp_bench01_mem_access_cost.
https://github.com/netoptimizer/prototype-kernel/commit/85f7ba2f0ea2

Even added a new option --swapmac for creating another test option for
modifying the packet.
https://github.com/netoptimizer/prototype-kernel/commit/fe080e6f3ccf

I will shortly publish a full report of testing this patch.
diff mbox

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index b3d3a6e..1bab3dc 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4316,11 +4316,11 @@  static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 
 	off = xdp.data - orig_data;
 	if (off)
-		__skb_push(skb, off);
+		__skb_push(skb, -off);
 
 	switch (act) {
 	case XDP_TX:
-		__skb_push(skb, skb->mac_len);
+		__skb_push(skb, mac_len);
 		/* fall through */
 	case XDP_PASS:
 		break;