diff mbox

[RFC,00/12] Implement XDP bpf_redirect vairants

Message ID 59650F6D.4070202@gmail.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

John Fastabend July 11, 2017, 5:48 p.m. UTC
On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote:
> On Sat, 8 Jul 2017 21:06:17 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
>> My plan is to test this latest patchset again, Monday and Tuesday.
>> I'll try to assess stability and provide some performance numbers.
> 
> Performance numbers:
> 
>  14378479 pkt/s = XDP_DROP without touching memory
>   9222401 pkt/s = xdp1: XDP_DROP with reading packet data
>   6344472 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
>   4595574 pkt/s = xdp_redirect:     XDP_REDIRECT with swap mac (simulate XDP_TX)
>   5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap
> 
> The performance drop between xdp2 and xdp_redirect, was expected due
> to the HW-tailptr flush per packet, which is costly.
> 
>  (1/6344472-1/4595574)*10^9 = -59.98 ns
> 
> The performance drop between xdp2 and xdp_redirect_map, is higher than
> I expected, which is not good!  The avoidance of the tailptr flush per
> packet was expected to give a higher boost.  The cost increased with
> 40 ns, which is too high compared to the code added (on a 4GHz machine
> approx 160 cycles).
> 
>  (1/6344472-1/5066243)*10^9 = -39.77 ns
> 
> This system doesn't have DDIO, thus we are stalling on cache-misses,
> but I was actually expecting that the added code could "hide" behind
> these cache-misses.
> 
> I'm somewhat surprised to see this large a performance drop.
> 

Yep, although there is room for optimizations in the code path for sure. And
5mpps is not horrible my preference is to get this series in plus any
small optimization we come up with while the merge window is closed. Then
follow up patches can do optimizations.

One easy optimization is to get rid of the atomic bitops. They are not needed
here we have a per cpu unsigned long. Another easy one would be to move
some of the checks out of the hotpath. For example checking for ndo_xdp_xmit
and flush ops on the net device in the hotpath really should be done in the
slow path.

Comments

Jesper Dangaard Brouer July 11, 2017, 6:01 p.m. UTC | #1
On Tue, 11 Jul 2017 10:48:29 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote:
> > On Sat, 8 Jul 2017 21:06:17 +0200
> > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >   
> >> My plan is to test this latest patchset again, Monday and Tuesday.
> >> I'll try to assess stability and provide some performance numbers.  
> > 
> > Performance numbers:
> > 
> >  14378479 pkt/s = XDP_DROP without touching memory
> >   9222401 pkt/s = xdp1: XDP_DROP with reading packet data
> >   6344472 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
> >   4595574 pkt/s = xdp_redirect:     XDP_REDIRECT with swap mac (simulate XDP_TX)
> >   5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap
> > 
> > The performance drop between xdp2 and xdp_redirect, was expected due
> > to the HW-tailptr flush per packet, which is costly.
> > 
> >  (1/6344472-1/4595574)*10^9 = -59.98 ns
> > 
> > The performance drop between xdp2 and xdp_redirect_map, is higher than
> > I expected, which is not good!  The avoidance of the tailptr flush per
> > packet was expected to give a higher boost.  The cost increased with
> > 40 ns, which is too high compared to the code added (on a 4GHz machine
> > approx 160 cycles).
> > 
> >  (1/6344472-1/5066243)*10^9 = -39.77 ns
> > 
> > This system doesn't have DDIO, thus we are stalling on cache-misses,
> > but I was actually expecting that the added code could "hide" behind
> > these cache-misses.
> > 
> > I'm somewhat surprised to see this large a performance drop.
> >   
> 
> Yep, although there is room for optimizations in the code path for sure. And
> 5mpps is not horrible my preference is to get this series in plus any
> small optimization we come up with while the merge window is closed. Then
> follow up patches can do optimizations.

IMHO 5Mpps is a very bad number for XDP.

> One easy optimization is to get rid of the atomic bitops. They are not needed
> here we have a per cpu unsigned long. Another easy one would be to move
> some of the checks out of the hotpath. For example checking for ndo_xdp_xmit
> and flush ops on the net device in the hotpath really should be done in the
> slow path.

I'm already running with a similar patch as below, but it
(surprisingly) only gave my 3 ns improvement.  I also tried a
prefetchw() on xdp.data that gave me 10 ns (which is quite good).

I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which
have DDIO ... I have high hopes for this, as the major bottleneck on
this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses.

Something is definitely wrong on this CPU, as perf stats shows, a very
bad utilization of the CPU pipeline with 0.89 insn per cycle.


> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 1191060..899364d 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -213,7 +213,7 @@ void __dev_map_insert_ctx(struct bpf_map *map, u32 key)
>         struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
>         unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed);
>  
> -       set_bit(key, bitmap);
> +       __set_bit(key, bitmap);
>  }
>  
>  struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
> @@ -253,7 +253,7 @@ void __dev_map_flush(struct bpf_map *map)
>  
>                 netdev = dev->dev;
>  
> -               clear_bit(bit, bitmap);
> +               __clear_bit(bit, bitmap);
>                 if (unlikely(!netdev || !netdev->netdev_ops->ndo_xdp_flush))
>                         continue;
>  
> @@ -287,7 +287,7 @@ static void dev_map_flush_old(struct bpf_dtab_netdev *old_dev)
>  
>                 for_each_online_cpu(cpu) {
>                         bitmap = per_cpu_ptr(old_dev->dtab->flush_needed, cpu);
> -                       clear_bit(old_dev->key, bitmap);
> +                       __clear_bit(old_dev->key, bitmap);
>  
>                         fl->netdev_ops->ndo_xdp_flush(old_dev->dev);
>                 }
>
John Fastabend July 11, 2017, 6:29 p.m. UTC | #2
On 07/11/2017 11:01 AM, Jesper Dangaard Brouer wrote:
> On Tue, 11 Jul 2017 10:48:29 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
>> On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote:
>>> On Sat, 8 Jul 2017 21:06:17 +0200
>>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>>   
>>>> My plan is to test this latest patchset again, Monday and Tuesday.
>>>> I'll try to assess stability and provide some performance numbers.  
>>>
>>> Performance numbers:
>>>
>>>  14378479 pkt/s = XDP_DROP without touching memory
>>>   9222401 pkt/s = xdp1: XDP_DROP with reading packet data
>>>   6344472 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
>>>   4595574 pkt/s = xdp_redirect:     XDP_REDIRECT with swap mac (simulate XDP_TX)
>>>   5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap
>>>
>>> The performance drop between xdp2 and xdp_redirect, was expected due
>>> to the HW-tailptr flush per packet, which is costly.
>>>
>>>  (1/6344472-1/4595574)*10^9 = -59.98 ns
>>>
>>> The performance drop between xdp2 and xdp_redirect_map, is higher than
>>> I expected, which is not good!  The avoidance of the tailptr flush per
>>> packet was expected to give a higher boost.  The cost increased with
>>> 40 ns, which is too high compared to the code added (on a 4GHz machine
>>> approx 160 cycles).
>>>
>>>  (1/6344472-1/5066243)*10^9 = -39.77 ns
>>>
>>> This system doesn't have DDIO, thus we are stalling on cache-misses,
>>> but I was actually expecting that the added code could "hide" behind
>>> these cache-misses.
>>>
>>> I'm somewhat surprised to see this large a performance drop.
>>>   
>>
>> Yep, although there is room for optimizations in the code path for sure. And
>> 5mpps is not horrible my preference is to get this series in plus any
>> small optimization we come up with while the merge window is closed. Then
>> follow up patches can do optimizations.
> 
> IMHO 5Mpps is a very bad number for XDP.
> 
>> One easy optimization is to get rid of the atomic bitops. They are not needed
>> here we have a per cpu unsigned long. Another easy one would be to move
>> some of the checks out of the hotpath. For example checking for ndo_xdp_xmit
>> and flush ops on the net device in the hotpath really should be done in the
>> slow path.
> 
> I'm already running with a similar patch as below, but it
> (surprisingly) only gave my 3 ns improvement.  I also tried a
> prefetchw() on xdp.data that gave me 10 ns (which is quite good).
> 

Ah OK good, do the above numbers use the both the bitops changes and the
prefechw?

> I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which
> have DDIO ... I have high hopes for this, as the major bottleneck on
> this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses.
> 
> Something is definitely wrong on this CPU, as perf stats shows, a very
> bad utilization of the CPU pipeline with 0.89 insn per cycle.
> 

Interesting, the E5-1650 numbers will be good to know. If you have the
perf trace to posting might help track down some hot spots.

.John
Jesper Dangaard Brouer July 11, 2017, 6:44 p.m. UTC | #3
On Tue, 11 Jul 2017 20:01:36 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Tue, 11 Jul 2017 10:48:29 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
> > On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote:  
> > > On Sat, 8 Jul 2017 21:06:17 +0200
> > > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > >     
> > >> My plan is to test this latest patchset again, Monday and Tuesday.
> > >> I'll try to assess stability and provide some performance numbers.    
> > > 
> > > Performance numbers:
> > > 
> > >  14378479 pkt/s = XDP_DROP without touching memory
> > >   9222401 pkt/s = xdp1: XDP_DROP with reading packet data
> > >   6344472 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
> > >   4595574 pkt/s = xdp_redirect:     XDP_REDIRECT with swap mac (simulate XDP_TX)
> > >   5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap
> > > 
> > > The performance drop between xdp2 and xdp_redirect, was expected due
> > > to the HW-tailptr flush per packet, which is costly.
> > > 
> > >  (1/6344472-1/4595574)*10^9 = -59.98 ns
> > > 
> > > The performance drop between xdp2 and xdp_redirect_map, is higher than
> > > I expected, which is not good!  The avoidance of the tailptr flush per
> > > packet was expected to give a higher boost.  The cost increased with
> > > 40 ns, which is too high compared to the code added (on a 4GHz machine
> > > approx 160 cycles).
> > > 
> > >  (1/6344472-1/5066243)*10^9 = -39.77 ns
> > > 
> > > This system doesn't have DDIO, thus we are stalling on cache-misses,
> > > but I was actually expecting that the added code could "hide" behind
> > > these cache-misses.
> > > 
> > > I'm somewhat surprised to see this large a performance drop.
> > >     
> > 
> > Yep, although there is room for optimizations in the code path for sure. And
> > 5mpps is not horrible my preference is to get this series in plus any
> > small optimization we come up with while the merge window is closed. Then
> > follow up patches can do optimizations.  
> 
> IMHO 5Mpps is a very bad number for XDP.
> 
> > One easy optimization is to get rid of the atomic bitops. They are not needed
> > here we have a per cpu unsigned long. Another easy one would be to move
> > some of the checks out of the hotpath. For example checking for ndo_xdp_xmit
> > and flush ops on the net device in the hotpath really should be done in the
> > slow path.  
> 
> I'm already running with a similar patch as below, but it
> (surprisingly) only gave my 3 ns improvement.  I also tried a
> prefetchw() on xdp.data that gave me 10 ns (which is quite good).
> 
> I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which
> have DDIO ... I have high hopes for this, as the major bottleneck on
> this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses.
> 
> Something is definitely wrong on this CPU, as perf stats shows, a very
> bad utilization of the CPU pipeline with 0.89 insn per cycle.

Wow, getting DDIO working and avoiding the cache-miss, was really
_the_ issue.  On this CPU E5-1650 v4 @ 3.60GHz things look really
really good for XDP_REDIRECT with maps. (p.s. with __set_bit()
optimization)

13,939,674 pkt/s = XDP_DROP without touching memory
14,290,650 pkt/s = xdp1: XDP_DROP with reading packet data
13,221,812 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
 7,596,576 pkt/s = xdp_redirect:    XDP_REDIRECT with swap mac (like XDP_TX)
13,058,435 pkt/s = xdp_redirect_map:XDP_REDIRECT with swap mac + devmap

Surprisingly, on this DDIO capable CPU it is slightly slower NOT to
read packet memory.

The large performance gap to xdp_redirect is due to the tailptr flush,
which really show up on this system.  The CPU efficiency is 1.36 insn
per cycle, which for map variant is 2.15 insn per cycle.

 Gap (1/13221812-1/7596576)*10^9 = -56 ns

The xdp_redirect_map performance is really really good, almost 10G
wirespeed on a single CPU!!!  This is amazing, and we know that this
code is not even optimal yet.  The performance difference to xdp2 is
only around 1 ns.
John Fastabend July 11, 2017, 6:56 p.m. UTC | #4
On 07/11/2017 11:44 AM, Jesper Dangaard Brouer wrote:
> On Tue, 11 Jul 2017 20:01:36 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
>> On Tue, 11 Jul 2017 10:48:29 -0700
>> John Fastabend <john.fastabend@gmail.com> wrote:
>>
>>> On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote:  
>>>> On Sat, 8 Jul 2017 21:06:17 +0200
>>>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>>>     
>>>>> My plan is to test this latest patchset again, Monday and Tuesday.
>>>>> I'll try to assess stability and provide some performance numbers.    
>>>>
>>>> Performance numbers:
>>>>
>>>>  14378479 pkt/s = XDP_DROP without touching memory
>>>>   9222401 pkt/s = xdp1: XDP_DROP with reading packet data
>>>>   6344472 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
>>>>   4595574 pkt/s = xdp_redirect:     XDP_REDIRECT with swap mac (simulate XDP_TX)
>>>>   5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap
>>>>
>>>> The performance drop between xdp2 and xdp_redirect, was expected due
>>>> to the HW-tailptr flush per packet, which is costly.
>>>>
>>>>  (1/6344472-1/4595574)*10^9 = -59.98 ns
>>>>
>>>> The performance drop between xdp2 and xdp_redirect_map, is higher than
>>>> I expected, which is not good!  The avoidance of the tailptr flush per
>>>> packet was expected to give a higher boost.  The cost increased with
>>>> 40 ns, which is too high compared to the code added (on a 4GHz machine
>>>> approx 160 cycles).
>>>>
>>>>  (1/6344472-1/5066243)*10^9 = -39.77 ns
>>>>
>>>> This system doesn't have DDIO, thus we are stalling on cache-misses,
>>>> but I was actually expecting that the added code could "hide" behind
>>>> these cache-misses.
>>>>
>>>> I'm somewhat surprised to see this large a performance drop.
>>>>     
>>>
>>> Yep, although there is room for optimizations in the code path for sure. And
>>> 5mpps is not horrible my preference is to get this series in plus any
>>> small optimization we come up with while the merge window is closed. Then
>>> follow up patches can do optimizations.  
>>
>> IMHO 5Mpps is a very bad number for XDP.
>>
>>> One easy optimization is to get rid of the atomic bitops. They are not needed
>>> here we have a per cpu unsigned long. Another easy one would be to move
>>> some of the checks out of the hotpath. For example checking for ndo_xdp_xmit
>>> and flush ops on the net device in the hotpath really should be done in the
>>> slow path.  
>>
>> I'm already running with a similar patch as below, but it
>> (surprisingly) only gave my 3 ns improvement.  I also tried a
>> prefetchw() on xdp.data that gave me 10 ns (which is quite good).
>>
>> I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which
>> have DDIO ... I have high hopes for this, as the major bottleneck on
>> this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses.
>>
>> Something is definitely wrong on this CPU, as perf stats shows, a very
>> bad utilization of the CPU pipeline with 0.89 insn per cycle.
> 
> Wow, getting DDIO working and avoiding the cache-miss, was really
> _the_ issue.  On this CPU E5-1650 v4 @ 3.60GHz things look really
> really good for XDP_REDIRECT with maps. (p.s. with __set_bit()
> optimization)
> 

Very nice :) this was with the prefecthw() removed right?

> 13,939,674 pkt/s = XDP_DROP without touching memory
> 14,290,650 pkt/s = xdp1: XDP_DROP with reading packet data
> 13,221,812 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
>  7,596,576 pkt/s = xdp_redirect:    XDP_REDIRECT with swap mac (like XDP_TX)
> 13,058,435 pkt/s = xdp_redirect_map:XDP_REDIRECT with swap mac + devmap
> 
> Surprisingly, on this DDIO capable CPU it is slightly slower NOT to
> read packet memory.
> 
> The large performance gap to xdp_redirect is due to the tailptr flush,
> which really show up on this system.  The CPU efficiency is 1.36 insn
> per cycle, which for map variant is 2.15 insn per cycle.
> 
>  Gap (1/13221812-1/7596576)*10^9 = -56 ns
> 
> The xdp_redirect_map performance is really really good, almost 10G
> wirespeed on a single CPU!!!  This is amazing, and we know that this
> code is not even optimal yet.  The performance difference to xdp2 is
> only around 1 ns.
> 

Great, yeah there are some more likely()/unlikely() hints we could add and
also remove some of the checks in the hotpath, etc.

Thanks for doing this!
Jesper Dangaard Brouer July 11, 2017, 7:19 p.m. UTC | #5
On Tue, 11 Jul 2017 11:56:21 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> On 07/11/2017 11:44 AM, Jesper Dangaard Brouer wrote:
> > On Tue, 11 Jul 2017 20:01:36 +0200
> > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >   
> >> On Tue, 11 Jul 2017 10:48:29 -0700
> >> John Fastabend <john.fastabend@gmail.com> wrote:
> >>  
> >>> On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote:    
> >>>> On Sat, 8 Jul 2017 21:06:17 +0200
> >>>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >>>>       
> >>>>> My plan is to test this latest patchset again, Monday and Tuesday.
> >>>>> I'll try to assess stability and provide some performance numbers.      
> >>>>
> >>>> Performance numbers:
> >>>>
> >>>>  14378479 pkt/s = XDP_DROP without touching memory
> >>>>   9222401 pkt/s = xdp1: XDP_DROP with reading packet data
> >>>>   6344472 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
> >>>>   4595574 pkt/s = xdp_redirect:     XDP_REDIRECT with swap mac (simulate XDP_TX)
> >>>>   5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap
> >>>>
> >>>> The performance drop between xdp2 and xdp_redirect, was expected due
> >>>> to the HW-tailptr flush per packet, which is costly.
> >>>>
> >>>>  (1/6344472-1/4595574)*10^9 = -59.98 ns
> >>>>
> >>>> The performance drop between xdp2 and xdp_redirect_map, is higher than
> >>>> I expected, which is not good!  The avoidance of the tailptr flush per
> >>>> packet was expected to give a higher boost.  The cost increased with
> >>>> 40 ns, which is too high compared to the code added (on a 4GHz machine
> >>>> approx 160 cycles).
> >>>>
> >>>>  (1/6344472-1/5066243)*10^9 = -39.77 ns
> >>>>
> >>>> This system doesn't have DDIO, thus we are stalling on cache-misses,
> >>>> but I was actually expecting that the added code could "hide" behind
> >>>> these cache-misses.
> >>>>
> >>>> I'm somewhat surprised to see this large a performance drop.
> >>>>       
> >>>
> >>> Yep, although there is room for optimizations in the code path for sure. And
> >>> 5mpps is not horrible my preference is to get this series in plus any
> >>> small optimization we come up with while the merge window is closed. Then
> >>> follow up patches can do optimizations.    
> >>
> >> IMHO 5Mpps is a very bad number for XDP.
> >>  
> >>> One easy optimization is to get rid of the atomic bitops. They are not needed
> >>> here we have a per cpu unsigned long. Another easy one would be to move
> >>> some of the checks out of the hotpath. For example checking for ndo_xdp_xmit
> >>> and flush ops on the net device in the hotpath really should be done in the
> >>> slow path.    
> >>
> >> I'm already running with a similar patch as below, but it
> >> (surprisingly) only gave my 3 ns improvement.  I also tried a
> >> prefetchw() on xdp.data that gave me 10 ns (which is quite good).
> >>
> >> I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which
> >> have DDIO ... I have high hopes for this, as the major bottleneck on
> >> this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses.
> >>
> >> Something is definitely wrong on this CPU, as perf stats shows, a very
> >> bad utilization of the CPU pipeline with 0.89 insn per cycle.  
> > 
> > Wow, getting DDIO working and avoiding the cache-miss, was really
> > _the_ issue.  On this CPU E5-1650 v4 @ 3.60GHz things look really
> > really good for XDP_REDIRECT with maps. (p.s. with __set_bit()
> > optimization)
> >   
> 
> Very nice :) this was with the prefecthw() removed right?

Yes, prefetchw removed.

> > 13,939,674 pkt/s = XDP_DROP without touching memory
> > 14,290,650 pkt/s = xdp1: XDP_DROP with reading packet data
> > 13,221,812 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
> >  7,596,576 pkt/s = xdp_redirect:    XDP_REDIRECT with swap mac (like XDP_TX)
> > 13,058,435 pkt/s = xdp_redirect_map:XDP_REDIRECT with swap mac + devmap
> > 
> > Surprisingly, on this DDIO capable CPU it is slightly slower NOT to
> > read packet memory.
> > 
> > The large performance gap to xdp_redirect is due to the tailptr flush,
> > which really show up on this system.  The CPU efficiency is 1.36 insn
> > per cycle, which for map variant is 2.15 insn per cycle.
> > 
> >  Gap (1/13221812-1/7596576)*10^9 = -56 ns
> > 
> > The xdp_redirect_map performance is really really good, almost 10G
> > wirespeed on a single CPU!!!  This is amazing, and we know that this
> > code is not even optimal yet.  The performance difference to xdp2 is
> > only around 1 ns.
> >   
> 
> Great, yeah there are some more likely()/unlikely() hints we could add and
> also remove some of the checks in the hotpath, etc.

Yes, plus inlining some function call.
 
> Thanks for doing this!

I have a really strange observation... if I change the CPU powersave
settings, then the xdp_redirect_map performance drops in half!  Above
was with "tuned-adm profile powersave" (because, this is a really noisy
server, and I'm sitting next to it).  I can see that the CPU under-load
goes into "turbomode", rest going into low-power, including the
Hyper-thread siblings.

If I change the profile to: # tuned-adm profile network-latency

ifindex 6:   12964879 pkt/s
ifindex 6:   12964683 pkt/s
ifindex 6:   12961497 pkt/s
ifindex 6:   11779966 pkt/s <-- change to tuned-adm profile network-latency
ifindex 6:    6853959 pkt/s
ifindex 6:    6851120 pkt/s
ifindex 6:    6856934 pkt/s
ifindex 6:    6857344 pkt/s
ifindex 6:    6857161 pkt/s

The CPU efficiency goes from 2.35 to 1.24 insn per cycle.

John do you know some Intel people that could help me understand what
is going on?!? This is very strange...

I tried Andi's toplev tool, which AFAIK indicate that this is a
Frontend problem, e.g. in decoding the instructions?!?
John Fastabend July 11, 2017, 7:37 p.m. UTC | #6
On 07/11/2017 12:19 PM, Jesper Dangaard Brouer wrote:
> On Tue, 11 Jul 2017 11:56:21 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
>> On 07/11/2017 11:44 AM, Jesper Dangaard Brouer wrote:
>>> On Tue, 11 Jul 2017 20:01:36 +0200
>>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>>   
>>>> On Tue, 11 Jul 2017 10:48:29 -0700
>>>> John Fastabend <john.fastabend@gmail.com> wrote:
>>>>  
>>>>> On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote:    
>>>>>> On Sat, 8 Jul 2017 21:06:17 +0200
>>>>>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>>>>>       
>>>>>>> My plan is to test this latest patchset again, Monday and Tuesday.
>>>>>>> I'll try to assess stability and provide some performance numbers.      
>>>>>>
>>>>>> Performance numbers:
>>>>>>
>>>>>>  14378479 pkt/s = XDP_DROP without touching memory
>>>>>>   9222401 pkt/s = xdp1: XDP_DROP with reading packet data
>>>>>>   6344472 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
>>>>>>   4595574 pkt/s = xdp_redirect:     XDP_REDIRECT with swap mac (simulate XDP_TX)
>>>>>>   5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap
>>>>>>
>>>>>> The performance drop between xdp2 and xdp_redirect, was expected due
>>>>>> to the HW-tailptr flush per packet, which is costly.
>>>>>>
>>>>>>  (1/6344472-1/4595574)*10^9 = -59.98 ns
>>>>>>
>>>>>> The performance drop between xdp2 and xdp_redirect_map, is higher than
>>>>>> I expected, which is not good!  The avoidance of the tailptr flush per
>>>>>> packet was expected to give a higher boost.  The cost increased with
>>>>>> 40 ns, which is too high compared to the code added (on a 4GHz machine
>>>>>> approx 160 cycles).
>>>>>>
>>>>>>  (1/6344472-1/5066243)*10^9 = -39.77 ns
>>>>>>
>>>>>> This system doesn't have DDIO, thus we are stalling on cache-misses,
>>>>>> but I was actually expecting that the added code could "hide" behind
>>>>>> these cache-misses.
>>>>>>
>>>>>> I'm somewhat surprised to see this large a performance drop.
>>>>>>       
>>>>>
>>>>> Yep, although there is room for optimizations in the code path for sure. And
>>>>> 5mpps is not horrible my preference is to get this series in plus any
>>>>> small optimization we come up with while the merge window is closed. Then
>>>>> follow up patches can do optimizations.    
>>>>
>>>> IMHO 5Mpps is a very bad number for XDP.
>>>>  
>>>>> One easy optimization is to get rid of the atomic bitops. They are not needed
>>>>> here we have a per cpu unsigned long. Another easy one would be to move
>>>>> some of the checks out of the hotpath. For example checking for ndo_xdp_xmit
>>>>> and flush ops on the net device in the hotpath really should be done in the
>>>>> slow path.    
>>>>
>>>> I'm already running with a similar patch as below, but it
>>>> (surprisingly) only gave my 3 ns improvement.  I also tried a
>>>> prefetchw() on xdp.data that gave me 10 ns (which is quite good).
>>>>
>>>> I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which
>>>> have DDIO ... I have high hopes for this, as the major bottleneck on
>>>> this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses.
>>>>
>>>> Something is definitely wrong on this CPU, as perf stats shows, a very
>>>> bad utilization of the CPU pipeline with 0.89 insn per cycle.  
>>>
>>> Wow, getting DDIO working and avoiding the cache-miss, was really
>>> _the_ issue.  On this CPU E5-1650 v4 @ 3.60GHz things look really
>>> really good for XDP_REDIRECT with maps. (p.s. with __set_bit()
>>> optimization)
>>>   
>>
>> Very nice :) this was with the prefecthw() removed right?
> 
> Yes, prefetchw removed.

Great, I wanted to avoid playing prefetch games for as long as possible.
At least in this initial submission.

> 
>>> 13,939,674 pkt/s = XDP_DROP without touching memory
>>> 14,290,650 pkt/s = xdp1: XDP_DROP with reading packet data
>>> 13,221,812 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
>>>  7,596,576 pkt/s = xdp_redirect:    XDP_REDIRECT with swap mac (like XDP_TX)
>>> 13,058,435 pkt/s = xdp_redirect_map:XDP_REDIRECT with swap mac + devmap
>>>
>>> Surprisingly, on this DDIO capable CPU it is slightly slower NOT to
>>> read packet memory.
>>>
>>> The large performance gap to xdp_redirect is due to the tailptr flush,
>>> which really show up on this system.  The CPU efficiency is 1.36 insn
>>> per cycle, which for map variant is 2.15 insn per cycle.
>>>
>>>  Gap (1/13221812-1/7596576)*10^9 = -56 ns
>>>
>>> The xdp_redirect_map performance is really really good, almost 10G
>>> wirespeed on a single CPU!!!  This is amazing, and we know that this
>>> code is not even optimal yet.  The performance difference to xdp2 is
>>> only around 1 ns.
>>>   
>>
>> Great, yeah there are some more likely()/unlikely() hints we could add and
>> also remove some of the checks in the hotpath, etc.
> 
> Yes, plus inlining some function call.
>  

Yep.

>> Thanks for doing this!
> 
> I have a really strange observation... if I change the CPU powersave
> settings, then the xdp_redirect_map performance drops in half!  Above
> was with "tuned-adm profile powersave" (because, this is a really noisy
> server, and I'm sitting next to it).  I can see that the CPU under-load
> goes into "turbomode", rest going into low-power, including the
> Hyper-thread siblings.
> 
> If I change the profile to: # tuned-adm profile network-latency
> 
> ifindex 6:   12964879 pkt/s
> ifindex 6:   12964683 pkt/s
> ifindex 6:   12961497 pkt/s
> ifindex 6:   11779966 pkt/s <-- change to tuned-adm profile network-latency
> ifindex 6:    6853959 pkt/s
> ifindex 6:    6851120 pkt/s
> ifindex 6:    6856934 pkt/s
> ifindex 6:    6857344 pkt/s
> ifindex 6:    6857161 pkt/s
> 
> The CPU efficiency goes from 2.35 to 1.24 insn per cycle.
> 
> John do you know some Intel people that could help me understand what
> is going on?!? This is very strange...
> 
> I tried Andi's toplev tool, which AFAIK indicate that this is a
> Frontend problem, e.g. in decoding the instructions?!?
> 

hmm maybe Jesse or Alex have some clues. Adding them to the CC list.
Jesper Dangaard Brouer July 16, 2017, 8:23 a.m. UTC | #7
On Tue, 11 Jul 2017 12:37:10 -0700 John Fastabend <john.fastabend@gmail.com> wrote:

> > I have a really strange observation... if I change the CPU powersave
> > settings, then the xdp_redirect_map performance drops in half!  Above
> > was with "tuned-adm profile powersave" (because, this is a really noisy
> > server, and I'm sitting next to it).  I can see that the CPU under-load
> > goes into "turbomode", rest going into low-power, including the
> > Hyper-thread siblings.
> > 
> > If I change the profile to: # tuned-adm profile network-latency
> > 
> > ifindex 6:   12964879 pkt/s
> > ifindex 6:   12964683 pkt/s
> > ifindex 6:   12961497 pkt/s
> > ifindex 6:   11779966 pkt/s <-- change to tuned-adm profile network-latency
> > ifindex 6:    6853959 pkt/s
> > ifindex 6:    6851120 pkt/s
> > ifindex 6:    6856934 pkt/s
> > ifindex 6:    6857344 pkt/s
> > ifindex 6:    6857161 pkt/s
> > 
> > The CPU efficiency goes from 2.35 to 1.24 insn per cycle.
> > 
> > John do you know some Intel people that could help me understand what
> > is going on?!? This is very strange...
> > 
> > I tried Andi's toplev tool, which AFAIK indicate that this is a
> > Frontend problem, e.g. in decoding the instructions?!?
> >   
> 
> hmm maybe Jesse or Alex have some clues. Adding them to the CC list.

This seems related to Hyper-Threading.  I tried to disable
hyperthreading in the BIOS, and the problem goes away.  That is, the
benchmarks are no-longer affected by the CPU tuned-adm profile.
Jesse Brandeburg July 17, 2017, 5:04 p.m. UTC | #8
On Sun, 16 Jul 2017 10:23:08 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Tue, 11 Jul 2017 12:37:10 -0700 John Fastabend <john.fastabend@gmail.com> wrote:
> 
>  [...]  
> > 
> > hmm maybe Jesse or Alex have some clues. Adding them to the CC list.  
> 
> This seems related to Hyper-Threading.  I tried to disable
> hyperthreading in the BIOS, and the problem goes away.  That is, the
> benchmarks are no-longer affected by the CPU tuned-adm profile.

Wow, nice job figuring that out.  I went and took a look at tuned-adm
documentation but didn't see anything that stood out that would cause
the behavior you were seeing.

I suspect your toplev results showing this is a frontend problem mesh
nicely with the hyper-threading configuration inducing the bad behavior,
since there is still only one execution unit, and the fetchers would
have to stall.  I think you've rediscovered why the forwarding /
routing crowd generally turns off hyperthreading.
diff mbox

Patch

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 1191060..899364d 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -213,7 +213,7 @@  void __dev_map_insert_ctx(struct bpf_map *map, u32 key)
        struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
        unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed);
 
-       set_bit(key, bitmap);
+       __set_bit(key, bitmap);
 }
 
 struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
@@ -253,7 +253,7 @@  void __dev_map_flush(struct bpf_map *map)
 
                netdev = dev->dev;
 
-               clear_bit(bit, bitmap);
+               __clear_bit(bit, bitmap);
                if (unlikely(!netdev || !netdev->netdev_ops->ndo_xdp_flush))
                        continue;
 
@@ -287,7 +287,7 @@  static void dev_map_flush_old(struct bpf_dtab_netdev *old_dev)
 
                for_each_online_cpu(cpu) {
                        bitmap = per_cpu_ptr(old_dev->dtab->flush_needed, cpu);
-                       clear_bit(old_dev->key, bitmap);
+                       __clear_bit(old_dev->key, bitmap);
 
                        fl->netdev_ops->ndo_xdp_flush(old_dev->dev);
                }