diff mbox

UDP sockets oddities

Message ID 1503535426.2499.66.camel@edumazet-glaptop3.roam.corp.google.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Aug. 24, 2017, 12:43 a.m. UTC
On Wed, 2017-08-23 at 17:03 -0700, Florian Fainelli wrote:

> Attached is the perf report --stdio of:
> 
> # perf record -a -g -e skb:kfree_skb iperf -c 192.168.1.23 -b 800M -t 60
> WARNING: option -b implies udp testing
> ------------------------------------------------------------
> Client connecting to 192.168.1.23, UDP port 5001
> Sending 1470 byte datagrams
> UDP buffer size:  160 KByte (default)
> ------------------------------------------------------------
> [  4] local 192.168.1.66 port 36169 connected with 192.168.1.23 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  4]  0.0-60.0 sec   685 MBytes  95.8 Mbits/sec
> [  4] Sent 488633 datagrams
> [  4] Server Report:
> [  4]  0.0-74.4 sec   685 MBytes  77.2 Mbits/sec   0.187 ms  300/488632
> (0.061%)
> [  4]  0.0-74.4 sec  58 datagrams received out-of-order
> 
> It measured 488644 events which is greater than the number of packets
> transmitted by iperf but I only count 8 non-IP packets being sent
> (protocl=2054 -> ETH_P_ARP) so I am not sure what the other 4 were and
> why there are not accounted for.
> 
> Almost all events came from net_tx_action() except 2 that came from
> net/core/neighbour.c::neigh_probe() and 65 that came from
> arch/arm/include/asm/irqflags.h::arch_local_irq_save() ?!?

Oh you have too much noise and need this fix :

Comments

Florian Fainelli Aug. 24, 2017, 2:23 a.m. UTC | #1
On 08/23/2017 05:43 PM, Eric Dumazet wrote:
> On Wed, 2017-08-23 at 17:03 -0700, Florian Fainelli wrote:
> 
>> Attached is the perf report --stdio of:
>>
>> # perf record -a -g -e skb:kfree_skb iperf -c 192.168.1.23 -b 800M -t 60
>> WARNING: option -b implies udp testing
>> ------------------------------------------------------------
>> Client connecting to 192.168.1.23, UDP port 5001
>> Sending 1470 byte datagrams
>> UDP buffer size:  160 KByte (default)
>> ------------------------------------------------------------
>> [  4] local 192.168.1.66 port 36169 connected with 192.168.1.23 port 5001
>> [ ID] Interval       Transfer     Bandwidth
>> [  4]  0.0-60.0 sec   685 MBytes  95.8 Mbits/sec
>> [  4] Sent 488633 datagrams
>> [  4] Server Report:
>> [  4]  0.0-74.4 sec   685 MBytes  77.2 Mbits/sec   0.187 ms  300/488632
>> (0.061%)
>> [  4]  0.0-74.4 sec  58 datagrams received out-of-order
>>
>> It measured 488644 events which is greater than the number of packets
>> transmitted by iperf but I only count 8 non-IP packets being sent
>> (protocl=2054 -> ETH_P_ARP) so I am not sure what the other 4 were and
>> why there are not accounted for.
>>
>> Almost all events came from net_tx_action() except 2 that came from
>> net/core/neighbour.c::neigh_probe() and 65 that came from
>> arch/arm/include/asm/irqflags.h::arch_local_irq_save() ?!?
> 
> Oh you have too much noise and need this fix :
> 
> diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c
> index dc3052751bc1..fcfa8d991850 100644
> --- a/drivers/net/ethernet/broadcom/bcmsysport.c
> +++ b/drivers/net/ethernet/broadcom/bcmsysport.c
> @@ -597,7 +597,7 @@ static int bcm_sysport_set_coalesce(struct net_device *dev,
>  
>  static void bcm_sysport_free_cb(struct bcm_sysport_cb *cb)
>  {
> -	dev_kfree_skb_any(cb->skb);
> +	dev_consume_skb_any(cb->skb);
>  	cb->skb = NULL;
>  	dma_unmap_addr_set(cb, dma_addr, 0);
>  }
> 

Yay, now I am getting some sensible information, thanks!

iperf reports 143 packets lost, and perf report gets me 146 entries for
kfree_skb() looking like that:
#
     2.74%     2.74%  skbaddr=0xee5f30c0 protocol=2048 location=0xc0855cdc
            |
            ---0x2d
               write
               0xe9fc2368
               kfree_skb


What is annoying is that 0xc0855cdc is resolved to
arch/arm/include/asm/irqflags.h::arch_local_irq_save() which does not
really help telling me where exactly in the stack the drop is happening,
although now I know it is somewhere down the path from write(2)... of
course it is :)

Now what is disturbing too is that iperf does  have its write() system
call get an error returned, write happily returned the number of bytes
written even though the perf trace indicates there was packet drops down
the road..

I will keep investigating, thanks a lot Eric!
Florian Fainelli Aug. 25, 2017, 11:18 p.m. UTC | #2
On 08/23/2017 07:23 PM, Florian Fainelli wrote:
> On 08/23/2017 05:43 PM, Eric Dumazet wrote:
>> On Wed, 2017-08-23 at 17:03 -0700, Florian Fainelli wrote:
>>
>>> Attached is the perf report --stdio of:
>>>
>>> # perf record -a -g -e skb:kfree_skb iperf -c 192.168.1.23 -b 800M -t 60
>>> WARNING: option -b implies udp testing
>>> ------------------------------------------------------------
>>> Client connecting to 192.168.1.23, UDP port 5001
>>> Sending 1470 byte datagrams
>>> UDP buffer size:  160 KByte (default)
>>> ------------------------------------------------------------
>>> [  4] local 192.168.1.66 port 36169 connected with 192.168.1.23 port 5001
>>> [ ID] Interval       Transfer     Bandwidth
>>> [  4]  0.0-60.0 sec   685 MBytes  95.8 Mbits/sec
>>> [  4] Sent 488633 datagrams
>>> [  4] Server Report:
>>> [  4]  0.0-74.4 sec   685 MBytes  77.2 Mbits/sec   0.187 ms  300/488632
>>> (0.061%)
>>> [  4]  0.0-74.4 sec  58 datagrams received out-of-order
>>>
>>> It measured 488644 events which is greater than the number of packets
>>> transmitted by iperf but I only count 8 non-IP packets being sent
>>> (protocl=2054 -> ETH_P_ARP) so I am not sure what the other 4 were and
>>> why there are not accounted for.
>>>
>>> Almost all events came from net_tx_action() except 2 that came from
>>> net/core/neighbour.c::neigh_probe() and 65 that came from
>>> arch/arm/include/asm/irqflags.h::arch_local_irq_save() ?!?
>>
>> Oh you have too much noise and need this fix :
>>
>> diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c
>> index dc3052751bc1..fcfa8d991850 100644
>> --- a/drivers/net/ethernet/broadcom/bcmsysport.c
>> +++ b/drivers/net/ethernet/broadcom/bcmsysport.c
>> @@ -597,7 +597,7 @@ static int bcm_sysport_set_coalesce(struct net_device *dev,
>>  
>>  static void bcm_sysport_free_cb(struct bcm_sysport_cb *cb)
>>  {
>> -	dev_kfree_skb_any(cb->skb);
>> +	dev_consume_skb_any(cb->skb);
>>  	cb->skb = NULL;
>>  	dma_unmap_addr_set(cb, dma_addr, 0);
>>  }
>>
> 
> Yay, now I am getting some sensible information, thanks!
> 
> iperf reports 143 packets lost, and perf report gets me 146 entries for
> kfree_skb() looking like that:
> #
>      2.74%     2.74%  skbaddr=0xee5f30c0 protocol=2048 location=0xc0855cdc
>             |
>             ---0x2d
>                write
>                0xe9fc2368
>                kfree_skb
> 
> 
> What is annoying is that 0xc0855cdc is resolved to
> arch/arm/include/asm/irqflags.h::arch_local_irq_save() which does not
> really help telling me where exactly in the stack the drop is happening,
> although now I know it is somewhere down the path from write(2)... of
> course it is :)
> 
> Now what is disturbing too is that iperf does  have its write() system
> call get an error returned, write happily returned the number of bytes
> written even though the perf trace indicates there was packet drops down
> the road..

Because perf was not able to track down exactly where the location was
with builtin_return_address(0) and always pointed to
arch_local_irq_save(), I added a WARN() in kfree_skb(), and of course
this made it impossible to see packet loss under the same conditions
anymore. After increasing dramatically wmem_{default,max} I could start
seeing significant drop made largely worse by the printks, so not very
conclusive either.

I took another approach and recorded net_dev_queue() events, and I can
see only 976704 (gphy + eth0 devices) / 2 = 488352 dev_queue_events()
were recorded. The total sent by iperf was 488898 and we lost 558
packets according to the server 488352 + 558 > 488898 which makes sense
because neigh_probe() and others still run. This still confirms there is
a drop occurring between UDP socket sendmsg and the SKB enqueuing...

So my take away is that the CPU has a bursty write(2) behavior, and the
larger the socket write buffer size, the more datagrams you can burst
into and the more of these SKBs can be flat out being dropped when the
transmit queue is congested. Adding a WARN() is enough of a slow down
for the CPU that we are not able to reliably reproduce this burst.

Eric, are there areas of the stack where we are allowed to drop packets,
not propagate that back to write(2) and also not increment any counter
either, or maybe I am not looking where I should...

Thanks!
Eric Dumazet Aug. 25, 2017, 11:57 p.m. UTC | #3
On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote:

> Eric, are there areas of the stack where we are allowed to drop packets,
> not propagate that back to write(2) and also not increment any counter
> either, or maybe I am not looking where I should...

What happens if you increase these sysctls ?

grep .  `find /proc/sys|grep unres_qlen`


unres_qlen_bytes -> 2000000
unres_qlen -> 10000
Florian Fainelli Aug. 26, 2017, 1:17 a.m. UTC | #4
On 08/25/2017 04:57 PM, Eric Dumazet wrote:
> On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote:
> 
>> Eric, are there areas of the stack where we are allowed to drop packets,
>> not propagate that back to write(2) and also not increment any counter
>> either, or maybe I am not looking where I should...
> 
> What happens if you increase these sysctls ?

I don't see packet loss after I tweak these two sysctls according to
your suggestions.

Tweaking eth0's sysctls did not change anything, but tweaking gphy's
sysctl resolved the loss. This was a little surprising considering that
gphy is an IFF_NO_QUEUE interface and eth0 is the conduit interface that
does the real transmission.

Does that make sense with respect to what I reported earlier? Should I
try to dump the neigh stats?

Thanks!

> 
> grep .  `find /proc/sys|grep unres_qlen`
> 
> 
> unres_qlen_bytes -> 2000000
> unres_qlen -> 10000
> 
>
Eric Dumazet Aug. 26, 2017, 1:52 a.m. UTC | #5
On Fri, 2017-08-25 at 18:17 -0700, Florian Fainelli wrote:
> On 08/25/2017 04:57 PM, Eric Dumazet wrote:
> > On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote:
> > 
> >> Eric, are there areas of the stack where we are allowed to drop packets,
> >> not propagate that back to write(2) and also not increment any counter
> >> either, or maybe I am not looking where I should...
> > 
> > What happens if you increase these sysctls ?
> 
> I don't see packet loss after I tweak these two sysctls according to
> your suggestions.
> 
> Tweaking eth0's sysctls did not change anything, but tweaking gphy's
> sysctl resolved the loss. This was a little surprising considering that
> gphy is an IFF_NO_QUEUE interface and eth0 is the conduit interface that
> does the real transmission.
> 
> Does that make sense with respect to what I reported earlier? Should I
> try to dump the neigh stats?

Note that if you had TCP traffic, the neighbour would be constantly
confirmed and no losses would happen.

I guess we should an SNMP counter for packets dropped in neigh queues.
Florian Fainelli Aug. 26, 2017, 3:25 a.m. UTC | #6
On 08/25/2017 06:52 PM, Eric Dumazet wrote:
> On Fri, 2017-08-25 at 18:17 -0700, Florian Fainelli wrote:
>> On 08/25/2017 04:57 PM, Eric Dumazet wrote:
>>> On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote:
>>>
>>>> Eric, are there areas of the stack where we are allowed to drop packets,
>>>> not propagate that back to write(2) and also not increment any counter
>>>> either, or maybe I am not looking where I should...
>>>
>>> What happens if you increase these sysctls ?
>>
>> I don't see packet loss after I tweak these two sysctls according to
>> your suggestions.
>>
>> Tweaking eth0's sysctls did not change anything, but tweaking gphy's
>> sysctl resolved the loss. This was a little surprising considering that
>> gphy is an IFF_NO_QUEUE interface and eth0 is the conduit interface that
>> does the real transmission.
>>
>> Does that make sense with respect to what I reported earlier? Should I
>> try to dump the neigh stats?
> 
> Note that if you had TCP traffic, the neighbour would be constantly
> confirmed and no losses would happen.

OK, that still sounds like quite a lot for a not so long UDP session (60
seconds).

I was finally able to get a better capture by switching to an ARM64
kernel, and as confirmed this is all coming from the neighbour code:

# Event count (approx.): 1970
#
# Children      Self  Trace output

# ........  ........
....................................................................
#
     3.10%     3.10%  skbaddr=0xffffffc2fa22a800 protocol=2048
location=0xffffff80086e53f4
            |
            ---write
               el0_svc_naked
               sys_write
               vfs_write
               __vfs_write
               sock_write_iter
               sock_sendmsg
               inet_sendmsg
               udp_sendmsg
               udp_send_skb
               ip_send_skb
               ip_local_out
               ip_output
               ip_finish_output
               ip_finish_output2
               neigh_resolve_output
               __neigh_event_send
               kfree_skb
               kfree_skb

     3.10%     3.10%  skbaddr=0xffffffc2fa22a900 protocol=2048
location=0xffffff80086e53f4
            |
            ---write
               el0_svc_naked
               sys_write
               vfs_write
               __vfs_write
               sock_write_iter
               sock_sendmsg
               inet_sendmsg
               udp_sendmsg
               udp_send_skb
               ip_send_skb
               ip_local_out
               ip_output
               ip_finish_output
               ip_finish_output2
               neigh_resolve_output
               __neigh_event_send
               kfree_skb
               kfree_skb

     3.10%     3.10%  skbaddr=0xffffffc2fa22aa00 protocol=2048
location=0xffffff80086e53f4
            |
            ---write
               el0_svc_naked
               sys_write
               vfs_write
               __vfs_write
               sock_write_iter
               sock_sendmsg
               inet_sendmsg
               udp_sendmsg
               udp_send_skb
               ip_send_skb
               ip_local_out
               ip_output
               ip_finish_output
               ip_finish_output2
               neigh_resolve_output
               __neigh_event_send
               kfree_skb
               kfree_skb

> 
> I guess we should an SNMP counter for packets dropped in neigh queues.

It would. Since the call trace involves udp_send_skb() how come we are
not returning an error to write(2)? are there other code paths where the
neighbor code can do drops like these?
Eric Dumazet Aug. 26, 2017, 3:40 a.m. UTC | #7
On Fri, 2017-08-25 at 20:25 -0700, Florian Fainelli wrote:

> It would. Since the call trace involves udp_send_skb() how come we are
> not returning an error to write(2)? are there other code paths where the
> neighbor code can do drops like these?

Are you suggesting write(2) should block until ARP resolution is
done ? :)

What about non blocking writes ?

Honestly UDP is not a protocol for which we must absolutely be sure
packets are sent or not.
David Miller Aug. 26, 2017, 4:17 a.m. UTC | #8
From: Florian Fainelli <f.fainelli@gmail.com>
Date: Fri, 25 Aug 2017 20:25:26 -0700

> It would. Since the call trace involves udp_send_skb() how come we are
> not returning an error to write(2)? are there other code paths where the
> neighbor code can do drops like these?

Keep in mind that the neighbour code isn't dropping the current 'skb'
coming from the IP stack, it's dropping the oldest packet in the
resolution queue.
David Miller Aug. 26, 2017, 4:19 a.m. UTC | #9
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 25 Aug 2017 20:40:44 -0700

> On Fri, 2017-08-25 at 20:25 -0700, Florian Fainelli wrote:
> 
>> It would. Since the call trace involves udp_send_skb() how come we are
>> not returning an error to write(2)? are there other code paths where the
>> neighbor code can do drops like these?
> 
> Are you suggesting write(2) should block until ARP resolution is
> done ? :)
> 
> What about non blocking writes ?
> 
> Honestly UDP is not a protocol for which we must absolutely be sure
> packets are sent or not.

Agreed, but the ARP resolution queue really needs to scale it's backlog
to the physical technology it is attached to.
Eric Dumazet Aug. 26, 2017, 5:20 a.m. UTC | #10
On Fri, 2017-08-25 at 18:52 -0700, Eric Dumazet wrote:

> I guess we should an SNMP counter for packets dropped in neigh queues.

Info is already there :

cat /proc/net/stat/arp_cache
diff mbox

Patch

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c
index dc3052751bc1..fcfa8d991850 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -597,7 +597,7 @@  static int bcm_sysport_set_coalesce(struct net_device *dev,
 
 static void bcm_sysport_free_cb(struct bcm_sysport_cb *cb)
 {
-	dev_kfree_skb_any(cb->skb);
+	dev_consume_skb_any(cb->skb);
 	cb->skb = NULL;
 	dma_unmap_addr_set(cb, dma_addr, 0);
 }