Message ID | 1503535426.2499.66.camel@edumazet-glaptop3.roam.corp.google.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On 08/23/2017 05:43 PM, Eric Dumazet wrote: > On Wed, 2017-08-23 at 17:03 -0700, Florian Fainelli wrote: > >> Attached is the perf report --stdio of: >> >> # perf record -a -g -e skb:kfree_skb iperf -c 192.168.1.23 -b 800M -t 60 >> WARNING: option -b implies udp testing >> ------------------------------------------------------------ >> Client connecting to 192.168.1.23, UDP port 5001 >> Sending 1470 byte datagrams >> UDP buffer size: 160 KByte (default) >> ------------------------------------------------------------ >> [ 4] local 192.168.1.66 port 36169 connected with 192.168.1.23 port 5001 >> [ ID] Interval Transfer Bandwidth >> [ 4] 0.0-60.0 sec 685 MBytes 95.8 Mbits/sec >> [ 4] Sent 488633 datagrams >> [ 4] Server Report: >> [ 4] 0.0-74.4 sec 685 MBytes 77.2 Mbits/sec 0.187 ms 300/488632 >> (0.061%) >> [ 4] 0.0-74.4 sec 58 datagrams received out-of-order >> >> It measured 488644 events which is greater than the number of packets >> transmitted by iperf but I only count 8 non-IP packets being sent >> (protocl=2054 -> ETH_P_ARP) so I am not sure what the other 4 were and >> why there are not accounted for. >> >> Almost all events came from net_tx_action() except 2 that came from >> net/core/neighbour.c::neigh_probe() and 65 that came from >> arch/arm/include/asm/irqflags.h::arch_local_irq_save() ?!? > > Oh you have too much noise and need this fix : > > diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c > index dc3052751bc1..fcfa8d991850 100644 > --- a/drivers/net/ethernet/broadcom/bcmsysport.c > +++ b/drivers/net/ethernet/broadcom/bcmsysport.c > @@ -597,7 +597,7 @@ static int bcm_sysport_set_coalesce(struct net_device *dev, > > static void bcm_sysport_free_cb(struct bcm_sysport_cb *cb) > { > - dev_kfree_skb_any(cb->skb); > + dev_consume_skb_any(cb->skb); > cb->skb = NULL; > dma_unmap_addr_set(cb, dma_addr, 0); > } > Yay, now I am getting some sensible information, thanks! iperf reports 143 packets lost, and perf report gets me 146 entries for kfree_skb() looking like that: # 2.74% 2.74% skbaddr=0xee5f30c0 protocol=2048 location=0xc0855cdc | ---0x2d write 0xe9fc2368 kfree_skb What is annoying is that 0xc0855cdc is resolved to arch/arm/include/asm/irqflags.h::arch_local_irq_save() which does not really help telling me where exactly in the stack the drop is happening, although now I know it is somewhere down the path from write(2)... of course it is :) Now what is disturbing too is that iperf does have its write() system call get an error returned, write happily returned the number of bytes written even though the perf trace indicates there was packet drops down the road.. I will keep investigating, thanks a lot Eric!
On 08/23/2017 07:23 PM, Florian Fainelli wrote: > On 08/23/2017 05:43 PM, Eric Dumazet wrote: >> On Wed, 2017-08-23 at 17:03 -0700, Florian Fainelli wrote: >> >>> Attached is the perf report --stdio of: >>> >>> # perf record -a -g -e skb:kfree_skb iperf -c 192.168.1.23 -b 800M -t 60 >>> WARNING: option -b implies udp testing >>> ------------------------------------------------------------ >>> Client connecting to 192.168.1.23, UDP port 5001 >>> Sending 1470 byte datagrams >>> UDP buffer size: 160 KByte (default) >>> ------------------------------------------------------------ >>> [ 4] local 192.168.1.66 port 36169 connected with 192.168.1.23 port 5001 >>> [ ID] Interval Transfer Bandwidth >>> [ 4] 0.0-60.0 sec 685 MBytes 95.8 Mbits/sec >>> [ 4] Sent 488633 datagrams >>> [ 4] Server Report: >>> [ 4] 0.0-74.4 sec 685 MBytes 77.2 Mbits/sec 0.187 ms 300/488632 >>> (0.061%) >>> [ 4] 0.0-74.4 sec 58 datagrams received out-of-order >>> >>> It measured 488644 events which is greater than the number of packets >>> transmitted by iperf but I only count 8 non-IP packets being sent >>> (protocl=2054 -> ETH_P_ARP) so I am not sure what the other 4 were and >>> why there are not accounted for. >>> >>> Almost all events came from net_tx_action() except 2 that came from >>> net/core/neighbour.c::neigh_probe() and 65 that came from >>> arch/arm/include/asm/irqflags.h::arch_local_irq_save() ?!? >> >> Oh you have too much noise and need this fix : >> >> diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c >> index dc3052751bc1..fcfa8d991850 100644 >> --- a/drivers/net/ethernet/broadcom/bcmsysport.c >> +++ b/drivers/net/ethernet/broadcom/bcmsysport.c >> @@ -597,7 +597,7 @@ static int bcm_sysport_set_coalesce(struct net_device *dev, >> >> static void bcm_sysport_free_cb(struct bcm_sysport_cb *cb) >> { >> - dev_kfree_skb_any(cb->skb); >> + dev_consume_skb_any(cb->skb); >> cb->skb = NULL; >> dma_unmap_addr_set(cb, dma_addr, 0); >> } >> > > Yay, now I am getting some sensible information, thanks! > > iperf reports 143 packets lost, and perf report gets me 146 entries for > kfree_skb() looking like that: > # > 2.74% 2.74% skbaddr=0xee5f30c0 protocol=2048 location=0xc0855cdc > | > ---0x2d > write > 0xe9fc2368 > kfree_skb > > > What is annoying is that 0xc0855cdc is resolved to > arch/arm/include/asm/irqflags.h::arch_local_irq_save() which does not > really help telling me where exactly in the stack the drop is happening, > although now I know it is somewhere down the path from write(2)... of > course it is :) > > Now what is disturbing too is that iperf does have its write() system > call get an error returned, write happily returned the number of bytes > written even though the perf trace indicates there was packet drops down > the road.. Because perf was not able to track down exactly where the location was with builtin_return_address(0) and always pointed to arch_local_irq_save(), I added a WARN() in kfree_skb(), and of course this made it impossible to see packet loss under the same conditions anymore. After increasing dramatically wmem_{default,max} I could start seeing significant drop made largely worse by the printks, so not very conclusive either. I took another approach and recorded net_dev_queue() events, and I can see only 976704 (gphy + eth0 devices) / 2 = 488352 dev_queue_events() were recorded. The total sent by iperf was 488898 and we lost 558 packets according to the server 488352 + 558 > 488898 which makes sense because neigh_probe() and others still run. This still confirms there is a drop occurring between UDP socket sendmsg and the SKB enqueuing... So my take away is that the CPU has a bursty write(2) behavior, and the larger the socket write buffer size, the more datagrams you can burst into and the more of these SKBs can be flat out being dropped when the transmit queue is congested. Adding a WARN() is enough of a slow down for the CPU that we are not able to reliably reproduce this burst. Eric, are there areas of the stack where we are allowed to drop packets, not propagate that back to write(2) and also not increment any counter either, or maybe I am not looking where I should... Thanks!
On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote: > Eric, are there areas of the stack where we are allowed to drop packets, > not propagate that back to write(2) and also not increment any counter > either, or maybe I am not looking where I should... What happens if you increase these sysctls ? grep . `find /proc/sys|grep unres_qlen` unres_qlen_bytes -> 2000000 unres_qlen -> 10000
On 08/25/2017 04:57 PM, Eric Dumazet wrote: > On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote: > >> Eric, are there areas of the stack where we are allowed to drop packets, >> not propagate that back to write(2) and also not increment any counter >> either, or maybe I am not looking where I should... > > What happens if you increase these sysctls ? I don't see packet loss after I tweak these two sysctls according to your suggestions. Tweaking eth0's sysctls did not change anything, but tweaking gphy's sysctl resolved the loss. This was a little surprising considering that gphy is an IFF_NO_QUEUE interface and eth0 is the conduit interface that does the real transmission. Does that make sense with respect to what I reported earlier? Should I try to dump the neigh stats? Thanks! > > grep . `find /proc/sys|grep unres_qlen` > > > unres_qlen_bytes -> 2000000 > unres_qlen -> 10000 > >
On Fri, 2017-08-25 at 18:17 -0700, Florian Fainelli wrote: > On 08/25/2017 04:57 PM, Eric Dumazet wrote: > > On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote: > > > >> Eric, are there areas of the stack where we are allowed to drop packets, > >> not propagate that back to write(2) and also not increment any counter > >> either, or maybe I am not looking where I should... > > > > What happens if you increase these sysctls ? > > I don't see packet loss after I tweak these two sysctls according to > your suggestions. > > Tweaking eth0's sysctls did not change anything, but tweaking gphy's > sysctl resolved the loss. This was a little surprising considering that > gphy is an IFF_NO_QUEUE interface and eth0 is the conduit interface that > does the real transmission. > > Does that make sense with respect to what I reported earlier? Should I > try to dump the neigh stats? Note that if you had TCP traffic, the neighbour would be constantly confirmed and no losses would happen. I guess we should an SNMP counter for packets dropped in neigh queues.
On 08/25/2017 06:52 PM, Eric Dumazet wrote: > On Fri, 2017-08-25 at 18:17 -0700, Florian Fainelli wrote: >> On 08/25/2017 04:57 PM, Eric Dumazet wrote: >>> On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote: >>> >>>> Eric, are there areas of the stack where we are allowed to drop packets, >>>> not propagate that back to write(2) and also not increment any counter >>>> either, or maybe I am not looking where I should... >>> >>> What happens if you increase these sysctls ? >> >> I don't see packet loss after I tweak these two sysctls according to >> your suggestions. >> >> Tweaking eth0's sysctls did not change anything, but tweaking gphy's >> sysctl resolved the loss. This was a little surprising considering that >> gphy is an IFF_NO_QUEUE interface and eth0 is the conduit interface that >> does the real transmission. >> >> Does that make sense with respect to what I reported earlier? Should I >> try to dump the neigh stats? > > Note that if you had TCP traffic, the neighbour would be constantly > confirmed and no losses would happen. OK, that still sounds like quite a lot for a not so long UDP session (60 seconds). I was finally able to get a better capture by switching to an ARM64 kernel, and as confirmed this is all coming from the neighbour code: # Event count (approx.): 1970 # # Children Self Trace output # ........ ........ .................................................................... # 3.10% 3.10% skbaddr=0xffffffc2fa22a800 protocol=2048 location=0xffffff80086e53f4 | ---write el0_svc_naked sys_write vfs_write __vfs_write sock_write_iter sock_sendmsg inet_sendmsg udp_sendmsg udp_send_skb ip_send_skb ip_local_out ip_output ip_finish_output ip_finish_output2 neigh_resolve_output __neigh_event_send kfree_skb kfree_skb 3.10% 3.10% skbaddr=0xffffffc2fa22a900 protocol=2048 location=0xffffff80086e53f4 | ---write el0_svc_naked sys_write vfs_write __vfs_write sock_write_iter sock_sendmsg inet_sendmsg udp_sendmsg udp_send_skb ip_send_skb ip_local_out ip_output ip_finish_output ip_finish_output2 neigh_resolve_output __neigh_event_send kfree_skb kfree_skb 3.10% 3.10% skbaddr=0xffffffc2fa22aa00 protocol=2048 location=0xffffff80086e53f4 | ---write el0_svc_naked sys_write vfs_write __vfs_write sock_write_iter sock_sendmsg inet_sendmsg udp_sendmsg udp_send_skb ip_send_skb ip_local_out ip_output ip_finish_output ip_finish_output2 neigh_resolve_output __neigh_event_send kfree_skb kfree_skb > > I guess we should an SNMP counter for packets dropped in neigh queues. It would. Since the call trace involves udp_send_skb() how come we are not returning an error to write(2)? are there other code paths where the neighbor code can do drops like these?
On Fri, 2017-08-25 at 20:25 -0700, Florian Fainelli wrote: > It would. Since the call trace involves udp_send_skb() how come we are > not returning an error to write(2)? are there other code paths where the > neighbor code can do drops like these? Are you suggesting write(2) should block until ARP resolution is done ? :) What about non blocking writes ? Honestly UDP is not a protocol for which we must absolutely be sure packets are sent or not.
From: Florian Fainelli <f.fainelli@gmail.com> Date: Fri, 25 Aug 2017 20:25:26 -0700 > It would. Since the call trace involves udp_send_skb() how come we are > not returning an error to write(2)? are there other code paths where the > neighbor code can do drops like these? Keep in mind that the neighbour code isn't dropping the current 'skb' coming from the IP stack, it's dropping the oldest packet in the resolution queue.
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Fri, 25 Aug 2017 20:40:44 -0700 > On Fri, 2017-08-25 at 20:25 -0700, Florian Fainelli wrote: > >> It would. Since the call trace involves udp_send_skb() how come we are >> not returning an error to write(2)? are there other code paths where the >> neighbor code can do drops like these? > > Are you suggesting write(2) should block until ARP resolution is > done ? :) > > What about non blocking writes ? > > Honestly UDP is not a protocol for which we must absolutely be sure > packets are sent or not. Agreed, but the ARP resolution queue really needs to scale it's backlog to the physical technology it is attached to.
On Fri, 2017-08-25 at 18:52 -0700, Eric Dumazet wrote:
> I guess we should an SNMP counter for packets dropped in neigh queues.
Info is already there :
cat /proc/net/stat/arp_cache
diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c index dc3052751bc1..fcfa8d991850 100644 --- a/drivers/net/ethernet/broadcom/bcmsysport.c +++ b/drivers/net/ethernet/broadcom/bcmsysport.c @@ -597,7 +597,7 @@ static int bcm_sysport_set_coalesce(struct net_device *dev, static void bcm_sysport_free_cb(struct bcm_sysport_cb *cb) { - dev_kfree_skb_any(cb->skb); + dev_consume_skb_any(cb->skb); cb->skb = NULL; dma_unmap_addr_set(cb, dma_addr, 0); }