Message ID | 543CAD2A.3070701@parallels.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
From: Vasily Averin <vvs@parallels.com> Date: Tue, 14 Oct 2014 08:57:14 +0400 > v2: adjust the indentation of the arguments __ip_append_data() call > > Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()") > > If sk_write_queue is empty ip_append_data() executes ip_setup_cork() > that "steals" dst entry from rt to cork. Later it calls __ip_append_data() > that creates skb and adds it to sk_write_queue. > > If skb was added successfully following ip_push_pending_frames() call > reassign dst entries from cork to skb, and kfree_skb frees dst_entry. > > However nobody frees stolen dst_entry if skb was not added into sk_write_queue. > > Signed-off-by: Vasily Averin <vvs@parallels.com> Why doesn't ip_make_skb() need the same fix? It seems to do the same thing. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2014-10-14 at 08:57 +0400, Vasily Averin wrote: > v2: adjust the indentation of the arguments __ip_append_data() call > > Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()") > > If sk_write_queue is empty ip_append_data() executes ip_setup_cork() > that "steals" dst entry from rt to cork. Later it calls __ip_append_data() > that creates skb and adds it to sk_write_queue. > > If skb was added successfully following ip_push_pending_frames() call > reassign dst entries from cork to skb, and kfree_skb frees dst_entry. > > However nobody frees stolen dst_entry if skb was not added into sk_write_queue. I thought this was done by ip_flush_pending_frames() ? Can you describe the issue more precisely ? Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 15.10.2014 08:46, Eric Dumazet wrote: > On Tue, 2014-10-14 at 08:57 +0400, Vasily Averin wrote: >> v2: adjust the indentation of the arguments __ip_append_data() call >> >> Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()") >> >> If sk_write_queue is empty ip_append_data() executes ip_setup_cork() >> that "steals" dst entry from rt to cork. Later it calls __ip_append_data() >> that creates skb and adds it to sk_write_queue. >> >> If skb was added successfully following ip_push_pending_frames() call >> reassign dst entries from cork to skb, and kfree_skb frees dst_entry. >> >> However nobody frees stolen dst_entry if skb was not added into sk_write_queue. > > I thought this was done by ip_flush_pending_frames() ? Take look at ip_send_unicast_reply(): ip_flush_pending_frames() is not called if skb was not added to sk_write_queue. And ip_rt_put() does not work, because dst entry was stolen in ip_setup_cork(). Probably it can happen in raw_sendmsg() and udp_sendmsg() too. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 15.10.2014 00:12, David Miller wrote: > From: Vasily Averin <vvs@parallels.com> > Date: Tue, 14 Oct 2014 08:57:14 +0400 > >> v2: adjust the indentation of the arguments __ip_append_data() call >> >> Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()") >> >> If sk_write_queue is empty ip_append_data() executes ip_setup_cork() >> that "steals" dst entry from rt to cork. Later it calls __ip_append_data() >> that creates skb and adds it to sk_write_queue. >> >> If skb was added successfully following ip_push_pending_frames() call >> reassign dst entries from cork to skb, and kfree_skb frees dst_entry. >> >> However nobody frees stolen dst_entry if skb was not added into sk_write_queue. >> >> Signed-off-by: Vasily Averin <vvs@parallels.com> > > Why doesn't ip_make_skb() need the same fix? It seems to do the same > thing. It seems for me ip_make_skb() works (almost) correctly, but seems refcounting can be is incorrect if queue can be not empty (Please see details below). If __ip_append_data() returns errors ip_make_skb() calls __ip_flush_pending_frames() that calls ip_cork_release() inside and frees stolen dst_entry. If __ip_append_data() returns success -- dst refcounter changes are not required. In this case skb will be created and added to queue (and it will not be empty) Later in __ip_make_skb() these skb will get dst reference, and refcounter will be decremented during kfree_skb(). I do not like that there is such unclear dependency between functions, but seems currently it works correctly. However I afraid dst refcountng can work incorrectly if sk_write_queue can be not empty at the moment of ip_append_data() call. It was not happen in case ip_send_unicast_reply() but probably can happen in other places. Let's calculate dst refcounters changes in this case. First packet: dst_refcount increment was happen in ip_append_data() caller, taken during rt lookup - ip_append_data(): -- sk_write_queue is empty, ip_setup_cork() steals dst entry -- __ip_append_data() adds skb to queue, queue is not flushed, waiting for next packets. ip_rt_put in ip_append_data() caller does not work, because dst reference was stolen. dst refcount here +1 then we want to sent 2nd packet: dst refcount increment was happen in ip_append_data() caller - ip_append_data(): -- sk_write_queue is NOT empty, dst was not stolen -- __ip_append_data() adds skb to queue ip_rt_put in ip_append_data() caller decrements dst refcount, because it as not stolen dst refcount here +1 Then we handle new packets, all of them are added to queue dst refcount is still +1 Then queue is flushed. Each packet in queue get dst reference from cork, Each kfree_skb decrements dst refcounter, and it may become negative. Am I wrong probably? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2014-10-15 at 10:56 +0400, Vasily Averin wrote: > On 15.10.2014 08:46, Eric Dumazet wrote: > > On Tue, 2014-10-14 at 08:57 +0400, Vasily Averin wrote: > >> v2: adjust the indentation of the arguments __ip_append_data() call > >> > >> Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()") > >> > >> If sk_write_queue is empty ip_append_data() executes ip_setup_cork() > >> that "steals" dst entry from rt to cork. Later it calls __ip_append_data() > >> that creates skb and adds it to sk_write_queue. > >> > >> If skb was added successfully following ip_push_pending_frames() call > >> reassign dst entries from cork to skb, and kfree_skb frees dst_entry. > >> > >> However nobody frees stolen dst_entry if skb was not added into sk_write_queue. > > > > I thought this was done by ip_flush_pending_frames() ? > > Take look at ip_send_unicast_reply(): So maybe the bug is here ? > > ip_flush_pending_frames() is not called if skb was not added to sk_write_queue. > And ip_rt_put() does not work, because dst entry was stolen in ip_setup_cork(). > > Probably it can happen in raw_sendmsg() and udp_sendmsg() too. UDP & RAW do : err = ip_append_data(...); if (err) udp_flush_pending_frames(sk); It seems you chose to add a test in fast path, with not even adding an unlikely() clause, while it seems that we took care of all the cases but missed a single one : ip_send_unicast_reply() I am suggesting to fix this bug in another way. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 15.10.2014 13:30, Eric Dumazet wrote: > On Wed, 2014-10-15 at 10:56 +0400, Vasily Averin wrote: >> On 15.10.2014 08:46, Eric Dumazet wrote: >>> On Tue, 2014-10-14 at 08:57 +0400, Vasily Averin wrote: >>>> v2: adjust the indentation of the arguments __ip_append_data() call >>>> >>>> Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()") >>>> >>>> If sk_write_queue is empty ip_append_data() executes ip_setup_cork() >>>> that "steals" dst entry from rt to cork. Later it calls __ip_append_data() >>>> that creates skb and adds it to sk_write_queue. >>>> >>>> If skb was added successfully following ip_push_pending_frames() call >>>> reassign dst entries from cork to skb, and kfree_skb frees dst_entry. >>>> >>>> However nobody frees stolen dst_entry if skb was not added into sk_write_queue. >>> >>> I thought this was done by ip_flush_pending_frames() ? >> >> Take look at ip_send_unicast_reply(): > > So maybe the bug is here ? Thank you, I'll remake my patch. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index e35b712..3ba2291 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1120,6 +1120,15 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork, return 0; } +static void ip_cork_release(struct inet_cork *cork) +{ + cork->flags &= ~IPCORK_OPT; + kfree(cork->opt); + cork->opt = NULL; + dst_release(cork->dst); + cork->dst = NULL; +} + /* * ip_append_data() and ip_append_page() can make one large IP datagram * from many pieces of data. Each pieces will be holded on the socket @@ -1152,9 +1161,14 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4, transhdrlen = 0; } - return __ip_append_data(sk, fl4, &sk->sk_write_queue, &inet->cork.base, - sk_page_frag(sk), getfrag, - from, length, transhdrlen, flags); + err = __ip_append_data(sk, fl4, &sk->sk_write_queue, &inet->cork.base, + sk_page_frag(sk), getfrag, + from, length, transhdrlen, flags); + + if (skb_queue_empty(&sk->sk_write_queue)) + ip_cork_release(&inet->cork.base); + + return err; } ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page, @@ -1304,15 +1318,6 @@ error: return err; } -static void ip_cork_release(struct inet_cork *cork) -{ - cork->flags &= ~IPCORK_OPT; - kfree(cork->opt); - cork->opt = NULL; - dst_release(cork->dst); - cork->dst = NULL; -} - /* * Combined all pending IP fragments on the socket as one IP datagram * and push them out.
v2: adjust the indentation of the arguments __ip_append_data() call Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()") If sk_write_queue is empty ip_append_data() executes ip_setup_cork() that "steals" dst entry from rt to cork. Later it calls __ip_append_data() that creates skb and adds it to sk_write_queue. If skb was added successfully following ip_push_pending_frames() call reassign dst entries from cork to skb, and kfree_skb frees dst_entry. However nobody frees stolen dst_entry if skb was not added into sk_write_queue. Signed-off-by: Vasily Averin <vvs@parallels.com> --- net/ipv4/ip_output.c | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-)