diff mbox

rcu locking issue in mpls output code?

Message ID 20160620004546.GP20238@wantstofly.org
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Lennert Buytenhek June 20, 2016, 12:45 a.m. UTC
Hi!

While trying to chase down a memory corruption issue that only occurs
when originating large amounts of MPLS tagged IP traffic, I came across
something in the MPLS output code for which I'm not entirely sure that
it's correct.

Specifically, there is the code path dst_output() -> lwtunnel_output()
-> mpls_output() -> neigh_xmit() -> ___neigh_lookup_noref(), where the
latter accesses a RCU-bh protected struct neigh_table pointer, but there
is no RCU-bh protection being arranged anywhere in this call chain.

Since this is locally generated IP traffic, we're running in process
context, and while lwtunnel_output() holds rcu_read_lock() across its
call to lwtunnel_encap_ops::output() (which is mpls_output() here),
nothing in the chain disables BHs, and in RCU-bh, the completion of a
softirq signals the end of any pending read-side critical sections,
and BHs can preempt this call chain at any time because it runs with
hardirqs and softirqs both enabled, so that would mean that neighbour
table entries can be zapped at any time even while we hold
rcu_read_lock().  I think.

The mpls_forward() path doesn't seem susceptible to the same issue,
as it runs from softirq, where rcu_read_lock() suffices, so I figured
that mpls_output() would be a good place to deal with this and that
something like the patch below would do the trick.  I can't say yet if
this makes my memory corruption issues go away, as they don't reproduce
that easily, but I'll keep testing.  Any thoughts so far?


Thanks,
Lennert

Comments

David Ahern June 20, 2016, 2:19 a.m. UTC | #1
On 6/19/16 6:45 PM, Lennert Buytenhek wrote:
> diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
> index fb31aa8..802956b 100644
> --- a/net/mpls/mpls_iptunnel.c
> +++ b/net/mpls/mpls_iptunnel.c
> @@ -105,12 +105,15 @@ static int mpls_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>  		bos = false;
>  	}
>
> +	rcu_read_lock_bh();
>  	if (rt)
>  		err = neigh_xmit(NEIGH_ARP_TABLE, out_dev, &rt->rt_gateway,
>  				 skb);
>  	else if (rt6)
>  		err = neigh_xmit(NEIGH_ND_TABLE, out_dev, &rt6->rt6i_gateway,
>  				 skb);
> +	rcu_read_unlock_bh();
> +
>  	if (err)
>  		net_dbg_ratelimited("%s: packet transmission failed: %d\n",
>  				    __func__, err);
>

I think those need to be added to neigh_xmit in the

	if (likely(index < NEIGH_NR_TABLES)) {

	}

block.
Lennert Buytenhek June 20, 2016, 6:30 a.m. UTC | #2
On Sun, Jun 19, 2016 at 08:19:20PM -0600, David Ahern wrote:

> > diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
> > index fb31aa8..802956b 100644
> > --- a/net/mpls/mpls_iptunnel.c
> > +++ b/net/mpls/mpls_iptunnel.c
> > @@ -105,12 +105,15 @@ static int mpls_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> >  		bos = false;
> >  	}
> > 
> > +	rcu_read_lock_bh();
> >  	if (rt)
> >  		err = neigh_xmit(NEIGH_ARP_TABLE, out_dev, &rt->rt_gateway,
> >  				 skb);
> >  	else if (rt6)
> >  		err = neigh_xmit(NEIGH_ND_TABLE, out_dev, &rt6->rt6i_gateway,
> >  				 skb);
> > +	rcu_read_unlock_bh();
> > +
> >  	if (err)
> >  		net_dbg_ratelimited("%s: packet transmission failed: %d\n",
> >  				    __func__, err);
> > 
> 
> I think those need to be added to neigh_xmit in the
> 
> 	if (likely(index < NEIGH_NR_TABLES)) {
> 
> 	}

That'll force callers that don't need the extra protection (i.e.
mpls_forward(), since that always runs from softirq and it's enough
to protect the neigh state with rcu_read_lock() from softirq and we're
already running under rcu_read_lock() when we get to neigh_xmit()) to
eat the useless overhead of an extra rcu_read_{,un}lock_bh() pair, but
sure, functionally that's correct, I think, and in my workload I don't
care about MPLS forwarding performance anyway. ;-)

Want me to send a patch moving it to neigh_xmit() ?

Thank you for having a look!


Cheers,
Lennert
David Ahern June 20, 2016, 3:19 p.m. UTC | #3
On 6/20/16 12:30 AM, Lennert Buytenhek wrote:
> On Sun, Jun 19, 2016 at 08:19:20PM -0600, David Ahern wrote:
>
>>> diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
>>> index fb31aa8..802956b 100644
>>> --- a/net/mpls/mpls_iptunnel.c
>>> +++ b/net/mpls/mpls_iptunnel.c
>>> @@ -105,12 +105,15 @@ static int mpls_output(struct net *net, struct sock *sk, struct sk_buff *skb)
>>>  		bos = false;
>>>  	}
>>>
>>> +	rcu_read_lock_bh();
>>>  	if (rt)
>>>  		err = neigh_xmit(NEIGH_ARP_TABLE, out_dev, &rt->rt_gateway,
>>>  				 skb);
>>>  	else if (rt6)
>>>  		err = neigh_xmit(NEIGH_ND_TABLE, out_dev, &rt6->rt6i_gateway,
>>>  				 skb);
>>> +	rcu_read_unlock_bh();
>>> +
>>>  	if (err)
>>>  		net_dbg_ratelimited("%s: packet transmission failed: %d\n",
>>>  				    __func__, err);
>>>
>>
>> I think those need to be added to neigh_xmit in the
>>
>> 	if (likely(index < NEIGH_NR_TABLES)) {
>>
>> 	}
>
> That'll force callers that don't need the extra protection (i.e.
> mpls_forward(), since that always runs from softirq and it's enough
> to protect the neigh state with rcu_read_lock() from softirq and we're
> already running under rcu_read_lock() when we get to neigh_xmit()) to
> eat the useless overhead of an extra rcu_read_{,un}lock_bh() pair, but
> sure, functionally that's correct, I think, and in my workload I don't
> care about MPLS forwarding performance anyway. ;-)

__neigh_lookup_noref expects bh level protection. Since the if block in 
neigh_xmit requires the locking seems like this the appropriate place 
for it.

>
> Want me to send a patch moving it to neigh_xmit() ?

Roopa/Robert: agree?
Roopa Prabhu June 20, 2016, 4:13 p.m. UTC | #4
On Mon, Jun 20, 2016 at 8:19 AM, David Ahern <dsa@cumulusnetworks.com> wrote:
> On 6/20/16 12:30 AM, Lennert Buytenhek wrote:
>>
>> On Sun, Jun 19, 2016 at 08:19:20PM -0600, David Ahern wrote:
>>
>>>> diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
>>>> index fb31aa8..802956b 100644
>>>> --- a/net/mpls/mpls_iptunnel.c
>>>> +++ b/net/mpls/mpls_iptunnel.c
>>>> @@ -105,12 +105,15 @@ static int mpls_output(struct net *net, struct
>>>> sock *sk, struct sk_buff *skb)
>>>>                 bos = false;
>>>>         }
>>>>
>>>> +       rcu_read_lock_bh();
>>>>         if (rt)
>>>>                 err = neigh_xmit(NEIGH_ARP_TABLE, out_dev,
>>>> &rt->rt_gateway,
>>>>                                  skb);
>>>>         else if (rt6)
>>>>                 err = neigh_xmit(NEIGH_ND_TABLE, out_dev,
>>>> &rt6->rt6i_gateway,
>>>>                                  skb);
>>>> +       rcu_read_unlock_bh();
>>>> +
>>>>         if (err)
>>>>                 net_dbg_ratelimited("%s: packet transmission failed:
>>>> %d\n",
>>>>                                     __func__, err);
>>>>
>>>
>>> I think those need to be added to neigh_xmit in the
>>>
>>>         if (likely(index < NEIGH_NR_TABLES)) {
>>>
>>>         }
>>
>>
>> That'll force callers that don't need the extra protection (i.e.
>> mpls_forward(), since that always runs from softirq and it's enough
>> to protect the neigh state with rcu_read_lock() from softirq and we're
>> already running under rcu_read_lock() when we get to neigh_xmit()) to
>> eat the useless overhead of an extra rcu_read_{,un}lock_bh() pair, but
>> sure, functionally that's correct, I think, and in my workload I don't
>> care about MPLS forwarding performance anyway. ;-)
>
>
> __neigh_lookup_noref expects bh level protection. Since the if block in
> neigh_xmit requires the locking seems like this the appropriate place for
> it.
>
>>
>> Want me to send a patch moving it to neigh_xmit() ?
>
>
> Roopa/Robert: agree?
>

yes, seems like an appropriate place for it.  provided it does not add
unnecessary overhead for others.
But then neigh_xmit seems to be only called from mpls_output and mpls_forward.

thanks!
Lennert Buytenhek June 20, 2016, 4:33 p.m. UTC | #5
On Mon, Jun 20, 2016 at 09:13:36AM -0700, Roopa Prabhu wrote:

> >>>> diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
> >>>> index fb31aa8..802956b 100644
> >>>> --- a/net/mpls/mpls_iptunnel.c
> >>>> +++ b/net/mpls/mpls_iptunnel.c
> >>>> @@ -105,12 +105,15 @@ static int mpls_output(struct net *net, struct
> >>>> sock *sk, struct sk_buff *skb)
> >>>>                 bos = false;
> >>>>         }
> >>>>
> >>>> +       rcu_read_lock_bh();
> >>>>         if (rt)
> >>>>                 err = neigh_xmit(NEIGH_ARP_TABLE, out_dev,
> >>>> &rt->rt_gateway,
> >>>>                                  skb);
> >>>>         else if (rt6)
> >>>>                 err = neigh_xmit(NEIGH_ND_TABLE, out_dev,
> >>>> &rt6->rt6i_gateway,
> >>>>                                  skb);
> >>>> +       rcu_read_unlock_bh();
> >>>> +
> >>>>         if (err)
> >>>>                 net_dbg_ratelimited("%s: packet transmission failed:
> >>>> %d\n",
> >>>>                                     __func__, err);
> >>>>
> >>>
> >>> I think those need to be added to neigh_xmit in the
> >>>
> >>>         if (likely(index < NEIGH_NR_TABLES)) {
> >>>
> >>>         }
> >>
> >>
> >> That'll force callers that don't need the extra protection (i.e.
> >> mpls_forward(), since that always runs from softirq and it's enough
> >> to protect the neigh state with rcu_read_lock() from softirq and we're
> >> already running under rcu_read_lock() when we get to neigh_xmit()) to
> >> eat the useless overhead of an extra rcu_read_{,un}lock_bh() pair, but
> >> sure, functionally that's correct, I think, and in my workload I don't
> >> care about MPLS forwarding performance anyway. ;-)
> >
> >
> > __neigh_lookup_noref expects bh level protection. Since the if block in
> > neigh_xmit requires the locking seems like this the appropriate place for
> > it.
> >
> >>
> >> Want me to send a patch moving it to neigh_xmit() ?
> >
> >
> > Roopa/Robert: agree?
> 
> yes, seems like an appropriate place for it.  provided it does not add
> unnecessary overhead for others.
> But then neigh_xmit seems to be only called from mpls_output and mpls_forward.

OK, patch coming up.  Thanks!
David Ahern June 20, 2016, 4:38 p.m. UTC | #6
On 6/20/16 10:33 AM, Lennert Buytenhek wrote:
> OK, patch coming up.  Thanks!

can you build a kernel with rcu debugging enabled as well and run it 
through your tests?

Thanks,
diff mbox

Patch

diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
index fb31aa8..802956b 100644
--- a/net/mpls/mpls_iptunnel.c
+++ b/net/mpls/mpls_iptunnel.c
@@ -105,12 +105,15 @@  static int mpls_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 		bos = false;
 	}
 
+	rcu_read_lock_bh();
 	if (rt)
 		err = neigh_xmit(NEIGH_ARP_TABLE, out_dev, &rt->rt_gateway,
 				 skb);
 	else if (rt6)
 		err = neigh_xmit(NEIGH_ND_TABLE, out_dev, &rt6->rt6i_gateway,
 				 skb);
+	rcu_read_unlock_bh();
+
 	if (err)
 		net_dbg_ratelimited("%s: packet transmission failed: %d\n",
 				    __func__, err);