diff mbox

[net-next,v2,2/5] mpls: Remove incorrect PHP comment

Message ID 1426866170-28739-3-git-send-email-rshearma@brocade.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Robert Shearman March 20, 2015, 3:42 p.m. UTC
Popping the last label on the stack does not necessarily imply
performing penultimate hop popping. There is no reason why this
couldn't be the last hop in the network, so remove the comment.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
---
 net/mpls/af_mpls.c | 1 -
 1 file changed, 1 deletion(-)

Comments

Eric W. Biederman March 22, 2015, 7:12 p.m. UTC | #1
Robert Shearman <rshearma@brocade.com> writes:

> Popping the last label on the stack does not necessarily imply
> performing penultimate hop popping. There is no reason why this
> couldn't be the last hop in the network, so remove the comment.

So this change I will disagree with.

What the code implements is Penultimate hop popping.  Even if you send
the packets over loopback that is what the code is doing.

This is relevant because I think the code may actually be wrong in the
local reception case.  By preforming penultimate hop popping and
receving the code on loopback I think this code allows bypassing
iptables rules that apply to incoming ip packets.  Certainly there is a
loss of information as to which hardware interface the packet came in on
that it may be desirable to correct.

Eric


> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Signed-off-by: Robert Shearman <rshearma@brocade.com>
> ---
>  net/mpls/af_mpls.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
> index 0d6763a..bf3459a 100644
> --- a/net/mpls/af_mpls.c
> +++ b/net/mpls/af_mpls.c
> @@ -199,7 +199,6 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
>  	skb->protocol = htons(ETH_P_MPLS_UC);
>  
>  	if (unlikely(!new_header_size && dec.bos)) {
> -		/* Penultimate hop popping */
>  		if (!mpls_egress(rt, skb, dec))
>  			goto drop;
>  	} else {
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert Shearman March 23, 2015, 11:32 a.m. UTC | #2
On 22/03/15 19:12, Eric W. Biederman wrote:
> Robert Shearman <rshearma@brocade.com> writes:
>
>> Popping the last label on the stack does not necessarily imply
>> performing penultimate hop popping. There is no reason why this
>> couldn't be the last hop in the network, so remove the comment.
>
> So this change I will disagree with.
>
> What the code implements is Penultimate hop popping.  Even if you send
> the packets over loopback that is what the code is doing.

No, RFC3031 s3.16 (https://tools.ietf.org/html/rfc3031#page-18) talks in 
terms of LSRs (label switch routers), not passes through the forwarding 
code.

> This is relevant because I think the code may actually be wrong in the
> local reception case.  By preforming penultimate hop popping and
> receving the code on loopback I think this code allows bypassing
> iptables rules that apply to incoming ip packets.  Certainly there is a
> loss of information as to which hardware interface the packet came in on
> that it may be desirable to correct.

Indeed, but network operators may well want to apply different rules to 
traffic coming in as IP versus traffic coming in as MPLS.

This may well merit a comment of its own, but this isn't directly 
relevant to the comment I'm removing.

Thanks,
Rob

>
> Eric
>
>
>> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
>> Signed-off-by: Robert Shearman <rshearma@brocade.com>
>> ---
>>   net/mpls/af_mpls.c | 1 -
>>   1 file changed, 1 deletion(-)
>>
>> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
>> index 0d6763a..bf3459a 100644
>> --- a/net/mpls/af_mpls.c
>> +++ b/net/mpls/af_mpls.c
>> @@ -199,7 +199,6 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
>>   	skb->protocol = htons(ETH_P_MPLS_UC);
>>
>>   	if (unlikely(!new_header_size && dec.bos)) {
>> -		/* Penultimate hop popping */
>>   		if (!mpls_egress(rt, skb, dec))
>>   			goto drop;
>>   	} else {
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 23, 2015, 6:16 p.m. UTC | #3
Robert Shearman <rshearma@brocade.com> writes:

> On 22/03/15 19:12, Eric W. Biederman wrote:
>> Robert Shearman <rshearma@brocade.com> writes:
>>
>>> Popping the last label on the stack does not necessarily imply
>>> performing penultimate hop popping. There is no reason why this
>>> couldn't be the last hop in the network, so remove the comment.
>>
>> So this change I will disagree with.
>>
>> What the code implements is Penultimate hop popping.  Even if you send
>> the packets over loopback that is what the code is doing.
>
> No, RFC3031 s3.16 (https://tools.ietf.org/html/rfc3031#page-18) talks in terms
> of LSRs (label switch routers), not passes through the forwarding
> code.

In very simple terms the code always removes the labels and forwards the
code.  That is by definition penultimate hop popping.  That is all that
is implmeneted in the code today.  And that is what the comment is
noting.

>> This is relevant because I think the code may actually be wrong in the
>> local reception case.  By preforming penultimate hop popping and
>> receving the code on loopback I think this code allows bypassing
>> iptables rules that apply to incoming ip packets.  Certainly there is a
>> loss of information as to which hardware interface the packet came in on
>> that it may be desirable to correct.
>
> Indeed, but network operators may well want to apply different rules to traffic
> coming in as IP versus traffic coming in as MPLS.
>
> This may well merit a comment of its own, but this isn't directly relevant to
> the comment I'm removing.

My point is and what is directly relevant is the case of local delivery
is a hack.  A hack that a pretty strong case can be made that it does
the wrong thing and something we probably should fix before the code
makes it to Linus for 4.1 so the bug does not get cast in stone.

In other words the disparity between the comment and the code indicates
an actual bug, not just a wrong comment.

Eric

>>> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
>>> Signed-off-by: Robert Shearman <rshearma@brocade.com>
>>> ---
>>>   net/mpls/af_mpls.c | 1 -
>>>   1 file changed, 1 deletion(-)
>>>
>>> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
>>> index 0d6763a..bf3459a 100644
>>> --- a/net/mpls/af_mpls.c
>>> +++ b/net/mpls/af_mpls.c
>>> @@ -199,7 +199,6 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
>>>   	skb->protocol = htons(ETH_P_MPLS_UC);
>>>
>>>   	if (unlikely(!new_header_size && dec.bos)) {
>>> -		/* Penultimate hop popping */
>>>   		if (!mpls_egress(rt, skb, dec))
>>>   			goto drop;
>>>   	} else {
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert Shearman March 24, 2015, 3:18 p.m. UTC | #4
On 23/03/15 18:16, Eric W. Biederman wrote:
> Robert Shearman <rshearma@brocade.com> writes:
>
>> On 22/03/15 19:12, Eric W. Biederman wrote:
>>> Robert Shearman <rshearma@brocade.com> writes:
>>>
>>>> Popping the last label on the stack does not necessarily imply
>>>> performing penultimate hop popping. There is no reason why this
>>>> couldn't be the last hop in the network, so remove the comment.
>>>
>>> So this change I will disagree with.
>>>
>>> What the code implements is Penultimate hop popping.  Even if you send
>>> the packets over loopback that is what the code is doing.
>>
>> No, RFC3031 s3.16 (https://tools.ietf.org/html/rfc3031#page-18) talks in terms
>> of LSRs (label switch routers), not passes through the forwarding
>> code.
>
> In very simple terms the code always removes the labels and forwards the
> code.  That is by definition penultimate hop popping.  That is all that
> is implmeneted in the code today.  And that is what the comment is
> noting.
>

I think we might be talking at cross purposes here so I'd like to back 
up a little bit and make sure that we are using the same terminology and 
are talking about the same problem.

In terms of terminology this is what I mean -

PHP means popping the label in the penultimate hop router in the 
*signalled* LSP. Note: There may be more hops beyond this LSP - either 
MPLS using an inner label or payload - such as IP.

UHP means popping the label in the ultimate hop router in the signalled 
LSP. Again there may be further hops that are outside of the LSP.

Orthogonal to this is the relation of the FEC to the UH router.

a) All traffic for this FEC needs to be sent to another router (which is 
beyond this LSP).
b) The FEC requires the UH router to do a payload lookup in order to 
identify where the payload should go. This can be a label lookup on the 
next label in (e.g. where the outer label is a TE tunnel over which 
further LSPs have been setup). Or it can be a payload protocol lookup 
e.g. IP. There are a couple of sub-cases here - the common case is where 
the FEC is for a prefix that is directly connected to the UH - in which 
case the lookup we need is conceptually a neighbour lookup. Another case 
is where the FEC corresponds to a summary prefix that this router has 
advertised to its peer - in which case the lookup is a routing table lookup.
c) The FEC is destined for the UH router and needs to be "received" by 
it. Typically this case can be handled in exactly the same way as B 
above but mention it here for completeness.

Does this terminology work for you?

If it does then I hope you will also agree that PHP is really aimed at 
decreasing the work for the UH router in the cases where the FEC 
corresponds to B above. (And it also allows C to be handled using 
existing e.g. IP path mechanisms). If we want to do UHP then we'd 
require the UH router to first lookup the outer label (which would be 
exp-null) and then an additional protocol lookup.

But there are a couple of cases where you either can't or may not want 
to do PHP.

Firstly with an application like MPLS VPN the UH router is the PE and it 
needs the label in order to identify which routing table to do the (IP) 
payload lookup in.

Secondly, in case A, PHP doesn't offer any benefit. The PH is already 
able to use the incoming label to determine how to forward without 
needing a further lookup. It's a choice, but there are certainly stacks 
out there that will choose to advertise a real label for the FEC so that 
we do UHP here. (I'd actually expect that to be the norm).

Does this description of behaviours work for you?

If we are still together at this point then my last question is why you 
say that "remov[ing] a label and forward[ing] the [packet]" is "by 
definition penultimate-hop popping"?

I'm actually more interested in getting aligned on this than the comment 
itself. I also have some comments on what you say below about local 
delivery but I suspect it will all make more sense once we sort this out.

Thanks,
Rob

>>> This is relevant because I think the code may actually be wrong in the
>>> local reception case.  By preforming penultimate hop popping and
>>> receving the code on loopback I think this code allows bypassing
>>> iptables rules that apply to incoming ip packets.  Certainly there is a
>>> loss of information as to which hardware interface the packet came in on
>>> that it may be desirable to correct.
>>
>> Indeed, but network operators may well want to apply different rules to traffic
>> coming in as IP versus traffic coming in as MPLS.
>>
>> This may well merit a comment of its own, but this isn't directly relevant to
>> the comment I'm removing.
>
> My point is and what is directly relevant is the case of local delivery
> is a hack.  A hack that a pretty strong case can be made that it does
> the wrong thing and something we probably should fix before the code
> makes it to Linus for 4.1 so the bug does not get cast in stone.
>
> In other words the disparity between the comment and the code indicates
> an actual bug, not just a wrong comment.
>
> Eric
>
>>>> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
>>>> Signed-off-by: Robert Shearman <rshearma@brocade.com>
>>>> ---
>>>>    net/mpls/af_mpls.c | 1 -
>>>>    1 file changed, 1 deletion(-)
>>>>
>>>> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
>>>> index 0d6763a..bf3459a 100644
>>>> --- a/net/mpls/af_mpls.c
>>>> +++ b/net/mpls/af_mpls.c
>>>> @@ -199,7 +199,6 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
>>>>    	skb->protocol = htons(ETH_P_MPLS_UC);
>>>>
>>>>    	if (unlikely(!new_header_size && dec.bos)) {
>>>> -		/* Penultimate hop popping */
>>>>    		if (!mpls_egress(rt, skb, dec))
>>>>    			goto drop;
>>>>    	} else {
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vivek Venkatraman March 24, 2015, 6:43 p.m. UTC | #5
On Tue, Mar 24, 2015 at 8:18 AM, Robert Shearman <rshearma@brocade.com> wrote:
> On 23/03/15 18:16, Eric W. Biederman wrote:
>>
>> Robert Shearman <rshearma@brocade.com> writes:
>>
>>> On 22/03/15 19:12, Eric W. Biederman wrote:
>>>>
>>>> Robert Shearman <rshearma@brocade.com> writes:
>>>>
>>>>> Popping the last label on the stack does not necessarily imply
>>>>> performing penultimate hop popping. There is no reason why this
>>>>> couldn't be the last hop in the network, so remove the comment.
>>>>
>>>>
>>>> So this change I will disagree with.
>>>>
>>>> What the code implements is Penultimate hop popping.  Even if you send
>>>> the packets over loopback that is what the code is doing.
>>>
>>>
>>> No, RFC3031 s3.16 (https://tools.ietf.org/html/rfc3031#page-18) talks in
>>> terms
>>> of LSRs (label switch routers), not passes through the forwarding
>>> code.
>>
>>
>> In very simple terms the code always removes the labels and forwards the
>> code.  That is by definition penultimate hop popping.  That is all that
>> is implmeneted in the code today.  And that is what the comment is
>> noting.
>>

Yes, this is what the code implements today. I guess the point that
Rob might be making
is that this is not necessarily PHP. The terminus of the LSP (ultimate
hop) need not always
signal an implicit-null or an explicit-null label upstream. In such a
case, it is the ultimate hop
that would see a labeled packet and would pop the label.

What the code implements today can probably be better called as "pop
and forward" - the
label directly identifies the nexthop to forward to. There are other
paradigms needed as Rob
describes below.

>
> I think we might be talking at cross purposes here so I'd like to back up a
> little bit and make sure that we are using the same terminology and are
> talking about the same problem.
>
> In terms of terminology this is what I mean -
>
> PHP means popping the label in the penultimate hop router in the *signalled*
> LSP. Note: There may be more hops beyond this LSP - either MPLS using an
> inner label or payload - such as IP.
>
> UHP means popping the label in the ultimate hop router in the signalled LSP.
> Again there may be further hops that are outside of the LSP.
>
> Orthogonal to this is the relation of the FEC to the UH router.
>
> a) All traffic for this FEC needs to be sent to another router (which is
> beyond this LSP).
> b) The FEC requires the UH router to do a payload lookup in order to
> identify where the payload should go. This can be a label lookup on the next
> label in (e.g. where the outer label is a TE tunnel over which further LSPs
> have been setup). Or it can be a payload protocol lookup e.g. IP. There are
> a couple of sub-cases here - the common case is where the FEC is for a
> prefix that is directly connected to the UH - in which case the lookup we
> need is conceptually a neighbour lookup. Another case is where the FEC
> corresponds to a summary prefix that this router has advertised to its peer
> - in which case the lookup is a routing table lookup.
> c) The FEC is destined for the UH router and needs to be "received" by it.
> Typically this case can be handled in exactly the same way as B above but
> mention it here for completeness.
>
> Does this terminology work for you?
>
> If it does then I hope you will also agree that PHP is really aimed at
> decreasing the work for the UH router in the cases where the FEC corresponds
> to B above. (And it also allows C to be handled using existing e.g. IP path
> mechanisms). If we want to do UHP then we'd require the UH router to first
> lookup the outer label (which would be exp-null) and then an additional
> protocol lookup.
>
> But there are a couple of cases where you either can't or may not want to do
> PHP.
>
> Firstly with an application like MPLS VPN the UH router is the PE and it
> needs the label in order to identify which routing table to do the (IP)
> payload lookup in.
>

I can relate to the terminology above. The one clarification I'll add
to the comment
on MPLS VPN is that it is the VPN label that will identify the routing table in
which to do the IP/payload lookup. The PH cannot pop this label as it doesn't
even know about it. The same would be the case for PWE - the PW label needs
to reach the UH for it to be able to do the forwarding. The PH may pop the LSP
label (on which the L3VPN or PW is setup) or it may not - based on what the
UH signaled it to do.

Of course, the comment is one thing - the implementation would be enhancements
to what is already available. IMO, one key aspect of this would be the
ability to
specify a label operation such as "pop" or "swap" - or even "no-op" which may
have a use case when MPLS dataplane is used to achieve Segment Routing.

For example,
ip -f mpls route add 200 pop
ip -f mpls route add 1400 nexthop via inet 192.168.12.1 dev eth1

would handle a label stack {200, 1400(S)} on the UH, pop the LSP label 200 and
(implicitly) pop the VPN label 1400 and forward to nexthop specified.

For example,
ip -f mpls route add 200 pop
ip -f mpls route add 1600 pop dev lo20

could handle a similar L3VPN scenario that requires a payload (IP) lookup. The
VRF (table to use) could be identified by rules associated with
'lo20'. The incoming
interface information is lost, but in this case, the most relevant
information is the
VRF which is determined by 'lo20' (which is pointed to by the VPN label 1600).

> Secondly, in case A, PHP doesn't offer any benefit. The PH is already able
> to use the incoming label to determine how to forward without needing a
> further lookup. It's a choice, but there are certainly stacks out there that
> will choose to advertise a real label for the FEC so that we do UHP here.
> (I'd actually expect that to be the norm).
>
> Does this description of behaviours work for you?
>
> If we are still together at this point then my last question is why you say
> that "remov[ing] a label and forward[ing] the [packet]" is "by definition
> penultimate-hop popping"?
>
> I'm actually more interested in getting aligned on this than the comment
> itself. I also have some comments on what you say below about local delivery
> but I suspect it will all make more sense once we sort this out.
>
> Thanks,
> Rob
>
>>>> This is relevant because I think the code may actually be wrong in the
>>>> local reception case.  By preforming penultimate hop popping and
>>>> receving the code on loopback I think this code allows bypassing
>>>> iptables rules that apply to incoming ip packets.  Certainly there is a
>>>> loss of information as to which hardware interface the packet came in on
>>>> that it may be desirable to correct.
>>>
>>>
>>> Indeed, but network operators may well want to apply different rules to
>>> traffic
>>> coming in as IP versus traffic coming in as MPLS.
>>>
>>> This may well merit a comment of its own, but this isn't directly
>>> relevant to
>>> the comment I'm removing.
>>
>>
>> My point is and what is directly relevant is the case of local delivery
>> is a hack.  A hack that a pretty strong case can be made that it does
>> the wrong thing and something we probably should fix before the code
>> makes it to Linus for 4.1 so the bug does not get cast in stone.
>>
>> In other words the disparity between the comment and the code indicates
>> an actual bug, not just a wrong comment.
>>
>> Eric
>>
>>>>> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
>>>>> Signed-off-by: Robert Shearman <rshearma@brocade.com>
>>>>> ---
>>>>>    net/mpls/af_mpls.c | 1 -
>>>>>    1 file changed, 1 deletion(-)
>>>>>
>>>>> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
>>>>> index 0d6763a..bf3459a 100644
>>>>> --- a/net/mpls/af_mpls.c
>>>>> +++ b/net/mpls/af_mpls.c
>>>>> @@ -199,7 +199,6 @@ static int mpls_forward(struct sk_buff *skb, struct
>>>>> net_device *dev,
>>>>>         skb->protocol = htons(ETH_P_MPLS_UC);
>>>>>
>>>>>         if (unlikely(!new_header_size && dec.bos)) {
>>>>> -               /* Penultimate hop popping */
>>>>>                 if (!mpls_egress(rt, skb, dec))
>>>>>                         goto drop;
>>>>>         } else {
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 0d6763a..bf3459a 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -199,7 +199,6 @@  static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
 	skb->protocol = htons(ETH_P_MPLS_UC);
 
 	if (unlikely(!new_header_size && dec.bos)) {
-		/* Penultimate hop popping */
 		if (!mpls_egress(rt, skb, dec))
 			goto drop;
 	} else {