diff mbox series

[RFC] Support for UNARP (RFC 1868)

Message ID 1507782977-2443-1-git-send-email-girish.moodalbail@oracle.com
State RFC, archived
Delegated to: David Miller
Headers show
Series [RFC] Support for UNARP (RFC 1868) | expand

Commit Message

Girish Moodalbail Oct. 12, 2017, 4:36 a.m. UTC
Add support for UNARP, as detailed in the IETF RFC 1868 (ARP Extension -
UNARP). The central idea here is for a node to announce that it is
leaving the network and that all the nodes on the L2 broadcast domain to
update their ARP tables accordingly (i.e., mark the neighbor entry state
to FAILED). Even though the ARP timers on nodes would eventually  mark
such entries as FAILED it will be more robust if those entries gets
marked FAILED sooner with the help from the host that is going away.

Besides providing a solution for an usecase, as captured in RFC, of an
IP address moving across a proxy server, this feature is even more
important for certain use cases in the Cloud. Imagine a tenant who is
bringing up and down VM instances for some workload of theirs. If these
instances are part of a small subnet, then the new VM instances may be
assigned the same IP address (since the subnet pool is small) but with a
different MAC address. So, if there is a client which has a stale
mapping of the IP address to the old MAC address, then that client will
fail to communicate with the new VM instance for some time.

Another usecase that comes to mind is that of the Live VM
Migration. Imagine a client that is communicating with a VM. Now, let us
migrate this VM to a destination machine. The IP address to MAC address
mapping for a VM doesn't change after the Live Migration. However, there
will be a small amount of time (till the VM sends gratuitous ARP from
the destination machine) during which packets from a client will be
forwarded to the source machine. This occurs because:

 - the ARP entry in the client is not invalidated yet and it continues
   to use the same MAC address and

 - the MAC address table of all of the intermediate switches between the
   client and the source machine are not updated yet for the MAC address
   move.

This issue of forwarding the packets to wrong target could be avoided by
sending UNARP packets from the source machine. This would invalidate the
ARP entry on the client and forces it to resolve the IP address again by
broadcasting an ARP request to the network. The VM on the destination
machine would then respond back with an ARP response. The ARP response
back from the VM should also clean up the MAC address table of the
intermediate switches.

The following changes implements the UNARP receive processing in the
kernel. Once the changes are in the kernel, arping(8) program can be
updated to send UNARP packets.

Any Thoughts/Comments?

Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
---

Compile-tested only.

 net/ipv4/arp.c | 46 +++++++++++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 11 deletions(-)

Comments

Eric Dumazet Oct. 12, 2017, 2:57 p.m. UTC | #1
On Wed, 2017-10-11 at 21:36 -0700, Girish Moodalbail wrote:
> Add support for UNARP, as detailed in the IETF RFC 1868 (ARP Extension -
> UNARP). The central idea here is for a node to announce that it is
> leaving the network and that all the nodes on the L2 broadcast domain to
> update their ARP tables accordingly (i.e., mark the neighbor entry state
> to FAILED). Even though the ARP timers on nodes would eventually  mark
> such entries as FAILED it will be more robust if those entries gets
> marked FAILED sooner with the help from the host that is going away.
> 
> Besides providing a solution for an usecase, as captured in RFC, of an
> IP address moving across a proxy server, this feature is even more
> important for certain use cases in the Cloud. Imagine a tenant who is
> bringing up and down VM instances for some workload of theirs. If these
> instances are part of a small subnet, then the new VM instances may be
> assigned the same IP address (since the subnet pool is small) but with a
> different MAC address. So, if there is a client which has a stale
> mapping of the IP address to the old MAC address, then that client will
> fail to communicate with the new VM instance for some time.
> 
> Another usecase that comes to mind is that of the Live VM
> Migration. Imagine a client that is communicating with a VM. Now, let us
> migrate this VM to a destination machine. The IP address to MAC address
> mapping for a VM doesn't change after the Live Migration. However, there
> will be a small amount of time (till the VM sends gratuitous ARP from
> the destination machine) during which packets from a client will be
> forwarded to the source machine. This occurs because:
> 
>  - the ARP entry in the client is not invalidated yet and it continues
>    to use the same MAC address and
> 
>  - the MAC address table of all of the intermediate switches between the
>    client and the source machine are not updated yet for the MAC address
>    move.
> 
> This issue of forwarding the packets to wrong target could be avoided by
> sending UNARP packets from the source machine. This would invalidate the
> ARP entry on the client and forces it to resolve the IP address again by
> broadcasting an ARP request to the network. The VM on the destination
> machine would then respond back with an ARP response. The ARP response
> back from the VM should also clean up the MAC address table of the
> intermediate switches.
> 
> The following changes implements the UNARP receive processing in the
> kernel. Once the changes are in the kernel, arping(8) program can be
> updated to send UNARP packets.
> 
> Any Thoughts/Comments?
> 
> Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
> ---

Hi Girish

Your description (or patch title) is misleading. You apparently
implement the receive side of the RFC.

And the RFC had Proxy ARP in mind.

What about security implications ? Will TCP flows be terminated, instead
of being smoothly migrated (TCP_REPAIR)

What about IPv6 ? Or maybe more abruptly, do we still need to add
features to IPv4 in 2017,  22 years after this RFC came ? ;)

Thanks.
Girish Moodalbail Oct. 12, 2017, 11:06 p.m. UTC | #2
Hello Eric,

The basic idea is to mark the ARP entry either FAILED or STALE as soon as we can 
so that the subsequent packets that depend on that ARP entry will take the slow 
path (neigh_resolve_output()).

Say, if base_reachable_time is 30 seconds, then an ARP entry will be in 
reachable state somewhere between 15 to 45 seconds. Assuming the worst case, the 
ARP entry will be in REACHABLE state for 45 seconds and the packets continue to 
traverse the network towards the source machine and gets dropped their since the 
VM has moved to destination machine.

Instead, based on the received UNARP packet if we mark the ARP entry

(a) FAILED
    - we move to INCOMPLETE state and start the address resolution by sending
      out ARP packets (up to allowed maximum number) until we get ARP response
      back at which point we move the ARP entry state to reachable.

(b) STALE
    - we move to DELAY state and set the next timer to DELAY_PROBE_TIME
      (1 second) and continue to send queued packets in arp_queue.
    - After 1 sec we move to PROBE state and start the address resolution like
      in the case(a) above.

I was leaning towards (a). Please see in-line..

<snip>

> 
> Hi Girish
> 
> Your description (or patch title) is misleading. You apparently
> implement the receive side of the RFC.

You are right, it implements only the receive side of the RFC. If this RFC is 
accepted, then we can change arping(8) to generate UNARP requests. We could also 
add an option to ip-address(8) delete subcommand to generate UNARP whenever an 
address is deleted from the interface.

> And the RFC had Proxy ARP in mind.
> 
> What about security implications ? 

Yes, this feature will extend the attack surface for L2 networks. However, the 
attack vectors for this feature should be same as that of the gratuitous ARP, 
right? The same attack mitigation techniques for gratuitous ARPs is equally 
applicable here.

> Will TCP flows be terminated, instead
> of being smoothly migrated (TCP_REPAIR)

The TCP flows will not be terminated. Upon receiving UNARP packet, the ARP entry 
will be marked FAILED. The subsequent TCP packets from the client (towards that 
IP) will be queued (the first 3 packets in arp_queue and then other packets get 
dropped) until the IP address is resolved again through the slow path 
neigh_resolve_output().

The slow path marks the entry as INCOMPLETE and will start sending several ARP 
requests (ucast_solicit + app_solicit + mcast_solicit) to resolve the IP. If the 
resolution is successful, then the TCP packets will be sent out. If not, we will 
invalidate the ARP entry and call arp_error_report() on the queued packets 
(which will end up sending ICMP_HOST_UNREACH error). This behavior is same as 
what will occur if TCP server disappears in the middle of a connection.

> 
> What about IPv6 ? Or maybe more abruptly, do we still need to add
> features to IPv4 in 2017,  22 years after this RFC came ? ;)

Legit question :). Well one thing I have seen in Networking is that an old idea 
circles back around later and turns out to be useful in new contexts and use 
cases. Like I enumerated in my initial email there are certain use cases in 
Cloud that might benefit from UNARP.

regards,
~Girish

> 
> Thanks.
> 
>
On Thu, Oct 12, 2017 at 4:06 PM, Girish Moodalbail
<girish.moodalbail@oracle.com> wrote:
> Hello Eric,
>
> The basic idea is to mark the ARP entry either FAILED or STALE as soon as we
> can so that the subsequent packets that depend on that ARP entry will take
> the slow path (neigh_resolve_output()).
>
> Say, if base_reachable_time is 30 seconds, then an ARP entry will be in
> reachable state somewhere between 15 to 45 seconds. Assuming the worst case,
> the ARP entry will be in REACHABLE state for 45 seconds and the packets
> continue to traverse the network towards the source machine and gets dropped
> their since the VM has moved to destination machine.
>
> Instead, based on the received UNARP packet if we mark the ARP entry
>
> (a) FAILED
>    - we move to INCOMPLETE state and start the address resolution by sending
>      out ARP packets (up to allowed maximum number) until we get ARP
> response
>      back at which point we move the ARP entry state to reachable.
>
> (b) STALE
>    - we move to DELAY state and set the next timer to DELAY_PROBE_TIME
>      (1 second) and continue to send queued packets in arp_queue.
>    - After 1 sec we move to PROBE state and start the address resolution
> like
>      in the case(a) above.
>
> I was leaning towards (a).
One could arbitrarily increase the stale timeout (by changing no of
probes). So sender
will continue sending traffic to something that has already gone away.
STALE doesn't
mean bad but here the sender is clearly indicating it's going away so
FAILED seems to
be the only logical option.

> Please see in-line..
>
> <snip>
>
>>
>> Hi Girish
>>
>> Your description (or patch title) is misleading. You apparently
>> implement the receive side of the RFC.
>
>
> You are right, it implements only the receive side of the RFC. If this RFC
> is accepted, then we can change arping(8) to generate UNARP requests. We
> could also add an option to ip-address(8) delete subcommand to generate
> UNARP whenever an address is deleted from the interface.
>
>> And the RFC had Proxy ARP in mind.
>>
>> What about security implications ?
>
>
> Yes, this feature will extend the attack surface for L2 networks. However,
> the attack vectors for this feature should be same as that of the gratuitous
> ARP, right? The same attack mitigation techniques for gratuitous ARPs is
> equally applicable here.
>
>> Will TCP flows be terminated, instead
>> of being smoothly migrated (TCP_REPAIR)
>
>
> The TCP flows will not be terminated. Upon receiving UNARP packet, the ARP
> entry will be marked FAILED. The subsequent TCP packets from the client
> (towards that IP) will be queued (the first 3 packets in arp_queue and then
> other packets get dropped) until the IP address is resolved again through
> the slow path neigh_resolve_output().
>
> The slow path marks the entry as INCOMPLETE and will start sending several
> ARP requests (ucast_solicit + app_solicit + mcast_solicit) to resolve the
> IP. If the resolution is successful, then the TCP packets will be sent out.
> If not, we will invalidate the ARP entry and call arp_error_report() on the
> queued packets (which will end up sending ICMP_HOST_UNREACH error). This
> behavior is same as what will occur if TCP server disappears in the middle
> of a connection.
>
>>
>> What about IPv6 ? Or maybe more abruptly, do we still need to add
>> features to IPv4 in 2017,  22 years after this RFC came ? ;)
>
>
> Legit question :). Well one thing I have seen in Networking is that an old
> idea circles back around later and turns out to be useful in new contexts
> and use cases. Like I enumerated in my initial email there are certain use
> cases in Cloud that might benefit from UNARP.
>
It doesn't make sense to have this implemented only for IPv4. At this time if
equivalent IPv6 feature is missing, I don't see it being useful / acceptable.

> regards,
> ~Girish
>
>>
>> Thanks.
>>
>>
>
Girish Moodalbail Oct. 13, 2017, 6:23 p.m. UTC | #4
On 10/12/17 5:53 PM, Mahesh Bandewar (महेश बंडेवार) wrote:
> On Thu, Oct 12, 2017 at 4:06 PM, Girish Moodalbail
> <girish.moodalbail@oracle.com> wrote:
>> Hello Eric,
>>
>> The basic idea is to mark the ARP entry either FAILED or STALE as soon as we
>> can so that the subsequent packets that depend on that ARP entry will take
>> the slow path (neigh_resolve_output()).
>>
>> Say, if base_reachable_time is 30 seconds, then an ARP entry will be in
>> reachable state somewhere between 15 to 45 seconds. Assuming the worst case,
>> the ARP entry will be in REACHABLE state for 45 seconds and the packets
>> continue to traverse the network towards the source machine and gets dropped
>> their since the VM has moved to destination machine.
>>
>> Instead, based on the received UNARP packet if we mark the ARP entry
>>
>> (a) FAILED
>>     - we move to INCOMPLETE state and start the address resolution by sending
>>       out ARP packets (up to allowed maximum number) until we get ARP
>> response
>>       back at which point we move the ARP entry state to reachable.
>>
>> (b) STALE
>>     - we move to DELAY state and set the next timer to DELAY_PROBE_TIME
>>       (1 second) and continue to send queued packets in arp_queue.
>>     - After 1 sec we move to PROBE state and start the address resolution
>> like
>>       in the case(a) above.
>>
>> I was leaning towards (a).
> One could arbitrarily increase the stale timeout (by changing no of
> probes). So sender
> will continue sending traffic to something that has already gone away.
> STALE doesn't
> mean bad but here the sender is clearly indicating it's going away so
> FAILED seems to
> be the only logical option.

I agree.

>>
>>> Will TCP flows be terminated, instead
>>> of being smoothly migrated (TCP_REPAIR)
>>
>>
>> The TCP flows will not be terminated. Upon receiving UNARP packet, the ARP
>> entry will be marked FAILED. The subsequent TCP packets from the client
>> (towards that IP) will be queued (the first 3 packets in arp_queue and then
>> other packets get dropped) until the IP address is resolved again through
>> the slow path neigh_resolve_output().
>>
>> The slow path marks the entry as INCOMPLETE and will start sending several
>> ARP requests (ucast_solicit + app_solicit + mcast_solicit) to resolve the
>> IP. If the resolution is successful, then the TCP packets will be sent out.
>> If not, we will invalidate the ARP entry and call arp_error_report() on the
>> queued packets (which will end up sending ICMP_HOST_UNREACH error). This
>> behavior is same as what will occur if TCP server disappears in the middle
>> of a connection.
>>
>>>
>>> What about IPv6 ? Or maybe more abruptly, do we still need to add
>>> features to IPv4 in 2017,  22 years after this RFC came ? ;)
>>
>>
>> Legit question :). Well one thing I have seen in Networking is that an old
>> idea circles back around later and turns out to be useful in new contexts
>> and use cases. Like I enumerated in my initial email there are certain use
>> cases in Cloud that might benefit from UNARP.
>>
> It doesn't make sense to have this implemented only for IPv4. At this time if
> equivalent IPv6 feature is missing, I don't see it being useful / acceptable.

Obviously UNARP is IPv4 only. I am not aware of any standard way of doing the 
same for IPv6. If such a standard doesn't exist, then we will have to go through 
IETF to get one done? If that is the case, can we please do this in a phased 
manner? This will atleast help the Cloud that are IPv4 only.

thanks,
~Girish
diff mbox series

Patch

diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 7c45b88..8cb9aa1 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -686,6 +686,7 @@  static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
 	struct neighbour *n;
 	struct dst_entry *reply_dst = NULL;
 	bool is_garp = false;
+	bool is_unarp;
 
 	/* arp_rcv below verifies the ARP header and verifies the device
 	 * is ARP'able.
@@ -695,6 +696,8 @@  static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
 		goto out_free_skb;
 
 	arp = arp_hdr(skb);
+	/* arp_rcv has already verified the header for the UNARP case */
+	is_unarp = arp->ar_hln == 0;
 
 	switch (dev_type) {
 	default:
@@ -741,8 +744,8 @@  static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
  *	Extract fields
  */
 	arp_ptr = (unsigned char *)(arp + 1);
-	sha	= arp_ptr;
-	arp_ptr += dev->addr_len;
+	sha = is_unarp ? NULL : arp_ptr;
+	arp_ptr += arp->ar_hln;
 	memcpy(&sip, arp_ptr, 4);
 	arp_ptr += 4;
 	switch (dev_type) {
@@ -751,8 +754,8 @@  static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
 		break;
 #endif
 	default:
-		tha = arp_ptr;
-		arp_ptr += dev->addr_len;
+		tha = is_unarp ? NULL : arp_ptr;
+		arp_ptr += arp->ar_hln;
 	}
 	memcpy(&tip, arp_ptr, 4);
 /*
@@ -874,7 +877,10 @@  static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
 		   It is possible, that this option should be enabled for some
 		   devices (strip is candidate)
 		 */
-		if (!n &&
+		/* If the packet is UNARP and we don't have the corresponding
+		 * neighbour entry, then there is nothing to do.
+		 */
+		if (!n && !is_unarp &&
 		    (is_garp ||
 		     (arp->ar_op == htons(ARPOP_REPLY) &&
 		      (addr_type == RTN_UNICAST ||
@@ -899,12 +905,15 @@  static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
 				      NEIGH_VAR(n->parms, LOCKTIME)) ||
 			   is_garp;
 
-		/* Broadcast replies and request packets
-		   do not assert neighbour reachability.
-		 */
-		if (arp->ar_op != htons(ARPOP_REPLY) ||
-		    skb->pkt_type != PACKET_HOST)
+		if (is_unarp) {
+			state = NUD_FAILED;
+		} else if (arp->ar_op != htons(ARPOP_REPLY) ||
+			   skb->pkt_type != PACKET_HOST) {
+			/* Broadcast replies and request packets
+			 * do not assert neighbour reachability.
+			 */
 			state = NUD_STALE;
+		}
 		neigh_update(n, sha, state,
 			     override ? NEIGH_UPDATE_F_OVERRIDE : 0, 0);
 		neigh_release(n);
@@ -936,6 +945,7 @@  static int arp_rcv(struct sk_buff *skb, struct net_device *dev,
 		   struct packet_type *pt, struct net_device *orig_dev)
 {
 	const struct arphdr *arp;
+	bool is_unarp = false;
 
 	/* do not tweak dropwatch on an ARP we will ignore */
 	if (dev->flags & IFF_NOARP ||
@@ -952,7 +962,21 @@  static int arp_rcv(struct sk_buff *skb, struct net_device *dev,
 		goto freeskb;
 
 	arp = arp_hdr(skb);
-	if (arp->ar_hln != dev->addr_len || arp->ar_pln != 4)
+	/* RFC 1868 (UNARP) allows zero-length hardware address in
+	 * ARPOP_REPLY and target protocol address will be set to
+	 * 255.255.255.255.
+	 */
+	if (unlikely(arp->ar_hln == 0)) {
+		unsigned char *arp_ptr;
+
+		arp_ptr = (unsigned char *)(arp + 1);
+		if (arp->ar_op != htons(ARPOP_REPLY) ||
+		    !ipv4_is_lbcast(*(__be32 *)(arp_ptr + 4)))
+			goto freeskb;
+		is_unarp = true;
+	}
+
+	if ((!is_unarp && arp->ar_hln != dev->addr_len) || arp->ar_pln != 4)
 		goto freeskb;
 
 	memset(NEIGH_CB(skb), 0, sizeof(struct neighbour_cb));