Patchwork [RESEND] bonding: delete migrated IP addresses from the rlb hash table

login
register
mail settings
Submitter Jiri Bohac
Date Nov. 23, 2012, 12:44 p.m.
Message ID <20121123124419.GA3002@midget.suse.cz>
Download mbox | patch
Permalink /patch/201312/
State Superseded
Delegated to: David Miller
Headers show

Comments

Jiri Bohac - Nov. 23, 2012, 12:44 p.m.
Hi, 

This is another resend of the patch discussed in June. The only
changes over the previous version are improved comments.

Bonding with balance_rlb keeps poisoning other machines' ARP
caches and I whink we need to fix this.

On Thu, Jun 21, 2012 at 04:05:19PM -0700, Jay Vosburgh wrote:
> Jiri Bohac <jbohac@suse.cz> wrote:
> 
> >Hi, this is a resend of the patch discussed here:
> >	http://thread.gmane.org/gmane.linux.network/228076
> >It has been updated to apply to the lastest net-next.
> [...]
> >The hash table is hashed by ip_dst. To be able to do the above
> >check efficiently (not walking the whole hash table), we need a
> >reverse mapping (by ip_src).
> 
> 	Just a note that I'm doing some testing with this patch.  Seems
> to be ok for the "direct" case (wherein the IP in question is assigned
> to the local system); I haven't tried the "bridge" case yet.  I've
> extended some of the debugfs stuff to dump the new information, and I'm
> trying some of the corner cases (e.g., breaking the linkages in the
> middle) to see if it all hangs together.

Were there any results of your testing?  Good or bad?

> 	I am thinking that the layout of the "hash"-ish table is now
> sufficiently complicated that there should be a comment block somewhere
> describing what's going on (because I didn't really quite get it until I
> dumped the whole thing and looked at it).  With this patch, there is one
> "used" linkage for all of the elements in use, plus some number of "src"
> linkages, one for each active source hash.  The "src" linkages are also
> notable in that they are separate from the "assigned" state.

I updated the comments in drivers/net/bonding/bond_alb.h to
describe the structure.

> >+	 * have a dirrerent mac_src.
> 
> 	Typo here; should be "different."

Fixed. 
Any chance we could finally get this merged?:






Bonding in balance-alb mode records information from ARP packets
passing through the bond in a hash table (rx_hashtbl).

At certain situations (e.g. link change of a slave),
rlb_update_rx_clients() will send out ARP packets to update ARP
caches of other hosts on the network to achieve RX load
balancing.

The problem is that once an IP address is recorded in the hash
table, it stays there indefinitely. If this IP address is
migrated to a different host in the network, bonding still sends
out ARP packets that poison other systems' ARP caches with
invalid information.

This patch solves this by looking at all incoming ARP packets,
and checking if the source IP address is one of the source
addresses stored in the rx_hashtbl. If it is, but the MAC
addresses differ, the corresponding hash table entries are
removed. Thus, when an IP address is migrated, the first ARP
broadcast by its new owner will purge the offending entries of
rx_hashtbl.

The hash table is hashed by ip_dst. To be able to do the above
check efficiently (not walking the whole hash table), we need a
reverse mapping (by ip_src).

I added three new members in struct rlb_client_info:
   rx_hashtbl[x].src_first will point to the start of a list of
      entries for which hash(ip_src) == x.
   The list is linked with src_next and src_prev.

When an incoming ARP packet arrives at rlb_arp_recv()
rlb_purge_src_ip() can quickly walk only the entries on the
corresponding lists, i.e. the entries that are likely to contain
the offending IP address.

To avoid confusion, I renamed these existing fields of struct 
rlb_client_info:
	next -> used_next
	prev -> used_prev
	rx_hashtbl_head -> rx_hashtbl_used_head

(The current linked list is _not_ a list of hash table
entries with colliding ip_dst. It's a list of entries that are
being used; its purpose is to avoid walking the whole hash table
when looking for used entries.)

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Jay Vosburgh - Nov. 28, 2012, 1:05 a.m.
Jiri Bohac <jbohac@suse.cz> wrote:

>Hi, 
>
>This is another resend of the patch discussed in June. The only
>changes over the previous version are improved comments.
>
>Bonding with balance_rlb keeps poisoning other machines' ARP
>caches and I whink we need to fix this.
>
>On Thu, Jun 21, 2012 at 04:05:19PM -0700, Jay Vosburgh wrote:
>> Jiri Bohac <jbohac@suse.cz> wrote:
>> 
>> >Hi, this is a resend of the patch discussed here:
>> >	http://thread.gmane.org/gmane.linux.network/228076
>> >It has been updated to apply to the lastest net-next.
>> [...]
>> >The hash table is hashed by ip_dst. To be able to do the above
>> >check efficiently (not walking the whole hash table), we need a
>> >reverse mapping (by ip_src).
>> 
>> 	Just a note that I'm doing some testing with this patch.  Seems
>> to be ok for the "direct" case (wherein the IP in question is assigned
>> to the local system); I haven't tried the "bridge" case yet.  I've
>> extended some of the debugfs stuff to dump the new information, and I'm
>> trying some of the corner cases (e.g., breaking the linkages in the
>> middle) to see if it all hangs together.
>
>Were there any results of your testing?  Good or bad?

	I did test it quite a bit (and then neglected to follow up).  I
tried various deliberate hash collisions to try and make it fail in a
corner case, but was unable to induce incorrect behavior.

>> 	I am thinking that the layout of the "hash"-ish table is now
>> sufficiently complicated that there should be a comment block somewhere
>> describing what's going on (because I didn't really quite get it until I
>> dumped the whole thing and looked at it).  With this patch, there is one
>> "used" linkage for all of the elements in use, plus some number of "src"
>> linkages, one for each active source hash.  The "src" linkages are also
>> notable in that they are separate from the "assigned" state.
>
>I updated the comments in drivers/net/bonding/bond_alb.h to
>describe the structure.
>
>> >+	 * have a dirrerent mac_src.
>> 
>> 	Typo here; should be "different."
>
>Fixed. 
>Any chance we could finally get this merged?:

	The only issue I see is that a number of added lines run past 80
columns, e.g.,

+		if (!(client_info->assigned && client_info->ip_src == arp->ip_src)) {
+			/* ip_src is going to be updated, fix the src hash list */
+			u32 hash_src = _simple_hash((u8 *)&arp->ip_src, sizeof(arp->ip_src));

+	 * sending out client updates with this IP address and the old MAC address.

+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {

	... and so on.  I did not compile and test this version, just
applied it and inspected it; presumably it is functionally identical to
the prior version.  There's also one typo I noted near the end of the
patch.

	-J

>Bonding in balance-alb mode records information from ARP packets
>passing through the bond in a hash table (rx_hashtbl).
>
>At certain situations (e.g. link change of a slave),
>rlb_update_rx_clients() will send out ARP packets to update ARP
>caches of other hosts on the network to achieve RX load
>balancing.
>
>The problem is that once an IP address is recorded in the hash
>table, it stays there indefinitely. If this IP address is
>migrated to a different host in the network, bonding still sends
>out ARP packets that poison other systems' ARP caches with
>invalid information.
>
>This patch solves this by looking at all incoming ARP packets,
>and checking if the source IP address is one of the source
>addresses stored in the rx_hashtbl. If it is, but the MAC
>addresses differ, the corresponding hash table entries are
>removed. Thus, when an IP address is migrated, the first ARP
>broadcast by its new owner will purge the offending entries of
>rx_hashtbl.
>
>The hash table is hashed by ip_dst. To be able to do the above
>check efficiently (not walking the whole hash table), we need a
>reverse mapping (by ip_src).
>
>I added three new members in struct rlb_client_info:
>   rx_hashtbl[x].src_first will point to the start of a list of
>      entries for which hash(ip_src) == x.
>   The list is linked with src_next and src_prev.
>
>When an incoming ARP packet arrives at rlb_arp_recv()
>rlb_purge_src_ip() can quickly walk only the entries on the
>corresponding lists, i.e. the entries that are likely to contain
>the offending IP address.
>
>To avoid confusion, I renamed these existing fields of struct 
>rlb_client_info:
>	next -> used_next
>	prev -> used_prev
>	rx_hashtbl_head -> rx_hashtbl_used_head
>
>(The current linked list is _not_ a list of hash table
>entries with colliding ip_dst. It's a list of entries that are
>being used; its purpose is to avoid walking the whole hash table
>when looking for used entries.)
>
>Signed-off-by: Jiri Bohac <jbohac@suse.cz>
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index e15cc11..8505a24 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -84,6 +84,9 @@ static inline struct arp_pkt *arp_pkt(const struct sk_buff *skb)
>
> /* Forward declaration */
> static void alb_send_learning_packets(struct slave *slave, u8 mac_addr[]);
>+static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp);
>+static void rlb_src_unlink(struct bonding *bond, u32 index);
>+static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash);
>
> static inline u8 _simple_hash(const u8 *hash_start, int hash_size)
> {
>@@ -354,6 +357,17 @@ static int rlb_arp_recv(const struct sk_buff *skb, struct bonding *bond,
> 	if (!arp)
> 		goto out;
>
>+	/* We received an ARP from arp->ip_src.
>+	 * We might have used this IP address previously (on the bonding host
>+	 * itself or on a system that is bridged together with the bond).
>+	 * However, if arp->mac_src is different than what is stored in
>+	 * rx_hashtbl, some other host is now using the IP and we must prevent
>+	 * sending out client updates with this IP address and the old MAC address.
>+	 * Clean up all hash table entries that have this address as ip_src but
>+	 * have a different mac_src.
>+	 */
>+	rlb_purge_src_ip(bond, arp);
>+
> 	if (arp->op_code == htons(ARPOP_REPLY)) {
> 		/* update rx hash table for this ARP */
> 		rlb_update_entry_from_arp(bond, arp);
>@@ -432,9 +446,9 @@ static void rlb_clear_slave(struct bonding *bond, struct slave *slave)
> 	_lock_rx_hashtbl_bh(bond);
>
> 	rx_hash_table = bond_info->rx_hashtbl;
>-	index = bond_info->rx_hashtbl_head;
>+	index = bond_info->rx_hashtbl_used_head;
> 	for (; index != RLB_NULL_INDEX; index = next_index) {
>-		next_index = rx_hash_table[index].next;
>+		next_index = rx_hash_table[index].used_next;
> 		if (rx_hash_table[index].slave == slave) {
> 			struct slave *assigned_slave = rlb_next_rx_slave(bond);
>
>@@ -519,8 +533,8 @@ static void rlb_update_rx_clients(struct bonding *bond)
>
> 	_lock_rx_hashtbl_bh(bond);
>
>-	hash_index = bond_info->rx_hashtbl_head;
>-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+	hash_index = bond_info->rx_hashtbl_used_head;
>+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 		client_info = &(bond_info->rx_hashtbl[hash_index]);
> 		if (client_info->ntt) {
> 			rlb_update_client(client_info);
>@@ -548,8 +562,8 @@ static void rlb_req_update_slave_clients(struct bonding *bond, struct slave *sla
>
> 	_lock_rx_hashtbl_bh(bond);
>
>-	hash_index = bond_info->rx_hashtbl_head;
>-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+	hash_index = bond_info->rx_hashtbl_used_head;
>+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 		client_info = &(bond_info->rx_hashtbl[hash_index]);
>
> 		if ((client_info->slave == slave) &&
>@@ -578,8 +592,8 @@ static void rlb_req_update_subnet_clients(struct bonding *bond, __be32 src_ip)
>
> 	_lock_rx_hashtbl(bond);
>
>-	hash_index = bond_info->rx_hashtbl_head;
>-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+	hash_index = bond_info->rx_hashtbl_used_head;
>+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 		client_info = &(bond_info->rx_hashtbl[hash_index]);
>
> 		if (!client_info->slave) {
>@@ -625,6 +639,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> 				/* update mac address from arp */
> 				memcpy(client_info->mac_dst, arp->mac_dst, ETH_ALEN);
> 			}
>+			memcpy(client_info->mac_src, arp->mac_src, ETH_ALEN);
>
> 			assigned_slave = client_info->slave;
> 			if (assigned_slave) {
>@@ -647,6 +662,13 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> 	assigned_slave = rlb_next_rx_slave(bond);
>
> 	if (assigned_slave) {
>+		if (!(client_info->assigned && client_info->ip_src == arp->ip_src)) {
>+			/* ip_src is going to be updated, fix the src hash list */
>+			u32 hash_src = _simple_hash((u8 *)&arp->ip_src, sizeof(arp->ip_src));
>+			rlb_src_unlink(bond, hash_index);
>+			rlb_src_link(bond, hash_src, hash_index);
>+		}
>+
> 		client_info->ip_src = arp->ip_src;
> 		client_info->ip_dst = arp->ip_dst;
> 		/* arp->mac_dst is broadcast for arp reqeusts.
>@@ -654,6 +676,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> 		 * upon receiving an arp reply.
> 		 */
> 		memcpy(client_info->mac_dst, arp->mac_dst, ETH_ALEN);
>+		memcpy(client_info->mac_src, arp->mac_src, ETH_ALEN);
> 		client_info->slave = assigned_slave;
>
> 		if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast)) {
>@@ -669,11 +692,11 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> 		}
>
> 		if (!client_info->assigned) {
>-			u32 prev_tbl_head = bond_info->rx_hashtbl_head;
>-			bond_info->rx_hashtbl_head = hash_index;
>-			client_info->next = prev_tbl_head;
>+			u32 prev_tbl_head = bond_info->rx_hashtbl_used_head;
>+			bond_info->rx_hashtbl_used_head = hash_index;
>+			client_info->used_next = prev_tbl_head;
> 			if (prev_tbl_head != RLB_NULL_INDEX) {
>-				bond_info->rx_hashtbl[prev_tbl_head].prev =
>+				bond_info->rx_hashtbl[prev_tbl_head].used_prev =
> 					hash_index;
> 			}
> 			client_info->assigned = 1;
>@@ -740,8 +763,8 @@ static void rlb_rebalance(struct bonding *bond)
> 	_lock_rx_hashtbl_bh(bond);
>
> 	ntt = 0;
>-	hash_index = bond_info->rx_hashtbl_head;
>-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+	hash_index = bond_info->rx_hashtbl_used_head;
>+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 		client_info = &(bond_info->rx_hashtbl[hash_index]);
> 		assigned_slave = rlb_next_rx_slave(bond);
> 		if (assigned_slave && (client_info->slave != assigned_slave)) {
>@@ -759,11 +782,113 @@ static void rlb_rebalance(struct bonding *bond)
> }
>
> /* Caller must hold rx_hashtbl lock */
>+static void rlb_init_table_entry_dst(struct rlb_client_info *entry)
>+{
>+	entry->used_next = RLB_NULL_INDEX;
>+	entry->used_prev = RLB_NULL_INDEX;
>+	entry->assigned = 0;
>+	entry->slave = NULL;
>+	entry->tag = 0;
>+}
>+static void rlb_init_table_entry_src(struct rlb_client_info *entry)
>+{
>+	entry->src_first = RLB_NULL_INDEX;
>+	entry->src_prev = RLB_NULL_INDEX;
>+	entry->src_next = RLB_NULL_INDEX;
>+}
>+
> static void rlb_init_table_entry(struct rlb_client_info *entry)
> {
> 	memset(entry, 0, sizeof(struct rlb_client_info));
>-	entry->next = RLB_NULL_INDEX;
>-	entry->prev = RLB_NULL_INDEX;
>+	rlb_init_table_entry_dst(entry);
>+	rlb_init_table_entry_src(entry);
>+}
>+
>+static void rlb_delete_table_entry_dst(struct bonding *bond, u32 index)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	u32 next_index = bond_info->rx_hashtbl[index].used_next;
>+	u32 prev_index = bond_info->rx_hashtbl[index].used_prev;
>+
>+	if (index == bond_info->rx_hashtbl_used_head)
>+		bond_info->rx_hashtbl_used_head = next_index;
>+	if (prev_index != RLB_NULL_INDEX)
>+		bond_info->rx_hashtbl[prev_index].used_next = next_index;
>+	if (next_index != RLB_NULL_INDEX)
>+		bond_info->rx_hashtbl[next_index].used_prev = prev_index;
>+}
>+
>+/* unlink a rlb hash table entry from the src list */
>+static void rlb_src_unlink(struct bonding *bond, u32 index)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	u32 next_index = bond_info->rx_hashtbl[index].src_next;
>+	u32 prev_index = bond_info->rx_hashtbl[index].src_prev;
>+
>+	bond_info->rx_hashtbl[index].src_next = RLB_NULL_INDEX;
>+	bond_info->rx_hashtbl[index].src_prev = RLB_NULL_INDEX;
>+
>+	if (next_index != RLB_NULL_INDEX)
>+		bond_info->rx_hashtbl[next_index].src_prev = prev_index;
>+
>+	if (prev_index == RLB_NULL_INDEX)
>+		return;
>+
>+	/* is prev_index pointing to the head of this list? */
>+	if (bond_info->rx_hashtbl[prev_index].src_first == index)
>+		bond_info->rx_hashtbl[prev_index].src_first = next_index;
>+	else
>+		bond_info->rx_hashtbl[prev_index].src_next = next_index;
>+
>+}
>+
>+static void rlb_delete_table_entry(struct bonding *bond, u32 index)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]);
>+
>+	rlb_delete_table_entry_dst(bond, index);
>+	rlb_init_table_entry_dst(entry);
>+
>+	rlb_src_unlink(bond, index);
>+}
>+
>+/* add the rx_hashtbl[ip_dst_hash] entry to the list
>+ * of entries with identical ip_src_hash
>+ */
>+static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	u32 next;
>+
>+	bond_info->rx_hashtbl[ip_dst_hash].src_prev = ip_src_hash;
>+	next = bond_info->rx_hashtbl[ip_src_hash].src_first;
>+	bond_info->rx_hashtbl[ip_dst_hash].src_next = next;
>+	if (next != RLB_NULL_INDEX)
>+		bond_info->rx_hashtbl[next].src_prev = ip_dst_hash;
>+	bond_info->rx_hashtbl[ip_src_hash].src_first = ip_dst_hash;
>+}
>+
>+/* deletes all rx_hashtbl entries with  arp->ip_src if their mac_src does
>+ * not match arp->mac_src */
>+static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp)
>+{
>+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+	u32 ip_src_hash = _simple_hash((u8*)&(arp->ip_src), sizeof(arp->ip_src));
>+	u32 index;
>+
>+	_lock_rx_hashtbl_bh(bond);
>+
>+	index = bond_info->rx_hashtbl[ip_src_hash].src_first;
>+	while (index != RLB_NULL_INDEX) {
>+		struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]);
>+		u32 next_index = entry->src_next;
>+		if (entry->ip_src == arp->ip_src &&
>+		    !ether_addr_equal_64bits(arp->mac_src, entry->mac_src))
>+				rlb_delete_table_entry(bond, index);
>+		index = next_index;
>+	}
>+	_unlock_rx_hashtbl_bh(bond);
> }
>
> static int rlb_initialize(struct bonding *bond)
>@@ -781,7 +906,7 @@ static int rlb_initialize(struct bonding *bond)
>
> 	bond_info->rx_hashtbl = new_hashtbl;
>
>-	bond_info->rx_hashtbl_head = RLB_NULL_INDEX;
>+	bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX;
>
> 	for (i = 0; i < RLB_HASH_TABLE_SIZE; i++) {
> 		rlb_init_table_entry(bond_info->rx_hashtbl + i);
>@@ -803,7 +928,7 @@ static void rlb_deinitialize(struct bonding *bond)
>
> 	kfree(bond_info->rx_hashtbl);
> 	bond_info->rx_hashtbl = NULL;
>-	bond_info->rx_hashtbl_head = RLB_NULL_INDEX;
>+	bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX;
>
> 	_unlock_rx_hashtbl_bh(bond);
> }
>@@ -815,25 +940,13 @@ static void rlb_clear_vlan(struct bonding *bond, unsigned short vlan_id)
>
> 	_lock_rx_hashtbl_bh(bond);
>
>-	curr_index = bond_info->rx_hashtbl_head;
>+	curr_index = bond_info->rx_hashtbl_used_head;
> 	while (curr_index != RLB_NULL_INDEX) {
> 		struct rlb_client_info *curr = &(bond_info->rx_hashtbl[curr_index]);
>-		u32 next_index = bond_info->rx_hashtbl[curr_index].next;
>-		u32 prev_index = bond_info->rx_hashtbl[curr_index].prev;
>-
>-		if (curr->tag && (curr->vlan_id == vlan_id)) {
>-			if (curr_index == bond_info->rx_hashtbl_head) {
>-				bond_info->rx_hashtbl_head = next_index;
>-			}
>-			if (prev_index != RLB_NULL_INDEX) {
>-				bond_info->rx_hashtbl[prev_index].next = next_index;
>-			}
>-			if (next_index != RLB_NULL_INDEX) {
>-				bond_info->rx_hashtbl[next_index].prev = prev_index;
>-			}
>+		u32 next_index = bond_info->rx_hashtbl[curr_index].used_next;
>
>-			rlb_init_table_entry(curr);
>-		}
>+		if (curr->tag && (curr->vlan_id == vlan_id))
>+			rlb_delete_table_entry(bond, curr_index);
>
> 		curr_index = next_index;
> 	}
>diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
>index 90f140a..de831ba 100644
>--- a/drivers/net/bonding/bond_alb.h
>+++ b/drivers/net/bonding/bond_alb.h
>@@ -94,15 +94,35 @@ struct tlb_client_info {
>
> /* -------------------------------------------------------------------------
>  * struct rlb_client_info contains all info related to a specific rx client
>- * connection. This is the Clients Hash Table entry struct
>+ * connection. This is the Clients Hash Table entry struct.
>+ * Note that this is not a proper hash table; if a new client's IP address
>+ * hash collides with an existing client entry, the old entry is replaced.
>+ *
>+ * There is a linked list (linked by the used_next and used_prev members)
>+ * linking all the used entries of the hash table. This allows updating
>+ * all the clients without walking over all the unused elements of the table.
>+ *
>+ * There are also linked lists of entries with identical hash(ip_src). These
>+ * allow cleaning up the table from ip_src<->mac_src associatins that have

	Typo here, "associations"

>+ * become outdated and would cause sending out invalid ARP updates to the
>+ * network. These are linked by the (src_next and src_prev members).
>  * -------------------------------------------------------------------------
>  */
> struct rlb_client_info {
> 	__be32 ip_src;		/* the server IP address */
> 	__be32 ip_dst;		/* the client IP address */
>+	u8  mac_src[ETH_ALEN];	/* the server MAC address */
> 	u8  mac_dst[ETH_ALEN];	/* the client MAC address */
>-	u32 next;		/* The next Hash table entry index */
>-	u32 prev;		/* The previous Hash table entry index */
>+
>+	/* list of used hash table entries, starting at rx_hashtbl_used_head */
>+	u32 used_next;
>+	u32 used_prev;
>+
>+	/* ip_src based hashing */
>+	u32 src_next;	/* next entry with same hash(ip_src) */
>+	u32 src_prev;	/* prev entry with same hash(ip_src) */
>+	u32 src_first;	/* first entry with hash(ip_src) == this entry's index */
>+
> 	u8  assigned;		/* checking whether this entry is assigned */
> 	u8  ntt;		/* flag - need to transmit client info */
> 	struct slave *slave;	/* the slave assigned to this client */
>@@ -131,7 +151,7 @@ struct alb_bond_info {
> 	int rlb_enabled;
> 	struct rlb_client_info	*rx_hashtbl;	/* Receive hash table */
> 	spinlock_t		rx_hashtbl_lock;
>-	u32			rx_hashtbl_head;
>+	u32			rx_hashtbl_used_head;
> 	u8			rx_ntt;	/* flag - need to transmit
> 					 * to all rx clients
> 					 */
>diff --git a/drivers/net/bonding/bond_debugfs.c b/drivers/net/bonding/bond_debugfs.c
>index 2cf084e..6ac855f 100644
>--- a/drivers/net/bonding/bond_debugfs.c
>+++ b/drivers/net/bonding/bond_debugfs.c
>@@ -31,8 +31,8 @@ static int bond_debug_rlb_hash_show(struct seq_file *m, void *v)
>
> 	spin_lock_bh(&(BOND_ALB_INFO(bond).rx_hashtbl_lock));
>
>-	hash_index = bond_info->rx_hashtbl_head;
>-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+	hash_index = bond_info->rx_hashtbl_used_head;
>+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 		client_info = &(bond_info->rx_hashtbl[hash_index]);
> 		seq_printf(m, "%-15pI4 %-15pI4 %-17pM %s\n",
> 			&client_info->ip_src,
>
>-- 
>Jiri Bohac <jbohac@suse.cz>
>SUSE Labs, SUSE CZ
>

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Bohac - Nov. 28, 2012, 2:38 p.m.
On Tue, Nov 27, 2012 at 05:05:25PM -0800, Jay Vosburgh wrote:
> Jiri Bohac <jbohac@suse.cz> wrote:

> >Were there any results of your testing?  Good or bad?
> 
> 	I did test it quite a bit (and then neglected to follow up).  I
> tried various deliberate hash collisions to try and make it fail in a
> corner case, but was unable to induce incorrect behavior.

Thanks for the review!

> 	The only issue I see is that a number of added lines run past 80
> columns, e.g.,
> 
> +		if (!(client_info->assigned && client_info->ip_src == arp->ip_src)) {
> +			/* ip_src is going to be updated, fix the src hash list */
> +			u32 hash_src = _simple_hash((u8 *)&arp->ip_src, sizeof(arp->ip_src));
> 
> +	 * sending out client updates with this IP address and the old MAC address.
> 
> +	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> 
> 	... and so on. 

I'll re-send a new version with all the >80 column lines fixed.

> I did not compile and test this version, just
> applied it and inspected it; presumably it is functionally identical to
> the prior version. 

Yes, it's identical, only the comments and formatting changed.

> There's also one typo I noted near the end of the
> patch.

Fixed as well. Thanks!

Patch

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index e15cc11..8505a24 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -84,6 +84,9 @@  static inline struct arp_pkt *arp_pkt(const struct sk_buff *skb)
 
 /* Forward declaration */
 static void alb_send_learning_packets(struct slave *slave, u8 mac_addr[]);
+static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp);
+static void rlb_src_unlink(struct bonding *bond, u32 index);
+static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash);
 
 static inline u8 _simple_hash(const u8 *hash_start, int hash_size)
 {
@@ -354,6 +357,17 @@  static int rlb_arp_recv(const struct sk_buff *skb, struct bonding *bond,
 	if (!arp)
 		goto out;
 
+	/* We received an ARP from arp->ip_src.
+	 * We might have used this IP address previously (on the bonding host
+	 * itself or on a system that is bridged together with the bond).
+	 * However, if arp->mac_src is different than what is stored in
+	 * rx_hashtbl, some other host is now using the IP and we must prevent
+	 * sending out client updates with this IP address and the old MAC address.
+	 * Clean up all hash table entries that have this address as ip_src but
+	 * have a different mac_src.
+	 */
+	rlb_purge_src_ip(bond, arp);
+
 	if (arp->op_code == htons(ARPOP_REPLY)) {
 		/* update rx hash table for this ARP */
 		rlb_update_entry_from_arp(bond, arp);
@@ -432,9 +446,9 @@  static void rlb_clear_slave(struct bonding *bond, struct slave *slave)
 	_lock_rx_hashtbl_bh(bond);
 
 	rx_hash_table = bond_info->rx_hashtbl;
-	index = bond_info->rx_hashtbl_head;
+	index = bond_info->rx_hashtbl_used_head;
 	for (; index != RLB_NULL_INDEX; index = next_index) {
-		next_index = rx_hash_table[index].next;
+		next_index = rx_hash_table[index].used_next;
 		if (rx_hash_table[index].slave == slave) {
 			struct slave *assigned_slave = rlb_next_rx_slave(bond);
 
@@ -519,8 +533,8 @@  static void rlb_update_rx_clients(struct bonding *bond)
 
 	_lock_rx_hashtbl_bh(bond);
 
-	hash_index = bond_info->rx_hashtbl_head;
-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
+	hash_index = bond_info->rx_hashtbl_used_head;
+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
 		client_info = &(bond_info->rx_hashtbl[hash_index]);
 		if (client_info->ntt) {
 			rlb_update_client(client_info);
@@ -548,8 +562,8 @@  static void rlb_req_update_slave_clients(struct bonding *bond, struct slave *sla
 
 	_lock_rx_hashtbl_bh(bond);
 
-	hash_index = bond_info->rx_hashtbl_head;
-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
+	hash_index = bond_info->rx_hashtbl_used_head;
+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
 		client_info = &(bond_info->rx_hashtbl[hash_index]);
 
 		if ((client_info->slave == slave) &&
@@ -578,8 +592,8 @@  static void rlb_req_update_subnet_clients(struct bonding *bond, __be32 src_ip)
 
 	_lock_rx_hashtbl(bond);
 
-	hash_index = bond_info->rx_hashtbl_head;
-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
+	hash_index = bond_info->rx_hashtbl_used_head;
+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
 		client_info = &(bond_info->rx_hashtbl[hash_index]);
 
 		if (!client_info->slave) {
@@ -625,6 +639,7 @@  static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 				/* update mac address from arp */
 				memcpy(client_info->mac_dst, arp->mac_dst, ETH_ALEN);
 			}
+			memcpy(client_info->mac_src, arp->mac_src, ETH_ALEN);
 
 			assigned_slave = client_info->slave;
 			if (assigned_slave) {
@@ -647,6 +662,13 @@  static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 	assigned_slave = rlb_next_rx_slave(bond);
 
 	if (assigned_slave) {
+		if (!(client_info->assigned && client_info->ip_src == arp->ip_src)) {
+			/* ip_src is going to be updated, fix the src hash list */
+			u32 hash_src = _simple_hash((u8 *)&arp->ip_src, sizeof(arp->ip_src));
+			rlb_src_unlink(bond, hash_index);
+			rlb_src_link(bond, hash_src, hash_index);
+		}
+
 		client_info->ip_src = arp->ip_src;
 		client_info->ip_dst = arp->ip_dst;
 		/* arp->mac_dst is broadcast for arp reqeusts.
@@ -654,6 +676,7 @@  static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 		 * upon receiving an arp reply.
 		 */
 		memcpy(client_info->mac_dst, arp->mac_dst, ETH_ALEN);
+		memcpy(client_info->mac_src, arp->mac_src, ETH_ALEN);
 		client_info->slave = assigned_slave;
 
 		if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast)) {
@@ -669,11 +692,11 @@  static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 		}
 
 		if (!client_info->assigned) {
-			u32 prev_tbl_head = bond_info->rx_hashtbl_head;
-			bond_info->rx_hashtbl_head = hash_index;
-			client_info->next = prev_tbl_head;
+			u32 prev_tbl_head = bond_info->rx_hashtbl_used_head;
+			bond_info->rx_hashtbl_used_head = hash_index;
+			client_info->used_next = prev_tbl_head;
 			if (prev_tbl_head != RLB_NULL_INDEX) {
-				bond_info->rx_hashtbl[prev_tbl_head].prev =
+				bond_info->rx_hashtbl[prev_tbl_head].used_prev =
 					hash_index;
 			}
 			client_info->assigned = 1;
@@ -740,8 +763,8 @@  static void rlb_rebalance(struct bonding *bond)
 	_lock_rx_hashtbl_bh(bond);
 
 	ntt = 0;
-	hash_index = bond_info->rx_hashtbl_head;
-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
+	hash_index = bond_info->rx_hashtbl_used_head;
+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
 		client_info = &(bond_info->rx_hashtbl[hash_index]);
 		assigned_slave = rlb_next_rx_slave(bond);
 		if (assigned_slave && (client_info->slave != assigned_slave)) {
@@ -759,11 +782,113 @@  static void rlb_rebalance(struct bonding *bond)
 }
 
 /* Caller must hold rx_hashtbl lock */
+static void rlb_init_table_entry_dst(struct rlb_client_info *entry)
+{
+	entry->used_next = RLB_NULL_INDEX;
+	entry->used_prev = RLB_NULL_INDEX;
+	entry->assigned = 0;
+	entry->slave = NULL;
+	entry->tag = 0;
+}
+static void rlb_init_table_entry_src(struct rlb_client_info *entry)
+{
+	entry->src_first = RLB_NULL_INDEX;
+	entry->src_prev = RLB_NULL_INDEX;
+	entry->src_next = RLB_NULL_INDEX;
+}
+
 static void rlb_init_table_entry(struct rlb_client_info *entry)
 {
 	memset(entry, 0, sizeof(struct rlb_client_info));
-	entry->next = RLB_NULL_INDEX;
-	entry->prev = RLB_NULL_INDEX;
+	rlb_init_table_entry_dst(entry);
+	rlb_init_table_entry_src(entry);
+}
+
+static void rlb_delete_table_entry_dst(struct bonding *bond, u32 index)
+{
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+	u32 next_index = bond_info->rx_hashtbl[index].used_next;
+	u32 prev_index = bond_info->rx_hashtbl[index].used_prev;
+
+	if (index == bond_info->rx_hashtbl_used_head)
+		bond_info->rx_hashtbl_used_head = next_index;
+	if (prev_index != RLB_NULL_INDEX)
+		bond_info->rx_hashtbl[prev_index].used_next = next_index;
+	if (next_index != RLB_NULL_INDEX)
+		bond_info->rx_hashtbl[next_index].used_prev = prev_index;
+}
+
+/* unlink a rlb hash table entry from the src list */
+static void rlb_src_unlink(struct bonding *bond, u32 index)
+{
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+	u32 next_index = bond_info->rx_hashtbl[index].src_next;
+	u32 prev_index = bond_info->rx_hashtbl[index].src_prev;
+
+	bond_info->rx_hashtbl[index].src_next = RLB_NULL_INDEX;
+	bond_info->rx_hashtbl[index].src_prev = RLB_NULL_INDEX;
+
+	if (next_index != RLB_NULL_INDEX)
+		bond_info->rx_hashtbl[next_index].src_prev = prev_index;
+
+	if (prev_index == RLB_NULL_INDEX)
+		return;
+
+	/* is prev_index pointing to the head of this list? */
+	if (bond_info->rx_hashtbl[prev_index].src_first == index)
+		bond_info->rx_hashtbl[prev_index].src_first = next_index;
+	else
+		bond_info->rx_hashtbl[prev_index].src_next = next_index;
+
+}
+
+static void rlb_delete_table_entry(struct bonding *bond, u32 index)
+{
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+	struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]);
+
+	rlb_delete_table_entry_dst(bond, index);
+	rlb_init_table_entry_dst(entry);
+
+	rlb_src_unlink(bond, index);
+}
+
+/* add the rx_hashtbl[ip_dst_hash] entry to the list
+ * of entries with identical ip_src_hash
+ */
+static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash)
+{
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+	u32 next;
+
+	bond_info->rx_hashtbl[ip_dst_hash].src_prev = ip_src_hash;
+	next = bond_info->rx_hashtbl[ip_src_hash].src_first;
+	bond_info->rx_hashtbl[ip_dst_hash].src_next = next;
+	if (next != RLB_NULL_INDEX)
+		bond_info->rx_hashtbl[next].src_prev = ip_dst_hash;
+	bond_info->rx_hashtbl[ip_src_hash].src_first = ip_dst_hash;
+}
+
+/* deletes all rx_hashtbl entries with  arp->ip_src if their mac_src does
+ * not match arp->mac_src */
+static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp)
+{
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+	u32 ip_src_hash = _simple_hash((u8*)&(arp->ip_src), sizeof(arp->ip_src));
+	u32 index;
+
+	_lock_rx_hashtbl_bh(bond);
+
+	index = bond_info->rx_hashtbl[ip_src_hash].src_first;
+	while (index != RLB_NULL_INDEX) {
+		struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]);
+		u32 next_index = entry->src_next;
+		if (entry->ip_src == arp->ip_src &&
+		    !ether_addr_equal_64bits(arp->mac_src, entry->mac_src))
+				rlb_delete_table_entry(bond, index);
+		index = next_index;
+	}
+	_unlock_rx_hashtbl_bh(bond);
 }
 
 static int rlb_initialize(struct bonding *bond)
@@ -781,7 +906,7 @@  static int rlb_initialize(struct bonding *bond)
 
 	bond_info->rx_hashtbl = new_hashtbl;
 
-	bond_info->rx_hashtbl_head = RLB_NULL_INDEX;
+	bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX;
 
 	for (i = 0; i < RLB_HASH_TABLE_SIZE; i++) {
 		rlb_init_table_entry(bond_info->rx_hashtbl + i);
@@ -803,7 +928,7 @@  static void rlb_deinitialize(struct bonding *bond)
 
 	kfree(bond_info->rx_hashtbl);
 	bond_info->rx_hashtbl = NULL;
-	bond_info->rx_hashtbl_head = RLB_NULL_INDEX;
+	bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX;
 
 	_unlock_rx_hashtbl_bh(bond);
 }
@@ -815,25 +940,13 @@  static void rlb_clear_vlan(struct bonding *bond, unsigned short vlan_id)
 
 	_lock_rx_hashtbl_bh(bond);
 
-	curr_index = bond_info->rx_hashtbl_head;
+	curr_index = bond_info->rx_hashtbl_used_head;
 	while (curr_index != RLB_NULL_INDEX) {
 		struct rlb_client_info *curr = &(bond_info->rx_hashtbl[curr_index]);
-		u32 next_index = bond_info->rx_hashtbl[curr_index].next;
-		u32 prev_index = bond_info->rx_hashtbl[curr_index].prev;
-
-		if (curr->tag && (curr->vlan_id == vlan_id)) {
-			if (curr_index == bond_info->rx_hashtbl_head) {
-				bond_info->rx_hashtbl_head = next_index;
-			}
-			if (prev_index != RLB_NULL_INDEX) {
-				bond_info->rx_hashtbl[prev_index].next = next_index;
-			}
-			if (next_index != RLB_NULL_INDEX) {
-				bond_info->rx_hashtbl[next_index].prev = prev_index;
-			}
+		u32 next_index = bond_info->rx_hashtbl[curr_index].used_next;
 
-			rlb_init_table_entry(curr);
-		}
+		if (curr->tag && (curr->vlan_id == vlan_id))
+			rlb_delete_table_entry(bond, curr_index);
 
 		curr_index = next_index;
 	}
diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
index 90f140a..de831ba 100644
--- a/drivers/net/bonding/bond_alb.h
+++ b/drivers/net/bonding/bond_alb.h
@@ -94,15 +94,35 @@  struct tlb_client_info {
 
 /* -------------------------------------------------------------------------
  * struct rlb_client_info contains all info related to a specific rx client
- * connection. This is the Clients Hash Table entry struct
+ * connection. This is the Clients Hash Table entry struct.
+ * Note that this is not a proper hash table; if a new client's IP address
+ * hash collides with an existing client entry, the old entry is replaced.
+ *
+ * There is a linked list (linked by the used_next and used_prev members)
+ * linking all the used entries of the hash table. This allows updating
+ * all the clients without walking over all the unused elements of the table.
+ *
+ * There are also linked lists of entries with identical hash(ip_src). These
+ * allow cleaning up the table from ip_src<->mac_src associatins that have
+ * become outdated and would cause sending out invalid ARP updates to the
+ * network. These are linked by the (src_next and src_prev members).
  * -------------------------------------------------------------------------
  */
 struct rlb_client_info {
 	__be32 ip_src;		/* the server IP address */
 	__be32 ip_dst;		/* the client IP address */
+	u8  mac_src[ETH_ALEN];	/* the server MAC address */
 	u8  mac_dst[ETH_ALEN];	/* the client MAC address */
-	u32 next;		/* The next Hash table entry index */
-	u32 prev;		/* The previous Hash table entry index */
+
+	/* list of used hash table entries, starting at rx_hashtbl_used_head */
+	u32 used_next;
+	u32 used_prev;
+
+	/* ip_src based hashing */
+	u32 src_next;	/* next entry with same hash(ip_src) */
+	u32 src_prev;	/* prev entry with same hash(ip_src) */
+	u32 src_first;	/* first entry with hash(ip_src) == this entry's index */
+
 	u8  assigned;		/* checking whether this entry is assigned */
 	u8  ntt;		/* flag - need to transmit client info */
 	struct slave *slave;	/* the slave assigned to this client */
@@ -131,7 +151,7 @@  struct alb_bond_info {
 	int rlb_enabled;
 	struct rlb_client_info	*rx_hashtbl;	/* Receive hash table */
 	spinlock_t		rx_hashtbl_lock;
-	u32			rx_hashtbl_head;
+	u32			rx_hashtbl_used_head;
 	u8			rx_ntt;	/* flag - need to transmit
 					 * to all rx clients
 					 */
diff --git a/drivers/net/bonding/bond_debugfs.c b/drivers/net/bonding/bond_debugfs.c
index 2cf084e..6ac855f 100644
--- a/drivers/net/bonding/bond_debugfs.c
+++ b/drivers/net/bonding/bond_debugfs.c
@@ -31,8 +31,8 @@  static int bond_debug_rlb_hash_show(struct seq_file *m, void *v)
 
 	spin_lock_bh(&(BOND_ALB_INFO(bond).rx_hashtbl_lock));
 
-	hash_index = bond_info->rx_hashtbl_head;
-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
+	hash_index = bond_info->rx_hashtbl_used_head;
+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
 		client_info = &(bond_info->rx_hashtbl[hash_index]);
 		seq_printf(m, "%-15pI4 %-15pI4 %-17pM %s\n",
 			&client_info->ip_src,