diff mbox

[ovs-dev,1/2] bond: Fix broken rebalancing after link state changes.

Message ID 1500571312-19049-2-git-send-email-i.maximets@samsung.com
State Changes Requested
Headers show

Commit Message

Ilya Maximets July 20, 2017, 5:21 p.m. UTC
There are 3 constraints for moving hashes from one slave to another:

	1. The load difference is larger than ~3% of one slave's load.
	2. The load difference between slaves exceeds 100000 bytes.
	3. Moving of the hash makes the load difference lower by > 10%.

In current implementation if one of the slaves goes DOWN state, all
the hashes assigned to it will be moved to other slaves. After that,
if slave will go UP it will wait for rebalancing to get some hashes.
But in case where we have more than 10 equally loaded hashes it
will never fit constraint #3, because each hash will handle less than
10% of the load. Situation become worse when number of flows grows
higher and it's almost impossible to migrate any hash when all the
256 hash entries are used which is very likely when we have few
hundreds/thousands of flows.

As a result, if one of the slaves goes down and up while traffic
flows, it will never be used again for packet transmission.
Situation will not be fixed even if we'll stop traffic completely
and start it again because first two constraints will block
rebalancing on the earlier stages while we have low amount of traffic.

Moving of one hash if destination has no hashes as it was before
commit c460a6a7bc75 ("ofproto/bond: simplify rebalancing logic")
will not help because having one hash isn't enough to make load
difference less than 10% of total load and this slave will
handle only that one hash forever.

To fix this lets try to move few hashes simultaniously to fit
constraint #3.

Implementation includes sorting of 'entries' to be able to collect
entries with accumulated load close enough to ideal value.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
---

I guess, the following tag should be correct, but not sure,
it was too long ago ...

Fixes: 5422a9e189c6 ("bonding: Balance bond slaves based on ratio.")

 ofproto/bond.c | 128 ++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 90 insertions(+), 38 deletions(-)

Comments

Andy Zhou Aug. 2, 2017, 9:09 p.m. UTC | #1
On Thu, Jul 20, 2017 at 10:21 AM, Ilya Maximets <i.maximets@samsung.com> wrote:
> There are 3 constraints for moving hashes from one slave to another:
>
>         1. The load difference is larger than ~3% of one slave's load.
>         2. The load difference between slaves exceeds 100000 bytes.
>         3. Moving of the hash makes the load difference lower by > 10%.
>
> In current implementation if one of the slaves goes DOWN state, all
> the hashes assigned to it will be moved to other slaves. After that,
> if slave will go UP it will wait for rebalancing to get some hashes.
> But in case where we have more than 10 equally loaded hashes it
> will never fit constraint #3, because each hash will handle less than
> 10% of the load. Situation become worse when number of flows grows
> higher and it's almost impossible to migrate any hash when all the
> 256 hash entries are used which is very likely when we have few
> hundreds/thousands of flows.
>
> As a result, if one of the slaves goes down and up while traffic
> flows, it will never be used again for packet transmission.
> Situation will not be fixed even if we'll stop traffic completely
> and start it again because first two constraints will block
> rebalancing on the earlier stages while we have low amount of traffic.
>
> Moving of one hash if destination has no hashes as it was before
> commit c460a6a7bc75 ("ofproto/bond: simplify rebalancing logic")
> will not help because having one hash isn't enough to make load
> difference less than 10% of total load and this slave will
> handle only that one hash forever.
>
> To fix this lets try to move few hashes simultaniously to fit
> constraint #3.

Thanks for working on this.

Current interface tries to only rebalance one hash bucket at a time.
There are two reasons I can think of:

In general, It is good idea to keep flow binding stable. Moving flow
introduces side effects, such as packet re-ordering, that impacts
application performances significantly.

The second reason is the bandwidth each flow consumes tends to
be more dynamic in practice.  It is usually pointless to use a snapshot
of bandwidth consumption to come up a "perfect" bond port binding.
For example, TCP flows rate adapt to available bandwidth.  It is
probably better to make smaller changes in each rebalancing step.

Is it still possible to address the issue mentioned while maintaining
current interface? Can we find some way to break the tie in case current
algorithm fail to nominate any bucket to move?

Thanks.
Ilya Maximets June 7, 2021, 11:01 a.m. UTC | #2
On 8/2/17 11:09 PM, Andy Zhou wrote:
> On Thu, Jul 20, 2017 at 10:21 AM, Ilya Maximets <i.maximets@samsung.com> wrote:
>> There are 3 constraints for moving hashes from one slave to another:
>>
>>         1. The load difference is larger than ~3% of one slave's load.
>>         2. The load difference between slaves exceeds 100000 bytes.
>>         3. Moving of the hash makes the load difference lower by > 10%.
>>
>> In current implementation if one of the slaves goes DOWN state, all
>> the hashes assigned to it will be moved to other slaves. After that,
>> if slave will go UP it will wait for rebalancing to get some hashes.
>> But in case where we have more than 10 equally loaded hashes it
>> will never fit constraint #3, because each hash will handle less than
>> 10% of the load. Situation become worse when number of flows grows
>> higher and it's almost impossible to migrate any hash when all the
>> 256 hash entries are used which is very likely when we have few
>> hundreds/thousands of flows.
>>
>> As a result, if one of the slaves goes down and up while traffic
>> flows, it will never be used again for packet transmission.
>> Situation will not be fixed even if we'll stop traffic completely
>> and start it again because first two constraints will block
>> rebalancing on the earlier stages while we have low amount of traffic.
>>
>> Moving of one hash if destination has no hashes as it was before
>> commit c460a6a7bc75 ("ofproto/bond: simplify rebalancing logic")
>> will not help because having one hash isn't enough to make load
>> difference less than 10% of total load and this slave will
>> handle only that one hash forever.
>>
>> To fix this lets try to move few hashes simultaniously to fit
>> constraint #3.
> 
> Thanks for working on this.

Sorry for not replying for almost 4 years. :)

And sorry for resurrecting the thread, but the issue still exists and
I think that we still need to fix it.  The first patch needs a
minor rebase, but it still works fine.  The test in the second patch
is still valid.

> 
> Current interface tries to only rebalance one hash bucket at a time.
> There are two reasons I can think of:
> 
> In general, It is good idea to keep flow binding stable. Moving flow
> introduces side effects, such as packet re-ordering, that impacts
> application performances significantly.

Yes.  I agree with that.  We should try to minimize moving of flows.
Current algorithm moves hashes until difference between heaviest
member and the lightest one is more than ~3% or it's more than ~1Mbps
and there is at least one hash that could be moved to make the load
difference better by 10%.  That is, in general, a lot of hash migrations.
New algorithm keeps all the previous restrictions, but allows to move
more than 1 hash at a time to fulfill the last 10% restriction.
It's hard to tell if it will result in more movements in general case.
But, I think, we can tune it better to keep as close to the current one
as possible. See below. 

> 
> The second reason is the bandwidth each flow consumes tends to
> be more dynamic in practice.  It is usually pointless to use a snapshot
> of bandwidth consumption to come up a "perfect" bond port binding.
> For example, TCP flows rate adapt to available bandwidth.  It is
> probably better to make smaller changes in each rebalancing step.

Yes, I agree with this statement too.  But new algorithm doesn't try
to create a perfect distribution.  It stops collecting hashes for
migration when distribution is close enough.  In this patch it's 10%
difference with the "ideal" value.  I agree that this might be too
much.  We can, probably, tune this value.  For example, 90% difference
with "ideal" for a single movement will be somewhat close to the
current algorithm as we will move hashes that reduces the difference
by 10%.   I'm afraid, though, that this will result in much more
movements, so it might make sense to settle somewhere in the middle,
e.g. 40-50% of improvement for a single migration.  This should be
fair enough.

Thoughts?

> 
> Is it still possible to address the issue mentioned while maintaining
> current interface? Can we find some way to break the tie in case current
> algorithm fail to nominate any bucket to move?

It's hard to tell.  I've been thinking about this for a long time
before sending this patch and didn't came up with anything elegant.
It would be easy to identify this kind of case where all hashes are
smaller than 10% of the difference, but the difference itself is high,
and use the new algorithm only for this case, but I suspect that in
this case we will always move hashes one by one several times with
the old algorithm and use the new one on the last iteration to move
a bunch of smaller hashes.  The end result might be not any different
than just running new algorithm from the beginning.  Also this schema
will take much more processing time than just running a new algorithm
from the start.

For now I'm going to rebase this patch-set and, probably, change the
moving threshold from 10% to 40-50%, so we will not move too many
small hashes at a time.  Will send v2.

Best regards, Ilya Maximets.
Ben Pfaff June 10, 2021, 11:24 p.m. UTC | #3
On Mon, Jun 07, 2021 at 01:01:33PM +0200, Ilya Maximets wrote:
> On 8/2/17 11:09 PM, Andy Zhou wrote:
> > On Thu, Jul 20, 2017 at 10:21 AM, Ilya Maximets <i.maximets@samsung.com> wrote:
> >> There are 3 constraints for moving hashes from one slave to another:
> >>
> >>         1. The load difference is larger than ~3% of one slave's load.
> >>         2. The load difference between slaves exceeds 100000 bytes.
> >>         3. Moving of the hash makes the load difference lower by > 10%.
> >>
> >> In current implementation if one of the slaves goes DOWN state, all
> >> the hashes assigned to it will be moved to other slaves. After that,
> >> if slave will go UP it will wait for rebalancing to get some hashes.
> >> But in case where we have more than 10 equally loaded hashes it
> >> will never fit constraint #3, because each hash will handle less than
> >> 10% of the load. Situation become worse when number of flows grows
> >> higher and it's almost impossible to migrate any hash when all the
> >> 256 hash entries are used which is very likely when we have few
> >> hundreds/thousands of flows.
> >>
> >> As a result, if one of the slaves goes down and up while traffic
> >> flows, it will never be used again for packet transmission.
> >> Situation will not be fixed even if we'll stop traffic completely
> >> and start it again because first two constraints will block
> >> rebalancing on the earlier stages while we have low amount of traffic.
> >>
> >> Moving of one hash if destination has no hashes as it was before
> >> commit c460a6a7bc75 ("ofproto/bond: simplify rebalancing logic")
> >> will not help because having one hash isn't enough to make load
> >> difference less than 10% of total load and this slave will
> >> handle only that one hash forever.
> >>
> >> To fix this lets try to move few hashes simultaniously to fit
> >> constraint #3.
> > 
> > Thanks for working on this.
> 
> Sorry for not replying for almost 4 years. :)
> 
> And sorry for resurrecting the thread, but the issue still exists and
> I think that we still need to fix it.  The first patch needs a
> minor rebase, but it still works fine.  The test in the second patch
> is still valid.

I don't think Andy is working on OVS these days.  I'm the original
author of the rebalancing algorithm, so I went back and took a look at
patch 1.  I see a little bit of coding style I'd do differently these
days (e.g. declare 'i' in the 'for' loops rather than at the top of a
block) but the code and the rationale for it seems solid to me.

I'll read v2 but I expect to ack it.

Thanks,

Ben.
Ilya Maximets July 13, 2021, 3:35 p.m. UTC | #4
On 6/11/21 1:24 AM, Ben Pfaff wrote:
> On Mon, Jun 07, 2021 at 01:01:33PM +0200, Ilya Maximets wrote:
>> On 8/2/17 11:09 PM, Andy Zhou wrote:
>>> On Thu, Jul 20, 2017 at 10:21 AM, Ilya Maximets <i.maximets@samsung.com> wrote:
>>>> There are 3 constraints for moving hashes from one slave to another:
>>>>
>>>>         1. The load difference is larger than ~3% of one slave's load.
>>>>         2. The load difference between slaves exceeds 100000 bytes.
>>>>         3. Moving of the hash makes the load difference lower by > 10%.
>>>>
>>>> In current implementation if one of the slaves goes DOWN state, all
>>>> the hashes assigned to it will be moved to other slaves. After that,
>>>> if slave will go UP it will wait for rebalancing to get some hashes.
>>>> But in case where we have more than 10 equally loaded hashes it
>>>> will never fit constraint #3, because each hash will handle less than
>>>> 10% of the load. Situation become worse when number of flows grows
>>>> higher and it's almost impossible to migrate any hash when all the
>>>> 256 hash entries are used which is very likely when we have few
>>>> hundreds/thousands of flows.
>>>>
>>>> As a result, if one of the slaves goes down and up while traffic
>>>> flows, it will never be used again for packet transmission.
>>>> Situation will not be fixed even if we'll stop traffic completely
>>>> and start it again because first two constraints will block
>>>> rebalancing on the earlier stages while we have low amount of traffic.
>>>>
>>>> Moving of one hash if destination has no hashes as it was before
>>>> commit c460a6a7bc75 ("ofproto/bond: simplify rebalancing logic")
>>>> will not help because having one hash isn't enough to make load
>>>> difference less than 10% of total load and this slave will
>>>> handle only that one hash forever.
>>>>
>>>> To fix this lets try to move few hashes simultaniously to fit
>>>> constraint #3.
>>>
>>> Thanks for working on this.
>>
>> Sorry for not replying for almost 4 years. :)
>>
>> And sorry for resurrecting the thread, but the issue still exists and
>> I think that we still need to fix it.  The first patch needs a
>> minor rebase, but it still works fine.  The test in the second patch
>> is still valid.
> 
> I don't think Andy is working on OVS these days.  I'm the original
> author of the rebalancing algorithm, so I went back and took a look at
> patch 1.  I see a little bit of coding style I'd do differently these
> days (e.g. declare 'i' in the 'for' loops rather than at the top of a
> block) but the code and the rationale for it seems solid to me.
> 
> I'll read v2 but I expect to ack it.

Thanks for taking a look!  I rebased the patch and adjusted coding style
a little bit:
  https://patchwork.ozlabs.org/project/openvswitch/patch/20210713153206.304369-1-i.maximets@ovn.org/

I didn't observe any benefits from changing migration threshold though,
so I kept it as it is in v1.

Best regards, Ilya Maximets.
diff mbox

Patch

diff --git a/ofproto/bond.c b/ofproto/bond.c
index cb25a1d..75f7551 100644
--- a/ofproto/bond.c
+++ b/ofproto/bond.c
@@ -1073,49 +1073,72 @@  bond_shift_load(struct bond_entry *hash, struct bond_slave *to)
     bond->bond_revalidate = true;
 }
 
-/* Picks and returns a bond_entry to migrate from 'from' (the most heavily
+/* Picks and returns a 'bond_entry's to migrate from 'from' (the most heavily
  * loaded bond slave) to a bond slave that has 'to_tx_bytes' bytes of load,
  * given that doing so must decrease the ratio of the load on the two slaves by
- * at least 0.1.  Returns NULL if there is no appropriate entry.
+ * at least 0.1.  Returns number of entries filled in 'to_migrate'.
  *
- * The list of entries isn't sorted.  I don't know of a reason to prefer to
- * shift away small hashes or large hashes. */
-static struct bond_entry *
-choose_entry_to_migrate(const struct bond_slave *from, uint64_t to_tx_bytes)
+ * The list of entries is sorted in descending order of load.  This allows us
+ * to collect subset of entries with accumulated load close to ideal.  */
+static size_t
+choose_entries_to_migrate(const struct bond_slave *from, uint64_t to_tx_bytes,
+                          struct bond_entry **to_migrate)
     OVS_REQ_WRLOCK(rwlock)
 {
     struct bond_entry *e;
+    /* Note, the ideal traffic is the mid point between 'from' and 'to'.
+     * This value does not change by rebalancing.  */
+    uint64_t ideal_tx_bytes = (from->tx_bytes + to_tx_bytes) / 2;
+    uint64_t ideal_delta = ideal_tx_bytes - to_tx_bytes;
+    uint64_t delta = 0;         /* The amount to rebalance. */
+    uint64_t new_low;           /* The lower bandwidth between 'to' and 'from'
+                                 * after rebalancing. */
+    uint64_t migrating_threshold = ideal_delta / 10; /* 10% */
+    size_t cnt = 0;
 
     if (ovs_list_is_short(&from->entries)) {
         /* 'from' carries no more than one MAC hash, so shifting load away from
          * it would be pointless. */
-        return NULL;
+        return 0;
     }
 
     LIST_FOR_EACH (e, list_node, &from->entries) {
-        uint64_t delta = e->tx_bytes;  /* The amount to rebalance.  */
-        uint64_t ideal_tx_bytes = (from->tx_bytes + to_tx_bytes)/2;
-                             /* Note, the ideal traffic is the mid point
-                              * between 'from' and 'to'. This value does
-                              * not change by rebalancing.  */
-        uint64_t new_low;    /* The lower bandwidth between 'to' and 'from'
-                                after rebalancing. */
-
-        new_low = MIN(from->tx_bytes - delta, to_tx_bytes + delta);
-
-        if ((new_low > to_tx_bytes) &&
-            (new_low - to_tx_bytes >= (ideal_tx_bytes - to_tx_bytes) / 10)) {
-            /* Only rebalance if the new 'low' is closer to to the mid point,
-             * and the improvement exceeds 10% of current traffic
-             * deviation from the ideal split.
-             *
-             * The improvement on the 'high' side is always the same as the
-             * 'low' side. Thus consider 'low' side is sufficient.  */
-            return e;
+        if (delta + e->tx_bytes <= ideal_delta) {
+            /* Take next entry if amount to rebalance will not exceed ideal. */
+            to_migrate[cnt++] = e;
+            delta += e->tx_bytes;
+        }
+        if (ideal_delta - delta < migrating_threshold) {
+            /* Stop collecting hashes if we're close enough to ideal value
+             * to avoid frequent moving of light ones.  */
+            break;
         }
     }
 
-    return NULL;
+    if (!cnt) {
+        /* There is no entry which load less than or equal to 'ideal_delta'.
+         * Lets try closest one. The closest is the last in sorted list. */
+        struct bond_entry *closest;
+
+        ASSIGN_CONTAINER(closest, ovs_list_back(&from->entries), list_node);
+
+        delta = closest->tx_bytes;
+        to_migrate[cnt++] = closest;
+    }
+
+    new_low = MIN(from->tx_bytes - delta, to_tx_bytes + delta);
+    if ((new_low > to_tx_bytes) &&
+        (new_low - to_tx_bytes >= migrating_threshold)) {
+       /* Only rebalance if the new 'low' is closer to to the mid point and the
+        * improvement of traffic deviation from the ideal split exceeds 10%
+        * (migrating threshold).
+        *
+        * The improvement on the 'high' side is always the same as the 'low'
+        * side.  Thus consider 'low' side is sufficient. */
+        return cnt;
+    }
+
+    return 0;
 }
 
 /* Inserts 'slave' into 'bals' so that descending order of 'tx_bytes' is
@@ -1142,6 +1165,21 @@  reinsert_bal(struct ovs_list *bals, struct bond_slave *slave)
     insert_bal(bals, slave);
 }
 
+static int
+compare_bond_entries(const void *a_, const void *b_)
+{
+     const struct bond_entry *const *ap = a_;
+     const struct bond_entry *const *bp = b_;
+     const struct bond_entry *a = *ap;
+     const struct bond_entry *b = *bp;
+
+     if (a->tx_bytes != b->tx_bytes) {
+         return a->tx_bytes > b->tx_bytes ? -1 : 1;
+     } else {
+         return 0;
+     }
+}
+
 /* If 'bond' needs rebalancing, does so.
  *
  * The caller should have called bond_account() for each active flow, or in case
@@ -1152,10 +1190,11 @@  void
 bond_rebalance(struct bond *bond)
 {
     struct bond_slave *slave;
-    struct bond_entry *e;
+    struct bond_entry *e, *hashes[BOND_BUCKETS];
     struct ovs_list bals;
     bool rebalanced = false;
     bool use_recirc;
+    int i;
 
     ovs_rwlock_wrlock(&rwlock);
     if (!bond_is_balanced(bond) || time_msec() < bond->next_rebalance) {
@@ -1176,7 +1215,15 @@  bond_rebalance(struct bond *bond)
         slave->tx_bytes = 0;
         ovs_list_init(&slave->entries);
     }
-    for (e = &bond->hash[0]; e <= &bond->hash[BOND_MASK]; e++) {
+
+    for (i = 0; i < BOND_BUCKETS; i++) {
+        hashes[i] = &bond->hash[i];
+    }
+    qsort(hashes, BOND_BUCKETS, sizeof *hashes, compare_bond_entries);
+
+    /* Iteration over sorted bond hashes will give us sorted 'entries'. */
+    for (i = 0; i < BOND_BUCKETS; i++) {
+        e = hashes[i];
         if (e->slave && e->tx_bytes) {
             e->slave->tx_bytes += e->tx_bytes;
             ovs_list_push_back(&e->slave->entries, &e->list_node);
@@ -1200,6 +1247,7 @@  bond_rebalance(struct bond *bond)
         struct bond_slave *from = bond_slave_from_bal_node(ovs_list_front(&bals));
         struct bond_slave *to = bond_slave_from_bal_node(ovs_list_back(&bals));
         uint64_t overload;
+        size_t cnt;
 
         overload = from->tx_bytes - to->tx_bytes;
         if (overload < to->tx_bytes >> 5 || overload < 100000) {
@@ -1209,15 +1257,23 @@  bond_rebalance(struct bond *bond)
             break;
         }
 
-        /* 'from' is carrying significantly more load than 'to'.  Pick a hash
+        /* 'from' is carrying significantly more load than 'to'.  Pick a hashes
          * to move from 'from' to 'to'. */
-        e = choose_entry_to_migrate(from, to->tx_bytes);
-        if (e) {
+        cnt = choose_entries_to_migrate(from, to->tx_bytes, hashes);
+        if (!cnt) {
+            /* Can't usefully migrate anything away from 'from'.
+             * Don't reconsider it. */
+            ovs_list_remove(&from->bal_node);
+            continue;
+        }
+
+        for (i = 0; i < cnt; i++) {
+            e = hashes[i];
             bond_shift_load(e, to);
 
             /* Delete element from from->entries.
              *
-             * We don't add the element to to->hashes.  That would only allow
+             * We don't add the element to to->entries.  That would only allow
              * 'e' to be migrated to another slave in this rebalancing run, and
              * there is no point in doing that. */
             ovs_list_remove(&e->list_node);
@@ -1225,12 +1281,8 @@  bond_rebalance(struct bond *bond)
             /* Re-sort 'bals'. */
             reinsert_bal(&bals, from);
             reinsert_bal(&bals, to);
-            rebalanced = true;
-        } else {
-            /* Can't usefully migrate anything away from 'from'.
-             * Don't reconsider it. */
-            ovs_list_remove(&from->bal_node);
         }
+        rebalanced = true;
     }
 
     /* Implement exponentially weighted moving average.  A weight of 1/2 causes