Message ID | 1500571312-19049-2-git-send-email-i.maximets@samsung.com |
---|---|
State | Changes Requested |
Headers | show |
On Thu, Jul 20, 2017 at 10:21 AM, Ilya Maximets <i.maximets@samsung.com> wrote: > There are 3 constraints for moving hashes from one slave to another: > > 1. The load difference is larger than ~3% of one slave's load. > 2. The load difference between slaves exceeds 100000 bytes. > 3. Moving of the hash makes the load difference lower by > 10%. > > In current implementation if one of the slaves goes DOWN state, all > the hashes assigned to it will be moved to other slaves. After that, > if slave will go UP it will wait for rebalancing to get some hashes. > But in case where we have more than 10 equally loaded hashes it > will never fit constraint #3, because each hash will handle less than > 10% of the load. Situation become worse when number of flows grows > higher and it's almost impossible to migrate any hash when all the > 256 hash entries are used which is very likely when we have few > hundreds/thousands of flows. > > As a result, if one of the slaves goes down and up while traffic > flows, it will never be used again for packet transmission. > Situation will not be fixed even if we'll stop traffic completely > and start it again because first two constraints will block > rebalancing on the earlier stages while we have low amount of traffic. > > Moving of one hash if destination has no hashes as it was before > commit c460a6a7bc75 ("ofproto/bond: simplify rebalancing logic") > will not help because having one hash isn't enough to make load > difference less than 10% of total load and this slave will > handle only that one hash forever. > > To fix this lets try to move few hashes simultaniously to fit > constraint #3. Thanks for working on this. Current interface tries to only rebalance one hash bucket at a time. There are two reasons I can think of: In general, It is good idea to keep flow binding stable. Moving flow introduces side effects, such as packet re-ordering, that impacts application performances significantly. The second reason is the bandwidth each flow consumes tends to be more dynamic in practice. It is usually pointless to use a snapshot of bandwidth consumption to come up a "perfect" bond port binding. For example, TCP flows rate adapt to available bandwidth. It is probably better to make smaller changes in each rebalancing step. Is it still possible to address the issue mentioned while maintaining current interface? Can we find some way to break the tie in case current algorithm fail to nominate any bucket to move? Thanks.
On 8/2/17 11:09 PM, Andy Zhou wrote: > On Thu, Jul 20, 2017 at 10:21 AM, Ilya Maximets <i.maximets@samsung.com> wrote: >> There are 3 constraints for moving hashes from one slave to another: >> >> 1. The load difference is larger than ~3% of one slave's load. >> 2. The load difference between slaves exceeds 100000 bytes. >> 3. Moving of the hash makes the load difference lower by > 10%. >> >> In current implementation if one of the slaves goes DOWN state, all >> the hashes assigned to it will be moved to other slaves. After that, >> if slave will go UP it will wait for rebalancing to get some hashes. >> But in case where we have more than 10 equally loaded hashes it >> will never fit constraint #3, because each hash will handle less than >> 10% of the load. Situation become worse when number of flows grows >> higher and it's almost impossible to migrate any hash when all the >> 256 hash entries are used which is very likely when we have few >> hundreds/thousands of flows. >> >> As a result, if one of the slaves goes down and up while traffic >> flows, it will never be used again for packet transmission. >> Situation will not be fixed even if we'll stop traffic completely >> and start it again because first two constraints will block >> rebalancing on the earlier stages while we have low amount of traffic. >> >> Moving of one hash if destination has no hashes as it was before >> commit c460a6a7bc75 ("ofproto/bond: simplify rebalancing logic") >> will not help because having one hash isn't enough to make load >> difference less than 10% of total load and this slave will >> handle only that one hash forever. >> >> To fix this lets try to move few hashes simultaniously to fit >> constraint #3. > > Thanks for working on this. Sorry for not replying for almost 4 years. :) And sorry for resurrecting the thread, but the issue still exists and I think that we still need to fix it. The first patch needs a minor rebase, but it still works fine. The test in the second patch is still valid. > > Current interface tries to only rebalance one hash bucket at a time. > There are two reasons I can think of: > > In general, It is good idea to keep flow binding stable. Moving flow > introduces side effects, such as packet re-ordering, that impacts > application performances significantly. Yes. I agree with that. We should try to minimize moving of flows. Current algorithm moves hashes until difference between heaviest member and the lightest one is more than ~3% or it's more than ~1Mbps and there is at least one hash that could be moved to make the load difference better by 10%. That is, in general, a lot of hash migrations. New algorithm keeps all the previous restrictions, but allows to move more than 1 hash at a time to fulfill the last 10% restriction. It's hard to tell if it will result in more movements in general case. But, I think, we can tune it better to keep as close to the current one as possible. See below. > > The second reason is the bandwidth each flow consumes tends to > be more dynamic in practice. It is usually pointless to use a snapshot > of bandwidth consumption to come up a "perfect" bond port binding. > For example, TCP flows rate adapt to available bandwidth. It is > probably better to make smaller changes in each rebalancing step. Yes, I agree with this statement too. But new algorithm doesn't try to create a perfect distribution. It stops collecting hashes for migration when distribution is close enough. In this patch it's 10% difference with the "ideal" value. I agree that this might be too much. We can, probably, tune this value. For example, 90% difference with "ideal" for a single movement will be somewhat close to the current algorithm as we will move hashes that reduces the difference by 10%. I'm afraid, though, that this will result in much more movements, so it might make sense to settle somewhere in the middle, e.g. 40-50% of improvement for a single migration. This should be fair enough. Thoughts? > > Is it still possible to address the issue mentioned while maintaining > current interface? Can we find some way to break the tie in case current > algorithm fail to nominate any bucket to move? It's hard to tell. I've been thinking about this for a long time before sending this patch and didn't came up with anything elegant. It would be easy to identify this kind of case where all hashes are smaller than 10% of the difference, but the difference itself is high, and use the new algorithm only for this case, but I suspect that in this case we will always move hashes one by one several times with the old algorithm and use the new one on the last iteration to move a bunch of smaller hashes. The end result might be not any different than just running new algorithm from the beginning. Also this schema will take much more processing time than just running a new algorithm from the start. For now I'm going to rebase this patch-set and, probably, change the moving threshold from 10% to 40-50%, so we will not move too many small hashes at a time. Will send v2. Best regards, Ilya Maximets.
On Mon, Jun 07, 2021 at 01:01:33PM +0200, Ilya Maximets wrote: > On 8/2/17 11:09 PM, Andy Zhou wrote: > > On Thu, Jul 20, 2017 at 10:21 AM, Ilya Maximets <i.maximets@samsung.com> wrote: > >> There are 3 constraints for moving hashes from one slave to another: > >> > >> 1. The load difference is larger than ~3% of one slave's load. > >> 2. The load difference between slaves exceeds 100000 bytes. > >> 3. Moving of the hash makes the load difference lower by > 10%. > >> > >> In current implementation if one of the slaves goes DOWN state, all > >> the hashes assigned to it will be moved to other slaves. After that, > >> if slave will go UP it will wait for rebalancing to get some hashes. > >> But in case where we have more than 10 equally loaded hashes it > >> will never fit constraint #3, because each hash will handle less than > >> 10% of the load. Situation become worse when number of flows grows > >> higher and it's almost impossible to migrate any hash when all the > >> 256 hash entries are used which is very likely when we have few > >> hundreds/thousands of flows. > >> > >> As a result, if one of the slaves goes down and up while traffic > >> flows, it will never be used again for packet transmission. > >> Situation will not be fixed even if we'll stop traffic completely > >> and start it again because first two constraints will block > >> rebalancing on the earlier stages while we have low amount of traffic. > >> > >> Moving of one hash if destination has no hashes as it was before > >> commit c460a6a7bc75 ("ofproto/bond: simplify rebalancing logic") > >> will not help because having one hash isn't enough to make load > >> difference less than 10% of total load and this slave will > >> handle only that one hash forever. > >> > >> To fix this lets try to move few hashes simultaniously to fit > >> constraint #3. > > > > Thanks for working on this. > > Sorry for not replying for almost 4 years. :) > > And sorry for resurrecting the thread, but the issue still exists and > I think that we still need to fix it. The first patch needs a > minor rebase, but it still works fine. The test in the second patch > is still valid. I don't think Andy is working on OVS these days. I'm the original author of the rebalancing algorithm, so I went back and took a look at patch 1. I see a little bit of coding style I'd do differently these days (e.g. declare 'i' in the 'for' loops rather than at the top of a block) but the code and the rationale for it seems solid to me. I'll read v2 but I expect to ack it. Thanks, Ben.
On 6/11/21 1:24 AM, Ben Pfaff wrote: > On Mon, Jun 07, 2021 at 01:01:33PM +0200, Ilya Maximets wrote: >> On 8/2/17 11:09 PM, Andy Zhou wrote: >>> On Thu, Jul 20, 2017 at 10:21 AM, Ilya Maximets <i.maximets@samsung.com> wrote: >>>> There are 3 constraints for moving hashes from one slave to another: >>>> >>>> 1. The load difference is larger than ~3% of one slave's load. >>>> 2. The load difference between slaves exceeds 100000 bytes. >>>> 3. Moving of the hash makes the load difference lower by > 10%. >>>> >>>> In current implementation if one of the slaves goes DOWN state, all >>>> the hashes assigned to it will be moved to other slaves. After that, >>>> if slave will go UP it will wait for rebalancing to get some hashes. >>>> But in case where we have more than 10 equally loaded hashes it >>>> will never fit constraint #3, because each hash will handle less than >>>> 10% of the load. Situation become worse when number of flows grows >>>> higher and it's almost impossible to migrate any hash when all the >>>> 256 hash entries are used which is very likely when we have few >>>> hundreds/thousands of flows. >>>> >>>> As a result, if one of the slaves goes down and up while traffic >>>> flows, it will never be used again for packet transmission. >>>> Situation will not be fixed even if we'll stop traffic completely >>>> and start it again because first two constraints will block >>>> rebalancing on the earlier stages while we have low amount of traffic. >>>> >>>> Moving of one hash if destination has no hashes as it was before >>>> commit c460a6a7bc75 ("ofproto/bond: simplify rebalancing logic") >>>> will not help because having one hash isn't enough to make load >>>> difference less than 10% of total load and this slave will >>>> handle only that one hash forever. >>>> >>>> To fix this lets try to move few hashes simultaniously to fit >>>> constraint #3. >>> >>> Thanks for working on this. >> >> Sorry for not replying for almost 4 years. :) >> >> And sorry for resurrecting the thread, but the issue still exists and >> I think that we still need to fix it. The first patch needs a >> minor rebase, but it still works fine. The test in the second patch >> is still valid. > > I don't think Andy is working on OVS these days. I'm the original > author of the rebalancing algorithm, so I went back and took a look at > patch 1. I see a little bit of coding style I'd do differently these > days (e.g. declare 'i' in the 'for' loops rather than at the top of a > block) but the code and the rationale for it seems solid to me. > > I'll read v2 but I expect to ack it. Thanks for taking a look! I rebased the patch and adjusted coding style a little bit: https://patchwork.ozlabs.org/project/openvswitch/patch/20210713153206.304369-1-i.maximets@ovn.org/ I didn't observe any benefits from changing migration threshold though, so I kept it as it is in v1. Best regards, Ilya Maximets.
diff --git a/ofproto/bond.c b/ofproto/bond.c index cb25a1d..75f7551 100644 --- a/ofproto/bond.c +++ b/ofproto/bond.c @@ -1073,49 +1073,72 @@ bond_shift_load(struct bond_entry *hash, struct bond_slave *to) bond->bond_revalidate = true; } -/* Picks and returns a bond_entry to migrate from 'from' (the most heavily +/* Picks and returns a 'bond_entry's to migrate from 'from' (the most heavily * loaded bond slave) to a bond slave that has 'to_tx_bytes' bytes of load, * given that doing so must decrease the ratio of the load on the two slaves by - * at least 0.1. Returns NULL if there is no appropriate entry. + * at least 0.1. Returns number of entries filled in 'to_migrate'. * - * The list of entries isn't sorted. I don't know of a reason to prefer to - * shift away small hashes or large hashes. */ -static struct bond_entry * -choose_entry_to_migrate(const struct bond_slave *from, uint64_t to_tx_bytes) + * The list of entries is sorted in descending order of load. This allows us + * to collect subset of entries with accumulated load close to ideal. */ +static size_t +choose_entries_to_migrate(const struct bond_slave *from, uint64_t to_tx_bytes, + struct bond_entry **to_migrate) OVS_REQ_WRLOCK(rwlock) { struct bond_entry *e; + /* Note, the ideal traffic is the mid point between 'from' and 'to'. + * This value does not change by rebalancing. */ + uint64_t ideal_tx_bytes = (from->tx_bytes + to_tx_bytes) / 2; + uint64_t ideal_delta = ideal_tx_bytes - to_tx_bytes; + uint64_t delta = 0; /* The amount to rebalance. */ + uint64_t new_low; /* The lower bandwidth between 'to' and 'from' + * after rebalancing. */ + uint64_t migrating_threshold = ideal_delta / 10; /* 10% */ + size_t cnt = 0; if (ovs_list_is_short(&from->entries)) { /* 'from' carries no more than one MAC hash, so shifting load away from * it would be pointless. */ - return NULL; + return 0; } LIST_FOR_EACH (e, list_node, &from->entries) { - uint64_t delta = e->tx_bytes; /* The amount to rebalance. */ - uint64_t ideal_tx_bytes = (from->tx_bytes + to_tx_bytes)/2; - /* Note, the ideal traffic is the mid point - * between 'from' and 'to'. This value does - * not change by rebalancing. */ - uint64_t new_low; /* The lower bandwidth between 'to' and 'from' - after rebalancing. */ - - new_low = MIN(from->tx_bytes - delta, to_tx_bytes + delta); - - if ((new_low > to_tx_bytes) && - (new_low - to_tx_bytes >= (ideal_tx_bytes - to_tx_bytes) / 10)) { - /* Only rebalance if the new 'low' is closer to to the mid point, - * and the improvement exceeds 10% of current traffic - * deviation from the ideal split. - * - * The improvement on the 'high' side is always the same as the - * 'low' side. Thus consider 'low' side is sufficient. */ - return e; + if (delta + e->tx_bytes <= ideal_delta) { + /* Take next entry if amount to rebalance will not exceed ideal. */ + to_migrate[cnt++] = e; + delta += e->tx_bytes; + } + if (ideal_delta - delta < migrating_threshold) { + /* Stop collecting hashes if we're close enough to ideal value + * to avoid frequent moving of light ones. */ + break; } } - return NULL; + if (!cnt) { + /* There is no entry which load less than or equal to 'ideal_delta'. + * Lets try closest one. The closest is the last in sorted list. */ + struct bond_entry *closest; + + ASSIGN_CONTAINER(closest, ovs_list_back(&from->entries), list_node); + + delta = closest->tx_bytes; + to_migrate[cnt++] = closest; + } + + new_low = MIN(from->tx_bytes - delta, to_tx_bytes + delta); + if ((new_low > to_tx_bytes) && + (new_low - to_tx_bytes >= migrating_threshold)) { + /* Only rebalance if the new 'low' is closer to to the mid point and the + * improvement of traffic deviation from the ideal split exceeds 10% + * (migrating threshold). + * + * The improvement on the 'high' side is always the same as the 'low' + * side. Thus consider 'low' side is sufficient. */ + return cnt; + } + + return 0; } /* Inserts 'slave' into 'bals' so that descending order of 'tx_bytes' is @@ -1142,6 +1165,21 @@ reinsert_bal(struct ovs_list *bals, struct bond_slave *slave) insert_bal(bals, slave); } +static int +compare_bond_entries(const void *a_, const void *b_) +{ + const struct bond_entry *const *ap = a_; + const struct bond_entry *const *bp = b_; + const struct bond_entry *a = *ap; + const struct bond_entry *b = *bp; + + if (a->tx_bytes != b->tx_bytes) { + return a->tx_bytes > b->tx_bytes ? -1 : 1; + } else { + return 0; + } +} + /* If 'bond' needs rebalancing, does so. * * The caller should have called bond_account() for each active flow, or in case @@ -1152,10 +1190,11 @@ void bond_rebalance(struct bond *bond) { struct bond_slave *slave; - struct bond_entry *e; + struct bond_entry *e, *hashes[BOND_BUCKETS]; struct ovs_list bals; bool rebalanced = false; bool use_recirc; + int i; ovs_rwlock_wrlock(&rwlock); if (!bond_is_balanced(bond) || time_msec() < bond->next_rebalance) { @@ -1176,7 +1215,15 @@ bond_rebalance(struct bond *bond) slave->tx_bytes = 0; ovs_list_init(&slave->entries); } - for (e = &bond->hash[0]; e <= &bond->hash[BOND_MASK]; e++) { + + for (i = 0; i < BOND_BUCKETS; i++) { + hashes[i] = &bond->hash[i]; + } + qsort(hashes, BOND_BUCKETS, sizeof *hashes, compare_bond_entries); + + /* Iteration over sorted bond hashes will give us sorted 'entries'. */ + for (i = 0; i < BOND_BUCKETS; i++) { + e = hashes[i]; if (e->slave && e->tx_bytes) { e->slave->tx_bytes += e->tx_bytes; ovs_list_push_back(&e->slave->entries, &e->list_node); @@ -1200,6 +1247,7 @@ bond_rebalance(struct bond *bond) struct bond_slave *from = bond_slave_from_bal_node(ovs_list_front(&bals)); struct bond_slave *to = bond_slave_from_bal_node(ovs_list_back(&bals)); uint64_t overload; + size_t cnt; overload = from->tx_bytes - to->tx_bytes; if (overload < to->tx_bytes >> 5 || overload < 100000) { @@ -1209,15 +1257,23 @@ bond_rebalance(struct bond *bond) break; } - /* 'from' is carrying significantly more load than 'to'. Pick a hash + /* 'from' is carrying significantly more load than 'to'. Pick a hashes * to move from 'from' to 'to'. */ - e = choose_entry_to_migrate(from, to->tx_bytes); - if (e) { + cnt = choose_entries_to_migrate(from, to->tx_bytes, hashes); + if (!cnt) { + /* Can't usefully migrate anything away from 'from'. + * Don't reconsider it. */ + ovs_list_remove(&from->bal_node); + continue; + } + + for (i = 0; i < cnt; i++) { + e = hashes[i]; bond_shift_load(e, to); /* Delete element from from->entries. * - * We don't add the element to to->hashes. That would only allow + * We don't add the element to to->entries. That would only allow * 'e' to be migrated to another slave in this rebalancing run, and * there is no point in doing that. */ ovs_list_remove(&e->list_node); @@ -1225,12 +1281,8 @@ bond_rebalance(struct bond *bond) /* Re-sort 'bals'. */ reinsert_bal(&bals, from); reinsert_bal(&bals, to); - rebalanced = true; - } else { - /* Can't usefully migrate anything away from 'from'. - * Don't reconsider it. */ - ovs_list_remove(&from->bal_node); } + rebalanced = true; } /* Implement exponentially weighted moving average. A weight of 1/2 causes
There are 3 constraints for moving hashes from one slave to another: 1. The load difference is larger than ~3% of one slave's load. 2. The load difference between slaves exceeds 100000 bytes. 3. Moving of the hash makes the load difference lower by > 10%. In current implementation if one of the slaves goes DOWN state, all the hashes assigned to it will be moved to other slaves. After that, if slave will go UP it will wait for rebalancing to get some hashes. But in case where we have more than 10 equally loaded hashes it will never fit constraint #3, because each hash will handle less than 10% of the load. Situation become worse when number of flows grows higher and it's almost impossible to migrate any hash when all the 256 hash entries are used which is very likely when we have few hundreds/thousands of flows. As a result, if one of the slaves goes down and up while traffic flows, it will never be used again for packet transmission. Situation will not be fixed even if we'll stop traffic completely and start it again because first two constraints will block rebalancing on the earlier stages while we have low amount of traffic. Moving of one hash if destination has no hashes as it was before commit c460a6a7bc75 ("ofproto/bond: simplify rebalancing logic") will not help because having one hash isn't enough to make load difference less than 10% of total load and this slave will handle only that one hash forever. To fix this lets try to move few hashes simultaniously to fit constraint #3. Implementation includes sorting of 'entries' to be able to collect entries with accumulated load close enough to ideal value. Signed-off-by: Ilya Maximets <i.maximets@samsung.com> --- I guess, the following tag should be correct, but not sure, it was too long ago ... Fixes: 5422a9e189c6 ("bonding: Balance bond slaves based on ratio.") ofproto/bond.c | 128 ++++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 90 insertions(+), 38 deletions(-)