diff mbox series

[ovs-dev] ovsdb: raft: Don't forward more than one command to the leader.

Message ID 20240626220232.3948706-1-i.maximets@ovn.org
State Accepted
Commit f8ed13355d9d01f5437e5c27cf5b3a2f094543e5
Delegated to: Ilya Maximets
Headers show
Series [ovs-dev] ovsdb: raft: Don't forward more than one command to the leader. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Ilya Maximets June 26, 2024, 10:02 p.m. UTC
Every transaction has RAFT log prerequisites.  Even if transactions
are not related (because RAFT doesn't actually know what data it is
handling).  When leader writes a new record to a RAFT storage, it is
getting appended to the log right away and changes current 'eid',
i.e., changes prerequisites.  The leader will not try to write new
records until the current one is committed, because until then the
pre-check will be failing.

However, that is different for the follower.  Followers do not add
records to the RAFT log until the leader sends an append request back.
So, if there are multiple transactions pending on a follower, it will
create a command for each of them and prerequisites will be set to the
same values.  All these commands will be sent to the leader, but only
one can succeed at a time, because accepting one command immediately
changes prerequisites and all other commands become non-applicable.
So, out of N commands, 1 will succeed and N - 1 will fail.  The cluster
failure is a transient failure, so the follower will re-process all the
failed transactions and send them again.  1 will succeed and N - 2 will
fail.  And so on, until there are no more transactions.  In the end,
instead of processing N transactions, the follower is performing
N * (N - 1) / 2 transaction processing iterations.  That is consuming
a huge amount of CPU resources completely unnecessarily.

Since there is no real chance for multiple transactions from the same
follower to succeed, it's better to not send them in the first place.
This also eliminates prerequisite mismatch messages on a leader in
this particular case.

In a test with 30 parallel shell threads executing 12K transactions
total with separate ovsdb-client calls through the same follower there
is about 60% performance improvement.  The test takes ~100 seconds to
complete without this change and ~40 seconds with this change applied.
The new time is very close to what it takes to execute the same test
through the cluster leader.

Note: prerequisite failures on a leader are still possible, but mostly
in a case of simultaneous transactions from different followers.  It's
a normal thing for a distributed database due to its nature.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
 ovsdb/raft.c        | 45 ++++++++++++++++++++++++++++++++++++++++++++-
 ovsdb/raft.h        |  2 +-
 ovsdb/storage.c     |  9 +++++----
 ovsdb/storage.h     |  5 ++++-
 ovsdb/transaction.c |  6 +-----
 5 files changed, 55 insertions(+), 12 deletions(-)

Comments

Dumitru Ceara June 28, 2024, 7:20 a.m. UTC | #1
On 6/27/24 00:02, Ilya Maximets wrote:
> Every transaction has RAFT log prerequisites.  Even if transactions
> are not related (because RAFT doesn't actually know what data it is
> handling).  When leader writes a new record to a RAFT storage, it is
> getting appended to the log right away and changes current 'eid',
> i.e., changes prerequisites.  The leader will not try to write new
> records until the current one is committed, because until then the
> pre-check will be failing.
> 
> However, that is different for the follower.  Followers do not add
> records to the RAFT log until the leader sends an append request back.
> So, if there are multiple transactions pending on a follower, it will
> create a command for each of them and prerequisites will be set to the
> same values.  All these commands will be sent to the leader, but only
> one can succeed at a time, because accepting one command immediately
> changes prerequisites and all other commands become non-applicable.
> So, out of N commands, 1 will succeed and N - 1 will fail.  The cluster
> failure is a transient failure, so the follower will re-process all the
> failed transactions and send them again.  1 will succeed and N - 2 will
> fail.  And so on, until there are no more transactions.  In the end,
> instead of processing N transactions, the follower is performing
> N * (N - 1) / 2 transaction processing iterations.  That is consuming
> a huge amount of CPU resources completely unnecessarily.
> 
> Since there is no real chance for multiple transactions from the same
> follower to succeed, it's better to not send them in the first place.
> This also eliminates prerequisite mismatch messages on a leader in
> this particular case.
> 

Makes sense!

> In a test with 30 parallel shell threads executing 12K transactions
> total with separate ovsdb-client calls through the same follower there
> is about 60% performance improvement.  The test takes ~100 seconds to
> complete without this change and ~40 seconds with this change applied.
> The new time is very close to what it takes to execute the same test
> through the cluster leader.
> 

Maybe a public link to the test (if possible) would be a nice reference
to have in the future?

> Note: prerequisite failures on a leader are still possible, but mostly
> in a case of simultaneous transactions from different followers.  It's
> a normal thing for a distributed database due to its nature.
> 
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>  ovsdb/raft.c        | 45 ++++++++++++++++++++++++++++++++++++++++++++-
>  ovsdb/raft.h        |  2 +-
>  ovsdb/storage.c     |  9 +++++----
>  ovsdb/storage.h     |  5 ++++-
>  ovsdb/transaction.c |  6 +-----
>  5 files changed, 55 insertions(+), 12 deletions(-)
> 
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index ac3d37ac4..116924058 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -2307,12 +2307,55 @@ raft_get_eid(const struct raft *raft, uint64_t index)
>      return &raft->snap.eid;
>  }
>  
> -const struct uuid *
> +static const struct uuid *
>  raft_current_eid(const struct raft *raft)
>  {
>      return raft_get_eid(raft, raft->log_end - 1);
>  }
>  
> +bool
> +raft_precheck_prereq(const struct raft *raft, const struct uuid *prereq)
> +{
> +    if (!uuid_equals(raft_current_eid(raft), prereq)) {
> +        VLOG_DBG("%s: prerequisites (" UUID_FMT ") "
> +                 "do not match current eid (" UUID_FMT ")",
> +                 __func__, UUID_ARGS(prereq),
> +                 UUID_ARGS(raft_current_eid(raft)));
> +        return false;
> +    }
> +
> +    /* Having incomplete commands on a follower means that the leader has
> +     * these commands and they will change the prerequisites once added to
> +     * the leader's log.
> +     *
> +     * There is a chance that all these commands will actually fail and the
> +     * record with current prerequisites will in fact succeed, but, since
> +     * these are our own commands, the chances are low.
> +     *
> +     * Incomplete commands on a leader will not change the leader's current
> +     * 'eid' on commit as they are already part of the leader's log. */
> +    if (raft->role != RAFT_LEADER && hmap_count(&raft->commands)) {
> +        struct raft_command *cmd;
> +
> +        HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {

Nit: I'd rewrite this as:

if (raft->role == RAFT_LEADER) {
    return true;
}

HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
    ...
}

> +            /* Skip commands that are already part of the log (have non-zero
> +             * index) and ones that do not carry any data (have zero 'eid'),
> +             * as they can't change prerequisites.
> +             *
> +             * Database will not re-run triggers unless the data changes or
> +             * one of the data-carrying triggers completes.  So, pre-check must
> +             * not fail if there are no outstanding data-carrying commands. */
> +            if (!cmd->index && !uuid_is_zero(&cmd->eid)) {
> +                VLOG_DBG("%s: follower still has an incomplete command "
> +                         UUID_FMT, __func__, UUID_ARGS(&cmd->eid));
> +                return false;
> +            }
> +        }
> +    }
> +
> +    return true;
> +}
> +
>  static struct raft_command *
>  raft_command_create_completed(enum raft_command_status status)
>  {
> diff --git a/ovsdb/raft.h b/ovsdb/raft.h
> index a5b55d9bf..5833aaf23 100644
> --- a/ovsdb/raft.h
> +++ b/ovsdb/raft.h
> @@ -189,5 +189,5 @@ struct ovsdb_error *raft_store_snapshot(struct raft *,
>  void raft_take_leadership(struct raft *);
>  void raft_transfer_leadership(struct raft *, const char *reason);
>  
> -const struct uuid *raft_current_eid(const struct raft *);
> +bool raft_precheck_prereq(const struct raft *, const struct uuid *prereq);
>  #endif /* lib/raft.h */
> diff --git a/ovsdb/storage.c b/ovsdb/storage.c
> index 6c395106c..c5aec5459 100644
> --- a/ovsdb/storage.c
> +++ b/ovsdb/storage.c
> @@ -661,11 +661,12 @@ ovsdb_storage_write_schema_change(struct ovsdb_storage *storage,
>      return w;
>  }
>  
> -const struct uuid *
> -ovsdb_storage_peek_last_eid(struct ovsdb_storage *storage)
> +bool
> +ovsdb_storage_precheck_prereq(const struct ovsdb_storage *storage,
> +                              const struct uuid *prereq)
>  {
>      if (!storage->raft) {
> -        return NULL;
> +        return true;
>      }
> -    return raft_current_eid(storage->raft);
> +    return raft_precheck_prereq(storage->raft, prereq);
>  }
> diff --git a/ovsdb/storage.h b/ovsdb/storage.h
> index 05f40ce93..7079ea261 100644
> --- a/ovsdb/storage.h
> +++ b/ovsdb/storage.h
> @@ -96,6 +96,9 @@ struct ovsdb_storage *ovsdb_storage_open_standalone(const char *filename,
>                                                      bool rw);
>  struct ovsdb_schema *ovsdb_storage_read_schema(struct ovsdb_storage *);
>  
> -const struct uuid *ovsdb_storage_peek_last_eid(struct ovsdb_storage *);
> +/* Checks that there is a chance for a record with specified prerequisites
> + * to be successfully written to the storage. */
> +bool ovsdb_storage_precheck_prereq(const struct ovsdb_storage *,
> +                                   const struct uuid *prereq);
>  
>  #endif /* ovsdb/storage.h */
> diff --git a/ovsdb/transaction.c b/ovsdb/transaction.c
> index 484a88e1c..65eca6478 100644
> --- a/ovsdb/transaction.c
> +++ b/ovsdb/transaction.c
> @@ -1277,11 +1277,7 @@ struct ovsdb_txn_progress {
>  bool
>  ovsdb_txn_precheck_prereq(const struct ovsdb *db)
>  {
> -    const struct uuid *eid = ovsdb_storage_peek_last_eid(db->storage);
> -    if (!eid) {
> -        return true;
> -    }
> -    return uuid_equals(&db->prereq, eid);
> +    return ovsdb_storage_precheck_prereq(db->storage, &db->prereq);
>  }

Aside from the comment above, the patch looks good to me.  If you
address my comment.. or if you decide to dismiss it too :) then:

Acked-by: Dumitru Ceara <dceara@redhat.com>

This part is not related to the patch itself, but reviewing the change I
was wondering:

ovsdb_txn_precheck_prereq() is called in ovsdb_trigger_try(), where we
return early if the prereq failed for the transaction.  But
ovsdb_trigger_try() is called by ovsdb_trigger_run() for each pending
trigger.  If the first trigger failed precheck is there any reason to
not break early from the loop there?

bool
ovsdb_trigger_run(struct ovsdb *db, long long int now)
{
    struct ovsdb_trigger *t;

    bool run_triggers = db->run_triggers;
    db->run_triggers_now = db->run_triggers = false;

    bool disconnect_all = false;

    LIST_FOR_EACH_SAFE (t, node, &db->triggers) {
        cooperative_multitasking_yield();

        if (run_triggers
            || now - t->created >= t->timeout_msec
            || t->progress || t->txn_forward) {
            if (ovsdb_trigger_try(t, now)) {
                disconnect_all = true;
            }
        }
    }
    return disconnect_all;
}

Regards,
Dumitru
Ilya Maximets June 28, 2024, 9:43 a.m. UTC | #2
On 6/28/24 09:20, Dumitru Ceara wrote:
> On 6/27/24 00:02, Ilya Maximets wrote:
>> Every transaction has RAFT log prerequisites.  Even if transactions
>> are not related (because RAFT doesn't actually know what data it is
>> handling).  When leader writes a new record to a RAFT storage, it is
>> getting appended to the log right away and changes current 'eid',
>> i.e., changes prerequisites.  The leader will not try to write new
>> records until the current one is committed, because until then the
>> pre-check will be failing.
>>
>> However, that is different for the follower.  Followers do not add
>> records to the RAFT log until the leader sends an append request back.
>> So, if there are multiple transactions pending on a follower, it will
>> create a command for each of them and prerequisites will be set to the
>> same values.  All these commands will be sent to the leader, but only
>> one can succeed at a time, because accepting one command immediately
>> changes prerequisites and all other commands become non-applicable.
>> So, out of N commands, 1 will succeed and N - 1 will fail.  The cluster
>> failure is a transient failure, so the follower will re-process all the
>> failed transactions and send them again.  1 will succeed and N - 2 will
>> fail.  And so on, until there are no more transactions.  In the end,
>> instead of processing N transactions, the follower is performing
>> N * (N - 1) / 2 transaction processing iterations.  That is consuming
>> a huge amount of CPU resources completely unnecessarily.
>>
>> Since there is no real chance for multiple transactions from the same
>> follower to succeed, it's better to not send them in the first place.
>> This also eliminates prerequisite mismatch messages on a leader in
>> this particular case.
>>
> 
> Makes sense!
> 
>> In a test with 30 parallel shell threads executing 12K transactions
>> total with separate ovsdb-client calls through the same follower there
>> is about 60% performance improvement.  The test takes ~100 seconds to
>> complete without this change and ~40 seconds with this change applied.
>> The new time is very close to what it takes to execute the same test
>> through the cluster leader.
>>
> 
> Maybe a public link to the test (if possible) would be a nice reference
> to have in the future?

I'm not sure where to post it, so I'll just leave it here, and I can
add a link to this post into the commit message.

The script is adding 12K ports across 120 logical switches in a style
of ovn-kubernetes (sets up port security, some external IDs and adds
to an address set):

---
$ cat ../transaction-benchmark-parallel.sh
#set -x
set -o errexit

# Creating 120 logical switches
for i in $(seq 120); do
    ovn-nbctl ls-add ip-10-10-177-$i.us-west-2.compute.internal
done
IFS=$'\r\n' GLOBIGNORE='*' \
    command eval  'uuids=($(ovn-nbctl --bare --columns _uuid list logical_switch | sed "/^$/d"))'

for i in $(seq 120); do
    echo "uuid #${i}: ${uuids[$(($i - 1))]}"
done

# Creating one addres set
as_name="a0123456789012345678"
as_uuid=$(ovn-nbctl create address_set name=${as_name})

# Adding ports
function add_400() {
    rm -rf txn_results$1.txt
    for i in $(seq $1 $(($1 + 399)) ); do
        echo $i
        port_name="node-density-00000000-0000-0000-0000-000000000000_node-density-$i"
        namespace="node-density-00000000-0000-0000-0000-000000000000"
        ls_name="ip-10-10-177-$((${i} % 120 + 1)).us-west-2.compute.internal"
        uuid_name="row00000000_0000_0000_0000_000000000000"
        mac=$(printf '0a:58:0a:9b:%02x:%02x' $((${i} >> 8)) $((${i} & 255)))
        ip=$(printf "10.11.%d.%d\n" $((${i} / 255)) $((${i} % 255 + 1)))

        time ovsdb-client transact unix:$(pwd)/sandbox/nb2.ovsdb "[\"OVN_Northbound\",
        {
          \"uuid-name\":\"${uuid_name}\",
          \"row\":{
              \"name\":\"${port_name}\",
              \"options\":[\"map\",[[\"requested-chassis\",\"${ls_name}\"]]],
              \"addresses\":[\"set\",[\"${mac} ${ip}\"]],
              \"external_ids\":[\"map\",[[\"pod\",\"true\"],[\"namespace\",\"${namespace}\"]]],
              \"port_security\":[\"set\",[\"${mac} ${ip}\"]]
            },
          \"op\":\"insert\", \"table\":\"Logical_Switch_Port\"
        },
        {
          \"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${uuids[$(($i % 120))]}\"]]],
          \"mutations\":[[\"ports\",\"insert\",[\"set\",[[\"named-uuid\",\"${uuid_name}\"]]]]],
          \"op\":\"mutate\", \"table\":\"Logical_Switch\"
        },
        {
          \"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${as_uuid}\"]]],
          \"mutations\":[[\"addresses\",\"insert\",[\"set\",[\"${ip}\"]]]],
          \"op\":\"mutate\", \"table\":\"Address_Set\"
        }]" >> txn_results$1.txt
    done
    echo "done" >> txn_results$1.txt
}

rm -f txn_results*.txt

n=0
for i in $(seq 0 400 11600); do
    n=$(($n + 1))
    add_400 $i &
done

while [ $(grep 'done' txn_results*.txt | wc -l) != $n ]; do
    sleep 1
done
---

It is not the most generic script and it is designed for sandbox,
where nb1 is a leader by default:

# make sandbox SANDBOXFLAGS="--nbdb-model=clustered"
# pkill ovn-northd # to remove the noise
# time bash ../transaction-benchmark-parallel.sh

Note that the script doesn't perform a lot of correctness checks, so
it's better to examine the txn_results*.txt files, or check the
database content afterwards to be sure.

> 
>> Note: prerequisite failures on a leader are still possible, but mostly
>> in a case of simultaneous transactions from different followers.  It's
>> a normal thing for a distributed database due to its nature.
>>
>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>> ---
>>  ovsdb/raft.c        | 45 ++++++++++++++++++++++++++++++++++++++++++++-
>>  ovsdb/raft.h        |  2 +-
>>  ovsdb/storage.c     |  9 +++++----
>>  ovsdb/storage.h     |  5 ++++-
>>  ovsdb/transaction.c |  6 +-----
>>  5 files changed, 55 insertions(+), 12 deletions(-)
>>
>> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> index ac3d37ac4..116924058 100644
>> --- a/ovsdb/raft.c
>> +++ b/ovsdb/raft.c
>> @@ -2307,12 +2307,55 @@ raft_get_eid(const struct raft *raft, uint64_t index)
>>      return &raft->snap.eid;
>>  }
>>  
>> -const struct uuid *
>> +static const struct uuid *
>>  raft_current_eid(const struct raft *raft)
>>  {
>>      return raft_get_eid(raft, raft->log_end - 1);
>>  }
>>  
>> +bool
>> +raft_precheck_prereq(const struct raft *raft, const struct uuid *prereq)
>> +{
>> +    if (!uuid_equals(raft_current_eid(raft), prereq)) {
>> +        VLOG_DBG("%s: prerequisites (" UUID_FMT ") "
>> +                 "do not match current eid (" UUID_FMT ")",
>> +                 __func__, UUID_ARGS(prereq),
>> +                 UUID_ARGS(raft_current_eid(raft)));
>> +        return false;
>> +    }
>> +
>> +    /* Having incomplete commands on a follower means that the leader has
>> +     * these commands and they will change the prerequisites once added to
>> +     * the leader's log.
>> +     *
>> +     * There is a chance that all these commands will actually fail and the
>> +     * record with current prerequisites will in fact succeed, but, since
>> +     * these are our own commands, the chances are low.
>> +     *
>> +     * Incomplete commands on a leader will not change the leader's current
>> +     * 'eid' on commit as they are already part of the leader's log. */
>> +    if (raft->role != RAFT_LEADER && hmap_count(&raft->commands)) {
>> +        struct raft_command *cmd;
>> +
>> +        HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
> 
> Nit: I'd rewrite this as:
> 
> if (raft->role == RAFT_LEADER) {
>     return true;
> }
> 
> HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
>     ...
> }

Makes sense, I can do that con commit, unless there will be further comments.

> 
>> +            /* Skip commands that are already part of the log (have non-zero
>> +             * index) and ones that do not carry any data (have zero 'eid'),
>> +             * as they can't change prerequisites.
>> +             *
>> +             * Database will not re-run triggers unless the data changes or
>> +             * one of the data-carrying triggers completes.  So, pre-check must
>> +             * not fail if there are no outstanding data-carrying commands. */
>> +            if (!cmd->index && !uuid_is_zero(&cmd->eid)) {
>> +                VLOG_DBG("%s: follower still has an incomplete command "
>> +                         UUID_FMT, __func__, UUID_ARGS(&cmd->eid));
>> +                return false;
>> +            }
>> +        }
>> +    }
>> +
>> +    return true;
>> +}
>> +
>>  static struct raft_command *
>>  raft_command_create_completed(enum raft_command_status status)
>>  {
>> diff --git a/ovsdb/raft.h b/ovsdb/raft.h
>> index a5b55d9bf..5833aaf23 100644
>> --- a/ovsdb/raft.h
>> +++ b/ovsdb/raft.h
>> @@ -189,5 +189,5 @@ struct ovsdb_error *raft_store_snapshot(struct raft *,
>>  void raft_take_leadership(struct raft *);
>>  void raft_transfer_leadership(struct raft *, const char *reason);
>>  
>> -const struct uuid *raft_current_eid(const struct raft *);
>> +bool raft_precheck_prereq(const struct raft *, const struct uuid *prereq);
>>  #endif /* lib/raft.h */
>> diff --git a/ovsdb/storage.c b/ovsdb/storage.c
>> index 6c395106c..c5aec5459 100644
>> --- a/ovsdb/storage.c
>> +++ b/ovsdb/storage.c
>> @@ -661,11 +661,12 @@ ovsdb_storage_write_schema_change(struct ovsdb_storage *storage,
>>      return w;
>>  }
>>  
>> -const struct uuid *
>> -ovsdb_storage_peek_last_eid(struct ovsdb_storage *storage)
>> +bool
>> +ovsdb_storage_precheck_prereq(const struct ovsdb_storage *storage,
>> +                              const struct uuid *prereq)
>>  {
>>      if (!storage->raft) {
>> -        return NULL;
>> +        return true;
>>      }
>> -    return raft_current_eid(storage->raft);
>> +    return raft_precheck_prereq(storage->raft, prereq);
>>  }
>> diff --git a/ovsdb/storage.h b/ovsdb/storage.h
>> index 05f40ce93..7079ea261 100644
>> --- a/ovsdb/storage.h
>> +++ b/ovsdb/storage.h
>> @@ -96,6 +96,9 @@ struct ovsdb_storage *ovsdb_storage_open_standalone(const char *filename,
>>                                                      bool rw);
>>  struct ovsdb_schema *ovsdb_storage_read_schema(struct ovsdb_storage *);
>>  
>> -const struct uuid *ovsdb_storage_peek_last_eid(struct ovsdb_storage *);
>> +/* Checks that there is a chance for a record with specified prerequisites
>> + * to be successfully written to the storage. */
>> +bool ovsdb_storage_precheck_prereq(const struct ovsdb_storage *,
>> +                                   const struct uuid *prereq);
>>  
>>  #endif /* ovsdb/storage.h */
>> diff --git a/ovsdb/transaction.c b/ovsdb/transaction.c
>> index 484a88e1c..65eca6478 100644
>> --- a/ovsdb/transaction.c
>> +++ b/ovsdb/transaction.c
>> @@ -1277,11 +1277,7 @@ struct ovsdb_txn_progress {
>>  bool
>>  ovsdb_txn_precheck_prereq(const struct ovsdb *db)
>>  {
>> -    const struct uuid *eid = ovsdb_storage_peek_last_eid(db->storage);
>> -    if (!eid) {
>> -        return true;
>> -    }
>> -    return uuid_equals(&db->prereq, eid);
>> +    return ovsdb_storage_precheck_prereq(db->storage, &db->prereq);
>>  }
> 
> Aside from the comment above, the patch looks good to me.  If you
> address my comment.. or if you decide to dismiss it too :) then:
> 
> Acked-by: Dumitru Ceara <dceara@redhat.com>
> 
> This part is not related to the patch itself, but reviewing the change I
> was wondering:
> 
> ovsdb_txn_precheck_prereq() is called in ovsdb_trigger_try(), where we
> return early if the prereq failed for the transaction.  But
> ovsdb_trigger_try() is called by ovsdb_trigger_run() for each pending
> trigger.  If the first trigger failed precheck is there any reason to
> not break early from the loop there?

The ovsdb_trigger_try() handles also transactions that are already
in progress, so we have to run it for those.

There is also an unrelated issue that we block even read-only transactions
today if there is an incomplete change in-flight.  This applies to both
the leader and the followers since introduction of the pre-check.
We could think of some optimization that would allow read-only transactions
in this case, the problem is that in the current design we need to process
the transaction in order to know if it is read-only or not and that might
be expensive.  However, if we find a way to avoid this somehow, we can
still execute some transactions while pre-check for others fails.  So, I'd
like to keep the loop as it is for now.

> 
> bool
> ovsdb_trigger_run(struct ovsdb *db, long long int now)
> {
>     struct ovsdb_trigger *t;
> 
>     bool run_triggers = db->run_triggers;
>     db->run_triggers_now = db->run_triggers = false;
> 
>     bool disconnect_all = false;
> 
>     LIST_FOR_EACH_SAFE (t, node, &db->triggers) {
>         cooperative_multitasking_yield();
> 
>         if (run_triggers
>             || now - t->created >= t->timeout_msec
>             || t->progress || t->txn_forward) {
>             if (ovsdb_trigger_try(t, now)) {
>                 disconnect_all = true;
>             }
>         }
>     }
>     return disconnect_all;
> }
> 
> Regards,
> Dumitru
>
Ilya Maximets July 8, 2024, 9:52 p.m. UTC | #3
On 6/28/24 11:43, Ilya Maximets wrote:
> On 6/28/24 09:20, Dumitru Ceara wrote:
>> On 6/27/24 00:02, Ilya Maximets wrote:
>>> Every transaction has RAFT log prerequisites.  Even if transactions
>>> are not related (because RAFT doesn't actually know what data it is
>>> handling).  When leader writes a new record to a RAFT storage, it is
>>> getting appended to the log right away and changes current 'eid',
>>> i.e., changes prerequisites.  The leader will not try to write new
>>> records until the current one is committed, because until then the
>>> pre-check will be failing.
>>>
>>> However, that is different for the follower.  Followers do not add
>>> records to the RAFT log until the leader sends an append request back.
>>> So, if there are multiple transactions pending on a follower, it will
>>> create a command for each of them and prerequisites will be set to the
>>> same values.  All these commands will be sent to the leader, but only
>>> one can succeed at a time, because accepting one command immediately
>>> changes prerequisites and all other commands become non-applicable.
>>> So, out of N commands, 1 will succeed and N - 1 will fail.  The cluster
>>> failure is a transient failure, so the follower will re-process all the
>>> failed transactions and send them again.  1 will succeed and N - 2 will
>>> fail.  And so on, until there are no more transactions.  In the end,
>>> instead of processing N transactions, the follower is performing
>>> N * (N - 1) / 2 transaction processing iterations.  That is consuming
>>> a huge amount of CPU resources completely unnecessarily.
>>>
>>> Since there is no real chance for multiple transactions from the same
>>> follower to succeed, it's better to not send them in the first place.
>>> This also eliminates prerequisite mismatch messages on a leader in
>>> this particular case.
>>>
>>
>> Makes sense!
>>
>>> In a test with 30 parallel shell threads executing 12K transactions
>>> total with separate ovsdb-client calls through the same follower there
>>> is about 60% performance improvement.  The test takes ~100 seconds to
>>> complete without this change and ~40 seconds with this change applied.
>>> The new time is very close to what it takes to execute the same test
>>> through the cluster leader.
>>>
>>
>> Maybe a public link to the test (if possible) would be a nice reference
>> to have in the future?
> 
> I'm not sure where to post it, so I'll just leave it here, and I can
> add a link to this post into the commit message.
> 
> The script is adding 12K ports across 120 logical switches in a style
> of ovn-kubernetes (sets up port security, some external IDs and adds
> to an address set):
> 
> ---
> $ cat ../transaction-benchmark-parallel.sh
> #set -x
> set -o errexit
> 
> # Creating 120 logical switches
> for i in $(seq 120); do
>     ovn-nbctl ls-add ip-10-10-177-$i.us-west-2.compute.internal
> done
> IFS=$'\r\n' GLOBIGNORE='*' \
>     command eval  'uuids=($(ovn-nbctl --bare --columns _uuid list logical_switch | sed "/^$/d"))'
> 
> for i in $(seq 120); do
>     echo "uuid #${i}: ${uuids[$(($i - 1))]}"
> done
> 
> # Creating one addres set
> as_name="a0123456789012345678"
> as_uuid=$(ovn-nbctl create address_set name=${as_name})
> 
> # Adding ports
> function add_400() {
>     rm -rf txn_results$1.txt
>     for i in $(seq $1 $(($1 + 399)) ); do
>         echo $i
>         port_name="node-density-00000000-0000-0000-0000-000000000000_node-density-$i"
>         namespace="node-density-00000000-0000-0000-0000-000000000000"
>         ls_name="ip-10-10-177-$((${i} % 120 + 1)).us-west-2.compute.internal"
>         uuid_name="row00000000_0000_0000_0000_000000000000"
>         mac=$(printf '0a:58:0a:9b:%02x:%02x' $((${i} >> 8)) $((${i} & 255)))
>         ip=$(printf "10.11.%d.%d\n" $((${i} / 255)) $((${i} % 255 + 1)))
> 
>         time ovsdb-client transact unix:$(pwd)/sandbox/nb2.ovsdb "[\"OVN_Northbound\",
>         {
>           \"uuid-name\":\"${uuid_name}\",
>           \"row\":{
>               \"name\":\"${port_name}\",
>               \"options\":[\"map\",[[\"requested-chassis\",\"${ls_name}\"]]],
>               \"addresses\":[\"set\",[\"${mac} ${ip}\"]],
>               \"external_ids\":[\"map\",[[\"pod\",\"true\"],[\"namespace\",\"${namespace}\"]]],
>               \"port_security\":[\"set\",[\"${mac} ${ip}\"]]
>             },
>           \"op\":\"insert\", \"table\":\"Logical_Switch_Port\"
>         },
>         {
>           \"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${uuids[$(($i % 120))]}\"]]],
>           \"mutations\":[[\"ports\",\"insert\",[\"set\",[[\"named-uuid\",\"${uuid_name}\"]]]]],
>           \"op\":\"mutate\", \"table\":\"Logical_Switch\"
>         },
>         {
>           \"where\":[[\"_uuid\",\"==\",[\"uuid\",\"${as_uuid}\"]]],
>           \"mutations\":[[\"addresses\",\"insert\",[\"set\",[\"${ip}\"]]]],
>           \"op\":\"mutate\", \"table\":\"Address_Set\"
>         }]" >> txn_results$1.txt
>     done
>     echo "done" >> txn_results$1.txt
> }
> 
> rm -f txn_results*.txt
> 
> n=0
> for i in $(seq 0 400 11600); do
>     n=$(($n + 1))
>     add_400 $i &
> done
> 
> while [ $(grep 'done' txn_results*.txt | wc -l) != $n ]; do
>     sleep 1
> done
> ---
> 
> It is not the most generic script and it is designed for sandbox,
> where nb1 is a leader by default:
> 
> # make sandbox SANDBOXFLAGS="--nbdb-model=clustered"
> # pkill ovn-northd # to remove the noise
> # time bash ../transaction-benchmark-parallel.sh
> 
> Note that the script doesn't perform a lot of correctness checks, so
> it's better to examine the txn_results*.txt files, or check the
> database content afterwards to be sure.
> 
>>
>>> Note: prerequisite failures on a leader are still possible, but mostly
>>> in a case of simultaneous transactions from different followers.  It's
>>> a normal thing for a distributed database due to its nature.
>>>
>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>> ---
>>>  ovsdb/raft.c        | 45 ++++++++++++++++++++++++++++++++++++++++++++-
>>>  ovsdb/raft.h        |  2 +-
>>>  ovsdb/storage.c     |  9 +++++----
>>>  ovsdb/storage.h     |  5 ++++-
>>>  ovsdb/transaction.c |  6 +-----
>>>  5 files changed, 55 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>>> index ac3d37ac4..116924058 100644
>>> --- a/ovsdb/raft.c
>>> +++ b/ovsdb/raft.c
>>> @@ -2307,12 +2307,55 @@ raft_get_eid(const struct raft *raft, uint64_t index)
>>>      return &raft->snap.eid;
>>>  }
>>>  
>>> -const struct uuid *
>>> +static const struct uuid *
>>>  raft_current_eid(const struct raft *raft)
>>>  {
>>>      return raft_get_eid(raft, raft->log_end - 1);
>>>  }
>>>  
>>> +bool
>>> +raft_precheck_prereq(const struct raft *raft, const struct uuid *prereq)
>>> +{
>>> +    if (!uuid_equals(raft_current_eid(raft), prereq)) {
>>> +        VLOG_DBG("%s: prerequisites (" UUID_FMT ") "
>>> +                 "do not match current eid (" UUID_FMT ")",
>>> +                 __func__, UUID_ARGS(prereq),
>>> +                 UUID_ARGS(raft_current_eid(raft)));
>>> +        return false;
>>> +    }
>>> +
>>> +    /* Having incomplete commands on a follower means that the leader has
>>> +     * these commands and they will change the prerequisites once added to
>>> +     * the leader's log.
>>> +     *
>>> +     * There is a chance that all these commands will actually fail and the
>>> +     * record with current prerequisites will in fact succeed, but, since
>>> +     * these are our own commands, the chances are low.
>>> +     *
>>> +     * Incomplete commands on a leader will not change the leader's current
>>> +     * 'eid' on commit as they are already part of the leader's log. */
>>> +    if (raft->role != RAFT_LEADER && hmap_count(&raft->commands)) {
>>> +        struct raft_command *cmd;
>>> +
>>> +        HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
>>
>> Nit: I'd rewrite this as:
>>
>> if (raft->role == RAFT_LEADER) {
>>     return true;
>> }
>>
>> HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
>>     ...
>> }
> 
> Makes sense, I can do that con commit, unless there will be further comments.
> 
>>
>>> +            /* Skip commands that are already part of the log (have non-zero
>>> +             * index) and ones that do not carry any data (have zero 'eid'),
>>> +             * as they can't change prerequisites.
>>> +             *
>>> +             * Database will not re-run triggers unless the data changes or
>>> +             * one of the data-carrying triggers completes.  So, pre-check must
>>> +             * not fail if there are no outstanding data-carrying commands. */
>>> +            if (!cmd->index && !uuid_is_zero(&cmd->eid)) {
>>> +                VLOG_DBG("%s: follower still has an incomplete command "
>>> +                         UUID_FMT, __func__, UUID_ARGS(&cmd->eid));
>>> +                return false;
>>> +            }
>>> +        }
>>> +    }
>>> +
>>> +    return true;
>>> +}
>>> +
>>>  static struct raft_command *
>>>  raft_command_create_completed(enum raft_command_status status)
>>>  {
>>> diff --git a/ovsdb/raft.h b/ovsdb/raft.h
>>> index a5b55d9bf..5833aaf23 100644
>>> --- a/ovsdb/raft.h
>>> +++ b/ovsdb/raft.h
>>> @@ -189,5 +189,5 @@ struct ovsdb_error *raft_store_snapshot(struct raft *,
>>>  void raft_take_leadership(struct raft *);
>>>  void raft_transfer_leadership(struct raft *, const char *reason);
>>>  
>>> -const struct uuid *raft_current_eid(const struct raft *);
>>> +bool raft_precheck_prereq(const struct raft *, const struct uuid *prereq);
>>>  #endif /* lib/raft.h */
>>> diff --git a/ovsdb/storage.c b/ovsdb/storage.c
>>> index 6c395106c..c5aec5459 100644
>>> --- a/ovsdb/storage.c
>>> +++ b/ovsdb/storage.c
>>> @@ -661,11 +661,12 @@ ovsdb_storage_write_schema_change(struct ovsdb_storage *storage,
>>>      return w;
>>>  }
>>>  
>>> -const struct uuid *
>>> -ovsdb_storage_peek_last_eid(struct ovsdb_storage *storage)
>>> +bool
>>> +ovsdb_storage_precheck_prereq(const struct ovsdb_storage *storage,
>>> +                              const struct uuid *prereq)
>>>  {
>>>      if (!storage->raft) {
>>> -        return NULL;
>>> +        return true;
>>>      }
>>> -    return raft_current_eid(storage->raft);
>>> +    return raft_precheck_prereq(storage->raft, prereq);
>>>  }
>>> diff --git a/ovsdb/storage.h b/ovsdb/storage.h
>>> index 05f40ce93..7079ea261 100644
>>> --- a/ovsdb/storage.h
>>> +++ b/ovsdb/storage.h
>>> @@ -96,6 +96,9 @@ struct ovsdb_storage *ovsdb_storage_open_standalone(const char *filename,
>>>                                                      bool rw);
>>>  struct ovsdb_schema *ovsdb_storage_read_schema(struct ovsdb_storage *);
>>>  
>>> -const struct uuid *ovsdb_storage_peek_last_eid(struct ovsdb_storage *);
>>> +/* Checks that there is a chance for a record with specified prerequisites
>>> + * to be successfully written to the storage. */
>>> +bool ovsdb_storage_precheck_prereq(const struct ovsdb_storage *,
>>> +                                   const struct uuid *prereq);
>>>  
>>>  #endif /* ovsdb/storage.h */
>>> diff --git a/ovsdb/transaction.c b/ovsdb/transaction.c
>>> index 484a88e1c..65eca6478 100644
>>> --- a/ovsdb/transaction.c
>>> +++ b/ovsdb/transaction.c
>>> @@ -1277,11 +1277,7 @@ struct ovsdb_txn_progress {
>>>  bool
>>>  ovsdb_txn_precheck_prereq(const struct ovsdb *db)
>>>  {
>>> -    const struct uuid *eid = ovsdb_storage_peek_last_eid(db->storage);
>>> -    if (!eid) {
>>> -        return true;
>>> -    }
>>> -    return uuid_equals(&db->prereq, eid);
>>> +    return ovsdb_storage_precheck_prereq(db->storage, &db->prereq);
>>>  }
>>
>> Aside from the comment above, the patch looks good to me.  If you
>> address my comment.. or if you decide to dismiss it too :) then:
>>
>> Acked-by: Dumitru Ceara <dceara@redhat.com>

I made the change above and applied the patch.  Thanks!

Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index ac3d37ac4..116924058 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -2307,12 +2307,55 @@  raft_get_eid(const struct raft *raft, uint64_t index)
     return &raft->snap.eid;
 }
 
-const struct uuid *
+static const struct uuid *
 raft_current_eid(const struct raft *raft)
 {
     return raft_get_eid(raft, raft->log_end - 1);
 }
 
+bool
+raft_precheck_prereq(const struct raft *raft, const struct uuid *prereq)
+{
+    if (!uuid_equals(raft_current_eid(raft), prereq)) {
+        VLOG_DBG("%s: prerequisites (" UUID_FMT ") "
+                 "do not match current eid (" UUID_FMT ")",
+                 __func__, UUID_ARGS(prereq),
+                 UUID_ARGS(raft_current_eid(raft)));
+        return false;
+    }
+
+    /* Having incomplete commands on a follower means that the leader has
+     * these commands and they will change the prerequisites once added to
+     * the leader's log.
+     *
+     * There is a chance that all these commands will actually fail and the
+     * record with current prerequisites will in fact succeed, but, since
+     * these are our own commands, the chances are low.
+     *
+     * Incomplete commands on a leader will not change the leader's current
+     * 'eid' on commit as they are already part of the leader's log. */
+    if (raft->role != RAFT_LEADER && hmap_count(&raft->commands)) {
+        struct raft_command *cmd;
+
+        HMAP_FOR_EACH (cmd, hmap_node, &raft->commands) {
+            /* Skip commands that are already part of the log (have non-zero
+             * index) and ones that do not carry any data (have zero 'eid'),
+             * as they can't change prerequisites.
+             *
+             * Database will not re-run triggers unless the data changes or
+             * one of the data-carrying triggers completes.  So, pre-check must
+             * not fail if there are no outstanding data-carrying commands. */
+            if (!cmd->index && !uuid_is_zero(&cmd->eid)) {
+                VLOG_DBG("%s: follower still has an incomplete command "
+                         UUID_FMT, __func__, UUID_ARGS(&cmd->eid));
+                return false;
+            }
+        }
+    }
+
+    return true;
+}
+
 static struct raft_command *
 raft_command_create_completed(enum raft_command_status status)
 {
diff --git a/ovsdb/raft.h b/ovsdb/raft.h
index a5b55d9bf..5833aaf23 100644
--- a/ovsdb/raft.h
+++ b/ovsdb/raft.h
@@ -189,5 +189,5 @@  struct ovsdb_error *raft_store_snapshot(struct raft *,
 void raft_take_leadership(struct raft *);
 void raft_transfer_leadership(struct raft *, const char *reason);
 
-const struct uuid *raft_current_eid(const struct raft *);
+bool raft_precheck_prereq(const struct raft *, const struct uuid *prereq);
 #endif /* lib/raft.h */
diff --git a/ovsdb/storage.c b/ovsdb/storage.c
index 6c395106c..c5aec5459 100644
--- a/ovsdb/storage.c
+++ b/ovsdb/storage.c
@@ -661,11 +661,12 @@  ovsdb_storage_write_schema_change(struct ovsdb_storage *storage,
     return w;
 }
 
-const struct uuid *
-ovsdb_storage_peek_last_eid(struct ovsdb_storage *storage)
+bool
+ovsdb_storage_precheck_prereq(const struct ovsdb_storage *storage,
+                              const struct uuid *prereq)
 {
     if (!storage->raft) {
-        return NULL;
+        return true;
     }
-    return raft_current_eid(storage->raft);
+    return raft_precheck_prereq(storage->raft, prereq);
 }
diff --git a/ovsdb/storage.h b/ovsdb/storage.h
index 05f40ce93..7079ea261 100644
--- a/ovsdb/storage.h
+++ b/ovsdb/storage.h
@@ -96,6 +96,9 @@  struct ovsdb_storage *ovsdb_storage_open_standalone(const char *filename,
                                                     bool rw);
 struct ovsdb_schema *ovsdb_storage_read_schema(struct ovsdb_storage *);
 
-const struct uuid *ovsdb_storage_peek_last_eid(struct ovsdb_storage *);
+/* Checks that there is a chance for a record with specified prerequisites
+ * to be successfully written to the storage. */
+bool ovsdb_storage_precheck_prereq(const struct ovsdb_storage *,
+                                   const struct uuid *prereq);
 
 #endif /* ovsdb/storage.h */
diff --git a/ovsdb/transaction.c b/ovsdb/transaction.c
index 484a88e1c..65eca6478 100644
--- a/ovsdb/transaction.c
+++ b/ovsdb/transaction.c
@@ -1277,11 +1277,7 @@  struct ovsdb_txn_progress {
 bool
 ovsdb_txn_precheck_prereq(const struct ovsdb *db)
 {
-    const struct uuid *eid = ovsdb_storage_peek_last_eid(db->storage);
-    if (!eid) {
-        return true;
-    }
-    return uuid_equals(&db->prereq, eid);
+    return ovsdb_storage_precheck_prereq(db->storage, &db->prereq);
 }
 
 struct ovsdb_txn_progress *