diff mbox series

[ovs-dev,v5] ovsdb raft: Sync commit index to followers without delay.

Message ID 1553359466-40630-1-git-send-email-hzhou8@ebay.com
State Superseded
Headers show
Series [ovs-dev,v5] ovsdb raft: Sync commit index to followers without delay. | expand

Commit Message

Han Zhou March 23, 2019, 4:44 p.m. UTC
From: Han Zhou <hzhou8@ebay.com>

When update is requested from follower, the leader sends AppendRequest
to all followers and wait until AppendReply received from majority, and
then it will update commit index - the new entry is regarded as committed
in raft log. However, this commit will not be notified to followers
(including the one initiated the request) until next heartbeat (ping
timeout), if no other pending requests. This results in long latency
for updates made through followers, especially when a batch of updates
are requested through the same follower.

$ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done

real    0m34.154s
user    0m0.083s
sys 0m0.250s

This patch solves the problem by sending heartbeat as soon as the commit
index is updated in leader. It also avoids unnessary heartbeat by resetting
the ping timer whenever AppendRequest is broadcasted. With this patch
the performance is improved more than 50 times in same test:

$ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done

real    0m0.564s
user    0m0.080s
sys 0m0.199s

Torture test cases are also updated because otherwise the tests will
all be skipped because of the improved performance.

Signed-off-by: Han Zhou <hzhou8@ebay.com>
---

Notes:
    v4->v5: no change. Just to trigger CI again to make sure torture tests pass
            since the bug blocking the patch has been merged:
            https://github.com/openvswitch/ovs/commit/23b16a03fc0717eb4025da0d1710fb5d972ebe86

 ovsdb/raft.c           | 43 +++++++++++++++++++++++++++++--------------
 tests/ovsdb-cluster.at | 16 +++++++++++-----
 2 files changed, 40 insertions(+), 19 deletions(-)

Comments

Ben Pfaff March 25, 2019, 9 p.m. UTC | #1
On Sat, Mar 23, 2019 at 09:44:26AM -0700, Han Zhou wrote:
> From: Han Zhou <hzhou8@ebay.com>
> 
> When update is requested from follower, the leader sends AppendRequest
> to all followers and wait until AppendReply received from majority, and
> then it will update commit index - the new entry is regarded as committed
> in raft log. However, this commit will not be notified to followers
> (including the one initiated the request) until next heartbeat (ping
> timeout), if no other pending requests. This results in long latency
> for updates made through followers, especially when a batch of updates
> are requested through the same follower.

The tests pass now, but each one of them ends up with several ovn-sbctl
and ovsdb-server processes all trying to use 100% of a CPU.  If I run
the tests with 10-way parallelism, the load average goes above 100.
Surely something has to be wrong in the implementation here.

Each of the ovn-sbctl processes is trying to push through only a single
transaction; after that, it exits.  If we think of ovsdb-server as
giving each of its clients one chance to execute an RPC in round-robin
order, which is approximately correct, then one of those transactions
should succeed per round.  I don't understand why, if this model is
correct, the ovn-sbctls would burn so much CPU.  If the model is wrong,
then we need to understand why it is wrong and how we can fix it.  Maybe
the ovn-sbctl processes are retrying blindly without waiting for an
update from the server; if so, that's a bug and it should be fixed.
Han Zhou March 25, 2019, 11:24 p.m. UTC | #2
On Mon, Mar 25, 2019 at 2:00 PM Ben Pfaff <blp@ovn.org> wrote:
>
> On Sat, Mar 23, 2019 at 09:44:26AM -0700, Han Zhou wrote:
> > From: Han Zhou <hzhou8@ebay.com>
> >
> > When update is requested from follower, the leader sends AppendRequest
> > to all followers and wait until AppendReply received from majority, and
> > then it will update commit index - the new entry is regarded as committed
> > in raft log. However, this commit will not be notified to followers
> > (including the one initiated the request) until next heartbeat (ping
> > timeout), if no other pending requests. This results in long latency
> > for updates made through followers, especially when a batch of updates
> > are requested through the same follower.
>
> The tests pass now, but each one of them ends up with several ovn-sbctl
> and ovsdb-server processes all trying to use 100% of a CPU.  If I run
> the tests with 10-way parallelism, the load average goes above 100.
> Surely something has to be wrong in the implementation here.
>
> Each of the ovn-sbctl processes is trying to push through only a single
> transaction; after that, it exits.  If we think of ovsdb-server as
> giving each of its clients one chance to execute an RPC in round-robin
> order, which is approximately correct, then one of those transactions
> should succeed per round.  I don't understand why, if this model is
> correct, the ovn-sbctls would burn so much CPU.  If the model is wrong,
> then we need to understand why it is wrong and how we can fix it.  Maybe
> the ovn-sbctl processes are retrying blindly without waiting for an
> update from the server; if so, that's a bug and it should be fixed.

Hi Ben, to make the test case more effective, in v4/v5 I enlarged the
size of each transaction 50 times, specified by the var "n3". This is
just to create the load in test and slow down the transaction
execution, to ensure there are enough parallel requests ongoing, and
to ensure we can test the server membership changes during this
period. Without this change (in test case only), the CPU was low, but
it just making the torture test not torturing at all (either skipped
because the execution was too fast, or if put sleep there it loses the
parallelism). And the test case change was proved to be more effective
- it found an existing bug, as mentioned in the notes.

If your question is why increasing the size of transaction causing
high CPU load, it seems expected if we see what's being done by each
client:
- download initial data, which increases as more and more transaction
executed, each adding 50 entries (10 * 5 * 50 = 2500), later clients
will take more CPU because of this.
- send transaction with operations: wait + update, which has double
size of the original data, plus newly added 50 entries.
- there are some chances that clients got conflict update (wait
operation fails), which triggers retry. This doesn't happen a lot (by
checking the logs), and I think it is expected due to the purpose of
the testing - to increase parallelism.
- process updates, at least once for its own update, but also could
process other's update. (This step is not CPU costly since each update
notification is not big)

So I think increasing the size of transaction causing higher CPU is
expected due to these operations, considering the JsonRPC cost. I ran
the tests a lot a times and I didn't observe any client occupying CPU
for long time - there were short spikes (not 100%) shown by "top", but
I never see anyone there twice. If you see any behavior like it is
continously retrying, could you please share the x-y.log file of that
client?

I agree there could be improvement to the RPC performance of
ovsdb/IDL, but it is not related to this patch (or maybe we can say
the test in this patch utilized this slowness).

Thanks,
Han
Han Zhou March 27, 2019, 11:10 p.m. UTC | #3
On Mon, Mar 25, 2019 at 4:24 PM Han Zhou <zhouhan@gmail.com> wrote:
>
> On Mon, Mar 25, 2019 at 2:00 PM Ben Pfaff <blp@ovn.org> wrote:
> >
> > On Sat, Mar 23, 2019 at 09:44:26AM -0700, Han Zhou wrote:
> > > From: Han Zhou <hzhou8@ebay.com>
> > >
> > > When update is requested from follower, the leader sends AppendRequest
> > > to all followers and wait until AppendReply received from majority, and
> > > then it will update commit index - the new entry is regarded as committed
> > > in raft log. However, this commit will not be notified to followers
> > > (including the one initiated the request) until next heartbeat (ping
> > > timeout), if no other pending requests. This results in long latency
> > > for updates made through followers, especially when a batch of updates
> > > are requested through the same follower.
> >
> > The tests pass now, but each one of them ends up with several ovn-sbctl
> > and ovsdb-server processes all trying to use 100% of a CPU.  If I run
> > the tests with 10-way parallelism, the load average goes above 100.
> > Surely something has to be wrong in the implementation here.
> >
> > Each of the ovn-sbctl processes is trying to push through only a single
> > transaction; after that, it exits.  If we think of ovsdb-server as
> > giving each of its clients one chance to execute an RPC in round-robin
> > order, which is approximately correct, then one of those transactions
> > should succeed per round.  I don't understand why, if this model is
> > correct, the ovn-sbctls would burn so much CPU.  If the model is wrong,
> > then we need to understand why it is wrong and how we can fix it.  Maybe
> > the ovn-sbctl processes are retrying blindly without waiting for an
> > update from the server; if so, that's a bug and it should be fixed.
>
> Hi Ben, to make the test case more effective, in v4/v5 I enlarged the
> size of each transaction 50 times, specified by the var "n3". This is
> just to create the load in test and slow down the transaction
> execution, to ensure there are enough parallel requests ongoing, and
> to ensure we can test the server membership changes during this
> period. Without this change (in test case only), the CPU was low, but
> it just making the torture test not torturing at all (either skipped
> because the execution was too fast, or if put sleep there it loses the
> parallelism). And the test case change was proved to be more effective
> - it found an existing bug, as mentioned in the notes.
>
> If your question is why increasing the size of transaction causing
> high CPU load, it seems expected if we see what's being done by each
> client:
> - download initial data, which increases as more and more transaction
> executed, each adding 50 entries (10 * 5 * 50 = 2500), later clients
> will take more CPU because of this.
> - send transaction with operations: wait + update, which has double
> size of the original data, plus newly added 50 entries.
> - there are some chances that clients got conflict update (wait
> operation fails), which triggers retry. This doesn't happen a lot (by
> checking the logs), and I think it is expected due to the purpose of
> the testing - to increase parallelism.
> - process updates, at least once for its own update, but also could
> process other's update. (This step is not CPU costly since each update
> notification is not big)
>
> So I think increasing the size of transaction causing higher CPU is
> expected due to these operations, considering the JsonRPC cost. I ran
> the tests a lot a times and I didn't observe any client occupying CPU
> for long time - there were short spikes (not 100%) shown by "top", but
> I never see anyone there twice. If you see any behavior like it is
> continously retrying, could you please share the x-y.log file of that
> client?
>
> I agree there could be improvement to the RPC performance of
> ovsdb/IDL, but it is not related to this patch (or maybe we can say
> the test in this patch utilized this slowness).
>

Hi Ben, does my explain make sense? To prove that the CPU cost is
expected (at least nothing caused by this patch), I compared with
master branch using leader_only = true for the ovn-sbctl command in
torture test, and execute same updated torture test case (i.e. same
number of n1 = 10, n2 = 5, and n3 = 50). When leader_only = true,
there is no latency for transaction execution even without this patch,
because no idle waiting for syncing from leader to followers, so all
transactions are executed full speed, and the performance is
comparable.

time make check TESTSUITEFLAGS="2529"
(2529: OVSDB 3-server torture test - kill/restart leader)

- *Master branch* with new test case, and use leader_only=true:
real    0m15.448s
user    0m15.384s
sys    0m13.687s

- With this patch (leader_only=false):
real    0m13.904s
user    0m15.578s
sys    0m11.817s

- With this patch but leader_only=true (exactly same test case as for
*Master branch*):
real    0m14.419s
user    0m14.972s
sys    0m12.467s

As we can see, there is no obvious difference. During the tests I was
also watch "top" output and I didn't notice any difference either. The
"leader_only" update on top of this patch is simply:
------8><------------------------------------------><8-----------
diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
index 5550a19..94ec9c7 100644
--- a/tests/ovsdb-cluster.at
+++ b/tests/ovsdb-cluster.at
@@ -140,7 +140,7 @@ ovsdb|WARN|schema: changed 2 columns in
'OVN_Southbound' database from ephemeral
              for k in $(seq $n3); do
                  txn="$txn -- add SB_Global . external_ids $i-$j-$k=$i-$j-$k"
              done
-             run_as "ovn-sbctl($i-$j)" ovn-sbctl
"-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m"
--log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120
--no-leader-only $txn
+             run_as "ovn-sbctl($i-$j)" ovn-sbctl
"-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m"
--log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120
$txn
              status=$?
              if test $status != 0; then
                  echo "$i-$j exited with status $status" > $i-$j:$status
--------------------------------------------------------------------------
Please try it and let me know if you still have concerns.

Thanks,
Han
Han Zhou April 8, 2019, 11:08 p.m. UTC | #4
On Wed, Mar 27, 2019 at 4:10 PM Han Zhou <zhouhan@gmail.com> wrote:
>
> On Mon, Mar 25, 2019 at 4:24 PM Han Zhou <zhouhan@gmail.com> wrote:
> >
> > On Mon, Mar 25, 2019 at 2:00 PM Ben Pfaff <blp@ovn.org> wrote:
> > >
> > > On Sat, Mar 23, 2019 at 09:44:26AM -0700, Han Zhou wrote:
> > > > From: Han Zhou <hzhou8@ebay.com>
> > > >
> > > > When update is requested from follower, the leader sends
AppendRequest
> > > > to all followers and wait until AppendReply received from majority,
and
> > > > then it will update commit index - the new entry is regarded as
committed
> > > > in raft log. However, this commit will not be notified to followers
> > > > (including the one initiated the request) until next heartbeat (ping
> > > > timeout), if no other pending requests. This results in long latency
> > > > for updates made through followers, especially when a batch of
updates
> > > > are requested through the same follower.
> > >
> > > The tests pass now, but each one of them ends up with several
ovn-sbctl
> > > and ovsdb-server processes all trying to use 100% of a CPU.  If I run
> > > the tests with 10-way parallelism, the load average goes above 100.
> > > Surely something has to be wrong in the implementation here.
> > >
> > > Each of the ovn-sbctl processes is trying to push through only a
single
> > > transaction; after that, it exits.  If we think of ovsdb-server as
> > > giving each of its clients one chance to execute an RPC in round-robin
> > > order, which is approximately correct, then one of those transactions
> > > should succeed per round.  I don't understand why, if this model is
> > > correct, the ovn-sbctls would burn so much CPU.  If the model is
wrong,
> > > then we need to understand why it is wrong and how we can fix it.
Maybe
> > > the ovn-sbctl processes are retrying blindly without waiting for an
> > > update from the server; if so, that's a bug and it should be fixed.
> >
> > Hi Ben, to make the test case more effective, in v4/v5 I enlarged the
> > size of each transaction 50 times, specified by the var "n3". This is
> > just to create the load in test and slow down the transaction
> > execution, to ensure there are enough parallel requests ongoing, and
> > to ensure we can test the server membership changes during this
> > period. Without this change (in test case only), the CPU was low, but
> > it just making the torture test not torturing at all (either skipped
> > because the execution was too fast, or if put sleep there it loses the
> > parallelism). And the test case change was proved to be more effective
> > - it found an existing bug, as mentioned in the notes.
> >
> > If your question is why increasing the size of transaction causing
> > high CPU load, it seems expected if we see what's being done by each
> > client:
> > - download initial data, which increases as more and more transaction
> > executed, each adding 50 entries (10 * 5 * 50 = 2500), later clients
> > will take more CPU because of this.
> > - send transaction with operations: wait + update, which has double
> > size of the original data, plus newly added 50 entries.
> > - there are some chances that clients got conflict update (wait
> > operation fails), which triggers retry. This doesn't happen a lot (by
> > checking the logs), and I think it is expected due to the purpose of
> > the testing - to increase parallelism.
> > - process updates, at least once for its own update, but also could
> > process other's update. (This step is not CPU costly since each update
> > notification is not big)
> >
> > So I think increasing the size of transaction causing higher CPU is
> > expected due to these operations, considering the JsonRPC cost. I ran
> > the tests a lot a times and I didn't observe any client occupying CPU
> > for long time - there were short spikes (not 100%) shown by "top", but
> > I never see anyone there twice. If you see any behavior like it is
> > continously retrying, could you please share the x-y.log file of that
> > client?
> >
> > I agree there could be improvement to the RPC performance of
> > ovsdb/IDL, but it is not related to this patch (or maybe we can say
> > the test in this patch utilized this slowness).
> >
>
> Hi Ben, does my explain make sense? To prove that the CPU cost is
> expected (at least nothing caused by this patch), I compared with
> master branch using leader_only = true for the ovn-sbctl command in
> torture test, and execute same updated torture test case (i.e. same
> number of n1 = 10, n2 = 5, and n3 = 50). When leader_only = true,
> there is no latency for transaction execution even without this patch,
> because no idle waiting for syncing from leader to followers, so all
> transactions are executed full speed, and the performance is
> comparable.
>
> time make check TESTSUITEFLAGS="2529"
> (2529: OVSDB 3-server torture test - kill/restart leader)
>
> - *Master branch* with new test case, and use leader_only=true:
> real    0m15.448s
> user    0m15.384s
> sys    0m13.687s
>
> - With this patch (leader_only=false):
> real    0m13.904s
> user    0m15.578s
> sys    0m11.817s
>
> - With this patch but leader_only=true (exactly same test case as for
> *Master branch*):
> real    0m14.419s
> user    0m14.972s
> sys    0m12.467s
>
> As we can see, there is no obvious difference. During the tests I was
> also watch "top" output and I didn't notice any difference either. The
> "leader_only" update on top of this patch is simply:
> ------8><------------------------------------------><8-----------
> diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
> index 5550a19..94ec9c7 100644
> --- a/tests/ovsdb-cluster.at
> +++ b/tests/ovsdb-cluster.at
> @@ -140,7 +140,7 @@ ovsdb|WARN|schema: changed 2 columns in
> 'OVN_Southbound' database from ephemeral
>               for k in $(seq $n3); do
>                   txn="$txn -- add SB_Global . external_ids
$i-$j-$k=$i-$j-$k"
>               done
> -             run_as "ovn-sbctl($i-$j)" ovn-sbctl
> "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m"
> --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120
> --no-leader-only $txn
> +             run_as "ovn-sbctl($i-$j)" ovn-sbctl
> "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m"
> --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120
> $txn
>               status=$?
>               if test $status != 0; then
>                   echo "$i-$j exited with status $status" > $i-$j:$status
> --------------------------------------------------------------------------
> Please try it and let me know if you still have concerns.
>
> Thanks,
> Han

Hi Ben, this one is small but important, just waiting for your final
approval. Sorry for pushing this, but I am also working on UTs for
different failure scenarios of Raft, this patch helps for some of the test
cases, so I hope this can be merged before I send out the UT series.

Thanks,
Han
Han Zhou April 10, 2019, 1:48 a.m. UTC | #5
On Mon, Apr 8, 2019 at 4:08 PM Han Zhou <zhouhan@gmail.com> wrote:
>
>
>
> On Wed, Mar 27, 2019 at 4:10 PM Han Zhou <zhouhan@gmail.com> wrote:
> >
> > On Mon, Mar 25, 2019 at 4:24 PM Han Zhou <zhouhan@gmail.com> wrote:
> > >
> > > On Mon, Mar 25, 2019 at 2:00 PM Ben Pfaff <blp@ovn.org> wrote:
> > > >
> > > > On Sat, Mar 23, 2019 at 09:44:26AM -0700, Han Zhou wrote:
> > > > > From: Han Zhou <hzhou8@ebay.com>
> > > > >
> > > > > When update is requested from follower, the leader sends
AppendRequest
> > > > > to all followers and wait until AppendReply received from
majority, and
> > > > > then it will update commit index - the new entry is regarded as
committed
> > > > > in raft log. However, this commit will not be notified to
followers
> > > > > (including the one initiated the request) until next heartbeat
(ping
> > > > > timeout), if no other pending requests. This results in long
latency
> > > > > for updates made through followers, especially when a batch of
updates
> > > > > are requested through the same follower.
> > > >
> > > > The tests pass now, but each one of them ends up with several
ovn-sbctl
> > > > and ovsdb-server processes all trying to use 100% of a CPU.  If I
run
> > > > the tests with 10-way parallelism, the load average goes above 100.
> > > > Surely something has to be wrong in the implementation here.
> > > >
> > > > Each of the ovn-sbctl processes is trying to push through only a
single
> > > > transaction; after that, it exits.  If we think of ovsdb-server as
> > > > giving each of its clients one chance to execute an RPC in
round-robin
> > > > order, which is approximately correct, then one of those
transactions
> > > > should succeed per round.  I don't understand why, if this model is
> > > > correct, the ovn-sbctls would burn so much CPU.  If the model is
wrong,
> > > > then we need to understand why it is wrong and how we can fix it.
Maybe
> > > > the ovn-sbctl processes are retrying blindly without waiting for an
> > > > update from the server; if so, that's a bug and it should be fixed.
> > >
> > > Hi Ben, to make the test case more effective, in v4/v5 I enlarged the
> > > size of each transaction 50 times, specified by the var "n3". This is
> > > just to create the load in test and slow down the transaction
> > > execution, to ensure there are enough parallel requests ongoing, and
> > > to ensure we can test the server membership changes during this
> > > period. Without this change (in test case only), the CPU was low, but
> > > it just making the torture test not torturing at all (either skipped
> > > because the execution was too fast, or if put sleep there it loses the
> > > parallelism). And the test case change was proved to be more effective
> > > - it found an existing bug, as mentioned in the notes.
> > >
> > > If your question is why increasing the size of transaction causing
> > > high CPU load, it seems expected if we see what's being done by each
> > > client:
> > > - download initial data, which increases as more and more transaction
> > > executed, each adding 50 entries (10 * 5 * 50 = 2500), later clients
> > > will take more CPU because of this.
> > > - send transaction with operations: wait + update, which has double
> > > size of the original data, plus newly added 50 entries.
> > > - there are some chances that clients got conflict update (wait
> > > operation fails), which triggers retry. This doesn't happen a lot (by
> > > checking the logs), and I think it is expected due to the purpose of
> > > the testing - to increase parallelism.
> > > - process updates, at least once for its own update, but also could
> > > process other's update. (This step is not CPU costly since each update
> > > notification is not big)
> > >
> > > So I think increasing the size of transaction causing higher CPU is
> > > expected due to these operations, considering the JsonRPC cost. I ran
> > > the tests a lot a times and I didn't observe any client occupying CPU
> > > for long time - there were short spikes (not 100%) shown by "top", but
> > > I never see anyone there twice. If you see any behavior like it is
> > > continously retrying, could you please share the x-y.log file of that
> > > client?
> > >
> > > I agree there could be improvement to the RPC performance of
> > > ovsdb/IDL, but it is not related to this patch (or maybe we can say
> > > the test in this patch utilized this slowness).
> > >
> >
> > Hi Ben, does my explain make sense? To prove that the CPU cost is
> > expected (at least nothing caused by this patch), I compared with
> > master branch using leader_only = true for the ovn-sbctl command in
> > torture test, and execute same updated torture test case (i.e. same
> > number of n1 = 10, n2 = 5, and n3 = 50). When leader_only = true,
> > there is no latency for transaction execution even without this patch,
> > because no idle waiting for syncing from leader to followers, so all
> > transactions are executed full speed, and the performance is
> > comparable.
> >
> > time make check TESTSUITEFLAGS="2529"
> > (2529: OVSDB 3-server torture test - kill/restart leader)
> >
> > - *Master branch* with new test case, and use leader_only=true:
> > real    0m15.448s
> > user    0m15.384s
> > sys    0m13.687s
> >
> > - With this patch (leader_only=false):
> > real    0m13.904s
> > user    0m15.578s
> > sys    0m11.817s
> >
> > - With this patch but leader_only=true (exactly same test case as for
> > *Master branch*):
> > real    0m14.419s
> > user    0m14.972s
> > sys    0m12.467s
> >
> > As we can see, there is no obvious difference. During the tests I was
> > also watch "top" output and I didn't notice any difference either. The
> > "leader_only" update on top of this patch is simply:
> > ------8><------------------------------------------><8-----------
> > diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
> > index 5550a19..94ec9c7 100644
> > --- a/tests/ovsdb-cluster.at
> > +++ b/tests/ovsdb-cluster.at
> > @@ -140,7 +140,7 @@ ovsdb|WARN|schema: changed 2 columns in
> > 'OVN_Southbound' database from ephemeral
> >               for k in $(seq $n3); do
> >                   txn="$txn -- add SB_Global . external_ids
$i-$j-$k=$i-$j-$k"
> >               done
> > -             run_as "ovn-sbctl($i-$j)" ovn-sbctl
> > "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m"
> > --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120
> > --no-leader-only $txn
> > +             run_as "ovn-sbctl($i-$j)" ovn-sbctl
> > "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m"
> > --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120
> > $txn
> >               status=$?
> >               if test $status != 0; then
> >                   echo "$i-$j exited with status $status" >
$i-$j:$status
> >
--------------------------------------------------------------------------
> > Please try it and let me know if you still have concerns.
> >
> > Thanks,
> > Han
>
> Hi Ben, this one is small but important, just waiting for your final
approval. Sorry for pushing this, but I am also working on UTs for
different failure scenarios of Raft, this patch helps for some of the test
cases, so I hope this can be merged before I send out the UT series.
>
> Thanks,
> Han

I embedded this one as the first patch in a new series:
https://patchwork.ozlabs.org/project/openvswitch/list/?series=101873
The new series is mainly for raft UT and one bug fix, which is not closely
related to but depends on this patch.
diff mbox series

Patch

diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index eee4f33..31e9e72 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -320,7 +320,8 @@  static void raft_send_append_request(struct raft *,
 
 static void raft_become_leader(struct raft *);
 static void raft_become_follower(struct raft *);
-static void raft_reset_timer(struct raft *);
+static void raft_reset_election_timer(struct raft *);
+static void raft_reset_ping_timer(struct raft *);
 static void raft_send_heartbeats(struct raft *);
 static void raft_start_election(struct raft *, bool leadership_transfer);
 static bool raft_truncate(struct raft *, uint64_t new_end);
@@ -376,8 +377,8 @@  raft_alloc(void)
     hmap_init(&raft->add_servers);
     hmap_init(&raft->commands);
 
-    raft->ping_timeout = time_msec() + PING_TIME_MSEC;
-    raft_reset_timer(raft);
+    raft_reset_ping_timer(raft);
+    raft_reset_election_timer(raft);
 
     return raft;
 }
@@ -865,7 +866,7 @@  raft_read_log(struct raft *raft)
 }
 
 static void
-raft_reset_timer(struct raft *raft)
+raft_reset_election_timer(struct raft *raft)
 {
     unsigned int duration = (ELECTION_BASE_MSEC
                              + random_range(ELECTION_RANGE_MSEC));
@@ -874,6 +875,12 @@  raft_reset_timer(struct raft *raft)
 }
 
 static void
+raft_reset_ping_timer(struct raft *raft)
+{
+    raft->ping_timeout = time_msec() + PING_TIME_MSEC;
+}
+
+static void
 raft_add_conn(struct raft *raft, struct jsonrpc_session *js,
               const struct uuid *sid, bool incoming)
 {
@@ -1603,7 +1610,7 @@  raft_start_election(struct raft *raft, bool leadership_transfer)
             VLOG_INFO("term %"PRIu64": starting election", raft->term);
         }
     }
-    raft_reset_timer(raft);
+    raft_reset_election_timer(raft);
 
     struct raft_server *peer;
     HMAP_FOR_EACH (peer, hmap_node, &raft->servers) {
@@ -1782,8 +1789,8 @@  raft_run(struct raft *raft)
                     raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT);
                 }
             }
+            raft_reset_ping_timer(raft);
         }
-        raft->ping_timeout = time_msec() + PING_TIME_MSEC;
     }
 
     /* Do this only at the end; if we did it as soon as we set raft->left or
@@ -1963,6 +1970,7 @@  raft_command_initiate(struct raft *raft,
             s->next_index++;
         }
     }
+    raft_reset_ping_timer(raft);
 
     return cmd;
 }
@@ -2313,7 +2321,7 @@  raft_become_follower(struct raft *raft)
     }
 
     raft->role = RAFT_FOLLOWER;
-    raft_reset_timer(raft);
+    raft_reset_election_timer(raft);
 
     /* Notify clients about lost leadership.
      *
@@ -2387,6 +2395,8 @@  raft_send_heartbeats(struct raft *raft)
                                             RAFT_CMD_INCOMPLETE, 0);
         }
     }
+
+    raft_reset_ping_timer(raft);
 }
 
 /* Initializes the fields in 's' that represent the leader's view of the
@@ -2413,7 +2423,7 @@  raft_become_leader(struct raft *raft)
     raft->role = RAFT_LEADER;
     raft->leader_sid = raft->sid;
     raft->election_timeout = LLONG_MAX;
-    raft->ping_timeout = time_msec() + PING_TIME_MSEC;
+    raft_reset_ping_timer(raft);
 
     struct raft_server *s;
     HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
@@ -2573,11 +2583,13 @@  raft_get_next_entry(struct raft *raft, struct uuid *eid)
     return data;
 }
 
-static void
+/* Updates commit index in raft log. If commit index is already up-to-date
+ * it does nothing and return false, otherwise, returns true. */
+static bool
 raft_update_commit_index(struct raft *raft, uint64_t new_commit_index)
 {
     if (new_commit_index <= raft->commit_index) {
-        return;
+        return false;
     }
 
     if (raft->role == RAFT_LEADER) {
@@ -2610,6 +2622,7 @@  raft_update_commit_index(struct raft *raft, uint64_t new_commit_index)
         .commit_index = raft->commit_index,
     };
     ignore(ovsdb_log_write_and_free(raft->log, raft_record_to_json(&r)));
+    return true;
 }
 
 /* This doesn't use rq->entries (but it does use rq->n_entries). */
@@ -2809,7 +2822,7 @@  raft_handle_append_request(struct raft *raft,
                                "usurped leadership");
         return;
     }
-    raft_reset_timer(raft);
+    raft_reset_election_timer(raft);
 
     /* First check for the common case, where the AppendEntries request is
      * entirely for indexes covered by 'log_start' ... 'log_end - 1', something
@@ -3045,7 +3058,9 @@  raft_consider_updating_commit_index(struct raft *raft)
             }
         }
     }
-    raft_update_commit_index(raft, new_commit_index);
+    if (raft_update_commit_index(raft, new_commit_index)) {
+        raft_send_heartbeats(raft);
+    }
 }
 
 static void
@@ -3274,7 +3289,7 @@  raft_handle_vote_request__(struct raft *raft,
         return false;
     }
 
-    raft_reset_timer(raft);
+    raft_reset_election_timer(raft);
 
     return true;
 }
@@ -3697,7 +3712,7 @@  static bool
 raft_handle_install_snapshot_request__(
     struct raft *raft, const struct raft_install_snapshot_request *rq)
 {
-    raft_reset_timer(raft);
+    raft_reset_election_timer(raft);
 
     /*
      * Our behavior here depend on new_log_start in the snapshot compared to
diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
index c7f1e34..5550a19 100644
--- a/tests/ovsdb-cluster.at
+++ b/tests/ovsdb-cluster.at
@@ -131,12 +131,16 @@  ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral
     done
     export OVN_SB_DB
 
-    n1=10 n2=5
+    n1=10 n2=5 n3=50
     echo "starting $n1*$n2 ovn-sbctl processes..."
     for i in $(seq 0 $(expr $n1 - 1) ); do
         (for j in $(seq $n2); do
              : > $i-$j.running
-             run_as "ovn-sbctl($i-$j)" ovn-sbctl "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m" --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120 --no-leader-only add SB_Global . external_ids $i-$j=$i-$j
+             txn="add SB_Global . external_ids $i-$j=$i-$j"
+             for k in $(seq $n3); do
+                 txn="$txn -- add SB_Global . external_ids $i-$j-$k=$i-$j-$k"
+             done
+             run_as "ovn-sbctl($i-$j)" ovn-sbctl "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m" --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120 --no-leader-only $txn
              status=$?
              if test $status != 0; then
                  echo "$i-$j exited with status $status" > $i-$j:$status
@@ -146,13 +150,12 @@  ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral
          : > $i.done)&
     done
     echo "...done"
-    sleep 2
 
     echo "waiting for ovn-sbctl processes to exit..."
     # Use file instead of var because code inside "while" runs in a subshell.
     echo 0 > phase
     i=0
-    (while :; do echo; sleep 1; done) | while read REPLY; do
+    (while :; do echo; sleep 0.1; done) | while read REPLY; do
         printf "t=%2d s:" $i
         done=0
         for j in $(seq 0 $(expr $n1 - 1)); do
@@ -168,7 +171,7 @@  ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral
 
         case $(cat phase) in # (
         0)
-            if test $done -ge $(expr $n1 / 4); then
+            if test $done -ge $(expr $n1 / 10); then
                 if test $variant = kill; then
                     stop_server $victim
                 else
@@ -199,6 +202,9 @@  ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral
     for i in $(seq 0 $(expr $n1 - 1) ); do
         for j in `seq $n2`; do
             echo "$i-$j=$i-$j"
+            for k in `seq $n3`; do
+                echo "$i-$j-$k=$i-$j-$k"
+            done
         done
     done | sort > expout
     AT_CHECK([ovn-sbctl --timeout=30 --log-file=finalize.log -vtimeval:off -vfile -vsyslog:off --bare get SB_Global . external-ids | tr ',' '\n' | sed 's/[[{}"" ]]//g' | sort], [0], [expout])