diff mbox series

[ovs-dev,v2] ovsdb raft: Sync commit index to followers without delay.

Message ID 1552691855-114951-1-git-send-email-hzhou8@ebay.com
State Superseded
Headers show
Series [ovs-dev,v2] ovsdb raft: Sync commit index to followers without delay. | expand

Commit Message

Han Zhou March 15, 2019, 11:17 p.m. UTC
From: Han Zhou <hzhou8@ebay.com>

When update is requested from follower, the leader sends AppendRequest
to all followers and wait until AppendReply received from majority, and
then it will update commit index - the new entry is regarded as committed
in raft log. However, this commit will not be notified to followers
(including the one initiated the request) until next heartbeat (ping
timeout), if no other pending requests. This results in long latency
for updates made through followers, especially when a batch of updates
are requested through the same follower.

$ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done

real    0m34.154s
user    0m0.083s
sys 0m0.250s

This patch solves the problem by sending heartbeat as soon as the commit
index is updated in leader. It also avoids unnessary heartbeat by resetting
the ping timer whenever AppendRequest is broadcasted. With this patch
the performance is improved more than 50 times in same test:

$ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done

real    0m0.564s
user    0m0.080s
sys 0m0.199s

Some sleep is added in torture test cases because of the improved
performance, otherwise the tests will all be skipped.

Signed-off-by: Han Zhou <hzhou8@ebay.com>
---

Notes:
    v1->v2: adjust torture test case so that it passes without overload CPU.

 ovsdb/raft.c           | 43 +++++++++++++++++++++++++++++--------------
 tests/ovsdb-cluster.at |  3 ++-
 2 files changed, 31 insertions(+), 15 deletions(-)

Comments

Ben Pfaff March 18, 2019, 9:49 p.m. UTC | #1
On Fri, Mar 15, 2019 at 04:17:35PM -0700, Han Zhou wrote:
> From: Han Zhou <hzhou8@ebay.com>
> 
> When update is requested from follower, the leader sends AppendRequest
> to all followers and wait until AppendReply received from majority, and
> then it will update commit index - the new entry is regarded as committed
> in raft log. However, this commit will not be notified to followers
> (including the one initiated the request) until next heartbeat (ping
> timeout), if no other pending requests. This results in long latency
> for updates made through followers, especially when a batch of updates
> are requested through the same follower.
> 
> $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> 
> real    0m34.154s
> user    0m0.083s
> sys 0m0.250s
> 
> This patch solves the problem by sending heartbeat as soon as the commit
> index is updated in leader. It also avoids unnessary heartbeat by resetting
> the ping timer whenever AppendRequest is broadcasted. With this patch
> the performance is improved more than 50 times in same test:
> 
> $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> 
> real    0m0.564s
> user    0m0.080s
> sys 0m0.199s
> 
> Some sleep is added in torture test cases because of the improved
> performance, otherwise the tests will all be skipped.
> 
> Signed-off-by: Han Zhou <hzhou8@ebay.com>
> ---
> 
> Notes:
>     v1->v2: adjust torture test case so that it passes without overload CPU.

With this patch, on my laptop, running test 2525 seems to always skip
it, with results similar to the following:

    ## ------------------------------- ##
    ## openvswitch 2.11.90 test suite. ##
    ## ------------------------------- ##
    2525: OVSDB 3-server torture test - kill/restart leader skipped (ovsdb-cluster.at:198)

    ## ------------- ##
    ## Test results. ##
    ## ------------- ##

    0 tests were successful.
    1 test was skipped.
    make[3]: Leaving directory '/home/blp/nicira/ovs/_build'
    make[2]: Leaving directory '/home/blp/nicira/ovs/_build'
    make[1]: Leaving directory '/home/blp/nicira/ovs/_build'

    real    0m9.194s
    user    0m3.693s
    sys     0m1.658s
    blp@sigill:~/nicira/ovs/_build(0)$
Han Zhou March 19, 2019, 12:02 a.m. UTC | #2
On Mon, Mar 18, 2019 at 2:49 PM Ben Pfaff <blp@ovn.org> wrote:
>
> On Fri, Mar 15, 2019 at 04:17:35PM -0700, Han Zhou wrote:
> > From: Han Zhou <hzhou8@ebay.com>
> >
> > When update is requested from follower, the leader sends AppendRequest
> > to all followers and wait until AppendReply received from majority, and
> > then it will update commit index - the new entry is regarded as committed
> > in raft log. However, this commit will not be notified to followers
> > (including the one initiated the request) until next heartbeat (ping
> > timeout), if no other pending requests. This results in long latency
> > for updates made through followers, especially when a batch of updates
> > are requested through the same follower.
> >
> > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> >
> > real    0m34.154s
> > user    0m0.083s
> > sys 0m0.250s
> >
> > This patch solves the problem by sending heartbeat as soon as the commit
> > index is updated in leader. It also avoids unnessary heartbeat by resetting
> > the ping timer whenever AppendRequest is broadcasted. With this patch
> > the performance is improved more than 50 times in same test:
> >
> > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> >
> > real    0m0.564s
> > user    0m0.080s
> > sys 0m0.199s
> >
> > Some sleep is added in torture test cases because of the improved
> > performance, otherwise the tests will all be skipped.
> >
> > Signed-off-by: Han Zhou <hzhou8@ebay.com>
> > ---
> >
> > Notes:
> >     v1->v2: adjust torture test case so that it passes without overload CPU.
>
> With this patch, on my laptop, running test 2525 seems to always skip
> it, with results similar to the following:
>
>     ## ------------------------------- ##
>     ## openvswitch 2.11.90 test suite. ##
>     ## ------------------------------- ##
>     2525: OVSDB 3-server torture test - kill/restart leader skipped (ovsdb-cluster.at:198)
>
>     ## ------------- ##
>     ## Test results. ##
>     ## ------------- ##
>
>     0 tests were successful.
>     1 test was skipped.
>     make[3]: Leaving directory '/home/blp/nicira/ovs/_build'
>     make[2]: Leaving directory '/home/blp/nicira/ovs/_build'
>     make[1]: Leaving directory '/home/blp/nicira/ovs/_build'
>
>     real    0m9.194s
>     user    0m3.693s
>     sys     0m1.658s
>     blp@sigill:~/nicira/ovs/_build(0)$

Sorry to hear :(. It was pretty stable on my laptop - maybe my laptop
is slower than yours :). I just sent V3 to make the test case more
stable. I reduced the interval of the checking loop so that it can
detect phase changes and trigger the operations asap. I ran all
torture tests with -j1, -j5 and -j10. All cases passed without
skipping. I hope it is stable on your laptop, too. Could you try
again?

Thanks,
Han
Ben Pfaff March 20, 2019, 1:42 a.m. UTC | #3
On Mon, Mar 18, 2019 at 05:02:15PM -0700, Han Zhou wrote:
> On Mon, Mar 18, 2019 at 2:49 PM Ben Pfaff <blp@ovn.org> wrote:
> >
> > On Fri, Mar 15, 2019 at 04:17:35PM -0700, Han Zhou wrote:
> > > From: Han Zhou <hzhou8@ebay.com>
> > >
> > > When update is requested from follower, the leader sends AppendRequest
> > > to all followers and wait until AppendReply received from majority, and
> > > then it will update commit index - the new entry is regarded as committed
> > > in raft log. However, this commit will not be notified to followers
> > > (including the one initiated the request) until next heartbeat (ping
> > > timeout), if no other pending requests. This results in long latency
> > > for updates made through followers, especially when a batch of updates
> > > are requested through the same follower.
> > >
> > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> > >
> > > real    0m34.154s
> > > user    0m0.083s
> > > sys 0m0.250s
> > >
> > > This patch solves the problem by sending heartbeat as soon as the commit
> > > index is updated in leader. It also avoids unnessary heartbeat by resetting
> > > the ping timer whenever AppendRequest is broadcasted. With this patch
> > > the performance is improved more than 50 times in same test:
> > >
> > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> > >
> > > real    0m0.564s
> > > user    0m0.080s
> > > sys 0m0.199s
> > >
> > > Some sleep is added in torture test cases because of the improved
> > > performance, otherwise the tests will all be skipped.
> > >
> > > Signed-off-by: Han Zhou <hzhou8@ebay.com>
> > > ---
> > >
> > > Notes:
> > >     v1->v2: adjust torture test case so that it passes without overload CPU.
> >
> > With this patch, on my laptop, running test 2525 seems to always skip
> > it, with results similar to the following:
> >
> >     ## ------------------------------- ##
> >     ## openvswitch 2.11.90 test suite. ##
> >     ## ------------------------------- ##
> >     2525: OVSDB 3-server torture test - kill/restart leader skipped (ovsdb-cluster.at:198)
> >
> >     ## ------------- ##
> >     ## Test results. ##
> >     ## ------------- ##
> >
> >     0 tests were successful.
> >     1 test was skipped.
> >     make[3]: Leaving directory '/home/blp/nicira/ovs/_build'
> >     make[2]: Leaving directory '/home/blp/nicira/ovs/_build'
> >     make[1]: Leaving directory '/home/blp/nicira/ovs/_build'
> >
> >     real    0m9.194s
> >     user    0m3.693s
> >     sys     0m1.658s
> >     blp@sigill:~/nicira/ovs/_build(0)$
> 
> Sorry to hear :(. It was pretty stable on my laptop - maybe my laptop
> is slower than yours :). I just sent V3 to make the test case more
> stable. I reduced the interval of the checking loop so that it can
> detect phase changes and trigger the operations asap. I ran all
> torture tests with -j1, -j5 and -j10. All cases passed without
> skipping. I hope it is stable on your laptop, too. Could you try
> again?

I do tend to buy nice laptops, current one is i7-8565U.

This version (v3) does not skip the test and does not use excessive CPU.
Splendid.

However, I am concerned that it makes the test a lot easier.
My design goal in this test was to try to provoke tons of races by
throwing many transactions at the server at once.  That is why it
invoked all of the ovn-sbctl calls without any "sleep"s.  By adding
sleeps, I think that the test becomes easier: aren't you basically
serializing all of the transactions?
Han Zhou March 20, 2019, 6:42 a.m. UTC | #4
On Tue, Mar 19, 2019 at 6:42 PM Ben Pfaff <blp@ovn.org> wrote:
>
> On Mon, Mar 18, 2019 at 05:02:15PM -0700, Han Zhou wrote:
> > On Mon, Mar 18, 2019 at 2:49 PM Ben Pfaff <blp@ovn.org> wrote:
> > >
> > > On Fri, Mar 15, 2019 at 04:17:35PM -0700, Han Zhou wrote:
> > > > From: Han Zhou <hzhou8@ebay.com>
> > > >
> > > > When update is requested from follower, the leader sends AppendRequest
> > > > to all followers and wait until AppendReply received from majority, and
> > > > then it will update commit index - the new entry is regarded as committed
> > > > in raft log. However, this commit will not be notified to followers
> > > > (including the one initiated the request) until next heartbeat (ping
> > > > timeout), if no other pending requests. This results in long latency
> > > > for updates made through followers, especially when a batch of updates
> > > > are requested through the same follower.
> > > >
> > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> > > >
> > > > real    0m34.154s
> > > > user    0m0.083s
> > > > sys 0m0.250s
> > > >
> > > > This patch solves the problem by sending heartbeat as soon as the commit
> > > > index is updated in leader. It also avoids unnessary heartbeat by resetting
> > > > the ping timer whenever AppendRequest is broadcasted. With this patch
> > > > the performance is improved more than 50 times in same test:
> > > >
> > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> > > >
> > > > real    0m0.564s
> > > > user    0m0.080s
> > > > sys 0m0.199s
> > > >
> > > > Some sleep is added in torture test cases because of the improved
> > > > performance, otherwise the tests will all be skipped.
> > > >
> > > > Signed-off-by: Han Zhou <hzhou8@ebay.com>
> > > > ---
> > > >
> > > > Notes:
> > > >     v1->v2: adjust torture test case so that it passes without overload CPU.
> > >
> > > With this patch, on my laptop, running test 2525 seems to always skip
> > > it, with results similar to the following:
> > >
> > >     ## ------------------------------- ##
> > >     ## openvswitch 2.11.90 test suite. ##
> > >     ## ------------------------------- ##
> > >     2525: OVSDB 3-server torture test - kill/restart leader skipped (ovsdb-cluster.at:198)
> > >
> > >     ## ------------- ##
> > >     ## Test results. ##
> > >     ## ------------- ##
> > >
> > >     0 tests were successful.
> > >     1 test was skipped.
> > >     make[3]: Leaving directory '/home/blp/nicira/ovs/_build'
> > >     make[2]: Leaving directory '/home/blp/nicira/ovs/_build'
> > >     make[1]: Leaving directory '/home/blp/nicira/ovs/_build'
> > >
> > >     real    0m9.194s
> > >     user    0m3.693s
> > >     sys     0m1.658s
> > >     blp@sigill:~/nicira/ovs/_build(0)$
> >
> > Sorry to hear :(. It was pretty stable on my laptop - maybe my laptop
> > is slower than yours :). I just sent V3 to make the test case more
> > stable. I reduced the interval of the checking loop so that it can
> > detect phase changes and trigger the operations asap. I ran all
> > torture tests with -j1, -j5 and -j10. All cases passed without
> > skipping. I hope it is stable on your laptop, too. Could you try
> > again?
>
> I do tend to buy nice laptops, current one is i7-8565U.

Mine is i7-7920HQ, and I am running in a VM ...

>
> This version (v3) does not skip the test and does not use excessive CPU.
> Splendid.
>
> However, I am concerned that it makes the test a lot easier.
> My design goal in this test was to try to provoke tons of races by
> throwing many transactions at the server at once.  That is why it
> invoked all of the ovn-sbctl calls without any "sleep"s.  By adding
> sleeps, I think that the test becomes easier: aren't you basically
> serializing all of the transactions?

Yes the degree of parallelism is reduced with sleep. The V1 was trying
to keep the parallelism simply by adding more clients, but it causes
high CPU.
I just sent V4, which keeps the original parallelism without
increasing much CPU cost by increasing the size of each transaction. I
tried several times and it never skips tests. In fact I see it more
effective than before, because the test fails more frequently than
before. The failure is not caused by this patch, but a bug before this
change, since I have noticed it before (and was planning to debug it
but didn't got much time yet)

The error is: db_ctl_base|ERR|transaction error: {"details":"transact
request specifies unknown database OVN_Southbound","error":"unknown
database"
It is triggered when a server is disconnected from the cluster, but
still communicates with clients. In fact, there are at least two
problems:
1. When client retrying connection, it didn't pick another server, but
connected to same server
2. I fixed 1) by calling jsonrpc_session_pick_remote() in
jsonrpc_session_force_reconnect(), but the client still fails.
I will need more time to debug, and submit separate patches since this
is not directly related to current patch. Any hints are welcome.

Thanks,
Han
Ben Pfaff March 20, 2019, 6:30 p.m. UTC | #5
On Tue, Mar 19, 2019 at 11:42:30PM -0700, Han Zhou wrote:
> On Tue, Mar 19, 2019 at 6:42 PM Ben Pfaff <blp@ovn.org> wrote:
> >
> > On Mon, Mar 18, 2019 at 05:02:15PM -0700, Han Zhou wrote:
> > > On Mon, Mar 18, 2019 at 2:49 PM Ben Pfaff <blp@ovn.org> wrote:
> > > >
> > > > On Fri, Mar 15, 2019 at 04:17:35PM -0700, Han Zhou wrote:
> > > > > From: Han Zhou <hzhou8@ebay.com>
> > > > >
> > > > > When update is requested from follower, the leader sends AppendRequest
> > > > > to all followers and wait until AppendReply received from majority, and
> > > > > then it will update commit index - the new entry is regarded as committed
> > > > > in raft log. However, this commit will not be notified to followers
> > > > > (including the one initiated the request) until next heartbeat (ping
> > > > > timeout), if no other pending requests. This results in long latency
> > > > > for updates made through followers, especially when a batch of updates
> > > > > are requested through the same follower.
> > > > >
> > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> > > > >
> > > > > real    0m34.154s
> > > > > user    0m0.083s
> > > > > sys 0m0.250s
> > > > >
> > > > > This patch solves the problem by sending heartbeat as soon as the commit
> > > > > index is updated in leader. It also avoids unnessary heartbeat by resetting
> > > > > the ping timer whenever AppendRequest is broadcasted. With this patch
> > > > > the performance is improved more than 50 times in same test:
> > > > >
> > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> > > > >
> > > > > real    0m0.564s
> > > > > user    0m0.080s
> > > > > sys 0m0.199s
> > > > >
> > > > > Some sleep is added in torture test cases because of the improved
> > > > > performance, otherwise the tests will all be skipped.
> > > > >
> > > > > Signed-off-by: Han Zhou <hzhou8@ebay.com>
> > > > > ---
> > > > >
> > > > > Notes:
> > > > >     v1->v2: adjust torture test case so that it passes without overload CPU.
> > > >
> > > > With this patch, on my laptop, running test 2525 seems to always skip
> > > > it, with results similar to the following:
> > > >
> > > >     ## ------------------------------- ##
> > > >     ## openvswitch 2.11.90 test suite. ##
> > > >     ## ------------------------------- ##
> > > >     2525: OVSDB 3-server torture test - kill/restart leader skipped (ovsdb-cluster.at:198)
> > > >
> > > >     ## ------------- ##
> > > >     ## Test results. ##
> > > >     ## ------------- ##
> > > >
> > > >     0 tests were successful.
> > > >     1 test was skipped.
> > > >     make[3]: Leaving directory '/home/blp/nicira/ovs/_build'
> > > >     make[2]: Leaving directory '/home/blp/nicira/ovs/_build'
> > > >     make[1]: Leaving directory '/home/blp/nicira/ovs/_build'
> > > >
> > > >     real    0m9.194s
> > > >     user    0m3.693s
> > > >     sys     0m1.658s
> > > >     blp@sigill:~/nicira/ovs/_build(0)$
> > >
> > > Sorry to hear :(. It was pretty stable on my laptop - maybe my laptop
> > > is slower than yours :). I just sent V3 to make the test case more
> > > stable. I reduced the interval of the checking loop so that it can
> > > detect phase changes and trigger the operations asap. I ran all
> > > torture tests with -j1, -j5 and -j10. All cases passed without
> > > skipping. I hope it is stable on your laptop, too. Could you try
> > > again?
> >
> > I do tend to buy nice laptops, current one is i7-8565U.
> 
> Mine is i7-7920HQ, and I am running in a VM ...
> 
> >
> > This version (v3) does not skip the test and does not use excessive CPU.
> > Splendid.
> >
> > However, I am concerned that it makes the test a lot easier.
> > My design goal in this test was to try to provoke tons of races by
> > throwing many transactions at the server at once.  That is why it
> > invoked all of the ovn-sbctl calls without any "sleep"s.  By adding
> > sleeps, I think that the test becomes easier: aren't you basically
> > serializing all of the transactions?
> 
> Yes the degree of parallelism is reduced with sleep. The V1 was trying
> to keep the parallelism simply by adding more clients, but it causes
> high CPU.

I assumed that the high CPU must be caused by some kind of busy-looping
bug.  Are you sure that it's really just due to a lot of clients?

> The error is: db_ctl_base|ERR|transaction error: {"details":"transact
> request specifies unknown database OVN_Southbound","error":"unknown
> database"
> It is triggered when a server is disconnected from the cluster, but
> still communicates with clients. In fact, there are at least two
> problems:
> 1. When client retrying connection, it didn't pick another server, but
> connected to same server
> 2. I fixed 1) by calling jsonrpc_session_pick_remote() in
> jsonrpc_session_force_reconnect(), but the client still fails.
> I will need more time to debug, and submit separate patches since this
> is not directly related to current patch. Any hints are welcome.

I'm glad to hear about more bug fixes.
Han Zhou March 20, 2019, 6:38 p.m. UTC | #6
On Wed, Mar 20, 2019 at 11:30 AM Ben Pfaff <blp@ovn.org> wrote:
>
> On Tue, Mar 19, 2019 at 11:42:30PM -0700, Han Zhou wrote:
> > On Tue, Mar 19, 2019 at 6:42 PM Ben Pfaff <blp@ovn.org> wrote:
> > >
> > > On Mon, Mar 18, 2019 at 05:02:15PM -0700, Han Zhou wrote:
> > > > On Mon, Mar 18, 2019 at 2:49 PM Ben Pfaff <blp@ovn.org> wrote:
> > > > >
> > > > > On Fri, Mar 15, 2019 at 04:17:35PM -0700, Han Zhou wrote:
> > > > > > From: Han Zhou <hzhou8@ebay.com>
> > > > > >
> > > > > > When update is requested from follower, the leader sends AppendRequest
> > > > > > to all followers and wait until AppendReply received from majority, and
> > > > > > then it will update commit index - the new entry is regarded as committed
> > > > > > in raft log. However, this commit will not be notified to followers
> > > > > > (including the one initiated the request) until next heartbeat (ping
> > > > > > timeout), if no other pending requests. This results in long latency
> > > > > > for updates made through followers, especially when a batch of updates
> > > > > > are requested through the same follower.
> > > > > >
> > > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> > > > > >
> > > > > > real    0m34.154s
> > > > > > user    0m0.083s
> > > > > > sys 0m0.250s
> > > > > >
> > > > > > This patch solves the problem by sending heartbeat as soon as the commit
> > > > > > index is updated in leader. It also avoids unnessary heartbeat by resetting
> > > > > > the ping timer whenever AppendRequest is broadcasted. With this patch
> > > > > > the performance is improved more than 50 times in same test:
> > > > > >
> > > > > > $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
> > > > > >
> > > > > > real    0m0.564s
> > > > > > user    0m0.080s
> > > > > > sys 0m0.199s
> > > > > >
> > > > > > Some sleep is added in torture test cases because of the improved
> > > > > > performance, otherwise the tests will all be skipped.
> > > > > >
> > > > > > Signed-off-by: Han Zhou <hzhou8@ebay.com>
> > > > > > ---
> > > > > >
> > > > > > Notes:
> > > > > >     v1->v2: adjust torture test case so that it passes without overload CPU.
> > > > >
> > > > > With this patch, on my laptop, running test 2525 seems to always skip
> > > > > it, with results similar to the following:
> > > > >
> > > > >     ## ------------------------------- ##
> > > > >     ## openvswitch 2.11.90 test suite. ##
> > > > >     ## ------------------------------- ##
> > > > >     2525: OVSDB 3-server torture test - kill/restart leader skipped (ovsdb-cluster.at:198)
> > > > >
> > > > >     ## ------------- ##
> > > > >     ## Test results. ##
> > > > >     ## ------------- ##
> > > > >
> > > > >     0 tests were successful.
> > > > >     1 test was skipped.
> > > > >     make[3]: Leaving directory '/home/blp/nicira/ovs/_build'
> > > > >     make[2]: Leaving directory '/home/blp/nicira/ovs/_build'
> > > > >     make[1]: Leaving directory '/home/blp/nicira/ovs/_build'
> > > > >
> > > > >     real    0m9.194s
> > > > >     user    0m3.693s
> > > > >     sys     0m1.658s
> > > > >     blp@sigill:~/nicira/ovs/_build(0)$
> > > >
> > > > Sorry to hear :(. It was pretty stable on my laptop - maybe my laptop
> > > > is slower than yours :). I just sent V3 to make the test case more
> > > > stable. I reduced the interval of the checking loop so that it can
> > > > detect phase changes and trigger the operations asap. I ran all
> > > > torture tests with -j1, -j5 and -j10. All cases passed without
> > > > skipping. I hope it is stable on your laptop, too. Could you try
> > > > again?
> > >
> > > I do tend to buy nice laptops, current one is i7-8565U.
> >
> > Mine is i7-7920HQ, and I am running in a VM ...
> >
> > >
> > > This version (v3) does not skip the test and does not use excessive CPU.
> > > Splendid.
> > >
> > > However, I am concerned that it makes the test a lot easier.
> > > My design goal in this test was to try to provoke tons of races by
> > > throwing many transactions at the server at once.  That is why it
> > > invoked all of the ovn-sbctl calls without any "sleep"s.  By adding
> > > sleeps, I think that the test becomes easier: aren't you basically
> > > serializing all of the transactions?
> >
> > Yes the degree of parallelism is reduced with sleep. The V1 was trying
> > to keep the parallelism simply by adding more clients, but it causes
> > high CPU.
>
> I assumed that the high CPU must be caused by some kind of busy-looping
> bug.  Are you sure that it's really just due to a lot of clients?

Yes, I believe it was due to a lot of clients in test. I didn't see
anything potentially triggering busy-looping in this change. The CPU
spike disappears after reverting the number of clients to the old
value. (also, in scale test with this change the overall CPU was as
low as in standalone mode)

>
> > The error is: db_ctl_base|ERR|transaction error: {"details":"transact
> > request specifies unknown database OVN_Southbound","error":"unknown
> > database"
> > It is triggered when a server is disconnected from the cluster, but
> > still communicates with clients. In fact, there are at least two
> > problems:
> > 1. When client retrying connection, it didn't pick another server, but
> > connected to same server
> > 2. I fixed 1) by calling jsonrpc_session_pick_remote() in
> > jsonrpc_session_force_reconnect(), but the client still fails.
> > I will need more time to debug, and submit separate patches since this
> > is not directly related to current patch. Any hints are welcome.
>
> I'm glad to hear about more bug fixes.
diff mbox series

Patch

diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index eee4f33..31e9e72 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -320,7 +320,8 @@  static void raft_send_append_request(struct raft *,
 
 static void raft_become_leader(struct raft *);
 static void raft_become_follower(struct raft *);
-static void raft_reset_timer(struct raft *);
+static void raft_reset_election_timer(struct raft *);
+static void raft_reset_ping_timer(struct raft *);
 static void raft_send_heartbeats(struct raft *);
 static void raft_start_election(struct raft *, bool leadership_transfer);
 static bool raft_truncate(struct raft *, uint64_t new_end);
@@ -376,8 +377,8 @@  raft_alloc(void)
     hmap_init(&raft->add_servers);
     hmap_init(&raft->commands);
 
-    raft->ping_timeout = time_msec() + PING_TIME_MSEC;
-    raft_reset_timer(raft);
+    raft_reset_ping_timer(raft);
+    raft_reset_election_timer(raft);
 
     return raft;
 }
@@ -865,7 +866,7 @@  raft_read_log(struct raft *raft)
 }
 
 static void
-raft_reset_timer(struct raft *raft)
+raft_reset_election_timer(struct raft *raft)
 {
     unsigned int duration = (ELECTION_BASE_MSEC
                              + random_range(ELECTION_RANGE_MSEC));
@@ -874,6 +875,12 @@  raft_reset_timer(struct raft *raft)
 }
 
 static void
+raft_reset_ping_timer(struct raft *raft)
+{
+    raft->ping_timeout = time_msec() + PING_TIME_MSEC;
+}
+
+static void
 raft_add_conn(struct raft *raft, struct jsonrpc_session *js,
               const struct uuid *sid, bool incoming)
 {
@@ -1603,7 +1610,7 @@  raft_start_election(struct raft *raft, bool leadership_transfer)
             VLOG_INFO("term %"PRIu64": starting election", raft->term);
         }
     }
-    raft_reset_timer(raft);
+    raft_reset_election_timer(raft);
 
     struct raft_server *peer;
     HMAP_FOR_EACH (peer, hmap_node, &raft->servers) {
@@ -1782,8 +1789,8 @@  raft_run(struct raft *raft)
                     raft_command_complete(raft, cmd, RAFT_CMD_TIMEOUT);
                 }
             }
+            raft_reset_ping_timer(raft);
         }
-        raft->ping_timeout = time_msec() + PING_TIME_MSEC;
     }
 
     /* Do this only at the end; if we did it as soon as we set raft->left or
@@ -1963,6 +1970,7 @@  raft_command_initiate(struct raft *raft,
             s->next_index++;
         }
     }
+    raft_reset_ping_timer(raft);
 
     return cmd;
 }
@@ -2313,7 +2321,7 @@  raft_become_follower(struct raft *raft)
     }
 
     raft->role = RAFT_FOLLOWER;
-    raft_reset_timer(raft);
+    raft_reset_election_timer(raft);
 
     /* Notify clients about lost leadership.
      *
@@ -2387,6 +2395,8 @@  raft_send_heartbeats(struct raft *raft)
                                             RAFT_CMD_INCOMPLETE, 0);
         }
     }
+
+    raft_reset_ping_timer(raft);
 }
 
 /* Initializes the fields in 's' that represent the leader's view of the
@@ -2413,7 +2423,7 @@  raft_become_leader(struct raft *raft)
     raft->role = RAFT_LEADER;
     raft->leader_sid = raft->sid;
     raft->election_timeout = LLONG_MAX;
-    raft->ping_timeout = time_msec() + PING_TIME_MSEC;
+    raft_reset_ping_timer(raft);
 
     struct raft_server *s;
     HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
@@ -2573,11 +2583,13 @@  raft_get_next_entry(struct raft *raft, struct uuid *eid)
     return data;
 }
 
-static void
+/* Updates commit index in raft log. If commit index is already up-to-date
+ * it does nothing and return false, otherwise, returns true. */
+static bool
 raft_update_commit_index(struct raft *raft, uint64_t new_commit_index)
 {
     if (new_commit_index <= raft->commit_index) {
-        return;
+        return false;
     }
 
     if (raft->role == RAFT_LEADER) {
@@ -2610,6 +2622,7 @@  raft_update_commit_index(struct raft *raft, uint64_t new_commit_index)
         .commit_index = raft->commit_index,
     };
     ignore(ovsdb_log_write_and_free(raft->log, raft_record_to_json(&r)));
+    return true;
 }
 
 /* This doesn't use rq->entries (but it does use rq->n_entries). */
@@ -2809,7 +2822,7 @@  raft_handle_append_request(struct raft *raft,
                                "usurped leadership");
         return;
     }
-    raft_reset_timer(raft);
+    raft_reset_election_timer(raft);
 
     /* First check for the common case, where the AppendEntries request is
      * entirely for indexes covered by 'log_start' ... 'log_end - 1', something
@@ -3045,7 +3058,9 @@  raft_consider_updating_commit_index(struct raft *raft)
             }
         }
     }
-    raft_update_commit_index(raft, new_commit_index);
+    if (raft_update_commit_index(raft, new_commit_index)) {
+        raft_send_heartbeats(raft);
+    }
 }
 
 static void
@@ -3274,7 +3289,7 @@  raft_handle_vote_request__(struct raft *raft,
         return false;
     }
 
-    raft_reset_timer(raft);
+    raft_reset_election_timer(raft);
 
     return true;
 }
@@ -3697,7 +3712,7 @@  static bool
 raft_handle_install_snapshot_request__(
     struct raft *raft, const struct raft_install_snapshot_request *rq)
 {
-    raft_reset_timer(raft);
+    raft_reset_election_timer(raft);
 
     /*
      * Our behavior here depend on new_log_start in the snapshot compared to
diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
index c7f1e34..889fb7f 100644
--- a/tests/ovsdb-cluster.at
+++ b/tests/ovsdb-cluster.at
@@ -142,11 +142,12 @@  ovsdb|WARN|schema: changed 2 columns in 'OVN_Southbound' database from ephemeral
                  echo "$i-$j exited with status $status" > $i-$j:$status
              fi
              rm $i-$j.running
+             sleep 0.5
          done
          : > $i.done)&
+         sleep 0.3
     done
     echo "...done"
-    sleep 2
 
     echo "waiting for ovn-sbctl processes to exit..."
     # Use file instead of var because code inside "while" runs in a subshell.