[ovs-dev] ovn pacemaker: Provide the option to configure inactivity probe value

Message ID 20171011085233.9645-1-nusiddiq@redhat.com
State New
Headers show
Series
  • [ovs-dev] ovn pacemaker: Provide the option to configure inactivity probe value
Related show

Commit Message

Numan Siddique Oct. 11, 2017, 8:52 a.m.
From: Numan Siddique <nusiddiq@redhat.com>

In the case of OVN HA deployments with openstack, it has been noticed
that the 5 seconds inactivity probe interval is not enough and ovsdb-servers
time out.
This patch
   - providdes an option to configure this value.
   - creates a connection row in NB/SB dbs and sets the target and
     inactivity_probe values when the node is promoted to master.

CC: Andy Zhou <azhou@ovn.org>
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
---
 ovn/utilities/ovndb-servers.ocf | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Comments

Ben Pfaff Oct. 12, 2017, 5:49 p.m. | #1
Hi Andy.  In the IRC meeting today, Numan suggested that you might be an
appropriate reviewer for this patch, so if you agree and you have a
chance to look at this then it would be appreciated.

Thanks,

Ben.

On Wed, Oct 11, 2017 at 02:22:33PM +0530, nusiddiq@redhat.com wrote:
> From: Numan Siddique <nusiddiq@redhat.com>
> 
> In the case of OVN HA deployments with openstack, it has been noticed
> that the 5 seconds inactivity probe interval is not enough and ovsdb-servers
> time out.
> This patch
>    - providdes an option to configure this value.
>    - creates a connection row in NB/SB dbs and sets the target and
>      inactivity_probe values when the node is promoted to master.
> 
> CC: Andy Zhou <azhou@ovn.org>
> Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
> ---
>  ovn/utilities/ovndb-servers.ocf | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/ovn/utilities/ovndb-servers.ocf b/ovn/utilities/ovndb-servers.ocf
> index fe1207c22..92620af6a 100755
> --- a/ovn/utilities/ovndb-servers.ocf
> +++ b/ovn/utilities/ovndb-servers.ocf
> @@ -8,6 +8,8 @@
>  : ${SB_MASTER_PORT_DEFAULT="6642"}
>  : ${SB_MASTER_PROTO_DEFAULT="tcp"}
>  : ${MANAGE_NORTHD_DEFAULT="no"}
> +: ${INACTIVE_PROBE_DEFAULT="60000"}
> +
>  CRM_MASTER="${HA_SBIN_DIR}/crm_master -l reboot"
>  CRM_ATTR_REPL_INFO="${HA_SBIN_DIR}/crm_attribute --type crm_config --name OVN_REPL_INFO -s ovn_ovsdb_master_server"
>  OVN_CTL=${OCF_RESKEY_ovn_ctl:-${OVN_CTL_DEFAULT}}
> @@ -17,6 +19,7 @@ NB_MASTER_PROTO=${OCF_RESKEY_nb_master_protocol:-${NB_MASTER_PROTO_DEFAULT}}
>  SB_MASTER_PORT=${OCF_RESKEY_sb_master_port:-${SB_MASTER_PORT_DEFAULT}}
>  SB_MASTER_PROTO=${OCF_RESKEY_sb_master_protocol:-${SB_MASTER_PROTO_DEFAULT}}
>  MANAGE_NORTHD=${OCF_RESKEY_manage_northd:-${MANAGE_NORTHD_DEFAULT}}
> +INACTIVE_PROBE=${OCF_RESKEY_inactive_probe_interval:-${INACTIVE_PROBE_DEFAULT}}
>  
>  # Invalid IP address is an address that can never exist in the network, as
>  # mentioned in rfc-5737. The ovsdb servers connects to this IP address till
> @@ -101,6 +104,14 @@ ovsdb_server_metadata() {
>    <content type="string" />
>    </parameter>
>  
> +  <parameter name="inactive_probe_interval" unique="1">
> +  <longdesc lang="en">
> +  Inactive probe interval to set for ovsdb-server.
> +  </longdesc>
> +  <shortdesc lang="en">Set inactive probe interval</shortdesc>
> +  <content type="string" />
> +  </parameter>
> +
>    </parameters>
>  
>    <actions>
> @@ -138,6 +149,22 @@ ovsdb_server_notify() {
>              ${OVN_CTL} --ovn-manage-ovsdb=no start_northd
>          fi
>  
> +        conn=`ovn-nbctl get NB_global . connections`
> +        if [ "$conn" == "[]" ]
> +        then
> +            ovn-nbctl -- --id=@conn_uuid create Connection \
> +target="p${NB_MASTER_PROTO}\:${NB_MASTER_PORT}\:${MASTER_IP}" \
> +inactivity_probe=$INACTIVE_PROBE -- set NB_Global . connections=@conn_uuid
> +        fi
> +
> +        conn=`ovn-sbctl get SB_global . connections`
> +        if [ "$conn" == "[]" ]
> +        then
> +            ovn-sbctl -- --id=@conn_uuid create Connection \
> +target="p${SB_MASTER_PROTO}\:${SB_MASTER_PORT}\:${MASTER_IP}" \
> +inactivity_probe=$INACTIVE_PROBE -- set SB_Global . connections=@conn_uuid
> +        fi
> +
>      else
>          if [ "$MANAGE_NORTHD" = "yes" ]; then
>              # Stop ovn-northd service. Set --ovn-manage-ovsdb=no so that
> -- 
> 2.13.5
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Andy Zhou Oct. 12, 2017, 6:08 p.m. | #2
Sure, I will take a look.

On Thu, Oct 12, 2017 at 10:49 AM, Ben Pfaff <blp@ovn.org> wrote:
> Hi Andy.  In the IRC meeting today, Numan suggested that you might be an
> appropriate reviewer for this patch, so if you agree and you have a
> chance to look at this then it would be appreciated.
>
> Thanks,
>
> Ben.
>
> On Wed, Oct 11, 2017 at 02:22:33PM +0530, nusiddiq@redhat.com wrote:
>> From: Numan Siddique <nusiddiq@redhat.com>
>>
>> In the case of OVN HA deployments with openstack, it has been noticed
>> that the 5 seconds inactivity probe interval is not enough and ovsdb-servers
>> time out.
>> This patch
>>    - providdes an option to configure this value.
>>    - creates a connection row in NB/SB dbs and sets the target and
>>      inactivity_probe values when the node is promoted to master.
>>
>> CC: Andy Zhou <azhou@ovn.org>
>> Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
>> ---
>>  ovn/utilities/ovndb-servers.ocf | 27 +++++++++++++++++++++++++++
>>  1 file changed, 27 insertions(+)
>>
>> diff --git a/ovn/utilities/ovndb-servers.ocf b/ovn/utilities/ovndb-servers.ocf
>> index fe1207c22..92620af6a 100755
>> --- a/ovn/utilities/ovndb-servers.ocf
>> +++ b/ovn/utilities/ovndb-servers.ocf
>> @@ -8,6 +8,8 @@
>>  : ${SB_MASTER_PORT_DEFAULT="6642"}
>>  : ${SB_MASTER_PROTO_DEFAULT="tcp"}
>>  : ${MANAGE_NORTHD_DEFAULT="no"}
>> +: ${INACTIVE_PROBE_DEFAULT="60000"}
>> +
>>  CRM_MASTER="${HA_SBIN_DIR}/crm_master -l reboot"
>>  CRM_ATTR_REPL_INFO="${HA_SBIN_DIR}/crm_attribute --type crm_config --name OVN_REPL_INFO -s ovn_ovsdb_master_server"
>>  OVN_CTL=${OCF_RESKEY_ovn_ctl:-${OVN_CTL_DEFAULT}}
>> @@ -17,6 +19,7 @@ NB_MASTER_PROTO=${OCF_RESKEY_nb_master_protocol:-${NB_MASTER_PROTO_DEFAULT}}
>>  SB_MASTER_PORT=${OCF_RESKEY_sb_master_port:-${SB_MASTER_PORT_DEFAULT}}
>>  SB_MASTER_PROTO=${OCF_RESKEY_sb_master_protocol:-${SB_MASTER_PROTO_DEFAULT}}
>>  MANAGE_NORTHD=${OCF_RESKEY_manage_northd:-${MANAGE_NORTHD_DEFAULT}}
>> +INACTIVE_PROBE=${OCF_RESKEY_inactive_probe_interval:-${INACTIVE_PROBE_DEFAULT}}
>>
>>  # Invalid IP address is an address that can never exist in the network, as
>>  # mentioned in rfc-5737. The ovsdb servers connects to this IP address till
>> @@ -101,6 +104,14 @@ ovsdb_server_metadata() {
>>    <content type="string" />
>>    </parameter>
>>
>> +  <parameter name="inactive_probe_interval" unique="1">
>> +  <longdesc lang="en">
>> +  Inactive probe interval to set for ovsdb-server.
>> +  </longdesc>
>> +  <shortdesc lang="en">Set inactive probe interval</shortdesc>
>> +  <content type="string" />
>> +  </parameter>
>> +
>>    </parameters>
>>
>>    <actions>
>> @@ -138,6 +149,22 @@ ovsdb_server_notify() {
>>              ${OVN_CTL} --ovn-manage-ovsdb=no start_northd
>>          fi
>>
>> +        conn=`ovn-nbctl get NB_global . connections`
>> +        if [ "$conn" == "[]" ]
>> +        then
>> +            ovn-nbctl -- --id=@conn_uuid create Connection \
>> +target="p${NB_MASTER_PROTO}\:${NB_MASTER_PORT}\:${MASTER_IP}" \
>> +inactivity_probe=$INACTIVE_PROBE -- set NB_Global . connections=@conn_uuid
>> +        fi
>> +
>> +        conn=`ovn-sbctl get SB_global . connections`
>> +        if [ "$conn" == "[]" ]
>> +        then
>> +            ovn-sbctl -- --id=@conn_uuid create Connection \
>> +target="p${SB_MASTER_PROTO}\:${SB_MASTER_PORT}\:${MASTER_IP}" \
>> +inactivity_probe=$INACTIVE_PROBE -- set SB_Global . connections=@conn_uuid
>> +        fi
>> +
>>      else
>>          if [ "$MANAGE_NORTHD" = "yes" ]; then
>>              # Stop ovn-northd service. Set --ovn-manage-ovsdb=no so that
>> --
>> 2.13.5
>>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Andy Zhou Oct. 13, 2017, 12:35 a.m. | #3
Hi, Numan,

I am curious why default 5 seconds inactivity time does not work? Do
you have more details?

Does the glitch usually happen around the HA switch over?  If this
happens during normal operation,
Then this is not HA specific issue, but an indication of some
connectivity issues.


On Thu, Oct 12, 2017 at 11:08 AM, Andy Zhou <azhou@ovn.org> wrote:
> Sure, I will take a look.
>
> On Thu, Oct 12, 2017 at 10:49 AM, Ben Pfaff <blp@ovn.org> wrote:
>> Hi Andy.  In the IRC meeting today, Numan suggested that you might be an
>> appropriate reviewer for this patch, so if you agree and you have a
>> chance to look at this then it would be appreciated.
>>
>> Thanks,
>>
>> Ben.
>>
>> On Wed, Oct 11, 2017 at 02:22:33PM +0530, nusiddiq@redhat.com wrote:
>>> From: Numan Siddique <nusiddiq@redhat.com>
>>>
>>> In the case of OVN HA deployments with openstack, it has been noticed
>>> that the 5 seconds inactivity probe interval is not enough and ovsdb-servers
>>> time out.
>>> This patch
>>>    - providdes an option to configure this value.
>>>    - creates a connection row in NB/SB dbs and sets the target and
>>>      inactivity_probe values when the node is promoted to master.
>>>
>>> CC: Andy Zhou <azhou@ovn.org>
>>> Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
>>> ---
>>>  ovn/utilities/ovndb-servers.ocf | 27 +++++++++++++++++++++++++++
>>>  1 file changed, 27 insertions(+)
>>>
>>> diff --git a/ovn/utilities/ovndb-servers.ocf b/ovn/utilities/ovndb-servers.ocf
>>> index fe1207c22..92620af6a 100755
>>> --- a/ovn/utilities/ovndb-servers.ocf
>>> +++ b/ovn/utilities/ovndb-servers.ocf
>>> @@ -8,6 +8,8 @@
>>>  : ${SB_MASTER_PORT_DEFAULT="6642"}
>>>  : ${SB_MASTER_PROTO_DEFAULT="tcp"}
>>>  : ${MANAGE_NORTHD_DEFAULT="no"}
>>> +: ${INACTIVE_PROBE_DEFAULT="60000"}
>>> +
>>>  CRM_MASTER="${HA_SBIN_DIR}/crm_master -l reboot"
>>>  CRM_ATTR_REPL_INFO="${HA_SBIN_DIR}/crm_attribute --type crm_config --name OVN_REPL_INFO -s ovn_ovsdb_master_server"
>>>  OVN_CTL=${OCF_RESKEY_ovn_ctl:-${OVN_CTL_DEFAULT}}
>>> @@ -17,6 +19,7 @@ NB_MASTER_PROTO=${OCF_RESKEY_nb_master_protocol:-${NB_MASTER_PROTO_DEFAULT}}
>>>  SB_MASTER_PORT=${OCF_RESKEY_sb_master_port:-${SB_MASTER_PORT_DEFAULT}}
>>>  SB_MASTER_PROTO=${OCF_RESKEY_sb_master_protocol:-${SB_MASTER_PROTO_DEFAULT}}
>>>  MANAGE_NORTHD=${OCF_RESKEY_manage_northd:-${MANAGE_NORTHD_DEFAULT}}
>>> +INACTIVE_PROBE=${OCF_RESKEY_inactive_probe_interval:-${INACTIVE_PROBE_DEFAULT}}
>>>
>>>  # Invalid IP address is an address that can never exist in the network, as
>>>  # mentioned in rfc-5737. The ovsdb servers connects to this IP address till
>>> @@ -101,6 +104,14 @@ ovsdb_server_metadata() {
>>>    <content type="string" />
>>>    </parameter>
>>>
>>> +  <parameter name="inactive_probe_interval" unique="1">
>>> +  <longdesc lang="en">
>>> +  Inactive probe interval to set for ovsdb-server.
>>> +  </longdesc>
>>> +  <shortdesc lang="en">Set inactive probe interval</shortdesc>
>>> +  <content type="string" />
>>> +  </parameter>
>>> +
>>>    </parameters>
>>>
>>>    <actions>
>>> @@ -138,6 +149,22 @@ ovsdb_server_notify() {
>>>              ${OVN_CTL} --ovn-manage-ovsdb=no start_northd
>>>          fi
>>>
>>> +        conn=`ovn-nbctl get NB_global . connections`
>>> +        if [ "$conn" == "[]" ]
>>> +        then
>>> +            ovn-nbctl -- --id=@conn_uuid create Connection \
>>> +target="p${NB_MASTER_PROTO}\:${NB_MASTER_PORT}\:${MASTER_IP}" \
>>> +inactivity_probe=$INACTIVE_PROBE -- set NB_Global . connections=@conn_uuid
>>> +        fi
>>> +
>>> +        conn=`ovn-sbctl get SB_global . connections`
>>> +        if [ "$conn" == "[]" ]
>>> +        then
>>> +            ovn-sbctl -- --id=@conn_uuid create Connection \
>>> +target="p${SB_MASTER_PROTO}\:${SB_MASTER_PORT}\:${MASTER_IP}" \
>>> +inactivity_probe=$INACTIVE_PROBE -- set SB_Global . connections=@conn_uuid
>>> +        fi
>>> +
>>>      else
>>>          if [ "$MANAGE_NORTHD" = "yes" ]; then
>>>              # Stop ovn-northd service. Set --ovn-manage-ovsdb=no so that
>>> --
>>> 2.13.5
>>>
>>> _______________________________________________
>>> dev mailing list
>>> dev@openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Numan Siddique Oct. 13, 2017, 12:30 p.m. | #4
On Fri, Oct 13, 2017 at 6:05 AM, Andy Zhou <azhou@ovn.org> wrote:

> Hi, Numan,
>
> I am curious why default 5 seconds inactivity time does not work? Do
> you have more details?
>
> Does the glitch usually happen around the HA switch over?  If this
> happens during normal operation,
> Then this is not HA specific issue, but an indication of some
> connectivity issues.
>

Hi Andy. This happens in the openstack deployment and when the
neutron-server is busy handling lots of API requests.
Normally the deployment would be having 3 controller nodes and
neutron-server would be running in each node.  On each controller node,
neutron-server starts around 10 - 12 neutron workers (which are separate
processes).  Number of API workers is a configuration option and normally
number of cores = no of neutron works if not configured.

I have tested  in both physical nodes deployment and virtual deployment (3
controllers running as vms in a node). Around 40 connections are opened to
the OVN north ovsdb-server by all the neutron workers in the physical
deployment and around 15 connections are opened in the virtual deployment.
When neutron-server is loaded with many API requests, I have noticed that,
ovsdb-server drops the connections when it doesn't get the echo reply every
5 seconds. This leads to lot of reconnections to the ovsdb-server and the
response from the neutron-server is very slow and bad.  With this patch it
seems to work fine.

The issue is not because of any network issues but because of lots of
connections from the neutron-server workers to the ovsdb-server and failure
by the idl clients to reply to the echo request every 5 seconds when the
neutron-server is loaded.

I can make the patch to provide the configuration option to override the
inactivity probe value so that it doesn't affect others who use the OVN OCF
pacemaker script.

Let me know your comments.

Thanks
Numan


>
> On Thu, Oct 12, 2017 at 11:08 AM, Andy Zhou <azhou@ovn.org> wrote:
> > Sure, I will take a look.
> >
> > On Thu, Oct 12, 2017 at 10:49 AM, Ben Pfaff <blp@ovn.org> wrote:
> >> Hi Andy.  In the IRC meeting today, Numan suggested that you might be an
> >> appropriate reviewer for this patch, so if you agree and you have a
> >> chance to look at this then it would be appreciated.
> >>
> >> Thanks,
> >>
> >> Ben.
> >>
> >> On Wed, Oct 11, 2017 at 02:22:33PM +0530, nusiddiq@redhat.com wrote:
> >>> From: Numan Siddique <nusiddiq@redhat.com>
> >>>
> >>> In the case of OVN HA deployments with openstack, it has been noticed
> >>> that the 5 seconds inactivity probe interval is not enough and
> ovsdb-servers
> >>> time out.
> >>> This patch
> >>>    - providdes an option to configure this value.
> >>>    - creates a connection row in NB/SB dbs and sets the target and
> >>>      inactivity_probe values when the node is promoted to master.
> >>>
> >>> CC: Andy Zhou <azhou@ovn.org>
> >>> Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
> >>> ---
> >>>  ovn/utilities/ovndb-servers.ocf | 27 +++++++++++++++++++++++++++
> >>>  1 file changed, 27 insertions(+)
> >>>
> >>> diff --git a/ovn/utilities/ovndb-servers.ocf
> b/ovn/utilities/ovndb-servers.ocf
> >>> index fe1207c22..92620af6a 100755
> >>> --- a/ovn/utilities/ovndb-servers.ocf
> >>> +++ b/ovn/utilities/ovndb-servers.ocf
> >>> @@ -8,6 +8,8 @@
> >>>  : ${SB_MASTER_PORT_DEFAULT="6642"}
> >>>  : ${SB_MASTER_PROTO_DEFAULT="tcp"}
> >>>  : ${MANAGE_NORTHD_DEFAULT="no"}
> >>> +: ${INACTIVE_PROBE_DEFAULT="60000"}
> >>> +
> >>>  CRM_MASTER="${HA_SBIN_DIR}/crm_master -l reboot"
> >>>  CRM_ATTR_REPL_INFO="${HA_SBIN_DIR}/crm_attribute --type crm_config
> --name OVN_REPL_INFO -s ovn_ovsdb_master_server"
> >>>  OVN_CTL=${OCF_RESKEY_ovn_ctl:-${OVN_CTL_DEFAULT}}
> >>> @@ -17,6 +19,7 @@ NB_MASTER_PROTO=${OCF_RESKEY_
> nb_master_protocol:-${NB_MASTER_PROTO_DEFAULT}}
> >>>  SB_MASTER_PORT=${OCF_RESKEY_sb_master_port:-${SB_MASTER_
> PORT_DEFAULT}}
> >>>  SB_MASTER_PROTO=${OCF_RESKEY_sb_master_protocol:-${SB_
> MASTER_PROTO_DEFAULT}}
> >>>  MANAGE_NORTHD=${OCF_RESKEY_manage_northd:-${MANAGE_NORTHD_DEFAULT}}
> >>> +INACTIVE_PROBE=${OCF_RESKEY_inactive_probe_interval:-${
> INACTIVE_PROBE_DEFAULT}}
> >>>
> >>>  # Invalid IP address is an address that can never exist in the
> network, as
> >>>  # mentioned in rfc-5737. The ovsdb servers connects to this IP
> address till
> >>> @@ -101,6 +104,14 @@ ovsdb_server_metadata() {
> >>>    <content type="string" />
> >>>    </parameter>
> >>>
> >>> +  <parameter name="inactive_probe_interval" unique="1">
> >>> +  <longdesc lang="en">
> >>> +  Inactive probe interval to set for ovsdb-server.
> >>> +  </longdesc>
> >>> +  <shortdesc lang="en">Set inactive probe interval</shortdesc>
> >>> +  <content type="string" />
> >>> +  </parameter>
> >>> +
> >>>    </parameters>
> >>>
> >>>    <actions>
> >>> @@ -138,6 +149,22 @@ ovsdb_server_notify() {
> >>>              ${OVN_CTL} --ovn-manage-ovsdb=no start_northd
> >>>          fi
> >>>
> >>> +        conn=`ovn-nbctl get NB_global . connections`
> >>> +        if [ "$conn" == "[]" ]
> >>> +        then
> >>> +            ovn-nbctl -- --id=@conn_uuid create Connection \
> >>> +target="p${NB_MASTER_PROTO}\:${NB_MASTER_PORT}\:${MASTER_IP}" \
> >>> +inactivity_probe=$INACTIVE_PROBE -- set NB_Global .
> connections=@conn_uuid
> >>> +        fi
> >>> +
> >>> +        conn=`ovn-sbctl get SB_global . connections`
> >>> +        if [ "$conn" == "[]" ]
> >>> +        then
> >>> +            ovn-sbctl -- --id=@conn_uuid create Connection \
> >>> +target="p${SB_MASTER_PROTO}\:${SB_MASTER_PORT}\:${MASTER_IP}" \
> >>> +inactivity_probe=$INACTIVE_PROBE -- set SB_Global .
> connections=@conn_uuid
> >>> +        fi
> >>> +
> >>>      else
> >>>          if [ "$MANAGE_NORTHD" = "yes" ]; then
> >>>              # Stop ovn-northd service. Set --ovn-manage-ovsdb=no so
> that
> >>> --
> >>> 2.13.5
> >>>
> >>> _______________________________________________
> >>> dev mailing list
> >>> dev@openvswitch.org
> >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Russell Bryant Oct. 13, 2017, 4:06 p.m. | #5
On Fri, Oct 13, 2017 at 8:30 AM, Numan Siddique <nusiddiq@redhat.com> wrote:
> On Fri, Oct 13, 2017 at 6:05 AM, Andy Zhou <azhou@ovn.org> wrote:
>
>> Hi, Numan,
>>
>> I am curious why default 5 seconds inactivity time does not work? Do
>> you have more details?
>>
>> Does the glitch usually happen around the HA switch over?  If this
>> happens during normal operation,
>> Then this is not HA specific issue, but an indication of some
>> connectivity issues.
>>
>
> Hi Andy. This happens in the openstack deployment and when the
> neutron-server is busy handling lots of API requests.
> Normally the deployment would be having 3 controller nodes and
> neutron-server would be running in each node.  On each controller node,
> neutron-server starts around 10 - 12 neutron workers (which are separate
> processes).  Number of API workers is a configuration option and normally
> number of cores = no of neutron works if not configured.
>
> I have tested  in both physical nodes deployment and virtual deployment (3
> controllers running as vms in a node). Around 40 connections are opened to
> the OVN north ovsdb-server by all the neutron workers in the physical
> deployment and around 15 connections are opened in the virtual deployment.
> When neutron-server is loaded with many API requests, I have noticed that,
> ovsdb-server drops the connections when it doesn't get the echo reply every
> 5 seconds. This leads to lot of reconnections to the ovsdb-server and the
> response from the neutron-server is very slow and bad.  With this patch it
> seems to work fine.
>
> The issue is not because of any network issues but because of lots of
> connections from the neutron-server workers to the ovsdb-server and failure
> by the idl clients to reply to the echo request every 5 seconds when the
> neutron-server is loaded.

We have to disable the inactivity probe everywhere each time we have
done performance testing so far.

> I can make the patch to provide the configuration option to override the
> inactivity probe value so that it doesn't affect others who use the OVN OCF
> pacemaker script.
>
> Let me know your comments.

I think the default through this script should match the normal
default.  It looks like it defaults to 60s in this patch instead of
5s?  I would make it match.  I do like exposing the ability to change
it, though.  We could consider setting a different default through our
OpenStack work.

>
> Thanks
> Numan
>
>
>>
>> On Thu, Oct 12, 2017 at 11:08 AM, Andy Zhou <azhou@ovn.org> wrote:
>> > Sure, I will take a look.
>> >
>> > On Thu, Oct 12, 2017 at 10:49 AM, Ben Pfaff <blp@ovn.org> wrote:
>> >> Hi Andy.  In the IRC meeting today, Numan suggested that you might be an
>> >> appropriate reviewer for this patch, so if you agree and you have a
>> >> chance to look at this then it would be appreciated.
>> >>
>> >> Thanks,
>> >>
>> >> Ben.
>> >>
>> >> On Wed, Oct 11, 2017 at 02:22:33PM +0530, nusiddiq@redhat.com wrote:
>> >>> From: Numan Siddique <nusiddiq@redhat.com>
>> >>>
>> >>> In the case of OVN HA deployments with openstack, it has been noticed
>> >>> that the 5 seconds inactivity probe interval is not enough and
>> ovsdb-servers
>> >>> time out.
>> >>> This patch
>> >>>    - providdes an option to configure this value.
>> >>>    - creates a connection row in NB/SB dbs and sets the target and
>> >>>      inactivity_probe values when the node is promoted to master.
>> >>>
>> >>> CC: Andy Zhou <azhou@ovn.org>
>> >>> Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
>> >>> ---
>> >>>  ovn/utilities/ovndb-servers.ocf | 27 +++++++++++++++++++++++++++
>> >>>  1 file changed, 27 insertions(+)
>> >>>
>> >>> diff --git a/ovn/utilities/ovndb-servers.ocf
>> b/ovn/utilities/ovndb-servers.ocf
>> >>> index fe1207c22..92620af6a 100755
>> >>> --- a/ovn/utilities/ovndb-servers.ocf
>> >>> +++ b/ovn/utilities/ovndb-servers.ocf
>> >>> @@ -8,6 +8,8 @@
>> >>>  : ${SB_MASTER_PORT_DEFAULT="6642"}
>> >>>  : ${SB_MASTER_PROTO_DEFAULT="tcp"}
>> >>>  : ${MANAGE_NORTHD_DEFAULT="no"}
>> >>> +: ${INACTIVE_PROBE_DEFAULT="60000"}
>> >>> +
>> >>>  CRM_MASTER="${HA_SBIN_DIR}/crm_master -l reboot"
>> >>>  CRM_ATTR_REPL_INFO="${HA_SBIN_DIR}/crm_attribute --type crm_config
>> --name OVN_REPL_INFO -s ovn_ovsdb_master_server"
>> >>>  OVN_CTL=${OCF_RESKEY_ovn_ctl:-${OVN_CTL_DEFAULT}}
>> >>> @@ -17,6 +19,7 @@ NB_MASTER_PROTO=${OCF_RESKEY_
>> nb_master_protocol:-${NB_MASTER_PROTO_DEFAULT}}
>> >>>  SB_MASTER_PORT=${OCF_RESKEY_sb_master_port:-${SB_MASTER_
>> PORT_DEFAULT}}
>> >>>  SB_MASTER_PROTO=${OCF_RESKEY_sb_master_protocol:-${SB_
>> MASTER_PROTO_DEFAULT}}
>> >>>  MANAGE_NORTHD=${OCF_RESKEY_manage_northd:-${MANAGE_NORTHD_DEFAULT}}
>> >>> +INACTIVE_PROBE=${OCF_RESKEY_inactive_probe_interval:-${
>> INACTIVE_PROBE_DEFAULT}}
>> >>>
>> >>>  # Invalid IP address is an address that can never exist in the
>> network, as
>> >>>  # mentioned in rfc-5737. The ovsdb servers connects to this IP
>> address till
>> >>> @@ -101,6 +104,14 @@ ovsdb_server_metadata() {
>> >>>    <content type="string" />
>> >>>    </parameter>
>> >>>
>> >>> +  <parameter name="inactive_probe_interval" unique="1">
>> >>> +  <longdesc lang="en">
>> >>> +  Inactive probe interval to set for ovsdb-server.
>> >>> +  </longdesc>
>> >>> +  <shortdesc lang="en">Set inactive probe interval</shortdesc>
>> >>> +  <content type="string" />
>> >>> +  </parameter>
>> >>> +
>> >>>    </parameters>
>> >>>
>> >>>    <actions>
>> >>> @@ -138,6 +149,22 @@ ovsdb_server_notify() {
>> >>>              ${OVN_CTL} --ovn-manage-ovsdb=no start_northd
>> >>>          fi
>> >>>
>> >>> +        conn=`ovn-nbctl get NB_global . connections`
>> >>> +        if [ "$conn" == "[]" ]
>> >>> +        then
>> >>> +            ovn-nbctl -- --id=@conn_uuid create Connection \
>> >>> +target="p${NB_MASTER_PROTO}\:${NB_MASTER_PORT}\:${MASTER_IP}" \
>> >>> +inactivity_probe=$INACTIVE_PROBE -- set NB_Global .
>> connections=@conn_uuid
>> >>> +        fi
>> >>> +
>> >>> +        conn=`ovn-sbctl get SB_global . connections`
>> >>> +        if [ "$conn" == "[]" ]
>> >>> +        then
>> >>> +            ovn-sbctl -- --id=@conn_uuid create Connection \
>> >>> +target="p${SB_MASTER_PROTO}\:${SB_MASTER_PORT}\:${MASTER_IP}" \
>> >>> +inactivity_probe=$INACTIVE_PROBE -- set SB_Global .
>> connections=@conn_uuid
>> >>> +        fi
>> >>> +
>> >>>      else
>> >>>          if [ "$MANAGE_NORTHD" = "yes" ]; then
>> >>>              # Stop ovn-northd service. Set --ovn-manage-ovsdb=no so
>> that
>> >>> --
>> >>> 2.13.5
>> >>>
>> >>> _______________________________________________
>> >>> dev mailing list
>> >>> dev@openvswitch.org
>> >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Ben Pfaff Oct. 13, 2017, 9:26 p.m. | #6
On Fri, Oct 13, 2017 at 12:06:56PM -0400, Russell Bryant wrote:
> On Fri, Oct 13, 2017 at 8:30 AM, Numan Siddique <nusiddiq@redhat.com> wrote:
> > On Fri, Oct 13, 2017 at 6:05 AM, Andy Zhou <azhou@ovn.org> wrote:
> >
> >> Hi, Numan,
> >>
> >> I am curious why default 5 seconds inactivity time does not work? Do
> >> you have more details?
> >>
> >> Does the glitch usually happen around the HA switch over?  If this
> >> happens during normal operation,
> >> Then this is not HA specific issue, but an indication of some
> >> connectivity issues.
> >>
> >
> > Hi Andy. This happens in the openstack deployment and when the
> > neutron-server is busy handling lots of API requests.
> > Normally the deployment would be having 3 controller nodes and
> > neutron-server would be running in each node.  On each controller node,
> > neutron-server starts around 10 - 12 neutron workers (which are separate
> > processes).  Number of API workers is a configuration option and normally
> > number of cores = no of neutron works if not configured.
> >
> > I have tested  in both physical nodes deployment and virtual deployment (3
> > controllers running as vms in a node). Around 40 connections are opened to
> > the OVN north ovsdb-server by all the neutron workers in the physical
> > deployment and around 15 connections are opened in the virtual deployment.
> > When neutron-server is loaded with many API requests, I have noticed that,
> > ovsdb-server drops the connections when it doesn't get the echo reply every
> > 5 seconds. This leads to lot of reconnections to the ovsdb-server and the
> > response from the neutron-server is very slow and bad.  With this patch it
> > seems to work fine.
> >
> > The issue is not because of any network issues but because of lots of
> > connections from the neutron-server workers to the ovsdb-server and failure
> > by the idl clients to reply to the echo request every 5 seconds when the
> > neutron-server is loaded.
> 
> We have to disable the inactivity probe everywhere each time we have
> done performance testing so far.

Really this seems that it's a bug (or inadequacy) in ovsdb-server.  It's
pretty sad that ovsdb-server can't reply within 5 seconds (maybe there's
a 2x or 3x multiplier on the response time, I don't recall).  I hope
that the clustered database does better here.

That said, if in the real world we need 60 seconds for now, let's use it
but remember that we should get our act together later.  (Maybe a
comment would be helpful.)
Numan Siddique Oct. 16, 2017, 8:32 a.m. | #7
On Sat, Oct 14, 2017 at 2:56 AM, Ben Pfaff <blp@ovn.org> wrote:

> On Fri, Oct 13, 2017 at 12:06:56PM -0400, Russell Bryant wrote:
> > On Fri, Oct 13, 2017 at 8:30 AM, Numan Siddique <nusiddiq@redhat.com>
> wrote:
> > > On Fri, Oct 13, 2017 at 6:05 AM, Andy Zhou <azhou@ovn.org> wrote:
> > >
> > >> Hi, Numan,
> > >>
> > >> I am curious why default 5 seconds inactivity time does not work? Do
> > >> you have more details?
> > >>
> > >> Does the glitch usually happen around the HA switch over?  If this
> > >> happens during normal operation,
> > >> Then this is not HA specific issue, but an indication of some
> > >> connectivity issues.
> > >>
> > >
> > > Hi Andy. This happens in the openstack deployment and when the
> > > neutron-server is busy handling lots of API requests.
> > > Normally the deployment would be having 3 controller nodes and
> > > neutron-server would be running in each node.  On each controller node,
> > > neutron-server starts around 10 - 12 neutron workers (which are
> separate
> > > processes).  Number of API workers is a configuration option and
> normally
> > > number of cores = no of neutron works if not configured.
> > >
> > > I have tested  in both physical nodes deployment and virtual
> deployment (3
> > > controllers running as vms in a node). Around 40 connections are
> opened to
> > > the OVN north ovsdb-server by all the neutron workers in the physical
> > > deployment and around 15 connections are opened in the virtual
> deployment.
> > > When neutron-server is loaded with many API requests, I have noticed
> that,
> > > ovsdb-server drops the connections when it doesn't get the echo reply
> every
> > > 5 seconds. This leads to lot of reconnections to the ovsdb-server and
> the
> > > response from the neutron-server is very slow and bad.  With this
> patch it
> > > seems to work fine.
> > >
> > > The issue is not because of any network issues but because of lots of
> > > connections from the neutron-server workers to the ovsdb-server and
> failure
> > > by the idl clients to reply to the echo request every 5 seconds when
> the
> > > neutron-server is loaded.
> >
> > We have to disable the inactivity probe everywhere each time we have
> > done performance testing so far.
>
> Really this seems that it's a bug (or inadequacy) in ovsdb-server.  It's
> pretty sad that ovsdb-server can't reply within 5 seconds (maybe there's
> a 2x or 3x multiplier on the response time, I don't recall).  I hope
> that the clustered database does better here.
>
> That said, if in the real world we need 60 seconds for now, let's use it
> but remember that we should get our act together later.  (Maybe a
> comment would be helpful.)
>

Thanks. I will add relevant comments in my next patch.
Numan Siddique Oct. 16, 2017, 8:32 a.m. | #8
On Fri, Oct 13, 2017 at 9:36 PM, Russell Bryant <russell@ovn.org> wrote:

> On Fri, Oct 13, 2017 at 8:30 AM, Numan Siddique <nusiddiq@redhat.com>
> wrote:
> > On Fri, Oct 13, 2017 at 6:05 AM, Andy Zhou <azhou@ovn.org> wrote:
> >
> >> Hi, Numan,
> >>
> >> I am curious why default 5 seconds inactivity time does not work? Do
> >> you have more details?
> >>
> >> Does the glitch usually happen around the HA switch over?  If this
> >> happens during normal operation,
> >> Then this is not HA specific issue, but an indication of some
> >> connectivity issues.
> >>
> >
> > Hi Andy. This happens in the openstack deployment and when the
> > neutron-server is busy handling lots of API requests.
> > Normally the deployment would be having 3 controller nodes and
> > neutron-server would be running in each node.  On each controller node,
> > neutron-server starts around 10 - 12 neutron workers (which are separate
> > processes).  Number of API workers is a configuration option and normally
> > number of cores = no of neutron works if not configured.
> >
> > I have tested  in both physical nodes deployment and virtual deployment
> (3
> > controllers running as vms in a node). Around 40 connections are opened
> to
> > the OVN north ovsdb-server by all the neutron workers in the physical
> > deployment and around 15 connections are opened in the virtual
> deployment.
> > When neutron-server is loaded with many API requests, I have noticed
> that,
> > ovsdb-server drops the connections when it doesn't get the echo reply
> every
> > 5 seconds. This leads to lot of reconnections to the ovsdb-server and the
> > response from the neutron-server is very slow and bad.  With this patch
> it
> > seems to work fine.
> >
> > The issue is not because of any network issues but because of lots of
> > connections from the neutron-server workers to the ovsdb-server and
> failure
> > by the idl clients to reply to the echo request every 5 seconds when the
> > neutron-server is loaded.
>
> We have to disable the inactivity probe everywhere each time we have
> done performance testing so far.
>
> > I can make the patch to provide the configuration option to override the
> > inactivity probe value so that it doesn't affect others who use the OVN
> OCF
> > pacemaker script.
> >
> > Let me know your comments.
>
> I think the default through this script should match the normal
> default.  It looks like it defaults to 60s in this patch instead of
> 5s?  I would make it match.


Ack. Will do that in the next patch.

Thanks


> I do like exposing the ability to change
> it, though.  We could consider setting a different default through our
> OpenStack work.
>
> >
> > Thanks
> > Numan
> >
> >
> >>
> >> On Thu, Oct 12, 2017 at 11:08 AM, Andy Zhou <azhou@ovn.org> wrote:
> >> > Sure, I will take a look.
> >> >
> >> > On Thu, Oct 12, 2017 at 10:49 AM, Ben Pfaff <blp@ovn.org> wrote:
> >> >> Hi Andy.  In the IRC meeting today, Numan suggested that you might
> be an
> >> >> appropriate reviewer for this patch, so if you agree and you have a
> >> >> chance to look at this then it would be appreciated.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Ben.
> >> >>
> >> >> On Wed, Oct 11, 2017 at 02:22:33PM +0530, nusiddiq@redhat.com wrote:
> >> >>> From: Numan Siddique <nusiddiq@redhat.com>
> >> >>>
> >> >>> In the case of OVN HA deployments with openstack, it has been
> noticed
> >> >>> that the 5 seconds inactivity probe interval is not enough and
> >> ovsdb-servers
> >> >>> time out.
> >> >>> This patch
> >> >>>    - providdes an option to configure this value.
> >> >>>    - creates a connection row in NB/SB dbs and sets the target and
> >> >>>      inactivity_probe values when the node is promoted to master.
> >> >>>
> >> >>> CC: Andy Zhou <azhou@ovn.org>
> >> >>> Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
> >> >>> ---
> >> >>>  ovn/utilities/ovndb-servers.ocf | 27 +++++++++++++++++++++++++++
> >> >>>  1 file changed, 27 insertions(+)
> >> >>>
> >> >>> diff --git a/ovn/utilities/ovndb-servers.ocf
> >> b/ovn/utilities/ovndb-servers.ocf
> >> >>> index fe1207c22..92620af6a 100755
> >> >>> --- a/ovn/utilities/ovndb-servers.ocf
> >> >>> +++ b/ovn/utilities/ovndb-servers.ocf
> >> >>> @@ -8,6 +8,8 @@
> >> >>>  : ${SB_MASTER_PORT_DEFAULT="6642"}
> >> >>>  : ${SB_MASTER_PROTO_DEFAULT="tcp"}
> >> >>>  : ${MANAGE_NORTHD_DEFAULT="no"}
> >> >>> +: ${INACTIVE_PROBE_DEFAULT="60000"}
> >> >>> +
> >> >>>  CRM_MASTER="${HA_SBIN_DIR}/crm_master -l reboot"
> >> >>>  CRM_ATTR_REPL_INFO="${HA_SBIN_DIR}/crm_attribute --type crm_config
> >> --name OVN_REPL_INFO -s ovn_ovsdb_master_server"
> >> >>>  OVN_CTL=${OCF_RESKEY_ovn_ctl:-${OVN_CTL_DEFAULT}}
> >> >>> @@ -17,6 +19,7 @@ NB_MASTER_PROTO=${OCF_RESKEY_
> >> nb_master_protocol:-${NB_MASTER_PROTO_DEFAULT}}
> >> >>>  SB_MASTER_PORT=${OCF_RESKEY_sb_master_port:-${SB_MASTER_
> >> PORT_DEFAULT}}
> >> >>>  SB_MASTER_PROTO=${OCF_RESKEY_sb_master_protocol:-${SB_
> >> MASTER_PROTO_DEFAULT}}
> >> >>>  MANAGE_NORTHD=${OCF_RESKEY_manage_northd:-${MANAGE_
> NORTHD_DEFAULT}}
> >> >>> +INACTIVE_PROBE=${OCF_RESKEY_inactive_probe_interval:-${
> >> INACTIVE_PROBE_DEFAULT}}
> >> >>>
> >> >>>  # Invalid IP address is an address that can never exist in the
> >> network, as
> >> >>>  # mentioned in rfc-5737. The ovsdb servers connects to this IP
> >> address till
> >> >>> @@ -101,6 +104,14 @@ ovsdb_server_metadata() {
> >> >>>    <content type="string" />
> >> >>>    </parameter>
> >> >>>
> >> >>> +  <parameter name="inactive_probe_interval" unique="1">
> >> >>> +  <longdesc lang="en">
> >> >>> +  Inactive probe interval to set for ovsdb-server.
> >> >>> +  </longdesc>
> >> >>> +  <shortdesc lang="en">Set inactive probe interval</shortdesc>
> >> >>> +  <content type="string" />
> >> >>> +  </parameter>
> >> >>> +
> >> >>>    </parameters>
> >> >>>
> >> >>>    <actions>
> >> >>> @@ -138,6 +149,22 @@ ovsdb_server_notify() {
> >> >>>              ${OVN_CTL} --ovn-manage-ovsdb=no start_northd
> >> >>>          fi
> >> >>>
> >> >>> +        conn=`ovn-nbctl get NB_global . connections`
> >> >>> +        if [ "$conn" == "[]" ]
> >> >>> +        then
> >> >>> +            ovn-nbctl -- --id=@conn_uuid create Connection \
> >> >>> +target="p${NB_MASTER_PROTO}\:${NB_MASTER_PORT}\:${MASTER_IP}" \
> >> >>> +inactivity_probe=$INACTIVE_PROBE -- set NB_Global .
> >> connections=@conn_uuid
> >> >>> +        fi
> >> >>> +
> >> >>> +        conn=`ovn-sbctl get SB_global . connections`
> >> >>> +        if [ "$conn" == "[]" ]
> >> >>> +        then
> >> >>> +            ovn-sbctl -- --id=@conn_uuid create Connection \
> >> >>> +target="p${SB_MASTER_PROTO}\:${SB_MASTER_PORT}\:${MASTER_IP}" \
> >> >>> +inactivity_probe=$INACTIVE_PROBE -- set SB_Global .
> >> connections=@conn_uuid
> >> >>> +        fi
> >> >>> +
> >> >>>      else
> >> >>>          if [ "$MANAGE_NORTHD" = "yes" ]; then
> >> >>>              # Stop ovn-northd service. Set --ovn-manage-ovsdb=no so
> >> that
> >> >>> --
> >> >>> 2.13.5
> >> >>>
> >> >>> _______________________________________________
> >> >>> dev mailing list
> >> >>> dev@openvswitch.org
> >> >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >>
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>
>
> --
> Russell Bryant
>
Numan Siddique Oct. 16, 2017, 9:20 a.m. | #9
On Sat, Oct 14, 2017 at 2:56 AM, Ben Pfaff <blp@ovn.org> wrote:

> On Fri, Oct 13, 2017 at 12:06:56PM -0400, Russell Bryant wrote:
> > On Fri, Oct 13, 2017 at 8:30 AM, Numan Siddique <nusiddiq@redhat.com>
> wrote:
> > > On Fri, Oct 13, 2017 at 6:05 AM, Andy Zhou <azhou@ovn.org> wrote:
> > >
> > >> Hi, Numan,
> > >>
> > >> I am curious why default 5 seconds inactivity time does not work? Do
> > >> you have more details?
> > >>
> > >> Does the glitch usually happen around the HA switch over?  If this
> > >> happens during normal operation,
> > >> Then this is not HA specific issue, but an indication of some
> > >> connectivity issues.
> > >>
> > >
> > > Hi Andy. This happens in the openstack deployment and when the
> > > neutron-server is busy handling lots of API requests.
> > > Normally the deployment would be having 3 controller nodes and
> > > neutron-server would be running in each node.  On each controller node,
> > > neutron-server starts around 10 - 12 neutron workers (which are
> separate
> > > processes).  Number of API workers is a configuration option and
> normally
> > > number of cores = no of neutron works if not configured.
> > >
> > > I have tested  in both physical nodes deployment and virtual
> deployment (3
> > > controllers running as vms in a node). Around 40 connections are
> opened to
> > > the OVN north ovsdb-server by all the neutron workers in the physical
> > > deployment and around 15 connections are opened in the virtual
> deployment.
> > > When neutron-server is loaded with many API requests, I have noticed
> that,
> > > ovsdb-server drops the connections when it doesn't get the echo reply
> every
> > > 5 seconds. This leads to lot of reconnections to the ovsdb-server and
> the
> > > response from the neutron-server is very slow and bad.  With this
> patch it
> > > seems to work fine.
> > >
> > > The issue is not because of any network issues but because of lots of
> > > connections from the neutron-server workers to the ovsdb-server and
> failure
> > > by the idl clients to reply to the echo request every 5 seconds when
> the
> > > neutron-server is loaded.
> >
> > We have to disable the inactivity probe everywhere each time we have
> > done performance testing so far.
>
> Really this seems that it's a bug (or inadequacy) in ovsdb-server.  It's
> pretty sad that ovsdb-server can't reply within 5 seconds


It's actually the ovsdb python idl client which is not able to reply within
5 seconds for the
echo request from ovsdb-server.




> (maybe there's
> a 2x or 3x multiplier on the response time, I don't recall).  I hope
> that the clustered database does better here.
>
> That said, if in the real world we need 60 seconds for now, let's use it
> but remember that we should get our act together later.  (Maybe a
> comment would be helpful.)
>
Ben Pfaff Oct. 16, 2017, 5:58 p.m. | #10
On Mon, Oct 16, 2017 at 02:50:48PM +0530, Numan Siddique wrote:
> On Sat, Oct 14, 2017 at 2:56 AM, Ben Pfaff <blp@ovn.org> wrote:
> 
> > On Fri, Oct 13, 2017 at 12:06:56PM -0400, Russell Bryant wrote:
> > > On Fri, Oct 13, 2017 at 8:30 AM, Numan Siddique <nusiddiq@redhat.com>
> > wrote:
> > > > On Fri, Oct 13, 2017 at 6:05 AM, Andy Zhou <azhou@ovn.org> wrote:
> > > >
> > > >> Hi, Numan,
> > > >>
> > > >> I am curious why default 5 seconds inactivity time does not work? Do
> > > >> you have more details?
> > > >>
> > > >> Does the glitch usually happen around the HA switch over?  If this
> > > >> happens during normal operation,
> > > >> Then this is not HA specific issue, but an indication of some
> > > >> connectivity issues.
> > > >>
> > > >
> > > > Hi Andy. This happens in the openstack deployment and when the
> > > > neutron-server is busy handling lots of API requests.
> > > > Normally the deployment would be having 3 controller nodes and
> > > > neutron-server would be running in each node.  On each controller node,
> > > > neutron-server starts around 10 - 12 neutron workers (which are
> > separate
> > > > processes).  Number of API workers is a configuration option and
> > normally
> > > > number of cores = no of neutron works if not configured.
> > > >
> > > > I have tested  in both physical nodes deployment and virtual
> > deployment (3
> > > > controllers running as vms in a node). Around 40 connections are
> > opened to
> > > > the OVN north ovsdb-server by all the neutron workers in the physical
> > > > deployment and around 15 connections are opened in the virtual
> > deployment.
> > > > When neutron-server is loaded with many API requests, I have noticed
> > that,
> > > > ovsdb-server drops the connections when it doesn't get the echo reply
> > every
> > > > 5 seconds. This leads to lot of reconnections to the ovsdb-server and
> > the
> > > > response from the neutron-server is very slow and bad.  With this
> > patch it
> > > > seems to work fine.
> > > >
> > > > The issue is not because of any network issues but because of lots of
> > > > connections from the neutron-server workers to the ovsdb-server and
> > failure
> > > > by the idl clients to reply to the echo request every 5 seconds when
> > the
> > > > neutron-server is loaded.
> > >
> > > We have to disable the inactivity probe everywhere each time we have
> > > done performance testing so far.
> >
> > Really this seems that it's a bug (or inadequacy) in ovsdb-server.  It's
> > pretty sad that ovsdb-server can't reply within 5 seconds
> 
> 
> It's actually the ovsdb python idl client which is not able to reply within
> 5 seconds for the
> echo request from ovsdb-server.

Oh, I'm surprised that ovsdb-server is doing the echo-requests, I
thought that we generally did them from the client end.
Ben Pfaff Oct. 16, 2017, 7:48 p.m. | #11
On Mon, Oct 16, 2017 at 10:58:43AM -0700, Ben Pfaff wrote:
> On Mon, Oct 16, 2017 at 02:50:48PM +0530, Numan Siddique wrote:
> > On Sat, Oct 14, 2017 at 2:56 AM, Ben Pfaff <blp@ovn.org> wrote:
> > 
> > > On Fri, Oct 13, 2017 at 12:06:56PM -0400, Russell Bryant wrote:
> > > > On Fri, Oct 13, 2017 at 8:30 AM, Numan Siddique <nusiddiq@redhat.com>
> > > wrote:
> > > > > On Fri, Oct 13, 2017 at 6:05 AM, Andy Zhou <azhou@ovn.org> wrote:
> > > > >
> > > > >> Hi, Numan,
> > > > >>
> > > > >> I am curious why default 5 seconds inactivity time does not work? Do
> > > > >> you have more details?
> > > > >>
> > > > >> Does the glitch usually happen around the HA switch over?  If this
> > > > >> happens during normal operation,
> > > > >> Then this is not HA specific issue, but an indication of some
> > > > >> connectivity issues.
> > > > >>
> > > > >
> > > > > Hi Andy. This happens in the openstack deployment and when the
> > > > > neutron-server is busy handling lots of API requests.
> > > > > Normally the deployment would be having 3 controller nodes and
> > > > > neutron-server would be running in each node.  On each controller node,
> > > > > neutron-server starts around 10 - 12 neutron workers (which are
> > > separate
> > > > > processes).  Number of API workers is a configuration option and
> > > normally
> > > > > number of cores = no of neutron works if not configured.
> > > > >
> > > > > I have tested  in both physical nodes deployment and virtual
> > > deployment (3
> > > > > controllers running as vms in a node). Around 40 connections are
> > > opened to
> > > > > the OVN north ovsdb-server by all the neutron workers in the physical
> > > > > deployment and around 15 connections are opened in the virtual
> > > deployment.
> > > > > When neutron-server is loaded with many API requests, I have noticed
> > > that,
> > > > > ovsdb-server drops the connections when it doesn't get the echo reply
> > > every
> > > > > 5 seconds. This leads to lot of reconnections to the ovsdb-server and
> > > the
> > > > > response from the neutron-server is very slow and bad.  With this
> > > patch it
> > > > > seems to work fine.
> > > > >
> > > > > The issue is not because of any network issues but because of lots of
> > > > > connections from the neutron-server workers to the ovsdb-server and
> > > failure
> > > > > by the idl clients to reply to the echo request every 5 seconds when
> > > the
> > > > > neutron-server is loaded.
> > > >
> > > > We have to disable the inactivity probe everywhere each time we have
> > > > done performance testing so far.
> > >
> > > Really this seems that it's a bug (or inadequacy) in ovsdb-server.  It's
> > > pretty sad that ovsdb-server can't reply within 5 seconds
> > 
> > 
> > It's actually the ovsdb python idl client which is not able to reply within
> > 5 seconds for the
> > echo request from ovsdb-server.
> 
> Oh, I'm surprised that ovsdb-server is doing the echo-requests, I
> thought that we generally did them from the client end.

One perfectly acceptable approach might be to simply disable
echo-requests on the server side entirely and do them from the client.
Miguel Angel Ajo Pelayo Oct. 17, 2017, 3:26 p.m. | #12
Acked-By: Miguel Angel Ajo <majopela@redhat.com>

It makes sense to be able to configure the inactive probe time, also
disabling the echo requests on server, as Ben said I agree would also make
sense in any future patch.

On Mon, Oct 16, 2017 at 9:48 PM, Ben Pfaff <blp@ovn.org> wrote:

> On Mon, Oct 16, 2017 at 10:58:43AM -0700, Ben Pfaff wrote:
> > On Mon, Oct 16, 2017 at 02:50:48PM +0530, Numan Siddique wrote:
> > > On Sat, Oct 14, 2017 at 2:56 AM, Ben Pfaff <blp@ovn.org> wrote:
> > >
> > > > On Fri, Oct 13, 2017 at 12:06:56PM -0400, Russell Bryant wrote:
> > > > > On Fri, Oct 13, 2017 at 8:30 AM, Numan Siddique <
> nusiddiq@redhat.com>
> > > > wrote:
> > > > > > On Fri, Oct 13, 2017 at 6:05 AM, Andy Zhou <azhou@ovn.org>
> wrote:
> > > > > >
> > > > > >> Hi, Numan,
> > > > > >>
> > > > > >> I am curious why default 5 seconds inactivity time does not
> work? Do
> > > > > >> you have more details?
> > > > > >>
> > > > > >> Does the glitch usually happen around the HA switch over?  If
> this
> > > > > >> happens during normal operation,
> > > > > >> Then this is not HA specific issue, but an indication of some
> > > > > >> connectivity issues.
> > > > > >>
> > > > > >
> > > > > > Hi Andy. This happens in the openstack deployment and when the
> > > > > > neutron-server is busy handling lots of API requests.
> > > > > > Normally the deployment would be having 3 controller nodes and
> > > > > > neutron-server would be running in each node.  On each
> controller node,
> > > > > > neutron-server starts around 10 - 12 neutron workers (which are
> > > > separate
> > > > > > processes).  Number of API workers is a configuration option and
> > > > normally
> > > > > > number of cores = no of neutron works if not configured.
> > > > > >
> > > > > > I have tested  in both physical nodes deployment and virtual
> > > > deployment (3
> > > > > > controllers running as vms in a node). Around 40 connections are
> > > > opened to
> > > > > > the OVN north ovsdb-server by all the neutron workers in the
> physical
> > > > > > deployment and around 15 connections are opened in the virtual
> > > > deployment.
> > > > > > When neutron-server is loaded with many API requests, I have
> noticed
> > > > that,
> > > > > > ovsdb-server drops the connections when it doesn't get the echo
> reply
> > > > every
> > > > > > 5 seconds. This leads to lot of reconnections to the
> ovsdb-server and
> > > > the
> > > > > > response from the neutron-server is very slow and bad.  With this
> > > > patch it
> > > > > > seems to work fine.
> > > > > >
> > > > > > The issue is not because of any network issues but because of
> lots of
> > > > > > connections from the neutron-server workers to the ovsdb-server
> and
> > > > failure
> > > > > > by the idl clients to reply to the echo request every 5 seconds
> when
> > > > the
> > > > > > neutron-server is loaded.
> > > > >
> > > > > We have to disable the inactivity probe everywhere each time we
> have
> > > > > done performance testing so far.
> > > >
> > > > Really this seems that it's a bug (or inadequacy) in ovsdb-server.
> It's
> > > > pretty sad that ovsdb-server can't reply within 5 seconds
> > >
> > >
> > > It's actually the ovsdb python idl client which is not able to reply
> within
> > > 5 seconds for the
> > > echo request from ovsdb-server.
> >
> > Oh, I'm surprised that ovsdb-server is doing the echo-requests, I
> > thought that we generally did them from the client end.
>
> One perfectly acceptable approach might be to simply disable
> echo-requests on the server side entirely and do them from the client.
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>

Patch

diff --git a/ovn/utilities/ovndb-servers.ocf b/ovn/utilities/ovndb-servers.ocf
index fe1207c22..92620af6a 100755
--- a/ovn/utilities/ovndb-servers.ocf
+++ b/ovn/utilities/ovndb-servers.ocf
@@ -8,6 +8,8 @@ 
 : ${SB_MASTER_PORT_DEFAULT="6642"}
 : ${SB_MASTER_PROTO_DEFAULT="tcp"}
 : ${MANAGE_NORTHD_DEFAULT="no"}
+: ${INACTIVE_PROBE_DEFAULT="60000"}
+
 CRM_MASTER="${HA_SBIN_DIR}/crm_master -l reboot"
 CRM_ATTR_REPL_INFO="${HA_SBIN_DIR}/crm_attribute --type crm_config --name OVN_REPL_INFO -s ovn_ovsdb_master_server"
 OVN_CTL=${OCF_RESKEY_ovn_ctl:-${OVN_CTL_DEFAULT}}
@@ -17,6 +19,7 @@  NB_MASTER_PROTO=${OCF_RESKEY_nb_master_protocol:-${NB_MASTER_PROTO_DEFAULT}}
 SB_MASTER_PORT=${OCF_RESKEY_sb_master_port:-${SB_MASTER_PORT_DEFAULT}}
 SB_MASTER_PROTO=${OCF_RESKEY_sb_master_protocol:-${SB_MASTER_PROTO_DEFAULT}}
 MANAGE_NORTHD=${OCF_RESKEY_manage_northd:-${MANAGE_NORTHD_DEFAULT}}
+INACTIVE_PROBE=${OCF_RESKEY_inactive_probe_interval:-${INACTIVE_PROBE_DEFAULT}}
 
 # Invalid IP address is an address that can never exist in the network, as
 # mentioned in rfc-5737. The ovsdb servers connects to this IP address till
@@ -101,6 +104,14 @@  ovsdb_server_metadata() {
   <content type="string" />
   </parameter>
 
+  <parameter name="inactive_probe_interval" unique="1">
+  <longdesc lang="en">
+  Inactive probe interval to set for ovsdb-server.
+  </longdesc>
+  <shortdesc lang="en">Set inactive probe interval</shortdesc>
+  <content type="string" />
+  </parameter>
+
   </parameters>
 
   <actions>
@@ -138,6 +149,22 @@  ovsdb_server_notify() {
             ${OVN_CTL} --ovn-manage-ovsdb=no start_northd
         fi
 
+        conn=`ovn-nbctl get NB_global . connections`
+        if [ "$conn" == "[]" ]
+        then
+            ovn-nbctl -- --id=@conn_uuid create Connection \
+target="p${NB_MASTER_PROTO}\:${NB_MASTER_PORT}\:${MASTER_IP}" \
+inactivity_probe=$INACTIVE_PROBE -- set NB_Global . connections=@conn_uuid
+        fi
+
+        conn=`ovn-sbctl get SB_global . connections`
+        if [ "$conn" == "[]" ]
+        then
+            ovn-sbctl -- --id=@conn_uuid create Connection \
+target="p${SB_MASTER_PROTO}\:${SB_MASTER_PORT}\:${MASTER_IP}" \
+inactivity_probe=$INACTIVE_PROBE -- set SB_Global . connections=@conn_uuid
+        fi
+
     else
         if [ "$MANAGE_NORTHD" = "yes" ]; then
             # Stop ovn-northd service. Set --ovn-manage-ovsdb=no so that