Message ID | 20180517100409.834-1-nusiddiq@redhat.com |
---|---|
State | Accepted |
Delegated to: | Russell Bryant |
Headers | show |
Series | [ovs-dev] ovn pacemaker: Fix the promotion issue in other cluster nodes when the master node is reset | expand |
Hi: I tried and it didnt help where Ip resource is always showing stopped where my private VIP IP is 192.168.220.108 # kernel panic on active node root@test7:~# echo c > /proc/sysrq-trigger root@test6:~# crm stat Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 22:45:03 2018 by root via cibadmin on test6 Stack: corosync Current DC: test7 (version 1.1.14-70404b0) - partition with quorum 2 nodes and 3 resources configured Online: [ test6 test7 ] Full list of resources: VirtualIP (ocf::heartbeat:IPaddr2): Started test7 Master/Slave Set: ovndb_servers-master [ovndb_servers] Masters: [ test7 ] Slaves: [ test6 ] root@test6:~# crm stat Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 22:45:03 2018 by root via cibadmin on test6 Stack: corosync Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum 2 nodes and 3 resources configured Online: [ test6 ] OFFLINE: [ test7 ] Full list of resources: VirtualIP (ocf::heartbeat:IPaddr2): Stopped Master/Slave Set: ovndb_servers-master [ovndb_servers] Slaves: [ test6 ] Stopped: [ test7 ] root@test6:~# crm stat Last updated: Thu May 17 22:49:26 2018 Last change: Thu May 17 22:45:03 2018 by root via cibadmin on test6 Stack: corosync Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum 2 nodes and 3 resources configured Online: [ test6 ] OFFLINE: [ test7 ] Full list of resources: VirtualIP (ocf::heartbeat:IPaddr2): Stopped Master/Slave Set: ovndb_servers-master [ovndb_servers] Stopped: [ test6 test7 ] I think this change not needed or something else is wrong when using virtual IP resource. May we you need a similar promotion logic that we have for LB with pacemaker in the discussion (will submit formal patch soon). I did test with kernel panic with LB code change and it works fine where node2 gets promoted. Below works fine for LB even if there is kernel panic without this change: root@test-pace1-2365293:~# echo c > /proc/sysrq-trigger root@test-pace2-2365308:~# crm stat Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 23:10:52 2018 by root via cibadmin on test-pace2-2365308 Stack: corosync Current DC: test-pace1-2365293 (version 1.1.14-70404b0) - partition with quorum 2 nodes and 2 resources configured Online: [ test-pace1-2365293 test-pace2-2365308 ] Full list of resources: Master/Slave Set: ovndb_servers-master [ovndb_servers] Masters: [ test-pace1-2365293 ] Slaves: [ test-pace2-2365308 ] root@test-pace2-2365308:~# crm stat Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 23:10:52 2018 by root via cibadmin on test-pace2-2365308 Stack: corosync Current DC: test-pace2-2365308 (version 1.1.14-70404b0) - partition WITHOUT quorum 2 nodes and 2 resources configured Online: [ test-pace2-2365308 ] OFFLINE: [ test-pace1-2365293 ] Full list of resources: Master/Slave Set: ovndb_servers-master [ovndb_servers] Slaves: [ test-pace2-2365308 ] Stopped: [ test-pace1-2365293 ] root@test-pace2-2365308:~# ps aux | grep ovs root 15175 0.0 0.0 18048 372 ? Ss 15:15 0:00 ovsdb-server: monitoring pid 15176 (healthy) root 15176 0.0 0.0 18312 4096 ? S 15:15 0:00 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-nb.log --remote=punix:/var/run/openvswitch/ovnnb_db.sock --pidfile=/var/run/openvswitch/ovnnb_db.pid --unixctl=ovnnb_db.ctl --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers --remote=ptcp:6641:0.0.0.0 --sync-from=tcp:192.0.2.254:6641 /etc/openvswitch/ovnnb_db.db root 15184 0.0 0.0 18048 376 ? Ss 15:15 0:00 ovsdb-server: monitoring pid 15185 (healthy) root 15185 0.0 0.0 18300 4480 ? S 15:15 0:00 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-sb.log --remote=punix:/var/run/openvswitch/ovnsb_db.sock --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl --detach --monitor --remote=db:OVN_Southbound,SB_Global,connections --private-key=db:OVN_Southbound,SSL,private_key --certificate=db:OVN_Southbound,SSL,certificate --ca-cert=db:OVN_Southbound,SSL,ca_cert --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers --remote=ptcp:6642:0.0.0.0 --sync-from=tcp:192.0.2.254:6642 /etc/openvswitch/ovnsb_db.db root 15398 0.0 0.0 12940 972 pts/0 S+ 15:15 0:00 grep --color=auto ovs >>>I just want to point out that I am also seeing below errors when setting target with master IP using ipaddr2 resource too! 2018-05-17T21:58:51.889Z|00011|ovsdb_jsonrpc_server|ERR|ptcp:6641: 192.168.220.108: listen failed: Cannot assign requested address 2018-05-17T21:58:51.889Z|00012|socket_util|ERR|6641:192.168.220.108: bind: Cannot assign requested address That needs to be handled too since existing code do throw this error! Only if I skip setting target then it the error is gone.? Regards, Aliasgar On Thu, May 17, 2018 at 3:04 AM, <nusiddiq@redhat.com> wrote: > From: Numan Siddique <nusiddiq@redhat.com> > > When a node 'A' in the pacemaker cluster running OVN db servers in master > is > brought down ungracefully ('echo b > /proc/sysrq_trigger' for example), > pacemaker > is not able to promote any other node to master in the cluster. When > pacemaker selects > a node B for instance to promote, it moves the IPAddr2 resource (i.e the > master ip) > to node 'B'. As soon the node is configured with the IP address, when the > issue is > seen, the OVN db servers which were running as standy earlier, transitions > to active. > Ideally this should not have happened. The ovsdb-servers are expected to > remain in > standby until there are promoted. (This needs separate investigation). > When the pacemaker > calls the OVN OCF script's promote action, the ovsdb_server_promot > function returns > almost immediately without recording the present master. And later in the > notify action > it demotes back the OVN db servers since the last known master doesn't > match with > node 'B's hostname. This results in pacemaker promoting/demoting in a loop. > > This patch fixes the issue by not returning immediately when promote > action is > called if the OVN db servers are running as active. Now it would continue > with > the ovsdb_server_promot function and records the new master by setting > proper > master score ($CRM_MASTER -N $host_name -v ${master_score}) > > This issue is not seen when a node is brought down gracefully as pacemaker > before > promoting a node, calls stop, start and then promote actions. Not sure why > pacemaker > doesn't call stop, start and promote actions when a node is reset > ungracefully. > > Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1579025 > Signed-off-by: Numan Siddique <nusiddiq@redhat.com> > --- > ovn/utilities/ovndb-servers.ocf | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/ovn/utilities/ovndb-servers.ocf > b/ovn/utilities/ovndb-servers.ocf > index 164b6bce6..23dc70056 100755 > --- a/ovn/utilities/ovndb-servers.ocf > +++ b/ovn/utilities/ovndb-servers.ocf > @@ -409,7 +409,7 @@ ovsdb_server_promote() { > rc=$? > case $rc in > ${OCF_SUCCESS}) ;; > - ${OCF_RUNNING_MASTER}) return ${OCF_SUCCESS};; > + ${OCF_RUNNING_MASTER}) ;; > *) > ovsdb_server_master_update $OCF_RUNNING_MASTER > return ${rc} > -- > 2.17.0 > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
On Fri, May 18, 2018 at 4:24 AM, aginwala <aginwala@asu.edu> wrote: > Hi: > > I tried and it didnt help where Ip resource is always showing stopped > where my private VIP IP is 192.168.220.108 > # kernel panic on active node > root@test7:~# echo c > /proc/sysrq-trigger > > > root@test6:~# crm stat > Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 22:45:03 > 2018 by root via cibadmin on test6 > Stack: corosync > Current DC: test7 (version 1.1.14-70404b0) - partition with quorum > 2 nodes and 3 resources configured > > Online: [ test6 test7 ] > > Full list of resources: > > VirtualIP (ocf::heartbeat:IPaddr2): Started test7 > Master/Slave Set: ovndb_servers-master [ovndb_servers] > Masters: [ test7 ] > Slaves: [ test6 ] > > root@test6:~# crm stat > Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 22:45:03 > 2018 by root via cibadmin on test6 > Stack: corosync > Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum > 2 nodes and 3 resources configured > > Online: [ test6 ] > OFFLINE: [ test7 ] > > Full list of resources: > > VirtualIP (ocf::heartbeat:IPaddr2): Stopped > Master/Slave Set: ovndb_servers-master [ovndb_servers] > Slaves: [ test6 ] > Stopped: [ test7 ] > > root@test6:~# crm stat > Last updated: Thu May 17 22:49:26 2018 Last change: Thu May 17 22:45:03 > 2018 by root via cibadmin on test6 > Stack: corosync > Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum > 2 nodes and 3 resources configured > > Online: [ test6 ] > OFFLINE: [ test7 ] > > Full list of resources: > > VirtualIP (ocf::heartbeat:IPaddr2): Stopped > Master/Slave Set: ovndb_servers-master [ovndb_servers] > Stopped: [ test6 test7 ] > > I think this change not needed or something else is wrong when using > virtual IP resource. > Hi Aliasgar, I think you haven't created the resource properly. Or haven't set the colocation constraints properly. What pcs/crm commands you used to create OVN db resources ? Can you share the output of "pcs resource show ovndb_servers" and "pcs constraint" In case of tripleo we create resource like this - https://github.com/openstack/puppet-tripleo/blob/master/manifests/profile/pacemaker/ovn_northd.pp#L80 > > May we you need a similar promotion logic that we have for LB with > pacemaker in the discussion (will submit formal patch soon). I did test > with kernel panic with LB code change and it works fine where node2 gets > promoted. Below works fine for LB even if there is kernel panic without > this change: > This issue is not seen all the time. I have another setup where I don't see this issue at all. The issue is seen when the IPAddr2 resource is moved to another slave node and ovsdb-server's start reporting as master as soon as the IP address is configured. When the issue is seen we hit the code here - https://github.com/openvswitch/ovs/blob/master/ovn/utilities/ovndb-servers.ocf#L412. Ideally when promot action is called, ovsdb servers will be running as slaves/standby and the promote action promotes them to master. But when the issue is seen, the ovsdb servers report the status as active. Because of which we don't complete the full promote action and return at L412. And later when notify action is called, we demote the servers because of this - https://github.com/openvswitch/ovs/blob/master/ovn/utilities/ovndb-servers.ocf#L176 For the use case like your's (where load balancer VIP is used), you may not see this issue at all since you will not be using the IPaddr2 resource as master ip. > root@test-pace1-2365293:~# echo c > /proc/sysrq-trigger > root@test-pace2-2365308:~# crm stat > Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 23:10:52 > 2018 by root via cibadmin on test-pace2-2365308 > Stack: corosync > Current DC: test-pace1-2365293 (version 1.1.14-70404b0) - partition with > quorum > 2 nodes and 2 resources configured > > Online: [ test-pace1-2365293 test-pace2-2365308 ] > > Full list of resources: > > Master/Slave Set: ovndb_servers-master [ovndb_servers] > Masters: [ test-pace1-2365293 ] > Slaves: [ test-pace2-2365308 ] > > root@test-pace2-2365308:~# crm stat > Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 23:10:52 > 2018 by root via cibadmin on test-pace2-2365308 > Stack: corosync > Current DC: test-pace2-2365308 (version 1.1.14-70404b0) - partition > WITHOUT quorum > 2 nodes and 2 resources configured > > Online: [ test-pace2-2365308 ] > OFFLINE: [ test-pace1-2365293 ] > > Full list of resources: > > Master/Slave Set: ovndb_servers-master [ovndb_servers] > Slaves: [ test-pace2-2365308 ] > Stopped: [ test-pace1-2365293 ] > > root@test-pace2-2365308:~# ps aux | grep ovs > root 15175 0.0 0.0 18048 372 ? Ss 15:15 0:00 > ovsdb-server: monitoring pid 15176 (healthy) > root 15176 0.0 0.0 18312 4096 ? S 15:15 0:00 > ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ > openvswitch/ovsdb-server-nb.log --remote=punix:/var/run/openvswitch/ovnnb_db.sock > --pidfile=/var/run/openvswitch/ovnnb_db.pid --unixctl=ovnnb_db.ctl > --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections > --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate > --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols > --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers > --remote=ptcp:6641:0.0.0.0 --sync-from=tcp:192.0.2.254:6641 > /etc/openvswitch/ovnnb_db.db > root 15184 0.0 0.0 18048 376 ? Ss 15:15 0:00 > ovsdb-server: monitoring pid 15185 (healthy) > root 15185 0.0 0.0 18300 4480 ? S 15:15 0:00 > ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ > openvswitch/ovsdb-server-sb.log --remote=punix:/var/run/openvswitch/ovnsb_db.sock > --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl > --detach --monitor --remote=db:OVN_Southbound,SB_Global,connections > --private-key=db:OVN_Southbound,SSL,private_key --certificate=db:OVN_Southbound,SSL,certificate > --ca-cert=db:OVN_Southbound,SSL,ca_cert --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols > --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers > --remote=ptcp:6642:0.0.0.0 --sync-from=tcp:192.0.2.254:6642 > /etc/openvswitch/ovnsb_db.db > root 15398 0.0 0.0 12940 972 pts/0 S+ 15:15 0:00 grep > --color=auto ovs > > >>>I just want to point out that I am also seeing below errors when > setting target with master IP using ipaddr2 resource too! > 2018-05-17T21:58:51.889Z|00011|ovsdb_jsonrpc_server|ERR|ptcp:6641: > 192.168.220.108: listen failed: Cannot assign requested address > 2018-05-17T21:58:51.889Z|00012|socket_util|ERR|6641:192.168.220.108: > bind: Cannot assign requested address > That needs to be handled too since existing code do throw this error! Only > if I skip setting target then it the error is gone.? > In the case of tripleo, we handle this error by setting the sysctl value net.ipv4.ip_nonlocal_bind to 1 - https://github.com/openstack/puppet-tripleo/blob/master/manifests/profile/pacemaker/ovn_northd.pp#L67 > > > > Regards, > Aliasgar > > > On Thu, May 17, 2018 at 3:04 AM, <nusiddiq@redhat.com> wrote: > >> From: Numan Siddique <nusiddiq@redhat.com> >> >> When a node 'A' in the pacemaker cluster running OVN db servers in master >> is >> brought down ungracefully ('echo b > /proc/sysrq_trigger' for example), >> pacemaker >> is not able to promote any other node to master in the cluster. When >> pacemaker selects >> a node B for instance to promote, it moves the IPAddr2 resource (i.e the >> master ip) >> to node 'B'. As soon the node is configured with the IP address, when the >> issue is >> seen, the OVN db servers which were running as standy earlier, >> transitions to active. >> Ideally this should not have happened. The ovsdb-servers are expected to >> remain in >> standby until there are promoted. (This needs separate investigation). >> When the pacemaker >> calls the OVN OCF script's promote action, the ovsdb_server_promot >> function returns >> almost immediately without recording the present master. And later in the >> notify action >> it demotes back the OVN db servers since the last known master doesn't >> match with >> node 'B's hostname. This results in pacemaker promoting/demoting in a >> loop. >> >> This patch fixes the issue by not returning immediately when promote >> action is >> called if the OVN db servers are running as active. Now it would continue >> with >> the ovsdb_server_promot function and records the new master by setting >> proper >> master score ($CRM_MASTER -N $host_name -v ${master_score}) >> >> This issue is not seen when a node is brought down gracefully as >> pacemaker before >> promoting a node, calls stop, start and then promote actions. Not sure >> why pacemaker >> doesn't call stop, start and promote actions when a node is reset >> ungracefully. >> >> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1579025 >> Signed-off-by: Numan Siddique <nusiddiq@redhat.com> >> --- >> ovn/utilities/ovndb-servers.ocf | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/ovn/utilities/ovndb-servers.ocf >> b/ovn/utilities/ovndb-servers.ocf >> index 164b6bce6..23dc70056 100755 >> --- a/ovn/utilities/ovndb-servers.ocf >> +++ b/ovn/utilities/ovndb-servers.ocf >> @@ -409,7 +409,7 @@ ovsdb_server_promote() { >> rc=$? >> case $rc in >> ${OCF_SUCCESS}) ;; >> - ${OCF_RUNNING_MASTER}) return ${OCF_SUCCESS};; >> + ${OCF_RUNNING_MASTER}) ;; >> *) >> ovsdb_server_master_update $OCF_RUNNING_MASTER >> return ${rc} >> -- >> 2.17.0 >> >> _______________________________________________ >> dev mailing list >> dev@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> > >
On Thu, May 17, 2018 at 11:23 PM, Numan Siddique <nusiddiq@redhat.com> wrote: > > > On Fri, May 18, 2018 at 4:24 AM, aginwala <aginwala@asu.edu> wrote: > >> Hi: >> >> I tried and it didnt help where Ip resource is always showing stopped >> where my private VIP IP is 192.168.220.108 >> # kernel panic on active node >> root@test7:~# echo c > /proc/sysrq-trigger >> >> >> root@test6:~# crm stat >> Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 22:45:03 >> 2018 by root via cibadmin on test6 >> Stack: corosync >> Current DC: test7 (version 1.1.14-70404b0) - partition with quorum >> 2 nodes and 3 resources configured >> >> Online: [ test6 test7 ] >> >> Full list of resources: >> >> VirtualIP (ocf::heartbeat:IPaddr2): Started test7 >> Master/Slave Set: ovndb_servers-master [ovndb_servers] >> Masters: [ test7 ] >> Slaves: [ test6 ] >> >> root@test6:~# crm stat >> Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 22:45:03 >> 2018 by root via cibadmin on test6 >> Stack: corosync >> Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum >> 2 nodes and 3 resources configured >> >> Online: [ test6 ] >> OFFLINE: [ test7 ] >> >> Full list of resources: >> >> VirtualIP (ocf::heartbeat:IPaddr2): Stopped >> Master/Slave Set: ovndb_servers-master [ovndb_servers] >> Slaves: [ test6 ] >> Stopped: [ test7 ] >> >> root@test6:~# crm stat >> Last updated: Thu May 17 22:49:26 2018 Last change: Thu May 17 22:45:03 >> 2018 by root via cibadmin on test6 >> Stack: corosync >> Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum >> 2 nodes and 3 resources configured >> >> Online: [ test6 ] >> OFFLINE: [ test7 ] >> >> Full list of resources: >> >> VirtualIP (ocf::heartbeat:IPaddr2): Stopped >> Master/Slave Set: ovndb_servers-master [ovndb_servers] >> Stopped: [ test6 test7 ] >> >> I think this change not needed or something else is wrong when using >> virtual IP resource. >> > > Hi Aliasgar, I think you haven't created the resource properly. Or haven't > set the colocation constraints properly. What pcs/crm commands you used to > create OVN db resources ? > Can you share the output of "pcs resource show ovndb_servers" and "pcs > constraint" > In case of tripleo we create resource like this - https://github.com/ > openstack/puppet-tripleo/blob/master/manifests/profile/ > pacemaker/ovn_northd.pp#L80 > >>>>> # I am using the same commands suggested upstream in the ovs document to create resource: I am skipping manage northd option with default inactivity probe interval http://docs.openvswitch.org/en/latest/topics/integration/#ha-for-ovn-db-servers-using-pacemaker # cat pcs_with_ipaddr2.sh pcs resource create VirtualIP ocf:heartbeat:IPaddr2 \ params ip="192.168.220.108" op monitor interval="30s" pcs resource create ovndb_servers ocf:ovn:ovndb-servers \ master_ip="192.168.220.108" \ op monitor interval="10s" \ op monitor role=Master interval="15s" --debug pcs resource master ovndb_servers-master ovndb_servers \ meta notify="true" pcs constraint order promote ovndb_servers-master then VirtualIP pcs constraint colocation add VirtualIP with master ovndb_servers-master \ score=INFINITY # pcs resource show ovndb_servers Resource: ovndb_servers (class=ocf provider=ovn type=ovndb-servers) Attributes: master_ip=192.168.220.108 Operations: start interval=0s timeout=30s (ovndb_servers-start-interval-0s) stop interval=0s timeout=20s (ovndb_servers-stop-interval-0s) promote interval=0s timeout=50s (ovndb_servers-promote-interval-0s) demote interval=0s timeout=50s (ovndb_servers-demote-interval-0s) monitor interval=10s (ovndb_servers-monitor-interval-10s) monitor interval=15s role=Master (ovndb_servers-monitor-interval-15s) # pcs constraint Location Constraints: Ordering Constraints: promote ovndb_servers-master then start VirtualIP (kind:Mandatory) Colocation Constraints: VirtualIP with ovndb_servers-master (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) > > >> >> May we you need a similar promotion logic that we have for LB with >> pacemaker in the discussion (will submit formal patch soon). I did test >> with kernel panic with LB code change and it works fine where node2 gets >> promoted. Below works fine for LB even if there is kernel panic without >> this change: >> > > This issue is not seen all the time. I have another setup where I don't > see this issue at all. The issue is seen when the IPAddr2 resource is moved > to another slave node and ovsdb-server's start reporting as master as soon > as the IP address is configured. > > When the issue is seen we hit the code here - https://github.com/ > openvswitch/ovs/blob/master/ovn/utilities/ovndb-servers.ocf#L412. Ideally > when promot action is called, ovsdb servers will be running as > slaves/standby and the promote action promotes them to master. But when the > issue is seen, the ovsdb servers report the status as active. Because of > which we don't complete the full promote action and return at L412. And > later when notify action is called, we demote the servers because of this - > https://github.com/openvswitch/ovs/blob/master/ > ovn/utilities/ovndb-servers.ocf#L176 > > >>> Yes I agree! As you said settings work fine in one cluster and if you use other cluster with same settings, you may see surprises . > For the use case like your's (where load balancer VIP is used), you may > not see this issue at all since you will not be using the IPaddr2 resource > as master ip. > >>> Correct, I just wanted to update both the settings to let you know pacemaker behavior with IPaddr2 vs LB VIP IP. > > >> root@test-pace1-2365293:~# echo c > /proc/sysrq-trigger >> root@test-pace2-2365308:~# crm stat >> Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 23:10:52 >> 2018 by root via cibadmin on test-pace2-2365308 >> Stack: corosync >> Current DC: test-pace1-2365293 (version 1.1.14-70404b0) - partition with >> quorum >> 2 nodes and 2 resources configured >> >> Online: [ test-pace1-2365293 test-pace2-2365308 ] >> >> Full list of resources: >> >> Master/Slave Set: ovndb_servers-master [ovndb_servers] >> Masters: [ test-pace1-2365293 ] >> Slaves: [ test-pace2-2365308 ] >> >> root@test-pace2-2365308:~# crm stat >> Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 23:10:52 >> 2018 by root via cibadmin on test-pace2-2365308 >> Stack: corosync >> Current DC: test-pace2-2365308 (version 1.1.14-70404b0) - partition >> WITHOUT quorum >> 2 nodes and 2 resources configured >> >> Online: [ test-pace2-2365308 ] >> OFFLINE: [ test-pace1-2365293 ] >> >> Full list of resources: >> >> Master/Slave Set: ovndb_servers-master [ovndb_servers] >> Slaves: [ test-pace2-2365308 ] >> Stopped: [ test-pace1-2365293 ] >> >> root@test-pace2-2365308:~# ps aux | grep ovs >> root 15175 0.0 0.0 18048 372 ? Ss 15:15 0:00 >> ovsdb-server: monitoring pid 15176 (healthy) >> root 15176 0.0 0.0 18312 4096 ? S 15:15 0:00 >> ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-nb.log >> --remote=punix:/var/run/openvswitch/ovnnb_db.sock >> --pidfile=/var/run/openvswitch/ovnnb_db.pid --unixctl=ovnnb_db.ctl >> --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections >> --private-key=db:OVN_Northbound,SSL,private_key >> --certificate=db:OVN_Northbound,SSL,certificate >> --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols >> --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers >> --remote=ptcp:6641:0.0.0.0 --sync-from=tcp:192.0.2.254:6641 >> /etc/openvswitch/ovnnb_db.db >> root 15184 0.0 0.0 18048 376 ? Ss 15:15 0:00 >> ovsdb-server: monitoring pid 15185 (healthy) >> root 15185 0.0 0.0 18300 4480 ? S 15:15 0:00 >> ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-sb.log >> --remote=punix:/var/run/openvswitch/ovnsb_db.sock >> --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl >> --detach --monitor --remote=db:OVN_Southbound,SB_Global,connections >> --private-key=db:OVN_Southbound,SSL,private_key >> --certificate=db:OVN_Southbound,SSL,certificate >> --ca-cert=db:OVN_Southbound,SSL,ca_cert --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols >> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers >> --remote=ptcp:6642:0.0.0.0 --sync-from=tcp:192.0.2.254:6642 >> /etc/openvswitch/ovnsb_db.db >> root 15398 0.0 0.0 12940 972 pts/0 S+ 15:15 0:00 grep >> --color=auto ovs >> >> >>>I just want to point out that I am also seeing below errors when >> setting target with master IP using ipaddr2 resource too! >> 2018-05-17T21:58:51.889Z|00011|ovsdb_jsonrpc_server|ERR|ptcp:6641: >> 192.168.220.108: listen failed: Cannot assign requested address >> 2018-05-17T21:58:51.889Z|00012|socket_util|ERR|6641:192.168.220.108: >> bind: Cannot assign requested address >> That needs to be handled too since existing code do throw this error! >> Only if I skip setting target then it the error is gone.? >> > > In the case of tripleo, we handle this error by setting the sysctl > value net.ipv4.ip_nonlocal_bind to 1 - https://github.com/ > openstack/puppet-tripleo/blob/master/manifests/profile/ > pacemaker/ovn_northd.pp#L67 > >>> Sweet, I can try to set this to get rid of socket error. > > >> >> >> >> Regards, >> Aliasgar >> >> >> On Thu, May 17, 2018 at 3:04 AM, <nusiddiq@redhat.com> wrote: >> >>> From: Numan Siddique <nusiddiq@redhat.com> >>> >>> When a node 'A' in the pacemaker cluster running OVN db servers in >>> master is >>> brought down ungracefully ('echo b > /proc/sysrq_trigger' for example), >>> pacemaker >>> is not able to promote any other node to master in the cluster. When >>> pacemaker selects >>> a node B for instance to promote, it moves the IPAddr2 resource (i.e the >>> master ip) >>> to node 'B'. As soon the node is configured with the IP address, when >>> the issue is >>> seen, the OVN db servers which were running as standy earlier, >>> transitions to active. >>> Ideally this should not have happened. The ovsdb-servers are expected to >>> remain in >>> standby until there are promoted. (This needs separate investigation). >>> When the pacemaker >>> calls the OVN OCF script's promote action, the ovsdb_server_promot >>> function returns >>> almost immediately without recording the present master. And later in >>> the notify action >>> it demotes back the OVN db servers since the last known master doesn't >>> match with >>> node 'B's hostname. This results in pacemaker promoting/demoting in a >>> loop. >>> >>> This patch fixes the issue by not returning immediately when promote >>> action is >>> called if the OVN db servers are running as active. Now it would >>> continue with >>> the ovsdb_server_promot function and records the new master by setting >>> proper >>> master score ($CRM_MASTER -N $host_name -v ${master_score}) >>> >>> This issue is not seen when a node is brought down gracefully as >>> pacemaker before >>> promoting a node, calls stop, start and then promote actions. Not sure >>> why pacemaker >>> doesn't call stop, start and promote actions when a node is reset >>> ungracefully. >>> >>> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1579025 >>> Signed-off-by: Numan Siddique <nusiddiq@redhat.com> >>> --- >>> ovn/utilities/ovndb-servers.ocf | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/ovn/utilities/ovndb-servers.ocf >>> b/ovn/utilities/ovndb-servers.ocf >>> index 164b6bce6..23dc70056 100755 >>> --- a/ovn/utilities/ovndb-servers.ocf >>> +++ b/ovn/utilities/ovndb-servers.ocf >>> @@ -409,7 +409,7 @@ ovsdb_server_promote() { >>> rc=$? >>> case $rc in >>> ${OCF_SUCCESS}) ;; >>> - ${OCF_RUNNING_MASTER}) return ${OCF_SUCCESS};; >>> + ${OCF_RUNNING_MASTER}) ;; >>> *) >>> ovsdb_server_master_update $OCF_RUNNING_MASTER >>> return ${rc} >>> -- >>> 2.17.0 >>> >>> _______________________________________________ >>> dev mailing list >>> dev@openvswitch.org >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>> >> >> >
On Fri, May 18, 2018 at 11:53 PM, aginwala <aginwala@asu.edu> wrote: > > > On Thu, May 17, 2018 at 11:23 PM, Numan Siddique <nusiddiq@redhat.com> > wrote: > >> >> >> On Fri, May 18, 2018 at 4:24 AM, aginwala <aginwala@asu.edu> wrote: >> >>> Hi: >>> >>> I tried and it didnt help where Ip resource is always showing stopped >>> where my private VIP IP is 192.168.220.108 >>> # kernel panic on active node >>> root@test7:~# echo c > /proc/sysrq-trigger >>> >>> >>> root@test6:~# crm stat >>> Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 22:45:03 >>> 2018 by root via cibadmin on test6 >>> Stack: corosync >>> Current DC: test7 (version 1.1.14-70404b0) - partition with quorum >>> 2 nodes and 3 resources configured >>> >>> Online: [ test6 test7 ] >>> >>> Full list of resources: >>> >>> VirtualIP (ocf::heartbeat:IPaddr2): Started test7 >>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>> Masters: [ test7 ] >>> Slaves: [ test6 ] >>> >>> root@test6:~# crm stat >>> Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 22:45:03 >>> 2018 by root via cibadmin on test6 >>> Stack: corosync >>> Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum >>> 2 nodes and 3 resources configured >>> >>> Online: [ test6 ] >>> OFFLINE: [ test7 ] >>> >>> Full list of resources: >>> >>> VirtualIP (ocf::heartbeat:IPaddr2): Stopped >>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>> Slaves: [ test6 ] >>> Stopped: [ test7 ] >>> >>> root@test6:~# crm stat >>> Last updated: Thu May 17 22:49:26 2018 Last change: Thu May 17 22:45:03 >>> 2018 by root via cibadmin on test6 >>> Stack: corosync >>> Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum >>> 2 nodes and 3 resources configured >>> >>> Online: [ test6 ] >>> OFFLINE: [ test7 ] >>> >>> Full list of resources: >>> >>> VirtualIP (ocf::heartbeat:IPaddr2): Stopped >>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>> Stopped: [ test6 test7 ] >>> >>> I think this change not needed or something else is wrong when using >>> virtual IP resource. >>> >> >> Hi Aliasgar, I think you haven't created the resource properly. Or >> haven't set the colocation constraints properly. What pcs/crm commands you >> used to create OVN db resources ? >> Can you share the output of "pcs resource show ovndb_servers" and "pcs >> constraint" >> In case of tripleo we create resource like this - >> https://github.com/openstack/puppet-tripleo/blob/master/ma >> nifests/profile/pacemaker/ovn_northd.pp#L80 >> > > >>>>> # I am using the same commands suggested upstream in the ovs > document to create resource: > I am skipping manage northd option with default inactivity probe interval > http://docs.openvswitch.org/en/latest/topics/integration/#ha > -for-ovn-db-servers-using-pacemaker > # cat pcs_with_ipaddr2.sh > pcs resource create VirtualIP ocf:heartbeat:IPaddr2 \ > params ip="192.168.220.108" op monitor interval="30s" > pcs resource create ovndb_servers ocf:ovn:ovndb-servers \ > master_ip="192.168.220.108" \ > op monitor interval="10s" \ > op monitor role=Master interval="15s" --debug > pcs resource master ovndb_servers-master ovndb_servers \ > meta notify="true" > pcs constraint order promote ovndb_servers-master then VirtualIP > I think ordering should be reversed. We want pacemaker to start IPAddr2 resource first and then start ovndb_servers resource. May be we need to update the document. Can you please try with the command "pcs constraint order VirtualIP then ovndb_servers-master". I think that's why in your setup, IPAddr2 resource is not started. Thanks Numan > pcs constraint colocation add VirtualIP with master ovndb_servers-master \ > score=INFINITY > > # pcs resource show ovndb_servers > Resource: ovndb_servers (class=ocf provider=ovn type=ovndb-servers) > Attributes: master_ip=192.168.220.108 > Operations: start interval=0s timeout=30s (ovndb_servers-start-interval- > 0s) > stop interval=0s timeout=20s (ovndb_servers-stop-interval-0 > s) > promote interval=0s timeout=50s > (ovndb_servers-promote-interval-0s) > demote interval=0s timeout=50s (ovndb_servers-demote-interval > -0s) > monitor interval=10s (ovndb_servers-monitor-interval-10s) > monitor interval=15s role=Master > (ovndb_servers-monitor-interval-15s) > # pcs constraint > Location Constraints: > Ordering Constraints: > promote ovndb_servers-master then start VirtualIP (kind:Mandatory) > Colocation Constraints: > VirtualIP with ovndb_servers-master (score:INFINITY) (rsc-role:Started) > (with-rsc-role:Master) > >> >> >>> >>> May we you need a similar promotion logic that we have for LB with >>> pacemaker in the discussion (will submit formal patch soon). I did test >>> with kernel panic with LB code change and it works fine where node2 gets >>> promoted. Below works fine for LB even if there is kernel panic without >>> this change: >>> >> >> This issue is not seen all the time. I have another setup where I don't >> see this issue at all. The issue is seen when the IPAddr2 resource is moved >> to another slave node and ovsdb-server's start reporting as master as soon >> as the IP address is configured. >> >> When the issue is seen we hit the code here - >> https://github.com/openvswitch/ovs/blob/master/ovn/utiliti >> es/ovndb-servers.ocf#L412. Ideally when promot action is called, ovsdb >> servers will be running as slaves/standby and the promote action promotes >> them to master. But when the issue is seen, the ovsdb servers report the >> status as active. Because of which we don't complete the full promote >> action and return at L412. And later when notify action is called, we >> demote the servers because of this - https://github.com/openvswit >> ch/ovs/blob/master/ovn/utilities/ovndb-servers.ocf#L176 >> >> >>> Yes I agree! As you said settings work fine in one cluster and if you > use other cluster with same settings, you may see surprises . > > >> For the use case like your's (where load balancer VIP is used), you may >> not see this issue at all since you will not be using the IPaddr2 resource >> as master ip. >> > >>> Correct, I just wanted to update both the settings to let you know > pacemaker behavior with IPaddr2 vs LB VIP IP. > >> >> >>> root@test-pace1-2365293:~# echo c > /proc/sysrq-trigger >>> root@test-pace2-2365308:~# crm stat >>> Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 23:10:52 >>> 2018 by root via cibadmin on test-pace2-2365308 >>> Stack: corosync >>> Current DC: test-pace1-2365293 (version 1.1.14-70404b0) - partition with >>> quorum >>> 2 nodes and 2 resources configured >>> >>> Online: [ test-pace1-2365293 test-pace2-2365308 ] >>> >>> Full list of resources: >>> >>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>> Masters: [ test-pace1-2365293 ] >>> Slaves: [ test-pace2-2365308 ] >>> >>> root@test-pace2-2365308:~# crm stat >>> Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 23:10:52 >>> 2018 by root via cibadmin on test-pace2-2365308 >>> Stack: corosync >>> Current DC: test-pace2-2365308 (version 1.1.14-70404b0) - partition >>> WITHOUT quorum >>> 2 nodes and 2 resources configured >>> >>> Online: [ test-pace2-2365308 ] >>> OFFLINE: [ test-pace1-2365293 ] >>> >>> Full list of resources: >>> >>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>> Slaves: [ test-pace2-2365308 ] >>> Stopped: [ test-pace1-2365293 ] >>> >>> root@test-pace2-2365308:~# ps aux | grep ovs >>> root 15175 0.0 0.0 18048 372 ? Ss 15:15 0:00 >>> ovsdb-server: monitoring pid 15176 (healthy) >>> root 15176 0.0 0.0 18312 4096 ? S 15:15 0:00 >>> ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-nb.log >>> --remote=punix:/var/run/openvswitch/ovnnb_db.sock >>> --pidfile=/var/run/openvswitch/ovnnb_db.pid --unixctl=ovnnb_db.ctl >>> --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections >>> --private-key=db:OVN_Northbound,SSL,private_key >>> --certificate=db:OVN_Northbound,SSL,certificate >>> --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols >>> --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers >>> --remote=ptcp:6641:0.0.0.0 --sync-from=tcp:192.0.2.254:6641 >>> /etc/openvswitch/ovnnb_db.db >>> root 15184 0.0 0.0 18048 376 ? Ss 15:15 0:00 >>> ovsdb-server: monitoring pid 15185 (healthy) >>> root 15185 0.0 0.0 18300 4480 ? S 15:15 0:00 >>> ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-sb.log >>> --remote=punix:/var/run/openvswitch/ovnsb_db.sock >>> --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl >>> --detach --monitor --remote=db:OVN_Southbound,SB_Global,connections >>> --private-key=db:OVN_Southbound,SSL,private_key >>> --certificate=db:OVN_Southbound,SSL,certificate >>> --ca-cert=db:OVN_Southbound,SSL,ca_cert --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols >>> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers >>> --remote=ptcp:6642:0.0.0.0 --sync-from=tcp:192.0.2.254:6642 >>> /etc/openvswitch/ovnsb_db.db >>> root 15398 0.0 0.0 12940 972 pts/0 S+ 15:15 0:00 grep >>> --color=auto ovs >>> >>> >>>I just want to point out that I am also seeing below errors when >>> setting target with master IP using ipaddr2 resource too! >>> 2018-05-17T21:58:51.889Z|00011|ovsdb_jsonrpc_server|ERR|ptcp:6641: >>> 192.168.220.108: listen failed: Cannot assign requested address >>> 2018-05-17T21:58:51.889Z|00012|socket_util|ERR|6641:192.168.220.108: >>> bind: Cannot assign requested address >>> That needs to be handled too since existing code do throw this error! >>> Only if I skip setting target then it the error is gone.? >>> >> >> In the case of tripleo, we handle this error by setting the sysctl >> value net.ipv4.ip_nonlocal_bind to 1 - https://github.com/openstack >> /puppet-tripleo/blob/master/manifests/profile/pacemaker/ovn_northd.pp#L67 >> >>> Sweet, I can try to set this to get rid of socket error. >> >> >>> >>> >>> >>> Regards, >>> Aliasgar >>> >>> >>> On Thu, May 17, 2018 at 3:04 AM, <nusiddiq@redhat.com> wrote: >>> >>>> From: Numan Siddique <nusiddiq@redhat.com> >>>> >>>> When a node 'A' in the pacemaker cluster running OVN db servers in >>>> master is >>>> brought down ungracefully ('echo b > /proc/sysrq_trigger' for example), >>>> pacemaker >>>> is not able to promote any other node to master in the cluster. When >>>> pacemaker selects >>>> a node B for instance to promote, it moves the IPAddr2 resource (i.e >>>> the master ip) >>>> to node 'B'. As soon the node is configured with the IP address, when >>>> the issue is >>>> seen, the OVN db servers which were running as standy earlier, >>>> transitions to active. >>>> Ideally this should not have happened. The ovsdb-servers are expected >>>> to remain in >>>> standby until there are promoted. (This needs separate investigation). >>>> When the pacemaker >>>> calls the OVN OCF script's promote action, the ovsdb_server_promot >>>> function returns >>>> almost immediately without recording the present master. And later in >>>> the notify action >>>> it demotes back the OVN db servers since the last known master doesn't >>>> match with >>>> node 'B's hostname. This results in pacemaker promoting/demoting in a >>>> loop. >>>> >>>> This patch fixes the issue by not returning immediately when promote >>>> action is >>>> called if the OVN db servers are running as active. Now it would >>>> continue with >>>> the ovsdb_server_promot function and records the new master by setting >>>> proper >>>> master score ($CRM_MASTER -N $host_name -v ${master_score}) >>>> >>>> This issue is not seen when a node is brought down gracefully as >>>> pacemaker before >>>> promoting a node, calls stop, start and then promote actions. Not sure >>>> why pacemaker >>>> doesn't call stop, start and promote actions when a node is reset >>>> ungracefully. >>>> >>>> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1579025 >>>> Signed-off-by: Numan Siddique <nusiddiq@redhat.com> >>>> --- >>>> ovn/utilities/ovndb-servers.ocf | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/ovn/utilities/ovndb-servers.ocf >>>> b/ovn/utilities/ovndb-servers.ocf >>>> index 164b6bce6..23dc70056 100755 >>>> --- a/ovn/utilities/ovndb-servers.ocf >>>> +++ b/ovn/utilities/ovndb-servers.ocf >>>> @@ -409,7 +409,7 @@ ovsdb_server_promote() { >>>> rc=$? >>>> case $rc in >>>> ${OCF_SUCCESS}) ;; >>>> - ${OCF_RUNNING_MASTER}) return ${OCF_SUCCESS};; >>>> + ${OCF_RUNNING_MASTER}) ;; >>>> *) >>>> ovsdb_server_master_update $OCF_RUNNING_MASTER >>>> return ${rc} >>>> -- >>>> 2.17.0 >>>> >>>> _______________________________________________ >>>> dev mailing list >>>> dev@openvswitch.org >>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>>> >>> >>> >> >
Sure. I tried with the settings you suggested but still its not able to promote new master during kernel panic :( : Current DC: test7 (version 1.1.14-70404b0) - partition WITHOUT quorum 2 nodes and 3 resources configured Online: [ test7 ] OFFLINE: [ test6 ] Full list of resources: VirtualIP (ocf::heartbeat:IPaddr2): Stopped Master/Slave Set: ovndb_servers-master [ovndb_servers] Stopped: [ test6 test7 ] Until it is stuck in panic, it doesn't let new leader to be promoted which is bad. So the cluster recovers after I force reboot the box. So, promote logic should still proceed without manual intervention which is not working as expected. So without your patch too I see same results where if I reboot the stuck master, only then cluster restores. Also I noticed ipaddr2 resource shows below error: 2018-05-18T21:36:13.794Z|00005|ovsdb_error|ERR|unexpected ovsdb error: Server ID check failed: Self replicating is not allowed 2018-05-18T21:36:13.795Z|00006|ovsdb_jsonrpc_server|INFO|tcp: 192.168.220.107:59864: disconnecting (making server read/write) So I think this kind of race condition is expected as I am seeing this too using LB code. Let me know further. Regards, On Fri, May 18, 2018 at 12:02 PM, Numan Siddique <nusiddiq@redhat.com> wrote: > > > On Fri, May 18, 2018 at 11:53 PM, aginwala <aginwala@asu.edu> wrote: > >> >> >> On Thu, May 17, 2018 at 11:23 PM, Numan Siddique <nusiddiq@redhat.com> >> wrote: >> >>> >>> >>> On Fri, May 18, 2018 at 4:24 AM, aginwala <aginwala@asu.edu> wrote: >>> >>>> Hi: >>>> >>>> I tried and it didnt help where Ip resource is always showing stopped >>>> where my private VIP IP is 192.168.220.108 >>>> # kernel panic on active node >>>> root@test7:~# echo c > /proc/sysrq-trigger >>>> >>>> >>>> root@test6:~# crm stat >>>> Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 >>>> 22:45:03 2018 by root via cibadmin on test6 >>>> Stack: corosync >>>> Current DC: test7 (version 1.1.14-70404b0) - partition with quorum >>>> 2 nodes and 3 resources configured >>>> >>>> Online: [ test6 test7 ] >>>> >>>> Full list of resources: >>>> >>>> VirtualIP (ocf::heartbeat:IPaddr2): Started test7 >>>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>>> Masters: [ test7 ] >>>> Slaves: [ test6 ] >>>> >>>> root@test6:~# crm stat >>>> Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 >>>> 22:45:03 2018 by root via cibadmin on test6 >>>> Stack: corosync >>>> Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum >>>> 2 nodes and 3 resources configured >>>> >>>> Online: [ test6 ] >>>> OFFLINE: [ test7 ] >>>> >>>> Full list of resources: >>>> >>>> VirtualIP (ocf::heartbeat:IPaddr2): Stopped >>>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>>> Slaves: [ test6 ] >>>> Stopped: [ test7 ] >>>> >>>> root@test6:~# crm stat >>>> Last updated: Thu May 17 22:49:26 2018 Last change: Thu May 17 >>>> 22:45:03 2018 by root via cibadmin on test6 >>>> Stack: corosync >>>> Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum >>>> 2 nodes and 3 resources configured >>>> >>>> Online: [ test6 ] >>>> OFFLINE: [ test7 ] >>>> >>>> Full list of resources: >>>> >>>> VirtualIP (ocf::heartbeat:IPaddr2): Stopped >>>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>>> Stopped: [ test6 test7 ] >>>> >>>> I think this change not needed or something else is wrong when using >>>> virtual IP resource. >>>> >>> >>> Hi Aliasgar, I think you haven't created the resource properly. Or >>> haven't set the colocation constraints properly. What pcs/crm commands you >>> used to create OVN db resources ? >>> Can you share the output of "pcs resource show ovndb_servers" and "pcs >>> constraint" >>> In case of tripleo we create resource like this - >>> https://github.com/openstack/puppet-tripleo/blob/master/ma >>> nifests/profile/pacemaker/ovn_northd.pp#L80 >>> >> >> >>>>> # I am using the same commands suggested upstream in the ovs >> document to create resource: >> I am skipping manage northd option with default inactivity probe interval >> http://docs.openvswitch.org/en/latest/topics/integration/#ha >> -for-ovn-db-servers-using-pacemaker >> # cat pcs_with_ipaddr2.sh >> pcs resource create VirtualIP ocf:heartbeat:IPaddr2 \ >> params ip="192.168.220.108" op monitor interval="30s" >> pcs resource create ovndb_servers ocf:ovn:ovndb-servers \ >> master_ip="192.168.220.108" \ >> op monitor interval="10s" \ >> op monitor role=Master interval="15s" --debug >> pcs resource master ovndb_servers-master ovndb_servers \ >> meta notify="true" >> pcs constraint order promote ovndb_servers-master then VirtualIP >> > > I think ordering should be reversed. We want pacemaker to start IPAddr2 > resource first and then start ovndb_servers resource. May be we need to > update the document. > > Can you please try with the command "pcs constraint order VirtualIP then > ovndb_servers-master". I think that's why in your setup, IPAddr2 resource > is not started. > > Thanks > Numan > > > > >> pcs constraint colocation add VirtualIP with master ovndb_servers-master \ >> score=INFINITY >> >> # pcs resource show ovndb_servers >> Resource: ovndb_servers (class=ocf provider=ovn type=ovndb-servers) >> Attributes: master_ip=192.168.220.108 >> Operations: start interval=0s timeout=30s (ovndb_servers-start-interval- >> 0s) >> stop interval=0s timeout=20s (ovndb_servers-stop-interval-0 >> s) >> promote interval=0s timeout=50s >> (ovndb_servers-promote-interval-0s) >> demote interval=0s timeout=50s >> (ovndb_servers-demote-interval-0s) >> monitor interval=10s (ovndb_servers-monitor-interval-10s) >> monitor interval=15s role=Master >> (ovndb_servers-monitor-interval-15s) >> # pcs constraint >> Location Constraints: >> Ordering Constraints: >> promote ovndb_servers-master then start VirtualIP (kind:Mandatory) >> Colocation Constraints: >> VirtualIP with ovndb_servers-master (score:INFINITY) (rsc-role:Started) >> (with-rsc-role:Master) >> >>> >>> >>>> >>>> May we you need a similar promotion logic that we have for LB with >>>> pacemaker in the discussion (will submit formal patch soon). I did test >>>> with kernel panic with LB code change and it works fine where node2 gets >>>> promoted. Below works fine for LB even if there is kernel panic without >>>> this change: >>>> >>> >>> This issue is not seen all the time. I have another setup where I don't >>> see this issue at all. The issue is seen when the IPAddr2 resource is moved >>> to another slave node and ovsdb-server's start reporting as master as soon >>> as the IP address is configured. >>> >>> When the issue is seen we hit the code here - >>> https://github.com/openvswitch/ovs/blob/master/ovn/utiliti >>> es/ovndb-servers.ocf#L412. Ideally when promot action is called, ovsdb >>> servers will be running as slaves/standby and the promote action promotes >>> them to master. But when the issue is seen, the ovsdb servers report the >>> status as active. Because of which we don't complete the full promote >>> action and return at L412. And later when notify action is called, we >>> demote the servers because of this - https://github.com/openvswit >>> ch/ovs/blob/master/ovn/utilities/ovndb-servers.ocf#L176 >>> >>> >>> Yes I agree! As you said settings work fine in one cluster and if >> you use other cluster with same settings, you may see surprises . >> >> >>> For the use case like your's (where load balancer VIP is used), you may >>> not see this issue at all since you will not be using the IPaddr2 resource >>> as master ip. >>> >> >>> Correct, I just wanted to update both the settings to let you know >> pacemaker behavior with IPaddr2 vs LB VIP IP. >> >>> >>> >>>> root@test-pace1-2365293:~# echo c > /proc/sysrq-trigger >>>> root@test-pace2-2365308:~# crm stat >>>> Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 >>>> 23:10:52 2018 by root via cibadmin on test-pace2-2365308 >>>> Stack: corosync >>>> Current DC: test-pace1-2365293 (version 1.1.14-70404b0) - partition >>>> with quorum >>>> 2 nodes and 2 resources configured >>>> >>>> Online: [ test-pace1-2365293 test-pace2-2365308 ] >>>> >>>> Full list of resources: >>>> >>>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>>> Masters: [ test-pace1-2365293 ] >>>> Slaves: [ test-pace2-2365308 ] >>>> >>>> root@test-pace2-2365308:~# crm stat >>>> Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 >>>> 23:10:52 2018 by root via cibadmin on test-pace2-2365308 >>>> Stack: corosync >>>> Current DC: test-pace2-2365308 (version 1.1.14-70404b0) - partition >>>> WITHOUT quorum >>>> 2 nodes and 2 resources configured >>>> >>>> Online: [ test-pace2-2365308 ] >>>> OFFLINE: [ test-pace1-2365293 ] >>>> >>>> Full list of resources: >>>> >>>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>>> Slaves: [ test-pace2-2365308 ] >>>> Stopped: [ test-pace1-2365293 ] >>>> >>>> root@test-pace2-2365308:~# ps aux | grep ovs >>>> root 15175 0.0 0.0 18048 372 ? Ss 15:15 0:00 >>>> ovsdb-server: monitoring pid 15176 (healthy) >>>> root 15176 0.0 0.0 18312 4096 ? S 15:15 0:00 >>>> ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-nb.log >>>> --remote=punix:/var/run/openvswitch/ovnnb_db.sock >>>> --pidfile=/var/run/openvswitch/ovnnb_db.pid --unixctl=ovnnb_db.ctl >>>> --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections >>>> --private-key=db:OVN_Northbound,SSL,private_key >>>> --certificate=db:OVN_Northbound,SSL,certificate >>>> --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols >>>> --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers >>>> --remote=ptcp:6641:0.0.0.0 --sync-from=tcp:192.0.2.254:6641 >>>> /etc/openvswitch/ovnnb_db.db >>>> root 15184 0.0 0.0 18048 376 ? Ss 15:15 0:00 >>>> ovsdb-server: monitoring pid 15185 (healthy) >>>> root 15185 0.0 0.0 18300 4480 ? S 15:15 0:00 >>>> ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-sb.log >>>> --remote=punix:/var/run/openvswitch/ovnsb_db.sock >>>> --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl >>>> --detach --monitor --remote=db:OVN_Southbound,SB_Global,connections >>>> --private-key=db:OVN_Southbound,SSL,private_key >>>> --certificate=db:OVN_Southbound,SSL,certificate >>>> --ca-cert=db:OVN_Southbound,SSL,ca_cert --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols >>>> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers >>>> --remote=ptcp:6642:0.0.0.0 --sync-from=tcp:192.0.2.254:6642 >>>> /etc/openvswitch/ovnsb_db.db >>>> root 15398 0.0 0.0 12940 972 pts/0 S+ 15:15 0:00 grep >>>> --color=auto ovs >>>> >>>> >>>I just want to point out that I am also seeing below errors when >>>> setting target with master IP using ipaddr2 resource too! >>>> 2018-05-17T21:58:51.889Z|00011|ovsdb_jsonrpc_server|ERR|ptcp:6641: >>>> 192.168.220.108: listen failed: Cannot assign requested address >>>> 2018-05-17T21:58:51.889Z|00012|socket_util|ERR|6641:192.168.220.108: >>>> bind: Cannot assign requested address >>>> That needs to be handled too since existing code do throw this error! >>>> Only if I skip setting target then it the error is gone.? >>>> >>> >>> In the case of tripleo, we handle this error by setting the sysctl >>> value net.ipv4.ip_nonlocal_bind to 1 - https://github.com/openstack >>> /puppet-tripleo/blob/master/manifests/profile/pacemaker/ovn_ >>> northd.pp#L67 >>> >>> Sweet, I can try to set this to get rid of socket error. >>> >>> >>>> >>>> >>>> >>>> Regards, >>>> Aliasgar >>>> >>>> >>>> On Thu, May 17, 2018 at 3:04 AM, <nusiddiq@redhat.com> wrote: >>>> >>>>> From: Numan Siddique <nusiddiq@redhat.com> >>>>> >>>>> When a node 'A' in the pacemaker cluster running OVN db servers in >>>>> master is >>>>> brought down ungracefully ('echo b > /proc/sysrq_trigger' for >>>>> example), pacemaker >>>>> is not able to promote any other node to master in the cluster. When >>>>> pacemaker selects >>>>> a node B for instance to promote, it moves the IPAddr2 resource (i.e >>>>> the master ip) >>>>> to node 'B'. As soon the node is configured with the IP address, when >>>>> the issue is >>>>> seen, the OVN db servers which were running as standy earlier, >>>>> transitions to active. >>>>> Ideally this should not have happened. The ovsdb-servers are expected >>>>> to remain in >>>>> standby until there are promoted. (This needs separate investigation). >>>>> When the pacemaker >>>>> calls the OVN OCF script's promote action, the ovsdb_server_promot >>>>> function returns >>>>> almost immediately without recording the present master. And later in >>>>> the notify action >>>>> it demotes back the OVN db servers since the last known master doesn't >>>>> match with >>>>> node 'B's hostname. This results in pacemaker promoting/demoting in a >>>>> loop. >>>>> >>>>> This patch fixes the issue by not returning immediately when promote >>>>> action is >>>>> called if the OVN db servers are running as active. Now it would >>>>> continue with >>>>> the ovsdb_server_promot function and records the new master by setting >>>>> proper >>>>> master score ($CRM_MASTER -N $host_name -v ${master_score}) >>>>> >>>>> This issue is not seen when a node is brought down gracefully as >>>>> pacemaker before >>>>> promoting a node, calls stop, start and then promote actions. Not sure >>>>> why pacemaker >>>>> doesn't call stop, start and promote actions when a node is reset >>>>> ungracefully. >>>>> >>>>> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1579025 >>>>> Signed-off-by: Numan Siddique <nusiddiq@redhat.com> >>>>> --- >>>>> ovn/utilities/ovndb-servers.ocf | 2 +- >>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>> >>>>> diff --git a/ovn/utilities/ovndb-servers.ocf >>>>> b/ovn/utilities/ovndb-servers.ocf >>>>> index 164b6bce6..23dc70056 100755 >>>>> --- a/ovn/utilities/ovndb-servers.ocf >>>>> +++ b/ovn/utilities/ovndb-servers.ocf >>>>> @@ -409,7 +409,7 @@ ovsdb_server_promote() { >>>>> rc=$? >>>>> case $rc in >>>>> ${OCF_SUCCESS}) ;; >>>>> - ${OCF_RUNNING_MASTER}) return ${OCF_SUCCESS};; >>>>> + ${OCF_RUNNING_MASTER}) ;; >>>>> *) >>>>> ovsdb_server_master_update $OCF_RUNNING_MASTER >>>>> return ${rc} >>>>> -- >>>>> 2.17.0 >>>>> >>>>> _______________________________________________ >>>>> dev mailing list >>>>> dev@openvswitch.org >>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>>>> >>>> >>>> >>> >> >
On Sat, May 19, 2018 at 3:12 AM, aginwala <aginwala@asu.edu> wrote: > Sure. > > I tried with the settings you suggested but still its not able to promote > new master during kernel panic :( : > > Current DC: test7 (version 1.1.14-70404b0) - partition WITHOUT quorum > 2 nodes and 3 resources configured > > Online: [ test7 ] > OFFLINE: [ test6 ] > > Full list of resources: > > VirtualIP (ocf::heartbeat:IPaddr2): Stopped > Master/Slave Set: ovndb_servers-master [ovndb_servers] > Stopped: [ test6 test7 ] > > Until it is stuck in panic, it doesn't let new leader to be promoted which > is bad. So the cluster recovers after I force reboot the box. So, promote > logic should still proceed without manual intervention which is not working > as expected. > So without your patch too I see same results where if I reboot the stuck > master, only then cluster restores. > > > Also I noticed ipaddr2 resource shows below error: > 2018-05-18T21:36:13.794Z|00005|ovsdb_error|ERR|unexpected ovsdb error: > Server ID check failed: Self replicating is not allowed > 2018-05-18T21:36:13.795Z|00006|ovsdb_jsonrpc_server|INFO|tcp: > 192.168.220.107:59864: disconnecting (making server read/write) > > So I think this kind of race condition is expected as I am seeing this too > using LB code. > > > I tried your commands and for some reason didn't work for me. I tested with the below commands and it is working as expected. When i trigger kernel panic on the master node, pacemaker promotes another master. I suspect there is some issue with the ordering of the resources. If you want you can give a shot using the below commands. ******************************************* $cat setup_pcs_resources.sh rm -f tmp-cib* pcs resource delete ip-192.168.121.100 pcs resource delete ovndb_servers sleep 5 pcs status pcs cluster cib tmp-cib.xml cp tmp-cib.xml tmp-cib.xml.deltasrc pcs -f tmp-cib.xml resource create ip-192.168.121.100 ocf:heartbeat:IPaddr2 ip=192.168.121.100 op monitor interval=30s pcs -f tmp-cib.xml resource create ovndb_servers ocf:ovn:ovndb-servers manage_northd=no master_ip=192.168.121.100 nb_master_port=6641 sb_master_port=6642 --master pcs -f tmp-cib.xml resource meta ovndb_servers-master notify=true pcs -f tmp-cib.xml constraint order start ip-192.168.121.100 then promote ovndb_servers-master pcs -f tmp-cib.xml constraint colocation add ip-192.168.121.100 with master ovndb_servers-master pcs cluster cib-push tmp-cib.xml diff-against=tmp-cib.xml.deltasrc *********************************************************** Let me know further. > > Regards, > > On Fri, May 18, 2018 at 12:02 PM, Numan Siddique <nusiddiq@redhat.com> > wrote: > >> >> >> On Fri, May 18, 2018 at 11:53 PM, aginwala <aginwala@asu.edu> wrote: >> >>> >>> >>> On Thu, May 17, 2018 at 11:23 PM, Numan Siddique <nusiddiq@redhat.com> >>> wrote: >>> >>>> >>>> >>>> On Fri, May 18, 2018 at 4:24 AM, aginwala <aginwala@asu.edu> wrote: >>>> >>>>> Hi: >>>>> >>>>> I tried and it didnt help where Ip resource is always showing stopped >>>>> where my private VIP IP is 192.168.220.108 >>>>> # kernel panic on active node >>>>> root@test7:~# echo c > /proc/sysrq-trigger >>>>> >>>>> >>>>> root@test6:~# crm stat >>>>> Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 >>>>> 22:45:03 2018 by root via cibadmin on test6 >>>>> Stack: corosync >>>>> Current DC: test7 (version 1.1.14-70404b0) - partition with quorum >>>>> 2 nodes and 3 resources configured >>>>> >>>>> Online: [ test6 test7 ] >>>>> >>>>> Full list of resources: >>>>> >>>>> VirtualIP (ocf::heartbeat:IPaddr2): Started test7 >>>>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>>>> Masters: [ test7 ] >>>>> Slaves: [ test6 ] >>>>> >>>>> root@test6:~# crm stat >>>>> Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 >>>>> 22:45:03 2018 by root via cibadmin on test6 >>>>> Stack: corosync >>>>> Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum >>>>> 2 nodes and 3 resources configured >>>>> >>>>> Online: [ test6 ] >>>>> OFFLINE: [ test7 ] >>>>> >>>>> Full list of resources: >>>>> >>>>> VirtualIP (ocf::heartbeat:IPaddr2): Stopped >>>>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>>>> Slaves: [ test6 ] >>>>> Stopped: [ test7 ] >>>>> >>>>> root@test6:~# crm stat >>>>> Last updated: Thu May 17 22:49:26 2018 Last change: Thu May 17 >>>>> 22:45:03 2018 by root via cibadmin on test6 >>>>> Stack: corosync >>>>> Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum >>>>> 2 nodes and 3 resources configured >>>>> >>>>> Online: [ test6 ] >>>>> OFFLINE: [ test7 ] >>>>> >>>>> Full list of resources: >>>>> >>>>> VirtualIP (ocf::heartbeat:IPaddr2): Stopped >>>>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>>>> Stopped: [ test6 test7 ] >>>>> >>>>> I think this change not needed or something else is wrong when using >>>>> virtual IP resource. >>>>> >>>> >>>> Hi Aliasgar, I think you haven't created the resource properly. Or >>>> haven't set the colocation constraints properly. What pcs/crm commands you >>>> used to create OVN db resources ? >>>> Can you share the output of "pcs resource show ovndb_servers" and "pcs >>>> constraint" >>>> In case of tripleo we create resource like this - >>>> https://github.com/openstack/puppet-tripleo/blob/master/ma >>>> nifests/profile/pacemaker/ovn_northd.pp#L80 >>>> >>> >>> >>>>> # I am using the same commands suggested upstream in the ovs >>> document to create resource: >>> I am skipping manage northd option with default inactivity probe interval >>> http://docs.openvswitch.org/en/latest/topics/integration/#ha >>> -for-ovn-db-servers-using-pacemaker >>> # cat pcs_with_ipaddr2.sh >>> pcs resource create VirtualIP ocf:heartbeat:IPaddr2 \ >>> params ip="192.168.220.108" op monitor interval="30s" >>> pcs resource create ovndb_servers ocf:ovn:ovndb-servers \ >>> master_ip="192.168.220.108" \ >>> op monitor interval="10s" \ >>> op monitor role=Master interval="15s" --debug >>> pcs resource master ovndb_servers-master ovndb_servers \ >>> meta notify="true" >>> pcs constraint order promote ovndb_servers-master then VirtualIP >>> >> >> I think ordering should be reversed. We want pacemaker to start IPAddr2 >> resource first and then start ovndb_servers resource. May be we need to >> update the document. >> >> Can you please try with the command "pcs constraint order VirtualIP then >> ovndb_servers-master". I think that's why in your setup, IPAddr2 resource >> is not started. >> >> Thanks >> Numan >> >> >> >> >>> pcs constraint colocation add VirtualIP with master ovndb_servers-master >>> \ >>> score=INFINITY >>> >>> # pcs resource show ovndb_servers >>> Resource: ovndb_servers (class=ocf provider=ovn type=ovndb-servers) >>> Attributes: master_ip=192.168.220.108 >>> Operations: start interval=0s timeout=30s >>> (ovndb_servers-start-interval-0s) >>> stop interval=0s timeout=20s (ovndb_servers-stop-interval-0 >>> s) >>> promote interval=0s timeout=50s >>> (ovndb_servers-promote-interval-0s) >>> demote interval=0s timeout=50s >>> (ovndb_servers-demote-interval-0s) >>> monitor interval=10s (ovndb_servers-monitor-interval-10s) >>> monitor interval=15s role=Master >>> (ovndb_servers-monitor-interval-15s) >>> # pcs constraint >>> Location Constraints: >>> Ordering Constraints: >>> promote ovndb_servers-master then start VirtualIP (kind:Mandatory) >>> Colocation Constraints: >>> VirtualIP with ovndb_servers-master (score:INFINITY) >>> (rsc-role:Started) (with-rsc-role:Master) >>> >>>> >>>> >>>>> >>>>> May we you need a similar promotion logic that we have for LB with >>>>> pacemaker in the discussion (will submit formal patch soon). I did test >>>>> with kernel panic with LB code change and it works fine where node2 gets >>>>> promoted. Below works fine for LB even if there is kernel panic without >>>>> this change: >>>>> >>>> >>>> This issue is not seen all the time. I have another setup where I don't >>>> see this issue at all. The issue is seen when the IPAddr2 resource is moved >>>> to another slave node and ovsdb-server's start reporting as master as soon >>>> as the IP address is configured. >>>> >>>> When the issue is seen we hit the code here - >>>> https://github.com/openvswitch/ovs/blob/master/ovn/utiliti >>>> es/ovndb-servers.ocf#L412. Ideally when promot action is called, ovsdb >>>> servers will be running as slaves/standby and the promote action promotes >>>> them to master. But when the issue is seen, the ovsdb servers report the >>>> status as active. Because of which we don't complete the full promote >>>> action and return at L412. And later when notify action is called, we >>>> demote the servers because of this - https://github.com/openvswit >>>> ch/ovs/blob/master/ovn/utilities/ovndb-servers.ocf#L176 >>>> >>>> >>> Yes I agree! As you said settings work fine in one cluster and if >>> you use other cluster with same settings, you may see surprises . >>> >>> >>>> For the use case like your's (where load balancer VIP is used), you may >>>> not see this issue at all since you will not be using the IPaddr2 resource >>>> as master ip. >>>> >>> >>> Correct, I just wanted to update both the settings to let you know >>> pacemaker behavior with IPaddr2 vs LB VIP IP. >>> >>>> >>>> >>>>> root@test-pace1-2365293:~# echo c > /proc/sysrq-trigger >>>>> root@test-pace2-2365308:~# crm stat >>>>> Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 >>>>> 23:10:52 2018 by root via cibadmin on test-pace2-2365308 >>>>> Stack: corosync >>>>> Current DC: test-pace1-2365293 (version 1.1.14-70404b0) - partition >>>>> with quorum >>>>> 2 nodes and 2 resources configured >>>>> >>>>> Online: [ test-pace1-2365293 test-pace2-2365308 ] >>>>> >>>>> Full list of resources: >>>>> >>>>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>>>> Masters: [ test-pace1-2365293 ] >>>>> Slaves: [ test-pace2-2365308 ] >>>>> >>>>> root@test-pace2-2365308:~# crm stat >>>>> Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 >>>>> 23:10:52 2018 by root via cibadmin on test-pace2-2365308 >>>>> Stack: corosync >>>>> Current DC: test-pace2-2365308 (version 1.1.14-70404b0) - partition >>>>> WITHOUT quorum >>>>> 2 nodes and 2 resources configured >>>>> >>>>> Online: [ test-pace2-2365308 ] >>>>> OFFLINE: [ test-pace1-2365293 ] >>>>> >>>>> Full list of resources: >>>>> >>>>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>>>> Slaves: [ test-pace2-2365308 ] >>>>> Stopped: [ test-pace1-2365293 ] >>>>> >>>>> root@test-pace2-2365308:~# ps aux | grep ovs >>>>> root 15175 0.0 0.0 18048 372 ? Ss 15:15 0:00 >>>>> ovsdb-server: monitoring pid 15176 (healthy) >>>>> root 15176 0.0 0.0 18312 4096 ? S 15:15 0:00 >>>>> ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-nb.log >>>>> --remote=punix:/var/run/openvswitch/ovnnb_db.sock >>>>> --pidfile=/var/run/openvswitch/ovnnb_db.pid --unixctl=ovnnb_db.ctl >>>>> --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections >>>>> --private-key=db:OVN_Northbound,SSL,private_key >>>>> --certificate=db:OVN_Northbound,SSL,certificate >>>>> --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols >>>>> --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers >>>>> --remote=ptcp:6641:0.0.0.0 --sync-from=tcp:192.0.2.254:6641 >>>>> /etc/openvswitch/ovnnb_db.db >>>>> root 15184 0.0 0.0 18048 376 ? Ss 15:15 0:00 >>>>> ovsdb-server: monitoring pid 15185 (healthy) >>>>> root 15185 0.0 0.0 18300 4480 ? S 15:15 0:00 >>>>> ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-sb.log >>>>> --remote=punix:/var/run/openvswitch/ovnsb_db.sock >>>>> --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl >>>>> --detach --monitor --remote=db:OVN_Southbound,SB_Global,connections >>>>> --private-key=db:OVN_Southbound,SSL,private_key >>>>> --certificate=db:OVN_Southbound,SSL,certificate >>>>> --ca-cert=db:OVN_Southbound,SSL,ca_cert --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols >>>>> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers >>>>> --remote=ptcp:6642:0.0.0.0 --sync-from=tcp:192.0.2.254:6642 >>>>> /etc/openvswitch/ovnsb_db.db >>>>> root 15398 0.0 0.0 12940 972 pts/0 S+ 15:15 0:00 grep >>>>> --color=auto ovs >>>>> >>>>> >>>I just want to point out that I am also seeing below errors when >>>>> setting target with master IP using ipaddr2 resource too! >>>>> 2018-05-17T21:58:51.889Z|00011|ovsdb_jsonrpc_server|ERR|ptcp:6641: >>>>> 192.168.220.108: listen failed: Cannot assign requested address >>>>> 2018-05-17T21:58:51.889Z|00012|socket_util|ERR|6641:192.168.220.108: >>>>> bind: Cannot assign requested address >>>>> That needs to be handled too since existing code do throw this error! >>>>> Only if I skip setting target then it the error is gone.? >>>>> >>>> >>>> In the case of tripleo, we handle this error by setting the sysctl >>>> value net.ipv4.ip_nonlocal_bind to 1 - https://github.com/openstack >>>> /puppet-tripleo/blob/master/manifests/profile/pacemaker/ovn_ >>>> northd.pp#L67 >>>> >>> Sweet, I can try to set this to get rid of socket error. >>>> >>>> >>>>> >>>>> >>>>> >>>>> Regards, >>>>> Aliasgar >>>>> >>>>> >>>>> On Thu, May 17, 2018 at 3:04 AM, <nusiddiq@redhat.com> wrote: >>>>> >>>>>> From: Numan Siddique <nusiddiq@redhat.com> >>>>>> >>>>>> When a node 'A' in the pacemaker cluster running OVN db servers in >>>>>> master is >>>>>> brought down ungracefully ('echo b > /proc/sysrq_trigger' for >>>>>> example), pacemaker >>>>>> is not able to promote any other node to master in the cluster. When >>>>>> pacemaker selects >>>>>> a node B for instance to promote, it moves the IPAddr2 resource (i.e >>>>>> the master ip) >>>>>> to node 'B'. As soon the node is configured with the IP address, when >>>>>> the issue is >>>>>> seen, the OVN db servers which were running as standy earlier, >>>>>> transitions to active. >>>>>> Ideally this should not have happened. The ovsdb-servers are expected >>>>>> to remain in >>>>>> standby until there are promoted. (This needs separate >>>>>> investigation). When the pacemaker >>>>>> calls the OVN OCF script's promote action, the ovsdb_server_promot >>>>>> function returns >>>>>> almost immediately without recording the present master. And later in >>>>>> the notify action >>>>>> it demotes back the OVN db servers since the last known master >>>>>> doesn't match with >>>>>> node 'B's hostname. This results in pacemaker promoting/demoting in a >>>>>> loop. >>>>>> >>>>>> This patch fixes the issue by not returning immediately when promote >>>>>> action is >>>>>> called if the OVN db servers are running as active. Now it would >>>>>> continue with >>>>>> the ovsdb_server_promot function and records the new master by >>>>>> setting proper >>>>>> master score ($CRM_MASTER -N $host_name -v ${master_score}) >>>>>> >>>>>> This issue is not seen when a node is brought down gracefully as >>>>>> pacemaker before >>>>>> promoting a node, calls stop, start and then promote actions. Not >>>>>> sure why pacemaker >>>>>> doesn't call stop, start and promote actions when a node is reset >>>>>> ungracefully. >>>>>> >>>>>> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1579025 >>>>>> Signed-off-by: Numan Siddique <nusiddiq@redhat.com> >>>>>> --- >>>>>> ovn/utilities/ovndb-servers.ocf | 2 +- >>>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/ovn/utilities/ovndb-servers.ocf >>>>>> b/ovn/utilities/ovndb-servers.ocf >>>>>> index 164b6bce6..23dc70056 100755 >>>>>> --- a/ovn/utilities/ovndb-servers.ocf >>>>>> +++ b/ovn/utilities/ovndb-servers.ocf >>>>>> @@ -409,7 +409,7 @@ ovsdb_server_promote() { >>>>>> rc=$? >>>>>> case $rc in >>>>>> ${OCF_SUCCESS}) ;; >>>>>> - ${OCF_RUNNING_MASTER}) return ${OCF_SUCCESS};; >>>>>> + ${OCF_RUNNING_MASTER}) ;; >>>>>> *) >>>>>> ovsdb_server_master_update $OCF_RUNNING_MASTER >>>>>> return ${rc} >>>>>> -- >>>>>> 2.17.0 >>>>>> >>>>>> _______________________________________________ >>>>>> dev mailing list >>>>>> dev@openvswitch.org >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>>>>> >>>>> >>>>> >>>> >>> >> >
On Thu, May 17, 2018 at 6:04 AM, <nusiddiq@redhat.com> wrote: > From: Numan Siddique <nusiddiq@redhat.com> > > When a node 'A' in the pacemaker cluster running OVN db servers in master is > brought down ungracefully ('echo b > /proc/sysrq_trigger' for example), pacemaker > is not able to promote any other node to master in the cluster. When pacemaker selects > a node B for instance to promote, it moves the IPAddr2 resource (i.e the master ip) > to node 'B'. As soon the node is configured with the IP address, when the issue is > seen, the OVN db servers which were running as standy earlier, transitions to active. > Ideally this should not have happened. The ovsdb-servers are expected to remain in > standby until there are promoted. (This needs separate investigation). When the pacemaker > calls the OVN OCF script's promote action, the ovsdb_server_promot function returns > almost immediately without recording the present master. And later in the notify action > it demotes back the OVN db servers since the last known master doesn't match with > node 'B's hostname. This results in pacemaker promoting/demoting in a loop. > > This patch fixes the issue by not returning immediately when promote action is > called if the OVN db servers are running as active. Now it would continue with > the ovsdb_server_promot function and records the new master by setting proper > master score ($CRM_MASTER -N $host_name -v ${master_score}) > > This issue is not seen when a node is brought down gracefully as pacemaker before > promoting a node, calls stop, start and then promote actions. Not sure why pacemaker > doesn't call stop, start and promote actions when a node is reset ungracefully. > > Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1579025 > Signed-off-by: Numan Siddique <nusiddiq@redhat.com> Thanks, Numan. I tweaked commit message formatting and applied this to master and branch-2.9
On Sat, May 26, 2018 at 12:02 AM, Russell Bryant <russell@ovn.org> wrote: > On Thu, May 17, 2018 at 6:04 AM, <nusiddiq@redhat.com> wrote: > > From: Numan Siddique <nusiddiq@redhat.com> > > > > When a node 'A' in the pacemaker cluster running OVN db servers in > master is > > brought down ungracefully ('echo b > /proc/sysrq_trigger' for example), > pacemaker > > is not able to promote any other node to master in the cluster. When > pacemaker selects > > a node B for instance to promote, it moves the IPAddr2 resource (i.e the > master ip) > > to node 'B'. As soon the node is configured with the IP address, when > the issue is > > seen, the OVN db servers which were running as standy earlier, > transitions to active. > > Ideally this should not have happened. The ovsdb-servers are expected to > remain in > > standby until there are promoted. (This needs separate investigation). > When the pacemaker > > calls the OVN OCF script's promote action, the ovsdb_server_promot > function returns > > almost immediately without recording the present master. And later in > the notify action > > it demotes back the OVN db servers since the last known master doesn't > match with > > node 'B's hostname. This results in pacemaker promoting/demoting in a > loop. > > > > This patch fixes the issue by not returning immediately when promote > action is > > called if the OVN db servers are running as active. Now it would > continue with > > the ovsdb_server_promot function and records the new master by setting > proper > > master score ($CRM_MASTER -N $host_name -v ${master_score}) > > > > This issue is not seen when a node is brought down gracefully as > pacemaker before > > promoting a node, calls stop, start and then promote actions. Not sure > why pacemaker > > doesn't call stop, start and promote actions when a node is reset > ungracefully. > > > > Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1579025 > > Signed-off-by: Numan Siddique <nusiddiq@redhat.com> > > Thanks, Numan. I tweaked commit message formatting and applied this > to master and branch-2.9 > Thanks Russell. Numan
diff --git a/ovn/utilities/ovndb-servers.ocf b/ovn/utilities/ovndb-servers.ocf index 164b6bce6..23dc70056 100755 --- a/ovn/utilities/ovndb-servers.ocf +++ b/ovn/utilities/ovndb-servers.ocf @@ -409,7 +409,7 @@ ovsdb_server_promote() { rc=$? case $rc in ${OCF_SUCCESS}) ;; - ${OCF_RUNNING_MASTER}) return ${OCF_SUCCESS};; + ${OCF_RUNNING_MASTER}) ;; *) ovsdb_server_master_update $OCF_RUNNING_MASTER return ${rc}