SLES for SAP maintenance procedure for Scale-Out Perf-Opt HANA cluster
This blog post cover specific maintenance scenarios for scale-out performance optimized HANA cluster. It illustrates the procedures that are defined in man page SAPHanaSR_maintenance_examples which are applicable for Scale-out topology.
We are going to cover the following three scenarios:
- HANA takeover procedures
- HANA maintenance (or linux OS maintenance when a reboot of the node is not required)
- Linux maintenance with reboot
HANA takeover procedures
- Check status of Linux cluster and HANA, show current site names.
- Set SAPHanaController multi-state resource (and the ip and loadbalancer resources) into maintenance.
- Perform the takeover, make sure to use the suspend primary feature.
- Check if the new primary is working.
- Stop suspended old primary.
- Register old primary as new secondary, make sure to use the correct site name.
- Start the new secondary.
- Check new secondary and its system replication.
- Refresh SAPHanaController multi-state resource.
- Set SAPHanaController multi-state resource (and the ip and ld balancer resources) to managed.
- Finally check status of Linux cluster and HANA.
1. Check status of Linux cluster and HANA, show current site names.
suse11:~ # cs_clusterstate -i
### suse11 - 2024-01-11 08:37:23 ###
Cluster state: S_IDLE
suse11:~ #
Cluster Summary:
* Stack: corosync
* Current DC: susemm (version 2.1.5+20221208.a3f44794f-150500.6.5.8-2.1.5+20221208.a3f44794f) - partition with quorum
* Last updated: Thu Jan 11 08:36:59 2024
* Last change: Thu Jan 11 08:36:54 2024 by root via crm_attribute on suse11
* 5 nodes configured
* 13 resource instances configured
Node List:
* Online: [ suse11 suse12 suse21 suse22 susemm ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started susemm
* Resource Group: g_ip_TST_HDB00:
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started suse11
* rsc_nc_TST_HDB00 (ocf::heartbeat:azure-lb): Started suse11
* Clone Set: cln_SAPHanaTop_TST_HDB00 [rsc_SAPHanaTop_TST_HDB00]:
* Started: [ suse11 suse12 suse21 suse22 ]
* Clone Set: msl_SAPHanaCon_TST_HDB00 [rsc_SAPHanaCon_TST_HDB00] (promotable):
* Masters: [ suse11 ]
* Slaves: [ suse12 suse21 suse22 ]
suse11:~ # SAPHanaSR-showAttr
Global cib-time maintenance prim sec sync_state upd
--------------------------------------------------------------------
TST Thu Jan 11 08:37:58 2024 false ONE TWO SOK ok
Resource maintenance
-------------------------------------
msl_SAPHanaCon_TST_HDB00 false
g_ip_TST_HDB00 false
Sites lpt lss mns srHook srr
---------------------------------------
ONE 1704962278 4 suse11 PRIM P
TWO 30 4 suse21 SOK S
Hosts clone_state gra gsh node_state roles score site
-------------------------------------------------------------------------------
suse11 PROMOTED 2.0 2.2 online master1:master:worker:master 150 ONE
suse12 DEMOTED 2.0 2.2 online slave:slave:worker:slave -10000 ONE
suse21 DEMOTED 2.0 2.2 online master1:master:worker:master 100 TWO
suse22 DEMOTED 2.0 2.2 online slave:slave:worker:slave -12200 TWO
susemm online
suse11:~ #
DISCUSSIONS: Checking whether running system is in a state to run the maintenance procedure is very important. Sometimes cluster is doing some kind of background tasks and it is always good to wait for the cluster to be stable to execute any step of the maintenance procedure.
2. Set SAPHanaController multi-state resource into maintenance.
suse11:~ # crm resource maintenance msl_SAPHanaCon_TST_HDB00
suse11:~ # crm resource maintenance g_ip_TST_HDB00
suse11:~ #
Cluster Summary:
* Stack: corosync
* Current DC: susemm (version 2.1.5+20221208.a3f44794f-150500.6.5.8-2.1.5+20221208.a3f44794f) - partition with quorum
* Last updated: Thu Jan 11 08:39:39 2024
* Last change: Thu Jan 11 08:39:37 2024 by root via cibadmin on suse11
* 5 nodes configured
* 13 resource instances configured
Node List:
* Online: [ suse11 suse12 suse21 suse22 susemm ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started susemm
* Resource Group: g_ip_TST_HDB00: (unmanaged)
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started suse11 (unmanaged)
* rsc_nc_TST_HDB00 (ocf::heartbeat:azure-lb): Started suse11 (unmanaged)
* Clone Set: cln_SAPHanaTop_TST_HDB00 [rsc_SAPHanaTop_TST_HDB00]:
* Started: [ suse11 suse12 suse21 suse22 ]
* Clone Set: msl_SAPHanaCon_TST_HDB00 [rsc_SAPHanaCon_TST_HDB00] (promotable, unmanaged):
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse12 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse21 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Master suse11 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse22 (unmanaged)
suse11:~ # SAPHanaSR-showAttr
Global cib-time maintenance prim sec sync_state upd
--------------------------------------------------------------------
TST Thu Jan 11 08:39:37 2024 false ONE TWO SOK ok
Resource maintenance
-------------------------------------
msl_SAPHanaCon_TST_HDB00 true
g_ip_TST_HDB00 true
Sites lpt lss mns srHook srr
---------------------------------------
ONE 1704962342 4 suse11 PRIM P
TWO 30 4 suse21 SOK S
Hosts clone_state gra gsh node_state roles score site
-------------------------------------------------------------------------------
suse11 PROMOTED 2.0 2.2 online master1:master:worker:master 150 ONE
suse12 DEMOTED 2.0 2.2 online slave:slave:worker:slave -10000 ONE
suse21 DEMOTED 2.0 2.2 online master1:master:worker:master 100 TWO
suse22 DEMOTED 2.0 2.2 online slave:slave:worker:slave -12200 TWO
susemm online
suse11:~ #
DISCUSSIONS: Putting the multi-state resource into maintenance first is the best practice method to start the maintenance on a HANA cluster. We no longer need to put the whole cluster into maintenance mode. Putting maintenance on the virtual IP resource is also important as we want cluster to avoid migrating this resource and we want it to stay running on its existing node. During the period of maintenance we want to manage both these resources manually.
3. Perform the takeover, make sure to use the suspend primary feature:
suse11:~ # cs_clusterstate -i
### suse11 - 2024-01-11 08:41:45 ###
Cluster state: S_IDLE
suse11:~ #
tstadm@suse21:/usr/sap/TST/HDB00> hdbnsutil -sr_takeover --suspendPrimary
done.
tstadm@suse21:/usr/sap/TST/HDB00>
DISCUSSIONS: The takeover process will change the role of the secondary site to primary site. The suspendPrimary flag will ensure that the primary database is not used by the application during this process.
4. Check if the new primary is working.
tstadm@suse21:/usr/sap/TST/HDB00> hdbnsutil -sr_state
System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~
online: true
mode: primary
operation mode: primary
site id: 2
site name: TWO
is source system: true
is secondary/consumer system: false
has secondaries/consumers attached: false
is a takeover active: false
is primary suspended: false
Host Mappings:
~~~~~~~~~~~~~~
suse22 -> [TWO] suse22
suse21 -> [TWO] suse21
Site Mappings:
~~~~~~~~~~~~~~
TWO (primary/primary)
Tier of TWO: 1
Replication mode of TWO: primary
Operation mode of TWO: primary
done.
tstadm@suse21:/usr/sap/TST/HDB00>
5. Stop suspended old primary.
tstadm@suse11:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function StopSystem
11.01.2024 08:45:03
StopSystem
OK
tstadm@suse11:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function WaitforStopped 300 20
11.01.2024 08:49:43
WaitforStopped
OK
tstadm@suse11:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function GetSystemInstanceList
11.01.2024 08:50:29
GetSystemInstanceList
OK
hostname, instanceNr, httpPort, httpsPort, startPriority, features, dispstatus
suse11, 0, 50013, 50014, 0.3, HDB|HDB_WORKER, GRAY
suse12, 0, 50013, 50014, 0.3, HDB|HDB_WORKER, GRAY
tstadm@suse11:/usr/sap/TST/HDB00>
6. Register old primary as new secondary, make sure to use the correct site name.
tstadm@suse11:/usr/sap/TST/HDB00> hdbnsutil -sr_register --name=ONE --remoteHost=suse21 --remoteInstance=00 --replicationMode=sync --operationMode=logreplay
adding site ...
nameserver suse11:30001 not responding.
collecting information ...
updating local ini files ...
done.
tstadm@suse11:/usr/sap/TST/HDB00>
DISCUSSIONS: This will ensure that the old primary will become the new secondary. Most common mistakes done by administrators are to use a new sitename for registering the old primary and hence it is important to check that one uses the existing sitename.
TODO: Also include the following steps from section “* Check the two site names that are known to the Linux cluster.” from manpage SAPHanaSR_maintenance_examples(7)
# crm configure show suse11 suse21
# crm configure show SAPHanaSR | grep hana_ha1_site_mns
# ssh suse21
# su - ha1adm -c "hdbnsutil -sr_state; echo rc: $?"
# exit
7. Start the new secondary.
tstadm@suse11:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function StartSystem
11.01.2024 08:52:40
StartSystem
OK
tstadm@suse11:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function WaitforStarted 300 20
11.01.2024 08:54:07
WaitforStarted
OK
tstadm@suse11:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function GetSystemInstanceList
11.01.2024 08:54:29
GetSystemInstanceList
OK
hostname, instanceNr, httpPort, httpsPort, startPriority, features, dispstatus
suse11, 0, 50013, 50014, 0.3, HDB|HDB_WORKER, GREEN
suse12, 0, 50013, 50014, 0.3, HDB|HDB_WORKER, GREEN
tstadm@suse11:/usr/sap/TST/HDB00>
8. Check new secondary and its system replication.
tstadm@suse11:/usr/sap/TST/HDB00> hdbnsutil -sr_state
System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~
online: true
mode: sync
operation mode: logreplay
site id: 1
site name: ONE
is source system: false
is secondary/consumer system: true
has secondaries/consumers attached: false
is a takeover active: false
is primary suspended: false
is timetravel enabled: false
replay mode: auto
active primary site: 2
primary masters: suse21
Host Mappings:
~~~~~~~~~~~~~~
suse12 -> [TWO] suse22
suse12 -> [ONE] suse12
suse11 -> [TWO] suse21
suse11 -> [ONE] suse11
Site Mappings:
~~~~~~~~~~~~~~
TWO (primary/primary)
|---ONE (sync/logreplay)
Tier of TWO: 1
Tier of ONE: 2
Replication mode of TWO: primary
Replication mode of ONE: sync
Operation mode of TWO: primary
Operation mode of ONE: logreplay
Mapping: TWO -> ONE
done.
tstadm@suse11:/usr/sap/TST/HDB00>
tstadm@suse21:/usr/sap/TST/HDB00/exe/python_support> python systemReplicationStatus.py
| Database | Host | Port | Service Name | Volume ID | Site ID | Site Name | Secondary | Secondary | Secondary | Secondary | Secondary | Replication | Replication | Replication |
| | | | | | | | Host | Port | Site ID | Site Name | Active Status | Mode | Status | Status Details |
| -------- | ------ | ----- | ------------ | --------- | ------- | --------- | --------- | --------- | --------- | --------- | ------------- | ----------- | ----------- | -------------- |
| TST | suse22 | 30003 | indexserver | 4 | 2 | TWO | suse12 | 30003 | 1 | ONE | YES | SYNC | ACTIVE | |
| SYSTEMDB | suse21 | 30001 | nameserver | 1 | 2 | TWO | suse11 | 30001 | 1 | ONE | YES | SYNC | ACTIVE | |
| TST | suse21 | 30007 | xsengine | 3 | 2 | TWO | suse11 | 30007 | 1 | ONE | YES | SYNC | ACTIVE | |
| TST | suse21 | 30003 | indexserver | 2 | 2 | TWO | suse11 | 30003 | 1 | ONE | YES | SYNC | ACTIVE | |
status system replication site "1": ACTIVE
overall system replication status: ACTIVE
Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mode: PRIMARY
site id: 2
site name: TWO
tstadm@suse21:/usr/sap/TST/HDB00/exe/python_support>
9. Refresh SAPHanaController multi-state resource.
suse11:~ # crm resource refresh msl_SAPHanaCon_TST_HDB00
Cleaned up rsc_SAPHanaCon_TST_HDB00:0 on suse12
Cleaned up rsc_SAPHanaCon_TST_HDB00:1 on suse21
Cleaned up rsc_SAPHanaCon_TST_HDB00:2 on susemm
Cleaned up rsc_SAPHanaCon_TST_HDB00:2 on suse11
Cleaned up rsc_SAPHanaCon_TST_HDB00:3 on suse22
Cleaned up rsc_SAPHanaCon_TST_HDB00:4 on suse22
Cleaned up rsc_SAPHanaCon_TST_HDB00:4 on suse12
Cleaned up rsc_SAPHanaCon_TST_HDB00:4 on susemm
... got reply
Cleaned up rsc_SAPHanaCon_TST_HDB00:4 on suse21
Cleaned up rsc_SAPHanaCon_TST_HDB00:4 on suse11
Waiting for 9 replies from the controller
... got reply
... got reply
... got reply
... got reply
... got reply
... got reply
... got reply
... got reply
... got reply (done)
suse11:~ #
suse11:~ # SAPHanaSR-showAttr
Global cib-time maintenance prim sec sync_state upd
--------------------------------------------------------------------
TST Thu Jan 11 08:56:55 2024 false ONE TWO SOK ok
Resource maintenance
-------------------------------------
msl_SAPHanaCon_TST_HDB00 true
g_ip_TST_HDB00 true
Sites lpt lss mns srHook srr
--------------------------------
ONE 30 4 suse11 SOK S
TWO 30 4 suse21 PRIM P
Hosts clone_state gra gsh node_state roles score site
-------------------------------------------------------------------------------
suse11 DEMOTED 2.0 2.2 online master1:master:worker:master 100 ONE
suse12 DEMOTED 2.0 2.2 online slave:slave:worker:slave -12200 ONE
suse21 DEMOTED 2.0 2.2 online master1:master:worker:master 150 TWO
suse22 DEMOTED 2.0 2.2 online slave:slave:worker:slave -10000 TWO
susemm online
suse11:~ #
DISCUSSIONS: Refreshing the resources ensures that the resource agents receives the new state/values of the attributes.
10. Set SAPHanaController multi-state resource to managed.
suse11:~ # crm resource maintenance g_ip_TST_HDB00 off
suse11:~ # crm resource maintenance msl_SAPHanaCon_TST_HDB00 off
suse11:~ # SAPHanaSR-showAttr
Global cib-time maintenance prim sec sync_state upd
--------------------------------------------------------------------
TST Thu Jan 11 08:58:27 2024 false ONE TWO SOK ok
Resource maintenance
-------------------------------------
msl_SAPHanaCon_TST_HDB00 false
g_ip_TST_HDB00 false
Sites lpt lss mns srHook srr
--------------------------------
ONE 30 4 suse11 SOK S
TWO 30 4 suse21 PRIM P
Hosts clone_state gra gsh node_state roles score site
-------------------------------------------------------------------------------
suse11 DEMOTED 2.0 2.2 online master1:master:worker:master 100 ONE
suse12 DEMOTED 2.0 2.2 online slave:slave:worker:slave -12200 ONE
suse21 PROMOTED 2.0 2.2 online master1:master:worker:master 150 TWO
suse22 DEMOTED 2.0 2.2 online slave:slave:worker:slave -10000 TWO
susemm online
suse11:~ #
Cluster Summary:
* Stack: corosync
* Current DC: susemm (version 2.1.5+20221208.a3f44794f-150500.6.5.8-2.1.5+20221208.a3f44794f) - partition with quorum
* Last updated: Thu Jan 11 08:58:38 2024
* Last change: Thu Jan 11 08:58:36 2024 by root via crm_attribute on suse21
* 5 nodes configured
* 13 resource instances configured
Node List:
* Online: [ suse11 suse12 suse21 suse22 susemm ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started susemm
* Resource Group: g_ip_TST_HDB00:
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started suse21
* rsc_nc_TST_HDB00 (ocf::heartbeat:azure-lb): Started suse21
* Clone Set: cln_SAPHanaTop_TST_HDB00 [rsc_SAPHanaTop_TST_HDB00]:
* Started: [ suse11 suse12 suse21 suse22 ]
* Clone Set: msl_SAPHanaCon_TST_HDB00 [rsc_SAPHanaCon_TST_HDB00] (promotable):
* Masters: [ suse21 ]
* Slaves: [ suse11 suse12 suse22 ]
11. Finally check status of Linux cluster.
suse11:~ # cs_clusterstate -i
### suse11 - 2024-01-11 08:59:45 ###
Cluster state: S_IDLE
suse11:~ #
HANA maintenance (or linux OS maintenance when a reboot of the node is not required)
- Check if everything looks fine.
- Set the SAPHanaController multi-state resource into maintenance mode.
- Perform the HANA maintenance, e.g. update to latest SPS.
- Tell the cluster to forget about HANA status and to reprobe the resources.
- Set the SAPHanaController multi-state resource back to managed.
- Remove the meta attribute from CIB, optional.
- Check if everything looks fine
1. Check if everything looks fine.
suse21:/home/azureuser # cs_clusterstate -i
### suse21 - 2024-01-21 11:19:19 ###
Cluster state: S_IDLE
suse21:/home/azureuser #
TODO: Also include the following steps from section “* Check status of Linux cluster and HANA system replication pair.” from manpage SAPHanaSR_maintenance_examples(7)
# cs_clusterstate
# crm_mon -1r
# crm configure show | grep cli-
# SAPHanaSR-showAttr
# cs_clusterstate -i
2. Set the SAPHanaController multi-state resource into maintenance mode.
suse21:/home/azureuser # crm resource maintenance msl_SAPHanaCon_TST_HDB00
suse21:/home/azureuser #
Cluster Summary:
* Stack: corosync
* Current DC: suse21 (version 2.1.5+20221208.a3f44794f-150500.6.5.8-2.1.5+20221208.a3f44794f) - partition with quorum
* Last updated: Sun Jan 21 11:20:16 2024
* Last change: Sun Jan 21 11:20:14 2024 by root via cibadmin on suse21
* 5 nodes configured
* 13 resource instances configured
Node List:
* Online: [ suse11 suse12 suse21 suse22 susemm ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started susemm
* Resource Group: g_ip_TST_HDB00:
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started suse21
* rsc_nc_TST_HDB00 (ocf::heartbeat:azure-lb): Started suse21
* Clone Set: cln_SAPHanaTop_TST_HDB00 [rsc_SAPHanaTop_TST_HDB00]:
* Started: [ suse11 suse12 suse21 suse22 ]
* Clone Set: msl_SAPHanaCon_TST_HDB00 [rsc_SAPHanaCon_TST_HDB00] (promotable, unmanaged):
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse22 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Master suse21 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse11 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse12 (unmanaged)
3. Perform the HANA maintenance, e.g. update to latest SPS
4. Tell the cluster to forget about HANA status and to reprobe the resources.
suse22:~ # crm resource refresh msl_SAPHanaCon_TST_HDB00
Cleaned up rsc_SAPHanaCon_TST_HDB00:0 on suse22
Cleaned up rsc_SAPHanaCon_TST_HDB00:1 on suse21
Cleaned up rsc_SAPHanaCon_TST_HDB00:2 on suse11
Cleaned up rsc_SAPHanaCon_TST_HDB00:3 on suse12
Cleaned up rsc_SAPHanaCon_TST_HDB00:4 on susemm
Waiting for 5 replies from the controller
... got reply
... got reply
... got reply
... got reply
... got reply (done)
suse22:~ #
Cluster Summary:
* Stack: corosync
* Current DC: suse21 (version 2.1.5+20221208.a3f44794f-150500.6.5.8-2.1.5+20221208.a3f44794f) - partition with quorum
* Last updated: Sun Jan 21 11:22:20 2024
* Last change: Sun Jan 21 11:22:14 2024 by hacluster via crmd on suse22
* 5 nodes configured
* 13 resource instances configured
Node List:
* Online: [ suse11 suse12 suse21 suse22 susemm ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started susemm
* Resource Group: g_ip_TST_HDB00:
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started suse21
* rsc_nc_TST_HDB00 (ocf::heartbeat:azure-lb): Started suse21
* Clone Set: cln_SAPHanaTop_TST_HDB00 [rsc_SAPHanaTop_TST_HDB00]:
* Started: [ suse11 suse12 suse21 suse22 ]
* Clone Set: msl_SAPHanaCon_TST_HDB00 [rsc_SAPHanaCon_TST_HDB00] (promotable, unmanaged):
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse22 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse21 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse11 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse12 (unmanaged)
5. Set the SAPHanaController multi-state resource back to managed.
suse22:~ # crm resource maintenance msl_SAPHanaCon_TST_HDB00 off
suse22:~ #
Cluster Summary:
* Stack: corosync
* Current DC: suse21 (version 2.1.5+20221208.a3f44794f-150500.6.5.8-2.1.5+20221208.a3f44794f) - partition with quorum
* Last updated: Sun Jan 21 11:24:30 2024
* Last change: Sun Jan 21 11:24:28 2024 by root via crm_attribute on suse21
* 5 nodes configured
* 13 resource instances configured
Node List:
* Online: [ suse11 suse12 suse21 suse22 susemm ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started susemm
* Resource Group: g_ip_TST_HDB00:
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started suse21
* rsc_nc_TST_HDB00 (ocf::heartbeat:azure-lb): Started suse21
* Clone Set: cln_SAPHanaTop_TST_HDB00 [rsc_SAPHanaTop_TST_HDB00]:
* Started: [ suse11 suse12 suse21 suse22 ]
* Clone Set: msl_SAPHanaCon_TST_HDB00 [rsc_SAPHanaCon_TST_HDB00] (promotable):
* Masters: [ suse21 ]
* Slaves: [ suse11 suse12 suse22 ]
6. Check if everything looks fine.
suse22:~ # cs_clusterstate -i
### suse22 - 2024-01-21 11:26:33 ###
Cluster state: S_IDLE
suse22:~ #
TODO: Also include the following steps from section “* Check status of Linux cluster and HANA system replication pair.” from manpage SAPHanaSR_maintenance_examples(7)
# cs_clusterstate
# crm_mon -1r
# crm configure show | grep cli-
# SAPHanaSR-showAttr
# cs_clusterstate -i
Linux maintenance with reboot
- Check the cluster and put the multi-state resource and the ip group resource into maintenance
- Set the maintenance on the whole cluster
- Stop the cluster on the secondary site nodes where the maintenance is supposed to take place
- Manually stop HANA on the node where maintenance is supposed to be done
- Disable the pacemaker on the node where the reboot is required after the maintenance
- Perform the maintenance and reboot if required
- Enable the pacemaker after the reboot
- Start the HANA manually
- Start the cluster on the secondary site nodes
- Refresh the cln_ and msl_ resources
- Remove the global maintenance from cluster
- Remove maintenance from the multi-state resource and ip group resource
- Check the Cluster Status
- Perform the takeover as described here and after that rerun steps 1 to 13 on the new secondary site nodes
- Perform the maintenance on the majority maker node
1. Check the cluster and put the multi-state resource and the ip group resource into maintenance
suse22:~ # cs_clusterstate -i
### suse22 - 2024-01-29 15:26:09 ###
Cluster state: S_IDLE
suse22:~ # crm resource maintenance msl_SAPHanaCon_TST_HDB00
suse22:~ # crm resource maintenance g_ip_TST_HDB00
suse22:~ #
Cluster Summary:
* Stack: corosync
* Current DC: susemm (version 2.1.5+20221208.a3f44794f-150500.6.5.8-2.1.5+20221208.a3f44794f) - partition with quorum
* Last updated: Mon Jan 29 15:26:33 2024
* Last change: Mon Jan 29 15:26:31 2024 by root via cibadmin on suse22
* 5 nodes configured
* 13 resource instances configured
Node List:
* Online: [ suse11 suse12 suse21 suse22 susemm ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started susemm
* Resource Group: g_ip_TST_HDB00 (unmanaged):
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started suse21 (unmanaged)
* rsc_nc_TST_HDB00 (ocf::heartbeat:azure-lb): Started suse21 (unmanaged)
* Clone Set: cln_SAPHanaTop_TST_HDB00 [rsc_SAPHanaTop_TST_HDB00]:
* Started: [ suse11 suse12 suse21 suse22 ]
* Clone Set: msl_SAPHanaCon_TST_HDB00 [rsc_SAPHanaCon_TST_HDB00] (promotable, unmanaged):
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse11 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse12 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Master suse21 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse22 (unmanaged)
2. Set the maintenance on the whole cluster
suse22:~ # crm maintenance on
suse22:~ #
Cluster Summary:
* Stack: corosync
* Current DC: susemm (version 2.1.5+20221208.a3f44794f-150500.6.5.8-2.1.5+20221208.a3f44794f) - partition with quorum
* Last updated: Mon Jan 29 15:27:09 2024
* Last change: Mon Jan 29 15:27:06 2024 by root via cibadmin on suse22
* 5 nodes configured
* 13 resource instances configured
*** Resource management is DISABLED ***
The cluster will not attempt to start, stop or recover services
Node List:
* Online: [ suse11 suse12 suse21 suse22 susemm ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started susemm (unmanaged)
* Resource Group: g_ip_TST_HDB00 (unmanaged):
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started suse21 (unmanaged)
* rsc_nc_TST_HDB00 (ocf::heartbeat:azure-lb): Started suse21 (unmanaged)
* Clone Set: cln_SAPHanaTop_TST_HDB00 [rsc_SAPHanaTop_TST_HDB00] (unmanaged):
* rsc_SAPHanaTop_TST_HDB00 (ocf::suse:SAPHanaTopology): Started suse11 (unmanaged)
* rsc_SAPHanaTop_TST_HDB00 (ocf::suse:SAPHanaTopology): Started suse12 (unmanaged)
* rsc_SAPHanaTop_TST_HDB00 (ocf::suse:SAPHanaTopology): Started suse21 (unmanaged)
* rsc_SAPHanaTop_TST_HDB00 (ocf::suse:SAPHanaTopology): Started suse22 (unmanaged)
* Clone Set: msl_SAPHanaCon_TST_HDB00 [rsc_SAPHanaCon_TST_HDB00] (promotable, unmanaged):
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse11 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse12 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Master suse21 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse22 (unmanaged)
3. Stop the cluster on the secondary site nodes where the maintenance is supposed to take place
suse22:~ # crm cluster stop suse11 suse12
INFO: The cluster stack stopped on suse11
INFO: The cluster stack stopped on suse12
suse22:~ #
Cluster Summary:
* Stack: corosync
* Current DC: susemm (version 2.1.5+20221208.a3f44794f-150500.6.5.8-2.1.5+20221208.a3f44794f) - partition with quorum
* Last updated: Mon Jan 29 15:28:20 2024
* Last change: Mon Jan 29 15:27:06 2024 by root via cibadmin on suse22
* 5 nodes configured
* 13 resource instances configured
*** Resource management is DISABLED ***
The cluster will not attempt to start, stop or recover services
Node List:
* Online: [ suse21 suse22 susemm ]
* OFFLINE: [ suse11 suse12 ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started susemm (unmanaged)
* Resource Group: g_ip_TST_HDB00 (unmanaged):
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started suse21 (unmanaged)
* rsc_nc_TST_HDB00 (ocf::heartbeat:azure-lb): Started suse21 (unmanaged)
* Clone Set: cln_SAPHanaTop_TST_HDB00 [rsc_SAPHanaTop_TST_HDB00] (unmanaged):
* rsc_SAPHanaTop_TST_HDB00 (ocf::suse:SAPHanaTopology): Started suse11 (unmanaged)
* rsc_SAPHanaTop_TST_HDB00 (ocf::suse:SAPHanaTopology): Started suse12 (unmanaged)
* rsc_SAPHanaTop_TST_HDB00 (ocf::suse:SAPHanaTopology): Started suse21 (unmanaged)
* rsc_SAPHanaTop_TST_HDB00 (ocf::suse:SAPHanaTopology): Started suse22 (unmanaged)
* Clone Set: msl_SAPHanaCon_TST_HDB00 [rsc_SAPHanaCon_TST_HDB00] (promotable, unmanaged):
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse11 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse12 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Master suse21 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse22 (unmanaged)
DISCUSSIONS: There are two reason why we need to stop the cluster on the nodes where we will perform the maintenance.
First because of all this maintenance procedure targets to patch update OS as well as cluster software stack which can better be done when cluster is stopped.
Second reason is that if anything goes wrong during the maintenance then we can at least rule out cluster as the source of the problem when the cluster is stopped.
4. Manually stop HANA on the node where maintenance is supposed to be done
suse11:~ # su - tstadm
tstadm@suse11:/usr/sap/TST/HDB00> HDB info
USER PID PPID %CPU VSZ RSS COMMAND
tstadm 28839 28837 1.2 14404 7348 -sh
tstadm 29030 28839 0.0 8284 3960 _ /bin/sh /usr/sap/TST/HDB00/HDB info
tstadm 29061 29030 0.0 17848 3984 _ ps fx -U tstadm -o user:8,pid:8,ppid:8,pcpu:5,vsz:10,rss:10,args
tstadm 6037 1 0.0 686404 51136 hdbrsutil --start --port 30003 --volume 2 --volumesuffix mnt00001/hdb00002.00003 --identi
tstadm 5103 1 0.0 686076 50864 hdbrsutil --start --port 30001 --volume 1 --volumesuffix mnt00001/hdb00001 --identifier 1
tstadm 4584 1 0.0 9572 3240 sapstart pf=/usr/sap/TST/SYS/profile/TST_HDB00_suse11
tstadm 4591 4584 0.0 432728 72492 _ /usr/sap/TST/HDB00/suse11/trace/hdb.sapTST_HDB00 -d -nw -f /usr/sap/TST/HDB00/suse11/d
tstadm 4609 4591 0.7 9770660 1652796 _ hdbnameserver
tstadm 4928 4591 0.2 424136 126520 _ hdbcompileserver
tstadm 4931 4591 0.2 692988 155156 _ hdbpreprocessor
tstadm 5053 4591 0.6 9781348 1813164 _ hdbindexserver -port 30003
tstadm 5064 4591 0.4 5054852 1079784 _ hdbxsengine -port 30007
tstadm 5695 4591 0.2 2386460 419944 _ hdbwebdispatcher
tstadm 2181 1 0.0 482660 31228 /usr/sap/TST/HDB00/exe/sapstartsrv pf=/usr/sap/TST/SYS/profile/TST_HDB00_suse11 -D -u tsta
tstadm 2097 1 0.0 46864 11096 /usr/lib/systemd/systemd --user
tstadm 2098 2097 0.0 78448 3992 _ (sd-pam)
tstadm@suse11:/usr/sap/TST/HDB00> HDB stop
hdbdaemon will wait maximal 300 seconds for NewDB services finishing.
Stopping instance using: /usr/sap/TST/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function Stop 400
29.01.2024 15:29:33
Stop
OK
Waiting for stopped instance using: /usr/sap/TST/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function WaitforStopped 600 2
29.01.2024 15:34:43
WaitforStopped
OK
hdbdaemon is stopped.
tstadm@suse11:/usr/sap/TST/HDB00>
DISCUSSIONS: This is only required when one needs to reboot the node. The reboot process stops all the processes including HANA although it is advised to manually stop HANA so that any problem related to HANA can be observed when it is manually stopped.
5. Disable the pacemaker on the node where the reboot is required after the maintenance
suse11:~ # systemctl disable pacemaker.service
Removed /etc/systemd/system/multi-user.target.wants/pacemaker.service.
suse11:~ #
DISCUSSIONS: Disabling of pacemaker service is required to avoid unintended start of the cluster after the reboot of the nodes.
6. Perform the maintenance and reboot if required
7. Enable the pacemaker after the reboot
suse11:~ # systemctl enable pacemaker.service
Created symlink /etc/systemd/system/multi-user.target.wants/pacemaker.service → /usr/lib/systemd/system/pacemaker.service.
suse11:~ #
8. Start the HANA manually
suse11:~ # su - tstadm
tstadm@suse11:/usr/sap/TST/HDB00> HDB info
USER PID PPID %CPU VSZ RSS COMMAND
tstadm 5017 5016 0.2 14404 7304 -sh
tstadm 5319 5017 0.0 8284 3968 _ /bin/sh /usr/sap/TST/HDB00/HDB info
tstadm 5350 5319 0.0 17848 3908 _ ps fx -U tstadm -o user:8,pid:8,ppid:8,pcpu:5,vsz:10,rss:10,args
tstadm 2191 1 0.2 416392 30192 /usr/sap/TST/HDB00/exe/sapstartsrv pf=/usr/sap/TST/SYS/profile/TST_HDB00_suse11 -D -u tsta
tstadm 2100 1 0.0 46808 10972 /usr/lib/systemd/systemd --user
tstadm 2101 2100 0.0 78472 3996 _ (sd-pam)
tstadm@suse11:/usr/sap/TST/HDB00> HDB start
StartService
Impromptu CCC initialization by 'rscpCInit'.
See SAP note 1266393.
OK
OK
Starting instance using: /usr/sap/TST/SYS/exe/hdb/sapcontrol -prot NI_HTTP -nr 00 -function StartWait 2700 2
29.01.2024 15:39:24
Start
OK
29.01.2024 15:40:19
StartWait
OK
tstadm@suse11:/usr/sap/TST/HDB00>
DISCUSSIONS: It is always a best practice to start the HANA manually after a reboot and before the cluster start so that later on cluster finds HANA in as close a state as possible when the maintenance was set in.
9. Start the cluster on the secondary site nodes
suse22:~ # crm cluster start suse11 suse12
INFO: The cluster stack started on suse11
INFO: The cluster stack started on suse12
suse22:~ #
Cluster Summary:
* Stack: corosync
* Current DC: susemm (version 2.1.5+20221208.a3f44794f-150500.6.5.8-2.1.5+20221208.a3f44794f) - partition with quorum
* Last updated: Mon Jan 29 15:41:22 2024
* Last change: Mon Jan 29 15:40:11 2024 by root via crm_attribute on suse21
* 5 nodes configured
* 13 resource instances configured
*** Resource management is DISABLED ***
The cluster will not attempt to start, stop or recover services
Node List:
* Online: [ suse11 suse12 suse21 suse22 susemm ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started susemm (unmanaged)
* Resource Group: g_ip_TST_HDB00 (unmanaged):
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started suse21 (unmanaged)
* rsc_nc_TST_HDB00 (ocf::heartbeat:azure-lb): Started suse21 (unmanaged)
* Clone Set: cln_SAPHanaTop_TST_HDB00 [rsc_SAPHanaTop_TST_HDB00] (unmanaged):
* rsc_SAPHanaTop_TST_HDB00 (ocf::suse:SAPHanaTopology): Started suse12 (unmanaged)
* rsc_SAPHanaTop_TST_HDB00 (ocf::suse:SAPHanaTopology): Started suse21 (unmanaged)
* rsc_SAPHanaTop_TST_HDB00 (ocf::suse:SAPHanaTopology): Started suse22 (unmanaged)
* Clone Set: msl_SAPHanaCon_TST_HDB00 [rsc_SAPHanaCon_TST_HDB00] (promotable, unmanaged):
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Master suse21 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse22 (unmanaged)
10. Refresh the cln_ and msl_ resources
suse22:~ # cs_clusterstate -i
### suse22 - 2024-01-29 15:42:35 ###
Cluster state: S_IDLE
suse22:~ # crm resource refresh cln_SAPHanaTop_TST_HDB00
Cleaned up rsc_SAPHanaTop_TST_HDB00:0 on suse12
Cleaned up rsc_SAPHanaTop_TST_HDB00:0 on suse11
Cleaned up rsc_SAPHanaTop_TST_HDB00:1 on suse21
Cleaned up rsc_SAPHanaTop_TST_HDB00:2 on suse22
Cleaned up rsc_SAPHanaTop_TST_HDB00:3 on susemm
Cleaned up rsc_SAPHanaTop_TST_HDB00:4 on suse22
Cleaned up rsc_SAPHanaTop_TST_HDB00:4 on suse12
Cleaned up rsc_SAPHanaTop_TST_HDB00:4 on susemm
Cleaned up rsc_SAPHanaTop_TST_HDB00:4 on suse21
... got reply
... got reply
... got reply
... got reply
... got reply
Cleaned up rsc_SAPHanaTop_TST_HDB00:4 on suse11
Waiting for 5 replies from the controller
... got reply
... got reply
... got reply
... got reply
... got reply (done)
suse22:~ # cs_wait_for_idle -s 5
Cluster state: S_IDLE
suse22:~ # crm resource refresh msl_SAPHanaCon_TST_HDB00
Cleaned up rsc_SAPHanaCon_TST_HDB00:0 on suse12
Cleaned up rsc_SAPHanaCon_TST_HDB00:0 on suse21
Cleaned up rsc_SAPHanaCon_TST_HDB00:0 on suse11
Cleaned up rsc_SAPHanaCon_TST_HDB00:1 on suse22
Cleaned up rsc_SAPHanaCon_TST_HDB00:2 on susemm
Cleaned up rsc_SAPHanaCon_TST_HDB00:3 on suse22
Cleaned up rsc_SAPHanaCon_TST_HDB00:3 on suse12
Cleaned up rsc_SAPHanaCon_TST_HDB00:3 on susemm
Cleaned up rsc_SAPHanaCon_TST_HDB00:3 on suse21
Cleaned up rsc_SAPHanaCon_TST_HDB00:3 on suse11
Cleaned up rsc_SAPHanaCon_TST_HDB00:4 on suse22
Cleaned up rsc_SAPHanaCon_TST_HDB00:4 on suse12
Cleaned up rsc_SAPHanaCon_TST_HDB00:4 on susemm
Cleaned up rsc_SAPHanaCon_TST_HDB00:4 on suse21
Cleaned up rsc_SAPHanaCon_TST_HDB00:4 on suse11
Waiting for 15 replies from the controller
... got reply
... got reply
... got reply
... got reply
... got reply
... got reply
... got reply
... got reply
... got reply
... got reply
... got reply
... got reply
... got reply
... got reply
... got reply (done)
suse22:~ # cs_wait_for_idle -s 5
Cluster state: S_IDLE
suse22:~ #
Cluster Summary:
* Stack: corosync
* Current DC: susemm (version 2.1.5+20221208.a3f44794f-150500.6.5.8-2.1.5+20221208.a3f44794f) - partition with quorum
* Last updated: Mon Jan 29 15:44:11 2024
* Last change: Mon Jan 29 15:44:03 2024 by hacluster via crmd on susemm
* 5 nodes configured
* 13 resource instances configured
*** Resource management is DISABLED ***
The cluster will not attempt to start, stop or recover services
Node List:
* Online: [ suse11 suse12 suse21 suse22 susemm ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started susemm (unmanaged)
* Resource Group: g_ip_TST_HDB00 (unmanaged):
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started suse21 (unmanaged)
* rsc_nc_TST_HDB00 (ocf::heartbeat:azure-lb): Started suse21 (unmanaged)
* Clone Set: cln_SAPHanaTop_TST_HDB00 [rsc_SAPHanaTop_TST_HDB00] (unmanaged):
* rsc_SAPHanaTop_TST_HDB00 (ocf::suse:SAPHanaTopology): Started suse12 (unmanaged)
* rsc_SAPHanaTop_TST_HDB00 (ocf::suse:SAPHanaTopology): Started suse21 (unmanaged)
* rsc_SAPHanaTop_TST_HDB00 (ocf::suse:SAPHanaTopology): Started suse22 (unmanaged)
* Clone Set: msl_SAPHanaCon_TST_HDB00 [rsc_SAPHanaCon_TST_HDB00] (promotable, unmanaged):
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse21 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse22 (unmanaged)
DISCUSSIONS: Refreshing the resource probes the state of resources and corrects the values of the attributes as per the new state of the resources.
11. Remove the global maintenance from cluster
suse22:~ # crm maintenance off
suse22:~ #
Cluster Summary:
* Stack: corosync
* Current DC: susemm (version 2.1.5+20221208.a3f44794f-150500.6.5.8-2.1.5+20221208.a3f44794f) - partition with quorum
* Last updated: Mon Jan 29 15:45:14 2024
* Last change: Mon Jan 29 15:45:07 2024 by root via cibadmin on suse22
* 5 nodes configured
* 13 resource instances configured
Node List:
* Online: [ suse11 suse12 suse21 suse22 susemm ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started susemm
* Resource Group: g_ip_TST_HDB00 (unmanaged):
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started suse21 (unmanaged)
* rsc_nc_TST_HDB00 (ocf::heartbeat:azure-lb): Started suse21 (unmanaged)
* Clone Set: cln_SAPHanaTop_TST_HDB00 [rsc_SAPHanaTop_TST_HDB00]:
* Started: [ suse11 suse12 suse21 suse22 ]
* Clone Set: msl_SAPHanaCon_TST_HDB00 [rsc_SAPHanaCon_TST_HDB00] (promotable, unmanaged):
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse21 (unmanaged)
* rsc_SAPHanaCon_TST_HDB00 (ocf::suse:SAPHanaController): Slave suse22 (unmanaged)
12. Remove maintenance from the multi-state resource and ip group resource
suse22:~ # crm resource maintenance g_ip_TST_HDB00 off
suse22:~ # crm resource maintenance msl_SAPHanaCon_TST_HDB00 off
suse22:~ #
Cluster Summary:
* Stack: corosync
* Current DC: susemm (version 2.1.5+20221208.a3f44794f-150500.6.5.8-2.1.5+20221208.a3f44794f) - partition with quorum
* Last updated: Mon Jan 29 15:46:54 2024
* Last change: Mon Jan 29 15:46:52 2024 by root via crm_attribute on suse21
* 5 nodes configured
* 13 resource instances configured
Node List:
* Online: [ suse11 suse12 suse21 suse22 susemm ]
Active Resources:
* stonith-sbd (stonith:external/sbd): Started susemm
* Resource Group: g_ip_TST_HDB00:
* rsc_ip_TST_HDB00 (ocf::heartbeat:IPaddr2): Started suse21
* rsc_nc_TST_HDB00 (ocf::heartbeat:azure-lb): Started suse21
* Clone Set: cln_SAPHanaTop_TST_HDB00 [rsc_SAPHanaTop_TST_HDB00]:
* Started: [ suse11 suse12 suse21 suse22 ]
* Clone Set: msl_SAPHanaCon_TST_HDB00 [rsc_SAPHanaCon_TST_HDB00] (promotable):
* Masters: [ suse21 ]
* Slaves: [ suse11 suse12 suse22 ]
13. Check the Cluster Status
suse22:~ # cs_clusterstate -i
### suse22 - 2024-01-29 15:47:11 ###
Cluster state: S_IDLE
suse22:~ #
TODO: Also include the following steps from section “* Check status of Linux cluster and HANA system replication pair.” from manpage SAPHanaSR_maintenance_examples(7)
# cs_clusterstate
# crm_mon -1r
# crm configure show | grep cli-
# SAPHanaSR-showAttr
# cs_clusterstate -i
14. Perform the takeover as described here and after that rerun steps 1 to 13 on the new secondary site nodes
15. Perform the maintenance on the majority maker node
Please also read our other blogs about #TowardsZeroDowntime.