Pacemaker: Difference between revisions
m (→FAQ) |
m (→Links) |
||
(39 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
uses [[Corosync|Corosync]] or [[Heartbeat|heartbeat]], (it seems) corosync is the one to go for. | |||
| |||
= Links = | |||
*[http://clusterlabs.org/ Cluster Labs] | |||
*[https://github.com/ClusterLabs/resource-agents/ Pacemaker resource agents on github] | |||
*[http://www.linux-ha.org/doc/man-pages/ap-ra-man-pages.html Linux-HA manpages] | |||
*[https://clusterlabs.org/quickstart.html pacemaker quickstart] | |||
*[https://github.com/ClusterLabs/pacemaker/blob/master/doc/pcs-crmsh-quick-ref.md https://github.com/ClusterLabs/pacemaker/blob/master/doc/pcs-crmsh-quick-ref.md] | |||
*[https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_pacemaker_architecture.html Pacemaker Architecture] | |||
*[http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/ Pacemaker explained] | |||
*[http://fibrevillage.com/sysadmin/304-pcs-command-reference pcs command resference] | |||
*[http://fibrevillage.com/sysadmin/321-pacemaker-and-pcs-on-linux-example-managing-cluster-resource Pacemaker and pcs on Linux example, managing cluster resource] | |||
*[http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs Building a high-available failover cluster with Pacemaker, Corosync & PCS] | |||
*[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/High_Availability_Add-On_Administration/index.html HIGH AVAILABILITY ADD-ON ADMINISTRATION] | |||
*[https://www.digitalocean.com/community/tutorials/how-to-create-a-high-availability-setup-with-corosync-pacemaker-and-floating-ips-on-ubuntu-14-04 How To Create a High Availability Setup with Corosync, Pacemaker, and Floating IPs on Ubuntu 14.04] | |||
*[http://fibrevillage.com/sysadmin/304-pcs-command-reference http://fibrevillage.com/sysadmin/304-pcs-command-reference] | |||
*[http://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services http://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services] | |||
*[http://fibrevillage.com/sysadmin/321-pacemaker-and-pcs-on-linux-example-managing-cluster-resource Pacemaker and pcs on Linux example, managing cluster resource] | |||
*[http://djlab.com/2013/04/pacemaker-corosync-drbd-cheatsheet/ Cheatsheet] | |||
*[https://redhatlinux.guru/2018/05/22/pacemaker-cheat-sheet/ Pacemaker cheat sheet] | |||
*[https://www.freesoftwareservers.com/wiki/pcs-tips-n-tricks-constraints-delete-resources-3965539.html PCS tips&tricks] | |||
*[https://www.hastexo.com/resources/hints-and-kinks/mandatory-and-advisory-ordering-pacemaker/ Mandatory and advisory ordering in Pacemaker] | |||
*[http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_specifying_a_preferred_location.html http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_specifying_a_preferred_location.html] | |||
*[https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-sets-ordering.html resource sets] | |||
*[https://www.alteeve.com/w/History_of_HA_Clustering History of HA clustering] | |||
*[http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html The OCF Resource Agent Developer’s Guide] | |||
*[https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Administration/_visualizing_the_action_sequence.html https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Administration/_visualizing_the_action_sequence.html]] | |||
*[https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-maintenance.html#sec-ha-maint-shutdown-node Implications of Taking Down a Cluster Node] | |||
*[https://www.programmerall.com/article/11571745093/ Corosync + Pacemaker + CRMSH Build a web high available cluster] | |||
*crm | = Notes = | ||
*crmadmin | |||
*cibadm | by specifying -INFINITY, the constraint is binding. | ||
*pcs | |||
*[[corosync]] | | ||
= Quickstart = | |||
Keep in mind you might want to use dedicated IPs for sync, so define those in /etc/hosts On both nodes | |||
#set password | |||
passwd hacluster | |||
systemctl start pcsd.service | |||
systemctl enable pcsd.service | |||
| |||
= Commands/tools = | |||
*crm | |||
*crmadmin | |||
*cibadm | |||
*pcs | |||
*[[Corosync|corosync]] | |||
= Useful commands = | |||
== save entire config == | |||
pcs config backup configfile | |||
== Dump entire crm == | |||
cibadm -Q | cibadm -Q | ||
==Move resource to node== | = HOWTO = | ||
== Groups == | |||
=== Add existing resource to group === | |||
pcs resource group add GROUPID RESOURCEID | |||
=== Stop resource group === | |||
pcs resource disable MYGROUP | |||
=== See if entire group is disabled === | |||
pcs resource show MYGROUP | |||
Meta Attrs: target-role=Stopped | |||
= FAQ = | |||
== Update resource == | |||
pcs resource update resourcname variablename=newvalue | |||
| |||
== Current DC == | |||
In output of | |||
pcs status | |||
this is Designated Controller | |||
| |||
== Remove resource group + members == | |||
pcs resource delete whateverresource | |||
== Move resource to node == | |||
pcs resource move RES NODE | pcs resource move RES NODE | ||
==Show default resource stickiness== | == Show default resource stickiness == | ||
pcs resource default | pcs resource default | ||
==Set resource stickiness== | == Set resource stickiness == | ||
pcs resource meta <resource_id> resource-stickiness=100 | pcs resource meta <resource_id> resource-stickiness=100 | ||
and to check: | and to check: | ||
pcs resource show <resource_id> | pcs resource show <resource_id> | ||
==Undo resource move== | Or better yet: | ||
crm_simulate -Ls | |||
== Undo resource move == | |||
pcs constraint --full | pcs constraint --full | ||
<pre> | <pre>Location Constraints: | ||
Location Constraints: | |||
Resource: FOO | Resource: FOO | ||
Enabled on: santest-a (score:INFINITY) (role: Started) (id:cli-prefer-FOO) | Enabled on: santest-a (score:INFINITY) (role: Started) (id:cli-prefer-FOO) | ||
</pre> | </pre> | ||
pcs constraint remove cli-prefer-FOO | pcs constraint remove cli-prefer-FOO | ||
==pcs status: Error: cluster is not currently running on this node== | == pcs status: Error: cluster is not currently running on this node == | ||
Don't panic until after | Don't panic until after | ||
sudo pcs status | sudo pcs status | ||
==show detailed resources== | == show detailed resources == | ||
pcs resource --full | pcs resource --full | ||
==stop node== | == stop node (standby) == | ||
The following command puts the specified node into standby mode. The specified node is no longer able to host resources. Any resources currently active on the node will be moved to another node. If you specify the --all, this command puts all nodes into standby mode. | |||
| |||
pcs cluster standby node-1 | pcs cluster standby node-1 | ||
or | or | ||
pcs node standby | pcs node standby | ||
on the node itself | on the node itself | ||
==set maintenance mode== | and undo this with | ||
pcs cluster unstandby node-1 | |||
or | |||
pcs node unstandby | |||
== set maintenance mode == | |||
This sets the cluster in maintenance mode, so it stops managing the resources | |||
pcs property set maintenance-mode=true | pcs property set maintenance-mode=true | ||
== Error: cluster is not currently running on this node == | |||
pcs cluster start [<node name>] | |||
pcs cluster start | |||
== Remove a constraint == | |||
pcs constraint list --full | |||
to identify the constraints and then | to identify the constraints and then | ||
pcs constraint remove <whatever-constraint-id> | pcs constraint remove <whatever-constraint-id> | ||
==Clear error messages== | == Clear error messages == | ||
pcs resource cleanup | pcs resource cleanup | ||
==Call cib_replace failed (-205): Update was older than existing configuration== | == Call cib_replace failed (-205): Update was older than existing configuration == | ||
can be run only once | can be run only once | ||
==[Error signing on to the CIB service: Transport endpoint is not connected ]== | == [Error signing on to the CIB service: Transport endpoint is not connected ] == | ||
probably selinux | probably selinux | ||
==Show allocation scores== | == Show allocation scores == | ||
crm_simulate -sL | crm_simulate -sL | ||
==Show resource failcount== | == Show resource failcount == | ||
pcs resource failcount show <resource> | pcs resource failcount show <resource> | ||
| |||
== export current configuration as commands == | |||
pcs config export pcs-commands | |||
== debug resource == | |||
pcs resource debug-start resource | pcs resource debug-start resource | ||
== | == *** Resource management is DISABLED *** The cluster will not attempt to start, stop or recover services == | ||
Cluster is in maintenance mode | |||
== Found meta data is "unclean", please apply-al first == | |||
== Troubleshooting == | |||
*[http://blog.clusterlabs.org/blog/2013/debugging-pengine Debugging the policy engine] | |||
== pcs status all resources stopped == | |||
probably a bad ordering constraint | |||
== Fencing and resource management disabled due to lack of quorum == | |||
Problably means you forgot to pcs cluster start the other node | |||
== Resource cannot run anywhere == | |||
Check if some stickiness was set | |||
== pcs resource update unable to find resource == | |||
Trying to unset stickiness: | |||
pcs resource update ISCSIgroupTEST1 meta resource-stickiness= | |||
caused: Error: Unable to find resource: ISCSIgroupTEST1 | |||
what his means is: try it on the host where stickiness was set :) | |||
== Difference between maintenance-mode and standby == | |||
Still not clear | |||
== drbdadm create-md test3 'test3' not defined in your config (for this host). == | |||
You're supposed to use `hostname` in the 'on ...' bit | |||
| |||
== corosync: active/disabled == | |||
As far as i can tell means some resources have been disabled | |||
== ocf-exit-reason:Undefined iSCSI target implementation == | |||
Install scsi-target-utils | |||
== moving RES away after 1000000 failures == | |||
If failcount is 0, try pcs resource cleanup |
Latest revision as of 12:37, 8 March 2022
uses Corosync or heartbeat, (it seems) corosync is the one to go for.
Links
- Cluster Labs
- Pacemaker resource agents on github
- Linux-HA manpages
- pacemaker quickstart
- https://github.com/ClusterLabs/pacemaker/blob/master/doc/pcs-crmsh-quick-ref.md
- Pacemaker Architecture
- Pacemaker explained
- pcs command resference
- Pacemaker and pcs on Linux example, managing cluster resource
- Building a high-available failover cluster with Pacemaker, Corosync & PCS
- HIGH AVAILABILITY ADD-ON ADMINISTRATION
- How To Create a High Availability Setup with Corosync, Pacemaker, and Floating IPs on Ubuntu 14.04
- http://fibrevillage.com/sysadmin/304-pcs-command-reference
- http://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services
- Pacemaker and pcs on Linux example, managing cluster resource
- Cheatsheet
- Pacemaker cheat sheet
- PCS tips&tricks
- Mandatory and advisory ordering in Pacemaker
- http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_specifying_a_preferred_location.html
- resource sets
- History of HA clustering
- The OCF Resource Agent Developer’s Guide
- https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Administration/_visualizing_the_action_sequence.html]
- Implications of Taking Down a Cluster Node
- Corosync + Pacemaker + CRMSH Build a web high available cluster
Notes
by specifying -INFINITY, the constraint is binding.
Quickstart
Keep in mind you might want to use dedicated IPs for sync, so define those in /etc/hosts On both nodes
- set password
passwd hacluster
systemctl start pcsd.service systemctl enable pcsd.service
Commands/tools
- crm
- crmadmin
- cibadm
- pcs
- corosync
Useful commands
save entire config
pcs config backup configfile
Dump entire crm
cibadm -Q
HOWTO
Groups
Add existing resource to group
pcs resource group add GROUPID RESOURCEID
Stop resource group
pcs resource disable MYGROUP
See if entire group is disabled
pcs resource show MYGROUP
Meta Attrs: target-role=Stopped
FAQ
Update resource
pcs resource update resourcname variablename=newvalue
Current DC
In output of
pcs status
this is Designated Controller
Remove resource group + members
pcs resource delete whateverresource
Move resource to node
pcs resource move RES NODE
Show default resource stickiness
pcs resource default
Set resource stickiness
pcs resource meta <resource_id> resource-stickiness=100
and to check:
pcs resource show <resource_id>
Or better yet:
crm_simulate -Ls
Undo resource move
pcs constraint --full
Location Constraints: Resource: FOO Enabled on: santest-a (score:INFINITY) (role: Started) (id:cli-prefer-FOO)
pcs constraint remove cli-prefer-FOO
pcs status: Error: cluster is not currently running on this node
Don't panic until after
sudo pcs status
show detailed resources
pcs resource --full
stop node (standby)
The following command puts the specified node into standby mode. The specified node is no longer able to host resources. Any resources currently active on the node will be moved to another node. If you specify the --all, this command puts all nodes into standby mode.
pcs cluster standby node-1
or
pcs node standby
on the node itself
and undo this with
pcs cluster unstandby node-1
or
pcs node unstandby
set maintenance mode
This sets the cluster in maintenance mode, so it stops managing the resources
pcs property set maintenance-mode=true
Error: cluster is not currently running on this node
pcs cluster start [<node name>]
Remove a constraint
pcs constraint list --full
to identify the constraints and then
pcs constraint remove <whatever-constraint-id>
Clear error messages
pcs resource cleanup
Call cib_replace failed (-205): Update was older than existing configuration
can be run only once
[Error signing on to the CIB service: Transport endpoint is not connected ]
probably selinux
Show allocation scores
crm_simulate -sL
Show resource failcount
pcs resource failcount show <resource>
export current configuration as commands
pcs config export pcs-commands
debug resource
pcs resource debug-start resource
*** Resource management is DISABLED *** The cluster will not attempt to start, stop or recover services
Cluster is in maintenance mode
Found meta data is "unclean", please apply-al first
Troubleshooting
pcs status all resources stopped
probably a bad ordering constraint
Fencing and resource management disabled due to lack of quorum
Problably means you forgot to pcs cluster start the other node
Resource cannot run anywhere
Check if some stickiness was set
pcs resource update unable to find resource
Trying to unset stickiness:
pcs resource update ISCSIgroupTEST1 meta resource-stickiness=
caused: Error: Unable to find resource: ISCSIgroupTEST1
what his means is: try it on the host where stickiness was set :)
Difference between maintenance-mode and standby
Still not clear
drbdadm create-md test3 'test3' not defined in your config (for this host).
You're supposed to use `hostname` in the 'on ...' bit
corosync: active/disabled
As far as i can tell means some resources have been disabled
ocf-exit-reason:Undefined iSCSI target implementation
Install scsi-target-utils
moving RES away after 1000000 failures
If failcount is 0, try pcs resource cleanup