File bug-921102_pacemaker-crmd-reset-stonith-failcount.patch of Package pacemaker.9287
commit 53d7d54d5a33857c3331f0dbc44eadf6d092c90c
Author: Gao,Yan <ygao@suse.com>
Date: Tue Mar 10 16:02:33 2015 +0100
Fix: crmd: Reset stonith failcount to recover transitioner when the node rejoins
CRMd transitioner could not recover from "Too many failures to fence".
Steps to produce:
1. Two-node cluster with stonith, for example using IPMI.
2. Node-1 has a complete power outage for a couple of minutes. The
IPMI device is also without power, which causes the fencing to fail
3. Node-2 tries to fence node-1 for several times but fails.
4. Node-2 reports "Too many failures to fence node-1 (11), giving up".
5. The power returns and node-1 boots up normally.
6. Node-1 rejoins the cluster, but resources are not started on it.
Expected result:
The stonith failcount for node-1 should be reset and resources should
be started on node-1.
Actual result:
Node-2 still logs "Too many failures to fence" and resources are not
started on node-1.
Index: pacemaker/crmd/callbacks.c
===================================================================
--- pacemaker.orig/crmd/callbacks.c
+++ pacemaker/crmd/callbacks.c
@@ -204,6 +204,9 @@ peer_update_callback(enum crm_status_typ
if (alive && safe_str_eq(task, CRM_OP_FENCE)) {
crm_notice("Node return implies stonith of %s (action %d) completed", node->uname,
down->id);
+
+ st_fail_count_reset(node->uname);
+
erase_status_tag(node->uname, XML_CIB_TAG_LRM, cib_scope_local);
erase_status_tag(node->uname, XML_TAG_TRANSIENT_NODEATTRS, cib_scope_local);
/* down->confirmed = TRUE; Only stonith-ng returning should imply completion */