File 0001-Don-t-die-in-case-of-faulty-node.patch of Package rabbitmq-server

From: Peter Lemenkov <lemenkov@gmail.com>
Date: Fri, 22 Jul 2016 17:15:02 +0200
Subject: [PATCH] Don't die in case of faulty node

This patch fixes TOCTOU issue similar to one introduced in the following
commit:

* rabbitmq/rabbitmq-server@93b9e37c3ea0cade4e30da0aa1f14fa97c82e669

If the node was just removed from the cluster, then there is a small
window when it is still listed as a member of a Mnesia cluster
locally. We retrieve list of nodes by calling locally

```erlang
 unsafe_rpc(Node, rabbit_mnesia, cluster_nodes, [running]).
```

However retrieving status from that particular failed node is no longer
available and throws an exception. See `action(Action, N, Args, Opts,
Inform)` function, which is simply calls `unsafe_rpc(M, F, A)` for this
node.

This `unsafe_rpc/4` function is basically a wrapper over
`rabbit_misc:rpc_call/4` which translates `{badrpc, nodedown}` into
exception. This exception generated by `action(Action, N, Args, Opts,
Inform)` function call emerges on a very high level, so rabbitmqct
thinks that the entire cluster is down, while generating a very bizarre
message:

	Cluster status of node 'rabbit@overcloud-controller-0' ...
	Error: unable to connect to node 'rabbit@overcloud-controller-0':
	nodedown

	DIAGNOSTICS
	===========

	attempted to contact: ['rabbit@overcloud-controller-0']

	rabbit@overcloud-controller-0:
	  * connected to epmd (port 4369) on overcloud-controller-0
	  * node rabbit@overcloud-controller-0 up, 'rabbit' application running

	current node details:
	- node name: 'rabbitmq-cli-31@overcloud-controller-0'
	- home dir: /var/lib/rabbitmq
	- cookie hash: PB31uPq3vzeQeZ+MHv+wgg==

See - it reports that it failed to connect to node
'rabbit@overcloud-controller-0' (because it catches an exception from
`action(Action, N, Args, Opts, Inform)`), but attempt to connect to this
node was successful ('rabbit' application running).

In order to fix that we should not throw exception during consequent
calls (`[action(Action, N, Args, Opts, Inform) || Name <-
nodes_in_cluster(Node)]`), only during the first one (`unsafe_rpc(Node,
rabbit_mnesia, status, [])`).

See this issue for further details and real world example:

* https://bugzilla.redhat.com/1300728
* https://bugzilla.redhat.com/1356169

Signed-off-by: Peter Lemenkov <lemenkov@gmail.com>

diff --git a/src/rabbit_control_main.erl b/src/rabbit_control_main.erl
index d2a1eb3..c84f854 100644
--- a/src/rabbit_control_main.erl
+++ b/src/rabbit_control_main.erl
@@ -526,7 +526,13 @@ action(list_policies, Node, [], Opts, Inform) ->
 
 action(report, Node, _Args, _Opts, Inform) ->
     Inform("Reporting server status on ~p~n~n", [erlang:universaltime()]),
-    [begin ok = action(Action, N, [], [], Inform), io:nl() end ||
+    [begin
+		case catch action(Action, N, [], [], Inform) of
+			{badrpc,nodedown} -> io:format("nodedown~n");
+			ok -> ok
+		end,
+		io:nl()
+	end ||
         N      <- unsafe_rpc(Node, rabbit_mnesia, cluster_nodes, [running]),
         Action <- [status, cluster_status, environment]],
     VHosts = unsafe_rpc(Node, rabbit_vhost, list, []),
openSUSE Build Service is sponsored by