File 0745-Fix-possible-hangings-in-ct_telnet_client.patch of Package erlang
From 204ff73d7d063d13a9b801f9ee3565335ad9df96 Mon Sep 17 00:00:00 2001
From: Anders Svensson <anders@erlang.org>
Date: Tue, 13 Apr 2021 10:20:54 +0200
Subject: [PATCH 4/4] Fix possible hangings in ct_telnet_client
In particular, receive on messages that will never arrive if the
ct_telnet_client process that should send them is no more, which can
happen if the peer closes the TCP connection for example. Hanging in
open/6 has been seen in practice in the case of network disturbances
that has presumably lead to such a scenario.
This is a minimal fix that just monitors on the process expected to
answer so that receive can't hang indefinitely, but there's much room
for improvement.
For one, ct_telnet isn't expecting an error return so getting one will
currently result in a failure in that module: this is a conscious choice
instead of exiting since ct_telnet should handle the error. (ct_telnet
is just a callback module for a ct_gen_conn process, so it should
probably leave it up to the latter, but the mechanics of receiving data
in ct_telnet are fairly tortured.)
For two, get_data/1 is a bit broken: it waits for a hardcoded (and
undocumented) 100 ms to return when there is no data return when the
request is received, regardless of whether data is then received after 1
ms, 99 ms, or 101 ms, and calls to get_data/1 affect the keep_alive NOP
since any message to the ct_telnet_client process is regarded as traffic
that delays the keep_alive message. Moreover, the 100 ms is on top of
poll_interval that was introduced in commit c3dd39ed. A timeout that
returns data as soon as it arrives up to a timeout would likely be a
better solution if a delay is really needed. As it is, the documented
behaviour of keep_alive is reasonable, but when NOP is actually sent
depends on more than just lack of traffic on the TCP connection.
(Sending it even after every 10th call to get_data/1 that returns
nothing was possibly meant to counteract that, but when NOP is sent is
unpredictable in practice.)
For three, close/1 will wait 5 seconds unnecessarily if the process in
question is already dead, and can leave a closed message in the message
queue after a timeout.
For four, implementing all of this more sanely as a gen_server instead
of reinventing the wheel with an own process loop would be an
improvement.
---
lib/common_test/src/ct_telnet_client.erl | 33 ++++++++++++++----------
1 file changed, 19 insertions(+), 14 deletions(-)
diff --git a/lib/common_test/src/ct_telnet_client.erl b/lib/common_test/src/ct_telnet_client.erl
index 6f9d05e504..f3b11c7ed5 100644
--- a/lib/common_test/src/ct_telnet_client.erl
+++ b/lib/common_test/src/ct_telnet_client.erl
@@ -83,15 +83,16 @@ open(Server, Port, Timeout, KeepAlive, ConnName) ->
open(Server, Port, Timeout, KeepAlive, NoDelay, ConnName) ->
Self = self(),
- Pid = spawn(fun() ->
- init(Self, Server, Port, Timeout,
- KeepAlive, NoDelay, ConnName)
- end),
- receive
- {open,Pid} ->
- {ok,Pid};
- {Error,Pid} ->
- Error
+ {Pid, MRef} = spawn_monitor(fun() ->
+ init(Self, Server, Port, Timeout,
+ KeepAlive, NoDelay, ConnName)
+ end),
+ receive
+ {Result, Pid} ->
+ demonitor(MRef, [flush]),
+ if Result == open -> {ok, Pid}; true -> Result end;
+ {'DOWN', MRef, process, _, _} = T ->
+ {error, T}
end.
close(Pid) ->
@@ -111,10 +112,14 @@ send_data(Pid, Data, false) ->
ok.
get_data(Pid) ->
- Pid ! {get_data,self()},
- receive
- {data,Data} ->
- {ok,Data}
+ MRef = monitor(process, Pid),
+ Pid ! {get_data, self()},
+ receive
+ {data, Data} ->
+ demonitor(MRef, [flush]),
+ {ok, Data};
+ {'DOWN', MRef, process, _, _} = T ->
+ {error, T}
end.
%%%-----------------------------------------------------------------
--
2.26.2