Debugging sporadically encountered “SSH encountered an unknown error” in ansible runs

Half way through setting up continuous deployment with jenkins and ansible, I sporadically spotted “unknown errors” with ssh connections when ansible tried to run things on a series of hosts. Searching for “SSH encountered an unknown error during the connection” obviously revealed a LOT of results but nothing that really helped. So I had to dig deeper and eventually found the problem.

The playbook log said sth like this:

10:30:30 fatal: [bob] => SSH encountered an unknown error during the connection.

At first, I could not find a red thread throughout the occurences. This happened with hosts of the same and of different sshd versions. This happened on servers with high and also on ones with low load. This happened with different playbooks and it also didn’t matter which machine ran the playbook.

When taking a more detailed look at the logs of one particular playbook, I spotted something I first considered a coincidence. Look at this log excerpt and especially at the timestamps.

The fact gathering is the first thing the playbook does and this is done in parallel. It works just fine:

10:28:33 GATHERING FACTS **************************** 
10:28:36 ok: [alice]
10:28:36 ok: [bob]
10:28:36 ok: [charlie]
10:28:37 ok: [daisy]

Then several tasks are done serially on one host after the other on alice, bob, charlie and then daisy. The runs take about 30 seconds each.

  • run on alice finished at 10:29:05
  • run on bob finished at 10:29:30
  • run on charlie finished at 10:29:57
  • run on daisy finished at 10:30:30

Next thing the playbook attempts to do is run another role on the same set of servers. So it has to gather facts once again. All of a sudden, the ssh connection to bob breaks:

10:30:30 GATHERING FACTS **************************** 
10:30:30 fatal: [bob] => SSH encountered an unknown error during the connection. We recommend you re-run the command using -vvvv, which will enable SSH debugging output to help diagnose the issue
10:30:31 ok: [charlie]
10:30:32 ok: [daisy]
10:30:33 ok: [alice]

The thing I noticed was that the failure was exactly a minute after things had last been run on bob (tasks finished at 10:29:30, ssh error at 10:30:30). But honestly, I didn’t pay much attention to that in the first place. Then, as suggested, I turned on debugging in ansible (run it with -vvvv) and this revealed ssh options used by ansible that I did not know about:

10:18:32  ConnectTimeout=10 'sudo -k && sudo -H -S -p <snip> 
ControlPath=~/.ansible/cp/ansible-ssh-%h-%p-%r <snip> 
ControlMaster=auto Port=22 ControlPersist=60s

I did not know what ControlMaster and ControlPersist was, but the 60s value for ControlPersist looked promising. So I looked at the man page:

SSH_CONFIG(5) 
 ControlMaster
  Enables the sharing of multiple sessions over a single network 
  connection.  When set to “yes”, ssh(1) will listen for connections 
  on a control socket specified using the ControlPath argument. 
  Additional sessions can connect to this socket using the same 
  ControlPath with ControlMaster set to “no” (the default).  
  These sessions will try to reuse the master instance's network 
  connection rather than initiating new ones, but will fall back to 
  connecting normally if the control socket does not exist, or is not 
  listening. Setting this to “ask” will cause ssh to listen for control 
  connections, but require confirmation using the SSH_ASKPASS program
  before they are accepted (see ssh-add(1) for details).  If the 
  ControlPath cannot be opened, ssh will continue without connecting 
  to a master instance. X11 and ssh-agent(1) forwarding is supported 
  over these multiplexed connections, however the display and agent 
  forwarded will be the one belonging to the master connection i.e. 
  it is not possible to forward multiple displays or agents. Two 
  additional options allow for opportunistic multiplexing: try to 
  use a master connection but fall back to creating a new one if 
  one does not already exist.  These options are: “auto” and 
  “autoask”.  The latter requires confirmation like the “ask” option.

 ControlPersist
  When used in conjunction with ControlMaster, specifies that the 
  master connection should remain open in the background (waiting 
  for future client connections) after the initial client 
  connection has been closed.  If set to “no”, then the master 
  connection will not be placed into the background, and will 
  close as soon as the initial client connection is closed.  
  If set to “yes”, then the master connection will remain in 
  the background indefinitely (until killed or closed via a 
  mechanism such as the ssh(1) “-O exit” option).  If set to 
  a time in seconds, or a time in any of the formats documented 
  in sshd_config(5), then the backgrounded master connection 
  will automatically terminate after it has remained idle 
  (with no client connections) for the specified time.

Knowing this and then re-reading the timestamps from the log formed a theory: That there might be a race condition between the new ssh client looking for the control socket, the timeout expiring (and therefore deleting the socket) and the new client actually using the socket. I proved this theory using this simple playbook:

---
- hosts: servers
  serial: 1

  tasks:
  - command: sleep 5

For completeness: This is the inventory:

[servers]
alice
bob
charlie
daisy

and this is ansible.cfg

[ssh_connection]
ssh_args=-o ControlPersist=5s -o ControlMaster=auto

I ran this 200 times in parallel

for i in {1..200}; do ansible-playbook -vvvv -i sshinv sshtest.yml &> $i.log & done

and found this in 13 of the log files:

debug1: auto-mux: Trying existing master
debug1: Control socket "/home/klein/.ansible/cp/ansible-ssh-alice-22-klein" does not exist

So, taking another look at the timestamps, here’s how it explains:

ansible opens ssh connections to all 4 hosts at 10:28:3x:

10:28:33 GATHERING FACTS **************************** 
10:28:36 ok: [alice]
10:28:36 ok: [bob]
10:28:36 ok: [charlie]
10:28:37 ok: [daisy]

Then, ansible does work on alice for 30s, nothing is done on bob, charlie or daisy during these 30s. After that, it does work on bob for 30 seconds while doing nothing with alice, charlie and daisy. The same for charlie and daisy afterwards.

run on alice finished at 10:29:05
run on bob finished at 10:29:30
run on charlie finished at 10:29:57
run on daisy finished at 10:30:30

Then ansible wants to gather facts again and tries to re-use the existing ssh connection. On alice, it has been about 90s since it has last run anything. So the socket is very likely closed and not usable. So, since ssh uses ControlMaster=auto, it opens a new connection. On bob however, about 60 seconds passed. And when looking for the socket, ssh finds one. Very closely after that, the initial ssh’s ControlPersist times out and removes the socket. And then, still thinking the socket exists, ansible wants to use the socket but now it is gone. And that’s your “unknown error”.

So how did I fix this? I read through the ansible configuration documentation, especially on the ssh args and then rewrote my ansible.cfg to hold these lines:

[ssh_connection]
control_path=%(directory)s/ansible-ssh-REPLACEME-%%h-%%p-%%r
ssh_args=-o ControlPersist=60m -o ControlMaster=auto

Then, in the jenkins job firing the playbook, after I clone my playbooks from git, I replace “REPLACEME” with a unique string for each run. This means each run has its individual socket and does not silently re-use other playbook’s sockets (which, with more and more playbooks running more frequently, might end up causing problems with sshd’s default value of 10 for “MaxSessions” anyway). Each run’s socket will stay alive for 1 hour, which no playbook should ever run for. This does sometimes leave quite a number of ssh processes on my deployment master machine though.

Considering it fixed.

Advertisements

One response to “Debugging sporadically encountered “SSH encountered an unknown error” in ansible runs

  1. Hi, thx for article, FIY I’ve just found another solution
    ANSIBLE_SSH_CONTROL_PATH=”./ansible-ssh-$(< /dev/urandom tr -dc _A-Z-a-z-0-9 | head -c8)-%%h-%%p-%%r"

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s