Cloud VM Reachability: Why Running Does Not Mean Healthy

In cloud operations, one of the most common incident reports sounds simple: “The VM is running, but SSH is not working.”

That phrasing is useful as a symptom, but useless as a diagnosis. SSH failure is not an incident category. It is an access-path symptom. A Linux server can be powered on, visible in the cloud control plane, reachable at the network layer, and still be operationally unavailable because something inside the guest OS is blocked. The failure may be in sshd, PAM, DNS, routing, storage, NFS, memory reclaim, disk I/O, or the scheduler.

The deeper SRE lesson is this: a running instance is not the same thing as a healthy operating system, and an open TCP port is not the same thing as a usable login path.

This distinction matters because many cloud incidents are initially misclassified as network problems. Engineers check security lists, route tables, NSGs, firewalls, public IPs, and port 22. Those checks are necessary, but they only validate part of the path. A failed SSH session can absolutely be caused by a cloud networking issue, but it can also be caused by a Linux host that is alive enough to respond partially while user-space work is stuck behind kernel waits.

The misleading symptom

When SSH fails, the first reaction is usually to ask whether port 22 is open, whether sshd is running, whether the security group changed, whether the route table changed, or whether the instance is still running. Those are reasonable first checks, but they can lead to a false conclusion: “This must be a network issue.”

In production, “SSH timeout” can describe very different failure modes. The client may not reach the VM at all. It may reach the VM, but fail to establish a TCP connection. TCP may connect, but the SSH banner may be delayed. Authentication may succeed, but session creation may hang. The shell may open, but commands may block immediately. These are not the same incident.

The SSH access path is not simply:

network –> port 22 –> sshd

It is closer to:

network –> host firewall > sshd listener –> key exchange –> authentication –> PAM/account/session modules –> name service lookups –> home directory access –> shell startup –> terminal/session allocation

Any dependency in that chain can make SSH look “down.” If TCP never establishes, the focus should be on routing, firewalling, security rules, host firewall, or listener availability. But if TCP connects and the login hangs later, the problem may be inside the guest: PAM, LDAP or SSSD, DNS, home directory mounts, shell startup files, disk I/O, or remote filesystems.

That is why “instance running” is a weak health signal. The cloud control plane can tell you that the VM exists, is powered on, and is manageable by the provider. It does not prove that the guest OS can schedule work, complete I/O, run login sessions, or serve the application. From an SRE perspective, control-plane state, guest health, and service health are separate layers.

When Linux is alive but not making progress

A particularly misleading pattern appears when the OS is technically alive, but important tasks are stuck in uninterruptible sleep. In Linux process state output, D means the process is waiting in uninterruptible sleep, commonly associated with disk or filesystem I/O.

You can inspect this with:

ps -eo pid,ppid,stat,wchan:32,comm,args | awk '$3 ~ /D/ {print}'

Or, using /proc directly:

for p in /proc/[0-9]*; do
  state=$(awk '/^State:/ {print $2}' "$p/status" 2>/dev/null)
  if [ "$state" = "D" ]; then
    pid=${p#/proc/}
    comm=$(cat "$p/comm" 2>/dev/null)
    wchan=$(cat "$p/wchan" 2>/dev/null)
    printf "%s %-24s %s\n" "$pid" "$comm" "$wchan"
  fi
done

A few D state tasks are not always an incident. But a growing number of blocked tasks, especially around filesystem or network-storage wait channels, is a serious reliability signal. The kernel’s hung task detector exists for this class of problem, and one of the typical log patterns is:

INFO: task <process>:<pid> blocked for more than <N> seconds

In one sanitized production pattern, SSH was reported as unavailable while the instance still appeared to be running. The useful evidence was not “port 22 failed.” The useful evidence was that the guest OS had symptoms of blocked I/O and remote filesystem instability.

Remote filesystems are a classic way to turn a partial infrastructure failure into a confusing host-level symptom. NFS is especially important because its retry behavior depends on mount options. Hard mounts can preserve data-integrity expectations, but they can also cause processes to wait for a long time when the server or network path is unavailable. Soft mounts may improve responsiveness in some cases, but they can return errors in ways that applications are not designed to handle.

So the better question is not “Should we use hard or soft mounts?” The better question is: what is the operational role of this filesystem, and what should fail when it becomes unavailable?

An application data path has different correctness requirements than a read-only reference mount. A user home directory has different availability implications than a backup mount. A boot-time mount has different blast-radius concerns than an on-demand mount. A remote backup mount should not make the whole server unmanageable, and a non-critical filesystem should not silently become a login-path dependency.

This is how a remote filesystem can affect SSH indirectly:

User authenticates successfully –> PAM/session setup runs –> shell starts –> home directory or profile is accessed –> remote filesystem blocks –> login appears hung

From the user’s point of view, “SSH is down.” From the system’s point of view, SSH may have done its job, and the session is blocked later.

Another clue in I/O-related incidents is high load average with low CPU utilization. This surprises engineers who mentally equate load with CPU. On Linux, load average can include tasks waiting in uninterruptible sleep, which means a system with many blocked I/O tasks can show high load while CPUs are mostly idle.

A better signal is pressure, not just utilization. Linux Pressure Stall Information, or PSI, helps quantify time lost because tasks are stalled on CPU, memory, or I/O resource pressure:

cat /proc/pressure/cpu
cat /proc/pressure/memory
cat /proc/pressure/io

For this incident pattern, /proc/pressure/io is especially useful because it can show whether workloads are losing time waiting on I/O, even when CPU graphs look normal.

A better diagnostic workflow

When SSH fails, avoid jumping straight to remediation. Build an evidence hierarchy. Start with the precise client-side failure mode:

ssh -vvv user@host
nc -vz host 22
timeout 5 bash -c '</dev/tcp/host/22' && echo open || echo closed

A connection timeout, connection refusal, banner delay, authentication delay, and session-creation delay point to different layers. Treat them differently.

Then validate the cloud-side basics: instance state, VNIC attachment, security rules, route tables, public or private IP association, network ACLs, cloud firewall constructs, serial console availability, recent changes, whether you can SSH successfully from an instance within the same network/subnet/ACL, etc. But do not stop there. These checks validate the infrastructure path, not the guest’s ability to complete work.

From a console, serial console, recovery shell, or another access path, check the listener and host firewall:

systemctl status sshd
ss -lntp | grep ':22'
journalctl -u sshd --since -2h
iptables -S
nft list ruleset
firewall-cmd --list-all 2>/dev/null

Then check login-path dependencies:

getent passwd user
id user
time getent hosts example.internal
time id user
ls -ld /home/user
mount | grep -E 'nfs|cifs|fuse'

If id, getent, or ls /home/user hangs, SSH is probably not the primary issue.

Finally, look for guest-level blocking:

dmesg -T | egrep -i 'blocked for more than|hung task|nfs|not responding|I/O error|reset|timeout'
journalctl -k --since -2h
ps -eo pid,ppid,stat,wchan:32,comm,args | awk '$3 ~ /D/ {print}'
cat /proc/loadavg
cat /proc/pressure/io
iostat -xz 1 5

Filesystem commands deserve special care because they can hang when remote mounts are unhealthy. Wrap them with timeout during incident response:

findmnt
timeout 5 df -hT
timeout 5 ls -la /home
timeout 5 stat /home/user

This workflow helps separate “network cannot reach the host” from “the guest is reachable but not making progress.” That distinction prevents false fixes and better supports postmortem accuracy.

Monitoring and guardrails

Weak signals include instance state, ping success, an open port 22, low CPU usage, normal disk capacity, or a correct-looking cloud route table. Those signals are still useful, but they are incomplete. Stronger signals include synthetic SSH command success, D-state task count, kernel hung task warnings, NFS timeout messages, I/O PSI, disk latency, filesystem operation latency, login latency, PAM/SSSD/DNS latency, and serial console kernel logs.

For Prometheus-style host monitoring, the useful categories are not only CPU and disk fullness. For this pattern, the more interesting signals are load, disk I/O time, filesystem state, and pressure metrics such as:

node_load*
node_disk_io_time_seconds_total
node_disk_read_time_seconds_total
node_disk_write_time_seconds_total
node_filesystem_avail_bytes
node_filesystem_readonly
node_pressure_io_waiting_seconds_total
node_pressure_memory_waiting_seconds_total
node_pressure_cpu_waiting_seconds_total

Metric names vary by node exporter version and enabled collectors, so treat this as a monitoring checklist rather than a copy-paste dashboard.

This class of incident also benefits from small automation. A lightweight triage script can collect safe, non-identifying evidence while the system is degraded:

#!/usr/bin/env bash
set -o pipefail

echo "## Time"
date -Is

echo "## Uptime/load"
uptime
cat /proc/loadavg

echo "## PSI"
for f in /proc/pressure/{cpu,memory,io}; do
  echo "# $f"
  cat "$f" 2>/dev/null || true
done

echo "## sshd"
systemctl --no-pager --full status sshd 2>/dev/null || true
ss -lntp 2>/dev/null | grep -E '(:22)\b' || true

echo "## D-state tasks"
ps -eo pid,ppid,stat,wchan:32,comm,args | awk '$3 ~ /D/ {print}'

echo "## Kernel warnings"
journalctl -k --since -2h --no-pager 2>/dev/null | \
  egrep -i 'blocked for more than|hung task|nfs|not responding|timeout|I/O error' || true

echo "## Mounts"
findmnt -t nfs,nfs4,cifs,fuse 2>/dev/null || true
timeout 5 df -hT || echo "df timed out"

The goal is not to expose sensitive incident data. The goal is to capture the shape of the failure while the system is still in the failed state.

Remote filesystem guardrails are equally important. Ask whether a mount is required for boot, whether it should be mounted on demand, whether it should fail independently, whether systemd automount should be used, and whether application startup or SSH login should depend on it. These are platform design questions, not only Linux tuning questions.

A TCP port check is also not enough for administrative access monitoring. A synthetic SSH check should prove that a command can actually run:

ssh -o BatchMode=yes \
    -o ConnectTimeout=5 \
    -o ServerAliveInterval=2 \
    -o ServerAliveCountMax=2 \
    monitor@host 'echo ok'

A successful TCP connection proves that the listener accepted a connection. A synthetic command proves more of the access path.