Trace

Linux production pathologies, one failure at a time

A production network incident often begins with a sentence that sounds simple:

“The server cannot reach the peer.”

That sentence is almost never simple.

It could be a route problem. It could be a firewall problem. It could be a subnet problem, a security-list problem, a bad gateway, a dead interface, a broken bond, an asymmetric return path, a DNS mistake, or a service that is not actually listening.

In one incident pattern, the first clues pointed toward routing. Several Linux nodes had connectivity problems after a maintenance window. The affected traffic used a dedicated interface, the expected routes appeared to exist, and policy routing had already been reviewed. One node worked. Other nodes did not.

That comparison changed the investigation.

The route was there.

The neighbor was not.

The route only tells Linux where to try

When we troubleshoot Linux connectivity, it is natural to start at Layer 3.

We ask:

Is the destination reachable through the expected interface?

Does ip route get <peer> select the right path?

Are the ip rule entries ordered correctly?

Is the source address correct?

Is the gateway correct?

Those are valid questions. But they only answer part of the problem.

A route lookup tells the kernel which egress interface and next hop should be used. It does not prove that the next hop is actually reachable on the local link. Before an IPv4 packet can leave an Ethernet interface, Linux still needs a link-layer address. That mapping lives in the neighbor table, historically exposed as the ARP table for IPv4. The ip neigh command is the modern interface for inspecting and manipulating these neighbor entries.

That distinction matters in production.

A route can be perfectly correct while packet delivery still fails one layer below it.

The operating system may know:

Send traffic for this destination out this interface.

But it may not know:

Which MAC address should receive the Ethernet frame?

When that second question fails, the symptom can look deceptively similar to a routing or firewall problem.

The clue was not in the route table

The useful evidence came from comparing a working node with failing nodes.

On the working node, the relevant peer addresses had neighbor entries with real link-layer addresses. Their states were not all permanently REACHABLE, and that was fine. Some entries were STALE, which does not automatically mean broken. A STALE neighbor entry can still contain a valid link-layer address; it means the entry is valid but its reachability may need to be confirmed again.

On the failing nodes, the same peer addresses did not resolve.

The neighbor entries were in FAILED state, or the legacy ARP view showed an all-zero hardware address. In practical terms, Linux had tried to resolve the next hop and did not have a usable MAC address.

A sanitized version of the evidence looked like this:

Check Working node Failing node
Interface up yes yes
Expected route present yes yes
Policy rule suspected reviewed reviewed
Peer in same network segment yes yes
Neighbor entry MAC address present no usable MAC address
Neighbor state STALE / usable FAILED
Connectivity succeeds fails

That table is the heart of the investigation.

It moves the question away from:

“Why is Linux choosing the wrong route?”

and toward:

“Why can this node not resolve the peer at Layer 2?”

That is a different failure domain.

It may involve VLAN placement, switch port policy, bond path behavior, upstream filtering, dynamic ARP inspection, private VLAN behavior, port security, or a fabric-side inconsistency. The exact cause depends on the environment, but the Linux-side evidence is already meaningful: route selection is not the failing step. Neighbor resolution is.

A failed neighbor is stronger evidence than a vague timeout

Timeouts are weak evidence.

A timeout tells us that a response did not come back within a certain period. It does not tell us where the request died.

A failed neighbor entry is stronger.

It says Linux could not establish the local link-layer mapping required to transmit to that peer or next hop.

That does not mean every FAILED entry is automatically a network-team problem. Linux can contribute to ARP failure through local configuration issues too.

Examples include:

  • wrong interface selection
  • incorrect source address
  • ARP filtering behavior
  • multiple interfaces on the same subnet
  • bonding or VLAN misconfiguration
  • network namespace confusion
  • stale manual neighbor entries

But in this incident pattern, the comparison mattered. The same intended network, same class of nodes, same peer set, and same routing expectation produced different neighbor behavior. One node could resolve peers. Others could not.

That is why comparative evidence is so powerful. A single broken node can mislead you into guessing. A working peer gives you a control sample.

The diagnostic workflow should be boring and layered:

  1. Start with ip route get <peer>.
  2. Confirm the selected egress interface and source address.
  3. Check ip rule show only if policy routing is involved.
  4. Inspect the neighbor table with ip neigh show dev <interface>.
  5. Compare a working node and a failing node.
  6. Use tcpdump -i <interface> arp to determine whether ARP requests leave and whether replies return.

The last step is especially important. If ARP requests leave the Linux host and no replies return, the investigation has crossed an operational boundary. At that point, the useful handoff is not “network issue.” The useful handoff is:

“This host sends ARP requests for this peer on this interface. No ARP replies are observed. A comparable node on the same intended network resolves the peer successfully.”

That is a precise statement. It is falsifiable. It gives the network or platform team something concrete to verify.

The trap after maintenance windows

The timing of an incident can distort the investigation.

When something breaks after patching, the operating system becomes the first suspect. That is reasonable, but it can also create tunnel vision. A maintenance window is rarely a single change. It may include reboot ordering, interface renegotiation, bond failover, switch reconvergence, VM movement, hypervisor-side events, security updates, driver changes, or network policy refreshes.

The last visible change is not always the root cause.

In this kind of incident, the temptation is to say:

“Patching broke networking.”

A better statement is:

“After the maintenance window, some nodes could no longer resolve required peers at Layer 2, while at least one comparable node could. Linux route selection did not explain the difference.”

That wording is more disciplined. It avoids false causality while preserving the operational sequence.

This is also why post-maintenance validation should not stop at service reachability. A service check can tell you that something is broken, but not why. For critical cluster or east-west paths, validation should include the substrate:

  • interface carrier
  • bond state
  • VLAN presence
  • route selection
  • source address
  • neighbor resolution
  • packet counters
  • ARP/NDP behavior

A Linux server can have the right IP configuration and still be unable to put a frame on the wire for the intended peer.

What to monitor before the next incident

Most monitoring systems are good at detecting that a host is unreachable. Fewer are good at explaining the layer at which reachability failed.

That is where neighbor-state checks can help, especially for critical private interconnects, storage paths, replication networks, and cluster heartbeat networks.

This does not mean alerting on every STALE neighbor. That would be noisy and technically wrong. STALE can be normal. The better signal is persistent failure for known critical peers or gateways.

A useful guardrail might check:

  • Does ip route get <critical-peer> select the expected interface?
  • Does ip neigh show to <critical-peer> dev <interface> produce a usable link-layer address after traffic is generated?
  • Do ARP failures persist for more than one probe interval?
  • Are multiple peers failing on the same interface?
  • Did the failure start after a reboot, failover, or maintenance window?

For dashboards, neighbor state should be correlated with interface and bond telemetry. Interface carrier, packet drops, errors, and bond state show whether the path is physically and logically healthy. The neighbor table adds a missing piece: whether the host can resolve specific local-link dependencies.

The automation idea is simple:

For each critical peer, generate traffic, inspect route selection, inspect neighbor resolution, and report the failing layer.

The runbook should not say:

“Ping failed.”

It should say:

“Route selected the expected interface, but neighbor resolution failed.”

That one sentence can save hours.

Because the real lesson is not about ARP as a command. It is about evidence hierarchy.

In Linux networking incidents, routing correctness is necessary, but not sufficient. The route can be right, the interface can be up, the firewall can be innocent, and the packet can still disappear before Layer 3 ever gets a chance to matter.

Sometimes the most important clue is not the route Linux chose.

It is the neighbor Linux could not find...

The incident started with a familiar symptom: SSH timeout.

  • Not “permission denied.”
  • Not “connection refused.”
  • Not a host key warning.

Just a timeout.

That detail mattered. A timeout usually means the client sent something and did not get the response it expected. But it does not say where the silence began.

At first, the obvious checks did not explain it. The instance was running. The cloud console showed the expected network attachment. The private IP was present. Security rules allowed access. The subnet route table looked correct. There was no immediate sign of a failed service, a missing route, or a broken instance.

That made the problem more interesting, not less.

If the cloud network looked correct, and the instance was alive, why did the connection still disappear?

The easy explanation was to keep calling it an SSH issue. But SSH was only where the symptom appeared. The failure could have been lower in the path: before the packet reached the server, inside the guest network stack, or on the way back to the client.

So the investigation had to become narrower.

Could the packet reach the instance? Could Linux see it? Could the service respond? And most importantly: when Linux responded, which path would the reply take?

That last question changed the shape of the problem.

Cloud Routing Explains Ingress, Not the Whole Conversation

Cloud networking gives us a clean mental model.

There are subnets, route tables, gateways, security rules, private IPs, public IPs, and virtual NICs. The model is useful because it makes infrastructure look legible. You can point to a route table and say traffic should go this way. You can point to a security rule and say TCP/22 is allowed. You can point to a VNIC and say this IP belongs to this instance.

But a cloud route table does not fully describe the packet’s life inside the guest OS.

It may explain how traffic reaches the instance. It does not prove how the instance sends traffic back.

That distinction becomes important when a Linux instance has more than one interface, more than one private IP, or more than one possible route to the same destination. The cloud control plane can attach the interface and assign the address, but the guest operating system still has to configure the interface, install routes, assign metrics, select source addresses, and decide which path a reply packet should use.

Those decisions are made by Linux, not by the cloud route table.

This is where many investigations drift in the wrong direction. The external view looks consistent. The subnet has a route. The security rule permits the port. The instance has the expected private IP. The service may even be listening.

But none of that answers the return-path question.

A connection is not complete when the packet arrives. It is complete when the response returns to the client through a path the client accepts.

In a single-interface system, this detail is easy to ignore. There is usually only one reasonable way out. In a multi-interface system, Linux may have several possible answers. A secondary VNIC, a changed NetworkManager profile, a new default route, or a different route metric can quietly change the preferred path.

Nothing has to crash for this to fail.

The system can be alive, configured, and still wrong.

The Investigation Turns on One Question

The investigation changed once the focus moved away from “is SSH available?” and toward “what would Linux do with the reply?”

That is a different class of question.

Checking sshd tells you whether the service is running. Checking security rules tells you whether the cloud edge is expected to allow the connection. Checking the cloud route table tells you how the subnet should forward traffic.

But checking route selection tells you how the kernel will actually behave for a specific destination.

The most useful command in this situation is often:

ip route get <client-ip>

This does not show a generic diagram. It asks the Linux kernel a direct question:

If this machine sends traffic to that client right now, which route, interface, and source address will it use?

The answer may include the selected gateway, device, and source IP, for example:

<client-ip> via <gateway> dev <interface> src <source-ip>

That output can immediately narrow the investigation.

If the client is connecting to an address on one interface, but Linux says the reply will leave through another interface, the timeout starts to make sense. The packet may have reached the server. The service may have responded. But the response did not follow the expected path back.

At that point, the incident is no longer primarily about SSH.

It is about asymmetric routing.

The useful supporting checks are simple, but they need to be interpreted as a path, not as isolated facts.

ip addr shows which addresses are assigned to which interfaces. ip route shows default routes, subnet routes, gateways, and metrics. ip rule shows whether policy routing is involved.

nmcli -f NAME,DEVICE,IP4.ROUTE-METRIC connection show shows how NetworkManager may be influencing route preference.

ip neigh show shows neighbor resolution state.

tcpdump -ni <interface-a> host <client-ip> shows whether packets arrive on one interface. tcpdump -ni <interface-b> host <client-ip> shows whether replies leave somewhere else.

That last comparison is often decisive. A request arriving on one interface and a reply leaving through another is not a theory. It is the packet path explaining the symptom.

The timeout was not silent. The investigation was just listening at the wrong layer.

The Actual Failure Pattern

The failure pattern is straightforward once you see it.

A client connects to an IP address associated with interface A.

The cloud network delivers the packet to the instance.

Linux receives the packet.

The service is available and can respond.

Then the kernel chooses the route back to the client.

Because of route metrics, a default route, policy rules, or interface configuration, Linux selects interface B as the return path. The reply leaves through an interface the client was not expecting, often with a source address that does not match the connection attempt.

From the server’s point of view, this may be a valid routing decision.

From the client’s point of view, the connection disappears.

This is why the early checks can be misleading. The instance is not necessarily down. SSH is not necessarily broken. The cloud route table is not necessarily wrong. The security rule is not necessarily missing.

The failure is in the relationship between ingress and egress.

In cloud environments, this often appears after secondary interfaces are attached, routes are modified, systems are patched, network profiles are regenerated, or instances are moved between images and configurations. The trigger may look administrative, but the reliability impact is real: management access fails, application traffic becomes inconsistent, or a host appears unreachable even though several layers are technically healthy.

The uncomfortable part is that the machine may be doing exactly what it was configured to do.

That is why this class of issue is harder than a simple outage. A broken service is obvious. A missing firewall rule is usually visible. But an unintended route preference can hide behind a mostly healthy system.

The phrase “the route table looks correct” also becomes too vague.

Which route table?

The cloud subnet route table?

The Linux main routing table?

A policy routing table?

A route installed by NetworkManager?

The route selected for this exact destination?

In multi-interface incidents, “the route table” is not one object. It is a sequence of decisions across layers.

The Reliability Lesson

The lesson is not only to remember ip route get, although it deserves a permanent place in the troubleshooting toolbox.

The bigger lesson is that reachability must be validated as a full path.

A system is not reachable only because the instance is running. It is not reachable only because a VNIC is attached. It is not reachable only because TCP/22 is allowed. It is not reachable only because sshd is listening.

Those are useful checks, but they are not the complete story.

A connection succeeds when the full path works: client to cloud network, cloud network to VNIC, VNIC to guest OS, guest OS to service, service back to kernel, kernel through the correct route, and response back to the client.

For SRE and platform teams, this changes how multi-NIC systems should be designed and operated.

Avoid same-subnet multi-interface designs unless there is a clear reason and the routing model is documented.

Use separate subnets when traffic classes are genuinely separate.

Use source-based policy routing when different interfaces must handle different traffic paths.

Validate route metrics after reboots, patching, image changes, VNIC attachment, and NetworkManager profile updates.

Add post-change checks that compare the expected egress interface and source IP against actual ip route get output.

Monitor for failed neighbor entries, unexpected default routes, and route metric drift.

At fleet scale, this should become a guardrail, not a troubleshooting trick remembered by one engineer. A small validation script after provisioning can prevent a long incident later. A runbook that asks “which route will Linux use to reply?” can stop an investigation from spending hours on SSH, firewall rules, and cloud route tables when the real issue is guest route selection.

The cloud route table is important.

It tells part of the story.

But it is not the whole story.

When a Linux instance has more than one possible path, the final answer is inside the guest OS. The question is not only whether the cloud can deliver the packet.

The question is whether Linux knows how to send it back.

In cloud operations, one of the most common incident reports sounds simple: “The VM is running, but SSH is not working.”

That phrasing is useful as a symptom, but useless as a diagnosis. SSH failure is not an incident category. It is an access-path symptom. A Linux server can be powered on, visible in the cloud control plane, reachable at the network layer, and still be operationally unavailable because something inside the guest OS is blocked. The failure may be in sshd, PAM, DNS, routing, storage, NFS, memory reclaim, disk I/O, or the scheduler.

The deeper SRE lesson is this: a running instance is not the same thing as a healthy operating system, and an open TCP port is not the same thing as a usable login path.

This distinction matters because many cloud incidents are initially misclassified as network problems. Engineers check security lists, route tables, NSGs, firewalls, public IPs, and port 22. Those checks are necessary, but they only validate part of the path. A failed SSH session can absolutely be caused by a cloud networking issue, but it can also be caused by a Linux host that is alive enough to respond partially while user-space work is stuck behind kernel waits.

The misleading symptom

When SSH fails, the first reaction is usually to ask whether port 22 is open, whether sshd is running, whether the security group changed, whether the route table changed, or whether the instance is still running. Those are reasonable first checks, but they can lead to a false conclusion: “This must be a network issue.”

In production, “SSH timeout” can describe very different failure modes. The client may not reach the VM at all. It may reach the VM, but fail to establish a TCP connection. TCP may connect, but the SSH banner may be delayed. Authentication may succeed, but session creation may hang. The shell may open, but commands may block immediately. These are not the same incident.

The SSH access path is not simply:

network –> port 22 –> sshd

It is closer to:

network –> host firewall > sshd listener –> key exchange –> authentication –> PAM/account/session modules –> name service lookups –> home directory access –> shell startup –> terminal/session allocation

Any dependency in that chain can make SSH look “down.” If TCP never establishes, the focus should be on routing, firewalling, security rules, host firewall, or listener availability. But if TCP connects and the login hangs later, the problem may be inside the guest: PAM, LDAP or SSSD, DNS, home directory mounts, shell startup files, disk I/O, or remote filesystems.

That is why “instance running” is a weak health signal. The cloud control plane can tell you that the VM exists, is powered on, and is manageable by the provider. It does not prove that the guest OS can schedule work, complete I/O, run login sessions, or serve the application. From an SRE perspective, control-plane state, guest health, and service health are separate layers.

When Linux is alive but not making progress

A particularly misleading pattern appears when the OS is technically alive, but important tasks are stuck in uninterruptible sleep. In Linux process state output, D means the process is waiting in uninterruptible sleep, commonly associated with disk or filesystem I/O.

You can inspect this with:

ps -eo pid,ppid,stat,wchan:32,comm,args | awk '$3 ~ /D/ {print}'

Or, using /proc directly:

for p in /proc/[0-9]*; do
  state=$(awk '/^State:/ {print $2}' "$p/status" 2>/dev/null)
  if [ "$state" = "D" ]; then
    pid=${p#/proc/}
    comm=$(cat "$p/comm" 2>/dev/null)
    wchan=$(cat "$p/wchan" 2>/dev/null)
    printf "%s %-24s %s\n" "$pid" "$comm" "$wchan"
  fi
done

A few D state tasks are not always an incident. But a growing number of blocked tasks, especially around filesystem or network-storage wait channels, is a serious reliability signal. The kernel’s hung task detector exists for this class of problem, and one of the typical log patterns is:

INFO: task <process>:<pid> blocked for more than <N> seconds

In one sanitized production pattern, SSH was reported as unavailable while the instance still appeared to be running. The useful evidence was not “port 22 failed.” The useful evidence was that the guest OS had symptoms of blocked I/O and remote filesystem instability.

Remote filesystems are a classic way to turn a partial infrastructure failure into a confusing host-level symptom. NFS is especially important because its retry behavior depends on mount options. Hard mounts can preserve data-integrity expectations, but they can also cause processes to wait for a long time when the server or network path is unavailable. Soft mounts may improve responsiveness in some cases, but they can return errors in ways that applications are not designed to handle.

So the better question is not “Should we use hard or soft mounts?” The better question is: what is the operational role of this filesystem, and what should fail when it becomes unavailable?

An application data path has different correctness requirements than a read-only reference mount. A user home directory has different availability implications than a backup mount. A boot-time mount has different blast-radius concerns than an on-demand mount. A remote backup mount should not make the whole server unmanageable, and a non-critical filesystem should not silently become a login-path dependency.

This is how a remote filesystem can affect SSH indirectly:

User authenticates successfully –> PAM/session setup runs –> shell starts –> home directory or profile is accessed –> remote filesystem blocks –> login appears hung

From the user’s point of view, “SSH is down.” From the system’s point of view, SSH may have done its job, and the session is blocked later.

Another clue in I/O-related incidents is high load average with low CPU utilization. This surprises engineers who mentally equate load with CPU. On Linux, load average can include tasks waiting in uninterruptible sleep, which means a system with many blocked I/O tasks can show high load while CPUs are mostly idle.

A better signal is pressure, not just utilization. Linux Pressure Stall Information, or PSI, helps quantify time lost because tasks are stalled on CPU, memory, or I/O resource pressure:

cat /proc/pressure/cpu
cat /proc/pressure/memory
cat /proc/pressure/io

For this incident pattern, /proc/pressure/io is especially useful because it can show whether workloads are losing time waiting on I/O, even when CPU graphs look normal.

A better diagnostic workflow

When SSH fails, avoid jumping straight to remediation. Build an evidence hierarchy. Start with the precise client-side failure mode:

ssh -vvv user@host
nc -vz host 22
timeout 5 bash -c '</dev/tcp/host/22' && echo open || echo closed

A connection timeout, connection refusal, banner delay, authentication delay, and session-creation delay point to different layers. Treat them differently.

Then validate the cloud-side basics: instance state, VNIC attachment, security rules, route tables, public or private IP association, network ACLs, cloud firewall constructs, serial console availability, recent changes, whether you can SSH successfully from an instance within the same network/subnet/ACL, etc. But do not stop there. These checks validate the infrastructure path, not the guest’s ability to complete work.

From a console, serial console, recovery shell, or another access path, check the listener and host firewall:

systemctl status sshd
ss -lntp | grep ':22'
journalctl -u sshd --since -2h
iptables -S
nft list ruleset
firewall-cmd --list-all 2>/dev/null

Then check login-path dependencies:

getent passwd user
id user
time getent hosts example.internal
time id user
ls -ld /home/user
mount | grep -E 'nfs|cifs|fuse'

If id, getent, or ls /home/user hangs, SSH is probably not the primary issue.

Finally, look for guest-level blocking:

dmesg -T | egrep -i 'blocked for more than|hung task|nfs|not responding|I/O error|reset|timeout'
journalctl -k --since -2h
ps -eo pid,ppid,stat,wchan:32,comm,args | awk '$3 ~ /D/ {print}'
cat /proc/loadavg
cat /proc/pressure/io
iostat -xz 1 5

Filesystem commands deserve special care because they can hang when remote mounts are unhealthy. Wrap them with timeout during incident response:

findmnt
timeout 5 df -hT
timeout 5 ls -la /home
timeout 5 stat /home/user

This workflow helps separate “network cannot reach the host” from “the guest is reachable but not making progress.” That distinction prevents false fixes and better supports postmortem accuracy.

Monitoring and guardrails

Weak signals include instance state, ping success, an open port 22, low CPU usage, normal disk capacity, or a correct-looking cloud route table. Those signals are still useful, but they are incomplete. Stronger signals include synthetic SSH command success, D-state task count, kernel hung task warnings, NFS timeout messages, I/O PSI, disk latency, filesystem operation latency, login latency, PAM/SSSD/DNS latency, and serial console kernel logs.

For Prometheus-style host monitoring, the useful categories are not only CPU and disk fullness. For this pattern, the more interesting signals are load, disk I/O time, filesystem state, and pressure metrics such as:

node_load*
node_disk_io_time_seconds_total
node_disk_read_time_seconds_total
node_disk_write_time_seconds_total
node_filesystem_avail_bytes
node_filesystem_readonly
node_pressure_io_waiting_seconds_total
node_pressure_memory_waiting_seconds_total
node_pressure_cpu_waiting_seconds_total

Metric names vary by node exporter version and enabled collectors, so treat this as a monitoring checklist rather than a copy-paste dashboard.

This class of incident also benefits from small automation. A lightweight triage script can collect safe, non-identifying evidence while the system is degraded:

#!/usr/bin/env bash
set -o pipefail

echo "## Time"
date -Is

echo "## Uptime/load"
uptime
cat /proc/loadavg

echo "## PSI"
for f in /proc/pressure/{cpu,memory,io}; do
  echo "# $f"
  cat "$f" 2>/dev/null || true
done

echo "## sshd"
systemctl --no-pager --full status sshd 2>/dev/null || true
ss -lntp 2>/dev/null | grep -E '(:22)\b' || true

echo "## D-state tasks"
ps -eo pid,ppid,stat,wchan:32,comm,args | awk '$3 ~ /D/ {print}'

echo "## Kernel warnings"
journalctl -k --since -2h --no-pager 2>/dev/null | \
  egrep -i 'blocked for more than|hung task|nfs|not responding|timeout|I/O error' || true

echo "## Mounts"
findmnt -t nfs,nfs4,cifs,fuse 2>/dev/null || true
timeout 5 df -hT || echo "df timed out"

The goal is not to expose sensitive incident data. The goal is to capture the shape of the failure while the system is still in the failed state.

Remote filesystem guardrails are equally important. Ask whether a mount is required for boot, whether it should be mounted on demand, whether it should fail independently, whether systemd automount should be used, and whether application startup or SSH login should depend on it. These are platform design questions, not only Linux tuning questions.

A TCP port check is also not enough for administrative access monitoring. A synthetic SSH check should prove that a command can actually run:

ssh -o BatchMode=yes \
    -o ConnectTimeout=5 \
    -o ServerAliveInterval=2 \
    -o ServerAliveCountMax=2 \
    monitor@host 'echo ok'

A successful TCP connection proves that the listener accepted a connection. A synthetic command proves more of the access path.