The Cloud Route Table Is Not the Route Table
The incident started with a familiar symptom: SSH timeout.
- Not “permission denied.”
- Not “connection refused.”
- Not a host key warning.
Just a timeout.
That detail mattered. A timeout usually means the client sent something and did not get the response it expected. But it does not say where the silence began.
At first, the obvious checks did not explain it. The instance was running. The cloud console showed the expected network attachment. The private IP was present. Security rules allowed access. The subnet route table looked correct. There was no immediate sign of a failed service, a missing route, or a broken instance.
That made the problem more interesting, not less.
If the cloud network looked correct, and the instance was alive, why did the connection still disappear?
The easy explanation was to keep calling it an SSH issue. But SSH was only where the symptom appeared. The failure could have been lower in the path: before the packet reached the server, inside the guest network stack, or on the way back to the client.
So the investigation had to become narrower.
Could the packet reach the instance? Could Linux see it? Could the service respond? And most importantly: when Linux responded, which path would the reply take?
That last question changed the shape of the problem.
Cloud Routing Explains Ingress, Not the Whole Conversation
Cloud networking gives us a clean mental model.
There are subnets, route tables, gateways, security rules, private IPs, public IPs, and virtual NICs. The model is useful because it makes infrastructure look legible. You can point to a route table and say traffic should go this way. You can point to a security rule and say TCP/22 is allowed. You can point to a VNIC and say this IP belongs to this instance.
But a cloud route table does not fully describe the packet’s life inside the guest OS.
It may explain how traffic reaches the instance. It does not prove how the instance sends traffic back.
That distinction becomes important when a Linux instance has more than one interface, more than one private IP, or more than one possible route to the same destination. The cloud control plane can attach the interface and assign the address, but the guest operating system still has to configure the interface, install routes, assign metrics, select source addresses, and decide which path a reply packet should use.
Those decisions are made by Linux, not by the cloud route table.
This is where many investigations drift in the wrong direction. The external view looks consistent. The subnet has a route. The security rule permits the port. The instance has the expected private IP. The service may even be listening.
But none of that answers the return-path question.
A connection is not complete when the packet arrives. It is complete when the response returns to the client through a path the client accepts.
In a single-interface system, this detail is easy to ignore. There is usually only one reasonable way out. In a multi-interface system, Linux may have several possible answers. A secondary VNIC, a changed NetworkManager profile, a new default route, or a different route metric can quietly change the preferred path.
Nothing has to crash for this to fail.
The system can be alive, configured, and still wrong.
The Investigation Turns on One Question
The investigation changed once the focus moved away from “is SSH available?” and toward “what would Linux do with the reply?”
That is a different class of question.
Checking sshd tells you whether the service is running. Checking security rules tells you whether the cloud edge is expected to allow the connection. Checking the cloud route table tells you how the subnet should forward traffic.
But checking route selection tells you how the kernel will actually behave for a specific destination.
The most useful command in this situation is often:
ip route get <client-ip>
This does not show a generic diagram. It asks the Linux kernel a direct question:
If this machine sends traffic to that client right now, which route, interface, and source address will it use?
The answer may include the selected gateway, device, and source IP, for example:
<client-ip> via <gateway> dev <interface> src <source-ip>
That output can immediately narrow the investigation.
If the client is connecting to an address on one interface, but Linux says the reply will leave through another interface, the timeout starts to make sense. The packet may have reached the server. The service may have responded. But the response did not follow the expected path back.
At that point, the incident is no longer primarily about SSH.
It is about asymmetric routing.
The useful supporting checks are simple, but they need to be interpreted as a path, not as isolated facts.
ip addr shows which addresses are assigned to which interfaces.
ip route shows default routes, subnet routes, gateways, and metrics.
ip rule shows whether policy routing is involved.
nmcli -f NAME,DEVICE,IP4.ROUTE-METRIC connection show shows how NetworkManager may be influencing route preference.
ip neigh show shows neighbor resolution state.
tcpdump -ni <interface-a> host <client-ip> shows whether packets arrive on one interface.
tcpdump -ni <interface-b> host <client-ip> shows whether replies leave somewhere else.
That last comparison is often decisive. A request arriving on one interface and a reply leaving through another is not a theory. It is the packet path explaining the symptom.
The timeout was not silent. The investigation was just listening at the wrong layer.
The Actual Failure Pattern
The failure pattern is straightforward once you see it.
A client connects to an IP address associated with interface A.
The cloud network delivers the packet to the instance.
Linux receives the packet.
The service is available and can respond.
Then the kernel chooses the route back to the client.
Because of route metrics, a default route, policy rules, or interface configuration, Linux selects interface B as the return path. The reply leaves through an interface the client was not expecting, often with a source address that does not match the connection attempt.
From the server’s point of view, this may be a valid routing decision.
From the client’s point of view, the connection disappears.
This is why the early checks can be misleading. The instance is not necessarily down. SSH is not necessarily broken. The cloud route table is not necessarily wrong. The security rule is not necessarily missing.
The failure is in the relationship between ingress and egress.
In cloud environments, this often appears after secondary interfaces are attached, routes are modified, systems are patched, network profiles are regenerated, or instances are moved between images and configurations. The trigger may look administrative, but the reliability impact is real: management access fails, application traffic becomes inconsistent, or a host appears unreachable even though several layers are technically healthy.
The uncomfortable part is that the machine may be doing exactly what it was configured to do.
That is why this class of issue is harder than a simple outage. A broken service is obvious. A missing firewall rule is usually visible. But an unintended route preference can hide behind a mostly healthy system.
The phrase “the route table looks correct” also becomes too vague.
Which route table?
The cloud subnet route table?
The Linux main routing table?
A policy routing table?
A route installed by NetworkManager?
The route selected for this exact destination?
In multi-interface incidents, “the route table” is not one object. It is a sequence of decisions across layers.
The Reliability Lesson
The lesson is not only to remember ip route get, although it deserves a permanent place in the troubleshooting toolbox.
The bigger lesson is that reachability must be validated as a full path.
A system is not reachable only because the instance is running. It is not reachable only because a VNIC is attached. It is not reachable only because TCP/22 is allowed. It is not reachable only because sshd is listening.
Those are useful checks, but they are not the complete story.
A connection succeeds when the full path works: client to cloud network, cloud network to VNIC, VNIC to guest OS, guest OS to service, service back to kernel, kernel through the correct route, and response back to the client.
For SRE and platform teams, this changes how multi-NIC systems should be designed and operated.
Avoid same-subnet multi-interface designs unless there is a clear reason and the routing model is documented.
Use separate subnets when traffic classes are genuinely separate.
Use source-based policy routing when different interfaces must handle different traffic paths.
Validate route metrics after reboots, patching, image changes, VNIC attachment, and NetworkManager profile updates.
Add post-change checks that compare the expected egress interface and source IP against actual ip route get output.
Monitor for failed neighbor entries, unexpected default routes, and route metric drift.
At fleet scale, this should become a guardrail, not a troubleshooting trick remembered by one engineer. A small validation script after provisioning can prevent a long incident later. A runbook that asks “which route will Linux use to reply?” can stop an investigation from spending hours on SSH, firewall rules, and cloud route tables when the real issue is guest route selection.
The cloud route table is important.
It tells part of the story.
But it is not the whole story.
When a Linux instance has more than one possible path, the final answer is inside the guest OS. The question is not only whether the cloud can deliver the packet.
The question is whether Linux knows how to send it back.