When Routing Is Correct but Packets Still Disappear

A production network incident often begins with a sentence that sounds simple:

“The server cannot reach the peer.”

That sentence is almost never simple.

It could be a route problem. It could be a firewall problem. It could be a subnet problem, a security-list problem, a bad gateway, a dead interface, a broken bond, an asymmetric return path, a DNS mistake, or a service that is not actually listening.

In one incident pattern, the first clues pointed toward routing. Several Linux nodes had connectivity problems after a maintenance window. The affected traffic used a dedicated interface, the expected routes appeared to exist, and policy routing had already been reviewed. One node worked. Other nodes did not.

That comparison changed the investigation.

The route was there.

The neighbor was not.

The route only tells Linux where to try

When we troubleshoot Linux connectivity, it is natural to start at Layer 3.

We ask:

Is the destination reachable through the expected interface?

Does ip route get <peer> select the right path?

Are the ip rule entries ordered correctly?

Is the source address correct?

Is the gateway correct?

Those are valid questions. But they only answer part of the problem.

A route lookup tells the kernel which egress interface and next hop should be used. It does not prove that the next hop is actually reachable on the local link. Before an IPv4 packet can leave an Ethernet interface, Linux still needs a link-layer address. That mapping lives in the neighbor table, historically exposed as the ARP table for IPv4. The ip neigh command is the modern interface for inspecting and manipulating these neighbor entries.

That distinction matters in production.

A route can be perfectly correct while packet delivery still fails one layer below it.

The operating system may know:

Send traffic for this destination out this interface.

But it may not know:

Which MAC address should receive the Ethernet frame?

When that second question fails, the symptom can look deceptively similar to a routing or firewall problem.

The clue was not in the route table

The useful evidence came from comparing a working node with failing nodes.

On the working node, the relevant peer addresses had neighbor entries with real link-layer addresses. Their states were not all permanently REACHABLE, and that was fine. Some entries were STALE, which does not automatically mean broken. A STALE neighbor entry can still contain a valid link-layer address; it means the entry is valid but its reachability may need to be confirmed again.

On the failing nodes, the same peer addresses did not resolve.

The neighbor entries were in FAILED state, or the legacy ARP view showed an all-zero hardware address. In practical terms, Linux had tried to resolve the next hop and did not have a usable MAC address.

A sanitized version of the evidence looked like this:

Check Working node Failing node
Interface up yes yes
Expected route present yes yes
Policy rule suspected reviewed reviewed
Peer in same network segment yes yes
Neighbor entry MAC address present no usable MAC address
Neighbor state STALE / usable FAILED
Connectivity succeeds fails

That table is the heart of the investigation.

It moves the question away from:

“Why is Linux choosing the wrong route?”

and toward:

“Why can this node not resolve the peer at Layer 2?”

That is a different failure domain.

It may involve VLAN placement, switch port policy, bond path behavior, upstream filtering, dynamic ARP inspection, private VLAN behavior, port security, or a fabric-side inconsistency. The exact cause depends on the environment, but the Linux-side evidence is already meaningful: route selection is not the failing step. Neighbor resolution is.

A failed neighbor is stronger evidence than a vague timeout

Timeouts are weak evidence.

A timeout tells us that a response did not come back within a certain period. It does not tell us where the request died.

A failed neighbor entry is stronger.

It says Linux could not establish the local link-layer mapping required to transmit to that peer or next hop.

That does not mean every FAILED entry is automatically a network-team problem. Linux can contribute to ARP failure through local configuration issues too.

Examples include:

But in this incident pattern, the comparison mattered. The same intended network, same class of nodes, same peer set, and same routing expectation produced different neighbor behavior. One node could resolve peers. Others could not.

That is why comparative evidence is so powerful. A single broken node can mislead you into guessing. A working peer gives you a control sample.

The diagnostic workflow should be boring and layered:

  1. Start with ip route get <peer>.
  2. Confirm the selected egress interface and source address.
  3. Check ip rule show only if policy routing is involved.
  4. Inspect the neighbor table with ip neigh show dev <interface>.
  5. Compare a working node and a failing node.
  6. Use tcpdump -i <interface> arp to determine whether ARP requests leave and whether replies return.

The last step is especially important. If ARP requests leave the Linux host and no replies return, the investigation has crossed an operational boundary. At that point, the useful handoff is not “network issue.” The useful handoff is:

“This host sends ARP requests for this peer on this interface. No ARP replies are observed. A comparable node on the same intended network resolves the peer successfully.”

That is a precise statement. It is falsifiable. It gives the network or platform team something concrete to verify.

The trap after maintenance windows

The timing of an incident can distort the investigation.

When something breaks after patching, the operating system becomes the first suspect. That is reasonable, but it can also create tunnel vision. A maintenance window is rarely a single change. It may include reboot ordering, interface renegotiation, bond failover, switch reconvergence, VM movement, hypervisor-side events, security updates, driver changes, or network policy refreshes.

The last visible change is not always the root cause.

In this kind of incident, the temptation is to say:

“Patching broke networking.”

A better statement is:

“After the maintenance window, some nodes could no longer resolve required peers at Layer 2, while at least one comparable node could. Linux route selection did not explain the difference.”

That wording is more disciplined. It avoids false causality while preserving the operational sequence.

This is also why post-maintenance validation should not stop at service reachability. A service check can tell you that something is broken, but not why. For critical cluster or east-west paths, validation should include the substrate:

A Linux server can have the right IP configuration and still be unable to put a frame on the wire for the intended peer.

What to monitor before the next incident

Most monitoring systems are good at detecting that a host is unreachable. Fewer are good at explaining the layer at which reachability failed.

That is where neighbor-state checks can help, especially for critical private interconnects, storage paths, replication networks, and cluster heartbeat networks.

This does not mean alerting on every STALE neighbor. That would be noisy and technically wrong. STALE can be normal. The better signal is persistent failure for known critical peers or gateways.

A useful guardrail might check:

For dashboards, neighbor state should be correlated with interface and bond telemetry. Interface carrier, packet drops, errors, and bond state show whether the path is physically and logically healthy. The neighbor table adds a missing piece: whether the host can resolve specific local-link dependencies.

The automation idea is simple:

For each critical peer, generate traffic, inspect route selection, inspect neighbor resolution, and report the failing layer.

The runbook should not say:

“Ping failed.”

It should say:

“Route selected the expected interface, but neighbor resolution failed.”

That one sentence can save hours.

Because the real lesson is not about ARP as a command. It is about evidence hierarchy.

In Linux networking incidents, routing correctness is necessary, but not sufficient. The route can be right, the interface can be up, the firewall can be innocent, and the packet can still disappear before Layer 3 ever gets a chance to matter.

Sometimes the most important clue is not the route Linux chose.

It is the neighbor Linux could not find...