Skip to content

Suppress errors from gateway re-config when disconnecting a network#51592

Merged
robmry merged 3 commits intomoby:masterfrom
robmry:sbleave_gw_config_error
Nov 26, 2025
Merged

Suppress errors from gateway re-config when disconnecting a network#51592
robmry merged 3 commits intomoby:masterfrom
robmry:sbleave_gw_config_error

Conversation

@robmry
Copy link
Contributor

@robmry robmry commented Nov 25, 2025

- What I did

When a network is disconnected from a container, if that network provides the container's default gateway, another gateway is selected from the remaining network endpoints (if there are any). That selection happens after the endpoint has been removed, but there's no proper rollback if the newly selected gateway doesn't work.

- How I did it

The first commit skips selection of a new gateway if the container is being deleted. So setup of a new gateway can't fail. This fixes the bug as-reported.

Prior to 53390f8, an error setting up the new gateway was only logged ... the second commit restores that behaviour for the network disconnect case, where the container is not being deleted. The container won't have a working gateway, but the Network will not be left with a broken reference to the disconnected Endpoint, so it can be reconnected.

- How to verify it

New integration test, following the repro steps outlined at #51578 (comment).

Without the fixes, the new test fails with ...

=== RUN   TestGatewayErrorOnNetDisconnect
    bridge_linux_test.go:2093: assertion failed: error is not nil: Error response from daemon: container c081ced8f0a5658709ec6cb4571c01eb51d389e5858220c2babedee8eae35261 failed to leave network n1: updating gateway endpoint: failed to set gateway: route for the gateway 172.22.0.1 could not be found: network is unreachable
    bridge_linux_test.go:2097: assertion failed: error is not nil: Error response from daemon: endpoint with name nervous_morse already exists in network n1
    bridge_linux_test.go:2102: assertion failed: error is not nil: Error response from daemon: Cannot restart container c081ced8f0a5658709ec6cb4571c01eb51d389e5858220c2babedee8eae35261: failed to set up container networking: endpoint with name nervous_morse already exists in network n1
    bridge_linux_test.go:2109: assertion failed: error is not nil: Error response from daemon: error while removing network: network n1 has active endpoints (name:"nervous_morse" id:"8f477b1b9e52")
--- FAIL: TestGatewayErrorOnNetDisconnect (5.44s)
FAIL

With the fixes:

  • restarting a container produces no warning in the log, because there's no attempt to select the new gateway. Release 28.x logged "Failed to clean up network resources on container disconnect".
  • if gateway config fails when disconnecting an endpoint, the a warning is logged. For example:
    • Configuring gateway after network disconnect cid=5b73d9f67743 eid=f2df6ddb1ac5 ep=c1 error="failed to set gateway: route for the gateway 172.20.0.1 could not be found: network is unreachable" gw4=ce9bf8761a88 gw6= net=n1 nid=1a3331eaf7df sid=c44c4f4d7982

- Human readable description for the release notes

- Fix an issue that prevented container restart or network reconnection when gateway configuration failed during container stop or network disconnect.

When the endpoint providing a container's default gateway
is removed, there's no need to select a new gateway if the
container is being removed.

Signed-off-by: Rob Murray <rob.murray@docker.com>
@robmry robmry added this to the 29.1.0 milestone Nov 25, 2025
@robmry robmry marked this pull request as ready for review November 25, 2025 19:25
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a regression where disconnecting a network from a container could fail and leave the system in a broken state if gateway reconfiguration encountered an error. The fix ensures that gateway selection is skipped when the container is being deleted, and downgrades gateway configuration errors to warnings (rather than fatal errors) during network disconnection.

  • Skip gateway selection entirely when the container is being deleted, preventing configuration failures during cleanup
  • Downgrade gateway configuration errors from fatal to logged warnings during network disconnect operations
  • Add comprehensive integration test covering the regression scenario

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
integration/networking/bridge_linux_test.go Adds regression test that verifies network disconnect with gateway config failure doesn't prevent reconnection or restart
daemon/libnetwork/endpoint.go Implements the fix by checking if sandbox is being deleted before gateway selection and logging gateway errors instead of failing
Comments suppressed due to low confidence (1)

daemon/libnetwork/endpoint.go:1

  • Potential nil pointer dereference on line 887. If gwepAfter4 is nil but gwepAfter6 is not nil, the condition gwepAfter6.ID() != gwepAfter4.ID() will panic when calling ID() on the nil gwepAfter4.
package libnetwork

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gwepAfter4, gwepAfter6 = sb.getGatewayEndpoint()
if err := sb.updateGateway(gwepAfter4, gwepAfter6); err != nil {
return fmt.Errorf("updating gateway endpoint: %w", err)
log.G(ctx).WithFields(log.Fields{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth adding a comment explaining why we're logging instead of returning an error. There's still a bunch of code paths in libnetwork that adopt this error handling strategy, so readers may just think this is one of these injustified instances and not even git blame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ... although this is one of those paths using the error handling strategy of not handling errors. (As it was in 28.x - I moved this out of clearNetworkResources, which returned an error that was only logged.)

The whole network-disconnect operation should have proper error handling.

During a network disconnect, log rather than returning an error
if it's not possible to set up a new gateway.

This restores the behaviour from before commit 53390f8 ("Put
clearNetworkResources() inline in its only caller"). It's not
ideal, but by the time new gateways are selected the old
endpoint has been disconnected - and nothing puts things back.
Until that's cleaned up, a broken state is inevitable, but
letting endpoint deletion complete means the container can
be restarted or re-connected to the network without a zombie
endpoint causing further issues.

Signed-off-by: Rob Murray <rob.murray@docker.com>
Signed-off-by: Rob Murray <rob.murray@docker.com>
@robmry robmry force-pushed the sbleave_gw_config_error branch from bdb4880 to 163cc95 Compare November 26, 2025 14:46
@robmry robmry merged commit baf59d6 into moby:master Nov 26, 2025
189 of 190 checks passed
@robmry robmry deleted the sbleave_gw_config_error branch November 26, 2025 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docker stop-start fails for container with multiple networks in v29

5 participants