Suppress errors from gateway re-config when disconnecting a network#51592
Suppress errors from gateway re-config when disconnecting a network#51592robmry merged 3 commits intomoby:masterfrom
Conversation
When the endpoint providing a container's default gateway is removed, there's no need to select a new gateway if the container is being removed. Signed-off-by: Rob Murray <rob.murray@docker.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes a regression where disconnecting a network from a container could fail and leave the system in a broken state if gateway reconfiguration encountered an error. The fix ensures that gateway selection is skipped when the container is being deleted, and downgrades gateway configuration errors to warnings (rather than fatal errors) during network disconnection.
- Skip gateway selection entirely when the container is being deleted, preventing configuration failures during cleanup
- Downgrade gateway configuration errors from fatal to logged warnings during network disconnect operations
- Add comprehensive integration test covering the regression scenario
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| integration/networking/bridge_linux_test.go | Adds regression test that verifies network disconnect with gateway config failure doesn't prevent reconnection or restart |
| daemon/libnetwork/endpoint.go | Implements the fix by checking if sandbox is being deleted before gateway selection and logging gateway errors instead of failing |
Comments suppressed due to low confidence (1)
daemon/libnetwork/endpoint.go:1
- Potential nil pointer dereference on line 887. If
gwepAfter4is nil butgwepAfter6is not nil, the conditiongwepAfter6.ID() != gwepAfter4.ID()will panic when callingID()on the nilgwepAfter4.
package libnetwork
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| gwepAfter4, gwepAfter6 = sb.getGatewayEndpoint() | ||
| if err := sb.updateGateway(gwepAfter4, gwepAfter6); err != nil { | ||
| return fmt.Errorf("updating gateway endpoint: %w", err) | ||
| log.G(ctx).WithFields(log.Fields{ |
There was a problem hiding this comment.
It's probably worth adding a comment explaining why we're logging instead of returning an error. There's still a bunch of code paths in libnetwork that adopt this error handling strategy, so readers may just think this is one of these injustified instances and not even git blame.
There was a problem hiding this comment.
Done ... although this is one of those paths using the error handling strategy of not handling errors. (As it was in 28.x - I moved this out of clearNetworkResources, which returned an error that was only logged.)
The whole network-disconnect operation should have proper error handling.
During a network disconnect, log rather than returning an error if it's not possible to set up a new gateway. This restores the behaviour from before commit 53390f8 ("Put clearNetworkResources() inline in its only caller"). It's not ideal, but by the time new gateways are selected the old endpoint has been disconnected - and nothing puts things back. Until that's cleaned up, a broken state is inevitable, but letting endpoint deletion complete means the container can be restarted or re-connected to the network without a zombie endpoint causing further issues. Signed-off-by: Rob Murray <rob.murray@docker.com>
Signed-off-by: Rob Murray <rob.murray@docker.com>
bdb4880 to
163cc95
Compare
- What I did
When a network is disconnected from a container, if that network provides the container's default gateway, another gateway is selected from the remaining network endpoints (if there are any). That selection happens after the endpoint has been removed, but there's no proper rollback if the newly selected gateway doesn't work.
- How I did it
The first commit skips selection of a new gateway if the container is being deleted. So setup of a new gateway can't fail. This fixes the bug as-reported.
Prior to 53390f8, an error setting up the new gateway was only logged ... the second commit restores that behaviour for the
network disconnectcase, where the container is not being deleted. The container won't have a working gateway, but theNetworkwill not be left with a broken reference to the disconnectedEndpoint, so it can be reconnected.- How to verify it
New integration test, following the repro steps outlined at #51578 (comment).
Without the fixes, the new test fails with ...
With the fixes:
Configuring gateway after network disconnect cid=5b73d9f67743 eid=f2df6ddb1ac5 ep=c1 error="failed to set gateway: route for the gateway 172.20.0.1 could not be found: network is unreachable" gw4=ce9bf8761a88 gw6= net=n1 nid=1a3331eaf7df sid=c44c4f4d7982- Human readable description for the release notes
- Fix an issue that prevented container restart or network reconnection when gateway configuration failed during container stop or network disconnect.