Tidy up when endpoint join fails by robmry · Pull Request #50945 · moby/moby

robmry · 2025-09-09T17:51:48Z

- What I did

related to macvlan: "failed to set IPv6 gateway: file exists" on network connect #50898

The repro in #50898 results in Endpoint.Join returning an error, the container isn't connected to the network. But ...

The failed connection is left behind in container inspect / NetworkSettings.
- The container isn't listed in network inspect.
The veth device is left in the container.

- How I did it

See individual commit messages.

- How to verify it

After running the steps in #50898, the mv2 network is not listed in container inspect, and its eth device is removed.

New integration test.

- Human readable description for the release notes

- Improved error handling for connection of a container to a network.

Signed-off-by: Rob Murray <rob.murray@docker.com>

Because it loaded the Endpoint object from store and operated on that copy rather than its own receiver, sbJoin couldn't successfully roll back on error if the Endpoint was not included in the Sandbox's list of endpoints, or its current state had not been written to store after the error occurred. So, for example, releaseOSSboxResources() would not be called to delete interfaces created in the container's netns. Signed-off-by: Rob Murray <rob.murray@docker.com>

If an endpoint is still attached to a Sandbox when Endpoint.Delete is called with force=true, sbLeave is called. It may change the Sandbox's gateway, which may conflict with a concurrent Join. So, acquire the Sandbox's joinLeaveMu to do that, and clarify the purpose of that mutex in struct Sandbox comments. Signed-off-by: Rob Murray <rob.murray@docker.com>

Signed-off-by: Rob Murray <rob.murray@docker.com>

The old deferred error handling cleared ep.sandboxID, but only in a copy of the Endpoint loaded from the store, not stored or returned - so the modification was immediately lost. It also tried to remove the endpoint from the Sandbox's 'endpoints', but the remove function compared pointers rather than ids, so nothing was removed. Removing it would have broken rollback anyway. Signed-off-by: Rob Murray <rob.murray@docker.com>

Signed-off-by: Rob Murray <rob.murray@docker.com>

akerouanton

LGTM. We can probably keep my suggestion for a follow-up.

akerouanton · 2025-09-12T09:39:04Z

daemon/libnetwork/endpoint.go

 	}

-	ep, err = n.getEndpointFromStore(ep.ID())
+	storedEp, err := n.getEndpointFromStore(ep.ID())


(*Endpoint).Leave() is called in 3 different places:

(*Daemon).disconnectFromNetwork() which obtains the Endpoint by loading the list of endpoints connected to the given network. Supposedly, they're fully hydrated already.

(*Sandbox).delete() which deletes all endpoints of the sandbox being deleted.

Endpoints are obtained via (*Sandbox).Endpoints() which returns a copy of sb.endpoints.

sb.endpoints is mutated by a single function: (*Sandbox).addEndpoint() which is called by two code paths: (*Controller).sandboxRestore() and (*Endpoint).sbJoin()

In the former case, the Endpoint is loaded from store before it's passed to (*Sandbox).addEndpoint().

In the latter case, the Endpoint is being created - so it's fully hydrated when it's passed to (*Sandbox).addEndpoint().

(*Sandbox).Refresh() which removes all the sandbox's endpoint, and then re-connect them.

The sandbox's endpoints list is obtained by calling (*Sandbox).Endpoints()

Thus, I believe re-loading the stored endpoint here isn't needed. Maybe we can just drop this line? Or do you think it's too risky, and prefer keeping this line out of caution?

Yes, agreed, I went back-and-forth on this. Also thought about doing an sb.GetEndpoint() to dig it out of sb.endpoints, rather than fetching from the store.

In the end though, decided the safest thing would be to leave it alone - we know we need to overhaul all this to get rid of all the reloading from store (after startup). The current implementation is bound to lead to bugs like the one fixed here.

I'll get this merged as it should be an improvement anyway.

robmry added this to the 29.0.0 milestone Sep 9, 2025

robmry self-assigned this Sep 9, 2025

robmry added area/networking Networking impact/changelog version/17.03 kind/bugfix PR's that fix bugs labels Sep 9, 2025

robmry force-pushed the cleanup_network_settings_on_join_err branch 3 times, most recently from 27a82a7 to 3b8424f Compare September 11, 2025 11:52

robmry added 5 commits September 11, 2025 13:02

Remove network info from container when endpoint join fails

b192d06

Signed-off-by: Rob Murray <rob.murray@docker.com>

Put clearNetworkResources() inline in its only caller

53390f8

Signed-off-by: Rob Murray <rob.murray@docker.com>

robmry force-pushed the cleanup_network_settings_on_join_err branch from 3b8424f to c504c94 Compare September 11, 2025 12:03

robmry changed the title ~~Remove network info from container when endpoint join fails~~ Tidy up when endpoint join fails Sep 11, 2025

robmry marked this pull request as ready for review September 11, 2025 13:11

robmry requested review from akerouanton and corhere September 11, 2025 13:11

robmry added 2 commits September 11, 2025 14:18

bridge_linux_test.go: gofumpt

73413ea

Signed-off-by: Rob Murray <rob.murray@docker.com>

Add TestJoinError

8efe6b0

Signed-off-by: Rob Murray <rob.murray@docker.com>

robmry force-pushed the cleanup_network_settings_on_join_err branch from c504c94 to 8efe6b0 Compare September 11, 2025 13:19

corhere approved these changes Sep 11, 2025

View reviewed changes

akerouanton approved these changes Sep 12, 2025

View reviewed changes

robmry merged commit af6d59e into moby:master Sep 12, 2025
251 of 252 checks passed

robmry mentioned this pull request Sep 12, 2025

macvlan, ipvlan-l2: only configure a default route when a gateway address is supplied #50929

Merged

robmry deleted the cleanup_network_settings_on_join_err branch September 18, 2025 14:27

robmry mentioned this pull request Nov 25, 2025

Suppress errors from gateway re-config when disconnecting a network #51592

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tidy up when endpoint join fails#50945

Tidy up when endpoint join fails#50945
robmry merged 7 commits intomoby:masterfrom
robmry:cleanup_network_settings_on_join_err

robmry commented Sep 9, 2025 •

edited

Loading

Uh oh!

akerouanton left a comment

Uh oh!

akerouanton Sep 12, 2025

Uh oh!

robmry Sep 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

robmry commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akerouanton left a comment

Choose a reason for hiding this comment

Uh oh!

akerouanton Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

robmry Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

robmry commented Sep 9, 2025 •

edited

Loading