Skip to content

Tidy up when endpoint join fails#50945

Merged
robmry merged 7 commits intomoby:masterfrom
robmry:cleanup_network_settings_on_join_err
Sep 12, 2025
Merged

Tidy up when endpoint join fails#50945
robmry merged 7 commits intomoby:masterfrom
robmry:cleanup_network_settings_on_join_err

Conversation

@robmry
Copy link
Contributor

@robmry robmry commented Sep 9, 2025

- What I did

The repro in #50898 results in Endpoint.Join returning an error, the container isn't connected to the network. But ...

  • The failed connection is left behind in container inspect / NetworkSettings.
    • The container isn't listed in network inspect.
  • The veth device is left in the container.

- How I did it

See individual commit messages.

- How to verify it

After running the steps in #50898, the mv2 network is not listed in container inspect, and its eth device is removed.

New integration test.

- Human readable description for the release notes

- Improved error handling for connection of a container to a network.

@robmry robmry added this to the 29.0.0 milestone Sep 9, 2025
@robmry robmry self-assigned this Sep 9, 2025
@robmry robmry force-pushed the cleanup_network_settings_on_join_err branch 3 times, most recently from 27a82a7 to 3b8424f Compare September 11, 2025 11:52
Signed-off-by: Rob Murray <rob.murray@docker.com>
Because it loaded the Endpoint object from store and operated
on that copy rather than its own receiver, sbJoin couldn't
successfully roll back on error if the Endpoint was not included
in the Sandbox's list of endpoints, or its current state had
not been written to store after the error occurred.

So, for example, releaseOSSboxResources() would not be called
to delete interfaces created in the container's netns.

Signed-off-by: Rob Murray <rob.murray@docker.com>
If an endpoint is still attached to a Sandbox when
Endpoint.Delete is called with force=true, sbLeave is
called. It may change the Sandbox's gateway, which may
conflict with a concurrent Join.

So, acquire the Sandbox's joinLeaveMu to do that, and
clarify the purpose of that mutex in struct Sandbox
comments.

Signed-off-by: Rob Murray <rob.murray@docker.com>
Signed-off-by: Rob Murray <rob.murray@docker.com>
The old deferred error handling cleared ep.sandboxID, but only
in a copy of the Endpoint loaded from the store, not stored or
returned - so the modification was immediately lost.

It also tried to remove the endpoint from the Sandbox's 'endpoints',
but the remove function compared pointers rather than ids, so
nothing was removed. Removing it would have broken rollback anyway.

Signed-off-by: Rob Murray <rob.murray@docker.com>
@robmry robmry force-pushed the cleanup_network_settings_on_join_err branch from 3b8424f to c504c94 Compare September 11, 2025 12:03
@robmry robmry changed the title Remove network info from container when endpoint join fails Tidy up when endpoint join fails Sep 11, 2025
@robmry robmry marked this pull request as ready for review September 11, 2025 13:11
Signed-off-by: Rob Murray <rob.murray@docker.com>
Signed-off-by: Rob Murray <rob.murray@docker.com>
@robmry robmry force-pushed the cleanup_network_settings_on_join_err branch from c504c94 to 8efe6b0 Compare September 11, 2025 13:19
Copy link
Member

@akerouanton akerouanton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can probably keep my suggestion for a follow-up.

}

ep, err = n.getEndpointFromStore(ep.ID())
storedEp, err := n.getEndpointFromStore(ep.ID())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • (*Endpoint).Leave() is called in 3 different places:
    • (*Daemon).disconnectFromNetwork() which obtains the Endpoint by loading the list of endpoints connected to the given network. Supposedly, they're fully hydrated already.
    • (*Sandbox).delete() which deletes all endpoints of the sandbox being deleted.
      • Endpoints are obtained via (*Sandbox).Endpoints() which returns a copy of sb.endpoints.
      • sb.endpoints is mutated by a single function: (*Sandbox).addEndpoint() which is called by two code paths: (*Controller).sandboxRestore() and (*Endpoint).sbJoin()
      • In the former case, the Endpoint is loaded from store before it's passed to (*Sandbox).addEndpoint().
      • In the latter case, the Endpoint is being created - so it's fully hydrated when it's passed to (*Sandbox).addEndpoint().
    • (*Sandbox).Refresh() which removes all the sandbox's endpoint, and then re-connect them.
      • The sandbox's endpoints list is obtained by calling (*Sandbox).Endpoints()

Thus, I believe re-loading the stored endpoint here isn't needed. Maybe we can just drop this line? Or do you think it's too risky, and prefer keeping this line out of caution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed, I went back-and-forth on this. Also thought about doing an sb.GetEndpoint() to dig it out of sb.endpoints, rather than fetching from the store.

In the end though, decided the safest thing would be to leave it alone - we know we need to overhaul all this to get rid of all the reloading from store (after startup). The current implementation is bound to lead to bugs like the one fixed here.

I'll get this merged as it should be an improvement anyway.

@robmry robmry merged commit af6d59e into moby:master Sep 12, 2025
251 of 252 checks passed
@robmry robmry deleted the cleanup_network_settings_on_join_err branch September 18, 2025 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants