Tidy up when endpoint join fails#50945
Conversation
27a82a7 to
3b8424f
Compare
Signed-off-by: Rob Murray <rob.murray@docker.com>
Because it loaded the Endpoint object from store and operated on that copy rather than its own receiver, sbJoin couldn't successfully roll back on error if the Endpoint was not included in the Sandbox's list of endpoints, or its current state had not been written to store after the error occurred. So, for example, releaseOSSboxResources() would not be called to delete interfaces created in the container's netns. Signed-off-by: Rob Murray <rob.murray@docker.com>
If an endpoint is still attached to a Sandbox when Endpoint.Delete is called with force=true, sbLeave is called. It may change the Sandbox's gateway, which may conflict with a concurrent Join. So, acquire the Sandbox's joinLeaveMu to do that, and clarify the purpose of that mutex in struct Sandbox comments. Signed-off-by: Rob Murray <rob.murray@docker.com>
Signed-off-by: Rob Murray <rob.murray@docker.com>
The old deferred error handling cleared ep.sandboxID, but only in a copy of the Endpoint loaded from the store, not stored or returned - so the modification was immediately lost. It also tried to remove the endpoint from the Sandbox's 'endpoints', but the remove function compared pointers rather than ids, so nothing was removed. Removing it would have broken rollback anyway. Signed-off-by: Rob Murray <rob.murray@docker.com>
3b8424f to
c504c94
Compare
Signed-off-by: Rob Murray <rob.murray@docker.com>
Signed-off-by: Rob Murray <rob.murray@docker.com>
c504c94 to
8efe6b0
Compare
akerouanton
left a comment
There was a problem hiding this comment.
LGTM. We can probably keep my suggestion for a follow-up.
| } | ||
|
|
||
| ep, err = n.getEndpointFromStore(ep.ID()) | ||
| storedEp, err := n.getEndpointFromStore(ep.ID()) |
There was a problem hiding this comment.
(*Endpoint).Leave()is called in 3 different places:(*Daemon).disconnectFromNetwork()which obtains the Endpoint by loading the list of endpoints connected to the given network. Supposedly, they're fully hydrated already.(*Sandbox).delete()which deletes all endpoints of the sandbox being deleted.- Endpoints are obtained via
(*Sandbox).Endpoints()which returns a copy ofsb.endpoints. sb.endpointsis mutated by a single function:(*Sandbox).addEndpoint()which is called by two code paths:(*Controller).sandboxRestore()and(*Endpoint).sbJoin()- In the former case, the Endpoint is loaded from store before it's passed to
(*Sandbox).addEndpoint(). - In the latter case, the Endpoint is being created - so it's fully hydrated when it's passed to
(*Sandbox).addEndpoint().
- Endpoints are obtained via
(*Sandbox).Refresh()which removes all the sandbox's endpoint, and then re-connect them.- The sandbox's endpoints list is obtained by calling
(*Sandbox).Endpoints()
- The sandbox's endpoints list is obtained by calling
Thus, I believe re-loading the stored endpoint here isn't needed. Maybe we can just drop this line? Or do you think it's too risky, and prefer keeping this line out of caution?
There was a problem hiding this comment.
Yes, agreed, I went back-and-forth on this. Also thought about doing an sb.GetEndpoint() to dig it out of sb.endpoints, rather than fetching from the store.
In the end though, decided the safest thing would be to leave it alone - we know we need to overhaul all this to get rid of all the reloading from store (after startup). The current implementation is bound to lead to bugs like the one fixed here.
I'll get this merged as it should be an improvement anyway.
- What I did
The repro in #50898 results in
Endpoint.Joinreturning an error, the container isn't connected to the network. But ...container inspect/NetworkSettings.network inspect.- How I did it
See individual commit messages.
- How to verify it
After running the steps in #50898, the
mv2network is not listed incontainer inspect, and its eth device is removed.New integration test.
- Human readable description for the release notes
- Improved error handling for connection of a container to a network.