Skip to content

Wait for container dependencies upon daemon start up#50327

Merged
robmry merged 1 commit intomoby:masterfrom
Adrien-Atmosphere:50326-wait-for-dependent-containers
Jul 17, 2025
Merged

Wait for container dependencies upon daemon start up#50327
robmry merged 1 commit intomoby:masterfrom
Adrien-Atmosphere:50326-wait-for-dependent-containers

Conversation

@Adrien-Atmosphere
Copy link
Contributor

@Adrien-Atmosphere Adrien-Atmosphere commented Jul 4, 2025

- What I did
Wait for container dependencies upon daemon start up

- How I did it

  • List all dependent containers upon registration
  • Dependent containers can be either from legacy link or container network
  • Wait on a best effort basis for the dependent containers

- How to verify it

  1. Create a docker-dind container with bind mounted /var/lib/docker :
    docker run -d --privileged --name docker -v ./docker:/var/lib/docker docker-dind:latest --debug

  2. Launch a compose application that creates 500 containers in docker-dind image to maximize the chances to encounter the issue

services:
  net:
    image: alpine
    command: sleep infinity
    restart: always

  app:
    image: alpine
    command: sleep infinity
    network_mode: "service:net"
    restart: always
    scale: 500

  1. Stop and restart the docker-dind container and check for stopped containers
CONTAINER=docker
EXEC="docker exec $CONTAINER"
while true;
do 
	echo "Stopping $CONTAINER"
	docker stop $CONTAINER
	echo "Removing $CONTAINER" 
	docker rm -v $CONTAINER
	echo "Starting $CONTAINER"
        docker run -d --privileged --name docker -v ./docker:/var/lib/docker docker-dind:latest --debug
	timeout 30 bash -c "until $EXEC docker info >/dev/null 2>&1; do sleep 1; done"
	echo "$CONTAINER started successfully"

	containers=$($EXEC docker ps -aq   --filter status=created   --filter status=exited   --filter status=dead)
        if [ -n "$containers" ]
	then 
    		echo "Inspecting stopped/exited/dead containers:"
    		for i in $containers; do
		      echo "Container ID: $i"
		      $EXEC docker inspect "$i" -f "{{ .State | json }}" | jq
	        done
		exit 
	else
		echo "No created/exited/dead containers found."
	fi
done 

- Human readable description for the release notes

- On daemon startup, restart containers that share their network stacks before containers that need those stacks.

- A picture of a cute animal (not mandatory but encouraged)

@thaJeztah
Copy link
Member

Thanks for contributing. This may need some review/discussion; I'm a bit concerned about possible consequences of this - at least it could use more eyes.

Currently, we create somewhat of a hard dependency when joining namespaces with another container; while this "works", it's also causing issues (like the one this PR is trying to address). Ultimately, we should try to make things more dynamic and resilient against situations where (e.g.) the linked container is restarted, removed, fails to start, or otherwise; we could still have some dependency, but ideally, it would be more dynamic.

For the legacy-links, it's definitely something we want to move away from; it's a broken concept, has security implications, and has many limitations and side-effects (also see #22295); non-legacy (DNS-based) links are more flexible in this respect.

The --network=container:XX is something we should look at, and possibly re-architect (I recall I brought this idea up in some discussions we had, but never wrote it up; I'll probably write up something in a separate ticket for further discussion).

Effectively, --network=container:XX indicates that a container wants to share its networking-namespace with an existing network-namespace; in this case a namespace that was created as part of another container.

  • container-A is created, and a network namespace is created for it
  • container-B -> network=container-A (IOW: "share network-namespace with container A") joins the network namespace that was created for container-A
  • container-A and container-B now both use the same network namespace

For convenience (and UX), it uses the --network flag, but to some extent it's a bit of an odd-one-out; for other namespaces, we use a flag more specific to "namespace". From the docker run --help output;

--pid string                       PID namespace to use
--userns string                    User namespace to use
--uts string                       UTS namespace to use

While the original "donor" container is required when creating the container to join its networking namespace, ultimately that's only when creating the container. Currently we still consider that namespace to be owned by the original container, but I think we should consider it more closely to a "co-ownership". Both containers share the same namespace, and we can consider the namespace to be no longer needed once the last container is gone.

If we take that approach, things become more similar to, for example, --volumes-from;

--volumes-from list                Mount volumes from the specified container(s)

Similar to --network=container:XXX, the container using the --volumes-from option needs the donor container to exist when creating. Volumes are created (or defined) on the donor's container, but

docker run -d -v shared:/foo --name container-a nginx:alpine
docker run -d --volumes-from=container-a --name container-b nginx:alpine

But once created, there's no longer a hard dependency; container-b now just shares the same volume(s);

docker system df --verbose
...
Local Volumes space usage:

VOLUME NAME                                                        LINKS     SIZE
docker-dev-cache                                                   1         2.209GB
shared                                                             2         0B
acd3650aa324548e56d29276001b13c8ad7e493a881b63106aaec0b6be542e9d   0         3.997MB

The original "donor" container (container-a) can be removed without impacting container-b;

docker container rm -f container-a
container-a

docker system df --verbose
...
Local Volumes space usage:

VOLUME NAME                                                        LINKS     SIZE
docker-dev-cache                                                   1         2.209GB
shared                                                             1         0B
acd3650aa324548e56d29276001b13c8ad7e493a881b63106aaec0b6be542e9d   0         3.997MB

I'm wondering if we could get to a similar state for networking; the devil may be in the detail though!

@Adrien-Atmosphere
Copy link
Contributor Author

Adrien-Atmosphere commented Jul 7, 2025

Thank you for your feedback. I thought it could be useful to have a simple "wait on dependent containers" functionality to have the possibility in the future to add a flag "--depends_on" like in docker compose.
I understand that you would prefer to rework completely the container network namespace sharing functionality. In that case, adding the possibility to make them work with linux "ip netns" would be great.

@robmry
Copy link
Contributor

robmry commented Jul 16, 2025

I'll take a more detailed look at the code if we can agree this change is needed - but, I think it is ...

@thaJeztah - as you say, un-hooking netns's is probably a separate discussion ... but, as you also say, the container that shares its netns would need to start first - even if we allow it to exit before its dependents and leave its netns behind. (Maybe the first container has CAP_NET_ADMIN and needs to do some setup before other containers arrive.) So, we'd still need restart dependencies like this? (At least on initial startup, maybe we could allow a restart without the dependency if another container sharing the netns has held it open - but that's about design for a future feature, and this change wouldn't hold it back.)

At the moment - if containers with netns dependencies don't necessarily restart when they should, that's a bug. Even if we want to re-design the way container-networking works in future, it should work properly until then. The mechanism for queuing container-starts exists, this change adds a missing dependency. I think it makes sense?

@akerouanton
Copy link
Member

I think this PR makes sense overall. As @robmry outlined, whether we want to rearchitect --network=container:… is a separate matter, and until we settle on that, we should treat the restart scenario as a bug that should be fixed.

@robmry
Copy link
Contributor

robmry commented Jul 16, 2025

(The Windows test failures are unrelated to this PR.)

Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this change makes a lot of sense. The PR could use another iteration, though.

// It is used to determine which containers need to be restarted when the given container is restarted.
// It returns a list of containers that depend on the given container.
// Upon error, it returns the last known dependent containers, which may be empty.
dependentContainers := make([]*container.Container, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: https://go.dev/wiki/CodeReviewComments#declaring-empty-slices

Suggested change
dependentContainers := make([]*container.Container, 0)
var dependentContainers []*container.Container

Comment on lines 220 to 225
if c.HostConfig.Links != nil {
for _, link := range c.HostConfig.Links {
name, _, err := opts.ParseLink(link)
if err != nil {
log.G(context.TODO()).WithError(err).Errorf("Could not parse link %s for %s", link, c.ID)
return dependentContainers
}
child, err := daemon.GetContainer(name)
if err != nil {
if cerrdefs.IsNotFound(err) {
// Trying to link to a non-existing container is not valid, and
// should return an "invalid parameter" error. Returning a "not
// found" error here would make the client report the container's
// image could not be found (see moby/moby#39823)
err = errdefs.InvalidParameter(err)
}
log.G(context.TODO()).WithError(err).Errorf("Could not get container for %s", name)
return dependentContainers
}

dependentContainers = append(dependentContainers, child)
}
}
return dependentContainers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a more efficient way to do this.

Suggested change
if c.HostConfig.Links != nil {
for _, link := range c.HostConfig.Links {
name, _, err := opts.ParseLink(link)
if err != nil {
log.G(context.TODO()).WithError(err).Errorf("Could not parse link %s for %s", link, c.ID)
return dependentContainers
}
child, err := daemon.GetContainer(name)
if err != nil {
if cerrdefs.IsNotFound(err) {
// Trying to link to a non-existing container is not valid, and
// should return an "invalid parameter" error. Returning a "not
// found" error here would make the client report the container's
// image could not be found (see moby/moby#39823)
err = errdefs.InvalidParameter(err)
}
log.G(context.TODO()).WithError(err).Errorf("Could not get container for %s", name)
return dependentContainers
}
dependentContainers = append(dependentContainers, child)
}
}
return dependentContainers
return append(dependentContainers, maps.Values(daemon.linkIndex.children(c)))

// This function is used to retrieve the containers that depend on the given container.
// It is used to determine which containers need to be restarted when the given container is restarted.
// It returns a list of containers that depend on the given container.
// Upon error, it returns the last known dependent containers, which may be empty.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this isn't part of the function's doc comment?

daemon/daemon.go Outdated
// (legacy links) to be running before we try to start the container
if children := daemon.linkIndex.children(c); len(children) > 0 {
// (legacy links or container network) to be running before we try to start the container
if children := dependentContainers[c]; len(children) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accumulating the dependent containers into a map doesn't really afford us anything, at the cost of some memory.

Suggested change
if children := dependentContainers[c]; len(children) > 0 {
if children := daemon.GetDependentContainers(c); len(children) > 0 {

Copy link
Contributor

@robmry robmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - thank you ... but please could you squash the commits? Looks like it needs a rebase too, just a conflict in imports.

@Adrien-Atmosphere Adrien-Atmosphere force-pushed the 50326-wait-for-dependent-containers branch from 3c948af to 3a3b9f9 Compare July 17, 2025 15:03
@robmry
Copy link
Contributor

robmry commented Jul 17, 2025

Looks like something odd happened in the rebase, an already merged commit ended up in this PR?

@Adrien-Atmosphere Adrien-Atmosphere force-pushed the 50326-wait-for-dependent-containers branch from 3a3b9f9 to 8f677ab Compare July 17, 2025 15:55
@Adrien-Atmosphere
Copy link
Contributor Author

Thanks, I didn't notice

Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@robmry
Copy link
Contributor

robmry commented Jul 17, 2025

There's a test failure in https://github.com/moby/moby/actions/runs/16349928056/job/46194686893?pr=50327 ...

# github.com/docker/docker/daemon
../../daemon/container.go:223:52: implicit function instantiation requires go1.18 or later (-lang was set to go1.16; check go.mod)

That means we need this comment at the top of daemon/container.go (because of the templated slices/maps functions) ...

// FIXME(thaJeztah): remove once we are a module; the go:build directive prevents go from downgrading language version to go1.16:
//go:build go1.23

@robmry
Copy link
Contributor

robmry commented Jul 17, 2025

I've added a changelog comment to the description - hopefully it makes sense?!

@robmry robmry added area/networking Networking impact/changelog area/daemon Core Engine kind/bugfix PR's that fix bugs labels Jul 17, 2025
@robmry robmry added this to the 29.0.0 milestone Jul 17, 2025
@Adrien-Atmosphere
Copy link
Contributor Author

For the comment at the top of container.go, do I make a separate commit or do I squash them?

@robmry
Copy link
Contributor

robmry commented Jul 17, 2025

For the comment at the top of container.go, do I make a separate commit or do I squash them?

Squash please!

@Adrien-Atmosphere Adrien-Atmosphere force-pushed the 50326-wait-for-dependent-containers branch from 8f677ab to 2e492de Compare July 17, 2025 16:54
@Adrien-Atmosphere Adrien-Atmosphere force-pushed the 50326-wait-for-dependent-containers branch from 2e492de to d2f8e7b Compare July 17, 2025 16:59
@robmry
Copy link
Contributor

robmry commented Jul 17, 2025

Sorry this is getting a bit tedious - nearly there, but ...

daemon/container.go:1:1: File is not properly formatted (gofmt)
// FIXME(thaJeztah): remove once we are a module; the go:build directive prevents go from downgrading language version to go1.16:

I think it wants a blank like after the build directive ...

// FIXME(thaJeztah): remove once we are a module; the go:build directive prevents go from downgrading language version to go1.16:
//go:build go1.23

package daemon

@Adrien-Atmosphere
Copy link
Contributor Author

I just did a gofmt -s and it reversed both comments like so :

//go:build go1.23

// FIXME(thaJeztah): remove once we are a module; the go:build directive prevents go from downgrading language version to go1.16:

I think I should push this, no?

@robmry
Copy link
Contributor

robmry commented Jul 17, 2025

It should be ok the original way around - we've got quite a few of these now (and should be able to remove them soon), so it's best if they're all the same. For example ...

moby/daemon/create.go

Lines 1 to 5 in c616e76

// TODO(thaJeztah): remove once we are a module; the go:build directive prevents go from downgrading language version to go1.16:
//go:build go1.23
package daemon

(Oh, except I see we have a mix of TODO and FIXME! Either is ok, most are FIXME.)

- Get dependent containers before starting containers
- Dependent containers can be either from legacy link or container network
- Wait on a best effort basis for the dependent containers

Fixes: moby#50326

Signed-off-by: Adrien Pompée <adrien.pompee@atmosphere.aero>
@Adrien-Atmosphere Adrien-Atmosphere force-pushed the 50326-wait-for-dependent-containers branch from d2f8e7b to d4e026f Compare July 17, 2025 17:42
Copy link
Contributor

@robmry robmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thank you @Adrien-Atmosphere!

The BuildKit Windows test failure is unrelated ... I'll get this merged.

@robmry robmry merged commit 2f4f78b into moby:master Jul 17, 2025
227 of 232 checks passed
@Adrien-Atmosphere Adrien-Atmosphere deleted the 50326-wait-for-dependent-containers branch July 17, 2025 19:28
@thompson-shaun thompson-shaun moved this from Open to Complete in 🔦 Maintainer spotlight Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Containers with "restart: always" and shared network namespace may fail to start with "cannot join network namespace of a non running container"

5 participants