Initialising a container and starting the user process

Jan 15th, 2025

17 min

This is Part 4 of the series Building a container runtime from scratch in Go.

In the fourth part of the series we learn about fork/exec and IPC over Unix domain sockets to initialise the container then run the user-defined process.

The source code to accompany this post is available on GitHub.

When we start a container, the process that gets run in the container is defined in the process¹ field of the config.json of the container bundle. This contains a bunch of fields we’ll cover in future posts, but the one we’re concerned about today is args.

// Process contains information to start a specific application 
// inside the container.
type Process struct {
    // Args specifies the binary and arguments for the application to execute.
    Args []string `json:"args,omitempty"`

    // ...
}

The args field is an array of strings with the same semantics as the argv argument of the exec² family of Linux commands. Essentially what this means is that the first item in the array is the command to execute and the remaining items are the arguments to pass to it.

There are two steps to getting the user process defined in the config.json spec running in a containerised environment.

Initialising the container (anocir create <container-id>)
- In this step the runtime creates a container process and applies the required configuration. It then sets up an IPC channel over a Unix domain socket so that it can continue to communicate with the container once the container process is detached.
Starting the container (anocir start <container-id>)
- In this step the runtime sends a message over the IPC channel to the container process instructing the container to start. The container process applies the remaining configuration, then execs the user process defined in the config.

Let’s get to it…

Initialising the container

Below is a diagram of the steps involved in ‘initialising’ a container.

sequenceDiagram
    autonumber
    actor CLI
    participant anocir as runtime
    CLI ->> anocir: anocir create <container-id>
    activate anocir
    anocir->>anocir: <br>configure container
    box rgba(255, 184, 108, 0.5) unix domain socket
    participant ipc
    end
    anocir->>ipc: <br><br>create ipc socket
    note over ipc: unix domain<br>socket created
    activate anocir
    anocir--)ipc: listen (async)
    deactivate anocir
    box rgba(80, 250, 123, 0.5) container process
    participant container
    end
    deactivate anocir
    anocir->>anocir: <br>reexec
    activate anocir
    note over container: container process<br>forked
    anocir-->container: release container process
    deactivate anocir
    activate container
    container->>container: <br>configure container
    container->>ipc: <br><br>send 'ready'
    container--)ipc: listen (sync)
    deactivate container
    ipc--)anocir: receive 'ready'
    anocir->>CLI: exit
    note over container: container process<br>persists after<br>runtime exits

The user (either directly or via some higher-level runtime tooling, such as Docker) issues the anocir create <container-id> command to the runtime.
The runtime does its part of the configuration of the container based on the config.json spec.
The runtime creates a Unix domain socket to use for communication between the runtime and the container process.
The runtime listens (asynchronously) on the socket for any messages coming from the container.
The runtime ‘reexecs’, applying the configuration from step 2 to the new process.
The runtime releases the container process, and continues listening on the socket, waiting to receive a ready message from the container.
The container process does its part of the configuration based on the config.json spec.
The container process sends a ready message to the socket to indicate it’s completed configuring and is ready to receive further commands.
The container process listens (synchronously) on the socket for further instructions.
The runtime receives the ready message from the socket.
The runtime exits, leaving the container process running in the ‘background’.

Note: Steps 2 & 7 are where the bulk of our work in future posts is going to be. Today, we’ll be working on everything else above.

We’re going to be referencing a few values quite frequently, so let’s first create some constants to store them in.

const (
    containerRootDir      = "/var/lib/anocir/containers"
    initSockFilename      = "init.sock"
    containerSockFilename = "container.sock"
)

The containerRootDir is the directory where the runtime is storing container metadata. The initSockFilename and containerSockFilename are the filenames of the sockets we’ll be using for IPC during initialisation and container start, respectively.

Next up, we can start creating the Init function on the container. I’ll annotate in comments the steps to which the different sections relate.

func (c *Container) Init() error {
    // 2. configure container
    // TODO: configure container

    // 3. create ipc socket
    listener, err := net.Listen(
	"unix",
	filepath.Join(containerRootDir, c.State.ID, initSockFilename),
    )
    if err != nil {
	return fmt.Errorf("listen on init sock: %w", err)
    }
    defer listener.Close()

    // 5. reexec
    cmd := exec.Command("/proc/self/exe", "reexec", c.State.ID)

    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    if err := cmd.Start(); err != nil {
	return fmt.Errorf("reexec container process: %w", err)
    }

    c.State.Pid = cmd.Process.Pid

    // 6. release container process
    if err := cmd.Process.Release(); err != nil {
	return fmt.Errorf("release container process: %w", err)
    }

    // 4. listen
    conn, err := listener.Accept()
    if err != nil {
	return fmt.Errorf("accept on init sock: %w", err)
    }
    defer conn.Close()

    b := make([]byte, 128)
    n, err := conn.Read(b)
    if err != nil {
	return fmt.Errorf("read bytes from init sock connection: %w", err)
    }

    // 10. receive 'ready'
    msg := string(b[:n])
    if msg != "ready" {
	return fmt.Errorf("expecting 'ready' but received '%s'", msg)
    }

    c.State.Status = specs.StateCreated

    // 11. exit
    return nil
}

If you’re familiar with the net³ and os/exec⁴ packages from the standard library, that should all look pretty familiar. What might look strange is the command that’s being executed - /proc/self/exe.

On a Linux system, the /proc filesystem⁵ is a pseudo-filesystem that provides a view into the current state of the kernel, including details about system hardware and configuration. In addition, details of all of the processes running on the system are available, each in a subdirectory with the name of the process’ PID.

ls -la -d /proc/*/

dr-xr-xr-x - root    12 Jan 04:06  /proc/1
dr-xr-xr-x - root    12 Jan 04:06  /proc/2
dr-xr-xr-x - root    12 Jan 04:06  /proc/3
dr-xr-xr-x - root    12 Jan 04:06  /proc/4
dr-xr-xr-x - root    12 Jan 04:06  /proc/5
dr-xr-xr-x - root    12 Jan 04:06  /proc/6
dr-xr-xr-x - root    12 Jan 04:06  /proc/7
dr-xr-xr-x - root    12 Jan 04:06  /proc/9
# etc...

For example, to view the status of the process with PID 1 (in my case, systemd).

cat /proc/1/status

Name:   systemd
Umask:  0000
State:  S (sleeping)
Tgid:   1
Ngid:   0
Pid:    1
PPid:   0
# truncated for brevity

As a kind of ‘handy utility’ to get details about the current process, rather than specifying it’s PID, it can be referenced as self.

ls -la /proc/self

.r--r--r-- 0 nixpig 12 Jan 07:05  arch_status
dr-xr-xr-x - nixpig 12 Jan 07:05  attr
.rw-r--r-- 0 nixpig 12 Jan 07:05  autogroup
.r-------- 0 nixpig 12 Jan 07:05  auxv
.r--r--r-- 0 nixpig 12 Jan 07:05  cgroup
.-w------- 0 nixpig 12 Jan 07:05  clear_refs
.r--r--r-- 0 nixpig 12 Jan 07:05  cmdline
.rw-r--r-- 0 nixpig 12 Jan 07:05  comm
# etc...

One of the entries (omitted above due to truncating) is exe, which is a link to the executable of the process. Thus, from our container runtime /proc/self/exe is a link to the container runtime process currently executing.

So, you might ask “why do we need to reexecute the process we’re already running? Why don’t we just execute the process defined in config.json directly?”. The nitty-gritty details will be elucidated in a future installment when we get into applying configuration to the container process. For now, understand that there are elements of the configuration (such as namespaces) which need to be applied to the process which will spawn the container process and cannot be applied from within the running process itself.

Next, we can update the Create operation to call Init on the container and save it’s state after it’s been initialised.

func Create(opts *CreateOpts) error {

    // ...

    cntr, err := container.New(&container.NewContainerOpts{
	ID:     opts.ID,
	Bundle: bundle,
	Spec:   spec,
    })
    if err != nil {
	return fmt.Errorf("create container: %w", err)
    }
    if err := cntr.Save(); err != nil {
	return fmt.Errorf("save container: %w", err)
    }

    if err := cntr.Init(); err != nil {
	return fmt.Errorf("initialise container: %w", err)
    }

    if err := cntr.Save(); err != nil {
	return fmt.Errorf("save container: %w", err)
    }

    return nil
}

Now, when we create a container with ./anocir create --bundle <bundle-path> <container-id> our runtime is going to fork off a process for the container and block waiting for a ready message on the socket in the runtime process.

Reexec’ing the container process

The /proc/self/exe process that is forked is essentially the same as executing ./anocir reexec <container-id> so now we need to handle what happens when that’s called. We’ll create a new Reexec operation* to do that.

* It’s not technically an operation using our current definition, i.e. an operation defined by the OCI Runtime Spec. But it feels appropriate to use the same abstraction, and many other container runtimes do the same thing.

package operations

import (
    "fmt"

    "github.com/nixpig/anocir/internal/container"
)

type ReexecOpts struct {
    ID string
}

func Reexec(opts *ReexecOpts) error {
    cntr, err := container.Load(opts.ID)
    if err != nil {
	return fmt.Errorf("load container: %w", err)
    }

    if err := cntr.Reexec(); err != nil {
	return fmt.Errorf("reexec container: %w", err)
    }

    return nil
}

First, we load the container, then we call Reexec on it.

We haven’t implemented Reexec on the container yet, so let’s do that now. Again, I’ll annotate the code with the step numbers from above.

func (c *Container) Reexec() error {
    // 7. configure container
    // TODO: configure container

    // 8. send 'ready'
    initConn, err := net.Dial(
	"unix",
	filepath.Join(containerRootDir, c.State.ID, initSockFilename),
    )
    if err != nil {
	return fmt.Errorf("dial init sock: %w", err)
    }

    if _, err := initConn.Write([]byte("ready")); err != nil {
	return fmt.Errorf("write 'ready' msg to init sock: %w", err)
    }
    // close immediately, rather than defering
    initConn.Close()

    listener, err := net.Listen(
	"unix",
	filepath.Join(containerRootDir, c.State.ID, containerSockFilename),
    )
    if err != nil {
	return fmt.Errorf("listen on container sock: %w", err)
    }

    // 9. listen for 'start'
    containerConn, err := listener.Accept()
    if err != nil {
	return fmt.Errorf("accept on container sock: %w", err)
    }

    // ...coming in just a moment!
}

The first thing we do after configuring is write the ready message to the init socket, which will be picked up by the calling runtime process. We need to immediately close the connection (often this would be done in a defer) so that the socket doesn’t leak into the container process when it’s started (remember, the container process should be isolated from the host environment).

Then, we start listening on a new socket. Spoiler: in a couple of paragraphs we’ll see what we’re listening for and handle it 😉

For now, let’s wire that up to the CLI in the same way we’ve done for all of the other operations.

package cli

import (
    "github.com/nixpig/anocir/internal/operations"
    "github.com/spf13/cobra"
)

func reexecCmd() *cobra.Command {
    cmd := &cobra.Command{
	Use:    "reexec [flags] CONTAINER_ID",
	Args:   cobra.ExactArgs(1),
	Hidden: true, // this command is only used internally
	RunE: func(cmd *cobra.Command, args []string) error {
	    containerID := args[0]

	    return operations.Reexec(&operations.ReexecOpts{
		ID: containerID,
	    })
	},
    }

    return cmd
}

First, creating the Cobra command. Then, adding it to the root command.

func RootCmd() *cobra.Command {

    // ...

    cmd.AddCommand(
	stateCmd(),
	createCmd(),
	startCmd(),
	deleteCmd(),
	killCmd(),
	reexecCmd(),
    )

    return cmd
}

Starting the container

Below is a diagram of the steps involved in ‘starting’ a container.

sequenceDiagram
    autonumber
    actor CLI
    participant anocir as runtime
    box rgba(255, 184, 108, 0.5) unix domain socket
    participant ipc
    end
    note over container: process persisted<br>from 'create'
    ipc-->container: listening (sync)
    CLI->>anocir: <br>anocir start <container-id>
    anocir->>ipc: send 'start'
    anocir->>CLI: exit
    note over CLI,anocir: exit immediately after sending 'start'
    ipc--)container: receive 'start'
    activate container
    container->>container: <br>exec user process
    box rgba(80, 250, 123, 0.5) container process
    participant container
    end
    deactivate container
    note over container: exit

The container process is already running in the ‘background’ and listening on the socket created by the runtime to receive instructions.
The user (either directly or via some higher-level runtime tooling, such as Docker) issues the anocir start <container-id> command to the runtime.
The runtime sends the start message to the socket.
The runtime exits, as it’s done it’s job of instructing the container what to do.
The container receives the start message.
The container execs the user process defined in the config, then exits when the user process exits.

Exec’ing the user process

Let’s pick back up where we left the last section…listening on the socket in the reexec.

func (c *Container) Reexec() error {

    // ...

    // 9. listen for 'start'
    containerConn, err := listener.Accept()
    if err != nil {
	return fmt.Errorf("accept on container sock: %w", err)
    }

    b := make([]byte, 128)
    n, err := containerConn.Read(b)
    if err != nil {
	return fmt.Errorf("read bytes from container sock: %w", err)
    }

    msg := string(b[:n])
    if msg != "start" {
	return fmt.Errorf("expecting 'start' but received '%s'", msg)
    }

    // close before exec'ing the user process
    containerConn.Close()
    listener.Close()

    bin, err := exec.LookPath(c.Spec.Process.Args[0])
    if err != nil {
	return fmt.Errorf("find path of user process binary: %w", err)
    }

    args := c.Spec.Process.Args
    env := os.Environ()

    if err := syscall.Exec(bin, args, env); err != nil {
	return fmt.Errorf("execve (%s, %s, %v): %w", bin, args, env, err)
    }

    panic("if you got here then something went horribly wrong")
}

First up, we wait to receive the start message from the runtime on the socket. When we receive the start message, we immediately close the listener and connection, so as not to leak these into the container process.

Now, we get to actually execute the user process defined in the config!

The command specified in the first item of the Args array may or may not (often not) be an absolute path to the binary, so we need to get an absolute path to it (if it’s available in the $PATH).

Then, we get the environment. Bear in mind, at the moment this is the same as the host environment.

We take the absolute path to the binary, the full arguments array, and the environment array, and execute an Exec syscall with them.

At this point, the process should be replaced by the execution of the user process from the config. Nothing in our code past this point should ever get executed, thus we panic if it does.

Why use syscall.Exec? Why not use the exec package?

The exec package, when executing a command, creates a new process for that command. What we need is to replace the existing process with the new command.

execve(2) is how we do that. From the Linux man pages:

execve() executes the program referred to by pathname. This causes the program that is currently being run by the calling process to be replaced with a new program, with newly initialized stack, heap, and (initialized and uninitialized) data segments.

From Go, that means making the syscall.Exec syscall. From the Go docs:

Exec invokes the execve(2) system call.

For a slightly more in depth explanation and example, see Go By Example: Exec’ing Processes.

Our container can how handle the start message when it receives it from the runtime. Let’s go ahead and actually send it!

Sending the `start` message

Before we send the start message, we need to check that the container is in a state that can be started. Per the spec, the runtime may only start the container if it’s in created state⁶. So, let’s create a receiver function on the Container struct to check that.

func (c *Container) canBeStarted() bool {
    return c.State.Status == specs.StateCreated
}

We can use this to check if the container can be started when we call Start, which we’ll implement now.

func (c *Container) Start() error {
    if c.Spec.Process == nil {
	// nothing to do; silent return
	return nil
    }

    if !c.canBeStarted() {
	return fmt.Errorf("container cannot be started in current state (%s)", c.State.Status)
    }

    conn, err := net.Dial(
	"unix",
	filepath.Join(containerRootDir, c.State.ID, containerSockFilename),
    )
    if err != nil {
	return fmt.Errorf("dial container sock: %w", err)
    }

    if _, err := conn.Write([]byte("start")); err != nil {
	return fmt.Errorf("write 'start' msg to container sock: %w", err)
    }
    conn.Close()

    c.State.Status = specs.StateRunning

    return nil
}

In Start, we first check if the Spec for the container has a Process defined. Then we check if the container can be started.

Assuming there’s a process defined and the container is in a state that can be started, we dial the container socket and write the start message to it. After that, we update the status of the container and return.

Let’s update our Start operation to load the container, call Start on it, and save its state.

func Start(opts *StartOpts) error {
    cntr, err := container.Load(opts.ID)
    if err != nil {
	return fmt.Errorf("load container: %w", err)
    }

    if err := cntr.Start(); err != nil {
	return fmt.Errorf("start container: %w", err)
    }

    if err := cntr.Save(); err != nil {
	return fmt.Errorf("save container: %w", err)
    }

    return nil
}

With everything apparently wired up, it might be tempting to try this out now, but hold on!

Updating the `delete` operation

You may have noticed way back up the top that once we execute the container process, we get the PID of it and add it to the state of the container.

Now that we have a PID for our container, we can update the delete operation to ensure any running container process is killed before we remove its resources.

We’re going to use the unix package for the kill signal, so make sure to go get it: go get golang.org/x/sys/unix.

func (c *Container) Delete(force bool) error {
    if !force && !c.canBeDeleted() {
	return fmt.Errorf("container cannot be deleted in current state (%s) try using '--force'", c.State.Status)
    }

    process, err := os.FindProcess(c.State.Pid)
    if err != nil {
	return fmt.Errorf("find container process to delete: %w", err)
    }
    if process != nil {
	process.Signal(unix.SIGKILL)
    }

    if err := os.RemoveAll(
	filepath.Join(containerRootDir, c.State.ID),
    ); err != nil {

    return nil
}

Before removing the container directory, we check if we can find the container process. If we can, we send a kill signal to it to ensure it’s no longer running before removing any associated resources.

Now we get to the fun part - testing it works!

To recap on the expectations:

Create a container using ./anocir create --bundle <bundle-path> <container-id>
Start the container using ./anocir start <container-id>
Delete the container using ./anocir delete <container-id>

Let’s go through those steps one-by-one.

Prerequisites

We need to have a container bundle from which to create a container. If you’ve been following along, you’ll know how to do this from Reading a bundle config and saving a container’s state. To recap, we need to:
1. Create a directory for the bundle: mkdir alpinefs
2. Change into the bundle directory: cd alpinefs
3. Create the rootfs directory: mkdir rootfs
4. Use Docker to export a rootfs: docker export $(docker create alpine) | tar -C rootfs -xvf -
5. Generate a configuration: runc spec
Configure the process to run.
1. Open the bundle config: nvim config.json
2. Configure the process.args array: ["touch", "/container_test.txt"]

Create a container

./anocir create --bundle alpinefs test1

./anocir state test1

{
  "ociVersion": "1.2.0",
  "id": "test1",
  "status": "created", # <- container is created
  "pid": 87105, # <- container has a PID
  "bundle": "/home/nixpig/projects/alpinefs"
}

Start the container

./anocir start test1

…and, we get nothing back in the terminal. While you might have been hoping (or even expecting) to see the result of running the command in process.args in the terminal, remember - it’s running in a detached process with no stdio hooked up (we’ll hooking that up in a future installment of the series).

We can verify that it did, in fact, execute by checking for the presence of the /container_test.txt file that should have been touch’d.

ls -la /container_test.txt

.rw-r--r-- 0 root 12 Jan 08:54  container_test.txt

…and check the state of the container.

./anocir state test1

{
  "ociVersion": "1.2.0",
  "id": "test1",
  "status": "running", # <- container is running
  "pid": 87105,
  "bundle": "/home/nixpig/projects/alpinefs"
}

Delete the container

./anocir delete test1 --force

It’s not very container-like

At this point, you’re probably seeing some red flags. This ‘container’ was able to write to the host filesystem. If you do some more playing around, you’ll realise it’s able to do a whole lot more that a container shouldn’t.

In fact, at this point the container is just a process running on the host system with absolutely no restrictions applied to what it can and can’t do.

You’ll remember, we haven’t actually applied any of the configuration in config.json which tells the runtime about what restrictions should be in place. These are what actually containerise a process. Things like namespaces, cgroups, capabilities, the list goes on…

We’re getting there, so stick around for the next post!

Part 5: Executing container runtime lifecycle hooks

References

https://github.com/opencontainers/runtime-spec/blob/main/config.md#process “Process config” ↩︎
https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html “exec” ↩︎
https://pkg.go.dev/net “net package” ↩︎
https://pkg.go.dev/os/exec “os/exec package” ↩︎
https://docs.kernel.org/filesystems/proc.html “The /proc filesystem” ↩︎
https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#start “Start operation” ↩︎