Initialising a container and starting the user process
This is Part 4 of the series Building a container runtime from scratch in Go.
In the fourth part of the series we learn about fork/exec and IPC over Unix domain sockets to initialise the container then run the user-defined process.
The source code to accompany this post is available on GitHub.
When we start a container, the process that gets run in the container is defined in the process
1 field of the config.json
of the container bundle. This contains a bunch of fields we’ll cover in future posts, but the one we’re concerned about today is args
.
// Process contains information to start a specific application
// inside the container.
type Process struct {
// Args specifies the binary and arguments for the application to execute.
Args []string `json:"args,omitempty"`
// ...
}
The args
field is an array of strings with the same semantics as the argv
argument of the exec
2 family of Linux commands. Essentially what this means is that the first item in the array is the command to execute and the remaining items are the arguments to pass to it.
There are two steps to getting the user process defined in the config.json
spec running in a containerised environment.
- Initialising the container (
anocir create <container-id>
)- In this step the runtime creates a container process and applies the required configuration. It then sets up an IPC channel over a Unix domain socket so that it can continue to communicate with the container once the container process is detached.
- Starting the container (
anocir start <container-id>
)- In this step the runtime sends a message over the IPC channel to the container process instructing the container to start. The container process applies the remaining configuration, then execs the user process defined in the config.
Let’s get to it…
Initialising the container
Below is a diagram of the steps involved in ‘initialising’ a container.
sequenceDiagram autonumber actor CLI participant anocir as runtime CLI ->> anocir: anocir create <container-id> activate anocir anocir->>anocir: <br>configure container box rgba(255, 184, 108, 0.5) unix domain socket participant ipc end anocir->>ipc: <br><br>create ipc socket note over ipc: unix domain<br>socket created activate anocir anocir--)ipc: listen (async) deactivate anocir box rgba(80, 250, 123, 0.5) container process participant container end deactivate anocir anocir->>anocir: <br>reexec activate anocir note over container: container process<br>forked anocir-->container: release container process deactivate anocir activate container container->>container: <br>configure container container->>ipc: <br><br>send 'ready' container--)ipc: listen (sync) deactivate container ipc--)anocir: receive 'ready' anocir->>CLI: exit note over container: container process<br>persists after<br>runtime exits
- The user (either directly or via some higher-level runtime tooling, such as Docker) issues the
anocir create <container-id>
command to the runtime. - The runtime does its part of the configuration of the container based on the
config.json
spec. - The runtime creates a Unix domain socket to use for communication between the runtime and the container process.
- The runtime listens (asynchronously) on the socket for any messages coming from the container.
- The runtime ‘reexecs’, applying the configuration from step 2 to the new process.
- The runtime releases the container process, and continues listening on the socket, waiting to receive a
ready
message from the container. - The container process does its part of the configuration based on the
config.json
spec. - The container process sends a
ready
message to the socket to indicate it’s completed configuring and is ready to receive further commands. - The container process listens (synchronously) on the socket for further instructions.
- The runtime receives the
ready
message from the socket. - The runtime exits, leaving the container process running in the ‘background’.
Note: Steps 2 & 7 are where the bulk of our work in future posts is going to be. Today, we’ll be working on everything else above.
We’re going to be referencing a few values quite frequently, so let’s first create some constants to store them in.
const (
containerRootDir = "/var/lib/anocir/containers"
initSockFilename = "init.sock"
containerSockFilename = "container.sock"
)
The containerRootDir
is the directory where the runtime is storing container metadata. The initSockFilename
and containerSockFilename
are the filenames of the sockets we’ll be using for IPC during initialisation and container start, respectively.
Next up, we can start creating the Init
function on the container. I’ll annotate in comments the steps to which the different sections relate.
func (c *Container) Init() error {
// 2. configure container
// TODO: configure container
// 3. create ipc socket
listener, err := net.Listen(
"unix",
filepath.Join(containerRootDir, c.State.ID, initSockFilename),
)
if err != nil {
return fmt.Errorf("listen on init sock: %w", err)
}
defer listener.Close()
// 5. reexec
cmd := exec.Command("/proc/self/exe", "reexec", c.State.ID)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
if err := cmd.Start(); err != nil {
return fmt.Errorf("reexec container process: %w", err)
}
c.State.Pid = cmd.Process.Pid
// 6. release container process
if err := cmd.Process.Release(); err != nil {
return fmt.Errorf("release container process: %w", err)
}
// 4. listen
conn, err := listener.Accept()
if err != nil {
return fmt.Errorf("accept on init sock: %w", err)
}
defer conn.Close()
b := make([]byte, 128)
n, err := conn.Read(b)
if err != nil {
return fmt.Errorf("read bytes from init sock connection: %w", err)
}
// 10. receive 'ready'
msg := string(b[:n])
if msg != "ready" {
return fmt.Errorf("expecting 'ready' but received '%s'", msg)
}
c.State.Status = specs.StateCreated
// 11. exit
return nil
}
If you’re familiar with the net
3 and os/exec
4 packages from the standard library, that should all look pretty familiar. What might look strange is the command that’s being executed - /proc/self/exe
.
On a Linux system, the /proc
filesystem5 is a pseudo-filesystem that provides a view into the current state of the kernel, including details about system hardware and configuration. In addition, details of all of the processes running on the system are available, each in a subdirectory with the name of the process’ PID.
ls -la -d /proc/*/
dr-xr-xr-x - root 12 Jan 04:06 /proc/1
dr-xr-xr-x - root 12 Jan 04:06 /proc/2
dr-xr-xr-x - root 12 Jan 04:06 /proc/3
dr-xr-xr-x - root 12 Jan 04:06 /proc/4
dr-xr-xr-x - root 12 Jan 04:06 /proc/5
dr-xr-xr-x - root 12 Jan 04:06 /proc/6
dr-xr-xr-x - root 12 Jan 04:06 /proc/7
dr-xr-xr-x - root 12 Jan 04:06 /proc/9
# etc...
For example, to view the status of the process with PID 1
(in my case, systemd
).
cat /proc/1/status
Name: systemd
Umask: 0000
State: S (sleeping)
Tgid: 1
Ngid: 0
Pid: 1
PPid: 0
# truncated for brevity
As a kind of ‘handy utility’ to get details about the current process, rather than specifying it’s PID, it can be referenced as self
.
ls -la /proc/self
.r--r--r-- 0 nixpig 12 Jan 07:05 arch_status
dr-xr-xr-x - nixpig 12 Jan 07:05 attr
.rw-r--r-- 0 nixpig 12 Jan 07:05 autogroup
.r-------- 0 nixpig 12 Jan 07:05 auxv
.r--r--r-- 0 nixpig 12 Jan 07:05 cgroup
.-w------- 0 nixpig 12 Jan 07:05 clear_refs
.r--r--r-- 0 nixpig 12 Jan 07:05 cmdline
.rw-r--r-- 0 nixpig 12 Jan 07:05 comm
# etc...
One of the entries (omitted above due to truncating) is exe
, which is a link to the executable of the process. Thus, from our container runtime /proc/self/exe
is a link to the container runtime process currently executing.
So, you might ask “why do we need to reexecute the process we’re already running? Why don’t we just execute the process defined in config.json
directly?”. The nitty-gritty details will be elucidated in a future installment when we get into applying configuration to the container process. For now, understand that there are elements of the configuration (such as namespaces) which need to be applied to the process which will spawn the container process and cannot be applied from within the running process itself.
Next, we can update the Create
operation to call Init
on the container and save it’s state after it’s been initialised.
func Create(opts *CreateOpts) error {
// ...
cntr, err := container.New(&container.NewContainerOpts{
ID: opts.ID,
Bundle: bundle,
Spec: spec,
})
if err != nil {
return fmt.Errorf("create container: %w", err)
}
if err := cntr.Save(); err != nil {
return fmt.Errorf("save container: %w", err)
}
if err := cntr.Init(); err != nil {
return fmt.Errorf("initialise container: %w", err)
}
if err := cntr.Save(); err != nil {
return fmt.Errorf("save container: %w", err)
}
return nil
}
Now, when we create a container with ./anocir create --bundle <bundle-path> <container-id>
our runtime is going to fork off a process for the container and block waiting for a ready
message on the socket in the runtime process.
Reexec’ing the container process
The /proc/self/exe
process that is forked is essentially the same as executing ./anocir reexec <container-id>
so now we need to handle what happens when that’s called. We’ll create a new Reexec
operation* to do that.
* It’s not technically an operation using our current definition, i.e. an operation defined by the OCI Runtime Spec. But it feels appropriate to use the same abstraction, and many other container runtimes do the same thing.
package operations
import (
"fmt"
"github.com/nixpig/anocir/internal/container"
)
type ReexecOpts struct {
ID string
}
func Reexec(opts *ReexecOpts) error {
cntr, err := container.Load(opts.ID)
if err != nil {
return fmt.Errorf("load container: %w", err)
}
if err := cntr.Reexec(); err != nil {
return fmt.Errorf("reexec container: %w", err)
}
return nil
}
First, we load the container, then we call Reexec
on it.
We haven’t implemented Reexec
on the container yet, so let’s do that now. Again, I’ll annotate the code with the step numbers from above.
func (c *Container) Reexec() error {
// 7. configure container
// TODO: configure container
// 8. send 'ready'
initConn, err := net.Dial(
"unix",
filepath.Join(containerRootDir, c.State.ID, initSockFilename),
)
if err != nil {
return fmt.Errorf("dial init sock: %w", err)
}
if _, err := initConn.Write([]byte("ready")); err != nil {
return fmt.Errorf("write 'ready' msg to init sock: %w", err)
}
// close immediately, rather than defering
initConn.Close()
listener, err := net.Listen(
"unix",
filepath.Join(containerRootDir, c.State.ID, containerSockFilename),
)
if err != nil {
return fmt.Errorf("listen on container sock: %w", err)
}
// 9. listen for 'start'
containerConn, err := listener.Accept()
if err != nil {
return fmt.Errorf("accept on container sock: %w", err)
}
// ...coming in just a moment!
}
The first thing we do after configuring is write the ready
message to the init socket, which will be picked up by the calling runtime process. We need to immediately close the connection (often this would be done in a defer
) so that the socket doesn’t leak into the container process when it’s started (remember, the container process should be isolated from the host environment).
Then, we start listening on a new socket. Spoiler: in a couple of paragraphs we’ll see what we’re listening for and handle it 😉
For now, let’s wire that up to the CLI in the same way we’ve done for all of the other operations.
package cli
import (
"github.com/nixpig/anocir/internal/operations"
"github.com/spf13/cobra"
)
func reexecCmd() *cobra.Command {
cmd := &cobra.Command{
Use: "reexec [flags] CONTAINER_ID",
Args: cobra.ExactArgs(1),
Hidden: true, // this command is only used internally
RunE: func(cmd *cobra.Command, args []string) error {
containerID := args[0]
return operations.Reexec(&operations.ReexecOpts{
ID: containerID,
})
},
}
return cmd
}
First, creating the Cobra command. Then, adding it to the root command.
func RootCmd() *cobra.Command {
// ...
cmd.AddCommand(
stateCmd(),
createCmd(),
startCmd(),
deleteCmd(),
killCmd(),
reexecCmd(),
)
return cmd
}
Starting the container
Below is a diagram of the steps involved in ‘starting’ a container.
sequenceDiagram autonumber actor CLI participant anocir as runtime box rgba(255, 184, 108, 0.5) unix domain socket participant ipc end note over container: process persisted<br>from 'create' ipc-->container: listening (sync) CLI->>anocir: <br>anocir start <container-id> anocir->>ipc: send 'start' anocir->>CLI: exit note over CLI,anocir: exit immediately after sending 'start' ipc--)container: receive 'start' activate container container->>container: <br>exec user process box rgba(80, 250, 123, 0.5) container process participant container end deactivate container note over container: exit
- The container process is already running in the ‘background’ and listening on the socket created by the runtime to receive instructions.
- The user (either directly or via some higher-level runtime tooling, such as Docker) issues the
anocir start <container-id>
command to the runtime. - The runtime sends the
start
message to the socket. - The runtime exits, as it’s done it’s job of instructing the container what to do.
- The container receives the
start
message. - The container execs the user process defined in the config, then exits when the user process exits.
Exec’ing the user process
Let’s pick back up where we left the last section…listening on the socket in the reexec.
func (c *Container) Reexec() error {
// ...
// 9. listen for 'start'
containerConn, err := listener.Accept()
if err != nil {
return fmt.Errorf("accept on container sock: %w", err)
}
b := make([]byte, 128)
n, err := containerConn.Read(b)
if err != nil {
return fmt.Errorf("read bytes from container sock: %w", err)
}
msg := string(b[:n])
if msg != "start" {
return fmt.Errorf("expecting 'start' but received '%s'", msg)
}
// close before exec'ing the user process
containerConn.Close()
listener.Close()
bin, err := exec.LookPath(c.Spec.Process.Args[0])
if err != nil {
return fmt.Errorf("find path of user process binary: %w", err)
}
args := c.Spec.Process.Args
env := os.Environ()
if err := syscall.Exec(bin, args, env); err != nil {
return fmt.Errorf("execve (%s, %s, %v): %w", bin, args, env, err)
}
panic("if you got here then something went horribly wrong")
}
First up, we wait to receive the start
message from the runtime on the socket. When we receive the start
message, we immediately close the listener and connection, so as not to leak these into the container process.
Now, we get to actually execute the user process defined in the config!
The command specified in the first item of the Args
array may or may not (often not) be an absolute path to the binary, so we need to get an absolute path to it (if it’s available in the $PATH
).
Then, we get the environment. Bear in mind, at the moment this is the same as the host environment.
We take the absolute path to the binary, the full arguments array, and the environment array, and execute an Exec
syscall with them.
At this point, the process should be replaced by the execution of the user process from the config. Nothing in our code past this point should ever get executed, thus we panic
if it does.
Why use syscall.Exec
? Why not use the exec
package?
The exec
package, when executing a command, creates a new process for that command. What we need is to replace the existing process with the new command.
execve(2)
is how we do that. From the Linux man pages:
execve() executes the program referred to by pathname. This causes the program that is currently being run by the calling process to be replaced with a new program, with newly initialized stack, heap, and (initialized and uninitialized) data segments.
From Go, that means making the syscall.Exec
syscall. From the Go docs:
Exec invokes the execve(2) system call.
For a slightly more in depth explanation and example, see Go By Example: Exec’ing Processes.
Our container can how handle the start
message when it receives it from the runtime. Let’s go ahead and actually send it!
Sending the start
message
Before we send the start
message, we need to check that the container is in a state that can be started. Per the spec, the runtime may only start the container if it’s in created
state6. So, let’s create a receiver function on the Container
struct to check that.
func (c *Container) canBeStarted() bool {
return c.State.Status == specs.StateCreated
}
We can use this to check if the container can be started when we call Start
, which we’ll implement now.
func (c *Container) Start() error {
if c.Spec.Process == nil {
// nothing to do; silent return
return nil
}
if !c.canBeStarted() {
return fmt.Errorf("container cannot be started in current state (%s)", c.State.Status)
}
conn, err := net.Dial(
"unix",
filepath.Join(containerRootDir, c.State.ID, containerSockFilename),
)
if err != nil {
return fmt.Errorf("dial container sock: %w", err)
}
if _, err := conn.Write([]byte("start")); err != nil {
return fmt.Errorf("write 'start' msg to container sock: %w", err)
}
conn.Close()
c.State.Status = specs.StateRunning
return nil
}
In Start
, we first check if the Spec
for the container has a Process
defined. Then we check if the container can be started.
Assuming there’s a process defined and the container is in a state that can be started, we dial the container socket and write the start
message to it. After that, we update the status of the container and return.
Let’s update our Start
operation to load the container, call Start
on it, and save its state.
func Start(opts *StartOpts) error {
cntr, err := container.Load(opts.ID)
if err != nil {
return fmt.Errorf("load container: %w", err)
}
if err := cntr.Start(); err != nil {
return fmt.Errorf("start container: %w", err)
}
if err := cntr.Save(); err != nil {
return fmt.Errorf("save container: %w", err)
}
return nil
}
With everything apparently wired up, it might be tempting to try this out now, but hold on!
Updating the delete
operation
You may have noticed way back up the top that once we execute the container process, we get the PID of it and add it to the state of the container.
Now that we have a PID for our container, we can update the delete
operation to ensure any running container process is killed before we remove its resources.
We’re going to use the unix
package for the kill signal, so make sure to go get it: go get golang.org/x/sys/unix
.
func (c *Container) Delete(force bool) error {
if !force && !c.canBeDeleted() {
return fmt.Errorf("container cannot be deleted in current state (%s) try using '--force'", c.State.Status)
}
process, err := os.FindProcess(c.State.Pid)
if err != nil {
return fmt.Errorf("find container process to delete: %w", err)
}
if process != nil {
process.Signal(unix.SIGKILL)
}
if err := os.RemoveAll(
filepath.Join(containerRootDir, c.State.ID),
); err != nil {
return nil
}
Before removing the container directory, we check if we can find the container process. If we can, we send a kill signal to it to ensure it’s no longer running before removing any associated resources.
Now we get to the fun part - testing it works!
To recap on the expectations:
- Create a container using
./anocir create --bundle <bundle-path> <container-id>
- Start the container using
./anocir start <container-id>
- Delete the container using
./anocir delete <container-id>
Let’s go through those steps one-by-one.
Prerequisites
- We need to have a container bundle from which to create a container. If you’ve been following along, you’ll know how to do this from Reading a bundle config and saving a container’s state. To recap, we need to:
- Create a directory for the bundle:
mkdir alpinefs
- Change into the bundle directory:
cd alpinefs
- Create the rootfs directory:
mkdir rootfs
- Use Docker to export a rootfs:
docker export $(docker create alpine) | tar -C rootfs -xvf -
- Generate a configuration:
runc spec
- Create a directory for the bundle:
- Configure the process to run.
- Open the bundle config:
nvim config.json
- Configure the
process.args
array:["touch", "/container_test.txt"]
- Open the bundle config:
Create a container
./anocir create --bundle alpinefs test1
./anocir state test1
{
"ociVersion": "1.2.0",
"id": "test1",
"status": "created", # <- container is created
"pid": 87105, # <- container has a PID
"bundle": "/home/nixpig/projects/alpinefs"
}
Start the container
./anocir start test1
…and, we get nothing back in the terminal. While you might have been hoping (or even expecting) to see the result of running the command in process.args
in the terminal, remember - it’s running in a detached process with no stdio hooked up (we’ll hooking that up in a future installment of the series).
We can verify that it did, in fact, execute by checking for the presence of the /container_test.txt
file that should have been touch
’d.
ls -la /container_test.txt
.rw-r--r-- 0 root 12 Jan 08:54 container_test.txt
…and check the state of the container.
./anocir state test1
{
"ociVersion": "1.2.0",
"id": "test1",
"status": "running", # <- container is running
"pid": 87105,
"bundle": "/home/nixpig/projects/alpinefs"
}
Delete the container
./anocir delete test1 --force
It’s not very container-like
At this point, you’re probably seeing some red flags. This ‘container’ was able to write to the host filesystem. If you do some more playing around, you’ll realise it’s able to do a whole lot more that a container shouldn’t.
In fact, at this point the container is just a process running on the host system with absolutely no restrictions applied to what it can and can’t do.
You’ll remember, we haven’t actually applied any of the configuration in config.json
which tells the runtime about what restrictions should be in place. These are what actually containerise a process. Things like namespaces, cgroups, capabilities, the list goes on…
We’re getting there, so stick around for the next post!
Part 5: Executing container runtime lifecycle hooks 🔜 Coming soon!
References
-
https://github.com/opencontainers/runtime-spec/blob/main/config.md#process “Process config” ↩︎
-
https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html “exec” ↩︎
-
https://pkg.go.dev/net “
net
package” ↩︎ -
https://pkg.go.dev/os/exec “
os/exec
package” ↩︎ -
https://docs.kernel.org/filesystems/proc.html “The /proc filesystem” ↩︎
-
https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#start “Start operation” ↩︎