Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --overlay and related options #547

Merged
merged 1 commit into from
Oct 16, 2024
Merged

Conversation

rhendric
Copy link
Contributor

@rhendric rhendric commented Jan 6, 2023

This commit adds --overlay, --tmp-overlay, --ro-overlay, and --overlay-src options to enable bubblewrap to create overlay mounts. These options are only permitted when bubblewrap is not installed setuid.


This is a continuation/partial rewrite of #167, addressing the feedback given there.

Other improvements of note:

  • Tests!
  • Characters treated specially by overlayfs are escaped; there are now no (known?) restrictions on the characters that can be in an overlay path.
  • --tmp-overlay is a new contribution for a use case I frequently have: I want to mount an overlay that is writable but I don't want to persist the writes between runs.

Copy link
Collaborator

@smcv smcv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a Signed-off-by to your commit(s) to indicate that you agree to the Developer Certificate of Origin terms.

bubblewrap.c Show resolved Hide resolved
bubblewrap.c Outdated Show resolved Hide resolved
bwrap.xml Outdated Show resolved Hide resolved
bubblewrap.c Outdated Show resolved Hide resolved
bubblewrap.c Outdated Show resolved Hide resolved
tests/test-run.sh Show resolved Hide resolved
@smcv
Copy link
Collaborator

smcv commented Jan 6, 2023

The commit message should have Resolves: #412 or similar, to link it up to the feature request that you're implementing here.

@rhendric rhendric force-pushed the rhendric/overlayfs branch 2 times, most recently from 742d3af to f995ca0 Compare January 6, 2023 20:14
@rhendric rhendric force-pushed the rhendric/overlayfs branch 2 times, most recently from bc8044a to 23ff0f8 Compare March 5, 2023 00:24
@rhendric
Copy link
Contributor Author

rhendric commented Mar 5, 2023

The only outstanding thread here should be getting feedback on my reasoning for option ordering and naming. Please let me know what you think and if there's anything I can do to move this forward.

Copy link
Collaborator

@smcv smcv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this. A re-review is on my list, but the PR is non-trivial and my list is not short.

bubblewrap.c Show resolved Hide resolved
tests/test-run.sh Show resolved Hide resolved
@joanbm
Copy link

joanbm commented Apr 10, 2023

I ran into a race when running two bwrap processes consecutively using the same overlay, and the kernel has CONFIG_OVERLAY_FS_INDEX=y (this is the default on Arch Linux).

On my Arch Linux system, this example:

#!/usr/bin/env sh
set -eu

rm -rf $HOME/bwrap_ofs_cleanup_test
mkdir $HOME/bwrap_ofs_cleanup_test && cd $HOME/bwrap_ofs_cleanup_test
mkdir -p rw wrk

bwrap --unshare-user --unshare-pid --dev-bind / / --overlay-src "$PWD" --overlay "$PWD"/rw "$PWD"/wrk "$PWD" \
  dd if=/dev/zero of=dummy bs=1M count=10 status=none
bwrap --unshare-user --unshare-pid --dev-bind / / --overlay-src "$PWD" --overlay "$PWD"/rw "$PWD"/wrk "$PWD" \
  echo SUCCESS

Fails with this error:

bwrap: Can't make overlay mount on /newroot/home/zealcharm/bwrap_ofs_cleanup_test with options upperdir=/oldroot/home/zealcharm/bwrap_ofs_cleanup_test/rw,workdir=/oldroot/home/zealcharm/bwrap_ofs_cleanup_test/wrk,lowerdir=/oldroot/home/zealcharm/bwrap_ofs_cleanup_test: Device or resource busy

And the error on dmesg is:

overlayfs: upperdir is in-use as upperdir/workdir of another mount, mount with '-o index=off' to override exclusive upperdir protection.

Inserting a short sleep or a sync in between thebwrap fixes the race.


Cause

I have taken a look and I believe that the cause is: Since I'm using --unshare-pid (without --as-pid-1), bwrap launches a reaper process as PID 1. The monitor bwrap process quits immediately after PID 2 (dd in the example above) terminates, but the PID 1 takes a while longer to terminate (it is visible in ps aux), so the overlayfs mount still lingers a little longer.

Trying --sync-fd

I tried using --sync-fd (I think that's the right way to use it, apologies if not) but it doesn't seem to fix it, the race still happens:

#!/usr/bin/env bash
set -eu

rm -rf $HOME/bwrap_ofs_cleanup_test
mkdir $HOME/bwrap_ofs_cleanup_test && cd $HOME/bwrap_ofs_cleanup_test
mkdir -p rw wrk

syncfile="$(mktemp)"
exec {syncfd_for_bwrap}<> "$syncfile"
exec {syncfd_for_wait}<> "$syncfile"
rm -f "$syncfile"
flock -x "$syncfd_for_bwrap"
bwrap --unshare-user --unshare-pid --sync-fd "$syncfd_for_bwrap" --dev-bind / / --overlay-src "$PWD" --overlay "$PWD"/rw "$PWD"/wrk "$PWD" \
  dd if=/dev/zero of=dummy bs=1M count=10 status=none
exec {syncfd_for_bwrap}>&-
flock -x "$syncfd_for_wait"

bwrap --unshare-user --unshare-pid --dev-bind / / --overlay-src "$PWD" --overlay "$PWD"/rw "$PWD"/wrk "$PWD" \
  echo SUCCESS

I think the problem is the same, there's nothing that guarantees that the sync FD is closed after all mounts are torn down.

Waiting for PID 1 to terminate

I tried waiting for PID 1 to terminate, and this seems to work:

diff --git a/bubblewrap.c b/bubblewrap.c
index 5e90e5e..e53f04b 100644
--- a/bubblewrap.c
+++ b/bubblewrap.c
@@ -560,6 +560,7 @@ monitor_child (int event_fd, pid_t child_pid, int setup_finished_fd)
             {
               exitc = (int) val - 1;
               report_child_exit_status (exitc, setup_finished_fd);
+              waitpid (child_pid, NULL, 0);
               return exitc;
             }
         }

However, I'm not sure if this is a bulletproof fix, and it is not a proper fix since this will also cause bwrap to wait until all backgrounded processes finish instead of just the main process.

Workaround

For now I'm explicitly disabling the inode index to avoid running into the issue:

diff --git a/bubblewrap.c b/bubblewrap.c
index 5e90e5e..a177c75 100644
--- a/bubblewrap.c
+++ b/bubblewrap.c
@@ -1268,6 +1268,7 @@ setup_newroot (bool unshare_pid,
             if (mkdir (dest, 0755) != 0 && errno != EEXIST)
               die_with_error ("Can't mkdir %s", op->dest);
 
+            strappend (&sb, "index=off,");
             if (op->source != NULL)
               {
                 strappend (&sb, "upperdir=/oldroot");

@rhendric
Copy link
Contributor Author

it is not a proper fix since this will also cause bwrap to wait until all backgrounded processes finish instead of just the main process.

Honestly, this is what I thought bwrap already did. Is there a reason this isn't an improvement?

Maybe it could be enabled by a separate flag? I would imagine that waiting for everything in the sandbox to be well and truly finished is a feature that would be useful in other cases.

@joanbm
Copy link

joanbm commented Apr 10, 2023

TBH I thought the same, but you can see that that's not the case by running bwrap --unshare-all --dev-bind / / sh -c 'sleep 50 &' then ps aux.

I guess it could be useful if you are launching something that daemonizes, but it's going to be trouble for anything that locks a resource like the overlayfs inode index.

So I also think that having flag(s) to wait or kill all processes instead of just for the main process could be useful for those scenarios.

Also, I suppose the waitpid solution above can be implemented without breaking compatibility by doing it only if just PID 1 remains after the main process exits.

@rusty-snake
Copy link
Contributor

doing it only if just PID 1 remains after the main process exits.

FWIW: if pid1 exists (no matter how) everything in the pid namespace will get a SIGKILL. I.e. there is no way pid1 has exited and deamons are still running.

Of course you only have a pid1 when you use unshare-pid.

@smcv
Copy link
Collaborator

smcv commented Apr 11, 2023

if pid1 exists

You mean if pid 1 exits, I think (unfortunately either word makes sense here, but only one is correct).

Yes, if we are unsharing the pid namespace for the sandbox, when pid 1 within that namespace exits (that's the user-specified process if you used --as-pid1, or bubblewrap's reaper process otherwise), everything else in the sandbox's pid namespace is killed (see pid_namespaces(7)).

However, I don't know whether they are killed synchronously before pid 1 is allowed to exit, or whether pid 1 can exit while the overlayfs resource is still in use...

Maybe [waiting for pid 1 to exit] could be enabled by a separate flag?

If there is any situation in which the reaper process would not exit promptly, then that would make sense as an opt-in thing. Or if the reaper process would always exit promptly in any case, then it might as well be on by default, in all cases where the reaper process gets run. (In situations where the reaper is not run, of course we must not wait for it.)

If this is likely to be a practical problem, it might make sense to have an --overlay-index [on|off] option to overrule the kernel default?

Or, you might find that your use-case is better served by creating a long-running sandbox containing an IPC server, and then poking commands into it, analogous to docker run somecontainer sleep infinity followed by one or more docker exec to poke commands into the container.

@rhendric
Copy link
Contributor Author

If there is any situation in which the reaper process would not exit promptly, then that would make sense as an opt-in thing. Or if the reaper process would always exit promptly in any case, then it might as well be on by default, in all cases where the reaper process gets run. (In situations where the reaper is not run, of course we must not wait for it.)

If I understand the relevant terms correctly, we know that the reaper process doesn't exit until all children have terminated (the bwrap --unshare-all --dev-bind / / sh -c 'sleep 50 &' example demonstrates this). So having an opt-in flag for waiting would be beneficial, and it sounds like it would resolve the issue at hand without being coupled to the rest of the --overlay work. I propose tracking that as a separate issue/PR.

If this is likely to be a practical problem, it might make sense to have an --overlay-index [on|off] option to overrule the kernel default?

It might indeed make sense to have this sort of option (as well as for the other options overlayfs exposes—there are a few of them), but I'm reluctant to design and implement them without an actual need. Turning off the index for this issue is more of a workaround for not having --wait-for-pid-1 functionality (or whatever it should be called), at least as I understand it. A user would have similar problems if background processes in the sandbox held any other sort of host resource past the lifetime of the parent sandbox command—I don't think we'd want to tell such users to disable any features their sandboxed programs are using that lock resources, as a long-term solution.

@joanbm
Copy link

joanbm commented Apr 28, 2023

Or, you might find that your use-case is better served by creating a long-running sandbox containing an IPC server, and then poking commands into it

I found the problem while trying to create a sandbox wrapper over an existing command line tool (makepkg, Arch Linux's package builder), with the idea that it works as a drop-in replacement for the original tool. Then, I have some scripts that run makepkg a few times consecutively that I switched to the wrapper, so that's how I ran into the issue. And sure, creating and tearing down the sandbox every time is suboptimal, but it's simple and works fast enough.

Perhaps my specific use case is a bit obscure, but I think the idea of being able to use overlays in drop-in tool wrappers that are used from scripts is something that ought to work.

So having an opt-in flag for waiting would be beneficial, [...]. I propose tracking that as a separate issue/PR.

I agree that creating a new issue/PR is the right approach, ultimately the cause of the problem I ran into is not related to this PR, but rather a more general issue with the exit/cleanup code and locked resources, and the conditions to run into it are pretty specific.

If this is likely to be a practical problem, it might make sense to have an --overlay-index [on|off] option to overrule the kernel default?

It might indeed make sense to have this sort of option (as well as for the other options overlayfs exposes—there are a few of them), but I'm reluctant to design and implement them without an actual need. Turning off the index for this issue is more of a workaround for not having --wait-for-pid-1 functionality (or whatever it should be called), at least as I understand it.

Agreed, I think bwrap should strive for simplicity and turning off the overlayfs index is just a workaround. If someone needs to tune their mounts for some specific use case, this can be done like in #412 (comment).

utils.c Outdated Show resolved Hide resolved
utils.c Outdated Show resolved Hide resolved
utils.c Show resolved Hide resolved
bubblewrap.c Show resolved Hide resolved
@swick
Copy link
Contributor

swick commented Sep 22, 2023

A few nits but this looks great!

Just from a simplicity perspective I think it would be easier to follow what's going on if the ops were added directly and then processed in setup_newroot by, e.g. prending them to a list and then walk the list.

The other big thing is how a user like flatpak should test for availability of this feature. How is this done with other features?

@rhendric
Copy link
Contributor Author

@smcv, in #596 (comment) you said ‘solving its remaining issues’. As far as I'm aware the only outstanding issue here is your decision on whether the design choices we discussed in #547 (comment) are acceptable. Is there anything else for anyone to do? Let me know and I'll be happy to work on it.

@swick
Copy link
Contributor

swick commented Sep 23, 2023

FWIW, the order as implemented right now makes a lot of sense when you build the command line for it from a program like flatpak.

@swick
Copy link
Contributor

swick commented Sep 24, 2023

I have a flatpak branch now which is basically doing --overlay-src $path-to-runtime-src --overlay-src $path-to-flatpak-helper-monitor --tmp-overlay /etc and any subsequent attempt to mount something onto existing directories in /etc fails with a permission denied error because the directories are owned by nobody.

Letting bwrap chown the directory doesn't work and fails with EINVAL.

For me --tmp-overlay is not useful. What works here is to either add the file directly in a new upper layer:

--overlay-src $path-to-runtime-src --overlay-src $masking-dir --overlay-src $path-to-flatpak-helper-monitor --tmp-overlay /etc --file 14 /etc/existing-dir/file

or just bind mounting it without the tmp overlay like this:

--overlay-src $path-to-runtime-src --overlay-src $masking-dir --overlay-src $path-to-flatpak-helper-monitor --ro-overlay /etc --bind-data 14 /etc/existing-dir/file

In both cases $masking-dir contains a the file /etc/existing-dir/file with workable uid:gid.

Unfortunately the masking dir layer is on the oldroot. What I want, really is a way to add a layer from the newroot:

--overlay-src $path-to-runtime-src --overlay-src-newroot /run/etc-overlay --overlay-src $path-to-flatpak-helper-monitor --ro-overlay /etc --file 14 /run/etc-overlay/existing-dir/file

This also works for bind mounting:

--overlay-src $path-to-runtime-src --overlay-src-newroot /run/etc-overlay --overlay-src $path-to-flatpak-helper-monitor --ro-overlay /etc --dir /run/etc-overlay/mount-point --bind $some-mount /etc/mount-point

Long story short, I think an option to create an overlayfs layer from a directory from the newroot would be really helpful.

@rhendric
Copy link
Contributor Author

I've rebased onto the 0.9.0 release and implemented the userxattr flag; thanks for the suggestion.

@smcv, any chance this can get your eyes again?

bubblewrap.c Outdated Show resolved Hide resolved
bubblewrap.c Show resolved Hide resolved
@smcv
Copy link
Collaborator

smcv commented Sep 16, 2024

Sorry, this got preempted by a security vulnerability in Flatpak which required a new bubblewrap feature to be added under embargo. Please could you rebase onto 0.10.0?

Is this otherwise in a ready-to-review state, or are there implementation concerns remaining?

Please mark the threads that various people have raised as "Resolved", if you believe those concerns have been addressed (or if you can't do that because you aren't a maintainer, the next best thing is to reply to the thread and say "Resolved").

Frontear added a commit to Frontear/nix-wrap that referenced this pull request Sep 16, 2024
Originally I was trying to ensure some level of isolation, but now I
realize that's contrary to my original goal: a clean $HOME.

Given that, I have changed the bwrap to bind all relevant mounts from
root and the user home. In order to prevent propagation of the managed
configs, directories are recursively bind mounted, which at the moment
is incurring a performance penalty, but may be improved if
containers/bubblewrap#547 gets merged.
@rhendric
Copy link
Contributor Author

Rebased and ready for review.

tests/libtest.sh Outdated Show resolved Hide resolved
Copy link
Collaborator

@smcv smcv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patience with this!

bwrap.xml Outdated Show resolved Hide resolved
bwrap.xml Outdated Show resolved Hide resolved
tests/test-run.sh Show resolved Hide resolved
utils.c Outdated Show resolved Hide resolved
utils.c Outdated Show resolved Hide resolved
utils.c Outdated Show resolved Hide resolved
bubblewrap.c Show resolved Hide resolved
@tu-maurice
Copy link

Sorry if I have to barge in again, but this seems to be a related issue:

Overlayfs (in the kernel) seems to have a few lines of code where it checks whether the user and group of a file that is modified have a mapping in the current userns (see torvalds/linux@4f11ada). In my use case (trying to alter files of a different uid, but same gid) this leads to various errors with the error text "Value too large for defined data type" (which is imho not really helpful).

So far I don't see how I can make bwrap identity map all users and groups (except maybe for root) into the user namespace used to setup the unprivileged mount namespace, in order to not trigger this error in overlayfs. (The solution might even be that bwrap does this automatically when using an overlayfs?)

Any thoughts?

@smcv
Copy link
Collaborator

smcv commented Oct 4, 2024

I don't see how I can make bwrap identity map all users and groups (except maybe for root)

As far as I'm aware, this is impossible: the kernel does not allow it. While unprivileged, we are only allowed to map one uid and one gid (our own uid and primary gid). Everyone else is mapped to the overflow uid/gid (usually nobody), which you should interpret as "not me".

Sorry, but in this project we have a strict policy of not doing impossible things.

In my use case (trying to alter files of a different uid, but same gid) this leads to various errors with the error text "Value too large for defined data type" (which is imho not really helpful)

We do not have any control over that error message, it comes from the kernel (which reports EOVERFLOW) and glibc (which translates it to the human-readable text you quoted). We can only influence the error message if it's code that is running in the bubblewrap process, and we have enough information to know the real context (which often we do not).

Unfortunately, kernel-level error reporting does not have a way to clarify the real meaning of error messages like this. You can see other effects of this design in various places inside bubblewrap, for example where we have to substitute a different error message when mount() fails with ENOSPC.

So I think your use-case is going to be impossible, even after merging this PR. You can't use the overlay to modify files that are owned by a different user. The closest you can get is that if you know in advance which files will be modified, you can make a copy in advance that is owned by you, and mount it over the top of the original file.

For more advanced use-cases with multiple uids, please consider using a fully-featured container manager like podman instead.

@tu-maurice
Copy link

I don't see how I can make bwrap identity map all users and groups (except maybe for root)

As far as I'm aware, this is impossible: the kernel does not allow it. While unprivileged, we are only allowed to map one uid and one gid (our own uid and primary gid). Everyone else is mapped to the overflow uid/gid (usually nobody), which you should interpret as "not me".

Fair point. I didn't know that.

Sorry, but in this project we have a strict policy of not doing impossible things.

No need to be an obnoxious muppet. ❤️

In my use case (trying to alter files of a different uid, but same gid) this leads to various errors with the error text "Value too large for defined data type" (which is imho not really helpful)

We do not have any control over that error message, it comes from the kernel (which reports EOVERFLOW) and glibc (which translates it to the human-readable text you quoted). We can only influence the error message if it's code that is running in the bubblewrap process, and we have enough information to know the real context (which often we do not).

Okay, buddy. Just for you in simple terms: I pointed to the source code of the kernel, because the check there was relevant to my comment. I did complain about the error message as a general side note, not with the expectation for you to fix it. I wanted to point out how the issue I ran into with this PR manifests, to hopefully prevent this happening to others further down the line. There's really no need to be so condescending.

So I think your use-case is going to be impossible, even after merging this PR. You can't use the overlay to modify files that are owned by a different user. The closest you can get is that if you know in advance which files will be modified, you can make a copy in advance that is owned by you, and mount it over the top of the original file.

Fair point. I thought accessing files of the same group is not too extravagant of a use-case, but as this seems to require a lot more preparation, I can see how it is out of scope. I'll think about your suggestion. Thanks.

@smcv
Copy link
Collaborator

smcv commented Oct 4, 2024

I did complain about the error message as a general side note, not with the expectation for you to fix it

I'm sorry I didn't initially understand that. There does seem to be a tendency in other issue reports for bubblewrap users to assume that any container problem is a bug that should be fixed in bubblewrap (including kernel limitations that are outside our control), so it's difficult to distinguish between that expectation and an awareness that not everything can be in-scope here.

@smcv
Copy link
Collaborator

smcv commented Oct 4, 2024

Back to the subject of this PR, the updated version with overflow checks is looking good - this is a relatively complicated addition, so I want to give it one more review pass from first principles before hitting merge, but it's certainly close.

Copy link
Collaborator

@smcv smcv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a version that has been rebased onto #660 (using true and false for booleans).

bubblewrap.c Outdated Show resolved Hide resolved
@smcv
Copy link
Collaborator

smcv commented Oct 15, 2024

I think the last thing blocking this now is that it needs testing in older build environments, to make sure the test is run or skipped appropriately. I'll aim to do that soon.

Fixing #547 (comment) would be nice but is not a blocker.

@smcv smcv added this to the 0.11 milestone Oct 15, 2024
@smcv smcv self-requested a review October 15, 2024 18:11
This commit adds --overlay, --tmp-overlay, --ro-overlay, and
--overlay-src options to enable bubblewrap to create overlay mounts.
These options are only permitted when bubblewrap is not installed
setuid.

Resolves: containers#412
Co-authored-by: William Manley <[email protected]>
Signed-off-by: Ryan Hendrickson <[email protected]>
[smcv: Fix merge conflicts with containers#660]
Signed-off-by: Simon McVittie <[email protected]>
@smcv smcv merged commit ff33964 into containers:main Oct 16, 2024
4 checks passed
@rhendric rhendric deleted the rhendric/overlayfs branch October 16, 2024 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants