diff mbox series

[5/8] kernfs: let objects opt-in to propagating from the initial namespace

Message ID 20200408152151.5780-6-christian.brauner@ubuntu.com
State Not Applicable
Delegated to: David Miller
Headers show
Series loopfs | expand

Commit Message

Christian Brauner April 8, 2020, 3:21 p.m. UTC
The initial namespace is special in many ways. One feature it always has
had is that it propagates all its devices into all non-initial
namespaces. This is e.g. true for all device classes under /sys/class/
except the net_class. Even though none of the propagated files can be
used there are still a lot of read-only values that are accessed or read
by tools running in non-initial namespaces. To not regress such
workloads we introduce the ability to tell kernfs to continue
propagating devices from the initial namespace even when the kernfs_node
is tagged with a non-initial namespace. Note that this is a purely
opt-in feature, i.e. if there were a new device class that wanted to
make use of this new infrastructure and did not want to propagate any
devices into non-initial namespaces it could simply not implement the
relevant callback.
When a new directory in sysfs is created sysfs now can simply check
whether the relevant device wants to propagate objects from the initial
namespace or not.

Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/kernfs/dir.c             | 34 +++++++++++++++++++++++++++++-----
 fs/kernfs/kernfs-internal.h | 14 ++++++++++++++
 include/linux/kernfs.h      | 22 ++++++++++++++++++++++
 include/linux/kobject_ns.h  |  3 +++
 lib/kobject.c               |  2 ++
 5 files changed, 70 insertions(+), 5 deletions(-)

Comments

Tejun Heo April 13, 2020, 7:02 p.m. UTC | #1
Hello,

On Wed, Apr 08, 2020 at 05:21:48PM +0200, Christian Brauner wrote:
> The initial namespace is special in many ways. One feature it always has
> had is that it propagates all its devices into all non-initial
> namespaces. This is e.g. true for all device classes under /sys/class/

Maybe I'm missing your point but I've always thought of it the other way
around. Some namespaces make all objects visible in init_ns so that all
non-init namespaces are subset of the init one, which sometimes requires
creating aliases. Other namespaces don't do that. At least in my experience,
the former is a lot easier to administer.

The current namespace support in kernfs behaves the way it does because the
only namespace it supports is netns, but if we're expanding it, I think it
might be better to default to init_ns is superset of all others model and make
netns opt for the disjointing behavior.

Thanks.
Christian Brauner April 13, 2020, 7:39 p.m. UTC | #2
On Mon, Apr 13, 2020 at 03:02:39PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 08, 2020 at 05:21:48PM +0200, Christian Brauner wrote:
> > The initial namespace is special in many ways. One feature it always has
> > had is that it propagates all its devices into all non-initial
> > namespaces. This is e.g. true for all device classes under /sys/class/
> 
> Maybe I'm missing your point but I've always thought of it the other way
> around. Some namespaces make all objects visible in init_ns so that all
> non-init namespaces are subset of the init one, which sometimes requires
> creating aliases. Other namespaces don't do that. At least in my experience,
> the former is a lot easier to administer.
> 
> The current namespace support in kernfs behaves the way it does because the
> only namespace it supports is netns, but if we're expanding it, I think it
> might be better to default to init_ns is superset of all others model and make
> netns opt for the disjointing behavior.

Hey Tejun,

The point was that devices have always been shown in all namespaces. You
can see all devices everywhere. Sure that wasn't ideal but we can't
really change that behavior since it would break userspace significantly
as a lot of tools are used to that behavior.

Another problem is that you might have two devices of the same class
with the same name that belong to different namespaces and if you shown
them all in the initial namespace you get clashes. This was one of the
original reasons why network devices are only shown in the namespace
they belong to but not in any other.

The network model of only showing the device in the namespace they belong
to also has the advantage that tools do not stomp on each others feet
when using them.
Tejun Heo April 13, 2020, 7:45 p.m. UTC | #3
Hello,

On Mon, Apr 13, 2020 at 09:39:50PM +0200, Christian Brauner wrote:
> Another problem is that you might have two devices of the same class
> with the same name that belong to different namespaces and if you shown
> them all in the initial namespace you get clashes. This was one of the
> original reasons why network devices are only shown in the namespace
> they belong to but not in any other.

For example, pid namespace has the same issue but it doesn't solve the problem
by breaking up visibility at the root level - it makes everything visiable at
root but give per-ns aliases which are selectively visble depending on the
namespace. From administration POV, this is way easier and less error-prone to
deal with and I was hoping that we could head that way rather than netdev way
for new things.

Thanks.
Christian Brauner April 13, 2020, 7:59 p.m. UTC | #4
On Mon, Apr 13, 2020 at 03:45:50PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Mon, Apr 13, 2020 at 09:39:50PM +0200, Christian Brauner wrote:
> > Another problem is that you might have two devices of the same class
> > with the same name that belong to different namespaces and if you shown
> > them all in the initial namespace you get clashes. This was one of the
> > original reasons why network devices are only shown in the namespace
> > they belong to but not in any other.
> 
> For example, pid namespace has the same issue but it doesn't solve the problem
> by breaking up visibility at the root level - it makes everything visiable at
> root but give per-ns aliases which are selectively visble depending on the
> namespace. From administration POV, this is way easier and less error-prone to
> deal with and I was hoping that we could head that way rather than netdev way
> for new things.

Right, pid namespaces deal with a single random identifier about which
userspace makes no assumptions other than that it's a positive number so
generating aliases is fine. In addition pid namespaces are nicely
hierarchical. I fear that we might introduce unneeded complexity if we
go this way and start generating aliases for devices that userspace
already knows about and has expectations of. We also still face some of
the other problems I mentioned.
I do think that what you say might make sense to explore in more detail
for a new device class (or type under a given class) that userspace does
not yet know about and were we don't regress anything.
Tejun Heo April 13, 2020, 8:37 p.m. UTC | #5
Hello,

On Mon, Apr 13, 2020 at 09:59:15PM +0200, Christian Brauner wrote:
> Right, pid namespaces deal with a single random identifier about which
> userspace makes no assumptions other than that it's a positive number so
> generating aliases is fine. In addition pid namespaces are nicely

I don't see any fundamental differences between pids and device numbers. One
of the reasons pid namespace does aliasing instead of just showing subsets is
because applications can have expectations on what the specific numbers should
be - e.g. for checkpoint-restarting.

> hierarchical. I fear that we might introduce unneeded complexity if we
> go this way and start generating aliases for devices that userspace

It adds complexity for sure but the other side of the scale is losing
visiblity into what devices are on the system, which can become really nasty
in practice, so I do think it can probably justify some additional complexity
especially if it's something which can be used by different devices. Even just
for block, if we end up expanding ns support to regular block devices for some
reason, it's kinda dreadful to think a situation where all devices can't be
discovered at the system level.

> already knows about and has expectations of. We also still face some of
> the other problems I mentioned.
> I do think that what you say might make sense to explore in more detail
> for a new device class (or type under a given class) that userspace does
> not yet know about and were we don't regress anything.

I don't quite follow why adding namespace support would break existing users.
Wouldn't namespace usage be opt-in?

Thanks.
Christian Brauner April 14, 2020, 10:39 a.m. UTC | #6
On Mon, Apr 13, 2020 at 04:37:16PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Mon, Apr 13, 2020 at 09:59:15PM +0200, Christian Brauner wrote:
> > Right, pid namespaces deal with a single random identifier about which
> > userspace makes no assumptions other than that it's a positive number so
> > generating aliases is fine. In addition pid namespaces are nicely
> 
> I don't see any fundamental differences between pids and device numbers. One
> of the reasons pid namespace does aliasing instead of just showing subsets is
> because applications can have expectations on what the specific numbers should
> be - e.g. for checkpoint-restarting.

One difference is that ownership is hierarchial in a pid namespace. This
becomes clear when looking at the parent child relationship when
creating new processes in nested pid namespaces. All processes created
in the innermost pid namespace are owned by that pid namespaces's init
process. If that pid namespace's init/subreaper process dies all
processes get zapped and autoreaped. In essence, unless the ancestor pid
namespace has setns()ed a process in there, ownership of that process is
clearly defined. I don't think that model is transferable to a device.
What seems most important to me here is that a pid namespace completely
defines ownership of a process. But there's not necessarily
a single namespace that guarantees ownership for all device types.
Network devices, imho are a good example for that. Their full ownership
is network namespace + user namespace actually. You could easily
envision other device classes where a combination of namespaces would
make sense.

> 
> > hierarchical. I fear that we might introduce unneeded complexity if we
> > go this way and start generating aliases for devices that userspace
> 
> It adds complexity for sure but the other side of the scale is losing
> visiblity into what devices are on the system, which can become really nasty
> in practice, so I do think it can probably justify some additional complexity
> especially if it's something which can be used by different devices. Even just
> for block, if we end up expanding ns support to regular block devices for some
> reason, it's kinda dreadful to think a situation where all devices can't be
> discovered at the system level.

Hm, it is already the case that we can't see all devices at the system
level. That includes network devices and also binderfs devices (the
latter don't have sysfs entries though which is what this is about).
And for virtual devices just as loop, binder, and network devices this
is fine imho. They are not physicall attached to your computer. Actual
disk devices where this would matter wouldn't be namespaced anyway imho.

We also need to consider that it is potentially dangerous for a
namespace to trigger a device event tricking the host into performing an
action on it. If e.g. the creation of a network device were to propagate
into all namespaces and there'd be a rule matching it you could trick
the host into performing privileged actions it. So it also isn't
obviously safe propagating devices out of their namespace. (I fixed
something similar to this just recently in a sysfs series.)

I addition the file ownership permissions would propagate from the inner
to all outer sysfs instances as well which would mean you'd suddenly
have 100000:100000 entries in your host's sysfs in the initial
namespace.

> 
> > already knows about and has expectations of. We also still face some of
> > the other problems I mentioned.
> > I do think that what you say might make sense to explore in more detail
> > for a new device class (or type under a given class) that userspace does
> > not yet know about and were we don't regress anything.
> 
> I don't quite follow why adding namespace support would break existing users.
> Wouldn't namespace usage be opt-in?

For sysfs, this change is opt-in per device type and it only applies to
loop devices here, i.e. if you don't e.g. use loopfs nothing changes
for you at all. If you use it, all that you get is correct ownership for
sysfs entries for those loop devices accounted to you in addition to all
the other entries that have always been there. This way we can handle
legacy workloads cleanly which we really want for our use-case.

Your model would effectively require a new version of sysfs where you
e.g. mount it with a new option that zaps all device entries that don't
belong to non-initial user namespaces. Which would mean most major tools
in containers will break completely. We can still totally try to bring
up a change like this independent of this patchset. This patchset
doesn't rule this out at all.

Christian
diff mbox series

Patch

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1f2d894ae454..02796ba6521a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -575,10 +575,15 @@  static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
 		goto out_bad;
 
 	/* The kernfs node has been moved to a different namespace */
-	if (kn->parent && kernfs_ns_enabled(kn->parent) &&
-	    kernfs_info(dentry->d_sb)->ns[kn->ns_type] != kn->ns)
-		goto out_bad;
+	if (kn->parent && kernfs_ns_enabled(kn->parent)) {
+		if (kernfs_init_ns_propagates(kn->parent) &&
+		    kn->ns == kernfs_init_ns(kn->parent->ns_type))
+			goto out_good;
+		if (kernfs_info(dentry->d_sb)->ns[kn->parent->ns_type] != kn->ns)
+			goto out_bad;
+	}
 
+out_good:
 	mutex_unlock(&kernfs_mutex);
 	return 1;
 out_bad:
@@ -1090,6 +1095,10 @@  static struct dentry *kernfs_iop_lookup(struct inode *dir,
 		ns = kernfs_info(dir->i_sb)->ns[parent->ns_type];
 
 	kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+	if (!kn && kernfs_init_ns_propagates(parent)) {
+		ns = kernfs_init_ns(parent->ns_type);
+		kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+	}
 
 	/* no such entry */
 	if (!kn || !kernfs_active(kn)) {
@@ -1614,6 +1623,8 @@  static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
 static struct kernfs_node *kernfs_dir_pos(const void *ns,
 	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
 {
+	const void *init_ns;
+
 	if (pos) {
 		int valid = kernfs_active(pos) &&
 			pos->parent == parent && hash == pos->hash;
@@ -1621,6 +1632,12 @@  static struct kernfs_node *kernfs_dir_pos(const void *ns,
 		if (!valid)
 			pos = NULL;
 	}
+
+	if (kernfs_init_ns_propagates(parent))
+		init_ns = kernfs_init_ns(parent->ns_type);
+	else
+		init_ns = NULL;
+
 	if (!pos && (hash > 1) && (hash < INT_MAX)) {
 		struct rb_node *node = parent->dir.children.rb_node;
 		while (node) {
@@ -1635,7 +1652,7 @@  static struct kernfs_node *kernfs_dir_pos(const void *ns,
 		}
 	}
 	/* Skip over entries which are dying/dead or in the wrong namespace */
-	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
+	while (pos && (!kernfs_active(pos) || (pos->ns != ns && pos->ns != init_ns))) {
 		struct rb_node *node = rb_next(&pos->rb);
 		if (!node)
 			pos = NULL;
@@ -1650,13 +1667,20 @@  static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
 {
 	pos = kernfs_dir_pos(ns, parent, ino, pos);
 	if (pos) {
+		const void *init_ns;
+		if (kernfs_init_ns_propagates(parent))
+			init_ns = kernfs_init_ns(parent->ns_type);
+		else
+			init_ns = NULL;
+
 		do {
 			struct rb_node *node = rb_next(&pos->rb);
 			if (!node)
 				pos = NULL;
 			else
 				pos = rb_to_kn(node);
-		} while (pos && (!kernfs_active(pos) || pos->ns != ns));
+		} while (pos && (!kernfs_active(pos) ||
+				 (pos->ns != ns && pos->ns != init_ns)));
 	}
 	return pos;
 }
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 6c375eb59460..4ba7b36103de 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -78,6 +78,20 @@  static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
 	return d_inode(dentry)->i_private;
 }
 
+extern struct net init_net;
+
+static inline const void *kernfs_init_ns(enum kobj_ns_type ns_type)
+{
+	switch (ns_type) {
+	case KOBJ_NS_TYPE_NET:
+		return &init_net;
+	default:
+		pr_debug("Unsupported namespace type %d for kernfs\n", ns_type);
+	}
+
+	return NULL;
+}
+
 extern const struct super_operations kernfs_sops;
 extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
 
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 0e4414bd7007..5e2143e69c1c 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -51,6 +51,7 @@  enum kernfs_node_flag {
 	KERNFS_SUICIDED		= 0x0800,
 	KERNFS_EMPTY_DIR	= 0x1000,
 	KERNFS_HAS_RELEASE	= 0x2000,
+	KERNFS_NS_PROPAGATE	= 0x4000,
 };
 
 /* @flags for kernfs_create_root() */
@@ -330,6 +331,27 @@  static inline void kernfs_enable_ns(struct kernfs_node *kn,
 	kn->ns_type = ns_type;
 }
 
+static inline void kernfs_enable_init_ns_propagates(struct kernfs_node *kn)
+{
+	WARN_ON_ONCE(kernfs_type(kn) != KERNFS_DIR);
+	WARN_ON_ONCE(!RB_EMPTY_ROOT(&kn->dir.children));
+	WARN_ON_ONCE(!(kn->flags & KERNFS_NS));
+	kn->flags |= KERNFS_NS_PROPAGATE;
+}
+
+/**
+ * kernfs_init_ns_propagates - test whether init ns propagates
+ * @kn: the node to test
+ *
+ * Test whether kernfs entries created in the init namespace propagate into
+ * other namespaces.
+ */
+static inline bool kernfs_init_ns_propagates(const struct kernfs_node *kn)
+{
+	return ((kn->flags & (KERNFS_NS | KERNFS_NS_PROPAGATE)) ==
+		(KERNFS_NS | KERNFS_NS_PROPAGATE));
+}
+
 /**
  * kernfs_ns_enabled - test whether namespace is enabled
  * @kn: the node to test
diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
index 991a9286bcea..216f9112ee1d 100644
--- a/include/linux/kobject_ns.h
+++ b/include/linux/kobject_ns.h
@@ -34,6 +34,8 @@  enum kobj_ns_type {
  *   @grab_current_ns: return a new reference to calling task's namespace
  *   @initial_ns: return the initial namespace (i.e. init_net_ns)
  *   @drop_ns: drops a reference to namespace
+ *   @initial_ns_propagates: whether devices in the initial namespace propagate
+ *			to all other namespaces
  */
 struct kobj_ns_type_operations {
 	enum kobj_ns_type type;
@@ -41,6 +43,7 @@  struct kobj_ns_type_operations {
 	void *(*grab_current_ns)(void);
 	const void *(*initial_ns)(void);
 	void (*drop_ns)(void *);
+	bool (*initial_ns_propagates)(void);
 };
 
 int kobj_ns_type_register(const struct kobj_ns_type_operations *ops);
diff --git a/lib/kobject.c b/lib/kobject.c
index c58c62d49a10..96bb8c732d1c 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -121,6 +121,8 @@  static int create_dir(struct kobject *kobj)
 		BUG_ON(!kobj_ns_type_registered(ops->type));
 
 		sysfs_enable_ns(kobj->sd, ops->type);
+		if (ops->initial_ns_propagates && ops->initial_ns_propagates())
+			kernfs_enable_init_ns_propagates(kobj->sd);
 	}
 
 	return 0;