diff mbox

[RFC] ns: Syscalls for better namespace sharing control.

Message ID m11vfuvi1t.fsf@fess.ebiederm.org
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Eric W. Biederman March 8, 2010, 5:29 p.m. UTC
Daniel Lezcano <daniel.lezcano@free.fr> writes:

> Eric W. Biederman wrote:
>> I have take an snapshot of my development tree and placed it at.
>>
>>
>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>   
>
> Hi Eric,
>
> thanks for the pointer.
>
> I tried to boot the kernel under qemu and I got this oops:

I am clearly running an old userspace on my test machine.  No udev.
It looks like udev has a long standing netlink misfeature, where
it does not initializing NETLINK_CB....


From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
From: Eric W. Biederman <ebiederm@xmission.com>
Date: Mon, 8 Mar 2010 09:25:20 -0800
Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 lib/kobject_uevent.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

Comments

Daniel Lezcano March 8, 2010, 7:57 p.m. UTC | #1
Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>
>   
>> Eric W. Biederman wrote:
>>     
>>> I have take an snapshot of my development tree and placed it at.
>>>
>>>
>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>   
>>>       
>> Hi Eric,
>>
>> thanks for the pointer.
>>
>> I tried to boot the kernel under qemu and I got this oops:
>>     
>
> I am clearly running an old userspace on my test machine.  No udev.
> It looks like udev has a long standing netlink misfeature, where
> it does not initializing NETLINK_CB....
>
>
> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
> From: Eric W. Biederman <ebiederm@xmission.com>
> Date: Mon, 8 Mar 2010 09:25:20 -0800
> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>   
Thanks.

I was able to boot but I have the following warning:

------------[ cut here ]------------
WARNING: at net/netlink/af_netlink.c:198 netlink_sock_destruct+0x72/0xac()
Hardware name:
Modules linked in: [last unloaded: scsi_wait_scan]
Pid: 840, comm: nash-hotplug Tainted: G        W  2.6.33 #2
Call Trace:
 [<ffffffff812df182>] ? netlink_sock_destruct+0x72/0xac
 [<ffffffff8102ca29>] warn_slowpath_common+0x77/0xa4
 [<ffffffff8102ca65>] warn_slowpath_null+0xf/0x11
 [<ffffffff812df182>] netlink_sock_destruct+0x72/0xac
 [<ffffffff812bb2a4>] __sk_free+0x1e/0x118
 [<ffffffff812bb40d>] sk_free+0x19/0x1b
 [<ffffffff812e0dc2>] netlink_release+0x246/0x253
 [<ffffffff812b825a>] sock_release+0x1a/0x6b
 [<ffffffff812b82cd>] sock_close+0x22/0x26
 [<ffffffff810c7823>] __fput+0x11b/0x1d7
 [<ffffffff810c78f6>] fput+0x17/0x19
 [<ffffffff810c4ae2>] filp_close+0x67/0x72
 [<ffffffff8102e75c>] put_files_struct+0x6a/0xd4
 [<ffffffff8102e80d>] exit_files+0x47/0x4f
 [<ffffffff8102fe59>] do_exit+0x1eb/0x693
 [<ffffffff813864c2>] ? _raw_spin_unlock_irq+0x2b/0x31
 [<ffffffff81030373>] do_group_exit+0x72/0x9b
 [<ffffffff8103f37c>] get_signal_to_deliver+0x3a1/0x3c1
 [<ffffffff81001e8e>] do_notify_resume+0x8d/0x6ea
 [<ffffffff810538c9>] ? trace_hardirqs_on_caller+0x110/0x13a
 [<ffffffff8102851e>] ? finish_task_switch+0x6a/0xb3
 [<ffffffff810284b4>] ? finish_task_switch+0x0/0xb3
 [<ffffffff813867aa>] ? retint_signal+0x11/0x87
 [<ffffffff810538c9>] ? trace_hardirqs_on_caller+0x110/0x13a
 [<ffffffff813867df>] retint_signal+0x46/0x87
---[ end trace d4a1e4cbaa70d63d ]---


And I have a kernel panic when exiting a network namespace using a macvlan:

linux-swk0 login: BUG: unable to handle kernel paging request at 
ffff880035475678
IP: [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
PGD 160b063 PUD 160f063 PMD 2aa067 PTE 35475160
Oops: 0002 [#1] DEBUG_PAGEALLOC
last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/net/eth0/flags
CPU 0
Pid: 10, comm: netns Tainted: G        W  2.6.33 #2 /
RIP: 0010:[<ffffffff8128dbef>]  [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
RSP: 0018:ffff88003f92bc50  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880035440800 RCX: ffff880035440800
RDX: ffff880035475678 RSI: ffff88003f913710 RDI: ffff88003cde9800
RBP: ffff88003f92bc70 R08: 0000000000000004 R09: 0000000000000000
R10: 0080000000000000 R11: ffff88003f92bbf0 R12: ffff88003cde9800
R13: ffff880035440de0 R14: 0080000000000000 R15: 0000000800000000
FS:  0000000000000000(0000) GS:ffffffff8161b000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff880035475678 CR3: 000000003eb41000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process netns (pid: 10, threadinfo ffff88003f92a000, task ffff88003f913058)
Stack:
 ffffffff814328a0 ffff880035440800 ffffffff814328a0 ffff88003553a800
<0> ffff88003f92bc90 ffffffff812c9150 ffff880035440800 ffff88003f92bd00
<0> ffff88003f92bcd0 ffffffff812c9259 ffff88003f92bcd0 ffff88003f92bd00
Call Trace:
 [<ffffffff812c9150>] dev_close+0x86/0xa8
 [<ffffffff812c9259>] rollback_registered_many+0xe7/0x208
 [<ffffffff812c9390>] unregister_netdevice_many+0x16/0x62
 [<ffffffff812c952d>] default_device_exit_batch+0x9f/0xb3
 [<ffffffff812c3906>] ops_exit_list+0x4e/0x56
 [<ffffffff812c40f4>] cleanup_net+0xfe/0x1b7
 [<ffffffff81042db6>] worker_thread+0x227/0x32d
 [<ffffffff81042d60>] ? worker_thread+0x1d1/0x32d
 [<ffffffff813864c2>] ? _raw_spin_unlock_irq+0x2b/0x31
 [<ffffffff812c3ff6>] ? cleanup_net+0x0/0x1b7
 [<ffffffff810466ae>] ? autoremove_wake_function+0x0/0x38
 [<ffffffff81042b8f>] ? worker_thread+0x0/0x32d
 [<ffffffff810462e0>] kthread+0x7c/0x84
 [<ffffffff810035b4>] kernel_thread_helper+0x4/0x10
 [<ffffffff8138673a>] ? restore_args+0x0/0x30
 [<ffffffff81046264>] ? kthread+0x0/0x84
 [<ffffffff810035b0>] ? kernel_thread_helper+0x0/0x10
Code: 01 00 00 02 74 0b 83 ce ff 4c 89 e7 e8 a1 8f 03 00 48 8b b3 50 02 
00 00 4c 89 e7 e8 df 8e 03 00 49 8b 45 18 49 8b 55 20 48 85 c0 <48> 89 
02 74 04 48 89 50 08 48 be 00 02 20 00 00 00 ad de 49 89
RIP  [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
 RSP <ffff88003f92bc50>
CR2: ffff880035475678
---[ end trace d4a1e4cbaa70d63e ]---

addr2line -e ./vmlinux ffffffff812c9150 gives net/core/dev.c:1252
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 8, 2010, 8:24 p.m. UTC | #2
Daniel Lezcano <daniel.lezcano@free.fr> writes:

> Eric W. Biederman wrote:
>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>
>>   
>>> Eric W. Biederman wrote:
>>>     
>>>> I have take an snapshot of my development tree and placed it at.
>>>>
>>>>
>>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>>         
>>> Hi Eric,
>>>
>>> thanks for the pointer.
>>>
>>> I tried to boot the kernel under qemu and I got this oops:
>>>     
>>
>> I am clearly running an old userspace on my test machine.  No udev.
>> It looks like udev has a long standing netlink misfeature, where
>> it does not initializing NETLINK_CB....
>>
>>
>> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
>> From: Eric W. Biederman <ebiederm@xmission.com>
>> Date: Mon, 8 Mar 2010 09:25:20 -0800
>> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>>
>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>>   
> Thanks.
>
> I was able to boot but I have the following warning:

Thanks for the bug report.

For the moment you might want to drop:
af_netlink:  Allow credentials to work across namespaces.
af_netlink: Debugging in case I have missed something.

Although I am curious if you hit my debugging messages in
netlink recv.

I guess if the goal is to test my nsfd bits you can drop everything
starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
it takes to get get uids, gid and pids translated when the cross
namespaces on an af_unix of an af_netlink socket.

At least in the af_netlink case it appears clear I am have missed
something.

This is a warning that netlink throws when the packet accounting messed
up.  So it sounds like you are exercising another path that I failed
to exercise and fix.
> ------------[ cut here ]------------
> WARNING: at net/netlink/af_netlink.c:198 netlink_sock_destruct+0x72/0xac()
> Hardware name:
> Modules linked in: [last unloaded: scsi_wait_scan]
> Pid: 840, comm: nash-hotplug Tainted: G        W  2.6.33 #2
> Call Trace:
> [<ffffffff812df182>] ? netlink_sock_destruct+0x72/0xac
> [<ffffffff8102ca29>] warn_slowpath_common+0x77/0xa4
> [<ffffffff8102ca65>] warn_slowpath_null+0xf/0x11
> [<ffffffff812df182>] netlink_sock_destruct+0x72/0xac
> [<ffffffff812bb2a4>] __sk_free+0x1e/0x118
> [<ffffffff812bb40d>] sk_free+0x19/0x1b
> [<ffffffff812e0dc2>] netlink_release+0x246/0x253
> [<ffffffff812b825a>] sock_release+0x1a/0x6b
> [<ffffffff812b82cd>] sock_close+0x22/0x26
> [<ffffffff810c7823>] __fput+0x11b/0x1d7
> [<ffffffff810c78f6>] fput+0x17/0x19
> [<ffffffff810c4ae2>] filp_close+0x67/0x72
> [<ffffffff8102e75c>] put_files_struct+0x6a/0xd4
> [<ffffffff8102e80d>] exit_files+0x47/0x4f
> [<ffffffff8102fe59>] do_exit+0x1eb/0x693
> [<ffffffff813864c2>] ? _raw_spin_unlock_irq+0x2b/0x31
> [<ffffffff81030373>] do_group_exit+0x72/0x9b
> [<ffffffff8103f37c>] get_signal_to_deliver+0x3a1/0x3c1
> [<ffffffff81001e8e>] do_notify_resume+0x8d/0x6ea
> [<ffffffff810538c9>] ? trace_hardirqs_on_caller+0x110/0x13a
> [<ffffffff8102851e>] ? finish_task_switch+0x6a/0xb3
> [<ffffffff810284b4>] ? finish_task_switch+0x0/0xb3
> [<ffffffff813867aa>] ? retint_signal+0x11/0x87
> [<ffffffff810538c9>] ? trace_hardirqs_on_caller+0x110/0x13a
> [<ffffffff813867df>] retint_signal+0x46/0x87
> ---[ end trace d4a1e4cbaa70d63d ]---
>
>
> And I have a kernel panic when exiting a network namespace using a macvlan:

I wonder/hope this is simply the result of corruption from earlier problems.
I haven't touched anything that should affect the macvlan driver in 2.6.33.

> linux-swk0 login: BUG: unable to handle kernel paging request at
> ffff880035475678
> IP: [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
> PGD 160b063 PUD 160f063 PMD 2aa067 PTE 35475160
> Oops: 0002 [#1] DEBUG_PAGEALLOC
> last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/net/eth0/flags
> CPU 0
> Pid: 10, comm: netns Tainted: G        W  2.6.33 #2 /
> RIP: 0010:[<ffffffff8128dbef>]  [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
> RSP: 0018:ffff88003f92bc50  EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff880035440800 RCX: ffff880035440800
> RDX: ffff880035475678 RSI: ffff88003f913710 RDI: ffff88003cde9800
> RBP: ffff88003f92bc70 R08: 0000000000000004 R09: 0000000000000000
> R10: 0080000000000000 R11: ffff88003f92bbf0 R12: ffff88003cde9800
> R13: ffff880035440de0 R14: 0080000000000000 R15: 0000000800000000
> FS:  0000000000000000(0000) GS:ffffffff8161b000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffff880035475678 CR3: 000000003eb41000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process netns (pid: 10, threadinfo ffff88003f92a000, task ffff88003f913058)
> Stack:
> ffffffff814328a0 ffff880035440800 ffffffff814328a0 ffff88003553a800
> <0> ffff88003f92bc90 ffffffff812c9150 ffff880035440800 ffff88003f92bd00
> <0> ffff88003f92bcd0 ffffffff812c9259 ffff88003f92bcd0 ffff88003f92bd00
> Call Trace:
> [<ffffffff812c9150>] dev_close+0x86/0xa8
> [<ffffffff812c9259>] rollback_registered_many+0xe7/0x208
> [<ffffffff812c9390>] unregister_netdevice_many+0x16/0x62
> [<ffffffff812c952d>] default_device_exit_batch+0x9f/0xb3
> [<ffffffff812c3906>] ops_exit_list+0x4e/0x56
> [<ffffffff812c40f4>] cleanup_net+0xfe/0x1b7
> [<ffffffff81042db6>] worker_thread+0x227/0x32d
> [<ffffffff81042d60>] ? worker_thread+0x1d1/0x32d
> [<ffffffff813864c2>] ? _raw_spin_unlock_irq+0x2b/0x31
> [<ffffffff812c3ff6>] ? cleanup_net+0x0/0x1b7
> [<ffffffff810466ae>] ? autoremove_wake_function+0x0/0x38
> [<ffffffff81042b8f>] ? worker_thread+0x0/0x32d
> [<ffffffff810462e0>] kthread+0x7c/0x84
> [<ffffffff810035b4>] kernel_thread_helper+0x4/0x10
> [<ffffffff8138673a>] ? restore_args+0x0/0x30
> [<ffffffff81046264>] ? kthread+0x0/0x84
> [<ffffffff810035b0>] ? kernel_thread_helper+0x0/0x10
> Code: 01 00 00 02 74 0b 83 ce ff 4c 89 e7 e8 a1 8f 03 00 48 8b b3 50 02 00 00 4c
> 89 e7 e8 df 8e 03 00 49 8b 45 18 49 8b 55 20 48 85 c0 <48> 89 02 74 04 48 89 50
> 08 48 be 00 02 20 00 00 00 ad de 49 89
> RIP  [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
> RSP <ffff88003f92bc50>
> CR2: ffff880035475678
> ---[ end trace d4a1e4cbaa70d63e ]---
>
> addr2line -e ./vmlinux ffffffff812c9150 gives net/core/dev.c:1252

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano March 8, 2010, 8:42 p.m. UTC | #3
Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>
>   
>> Eric W. Biederman wrote:
>>     
>>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>>
>>>   
>>>       
>>>> Eric W. Biederman wrote:
>>>>     
>>>>         
>>>>> I have take an snapshot of my development tree and placed it at.
>>>>>
>>>>>
>>>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>>>         
>>>>>           
>>>> Hi Eric,
>>>>
>>>> thanks for the pointer.
>>>>
>>>> I tried to boot the kernel under qemu and I got this oops:
>>>>     
>>>>         
>>> I am clearly running an old userspace on my test machine.  No udev.
>>> It looks like udev has a long standing netlink misfeature, where
>>> it does not initializing NETLINK_CB....
>>>
>>>
>>> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
>>> From: Eric W. Biederman <ebiederm@xmission.com>
>>> Date: Mon, 8 Mar 2010 09:25:20 -0800
>>> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>>>
>>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>>>   
>>>       
>> Thanks.
>>
>> I was able to boot but I have the following warning:
>>     
>
> Thanks for the bug report.
>   
Thanks to you for the patchset :)

> For the moment you might want to drop:
> af_netlink:  Allow credentials to work across namespaces.
> af_netlink: Debugging in case I have missed something.
>
> Although I am curious if you hit my debugging messages in
> netlink recv.
>   
No, it does not appear (looked for "missing NETLINK_CB proto").

> I guess if the goal is to test my nsfd bits you can drop everything
> starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
> it takes to get get uids, gid and pids translated when the cross
> namespaces on an af_unix of an af_netlink socket.
>
> At least in the af_netlink case it appears clear I am have missed
> something.
>
> This is a warning that netlink throws when the packet accounting messed
> up.  So it sounds like you are exercising another path that I failed
> to exercise and fix.
>   
I will look forward if I find more clues for this warning.

In the meantime  was able to enter the container with the ugly following 
program:

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <syscall.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/param.h>

#define __NR_setns 300

int setns(int nstype, int fd)
{
    return syscall (__NR_setns, nstype, fd);
}

int main(int argc, char *argv[])
{
    char path[MAXPATHLEN];
    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
    const int size = sizeof(ns) / sizeof(char *);
    int fd[size];
    int i;

    if (argc != 3) {
        fprintf(stderr, "mynsenter <pid> <command>\n");
        exit(1);
    }

    for (i = 0; i < size; i++) {
       
        sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);

        fd[i] = open(path, O_RDONLY);
        if (fd[i] < 0) {
            perror("open");
            return -1;
        }

    }

    for (i = 0; i < size; i++) {

        if (setns(0, fd[i])) {
            perror("setns");
            return -1;
        }
    }

    execve(argv[2], &argv[2], NULL);
    perror("execve");

    return 0;
}

At the fist glance, no problem :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 8, 2010, 8:47 p.m. UTC | #4
Daniel Lezcano <daniel.lezcano@free.fr> writes:

> Eric W. Biederman wrote:
>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>
>>   
>>> Eric W. Biederman wrote:
>>>     
>>>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>>>
>>>>         
>>>>> Eric W. Biederman wrote:
>>>>>             
>>>>>> I have take an snapshot of my development tree and placed it at.
>>>>>>
>>>>>>
>>>>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>>>>                   
>>>>> Hi Eric,
>>>>>
>>>>> thanks for the pointer.
>>>>>
>>>>> I tried to boot the kernel under qemu and I got this oops:
>>>>>             
>>>> I am clearly running an old userspace on my test machine.  No udev.
>>>> It looks like udev has a long standing netlink misfeature, where
>>>> it does not initializing NETLINK_CB....
>>>>
>>>>
>>>> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
>>>> From: Eric W. Biederman <ebiederm@xmission.com>
>>>> Date: Mon, 8 Mar 2010 09:25:20 -0800
>>>> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>>>>
>>>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>>>>         
>>> Thanks.
>>>
>>> I was able to boot but I have the following warning:
>>>     
>>
>> Thanks for the bug report.
>>   
> Thanks to you for the patchset :)
>
>> For the moment you might want to drop:
>> af_netlink:  Allow credentials to work across namespaces.
>> af_netlink: Debugging in case I have missed something.
>>
>> Although I am curious if you hit my debugging messages in
>> netlink recv.
>>   
> No, it does not appear (looked for "missing NETLINK_CB proto").
>
>> I guess if the goal is to test my nsfd bits you can drop everything
>> starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
>> it takes to get get uids, gid and pids translated when the cross
>> namespaces on an af_unix of an af_netlink socket.
>>
>> At least in the af_netlink case it appears clear I am have missed
>> something.
>>
>> This is a warning that netlink throws when the packet accounting messed
>> up.  So it sounds like you are exercising another path that I failed
>> to exercise and fix.
>>   
> I will look forward if I find more clues for this warning.
>
> In the meantime  was able to enter the container with the ugly following
> program:
>
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <syscall.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sys/param.h>
>
> #define __NR_setns 300
>
> int setns(int nstype, int fd)
> {
>    return syscall (__NR_setns, nstype, fd);
> }
>
> int main(int argc, char *argv[])
> {
>    char path[MAXPATHLEN];
>    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
>    const int size = sizeof(ns) / sizeof(char *);
>    int fd[size];
>    int i;
>
>    if (argc != 3) {
>        fprintf(stderr, "mynsenter <pid> <command>\n");
>        exit(1);
>    }
>
>    for (i = 0; i < size; i++) {
>            sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);
>
>        fd[i] = open(path, O_RDONLY);
>        if (fd[i] < 0) {
>            perror("open");
>            return -1;
>        }
>
>    }
>
>    for (i = 0; i < size; i++) {
>
>        if (setns(0, fd[i])) {
>            perror("setns");
>            return -1;
>        }
>    }
>
>    execve(argv[2], &argv[2], NULL);
>    perror("execve");
>
>    return 0;
> }
>
> At the fist glance, no problem :)

No fork() so your processes is completely in the pid namespace?

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano March 8, 2010, 9:12 p.m. UTC | #5
Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>
>   
>> Eric W. Biederman wrote:
>>     
>>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>>
>>>   
>>>       
>>>> Eric W. Biederman wrote:
>>>>     
>>>>         
>>>>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>>>>
>>>>>         
>>>>>           
>>>>>> Eric W. Biederman wrote:
>>>>>>             
>>>>>>             
>>>>>>> I have take an snapshot of my development tree and placed it at.
>>>>>>>
>>>>>>>
>>>>>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>>>>>                   
>>>>>>>               
>>>>>> Hi Eric,
>>>>>>
>>>>>> thanks for the pointer.
>>>>>>
>>>>>> I tried to boot the kernel under qemu and I got this oops:
>>>>>>             
>>>>>>             
>>>>> I am clearly running an old userspace on my test machine.  No udev.
>>>>> It looks like udev has a long standing netlink misfeature, where
>>>>> it does not initializing NETLINK_CB....
>>>>>
>>>>>
>>>>> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
>>>>> From: Eric W. Biederman <ebiederm@xmission.com>
>>>>> Date: Mon, 8 Mar 2010 09:25:20 -0800
>>>>> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>>>>>
>>>>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>>>>>         
>>>>>           
>>>> Thanks.
>>>>
>>>> I was able to boot but I have the following warning:
>>>>     
>>>>         
>>> Thanks for the bug report.
>>>   
>>>       
>> Thanks to you for the patchset :)
>>
>>     
>>> For the moment you might want to drop:
>>> af_netlink:  Allow credentials to work across namespaces.
>>> af_netlink: Debugging in case I have missed something.
>>>
>>> Although I am curious if you hit my debugging messages in
>>> netlink recv.
>>>   
>>>       
>> No, it does not appear (looked for "missing NETLINK_CB proto").
>>
>>     
>>> I guess if the goal is to test my nsfd bits you can drop everything
>>> starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
>>> it takes to get get uids, gid and pids translated when the cross
>>> namespaces on an af_unix of an af_netlink socket.
>>>
>>> At least in the af_netlink case it appears clear I am have missed
>>> something.
>>>
>>> This is a warning that netlink throws when the packet accounting messed
>>> up.  So it sounds like you are exercising another path that I failed
>>> to exercise and fix.
>>>   
>>>       
>> I will look forward if I find more clues for this warning.
>>
>> In the meantime  was able to enter the container with the ugly following
>> program:
>>
>> #include <unistd.h>
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include <syscall.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <sys/param.h>
>>
>> #define __NR_setns 300
>>
>> int setns(int nstype, int fd)
>> {
>>    return syscall (__NR_setns, nstype, fd);
>> }
>>
>> int main(int argc, char *argv[])
>> {
>>    char path[MAXPATHLEN];
>>    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
>>    const int size = sizeof(ns) / sizeof(char *);
>>    int fd[size];
>>    int i;
>>
>>    if (argc != 3) {
>>        fprintf(stderr, "mynsenter <pid> <command>\n");
>>        exit(1);
>>    }
>>
>>    for (i = 0; i < size; i++) {
>>            sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);
>>
>>        fd[i] = open(path, O_RDONLY);
>>        if (fd[i] < 0) {
>>            perror("open");
>>            return -1;
>>        }
>>
>>    }
>>
>>    for (i = 0; i < size; i++) {
>>
>>        if (setns(0, fd[i])) {
>>            perror("setns");
>>            return -1;
>>        }
>>    }
>>
>>    execve(argv[2], &argv[2], NULL);
>>    perror("execve");
>>
>>    return 0;
>> }
>>
>> At the fist glance, no problem :)
>>     
>
> No fork() so your processes is completely in the pid namespace?
>   
What I do is to attach "/bin/sh" to the container with this program.
The container is a VPS running busybox with the full isolation.

echo $$ gives the real pid.
All the forked processes appears in the pid namespace, they are visible 
through /proc with the virtual pid.
I am not able to change to the /proc/self directory (I assume this is 
normal).


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 8, 2010, 9:25 p.m. UTC | #6
Daniel Lezcano <daniel.lezcano@free.fr> writes:

> Eric W. Biederman wrote:
>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>
>>   
>>> Eric W. Biederman wrote:
>>>     
>>>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>>>
>>>>         
>>>>> Eric W. Biederman wrote:
>>>>>             
>>>>>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>>>>>>
>>>>>>                   
>>>>>>> Eric W. Biederman wrote:
>>>>>>>                         
>>>>>>>> I have take an snapshot of my development tree and placed it at.
>>>>>>>>
>>>>>>>>
>>>>>>>> git://git.kernel.org/pub/scm/linux/people/ebiederm/linux-2.6.33-nsfd-v5.git
>>>>>>>>                                 
>>>>>>> Hi Eric,
>>>>>>>
>>>>>>> thanks for the pointer.
>>>>>>>
>>>>>>> I tried to boot the kernel under qemu and I got this oops:
>>>>>>>                         
>>>>>> I am clearly running an old userspace on my test machine.  No udev.
>>>>>> It looks like udev has a long standing netlink misfeature, where
>>>>>> it does not initializing NETLINK_CB....
>>>>>>
>>>>>>
>>>>>> >From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
>>>>>> From: Eric W. Biederman <ebiederm@xmission.com>
>>>>>> Date: Mon, 8 Mar 2010 09:25:20 -0800
>>>>>> Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...
>>>>>>
>>>>>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>>>>>>                   
>>>>> Thanks.
>>>>>
>>>>> I was able to boot but I have the following warning:
>>>>>             
>>>> Thanks for the bug report.
>>>>         
>>> Thanks to you for the patchset :)
>>>
>>>     
>>>> For the moment you might want to drop:
>>>> af_netlink:  Allow credentials to work across namespaces.
>>>> af_netlink: Debugging in case I have missed something.
>>>>
>>>> Although I am curious if you hit my debugging messages in
>>>> netlink recv.
>>>>         
>>> No, it does not appear (looked for "missing NETLINK_CB proto").
>>>
>>>     
>>>> I guess if the goal is to test my nsfd bits you can drop everything
>>>> starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
>>>> it takes to get get uids, gid and pids translated when the cross
>>>> namespaces on an af_unix of an af_netlink socket.
>>>>
>>>> At least in the af_netlink case it appears clear I am have missed
>>>> something.
>>>>
>>>> This is a warning that netlink throws when the packet accounting messed
>>>> up.  So it sounds like you are exercising another path that I failed
>>>> to exercise and fix.
>>>>         
>>> I will look forward if I find more clues for this warning.
>>>
>>> In the meantime  was able to enter the container with the ugly following
>>> program:
>>>
>>> #include <unistd.h>
>>> #include <stdlib.h>
>>> #include <stdio.h>
>>> #include <syscall.h>
>>> #include <sys/types.h>
>>> #include <sys/stat.h>
>>> #include <fcntl.h>
>>> #include <sys/param.h>
>>>
>>> #define __NR_setns 300
>>>
>>> int setns(int nstype, int fd)
>>> {
>>>    return syscall (__NR_setns, nstype, fd);
>>> }
>>>
>>> int main(int argc, char *argv[])
>>> {
>>>    char path[MAXPATHLEN];
>>>    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
>>>    const int size = sizeof(ns) / sizeof(char *);
>>>    int fd[size];
>>>    int i;
>>>
>>>    if (argc != 3) {
>>>        fprintf(stderr, "mynsenter <pid> <command>\n");
>>>        exit(1);
>>>    }
>>>
>>>    for (i = 0; i < size; i++) {
>>>            sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);
>>>
>>>        fd[i] = open(path, O_RDONLY);
>>>        if (fd[i] < 0) {
>>>            perror("open");
>>>            return -1;
>>>        }
>>>
>>>    }
>>>
>>>    for (i = 0; i < size; i++) {
>>>
>>>        if (setns(0, fd[i])) {
>>>            perror("setns");
>>>            return -1;
>>>        }
>>>    }
>>>
>>>    execve(argv[2], &argv[2], NULL);
>>>    perror("execve");
>>>
>>>    return 0;
>>> }
>>>
>>> At the fist glance, no problem :)
>>>     
>>
>> No fork() so your processes is completely in the pid namespace?
>>   
> What I do is to attach "/bin/sh" to the container with this program.
> The container is a VPS running busybox with the full isolation.
>
> echo $$ gives the real pid.
> All the forked processes appears in the pid namespace, they are visible through
> /proc with the virtual pid.
> I am not able to change to the /proc/self directory (I assume this is normal).

I guess my meaning is I was expecting.
child = fork();
if (child == 0) {
	execve(...);
}
waitpid(child);

This puts /bin/sh in the container as well.

I'm not certain about the /proc/self thing I have never encountered that.
But I guess if your pid is outside of the pid namespace of that instance
of proc /proc/self will be a broken symlink.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Serge E. Hallyn March 8, 2010, 9:49 p.m. UTC | #7
Quoting Eric W. Biederman (ebiederm@xmission.com):
> Daniel Lezcano <daniel.lezcano@free.fr> writes:
> I guess my meaning is I was expecting.
> child = fork();
> if (child == 0) {
> 	execve(...);
> }
> waitpid(child);
> 
> This puts /bin/sh in the container as well.
> 
> I'm not certain about the /proc/self thing I have never encountered that.
> But I guess if your pid is outside of the pid namespace of that instance
> of proc /proc/self will be a broken symlink.
> 
> Eric

Hmm, worse than a broken symlink, will it be a wrong symlink if just
the right pid is created in the container?

-serge
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 8, 2010, 10:24 p.m. UTC | #8
"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>> I guess my meaning is I was expecting.
>> child = fork();
>> if (child == 0) {
>> 	execve(...);
>> }
>> waitpid(child);
>> 
>> This puts /bin/sh in the container as well.
>> 
>> I'm not certain about the /proc/self thing I have never encountered that.
>> But I guess if your pid is outside of the pid namespace of that instance
>> of proc /proc/self will be a broken symlink.
>> 
>> Eric
>
> Hmm, worse than a broken symlink, will it be a wrong symlink if just
> the right pid is created in the container?

It won't happen. readlink and followlink are both based on 
task_tgid_nr_ns(current, ns_of_proc).

Which fails if your process is not known in that pid namespace.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano March 9, 2010, 10:03 a.m. UTC | #9
Eric W. Biederman wrote:

[ ... ]
> I guess my meaning is I was expecting.
> child = fork();
> if (child == 0) {
> 	execve(...);
> }
> waitpid(child);
>
> This puts /bin/sh in the container as well.
>   
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <syscall.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/param.h>

#define __NR_setns 300

int setns(int nstype, int fd)
{
    return syscall (__NR_setns, nstype, fd);
}

int main(int argc, char *argv[])
{
    char path[MAXPATHLEN];
    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
    const int size = sizeof(ns) / sizeof(char *);
    int fd[size];
    int i;
    pid_t pid;
   
    if (argc != 3) {
        fprintf(stderr, "mynsenter <pid> <command>\n");
        exit(1);
    }

    for (i = 0; i < size; i++) {
       
        sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);

        fd[i] = open(path, O_RDONLY| FD_CLOEXEC);
        if (fd[i] < 0) {
            perror("open");
            return -1;
        }

    }
   
    for (i = 0; i < size; i++)
        if (setns(0, fd[i])) {
            perror("setns");
            return -1;
        }

    pid = fork();
    if (!pid) {

        fprintf(stderr, "mypid is %d\n", syscall(__NR_getpid));

        execve(argv[2], &argv[2], NULL);
        perror("execve");

    }

    if (pid < 0) {
        perror("fork");
        return -1;
    }

    if (waitpid(&pid, NULL, 0) < 0) {
        perror("waitpid");
    }

    return 0;
}

Waitpid returns an error:

waitpid: No child processes

The pid number returned by fork is the pid from the init pid namespace 
but it seems waitpid is waiting for a pid belonging to the child pid 
namespace.

waitpid
 -> wait4
   -> find_get_pid
     -> find_vpid
       -> find_pid_ns(nr, current->nsproxy->pid_ns);

The current->nsproxy->pid_ns is the one of the namespace we attached to. 
So the real pid returned by the fork does not exist in this pid namespace.
Maybe fork should return a pid number belonging to the current pid 
namespace we are attached no ?




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman March 9, 2010, 10:13 a.m. UTC | #10
Daniel Lezcano <daniel.lezcano@free.fr> writes:

> Eric W. Biederman wrote:
>
> [ ... ]
>> I guess my meaning is I was expecting.
>> child = fork();
>> if (child == 0) {
>> 	execve(...);
>> }
>> waitpid(child);
>>
>> This puts /bin/sh in the container as well.
>>   
> #include <unistd.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <syscall.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sys/param.h>
>
> #define __NR_setns 300
>
> int setns(int nstype, int fd)
> {
>    return syscall (__NR_setns, nstype, fd);
> }
>
> int main(int argc, char *argv[])
> {
>    char path[MAXPATHLEN];
>    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
>    const int size = sizeof(ns) / sizeof(char *);
>    int fd[size];
>    int i;
>    pid_t pid;
>    if (argc != 3) {
>        fprintf(stderr, "mynsenter <pid> <command>\n");
>        exit(1);
>    }
>
>    for (i = 0; i < size; i++) {
>            sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);
>
>        fd[i] = open(path, O_RDONLY| FD_CLOEXEC);
>        if (fd[i] < 0) {
>            perror("open");
>            return -1;
>        }
>
>    }
>    for (i = 0; i < size; i++)
>        if (setns(0, fd[i])) {
>            perror("setns");
>            return -1;
>        }
>
>    pid = fork();
>    if (!pid) {
>
>        fprintf(stderr, "mypid is %d\n", syscall(__NR_getpid));
>
>        execve(argv[2], &argv[2], NULL);
>        perror("execve");
>
>    }
>
>    if (pid < 0) {
>        perror("fork");
>        return -1;
>    }
>
>    if (waitpid(&pid, NULL, 0) < 0) {
>        perror("waitpid");
>    }
>
>    return 0;
> }

&pid ???  Isn't that a type error?

> Waitpid returns an error:
>
> waitpid: No child processes
>
> The pid number returned by fork is the pid from the init pid namespace but it
> seems waitpid is waiting for a pid belonging to the child pid namespace.
>
> waitpid
> -> wait4
>   -> find_get_pid
>     -> find_vpid
>       -> find_pid_ns(nr, current->nsproxy->pid_ns);

But it isn't.  It is.
           find_pid_ns(nr, task_active_pid_ns(current));
Which is:
           find_pid_ns(nr, ns_of_pid(task_pid(current)));
           
Which is a value that doesn't change.  When we attach to a pid namespace.

> The current->nsproxy->pid_ns is the one of the namespace we attached to. So the
> real pid returned by the fork does not exist in this pid namespace.
> Maybe fork should return a pid number belonging to the current pid namespace we
> are attached no ?

Do you not have my patch that changed that?

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano March 9, 2010, 10:26 a.m. UTC | #11
Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>
>   
>> Eric W. Biederman wrote:
>>
>> [ ... ]
>>     
>>> I guess my meaning is I was expecting.
>>> child = fork();
>>> if (child == 0) {
>>> 	execve(...);
>>> }
>>> waitpid(child);
>>>
>>> This puts /bin/sh in the container as well.
>>>   
>>>       
>> #include <unistd.h>
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include <syscall.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <sys/param.h>
>>
>> #define __NR_setns 300
>>
>> int setns(int nstype, int fd)
>> {
>>    return syscall (__NR_setns, nstype, fd);
>> }
>>
>> int main(int argc, char *argv[])
>> {
>>    char path[MAXPATHLEN];
>>    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
>>    const int size = sizeof(ns) / sizeof(char *);
>>    int fd[size];
>>    int i;
>>    pid_t pid;
>>    if (argc != 3) {
>>        fprintf(stderr, "mynsenter <pid> <command>\n");
>>        exit(1);
>>    }
>>
>>    for (i = 0; i < size; i++) {
>>            sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);
>>
>>        fd[i] = open(path, O_RDONLY| FD_CLOEXEC);
>>        if (fd[i] < 0) {
>>            perror("open");
>>            return -1;
>>        }
>>
>>    }
>>    for (i = 0; i < size; i++)
>>        if (setns(0, fd[i])) {
>>            perror("setns");
>>            return -1;
>>        }
>>
>>    pid = fork();
>>    if (!pid) {
>>
>>        fprintf(stderr, "mypid is %d\n", syscall(__NR_getpid));
>>
>>        execve(argv[2], &argv[2], NULL);
>>        perror("execve");
>>
>>    }
>>
>>    if (pid < 0) {
>>        perror("fork");
>>        return -1;
>>    }
>>
>>    if (waitpid(&pid, NULL, 0) < 0) {
>>        perror("waitpid");
>>    }
>>
>>    return 0;
>> }
>>     
>
> &pid ???  Isn't that a type error?
>   
argh ! right :)

Sorry for the noise. Works well now.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano March 10, 2010, 9:16 p.m. UTC | #12
Eric W. Biederman wrote:
> Daniel Lezcano <daniel.lezcano@free.fr> writes:
>   

[ ... ]

> I guess my meaning is I was expecting.
> child = fork();
> if (child == 0) {
> 	execve(...);
> }
> waitpid(child);
>
> This puts /bin/sh in the container as well.
>   

Eric,

at this point I did not fall in any obvious bug and I was able to enter 
/ execute commands directly inside the container.

Excellent !

Thanks
  -- Daniel


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 920a3ca..b8229cc 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -216,7 +216,7 @@  int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
 
 		/* allocate message with the maximum possible size */
 		len = strlen(action_string) + strlen(devpath) + 2;
-		skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+		skb = nlmsg_new(len + env->buflen, GFP_KERNEL);
 		if (skb) {
 			char *scratch;