diff mbox

bpf: fix size of copy_to_user in percpu map.

Message ID 1469752941-7140-1-git-send-email-u9012063@gmail.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

William Tu July 29, 2016, 12:42 a.m. UTC
The total size of value copy_to_user() writes to userspace should
be the (current number of cpu) * (value size), instead of
num_possible_cpus() * (value size).  Found by samples/bpf/test_maps.c,
which always copies 512 byte to userspace, crashing the userspace
program stack.

Signed-off-by: William Tu <u9012063@gmail.com>
---
 kernel/bpf/syscall.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Alexei Starovoitov July 29, 2016, 6:47 a.m. UTC | #1
On Thu, Jul 28, 2016 at 05:42:21PM -0700, William Tu wrote:
> The total size of value copy_to_user() writes to userspace should
> be the (current number of cpu) * (value size), instead of
> num_possible_cpus() * (value size).  Found by samples/bpf/test_maps.c,
> which always copies 512 byte to userspace, crashing the userspace
> program stack.

hmm. I'm missing something. The sample code assumes no cpu hutplug,
so sysconf(_SC_NPROCESSORS_CONF) == num_possible_cpu == num_online_cpu,
unless there is crazy INIT_ALL_POSSIBLE config option is used.
Daniel Borkmann July 29, 2016, 7:54 a.m. UTC | #2
On 07/29/2016 08:47 AM, Alexei Starovoitov wrote:
> On Thu, Jul 28, 2016 at 05:42:21PM -0700, William Tu wrote:
>> The total size of value copy_to_user() writes to userspace should
>> be the (current number of cpu) * (value size), instead of
>> num_possible_cpus() * (value size).  Found by samples/bpf/test_maps.c,
>> which always copies 512 byte to userspace, crashing the userspace
>> program stack.
>
> hmm. I'm missing something. The sample code assumes no cpu hutplug,
> so sysconf(_SC_NPROCESSORS_CONF) == num_possible_cpu == num_online_cpu,
> unless there is crazy INIT_ALL_POSSIBLE config option is used.

Are you using ARM by chance? What is the count that you get in
user space and from kernel side?

http://lists.infradead.org/pipermail/linux-arm-kernel/2011-June/054177.html
William Tu July 29, 2016, 8:03 p.m. UTC | #3
Hi Daniel and Alexei,

Thanks for the reply. My apology for too brief description. In short,
in my environment, running samples/bpf/test_map always segfault under
percpu array/hash map operations. I think it's due to stack
corruption.

I'm not using ARM. It's x86 in a VM with 2 vcpu. By printk() in kernel, I got
  num_possible_cpu == 64
  num_online_cpu == 2 == sysconf(_SC_NPROCESSORS_CONF)

So at samples/bpf/test_maps.c, test_percpu_arraymap_sanity(),
we define:
  long values[nr_cpus]; //nr_cpus=2

  ... // create map and update map ...

  /* check that key=0 is also found and zero initialized */
  assert(bpf_lookup_elem(map_fd, &key, values) == 0 &&
            values[0] == 0 && values[nr_cpus - 1] == 0);

Here we enter the bpf syscall, calls into kernel "map_lookup_elem()"
and we calculate:
  value_size = round_up(map->value_size, 8) * num_possible_cpus();
  // which in my case 8 * 64 = 512
  ...
  // then copy to user, which writes 512B to the "values[nr_cpus]" on stack
  if (copy_to_user(uvalue, value, value_size) != 0)

And I think this 512B write to userspace corrupts the userspace stack
and causes a coredump. After bpf_lookup_elem() calls, gdb shows
'values' points to memory address 0x0.

To fix it, I could either
1). declare values array based on num_possible_cpu in test_map.c,
  long values[64];
or 2) in kernel, only copying 8*2 = 16 byte from kernel to user.

Regards,
William


On Fri, Jul 29, 2016 at 12:54 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 07/29/2016 08:47 AM, Alexei Starovoitov wrote:
>>
>> On Thu, Jul 28, 2016 at 05:42:21PM -0700, William Tu wrote:
>>>
>>> The total size of value copy_to_user() writes to userspace should
>>> be the (current number of cpu) * (value size), instead of
>>> num_possible_cpus() * (value size).  Found by samples/bpf/test_maps.c,
>>> which always copies 512 byte to userspace, crashing the userspace
>>> program stack.
>>
>>
>> hmm. I'm missing something. The sample code assumes no cpu hutplug,
>> so sysconf(_SC_NPROCESSORS_CONF) == num_possible_cpu == num_online_cpu,
>> unless there is crazy INIT_ALL_POSSIBLE config option is used.
>
>
> Are you using ARM by chance? What is the count that you get in
> user space and from kernel side?
>
> http://lists.infradead.org/pipermail/linux-arm-kernel/2011-June/054177.html
Daniel Borkmann July 30, 2016, 12:19 a.m. UTC | #4
On 07/29/2016 10:03 PM, William Tu wrote:
> Hi Daniel and Alexei,
>
> Thanks for the reply. My apology for too brief description. In short,
> in my environment, running samples/bpf/test_map always segfault under
> percpu array/hash map operations. I think it's due to stack
> corruption.
>
> I'm not using ARM. It's x86 in a VM with 2 vcpu. By printk() in kernel, I got
>    num_possible_cpu == 64
>    num_online_cpu == 2 == sysconf(_SC_NPROCESSORS_CONF)

Ok, thanks for the data!

> So at samples/bpf/test_maps.c, test_percpu_arraymap_sanity(),
> we define:
>    long values[nr_cpus]; //nr_cpus=2
>
>    ... // create map and update map ...
>
>    /* check that key=0 is also found and zero initialized */
>    assert(bpf_lookup_elem(map_fd, &key, values) == 0 &&
>              values[0] == 0 && values[nr_cpus - 1] == 0);
>
> Here we enter the bpf syscall, calls into kernel "map_lookup_elem()"
> and we calculate:
>    value_size = round_up(map->value_size, 8) * num_possible_cpus();
>    // which in my case 8 * 64 = 512
>    ...
>    // then copy to user, which writes 512B to the "values[nr_cpus]" on stack
>    if (copy_to_user(uvalue, value, value_size) != 0)
>
> And I think this 512B write to userspace corrupts the userspace stack
> and causes a coredump. After bpf_lookup_elem() calls, gdb shows
> 'values' points to memory address 0x0.
>
> To fix it, I could either
> 1). declare values array based on num_possible_cpu in test_map.c,
>    long values[64];
> or 2) in kernel, only copying 8*2 = 16 byte from kernel to user.

But I think the patch of using num_online_cpus() would also not be correct
in the sense that f.e. your application could alloc an array at time X
where map lookup at time Y would not fit to the expectations anymore due
to CPU hotplugging (since apparently _SC_NPROCESSORS_CONF maps to online
CPUs in some cases). So also there you could potentially corrupt your
application or mem allocator in user space, or not all your valid data
might get copied, hmm.

> Regards,
> William
>
>
> On Fri, Jul 29, 2016 at 12:54 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 07/29/2016 08:47 AM, Alexei Starovoitov wrote:
>>>
>>> On Thu, Jul 28, 2016 at 05:42:21PM -0700, William Tu wrote:
>>>>
>>>> The total size of value copy_to_user() writes to userspace should
>>>> be the (current number of cpu) * (value size), instead of
>>>> num_possible_cpus() * (value size).  Found by samples/bpf/test_maps.c,
>>>> which always copies 512 byte to userspace, crashing the userspace
>>>> program stack.
>>>
>>>
>>> hmm. I'm missing something. The sample code assumes no cpu hutplug,
>>> so sysconf(_SC_NPROCESSORS_CONF) == num_possible_cpu == num_online_cpu,
>>> unless there is crazy INIT_ALL_POSSIBLE config option is used.
>>
>>
>> Are you using ARM by chance? What is the count that you get in
>> user space and from kernel side?
>>
>> http://lists.infradead.org/pipermail/linux-arm-kernel/2011-June/054177.html
William Tu July 30, 2016, 5:23 a.m. UTC | #5
On Fri, Jul 29, 2016 at 5:19 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 07/29/2016 10:03 PM, William Tu wrote:
>>
>> Hi Daniel and Alexei,
>>
>> Thanks for the reply. My apology for too brief description. In short,
>> in my environment, running samples/bpf/test_map always segfault under
>> percpu array/hash map operations. I think it's due to stack
>> corruption.
>>
>> I'm not using ARM. It's x86 in a VM with 2 vcpu. By printk() in kernel, I
>> got
>>    num_possible_cpu == 64
>>    num_online_cpu == 2 == sysconf(_SC_NPROCESSORS_CONF)
>
>
> Ok, thanks for the data!
>
>> So at samples/bpf/test_maps.c, test_percpu_arraymap_sanity(),
>> we define:
>>    long values[nr_cpus]; //nr_cpus=2
>>
>>    ... // create map and update map ...
>>
>>    /* check that key=0 is also found and zero initialized */
>>    assert(bpf_lookup_elem(map_fd, &key, values) == 0 &&
>>              values[0] == 0 && values[nr_cpus - 1] == 0);
>>
>> Here we enter the bpf syscall, calls into kernel "map_lookup_elem()"
>> and we calculate:
>>    value_size = round_up(map->value_size, 8) * num_possible_cpus();
>>    // which in my case 8 * 64 = 512
>>    ...
>>    // then copy to user, which writes 512B to the "values[nr_cpus]" on
>> stack
>>    if (copy_to_user(uvalue, value, value_size) != 0)
>>
>> And I think this 512B write to userspace corrupts the userspace stack
>> and causes a coredump. After bpf_lookup_elem() calls, gdb shows
>> 'values' points to memory address 0x0.
>>
>> To fix it, I could either
>> 1). declare values array based on num_possible_cpu in test_map.c,
>>    long values[64];
>> or 2) in kernel, only copying 8*2 = 16 byte from kernel to user.
>
>
> But I think the patch of using num_online_cpus() would also not be correct
> in the sense that f.e. your application could alloc an array at time X
> where map lookup at time Y would not fit to the expectations anymore due
> to CPU hotplugging (since apparently _SC_NPROCESSORS_CONF maps to online
> CPUs in some cases). So also there you could potentially corrupt your
> application or mem allocator in user space, or not all your valid data
> might get copied, hmm.
>
Yes, you're right. CPU hotplugging might cause the same issue.

Since percpu array adds variable length of data passing between kernel
and userspace, I wonder if we should add a 'value_len' field in 'union
bpf_attr' so kernel knows how much data to copy to user?

Regards,
William
Alexei Starovoitov July 30, 2016, 5:34 a.m. UTC | #6
On Fri, Jul 29, 2016 at 10:23:06PM -0700, William Tu wrote:
> On Fri, Jul 29, 2016 at 5:19 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> > On 07/29/2016 10:03 PM, William Tu wrote:
> >>
> >> I'm not using ARM. It's x86 in a VM with 2 vcpu. By printk() in kernel, I
> >> got
> >>    num_possible_cpu == 64
> >>    num_online_cpu == 2 == sysconf(_SC_NPROCESSORS_CONF)
...
> >> To fix it, I could either
> >> 1). declare values array based on num_possible_cpu in test_map.c,
> >>    long values[64];
> >> or 2) in kernel, only copying 8*2 = 16 byte from kernel to user.
...
> Since percpu array adds variable length of data passing between kernel
> and userspace, I wonder if we should add a 'value_len' field in 'union
> bpf_attr' so kernel knows how much data to copy to user?

I think the first step is to figure out why num_possible is 64,
since it hurts all per-cpu allocations. If it is a widespread issue,
it hurts a lot of VMs.
Hopefully it's not the case, since in my kvm setup num_possible==num_online
qemu version 2.4.0
booting with -enable-kvm -smp N
William Tu July 31, 2016, 3:25 p.m. UTC | #7
>> >>    num_possible_cpu == 64
>> >>    num_online_cpu == 2 == sysconf(_SC_NPROCESSORS_CONF)
> ...
>> >> To fix it, I could either
>> >> 1). declare values array based on num_possible_cpu in test_map.c,
>> >>    long values[64];
>> >> or 2) in kernel, only copying 8*2 = 16 byte from kernel to user.
> ...
>> Since percpu array adds variable length of data passing between kernel
>> and userspace, I wonder if we should add a 'value_len' field in 'union
>> bpf_attr' so kernel knows how much data to copy to user?
>
> I think the first step is to figure out why num_possible is 64,
> since it hurts all per-cpu allocations. If it is a widespread issue,
> it hurts a lot of VMs.
> Hopefully it's not the case, since in my kvm setup num_possible==num_online
> qemu version 2.4.0
> booting with -enable-kvm -smp N
>
Thanks. I'm using VMware Fusion with 2 vcpu, running Fedora 23.

I tried on my another physical machine (Xeon E3), indeed
"num_possible==num_online". In fact, num_online shouldn't be an issue.
As long as num_possible == sysconf(SC_NPROCESSORS_CONF), then kernel
and user are consistent about the size of data copied.

Diving into more details:
when calling sysconf(_SC_NPROCESSORS_CONF), strace shows that it does
"open("/sys/devices/system/cpu",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
And in my /sys/devices/system/cpu, I have cpu0 and cpu1,
kernel_max = 63
possible = 0-63
present = 0-1

So sysconf simply reads these entries configured by kernel. Looking at
kernel code, "arch/x86/configs/x86_64_defconfig" sets
CONFIG_NR_CPUS=64, and later on set_cpu_possible() is called at
arch/x86/kernel/smpboot.c, which parses the ACPI multiprocessor table
and configured new value. Based on these observations, I think
different hypervisor may have different ways of emulating ACPI
processor table or BIOS implementation thus these values differ.

Regards,
William
William Tu Aug. 12, 2016, 4:58 p.m. UTC | #8
Hi,

I've tested on ESXi version 5.5 and it seems OK.
- VM1: Ubuntu 14.04, kernel 3.19 ---> OK 3 cpu dirs, possible = 0-2
- VM2: Centos7, kernel 3.10 ---> OK 8 cpu dirs, possible = 0-7

I tried another MacBook with Fusion, same issue happens, the cpu[0-9]
dirs are not equal to /sys/devices/system/cpu/possible

Regards,
William

On Mon, Aug 1, 2016 at 9:30 AM, William Tu <u9012063@gmail.com> wrote:
>>> And in my /sys/devices/system/cpu, I have cpu0 and cpu1,
>>> kernel_max = 63
>>> possible = 0-63
>>> present = 0-1
>>
>> glibc is doing
>> ls -d /sys/devices/system/cpu/cpu*
>> http://osxr.org:8080/glibc/source/sysdeps/unix/sysv/linux/getsysstats.c?v=glibc-2.14#0180
>> And /sys/devices/system/cpu/possible shows 0-63 while only two dirs 'cpu0' and 'cpu1'
>> are there?!
>
> yes, that's right.
> I only have cpu0 and cpu1 directories, while 'possible' shows 0-63.
>
>> If my understanding of cpu_dev_register_generic() in drivers/base/cpu.c
>> is correct the number of 'cpu*' dirs should be equal to possible_cpu.
>> Could you please debug why is that the case, because then it's probably
>> a bug on the kernel side.
>
> Thanks, I will do it.
>
>> I think it's correct for glibc to rely on the number of 'cpu*' dirs.
>> Did you boot with possible_cpus=64 command line arg by any chance?
>>
> No.
>
>>> So sysconf simply reads these entries configured by kernel. Looking at
>>> kernel code, "arch/x86/configs/x86_64_defconfig" sets
>>> CONFIG_NR_CPUS=64, and later on set_cpu_possible() is called at
>>> arch/x86/kernel/smpboot.c, which parses the ACPI multiprocessor table
>>> and configured new value. Based on these observations, I think
>>> different hypervisor may have different ways of emulating ACPI
>>> processor table or BIOS implementation thus these values differ.
>>
>> What behavior do you see in ESX ?
>> btw, rhel7 ships with nr_cpus=5120 and ubuntu default is 256,
>> so this lack of acpi in vmware fusion will lead to possible_cpu=5120,
>> a lot of pain in per-cpu allocator and linux VMs will not be happy.
>> I think vmware has to be fixed first regardless of what we find
>> out about 'cpu*' vs /sys/devices/system/cpu/possible
>>
>
> Thanks, I will debug it and update when I have more info.
> Regards,
> William
Alexei Starovoitov Aug. 12, 2016, 11:08 p.m. UTC | #9
On Fri, Aug 12, 2016 at 09:58:51AM -0700, William Tu wrote:
> Hi,
> 
> I've tested on ESXi version 5.5 and it seems OK.
> - VM1: Ubuntu 14.04, kernel 3.19 ---> OK 3 cpu dirs, possible = 0-2
> - VM2: Centos7, kernel 3.10 ---> OK 8 cpu dirs, possible = 0-7
> 
> I tried another MacBook with Fusion, same issue happens, the cpu[0-9]
> dirs are not equal to /sys/devices/system/cpu/possible

great. thanks for testing. I think the issue is closed and
hopefully you can follow up with fusion guys ;)
William Tu Aug. 19, 2016, 1:38 a.m. UTC | #10
Hi Alexei and Daniel,

I got feedback from Fusion bios/chipset team. In short, the value
'possible' includes empty CPU socket. To verify, I tested on a
physical Xeon machine with 2 CPU sockets, one of them is empty. I got
'possible' = 0-239, the number of 'cpu*' =12. As a result, extra bytes
are copied from kernel to userspace.

As for Fusion, there is a configuration in *.vmx. If we disable cpu
hot plug by "vcpu.hotadd=FALSE", then 'possible' ==  the number of
'cpu*' dirs, which is the configuration in ESX. If "vcpu.hotadd=TURE",
then 'possible' is larger than 'cpu*' dirs, allowing users to add more
vcpus.

Regards,
William


On Fri, Aug 12, 2016 at 4:08 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Fri, Aug 12, 2016 at 09:58:51AM -0700, William Tu wrote:
>> Hi,
>>
>> I've tested on ESXi version 5.5 and it seems OK.
>> - VM1: Ubuntu 14.04, kernel 3.19 ---> OK 3 cpu dirs, possible = 0-2
>> - VM2: Centos7, kernel 3.10 ---> OK 8 cpu dirs, possible = 0-7
>>
>> I tried another MacBook with Fusion, same issue happens, the cpu[0-9]
>> dirs are not equal to /sys/devices/system/cpu/possible
>
> great. thanks for testing. I think the issue is closed and
> hopefully you can follow up with fusion guys ;)
>
William Tu Aug. 24, 2016, 11:46 p.m. UTC | #11
Hi,

I've tested on kvm and encountered similar issue. If I boot up VM with
CPU hotplug enabled like below:
./qemu-system-x86_64 -smp 2, maxcpus=4 <other boot parameters>

then the ' /sys/devices/system/cpu/possible' does not equal to the
number of cpu* dirs in ' /sys/devices/system/cpu/', which will crash
the percpu map testcase. Should we consider fix it, or for simply
workaround, we could disable CONFIG_HOTPLUG_CPU?

Thanks
William

On Thu, Aug 18, 2016 at 6:38 PM, William Tu <u9012063@gmail.com> wrote:
> Hi Alexei and Daniel,
>
> I got feedback from Fusion bios/chipset team. In short, the value
> 'possible' includes empty CPU socket. To verify, I tested on a
> physical Xeon machine with 2 CPU sockets, one of them is empty. I got
> 'possible' = 0-239, the number of 'cpu*' =12. As a result, extra bytes
> are copied from kernel to userspace.
>
> As for Fusion, there is a configuration in *.vmx. If we disable cpu
> hot plug by "vcpu.hotadd=FALSE", then 'possible' ==  the number of
> 'cpu*' dirs, which is the configuration in ESX. If "vcpu.hotadd=TURE",
> then 'possible' is larger than 'cpu*' dirs, allowing users to add more
> vcpus.
>
> Regards,
> William
>
>
> On Fri, Aug 12, 2016 at 4:08 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Fri, Aug 12, 2016 at 09:58:51AM -0700, William Tu wrote:
>>> Hi,
>>>
>>> I've tested on ESXi version 5.5 and it seems OK.
>>> - VM1: Ubuntu 14.04, kernel 3.19 ---> OK 3 cpu dirs, possible = 0-2
>>> - VM2: Centos7, kernel 3.10 ---> OK 8 cpu dirs, possible = 0-7
>>>
>>> I tried another MacBook with Fusion, same issue happens, the cpu[0-9]
>>> dirs are not equal to /sys/devices/system/cpu/possible
>>
>> great. thanks for testing. I think the issue is closed and
>> hopefully you can follow up with fusion guys ;)
>>
diff mbox

Patch

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..47f738e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -324,7 +324,8 @@  static int map_lookup_elem(union bpf_attr *attr)
 		goto free_value;
 
 	err = -EFAULT;
-	if (copy_to_user(uvalue, value, value_size) != 0)
+	if (copy_to_user(uvalue, value,
+		map->value_size * num_online_cpus()) != 0)
 		goto free_value;
 
 	err = 0;