Message ID | 1469752941-7140-1-git-send-email-u9012063@gmail.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, Jul 28, 2016 at 05:42:21PM -0700, William Tu wrote: > The total size of value copy_to_user() writes to userspace should > be the (current number of cpu) * (value size), instead of > num_possible_cpus() * (value size). Found by samples/bpf/test_maps.c, > which always copies 512 byte to userspace, crashing the userspace > program stack. hmm. I'm missing something. The sample code assumes no cpu hutplug, so sysconf(_SC_NPROCESSORS_CONF) == num_possible_cpu == num_online_cpu, unless there is crazy INIT_ALL_POSSIBLE config option is used.
On 07/29/2016 08:47 AM, Alexei Starovoitov wrote: > On Thu, Jul 28, 2016 at 05:42:21PM -0700, William Tu wrote: >> The total size of value copy_to_user() writes to userspace should >> be the (current number of cpu) * (value size), instead of >> num_possible_cpus() * (value size). Found by samples/bpf/test_maps.c, >> which always copies 512 byte to userspace, crashing the userspace >> program stack. > > hmm. I'm missing something. The sample code assumes no cpu hutplug, > so sysconf(_SC_NPROCESSORS_CONF) == num_possible_cpu == num_online_cpu, > unless there is crazy INIT_ALL_POSSIBLE config option is used. Are you using ARM by chance? What is the count that you get in user space and from kernel side? http://lists.infradead.org/pipermail/linux-arm-kernel/2011-June/054177.html
Hi Daniel and Alexei, Thanks for the reply. My apology for too brief description. In short, in my environment, running samples/bpf/test_map always segfault under percpu array/hash map operations. I think it's due to stack corruption. I'm not using ARM. It's x86 in a VM with 2 vcpu. By printk() in kernel, I got num_possible_cpu == 64 num_online_cpu == 2 == sysconf(_SC_NPROCESSORS_CONF) So at samples/bpf/test_maps.c, test_percpu_arraymap_sanity(), we define: long values[nr_cpus]; //nr_cpus=2 ... // create map and update map ... /* check that key=0 is also found and zero initialized */ assert(bpf_lookup_elem(map_fd, &key, values) == 0 && values[0] == 0 && values[nr_cpus - 1] == 0); Here we enter the bpf syscall, calls into kernel "map_lookup_elem()" and we calculate: value_size = round_up(map->value_size, 8) * num_possible_cpus(); // which in my case 8 * 64 = 512 ... // then copy to user, which writes 512B to the "values[nr_cpus]" on stack if (copy_to_user(uvalue, value, value_size) != 0) And I think this 512B write to userspace corrupts the userspace stack and causes a coredump. After bpf_lookup_elem() calls, gdb shows 'values' points to memory address 0x0. To fix it, I could either 1). declare values array based on num_possible_cpu in test_map.c, long values[64]; or 2) in kernel, only copying 8*2 = 16 byte from kernel to user. Regards, William On Fri, Jul 29, 2016 at 12:54 AM, Daniel Borkmann <daniel@iogearbox.net> wrote: > On 07/29/2016 08:47 AM, Alexei Starovoitov wrote: >> >> On Thu, Jul 28, 2016 at 05:42:21PM -0700, William Tu wrote: >>> >>> The total size of value copy_to_user() writes to userspace should >>> be the (current number of cpu) * (value size), instead of >>> num_possible_cpus() * (value size). Found by samples/bpf/test_maps.c, >>> which always copies 512 byte to userspace, crashing the userspace >>> program stack. >> >> >> hmm. I'm missing something. The sample code assumes no cpu hutplug, >> so sysconf(_SC_NPROCESSORS_CONF) == num_possible_cpu == num_online_cpu, >> unless there is crazy INIT_ALL_POSSIBLE config option is used. > > > Are you using ARM by chance? What is the count that you get in > user space and from kernel side? > > http://lists.infradead.org/pipermail/linux-arm-kernel/2011-June/054177.html
On 07/29/2016 10:03 PM, William Tu wrote: > Hi Daniel and Alexei, > > Thanks for the reply. My apology for too brief description. In short, > in my environment, running samples/bpf/test_map always segfault under > percpu array/hash map operations. I think it's due to stack > corruption. > > I'm not using ARM. It's x86 in a VM with 2 vcpu. By printk() in kernel, I got > num_possible_cpu == 64 > num_online_cpu == 2 == sysconf(_SC_NPROCESSORS_CONF) Ok, thanks for the data! > So at samples/bpf/test_maps.c, test_percpu_arraymap_sanity(), > we define: > long values[nr_cpus]; //nr_cpus=2 > > ... // create map and update map ... > > /* check that key=0 is also found and zero initialized */ > assert(bpf_lookup_elem(map_fd, &key, values) == 0 && > values[0] == 0 && values[nr_cpus - 1] == 0); > > Here we enter the bpf syscall, calls into kernel "map_lookup_elem()" > and we calculate: > value_size = round_up(map->value_size, 8) * num_possible_cpus(); > // which in my case 8 * 64 = 512 > ... > // then copy to user, which writes 512B to the "values[nr_cpus]" on stack > if (copy_to_user(uvalue, value, value_size) != 0) > > And I think this 512B write to userspace corrupts the userspace stack > and causes a coredump. After bpf_lookup_elem() calls, gdb shows > 'values' points to memory address 0x0. > > To fix it, I could either > 1). declare values array based on num_possible_cpu in test_map.c, > long values[64]; > or 2) in kernel, only copying 8*2 = 16 byte from kernel to user. But I think the patch of using num_online_cpus() would also not be correct in the sense that f.e. your application could alloc an array at time X where map lookup at time Y would not fit to the expectations anymore due to CPU hotplugging (since apparently _SC_NPROCESSORS_CONF maps to online CPUs in some cases). So also there you could potentially corrupt your application or mem allocator in user space, or not all your valid data might get copied, hmm. > Regards, > William > > > On Fri, Jul 29, 2016 at 12:54 AM, Daniel Borkmann <daniel@iogearbox.net> wrote: >> On 07/29/2016 08:47 AM, Alexei Starovoitov wrote: >>> >>> On Thu, Jul 28, 2016 at 05:42:21PM -0700, William Tu wrote: >>>> >>>> The total size of value copy_to_user() writes to userspace should >>>> be the (current number of cpu) * (value size), instead of >>>> num_possible_cpus() * (value size). Found by samples/bpf/test_maps.c, >>>> which always copies 512 byte to userspace, crashing the userspace >>>> program stack. >>> >>> >>> hmm. I'm missing something. The sample code assumes no cpu hutplug, >>> so sysconf(_SC_NPROCESSORS_CONF) == num_possible_cpu == num_online_cpu, >>> unless there is crazy INIT_ALL_POSSIBLE config option is used. >> >> >> Are you using ARM by chance? What is the count that you get in >> user space and from kernel side? >> >> http://lists.infradead.org/pipermail/linux-arm-kernel/2011-June/054177.html
On Fri, Jul 29, 2016 at 5:19 PM, Daniel Borkmann <daniel@iogearbox.net> wrote: > On 07/29/2016 10:03 PM, William Tu wrote: >> >> Hi Daniel and Alexei, >> >> Thanks for the reply. My apology for too brief description. In short, >> in my environment, running samples/bpf/test_map always segfault under >> percpu array/hash map operations. I think it's due to stack >> corruption. >> >> I'm not using ARM. It's x86 in a VM with 2 vcpu. By printk() in kernel, I >> got >> num_possible_cpu == 64 >> num_online_cpu == 2 == sysconf(_SC_NPROCESSORS_CONF) > > > Ok, thanks for the data! > >> So at samples/bpf/test_maps.c, test_percpu_arraymap_sanity(), >> we define: >> long values[nr_cpus]; //nr_cpus=2 >> >> ... // create map and update map ... >> >> /* check that key=0 is also found and zero initialized */ >> assert(bpf_lookup_elem(map_fd, &key, values) == 0 && >> values[0] == 0 && values[nr_cpus - 1] == 0); >> >> Here we enter the bpf syscall, calls into kernel "map_lookup_elem()" >> and we calculate: >> value_size = round_up(map->value_size, 8) * num_possible_cpus(); >> // which in my case 8 * 64 = 512 >> ... >> // then copy to user, which writes 512B to the "values[nr_cpus]" on >> stack >> if (copy_to_user(uvalue, value, value_size) != 0) >> >> And I think this 512B write to userspace corrupts the userspace stack >> and causes a coredump. After bpf_lookup_elem() calls, gdb shows >> 'values' points to memory address 0x0. >> >> To fix it, I could either >> 1). declare values array based on num_possible_cpu in test_map.c, >> long values[64]; >> or 2) in kernel, only copying 8*2 = 16 byte from kernel to user. > > > But I think the patch of using num_online_cpus() would also not be correct > in the sense that f.e. your application could alloc an array at time X > where map lookup at time Y would not fit to the expectations anymore due > to CPU hotplugging (since apparently _SC_NPROCESSORS_CONF maps to online > CPUs in some cases). So also there you could potentially corrupt your > application or mem allocator in user space, or not all your valid data > might get copied, hmm. > Yes, you're right. CPU hotplugging might cause the same issue. Since percpu array adds variable length of data passing between kernel and userspace, I wonder if we should add a 'value_len' field in 'union bpf_attr' so kernel knows how much data to copy to user? Regards, William
On Fri, Jul 29, 2016 at 10:23:06PM -0700, William Tu wrote: > On Fri, Jul 29, 2016 at 5:19 PM, Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 07/29/2016 10:03 PM, William Tu wrote: > >> > >> I'm not using ARM. It's x86 in a VM with 2 vcpu. By printk() in kernel, I > >> got > >> num_possible_cpu == 64 > >> num_online_cpu == 2 == sysconf(_SC_NPROCESSORS_CONF) ... > >> To fix it, I could either > >> 1). declare values array based on num_possible_cpu in test_map.c, > >> long values[64]; > >> or 2) in kernel, only copying 8*2 = 16 byte from kernel to user. ... > Since percpu array adds variable length of data passing between kernel > and userspace, I wonder if we should add a 'value_len' field in 'union > bpf_attr' so kernel knows how much data to copy to user? I think the first step is to figure out why num_possible is 64, since it hurts all per-cpu allocations. If it is a widespread issue, it hurts a lot of VMs. Hopefully it's not the case, since in my kvm setup num_possible==num_online qemu version 2.4.0 booting with -enable-kvm -smp N
>> >> num_possible_cpu == 64 >> >> num_online_cpu == 2 == sysconf(_SC_NPROCESSORS_CONF) > ... >> >> To fix it, I could either >> >> 1). declare values array based on num_possible_cpu in test_map.c, >> >> long values[64]; >> >> or 2) in kernel, only copying 8*2 = 16 byte from kernel to user. > ... >> Since percpu array adds variable length of data passing between kernel >> and userspace, I wonder if we should add a 'value_len' field in 'union >> bpf_attr' so kernel knows how much data to copy to user? > > I think the first step is to figure out why num_possible is 64, > since it hurts all per-cpu allocations. If it is a widespread issue, > it hurts a lot of VMs. > Hopefully it's not the case, since in my kvm setup num_possible==num_online > qemu version 2.4.0 > booting with -enable-kvm -smp N > Thanks. I'm using VMware Fusion with 2 vcpu, running Fedora 23. I tried on my another physical machine (Xeon E3), indeed "num_possible==num_online". In fact, num_online shouldn't be an issue. As long as num_possible == sysconf(SC_NPROCESSORS_CONF), then kernel and user are consistent about the size of data copied. Diving into more details: when calling sysconf(_SC_NPROCESSORS_CONF), strace shows that it does "open("/sys/devices/system/cpu", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3 And in my /sys/devices/system/cpu, I have cpu0 and cpu1, kernel_max = 63 possible = 0-63 present = 0-1 So sysconf simply reads these entries configured by kernel. Looking at kernel code, "arch/x86/configs/x86_64_defconfig" sets CONFIG_NR_CPUS=64, and later on set_cpu_possible() is called at arch/x86/kernel/smpboot.c, which parses the ACPI multiprocessor table and configured new value. Based on these observations, I think different hypervisor may have different ways of emulating ACPI processor table or BIOS implementation thus these values differ. Regards, William
Hi, I've tested on ESXi version 5.5 and it seems OK. - VM1: Ubuntu 14.04, kernel 3.19 ---> OK 3 cpu dirs, possible = 0-2 - VM2: Centos7, kernel 3.10 ---> OK 8 cpu dirs, possible = 0-7 I tried another MacBook with Fusion, same issue happens, the cpu[0-9] dirs are not equal to /sys/devices/system/cpu/possible Regards, William On Mon, Aug 1, 2016 at 9:30 AM, William Tu <u9012063@gmail.com> wrote: >>> And in my /sys/devices/system/cpu, I have cpu0 and cpu1, >>> kernel_max = 63 >>> possible = 0-63 >>> present = 0-1 >> >> glibc is doing >> ls -d /sys/devices/system/cpu/cpu* >> http://osxr.org:8080/glibc/source/sysdeps/unix/sysv/linux/getsysstats.c?v=glibc-2.14#0180 >> And /sys/devices/system/cpu/possible shows 0-63 while only two dirs 'cpu0' and 'cpu1' >> are there?! > > yes, that's right. > I only have cpu0 and cpu1 directories, while 'possible' shows 0-63. > >> If my understanding of cpu_dev_register_generic() in drivers/base/cpu.c >> is correct the number of 'cpu*' dirs should be equal to possible_cpu. >> Could you please debug why is that the case, because then it's probably >> a bug on the kernel side. > > Thanks, I will do it. > >> I think it's correct for glibc to rely on the number of 'cpu*' dirs. >> Did you boot with possible_cpus=64 command line arg by any chance? >> > No. > >>> So sysconf simply reads these entries configured by kernel. Looking at >>> kernel code, "arch/x86/configs/x86_64_defconfig" sets >>> CONFIG_NR_CPUS=64, and later on set_cpu_possible() is called at >>> arch/x86/kernel/smpboot.c, which parses the ACPI multiprocessor table >>> and configured new value. Based on these observations, I think >>> different hypervisor may have different ways of emulating ACPI >>> processor table or BIOS implementation thus these values differ. >> >> What behavior do you see in ESX ? >> btw, rhel7 ships with nr_cpus=5120 and ubuntu default is 256, >> so this lack of acpi in vmware fusion will lead to possible_cpu=5120, >> a lot of pain in per-cpu allocator and linux VMs will not be happy. >> I think vmware has to be fixed first regardless of what we find >> out about 'cpu*' vs /sys/devices/system/cpu/possible >> > > Thanks, I will debug it and update when I have more info. > Regards, > William
On Fri, Aug 12, 2016 at 09:58:51AM -0700, William Tu wrote: > Hi, > > I've tested on ESXi version 5.5 and it seems OK. > - VM1: Ubuntu 14.04, kernel 3.19 ---> OK 3 cpu dirs, possible = 0-2 > - VM2: Centos7, kernel 3.10 ---> OK 8 cpu dirs, possible = 0-7 > > I tried another MacBook with Fusion, same issue happens, the cpu[0-9] > dirs are not equal to /sys/devices/system/cpu/possible great. thanks for testing. I think the issue is closed and hopefully you can follow up with fusion guys ;)
Hi Alexei and Daniel, I got feedback from Fusion bios/chipset team. In short, the value 'possible' includes empty CPU socket. To verify, I tested on a physical Xeon machine with 2 CPU sockets, one of them is empty. I got 'possible' = 0-239, the number of 'cpu*' =12. As a result, extra bytes are copied from kernel to userspace. As for Fusion, there is a configuration in *.vmx. If we disable cpu hot plug by "vcpu.hotadd=FALSE", then 'possible' == the number of 'cpu*' dirs, which is the configuration in ESX. If "vcpu.hotadd=TURE", then 'possible' is larger than 'cpu*' dirs, allowing users to add more vcpus. Regards, William On Fri, Aug 12, 2016 at 4:08 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Fri, Aug 12, 2016 at 09:58:51AM -0700, William Tu wrote: >> Hi, >> >> I've tested on ESXi version 5.5 and it seems OK. >> - VM1: Ubuntu 14.04, kernel 3.19 ---> OK 3 cpu dirs, possible = 0-2 >> - VM2: Centos7, kernel 3.10 ---> OK 8 cpu dirs, possible = 0-7 >> >> I tried another MacBook with Fusion, same issue happens, the cpu[0-9] >> dirs are not equal to /sys/devices/system/cpu/possible > > great. thanks for testing. I think the issue is closed and > hopefully you can follow up with fusion guys ;) >
Hi, I've tested on kvm and encountered similar issue. If I boot up VM with CPU hotplug enabled like below: ./qemu-system-x86_64 -smp 2, maxcpus=4 <other boot parameters> then the ' /sys/devices/system/cpu/possible' does not equal to the number of cpu* dirs in ' /sys/devices/system/cpu/', which will crash the percpu map testcase. Should we consider fix it, or for simply workaround, we could disable CONFIG_HOTPLUG_CPU? Thanks William On Thu, Aug 18, 2016 at 6:38 PM, William Tu <u9012063@gmail.com> wrote: > Hi Alexei and Daniel, > > I got feedback from Fusion bios/chipset team. In short, the value > 'possible' includes empty CPU socket. To verify, I tested on a > physical Xeon machine with 2 CPU sockets, one of them is empty. I got > 'possible' = 0-239, the number of 'cpu*' =12. As a result, extra bytes > are copied from kernel to userspace. > > As for Fusion, there is a configuration in *.vmx. If we disable cpu > hot plug by "vcpu.hotadd=FALSE", then 'possible' == the number of > 'cpu*' dirs, which is the configuration in ESX. If "vcpu.hotadd=TURE", > then 'possible' is larger than 'cpu*' dirs, allowing users to add more > vcpus. > > Regards, > William > > > On Fri, Aug 12, 2016 at 4:08 PM, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: >> On Fri, Aug 12, 2016 at 09:58:51AM -0700, William Tu wrote: >>> Hi, >>> >>> I've tested on ESXi version 5.5 and it seems OK. >>> - VM1: Ubuntu 14.04, kernel 3.19 ---> OK 3 cpu dirs, possible = 0-2 >>> - VM2: Centos7, kernel 3.10 ---> OK 8 cpu dirs, possible = 0-7 >>> >>> I tried another MacBook with Fusion, same issue happens, the cpu[0-9] >>> dirs are not equal to /sys/devices/system/cpu/possible >> >> great. thanks for testing. I think the issue is closed and >> hopefully you can follow up with fusion guys ;) >>
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 228f962..47f738e 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -324,7 +324,8 @@ static int map_lookup_elem(union bpf_attr *attr) goto free_value; err = -EFAULT; - if (copy_to_user(uvalue, value, value_size) != 0) + if (copy_to_user(uvalue, value, + map->value_size * num_online_cpus()) != 0) goto free_value; err = 0;
The total size of value copy_to_user() writes to userspace should be the (current number of cpu) * (value size), instead of num_possible_cpus() * (value size). Found by samples/bpf/test_maps.c, which always copies 512 byte to userspace, crashing the userspace program stack. Signed-off-by: William Tu <u9012063@gmail.com> --- kernel/bpf/syscall.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)