diff mbox series

[SRU,Focal,1/1] bpf: Fix toctou on read-only map's constant scalar tracking

Message ID 20230807132123.14731-2-jacob.martin@canonical.com
State New
Headers show
Series CVE-2021-4001 | expand

Commit Message

Jacob Martin Aug. 7, 2023, 1:21 p.m. UTC
From: Daniel Borkmann <daniel@iogearbox.net>

Commit a23740ec43ba ("bpf: Track contents of read-only maps as scalars") is
checking whether maps are read-only both from BPF program side and user space
side, and then, given their content is constant, reading out their data via
map->ops->map_direct_value_addr() which is then subsequently used as known
scalar value for the register, that is, it is marked as __mark_reg_known()
with the read value at verification time. Before a23740ec43ba, the register
content was marked as an unknown scalar so the verifier could not make any
assumptions about the map content.

The current implementation however is prone to a TOCTOU race, meaning, the
value read as known scalar for the register is not guaranteed to be exactly
the same at a later point when the program is executed, and as such, the
prior made assumptions of the verifier with regards to the program will be
invalid which can cause issues such as OOB access, etc.

While the BPF_F_RDONLY_PROG map flag is always fixed and required to be
specified at map creation time, the map->frozen property is initially set to
false for the map given the map value needs to be populated, e.g. for global
data sections. Once complete, the loader "freezes" the map from user space
such that no subsequent updates/deletes are possible anymore. For the rest
of the lifetime of the map, this freeze one-time trigger cannot be undone
anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_*
cmd calls which would update/delete map entries will be rejected with -EPERM
since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also
means that pending update/delete map entries must still complete before this
guarantee is given. This corner case is not an issue for loaders since they
create and prepare such program private map in successive steps.

However, a malicious user is able to trigger this TOCTOU race in two different
ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is
used to expand the competition interval, so that map_update_elem() can modify
the contents of the map after map_freeze() and bpf_prog_load() were executed.
This works, because userfaultfd halts the parallel thread which triggered a
map_update_elem() at the time where we copy key/value from the user buffer and
this already passed the FMODE_CAN_WRITE capability test given at that time the
map was not "frozen". Then, the main thread performs the map_freeze() and
bpf_prog_load(), and once that had completed successfully, the other thread
is woken up to complete the pending map_update_elem() which then changes the
map content. For ii) the idea of the batched update is similar, meaning, when
there are a large number of updates to be processed, it can increase the
competition interval between the two. It is therefore possible in practice to
modify the contents of the map after executing map_freeze() and bpf_prog_load().

One way to fix both i) and ii) at the same time is to expand the use of the
map's map->writecnt. The latter was introduced in fc9702273e2e ("bpf: Add mmap()
support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19be2e2 ("bpf:
Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with
the rationale to make a writable mmap()'ing of a map mutually exclusive with
read-only freezing. The counter indicates writable mmap() mappings and then
prevents/fails the freeze operation. Its semantics can be expanded beyond
just mmap() by generally indicating ongoing write phases. This would essentially
span any parallel regular and batched flavor of update/delete operation and
then also have map_freeze() fail with -EBUSY. For the check_mem_access() in
the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all
last pending writes have completed via bpf_map_write_active() test. Once the
map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0
only then we are really guaranteed to use the map's data as known constants.
For map->frozen being set and pending writes in process of still being completed
we fall back to marking that register as unknown scalar so we don't end up
making assumptions about it. With this, both TOCTOU reproducers from i) and
ii) are fixed.

Note that the map->writecnt has been converted into a atomic64 in the fix in
order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating
map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning
the freeze_mutex over entire map update/delete operations in syscall side
would not be possible due to then causing everything to be serialized.
Similarly, something like synchronize_rcu() after setting map->frozen to wait
for update/deletes to complete is not possible either since it would also
have to span the user copy which can sleep. On the libbpf side, this won't
break d66562fba1ce ("libbpf: Add BPF object skeleton support") as the
anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed
mmap()-ed memory where for .rodata it's non-writable.

Fixes: a23740ec43ba ("bpf: Track contents of read-only maps as scalars")
Reported-by: w1tcher.bupt@gmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
(backported from commit 353050be4c19e102178ccc05988101887c25ae53)
[jacobmartin: bpf map mmap and batch support are not present, so omit fixes for them]
CVE-2021-4001
Signed-off-by: Jacob Martin <jacob.martin@canonical.com>
---
 include/linux/bpf.h   |  2 ++
 kernel/bpf/syscall.c  | 25 +++++++++++++++++++++++++
 kernel/bpf/verifier.c | 18 +++++++++++++++++-
 3 files changed, 44 insertions(+), 1 deletion(-)

Comments

Stefan Bader Aug. 10, 2023, 7:18 a.m. UTC | #1
On 07.08.23 15:21, Jacob Martin wrote:
> From: Daniel Borkmann <daniel@iogearbox.net>
> 
> Commit a23740ec43ba ("bpf: Track contents of read-only maps as scalars") is
> checking whether maps are read-only both from BPF program side and user space
> side, and then, given their content is constant, reading out their data via
> map->ops->map_direct_value_addr() which is then subsequently used as known
> scalar value for the register, that is, it is marked as __mark_reg_known()
> with the read value at verification time. Before a23740ec43ba, the register
> content was marked as an unknown scalar so the verifier could not make any
> assumptions about the map content.
> 
> The current implementation however is prone to a TOCTOU race, meaning, the
> value read as known scalar for the register is not guaranteed to be exactly
> the same at a later point when the program is executed, and as such, the
> prior made assumptions of the verifier with regards to the program will be
> invalid which can cause issues such as OOB access, etc.
> 
> While the BPF_F_RDONLY_PROG map flag is always fixed and required to be
> specified at map creation time, the map->frozen property is initially set to
> false for the map given the map value needs to be populated, e.g. for global
> data sections. Once complete, the loader "freezes" the map from user space
> such that no subsequent updates/deletes are possible anymore. For the rest
> of the lifetime of the map, this freeze one-time trigger cannot be undone
> anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_*
> cmd calls which would update/delete map entries will be rejected with -EPERM
> since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also
> means that pending update/delete map entries must still complete before this
> guarantee is given. This corner case is not an issue for loaders since they
> create and prepare such program private map in successive steps.
> 
> However, a malicious user is able to trigger this TOCTOU race in two different
> ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is
> used to expand the competition interval, so that map_update_elem() can modify
> the contents of the map after map_freeze() and bpf_prog_load() were executed.
> This works, because userfaultfd halts the parallel thread which triggered a
> map_update_elem() at the time where we copy key/value from the user buffer and
> this already passed the FMODE_CAN_WRITE capability test given at that time the
> map was not "frozen". Then, the main thread performs the map_freeze() and
> bpf_prog_load(), and once that had completed successfully, the other thread
> is woken up to complete the pending map_update_elem() which then changes the
> map content. For ii) the idea of the batched update is similar, meaning, when
> there are a large number of updates to be processed, it can increase the
> competition interval between the two. It is therefore possible in practice to
> modify the contents of the map after executing map_freeze() and bpf_prog_load().
> 
> One way to fix both i) and ii) at the same time is to expand the use of the
> map's map->writecnt. The latter was introduced in fc9702273e2e ("bpf: Add mmap()
> support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19be2e2 ("bpf:
> Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with
> the rationale to make a writable mmap()'ing of a map mutually exclusive with
> read-only freezing. The counter indicates writable mmap() mappings and then
> prevents/fails the freeze operation. Its semantics can be expanded beyond
> just mmap() by generally indicating ongoing write phases. This would essentially
> span any parallel regular and batched flavor of update/delete operation and
> then also have map_freeze() fail with -EBUSY. For the check_mem_access() in
> the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all
> last pending writes have completed via bpf_map_write_active() test. Once the
> map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0
> only then we are really guaranteed to use the map's data as known constants.
> For map->frozen being set and pending writes in process of still being completed
> we fall back to marking that register as unknown scalar so we don't end up
> making assumptions about it. With this, both TOCTOU reproducers from i) and
> ii) are fixed.
> 
> Note that the map->writecnt has been converted into a atomic64 in the fix in
> order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating
> map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning
> the freeze_mutex over entire map update/delete operations in syscall side
> would not be possible due to then causing everything to be serialized.
> Similarly, something like synchronize_rcu() after setting map->frozen to wait
> for update/deletes to complete is not possible either since it would also
> have to span the user copy which can sleep. On the libbpf side, this won't
> break d66562fba1ce ("libbpf: Add BPF object skeleton support") as the
> anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed
> mmap()-ed memory where for .rodata it's non-writable.
> 
> Fixes: a23740ec43ba ("bpf: Track contents of read-only maps as scalars")
> Reported-by: w1tcher.bupt@gmail.com
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

CVE-2021-4001
> (backported from commit 353050be4c19e102178ccc05988101887c25ae53)
> [jacobmartin: bpf map mmap and batch support are not present, so omit fixes for them]
> Signed-off-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
> ---

Just as a personal note that I find things more readable if the new 
s-o-b block is detached and follows a <reason>, <origin>, <s-o-b> order.

-Stefan

>   include/linux/bpf.h   |  2 ++
>   kernel/bpf/syscall.c  | 25 +++++++++++++++++++++++++
>   kernel/bpf/verifier.c | 18 +++++++++++++++++-
>   3 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 5705cda3c4c4..e6ed8cbda837 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -103,6 +103,7 @@ struct bpf_map {
>   	atomic_t usercnt;
>   	struct work_struct work;
>   	char name[BPF_OBJ_NAME_LEN];
> +	atomic64_t writecnt;
>   };
>   
>   static inline bool map_value_has_spin_lock(const struct bpf_map *map)
> @@ -663,6 +664,7 @@ void bpf_map_charge_move(struct bpf_map_memory *dst,
>   			 struct bpf_map_memory *src);
>   void *bpf_map_area_alloc(u64 size, int numa_node);
>   void bpf_map_area_free(void *base);
> +bool bpf_map_write_active(const struct bpf_map *map);
>   void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
>   
>   extern int sysctl_unprivileged_bpf_disabled;
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index de788761b708..56eee42da0d0 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -127,6 +127,21 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
>   	return map;
>   }
>   
> +static void bpf_map_write_active_inc(struct bpf_map *map)
> +{
> +	atomic64_inc(&map->writecnt);
> +}
> +
> +static void bpf_map_write_active_dec(struct bpf_map *map)
> +{
> +	atomic64_dec(&map->writecnt);
> +}
> +
> +bool bpf_map_write_active(const struct bpf_map *map)
> +{
> +	return atomic64_read(&map->writecnt) != 0;
> +}
> +
>   void *bpf_map_area_alloc(u64 size, int numa_node)
>   {
>   	/* We really just want to fail instead of triggering OOM killer
> @@ -890,6 +905,7 @@ static int map_update_elem(union bpf_attr *attr)
>   	map = __bpf_map_get(f);
>   	if (IS_ERR(map))
>   		return PTR_ERR(map);
> +	bpf_map_write_active_inc(map);
>   	if (!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) {
>   		err = -EPERM;
>   		goto err_put;
> @@ -979,6 +995,7 @@ static int map_update_elem(union bpf_attr *attr)
>   free_key:
>   	kfree(key);
>   err_put:
> +	bpf_map_write_active_dec(map);
>   	fdput(f);
>   	return err;
>   }
> @@ -1001,6 +1018,7 @@ static int map_delete_elem(union bpf_attr *attr)
>   	map = __bpf_map_get(f);
>   	if (IS_ERR(map))
>   		return PTR_ERR(map);
> +	bpf_map_write_active_inc(map);
>   	if (!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) {
>   		err = -EPERM;
>   		goto err_put;
> @@ -1028,6 +1046,7 @@ static int map_delete_elem(union bpf_attr *attr)
>   out:
>   	kfree(key);
>   err_put:
> +	bpf_map_write_active_dec(map);
>   	fdput(f);
>   	return err;
>   }
> @@ -1119,6 +1138,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr)
>   	map = __bpf_map_get(f);
>   	if (IS_ERR(map))
>   		return PTR_ERR(map);
> +	bpf_map_write_active_inc(map);
>   	if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ) ||
>   	    !(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) {
>   		err = -EPERM;
> @@ -1160,6 +1180,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr)
>   free_key:
>   	kfree(key);
>   err_put:
> +	bpf_map_write_active_dec(map);
>   	fdput(f);
>   	return err;
>   }
> @@ -1179,6 +1200,10 @@ static int map_freeze(const union bpf_attr *attr)
>   	map = __bpf_map_get(f);
>   	if (IS_ERR(map))
>   		return PTR_ERR(map);
> +	if (bpf_map_write_active(map)) {
> +		err = -EBUSY;
> +		goto err_put;
> +	}
>   	if (READ_ONCE(map->frozen)) {
>   		err = -EBUSY;
>   		goto err_put;
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 530664693ac4..cda7907d08d9 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2773,7 +2773,23 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, int size)
>   
>   static bool bpf_map_is_rdonly(const struct bpf_map *map)
>   {
> -	return (map->map_flags & BPF_F_RDONLY_PROG) && map->frozen;
> +	/* A map is considered read-only if the following condition are true:
> +	 *
> +	 * 1) BPF program side cannot change any of the map content. The
> +	 *    BPF_F_RDONLY_PROG flag is throughout the lifetime of a map
> +	 *    and was set at map creation time.
> +	 * 2) The map value(s) have been initialized from user space by a
> +	 *    loader and then "frozen", such that no new map update/delete
> +	 *    operations from syscall side are possible for the rest of
> +	 *    the map's lifetime from that point onwards.
> +	 *    the map's lifetime from that point onwards.
> +	 * 3) Any parallel/pending map update/delete operations from syscall
> +	 *    side have been completed. Only after that point, it's safe to
> +	 *    assume that map value(s) are immutable.
> +	 */
> +	return (map->map_flags & BPF_F_RDONLY_PROG) &&
> +		READ_ONCE(map->frozen) &&
> +		!bpf_map_write_active(map);
>   }
>   
>   static int bpf_map_direct_read(struct bpf_map *map, int off, int size, u64 *val)
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5705cda3c4c4..e6ed8cbda837 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -103,6 +103,7 @@  struct bpf_map {
 	atomic_t usercnt;
 	struct work_struct work;
 	char name[BPF_OBJ_NAME_LEN];
+	atomic64_t writecnt;
 };
 
 static inline bool map_value_has_spin_lock(const struct bpf_map *map)
@@ -663,6 +664,7 @@  void bpf_map_charge_move(struct bpf_map_memory *dst,
 			 struct bpf_map_memory *src);
 void *bpf_map_area_alloc(u64 size, int numa_node);
 void bpf_map_area_free(void *base);
+bool bpf_map_write_active(const struct bpf_map *map);
 void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
 
 extern int sysctl_unprivileged_bpf_disabled;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index de788761b708..56eee42da0d0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -127,6 +127,21 @@  static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
 	return map;
 }
 
+static void bpf_map_write_active_inc(struct bpf_map *map)
+{
+	atomic64_inc(&map->writecnt);
+}
+
+static void bpf_map_write_active_dec(struct bpf_map *map)
+{
+	atomic64_dec(&map->writecnt);
+}
+
+bool bpf_map_write_active(const struct bpf_map *map)
+{
+	return atomic64_read(&map->writecnt) != 0;
+}
+
 void *bpf_map_area_alloc(u64 size, int numa_node)
 {
 	/* We really just want to fail instead of triggering OOM killer
@@ -890,6 +905,7 @@  static int map_update_elem(union bpf_attr *attr)
 	map = __bpf_map_get(f);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
+	bpf_map_write_active_inc(map);
 	if (!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) {
 		err = -EPERM;
 		goto err_put;
@@ -979,6 +995,7 @@  static int map_update_elem(union bpf_attr *attr)
 free_key:
 	kfree(key);
 err_put:
+	bpf_map_write_active_dec(map);
 	fdput(f);
 	return err;
 }
@@ -1001,6 +1018,7 @@  static int map_delete_elem(union bpf_attr *attr)
 	map = __bpf_map_get(f);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
+	bpf_map_write_active_inc(map);
 	if (!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) {
 		err = -EPERM;
 		goto err_put;
@@ -1028,6 +1046,7 @@  static int map_delete_elem(union bpf_attr *attr)
 out:
 	kfree(key);
 err_put:
+	bpf_map_write_active_dec(map);
 	fdput(f);
 	return err;
 }
@@ -1119,6 +1138,7 @@  static int map_lookup_and_delete_elem(union bpf_attr *attr)
 	map = __bpf_map_get(f);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
+	bpf_map_write_active_inc(map);
 	if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ) ||
 	    !(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) {
 		err = -EPERM;
@@ -1160,6 +1180,7 @@  static int map_lookup_and_delete_elem(union bpf_attr *attr)
 free_key:
 	kfree(key);
 err_put:
+	bpf_map_write_active_dec(map);
 	fdput(f);
 	return err;
 }
@@ -1179,6 +1200,10 @@  static int map_freeze(const union bpf_attr *attr)
 	map = __bpf_map_get(f);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
+	if (bpf_map_write_active(map)) {
+		err = -EBUSY;
+		goto err_put;
+	}
 	if (READ_ONCE(map->frozen)) {
 		err = -EBUSY;
 		goto err_put;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 530664693ac4..cda7907d08d9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2773,7 +2773,23 @@  static void coerce_reg_to_size(struct bpf_reg_state *reg, int size)
 
 static bool bpf_map_is_rdonly(const struct bpf_map *map)
 {
-	return (map->map_flags & BPF_F_RDONLY_PROG) && map->frozen;
+	/* A map is considered read-only if the following condition are true:
+	 *
+	 * 1) BPF program side cannot change any of the map content. The
+	 *    BPF_F_RDONLY_PROG flag is throughout the lifetime of a map
+	 *    and was set at map creation time.
+	 * 2) The map value(s) have been initialized from user space by a
+	 *    loader and then "frozen", such that no new map update/delete
+	 *    operations from syscall side are possible for the rest of
+	 *    the map's lifetime from that point onwards.
+	 *    the map's lifetime from that point onwards.
+	 * 3) Any parallel/pending map update/delete operations from syscall
+	 *    side have been completed. Only after that point, it's safe to
+	 *    assume that map value(s) are immutable.
+	 */
+	return (map->map_flags & BPF_F_RDONLY_PROG) &&
+		READ_ONCE(map->frozen) &&
+		!bpf_map_write_active(map);
 }
 
 static int bpf_map_direct_read(struct bpf_map *map, int off, int size, u64 *val)