diff mbox series

[uclibc-ng-devel] fix possible overflow in pointer arithmetics strnlen()

Message ID 20250107120956.1846-1-marcus.haehnel@kernkonzept.com
State Accepted
Headers show
Series [uclibc-ng-devel] fix possible overflow in pointer arithmetics strnlen() | expand

Commit Message

Marcus Haehnel Jan. 7, 2025, 12:09 p.m. UTC
From: Frank Mehnert <frank.mehnert@kernkonzept.com>

It is undefined behavior to compare two pointers belonging to different
objects. This includes the case where the addition overflows. Clang-20
seems to follow this rule more eagerly and optimizes away the old test.

Fix the test by performing the addition on uintptr_t values rather than
on on char pointers.

See also https://github.com/llvm/llvm-project/issues/121909.

Signed-off-by: Marcus Haehnel <marcus.haehnel@kernkonzept.com>
---
 libc/string/generic/strnlen.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Kjetil Oftedal Jan. 7, 2025, 12:29 p.m. UTC | #1
Hi,

Can you point to where in the C standard this is treated as undefined behavior?

Best regards,
Kjetil Oftedal

On Tue, 7 Jan 2025 at 13:10, Marcus Haehnel
<marcus.haehnel@kernkonzept.com> wrote:
>
> From: Frank Mehnert <frank.mehnert@kernkonzept.com>
>
> It is undefined behavior to compare two pointers belonging to different
> objects. This includes the case where the addition overflows. Clang-20
> seems to follow this rule more eagerly and optimizes away the old test.
>
> Fix the test by performing the addition on uintptr_t values rather than
> on on char pointers.
>
> See also https://github.com/llvm/llvm-project/issues/121909.
>
> Signed-off-by: Marcus Haehnel <marcus.haehnel@kernkonzept.com>
> ---
>  libc/string/generic/strnlen.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/libc/string/generic/strnlen.c b/libc/string/generic/strnlen.c
> index 4d4cde84f..82d4122ec 100644
> --- a/libc/string/generic/strnlen.c
> +++ b/libc/string/generic/strnlen.c
> @@ -29,15 +29,17 @@
>     '\0' terminator is found in that many characters, return MAXLEN.  */
>  size_t strnlen (const char *str, size_t maxlen)
>  {
> -  const char *char_ptr, *end_ptr = str + maxlen;
> +  const char *char_ptr, *end_ptr;
>    const unsigned long int *longword_ptr;
>    unsigned long int longword, himagic, lomagic;
>
>    if (maxlen == 0)
>      return 0;
>
> -  if (__builtin_expect (end_ptr < str, 0))
> +  if (__builtin_expect ((uintptr_t)str + maxlen < (uintptr_t)str, 0))
>      end_ptr = (const char *) ~0UL;
> +  else
> +    end_ptr = str + maxlen;
>
>    /* Handle the first few characters by reading one character at a time.
>       Do this until CHAR_PTR is aligned on a longword boundary.  */
> --
> 2.47.1
>
> _______________________________________________
> devel mailing list -- devel@uclibc-ng.org
> To unsubscribe send an email to devel-leave@uclibc-ng.org
Frank Mehnert Jan. 7, 2025, 12:38 p.m. UTC | #2
There is a longer discussion on stack overflow:

  https://stackoverflow.com/questions/6702161/pointer-comparisons-in-c-are-they-signed-or-unsigned

See the remarks about "comparison of pointers belonging to different
objects".

See also the remark in the LLVM ticket from 'nikic' about the overflow
being the same case of undefined behavior. IMO this makes sense: If we
add a huge offset to a pointer triggering an overflow, the compiler may
indeed assume that the result does not point to the same object as the
original pointer.

Kind regards,

Frank

On Dienstag, 7. Januar 2025 13:29:33 MEZ Kjetil Oftedal wrote:
> Hi,
> 
> Can you point to where in the C standard this is treated as undefined behavior?
> 
> Best regards,
> Kjetil Oftedal
> 
> On Tue, 7 Jan 2025 at 13:10, Marcus Haehnel
> <marcus.haehnel@kernkonzept.com> wrote:
> >
> > From: Frank Mehnert <frank.mehnert@kernkonzept.com>
> >
> > It is undefined behavior to compare two pointers belonging to different
> > objects. This includes the case where the addition overflows. Clang-20
> > seems to follow this rule more eagerly and optimizes away the old test.
> >
> > Fix the test by performing the addition on uintptr_t values rather than
> > on on char pointers.
> >
> > See also https://github.com/llvm/llvm-project/issues/121909.
> >
> > Signed-off-by: Marcus Haehnel <marcus.haehnel@kernkonzept.com>
> > ---
> >  libc/string/generic/strnlen.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/libc/string/generic/strnlen.c b/libc/string/generic/strnlen.c
> > index 4d4cde84f..82d4122ec 100644
> > --- a/libc/string/generic/strnlen.c
> > +++ b/libc/string/generic/strnlen.c
> > @@ -29,15 +29,17 @@
> >     '\0' terminator is found in that many characters, return MAXLEN.  */
> >  size_t strnlen (const char *str, size_t maxlen)
> >  {
> > -  const char *char_ptr, *end_ptr = str + maxlen;
> > +  const char *char_ptr, *end_ptr;
> >    const unsigned long int *longword_ptr;
> >    unsigned long int longword, himagic, lomagic;
> >
> >    if (maxlen == 0)
> >      return 0;
> >
> > -  if (__builtin_expect (end_ptr < str, 0))
> > +  if (__builtin_expect ((uintptr_t)str + maxlen < (uintptr_t)str, 0))
> >      end_ptr = (const char *) ~0UL;
> > +  else
> > +    end_ptr = str + maxlen;
> >
> >    /* Handle the first few characters by reading one character at a time.
> >       Do this until CHAR_PTR is aligned on a longword boundary.  */
> > --
> > 2.47.1
> >
> > _______________________________________________
> > devel mailing list -- devel@uclibc-ng.org
> > To unsubscribe send an email to devel-leave@uclibc-ng.org
>
Kjetil Oftedal Jan. 7, 2025, 1 p.m. UTC | #3
Hi,

How can the compiler in this case assume that endptr and str do not
point to the same array object?

ISO C99 6.5,8  Relational operators
5)
...
"If the expression P points to an element of an array object and the
expression Q points to the last element of the same array object,
the pointer expression Q+1 compares greater than P. In all other
cases, the behavior is undefined"

How does the compiler at compile time know that endptr which is
equivalent is to str[maxlen] is not within the
range of string declared by str?
Or in the terms of the above standard, how does it know that endptr is not Q+1?


Best regards,
Kjetil Oftedal



On Tue, 7 Jan 2025 at 13:38, Frank Mehnert
<frank.mehnert@kernkonzept.com> wrote:
>
> There is a longer discussion on stack overflow:
>
>   https://stackoverflow.com/questions/6702161/pointer-comparisons-in-c-are-they-signed-or-unsigned
>
> See the remarks about "comparison of pointers belonging to different
> objects".
>
> See also the remark in the LLVM ticket from 'nikic' about the overflow
> being the same case of undefined behavior. IMO this makes sense: If we
> add a huge offset to a pointer triggering an overflow, the compiler may
> indeed assume that the result does not point to the same object as the
> original pointer.
>
> Kind regards,
>
> Frank
>
> On Dienstag, 7. Januar 2025 13:29:33 MEZ Kjetil Oftedal wrote:
> > Hi,
> >
> > Can you point to where in the C standard this is treated as undefined behavior?
> >
> > Best regards,
> > Kjetil Oftedal
> >
> > On Tue, 7 Jan 2025 at 13:10, Marcus Haehnel
> > <marcus.haehnel@kernkonzept.com> wrote:
> > >
> > > From: Frank Mehnert <frank.mehnert@kernkonzept.com>
> > >
> > > It is undefined behavior to compare two pointers belonging to different
> > > objects. This includes the case where the addition overflows. Clang-20
> > > seems to follow this rule more eagerly and optimizes away the old test.
> > >
> > > Fix the test by performing the addition on uintptr_t values rather than
> > > on on char pointers.
> > >
> > > See also https://github.com/llvm/llvm-project/issues/121909.
> > >
> > > Signed-off-by: Marcus Haehnel <marcus.haehnel@kernkonzept.com>
> > > ---
> > >  libc/string/generic/strnlen.c | 6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/libc/string/generic/strnlen.c b/libc/string/generic/strnlen.c
> > > index 4d4cde84f..82d4122ec 100644
> > > --- a/libc/string/generic/strnlen.c
> > > +++ b/libc/string/generic/strnlen.c
> > > @@ -29,15 +29,17 @@
> > >     '\0' terminator is found in that many characters, return MAXLEN.  */
> > >  size_t strnlen (const char *str, size_t maxlen)
> > >  {
> > > -  const char *char_ptr, *end_ptr = str + maxlen;
> > > +  const char *char_ptr, *end_ptr;
> > >    const unsigned long int *longword_ptr;
> > >    unsigned long int longword, himagic, lomagic;
> > >
> > >    if (maxlen == 0)
> > >      return 0;
> > >
> > > -  if (__builtin_expect (end_ptr < str, 0))
> > > +  if (__builtin_expect ((uintptr_t)str + maxlen < (uintptr_t)str, 0))
> > >      end_ptr = (const char *) ~0UL;
> > > +  else
> > > +    end_ptr = str + maxlen;
> > >
> > >    /* Handle the first few characters by reading one character at a time.
> > >       Do this until CHAR_PTR is aligned on a longword boundary.  */
> > > --
> > > 2.47.1
> > >
> > > _______________________________________________
> > > devel mailing list -- devel@uclibc-ng.org
> > > To unsubscribe send an email to devel-leave@uclibc-ng.org
> >
>
>
> --
> Dr.-Ing. Frank Mehnert, frank.mehnert@kernkonzept.com, +49-351-41 883 224
>
> Kernkonzept GmbH.  Sitz: Dresden.  Amtsgericht Dresden, HRB 31129.
> Geschäftsführer: Dr.-Ing. Michael Hohmuth
>
>
> _______________________________________________
> devel mailing list -- devel@uclibc-ng.org
> To unsubscribe send an email to devel-leave@uclibc-ng.org
Frank Mehnert Jan. 7, 2025, 1:38 p.m. UTC | #4
Hi Kjetil,

On Dienstag, 7. Januar 2025 14:00:16 MEZ Kjetil Oftedal wrote:
> How can the compiler in this case assume that endptr and str do not
> point to the same array object?
> 
> ISO C99 6.5,8  Relational operators
> 5)
> ...
> "If the expression P points to an element of an array object and the
> expression Q points to the last element of the same array object,
> the pointer expression Q+1 compares greater than P. In all other
> cases, the behavior is undefined"
> 
> How does the compiler at compile time know that endptr which is
> equivalent is to str[maxlen] is not within the
> range of string declared by str?
> Or in the terms of the above standard, how does it know that endptr is not Q+1?
>
> [...]

The compiler does not need to know!

Let's consider the pointer 'x', the offset 'offs' and

  y = x + offset

with y < x (because offset is huge).

So if y < (x + offset) then y points to a different object by definition.
Hence, the compiler is allowed to remove this test.

Frank
Kjetil Oftedal Jan. 7, 2025, 1:54 p.m. UTC | #5
On Tue, 7 Jan 2025 at 14:39, Frank Mehnert
<frank.mehnert@kernkonzept.com> wrote:
>
> Hi Kjetil,
>
> On Dienstag, 7. Januar 2025 14:00:16 MEZ Kjetil Oftedal wrote:
> > How can the compiler in this case assume that endptr and str do not
> > point to the same array object?
> >
> > ISO C99 6.5,8  Relational operators
> > 5)
> > ...
> > "If the expression P points to an element of an array object and the
> > expression Q points to the last element of the same array object,
> > the pointer expression Q+1 compares greater than P. In all other
> > cases, the behavior is undefined"
> >
> > How does the compiler at compile time know that endptr which is
> > equivalent is to str[maxlen] is not within the
> > range of string declared by str?
> > Or in the terms of the above standard, how does it know that endptr is not Q+1?
> >
> > [...]
>
> The compiler does not need to know!
>
> Let's consider the pointer 'x', the offset 'offs' and
>
>   y = x + offset
>
> with y < x (because offset is huge).
>
> So if y < (x + offset) then y points to a different object by definition.
> Hence, the compiler is allowed to remove this test.
>
> Frank
> --
> Dr.-Ing. Frank Mehnert, frank.mehnert@kernkonzept.com, +49-351-41 883 224
>
> Kernkonzept GmbH.  Sitz: Dresden.  Amtsgericht Dresden, HRB 31129.
> Geschäftsführer: Dr.-Ing. Michael Hohmuth
>
>

Correspondingly
 y = x + offset
with y < x (Offset is small E.g 1).

So if y < (x + offset) and offset is 1, then the Q+1 rule holds, and
the compiler is not allowed to remove this test as it is well defined.

This cannot be assumed at compile time, thus the test needs to be done
to respect the
requirements of the C standards.

Best regards,
Kjetil Oftedal
Frank Mehnert Jan. 7, 2025, 2:33 p.m. UTC | #6
On Dienstag, 7. Januar 2025 14:54:54 MEZ Kjetil Oftedal wrote:
> On Tue, 7 Jan 2025 at 14:39, Frank Mehnert
> <frank.mehnert@kernkonzept.com> wrote:
> >
> > Hi Kjetil,
> >
> > On Dienstag, 7. Januar 2025 14:00:16 MEZ Kjetil Oftedal wrote:
> > > How can the compiler in this case assume that endptr and str do not
> > > point to the same array object?
> > >
> > > ISO C99 6.5,8  Relational operators
> > > 5)
> > > ...
> > > "If the expression P points to an element of an array object and the
> > > expression Q points to the last element of the same array object,
> > > the pointer expression Q+1 compares greater than P. In all other
> > > cases, the behavior is undefined"
> > >
> > > How does the compiler at compile time know that endptr which is
> > > equivalent is to str[maxlen] is not within the
> > > range of string declared by str?
> > > Or in the terms of the above standard, how does it know that endptr is not Q+1?
> > >
> > > [...]
> >
> > The compiler does not need to know!
> >
> > Let's consider the pointer 'x', the offset 'offs' and
> >
> >   y = x + offset
> >
> > with y < x (because offset is huge).
> >
> > So if y < (x + offset) then y points to a different object by definition.
> > Hence, the compiler is allowed to remove this test.
> >
> > Frank
> > --
> > Dr.-Ing. Frank Mehnert, frank.mehnert@kernkonzept.com, +49-351-41 883 224
> >
> > Kernkonzept GmbH.  Sitz: Dresden.  Amtsgericht Dresden, HRB 31129.
> > Geschäftsführer: Dr.-Ing. Michael Hohmuth
> >
> >
> 
> Correspondingly
>  y = x + offset
> with y < x (Offset is small E.g 1).
> 
> So if y < (x + offset) and offset is 1, then the Q+1 rule holds, and
> the compiler is not allowed to remove this test as it is well defined.
> 
> This cannot be assumed at compile time, thus the test needs to be done
> to respect the
> requirements of the C standards.

I think you mix things up: The error here is the pointer arithmetic, not
the comparison.

With offset=1 and x=0xffffffff'fffffff, y will 0, so y does not belong to
the same object as x. There is no other value of x which could lead to a
result y<x with offset=1!

The comparison tests the wrap around and adapts end_ptr in that case.

Again: If there was a wrap around, then the pointer arithmetic operation
was wrong (undefined behavior). Hence, we can remove the test plus the code
which is executed if the test succeeds.

Frank
Kjetil Oftedal Jan. 7, 2025, 2:55 p.m. UTC | #7
On Tue, 7 Jan 2025 at 15:33, Frank Mehnert
<frank.mehnert@kernkonzept.com> wrote:
>
> On Dienstag, 7. Januar 2025 14:54:54 MEZ Kjetil Oftedal wrote:
> > On Tue, 7 Jan 2025 at 14:39, Frank Mehnert
> > <frank.mehnert@kernkonzept.com> wrote:
> > >
> > > Hi Kjetil,
> > >
> > > On Dienstag, 7. Januar 2025 14:00:16 MEZ Kjetil Oftedal wrote:
> > > > How can the compiler in this case assume that endptr and str do not
> > > > point to the same array object?
> > > >
> > > > ISO C99 6.5,8  Relational operators
> > > > 5)
> > > > ...
> > > > "If the expression P points to an element of an array object and the
> > > > expression Q points to the last element of the same array object,
> > > > the pointer expression Q+1 compares greater than P. In all other
> > > > cases, the behavior is undefined"
> > > >
> > > > How does the compiler at compile time know that endptr which is
> > > > equivalent is to str[maxlen] is not within the
> > > > range of string declared by str?
> > > > Or in the terms of the above standard, how does it know that endptr is not Q+1?
> > > >
> > > > [...]
> > >
> > > The compiler does not need to know!
> > >
> > > Let's consider the pointer 'x', the offset 'offs' and
> > >
> > >   y = x + offset
> > >
> > > with y < x (because offset is huge).
> > >
> > > So if y < (x + offset) then y points to a different object by definition.
> > > Hence, the compiler is allowed to remove this test.
> > >
> > > Frank
> > > --
> > > Dr.-Ing. Frank Mehnert, frank.mehnert@kernkonzept.com, +49-351-41 883 224
> > >
> > > Kernkonzept GmbH.  Sitz: Dresden.  Amtsgericht Dresden, HRB 31129.
> > > Geschäftsführer: Dr.-Ing. Michael Hohmuth
> > >
> > >
> >
> > Correspondingly
> >  y = x + offset
> > with y < x (Offset is small E.g 1).
> >
> > So if y < (x + offset) and offset is 1, then the Q+1 rule holds, and
> > the compiler is not allowed to remove this test as it is well defined.
> >
> > This cannot be assumed at compile time, thus the test needs to be done
> > to respect the
> > requirements of the C standards.
>
> I think you mix things up: The error here is the pointer arithmetic, not
> the comparison.
>
> With offset=1 and x=0xffffffff'fffffff, y will 0, so y does not belong to
> the same object as x. There is no other value of x which could lead to a
> result y<x with offset=1!
>
> The comparison tests the wrap around and adapts end_ptr in that case.
>
> Again: If there was a wrap around, then the pointer arithmetic operation
> was wrong (undefined behavior). Hence, we can remove the test plus the code
> which is executed if the test succeeds.
>
> Frank
> --
> Dr.-Ing. Frank Mehnert, frank.mehnert@kernkonzept.com, +49-351-41 883 224
>
> Kernkonzept GmbH.  Sitz: Dresden.  Amtsgericht Dresden, HRB 31129.
> Geschäftsführer: Dr.-Ing. Michael Hohmuth
>
>


Hi,

I see the point. I just don't agree with the conclusion.
As I see it the LLVM group is claiming that if there is any possiblity for UB,
even if it is dependent on input arguments, then it is always UB, and
the compiler can
remove code as it sees fit. (Which will be a nightmare for security code)

And this is what I disagree with. If the code have defined behaviour
by the C standard,
for some input arguments, then it must treat the code as being well
behaved and omit code
as such.

If I have some special machine where I can use every single byte of
the address space..
(For a 32-bit machine that could provide the code with the ability to
declare a 4GB continuous string/char array)
And in this case should not str < end_of_str and str+X < end_of_str
always be true, even with overflow wraparounds?

Best regards,
Kjetil Oftedal
Frank Mehnert Jan. 7, 2025, 3:18 p.m. UTC | #8
On Dienstag, 7. Januar 2025 15:55:28 MEZ Kjetil Oftedal wrote:
> On Tue, 7 Jan 2025 at 15:33, Frank Mehnert
> <frank.mehnert@kernkonzept.com> wrote:
> > [...]
> >
> > With offset=1 and x=0xffffffff'fffffff, y will 0, so y does not belong to
> > the same object as x. There is no other value of x which could lead to a
> > result y<x with offset=1!
> >
> > The comparison tests the wrap around and adapts end_ptr in that case.
> >
> > Again: If there was a wrap around, then the pointer arithmetic operation
> > was wrong (undefined behavior). Hence, we can remove the test plus the code
> > which is executed if the test succeeds.
>
> [...]
> 
> I see the point. I just don't agree with the conclusion.
> As I see it the LLVM group is claiming that if there is any possiblity for UB,
> even if it is dependent on input arguments, then it is always UB, and
> the compiler can
> remove code as it sees fit. (Which will be a nightmare for security code)

I admit I thought so some time ago but in my opinion you do a wrong
conclusion:

There is a test in the code.

The result of the test can _only_ be true if the code before the test
triggered undefined behavior.

In other words, if the test succeeded, then we can be sure that the test
compared two pointers belonging to different objects -- which is undefined
behavior.

Therefore, the compiler is fine removing the test because the compiler does
not support the case where the test result is true.

If the test result is false, everything is fine, both objects may or may not
point to different objects, but the compiler assumes that they do. But: If
the test result is false, end_ptr does not need an adaption, therefore that
line can be optimized out.

TBH, I'm still convinced that the compiler is right. :-)

Frank
Kjetil Oftedal Jan. 7, 2025, 3:45 p.m. UTC | #9
On Tue, 7 Jan 2025 at 16:19, Frank Mehnert
<frank.mehnert@kernkonzept.com> wrote:
>
> On Dienstag, 7. Januar 2025 15:55:28 MEZ Kjetil Oftedal wrote:
> > On Tue, 7 Jan 2025 at 15:33, Frank Mehnert
> > <frank.mehnert@kernkonzept.com> wrote:
> > > [...]
> > >
> > > With offset=1 and x=0xffffffff'fffffff, y will 0, so y does not belong to
> > > the same object as x. There is no other value of x which could lead to a
> > > result y<x with offset=1!
> > >
> > > The comparison tests the wrap around and adapts end_ptr in that case.
> > >
> > > Again: If there was a wrap around, then the pointer arithmetic operation
> > > was wrong (undefined behavior). Hence, we can remove the test plus the code
> > > which is executed if the test succeeds.
> >
> > [...]
> >
> > I see the point. I just don't agree with the conclusion.
> > As I see it the LLVM group is claiming that if there is any possiblity for UB,
> > even if it is dependent on input arguments, then it is always UB, and
> > the compiler can
> > remove code as it sees fit. (Which will be a nightmare for security code)
>
> I admit I thought so some time ago but in my opinion you do a wrong
> conclusion:
>
> There is a test in the code.
>
> The result of the test can _only_ be true if the code before the test
> triggered undefined behavior.
>
> In other words, if the test succeeded, then we can be sure that the test
> compared two pointers belonging to different objects -- which is undefined
> behavior.
>
> Therefore, the compiler is fine removing the test because the compiler does
> not support the case where the test result is true.
>
> If the test result is false, everything is fine, both objects may or may not
> point to different objects, but the compiler assumes that they do. But: If
> the test result is false, end_ptr does not need an adaption, therefore that
> line can be optimized out.
>
> TBH, I'm still convinced that the compiler is right. :-)
>
> Frank
> --
> Dr.-Ing. Frank Mehnert, frank.mehnert@kernkonzept.com, +49-351-41 883 224
>
> Kernkonzept GmbH.  Sitz: Dresden.  Amtsgericht Dresden, HRB 31129.
> Geschäftsführer: Dr.-Ing. Michael Hohmuth
>

Let us flip the variables around and review:
E.g.
if ( str < end_ptr )
instead of
if ( end_ptr < str )

Should the compiler then always evaluate the clause to true?
end_ptr might still be lower than str due to overflow in the artihmetic.

Best regards,
Kjetil Oftedal
Frank Mehnert Jan. 7, 2025, 4:09 p.m. UTC | #10
On Dienstag, 7. Januar 2025 16:45:03 MEZ Kjetil Oftedal wrote:
> On Tue, 7 Jan 2025 at 16:19, Frank Mehnert
> <frank.mehnert@kernkonzept.com> wrote:
> >
> > On Dienstag, 7. Januar 2025 15:55:28 MEZ Kjetil Oftedal wrote:
> > > On Tue, 7 Jan 2025 at 15:33, Frank Mehnert
> > > <frank.mehnert@kernkonzept.com> wrote:
> > > > [...]
> > > >
> > > > With offset=1 and x=0xffffffff'fffffff, y will 0, so y does not belong to
> > > > the same object as x. There is no other value of x which could lead to a
> > > > result y<x with offset=1!
> > > >
> > > > The comparison tests the wrap around and adapts end_ptr in that case.
> > > >
> > > > Again: If there was a wrap around, then the pointer arithmetic operation
> > > > was wrong (undefined behavior). Hence, we can remove the test plus the code
> > > > which is executed if the test succeeds.
> > >
> > > [...]
> > >
> > > I see the point. I just don't agree with the conclusion.
> > > As I see it the LLVM group is claiming that if there is any possiblity for UB,
> > > even if it is dependent on input arguments, then it is always UB, and
> > > the compiler can
> > > remove code as it sees fit. (Which will be a nightmare for security code)
> >
> > I admit I thought so some time ago but in my opinion you do a wrong
> > conclusion:
> >
> > There is a test in the code.
> >
> > The result of the test can _only_ be true if the code before the test
> > triggered undefined behavior.
> >
> > In other words, if the test succeeded, then we can be sure that the test
> > compared two pointers belonging to different objects -- which is undefined
> > behavior.
> >
> > Therefore, the compiler is fine removing the test because the compiler does
> > not support the case where the test result is true.
> >
> > If the test result is false, everything is fine, both objects may or may not
> > point to different objects, but the compiler assumes that they do. But: If
> > the test result is false, end_ptr does not need an adaption, therefore that
> > line can be optimized out.
> >
> > [...]
> 
> Let us flip the variables around and review:
> E.g.
> if ( str < end_ptr )
> instead of
> if ( end_ptr < str )
> 
> Should the compiler then always evaluate the clause to true?
> end_ptr might still be lower than str due to overflow in the artihmetic.

Not sure what you mean by "then always evaluate the clause to true".
Are you asking if the compiler may assume that (str < end_ptr) is always
true because only this is "defined behavior"?

No, the compiler cannot assume that. But this is no contradiction to what
I said before.

The compiler does the following steps for the original problem,
(end_ptr < str):

  1. Generate code for the test.
  2. The test result can either be true or false. The compiler generates
     code for both cases.
  3. No code needs to be generated for the case "false".
  4. Generate code for the case "true".
  5. Optimize the code. The compiler observes that the "false" case can
     only happen if the previous calculation triggered undefined behavior
     (str and end_ptr point to different objects).

In your case (str < end_ptr), the compiler still needs to generate code for
the test because the compiler must assume that both pointers (str and end_ptr)
belong to the same object, therefore the test is valid.

In the original case (end_ptr < str), the compiler knows for sure that both
pointers belong to different objects!

Kind regards,

Frank
Kjetil Oftedal Jan. 7, 2025, 4:38 p.m. UTC | #11
On Tue, 7 Jan 2025 at 17:09, Frank Mehnert
<frank.mehnert@kernkonzept.com> wrote:
>
> On Dienstag, 7. Januar 2025 16:45:03 MEZ Kjetil Oftedal wrote:
> > On Tue, 7 Jan 2025 at 16:19, Frank Mehnert
> > <frank.mehnert@kernkonzept.com> wrote:
> > >
> > > On Dienstag, 7. Januar 2025 15:55:28 MEZ Kjetil Oftedal wrote:
> > > > On Tue, 7 Jan 2025 at 15:33, Frank Mehnert
> > > > <frank.mehnert@kernkonzept.com> wrote:
> > > > > [...]
> > > > >
> > > > > With offset=1 and x=0xffffffff'fffffff, y will 0, so y does not belong to
> > > > > the same object as x. There is no other value of x which could lead to a
> > > > > result y<x with offset=1!
> > > > >
> > > > > The comparison tests the wrap around and adapts end_ptr in that case.
> > > > >
> > > > > Again: If there was a wrap around, then the pointer arithmetic operation
> > > > > was wrong (undefined behavior). Hence, we can remove the test plus the code
> > > > > which is executed if the test succeeds.
> > > >
> > > > [...]
> > > >
> > > > I see the point. I just don't agree with the conclusion.
> > > > As I see it the LLVM group is claiming that if there is any possiblity for UB,
> > > > even if it is dependent on input arguments, then it is always UB, and
> > > > the compiler can
> > > > remove code as it sees fit. (Which will be a nightmare for security code)
> > >
> > > I admit I thought so some time ago but in my opinion you do a wrong
> > > conclusion:
> > >
> > > There is a test in the code.
> > >
> > > The result of the test can _only_ be true if the code before the test
> > > triggered undefined behavior.
> > >
> > > In other words, if the test succeeded, then we can be sure that the test
> > > compared two pointers belonging to different objects -- which is undefined
> > > behavior.
> > >
> > > Therefore, the compiler is fine removing the test because the compiler does
> > > not support the case where the test result is true.
> > >
> > > If the test result is false, everything is fine, both objects may or may not
> > > point to different objects, but the compiler assumes that they do. But: If
> > > the test result is false, end_ptr does not need an adaption, therefore that
> > > line can be optimized out.
> > >
> > > [...]
> >
> > Let us flip the variables around and review:
> > E.g.
> > if ( str < end_ptr )
> > instead of
> > if ( end_ptr < str )
> >
> > Should the compiler then always evaluate the clause to true?
> > end_ptr might still be lower than str due to overflow in the artihmetic.
>
> Not sure what you mean by "then always evaluate the clause to true".
> Are you asking if the compiler may assume that (str < end_ptr) is always
> true because only this is "defined behavior"?
>
> No, the compiler cannot assume that. But this is no contradiction to what
> I said before.
>
> The compiler does the following steps for the original problem,
> (end_ptr < str):
>
>   1. Generate code for the test.
>   2. The test result can either be true or false. The compiler generates
>      code for both cases.
>   3. No code needs to be generated for the case "false".
>   4. Generate code for the case "true".
>   5. Optimize the code. The compiler observes that the "false" case can
>      only happen if the previous calculation triggered undefined behavior
>      (str and end_ptr point to different objects).
>
> In your case (str < end_ptr), the compiler still needs to generate code for
> the test because the compiler must assume that both pointers (str and end_ptr)
> belong to the same object, therefore the test is valid.
>
> In the original case (end_ptr < str), the compiler knows for sure that both
> pointers belong to different objects!
>
> Kind regards,
>
> Frank
> --
> Dr.-Ing. Frank Mehnert, frank.mehnert@kernkonzept.com, +49-351-41 883 224
>
> Kernkonzept GmbH.  Sitz: Dresden.  Amtsgericht Dresden, HRB 31129.
> Geschäftsführer: Dr.-Ing. Michael Hohmuth
>

Is this true for all cases for strings?
> In the original case (end_ptr < str), the compiler knows for sure that both
pointers belong to different objects!

---
const char* str = "Hello World!"
const char* sub_str = strstr(str, "Wor");

strnlen(sub_str, X);
---

For some values of X end_ptr is still pointing to a valid array entry
for the original "Hello World" string,
even if it is not within the "World!" substring. Is it then not
pointing to a valid array of objects?
(E.g if X is the unsigned representation of -1,-2,...-6)

Best regards,
Kjetil Oftedal
Frank Mehnert Jan. 7, 2025, 5:07 p.m. UTC | #12
On Dienstag, 7. Januar 2025 17:38:52 MEZ Kjetil Oftedal wrote:
> On Tue, 7 Jan 2025 at 17:09, Frank Mehnert
> <frank.mehnert@kernkonzept.com> wrote:
> >
> > On Dienstag, 7. Januar 2025 16:45:03 MEZ Kjetil Oftedal wrote:
> > > On Tue, 7 Jan 2025 at 16:19, Frank Mehnert
> > > <frank.mehnert@kernkonzept.com> wrote:
> > > >
> > > > On Dienstag, 7. Januar 2025 15:55:28 MEZ Kjetil Oftedal wrote:
> > > > > On Tue, 7 Jan 2025 at 15:33, Frank Mehnert
> > > > > <frank.mehnert@kernkonzept.com> wrote:
> > > > > > [...]
> > > > > >
> > > > > > With offset=1 and x=0xffffffff'fffffff, y will 0, so y does not belong to
> > > > > > the same object as x. There is no other value of x which could lead to a
> > > > > > result y<x with offset=1!
> > > > > >
> > > > > > The comparison tests the wrap around and adapts end_ptr in that case.
> > > > > >
> > > > > > Again: If there was a wrap around, then the pointer arithmetic operation
> > > > > > was wrong (undefined behavior). Hence, we can remove the test plus the code
> > > > > > which is executed if the test succeeds.
> > > > >
> > > > > [...]
> > > > >
> > > > > I see the point. I just don't agree with the conclusion.
> > > > > As I see it the LLVM group is claiming that if there is any possiblity for UB,
> > > > > even if it is dependent on input arguments, then it is always UB, and
> > > > > the compiler can
> > > > > remove code as it sees fit. (Which will be a nightmare for security code)
> > > >
> > > > I admit I thought so some time ago but in my opinion you do a wrong
> > > > conclusion:
> > > >
> > > > There is a test in the code.
> > > >
> > > > The result of the test can _only_ be true if the code before the test
> > > > triggered undefined behavior.
> > > >
> > > > In other words, if the test succeeded, then we can be sure that the test
> > > > compared two pointers belonging to different objects -- which is undefined
> > > > behavior.
> > > >
> > > > Therefore, the compiler is fine removing the test because the compiler does
> > > > not support the case where the test result is true.
> > > >
> > > > If the test result is false, everything is fine, both objects may or may not
> > > > point to different objects, but the compiler assumes that they do. But: If
> > > > the test result is false, end_ptr does not need an adaption, therefore that
> > > > line can be optimized out.
> > > >
> > > > [...]
> > >
> > > Let us flip the variables around and review:
> > > E.g.
> > > if ( str < end_ptr )
> > > instead of
> > > if ( end_ptr < str )
> > >
> > > Should the compiler then always evaluate the clause to true?
> > > end_ptr might still be lower than str due to overflow in the artihmetic.
> >
> > Not sure what you mean by "then always evaluate the clause to true".
> > Are you asking if the compiler may assume that (str < end_ptr) is always
> > true because only this is "defined behavior"?
> >
> > No, the compiler cannot assume that. But this is no contradiction to what
> > I said before.
> >
> > The compiler does the following steps for the original problem,
> > (end_ptr < str):
> >
> >   1. Generate code for the test.
> >   2. The test result can either be true or false. The compiler generates
> >      code for both cases.
> >   3. No code needs to be generated for the case "false".
> >   4. Generate code for the case "true".
> >   5. Optimize the code. The compiler observes that the "false" case can
> >      only happen if the previous calculation triggered undefined behavior
> >      (str and end_ptr point to different objects).
> >
> > In your case (str < end_ptr), the compiler still needs to generate code for
> > the test because the compiler must assume that both pointers (str and end_ptr)
> > belong to the same object, therefore the test is valid.
> >
> > In the original case (end_ptr < str), the compiler knows for sure that both
> > pointers belong to different objects!
>
> [...]
>
> Is this true for all cases for strings?
>
> > In the original case (end_ptr < str), the compiler knows for sure that
> > both pointers belong to different objects!
> 
> ---
> const char* str = "Hello World!"
> const char* sub_str = strstr(str, "Wor");
> 
> strnlen(sub_str, X);
> ---
> 
> For some values of X end_ptr is still pointing to a valid array entry
> for the original "Hello World" string,
> even if it is not within the "World!" substring. Is it then not
> pointing to a valid array of objects?
> (E.g if X is the unsigned representation of -1,-2,...-6)

OK, I think you have a point.

Basically this derives into the question if adding a huge offset to a pointer
is the same as subtracting a small offset from a pointer. So far I couldn't
find any C++ rule related to this problem.

Kind regards,

Frank
--
Dr.-Ing. Frank Mehnert, frank.mehnert@kernkonzept.com, +49-351-41 883 224

Kernkonzept GmbH.  Sitz: Dresden.  Amtsgericht Dresden, HRB 31129.
Geschäftsführer: Dr.-Ing. Michael Hohmuth
Kjetil Oftedal Jan. 7, 2025, 5:18 p.m. UTC | #13
On Tue, 7 Jan 2025 at 18:07, Frank Mehnert
<frank.mehnert@kernkonzept.com> wrote:
>
> On Dienstag, 7. Januar 2025 17:38:52 MEZ Kjetil Oftedal wrote:
> > On Tue, 7 Jan 2025 at 17:09, Frank Mehnert
> > <frank.mehnert@kernkonzept.com> wrote:
> > >
> > > On Dienstag, 7. Januar 2025 16:45:03 MEZ Kjetil Oftedal wrote:
> > > > On Tue, 7 Jan 2025 at 16:19, Frank Mehnert
> > > > <frank.mehnert@kernkonzept.com> wrote:
> > > > >
> > > > > On Dienstag, 7. Januar 2025 15:55:28 MEZ Kjetil Oftedal wrote:
> > > > > > On Tue, 7 Jan 2025 at 15:33, Frank Mehnert
> > > > > > <frank.mehnert@kernkonzept.com> wrote:
> > > > > > > [...]
> > > > > > >
> > > > > > > With offset=1 and x=0xffffffff'fffffff, y will 0, so y does not belong to
> > > > > > > the same object as x. There is no other value of x which could lead to a
> > > > > > > result y<x with offset=1!
> > > > > > >
> > > > > > > The comparison tests the wrap around and adapts end_ptr in that case.
> > > > > > >
> > > > > > > Again: If there was a wrap around, then the pointer arithmetic operation
> > > > > > > was wrong (undefined behavior). Hence, we can remove the test plus the code
> > > > > > > which is executed if the test succeeds.
> > > > > >
> > > > > > [...]
> > > > > >
> > > > > > I see the point. I just don't agree with the conclusion.
> > > > > > As I see it the LLVM group is claiming that if there is any possiblity for UB,
> > > > > > even if it is dependent on input arguments, then it is always UB, and
> > > > > > the compiler can
> > > > > > remove code as it sees fit. (Which will be a nightmare for security code)
> > > > >
> > > > > I admit I thought so some time ago but in my opinion you do a wrong
> > > > > conclusion:
> > > > >
> > > > > There is a test in the code.
> > > > >
> > > > > The result of the test can _only_ be true if the code before the test
> > > > > triggered undefined behavior.
> > > > >
> > > > > In other words, if the test succeeded, then we can be sure that the test
> > > > > compared two pointers belonging to different objects -- which is undefined
> > > > > behavior.
> > > > >
> > > > > Therefore, the compiler is fine removing the test because the compiler does
> > > > > not support the case where the test result is true.
> > > > >
> > > > > If the test result is false, everything is fine, both objects may or may not
> > > > > point to different objects, but the compiler assumes that they do. But: If
> > > > > the test result is false, end_ptr does not need an adaption, therefore that
> > > > > line can be optimized out.
> > > > >
> > > > > [...]
> > > >
> > > > Let us flip the variables around and review:
> > > > E.g.
> > > > if ( str < end_ptr )
> > > > instead of
> > > > if ( end_ptr < str )
> > > >
> > > > Should the compiler then always evaluate the clause to true?
> > > > end_ptr might still be lower than str due to overflow in the artihmetic.
> > >
> > > Not sure what you mean by "then always evaluate the clause to true".
> > > Are you asking if the compiler may assume that (str < end_ptr) is always
> > > true because only this is "defined behavior"?
> > >
> > > No, the compiler cannot assume that. But this is no contradiction to what
> > > I said before.
> > >
> > > The compiler does the following steps for the original problem,
> > > (end_ptr < str):
> > >
> > >   1. Generate code for the test.
> > >   2. The test result can either be true or false. The compiler generates
> > >      code for both cases.
> > >   3. No code needs to be generated for the case "false".
> > >   4. Generate code for the case "true".
> > >   5. Optimize the code. The compiler observes that the "false" case can
> > >      only happen if the previous calculation triggered undefined behavior
> > >      (str and end_ptr point to different objects).
> > >
> > > In your case (str < end_ptr), the compiler still needs to generate code for
> > > the test because the compiler must assume that both pointers (str and end_ptr)
> > > belong to the same object, therefore the test is valid.
> > >
> > > In the original case (end_ptr < str), the compiler knows for sure that both
> > > pointers belong to different objects!
> >
> > [...]
> >
> > Is this true for all cases for strings?
> >
> > > In the original case (end_ptr < str), the compiler knows for sure that
> > > both pointers belong to different objects!
> >
> > ---
> > const char* str = "Hello World!"
> > const char* sub_str = strstr(str, "Wor");
> >
> > strnlen(sub_str, X);
> > ---
> >
> > For some values of X end_ptr is still pointing to a valid array entry
> > for the original "Hello World" string,
> > even if it is not within the "World!" substring. Is it then not
> > pointing to a valid array of objects?
> > (E.g if X is the unsigned representation of -1,-2,...-6)
>
> OK, I think you have a point.
>
> Basically this derives into the question if adding a huge offset to a pointer
> is the same as subtracting a small offset from a pointer. So far I couldn't
> find any C++ rule related to this problem.
>
> Kind regards,
>
> Frank
> --
> Dr.-Ing. Frank Mehnert, frank.mehnert@kernkonzept.com, +49-351-41 883 224
>
> Kernkonzept GmbH.  Sitz: Dresden.  Amtsgericht Dresden, HRB 31129.
> Geschäftsführer: Dr.-Ing. Michael Hohmuth
>
>
Hi,

I think it is purely academic as this point though :)

The change is probably fine. Just a bit annoyed that clang forces
changes to the code
in a lot of projects, instead of clang accepting defacto standard patterns.

Best regards,
Kjetil Oftedal
Frank Mehnert Jan. 8, 2025, 10:42 a.m. UTC | #14
Hi Kjetil,

On Dienstag, 7. Januar 2025 18:18:14 MEZ Kjetil Oftedal wrote:
> On Tue, 7 Jan 2025 at 18:07, Frank Mehnert
> <frank.mehnert@kernkonzept.com> wrote:
> >
> > On Dienstag, 7. Januar 2025 17:38:52 MEZ Kjetil Oftedal wrote:
> > > On Tue, 7 Jan 2025 at 17:09, Frank Mehnert
> > > <frank.mehnert@kernkonzept.com> wrote:
> > >
> > > [...]
> > >
> > > Is this true for all cases for strings?
> > >
> > > > In the original case (end_ptr < str), the compiler knows for sure that
> > > > both pointers belong to different objects!
> > >
> > > ---
> > > const char* str = "Hello World!"
> > > const char* sub_str = strstr(str, "Wor");
> > >
> > > strnlen(sub_str, X);
> > > ---
> > >
> > > For some values of X end_ptr is still pointing to a valid array entry
> > > for the original "Hello World" string,
> > > even if it is not within the "World!" substring. Is it then not
> > > pointing to a valid array of objects?
> > > (E.g if X is the unsigned representation of -1,-2,...-6)
> >
> > OK, I think you have a point.
> >
> > Basically this derives into the question if adding a huge offset to a pointer
> > is the same as subtracting a small offset from a pointer. So far I couldn't
> > find any C++ rule related to this problem.
>
> Hi,
> 
> I think it is purely academic as this point though :)
> 
> The change is probably fine. Just a bit annoyed that clang forces
> changes to the code
> in a lot of projects, instead of clang accepting defacto standard patterns.

The LLVM ticket [1] was updated and Andrew Pinski noted that the the Linux
kernel, -fwrapv-pointer was added to keep the "defacto standard patterns".
So compiling with -fno-strict-overflow (which included -fwrapv-pointer) would
most likely make my fix superfluous.

So it's basically about pointer arithmetics warp around.

But I hope there is agreement now that my change makes the code "more correct".

Kind regards

Frank

[1] https://github.com/llvm/llvm-project/issues/121909
Kjetil Oftedal Jan. 8, 2025, 12:57 p.m. UTC | #15
On Wed, 8 Jan 2025 at 11:42, Frank Mehnert
<frank.mehnert@kernkonzept.com> wrote:
>
> Hi Kjetil,
>
> On Dienstag, 7. Januar 2025 18:18:14 MEZ Kjetil Oftedal wrote:
> > On Tue, 7 Jan 2025 at 18:07, Frank Mehnert
> > <frank.mehnert@kernkonzept.com> wrote:
> > >
> > > On Dienstag, 7. Januar 2025 17:38:52 MEZ Kjetil Oftedal wrote:
> > > > On Tue, 7 Jan 2025 at 17:09, Frank Mehnert
> > > > <frank.mehnert@kernkonzept.com> wrote:
> > > >
> > > > [...]
> > > >
> > > > Is this true for all cases for strings?
> > > >
> > > > > In the original case (end_ptr < str), the compiler knows for sure that
> > > > > both pointers belong to different objects!
> > > >
> > > > ---
> > > > const char* str = "Hello World!"
> > > > const char* sub_str = strstr(str, "Wor");
> > > >
> > > > strnlen(sub_str, X);
> > > > ---
> > > >
> > > > For some values of X end_ptr is still pointing to a valid array entry
> > > > for the original "Hello World" string,
> > > > even if it is not within the "World!" substring. Is it then not
> > > > pointing to a valid array of objects?
> > > > (E.g if X is the unsigned representation of -1,-2,...-6)
> > >
> > > OK, I think you have a point.
> > >
> > > Basically this derives into the question if adding a huge offset to a pointer
> > > is the same as subtracting a small offset from a pointer. So far I couldn't
> > > find any C++ rule related to this problem.
> >
> > Hi,
> >
> > I think it is purely academic as this point though :)
> >
> > The change is probably fine. Just a bit annoyed that clang forces
> > changes to the code
> > in a lot of projects, instead of clang accepting defacto standard patterns.
>
> The LLVM ticket [1] was updated and Andrew Pinski noted that the the Linux
> kernel, -fwrapv-pointer was added to keep the "defacto standard patterns".
> So compiling with -fno-strict-overflow (which included -fwrapv-pointer) would
> most likely make my fix superfluous.
>
> So it's basically about pointer arithmetics warp around.
>
> But I hope there is agreement now that my change makes the code "more correct".
>
> Kind regards
>
> Frank
>
> [1] https://github.com/llvm/llvm-project/issues/121909
> --
> Dr.-Ing. Frank Mehnert, frank.mehnert@kernkonzept.com, +49-351-41 883 224
>
> Kernkonzept GmbH.  Sitz: Dresden.  Amtsgericht Dresden, HRB 31129.
> Geschäftsführer: Dr.-Ing. Michael Hohmuth
>

Sure. I enjoyed the discussion. There is no objection from me on the change.

Best regards,
Kjetil Oftedal
diff mbox series

Patch

diff --git a/libc/string/generic/strnlen.c b/libc/string/generic/strnlen.c
index 4d4cde84f..82d4122ec 100644
--- a/libc/string/generic/strnlen.c
+++ b/libc/string/generic/strnlen.c
@@ -29,15 +29,17 @@ 
    '\0' terminator is found in that many characters, return MAXLEN.  */
 size_t strnlen (const char *str, size_t maxlen)
 {
-  const char *char_ptr, *end_ptr = str + maxlen;
+  const char *char_ptr, *end_ptr;
   const unsigned long int *longword_ptr;
   unsigned long int longword, himagic, lomagic;
 
   if (maxlen == 0)
     return 0;
 
-  if (__builtin_expect (end_ptr < str, 0))
+  if (__builtin_expect ((uintptr_t)str + maxlen < (uintptr_t)str, 0))
     end_ptr = (const char *) ~0UL;
+  else
+    end_ptr = str + maxlen;
 
   /* Handle the first few characters by reading one character at a time.
      Do this until CHAR_PTR is aligned on a longword boundary.  */