diff mbox series

libstdc++: Fix infinite loop in std::istream::ignore(n, delim) [PR93672]

Message ID 20240404153158.313297-1-jwakely@redhat.com
State New
Headers show
Series libstdc++: Fix infinite loop in std::istream::ignore(n, delim) [PR93672] | expand

Commit Message

Jonathan Wakely April 4, 2024, 3:29 p.m. UTC
I would appreciate more eyes on this to confirm my conclusions about
negative int_type values, and the proposed fix, make sense.

Tested x86_64-linux.

-- >8 --

A negative value for the delim value passed to std::istream::ignore can
never match any character in the stream, because the comparison is done
using traits_type::eq_int_type(sb->sgetc(), delim) and sgetc() never
returns negative values (except at EOF). The optimized version of
ignore for the std::istream specialization uses traits_type::find to
locate the delim character in the streambuf, which _can_ match a
negative delim on platforms where char is signed, but then we do another
comparison using eq_int_type which fails. The code then keeps looping
forever, with traits_type::find saying the character is present and
eq_int_type saying it's not.

A possible fix would be to check with eq_int_type after a successful
find, to see whether we really have a match. However, that would be
suboptimal since we know that a negative delimiter will never match
using eq_int_type. So a better fix is to adjust the check at the top of
the function that handles delim==eof(), so that we treat all negative
delim values as equivalent to EOF. That way we don't bother using find
to search for something that will never match with eq_int_type.

The version of ignore in the primary template doesn't need a change,
because it doesn't use traits_type::find, instead characters are
extracted one-by-one and always matched using eq_int_type. That avoids
the inconsistency between find and eq_int_type.

libstdc++-v3/ChangeLog:

	PR libstdc++/93672
	* src/c++98/istream.cc (istream::ignore(streamsize, int_type)):
	Treat all negative delimiter values as eof().
	* testsuite/27_io/basic_istream/ignore/char/93672.cc: New test.
---
 libstdc++-v3/src/c++98/istream.cc                 |  5 ++++-
 .../27_io/basic_istream/ignore/char/93672.cc      | 15 +++++++++++++++
 2 files changed, 19 insertions(+), 1 deletion(-)
 create mode 100644 libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc

Comments

Iain Sandoe April 4, 2024, 3:40 p.m. UTC | #1
> On 4 Apr 2024, at 16:29, Jonathan Wakely <jwakely@redhat.com> wrote:
> 
> I would appreciate more eyes on this to confirm my conclusions about
> negative int_type values, and the proposed fix, make sense.
> 
> Tested x86_64-linux.
> 
> -- >8 --
> 
> A negative value for the delim value passed to std::istream::ignore can
> never match any character in the stream, because the comparison is done
> using traits_type::eq_int_type(sb->sgetc(), delim) and sgetc() never
> returns negative values (except at EOF). The optimized version of
> ignore for the std::istream specialization uses traits_type::find to
> locate the delim character in the streambuf, which _can_ match a
> negative delim on platforms where char is signed, but then we do another
> comparison using eq_int_type which fails. The code then keeps looping
> forever, with traits_type::find saying the character is present and
> eq_int_type saying it's not.
> 
> A possible fix would be to check with eq_int_type after a successful
> find, to see whether we really have a match. However, that would be
> suboptimal since we know that a negative delimiter will never match
> using eq_int_type. So a better fix is to adjust the check at the top of
> the function that handles delim==eof(), so that we treat all negative
> delim values as equivalent to EOF. That way we don't bother using find
> to search for something that will never match with eq_int_type.

Is the corollary to this that a platform with signed chars can never use a
negative value as a delimiter - since that we always be treated as EOF?

- I am not sure it there’s an actual use-case where that matters, but,
Iain

> 
> The version of ignore in the primary template doesn't need a change,
> because it doesn't use traits_type::find, instead characters are
> extracted one-by-one and always matched using eq_int_type. That avoids
> the inconsistency between find and eq_int_type.
> 
> libstdc++-v3/ChangeLog:
> 
> 	PR libstdc++/93672
> 	* src/c++98/istream.cc (istream::ignore(streamsize, int_type)):
> 	Treat all negative delimiter values as eof().
> 	* testsuite/27_io/basic_istream/ignore/char/93672.cc: New test.
> ---
> libstdc++-v3/src/c++98/istream.cc                 |  5 ++++-
> .../27_io/basic_istream/ignore/char/93672.cc      | 15 +++++++++++++++
> 2 files changed, 19 insertions(+), 1 deletion(-)
> create mode 100644 libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc
> 
> diff --git a/libstdc++-v3/src/c++98/istream.cc b/libstdc++-v3/src/c++98/istream.cc
> index 07ac739c26a..aa1069dea07 100644
> --- a/libstdc++-v3/src/c++98/istream.cc
> +++ b/libstdc++-v3/src/c++98/istream.cc
> @@ -112,7 +112,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>     basic_istream<char>::
>     ignore(streamsize __n, int_type __delim)
>     {
> -      if (traits_type::eq_int_type(__delim, traits_type::eof()))
> +      // sgetc() returns either (int_type)(unsigned char)c or -1 for EOF.
> +      // If __delim is negative, then eq_int_type(sgetc(), __delim) can only
> +      // be true for EOF, so just treat all negative values as eof().
> +      if (__delim < 0)
> 	return ignore(__n);
> 
>       _M_gcount = 0;
> diff --git a/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc b/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc
> new file mode 100644
> index 00000000000..6d11f5622c8
> --- /dev/null
> +++ b/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc
> @@ -0,0 +1,15 @@
> +// { dg-do run }
> +
> +#include <sstream>
> +#include <testsuite_hooks.h>
> +
> +int main()
> +{
> +  std::istringstream in("x\xfdxxx\xfex");
> +  in.ignore(10, std::char_traits<char>::to_int_type('\xfd'));
> +  VERIFY( in.gcount() == 2 );
> +  VERIFY( ! in.eof() );
> +  in.ignore(10, '\xfe');
> +  VERIFY( in.gcount() == 5 );
> +  VERIFY( in.eof() );
> +}
> -- 
> 2.44.0
>
Ulrich Drepper April 4, 2024, 4:28 p.m. UTC | #2
On Thu, Apr 4, 2024 at 5:29 PM Jonathan Wakely <jwakely@redhat.com> wrote:
> I would appreciate more eyes on this to confirm my conclusions about
> negative int_type values, and the proposed fix, make sense.

The way something like this is handled in glibc's ctype functions is
that both branches are considered.  For isXXX(c) whether c is -v or
256-v the same value is returned (except for EOF which is -1).  This
caused the least number of bad surprises.

You could here also perform similar actions.
Jonathan Wakely April 4, 2024, 4:30 p.m. UTC | #3
On Thu, 4 Apr 2024 at 16:40, Iain Sandoe <idsandoe@googlemail.com> wrote:
>
>
>
> > On 4 Apr 2024, at 16:29, Jonathan Wakely <jwakely@redhat.com> wrote:
> >
> > I would appreciate more eyes on this to confirm my conclusions about
> > negative int_type values, and the proposed fix, make sense.
> >
> > Tested x86_64-linux.
> >
> > -- >8 --
> >
> > A negative value for the delim value passed to std::istream::ignore can
> > never match any character in the stream, because the comparison is done
> > using traits_type::eq_int_type(sb->sgetc(), delim) and sgetc() never
> > returns negative values (except at EOF). The optimized version of
> > ignore for the std::istream specialization uses traits_type::find to
> > locate the delim character in the streambuf, which _can_ match a
> > negative delim on platforms where char is signed, but then we do another
> > comparison using eq_int_type which fails. The code then keeps looping
> > forever, with traits_type::find saying the character is present and
> > eq_int_type saying it's not.
> >
> > A possible fix would be to check with eq_int_type after a successful
> > find, to see whether we really have a match. However, that would be
> > suboptimal since we know that a negative delimiter will never match
> > using eq_int_type. So a better fix is to adjust the check at the top of
> > the function that handles delim==eof(), so that we treat all negative
> > delim values as equivalent to EOF. That way we don't bother using find
> > to search for something that will never match with eq_int_type.
>
> Is the corollary to this that a platform with signed chars can never use a
> negative value as a delimiter - since that we always be treated as EOF?

That's what the C++ standard says (and is what libc++ does).

The delimiter argument to ignore is an int_type, not a char. So
formally you should call it like:

std::cin.ignore(n, std::istream::traits_type::to_int_type('a'));

where to_int_type will cast to unsigned char and then to int, so that
no char can ever produce a negative value for that argument.

If you happen to know that casting 'a' to unsigned char and then to
int doesn't change its value (because it's a 7-bit ASCII value), then
you can be lazy and do:

std::cin.ignore(n, 'a');

That works fine.

But if your delimiter character is the MSB set, *and* char is signed
on your platform, then you can't be lazy. The implicit conversion from
char to the stream's int_type is not the same as the result of calling
traits_type::to_int_type, and so these are NOT equivalent on a
platform with signed char:
std::cin.ignore(n, '\x80');
std::cin.ignore(n, (unsigned char)'\x80');

The former is wrong, the latter is correct.
The former will never match a '\x80' in the stream, because the ignore
function will cast each char extracted from the stream to
(int)(unsigned char) and so never match -128.

So the change to treat all negative values as EOF is just an
optimization. Since they can never match, there's no point searching
for them. Just skip n chars.



>
> - I am not sure it there’s an actual use-case where that matters, but,
> Iain
>
> >
> > The version of ignore in the primary template doesn't need a change,
> > because it doesn't use traits_type::find, instead characters are
> > extracted one-by-one and always matched using eq_int_type. That avoids
> > the inconsistency between find and eq_int_type.
> >
> > libstdc++-v3/ChangeLog:
> >
> >       PR libstdc++/93672
> >       * src/c++98/istream.cc (istream::ignore(streamsize, int_type)):
> >       Treat all negative delimiter values as eof().
> >       * testsuite/27_io/basic_istream/ignore/char/93672.cc: New test.
> > ---
> > libstdc++-v3/src/c++98/istream.cc                 |  5 ++++-
> > .../27_io/basic_istream/ignore/char/93672.cc      | 15 +++++++++++++++
> > 2 files changed, 19 insertions(+), 1 deletion(-)
> > create mode 100644 libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc
> >
> > diff --git a/libstdc++-v3/src/c++98/istream.cc b/libstdc++-v3/src/c++98/istream.cc
> > index 07ac739c26a..aa1069dea07 100644
> > --- a/libstdc++-v3/src/c++98/istream.cc
> > +++ b/libstdc++-v3/src/c++98/istream.cc
> > @@ -112,7 +112,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >     basic_istream<char>::
> >     ignore(streamsize __n, int_type __delim)
> >     {
> > -      if (traits_type::eq_int_type(__delim, traits_type::eof()))
> > +      // sgetc() returns either (int_type)(unsigned char)c or -1 for EOF.
> > +      // If __delim is negative, then eq_int_type(sgetc(), __delim) can only
> > +      // be true for EOF, so just treat all negative values as eof().
> > +      if (__delim < 0)
> >       return ignore(__n);
> >
> >       _M_gcount = 0;
> > diff --git a/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc b/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc
> > new file mode 100644
> > index 00000000000..6d11f5622c8
> > --- /dev/null
> > +++ b/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc
> > @@ -0,0 +1,15 @@
> > +// { dg-do run }
> > +
> > +#include <sstream>
> > +#include <testsuite_hooks.h>
> > +
> > +int main()
> > +{
> > +  std::istringstream in("x\xfdxxx\xfex");
> > +  in.ignore(10, std::char_traits<char>::to_int_type('\xfd'));
> > +  VERIFY( in.gcount() == 2 );
> > +  VERIFY( ! in.eof() );
> > +  in.ignore(10, '\xfe');
> > +  VERIFY( in.gcount() == 5 );
> > +  VERIFY( in.eof() );
> > +}
> > --
> > 2.44.0
> >
>
Jonathan Wakely April 4, 2024, 4:33 p.m. UTC | #4
On Thu, 4 Apr 2024 at 17:29, Ulrich Drepper <drepper.fsp@gmail.com> wrote:
>
> On Thu, Apr 4, 2024 at 5:29 PM Jonathan Wakely <jwakely@redhat.com> wrote:
> > I would appreciate more eyes on this to confirm my conclusions about
> > negative int_type values, and the proposed fix, make sense.
>
> The way something like this is handled in glibc's ctype functions is
> that both branches are considered.  For isXXX(c) whether c is -v or
> 256-v the same value is returned (except for EOF which is -1).  This
> caused the least number of bad surprises.
>
> You could here also perform similar actions.

Yes, my first attempt to fix PR93672 did exactly that, see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93672#c1

But since it doesn't work for '\xff' (because that's EOF when char is
signed) it only handles 127 of the 128 possible bugs ;-)
I'm also not sure it's conforming, since the standard specifies how
the matching is done, and that won't match negative chars.
Jonathan Wakely April 4, 2024, 4:35 p.m. UTC | #5
On Thu, 4 Apr 2024 at 17:33, Jonathan Wakely <jwakely@redhat.com> wrote:
>
> On Thu, 4 Apr 2024 at 17:29, Ulrich Drepper <drepper.fsp@gmail.com> wrote:
> >
> > On Thu, Apr 4, 2024 at 5:29 PM Jonathan Wakely <jwakely@redhat.com> wrote:
> > > I would appreciate more eyes on this to confirm my conclusions about
> > > negative int_type values, and the proposed fix, make sense.
> >
> > The way something like this is handled in glibc's ctype functions is
> > that both branches are considered.  For isXXX(c) whether c is -v or
> > 256-v the same value is returned (except for EOF which is -1).  This
> > caused the least number of bad surprises.
> >
> > You could here also perform similar actions.
>
> Yes, my first attempt to fix PR93672 did exactly that, see
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93672#c1
>
> But since it doesn't work for '\xff' (because that's EOF when char is
> signed) it only handles 127 of the 128 possible bugs ;-)
> I'm also not sure it's conforming, since the standard specifies how
> the matching is done, and that won't match negative chars.

I might suggest relaxing the standard to allow it though...
Jonathan Wakely April 4, 2024, 4:55 p.m. UTC | #6
On Thu, 4 Apr 2024 at 17:30, Jonathan Wakely <jwakely@redhat.com> wrote:
>
> On Thu, 4 Apr 2024 at 16:40, Iain Sandoe <idsandoe@googlemail.com> wrote:
> >
> >
> >
> > > On 4 Apr 2024, at 16:29, Jonathan Wakely <jwakely@redhat.com> wrote:
> > >
> > > I would appreciate more eyes on this to confirm my conclusions about
> > > negative int_type values, and the proposed fix, make sense.
> > >
> > > Tested x86_64-linux.
> > >
> > > -- >8 --
> > >
> > > A negative value for the delim value passed to std::istream::ignore can
> > > never match any character in the stream, because the comparison is done
> > > using traits_type::eq_int_type(sb->sgetc(), delim) and sgetc() never
> > > returns negative values (except at EOF). The optimized version of
> > > ignore for the std::istream specialization uses traits_type::find to
> > > locate the delim character in the streambuf, which _can_ match a
> > > negative delim on platforms where char is signed, but then we do another
> > > comparison using eq_int_type which fails. The code then keeps looping
> > > forever, with traits_type::find saying the character is present and
> > > eq_int_type saying it's not.
> > >
> > > A possible fix would be to check with eq_int_type after a successful
> > > find, to see whether we really have a match. However, that would be
> > > suboptimal since we know that a negative delimiter will never match
> > > using eq_int_type. So a better fix is to adjust the check at the top of
> > > the function that handles delim==eof(), so that we treat all negative
> > > delim values as equivalent to EOF. That way we don't bother using find
> > > to search for something that will never match with eq_int_type.
> >
> > Is the corollary to this that a platform with signed chars can never use a
> > negative value as a delimiter - since that we always be treated as EOF?
>
> That's what the C++ standard says (and is what libc++ does).
>
> The delimiter argument to ignore is an int_type, not a char. So
> formally you should call it like:
>
> std::cin.ignore(n, std::istream::traits_type::to_int_type('a'));
>
> where to_int_type will cast to unsigned char and then to int, so that
> no char can ever produce a negative value for that argument.
>
> If you happen to know that casting 'a' to unsigned char and then to
> int doesn't change its value (because it's a 7-bit ASCII value), then
> you can be lazy and do:
>
> std::cin.ignore(n, 'a');
>
> That works fine.
>
> But if your delimiter character is the MSB set, *and* char is signed
> on your platform, then you can't be lazy. The implicit conversion from
> char to the stream's int_type is not the same as the result of calling
> traits_type::to_int_type, and so these are NOT equivalent on a
> platform with signed char:
> std::cin.ignore(n, '\x80');
> std::cin.ignore(n, (unsigned char)'\x80');
>
> The former is wrong, the latter is correct.
> The former will never match a '\x80' in the stream, because the ignore
> function will cast each char extracted from the stream to
> (int)(unsigned char) and so never match -128.
>
> So the change to treat all negative values as EOF is just an
> optimization. Since they can never match, there's no point searching
> for them. Just skip n chars.

And FWIW, I don't think libc++ does that optimization. They extract
char-by-char and compare them, even though that can never match a
negative value. So they get the same result as with my patch, but much
slower ;-)

The status quo is that libstdc++ loops forever given a negative
delimiter. That's obviously wrong!

We need to fix the buggy logic that loops forever, and the "correct"
standard conforming fix means that we never match the negative value,
because the chars are extracted from the stream as non-negative
int_type values. As an optimization, we can skip the loop that keeps
trying impossible matches, because we know it won't match. That's what
my patch does. (It occurs to me now that we could also treat any delim
value greater than 255 as EOF too, since that can't match either).

Ulrich's suggestion is to allow the buggy user code to Just Work (for
all cases except (char)-1 on a platform where char is signed). That is
maybe the least surprising behaviour for users.

On the other hand, I'm not sure we really want to support:

cin.ignore(n, -10);
cin.ignore(n, -999);
cin.ignore(n, +999);

What are these trying to do? Those are just nonsense arguments to this
function. But if we try to make the testcase in the bug report Just
Work, we end up also accepting (at least) the -10 case, because it's
indistinguishable from the char '\xf6'. Depending how we do it, we
might also make the other cases work, treating -999 as '\x19', and
+999 as '\xe7'.
Jonathan Wakely April 4, 2024, 7:53 p.m. UTC | #7
On Thu, 4 Apr 2024 at 17:55, Jonathan Wakely <jwakely@redhat.com> wrote:
>
> On Thu, 4 Apr 2024 at 17:30, Jonathan Wakely <jwakely@redhat.com> wrote:
> Ulrich's suggestion is to allow the buggy user code to Just Work (for
> all cases except (char)-1 on a platform where char is signed). That is
> maybe the least surprising behaviour for users.
>
> On the other hand, I'm not sure we really want to support:
>
> cin.ignore(n, -10);
> cin.ignore(n, -999);
> cin.ignore(n, +999);
>
> What are these trying to do? Those are just nonsense arguments to this
> function. But if we try to make the testcase in the bug report Just
> Work, we end up also accepting (at least) the -10 case, because it's
> indistinguishable from the char '\xf6'. Depending how we do it, we
> might also make the other cases work, treating -999 as '\x19', and
> +999 as '\xe7'.

So maybe what we want is to add a new overload:

basic_istream&
ignore(streamsize n, char_type c)
{
  return ignore(n, traits_type::to_int_type(c));
}

This will only accept the stream's char_type, not -999 or +999, and
will do the required conversion to int_type correctly, not as an
implicit conversion.

Calls that pass eof() will still select the other overload and do the
right thing. Calls using a char (or for wstream, a wchar_t) will
select the new overload.

This new overload will make some calls ambiguous, e.g. ignore(1, 1L)
compiles today, but would become ambiguous. That seems fine, since the
second argument should not be type long (what is it even trying to
do?)

If that's a problem, we can make it a constrained template that only
gets called for the exact char_type:

basic_istream&
ignore(streamsize n, same_as<char_type> auto c)

This would still Do The Right Thing for ignore(n, '\x80') but would
not change the result of ignore(1, 1L) which would still select the
original overload taking int_type for the second parameter.

I think I'll raise this idea with the committee. For now though, I
think my patch fixes the bug and conforms to the standard, and doesn't
do anything inventive.
diff mbox series

Patch

diff --git a/libstdc++-v3/src/c++98/istream.cc b/libstdc++-v3/src/c++98/istream.cc
index 07ac739c26a..aa1069dea07 100644
--- a/libstdc++-v3/src/c++98/istream.cc
+++ b/libstdc++-v3/src/c++98/istream.cc
@@ -112,7 +112,10 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
     basic_istream<char>::
     ignore(streamsize __n, int_type __delim)
     {
-      if (traits_type::eq_int_type(__delim, traits_type::eof()))
+      // sgetc() returns either (int_type)(unsigned char)c or -1 for EOF.
+      // If __delim is negative, then eq_int_type(sgetc(), __delim) can only
+      // be true for EOF, so just treat all negative values as eof().
+      if (__delim < 0)
 	return ignore(__n);
 
       _M_gcount = 0;
diff --git a/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc b/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc
new file mode 100644
index 00000000000..6d11f5622c8
--- /dev/null
+++ b/libstdc++-v3/testsuite/27_io/basic_istream/ignore/char/93672.cc
@@ -0,0 +1,15 @@ 
+// { dg-do run }
+
+#include <sstream>
+#include <testsuite_hooks.h>
+
+int main()
+{
+  std::istringstream in("x\xfdxxx\xfex");
+  in.ignore(10, std::char_traits<char>::to_int_type('\xfd'));
+  VERIFY( in.gcount() == 2 );
+  VERIFY( ! in.eof() );
+  in.ignore(10, '\xfe');
+  VERIFY( in.gcount() == 5 );
+  VERIFY( in.eof() );
+}