[libstdc++] Improve M_check_len

Message ID	ZI9MmdQ+OMehcdeg@kam.mff.cuni.cz
State	New
Headers	show Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E808A3858D28 Date: Sun, 18 Jun 2023 20:27:37 +0200 To: gcc-patches@gcc.gnu.org, jwakely@redhat.com Subject: [libstdc++] Improve M_check_len Message-ID: <ZI9MmdQ+OMehcdeg@kam.mff.cuni.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Precedence: list From: Jan Hubicka via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Jan Hubicka <hubicka@ucw.cz> Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
Series	[libstdc++] Improve M_check_len \| expand [libstdc++] Improve M_check_len

Jan Hubicka June 18, 2023, 6:27 p.m. UTC

Hi,
_M_check_len is used in vector reallocations. It computes __n + __s but does
checking for case that (__n + __s) * sizeof (Tp) would overflow ptrdiff_t.
Since we know that __s is a size of already allocated memory block if __n is
not too large, this will never happen on 64bit systems since memory is not that
large.  This patch adds __builtin_constant_p checks for this case.  This size
of fully inlined push_back function that is critical for loops that are
controlled by std::vector based stack.

With the patch to optimize std::max and to handle SRA candidates, we
fully now inline push_back with -O3 (not with -O2), however there are still
quite few silly things for example:

  //  _78 is original size of the allocated vector.

  _76 = stack$_M_end_of_storage_177 - _142;
  _77 = _76 /[ex] 8; 
  _78 = (long unsigned int) _77;
  _79 = MAX_EXPR <_78, 1>;   
  _80 = _78 + _79; // this is result of _M_check_len doubling the allocated vector size.
  if (_80 != 0)    // result will always be non-zero.  
    goto <bb 7>; [54.67%]
  else
    goto <bb 13>; [45.33%]
  
  <bb 7> [local count: 30795011]:
  if (_80 > 1152921504606846975)  // doubling succesfully allocated memmory will never get so large.
    goto <bb 8>; [10.00%]
  else
    goto <bb 11>; [90.00%]
  
  <bb 8> [local count: 3079501]:
  if (_80 > 2305843009213693951)  // I wonder if we really want to have two different throws
    goto <bb 9>; [50.00%]
  else 
    goto <bb 10>; [50.00%]
  
  <bb 9> [local count: 1539750]:
  std::__throw_bad_array_new_length ();
  
  <bb 10> [local count: 1539750]:
  std::__throw_bad_alloc ();
  
  <bb 11> [local count: 27715510]:
  _108 = _80 * 8;
  _109 = operator new (_108);

Maybe we want to add assumption that result of the function is never
greater than max_size to get rid of the two checks above.  However this
will still be recongized only after inlining and will continue confusing
inliner heuristics.

Bootstrapped/regtested x86_64-linux.  I am not too familiar with libstdc++ internals,
so would welcome comments and ideas.

libstdc++-v3/ChangeLog:

	PR tree-optimization/110287
	* include/bits/stl_vector.h: Optimize _M_check_len for constantly sized
	types and allocations.

Jonathan Wakely June 19, 2023, 10:12 a.m. UTC | #1

On Sun, 18 Jun 2023 at 19:37, Jan Hubicka <hubicka@ucw.cz> wrote:

> Hi,
> _M_check_len is used in vector reallocations. It computes __n + __s but
> does
> checking for case that (__n + __s) * sizeof (Tp) would overflow ptrdiff_t.
> Since we know that __s is a size of already allocated memory block if __n
> is
> not too large, this will never happen on 64bit systems since memory is not
> that
> large.  This patch adds __builtin_constant_p checks for this case.  This
> size
> of fully inlined push_back function that is critical for loops that are
> controlled by std::vector based stack.
>
> With the patch to optimize std::max and to handle SRA candidates, we
> fully now inline push_back with -O3 (not with -O2), however there are still
> quite few silly things for example:
>
>   //  _78 is original size of the allocated vector.
>
>   _76 = stack$_M_end_of_storage_177 - _142;
>   _77 = _76 /[ex] 8;
>   _78 = (long unsigned int) _77;
>   _79 = MAX_EXPR <_78, 1>;
>   _80 = _78 + _79; // this is result of _M_check_len doubling the
> allocated vector size.
>   if (_80 != 0)    // result will always be non-zero.
>     goto <bb 7>; [54.67%]
>   else
>     goto <bb 13>; [45.33%]
>
>   <bb 7> [local count: 30795011]:
>   if (_80 > 1152921504606846975)  // doubling succesfully allocated
> memmory will never get so large.
>     goto <bb 8>; [10.00%]
>   else
>     goto <bb 11>; [90.00%]
>
>   <bb 8> [local count: 3079501]:
>   if (_80 > 2305843009213693951)  // I wonder if we really want to have
> two different throws
>     goto <bb 9>; [50.00%]
>   else
>     goto <bb 10>; [50.00%]
>
>   <bb 9> [local count: 1539750]:
>   std::__throw_bad_array_new_length ();
>
>   <bb 10> [local count: 1539750]:
>   std::__throw_bad_alloc ();
>
>   <bb 11> [local count: 27715510]:
>   _108 = _80 * 8;
>   _109 = operator new (_108);
>
> Maybe we want to add assumption that result of the function is never
> greater than max_size to get rid of the two checks above.  However this
> will still be recongized only after inlining and will continue confusing
> inliner heuristics.
>
> Bootstrapped/regtested x86_64-linux.  I am not too familiar with libstdc++
> internals,
> so would welcome comments and ideas.
>
> libstdc++-v3/ChangeLog:
>
>         PR tree-optimization/110287
>         * include/bits/stl_vector.h: Optimize _M_check_len for constantly
> sized
>         types and allocations.
>
> diff --git a/libstdc++-v3/include/bits/stl_vector.h
> b/libstdc++-v3/include/bits/stl_vector.h
> index 70ced3d101f..3ad59fe3e2b 100644
> --- a/libstdc++-v3/include/bits/stl_vector.h
> +++ b/libstdc++-v3/include/bits/stl_vector.h
> @@ -1895,11 +1895,22 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
>        size_type
>        _M_check_len(size_type __n, const char* __s) const
>        {
> -       if (max_size() - size() < __n)
> -         __throw_length_error(__N(__s));
> +       // On 64bit systems vectors of small sizes can not
> +       // reach overflow by growing by small sizes; before
> +       // this happens, we will run out of memory.
> +       if (__builtin_constant_p (sizeof (_Tp))
>

This shouldn't be here, of course sizeof is a constant.

No space before the opening parens, libstdc++ doesn't follow GNU style.



> +           && __builtin_constant_p (__n)
> +           && sizeof (ptrdiff_t) >= 8
> +           && __n < max_size () / 2)
>

This check is not OK. As I said in Bugzilla just now, max_size() depends on
the allocator, which could return something much smaller than PTRDIFF_MAX.
You can't make this assumption for all specializations of std::vector.

If Alloc::max_size() == 100 and this->size() == 100 then this function
needs to throw length_error for *any* n. In the general case you cannot
remove size() from this condition.

For std::allocator<T> it's safe to assume that max_size() is related to
PTRDIFF_MAX/sizeof(T), but this patch would apply to all allocators.



> +         return size() + (std::max)(size(), __n);
>
+       else
> +         {
> +           if (max_size() - size() < __n)
> +             __throw_length_error(__N(__s));
>
> -       const size_type __len = size() + (std::max)(size(), __n);
> -       return (__len < size() || __len > max_size()) ? max_size() : __len;
> +           const size_type __len = size() + (std::max)(size(), __n);
> +           return (__len < size() || __len > max_size()) ? max_size() :
> __len;
> +         }
>        }
>
>        // Called by constructors to check initial size.
>
>

Jan Hubicka June 19, 2023, 11:05 a.m. UTC | #2

> > -       if (max_size() - size() < __n)
> > -         __throw_length_error(__N(__s));
> > +       // On 64bit systems vectors of small sizes can not
> > +       // reach overflow by growing by small sizes; before
> > +       // this happens, we will run out of memory.
> > +       if (__builtin_constant_p (sizeof (_Tp))
> >
> 
> This shouldn't be here, of course sizeof is a constant.
OK :)
> 
> No space before the opening parens, libstdc++ doesn't follow GNU style.
Fixed.
> 
> 
> 
> > +           && __builtin_constant_p (__n)
> > +           && sizeof (ptrdiff_t) >= 8
> > +           && __n < max_size () / 2)
> >
> 
> This check is not OK. As I said in Bugzilla just now, max_size() depends on
> the allocator, which could return something much smaller than PTRDIFF_MAX.
> You can't make this assumption for all specializations of std::vector.
> 
> If Alloc::max_size() == 100 and this->size() == 100 then this function
> needs to throw length_error for *any* n. In the general case you cannot
> remove size() from this condition.
> 
> For std::allocator<T> it's safe to assume that max_size() is related to
> PTRDIFF_MAX/sizeof(T), but this patch would apply to all allocators.

Here is updated version.  I simply __builtin_constant_p max_size and
test it is large enough.  For that we need to copy it into temporary
variable since we fold-const __builtin_constant_p (function (x))
early, before function gets inlined.

I also added __builtin_unreachable to determine return value range
as discussed in PR.

Honza

diff --git a/libstdc++-v3/include/bits/stl_vector.h b/libstdc++-v3/include/bits/stl_vector.h
index 70ced3d101f..7a1966405ca 100644
--- a/libstdc++-v3/include/bits/stl_vector.h
+++ b/libstdc++-v3/include/bits/stl_vector.h
@@ -1895,11 +1895,29 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
       size_type
       _M_check_len(size_type __n, const char* __s) const
       {
-	if (max_size() - size() < __n)
-	  __throw_length_error(__N(__s));
+	const size_type __max_size = max_size();
+	// On 64bit systems vectors can not reach overflow by growing
+	// by small sizes; before this happens, we will run out of memory.
+	if (__builtin_constant_p(__n)
+	    && __builtin_constant_p(__max_size)
+	    && sizeof(ptrdiff_t) >= 8
+	    && __max_size * sizeof(_Tp) >= ((ptrdiff_t)1 << 60)
+	    && __n < __max_size / 2)
+	  {
+	    const size_type __len = size() + (std::max)(size(), __n);
+	    // let compiler know that __len has sane value range.
+	    if (__len < __n || __len >= __max_size)
+	      __builtin_unreachable();
+	    return __len;
+	  }
+	else
+	  {
+	    if (__max_size - size() < __n)
+	      __throw_length_error(__N(__s));
 
-	const size_type __len = size() + (std::max)(size(), __n);
-	return (__len < size() || __len > max_size()) ? max_size() : __len;
+	    const size_type __len = size() + (std::max)(size(), __n);
+	    return (__len < size() || __len > __max_size) ? __max_size : __len;
+	  }
       }
 
       // Called by constructors to check initial size.

Jakub Jelinek June 19, 2023, 11:20 a.m. UTC | #3

On Mon, Jun 19, 2023 at 01:05:36PM +0200, Jan Hubicka via Gcc-patches wrote:
> -	if (max_size() - size() < __n)
> -	  __throw_length_error(__N(__s));
> +	const size_type __max_size = max_size();
> +	// On 64bit systems vectors can not reach overflow by growing
> +	// by small sizes; before this happens, we will run out of memory.
> +	if (__builtin_constant_p(__n)
> +	    && __builtin_constant_p(__max_size)
> +	    && sizeof(ptrdiff_t) >= 8
> +	    && __max_size * sizeof(_Tp) >= ((ptrdiff_t)1 << 60)

Isn't there a risk of overlow in the __max_size * sizeof(_Tp) computation?

	Jakub

Jonathan Wakely June 19, 2023, 3:13 p.m. UTC | #4

On Mon, 19 Jun 2023 at 12:20, Jakub Jelinek wrote:

> On Mon, Jun 19, 2023 at 01:05:36PM +0200, Jan Hubicka via Gcc-patches
> wrote:
> > -     if (max_size() - size() < __n)
> > -       __throw_length_error(__N(__s));
> > +     const size_type __max_size = max_size();
> > +     // On 64bit systems vectors can not reach overflow by growing
> > +     // by small sizes; before this happens, we will run out of memory.
> > +     if (__builtin_constant_p(__n)
> > +         && __builtin_constant_p(__max_size)
> > +         && sizeof(ptrdiff_t) >= 8
> > +         && __max_size * sizeof(_Tp) >= ((ptrdiff_t)1 << 60)
>
> Isn't there a risk of overlow in the __max_size * sizeof(_Tp) computation?
>

For std::allocator, no, because max_size() is size_t(-1) / sizeof(_Tp). But
for a user-defined allocator that has a silly max_size(), yes, that's
possible.

I still don't really understand why any change is needed here. The PR says
that the current _M_check_len brings in the EH code, but how/why does that
happen? The __throw_length_error function is not inline, it's defined in
libstdc++.so, so why isn't it just an extern call? Is the problem that it
makes _M_check_len potentially-throwing? Because that's basically the
entire point of _M_check_len: to throw the exception that is required by
the C++ standard. We need to be very careful about removing that required
throw! And after we call _M_check_len we call allocate unconditionally, so
_M_realloc_insert can always throw (we only call _M_realloc_insert in the
case where we've already decided a reallocation is definitely needed).

Would this version of _M_check_len help?

      size_type
      _M_check_len(size_type __n, const char* __s) const
      {
        const size_type __size = size();
        const size_type __max_size = max_size();

        if (__is_same(allocator_type, allocator<_Tp>)
              && __size > __max_size / 2)
          __builtin_unreachable(); // Assume std::allocator can't fill
memory.
        else if (__size > __max_size)
          __builtin_unreachable();

        if (__max_size - __size < __n)
          __throw_length_error(__N(__s));

        const size_type __len = __size + (std::max)(__size, __n);
        return (__len < __size || __len > __max_size) ? __max_size : __len;
      }

This only applies to std::allocator, not user-defined allocators (because
we don't know their semantics). It also seems like less of a big hack!

Jonathan Wakely June 19, 2023, 3:14 p.m. UTC | #5

P.S. please CC libstdc++@gcc.gnu.org for all libstdc++ patches.

On Mon, 19 Jun 2023 at 16:13, Jonathan Wakely <jwakely@redhat.com> wrote:

> On Mon, 19 Jun 2023 at 12:20, Jakub Jelinek wrote:
>
>> On Mon, Jun 19, 2023 at 01:05:36PM +0200, Jan Hubicka via Gcc-patches
>> wrote:
>> > -     if (max_size() - size() < __n)
>> > -       __throw_length_error(__N(__s));
>> > +     const size_type __max_size = max_size();
>> > +     // On 64bit systems vectors can not reach overflow by growing
>> > +     // by small sizes; before this happens, we will run out of memory.
>> > +     if (__builtin_constant_p(__n)
>> > +         && __builtin_constant_p(__max_size)
>> > +         && sizeof(ptrdiff_t) >= 8
>> > +         && __max_size * sizeof(_Tp) >= ((ptrdiff_t)1 << 60)
>>
>> Isn't there a risk of overlow in the __max_size * sizeof(_Tp) computation?
>>
>
> For std::allocator, no, because max_size() is size_t(-1) / sizeof(_Tp).
> But for a user-defined allocator that has a silly max_size(), yes, that's
> possible.
>
> I still don't really understand why any change is needed here. The PR says
> that the current _M_check_len brings in the EH code, but how/why does that
> happen? The __throw_length_error function is not inline, it's defined in
> libstdc++.so, so why isn't it just an extern call? Is the problem that it
> makes _M_check_len potentially-throwing? Because that's basically the
> entire point of _M_check_len: to throw the exception that is required by
> the C++ standard. We need to be very careful about removing that required
> throw! And after we call _M_check_len we call allocate unconditionally, so
> _M_realloc_insert can always throw (we only call _M_realloc_insert in the
> case where we've already decided a reallocation is definitely needed).
>
> Would this version of _M_check_len help?
>
>       size_type
>       _M_check_len(size_type __n, const char* __s) const
>       {
>         const size_type __size = size();
>         const size_type __max_size = max_size();
>
>         if (__is_same(allocator_type, allocator<_Tp>)
>               && __size > __max_size / 2)
>           __builtin_unreachable(); // Assume std::allocator can't fill
> memory.
>         else if (__size > __max_size)
>           __builtin_unreachable();
>
>         if (__max_size - __size < __n)
>           __throw_length_error(__N(__s));
>
>         const size_type __len = __size + (std::max)(__size, __n);
>         return (__len < __size || __len > __max_size) ? __max_size : __len;
>       }
>
> This only applies to std::allocator, not user-defined allocators (because
> we don't know their semantics). It also seems like less of a big hack!
>
>
>

Jonathan Wakely June 19, 2023, 3:35 p.m. UTC | #6

On Mon, 19 Jun 2023 at 16:13, Jonathan Wakely <jwakely@redhat.com> wrote:

> On Mon, 19 Jun 2023 at 12:20, Jakub Jelinek wrote:
>
>> On Mon, Jun 19, 2023 at 01:05:36PM +0200, Jan Hubicka via Gcc-patches
>> wrote:
>> > -     if (max_size() - size() < __n)
>> > -       __throw_length_error(__N(__s));
>> > +     const size_type __max_size = max_size();
>> > +     // On 64bit systems vectors can not reach overflow by growing
>> > +     // by small sizes; before this happens, we will run out of memory.
>> > +     if (__builtin_constant_p(__n)
>> > +         && __builtin_constant_p(__max_size)
>> > +         && sizeof(ptrdiff_t) >= 8
>> > +         && __max_size * sizeof(_Tp) >= ((ptrdiff_t)1 << 60)
>>
>> Isn't there a risk of overlow in the __max_size * sizeof(_Tp) computation?
>>
>
> For std::allocator, no, because max_size() is size_t(-1) / sizeof(_Tp).
> But for a user-defined allocator that has a silly max_size(), yes, that's
> possible.
>
> I still don't really understand why any change is needed here. The PR says
> that the current _M_check_len brings in the EH code, but how/why does that
> happen? The __throw_length_error function is not inline, it's defined in
> libstdc++.so, so why isn't it just an extern call? Is the problem that it
> makes _M_check_len potentially-throwing? Because that's basically the
> entire point of _M_check_len: to throw the exception that is required by
> the C++ standard. We need to be very careful about removing that required
> throw! And after we call _M_check_len we call allocate unconditionally, so
> _M_realloc_insert can always throw (we only call _M_realloc_insert in the
> case where we've already decided a reallocation is definitely needed).
>
> Would this version of _M_check_len help?
>
>       size_type
>       _M_check_len(size_type __n, const char* __s) const
>       {
>         const size_type __size = size();
>         const size_type __max_size = max_size();
>
>         if (__is_same(allocator_type, allocator<_Tp>)
>               && __size > __max_size / 2)
>

This check is wrong for C++17 and older standards, because max_size()
changed value in C++20.

In C++17 it was PTRDIFF_MAX / sizeof(T) but in C++20 it's SIZE_MAX /
sizeof(T). So on 32-bit targets using C++17, it's possible a std::vector
could use PTRDIFF_MAX/2 bytes, and then the size <= max_size/2 assumption
would not hold.

Jan Hubicka June 19, 2023, 4:14 p.m. UTC | #7

> On Mon, 19 Jun 2023 at 12:20, Jakub Jelinek wrote:
> 
> > On Mon, Jun 19, 2023 at 01:05:36PM +0200, Jan Hubicka via Gcc-patches
> > wrote:
> > > -     if (max_size() - size() < __n)
> > > -       __throw_length_error(__N(__s));
> > > +     const size_type __max_size = max_size();
> > > +     // On 64bit systems vectors can not reach overflow by growing
> > > +     // by small sizes; before this happens, we will run out of memory.
> > > +     if (__builtin_constant_p(__n)
> > > +         && __builtin_constant_p(__max_size)
> > > +         && sizeof(ptrdiff_t) >= 8
> > > +         && __max_size * sizeof(_Tp) >= ((ptrdiff_t)1 << 60)
> >
> > Isn't there a risk of overlow in the __max_size * sizeof(_Tp) computation?
> >
> 
> For std::allocator, no, because max_size() is size_t(-1) / sizeof(_Tp). But
> for a user-defined allocator that has a silly max_size(), yes, that's
> possible.
> 
> I still don't really understand why any change is needed here. The PR says
> that the current _M_check_len brings in the EH code, but how/why does that
> happen? The __throw_length_error function is not inline, it's defined in
> libstdc++.so, so why isn't it just an extern call? Is the problem that it

It is really quite interesting peformance problem which does affect real
code. Extra extern call counts (especially since it is seen as
3 calls by inliner).  

This is _M_check_len after early optimizations (so as seen by inline
heuristics):

  <bb 2> [local count: 1073741824]:
  _15 = this_7(D)->D.26656._M_impl.D.25963._M_finish;
  _14 = this_7(D)->D.26656._M_impl.D.25963._M_start;
  _13 = _15 - _14;
  _10 = _13 /[ex] 8;
  _8 = (long unsigned int) _10;
  _1 = 1152921504606846975 - _8;
  __n.3_2 = __n;
  if (_1 < __n.3_2)
    goto <bb 3>; [0.04%]
  else
    goto <bb 4>; [99.96%]

  <bb 3> [local count: 429496]:
  std::__throw_length_error (__s_16(D));

  <bb 4> [local count: 1073312329]:
  D.27696 = _8;
  if (__n.3_2 > _8)
    goto <bb 5>; [34.00%]
  else
    goto <bb 6>; [66.00%]

  <bb 5> [local count: 364926196]:

  <bb 6> [local count: 1073312330]:
  # _18 = PHI <&D.27696(4), &__n(5)>
  _3 = *_18;
  __len_11 = _3 + _8;
  D.27696 ={v} {CLOBBER(eol)};
  if (_8 > __len_11)
    goto <bb 8>; [35.00%]
  else
    goto <bb 7>; [65.00%]

  <bb 7> [local count: 697653013]:
  _5 = MIN_EXPR <__len_11, 1152921504606846975>;

  <bb 8> [local count: 1073312330]:
  # iftmp.4_4 = PHI <1152921504606846975(6), _5(7)>
  return iftmp.4_4;

So a lot of code that is essnetially semantically equivalent to:

   return __size + MAX_EXPR (__n, __size)

at least with the default allocator.

Early inliner decides that it is not good idea to early inline. 
At this stage we inline mostly calls where we expect code to get
smaller after inlining and since the function contains another
uninlinable call, this does not seem likely.

With -O3 we will inline it later at IPA stage, but only when the code is
considered hot. 
With -O2 we decide to keep it offline if the unit contians multiple
calls to the function otherwise we inline it since it wins in the code
size estimation model.

The problem is that _M_check_len is used by _M_realloc_insert that later
feeds result to the allocator.  There is extra redundancy since
allocator can call std::__throw_bad_array_new_length and 
std::__throw_bad_alloc for bad sizes, but _M_check_len will not produce
them which is something we won't work out before inlning it.

As a result _M_realloc_insert is seen as very large function by
inliner heuristics (71 instructions).  Functions that are not
declared inline are inlined if smaller than 15 instructions with -O2
and 30 instructions with -O3. So we don't inline.

This hurts common lops that use vector as a stack and calls push_back
in internal loop. Not inlining prevents SRA and we end up saving and
loading the end of vector pointer on every iteration of the loop.

The following testcase:

typedef unsigned int uint32_t;
std::pair<uint32_t, uint32_t> pair;
void
test()
{
        std::vector<std::pair<uint32_t, uint32_t>> stack;
        stack.push_back (pair);
        while (!stack.empty()) {
                std::pair<uint32_t, uint32_t> cur = stack.back();
                stack.pop_back();
                if (!cur.first)
                {
                        cur.second++;
                        stack.push_back (cur);
                }
                if (cur.second > 10000)
                        break;
        }
}
int
main()
{
        for (int i = 0; i < 10000; i++)
          test();
}

Runs for me 0.5s with _M_realoc_insert not inlined and 0.048s with
_M_realloc_insert inlined.  Clang inlines it even at -O2 and does
0.063s.  I believe it is the reason why jpegxl library is slower
when built with GCC and since such loops are quite common in say
DFS walk, I think it is frequent problem.
> makes _M_check_len potentially-throwing? Because that's basically the
> entire point of _M_check_len: to throw the exception that is required by
> the C++ standard. We need to be very careful about removing that required
> throw! And after we call _M_check_len we call allocate unconditionally, so
> _M_realloc_insert can always throw (we only call _M_realloc_insert in the
> case where we've already decided a reallocation is definitely needed).
> 
> Would this version of _M_check_len help?
> 
>       size_type
>       _M_check_len(size_type __n, const char* __s) const
>       {
>         const size_type __size = size();
>         const size_type __max_size = max_size();
> 
>         if (__is_same(allocator_type, allocator<_Tp>)
>               && __size > __max_size / 2)
>           __builtin_unreachable(); // Assume std::allocator can't fill
> memory.
>         else if (__size > __max_size)
>           __builtin_unreachable();
> 
>         if (__max_size - __size < __n)
>           __throw_length_error(__N(__s));
> 
>         const size_type __len = __size + (std::max)(__size, __n);
>         return (__len < __size || __len > __max_size) ? __max_size : __len;
>       }
> 
> This only applies to std::allocator, not user-defined allocators (because
> we don't know their semantics). It also seems like less of a big hack!

This helps my testcase :)
_M_check_len is now:

size_type std::vector<std::pair<unsigned int, unsigned int> >::_M_check_len (const struct vector * const this, size_type __n, const char * __s)
{
  const size_type __len;
  long unsigned int _1;
  long unsigned int __n.5_2;
  long int _5;
  long unsigned int _6;
  struct pair * _7;
  struct pair * _8;
  long int _12;
  long unsigned int _13;
  long unsigned int _14;
  long unsigned int _15;

  <bb 2> [local count: 1073741824]:
  _8 = this_4(D)->D.26651._M_impl.D.25958._M_finish;
  _7 = this_4(D)->D.26651._M_impl.D.25958._M_start;
  _5 = _8 - _7;
  _12 = _5 /[ex] 8;
  _13 = (long unsigned int) _12;
  _15 = (long unsigned int) _5;
  if (_15 > 4611686018427387896)
    goto <bb 3>; [0.00%]
  else
    goto <bb 4>; [100.00%]

  <bb 3> [count: 0]:
  __builtin_unreachable ();

  <bb 4> [local count: 1073741824]:
  _1 = 1152921504606846975 - _13;
  __n.5_2 = __n;
  if (_1 < __n.5_2)
    goto <bb 5>; [0.04%]
  else
    goto <bb 6>; [99.96%]

  <bb 5> [local count: 429496]:
  std::__throw_length_error (__s_10(D));

  <bb 6> [local count: 1073312329]:
  _14 = MAX_EXPR <__n.5_2, _13>;
  __len_9 = _13 + _14;
  _6 = MIN_EXPR <__len_9, 1152921504606846975>;
  return _6;
}

Originally it counted as 32 instructions in inliner metrics (load/calls
counts as 2), while now it is 28 and 13 of them are expected to survive
optimizations after inlining.

With the patch we inline _M_realloc_insert at -O3.  With -O2 we still do
not inline it (and I am not convinced we should), but reduce its code
size from 347 bytes to 227 bytes in the final binary that also counts.

Estimated size of offlined _M_realloc_insert is still 67 instructions
instead of formed 71 (still long way to 15).

How common are nonstandard allocators?

Honza

Jan Hubicka June 20, 2023, 7:50 a.m. UTC | #8

> >
> >       size_type
> >       _M_check_len(size_type __n, const char* __s) const
> >       {
> >         const size_type __size = size();
> >         const size_type __max_size = max_size();
> >
> >         if (__is_same(allocator_type, allocator<_Tp>)
> >               && __size > __max_size / 2)
> >
> 
> This check is wrong for C++17 and older standards, because max_size()
> changed value in C++20.
> 
> In C++17 it was PTRDIFF_MAX / sizeof(T) but in C++20 it's SIZE_MAX /
> sizeof(T). So on 32-bit targets using C++17, it's possible a std::vector
> could use PTRDIFF_MAX/2 bytes, and then the size <= max_size/2 assumption
> would not hold.

Can we go with this perhaps only for 64bit targets?
I am not sure how completely safe this idea is in 32bit world: I guess
one can have OS that lets you to allocate half of address space as one
allocation.

Thanks!
Honza

Jan Hubicka June 20, 2023, 8:05 a.m. UTC | #9

> > >
> > >       size_type
> > >       _M_check_len(size_type __n, const char* __s) const
> > >       {
> > >         const size_type __size = size();
> > >         const size_type __max_size = max_size();
> > >
> > >         if (__is_same(allocator_type, allocator<_Tp>)
> > >               && __size > __max_size / 2)
> > >
> > 
> > This check is wrong for C++17 and older standards, because max_size()
> > changed value in C++20.
> > 
> > In C++17 it was PTRDIFF_MAX / sizeof(T) but in C++20 it's SIZE_MAX /
> > sizeof(T). So on 32-bit targets using C++17, it's possible a std::vector
> > could use PTRDIFF_MAX/2 bytes, and then the size <= max_size/2 assumption
> > would not hold.
> 
> Can we go with this perhaps only for 64bit targets?
> I am not sure how completely safe this idea is in 32bit world: I guess
> one can have OS that lets you to allocate half of address space as one
> allocation.

Perhaps something like:
  size > std::min ((uint64_t)__max_size, ((uint64_t)1 << 62) / sizeof (_Tp))
is safe for all allocators and 32bit, so we won't need __is_same test
and test for 64bit?

Honza
> 
> Thanks!
> Honza

Jakub Jelinek June 20, 2023, 8:07 a.m. UTC | #10

On Tue, Jun 20, 2023 at 09:50:25AM +0200, Jan Hubicka wrote:
> > >
> > >       size_type
> > >       _M_check_len(size_type __n, const char* __s) const
> > >       {
> > >         const size_type __size = size();
> > >         const size_type __max_size = max_size();
> > >
> > >         if (__is_same(allocator_type, allocator<_Tp>)
> > >               && __size > __max_size / 2)
> > >
> > 
> > This check is wrong for C++17 and older standards, because max_size()
> > changed value in C++20.
> > 
> > In C++17 it was PTRDIFF_MAX / sizeof(T) but in C++20 it's SIZE_MAX /
> > sizeof(T). So on 32-bit targets using C++17, it's possible a std::vector
> > could use PTRDIFF_MAX/2 bytes, and then the size <= max_size/2 assumption
> > would not hold.
> 
> Can we go with this perhaps only for 64bit targets?
> I am not sure how completely safe this idea is in 32bit world: I guess
> one can have OS that lets you to allocate half of address space as one
> allocation.

Is it safe even on 64bit targets?  I mean, doesn't say PowerPC already allow
full 64-bit virtual address space?  The assumption that one can't have
more than half of virtual address space allocations is true right now at
least on x86-64, aarch64 and others, but isn't that something that can
change with newer versions of CPUs without the need to recompile
applications (add another level or two of page tables)?
By being hardcoded in libstdc++ headers those assumptions will be hardcoded
in lots of applications.

	Jakub

Andreas Schwab June 20, 2023, 8:21 a.m. UTC | #11

On Jun 20 2023, Jakub Jelinek via Gcc-patches wrote:

> Is it safe even on 64bit targets?  I mean, doesn't say PowerPC already allow
> full 64-bit virtual address space?  The assumption that one can't have
> more than half of virtual address space allocations is true right now at
> least on x86-64, aarch64 and others, but isn't that something that can
> change with newer versions of CPUs without the need to recompile
> applications (add another level or two of page tables)?

At least s390 can allocate more than half the address space.  That
triggered a failure in gawk.

Jonathan Wakely June 20, 2023, 10:45 a.m. UTC | #12

On Tue, 20 Jun 2023 at 09:21, Andreas Schwab wrote:

> On Jun 20 2023, Jakub Jelinek via Gcc-patches wrote:
>
> > Is it safe even on 64bit targets?  I mean, doesn't say PowerPC already
> allow
> > full 64-bit virtual address space?  The assumption that one can't have
> > more than half of virtual address space allocations is true right now at
> > least on x86-64, aarch64 and others, but isn't that something that can
> > change with newer versions of CPUs without the need to recompile
> > applications (add another level or two of page tables)?
>
> At least s390 can allocate more than half the address space.  That
> triggered a failure in gawk.
>

Is PTRDIFF_MAX large enough to represent the difference between any two
pointers?

What we're considering for libstdc++ is treating PTRDIFF_MAX as an upper
limit on allocation size. If there are targets that can really allocate a
2^63 byte array, they won't be able to represent the difference between the
first element and the last element unless ptrdiff_t is wider than 64 bits.

Jonathan Wakely June 20, 2023, 10:50 a.m. UTC | #13

On Tue, 20 Jun 2023 at 11:45, Jonathan Wakely <jwakely@redhat.com> wrote:

> On Tue, 20 Jun 2023 at 09:21, Andreas Schwab wrote:
>
>> On Jun 20 2023, Jakub Jelinek via Gcc-patches wrote:
>>
>> > Is it safe even on 64bit targets?  I mean, doesn't say PowerPC already
>> allow
>> > full 64-bit virtual address space?  The assumption that one can't have
>> > more than half of virtual address space allocations is true right now at
>> > least on x86-64, aarch64 and others, but isn't that something that can
>> > change with newer versions of CPUs without the need to recompile
>> > applications (add another level or two of page tables)?
>>
>> At least s390 can allocate more than half the address space.  That
>> triggered a failure in gawk.
>>
>
> Is PTRDIFF_MAX large enough to represent the difference between any two
> pointers?
>
> What we're considering for libstdc++ is treating PTRDIFF_MAX as an upper
> limit on allocation size. If there are targets that can really allocate a
> 2^63 byte array, they won't be able to represent the difference between the
> first element and the last element unless ptrdiff_t is wider than 64 bits.
>

Of course if we're talking about 32-bit targets then you only need a 64-bit
ptrdiff_t to allow arrays larger than 2^31 bytes. In any case, PTRDIFF_MAX
is still an upper limit (although for a 64-bit ptrdiff_t and a 32-bit
address space, it's not a useful limit, because it's much much larger than
the real limit).

[libstdc++] Improve M_check_len

Commit Message

Comments

Patch