diff mbox series

[C++] Speed up inplace_merge algorithm & fix inefficient logic(PR c++/83938)

Message ID CALnYPH-iONCpF595=MpWP=w_+4xGqU11=Br4Yh=g=wdf2Ken3A@mail.gmail.com
State New
Headers show
Series [C++] Speed up inplace_merge algorithm & fix inefficient logic(PR c++/83938) | expand

Commit Message

chang jc Jan. 19, 2018, 8:02 a.m. UTC
Current std::inplace_merge() suffers from performance issue by inefficient
logic under limited memory,

It leads to performance downgrade.

Please help to review it.





Thanks

Comments

Jason Merrill Jan. 19, 2018, 8:05 p.m. UTC | #1
This is a libstdc++ bug and patch, not the C++ front end.  So I'm
adding the libstdc++ list to CC.

On Fri, Jan 19, 2018 at 3:02 AM, chang jc <r97922153@gmail.com> wrote:
> Current std::inplace_merge() suffers from performance issue by inefficient
> logic under limited memory,
>
> It leads to performance downgrade.
>
> Please help to review it.
>
> Index: include/bits/stl_algo.h
> ===================================================================
> --- include/bits/stl_algo.h     (revision 256871)
> +++ include/bits/stl_algo.h     (working copy)
> @@ -2437,7 +2437,7 @@
>           _BidirectionalIterator __second_cut = __middle;
>           _Distance __len11 = 0;
>           _Distance __len22 = 0;
> -         if (__len1 > __len2)
> +         if (__len1 < __len2)
>             {
>               __len11 = __len1 / 2;
>               std::advance(__first_cut, __len11);
> @@ -2539,9 +2539,15 @@
>        const _DistanceType __len1 = std::distance(__first, __middle);
>        const _DistanceType __len2 = std::distance(__middle, __last);
>
> +
>        typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf;
> -      _TmpBuf __buf(__first, __last);
> -
> +      _BidirectionalIterator __start, __end;
> +      if (__len1 < __len2) {
> +       __start = __first; __end = __middle;
> +      } else {
> +       __start = __middle; __end = __last;
> +      }
> +      _TmpBuf __buf(__start, ___end);
>        if (__buf.begin() == 0)
>         std::__merge_without_buffer
>           (__first, __middle, __last, __len1, __len2, __comp);
> Index: include/bits/stl_tempbuf.h
> ===================================================================
> --- include/bits/stl_tempbuf.h  (revision 256871)
> +++ include/bits/stl_tempbuf.h  (working copy)
> @@ -95,7 +95,7 @@
>                                                         std::nothrow));
>           if (__tmp != 0)
>             return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
> -         __len /= 2;
> +         __len = (__len + 1) / 2;
>         }
>        return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>      }
>
>
>
>
> Thanks
chang jc Jan. 25, 2018, 10:37 p.m. UTC | #2
Hi:

1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
__len = (__len == 1) ? 0 : ((__len + 1) / 2);

2. The coding gain has been shown  PR c++/83938; I re-post here




  21
  20
  19
  18
  17
  16


  0.471136
  0.625695
  0.767262
  0.907461
  1.04838
  1.19508


  0.340845
  0.48651
  0.639139
  0.770133
  0.898454
  1.04632

it means when Merge [0, 4325376, 16777216); A is a sorted integer with
4325376 & B with 12451840 elements, total with 16M integers

The proposed method has the speed up under given buffer size =, ex
2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
given sizof(int)*64K bytes.

3. As your suggestion, _TmpBuf __buf should be rewrite.

4. It represents a fact that the intuitive idea to split from larger
part is wrong.

For example, if you have an input sorted array A & B, A has 8 integers
& B has 24 integers. Given tmp buffer whose capacity = 4 integers.

Current it tries to split from B, right?

Then we have:

A1 | A2  B1 | B2

B1 & B2 has 12 integers each, right?

Current algorithm selects pivot as 13th integer from B, right?

If the corresponding upper bound of A is 6th integer.

Then it split in

A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12

After rotate, we have two arrays to merge

[A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]

Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.

Sadly, [A1 = 5 | B1 = 12] CANNOT.

So we do rotate again, split & merge the two split arrays from [A1 = 5
| B1 = 12] again.


But wait, if we always split from the smaller one instead of larger one.

After rotate, it promises two split arrays both contain ceiling[small/2].

Since tmp buffer also allocate by starting from sizeof(small) &
recursively downgrade by ceiling[small/2^(# of fail allocate)].

It means the allocated tmp buffer promises to be sufficient at the
level of (# of fail allocate).

Instead, you can see if split from large at level (# of fail allocate)
several split array still CANNOT use  tmp buffer to do buffered merge.


As you know, buffered merge is far speed then (split, rotate, and
merge two sub-arrays) (PR c++/83938 gives the profiling figures),

the way should provide speedup.


Thanks.










On 24/01/2018 18:23, François Dumont wrote:

Hi


    It sounds like a very sensitive change to make but nothing worth figures.
Do you have any bench showing the importance of the gain ?

    At least the memory usage optimization is obvious.

On 19/01/2018 10:43, chang jc wrote:

Current std::inplace_merge() suffers from performance issue by inefficient

logic under limited memory,

It leads to performance downgrade.

Please help to review it.

Index: include/bits/stl_algo.h
===================================================================
--- include/bits/stl_algo.h    (revision 256871)
+++ include/bits/stl_algo.h    (working copy)
@@ -2437,7 +2437,7 @@
        _BidirectionalIterator __second_cut = __middle;
        _Distance __len11 = 0;
        _Distance __len22 = 0;
-      if (__len1 > __len2)
+      if (__len1 < __len2)
          {
            __len11 = __len1 / 2;
            std::advance(__first_cut, __len11);
@@ -2539,9 +2539,15 @@
        const _DistanceType __len1 = std::distance(__first, __middle);
        const _DistanceType __len2 = std::distance(__middle, __last);

+

        typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
_TmpBuf;

-      _TmpBuf __buf(__first, __last);
-
+      _BidirectionalIterator __start, __end;
+      if (__len1 < __len2) {
+    __start = __first; __end = __middle;
+      } else {
+    __start = __middle; __end = __last;
+      }
+      _TmpBuf __buf(__start, ___end);

Note another optimization, we could introduce a _Temporary_buffer<> constructor
in order to write:

_TmpBuf __buf(std::min(__len1, __len2), __first);

So that std::distance do not need to be called again.

        if (__buf.begin() == 0)
      std::__merge_without_buffer
        (__first, __middle, __last, __len1, __len2, __comp);
Index: include/bits/stl_tempbuf.h
===================================================================
--- include/bits/stl_tempbuf.h    (revision 256871)
+++ include/bits/stl_tempbuf.h    (working copy)
@@ -95,7 +95,7 @@
                              std::nothrow));
        if (__tmp != 0)
          return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
-      __len /= 2;
+      __len = (__len + 1) / 2;

This part is problematic, if __len is 1 and allocation fails it will loop
forever.

It doesn't seem really necessary for your patch.


2018-01-20 4:05 GMT+08:00 Jason Merrill <jason@redhat.com>:

> This is a libstdc++ bug and patch, not the C++ front end.  So I'm
> adding the libstdc++ list to CC.
>
> On Fri, Jan 19, 2018 at 3:02 AM, chang jc <r97922153@gmail.com> wrote:
> > Current std::inplace_merge() suffers from performance issue by
> inefficient
> > logic under limited memory,
> >
> > It leads to performance downgrade.
> >
> > Please help to review it.
> >
> > Index: include/bits/stl_algo.h
> > ===================================================================
> > --- include/bits/stl_algo.h     (revision 256871)
> > +++ include/bits/stl_algo.h     (working copy)
> > @@ -2437,7 +2437,7 @@
> >           _BidirectionalIterator __second_cut = __middle;
> >           _Distance __len11 = 0;
> >           _Distance __len22 = 0;
> > -         if (__len1 > __len2)
> > +         if (__len1 < __len2)
> >             {
> >               __len11 = __len1 / 2;
> >               std::advance(__first_cut, __len11);
> > @@ -2539,9 +2539,15 @@
> >        const _DistanceType __len1 = std::distance(__first, __middle);
> >        const _DistanceType __len2 = std::distance(__middle, __last);
> >
> > +
> >        typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
> _TmpBuf;
> > -      _TmpBuf __buf(__first, __last);
> > -
> > +      _BidirectionalIterator __start, __end;
> > +      if (__len1 < __len2) {
> > +       __start = __first; __end = __middle;
> > +      } else {
> > +       __start = __middle; __end = __last;
> > +      }
> > +      _TmpBuf __buf(__start, ___end);
> >        if (__buf.begin() == 0)
> >         std::__merge_without_buffer
> >           (__first, __middle, __last, __len1, __len2, __comp);
> > Index: include/bits/stl_tempbuf.h
> > ===================================================================
> > --- include/bits/stl_tempbuf.h  (revision 256871)
> > +++ include/bits/stl_tempbuf.h  (working copy)
> > @@ -95,7 +95,7 @@
> >                                                         std::nothrow));
> >           if (__tmp != 0)
> >             return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
> > -         __len /= 2;
> > +         __len = (__len + 1) / 2;
> >         }
> >        return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
> >      }
> >
> >
> >
> >
> > Thanks
>
François Dumont Jan. 30, 2018, 9:41 p.m. UTC | #3
Your change looks good, don't hesitate to re-submit a proper patch with 
the point 1 below fix. You should add gcc-patches@gcc.gnu.org when you 
do so.

Remaining questions will be:
Copyright assignment, your change seems limited enough to avoid it but 
you will need an official maintainer confirmation.

Stage 1: We are in a Fix-Only development phase. Your change brings such 
an important improvement that it could be considered as a bug fix but 
once again an official maintainer will have to decide. Otherwise you 
will have to wait for a couple of month to see you patch committed.

François

On 25/01/2018 23:37, chang jc wrote:
> Hi:
>
> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
> __len = (__len == 1) ? 0 : ((__len + 1) / 2);
>
> 2. The coding gain has been shown  PR c++/83938; I re-post here
>
>
>
>
>    21
>    20
>    19
>    18
>    17
>    16
>
>
>    0.471136
>    0.625695
>    0.767262
>    0.907461
>    1.04838
>    1.19508
>
>
>    0.340845
>    0.48651
>    0.639139
>    0.770133
>    0.898454
>    1.04632
>
> it means when Merge [0, 4325376, 16777216); A is a sorted integer with
> 4325376 & B with 12451840 elements, total with 16M integers
>
> The proposed method has the speed up under given buffer size =, ex
> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
> given sizof(int)*64K bytes.
>
> 3. As your suggestion, _TmpBuf __buf should be rewrite.
>
> 4. It represents a fact that the intuitive idea to split from larger
> part is wrong.
>
> For example, if you have an input sorted array A & B, A has 8 integers
> & B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>
> Current it tries to split from B, right?
>
> Then we have:
>
> A1 | A2  B1 | B2
>
> B1 & B2 has 12 integers each, right?
>
> Current algorithm selects pivot as 13th integer from B, right?
>
> If the corresponding upper bound of A is 6th integer.
>
> Then it split in
>
> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>
> After rotate, we have two arrays to merge
>
> [A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]
>
> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>
> Sadly, [A1 = 5 | B1 = 12] CANNOT.
>
> So we do rotate again, split & merge the two split arrays from [A1 = 5
> | B1 = 12] again.
>
>
> But wait, if we always split from the smaller one instead of larger one.
>
> After rotate, it promises two split arrays both contain ceiling[small/2].
>
> Since tmp buffer also allocate by starting from sizeof(small) &
> recursively downgrade by ceiling[small/2^(# of fail allocate)].
>
> It means the allocated tmp buffer promises to be sufficient at the
> level of (# of fail allocate).
>
> Instead, you can see if split from large at level (# of fail allocate)
> several split array still CANNOT use  tmp buffer to do buffered merge.
>
>
> As you know, buffered merge is far speed then (split, rotate, and
> merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>
> the way should provide speedup.
>
>
> Thanks.
>
>
>
>
>
>
>
>
>
>
> On 24/01/2018 18:23, François Dumont wrote:
>
> Hi
>
>
>      It sounds like a very sensitive change to make but nothing worth figures.
> Do you have any bench showing the importance of the gain ?
>
>      At least the memory usage optimization is obvious.
>
> On 19/01/2018 10:43, chang jc wrote:
>
> Current std::inplace_merge() suffers from performance issue by inefficient
>
> logic under limited memory,
>
> It leads to performance downgrade.
>
> Please help to review it.
>
> Index: include/bits/stl_algo.h
> ===================================================================
> --- include/bits/stl_algo.h    (revision 256871)
> +++ include/bits/stl_algo.h    (working copy)
> @@ -2437,7 +2437,7 @@
>          _BidirectionalIterator __second_cut = __middle;
>          _Distance __len11 = 0;
>          _Distance __len22 = 0;
> -      if (__len1 > __len2)
> +      if (__len1 < __len2)
>            {
>              __len11 = __len1 / 2;
>              std::advance(__first_cut, __len11);
> @@ -2539,9 +2539,15 @@
>          const _DistanceType __len1 = std::distance(__first, __middle);
>          const _DistanceType __len2 = std::distance(__middle, __last);
>
> +
>
>          typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
> _TmpBuf;
>
> -      _TmpBuf __buf(__first, __last);
> -
> +      _BidirectionalIterator __start, __end;
> +      if (__len1 < __len2) {
> +    __start = __first; __end = __middle;
> +      } else {
> +    __start = __middle; __end = __last;
> +      }
> +      _TmpBuf __buf(__start, ___end);
>
> Note another optimization, we could introduce a _Temporary_buffer<> constructor
> in order to write:
>
> _TmpBuf __buf(std::min(__len1, __len2), __first);
>
> So that std::distance do not need to be called again.
>
>          if (__buf.begin() == 0)
>        std::__merge_without_buffer
>          (__first, __middle, __last, __len1, __len2, __comp);
> Index: include/bits/stl_tempbuf.h
> ===================================================================
> --- include/bits/stl_tempbuf.h    (revision 256871)
> +++ include/bits/stl_tempbuf.h    (working copy)
> @@ -95,7 +95,7 @@
>                                std::nothrow));
>          if (__tmp != 0)
>            return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
> -      __len /= 2;
> +      __len = (__len + 1) / 2;
>
> This part is problematic, if __len is 1 and allocation fails it will loop
> forever.
>
> It doesn't seem really necessary for your patch.
>
>
> 2018-01-20 4:05 GMT+08:00 Jason Merrill <jason@redhat.com>:
>
>> This is a libstdc++ bug and patch, not the C++ front end.  So I'm
>> adding the libstdc++ list to CC.
>>
>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc <r97922153@gmail.com> wrote:
>>> Current std::inplace_merge() suffers from performance issue by
>> inefficient
>>> logic under limited memory,
>>>
>>> It leads to performance downgrade.
>>>
>>> Please help to review it.
>>>
>>> Index: include/bits/stl_algo.h
>>> ===================================================================
>>> --- include/bits/stl_algo.h     (revision 256871)
>>> +++ include/bits/stl_algo.h     (working copy)
>>> @@ -2437,7 +2437,7 @@
>>>            _BidirectionalIterator __second_cut = __middle;
>>>            _Distance __len11 = 0;
>>>            _Distance __len22 = 0;
>>> -         if (__len1 > __len2)
>>> +         if (__len1 < __len2)
>>>              {
>>>                __len11 = __len1 / 2;
>>>                std::advance(__first_cut, __len11);
>>> @@ -2539,9 +2539,15 @@
>>>         const _DistanceType __len1 = std::distance(__first, __middle);
>>>         const _DistanceType __len2 = std::distance(__middle, __last);
>>>
>>> +
>>>         typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>> _TmpBuf;
>>> -      _TmpBuf __buf(__first, __last);
>>> -
>>> +      _BidirectionalIterator __start, __end;
>>> +      if (__len1 < __len2) {
>>> +       __start = __first; __end = __middle;
>>> +      } else {
>>> +       __start = __middle; __end = __last;
>>> +      }
>>> +      _TmpBuf __buf(__start, ___end);
>>>         if (__buf.begin() == 0)
>>>          std::__merge_without_buffer
>>>            (__first, __middle, __last, __len1, __len2, __comp);
>>> Index: include/bits/stl_tempbuf.h
>>> ===================================================================
>>> --- include/bits/stl_tempbuf.h  (revision 256871)
>>> +++ include/bits/stl_tempbuf.h  (working copy)
>>> @@ -95,7 +95,7 @@
>>>                                                          std::nothrow));
>>>            if (__tmp != 0)
>>>              return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>> -         __len /= 2;
>>> +         __len = (__len + 1) / 2;
>>>          }
>>>         return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>>       }
>>>
>>>
>>>
>>>
>>> Thanks
François Dumont June 6, 2018, 4:39 p.m. UTC | #4
Hi

     I review and rework this proposal. I noticed that the same idea to 
limit buffer size within inplace_merge also apply to stable_sort.

     I also change the decision when buffer is too small to consider the 
buffer size rather than going through successive cuts of the original 
ranges. This way we won't cut the range more than necessary. Note that 
if you prefer this second part could be committed seperatly.

     PR libstdc++/83938
     * include/bits/stl_algo.h:
     (__stable_partition_adaptive): When buffer to small cut range using
     buffer size.
     (__inplace_merge): Take temporary buffer length from smallest range.
     (__merge_adaptive): When buffer too small consider smallest range first
     and cut based on buffer size.
     (__stable_sort_adaptive): When buffer too small cut based on buffer
     size.
     (__stable_sort): Limit temporary buffer length.
     * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
     to reduce requested buffer length on allocation failure.

Tested under Linux x86_64.

Ok to commit ?

François


On 25/01/2018 23:37, chang jc wrote:
> Hi:
>
> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
> __len = (__len == 1) ? 0 : ((__len + 1) / 2);
>
> 2. The coding gain has been shown  PR c++/83938; I re-post here
>
>
>
>
>    21
>    20
>    19
>    18
>    17
>    16
>
>
>    0.471136
>    0.625695
>    0.767262
>    0.907461
>    1.04838
>    1.19508
>
>
>    0.340845
>    0.48651
>    0.639139
>    0.770133
>    0.898454
>    1.04632
>
> it means when Merge [0, 4325376, 16777216); A is a sorted integer with
> 4325376 & B with 12451840 elements, total with 16M integers
>
> The proposed method has the speed up under given buffer size =, ex
> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
> given sizof(int)*64K bytes.
>
> 3. As your suggestion, _TmpBuf __buf should be rewrite.
>
> 4. It represents a fact that the intuitive idea to split from larger
> part is wrong.
>
> For example, if you have an input sorted array A & B, A has 8 integers
> & B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>
> Current it tries to split from B, right?
>
> Then we have:
>
> A1 | A2  B1 | B2
>
> B1 & B2 has 12 integers each, right?
>
> Current algorithm selects pivot as 13th integer from B, right?
>
> If the corresponding upper bound of A is 6th integer.
>
> Then it split in
>
> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>
> After rotate, we have two arrays to merge
>
> [A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]
>
> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>
> Sadly, [A1 = 5 | B1 = 12] CANNOT.
>
> So we do rotate again, split & merge the two split arrays from [A1 = 5
> | B1 = 12] again.
>
>
> But wait, if we always split from the smaller one instead of larger one.
>
> After rotate, it promises two split arrays both contain ceiling[small/2].
>
> Since tmp buffer also allocate by starting from sizeof(small) &
> recursively downgrade by ceiling[small/2^(# of fail allocate)].
>
> It means the allocated tmp buffer promises to be sufficient at the
> level of (# of fail allocate).
>
> Instead, you can see if split from large at level (# of fail allocate)
> several split array still CANNOT use  tmp buffer to do buffered merge.
>
>
> As you know, buffered merge is far speed then (split, rotate, and
> merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>
> the way should provide speedup.
>
>
> Thanks.
>
>
>
>
>
>
>
>
>
>
> On 24/01/2018 18:23, François Dumont wrote:
>
> Hi
>
>
>      It sounds like a very sensitive change to make but nothing worth figures.
> Do you have any bench showing the importance of the gain ?
>
>      At least the memory usage optimization is obvious.
>
> On 19/01/2018 10:43, chang jc wrote:
>
> Current std::inplace_merge() suffers from performance issue by inefficient
>
> logic under limited memory,
>
> It leads to performance downgrade.
>
> Please help to review it.
>
> Index: include/bits/stl_algo.h
> ===================================================================
> --- include/bits/stl_algo.h    (revision 256871)
> +++ include/bits/stl_algo.h    (working copy)
> @@ -2437,7 +2437,7 @@
>          _BidirectionalIterator __second_cut = __middle;
>          _Distance __len11 = 0;
>          _Distance __len22 = 0;
> -      if (__len1 > __len2)
> +      if (__len1 < __len2)
>            {
>              __len11 = __len1 / 2;
>              std::advance(__first_cut, __len11);
> @@ -2539,9 +2539,15 @@
>          const _DistanceType __len1 = std::distance(__first, __middle);
>          const _DistanceType __len2 = std::distance(__middle, __last);
>
> +
>
>          typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
> _TmpBuf;
>
> -      _TmpBuf __buf(__first, __last);
> -
> +      _BidirectionalIterator __start, __end;
> +      if (__len1 < __len2) {
> +    __start = __first; __end = __middle;
> +      } else {
> +    __start = __middle; __end = __last;
> +      }
> +      _TmpBuf __buf(__start, ___end);
>
> Note another optimization, we could introduce a _Temporary_buffer<> constructor
> in order to write:
>
> _TmpBuf __buf(std::min(__len1, __len2), __first);
>
> So that std::distance do not need to be called again.
>
>          if (__buf.begin() == 0)
>        std::__merge_without_buffer
>          (__first, __middle, __last, __len1, __len2, __comp);
> Index: include/bits/stl_tempbuf.h
> ===================================================================
> --- include/bits/stl_tempbuf.h    (revision 256871)
> +++ include/bits/stl_tempbuf.h    (working copy)
> @@ -95,7 +95,7 @@
>                                std::nothrow));
>          if (__tmp != 0)
>            return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
> -      __len /= 2;
> +      __len = (__len + 1) / 2;
>
> This part is problematic, if __len is 1 and allocation fails it will loop
> forever.
>
> It doesn't seem really necessary for your patch.
>
>
> 2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>:
>
>> This is a libstdc++ bug and patch, not the C++ front end.  So I'm
>> adding the libstdc++ list to CC.
>>
>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc<r97922153@gmail.com>  wrote:
>>> Current std::inplace_merge() suffers from performance issue by
>> inefficient
>>> logic under limited memory,
>>>
>>> It leads to performance downgrade.
>>>
>>> Please help to review it.
>>>
>>> Index: include/bits/stl_algo.h
>>> ===================================================================
>>> --- include/bits/stl_algo.h     (revision 256871)
>>> +++ include/bits/stl_algo.h     (working copy)
>>> @@ -2437,7 +2437,7 @@
>>>            _BidirectionalIterator __second_cut = __middle;
>>>            _Distance __len11 = 0;
>>>            _Distance __len22 = 0;
>>> -         if (__len1 > __len2)
>>> +         if (__len1 < __len2)
>>>              {
>>>                __len11 = __len1 / 2;
>>>                std::advance(__first_cut, __len11);
>>> @@ -2539,9 +2539,15 @@
>>>         const _DistanceType __len1 = std::distance(__first, __middle);
>>>         const _DistanceType __len2 = std::distance(__middle, __last);
>>>
>>> +
>>>         typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>> _TmpBuf;
>>> -      _TmpBuf __buf(__first, __last);
>>> -
>>> +      _BidirectionalIterator __start, __end;
>>> +      if (__len1 < __len2) {
>>> +       __start = __first; __end = __middle;
>>> +      } else {
>>> +       __start = __middle; __end = __last;
>>> +      }
>>> +      _TmpBuf __buf(__start, ___end);
>>>         if (__buf.begin() == 0)
>>>          std::__merge_without_buffer
>>>            (__first, __middle, __last, __len1, __len2, __comp);
>>> Index: include/bits/stl_tempbuf.h
>>> ===================================================================
>>> --- include/bits/stl_tempbuf.h  (revision 256871)
>>> +++ include/bits/stl_tempbuf.h  (working copy)
>>> @@ -95,7 +95,7 @@
>>>                                                          std::nothrow));
>>>            if (__tmp != 0)
>>>              return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>> -         __len /= 2;
>>> +         __len = (__len + 1) / 2;
>>>          }
>>>         return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>>       }
>>>
>>>
>>>
>>>
>>> Thanks
diff --git a/libstdc++-v3/include/bits/stl_algo.h b/libstdc++-v3/include/bits/stl_algo.h
index d8488a7..bf5b25f 100644
--- a/libstdc++-v3/include/bits/stl_algo.h
+++ b/libstdc++-v3/include/bits/stl_algo.h
@@ -1577,15 +1577,15 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	}
 
       _ForwardIterator __middle = __first;
-      std::advance(__middle, __len / 2);
+      std::advance(__middle, __buffer_size);
       _ForwardIterator __left_split =
-	std::__stable_partition_adaptive(__first, __middle, __pred,
-					 __len / 2, __buffer,
-					 __buffer_size);
+	std::__stable_partition_adaptive(__first, __middle,
+					 __pred, __buffer_size,
+					 __buffer, __buffer_size);
 
       // Advance past true-predicate values to satisfy this
       // function's preconditions.
-      _Distance __right_len = __len - __len / 2;
+      _Distance __right_len = __len - __buffer_size;
       _ForwardIterator __right_split =
 	std::__find_if_not_n(__middle, __right_len, __pred);
 
@@ -2432,9 +2432,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	  _BidirectionalIterator __second_cut = __middle;
 	  _Distance __len11 = 0;
 	  _Distance __len22 = 0;
-	  if (__len1 > __len2)
+	  if (__len1 < __len2)
 	    {
-	      __len11 = __len1 / 2;
+	      __len11 = __buffer_size;
 	      std::advance(__first_cut, __len11);
 	      __second_cut
 		= std::__lower_bound(__middle, __last, *__first_cut,
@@ -2443,7 +2443,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	    }
 	  else
 	    {
-	      __len22 = __len2 / 2;
+	      __len22 = __buffer_size;
 	      std::advance(__second_cut, __len22);
 	      __first_cut
 		= std::__upper_bound(__first, __middle, *__second_cut,
@@ -2453,14 +2453,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 
 	  _BidirectionalIterator __new_middle
 	    = std::__rotate_adaptive(__first_cut, __middle, __second_cut,
-				     __len1 - __len11, __len22, __buffer,
-				     __buffer_size);
-	  std::__merge_adaptive(__first, __first_cut, __new_middle, __len11,
-				__len22, __buffer, __buffer_size, __comp);
+				     __len1 - __len11, __len22,
+				     __buffer, __buffer_size);
+	  std::__merge_adaptive(__first, __first_cut, __new_middle,
+				__len11, __len22,
+				__buffer, __buffer_size, __comp);
 	  std::__merge_adaptive(__new_middle, __second_cut, __last,
-				__len1 - __len11,
-				__len2 - __len22, __buffer,
-				__buffer_size, __comp);
+				__len1 - __len11, __len2 - __len22,
+				__buffer, __buffer_size, __comp);
 	}
     }
 
@@ -2535,7 +2535,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
       const _DistanceType __len2 = std::distance(__middle, __last);
 
       typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf;
-      _TmpBuf __buf(__first, __len1 + __len2);
+      _TmpBuf __buf(__first, std::min(__len1, __len2));
 
       if (__buf.begin() == 0)
 	std::__merge_without_buffer
@@ -2730,9 +2730,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 			   _Compare __comp)
     {
       const _Distance __len = (__last - __first + 1) / 2;
-      const _RandomAccessIterator __middle = __first + __len;
+      _RandomAccessIterator __middle;
       if (__len > __buffer_size)
 	{
+	  __middle = __first + __buffer_size + __buffer_size;
 	  std::__stable_sort_adaptive(__first, __middle, __buffer,
 				      __buffer_size, __comp);
 	  std::__stable_sort_adaptive(__middle, __last, __buffer,
@@ -2740,9 +2741,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	}
       else
 	{
+	  __middle = __first + __len;
 	  std::__merge_sort_with_buffer(__first, __middle, __buffer, __comp);
 	  std::__merge_sort_with_buffer(__middle, __last, __buffer, __comp);
 	}
+
       std::__merge_adaptive(__first, __middle, __last,
 			    _Distance(__middle - __first),
 			    _Distance(__last - __middle),
@@ -4992,8 +4995,11 @@ _GLIBCXX_BEGIN_NAMESPACE_ALGO
       typedef typename iterator_traits<_RandomAccessIterator>::difference_type
 	_DistanceType;
 
+      if (__first == __last)
+	return;
+
       typedef _Temporary_buffer<_RandomAccessIterator, _ValueType> _TmpBuf;
-      _TmpBuf __buf(__first, std::distance(__first, __last));
+      _TmpBuf __buf(__first, (__last - __first + 1) / 2);
 
       if (__buf.begin() == 0)
 	std::__inplace_stable_sort(__first, __last, __comp);
diff --git a/libstdc++-v3/include/bits/stl_tempbuf.h b/libstdc++-v3/include/bits/stl_tempbuf.h
index 159ee27..7fb97d0 100644
--- a/libstdc++-v3/include/bits/stl_tempbuf.h
+++ b/libstdc++-v3/include/bits/stl_tempbuf.h
@@ -95,7 +95,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 							std::nothrow));
 	  if (__tmp != 0)
 	    return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
-	  __len /= 2;
+	  __len = __len == 1 ? 0 : ((__len + 1) / 2);
 	}
       return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
     }
François Dumont July 24, 2018, 10:22 a.m. UTC | #5
Hi

     Any chance to review this patch ?

François


On 06/06/2018 18:39, François Dumont wrote:
> Hi
>
>     I review and rework this proposal. I noticed that the same idea to 
> limit buffer size within inplace_merge also apply to stable_sort.
>
>     I also change the decision when buffer is too small to consider 
> the buffer size rather than going through successive cuts of the 
> original ranges. This way we won't cut the range more than necessary. 
> Note that if you prefer this second part could be committed seperatly.
>
>     PR libstdc++/83938
>     * include/bits/stl_algo.h:
>     (__stable_partition_adaptive): When buffer to small cut range using
>     buffer size.
>     (__inplace_merge): Take temporary buffer length from smallest range.
>     (__merge_adaptive): When buffer too small consider smallest range 
> first
>     and cut based on buffer size.
>     (__stable_sort_adaptive): When buffer too small cut based on buffer
>     size.
>     (__stable_sort): Limit temporary buffer length.
>     * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
>     to reduce requested buffer length on allocation failure.
>
> Tested under Linux x86_64.
>
> Ok to commit ?
>
> François
>
>
> On 25/01/2018 23:37, chang jc wrote:
>> Hi:
>>
>> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
>> __len = (__len == 1) ? 0 : ((__len + 1) / 2);
>>
>> 2. The coding gain has been shown  PR c++/83938; I re-post here
>>
>>
>>
>>
>>    21
>>    20
>>    19
>>    18
>>    17
>>    16
>>
>>
>>    0.471136
>>    0.625695
>>    0.767262
>>    0.907461
>>    1.04838
>>    1.19508
>>
>>
>>    0.340845
>>    0.48651
>>    0.639139
>>    0.770133
>>    0.898454
>>    1.04632
>>
>> it means when Merge [0, 4325376, 16777216); A is a sorted integer with
>> 4325376 & B with 12451840 elements, total with 16M integers
>>
>> The proposed method has the speed up under given buffer size =, ex
>> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
>> given sizof(int)*64K bytes.
>>
>> 3. As your suggestion, _TmpBuf __buf should be rewrite.
>>
>> 4. It represents a fact that the intuitive idea to split from larger
>> part is wrong.
>>
>> For example, if you have an input sorted array A & B, A has 8 integers
>> & B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>>
>> Current it tries to split from B, right?
>>
>> Then we have:
>>
>> A1 | A2  B1 | B2
>>
>> B1 & B2 has 12 integers each, right?
>>
>> Current algorithm selects pivot as 13th integer from B, right?
>>
>> If the corresponding upper bound of A is 6th integer.
>>
>> Then it split in
>>
>> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>>
>> After rotate, we have two arrays to merge
>>
>> [A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]
>>
>> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>>
>> Sadly, [A1 = 5 | B1 = 12] CANNOT.
>>
>> So we do rotate again, split & merge the two split arrays from [A1 = 5
>> | B1 = 12] again.
>>
>>
>> But wait, if we always split from the smaller one instead of larger one.
>>
>> After rotate, it promises two split arrays both contain 
>> ceiling[small/2].
>>
>> Since tmp buffer also allocate by starting from sizeof(small) &
>> recursively downgrade by ceiling[small/2^(# of fail allocate)].
>>
>> It means the allocated tmp buffer promises to be sufficient at the
>> level of (# of fail allocate).
>>
>> Instead, you can see if split from large at level (# of fail allocate)
>> several split array still CANNOT use  tmp buffer to do buffered merge.
>>
>>
>> As you know, buffered merge is far speed then (split, rotate, and
>> merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>>
>> the way should provide speedup.
>>
>>
>> Thanks.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 24/01/2018 18:23, François Dumont wrote:
>>
>> Hi
>>
>>
>>      It sounds like a very sensitive change to make but nothing worth 
>> figures.
>> Do you have any bench showing the importance of the gain ?
>>
>>      At least the memory usage optimization is obvious.
>>
>> On 19/01/2018 10:43, chang jc wrote:
>>
>> Current std::inplace_merge() suffers from performance issue by 
>> inefficient
>>
>> logic under limited memory,
>>
>> It leads to performance downgrade.
>>
>> Please help to review it.
>>
>> Index: include/bits/stl_algo.h
>> ===================================================================
>> --- include/bits/stl_algo.h    (revision 256871)
>> +++ include/bits/stl_algo.h    (working copy)
>> @@ -2437,7 +2437,7 @@
>>          _BidirectionalIterator __second_cut = __middle;
>>          _Distance __len11 = 0;
>>          _Distance __len22 = 0;
>> -      if (__len1 > __len2)
>> +      if (__len1 < __len2)
>>            {
>>              __len11 = __len1 / 2;
>>              std::advance(__first_cut, __len11);
>> @@ -2539,9 +2539,15 @@
>>          const _DistanceType __len1 = std::distance(__first, __middle);
>>          const _DistanceType __len2 = std::distance(__middle, __last);
>>
>> +
>>
>>          typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>> _TmpBuf;
>>
>> -      _TmpBuf __buf(__first, __last);
>> -
>> +      _BidirectionalIterator __start, __end;
>> +      if (__len1 < __len2) {
>> +    __start = __first; __end = __middle;
>> +      } else {
>> +    __start = __middle; __end = __last;
>> +      }
>> +      _TmpBuf __buf(__start, ___end);
>>
>> Note another optimization, we could introduce a _Temporary_buffer<> 
>> constructor
>> in order to write:
>>
>> _TmpBuf __buf(std::min(__len1, __len2), __first);
>>
>> So that std::distance do not need to be called again.
>>
>>          if (__buf.begin() == 0)
>>        std::__merge_without_buffer
>>          (__first, __middle, __last, __len1, __len2, __comp);
>> Index: include/bits/stl_tempbuf.h
>> ===================================================================
>> --- include/bits/stl_tempbuf.h    (revision 256871)
>> +++ include/bits/stl_tempbuf.h    (working copy)
>> @@ -95,7 +95,7 @@
>>                                std::nothrow));
>>          if (__tmp != 0)
>>            return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>> -      __len /= 2;
>> +      __len = (__len + 1) / 2;
>>
>> This part is problematic, if __len is 1 and allocation fails it will 
>> loop
>> forever.
>>
>> It doesn't seem really necessary for your patch.
>>
>>
>> 2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>:
>>
>>> This is a libstdc++ bug and patch, not the C++ front end.  So I'm
>>> adding the libstdc++ list to CC.
>>>
>>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc<r97922153@gmail.com>  wrote:
>>>> Current std::inplace_merge() suffers from performance issue by
>>> inefficient
>>>> logic under limited memory,
>>>>
>>>> It leads to performance downgrade.
>>>>
>>>> Please help to review it.
>>>>
>>>> Index: include/bits/stl_algo.h
>>>> ===================================================================
>>>> --- include/bits/stl_algo.h     (revision 256871)
>>>> +++ include/bits/stl_algo.h     (working copy)
>>>> @@ -2437,7 +2437,7 @@
>>>>            _BidirectionalIterator __second_cut = __middle;
>>>>            _Distance __len11 = 0;
>>>>            _Distance __len22 = 0;
>>>> -         if (__len1 > __len2)
>>>> +         if (__len1 < __len2)
>>>>              {
>>>>                __len11 = __len1 / 2;
>>>>                std::advance(__first_cut, __len11);
>>>> @@ -2539,9 +2539,15 @@
>>>>         const _DistanceType __len1 = std::distance(__first, __middle);
>>>>         const _DistanceType __len2 = std::distance(__middle, __last);
>>>>
>>>> +
>>>>         typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>>> _TmpBuf;
>>>> -      _TmpBuf __buf(__first, __last);
>>>> -
>>>> +      _BidirectionalIterator __start, __end;
>>>> +      if (__len1 < __len2) {
>>>> +       __start = __first; __end = __middle;
>>>> +      } else {
>>>> +       __start = __middle; __end = __last;
>>>> +      }
>>>> +      _TmpBuf __buf(__start, ___end);
>>>>         if (__buf.begin() == 0)
>>>>          std::__merge_without_buffer
>>>>            (__first, __middle, __last, __len1, __len2, __comp);
>>>> Index: include/bits/stl_tempbuf.h
>>>> ===================================================================
>>>> --- include/bits/stl_tempbuf.h  (revision 256871)
>>>> +++ include/bits/stl_tempbuf.h  (working copy)
>>>> @@ -95,7 +95,7 @@
>>>> std::nothrow));
>>>>            if (__tmp != 0)
>>>>              return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>> -         __len /= 2;
>>>> +         __len = (__len + 1) / 2;
>>>>          }
>>>>         return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>>>       }
>>>>
>>>>
>>>>
>>>>
>>>> Thanks
>
>
François Dumont Aug. 21, 2018, 8:34 p.m. UTC | #6
I missed a test that was failing because of this patch. So I revert a 
small part of it and here is the new proposal.

Tested under Linux x86_64, ok to commit ?

François


On 24/07/2018 12:22, François Dumont wrote:
> Hi
>
>     Any chance to review this patch ?
>
> François
>
>
> On 06/06/2018 18:39, François Dumont wrote:
>> Hi
>>
>>     I review and rework this proposal. I noticed that the same idea 
>> to limit buffer size within inplace_merge also apply to stable_sort.
>>
>>     I also change the decision when buffer is too small to consider 
>> the buffer size rather than going through successive cuts of the 
>> original ranges. This way we won't cut the range more than necessary. 
>> Note that if you prefer this second part could be committed seperatly.
>>
>>     PR libstdc++/83938
>>     * include/bits/stl_algo.h:
>>     (__stable_partition_adaptive): When buffer to small cut range using
>>     buffer size.
>>     (__inplace_merge): Take temporary buffer length from smallest range.
>>     (__merge_adaptive): When buffer too small consider smallest range 
>> first
>>     and cut based on buffer size.
>>     (__stable_sort_adaptive): When buffer too small cut based on buffer
>>     size.
>>     (__stable_sort): Limit temporary buffer length.
>>     * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
>>     to reduce requested buffer length on allocation failure.
>>
>> Tested under Linux x86_64.
>>
>> Ok to commit ?
>>
>> François
>>
>>
>> On 25/01/2018 23:37, chang jc wrote:
>>> Hi:
>>>
>>> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
>>> __len = (__len == 1) ? 0 : ((__len + 1) / 2);
>>>
>>> 2. The coding gain has been shown  PR c++/83938; I re-post here
>>>
>>>
>>>
>>>
>>>    21
>>>    20
>>>    19
>>>    18
>>>    17
>>>    16
>>>
>>>
>>>    0.471136
>>>    0.625695
>>>    0.767262
>>>    0.907461
>>>    1.04838
>>>    1.19508
>>>
>>>
>>>    0.340845
>>>    0.48651
>>>    0.639139
>>>    0.770133
>>>    0.898454
>>>    1.04632
>>>
>>> it means when Merge [0, 4325376, 16777216); A is a sorted integer with
>>> 4325376 & B with 12451840 elements, total with 16M integers
>>>
>>> The proposed method has the speed up under given buffer size =, ex
>>> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
>>> given sizof(int)*64K bytes.
>>>
>>> 3. As your suggestion, _TmpBuf __buf should be rewrite.
>>>
>>> 4. It represents a fact that the intuitive idea to split from larger
>>> part is wrong.
>>>
>>> For example, if you have an input sorted array A & B, A has 8 integers
>>> & B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>>>
>>> Current it tries to split from B, right?
>>>
>>> Then we have:
>>>
>>> A1 | A2  B1 | B2
>>>
>>> B1 & B2 has 12 integers each, right?
>>>
>>> Current algorithm selects pivot as 13th integer from B, right?
>>>
>>> If the corresponding upper bound of A is 6th integer.
>>>
>>> Then it split in
>>>
>>> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>>>
>>> After rotate, we have two arrays to merge
>>>
>>> [A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]
>>>
>>> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>>>
>>> Sadly, [A1 = 5 | B1 = 12] CANNOT.
>>>
>>> So we do rotate again, split & merge the two split arrays from [A1 = 5
>>> | B1 = 12] again.
>>>
>>>
>>> But wait, if we always split from the smaller one instead of larger 
>>> one.
>>>
>>> After rotate, it promises two split arrays both contain 
>>> ceiling[small/2].
>>>
>>> Since tmp buffer also allocate by starting from sizeof(small) &
>>> recursively downgrade by ceiling[small/2^(# of fail allocate)].
>>>
>>> It means the allocated tmp buffer promises to be sufficient at the
>>> level of (# of fail allocate).
>>>
>>> Instead, you can see if split from large at level (# of fail allocate)
>>> several split array still CANNOT use  tmp buffer to do buffered merge.
>>>
>>>
>>> As you know, buffered merge is far speed then (split, rotate, and
>>> merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>>>
>>> the way should provide speedup.
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 24/01/2018 18:23, François Dumont wrote:
>>>
>>> Hi
>>>
>>>
>>>      It sounds like a very sensitive change to make but nothing 
>>> worth figures.
>>> Do you have any bench showing the importance of the gain ?
>>>
>>>      At least the memory usage optimization is obvious.
>>>
>>> On 19/01/2018 10:43, chang jc wrote:
>>>
>>> Current std::inplace_merge() suffers from performance issue by 
>>> inefficient
>>>
>>> logic under limited memory,
>>>
>>> It leads to performance downgrade.
>>>
>>> Please help to review it.
>>>
>>> Index: include/bits/stl_algo.h
>>> ===================================================================
>>> --- include/bits/stl_algo.h    (revision 256871)
>>> +++ include/bits/stl_algo.h    (working copy)
>>> @@ -2437,7 +2437,7 @@
>>>          _BidirectionalIterator __second_cut = __middle;
>>>          _Distance __len11 = 0;
>>>          _Distance __len22 = 0;
>>> -      if (__len1 > __len2)
>>> +      if (__len1 < __len2)
>>>            {
>>>              __len11 = __len1 / 2;
>>>              std::advance(__first_cut, __len11);
>>> @@ -2539,9 +2539,15 @@
>>>          const _DistanceType __len1 = std::distance(__first, __middle);
>>>          const _DistanceType __len2 = std::distance(__middle, __last);
>>>
>>> +
>>>
>>>          typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>>> _TmpBuf;
>>>
>>> -      _TmpBuf __buf(__first, __last);
>>> -
>>> +      _BidirectionalIterator __start, __end;
>>> +      if (__len1 < __len2) {
>>> +    __start = __first; __end = __middle;
>>> +      } else {
>>> +    __start = __middle; __end = __last;
>>> +      }
>>> +      _TmpBuf __buf(__start, ___end);
>>>
>>> Note another optimization, we could introduce a _Temporary_buffer<> 
>>> constructor
>>> in order to write:
>>>
>>> _TmpBuf __buf(std::min(__len1, __len2), __first);
>>>
>>> So that std::distance do not need to be called again.
>>>
>>>          if (__buf.begin() == 0)
>>>        std::__merge_without_buffer
>>>          (__first, __middle, __last, __len1, __len2, __comp);
>>> Index: include/bits/stl_tempbuf.h
>>> ===================================================================
>>> --- include/bits/stl_tempbuf.h    (revision 256871)
>>> +++ include/bits/stl_tempbuf.h    (working copy)
>>> @@ -95,7 +95,7 @@
>>>                                std::nothrow));
>>>          if (__tmp != 0)
>>>            return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>> -      __len /= 2;
>>> +      __len = (__len + 1) / 2;
>>>
>>> This part is problematic, if __len is 1 and allocation fails it will 
>>> loop
>>> forever.
>>>
>>> It doesn't seem really necessary for your patch.
>>>
>>>
>>> 2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>:
>>>
>>>> This is a libstdc++ bug and patch, not the C++ front end.  So I'm
>>>> adding the libstdc++ list to CC.
>>>>
>>>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc<r97922153@gmail.com>  wrote:
>>>>> Current std::inplace_merge() suffers from performance issue by
>>>> inefficient
>>>>> logic under limited memory,
>>>>>
>>>>> It leads to performance downgrade.
>>>>>
>>>>> Please help to review it.
>>>>>
>>>>> Index: include/bits/stl_algo.h
>>>>> ===================================================================
>>>>> --- include/bits/stl_algo.h     (revision 256871)
>>>>> +++ include/bits/stl_algo.h     (working copy)
>>>>> @@ -2437,7 +2437,7 @@
>>>>>            _BidirectionalIterator __second_cut = __middle;
>>>>>            _Distance __len11 = 0;
>>>>>            _Distance __len22 = 0;
>>>>> -         if (__len1 > __len2)
>>>>> +         if (__len1 < __len2)
>>>>>              {
>>>>>                __len11 = __len1 / 2;
>>>>>                std::advance(__first_cut, __len11);
>>>>> @@ -2539,9 +2539,15 @@
>>>>>         const _DistanceType __len1 = std::distance(__first, 
>>>>> __middle);
>>>>>         const _DistanceType __len2 = std::distance(__middle, __last);
>>>>>
>>>>> +
>>>>>         typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>>>> _TmpBuf;
>>>>> -      _TmpBuf __buf(__first, __last);
>>>>> -
>>>>> +      _BidirectionalIterator __start, __end;
>>>>> +      if (__len1 < __len2) {
>>>>> +       __start = __first; __end = __middle;
>>>>> +      } else {
>>>>> +       __start = __middle; __end = __last;
>>>>> +      }
>>>>> +      _TmpBuf __buf(__start, ___end);
>>>>>         if (__buf.begin() == 0)
>>>>>          std::__merge_without_buffer
>>>>>            (__first, __middle, __last, __len1, __len2, __comp);
>>>>> Index: include/bits/stl_tempbuf.h
>>>>> ===================================================================
>>>>> --- include/bits/stl_tempbuf.h  (revision 256871)
>>>>> +++ include/bits/stl_tempbuf.h  (working copy)
>>>>> @@ -95,7 +95,7 @@
>>>>> std::nothrow));
>>>>>            if (__tmp != 0)
>>>>>              return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>>> -         __len /= 2;
>>>>> +         __len = (__len + 1) / 2;
>>>>>          }
>>>>>         return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>>>>       }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>
>>
>
diff --git a/libstdc++-v3/include/bits/stl_algo.h b/libstdc++-v3/include/bits/stl_algo.h
index 04e152a..eb58efe 100644
--- a/libstdc++-v3/include/bits/stl_algo.h
+++ b/libstdc++-v3/include/bits/stl_algo.h
@@ -2416,9 +2416,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	  _BidirectionalIterator __second_cut = __middle;
 	  _Distance __len11 = 0;
 	  _Distance __len22 = 0;
-	  if (__len1 > __len2)
+	  if (__len1 < __len2)
 	    {
-	      __len11 = __len1 / 2;
+	      __len11 = __buffer_size;
 	      std::advance(__first_cut, __len11);
 	      __second_cut
 		= std::__lower_bound(__middle, __last, *__first_cut,
@@ -2427,7 +2427,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	    }
 	  else
 	    {
-	      __len22 = __len2 / 2;
+	      __len22 = __buffer_size;
 	      std::advance(__second_cut, __len22);
 	      __first_cut
 		= std::__upper_bound(__first, __middle, *__second_cut,
@@ -2437,14 +2437,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 
 	  _BidirectionalIterator __new_middle
 	    = std::__rotate_adaptive(__first_cut, __middle, __second_cut,
-				     __len1 - __len11, __len22, __buffer,
-				     __buffer_size);
-	  std::__merge_adaptive(__first, __first_cut, __new_middle, __len11,
-				__len22, __buffer, __buffer_size, __comp);
+				     __len1 - __len11, __len22,
+				     __buffer, __buffer_size);
+	  std::__merge_adaptive(__first, __first_cut, __new_middle,
+				__len11, __len22,
+				__buffer, __buffer_size, __comp);
 	  std::__merge_adaptive(__new_middle, __second_cut, __last,
-				__len1 - __len11,
-				__len2 - __len22, __buffer,
-				__buffer_size, __comp);
+				__len1 - __len11, __len2 - __len22,
+				__buffer, __buffer_size, __comp);
 	}
     }
 
@@ -2518,7 +2518,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
       const _DistanceType __len2 = std::distance(__middle, __last);
 
       typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf;
-      _TmpBuf __buf(__first, __len1 + __len2);
+      _TmpBuf __buf(__first, std::min(__len1, __len2));
 
       if (__buf.begin() == 0)
 	std::__merge_without_buffer
@@ -2713,9 +2713,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 			   _Compare __comp)
     {
       const _Distance __len = (__last - __first + 1) / 2;
-      const _RandomAccessIterator __middle = __first + __len;
+      _RandomAccessIterator __middle;
       if (__len > __buffer_size)
 	{
+	  __middle = __first + __buffer_size + __buffer_size;
 	  std::__stable_sort_adaptive(__first, __middle, __buffer,
 				      __buffer_size, __comp);
 	  std::__stable_sort_adaptive(__middle, __last, __buffer,
@@ -2723,9 +2724,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	}
       else
 	{
+	  __middle = __first + __len;
 	  std::__merge_sort_with_buffer(__first, __middle, __buffer, __comp);
 	  std::__merge_sort_with_buffer(__middle, __last, __buffer, __comp);
 	}
+
       std::__merge_adaptive(__first, __middle, __last,
 			    _Distance(__middle - __first),
 			    _Distance(__last - __middle),
@@ -4975,8 +4978,11 @@ _GLIBCXX_BEGIN_NAMESPACE_ALGO
       typedef typename iterator_traits<_RandomAccessIterator>::difference_type
 	_DistanceType;
 
+      if (__first == __last)
+	return;
+
       typedef _Temporary_buffer<_RandomAccessIterator, _ValueType> _TmpBuf;
-      _TmpBuf __buf(__first, std::distance(__first, __last));
+      _TmpBuf __buf(__first, (__last - __first + 1) / 2);
 
       if (__buf.begin() == 0)
 	std::__inplace_stable_sort(__first, __last, __comp);
diff --git a/libstdc++-v3/include/bits/stl_tempbuf.h b/libstdc++-v3/include/bits/stl_tempbuf.h
index 0abd3c12..e177272 100644
--- a/libstdc++-v3/include/bits/stl_tempbuf.h
+++ b/libstdc++-v3/include/bits/stl_tempbuf.h
@@ -95,7 +95,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 							std::nothrow));
 	  if (__tmp != 0)
 	    return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
-	  __len /= 2;
+	  __len = __len == 1 ? 0 : ((__len + 1) / 2);
 	}
       return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
     }
François Dumont Oct. 29, 2018, 6:06 a.m. UTC | #7
Hi

     Some feedback regarding this patch ?

Thanks,
François

On 8/21/18 10:34 PM, François Dumont wrote:
> I missed a test that was failing because of this patch. So I revert a 
> small part of it and here is the new proposal.
>
> Tested under Linux x86_64, ok to commit ?
>
> François
>
>
> On 24/07/2018 12:22, François Dumont wrote:
>> Hi
>>
>>     Any chance to review this patch ?
>>
>> François
>>
>>
>> On 06/06/2018 18:39, François Dumont wrote:
>>> Hi
>>>
>>>     I review and rework this proposal. I noticed that the same idea 
>>> to limit buffer size within inplace_merge also apply to stable_sort.
>>>
>>>     I also change the decision when buffer is too small to consider 
>>> the buffer size rather than going through successive cuts of the 
>>> original ranges. This way we won't cut the range more than 
>>> necessary. Note that if you prefer this second part could be 
>>> committed seperatly.
>>>
>>>     PR libstdc++/83938
>>>     * include/bits/stl_algo.h:
>>>     (__stable_partition_adaptive): When buffer to small cut range using
>>>     buffer size.
>>>     (__inplace_merge): Take temporary buffer length from smallest 
>>> range.
>>>     (__merge_adaptive): When buffer too small consider smallest 
>>> range first
>>>     and cut based on buffer size.
>>>     (__stable_sort_adaptive): When buffer too small cut based on buffer
>>>     size.
>>>     (__stable_sort): Limit temporary buffer length.
>>>     * include/bits/stl_tempbuf.h (get_temporary_buffer): Change 
>>> function
>>>     to reduce requested buffer length on allocation failure.
>>>
>>> Tested under Linux x86_64.
>>>
>>> Ok to commit ?
>>>
>>> François
>>>
>>>
>>> On 25/01/2018 23:37, chang jc wrote:
>>>> Hi:
>>>>
>>>> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
>>>> __len = (__len == 1) ? 0 : ((__len + 1) / 2);
>>>>
>>>> 2. The coding gain has been shown  PR c++/83938; I re-post here
>>>>
>>>>
>>>>
>>>>
>>>>    21
>>>>    20
>>>>    19
>>>>    18
>>>>    17
>>>>    16
>>>>
>>>>
>>>>    0.471136
>>>>    0.625695
>>>>    0.767262
>>>>    0.907461
>>>>    1.04838
>>>>    1.19508
>>>>
>>>>
>>>>    0.340845
>>>>    0.48651
>>>>    0.639139
>>>>    0.770133
>>>>    0.898454
>>>>    1.04632
>>>>
>>>> it means when Merge [0, 4325376, 16777216); A is a sorted integer with
>>>> 4325376 & B with 12451840 elements, total with 16M integers
>>>>
>>>> The proposed method has the speed up under given buffer size =, ex
>>>> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
>>>> given sizof(int)*64K bytes.
>>>>
>>>> 3. As your suggestion, _TmpBuf __buf should be rewrite.
>>>>
>>>> 4. It represents a fact that the intuitive idea to split from larger
>>>> part is wrong.
>>>>
>>>> For example, if you have an input sorted array A & B, A has 8 integers
>>>> & B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>>>>
>>>> Current it tries to split from B, right?
>>>>
>>>> Then we have:
>>>>
>>>> A1 | A2  B1 | B2
>>>>
>>>> B1 & B2 has 12 integers each, right?
>>>>
>>>> Current algorithm selects pivot as 13th integer from B, right?
>>>>
>>>> If the corresponding upper bound of A is 6th integer.
>>>>
>>>> Then it split in
>>>>
>>>> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>>>>
>>>> After rotate, we have two arrays to merge
>>>>
>>>> [A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]
>>>>
>>>> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>>>>
>>>> Sadly, [A1 = 5 | B1 = 12] CANNOT.
>>>>
>>>> So we do rotate again, split & merge the two split arrays from [A1 = 5
>>>> | B1 = 12] again.
>>>>
>>>>
>>>> But wait, if we always split from the smaller one instead of larger 
>>>> one.
>>>>
>>>> After rotate, it promises two split arrays both contain 
>>>> ceiling[small/2].
>>>>
>>>> Since tmp buffer also allocate by starting from sizeof(small) &
>>>> recursively downgrade by ceiling[small/2^(# of fail allocate)].
>>>>
>>>> It means the allocated tmp buffer promises to be sufficient at the
>>>> level of (# of fail allocate).
>>>>
>>>> Instead, you can see if split from large at level (# of fail allocate)
>>>> several split array still CANNOT use  tmp buffer to do buffered merge.
>>>>
>>>>
>>>> As you know, buffered merge is far speed then (split, rotate, and
>>>> merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>>>>
>>>> the way should provide speedup.
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 24/01/2018 18:23, François Dumont wrote:
>>>>
>>>> Hi
>>>>
>>>>
>>>>      It sounds like a very sensitive change to make but nothing 
>>>> worth figures.
>>>> Do you have any bench showing the importance of the gain ?
>>>>
>>>>      At least the memory usage optimization is obvious.
>>>>
>>>> On 19/01/2018 10:43, chang jc wrote:
>>>>
>>>> Current std::inplace_merge() suffers from performance issue by 
>>>> inefficient
>>>>
>>>> logic under limited memory,
>>>>
>>>> It leads to performance downgrade.
>>>>
>>>> Please help to review it.
>>>>
>>>> Index: include/bits/stl_algo.h
>>>> ===================================================================
>>>> --- include/bits/stl_algo.h    (revision 256871)
>>>> +++ include/bits/stl_algo.h    (working copy)
>>>> @@ -2437,7 +2437,7 @@
>>>>          _BidirectionalIterator __second_cut = __middle;
>>>>          _Distance __len11 = 0;
>>>>          _Distance __len22 = 0;
>>>> -      if (__len1 > __len2)
>>>> +      if (__len1 < __len2)
>>>>            {
>>>>              __len11 = __len1 / 2;
>>>>              std::advance(__first_cut, __len11);
>>>> @@ -2539,9 +2539,15 @@
>>>>          const _DistanceType __len1 = std::distance(__first, 
>>>> __middle);
>>>>          const _DistanceType __len2 = std::distance(__middle, __last);
>>>>
>>>> +
>>>>
>>>>          typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>>>> _TmpBuf;
>>>>
>>>> -      _TmpBuf __buf(__first, __last);
>>>> -
>>>> +      _BidirectionalIterator __start, __end;
>>>> +      if (__len1 < __len2) {
>>>> +    __start = __first; __end = __middle;
>>>> +      } else {
>>>> +    __start = __middle; __end = __last;
>>>> +      }
>>>> +      _TmpBuf __buf(__start, ___end);
>>>>
>>>> Note another optimization, we could introduce a _Temporary_buffer<> 
>>>> constructor
>>>> in order to write:
>>>>
>>>> _TmpBuf __buf(std::min(__len1, __len2), __first);
>>>>
>>>> So that std::distance do not need to be called again.
>>>>
>>>>          if (__buf.begin() == 0)
>>>>        std::__merge_without_buffer
>>>>          (__first, __middle, __last, __len1, __len2, __comp);
>>>> Index: include/bits/stl_tempbuf.h
>>>> ===================================================================
>>>> --- include/bits/stl_tempbuf.h    (revision 256871)
>>>> +++ include/bits/stl_tempbuf.h    (working copy)
>>>> @@ -95,7 +95,7 @@
>>>>                                std::nothrow));
>>>>          if (__tmp != 0)
>>>>            return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>> -      __len /= 2;
>>>> +      __len = (__len + 1) / 2;
>>>>
>>>> This part is problematic, if __len is 1 and allocation fails it 
>>>> will loop
>>>> forever.
>>>>
>>>> It doesn't seem really necessary for your patch.
>>>>
>>>>
>>>> 2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>:
>>>>
>>>>> This is a libstdc++ bug and patch, not the C++ front end.  So I'm
>>>>> adding the libstdc++ list to CC.
>>>>>
>>>>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc<r97922153@gmail.com>  
>>>>> wrote:
>>>>>> Current std::inplace_merge() suffers from performance issue by
>>>>> inefficient
>>>>>> logic under limited memory,
>>>>>>
>>>>>> It leads to performance downgrade.
>>>>>>
>>>>>> Please help to review it.
>>>>>>
>>>>>> Index: include/bits/stl_algo.h
>>>>>> ===================================================================
>>>>>> --- include/bits/stl_algo.h     (revision 256871)
>>>>>> +++ include/bits/stl_algo.h     (working copy)
>>>>>> @@ -2437,7 +2437,7 @@
>>>>>>            _BidirectionalIterator __second_cut = __middle;
>>>>>>            _Distance __len11 = 0;
>>>>>>            _Distance __len22 = 0;
>>>>>> -         if (__len1 > __len2)
>>>>>> +         if (__len1 < __len2)
>>>>>>              {
>>>>>>                __len11 = __len1 / 2;
>>>>>>                std::advance(__first_cut, __len11);
>>>>>> @@ -2539,9 +2539,15 @@
>>>>>>         const _DistanceType __len1 = std::distance(__first, 
>>>>>> __middle);
>>>>>>         const _DistanceType __len2 = std::distance(__middle, 
>>>>>> __last);
>>>>>>
>>>>>> +
>>>>>>         typedef _Temporary_buffer<_BidirectionalIterator, 
>>>>>> _ValueType>
>>>>> _TmpBuf;
>>>>>> -      _TmpBuf __buf(__first, __last);
>>>>>> -
>>>>>> +      _BidirectionalIterator __start, __end;
>>>>>> +      if (__len1 < __len2) {
>>>>>> +       __start = __first; __end = __middle;
>>>>>> +      } else {
>>>>>> +       __start = __middle; __end = __last;
>>>>>> +      }
>>>>>> +      _TmpBuf __buf(__start, ___end);
>>>>>>         if (__buf.begin() == 0)
>>>>>>          std::__merge_without_buffer
>>>>>>            (__first, __middle, __last, __len1, __len2, __comp);
>>>>>> Index: include/bits/stl_tempbuf.h
>>>>>> ===================================================================
>>>>>> --- include/bits/stl_tempbuf.h  (revision 256871)
>>>>>> +++ include/bits/stl_tempbuf.h  (working copy)
>>>>>> @@ -95,7 +95,7 @@
>>>>>> std::nothrow));
>>>>>>            if (__tmp != 0)
>>>>>>              return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>>>> -         __len /= 2;
>>>>>> +         __len = (__len + 1) / 2;
>>>>>>          }
>>>>>>         return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>>>>>       }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>
>>>
>>
>
Jonathan Wakely Dec. 21, 2018, 8:57 p.m. UTC | #8
On 29/10/18 07:06 +0100, François Dumont wrote:
>Hi
>
>    Some feedback regarding this patch ?

Sorry this got missed, please resubmit during stage 1.

You haven't CC'd the original patch author (chang jc) to give them a
chance to comment on your proposed changes to the patch.

The attached PDF on PR libstdc++/83938 has extensive discussion of the
performance issue, but I don't see any for your version. Some
performance benchmarks for your version would be helpful.



>
>Thanks,
>François
>
>On 8/21/18 10:34 PM, François Dumont wrote:
>>I missed a test that was failing because of this patch. So I revert 
>>a small part of it and here is the new proposal.
>>
>>Tested under Linux x86_64, ok to commit ?
>>
>>François
>>
>>
>>On 24/07/2018 12:22, François Dumont wrote:
>>>Hi
>>>
>>>    Any chance to review this patch ?
>>>
>>>François
>>>
>>>
>>>On 06/06/2018 18:39, François Dumont wrote:
>>>>Hi
>>>>
>>>>    I review and rework this proposal. I noticed that the same 
>>>>idea to limit buffer size within inplace_merge also apply to 
>>>>stable_sort.
>>>>
>>>>    I also change the decision when buffer is too small to 
>>>>consider the buffer size rather than going through successive 
>>>>cuts of the original ranges. This way we won't cut the range 
>>>>more than necessary. Note that if you prefer this second part 
>>>>could be committed seperatly.
>>>>
>>>>    PR libstdc++/83938
>>>>    * include/bits/stl_algo.h:
>>>>    (__stable_partition_adaptive): When buffer to small cut range using
>>>>    buffer size.
>>>>    (__inplace_merge): Take temporary buffer length from 
>>>>smallest range.
>>>>    (__merge_adaptive): When buffer too small consider smallest 
>>>>range first
>>>>    and cut based on buffer size.
>>>>    (__stable_sort_adaptive): When buffer too small cut based on buffer
>>>>    size.
>>>>    (__stable_sort): Limit temporary buffer length.
>>>>    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change 
>>>>function
>>>>    to reduce requested buffer length on allocation failure.
>>>>
>>>>Tested under Linux x86_64.
>>>>
>>>>Ok to commit ?
>>>>
>>>>François
>>>>
>>>>
>>>>On 25/01/2018 23:37, chang jc wrote:
>>>>>Hi:
>>>>>
>>>>>1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
>>>>>__len = (__len == 1) ? 0 : ((__len + 1) / 2);
>>>>>
>>>>>2. The coding gain has been shown  PR c++/83938; I re-post here
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   21
>>>>>   20
>>>>>   19
>>>>>   18
>>>>>   17
>>>>>   16
>>>>>
>>>>>
>>>>>   0.471136
>>>>>   0.625695
>>>>>   0.767262
>>>>>   0.907461
>>>>>   1.04838
>>>>>   1.19508
>>>>>
>>>>>
>>>>>   0.340845
>>>>>   0.48651
>>>>>   0.639139
>>>>>   0.770133
>>>>>   0.898454
>>>>>   1.04632
>>>>>
>>>>>it means when Merge [0, 4325376, 16777216); A is a sorted integer with
>>>>>4325376 & B with 12451840 elements, total with 16M integers
>>>>>
>>>>>The proposed method has the speed up under given buffer size =, ex
>>>>>2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
>>>>>given sizof(int)*64K bytes.
>>>>>
>>>>>3. As your suggestion, _TmpBuf __buf should be rewrite.
>>>>>
>>>>>4. It represents a fact that the intuitive idea to split from larger
>>>>>part is wrong.
>>>>>
>>>>>For example, if you have an input sorted array A & B, A has 8 integers
>>>>>& B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>>>>>
>>>>>Current it tries to split from B, right?
>>>>>
>>>>>Then we have:
>>>>>
>>>>>A1 | A2  B1 | B2
>>>>>
>>>>>B1 & B2 has 12 integers each, right?
>>>>>
>>>>>Current algorithm selects pivot as 13th integer from B, right?
>>>>>
>>>>>If the corresponding upper bound of A is 6th integer.
>>>>>
>>>>>Then it split in
>>>>>
>>>>>A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>>>>>
>>>>>After rotate, we have two arrays to merge
>>>>>
>>>>>[A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]
>>>>>
>>>>>Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>>>>>
>>>>>Sadly, [A1 = 5 | B1 = 12] CANNOT.
>>>>>
>>>>>So we do rotate again, split & merge the two split arrays from [A1 = 5
>>>>>| B1 = 12] again.
>>>>>
>>>>>
>>>>>But wait, if we always split from the smaller one instead of 
>>>>>larger one.
>>>>>
>>>>>After rotate, it promises two split arrays both contain 
>>>>>ceiling[small/2].
>>>>>
>>>>>Since tmp buffer also allocate by starting from sizeof(small) &
>>>>>recursively downgrade by ceiling[small/2^(# of fail allocate)].
>>>>>
>>>>>It means the allocated tmp buffer promises to be sufficient at the
>>>>>level of (# of fail allocate).
>>>>>
>>>>>Instead, you can see if split from large at level (# of fail allocate)
>>>>>several split array still CANNOT use  tmp buffer to do buffered merge.
>>>>>
>>>>>
>>>>>As you know, buffered merge is far speed then (split, rotate, and
>>>>>merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>>>>>
>>>>>the way should provide speedup.
>>>>>
>>>>>
>>>>>Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>On 24/01/2018 18:23, François Dumont wrote:
>>>>>
>>>>>Hi
>>>>>
>>>>>
>>>>>     It sounds like a very sensitive change to make but 
>>>>>nothing worth figures.
>>>>>Do you have any bench showing the importance of the gain ?
>>>>>
>>>>>     At least the memory usage optimization is obvious.
>>>>>
>>>>>On 19/01/2018 10:43, chang jc wrote:
>>>>>
>>>>>Current std::inplace_merge() suffers from performance issue by 
>>>>>inefficient
>>>>>
>>>>>logic under limited memory,
>>>>>
>>>>>It leads to performance downgrade.
>>>>>
>>>>>Please help to review it.
>>>>>
>>>>>Index: include/bits/stl_algo.h
>>>>>===================================================================
>>>>>--- include/bits/stl_algo.h    (revision 256871)
>>>>>+++ include/bits/stl_algo.h    (working copy)
>>>>>@@ -2437,7 +2437,7 @@
>>>>>         _BidirectionalIterator __second_cut = __middle;
>>>>>         _Distance __len11 = 0;
>>>>>         _Distance __len22 = 0;
>>>>>-      if (__len1 > __len2)
>>>>>+      if (__len1 < __len2)
>>>>>           {
>>>>>             __len11 = __len1 / 2;
>>>>>             std::advance(__first_cut, __len11);
>>>>>@@ -2539,9 +2539,15 @@
>>>>>         const _DistanceType __len1 = std::distance(__first, 
>>>>>__middle);
>>>>>         const _DistanceType __len2 = std::distance(__middle, __last);
>>>>>
>>>>>+
>>>>>
>>>>>         typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>>>>>_TmpBuf;
>>>>>
>>>>>-      _TmpBuf __buf(__first, __last);
>>>>>-
>>>>>+      _BidirectionalIterator __start, __end;
>>>>>+      if (__len1 < __len2) {
>>>>>+    __start = __first; __end = __middle;
>>>>>+      } else {
>>>>>+    __start = __middle; __end = __last;
>>>>>+      }
>>>>>+      _TmpBuf __buf(__start, ___end);
>>>>>
>>>>>Note another optimization, we could introduce a 
>>>>>_Temporary_buffer<> constructor
>>>>>in order to write:
>>>>>
>>>>>_TmpBuf __buf(std::min(__len1, __len2), __first);
>>>>>
>>>>>So that std::distance do not need to be called again.
>>>>>
>>>>>         if (__buf.begin() == 0)
>>>>>       std::__merge_without_buffer
>>>>>         (__first, __middle, __last, __len1, __len2, __comp);
>>>>>Index: include/bits/stl_tempbuf.h
>>>>>===================================================================
>>>>>--- include/bits/stl_tempbuf.h    (revision 256871)
>>>>>+++ include/bits/stl_tempbuf.h    (working copy)
>>>>>@@ -95,7 +95,7 @@
>>>>>                               std::nothrow));
>>>>>         if (__tmp != 0)
>>>>>           return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>>>-      __len /= 2;
>>>>>+      __len = (__len + 1) / 2;
>>>>>
>>>>>This part is problematic, if __len is 1 and allocation fails 
>>>>>it will loop
>>>>>forever.
>>>>>
>>>>>It doesn't seem really necessary for your patch.
>>>>>
>>>>>
>>>>>2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>:
>>>>>
>>>>>>This is a libstdc++ bug and patch, not the C++ front end.  So I'm
>>>>>>adding the libstdc++ list to CC.
>>>>>>
>>>>>>On Fri, Jan 19, 2018 at 3:02 AM, chang 
>>>>>>jc<r97922153@gmail.com>  wrote:
>>>>>>>Current std::inplace_merge() suffers from performance issue by
>>>>>>inefficient
>>>>>>>logic under limited memory,
>>>>>>>
>>>>>>>It leads to performance downgrade.
>>>>>>>
>>>>>>>Please help to review it.
>>>>>>>
>>>>>>>Index: include/bits/stl_algo.h
>>>>>>>===================================================================
>>>>>>>--- include/bits/stl_algo.h     (revision 256871)
>>>>>>>+++ include/bits/stl_algo.h     (working copy)
>>>>>>>@@ -2437,7 +2437,7 @@
>>>>>>>           _BidirectionalIterator __second_cut = __middle;
>>>>>>>           _Distance __len11 = 0;
>>>>>>>           _Distance __len22 = 0;
>>>>>>>-         if (__len1 > __len2)
>>>>>>>+         if (__len1 < __len2)
>>>>>>>             {
>>>>>>>               __len11 = __len1 / 2;
>>>>>>>               std::advance(__first_cut, __len11);
>>>>>>>@@ -2539,9 +2539,15 @@
>>>>>>>        const _DistanceType __len1 = 
>>>>>>>std::distance(__first, __middle);
>>>>>>>        const _DistanceType __len2 = 
>>>>>>>std::distance(__middle, __last);
>>>>>>>
>>>>>>>+
>>>>>>>        typedef _Temporary_buffer<_BidirectionalIterator, 
>>>>>>>_ValueType>
>>>>>>_TmpBuf;
>>>>>>>-      _TmpBuf __buf(__first, __last);
>>>>>>>-
>>>>>>>+      _BidirectionalIterator __start, __end;
>>>>>>>+      if (__len1 < __len2) {
>>>>>>>+       __start = __first; __end = __middle;
>>>>>>>+      } else {
>>>>>>>+       __start = __middle; __end = __last;
>>>>>>>+      }
>>>>>>>+      _TmpBuf __buf(__start, ___end);
>>>>>>>        if (__buf.begin() == 0)
>>>>>>>         std::__merge_without_buffer
>>>>>>>           (__first, __middle, __last, __len1, __len2, __comp);
>>>>>>>Index: include/bits/stl_tempbuf.h
>>>>>>>===================================================================
>>>>>>>--- include/bits/stl_tempbuf.h  (revision 256871)
>>>>>>>+++ include/bits/stl_tempbuf.h  (working copy)
>>>>>>>@@ -95,7 +95,7 @@
>>>>>>>std::nothrow));
>>>>>>>           if (__tmp != 0)
>>>>>>>             return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>>>>>-         __len /= 2;
>>>>>>>+         __len = (__len + 1) / 2;
>>>>>>>         }
>>>>>>>        return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>>>>>>      }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>Thanks
>>>>
>>>>
>>>
>>
>
François Dumont June 9, 2019, 2:27 p.m. UTC | #9
On 12/21/18 9:57 PM, Jonathan Wakely wrote:
> On 29/10/18 07:06 +0100, François Dumont wrote:
>> Hi
>>
>>     Some feedback regarding this patch ?
>
> Sorry this got missed, please resubmit during stage 1.
>
> You haven't CC'd the original patch author (chang jc) to give them a
> chance to comment on your proposed changes to the patch.
>
> The attached PDF on PR libstdc++/83938 has extensive discussion of the
> performance issue, but I don't see any for your version. Some
> performance benchmarks for your version would be helpful.

Here is this patch again.

This time it is much closer to John's original one, I just kept my 
change on the size of the temporary buffer which doesn't need to be as 
large as it is currently allocated, especially with John's patch.

The performance tests are showing the good improvements, attached are 
the before/after. Surprisingly I do not see any change regarding 
allocated memory, I guess measures are not accurate enough. However 
working on them also show that current implementation is fragile. If you 
reduce the allowed allocated memory too much you end up with a stack 
overflow because of the recursive implementation.

I already have a patch to keep on trying to allocate memory as long as 
above a given threshold rather than 0 but I am also working on making 
implementation non-recursive. We'll see if even a buffer of size 1 is 
better than no buffer at all then.

     PR libstdc++/83938
     * include/bits/stl_algo.h:
     (__merge_adaptive): When buffer too small consider smallest range first
     and cut based on buffer size.
     (__inplace_merge): Take temporary buffer length from smallest range.
     (__stable_sort): Limit temporary buffer length.
     * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
     to reduce requested buffer length on allocation failure.
     * testsuite/performance/25_algorithms/inplace_merge.cc: New.
     * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
     execution with different level of memory.

François
inplace_merge.cc         	reverse                  	  19r   19u    0s        48mem    0pf 
inplace_merge.cc         	forwards                 	   0r    0u    0s         0mem    0pf 
inplace_merge.cc         	random                   	  18r   18u    0s         0mem    0pf 
inplace_merge.cc         	reverse                  	  10r   10u    0s         0mem    0pf 
inplace_merge.cc         	forwards                 	   8r    8u    0s         0mem    0pf 
inplace_merge.cc         	random                   	  22r   22u    0s         0mem    0pf 
inplace_merge.cc         	bench 1/2 memory         	1891r 1884u    7s       928mem    0pf 
inplace_merge.cc         	reverse                  	  19r   18u    0s         0mem    0pf 
inplace_merge.cc         	forwards                 	   0r    0u    0s         0mem    0pf 
inplace_merge.cc         	random                   	  18r   18u    0s         0mem    0pf 
inplace_merge.cc         	reverse                  	  10r   10u    0s         0mem    0pf 
inplace_merge.cc         	forwards                 	   3r    3u    0s         0mem    0pf 
inplace_merge.cc         	random                   	  22r   22u    0s         0mem    0pf 
inplace_merge.cc         	bench 1/4 memory         	1867r 1861u    6s       192mem    0pf 
inplace_merge.cc         	reverse                  	  17r   17u    0s         0mem    0pf 
inplace_merge.cc         	forwards                 	   0r    0u    0s         0mem    0pf 
inplace_merge.cc         	random                   	  25r   24u    0s         0mem    0pf 
inplace_merge.cc         	reverse                  	  28r   28u    0s         0mem    0pf 
inplace_merge.cc         	forwards                 	   0r    0u    0s         0mem    0pf 
inplace_merge.cc         	random                   	 289r  289u    0s         0mem    0pf 
inplace_merge.cc         	bench no memory          	2155r 2149u    6s       192mem    0pf 
stable_sort.cc           	reverse                  	 240r  240u    1s         0mem    0pf 
stable_sort.cc           	forwards                 	 205r  205u    0s         0mem    0pf 
stable_sort.cc           	random                   	 493r  493u    0s         0mem    0pf 
stable_sort.cc           	bench full memory        	 945r  939u    5s       752mem    0pf 
stable_sort.cc           	reverse                  	 236r  236u    0s         0mem    0pf 
stable_sort.cc           	forwards                 	 199r  199u    0s         0mem    0pf 
stable_sort.cc           	random                   	 487r  487u    0s         0mem    0pf 
stable_sort.cc           	bench 1/2 memory         	 928r  925u    3s        96mem    0pf 
stable_sort.cc           	reverse                  	 240r  240u    0s         0mem    0pf 
stable_sort.cc           	forwards                 	 203r  203u    0s         0mem    0pf 
stable_sort.cc           	random                   	 486r  487u    0s         0mem    0pf 
stable_sort.cc           	bench 1/4 memory         	 935r  932u    5s        96mem    0pf 
stable_sort.cc           	reverse                  	 829r  829u    0s         0mem    0pf 
stable_sort.cc           	forwards                 	 143r  143u    0s         0mem    0pf 
stable_sort.cc           	random                   	 487r  486u    0s         0mem    0pf 
stable_sort.cc           	bench no memory          	1465r 1461u    2s        96mem    0pf
inplace_merge.cc         	reverse                  	  19r   19u    0s         0mem    0pf 
inplace_merge.cc         	forwards                 	   0r    0u    0s         0mem    0pf 
inplace_merge.cc         	random                   	  19r   19u    0s         0mem    0pf 
inplace_merge.cc         	reverse                  	  11r   10u    0s         0mem    0pf 
inplace_merge.cc         	forwards                 	   8r    8u    0s         0mem    0pf 
inplace_merge.cc         	random                   	  22r   22u    0s         0mem    0pf 
inplace_merge.cc         	bench 1/2 memory         	1938r 1925u   12s       928mem    0pf 
inplace_merge.cc         	reverse                  	  19r   19u    0s         0mem    0pf 
inplace_merge.cc         	forwards                 	   0r    0u    0s         0mem    0pf 
inplace_merge.cc         	random                   	  18r   19u    0s         0mem    0pf 
inplace_merge.cc         	reverse                  	  10r   11u    0s         0mem    0pf 
inplace_merge.cc         	forwards                 	   3r    2u    0s         0mem    0pf 
inplace_merge.cc         	random                   	  22r   22u    0s         0mem    0pf 
inplace_merge.cc         	bench 1/4 memory         	1922r 1915u    7s       192mem    0pf 
inplace_merge.cc         	reverse                  	  18r   17u    0s         0mem    0pf 
inplace_merge.cc         	forwards                 	   0r    0u    0s         0mem    0pf 
inplace_merge.cc         	random                   	  23r   24u    0s         0mem    0pf 
inplace_merge.cc         	reverse                  	  28r   27u    0s         0mem    0pf 
inplace_merge.cc         	forwards                 	   0r    0u    0s         0mem    0pf 
inplace_merge.cc         	random                   	 288r  289u    0s         0mem    0pf 
inplace_merge.cc         	bench no memory          	2204r 2196u    9s       192mem    0pf 
stable_sort.cc           	reverse                  	 234r  234u    0s         0mem    0pf 
stable_sort.cc           	forwards                 	 201r  200u    1s         0mem    0pf 
stable_sort.cc           	random                   	 485r  484u    1s         0mem    0pf 
stable_sort.cc           	bench full memory        	 927r  920u    5s       752mem    0pf 
stable_sort.cc           	reverse                  	 233r  231u    1s         0mem    0pf 
stable_sort.cc           	forwards                 	 198r  198u    0s         0mem    0pf 
stable_sort.cc           	random                   	 478r  478u    0s         0mem    0pf 
stable_sort.cc           	bench 1/2 memory         	 915r  910u    4s        96mem    0pf 
stable_sort.cc           	reverse                  	 236r  236u    0s         0mem    0pf 
stable_sort.cc           	forwards                 	 201r  200u    0s         0mem    0pf 
stable_sort.cc           	random                   	 478r  478u    1s         0mem    0pf 
stable_sort.cc           	bench 1/4 memory         	 921r  916u    4s        96mem    0pf 
stable_sort.cc           	reverse                  	 830r  830u    0s         0mem    0pf 
stable_sort.cc           	forwards                 	 146r  146u    0s         0mem    0pf 
stable_sort.cc           	random                   	 479r  478u    0s         0mem    0pf 
stable_sort.cc           	bench no memory          	1461r 1456u    5s        96mem    0pf
François Dumont July 16, 2019, 4:40 p.m. UTC | #10
Hi

     I eventually spent much more time working on the inplace_merge 
performance bench.

     And the results do not confirm the theory:

Before patch:
inplace_merge.cc             bench 1 / 1 memory            243r  227u   
17s      1216mem    5pf
inplace_merge.cc             bench 1 / 4 memory            297r 278u   
18s       480mem    0pf
inplace_merge.cc             bench 1 /64 memory           373r 354u   
18s       480mem    0pf
inplace_merge.cc             bench 0 / 1 memory            12r 11u     
0s       480mem    0pf

After the patch to reduce memory allocation:
inplace_merge.cc             bench 1 / 1 memory            245r  227u   
18s      1216mem    0pf
inplace_merge.cc             bench 1 / 4 memory            292r 273u   
19s       480mem    0pf
inplace_merge.cc             bench 1 /64 memory           373r 356u   
17s       480mem    0pf
inplace_merge.cc             bench 0 / 1 memory            11r 11u     
0s       480mem    0pf

With the __len1 > __len2 condition change:
inplace_merge.cc             bench 1 / 1 memory            244r  225u   
20s      1216mem    0pf
inplace_merge.cc             bench 1 / 4 memory            379r 361u   
17s       480mem    0pf
inplace_merge.cc             bench 1 /64 memory            640r 625u   
16s       480mem    0pf
inplace_merge.cc             bench 0 / 1 memory             11r 11u    
0s       480mem    0pf

When there is no memory limit the results are identical of course. 
Otherwise as soon as memory is limited performance starts to decrease 
with the condition change on __len1 vs __len2.

Could you give the bench you use to demonstrate the enhancement ? I also 
wonder why your patch doesn't change consistently the same condition in 
__merge_without_buffer ?

For the moment I'd like to propose the attached patch that is to say the 
reduction on the amount of allocated memory and the new/modified benches.

Note that as soon as I forbid any memory allocation I also reduce the 
size of the range to merge cause the implementation rely on recursivity 
and so could end-up in a stack overflow. Maybe I need to check for 
simulator targets like several other tests ? Unless simulators do not 
run the performance tests ?

Regarding this stack overflow issue, is there some routine to find out 
how many levels of function calls can be added before reaching the stack 
overflow ? I could perhaps call __builtin_alloca and check the result 
but that smells. If I could find out this we could fallback on an 
iterative approach to complete the merge.

     PR libstdc++/83938
     * include/bits/stl_algo.h:
     (__inplace_merge): Take temporary buffer length from smallest range.
     (__stable_sort): Limit temporary buffer length.
     * include/bits/stl_tempbuf.h (get_temporary_buffer): Change __len
     computation in the loop.
     * testsuite/25_algorithms/inplace_merge/1.cc (test3): Test all possible
     pivot positions.
     * testsuite/performance/25_algorithms/inplace_merge.cc: New.
     * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
     execution with different memory constraints.

Ok to commit ?

François

On 6/9/19 4:27 PM, François Dumont wrote:
> On 12/21/18 9:57 PM, Jonathan Wakely wrote:
>> On 29/10/18 07:06 +0100, François Dumont wrote:
>>> Hi
>>>
>>>     Some feedback regarding this patch ?
>>
>> Sorry this got missed, please resubmit during stage 1.
>>
>> You haven't CC'd the original patch author (chang jc) to give them a
>> chance to comment on your proposed changes to the patch.
>>
>> The attached PDF on PR libstdc++/83938 has extensive discussion of the
>> performance issue, but I don't see any for your version. Some
>> performance benchmarks for your version would be helpful.
>
> Here is this patch again.
>
> This time it is much closer to John's original one, I just kept my 
> change on the size of the temporary buffer which doesn't need to be as 
> large as it is currently allocated, especially with John's patch.
>
> The performance tests are showing the good improvements, attached are 
> the before/after. Surprisingly I do not see any change regarding 
> allocated memory, I guess measures are not accurate enough. However 
> working on them also show that current implementation is fragile. If 
> you reduce the allowed allocated memory too much you end up with a 
> stack overflow because of the recursive implementation.
>
> I already have a patch to keep on trying to allocate memory as long as 
> above a given threshold rather than 0 but I am also working on making 
> implementation non-recursive. We'll see if even a buffer of size 1 is 
> better than no buffer at all then.
>
>     PR libstdc++/83938
>     * include/bits/stl_algo.h:
>     (__merge_adaptive): When buffer too small consider smallest range 
> first
>     and cut based on buffer size.
>     (__inplace_merge): Take temporary buffer length from smallest range.
>     (__stable_sort): Limit temporary buffer length.
>     * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
>     to reduce requested buffer length on allocation failure.
>     * testsuite/performance/25_algorithms/inplace_merge.cc: New.
>     * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
>     execution with different level of memory.
>
> François
>
Jonathan Wakely Nov. 19, 2020, 12:08 p.m. UTC | #11
On 16/07/19 18:40 +0200, François Dumont wrote:
>Hi
>
>    I eventually spent much more time working on the inplace_merge 
>performance bench.
>
>    And the results do not confirm the theory:
>
>Before patch:
>inplace_merge.cc             bench 1 / 1 memory            243r  
>227u   17s      1216mem    5pf
>inplace_merge.cc             bench 1 / 4 memory            297r 278u   
>18s       480mem    0pf
>inplace_merge.cc             bench 1 /64 memory           373r 354u   
>18s       480mem    0pf
>inplace_merge.cc             bench 0 / 1 memory            12r 11u     
>0s       480mem    0pf
>
>After the patch to reduce memory allocation:
>inplace_merge.cc             bench 1 / 1 memory            245r  
>227u   18s      1216mem    0pf
>inplace_merge.cc             bench 1 / 4 memory            292r 273u   
>19s       480mem    0pf
>inplace_merge.cc             bench 1 /64 memory           373r 356u   
>17s       480mem    0pf
>inplace_merge.cc             bench 0 / 1 memory            11r 11u     
>0s       480mem    0pf
>
>With the __len1 > __len2 condition change:
>inplace_merge.cc             bench 1 / 1 memory            244r  
>225u   20s      1216mem    0pf
>inplace_merge.cc             bench 1 / 4 memory            379r 361u   
>17s       480mem    0pf
>inplace_merge.cc             bench 1 /64 memory            640r 625u   
>16s       480mem    0pf
>inplace_merge.cc             bench 0 / 1 memory             11r 11u    
>0s       480mem    0pf
>
>When there is no memory limit the results are identical of course. 
>Otherwise as soon as memory is limited performance starts to decrease 
>with the condition change on __len1 vs __len2.
>
>Could you give the bench you use to demonstrate the enhancement ? I 
>also wonder why your patch doesn't change consistently the same 
>condition in __merge_without_buffer ?
>
>For the moment I'd like to propose the attached patch that is to say 
>the reduction on the amount of allocated memory and the new/modified 
>benches.
>
>Note that as soon as I forbid any memory allocation I also reduce the 
>size of the range to merge cause the implementation rely on 
>recursivity and so could end-up in a stack overflow. Maybe I need to 
>check for simulator targets like several other tests ? Unless 
>simulators do not run the performance tests ?

The performance tests are never run by default.

I don't think we should spend too much time caring about performance
of sorting close to Out Of Memory conditions. We don't try to optimize
std::vector or other cases to work when close to OOM.

So I think just reducing the memory usage is the right approach here.

>Regarding this stack overflow issue, is there some routine to find out 
>how many levels of function calls can be added before reaching the 
>stack overflow ? I could perhaps call __builtin_alloca and check the 
>result but that smells. If I could find out this we could fallback on 
>an iterative approach to complete the merge.

No, alloca is inherently unsafe. Please don't even consider it :-)


>    PR libstdc++/83938
>    * include/bits/stl_algo.h:
>    (__inplace_merge): Take temporary buffer length from smallest range.
>    (__stable_sort): Limit temporary buffer length.
>    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change __len
>    computation in the loop.

Please use "Change __len computation in the loop to avoid truncation."

It's a bit more descriptive.

>    * testsuite/25_algorithms/inplace_merge/1.cc (test3): Test all possible
>    pivot positions.
>    * testsuite/performance/25_algorithms/inplace_merge.cc: New.
>    * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
>    execution with different memory constraints.
>
>Ok to commit ?
>
>François
>
>On 6/9/19 4:27 PM, François Dumont wrote:
>>On 12/21/18 9:57 PM, Jonathan Wakely wrote:
>>>On 29/10/18 07:06 +0100, François Dumont wrote:
>>>>Hi
>>>>
>>>>    Some feedback regarding this patch ?
>>>
>>>Sorry this got missed, please resubmit during stage 1.
>>>
>>>You haven't CC'd the original patch author (chang jc) to give them a
>>>chance to comment on your proposed changes to the patch.
>>>
>>>The attached PDF on PR libstdc++/83938 has extensive discussion of the
>>>performance issue, but I don't see any for your version. Some
>>>performance benchmarks for your version would be helpful.
>>
>>Here is this patch again.
>>
>>This time it is much closer to John's original one, I just kept my 
>>change on the size of the temporary buffer which doesn't need to be 
>>as large as it is currently allocated, especially with John's patch.
>>
>>The performance tests are showing the good improvements, attached 
>>are the before/after. Surprisingly I do not see any change regarding 
>>allocated memory, I guess measures are not accurate enough. However 
>>working on them also show that current implementation is fragile. If 
>>you reduce the allowed allocated memory too much you end up with a 
>>stack overflow because of the recursive implementation.
>>
>>I already have a patch to keep on trying to allocate memory as long 
>>as above a given threshold rather than 0 but I am also working on 
>>making implementation non-recursive. We'll see if even a buffer of 
>>size 1 is better than no buffer at all then.
>>
>>    PR libstdc++/83938
>>    * include/bits/stl_algo.h:
>>    (__merge_adaptive): When buffer too small consider smallest 
>>range first
>>    and cut based on buffer size.
>>    (__inplace_merge): Take temporary buffer length from smallest range.
>>    (__stable_sort): Limit temporary buffer length.
>>    * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
>>    to reduce requested buffer length on allocation failure.
>>    * testsuite/performance/25_algorithms/inplace_merge.cc: New.
>>    * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow
>>    execution with different level of memory.
>>
>>François
>>
>

>diff --git a/libstdc++-v3/include/bits/stl_algo.h b/libstdc++-v3/include/bits/stl_algo.h
>index ec651e2cc45..b396437443f 100644
>--- a/libstdc++-v3/include/bits/stl_algo.h
>+++ b/libstdc++-v3/include/bits/stl_algo.h
>@@ -2530,7 +2530,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>       const _DistanceType __len2 = std::distance(__middle, __last);
> 
>       typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf;
>-      _TmpBuf __buf(__first, __len1 + __len2);
>+      _TmpBuf __buf(__first, std::min(__len1, __len2));

Please add a comment before this saying something like:

       // __merge_adaptive will use a buffer for the smaller of
       // [first,middle) and [middle,last).

>       if (__buf.begin() == 0)
> 	std::__merge_without_buffer
>@@ -2738,6 +2738,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> 	  std::__merge_sort_with_buffer(__first, __middle, __buffer, __comp);
> 	  std::__merge_sort_with_buffer(__middle, __last, __buffer, __comp);
> 	}
>+
>       std::__merge_adaptive(__first, __middle, __last,
> 			    _Distance(__middle - __first),
> 			    _Distance(__last - __middle),
>@@ -5023,8 +5024,11 @@ _GLIBCXX_BEGIN_NAMESPACE_ALGO
>       typedef typename iterator_traits<_RandomAccessIterator>::difference_type
> 	_DistanceType;
> 
>+      if (__first == __last)
>+	return;
>+
>       typedef _Temporary_buffer<_RandomAccessIterator, _ValueType> _TmpBuf;
>-      _TmpBuf __buf(__first, std::distance(__first, __last));
>+      _TmpBuf __buf(__first, (__last - __first + 1) / 2);

Please add a comment before this saying something like:

         // __stable_sort_adaptive sorts the range in two halves,
         // so the buffer only needs to fit half the range at once.

That explains why (last - first + 1)/2 is correct, and if we ever
change __stable_sort_adaptive we'll know to change this too.

> 
>       if (__buf.begin() == 0)
> 	std::__inplace_stable_sort(__first, __last, __comp);
>diff --git a/libstdc++-v3/include/bits/stl_tempbuf.h b/libstdc++-v3/include/bits/stl_tempbuf.h
>index fa78ee53eb4..b6ad9ee6a46 100644
>--- a/libstdc++-v3/include/bits/stl_tempbuf.h
>+++ b/libstdc++-v3/include/bits/stl_tempbuf.h
>@@ -95,7 +95,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> 							std::nothrow));
> 	  if (__tmp != 0)
> 	    return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>-	  __len /= 2;
>+	  __len = __len == 1 ? 0 : ((__len + 1) / 2);
> 	}
>       return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>     }
>diff --git a/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc b/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc
>index 0b0c9c59640..9c393ce8fe3 100644
>--- a/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc
>+++ b/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc
>@@ -28,7 +28,7 @@ using std::inplace_merge;
> 
> typedef test_container<int, bidirectional_iterator_wrapper> container;
> 
>-void 
>+void
> test1()
> {
>   int array[] = { 1 };
>@@ -39,7 +39,7 @@ test1()
>   inplace_merge(con2.begin(), con2.begin(), con2.end());
> }
> 
>-void 
>+void
> test2()
> {
>   int array[] = { 0, 2, 4, 1, 3, 5 };
>@@ -60,30 +60,34 @@ struct S
>   { return a < _s.a; }
> };
> 
>-void 
>+void
> test3()
> {
>   S s[8];
>-  s[0].a = 0;
>-  s[1].a = 1;
>-  s[2].a = 2;
>-  s[3].a = 3;
>-  s[4].a = 0;
>-  s[5].a = 1;
>-  s[6].a = 2;
>-  s[7].a = 3;
>-
>-  s[0].b = 0;
>-  s[1].b = 1;
>-  s[2].b = 2;
>-  s[3].b = 3;
>-  s[4].b = 4;
>-  s[5].b = 5;
>-  s[6].b = 6;
>-  s[7].b = 7;
>-
>-  inplace_merge(s, s + 4, s + 8);
>-  VERIFY( s[0].b == 0 && s[1].b == 4 && s[2].b == 1 && s[3].b == 5 );
>+  for (int pivot_idx = 0; pivot_idx < 8; ++pivot_idx)
>+    {
>+      int bval = 0;
>+      for (int i = 0; i != pivot_idx; ++i)
>+	{
>+	  s[i].a = i;
>+	  s[i].b = bval++;
>+	}
>+
>+      for (int i = pivot_idx; i != 8; ++i)
>+	{
>+	  s[i].a = i - pivot_idx;
>+	  s[i].b = bval++;
>+	}
>+
>+      inplace_merge(s, s + pivot_idx, s + 8);
>+
>+      for (int i = 1; i < 8; ++i)
>+	{
>+	  VERIFY( !(s[i] < s[i - 1]) );
>+	  if (s[i - 1].a == s[i].a)
>+	    VERIFY( s[i - 1].b < s[i].b );
>+	}
>+    }
> }

I know it's redundant, but I think I'd prefer to keep test3 unchanged
and add test4 for the more general test that uses different pivots.

OK for trunk with those changes.

Thanks for persisting with this.
diff mbox series

Patch

Index: include/bits/stl_algo.h
===================================================================
--- include/bits/stl_algo.h	(revision 256871)
+++ include/bits/stl_algo.h	(working copy)
@@ -2437,7 +2437,7 @@ 
 	  _BidirectionalIterator __second_cut = __middle;
 	  _Distance __len11 = 0;
 	  _Distance __len22 = 0;
-	  if (__len1 > __len2)
+	  if (__len1 < __len2)
 	    {
 	      __len11 = __len1 / 2;
 	      std::advance(__first_cut, __len11);
@@ -2539,9 +2539,15 @@ 
       const _DistanceType __len1 = std::distance(__first, __middle);
       const _DistanceType __len2 = std::distance(__middle, __last);

+
       typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf;
-      _TmpBuf __buf(__first, __last);
-
+      _BidirectionalIterator __start, __end;
+      if (__len1 < __len2) {
+	__start = __first; __end = __middle;
+      } else {
+	__start = __middle; __end = __last;
+      }
+      _TmpBuf __buf(__start, ___end);
       if (__buf.begin() == 0)
 	std::__merge_without_buffer
 	  (__first, __middle, __last, __len1, __len2, __comp);
Index: include/bits/stl_tempbuf.h
===================================================================
--- include/bits/stl_tempbuf.h	(revision 256871)
+++ include/bits/stl_tempbuf.h	(working copy)
@@ -95,7 +95,7 @@ 
 							std::nothrow));
 	  if (__tmp != 0)
 	    return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
-	  __len /= 2;
+	  __len = (__len + 1) / 2;
 	}
       return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
     }