Message ID | CALnYPH-iONCpF595=MpWP=w_+4xGqU11=Br4Yh=g=wdf2Ken3A@mail.gmail.com |
---|---|
State | New |
Headers | show |
Series | [C++] Speed up inplace_merge algorithm & fix inefficient logic(PR c++/83938) | expand |
This is a libstdc++ bug and patch, not the C++ front end. So I'm adding the libstdc++ list to CC. On Fri, Jan 19, 2018 at 3:02 AM, chang jc <r97922153@gmail.com> wrote: > Current std::inplace_merge() suffers from performance issue by inefficient > logic under limited memory, > > It leads to performance downgrade. > > Please help to review it. > > Index: include/bits/stl_algo.h > =================================================================== > --- include/bits/stl_algo.h (revision 256871) > +++ include/bits/stl_algo.h (working copy) > @@ -2437,7 +2437,7 @@ > _BidirectionalIterator __second_cut = __middle; > _Distance __len11 = 0; > _Distance __len22 = 0; > - if (__len1 > __len2) > + if (__len1 < __len2) > { > __len11 = __len1 / 2; > std::advance(__first_cut, __len11); > @@ -2539,9 +2539,15 @@ > const _DistanceType __len1 = std::distance(__first, __middle); > const _DistanceType __len2 = std::distance(__middle, __last); > > + > typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf; > - _TmpBuf __buf(__first, __last); > - > + _BidirectionalIterator __start, __end; > + if (__len1 < __len2) { > + __start = __first; __end = __middle; > + } else { > + __start = __middle; __end = __last; > + } > + _TmpBuf __buf(__start, ___end); > if (__buf.begin() == 0) > std::__merge_without_buffer > (__first, __middle, __last, __len1, __len2, __comp); > Index: include/bits/stl_tempbuf.h > =================================================================== > --- include/bits/stl_tempbuf.h (revision 256871) > +++ include/bits/stl_tempbuf.h (working copy) > @@ -95,7 +95,7 @@ > std::nothrow)); > if (__tmp != 0) > return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); > - __len /= 2; > + __len = (__len + 1) / 2; > } > return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0); > } > > > > > Thanks
Hi: 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as __len = (__len == 1) ? 0 : ((__len + 1) / 2); 2. The coding gain has been shown PR c++/83938; I re-post here 21 20 19 18 17 16 0.471136 0.625695 0.767262 0.907461 1.04838 1.19508 0.340845 0.48651 0.639139 0.770133 0.898454 1.04632 it means when Merge [0, 4325376, 16777216); A is a sorted integer with 4325376 & B with 12451840 elements, total with 16M integers The proposed method has the speed up under given buffer size =, ex 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means given sizof(int)*64K bytes. 3. As your suggestion, _TmpBuf __buf should be rewrite. 4. It represents a fact that the intuitive idea to split from larger part is wrong. For example, if you have an input sorted array A & B, A has 8 integers & B has 24 integers. Given tmp buffer whose capacity = 4 integers. Current it tries to split from B, right? Then we have: A1 | A2 B1 | B2 B1 & B2 has 12 integers each, right? Current algorithm selects pivot as 13th integer from B, right? If the corresponding upper bound of A is 6th integer. Then it split in A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12 After rotate, we have two arrays to merge [A1 = 5 | B1 = 12] & [A2 = 3 | B2 = 12] Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge. Sadly, [A1 = 5 | B1 = 12] CANNOT. So we do rotate again, split & merge the two split arrays from [A1 = 5 | B1 = 12] again. But wait, if we always split from the smaller one instead of larger one. After rotate, it promises two split arrays both contain ceiling[small/2]. Since tmp buffer also allocate by starting from sizeof(small) & recursively downgrade by ceiling[small/2^(# of fail allocate)]. It means the allocated tmp buffer promises to be sufficient at the level of (# of fail allocate). Instead, you can see if split from large at level (# of fail allocate) several split array still CANNOT use tmp buffer to do buffered merge. As you know, buffered merge is far speed then (split, rotate, and merge two sub-arrays) (PR c++/83938 gives the profiling figures), the way should provide speedup. Thanks. On 24/01/2018 18:23, François Dumont wrote: Hi It sounds like a very sensitive change to make but nothing worth figures. Do you have any bench showing the importance of the gain ? At least the memory usage optimization is obvious. On 19/01/2018 10:43, chang jc wrote: Current std::inplace_merge() suffers from performance issue by inefficient logic under limited memory, It leads to performance downgrade. Please help to review it. Index: include/bits/stl_algo.h =================================================================== --- include/bits/stl_algo.h (revision 256871) +++ include/bits/stl_algo.h (working copy) @@ -2437,7 +2437,7 @@ _BidirectionalIterator __second_cut = __middle; _Distance __len11 = 0; _Distance __len22 = 0; - if (__len1 > __len2) + if (__len1 < __len2) { __len11 = __len1 / 2; std::advance(__first_cut, __len11); @@ -2539,9 +2539,15 @@ const _DistanceType __len1 = std::distance(__first, __middle); const _DistanceType __len2 = std::distance(__middle, __last); + typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf; - _TmpBuf __buf(__first, __last); - + _BidirectionalIterator __start, __end; + if (__len1 < __len2) { + __start = __first; __end = __middle; + } else { + __start = __middle; __end = __last; + } + _TmpBuf __buf(__start, ___end); Note another optimization, we could introduce a _Temporary_buffer<> constructor in order to write: _TmpBuf __buf(std::min(__len1, __len2), __first); So that std::distance do not need to be called again. if (__buf.begin() == 0) std::__merge_without_buffer (__first, __middle, __last, __len1, __len2, __comp); Index: include/bits/stl_tempbuf.h =================================================================== --- include/bits/stl_tempbuf.h (revision 256871) +++ include/bits/stl_tempbuf.h (working copy) @@ -95,7 +95,7 @@ std::nothrow)); if (__tmp != 0) return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); - __len /= 2; + __len = (__len + 1) / 2; This part is problematic, if __len is 1 and allocation fails it will loop forever. It doesn't seem really necessary for your patch. 2018-01-20 4:05 GMT+08:00 Jason Merrill <jason@redhat.com>: > This is a libstdc++ bug and patch, not the C++ front end. So I'm > adding the libstdc++ list to CC. > > On Fri, Jan 19, 2018 at 3:02 AM, chang jc <r97922153@gmail.com> wrote: > > Current std::inplace_merge() suffers from performance issue by > inefficient > > logic under limited memory, > > > > It leads to performance downgrade. > > > > Please help to review it. > > > > Index: include/bits/stl_algo.h > > =================================================================== > > --- include/bits/stl_algo.h (revision 256871) > > +++ include/bits/stl_algo.h (working copy) > > @@ -2437,7 +2437,7 @@ > > _BidirectionalIterator __second_cut = __middle; > > _Distance __len11 = 0; > > _Distance __len22 = 0; > > - if (__len1 > __len2) > > + if (__len1 < __len2) > > { > > __len11 = __len1 / 2; > > std::advance(__first_cut, __len11); > > @@ -2539,9 +2539,15 @@ > > const _DistanceType __len1 = std::distance(__first, __middle); > > const _DistanceType __len2 = std::distance(__middle, __last); > > > > + > > typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> > _TmpBuf; > > - _TmpBuf __buf(__first, __last); > > - > > + _BidirectionalIterator __start, __end; > > + if (__len1 < __len2) { > > + __start = __first; __end = __middle; > > + } else { > > + __start = __middle; __end = __last; > > + } > > + _TmpBuf __buf(__start, ___end); > > if (__buf.begin() == 0) > > std::__merge_without_buffer > > (__first, __middle, __last, __len1, __len2, __comp); > > Index: include/bits/stl_tempbuf.h > > =================================================================== > > --- include/bits/stl_tempbuf.h (revision 256871) > > +++ include/bits/stl_tempbuf.h (working copy) > > @@ -95,7 +95,7 @@ > > std::nothrow)); > > if (__tmp != 0) > > return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); > > - __len /= 2; > > + __len = (__len + 1) / 2; > > } > > return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0); > > } > > > > > > > > > > Thanks >
Your change looks good, don't hesitate to re-submit a proper patch with the point 1 below fix. You should add gcc-patches@gcc.gnu.org when you do so. Remaining questions will be: Copyright assignment, your change seems limited enough to avoid it but you will need an official maintainer confirmation. Stage 1: We are in a Fix-Only development phase. Your change brings such an important improvement that it could be considered as a bug fix but once again an official maintainer will have to decide. Otherwise you will have to wait for a couple of month to see you patch committed. François On 25/01/2018 23:37, chang jc wrote: > Hi: > > 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as > __len = (__len == 1) ? 0 : ((__len + 1) / 2); > > 2. The coding gain has been shown PR c++/83938; I re-post here > > > > > 21 > 20 > 19 > 18 > 17 > 16 > > > 0.471136 > 0.625695 > 0.767262 > 0.907461 > 1.04838 > 1.19508 > > > 0.340845 > 0.48651 > 0.639139 > 0.770133 > 0.898454 > 1.04632 > > it means when Merge [0, 4325376, 16777216); A is a sorted integer with > 4325376 & B with 12451840 elements, total with 16M integers > > The proposed method has the speed up under given buffer size =, ex > 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means > given sizof(int)*64K bytes. > > 3. As your suggestion, _TmpBuf __buf should be rewrite. > > 4. It represents a fact that the intuitive idea to split from larger > part is wrong. > > For example, if you have an input sorted array A & B, A has 8 integers > & B has 24 integers. Given tmp buffer whose capacity = 4 integers. > > Current it tries to split from B, right? > > Then we have: > > A1 | A2 B1 | B2 > > B1 & B2 has 12 integers each, right? > > Current algorithm selects pivot as 13th integer from B, right? > > If the corresponding upper bound of A is 6th integer. > > Then it split in > > A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12 > > After rotate, we have two arrays to merge > > [A1 = 5 | B1 = 12] & [A2 = 3 | B2 = 12] > > Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge. > > Sadly, [A1 = 5 | B1 = 12] CANNOT. > > So we do rotate again, split & merge the two split arrays from [A1 = 5 > | B1 = 12] again. > > > But wait, if we always split from the smaller one instead of larger one. > > After rotate, it promises two split arrays both contain ceiling[small/2]. > > Since tmp buffer also allocate by starting from sizeof(small) & > recursively downgrade by ceiling[small/2^(# of fail allocate)]. > > It means the allocated tmp buffer promises to be sufficient at the > level of (# of fail allocate). > > Instead, you can see if split from large at level (# of fail allocate) > several split array still CANNOT use tmp buffer to do buffered merge. > > > As you know, buffered merge is far speed then (split, rotate, and > merge two sub-arrays) (PR c++/83938 gives the profiling figures), > > the way should provide speedup. > > > Thanks. > > > > > > > > > > > On 24/01/2018 18:23, François Dumont wrote: > > Hi > > > It sounds like a very sensitive change to make but nothing worth figures. > Do you have any bench showing the importance of the gain ? > > At least the memory usage optimization is obvious. > > On 19/01/2018 10:43, chang jc wrote: > > Current std::inplace_merge() suffers from performance issue by inefficient > > logic under limited memory, > > It leads to performance downgrade. > > Please help to review it. > > Index: include/bits/stl_algo.h > =================================================================== > --- include/bits/stl_algo.h (revision 256871) > +++ include/bits/stl_algo.h (working copy) > @@ -2437,7 +2437,7 @@ > _BidirectionalIterator __second_cut = __middle; > _Distance __len11 = 0; > _Distance __len22 = 0; > - if (__len1 > __len2) > + if (__len1 < __len2) > { > __len11 = __len1 / 2; > std::advance(__first_cut, __len11); > @@ -2539,9 +2539,15 @@ > const _DistanceType __len1 = std::distance(__first, __middle); > const _DistanceType __len2 = std::distance(__middle, __last); > > + > > typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> > _TmpBuf; > > - _TmpBuf __buf(__first, __last); > - > + _BidirectionalIterator __start, __end; > + if (__len1 < __len2) { > + __start = __first; __end = __middle; > + } else { > + __start = __middle; __end = __last; > + } > + _TmpBuf __buf(__start, ___end); > > Note another optimization, we could introduce a _Temporary_buffer<> constructor > in order to write: > > _TmpBuf __buf(std::min(__len1, __len2), __first); > > So that std::distance do not need to be called again. > > if (__buf.begin() == 0) > std::__merge_without_buffer > (__first, __middle, __last, __len1, __len2, __comp); > Index: include/bits/stl_tempbuf.h > =================================================================== > --- include/bits/stl_tempbuf.h (revision 256871) > +++ include/bits/stl_tempbuf.h (working copy) > @@ -95,7 +95,7 @@ > std::nothrow)); > if (__tmp != 0) > return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); > - __len /= 2; > + __len = (__len + 1) / 2; > > This part is problematic, if __len is 1 and allocation fails it will loop > forever. > > It doesn't seem really necessary for your patch. > > > 2018-01-20 4:05 GMT+08:00 Jason Merrill <jason@redhat.com>: > >> This is a libstdc++ bug and patch, not the C++ front end. So I'm >> adding the libstdc++ list to CC. >> >> On Fri, Jan 19, 2018 at 3:02 AM, chang jc <r97922153@gmail.com> wrote: >>> Current std::inplace_merge() suffers from performance issue by >> inefficient >>> logic under limited memory, >>> >>> It leads to performance downgrade. >>> >>> Please help to review it. >>> >>> Index: include/bits/stl_algo.h >>> =================================================================== >>> --- include/bits/stl_algo.h (revision 256871) >>> +++ include/bits/stl_algo.h (working copy) >>> @@ -2437,7 +2437,7 @@ >>> _BidirectionalIterator __second_cut = __middle; >>> _Distance __len11 = 0; >>> _Distance __len22 = 0; >>> - if (__len1 > __len2) >>> + if (__len1 < __len2) >>> { >>> __len11 = __len1 / 2; >>> std::advance(__first_cut, __len11); >>> @@ -2539,9 +2539,15 @@ >>> const _DistanceType __len1 = std::distance(__first, __middle); >>> const _DistanceType __len2 = std::distance(__middle, __last); >>> >>> + >>> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> >> _TmpBuf; >>> - _TmpBuf __buf(__first, __last); >>> - >>> + _BidirectionalIterator __start, __end; >>> + if (__len1 < __len2) { >>> + __start = __first; __end = __middle; >>> + } else { >>> + __start = __middle; __end = __last; >>> + } >>> + _TmpBuf __buf(__start, ___end); >>> if (__buf.begin() == 0) >>> std::__merge_without_buffer >>> (__first, __middle, __last, __len1, __len2, __comp); >>> Index: include/bits/stl_tempbuf.h >>> =================================================================== >>> --- include/bits/stl_tempbuf.h (revision 256871) >>> +++ include/bits/stl_tempbuf.h (working copy) >>> @@ -95,7 +95,7 @@ >>> std::nothrow)); >>> if (__tmp != 0) >>> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); >>> - __len /= 2; >>> + __len = (__len + 1) / 2; >>> } >>> return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0); >>> } >>> >>> >>> >>> >>> Thanks
Hi I review and rework this proposal. I noticed that the same idea to limit buffer size within inplace_merge also apply to stable_sort. I also change the decision when buffer is too small to consider the buffer size rather than going through successive cuts of the original ranges. This way we won't cut the range more than necessary. Note that if you prefer this second part could be committed seperatly. PR libstdc++/83938 * include/bits/stl_algo.h: (__stable_partition_adaptive): When buffer to small cut range using buffer size. (__inplace_merge): Take temporary buffer length from smallest range. (__merge_adaptive): When buffer too small consider smallest range first and cut based on buffer size. (__stable_sort_adaptive): When buffer too small cut based on buffer size. (__stable_sort): Limit temporary buffer length. * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function to reduce requested buffer length on allocation failure. Tested under Linux x86_64. Ok to commit ? François On 25/01/2018 23:37, chang jc wrote: > Hi: > > 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as > __len = (__len == 1) ? 0 : ((__len + 1) / 2); > > 2. The coding gain has been shown PR c++/83938; I re-post here > > > > > 21 > 20 > 19 > 18 > 17 > 16 > > > 0.471136 > 0.625695 > 0.767262 > 0.907461 > 1.04838 > 1.19508 > > > 0.340845 > 0.48651 > 0.639139 > 0.770133 > 0.898454 > 1.04632 > > it means when Merge [0, 4325376, 16777216); A is a sorted integer with > 4325376 & B with 12451840 elements, total with 16M integers > > The proposed method has the speed up under given buffer size =, ex > 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means > given sizof(int)*64K bytes. > > 3. As your suggestion, _TmpBuf __buf should be rewrite. > > 4. It represents a fact that the intuitive idea to split from larger > part is wrong. > > For example, if you have an input sorted array A & B, A has 8 integers > & B has 24 integers. Given tmp buffer whose capacity = 4 integers. > > Current it tries to split from B, right? > > Then we have: > > A1 | A2 B1 | B2 > > B1 & B2 has 12 integers each, right? > > Current algorithm selects pivot as 13th integer from B, right? > > If the corresponding upper bound of A is 6th integer. > > Then it split in > > A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12 > > After rotate, we have two arrays to merge > > [A1 = 5 | B1 = 12] & [A2 = 3 | B2 = 12] > > Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge. > > Sadly, [A1 = 5 | B1 = 12] CANNOT. > > So we do rotate again, split & merge the two split arrays from [A1 = 5 > | B1 = 12] again. > > > But wait, if we always split from the smaller one instead of larger one. > > After rotate, it promises two split arrays both contain ceiling[small/2]. > > Since tmp buffer also allocate by starting from sizeof(small) & > recursively downgrade by ceiling[small/2^(# of fail allocate)]. > > It means the allocated tmp buffer promises to be sufficient at the > level of (# of fail allocate). > > Instead, you can see if split from large at level (# of fail allocate) > several split array still CANNOT use tmp buffer to do buffered merge. > > > As you know, buffered merge is far speed then (split, rotate, and > merge two sub-arrays) (PR c++/83938 gives the profiling figures), > > the way should provide speedup. > > > Thanks. > > > > > > > > > > > On 24/01/2018 18:23, François Dumont wrote: > > Hi > > > It sounds like a very sensitive change to make but nothing worth figures. > Do you have any bench showing the importance of the gain ? > > At least the memory usage optimization is obvious. > > On 19/01/2018 10:43, chang jc wrote: > > Current std::inplace_merge() suffers from performance issue by inefficient > > logic under limited memory, > > It leads to performance downgrade. > > Please help to review it. > > Index: include/bits/stl_algo.h > =================================================================== > --- include/bits/stl_algo.h (revision 256871) > +++ include/bits/stl_algo.h (working copy) > @@ -2437,7 +2437,7 @@ > _BidirectionalIterator __second_cut = __middle; > _Distance __len11 = 0; > _Distance __len22 = 0; > - if (__len1 > __len2) > + if (__len1 < __len2) > { > __len11 = __len1 / 2; > std::advance(__first_cut, __len11); > @@ -2539,9 +2539,15 @@ > const _DistanceType __len1 = std::distance(__first, __middle); > const _DistanceType __len2 = std::distance(__middle, __last); > > + > > typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> > _TmpBuf; > > - _TmpBuf __buf(__first, __last); > - > + _BidirectionalIterator __start, __end; > + if (__len1 < __len2) { > + __start = __first; __end = __middle; > + } else { > + __start = __middle; __end = __last; > + } > + _TmpBuf __buf(__start, ___end); > > Note another optimization, we could introduce a _Temporary_buffer<> constructor > in order to write: > > _TmpBuf __buf(std::min(__len1, __len2), __first); > > So that std::distance do not need to be called again. > > if (__buf.begin() == 0) > std::__merge_without_buffer > (__first, __middle, __last, __len1, __len2, __comp); > Index: include/bits/stl_tempbuf.h > =================================================================== > --- include/bits/stl_tempbuf.h (revision 256871) > +++ include/bits/stl_tempbuf.h (working copy) > @@ -95,7 +95,7 @@ > std::nothrow)); > if (__tmp != 0) > return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); > - __len /= 2; > + __len = (__len + 1) / 2; > > This part is problematic, if __len is 1 and allocation fails it will loop > forever. > > It doesn't seem really necessary for your patch. > > > 2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>: > >> This is a libstdc++ bug and patch, not the C++ front end. So I'm >> adding the libstdc++ list to CC. >> >> On Fri, Jan 19, 2018 at 3:02 AM, chang jc<r97922153@gmail.com> wrote: >>> Current std::inplace_merge() suffers from performance issue by >> inefficient >>> logic under limited memory, >>> >>> It leads to performance downgrade. >>> >>> Please help to review it. >>> >>> Index: include/bits/stl_algo.h >>> =================================================================== >>> --- include/bits/stl_algo.h (revision 256871) >>> +++ include/bits/stl_algo.h (working copy) >>> @@ -2437,7 +2437,7 @@ >>> _BidirectionalIterator __second_cut = __middle; >>> _Distance __len11 = 0; >>> _Distance __len22 = 0; >>> - if (__len1 > __len2) >>> + if (__len1 < __len2) >>> { >>> __len11 = __len1 / 2; >>> std::advance(__first_cut, __len11); >>> @@ -2539,9 +2539,15 @@ >>> const _DistanceType __len1 = std::distance(__first, __middle); >>> const _DistanceType __len2 = std::distance(__middle, __last); >>> >>> + >>> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> >> _TmpBuf; >>> - _TmpBuf __buf(__first, __last); >>> - >>> + _BidirectionalIterator __start, __end; >>> + if (__len1 < __len2) { >>> + __start = __first; __end = __middle; >>> + } else { >>> + __start = __middle; __end = __last; >>> + } >>> + _TmpBuf __buf(__start, ___end); >>> if (__buf.begin() == 0) >>> std::__merge_without_buffer >>> (__first, __middle, __last, __len1, __len2, __comp); >>> Index: include/bits/stl_tempbuf.h >>> =================================================================== >>> --- include/bits/stl_tempbuf.h (revision 256871) >>> +++ include/bits/stl_tempbuf.h (working copy) >>> @@ -95,7 +95,7 @@ >>> std::nothrow)); >>> if (__tmp != 0) >>> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); >>> - __len /= 2; >>> + __len = (__len + 1) / 2; >>> } >>> return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0); >>> } >>> >>> >>> >>> >>> Thanks diff --git a/libstdc++-v3/include/bits/stl_algo.h b/libstdc++-v3/include/bits/stl_algo.h index d8488a7..bf5b25f 100644 --- a/libstdc++-v3/include/bits/stl_algo.h +++ b/libstdc++-v3/include/bits/stl_algo.h @@ -1577,15 +1577,15 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } _ForwardIterator __middle = __first; - std::advance(__middle, __len / 2); + std::advance(__middle, __buffer_size); _ForwardIterator __left_split = - std::__stable_partition_adaptive(__first, __middle, __pred, - __len / 2, __buffer, - __buffer_size); + std::__stable_partition_adaptive(__first, __middle, + __pred, __buffer_size, + __buffer, __buffer_size); // Advance past true-predicate values to satisfy this // function's preconditions. - _Distance __right_len = __len - __len / 2; + _Distance __right_len = __len - __buffer_size; _ForwardIterator __right_split = std::__find_if_not_n(__middle, __right_len, __pred); @@ -2432,9 +2432,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _BidirectionalIterator __second_cut = __middle; _Distance __len11 = 0; _Distance __len22 = 0; - if (__len1 > __len2) + if (__len1 < __len2) { - __len11 = __len1 / 2; + __len11 = __buffer_size; std::advance(__first_cut, __len11); __second_cut = std::__lower_bound(__middle, __last, *__first_cut, @@ -2443,7 +2443,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } else { - __len22 = __len2 / 2; + __len22 = __buffer_size; std::advance(__second_cut, __len22); __first_cut = std::__upper_bound(__first, __middle, *__second_cut, @@ -2453,14 +2453,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _BidirectionalIterator __new_middle = std::__rotate_adaptive(__first_cut, __middle, __second_cut, - __len1 - __len11, __len22, __buffer, - __buffer_size); - std::__merge_adaptive(__first, __first_cut, __new_middle, __len11, - __len22, __buffer, __buffer_size, __comp); + __len1 - __len11, __len22, + __buffer, __buffer_size); + std::__merge_adaptive(__first, __first_cut, __new_middle, + __len11, __len22, + __buffer, __buffer_size, __comp); std::__merge_adaptive(__new_middle, __second_cut, __last, - __len1 - __len11, - __len2 - __len22, __buffer, - __buffer_size, __comp); + __len1 - __len11, __len2 - __len22, + __buffer, __buffer_size, __comp); } } @@ -2535,7 +2535,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION const _DistanceType __len2 = std::distance(__middle, __last); typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf; - _TmpBuf __buf(__first, __len1 + __len2); + _TmpBuf __buf(__first, std::min(__len1, __len2)); if (__buf.begin() == 0) std::__merge_without_buffer @@ -2730,9 +2730,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _Compare __comp) { const _Distance __len = (__last - __first + 1) / 2; - const _RandomAccessIterator __middle = __first + __len; + _RandomAccessIterator __middle; if (__len > __buffer_size) { + __middle = __first + __buffer_size + __buffer_size; std::__stable_sort_adaptive(__first, __middle, __buffer, __buffer_size, __comp); std::__stable_sort_adaptive(__middle, __last, __buffer, @@ -2740,9 +2741,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } else { + __middle = __first + __len; std::__merge_sort_with_buffer(__first, __middle, __buffer, __comp); std::__merge_sort_with_buffer(__middle, __last, __buffer, __comp); } + std::__merge_adaptive(__first, __middle, __last, _Distance(__middle - __first), _Distance(__last - __middle), @@ -4992,8 +4995,11 @@ _GLIBCXX_BEGIN_NAMESPACE_ALGO typedef typename iterator_traits<_RandomAccessIterator>::difference_type _DistanceType; + if (__first == __last) + return; + typedef _Temporary_buffer<_RandomAccessIterator, _ValueType> _TmpBuf; - _TmpBuf __buf(__first, std::distance(__first, __last)); + _TmpBuf __buf(__first, (__last - __first + 1) / 2); if (__buf.begin() == 0) std::__inplace_stable_sort(__first, __last, __comp); diff --git a/libstdc++-v3/include/bits/stl_tempbuf.h b/libstdc++-v3/include/bits/stl_tempbuf.h index 159ee27..7fb97d0 100644 --- a/libstdc++-v3/include/bits/stl_tempbuf.h +++ b/libstdc++-v3/include/bits/stl_tempbuf.h @@ -95,7 +95,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION std::nothrow)); if (__tmp != 0) return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); - __len /= 2; + __len = __len == 1 ? 0 : ((__len + 1) / 2); } return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0); }
Hi Any chance to review this patch ? François On 06/06/2018 18:39, François Dumont wrote: > Hi > > I review and rework this proposal. I noticed that the same idea to > limit buffer size within inplace_merge also apply to stable_sort. > > I also change the decision when buffer is too small to consider > the buffer size rather than going through successive cuts of the > original ranges. This way we won't cut the range more than necessary. > Note that if you prefer this second part could be committed seperatly. > > PR libstdc++/83938 > * include/bits/stl_algo.h: > (__stable_partition_adaptive): When buffer to small cut range using > buffer size. > (__inplace_merge): Take temporary buffer length from smallest range. > (__merge_adaptive): When buffer too small consider smallest range > first > and cut based on buffer size. > (__stable_sort_adaptive): When buffer too small cut based on buffer > size. > (__stable_sort): Limit temporary buffer length. > * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function > to reduce requested buffer length on allocation failure. > > Tested under Linux x86_64. > > Ok to commit ? > > François > > > On 25/01/2018 23:37, chang jc wrote: >> Hi: >> >> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as >> __len = (__len == 1) ? 0 : ((__len + 1) / 2); >> >> 2. The coding gain has been shown PR c++/83938; I re-post here >> >> >> >> >> 21 >> 20 >> 19 >> 18 >> 17 >> 16 >> >> >> 0.471136 >> 0.625695 >> 0.767262 >> 0.907461 >> 1.04838 >> 1.19508 >> >> >> 0.340845 >> 0.48651 >> 0.639139 >> 0.770133 >> 0.898454 >> 1.04632 >> >> it means when Merge [0, 4325376, 16777216); A is a sorted integer with >> 4325376 & B with 12451840 elements, total with 16M integers >> >> The proposed method has the speed up under given buffer size =, ex >> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means >> given sizof(int)*64K bytes. >> >> 3. As your suggestion, _TmpBuf __buf should be rewrite. >> >> 4. It represents a fact that the intuitive idea to split from larger >> part is wrong. >> >> For example, if you have an input sorted array A & B, A has 8 integers >> & B has 24 integers. Given tmp buffer whose capacity = 4 integers. >> >> Current it tries to split from B, right? >> >> Then we have: >> >> A1 | A2 B1 | B2 >> >> B1 & B2 has 12 integers each, right? >> >> Current algorithm selects pivot as 13th integer from B, right? >> >> If the corresponding upper bound of A is 6th integer. >> >> Then it split in >> >> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12 >> >> After rotate, we have two arrays to merge >> >> [A1 = 5 | B1 = 12] & [A2 = 3 | B2 = 12] >> >> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge. >> >> Sadly, [A1 = 5 | B1 = 12] CANNOT. >> >> So we do rotate again, split & merge the two split arrays from [A1 = 5 >> | B1 = 12] again. >> >> >> But wait, if we always split from the smaller one instead of larger one. >> >> After rotate, it promises two split arrays both contain >> ceiling[small/2]. >> >> Since tmp buffer also allocate by starting from sizeof(small) & >> recursively downgrade by ceiling[small/2^(# of fail allocate)]. >> >> It means the allocated tmp buffer promises to be sufficient at the >> level of (# of fail allocate). >> >> Instead, you can see if split from large at level (# of fail allocate) >> several split array still CANNOT use tmp buffer to do buffered merge. >> >> >> As you know, buffered merge is far speed then (split, rotate, and >> merge two sub-arrays) (PR c++/83938 gives the profiling figures), >> >> the way should provide speedup. >> >> >> Thanks. >> >> >> >> >> >> >> >> >> >> >> On 24/01/2018 18:23, François Dumont wrote: >> >> Hi >> >> >> It sounds like a very sensitive change to make but nothing worth >> figures. >> Do you have any bench showing the importance of the gain ? >> >> At least the memory usage optimization is obvious. >> >> On 19/01/2018 10:43, chang jc wrote: >> >> Current std::inplace_merge() suffers from performance issue by >> inefficient >> >> logic under limited memory, >> >> It leads to performance downgrade. >> >> Please help to review it. >> >> Index: include/bits/stl_algo.h >> =================================================================== >> --- include/bits/stl_algo.h (revision 256871) >> +++ include/bits/stl_algo.h (working copy) >> @@ -2437,7 +2437,7 @@ >> _BidirectionalIterator __second_cut = __middle; >> _Distance __len11 = 0; >> _Distance __len22 = 0; >> - if (__len1 > __len2) >> + if (__len1 < __len2) >> { >> __len11 = __len1 / 2; >> std::advance(__first_cut, __len11); >> @@ -2539,9 +2539,15 @@ >> const _DistanceType __len1 = std::distance(__first, __middle); >> const _DistanceType __len2 = std::distance(__middle, __last); >> >> + >> >> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> >> _TmpBuf; >> >> - _TmpBuf __buf(__first, __last); >> - >> + _BidirectionalIterator __start, __end; >> + if (__len1 < __len2) { >> + __start = __first; __end = __middle; >> + } else { >> + __start = __middle; __end = __last; >> + } >> + _TmpBuf __buf(__start, ___end); >> >> Note another optimization, we could introduce a _Temporary_buffer<> >> constructor >> in order to write: >> >> _TmpBuf __buf(std::min(__len1, __len2), __first); >> >> So that std::distance do not need to be called again. >> >> if (__buf.begin() == 0) >> std::__merge_without_buffer >> (__first, __middle, __last, __len1, __len2, __comp); >> Index: include/bits/stl_tempbuf.h >> =================================================================== >> --- include/bits/stl_tempbuf.h (revision 256871) >> +++ include/bits/stl_tempbuf.h (working copy) >> @@ -95,7 +95,7 @@ >> std::nothrow)); >> if (__tmp != 0) >> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); >> - __len /= 2; >> + __len = (__len + 1) / 2; >> >> This part is problematic, if __len is 1 and allocation fails it will >> loop >> forever. >> >> It doesn't seem really necessary for your patch. >> >> >> 2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>: >> >>> This is a libstdc++ bug and patch, not the C++ front end. So I'm >>> adding the libstdc++ list to CC. >>> >>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc<r97922153@gmail.com> wrote: >>>> Current std::inplace_merge() suffers from performance issue by >>> inefficient >>>> logic under limited memory, >>>> >>>> It leads to performance downgrade. >>>> >>>> Please help to review it. >>>> >>>> Index: include/bits/stl_algo.h >>>> =================================================================== >>>> --- include/bits/stl_algo.h (revision 256871) >>>> +++ include/bits/stl_algo.h (working copy) >>>> @@ -2437,7 +2437,7 @@ >>>> _BidirectionalIterator __second_cut = __middle; >>>> _Distance __len11 = 0; >>>> _Distance __len22 = 0; >>>> - if (__len1 > __len2) >>>> + if (__len1 < __len2) >>>> { >>>> __len11 = __len1 / 2; >>>> std::advance(__first_cut, __len11); >>>> @@ -2539,9 +2539,15 @@ >>>> const _DistanceType __len1 = std::distance(__first, __middle); >>>> const _DistanceType __len2 = std::distance(__middle, __last); >>>> >>>> + >>>> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> >>> _TmpBuf; >>>> - _TmpBuf __buf(__first, __last); >>>> - >>>> + _BidirectionalIterator __start, __end; >>>> + if (__len1 < __len2) { >>>> + __start = __first; __end = __middle; >>>> + } else { >>>> + __start = __middle; __end = __last; >>>> + } >>>> + _TmpBuf __buf(__start, ___end); >>>> if (__buf.begin() == 0) >>>> std::__merge_without_buffer >>>> (__first, __middle, __last, __len1, __len2, __comp); >>>> Index: include/bits/stl_tempbuf.h >>>> =================================================================== >>>> --- include/bits/stl_tempbuf.h (revision 256871) >>>> +++ include/bits/stl_tempbuf.h (working copy) >>>> @@ -95,7 +95,7 @@ >>>> std::nothrow)); >>>> if (__tmp != 0) >>>> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); >>>> - __len /= 2; >>>> + __len = (__len + 1) / 2; >>>> } >>>> return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0); >>>> } >>>> >>>> >>>> >>>> >>>> Thanks > >
I missed a test that was failing because of this patch. So I revert a small part of it and here is the new proposal. Tested under Linux x86_64, ok to commit ? François On 24/07/2018 12:22, François Dumont wrote: > Hi > > Any chance to review this patch ? > > François > > > On 06/06/2018 18:39, François Dumont wrote: >> Hi >> >> I review and rework this proposal. I noticed that the same idea >> to limit buffer size within inplace_merge also apply to stable_sort. >> >> I also change the decision when buffer is too small to consider >> the buffer size rather than going through successive cuts of the >> original ranges. This way we won't cut the range more than necessary. >> Note that if you prefer this second part could be committed seperatly. >> >> PR libstdc++/83938 >> * include/bits/stl_algo.h: >> (__stable_partition_adaptive): When buffer to small cut range using >> buffer size. >> (__inplace_merge): Take temporary buffer length from smallest range. >> (__merge_adaptive): When buffer too small consider smallest range >> first >> and cut based on buffer size. >> (__stable_sort_adaptive): When buffer too small cut based on buffer >> size. >> (__stable_sort): Limit temporary buffer length. >> * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function >> to reduce requested buffer length on allocation failure. >> >> Tested under Linux x86_64. >> >> Ok to commit ? >> >> François >> >> >> On 25/01/2018 23:37, chang jc wrote: >>> Hi: >>> >>> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as >>> __len = (__len == 1) ? 0 : ((__len + 1) / 2); >>> >>> 2. The coding gain has been shown PR c++/83938; I re-post here >>> >>> >>> >>> >>> 21 >>> 20 >>> 19 >>> 18 >>> 17 >>> 16 >>> >>> >>> 0.471136 >>> 0.625695 >>> 0.767262 >>> 0.907461 >>> 1.04838 >>> 1.19508 >>> >>> >>> 0.340845 >>> 0.48651 >>> 0.639139 >>> 0.770133 >>> 0.898454 >>> 1.04632 >>> >>> it means when Merge [0, 4325376, 16777216); A is a sorted integer with >>> 4325376 & B with 12451840 elements, total with 16M integers >>> >>> The proposed method has the speed up under given buffer size =, ex >>> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means >>> given sizof(int)*64K bytes. >>> >>> 3. As your suggestion, _TmpBuf __buf should be rewrite. >>> >>> 4. It represents a fact that the intuitive idea to split from larger >>> part is wrong. >>> >>> For example, if you have an input sorted array A & B, A has 8 integers >>> & B has 24 integers. Given tmp buffer whose capacity = 4 integers. >>> >>> Current it tries to split from B, right? >>> >>> Then we have: >>> >>> A1 | A2 B1 | B2 >>> >>> B1 & B2 has 12 integers each, right? >>> >>> Current algorithm selects pivot as 13th integer from B, right? >>> >>> If the corresponding upper bound of A is 6th integer. >>> >>> Then it split in >>> >>> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12 >>> >>> After rotate, we have two arrays to merge >>> >>> [A1 = 5 | B1 = 12] & [A2 = 3 | B2 = 12] >>> >>> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge. >>> >>> Sadly, [A1 = 5 | B1 = 12] CANNOT. >>> >>> So we do rotate again, split & merge the two split arrays from [A1 = 5 >>> | B1 = 12] again. >>> >>> >>> But wait, if we always split from the smaller one instead of larger >>> one. >>> >>> After rotate, it promises two split arrays both contain >>> ceiling[small/2]. >>> >>> Since tmp buffer also allocate by starting from sizeof(small) & >>> recursively downgrade by ceiling[small/2^(# of fail allocate)]. >>> >>> It means the allocated tmp buffer promises to be sufficient at the >>> level of (# of fail allocate). >>> >>> Instead, you can see if split from large at level (# of fail allocate) >>> several split array still CANNOT use tmp buffer to do buffered merge. >>> >>> >>> As you know, buffered merge is far speed then (split, rotate, and >>> merge two sub-arrays) (PR c++/83938 gives the profiling figures), >>> >>> the way should provide speedup. >>> >>> >>> Thanks. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On 24/01/2018 18:23, François Dumont wrote: >>> >>> Hi >>> >>> >>> It sounds like a very sensitive change to make but nothing >>> worth figures. >>> Do you have any bench showing the importance of the gain ? >>> >>> At least the memory usage optimization is obvious. >>> >>> On 19/01/2018 10:43, chang jc wrote: >>> >>> Current std::inplace_merge() suffers from performance issue by >>> inefficient >>> >>> logic under limited memory, >>> >>> It leads to performance downgrade. >>> >>> Please help to review it. >>> >>> Index: include/bits/stl_algo.h >>> =================================================================== >>> --- include/bits/stl_algo.h (revision 256871) >>> +++ include/bits/stl_algo.h (working copy) >>> @@ -2437,7 +2437,7 @@ >>> _BidirectionalIterator __second_cut = __middle; >>> _Distance __len11 = 0; >>> _Distance __len22 = 0; >>> - if (__len1 > __len2) >>> + if (__len1 < __len2) >>> { >>> __len11 = __len1 / 2; >>> std::advance(__first_cut, __len11); >>> @@ -2539,9 +2539,15 @@ >>> const _DistanceType __len1 = std::distance(__first, __middle); >>> const _DistanceType __len2 = std::distance(__middle, __last); >>> >>> + >>> >>> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> >>> _TmpBuf; >>> >>> - _TmpBuf __buf(__first, __last); >>> - >>> + _BidirectionalIterator __start, __end; >>> + if (__len1 < __len2) { >>> + __start = __first; __end = __middle; >>> + } else { >>> + __start = __middle; __end = __last; >>> + } >>> + _TmpBuf __buf(__start, ___end); >>> >>> Note another optimization, we could introduce a _Temporary_buffer<> >>> constructor >>> in order to write: >>> >>> _TmpBuf __buf(std::min(__len1, __len2), __first); >>> >>> So that std::distance do not need to be called again. >>> >>> if (__buf.begin() == 0) >>> std::__merge_without_buffer >>> (__first, __middle, __last, __len1, __len2, __comp); >>> Index: include/bits/stl_tempbuf.h >>> =================================================================== >>> --- include/bits/stl_tempbuf.h (revision 256871) >>> +++ include/bits/stl_tempbuf.h (working copy) >>> @@ -95,7 +95,7 @@ >>> std::nothrow)); >>> if (__tmp != 0) >>> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); >>> - __len /= 2; >>> + __len = (__len + 1) / 2; >>> >>> This part is problematic, if __len is 1 and allocation fails it will >>> loop >>> forever. >>> >>> It doesn't seem really necessary for your patch. >>> >>> >>> 2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>: >>> >>>> This is a libstdc++ bug and patch, not the C++ front end. So I'm >>>> adding the libstdc++ list to CC. >>>> >>>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc<r97922153@gmail.com> wrote: >>>>> Current std::inplace_merge() suffers from performance issue by >>>> inefficient >>>>> logic under limited memory, >>>>> >>>>> It leads to performance downgrade. >>>>> >>>>> Please help to review it. >>>>> >>>>> Index: include/bits/stl_algo.h >>>>> =================================================================== >>>>> --- include/bits/stl_algo.h (revision 256871) >>>>> +++ include/bits/stl_algo.h (working copy) >>>>> @@ -2437,7 +2437,7 @@ >>>>> _BidirectionalIterator __second_cut = __middle; >>>>> _Distance __len11 = 0; >>>>> _Distance __len22 = 0; >>>>> - if (__len1 > __len2) >>>>> + if (__len1 < __len2) >>>>> { >>>>> __len11 = __len1 / 2; >>>>> std::advance(__first_cut, __len11); >>>>> @@ -2539,9 +2539,15 @@ >>>>> const _DistanceType __len1 = std::distance(__first, >>>>> __middle); >>>>> const _DistanceType __len2 = std::distance(__middle, __last); >>>>> >>>>> + >>>>> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> >>>> _TmpBuf; >>>>> - _TmpBuf __buf(__first, __last); >>>>> - >>>>> + _BidirectionalIterator __start, __end; >>>>> + if (__len1 < __len2) { >>>>> + __start = __first; __end = __middle; >>>>> + } else { >>>>> + __start = __middle; __end = __last; >>>>> + } >>>>> + _TmpBuf __buf(__start, ___end); >>>>> if (__buf.begin() == 0) >>>>> std::__merge_without_buffer >>>>> (__first, __middle, __last, __len1, __len2, __comp); >>>>> Index: include/bits/stl_tempbuf.h >>>>> =================================================================== >>>>> --- include/bits/stl_tempbuf.h (revision 256871) >>>>> +++ include/bits/stl_tempbuf.h (working copy) >>>>> @@ -95,7 +95,7 @@ >>>>> std::nothrow)); >>>>> if (__tmp != 0) >>>>> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); >>>>> - __len /= 2; >>>>> + __len = (__len + 1) / 2; >>>>> } >>>>> return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0); >>>>> } >>>>> >>>>> >>>>> >>>>> >>>>> Thanks >> >> > diff --git a/libstdc++-v3/include/bits/stl_algo.h b/libstdc++-v3/include/bits/stl_algo.h index 04e152a..eb58efe 100644 --- a/libstdc++-v3/include/bits/stl_algo.h +++ b/libstdc++-v3/include/bits/stl_algo.h @@ -2416,9 +2416,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _BidirectionalIterator __second_cut = __middle; _Distance __len11 = 0; _Distance __len22 = 0; - if (__len1 > __len2) + if (__len1 < __len2) { - __len11 = __len1 / 2; + __len11 = __buffer_size; std::advance(__first_cut, __len11); __second_cut = std::__lower_bound(__middle, __last, *__first_cut, @@ -2427,7 +2427,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } else { - __len22 = __len2 / 2; + __len22 = __buffer_size; std::advance(__second_cut, __len22); __first_cut = std::__upper_bound(__first, __middle, *__second_cut, @@ -2437,14 +2437,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _BidirectionalIterator __new_middle = std::__rotate_adaptive(__first_cut, __middle, __second_cut, - __len1 - __len11, __len22, __buffer, - __buffer_size); - std::__merge_adaptive(__first, __first_cut, __new_middle, __len11, - __len22, __buffer, __buffer_size, __comp); + __len1 - __len11, __len22, + __buffer, __buffer_size); + std::__merge_adaptive(__first, __first_cut, __new_middle, + __len11, __len22, + __buffer, __buffer_size, __comp); std::__merge_adaptive(__new_middle, __second_cut, __last, - __len1 - __len11, - __len2 - __len22, __buffer, - __buffer_size, __comp); + __len1 - __len11, __len2 - __len22, + __buffer, __buffer_size, __comp); } } @@ -2518,7 +2518,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION const _DistanceType __len2 = std::distance(__middle, __last); typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf; - _TmpBuf __buf(__first, __len1 + __len2); + _TmpBuf __buf(__first, std::min(__len1, __len2)); if (__buf.begin() == 0) std::__merge_without_buffer @@ -2713,9 +2713,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _Compare __comp) { const _Distance __len = (__last - __first + 1) / 2; - const _RandomAccessIterator __middle = __first + __len; + _RandomAccessIterator __middle; if (__len > __buffer_size) { + __middle = __first + __buffer_size + __buffer_size; std::__stable_sort_adaptive(__first, __middle, __buffer, __buffer_size, __comp); std::__stable_sort_adaptive(__middle, __last, __buffer, @@ -2723,9 +2724,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } else { + __middle = __first + __len; std::__merge_sort_with_buffer(__first, __middle, __buffer, __comp); std::__merge_sort_with_buffer(__middle, __last, __buffer, __comp); } + std::__merge_adaptive(__first, __middle, __last, _Distance(__middle - __first), _Distance(__last - __middle), @@ -4975,8 +4978,11 @@ _GLIBCXX_BEGIN_NAMESPACE_ALGO typedef typename iterator_traits<_RandomAccessIterator>::difference_type _DistanceType; + if (__first == __last) + return; + typedef _Temporary_buffer<_RandomAccessIterator, _ValueType> _TmpBuf; - _TmpBuf __buf(__first, std::distance(__first, __last)); + _TmpBuf __buf(__first, (__last - __first + 1) / 2); if (__buf.begin() == 0) std::__inplace_stable_sort(__first, __last, __comp); diff --git a/libstdc++-v3/include/bits/stl_tempbuf.h b/libstdc++-v3/include/bits/stl_tempbuf.h index 0abd3c12..e177272 100644 --- a/libstdc++-v3/include/bits/stl_tempbuf.h +++ b/libstdc++-v3/include/bits/stl_tempbuf.h @@ -95,7 +95,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION std::nothrow)); if (__tmp != 0) return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); - __len /= 2; + __len = __len == 1 ? 0 : ((__len + 1) / 2); } return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0); }
Hi Some feedback regarding this patch ? Thanks, François On 8/21/18 10:34 PM, François Dumont wrote: > I missed a test that was failing because of this patch. So I revert a > small part of it and here is the new proposal. > > Tested under Linux x86_64, ok to commit ? > > François > > > On 24/07/2018 12:22, François Dumont wrote: >> Hi >> >> Any chance to review this patch ? >> >> François >> >> >> On 06/06/2018 18:39, François Dumont wrote: >>> Hi >>> >>> I review and rework this proposal. I noticed that the same idea >>> to limit buffer size within inplace_merge also apply to stable_sort. >>> >>> I also change the decision when buffer is too small to consider >>> the buffer size rather than going through successive cuts of the >>> original ranges. This way we won't cut the range more than >>> necessary. Note that if you prefer this second part could be >>> committed seperatly. >>> >>> PR libstdc++/83938 >>> * include/bits/stl_algo.h: >>> (__stable_partition_adaptive): When buffer to small cut range using >>> buffer size. >>> (__inplace_merge): Take temporary buffer length from smallest >>> range. >>> (__merge_adaptive): When buffer too small consider smallest >>> range first >>> and cut based on buffer size. >>> (__stable_sort_adaptive): When buffer too small cut based on buffer >>> size. >>> (__stable_sort): Limit temporary buffer length. >>> * include/bits/stl_tempbuf.h (get_temporary_buffer): Change >>> function >>> to reduce requested buffer length on allocation failure. >>> >>> Tested under Linux x86_64. >>> >>> Ok to commit ? >>> >>> François >>> >>> >>> On 25/01/2018 23:37, chang jc wrote: >>>> Hi: >>>> >>>> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as >>>> __len = (__len == 1) ? 0 : ((__len + 1) / 2); >>>> >>>> 2. The coding gain has been shown PR c++/83938; I re-post here >>>> >>>> >>>> >>>> >>>> 21 >>>> 20 >>>> 19 >>>> 18 >>>> 17 >>>> 16 >>>> >>>> >>>> 0.471136 >>>> 0.625695 >>>> 0.767262 >>>> 0.907461 >>>> 1.04838 >>>> 1.19508 >>>> >>>> >>>> 0.340845 >>>> 0.48651 >>>> 0.639139 >>>> 0.770133 >>>> 0.898454 >>>> 1.04632 >>>> >>>> it means when Merge [0, 4325376, 16777216); A is a sorted integer with >>>> 4325376 & B with 12451840 elements, total with 16M integers >>>> >>>> The proposed method has the speed up under given buffer size =, ex >>>> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means >>>> given sizof(int)*64K bytes. >>>> >>>> 3. As your suggestion, _TmpBuf __buf should be rewrite. >>>> >>>> 4. It represents a fact that the intuitive idea to split from larger >>>> part is wrong. >>>> >>>> For example, if you have an input sorted array A & B, A has 8 integers >>>> & B has 24 integers. Given tmp buffer whose capacity = 4 integers. >>>> >>>> Current it tries to split from B, right? >>>> >>>> Then we have: >>>> >>>> A1 | A2 B1 | B2 >>>> >>>> B1 & B2 has 12 integers each, right? >>>> >>>> Current algorithm selects pivot as 13th integer from B, right? >>>> >>>> If the corresponding upper bound of A is 6th integer. >>>> >>>> Then it split in >>>> >>>> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12 >>>> >>>> After rotate, we have two arrays to merge >>>> >>>> [A1 = 5 | B1 = 12] & [A2 = 3 | B2 = 12] >>>> >>>> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge. >>>> >>>> Sadly, [A1 = 5 | B1 = 12] CANNOT. >>>> >>>> So we do rotate again, split & merge the two split arrays from [A1 = 5 >>>> | B1 = 12] again. >>>> >>>> >>>> But wait, if we always split from the smaller one instead of larger >>>> one. >>>> >>>> After rotate, it promises two split arrays both contain >>>> ceiling[small/2]. >>>> >>>> Since tmp buffer also allocate by starting from sizeof(small) & >>>> recursively downgrade by ceiling[small/2^(# of fail allocate)]. >>>> >>>> It means the allocated tmp buffer promises to be sufficient at the >>>> level of (# of fail allocate). >>>> >>>> Instead, you can see if split from large at level (# of fail allocate) >>>> several split array still CANNOT use tmp buffer to do buffered merge. >>>> >>>> >>>> As you know, buffered merge is far speed then (split, rotate, and >>>> merge two sub-arrays) (PR c++/83938 gives the profiling figures), >>>> >>>> the way should provide speedup. >>>> >>>> >>>> Thanks. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 24/01/2018 18:23, François Dumont wrote: >>>> >>>> Hi >>>> >>>> >>>> It sounds like a very sensitive change to make but nothing >>>> worth figures. >>>> Do you have any bench showing the importance of the gain ? >>>> >>>> At least the memory usage optimization is obvious. >>>> >>>> On 19/01/2018 10:43, chang jc wrote: >>>> >>>> Current std::inplace_merge() suffers from performance issue by >>>> inefficient >>>> >>>> logic under limited memory, >>>> >>>> It leads to performance downgrade. >>>> >>>> Please help to review it. >>>> >>>> Index: include/bits/stl_algo.h >>>> =================================================================== >>>> --- include/bits/stl_algo.h (revision 256871) >>>> +++ include/bits/stl_algo.h (working copy) >>>> @@ -2437,7 +2437,7 @@ >>>> _BidirectionalIterator __second_cut = __middle; >>>> _Distance __len11 = 0; >>>> _Distance __len22 = 0; >>>> - if (__len1 > __len2) >>>> + if (__len1 < __len2) >>>> { >>>> __len11 = __len1 / 2; >>>> std::advance(__first_cut, __len11); >>>> @@ -2539,9 +2539,15 @@ >>>> const _DistanceType __len1 = std::distance(__first, >>>> __middle); >>>> const _DistanceType __len2 = std::distance(__middle, __last); >>>> >>>> + >>>> >>>> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> >>>> _TmpBuf; >>>> >>>> - _TmpBuf __buf(__first, __last); >>>> - >>>> + _BidirectionalIterator __start, __end; >>>> + if (__len1 < __len2) { >>>> + __start = __first; __end = __middle; >>>> + } else { >>>> + __start = __middle; __end = __last; >>>> + } >>>> + _TmpBuf __buf(__start, ___end); >>>> >>>> Note another optimization, we could introduce a _Temporary_buffer<> >>>> constructor >>>> in order to write: >>>> >>>> _TmpBuf __buf(std::min(__len1, __len2), __first); >>>> >>>> So that std::distance do not need to be called again. >>>> >>>> if (__buf.begin() == 0) >>>> std::__merge_without_buffer >>>> (__first, __middle, __last, __len1, __len2, __comp); >>>> Index: include/bits/stl_tempbuf.h >>>> =================================================================== >>>> --- include/bits/stl_tempbuf.h (revision 256871) >>>> +++ include/bits/stl_tempbuf.h (working copy) >>>> @@ -95,7 +95,7 @@ >>>> std::nothrow)); >>>> if (__tmp != 0) >>>> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); >>>> - __len /= 2; >>>> + __len = (__len + 1) / 2; >>>> >>>> This part is problematic, if __len is 1 and allocation fails it >>>> will loop >>>> forever. >>>> >>>> It doesn't seem really necessary for your patch. >>>> >>>> >>>> 2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>: >>>> >>>>> This is a libstdc++ bug and patch, not the C++ front end. So I'm >>>>> adding the libstdc++ list to CC. >>>>> >>>>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc<r97922153@gmail.com> >>>>> wrote: >>>>>> Current std::inplace_merge() suffers from performance issue by >>>>> inefficient >>>>>> logic under limited memory, >>>>>> >>>>>> It leads to performance downgrade. >>>>>> >>>>>> Please help to review it. >>>>>> >>>>>> Index: include/bits/stl_algo.h >>>>>> =================================================================== >>>>>> --- include/bits/stl_algo.h (revision 256871) >>>>>> +++ include/bits/stl_algo.h (working copy) >>>>>> @@ -2437,7 +2437,7 @@ >>>>>> _BidirectionalIterator __second_cut = __middle; >>>>>> _Distance __len11 = 0; >>>>>> _Distance __len22 = 0; >>>>>> - if (__len1 > __len2) >>>>>> + if (__len1 < __len2) >>>>>> { >>>>>> __len11 = __len1 / 2; >>>>>> std::advance(__first_cut, __len11); >>>>>> @@ -2539,9 +2539,15 @@ >>>>>> const _DistanceType __len1 = std::distance(__first, >>>>>> __middle); >>>>>> const _DistanceType __len2 = std::distance(__middle, >>>>>> __last); >>>>>> >>>>>> + >>>>>> typedef _Temporary_buffer<_BidirectionalIterator, >>>>>> _ValueType> >>>>> _TmpBuf; >>>>>> - _TmpBuf __buf(__first, __last); >>>>>> - >>>>>> + _BidirectionalIterator __start, __end; >>>>>> + if (__len1 < __len2) { >>>>>> + __start = __first; __end = __middle; >>>>>> + } else { >>>>>> + __start = __middle; __end = __last; >>>>>> + } >>>>>> + _TmpBuf __buf(__start, ___end); >>>>>> if (__buf.begin() == 0) >>>>>> std::__merge_without_buffer >>>>>> (__first, __middle, __last, __len1, __len2, __comp); >>>>>> Index: include/bits/stl_tempbuf.h >>>>>> =================================================================== >>>>>> --- include/bits/stl_tempbuf.h (revision 256871) >>>>>> +++ include/bits/stl_tempbuf.h (working copy) >>>>>> @@ -95,7 +95,7 @@ >>>>>> std::nothrow)); >>>>>> if (__tmp != 0) >>>>>> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); >>>>>> - __len /= 2; >>>>>> + __len = (__len + 1) / 2; >>>>>> } >>>>>> return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0); >>>>>> } >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Thanks >>> >>> >> >
On 29/10/18 07:06 +0100, François Dumont wrote: >Hi > > Some feedback regarding this patch ? Sorry this got missed, please resubmit during stage 1. You haven't CC'd the original patch author (chang jc) to give them a chance to comment on your proposed changes to the patch. The attached PDF on PR libstdc++/83938 has extensive discussion of the performance issue, but I don't see any for your version. Some performance benchmarks for your version would be helpful. > >Thanks, >François > >On 8/21/18 10:34 PM, François Dumont wrote: >>I missed a test that was failing because of this patch. So I revert >>a small part of it and here is the new proposal. >> >>Tested under Linux x86_64, ok to commit ? >> >>François >> >> >>On 24/07/2018 12:22, François Dumont wrote: >>>Hi >>> >>> Any chance to review this patch ? >>> >>>François >>> >>> >>>On 06/06/2018 18:39, François Dumont wrote: >>>>Hi >>>> >>>> I review and rework this proposal. I noticed that the same >>>>idea to limit buffer size within inplace_merge also apply to >>>>stable_sort. >>>> >>>> I also change the decision when buffer is too small to >>>>consider the buffer size rather than going through successive >>>>cuts of the original ranges. This way we won't cut the range >>>>more than necessary. Note that if you prefer this second part >>>>could be committed seperatly. >>>> >>>> PR libstdc++/83938 >>>> * include/bits/stl_algo.h: >>>> (__stable_partition_adaptive): When buffer to small cut range using >>>> buffer size. >>>> (__inplace_merge): Take temporary buffer length from >>>>smallest range. >>>> (__merge_adaptive): When buffer too small consider smallest >>>>range first >>>> and cut based on buffer size. >>>> (__stable_sort_adaptive): When buffer too small cut based on buffer >>>> size. >>>> (__stable_sort): Limit temporary buffer length. >>>> * include/bits/stl_tempbuf.h (get_temporary_buffer): Change >>>>function >>>> to reduce requested buffer length on allocation failure. >>>> >>>>Tested under Linux x86_64. >>>> >>>>Ok to commit ? >>>> >>>>François >>>> >>>> >>>>On 25/01/2018 23:37, chang jc wrote: >>>>>Hi: >>>>> >>>>>1. The __len = (__len + 1) / 2; is as you suggested, need to modify as >>>>>__len = (__len == 1) ? 0 : ((__len + 1) / 2); >>>>> >>>>>2. The coding gain has been shown PR c++/83938; I re-post here >>>>> >>>>> >>>>> >>>>> >>>>> 21 >>>>> 20 >>>>> 19 >>>>> 18 >>>>> 17 >>>>> 16 >>>>> >>>>> >>>>> 0.471136 >>>>> 0.625695 >>>>> 0.767262 >>>>> 0.907461 >>>>> 1.04838 >>>>> 1.19508 >>>>> >>>>> >>>>> 0.340845 >>>>> 0.48651 >>>>> 0.639139 >>>>> 0.770133 >>>>> 0.898454 >>>>> 1.04632 >>>>> >>>>>it means when Merge [0, 4325376, 16777216); A is a sorted integer with >>>>>4325376 & B with 12451840 elements, total with 16M integers >>>>> >>>>>The proposed method has the speed up under given buffer size =, ex >>>>>2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means >>>>>given sizof(int)*64K bytes. >>>>> >>>>>3. As your suggestion, _TmpBuf __buf should be rewrite. >>>>> >>>>>4. It represents a fact that the intuitive idea to split from larger >>>>>part is wrong. >>>>> >>>>>For example, if you have an input sorted array A & B, A has 8 integers >>>>>& B has 24 integers. Given tmp buffer whose capacity = 4 integers. >>>>> >>>>>Current it tries to split from B, right? >>>>> >>>>>Then we have: >>>>> >>>>>A1 | A2 B1 | B2 >>>>> >>>>>B1 & B2 has 12 integers each, right? >>>>> >>>>>Current algorithm selects pivot as 13th integer from B, right? >>>>> >>>>>If the corresponding upper bound of A is 6th integer. >>>>> >>>>>Then it split in >>>>> >>>>>A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12 >>>>> >>>>>After rotate, we have two arrays to merge >>>>> >>>>>[A1 = 5 | B1 = 12] & [A2 = 3 | B2 = 12] >>>>> >>>>>Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge. >>>>> >>>>>Sadly, [A1 = 5 | B1 = 12] CANNOT. >>>>> >>>>>So we do rotate again, split & merge the two split arrays from [A1 = 5 >>>>>| B1 = 12] again. >>>>> >>>>> >>>>>But wait, if we always split from the smaller one instead of >>>>>larger one. >>>>> >>>>>After rotate, it promises two split arrays both contain >>>>>ceiling[small/2]. >>>>> >>>>>Since tmp buffer also allocate by starting from sizeof(small) & >>>>>recursively downgrade by ceiling[small/2^(# of fail allocate)]. >>>>> >>>>>It means the allocated tmp buffer promises to be sufficient at the >>>>>level of (# of fail allocate). >>>>> >>>>>Instead, you can see if split from large at level (# of fail allocate) >>>>>several split array still CANNOT use tmp buffer to do buffered merge. >>>>> >>>>> >>>>>As you know, buffered merge is far speed then (split, rotate, and >>>>>merge two sub-arrays) (PR c++/83938 gives the profiling figures), >>>>> >>>>>the way should provide speedup. >>>>> >>>>> >>>>>Thanks. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>On 24/01/2018 18:23, François Dumont wrote: >>>>> >>>>>Hi >>>>> >>>>> >>>>> It sounds like a very sensitive change to make but >>>>>nothing worth figures. >>>>>Do you have any bench showing the importance of the gain ? >>>>> >>>>> At least the memory usage optimization is obvious. >>>>> >>>>>On 19/01/2018 10:43, chang jc wrote: >>>>> >>>>>Current std::inplace_merge() suffers from performance issue by >>>>>inefficient >>>>> >>>>>logic under limited memory, >>>>> >>>>>It leads to performance downgrade. >>>>> >>>>>Please help to review it. >>>>> >>>>>Index: include/bits/stl_algo.h >>>>>=================================================================== >>>>>--- include/bits/stl_algo.h (revision 256871) >>>>>+++ include/bits/stl_algo.h (working copy) >>>>>@@ -2437,7 +2437,7 @@ >>>>> _BidirectionalIterator __second_cut = __middle; >>>>> _Distance __len11 = 0; >>>>> _Distance __len22 = 0; >>>>>- if (__len1 > __len2) >>>>>+ if (__len1 < __len2) >>>>> { >>>>> __len11 = __len1 / 2; >>>>> std::advance(__first_cut, __len11); >>>>>@@ -2539,9 +2539,15 @@ >>>>> const _DistanceType __len1 = std::distance(__first, >>>>>__middle); >>>>> const _DistanceType __len2 = std::distance(__middle, __last); >>>>> >>>>>+ >>>>> >>>>> typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> >>>>>_TmpBuf; >>>>> >>>>>- _TmpBuf __buf(__first, __last); >>>>>- >>>>>+ _BidirectionalIterator __start, __end; >>>>>+ if (__len1 < __len2) { >>>>>+ __start = __first; __end = __middle; >>>>>+ } else { >>>>>+ __start = __middle; __end = __last; >>>>>+ } >>>>>+ _TmpBuf __buf(__start, ___end); >>>>> >>>>>Note another optimization, we could introduce a >>>>>_Temporary_buffer<> constructor >>>>>in order to write: >>>>> >>>>>_TmpBuf __buf(std::min(__len1, __len2), __first); >>>>> >>>>>So that std::distance do not need to be called again. >>>>> >>>>> if (__buf.begin() == 0) >>>>> std::__merge_without_buffer >>>>> (__first, __middle, __last, __len1, __len2, __comp); >>>>>Index: include/bits/stl_tempbuf.h >>>>>=================================================================== >>>>>--- include/bits/stl_tempbuf.h (revision 256871) >>>>>+++ include/bits/stl_tempbuf.h (working copy) >>>>>@@ -95,7 +95,7 @@ >>>>> std::nothrow)); >>>>> if (__tmp != 0) >>>>> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); >>>>>- __len /= 2; >>>>>+ __len = (__len + 1) / 2; >>>>> >>>>>This part is problematic, if __len is 1 and allocation fails >>>>>it will loop >>>>>forever. >>>>> >>>>>It doesn't seem really necessary for your patch. >>>>> >>>>> >>>>>2018-01-20 4:05 GMT+08:00 Jason Merrill<jason@redhat.com>: >>>>> >>>>>>This is a libstdc++ bug and patch, not the C++ front end. So I'm >>>>>>adding the libstdc++ list to CC. >>>>>> >>>>>>On Fri, Jan 19, 2018 at 3:02 AM, chang >>>>>>jc<r97922153@gmail.com> wrote: >>>>>>>Current std::inplace_merge() suffers from performance issue by >>>>>>inefficient >>>>>>>logic under limited memory, >>>>>>> >>>>>>>It leads to performance downgrade. >>>>>>> >>>>>>>Please help to review it. >>>>>>> >>>>>>>Index: include/bits/stl_algo.h >>>>>>>=================================================================== >>>>>>>--- include/bits/stl_algo.h (revision 256871) >>>>>>>+++ include/bits/stl_algo.h (working copy) >>>>>>>@@ -2437,7 +2437,7 @@ >>>>>>> _BidirectionalIterator __second_cut = __middle; >>>>>>> _Distance __len11 = 0; >>>>>>> _Distance __len22 = 0; >>>>>>>- if (__len1 > __len2) >>>>>>>+ if (__len1 < __len2) >>>>>>> { >>>>>>> __len11 = __len1 / 2; >>>>>>> std::advance(__first_cut, __len11); >>>>>>>@@ -2539,9 +2539,15 @@ >>>>>>> const _DistanceType __len1 = >>>>>>>std::distance(__first, __middle); >>>>>>> const _DistanceType __len2 = >>>>>>>std::distance(__middle, __last); >>>>>>> >>>>>>>+ >>>>>>> typedef _Temporary_buffer<_BidirectionalIterator, >>>>>>>_ValueType> >>>>>>_TmpBuf; >>>>>>>- _TmpBuf __buf(__first, __last); >>>>>>>- >>>>>>>+ _BidirectionalIterator __start, __end; >>>>>>>+ if (__len1 < __len2) { >>>>>>>+ __start = __first; __end = __middle; >>>>>>>+ } else { >>>>>>>+ __start = __middle; __end = __last; >>>>>>>+ } >>>>>>>+ _TmpBuf __buf(__start, ___end); >>>>>>> if (__buf.begin() == 0) >>>>>>> std::__merge_without_buffer >>>>>>> (__first, __middle, __last, __len1, __len2, __comp); >>>>>>>Index: include/bits/stl_tempbuf.h >>>>>>>=================================================================== >>>>>>>--- include/bits/stl_tempbuf.h (revision 256871) >>>>>>>+++ include/bits/stl_tempbuf.h (working copy) >>>>>>>@@ -95,7 +95,7 @@ >>>>>>>std::nothrow)); >>>>>>> if (__tmp != 0) >>>>>>> return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); >>>>>>>- __len /= 2; >>>>>>>+ __len = (__len + 1) / 2; >>>>>>> } >>>>>>> return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0); >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>Thanks >>>> >>>> >>> >> >
On 12/21/18 9:57 PM, Jonathan Wakely wrote: > On 29/10/18 07:06 +0100, François Dumont wrote: >> Hi >> >> Some feedback regarding this patch ? > > Sorry this got missed, please resubmit during stage 1. > > You haven't CC'd the original patch author (chang jc) to give them a > chance to comment on your proposed changes to the patch. > > The attached PDF on PR libstdc++/83938 has extensive discussion of the > performance issue, but I don't see any for your version. Some > performance benchmarks for your version would be helpful. Here is this patch again. This time it is much closer to John's original one, I just kept my change on the size of the temporary buffer which doesn't need to be as large as it is currently allocated, especially with John's patch. The performance tests are showing the good improvements, attached are the before/after. Surprisingly I do not see any change regarding allocated memory, I guess measures are not accurate enough. However working on them also show that current implementation is fragile. If you reduce the allowed allocated memory too much you end up with a stack overflow because of the recursive implementation. I already have a patch to keep on trying to allocate memory as long as above a given threshold rather than 0 but I am also working on making implementation non-recursive. We'll see if even a buffer of size 1 is better than no buffer at all then. PR libstdc++/83938 * include/bits/stl_algo.h: (__merge_adaptive): When buffer too small consider smallest range first and cut based on buffer size. (__inplace_merge): Take temporary buffer length from smallest range. (__stable_sort): Limit temporary buffer length. * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function to reduce requested buffer length on allocation failure. * testsuite/performance/25_algorithms/inplace_merge.cc: New. * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow execution with different level of memory. François inplace_merge.cc reverse 19r 19u 0s 48mem 0pf inplace_merge.cc forwards 0r 0u 0s 0mem 0pf inplace_merge.cc random 18r 18u 0s 0mem 0pf inplace_merge.cc reverse 10r 10u 0s 0mem 0pf inplace_merge.cc forwards 8r 8u 0s 0mem 0pf inplace_merge.cc random 22r 22u 0s 0mem 0pf inplace_merge.cc bench 1/2 memory 1891r 1884u 7s 928mem 0pf inplace_merge.cc reverse 19r 18u 0s 0mem 0pf inplace_merge.cc forwards 0r 0u 0s 0mem 0pf inplace_merge.cc random 18r 18u 0s 0mem 0pf inplace_merge.cc reverse 10r 10u 0s 0mem 0pf inplace_merge.cc forwards 3r 3u 0s 0mem 0pf inplace_merge.cc random 22r 22u 0s 0mem 0pf inplace_merge.cc bench 1/4 memory 1867r 1861u 6s 192mem 0pf inplace_merge.cc reverse 17r 17u 0s 0mem 0pf inplace_merge.cc forwards 0r 0u 0s 0mem 0pf inplace_merge.cc random 25r 24u 0s 0mem 0pf inplace_merge.cc reverse 28r 28u 0s 0mem 0pf inplace_merge.cc forwards 0r 0u 0s 0mem 0pf inplace_merge.cc random 289r 289u 0s 0mem 0pf inplace_merge.cc bench no memory 2155r 2149u 6s 192mem 0pf stable_sort.cc reverse 240r 240u 1s 0mem 0pf stable_sort.cc forwards 205r 205u 0s 0mem 0pf stable_sort.cc random 493r 493u 0s 0mem 0pf stable_sort.cc bench full memory 945r 939u 5s 752mem 0pf stable_sort.cc reverse 236r 236u 0s 0mem 0pf stable_sort.cc forwards 199r 199u 0s 0mem 0pf stable_sort.cc random 487r 487u 0s 0mem 0pf stable_sort.cc bench 1/2 memory 928r 925u 3s 96mem 0pf stable_sort.cc reverse 240r 240u 0s 0mem 0pf stable_sort.cc forwards 203r 203u 0s 0mem 0pf stable_sort.cc random 486r 487u 0s 0mem 0pf stable_sort.cc bench 1/4 memory 935r 932u 5s 96mem 0pf stable_sort.cc reverse 829r 829u 0s 0mem 0pf stable_sort.cc forwards 143r 143u 0s 0mem 0pf stable_sort.cc random 487r 486u 0s 0mem 0pf stable_sort.cc bench no memory 1465r 1461u 2s 96mem 0pf inplace_merge.cc reverse 19r 19u 0s 0mem 0pf inplace_merge.cc forwards 0r 0u 0s 0mem 0pf inplace_merge.cc random 19r 19u 0s 0mem 0pf inplace_merge.cc reverse 11r 10u 0s 0mem 0pf inplace_merge.cc forwards 8r 8u 0s 0mem 0pf inplace_merge.cc random 22r 22u 0s 0mem 0pf inplace_merge.cc bench 1/2 memory 1938r 1925u 12s 928mem 0pf inplace_merge.cc reverse 19r 19u 0s 0mem 0pf inplace_merge.cc forwards 0r 0u 0s 0mem 0pf inplace_merge.cc random 18r 19u 0s 0mem 0pf inplace_merge.cc reverse 10r 11u 0s 0mem 0pf inplace_merge.cc forwards 3r 2u 0s 0mem 0pf inplace_merge.cc random 22r 22u 0s 0mem 0pf inplace_merge.cc bench 1/4 memory 1922r 1915u 7s 192mem 0pf inplace_merge.cc reverse 18r 17u 0s 0mem 0pf inplace_merge.cc forwards 0r 0u 0s 0mem 0pf inplace_merge.cc random 23r 24u 0s 0mem 0pf inplace_merge.cc reverse 28r 27u 0s 0mem 0pf inplace_merge.cc forwards 0r 0u 0s 0mem 0pf inplace_merge.cc random 288r 289u 0s 0mem 0pf inplace_merge.cc bench no memory 2204r 2196u 9s 192mem 0pf stable_sort.cc reverse 234r 234u 0s 0mem 0pf stable_sort.cc forwards 201r 200u 1s 0mem 0pf stable_sort.cc random 485r 484u 1s 0mem 0pf stable_sort.cc bench full memory 927r 920u 5s 752mem 0pf stable_sort.cc reverse 233r 231u 1s 0mem 0pf stable_sort.cc forwards 198r 198u 0s 0mem 0pf stable_sort.cc random 478r 478u 0s 0mem 0pf stable_sort.cc bench 1/2 memory 915r 910u 4s 96mem 0pf stable_sort.cc reverse 236r 236u 0s 0mem 0pf stable_sort.cc forwards 201r 200u 0s 0mem 0pf stable_sort.cc random 478r 478u 1s 0mem 0pf stable_sort.cc bench 1/4 memory 921r 916u 4s 96mem 0pf stable_sort.cc reverse 830r 830u 0s 0mem 0pf stable_sort.cc forwards 146r 146u 0s 0mem 0pf stable_sort.cc random 479r 478u 0s 0mem 0pf stable_sort.cc bench no memory 1461r 1456u 5s 96mem 0pf
Hi I eventually spent much more time working on the inplace_merge performance bench. And the results do not confirm the theory: Before patch: inplace_merge.cc bench 1 / 1 memory 243r 227u 17s 1216mem 5pf inplace_merge.cc bench 1 / 4 memory 297r 278u 18s 480mem 0pf inplace_merge.cc bench 1 /64 memory 373r 354u 18s 480mem 0pf inplace_merge.cc bench 0 / 1 memory 12r 11u 0s 480mem 0pf After the patch to reduce memory allocation: inplace_merge.cc bench 1 / 1 memory 245r 227u 18s 1216mem 0pf inplace_merge.cc bench 1 / 4 memory 292r 273u 19s 480mem 0pf inplace_merge.cc bench 1 /64 memory 373r 356u 17s 480mem 0pf inplace_merge.cc bench 0 / 1 memory 11r 11u 0s 480mem 0pf With the __len1 > __len2 condition change: inplace_merge.cc bench 1 / 1 memory 244r 225u 20s 1216mem 0pf inplace_merge.cc bench 1 / 4 memory 379r 361u 17s 480mem 0pf inplace_merge.cc bench 1 /64 memory 640r 625u 16s 480mem 0pf inplace_merge.cc bench 0 / 1 memory 11r 11u 0s 480mem 0pf When there is no memory limit the results are identical of course. Otherwise as soon as memory is limited performance starts to decrease with the condition change on __len1 vs __len2. Could you give the bench you use to demonstrate the enhancement ? I also wonder why your patch doesn't change consistently the same condition in __merge_without_buffer ? For the moment I'd like to propose the attached patch that is to say the reduction on the amount of allocated memory and the new/modified benches. Note that as soon as I forbid any memory allocation I also reduce the size of the range to merge cause the implementation rely on recursivity and so could end-up in a stack overflow. Maybe I need to check for simulator targets like several other tests ? Unless simulators do not run the performance tests ? Regarding this stack overflow issue, is there some routine to find out how many levels of function calls can be added before reaching the stack overflow ? I could perhaps call __builtin_alloca and check the result but that smells. If I could find out this we could fallback on an iterative approach to complete the merge. PR libstdc++/83938 * include/bits/stl_algo.h: (__inplace_merge): Take temporary buffer length from smallest range. (__stable_sort): Limit temporary buffer length. * include/bits/stl_tempbuf.h (get_temporary_buffer): Change __len computation in the loop. * testsuite/25_algorithms/inplace_merge/1.cc (test3): Test all possible pivot positions. * testsuite/performance/25_algorithms/inplace_merge.cc: New. * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow execution with different memory constraints. Ok to commit ? François On 6/9/19 4:27 PM, François Dumont wrote: > On 12/21/18 9:57 PM, Jonathan Wakely wrote: >> On 29/10/18 07:06 +0100, François Dumont wrote: >>> Hi >>> >>> Some feedback regarding this patch ? >> >> Sorry this got missed, please resubmit during stage 1. >> >> You haven't CC'd the original patch author (chang jc) to give them a >> chance to comment on your proposed changes to the patch. >> >> The attached PDF on PR libstdc++/83938 has extensive discussion of the >> performance issue, but I don't see any for your version. Some >> performance benchmarks for your version would be helpful. > > Here is this patch again. > > This time it is much closer to John's original one, I just kept my > change on the size of the temporary buffer which doesn't need to be as > large as it is currently allocated, especially with John's patch. > > The performance tests are showing the good improvements, attached are > the before/after. Surprisingly I do not see any change regarding > allocated memory, I guess measures are not accurate enough. However > working on them also show that current implementation is fragile. If > you reduce the allowed allocated memory too much you end up with a > stack overflow because of the recursive implementation. > > I already have a patch to keep on trying to allocate memory as long as > above a given threshold rather than 0 but I am also working on making > implementation non-recursive. We'll see if even a buffer of size 1 is > better than no buffer at all then. > > PR libstdc++/83938 > * include/bits/stl_algo.h: > (__merge_adaptive): When buffer too small consider smallest range > first > and cut based on buffer size. > (__inplace_merge): Take temporary buffer length from smallest range. > (__stable_sort): Limit temporary buffer length. > * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function > to reduce requested buffer length on allocation failure. > * testsuite/performance/25_algorithms/inplace_merge.cc: New. > * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow > execution with different level of memory. > > François >
On 16/07/19 18:40 +0200, François Dumont wrote: >Hi > >   I eventually spent much more time working on the inplace_merge >performance bench. > >   And the results do not confirm the theory: > >Before patch: >inplace_merge.cc           bench 1 / 1 memory          243r >227u  17s     1216mem   5pf >inplace_merge.cc           bench 1 / 4 memory          297r 278u  >18s      480mem   0pf >inplace_merge.cc           bench 1 /64 memory         373r 354u  >18s      480mem   0pf >inplace_merge.cc           bench 0 / 1 memory          12r 11u    >0s      480mem   0pf > >After the patch to reduce memory allocation: >inplace_merge.cc           bench 1 / 1 memory          245r >227u  18s     1216mem   0pf >inplace_merge.cc           bench 1 / 4 memory          292r 273u  >19s      480mem   0pf >inplace_merge.cc           bench 1 /64 memory         373r 356u  >17s      480mem   0pf >inplace_merge.cc           bench 0 / 1 memory          11r 11u    >0s      480mem   0pf > >With the __len1 > __len2 condition change: >inplace_merge.cc           bench 1 / 1 memory          244r >225u  20s     1216mem   0pf >inplace_merge.cc           bench 1 / 4 memory          379r 361u  >17s      480mem   0pf >inplace_merge.cc           bench 1 /64 memory          640r 625u  >16s      480mem   0pf >inplace_merge.cc           bench 0 / 1 memory          11r 11u   >0s      480mem   0pf > >When there is no memory limit the results are identical of course. >Otherwise as soon as memory is limited performance starts to decrease >with the condition change on __len1 vs __len2. > >Could you give the bench you use to demonstrate the enhancement ? I >also wonder why your patch doesn't change consistently the same >condition in __merge_without_buffer ? > >For the moment I'd like to propose the attached patch that is to say >the reduction on the amount of allocated memory and the new/modified >benches. > >Note that as soon as I forbid any memory allocation I also reduce the >size of the range to merge cause the implementation rely on >recursivity and so could end-up in a stack overflow. Maybe I need to >check for simulator targets like several other tests ? Unless >simulators do not run the performance tests ? The performance tests are never run by default. I don't think we should spend too much time caring about performance of sorting close to Out Of Memory conditions. We don't try to optimize std::vector or other cases to work when close to OOM. So I think just reducing the memory usage is the right approach here. >Regarding this stack overflow issue, is there some routine to find out >how many levels of function calls can be added before reaching the >stack overflow ? I could perhaps call __builtin_alloca and check the >result but that smells. If I could find out this we could fallback on >an iterative approach to complete the merge. No, alloca is inherently unsafe. Please don't even consider it :-) >   PR libstdc++/83938 >   * include/bits/stl_algo.h: >   (__inplace_merge): Take temporary buffer length from smallest range. >   (__stable_sort): Limit temporary buffer length. >   * include/bits/stl_tempbuf.h (get_temporary_buffer): Change __len >   computation in the loop. Please use "Change __len computation in the loop to avoid truncation." It's a bit more descriptive. >   * testsuite/25_algorithms/inplace_merge/1.cc (test3): Test all possible >   pivot positions. >   * testsuite/performance/25_algorithms/inplace_merge.cc: New. >   * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow >   execution with different memory constraints. > >Ok to commit ? > >François > >On 6/9/19 4:27 PM, François Dumont wrote: >>On 12/21/18 9:57 PM, Jonathan Wakely wrote: >>>On 29/10/18 07:06 +0100, François Dumont wrote: >>>>Hi >>>> >>>>   Some feedback regarding this patch ? >>> >>>Sorry this got missed, please resubmit during stage 1. >>> >>>You haven't CC'd the original patch author (chang jc) to give them a >>>chance to comment on your proposed changes to the patch. >>> >>>The attached PDF on PR libstdc++/83938 has extensive discussion of the >>>performance issue, but I don't see any for your version. Some >>>performance benchmarks for your version would be helpful. >> >>Here is this patch again. >> >>This time it is much closer to John's original one, I just kept my >>change on the size of the temporary buffer which doesn't need to be >>as large as it is currently allocated, especially with John's patch. >> >>The performance tests are showing the good improvements, attached >>are the before/after. Surprisingly I do not see any change regarding >>allocated memory, I guess measures are not accurate enough. However >>working on them also show that current implementation is fragile. If >>you reduce the allowed allocated memory too much you end up with a >>stack overflow because of the recursive implementation. >> >>I already have a patch to keep on trying to allocate memory as long >>as above a given threshold rather than 0 but I am also working on >>making implementation non-recursive. We'll see if even a buffer of >>size 1 is better than no buffer at all then. >> >>   PR libstdc++/83938 >>   * include/bits/stl_algo.h: >>   (__merge_adaptive): When buffer too small consider smallest >>range first >>   and cut based on buffer size. >>   (__inplace_merge): Take temporary buffer length from smallest range. >>   (__stable_sort): Limit temporary buffer length. >>   * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function >>   to reduce requested buffer length on allocation failure. >>   * testsuite/performance/25_algorithms/inplace_merge.cc: New. >>   * testsuite/performance/25_algorithms/stable_sort.cc: Rework to allow >>   execution with different level of memory. >> >>François >> > >diff --git a/libstdc++-v3/include/bits/stl_algo.h b/libstdc++-v3/include/bits/stl_algo.h >index ec651e2cc45..b396437443f 100644 >--- a/libstdc++-v3/include/bits/stl_algo.h >+++ b/libstdc++-v3/include/bits/stl_algo.h >@@ -2530,7 +2530,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > const _DistanceType __len2 = std::distance(__middle, __last); > > typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf; >- _TmpBuf __buf(__first, __len1 + __len2); >+ _TmpBuf __buf(__first, std::min(__len1, __len2)); Please add a comment before this saying something like: // __merge_adaptive will use a buffer for the smaller of // [first,middle) and [middle,last). > if (__buf.begin() == 0) > std::__merge_without_buffer >@@ -2738,6 +2738,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > std::__merge_sort_with_buffer(__first, __middle, __buffer, __comp); > std::__merge_sort_with_buffer(__middle, __last, __buffer, __comp); > } >+ > std::__merge_adaptive(__first, __middle, __last, > _Distance(__middle - __first), > _Distance(__last - __middle), >@@ -5023,8 +5024,11 @@ _GLIBCXX_BEGIN_NAMESPACE_ALGO > typedef typename iterator_traits<_RandomAccessIterator>::difference_type > _DistanceType; > >+ if (__first == __last) >+ return; >+ > typedef _Temporary_buffer<_RandomAccessIterator, _ValueType> _TmpBuf; >- _TmpBuf __buf(__first, std::distance(__first, __last)); >+ _TmpBuf __buf(__first, (__last - __first + 1) / 2); Please add a comment before this saying something like: // __stable_sort_adaptive sorts the range in two halves, // so the buffer only needs to fit half the range at once. That explains why (last - first + 1)/2 is correct, and if we ever change __stable_sort_adaptive we'll know to change this too. > > if (__buf.begin() == 0) > std::__inplace_stable_sort(__first, __last, __comp); >diff --git a/libstdc++-v3/include/bits/stl_tempbuf.h b/libstdc++-v3/include/bits/stl_tempbuf.h >index fa78ee53eb4..b6ad9ee6a46 100644 >--- a/libstdc++-v3/include/bits/stl_tempbuf.h >+++ b/libstdc++-v3/include/bits/stl_tempbuf.h >@@ -95,7 +95,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > std::nothrow)); > if (__tmp != 0) > return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); >- __len /= 2; >+ __len = __len == 1 ? 0 : ((__len + 1) / 2); > } > return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0); > } >diff --git a/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc b/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc >index 0b0c9c59640..9c393ce8fe3 100644 >--- a/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc >+++ b/libstdc++-v3/testsuite/25_algorithms/inplace_merge/1.cc >@@ -28,7 +28,7 @@ using std::inplace_merge; > > typedef test_container<int, bidirectional_iterator_wrapper> container; > >-void >+void > test1() > { > int array[] = { 1 }; >@@ -39,7 +39,7 @@ test1() > inplace_merge(con2.begin(), con2.begin(), con2.end()); > } > >-void >+void > test2() > { > int array[] = { 0, 2, 4, 1, 3, 5 }; >@@ -60,30 +60,34 @@ struct S > { return a < _s.a; } > }; > >-void >+void > test3() > { > S s[8]; >- s[0].a = 0; >- s[1].a = 1; >- s[2].a = 2; >- s[3].a = 3; >- s[4].a = 0; >- s[5].a = 1; >- s[6].a = 2; >- s[7].a = 3; >- >- s[0].b = 0; >- s[1].b = 1; >- s[2].b = 2; >- s[3].b = 3; >- s[4].b = 4; >- s[5].b = 5; >- s[6].b = 6; >- s[7].b = 7; >- >- inplace_merge(s, s + 4, s + 8); >- VERIFY( s[0].b == 0 && s[1].b == 4 && s[2].b == 1 && s[3].b == 5 ); >+ for (int pivot_idx = 0; pivot_idx < 8; ++pivot_idx) >+ { >+ int bval = 0; >+ for (int i = 0; i != pivot_idx; ++i) >+ { >+ s[i].a = i; >+ s[i].b = bval++; >+ } >+ >+ for (int i = pivot_idx; i != 8; ++i) >+ { >+ s[i].a = i - pivot_idx; >+ s[i].b = bval++; >+ } >+ >+ inplace_merge(s, s + pivot_idx, s + 8); >+ >+ for (int i = 1; i < 8; ++i) >+ { >+ VERIFY( !(s[i] < s[i - 1]) ); >+ if (s[i - 1].a == s[i].a) >+ VERIFY( s[i - 1].b < s[i].b ); >+ } >+ } > } I know it's redundant, but I think I'd prefer to keep test3 unchanged and add test4 for the more general test that uses different pivots. OK for trunk with those changes. Thanks for persisting with this.
Index: include/bits/stl_algo.h =================================================================== --- include/bits/stl_algo.h (revision 256871) +++ include/bits/stl_algo.h (working copy) @@ -2437,7 +2437,7 @@ _BidirectionalIterator __second_cut = __middle; _Distance __len11 = 0; _Distance __len22 = 0; - if (__len1 > __len2) + if (__len1 < __len2) { __len11 = __len1 / 2; std::advance(__first_cut, __len11); @@ -2539,9 +2539,15 @@ const _DistanceType __len1 = std::distance(__first, __middle); const _DistanceType __len2 = std::distance(__middle, __last); + typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf; - _TmpBuf __buf(__first, __last); - + _BidirectionalIterator __start, __end; + if (__len1 < __len2) { + __start = __first; __end = __middle; + } else { + __start = __middle; __end = __last; + } + _TmpBuf __buf(__start, ___end); if (__buf.begin() == 0) std::__merge_without_buffer (__first, __middle, __last, __len1, __len2, __comp); Index: include/bits/stl_tempbuf.h =================================================================== --- include/bits/stl_tempbuf.h (revision 256871) +++ include/bits/stl_tempbuf.h (working copy) @@ -95,7 +95,7 @@ std::nothrow)); if (__tmp != 0) return std::pair<_Tp*, ptrdiff_t>(__tmp, __len); - __len /= 2; + __len = (__len + 1) / 2; } return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0); }