Message ID | 016101d2682b$136dc890$3a4959b0$@samsung.com |
---|---|
State | New |
Headers | show |
On Fri, Jan 06, 2017 at 08:42:15AM -0600, Aditya Kumar wrote: > Yes, we do. > Sorry for the mistake, it happened because I first wrote this for > libcxx (https://reviews.llvm.org/D27068) and while porting that line > got missed. Shouldn't find at least in the case where it is narrow char string just use C library memmem? That implements a Two-Way searching algorithm with some improvements from Boyer-Moore. Otherwise, consider what even your modified version will do for #include <string> int main() { (std::string(10000000, 'a')+"b").find(std::string(1000000, 'a')+"b"); } Or does the C++ library need to reinvent everything implemented in the C library? Jakub
On 06/01/17 16:09 +0100, Jakub Jelinek wrote: >On Fri, Jan 06, 2017 at 08:42:15AM -0600, Aditya Kumar wrote: >> Yes, we do. >> Sorry for the mistake, it happened because I first wrote this for >> libcxx (https://reviews.llvm.org/D27068) and while porting that line >> got missed. > >Shouldn't find at least in the case where it is narrow char string >just use C library memmem? That implements a Two-Way searching algorithm >with some improvements from Boyer-Moore. >Otherwise, consider what even your modified version will do for > >#include <string> >int main() { > (std::string(10000000, 'a')+"b").find(std::string(1000000, 'a')+"b"); >} Yes, I have an incomplete local patch that checks for memmem and uses it where possible. This change would still be valuable for non-GNU targets though. >Or does the C++ library need to reinvent everything implemented in the C >library? In this case yes, because strstr doesn't take the length of the strings into account, and memmem isn't standard. Because std::string knows its length we can do better than strstr for some cases, such as std::string(1000, 'a').find(std::string(1001, 'a')).
On Fri, Jan 06, 2017 at 03:16:12PM +0000, Jonathan Wakely wrote: > On 06/01/17 16:09 +0100, Jakub Jelinek wrote: > > On Fri, Jan 06, 2017 at 08:42:15AM -0600, Aditya Kumar wrote: > > > Yes, we do. > > > Sorry for the mistake, it happened because I first wrote this for > > > libcxx (https://reviews.llvm.org/D27068) and while porting that line > > > got missed. > > > > Shouldn't find at least in the case where it is narrow char string > > just use C library memmem? That implements a Two-Way searching algorithm > > with some improvements from Boyer-Moore. > > Otherwise, consider what even your modified version will do for > > > > #include <string> > > int main() { > > (std::string(10000000, 'a')+"b").find(std::string(1000000, 'a')+"b"); > > } > > Yes, I have an incomplete local patch that checks for memmem and uses > it where possible. This change would still be valuable for non-GNU > targets though. The description of Two-Way algorithm is in http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260 Boyer-Moore in http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm Of course, as you know the lengths of both strings, you can quite cheaply decide whether it is worth spending some time building data structures to speed up the following search or not (but so does memmem already, e.g. handle with no searching zero length needle, or needle longer than haystack). Or for short needles start with memchr of the first char, while for longer needles don't do that etc. > > Or does the C++ library need to reinvent everything implemented in the C > > library? > > In this case yes, because strstr doesn't take the length of the > strings into account, and memmem isn't standard. Because std::string > knows its length we can do better than strstr for some cases, such as > std::string(1000, 'a').find(std::string(1001, 'a')). Jakub
On 06/01/17 16:26 +0100, Jakub Jelinek wrote: >On Fri, Jan 06, 2017 at 03:16:12PM +0000, Jonathan Wakely wrote: >> On 06/01/17 16:09 +0100, Jakub Jelinek wrote: >> > On Fri, Jan 06, 2017 at 08:42:15AM -0600, Aditya Kumar wrote: >> > > Yes, we do. >> > > Sorry for the mistake, it happened because I first wrote this for >> > > libcxx (https://reviews.llvm.org/D27068) and while porting that line >> > > got missed. >> > >> > Shouldn't find at least in the case where it is narrow char string >> > just use C library memmem? That implements a Two-Way searching algorithm >> > with some improvements from Boyer-Moore. >> > Otherwise, consider what even your modified version will do for >> > >> > #include <string> >> > int main() { >> > (std::string(10000000, 'a')+"b").find(std::string(1000000, 'a')+"b"); >> > } >> >> Yes, I have an incomplete local patch that checks for memmem and uses >> it where possible. This change would still be valuable for non-GNU >> targets though. > >The description of Two-Way algorithm is in >http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260 >Boyer-Moore in >http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm And in std::boyer_moore_searcher in <functional> :-) We also have std::boyer_moore_horspool_searcher in <functional>. We could try allocating memory in std::string::find() and fall back to the dumb algorithm if it fails, but we don't want to throw bad_alloc.
On Fri, Jan 06, 2017 at 03:34:22PM +0000, Jonathan Wakely wrote: > > The description of Two-Way algorithm is in > > http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260 > > Boyer-Moore in > > http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm > > And in std::boyer_moore_searcher in <functional> :-) > We also have std::boyer_moore_horspool_searcher in <functional>. > > We could try allocating memory in std::string::find() and fall back to > the dumb algorithm if it fails, but we don't want to throw bad_alloc. glibc memmem algorithm doesn't allocate any memory for short needles (< 32 chars) and for longer needles uses size_t shift_table[1U << CHAR_BIT]; automatic array (i.e. 1KB on 32-bits, 2KB on 64-bits). Jakub
diff --git a/libstdc++-v3/include/bits/basic_string.tcc b/libstdc++-v3/include/bits/basic_string.tcc index df1e8dd..7942ee6 100644 --- a/libstdc++-v3/include/bits/basic_string.tcc +++ b/libstdc++-v3/include/bits/basic_string.tcc @@ -1194,14 +1194,29 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION if (__n == 0) return __pos <= __size ? __pos : npos; - if (__n <= __size) - { - for (; __pos <= __size - __n; ++__pos) - if (traits_type::eq(__data[__pos], __s[0]) - && traits_type::compare(__data + __pos + 1, - __s + 1, __n - 1) == 0) - return __pos; - } + if (__n > __size) + return npos; + + const _CharT __elem0 = __s[0]; + const _CharT* __first1 = __data; + const _CharT* __first2 = __s; + const _CharT* __last1 = __data + __size; + ptrdiff_t __len2 = __n - __pos; + + while (true) { + ptrdiff_t __len1 = __last1 - __first1; + if (__len1 < __len2) + return npos; + + // Find the first match. + __first1 = traits_type::find(__first1, __len1 - __len2 + 1, __elem0); + if (__first1 == 0) + return npos; + // Compare the full string when first match is found. + if (traits_type::compare(__first1, __first2, __len2) == 0) + return __first1 - __data; + + ++__first1; + } + return npos; }