diff mbox

improve string find algorithm

Message ID 016101d2682b$136dc890$3a4959b0$@samsung.com
State New
Headers show

Commit Message

Aditya Kumar Jan. 6, 2017, 2:42 p.m. UTC
Yes, we do.
Sorry for the mistake, it happened because I first wrote this for
libcxx (https://reviews.llvm.org/D27068) and while porting that line
got missed.

Thanks,
-Aditya

Comments

Jakub Jelinek Jan. 6, 2017, 3:09 p.m. UTC | #1
On Fri, Jan 06, 2017 at 08:42:15AM -0600, Aditya Kumar wrote:
> Yes, we do.
> Sorry for the mistake, it happened because I first wrote this for
> libcxx (https://reviews.llvm.org/D27068) and while porting that line
> got missed.

Shouldn't find at least in the case where it is narrow char string
just use C library memmem?  That implements a Two-Way searching algorithm
with some improvements from Boyer-Moore.
Otherwise, consider what even your modified version will do for

#include <string>
int main() {
 (std::string(10000000, 'a')+"b").find(std::string(1000000, 'a')+"b");
}

Or does the C++ library need to reinvent everything implemented in the C
library?

	Jakub
Jonathan Wakely Jan. 6, 2017, 3:16 p.m. UTC | #2
On 06/01/17 16:09 +0100, Jakub Jelinek wrote:
>On Fri, Jan 06, 2017 at 08:42:15AM -0600, Aditya Kumar wrote:
>> Yes, we do.
>> Sorry for the mistake, it happened because I first wrote this for
>> libcxx (https://reviews.llvm.org/D27068) and while porting that line
>> got missed.
>
>Shouldn't find at least in the case where it is narrow char string
>just use C library memmem?  That implements a Two-Way searching algorithm
>with some improvements from Boyer-Moore.
>Otherwise, consider what even your modified version will do for
>
>#include <string>
>int main() {
> (std::string(10000000, 'a')+"b").find(std::string(1000000, 'a')+"b");
>}

Yes, I have an incomplete local patch that checks for memmem and uses
it where possible. This change would still be valuable for non-GNU
targets though.

>Or does the C++ library need to reinvent everything implemented in the C
>library?

In this case yes, because strstr doesn't take the length of the
strings into account, and memmem isn't standard. Because std::string
knows its length we can do better than strstr for some cases, such as
std::string(1000, 'a').find(std::string(1001, 'a')).
Jakub Jelinek Jan. 6, 2017, 3:26 p.m. UTC | #3
On Fri, Jan 06, 2017 at 03:16:12PM +0000, Jonathan Wakely wrote:
> On 06/01/17 16:09 +0100, Jakub Jelinek wrote:
> > On Fri, Jan 06, 2017 at 08:42:15AM -0600, Aditya Kumar wrote:
> > > Yes, we do.
> > > Sorry for the mistake, it happened because I first wrote this for
> > > libcxx (https://reviews.llvm.org/D27068) and while porting that line
> > > got missed.
> > 
> > Shouldn't find at least in the case where it is narrow char string
> > just use C library memmem?  That implements a Two-Way searching algorithm
> > with some improvements from Boyer-Moore.
> > Otherwise, consider what even your modified version will do for
> > 
> > #include <string>
> > int main() {
> > (std::string(10000000, 'a')+"b").find(std::string(1000000, 'a')+"b");
> > }
> 
> Yes, I have an incomplete local patch that checks for memmem and uses
> it where possible. This change would still be valuable for non-GNU
> targets though.

The description of Two-Way algorithm is in
http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
Boyer-Moore in
http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm

Of course, as you know the lengths of both strings, you can quite cheaply
decide whether it is worth spending some time building data structures
to speed up the following search or not (but so does memmem already,
e.g. handle with no searching zero length needle, or needle longer than
haystack).  Or for short needles start with memchr of the first char, while
for longer needles don't do that etc.

> > Or does the C++ library need to reinvent everything implemented in the C
> > library?
> 
> In this case yes, because strstr doesn't take the length of the
> strings into account, and memmem isn't standard. Because std::string
> knows its length we can do better than strstr for some cases, such as
> std::string(1000, 'a').find(std::string(1001, 'a')).

	Jakub
Jonathan Wakely Jan. 6, 2017, 3:34 p.m. UTC | #4
On 06/01/17 16:26 +0100, Jakub Jelinek wrote:
>On Fri, Jan 06, 2017 at 03:16:12PM +0000, Jonathan Wakely wrote:
>> On 06/01/17 16:09 +0100, Jakub Jelinek wrote:
>> > On Fri, Jan 06, 2017 at 08:42:15AM -0600, Aditya Kumar wrote:
>> > > Yes, we do.
>> > > Sorry for the mistake, it happened because I first wrote this for
>> > > libcxx (https://reviews.llvm.org/D27068) and while porting that line
>> > > got missed.
>> >
>> > Shouldn't find at least in the case where it is narrow char string
>> > just use C library memmem?  That implements a Two-Way searching algorithm
>> > with some improvements from Boyer-Moore.
>> > Otherwise, consider what even your modified version will do for
>> >
>> > #include <string>
>> > int main() {
>> > (std::string(10000000, 'a')+"b").find(std::string(1000000, 'a')+"b");
>> > }
>>
>> Yes, I have an incomplete local patch that checks for memmem and uses
>> it where possible. This change would still be valuable for non-GNU
>> targets though.
>
>The description of Two-Way algorithm is in
>http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
>Boyer-Moore in
>http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm

And in std::boyer_moore_searcher in <functional> :-)
We also have std::boyer_moore_horspool_searcher in <functional>.

We could try allocating memory in std::string::find() and fall back to
the dumb algorithm if it fails, but we don't want to throw bad_alloc.
Jakub Jelinek Jan. 6, 2017, 3:43 p.m. UTC | #5
On Fri, Jan 06, 2017 at 03:34:22PM +0000, Jonathan Wakely wrote:
> > The description of Two-Way algorithm is in
> > http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
> > Boyer-Moore in
> > http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm
> 
> And in std::boyer_moore_searcher in <functional> :-)
> We also have std::boyer_moore_horspool_searcher in <functional>.
> 
> We could try allocating memory in std::string::find() and fall back to
> the dumb algorithm if it fails, but we don't want to throw bad_alloc.

glibc memmem algorithm doesn't allocate any memory for short needles (< 32
chars) and for longer needles uses size_t shift_table[1U << CHAR_BIT];
automatic array (i.e. 1KB on 32-bits, 2KB on 64-bits).

	Jakub
diff mbox

Patch

diff --git a/libstdc++-v3/include/bits/basic_string.tcc
b/libstdc++-v3/include/bits/basic_string.tcc
index df1e8dd..7942ee6 100644
--- a/libstdc++-v3/include/bits/basic_string.tcc
+++ b/libstdc++-v3/include/bits/basic_string.tcc
@@ -1194,14 +1194,29 @@  _GLIBCXX_BEGIN_NAMESPACE_VERSION
       if (__n == 0)
         return __pos <= __size ? __pos : npos;
 
-      if (__n <= __size)
-       {
-         for (; __pos <= __size - __n; ++__pos)
-           if (traits_type::eq(__data[__pos], __s[0])
-               && traits_type::compare(__data + __pos + 1,
-                                       __s + 1, __n - 1) == 0)
-             return __pos;
-       }
+      if (__n > __size)
+       return npos;
+
+      const _CharT __elem0 = __s[0];
+      const _CharT* __first1 = __data;
+      const _CharT* __first2 = __s;
+      const _CharT* __last1 = __data + __size;
+      ptrdiff_t __len2 = __n -  __pos;
+
+      while (true) {
+        ptrdiff_t __len1 = __last1 - __first1;
+       if (__len1 < __len2)
+         return npos;
+
+       // Find the first match.
+       __first1 = traits_type::find(__first1, __len1 - __len2 + 1,
__elem0);
+       if (__first1 == 0)
+         return npos;
+       // Compare the full string when first match is found.
+        if (traits_type::compare(__first1, __first2, __len2) == 0)
+          return __first1 - __data;
+
+        ++__first1;
+      }
+
       return npos;
     }