Message ID | DB5PR08MB1030D79438AF0E17282F3E1783020@DB5PR08MB1030.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | Improve strstr performance of short needles | expand |
On Wed, 05 Sep 2018, Wilco Dijkstra wrote: >Improve strstr performance for the common case of short needles. For 2-4 >characters a small loop is typically fastest. On large strings the speedup >with a needle size of 4 is about 65%. I ran some tests on a powerpc64le machine and I noticed that, as the haystack becomes larger, the search for a needle that is *not* in the haystack, became slower with this change. For instance, when the haystack is 32 chars long, searching for a 4 chars long needle that is not in the haystack [1] took ~11% to ~22% longer to finish, when compared to the pristine code. When the needle *is* in the haystack, that did not happen and the performance was usually 30% better. In my tests, this is also true for needles 2 and 3 chars long. Have you observed similar results in your tests? [1] Tests with the summary similar to this: Length 32/4, alignment 0/ 0, fail:
Gabriel F. T. Gomes wrote: > On Wed, 05 Sep 2018, Wilco Dijkstra wrote: > >>Improve strstr performance for the common case of short needles. For 2-4 >>characters a small loop is typically fastest. On large strings the speedup >>with a needle size of 4 is about 65%. > > I ran some tests on a powerpc64le machine and I noticed that, as the > haystack becomes larger, the search for a needle that is *not* in the > haystack, became slower with this change. That seems odd, particularly since the Two-Way algorithm is pretty slow for small needles (large preprocessing overhead and small skips). > For instance, when the haystack is 32 chars long, searching for a 4 chars > long needle that is not in the haystack [1] took ~11% to ~22% longer to > finish, when compared to the pristine code. When the needle *is* in the > haystack, that did not happen and the performance was usually 30% better. > > Have you observed similar results in your tests? I haven't looked in detail at the GLIBC strstr benchmark results. I'm running tests with natural language texts from 64 bytes to 256KB. Generally speaking the short needle Two-Way algorithm is terribly slow on both short and long haystacks. Short needles make it worse since it can't skip much, so there isn't any point in keeping it (I posted a much faster replacement in newlib). Wilco
On 29/09/18 5:51 AM, Wilco Dijkstra wrote: > I haven't looked in detail at the GLIBC strstr benchmark results. I'm running > tests with natural language texts from 64 bytes to 256KB. Generally speaking Would you be able to add a test that emulates that workload to glibc benchtests? Thanks, Siddhesh
Hi Gabriel, >> I ran some tests on a powerpc64le machine and I noticed that, as the >> haystack becomes larger, the search for a needle that is *not* in the >> haystack, became slower with this change. > > That seems odd, particularly since the Two-Way algorithm is pretty slow > for small needles (large preprocessing overhead and small skips). I looked at bench-strstr again. There are cases where Two-Way wins by 10-15% when the string is found but for short strings the linear loop wins. On average I can't see any performance difference. However when iterating through 16K needles randomly chosen from natural text I get the following speedups for different haystack and needle sizes: H64 H256 H4096 N2: 64% 73% 46% N3: 74% 66% 37% N4: 94% 92% 66% As you can see the speedup is larger for small needles since the preprocessing overhead of the Two-Way algorithm is significant. This is completely hidden by the GLIBC test due to using just 1 needle. Wilco
On Mon, 01 Oct 2018, Wilco Dijkstra wrote: >>> I ran some tests on a powerpc64le machine and I noticed that, as the >>> haystack becomes larger, the search for a needle that is *not* in the >>> haystack, became slower with this change. >> >> That seems odd, particularly since the Two-Way algorithm is pretty slow >> for small needles (large preprocessing overhead and small skips). > >I looked at bench-strstr again. There are cases where Two-Way wins by >10-15% when the string is found but for short strings the linear loop wins. >On average I can't see any performance difference. I ran bench tests with your patch on an x86 machine, and I noticed the same behaviour that I described previously, however on fewer occasions (never when needle is 2-chars long, and less frequently for 3- and 4-chars long needles). Both on powerpc64le and x86, the overall performance gain for bench tests looks positive, so I don't actually see a compelling reason to not commit this patch. The intention of my question was to know if you have noticed something similar in your tests, and I'm glad to hear that you have used a different benchmark. >However when iterating through 16K needles randomly chosen from natural >text I get the following speedups for different haystack and needle sizes: > > H64 H256 H4096 >N2: 64% 73% 46% >N3: 74% 66% 37% >N4: 94% 92% 66% Could you add this information (data and explanation) to the commit message? It would be good to be able to read it some years in the future and know that a different benchmark (not bench tests) was used to produce these numbers. >As you can see the speedup is larger for small needles since the preprocessing >overhead of the Two-Way algorithm is significant. This is completely hidden by >the GLIBC test due to using just 1 needle. Agreed. Reviewed-by: Gabriel F. T. Gomes <gabriel@inconstante.eti.br>
On 01/10/18 10:35 PM, Wilco Dijkstra wrote: > However when iterating through 16K needles randomly chosen from natural > text I get the following speedups for different haystack and needle sizes: > > H64 H256 H4096 > N2: 64% 73% 46% > N3: 74% 66% 37% > N4: 94% 92% 66% > > As you can see the speedup is larger for small needles since the preprocessing > overhead of the Two-Way algorithm is significant. This is completely hidden by > the GLIBC test due to using just 1 needle. Can you please add this test to benchtests? Thanks, Siddhesh
diff --git a/string/strstr.c b/string/strstr.c index 33acdc5442d3e9031298a56e4f52a19e2ffa7d84..dd8663494ef391f34f7ed0a961d6c0e399bfdda9 100644 --- a/string/strstr.c +++ b/string/strstr.c @@ -46,6 +46,50 @@ #define STRSTR strstr #endif + +static inline char * +strstr2 (const char *hs, const char *ne) +{ + uint32_t h1 = (ne[0] << 16) | ne[1]; + uint32_t h2 = 0; + int c = hs[0]; + while (h1 != h2 && c != 0) + { + h2 = (h2 << 16) | c; + c = *++hs; + } + return h1 == h2 ? (char *)hs - 2 : NULL; +} + +static inline char * +strstr3 (const char *hs, const char *ne) +{ + uint32_t h1 = (ne[0] << 24) | (ne[1] << 16) | (ne[2] << 8); + uint32_t h2 = 0; + int c = hs[0]; + while (h1 != h2 && c != 0) + { + h2 = (h2 | c) << 8; + c = *++hs; + } + return h1 == h2 ? (char *)hs - 3 : NULL; +} + +static inline char * +strstr4 (const char *hs, const char *ne) +{ + uint32_t h1 = (ne[0] << 24) | (ne[1] << 16) | (ne[2] << 8) | ne[3]; + uint32_t h2 = 0; + int c = hs[0]; + while (h1 != h2 && c != 0) + { + h2 = (h2 << 8) | c; + c = *++hs; + } + return h1 == h2 ? (char *)hs - 4 : NULL; +} + + /* Return the first occurrence of NEEDLE in HAYSTACK. Return HAYSTACK if NEEDLE is empty, otherwise NULL if NEEDLE is not found in HAYSTACK. */ @@ -64,6 +108,13 @@ STRSTR (const char *haystack, const char *needle) if (haystack == NULL || needle[1] == '\0') return (char *) haystack; + if (needle[2] == '\0') + return strstr2 (haystack, needle); + if (needle[3] == '\0') + return strstr3 (haystack, needle); + if (needle[4] == '\0') + return strstr4 (haystack, needle); + /* Ensure HAYSTACK length is at least as long as NEEDLE length. Since a match may occur early on in a huge HAYSTACK, use strnlen and read ahead a few cachelines for improved performance. */