Message ID | 556C36D8.2070208@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
On Mon, Jun 01, 2015 at 04:11:28PM +0530, Rajalakshmi Srinivasaraghavan wrote: > > This patch optimizes strcasestr function for power >= 7 systems. > This patch uses optimized strlen and strnlen for calculating > string length and the average improvement of this optimization is ~40%. > This patch is tested on powerpc64 and powerpc64le. > Attached the benchresults with this new patch. > Thats not enough. As strcasestr that I submited is around three times slower your implementation would likely be regression over generic one. A problem here is that you use moronic algorithm. Fix algorithm first before trying to optimize it.
On Mon, 2015-06-01 at 14:28 +0200, Ondřej Bílka wrote: > On Mon, Jun 01, 2015 at 04:11:28PM +0530, Rajalakshmi Srinivasaraghavan wrote: > > > > This patch optimizes strcasestr function for power >= 7 systems. > > This patch uses optimized strlen and strnlen for calculating > > string length and the average improvement of this optimization is ~40%. > > This patch is tested on powerpc64 and powerpc64le. > > Attached the benchresults with this new patch. > > > Thats not enough. As strcasestr that I submited is around three times > slower your implementation would likely be regression over generic one. > > A problem here is that you use moronic algorithm. Fix algorithm first > before trying to optimize it. > This is not very helpful. You are demanding changes without clear explanation and justification. What is wrong with Raja's algorithm? What is insufficient in the benchmark data she has provided? And why do you think your specific design applies to PowerISA and POWER7/POWER8 micro-architecture. What data do you have that justified this objection?
On Mon, Jun 01, 2015 at 09:36:47AM -0500, Steven Munroe wrote: > On Mon, 2015-06-01 at 14:28 +0200, Ondřej Bílka wrote: > > On Mon, Jun 01, 2015 at 04:11:28PM +0530, Rajalakshmi Srinivasaraghavan wrote: > > > > > > This patch optimizes strcasestr function for power >= 7 systems. > > > This patch uses optimized strlen and strnlen for calculating > > > string length and the average improvement of this optimization is ~40%. > > > This patch is tested on powerpc64 and powerpc64le. > > > Attached the benchresults with this new patch. > > > > > Thats not enough. As strcasestr that I submited is around three times > > slower your implementation would likely be regression over generic one. > > > > A problem here is that you use moronic algorithm. Fix algorithm first > > before trying to optimize it. > > > > This is not very helpful. You are demanding changes without clear > explanation and justification. > > What is wrong with Raja's algorithm? What is insufficient in the > benchmark data she has provided? And why do you think your specific > design applies to PowerISA and POWER7/POWER8 micro-architecture. > > What data do you have that justified this objection? I replied on strstr patch thread on why what she submitted is performance regression. So I will repeat arguments from other thread which still apply. First was problem with quadratic behaviour. She tried to fix it but it isn't a fix at all. Just benchmark strcasestr ("aaa...(4000 times)...aaa", "aaa...(2000 times)...aab") That call would take around hundred times than before which is unacceptable. If we ignore that red flag second problem was that benchmark she used is bogus. It test with periodic haystacks, needle is copy of first bytes of haystack with last byte set to something else. 1234567890abcxyz1234567890abcxyz1234567890abcxyz... These are far from real inputs. Periodicity could introduce artefacts in algorithms (like that its worst case for my digraph search which relies on fact that in most strings characters are relatively independent.) My worst fear that some implementation would be ten times faster than in practice. There could be branch that is normally unpredictable. But as input is regular it could be alternatively taken and not taken which cpu recognizes and will predict it perfectly. This isn't main problem which is that result of benchmark depends on parameter and each submitter could make his benchmark best by selecting appropriate value of that parameter. This could apply by selecting period to string n or if we fix previous paragraph by using random strings by selecting probability of character. Performance would differ by order of magnitude depending on choice of that n. A simple comparison for strstr would be compare two-way algorithm with naive algorithm that looks like while (s = strchr(s, n[0])) check_rest If you set n=2 or random strings with only a or b then strchr would cause big overhead per call and it could be five times slower than two-way algorithm whose performance is mostly constant. On other hand with n=255 or using all values a strchr would quickly skip 256 bytes on average and be say 5 times faster than two-way algorithm. If you draw performance characteristic they meet at same point where they run at same speed. If we assume that random strings model real ones well how can you be sure that you selected n constiderably less than encountered in real programs and it shows wrong implementation as best? Just use same patch like I send with ((unsigned) rand())%16 + 1 and you will see completely different numbers in benchmark. I didn't touched that benchmarks couldn't produce a correct result due additional factors, like that in practice most inputs are small so a branch prediction in implementation header matters a lot. But this benchmark doesn't measure it as it calls functions with same arguments over and over again which makes these branches. That briefly covers what was wrong with benchmark. As microarchitecture one doesn't need to know details to see that something is fundamentally wrong. A algorithm here is essentially byte-by-byte check which essentially does following uint64_t haystack_window, needle_window; while (!end) { haystack_window = haystack_window << 8 | tolower (haystack[i]) if (haystack_window == needle_window) match_rest i++; } A problem of that is that operations there are useless most of time. For original benchmark 15 times out of 16 you don't have to use shifts and other instructions to compare needle with haystack. At that position they differ on first byte. So her algorithm is worse than naive doing following with same optimizations as in previous case. while (!end) { if (tolower(haystack[i]) == tolower(needle[0])) match_rest i++; } As that was covered my final comment was that it isn't substantial speedup. My implemantation is generic and gives large boost so why should be powerpc different. As I don't have powerpc access now apply my patches [PATCH v5] Generic string skeleton [PATCH v5 4*] Generic string search functions (strstr, strcasestr, memmem) and run (preferably fixed) benchmark with these. As gains that I see on x64 are bigger than ones gained by this assembly you will likely see that generic implementation is indeed better and it would be pointless to try review that only to remove it shortly after adding to improve performance.
On Mon, 2015-06-01 at 18:22 +0200, Ondřej Bílka wrote: > On Mon, Jun 01, 2015 at 09:36:47AM -0500, Steven Munroe wrote: > > On Mon, 2015-06-01 at 14:28 +0200, Ondřej Bílka wrote: > > > On Mon, Jun 01, 2015 at 04:11:28PM +0530, Rajalakshmi Srinivasaraghavan wrote: > > > > > > > > This patch optimizes strcasestr function for power >= 7 systems. > > > > This patch uses optimized strlen and strnlen for calculating > > > > string length and the average improvement of this optimization is ~40%. > > > > This patch is tested on powerpc64 and powerpc64le. > > > > Attached the benchresults with this new patch. > > > > > > > Thats not enough. As strcasestr that I submited is around three times > > > slower your implementation would likely be regression over generic one. > > > > > > A problem here is that you use moronic algorithm. Fix algorithm first > > > before trying to optimize it. > > > > > > > This is not very helpful. You are demanding changes without clear > > explanation and justification. > > > > What is wrong with Raja's algorithm? What is insufficient in the > > benchmark data she has provided? And why do you think your specific > > design applies to PowerISA and POWER7/POWER8 micro-architecture. > > > > What data do you have that justified this objection? > > I replied on strstr patch thread on why what she submitted is > performance regression. So I will repeat arguments from other thread > which still apply. > > First was problem with quadratic behaviour. She tried to fix it but it > isn't a fix at all. Just benchmark Which is fixed in the latest patch > > strcasestr ("aaa...(4000 times)...aaa", "aaa...(2000 times)...aab") > > That call would take around hundred times than before which is > unacceptable. > > If we ignore that red flag second problem was that benchmark she used is > bogus. It test with periodic haystacks, needle is copy of first bytes of > haystack with last byte set to something else. > Have you looked at he latest patch? If the benchmark provided does not cover your assertion, then it is your responsibility to provide and upstream the appropriate benchmark. And convince the community that benchmark covers a valid issue. > snip > > As microarchitecture one doesn't need to know details to see that > something is fundamentally wrong. A algorithm here is essentially > byte-by-byte check which essentially does following > That assertion is at odds with my experience. Every day I see code that somebody thinks is optimized but in reality is not. Current processors are vastly more complicated then most realize and RISC processors have stronger interactions between compiler and micro-architecture. So it is on you have to prove that your generic optimization actually works for more then one platform. > snip > > As that was covered my final comment was that it isn't substantial > speedup. My implemantation is generic and gives large boost > so why should be powerpc different. > > As I don't have powerpc access now apply my patches > > [PATCH v5] Generic string skeleton > [PATCH v5 4*] Generic string search functions (strstr, strcasestr, memmem) > Every one has access to the the GCC Compile farm https://gcc.gnu.org/wiki/CompileFarm So there is no excuse to not have data to back you assertion.
On Tue, Jun 02, 2015 at 01:47:03PM -0500, Steven Munroe wrote: > On Mon, 2015-06-01 at 18:22 +0200, Ondřej Bílka wrote: > > On Mon, Jun 01, 2015 at 09:36:47AM -0500, Steven Munroe wrote: > > > On Mon, 2015-06-01 at 14:28 +0200, Ondřej Bílka wrote: > > > > On Mon, Jun 01, 2015 at 04:11:28PM +0530, Rajalakshmi Srinivasaraghavan wrote: > > > > > > > > > > This patch optimizes strcasestr function for power >= 7 systems. > > > > > This patch uses optimized strlen and strnlen for calculating > > > > > string length and the average improvement of this optimization is ~40%. > > > > > This patch is tested on powerpc64 and powerpc64le. > > > > > Attached the benchresults with this new patch. > > > > > > > > > Thats not enough. As strcasestr that I submited is around three times > > > > slower your implementation would likely be regression over generic one. > > > > > > > > A problem here is that you use moronic algorithm. Fix algorithm first > > > > before trying to optimize it. > > > > > > > > > > This is not very helpful. You are demanding changes without clear > > > explanation and justification. > > > > > > What is wrong with Raja's algorithm? What is insufficient in the > > > benchmark data she has provided? And why do you think your specific > > > design applies to PowerISA and POWER7/POWER8 micro-architecture. > > > > > > What data do you have that justified this objection? > > > > I replied on strstr patch thread on why what she submitted is > > performance regression. So I will repeat arguments from other thread > > which still apply. > > > > First was problem with quadratic behaviour. She tried to fix it but it > > isn't a fix at all. Just benchmark > > Which is fixed in the latest patch > > > > strcasestr ("aaa...(4000 times)...aaa", "aaa...(2000 times)...aab") > > Did you read also following line? It isn't solved at all. > > That call would take around hundred times than before which is > > unacceptable. > > > > If we ignore that red flag second problem was that benchmark she used is > > bogus. It test with periodic haystacks, needle is copy of first bytes of > > haystack with last byte set to something else. > > > Have you looked at he latest patch? If the benchmark provided does not > cover your assertion, then it is your responsibility to provide and > upstream the appropriate benchmark. And convince the community that > benchmark covers a valid issue. > That was my comment why benchmark isn't feasible to prove performance as I could for almost any patch write benchmark that shows its regression by selecting suitable data. Question is what happens in reality and for that you shouldn't use artifical benchmarks. > > snip > > > > As microarchitecture one doesn't need to know details to see that > > something is fundamentally wrong. A algorithm here is essentially > > byte-by-byte check which essentially does following > > > > That assertion is at odds with my experience. Every day I see code that > somebody thinks is optimized but in reality is not. Current processors > are vastly more complicated then most realize and RISC processors have > stronger interactions between compiler and micro-architecture. > > So it is on you have to prove that your generic optimization actually > works for more then one platform. > I have same especially when author submits an assembly improvement where algorithm is completely wrong and doesn't cover known performance issues. > > snip > > > > > As that was covered my final comment was that it isn't substantial > > speedup. My implemantation is generic and gives large boost > > so why should be powerpc different. > > > > As I don't have powerpc access now apply my patches > > > > [PATCH v5] Generic string skeleton > > [PATCH v5 4*] Generic string search functions (strstr, strcasestr, memmem) > > > Every one has access to the the GCC Compile farm > > https://gcc.gnu.org/wiki/CompileFarm > > So there is no excuse to not have data to back you assertion. Except that all power machines except osuosl are offline. There may be osuosl ones but gcc11[012].osuosl.org which don't resolve
On Tue, 2015-06-02 at 23:00 +0200, Ondřej Bílka wrote: > On Tue, Jun 02, 2015 at 01:47:03PM -0500, Steven Munroe wrote: > > On Mon, 2015-06-01 at 18:22 +0200, Ondřej Bílka wrote: > > > On Mon, Jun 01, 2015 at 09:36:47AM -0500, Steven Munroe wrote: > > > > On Mon, 2015-06-01 at 14:28 +0200, Ondřej Bílka wrote: > > > > > On Mon, Jun 01, 2015 at 04:11:28PM +0530, Rajalakshmi Srinivasaraghavan wrote: > > > > > > > > > > > > This patch optimizes strcasestr function for power >= 7 systems. > > > > > > This patch uses optimized strlen and strnlen for calculating > > > > > > string length and the average improvement of this optimization is ~40%. > > > > > > This patch is tested on powerpc64 and powerpc64le. > > > > > > Attached the benchresults with this new patch. > > > > > > > > > > > Thats not enough. As strcasestr that I submited is around three times > > > > > slower your implementation would likely be regression over generic one. > > > > > > > > > > A problem here is that you use moronic algorithm. Fix algorithm first > > > > > before trying to optimize it. > > > > > > > > > > > > > This is not very helpful. You are demanding changes without clear > > > > explanation and justification. > > > > > > > > What is wrong with Raja's algorithm? What is insufficient in the > > > > benchmark data she has provided? And why do you think your specific > > > > design applies to PowerISA and POWER7/POWER8 micro-architecture. > > > > > > > > What data do you have that justified this objection? > > > > > > I replied on strstr patch thread on why what she submitted is > > > performance regression. So I will repeat arguments from other thread > > > which still apply. > > > > > > First was problem with quadratic behaviour. She tried to fix it but it > > > isn't a fix at all. Just benchmark > > > > Which is fixed in the latest patch > > > > > > strcasestr ("aaa...(4000 times)...aaa", "aaa...(2000 times)...aab") > > > > Did you read also following line? It isn't solved at all. > > > That call would take around hundred times than before which is > > > unacceptable. > > > Ok we need to make progress, either you need to provide data from a POWER machine that proves your point. Or you need to proved a benchmark that can be include in GLIBC and we can run, that proves your point. So far this only your opinion from a abstract perspective vs my opinion based on my experience as a platform maintainer. It is past time to break this logjam! Anyone disagree with this approach? > > So there is no excuse to not have data to back you assertion. > > Except that all power machines except osuosl are offline. There may be > osuosl ones but gcc11[012].osuosl.org which don't resolve > My team assures me that both gcc110 gcc112 are up and available. -bash-4.3$ uptime 01:30PM up 127 days, 3:38, 0 users, load average: 2.36, 2.11, 1.94 you might try via the fsffrance.org domain.
Our policy is that all such questions should be settled objectively in a reproducible manner. That means a new benchmark should be added that can demonstrate the effect of the library change in question, and consensus reached on the utility of that particular benchmark (which is always a somewhat subjective issue). Once the benchmark is in, then objective reports of benchmark results before and after a change on various hardware must be shown to justify the library change. The onus for making all this happen is on the person who wants to change the code from its status quo.
On Tue, 2015-06-09 at 14:05 -0700, Roland McGrath wrote: > Our policy is that all such questions should be settled objectively in a > reproducible manner. That means a new benchmark should be added that can > demonstrate the effect of the library change in question, and consensus > reached on the utility of that particular benchmark (which is always a > somewhat subjective issue). Once the benchmark is in, then objective > reports of benchmark results before and after a change on various hardware > must be shown to justify the library change. The onus for making all this > happen is on the person who wants to change the code from its status quo. > In https://sourceware.org/ml/libc-alpha/2015-06/msg00032.html Raja provided the results for the existing benchmark. So we have meet your and the community requirement. Ondřej Bílka is asserting that not good enough based based on his opinion that the algorithm is flawed. But he has not provide a benchmark that we can use to verify this. Since we do not understand what the problem is, we can not provide a new benchmark. If Ondřej Bílka believes there is a problem that only he understands then the onus is on him to provide the appropriate benchmark.
> If Ond ej Bílka believes there is a problem that only he understands > then the onus is on him to provide the appropriate benchmark. Agreed. It's reasonable to hold up the change based on announced intent to quickly supply a benchmark that changes the available objective analysis of the change's performance effect. But if the objector doesn't commit to delivering that benchmark soon, then he doesn't have grounds to object to the change. OTOH, if it's the case that the only benchmarks we have for a particular function are very poor representatives and the consensus agrees on that assessment, then it's reasonable to demand that better benchmarks be achieved before changes claimed to improve performance of that function can be entertained. (I don't know if that's the situation here; I'm just talking about the general policy of the project.) In short, if you intend either to propose a lot of performance-improvement changes or to be skeptical of such proposals by others, then you should expect to spend a lot of effort on writing better benchmarks and reaching consensus on their quality. Thanks, Roland
On 06/10/2015 04:18 PM, Ondřej Bílka wrote: > > The onus for making all this happen is on the person who wants to change the code from its status quo. > > That would mean that its Rajalakshmi and responsibility to > write benchmark that shows its performance improvement. I stated my > objections explicitly described which inputs to use. > > Problem is that Steven failed to do his duty as a powerpc maintainer > which is to read and review change, check benchmark inputs and that > they are adequate. I could supply really bad implementation of strstr > which first checks special case that needle and haystack are periodic > abcdef sequences and returns result expected from benchmark. > > I did just simple request that every freshmen should know and Steven > should as powerpc maintainer check how its handled. Its that naive > algorithm and proposed algorithm are slow on inputs of form > > strstr("aaaaa....aaaa","aaaa...aaab"); > > As from your previous mail its their responsibility to add that case for > benchmark. A 'solution' on v2 would be to switch into two-way algorithm > when size is bigger than 2048. That isn't solution as you still have > problem with size 2047. I seen somewhere comment that its security > problem when you have website with search function on user comments and > implementation internally uses strstr then attacker could just post > comment with aaaaa...a and search few times for aa...ab. > 2048 number is chosen by checking various length inputs and comparing the performance. So for 2047 case, there wont be worst performance. > Roland as its really basic issue I think that its proposer > responsibility to check. Also for strstr thread Carlos said about strstr > quadratic case: > >> We should add gnulib's test to the microbenchmarks. > > > Then as I commented that algorithm is flawed its from my experience, > Raji's description was: > > For case 1, I always compare 8 characters at a time using cmpb instruction. > instruction. > -> perform basic strlen, strnlen, checks. Move haystack based on strchr result. > strchr result. > -> Load first 8 bytes(taking care of aligned load)from haystack and needle and compare them. > needle and compare them. > If they are same proceed to next 8 bytes of haystack and needle. > Else load next doubleword from haystack.Form a doubleword with last > seven characters from first load and first character from > second load and compare again with needle. Proceed in a similar way. > This gives good improvement. > Lets not mix the review comments of strstr and strcasestr.(as the algorithm is different) > My comment was that it wastes time most of time as in common case these > already differ in first byte so all these shifts were completely > unnecessary, a simple optimization of naive algorithm using > > while(s=strchr(s,n[0])) If I use while(s=strchr(s,n[0]) as mentioned in https://sourceware.org/ml/libc-alpha/2015-05/msg00060.html the quadratic testcase gives worst peformance. Attached the testcase. > > would be faster. Steven instead of doing his duty as powerpc maintainer > and checking result of benchtest that I send instead tried to bullshit > himself by telling that powerpc architecture is too special. Then he > asked me do do benchmark instead as stalling tactics. > > If he did his duty of reading benchmark results there is already > evidence. First to notice is there are regressions in original > benchmark, like following: > > Length 30/2, alignment 0/ 9, found: 20.7812 32.2656 13.125 19.7188 > Length 30/2, alignment 0/ 9, fail: 32.4688 33.9219 36.75 31.4688 > I agree there are few regression cases(79 out of 13442 cases) in 'proposed_benchresults.txt' attached in https://sourceware.org/ml/libc-alpha/2015-05/msg00276.html. This is minimal when compared to the overall performance gain. > That he should comment, as strstr on average end on 10.6 byte after > haystack start its quite relevant. > > But if he read proposed benchmark where needle and haystack are randomly > generated from list of 128 characters then he would easily see it. I have used your proposed benchtest, and it always gives improvement for strcasestr.Please check the file 'newresults' in https://sourceware.org/ml/libc-alpha/2015-06/msg00032.html > > Here is trimmed version to see pattern. Note that first there are lot of > times where performance is less than 13. Thats case where strchr didn't > find first character and we just return(generic case now does > unnecessary work leading to worse numbers but thats addressed on patch that I send with strchr loop.) > > Its natural to ask what would happen if you called strchr again. In 120 > bytes it would probably didn't found second character and we would > finish in 26 cycles at most not 111.5 > > I expect that average loop performance would be 13.4375/135 = 0.1 cycles > per byte not cycle per byte that this algorithm exhibits after he loses > performance gains from first correct strchr call. > > simple_strstr stupid_strstr __strstr_power7 > __strstr_ppc > > Length 4/2, alignment 0/ 3, found: 9.35938 8.09375 7.90625 8.5 > Length 12/2, alignment 3/ 9, fail: 9.23438 16.5625 7.90625 9.20312 > Length 24/2, alignment 3/ 9, fail: 9.54688 26.2812 8.32812 10.1875 > Length 44/4, alignment 9/ 3, found: 13.4688 40.9219 9.59375 13.9688 > Length 44/4, alignment 9/ 3, fail: 13.4688 40.9531 50.6875 13.9844 > Length 52/4, alignment 3/ 9, found: 13.5312 47.2812 9.5625 14.7344 > Length 52/4, alignment 3/ 9, fail: 45.9688 53.625 35.9688 47.5938 > Length 75/5, alignment 15/ 9, fail: 73.75 67.9375 64.2188 75.9062 > Length 75/5, alignment 15/15, found: 15.2188 65.9062 10.7344 17.8438 > Length 75/5, alignment 15/15, fail: 61.5469 68.6562 50.375 63.0312 > Length 120/8, alignment 15/ 3, fail: 113.281 111.5 103.5 114.406 > Length 120/8, alignment 15/ 9, found: 21.25 99.1406 12.8438 28.1406 > Length 135/9, alignment 15/15, found: 23.125 110.734 13.4375 30.7969 > Length 135/9, alignment 15/15, fail: 147.484 113.781 110.516 165.984 > Length 168/12, alignment 3/15, found: 31.2344 134.172 13.6719 39.9375 > Length 168/12, alignment 3/15, fail: 170.766 137.609 131.859 179.859 > Length 168/12, alignment 9/ 0, found: 100.562 137.922 76.6562 112.578 > Length 168/12, alignment 9/ 0, fail: 168.703 144.078 144.828 181.203 > Length 168/12, alignment 9/ 3, found: 129.938 137.703 109.828 136.266 > Length 168/12, alignment 9/ 3, fail: 178.734 137.875 165.469 183.609 > Length 465/31, alignment 15/ 0, fail: 73.0469 363.375 32.2969 87.0625 > Length 465/31, alignment 15/ 3, found: 507.656 373.703 429.797 543.625 > Length 465/31, alignment 15/ 3, fail: 512.922 373.188 437.047 556.844 > Length 465/31, alignment 15/ 9, found: 443.969 379.359 392.828 464.75 > Length 465/31, alignment 15/ 9, fail: 484.312 376.406 423.703 513.672 > Length 465/31, alignment 15/15, found: 326.859 376 256.625 350.422 > Length 465/31, alignment 15/15, fail: 512.641 376.203 439.234 545.906 > Length 131071/16, alignment 0/ 0, found: 116583 108378 122125 > 127809 > Length 131071/16, alignment 0/ 0, fail: 118298 108639 122208 > 128700 > >
On Thu, 2015-06-11 at 00:04 +0530, Rajalakshmi Srinivasaraghavan wrote: > > On 06/10/2015 04:18 PM, Ondřej Bílka wrote: > > > > The onus for making all this happen is on the person who wants to change the code from its status quo. > > > > That would mean that its Rajalakshmi and responsibility to > > write benchmark that shows its performance improvement. I stated my > > objections explicitly described which inputs to use. > > > > Problem is that Steven failed to do his duty as a powerpc maintainer > > which is to read and review change, check benchmark inputs and that > > they are adequate. I could supply really bad implementation of strstr > > which first checks special case that needle and haystack are periodic > > abcdef sequences and returns result expected from benchmark. > > > > I did just simple request that every freshmen should know and Steven > > should as powerpc maintainer check how its handled. Its that naive > > algorithm and proposed algorithm are slow on inputs of form > > > > strstr("aaaaa....aaaa","aaaa...aaab"); > > > > As from your previous mail its their responsibility to add that case for > > benchmark. A 'solution' on v2 would be to switch into two-way algorithm > > when size is bigger than 2048. That isn't solution as you still have > > problem with size 2047. I seen somewhere comment that its security > > problem when you have website with search function on user comments and > > implementation internally uses strstr then attacker could just post > > comment with aaaaa...a and search few times for aa...ab. > > Thanks Raja for correcting my misconception. So Raja did responded (in https://sourceware.org/ml/libc-alpha/2015-06/msg00032.html) with a new benchmark that we believe should address Ondřej Bílka concerns. I apologize if this error on my part cause any undo increases in blood pressure or heart burn. > 2048 number is chosen by checking various length inputs and comparing > the performance. So for 2047 case, there wont be worst performance. > > > Roland as its really basic issue I think that its proposer > > responsibility to check. Also for strstr thread Carlos said about strstr > > quadratic case: > > > >> We should add gnulib's test to the microbenchmarks. > > > > > > Then as I commented that algorithm is flawed its from my experience, > > Raji's description was: > > > > For case 1, I always compare 8 characters at a time using cmpb instruction. > > instruction. > > -> perform basic strlen, strnlen, checks. Move haystack based on strchr result. > > strchr result. > > -> Load first 8 bytes(taking care of aligned load)from haystack and needle and compare them. > > needle and compare them. > > If they are same proceed to next 8 bytes of haystack and needle. > > Else load next doubleword from haystack.Form a doubleword with last > > seven characters from first load and first character from > > second load and compare again with needle. Proceed in a similar way. > > This gives good improvement. > > > Lets not mix the review comments of strstr and strcasestr.(as the > algorithm is different) > > My comment was that it wastes time most of time as in common case these > > already differ in first byte so all these shifts were completely > > unnecessary, a simple optimization of naive algorithm using > > > > while(s=strchr(s,n[0])) > If I use while(s=strchr(s,n[0]) as mentioned in > https://sourceware.org/ml/libc-alpha/2015-05/msg00060.html > the quadratic testcase gives worst peformance. Attached the testcase. > > > > would be faster. Steven instead of doing his duty as powerpc maintainer > > and checking result of benchtest that I send instead tried to bullshit > > himself by telling that powerpc architecture is too special. Then he > > asked me do do benchmark instead as stalling tactics. > > > > If he did his duty of reading benchmark results there is already > > evidence. First to notice is there are regressions in original > > benchmark, like following: > > > > Length 30/2, alignment 0/ 9, found: 20.7812 32.2656 13.125 19.7188 > > Length 30/2, alignment 0/ 9, fail: 32.4688 33.9219 36.75 31.4688 > > > I agree there are few regression cases(79 out of 13442 cases) in > 'proposed_benchresults.txt' attached in > https://sourceware.org/ml/libc-alpha/2015-05/msg00276.html. > This is minimal when compared to the overall performance gain. > > > That he should comment, as strstr on average end on 10.6 byte after > > haystack start its quite relevant. > > > > But if he read proposed benchmark where needle and haystack are randomly > > generated from list of 128 characters then he would easily see it. > I have used your proposed benchtest, and it always gives improvement for > strcasestr.Please check the file 'newresults' in > https://sourceware.org/ml/libc-alpha/2015-06/msg00032.html > > > > Here is trimmed version to see pattern. Note that first there are lot of > > times where performance is less than 13. Thats case where strchr didn't > > find first character and we just return(generic case now does > > unnecessary work leading to worse numbers but thats addressed on patch that I send with strchr loop.) > > > > Its natural to ask what would happen if you called strchr again. In 120 > > bytes it would probably didn't found second character and we would > > finish in 26 cycles at most not 111.5 > > > > I expect that average loop performance would be 13.4375/135 = 0.1 cycles > > per byte not cycle per byte that this algorithm exhibits after he loses > > performance gains from first correct strchr call. > > > > simple_strstr stupid_strstr __strstr_power7 > > __strstr_ppc > > > > Length 4/2, alignment 0/ 3, found: 9.35938 8.09375 7.90625 8.5 > > Length 12/2, alignment 3/ 9, fail: 9.23438 16.5625 7.90625 9.20312 > > Length 24/2, alignment 3/ 9, fail: 9.54688 26.2812 8.32812 10.1875 > > Length 44/4, alignment 9/ 3, found: 13.4688 40.9219 9.59375 13.9688 > > Length 44/4, alignment 9/ 3, fail: 13.4688 40.9531 50.6875 13.9844 > > Length 52/4, alignment 3/ 9, found: 13.5312 47.2812 9.5625 14.7344 > > Length 52/4, alignment 3/ 9, fail: 45.9688 53.625 35.9688 47.5938 > > Length 75/5, alignment 15/ 9, fail: 73.75 67.9375 64.2188 75.9062 > > Length 75/5, alignment 15/15, found: 15.2188 65.9062 10.7344 17.8438 > > Length 75/5, alignment 15/15, fail: 61.5469 68.6562 50.375 63.0312 > > Length 120/8, alignment 15/ 3, fail: 113.281 111.5 103.5 114.406 > > Length 120/8, alignment 15/ 9, found: 21.25 99.1406 12.8438 28.1406 > > Length 135/9, alignment 15/15, found: 23.125 110.734 13.4375 30.7969 > > Length 135/9, alignment 15/15, fail: 147.484 113.781 110.516 165.984 > > Length 168/12, alignment 3/15, found: 31.2344 134.172 13.6719 39.9375 > > Length 168/12, alignment 3/15, fail: 170.766 137.609 131.859 179.859 > > Length 168/12, alignment 9/ 0, found: 100.562 137.922 76.6562 112.578 > > Length 168/12, alignment 9/ 0, fail: 168.703 144.078 144.828 181.203 > > Length 168/12, alignment 9/ 3, found: 129.938 137.703 109.828 136.266 > > Length 168/12, alignment 9/ 3, fail: 178.734 137.875 165.469 183.609 > > Length 465/31, alignment 15/ 0, fail: 73.0469 363.375 32.2969 87.0625 > > Length 465/31, alignment 15/ 3, found: 507.656 373.703 429.797 543.625 > > Length 465/31, alignment 15/ 3, fail: 512.922 373.188 437.047 556.844 > > Length 465/31, alignment 15/ 9, found: 443.969 379.359 392.828 464.75 > > Length 465/31, alignment 15/ 9, fail: 484.312 376.406 423.703 513.672 > > Length 465/31, alignment 15/15, found: 326.859 376 256.625 350.422 > > Length 465/31, alignment 15/15, fail: 512.641 376.203 439.234 545.906 > > Length 131071/16, alignment 0/ 0, found: 116583 108378 122125 > > 127809 > > Length 131071/16, alignment 0/ 0, fail: 118298 108639 122208 > > 128700 > > > > >
On Thu, Jun 11, 2015 at 12:04:08AM +0530, Rajalakshmi Srinivasaraghavan wrote: > > > On 06/10/2015 04:18 PM, Ondřej Bílka wrote: > > > > The onus for making all this happen is on the person who wants to change the code from its status quo. > > > >That would mean that its Rajalakshmi and responsibility to > >write benchmark that shows its performance improvement. I stated my > >objections explicitly described which inputs to use. > > > >Problem is that Steven failed to do his duty as a powerpc maintainer > >which is to read and review change, check benchmark inputs and that > >they are adequate. I could supply really bad implementation of strstr > >which first checks special case that needle and haystack are periodic > >abcdef sequences and returns result expected from benchmark. > > > >I did just simple request that every freshmen should know and Steven > >should as powerpc maintainer check how its handled. Its that naive > >algorithm and proposed algorithm are slow on inputs of form > > > >strstr("aaaaa....aaaa","aaaa...aaab"); > > > >As from your previous mail its their responsibility to add that case for > >benchmark. A 'solution' on v2 would be to switch into two-way algorithm > >when size is bigger than 2048. That isn't solution as you still have > >problem with size 2047. I seen somewhere comment that its security > >problem when you have website with search function on user comments and > >implementation internally uses strstr then attacker could just post > >comment with aaaaa...a and search few times for aa...ab. > > > 2048 number is chosen by checking various length inputs and > comparing the performance. So for 2047 case, there wont be worst > performance. > What are you talking about? I am saying that aaaa haystack with aaa...ab needle of size 2047 is worst case for your algorithm performance. > >Roland as its really basic issue I think that its proposer > >responsibility to check. Also for strstr thread Carlos said about strstr > >quadratic case: > > > >>We should add gnulib's test to the microbenchmarks. > > > > > >Then as I commented that algorithm is flawed its from my experience, > >Raji's description was: > > > >For case 1, I always compare 8 characters at a time using cmpb instruction. > >instruction. > >-> perform basic strlen, strnlen, checks. Move haystack based on strchr result. > >strchr result. > >-> Load first 8 bytes(taking care of aligned load)from haystack and needle and compare them. > >needle and compare them. > >If they are same proceed to next 8 bytes of haystack and needle. > >Else load next doubleword from haystack.Form a doubleword with last > >seven characters from first load and first character from > >second load and compare again with needle. Proceed in a similar way. > >This gives good improvement. > > > Lets not mix the review comments of strstr and strcasestr.(as the > algorithm is different) From webster dictionary "Full Definition of ALGORITHM : a procedure for solving a mathematical problem (as of finding the greatest common divisor) in a finite number of steps that frequently involves repetition of an operation; broadly : a step-by-step procedure for solving a problem or accomplishing some end especially by a computer" Changing techical few details like applying parallel tolower does not make new algorithm. Your strstr and strcasestr are really same algorithm. And my point still applies, as you could change generic strcasestr from while(s=strchr(s,n[0])) to if (bijective_locale) { char fn[3]; fn[0]=tolower(n[0]); fn[1]=toupper(n[0]); fn[2]=0; while(s=strpbrk(s,fn)) and add optimized strpbrk(x,"xy") path. That as strpbrk would be at most twice slower than strchr and match is twice likely would be faster than what you gain and easier to implement. > >My comment was that it wastes time most of time as in common case these > >already differ in first byte so all these shifts were completely > >unnecessary, a simple optimization of naive algorithm using > > > >while(s=strchr(s,n[0])) > If I use while(s=strchr(s,n[0]) as mentioned in > https://sourceware.org/ml/libc-alpha/2015-05/msg00060.html > the quadratic testcase gives worst peformance. Attached the testcase. Thats irrelevant testcase. I asked two things. First is show how big regression you get on pathological inputs. So submit data when you use your testcase with m=2047 and strstr executed 10000 times in loop with current algorithm and with your proposed one. That is what I ask. Second is about average case perforance. There worst case is irrelevant as you use buy-or-rent techniques to handle it. I sumbitted following patch with easy way to handle that. [PATCH] Improve generic strstr performance. its currently superseeded by generic vector bigraph search but it should beat your strstr. > > > >would be faster. Steven instead of doing his duty as powerpc maintainer > >and checking result of benchtest that I send instead tried to bullshit > >himself by telling that powerpc architecture is too special. Then he > >asked me do do benchmark instead as stalling tactics. > > > >If he did his duty of reading benchmark results there is already > >evidence. First to notice is there are regressions in original > >benchmark, like following: > > > >Length 30/2, alignment 0/ 9, found: 20.7812 32.2656 13.125 19.7188 > >Length 30/2, alignment 0/ 9, fail: 32.4688 33.9219 36.75 31.4688 > > > I agree there are few regression cases(79 out of 13442 cases) in > 'proposed_benchresults.txt' attached in > https://sourceware.org/ml/libc-alpha/2015-05/msg00276.html. > This is minimal when compared to the overall performance gain. > What overall performance gain? Did you try to find application and measure performance difference before and after? Big sizes are not very relevant in practice. You look behind 32'th byte of haystack only in 1.5% of calls so 10% regression at these sizes is much worse that 30% gain on 256 byte range. In my profiler a simple collection shows following calls 13561 success probability 3.0% average needle size: 8.9 ns <= 0: 0.4% ns <= 4: 0.8% ns <= 8: 5.9% ns <= 16: 99.9% ns <= 24: 100.0% ns <= 32: 100.0% ns <= 48: 100.0% ns <= 64: 100.0% average n: 10.6 n <= 0: 0.4% n <= 4: 1.3% n <= 8: 1.3% n <= 16: 97.0% n <= 24: 97.8% n <= 32: 98.5% n <= 48: 99.9% n <= 64: 99.9% s aligned to 4 bytes: 98.6% 8 bytes: 98.6% 16 bytes: 96.0% > >That he should comment, as strstr on average end on 10.6 byte after > >haystack start its quite relevant. > > > >But if he read proposed benchmark where needle and haystack are randomly > >generated from list of 128 characters then he would easily see it. > I have used your proposed benchtest, and it always gives improvement > for strcasestr.Please check the file 'newresults' in > https://sourceware.org/ml/libc-alpha/2015-06/msg00032.html > > > >Here is trimmed version to see pattern. Note that first there are lot of > >times where performance is less than 13. Thats case where strchr didn't > >find first character and we just return(generic case now does > >unnecessary work leading to worse numbers but thats addressed on patch that I send with strchr loop.) > > > >Its natural to ask what would happen if you called strchr again. In 120 > >bytes it would probably didn't found second character and we would > >finish in 26 cycles at most not 111.5 > > > >I expect that average loop performance would be 13.4375/135 = 0.1 cycles > >per byte not cycle per byte that this algorithm exhibits after he loses > >performance gains from first correct strchr call. > > > > simple_strstr stupid_strstr __strstr_power7 > >__strstr_ppc > > > >Length 4/2, alignment 0/ 3, found: 9.35938 8.09375 7.90625 8.5 > >Length 12/2, alignment 3/ 9, fail: 9.23438 16.5625 7.90625 9.20312 > >Length 24/2, alignment 3/ 9, fail: 9.54688 26.2812 8.32812 10.1875 > >Length 44/4, alignment 9/ 3, found: 13.4688 40.9219 9.59375 13.9688 > >Length 44/4, alignment 9/ 3, fail: 13.4688 40.9531 50.6875 13.9844 > >Length 52/4, alignment 3/ 9, found: 13.5312 47.2812 9.5625 14.7344 > >Length 52/4, alignment 3/ 9, fail: 45.9688 53.625 35.9688 47.5938 > >Length 75/5, alignment 15/ 9, fail: 73.75 67.9375 64.2188 75.9062 > >Length 75/5, alignment 15/15, found: 15.2188 65.9062 10.7344 17.8438 > >Length 75/5, alignment 15/15, fail: 61.5469 68.6562 50.375 63.0312 > >Length 120/8, alignment 15/ 3, fail: 113.281 111.5 103.5 114.406 > >Length 120/8, alignment 15/ 9, found: 21.25 99.1406 12.8438 28.1406 > >Length 135/9, alignment 15/15, found: 23.125 110.734 13.4375 30.7969 > >Length 135/9, alignment 15/15, fail: 147.484 113.781 110.516 165.984 > >Length 168/12, alignment 3/15, found: 31.2344 134.172 13.6719 39.9375 > >Length 168/12, alignment 3/15, fail: 170.766 137.609 131.859 179.859 > >Length 168/12, alignment 9/ 0, found: 100.562 137.922 76.6562 112.578 > >Length 168/12, alignment 9/ 0, fail: 168.703 144.078 144.828 181.203 > >Length 168/12, alignment 9/ 3, found: 129.938 137.703 109.828 136.266 > >Length 168/12, alignment 9/ 3, fail: 178.734 137.875 165.469 183.609 > >Length 465/31, alignment 15/ 0, fail: 73.0469 363.375 32.2969 87.0625 > >Length 465/31, alignment 15/ 3, found: 507.656 373.703 429.797 543.625 > >Length 465/31, alignment 15/ 3, fail: 512.922 373.188 437.047 556.844 > >Length 465/31, alignment 15/ 9, found: 443.969 379.359 392.828 464.75 > >Length 465/31, alignment 15/ 9, fail: 484.312 376.406 423.703 513.672 > >Length 465/31, alignment 15/15, found: 326.859 376 256.625 350.422 > >Length 465/31, alignment 15/15, fail: 512.641 376.203 439.234 545.906 > >Length 131071/16, alignment 0/ 0, found: 116583 108378 122125 > >127809 > >Length 131071/16, alignment 0/ 0, fail: 118298 108639 122208 > >128700 > > > >
On 06/16/2015 12:51 PM, Ondřej Bílka wrote: > On Thu, Jun 11, 2015 at 12:04:08AM +0530, Rajalakshmi Srinivasaraghavan wrote: >> >> >> On 06/10/2015 04:18 PM, Ondřej Bílka wrote: >>> >>> The onus for making all this happen is on the person who wants to change the code from its status quo. >>> >>> That would mean that its Rajalakshmi and responsibility to >>> write benchmark that shows its performance improvement. I stated my >>> objections explicitly described which inputs to use. >>> >>> Problem is that Steven failed to do his duty as a powerpc maintainer >>> which is to read and review change, check benchmark inputs and that >>> they are adequate. I could supply really bad implementation of strstr >>> which first checks special case that needle and haystack are periodic >>> abcdef sequences and returns result expected from benchmark. >>> >>> I did just simple request that every freshmen should know and Steven >>> should as powerpc maintainer check how its handled. Its that naive >>> algorithm and proposed algorithm are slow on inputs of form >>> >>> strstr("aaaaa....aaaa","aaaa...aaab"); >>> >>> As from your previous mail its their responsibility to add that case for >>> benchmark. A 'solution' on v2 would be to switch into two-way algorithm >>> when size is bigger than 2048. That isn't solution as you still have >>> problem with size 2047. I seen somewhere comment that its security >>> problem when you have website with search function on user comments and >>> implementation internally uses strstr then attacker could just post >>> comment with aaaaa...a and search few times for aa...ab. >>> >> 2048 number is chosen by checking various length inputs and >> comparing the performance. So for 2047 case, there wont be worst >> performance. >> > What are you talking about? I am saying that aaaa haystack with aaa...ab > needle of size 2047 is worst case for your algorithm performance. > I ran the attached test.c in a loop 10000 times. proposed patch took 19s and existing code took 20s on power machine. > >>> Roland as its really basic issue I think that its proposer >>> responsibility to check. Also for strstr thread Carlos said about strstr >>> quadratic case: >>> >>>> We should add gnulib's test to the microbenchmarks. >>> >>> >>> Then as I commented that algorithm is flawed its from my experience, >>> Raji's description was: >>> >>> For case 1, I always compare 8 characters at a time using cmpb instruction. >>> instruction. >>> -> perform basic strlen, strnlen, checks. Move haystack based on strchr result. >>> strchr result. >>> -> Load first 8 bytes(taking care of aligned load)from haystack and needle and compare them. >>> needle and compare them. >>> If they are same proceed to next 8 bytes of haystack and needle. >>> Else load next doubleword from haystack.Form a doubleword with last >>> seven characters from first load and first character from >>> second load and compare again with needle. Proceed in a similar way. >>> This gives good improvement. >>> >> Lets not mix the review comments of strstr and strcasestr.(as the >> algorithm is different) > >>From webster dictionary > > "Full Definition of ALGORITHM : > a procedure for solving a mathematical problem (as of finding the > greatest common divisor) in a finite number of steps that frequently > involves repetition of an operation; broadly : a step-by-step procedure > for solving a problem or accomplishing some end especially by a computer" > > Changing techical few details like applying parallel tolower does not > make new algorithm. > > Your strstr and strcasestr are really same algorithm. > The reason I mentioned them as different is, in strstr() patch the input is loaded as double word ie. 8 bytes at a time by byte and in strcasestr its byte by byte load.The input is handled in different way. > And my point still applies, as you could change generic strcasestr from > > while(s=strchr(s,n[0])) > > to > > if (bijective_locale) > { > char fn[3]; > fn[0]=tolower(n[0]); > fn[1]=toupper(n[0]); > fn[2]=0; > while(s=strpbrk(s,fn)) > > and add optimized strpbrk(x,"xy") path. That as strpbrk would be at most > twice slower than strchr and match is twice likely would be faster than > what you gain and easier to implement. > >>> My comment was that it wastes time most of time as in common case these >>> already differ in first byte so all these shifts were completely >>> unnecessary, a simple optimization of naive algorithm using >>> >>> while(s=strchr(s,n[0])) >> If I use while(s=strchr(s,n[0]) as mentioned in >> https://sourceware.org/ml/libc-alpha/2015-05/msg00060.html >> the quadratic testcase gives worst peformance. Attached the testcase. > > Thats irrelevant testcase. I asked two things. First is show how big > regression you get on pathological inputs. So submit data when you use > your testcase with m=2047 and strstr executed 10000 times in loop with > current algorithm and with your proposed one. That is what I ask. > > Second is about average case perforance. There worst case is irrelevant > as you use buy-or-rent techniques to handle it. I sumbitted following > patch with easy way to handle that. > > [PATCH] Improve generic strstr performance. > I applied your proposed patches and the benchtests and I could see proposed patch(__strstr_power7) is better than existing code (__strstr_ppc). 13363 out of 13442 combinations shows improvement. Attached benchtest results. git status shows: On branch master Your branch is up-to-date with 'origin/master'. Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory) modified: benchtests/bench-strcasestr.c modified: benchtests/bench-strstr.c modified: locale/C-ctype.c modified: locale/categories.def modified: locale/langinfo.h modified: locale/programs/ld-ctype.c modified: string/memmem.c modified: string/strcasestr.c modified: string/strstr.c modified: string/test-strcasestr.c modified: sysdeps/powerpc/powerpc64/multiarch/Makefile modified: sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c Untracked files: (use "git add <file>..." to include in what will be committed) sysdeps/generic/string_vector.h sysdeps/generic/string_vector_search.h sysdeps/generic/string_vector_skeleton.h sysdeps/powerpc/powerpc64/multiarch/strcasestr-power7.S sysdeps/powerpc/powerpc64/multiarch/strcasestr-ppc64.c sysdeps/powerpc/powerpc64/multiarch/strcasestr.c sysdeps/powerpc/powerpc64/multiarch/strstr-power7.S sysdeps/powerpc/powerpc64/multiarch/strstr-ppc64.c sysdeps/powerpc/powerpc64/multiarch/strstr.c sysdeps/powerpc/powerpc64/power7/strcasestr.S sysdeps/powerpc/powerpc64/power7/strstr.S > its currently superseeded by generic vector bigraph search but it should > beat your strstr. > >>> >>> would be faster. Steven instead of doing his duty as powerpc maintainer >>> and checking result of benchtest that I send instead tried to bullshit >>> himself by telling that powerpc architecture is too special. Then he >>> asked me do do benchmark instead as stalling tactics. >>> >>> If he did his duty of reading benchmark results there is already >>> evidence. First to notice is there are regressions in original >>> benchmark, like following: >>> >>> Length 30/2, alignment 0/ 9, found: 20.7812 32.2656 13.125 19.7188 >>> Length 30/2, alignment 0/ 9, fail: 32.4688 33.9219 36.75 31.4688 >>> >> I agree there are few regression cases(79 out of 13442 cases) in >> 'proposed_benchresults.txt' attached in >> https://sourceware.org/ml/libc-alpha/2015-05/msg00276.html. >> This is minimal when compared to the overall performance gain. >> > What overall performance gain? Did you try to find application and > measure performance difference before and after? Big sizes are not very > relevant in practice. You look behind 32'th byte of haystack only in > 1.5% of calls so 10% regression at these sizes is much worse that 30% > gain on 256 byte range. In my profiler a simple collection shows > following No I don't check on any specific application and I use the glibc benchtests for performance measurement which is considered to be the standard way of performance measurement before proposing a patch. > > > calls 13561 > success probability 3.0% > average needle size: 8.9 ns <= 0: 0.4% ns <= 4: 0.8% ns <= 8: 5.9% ns <= 16: 99.9% ns <= 24: 100.0% ns <= 32: 100.0% ns <= 48: 100.0% > ns <= 64: 100.0% > average n: 10.6 n <= 0: 0.4% n <= 4: 1.3% n <= 8: 1.3% n <= 16: 97.0% n <= 24: 97.8% n <= 32: 98.5% n <= 48: 99.9% n <= 64: 99.9% > s aligned to 4 bytes: 98.6% 8 bytes: 98.6% 16 bytes: 96.0% > >
On Thu, Jun 18, 2015 at 03:59:55PM +0530, Rajalakshmi Srinivasaraghavan wrote: > >>2048 number is chosen by checking various length inputs and > >>comparing the performance. So for 2047 case, there wont be worst > >>performance. > >> > >What are you talking about? I am saying that aaaa haystack with aaa...ab > > needle of size 2047 is worst case for your algorithm performance. > > > I ran the attached test.c in a loop 10000 times. > proposed patch took 19s and existing code took 20s on power machine. > How could you ran it? You have two problems with it 1. It doesn't compile, its missing final } to end main 2. It doesn't measure strstr at all. Its pure function so gcc optimized that away. As when I fixed these two issues just 100 iterations take 1.002s with my x64 vector implementation and 44.538s on generic implemenation your test couldn't take just 20s for existing code. > The reason I mentioned them as different is, in strstr() patch the > input is loaded as double word ie. 8 bytes at a time by byte and in > strcasestr its byte by byte load.The input is handled in different > way. > I see now that you wrote basically long i=0; n0 = tolower(n[0]) while (s[i]) { int j=1; if (tolower(s[i])==n0) while (n[j] && tolower(n[j]) == tolower(s[i+j])) j++; if (n[j]=='\0') return s+i; i++; } return NULL; For some reason I skimmed that patch and though that you did obvious optimization once you have vector strstr. You use vectorized tolower to convert 8 bytes at once. Example of conversion is in sysdeps/x86_64/multiarch/strcmp-sse42.S > > > >Second is about average case perforance. There worst case is irrelevant > >as you use buy-or-rent techniques to handle it. I sumbitted following > >patch with easy way to handle that. > > > >[PATCH] Improve generic strstr performance. > > > I applied your proposed patches and the benchtests and I could see > proposed patch(__strstr_power7) is better than existing code > (__strstr_ppc). > 13363 out of 13442 combinations shows improvement. > Attached benchtest results. > I said use that patch. Next line of my previous mail explains that. > >its currently superseeded by generic vector bigraph search but it should > >beat your strstr. So you should do also comparison with that. Your benchmark is invalid for simple reason, I guard vector patch with: if (__BYTE_ORDER == __LITTLE_ENDIAN) { So as powerpc is big endian my patch does nothing at all and you are just measuring generic strstr again. If you want test bigraph search to that on power64le. > >What overall performance gain? Did you try to find application and > >measure performance difference before and after? Big sizes are not very > >relevant in practice. You look behind 32'th byte of haystack only in > >1.5% of calls so 10% regression at these sizes is much worse that 30% > >gain on 256 byte range. In my profiler a simple collection shows > >following > No I don't check on any specific application and I use the glibc > benchtests for performance measurement which is considered to be the > standard way of performance measurement before proposing a patch. Then you have problem that you don't know which data are important and which are not. Proof by numbers is quite suspect. Input ranges are completely arbitrary. So I could decide that I want to check for sizes 1..64 all possible 64byte needle and haystack alignments. That generates 262144 test cases which should be enough.
On 06/18/2015 06:28 PM, Ondřej Bílka wrote: > On Thu, Jun 18, 2015 at 03:59:55PM +0530, Rajalakshmi Srinivasaraghavan wrote: >>>> 2048 number is chosen by checking various length inputs and >>>> comparing the performance. So for 2047 case, there wont be worst >>>> performance. >>>> >>> What are you talking about? I am saying that aaaa haystack with aaa...ab >>> needle of size 2047 is worst case for your algorithm performance. >>> >> I ran the attached test.c in a loop 10000 times. >> proposed patch took 19s and existing code took 20s on power machine. >> > How could you ran it? You have two problems with it > 1. It doesn't compile, its missing final } to end main > 2. It doesn't measure strstr at all. Its pure function so gcc optimized > that away. > > As when I fixed these two issues just 100 iterations take 1.002s with my > x64 vector implementation and 44.538s on generic implemenation your test > couldn't take just 20s for existing code. > > >> The reason I mentioned them as different is, in strstr() patch the >> input is loaded as double word ie. 8 bytes at a time by byte and in >> strcasestr its byte by byte load.The input is handled in different >> way. >> > > I see now that you wrote basically > > long i=0; > n0 = tolower(n[0]) > while (s[i]) > { > int j=1; > if (tolower(s[i])==n0) > while (n[j] && tolower(n[j]) == tolower(s[i+j])) > j++; > if (n[j]=='\0') > return s+i; > i++; > } > return NULL; > > For some reason I skimmed that patch and though that you did obvious > optimization once you have vector strstr. You use vectorized tolower to > convert 8 bytes at once. Example of conversion is in > sysdeps/x86_64/multiarch/strcmp-sse42.S > For strcasestr() optimization I am trying to use tolower to convert 8 bytes and will send the modified patch. For strstr() I have sent the new version https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html >>> >>> Second is about average case perforance. There worst case is irrelevant >>> as you use buy-or-rent techniques to handle it. I sumbitted following >>> patch with easy way to handle that. >>> >>> [PATCH] Improve generic strstr performance. >>> >> I applied your proposed patches and the benchtests and I could see >> proposed patch(__strstr_power7) is better than existing code >> (__strstr_ppc). >> 13363 out of 13442 combinations shows improvement. >> Attached benchtest results. >> > I said use that patch. Next line of my previous mail explains that. > >>> its currently superseeded by generic vector bigraph search but it should >>> beat your strstr. > > So you should do also comparison with that. > > > Your benchmark is invalid for simple reason, I guard vector patch with: > > if (__BYTE_ORDER == __LITTLE_ENDIAN) > { > Modified new version works better for both LE and BE. Results attached in https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html > So as powerpc is big endian my patch does nothing at all and you are > just measuring generic strstr again. If you want test bigraph search to > that on power64le. > Used your proposed code on powerpc64le and observed that quadratic worst case still takes more time with your proposed code. It took more than 2 minutes when the needle size is 100000 and haystack is of size 1000000. >>> What overall performance gain? Did you try to find application and >>> measure performance difference before and after? Big sizes are not very >>> relevant in practice. You look behind 32'th byte of haystack only in >>> 1.5% of calls so 10% regression at these sizes is much worse that 30% >>> gain on 256 byte range. In my profiler a simple collection shows >>> following >> No I don't check on any specific application and I use the glibc >> benchtests for performance measurement which is considered to be the >> standard way of performance measurement before proposing a patch. > > Then you have problem that you don't know which data are important and > which are not. > > Proof by numbers is quite suspect. Input ranges are completely > arbitrary. So I could decide that I want to check for sizes 1..64 all > possible 64byte needle and haystack alignments. That generates 262144 > test cases which should be enough. > >
On Wed, Jun 24, 2015 at 12:47:23PM +0530, Rajalakshmi Srinivasaraghavan wrote: > > > On 06/18/2015 06:28 PM, Ondřej Bílka wrote: > >On Thu, Jun 18, 2015 at 03:59:55PM +0530, Rajalakshmi Srinivasaraghavan wrote: > >>>>2048 number is chosen by checking various length inputs and > >>>>comparing the performance. So for 2047 case, there wont be worst > >>>>performance. > >>>> > >>>What are you talking about? I am saying that aaaa haystack with aaa...ab > >>> needle of size 2047 is worst case for your algorithm performance. > >>> > >>I ran the attached test.c in a loop 10000 times. > >>proposed patch took 19s and existing code took 20s on power machine. > >> > >How could you ran it? You have two problems with it > >1. It doesn't compile, its missing final } to end main > >2. It doesn't measure strstr at all. Its pure function so gcc optimized > >that away. > > > >As when I fixed these two issues just 100 iterations take 1.002s with my > >x64 vector implementation and 44.538s on generic implemenation your test > >couldn't take just 20s for existing code. > > > > > >>The reason I mentioned them as different is, in strstr() patch the > >>input is loaded as double word ie. 8 bytes at a time by byte and in > >>strcasestr its byte by byte load.The input is handled in different > >>way. > >> > > > >I see now that you wrote basically > > > >long i=0; > >n0 = tolower(n[0]) > >while (s[i]) > > { > > int j=1; > > if (tolower(s[i])==n0) > > while (n[j] && tolower(n[j]) == tolower(s[i+j])) > > j++; > > if (n[j]=='\0') > > return s+i; > > i++; > > } > >return NULL; > > > >For some reason I skimmed that patch and though that you did obvious > >optimization once you have vector strstr. You use vectorized tolower to > >convert 8 bytes at once. Example of conversion is in > > sysdeps/x86_64/multiarch/strcmp-sse42.S > > > For strcasestr() optimization I am trying to use tolower to convert > 8 bytes and will send the modified patch. > For strstr() I have sent the new version > https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html Doing tolower is still bit slow. You should improve strspn. Then you only need to convert first byte of needle and call strspn. > >>> > >>>Second is about average case perforance. There worst case is irrelevant > >>>as you use buy-or-rent techniques to handle it. I sumbitted following > >>>patch with easy way to handle that. > >>> > >>>[PATCH] Improve generic strstr performance. > >>> > >>I applied your proposed patches and the benchtests and I could see > >>proposed patch(__strstr_power7) is better than existing code > >>(__strstr_ppc). > >>13363 out of 13442 combinations shows improvement. > >>Attached benchtest results. > >> > >I said use that patch. Next line of my previous mail explains that. > > > >>>its currently superseeded by generic vector bigraph search but it should > >>>beat your strstr. > > > >So you should do also comparison with that. > > > > > >Your benchmark is invalid for simple reason, I guard vector patch with: > > > >if (__BYTE_ORDER == __LITTLE_ENDIAN) > > { > > > Modified new version works better for both LE and BE. Results > attached in > https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html > Yes I was saying all time that you should use that. > >So as powerpc is big endian my patch does nothing at all and you are > >just measuring generic strstr again. If you want test bigraph search to > >that on power64le. > > > Used your proposed code on powerpc64le and observed that quadratic > worst case still takes more time with your proposed code. > It took more than 2 minutes when the needle size is 100000 and > haystack is of size 1000000. > You still didn't answer how big degradation has your implementation. For mine its a typo disabled quadratic detection at all. All that needs to change would be rent += i + 10; into *rent += i + 10; > >>>What overall performance gain? Did you try to find application and > >>>measure performance difference before and after? Big sizes are not very > >>>relevant in practice. You look behind 32'th byte of haystack only in > >>>1.5% of calls so 10% regression at these sizes is much worse that 30% > >>>gain on 256 byte range. In my profiler a simple collection shows > >>>following > >>No I don't check on any specific application and I use the glibc > >>benchtests for performance measurement which is considered to be the > >>standard way of performance measurement before proposing a patch. > > > >Then you have problem that you don't know which data are important and > >which are not. > > > >Proof by numbers is quite suspect. Input ranges are completely > >arbitrary. So I could decide that I want to check for sizes 1..64 all > >possible 64byte needle and haystack alignments. That generates 262144 > >test cases which should be enough. > > > > > > -- > Thanks > Rajalakshmi S
diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile index 17265bd..06b2c67 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile @@ -19,7 +19,7 @@ sysdep_routines += memcpy-power7 memcpy-a2 memcpy-power6 memcpy-cell \ strcmp-power8 strcmp-power7 strcmp-ppc64 \ strcat-power8 strcat-power7 strcat-ppc64 \ memmove-power7 memmove-ppc64 wordcopy-ppc64 bcopy-ppc64 \ - strncpy-power8 + strncpy-power8 strcasestr-power7 strcasestr-ppc64 \ CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops CFLAGS-strncase_l-power7.c += -mcpu=power7 -funroll-loops diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c index f5fdea5..0fd2bd2 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c @@ -322,5 +322,12 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, IFUNC_IMPL_ADD (array, i, strcat, 1, __strcat_ppc)) + /* Support sysdeps/powerpc/powerpc64/multiarch/strcasestr.c. */ + IFUNC_IMPL (i, name, strcasestr, + IFUNC_IMPL_ADD (array, i, strcasestr, + hwcap & PPC_FEATURE_HAS_VSX, + __strcasestr_power7) + IFUNC_IMPL_ADD (array, i, strcasestr, 1, + __strcasestr_ppc)) return i; } diff --git a/sysdeps/powerpc/powerpc64/multiarch/strcasestr-power7.S b/sysdeps/powerpc/powerpc64/multiarch/strcasestr-power7.S new file mode 100644 index 0000000..e13f575 --- /dev/null +++ b/sysdeps/powerpc/powerpc64/multiarch/strcasestr-power7.S @@ -0,0 +1,43 @@ +/* Optimized strcasestr implementation for POWER7. + Copyright (C) 2015 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#include <sysdep.h> + +#undef EALIGN +#define EALIGN(name, alignt, words) \ + .section ".text"; \ + ENTRY_2(__strcasestr_power7) \ + .align ALIGNARG(alignt); \ + EALIGN_W_##words; \ + BODY_LABEL(__strcasestr_power7): \ + cfi_startproc; \ + LOCALENTRY(__strcasestr_power7) + +#undef END +#define END(name) \ + cfi_endproc; \ + TRACEBACK(__strcasestr_power7) \ + END_2(__strcasestr_power7) + +#undef libc_hidden_builtin_def +#define libc_hidden_builtin_def(name) + +#define STRLEN __strlen_power7 +#define STRNLEN __strnlen_power7 + +#include <sysdeps/powerpc/powerpc64/power7/strcasestr.S> diff --git a/sysdeps/powerpc/powerpc64/multiarch/strcasestr-ppc64.c b/sysdeps/powerpc/powerpc64/multiarch/strcasestr-ppc64.c new file mode 100644 index 0000000..614c7bf --- /dev/null +++ b/sysdeps/powerpc/powerpc64/multiarch/strcasestr-ppc64.c @@ -0,0 +1,34 @@ +/* PowerPC64 default implementation of strcasestr. + Copyright (C) 2015 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#include <string.h> + +#define STRCASESTR __strcasestr_ppc +#if IS_IN (libc) && defined(SHARED) +# undef libc_hidden_builtin_def +# define libc_hidden_builtin_def(name) \ + __hidden_ver1(__strstr_ppc, __GI_strstr, __strstr_ppc); +#endif + + +#undef weak_alias +#define weak_alias(a,b ) + +extern __typeof (strcasestr) __strcasestr_ppc attribute_hidden; + +#include <string/strcasestr.c> diff --git a/sysdeps/powerpc/powerpc64/multiarch/strcasestr.c b/sysdeps/powerpc/powerpc64/multiarch/strcasestr.c new file mode 100644 index 0000000..6564314 --- /dev/null +++ b/sysdeps/powerpc/powerpc64/multiarch/strcasestr.c @@ -0,0 +1,37 @@ +/* Multiple versions of strcasestr. + Copyright (C) 2015 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#if IS_IN (libc) +# include <string.h> +# include <shlib-compat.h> +# include "init-arch.h" + +extern __typeof (__strcasestr) __strcasestr_ppc attribute_hidden; +extern __typeof (__strcasestr) __strcasestr_power7 attribute_hidden; + +/* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle + ifunc symbol properly. */ +libc_ifunc (__strcasestr, + (hwcap & PPC_FEATURE_HAS_VSX) + ? __strcasestr_power7 + : __strcasestr_ppc); + +weak_alias (__strcasestr, strcasestr) +#else +#include <string/strcasestr.c> +#endif diff --git a/sysdeps/powerpc/powerpc64/power7/strcasestr.S b/sysdeps/powerpc/powerpc64/power7/strcasestr.S new file mode 100644 index 0000000..521eadb --- /dev/null +++ b/sysdeps/powerpc/powerpc64/power7/strcasestr.S @@ -0,0 +1,177 @@ +/* Optimized strcasestr implementation for PowerPC64. + Copyright (C) 2015 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#include <sysdep.h> +#include <locale-defines.h> + +#ifndef STRLEN +/* For builds with no IFUNC support, local calls should be made to internal + GLIBC symbol (created by libc_hidden_builtin_def). */ +# ifdef SHARED +# define STRLEN __GI_strlen +# else +# define STRLEN strlen +# endif +#endif + +#ifndef STRNLEN +/* For builds with no IFUNC support, local calls should be made to internal + GLIBC symbol (created by libc_hidden_builtin_def). */ +# ifdef SHARED +# define STRNLEN __GI_strnlen +# else +# define STRNLEN strnlen +# endif +#endif + +#undef strcasestr +#undef __strcasestr + +#ifndef STRCASESTR +#define STRCASESTR __strcasestr +#endif + +/* char * [r3] strcasestr (char *s [r3], char * pat[r4]) */ + +/* +* Load byte from input string and search substring and convert +* each character to lower case character and then compare both. +* If they are same, load byte from both r3 and r4 and proceed, +* Else, load next byte from r3 and compare with current r4. +*/ + +#define FRAMESIZE (FRAME_MIN_SIZE+32) + .machine power7 +EALIGN (STRCASESTR, 4, 0) + CALL_MCOUNT 2 + mflr r0 /* Load link register LR to r0. */ + std r31, -8(r1) /* Save callers register r31. */ + cfi_offset(r31, -8) + std r30, -16(r1) /* Save callers register r30. */ + cfi_offset(r30, -16) + std r29, -24(r1) /* Save callers register r29. */ + cfi_offset(r29, -24) + std r0, 16(r1) /* Store the link register. */ + cfi_offset(lr, 16) + stdu r1, -FRAMESIZE(r1) /* Create the stack frame. */ + cfi_adjust_cfa_offset(FRAMESIZE) + + dcbt 0, r3 + dcbt 0, r4 + + and r0, r3, r4 + cmpdi cr7, r0, 0 + beq cr7, L(retnull) + + mr r29, r3 + mr r30, r4 + mr r3, r4 + bl STRLEN + nop + + /* Call __strcasestr_ppc if needle len > 2048. */ + cmpdi cr7, r3, 2048 + bgt cr7, L(default) + + cmpdi cr7, r3, 0 /* If search str is null. */ + beq cr7, L(ret_r3) + mr r31, r3 + mr r4, r3 + mr r3, r29 + bl STRNLEN + nop + + cmpd cr7, r3, r31 /* If len(r3) < len(r4). */ + blt cr7, L(retnull) + + mr r3, r29 + mr r8, r3 /* Save r3. */ + addi r8, r8, -1 + ld r10, __libc_tsd_LOCALE@got@tprel(r2) + add r11, r10, __libc_tsd_LOCALE@tls + ld r11, 0(r11) + ld r11, LOCALE_CTYPE_TOLOWER(r11) + + mr r4, r30 + lbz r6, 0(r4) /* Load next byte from r4. */ + cmpdi cr7, r6, 0 /* Is it null? */ + beq cr7, L(updater3) + sldi r7, r6, 2 /* Convert to lower case. */ + lwzx r7, r11, r7 + mr r12, r7 /* Save it for next loop. */ +L(loop1): + addi r8, r8, 1 + mr r3, r8 /* Restore r3. */ + mr r4, r30 /* Restore r4. */ + mr r7, r12 +L(loop): + lbz r5, 0(r3) /* Load byte from r3. */ + cmpdi cr7, r5, 0 /* Is it null? */ + beq cr7, L(retnull) /* If yes, return. */ + sldi r10, r5, 2 /* Convert to lower case. */ + lwzx r10, r11, r10 + cmpw cr7, r7, r10 /* Compare with byte from r4. */ + bne cr7, L(loop1) + addi r3, r3, 1 /* Increment r3. */ + addi r4, r4, 1 /* Increment r4. */ + lbz r6, 0(r4) /* Load next byte from r4. */ + cmpdi cr7, r6, 0 /* Is it null? */ + beq cr7, L(updater3) + sldi r7, r6, 2 /* Convert to lower case. */ + lwzx r7, r11, r7 + b L(loop) + + /* Handling return values. */ + .align 4 +L(updater3): + subf r3, r31, r3 /* Reduce len of r4 from r3. */ + b L(end) + + .align 4 +L(ret_r3): + mr r3, r29 /* Return r3. */ + b L(end) + + .align 4 +L(retnull): + li r3, 0 /* Return NULL. */ + b L(end) + + .align 4 +L(default): + mr r3, r29 + mr r4, r30 + bl __strcasestr_ppc + nop + + .align 4 +L(end): + addi r1, r1, FRAMESIZE /* Restore stack pointer. */ + cfi_adjust_cfa_offset(-FRAMESIZE) + ld r0, 16(r1) /* Restore the saved link register. */ + ld r29, -24(r1) /* Restore callers save register r29. */ + ld r30, -16(r1) /* Restore callers save register r30. */ + ld r31, -8(r1) /* Restore callers save register r31. */ + mtlr r0 /* Branch to link register. */ + blr +END (STRCASESTR) +#ifndef NO_ALIAS +weak_alias (__strcasestr, strcasestr) +#endif + +libc_hidden_builtin_def (strcasestr)