diff mbox

[COMMITTED] Bug 18589: Revert strcoll optimization using strdiff.

Message ID 5616D304.8010909@redhat.com
State New
Headers show

Commit Message

Carlos O'Donell Oct. 8, 2015, 8:33 p.m. UTC
The optimization introduced in commit
f13c2a8dff2329c6692a80176262ceaaf8a6f74e, causes regressions in
sorting for languages that have digraphs that change sort order, like
cs_CZ which sorts ch between h and i.

My analysis shows the fast-forwarding optimization in STRCOLL advances
through a digraph while possibly stopping in the middle which results
in a subsequent skipping of the digraph and incorrect sorting. The
optimization is incorrect as implemented and because of that I'm
removing it for 2.23, and I will also commit this fix for 2.22 where
it was originally introduced.

This patch reverts the optimization, introduces a new bug-strcoll2.c
regression test that tests both cs_CZ.UTF-8 and da_DK.ISO-8859-1 and
ensures they sort one digraph each correctly. The optimization can't be
applied without regressing this test.

Checked on x86_64, bug-strcoll2.c fails without this patch and passes
after.

Checked in for 2.23.

2015-10-08  Carlos O'Donell  <carlos@redhat.com>

        [BZ #18589]     
        * string/bug-strcoll2.c: New file.
        * locale/categories.def: Revert commit
        f13c2a8dff2329c6692a80176262ceaaf8a6f74e.
        * locale/langinfo.h: Likewise.
        * locale/localeinfo.h: Likewise.
        * locale/C-collate.c: Likewise.
        * programs/ld-collate.c (collate_output): Likewise.
        * string/strcoll_l.c (STRDIFF): Likewise.
        (STRCOLL): Likewise.
        * wcsmbs/wcscoll_l.c: Likewise.

---

Cheers,
Carlos.

Comments

Carlos O'Donell Oct. 8, 2015, 8:50 p.m. UTC | #1
On 10/08/2015 04:33 PM, Carlos O'Donell wrote:
> The optimization introduced in commit
> f13c2a8dff2329c6692a80176262ceaaf8a6f74e, causes regressions in
> sorting for languages that have digraphs that change sort order, like
> cs_CZ which sorts ch between h and i.
> 
> My analysis shows the fast-forwarding optimization in STRCOLL advances
> through a digraph while possibly stopping in the middle which results
> in a subsequent skipping of the digraph and incorrect sorting. The
> optimization is incorrect as implemented and because of that I'm
> removing it for 2.23, and I will also commit this fix for 2.22 where
> it was originally introduced.
> 
> This patch reverts the optimization, introduces a new bug-strcoll2.c
> regression test that tests both cs_CZ.UTF-8 and da_DK.ISO-8859-1 and
> ensures they sort one digraph each correctly. The optimization can't be
> applied without regressing this test.
> 
> Checked on x86_64, bug-strcoll2.c fails without this patch and passes
> after.
> 
> Checked in for 2.23.

Forgot to add string/Makefile. Checking that in now.

c.
Carlos O'Donell Oct. 8, 2015, 8:56 p.m. UTC | #2
On 10/08/2015 04:50 PM, Carlos O'Donell wrote:
> On 10/08/2015 04:33 PM, Carlos O'Donell wrote:
>> The optimization introduced in commit
>> f13c2a8dff2329c6692a80176262ceaaf8a6f74e, causes regressions in
>> sorting for languages that have digraphs that change sort order, like
>> cs_CZ which sorts ch between h and i.
>>
>> My analysis shows the fast-forwarding optimization in STRCOLL advances
>> through a digraph while possibly stopping in the middle which results
>> in a subsequent skipping of the digraph and incorrect sorting. The
>> optimization is incorrect as implemented and because of that I'm
>> removing it for 2.23, and I will also commit this fix for 2.22 where
>> it was originally introduced.
>>
>> This patch reverts the optimization, introduces a new bug-strcoll2.c
>> regression test that tests both cs_CZ.UTF-8 and da_DK.ISO-8859-1 and
>> ensures they sort one digraph each correctly. The optimization can't be
>> applied without regressing this test.
>>
>> Checked on x86_64, bug-strcoll2.c fails without this patch and passes
>> after.
>>
>> Checked in for 2.23.
> 
> Forgot to add string/Makefile. Checking that in now.
> 
> c.
> 

Pushed.

commit 233127a79e74c1490cae021877c0213337893dcf
Author: Carlos O'Donell <carlos@systemhalted.org>
Date:   Thu Oct 8 16:54:30 2015 -0400

    strcoll: Add bug-strcoll2 to testsuite.
    
    Adds bug-strcoll2 to the string tests, along with the
    generation of required locales.

Cheers,
Carlos.
Leonhard Holz Oct. 13, 2015, 5:41 a.m. UTC | #3
Sorry for coming up late with that, but the UTF-8 detection part is needed in the patch for #18441. How can I handle
this correctly?

Maybe it is also possible to fix the STRDIFF-patch.

Leonhard

Am 08.10.2015 um 22:33 schrieb Carlos O'Donell:
> The optimization introduced in commit
> f13c2a8dff2329c6692a80176262ceaaf8a6f74e, causes regressions in
> sorting for languages that have digraphs that change sort order, like
> cs_CZ which sorts ch between h and i.
> 
> My analysis shows the fast-forwarding optimization in STRCOLL advances
> through a digraph while possibly stopping in the middle which results
> in a subsequent skipping of the digraph and incorrect sorting. The
> optimization is incorrect as implemented and because of that I'm
> removing it for 2.23, and I will also commit this fix for 2.22 where
> it was originally introduced.
> 
> This patch reverts the optimization, introduces a new bug-strcoll2.c
> regression test that tests both cs_CZ.UTF-8 and da_DK.ISO-8859-1 and
> ensures they sort one digraph each correctly. The optimization can't be
> applied without regressing this test.
> 
> Checked on x86_64, bug-strcoll2.c fails without this patch and passes
> after.
> 
> Checked in for 2.23.
> 
> 2015-10-08  Carlos O'Donell  <carlos@redhat.com>
> 
>         [BZ #18589]     
>         * string/bug-strcoll2.c: New file.
>         * locale/categories.def: Revert commit
>         f13c2a8dff2329c6692a80176262ceaaf8a6f74e.
>         * locale/langinfo.h: Likewise.
>         * locale/localeinfo.h: Likewise.
>         * locale/C-collate.c: Likewise.
>         * programs/ld-collate.c (collate_output): Likewise.
>         * string/strcoll_l.c (STRDIFF): Likewise.
>         (STRCOLL): Likewise.
>         * wcsmbs/wcscoll_l.c: Likewise.
>
Carlos O'Donell Oct. 13, 2015, 1:18 p.m. UTC | #4
On 10/13/2015 01:41 AM, Leonhard Holz wrote:
> Sorry for coming up late with that, but the UTF-8 detection part is
> needed in the patch for #18441. How can I handle this correctly?
> 
> Maybe it is also possible to fix the STRDIFF-patch.

To fix the STRDIFF patch you would have to advance by units of
comparison including advancing by entire digraphs. It would slow
down the fast-forward pass, and you would have to see if it makes
it feasible to use.

Cheers,
Carlos.
diff mbox

Patch

diff --git a/string/bug-strcoll2.c b/string/bug-strcoll2.c
new file mode 100644
index 0000000..5ce2f94
--- /dev/null
+++ b/string/bug-strcoll2.c
@@ -0,0 +1,93 @@ 
+/* Bug 18589: sort-test.sh fails at random.
+   Copyright (C) 1998-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+   Contributed by Ulrich Drepper <drepper@cygnus.com>, 1998.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <stdio.h>
+#include <string.h>
+#include <locale.h>
+
+/* An incorrect strcoll optimization resulted in incorrect
+   results from strcoll for cs_CZ and da_DK.  */
+
+int
+test_cs_CZ (void)
+{
+  const char t1[] = "config";
+  const char t2[] = "choose";
+  if (setlocale (LC_ALL, "cs_CZ.UTF-8") == NULL)
+    {
+      perror ("setlocale");
+      return 1;
+    }
+  /* In Czech the digraph ch sorts after c, therefore we expect
+     config to sort before choose.  */
+  int a = strcoll (t1, t2);
+  int b = strcoll (t2, t1);
+  printf ("strcoll (\"%s\", \"%s\") = %d\n", t1, t2, a);
+  printf ("strcoll (\"%s\", \"%s\") = %d\n", t2, t1, b);
+  if (a < 0 && b > 0)
+    {
+      puts ("PASS: config < choose");
+      return 0;
+    }
+  else
+    {
+      puts ("FAIL: Wrong sorting in cz_CZ.UTF-8.");
+      return 1;
+    }
+}
+
+int
+test_da_DK (void)
+{
+  const char t1[] = "AS";
+  const char t2[] = "AA";
+  if (setlocale (LC_ALL, "da_DK.ISO-8859-1") == NULL)
+    {
+      perror ("setlocale");
+      return 1;
+    }
+  /* AA should be treated as the last letter of the Danish alphabet,
+     hence sorting after AS.  */
+  int a = strcoll (t1, t2);
+  int b = strcoll (t2, t1);
+  printf ("strcoll (\"%s\", \"%s\") = %d\n", t1, t2, a);
+  printf ("strcoll (\"%s\", \"%s\") = %d\n", t2, t1, b);
+  if (a < 0 && b > 0)
+    {
+      puts ("PASS: AS < AA");
+      return 0;
+    }
+  else
+    {
+      puts ("FAIL: Wrong sorting in da_DK.ISO-8859-1");
+      return 1;
+    }
+}
+
+static int
+do_test (void)
+{
+  int err = 0;
+  err |= test_cs_CZ ();
+  err |= test_da_DK ();
+  return err;
+}
+
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"