From patchwork Fri Apr 17 13:24:54 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Leonhard Holz <leonhard.holz@web.de>
X-Patchwork-Id: 462048
Return-Path: 
 <libc-alpha-return-58483-incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 6E4A51400B7
	for <incoming@patchwork.ozlabs.org>;
	Fri, 17 Apr 2015 23:25:08 +1000 (AEST)
Authentication-Results: ozlabs.org; dkim=pass
	reason="1024-bit key; unprotected key"
	header.d=sourceware.org header.i=@sourceware.org header.b=dlLt/xgS;
	dkim-adsp=none (unprotected policy); dkim-atps=neutral
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:message-id:date:from:mime-version:to:subject
	:content-type:content-transfer-encoding; q=dns; s=default; b=sRl
	3bmWjS+ewqhAmTm/9emc9/Na/avUHudmH2tNcZRxgBkDKfwWASxdAWe0yR8V2Lfv
	4dqRIX/6bg0cDDLZ+wOMZ9x1Fn5Ar+GeR8JkTzGag5ifhhCEa8py2084xF1z28t4
	XZ6ivWbW9Tz4OcoNAnEOYsbAL83BYMVMZU6aRopw=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:message-id:date:from:mime-version:to:subject
	:content-type:content-transfer-encoding; s=default; bh=HJTOMf9NY
	7lmhQJ6qO3N6oFDt50=; b=dlLt/xgScK+SBJPyF7xyvTC+lAq32vt26mnElcaVF
	+R05re23Q+iP+ThkQ8UScO6O07Y4GC11QixssEIRwUyJXJqzaPc9UkD8eeOfrgrt
	jR/dIjd09lTOA7Dg+bqnQkSADnSGWXqGyjPFjLzVQQVj0xx6sCk1o/NFYCz2PHzA
	HA=
Received: (qmail 22455 invoked by alias); 17 Apr 2015 13:25:00 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: 
 <mailto:libc-alpha-unsubscribe-incoming=patchwork.ozlabs.org@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 22434 invoked by uid 89); 17 Apr 2015 13:25:00 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.1 required=5.0 tests=AWL, BAYES_00,
	FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_PASS,
	T_RP_MATCHES_RCVD autolearn=ham version=3.3.2
X-HELO: mout.web.de
Message-ID: <553109A6.9070809@web.de>
Date: Fri, 17 Apr 2015 15:24:54 +0200
From: Leonhard Holz <leonhard.holz@web.de>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
	rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: libc-alpha@sourceware.org
Subject: [PATCH v3] speedup strcoll by using strdiff
X-UI-Out-Filterresults: notjunk:1;

This patch uses strdiff to skip over equal prefixes in strcoll. With this patch
the time for collating a filelist drops to twice the time of byte-comparison. It
is implemented for 8-bit character sets and UTF-8. The new attribute
_NL_COLLATE_ENCODING_TYPE is added to the locale data to state the charset
encoding. I opt for commiting this and do the factoring out of strdiff in a
follow-up patch. The tests show no problems although one needs to remove the old
locale data manually.

				change		current		patched
filelist#C			+1.49%		23,724,500	24,078,400
filelist#en_US.UTF-8		-43.06%		95,092,000	54,150,100
lorem_ipsum#vi_VN.UTF-8		-0.29%		1,995,540	1,989,810
lorem_ipsum#en_US.UTF-8		-15.96%		2,081,250	1,749,150
lorem_ipsum#ar_SA.UTF-8		-24.48%		2,618,660	1,977,630
lorem_ipsum#zh_CN.UTF-8		+15.03%		793,925		913,224
lorem_ipsum#cs_CZ.UTF-8		-10.68%		2,749,650	2,455,860
lorem_ipsum#en_GB.UTF-8		-6.12%		2,451,930	2,301,790
lorem_ipsum#da_DK.UTF-8		-18.40%		2,128,290	1,736,710
lorem_ipsum#pl_PL.UTF-8		-7.76%		2,005,750	1,850,190
lorem_ipsum#fr_FR.UTF-8		-3.95%		2,799,710	2,689,040
lorem_ipsum#pt_PT.UTF-8		-8.94%		2,803,570	2,552,890
lorem_ipsum#el_GR.UTF-8		-26.48%		4,063,660	2,987,760
lorem_ipsum#ru_RU.UTF-8		-14.48%		2,908,900	2,487,730
lorem_ipsum#iw_IL.UTF-8		-36.47%		2,707,980	1,720,460
lorem_ipsum#es_ES.UTF-8		-11.02%		2,539,390	2,259,460
lorem_ipsum#hi_IN.UTF-8		-61.35%		231,126,000	89,330,100
lorem_ipsum#sv_SE.UTF-8		-12.04%		1,971,310	1,733,890
lorem_ipsum#hu_HU.UTF-8		-14.46%		3,708,090	3,172,060
lorem_ipsum#tr_TR.UTF-8		-9.95%		2,409,160	2,169,550
lorem_ipsum#is_IS.UTF-8		-14.22%		2,032,850	1,743,860
lorem_ipsum#it_IT.UTF-8		-10.03%		2,696,880	2,426,500
lorem_ipsum#sr_RS.UTF-8		-11.96%		2,360,780	2,078,510
lorem_ipsum#ja_JP.UTF-8		-3.45%		1,141,530	1,102,100


	* locale/categories.def: Define _NL_COLLATE_ENCODING_TYPE.
	* locale/langinfo.h: Add _NL_COLLATE_ENCODING_TYPE to attribute list.
	* locale/localeinfo.h: Add enum collation_encoding_type.
	* locale/C-collate.c: Set _NL_COLLATE_ENCODING_TYPE to 8bit.
	* programs/ld-collate.c (collate_output): Add encoding type info.
	* string/strcoll_l.c (STRDIFF): New function.
	* (STRCOLL): Use STRDIFF to skip over equal prefix.
	* wcsmbs/wcscoll_l.c: Define STRDIFF.

diff --git a/locale/C-collate.c b/locale/C-collate.c
index 06dfdfa..d7f3c55 100644
--- a/locale/C-collate.c
+++ b/locale/C-collate.c
@@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
     /* _NL_COLLATE_COLLSEQWC */
     { .string = (const char *) collseqwc },
     /* _NL_COLLATE_CODESET */
-    { .string = _nl_C_codeset }
+    { .string = _nl_C_codeset },
+    /* _NL_COLLATE_ENCODING_TYPE */
+    { .word = __cet_8bit }
   }
 };
diff --git a/locale/categories.def b/locale/categories.def
index a8dda53..45d644e 100644
--- a/locale/categories.def
+++ b/locale/categories.def
@@ -58,6 +58,7 @@ DEFINE_CATEGORY
   DEFINE_ELEMENT (_NL_COLLATE_COLLSEQMB,        "collate-collseqmb",        std,
wstring)
   DEFINE_ELEMENT (_NL_COLLATE_COLLSEQWC,        "collate-collseqwc",        std,
wstring)
   DEFINE_ELEMENT (_NL_COLLATE_CODESET,		"collate-codeset",	    std, string)
+  DEFINE_ELEMENT (_NL_COLLATE_ENCODING_TYPE,	"collate-encoding-type",    std, word)
   ), NO_POSTLOAD)


diff --git a/locale/langinfo.h b/locale/langinfo.h
index a565d9d..ffc5c7f 100644
--- a/locale/langinfo.h
+++ b/locale/langinfo.h
@@ -255,6 +255,7 @@ enum
   _NL_COLLATE_COLLSEQMB,
   _NL_COLLATE_COLLSEQWC,
   _NL_COLLATE_CODESET,
+  _NL_COLLATE_ENCODING_TYPE,
   _NL_NUM_LC_COLLATE,

   /* LC_CTYPE category: character classification.
diff --git a/locale/localeinfo.h b/locale/localeinfo.h
index 1d2ee00..bdab9fe 100644
--- a/locale/localeinfo.h
+++ b/locale/localeinfo.h
@@ -110,6 +110,14 @@ enum coll_sort_rule
   sort_mask
 };

+/* Collation encoding type.  */
+enum collation_encoding_type
+{
+  __cet_other,
+  __cet_8bit,
+  __cet_utf8
+};
+
 /* We can map the types of the entries into a few categories.  */
 enum value_type
 {
diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c
index dc0fe30..a39a94f 100644
--- a/locale/programs/ld-collate.c
+++ b/locale/programs/ld-collate.c
@@ -32,6 +32,7 @@
 #include "linereader.h"
 #include "locfile.h"
 #include "elem-hash.h"
+#include "../localeinfo.h"

 /* Uncomment the following line in the production version.  */
 /* #define NDEBUG 1 */
@@ -2130,6 +2131,8 @@ collate_output (struct localedef_t *locale, const struct
charmap_t *charmap,
 	  /* The words have to be handled specially.  */
 	  if (idx == _NL_ITEM_INDEX (_NL_COLLATE_SYMB_HASH_SIZEMB))
 	    add_locale_uint32 (&file, 0);
+	  else if (idx == _NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE))
+	    add_locale_uint32 (&file, __cet_other);
 	  else
 	    add_locale_empty (&file);
 	}
@@ -2493,6 +2496,12 @@ collate_output (struct localedef_t *locale, const struct
charmap_t *charmap,
   add_locale_raw_data (&file, collate->mbseqorder, 256);
   add_locale_collseq_table (&file, &collate->wcseqorder);
   add_locale_string (&file, charmap->code_set_name);
+  if (strcmp (charmap->code_set_name, "UTF-8") == 0)
+    add_locale_uint32 (&file, __cet_utf8);
+  else if (charmap->mb_cur_max == 1)
+    add_locale_uint32 (&file, __cet_8bit);
+  else
+    add_locale_uint32 (&file, __cet_other);
   write_locale_data (output_path, LC_COLLATE, "LC_COLLATE", &file);

   obstack_free (&weightpool, NULL);
diff --git a/string/strcoll_l.c b/string/strcoll_l.c
index 658d5b9..0fa005f 100644
--- a/string/strcoll_l.c
+++ b/string/strcoll_l.c
@@ -29,6 +29,7 @@
 # define STRING_TYPE char
 # define USTRING_TYPE unsigned char
 # define STRCOLL __strcoll_l
+# define STRDIFF __strdiff
 # define STRCMP strcmp
 # define WEIGHT_H "../locale/weight.h"
 # define SUFFIX	MB
@@ -41,6 +42,20 @@
 #include "../locale/localeinfo.h"
 #include WEIGHT_H

+#define MASK_UTF8_7BIT  (1 << 7)
+#define MASK_UTF8_START (3 << 6)
+
+size_t
+STRDIFF (const STRING_TYPE *s, const STRING_TYPE *t)
+{
+  size_t n;
+
+  for (n = 0; *s != '\0' && *s++ == *t++; ++n)
+    continue;
+
+  return n;
+}
+
 /* Track status while looking for sequences in a string.  */
 typedef struct
 {
@@ -255,9 +270,29 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2,
__locale_t l)
   const USTRING_TYPE *extra;
   const int32_t *indirect;

+  /* In case there is no locale specific sort order (C / POSIX).  */
   if (nrules == 0)
     return STRCMP (s1, s2);

+  /* Fast forward to the position of the first difference.  Needs to be
+     encoding aware as the byte-by-byte comparison can stop in the middle
+     of a char sequence for multibyte encodings like UTF-8.  */
+  uint_fast32_t encoding =
+    current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
+  if (encoding != __cet_other)
+    {
+      size_t diff = STRDIFF (s1, s2);
+      if (diff > 0)
+	{
+	  if (encoding == __cet_utf8 && (*(s1 + diff) & MASK_UTF8_7BIT) != 0)
+	    do
+	      diff--;
+	    while (diff > 0 && (*(s1 + diff) & MASK_UTF8_START) != MASK_UTF8_START);
+	  s1 += diff;
+	  s2 += diff;
+	}
+    }
+
   /* Catch empty strings.  */
   if (__glibc_unlikely (*s1 == '\0') || __glibc_unlikely (*s2 == '\0'))
     return (*s1 != '\0') - (*s2 != '\0');
@@ -321,7 +356,8 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2,
__locale_t l)
 		     byte-level comparison to ensure that we don't waste time
 		     going through multiple passes for totally equal strings
 		     before proceeding to subsequent passes.  */
-		  if (pass == 0 && STRCMP (s1, s2) == 0)
+		  if (pass == 0 && encoding == __cet_other &&
+		      STRCMP (s1, s2) == 0)
 		    return result;
 		  else
 		    break;
diff --git a/wcsmbs/wcscoll_l.c b/wcsmbs/wcscoll_l.c
index 106ec93..9f60cee 100644
--- a/wcsmbs/wcscoll_l.c
+++ b/wcsmbs/wcscoll_l.c
@@ -23,6 +23,7 @@
 #define STRING_TYPE wchar_t
 #define USTRING_TYPE wint_t
 #define STRCOLL __wcscoll_l
+#define STRDIFF __wcsdiff
 #define STRCMP wcscmp
 #define WEIGHT_H "../locale/weightwc.h"
 #define SUFFIX	WC