Message ID | 56D3F8F0.8070401@web.de |
---|---|
State | New |
Headers | show |
On 02/29/2016 02:53 AM, Leonhard Holz wrote: > In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance > regression. The cause of the problem is that I think we need a V5 with a few more changes as noted below: - More comments. - Change signature of utf8index. - Probably one more define for magic 255 constant. > a) en_US.UTF-8 has no informations for thai chars and so always reports a zero > sort weight which causes the comparison to check the whole string instead of > breaking up early and OK. > b) the sequence-to-weight list is partitioned by the first byte of the first > character (TABLEMB); this generates long lists for multibyte UTF-8 characters as > they tend to have an equal starting byte (e.g. all thai chars start with E0). Agreed. > The approach of the patch is to interprete TABLEMB as a hashtable and find a > better hash key. My first try was to somehow "fold" a multibyte character into one > byte but that worsened the overall performance a lot. Enhancing the table to 2 > byte keys works much better while needing a reasonable amount of extra memory. How much memory on average? Worst case is all possible 2-byte hash values? e.g. 256*256*(sizeof(struct element_t)) == ~10MB? Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB? > The patch vastly improves the performance of languages with multibyte chars (see > zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars > get a bit slower because of the extra check for the first byte while finding the right > sequence in the sequence list . It cannot be avoided since the hash key is not > longer equal to the first byte of the sequence. Tests are ok. Can we use UTF-8-specific knowledge to accelerate the lookup? For example, you know that E0 is always the start of a 3-byte UTF-8 sequence. Could you do two checks? (a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence? (b) If in a 1 byte sequence use a one-byte table. (c) If in a 2-6 byte sequence use the hash-table? > filelist#C 1.75% 23,396,200 23,805,700 > filelist#en_US.UTF-8 1.42% 77,186,200 78,285,200 > lorem_ipsum#vi_VN.UTF-8 -1.70% 1,680,740 1,652,110 > lorem_ipsum#ar_SA.UTF-8 -7.71% 2,134,780 1,970,170 > lorem_ipsum#en_US.UTF-8 2.61% 1,685,120 1,729,160 > lorem_ipsum#zh_CN.UTF-8 -88.66% 806,176 91,423 > lorem_ipsum#cs_CZ.UTF-8 -4.89% 2,150,120 2,045,030 > lorem_ipsum#en_GB.UTF-8 -1.47% 2,061,960 2,031,620 > lorem_ipsum#da_DK.UTF-8 3.15% 1,703,710 1,757,390 > lorem_ipsum#pl_PL.UTF-8 0.86% 1,634,890 1,648,870 > lorem_ipsum#fr_FR.UTF-8 -2.06% 2,232,030 2,186,030 > lorem_ipsum#pt_PT.UTF-8 -2.60% 2,238,410 2,180,210 > lorem_ipsum#el_GR.UTF-8 -34.52% 3,413,330 2,235,010 > lorem_ipsum#ru_RU.UTF-8 -9.88% 2,403,370 2,165,950 > lorem_ipsum#iw_IL.UTF-8 -9.56% 2,209,740 1,998,500 > lorem_ipsum#es_ES.UTF-8 4.92% 1,983,470 2,081,050 > lorem_ipsum#hi_IN.UTF-8 -98.88% 220,453,000 2,458,620 > lorem_ipsum#sv_SE.UTF-8 1.79% 1,645,370 1,674,760 > lorem_ipsum#hu_HU.UTF-8 4.86% 3,179,620 3,334,290 > lorem_ipsum#tr_TR.UTF-8 -23.59% 2,473,330 1,889,870 > lorem_ipsum#is_IS.UTF-8 2.49% 1,620,370 1,660,680 > lorem_ipsum#it_IT.UTF-8 -2.67% 2,186,160 2,127,710 > lorem_ipsum#sr_RS.UTF-8 2.70% 1,930,520 1,982,720 > lorem_ipsum#ja_JP.UTF-8 -97.43% 958,411 24,664 > wikipedia-th#en_US.UTF-8 -99.61% 10,511,700,000 40,577,100 > > The performance numbers and the size of the patch changed due to the removal of the strdiff optimization (#18589) and > the included thai test. Performance degration for locales in the ASCII plane is still minor. It does increase the speed > of strcoll for all languages that mostly use multiple byte UTF-8 encoding a lot. Note that it should affect the regex > performance of these languages too, though there is no benchmark for that. >> Will this always work? I'm just wondering about a user generated charmap that they >> call 'utf8', which is the other common alias for instance where the dash is not valid >> syntax. Probably not since the official name is UTF-8, and that's what you should use. > > Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8". Could you check to see what value 'code_set_name' uses internally? Is it always 'UTF-8'? If it is, then the check you have is just fine. Otherwise we should check for utf8 and UTF-8. We don't know until you verify the values code_set_name can have. > > * benchtests/bench-strcoll.c: Add thai text with en_US.UTF-8 locale. > * benchtests/strcoll-inputs/wikipedia-th#en_US.UTF-8: New file. > * locale/categories.def: Define _NL_COLLATE_ENCODING_TYPE. > * locale/langinfo.h: Add _NL_COLLATE_ENCODING_TYPE to attribute list. > * locale/localeinfo.h: Add enum collation_encoding_type. > * locale/C-collate.c: Set _NL_COLLATE_ENCODING_TYPE to 8bit. > * locale/programs/ld-collate.c (struct locale_collate_t): > Expand mbheads array from 256 to 16384 entries. > (collate_finish): Generate 2-byte key for mbheads if UTF-8 locale. > (collate_output): Output larger table and sequences including first byte. > (collate_output): Add encoding type info. > * locale/weight.h (utf8index): New function to calculate 2 byte index. > (findidx): Use 2-byte index for table if UTF-8 locale. > * locale/weightwc.h (findidx): Accept encoding parameter, not used. > * posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter. > * posix/regcomp.c (build_equiv_class): Likewise. > * posix/regex_internal.h (re_string_elem_size_at): Likewise. > * posix/regexec.c (check_node_accept_bytes): Likewise. > * string/strcoll_l.c (get_next_seq): Likewise. > (STRCOLL): Call get_next_seq with encoding parameter. > * string/strxfrm_l.c (find_idx): Call findidx with encoding parameter. > (STRXFRM): Call find_idx with encoding parameter. > > > diff --git a/benchtests/bench-strcoll.c b/benchtests/bench-strcoll.c > index 22ae87c..6ce5b2a 100644 > --- a/benchtests/bench-strcoll.c > +++ b/benchtests/bench-strcoll.c > @@ -53,7 +53,8 @@ static const char *const input_files[] = { > "lorem_ipsum#is_IS.UTF-8", > "lorem_ipsum#it_IT.UTF-8", > "lorem_ipsum#sr_RS.UTF-8", > - "lorem_ipsum#ja_JP.UTF-8" > + "lorem_ipsum#ja_JP.UTF-8", > + "wikipedia-th#en_US.UTF-8" OK. > }; > > #define TEXTFILE_DELIMITER " \n\r\t.,?!" > diff --git a/locale/C-collate.c b/locale/C-collate.c > index 8214ff5..5a9ed6a 100644 > --- a/locale/C-collate.c > +++ b/locale/C-collate.c > @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden = > /* _NL_COLLATE_COLLSEQWC */ > { .string = (const char *) collseqwc }, > /* _NL_COLLATE_CODESET */ > - { .string = _nl_C_codeset } > + { .string = _nl_C_codeset }, > + /* _NL_COLLATE_ENCODING_TYPE */ > + { .word = __cet_8bit } This makes locale-archive incompatible again right? Users have to regenerate the locale-archive after the upgrade? We need a release note for that under "Packaging Changes" https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes The release note should mention the binary locale-archive format has changed and that locale-archive must be removed before upgrading and then recompiled after upgrade. > } > }; > diff --git a/locale/categories.def b/locale/categories.def > index d8a3ab8..cb57eae 100644 > --- a/locale/categories.def > +++ b/locale/categories.def > @@ -58,6 +58,7 @@ DEFINE_CATEGORY > DEFINE_ELEMENT (_NL_COLLATE_COLLSEQMB, "collate-collseqmb", std, wstring) > DEFINE_ELEMENT (_NL_COLLATE_COLLSEQWC, "collate-collseqwc", std, wstring) > DEFINE_ELEMENT (_NL_COLLATE_CODESET, "collate-codeset", std, string) > + DEFINE_ELEMENT (_NL_COLLATE_ENCODING_TYPE, "collate-encoding-type", std, word) OK. > ), NO_POSTLOAD) > > > diff --git a/locale/langinfo.h b/locale/langinfo.h > index 481e226..0906a6a 100644 > --- a/locale/langinfo.h > +++ b/locale/langinfo.h > @@ -255,6 +255,7 @@ enum > _NL_COLLATE_COLLSEQMB, > _NL_COLLATE_COLLSEQWC, > _NL_COLLATE_CODESET, > + _NL_COLLATE_ENCODING_TYPE, OK. > _NL_NUM_LC_COLLATE, > > /* LC_CTYPE category: character classification. > diff --git a/locale/localeinfo.h b/locale/localeinfo.h > index 5c4e6ef..bd284df 100644 > --- a/locale/localeinfo.h > +++ b/locale/localeinfo.h > @@ -110,6 +110,14 @@ enum coll_sort_rule > sort_mask > }; > > +/* Collation encoding type. */ > +enum collation_encoding_type > +{ > + __cet_other, > + __cet_8bit, > + __cet_utf8 > +}; OK. > + > /* We can map the types of the entries into a few categories. */ > enum value_type > { > diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c > index 1e125f6..efaacf6 100644 > --- a/locale/programs/ld-collate.c > +++ b/locale/programs/ld-collate.c > @@ -32,6 +32,8 @@ > #include "linereader.h" > #include "locfile.h" > #include "elem-hash.h" > +#include "../localeinfo.h" > +#include "../locale/weight.h" OK. > > /* Uncomment the following line in the production version. */ > /* #define NDEBUG 1 */ > @@ -243,9 +245,10 @@ struct locale_collate_t > Therefore we keep all relevant input in a list. */ > struct locale_collate_t *next; > > - /* Arrays with heads of the list for each of the leading bytes in > + /* Arrays with heads of the list for the leading bytes in > the multibyte sequences. */ > - struct element_t *mbheads[256]; Needs a comment explaining why this is '256*256'. > + #define MBHEADS_SZ (256 * 256) > + struct element_t *mbheads[MBHEADS_SZ]; OK. > > /* Arrays with heads of the list for each of the leading bytes in > the multibyte sequences. */ > @@ -1557,6 +1560,7 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap) > struct section_list *sect; > int ruleidx; > int nr_wide_elems = 0; > + bool is_utf8 = strcmp (charmap->code_set_name, "UTF-8") == 0; As discussed above you need to verify that code_set_name is always "UTF-8" if it's a UTF-8 charmap. If it is always "UTF-8" then this check is just fine. I will note that `locale -a -v` always prints 'UTF-8' for codeset and never 'utf8', but the printed name is still 'C.utf8' or 'es_EC.utf8' etc because glibc normalizes the name internally (downcase and strip dashes e.g. pt_PT.iso88591). I suggest just putting a breakpoint there and verify that it's UTF-8 for a few locale names. You could then avoid strcmp: bool is_utf8 == strlen (charmap->code_set_name) == 5 ? (charmap->code_set_name[0] == 'U' && charmap->code_set_name[1] == 'T' && charmap->code_set_name[2] == 'F' && charmap->code_set_name[3] == '-' && charmap->code_set_name[4] == '8') : 0; Since more than 5 chars means it's not 'UTF-8' e.g. BIG5-HKSCS, or ISO-8859-1. The only locales with 5 or less are GBK and PT154. So you could skip the strcmp in all cases, but still need a comparison to disambiguate PT154 vs. UTF-8. > > if (collate == NULL) > { > @@ -1663,7 +1667,22 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap) > struct element_t *lastp = NULL; > > /* Find the point where to insert in the list. */ > - eptr = &collate->mbheads[((unsigned char *) runp->mbs)[0]]; > + uint16_t index = ((unsigned char *) runp->mbs)[0]; > + > + /* Special handling of UTF-8: Generate a 2-byte index to mbheads. */ > + if (is_utf8 && index > 0) > + { > + index = utf8index((unsigned char *) runp->mbs, runp->nmbs); > + if (index == 0) > + { > + WITH_CUR_LOCALE (error_at_line (0, 0, runp->file, runp->line, > + _("\ > +malformed UTF-8 character in `%s'"), runp->name);); > + goto dont_insert; > + } > + } > + > + eptr = &collate->mbheads[index]; OK. > while (*eptr != NULL) > { > if ((*eptr)->nmbs < runp->nmbs) > @@ -1734,7 +1753,7 @@ symbol `%s' has the same encoding as"), (*eptr)->name); > > /* Find out whether any of the `mbheads' entries is unset. In this > case we use the UNDEFINED entry. */ > - for (i = 1; i < 256; ++i) > + for (i = 1; i < MBHEADS_SZ; ++i) OK. > if (collate->mbheads[i] == NULL) > { > need_undefined = 1; > @@ -2107,7 +2126,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, > const size_t nelems = _NL_ITEM_INDEX (_NL_NUM_LC_COLLATE); > struct locale_file file; > size_t ch; > - int32_t tablemb[256]; > + int32_t tablemb[MBHEADS_SZ]; OK. > struct obstack weightpool; > struct obstack extrapool; > struct obstack indirectpool; > @@ -2130,6 +2149,8 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, > /* The words have to be handled specially. */ > if (idx == _NL_ITEM_INDEX (_NL_COLLATE_SYMB_HASH_SIZEMB)) > add_locale_uint32 (&file, 0); > + else if (idx == _NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)) > + add_locale_uint32 (&file, __cet_other); OK. > else > add_locale_empty (&file); > } > @@ -2183,7 +2204,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, > if (collate->undefined.used_in_level != 0) > output_weight (&weightpool, collate, &collate->undefined); > > - for (ch = 1; ch < 256; ++ch) > + for (ch = 1; ch < MBHEADS_SZ; ++ch) OK. > if (collate->mbheads[ch]->mbnext == NULL > && collate->mbheads[ch]->nmbs <= 1) > { > @@ -2208,7 +2229,6 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, > and add only one index into the weight table. We can find the > consecutive entries since they are also consecutive in the list. */ > struct element_t *runp = collate->mbheads[ch]; > - struct element_t *lastp; > > assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))); > > @@ -2236,7 +2256,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, > > /* Compute how much space we will need. */ > added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 > - + 2 * (runp->nmbs - 1)); > + + 2 * runp->nmbs); > assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))); > obstack_make_room (&extrapool, added); > > @@ -2259,9 +2279,9 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, > /* Now walk backward from here to the beginning. */ > curp = runp; > > - assert (runp->nmbs <= 256); > - obstack_1grow_fast (&extrapool, curp->nmbs - 1); > - for (i = 1; i < curp->nmbs; ++i) > + assert (runp->nmbs <= 255); Another magic constant. Where does 256 or 255 come from? These are struct element_t structures with entries chained from there. Why do we limit them to 255? When might they be larger? We should use a define here and document why the limit exists. > + obstack_1grow_fast (&extrapool, curp->nmbs); > + for (i = 0; i < curp->nmbs; ++i) OK. > obstack_1grow_fast (&extrapool, curp->mbs[i]); > > /* Now find the end of the consecutive sequence and > @@ -2281,7 +2301,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, > > /* And add the end byte sequence. Without length this > time. */ > - for (i = 1; i < curp->nmbs; ++i) > + for (i = 0; i < curp->nmbs; ++i) > obstack_1grow_fast (&extrapool, curp->mbs[i]); > } > else > @@ -2295,15 +2315,15 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, > weightidx = output_weight (&weightpool, collate, runp); > > added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 > - + runp->nmbs - 1); > + + runp->nmbs); > assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))); > obstack_make_room (&extrapool, added); > > obstack_int32_grow_fast (&extrapool, weightidx); > - assert (runp->nmbs <= 256); > - obstack_1grow_fast (&extrapool, runp->nmbs - 1); > + assert (runp->nmbs <= 255); > + obstack_1grow_fast (&extrapool, runp->nmbs); > > - for (i = 1; i < runp->nmbs; ++i) > + for (i = 0; i < runp->nmbs; ++i) > obstack_1grow_fast (&extrapool, runp->mbs[i]); > } > > @@ -2312,30 +2332,25 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, > obstack_1grow_fast (&extrapool, '\0'); > > /* Next entry. */ > - lastp = runp; > runp = runp->mbnext; > } > while (runp != NULL); > > assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))); > > - /* If the final entry in the list is not a single character we > - add an UNDEFINED entry here. */ > - if (lastp->nmbs != 1) > - { > - int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1); > - obstack_make_room (&extrapool, added); > + /* Add an UNDEFINED entry at the end of the list. */ > + int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1); > + obstack_make_room (&extrapool, added); > > - obstack_int32_grow_fast (&extrapool, 0); > - /* XXX What rule? We just pick the first. */ > - obstack_1grow_fast (&extrapool, 0); > - /* Length is zero. */ > - obstack_1grow_fast (&extrapool, 0); > + obstack_int32_grow_fast (&extrapool, 0); > + /* XXX What rule? We just pick the first. */ > + obstack_1grow_fast (&extrapool, 0); > + /* Length is zero. */ > + obstack_1grow_fast (&extrapool, 0); > > - /* Add alignment bytes if necessary. */ > - while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))) > - obstack_1grow_fast (&extrapool, '\0'); > - } > + /* Add alignment bytes if necessary. */ > + while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))) > + obstack_1grow_fast (&extrapool, '\0'); > } > > /* Add padding to the tables if necessary. */ > @@ -2343,7 +2358,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, > obstack_1grow (&weightpool, 0); > > /* Now add the four tables. */ > - add_locale_uint32_array (&file, (const uint32_t *) tablemb, 256); > + add_locale_uint32_array (&file, (const uint32_t *) tablemb, MBHEADS_SZ); > add_locale_raw_obstack (&file, &weightpool); > add_locale_raw_obstack (&file, &extrapool); > add_locale_raw_obstack (&file, &indirectpool); > @@ -2493,6 +2508,12 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, > add_locale_raw_data (&file, collate->mbseqorder, 256); > add_locale_collseq_table (&file, &collate->wcseqorder); > add_locale_string (&file, charmap->code_set_name); > + if (strcmp (charmap->code_set_name, "UTF-8") == 0) Similar discussion as above regarding UTF-8 and optimziation. > + add_locale_uint32 (&file, __cet_utf8); > + else if (charmap->mb_cur_max == 1) > + add_locale_uint32 (&file, __cet_8bit); > + else > + add_locale_uint32 (&file, __cet_other); > write_locale_data (output_path, LC_COLLATE, "LC_COLLATE", &file); > > obstack_free (&weightpool, NULL); > diff --git a/locale/weight.h b/locale/weight.h > index c99730c..5b4103b 100644 > --- a/locale/weight.h > +++ b/locale/weight.h > @@ -19,26 +19,81 @@ > #ifndef _WEIGHT_H_ > #define _WEIGHT_H_ 1 > > +/* Generate 2 byte code for the next UTF-8 encoded char. > + Returns zero on UTF-8 encoding errors. */ Plase change the prototype to be: int utf8index (const unsigned char *cp, size_t len, uint16_t *index); Return 0 on success. Return !0 on error. Store index into *index. This way we avoid conflating index value 0 and error. > +static __always_inline uint16_t > +utf8index (const unsigned char *cp, size_t len) > +{ > + uint16_t index = cp[0]; > + > + if (index >= 0x80) > + { > + if (index < 0xE0) > + { > + if (len < 2) > + return 0; > + uint16_t byte2 = cp[1]; > + index = (index << 6) + byte2 - 0x3080; > + } > + else if (index < 0xF0) > + { > + if (len < 3) > + return 0; > + uint16_t byte2 = cp[1]; > + uint16_t byte3 = cp[2]; > + index = (index << 12) + (byte2 << 6) + byte3 - 0xE2080; > + } > + else if (index < 0xF8) > + { > + if (len < 4) > + return 0; > + uint16_t byte2 = cp[1]; > + uint16_t byte3 = cp[2]; > + uint16_t byte4 = cp[3]; > + index = (byte2 << 12) + (byte3 << 6) + byte4 - 0x82080; I believe this is technically inaccurate since it allows all 4-byte sequences, when in reality the limit is at U+10FFFF? You need not fix it, but we should add a comment saying that for the sake of simpler code we're allowing those 4-byte sequences which are not normally accepted. > + } > + else > + return 0; OK. > + } > + > + return index; > +} > + > /* Find index of weight. */ > static inline int32_t __attribute__ ((always_inline)) > -findidx (const int32_t *table, > +findidx (uint_fast32_t locale_encoding, > + const int32_t *table, > const int32_t *indirect, > const unsigned char *extra, > const unsigned char **cpp, size_t len) > { > - int_fast32_t i = table[*(*cpp)++]; > const unsigned char *cp; > const unsigned char *usrc; > + uint16_t index = (*cpp)[0]; > + > + /* Special handling of UTF-8: Generate a 2-byte index for table. */ > + if (index >= 0x80 && locale_encoding == __cet_utf8) > + { > + index = utf8index(*cpp, len); > + if (index == 0) > + { > + *cpp += 1; > + return 0; > + } > + } OK. > > + int_fast32_t i = table[index]; > if (i >= 0) > - /* This is an index into the weight table. Cool. */ > - return i; > + { > + /* This is an index into the weight table. Cool. */ > + *cpp += 1; > + return i; > + } > > /* Oh well, more than one sequence starting with this byte. > Search for the correct one. */ > cp = &extra[-i]; > usrc = *cpp; > - --len; > while (1) > { > size_t nhere; > @@ -57,8 +112,7 @@ findidx (const int32_t *table, > /* It is a single character. If it matches we found our > index. Note that at the end of each list there is an > entry of length zero which represents the single byte > - sequence. The first (and here only) byte was tested > - already. */ > + sequence. */ > size_t cnt; > > for (cnt = 0; cnt < nhere && cnt < len; ++cnt) > @@ -68,7 +122,7 @@ findidx (const int32_t *table, > if (cnt == nhere) > { > /* Found it. */ > - *cpp += nhere; > + *cpp += nhere > 0 ? nhere : 1; > return i; > } > > @@ -127,7 +181,7 @@ findidx (const int32_t *table, > while (++cnt < nhere); > } > > - *cpp += nhere; > + *cpp += nhere > 0 ? nhere : 1; > return indirect[-i + offset]; > } > } OK. > diff --git a/locale/weightwc.h b/locale/weightwc.h > index ab26482..4101dc8 100644 > --- a/locale/weightwc.h > +++ b/locale/weightwc.h > @@ -21,7 +21,8 @@ > > /* Find index of weight. */ > static inline int32_t __attribute__ ((always_inline)) > -findidx (const int32_t *table, > +findidx (uint_fast32_t encoding, > + const int32_t *table, > const int32_t *indirect, > const wint_t *extra, > const wint_t **cpp, size_t len) > diff --git a/posix/fnmatch_loop.c b/posix/fnmatch_loop.c > index 229904e..07b60fb 100644 > --- a/posix/fnmatch_loop.c > +++ b/posix/fnmatch_loop.c > @@ -383,6 +383,8 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end, > const int32_t *indirect; > int32_t idx; > const UCHAR *cp = (const UCHAR *) &str; > + uint_fast32_t encoding = > + _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE); > > # if WIDE_CHAR_VERSION > table = (const int32_t *) > @@ -404,7 +406,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end, > _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB); > # endif > > - idx = FINDIDX (table, indirect, extra, &cp, 1); > + idx = FINDIDX (encoding, table, indirect, extra, &cp, 1); > if (idx != 0) > { > /* We found a table entry. Now see whether the > @@ -414,7 +416,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end, > int32_t idx2; > const UCHAR *np = (const UCHAR *) n; > > - idx2 = FINDIDX (table, indirect, extra, > + idx2 = FINDIDX (encoding, table, indirect, extra, > &np, string_end - n); > if (idx2 != 0 > && (idx >> 24) == (idx2 >> 24) OK. > diff --git a/posix/regcomp.c b/posix/regcomp.c > index b6126b7..011ef92 100644 > --- a/posix/regcomp.c > +++ b/posix/regcomp.c > @@ -3414,6 +3414,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name) > uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES); > if (nrules != 0) > { > + uint_fast32_t encoding; > const int32_t *table, *indirect; > const unsigned char *weights, *extra, *cp; > unsigned char char_buf[2]; > @@ -3422,6 +3423,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name) > size_t len; > /* Calculate the index for equivalence class. */ > cp = name; > + encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE); > table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB); > weights = (const unsigned char *) _NL_CURRENT (LC_COLLATE, > _NL_COLLATE_WEIGHTMB); > @@ -3429,7 +3431,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name) > _NL_COLLATE_EXTRAMB); > indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE, > _NL_COLLATE_INDIRECTMB); > - idx1 = findidx (table, indirect, extra, &cp, -1); > + idx1 = findidx (encoding, table, indirect, extra, &cp, -1); > if (BE (idx1 == 0 || *cp != '\0', 0)) > /* This isn't a valid character. */ > return REG_ECOLLATE; > @@ -3440,7 +3442,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name) > { > char_buf[0] = ch; > cp = char_buf; > - idx2 = findidx (table, indirect, extra, &cp, 1); > + idx2 = findidx (encoding, table, indirect, extra, &cp, 1); > /* > idx2 = table[ch]; > */ OK > diff --git a/posix/regex_internal.h b/posix/regex_internal.h > index 02e040b..993c7c3 100644 > --- a/posix/regex_internal.h > +++ b/posix/regex_internal.h > @@ -743,17 +743,19 @@ re_string_elem_size_at (const re_string_t *pstr, int idx) > # ifdef _LIBC > const unsigned char *p, *extra; > const int32_t *table, *indirect; > + uint_fast32_t encoding; > uint_fast32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES); > > if (nrules != 0) > { > + encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE); > table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB); > extra = (const unsigned char *) > _NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB); > indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE, > _NL_COLLATE_INDIRECTMB); > p = pstr->mbs + idx; > - findidx (table, indirect, extra, &p, pstr->len - idx); > + findidx (encoding, table, indirect, extra, &p, pstr->len - idx); > return p - pstr->mbs - idx; > } > else OK. > diff --git a/posix/regexec.c b/posix/regexec.c > index ec46c3a..3d3ad9a 100644 > --- a/posix/regexec.c > +++ b/posix/regexec.c > @@ -3843,6 +3843,7 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx, > if (nrules != 0) > { > unsigned int in_collseq = 0; > + uint_fast32_t encoding; > const int32_t *table, *indirect; > const unsigned char *weights, *extra; > const char *collseqwc; > @@ -3893,6 +3894,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx, > if (cset->nequiv_classes) > { > const unsigned char *cp = pin; > + encoding = > + _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE); > table = (const int32_t *) > _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB); > weights = (const unsigned char *) > @@ -3901,7 +3904,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx, > _NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB); > indirect = (const int32_t *) > _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB); > - int32_t idx = findidx (table, indirect, extra, &cp, elem_len); > + int32_t idx = findidx (encoding, table, indirect, extra, &cp, > + elem_len); > if (idx > 0) > for (i = 0; i < cset->nequiv_classes; ++i) > { OK. > diff --git a/string/strcoll_l.c b/string/strcoll_l.c > index 4d1e3ab..2c2cab0 100644 > --- a/string/strcoll_l.c > +++ b/string/strcoll_l.c > @@ -63,9 +63,9 @@ typedef struct > /* Get next sequence. Traverse the string as required. */ > static __always_inline void > get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets, > - const USTRING_TYPE *weights, const int32_t *table, > - const USTRING_TYPE *extra, const int32_t *indirect, > - int pass) > + const USTRING_TYPE *weights, uint_fast32_t encoding, > + const int32_t *table, const USTRING_TYPE *extra, > + const int32_t *indirect, int pass) > { > size_t val = seq->val = 0; > int len = seq->len; > @@ -109,7 +109,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets, > us = seq->back_us; > while (i < backw) > { > - int32_t tmp = findidx (table, indirect, extra, &us, -1); > + int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1); > idx = tmp & 0xffffff; > i++; > } > @@ -124,7 +124,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets, > > while (*us != L('\0')) > { > - int32_t tmp = findidx (table, indirect, extra, &us, -1); > + int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1); > unsigned char rule = tmp >> 24; > prev_idx = idx; > idx = tmp & 0xffffff; > @@ -253,6 +253,7 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l) > const USTRING_TYPE *weights; > const USTRING_TYPE *extra; > const int32_t *indirect; > + uint_fast32_t encoding; > > if (nrules == 0) > return STRCMP (s1, s2); > @@ -271,6 +272,8 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l) > current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_EXTRA,SUFFIX))].string; > indirect = (const int32_t *) > current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_INDIRECT,SUFFIX))].string; > + encoding = current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word; > + > > assert (((uintptr_t) table) % __alignof__ (table[0]) == 0); > assert (((uintptr_t) weights) % __alignof__ (weights[0]) == 0); > @@ -310,9 +313,9 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l) > > while (1) > { > - get_next_seq (&seq1, nrules, rulesets, weights, table, > + get_next_seq (&seq1, nrules, rulesets, weights, encoding, table, > extra, indirect, pass); > - get_next_seq (&seq2, nrules, rulesets, weights, table, > + get_next_seq (&seq2, nrules, rulesets, weights, encoding, table, > extra, indirect, pass); > /* See whether any or both strings are empty. */ > if (seq1.len == 0 || seq2.len == 0) OK. > diff --git a/string/strxfrm_l.c b/string/strxfrm_l.c > index 22e24d3..5c89b15 100644 > --- a/string/strxfrm_l.c > +++ b/string/strxfrm_l.c > @@ -53,6 +53,7 @@ typedef struct > uint_fast32_t nrules; > unsigned char *rulesets; > USTRING_TYPE *weights; > + uint_fast32_t encoding; > int32_t *table; > USTRING_TYPE *extra; > int32_t *indirect; > @@ -100,8 +101,8 @@ static __always_inline size_t > find_idx (const USTRING_TYPE **us, int32_t *weight_idx, > unsigned char *rule_idx, const locale_data_t *l_data, const int pass) > { > - int32_t tmp = findidx (l_data->table, l_data->indirect, l_data->extra, us, > - -1); > + int32_t tmp = findidx (l_data->encoding, l_data->table, l_data->indirect, > + l_data->extra, us, -1); > *rule_idx = tmp >> 24; > int32_t idx = tmp & 0xffffff; > size_t len = l_data->weights[idx++]; > @@ -693,6 +694,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l) > /* Get the locale data. */ > l_data.rulesets = (unsigned char *) > current->values[_NL_ITEM_INDEX (_NL_COLLATE_RULESETS)].string; > + l_data.encoding = > + current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word; > l_data.table = (int32_t *) > current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_TABLE,SUFFIX))].string; > l_data.weights = (USTRING_TYPE *) > @@ -721,8 +724,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l) > > do > { > - int32_t tmp = findidx (l_data.table, l_data.indirect, l_data.extra, &cur, > - -1); > + int32_t tmp = findidx (l_data.encoding, l_data.table, l_data.indirect, > + l_data.extra, &cur, -1); > rulearr[idxmax] = tmp >> 24; > idxarr[idxmax] = tmp & 0xffffff; > OK.
On Tue, 29 Mar 2016, Carlos O'Donell wrote: > I believe this is technically inaccurate since it allows all 4-byte > sequences, when in reality the limit is at U+10FFFF? That glibc accepts UTF-8 according to the definition in the 2003 edition of ISO 10646 rather than the definition in the 2011 and later editions is a known issue. I've filed bug 19883 for it since I couldn't find an existing bug report in Bugzilla. I don't think it's particularly relevant to any patch not aiming to fix that bug, but: > You need not fix it, but we should add a comment saying that for the > sake of simpler code we're allowing those 4-byte sequences which are > not normally accepted. I'd think a reference to this code in bug 19883 might be more useful - or something in that bug giving a standard (greppable) wording for a comment identifying places needing updating for the current UTF-8 (or in some cases UCS-4) definition, with such a comment added in this code.
On Tue, Mar 29, 2016 at 6:05 PM, Joseph Myers <joseph@codesourcery.com> wrote: > On Tue, 29 Mar 2016, Carlos O'Donell wrote: > >> I believe this is technically inaccurate since it allows all 4-byte >> sequences, when in reality the limit is at U+10FFFF? > > That glibc accepts UTF-8 according to the definition in the 2003 edition > of ISO 10646 rather than the definition in the 2011 and later editions is > a known issue. I've filed bug 19883 for it since I couldn't find an > existing bug report in Bugzilla. Note that the U+10FFFF limit equates to a Y2541 bug, at the present (post-2000) rate of codepoint assignment. See https://gist.github.com/zackw/f2e74a8d7b31baa88002 for calculations and pretty graph. zw
diff --git a/benchtests/bench-strcoll.c b/benchtests/bench-strcoll.c index 22ae87c..6ce5b2a 100644 --- a/benchtests/bench-strcoll.c +++ b/benchtests/bench-strcoll.c @@ -53,7 +53,8 @@ static const char *const input_files[] = { "lorem_ipsum#is_IS.UTF-8", "lorem_ipsum#it_IT.UTF-8", "lorem_ipsum#sr_RS.UTF-8", - "lorem_ipsum#ja_JP.UTF-8" + "lorem_ipsum#ja_JP.UTF-8", + "wikipedia-th#en_US.UTF-8" }; #define TEXTFILE_DELIMITER " \n\r\t.,?!" diff --git a/locale/C-collate.c b/locale/C-collate.c index 8214ff5..5a9ed6a 100644 --- a/locale/C-collate.c +++ b/locale/C-collate.c @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden = /* _NL_COLLATE_COLLSEQWC */ { .string = (const char *) collseqwc }, /* _NL_COLLATE_CODESET */ - { .string = _nl_C_codeset } + { .string = _nl_C_codeset }, + /* _NL_COLLATE_ENCODING_TYPE */ + { .word = __cet_8bit } } }; diff --git a/locale/categories.def b/locale/categories.def index d8a3ab8..cb57eae 100644 --- a/locale/categories.def +++ b/locale/categories.def @@ -58,6 +58,7 @@ DEFINE_CATEGORY DEFINE_ELEMENT (_NL_COLLATE_COLLSEQMB, "collate-collseqmb", std, wstring) DEFINE_ELEMENT (_NL_COLLATE_COLLSEQWC, "collate-collseqwc", std, wstring) DEFINE_ELEMENT (_NL_COLLATE_CODESET, "collate-codeset", std, string) + DEFINE_ELEMENT (_NL_COLLATE_ENCODING_TYPE, "collate-encoding-type", std, word) ), NO_POSTLOAD) diff --git a/locale/langinfo.h b/locale/langinfo.h index 481e226..0906a6a 100644 --- a/locale/langinfo.h +++ b/locale/langinfo.h @@ -255,6 +255,7 @@ enum _NL_COLLATE_COLLSEQMB, _NL_COLLATE_COLLSEQWC, _NL_COLLATE_CODESET, + _NL_COLLATE_ENCODING_TYPE, _NL_NUM_LC_COLLATE, /* LC_CTYPE category: character classification. diff --git a/locale/localeinfo.h b/locale/localeinfo.h index 5c4e6ef..bd284df 100644 --- a/locale/localeinfo.h +++ b/locale/localeinfo.h @@ -110,6 +110,14 @@ enum coll_sort_rule sort_mask }; +/* Collation encoding type. */ +enum collation_encoding_type +{ + __cet_other, + __cet_8bit, + __cet_utf8 +}; + /* We can map the types of the entries into a few categories. */ enum value_type { diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c index 1e125f6..efaacf6 100644 --- a/locale/programs/ld-collate.c +++ b/locale/programs/ld-collate.c @@ -32,6 +32,8 @@ #include "linereader.h" #include "locfile.h" #include "elem-hash.h" +#include "../localeinfo.h" +#include "../locale/weight.h" /* Uncomment the following line in the production version. */ /* #define NDEBUG 1 */ @@ -243,9 +245,10 @@ struct locale_collate_t Therefore we keep all relevant input in a list. */ struct locale_collate_t *next; - /* Arrays with heads of the list for each of the leading bytes in + /* Arrays with heads of the list for the leading bytes in the multibyte sequences. */ - struct element_t *mbheads[256]; + #define MBHEADS_SZ (256 * 256) + struct element_t *mbheads[MBHEADS_SZ]; /* Arrays with heads of the list for each of the leading bytes in the multibyte sequences. */ @@ -1557,6 +1560,7 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap) struct section_list *sect; int ruleidx; int nr_wide_elems = 0; + bool is_utf8 = strcmp (charmap->code_set_name, "UTF-8") == 0; if (collate == NULL) { @@ -1663,7 +1667,22 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap) struct element_t *lastp = NULL; /* Find the point where to insert in the list. */ - eptr = &collate->mbheads[((unsigned char *) runp->mbs)[0]]; + uint16_t index = ((unsigned char *) runp->mbs)[0]; + + /* Special handling of UTF-8: Generate a 2-byte index to mbheads. */ + if (is_utf8 && index > 0) + { + index = utf8index((unsigned char *) runp->mbs, runp->nmbs); + if (index == 0) + { + WITH_CUR_LOCALE (error_at_line (0, 0, runp->file, runp->line, + _("\ +malformed UTF-8 character in `%s'"), runp->name);); + goto dont_insert; + } + } + + eptr = &collate->mbheads[index]; while (*eptr != NULL) { if ((*eptr)->nmbs < runp->nmbs) @@ -1734,7 +1753,7 @@ symbol `%s' has the same encoding as"), (*eptr)->name); /* Find out whether any of the `mbheads' entries is unset. In this case we use the UNDEFINED entry. */ - for (i = 1; i < 256; ++i) + for (i = 1; i < MBHEADS_SZ; ++i) if (collate->mbheads[i] == NULL) { need_undefined = 1; @@ -2107,7 +2126,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, const size_t nelems = _NL_ITEM_INDEX (_NL_NUM_LC_COLLATE); struct locale_file file; size_t ch; - int32_t tablemb[256]; + int32_t tablemb[MBHEADS_SZ]; struct obstack weightpool; struct obstack extrapool; struct obstack indirectpool; @@ -2130,6 +2149,8 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, /* The words have to be handled specially. */ if (idx == _NL_ITEM_INDEX (_NL_COLLATE_SYMB_HASH_SIZEMB)) add_locale_uint32 (&file, 0); + else if (idx == _NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)) + add_locale_uint32 (&file, __cet_other); else add_locale_empty (&file); } @@ -2183,7 +2204,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, if (collate->undefined.used_in_level != 0) output_weight (&weightpool, collate, &collate->undefined); - for (ch = 1; ch < 256; ++ch) + for (ch = 1; ch < MBHEADS_SZ; ++ch) if (collate->mbheads[ch]->mbnext == NULL && collate->mbheads[ch]->nmbs <= 1) { @@ -2208,7 +2229,6 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, and add only one index into the weight table. We can find the consecutive entries since they are also consecutive in the list. */ struct element_t *runp = collate->mbheads[ch]; - struct element_t *lastp; assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))); @@ -2236,7 +2256,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, /* Compute how much space we will need. */ added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 - + 2 * (runp->nmbs - 1)); + + 2 * runp->nmbs); assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))); obstack_make_room (&extrapool, added); @@ -2259,9 +2279,9 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, /* Now walk backward from here to the beginning. */ curp = runp; - assert (runp->nmbs <= 256); - obstack_1grow_fast (&extrapool, curp->nmbs - 1); - for (i = 1; i < curp->nmbs; ++i) + assert (runp->nmbs <= 255); + obstack_1grow_fast (&extrapool, curp->nmbs); + for (i = 0; i < curp->nmbs; ++i) obstack_1grow_fast (&extrapool, curp->mbs[i]); /* Now find the end of the consecutive sequence and @@ -2281,7 +2301,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, /* And add the end byte sequence. Without length this time. */ - for (i = 1; i < curp->nmbs; ++i) + for (i = 0; i < curp->nmbs; ++i) obstack_1grow_fast (&extrapool, curp->mbs[i]); } else @@ -2295,15 +2315,15 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, weightidx = output_weight (&weightpool, collate, runp); added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 - + runp->nmbs - 1); + + runp->nmbs); assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))); obstack_make_room (&extrapool, added); obstack_int32_grow_fast (&extrapool, weightidx); - assert (runp->nmbs <= 256); - obstack_1grow_fast (&extrapool, runp->nmbs - 1); + assert (runp->nmbs <= 255); + obstack_1grow_fast (&extrapool, runp->nmbs); - for (i = 1; i < runp->nmbs; ++i) + for (i = 0; i < runp->nmbs; ++i) obstack_1grow_fast (&extrapool, runp->mbs[i]); } @@ -2312,30 +2332,25 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, obstack_1grow_fast (&extrapool, '\0'); /* Next entry. */ - lastp = runp; runp = runp->mbnext; } while (runp != NULL); assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))); - /* If the final entry in the list is not a single character we - add an UNDEFINED entry here. */ - if (lastp->nmbs != 1) - { - int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1); - obstack_make_room (&extrapool, added); + /* Add an UNDEFINED entry at the end of the list. */ + int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1); + obstack_make_room (&extrapool, added); - obstack_int32_grow_fast (&extrapool, 0); - /* XXX What rule? We just pick the first. */ - obstack_1grow_fast (&extrapool, 0); - /* Length is zero. */ - obstack_1grow_fast (&extrapool, 0); + obstack_int32_grow_fast (&extrapool, 0); + /* XXX What rule? We just pick the first. */ + obstack_1grow_fast (&extrapool, 0); + /* Length is zero. */ + obstack_1grow_fast (&extrapool, 0); - /* Add alignment bytes if necessary. */ - while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))) - obstack_1grow_fast (&extrapool, '\0'); - } + /* Add alignment bytes if necessary. */ + while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))) + obstack_1grow_fast (&extrapool, '\0'); } /* Add padding to the tables if necessary. */ @@ -2343,7 +2358,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, obstack_1grow (&weightpool, 0); /* Now add the four tables. */ - add_locale_uint32_array (&file, (const uint32_t *) tablemb, 256); + add_locale_uint32_array (&file, (const uint32_t *) tablemb, MBHEADS_SZ); add_locale_raw_obstack (&file, &weightpool); add_locale_raw_obstack (&file, &extrapool); add_locale_raw_obstack (&file, &indirectpool); @@ -2493,6 +2508,12 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, add_locale_raw_data (&file, collate->mbseqorder, 256); add_locale_collseq_table (&file, &collate->wcseqorder); add_locale_string (&file, charmap->code_set_name); + if (strcmp (charmap->code_set_name, "UTF-8") == 0) + add_locale_uint32 (&file, __cet_utf8); + else if (charmap->mb_cur_max == 1) + add_locale_uint32 (&file, __cet_8bit); + else + add_locale_uint32 (&file, __cet_other); write_locale_data (output_path, LC_COLLATE, "LC_COLLATE", &file); obstack_free (&weightpool, NULL); diff --git a/locale/weight.h b/locale/weight.h index c99730c..5b4103b 100644 --- a/locale/weight.h +++ b/locale/weight.h @@ -19,26 +19,81 @@ #ifndef _WEIGHT_H_ #define _WEIGHT_H_ 1 +/* Generate 2 byte code for the next UTF-8 encoded char. + Returns zero on UTF-8 encoding errors. */ +static __always_inline uint16_t +utf8index (const unsigned char *cp, size_t len) +{ + uint16_t index = cp[0]; + + if (index >= 0x80) + { + if (index < 0xE0) + { + if (len < 2) + return 0; + uint16_t byte2 = cp[1]; + index = (index << 6) + byte2 - 0x3080; + } + else if (index < 0xF0) + { + if (len < 3) + return 0; + uint16_t byte2 = cp[1]; + uint16_t byte3 = cp[2]; + index = (index << 12) + (byte2 << 6) + byte3 - 0xE2080; + } + else if (index < 0xF8) + { + if (len < 4) + return 0; + uint16_t byte2 = cp[1]; + uint16_t byte3 = cp[2]; + uint16_t byte4 = cp[3]; + index = (byte2 << 12) + (byte3 << 6) + byte4 - 0x82080; + } + else + return 0; + } + + return index; +} + /* Find index of weight. */ static inline int32_t __attribute__ ((always_inline)) -findidx (const int32_t *table, +findidx (uint_fast32_t locale_encoding, + const int32_t *table, const int32_t *indirect, const unsigned char *extra, const unsigned char **cpp, size_t len) { - int_fast32_t i = table[*(*cpp)++]; const unsigned char *cp; const unsigned char *usrc; + uint16_t index = (*cpp)[0]; + + /* Special handling of UTF-8: Generate a 2-byte index for table. */ + if (index >= 0x80 && locale_encoding == __cet_utf8) + { + index = utf8index(*cpp, len); + if (index == 0) + { + *cpp += 1; + return 0; + } + } + int_fast32_t i = table[index]; if (i >= 0) - /* This is an index into the weight table. Cool. */ - return i; + { + /* This is an index into the weight table. Cool. */ + *cpp += 1; + return i; + } /* Oh well, more than one sequence starting with this byte. Search for the correct one. */ cp = &extra[-i]; usrc = *cpp; - --len; while (1) { size_t nhere; @@ -57,8 +112,7 @@ findidx (const int32_t *table, /* It is a single character. If it matches we found our index. Note that at the end of each list there is an entry of length zero which represents the single byte - sequence. The first (and here only) byte was tested - already. */ + sequence. */ size_t cnt; for (cnt = 0; cnt < nhere && cnt < len; ++cnt) @@ -68,7 +122,7 @@ findidx (const int32_t *table, if (cnt == nhere) { /* Found it. */ - *cpp += nhere; + *cpp += nhere > 0 ? nhere : 1; return i; } @@ -127,7 +181,7 @@ findidx (const int32_t *table, while (++cnt < nhere); } - *cpp += nhere; + *cpp += nhere > 0 ? nhere : 1; return indirect[-i + offset]; } } diff --git a/locale/weightwc.h b/locale/weightwc.h index ab26482..4101dc8 100644 --- a/locale/weightwc.h +++ b/locale/weightwc.h @@ -21,7 +21,8 @@ /* Find index of weight. */ static inline int32_t __attribute__ ((always_inline)) -findidx (const int32_t *table, +findidx (uint_fast32_t encoding, + const int32_t *table, const int32_t *indirect, const wint_t *extra, const wint_t **cpp, size_t len) diff --git a/posix/fnmatch_loop.c b/posix/fnmatch_loop.c index 229904e..07b60fb 100644 --- a/posix/fnmatch_loop.c +++ b/posix/fnmatch_loop.c @@ -383,6 +383,8 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end, const int32_t *indirect; int32_t idx; const UCHAR *cp = (const UCHAR *) &str; + uint_fast32_t encoding = + _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE); # if WIDE_CHAR_VERSION table = (const int32_t *) @@ -404,7 +406,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end, _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB); # endif - idx = FINDIDX (table, indirect, extra, &cp, 1); + idx = FINDIDX (encoding, table, indirect, extra, &cp, 1); if (idx != 0) { /* We found a table entry. Now see whether the @@ -414,7 +416,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end, int32_t idx2; const UCHAR *np = (const UCHAR *) n; - idx2 = FINDIDX (table, indirect, extra, + idx2 = FINDIDX (encoding, table, indirect, extra, &np, string_end - n); if (idx2 != 0 && (idx >> 24) == (idx2 >> 24) diff --git a/posix/regcomp.c b/posix/regcomp.c index b6126b7..011ef92 100644 --- a/posix/regcomp.c +++ b/posix/regcomp.c @@ -3414,6 +3414,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name) uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES); if (nrules != 0) { + uint_fast32_t encoding; const int32_t *table, *indirect; const unsigned char *weights, *extra, *cp; unsigned char char_buf[2]; @@ -3422,6 +3423,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name) size_t len; /* Calculate the index for equivalence class. */ cp = name; + encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE); table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB); weights = (const unsigned char *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_WEIGHTMB); @@ -3429,7 +3431,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name) _NL_COLLATE_EXTRAMB); indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB); - idx1 = findidx (table, indirect, extra, &cp, -1); + idx1 = findidx (encoding, table, indirect, extra, &cp, -1); if (BE (idx1 == 0 || *cp != '\0', 0)) /* This isn't a valid character. */ return REG_ECOLLATE; @@ -3440,7 +3442,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name) { char_buf[0] = ch; cp = char_buf; - idx2 = findidx (table, indirect, extra, &cp, 1); + idx2 = findidx (encoding, table, indirect, extra, &cp, 1); /* idx2 = table[ch]; */ diff --git a/posix/regex_internal.h b/posix/regex_internal.h index 02e040b..993c7c3 100644 --- a/posix/regex_internal.h +++ b/posix/regex_internal.h @@ -743,17 +743,19 @@ re_string_elem_size_at (const re_string_t *pstr, int idx) # ifdef _LIBC const unsigned char *p, *extra; const int32_t *table, *indirect; + uint_fast32_t encoding; uint_fast32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES); if (nrules != 0) { + encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE); table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB); extra = (const unsigned char *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB); indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB); p = pstr->mbs + idx; - findidx (table, indirect, extra, &p, pstr->len - idx); + findidx (encoding, table, indirect, extra, &p, pstr->len - idx); return p - pstr->mbs - idx; } else diff --git a/posix/regexec.c b/posix/regexec.c index ec46c3a..3d3ad9a 100644 --- a/posix/regexec.c +++ b/posix/regexec.c @@ -3843,6 +3843,7 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx, if (nrules != 0) { unsigned int in_collseq = 0; + uint_fast32_t encoding; const int32_t *table, *indirect; const unsigned char *weights, *extra; const char *collseqwc; @@ -3893,6 +3894,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx, if (cset->nequiv_classes) { const unsigned char *cp = pin; + encoding = + _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE); table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB); weights = (const unsigned char *) @@ -3901,7 +3904,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx, _NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB); indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB); - int32_t idx = findidx (table, indirect, extra, &cp, elem_len); + int32_t idx = findidx (encoding, table, indirect, extra, &cp, + elem_len); if (idx > 0) for (i = 0; i < cset->nequiv_classes; ++i) { diff --git a/string/strcoll_l.c b/string/strcoll_l.c index 4d1e3ab..2c2cab0 100644 --- a/string/strcoll_l.c +++ b/string/strcoll_l.c @@ -63,9 +63,9 @@ typedef struct /* Get next sequence. Traverse the string as required. */ static __always_inline void get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets, - const USTRING_TYPE *weights, const int32_t *table, - const USTRING_TYPE *extra, const int32_t *indirect, - int pass) + const USTRING_TYPE *weights, uint_fast32_t encoding, + const int32_t *table, const USTRING_TYPE *extra, + const int32_t *indirect, int pass) { size_t val = seq->val = 0; int len = seq->len; @@ -109,7 +109,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets, us = seq->back_us; while (i < backw) { - int32_t tmp = findidx (table, indirect, extra, &us, -1); + int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1); idx = tmp & 0xffffff; i++; } @@ -124,7 +124,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets, while (*us != L('\0')) { - int32_t tmp = findidx (table, indirect, extra, &us, -1); + int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1); unsigned char rule = tmp >> 24; prev_idx = idx; idx = tmp & 0xffffff; @@ -253,6 +253,7 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l) const USTRING_TYPE *weights; const USTRING_TYPE *extra; const int32_t *indirect; + uint_fast32_t encoding; if (nrules == 0) return STRCMP (s1, s2); @@ -271,6 +272,8 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l) current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_EXTRA,SUFFIX))].string; indirect = (const int32_t *) current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_INDIRECT,SUFFIX))].string; + encoding = current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word; + assert (((uintptr_t) table) % __alignof__ (table[0]) == 0); assert (((uintptr_t) weights) % __alignof__ (weights[0]) == 0); @@ -310,9 +313,9 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l) while (1) { - get_next_seq (&seq1, nrules, rulesets, weights, table, + get_next_seq (&seq1, nrules, rulesets, weights, encoding, table, extra, indirect, pass); - get_next_seq (&seq2, nrules, rulesets, weights, table, + get_next_seq (&seq2, nrules, rulesets, weights, encoding, table, extra, indirect, pass); /* See whether any or both strings are empty. */ if (seq1.len == 0 || seq2.len == 0) diff --git a/string/strxfrm_l.c b/string/strxfrm_l.c index 22e24d3..5c89b15 100644 --- a/string/strxfrm_l.c +++ b/string/strxfrm_l.c @@ -53,6 +53,7 @@ typedef struct uint_fast32_t nrules; unsigned char *rulesets; USTRING_TYPE *weights; + uint_fast32_t encoding; int32_t *table; USTRING_TYPE *extra; int32_t *indirect; @@ -100,8 +101,8 @@ static __always_inline size_t find_idx (const USTRING_TYPE **us, int32_t *weight_idx, unsigned char *rule_idx, const locale_data_t *l_data, const int pass) { - int32_t tmp = findidx (l_data->table, l_data->indirect, l_data->extra, us, - -1); + int32_t tmp = findidx (l_data->encoding, l_data->table, l_data->indirect, + l_data->extra, us, -1); *rule_idx = tmp >> 24; int32_t idx = tmp & 0xffffff; size_t len = l_data->weights[idx++]; @@ -693,6 +694,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l) /* Get the locale data. */ l_data.rulesets = (unsigned char *) current->values[_NL_ITEM_INDEX (_NL_COLLATE_RULESETS)].string; + l_data.encoding = + current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word; l_data.table = (int32_t *) current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_TABLE,SUFFIX))].string; l_data.weights = (USTRING_TYPE *) @@ -721,8 +724,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l) do { - int32_t tmp = findidx (l_data.table, l_data.indirect, l_data.extra, &cur, - -1); + int32_t tmp = findidx (l_data.encoding, l_data.table, l_data.indirect, + l_data.extra, &cur, -1); rulearr[idxmax] = tmp >> 24; idxarr[idxmax] = tmp & 0xffffff;