Message ID | 20181126221949.12172-1-krisman@collabora.com |
---|---|
Headers | show |
Series | Support encoding awareness and casefold | expand |
On Mon, Nov 26, 2018 at 05:19:45PM -0500, Gabriel Krisman Bertazi wrote: > +static int utf8_casefold(const struct nls_table *table, > + const unsigned char *str, size_t len, > + unsigned char *dest, size_t dlen) > +{ > + const struct utf8data *data = utf8nfkdicf(UNICODE_AGE(10,0,0)); > + struct utf8cursor cur; > + size_t nlen = 0; > + > + if (utf8ncursor(&cur, data, str, len) < 0) > + goto invalid_seq; > + > + for (nlen = 0; nlen < dlen; nlen++) { > + dest[nlen] = utf8byte(&cur); > + if (!dest[nlen]) > + return nlen; > + if (dest[nlen] == -1) > + break; > + } > +invalid_seq: > + /* Treat the sequence as a binary blob. */ > + memcpy(dest, str, len); > + return len; > + > +} So it looks like the interface is if the destination buffer is too small OR if the string is not a valid UTF-8 string, we treat it as a binary blob. I wonder if we would be better off if this function actually signalling that there is a problem? (Buffer too small, invalid UTF-8 string). It's fine to treat it as a binary blob, and copy it out to the destination buffer, but I can imagine be use cases where knowing this will be useful. *Especially* the destination buffer too small case; I'm actually a little nervous about having it silently ignoring that error condition and just copying the binary blob. Also, there *really* needs to be a check before dlen is assumed to be >= len in the memcpy after the invalid_seq label. - Ted
On Mon, Nov 26, 2018 at 05:19:45PM -0500, Gabriel Krisman Bertazi wrote: > From: Gabriel Krisman Bertazi <krisman@collabora.co.uk> > > We need this such that we can do normalization and casefolding > compatible with the kernel, in order to properly support fsck > verification and rehashing. > > The UTF-8 11.0 implementation is copied and adapted from the kernel code > to ensure maximum compatibility. The decode trie in utf8data.h is > generated using a script and the UCD sources in the kernel code. > > Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> One more thought. Is there any test cases we can add here? I assume the SGI folks must have had some test code that they used when they were developing their trie code. Was any of that released? Maybe there is some Unicode normalization and case folding test vectors we can grab? Thanks, - Ted
"Theodore Y. Ts'o" <tytso@mit.edu> writes: > One more thought. Is there any test cases we can add here? I assume > the SGI folks must have had some test code that they used when they > were developing their trie code. Was any of that released? > > Maybe there is some Unicode normalization and case folding test > vectors we can grab? Since these file are generated and imported from the kernel code, I added the tests there instead. Should I duplicate or move them to here?