mbox series

[0/4] : C++ P1423R3 char8_t remediation implementation

Message ID 4491be7d-2cea-c437-b991-b1cc43e344ee@honermann.net
Headers show
Series : C++ P1423R3 char8_t remediation implementation | expand

Message

Tom Honermann Sept. 15, 2019, 7:39 p.m. UTC
This series of patches provides an implementation of the changes for C++ 
proposal P1423R3 [1].

These changes do not impact default libstdc++ behavior for C++17 and 
earlier; they are only active for C++2a or when the -fchar8_t option is 
specified.

Tested x86_64-linux.

Patch 1: Decouple constraints for u8path from path constructors.
Patch 2: Update __cpp_lib_char8_t feature test macro value, add deleted 
operators, update u8path.
Patch 3: Updates to existing tests.
Patch 4: New tests.

Tom.

[1]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r3.html

Comments

Jonathan Wakely Nov. 29, 2019, 5:45 p.m. UTC | #1
On 15/09/19 15:39 -0400, Tom Honermann wrote:
>This series of patches provides an implementation of the changes for 
>C++ proposal P1423R3 [1].
>
>These changes do not impact default libstdc++ behavior for C++17 and 
>earlier; they are only active for C++2a or when the -fchar8_t option 
>is specified.
>
>Tested x86_64-linux.
>
>Patch 1: Decouple constraints for u8path from path constructors.
>Patch 2: Update __cpp_lib_char8_t feature test macro value, add 
>deleted operators, update u8path.
>Patch 3: Updates to existing tests.
>Patch 4: New tests.
>
>Tom.
>
>[1]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r3.html

It took a while, but I've committed these four patches, with just some
minor whitespace changes and changelog tweaks.

I'm also following it up with this patch, which corrects some
pre-existing problems that got worse with the new deleted operator<<
overloads.

Tested powerpc64le-linux, committed to trunk.
Jonathan Wakely Nov. 29, 2019, 7:48 p.m. UTC | #2
On 29/11/19 17:45 +0000, Jonathan Wakely wrote:
>On 15/09/19 15:39 -0400, Tom Honermann wrote:
>>This series of patches provides an implementation of the changes for 
>>C++ proposal P1423R3 [1].
>>
>>These changes do not impact default libstdc++ behavior for C++17 and 
>>earlier; they are only active for C++2a or when the -fchar8_t option 
>>is specified.
>>
>>Tested x86_64-linux.
>>
>>Patch 1: Decouple constraints for u8path from path constructors.
>>Patch 2: Update __cpp_lib_char8_t feature test macro value, add 
>>deleted operators, update u8path.
>>Patch 3: Updates to existing tests.
>>Patch 4: New tests.
>>
>>Tom.
>>
>>[1]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r3.html
>
>It took a while, but I've committed these four patches, with just some
>minor whitespace changes and changelog tweaks.

Running the new tests revealed a latent bug on Windows, where
experimental::filesystem::u8path(const Source&) assumed the input
was an iterator over a NTCTS. That worked for a const char* but not a
std::string or experimental::string_view.

The attached patch fixes that (and simplifies the #if and if-constexpr
conditions for Windows) but there's a remaining bug. Constructing a
experimental::filesystem::path from a char8_t string doesn't do the
right thing on Windows, so these cases fails:

  fs::path p(u8"\xf0\x9d\x84\x9e");
  VERIFY( p.u8string() == u8"\U0001D11E" );

  p = fs::u8path(u8"\xf0\x9d\x84\x9e");
  VERIFY( p.u8string() == u8"\U0001D11E" );

It works correctly for std::filesystem::path, just not the TS version.

I plan to commit the attached patch after some more testing. I'll
backport something like it too.
Jonathan Wakely Nov. 29, 2019, 9:45 p.m. UTC | #3
On 29/11/19 19:48 +0000, Jonathan Wakely wrote:
>On 29/11/19 17:45 +0000, Jonathan Wakely wrote:
>>On 15/09/19 15:39 -0400, Tom Honermann wrote:
>>>This series of patches provides an implementation of the changes 
>>>for C++ proposal P1423R3 [1].
>>>
>>>These changes do not impact default libstdc++ behavior for C++17 
>>>and earlier; they are only active for C++2a or when the -fchar8_t 
>>>option is specified.
>>>
>>>Tested x86_64-linux.
>>>
>>>Patch 1: Decouple constraints for u8path from path constructors.
>>>Patch 2: Update __cpp_lib_char8_t feature test macro value, add 
>>>deleted operators, update u8path.
>>>Patch 3: Updates to existing tests.
>>>Patch 4: New tests.
>>>
>>>Tom.
>>>
>>>[1]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r3.html
>>
>>It took a while, but I've committed these four patches, with just some
>>minor whitespace changes and changelog tweaks.
>
>Running the new tests revealed a latent bug on Windows, where
>experimental::filesystem::u8path(const Source&) assumed the input
>was an iterator over a NTCTS. That worked for a const char* but not a
>std::string or experimental::string_view.
>
>The attached patch fixes that (and simplifies the #if and if-constexpr
>conditions for Windows) but there's a remaining bug. Constructing a
>experimental::filesystem::path from a char8_t string doesn't do the
>right thing on Windows, so these cases fails:
>
> fs::path p(u8"\xf0\x9d\x84\x9e");
> VERIFY( p.u8string() == u8"\U0001D11E" );
>
> p = fs::u8path(u8"\xf0\x9d\x84\x9e");
> VERIFY( p.u8string() == u8"\U0001D11E" );
>
>It works correctly for std::filesystem::path, just not the TS version.

I think this is the fix needed for the TS code:

--- a/libstdc++-v3/include/experimental/bits/fs_path.h
+++ b/libstdc++-v3/include/experimental/bits/fs_path.h
@@ -765,7 +765,14 @@ namespace __detail
       {
 #ifdef _GLIBCXX_USE_CHAR8_T
        if constexpr (is_same<_CharT, char8_t>::value)
-         return _S_wconvert((const char*)__f, (const char*)__l, true_type());
+         {
+           const char* __f2 = (const char*)__f;
+           const char* __l2 = (const char*)__l;
+           std::wstring __wstr;
+           std::codecvt_utf8_utf16<wchar_t> __wcvt;
+           if (__str_codecvt_in_all(__f2, __l2, __wstr, __wcvt))
+             return __wstr;
+         }
        else
 #endif
          {

The current code uses std::codecvt<wchar_t, char, mbstate_t> but when
we know the input is UTF-8 encoded we should use codecvt_utf8_utf16
(which is what the C++17 code already does for char8_t input).

I'll add that the patch I'm testing.
Jonathan Wakely Nov. 30, 2019, 1:03 a.m. UTC | #4
On 29/11/19 21:45 +0000, Jonathan Wakely wrote:
>On 29/11/19 19:48 +0000, Jonathan Wakely wrote:
>>On 29/11/19 17:45 +0000, Jonathan Wakely wrote:
>>>On 15/09/19 15:39 -0400, Tom Honermann wrote:
>>>>This series of patches provides an implementation of the changes 
>>>>for C++ proposal P1423R3 [1].
>>>>
>>>>These changes do not impact default libstdc++ behavior for C++17 
>>>>and earlier; they are only active for C++2a or when the 
>>>>-fchar8_t option is specified.
>>>>
>>>>Tested x86_64-linux.
>>>>
>>>>Patch 1: Decouple constraints for u8path from path constructors.
>>>>Patch 2: Update __cpp_lib_char8_t feature test macro value, add 
>>>>deleted operators, update u8path.
>>>>Patch 3: Updates to existing tests.
>>>>Patch 4: New tests.
>>>>
>>>>Tom.
>>>>
>>>>[1]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r3.html
>>>
>>>It took a while, but I've committed these four patches, with just some
>>>minor whitespace changes and changelog tweaks.
>>
>>Running the new tests revealed a latent bug on Windows, where
>>experimental::filesystem::u8path(const Source&) assumed the input
>>was an iterator over a NTCTS. That worked for a const char* but not a
>>std::string or experimental::string_view.
>>
>>The attached patch fixes that (and simplifies the #if and if-constexpr
>>conditions for Windows) but there's a remaining bug. Constructing a
>>experimental::filesystem::path from a char8_t string doesn't do the
>>right thing on Windows, so these cases fails:
>>
>>fs::path p(u8"\xf0\x9d\x84\x9e");
>>VERIFY( p.u8string() == u8"\U0001D11E" );
>>
>>p = fs::u8path(u8"\xf0\x9d\x84\x9e");
>>VERIFY( p.u8string() == u8"\U0001D11E" );
>>
>>It works correctly for std::filesystem::path, just not the TS version.
>
>I think this is the fix needed for the TS code:
>
>--- a/libstdc++-v3/include/experimental/bits/fs_path.h
>+++ b/libstdc++-v3/include/experimental/bits/fs_path.h
>@@ -765,7 +765,14 @@ namespace __detail
>      {
>#ifdef _GLIBCXX_USE_CHAR8_T
>       if constexpr (is_same<_CharT, char8_t>::value)
>-         return _S_wconvert((const char*)__f, (const char*)__l, true_type());
>+         {
>+           const char* __f2 = (const char*)__f;
>+           const char* __l2 = (const char*)__l;
>+           std::wstring __wstr;
>+           std::codecvt_utf8_utf16<wchar_t> __wcvt;
>+           if (__str_codecvt_in_all(__f2, __l2, __wstr, __wcvt))
>+             return __wstr;
>+         }
>       else
>#endif
>         {
>
>The current code uses std::codecvt<wchar_t, char, mbstate_t> but when
>we know the input is UTF-8 encoded we should use codecvt_utf8_utf16
>(which is what the C++17 code already does for char8_t input).
>
>I'll add that the patch I'm testing.

Here's the final patch I'm committing. Tested powerpc64le-linux and
x86_64-w64-mingw, committed to trunk.