mbox series

[0/3] : C N2653 char8_t implementation

Message ID 770e6d4f-2dae-e2f9-c6c8-bb3d458ef796@honermann.net
Headers show
Series : C N2653 char8_t implementation | expand

Message

Tom Honermann June 7, 2021, 2:31 a.m. UTC
This series of patches implements the core language features for the 
WG14 N2653 [1] proposal to provide char8_t support in C.  These changes 
are intended to align char8_t support in C with the support provided in 
C++20 via WG21 P0482R6 [2].

These changes do not impact default gcc behavior.  The existing 
-fchar8_t option is extended to C compilation to enable the N2653 
changes, and -fno-char8_t is extended to explicitly disable them.  N2653 
has not yet been accepted by WG14, so no changes are made to handling of 
the C2X language dialect.

Patch 1: Language support
Patch 2: New tests
Patch 3: Documentation updates

Tom.

[1]: WG14 N2653
      "char8_t: A type for UTF-8 characters and strings (Revision 1)"
      http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm

[2]: WG21 P0482R6
      "char8_t: A type for UTF-8 characters and strings (Revision 6)"
      https://wg21.link/p0482r6

Comments

Joseph Myers June 7, 2021, 9:03 p.m. UTC | #1
On Sun, 6 Jun 2021, Tom Honermann via Gcc-patches wrote:

> These changes do not impact default gcc behavior.  The existing -fchar8_t
> option is extended to C compilation to enable the N2653 changes, and
> -fno-char8_t is extended to explicitly disable them.  N2653 has not yet been
> accepted by WG14, so no changes are made to handling of the C2X language
> dialect.

Why is that option needed?  Normally I'd expect features to be enabled or 
disabled based on the selected language version, rather than having 
separate options to adjust the configuration for one very specific feature 
in a language version.  Adding extra language dialects not corresponding 
to any standard version but to some peculiar mix of versions (such as C17 
with a changed type for u8"", or C2X with a changed type for u8'') needs a 
strong reason for those language dialects to be useful (for example, the 
-fgnu89-inline option was justified by widespread use of GNU-style extern 
inline in headers).

I think the whole patch series would best wait until after the proposal 
has been considered by a WG14 meeting, in addition to not increasing the 
number of language dialects supported.
Tom Honermann June 11, 2021, 3:42 p.m. UTC | #2
On 6/7/21 5:03 PM, Joseph Myers wrote:
> On Sun, 6 Jun 2021, Tom Honermann via Gcc-patches wrote:
>
>> These changes do not impact default gcc behavior.  The existing -fchar8_t
>> option is extended to C compilation to enable the N2653 changes, and
>> -fno-char8_t is extended to explicitly disable them.  N2653 has not yet been
>> accepted by WG14, so no changes are made to handling of the C2X language
>> dialect.
> Why is that option needed?  Normally I'd expect features to be enabled or
> disabled based on the selected language version, rather than having
> separate options to adjust the configuration for one very specific feature
> in a language version.  Adding extra language dialects not corresponding
> to any standard version but to some peculiar mix of versions (such as C17
> with a changed type for u8"", or C2X with a changed type for u8'') needs a
> strong reason for those language dialects to be useful (for example, the
> -fgnu89-inline option was justified by widespread use of GNU-style extern
> inline in headers).

The option is needed because it impacts core language backward 
compatibility (for both C and C++, the type of u8 string literals; for 
C++, the type of u8 character literals and the new char8_t fundamental 
type).

The ability to opt-in or opt-out of the feature eases migration by 
enabling source code compatibility.  C and C++ standards are not 
published at the same cadence.  A project that targets C++20 and C17 may 
therefore have a need to either opt-out of char8_t support on the C++ 
side (already possible via -fno-char8_t), or to opt-in to char8_t 
support on the C side until such time as the targets change to C++20(+) 
and C23(+); assuming WG14 approval at some point.

>
> I think the whole patch series would best wait until after the proposal
> has been considered by a WG14 meeting, in addition to not increasing the
> number of language dialects supported.

As an opt-in feature, this is useful to gain implementation and 
deployment experience for WG14.

It would be appropriate to document this as an experimental feature 
pending WG14 approval.  If WG14 declines it or approves it with 
different behavior, the feature can then be removed or changed.

The option could also be introduced as -fexperimental-char8_t if that 
eases concerns, though I do not favor that approach due to misalignment 
with the existing option for C++.

Tom.
Joseph Myers June 11, 2021, 5:27 p.m. UTC | #3
On Fri, 11 Jun 2021, Tom Honermann via Gcc-patches wrote:

> The option is needed because it impacts core language backward compatibility
> (for both C and C++, the type of u8 string literals; for C++, the type of u8
> character literals and the new char8_t fundamental type).

Lots of new features in new standard versions can affect backward 
compatibility.  We generally bundle all of those up into a single -std 
option rather than having an explosion of different language variants with 
different features enabled or disabled.  I don't think this feature, for 
C, reaches the threshold that would justify having a separate option to 
control it, especially given that people can use -Wno-pointer-sign or 
pointer casts or their own local char8_t typedef as an intermediate step 
if they want code using u8"" strings to work for both old and new standard 
versions.

I don't think u8"" strings are widely used in C library headers in a way 
where the choice of type matters.  (Use of a feature in library headers is 
a key thing that can justify options such as -fgnu89-inline, because it 
means the choice of language version is no longer fully under control of a 
single project.)

The only feature proposed for C2x that I think is likely to have 
significant compatibility implications in practice for a lot of code is 
making bool, true and false into keywords.  I still don't think a separate 
option makes sense there.  (If that feature is accepted for C2x, what 
would be useful is for people to do distribution rebuilds with -std=gnu2x 
as the default to find and fix code that breaks, in advance of the default 
actually changing in GCC.  But the workaround for not-yet-fixed code would 
be -std=gnu11, not a separate option for that one feature.)

> > I think the whole patch series would best wait until after the proposal
> > has been considered by a WG14 meeting, in addition to not increasing the
> > number of language dialects supported.
> 
> As an opt-in feature, this is useful to gain implementation and deployment
> experience for WG14.

I think this feature is one of the cases where experience in C++ is 
sufficiently relevant for C (although there are certainly cases of other 
language features where the languages are sufficiently different that 
using C++ experience like that can be problematic).

E.g. we didn't need -fdigit-separators for C before digit separators were 
added to C2x, and we don't need -fno-digit-separators now they are in C2x 
(the feature is just enabled or disabled based on the language version), 
although that's one of many features that do affect compatibility in 
corner cases.
Tom Honermann June 13, 2021, 3:35 p.m. UTC | #4
On 6/11/21 1:27 PM, Joseph Myers wrote:
> On Fri, 11 Jun 2021, Tom Honermann via Gcc-patches wrote:
>
>> The option is needed because it impacts core language backward compatibility
>> (for both C and C++, the type of u8 string literals; for C++, the type of u8
>> character literals and the new char8_t fundamental type).
> Lots of new features in new standard versions can affect backward
> compatibility.  We generally bundle all of those up into a single -std
> option rather than having an explosion of different language variants with
> different features enabled or disabled.  I don't think this feature, for
> C, reaches the threshold that would justify having a separate option to
> control it, especially given that people can use -Wno-pointer-sign or
> pointer casts or their own local char8_t typedef as an intermediate step
> if they want code using u8"" strings to work for both old and new standard
> versions.
Ok, I'm happy to defer to your experience.  My perspective is likely 
biased by the C++20 changes being more disruptive for that language.
>
> I don't think u8"" strings are widely used in C library headers in a way
> where the choice of type matters.  (Use of a feature in library headers is
> a key thing that can justify options such as -fgnu89-inline, because it
> means the choice of language version is no longer fully under control of a
> single project.)
That aligns with my expectations.
>
> The only feature proposed for C2x that I think is likely to have
> significant compatibility implications in practice for a lot of code is
> making bool, true and false into keywords.  I still don't think a separate
> option makes sense there.  (If that feature is accepted for C2x, what
> would be useful is for people to do distribution rebuilds with -std=gnu2x
> as the default to find and fix code that breaks, in advance of the default
> actually changing in GCC.  But the workaround for not-yet-fixed code would
> be -std=gnu11, not a separate option for that one feature.)
Ok, that comparison is helpful.
>
>>> I think the whole patch series would best wait until after the proposal
>>> has been considered by a WG14 meeting, in addition to not increasing the
>>> number of language dialects supported.
>> As an opt-in feature, this is useful to gain implementation and deployment
>> experience for WG14.
> I think this feature is one of the cases where experience in C++ is
> sufficiently relevant for C (although there are certainly cases of other
> language features where the languages are sufficiently different that
> using C++ experience like that can be problematic).
>
> E.g. we didn't need -fdigit-separators for C before digit separators were
> added to C2x, and we don't need -fno-digit-separators now they are in C2x
> (the feature is just enabled or disabled based on the language version),
> although that's one of many features that do affect compatibility in
> corner cases.

Got it, thanks again, that comparison is helpful.

Per this and prior messages, I'll revise the gcc patch series as follows 
(I'll likewise revise the glibc changes, but will detail that in the 
corresponding glibc mailing list thread).

 1. Remove the proposed use of -fchar8_t and -fno-char8_t for C code.
 2. Remove the updated documentation for the -fchar8_t option since it
    won't be applicable to C code.
 3. Remove the _CHAR8_T_SOURCE macro.
 4. Enable the change of u8 string literal type based on -std=[gnu|c]2x
    (by setting flag_char8_t if flag_isoc2x is set).
 5. Condition the declarations of atomic_char8_t and
    __GCC_ATOMIC_CHAR8_T_LOCK_FREE on _GNU_SOURCE or _ISOC2X_SOURCE.
 6. Remove the char8 data member from cpp_options that I had added and
    forgot to remove.
 7. Revise the tests and rename them for consistency with other C2x tests.

If I've forgotten anything, please let me know.

Thank you for the thorough review!

Tom.