diff mbox series

[v2] vl: set LC_CTYPE early in main() for all code

Message ID 20190415141547.15444-1-berrange@redhat.com
State New
Headers show
Series [v2] vl: set LC_CTYPE early in main() for all code | expand

Commit Message

Daniel P. Berrangé April 15, 2019, 2:15 p.m. UTC
Localization is not a feature whose impact is limited to the UI
frontends. Other parts of QEMU rely in localization. In particular the
USB MTP driver needs to be able to convert filenames from the locale
specified character set into UTF-16 / UCS-2 encoded wide characters.
setlocale() is only set from two of the UI frontends though, and worse,
there is inconsistent behaviour with GTK setting LC_CTYPE to C.UTF-8,
while ncurses honours whatever is set in the user's environment.

This causes the USP MTP driver to behave differently depending on which
UI frontend is activated.

Furthermore, the curses settings are dangerous because LC_CTYPE will affect
the is{upper,lower,alnum} functions which much QEMU code assumes to have
C locale sorting behaviour. This also breaks QMP if the env requests a
non-UTF-8 locale, since QMP is defined to use UTF-8 encoding for JSON.
This problematic curses code was introduced in:

  commit 2f8b7cd587558944532f587abb5203ce54badba9
  Author: Samuel Thibault <samuel.thibault@ens-lyon.org>
  Date:   Mon Mar 11 14:51:27 2019 +0100

    curses: add option to specify VGA font encoding

This patch moves the GTK frontend setlocale() handling into the main()
method. This ensures QMP and other QEMU code has a predictable C.UTF-8.

Eventually QEMU should set LC_ALL, honouring the full user environment,
but this needs various cleanups in QEMU code first. Hardcoding LC_CTYPE
to C.UTF-8 is a partial regression vs the above curses commit, since it
will break the curses wide character handling for non-UTF-8 locales but
this is unavoidable until QEMU is cleaned up to cope with non-UTF-8
locales fully.

Setting of LC_MESSAGES is left in the GTK code since only the GTK
frontend is using translation of strings. This lets us avoid the
platform portability problem where LC_MESSAGES is not provided by
locale.h on MinGW. GTK pulls it in indirectly from libintl.h via
gi18n.h header, but we don't want to pull that into the global
QEMU namespace.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---

Changed in v2:

 - Leave LC_MESSAGES setting in gtk code to avoid platform
   portability problems.

 ui/curses.c |  2 --
 ui/gtk.c    | 32 +++++++++++---------------------
 vl.c        | 40 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+), 23 deletions(-)

Comments

Peter Maydell April 15, 2019, 2:25 p.m. UTC | #1
On Mon, 15 Apr 2019 at 15:17, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> Localization is not a feature whose impact is limited to the UI
> frontends. Other parts of QEMU rely in localization. In particular the
> USB MTP driver needs to be able to convert filenames from the locale
> specified character set into UTF-16 / UCS-2 encoded wide characters.
> setlocale() is only set from two of the UI frontends though, and worse,
> there is inconsistent behaviour with GTK setting LC_CTYPE to C.UTF-8,
> while ncurses honours whatever is set in the user's environment.
>
> This causes the USP MTP driver to behave differently depending on which
> UI frontend is activated.
>
> Furthermore, the curses settings are dangerous because LC_CTYPE will affect
> the is{upper,lower,alnum} functions which much QEMU code assumes to have
> C locale sorting behaviour. This also breaks QMP if the env requests a
> non-UTF-8 locale, since QMP is defined to use UTF-8 encoding for JSON.
> This problematic curses code was introduced in:
>
>   commit 2f8b7cd587558944532f587abb5203ce54badba9
>   Author: Samuel Thibault <samuel.thibault@ens-lyon.org>
>   Date:   Mon Mar 11 14:51:27 2019 +0100
>
>     curses: add option to specify VGA font encoding
>
> This patch moves the GTK frontend setlocale() handling into the main()
> method. This ensures QMP and other QEMU code has a predictable C.UTF-8.
>
> Eventually QEMU should set LC_ALL, honouring the full user environment,
> but this needs various cleanups in QEMU code first. Hardcoding LC_CTYPE
> to C.UTF-8 is a partial regression vs the above curses commit, since it
> will break the curses wide character handling for non-UTF-8 locales but
> this is unavoidable until QEMU is cleaned up to cope with non-UTF-8
> locales fully.
>
> Setting of LC_MESSAGES is left in the GTK code since only the GTK
> frontend is using translation of strings. This lets us avoid the
> platform portability problem where LC_MESSAGES is not provided by
> locale.h on MinGW. GTK pulls it in indirectly from libintl.h via
> gi18n.h header, but we don't want to pull that into the global
> QEMU namespace.
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>

A few typo nits below...

>
> +    /*
> +     * Ideally we would set LC_ALL, but QEMU currently isn't able to cope
> +     * with arbitrary localization settings. In particular there are two
> +     * known problems
> +     *
> +     *   - The QMP monitor needs to use the C locale rules for numeric
> +     *     formatting. This would need a double/int -> string formatter
> +     *     that is locale independant.

"independent"

> +     *
> +     *   - The QMP monitor needs to encode all data as UTF-8. This needs
> +     *     to be updated to use iconv(3) to explicitly convert the current
> +     *     locale's charset into utf-8
> +     *
> +     *   - Lots of codes uses is{upper,lower,alnum,...} functions, expecting

"code"

> +     *     C locale sorting behaviour. Most QEMU usage should likely be
> +     *     changed to g_ascii_is{upper,lower,alnum...} to match code
> +     *     assumptions, without being broken by locale settnigs.

"settings"

> +     *
> +     * We do still have two requirements
> +     *
> +     *   - Ability to correct display translated text according to the
> +     *     user's locale
> +     *
> +     *   - Ability to handle multibyte characters, ideally according to
> +     *     user's locale specified character set. This affects ability
> +     *     of usb-mtp to correctly convert filenames to UCS16 and curses
> +     *     & GTK frontends wide character display.
> +     *
> +     * The second requirement would need LC_CTYPE to be honoured, but
> +     * this conflicts with the 2nd & 3rd problems listed earlier. For
> +     * now we make a tradeoff, trying to set an explicit UTF-8 localee

"locale"

> +     *
> +     * Note we can't set LC_MESSAGES here, since mingw doesn't define
> +     * this constant in locale.h Fortunately we only need it for the
> +     * GTK frontend and that uses gi18n.h which pulls in a definition
> +     * of LC_MESSAGES.
> +     */

thanks
-- PMM
Eric Blake April 15, 2019, 3:25 p.m. UTC | #2
On 4/15/19 9:25 AM, Peter Maydell wrote:
> On Mon, 15 Apr 2019 at 15:17, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>

> A few typo nits below...

Also:

> 
>>
>> +    /*
>> +     * Ideally we would set LC_ALL, but QEMU currently isn't able to cope
>> +     * with arbitrary localization settings. In particular there are two
>> +     * known problems

"two known problems" but...

>> +     *
>> +     *   - The QMP monitor needs to use the C locale rules for numeric
>> +     *     formatting. This would need a double/int -> string formatter
>> +     *     that is locale independant.
> 
> "independent"
> 
>> +     *
>> +     *   - The QMP monitor needs to encode all data as UTF-8. This needs
>> +     *     to be updated to use iconv(3) to explicitly convert the current
>> +     *     locale's charset into utf-8
>> +     *
>> +     *   - Lots of codes uses is{upper,lower,alnum,...} functions, expecting
> 
> "code"
> 
>> +     *     C locale sorting behaviour. Most QEMU usage should likely be
>> +     *     changed to g_ascii_is{upper,lower,alnum...} to match code
>> +     *     assumptions, without being broken by locale settnigs.
> 
> "settings"

...three bullet points.
Peter Maydell April 15, 2019, 3:46 p.m. UTC | #3
On Mon, 15 Apr 2019 at 16:25, Eric Blake <eblake@redhat.com> wrote:
> "two known problems" but...
>
> ...three bullet points.

Nobody expects the Spanish Inquisition to be code reviewed :-)

thanks
-- PMM
Daniel P. Berrangé April 15, 2019, 3:51 p.m. UTC | #4
On Mon, Apr 15, 2019 at 10:25:03AM -0500, Eric Blake wrote:
> On 4/15/19 9:25 AM, Peter Maydell wrote:
> > On Mon, 15 Apr 2019 at 15:17, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >>
> 
> > A few typo nits below...
> 
> Also:
> 
> > 
> >>
> >> +    /*
> >> +     * Ideally we would set LC_ALL, but QEMU currently isn't able to cope
> >> +     * with arbitrary localization settings. In particular there are two
> >> +     * known problems
> 
> "two known problems" but...
> 
> >> +     *
> >> +     *   - The QMP monitor needs to use the C locale rules for numeric
> >> +     *     formatting. This would need a double/int -> string formatter
> >> +     *     that is locale independant.
> > 
> > "independent"
> > 
> >> +     *
> >> +     *   - The QMP monitor needs to encode all data as UTF-8. This needs
> >> +     *     to be updated to use iconv(3) to explicitly convert the current
> >> +     *     locale's charset into utf-8
> >> +     *
> >> +     *   - Lots of codes uses is{upper,lower,alnum,...} functions, expecting
> > 
> > "code"
> > 
> >> +     *     C locale sorting behaviour. Most QEMU usage should likely be
> >> +     *     changed to g_ascii_is{upper,lower,alnum...} to match code
> >> +     *     assumptions, without being broken by locale settnigs.
> > 
> > "settings"
> 
> ...three bullet points.

Heh, I knew I should have used "some" instead of "two" earlier when
first writing this. It was inevitable I'd think of another problem
to add :-)

Regards,
Daniel
Markus Armbruster April 16, 2019, 7:49 a.m. UTC | #5
Daniel P. Berrangé <berrange@redhat.com> writes:

> Localization is not a feature whose impact is limited to the UI
> frontends. Other parts of QEMU rely in localization.

Typically by accident :)

>                                                      In particular the
> USB MTP driver needs to be able to convert filenames from the locale
> specified character set into UTF-16 / UCS-2 encoded wide characters.

Oookay.  I guess that makes some sense.

> setlocale() is only set from two of the UI frontends though, and worse,
> there is inconsistent behaviour with GTK setting LC_CTYPE to C.UTF-8,
> while ncurses honours whatever is set in the user's environment.
>
> This causes the USP MTP driver to behave differently depending on which
> UI frontend is activated.
>
> Furthermore, the curses settings are dangerous because LC_CTYPE will affect
> the is{upper,lower,alnum} functions which much QEMU code assumes to have
> C locale sorting behaviour. This also breaks QMP if the env requests a
> non-UTF-8 locale, since QMP is defined to use UTF-8 encoding for JSON.

Can you elaborate on how QMP breaks with a non-UTF-8 locale?  Ah, you do
in the comment you add to vl.c.

> This problematic curses code was introduced in:
>
>   commit 2f8b7cd587558944532f587abb5203ce54badba9
>   Author: Samuel Thibault <samuel.thibault@ens-lyon.org>
>   Date:   Mon Mar 11 14:51:27 2019 +0100
>
>     curses: add option to specify VGA font encoding
>
> This patch moves the GTK frontend setlocale() handling into the main()
> method. This ensures QMP and other QEMU code has a predictable C.UTF-8.
>
> Eventually QEMU should set LC_ALL, honouring the full user environment,
> but this needs various cleanups in QEMU code first.

Eventually, we'll all be dead.

>                                                     Hardcoding LC_CTYPE
> to C.UTF-8 is a partial regression vs the above curses commit, since it
> will break the curses wide character handling for non-UTF-8 locales but
> this is unavoidable until QEMU is cleaned up to cope with non-UTF-8
> locales fully.
>
> Setting of LC_MESSAGES is left in the GTK code since only the GTK
> frontend is using translation of strings. This lets us avoid the
> platform portability problem where LC_MESSAGES is not provided by
> locale.h on MinGW. GTK pulls it in indirectly from libintl.h via
> gi18n.h header, but we don't want to pull that into the global
> QEMU namespace.
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>
> Changed in v2:
>
>  - Leave LC_MESSAGES setting in gtk code to avoid platform
>    portability problems.
>
>  ui/curses.c |  2 --
>  ui/gtk.c    | 32 +++++++++++---------------------
>  vl.c        | 40 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 51 insertions(+), 23 deletions(-)
>
> diff --git a/ui/curses.c b/ui/curses.c
> index cc6d6da684..403cd57913 100644
> --- a/ui/curses.c
> +++ b/ui/curses.c
> @@ -27,7 +27,6 @@
>  #include <sys/ioctl.h>
>  #include <termios.h>
>  #endif
> -#include <locale.h>
>  #include <wchar.h>
>  #include <langinfo.h>
>  #include <iconv.h>
> @@ -716,7 +715,6 @@ static void curses_display_init(DisplayState *ds, DisplayOptions *opts)
>      }
>  #endif
>  
> -    setlocale(LC_CTYPE, "");
>      if (opts->u.curses.charset) {
>          font_charset = opts->u.curses.charset;
>      }
> diff --git a/ui/gtk.c b/ui/gtk.c
> index e96e15435a..e6a41c79ab 100644
> --- a/ui/gtk.c
> +++ b/ui/gtk.c
> @@ -2208,12 +2208,12 @@ static void gtk_display_init(DisplayState *ds, DisplayOptions *opts)
>  
>      s->free_scale = FALSE;
>  
> -    /* Mostly LC_MESSAGES only. See early_gtk_display_init() for details. For
> -     * LC_CTYPE, we need to make sure that non-ASCII characters are considered
> -     * printable, but without changing any of the character classes to make
> -     * sure that we don't accidentally break implicit assumptions.  */
> +    /*
> +     * See comment in main() for why it has only setup LC_CTYPE,
> +     * as opposed to LC_ALL. Given that we need to enable
> +     * LC_MESSAGES too for menu translations.
> +     */
>      setlocale(LC_MESSAGES, "");
> -    setlocale(LC_CTYPE, "C.UTF-8");
>      bindtextdomain("qemu", CONFIG_QEMU_LOCALEDIR);
>      textdomain("qemu");
>  
> @@ -2262,22 +2262,12 @@ static void gtk_display_init(DisplayState *ds, DisplayOptions *opts)
>  
>  static void early_gtk_display_init(DisplayOptions *opts)
>  {
> -    /* The QEMU code relies on the assumption that it's always run in
> -     * the C locale. Therefore it is not prepared to deal with
> -     * operations that produce different results depending on the
> -     * locale, such as printf's formatting of decimal numbers, and
> -     * possibly others.
> -     *
> -     * Since GTK+ calls setlocale() by default -importing the locale
> -     * settings from the environment- we must prevent it from doing so
> -     * using gtk_disable_setlocale().
> -     *
> -     * QEMU's GTK+ UI, however, _does_ have translations for some of
> -     * the menu items. As a trade-off between a functionally correct
> -     * QEMU and a fully internationalized UI we support importing
> -     * LC_MESSAGES from the environment (see the setlocale() call
> -     * earlier in this file). This allows us to display translated
> -     * messages leaving everything else untouched.
> +    /*
> +     * GTK calls setlocale() by default, importing the locale
> +     * settings from the environment. QEMU's main() method will
> +     * have set LC_MESSAGES and LC_CTYPE to allow GTK to display
> +     * translated messages, including use of wide characters. We
> +     * must not allow GTK to change other aspects of the locale.
>       */
>      gtk_disable_setlocale();
>      gtkinit = gtk_init_check(NULL, NULL);
> diff --git a/vl.c b/vl.c
> index c696ad2a13..87a55cddb3 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -49,6 +49,7 @@ int main(int argc, char **argv)
>  #define main qemu_main
>  #endif /* CONFIG_COCOA */
>  
> +#include <locale.h>
>  
>  #include "qemu/error-report.h"
>  #include "qemu/sockets.h"
> @@ -3022,6 +3023,45 @@ int main(int argc, char **argv, char **envp)
>      char *dir, **dirs;
>      BlockdevOptionsQueue bdo_queue = QSIMPLEQ_HEAD_INITIALIZER(bdo_queue);
>  
> +    /*
> +     * Ideally we would set LC_ALL, but QEMU currently isn't able to cope
> +     * with arbitrary localization settings. In particular there are two
> +     * known problems
> +     *
> +     *   - The QMP monitor needs to use the C locale rules for numeric
> +     *     formatting. This would need a double/int -> string formatter
> +     *     that is locale independant.

Aha.

Reminds me of the story of a FORTRAN compiler that rejected only
far-away customers' programs...

> +     *
> +     *   - The QMP monitor needs to encode all data as UTF-8. This needs
> +     *     to be updated to use iconv(3) to explicitly convert the current
> +     *     locale's charset into utf-8

Really?  What locale-dependent text gets sent to QMP?

> +     *
> +     *   - Lots of codes uses is{upper,lower,alnum,...} functions, expecting
> +     *     C locale sorting behaviour. Most QEMU usage should likely be
> +     *     changed to g_ascii_is{upper,lower,alnum...} to match code
> +     *     assumptions, without being broken by locale settnigs.
> +     *
> +     * We do still have two requirements
> +     *
> +     *   - Ability to correct display translated text according to the
> +     *     user's locale
> +     *
> +     *   - Ability to handle multibyte characters, ideally according to
> +     *     user's locale specified character set. This affects ability
> +     *     of usb-mtp to correctly convert filenames to UCS16 and curses
> +     *     & GTK frontends wide character display.
> +     *
> +     * The second requirement would need LC_CTYPE to be honoured, but
> +     * this conflicts with the 2nd & 3rd problems listed earlier. For
> +     * now we make a tradeoff, trying to set an explicit UTF-8 localee
> +     *
> +     * Note we can't set LC_MESSAGES here, since mingw doesn't define
> +     * this constant in locale.h Fortunately we only need it for the
> +     * GTK frontend and that uses gi18n.h which pulls in a definition
> +     * of LC_MESSAGES.
> +     */
> +    setlocale(LC_CTYPE, "C.UTF-8");
> +
>      module_call_init(MODULE_INIT_TRACE);
>  
>      qemu_init_cpu_list();

We should've stayed out of the GUI business.
Daniel P. Berrangé April 16, 2019, 9:03 a.m. UTC | #6
On Tue, Apr 16, 2019 at 09:49:09AM +0200, Markus Armbruster wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > Localization is not a feature whose impact is limited to the UI
> > frontends. Other parts of QEMU rely in localization.
> 
> Typically by accident :)
> 
> >                                                      In particular the
> > USB MTP driver needs to be able to convert filenames from the locale
> > specified character set into UTF-16 / UCS-2 encoded wide characters.
> 
> Oookay.  I guess that makes some sense.
> 
> > setlocale() is only set from two of the UI frontends though, and worse,
> > there is inconsistent behaviour with GTK setting LC_CTYPE to C.UTF-8,
> > while ncurses honours whatever is set in the user's environment.
> >
> > This causes the USP MTP driver to behave differently depending on which
> > UI frontend is activated.
> >
> > Furthermore, the curses settings are dangerous because LC_CTYPE will affect
> > the is{upper,lower,alnum} functions which much QEMU code assumes to have
> > C locale sorting behaviour. This also breaks QMP if the env requests a
> > non-UTF-8 locale, since QMP is defined to use UTF-8 encoding for JSON.
> 
> Can you elaborate on how QMP breaks with a non-UTF-8 locale?  Ah, you do
> in the comment you add to vl.c.
> 
> > This problematic curses code was introduced in:
> >
> >   commit 2f8b7cd587558944532f587abb5203ce54badba9
> >   Author: Samuel Thibault <samuel.thibault@ens-lyon.org>
> >   Date:   Mon Mar 11 14:51:27 2019 +0100
> >
> >     curses: add option to specify VGA font encoding
> >
> > This patch moves the GTK frontend setlocale() handling into the main()
> > method. This ensures QMP and other QEMU code has a predictable C.UTF-8.
> >
> > Eventually QEMU should set LC_ALL, honouring the full user environment,
> > but this needs various cleanups in QEMU code first.
> 
> Eventually, we'll all be dead.
> 
> >                                                     Hardcoding LC_CTYPE
> > to C.UTF-8 is a partial regression vs the above curses commit, since it
> > will break the curses wide character handling for non-UTF-8 locales but
> > this is unavoidable until QEMU is cleaned up to cope with non-UTF-8
> > locales fully.
> >
> > Setting of LC_MESSAGES is left in the GTK code since only the GTK
> > frontend is using translation of strings. This lets us avoid the
> > platform portability problem where LC_MESSAGES is not provided by
> > locale.h on MinGW. GTK pulls it in indirectly from libintl.h via
> > gi18n.h header, but we don't want to pull that into the global
> > QEMU namespace.
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > ---
> >
> > Changed in v2:
> >
> >  - Leave LC_MESSAGES setting in gtk code to avoid platform
> >    portability problems.
> >
> >  ui/curses.c |  2 --
> >  ui/gtk.c    | 32 +++++++++++---------------------
> >  vl.c        | 40 ++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 51 insertions(+), 23 deletions(-)
> >
> > diff --git a/ui/curses.c b/ui/curses.c
> > index cc6d6da684..403cd57913 100644
> > --- a/ui/curses.c
> > +++ b/ui/curses.c
> > @@ -27,7 +27,6 @@
> >  #include <sys/ioctl.h>
> >  #include <termios.h>
> >  #endif
> > -#include <locale.h>
> >  #include <wchar.h>
> >  #include <langinfo.h>
> >  #include <iconv.h>
> > @@ -716,7 +715,6 @@ static void curses_display_init(DisplayState *ds, DisplayOptions *opts)
> >      }
> >  #endif
> >  
> > -    setlocale(LC_CTYPE, "");
> >      if (opts->u.curses.charset) {
> >          font_charset = opts->u.curses.charset;
> >      }
> > diff --git a/ui/gtk.c b/ui/gtk.c
> > index e96e15435a..e6a41c79ab 100644
> > --- a/ui/gtk.c
> > +++ b/ui/gtk.c
> > @@ -2208,12 +2208,12 @@ static void gtk_display_init(DisplayState *ds, DisplayOptions *opts)
> >  
> >      s->free_scale = FALSE;
> >  
> > -    /* Mostly LC_MESSAGES only. See early_gtk_display_init() for details. For
> > -     * LC_CTYPE, we need to make sure that non-ASCII characters are considered
> > -     * printable, but without changing any of the character classes to make
> > -     * sure that we don't accidentally break implicit assumptions.  */
> > +    /*
> > +     * See comment in main() for why it has only setup LC_CTYPE,
> > +     * as opposed to LC_ALL. Given that we need to enable
> > +     * LC_MESSAGES too for menu translations.
> > +     */
> >      setlocale(LC_MESSAGES, "");
> > -    setlocale(LC_CTYPE, "C.UTF-8");
> >      bindtextdomain("qemu", CONFIG_QEMU_LOCALEDIR);
> >      textdomain("qemu");
> >  
> > @@ -2262,22 +2262,12 @@ static void gtk_display_init(DisplayState *ds, DisplayOptions *opts)
> >  
> >  static void early_gtk_display_init(DisplayOptions *opts)
> >  {
> > -    /* The QEMU code relies on the assumption that it's always run in
> > -     * the C locale. Therefore it is not prepared to deal with
> > -     * operations that produce different results depending on the
> > -     * locale, such as printf's formatting of decimal numbers, and
> > -     * possibly others.
> > -     *
> > -     * Since GTK+ calls setlocale() by default -importing the locale
> > -     * settings from the environment- we must prevent it from doing so
> > -     * using gtk_disable_setlocale().
> > -     *
> > -     * QEMU's GTK+ UI, however, _does_ have translations for some of
> > -     * the menu items. As a trade-off between a functionally correct
> > -     * QEMU and a fully internationalized UI we support importing
> > -     * LC_MESSAGES from the environment (see the setlocale() call
> > -     * earlier in this file). This allows us to display translated
> > -     * messages leaving everything else untouched.
> > +    /*
> > +     * GTK calls setlocale() by default, importing the locale
> > +     * settings from the environment. QEMU's main() method will
> > +     * have set LC_MESSAGES and LC_CTYPE to allow GTK to display
> > +     * translated messages, including use of wide characters. We
> > +     * must not allow GTK to change other aspects of the locale.
> >       */
> >      gtk_disable_setlocale();
> >      gtkinit = gtk_init_check(NULL, NULL);
> > diff --git a/vl.c b/vl.c
> > index c696ad2a13..87a55cddb3 100644
> > --- a/vl.c
> > +++ b/vl.c
> > @@ -49,6 +49,7 @@ int main(int argc, char **argv)
> >  #define main qemu_main
> >  #endif /* CONFIG_COCOA */
> >  
> > +#include <locale.h>
> >  
> >  #include "qemu/error-report.h"
> >  #include "qemu/sockets.h"
> > @@ -3022,6 +3023,45 @@ int main(int argc, char **argv, char **envp)
> >      char *dir, **dirs;
> >      BlockdevOptionsQueue bdo_queue = QSIMPLEQ_HEAD_INITIALIZER(bdo_queue);
> >  
> > +    /*
> > +     * Ideally we would set LC_ALL, but QEMU currently isn't able to cope
> > +     * with arbitrary localization settings. In particular there are two
> > +     * known problems
> > +     *
> > +     *   - The QMP monitor needs to use the C locale rules for numeric
> > +     *     formatting. This would need a double/int -> string formatter
> > +     *     that is locale independant.
> 
> Aha.
> 
> Reminds me of the story of a FORTRAN compiler that rejected only
> far-away customers' programs...
> 
> > +     *
> > +     *   - The QMP monitor needs to encode all data as UTF-8. This needs
> > +     *     to be updated to use iconv(3) to explicitly convert the current
> > +     *     locale's charset into utf-8
> 
> Really?  What locale-dependent text gets sent to QMP?

The main thing I can see would be filenames.

Though having said it is UTF-8 on looking more closely I think QEMU is
probably 8-bit clean in its handling, so will just be blindly passing
whatever filename string it get from libvirt straight on to the kernel
with no interpretation.

Libvirt has enabled UTF-8 validation in its JSON library when encoding
data it sends to QEMU, so any data libvirt is sending will be a valid
UTF-8 byte sequence at least. Libvirt doesn't axctually do any charset
conversion though, so if libvirt runs in a non-UTF8 locale it will
likely trip over this UTF-8 validation.



> 
> > +     *
> > +     *   - Lots of codes uses is{upper,lower,alnum,...} functions, expecting
> > +     *     C locale sorting behaviour. Most QEMU usage should likely be
> > +     *     changed to g_ascii_is{upper,lower,alnum...} to match code
> > +     *     assumptions, without being broken by locale settnigs.
> > +     *
> > +     * We do still have two requirements
> > +     *
> > +     *   - Ability to correct display translated text according to the
> > +     *     user's locale
> > +     *
> > +     *   - Ability to handle multibyte characters, ideally according to
> > +     *     user's locale specified character set. This affects ability
> > +     *     of usb-mtp to correctly convert filenames to UCS16 and curses
> > +     *     & GTK frontends wide character display.
> > +     *
> > +     * The second requirement would need LC_CTYPE to be honoured, but
> > +     * this conflicts with the 2nd & 3rd problems listed earlier. For
> > +     * now we make a tradeoff, trying to set an explicit UTF-8 localee
> > +     *
> > +     * Note we can't set LC_MESSAGES here, since mingw doesn't define
> > +     * this constant in locale.h Fortunately we only need it for the
> > +     * GTK frontend and that uses gi18n.h which pulls in a definition
> > +     * of LC_MESSAGES.
> > +     */
> > +    setlocale(LC_CTYPE, "C.UTF-8");
> > +
> >      module_call_init(MODULE_INIT_TRACE);
> >  
> >      qemu_init_cpu_list();
> 
> We should've stayed out of the GUI business.

This isn't only a GUI problem as above, it affects USB MTP.

Regards,
Daniel
Markus Armbruster April 16, 2019, 4:01 p.m. UTC | #7
Daniel P. Berrangé <berrange@redhat.com> writes:

> On Tue, Apr 16, 2019 at 09:49:09AM +0200, Markus Armbruster wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>> > Localization is not a feature whose impact is limited to the UI
>> > frontends. Other parts of QEMU rely in localization.

Typo: rely on

>> Typically by accident :)
>> 
>> >                                                      In particular the
>> > USB MTP driver needs to be able to convert filenames from the locale
>> > specified character set into UTF-16 / UCS-2 encoded wide characters.
>> 
>> Oookay.  I guess that makes some sense.
>> 
>> > setlocale() is only set from two of the UI frontends though, and worse,
>> > there is inconsistent behaviour with GTK setting LC_CTYPE to C.UTF-8,
>> > while ncurses honours whatever is set in the user's environment.
>> >
>> > This causes the USP MTP driver to behave differently depending on which
>> > UI frontend is activated.
>> >
>> > Furthermore, the curses settings are dangerous because LC_CTYPE will affect
>> > the is{upper,lower,alnum} functions which much QEMU code assumes to have
>> > C locale sorting behaviour. This also breaks QMP if the env requests a
>> > non-UTF-8 locale, since QMP is defined to use UTF-8 encoding for JSON.
>> 
>> Can you elaborate on how QMP breaks with a non-UTF-8 locale?  Ah, you do
>> in the comment you add to vl.c.
>> 
>> > This problematic curses code was introduced in:
>> >
>> >   commit 2f8b7cd587558944532f587abb5203ce54badba9
>> >   Author: Samuel Thibault <samuel.thibault@ens-lyon.org>
>> >   Date:   Mon Mar 11 14:51:27 2019 +0100
>> >
>> >     curses: add option to specify VGA font encoding
>> >
>> > This patch moves the GTK frontend setlocale() handling into the main()
>> > method. This ensures QMP and other QEMU code has a predictable C.UTF-8.
>> >
>> > Eventually QEMU should set LC_ALL, honouring the full user environment,
>> > but this needs various cleanups in QEMU code first.
>> 
>> Eventually, we'll all be dead.
>> 
>> >                                                     Hardcoding LC_CTYPE
>> > to C.UTF-8 is a partial regression vs the above curses commit, since it
>> > will break the curses wide character handling for non-UTF-8 locales but
>> > this is unavoidable until QEMU is cleaned up to cope with non-UTF-8
>> > locales fully.
>> >
>> > Setting of LC_MESSAGES is left in the GTK code since only the GTK
>> > frontend is using translation of strings. This lets us avoid the
>> > platform portability problem where LC_MESSAGES is not provided by
>> > locale.h on MinGW. GTK pulls it in indirectly from libintl.h via
>> > gi18n.h header, but we don't want to pull that into the global
>> > QEMU namespace.
>> >
>> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
>> > ---
>> >
>> > Changed in v2:
>> >
>> >  - Leave LC_MESSAGES setting in gtk code to avoid platform
>> >    portability problems.
>> >
>> >  ui/curses.c |  2 --
>> >  ui/gtk.c    | 32 +++++++++++---------------------
>> >  vl.c        | 40 ++++++++++++++++++++++++++++++++++++++++
>> >  3 files changed, 51 insertions(+), 23 deletions(-)
>> >
>> > diff --git a/ui/curses.c b/ui/curses.c
>> > index cc6d6da684..403cd57913 100644
>> > --- a/ui/curses.c
>> > +++ b/ui/curses.c
>> > @@ -27,7 +27,6 @@
>> >  #include <sys/ioctl.h>
>> >  #include <termios.h>
>> >  #endif
>> > -#include <locale.h>
>> >  #include <wchar.h>
>> >  #include <langinfo.h>
>> >  #include <iconv.h>
>> > @@ -716,7 +715,6 @@ static void curses_display_init(DisplayState *ds, DisplayOptions *opts)
>> >      }
>> >  #endif
>> >  
>> > -    setlocale(LC_CTYPE, "");
>> >      if (opts->u.curses.charset) {
>> >          font_charset = opts->u.curses.charset;
>> >      }
>> > diff --git a/ui/gtk.c b/ui/gtk.c
>> > index e96e15435a..e6a41c79ab 100644
>> > --- a/ui/gtk.c
>> > +++ b/ui/gtk.c
>> > @@ -2208,12 +2208,12 @@ static void gtk_display_init(DisplayState *ds, DisplayOptions *opts)
>> >  
>> >      s->free_scale = FALSE;
>> >  
>> > -    /* Mostly LC_MESSAGES only. See early_gtk_display_init() for details. For
>> > -     * LC_CTYPE, we need to make sure that non-ASCII characters are considered
>> > -     * printable, but without changing any of the character classes to make
>> > -     * sure that we don't accidentally break implicit assumptions.  */
>> > +    /*
>> > +     * See comment in main() for why it has only setup LC_CTYPE,
>> > +     * as opposed to LC_ALL. Given that we need to enable
>> > +     * LC_MESSAGES too for menu translations.
>> > +     */
>> >      setlocale(LC_MESSAGES, "");
>> > -    setlocale(LC_CTYPE, "C.UTF-8");
>> >      bindtextdomain("qemu", CONFIG_QEMU_LOCALEDIR);
>> >      textdomain("qemu");
>> >  
>> > @@ -2262,22 +2262,12 @@ static void gtk_display_init(DisplayState *ds, DisplayOptions *opts)
>> >  
>> >  static void early_gtk_display_init(DisplayOptions *opts)
>> >  {
>> > -    /* The QEMU code relies on the assumption that it's always run in
>> > -     * the C locale. Therefore it is not prepared to deal with
>> > -     * operations that produce different results depending on the
>> > -     * locale, such as printf's formatting of decimal numbers, and
>> > -     * possibly others.
>> > -     *
>> > -     * Since GTK+ calls setlocale() by default -importing the locale
>> > -     * settings from the environment- we must prevent it from doing so
>> > -     * using gtk_disable_setlocale().
>> > -     *
>> > -     * QEMU's GTK+ UI, however, _does_ have translations for some of
>> > -     * the menu items. As a trade-off between a functionally correct
>> > -     * QEMU and a fully internationalized UI we support importing
>> > -     * LC_MESSAGES from the environment (see the setlocale() call
>> > -     * earlier in this file). This allows us to display translated
>> > -     * messages leaving everything else untouched.
>> > +    /*
>> > +     * GTK calls setlocale() by default, importing the locale
>> > +     * settings from the environment. QEMU's main() method will
>> > +     * have set LC_MESSAGES and LC_CTYPE to allow GTK to display
>> > +     * translated messages, including use of wide characters. We
>> > +     * must not allow GTK to change other aspects of the locale.
>> >       */
>> >      gtk_disable_setlocale();
>> >      gtkinit = gtk_init_check(NULL, NULL);
>> > diff --git a/vl.c b/vl.c
>> > index c696ad2a13..87a55cddb3 100644
>> > --- a/vl.c
>> > +++ b/vl.c
>> > @@ -49,6 +49,7 @@ int main(int argc, char **argv)
>> >  #define main qemu_main
>> >  #endif /* CONFIG_COCOA */
>> >  
>> > +#include <locale.h>
>> >  
>> >  #include "qemu/error-report.h"
>> >  #include "qemu/sockets.h"
>> > @@ -3022,6 +3023,45 @@ int main(int argc, char **argv, char **envp)
>> >      char *dir, **dirs;
>> >      BlockdevOptionsQueue bdo_queue = QSIMPLEQ_HEAD_INITIALIZER(bdo_queue);
>> >  
>> > +    /*
>> > +     * Ideally we would set LC_ALL, but QEMU currently isn't able to cope
>> > +     * with arbitrary localization settings. In particular there are two
>> > +     * known problems
>> > +     *
>> > +     *   - The QMP monitor needs to use the C locale rules for numeric
>> > +     *     formatting. This would need a double/int -> string formatter
>> > +     *     that is locale independant.
>> 
>> Aha.
>> 
>> Reminds me of the story of a FORTRAN compiler that rejected only
>> far-away customers' programs...
>> 
>> > +     *
>> > +     *   - The QMP monitor needs to encode all data as UTF-8. This needs
>> > +     *     to be updated to use iconv(3) to explicitly convert the current
>> > +     *     locale's charset into utf-8
>> 
>> Really?  What locale-dependent text gets sent to QMP?
>
> The main thing I can see would be filenames.
>
> Though having said it is UTF-8 on looking more closely I think QEMU is
> probably 8-bit clean in its handling, so will just be blindly passing
> whatever filename string it get from libvirt straight on to the kernel
> with no interpretation.

Sounds good to me.

> Libvirt has enabled UTF-8 validation in its JSON library when encoding
> data it sends to QEMU, so any data libvirt is sending will be a valid
> UTF-8 byte sequence at least. Libvirt doesn't axctually do any charset
> conversion though, so if libvirt runs in a non-UTF8 locale it will
> likely trip over this UTF-8 validation.

QMP input must be encoded in UTF-8.  Converting from other encodings to
UTF-8 is the QMP client's problem.

The more interesting direction is the one I inquired about: QMP output.
If locale-dependent text gets sent to QMP, converting it to UTF-8 is
QEMU's problem.

On closer look, anything but JSON string contents is plain ASCII by
construction.  JSON string contents gets assembled in to_json() case
QTYPE_QSTRING.  It expects QString to use UTF-8[*].  You can have any
locale as long as it uses ASCII or UTF-8.

>> > +     *
>> > +     *   - Lots of codes uses is{upper,lower,alnum,...} functions, expecting
>> > +     *     C locale sorting behaviour. Most QEMU usage should likely be
>> > +     *     changed to g_ascii_is{upper,lower,alnum...} to match code
>> > +     *     assumptions, without being broken by locale settnigs.
>> > +     *
>> > +     * We do still have two requirements
>> > +     *
>> > +     *   - Ability to correct display translated text according to the
>> > +     *     user's locale
>> > +     *
>> > +     *   - Ability to handle multibyte characters, ideally according to
>> > +     *     user's locale specified character set. This affects ability
>> > +     *     of usb-mtp to correctly convert filenames to UCS16 and curses
>> > +     *     & GTK frontends wide character display.
>> > +     *
>> > +     * The second requirement would need LC_CTYPE to be honoured, but
>> > +     * this conflicts with the 2nd & 3rd problems listed earlier. For
>> > +     * now we make a tradeoff, trying to set an explicit UTF-8 localee
>> > +     *
>> > +     * Note we can't set LC_MESSAGES here, since mingw doesn't define
>> > +     * this constant in locale.h Fortunately we only need it for the
>> > +     * GTK frontend and that uses gi18n.h which pulls in a definition
>> > +     * of LC_MESSAGES.
>> > +     */
>> > +    setlocale(LC_CTYPE, "C.UTF-8");
>> > +
>> >      module_call_init(MODULE_INIT_TRACE);
>> >  
>> >      qemu_init_cpu_list();
>> 
>> We should've stayed out of the GUI business.
>
> This isn't only a GUI problem as above, it affects USB MTP.

I believe setlocale() in QEMU is basically wrong.  Finding all the
places that rely on the current locale when they shouldn't and
converting them to locale-independent alternatives is a huge amount of
work.  Even if we managed to complete it, it wouldn't stay complete.

Instead, find the places that have reason to use the locale, and fix
them to uselocale().



[*] Actually modified UTF-8, but I don't think we rely on the "modified"
part.
Daniel P. Berrangé April 16, 2019, 4:09 p.m. UTC | #8
On Tue, Apr 16, 2019 at 06:01:46PM +0200, Markus Armbruster wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Tue, Apr 16, 2019 at 09:49:09AM +0200, Markus Armbruster wrote:
> >> Daniel P. Berrangé <berrange@redhat.com> writes:
> > The main thing I can see would be filenames.
> >
> > Though having said it is UTF-8 on looking more closely I think QEMU is
> > probably 8-bit clean in its handling, so will just be blindly passing
> > whatever filename string it get from libvirt straight on to the kernel
> > with no interpretation.
> 
> Sounds good to me.
> 
> > Libvirt has enabled UTF-8 validation in its JSON library when encoding
> > data it sends to QEMU, so any data libvirt is sending will be a valid
> > UTF-8 byte sequence at least. Libvirt doesn't axctually do any charset
> > conversion though, so if libvirt runs in a non-UTF8 locale it will
> > likely trip over this UTF-8 validation.
> 
> QMP input must be encoded in UTF-8.  Converting from other encodings to
> UTF-8 is the QMP client's problem.

Ok, so consider the host OS is globally running in a non-UTF-8 locale
such as ISO8859-1. This means that any multibyte filenames in the
filesystem are assumed to be in ISO8859-1  encoding.

Since QMP input must be UTF-8, libvirt must convert the filename
from the current locale (ISO8859-1) to UTF-8 otherwise it might
be putting an invalid UTF-8 sequence in the JSON.

For QEMU to be able to open the file, QEMU must be honouring the
host OS LC_CTYPE, and converting from UTF-8 back to LC_CTYPE
character set.

> 
> The more interesting direction is the one I inquired about: QMP output.
> If locale-dependent text gets sent to QMP, converting it to UTF-8 is
> QEMU's problem.
> 
> On closer look, anything but JSON string contents is plain ASCII by
> construction.  JSON string contents gets assembled in to_json() case
> QTYPE_QSTRING.  It expects QString to use UTF-8[*].  You can have any
> locale as long as it uses ASCII or UTF-8.

IOW

> 
> >> > +     *
> >> > +     *   - Lots of codes uses is{upper,lower,alnum,...} functions, expecting
> >> > +     *     C locale sorting behaviour. Most QEMU usage should likely be
> >> > +     *     changed to g_ascii_is{upper,lower,alnum...} to match code
> >> > +     *     assumptions, without being broken by locale settnigs.
> >> > +     *
> >> > +     * We do still have two requirements
> >> > +     *
> >> > +     *   - Ability to correct display translated text according to the
> >> > +     *     user's locale
> >> > +     *
> >> > +     *   - Ability to handle multibyte characters, ideally according to
> >> > +     *     user's locale specified character set. This affects ability
> >> > +     *     of usb-mtp to correctly convert filenames to UCS16 and curses
> >> > +     *     & GTK frontends wide character display.
> >> > +     *
> >> > +     * The second requirement would need LC_CTYPE to be honoured, but
> >> > +     * this conflicts with the 2nd & 3rd problems listed earlier. For
> >> > +     * now we make a tradeoff, trying to set an explicit UTF-8 localee
> >> > +     *
> >> > +     * Note we can't set LC_MESSAGES here, since mingw doesn't define
> >> > +     * this constant in locale.h Fortunately we only need it for the
> >> > +     * GTK frontend and that uses gi18n.h which pulls in a definition
> >> > +     * of LC_MESSAGES.
> >> > +     */
> >> > +    setlocale(LC_CTYPE, "C.UTF-8");
> >> > +
> >> >      module_call_init(MODULE_INIT_TRACE);
> >> >  
> >> >      qemu_init_cpu_list();
> >> 
> >> We should've stayed out of the GUI business.
> >
> > This isn't only a GUI problem as above, it affects USB MTP.
> 
> I believe setlocale() in QEMU is basically wrong.  Finding all the
> places that rely on the current locale when they shouldn't and
> converting them to locale-independent alternatives is a huge amount of
> work.  Even if we managed to complete it, it wouldn't stay complete.
> 
> Instead, find the places that have reason to use the locale, and fix
> them to uselocale().

I think that's fundamentally the wrong way around. Most stuff *should*
be locale dependant, otherwise any interaction with the host OS is
likely to use incorrect localization. It isn't practical to put a
uselocale() call around every place that opens a filename. There are
a few places where QEMU should be locale indepandant such as the QMP
and guest OS ABI sensitive things, which should take account of it.

Regards,
Daniel
Markus Armbruster April 16, 2019, 7:19 p.m. UTC | #9
Daniel P. Berrangé <berrange@redhat.com> writes:

[...]
> I think that's fundamentally the wrong way around. Most stuff *should*
> be locale dependant, otherwise any interaction with the host OS is
> likely to use incorrect localization. It isn't practical to put a
> uselocale() call around every place that opens a filename. There are
> a few places where QEMU should be locale indepandant such as the QMP
> and guest OS ABI sensitive things, which should take account of it.

All right, you convinced me.  The problem is even bigger than I thought.
That means the return on investment in this area is even poorer than I
thought.

I don't think we should try to support anything but ASCII and UTF-8.
diff mbox series

Patch

diff --git a/ui/curses.c b/ui/curses.c
index cc6d6da684..403cd57913 100644
--- a/ui/curses.c
+++ b/ui/curses.c
@@ -27,7 +27,6 @@ 
 #include <sys/ioctl.h>
 #include <termios.h>
 #endif
-#include <locale.h>
 #include <wchar.h>
 #include <langinfo.h>
 #include <iconv.h>
@@ -716,7 +715,6 @@  static void curses_display_init(DisplayState *ds, DisplayOptions *opts)
     }
 #endif
 
-    setlocale(LC_CTYPE, "");
     if (opts->u.curses.charset) {
         font_charset = opts->u.curses.charset;
     }
diff --git a/ui/gtk.c b/ui/gtk.c
index e96e15435a..e6a41c79ab 100644
--- a/ui/gtk.c
+++ b/ui/gtk.c
@@ -2208,12 +2208,12 @@  static void gtk_display_init(DisplayState *ds, DisplayOptions *opts)
 
     s->free_scale = FALSE;
 
-    /* Mostly LC_MESSAGES only. See early_gtk_display_init() for details. For
-     * LC_CTYPE, we need to make sure that non-ASCII characters are considered
-     * printable, but without changing any of the character classes to make
-     * sure that we don't accidentally break implicit assumptions.  */
+    /*
+     * See comment in main() for why it has only setup LC_CTYPE,
+     * as opposed to LC_ALL. Given that we need to enable
+     * LC_MESSAGES too for menu translations.
+     */
     setlocale(LC_MESSAGES, "");
-    setlocale(LC_CTYPE, "C.UTF-8");
     bindtextdomain("qemu", CONFIG_QEMU_LOCALEDIR);
     textdomain("qemu");
 
@@ -2262,22 +2262,12 @@  static void gtk_display_init(DisplayState *ds, DisplayOptions *opts)
 
 static void early_gtk_display_init(DisplayOptions *opts)
 {
-    /* The QEMU code relies on the assumption that it's always run in
-     * the C locale. Therefore it is not prepared to deal with
-     * operations that produce different results depending on the
-     * locale, such as printf's formatting of decimal numbers, and
-     * possibly others.
-     *
-     * Since GTK+ calls setlocale() by default -importing the locale
-     * settings from the environment- we must prevent it from doing so
-     * using gtk_disable_setlocale().
-     *
-     * QEMU's GTK+ UI, however, _does_ have translations for some of
-     * the menu items. As a trade-off between a functionally correct
-     * QEMU and a fully internationalized UI we support importing
-     * LC_MESSAGES from the environment (see the setlocale() call
-     * earlier in this file). This allows us to display translated
-     * messages leaving everything else untouched.
+    /*
+     * GTK calls setlocale() by default, importing the locale
+     * settings from the environment. QEMU's main() method will
+     * have set LC_MESSAGES and LC_CTYPE to allow GTK to display
+     * translated messages, including use of wide characters. We
+     * must not allow GTK to change other aspects of the locale.
      */
     gtk_disable_setlocale();
     gtkinit = gtk_init_check(NULL, NULL);
diff --git a/vl.c b/vl.c
index c696ad2a13..87a55cddb3 100644
--- a/vl.c
+++ b/vl.c
@@ -49,6 +49,7 @@  int main(int argc, char **argv)
 #define main qemu_main
 #endif /* CONFIG_COCOA */
 
+#include <locale.h>
 
 #include "qemu/error-report.h"
 #include "qemu/sockets.h"
@@ -3022,6 +3023,45 @@  int main(int argc, char **argv, char **envp)
     char *dir, **dirs;
     BlockdevOptionsQueue bdo_queue = QSIMPLEQ_HEAD_INITIALIZER(bdo_queue);
 
+    /*
+     * Ideally we would set LC_ALL, but QEMU currently isn't able to cope
+     * with arbitrary localization settings. In particular there are two
+     * known problems
+     *
+     *   - The QMP monitor needs to use the C locale rules for numeric
+     *     formatting. This would need a double/int -> string formatter
+     *     that is locale independant.
+     *
+     *   - The QMP monitor needs to encode all data as UTF-8. This needs
+     *     to be updated to use iconv(3) to explicitly convert the current
+     *     locale's charset into utf-8
+     *
+     *   - Lots of codes uses is{upper,lower,alnum,...} functions, expecting
+     *     C locale sorting behaviour. Most QEMU usage should likely be
+     *     changed to g_ascii_is{upper,lower,alnum...} to match code
+     *     assumptions, without being broken by locale settnigs.
+     *
+     * We do still have two requirements
+     *
+     *   - Ability to correct display translated text according to the
+     *     user's locale
+     *
+     *   - Ability to handle multibyte characters, ideally according to
+     *     user's locale specified character set. This affects ability
+     *     of usb-mtp to correctly convert filenames to UCS16 and curses
+     *     & GTK frontends wide character display.
+     *
+     * The second requirement would need LC_CTYPE to be honoured, but
+     * this conflicts with the 2nd & 3rd problems listed earlier. For
+     * now we make a tradeoff, trying to set an explicit UTF-8 localee
+     *
+     * Note we can't set LC_MESSAGES here, since mingw doesn't define
+     * this constant in locale.h Fortunately we only need it for the
+     * GTK frontend and that uses gi18n.h which pulls in a definition
+     * of LC_MESSAGES.
+     */
+    setlocale(LC_CTYPE, "C.UTF-8");
+
     module_call_init(MODULE_INIT_TRACE);
 
     qemu_init_cpu_list();