From patchwork Sun Mar 30 04:58:28 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jerry DeLisle X-Patchwork-Id: 335081 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 1F0F11400B9 for ; Sun, 30 Mar 2014 15:59:02 +1100 (EST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:content-type; q=dns; s=default; b=iT+FQwh2DNNcEfsqUWFIVY03xAqm3xxz7M2jtBjZxtM HxdG7iS0VbhuWwV2MIeOeKKdytshCE5/JCIi/S8YO+Uqi+JQ+eb2I1+a10WV0DRf 6ISmJj5AHH8buXgagPbI76xOPL+h5IN9bDojywJLjooYCY0JGKr0DTY9HSX1H+1I = DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:content-type; s=default; bh=GeoKm/iMZjL5BMHYDogb2w+uCGw=; b=WACHxmx+/3WjbjT/Y MtBFQCzhfsTlMuaHaa06pvZJSeFpudk48HTpJMwwetr6T3kDdQkut91WwhhA/fYm pwfBenUXcU2FoVUrkLhAxSj8ggiXfilWnNXD7QYvQqBPtCAgSMrKb9RlfVi2kGiU OipExLA+cpvHdjuyuH2zd4OOww= Received: (qmail 3164 invoked by alias); 30 Mar 2014 04:58:37 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 3121 invoked by uid 89); 30 Mar 2014 04:58:32 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=0.1 required=5.0 tests=AWL, BAYES_00, LIKELY_SPAM_SUBJECT, RCVD_IN_DNSWL_NONE, RP_MATCHES_RCVD, SPF_PASS, UNSUBSCRIBE_BODY autolearn=no version=3.3.2 X-Spam-User: qpsmtpd, 2 recipients X-HELO: mta31.charter.net Received: from mta31.charter.net (HELO mta31.charter.net) (216.33.127.82) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Sun, 30 Mar 2014 04:58:30 +0000 Received: from imp10 ([10.20.200.15]) by mta31.charter.net (InterMail vM.8.01.05.02 201-2260-151-103-20110920) with ESMTP id <20140330045828.ORYX6769.mta31.charter.net@imp10>; Sun, 30 Mar 2014 00:58:28 -0400 Received: from pavilion.localdomain ([68.5.43.244]) by imp10 with smtp.charter.net id jgyT1n0095G55b005gyTzi; Sun, 30 Mar 2014 00:58:28 -0400 X-Authority-Analysis: v=2.0 cv=Q7eKePKa c=1 sm=1 a=mw+G4YjGptsaRR05zBLClw==:17 a=3C7RHhrQF8gA:10 a=gyntC5ncBPQA:10 a=yUnIBFQkZM0A:10 a=hOpmn2quAAAA:8 a=soDD_djyaZSXWtTf4U4A:9 a=wPNLvfGTeEIA:10 a=h0MbPZiqXlO5ZBQrrYgA:9 a=vAUQDKi7BZUe09FQ:21 a=s5Xpx4CYH_b-FE7Y:21 a=C9hoFXZZBOjo8A3KwocA:9 a=mw+G4YjGptsaRR05zBLClw==:117 X-Auth-id: anZkZWxpc2xlQGNoYXJ0ZXIubmV0 Message-ID: <5337A474.8090004@charter.net> Date: Sat, 29 Mar 2014 21:58:28 -0700 From: Jerry DeLisle User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: gfortran CC: gcc patches Subject: [patch, libgfortran] Wrong result for UTF-8/UCS-4 list-directed and namelist read and nml write Hi all, The attached patch fixes namelist read/write and list directed read/write to support UTF-8. I have attached a preliminary test case to use to experiment with this. I will need to set it up for the testsuite still. Regression tested on x86-64-linux-gnu. OK for trunk or wait? Regards, Jerry 2014-03-29 Jerry DeLisle PR libfortran/52539 * io/list_read.c: Add uchar typedef. (push_char4): New function to save kind=4 character. (next_char_utf8): New function to read a single UTF-8 encoded character value. (read_chracter): Update to use the new functions for reading UTF-8 strings. (list_formatted_read_scalar): Update to handle list directed reads of UTF-8 strings. (nml_read_obj): Likewise update for UTF-8 strings in namelists. * io/write.c (nml_write_obj): Add kind=4 character support for namelist writes. Index: list_read.c =================================================================== --- list_read.c (revision 208931) +++ list_read.c (working copy) @@ -32,7 +32,9 @@ see the files COPYING3 and COPYING.RUNTIME respect #include #include +typedef unsigned char uchar; + /* List directed input. Several parsing subroutines are practically reimplemented from formatted input, the reason being that there are all kinds of small differences between formatted and list directed @@ -97,7 +99,38 @@ push_char (st_parameter_dt *dtp, char c) dtp->u.p.saved_string[dtp->u.p.saved_used++] = c; } +/* Save a KIND=4 character to a string buffer, enlarging the buffer + as necessary. */ +static void +push_char4 (st_parameter_dt *dtp, gfc_char4_t c) +{ + gfc_char4_t *new, *p = (gfc_char4_t *) dtp->u.p.saved_string; + + if (p == NULL) + { + dtp->u.p.saved_string = xcalloc (SCRATCH_SIZE, sizeof (gfc_char4_t)); + dtp->u.p.saved_length = SCRATCH_SIZE; + dtp->u.p.saved_used = 0; + p = (gfc_char4_t *) dtp->u.p.saved_string; + } + + if (dtp->u.p.saved_used >= dtp->u.p.saved_length) + { + dtp->u.p.saved_length = 2 * dtp->u.p.saved_length; + new = realloc (p, dtp->u.p.saved_length); + if (new == NULL) + generate_error (&dtp->common, LIBERROR_OS, NULL); + p = new; + + memset (new + dtp->u.p.saved_used, 0, + dtp->u.p.saved_length - dtp->u.p.saved_used); + } + + p[dtp->u.p.saved_used++] = c; +} + + /* Free the input buffer if necessary. */ static void @@ -247,6 +280,57 @@ done: } +static gfc_char4_t +next_char_utf8 (st_parameter_dt *dtp) +{ + static const uchar masks[6] = { 0x7F, 0x1F, 0x0F, 0x07, 0x02, 0x01 }; + static const uchar patns[6] = { 0x00, 0xC0, 0xE0, 0xF0, 0xF8, 0xFC }; + int i, nb; + gfc_char4_t c; + + c = next_char (dtp); + if (c < 0x80) + return c; + + /* The number of leading 1-bits in the first byte indicates how many + bytes follow. */ + for (nb = 2; nb < 7; nb++) + if ((c & ~masks[nb-1]) == patns[nb-1]) + goto found; + goto invalid; + + found: + c = (c & masks[nb-1]); + + /* Decode the bytes read. */ + for (i = 1; i < nb; i++) + { + gfc_char4_t n = next_char (dtp); + + if ((n & 0xC0) != 0x80) + goto invalid; + + c = ((c << 6) + (n & 0x3F)); + } + + /* Make sure the shortest possible encoding was used. */ + if (c <= 0x7F && nb > 1) goto invalid; + if (c <= 0x7FF && nb > 2) goto invalid; + if (c <= 0xFFFF && nb > 3) goto invalid; + if (c <= 0x1FFFFF && nb > 4) goto invalid; + if (c <= 0x3FFFFFF && nb > 5) goto invalid; + + /* Make sure the character is valid. */ + if (c > 0x7FFFFFFF || (c >= 0xD800 && c <= 0xDFFF)) + goto invalid; + + return c; + + invalid: + generate_error (&dtp->common, LIBERROR_READ_VALUE, "Invalid UTF-8 encoding"); + return (gfc_char4_t) '?'; +} + /* Push a character back onto the input. */ static void @@ -1087,51 +1171,98 @@ read_character (st_parameter_dt *dtp, int length _ } get_string: - for (;;) - { - if ((c = next_char (dtp)) == EOF) - goto done_eof; - switch (c) - { - case '"': - case '\'': - if (c != quote) - { + + if (dtp->u.p.current_unit->flags.encoding == ENCODING_UTF8) + for (;;) + { + if ((c = next_char_utf8 (dtp)) == EOF) + goto done_eof; + switch (c) + { + case '"': + case '\'': + if (c != quote) + { + push_char4 (dtp, c); + break; + } + + /* See if we have a doubled quote character or the end of + the string. */ + + if ((c = next_char_utf8 (dtp)) == EOF) + goto done_eof; + if (c == quote) + { + push_char4 (dtp, quote); + break; + } + + unget_char (dtp, c); + goto done; + + CASE_SEPARATORS: + if (quote == ' ') + { + unget_char (dtp, c); + goto done; + } + + if (c != '\n' && c != '\r') + push_char4 (dtp, c); + break; + + default: + push_char4 (dtp, c); + break; + } + } + else + for (;;) + { + if ((c = next_char (dtp)) == EOF) + goto done_eof; + switch (c) + { + case '"': + case '\'': + if (c != quote) + { + push_char (dtp, c); + break; + } + + /* See if we have a doubled quote character or the end of + the string. */ + + if ((c = next_char (dtp)) == EOF) + goto done_eof; + if (c == quote) + { + push_char (dtp, quote); + break; + } + + unget_char (dtp, c); + goto done; + + CASE_SEPARATORS: + if (quote == ' ') + { + unget_char (dtp, c); + goto done; + } + + if (c != '\n' && c != '\r') push_char (dtp, c); - break; - } - - /* See if we have a doubled quote character or the end of - the string. */ - - if ((c = next_char (dtp)) == EOF) - goto done_eof; - if (c == quote) - { - push_char (dtp, quote); - break; - } - - unget_char (dtp, c); - goto done; - - CASE_SEPARATORS: - if (quote == ' ') - { - unget_char (dtp, c); - goto done; - } - - if (c != '\n' && c != '\r') + break; + + default: push_char (dtp, c); - break; + break; + } + } - default: - push_char (dtp, c); - break; - } - } - /* At this point, we have to have a separator, or else the string is invalid. */ done: @@ -1903,7 +2034,7 @@ static int list_formatted_read_scalar (st_parameter_dt *dtp, bt type, void *p, int kind, size_t size) { - gfc_char4_t *q; + gfc_char4_t *q, *r; int c, i, m; int err = 0; @@ -2031,13 +2162,19 @@ list_formatted_read_scalar (st_parameter_dt *dtp, { m = ((int) size < dtp->u.p.saved_used) ? (int) size : dtp->u.p.saved_used; - if (kind == 1) - memcpy (p, dtp->u.p.saved_string, m); + + q = (gfc_char4_t *) p; + r = (gfc_char4_t *) dtp->u.p.saved_string; + if (dtp->u.p.current_unit->flags.encoding == ENCODING_UTF8) + for (i = 0; i < m; i++) + *q++ = *r++; else { - q = (gfc_char4_t *) p; - for (i = 0; i < m; i++) - q[i] = (unsigned char) dtp->u.p.saved_string[i]; + if (kind == 1) + memcpy (p, dtp->u.p.saved_string, m); + else + for (i = 0; i < m; i++) + *q++ = (unsigned char) dtp->u.p.saved_string[i]; } } else @@ -2771,10 +2908,27 @@ nml_read_obj (st_parameter_dt *dtp, namelist_info } else m = dtp->u.p.saved_used; - pdata = (void*)( pdata + clow - 1 ); - memcpy (pdata, dtp->u.p.saved_string, m); - if (m < dlen) - memset ((void*)( pdata + m ), ' ', dlen - m); + + if (dtp->u.p.current_unit->flags.encoding == ENCODING_UTF8) + { + gfc_char4_t *q4, *p4 = pdata; + int i; + + q4 = (gfc_char4_t *) dtp->u.p.saved_string; + p4 += clow -1; + for (i = 0; i < m; i++) + *p4++ = *q4++; + if (m < dlen) + for (i = 0; i < dlen - m; i++) + *p4++ = (gfc_char4_t) ' '; + } + else + { + pdata = (void*)( pdata + clow - 1 ); + memcpy (pdata, dtp->u.p.saved_string, m); + if (m < dlen) + memset ((void*)( pdata + m ), ' ', dlen - m); + } break; default: Index: write.c =================================================================== --- write.c (revision 208931) +++ write.c (working copy) @@ -1835,7 +1835,10 @@ nml_write_obj (st_parameter_dt *dtp, namelist_info break; case BT_CHARACTER: - write_character (dtp, p, 1, obj->string_length, DELIM); + if (dtp->u.p.current_unit->flags.encoding == ENCODING_UTF8) + write_character (dtp, p, 4, obj->string_length, DELIM); + else + write_character (dtp, p, 1, obj->string_length, DELIM); break; case BT_REAL: