diff mbox

PR78888 - add value range info for tolower/toupper

Message ID CAAgBjMmtn1_bTRbKv=LdaqVioaDTObjs+ubnS+QSvcqR6fecNg@mail.gmail.com
State New
Headers show

Commit Message

Prathamesh Kulkarni Aug. 3, 2017, 7:28 a.m. UTC
Hi,
The attached patch adds value-range info for __builtin_tolower and
__builtin_toupper.
In the patch, I have just settled for anti-range ~['a', 'z'] for
return value of toupper.
Would that be correct albeit imprecise ?

A more precise range would be:
[0, UCHAR_MAX] intersect ~['a', 'z'] union EOF
as mentioned in PR.
I am not sure though if it's possible for a SSA_NAME to have multiple
disjoint ranges ?

Bootstrap+tested on x86_64-unknown-linux-gnu.

Thanks,
Prathamesh

Comments

Jakub Jelinek Aug. 3, 2017, 7:51 a.m. UTC | #1
On Thu, Aug 03, 2017 at 12:58:06PM +0530, Prathamesh Kulkarni wrote:
> --- a/gcc/tree-vrp.c
> +++ b/gcc/tree-vrp.c
> @@ -3778,6 +3778,19 @@ extract_range_basic (value_range *vr, gimple *stmt)
>  		return;
>  	      }
>  	  break;
> +	case CFN_BUILT_IN_TOUPPER:
> +	case CFN_BUILT_IN_TOLOWER:
> +	  if (tree lhs = gimple_call_lhs (stmt))
> +	    {
> +	      tree type = TREE_TYPE (lhs);
> +	      unsigned char min = (cfn == CFN_BUILT_IN_TOUPPER) ? 'a' : 'A';
> +	      unsigned char max = (cfn == CFN_BUILT_IN_TOUPPER) ? 'z' : 'Z';
> +	      tree range_min = build_int_cstu (type, min);
> +	      tree range_max = build_int_cstu (type, max);
> +	      set_value_range (vr, VR_ANTI_RANGE, range_min, range_max, NULL);
> +	      return;

You are hardcoding here host characters and using it for target.
I think you need to use
lang_hooks.to_target_charset
(really no idea how it works or doesn't in LTO, but gimple-fold.c is already
using it among others).
Also, you're assuming that the 'a'-'z' and 'A'-'Z' ranges are without gaps,
which isn't the case for e.g. EBCDIC.  So I think you'd need to verify
(once?) that the target charset has this property.

	Jakub
Prathamesh Kulkarni Aug. 3, 2017, 10:43 a.m. UTC | #2
On 3 August 2017 at 13:21, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Aug 03, 2017 at 12:58:06PM +0530, Prathamesh Kulkarni wrote:
>> --- a/gcc/tree-vrp.c
>> +++ b/gcc/tree-vrp.c
>> @@ -3778,6 +3778,19 @@ extract_range_basic (value_range *vr, gimple *stmt)
>>               return;
>>             }
>>         break;
>> +     case CFN_BUILT_IN_TOUPPER:
>> +     case CFN_BUILT_IN_TOLOWER:
>> +       if (tree lhs = gimple_call_lhs (stmt))
>> +         {
>> +           tree type = TREE_TYPE (lhs);
>> +           unsigned char min = (cfn == CFN_BUILT_IN_TOUPPER) ? 'a' : 'A';
>> +           unsigned char max = (cfn == CFN_BUILT_IN_TOUPPER) ? 'z' : 'Z';
>> +           tree range_min = build_int_cstu (type, min);
>> +           tree range_max = build_int_cstu (type, max);
>> +           set_value_range (vr, VR_ANTI_RANGE, range_min, range_max, NULL);
>> +           return;
>
> You are hardcoding here host characters and using it for target.
> I think you need to use
> lang_hooks.to_target_charset
> (really no idea how it works or doesn't in LTO, but gimple-fold.c is already
> using it among others).
> Also, you're assuming that the 'a'-'z' and 'A'-'Z' ranges are without gaps,
> which isn't the case for e.g. EBCDIC.  So I think you'd need to verify
> (once?) that the target charset has this property.
Hi Jakub,
Thanks for the suggestions, I wasn't aware about lang_hooks.to_target_charset.
Would following be a correct check that target has ascii charset and only then
set range to ~[a, z] ?

target_a = lang_hooks.to_target_charset ('a');
target_z = lang_hooks.to_target_charset('z');
if (target_a == 'a' && target_z == 'z')
  {
     // set vr to ~['a', 'z']
  }

Thanks,
Prathamesh
>
>         Jakub
Jakub Jelinek Aug. 3, 2017, 10:55 a.m. UTC | #3
On Thu, Aug 03, 2017 at 04:13:38PM +0530, Prathamesh Kulkarni wrote:
> > You are hardcoding here host characters and using it for target.
> > I think you need to use
> > lang_hooks.to_target_charset
> > (really no idea how it works or doesn't in LTO, but gimple-fold.c is already
> > using it among others).
> > Also, you're assuming that the 'a'-'z' and 'A'-'Z' ranges are without gaps,
> > which isn't the case for e.g. EBCDIC.  So I think you'd need to verify
> > (once?) that the target charset has this property.
> Hi Jakub,
> Thanks for the suggestions, I wasn't aware about lang_hooks.to_target_charset.
> Would following be a correct check that target has ascii charset and only then
> set range to ~[a, z] ?
> 
> target_a = lang_hooks.to_target_charset ('a');
> target_z = lang_hooks.to_target_charset('z');
> if (target_a == 'a' && target_z == 'z')
>   {
>      // set vr to ~['a', 'z']
>   }

No.  E.g. if host == target charset is EBCDIC, then the condition is true,
but it isn't a consecutive range.
A rough check that would maybe work (at least would be false for EBCDIC
and true for ASCII) could be something like:
if (target_z > target_a && target_z - target_a == 25)
(which is not exact, some strange charset could have say '0'-'9' block in
the middle of 'a'-'z' block, and say 'h'-'r' outside of the range), but
perhaps good enough for real-world charsets.
In any case, you should probably investigate all the locales say in glibc or
some other big locale repository whether tolower/toupper have the expected
properties there.
Plus of course, the set vr to ~[ ] needs to use target_a/target_z.

	Jakub
Joseph Myers Aug. 3, 2017, 2:38 p.m. UTC | #4
On Thu, 3 Aug 2017, Jakub Jelinek wrote:

> In any case, you should probably investigate all the locales say in glibc or
> some other big locale repository whether tolower/toupper have the expected
> properties there.

They don't.  In tr_TR.UTF-8, toupper ('i') == 'i', because 'İ', the 
correct uppercase version (as returned in locale tr_TR.ISO-8859-9), is a 
multibyte character and toupper can only return single-byte characters.
Jakub Jelinek Aug. 3, 2017, 2:50 p.m. UTC | #5
On Thu, Aug 03, 2017 at 02:38:35PM +0000, Joseph Myers wrote:
> On Thu, 3 Aug 2017, Jakub Jelinek wrote:
> 
> > In any case, you should probably investigate all the locales say in glibc or
> > some other big locale repository whether tolower/toupper have the expected
> > properties there.
> 
> They don't.  In tr_TR.UTF-8, toupper ('i') == 'i', because 'İ', the 
> correct uppercase version (as returned in locale tr_TR.ISO-8859-9), is a 
> multibyte character and toupper can only return single-byte characters.

Indeed,
#include <ctype.h>
#include <locale.h>

int
main ()
{
  setlocale (LC_ALL, "");
  int i;
  for (i = -1000; i < 1000; i++)
    if (tolower (i) >= 'A' && tolower (i) <= 'Z')
      __builtin_abort ();
    else if (toupper (i) >= 'a' && toupper (i) <= 'z')
      __builtin_abort ();
  return 0;
}
fails for LC_ALL=tr_TR.UTF-8, because tolower ('I') is 'I'.
Not to mention that the result is unspecified if the functions are called
with a value outside of the range of unsigned char or EOF.
Therefore, this optimization is invalid.

	Jakub
diff mbox

Patch

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr78888.c b/gcc/testsuite/gcc.dg/tree-ssa/pr78888.c
new file mode 100644
index 00000000000..2bcddf1f2c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr78888.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-evrp-slim" } */
+
+void g (int x)
+{
+  if (__builtin_toupper ((unsigned char)x) == 'a')
+    __builtin_abort ();
+}
+
+void h (int x)
+{
+  if (__builtin_tolower ((unsigned char)x) == 'A')
+    __builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump-not "__builtin_abort" "evrp" } } */
diff --git a/gcc/tree-vrp.c b/gcc/tree-vrp.c
index 79a29bf0efb..7137a4c52ec 100644
--- a/gcc/tree-vrp.c
+++ b/gcc/tree-vrp.c
@@ -3778,6 +3778,19 @@  extract_range_basic (value_range *vr, gimple *stmt)
 		return;
 	      }
 	  break;
+	case CFN_BUILT_IN_TOUPPER:
+	case CFN_BUILT_IN_TOLOWER:
+	  if (tree lhs = gimple_call_lhs (stmt))
+	    {
+	      tree type = TREE_TYPE (lhs);
+	      unsigned char min = (cfn == CFN_BUILT_IN_TOUPPER) ? 'a' : 'A';
+	      unsigned char max = (cfn == CFN_BUILT_IN_TOUPPER) ? 'z' : 'Z';
+	      tree range_min = build_int_cstu (type, min);
+	      tree range_max = build_int_cstu (type, max);
+	      set_value_range (vr, VR_ANTI_RANGE, range_min, range_max, NULL);
+	      return;
+	    }
+	  break;
 	default:
 	  break;
 	}