regexec: Fix off-by-one bug in weight comparison [BZ #23036]

Message ID 20180709172046.D1DA743994575@oldenburg.str.redhat.com
State New
Headers show
Series
  • regexec: Fix off-by-one bug in weight comparison [BZ #23036]
Related show

Commit Message

Florian Weimer July 9, 2018, 5:20 p.m.
2018-07-09  Florian Weimer  <fweimer@redhat.com>

	[BZ #23036]
	* posix/regexec.c (check_node_accept_bytes): When comparing
	weights, do not compare an extra byte after the end of the
	weights.

Comments

Carlos O'Donell July 9, 2018, 6 p.m. | #1
On 07/09/2018 01:20 PM, Florian Weimer wrote:
> 2018-07-09  Florian Weimer  <fweimer@redhat.com>
> 
> 	[BZ #23036]
> 	* posix/regexec.c (check_node_accept_bytes): When comparing
> 	weights, do not compare an extra byte after the end of the
> 	weights.

I assume that this fixes the issue and that you have minimally passed
the regression testing for x86_64. I would like a test for this, but it
can wait. This regression is a blocker for the release, and I appreciate
your root cause analysis of the issue.

If I understand correctly this is a "off-by-one" error and I'll explain below.

In the future please try to keep the change to the absolute minimum instead
of rewriting it and hoisting constants out of loops. It certainly made the
review longer than it should have been. We can always clean this up later.

Given all of my assumptions and admonitions, this looks OK.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>

> diff --git a/posix/regexec.c b/posix/regexec.c
> index 63aef97535..73644c2341 100644
> --- a/posix/regexec.c
> +++ b/posix/regexec.c
> @@ -3878,30 +3878,27 @@ check_node_accept_bytes (const re_dfa_t *dfa, Idx node_idx,
>  	      indirect = (const int32_t *)
>  		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
>  	      int32_t idx = findidx (table, indirect, extra, &cp, elem_len);
> +	      int32_t rule = idx >> 24;
> +	      idx &= 0xffffff;

OK.

>  	      if (idx > 0)
> -		for (i = 0; i < cset->nequiv_classes; ++i)
> -		  {
> -		    int32_t equiv_class_idx = cset->equiv_classes[i];
> -		    size_t weight_len = weights[idx & 0xffffff];
> -		    if (weight_len == weights[equiv_class_idx & 0xffffff]
> -			&& (idx >> 24) == (equiv_class_idx >> 24))

Check length and rule is the same.

> -		      {
> -			Idx cnt = 0;
> -
> -			idx &= 0xffffff;
> -			equiv_class_idx &= 0xffffff;
> -
> -			while (cnt <= weight_len
> -			       && (weights[equiv_class_idx + 1 + cnt]
> -				   == weights[idx + 1 + cnt]))

Here we start at count 0 and go to <= weight_len.

This is one byte too far.

In an N-length weight:

|L01234...[N-1]|
 ^^        ^
 ||        |--- weights [idx + 1 + (weight_len - 1)]
 ||--- weights[idx + 1]
 |--- weights

L == N == weight_len

So the loop for cnt <= weight_len goes one byte beyond the weights array.

> -			  ++cnt;
> -			if (cnt > weight_len)
> -			  {
> -			    match_len = elem_len;
> -			    goto check_node_accept_bytes_match;
> -			  }
> -		      }
> -		  }
> +		{
> +		  size_t weight_len = weights[idx];
> +		  for (i = 0; i < cset->nequiv_classes; ++i)
> +		    {
> +		      int32_t equiv_class_idx = cset->equiv_classes[i];
> +		      int32_t equiv_class_rule = equiv_class_idx >> 24;
> +		      equiv_class_idx &= 0xffffff;
> +		      if (weights[equiv_class_idx] == weight_len

OK, check the lengths are the same.

> +			  && equiv_class_rule == rule

OK, check the rule is the same.

> +			  && memcmp (weights + idx + 1,
> +				     weights + equiv_class_idx + 1,
> +				     weight_len) == 0)

OK. Use memcmp to compare only weight_len bytes, not weight_len + 1 bytes.

> +			{
> +			  match_len = elem_len;
> +			  goto check_node_accept_bytes_match;


OK.

> +			}
> +		    }
> +		}
>  	    }
>  	}
>        else
> 

The new code is much more readable too, you can clearly see what is being
compared and why. It is better than what we had before.
Adhemerval Zanella July 9, 2018, 6:55 p.m. | #2
On 09/07/2018 15:00, Carlos O'Donell wrote:
> On 07/09/2018 01:20 PM, Florian Weimer wrote:
>> 2018-07-09  Florian Weimer  <fweimer@redhat.com>
>>
>> 	[BZ #23036]
>> 	* posix/regexec.c (check_node_accept_bytes): When comparing
>> 	weights, do not compare an extra byte after the end of the
>> 	weights.
> 
> I assume that this fixes the issue and that you have minimally passed
> the regression testing for x86_64. I would like a test for this, but it
> can wait. This regression is a blocker for the release, and I appreciate
> your root cause analysis of the issue.

I still think we should add a regression test for it, the bug report
have one quite simple which we can add as an independent test (which
should not invalidate testsuite validation already done).
Florian Weimer July 9, 2018, 7:14 p.m. | #3
* Adhemerval Zanella:

> On 09/07/2018 15:00, Carlos O'Donell wrote:
>> On 07/09/2018 01:20 PM, Florian Weimer wrote:
>>> 2018-07-09  Florian Weimer  <fweimer@redhat.com>
>>>
>>> 	[BZ #23036]
>>> 	* posix/regexec.c (check_node_accept_bytes): When comparing
>>> 	weights, do not compare an extra byte after the end of the
>>> 	weights.
>> 
>> I assume that this fixes the issue and that you have minimally passed
>> the regression testing for x86_64. I would like a test for this, but it
>> can wait. This regression is a blocker for the release, and I appreciate
>> your root cause analysis of the issue.
>
> I still think we should add a regression test for it, the bug report
> have one quite simple which we can add as an independent test (which
> should not invalidate testsuite validation already done).

I'm working on a test which constructs all matching single-character
multi-byte strings matching a given pattern in a given locale, so we
can check that we get the expected result.
Florian Weimer July 9, 2018, 7:18 p.m. | #4
* Carlos O'Donell:

>> -			while (cnt <= weight_len
>> -			       && (weights[equiv_class_idx + 1 + cnt]
>> -				   == weights[idx + 1 + cnt]))
>
> Here we start at count 0 and go to <= weight_len.
>
> This is one byte too far.
>
> In an N-length weight:
>
> |L01234...[N-1]|
>  ^^        ^
>  ||        |--- weights [idx + 1 + (weight_len - 1)]
>  ||--- weights[idx + 1]
>  |--- weights
>
> L == N == weight_len
>
> So the loop for cnt <= weight_len goes one byte beyond the weights array.

Right.  I wasn't able to derive the data layout from
locale/programs/ld-collate.c, but there's this code in
string/strxfrm_l.c:

/* Find next weight and rule index.  Inlined since called for every char.  */
static __always_inline size_t
find_idx (const USTRING_TYPE **us, int32_t *weight_idx,
	  unsigned char *rule_idx, const locale_data_t *l_data, const int pass)
{
  int32_t tmp = findidx (l_data->table, l_data->indirect, l_data->extra, us,
			 -1);
  *rule_idx = tmp >> 24;
  int32_t idx = tmp & 0xffffff;
  size_t len = l_data->weights[idx++];

  /* Skip over indices of previous levels.  */
  for (int i = 0; i < pass; i++)
    {
      idx += len;
      len = l_data->weights[idx++];
    }

  *weight_idx = idx;
  return len;
}

This makes it abundantly clear that the length element does not count
itself in the length.
Carlos O'Donell July 9, 2018, 7:19 p.m. | #5
On 07/09/2018 02:55 PM, Adhemerval Zanella wrote:
> 
> 
> On 09/07/2018 15:00, Carlos O'Donell wrote:
>> On 07/09/2018 01:20 PM, Florian Weimer wrote:
>>> 2018-07-09  Florian Weimer  <fweimer@redhat.com>
>>>
>>> 	[BZ #23036]
>>> 	* posix/regexec.c (check_node_accept_bytes): When comparing
>>> 	weights, do not compare an extra byte after the end of the
>>> 	weights.
>>
>> I assume that this fixes the issue and that you have minimally passed
>> the regression testing for x86_64. I would like a test for this, but it
>> can wait. This regression is a blocker for the release, and I appreciate
>> your root cause analysis of the issue.
> 
> I still think we should add a regression test for it, the bug report
> have one quite simple which we can add as an independent test (which
> should not invalidate testsuite validation already done).

I agree. However, the testsuite can take all the way until August 1st
for us to get done. I imagine Florian posted this early because it's
a release blocker and we want review right away.

Patch

diff --git a/posix/regexec.c b/posix/regexec.c
index 63aef97535..73644c2341 100644
--- a/posix/regexec.c
+++ b/posix/regexec.c
@@ -3878,30 +3878,27 @@  check_node_accept_bytes (const re_dfa_t *dfa, Idx node_idx,
 	      indirect = (const int32_t *)
 		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
 	      int32_t idx = findidx (table, indirect, extra, &cp, elem_len);
+	      int32_t rule = idx >> 24;
+	      idx &= 0xffffff;
 	      if (idx > 0)
-		for (i = 0; i < cset->nequiv_classes; ++i)
-		  {
-		    int32_t equiv_class_idx = cset->equiv_classes[i];
-		    size_t weight_len = weights[idx & 0xffffff];
-		    if (weight_len == weights[equiv_class_idx & 0xffffff]
-			&& (idx >> 24) == (equiv_class_idx >> 24))
-		      {
-			Idx cnt = 0;
-
-			idx &= 0xffffff;
-			equiv_class_idx &= 0xffffff;
-
-			while (cnt <= weight_len
-			       && (weights[equiv_class_idx + 1 + cnt]
-				   == weights[idx + 1 + cnt]))
-			  ++cnt;
-			if (cnt > weight_len)
-			  {
-			    match_len = elem_len;
-			    goto check_node_accept_bytes_match;
-			  }
-		      }
-		  }
+		{
+		  size_t weight_len = weights[idx];
+		  for (i = 0; i < cset->nequiv_classes; ++i)
+		    {
+		      int32_t equiv_class_idx = cset->equiv_classes[i];
+		      int32_t equiv_class_rule = equiv_class_idx >> 24;
+		      equiv_class_idx &= 0xffffff;
+		      if (weights[equiv_class_idx] == weight_len
+			  && equiv_class_rule == rule
+			  && memcmp (weights + idx + 1,
+				     weights + equiv_class_idx + 1,
+				     weight_len) == 0)
+			{
+			  match_len = elem_len;
+			  goto check_node_accept_bytes_match;
+			}
+		    }
+		}
 	    }
 	}
       else