[WIP] Use Levenshtein distance for various misspellings in C frontend v2

Updated patch attached, which is now independent of the rest of the
patch kit; see below.  Various other comments inline.

On Fri, 2015-09-11 at 17:30 +0200, Manuel López-Ibáñez wrote:
On 10/09/15 22:28, David Malcolm wrote:
> > There are a couple of FIXMEs here:
> > * where to call levenshtein_distance_unit_tests
>
> Should this be part of make check? Perhaps a small program that is compiled and
> linked with spellcheck.c? This would be possible if spellcheck.c did not depend
> on tree.h or tm.h, which I doubt it needs to.

Ideally I'd like to put them into a unittest plugin I've been working on:
 https://gcc.gnu.org/ml/gcc-patches/2015-06/msg00765.html
In the meantime, they only get run in an ENABLE_CHECKING build.

> > * should we attempt error-recovery in c-typeck.c:build_component_ref
>
> I would say yes, but why not leave this discussion to a later patch? The
> current one seems useful enough.

(nods)

> > +
> > +/* Look for the closest match for NAME within the currently valid
> > +   scopes.
> > +
> > +   This finds the identifier with the lowest Levenshtein distance to
> > +   NAME.  If there are multiple candidates with equal minimal distance,
> > +   the first one found is returned.  Scopes are searched from innermost
> > +   outwards, and within a scope in reverse order of declaration, thus
> > +   benefiting candidates "near" to the current scope.  */
> > +
> > +tree
> > +lookup_name_fuzzy (tree name)
> > +{
> > +  gcc_assert (TREE_CODE (name) == IDENTIFIER_NODE);
> > +
> > +  c_binding *best_binding = NULL;
> > +  int best_distance = INT_MAX;
> > +
> > +  for (c_scope *scope = current_scope; scope; scope = scope->outer)
> > +    for (c_binding *binding = scope->bindings; binding; binding = binding->prev)
> > +      {
> > +	if (!binding->id)
> > +	  continue;
> > +	int dist = levenshtein_distance (name, binding->id);
> > +	if (dist < best_distance)
>
> I guess 'dist' cannot be negative. Can it be zero? If not, wouldn't be
> appropriate to exit as soon as it becomes 1?

It can't be negative, so I've converted it to unsigned int, and introduced an
"edit_distance_t" typedef for it.

It would be appropriate to exit as soon as we reach 1 if we agree
that lookup_name_fuzzy isn't intended to find exact matches (since
otherwise we might fail to return an exact match if we see a
distance 1 match first).

I haven't implemented that early bailout in this iteration of the
patch; should I?

> Is this code discriminating between types and names? That is, what happens for:
>
> typedef int ins;
>
> int foo(void)
> {
>     int inr;
>     inp x;
> }

Thanks.  I've fixed that.

> > +/* Recursively append candidate IDENTIFIER_NODEs to CANDIDATES.  */
> > +
> > +static void
> > +lookup_field_fuzzy_find_candidates (tree type, tree component,
> > +				    vec<tree> *candidates)
> > +{
> > +  tree field;
> > +  for (field = TYPE_FIELDS (type); field; field = DECL_CHAIN (field))
> > +    {
> > +      if (DECL_NAME (field) == NULL_TREE
> > +	  && (TREE_CODE (TREE_TYPE (field)) == RECORD_TYPE
> > +	      || TREE_CODE (TREE_TYPE (field)) == UNION_TYPE))
> > +	{
> > +	  lookup_field_fuzzy_find_candidates (TREE_TYPE (field),
> > +					      component,
> > +					      candidates);
> > +	}
> > +
> > +      if (DECL_NAME (field))
> > +	candidates->safe_push (field);
> > +    }
> > +}
>
> This is appending inner-most, isn't it? Thus, given:

Yes.

> struct s{
>      struct j { int aa; } kk;
>      int aa;
> };
>
> void foo(struct s x)
> {
>      x.ab;
> }
>
> it will find s::j::aa before s::aa, no?

AIUI, it doesn't look inside the "kk", only for anonymous structs.

I added a test for this.

> >   tree
> > -build_component_ref (location_t loc, tree datum, tree component)
> > +build_component_ref (location_t loc, tree datum, tree component,
> > +		     source_range *ident_range)
> >   {
> >     tree type = TREE_TYPE (datum);
> >     enum tree_code code = TREE_CODE (type);
> > @@ -2294,7 +2356,31 @@ build_component_ref (location_t loc, tree datum, tree component)
> >
> >         if (!field)
> >   	{
> > -	  error_at (loc, "%qT has no member named %qE", type, component);
> > +	  if (!ident_range)
> > +	    {
> > +	      error_at (loc, "%qT has no member named %qE",
> > +			type, component);
> > +	      return error_mark_node;
> > +	    }
> > +	  gcc_rich_location richloc (*ident_range);
> > +	  if (TREE_CODE (datum) == INDIRECT_REF)
> > +	    richloc.add_expr (TREE_OPERAND (datum, 0));
> > +	  else
> > +	    richloc.add_expr (datum);
> > +	  field = lookup_field_fuzzy (type, component);
> > +	  if (field)
> > +	    {
> > +	      error_at_rich_loc
> > +		(&richloc,
> > +		 "%qT has no member named %qE; did you mean %qE?",
> > +		 type, component, field);
> > +	      /* FIXME: error recovery: should we try to keep going,
> > +		 with "field"? (having issued an error, and hence no
> > +		 output).  */
> > +	    }
> > +	  else
> > +	    error_at_rich_loc (&richloc, "%qT has no member named %qE",
> > +			       type, component);
> >   	  return error_mark_node;
> >   	}
>
> I don't understand why looking for a candidate or not depends on ident_range.

This is because the old patch was integrated with the source_range
ideas from the rest of the patch kit.  I've taken that out in the new
version.

> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/spellcheck.c
> > @@ -0,0 +1,36 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-fdiagnostics-show-caret" } */
> > +
> > +struct foo
> > +{
> > +  int foo;
> > +  int bar;
> > +  int baz;
> > +};
> > +
> > +int test (struct foo *ptr)
> > +{
> > +  return ptr->m_bar; /* { dg-error "'struct foo' has no member named 'm_bar'; did you mean 'bar'?" } */
> > +
> > +/* { dg-begin-multiline-output "" }
> > +   return ptr->m_bar;
> > +          ~~~  ^~~~~
> > +   { dg-end-multiline-output "" } */
> > +}
> > +
> > +int test2 (void)
> > +{
> > +  struct foo instance = {};
> > +  return instance.m_bar; /* { dg-error "'struct foo' has no member named 'm_bar'; did you mean 'bar'?" } */
> > +
> > +/* { dg-begin-multiline-output "" }
> > +   return instance.m_bar;
> > +          ~~~~~~~~ ^~~~~
> > +   { dg-end-multiline-output "" } */
> > +}
> > +
> > +int64 foo; /* { dg-error "unknown type name 'int64'; did you mean 'int'?" } */
> > +/* { dg-begin-multiline-output "" }
> > + int64 foo;
> > + ^~~~~
> > +   { dg-end-multiline-output "" } */
> >
>
>
> These tests could also test different scopes, clashes between types and fields
> and variables, and the correct behavior for nested struct/unions.

Thanks; added to TODO list below.

> I wonder whether it would be worth it to extend existing tests if now they emit
> the "do you mean" part to be sure they are doing the right thing.

Thanks; added to TODO list below.  These are passing now due to the
dg-error regexp not caring about the exact message.

Many of the field names in these tests are very short; it's not clear
to me that there's a good single suggestion that can be made if there
are several 1-char field names to choose from.

I noticed that the old patch could sometimes offer unhelpful
suggestions; I added a test for this:

  nonsensical_suggestion_t var;

where it would suggest something unrelated.  I suppressed that in
lookup_name_fuzzy by only offering a suggestion if the distance is less
than half of the length of what the user typed and that seemed to work
well, albeit in the few cases I tried.  I suspect that we may
want a similar suppression for lookup_field_fuzzy.

> Cheers,
>
> Manuel.

Thanks.

Update version of the patch follows.

This version of the patch is independent of the rest of the kit,
and applies directly on top of trunk (r227562, specifically).

Changes since previous version:
- it's now independent of the rest of the patch kit.
- removal of tracking of fieldname range "ident_range" from calls
  to build_component_ref, just using the location_t.
- removal of show-caret/multiline tests from testcase
- introduced a typedef "edit_distance_t", using it to convert
  the underlying type from "int" to "unsigned int".
- "lookup_name_fuzzy" now only considers bindings of a TYPE_DECL,
  thus matching "ins" rather than "inr" for the example given by Manu
  here:
    https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00813.html
- lookup_name_fuzzy: don't offer a suggestion if the distance is too
  high, since such a suggestion is likely to be bogus
- added test coverage to try to cover the above
- reimplemented levenshtein_distance to avoid allocating and building
  an (m + 1) * (n + 1) matrix in favor of just tracking two rows
  at once
- made levenshtein_distance_unit_tests automatically run each test
  both ways; added some more tests

I attempted the error-recovery in build_component_ref, but I found it
could make things worse.  For example, in
gcc/testsuite/gcc.dg/anon-struct-11.c:
  f3 (&e.D);		/* { dg-error "no member" } */
becomes:
  error: 'struct E' has no member named 'D'; did you mean 'b'?
but if we try to use "b", this then leads to thes additional bogus
messages:
  warning: passing argument 1 of 'f3' from incompatible
    pointer type [-Wincompatible-pointer-types]
  note: expected 'D * {aka struct <anonymous> *}' but
    argument is of type 'char *'

Similarly, in gcc/testsuite/gcc.dg/c11-anon-struct-2.c:
  x.i = 0; /* { dg-error "has no member" } */
this becomes:
  error: 'struct s5' has no member named 'i'; did you mean 'a'?
which then leads to:
  error: incompatible types when assigning to type
   'struct <anonymous>' from type 'int'

So this version of the patch doesn't attempted to use the suggested
field.

Successfully bootstrapped&regrtested on x86_64-pc-linux-gnu; adds
9 PASSes to gcc.sum.

I'm posting it here as a work-in-progress.

Remaining work:
  * the FIXME about where to call levenshtein_distance_unit_tests;
there's an argument that this could be moved to libiberty (is C++
allowed in libiberty?); I'd prefer to get the unittest idea from
 https://gcc.gnu.org/ml/gcc-patches/2015-06/msg00765.html
into trunk, and then move it into there.  Right now it's all
gcc_assert, so optimizes away in a production build.
  * more testcases as noted by Manu above
  * try existing testcases as noted by Manu above
  * possible early return when distance == 1
  * perhaps some kind of limit on the number of iterations inside
levenshtein_distance (e.g. governed by a param).
  * perhaps some ability to pass in a limit on the
distance we care about, so we can immediately reject distances
that will be above this

It also strikes me that sometimes a "misspelling" is a missing
header file, and that the most helpful thing to do might be to
suggest including that header file.  For instance given:
  $ cat /tmp/foo.c
  int64_t i;

  $ ./xgcc -B. /tmp/foo.c
  /tmp/foo.c:1:1: error: unknown type name ‘int64_t’
  int64_t i;
  ^
(where the suggestion of "int" is suppressed due to the distance
being too long) it might be helpful to print:
  /tmp/foo.c:1:1: error: unknown type name 'int64_t'; did you mean to include '<inttypes.h>'?
  int64_t i;
  ^
That does seem like a separate enhancement, though.

gcc/ChangeLog:
	* Makefile.in (OBJS): Add spellcheck.o.
	* spellcheck.c: New file.
	* spellcheck.h: New file.

gcc/c-family/ChangeLog:
	* c-common.h (lookup_name_fuzzy): New decl.

gcc/c/ChangeLog:
	* c-decl.c: Include spellcheck.h.
	(lookup_name_fuzzy): New.
	* c-parser.c: Include spellcheck.h.
	(c_parser_declaration_or_fndef): If "unknown type name",
	attempt to suggest a close match using lookup_name_fuzzy.
	* c-typeck.c: Include spellcheck.h.
	(lookup_field_fuzzy_find_candidates): New function.
	(lookup_field_fuzzy): New function.
	(build_component_ref): Use lookup_field_fuzzy to suggest close
	matches when printing field-not-found error.

gcc/testsuite/ChangeLog:
	* gcc.dg/spellcheck.c: New file.
---
 gcc/Makefile.in                   |   1 +
 gcc/c-family/c-common.h           |   1 +
 gcc/c/c-decl.c                    |  45 +++++++++++
 gcc/c/c-parser.c                  |  11 ++-
 gcc/c/c-typeck.c                  |  66 ++++++++++++++-
 gcc/spellcheck.c                  | 166 ++++++++++++++++++++++++++++++++++++++
 gcc/spellcheck.h                  |  35 ++++++++
 gcc/testsuite/gcc.dg/spellcheck.c |  49 +++++++++++
 8 files changed, 371 insertions(+), 3 deletions(-)
 create mode 100644 gcc/spellcheck.c
 create mode 100644 gcc/spellcheck.h
 create mode 100644 gcc/testsuite/gcc.dg/spellcheck.c

[WIP] Use Levenshtein distance for various misspellings in C frontend v2

Commit Message

Comments

Patch