diff mbox

pch bug fix

Message ID F756187A-91EC-46CD-888A-EF3D923DA887@comcast.net
State New
Headers show

Commit Message

Mike Stump Jan. 1, 2014, 6:39 a.m. UTC
In testing for wide-int, we discovered that someone seems to have blown pch.  The problem is that the optabs field is not a string.  It's size is not determined by strlen.  strlen are the semantics gty attaches to unsigned char * data.  By defeating the type of optabs to being any other type, then the length is determined by the ggc subsystem which just knows the length of the data (2K on my system).

I don't really like the type lying going on.  I'd prefer if a tree-core person would get honest with the types.

Another style of fix would be to decide that atomic in this case means something else, then, we'd have to have a gty person implement the new semantics.  Until that is done, this problem is so bad, that the below should go in until someone can fix gty, if someone doesn't want to be honest with the type system.

This bug comes across as a random memory smasher and is incredibly annoying to track down.  There is non-determinism in the process and will causes non-deterministic memory crashes, but, only on pch use.  To track down why, you have to trace back through the pch writing process and realize the data in memory is completely valid, is it only the meta information about that data that pch uses from the GTY world that causes any problem.

Ok?

Comments

Jakub Jelinek Jan. 1, 2014, 7:46 a.m. UTC | #1
On Tue, Dec 31, 2013 at 10:39:58PM -0800, Mike Stump wrote:
> In testing for wide-int, we discovered that someone seems to have blown
> pch.  The problem is that the optabs field is not a string.  It's size is
> not determined by strlen.  strlen are the semantics gty attaches to
> unsigned char * data.  By defeating the type of optabs to being any other
> type, then the length is determined by the ggc subsystem which just knows
> the length of the data (2K on my system).
> 
> I don't really like the type lying going on.  I'd prefer if a tree-core
> person would get honest with the types.
> 
> Another style of fix would be to decide that atomic in this case means
> something else, then, we'd have to have a gty person implement the new
> semantics.  Until that is done, this problem is so bad, that the below
> should go in until someone can fix gty, if someone doesn't want to be
> honest with the type system.
> 
> This bug comes across as a random memory smasher and is incredibly
> annoying to track down.  There is non-determinism in the process and will
> causes non-deterministic memory crashes, but, only on pch use.  To track
> down why, you have to trace back through the pch writing process and
> realize the data in memory is completely valid, is it only the meta
> information about that data that pch uses from the GTY world that causes
> any problem.

Thanks for tracking this down, this sounds like PR59436.  How have you
managed to track it down?  I also wonder why it doesn't seem to affect 4.8
when it also has the same change.

Based on the comments in gengtype.c, I'd expect
  unsigned char *GTY ((atomic, length ("sizeof (struct target_optabs)"))) optabs;
to work (or, perhaps instead of sizeof directly call a function that is
defined in optabs.c and returns that value, perhaps optabs.h isn't included
where it should be), but unfortunately it seems to be rejected right now.

So, the question is how hard would it be to support that.

	Jakub
Mike Stump Jan. 2, 2014, 12:20 a.m. UTC | #2
On Dec 31, 2013, at 11:46 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Dec 31, 2013 at 10:39:58PM -0800, Mike Stump wrote:
>> In testing for wide-int, we discovered that someone seems to have blown
>> pch

> Thanks for tracking this down, this sounds like PR59436.

To confirm that, one would need to either, investigate those issues, if it is the same, it is just plain hard work, or statistically work up a probability of failure, and run the test case enough to be 99.9% certain in a change of test case result and then drop my patch in, and then check.  In my case, I was near 30% failure, so, I could run it 20 times, and be 99.99999% certain.  I didn't judge my work by the probabilistic method; rather I only used it to double check after I was certain I understood and fixed the problem.  I read through the PR and could not rise above 10% confidence that it is the same problem.

> How have you managed to track it down?

:-)  You know, I'd love to teach an AI how to debug gcc, then, we could just use it to debug gcc, and it would have found the problem in 20 minutes on my crappy computer (16 core), I'd love to use my nice computer (64 cores) on a hard problem with such a program.  One day, this will come to pass, but, we seem so far away still from that goal.

I'll give you an overview of how it should be done, and I mostly did this, though, at times not quite in this order:

First, was the determination that the issue, when pch isn't used, doesn't occur.  Next, was to find the data that is wrong and the address it is at.  You see if the wrong data is memory smashed or created that way purposefully.  In my case, it was smashed by an unrelated data structure.  Next, why, what formed that address for both pieces of data?  In my case, both were ggc allocated, and valid addresses, wait, that's can't be.  Yup, it is the case.  The allocations are correct, and purposefully given out by ggc and are correct and both simultaneously live, wait, that can't be.  Size, it turns out, the person allocation the data structure was pch importing, and it has a size in mind, that size is way to small, and the write off the end of the allocated data structure clobbers the other unrelated data structure.  Now, why is the size wrong?  Comes from the pch file that way.  Why?  During initial allocation on the pch build size, the size is correct.  If pch is off, the size is correct, what?  Walk writing, notice that someone calls strlen on the data in question.  Wait, that's wrong, why does it do that?  Ah, char * is a string.  char[500], when you gty atomic it and takes it's address, it persisted out as a string, but that's wrong.  Make it not use strlen.

The rest of the process is then about fixing it.

I'm old skool, so, I remember the days of debugging gcc when we had memory smashers, lifetime issues and other such fun.  ggc was invented to get rid of a large class of errors in one wave of the magic wand, and today, those issues do seem rarer indeed.

> I also wonder why it doesn't seem to affect 4.8 when it also has the same change.

Don't know.  If no one writes to the pat_enable after pch load, then, trivially, clobbering can't happen; though, if someone ever reads pat_enable, then I can't imagine how it could work.

> Based on the comments in gengtype.c, I'd expect
>  unsigned char *GTY ((atomic, length ("sizeof (struct target_optabs)"))) optabs;

The comments should be fixed...

> to work (or, perhaps instead of sizeof directly call a function that is
> defined in optabs.c and returns that value, perhaps optabs.h isn't included
> where it should be), but unfortunately it seems to be rejected right now.

Yup, I thought the same, and first tried the same.  I too noticed it didn't work.

> So, the question is how hard would it be to support that.

I don't know.  All I know is that if a single other person stubs his toe on this issue, it will be a complete waste of their time trying to track it down.
diff mbox

Patch

Index: optabs.c
===================================================================
--- optabs.c    (revision 206086)
+++ optabs.c    (working copy)
@@ -6242,7 +6242,7 @@ 
 
   /* If the optabs changed, record it.  */
   if (memcmp (tmp_optabs, this_target_optabs, sizeof (struct target_optabs)))
-    TREE_OPTIMIZATION_OPTABS (optnode) = (unsigned char *) tmp_optabs;
+    TREE_OPTIMIZATION_OPTABS (optnode) = (struct target_optabs_S *) tmp_optabs;
   else
     {
       TREE_OPTIMIZATION_OPTABS (optnode) = NULL;
Index: tree-core.h
===================================================================
--- tree-core.h (revision 206086)
+++ tree-core.h (working copy)
@@ -1559,6 +1559,9 @@ 
   struct tree_statement_list_node *tail;
 };
 
+struct GTY(()) target_optabs_S {
+  unsigned char e[1];
+};
 
 /* Optimization options used by a function.  */
 
@@ -1570,7 +1573,7 @@ 
 
   /* Target optabs for this set of optimization options.  This is of
      type `struct target_optabs *'.  */
-  unsigned char *GTY ((atomic)) optabs;
+  struct target_optabs_S *GTY((atomic)) optabs;
 
   /* The value of this_target_optabs against which the optabs above were
      generated.  */