From patchwork Sat Nov 16 01:22:49 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Malcolm X-Patchwork-Id: 1195998 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-513722-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="R6Lw9bpd"; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.b="PeKEeKXG"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 47FHRR3vdGz9s7T for ; Sat, 16 Nov 2019 12:18:43 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:in-reply-to:references :content-type:content-transfer-encoding; q=dns; s=default; b=DSy xlqJ4gNx3+OhMZ4Os4RJxRwe76RtyLLk+UPx+bmeLd/mZFSzbcWMjX3gngzHwd0b K8zkSwijmHStajTSxnTKKw9Q+Bym4Otk4h+QD8bw42OC0mENOrh4VMneHnuKmDuM E9XLVOZ/PiQJC7BM2DrLeCofdBUcTQI5/gxSAasI= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:in-reply-to:references :content-type:content-transfer-encoding; s=default; bh=TV3suWowp BPhY/jCuyOuQxVxUCg=; b=R6Lw9bpdDA8O7PlOZNpVz6vQLG737VAw9iCcyIdBg Dz+/QbRhs+JMsrAi/lpZxl8s8QisOosq/s/5YkZQnVRqTrnqYuM/Vzm36hCLzTQi af/3jgqM5AjU+CHm+2uRR2gLimGqavlrkQFt/7wSo+aEOSpqMczbp4wUL37PADFQ 0o= Received: (qmail 55555 invoked by alias); 16 Nov 2019 01:17:13 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 55483 invoked by uid 89); 16 Nov 2019 01:17:13 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-15.1 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, SPAM_BODY, SPF_PASS, TLD_CHINA autolearn=ham version=3.3.1 spammy=Experience, defaulting, poisoned, Susan X-HELO: us-smtp-1.mimecast.com Received: from us-smtp-delivery-1.mimecast.com (HELO us-smtp-1.mimecast.com) (205.139.110.120) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Sat, 16 Nov 2019 01:17:08 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1573867026; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=v0/Qeqmv0badreqBM1gv7Z7Ezz9xkCd7wJ1FhdxU76Y=; b=PeKEeKXGuR/Ftyzvk1+YC/Qk0qKQWAnfShD7eiy6hARi+89UGmDjEUw4yjDtog+0jIP9+q +q/oqtmYubQnb0CoUjdiPprUqWR82rnBIX+I+QpS/MZqTa48zZgIAJype1CO071N664sDc XAY41BWl90P6eCMhXxXusYMVtQfxYZI= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-146-1PkJGMnPNI2Q8wmNDoScXw-1; Fri, 15 Nov 2019 20:17:03 -0500 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 96417107ACC4 for ; Sat, 16 Nov 2019 01:17:02 +0000 (UTC) Received: from c64.redhat.com (ovpn-112-32.phx2.redhat.com [10.3.112.32]) by smtp.corp.redhat.com (Postfix) with ESMTP id 93AEA10246FB; Sat, 16 Nov 2019 01:17:01 +0000 (UTC) From: David Malcolm To: gcc-patches@gcc.gnu.org Cc: David Malcolm Subject: [PATCH 02/49] analyzer: internal documentation Date: Fri, 15 Nov 2019 20:22:49 -0500 Message-Id: <1573867416-55618-3-git-send-email-dmalcolm@redhat.com> In-Reply-To: <1573867416-55618-1-git-send-email-dmalcolm@redhat.com> References: <1573867416-55618-1-git-send-email-dmalcolm@redhat.com> X-Mimecast-Spam-Score: 0 X-IsSubscribed: yes gcc/ChangeLog: * Makefile.in (TEXI_GCCINT_FILES): Add analyzer.texi. * doc/analyzer.texi: New file. * doc/gccint.texi ("Static Analyzer") New menu item. (analyzer.texi): Include it. --- gcc/Makefile.in | 2 +- gcc/doc/analyzer.texi | 470 ++++++++++++++++++++++++++++++++++++++++++++++++++ gcc/doc/gccint.texi | 2 + 3 files changed, 473 insertions(+), 1 deletion(-) create mode 100644 gcc/doc/analyzer.texi diff --git a/gcc/Makefile.in b/gcc/Makefile.in index c82858f..53d74e2 100644 --- a/gcc/Makefile.in +++ b/gcc/Makefile.in @@ -3182,7 +3182,7 @@ TEXI_GCCINT_FILES = gccint.texi gcc-common.texi gcc-vers.texi \ gnu.texi gpl_v3.texi fdl.texi contrib.texi languages.texi \ sourcebuild.texi gty.texi libgcc.texi cfg.texi tree-ssa.texi \ loop.texi generic.texi gimple.texi plugins.texi optinfo.texi \ - match-and-simplify.texi ux.texi poly-int.texi + match-and-simplify.texi analyzer.texi ux.texi poly-int.texi TEXI_GCCINSTALL_FILES = install.texi install-old.texi fdl.texi \ gcc-common.texi gcc-vers.texi diff --git a/gcc/doc/analyzer.texi b/gcc/doc/analyzer.texi new file mode 100644 index 0000000..8e7b26f --- /dev/null +++ b/gcc/doc/analyzer.texi @@ -0,0 +1,470 @@ +@c Copyright (C) 2019 Free Software Foundation, Inc. +@c This is part of the GCC manual. +@c For copying conditions, see the file gcc.texi. +@c Contributed by David Malcolm . + +@node Static Analyzer +@chapter Static Analyzer +@cindex analyzer +@cindex static analysis +@cindex static analyzer + +@menu +* Analyzer Internals:: Analyzer Internals +* Debugging the Analyzer:: Useful debugging tips +@end menu + +@node Analyzer Internals +@section Analyzer Internals +@cindex analyzer, internals +@cindex static analyzer, internals + +@subsection Overview + +The analyzer implementation works on the gimple-SSA representation. +(I chose this in the hopes of making it easy to work with LTO to +do whole-program analysis). + +The implementation is read-only: it doesn't attempt to change anything, +just emit warnings. + +First, we build a @code{supergraph} which combines the callgraph and all +of the CFGs into a single directed graph, with both interprocedural and +intraprocedural edges. The nodes and edges in the supergraph are called +``supernodes'' and ``superedges'', and often referred to in code as +@code{snodes} and @code{sedges}. Basic blocks in the CFGs are split at +interprocedural calls, so there can be more than one supernode per +basic block. Most statements will be in just one supernode, but a call +statement can appear in two supernodes: at the end of one for the call, +and again at the start of another for the return. + +The supergraph can be seen using @option{-fdump-analyzer-supergraph}. + +We then build an @code{analysis_plan} which walks the callgraph to +determine which calls might be suitable for being summarized (rather +than fully explored) and thus in what order to explore the functions. + +Next is the heart of the analyzer: we use a worklist to explore state +within the supergraph, building an "exploded graph". +Nodes in the exploded graph correspond to pairs, as in + "Precise Interprocedural Dataflow Analysis via Graph Reachability" + (Thomas Reps, Susan Horwitz and Mooly Sagiv). + +We reuse nodes for pairs we've already seen, and avoid +tracking state too closely, so that (hopefully) we rapidly converge +on a final exploded graph, and terminate the analysis. We also bail +out if the number of exploded nodes gets +larger than a particular multiple of the total number of basic blocks +(to ensure termination in the face of pathological state-explosion +cases, or bugs). We also stop exploring a point once we hit a limit +of states for that point. + +We can identify problems directly when processing a +instance. For example, if we're finding the successors of + +@smallexample + +@end smallexample + +then we can detect a double-free of "ptr". We can then emit a path +to reach the problem by finding the simplest route through the graph. + +Program points in the analysis are much more fine-grained than in the +CFG and supergraph, with points (and thus potentially exploded nodes) +for various events, including before individual statements. +By default the exploded graph merges multiple consecutive statements +in a supernode into one exploded edge to minimize the size of the +exploded graph. This can be suppressed via +@option{-fanalyzer-fine-grained}. +The fine-grained approach seems to make things simpler and more debuggable +that other approaches I tried, in that each point is responsible for one +thing. + +Program points in the analysis also have a "call string" identifying the +stack of callsites below them, so that paths in the exploded graph +correspond to interprocedurally valid paths: we always return to the +correct call site, propagating state information accordingly. +We avoid infinite recursion by stopping the analysis if a callsite +appears more than @code{analyzer-max-recursion-depth} in a callstring +(defaulting to 2). + +@subsection Graphs + +Nodes and edges in the exploded graph are called ``exploded nodes'' and +``exploded edges'' and often referred to in the code as +@code{enodes} and @code{eedges} (especially when distinguishing them +from the @code{snodes} and @code{sedges} in the supergraph). + +Each graph numbers its nodes, giving unique identifiers - supernodes +are referred to throughout dumps in the form @samp{SN': @var{index}} and +exploded nodes in the form @samp{EN: @var{index}} (e.g. @samp{SN: 2} and +@samp{EN:29}). + +The supergraph can be seen using @option{-fdump-analyzer-supergraph-graph}. + +The exploded graph can be seen using @option{-fdump-analyzer-exploded-graph} +and other dump options. Exploded nodes are color-coded in the .dot output +based on state-machine states to make it easier to see state changes at +a glance. + +@subsection State Tracking + +There's a tension between: +@itemize @bullet +@item +precision of analysis in the straight-line case, vs +@item +exponential blow-up in the face of control flow. +@end itemize + +For example, in general, given this CFG: + +@smallexample + A + / \ + B C + \ / + D + / \ + E F + \ / + G +@end smallexample + +we want to avoid differences in state-tracking in B and C from +leading to blow-up. If we don't prevent state blowup, we end up +with exponential growth of the exploded graph like this: + +@smallexample + + 1:A + / \ + / \ + / \ + 2:B 3:C + | | + 4:D 5:D (2 exploded nodes for D) + / \ / \ + 6:E 7:F 8:E 9:F + | | | | + 10:G 11:G 12:G 13:G (4 exploded nodes for G) + +@end smallexample + +Similar issues arise with loops. + +To prevent this, we follow various approaches: + +@enumerate a +@item +state pruning: which tries to discard state that won't be relevant +later on withing the function. +This can be disabled via @option{-fno-analyzer-state-purge}. + +@item +state merging. We can try to find the commonality between two +program_state instances to make a third, simpler program_state. +We have two strategies here: + + @enumerate + @item + the worklist keeps new nodes for the same program_point together, + and tries to merge them before processing, and thus before they have + successors. Hence, in the above, the two nodes for D (4 and 5) reach + the front of the worklist together, and we create a node for D with + the merger of the incoming states. + + @item + try merging with the state of existing enodes for the program_point + (which may have already been explored). There will be duplication, + but only one set of duplication; subsequent duplicates are more likely + to hit the cache. In particular, (hopefully) all merger chains are + finite, and so we guarantee termination. + This is intended to help with loops: we ought to explore the first + iteration, and then have a "subsequent iterations" exploration, + which uses a state merged from that of the first, to be more abstract. + @end enumerate + +We avoid merging pairs of states that have state-machine differences, +as these are the kinds of differences that are likely to be most +interesting. So, for example, given: + +@smallexample + if (condition) + ptr = malloc (size); + else + ptr = local_buf; + + .... do things with 'ptr' + + if (condition) + free (ptr); + + ...etc +@end smallexample + +then we end up with an exploded graph that looks like this: + +@smallexample + + if (condition) + / T \ F + --------- ---------- + / \ + ptr = malloc (size) ptr = local_buf + | | + copy of copy of + "do things with 'ptr'" "do things with 'ptr'" + with ptr: heap-allocated with ptr: stack-allocated + | | + if (condition) if (condition) + | known to be T | known to be F + free (ptr); | + \ / + ----------------------------- + | ('ptr' is pruned, so states can be merged) + etc + +@end smallexample + +where some duplication has occurred, but only for the places where the +the different paths are worth exploringly separately. + +Merging can be disabled via @option{-fno-analyzer-state-merge}. +@end enumerate + +@subsection Region Model + +Part of the state stored at a @code{exploded_node} is a @code{region_model}. +This is an implementation of the region-based ternary model described in +@url{http://lcs.ios.ac.cn/~xuzb/canalyze/memmodel.pdf, +"A Memory Model for Static Analysis of C Programs"} +(Zhongxing Xu, Ted Kremenek, and Jian Zhang). + +A @code{region_model} encapsulates a representation of the state of +memory, with a tree of @code{region} instances, along with their associated +values. The representation is graph-like because values can be pointers +to regions. It also stores a constraint_manager, capturing relationships +between the values. + +Because each node in the @code{exploded_graph} has a @code{region_model}, +and each of the latter is graph-like, the @code{exploded_graph} is in some +ways a graph of graphs. + +Here's an example of printing a @code{region_model}, showing the ASCII-art +used to visualize the region hierarchy (colorized when printing to stderr): + +@smallexample +(gdb) call debug (*this) +r0: @{kind: 'root', parent: null, sval: null@} +|-stack: r1: @{kind: 'stack', parent: r0, sval: sv1@} +| |: sval: sv1: @{poisoned: uninit@} +| |-frame for 'test': r2: @{kind: 'frame', parent: r1, sval: null, map: @{'ptr_3': r3@}, function: 'test', depth: 0@} +| | `-'ptr_3': r3: @{kind: 'map', parent: r2, sval: sv3, type: 'void *', map: @{@}@} +| | |: sval: sv3: @{type: 'void *', unknown@} +| | |: type: 'void *' +| `-frame for 'calls_malloc': r4: @{kind: 'frame', parent: r1, sval: null, map: @{'result_3': r7, '_4': r8, '': r5@}, function: 'calls_malloc', depth: 1@} +| |-'': r5: @{kind: 'map', parent: r4, sval: sv4, type: 'void *', map: @{@}@} +| | |: sval: sv4: @{type: 'void *', &r6@} +| | |: type: 'void *' +| |-'result_3': r7: @{kind: 'map', parent: r4, sval: sv4, type: 'void *', map: @{@}@} +| | |: sval: sv4: @{type: 'void *', &r6@} +| | |: type: 'void *' +| `-'_4': r8: @{kind: 'map', parent: r4, sval: sv4, type: 'void *', map: @{@}@} +| |: sval: sv4: @{type: 'void *', &r6@} +| |: type: 'void *' +`-heap: r9: @{kind: 'heap', parent: r0, sval: sv2@} + |: sval: sv2: @{poisoned: uninit@} + `-r6: @{kind: 'symbolic', parent: r9, sval: null, map: @{@}@} +svalues: + sv0: @{type: 'size_t', '1024'@} + sv1: @{poisoned: uninit@} + sv2: @{poisoned: uninit@} + sv3: @{type: 'void *', unknown@} + sv4: @{type: 'void *', &r6@} +constraint manager: + equiv classes: + ec0: @{sv0 == '1024'@} + ec1: @{sv4@} + constraints: +@end smallexample + +This is the state at the point of returning from @code{calls_malloc} back +to @code{test} in the following: + +@smallexample +void * +calls_malloc (void) +@{ + void *result = malloc (1024); + return result; +@} + +void test (void) +@{ + void *ptr = calls_malloc (); + /* etc. */ +@} +@end smallexample + +The ``root'' region (``r0'') has a ``stack'' child (``r1''), with two +children: a frame for @code{test} (``r2''), and a frame for +@code{calls_malloc} (``r4''). These frame regions have child regions for +storing their local variables. For example, the return region +and that of various other regions within the ``calls_malloc'' frame all have +value ``sv4'', a pointer to a heap-allocated region ``r6''. Within the parent +frame, @code{ptr_3} has value ``sv3'', an unknown @code{void *}. + +@subsection Analyzer Paths + +We need to explain to the user what the problem is, and to persuade them +that there really is a problem. Hence having a @code{diagnostic_path} +isn't just an incidental detail of the analyzer; it's required. + +Paths ought to be: +@itemize @bullet +@item +interprocedurally-valid +@item +feasible +@end itemize + +Without state-merging, all paths in the exploded graph are feasible +(in terms of constraints being satisified). +With state-merging, paths in the exploded graph can be infeasible. + +We collate warnings and only emit them for the simplest path +e.g. for a bug in a utility function, with lots of routes to calling it, +we only emit the simplest path (which could be intraprocedural, if +it can be reproduced without a caller). We apply a check that +each duplicate warning's shortest path is feasible, rejecting any +warnings for which the shortest path is infeasible (which could lead to +false negatives). + +We use the shortest feasible path through the exploded_graph (a list of +edges) to build a @code{diagnostic_path} (a list of events for the +diagnostic subsystem). + +@subsection Limitations + +@itemize @bullet +@item +Only for C so far +@item +The implementation of call summaries is currently very simplistic. +@item +Lack of function pointer analysis +@item +The region model code creates lots of little mutable objects at each +@code{region_model} (and thus per @code{exploded_node}) rather than +sharing immutable objects and having the mutable state in the +@code{program_state} or @code{region_model}. The latter approach might be +more efficient, and might avoid dealing with IDs rather than pointers +(which requires us to impose an ordering to get meaningful equality). +@item +The region model code doesn't yet support @code{memcpy}. At the +gimple-ssa level these have been optimized to statements like this: +@smallexample +_10 = MEM [(char * @{ref-all@})&c] +MEM [(char * @{ref-all@})&d] = _10; +@end smallexample +Perhaps they could be supported via a new @code{compound_svalue} type. +@item +There are various other limitations in the region model (grep for TODO/xfail +in the testsuite). +@item +The constraint_manager's implementation of transitivity is currently too +expensive to enable by default and so must be manually enabled via +@option{-fanalyzer-transitivity}). +@item +The checkers are currently hardcoded and don't allow for user extensibility +(e.g. adding allocate/release pairs). +@item +Although the analyzer's test suite has a proof-of-concept test case for +LTO, it will crash on anything non-trivial. There are various +lang-specific things in the analyzer that assume C rather than LTO. +For example, casts are currently implemented via @code{convert}, and this +leads to an ICE on LTO. Similarly, SSA names are printed to the user in +``raw'' form, rather than printing the underlying variable name. +@end itemize + +Some ideas for other checkers +@itemize @bullet +@item +File-descriptor-based APIs +@item +Linux kernel internal APIs +@item +Signal handling +@end itemize + +@node Debugging the Analyzer +@section Debugging the Analyzer +@cindex analyzer, debugging +@cindex static analyzer, debugging + +@subsection Builtins for Debugging the Analyzer + +Add: +@smallexample + __analyzer_break (); +@end smallexample +to the source being analyzed to trigger a breakpoint in the analyzer when +that source is reached. By putting a series of these in the source, it's +much easier to effectively step through the program state as it's analyzed. + +@smallexample +__analyzer_dump (); +@end smallexample + +will dump the copious information about the analyzer's state each time it +reaches the call in its traversal of the source. + +@smallexample +__analyzer_dump_path (); +@end smallexample + +will emit a placeholder ``note'' diagnostic with a path to that call site, +if the analyzer finds a feasible path to it. + +The builtin @code{__analyzer_dump_exploded_nodes} will dump information +after analysis on all of the exploded nodes at that program point: + +@smallexample + __analyzer_dump_exploded_nodes (0); +@end smallexample + +will dump just the number of nodes, and their IDs. + +@smallexample + __analyzer_dump_exploded_nodes (1); +@end smallexample + +will also dump all of the states within those nodes. + +@code{region_model::get_value_by_name} can be used when inserting custom +debugging code (e.g. adding it @code{region_model::validate} to detect the +point at which a named variable acquires an unexpected state). + +@subsection Other Debugging Techniques + +One approach when tracking down where a particular bogus state is +introduced into the @code{exploded_graph} is to add custom code to +@code{region_model::validate}. + +For example, this custom code breaks with an assertion failure when a +variable called @code{ptr} acquires a value that's unknown: + +@smallexample + + /* Find a variable matching "ptr". */ + svalue_id sid = get_value_by_name ("ptr"); + if (!sid.null_p ()) + @{ + svalue *sval = get_svalue (sid); + gcc_assert (sval->get_kind () != SK_UNKNOWN); + @} +@end smallexample + +making it easier to investigate further in a debugger when this occurs. diff --git a/gcc/doc/gccint.texi b/gcc/doc/gccint.texi index 99ec005..9dd935d 100644 --- a/gcc/doc/gccint.texi +++ b/gcc/doc/gccint.texi @@ -125,6 +125,7 @@ Additional tutorial information is linked to from * LTO:: Using Link-Time Optimization. * Match and Simplify:: How to write expression simplification patterns for GIMPLE and GENERIC +* Static Analyzer:: Working with the static analyzer. * User Experience Guidelines:: Guidelines for implementing diagnostics and options. * Funding:: How to help assure funding for free software. * GNU Project:: The GNU Project and GNU/Linux. @@ -163,6 +164,7 @@ Additional tutorial information is linked to from @include plugins.texi @include lto.texi @include match-and-simplify.texi +@include analyzer.texi @include ux.texi @include funding.texi