Lecture Notes On Static Analysis

8m ago
364.44 KB
58 Pages

Lecture Notes on Static AnalysisMichael I. SchwartzbachBRICS, Department of Computer ScienceUniversity of Aarhus, [email protected] notes present principles and applications of static analysis ofprograms. We cover type analysis, lattice theory, control flow graphs,dataflow analysis, fixed-point algorithms, narrowing and widening, interprocedural analysis, control flow analysis, and pointer analysis. A tinyimperative programming language with heap pointers and function pointers is subjected to numerous different static analyses illustrating the techniques that are presented.The style of presentation is intended to be precise but not overly formal. The readers are assumed to be familiar with advanced programminglanguage concepts and the basics of compiler construction.1

Contents12IntroductionA Tiny Example LanguageExample Programs3Type AnalysisTypesType ConstraintsSolving ConstraintsSlack and Limitations4Lattice TheoryLatticesFixed-PointsClosure PropertiesEquations and Inequations5Control Flow GraphsControl Flow Graphs for Statements6Dataflow AnalysisFixed-Point AlgorithmsExample: LivenessExample: Available ExpressionsExample: Very Busy ExpressionsExample: Reaching DefinitionsForwards, Backwards, May, and MustExample: Initialized VariablesExample: Sign AnalysisExample: Constant Propagation789Widening and NarrowingConditions and AssertionsInterprocedural AnalysisFlow Graphs for ProgramsPolyvarianceExample: Tree Shaking10Control Flow AnalysisControl Flow Analysis for the λ-CalculusThe Cubic AlgorithmControl Flow Graphs for Function PointersClass Hierarchy Analysis11Pointer AnalysisPoints-To AnalysisAndersen’s AlgorithmSteensgaard’s AlgorithmInterprocedural Points-To AnalysisExample: Null Pointer AnalysisExample: Shape AnalysisExample: Better Shape AnalysisExample: Escape 555858

1IntroductionThere are many interesting questions that can be asked about a given program: does the program terminate? how large can the heap become during execution? what is the possible output?Other questions concern individual program points in the source code: does the variable x always have the same value? will the value of x be read in the future? can the pointer p be null? which variables can p point to? is the variable x initialized before it is read? is the value of the integer variable x always positive? what is a lower and upper bound on the value of the integer variable x? at which program points could x be assigned its current value? do p and q point to disjoint structures in the heap?Rice’s theorem is a general result from 1953 that informally can be paraphrasedas stating that all interesting questions about the behavior of programs areundecidable. This is easily seen for any special case. Assume for example theexistence of an analyzer that decides if a variable in a program has a constantvalue. We could exploit this analyzer to also decide the halting problem byusing as input the program:x 17; if (TM(j)) x 18;Here x has a constant value if and only if the j’th Turing machine halts onempty input.This seems like a discouraging result. However, our real focus is not to decidesuch properties but rather to solve practical problems like making the programrun faster or use less space, or finding bugs in the program. The solution isto settle for approximative answers that are still precise enough to fuel ourapplications.Most often, such approximations are conservative, meaning that all errorslean to the same side, which is determined by our intended application.Consider again the problem of determining if a variable has a constant value.If our intended application is to perform constant propagation, then the analysismay only answer yes if the variable really is a constant and must answer no ifthe variable may or may not be a constant. The trivial solution is of course toanswer no all the time, so we are facing the engineering challenge of answeringyes as often as possible while obtaining a reasonable performance.3

A different example is the question: to which variables may the pointer ppoint? If our intended application is to replace *p with x in order to save adereference operation, then the analysis may only answer “&x” if p certainlymust point to x and must answer “?” if this is false or the answer cannot bedetermined. If our intended application is instead to determine the maximal sizeof *p, then the analysis must reply with a possibly too large set {&x,&y,&z,.}that is guaranteed to contain all targets.In general, all optimization applications need conservative approximations.If we are given false information, then the optimization is unsound and changesthe semantics of the program. Conversely, if we are given trivial information,then the optimization fails to do anything.Approximative answers may also be useful for finding bugs in programs,which may be viewed as a weak form of program verification. As a case inpoint, consider programming with pointers in the C language. This is fraughtwith dangers such as null dereferences, dangling pointers, leaking memory, andunintended aliases. The standard compiler technology, based on type checking,offers little protection from pointer errors. Consider the following small programwhich performs every kind of error:int main() {char *p,*q;p NULL;printf("%s",p);q (char *)malloc(100);p q;free(q);*p ’x’;free(p);p (char *)malloc(100);p (char *)malloc(100);q p;strcat(p,q);}The standard tools such as gcc -Wall and lint detect no errors. If we hadeven approximative answers to questions about null values and pointer targets,then many of the above errors could be caught.Exercise 1.1: Describe all the errors in the above program.2A Tiny Example LanguageWe use a tiny imperative programming language, called TIP, throughout thefollowing sections. It is designed to have a minimal syntax and yet to containall the constructions that make static analyses interesting and challenging.4

ExpressionsThe basic expressions all denote integer values:E intconstidE E E - E E * E E / E E E E E(E )inputThe input expression reads an integer from the input stream. The comparisonoperators yield 0 for false and 1 for true. Pointer expressions will be added later.StatementsThe simple statements are familiar:S id E ;output E ;S Sif (E ) { S }if (E ) { S } else { S }while (E ) { S }var id 1 ,. . . ,,id n ;In the conditions we interpret 0 as false and all other values as true. The outputstatement writes an integer value to the output stream. The var statementdeclares a collection of uninitialized variables.FunctionsFunctions take any number of arguments and return a single value:F id ( id ,. . . ,id ) { var id ,. . . ,id ; S return E ; }Function calls are an extra kind of expression:E id ( E ,. . . ,E )PointersFinally, to allow dynamic memory, we introduce pointers into a heap:E &idmalloc*Enull5

The first expression creates a pointer to a variable, the second expression allocates a new cell in the heap, and the third expression dereferences a pointervalue. In order to assign values to heap cells we allow another form of assignment:S *id E ;Note that pointers and integers are distinct values, so pointer arithmetic is notpermitted. It is of course limiting that malloc only allocates a single heap cell,but this is sufficient to illustrate the challenges that pointers impose.We also allow function pointers to be denoted by function names. In orderto use those, we generalize function calls to:E (E )( E ,. . . ,E )Function pointers serve as a simple model for objects or higher-order functions.ProgramsA program is just a collection of functions:P F .FThe final function is the main one that initiates execution. Its arguments aresupplied in sequence from the beginning of the input stream, and the valuethat it returns is appended to the output stream. We make the notationallysimplifying assumption that all declared identifiers are unique in a program.Exercise 2.1: Argue that any program can be normalized so that all declaredidentifiers are unique.Example ProgramsThe following TIP programs all compute the factorial of a given integer. Thefirst one is iterative:ite(n) {var f;f 1;while (n 0) {f f*n;n n-1;}return f;}The second program is recursive:6

rec(n) {var f;if (n 0) { f 1; }else { f n*rec(n-1); }return f;}The third program is unnecessarily complicated:foo(p,x) {var f,q;if (*p 0) { f 1; }else {q malloc;*q (*p)-1;f (*p)*((x)(q,x));}return f;}3main() {var n;n input;return foo(&n,foo);}Type AnalysisOur programming language is untyped, but of course the various operations areintended to be applied only to certain arguments. Specifically, the followingrestrictions seem reasonable: arithmetic operations and comparisons apply only to integers; only integers can be input and output; conditions in control structures must be integers; only functions can be called; and the * operator only applies to pointers.We assume that their violation results in runtime errors. Thus, for a givenprogram we would like to know that these requirements hold during execution.Since this is an interesting question, we immediately know that it is undecidable.Instead of giving up, we resort to a conservative approximation: typability. Aprogram is typable if it satisfies a collection of type constraints that is systematically derived from the syntax tree of the given program. This condition impliesthat the above requirements are guaranteed to hold during execution, but theconverse is not true. Thus, our type-checker will be conservative and reject someprograms that in fact will not violate any requirements during execution.TypesWe first define a language of types that will describe possible values:7

τ int &τ (τ ,. . . ,τ )- τThe type terms describe respectively integers, pointers, and function pointers.The grammar would normally generate finite types, but for recursive functionsand data structures we need regular types. Those are defined as regular treesdefined over the above constructors. Recall that a possibly infinite tree is regularif it contains only finitely many different subtrees.Exercise 3.1: Show how regular types can be represented by finite automataso that two types are equal if their automata accept the same language.Type ConstraintsFor a given program we generate a constraint system and define the programto be typable when the constraints are solvable. In our case we only need toconsider equality constraints over regular type terms with variables. This classof constraints can be efficiently solved using the unification algorithm.For each identifier id we introduce a type variable [[id ]], and for each expression E a type variable [[E ]]. Here, E refers to a concrete node in the syntaxtree—not to the syntax it corresponds to. This makes our notation slightlyambiguous but simpler than a pedantically correct approach. The constraintsare systematically defined for each construction in our language:intconst :E 1 op E 2 :E 1 E 2 :input:id E :output E :if (E ) S :if (E ) S 1 else S 2 :while (E ) S :id (id 1 ,. . . ,id n ){ . . . return E ; }:id (E 1 ,. . . ,E n ):(E )(E 1 ,. . . ,E n ):&id :malloc:null:*E :*id E :[[intconst ]] int[[E 1 ]] [[E 2 ]] [[E 1 op E 2 ]] int[[E 1 ]] [[E 2 ]] [[E 1 E 2 ]] int[[input]] int[[id ]] [[E ]][[E ]] int[[E ]] int[[E ]] int[[E ]] int[[id ]] ([[id 1 ]],. . . ,[[id n ]])- [[E ]][[id ]] ([[E 1 ]],. . . ,[[E n ]])- [[id (E 1 ,. . . ,E n )]][[E ]] ([[E 1 ]],. . . ,[[E n ]])- [[(E )(E 1 ,. . . ,E n )]][[&id ]] &[[id ]][[malloc]] &α[[null]] &α[[E ]] &[[*E ]][[id ]] &[[E ]]In the above, each occurrence of α denotes a fresh type variable. Note thatvariable references and declarations do not yield any constraints and that parenthesized expression are not present in the abstract syntax.8

Thus, a given program gives rise to a collection of equality constraints ontype terms with variables.Exercise 3.2: Explain each of the above type constraints.A solution assigns to each type variable a type, such that all equality constraintsare satisfied. The correctness claim for this algorithm is that the existence of asolution implies that the specified runtime errors cannot occur during execution.Solving ConstraintsIf solutions exist, then they can be computed in almost linear time using the unification algorithm for regular terms. Since the constraints may also be extractedin linear time, the whole type analysis is quite efficient.The complicated factorial program generates the following constraints, whereduplicates are not shown:[[foo]] ([[p]],[[x]])- [[f]][[*p]] int[[1]] int[[p]] &[[*p]][[malloc]] &α[[q]] &[[*q]][[f]] [[(*p)*((x)(q,x))]][[(x)(q,x)]] int[[input]] int[[n]] [[input]][[foo]] ([[&n]],[[foo]])- [[foo(&n,foo)]][[*p 0]] int[[f]] [[1]][[0]] int[[q]] [[malloc]][[q]] &[[(*p)-1]][[*p]] int[[(*p)*((x)(q,x))]] int[[x]] ([[q]],[[x]])- [[(x)(q,x)]][[main]] ()- [[foo(&n,foo)]][[&n]] &[[n]][[*p]] [[0]]These constraints have a solution, where most variables are assigned int, except:[[p]] &int[[q]] &int[[malloc]] &int[[x]] φ[[foo]] φ[[&n]] &int[[main]] ()- intwhere φ is the regular type that corresponds to the infinite unfolding of:φ (&int,φ)- intExercise 3.3: Draw a picture of the unfolding of φ.Since this solution exists, we conclude that our program is type correct. Recursive types are also required for data structures. The example program:9

var p;p malloc;*p p;creates the constraints:[[p]] &α[[p]] &[[p]]which has the solution [[p]] ψ where ψ &ψ. Some constraints admit infinitelymany solutions. For example, the function:poly(x) {return *x;}has type &α- α for any type α, which corresponds to the polymorphic behaviorit displays.Slack and LimitationsThe type analysis is of course only approximate, which mean