Publications

Understanding Database Usage in PHP Systems: Current and Future Work

Most PHP applications use databases to store application data. To manipulate this data, developers create queries in their programs using database APIs. These queries are constructed in source code using a combination of static query text and dynamically computed values, including values based on user input. In this abstract, we briefly explain our ongoing work on understanding how developers use PHP language features and query libraries to create database queries, and on how query languages are used in source code.

Escaping the Clone Zone: Java Runtime-Managed Snapshots Current and Future Work

Mutable objects shared across modules may lead to unexpected results as changes to the object in one module are visible to other modules sharing the object. It is common practice in Java to defensively create a private copy of the input object state via cloning, serialization, copy constructor, or external library. No universal approach exists and each common solution has limitations or problems. The subject of this paper is a novel runtime-managed approach to declaratively unshare input object state. This approach is evaluated using an experimental OpenJDK 10. The paper summarizes the motivation for the work, describes the experimental implementation, discusses ongoing empirical work regarding desirable properties of the snapshot function, and describes plans for future research.

Supporting Analysis of SQL Queries in PHP AiR

The code behind dynamic webpages often includes calls to database libraries, with queries formed using a combination of static text and values computed at runtime. In this paper, we describe our work on a program analysis for extracting models of database queries that can compactly represent all queries that could be used in a specific database library call. We also describe our work on parsing partial queries, with holes representing parts of the query that are computed dynamically. Implemented in Rascal as part of the PHP AiR framework, the goal of this work is to enable empirical research on database usage in PHP scripts, to support developer tools for understanding existing queries, and to support program transformation tools to evolve existing systems and to improve the security of existing code.

Query Construction Patterns in PHP

Most PHP applications use databases, with developers including both static queries, given directly in the code, and dynamic queries, which are based on a mixture of static text, computed values, and user input. In this paper, we focus specifically on how developers create queries that are then used with the original MySQL API library. Based on a collection of open-source PHP applications, our initial results show that many of these queries are created according to a small collection of query construction patterns. We believe that identifying these patterns provides a solid base for program analysis, comprehension, and transformation tools that need to reason about database queries, including tools to support renovating existing PHP code to support safer, more modern database access APIs.

Enabling PHP Software Engineering Research in Rascal

Today, PHP is one of the most popular programming languages, and is commonly used in the open source community and in industry to build large application frameworks and web applications. In this paper, we discuss our ongoing work on PHP AiR, a framework for PHP Analysis in Rascal. PHP AiR is focused especially on program analysis and empirical software engineering, and is being used actively and effectively in work on evaluating PHP feature usage and system evolution, on program analysis for refactoring and security validation, and on source code metrics. We describe the requirements and design decisions for PHP AiR, summarize current research using PHP AiR, discuss lessons learned, and briefly sketch future work.

Navigating the WordPress Plugin Landscape

WordPress includes a plugin mechanism that allows user-provided code to be executed in response to specific system events and input/output requests. The large number of extension points provided by WordPress makes it challenging for new plugin developers to understand which extension points they should use, while the thousands of existing plugins make it hard to find existing extension point handler implementations for use as examples when creating a new plugin. In this paper, we present a lightweight analysis, supplemented with information mined from source comments and the webpages hosted by WordPress for each plugin, that guides developers to the right extension points and to existing implementations of handlers for these extension points. We also present empirical information about how plugins are used in practice, providing guidance to both tool and prospective plugin developers.

Modular language implementation in Rascal - experience report

All software evolves, and programming languages and programming language tools are no exception. And just like in ordinary software construction, modular implementations can help ease the process of changing a language implementation and its dependent tools. However, the syntactic and semantic dependencies between language features make this a challenging problem. In this paper we detail how programming languages can be implemented in a modular fashion using the Rascal meta-programming language. Rascal supports extensible definition of concrete syntax, abstract syntax and operations on concrete and abstract syntax trees like matching, traversal and transformation. As a result, new language features can be added without having to change existing code. As a case study, we detail our solution of the LDTA'11 Tool Challenge: a modular implementation of Oberon-0, a relatively simple imperative programming language. The approach we sketch can be applied equally well to the implementation of domain-specific languages.

Variable Feature Usage Patterns in PHP

PHP allows the names of variables, classes, functions, methods, and properties to be given dynamically, as expressions that, when evaluated, return an identifier as a string. While this provides greater flexibility for programmers, it also makes PHP programs harder to precisely analyze and understand. In this paper we present a number of patterns designed to recognize idiomatic uses of these features that can be statically resolved to a precise set of possible names. We then evaluate these patterns across a corpus of 20 open-source systems totaling more than 3.7 million lines of PHP, showing how often these patterns occur in actual PHP code, demonstrating their effectiveness at statically determining the names that can be used at runtime, and exploring anti-patterns that indicate when the identifier computation is truly dynamic.

Supporting PHP Dynamic Analysis in PHP AiR

The PHP AiR framework is currently being developed to support software metrics, empirical software engineering, and program analysis for real-world PHP systems. While most of the work on program analysis has focused on static analysis, to help address the dynamic nature of the language we have also started to extend PHP AiR with support for dynamic program analysis. This extended abstract highlights two parts of this support: integration with xdebug for trace analysis, and instrumentation of an open-source PHP interpreter with a focus on supporting string origins, allowing us to explore how strings are created in security-sensitive areas such as database calls and HTML generation.

M3: A General Model for Source Code Analytics in Rascal

This short paper introduces M3, a simple and extensible model for capturing facts about source code for future analysis. M3 is a core part of the standard library of the Rascal meta programming language. We motivate it, position it to related work and detail the key design aspects.

Evolution of Dynamic Feature Usage in PHP

PHP includes a number of dynamic features that, if used, make it challenging for both programmers and tools to reason about programs. In this paper we examine how usage of these features has changed over time, looking at usage trends for three categories of dynamic features across the release histories of two popular open-source PHP systems, WordPress and MediaWiki. Our initial results suggest that, while features such as eval are being removed over time, more constrained dynamic features such as variable properties are becoming more common. We believe the results of this analysis provide useful insights for researchers and tool developers into the evolving use of dynamic features in real PHP programs.

Streamlining Control Flow Graph Construction with DCFlow

A control flow graph (CFG) is used to model possible paths through a program, and is an essential part of many program analysis algorithms. While programs to construct CFGs can be written in meta-programming languages such as Rascal, writing such programs is currently quite tedious. With the goal of streamlining this process, in this paper we present DCFlow, a domain-specific language and Rascal library for defining control flow rules and building control flow graphs. Control flow rules in DCFlow are defined declaratively, based directly on the abstract syntax of the language under analysis and a number of operations representing types of control flow. Standard Rascal code is then generated based on the DCFlow definition. This code makes use of the DCFlow libraries to build CFGs for programs, which can then be visualized or used inside program analysis algorithms. To demonstrate the design of DCFlow we apply it to Pico—a very simple imperative language—and to a significant subset of PHP.

Static, Lightweight Includes Resolution for PHP

Dynamic languages include a number of features that are challenging to model properly in static analysis tools. In PHP, one of these features is the include expression, where an arbitrary expression provides the path of the file to include at runtime. In this paper we present two complementary analyses for statically resolving PHP includes, one that works at the level of individual PHP files, and one targeting PHP programs possibly consisting of multiple scripts. To evaluate the effectiveness of these analyses we have applied the first to a corpus of 20 open-source systems, totaling more than 4.5 million lines of PHP, and the second to a number of programs from a subset of these systems. Our results show that, in many cases, includes can be resolved to a specific file or a small subset of possible files, enabling better IDE features and more advanced program analysis tools for PHP.

PHP AiR: Analyzing PHP Systems with Rascal

PHP is currently one of the most popular programming languages, widely used in both the open source community and in industry to build large web-focused applications and application frameworks. To provide a solid framework for working with large PHP systems in areas such as evaluating how language features are used, studying how PHP systems evolve, program analysis for refactoring and security validation, and software metrics, we have developed PHP AiR, a framework for PHP Analysis in Rascal. Here we briefly describe features available in PHP AiR, integration with the Eclipse PHP Development Tools, and usage scenarios in program analysis, metrics, and empirical software engineering.

Enabling PHP Software Engineering Research in Rascal

Today, PHP is one of the most popular programming languages and is commonly used in the open source community and in industry to build large application frameworks and web applications. In this paper, we discuss our ongoing work on PHP AiR, a framework for PHP Analysis in Rascal. PHP AiR is focused especially on program analysis and empirical software engineering, and is being used actively and effectively in work on evaluating PHP feature usage, program analysis for refactoring and security validation, and source code metrics. We describe the requirements and design decisions for PHP AiR, summarize current research using PHP AiR, discuss lessons learned, and briefly sketch future work.

An Empirical Study of PHP Feature Usage: A Static Analysis Perspective

PHP is one of the most popular languages for server-side application development. The language is highly dynamic, providing programmers with a large amount of flexibility. However, these dynamic features also have a cost, making it difficult to apply traditional static analysis techniques used in standard code analysis and transformation tools. As part of our work on creating analysis tools for PHP, we have conducted a study over a significant corpus of open-source PHP systems, looking at the sizes of actual PHP programs, which features of PHP are actually used, how often dynamic features appear, and how distributed these features are across the files that make up a PHP website. We have also looked at whether uses of these dynamic features are truly dynamic or are, in some cases, statically understandable, allowing us to identify specific patterns of use which can then be taken into account to build more precise tools. We believe this work will be of interest to creators of analysis tools for PHP, and that the methodology we present can be leveraged for other dynamic languages with similar features.

Meta-Language Support for Type-Safe Access to External Resources

Meta-programming applications often require access to heterogeneous sources of information, often from different technological spaces (grammars, models, ontologies, databases), that have specialized ways of defining their respective data schemas. Without direct language support, obtaining typed access to this external, potentially changing, information is a tedious and error-prone engineering task. The Rascal meta-programming language aims to support the import and manipulation of all of these kinds of data in a type- safe manner. The goal is to lower the engineering effort to build new meta programs that combine information about software in unforeseen ways. In this paper we describe built-in language support, so called resources, for incorporating external sources of data and their corresponding data-types while maintaining type safety. We demonstrate the applicability of Rascal resources by example, showing resources for RSF files, CSV files, JDBC- accessible SQL databases, and SDF2 grammars. For RSF and CSV files this requires a type inference step, allowing the data in the files to be loaded in a type-safe manner without requiring the type to be declared in advance. For SQL and SDF2 a direct translation from their respective schema languages into Rascal is instead constructed, providing a faithful translation of the declared types or sorts into equivalent types in the Rascal type system. An overview of related work and a discussion conclude the paper.

Scripting a Refactoring with Rascal and Eclipse

To facilitate experimentation with creating new, complex refactorings, we want to reuse existing transformation and analysis code as orchestrated parts of a larger refactoring: i.e., to script refactorings. The language we use to perform this scripting must be able to deal with the diversity of languages, tools, analyses, and transformations that arise in practice. To illustrate one solution to this problem, in this paper we describe, in detail, a specific refactoring script for switching from the Visitor design pattern to the Interpreter design pattern. This script, written in the meta-programming language Rascal, and targeting an interpreter written in Java, extracts facts from the interpreter code using the Eclipse JDT, performs the needed analysis in Rascal, and then transforms the interpreter code using a combination of Rascal code and existing JDT refactorings. Using this script we illustrate how a new, real and complex refactoring can be scripted in a few hundred lines of code and within a short timeframe. We believe the key to successfully building such refactorings is the ability to pair existing tools, focused on specific languages, with general-purpose meta-programming languages.

Program Analysis Scenarios in Rascal

Rascal is a meta programming language focused on the implementation of domain-specific languages and on the rapid construction of tools for software analysis and software transformation. In this paper we focus on the use of Rascal for software analysis. We illustrate a range of scenarios for building new software analysis tools through a number of examples, including one showing integration with an existing Maude-based analysis. We then focus on ongoing work on alias analysis and type inference for PHP, showing how Rascal is being used, and sketching a hypothetical solution in Maude. We conclude with a high-level discussion on the commonalities and differences between Rascal and Maude when applied to program analysis.

RLSRunner: Linking Rascal with K for Program Analysis

The Rascal meta-programming language provides a number of features supporting the development of program analysis tools. However, sometimes the analysis to be developed is already implemented by another system. In this case, Rascal can provide a useful front-end for this system, handling the parsing of the input program, any transformation (if needed) of this program into individual analysis tasks, and the display of the results generated by the analysis. In this paper we describe a tool, RLSRunner, which provides this integration with static analysis tools defined using the K framework, a rewriting-based framework for defining the semantics of programming languages.

Rascal: From Algebraic Specification to Meta-Programming (Invited Paper)

Algebraic specification has a long tradition in bridging the gap between specification and programming by making specifications executable. Building on extensive experience in designing, implementing and using specification formalisms that are based on algebraic specification and term rewriting (namely ASF and ASF+SDF), we are now focusing on using the best concepts from algebraic specification and integrating these into a new programming language: Rascal. This language is easy to learn by non-experts but is also scalable to very large meta-programming applications. We explain the algebraic roots of Rascal and its main application areas: software analysis, software transformation, and design and implementation of domain-specific languages. Some example applications in the domain of Model-Driven Engineering (MDE) are described to illustrate this.

A Case of Visitor versus Interpreter Pattern

We compare the Visitor pattern with the Interpreter pattern, investigating a single case in point for the Java language. We have produced and compared two versions of an interpreter for a programming language. The first version makes use of the Visitor pattern. The second version was obtained by using an automated refactoring to transform uses of the Visitor pattern to uses of the Interpreter pattern. We compare these two nearly equivalent versions on their maintenance characteristics and execution efficiency. Using a tailored experimental research method we can highlight differences and the causes thereof. The contributions of this paper are that it isolates the choice between Visitor and Interpreter in a realistic software project and makes the difference experimentally observable.

A Rewriting Logic Semantics Approach to Modular Program Analysis

The K framework, based on rewriting logic semantics, provides a powerful logic for defining the semantics of programming languages. While most work in this area has focused on defining an evaluation semantics for a language, it is also possible to define an abstract semantics that can be used for program analysis. Using the SILF language (Hills, Serbanuta and Rosu, 2007), this paper describes one technique for defining such a semantics: policy frameworks. In policy frameworks, an analysis-generic, modular framework is first defined for a language. Individual analyses, called policies, are then defined as extensions of this framework, with each policy defining analysis-specific semantic rules and an annotation language which, in combination with support in the language front- end, allows users to annotate program types and functions with information used during program analysis. Standard term rewriting techniques are used to analyze programs by evaluating them in the policy semantics.

A Modular Rewriting Approach to Language Design, Evolution and Analysis

Software is becoming a pervasive presence in our lives, powering computing systems in the home, in businesses, and in safety-critical settings. In response, languages are being defined with support for new domains and complex computational abstractions. The need for formal techniques to help better understand the languages we use, correctly design new language abstractions, and reason about the behavior and correctness of programs is now more urgent then ever.

In this dissertation we focus on research in programming language semantics and program analysis, aimed at building and reasoning about programming languages and applications. In language semantics, we first show how to use formal techniques during language design, presenting definitional techniques for object-oriented languages with concurrency features, including the Beta language and a paradigmatic language called KOOL. Since reuse is important, we then present a module system for K, a formalism for language definition that takes advantage of the strengths of rewriting logic and term rewriting techniques. Although currently specific to K, parts of this module system are also aimed at other formalisms, with the goal of providing a reuse mechanism for different forms of modular semantics in the future. Finally, since performance is also important, we show techniques for improving the executable and analysis performance of rewriting logic semantics definitions, specifically focused on decisions around the representation of program values and configurations used in semantics definitions.

The work on performance, with a discussion of analysis performance, provides a good bridge to the second major topic, program analysis. We present a new technique aimed at annotation-driven static analysis called policy frameworks. A policy framework consists of analysis domains, an analysis generic front-end, an analysis-generic abstract language semantics, and an abstract analysis semantics that defines the semantics of the domain and the annotation language. After illustrating the technique using SILF, a simple imperative language, we then describe a policy framework for C. To provide a real example of using this framework, we have defined a units of measurement policy for C. This policy allows both type and code annotations to be added to standard C programs, which are then used to generate modular analysis tasks checked using the CPF semantics in Maude.

A Rewriting Logic Approach to Static Checking of Units of Measurement in C

Many C programs assume the use of implicit domain-specific information. A common example is units of measurement, where values can have both a standard C type and an associated unit. However, since there is no way in the C language to represent this additional information, violations of domain-specific policies, such as unit safety violations, can be difficult to detect. In this paper we present a static analysis, based on the use of an abstract C semantics defined using rewriting logic, for the detection of unit violations in C programs. In contrast to typed approaches, the analysis makes use of annotations present in C comments on function headers and in function bodies, leaving the C language unchanged. Initial evaluation results show that performance scales well, and that errors can be detected without imposing a heavy annotation burden.

Towards a Module System for K

To create reusable definitions of language features, isolated from changes to other features in the language, it is important that feature definitions be modular. This abstract introduces ongoing work on modularity features in K, an algebraic, rewriting logic based formalism for defining the semantics of programming languages.

Memory Representations in Rewriting Logic Semantics Definitions

The executability of rewriting logic makes it a compelling environment for language design and experimentation; the ability to interpret programs directly in the language semantics develops assurance that definitions are correct and that language features work well together. However, this executability raises new questions about language semantics that don’t necessarily make sense with non-executable definitions. For instance, suddenly the performance of the semantics, not just language interpreters or compilers based on the semantics, can be important, and representations must be chosen carefully to ensure that executing programs directly in language definitions is still feasible. Unfortunately, many obvious representations common in other semantic formalisms can lead to poor performance, including those used to represent program memory. This paper describes two different memory representations designed to improve performance: the first, which has been fully developed, is designed for use in imperative programs, while the second, still being developed, is intended for use in a variety of languages, with a special focus on pure object-oriented languages. Each representation is described and compared to the initial representation used in the language semantics, with thoughts on reuse also presented.

On Formal Analysis of OO Languages using Rewriting Logic: Designing for Performance

Rewriting logic provides a powerful, flexible mechanism for language definition and analysis. This flexibility in design can lead to problems during analysis, as different designs for the same language feature can cause drastic differences in analysis performance. This paper describes some of these design decisions in the context of KOOL, a concurrent, dynamic, object-oriented language. Also described is a general mechanism used in KOOL to support model checking while still allowing for ongoing, sometimes major, changes to the language definition.

KOOL: An Application of Rewriting Logic to Language Prototyping and Analysis

This paper presents KOOL, a concurrent, dynamic, object-oriented language defined in rewriting logic. KOOL has been designed as an experimental language, with a focus on making the language easy to extend. This is done by taking advantage of the flexibility provided by rewriting logic, which allows for the rapid prototyping of new language features. An example of this process is illustrated by sketching the addition of synchronized methods. KOOL also provides support for program analysis through language extensions and the underlying capabilities of rewriting logic. This support is illustrated with several examples.

A Rewrite Framework for Language Definitions and for Generation of Efficient Interpreters

A rewrite logic semantic definitional framework for programming languages is introduced, called K, together with partially automated translations of K language definitions into rewriting logic and into C. The framework is exemplified by defining SILF, a simple imperative language with functions. The translation of K definitions into rewriting logic enables the use of the various analysis tools developed for rewrite logic specifications, while the translation into C allows for very efficient interpreters. A suite of tests show the performance of interpreters compiled from K definitions.

An Orchestration Language for Parallel Objects

Charm++, a parallel object language based on the idea of virtual processors, has attained significant success in efficient parallelization of applications. Requiring the user to only decompose the computation into a large number of objects (“virtual processors”), Charm++ empowers its intelligent adaptive runtime system to assign and reassign the objects to processors at runtime. This facility is used to optimize execution, including dynamic load balancing. However, in complex applications, Charm++ programs obscure the overall flow of control: one must look at the code of multiple objects to discern how the sets of objects are orchestrated in a given application. In this paper, we present an orchestration notation that allows expression of Charm++ functionality without its fragmented flow of control.