The information content of software is expressible in a variety of representations, such as trees (abstract syntax), graphs (control-flow) and relations (definition-use pairs). The independence of these representations from the text of the source-code necessitates the use of a map to relate extracted information to the code. To avoid the need for an explicit map between information and code, this thesis examines the use of a source-based textual model that represents information within the context of source-code.
Transformations applied to source-code, such as preprocessing, complicate information extraction and its representation within original unprocessed source files. These issues are examined with respect to preprocessing to develop techniques that support source-based representations. An algorithm is presented to accurately back locate information from preprocessed to unprocessed code. Hierarchical lexical analysis is examined as an information extraction technique for code that is discarded by a preprocessor or, for some reason, cannot be parsed.
To explore the sufficiency of a text-based representation, the Jupiter source-code repository system was designed and implemented. Jupiter, an application of the MultiText structured text database system, reveals the need for enhancements to MultiText for managing source-code. The implicit data-model of MultiText is made explicit and extended with attributes — a general purpose facility for representing relationships and properties. To retrieve attributes and to query data that is not hierarchically structured, GCL, the query language of MultiText is extended. Examples demonstrating the use of Jupiter for some typical program exploration tasks and an overview of the issues responsible for its design are provided.