Determining the Limits of Automated Program Recognition
This working paper was submitted as a Ph.D. thesis proposal.
Program recognition is a program understanding technique in which stereotypic computational structures are identified in a program. From this identification and the known relationships between the structures, a hierarchical description of the program's design is recovered. The feasibility of this technique for small programs has been shown by several researchers. However, it seems unlikely that the existing program recognition systems will scale up to realistic, full-sized programs without some guidance (e.g., from a person using the recognition system as an assistant). One reason is that there are limits to what can be recovered by a purely code-driven approach. Some of the information about the program that is useful to know for common software engineering tasks, particularly maintenance, is missing from the code. Another reason guidance must be provided is to reduce the cost of recognition. To determine what guidance is appropriate, therefore, we must know what information is recoverable from the code and where the complexity of program recognition lies. I propose to study the limits of program recognition, both empirically and analytically. First, I will build an experimental system that performs recognition on realistic programs on the order of thousands of lines. This will allow me to characterize the information that can be recovered by this code-driven technique. Second, I will formally analyze the complexity of the recognition process. This will help determine how guidance can be applied most profitably to improve the efficiency of program recognition.