Manuel Hermenegildo
herme@fi.upm.es
http://clip.dia.fi.upm.es/~herme/
Computer Science Department
Technical University of Madrid (UPM), Spain
Multiprocessing hardware is already available which offers significant advantages in either performance or cost/performance over uniprocessors. For example, departmental servers using fast, inexpensive off-the-shelf processors, are being produced at a fraction of the cost of the mainframes they replace and even multiprocessor workstations are now relatively commonplace. However, beyond some manually parallelized high volume applications and scientific codes, still comparatively few programs exploit parallelism. The traditional explanation that parallelization is a difficult and error-prone task [14] seems to remain valid, and still points to the necessity of improving the tools used in the process. Thus, despite much progress, significant challenges remain in automatic parallelization, including dealing well with both regular and irregular computations, performing efficient partitioning for both types of computations, automatically changing data structures for more efficient exploitation of parallelism, and developing parallelization techniques for new, more complex programming paradigms.
The goal of developing effective parallelizing compilers is being sought after in parallel and somewhat independently in the context of different programming paradigms or even individual languages. We argue that despite the apparent differences among imperative, functional, logic, and object oriented languages, the fundamental issues being tackled are quite similar and progress towards more effective parallelizing compilers for all programming paradigms can be sped up by cross fertilization. In order to elaborate this point we will center on two well known programming paradigms on which extensive research on automatic parallelization has been performed: imperative programming (e.g., in the context of FORTRAN) and logic (and constraint) programming.
Some very significant progress has been made in parallelizing compilers for regular, numerical computations, generally based on the FORTRAN language [2]. This research has resulted in well known concepts and techniques including a well understood notion of independence (based on the Bernstein conditions or more recent notions of ``semantic independence''), sophisticated syntactic loop transformations, other transformations based on polytope models, extensive work on partitioning and placement, etc. On the other hand, the applicability of these techniques has remained quite limited for irregular or symbolic computations, and few systems deal with parallelization of procedure calls. Also, the techniques often rely on the relative cleanliness as a programming language of traditional FORTRAN and additional work is needed in order to extend them to other common languages like C or C++. These languages include features such as complex data structures and pointer manipulation which complicate determining independence among statements or procedure calls.
We argue that logic programming and constraint programming languages [13] offer a particularly interesting case study for automatic parallelization. On the one hand, these programming paradigms pose significant challenges to the parallelization task, which happen to closely relate to the more difficult challenges faced in imperative language parallelization. Examples are a non-trivial notion of independence (due to the reversibility of programs and the possibility of providing multiple answers to a given call, which they share with database systems), the symbolic nature of many of their applications, highly irregular computations, dynamic control flow, meta-programming (higher order), and the presence of dynamically allocated, complex data structures containing (well behaved) pointers. Such pointers are represented by logical variables and have similar behavior to those used, for example, in ``clean C'' or Java. On the other hand, due to their high-level nature these languages also facilitate the study of parallelization issues. In addition to significantly reducing the time needed to code and debug complex applications, such languages encourage coding in a way which expresses computations (i.e., the intended algorithm, which can thus be efficiently implemented) while at the same time using formulations which remain close to specifications, and make the parallelism in the problem much more accessible to the compiler. The relatively clean semantics of these languages also makes it comparatively easy to use formal methods and prove the transformations performed by the parallelizing compiler both correct (in terms of computed outputs) and efficient (in terms of computational cost). Functional programming is another paradigm which also generally facilitates exploitation of parallelism. In fact, the lack of certain features, such as pointers and nondeterminism, makes the parallelization problem easier. On the other hand the absence of such features also precludes studying some interesting issues.
Very significant progress has been made in the past decade in the area of automatic program parallelization for logic programs [6] and some of the challenges have been tackled quite effectively. As a result, robust, publicly available compilers for the Prolog language exist which automatically exploit parallelism in complex applications. The capabilities of such compilers include for example automatically exploiting the parallelism stemming from Prolog's built-in search (``or-parallelism'') [16, 1, 9]. While determining independence of tasks is trivial in this case, quite interesting dynamic scheduling and granularity control techniques have been developed for these systems. More interestingly for our discussion, parallelizing compilers are available which exploit the parallelism arising from independent computations [12, 17]. This corresponds to traditional parallelism, and is called ``and-parallelism'' in this context. Interesting developments in this area include quite sophisticated, incremental inter-procedural analyzers for detecting sharing patterns among complex data structures containing pointers [4]. These analyzers also derive other important properties such as variable instantiation state, determinism, non-failure, and cardinality of answers, and the parallelizers that use this information have been able to effectively parallelize application class programs. The correctness of the process has been shown using the theory of abstract interpretation [7]. The speed and robustness of these compilers has demonstrated that abstract interpretation provides a very adequate framework for developing provably correct, powerful, and efficient global analyzers and, consequently, parallelizers. In addition, significant progress has also been made in run-time systems including dynamic task allocation, distributed dynamic memory management, and parallel implementation of backtracking search.
While, as mentioned above, much progress has been made in the context of logic programming in the parallelization of irregular computations, the techniques developed are weaker than those developed in the context of numerical computing for regular problems. Logic programming techniques can discover the parallelism in complex recursive traversals of data structures, but do not handle well traversals that are based on integer (i.e., array subscript) arithmetic, for which much work exists in the area of imperative languages. It thus appears that it would be quite interesting to merge the complementary work done in these areas by the different communities. Some progress has been made in this direction in the context of ``data parallelism'' [3, 11], but it still seems like a very promising avenue for future research.
Another important area which has also received a significant amount of attention in the context of logic programming parallelization is granularity control (task partitioning). Controlling the granularity of tasks to be executed in parallel is simply an optimization in machines with a fast interconnection switch, such as shared memory multiprocessors, but a necessity when parallelizing for machines with slower interconnections. The problem is challenging because in the logic programming context the tasks being parallelized are often procedure calls whose computational cost greatly depends on dynamic characteristics of the input data. One of the solutions currently used in logic programming is to derive at compile time complexity cost functions which give upper and lower bounds on task execution time as a function of certain measures of input data [8, 15]. Such measures include procedure argument values as well as more complex ones such as length or depth of the structures being passed as arguments. Programs are transformed at compile-time into equivalent counterparts which automatically control granularity at run-time based on such functions. The performance improvements resulting from the incorporation of this type of grain size control have been shown to be quite good, specially for systems with medium to large parallel execution overheads. However, while current systems are reasonably good at dealing with tasks with dynamic costs, the techniques currently used are again comparatively weaker for the static case than the partitioning algorithms used in imperative program parallelization. Again, cross-fertilization looks like an interesting avenue for further progress. Ideally, a parallelizing compiler should perform optimal partitioning and placement for any kind of architecture, using dynamic techniques when unavoidable.
Finally, we would like to point to the parallelization of the very powerful framework of constraint programming as an interesting area for future work. The growing acceptance of this programming paradigm is underlined by significant industrial use of constraint logic programming systems [13] and also of libraries which provide constraint solving capabilities (through interfaces with constraint programming systems) in conventional languages such as C++. Constraint logic programming extends the high-level programming paradigm that logic programming offers in symbolic applications to numerical domains and offers a natural platform in which to study the merging of the parallelization techniques used in the numerical and symbolic programming fields, which, for example, entails studying new concepts of independence [10]. In fact, concurrent constraint programming languages already exist which can be used as a compilation target and which bring, among other things, a very simple and powerful synchronization mechanism based on constraint entailment [18]. Also, these systems, due to their combination of features, appear to be well suited as tools for the distributed/network-wide form of programming which may become mainstream in the coming years [5].