housel


Not Quite Research

Peter S. Housel's weblog


Previous Entry Share Next Entry
Project snapshot: an exercise in yak shaving
housel
Projects directed graph

At any given time, paid workload permitting, I tend to be working on one or more different free or open-source projects, most of them related to the Dylan programming language. Much of this work frequently ends up split up into a large number of subtasks with various complex interrelationships. So that others, and I myself, are clear about these...

Topic-oriented documentation

This is the only project I'm working on that isn't directly connected to anything else.  When Open Dylan was open-sourced, it came with a large amount of high-quality documentation in FrameMaker format. FrameMaker is a fairly expensive tool, so it makes sense to convert all of this documentation to some other format in order to allow wider participation.  In 2004, DITA was starting to become known as a hot new documentation schema/architecture, designed from the ground up for customization.  By May 2005 I had finished a customization for Dylan, built a prototype system for delivering it on the web, and converted a number of fragments of the Open Dylan documentation to use it.

Since then, it has continued as a background task.  It's one of those relatively relaxing and stress-free things I can do when I don't feel like coding, or when something prevents the development environment from running on one of my machines.

Collage

One of the nifty features of the original Apple Dylan compiler was Creole. Creole was a foreign-function interface facility for interfacing to C routines, simply by including in your program an interface definition such as:
define interface
  #include "sampleheader.h",
    import: { "sqrt" => square-root, 
              "expt" => exponent,
              "divide",
              "remainder" };
end interface;

Creole was built into the compiler and worked by parsing the named C header files.

The CMU Gwydion Dylan developers implemented something very similar called Melange. One of the main differences was that Melange was implemented as a separate tool which took an interface file (containing a definition like the one above) and turned it into a Dylan source file.  This allowed it to work with Mindy, and also made rebuilding the compiler faster than it would have been if the interface generator had been built in (a major consideration back then).

Melange contains its own C preprocessor (so that it can see #define definitions) and its own incomplete C parser.  It does reasonably well, but it has often been unable to parse system headers on Linux, BSD, and Mac OS systems, especially when they make use of gcc-specific C extensions.  For this reason, almost all of the people who have hacked on Gwydion Dylan have had their turn at improving Melange, or at least adding their own stopgap measures.

Fairly early on, Erik Kidd got fed up with adding bandages to Melange and started a new tool, Pidgin.  Pidgin concentrated on providing a sensible representation for C types, as the C type representation in Melange was a source of frequent problems.  The preprocessor and parser were adapted from the Melange code base and shared the same failings as Melange's.  Pidgin was designed to be made up of reusable libraries; a command-line tool was also included that could parse one or more C header files and generate code for Harlequin Dylan C-FFI.  (In 2001 I also made use of the Pidgin libraries as part of a tool to generate interfaces for glib 1.x and gtk 1.x.)

The C-FFI library was the Harlequin Dylan (later Functional Objects Functional Developer, now Open Dylan) take on a low-level foreign function interface facility.  It provides a fairly elegant way of working with C types, functions, and values.  The library was originally designed to be the lower-level substrate for a higher-level facility called Collage... but unfortunately Collage was never implemented.  Instead, they used gema along with a number of custom transformation patterns to generate C-FFI definitions from the Win32 headers.

At one point I worked on implementing C-FFI in d2c, the Gwydion Dylan compiler.  I implemented a lot of the compiler metaclass support needed for C-FFI designator classes, but never got any further than that.

Since Harlequin/Functional Objects never finished designing and implementing Collage, I've taken it upon myself to do so.  In particular, it will be a foreign function interface, built into the compiler (like Creole), and designed to work well with C-FFI.  The programmer's interface will look as much like Creole and Melange as possible, making it possible to adapt existing definitions with little work.

Collage needs to be built on top of a high quality C parser/internal representation.  Inspired by the SML-NJ ckit library and by George Necula's CIL, in 2004 I started work on CPR, the C Program Representation library. CPR consists of representation classes for C types, expressions, and programs, along with a C preprocessor, C parser, and a pretty-printer for writing out C language textual representations of the intermediate language representation.  In addition to being useful for Collage, CPR could also be used for constructing a C compiler (by connecting it to the Open Dylan HARP backend, say) or for writing the same sort of static analysis tools that people use CIL for.  It is designed to support multiple C dialects, including C89, C99, GNU C extensions, and Microsoft Visual C.

CPR has been an exercise in test-driven development using the excellent Testworks framework for Dylan. In lieu of a commercial test suite for C, I've been testing using the test suites included with mcpp and with gcc.  Both of these are based on Dejagnu, a test framework written using Tcl (Expect, actually).  Most of the tests are written as C source files containing embedded Dejagnu directives (which use Tcl syntax) in comments.  Dealing with these directives required writing my own Tcl parser, with (since I insisted on doing test-driven development) its own test suite based on the tests included in the Tcl source code.  (These currently only implement just enough to handle embedded Dejagnu, though it could be expanded if some application required it.)

Currently the CPR preprocessor is almost complete, modulo some difficulties with integer arithmetic (as described below).  The expression representation classes, parser, and pretty-printer are complete, and I've begun work on the parsing and representation of type names and declarations.

Numbers

The Dylan Reference Manual states that

Implementations are required to support integers with at least 28 bits of precision. The overflow and underflow behavior is implementation-defined. (Some implementations may choose to have integers of unlimited size, but this is not required.)

Gwydion Dylan implements 32 bits of <integer> precision.  This makes use of d2c's two-word tagged representation for objects.  No checking for overflow or underflow is done.  Infinite-precision integers (bignums) are provided via the <extended-integer> class, included in an extensions module in the Dylan library.

Open Dylan, on the other hand, uses a tagged pointer representation with two tag bits:

  • #b00 => Object pointer
  • #b01 => <integer>
  • #b10 => <character>
  • #b11 => <unicode-character>

This leaves 30 bits for the integer values.  (Or 62 bits for 64-bit procesors, if we supported them.) As described here, integer overflows and underflows cause an exception to be raised.

Both implementations support a <double-integer> class, corresponding roughly to a C "long long int" (usually 64 bits).  In the case of Open Dylan, this is provided in the context of the big-integers and generic-arithmetic libraries, which allow users to re-map the <integer> class and the arithmetic functions to generic versions that allow more bits of precision than the standard 30 bits.

For CPR to support integer arithmetic in a generic way, infinite-precision integers are needed.  So that I can rely on having <extended-integer> with both compilers, I'm currently working on implementing the necessary support in Open Dylan.  Most of this is being placed in the big-integers library, but some changes to the compiler (DFMC) are needed too.  When I last left off I had implemented addition, subtraction, O(n2) multiplication, some of the logical arithmetic functions, and had started implementing division.  TAoCP volume 2 and the GMP manual detail some of  the more sophisticated algorithms that I'll add once the basic functionality is out of the way.

Several years ago I proposed, and then implemented in Gwydion Dylan, a number of extensions to the Common Dylan library for doing floating point arithmetic.  Implementing these extensions, in particular guaranteed-accurate conversions between decimal and floating-point, requires infinite-precision integers.  Once <extended-integer> support is complete these extensions can be added to Open Dylan as well.

(Actually, Bruce Hoult recently described in IRC a clever new decimal-to-floating point algorithm that allows one to limit the size of the integers involved to 1152 bits.)

There are additional things that can be done to improve integer arithmetic.  Using the #b01 tag means that when you add two tagged integers, you need to either mask the tag off of one of them, or subtract one from the result, in order to yield a valid tagged integer.  That's a waste of an instruction in a frequently-executed operation.  We can do better than that with the following tag scheme, inspired by something I first read in Appel's Compiling with Continuations:

  • #b00 => Even <integer> 
  • #b01 => Object pointer
  • #b10 => Odd <integer>
  • #b11 => <character>

Now adding or subtracting integers doesn't require any tag fixups, instance?(<integer>, x) is a single-bit test, and we get 31 bits (or 63 for 64-bit processors) instead of just 30 (or 62).  It doesn't matter that object pointer references are to one more than the actual (32-bit aligned) address, since most accesses to object fields use an immediate offset anyway, so we just reduce the offset in the addressing mode by 1.  (This would also require a change to the garbage collector so that it knows how to recognize offset pointers.)

I think we could do even better by making bignums transparent: integer arithmetic would transparently overflow into bignums when needed, making it possible for programmers to think of <integer> as the mathematical set of integers. The DRM could be changed to read:

Implementations are required to support integers of unlimited size.

Doing this efficiently would require that the compiler do integer range analysis in order to prove when fixed-precision integer arithmetic is sufficient.  This sort of analysis is currently done by d2c but not by the Open Dylan compiler.

Note that since the size field of most sequence types (such as <simple-object-vector>) will necessarily fit in a fixed-precision integer since the maximum number of elements is 1/4 (or 1/8) the size of the address space.  This constraint would allow a significant fraction of arithmetic operations to be optimized to use fixnum-only arithmetic. 

Open Dylan Project-Manager

The project-manager component (currently the projects, registry-projects, and user-projects libraries) of the Open Dylan compiler and development environment is responsible for:

  • Locating Dylan libraries given their names
  • Locating the database (*.ddb) file for a compiled library
  • Managing the project file (*.hdp or *.lid)
  • Keeping track of dependencies between libraries, and determining when a library needs to be recompiled
  • Determining whether particular cross-library bindings should be done in "tight" or "loose" mode
  • Interfacing with the compiler and build system to control compilation and linking
  • Interfacing with the "tools" plugin interface for constructing Dylan source files (or entire subprojects) out of some other source.  Current tools include tool-scepter (the CORBA IDL compiler), tool-parser-generator (the LALR(1) parser generator for *.dylgram files), and motley (the OLE type library interface generator).

There are currently three types of projects: binary-only projects, user projects, and registry projects.  Binary-only projects have only a *.ddb file and a shared library (*.dll/*.so), and of course can never be recompiled.  Binary projects belonged to library packs. Registry projects are opened via a registry, which is a directory tree containing one-line files which map library names to LID or *.hdp project files in a source code hierarchy.  User projects are opened by referencing their LID or *.hdp files directly.

Before r9982, things worked like this:

  • When opening a user project (i.e., via a LID or HDP file), the results of compilation (build temporaries, linked DLL, import library, etc.) were placed in subdirectories of the directory where the project file was found.
  • System projects were never recompiled, even if the source code was modified.

Since at least r10747, the project manager works like this:

  • No matter what kind of library you open, the results of compilation are stored in subdirectories of ~/Open-Dylan (on Linux/FreeBSD/Mac OS) or %AppData%\Open-Dylan (on Win32)
  • This means that the first time you compile something with Open Dylan, it recompiles all of the libraries on which it depends (such as dylan, common-dylan, io, system, etc.) into your personal root (such as ~/Open-Dylan)
  • Warning messages from system projects (dylan, common-dylan, etc.) show up in the Warnings tab in the IDE whenever you compile something that uses them
  • Registry projects are easier to use than they were before

I like this less than the old way, and have been known to revert a few of the relevant revisions in my local working copy during development. Of course, either way is short of ideal.

In order to solve this in a more flexible way, allowing people to customize the environment to get the behavior they want, I came up with the concept of Dylan Library Catalogs. So far I've defined the requirements, but don't have all of the design ready.

Other improvements can be made to the project manager, such as:

  • It should be possible to support projects containing no Dylan code, only (say) a Jamfile.  This would allow projects with significant amounts of non-Dylan code (such as Open Dylan itself) to be built entirely under the control of the Dylan environment.
  • It ought to be possible to do a re-link of a *.so shared library to change its rpath (as is often done using "libtool install") without re-compiling the library
  • The tools interface should be supported for registry projects as well as for user projects.  (Working around this currently makes the build process more complex than it needs to be.)

At one point Andy Armstrong began implementing a new project manager (the currently unused projects-protocol and dfmc-projects libraries).  I haven't yet tried hooking this up to the build to see how well it works.

Runtime Cleanup

The Open Dylan runtime source code is a bit of a mess. Problems include:

  • The run-time for the native code generator is found under sources/lib/run-time, whereas the run-time for the C code generator is under sources/dfmc/c-run-time
  • The source directories and their subdirectories contain a haphazard mix of platform-dependent and platform-independent files
  • In some cases source files get copied from one directory to another during build
  • Some *.c files #include other *.c files
  • Either MPS or the Boehm-Demers-Weiser garbage collector is supported; the choice is handled via a set of haphazard #ifdefs in the runtime, rather than providing a clean interface
  • Certain structures and interface functions are declared both in the C code and (in Dylan) in the run-time generator, requiring that they be manually kept in sync
  • The Makefiles for each platform are completely separate, and probably don't accurately reflect the inter-file dependencies
What it mainly needs is just a little refactoring care and attention.  I've been holding off doing any until I can replace the Makefiles with Jamfiles.  This could be done, I guess, except that the Jam automatic dependency mechanism depends on using a regular expression with \(...\) submatch addressing.  That is, unfortunately, one of the few things missing from the Open Dylan Jam implementation.  I have been planning to rectify this by adapting the Jam interpreter's regular-expression implementation to use one of the algorithms in Ville Laurikari's master's thesis or in his TRE regular expression library.

It's worth noting that we could also solve the problem of keeping the C and Dylan sides of the run-time interface in sync by using CPR to parse header files in the run-time.

As far as directory layout is concerned, I think it might make sense to place both source directories under lib/dylan, since the code is statically linked into the Dylan library.

Tags:

  • 1

What's the current state of CPR?

(Anonymous)
CPR is the central project for any dylan redesign project.

Where can I find the code and the latest status information about the CPR Project?

(Maybe christmas brings a new computer and time for coding)

pet-ro

  • 1
?

Log in