Summary
One of the nicer aspects of programming in Java is that once the foundation is solid, building upon it is easy. Now that we have our interpreted language parsed into Java objects, implementing the execution engine is straightforward. This column, the third installment on building an interpreter, shows you how. (4,650 words)
For those of you just joining us, in my "Java In Depth" column over the past couple months, I've been discussing how one might go about building an interpreter in Java. In the first interpreter column, we covered some of the desirable attributes of an interpreter; in the second column we discussed both parsing and the layout of a class package for implementing the interpreter. In this column we'll look at running the interpreter, and the support classes necessary to accomplish that. Finally, I'll wrap up the series here with a discussion of how an interpreter can be hooked into other Java classes, thus enhancing their abilities.
Reviewing the relevant bits
Let me start by sketching out what we've covered so far and point out
those parts of the design that will become more important as we discuss
execution mode. For a more detailed description of these classes, refer
to my previous columns or to the source code links that are in the Resources section below.
There are three foundation classes in the implementation of the
interpreter, Program
, Statement
, and
Expression
. The following shows how the three are
related:
Program
ThisProgram
class glues together the parsing and execution components of the parser. This class defines two principal methods,load
andrun
. Theload
method reads statements from an input stream and parses them into a collection of statements, therun
method iterates over the collection and executes each of the statements. TheProgram
class also provides a collection of variables for the program to use as well as a stack for storing data.
Statement
TheStatement
class contains a single parsed statement. This class is actually subclassed into a specific type of statement (PRINT, GOTO, IF, and so on) but all statements contain the methodexecute
which is called to execute the statement in the context of aProgram
class instance.
Expression
TheExpression
class contains the parse tree of an expression. During execution, thevalue
method is used to evaluate the expression and return its value. LikeStatement
, theExpression
class is primarily designed to be subclassed by specific types of expressions.
All of these classes work together to form the basis of an
interpreter. The Program
class simultaneously encapsulates
the parsing operation and execution operation, whereas the
Statement
and Expression
classes encapsulate
the actual computational concepts of the language we've implemented.
For these three articles on building interpreters, the example language
has been BASIC.
Facilities for computation
There are two executable classes in the interpreter,
Statement
and Expression
. First let's take a
look at Expression
.
The instances of Expression
are created by the method
expression
in the class ParseExpression
. The
ParseExpression
class implements the expression parser in
this interpreter. This class is a peer of the
ParseStatement
class, which uses the
statement
method to parse BASIC statements. Instances of
Expression
have an internal type
that
identifies what operator the instance represents, and two methods,
value
and stringValue
, that return the
computed value of the expression. Additionally, when an Expression
instance is created, it is nominally given two parameters representing
the left and right side of the expression's operation. Shown in source
form, the first part of the Expression is as follows:
class Expression { Expression arg1, arg2; int oper; final static int OP_ADD = 1; // Addition '+' final static int OP_SUB = 2; // Subtraction '-' final static int OP_MUL = 3; // Multiplication '*' final static int OP_DIV = 4; // Division '/' [... etc for all of the operators ...] final static int OP_BNOT = 19; // Boolean negation '.NOT.' final static int OP_NEG = 20; // Unary minus
As you can see in the code above, there are the instance variables, an operator type named oper, and two halves of the operation in arg1 and arg2, and then some constants to define the various types. Also, there are two constructors that are used by the expression parser; these create a new expression from two expressions. Their source is shown below:
protected Expression(int op, Expression a, Expression b) throws BASICSyntaxError { arg1 = a; arg2 = b; oper = op; /* * If the operator is a boolean, both arguments must be boolean. */ if (op > OP_GE) { if ( (! (arg1 instanceof BooleanExpression)) || (! (arg2 instanceof BooleanExpression)) ) throw new BASICSyntaxError(typeError); } else { if ((arg1 instanceof BooleanExpression) || (arg2 instanceof BooleanExpression)) throw new BASICSyntaxError(typeError); } } protected Expression(int op, Expression a) throws BASICSyntaxError { arg2 = a; oper = op; if ((oper == OP_BNOT) && (! (arg2 instanceof BooleanExpression))) throw new BASICSyntaxError(typeError); }
The first constructor builds an arbitrary expression object, and the second one builds a "unary" expression object -- such as unary minus. One thing to note is that if there is only one argument, arg2 is used to store its value.
The method that is used the most frequently in the
Expression
class is value
, which is defined
as follows:
double value(Program pgm) throws BASICRuntimeError { switch (oper) { case OP_ADD : return arg1.value(pgm) + arg2.value(pgm); case OP_SUB : return arg1.value(pgm) - arg2.value(pgm); ... etc for all of the other operator types. ...
You can see that each expression object represents a tuple
consisting of an operator and one or two arguments. The fun part of
designing the expression execution engine in this way is that when you
construct this set of expression tuples based on the
Expression
object, you can compute the value of the
expression by simply invoking the value
method. The
value
method recursively invokes the value
method of the two arguments that compose this expression, applies the
operation to them, and returns the result. This design was used so that
expressions would be easy to understand.
To keep the class structure clean, all computational units -- from
constants to trigonometric functions -- are subclasses of
Expression
. This idea, stolen shamelessly from Lisp,
completely encapsulates the notion of "causing" an evaluation
to occur, from the actual implementation of "how" that
evaluation occurs. To demonstrate how this principle is applied, we
need only look at some of the specialized subclasses of
Expression
.
Constants in my version of BASIC, which I named COCOA are
represented by the class ConstantExpression
, which
subclasses Expression
and simply stores the numeric value
in a member value. The source code to ConstantExpression
is shown conceptually below. I say "conceptually" because I
did choose to bundle what would have been
StringConstantExpression
and
NumericConstantExpression
into a single class. So the real
class includes a constructor for creating a constant with a string
argument and for returning its value as a string. The following code
shows how the ConstantExpression
class handles numeric
constants.
class ConstantExpression extends Expression { private double v; ConstantExpression(double a) { super(); v = a; } double value(Program pgm) throws BASICRuntimeError { return v; } }
The code shown above replaces the more complicated constructors of
Expression
with a simple store of an instance variable;
the value
method is simply replaced with a return of the
stored value.
It is true that you could code the Expression
class to
accept in its constructors a constant that would save you a class.
However, one advantage of designing Expression
the way I
have is that the code in Expression
remains maximally
generic. I find this style of coding helps me eliminate the complexity
of special cases and thus when I am "done" with the
Expression
code, I can move on to other aspects of
expressions without revisiting the base class again and again. The
advantage becomes clearer when we delve into another subclass of
Expression
named FunctionExpression
.
In the class FunctionExpression
, there were two design
requirements I felt should be met to keep the interpreter flexible. The
first was to implement the standard BASIC functions; the other was to
encapsulate the parsing of the functions arguments into the same class
that implemented these functions. The second requirement, parsing, was
motivated by a desire to make this BASIC extensible by creating
additional function libraries that could be passed to the parser as
subclasses of FunctionExpression
. Further, those
passed-in classes could be used by the parser to increase the number of
functions available to the user's program.
The FunctionExpression
class is only moderately
more complicated than a ConstantExpression
and is
shown in condensed form below:
1 class FunctionExpression extends Expression { 2 [ ... parsing stuff removed ...] 3 double value(Program p) throws BASICRuntimeError { 4 try { 5 switch (oper) { 6 case RND : 7 if (r == null) 8 r = p.getRandom(); 9 return (r.nextDouble() * arg2.value(p)); 10 case INT : 11 return Math.floor(arg2.value(p)); 12 case SIN : 13 return Math.sin(arg2.value(p)); 14 [ ... and so on for all of the functions implemented ... ] 15 default : 16 throw new BASICRuntimeError("Unknown or non-numeric function."); 17 } 18 } catch (Exception e) { 19 if (e instanceof BASICRuntimeError) 20 throw (BASICRuntimeError) e; 21 else 22 throw new BASICRuntimeError("Arithmetic Exception."); 23 } 24 }
The above source shows how the value
method is
implemented. The oper variable is reused to hold the function
identity, and the Expression objects referenced by arg1 and
arg2 are used as arguments to the functions themselves.
Finally, there is a large switch statement that dispatches the
request. One interesting aspect is that the value
method
does catch the potential arithmetic exceptions and converts them into
instances of BASICRuntimeError
.
The parsing code in FunctionExpression
is shown below,
again condensed to save space. (Remember, all of the source code is
available using links in the Resources
section.)
1 static FunctionExpression parse(int ty, LexicalTokenizer lt) throws BASICSyntaxError { 2 FunctionExpression result; 3 Expression a; 4 Expression b; 5 Expression se; 6 Token t; 7 8 t = lt.nextToken(); 9 if (! t.isSymbol('(')) { 10 if (ty == RND) { 11 lt.unGetToken(); 12 return new FunctionExpression(ty, new ConstantExpression(1)); 13 } else if (ty == FRE) { 14 lt.unGetToken(); 15 return new FunctionExpression(ty, new ConstantExpression(0)); 16 } 17 throw new BASICSyntaxError("Missing argument for function."); 18 } 19 switch (ty) { 20 case RND: 21 case INT: 22 case SIN: 23 case COS: 24 case TAN: 25 case ATN: 26 case SQR: 27 case ABS: 28 case CHR: 29 case VAL: 30 case STR: 31 case SPC: 32 case TAB: 33 case LOG: 34 a = ParseExpression.expression(lt); 35 if (a instanceof BooleanExpression) { 36 throw new BASICSyntaxError(functions[ty].toUpperCase()+" function cannot accept boolean expression."); 37 } 38 if ((ty == VAL) && (! a.isString())) 39 throw new BASICSyntaxError(functions[ty].toUpperCase()+" requires a string valued argument."); 40 result = new FunctionExpression(ty, a); 41 break; [ ... other cases omitted for brevity ... ] 42 default: 43 throw new BASICSyntaxError("Unknown function on input."); 44 45 } 46 t = lt.nextToken(); 47 if (! t.isSymbol(')')) { 48 throw new BASICSyntaxError("Missing closing parenthesis for function."); 49 } 50 return result; 51 }
Note that this code takes advantage of the fact that the expression
parser in ParseStatement
has already figured out that it
is looking at an expression and has passed the identity of the
expression in as parameter ty. This parser then need only
locate the opening parenthesis and the closing parenthesis that contain
the argument(s). But look carefully: In lines #9 through #18, the
parser allows some functions to have no arguments (in this case RND and
FRE). This demonstrates the flexibility afforded by having the
function subparser built into this class, rather than forcing all
functions to conform to some predefined template. Given a function type
in the parameter ty, the switch statement selects a branch
that can parse the arguments required for that function, be they
strings, numbers, other expressions, and so on.
Other aspects: Strings and
arrays
Two other parts of the BASIC language are implemented by the COCOA
interpreter: strings and arrays. Let's look at the implementation of
strings first.
To implement strings as variables, the Expression
class
was modified to include the notion of "string" expressions.
This modification took the form of two additions:
isString
and stringValue
. The source for
these two new methods is shown below.
String stringValue(Program pgm) throws BASICRuntimeError { throw new BASICRuntimeError("No String representation for this."); } boolean isString() { return false; }
Clearly, it isn't too useful to a BASIC program to get the string
value of a base expression (which is always either numeric or boolean
expression). You might conclude from the lack of utility that these
methods then did not belong in Expression
and belonged in
a subclass of Expression
instead. However, by putting
these two methods in the base class, all Expression
objects can be tested to see if, in fact, they are strings.
Another design approach is to return the numeric values as strings
using a StringBuffer
object to generate a value. So,
for example, the same code could be rewritten as:
String stringValue(Program pgm) throws BASICRuntimeError { StringBuffer sb = new StringBuffer(); sb.append(this.value(pgm)); return sb.toString(); }
And if the code above is used, you can eliminate the use of
isString
because every expression can return a string
value. Further, you can modify the value
method to try to
return a number if the expression evaluates to a string by running it
through the valueOf
method of
java.lang.Double
. In many languages such as Perl, TCL, and
REXX, this sort of amorphous typing is used to great advantage. Both
approaches are valid, and you should make your choice based on the
design of your interpreter. In BASIC, the interpreter needs to return
an error when a string is assigned to a numeric variable, so I chose the
first approach (returning an error).
As for arrays, there are different ways in which you can design your
language to interpret them. C uses the square brackets around array
elements to distinguish the array's index references from function
references that have parentheses around their arguments. However, the
language designers for BASIC chose to use parentheses for both
functions and arrays so when the text NAME(V1, V2)
is seen
by the parser, it could be either a function call or an array
reference.
The lexical analyzer discriminates between tokens that are followed
by parentheses by first assuming they are functions and testing for
that. Then it goes on to see if they are keywords or variables. It is
this decision that prevents your program from defining a variable named
"SIN." Any variable whose name matched a function name would be
returned by the lexical analyzer as a function token instead. The
second trick the lexical analyzer uses is to check to see if the
variable name is immediately followed by `('. If it is, the analyzer
assumes it is an array reference. By parsing this in the lexical
analyzer, we eliminate the string `MYARRAY ( 2 )
' from
being interpreted as a valid array (note the space between the variable
name and the open parenthesis).
The final trick to implementing arrays is in the
Variable
class. This class is used for an instance of a
variable, and as I discussed in last month's
column, it is a subclass of Token
. However, it also
has some machinery to support arrays and that is what I'll show below:
class Variable extends Token { // Legal variable sub types final static int NUMBER = 0; final static int STRING = 1; final static int NUMBER_ARRAY = 2; final static int STRING_ARRAY = 4; String name; int subType; /* * If the variable is in the symbol table these values are * initialized. */ int ndx[]; // array indices. int mult[]; // array multipliers double nArrayValues[]; String sArrayValues[];
The above code shows the instance variables associated with a
variable, as in the ConstantExpression
class. One has to
make a choice about the number of classes to be used versus the
complexity of a class. One design choice might be to build a
Variable
class that holds only scalar variables and then
add an ArrayVariable
subclass to deal with the intricacies
of arrays. I chose to combine them, turning scalar variables
essentially into arrays of length 1.
If you read the above code, you will see array indices and multipliers. These are here because multidimensional arrays in BASIC are implemented using a single linear Java array. The linear index into the Java array is computed manually using the elements of the multiplier array. The indices used in the BASIC program are checked for validity by comparing them to the maximum legal index in the indices' ndx array.
For example, a BASIC array with three dimensions of 10, 10, and 8, would have the values 10, 10, and 8 stored in ndx. This allows the expression evaluator to test for an "index out of bounds" condition by comparing the number used in the BASIC program to the maximum legal number that is now stored in ndx. The multiplier array in our example would contain the values 1, 10, and 100. These constants represent the numbers one uses to map from a multidimensional array index specification into a linear array index specification. The actual equation is:
Java Index = Index1 + Index2 * Max Size of Index1 + Index3 * (MaxSize of Index1 * MaxSizeIndex 2)
The next Java array in the Variable
class is shown
below.
Expression expns[];
The expns array is used to deal with arrays that are
written as "A(10*B, i)
." In that case, the
indices are actually expressions rather than constants, so the
reference has to contain pointers to those expressions that are
evaluated at run time. Finally there is this fairly ugly-looking piece
of code that calculates the index depending on what was passed in the
program. This private method is shown below.
private int computeIndex(int ii[]) throws BASICRuntimeError { int offset = 0; if ((ndx == null) || (ii.length != ndx.length)) throw new BASICRuntimeError("Wrong number of indices."); for (int i = 0; i < ndx.length; i++) { if ((ii[i] < 1) || (ii[i] > ndx[i])) throw new BASICRuntimeError("Index out of range."); offset = offset + (ii[i]-1) * mult[i]; } return offset; }
Looking at the code above, you'll note that the code first checks to
see that the correct number of indices were used when referencing the
array, and then that each index was within the legal range for that
index. If an error is detected, an exception is thrown to the
interpreter. The methods numValue
and
stringValue
return a value from the variable as a number
or a string respectively. These two methods are shown below.
double numValue(int ii[]) throws BASICRuntimeError { return nArrayValues[computeIndex(ii)]; } String stringValue(int ii[]) throws BASICRuntimeError { if (subType == NUMBER_ARRAY) return ""+nArrayValues[computeIndex(ii)]; return sArrayValues[computeIndex(ii)]; }
There are additional methods for setting the value of a variable that are not shown here.
By hiding much of the complexity of how each piece is implemented, when it finally comes time to execute the BASIC program, the Java code is quite straightforward.
Running the code
The
code to interpret the BASIC statements and execute them is contained in
the run
method of the Program
class. The code
for this method is shown below, and I'll step through it to point out
the interesting parts.
1 public void run(InputStream in, OutputStream out) throws BASICRuntimeError { 2 PrintStream pout; 3 Enumeration e = stmts.elements(); 4 stmtStack = new Stack(); // assume no stacked statements ... 5 dataStore = new Vector(); // ... and no data to be read. 6 dataPtr = 0; 7 Statement s; 8 9 vars = new RedBlackTree(); 10 11 // if the program isn't yet valid. 12 if (! e.hasMoreElements()) 13 return; 14 15 if (out instanceof PrintStream) { 16 pout = (PrintStream) out; 17 } else { 18 pout = new PrintStream(out); 19 }
The above code shows that the run
method takes an
InputStream
and an OutputStream
for use as
the "console" for the executing program. In line 3, the
enumeration object e is set to the set of statements from the
collection named stmts. For this collection I used a variation
on a binary search tree called a "red-black" tree. (For further
information on binary search trees, see my previous column on
building generic collections.) Following that, two additional
collections are created -- one using a Stack
and one using
a Vector
. The stack is used like the stack in any
computer, but the vector is used expressly for the DATA statements in
the BASIC program. The final collection is another red-black tree that
holds the references for the variables defined by the BASIC program.
This tree is the symbol table that is used by the program while it is
executing.
Following the initialization, the input and output streams are set up, and then if e is not null, we start by collecting any data that has been declared. That is done as shown in the following code.
/* First we load all of the data statements */ while (e.hasMoreElements()) { s = (Statement) e.nextElement(); if (s.keyword == Statement.DATA) { s.execute(this, in, pout); } }
The above loop simply looks at all of the statements, and any DATA statements it finds are then executed. The execution of each DATA statement inserts the values declared by that statement into the dataStore vector. Next we execute the program proper, which is done using this next piece of code:
e = stmts.elements(); s = (Statement) e.nextElement(); do { int yyy; /* While running we skip Data statements. */ try { yyy = in.available(); } catch (IOException ez) { yyy = 0; } if (yyy != 0) { pout.println("Stopped at :"+s); push(s); break; } if (s.keyword != Statement.DATA) { if (traceState) { s.trace(this, (traceFile != null) ? traceFile : pout); } s = s.execute(this, in, pout); } else s = nextStatement(s); } while (s != null); }
As you can see in the code above, the first step is to reinitialize
e. The next step is to fetch the first statement into the
variable s and then to enter the execution loop. There is some
code to check for pending input on the input stream to allow the
progress of the program to be interrupted by typing at the program, and
then the loop checks to see if the statement to execute would be a DATA
statement. If it is, the loop skips the statement as it was already
executed. The rather convoluted technique of executing all data
statements first is required because BASIC allows the DATA statements
that satisfy a READ statement to appear anywhere in the source code.
Finally, if tracing is enabled, a trace record is printed and the very
uninpressive statement s = s.execute(this, in, pout);
is
invoked. The beauty is that all the effort of encapsulating the base
concepts into easy-to-understand classes makes the final code trivial.
If it isn't trivial then perhaps you have a clue that there might be
another way to split your design.
Wrapping up and further
thoughts
The interpreter was designed so that it could run as a thread, thus
there can be several COCOA interpreter threads running simultaneously
in your program space at the same time. Further, with the use of
function expansion we can provide a means whereby those threads can
interact with each other. There was a program for the Apple II and
later for the PC and Unix called C-robots that was a system of
interacting "robotic" entities that were programmed using a simple
BASIC derivative language. The game provided me and others with many
hours of entertainment but was also an excellent way to introduce the
basic principles of computation to younger students (who mistakenly
believed they were just playing and not learning). Java based
interpreter subsystems are much more powerful than their pre-Java
counterparts because they are instantly available on any Java
platform. COCOA ran on Unix systems and Macintoshes the same day I got
working on a Windows 95 based PC. While Java gets beaten up by
incompatibilities in the thread or window toolkit implementations, what
is often overlooked is this: A lot of code "just works."
About the author
Chuck McManis currently is the director of system software at FreeGate
Corp., a venture-funded start-up that is exploring opportunities in the
Internet marketplace. Before joining FreeGate, Chuck was a member of
the Java Group. He joined the Java Group just after the formation of
FirstPerson Inc. and was a member of the portable OS group (the group
responsible for the OS portion of Java). Later, when FirstPerson was
dissolved, he stayed with the group through the development of the
alpha and beta versions of the Java platform. He created the first "all
Java" home page on the Internet when he did the programming for the
Java version of the Sun home page in May 1995. He also developed a
cryptographic library for Java and versions of the Java class loader
that could screen classes based on digital signatures. Before joining
FirstPerson, Chuck worked in the operating systems area of SunSoft,
developing networking applications, where he did the initial design of
NIS+. You can reach him at chuck.mcmanis@javaworld.com.
Also, check out his home page.
If you have problems with this magazine, contact
webmaster@javaworld.com
URL: http://www.javaworld.com/javaworld/jw-07-1997/jw-07-indepth.html
Last modified: