Go to the previous, next section.

The Command Interpreter

The basis around which the command interpreter is written is a grammar which is passed a set of tokens ( analogous to words in English ) which it parses, given a set of grammatical rules. As it recognises each rule, it executes the code associated with that rule. See section The SM Grammar.

An example would be:

        aa : BB CC
           {
              printf("Rule BB CC found\n");
           }

which specifies that the rule aa consists of the token BB followed by CC, and that if rule aa is recognised the programme should print that fact out. Conventionally, uppercase names are reserved for `terminal symbols', and lowercase for `non-terminal symbols' where terminal symbols are those that are passed to the parser ( analogous to words ), and non-terminal symbols are tokens that the parser has constructed out of terminal symbols (analogous to phrases). The right hand side of a rule may contain a mixture and non-terminal symbols, and symbols may be assigned a value @footnote #{The grammar is actually specified using YACC, see S.C. Johnson YACC: Yet Another Compiler Compiler, Computing Science Technical Report No. 32, 1975, AT&T Bell laboratories, Murray Hill, NJ07974. This report is reprinted in section 2b of the UNIX manual, and is rather difficult reading at first. We do not in fact use the AT&T code, which is proprietary, but rather a public domain compiler-compiler called Bison written by the Free Software Foundation. }.

Token Generation

SM generates tokens for the grammar roughly as follows: When characters are typed at the keyboard, they are read by a routine which runs in CBREAK mode (PASSALL for VMS), and receives each character as it is typed. It is this routine that handles command line editing, the history system, and key bindings. @footnote ${Specifying -s on the command line bypasses all of this, and makes SM read input one line at a time.} Following a carriage return, it passes the whole line to the lexical analyser, which divides the input stream into integers, floats, strings, or words. In addition it recognises ${2^ as having special meanings (see below under variables ($) and history (^)). As in C, the escape sequence `\n' is replaced by a newline, which means that commands which read to the end of the line may be fooled into thinking that they have found it; see the examples at the end of the section. A { sets the flag `noexpand', which turns off the interpretation of all special symbols, and causes all tokens to be returned as WORD. The matching } unsets this flag. This mechanism is used in defining macros and various lists. A word is anything which is not otherwise recognised, so for example `hello_there.c' or `1.2e' would be considered words. Symbols are separated by white space, taken to be spaces tabs or newlines, or the characters !, {, 2, +, -, *, /, =, ?, !, ,, <, >, (, or ). This behaviour can be modified by enclosing a string in double quotes, when no characters (except ^) are special, and tokens are delimited only by the end of the line, or some character after the closing quote. Enclosing in quotes is rather similar to enclosing in {2, except that quotes have no grammatical significance. A string in double quotes is always treated as a word, but the quotes must not have been discarded by the time that the lexical analysis occurs. For example, "2.80" is a float, as SM will have digested the " before looking at the string. You can fool it with "2.80 ". A string begins with a ' and continues to the next ': they are used in certain contexts where SM needs to know if a WORD or STRING is involved, for example in a PRINT command. It's worth noting that the '...' are stripped when the string is recognised -- if you need to preserve them make sure that noexpand is set (e.g. SET s={ 'a' 'b' 'c'2 ).

The output from this programme is passed to a second stage of lexical analysis. This passes integers and floats through unaltered, while words are passed through a filter to see if they are external tokens from the grammar (such as CONNECT). If a word is recognised as being a token then that token is returned, otherwise the token WORD is passed, and the text of the word is stored. Tokens may be written in either lower or upper case, but for clarity they are written in upper case in this document. The overloading of lowercase tokens is achieved at this stage by simply refusing to recognise them as keywords.

The input stream is now fully analysed into tokens and is passed to the parser, which is written in YACC. If the sequence of tokens seen corresponds to a grammar rule, the parser executes the appropriate section of code, which is written in C. If the parser doesn't understand, it tells you that you have a syntax error and prints the last logical line that it was processing, with the error underlined. If you can't figure out which command it really failed on, try setting the VERBOSE flag to be 4 or more. This produces a voluminous output, which will stop suddenly when the error re-occurs. One simple rule in the grammar is that a WORD should be treated as a possible macro.

Peculiarities of the Grammar

If the command interpreter is faced with a pair of grammar rules such as

        AA BB CC
and
        AA BB

it may not know whether to treat the tokens AA BB as the first part of AA BB CC or as the complete command AA BB followed by the token CC beginning the next command without examining the next token. This ambiguity only arises if a command can begin CC, and may be dealt with by defining the second rule as

        AA BB \n

This should be borne in mind whenever SM complains about a syntax error in an apparently valid command (such as LIST MACRO HELP, intended as first LIST MACRO and then the valid command HELP). The presence of a required carriage return also sometimes requires that macros be spread over a number of lines rather than as one long list of commands, although a carriage return may always be written as `\n', which makes SM think that it has found a carriage return. There is a also requirement that an ELSEless IF statement should end with a newline; this is produced by a subtlety of the way that IF's are processed and is discussed under IF.

SM places a restriction upon commands such as RELOCATE which expect more than one argument, which is that the arguments must be numbers rather than (scalar) expressions. This is required by the unary minus, as if the grammar sees expr1 - expr2 it cannot know whether this is the two expressions expr1 and -expr2, or the single expression expr1-expr2. Unless the grammar is changed, for instance by using commas to separate arguments, this restriction cannot be lifted; it can, however, frequently be circumvented using macros such as rel discussed under `Useful Macros'. As an alternative, in almost all cases the expression can be enclosed in parentheses, for example connect (lg(x)) (-lg(rho)).

The Macro Processor

Executing a macro consists of substituting the text of the macro for its name. In order to understand how SM does this you have to know a bit more about how it processes input characters. We said above that it `passed the whole line' to the lexical analyser. What it actually does is to pass a pointer to the line, and starts reading from the beginning of the line. Now if you execute a macro, all that is done is that we now pass a pointer to the text of the macro, and start reading from it instead. The old pointer is pushed onto the top of a stack. When SM comes to the `\0' at the end of the macro text, the stack is popped and input continues as if the macro had never been seen. When we come to the end of the `whole line' pushed at the top of this paragraph, it is popped, and SM gives you a prompt for more input. Of course, if a macro had been seen while the first macro was being executed, the first one would get pushed onto the stack, and attention transferred to the the new one. If a macro has any arguments, their definitions are pushed onto an argument stack which is popped at the proper times. To jump ahead a little, variables are implemented in a very similar way, being pushed onto the stack, as are DO and FOREACH loops, and perhaps more surprisingly IF statements.

The strange behaviour of RETURN at the end of macros comes about because when the input routine is reading the RETURN it has to read one character beyond it, so as to know that it isn't dealing with, say, RETURN_OF_THE_NATIVE. But in looking for the next character it has to pop the macro off the stack, so when the RETURN is acted upon we have already returned from where we wanted to return from, and we now RETURN from the wrong place. In a similar way, an IF at the end of a macro will cause the parser to look for an ELSE, thereby popping the macro stack if there isn't one. If the IF test was true, and contained references to macro arguments, there will be a problem as either there will be no macros defined, or the arguments to the previous macro on the stack will be supplied.

Macro definitions are currently stored in the form of a weight-balanced tree (actually a BB($1-\sqrt{2}/2$) tree). This means that the access time for a given macro only grows as the logarithm of the total number defined. In the future it may be possible to choose the weights depending on the access probability for a given macro, but this is not currently possible. Definitions of variables and vectors are stored in the same way.

The DO, FOREACH, and IF commands

It seems worth discussing the implementation of these commands. Both loops consist of a definition of a variable, together with instructions about what to do with it, followed by a list of commands within a set of {2, while IF just has the command list. It is not possible for the main grammar to execute commands or macros, as the YACC implementation is non-reentrant, so the best that it can do is to push the commands onto the input stack as a sort of temporary macro, after defining the initial value of the loop parameter. When the `\0' at the end of the loop appears, instead of popping the macro stack we simply define the loop parameter to have its next value, and jump back to the beginning. This means that you can't change the value of a loop parameter, as it'll be reset anyway, but you can use it as a sort of local variable.

IF statements are similar, in that we read the entire list before executing it. Once more, a temporary buffer is pushed onto the stack, with instructions to delete it after use. The reason that a newline is required after an ELSEless IF is that the grammar will have already read the next token to see if it was ELSE. If it wasn't, then it will seem to have been typed before the body of the IF. For example, IF( test ) { echo Hello 2 PROMPT : will be parsed as IF( test ) { PROMPT echo Hello 2 : if test is true, but correctly as IF( test ) { echo Hello 2 PROMPT : if it is false. Because an extra \n does no harm, we demand it.

Examples

If you want to watch SM thinking about these examples, the command VERBOSE 4 will make it print out in detail each token as it reads it, and each macro or variable as it expands it. To turn this off, use VERBOSE 0 or VERBOSE 1. To really see the parser at work, try a negative value of verbosity. This will report every step that the parser takes, providing that it was compiled with DEBUG defined. A second negative value will turn the information off again.

PROMPT @: PROMPT is an external token, so PROMPT is passed to the grammar which recognises the rule PROMPT WORD, and sets the prompt to be `@'. When it has finished, control is passed back to the input routine.
MACRO p { PROMPT 2: This is a simple macro defining p to be PROMPT
p @: The lex analyser doesn't recognise p as a keyword, so it returns WORD and as the grammar has no other interpretation of a WORD in this context, it passes p to the macro interpreter, which replaces it by PROMPT (i.e. pushes PROMPT onto the input stack). SM now thinks that you have just typed PROMPT @, and behaves as described in the first example.
MACRO pp 1 { PROMPT $1 2: The macro pp is declared to have one argument, which is referred to as $1. After pp is invoked it reads the next (whitespace delimited) word from the input stream, and replaces $1 by that word.
pp @: Just like the first example, the prompt is set to @.
pp: You are prompted for the missing argument to PROMPT.
PRMPT: As PRMPT isn't an external token, it is a WORD, so SM tries to execute it as a macro and complains if it isn't defined.
DEFINE Hi Hello: The variable Hi is defined to have the value Hello.
WRITE STANDARD $Hi Gentle User: When it has read $Hi SM pushes the value of the variable Hi onto the stack and then reads it, popping it off again when it has finished. The WRITE STANDARD command writes Hello Gentle Reader (i.e. up to the end of the line) to the terminal.
WRITE STANDARD $Hi Gentle User \n pp "SM>": As above, the rest of the line is written to the terminal (up to the carriage return `\n'), then the prompt is changed yet again.

Go to the previous, next section.