Go to the previous, next section.
The basis around which the command interpreter is written is a grammar which is passed a set of tokens ( analogous to words in English ) which it parses, given a set of grammatical rules. As it recognises each rule, it executes the code associated with that rule. See section The SM Grammar.
An example would be:
aa : BB CC { printf("Rule BB CC found\n"); }which specifies that the rule aa consists of the token
BB
followed by
CC
,
and that if rule aa
is recognised the programme should print that fact
out. Conventionally, uppercase names are reserved for `terminal symbols',
and lowercase for `non-terminal symbols' where terminal symbols are
those that are passed to the parser ( analogous to words ), and
non-terminal symbols are tokens that the parser has constructed out of
terminal symbols (analogous to phrases). The right hand side of a rule
may contain a mixture and non-terminal symbols, and symbols may be
assigned a value
@footnote #{The grammar is actually specified using YACC, see S.C. Johnson
YACC: Yet Another Compiler Compiler, Computing Science Technical
Report
No. 32, 1975, AT&T Bell laboratories, Murray Hill, NJ07974. This report is
reprinted in section 2b of the UNIX manual, and is rather difficult reading
at first. We do not in fact use the AT&T code, which is proprietary, but
rather a public domain compiler-compiler called Bison written by the
Free Software Foundation.
}.
-s
on the command line bypasses all
of this, and makes SM read input one line at a time.}
Following a carriage return, it passes the whole line to the
lexical analyser, which divides
the input stream into integers, floats, strings, or words. In addition it
recognises ${2^
as having special meanings (see below under
variables ($
) and history (^
)). As in
C, the escape sequence `\n' is replaced by a newline, which means that
commands which read to the end of the line may be fooled into thinking
that they have found it; see the examples at the end of the section.
A {
sets the flag `noexpand', which
turns off the interpretation of all special symbols, and causes all
tokens to be returned as WORD
. The matching }
unsets this flag.
This mechanism is used in defining macros and various lists.
A word is anything which is not otherwise recognised, so for
example `hello_there.c' or `1.2e' would be considered words.
Symbols are separated by white space, taken to be spaces tabs or newlines,
or the characters !
, {
, 2
, +
, -
,
*
, /
, =
, ?
, !
, ,
, <
,
>
,
(
, or )
. This behaviour can be modified by enclosing a string in
double quotes, when no characters (except ^
) are special, and
tokens are delimited only by the end of the line, or some character
after the closing quote.
Enclosing in quotes is rather
similar to enclosing in {2, except that quotes have no grammatical
significance. A string in double quotes is always treated as a word,
but the quotes must not have been discarded by the time that the
lexical analysis occurs.
For example, "2.80"
is a float, as SM will have digested
the "
before looking at the string. You can fool it with "2.80 "
.
A string begins with a '
and continues to the next '
: they
are used in certain contexts where SM needs to know if a
WORD
or STRING
is involved, for example in a PRINT
command. It's worth noting that the '...'
are stripped when the
string is recognised -- if you need to preserve them make sure that
noexpand
is set (e.g. SET s={ 'a' 'b' 'c'2
).
The output from this programme is passed to a second stage of lexical
analysis. This passes integers and floats through unaltered, while
words are passed through a filter to see if they are external tokens
from the grammar (such as CONNECT
).
If a word is recognised as being a token then that token is returned,
otherwise the token WORD
is passed, and the text of the word is stored.
Tokens may be written in
either lower or upper case, but for clarity they are written in upper
case in this document. The overloading of lowercase tokens is achieved
at this stage by simply refusing to recognise them as keywords.
The input stream is now fully analysed into tokens and is passed to the
parser, which is written in YACC.
If the sequence of tokens seen corresponds to a grammar rule,
the parser executes the appropriate section
of code, which is written in C. If the parser doesn't understand, it
tells you that you have a syntax error and prints the last
logical line that it was processing, with the error underlined. If you can't
figure out which command it really failed on, try setting the VERBOSE
flag to be 4 or more. This produces a voluminous output, which will stop
suddenly when the error re-occurs.
One simple rule in the grammar is that a WORD
should be
treated as a possible macro.
AA BB CC and AA BBit may not know whether to treat the tokens
AA BB
as the first part of
AA BB CC
or as the complete command AA BB
followed by the token
CC
beginning the next command
without examining the next token. This ambiguity only arises if a command can
begin CC
, and may
be dealt with by defining the second rule as
AA BB \nThis should be borne in mind whenever SM complains about a syntax error in an apparently valid command (such as
LIST MACRO HELP
, intended as first LIST MACRO
and then the valid
command HELP
). The presence of a required carriage return also sometimes
requires that macros be spread over a number of lines rather than as one
long list of commands, although a carriage return may always be written as
`\n', which makes SM think that it has found a carriage
return. There is a also requirement that an ELSE
less IF
statement
should end with a newline; this is produced by a subtlety of the way
that IF
's are processed and is discussed under IF
.
SM places a restriction upon commands such as RELOCATE
which
expect more than one argument, which is that the arguments must be
numbers rather than (scalar) expressions. This is required by the
unary minus, as if the grammar sees expr1 - expr2
it cannot know
whether this is the two expressions expr1
and -expr2
, or the single
expression expr1-expr2
. Unless the grammar is changed, for instance by
using commas to separate arguments, this restriction cannot be lifted;
it can, however, frequently be circumvented using macros such as rel
discussed under `Useful Macros'. As an alternative, in almost all
cases the expression can be enclosed in parentheses, for example
connect (lg(x)) (-lg(rho))
.
DO
and FOREACH
loops, and perhaps more surprisingly
IF
statements.
The strange behaviour of RETURN
at the end of macros comes about
because when the input routine is reading the RETURN
it has to
read one character beyond it, so as to know that it isn't dealing with,
say, RETURN_OF_THE_NATIVE
. But in looking for the next character
it has to pop the macro off the stack, so when the RETURN
is
acted upon we have already returned from where we wanted to
return from, and we now RETURN
from the wrong place. In a
similar way, an IF
at the end of a macro will cause the parser
to look for an ELSE
, thereby popping the macro stack if there
isn't one. If the IF
test was true, and contained references to
macro arguments, there will be a problem as either there will be no
macros defined, or the arguments to the previous macro on the stack
will be supplied.
Macro definitions are currently stored in the form of a weight-balanced tree (actually a BB($1-\sqrt{2}/2$) tree). This means that the access time for a given macro only grows as the logarithm of the total number defined. In the future it may be possible to choose the weights depending on the access probability for a given macro, but this is not currently possible. Definitions of variables and vectors are stored in the same way.
IF
just has the command list.
It is not possible for the main grammar to execute commands or
macros, as the YACC implementation is
non-reentrant, so the best that it can do is to push the commands onto
the input stack as a sort of temporary macro, after defining the
initial value of the loop parameter. When the `\0' at the end of the
loop appears, instead of popping the macro stack we simply define the
loop parameter to have its next value, and jump back to the
beginning. This means that you can't change the value of a loop parameter,
as it'll be reset anyway, but you can use it as a sort of local variable.
IF
statements are similar, in that we read the entire list
before executing it. Once more, a temporary buffer is pushed onto the
stack, with instructions to delete it after use. The reason that a newline
is required after an ELSE
less IF
is that the grammar will
have already read the next token to see if it was ELSE
. If it
wasn't, then it will seem to have been typed before the body of the
IF
. For example, IF( test ) { echo Hello 2 PROMPT :
will be
parsed as IF( test ) { PROMPT echo Hello 2 :
if test is true,
but correctly as IF( test ) { echo Hello 2 PROMPT :
if it is false.
Because an extra \n does no harm, we demand it.
VERBOSE 4
will make it print out in detail each token as it reads it,
and each macro or variable as it expands it. To turn this off, use
VERBOSE 0
or VERBOSE 1
.
To really see the parser at work, try a negative value of verbosity.
This will report every step that the parser takes, providing that it
was compiled with DEBUG defined. A second negative value will turn the
information off again.
PROMPT @
PROMPT
is an external token, so PROMPT
is passed to the
grammar which recognises the rule
PROMPT WORD
, and sets the prompt to be `@'. When it has finished,
control is passed back to the input routine.
MACRO p { PROMPT 2
p
to be PROMPT
p @
p
as a keyword, so it returns
WORD
and as the grammar has no other interpretation of a WORD
in this
context,
it passes p
to the macro interpreter, which
replaces it by PROMPT
(i.e. pushes PROMPT
onto the input stack).
SM now thinks that you have just typed
PROMPT @
, and behaves as described in the first example.
MACRO pp 1 { PROMPT $1 2
pp
is declared to have one argument, which is referred to as
$1. After pp
is invoked it reads the next (whitespace delimited) word
from the input stream, and replaces $1
by that word.
pp @
@
.
pp
PROMPT
.
PRMPT
PRMPT
isn't an external token, it is a WORD
, so SM
tries to execute
it as a macro and complains if it isn't defined.
DEFINE Hi Hello
Hi
is defined to have the value Hello
.
WRITE STANDARD $Hi Gentle User
$Hi
SM pushes the value of the variable Hi
onto the stack and then reads it, popping it off again when it has finished.
The WRITE STANDARD
command writes Hello Gentle Reader
(i.e.
up to the end of the line) to the terminal.
WRITE STANDARD $Hi Gentle User \n pp "SM>"
Go to the previous, next section.