Tokeniser
BBC BASIC programs are tokenised, that is, BASIC keywords are stored as one or two byte values. This results in programs which execute faster and are more compact.
A tokenised line can easily be detokenised, or expanded, as there is a one-to-one mapping between token values and the expanded string. For example, code similar to the following would expand a tokenised line:
quote%=FALSE REPEAT IF ?addr%<128 OR quote% THEN VDU ?addr% ELSE P.token$(?addr%); IF ?addr%=34 quote%=NOT quote% addr%=addr%+1 UNTIL ?addr%=13
Tokenising, however, is more fiddly. Tokens can be abbreviated on entry and characters are only tokenised at certain parts of the line. For instance, in the following line:
ON NOON GOTO 1,2
the first 'ON' is the token ON, but the second 'ON' is part of the variable 'NOON'. The second 'ON' must be left untokenised.
The EVAL
function tokenises the supplied string and evaluates it as an
expression. Usefully, the tokenised string can be retrieved from where BASIC
has stored it.
In 6502 BASIC:
A%=EVAL("0:"+A$) token$=$((!4 AND &FFFF)-LENA$-1)
In Z80 BASIC:
A%=EVAL("0:"+A$) =$(string_buffer)
In 32000 BASIC:
A%=EVAL("0:"+A$) =$(!&1B2+2)
In PDP11 BASIC:
A%=EVAL("0:"+A$) =$(^@%-510)
In ARM BASIC:
SYS "XOS_GenerateError",0,STRING$(255,"*") TO ,A% B%=EVAL("0:"+A$) token$=$(A%-14)
(There is an official but unwieldy tokenising routine named MATCH
available from the call table provided by the CALL
statement.)
In DOS BASIC:
B%=EVAL("0:"+A$) token$=$&102
In Windows BASIC:
B%=EVAL("0:"+A$) token$=$(!332+2)
By preceding the code you want to tokenise with "0:"
you can safely pass it to EVAL
without provoking a
Syntax error. You can then extract the tokenised code from memory, so
long as you do it immediately after calling EVAL
.
In later versions of ARM BASIC the stack has an extra word on it and the string is stored lower in memory, as do later versions of 6502 BASIC. The following functions are written to take this into account
In Z80 BASIC the string buffer is in a different location in different
versions. When machine code is entered with CALL
or
USR
IX is set pointing to the string buffer, and this
can be used to find it.
These can be written as functions as follows:
DEFFNTokenise_65(A$):LOCAL A%,B% A%=(!4AND&FFFF)-LENA$-1 B%=EVAL("0:"+A$):=$A% : DEFFNTokenise_Z80(A$):LOCAL A%,P%:Tokenise_Z80%=Tokenise_Z80% IF Tokenise_Z80%=0:DIM A% 4:!A%=&D9E1E5DD:A%?4=&C9:Tokenise_Z80%=USRA% A%=EVAL("0:"+A$):=$(Tokenise_Z80%-254) : DEFFNTokenise_32(A$):LOCAL A% A%=EVAL("0:"+A$):=$(!&1B2+2) : DEFFNTokenise_PDP(A$):LOCAL A% A%=EVAL("0:"+A$):=$(^@%-510) : DEFFNTokenise_ARM(A$):LOCAL A%,B% SYS "XOS_GenerateError",0,STRING$(255,"*") TO ,A% A%!-36=0:B%=EVAL("0:"+A$):=$(A%-14+4*(A%!-36<>0)) : DEFFNTokenise_DOS(A$):LOCAL A% A%=EVAL("0:"+A$):=$&102 : DEFFNTokenise_Win(A$):LOCAL A%,B% WHILELEFT$(A$,1)=" ":A$=MID$(A$,2):ENDWHILE B%=EVAL("0:"+A$):=$(!332+2) :
These functions are used in full in the 'Tokenise' BASIC library at mdfs.net.
A text file can then be tokenised using the following code, which uses the
'FileIO' library functions FNrd()
and FNwr()
:
in%=OPENIN(text$) out%=OPENOUT(basic$) line%=10 :REM Start from an arbitrary line number REPEAT line$=FNTokenise_65(FNrd(in%)) :REM Read line and tokenise it BPUT#out%,13 :REM Output <cr> BPUT#out%,line%DIV256:BPUT#out%,line% :REM Output line number BPUT#out%,LENline$+4 :REM Output line length PROCwr(out%,line$) :REM Output line line%=line%+10 :REM Increment line number UNTIL EOF#in% BPUT#out%,13:BPUT#out%,&FF :REM Output program terminator CLOSE#out%:out%=0 CLOSE#in%:in%=0
See also
References
Richard Russell, "Using the tokeniser", BBC BASIC for Windows Yahoo! group message 86.
Jgharston 21:20, 23 June 2007 (BST) originally in BB4W Programmer's Reference