Tokeniser

From BeebWiki
Revision as of 20:08, 29 November 2010 by Jgharston (talk) (Added markup to code fragments.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

BBC BASIC programs are tokenised, that is, BASIC keywords are stored as one or two byte values. This results in programs which execute faster and are more compact.

A tokenised line can easily be detokenised, or expanded, as there is a one-to-one mapping between token values and the expanded string. For example, code similar to the following would expand a tokenised line:

     quote%=FALSE
     REPEAT
       IF ?addr%<128 OR quote% THEN VDU ?addr% ELSE P.token$(?addr%);
       IF ?addr%=34 quote%=NOT quote%
       addr%=addr%+1
     UNTIL ?addr%=13

Tokenising, however, is more fiddly. Tokens can be abbreviated on entry and characters are only tokenised at certain parts of the line. For instance, in the following line:

     ON NOON GOTO 1,2

the first 'ON' is the token ON, but the second 'ON' is part of the variable 'NOON'. The second 'ON' must be left untokenised.

The EVAL function tokenises the supplied string and evaluates it as an expression. Usefully, the tokenised string can be retrieved from where BASIC has stored it.

In 6502 BASIC:

     A%=EVAL("0:"+A$)
     token$=$((!4 AND &FFFF)-LENA$-1)

In Z80 BASIC:

     A%=EVAL("0:"+A$)
     =$(string_buffer)

In 32000 BASIC:

     A%=EVAL("0:"+A$)
     =$(!&1B2+2)

In PDP-11 BASIC:

     A%=EVAL("0:"+A$)
     =$(^@%-254)

In ARM BASIC:

     SYS "XOS_GenerateError",0,STRING$(255,"*") TO ,A%
     B%=EVAL("0:"+A$)
     token$=$(A%-14)

(There is an official but unwieldy tokenising routine named MATCH available from the call table provided by the CALL statement.)

In DOS BASIC:

     B%=EVAL("0:"+A$)
     token$=$&102

In Windows BASIC:

     B%=EVAL("0:"+A$)
     token$=$(!332+2)

By preceding the code you want to tokenise with "0:" you can safely pass it to EVAL without provoking a Syntax error. You can then extract the tokenised code from memory, so long as you do it immediately after calling EVAL.

In later versions of ARM BASIC the stack has an extra word on it and the string is stored lower in memory, as do later versions of 6502 BASIC. The following functions are written to take this into account

In Z80 BASIC the string buffer is in a different location in different versions. When machine code is entered with CALL or USR IX is set pointing to the string buffer, and this can be used to find it.

These can be written as functions as follows:

     DEFFNTokenise_65(A$):LOCAL A%,B%
     A%=!4AND&FFFF:A%=A%-LENA$-1+4*(A%?-1>0)
     B%=EVAL("0:"+A$):=$A%
     :
     DEFFNTokenise_Z80(A$):LOCAL A%,P%:Tokenise_Z80%=Tokenise_Z80%
     IF Tokenise_Z80%=0:DIM A% 4:!A%=&D9E1E5DD:A%?4=&C9:Tokenise_Z80%=USRA%
     A%=EVAL("0:"+A$):=$(Tokenise_Z80%-254)
     :
     DEFFNTokenise_32(A$):LOCAL A%
     A%=EVAL("0:"+A$):=$(!&1B2+2)
     :
     DEFFNTokenise_PDP(A$):LOCAL A%
     A%=EVAL("0:"+A$):=$(^@%-254)
     :
     DEFFNTokenise_ARM(A$):LOCAL A%,B%
     SYS "XOS_GenerateError",0,STRING$(255,"*") TO ,A%
     A%!-36=0:B%=EVAL("0:"+A$):=$(A%-14+4*(A%!-36<>0))
     :
     DEFFNTokenise_DOS(A$):LOCAL A%
     A%=EVAL("0:"+A$):=$&102
     :
     DEFFNTokenise_Win(A$):LOCAL A%,B%
     WHILELEFT$(A$,1)=" ":A$=MID$(A$,2):ENDWHILE
     B%=EVAL("0:"+A$):=$(!332+2)
     :

These functions are used in full in the 'Tokenise' BASIC library at mdfs.net.

A text file can then be tokenised using the following code, which uses the 'FileIO' library functions FNrd() and FNwr():

   in%=OPENIN(text$)
   out%=OPENOUT(basic$)
   line%=10                                 :REM Start from an arbitrary line number
   REPEAT
     line$=FNTokenise_65(FNrd(in%))         :REM Read line and tokenise it
     BPUT#out%,13                           :REM Output <cr>
     BPUT#out%,line%DIV256:BPUT#out%,line%  :REM Output line number
     BPUT#out%,LENline$+4                   :REM Output line length
     PROCwr(out%,line$)                     :REM Output line
     line%=line%+10                         :REM Increment line number
   UNTIL EOF#in%
   BPUT#out%,13:BPUT#out%,&FF               :REM Output program terminator
   CLOSE#out%:out%=0
   CLOSE#in%:in%=0

See also

References

Richard Russell, "Using the tokeniser", BBC BASIC for Windows Yahoo! group message 86.

Jgharston 21:20, 23 June 2007 (BST) originally in BB4W Programmer's Reference