Qbasicnews.com

Full Version: Parsing Source to Tokens for Interpreter
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi again everyone -- I need to borrow someone's brain for a bit.

This is actually a VB question, but I rather think that more eyeballs see this forum than the General Programming one buried all the way down at the bottom.

Okay, it's a parser for C-style code. I already have it breaking the source into tokens as

- Numeric Constants
- Identifiers
- Operators
- Parenthesis
- Braces
- Semicolons

The tokens are stored in a Collection Object. The second stage introduces the CodeBlock token, which is simply a token containing a pointer to a further block of tokens. Any braces are consumed and the tokens inside them are moved into sub-collections, forming a tree.

What I'm having trouble with is what I percieve to be the next phase -- figure out where the expressions are and turn them into a similar nested-tree approach, where each node basically contains two pointers to subnodes, and the operator that connects them.

I'd like it to recognize order of operations, but I'm finding it all rather overwhelming. Without a lot of computer science background or formal training, is there a good site that explains the logic process for turning these linear tokens into a meaningful tree?

A quick Google just revealed a bunch of philisophical guff about RPN. Anyone want to recommend a practical resource?

Thanks!

Mike

ps. The language that I'm dealing with doesn't need to account for string constants or compiler directives (eg, macros).
You might try the Finite State Automata model. See example:
http://www.network54.com/Forum/message?f...1096906004

Mac
Quote:You might try the Finite State Automata model. See example:
http://www.network54.com/Forum/message?f...1096906004

Mac

Hmm... Thanks for the example. I find it rather scary, and it still doesn't properly observe order of operations.

I found an actual VB one, that I believe uses a similar state scheme, and does observe order of op, but it's equally incomprehensible. I'd really like this to integrate cleanly into my current framework, so I just need to buckle down and figure out what on earth it is that it does.

Anyhow, thanks!
The dragon book. Compilers: Principles, techniques and tools by Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman. ISBN: 0-201-10088-6

Quote:and it still doesn't properly observe order of operations

Usually you construct the grammar to deal with order of operations, for example:
Code:
expr ::= term | expr + term | expr - term

term ::= factor | term * factor | term / factor

factor ::= id | num

When you parse an expression like:
Code:
5 + 3 * 4

It is computed as:
Code:
5 + (3 * 4)
Aho, Sethi, Ullman... What huge nightmares they gave me at college Big Grin but yeah, that's the very source to go.

How will you be building the grammar? I've found that the easiest way to do it is (when you are new to this thingo) building a LL(1) authomata.
Quote:Usually you construct the grammar to deal with order of operations...

Well, I'm not using any fancy lex/yacc/[insert grammar tool here] stuff. It's just home-cooked tokenizing. I considered trying to use one of those established tools by wrapping it in an activeX control and dropping it on the form, but in the end decided that it was overkill given the task at hand.

LL(1) = ??

Since I'm obviously missing key theory here, I think I'll do my best to understand the existing code, but end up basically adapting it to my needs with minimal modification. I'll post a link when I get home from work.
Quote:Well, I'm not using any fancy lex/yacc/[insert grammar tool here] stuff.

Regardless of what tools you use to construct a lexer/parser with, you need a grammar, it should be the first thing you do. The grammar I gave you allows you to set operator precedences for addition, subtraction, multiplication and division. After tokenisations (lexer) you can parse the grammar I gave you (roughly, see below) with a simple recursive descent parser:

Code:
sub expr
  term

  while current = SYM_PLUS or current = SYM_MINUS
    term
  wend
end sub

sub term
  factor

  while current = SYM_TIMES or current = SYM_DIVIDE
    factor
  wend
end sub

sub factor
  if current <> SYM_ID and current <> SYM_NUM then
    print_error "Expected identifier or constant"
  end if
end sub

LL(1) stands for Left to right parsed, producing a Leftmost derivation and using 1 token for lookahead. This means that you parse the grammar from left to right, and that the leftmost non-terminal will be replaced at each step. Lookahead is the number of tokens you need to be able to see to correctly parse the grammar, with 1 lookahead you only need to know the current token.

The grammar that I gave you isn't actually LL(1) because it contains left recursion, but you can rewrite left recursive grammars so they can be parsed by an LL(1) parser.

You are going to need at least a little bit of grammar theory to be able to do this, and IMHO you are better off with a book than some web tutorial. You need to construct an LL(1) grammar for the language you are trying to parse, then implementing a lexer and parser are easy.
A recursive descent top down parser is the way to go if your coding by hand. Using state tables will be a bitch when you want o debug. Try getting your language down in EBNF form then translating it to code. That part is a peice of cake.

simple expression parser:

Code:
expr        := addexpr
addexpr     := mulexpr ( ( '+' | '-' ) mulexpr )*
mulexpr     := parexpr ( ( '*' | '/' ) parexpr )*
parexpr     := '(' expr ')'
             | atom
atom        := numlit
             | id

Try to convert that to recursive code, peice of cake. * means 0 or more, | means or
Quote:You are going to need at least a little bit of grammar theory to be able to do this, and IMHO you are better off with a book than some web tutorial. You need to construct an LL(1) grammar for the language you are trying to parse, then implementing a lexer and parser are easy.
Honestly, that is just way beyond the scope the project. I've got state-machine code that does what I need it to do, so I've just bolted that on to the rest of the parser. This program just isn't complicated enough to warrant such a sophisticated approach. It doesn't need to be extensible in any kind of way; they're just simple 5-10 line scripts that are executed at run-time. It's not a compiler, just an extremely simpleminded interpreter.

The theory interests me, but there are other aspects of the program that require my attention. If it was a personal project, I'd feel differently.

But thanks for all your help, everyone.


btw, neat to look around here again, haven't been by in ages. Looks like the regular crop of homegrown OSes in the Projects forum and Do-My-Homework threads in the Help forum, lol.