LL, LR, RD, paqrat or other?

 

Actually, we don't care. It is even possible to use different kind of line parsers in the "global parsing process".

 

 

High-level process

 

The first thing that is done is to transform the text into a tree of lines. Where indented blocks are the children of their corresponding header line (and blank lines are ignored). In the example above, the green lines would be the children of this is a header line and the orange lines the children of bli bli bli.

 

Since all blue lines are in the same language and each line is a statement, they all can be parsed in parallel. Even if some statements/lines contain errors, it does not affect the parsing of the other statements/lines.

 

The same applies to bodies. Once the header line is parsed, it determines which language is used in the body and the same process applies recursively. Once this is a header line is parsed successfully, the lines of the green block can be parsed in parrallel, and so on.

 

This also means that, for indentation sensitive blocks, the grammar apply solely for a single line! The syntax definition is always divided in two parts:

  • How the line can be parsed
  • What can the body of the parsed line be

Hence, it is more a mini parser applied to every line than one big monolithic parser. In fact it can be divided in two parts:

  • The process of:
    • applying the line parsers on each line
    • selecting the correct other line parsers depending on the header
    • merging the correct result or summarizing the errors
  • The mini line parsers themeselves

 

For sub-languages which are not context sensitive, the sub-tree is simply transformed back to a single stream and parsed classically.

 

 

What's inside?

 

The (very important) question is about the outcome of the parsing.

Either we talk about a "tree of strings", like:

 

"definition"

+-- "symbol"  "x"

+-- "expression"

      +-- "operator"  "+"

             +-- "value"  "2"

             +-- "value"  "3"

 

Or about a tree whose nodes are typed, like:

 

Definition

    symbol = Symbol "x"

    expression = Operator

            op = "+"

            lhs = Value  2

            rhs = Value  3

 

The big difference are thus the kind of nodes. Despite it looks like a peculiar detail, it is absolutely not! Read on to see why.

 

The first kind of tree, the "tree of strings", can be produced dynamically. We could imagine a program running, knowing nothing about what has to be parsed. This program could receive a grammar file and a source file as input to dynamically produce on the fly the desired parse tree à la "tree of strings". On the opposite, this cannot be done with a typed tree. Indeed, to produce such a tree, the definition of all node types must be included in the program beforehand. How could the program output the Operator type if it is not even defined? Hence, we cannot simply "feed it" with a grammar at runtime. The grammar must be hardcoded.

 

 

From one to another

 

To make the distinction, we will call the first the string tree and the second the typed tree.

 

Despite the string tree is very convinient because it is sufficient to provide a grammar file to get it, what we want to obtain in the end is typed data. Strictly speaking, it is not required but it is so tremenduously advised that any sane person would and should do it. Since we will manipulate the content of this tree a lot, typed data is needed for the ease of programming, debugging, safety, static analysis, and so on.

 

In order to get the typed tree, there are two ways:

  • We can either directly parse the text into it.
  • Either parse a string tree which can be used to produce the typed tree from it.

The first solution is more efficient while the second is more comfortable.

Yet, it doesn't solve our problem:

 

 

So how to get flexible / dynamic syntax?