On parsing tokenized wikitext - Wikitext-l

23 Aug 2010

In my previous post I covered the lexer.  Here I will describe the
parser, the parser context and the listener interface.  After the
lexer's extensive job att providing a reasonably well formed token
stream, the parser's job becomes completely straightforward.

== The parser

For inlined elemements, the parser will just mindlessly report these
to the context object:

inline_element: 
word|space|special|br|html_entity|link_element|format|nowiki|table_of_contents|html_inline_tag

     ;

space: token = SPACE_TAB  {IE(CX->onSpace(CX, $token->getText($token));)}
     ;

etc.

The lexer guarantees that a closing token will not appear before
a corresponding opening token, and the parser context takes care of
nesting formats and removing empty format tags.

For block elements, the only special thing the parser need to pay
attention to is the fact that end tokens may be missing.  Therefore,
end-of-file is always accepted instead of the closing token, for
instance:

html_div:
     token = HTML_DIV_OPEN
     {
         CX->beginHtmlDiv(CX, $token->custom);
     }
     block_element_contents
     (HTML_DIV_CLOSE|EOF)
     {
         CX->endHtmlDiv(CX);
     }
     ;

The rule 'block_element_contents' covers all parser productions.  The
lexer will restrict which tokens that may appear.  For instance
'HTML_DIV_CLOSE' will never appear before a corresponding
'HTML_DIV_OPEN'.  Also, list items and table cells will not appear
unless the current block context is correct.  I have also introduced
a max nesting level limit in the lexer, so stack space is also not an
issue.

== The parser context

The parser context relays the parser events to a listener, but it will
insert and remove events to produce a well formed output.  For instance:

text '' italic <b><strong /> bold-italic
bold </b> text

will result in an event stream to the listener that will look like this:

text <i> italic <b> bold-italic </b></i>
<b> bold </b> text

Two mechanisms are used to implement this:

* The call to the "begin" method is delayed until some actual inlined
   content is produced.  The call is never taken if an "end" event is
   recieved before such content.

* The order of the formats is maintained so that inner formats can be
   closed and reopened when a non-matching end token is recieved.

So, most of the parser context's methods look like this:

static void
beginHtmlStrong(MWPARSERCONTEXT *context, pANTLR3_VECTOR attr)
{
     MW_DELAYED_CALL(        context, beginHtmlStrong, endHtmlStrong, 
attr, NULL);
     MW_BEGIN_ORDERED_FORMAT(context, beginHtmlStrong, endHtmlStrong, 
attr, NULL, false);
     MWLISTENER *l = &context->listener;
     l->beginHtmlStrong(l, attr);
}

static void
endHtmlStrong(MWPARSERCONTEXT *context)
{
     MW_SKIP_IF_EMPTY(     context, beginHtmlStrong, endHtmlStrong, NULL);
     MW_END_ORDERED_FORMAT(context, beginHtmlStrong, endHtmlStrong, NULL);
     MWLISTENER *l = &context->listener;
     l->endHtmlStrong(l);
}

Block elements are already guaranteed by the lexer to be well nested,
so the context typically does not need to do anything special about
those.  Only the wikitext list elements needs to be resolved by the
context.

== The listener

The listening application needs to implement the MWLISTENER interface.
I haven't added support for all features yet, but at the moment, there
are 91 methods in this interface.  They are trivial to implement,
though.  The only thing to think about is that it is the listener's
responsibility to escape the contents of nowiki and special
characters, and also to filter the attribute lists.

/Andreas