In my previous post I covered the lexer. Here I will describe the
parser, the parser context and the listener interface. After the
lexer's extensive job att providing a reasonably well formed token
stream, the parser's job becomes completely straightforward.
== The parser
For inlined elemements, the parser will just mindlessly report these
to the context object:
inline_element:
word|space|special|br|html_entity|link_element|format|nowiki|table_of_contents|html_inline_tag
;
space: token = SPACE_TAB {IE(CX->onSpace(CX, $token->getText($token));)}
;
etc.
The lexer guarantees that a closing token will not appear before
a corresponding opening token, and the parser context takes care of
nesting formats and removing empty format tags.
For block elements, the only special thing the parser need to pay
attention to is the fact that end tokens may be missing. Therefore,
end-of-file is always accepted instead of the closing token, for
instance:
html_div:
token = HTML_DIV_OPEN
{
CX->beginHtmlDiv(CX, $token->custom);
}
block_element_contents
(HTML_DIV_CLOSE|EOF)
{
CX->endHtmlDiv(CX);
}
;
The rule 'block_element_contents' covers all parser productions. The
lexer will restrict which tokens that may appear. For instance
'HTML_DIV_CLOSE' will never appear before a corresponding
'HTML_DIV_OPEN'. Also, list items and table cells will not appear
unless the current block context is correct. I have also introduced
a max nesting level limit in the lexer, so stack space is also not an
issue.
== The parser context
The parser context relays the parser events to a listener, but it will
insert and remove events to produce a well formed output. For instance:
text '' italic <b><strong /> bold-italic
bold </b> text
will result in an event stream to the listener that will look like this:
text <i> italic <b> bold-italic </b></i>
<b> bold </b> text
Two mechanisms are used to implement this:
* The call to the "begin" method is delayed until some actual inlined
content is produced. The call is never taken if an "end" event is
recieved before such content.
* The order of the formats is maintained so that inner formats can be
closed and reopened when a non-matching end token is recieved.
So, most of the parser context's methods look like this:
static void
beginHtmlStrong(MWPARSERCONTEXT *context, pANTLR3_VECTOR attr)
{
MW_DELAYED_CALL( context, beginHtmlStrong, endHtmlStrong,
attr, NULL);
MW_BEGIN_ORDERED_FORMAT(context, beginHtmlStrong, endHtmlStrong,
attr, NULL, false);
MWLISTENER *l = &context->listener;
l->beginHtmlStrong(l, attr);
}
static void
endHtmlStrong(MWPARSERCONTEXT *context)
{
MW_SKIP_IF_EMPTY( context, beginHtmlStrong, endHtmlStrong, NULL);
MW_END_ORDERED_FORMAT(context, beginHtmlStrong, endHtmlStrong, NULL);
MWLISTENER *l = &context->listener;
l->endHtmlStrong(l);
}
Block elements are already guaranteed by the lexer to be well nested,
so the context typically does not need to do anything special about
those. Only the wikitext list elements needs to be resolved by the
context.
== The listener
The listening application needs to implement the MWLISTENER interface.
I haven't added support for all features yet, but at the moment, there
are 91 methods in this interface. They are trivial to implement,
though. The only thing to think about is that it is the listener's
responsibility to escape the contents of nowiki and special
characters, and also to filter the attribute lists.
/Andreas
Show replies by date
"Andreas Jonsson" <andreas.jonsson(a)kreablo.se> wrote in message
news:4C72738C.80704@kreablo.se...
* The call to the "begin" method is delayed
until some actual inlined
content is produced. The call is never taken if an "end" event is
recieved before such content.
Does this mean that constructs such as <span
id="JSPlaceholder"></span> are
obliterated by the lexer? Some empty inline (and block) elements may have
an important purpose as a JS DOM hook, and should not be removed from the
output stream.
- Mark Clements (HappyDog)
2010-09-02 15:15, Mark Clements (HappyDog) skrev:
"Andreas
Jonsson"<andreas.jonsson(a)kreablo.se> wrote in message
news:4C72738C.80704@kreablo.se...
* The call to the "begin" method is
delayed until some actual inlined
content is produced. The call is never taken if an "end" event is
recieved before such content.
Does this mean that constructs such as<span
id="JSPlaceholder"></span> are
obliterated by the lexer? Some empty inline (and block) elements may have
an important purpose as a JS DOM hook, and should not be removed from the
output stream.
Yes, that is correct. This is what the original parser does for <i> and
<b>. But now when you mention it, I realize that this is probably just
an artefact of cleaning up the apostrophe mess.
I changed it so that inlined empty html elements are always included.
/Andreas
"Andreas Jonsson" <andreas.jonsson(a)kreablo.se> wrote in message
news:4C7FD17A.7000906@kreablo.se...
2010-09-02 15:15, Mark Clements (HappyDog) skrev:
"Andreas
Jonsson"<andreas.jonsson(a)kreablo.se> wrote in message
news:4C72738C.80704@kreablo.se...
* The call to the "begin" method is
delayed until some actual inlined
content is produced. The call is never taken if an "end" event is
recieved before such content.
Does this mean that constructs such as<span
id="JSPlaceholder"></span>
are
obliterated by the lexer? Some empty inline (and block) elements may
have
an important purpose as a JS DOM hook, and should not be removed from the
output stream.
Yes, that is correct. This is what the original parser does for <i> and
<b>. But now when you mention it, I realize that this is probably just
an artefact of cleaning up the apostrophe mess.
I changed it so that inlined empty html elements are always included.
That sounds sensible. Any HTML inserted manually should be left in place
(possibly tidied - e.g. addition of closing tags - but not removed). It's
only the generated HTML that should (arguably) be cleaned up in this way.
If the user doesn't want the empty tag, then they can edit the page to
remove it.
- Mark Clements (HappyDog).