In my previous post I covered the lexer. Here I will describe the
parser, the parser context and the listener interface. After the
lexer's extensive job att providing a reasonably well formed token
stream, the parser's job becomes completely straightforward.
== The parser
For inlined elemements, the parser will just mindlessly report these
to the context object:
inline_element:
word|space|special|br|html_entity|link_element|format|nowiki|table_of_contents|html_inline_tag
;
space: token = SPACE_TAB {IE(CX->onSpace(CX, $token->getText($token));)}
;
etc.
The lexer guarantees that a closing token will not appear before
a corresponding opening token, and the parser context takes care of
nesting formats and removing empty format tags.
For block elements, the only special thing the parser need to pay
attention to is the fact that end tokens may be missing. Therefore,
end-of-file is always accepted instead of the closing token, for
instance:
html_div:
token = HTML_DIV_OPEN
{
CX->beginHtmlDiv(CX, $token->custom);
}
block_element_contents
(HTML_DIV_CLOSE|EOF)
{
CX->endHtmlDiv(CX);
}
;
The rule 'block_element_contents' covers all parser productions. The
lexer will restrict which tokens that may appear. For instance
'HTML_DIV_CLOSE' will never appear before a corresponding
'HTML_DIV_OPEN'. Also, list items and table cells will not appear
unless the current block context is correct. I have also introduced
a max nesting level limit in the lexer, so stack space is also not an
issue.
== The parser context
The parser context relays the parser events to a listener, but it will
insert and remove events to produce a well formed output. For instance:
text '' italic <b><strong /> bold-italic
bold </b> text
will result in an event stream to the listener that will look like this:
text <i> italic <b> bold-italic </b></i>
<b> bold </b> text
Two mechanisms are used to implement this:
* The call to the "begin" method is delayed until some actual inlined
content is produced. The call is never taken if an "end" event is
recieved before such content.
* The order of the formats is maintained so that inner formats can be
closed and reopened when a non-matching end token is recieved.
So, most of the parser context's methods look like this:
static void
beginHtmlStrong(MWPARSERCONTEXT *context, pANTLR3_VECTOR attr)
{
MW_DELAYED_CALL( context, beginHtmlStrong, endHtmlStrong,
attr, NULL);
MW_BEGIN_ORDERED_FORMAT(context, beginHtmlStrong, endHtmlStrong,
attr, NULL, false);
MWLISTENER *l = &context->listener;
l->beginHtmlStrong(l, attr);
}
static void
endHtmlStrong(MWPARSERCONTEXT *context)
{
MW_SKIP_IF_EMPTY( context, beginHtmlStrong, endHtmlStrong, NULL);
MW_END_ORDERED_FORMAT(context, beginHtmlStrong, endHtmlStrong, NULL);
MWLISTENER *l = &context->listener;
l->endHtmlStrong(l);
}
Block elements are already guaranteed by the lexer to be well nested,
so the context typically does not need to do anything special about
those. Only the wikitext list elements needs to be resolved by the
context.
== The listener
The listening application needs to implement the MWLISTENER interface.
I haven't added support for all features yet, but at the moment, there
are 91 methods in this interface. They are trivial to implement,
though. The only thing to think about is that it is the listener's
responsibility to escape the contents of nowiki and special
characters, and also to filter the attribute lists.
/Andreas
I have previosly written about speculative execution in the lexer. To
exactly reproduce the behavior of the image links, not only one, but
two speculations will be necessary. However, this is very complex and
the use case is undocumented so I would like to simplify these.
The original beheviour is as follows: the option list is split on the
'|' character; the caption is the _last_ non-option in the list, if
any.
So, to reproduce this, a separate speculation has to be initiated for
the caption. If another caption (non-option) is seen in the list, the
speculation will fail.
Furthermore, media link may nest one level. If a MEDIA_LINK or
INTERNAL_LINK appears in the caption of the second level, the
production will completely fail.
I think that the following is a reasonable simplification: image links
may not nest (although internal links and external links may appear in
the caption of a media link), The _first_ non-option in the list is
the caption and no options may appear after the caption. In this way,
only one speculation is required for media links, and the lexer can
handle the option list. This behavior seems consistent with the
documentation at http://www.mediawiki.org/wiki/Help:Images.
Is there any known use for putting an image inside an image caption,
or is the restriction I propose here sufficient?
Best regards,
Andreas Jonsson
I have imported the parser implementation to the repository:
http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser
Dependencies:
* antlr snapshot. Be sure to apply the patch to the C runtime.
* libtre. Regexp library for wide character strings. (Not actually
used yet.)
There is no php integration yet.
Below is a list of cases I'm awaro of where the behavior differs from
Parser.php. (libmwparser doesn't actually output html at the moment,
but in the below examples I've converted the traces to html in the
obvious way for comparison.)
- Definition lists:
;; item
Parser.php: <dl><dt></dt><dl><dt> item </dt></dl></dl>
libmwparser: <dl><dl><dt> item</dt></dl></dl>
- Html/table attributes:
{| id='a class='b'
| col1
|}
Parser.php: <table class='b'><tbody><tr><td> col1 </td></tr></tbody></table>
libmwparser: <table><tbody><tr><td> col1 </td></tr></tbody></table>
(libmwparser does not backtrack to the space character to try to find
a valid attribute, it just considers id='a class='<junk characters> to
be garbage altoghether.)
- libmwparser restricts some block elements tokens to the correct
block contexts.
- inline formatting:
<b>'''bold'''</b>
Parser.php: <b><b>bold</b></b>
libmwparser: <b>bold</b>
- long term formatting is applied to all inline text:
<i>text
{|
| col1
|}
text</i>
Parser.php: <p><i>text</i></p><table><tbody><tr><td> col1
</td></tr></tbody></table><p><i>text</i></p>
libmwparser: <p><i>text</i></p><table><tbody><tr><td><i>
col1</i></td></tr></tbody></table><p><i>text</i></p>
- internal links are treated as long term formatting:
[[Link|text
{|
| col1
|}
text]]
Parser.php: <p><a href="...">text</p><table><tbody><tr><td> col1
</td></tr></tbody></table><p>text</a></p>
libmwparser: <p><a href="...">text</a></p><table><tbody><tr><td><a
href="..."> col1</a></td></tr></tbody></table><p><a href="...">text</a></p>
- In general, any case that cause Parser.php to generate invalid html is
likely to differ in libmwparser.
Some benchmarking:
The performance isn't very impressive.
I've tried very quickly to make a comparison:
Parser.php:
* Mediawiki 1.15.0 running on a 2.2GhZ AMD Opteron 275
* I'm measuring from just before internalParse to right after
doBlockLevels.
libmwparse:
* 2.5GhZ core 2 duo
* The time for outputting the traces to /dev/null is included
128kB of plain text:
Parser.php: 170ms
libmwparser: 180ms
The page http://en.wikipedia.org/wiki/Wikipedia (templates not
installed at the mediawiki test server) size 124kB
Parser.php: 720ms
libmwparser: 190ms
As expected, Parser.php will take more time the more markup on the
page, while libmwparser maintains a fairly constant pace.
/Andreas
I have previously stated that the most complex thing in the MediaWiki
wikitext seemed to be the apostrophes. I was wrong.
Links are divided into internal and external. Here I will write about
internal links. Internal links can further be subdivided into regular
and image-links, image-links complicate things even further, but I
will not discuss them here.
In its most basic form a link has the pattern:
prefix '[[' (' '|'\t')* title=(LEGAL_TITLE_CHARS+) (' '|'\t')* ']]' trail
Whether the prefix is part of the link or not is configurable. The
rendered html is
<a href="<url>&title=$title"> $prefix$title$trail </a>
The only slight problem here for implementing a parser is that
LEGAL_TITLE_CHARS and supporting the prefix is configurable.
The situation becomes funny when considering the case with custom link
text:
'[[' <link pattern> '|' (.*?) ']]'
(I have substituted the pattern (' '|'\t')* title=(LEGAL_TITLE_CHARS+)
(' '|'\t')* with <link pattern> and removed the prefix and trail for
clarity.) The .* is matched non-greedily, so it will be the first
instance of ']]' following the opening '[[' that will be matched.
(The actual regexp is using (.+?), but using an empty string as link
text will still produce a link for some strange reason.)
Before the pattern matching, the whole article is split into pieces
using the '[[' token as delimiter. A pattern match for all variants
of internal link patterns (sans the '[[') is attempted at the
beginning of each piece of text. The consequence is that the closing
']]' will be mathed with the last opening '[[' preceeding it.
If the pattern match fails, no link is produced; the wikitext is
outputted and may be processed again to produce other constructions
later on.
The fundamental problem with all this is the fact that when looking at
the wikitext '[[Link|', it cannot easily be determined if this is
actually a link, or if it is just the text '[[Link|'. It is not a
link unless there is a closing ']]' token, that is not preceeded by
another '[[' token. The closing token may appear almost anywhere:
This is not a link:
---------
[[Link|
text.
---------
But this is a link:
---------
[[Link|
text.]]
---------
And this:
---------
[[Link|
text
* list item]]
----------
The list element will not be rendered as a list. A table will,
however be rendered as a table:
--------
[[Link|
{|
| col1
|}
text]]
--------
Do you need to put table inside a definition term? No problem:
--------
;[[l|
{|
| col1
|}]]
--------
This is an especially funny example:
---------
[[Link|
{|
| col1
|- id="row2" class=']]'
|}
--------
The rendered link will be left unterminated.
Links cannot be opened inside table parameters though. This is not a link:
--------
{| id="[[Link|"
| text]]
|}
--------
Even though the table parameters usually swallows any junk:
--------
{| id="table1" class="[[Link]]"
| No link above.
|}
--------
column parameters are different:
--------
{|
| class="[[Link]]" | <-- will not actually become column parameters
| class="[[Link" | <-- still not column parameters, but this time no
link either
| class="[[Link" | but of course, if there happen to be a
| ']]'-token somewhere later in the document, its a whole different matter.
|}
--------
Trying to reproduce this behavior in a new parser would, of course, be
insane. In fact, the current MediaWiki parser does not seem to parse
links in linear time using linear amount of memory. My test server
failed to process a preview of an article consisisting of about 24000
links on the form [[a]]. It was working hard before it, I
guess, ran out of memory. As a comparison it parsed over 38000 italic
a's, ''a'', without problems.
So, what is the reasonable thing to do? First of all it should be
pointed out that block elements are not allowed inside link text:
http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#dtdentry_xhtml1-str…
This suggests that any sane wikitext should not allow a link to
continue past the end of the inlined text where it is located. Even
better is to say that the sequence [[Link| always opens up a new link
and that 'end of inline text' will implicitly close the link if it is
still open. That will not require any lookahead to parse. It would
be consistent with the format parsing to only allow it to run to the
end of line, though. Also, currently paragraphs and list elements
aren't rendered inside link text, unless enclosed or preceeded by a
table. So, unless tables inside link text is a widely used feature,
such a change might not break that many pages.
/Andreas
In my post about tokenizing wikitext, I had overlooked one important
detail in the lexer. The combination of lookahead and context
sensitive token productions will not play well together.
The token productions executes actions. For the sake of this
discussion, these actions can be divided into two categories: context
updating actions, and other actions.
If during lookahead, the same context updating actions are not
executed as they would be when actually consuming the input stream,
the lookahead will be based on an incorrect context, and hence, its
conclusion may be incorrect. Meanwhile, other actions must not be
executed during a lookahead. Furthermore, the context must be
restored to the point where the lookahead started after it has
completed.
The problem now is that writing lookahead rules that executes the
actions correctly will become very complex and prone to error. Also,
invoking a new instance of the lexer to perform the lookahead seems
like a lot of overhead. I will instead explore a more elegant
solution: speculative execution.
The implementation is very simple, at a decision point, save a "mark"
that contains a mark in the input stream, a copy of the context and a
mark in the output channels. Speculatively execute one of the
decisions. The execution will eventually reach either a success
condition or a failure condition. On a success condition, the mark is
removed and the execution permanently accepted. On a failure, the
input stream is rewinded, the context restored, the produced tokens
are taken back, and the alternative decision is taken.
Now, the antlr C runtime doesn't actually support taking back tokens
that have been produced. It happens to be easy to add such a feature,
but I don't think that it is likely to be included in an official
antlr release. So the drawback is that we will permanently be
depending on a customized runtime.
There are six different situations where speculative execution would
be initiated:
+------------------+---------------------+-------------+---------------------+
|Token |Start condition |Success |Failure
condition |
|production | |condition
| |
+------------------+---------------------+-------------+---------------------+
|HEADING_BEGIN |1-6 '=' characters at|A matching |Otherwise EOL or
EOF |
| |BOL |number of
| |
| | |'='s
followed| |
| | |by zero or
| |
| | |more spaces
| |
| | |and EOL
| |
+------------------+---------------------+-------------+---------------------+
|INDENT |1 or more spaces or |EOL or EOF |Block element
token |
| |tabs at BOL |
| |
| | |
| |
| | |
| |
+------------------+---------------------+-------------+---------------------+
|INTERNAL_LINK |'[[Link|' |']]' |'[[' or
EOF |
| | |
| |
+------------------+---------------------+-------------+---------------------+
|EXTERNAL_LINK |'[http://example.com |']' not |'[' (including
start |
| |' |inside |of INTERNAL_LINK
and |
| | |INTERNAL_LINK|MEDIA_LINK),
EOL, |
| | |or
MEDIA_LINK|EOF. |
| | |
| |
| | |
| |
+------------------+---------------------+-------------+---------------------+
|MEDIA_LINK |'[[File:image.png|' |']]' |MEDIA_LINK
start |
| | | |condition,
EOF |
| | |
| |
| | |
| |
+------------------+---------------------+-------------+---------------------+
|MEDIA_LINK_CAPTION|'|' <non-option> |']]' |another
'|' |
| | | |<non-option>,
EOF |
| | |
| |
| | |
| |
+------------------+---------------------+-------------+---------------------+
Notice that 'EOF' is a condition that must be matched. This is a
problem as it is impossible to match the empty string at EOF---the
runtime simply will not call the lexer rules when the input stream has
reached its end. This means that an additional feature is required
from the runtime: the ability to install an 'end-of-file-action', with
the ability to rewind the input stream. This is, fortunately, trivial
to implement.
The media link caption speculation is required if the current behavior
is to be reproduced exactly. However, I don't think that the extreme
complexity of reproducing this behavior is justified, so I will write
a separate post on how to make a reasonable simplification of this
later.
One important question is whether speculative execution could
potentially lead to super linear worst case scanning times. So, how
would the worst possibly crafted input look like?
HEADING_BEGIN and INDENT cannot be combined. Also, it is in
all cases impossible for the start condition to occur inside of a
speculative production (assuming that medialinks are restricted
from nesting). On top of that, external links are disabled inside
internal links. So, at most three speculative executions will ever
be initiated at the same time. The worst case should look like this:
= [[File:image.png| [[Link| [http://example.com very long line of text
running to EOF ... [[
The execution would look like this:
try heading1, try media link, try internal link, (external link
disabled so don't try that here), fail internal link, reverse, try
external link, fail external link, reverse, fail medialink, reverse,
try internal link, fail internal link, reverse try external link, fail
external link, reverse, fail heading1, reverse, try medialink, try
internal link, fail internal link, reverse, try external link, fail
external link, reverse, link, fail media link, reverse, try internal
link, fail internal link, reverse, try external link, fail external
link, reverse, succeed.
So, some contents may be scanned 12 times, but this still means worst
case linear time. This should be investigated more rigorously,
however. It would also be possible to use a hash table with failed
speculations to avoid the redundant retries altogheter. (I'm actually
storing the last failure point for any speculation, so the line in the
above example would only be scanned 4 times. However, the 12 times
scanning can still be provoked by adding additional start conditions.)
It should be pointed out that the only speculation that is expected to
ever fail inside a well formed wiki page is the INDENT speculation.
Any other failure is either due to a typo or due to a malicious user
trying to DoS attack the server. If 12 scans is the worst
possible scenario, such an attack will not be very efficient.
/Andreas
The parser I am writing follows a four layered design:
- client application
- parser context
- parser
- lexical scanner
In this post I will write about the lexical scanner and what the
special requirements are for wikitext. Since all input is valid
wikitext, the lexer must never fail to accept a character in any
position or state. Also, it is important that the lexer protects the
parser from spurious tokens. The parser grammar will quickly become
very complex if it has to deal with unexpected tokens. It is
therefore beneficial to make the lexer a bit more complex and make
sure that, for instance, a </li> tag appearing in a particular context
is really supposed to be a html tag token or if it should rather be
just the verbatim characters. To accomplish this two things are
required: lookahead and the ability to selectively enable/disable
individual token productions.
== Basic tokens
At the very bottom of the lexer there are 7 kinds of tokens:
SPACE_TAB: SPACE_TAB_CHAR+;
NEWLINE: '\r\n' | NEWLINE_CHAR;
SPECIAL: SPECIAL_SYMBOL;
APOS: '\'';
ASCII_WORD: ASCII_LETTER+;
WORD
~(SPACE_TAB_CHAR|NEWLINE_CHAR|SPECIAL_SYMBOL|'\''|ASCII_LETTER|NON_PRINTABLE_CHAR)+
NON_PRINTABLE: NON_PRINTABLE_CHAR+ {$channel = HIDDEN;}
These tokens types covers all characters in the Unicode set, any
character will match into exactly one of the above. The SPECIAL rule
matches exactly one character out of a set of characters that may be
used to form more advanced tokens; it captures instances of such
characters that are not part of a token. Other lexer rules may also
fall back to producing SPECIAL tokens, which may contain multiple
characters. The apostrophes are handled by the parser, so we will
pass it as a token of its own. The ASCII_WORD needs to be treated
separately in order to support internal link prefixes and trails. A
word is very loosely defined as a sequence of characters that do not
fall into any other category of characters. Non printable characters
are hidden from the parser.
If we would remove all other lexer rules except the above basic ones,
the result of the parsing would be a text split into paragraphs where
all markup is rendered verbatim and where non-printable characters
have been stripped away. All other kinds of formatting are triggered
by the lexer producing a more complex token.
== Lookahead
Some tokens aren't tokens unless followed by a proper closing token.
Thus, we must sometimes perform a lookahead in the input stream before
accepting a token.
1. Syntactic predicates cannot be used for lookahead. E.g., the rule
('ab')=> 'a' will match 'a' regardless of what follows. The
trailing 'b' in the lookahead is silently ignored by Antlr.
2. Using Antlr's builtin backtracking has two major problems:
1. The DFA tables becomes extremely large.
2. Passing arguments to lexer rule will not work. (May be specific
to the C target.)
Instead I am using the following pattern to perform lookahead:
RULE
@init{
bool success;
}:
MATCHING_CHARACTERS RULE_LOOKAHEAD[&success]
{
if (!success) {
// Fall back to instead producing a SPECIAL token, or
whatever else might be appropriate.
}
}
;
RULE_LOOKAHEAD[bool *success]
@init {
ANTLR3_MARKER mark;
}:
{
mark = MARK();
}
(
(ALTERNATIVE1 {*success = true;})
| (ALTERNATIVE2 {*success = false;})
| ...
)
{
REWIND(mark);
}
;
This is a bit clumsy syntactically, but it does the job.
When falling back to producing different tokens, it may sometimes be
convenient to be able to produce several tokens at once. This is,
however, not supported by the C target. I have provided a patch for
this, but Jim Idle said that it could not be included in the main
distribution of libantlr3c, since, I guess, it would break some other
things, so we may not want to rely on this.
== Selectively enabling and disabling token productions.
By maintaining a lexer context and using the semantic predicates in
Antlr, it is possible to completely avoid generating spurious tokens.
The basic pattern is as follows:
RULE: {!CX->ruleDisabled}?=> RULE_PATTERN { onRule(CX); }
;
So, there is a predicate 'ruleDisabled' that guards the rule, and a
method 'onRule' that reports to the context that the rule was taken.
To take a specific example:
BEGIN_TABLE:
{BOL && !CX->wikitextTableOpenDisabled}?=> (SKIP_SPACE '{|')=>
SKIP_SPACE
BEGIN_TABLE_INTERNAL
;
fragment
BEGIN_TABLE_INTERNAL
@init{
pANTLR3_VECTOR attrs = NULL;
}:
'{|' {onWikitextTableOpen(CX);}
ATTRIBUTE_LIST_TABLE[&attrs]
{ CUSTOM = attrs; }
;
The logic that governs the 'ruleDisabled' predicates is fairly complex
and writing the code for this directly is a very error prone process.
Instead, I have opted to write a table with a description of each
predicate and let a script generate the code. For instance, the
description for wikitext table open and close predicates looks like
this:
array(
'name' => 'wikitextTableOpen',
'close' => 'wikitextTableClose',
'initiallyDisabled' => array(),
'affects' => array(new TypeEnable('wikitextTable',
'WIKITEXTTABLE')),
'mayNest' => true,
'haveNestingCount' => true,
'types' => array('block', 'wikitextTable'),
'pushesBlockContext'=> true,
),
(Maybe not immediately clear what each attribute means, but I have
provided documentation for these in the source file.) This is one of
the most complex predicates to describe. The script would generate
the methods onWikitextTableOpen/Close and enable/disableWikitextTable.
They will look something like this:
static inline void onWikitextTableOpen(struct MWLEXERCONTEXT_struct
*context) {
context->wikitextTableOpenNestingLevel++;
enableWikitextTable(context, DC_WIKITEXTTABLE);
context->wikitextTableCloseDisabled &= ~ DC_OPENED;
pushBlockContext(context, &context->wikitextTableOpenDisabled);
}
static inline void onWikitextTableClose(struct MWLEXERCONTEXT_struct
*context) {
context->wikitextTableOpenNestingLevel--;
if (context->wikitextTableOpenNestingLevel == 0) {
context->wikitextTableCloseDisabled |= DC_OPENED;
disableWikitextTable(context, DC_WIKITEXTTABLE);
context->wikitextTableOpenDisabled &= ~ DC_WIKITEXTTABLE;
}
popBlockContext(context, &context->wikitextTableOpenDisabled);
}
=== Block Context
The block context is a generalization of the table context concept
that I have mentioned in a previous email.
The option 'pushesBlockContext' that is visible in the above example
states that the opening token corresponding to a particular predicate
will create a new "block context" and push this to a stack. The
corresponding end token will pop the stack and restore the previous
block context.
When a particular block context is activated, the corresponding
closing token will be activated, as well as any tokens internal to the
particular block context. E.g., the '<ul>' token will enable the
'</ul>' and the '<li>' token. Deactivating a block context will
disable the tokens again.
So, pushing a new block context includes deactivating the current
context and activating the new one; popping includes deactivating the
current block context and activating the previous one, if any.
As a consequence, the html tokens '<li>', '<td>', '<tr>' etc. and the
wikitext table tokens '|-', '|', '||' etc. will be disabled and the
corresponding text string rendered verbatim if occuring outside the
appropriate context. Also, block elements that pushes the block
context must be closed with the correct end token. For example:
<ul>
<li><div>text</li>
</ul>
will be rendered as:
<ul><li><div>text</li> </ul> </div></li></ul>
Since the <div> is still opened, the end tokens for the list will not
be active. The tokens are instead closed at the end of the article.
By restricting the block contexts in this way, lots of problems are
avoided in the parser. Since the behavior doesn't seem to be well
defined in the current parser, I would guess that this would be
acceptable.
To summarize this rather long post, the lexer is the most complex
part, but by letting it do a thorough filtering of the input, the
parser's job becomes much simpler.
Best regards,
Andreas Jonsson
Mixing HTML elements with wikitext is a grey area. How the HTML tags
in the wikitext interact with the wikitext elements does not seem very
well defined. Therefore, I will make up some rules
where I try to preserve any legitimate use of html elements, but with
some restrictions to avoid some problems:
1. Do not allow html block elements inside wikitext lists. For examle
this is no longer allowed:
* item1 <li> item2
The problem is that it becomes very hard to determine where the
wikitext list element ends. Should it run until the inner block
elements are properly closed? Should the inner block elements be
implicitly closed at the end of the list element? What if the
inner html is malformed? What if a new wikitext list element is
opened before the inner block elements is closed? The current
behavior does not seem very well defined, it seems to generate
garbage html in most cases. Wikitext lists inside html list elements
should be ok though:
<ul><li>item1</li>
<li>
* item 2.1 ( </li> tag will not be active here )
</li>
</ul>
2. Do not allow table html tags inside wikitext tables, unless opened
up by a nested html table, which disables wikitext table tokens
until the html table is properly closed:
{|
| col1 <td> tag disabled, so still col1
| <table><td> implicitly open up <tbody> and <tr>
|} wikitext table tokens disabled, thus still in html table.
{|
| However allow wikitext tables to nest inside html tables. Here
the html
tokens, <td>,<tr> etc., are once again disabled.
|}
</table> (inner close tags implied).
| col2
|}
So, we'll get two different kinds of table contexts, which may be
arbitrarily nested, but not mixed.
So, the question is, would the restrictions allow sufficient
backwards compliancy with the current parser?
Some other thoughts on parsing of HTML-like tags:
* <br [attributes]> Also allowed in the form <br [attributes/>. No
</br> tag exists.
* <hr [attributes]> Same as <br> except that it also terminates inline
text.
* <img [attributes]> Same as <br> except that it is enabled/disabled
via a configuration option.
* <p [attributes]> Opening tag enables closing tag </p> and disables
itself until the end of the current inlined text. <p> opens up a
new paragraph, </p> closes the current inlined text.
* Inlined html elements. These can be used for long term formatting.
The context will make sure they are correctly nested, closed on end
of inlined text and reopened at beginning of inlined text. They are
permanently closed at the corresponding end tag, or at end of
article. Variants:
* non-nesting (that disables the start-tag when entered)/
nesting (that adds to a nesting level when entered).
* may be empty/may not be empty (empty instances will be ignored)
* Block html elements. Start and end tags terminate inline text.
(They may _not_ be nested inside paragraphs.). Inline text inside
<ol> and <ul> implies <li>, inlined text inside <dl> implies <dd>,
inline text inside <div> implies <p>, inline text inside <table>
implies <tbody><tr><td>, <h1>-<h6> disables wikitext block element
tokens, in addition to all html block element tokens except the
correspondig closing </hX> token.
* <pre> disables all html elements and all block elements (both wikitext
and html block elements).
* <ins> and <del> will be inline if occuring inside inlined text.
Otherwise block.
* <a> disables wikitext link tokens.
* Tag extensions are treated like <nowiki>; the contents are passed
verbatim to the corresponding callback function. The parser may be
called recursively if the extension needs to parse wikitext.
Best Regards,
Andreas Jonsson
List item tokens must appear at the beginning of line as lists aren't
allowed inside a preformatted box:
: foo -> <dl><dd> foo </dd></dl>
: foo -> <pre> : foo </pre>
However, if you put a table inside it, the list item token may appear
indented:
:{|
|foo
|} -> <dl><dd><table><tbody><tr><td> foo
</td></tr></tbody></table></dd></dl>
/Andreas
Hello,
I am initiating yet another attempt at writing a new parser for
MediaWiki. It seems that more than six month have passed since the
last attempt, so it's about time. :)
Parser functions, magic words and html comments are better handled by
a preprocessor than trying to integrate them with the parser (at least
if you want preserve the current behavior). So I am only aiming at
implementing something that can be plugged in after the preprocessing
stages.
In the wikimodel project (http://code.google.com/p/wikimodel/) we are
using a parser design that works well for wiki syntax; a front end
(implemented using an LL-parser generator) scans the text and feeds
events to a context object, which can be queried by the front end to
enable context sensitive parsing. The context object will in turn
feed a well formed sequence of events to a listener that may build a
tree structure, generate xml, or any other format.
As of parser generators, Antlr seems to be the best choice. It have
support for semantic predicates and rather sophisticated options for
backtracking. I'm peeking at Steve Bennet's antlr grammar
(http://www.mediawiki.org/wiki/Markup_spec/ANTLR), but I cannot really
use that one, since the parsing algorothm is fundamentally different.
There are two problems with Antlr:
1. No php back-end
Writing a php back-end to antlr is a matter of providing a set of
templates and porting the runtime. It's a lot of work, but seems
fairly straightforward.
The parser can, of course, be written in C and be deployed as a php
extension. The drawback is that it will be harder to deploy it,
while the advantage is the performance. For MediaWiki it might be
worth to maintain both a php and a C version though, since both
speed and deployability are important.
2. No UTF-8 support in the C runtime in the latest release of antlr.
In trunk it has support of various character encodings,though, so
it will probably be there in the next release.
My implementation is just at the beginning stages, but I have
successfully reproduced the exact behavior of MediaWiki's parsing of
apostrophes, which seems to be by far the hardest part. :)
I put it up right here if anyone is interested at looking at it:
http://kreablo.se:8080/x/bin/download/Gob/libmwparser/libwikimodel%2D0.1.ta…
Best regards,
Andreas Jonsson