Wikitext-l August 2010

wikitext-l@lists.wikimedia.org

15 participants
9 discussions

by Andreas Jonsson

In my previous post I covered the lexer. Here I will describe the parser, the parser context and the listener interface. After the lexer's extensive job att providing a reasonably well formed token stream, the parser's job becomes completely straightforward. == The parser For inlined elemements, the parser will just mindlessly report these to the context object: inline_element: word|space|special|br|html_entity|link_element|format|nowiki|table_of_contents|html_inline_tag ; space: token = SPACE_TAB {IE(CX->onSpace(CX, $token->getText($token));)} ; etc. The lexer guarantees that a closing token will not appear before a corresponding opening token, and the parser context takes care of nesting formats and removing empty format tags. For block elements, the only special thing the parser need to pay attention to is the fact that end tokens may be missing. Therefore, end-of-file is always accepted instead of the closing token, for instance: html_div: token = HTML_DIV_OPEN { CX->beginHtmlDiv(CX, $token->custom); } block_element_contents (HTML_DIV_CLOSE|EOF) { CX->endHtmlDiv(CX); } ; The rule 'block_element_contents' covers all parser productions. The lexer will restrict which tokens that may appear. For instance 'HTML_DIV_CLOSE' will never appear before a corresponding 'HTML_DIV_OPEN'. Also, list items and table cells will not appear unless the current block context is correct. I have also introduced a max nesting level limit in the lexer, so stack space is also not an issue. == The parser context The parser context relays the parser events to a listener, but it will insert and remove events to produce a well formed output. For instance: text '' italic <b><strong /> bold-italic bold </b> text will result in an event stream to the listener that will look like this: text <i> italic <b> bold-italic </b></i> <b> bold </b> text Two mechanisms are used to implement this: * The call to the "begin" method is delayed until some actual inlined content is produced. The call is never taken if an "end" event is recieved before such content. * The order of the formats is maintained so that inner formats can be closed and reopened when a non-matching end token is recieved. So, most of the parser context's methods look like this: static void beginHtmlStrong(MWPARSERCONTEXT *context, pANTLR3_VECTOR attr) { MW_DELAYED_CALL( context, beginHtmlStrong, endHtmlStrong, attr, NULL); MW_BEGIN_ORDERED_FORMAT(context, beginHtmlStrong, endHtmlStrong, attr, NULL, false); MWLISTENER *l = &context->listener; l->beginHtmlStrong(l, attr); } static void endHtmlStrong(MWPARSERCONTEXT *context) { MW_SKIP_IF_EMPTY( context, beginHtmlStrong, endHtmlStrong, NULL); MW_END_ORDERED_FORMAT(context, beginHtmlStrong, endHtmlStrong, NULL); MWLISTENER *l = &context->listener; l->endHtmlStrong(l); } Block elements are already guaranteed by the lexer to be well nested, so the context typically does not need to do anything special about those. Only the wikitext list elements needs to be resolved by the context. == The listener The listening application needs to implement the MWLISTENER interface. I haven't added support for all features yet, but at the moment, there are 91 methods in this interface. They are trivial to implement, though. The only thing to think about is that it is the listener's responsibility to escape the contents of nowiki and special characters, and also to filter the attribute lists. /Andreas

13 years, 7 months

Image links

by Andreas Jonsson

I have previosly written about speculative execution in the lexer. To exactly reproduce the behavior of the image links, not only one, but two speculations will be necessary. However, this is very complex and the use case is undocumented so I would like to simplify these. The original beheviour is as follows: the option list is split on the '|' character; the caption is the _last_ non-option in the list, if any. So, to reproduce this, a separate speculation has to be initiated for the caption. If another caption (non-option) is seen in the list, the speculation will fail. Furthermore, media link may nest one level. If a MEDIA_LINK or INTERNAL_LINK appears in the caption of the second level, the production will completely fail. I think that the following is a reasonable simplification: image links may not nest (although internal links and external links may appear in the caption of a media link), The _first_ non-option in the list is the caption and no options may appear after the caption. In this way, only one speculation is required for media links, and the lexer can handle the option list. This behavior seems consistent with the documentation at http://www.mediawiki.org/wiki/Help:Images. Is there any known use for putting an image inside an image caption, or is the restriction I propose here sufficient? Best regards, Andreas Jonsson

13 years, 8 months

Parser now available in svn repository

by Andreas Jonsson

I have imported the parser implementation to the repository: http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser Dependencies: * antlr snapshot. Be sure to apply the patch to the C runtime. * libtre. Regexp library for wide character strings. (Not actually used yet.) There is no php integration yet. Below is a list of cases I'm awaro of where the behavior differs from Parser.php. (libmwparser doesn't actually output html at the moment, but in the below examples I've converted the traces to html in the obvious way for comparison.) - Definition lists: ;; item Parser.php: <dl><dt></dt><dl><dt> item </dt></dl></dl> libmwparser: <dl><dl><dt> item</dt></dl></dl> - Html/table attributes: {| id='a class='b' | col1 |} Parser.php: <table class='b'><tbody><tr><td> col1 </td></tr></tbody></table> libmwparser: <table><tbody><tr><td> col1 </td></tr></tbody></table> (libmwparser does not backtrack to the space character to try to find a valid attribute, it just considers id='a class='<junk characters> to be garbage altoghether.) - libmwparser restricts some block elements tokens to the correct block contexts. - inline formatting: <b>'''bold'''</b> Parser.php: <b><b>bold</b></b> libmwparser: <b>bold</b> - long term formatting is applied to all inline text: <i>text {| | col1 |} text</i> Parser.php: <p><i>text</i></p><table><tbody><tr><td> col1 </td></tr></tbody></table><p><i>text</i></p> libmwparser: <p><i>text</i></p><table><tbody><tr><td><i> col1</i></td></tr></tbody></table><p><i>text</i></p> - internal links are treated as long term formatting: [[Link|text {| | col1 |} text]] Parser.php: <p><a href="...">text</p><table><tbody><tr><td> col1 </td></tr></tbody></table><p>text</a></p> libmwparser: <p><a href="...">text</a></p><table><tbody><tr><td><a href="..."> col1</a></td></tr></tbody></table><p><a href="...">text</a></p> - In general, any case that cause Parser.php to generate invalid html is likely to differ in libmwparser. Some benchmarking: The performance isn't very impressive. I've tried very quickly to make a comparison: Parser.php: * Mediawiki 1.15.0 running on a 2.2GhZ AMD Opteron 275 * I'm measuring from just before internalParse to right after doBlockLevels. libmwparse: * 2.5GhZ core 2 duo * The time for outputting the traces to /dev/null is included 128kB of plain text: Parser.php: 170ms libmwparser: 180ms The page http://en.wikipedia.org/wiki/Wikipedia (templates not installed at the mediawiki test server) size 124kB Parser.php: 720ms libmwparser: 190ms As expected, Parser.php will take more time the more markup on the page, while libmwparser maintains a fairly constant pace. /Andreas

13 years, 8 months

Wikitext Madness of the Day: Internal Links

by Andreas Jonsson

I have previously stated that the most complex thing in the MediaWiki wikitext seemed to be the apostrophes. I was wrong. Links are divided into internal and external. Here I will write about internal links. Internal links can further be subdivided into regular and image-links, image-links complicate things even further, but I will not discuss them here. In its most basic form a link has the pattern: prefix '[[' (' '|'\t')* title=(LEGAL_TITLE_CHARS+) (' '|'\t')* ']]' trail Whether the prefix is part of the link or not is configurable. The rendered html is <a href="<url>&title=$title"> $prefix$title$trail </a> The only slight problem here for implementing a parser is that LEGAL_TITLE_CHARS and supporting the prefix is configurable. The situation becomes funny when considering the case with custom link text: '[[' <link pattern> '|' (.*?) ']]' (I have substituted the pattern (' '|'\t')* title=(LEGAL_TITLE_CHARS+) (' '|'\t')* with <link pattern> and removed the prefix and trail for clarity.) The .* is matched non-greedily, so it will be the first instance of ']]' following the opening '[[' that will be matched. (The actual regexp is using (.+?), but using an empty string as link text will still produce a link for some strange reason.) Before the pattern matching, the whole article is split into pieces using the '[[' token as delimiter. A pattern match for all variants of internal link patterns (sans the '[[') is attempted at the beginning of each piece of text. The consequence is that the closing ']]' will be mathed with the last opening '[[' preceeding it. If the pattern match fails, no link is produced; the wikitext is outputted and may be processed again to produce other constructions later on. The fundamental problem with all this is the fact that when looking at the wikitext '[[Link|', it cannot easily be determined if this is actually a link, or if it is just the text '[[Link|'. It is not a link unless there is a closing ']]' token, that is not preceeded by another '[[' token. The closing token may appear almost anywhere: This is not a link: --------- [[Link| text. --------- But this is a link: --------- [[Link| text.]] --------- And this: --------- [[Link| text * list item]] ---------- The list element will not be rendered as a list. A table will, however be rendered as a table: -------- [[Link| {| | col1 |} text]] -------- Do you need to put table inside a definition term? No problem: -------- ;[[l| {| | col1 |}]] -------- This is an especially funny example: --------- [[Link| {| | col1 |- id="row2" class=']]' |} -------- The rendered link will be left unterminated. Links cannot be opened inside table parameters though. This is not a link: -------- {| id="[[Link|" | text]] |} -------- Even though the table parameters usually swallows any junk: -------- {| id="table1" class="[[Link]]" | No link above. |} -------- column parameters are different: -------- {| | class="[[Link]]" | <-- will not actually become column parameters | class="[[Link" | <-- still not column parameters, but this time no link either | class="[[Link" | but of course, if there happen to be a | ']]'-token somewhere later in the document, its a whole different matter. |} -------- Trying to reproduce this behavior in a new parser would, of course, be insane. In fact, the current MediaWiki parser does not seem to parse links in linear time using linear amount of memory. My test server failed to process a preview of an article consisisting of about 24000 links on the form [[a]]. It was working hard before it, I guess, ran out of memory. As a comparison it parsed over 38000 italic a's, ''a'', without problems. So, what is the reasonable thing to do? First of all it should be pointed out that block elements are not allowed inside link text: http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#dtdentry_xhtml1-str… This suggests that any sane wikitext should not allow a link to continue past the end of the inlined text where it is located. Even better is to say that the sequence [[Link| always opens up a new link and that 'end of inline text' will implicitly close the link if it is still open. That will not require any lookahead to parse. It would be consistent with the format parsing to only allow it to run to the end of line, though. Also, currently paragraphs and list elements aren't rendered inside link text, unless enclosed or preceeded by a table. So, unless tables inside link text is a widely used feature, such a change might not break that many pages. /Andreas

13 years, 8 months

Revisiting lexer lookahead

by Andreas Jonsson

In my post about tokenizing wikitext, I had overlooked one important detail in the lexer. The combination of lookahead and context sensitive token productions will not play well together. The token productions executes actions. For the sake of this discussion, these actions can be divided into two categories: context updating actions, and other actions. If during lookahead, the same context updating actions are not executed as they would be when actually consuming the input stream, the lookahead will be based on an incorrect context, and hence, its conclusion may be incorrect. Meanwhile, other actions must not be executed during a lookahead. Furthermore, the context must be restored to the point where the lookahead started after it has completed. The problem now is that writing lookahead rules that executes the actions correctly will become very complex and prone to error. Also, invoking a new instance of the lexer to perform the lookahead seems like a lot of overhead. I will instead explore a more elegant solution: speculative execution. The implementation is very simple, at a decision point, save a "mark" that contains a mark in the input stream, a copy of the context and a mark in the output channels. Speculatively execute one of the decisions. The execution will eventually reach either a success condition or a failure condition. On a success condition, the mark is removed and the execution permanently accepted. On a failure, the input stream is rewinded, the context restored, the produced tokens are taken back, and the alternative decision is taken. Now, the antlr C runtime doesn't actually support taking back tokens that have been produced. It happens to be easy to add such a feature, but I don't think that it is likely to be included in an official antlr release. So the drawback is that we will permanently be depending on a customized runtime. There are six different situations where speculative execution would be initiated: +------------------+---------------------+-------------+---------------------+ |Token |Start condition |Success |Failure condition | |production | |condition | | +------------------+---------------------+-------------+---------------------+ |HEADING_BEGIN |1-6 '=' characters at|A matching |Otherwise EOL or EOF | | |BOL |number of | | | | |'='s followed| | | | |by zero or | | | | |more spaces | | | | |and EOL | | +------------------+---------------------+-------------+---------------------+ |INDENT |1 or more spaces or |EOL or EOF |Block element token | | |tabs at BOL | | | | | | | | | | | | | +------------------+---------------------+-------------+---------------------+ |INTERNAL_LINK |'[[Link|' |']]' |'[[' or EOF | | | | | | +------------------+---------------------+-------------+---------------------+ |EXTERNAL_LINK |'[http://example.com |']' not |'[' (including start | | |' |inside |of INTERNAL_LINK and | | | |INTERNAL_LINK|MEDIA_LINK), EOL, | | | |or MEDIA_LINK|EOF. | | | | | | | | | | | +------------------+---------------------+-------------+---------------------+ |MEDIA_LINK |'[[File:image.png|' |']]' |MEDIA_LINK start | | | | |condition, EOF | | | | | | | | | | | +------------------+---------------------+-------------+---------------------+ |MEDIA_LINK_CAPTION|'|' <non-option> |']]' |another '|' | | | | |<non-option>, EOF | | | | | | | | | | | +------------------+---------------------+-------------+---------------------+ Notice that 'EOF' is a condition that must be matched. This is a problem as it is impossible to match the empty string at EOF---the runtime simply will not call the lexer rules when the input stream has reached its end. This means that an additional feature is required from the runtime: the ability to install an 'end-of-file-action', with the ability to rewind the input stream. This is, fortunately, trivial to implement. The media link caption speculation is required if the current behavior is to be reproduced exactly. However, I don't think that the extreme complexity of reproducing this behavior is justified, so I will write a separate post on how to make a reasonable simplification of this later. One important question is whether speculative execution could potentially lead to super linear worst case scanning times. So, how would the worst possibly crafted input look like? HEADING_BEGIN and INDENT cannot be combined. Also, it is in all cases impossible for the start condition to occur inside of a speculative production (assuming that medialinks are restricted from nesting). On top of that, external links are disabled inside internal links. So, at most three speculative executions will ever be initiated at the same time. The worst case should look like this: = [[File:image.png| [[Link| [http://example.com very long line of text running to EOF ... [[ The execution would look like this: try heading1, try media link, try internal link, (external link disabled so don't try that here), fail internal link, reverse, try external link, fail external link, reverse, fail medialink, reverse, try internal link, fail internal link, reverse try external link, fail external link, reverse, fail heading1, reverse, try medialink, try internal link, fail internal link, reverse, try external link, fail external link, reverse, link, fail media link, reverse, try internal link, fail internal link, reverse, try external link, fail external link, reverse, succeed. So, some contents may be scanned 12 times, but this still means worst case linear time. This should be investigated more rigorously, however. It would also be possible to use a hash table with failed speculations to avoid the redundant retries altogheter. (I'm actually storing the last failure point for any speculation, so the line in the above example would only be scanned 4 times. However, the 12 times scanning can still be provoked by adding additional start conditions.) It should be pointed out that the only speculation that is expected to ever fail inside a well formed wiki page is the INDENT speculation. Any other failure is either due to a typo or due to a malicious user trying to DoS attack the server. If 12 scans is the worst possible scenario, such an attack will not be very efficient. /Andreas

13 years, 8 months

On tokenizing wikitext

by Andreas Jonsson

The parser I am writing follows a four layered design: - client application - parser context - parser - lexical scanner In this post I will write about the lexical scanner and what the special requirements are for wikitext. Since all input is valid wikitext, the lexer must never fail to accept a character in any position or state. Also, it is important that the lexer protects the parser from spurious tokens. The parser grammar will quickly become very complex if it has to deal with unexpected tokens. It is therefore beneficial to make the lexer a bit more complex and make sure that, for instance, a </li> tag appearing in a particular context is really supposed to be a html tag token or if it should rather be just the verbatim characters. To accomplish this two things are required: lookahead and the ability to selectively enable/disable individual token productions. == Basic tokens At the very bottom of the lexer there are 7 kinds of tokens: SPACE_TAB: SPACE_TAB_CHAR+; NEWLINE: '\r\n' | NEWLINE_CHAR; SPECIAL: SPECIAL_SYMBOL; APOS: '\''; ASCII_WORD: ASCII_LETTER+; WORD ~(SPACE_TAB_CHAR|NEWLINE_CHAR|SPECIAL_SYMBOL|'\''|ASCII_LETTER|NON_PRINTABLE_CHAR)+ NON_PRINTABLE: NON_PRINTABLE_CHAR+ {$channel = HIDDEN;} These tokens types covers all characters in the Unicode set, any character will match into exactly one of the above. The SPECIAL rule matches exactly one character out of a set of characters that may be used to form more advanced tokens; it captures instances of such characters that are not part of a token. Other lexer rules may also fall back to producing SPECIAL tokens, which may contain multiple characters. The apostrophes are handled by the parser, so we will pass it as a token of its own. The ASCII_WORD needs to be treated separately in order to support internal link prefixes and trails. A word is very loosely defined as a sequence of characters that do not fall into any other category of characters. Non printable characters are hidden from the parser. If we would remove all other lexer rules except the above basic ones, the result of the parsing would be a text split into paragraphs where all markup is rendered verbatim and where non-printable characters have been stripped away. All other kinds of formatting are triggered by the lexer producing a more complex token. == Lookahead Some tokens aren't tokens unless followed by a proper closing token. Thus, we must sometimes perform a lookahead in the input stream before accepting a token. 1. Syntactic predicates cannot be used for lookahead. E.g., the rule ('ab')=> 'a' will match 'a' regardless of what follows. The trailing 'b' in the lookahead is silently ignored by Antlr. 2. Using Antlr's builtin backtracking has two major problems: 1. The DFA tables becomes extremely large. 2. Passing arguments to lexer rule will not work. (May be specific to the C target.) Instead I am using the following pattern to perform lookahead: RULE @init{ bool success; }: MATCHING_CHARACTERS RULE_LOOKAHEAD[&success] { if (!success) { // Fall back to instead producing a SPECIAL token, or whatever else might be appropriate. } } ; RULE_LOOKAHEAD[bool *success] @init { ANTLR3_MARKER mark; }: { mark = MARK(); } ( (ALTERNATIVE1 {*success = true;}) | (ALTERNATIVE2 {*success = false;}) | ... ) { REWIND(mark); } ; This is a bit clumsy syntactically, but it does the job. When falling back to producing different tokens, it may sometimes be convenient to be able to produce several tokens at once. This is, however, not supported by the C target. I have provided a patch for this, but Jim Idle said that it could not be included in the main distribution of libantlr3c, since, I guess, it would break some other things, so we may not want to rely on this. == Selectively enabling and disabling token productions. By maintaining a lexer context and using the semantic predicates in Antlr, it is possible to completely avoid generating spurious tokens. The basic pattern is as follows: RULE: {!CX->ruleDisabled}?=> RULE_PATTERN { onRule(CX); } ; So, there is a predicate 'ruleDisabled' that guards the rule, and a method 'onRule' that reports to the context that the rule was taken. To take a specific example: BEGIN_TABLE: {BOL && !CX->wikitextTableOpenDisabled}?=> (SKIP_SPACE '{|')=> SKIP_SPACE BEGIN_TABLE_INTERNAL ; fragment BEGIN_TABLE_INTERNAL @init{ pANTLR3_VECTOR attrs = NULL; }: '{|' {onWikitextTableOpen(CX);} ATTRIBUTE_LIST_TABLE[&attrs] { CUSTOM = attrs; } ; The logic that governs the 'ruleDisabled' predicates is fairly complex and writing the code for this directly is a very error prone process. Instead, I have opted to write a table with a description of each predicate and let a script generate the code. For instance, the description for wikitext table open and close predicates looks like this: array( 'name' => 'wikitextTableOpen', 'close' => 'wikitextTableClose', 'initiallyDisabled' => array(), 'affects' => array(new TypeEnable('wikitextTable', 'WIKITEXTTABLE')), 'mayNest' => true, 'haveNestingCount' => true, 'types' => array('block', 'wikitextTable'), 'pushesBlockContext'=> true, ), (Maybe not immediately clear what each attribute means, but I have provided documentation for these in the source file.) This is one of the most complex predicates to describe. The script would generate the methods onWikitextTableOpen/Close and enable/disableWikitextTable. They will look something like this: static inline void onWikitextTableOpen(struct MWLEXERCONTEXT_struct *context) { context->wikitextTableOpenNestingLevel++; enableWikitextTable(context, DC_WIKITEXTTABLE); context->wikitextTableCloseDisabled &= ~ DC_OPENED; pushBlockContext(context, &context->wikitextTableOpenDisabled); } static inline void onWikitextTableClose(struct MWLEXERCONTEXT_struct *context) { context->wikitextTableOpenNestingLevel--; if (context->wikitextTableOpenNestingLevel == 0) { context->wikitextTableCloseDisabled |= DC_OPENED; disableWikitextTable(context, DC_WIKITEXTTABLE); context->wikitextTableOpenDisabled &= ~ DC_WIKITEXTTABLE; } popBlockContext(context, &context->wikitextTableOpenDisabled); } === Block Context The block context is a generalization of the table context concept that I have mentioned in a previous email. The option 'pushesBlockContext' that is visible in the above example states that the opening token corresponding to a particular predicate will create a new "block context" and push this to a stack. The corresponding end token will pop the stack and restore the previous block context. When a particular block context is activated, the corresponding closing token will be activated, as well as any tokens internal to the particular block context. E.g., the '<ul>' token will enable the '</ul>' and the '<li>' token. Deactivating a block context will disable the tokens again. So, pushing a new block context includes deactivating the current context and activating the new one; popping includes deactivating the current block context and activating the previous one, if any. As a consequence, the html tokens '<li>', '<td>', '<tr>' etc. and the wikitext table tokens '|-', '|', '||' etc. will be disabled and the corresponding text string rendered verbatim if occuring outside the appropriate context. Also, block elements that pushes the block context must be closed with the correct end token. For example: <ul> <li><div>text</li> </ul> will be rendered as: <ul><li><div>text</li> </ul> </div></li></ul> Since the <div> is still opened, the end tokens for the list will not be active. The tokens are instead closed at the end of the article. By restricting the block contexts in this way, lots of problems are avoided in the parser. Since the behavior doesn't seem to be well defined in the current parser, I would guess that this would be acceptable. To summarize this rather long post, the lexer is the most complex part, but by letting it do a thorough filtering of the input, the parser's job becomes much simpler. Best regards, Andreas Jonsson

13 years, 8 months

On HTML elements in wikitext

by Andreas Jonsson

Mixing HTML elements with wikitext is a grey area. How the HTML tags in the wikitext interact with the wikitext elements does not seem very well defined. Therefore, I will make up some rules where I try to preserve any legitimate use of html elements, but with some restrictions to avoid some problems: 1. Do not allow html block elements inside wikitext lists. For examle this is no longer allowed: * item1 <li> item2 The problem is that it becomes very hard to determine where the wikitext list element ends. Should it run until the inner block elements are properly closed? Should the inner block elements be implicitly closed at the end of the list element? What if the inner html is malformed? What if a new wikitext list element is opened before the inner block elements is closed? The current behavior does not seem very well defined, it seems to generate garbage html in most cases. Wikitext lists inside html list elements should be ok though: <ul><li>item1</li> <li> * item 2.1 ( </li> tag will not be active here ) </li> </ul> 2. Do not allow table html tags inside wikitext tables, unless opened up by a nested html table, which disables wikitext table tokens until the html table is properly closed: {| | col1 <td> tag disabled, so still col1 | <table><td> implicitly open up <tbody> and <tr> |} wikitext table tokens disabled, thus still in html table. {| | However allow wikitext tables to nest inside html tables. Here the html tokens, <td>,<tr> etc., are once again disabled. |} </table> (inner close tags implied). | col2 |} So, we'll get two different kinds of table contexts, which may be arbitrarily nested, but not mixed. So, the question is, would the restrictions allow sufficient backwards compliancy with the current parser? Some other thoughts on parsing of HTML-like tags: * <br [attributes]> Also allowed in the form <br [attributes/>. No </br> tag exists. * <hr [attributes]> Same as <br> except that it also terminates inline text. * <img [attributes]> Same as <br> except that it is enabled/disabled via a configuration option. * <p [attributes]> Opening tag enables closing tag </p> and disables itself until the end of the current inlined text. <p> opens up a new paragraph, </p> closes the current inlined text. * Inlined html elements. These can be used for long term formatting. The context will make sure they are correctly nested, closed on end of inlined text and reopened at beginning of inlined text. They are permanently closed at the corresponding end tag, or at end of article. Variants: * non-nesting (that disables the start-tag when entered)/ nesting (that adds to a nesting level when entered). * may be empty/may not be empty (empty instances will be ignored) * Block html elements. Start and end tags terminate inline text. (They may _not_ be nested inside paragraphs.). Inline text inside <ol> and <ul> implies <li>, inlined text inside <dl> implies <dd>, inline text inside <div> implies <p>, inline text inside <table> implies <tbody><tr><td>, <h1>-<h6> disables wikitext block element tokens, in addition to all html block element tokens except the correspondig closing </hX> token. * <pre> disables all html elements and all block elements (both wikitext and html block elements). * <ins> and <del> will be inline if occuring inside inlined text. Otherwise block. * <a> disables wikitext link tokens. * Tag extensions are treated like <nowiki>; the contents are passed verbatim to the corresponding callback function. The parser may be called recursively if the extension needs to parse wikitext. Best Regards, Andreas Jonsson

13 years, 8 months

Wikitext Madness of the Day: List Items

by Andreas Jonsson

List item tokens must appear at the beginning of line as lists aren't allowed inside a preformatted box: : foo -> <dl><dd> foo </dd></dl> : foo -> <pre>  : foo </pre> However, if you put a table inside it, the list item token may appear indented: :{| |foo |} -> <dl><dd><table><tbody><tr><td> foo </td></tr></tbody></table></dd></dl> /Andreas

13 years, 8 months

MediaWiki parser implementation

by Andreas Jonsson

Hello, I am initiating yet another attempt at writing a new parser for MediaWiki. It seems that more than six month have passed since the last attempt, so it's about time. :) Parser functions, magic words and html comments are better handled by a preprocessor than trying to integrate them with the parser (at least if you want preserve the current behavior). So I am only aiming at implementing something that can be plugged in after the preprocessing stages. In the wikimodel project (http://code.google.com/p/wikimodel/) we are using a parser design that works well for wiki syntax; a front end (implemented using an LL-parser generator) scans the text and feeds events to a context object, which can be queried by the front end to enable context sensitive parsing. The context object will in turn feed a well formed sequence of events to a listener that may build a tree structure, generate xml, or any other format. As of parser generators, Antlr seems to be the best choice. It have support for semantic predicates and rather sophisticated options for backtracking. I'm peeking at Steve Bennet's antlr grammar (http://www.mediawiki.org/wiki/Markup_spec/ANTLR), but I cannot really use that one, since the parsing algorothm is fundamentally different. There are two problems with Antlr: 1. No php back-end Writing a php back-end to antlr is a matter of providing a set of templates and porting the runtime. It's a lot of work, but seems fairly straightforward. The parser can, of course, be written in C and be deployed as a php extension. The drawback is that it will be harder to deploy it, while the advantage is the performance. For MediaWiki it might be worth to maintain both a php and a C version though, since both speed and deployability are important. 2. No UTF-8 support in the C runtime in the latest release of antlr. In trunk it has support of various character encodings,though, so it will probably be there in the next release. My implementation is just at the beginning stages, but I have successfully reproduced the exact behavior of MediaWiki's parsing of apostrophes, which seems to be by far the hardest part. :) I put it up right here if anyone is interested at looking at it: http://kreablo.se:8080/x/bin/download/Gob/libmwparser/libwikimodel%2D0.1.ta… Best regards, Andreas Jonsson

13 years, 8 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Wikitext-l August 2010