I saw we get a lot of troubles with wikitext. I wonder this idea is
possible: we will concentrate to build a wysiwyg editor which produces
xml-like content so that:
- people don't have to know any wiki language
- developer is able to forget wikitext and its complexity
To prove that it is possible with a friend of mine, I had write a xml-like
language for wiki (in Vietnamese). I can ask him to translate it into
English (because I'm not really good in English).
Here's a half-thought about the difficulties of parsing near-correct
syntax like this:
[[image:foo.jpg|thumb|Some '''valiant''' attempt at including an image
that fails because it only has one trailing square bracket.]
This is quite expensive to parse. According to current practice, we
should render this literally, except that the 'valiant' should be
rendered in bold. If text like this were to be frequently parsed, that
would add up to a lot of computational effort for not much gain -
clearly the user didn't *really* want two square brackets, the word
'image' etc.
So some possibilities:
- Detect the error at save time, alter the text to some more
parser-friendly but equivalent wikitext. eg.:
<nowiki>[[</nowiki>image:foo.jpg...
- Detect the error at save time, wrap it in some new extension/tag
like <error>[[image:...</error> or {{error|[[image:...}} This could
have some pretty good benefits (render it in red, generate a list of
errors* somewhere...)
- Detect the error at render time, and shortcut to displaying strictly
literally (ie, not attempting to parse the bolded text within). This
way at least you're only parsing it once. (Has implications for the
way some security is handled, like escaping & and < chars...)
Probably in reality slow parsing of incorrect syntax is a very minor
issue. But it was just a thought.
Steve
* I mean "generate a list of friendly suggestions to the user". Yes, I
know everyone goes ballistic at the word "error" and assumes that the
user will not be able to save error-ridden syntax. :)
Does this make the ANTLR problem any simpler?
- d.
---------- Forwarded message ----------
From: Tim Starling <tstarling(a)wikimedia.org>
Date: 17 Feb 2008 06:19
Subject: [Wikitech-l] Preprocessor syntax in ABNF
To: wikitech-l(a)lists.wikimedia.org
Just a fun little project for my Sunday afternoon:
http://www.mediawiki.org/wiki/Preprocessor_ABNF
Turns out the production rules are pretty simple. The magic is in the
disambiguation. An EBNF representation of the whole of MediaWiki wikitext,
if such a thing is possible, would only go a small way towards specifying
the language.
-- Tim Starling
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi all,
I've published what I'm calling (for no good reason) "draft 10" here:
http://www.mediawiki.org/wiki/Markup_spec/ANTLR/draft
Mostly, I got to a certain level of feature completeness.
Specifically, the list of 4 features that were previously missing
(tables, magic words, categories and inline HTML) have been
implemented.
I redid the table stuff - turns out I was getting too fancy for my own
good. I now do less semantic checking, and am thus much more tolerant
of borderline input.
I've also cleaned it up a bit and have roughly grouped all the rules
into levels, thus:
Top level, block elements:
line:
(table) => table^
| (headerline) => headerline^
| (listmarker) => listline^
| (hrline) => hrline^
| (spaceline) => spaceline^
| paragraph^ ;
Next level, inline text (generally, stuff that appears within a line,
and doesn't contain new lines)
inline_text
@init { text_levels++; }
:
(
((LEFT_BRACKET LEFT_BRACKET LEFT_BRACKET) => literal_left_bracket
|(literal_left_bracket bracketed_url) => literal_left_bracket
|(image) => image
|(category) => category
|(external_link) => external_link
|(internal_link) => internal_link
|(magic_link) => magic_link
|(magic_word) => magic_word
|pre_block
|(formatted_text_elem) =>formatted_text_elem
)
((nbsp_before_punctuation) => nbsp_before_punctuation)?
((ws) =>printing_ws)?
)+;
finally { text_levels --;}
The exception there is <pre> blocks which really do contain newlines.
Next level down is formatted text, which can appear in places like
link captions:
formatted_text
@init { text_levels++; } :
(
(formatted_text_elem) => formatted_text_elem
((nbsp_before_punctuation) => nbsp_before_punctuation)*
((printing_ws) => printing_ws)?
)+;
finally { text_levels --; }
formatted_text_elem:
(
(accidental_magic_link) => accidental_magic_link
| ((punctuation_before_nbsp)=> punctuation_before_nbsp)
| (APOSTROPHES) => bold_and_italics
| angle_tag
| ((html_entity) => html_entity)
| unformatted_characters
);
And the very lowest level is unformatted characters:
unformatted_characters:
(html_dangerous
|punctuation
|meaningless_characters
|digits
);
Anyway, when I say "feature complete", most of the major features that
I know of are present in some form. None of them is complete in itself
(except perhaps images), but it's a start.
So what next: suggestions for more features to add would be
handy.Also, I need to get around to making it do more than just
generate an AST. Theoretically it's not too much work to take the ASP
and spit out some kind of XHTML.
It would also be nifty if someone could figure out a way of embedding
wikitext into the grammar to mark it up somehow. Does section
inclusion work yet? If so, would it be possible to insert comments
somehow that would allow other pages to transclude sections? Then some
of the documention could be stored outside the grammar itself, yet
shown alongside...
Steve
I have successfully parsed my first nested table. It's 3 in the
morning but I'm quite happy :)
One of the really complicated bits about the nested table syntax is
that the contents of multi-line cells looks exactly like normal text
(with lists, headers, tables and so forth) except that each row can't
begin with a pipe. I tried at least 4 different ways of implementing
that rule (my practical ANTLR knowledge is still pretty weak), and
finally this simple method worked:
nonpipeline:
(table) => table^
| (headerline) => headerline^
| (listmarker) => listline^
| (hrline) => hrline^
| (spaceline) => spaceline^
| (nonpipe paragraph?)^ ;
It's a complete duplicate of the normal "line" rule, except with the
addition of "nonpipe" before the paragraph.
Anyway, now it's onto the next round of "yes, the grammar works, now
to stop ANTLR spewing 5000 warnings at me".
Steve
Table rows:
{|
|-
|You parse and parse and parse and read and read and you have no idea
whether this is a table cell or a style property for the cell, until
you hit either a | or a ||. Oops, it was just a style property, better
go back and parse it again.
|}
That's kind of evil. For big table rows, that could get very expensive to parse.
Steve
Consider this:
{|
|-
|foo || boo || moo moo moo moo moo moo moo
zoo
|}
Being able to deal with both single-line and multi-line cell contents
is quite painful. In the case above, we know that foo and boo are
single-liners only by the time we hit the ||. With the moo's, we only
know it's *not* a single liner once we hit the newline. And we only
know that hte 'zoo' is the continuation of that multiliner once we
work out way through all that whitespace.
In my delirious state, I'm dreaming up all sorts of friendlier
syntax...maybe I should write one up somewhere. :)
Steve
Never noticed this before. Compare these:
[[foo|]]
[[foo| ]]
Both render as if they were [[foo]], but the first one is replaced by
[[foo|foo]] at savetime, while the second one isn't it. Feature or
bug?
I'm a bit skeptical about the need to transform pipetricks at
savetime. I think a developer once explained it as not wanting third
party users of wikitext to have to know the transformation rules, but
that sounds pretty flimsy to me.
Steve
There is no shortage of curious behaviour in our parser when you look
hard enough:
foo
(note the space at left, this is what I call a 'spaceblock' and
renders as <pre>foo</foo>)
<pre>foo</pre>
(again with space, renders exactly as before - the parser evidently
decides the extra <pre> is redundant)
But what about:
foo </pre>
Strangely enough, this renders without a <pre> block at all.
Steve