Here's one that uses lex + single pass over tokens, and can generate
valid XHTML. It's proof-of-concept program, not a drop-in replacement
for current parser.
=== How it works ===
Wiki syntax is line based and uses state-transition model -
a token chances state from anything to X. HTML is free-form
and states can nest.
This parser maintains stack of "inline" elements. Every time it
finds </X>, it checks if <X> is on stack, and if it is, it pops and closes
every element till it gets to <X>, otherwise it prints raw </X>.
If it finds <X>, it checks if it conflicts with something on stack,
and acts accordingly.
When "paragraph state" has to change, it closes all open inline tags.
It doesn't preserve whitespace unless necessary (<pre> and wiki pre,
for now also <nowiki>).
Example: <<END
Ala <b>ma kota
Ala <b>ma <b>kota
Ala </b>ma kota
Ala <strong> <b>ma kota
Ala <i> <b>ma kota
END
Output (with \ns inserted): <<END
<p>Ala <b>ma kota </b></p>
<p>Ala <b>ma <b>kota </b></p>
<p>Ala </b>ma kota </p>
<p>Ala <strong> <b>ma kota </strong></p>
<p>Ala <i> <b>ma kota </b></i></p>
END
Because it has to support both HTML and wiki paragraph control,
it's quite ugly code.
Example: <<END
<ul>
<li>
Ala
</li>
<li>
Ma
<li>
Kota
<ul>
i
<li> Psa
</ul>
END
Output: <<END
<ul> <li> Ala </li> <li> Ma </li><li> Kota
</li><ul> <li>i </li><li> Psa </li></ul>
</ul>
END
As you can see above, <li> was automatically opened in nested list.
=== More examples of magic ===
Example: <<END
=== Foo ===
Bar
END
Output: <<END
<h3> Foo </h3><p>Bar </p>
END
But also:
Example: <<END
=== Foo
Bar
END
Output: <<END
<h3> Foo </h3><p>Bar </p>
END
=== '''-magic ===
It reopens quote if necessary.
Example (Quotes.txt from test suite) <<END
Wikipedia quoting tests:
(1) normal '''bold''' normal
(2) normal ''italic'' normal
(3) normal '''''bold italic''''' normal
(4) normal '''bold ''bold italic'' bold''' normal
(5) normal ''italic '''bold italic''' italic''
normal
(6) normal '''''bold italic'' bold''' normal
(7) normal '''''bold italic''' italic'' normal
(8) normal ''italic '''bold italic''''' normal
(9) normal '''bold ''bold italic''''' normal
(10) normal '''bold's''' normal
(11) normal ''italic's'' normal
(12) normal ''italic's '''bold's italic'''
italic's'' normal
(13) normal '''''bold's italic'' bold's'''
normal
(14) normal ''italic''' normal
(15) normal ''''bold''' normal
(16) normal ''italic'' normal ''italic'' normal
(17) normal ''italic'' normal '''bold''' normal
(18) normal '''bold''' normal '''bold'''
normal
(19) normal '''bold''' normal ''italic'' normal
END
Output (with \ns inserted): <<END
<p>Wikipedia quoting tests: </p>
<p>(1) normal <b>bold</b> normal </p>
<p>(2) normal <i>italic</i> normal </p>
<p>(3) normal <b><i>bold italic</i></b> normal </p>
<p>(4) normal <b>bold <i>bold italic</i> bold</b> normal
</p>
<p>(5) normal <i>italic <b>bold italic</b> italic</i> normal
</p>
<p>(6) normal <b><i>bold italic</i> bold</b> normal
</p>
<p>(7) normal <b><i>bold italic</i></b><i>
italic</i> normal </p>
<p>(8) normal <i>italic <b>bold italic</b></i> normal
</p>
<p>(9) normal <b>bold <i>bold italic</i></b> normal
</p>
<p>(10) normal <b>bold's</b> normal </p>
<p>(11) normal <i>italic's</i> normal </p>
<p>(12) normal <i>italic's <b>bold's italic</b>
italic's</i> normal </p>
<p>(13) normal <b><i>bold's italic</i> bold's</b>
normal </p>
<p>(14) normal <i>italic<b> normal </b></i></p>
<p>(15) normal <b>'bold</b> normal </p>
<p>(16) normal <i>italic</i> normal <i>italic</i> normal
</p>
<p>(17) normal <i>italic</i> normal <b>bold</b> normal
</p>
<p>(18) normal <b>bold</b> normal <b>bold</b> normal
</p>
<p>(19) normal <b>bold</b> normal <i>italic</i> normal
</p>
END
7 is not optimal but still 100% correct.
14 has different interpretation.