Re: [Wikitech-l] Re: Test my lex/yacc parser!

20 Sep 2004

On Mon, 20 Sep 2004 13:56:36 +0100, Rowan Collins
&lt;rowan.collins(a)gmail.com&gt; wrote:
...
  On Mon, 20 Sep 2004 00:05:23 +0100, Timwi
&lt;timwi(a)gmx.net&gt; wrote:
  Rowan Collins wrote:

  Have you worked out how to deal with
"MagicWord" i18n yet? 
 Not entirely, but mostly :)

 Here are my thoughts:

 * Redirects are not passed through the lex/yacc parser at all. They can
    be recognised with a regular expression that takes the magic words
    into account.  
 I know the current version doesn't do anything, but I've been meaning
 for a while to finalise a patch to show a message saying "This is a
 redirect to [[foo]]". It's always annoyed me that it parses as though
 it were a numbered list. I was hoping we could then post-process it to
 say "This is a *broken* redirect", and even "This is a double redirect
 (and therefore broken)" etc. How hard would it be to recognise "first
 token of text begins with #" How about just changing the first redirect to
point to the page the
second one is pointing to? That is change:
A -> B -> C
to
A -> C
so when people show up at a they just see "this is a redirect" rather
than "this is a broken redirect" due to the software having solved
that automatically.
...

  * Things like __NOTOC__ and stuff can be handled
like this:
    * Regard *everything* of the form __CAPITALLETTERS__ as a special
      token  
 Actually, it can be lower case currently. Unless we're going to hunt
 the database for examples where it is, best just treat
 __anystringofletters__ as needing to be investigated.

  * The template pseudo-variables (e.g.
CURRENTMONTH) are similarly
    handled in post-processing.  
 By which, do you mean they are treated as templates and then
 recognised as magic after? Just curious.

  * HTML tags and extension names are either not
internationalised, or all
    translations of them are made to work on all Wikipedias.  
 That seems a bit of a step backwards to me. Actually, everything that
 looks like a SGML tag has to be treated one of three ways:

 a) it is an extension, and everything from there to its partner should
 be unparsed / sent somewhere else for parsing
 b) it's an allowed HTML tag, and should be put in the parse-tree as
 that kind of element, with its contents parsed "independently" (sort
 of)
 c) it is neither of the above, and needs entity escaping so that it
 doesn't get as far as the browser still looking like HTML

 Perhaps extensions could be made to return a parse sub-tree (even if
 it only has one element). Then we could use a HTML "extension" bound
 to all allowed HTML tags, which just called the original parser back
 on the contents of the tags. Similarly, a no-match handler would
 escape the tags in question and then parse the whole string back for
 normal parsing. Or would that be hopelessly inefficient?

 --
 Rowan Collins BSc
 [IMSoP]

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)wikimedia.org
 http://mail.wikipedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: Test my lex/yacc parser!