On Sat, Nov 10, 2007 at 05:30:53PM +1100, Steve Bennett wrote:
On 11/10/07, Jay R. Ashworth <jra(a)baylink.com>
wrote:
They certainly are, if no one ever examines the
corpus. I've just
banged up a new server in the office, if no one else who already *has*
a mirror of, say, en.wp set up steps up, I may do the testing myself,
in my Copious Free Time.
What are you proposing, autobotically replacing ''' with **?
Specifically, I was proposing defining the combinations of the current
parser tokens which are difficult to interpret (primarily, combinations
of bold, italics, and apostrophes), and determining how frequently they
appear in the live corpus.
This will delimit the *actual* size of the Installed Base problem, in
both meanings I gave it earlier. If in 2 megapages, there are only 100
occurrences, you fix them by hand. If 1000, you grind a robot. If
500K, then you take a different approach to the overall problem.
(To USAdians, this is referred to as "Dropping back 10, and punting".)
Cheers,
-- jra
--
Jay R. Ashworth Baylink jra(a)baylink.com
Designer The Things I Think RFC 2100
Ashworth & Associates
http://baylink.pitas.com '87 e24
St Petersburg FL USA
http://photo.imageinc.us +1 727 647 1274