This happens fairly frequently on very large pages on our wiki. While I
don't have a good well-rounded solution, I do know that it's easy to
mark such edits as patrolled. Simply hover over the diff link of the
page in question and note the rcid= value at the end of the URL. Then go
to any other properly displayed diff page, grab the url of the "mark as
patrolled" link, copy that into your URL field in your browser, and
replace the rcid of the page that you want to mark patrolled. Press
enter and it's patrolled.
Tim
> -------- Original Message --------
> Subject: [Wikitech-l] 78MB diff?
> From: "Travis Derouin" <travis(a)wikihow.com>
> Date: Wed, November 21, 2007 8:33 am
> To: "Wikimedia developers" <wikitech-l(a)lists.wikimedia.org>
>
>
> We have a strange diff on our site that appears to be 78MB in size
> that's causing errors:
>
> http://www.wikihow.com/index.php?title=Sweep-a-Girl-off-Her-Feet&diff=13665…
>
> Between this version:
>
> http://www.wikihow.com/index.php?title=Sweep-a-Girl-off-Her-Feet&oldid=1366…
>
> and this version:
>
> http://www.wikihow.com/index.php?title=Sweep-a-Girl-off-Her-Feet&oldid=1366…
>
> (obviously this is vandalism)
>
> It seems like the large diff is a result of a very long list of
> newlines being entered into the revision. I tried putting some error
> checking into DifferenceEngine to avoid displaying or storing large
> diffs in the cache, but it seems like this affects several areas of
> the code. This is the diff that was being stored:
>
> http://207.97.207.17/x/baddiff.html
>
> Any ideas? Has anyone run into this before?
>
> Travis
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
> http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Dear All,
specially Anthony and Platonides,
I'm not techy - so why hasn't it been possible to have a non-corrupt dump in
a long time (that includes history?). A professor of mine asked if the
problem could be man(person)-power and if it would be interesting/useful to
have the university help with a programmer to help the dump happen.
Also - now I've got a file from 2006 but I still wonder if there is no place
where one can access old dumps - these will/could be very important research
wise.
And last but not least - If the dumps don't work, then it is very important
to be able to dump some articles with their full histories in other
fashions. I ask my pledge again - do you know who made the block so that
export would only allow for 100 revisions? any way to hack that? Would it be
possible to open an exception to get the data for a research study?
Thanks!
Rut
Hi,
Is there something like an oversight function for log entrees?
The automatic summary after you delete a article can give away personal
information if this was in the first line of the deleted article. If you
forget to rewrite the summary these personal data is accesable through te
logs.
Is there a way to delete a log entree?
(me hopes this is the right list to ask).
Kind regards
Peter van Londen/Londenp
The parser has to parse and treat magic words like __TOC__. These words are
defined in languages/messages/MessagesXx.php (and possibly overridden). That
theoretically means that *anything* (like "a" or
even " ") could be a magic word. That makes it hard to write a fast
parser, as basically you would have to process
every character one at a time, look for a match, move onto the next character...
So, two questions:
1) Is it possible/feasible to restrict the range of what could be a magic
word in some
way, like that they have to start with __, or some range of characters.
2) Is it possible to get a complete list of all the magic words currently
used for all the languages of Wikipedia? Does the contents of the
languages/messages directory already represent that?
I realise that the term "magic word" is somewhat ambiguous: I'm primarily
referring to words like __TOC__ that can appear vitually anywhere, rather
than words like "subst:" that require a special context, or magic variables
like PAGENAME, which (afaik) have to be wrapped in {{..}}.
Thanks,
Steve
Platonides <Platonides(a)gmail.com> escribió:
I did a proposal on that line last month
http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/34547
You're also welcomed to comment it ;)
Although the main point seems to be if the files compression is good
enough... The compression acceptable level varying due to things like
WMF disk space available for dumps and the needing to have a better dump
system.
Well, actually if you read the previous threads on this list, you will see that this is a recurrent topic in the last two months. AFAIK, this topic also got the attention of the board of trustees, as it is not a joke. Now, it's been more than a year since the last time we had a complete and valid sutb-meta-history for enwiki.
Brion also heard this complaints, so please, don't bother him again about that. Currently, he has no time to properly fix it. He also offered some solutions in his blog (read previous threads, please).
Other big editions (dewiki, frwiki, plwiki....) also presents serious problems with complete history dumps. And I think the whole problem raised because the DB server was too stressed, and the dump script lost connectivity to the MySQL backend.
We all agree in that:
1. We all would like this problem to be fixed soon. Many of us researchers are stopped right now, waiting for new, fresh data.
2. The admins does not have enough time to fix it, because they have more important issues to attend, and this is normal in such a big project like Wikipedia (let alone the rest of the Wikimedia Foundation projects).
In short: in my humble opinion we should think about setting up:
1. A mirror/several mirrors to duplicate stub-meta-history info and thus offer alternative data repositories for research on Wikipedia and related projects. We at the URJC offer our facilities to the Wikimedia Foundation (and I think, other people in this thread could do that too).
2. An intermediate board of researchers that would serve as a central point of contact (though mirrored in practice) to ask for research data about Wikipedia and centralize petitions to Wikimedia Foundation tech-masters.
This way, everyone could focus his/her attention to their own tasks, and we would not slow down interesting research works about Wikipedia.
Regards.
Felipe
---------------------------------
¿Chef por primera vez? - Sé un mejor Cocinillas.
Entra en Yahoo! Respuestas.
Since we're on that topic again :-) I'd like to announce that I've
added a script to my wiki2xml package (svn: wiki2xml/php) that runs
the MediaWiki parser tests on it. At first glance, there are many
errors, but at closer view, the XML is actually pretty good in most
cases; just my XML-to-XHTML script is not entirely up to the task yet.
Also, the "expected results" in the parser tests are sometimes rather
MediaWiki-specific. Does it matter if there's a space after <li>? It's
not rendered anyway. Or, "X\nY" vs. "X Y" in HTML - no difference,
AFAIK (except in <pre>). These "non-errors" make up quite a few
"wrong" results in my tests.
Cheers,
Magnus
I've just read the past couple of days of discussion, and would like to
agree with Merlijn.
One of the points missed is that the pipe trick and many of the other
"end cases" are actually pre-processed, not stored in the database.
The easy examples being:
* [[turkey (bird)|]] is stored as [[turkey (bird)|turkey]]
* [[stuff]]ing is stored as [[stuff|stuffing]]
Other such behaviors could be regularized, and not affect the existing
articles. Some years back, I made some suggestions in this wise, but
they were not accepted.
A case I was concerned with at the time was normalized pre-processing
of [[stuff:]] versus [[:stuff]], and [[|stuff]] versus [[stuff|]],
and their combinations -- [[:stuff (action)|]]. This is the kind of
thing that could most easily be formalized.
In regularizing the grammar, think about how the back-end data could be
normalized to a new grammar for editing, and then stored again in the
back-end form. For example, the // and ** ideas we've talked about
multiple times over the years. No reason that the database couldn't
continue to store them as '' and '''. Or better as <i> and <b>!
If we stick to just front-end parsing, the project might be doable in
our lifetimes.
===
And as a final note for the computer scientists, remember that we often
use LR(1) and LALR(1) grammars, but RL(1) is also possible! MW syntax
has often seemed to me more like RL....
(Yes, back in university we were all required to write a parser -- a
year-long project. I've written several for later projects, too.
But university was a very long time ago.)
What's the status of the project to create a grammar for Wikitext in EBNF?
There are two pages:
http://meta.wikimedia.org/wiki/Wikitext_Metasyntaxhttp://www.mediawiki.org/wiki/Markup_spec
Nothing seems to have happened since January this year. Also the comments on
the latter page seem to indicate a lack of clear goal: is this just a fun
project, is it to improve the existing parser, or is it to facilititate a
new parser? It's obviously a lot of work, so it needs to be of clear
benefit.
Brion requested the grammar IIRC (and there's a comment to that effect at
http://bugzilla.wikimedia.org/show_bug.cgi?id=7
), so I'm wondering what became of it.
Is there still a goal of replacing the parser? Or is there some alternative
plan?
Steve