If you are looking to check for image/file usage it’s better to query the
api for just image used images instead of trying to parse wiki text
On Tue, Mar 28, 2023 at 10:29 PM Roy Smith <roy(a)panix.com> wrote:
On Mar 28, 2023, at 9:09 PM, Kunal Mehta
<legoktm(a)debian.org> wrote:
I suppose it's also worth asking what you're using expand_text() for in
the first place, to see if there's a better way to do whatever it is you
want to :)
That's a fair question.
What I'm doing is looking at DYK nominations to evaluate if they've been
approved. Like so many wiki things, there's no formal definition, but the
simple version is that I'm looking for "File:Symbol confirmed.svg". The
problem is that it may not appear in the raw wikitext. An example is Bismarck
Kuyon
<https://en.wikipedia.org/wiki/Template:Did_you_know_nominations/Bismarck_Kuyon>.
Looking at the page, it's easy to see the green checkmark indicating
approval. But looking at the wikitext source, there's no such thing. What
there is, is a {{DYK checklist}} template which invokes some Lua code that
generates the checkmark based on the values in the other fields. The
expand_text() forces that to get run on the server side.
From a machine-parsability point of view, it's insane. But I gotta work
with what I've been given.
Ultimately, this is going to run as a bot. That fact that it takes a
couple of minutes to evaluate all the nominations of interest isn't
critical. I was doing an interactive web-based version for review
purposes, and for that, waiting 2 minutes for the page to load sucked.
But, I don't really need to do that, so I'll probably just go back to the
serialized version and leave it at that.
One optimization I can see is that I only really need to do the
expand_text() on the subset of nominations which use {{DYK checklist}}, and
not even all of those (sometimes it's possible to determine the approval
state entirely from the text following the {{DYK checklist}}). That will
add a bit more complexity, which I was trying to avoid.
Even deeper down the complexity rathole, I could re-implement the Lua
logic on the client side and avoid the expand_text() completely. I believe
that's what some existing bots, such as WugBot do. But I really didn't
want to go there.
I did a little reading about your mwbot-rs project. At one point, I was
actually kind of excited about Rust and might have joined you just for the
excuse to learn it. Maybe some day. I am totally about your goal of
"sustainable development of bots and tools". We've got so many tools
(some
of which important processes like DYK are totally dependent on) which are,
frankly, a mess of single-purpose code which can't be easily reused for
anything else. What I've been trying to do with dyk-tools is create a
toolkit of reusable components which other people can build upon. But I
seem to be spending most of my time working around silly things like the
{{DYK checklist}} stuff.
Anyway, I hope that answers your question :-)
BTW, I've mentioned this before, but I really can't recommend viztracer
<https://github.com/gaogaotiantian/viztracer> highly enough as a
performance analysis tool. At one level, it's just cProfile on steroids,
but with a snazzy graphical front end. It's what let me figure out that it
was expand(), not get(), which was the most expensive. I uploaded a
screenshot to commons.
<https://commons.wikimedia.org/wiki/File:Screen_Shot_of_viztracer_output.png>
_______________________________________________
pywikibot mailing list -- pywikibot(a)lists.wikimedia.org
Public archives at
https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/m…
To unsubscribe send an email to pywikibot-leave(a)lists.wikimedia.org