jenkins-bot has submitted this change. ( https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806854 )
Change subject: Fix regression for interwiki script
......................................................................
Fix regression for interwiki script
Commit 0a5f9135 introduces regression where interwiki script removes all
interwiki links instead adding/replacing ones.
Patch submitted by winterheart.
Bug: T310964
Change-Id: I3cccdbb4fb0115ec2270ea001826d9c61389ef9e
---
M scripts/interwiki.py
1 file changed, 1 insertion(+), 1 deletion(-)
Approvals:
Xqt: Looks good to me, approved
jenkins-bot: Verified
diff --git a/scripts/interwiki.py b/scripts/interwiki.py
index e217308..76e75ad 100755
--- a/scripts/interwiki.py
+++ b/scripts/interwiki.py
@@ -1224,7 +1224,7 @@
break
if not self.skipPage(page, linkedPage, counter) \
- and self.conf.followinterwiki or page == self.origin \
+ and (self.conf.followinterwiki or page == self.origin) \
and self.addIfNew(linkedPage, counter, page):
# It is new. Also verify whether it is the second on the
# same site
--
To view, visit https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806854
To unsubscribe, or for help writing mail filters, visit https://gerrit.wikimedia.org/r/settings
Gerrit-Project: pywikibot/core
Gerrit-Branch: master
Gerrit-Change-Id: I3cccdbb4fb0115ec2270ea001826d9c61389ef9e
Gerrit-Change-Number: 806854
Gerrit-PatchSet: 1
Gerrit-Owner: Xqt <info(a)gno.de>
Gerrit-Reviewer: D3r1ck01 <xsavitar.wiki(a)aol.com>
Gerrit-Reviewer: Xqt <info(a)gno.de>
Gerrit-Reviewer: jenkins-bot
Gerrit-MessageType: merged
Xqt has submitted this change. ( https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806546 )
Change subject: [fix] move get_closest_memento_url to data/memento.py
......................................................................
[fix] move get_closest_memento_url to data/memento.py
Change-Id: I53e7c99b93389907b985e5223bcc4d4134de42a7
---
C pywikibot/data/memento.py
R scripts/_weblinkchecker.py
2 files changed, 0 insertions(+), 0 deletions(-)
Approvals:
Xqt: Verified; Looks good to me, approved
diff --git a/scripts/weblinkchecker.py b/pywikibot/data/memento.py
old mode 100755
new mode 100644
similarity index 100%
copy from scripts/weblinkchecker.py
copy to pywikibot/data/memento.py
diff --git a/scripts/weblinkchecker.py b/scripts/_weblinkchecker.py
similarity index 100%
rename from scripts/weblinkchecker.py
rename to scripts/_weblinkchecker.py
--
To view, visit https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806546
To unsubscribe, or for help writing mail filters, visit https://gerrit.wikimedia.org/r/settings
Gerrit-Project: pywikibot/core
Gerrit-Branch: master
Gerrit-Change-Id: I53e7c99b93389907b985e5223bcc4d4134de42a7
Gerrit-Change-Number: 806546
Gerrit-PatchSet: 1
Gerrit-Owner: Xqt <info(a)gno.de>
Gerrit-Reviewer: Xqt <info(a)gno.de>
Gerrit-Reviewer: jenkins-bot
Gerrit-MessageType: merged
Xqt has submitted this change. ( https://gerrit.wikimedia.org/r/c/pywikibot/core/+/803232 )
Change subject: [fix] Add a memento_client fix to the framework
......................................................................
[fix] Add a memento_client fix to the framework
- add memento module and import MementoClient and MementoClientException.
Derive MementoClient and add timout parameters for several methods. The
implementation was originally made with this commit:
https://github.com/mementoweb/py-memento-client/commit/15dc9f520230aaa0793d…
- move weblinkchecker._get_closest_memento_url() to memento.py
as get_closest_memento_url() but keep the blame history
- use pywikibot logging system
- set memento_client requirement to ==0.6.1
- update weblinkchecker.py script
- update tests and rename weblinkchecker_tests to memento_tests
- update doc
- update license
Bug: T185561
Change-Id: I137a27ad198f0e0aae713c888401265f7aca187b
---
M docs/api_ref/pywikibot.data.rst
M docs/licenses.rst
M pywikibot/CONTENT.rst
M pywikibot/data/memento.py
M requirements.txt
R scripts/weblinkchecker.py
M setup.py
M tests/__init__.py
R tests/memento_tests.py
M tox.ini
10 files changed, 326 insertions(+), 771 deletions(-)
Approvals:
jenkins-bot: Verified
Xqt: Verified; Looks good to me, approved
diff --git a/docs/api_ref/pywikibot.data.rst b/docs/api_ref/pywikibot.data.rst
index 038c439..12b7d14 100644
--- a/docs/api_ref/pywikibot.data.rst
+++ b/docs/api_ref/pywikibot.data.rst
@@ -11,6 +11,11 @@
.. automodule:: pywikibot.data.api
+pywikibot.data.memento module
+-----------------------------
+
+.. automodule:: pywikibot.data.memento
+
pywikibot.data.mysql module
---------------------------
diff --git a/docs/licenses.rst b/docs/licenses.rst
index 965e28e..f07112a 100644
--- a/docs/licenses.rst
+++ b/docs/licenses.rst
@@ -7,7 +7,10 @@
:ref:`MIT license`; translations by translators and manual pages on
mediawiki.org are available under the `CC-BY-SA 3.0`_ license. The
Pywikibot logo is Public domain but it includes material that may be
-protected as a trademark.
+protected as a trademark. Parts of :mod:`memento<pywikibot.data.memento>`
+module is licenced under the `BSD`_ open source software license. You
+may obtain a copy of the License at
+http://mementoweb.github.io/SiteStory/license.html.
MIT License
@@ -18,4 +21,4 @@
.. _CC-BY-SA 3.0: https://creativecommons.org/licenses/by-sa/3.0/
-
+.. _BSD: https://github.com/mementoweb/py-memento-client/blob/master/LICENSE.txt
diff --git a/pywikibot/CONTENT.rst b/pywikibot/CONTENT.rst
index 5f100ee..617b52d 100644
--- a/pywikibot/CONTENT.rst
+++ b/pywikibot/CONTENT.rst
@@ -103,6 +103,8 @@
| +----------------+-------------------------------------+
| | _requests.py | API Requests interface |
+----------------------------+----------------+-------------------------------------+
+ | memento.py | memento_client 0.6.1 package fix |
+ +----------------------------+------------------------------------------------------+
| mysql.py | Miscellaneous helper functions for mysql queries |
+----------------------------+------------------------------------------------------+
| sparql.py | Objects representing SPARQL query API |
diff --git a/pywikibot/data/memento.py b/pywikibot/data/memento.py
index 398ba64..ecaf35e 100644
--- a/pywikibot/data/memento.py
+++ b/pywikibot/data/memento.py
@@ -1,185 +1,311 @@
-#!/usr/bin/python3
-"""
-This bot is used for checking external links found at the wiki.
+"""Fix ups for memento-client package version 0.6.1.
-It checks several pages at once, with a limit set by the config variable
-max_external_links, which defaults to 50.
-
-The bot won't change any wiki pages, it will only report dead links such that
-people can fix or remove the links themselves.
-
-The bot will store all links found dead in a .dat file in the deadlinks
-subdirectory. To avoid the removing of links which are only temporarily
-unavailable, the bot ONLY reports links which were reported dead at least
-two times, with a time lag of at least one week. Such links will be logged to a
-.txt file in the deadlinks subdirectory.
-
-The .txt file uses wiki markup and so it may be useful to post it on the
-wiki and then exclude that page from subsequent runs. For example if the
-page is named Broken Links, exclude it with '-titleregexnot:^Broken Links$'
-
-After running the bot and waiting for at least one week, you can re-check those
-pages where dead links were found, using the -repeat parameter.
-
-In addition to the logging step, it is possible to automatically report dead
-links to the talk page of the article where the link was found. To use this
-feature, set report_dead_links_on_talk = True in your user-config.py, or
-specify "-talk" on the command line. Adding "-notalk" switches this off
-irrespective of the configuration variable.
-
-When a link is found alive, it will be removed from the .dat file.
-
-These command line parameters can be used to specify which pages to work on:
-
--repeat Work on all pages were dead links were found before. This is
- useful to confirm that the links are dead after some time (at
- least one week), which is required before the script will report
- the problem.
-
--namespace Only process templates in the namespace with the given number or
- name. This parameter may be used multiple times.
-
--xml Should be used instead of a simple page fetching method from
- pagegenerators.py for performance and load issues
-
--xmlstart Page to start with when using an XML dump
-
--ignore HTTP return codes to ignore. Can be provided several times :
- -ignore:401 -ignore:500
-
-¶ms;
-
-Furthermore, the following command line parameters are supported:
-
--talk Overrides the report_dead_links_on_talk config variable, enabling
- the feature.
-
--notalk Overrides the report_dead_links_on_talk config variable, disabling
- the feature.
-
--day Do not report broken link if the link is there only since
- x days or less. If not set, the default is 7 days.
-
-The following config variables are supported:
-
- max_external_links The maximum number of web pages that should be
- loaded simultaneously. You should change this
- according to your Internet connection speed.
- Be careful: if it is set too high, the script
- might get socket errors because your network
- is congested, and will then think that the page
- is offline.
-
- report_dead_links_on_talk If set to true, causes the script to report dead
- links on the article's talk page if (and ONLY if)
- the linked page has been unavailable at least two
- times during a timespan of at least one week.
-
- weblink_dead_days sets the timespan (default: one week) after which
- a dead link will be reported
-
-Examples
---------
-
-Loads all wiki pages in alphabetical order using the Special:Allpages
-feature:
-
- python pwb.py weblinkchecker -start:!
-
-Loads all wiki pages using the Special:Allpages feature, starting at
-"Example page":
-
- python pwb.py weblinkchecker -start:Example_page
-
-Loads all wiki pages that link to www.example.org:
-
- python pwb.py weblinkchecker -weblink:www.example.org
-
-Only checks links found in the wiki page "Example page":
-
- python pwb.py weblinkchecker Example page
-
-Loads all wiki pages where dead links were found during a prior run:
-
- python pwb.py weblinkchecker -repeat
+.. versionadded:: 7.4
+.. seealso:: https://github.com/mementoweb/py-memento-client#readme
"""
#
-# (C) Pywikibot team, 2005-2022
+# (C) Shawn M. Jones, Harihar Shankar, Herbert Van de Sompel.
+# -- Los Alamos National Laboratory, 2013
+# Parts of MementoClient class codes are
+# licensed under the BSD open source software license.
+#
+# (C) Pywikibot team, 2015-2022
#
# Distributed under the terms of the MIT license.
#
-import codecs
-import datetime
-import pickle
-import re
-import threading
-import time
-import urllib.parse as urlparse
-from contextlib import suppress
-from functools import partial
-from http import HTTPStatus
+from datetime import datetime
+from typing import Optional
+
+from memento_client.memento_client import MementoClient as OldMementoClient
+from memento_client.memento_client import MementoClientException
import requests
+from requests.exceptions import InvalidSchema, MissingSchema
-import pywikibot
-from pywikibot import comms, config, i18n, pagegenerators, textlib
-from pywikibot.backports import Dict, removeprefix
-from pywikibot.bot import ExistingPageBot, SingleSiteBot, suggest_help
-from pywikibot.exceptions import (
- IsRedirectPageError,
- NoPageError,
- SpamblacklistError,
-)
-from pywikibot.pagegenerators import (
- XMLDumpPageGenerator as _XMLDumpPageGenerator,
-)
-from pywikibot.tools import ThreadList
+from pywikibot import config, debug, sleep, warning
+
+__all__ = ('MementoClient', 'MementoClientException')
-try:
- import memento_client
- from memento_client.memento_client import MementoClientException
- missing_dependencies = None
-except ImportError:
- missing_dependencies = ['memento_client']
+class MementoClient(OldMementoClient):
+
+ """A Memento Client.
+
+ It makes it straightforward to access the Web of the past as it is
+ to access the current Web.
+
+ .. versionchanged:: 7.4
+ `timeout` is used in several methods.
+
+ Basic usage:
+
+ >>> mc = MementoClient()
+ >>> dt = mc.convert_to_datetime("Sun, 01 Apr 2010 12:00:00 GMT")
+ >>> mc = mc.get_memento_info("http://www.bbc.com/", dt)
+ >>> print(mc['original_uri'])
+ http://www.bbc.com/
+ >>> print(mc['timegate_uri'])
+ http://timetravel.mementoweb.org/timegate/http://www.bbc.com/
+ >>> print(sorted(mc['mementos']))
+ ['closest', 'first', 'last', 'next', 'prev']
+ >>> del mc['mementos']['last']
+ >>> from pprint import pprint
+ >>> pprint(mc['mementos'])
+ {'closest': {'datetime': datetime.datetime(2010, 2, 28, 8, 5, 38),
+ 'http_status_code': 200,
+ 'uri': ['https://swap.stanford.edu/20100228080538/http://www.bbc.co.uk/']},
+ 'first': {'datetime': datetime.datetime(1998, 12, 2, 21, 26, 10),
+ 'uri': ['http://wayback.nli.org.il:8080/19981202212610/http://bbc.com/']},
+ 'next': {'datetime': datetime.datetime(2010, 5, 23, 13, 47, 38),
+ 'uri': ['https://web.archive.org/web/20100523134738/http://www.bbc.com/']},
+ 'prev': {'datetime': datetime.datetime(1998, 12, 2, 21, 26, 10),
+ 'uri': ['http://wayback.nli.org.il:8080/19981202212610/http://bbc.com/']}}
+
+ The output conforms to the Memento API format explained here:
+ http://timetravel.mementoweb.org/guide/api/#memento-json
+
+ By default, MementoClient uses the Memento Aggregator:
+ http://mementoweb.org/depot/
+
+ It is also possible to use different TimeGate, simply initialize
+ with a preferred timegate base uri. Toggle check_native_timegate to
+ see if the original uri has its own timegate. The native timegate,
+ if found will be used instead of the timegate_uri preferred. If no
+ native timegate is found, the preferred timegate_uri will be used.
+
+ :param str timegate_uri: A valid HTTP base uri for a timegate.
+ Must start with http(s):// and end with a /.
+ :param int max_redirects: the maximum number of redirects allowed
+ for all HTTP requests to be made.
+ :return: A :class:`MementoClient` obj.
+ """ # noqa: E501
+
+ def __init__(self, *args, **kwargs):
+ """Initializer."""
+ # To prevent documentation inclusion from inherited class
+ # because it is malformed.
+ super().__init__(*args, **kwargs)
+
+ def get_memento_info(self, request_uri: str,
+ accept_datetime: Optional[datetime] = None,
+ timeout: Optional[int] = None,
+ **kwargs) -> dict:
+ """Query the preferred timegate and return the closest memento uri.
+
+ Given an original uri and an accept datetime, this method
+ queries the preferred timegate and returns the closest memento
+ uri, along with prev/next/first/last if available.
+
+ .. seealso:: http://timetravel.mementoweb.org/guide/api/#memento-json
+ for the response format.
+
+ :param request_uri: The input http uri.
+ :param accept_datetime: The datetime object of the accept
+ datetime. The current datetime is used if none is provided.
+ :param timeout: the timeout value for the HTTP connection.
+ :return: A map of uri and datetime for the
+ closest/prev/next/first/last mementos.
+ """
+ # for reading the headers of the req uri to find uri_r
+ req_uri_response = kwargs.get('req_uri_response')
+ # for checking native tg uri in uri_r
+ org_response = kwargs.get('org_response')
+ tg_response = kwargs.get('tg_response')
+ if not tg_response:
+ native_tg = None
+
+ original_uri = self.get_original_uri(
+ request_uri, response=req_uri_response)
+
+ if self.check_native_timegate:
+ native_tg = self.get_native_timegate_uri(
+ original_uri, accept_datetime=accept_datetime,
+ response=org_response)
+
+ timegate_uri = native_tg if native_tg \
+ else self.timegate_uri + original_uri
+
+ http_acc_dt = MementoClient.convert_to_http_datetime(
+ accept_datetime)
+
+ tg_response = MementoClient.request_head(
+ timegate_uri,
+ accept_datetime=http_acc_dt,
+ follow_redirects=True,
+ session=self.session,
+ timeout=timeout
+ )
+
+ return super().get_memento_info(request_uri,
+ accept_datetime=accept_datetime,
+ tg_response=tg_response,
+ **kwargs)
+
+ def get_native_timegate_uri(self,
+ original_uri: str,
+ accept_datetime: Optional[datetime],
+ timeout: Optional[int] = None,
+ **kwargs) -> Optional[str]:
+ """Check the original uri whether the timegate uri is provided.
+
+ Given an original URL and an accept datetime, check the original uri
+ to see if the timegate uri is provided in the Link header.
+
+ :param original_uri: An HTTP uri of the original resource.
+ :param accept_datetime: The datetime object of the accept
+ datetime
+ :param timeout: the timeout value for the HTTP connection.
+ :return: The timegate uri of the original resource, if provided,
+ else None.
+ """
+ org_response = kwargs.pop('response', None)
+ if not org_response:
+ try:
+ org_response = MementoClient.request_head(
+ original_uri,
+ accept_datetime=MementoClient.convert_to_http_datetime(
+ accept_datetime),
+ session=self.session,
+ timeout=timeout
+ )
+ except (requests.exceptions.ConnectTimeout,
+ requests.exceptions.ConnectionError):
+ warning('Could not connect to URI {}, returning no native '
+ 'URI-G'.format(original_uri))
+ return None
+
+ debug('Request headers sent to search for URI-G: '
+ + str(org_response.request.headers))
+
+ return super().get_native_timegate_uri(original_uri, accept_datetime,
+ response=org_response, **kwargs)
+
+ @staticmethod
+ def is_timegate(uri: str,
+ accept_datetime: Optional[str] = None,
+ response: Optional[requests.Response] = None,
+ session: Optional[requests.Session] = None,
+ timeout: Optional[int] = None) -> bool:
+ """Checks if the given uri is a valid timegate according to the RFC.
+
+ :param uri: the http uri to check.
+ :param accept_datetime: the accept datetime string in http date
+ format.
+ :param response: the response object of the uri.
+ :param session: the requests session object.
+ :param timeout: the timeout value for the HTTP connection.
+ :return: True if a valid timegate, else False.
+ """
+ if not response:
+ if not accept_datetime:
+ accept_datetime = MementoClient.convert_to_http_datetime(
+ datetime.now())
+
+ response = MementoClient.request_head(
+ uri,
+ accept_datetime=accept_datetime,
+ session=session,
+ timeout=timeout
+ )
+ return old_is_timegate(
+ uri, accept_datetime, response=response, session=session)
+
+ @staticmethod
+ def is_memento(uri: str,
+ response: Optional[requests.Response] = None,
+ session: Optional[requests.Session] = None,
+ timeout: Optional[int] = None) -> bool:
+ """
+ Determines if the URI given is indeed a Memento.
+
+ The simple case is to look for a Memento-Datetime header in the
+ request, but not all archives are Memento-compliant yet.
+
+ :param uri: an HTTP URI for testing
+ :param response: the response object of the uri.
+ :param session: the requests session object.
+ :param timeout: (int) the timeout value for the HTTP connection.
+ :return: True if a Memento, False otherwise
+ """
+ if not response:
+ response = MementoClient.request_head(uri,
+ follow_redirects=False,
+ session=session,
+ timeout=timeout)
+ return old_is_memento(uri, response=response)
+
+ @staticmethod
+ def convert_to_http_datetime(dt: Optional[datetime]) -> str:
+ """Converts a datetime object to a date string in HTTP format.
+
+ :param dt: A datetime object.
+ :return: The date in HTTP format.
+ :raises TypeError: Expecting dt parameter to be of type datetime.
+ """
+ if dt and not isinstance(dt, datetime):
+ raise TypeError(
+ 'Expecting dt parameter to be of type datetime.')
+ return old_convert_to_http_datetime(dt)
+
+ @staticmethod
+ def request_head(uri: str,
+ accept_datetime: Optional[str] = None,
+ follow_redirects: bool = False,
+ session: Optional[requests.Session] = None,
+ timeout: Optional[int] = None) -> requests.Response:
+ """Makes HEAD requests.
+
+ :param uri: the uri for the request.
+ :param accept_datetime: the accept-datetime in the http format.
+ :param follow_redirects: Toggle to follow redirects. False by
+ default, so does not follow any redirects.
+ :param session: the request session object to avoid opening new
+ connections for every request.
+ :param timeout: the timeout for the HTTP requests.
+ :return: the response object.
+ :raises ValueError: Only HTTP URIs are supported
+ """
+ headers = {
+ 'Accept-Datetime': accept_datetime} if accept_datetime else {}
+
+ # create a session if not supplied
+ session_set = False
+ if not session:
+ session = requests.Session()
+ session_set = True
+ try:
+ response = session.head(uri,
+ headers=headers,
+ allow_redirects=follow_redirects,
+ timeout=timeout or 9)
+ except (InvalidSchema, MissingSchema):
+ raise ValueError('Only HTTP URIs are supported, '
+ 'URI {} unrecognized.'.format(uri))
+ if session_set:
+ session.close()
+
+ return response
-docuReplacements = {'¶ms;': pagegenerators.parameterHelp} # noqa: N816
-
-ignorelist = [
- # Officially reserved for testing, documentation, etc. in
- # https://datatracker.ietf.org/doc/html/rfc2606#page-2
- # top-level domains:
- re.compile(r'.*[\./(a)]test(/.*)?'),
- re.compile(r'.*[\./(a)]example(/.*)?'),
- re.compile(r'.*[\./(a)]invalid(/.*)?'),
- re.compile(r'.*[\./(a)]localhost(/.*)?'),
- # second-level domains:
- re.compile(r'.*[\./(a)]example\.com(/.*)?'),
- re.compile(r'.*[\./(a)]example\.net(/.*)?'),
- re.compile(r'.*[\./(a)]example\.org(/.*)?'),
-
- # Other special cases
- re.compile(r'.*[\./(a)]berlinonline\.de(/.*)?'),
- # above entry to be manually fixed per request at
- # [[de:Benutzer:BLueFiSH.as/BZ]]
- # bot can't handle their redirects:
-
- # bot rejected on the site, already archived
- re.compile(r'.*[\./(a)]web\.archive\.org(/.*)?'),
-
- # Ignore links containing * in domain name
- # as they are intentionally fake
- re.compile(r'https?\:\/\/\*(/.*)?'),
-]
+# Save old static methods and update static methods of parent class
+old_is_timegate = OldMementoClient.is_timegate
+old_is_memento = OldMementoClient.is_memento
+old_convert_to_http_datetime = OldMementoClient.convert_to_http_datetime
+OldMementoClient.is_timegate = MementoClient.is_timegate
+OldMementoClient.is_memento = MementoClient.is_memento
+OldMementoClient.convert_to_http_datetime \
+ = MementoClient.convert_to_http_datetime
+OldMementoClient.request_head = MementoClient.request_head
-def _get_closest_memento_url(url, when=None, timegate_uri=None):
+def get_closest_memento_url(url: str,
+ when: Optional[datetime] = None,
+ timegate_uri: Optional[str] = None):
"""Get most recent memento for url."""
if not when:
when = datetime.datetime.now()
- mc = memento_client.MementoClient()
+ mc = MementoClient()
if timegate_uri:
mc.timegate_uri = timegate_uri
@@ -191,561 +317,17 @@
except (requests.ConnectionError, MementoClientException) as e:
error = e
retry_count += 1
- pywikibot.sleep(config.retry_wait)
+ sleep(config.retry_wait)
else:
raise error
mementos = memento_info.get('mementos')
if not mementos:
- raise Exception(
- 'mementos not found for {} via {}'.format(url, timegate_uri))
- if 'closest' not in mementos:
- raise Exception(
- 'closest memento not found for {} via {}'.format(
- url, timegate_uri))
- if 'uri' not in mementos['closest']:
- raise Exception(
- 'closest memento uri not found for {} via {}'.format(
- url, timegate_uri))
- return mementos['closest']['uri'][0]
-
-
-def get_archive_url(url):
- """Get archive URL."""
- try:
- archive = _get_closest_memento_url(
- url,
- timegate_uri='http://web.archive.org/web/')
- except Exception:
- archive = _get_closest_memento_url(
- url,
- timegate_uri='http://timetravel.mementoweb.org/webcite/timegate/')
-
- # FIXME: Hack for T167463: Use https instead of http for archive.org links
- if archive.startswith('http://web.archive.org'):
- archive = archive.replace('http://', 'https://', 1)
- return archive
-
-
-def weblinks_from_text(
- text,
- without_bracketed: bool = False,
- only_bracketed: bool = False
-):
- """
- Yield web links from text.
-
- Only used as text predicate for XmlDumpPageGenerator to speed up
- generator.
-
- TODO: move to textlib
- """
- text = textlib.removeDisabledParts(text)
-
- # Ignore links in fullurl template
- text = re.sub(r'{{\s?fullurl:.[^}]*}}', '', text)
-
- # MediaWiki parses templates before parsing external links. Thus, there
- # might be a | or a } directly after a URL which does not belong to
- # the URL itself.
-
- # First, remove the curly braces of inner templates:
- nested_template_regex = re.compile(r'{{([^}]*?){{(.*?)}}(.*?)}}')
- while nested_template_regex.search(text):
- text = nested_template_regex.sub(r'{{\1 \2 \3}}', text)
-
- # Then blow up the templates with spaces so that the | and }} will not
- # be regarded as part of the link:.
- template_with_params_regex = re.compile(r'{{([^}]*?[^ ])\|([^ ][^}]*?)}}',
- re.DOTALL)
- while template_with_params_regex.search(text):
- text = template_with_params_regex.sub(r'{{ \1 | \2 }}', text)
-
- # Add <blank> at the end of a template
- # URL as last param of multiline template would not be correct
- text = text.replace('}}', ' }}')
-
- # Remove HTML comments in URLs as well as URLs in HTML comments.
- # Also remove text inside nowiki links etc.
- text = textlib.removeDisabledParts(text)
- link_regex = textlib.compileLinkR(without_bracketed, only_bracketed)
- for m in link_regex.finditer(text):
- if m.group('url'):
- yield m.group('url')
- else:
- yield m.group('urlb')
-
-
-XmlDumpPageGenerator = partial(
- _XMLDumpPageGenerator, text_predicate=weblinks_from_text)
-
-
-class NotAnURLError(BaseException):
-
- """The link is not an URL."""
-
-
-class LinkCheckThread(threading.Thread):
-
- """A thread responsible for checking one URL.
-
- After checking the page, it will die.
- """
-
- #: Collecting start time of a thread for any host
- hosts = {} # type: Dict[str, float]
- lock = threading.Lock()
-
- def __init__(self, page, url, history, http_ignores, day) -> None:
- """Initializer."""
- self.page = page
- self.url = url
- self.history = history
- self.header = {
- 'Accept': 'text/xml,application/xml,application/xhtml+xml,'
- 'text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
- 'Accept-Language': 'de-de,de;q=0.8,en-us;q=0.5,en;q=0.3',
- 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
- 'Keep-Alive': '30',
- 'Connection': 'keep-alive',
- }
- # identification for debugging purposes
- self.http_ignores = http_ignores
- self._use_fake_user_agent = config.fake_user_agent_default.get(
- 'weblinkchecker', False)
- self.day = day
- super().__init__()
-
- @classmethod
- def get_delay(cls, name: str) -> float:
- """Determine delay from class attribute.
-
- Store the last call for a given hostname with an offset of
- 6 seconds to ensure there are no more than 10 calls per minute
- for the same host. Calculate the delay to start the run.
-
- :param name: The key for the hosts class attribute
- :return: The calulated delay to start the run
- """
- now = time.monotonic()
- with cls.lock:
- timestamp = cls.hosts.get(name, now)
- cls.hosts[name] = max(now, timestamp) + 6
- return max(0, timestamp - now)
-
- def run(self):
- """Run the bot."""
- time.sleep(self.get_delay(self.name))
- try:
- header = self.header
- r = comms.http.fetch(
- self.url, headers=header,
- use_fake_user_agent=self._use_fake_user_agent)
- except requests.exceptions.InvalidURL:
- message = i18n.twtranslate(self.page.site,
- 'weblinkchecker-badurl_msg',
- {'URL': self.url})
- except Exception:
- pywikibot.output('Exception while processing URL {} in page {}'
- .format(self.url, self.page.title()))
- raise
-
- if (
- r.status_code != HTTPStatus.OK
- or r.status_code in self.http_ignores
- ):
- message = HTTPStatus(r.status_code).phrase
- pywikibot.output('*{} links to {} - {}.'
- .format(self.page.title(as_link=True), self.url,
- message))
- self.history.set_dead_link(self.url, message, self.page,
- config.weblink_dead_days)
- elif self.history.set_link_alive(self.url):
- pywikibot.output(
- '*Link to {} in {} is back alive.'
- .format(self.url, self.page.title(as_link=True)))
-
-
-class History:
-
- """
- Store previously found dead links.
-
- The URLs are dictionary keys, and
- values are lists of tuples where each tuple represents one time the URL was
- found dead. Tuples have the form (title, date, error) where title is the
- wiki page where the URL was found, date is an instance of time, and error
- is a string with error code and message.
-
- We assume that the first element in the list represents the first time we
- found this dead link, and the last element represents the last time.
-
- Example::
-
- dict = {
- 'https://www.example.org/page': [
- ('WikiPageTitle', DATE, '404: File not found'),
- ('WikiPageName2', DATE, '404: File not found'),
- ]
- }
- """
-
- def __init__(self, report_thread, site=None) -> None:
- """Initializer."""
- self.report_thread = report_thread
- if not site:
- self.site = pywikibot.Site()
- else:
- self.site = site
- self.semaphore = threading.Semaphore()
- self.datfilename = pywikibot.config.datafilepath(
- 'deadlinks', 'deadlinks-{}-{}.dat'.format(self.site.family.name,
- self.site.code))
- # Count the number of logged links, so that we can insert captions
- # from time to time
- self.log_count = 0
- try:
- with open(self.datfilename, 'rb') as datfile:
- self.history_dict = pickle.load(datfile)
- except (OSError, EOFError):
- # no saved history exists yet, or history dump broken
- self.history_dict = {}
-
- def log(self, url, error, containing_page, archive_url) -> None:
- """Log an error report to a text file in the deadlinks subdirectory."""
- if archive_url:
- error_report = '* {} ([{} archive])\n'.format(url, archive_url)
- else:
- error_report = '* {}\n'.format(url)
- for (page_title, date, error) in self.history_dict[url]:
- # ISO 8601 formulation
- iso_date = time.strftime('%Y-%m-%d %H:%M:%S', time.gmtime(date))
- error_report += '** In [[{}]] on {}, {}\n'.format(
- page_title, iso_date, error)
- pywikibot.output('** Logging link for deletion.')
- txtfilename = pywikibot.config.datafilepath('deadlinks',
- 'results-{}-{}.txt'
- .format(
- self.site.family.name,
- self.site.lang))
- with codecs.open(txtfilename, 'a', 'utf-8') as txtfile:
- self.log_count += 1
- if self.log_count % 30 == 0:
- # insert a caption
- txtfile.write('=== {} ===\n'
- .format(containing_page.title()[:3]))
- txtfile.write(error_report)
-
- if self.report_thread and not containing_page.isTalkPage():
- self.report_thread.report(url, error_report, containing_page,
- archive_url)
-
- def set_dead_link(self, url, error, page, weblink_dead_days) -> None:
- """Add the fact that the link was found dead to the .dat file."""
- with self.semaphore:
- now = time.time()
- if url in self.history_dict:
- time_since_first_found = now - self.history_dict[url][0][1]
- time_since_last_found = now - self.history_dict[url][-1][1]
- # if the last time we found this dead link is less than an hour
- # ago, we won't save it in the history this time.
- if time_since_last_found > 60 * 60:
- self.history_dict[url].append((page.title(), now, error))
- # if the first time we found this link longer than x day ago
- # (default is a week), it should probably be fixed or removed.
- # We'll list it in a file so that it can be removed manually.
- if time_since_first_found > 60 * 60 * 24 * weblink_dead_days:
- # search for archived page
- try:
- archive_url = get_archive_url(url)
- except Exception as e:
- pywikibot.warning(
- 'get_closest_memento_url({}) failed: {}'.format(
- url, e))
- archive_url = None
- self.log(url, error, page, archive_url)
- else:
- self.history_dict[url] = [(page.title(), now, error)]
-
- def set_link_alive(self, url) -> bool:
- """
- Record that the link is now alive.
-
- If link was previously found dead, remove it from the .dat file.
-
- :return: True if previously found dead, else returns False.
- """
- if url in self.history_dict:
- with self.semaphore, suppress(KeyError):
- del self.history_dict[url]
- return True
-
- return False
-
- def save(self) -> None:
- """Save the .dat file to disk."""
- with open(self.datfilename, 'wb') as f:
- pickle.dump(self.history_dict, f, protocol=config.pickle_protocol)
-
-
-class DeadLinkReportThread(threading.Thread):
-
- """
- A Thread that is responsible for posting error reports on talk pages.
-
- There is only one DeadLinkReportThread, and it is using a semaphore to make
- sure that two LinkCheckerThreads cannot access the queue at the same time.
- """
-
- def __init__(self) -> None:
- """Initializer."""
- super().__init__()
- self.semaphore = threading.Semaphore()
- self.queue = []
- self.finishing = False
- self.killed = False
-
- def report(self, url, error_report, containing_page, archive_url) -> None:
- """Report error on talk page of the page containing the dead link."""
- with self.semaphore:
- self.queue.append((url, error_report, containing_page,
- archive_url))
-
- def shutdown(self) -> None:
- """Finish thread."""
- self.finishing = True
-
- def kill(self) -> None:
- """Kill thread."""
- # TODO: remove if unneeded
- self.killed = True
-
- def run(self) -> None:
- """Run thread."""
- while not self.killed:
- if not self.queue:
- if self.finishing:
- break
- time.sleep(0.1)
- continue
-
- with self.semaphore:
- url, error_report, containing_page, archive_url = self.queue[0]
- self.queue = self.queue[1:]
- talk_page = containing_page.toggleTalkPage()
- pywikibot.output('<<lightaqua>>** Reporting dead link on {}...'
- '<<default>>'.format(talk_page))
- try:
- content = talk_page.get() + '\n\n\n'
- if url in content:
- pywikibot.output('<<lightaqua>>** Dead link seems to '
- 'have already been reported on {}'
- '<<default>>'.format(talk_page))
- continue
- except (NoPageError, IsRedirectPageError):
- content = ''
-
- if archive_url:
- archive_msg = '\n' + i18n.twtranslate(
- containing_page.site, 'weblinkchecker-archive_msg',
- {'URL': archive_url})
- else:
- archive_msg = ''
- # The caption will default to "Dead link". But if there
- # is already such a caption, we'll use "Dead link 2",
- # "Dead link 3", etc.
- caption = i18n.twtranslate(containing_page.site,
- 'weblinkchecker-caption')
- i = 1
- count = ''
- # Check if there is already such a caption on
- # the talk page.
- while re.search('= *{}{} *='
- .format(caption, count), content) is not None:
- i += 1
- count = ' ' + str(i)
- caption += count
- content += '== {0} ==\n\n{3}\n\n{1}{2}\n--~~~~'.format(
- caption, error_report, archive_msg,
- i18n.twtranslate(containing_page.site,
- 'weblinkchecker-report'))
-
- comment = '[[{}#{}|→]] {}'.format(
- talk_page.title(), caption,
- i18n.twtranslate(containing_page.site,
- 'weblinkchecker-summary'))
- try:
- talk_page.put(content, comment)
- except SpamblacklistError as error:
- pywikibot.output(
- '<<lightaqua>>** SpamblacklistError while trying to '
- 'change {}: {}<<default>>'
- .format(talk_page, error.url))
-
-
-class WeblinkCheckerRobot(SingleSiteBot, ExistingPageBot):
-
- """
- Bot which will search for dead weblinks.
-
- It uses several LinkCheckThreads at once to process pages from generator.
- """
-
- use_redirects = False
-
- def __init__(self, http_ignores=None, day: int = 7, **kwargs) -> None:
- """Initializer."""
- super().__init__(**kwargs)
-
- if config.report_dead_links_on_talk:
- pywikibot.log('Starting talk page thread')
- report_thread = DeadLinkReportThread()
- report_thread.start()
- else:
- report_thread = None
- self.history = History(report_thread, site=self.site)
- self.http_ignores = http_ignores or []
- self.day = day
-
- # Limit the number of threads started at the same time
- self.threads = ThreadList(limit=config.max_external_links,
- wait_time=config.retry_wait)
-
- def treat_page(self) -> None:
- """Process one page."""
- page = self.current_page
- for url in page.extlinks():
- for ignore_regex in ignorelist:
- if ignore_regex.match(url):
- break
- else:
- # Each thread will check one page, then die.
- thread = LinkCheckThread(page, url, self.history,
- self.http_ignores, self.day)
- # thread dies when program terminates
- thread.daemon = True
- # use hostname as thread.name
- thread.name = removeprefix(
- urlparse.urlparse(url).hostname, 'www.')
- self.threads.append(thread)
-
- def teardown(self) -> None:
- """Finish remaining threads and save history file."""
- num = self.count_link_check_threads()
- if num:
- pywikibot.info('<<lightblue>>Waiting for remaining {} threads '
- 'to finish, please wait...'.format(num))
-
- while self.count_link_check_threads():
- try:
- time.sleep(0.1)
- except KeyboardInterrupt:
- # Threads will die automatically because they are daemonic.
- if pywikibot.input_yn('There are {} pages remaining in the '
- 'queue. Really exit?'
- .format(self.count_link_check_threads()),
- default=False, automatic_quit=False):
- break
-
- num = self.count_link_check_threads()
- if num:
- pywikibot.info('<<yellow>>>Remaining {} threads will be killed.'
- .format(num))
-
- if self.history.report_thread:
- self.history.report_thread.shutdown()
- # wait until the report thread is shut down; the user can
- # interrupt it by pressing CTRL-C.
- try:
- while self.history.report_thread.is_alive():
- time.sleep(0.1)
- except KeyboardInterrupt:
- pywikibot.info('Report thread interrupted.')
- self.history.report_thread.kill()
-
- pywikibot.info('Saving history...')
- self.history.save()
-
- @staticmethod
- def count_link_check_threads() -> int:
- """Count LinkCheckThread threads.
-
- :return: number of LinkCheckThread threads
- """
- return sum(isinstance(thread, LinkCheckThread)
- for thread in threading.enumerate())
-
-
-def RepeatPageGenerator(): # noqa: N802
- """Generator for pages in History."""
- history = History(None)
- page_titles = set()
- for value in history.history_dict.values():
- for entry in value:
- page_titles.add(entry[0])
- for page_title in sorted(page_titles):
- page = pywikibot.Page(pywikibot.Site(), page_title)
- yield page
-
-
-def main(*args: str) -> None:
- """
- Process command line arguments and invoke bot.
-
- If args is an empty list, sys.argv is used.
-
- :param args: command line arguments
- """
- gen = None
- xml_filename = None
- http_ignores = []
-
- # Process global args and prepare generator args parser
- local_args = pywikibot.handle_args(args)
- gen_factory = pagegenerators.GeneratorFactory()
-
- for arg in local_args:
- if arg == '-talk':
- config.report_dead_links_on_talk = True
- elif arg == '-notalk':
- config.report_dead_links_on_talk = False
- elif arg == '-repeat':
- gen = RepeatPageGenerator()
- elif arg.startswith('-ignore:'):
- http_ignores.append(int(arg[8:]))
- elif arg.startswith('-day:'):
- config.weblink_dead_days = int(arg[5:])
- elif arg.startswith('-xmlstart'):
- if len(arg) == 9:
- xml_start = pywikibot.input(
- 'Please enter the dumped article to start with:')
- else:
- xml_start = arg[10:]
- elif arg.startswith('-xml'):
- if len(arg) == 4:
- xml_filename = i18n.input('pywikibot-enter-xml-filename')
- else:
- xml_filename = arg[5:]
- else:
- gen_factory.handle_arg(arg)
-
- if xml_filename:
- try:
- xml_start
- except NameError:
- xml_start = None
- gen = XmlDumpPageGenerator(xml_filename, xml_start,
- gen_factory.namespaces)
-
- if not gen:
- gen = gen_factory.getCombinedGenerator()
-
- if not suggest_help(missing_generator=not gen,
- missing_dependencies=missing_dependencies):
- bot = WeblinkCheckerRobot(http_ignores, config.weblink_dead_days,
- generator=gen)
- bot.run()
-
-
-if __name__ == '__main__':
- main()
+ err_msg = 'mementos not found for {} via {}'
+ elif 'closest' not in mementos:
+ err_msg = 'closest memento not found for {} via {}'
+ elif 'uri' not in mementos['closest']:
+ err_msg = 'closest memento uri not found for {} via {}'
+ else:
+ return mementos['closest']['uri'][0]
+ raise Exception(err_msg)
diff --git a/requirements.txt b/requirements.txt
index 5a4796f..10eddfb 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -58,4 +58,4 @@
beautifulsoup4
# scripts/weblinkchecker.py
-memento_client>=0.5.1,!=0.6.0
+memento_client==0.6.1
diff --git a/scripts/_weblinkchecker.py b/scripts/weblinkchecker.py
similarity index 94%
rename from scripts/_weblinkchecker.py
rename to scripts/weblinkchecker.py
index 398ba64..215b8d8 100755
--- a/scripts/_weblinkchecker.py
+++ b/scripts/weblinkchecker.py
@@ -109,7 +109,6 @@
# Distributed under the terms of the MIT license.
#
import codecs
-import datetime
import pickle
import re
import threading
@@ -137,9 +136,8 @@
try:
- import memento_client
- from memento_client.memento_client import MementoClientException
- missing_dependencies = None
+ from pywikibot.data.memento import get_closest_memento_url
+ missing_dependencies = []
except ImportError:
missing_dependencies = ['memento_client']
@@ -174,50 +172,13 @@
]
-def _get_closest_memento_url(url, when=None, timegate_uri=None):
- """Get most recent memento for url."""
- if not when:
- when = datetime.datetime.now()
-
- mc = memento_client.MementoClient()
- if timegate_uri:
- mc.timegate_uri = timegate_uri
-
- retry_count = 0
- while retry_count <= config.max_retries:
- try:
- memento_info = mc.get_memento_info(url, when)
- break
- except (requests.ConnectionError, MementoClientException) as e:
- error = e
- retry_count += 1
- pywikibot.sleep(config.retry_wait)
- else:
- raise error
-
- mementos = memento_info.get('mementos')
- if not mementos:
- raise Exception(
- 'mementos not found for {} via {}'.format(url, timegate_uri))
- if 'closest' not in mementos:
- raise Exception(
- 'closest memento not found for {} via {}'.format(
- url, timegate_uri))
- if 'uri' not in mementos['closest']:
- raise Exception(
- 'closest memento uri not found for {} via {}'.format(
- url, timegate_uri))
- return mementos['closest']['uri'][0]
-
-
def get_archive_url(url):
"""Get archive URL."""
try:
- archive = _get_closest_memento_url(
- url,
- timegate_uri='http://web.archive.org/web/')
+ archive = get_closest_memento_url(
+ url, timegate_uri='http://web.archive.org/web/')
except Exception:
- archive = _get_closest_memento_url(
+ archive = get_closest_memento_url(
url,
timegate_uri='http://timetravel.mementoweb.org/webcite/timegate/')
diff --git a/setup.py b/setup.py
index e6e0925..3ab222a 100755
--- a/setup.py
+++ b/setup.py
@@ -60,6 +60,7 @@
'isbn': ['python-stdnum>=1.17'],
'Graphviz': ['pydot>=1.2'],
'Google': ['google>=1.7'],
+ 'memento': ['memento_client==0.6.1'],
'mwparserfromhell': ['mwparserfromhell>=0.5.0'],
'wikitextparser': ['wikitextparser>=0.47.5; python_version < "3.6"',
'wikitextparser>=0.47.0; python_version >= "3.6"'],
@@ -99,7 +100,7 @@
script_deps = {
'commons_information.py': extra_deps['mwparserfromhell'],
'patrol.py': extra_deps['mwparserfromhell'],
- 'weblinkchecker.py': ['memento_client!=0.6.0,>=0.5.1'],
+ 'weblinkchecker.py': extra_deps['memento'],
}
extra_deps.update(script_deps)
diff --git a/tests/__init__.py b/tests/__init__.py
index 6606d55..39ce2b0 100644
--- a/tests/__init__.py
+++ b/tests/__init__.py
@@ -101,6 +101,7 @@
'logentries',
'login',
'mediawikiversion',
+ 'memento',
'mysql',
'namespace',
'oauth',
@@ -158,7 +159,6 @@
'script',
'template_bot',
'uploadscript',
- 'weblinkchecker'
}
disabled_test_modules = {
diff --git a/tests/weblinkchecker_tests.py b/tests/memento_tests.py
similarity index 88%
rename from tests/weblinkchecker_tests.py
rename to tests/memento_tests.py
index 9dd1109..a8e2768 100755
--- a/tests/weblinkchecker_tests.py
+++ b/tests/memento_tests.py
@@ -12,7 +12,6 @@
from requests.exceptions import ConnectionError as RequestsConnectionError
-from scripts import weblinkchecker
from tests.aspects import TestCase, require_modules
@@ -22,15 +21,17 @@
"""Test memento client."""
def _get_archive_url(self, url, date_string=None):
- from memento_client.memento_client import MementoClientException
+ from pywikibot.data.memento import (
+ MementoClientException,
+ get_closest_memento_url,
+ )
if date_string is None:
when = datetime.datetime.now()
else:
when = datetime.datetime.strptime(date_string, '%Y%m%d')
try:
- result = weblinkchecker._get_closest_memento_url(
- url, when, self.timegate_uri)
+ result = get_closest_memento_url(url, when, self.timegate_uri)
except (RequestsConnectionError, MementoClientException) as e:
self.skipTest(e)
return result
@@ -72,8 +73,7 @@
"""Test getting memento for invalid URL."""
# memento_client raises 'Exception', not a subclass.
with self.assertRaisesRegex(
- Exception,
- 'Only HTTP URIs are supported'):
+ ValueError, 'Only HTTP URIs are supported'):
self._get_archive_url('invalid')
diff --git a/tox.ini b/tox.ini
index 8278dfa..bc97202 100644
--- a/tox.ini
+++ b/tox.ini
@@ -70,6 +70,7 @@
nosetests --with-doctest pywikibot {[params]doctest_skip}
deps =
nose
+ .[memento]
.[mwparserfromhell]
[testenv:venv]
--
To view, visit https://gerrit.wikimedia.org/r/c/pywikibot/core/+/803232
To unsubscribe, or for help writing mail filters, visit https://gerrit.wikimedia.org/r/settings
Gerrit-Project: pywikibot/core
Gerrit-Branch: master
Gerrit-Change-Id: I137a27ad198f0e0aae713c888401265f7aca187b
Gerrit-Change-Number: 803232
Gerrit-PatchSet: 22
Gerrit-Owner: Xqt <info(a)gno.de>
Gerrit-Reviewer: D3r1ck01 <xsavitar.wiki(a)aol.com>
Gerrit-Reviewer: Dvorapa <dvorapa(a)seznam.cz>
Gerrit-Reviewer: Framawiki <framawiki(a)tools.wmflabs.org>
Gerrit-Reviewer: Matěj Suchánek <matejsuchanek97(a)gmail.com>
Gerrit-Reviewer: Shawnmjones <jones.shawn.m(a)gmail.com>
Gerrit-Reviewer: Xqt <info(a)gno.de>
Gerrit-Reviewer: Zhuyifei1999 <zhuyifei1999(a)gmail.com>
Gerrit-Reviewer: jenkins-bot
Gerrit-MessageType: merged