Pywikibot-commits June 2022

pywikibot-commits@lists.wikimedia.org

1 participants
83 discussions

[Gerrit] ...core[master]: [fix] Fix error message of memento.get_closest_memento_url

by jenkins-bot (Code Review)

jenkins-bot has submitted this change. ( https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806981 ) Change subject: [fix] Fix error message of memento.get_closest_memento_url ...................................................................... [fix] Fix error message of memento.get_closest_memento_url Change-Id: I07927e0426636f1b6d3f186add56f4fb8c2e205d --- M pywikibot/data/memento.py 1 file changed, 1 insertion(+), 1 deletion(-) Approvals: Xqt: Looks good to me, approved jenkins-bot: Verified diff --git a/pywikibot/data/memento.py b/pywikibot/data/memento.py index 59960aa..b8cfc9b 100644 --- a/pywikibot/data/memento.py +++ b/pywikibot/data/memento.py @@ -334,4 +334,4 @@ err_msg = 'closest memento uri not found for {} via {}' else: return mementos['closest']['uri'][0] - raise Exception(err_msg) + raise Exception(err_msg.format(url, timegate_uri)) -- To view, visit https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806981 To unsubscribe, or for help writing mail filters, visit https://gerrit.wikimedia.org/r/settings Gerrit-Project: pywikibot/core Gerrit-Branch: master Gerrit-Change-Id: I07927e0426636f1b6d3f186add56f4fb8c2e205d Gerrit-Change-Number: 806981 Gerrit-PatchSet: 2 Gerrit-Owner: Xqt <info(a)gno.de> Gerrit-Reviewer: Xqt <info(a)gno.de> Gerrit-Reviewer: jenkins-bot Gerrit-MessageType: merged

1 year, 10 months

[Gerrit] ...core[master]: [doc] Update documentation

by Xqt (Code Review)

Xqt has submitted this change. ( https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806900 ) Change subject: [doc] Update documentation ...................................................................... [doc] Update documentation - update ROADMAP.rst - update CHANGELOG.md - update __all__ in memento module - update sphinx doc Change-Id: Iac25bdc4921e136aa480804455d03435b9488f97 --- M ROADMAP.rst M docs/tests_ref/index.rst R docs/tests_ref/memento_tests.rst M pywikibot/data/memento.py M scripts/CHANGELOG.md 5 files changed, 20 insertions(+), 3 deletions(-) Approvals: Xqt: Looks good to me, approved jenkins-bot: Verified diff --git a/ROADMAP.rst b/ROADMAP.rst index 1bedb6a..0812f8a 100644 --- a/ROADMAP.rst +++ b/ROADMAP.rst @@ -1,6 +1,11 @@ Current release 7.4.0 ^^^^^^^^^^^^^^^^^^^^^ +* Move :func:`get_closest_memento_url<pywikibot.data.memento.get_closest_memento_url>` from weblinkchecker script to memento module. +* Add :mod:`memento module<pywikibot.data.memento>` to fix memento_client package (:phab:`T185561`) +* L10N and i18n updates +* Fix Flow board topic continuation when iterating in reverse (:phab:`T138323`) +* Add Avestan transliteration * Use Response.json() instead of json.loads(Response.text) * Show an APIError if PetScanPageGenerator.query() fails (:phab:`T309538`) * `login.py` is now a utiliy script even for site-package installation (:phab:`T309290`) diff --git a/docs/tests_ref/index.rst b/docs/tests_ref/index.rst index 4e30565..a91bc2d 100644 --- a/docs/tests_ref/index.rst +++ b/docs/tests_ref/index.rst @@ -54,6 +54,7 @@ logentries<./logentries_tests> login<./login_tests> mediawikiversion<./mediawikiversion_tests> + memento<./memento_tests> namespace<./namespace_tests> oauth<./oauth_tests> page<./page_tests> @@ -113,4 +114,3 @@ template_bot<./template_bot_tests> upload<./upload_tests> uploadbot<./uploadbot_tests> - weblinkchecker<./weblinkchecker_tests> diff --git a/docs/tests_ref/weblinkchecker_tests.rst b/docs/tests_ref/memento_tests.rst similarity index 74% rename from docs/tests_ref/weblinkchecker_tests.rst rename to docs/tests_ref/memento_tests.rst index fb546e9..71f336c 100644 --- a/docs/tests_ref/weblinkchecker_tests.rst +++ b/docs/tests_ref/memento_tests.rst @@ -1,7 +1,7 @@ tests.weblinkchecker\_tests module ================================== -.. automodule:: tests.weblinkchecker_tests +.. automodule:: tests.memento_tests :members: :undoc-members: :show-inheritance: diff --git a/pywikibot/data/memento.py b/pywikibot/data/memento.py index 9d2e0a1..59960aa 100644 --- a/pywikibot/data/memento.py +++ b/pywikibot/data/memento.py @@ -24,7 +24,11 @@ from pywikibot import config, debug, sleep, warning -__all__ = ('MementoClient', 'MementoClientException') +__all__ = ( + 'MementoClient', + 'MementoClientException', + 'get_closest_memento_url', +) class MementoClient(OldMementoClient): diff --git a/scripts/CHANGELOG.md b/scripts/CHANGELOG.md index a374d67..d846fdd 100644 --- a/scripts/CHANGELOG.md +++ b/scripts/CHANGELOG.md @@ -3,6 +3,14 @@ ## 7.4.0 *current release* +### harvest_templates +* Add -confirm option which sets 'always' option to False (:phab:`T310356`) +* Do not show a warning if generator is specified later (:phab:`T310418`) + +### interwiki +* Fix regression where interwiki script removes all interwiki links (:phab:`T310964`) +* Assign compareLanguages to be reused and fix process_limit_two call (:phab:`T310908`) + ### listpages * Print the page list immediately except pages are preloaded -- To view, visit https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806900 To unsubscribe, or for help writing mail filters, visit https://gerrit.wikimedia.org/r/settings Gerrit-Project: pywikibot/core Gerrit-Branch: master Gerrit-Change-Id: Iac25bdc4921e136aa480804455d03435b9488f97 Gerrit-Change-Number: 806900 Gerrit-PatchSet: 1 Gerrit-Owner: Xqt <info(a)gno.de> Gerrit-Reviewer: D3r1ck01 <xsavitar.wiki(a)aol.com> Gerrit-Reviewer: Xqt <info(a)gno.de> Gerrit-Reviewer: jenkins-bot Gerrit-MessageType: merged

1 year, 10 months

[Gerrit] ...i18n[master]: Localisation updates from https://translatewiki.net.

by jenkins-bot (Code Review)

jenkins-bot has submitted this change. ( https://gerrit.wikimedia.org/r/c/pywikibot/i18n/+/806886 ) Change subject: Localisation updates from https://translatewiki.net. ...................................................................... Localisation updates from https://translatewiki.net. Change-Id: I7439c80a120d0a94b3c49739b6a8d7de14342e1e --- M archivebot/bjn.json A delinker/ko.json 2 files changed, 9 insertions(+), 1 deletion(-) Approvals: L10n-bot: Looks good to me, approved jenkins-bot: Verified diff --git a/archivebot/bjn.json b/archivebot/bjn.json index 1a38b2b..2aa6b49 100644 --- a/archivebot/bjn.json +++ b/archivebot/bjn.json @@ -9,6 +9,6 @@ }, "archivebot-archive-full": "(ARSIP LANGKAP)", "archivebot-archive-summary": "Bot: Maarsipakan %(count)d {{PLURAL:%(count)d|lambaran|lambaran}} matan [[%(from)s]]", - "archivebot-older-than": "{{PLURAL:%(count)d|talawas}} pada %(duration)s", + "archivebot-older-than": "{{PLURAL:%(count)d|labih lawas}} pada %(duration)s", "archivebot-page-summary": "Robot: Mangarsipakan %(count)d {{PLURAL:%(count)d|lambaran|lambaran}} (%(why)s) ka %(archives)s" } diff --git a/delinker/ko.json b/delinker/ko.json new file mode 100644 index 0000000..230a316 --- /dev/null +++ b/delinker/ko.json @@ -0,0 +1,8 @@ +{ + "@metadata": { + "authors": [ + "Kwj2772" + ] + }, + "delinker-delink": "봇: [[:User:%(user)s]]이 파일을 삭제함에 따라 파일 [[%(title)s]]을 제거함: ''%(comment)s''." +} -- To view, visit https://gerrit.wikimedia.org/r/c/pywikibot/i18n/+/806886 To unsubscribe, or for help writing mail filters, visit https://gerrit.wikimedia.org/r/settings Gerrit-Project: pywikibot/i18n Gerrit-Branch: master Gerrit-Change-Id: I7439c80a120d0a94b3c49739b6a8d7de14342e1e Gerrit-Change-Number: 806886 Gerrit-PatchSet: 1 Gerrit-Owner: L10n-bot <l10n-bot(a)translatewiki.net> Gerrit-Reviewer: L10n-bot <l10n-bot(a)translatewiki.net> Gerrit-Reviewer: jenkins-bot Gerrit-MessageType: merged

1 year, 10 months

[Gerrit] ...core[master]: [tests] update coverage

by Xqt (Code Review)

Xqt has submitted this change. ( https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806878 ) Change subject: [tests] update coverage ...................................................................... [tests] update coverage - remove flags from codecov.yml - exclude make_dist.py - use # pragma: no cover to exclude some code Change-Id: I2a36aab0c0db230cff0f6471d40445391e49834b --- M .codecov.yml M pywikibot/data/api/_login.py M pywikibot/data/memento.py M pywikibot/scripts/shell.py M pywikibot/userinterfaces/terminal_interface_win32.py 5 files changed, 8 insertions(+), 31 deletions(-) Approvals: Xqt: Verified; Looks good to me, approved diff --git a/.codecov.yml b/.codecov.yml index 1742cae..af8ecaf 100644 --- a/.codecov.yml +++ b/.codecov.yml @@ -10,15 +10,6 @@ project: default: enabled: yes - framework: - flags: - - framework - tests: - flags: - - tests - scripts: - flags: - - scripts patch: default: enabled: yes @@ -27,6 +18,7 @@ enabled: yes ignore: + - make_dist.py - pywikibot/backports.py - pywikibot/daemonize.py - pywikibot/families/__init__.py @@ -50,19 +42,3 @@ gitter: default: url: https://webhooks.gitter.im/e/f312b840da1c40d9f4be - -flags: - framework: - carryforward: true - paths: - - pywikibot - - tests: - carryforward: true - paths: - - tests - - scripts: - carryforward: true - paths: - - scripts diff --git a/pywikibot/data/api/_login.py b/pywikibot/data/api/_login.py index db90c01..cedc9a7 100644 --- a/pywikibot/data/api/_login.py +++ b/pywikibot/data/api/_login.py @@ -95,7 +95,7 @@ # try to login try: login_result = login_request.submit() - except pywikibot.exceptions.APIError as e: + except pywikibot.exceptions.APIError as e: # pragma: no cover login_result = {'error': e.__dict__} # clientlogin response can be clientlogin or error @@ -115,7 +115,7 @@ if status in ('NeedToken', 'WrongToken', 'badtoken'): token = response.get('token') - if token and self.below_mw_1_27: + if token and self.below_mw_1_27: # pragma: no cover # fetched token using action=login login_request['lgtoken'] = token pywikibot.log('Received login token, proceed with login.') @@ -170,5 +170,5 @@ if 'query' in login_token_result: return login_token_result['query']['tokens'].get('logintoken') - self.below_mw_1_27 = True + self.below_mw_1_27 = True # pragma: no cover return None diff --git a/pywikibot/data/memento.py b/pywikibot/data/memento.py index ecaf35e..9d2e0a1 100644 --- a/pywikibot/data/memento.py +++ b/pywikibot/data/memento.py @@ -169,7 +169,7 @@ timeout=timeout ) except (requests.exceptions.ConnectTimeout, - requests.exceptions.ConnectionError): + requests.exceptions.ConnectionError): # pragma: no cover warning('Could not connect to URI {}, returning no native ' 'URI-G'.format(original_uri)) return None diff --git a/pywikibot/scripts/shell.py b/pywikibot/scripts/shell.py index b267b7f..ae36d81 100755 --- a/pywikibot/scripts/shell.py +++ b/pywikibot/scripts/shell.py @@ -48,7 +48,7 @@ code.interact("""Welcome to the Pywikibot interactive shell!""", local=env) -if __name__ == '__main__': +if __name__ == '__main__': # pragma: no cover if sys.platform == 'win32': import platform import subprocess diff --git a/pywikibot/userinterfaces/terminal_interface_win32.py b/pywikibot/userinterfaces/terminal_interface_win32.py index f1f9f66..c3a3454 100644 --- a/pywikibot/userinterfaces/terminal_interface_win32.py +++ b/pywikibot/userinterfaces/terminal_interface_win32.py @@ -38,7 +38,8 @@ def __init__(self) -> None: """Initializer.""" super().__init__() - if PYTHON_VERSION == (3, 5): # issue1602 solved in Python 3.6 + # issue1602 solved in Python 3.6 + if PYTHON_VERSION == (3, 5): # pragma: no cover from pywikibot.userinterfaces import win32_unicode stdin, stdout, stderr, argv = win32_unicode.get_unicode_console() self.stdin = stdin -- To view, visit https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806878 To unsubscribe, or for help writing mail filters, visit https://gerrit.wikimedia.org/r/settings Gerrit-Project: pywikibot/core Gerrit-Branch: master Gerrit-Change-Id: I2a36aab0c0db230cff0f6471d40445391e49834b Gerrit-Change-Number: 806878 Gerrit-PatchSet: 1 Gerrit-Owner: Xqt <info(a)gno.de> Gerrit-Reviewer: Xqt <info(a)gno.de> Gerrit-MessageType: merged

1 year, 10 months

[Gerrit] ...core[master]: [tests] Remove "temporary" debug log in TestMementoDefault

by jenkins-bot (Code Review)

jenkins-bot has submitted this change. ( https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806868 ) Change subject: [tests] Remove "temporary" debug log in TestMementoDefault ...................................................................... [tests] Remove "temporary" debug log in TestMementoDefault There is no failure anymore. Bug: T196304 Change-Id: Ia19488c73bbaf9fe05be8a2395afb1e895b6fea5 --- M tests/memento_tests.py 1 file changed, 0 insertions(+), 4 deletions(-) Approvals: Xqt: Looks good to me, approved jenkins-bot: Verified diff --git a/tests/memento_tests.py b/tests/memento_tests.py index a20da67..41b1887 100755 --- a/tests/memento_tests.py +++ b/tests/memento_tests.py @@ -61,10 +61,6 @@ def test_newest(self): """Test getting memento for newest https://google.com.""" - # Temporary increase the debug level for T196304 - import logging - logging.basicConfig() - logging.getLogger().setLevel(logging.DEBUG) archivedversion = self._get_archive_url('https://google.com') self.assertIsNotNone(archivedversion) -- To view, visit https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806868 To unsubscribe, or for help writing mail filters, visit https://gerrit.wikimedia.org/r/settings Gerrit-Project: pywikibot/core Gerrit-Branch: master Gerrit-Change-Id: Ia19488c73bbaf9fe05be8a2395afb1e895b6fea5 Gerrit-Change-Number: 806868 Gerrit-PatchSet: 1 Gerrit-Owner: Xqt <info(a)gno.de> Gerrit-Reviewer: Xqt <info(a)gno.de> Gerrit-Reviewer: jenkins-bot Gerrit-MessageType: merged

1 year, 10 months

[Gerrit] ...core[master]: [tests] Update TestMementoWebCite

by jenkins-bot (Code Review)

jenkins-bot has submitted this change. ( https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806865 ) Change subject: [tests] Update TestMementoWebCite ...................................................................... [tests] Update TestMementoWebCite Bug: T193924 Change-Id: I2a4403ad3983da1ef90b0ce6f1c6a29fb10379e5 --- M tests/memento_tests.py 1 file changed, 5 insertions(+), 6 deletions(-) Approvals: Xqt: Looks good to me, approved jenkins-bot: Verified diff --git a/tests/memento_tests.py b/tests/memento_tests.py index a8e2768..a20da67 100755 --- a/tests/memento_tests.py +++ b/tests/memento_tests.py @@ -37,20 +37,19 @@ return result -class TestMementoWebCite(MementoTestCase): +class TestMementoArchive(MementoTestCase): """New WebCite Memento tests.""" - timegate_uri = 'http://timetravel.mementoweb.org/webcite/timegate/' - hostname = ('http://timetravel.mementoweb.org/webcite/' - 'timemap/json/http://google.com') + timegate_uri = 'http://timetravel.mementoweb.org/timegate/' + hostname = timegate_uri.replace('gate/', 'map/json/http://google.com') def test_newest(self): - """Test WebCite for newest https://google.com.""" + """Test Archive for newest https://google.com.""" archivedversion = self._get_archive_url('https://google.com') parsed = urlparse(archivedversion) self.assertIn(parsed.scheme, ['http', 'https']) - self.assertEqual(parsed.netloc, 'www.webcitation.org') + self.assertEqual(parsed.netloc, 'archive.ph') class TestMementoDefault(MementoTestCase): -- To view, visit https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806865 To unsubscribe, or for help writing mail filters, visit https://gerrit.wikimedia.org/r/settings Gerrit-Project: pywikibot/core Gerrit-Branch: master Gerrit-Change-Id: I2a4403ad3983da1ef90b0ce6f1c6a29fb10379e5 Gerrit-Change-Number: 806865 Gerrit-PatchSet: 2 Gerrit-Owner: Xqt <info(a)gno.de> Gerrit-Reviewer: Xqt <info(a)gno.de> Gerrit-Reviewer: jenkins-bot Gerrit-MessageType: merged

1 year, 10 months

[Gerrit] ...core[master]: Fix regression for interwiki script

by jenkins-bot (Code Review)

jenkins-bot has submitted this change. ( https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806854 ) Change subject: Fix regression for interwiki script ...................................................................... Fix regression for interwiki script Commit 0a5f9135 introduces regression where interwiki script removes all interwiki links instead adding/replacing ones. Patch submitted by winterheart. Bug: T310964 Change-Id: I3cccdbb4fb0115ec2270ea001826d9c61389ef9e --- M scripts/interwiki.py 1 file changed, 1 insertion(+), 1 deletion(-) Approvals: Xqt: Looks good to me, approved jenkins-bot: Verified diff --git a/scripts/interwiki.py b/scripts/interwiki.py index e217308..76e75ad 100755 --- a/scripts/interwiki.py +++ b/scripts/interwiki.py @@ -1224,7 +1224,7 @@ break if not self.skipPage(page, linkedPage, counter) \ - and self.conf.followinterwiki or page == self.origin \ + and (self.conf.followinterwiki or page == self.origin) \ and self.addIfNew(linkedPage, counter, page): # It is new. Also verify whether it is the second on the # same site -- To view, visit https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806854 To unsubscribe, or for help writing mail filters, visit https://gerrit.wikimedia.org/r/settings Gerrit-Project: pywikibot/core Gerrit-Branch: master Gerrit-Change-Id: I3cccdbb4fb0115ec2270ea001826d9c61389ef9e Gerrit-Change-Number: 806854 Gerrit-PatchSet: 1 Gerrit-Owner: Xqt <info(a)gno.de> Gerrit-Reviewer: D3r1ck01 <xsavitar.wiki(a)aol.com> Gerrit-Reviewer: Xqt <info(a)gno.de> Gerrit-Reviewer: jenkins-bot Gerrit-MessageType: merged

1 year, 10 months

[Gerrit] ...core[master]: [fix] move get_closest_memento_url to data/memento.py

by Xqt (Code Review)

Xqt has submitted this change. ( https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806546 ) Change subject: [fix] move get_closest_memento_url to data/memento.py ...................................................................... [fix] move get_closest_memento_url to data/memento.py Change-Id: I53e7c99b93389907b985e5223bcc4d4134de42a7 --- C pywikibot/data/memento.py R scripts/_weblinkchecker.py 2 files changed, 0 insertions(+), 0 deletions(-) Approvals: Xqt: Verified; Looks good to me, approved diff --git a/scripts/weblinkchecker.py b/pywikibot/data/memento.py old mode 100755 new mode 100644 similarity index 100% copy from scripts/weblinkchecker.py copy to pywikibot/data/memento.py diff --git a/scripts/weblinkchecker.py b/scripts/_weblinkchecker.py similarity index 100% rename from scripts/weblinkchecker.py rename to scripts/_weblinkchecker.py -- To view, visit https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806546 To unsubscribe, or for help writing mail filters, visit https://gerrit.wikimedia.org/r/settings Gerrit-Project: pywikibot/core Gerrit-Branch: master Gerrit-Change-Id: I53e7c99b93389907b985e5223bcc4d4134de42a7 Gerrit-Change-Number: 806546 Gerrit-PatchSet: 1 Gerrit-Owner: Xqt <info(a)gno.de> Gerrit-Reviewer: Xqt <info(a)gno.de> Gerrit-Reviewer: jenkins-bot Gerrit-MessageType: merged

1 year, 10 months

[Gerrit] ...core[master]: [fix] Add a memento_client fix to the framework

by Xqt (Code Review)

Xqt has submitted this change. ( https://gerrit.wikimedia.org/r/c/pywikibot/core/+/803232 ) Change subject: [fix] Add a memento_client fix to the framework ...................................................................... [fix] Add a memento_client fix to the framework - add memento module and import MementoClient and MementoClientException. Derive MementoClient and add timout parameters for several methods. The implementation was originally made with this commit: https://github.com/mementoweb/py-memento-client/commit/15dc9f520230aaa0793d… - move weblinkchecker._get_closest_memento_url() to memento.py as get_closest_memento_url() but keep the blame history - use pywikibot logging system - set memento_client requirement to ==0.6.1 - update weblinkchecker.py script - update tests and rename weblinkchecker_tests to memento_tests - update doc - update license Bug: T185561 Change-Id: I137a27ad198f0e0aae713c888401265f7aca187b --- M docs/api_ref/pywikibot.data.rst M docs/licenses.rst M pywikibot/CONTENT.rst M pywikibot/data/memento.py M requirements.txt R scripts/weblinkchecker.py M setup.py M tests/__init__.py R tests/memento_tests.py M tox.ini 10 files changed, 326 insertions(+), 771 deletions(-) Approvals: jenkins-bot: Verified Xqt: Verified; Looks good to me, approved diff --git a/docs/api_ref/pywikibot.data.rst b/docs/api_ref/pywikibot.data.rst index 038c439..12b7d14 100644 --- a/docs/api_ref/pywikibot.data.rst +++ b/docs/api_ref/pywikibot.data.rst @@ -11,6 +11,11 @@ .. automodule:: pywikibot.data.api +pywikibot.data.memento module +----------------------------- + +.. automodule:: pywikibot.data.memento + pywikibot.data.mysql module --------------------------- diff --git a/docs/licenses.rst b/docs/licenses.rst index 965e28e..f07112a 100644 --- a/docs/licenses.rst +++ b/docs/licenses.rst @@ -7,7 +7,10 @@ :ref:`MIT license`; translations by translators and manual pages on mediawiki.org are available under the `CC-BY-SA 3.0`_ license. The Pywikibot logo is Public domain but it includes material that may be -protected as a trademark. +protected as a trademark. Parts of :mod:`memento<pywikibot.data.memento>` +module is licenced under the `BSD`_ open source software license. You +may obtain a copy of the License at +http://mementoweb.github.io/SiteStory/license.html. MIT License @@ -18,4 +21,4 @@ .. _CC-BY-SA 3.0: https://creativecommons.org/licenses/by-sa/3.0/ - +.. _BSD: https://github.com/mementoweb/py-memento-client/blob/master/LICENSE.txt diff --git a/pywikibot/CONTENT.rst b/pywikibot/CONTENT.rst index 5f100ee..617b52d 100644 --- a/pywikibot/CONTENT.rst +++ b/pywikibot/CONTENT.rst @@ -103,6 +103,8 @@ | +----------------+-------------------------------------+ | | _requests.py | API Requests interface | +----------------------------+----------------+-------------------------------------+ + | memento.py | memento_client 0.6.1 package fix | + +----------------------------+------------------------------------------------------+ | mysql.py | Miscellaneous helper functions for mysql queries | +----------------------------+------------------------------------------------------+ | sparql.py | Objects representing SPARQL query API | diff --git a/pywikibot/data/memento.py b/pywikibot/data/memento.py index 398ba64..ecaf35e 100644 --- a/pywikibot/data/memento.py +++ b/pywikibot/data/memento.py @@ -1,185 +1,311 @@ -#!/usr/bin/python3 -""" -This bot is used for checking external links found at the wiki. +"""Fix ups for memento-client package version 0.6.1. -It checks several pages at once, with a limit set by the config variable -max_external_links, which defaults to 50. - -The bot won't change any wiki pages, it will only report dead links such that -people can fix or remove the links themselves. - -The bot will store all links found dead in a .dat file in the deadlinks -subdirectory. To avoid the removing of links which are only temporarily -unavailable, the bot ONLY reports links which were reported dead at least -two times, with a time lag of at least one week. Such links will be logged to a -.txt file in the deadlinks subdirectory. - -The .txt file uses wiki markup and so it may be useful to post it on the -wiki and then exclude that page from subsequent runs. For example if the -page is named Broken Links, exclude it with '-titleregexnot:^Broken Links$' - -After running the bot and waiting for at least one week, you can re-check those -pages where dead links were found, using the -repeat parameter. - -In addition to the logging step, it is possible to automatically report dead -links to the talk page of the article where the link was found. To use this -feature, set report_dead_links_on_talk = True in your user-config.py, or -specify "-talk" on the command line. Adding "-notalk" switches this off -irrespective of the configuration variable. - -When a link is found alive, it will be removed from the .dat file. - -These command line parameters can be used to specify which pages to work on: - --repeat Work on all pages were dead links were found before. This is - useful to confirm that the links are dead after some time (at - least one week), which is required before the script will report - the problem. - --namespace Only process templates in the namespace with the given number or - name. This parameter may be used multiple times. - --xml Should be used instead of a simple page fetching method from - pagegenerators.py for performance and load issues - --xmlstart Page to start with when using an XML dump - --ignore HTTP return codes to ignore. Can be provided several times : - -ignore:401 -ignore:500 - -&params; - -Furthermore, the following command line parameters are supported: - --talk Overrides the report_dead_links_on_talk config variable, enabling - the feature. - --notalk Overrides the report_dead_links_on_talk config variable, disabling - the feature. - --day Do not report broken link if the link is there only since - x days or less. If not set, the default is 7 days. - -The following config variables are supported: - - max_external_links The maximum number of web pages that should be - loaded simultaneously. You should change this - according to your Internet connection speed. - Be careful: if it is set too high, the script - might get socket errors because your network - is congested, and will then think that the page - is offline. - - report_dead_links_on_talk If set to true, causes the script to report dead - links on the article's talk page if (and ONLY if) - the linked page has been unavailable at least two - times during a timespan of at least one week. - - weblink_dead_days sets the timespan (default: one week) after which - a dead link will be reported - -Examples --------- - -Loads all wiki pages in alphabetical order using the Special:Allpages -feature: - - python pwb.py weblinkchecker -start:! - -Loads all wiki pages using the Special:Allpages feature, starting at -"Example page": - - python pwb.py weblinkchecker -start:Example_page - -Loads all wiki pages that link to www.example.org: - - python pwb.py weblinkchecker -weblink:www.example.org - -Only checks links found in the wiki page "Example page": - - python pwb.py weblinkchecker Example page - -Loads all wiki pages where dead links were found during a prior run: - - python pwb.py weblinkchecker -repeat +.. versionadded:: 7.4 +.. seealso:: https://github.com/mementoweb/py-memento-client#readme """ # -# (C) Pywikibot team, 2005-2022 +# (C) Shawn M. Jones, Harihar Shankar, Herbert Van de Sompel. +# -- Los Alamos National Laboratory, 2013 +# Parts of MementoClient class codes are +# licensed under the BSD open source software license. +# +# (C) Pywikibot team, 2015-2022 # # Distributed under the terms of the MIT license. # -import codecs -import datetime -import pickle -import re -import threading -import time -import urllib.parse as urlparse -from contextlib import suppress -from functools import partial -from http import HTTPStatus +from datetime import datetime +from typing import Optional + +from memento_client.memento_client import MementoClient as OldMementoClient +from memento_client.memento_client import MementoClientException import requests +from requests.exceptions import InvalidSchema, MissingSchema -import pywikibot -from pywikibot import comms, config, i18n, pagegenerators, textlib -from pywikibot.backports import Dict, removeprefix -from pywikibot.bot import ExistingPageBot, SingleSiteBot, suggest_help -from pywikibot.exceptions import ( - IsRedirectPageError, - NoPageError, - SpamblacklistError, -) -from pywikibot.pagegenerators import ( - XMLDumpPageGenerator as _XMLDumpPageGenerator, -) -from pywikibot.tools import ThreadList +from pywikibot import config, debug, sleep, warning + +__all__ = ('MementoClient', 'MementoClientException') -try: - import memento_client - from memento_client.memento_client import MementoClientException - missing_dependencies = None -except ImportError: - missing_dependencies = ['memento_client'] +class MementoClient(OldMementoClient): + + """A Memento Client. + + It makes it straightforward to access the Web of the past as it is + to access the current Web. + + .. versionchanged:: 7.4 + `timeout` is used in several methods. + + Basic usage: + + >>> mc = MementoClient() + >>> dt = mc.convert_to_datetime("Sun, 01 Apr 2010 12:00:00 GMT") + >>> mc = mc.get_memento_info("http://www.bbc.com/", dt) + >>> print(mc['original_uri']) + http://www.bbc.com/ + >>> print(mc['timegate_uri']) + http://timetravel.mementoweb.org/timegate/http://www.bbc.com/ + >>> print(sorted(mc['mementos'])) + ['closest', 'first', 'last', 'next', 'prev'] + >>> del mc['mementos']['last'] + >>> from pprint import pprint + >>> pprint(mc['mementos']) + {'closest': {'datetime': datetime.datetime(2010, 2, 28, 8, 5, 38), + 'http_status_code': 200, + 'uri': ['https://swap.stanford.edu/20100228080538/http://www.bbc.co.uk/']}, + 'first': {'datetime': datetime.datetime(1998, 12, 2, 21, 26, 10), + 'uri': ['http://wayback.nli.org.il:8080/19981202212610/http://bbc.com/']}, + 'next': {'datetime': datetime.datetime(2010, 5, 23, 13, 47, 38), + 'uri': ['https://web.archive.org/web/20100523134738/http://www.bbc.com/']}, + 'prev': {'datetime': datetime.datetime(1998, 12, 2, 21, 26, 10), + 'uri': ['http://wayback.nli.org.il:8080/19981202212610/http://bbc.com/']}} + + The output conforms to the Memento API format explained here: + http://timetravel.mementoweb.org/guide/api/#memento-json + + By default, MementoClient uses the Memento Aggregator: + http://mementoweb.org/depot/ + + It is also possible to use different TimeGate, simply initialize + with a preferred timegate base uri. Toggle check_native_timegate to + see if the original uri has its own timegate. The native timegate, + if found will be used instead of the timegate_uri preferred. If no + native timegate is found, the preferred timegate_uri will be used. + + :param str timegate_uri: A valid HTTP base uri for a timegate. + Must start with http(s):// and end with a /. + :param int max_redirects: the maximum number of redirects allowed + for all HTTP requests to be made. + :return: A :class:`MementoClient` obj. + """ # noqa: E501 + + def __init__(self, *args, **kwargs): + """Initializer.""" + # To prevent documentation inclusion from inherited class + # because it is malformed. + super().__init__(*args, **kwargs) + + def get_memento_info(self, request_uri: str, + accept_datetime: Optional[datetime] = None, + timeout: Optional[int] = None, + **kwargs) -> dict: + """Query the preferred timegate and return the closest memento uri. + + Given an original uri and an accept datetime, this method + queries the preferred timegate and returns the closest memento + uri, along with prev/next/first/last if available. + + .. seealso:: http://timetravel.mementoweb.org/guide/api/#memento-json + for the response format. + + :param request_uri: The input http uri. + :param accept_datetime: The datetime object of the accept + datetime. The current datetime is used if none is provided. + :param timeout: the timeout value for the HTTP connection. + :return: A map of uri and datetime for the + closest/prev/next/first/last mementos. + """ + # for reading the headers of the req uri to find uri_r + req_uri_response = kwargs.get('req_uri_response') + # for checking native tg uri in uri_r + org_response = kwargs.get('org_response') + tg_response = kwargs.get('tg_response') + if not tg_response: + native_tg = None + + original_uri = self.get_original_uri( + request_uri, response=req_uri_response) + + if self.check_native_timegate: + native_tg = self.get_native_timegate_uri( + original_uri, accept_datetime=accept_datetime, + response=org_response) + + timegate_uri = native_tg if native_tg \ + else self.timegate_uri + original_uri + + http_acc_dt = MementoClient.convert_to_http_datetime( + accept_datetime) + + tg_response = MementoClient.request_head( + timegate_uri, + accept_datetime=http_acc_dt, + follow_redirects=True, + session=self.session, + timeout=timeout + ) + + return super().get_memento_info(request_uri, + accept_datetime=accept_datetime, + tg_response=tg_response, + **kwargs) + + def get_native_timegate_uri(self, + original_uri: str, + accept_datetime: Optional[datetime], + timeout: Optional[int] = None, + **kwargs) -> Optional[str]: + """Check the original uri whether the timegate uri is provided. + + Given an original URL and an accept datetime, check the original uri + to see if the timegate uri is provided in the Link header. + + :param original_uri: An HTTP uri of the original resource. + :param accept_datetime: The datetime object of the accept + datetime + :param timeout: the timeout value for the HTTP connection. + :return: The timegate uri of the original resource, if provided, + else None. + """ + org_response = kwargs.pop('response', None) + if not org_response: + try: + org_response = MementoClient.request_head( + original_uri, + accept_datetime=MementoClient.convert_to_http_datetime( + accept_datetime), + session=self.session, + timeout=timeout + ) + except (requests.exceptions.ConnectTimeout, + requests.exceptions.ConnectionError): + warning('Could not connect to URI {}, returning no native ' + 'URI-G'.format(original_uri)) + return None + + debug('Request headers sent to search for URI-G: ' + + str(org_response.request.headers)) + + return super().get_native_timegate_uri(original_uri, accept_datetime, + response=org_response, **kwargs) + + @staticmethod + def is_timegate(uri: str, + accept_datetime: Optional[str] = None, + response: Optional[requests.Response] = None, + session: Optional[requests.Session] = None, + timeout: Optional[int] = None) -> bool: + """Checks if the given uri is a valid timegate according to the RFC. + + :param uri: the http uri to check. + :param accept_datetime: the accept datetime string in http date + format. + :param response: the response object of the uri. + :param session: the requests session object. + :param timeout: the timeout value for the HTTP connection. + :return: True if a valid timegate, else False. + """ + if not response: + if not accept_datetime: + accept_datetime = MementoClient.convert_to_http_datetime( + datetime.now()) + + response = MementoClient.request_head( + uri, + accept_datetime=accept_datetime, + session=session, + timeout=timeout + ) + return old_is_timegate( + uri, accept_datetime, response=response, session=session) + + @staticmethod + def is_memento(uri: str, + response: Optional[requests.Response] = None, + session: Optional[requests.Session] = None, + timeout: Optional[int] = None) -> bool: + """ + Determines if the URI given is indeed a Memento. + + The simple case is to look for a Memento-Datetime header in the + request, but not all archives are Memento-compliant yet. + + :param uri: an HTTP URI for testing + :param response: the response object of the uri. + :param session: the requests session object. + :param timeout: (int) the timeout value for the HTTP connection. + :return: True if a Memento, False otherwise + """ + if not response: + response = MementoClient.request_head(uri, + follow_redirects=False, + session=session, + timeout=timeout) + return old_is_memento(uri, response=response) + + @staticmethod + def convert_to_http_datetime(dt: Optional[datetime]) -> str: + """Converts a datetime object to a date string in HTTP format. + + :param dt: A datetime object. + :return: The date in HTTP format. + :raises TypeError: Expecting dt parameter to be of type datetime. + """ + if dt and not isinstance(dt, datetime): + raise TypeError( + 'Expecting dt parameter to be of type datetime.') + return old_convert_to_http_datetime(dt) + + @staticmethod + def request_head(uri: str, + accept_datetime: Optional[str] = None, + follow_redirects: bool = False, + session: Optional[requests.Session] = None, + timeout: Optional[int] = None) -> requests.Response: + """Makes HEAD requests. + + :param uri: the uri for the request. + :param accept_datetime: the accept-datetime in the http format. + :param follow_redirects: Toggle to follow redirects. False by + default, so does not follow any redirects. + :param session: the request session object to avoid opening new + connections for every request. + :param timeout: the timeout for the HTTP requests. + :return: the response object. + :raises ValueError: Only HTTP URIs are supported + """ + headers = { + 'Accept-Datetime': accept_datetime} if accept_datetime else {} + + # create a session if not supplied + session_set = False + if not session: + session = requests.Session() + session_set = True + try: + response = session.head(uri, + headers=headers, + allow_redirects=follow_redirects, + timeout=timeout or 9) + except (InvalidSchema, MissingSchema): + raise ValueError('Only HTTP URIs are supported, ' + 'URI {} unrecognized.'.format(uri)) + if session_set: + session.close() + + return response -docuReplacements = {'&params;': pagegenerators.parameterHelp} # noqa: N816 - -ignorelist = [ - # Officially reserved for testing, documentation, etc. in - # https://datatracker.ietf.org/doc/html/rfc2606#page-2 - # top-level domains: - re.compile(r'.*[\./(a)]test(/.*)?'), - re.compile(r'.*[\./(a)]example(/.*)?'), - re.compile(r'.*[\./(a)]invalid(/.*)?'), - re.compile(r'.*[\./(a)]localhost(/.*)?'), - # second-level domains: - re.compile(r'.*[\./(a)]example\.com(/.*)?'), - re.compile(r'.*[\./(a)]example\.net(/.*)?'), - re.compile(r'.*[\./(a)]example\.org(/.*)?'), - - # Other special cases - re.compile(r'.*[\./(a)]berlinonline\.de(/.*)?'), - # above entry to be manually fixed per request at - # [[de:Benutzer:BLueFiSH.as/BZ]] - # bot can't handle their redirects: - - # bot rejected on the site, already archived - re.compile(r'.*[\./(a)]web\.archive\.org(/.*)?'), - - # Ignore links containing * in domain name - # as they are intentionally fake - re.compile(r'https?\:\/\/\*(/.*)?'), -] +# Save old static methods and update static methods of parent class +old_is_timegate = OldMementoClient.is_timegate +old_is_memento = OldMementoClient.is_memento +old_convert_to_http_datetime = OldMementoClient.convert_to_http_datetime +OldMementoClient.is_timegate = MementoClient.is_timegate +OldMementoClient.is_memento = MementoClient.is_memento +OldMementoClient.convert_to_http_datetime \ + = MementoClient.convert_to_http_datetime +OldMementoClient.request_head = MementoClient.request_head -def _get_closest_memento_url(url, when=None, timegate_uri=None): +def get_closest_memento_url(url: str, + when: Optional[datetime] = None, + timegate_uri: Optional[str] = None): """Get most recent memento for url.""" if not when: when = datetime.datetime.now() - mc = memento_client.MementoClient() + mc = MementoClient() if timegate_uri: mc.timegate_uri = timegate_uri @@ -191,561 +317,17 @@ except (requests.ConnectionError, MementoClientException) as e: error = e retry_count += 1 - pywikibot.sleep(config.retry_wait) + sleep(config.retry_wait) else: raise error mementos = memento_info.get('mementos') if not mementos: - raise Exception( - 'mementos not found for {} via {}'.format(url, timegate_uri)) - if 'closest' not in mementos: - raise Exception( - 'closest memento not found for {} via {}'.format( - url, timegate_uri)) - if 'uri' not in mementos['closest']: - raise Exception( - 'closest memento uri not found for {} via {}'.format( - url, timegate_uri)) - return mementos['closest']['uri'][0] - - -def get_archive_url(url): - """Get archive URL.""" - try: - archive = _get_closest_memento_url( - url, - timegate_uri='http://web.archive.org/web/') - except Exception: - archive = _get_closest_memento_url( - url, - timegate_uri='http://timetravel.mementoweb.org/webcite/timegate/') - - # FIXME: Hack for T167463: Use https instead of http for archive.org links - if archive.startswith('http://web.archive.org'): - archive = archive.replace('http://', 'https://', 1) - return archive - - -def weblinks_from_text( - text, - without_bracketed: bool = False, - only_bracketed: bool = False -): - """ - Yield web links from text. - - Only used as text predicate for XmlDumpPageGenerator to speed up - generator. - - TODO: move to textlib - """ - text = textlib.removeDisabledParts(text) - - # Ignore links in fullurl template - text = re.sub(r'{{\s?fullurl:.[^}]*}}', '', text) - - # MediaWiki parses templates before parsing external links. Thus, there - # might be a | or a } directly after a URL which does not belong to - # the URL itself. - - # First, remove the curly braces of inner templates: - nested_template_regex = re.compile(r'{{([^}]*?){{(.*?)}}(.*?)}}') - while nested_template_regex.search(text): - text = nested_template_regex.sub(r'{{\1 \2 \3}}', text) - - # Then blow up the templates with spaces so that the | and }} will not - # be regarded as part of the link:. - template_with_params_regex = re.compile(r'{{([^}]*?[^ ])\|([^ ][^}]*?)}}', - re.DOTALL) - while template_with_params_regex.search(text): - text = template_with_params_regex.sub(r'{{ \1 | \2 }}', text) - - # Add <blank> at the end of a template - # URL as last param of multiline template would not be correct - text = text.replace('}}', ' }}') - - # Remove HTML comments in URLs as well as URLs in HTML comments. - # Also remove text inside nowiki links etc. - text = textlib.removeDisabledParts(text) - link_regex = textlib.compileLinkR(without_bracketed, only_bracketed) - for m in link_regex.finditer(text): - if m.group('url'): - yield m.group('url') - else: - yield m.group('urlb') - - -XmlDumpPageGenerator = partial( - _XMLDumpPageGenerator, text_predicate=weblinks_from_text) - - -class NotAnURLError(BaseException): - - """The link is not an URL.""" - - -class LinkCheckThread(threading.Thread): - - """A thread responsible for checking one URL. - - After checking the page, it will die. - """ - - #: Collecting start time of a thread for any host - hosts = {} # type: Dict[str, float] - lock = threading.Lock() - - def __init__(self, page, url, history, http_ignores, day) -> None: - """Initializer.""" - self.page = page - self.url = url - self.history = history - self.header = { - 'Accept': 'text/xml,application/xml,application/xhtml+xml,' - 'text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5', - 'Accept-Language': 'de-de,de;q=0.8,en-us;q=0.5,en;q=0.3', - 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7', - 'Keep-Alive': '30', - 'Connection': 'keep-alive', - } - # identification for debugging purposes - self.http_ignores = http_ignores - self._use_fake_user_agent = config.fake_user_agent_default.get( - 'weblinkchecker', False) - self.day = day - super().__init__() - - @classmethod - def get_delay(cls, name: str) -> float: - """Determine delay from class attribute. - - Store the last call for a given hostname with an offset of - 6 seconds to ensure there are no more than 10 calls per minute - for the same host. Calculate the delay to start the run. - - :param name: The key for the hosts class attribute - :return: The calulated delay to start the run - """ - now = time.monotonic() - with cls.lock: - timestamp = cls.hosts.get(name, now) - cls.hosts[name] = max(now, timestamp) + 6 - return max(0, timestamp - now) - - def run(self): - """Run the bot.""" - time.sleep(self.get_delay(self.name)) - try: - header = self.header - r = comms.http.fetch( - self.url, headers=header, - use_fake_user_agent=self._use_fake_user_agent) - except requests.exceptions.InvalidURL: - message = i18n.twtranslate(self.page.site, - 'weblinkchecker-badurl_msg', - {'URL': self.url}) - except Exception: - pywikibot.output('Exception while processing URL {} in page {}' - .format(self.url, self.page.title())) - raise - - if ( - r.status_code != HTTPStatus.OK - or r.status_code in self.http_ignores - ): - message = HTTPStatus(r.status_code).phrase - pywikibot.output('*{} links to {} - {}.' - .format(self.page.title(as_link=True), self.url, - message)) - self.history.set_dead_link(self.url, message, self.page, - config.weblink_dead_days) - elif self.history.set_link_alive(self.url): - pywikibot.output( - '*Link to {} in {} is back alive.' - .format(self.url, self.page.title(as_link=True))) - - -class History: - - """ - Store previously found dead links. - - The URLs are dictionary keys, and - values are lists of tuples where each tuple represents one time the URL was - found dead. Tuples have the form (title, date, error) where title is the - wiki page where the URL was found, date is an instance of time, and error - is a string with error code and message. - - We assume that the first element in the list represents the first time we - found this dead link, and the last element represents the last time. - - Example:: - - dict = { - 'https://www.example.org/page': [ - ('WikiPageTitle', DATE, '404: File not found'), - ('WikiPageName2', DATE, '404: File not found'), - ] - } - """ - - def __init__(self, report_thread, site=None) -> None: - """Initializer.""" - self.report_thread = report_thread - if not site: - self.site = pywikibot.Site() - else: - self.site = site - self.semaphore = threading.Semaphore() - self.datfilename = pywikibot.config.datafilepath( - 'deadlinks', 'deadlinks-{}-{}.dat'.format(self.site.family.name, - self.site.code)) - # Count the number of logged links, so that we can insert captions - # from time to time - self.log_count = 0 - try: - with open(self.datfilename, 'rb') as datfile: - self.history_dict = pickle.load(datfile) - except (OSError, EOFError): - # no saved history exists yet, or history dump broken - self.history_dict = {} - - def log(self, url, error, containing_page, archive_url) -> None: - """Log an error report to a text file in the deadlinks subdirectory.""" - if archive_url: - error_report = '* {} ([{} archive])\n'.format(url, archive_url) - else: - error_report = '* {}\n'.format(url) - for (page_title, date, error) in self.history_dict[url]: - # ISO 8601 formulation - iso_date = time.strftime('%Y-%m-%d %H:%M:%S', time.gmtime(date)) - error_report += '** In [[{}]] on {}, {}\n'.format( - page_title, iso_date, error) - pywikibot.output('** Logging link for deletion.') - txtfilename = pywikibot.config.datafilepath('deadlinks', - 'results-{}-{}.txt' - .format( - self.site.family.name, - self.site.lang)) - with codecs.open(txtfilename, 'a', 'utf-8') as txtfile: - self.log_count += 1 - if self.log_count % 30 == 0: - # insert a caption - txtfile.write('=== {} ===\n' - .format(containing_page.title()[:3])) - txtfile.write(error_report) - - if self.report_thread and not containing_page.isTalkPage(): - self.report_thread.report(url, error_report, containing_page, - archive_url) - - def set_dead_link(self, url, error, page, weblink_dead_days) -> None: - """Add the fact that the link was found dead to the .dat file.""" - with self.semaphore: - now = time.time() - if url in self.history_dict: - time_since_first_found = now - self.history_dict[url][0][1] - time_since_last_found = now - self.history_dict[url][-1][1] - # if the last time we found this dead link is less than an hour - # ago, we won't save it in the history this time. - if time_since_last_found > 60 * 60: - self.history_dict[url].append((page.title(), now, error)) - # if the first time we found this link longer than x day ago - # (default is a week), it should probably be fixed or removed. - # We'll list it in a file so that it can be removed manually. - if time_since_first_found > 60 * 60 * 24 * weblink_dead_days: - # search for archived page - try: - archive_url = get_archive_url(url) - except Exception as e: - pywikibot.warning( - 'get_closest_memento_url({}) failed: {}'.format( - url, e)) - archive_url = None - self.log(url, error, page, archive_url) - else: - self.history_dict[url] = [(page.title(), now, error)] - - def set_link_alive(self, url) -> bool: - """ - Record that the link is now alive. - - If link was previously found dead, remove it from the .dat file. - - :return: True if previously found dead, else returns False. - """ - if url in self.history_dict: - with self.semaphore, suppress(KeyError): - del self.history_dict[url] - return True - - return False - - def save(self) -> None: - """Save the .dat file to disk.""" - with open(self.datfilename, 'wb') as f: - pickle.dump(self.history_dict, f, protocol=config.pickle_protocol) - - -class DeadLinkReportThread(threading.Thread): - - """ - A Thread that is responsible for posting error reports on talk pages. - - There is only one DeadLinkReportThread, and it is using a semaphore to make - sure that two LinkCheckerThreads cannot access the queue at the same time. - """ - - def __init__(self) -> None: - """Initializer.""" - super().__init__() - self.semaphore = threading.Semaphore() - self.queue = [] - self.finishing = False - self.killed = False - - def report(self, url, error_report, containing_page, archive_url) -> None: - """Report error on talk page of the page containing the dead link.""" - with self.semaphore: - self.queue.append((url, error_report, containing_page, - archive_url)) - - def shutdown(self) -> None: - """Finish thread.""" - self.finishing = True - - def kill(self) -> None: - """Kill thread.""" - # TODO: remove if unneeded - self.killed = True - - def run(self) -> None: - """Run thread.""" - while not self.killed: - if not self.queue: - if self.finishing: - break - time.sleep(0.1) - continue - - with self.semaphore: - url, error_report, containing_page, archive_url = self.queue[0] - self.queue = self.queue[1:] - talk_page = containing_page.toggleTalkPage() - pywikibot.output('<<lightaqua>>** Reporting dead link on {}...' - '<<default>>'.format(talk_page)) - try: - content = talk_page.get() + '\n\n\n' - if url in content: - pywikibot.output('<<lightaqua>>** Dead link seems to ' - 'have already been reported on {}' - '<<default>>'.format(talk_page)) - continue - except (NoPageError, IsRedirectPageError): - content = '' - - if archive_url: - archive_msg = '\n' + i18n.twtranslate( - containing_page.site, 'weblinkchecker-archive_msg', - {'URL': archive_url}) - else: - archive_msg = '' - # The caption will default to "Dead link". But if there - # is already such a caption, we'll use "Dead link 2", - # "Dead link 3", etc. - caption = i18n.twtranslate(containing_page.site, - 'weblinkchecker-caption') - i = 1 - count = '' - # Check if there is already such a caption on - # the talk page. - while re.search('= *{}{} *=' - .format(caption, count), content) is not None: - i += 1 - count = ' ' + str(i) - caption += count - content += '== {0} ==\n\n{3}\n\n{1}{2}\n--~~~~'.format( - caption, error_report, archive_msg, - i18n.twtranslate(containing_page.site, - 'weblinkchecker-report')) - - comment = '[[{}#{}|→]] {}'.format( - talk_page.title(), caption, - i18n.twtranslate(containing_page.site, - 'weblinkchecker-summary')) - try: - talk_page.put(content, comment) - except SpamblacklistError as error: - pywikibot.output( - '<<lightaqua>>** SpamblacklistError while trying to ' - 'change {}: {}<<default>>' - .format(talk_page, error.url)) - - -class WeblinkCheckerRobot(SingleSiteBot, ExistingPageBot): - - """ - Bot which will search for dead weblinks. - - It uses several LinkCheckThreads at once to process pages from generator. - """ - - use_redirects = False - - def __init__(self, http_ignores=None, day: int = 7, **kwargs) -> None: - """Initializer.""" - super().__init__(**kwargs) - - if config.report_dead_links_on_talk: - pywikibot.log('Starting talk page thread') - report_thread = DeadLinkReportThread() - report_thread.start() - else: - report_thread = None - self.history = History(report_thread, site=self.site) - self.http_ignores = http_ignores or [] - self.day = day - - # Limit the number of threads started at the same time - self.threads = ThreadList(limit=config.max_external_links, - wait_time=config.retry_wait) - - def treat_page(self) -> None: - """Process one page.""" - page = self.current_page - for url in page.extlinks(): - for ignore_regex in ignorelist: - if ignore_regex.match(url): - break - else: - # Each thread will check one page, then die. - thread = LinkCheckThread(page, url, self.history, - self.http_ignores, self.day) - # thread dies when program terminates - thread.daemon = True - # use hostname as thread.name - thread.name = removeprefix( - urlparse.urlparse(url).hostname, 'www.') - self.threads.append(thread) - - def teardown(self) -> None: - """Finish remaining threads and save history file.""" - num = self.count_link_check_threads() - if num: - pywikibot.info('<<lightblue>>Waiting for remaining {} threads ' - 'to finish, please wait...'.format(num)) - - while self.count_link_check_threads(): - try: - time.sleep(0.1) - except KeyboardInterrupt: - # Threads will die automatically because they are daemonic. - if pywikibot.input_yn('There are {} pages remaining in the ' - 'queue. Really exit?' - .format(self.count_link_check_threads()), - default=False, automatic_quit=False): - break - - num = self.count_link_check_threads() - if num: - pywikibot.info('<<yellow>>>Remaining {} threads will be killed.' - .format(num)) - - if self.history.report_thread: - self.history.report_thread.shutdown() - # wait until the report thread is shut down; the user can - # interrupt it by pressing CTRL-C. - try: - while self.history.report_thread.is_alive(): - time.sleep(0.1) - except KeyboardInterrupt: - pywikibot.info('Report thread interrupted.') - self.history.report_thread.kill() - - pywikibot.info('Saving history...') - self.history.save() - - @staticmethod - def count_link_check_threads() -> int: - """Count LinkCheckThread threads. - - :return: number of LinkCheckThread threads - """ - return sum(isinstance(thread, LinkCheckThread) - for thread in threading.enumerate()) - - -def RepeatPageGenerator(): # noqa: N802 - """Generator for pages in History.""" - history = History(None) - page_titles = set() - for value in history.history_dict.values(): - for entry in value: - page_titles.add(entry[0]) - for page_title in sorted(page_titles): - page = pywikibot.Page(pywikibot.Site(), page_title) - yield page - - -def main(*args: str) -> None: - """ - Process command line arguments and invoke bot. - - If args is an empty list, sys.argv is used. - - :param args: command line arguments - """ - gen = None - xml_filename = None - http_ignores = [] - - # Process global args and prepare generator args parser - local_args = pywikibot.handle_args(args) - gen_factory = pagegenerators.GeneratorFactory() - - for arg in local_args: - if arg == '-talk': - config.report_dead_links_on_talk = True - elif arg == '-notalk': - config.report_dead_links_on_talk = False - elif arg == '-repeat': - gen = RepeatPageGenerator() - elif arg.startswith('-ignore:'): - http_ignores.append(int(arg[8:])) - elif arg.startswith('-day:'): - config.weblink_dead_days = int(arg[5:]) - elif arg.startswith('-xmlstart'): - if len(arg) == 9: - xml_start = pywikibot.input( - 'Please enter the dumped article to start with:') - else: - xml_start = arg[10:] - elif arg.startswith('-xml'): - if len(arg) == 4: - xml_filename = i18n.input('pywikibot-enter-xml-filename') - else: - xml_filename = arg[5:] - else: - gen_factory.handle_arg(arg) - - if xml_filename: - try: - xml_start - except NameError: - xml_start = None - gen = XmlDumpPageGenerator(xml_filename, xml_start, - gen_factory.namespaces) - - if not gen: - gen = gen_factory.getCombinedGenerator() - - if not suggest_help(missing_generator=not gen, - missing_dependencies=missing_dependencies): - bot = WeblinkCheckerRobot(http_ignores, config.weblink_dead_days, - generator=gen) - bot.run() - - -if __name__ == '__main__': - main() + err_msg = 'mementos not found for {} via {}' + elif 'closest' not in mementos: + err_msg = 'closest memento not found for {} via {}' + elif 'uri' not in mementos['closest']: + err_msg = 'closest memento uri not found for {} via {}' + else: + return mementos['closest']['uri'][0] + raise Exception(err_msg) diff --git a/requirements.txt b/requirements.txt index 5a4796f..10eddfb 100644 --- a/requirements.txt +++ b/requirements.txt @@ -58,4 +58,4 @@ beautifulsoup4 # scripts/weblinkchecker.py -memento_client>=0.5.1,!=0.6.0 +memento_client==0.6.1 diff --git a/scripts/_weblinkchecker.py b/scripts/weblinkchecker.py similarity index 94% rename from scripts/_weblinkchecker.py rename to scripts/weblinkchecker.py index 398ba64..215b8d8 100755 --- a/scripts/_weblinkchecker.py +++ b/scripts/weblinkchecker.py @@ -109,7 +109,6 @@ # Distributed under the terms of the MIT license. # import codecs -import datetime import pickle import re import threading @@ -137,9 +136,8 @@ try: - import memento_client - from memento_client.memento_client import MementoClientException - missing_dependencies = None + from pywikibot.data.memento import get_closest_memento_url + missing_dependencies = [] except ImportError: missing_dependencies = ['memento_client'] @@ -174,50 +172,13 @@ ] -def _get_closest_memento_url(url, when=None, timegate_uri=None): - """Get most recent memento for url.""" - if not when: - when = datetime.datetime.now() - - mc = memento_client.MementoClient() - if timegate_uri: - mc.timegate_uri = timegate_uri - - retry_count = 0 - while retry_count <= config.max_retries: - try: - memento_info = mc.get_memento_info(url, when) - break - except (requests.ConnectionError, MementoClientException) as e: - error = e - retry_count += 1 - pywikibot.sleep(config.retry_wait) - else: - raise error - - mementos = memento_info.get('mementos') - if not mementos: - raise Exception( - 'mementos not found for {} via {}'.format(url, timegate_uri)) - if 'closest' not in mementos: - raise Exception( - 'closest memento not found for {} via {}'.format( - url, timegate_uri)) - if 'uri' not in mementos['closest']: - raise Exception( - 'closest memento uri not found for {} via {}'.format( - url, timegate_uri)) - return mementos['closest']['uri'][0] - - def get_archive_url(url): """Get archive URL.""" try: - archive = _get_closest_memento_url( - url, - timegate_uri='http://web.archive.org/web/') + archive = get_closest_memento_url( + url, timegate_uri='http://web.archive.org/web/') except Exception: - archive = _get_closest_memento_url( + archive = get_closest_memento_url( url, timegate_uri='http://timetravel.mementoweb.org/webcite/timegate/') diff --git a/setup.py b/setup.py index e6e0925..3ab222a 100755 --- a/setup.py +++ b/setup.py @@ -60,6 +60,7 @@ 'isbn': ['python-stdnum>=1.17'], 'Graphviz': ['pydot>=1.2'], 'Google': ['google>=1.7'], + 'memento': ['memento_client==0.6.1'], 'mwparserfromhell': ['mwparserfromhell>=0.5.0'], 'wikitextparser': ['wikitextparser>=0.47.5; python_version < "3.6"', 'wikitextparser>=0.47.0; python_version >= "3.6"'], @@ -99,7 +100,7 @@ script_deps = { 'commons_information.py': extra_deps['mwparserfromhell'], 'patrol.py': extra_deps['mwparserfromhell'], - 'weblinkchecker.py': ['memento_client!=0.6.0,>=0.5.1'], + 'weblinkchecker.py': extra_deps['memento'], } extra_deps.update(script_deps) diff --git a/tests/__init__.py b/tests/__init__.py index 6606d55..39ce2b0 100644 --- a/tests/__init__.py +++ b/tests/__init__.py @@ -101,6 +101,7 @@ 'logentries', 'login', 'mediawikiversion', + 'memento', 'mysql', 'namespace', 'oauth', @@ -158,7 +159,6 @@ 'script', 'template_bot', 'uploadscript', - 'weblinkchecker' } disabled_test_modules = { diff --git a/tests/weblinkchecker_tests.py b/tests/memento_tests.py similarity index 88% rename from tests/weblinkchecker_tests.py rename to tests/memento_tests.py index 9dd1109..a8e2768 100755 --- a/tests/weblinkchecker_tests.py +++ b/tests/memento_tests.py @@ -12,7 +12,6 @@ from requests.exceptions import ConnectionError as RequestsConnectionError -from scripts import weblinkchecker from tests.aspects import TestCase, require_modules @@ -22,15 +21,17 @@ """Test memento client.""" def _get_archive_url(self, url, date_string=None): - from memento_client.memento_client import MementoClientException + from pywikibot.data.memento import ( + MementoClientException, + get_closest_memento_url, + ) if date_string is None: when = datetime.datetime.now() else: when = datetime.datetime.strptime(date_string, '%Y%m%d') try: - result = weblinkchecker._get_closest_memento_url( - url, when, self.timegate_uri) + result = get_closest_memento_url(url, when, self.timegate_uri) except (RequestsConnectionError, MementoClientException) as e: self.skipTest(e) return result @@ -72,8 +73,7 @@ """Test getting memento for invalid URL.""" # memento_client raises 'Exception', not a subclass. with self.assertRaisesRegex( - Exception, - 'Only HTTP URIs are supported'): + ValueError, 'Only HTTP URIs are supported'): self._get_archive_url('invalid') diff --git a/tox.ini b/tox.ini index 8278dfa..bc97202 100644 --- a/tox.ini +++ b/tox.ini @@ -70,6 +70,7 @@ nosetests --with-doctest pywikibot {[params]doctest_skip} deps = nose + .[memento] .[mwparserfromhell] [testenv:venv] -- To view, visit https://gerrit.wikimedia.org/r/c/pywikibot/core/+/803232 To unsubscribe, or for help writing mail filters, visit https://gerrit.wikimedia.org/r/settings Gerrit-Project: pywikibot/core Gerrit-Branch: master Gerrit-Change-Id: I137a27ad198f0e0aae713c888401265f7aca187b Gerrit-Change-Number: 803232 Gerrit-PatchSet: 22 Gerrit-Owner: Xqt <info(a)gno.de> Gerrit-Reviewer: D3r1ck01 <xsavitar.wiki(a)aol.com> Gerrit-Reviewer: Dvorapa <dvorapa(a)seznam.cz> Gerrit-Reviewer: Framawiki <framawiki(a)tools.wmflabs.org> Gerrit-Reviewer: Matěj Suchánek <matejsuchanek97(a)gmail.com> Gerrit-Reviewer: Shawnmjones <jones.shawn.m(a)gmail.com> Gerrit-Reviewer: Xqt <info(a)gno.de> Gerrit-Reviewer: Zhuyifei1999 <zhuyifei1999(a)gmail.com> Gerrit-Reviewer: jenkins-bot Gerrit-MessageType: merged

1 year, 10 months

[Gerrit] ...core[master]: Updated edit_restricted_templates with ru template name

by jenkins-bot (Code Review)

jenkins-bot has submitted this change. ( https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806545 ) Change subject: Updated edit_restricted_templates with ru template name ...................................................................... Updated edit_restricted_templates with ru template name Patch submitted by winterheart Bug: T310958 Change-Id: I68222e13818ef9b79ea5ba87aefe29a0e2a4ff59 --- M AUTHORS.rst M pywikibot/families/wikipedia_family.py 2 files changed, 2 insertions(+), 0 deletions(-) Approvals: Xqt: Looks good to me, approved jenkins-bot: Verified diff --git a/AUTHORS.rst b/AUTHORS.rst index 793360a..b488187 100644 --- a/AUTHORS.rst +++ b/AUTHORS.rst @@ -351,6 +351,7 @@ Wikipedian WikiWichtel William Avery + winterheart withoutaname X diff --git a/pywikibot/families/wikipedia_family.py b/pywikibot/families/wikipedia_family.py index 1ed6114..615c7f9 100644 --- a/pywikibot/families/wikipedia_family.py +++ b/pywikibot/families/wikipedia_family.py @@ -197,6 +197,7 @@ 'he': ('בעבודה',), 'hr': ('Radovi',), 'hy': ('Խմբագրում եմ',), + 'ru': ('Редактирую',), 'sr': ('Радови у току', 'Рут',), 'test': ('In use',), 'ur': ('زیر ترمیم',), -- To view, visit https://gerrit.wikimedia.org/r/c/pywikibot/core/+/806545 To unsubscribe, or for help writing mail filters, visit https://gerrit.wikimedia.org/r/settings Gerrit-Project: pywikibot/core Gerrit-Branch: master Gerrit-Change-Id: I68222e13818ef9b79ea5ba87aefe29a0e2a4ff59 Gerrit-Change-Number: 806545 Gerrit-PatchSet: 1 Gerrit-Owner: Xqt <info(a)gno.de> Gerrit-Reviewer: Xqt <info(a)gno.de> Gerrit-Reviewer: jenkins-bot Gerrit-MessageType: merged

1 year, 10 months

← Newer
1
2
3
4
5
6
7
8
9
Older →

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Pywikibot-commits June 2022