Last updated on September 19th, 2023 at 01:49 pm
Wikipedia is an educational resource degraded by broken links
Arquivo.pt preserves information published online so that it can be used later for research and educational purposes. For example, Arquivo.pt preserved online information about European projects funded by H2020 .
One of the most used online resources for educational purposes are Wikipedia articles. However, Wikipedia articles sometimes reference external web pages with important complementary information which disappeared from their original websites. This problem degrades the quality of Wikipedia as a credible and verifiable source of information.
In August 2023, the Arquivo.pt team carried out an experiment to measure the percentage of external links (outside the wikipedia.org domain) that were broken in Portuguese Wikipedia articles. The obtained results showed that 25% of the external links referenced in the Portuguese Wikipedia were broken.
There is also the problem that a link may refer to an accessible web page, but its content changed meanwhile and it is no longer what it was meant to be referenced (Content Drift problem). The domain of the hosting website may have meanwhile been purchased by third parties, for example for malicious purposes.
To mitigate these problems, Arquivo.pt launched a project to preserve the online references contained in Portuguese Wikipedia articles in collaboration with Wikimedia Portugal. The objective was to change the references to broken links on Wikipedia articles, to start referencing web-archived content preserved in Arquivo.pt, thus keeping the referenced information accessible to Wikipedia users.
Arquivo.pt archived the pages referenced in Portuguese Wikipedia articles
The Portuguese Wikipedia contains about 1 million articles and on average 140 pages are edited per day.
Arquivo.pt automatically extracted 14 million links from references in all Portuguese Wikipedia articles. Only 620 referenced Arquivo.pt and 744 553 the Internet Archive (5.3%). Note that Wikipedia’s guide to creating references recommends publishing citations to web archives ( archiveurl/archive-url parameter) .
On February 15, 2023, Arquivo.pt collected all pages referenced in Portuguese Wikipedia articles, which resulted in a new collection named EAWP42: Collection of external links from wikipedia using the wikimedia dumps, which contains 12 million files (856 GB).
The main result of this project was the creation of a new automatic process for extracting and collecting external links cited on Portuguese Wikipedia pages. Therefore, Arquivo.pt will perform an annual crawl of Wikipedia citations.
Attempt to automatically fix broken links in Wikipedia articles
There are software robots which automatically add links to web-archived content when they find broken links in Wikipedia articles (e.g. Pywikibot, Wayback Medic or InternetArchiveBot).
An experiment was carried out to create an ArquivoPTBot based on the InternetArchiveBot because it offers powerful operation and monitoring tools (e.g. Dashboard and Insights ) and it is maintained by the Internet Archive: the largest web archive in the world.
However, it was not possible to launch this service in production because it implies changes in the software to use Arquivo.pt as a source of web-archived information.
If you want to collaborate to resume this project, please contact us!
Preserving Wikipedia references is at your fingertips!
Arquivo.pt remains committed in contributing to preserve links cited in Wikipedia articles and offers the following services that may be useful to you.
CitationSaver allows you to submit the source code of an Wikipedia article and Arquivo.pt will automatically extract the referenced links and archive the correspondent content.
SavePageNow allows you to immediately archive a web page in Arquivo.pt, for example, that is being referenced in a Wikipedia article so that it doesn’t get lost.
Training Arquivo.pt/Wikimedia
Wikimedia Portugal, in collaboration with Arquivo.pt, promoted a set of webinars to capture the community’s attention for the preservation of content published and cited on Wikipedia. The videos and slides of these webinars are available: