Arquivo404 more powerful!

Arquivo.pt has been launching innovative complementary services useful for organizations to optimize their functioning.

The new release of Arquivo.pt named Helios was launched on November 13, 2023 and includes developments in Arquivo404 and CitationSaver.

Arquivo404 with new methods for defining time intervals

Arquivo404 is a service that presents website users with links to web-archived versions, instead of laconic “Page not found” error messages.

However, sometimes it is necessary to specify the correct version of a web-archived page to be displayed. For example, a website’s domain may have belonged to another entity in the past, and only web-archived versions since the website came under its current owners should to be displayed.

For this purpose, 3 new methods for configuring Arquivo404 were released :

  • setMinimumDate( minDate : Date ) – specifies the earliest date of the web-archived version of the URL that can be displayed.
  • setMaximumDate( maxDate : Date ) – specifies the latest date of the web-archived version of the URL that can be displayed.
  • setMostRelevantMemento( criterion : ‘oldest’ | ‘most-recent’ ) – specifies the order of results for the versions retrieved from the web archive. By default, the oldest version is displayed ( ‘oldest’ ).

In short, Arquivo404 now allows you to define whether to display the oldest or most recent web-archived page to the users, within a certain time interval.

CitationSaver processes HTML documents

CitationSaver is a service that extracts citations in documents to online resources and archives them. This service is particularly useful for maintaining the integrity of scientific articles and the reproducibility of the experiments and studies described in them.

Many open-access articles are published in hypertext format (HMTL). CitationSaver now processes documents in HTML format, in addition to PDF and TXT formats.

For example, if a user finds an article on the Web  which contains citations to online resources, he/she simply needs to submit the URL of the article into CitationSaver. The URLs cited in the article will be extracted and their content will be web-archived for later access.

Example of an article from the Journal of Integrated Coastal Management, available on SciELO

Know more

Give us feedback about our services and if you detect any issue, please contact us.

Arquivo.pt preserves Wikipedia citations

Wikimedia Portugal e Arquivo.pt

Last updated on September 19th, 2023 at 01:49 pm

Wikipedia is an educational resource degraded by broken links

Arquivo.pt preserves information published online so that it can be used later for research and educational purposes. For example, Arquivo.pt preserved online information about European projects funded by H2020 .

Wikipedia articles refer to external pages with important complementary information that has since become unavailable.
Wikipedia articles reference external web pages with important complementary information that became unavailable on their original websites.

One of the most used online resources for educational purposes are Wikipedia articles. However, Wikipedia articles sometimes reference external web pages with important complementary information which disappeared from their original websites. This problem degrades the quality of Wikipedia as a credible and verifiable source of information.

In August 2023, the Arquivo.pt team carried out an experiment to measure the percentage of external links (outside the wikipedia.org domain) that were broken in Portuguese Wikipedia articles. The obtained results showed that 25% of the external links referenced in the Portuguese Wikipedia were broken.

There is also the problem that a link may refer to an accessible web page, but its content changed meanwhile and it is no longer what it was meant to be referenced (Content Drift problem). The domain of the hosting website may have meanwhile been purchased by third parties, for example for malicious purposes.

To mitigate these problems, Arquivo.pt launched a project to preserve the online references contained in Portuguese Wikipedia articles in collaboration with Wikimedia Portugal. The objective was to change the references to broken links on Wikipedia articles, to start referencing web-archived content preserved in Arquivo.pt, thus keeping the referenced information accessible to Wikipedia users.

Arquivo.pt archived the pages referenced in Portuguese Wikipedia articles

Wikipedia articles refer to external pages with important complementary information that has since become unavailable.
Wikipedia recommends citing web archives (archiveurl/archive-url parameter).

The Portuguese Wikipedia contains about 1 million articles and on average 140 pages are edited per day.

Arquivo.pt automatically extracted 14 million links from references in all Portuguese Wikipedia articles. Only 620 referenced Arquivo.pt and 744 553 the Internet Archive (5.3%). Note that Wikipedia’s guide to creating references recommends publishing citations to web archives ( archiveurl/archive-url parameter) .

On February 15, 2023, Arquivo.pt collected all pages referenced in Portuguese Wikipedia articles, which resulted in a new collection named EAWP42: Collection of external links from wikipedia using the wikimedia dumps, which contains 12 million files (856 GB).

The main result of this project was the creation of a new automatic process for extracting and collecting external links cited on Portuguese Wikipedia pages. Therefore, Arquivo.pt will perform an annual crawl of Wikipedia citations.

Attempt to automatically fix broken links in Wikipedia articles

InternetArchiveBot offers powerful operation and monitoring tools (eg Dashboard and Insights)
InternetArchiveBot offers powerful operation and monitoring tools (e.g. Dashboard and Insights)

There are software robots which automatically add links to web-archived content when they find broken links in Wikipedia articles (e.g. PywikibotWayback Medic or InternetArchiveBot).

An experiment was carried out to create an ArquivoPTBot based on the InternetArchiveBot because it offers powerful operation and monitoring tools (e.g. Dashboard and Insights ) and it is maintained by the Internet Archive: the largest web archive in the world.

However, it was not possible to launch this service in production because it implies changes in the software to use Arquivo.pt as a source of web-archived information.

If you want to collaborate to resume this project, please contact us!

Preserving Wikipedia references is at your fingertips!

Arquivo.pt remains committed in contributing to preserve links cited in Wikipedia articles and offers the following services that may be useful to you.

Arquivo.pt CitationSaver: preserves citations to online content (https://arquivo.pt/citationsaver).
Arquivo.pt CitationSaver: preserves citations to online content (https://arquivo.pt/citationsaver).

CitationSaver allows you to submit the source code of an Wikipedia article and Arquivo.pt will automatically extract the referenced links and archive the correspondent content.

 

Arquivo.pt SavePageNow: saves pages in Arquivo.pt (https://arquivo.pt/savepagenow).
Arquivo.pt SavePageNow: saves pages in Arquivo.pt (https://arquivo.pt/savepagenow).

SavePageNow allows you to immediately archive a web page in Arquivo.pt, for example, that is being referenced in a Wikipedia article so that it doesn’t get lost.

Training Arquivo.pt/Wikimedia

Wikimedia Portugal, in collaboration with Arquivo.pt, promoted a set of webinars to capture the community’s attention for the preservation of content published and cited on Wikipedia. The videos and slides of these webinars are available:

CitationSaver preserves citations to web resources

Last updated on April 20th, 2023 at 09:37 pm

Documents cite web content by referencing their URLs so that readers can later access them.

In the case of scientific articles, the importance of these citations is even greater to maintain the integrity of research works because they often reference essential information to enable the reproducibility of an experiment or analysis.

For example, links in a scientific article may cite the datasets, software or web news that supported the research, which are not included in the text of the article.

To respond to the need of preserving the integrity of documents, Arquivo.pt launched the CitationSaver.

CitationSaver automatically extracts cited links in a document and preserves their content (e.g. web pages cited in a book) so that they can be retrieved later from Arquivo.pt.

infografia-citationsaver-en

Use CitationSaver to preserve the integrity of your documents

Upload a document and CitationSaver will extract the cited URLs, archive their content and make it available on Arquivo.pt after a short notice. There are 3 methods to upload a document:

  • insert the address (URL) of the PDF or TXT file, if it is published online
  • upload the file in PDF or TXT format
  • paste the text containing the addresses you want to preserve (e.g. References section of an article or Bibliography of a book).

More information