Arquivo.pt reaches 1 PetaByte of preserved information!

The collection of 1 PetaByte of content predominantly in Portuguese, accessible to both researchers and ordinary citizens, is a milestone that deserves to be celebrated, in the month of its 16th anniversary.

At Arquivo.pt you can search for information published on the Web in the past, such as:

Discover more pages through the selected pages in the Arquivo.pt Online Exhibitions.

The first European page
News from The New York Times in 2008
European Film Awards 2014

Purpose and mission of the Portuguese Web Archive

Arquivo.pt was created on November 8, 2007 with the aim of preserving content from the Portuguese Web.

In 2013, as a service operated by the Fundação para a Ciência e a Tecnologia (FCT), its mission was formulated as follows: “To promote the preservation of content available on the national Internet, ensuring that it is made available to the scientific community and the general public” (Decreto-Lei no. 55/2013).

In recent years, Arquivo.pt has created new services, such as CitationSaver, which allows researchers to record references to web content in their scientific articles, Memorial and Complete page, which facilitate access to content scattered throughout the huge 1 PetaByte block of data.

Where did so much information come from?

In order to reach the 1 PetaByte volume, Arquivo.pt periodically recorded content from websites in the .PT domain and from Portuguese websites in other domains.

In addition, frequent daily and monthly collections were made from a small number of government sites and the main news sites in Portugal.

As part of international collaborations, content was collected from sites in various languages, for example on the 2019 European Elections.

Content prior to 2008 came from the Internet Archive and donations, such as a collection made by the National Library and INESC on the 2005 Legislative Elections.

The largest Portuguese-language dataset available to researchers

By making 1 PetaByte of information available, in open access and through the use of APIs (Application Programming Interfaces), Arquivo.pt is a useful tool for research.

For example, a researcher who wants to do a study on elections in Portugal can use the entire Arquivo.pt collection. Better still, they can focus on just a few special collections dedicated to the elections, choosing the ones that interest them and downloading just a few Terabytes to process automatically with the APIs.

Contributions from the various teams and friends of Arquivo.pt

The development of Arquivo.pt is more than a technological issue and has been due to the dedication and persistence of the various teams that have worked on it since 2007.

It was also due to the contribution of many friends of Arquivo.pt, who were always on hand to help improve, and to the response of the user community.

Congratulations to all! Thank you.

Arquivo404 more powerful!

Last updated on August 9th, 2024 at 12:59 pm

Arquivo.pt has been launching innovative complementary services useful for organizations to optimize their functioning.

The new release of Arquivo.pt named Helios was launched on November 13, 2023 and includes developments in Arquivo404 and CitationSaver.

Arquivo404 with new methods for defining time intervals

Arquivo404 is a service that presents website users with links to web-archived versions, instead of laconic “Page not found” error messages.

However, sometimes it is necessary to specify the correct version of a web-archived page to be displayed. For example, a website’s domain may have belonged to another entity in the past, and only web-archived versions since the website came under its current owners should to be displayed.

For this purpose, 3 new methods for configuring Arquivo404 were released :

  • setMinimumDate( minDate : Date ) – specifies the earliest date of the web-archived version of the URL that can be displayed.
  • setMaximumDate( maxDate : Date ) – specifies the latest date of the web-archived version of the URL that can be displayed.
  • setMostRelevantMemento( criterion : ‘oldest’ | ‘most-recent’ ) – specifies the order of results for the versions retrieved from the web archive. By default, the oldest version is displayed ( ‘oldest’ ).

In short, Arquivo404 now allows you to define whether to display the oldest or most recent web-archived page to the users, within a certain time interval.

CitationSaver processes HTML documents

CitationSaver is a service that extracts citations in documents to online resources and archives them. This service is particularly useful for maintaining the integrity of scientific articles and the reproducibility of the experiments and studies described in them.

Many open-access articles are published in hypertext format (HMTL). CitationSaver now processes documents in HTML format, in addition to PDF and TXT formats.

For example, if a user finds an article on the Web  which contains citations to online resources, he/she simply needs to submit the URL of the article into CitationSaver. The URLs cited in the article will be extracted and their content will be web-archived for later access.

Example of an article from the Journal of Integrated Coastal Management, available on SciELO

Know more

Give us feedback about our services and if you detect any issue, please contact us.

Arquivo404 presents web-archived pages instead of “pages not found”

thumbnail- erro404-en-

Last updated on November 14th, 2023 at 02:46 pm

Does your website presents “Error 404 – Page not found” messages to your users?

Arquivo.pt offers a solution for this problem through Arquivo404.

Just insert a single line of code in the page that generates the 404 error message on your website and web-archived pages will be presented to your users instead of pages not found.

See these examples on websites that installed arquivo404.

How does Arquivo404 work?

example-fccn-pt-arquivo404-en

When a page is no longer on a website, Arquivo404 checks if a preserved version exists.

When a user tries to access a page that is no longer available on a website, Arquivo404 automatically checks if there is a version of that page preserved in Arquivo.pt.

If the page exists in Arquivo.pt, a link is presented so that the user may visit this version. If it does not exist, the normal error page is displayed.

See Arquivo404 at work in this example of an error page that presents a link automatically generated by Arquivo404.

How to install arquivo404 on your website?

The simplest implementation of arquivo404 is to insert the following Javascript on the HTML code that generates the “Page not found” message:

<script type="text/javascript" src="https://arquivo.pt/arquivo404.js" async defer onload="ARQUIVO_NOT_FOUND_404.call();"></script>

The code in Arquivo404 can easily be adapted. You can for example create a customised error message.

Hint for WordPress websites: When editing the 404 error page and inserting the arquivo404 script inside the <body>, you must put the <!– wp:html –> tag at the beginning and the <!– /wp:html –> tag at the end, otherwise the script will be deleted.

If you have any questions or issues, please contact us!

Know more

Short link to this page: arquivo.pt/arquivo404en

SavePageNow to record webpages immediately on Arquivo.pt

thumb_savepagenow

Last updated on August 23rd, 2022 at 11:51 am

Arquivo.pt launched a new version, called Francisco, on the 19th of January 2022.

The SavePageNow function stands out, allowing anyone to save a Web page to be preserved by Arquivo.pt.  It is only necessary to enter a page’s address and browse through its contents.

Arquivo.pt SavePageNow was inspired on the Internet Archive Save Page Now and implemented using webrecorder pywb.

For example, a publication on the FCCN blog marking the 30th anniversary of the Internet in Portugal was saved with SavePageNow and preserved at Arquivo.pt. This way, anyone using SavePageNow is contributing to the contents published on the Internet not being lost.

 

Help us to improve!

The user interfaces have been recoded to be optimized, so we need your help to test them in different devices of various brands (e.g. mobile phones, tablets, laptops).

If you detect any problems, please contact us!

Remember to always send the address of the page where you detected the problem.

To Know more

 

Create automatic narratives about any topic!

thumbnail-narrative-q2

Last updated on August 5th, 2024 at 04:51 pm

Arquivo.pt provides a new function that allows you to automatically create temporal narratives on any topic.

The “Narrative” functionality, integrated into Arquivo.pt in September 2021, is the result of the collaboration between “Conta-me Histórias”, winner of the Arquivo.pt Award 2018, and Arquivo.pt.

The Conta-me Histórias” (Tell me Stories) project was developed by researchers from the Laboratory of Artificial Intelligence and Decision Support (LIAAD – INESCTEC )  and affiliated to the institutions Instituto Politécnico de Tomar – Center for Research in Smart Cities (CI2) ; University of Porto and University of Innsbruck .

How it works?

When a user enters a set of words about a topic in the Arquivo.pt search box and clicks on the “Narrative” button, the user is directed to the “Conta-me Histórias” service, which automatically analyzes the news from 25 websites archived by Arquivo.pt over time and presents a chronology of news related to the topic.

For example, if we search for “Just Bieber” and click on the “Narrative” button (Figure 1), we will be directed to the “Conta-me Histórias” , where we will automatically obtain a narrative of archived news (Figure 2).

example-narrative-arquivopt

Figure 1: Search results for pages about “Justin Bieber”.

example-tell-me-stories-arquivopt

Figure 2: Narrative of news about “Justin Bieber” from Portuguese news sites preserved by Arquivo.pt generated by the “Conta-me Histórias” service.

Create your narrative now!

“Conta-me Histórias” researches, analyzes and aggregates thousands of results to generate each narrative about a topic. It is recommended to choose descriptive words about well-defined themes, personalities or events to obtain good narratives.

Creating a narrative is useful for researchers, journalists or citizens who want to quickly get an overview of the evolution of a topic along time, thus saving them a lot of time and effort.

Go to Arquivo.pt and try to create a narrative about a theme of your choice.

Tell us about your experience so we can improve the service!

We improved the Arquivo.pt interface (Basileus release)

Thumbnail feature basileus version

Last updated on November 17th, 2020 at 07:00 pm

Arquivo.pt launched a new version, called Basileus, on November 11, 2020.

The purpose of this version was to improve the user experience when browsing through the different interfaces of Arquivo.pt.

Adjustments were made at the level of Web design which resulted in greater consistency in the structure of the code, in the graphic aspects and in the interactions, such as colors, fonts and buttons.

Print of release basileus of arquivo-pt replay Geocities

Figure 1: Search and replay interface of Web pages. In the figure, a replay of a Web page from Geocities historical collections on Arquivo.pt.

Help us to improve!

To help us, just search the Arquivo.pt using any device (e.g. laptop, mobile phone, tablet).

If you encounter any problems, please contact us!

Remember to always send the address of the page where you detected the problem.

To know more

Replay with old browser and export results with the new version of Arquivo.pt

Exported results into an Excel sheet of a search for the word "universidade", university, limited to 10 items

Last updated on August 4th, 2024 at 05:53 pm

Arquivo.pt launched a new version of its service on July 1, 2020 named Responsive.

The purpose of this version was to improve the user experience between different devices and add new features.

Replay a past webpage using a browser from the past

We added an option to view the archived page using a browser from the past. In the Options choose Replay with old browser and you will be redirected to the oldweb.today service that emulates browsers such as Netscape NavigatorMicrosoft Internet Explorer or NSCA Mosaic.

This external service is useful for research use cases, in areas such as Web design, Art, Communication or History,where it is necessary to access the original visual aspect of a page from the past in the most reliable way possible.

Web page of the European Union in 1996 using the Oldweb.Today service
Web page of the European Map of WWW/NIR sites in 1996 using the Oldweb.Today service

Try this new option from Arquivo.pt to replay the European Map of WWW/NIR sites in 1996 using a contemporary browser or any other historical page using the Oldweb.Today service.

You may have to wait a while for your request to be processed but it is always faster than having to install a browser from the past on your computer.

Export search results to spreadsheet format

This new function enables users to save their search results for further treatment and analysis. This is specially useful to perform thorough research about a given topic.

After a search, in the Options, just choose one of the available formats to export the obtained results: XLSX, CSV or TXT.

Exported results into an Excel sheet of a search for the word "universidade", university, limited to 10 items
Exported results into an Excel sheet of a search for the word “universidade”, university, limited to 10 items

More on the Responsive milestone

New version of Arquivo.pt (Webapp release)

Webapp release on mobile version example

Last updated on August 5th, 2024 at 04:52 pm

Arquivo.pt launched a new version of its service on April 15, 2020 named WebApp.

The purpose of this version was to standardize the user experience between different devices and reduce maintenance costs by removing components with redundant functions.

Its main novelty is the combination of the desktop and mobile interfaces in a single user interface.

The old desktop version has been disabled and the mobile version has evolved to work on various types of devices and screen sizes.

Webapp release desktop and mobile

New design of the homepage

 

Try the new image and page search

Webapp release search in english

New user interfaces for image or text search

Help us to improve!

To help us, just search the Arquivo.pt using any device (e.g. laptop, mobile phone, tablet).

If you encounter any problems, please contact us!

Remember to always send the address of the page where you detected the problem.

More information

Software Engineer Java/Linux needed at Arquivo.pt

Last updated on August 5th, 2024 at 05:01 pm

Arquivo.pt is looking for a Software Engineer Java/Linux to execute the following functions:

  • Maintenance and quality control of the service;
  • Development of large scale distributed computer systems;
  • Software maintenance and corrective evolution;
  • Interaction with the scientific community to establish collaborations.

Requirements

  • Bachelor, master or PhD in computer science;
  • Experience with: Java J2EE, Java Server Pages, Java Beans, Python, Tomcat, Maven/Ant,  Linux  and Apache HTTPd servers;
  • Knowledge about architecture, development and operation of distributed computer systems;
  • Good level of English.

Preferences

  • Source code repositories management (ex. Git);
  • Automatic tests platform management (ex. Jenkins, SeleniumHQ, SonarCube);
  • Participation in open source collaboration projects;
  • Internet production systems management experience;
  • Knowledge of Big Data/NoSQL (ex. Hadoop, HBase, Lucene, Solr);
  • Knowledge of Information Retrieval or Machine Learning;
  • Experience in web archive technologies (ex. Wayback Machine, Heritrix, NutchWAX);
  • Participation in Research & Development projects.

Applications

How to improve an online service (video)?

Improving the robustness of the Arquivo.pt web archive (video thumbnail)

Last updated on August 5th, 2024 at 05:04 pm

Arquivo.pt recommendations to improve the quality of online services


This presentation provides an overview of the architecture and functioning of the system that supports the Arquivo.pt web archive.

It shares the main lessons learned from the experience of developing and maintaining this service for 10 years.

We believe that these recommendations can be useful to improve the quality of any web-based information system.

Share it with your IT department!