How to create a billion-scale searchable web archive

Last updated on September 28th, 2017 at 01:26 pm

The Portuguese Web Archive published a study that contributes with an overview of the lessons learned while developing the Portuguese Web Archive, focusing on web data acquisition, ranking search results and user interface design.

Several organizations around the world are struggling to archive information from the web before it vanishes. However, users demand efficient and effective search mechanisms to access the already vast collections of historical information held by web archives. The Portuguese Web Archive is the largest full-text searchable web archive publicly available. It supports search over 1.2 billion files archived from the web since 1996.

The paper Creating a Billion-Scale Searchable Web Archive was presented on the Temporal Web Analytics Workshop 2013, in Rio de Janeiro, Brazil.

WWW 2013: Search the Past with the Portuguese Web Archive

Last updated on August 4th, 2024 at 06:10 pm

The Portuguese Web Archive (PWA) is at the World Wide Web Conference (WWW 2013) in Rio de Janeiro, Brazil, with a demo session.

The demo at WWW 2013 presents the Portuguese Web Archive, which enables search over 1.6 billion files archived from 1996 to 2012.

New video: “The Portuguese Web Archive: An overview”

Last updated on December 20th, 2019 at 05:26 pm

The Portuguese Web Archive preserves and provides access to information published on the web of main interest to the Portuguese community. It provides a free and publicly available full-text search service over 1 billion web archived since 1996.

This video provides an overview of the services provided by the Portuguese Web Archive:

New video: “The Portuguese Web Archive and the open access to scientific knowledge”

Last updated on August 4th, 2024 at 06:11 pm

Web archiving contributes to empower open-access to science.

There is a growing amount of open access scientific knowledge published on the Web.

This video debates the importance of web archiving to empower open access to science.

Satisfaction surveys of the Portuguese Web Archive at Jornadas FCCN 2012

Last updated on September 29th, 2017 at 02:10 pm

The demonstration session of the Portuguese Web Archive aimed to bring participants to experience, in an informal and relaxed way, the Archive and its features in order to identify faults and points to correct, and record users’ suggestions to improve the system.

The participants were invited, in case they were interested, to take a challenge with three steps, which included finding historical pages in the Archive.

The participants were asked to fill in a satisfaction survey, in an increasing scale of satisfaction from 1 to 7, with questions related to their experience with the Portuguese Web Archive. The obtained results show that users liked to use the service (6.1 average), that they easily learned to use it (5.9 average) and that they easily found the information they were seeking (5.1 average). It should be noted that the users claimed that they would use the service in the future (6.1) and they would talk about it to their friends (6.2).

The results obtained are positive regarding the quality of the new interface and allowed to set priorities for future service improvements.

Technical report documents the creation of a searchable web archive

Last updated on August 4th, 2024 at 06:22 pm

This report presents some of the work developed to create an efficient and effective web archive service, from data acquisition to user interface design.

The results of this research were applied to create the Portuguese Web Archive that is publicly available since January 2010. It supports full-text search over 1 billion contents archived from 1996 to 2010. The developed software is available as an open source project.