How to create a billion-scale searchable web archive

Last updated on September 28th, 2017 at 01:26 pm

The Portuguese Web Archive published a study that contributes with an overview of the lessons learned while developing the Portuguese Web Archive, focusing on web data acquisition, ranking search results and user interface design.

Several organizations around the world are struggling to archive information from the web before it vanishes. However, users demand efficient and effective search mechanisms to access the already vast collections of historical information held by web archives. The Portuguese Web Archive is the largest full-text searchable web archive publicly available. It supports search over 1.2 billion files archived from the web since 1996.

The paper Creating a Billion-Scale Searchable Web Archive was presented on the Temporal Web Analytics Workshop 2013, in Rio de Janeiro, Brazil.