Technology

The main technologies used to implement the Portuguese Web Archive system.

Initially, the different web archiving initiatives worked independently, developing their own systems from scratch. This resulted in a huge waste of resources. Problems related to web archives were felt by everyone, but each one tried to solve them alone. On the other hand, the web did not stop evolving and new problems keep on appearing. It became clear that it would be necessary to join efforts together to archive the web successfully.

The software technologies used in the Portuguese Web Archive are provided mainly by the Archive-access project that joins several free and open-source tools developed to support web archiving. Open source software contributes to enable the long-term system maintenance and preservation of the archived information.

  • The Crawler was implemented using Heritrix and the Deduplicator add-on module;
  • Search is based on the NutchWax and Lucene search engine;
  • PyWb, developed by Ilya Kreymer, is the software component responsible for the reproduction of the archived Web pages;
  • Supports memento protocol, enabling search between multiple Web archiving initiatives.
  • The spellchecker uses Hunspell.
  • Distributed processing of data is done using Hadoop, a powerful free platform for parallel computing supported by the Apache Software Foundation;
  • The operating system mainly used is Linux (CentOS);
  • The main programming language used is Java;
  • As a database management system we use PostgreSQL;
  • Development and web publishing systems are supported by GitHub, WordPress, Apache http server, Tomcat, and Mediawiki .

This open source technology is a valuable basis for the development of the Portuguese Web Archive system. However, specific tools for web archiving are in permanent evolution, and frequently they cannot be used as off-the-shelf products.
Often, the installation and operation processes are undocumented and there are errors and incompatibilities between releases. Therefore, the decision to use Archive-access tools requires us to involve in its improvement and find solutions to web preservation problems.

All the software developed by our team is available as free open-source: