Architecture

Overview of the system’s architecture and functioning.

Workflow overview

The general functioning of a web archive is similar to a web search engine like Google. It is divided in 3 main stages:

  1. Crawl: the harvest (crawl) of web content starts from an initial list of URLs (seeds). The systems that automatically harvest the web are named crawlers. It cyclically performs the following tasks:
    • harvests the content referenced by an URL and stores it on disk;
    • extracts embedded URLs to new pages from it;
    • inserts the newly-found URLs to be also harvested.
  2. Indexing: when the crawl is finished, all the information collected from the Web is processed to build the indexes that will enable fast search over it;
  3. Search and Access: after building the indexes, the search service is made available. The main difference between live-web search engines and web archives is that web archives must preserve the harvested information to maintain it accessible across time.
  • harvests the content referenced by an URL and stores it on disk;
  • extracts embedded URLs to new pages from it;
  • inserts the newly-found URLs to be also harvested.

Learn more from our scientific and technical publications