System functioning

Last updated on April 13th, 2021 at 10:02 am

The general functioning of a web archive is similar to a web search engine like Google. It is divided in 3 main stages:

  1. Crawl: the harvest (crawl) of web content starts from an initial list of URLs (seeds). The systems that automatically harvest the web are named crawlers. It cyclically performs the following tasks:
    • harvests the content referenced by an URL and stores it on disk;
    • extracts embedded URLs to new pages from it;
    • inserts the newly-found URLs to be also harvested.
  2. Indexing: when the crawl is finished, all the information collected from the Web is processed to build the indexes that will enable fast search over it;
  3. Search and Access: after building the indexes, the search service is made available.

The main difference between live-web search engines and web archives is that web archives must preserve the harvested information to maintain it accessible across time.

Learn more