Last updated on April 13th, 2021 at 10:02 am
The general functioning of a web archive is similar to a web search engine like Google. It is divided in 3 main stages:
- Crawl: the harvest (crawl) of web content starts from an initial list of URLs (seeds). The systems that automatically harvest the web are named crawlers. It cyclically performs the following tasks:
- harvests the content referenced by an URL and stores it on disk;
- extracts embedded URLs to new pages from it;
- inserts the newly-found URLs to be also harvested.
- Indexing: when the crawl is finished, all the information collected from the Web is processed to build the indexes that will enable fast search over it;
- Search and Access: after building the indexes, the search service is made available.
The main difference between live-web search engines and web archives is that web archives must preserve the harvested information to maintain it accessible across time.