Robots Exclusion Protocol authorizing the harvest of important content

Last updated on September 22nd, 2023 at 02:40 pm

The Portuguese Web Archive respects the access restrictions imposed by the authors through the Robots Exclusion Protocol. Therefore, it is important that the authors authorize the harvest of important content from their sites, such as image or style sheet CSS files, so that they can be preserved.

The Robots Exclusion Protocol (REP) can be used to list contents that should not be crawled and archived.

The access restrictions should be specified on a single file named robots.txt that is hosted on the root directory of a site (e.g. http://arquivo.pt/robots.txt).

Allow web archive crawlers to harvest all the files required to reproduce pages

  • Search engines just need to crawl textual contents to present results from a site. However, in 2014 Google started using the appearance of the site as a parameter to decide the position of a page in Google’s results, since Google believes that if there are poorly maintained the page shouldn’t be in the first results of a search (see more).
  • Web archives need all the files embedded on a web page to later reproduce it (e.g. CSS, JavaScript or image files).
  • Default Robots Exclusion in some Content Management Systems needs to be changed to enable efficient web archiving (e.g. WordPress, Joomla, Mambo)

Allowing the full crawl of a site by the Portuguese Web Archive

Put in the beginning of robots.txt:

User-agent: Arquivo-web-crawler 
Disallow:

Disallow harmful contents for crawlers

Authors can contribute to facilitate web archiving by using REP to identify irrelevant for archiving or harmful contents for crawlers. This way:

  • The visited sites save resources (e.g. bandwidth)
  • The Portuguese Web Archive saves resources ( e.g. disk space)

Disallowing the crawl of a directory using robots.txt

For instance, a robots.txt file with the following instructions, would forbid the crawl by the Portuguese Web Archive of all the contents under the folder /calendar/:

User-agent: Arquivo-web-crawler 
Disallow: /calendar/

Disallowing the crawl and indexing of page using the meta-tag ROBOTS

Alternatively, access restrictions can be described on each page, through the inclusion of the meta tag ROBOTS in its source code.

The following example would forbid the crawl and indexing of the page by all robots:

<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />

The exclusions defined through the ROBOTS meta tag are applied to all robots, including search engines, such as Google.