One link for each content

Last updated on August 1st, 2017 at 01:53 pm

To efficiently archive a web site, it is fundamental to have one link referencing each individual content (e.g. image, page, file).

Any content of a web site must be referenced directly by an URL, including images, videos or pages. For instance, the URL http://arquivo.pt/img/logo-home-pt.png references the Arquivo.pt logo.

Our crawler is only able to find and archive contents referenced by at least one link presented on a page of a web site. There are two cases that require particular attention:

  • Videos provided in streaming: are downloaded by specific applications, such as Flash Player, Windows Movie Player or Real Player. However, crawlers can only download contents available through the HTTP protocol. Hence, to enable the possibility of archiving a video available on the web, there must be a link to download the full video file.
  • Contents hidden behind forms: crawlers cannot fill out forms. Therefore, all the contents exclusively available after authentication, acceptance of terms or other kind of forms, cannot be archived. Unless, there are links on pages that enable direct  access to them.

Patch existing sites

Having each content referenced by an URL brings many advantages. However, it might impossible to restructure an existing site to comply with this best practice.

A possible solution is to provide alternative information about content location on the site through a:

  • User site map (example): it improves usability and enables the crawl of all the pages of a site;
  • RSS feeds archive (example): RSS feeds are used to publish the latest updates on a web site. An archive of feeds helps crawlers to find contents to download;
  • XML Sitemap: is a file containing information regarding each URL of a site (e.g. last modification date, priority, frequency of change). Although, we do not process XML sitemaps yet, this protocol is supported by companies such as Google, Yahoo! or Microsoft.

It is crucial to keep the information contained in these files up-to-date.