Crawling and archiving Web content

Crawling and archiving Web content.

1. How often do you collect the Portuguese Web and how long does it take?

We are performing 3 to 4 crawls per year. About 90% of the contents are crawled within 7 days. However, the crawl continues for slower sites or with higher amount of contents. We collect daily a selected set of 400 online publications.

When relevant event occur, such as elections, we perform extraordinary crawls of selected sites.

2. Do you collect the whole Portuguese Web?

No.

Some constrains are imposed, for instance, to the:

  • maximum size of contents downloaded from the Web
  • number of contents per site
  • number of links the crawler follows from an initial address until it reaches the content

On the other hand, the boundaries of the Portuguese Web are difficult to define accurately. Many contents are hosted outside the .PT domain and those require particular effort in identifying them. If you wish, you may suggest a site to be archived.

3. Which media types do you archive?

All media types.

4. What about the dynamically generated pages?

Dynamically generated pages such as those generated through PHP scripts are collected in the same way as the static ones, as long as there is a link to it.

5. Do you archive restrict access data?

No.

The Portuguese Web Archive archives only the public Web. Pages protected by password or other forms of access restriction are not archived.

6. What is the Portuguese Web Archive crawler?

The Portuguese Web Archive crawler is the system that automatically collects contents from the web to be archived. These kind of systems are also known as spiders or harvesters.

7. How does it work?

A crawler cyclically harvests contents from the web. It downloads a content and extracts its embedded links to find new contents.

Each new crawl starts from an initial set of web addresses called seeds. For each new crawl of the Portuguese Web, we use as seeds the home pages of all web sites under .PT previously successfully crawled, user suggestions and the DNS listing of .PT domains.

8. Have I been archived?

Webmasters may detect their servers have been crawled by checking the logs for requests identified by the following user agent:

Arquivo-web-crawler  (compatible; heritrix/1.14.3 +http://arquivo.pt)

If you detect any unexpected behavior please contact us, indicating the full User-Agent

identification, dates of access and a description of the identified problem as thorough as possible.

9. What is the frequency of requests made to my web site?

The crawler respects a courtesy pause of 10 seconds between requests to the same site so that its actions do not overload web servers.

The current value for the courtesy pause imposes a lower load than the one imposed by a browser  when opening, for example, an HTML page and the corresponding images. If you detect any harmful behavior carried out by our crawler, please let us know.

10. Does the Archive crawler fill in forms?

No.

If you notice such a situation please let us know.

11. Can I allow comprehensive visits to my site?

Yes.

12. Can I restrict visits to my site?

Yes.

The Portuguese Web Archive crawler respects the Robots Exclusion Protocol. If you want to prevent your web site from being partially or totally visited by our crawler, and therefore, excluded from the Portuguese Web Archive, follow the instructions for compliance with the Robots Exclusion Protocol.

13. How do I create a preservable web site?

Follow our recommendations that will contribute to publish contents that can be preserved for the future.

14. What is the difference between Web and Internet?

The Internet is the communication infrastructure that link computers worldwide. There are several services on the Internet. The Web is one of them. Other services are, for instance:

The Web consists of pages and contents connected by hyperlinks. One may say that the Internet is equivalent to the roads, and the Web, email and other services are the different vehicles in circulation.

15. What is the Portuguese Web?

Everything hosted under the .PT domain and other contents hosted outside this domain that are of broad interest to the Portuguese community are considered to be part of the Portuguese Web.