Crawling web content

Last updated on February 14th, 2023 at 02:50 pm

1. How often do you collect the Portuguese Web and how long does it take?

Arquivo.pt performs 3 to 4 crawls per year. About 90% of the contents are crawled within 7 days. However, the crawl continues for slower sites or with a higher amount of contents. We collect daily a selected set of online publications.

When a relevant event occurs, such as elections, we perform extraordinary crawls of selected sites.

2. Do you collect the whole Portuguese Web?

No.

Some constrains are imposed, for instance, to the:

  • maximum size of contents downloaded from the Web
  • number of contents per site
  • number of links the crawler follows from an initial address until it reaches the content

On the other hand, the boundaries of the Portuguese Web are difficult to define accurately. Many contents are hosted outside the .PT domain and those require particular effort in identifying them. If you wish, you may suggest a site to be archived.

3. Are archived pages immediately available?

No. They are only available for consultation after one year.

4. How Arquivo.pt preserve sites protected by passwords?

The person in charge of a restricted access site will have to provide information such as login and password to Arquivo.pt to preserve.

5. Which media types do you archive?

All media types.

6. What about the dynamically generated pages?

Dynamically generated pages such as those generated through PHP scripts are collected in the same way as the static ones, as long as there is a link to it.

7. Do you archive restrict access data?

No.

Arquivo.pt archives only the public Web. Pages protected by a password or other forms of access restriction are not archived.

8. What is Arquivo.pt crawler?

The Arquivo.pt crawler is the system that automatically collects contents from the web to be archived. This kind of systems are also known as spiders or harvesters.

9. How does it work?

A crawler cyclically harvests contents from the web. It downloads content and extracts its embedded links to find new contents.

Each new crawl starts from an initial set of web addresses called seeds. For each new crawl of the Portuguese Web, we use as seeds the home pages of all websites under .PT previously successfully crawled user suggestions and the DNS listing of .PT domains.

10. Have my website been archived?

Webmasters may detect their servers have been crawled by checking the logs for requests identified by the following user agents:

Arquivo-web-crawler (compatible; heritrix/3.4.0-20200304 +https://arquivo.pt/faq-crawling)
Arquivo-web-crawler (compatible; brozzler/1.5 +https://arquivo.pt/faq-crawling)
Arquivo-web-crawler (compatible; browsertrix/0.8 +https://arquivo.pt/faq-crawling)

If you detect any unexpected behaviour please contact us, indicating the full User-Agent identification, dates of access and a description of the identified problem as thorough as possible.

11. What is the frequency of requests made to my web site?

The crawler respects a courtesy pause of 10 seconds between requests to the same site so that its actions do not overload web servers.

The current value for the courtesy pause imposes a lower load than the one imposed by a browser when opening, for example, an HTML page and the corresponding images. If you detect any harmful behaviour carried out by our crawler, please let us know.

12. Does the Arquivo.pt crawler fill in forms?

No.

If you notice such a situation please let us know.

13. Can I allow comprehensive visits to my site?

Yes.

14. Can I restrict visits to my site?

Yes.

The Arquivo.pt crawler respects the Robots Exclusion Protocol. If you want to prevent your web site from being partially or totally visited by our crawler, and therefore, excluded from Arquivo.pt, follow the instructions for compliance with the Robots Exclusion Protocol.

15. How do I create a preservable web site?

Follow our recommendations that will contribute to publishing contents that can be preserved for the future.

16. What is the difference between the Web and the Internet?

The Internet is the communication infrastructure that links computers worldwide. There are several services on the Internet. The Web is one of them. Other services are, for instance:

The Web consists of pages and contents connected by hyperlinks. One may say that the Internet is equivalent to the roads, and the Web, email and other services are the different vehicles in circulation.

17. What is the Portuguese Web?

Everything hosted under the .PT domain and other contents hosted outside this domain that are of broad interest to the Portuguese community are considered to be part of the Portuguese Web.

18. A page is not being crawled as frequently as it should?

If you find a page that is not being crawled or is not being crawled as often as it should be, please contact us with the page link, so that we can increase crawl frequency.

Short link to this page: arquivo.pt/faq-crawling