Supplying historical Portuguese web contents

Contribute to preserve web knowledge by supplying historical contents.

Arquivo.pt periodically crawls the Portuguese web since January 2008.

However, contents previously published must be gathered from external sources to be archived in our system.

If you have web contents interesting for the Portuguese community and want to contribute for their preservation, please contact us.

We consider that all contents published on sites under the .PT domain are part of the Portuguese web and must be preserved. However, contents hosted under other domains, considered of interest for the Portuguese community, will also be accepted.

Should I supply only old contents?

We are interested in receiving contents that are no longer available online, independently from their publication date.

The web is extremely dynamic and the lifetime of most contents is very short.

Thus, many contents are lost because they become unavailable before we can gather them, even though we perform periodic crawls of the Portuguese web.

Backups of Portuguese web sites are a good example of contents that may be provided.

How can I supply contents?

Arquivo.pt stores the archived contents using the ARC format. Ideally, contents should be supplied using this format.

However, it’s natural that most people do not use it to keep their files. Therefore, we accept Portuguese web contents kept in any format.

Later, the Portuguese web archive team will convert them to the ARC format, so that they can be integrated in our system.

To facilitate this task we would appreciate that the largest amount of meta-data would be supplied along with the contents, specially:

  • the web site address(es). If there are several web sites, please group the contents belonging to each one of them on a separate directory;
  • the content addresses (URL). If you are providing a local copy of a site please maintain the original file names. If you are supplying contents that you gathered from the web please provide their original URLs;
  • the content dates. Supply the date when each content was published or saved. If you do not know the exact dates, please supply approximate dates;
  • the content media type (MIME). Please maintain the original file name extensions of the contents (e.g. .gif, .html, .jpg). If possible, provide the full HTTP header for each content. It is particularly important to provide the media type for contents dynamically generated that do not contain file name extensions.

Do not hesitate to contact us

Supplying and integrating historical web contents is a complex task.

Please, do not hesitate to contact us to clarify any doubt.

Contributors list

We express our gratitude to the following entities for supplying contents to the Portuguese Web Archive:

  • Dinis Manuel Alves: shared several of his news articles published between 1997 and 2003 (4 000 files; 0,000075 TB; 2000-2007)
  • José Magalhães: author of the book “Novo roteiro prático da Internet: o ciberespaço ao alcance de todos” that contained a CD-ROM with contents gathered from the Web in 1996 (75 174 files; 0,000316 TB; 1996)
  • National Library of Portugal: provided contents archived in 2005 during project RECOLHA (14 373 817 files; 0,165 TB; 2004-2005)
  • XLDB research group at the University of Lisbon: provided crawls of the Portuguese web performed between 2002 and 2006 during the tumba! search engine project (37 000 000 files; 0,360 TB; 2002-2006)
  • Internet Archive: provided about 1.9 TB of contents gathered from the .PT domain between 1996 and 2007 (123 889 349 files; 1,948 TB; 1996-2007)
  • Helder Guerreiro: crawl from the blog platform weblog.com.pt before being shut down in 2012 (563 350 files; 0,026 TB; 2012)