Donate historical web content

Last updated on June 21st, 2021 at 12:13 pm

Contribute to preserve web knowledge by donating historical web content so that Arquivo.pt can preserve it.

Arquivo.pt periodically crawls the Portuguese web since January 2008. However, content previously published online must be gathered from external sources to be preserved by our system.

If you have web content interesting for the Portuguese community or science and technology and want to contribute for its preservation, please contact us.

We consider that all content published on sites under the .PT domain is part of the Portuguese web and must be preserved. However, contents hosted under other domains, considered of interest for the Portuguese community, will also be accepted.

Should I supply only old content?

We are interested in receiving contents that are no longer available online, independently from their publication date.

The web is extremely dynamic and the lifetime of most contents is very short.

Thus, many contents are lost because they become unavailable before we can gather them, even though we perform periodic crawls of the Portuguese web.

Backups of Portuguese websites are a good example of contents that may be provided.

How can I supply web content?

Arquivo.pt stores the archived content using the ARC format. Ideally, content should be supplied using this format.

However, it’s natural that most people do not use it to keep their files. Therefore, we accept Portuguese web contents kept in any format.

Later, the Arquivo.pt team will convert them to the ARC format so that they can be integrated into our system.

To facilitate this task we would appreciate that the largest amount of meta-data would be supplied along with the contents, especially:

  • the web site address(es). If there are several websites, please group the contents belonging to each one of them on a separate directory;
  • the content addresses (URL). If you are providing a local copy of a site please maintain the original file names. If you are supplying contents that you gathered from the web please provide their original URLs;
  • the content dates. Supply the date when each content was published or saved. If you do not know the exact dates, please supply approximate dates;
  • the content media type (MIME). Please maintain the original file name extensions of the contents (e.g. .gif, .html, .jpg). If possible, provide the full HTTP header for each content. It is particularly important to provide the media type for contents dynamically generated that do not contain file name extensions.

Software to convert to ARC/WARC format

  • Httrack2Arc: HTTrack2Arc is a tool that converts crawls made by HTTrack to Internet Archive ARC files.
  • Roteiro2Arc: Roteiro2Arc is a tool that converts to Internet Archive ARC files the local archive of the Portuguese Web present in the CD-ROM featured with the book “Novo Roteiro Prático da Internet” by José Magalhães.
  • AWPJornaisIntegration: Project to integrate with Arquivo.pt collections of old Portuguese journals Websites.
  • WARCIO: Python library for ARC to WARC record conversion, developed by Ilya Kreymer.
  • warcit: A command-line tool to convert on-disk directories of web documents (commonly HTML, web assets and any other data files) into an ISO standard web archive (WARC) files.
  • har2warc: A way to convert HTTP Archive (HAR) into a Web Archive (WARC) format.

Contributors list

We express our gratitude to the following entities for supplying contents to Arquivo.pt:

  • Internet Memory Foundation: historical collection donated by Julien Masanès who led the project until its closure in 2018 to be integrated and searchable on Arquivo.pt (142 000 000 files; 6,3 TB)
  • Anat Ben-Davidcollected and donated the collection called “Israblog”, which contains Israeli blogs, between May 2018 January 2019 to be searchable on Arquivo.pt (24 520 849 files; 0,55 TB)
  • Rui Bebiano: donated the oldest files from the website of the online magazine NON!, of which he was a founder, in Coimbra; the files were converted and integrated into Arquivo.pt in 2020 (8 303 files)
  • Dinis Manuel Alves: shared several of his news articles published between 1997 and 2003 (4 000 files; 0,000075 TB; 2000-2007)
  • José Magalhães: author of the book “Novo roteiro prático da Internet: o ciberespaço ao alcance de todos” that contained a CD-ROM with contents gathered from the Web in 1996 (75 174 files; 0,000316 TB; 1996)
  • National Library of Portugal: provided contents archived in 2005 during project RECOLHA (14 373 817 files; 0,165 TB; 2004-2005)
  • XLDB research group at the University of Lisbon: provided crawls of the Portuguese web performed between 2002 and 2006 during the tumba! search engine project (37 000 000 files; 0,360 TB; 2002-2006)
  • Internet Archive: provided about 1.9 TB of contents gathered from the .PT domain between 1996 and 2007 (123 889 349 files; 1,948 TB; 1996-2007)
  • Helder Guerreiro: crawl from the blog platform weblog.com.pt before being shut down in 2012 (563 350 files; 0,026 TB; 2012)

Do not hesitate to contact us

Supplying and integrating historical web contents is a complex task.

Please, do not hesitate to contact us to clarify any doubt.