Arquivo.pt reaches 1 PetaByte of preserved information!

The collection of 1 PetaByte of content predominantly in Portuguese, accessible to both researchers and ordinary citizens, is a milestone that deserves to be celebrated, in the month of its 16th anniversary.

At Arquivo.pt you can search for information published on the Web in the past, such as:

Discover more pages through the selected pages in the Arquivo.pt Online Exhibitions.

The first European page
News from The New York Times in 2008
European Film Awards 2014

Purpose and mission of the Portuguese Web Archive

Arquivo.pt was created on November 8, 2007 with the aim of preserving content from the Portuguese Web.

In 2013, as a service operated by the Fundação para a Ciência e a Tecnologia (FCT), its mission was formulated as follows: “To promote the preservation of content available on the national Internet, ensuring that it is made available to the scientific community and the general public” (Decreto-Lei no. 55/2013).

In recent years, Arquivo.pt has created new services, such as CitationSaver, which allows researchers to record references to web content in their scientific articles, Memorial and Complete page, which facilitate access to content scattered throughout the huge 1 PetaByte block of data.

Where did so much information come from?

In order to reach the 1 PetaByte volume, Arquivo.pt periodically recorded content from websites in the .PT domain and from Portuguese websites in other domains.

In addition, frequent daily and monthly collections were made from a small number of government sites and the main news sites in Portugal.

As part of international collaborations, content was collected from sites in various languages, for example on the 2019 European Elections.

Content prior to 2008 came from the Internet Archive and donations, such as a collection made by the National Library and INESC on the 2005 Legislative Elections.

The largest Portuguese-language dataset available to researchers

By making 1 PetaByte of information available, in open access and through the use of APIs (Application Programming Interfaces), Arquivo.pt is a useful tool for research.

For example, a researcher who wants to do a study on elections in Portugal can use the entire Arquivo.pt collection. Better still, they can focus on just a few special collections dedicated to the elections, choosing the ones that interest them and downloading just a few Terabytes to process automatically with the APIs.

Contributions from the various teams and friends of Arquivo.pt

The development of Arquivo.pt is more than a technological issue and has been due to the dedication and persistence of the various teams that have worked on it since 2007.

It was also due to the contribution of many friends of Arquivo.pt, who were always on hand to help improve, and to the response of the user community.

Congratulations to all! Thank you.

World Digital Preservation Day dedicated to Justice

Last updated on November 13th, 2023 at 08:59 am

The Instituto de Gestão Financeira e Equipamentos da Justiça (IGFEJ) and Secretaria Geral do Ministério da Justiça (SGMJ), in collaboration with BAD, organized the event “Digital Preservation in Justice” to mark World Digital Preservation Day on November 2, 2023.

The event, which took place in the auditorium of the Polícia Judiciária in Lisbon, was attended by representatives from the government’s justice department and professionals from the archives, communications and IT departments.

How to use Arquivo.pt to preserve institutional websites

Arquivo.pt took part in the presentation “Preserve your website”, which addressed the issue of preserving institutional websites and critical aspects such as cybersecurity.

Justice entities can benefit from Arquivo.pt and its various services to ensure good preservation of their websites, mitigate cybersecurity threats and provide historical content to citizens.

The presentation concluded with the following recommendations:

  • Inventory and publicize your current and historical websites
  • Use Arquivo.pt services collaboratively
  • Save content in a standardized format with ArchiveWeb.page

Resources

Arquivo.pt presentations at IIPC GA/WAC, RESAW 2023 and CLEOPATRA

Last updated on March 10th, 2024 at 05:23 pm

Meeting the Web Archive Community

The International Internet Preservation Consortium (IIPC), a consortium that brings together Web preservation initiatives from around the world, held its General Assembly with its members on May 10, 2023.

On the following days, May 11 and 12, the IIPC Web Archiving Conference (IIPC WAC) was held, an initiative open to the community, where people or entities not associated with the IIPC and interested in the Web preservation domain can participate.

The two events were jointly hosted by KB – National Library of the Netherlands, and by Beeld & Geluid – Netherlands Institute for Sound & Vision.

Contributions from the Arquivo.pt at the Web Archiving Conference

Arquivo.pt participated in the IIPC working group meetings (Training Working Group and Curators Working Group) and contributed with presentations in the thematic sessions Collaborations & Outreach and Program infrastructure (sessions 7 and 17).

  • Arquivo.pt updates 2023 (slides)
  • Linking web archiving with arts and humanities: the collaboration between ROSSIO and Arquivo.pt (video, slides)
  • Arquivo.pt behind the curtains (slides)

Meeting the RESAW research community

RESAW (Research Infrastructure for the Study of Archived Web Materials) is an initiative created in 2012 with the aim of promoting studies based on archived Web content, in areas such as Social Sciences, Digital Arts and Humanities.

The RESAW 2023 conference was held at the MUCEM Lab (Mediterranean Institute of Heritage Crafts) in Marseille on June 5-6, 2023, under the theme Exploring the Archived Web During a Highly Transformative Age.

Contributions from Arquivo.pt to RESAW 2023

Arquivo.pt contributed with presentations to the sessions Web Archive in Mediterranean area and its merge (4.A), From online Tools to Web Archive (6.B.), Towards a participatory approach to collections (9. A.), Digging up the materials for writing web history (9.B.).

  • How to research governmental web data? (abstract, slides)
  • Archiving Cryptocurrencies (abstract, slides)
  • Time to explore, time to learn from the archived web: Arquivo.pt training initiative (abstract, slides)
  • Exhibiting Web Memories from Arquivo.pt: a call for community participation (abstract, slides)

CLEOPATRA Project Meeting

The CLEOPATRA Project, led by the L3S Research Center at the Gottfried Wilhelm Leibniz University of Hannover, has developed since 2019 a training programme for doctoral researchers (Early Stage Researcher, PhD).

Arquivo.pt has participated in three courses: Incentives design for hybrid multilingual information processing and analytics, in Southampton; National and transnational media coverage of European parliamentary elections, 2004-2014, London; and NLP for under-resourced languages, in Zagreb, Croatia.

In 2022, the Arquivo.pt welcomed two researchers in its facilities who used the archived resources and received special support from the Arquivo.pt team to develop their research.

The CLEOPATRA Project ended in 2023 with a meeting on the 16th May, in Hannover, which brought together Professors, Researchers and representatives of the institutions involved.

Daniel Gomes, Arquivo.pt’s Manager, highlighted the new tools that Arquivo.pt makes available and the results of the work carried out by the researchers that have passed through Arquivo.pt.

Virtual Museum of Tourism MUVITUR created a collection of preserved Websites

Coleção registos no Catálogo do MUVITUR com páginas Web preservadas no Arquivo.pt

Last updated on February 26th, 2024 at 09:07 am

MUVITUR – Virtual Museum of Turism is a portal that aggregates digital content about Tourism in Portugal.

The platform is maintained by the Celestino Domingues Library of The Estoril Higher Institute for Tourism and Hotel Studies (ESHTE) and has the participation of institutions from various areas of heritage that are content providers.

Among the digitized contents that can be consulted in the catalog and accessed in the provider institutions were sound, image, photography, printed material, but websites were missing.

Thus, the idea for the MUVITUR’s new “Web Pages” collection emerged.

Collaboration between MUVITUR and Arquivo.pt

In 2019, a collaboration between Arquivo.pt and MUVITUR began with the aim of identifying websites related to Tourism in Portugal and to disseminate the history of content published on the Web since 1996.

In 2022, a list was established with about 400 records of websites of various entities related to tourism, hotels, travel agencies, pages of municipalities’ websites dedicated to tourism and others.

This database resulted in the first collection of preserved websites about Tourism in Portugal.

Collection of records in the MUVITUR catalog with webpages preserved at Arquivo.pt. 

How the integration was done

MUVITUR uses Nyron software, which allows content from different sources to be aggregated using the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) interoperability protocol, which is very common among libraries, archives and museums to provide content to portals such as Europeana.

Arquivo.pt, however, does not make information available through OAI-PMH so it was necessary to find alternative ways to create a record in Nyron with descriptive information from preserved sites.

The procedure for integration was as follows:

  • The XML schema with the fields for the metadata, according to what works in Nyron, was exported to an Excel sheet.
  • The information was entered manually, respecting the format and syntax, in collaboration with the computer technicians.
  • The XML file with the inserted data was validated and imported into Nyron.

Creating records in catalogs is largely a manual task and requires human curation. However, it was possible to input information to be automatically processed in the records of the Website collection. For example, the thumbnail was obtained using the Arquivo.pt API, more specifically the linkToScreenShot, visible in the technical details of a preserved page (see the options menu on the top right of a replayed page).

For other elements, such as the site’s title, it would be possible to obtain them automatically through the Arquivo.pt API, however the quality of the information depends on what the site’s producers have inserted and may not be accurate. The dates to limit the temporal scope can also be obtained automatically, but the manual method was chosen to control the information presented.

In the continuation of the project, the collection will be increased with new records, as there are thousands of websites about the Tourism sector.

Description of Web contents in the MUVITUR catalog

In the collection “Paginas Web” the following data are used:

  • Denomination – usually the title of the website
  • Organization – the entity to which the publication belongs
  • Website address on the Internet
  • Address for version in Arquivo.pt
  • Moment(s) to remember
  • Link for miniature in Arquivo.pt
  • Descriptors
  • Geographical data (location, coordinates, geographical name)

The presentation of the information was adjusted to be aligned with that of other MUVITUR resources and contains links to Arquivo.pt.

For example, in the register of the Turismo do Algarve site, we find a link to a moment to remember in 2011 and another link to the history in Arquivo.pt under “Consultar objecto”.

Detalhe do registo do site "Turismo do Algarve"
Detalhe do registo do site Turismo do Algarve

Organizations can create collections of Websites from their area

The National Library of Australia, for example, included records of preserved Websites in its catalog. In the Library of Congress there are collections of old Websites alongside traditional resources.

However, websites are rarely included in  museums.

With this unprecedented project we can say that preserved Web sites have gained citizenship in digital platforms dedicated to cultural heritage.

MUVITUR has paved the way with this project for other entities to create collections of websites of their interest on their own platforms.

Other results of the collaboration

Afghanistan Websites and the fall of the regime in August 2021

thumbnail_Karima Faryabi

Last updated on September 26th, 2022 at 03:57 pm

afghan-ministry-of-economy-17-08-2021

Afghanistan Ministry of Economy website with Karima Faryabi (recorded August 17, 2021)

On August 15, 2021 the presidential palace in Kabul was taken over by the Taliban, consummating the fall of the regime that had been in place for 20 years, following the 9/11 attacks on the United States.

The community of Web archivists, through the Content Development Working Group – International Internet Preservation Consortium, was challenged to record the Afghan sites, given the risk that they would disappear with the new regime.

No time to lose when it comes to preserving the Web

Arquivo.pt reacted quickly, launching an automatic content search focused on .af domain sites and on international media news about the ongoing events.

On August 17, the websites began to be recorded.

1800 website addresses from Afghanistan (ending in .af) and 500 media news stories from around the world were used.

The addresses, URLs or “seeds” were obtained through automated search using the Bing Search API and immediately put into recording.

Content available to know Afghanistan’s history

As a result of the collection carried out, more than 400 Gigabytes of information became available at Arquivo.pt, which anyone can use for research in the most diverse areas.

The main contribution of Arquivo.pt to the community of Web archivists was the use of the automatic search that allows a quick reaction in the recording of Web contents in imminent risk of being lost.

Know more

Arquivo.pt open data set (Dados.gov)

Content collected by the Content Development Working Group of the International Internet Preservation Consortium available at the Archive-it service

Participation of Arquivo.pt in the meetings of the International Internet Preservation Consortium

thumbnail_GA_WAC2022

Last updated on August 1st, 2023 at 05:37 pm

IIPC Web Archiving Conference

The International Internet Preservation Consortium (IIPC), a consortium that brings together Web preservation initiatives from around the world, held its General Assembly with its members between May 17 and 19, 2022.

The following week, between May 24 and 25, held the IIPC Web Archiving Conference (IIPC WAC), online as in the previous year due to the contingencies of the Covid-19 pandemic.

The 2022 edition of these two events was hosted by the Library of Congress.

Arquivo.pt resources and initiatives presented at the IIPC WAC 2022

The IIPC Web Archiving Conference is an initiative open to the community, where people or entities interested in the Web preservation domain may participate.

The Arquivo.pt contributed to the Ligthtning Talks sessions (session 5 and session 13).

The Arquivo.pt presentations focused on the resources and initiatives that this service has lately developed for the community.

On line Cafe with Arquivo.pt continues

Last updated on August 17th, 2022 at 09:27 am

banner-cafe-com-o-arquivo-pt

Share this page: arquivo.pt/onlinecafe

Welcome to the third season of the Online Cafe with Arquivo.pt

Talk directly to the Arquivo.pt team and get answers to all your questions! The Arquivo.pt launched a new cycle of team chats with you through online sessions. Brief introductory presentations will be given, leaving time to ask all your questions about how to get more out of Arquivo.pt or how to apply to the Arquivo.pt Awards.

Sessions

February 17, 2022 – Primeiras páginas de jornais online portugueses

Primeiras páginas de jornais online portugueses” (Front pages of Portuguese online newspapers) presents an interactive graphical analysis of the front pages of Portuguese online newspapers. For this study, specific items within the newspaper design were analysed, thus allowing trends to be observed over time.

Susana Parreira, explains how she developed this work as part of her Masters, with the collaboration and guidance of Ana Boavida (Universidade de Coimbra) Ana Sabino (Instituto Politécnico de Castelo Branco) and Penousal Machado (Universidade de Coimbra).

22nd session –  January 20, 2022 – Politiquices

Politiquices.pt, allows to research support or opposition relations between political personalities and parties expressed in news headlines. This application uses information preserved in Arquivo.pt to create an ontology of relations. It uses Natural Language Processing technology. David Batista, 2nd place of Arquivo.pt Awards 2021, will explain how he developed his work and demonstrate the applications for researchers and citizens in general.

Special session – World Digital Preservation Day 2021 – Major minors project – november 5

In November, World Digital Preservation Day is broadly celebrated and, to mark this international initiative, Arquivo.pt held an online session open to the community. Special guests of this session were the winners of the Arquivo.pt Award 2021, Leandro Costa, Paulo Martins and José Carlos Ramalho.

Previous seasons

Presentation at the IIPC Web Archiving Conference 2022

Arquivo.pt certified as an open data provider

selo-dados-gov

Last updated on August 17th, 2022 at 08:39 am

Arquivo.pt has been collaborating with Agência Modernização Administrativa (AMA) with the aim of improving the preservation of Public Administration websites.

Collaboration is based on three action points:

AMA is the public organisation responsible for promoting digital means in Public Administration and aims to modernise and simplify citizens’ access to State services.

Arquivo.pt is a service operated by the Fundação para a Ciência e Tecnologia I.P. that preserves data published on the Web between 1996 and the present day, making them accessible to any citizen for memory and research purposes.

EU open data directive includes documents on websites

The Directive (EU) 2019/1024 of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information stipulates the following:

“(30) This Directive lays down the definition of the term ‘document’ and that definition should include any part of a document. The term ‘document’ should cover any representation of acts, facts or information — and any compilation of such acts, facts or information — whatever its medium (paper, or electronic form or as a sound, visual or audiovisual recording.

(34) To facilitate re-use, public sector bodies should, where possible and appropriate, make documents, including those published on websites, available through an open and machine-readable format and together with their metadata, at the best level of precision and granularity, in a format that ensures interoperability

(35) A document should be considered to be in a machine-readable format if it is in a file format that is structured in such a way that software applications can easily identify, recognise and extract specific data from it. Data encoded in files that are structured in a machine-readable format should be considered to be machine-readable data. A machine-readable format can be open or proprietary. They can be formal standards or not.

(60) The Commission should facilitate the cooperation among Member States and support the design, testing, implementation and deployment of interoperable electronic interfaces that enable more efficient and secure public services.

Arquivo.pt is a public service that has the mission of preserving documents published on Internet sites to enable their long-term open access and provides interoperable electronic interfaces (APIs) for their automatic processing.

The Portuguese Law No. 68/2021 of 2021-08-26 approves the general principles on open data and transposes the European Directive.

Arquivo.pt was certified as a Public Administration open data provider

The AMA recognized Arquivo.pt as a public service and open data provider and awarded its certification seal on the Open Data Portal.

Arquivo.pt collects general information published on the Web of interest to the Portuguese community. However, it is also responsible for the preservation of Public Administration websites, such as the Portal do Governo, in collaboration with the Management Center for the Government Electronic Network (CEGER).

Any citizen can access the open data resulting from these historical archives and, for example, search for official information published on the websites of successive governments.

In 2021, Arquivo.pt provided open access to over 10 billion files (721 TB) from 27 million websites. The open data preserved by Arquivo.pt can be explored through the search interface, automatically through API (https://arquivo.pt/api) or by reusing derived datasets.

Derived datasets available on the Open Data Portal

Besides the original web artefacts preserved at Arquivo.pt, this service has generated open datasets derived from its activities, which are now available in open access so that they can be reused:

Resources list

Video presentation at the IIPC Web Archiving Conference 2022

Special collection of Portuguese Presidential Elections

thumbnail_presidential_elections
banner_presidenciais_v
Form to suggest a web page, a web site or other web content

Arquivo.pt invites all citizens to suggest web pages related to the 2021 Presidential Elections to be preserved for the future.

The Presidential Elections will take place in Portugal on January 24, 2021.

Your suggestions are important so that Arquivo.pt can keep a more complete memory of this important electoral event.

To suggest web pages use this form (https://tinyurl.com/presidenciais-sugerir)

Arquivo.pt preserves websites of national scientific projects

thumbnail_memoriafct

Last updated on October 1st, 2021 at 09:11 am

Preserving scientific project websites is important

The contents of the websites tend to disappear when the scientific projects are finished.

The preservation of scientific project websites is important because:

  • documents the development of projects;
  • ensures access to unique technical and scientific content that researchers have posted on the project websites (eg presentations, photographs, data sets);
  • reinforces the visibility of the results of projects financed by FCT.

Experimental collection of scientific projects websites in 2016

Arquivo.pt automatically collected websites for projects financed by FCT in 2016.

The information about these websites was dispersed as it was not recorded during the administrative process.

For about 20 years, FCT financed scientific projects, so the number of sites could be too high to be identified manually.

Then an automatic methodology for identifying these websites was developed, developed by Arquivo.pt.

The FCT database had a total of 11,996 project entries but did not include references to web addresses. Applying the automatic methodology, 7 956 URLs related to the funded scientific projects were identified.

The collection of content referenced by these addresses resulted in the preservation of 600 721 files (72 GB), including content such as research group web pages, researchers’ personal pages or project-related blogs.

Online references in scientific project reports have been preserved since 2020

From June 2020, the website addresses of the projects financed by FCT must be registered in the progress and final reports funded by FCT.

Arquivo.pt started using these addresses to preserve the contents of websites of national scientific projects in a systematic way.

1st official collection of scientific project websites

In June 2020, Arquivo.pt obtained 263 addresses related to 100 scientific projects from the reports submitted to FCT. Most of the addresses (67%) did not have any version previously preserved in Arquivo.pt.

The addresses obtained point to online resources such as the websites of the projects, R&D units, news in the media, articles in scientific journals or repositories, databases, videos on Youtube or Facebook pages.

In July 2020, a special collection was launched from this set of addresses which resulted in 6.9 GB of information obtained from the visit to 31,606 URLs.

Exhibition about Research & Development projects

The Scientific Research Memory is an online exhibition dedicated to the websites of scientific projects funded by the Foundation for Science and Technology (FCT) that Arquivo.pt has preserved.

There are also websites of the Research & Development Units financed by FCT.

Memorial do Arquivo.pt preserves scientific websites for free

The Memorial do Arquivo.pt service has preserved historic FCT websites that have been disabled. These were created for events or initiatives that have ended and therefore their contents are no longer updated.

To include a website in the Memorial, Arquivo.pt starts by making a high quality collection of its contents.

Then, the collected contents are validated in collaboration with those responsible for the website.

Finally, the website address is redirected to the contents that have been preserved by Arquivo.pt.

For example, if someone wants to access any page on the Scientific Archives Meeting held in 2014, they will be redirected to Arquivo.pt.

Thus, the contents remain accessible over time and the links, the references in scientific communications that may exist do not break.

The digital preservation service Memorial do Arquivo.pt is free of charge for websites of the academic and scientific community, just send a request to contacto@arquivo.pt.

To know more