The research and education community has been requesting to support the bulk download of web-archived data and index files (CDXJ), for instance, to feed AI training models, optimize routing of web archive requests or recover information from selected websites (e.g. news).
Arquivo.pt begun making all its CDXJ index files publicly available in real-time to facilitate the bulk download of web-archived data. Learn how at:
One of them was the tutorial “Timeline summarization for large-scale past-web events with Python: the case of Arquivo.pt” developed by Daniel Gomes and Ricardo Campos.
Now, Arquivo.pt contributed to preserve online information that documents R&D projects funded by the Horizon 2020 programme (2014-2021). It preserved 197 million web files (17 TB) related to science for future access.
H2020 projects publish valuable information online but are being lost
However, after projects ending, the corresponding websites usually disappear causing a permanent loss of unique and valuable scientific information.
Arquivo.pt automatically identified URLs that document H2020 Research and Development projects
The European Union’s Open Data Portal published a data set from the Community Research and Development Information Service (CORDIS) that documents H2020 research projects. However, from the 31 129 projects listed, only 46% presented a project URL.
Arquivo.pt developed a low-cost methodology that automatically identifies URLs related to R&D projects to be systematically preserved. This automatic identification is achieved through the combination of open data sets with web search services. This methodology is detailed on a scientific article published at the International Conference on Digital Preservation 2016.
In sum, we extracted 106 300 unique URLs from the following open data sets:
Then, we extracted the acronym and title of the projects from the data sets and automatically searched the web for additional URLs using the Bing Search API.
All the data sets and tools developed have been made publicly available in open access so that they can be reused and collaboratively enhanced. In particular, you can access the software developed to automatically identify additional URLs about H2020 projects.
197 million web files related to science were preserved
Arquivo.pt identified and preserved 197 million web files (17 TB) that document R&D projects funded by Horizon 2020.
Contributions to complement the European Open Data Sets
All the resulting data sets were made publicly available so that they can be improved and reused by other organizations also interested on preserving this digital heritage:
URLsBingSearch (column V): top 10 search results returned by Bing API when column projectUrl (column K) in the original data set was empty (e.g. http://extmos.eu/)
ArchivedProjectURLs (column W): direct link to access the preserved version of the projectUrls and URLsBingSearch in Arquivo.pt (e.g. https://arquivo.pt/wayback/http://extmos.eu)
archivedOrganizationUrl (column Y): direct link to access the preserved version of the organizationUrl (column O) in Arquivo.pt (e.g. https://arquivo.pt/wayback/www.it.pt)
When a user enters a set of words about a topic in the Arquivo.pt search box and clicks on the “Narrative” button, the user is directed to the “Conta-me Histórias” service, which automatically analyzes the news from 25 websites archived by Arquivo.pt over time and presents a chronology of news related to the topic.
Figure 1: Search results for pages about “Justin Bieber”.
Figure 2: Narrative of news about “Justin Bieber” from Portuguese news sites preserved by Arquivo.pt generated by the “Conta-me Histórias” service.
Create your narrative now!
“Conta-me Histórias” researches, analyzes and aggregates thousands of results to generate each narrative about a topic. It is recommended to choose descriptive words about well-defined themes, personalities or events to obtain good narratives.
Creating a narrative is useful for researchers, journalists or citizens who want to quickly get an overview of the evolution of a topic along time, thus saving them a lot of time and effort.
Go to Arquivo.pt and try to create a narrative about a theme of your choice.
Web Archiving Conference 2021 – the most important meeting in the field of Web preservation, where experts share new knowledge and experiences
RESAW Conference – meeting of the European RESAW network (Research Infrastructure for the Study of Archived Web Materials) this year in its 4th edition, mainly addressed to the community of researchers from non-technological scientific areas, such as Social Sciences, Arts and Humanities.
Contributions of Arquivo.pt to the international community
Arquivo.pt presented some results of the work developed in the last year, with emphasis on the functionalities that improve the reproduction of the archived contents, such as the “Complete the page”.
Two historical collections were integrated on the Arquivo.pt: the Geocities and the Internet Memory Foundation. Arquivo.pt did special collections about the 2019 European Elections and Covid-19.
The contents of Arquivo.pt are accessible to any researcher regardless of the country they are in and therefore it is a useful service to the international community.
Presentations
Arquivo.pt updates 2021: presentation at the IIPC – General Assembly, by Daniel Gomes (Vídeo)
Complete the page. 1 minute drop in (presentation at the IIPC – General Assembly “complete the page”), by Daniel Gomes (Slide, Video)
A transnational and cross-lingual crawl of the European Parliamentary Elections 2019, by Ivo Branco (Slides, Vídeo)
Enhancing access to research the Geocities historical collection, by Pedro Gomes (Slides, Vídeo)
Complete the page – demo. Slide used in the IIPC 1 minute presentation, at the IIPC General Assembly 2021
The historical collection of web content generated during the Internet Memory Foundation’s (IMF) activity has been donated to Arquivo.pt and is now searchable!
The IMF was a European organization dedicated to preserving web content that was wound up in 2018.
The 1st web archiving project in Europe (2004-2010) was led by Julien Masanès (who was guest of honour at the celebration of 10 years of Arquivo.pt) and was called European Archive Foundation.
In 2010, Julien Masanès, the “father” of Web archives in Europe created the IMF.
Examples of pages from the collection donated by the IMF
The collection donated by the IMF has now been integrated in the Arquivo.pt collection to be preserved for posterity.
This collection is composed of 142 million files that total 6.3 TB of historical information whose texts or images can now be searched through Arquivo.pt.
This new collection has been named “InternetMemory” in the Arquivo.pt collections list.
Searches can be made on this collection using the collection search parameter or through the custom search page available at arquivo.pt/InternetMemory.
This external service is useful for research use cases, in areas such as Web design, Art, Communication or History,where it is necessary to access the original visual aspect of a page from the past in the most reliable way possible.
Web page of the European Map of WWW/NIR sites in 1996 using the Oldweb.Today service
You may have to wait a while for your request to be processed but it is always faster than having to install a browser from the past on your computer.
Export search results to spreadsheet format
This new function enables users to save their search results for further treatment and analysis. This is specially useful to perform thorough research about a given topic.
After a search, in the Options, just choose one of the available formats to export the obtained results: XLSX, CSV or TXT.
Exported results into an Excel sheet of a search for the word “universidade”, university, limited to 10 items
Arquivo.pt launched a new version of its service on April 15, 2020 named WebApp.
The purpose of this version was to standardize the user experience between different devices and reduce maintenance costs by removing components with redundant functions.
Its main novelty is the combination of the desktop and mobile interfaces in a single user interface.
The old desktop version has been disabled and the mobile version has evolved to work on various types of devices and screen sizes.
The Arquivo.pt Award 2020 was officially launched on January 16th, at the Público Newspaper in Lisbon. Público is one of the most well-known newspapers in Portugal.
The event had talks by the Director of Público Newspaper Manuel Carvalho, the President of the Foundation for Science and Technology Helena Pereira and the manager of Arquivo.pt Daniel Gomes.
Participants of this open event were led into a guided visit to the newsroom. They saw a real scenario where contents of a newspaper are edited and produced.
Applications are open to the Arquivo.pt Award 2020!
In this 3rd edition of the annual Arquivo.pt Award, € 15,000 will be awarded to the 3 best works (1st place: € 10,000).
The deadline for submissions is May 4, 2020.
Works may be developed individually or in group about any topic, as long as they use the information provided by Arquivo.pt as the main source of information.