Last updated on June 16th, 2023 at 01:40 pm
The main objective of Arquivo.pt is to preserve online information for research and education purposes.
Now, Arquivo.pt contributed to preserve online information that documents R&D projects funded by the Horizon 2020 programme (2014-2021). It preserved 197 million web files (17 TB) related to science for future access.
H2020 projects publish valuable information online but are being lost
Websites about Research and Development (R&D) projects are increasingly being used to publish and disseminate important scientific information that complements published literature (e.g. data sets, documentation or software).
However, after projects ending, the corresponding websites usually disappear causing a permanent loss of unique and valuable scientific information.
Arquivo.pt automatically identified URLs that document H2020 Research and Development projects
The European Union’s Open Data Portal published a data set from the Community Research and Development Information Service (CORDIS) that documents H2020 research projects. However, from the 31 129 projects listed, only 46% presented a project URL.
Arquivo.pt developed a low-cost methodology that automatically identifies URLs related to R&D projects to be systematically preserved. This automatic identification is achieved through the combination of open data sets with web search services. This methodology is detailed on a scientific article published at the International Conference on Digital Preservation 2016.
In sum, we extracted 106 300 unique URLs from the following open data sets:
Then, we extracted the acronym and title of the projects from the data sets and automatically searched the web for additional URLs using the Bing Search API.
All the data sets and tools developed have been made publicly available in open access so that they can be reused and collaboratively enhanced. In particular, you can access the software developed to automatically identify additional URLs about H2020 projects.
197 million web files related to science were preserved
Arquivo.pt identified and preserved 197 million web files (17 TB) that document R&D projects funded by Horizon 2020.
In 2021, we can already witness project websites that are no longer available online, such as the Extended Model of Organic Semiconductors (EXTMOS) project (http://extmos.eu/). However, it was preserved and can be accessed at Arquivo.pt:
Contributions to complement the European Open Data Sets
All the resulting data sets were made publicly available so that they can be improved and reused by other organizations also interested on preserving this digital heritage:
- Cordis-h2020projectsComplementedByArquivoPT.xlsx: contains 2 additional columns in comparison to the original data set:
- URLsBingSearch (column V): top 10 search results returned by Bing API when column projectUrl (column K) in the original data set was empty (e.g. http://extmos.eu/)
- ArchivedProjectURLs (column W): direct link to access the preserved version of the projectUrls and URLsBingSearch in Arquivo.pt (e.g. https://arquivo.pt/wayback/http://extmos.eu)
- Cordis-h2020organizationsComplementedByArquivoPT.xlsx: 1 additional column:
- archivedOrganizationUrl (column Y): direct link to access the preserved version of the organizationUrl (column O) in Arquivo.pt (e.g. https://arquivo.pt/wayback/www.it.pt)
- Cordis-h2020projectDeliverablesComplementedByArquivoPT.xlsx: 1 additional column:
- archivedUrl (column K): direct link to access the preserved version of the url (column I) in Arquivo.pt (e.g. https://arquivo.pt/wayback/https://ec.europa.eu/research/participants/documents/downloadPublic?documentIds=080166e5c0231ec0&appId=PPGMS)
- Cordis-h2020reportsComplementedByArquivoPT.xlsx: 1 additional column:
- archivedUrl (column P): direct link to access the preserved version of the url (column O) in Arquivo.pt (e.g. https://arquivo.pt/wayback/http://crome.ces.uc.pt)
If you want to know more information about this collection you can watch the video Preservation of web content related to Horizon 2020.
- Previous work to preserve online documentation about FP4-FP7 projects
- Software and datasets to automatically identify online information about H2020 projects
- Technical Report about collection H2020 EU projects (in Portuguese)
- Video Preservation of web content related to Horizon 2020, Pedro Gomes, October 2021
- Presentation Preservation of web content related to Horizon 2020, Pedro Gomes, June 2023