Arquivo.pt automatically identified R&D project websites to preserve their content. It preserved 52 million web files (7 TB) related to science for future access.
R&D websites publish valuable information but are being lost
Arquivo.pt automatically identified URLs related to Research and Development projects
The main objective of Arquivo.pt is to preserve online information for scientific and academic purposes. Therefore, it developed a pragmatic and low-cost process that automatically identifies URLs related to R&D projects to be systematically preserved. Automatic identification is achieved through the combination of open data sets with free search services. This work is detailed in an article published at the International Conference on Digital Preservation 2016.
All the data sets and tools developed during this research have been made publicly available in open access so that they can be reused and collaboratively enhanced.
52 million web files related to science were preserved
Contributions to complement the European Open Data Portal data sets
The developed process was applied to the data sets published through the European Open Data Portal to try to complement the missing information regarding project URLs. The obtained results showed that the completeness of the FP7 data set was improved by 86.6%.
- Unavailability of online supplementary scientific information from articles published in major journals, the FASEB journal 2005
- The Case for the Preservation of R&D Project Websites, InternationaI Internet Preservation Blog 2016
- Preserving Websites Of Research & Development Projects, International Conference on Digital Preservation 2016 (ppt, bibtex)
- Software and experimental datasets to automatically identify sites of R&D
- European Open Data Portal Database complemented by Arquivo.pt with the proposed process. The new project URLs are available at column “Identified Websites” of the files FP4, FP5, FP6, FP7.