Arquivo.pt automatically identified R&D project websites to preserve their content. It preserved 52 million web files (7 TB) related to science for future access.
R&D websites publish valuable information but are being lost
Arquivo.pt automatically identified URLs related to Research and Development projects
The main objective of Arquivo.pt is to preserve online information for scientific and academic purposes. Therefore, it developed a pragmatic and low-cost process that automatically identifies URLs related to R&D projects to be systematically preserved. Automatic identification is achieved through the combination of open data sets with free search services. This work is detailed in an article published at the International Conference on Digital Preservation 2016.
52 million web files related to science were preserved
The application of the developed process already enabled the preservation of 52 million files (7 TB) obtained from 53 993 websites of R&D projects financed since the FP4 (1994), such as the WEZARD project funded by FP7 aimed at “preparing the future research community in the area of air transport system robustness when it is faced with weather hazards”. The website for this project (www.wezard.eu) is no longer available online. However, it was preserved and can be accessed at Arquivo.pt.
All the websites identified and preserved during this project are accessible through Arquivo.pt since March 2017.
Contributions to complement the European Open Data Portal data sets
The developed process was applied to the data sets published through the European Open Data Portal to try to complement the missing information regarding project URLs. The obtained results showed that the completeness of the FP7 data set was improved by 86.6%.
All the resulting data sets were made publicly available so that they can be improved and reused by other organizations also interested on preserving this digital heritage (FP4, FP5, FP6, FP7).
European Open Data Portal Database complemented by Arquivo.pt with the proposed process. The new project URLs are available at column “Identified Websites” of the files FP4, FP5, FP6, FP7.
“How can my blog stay in the Digital History of Portugal?” Is the starting question for this meeting dedicated to digital preservation.
On the 23rd February 2017, the FCCN unit of the FCT, in Lisbon, responsible for Arquivo.pt, hosted a free training session for bloggers in the areas of technology, lifestyle and fashion. Under the motto of working to leave their blogs in the history of the Portuguese Web, this set of bloggers joined the research infrastructure Arquivo.pt attending sessions on digital preservation techniques
This training will take place during the event Jornadas FCCN 2017 at UTAD from 19 to 21 April.
Training agenda
This training course will take place at Universidade de Trás-os-Montes e Alto Douro (UTAD), 20 de abril, 14:30-16:00.
Arquivo.pt: an innovative service at your disposal
How to publish preservable information for the future
Automatic access to Arquivo.pt (APIs)
Do not miss the Zapping of other services at your service!
We also highlight the “Zapping session of FCCN projects and services” (April 20, 9:30). During just 1 hour anyone can get to know all the services offered by the FCCN, free or at no cost to the academic community.
Registrations
Registration is free and includes social events. However, the number of registrations is limited and we accept submissions in order of submission. The main objective of Jornadas FCCN is to interact with the local communities.
Please spread the word about it to potential interested parties.
Arquivo.pt released a new version of its service on the 25th of January 2017.
The new version named PyCDX introduces significant improvements in the replay quality of the preserved pages.
These improvements resulted from the adoption of PyWb technology developed by Ilya Kreymer.
The replay of the preserved pages is now more comprehensive, with the loading of additional images, PDF, CSS, and other Web content that previously were not reproduced.
Arquivo.pt released a new version on the 7th of November 2016.
The new version named Hercules presents design improvements, specially in the user interface for the replay of preserved Web pages, such as:
Minimize and Maximize options for the toolbar by clicking in the upper right corner, so that users can visualize the preserved Web page in full screen;
New ‘Complete the page’ function, that attempts to obtain missing elements on the preserved page (e.g. images) from external Web archives using the Memento Time Travel reconstruct service;
The Arquivo.pt made 2 crawls of Web pages related with the Portuguese Presidential Elections of 2016.
We had appealed to the community contribution by suggesting Web pages related with the Presidential Elections of 2016 in order to archive it.
We made 2 crawls, during and after the election campaign period, using the list of 284 Web pages suggested by the community, archiving a total of 551 672 Web resources, that occupy 7 GB.
It were collected Web pages such as the official ones from the running candidates, from political parties, news in the media about the elections, blogs, opinion articles.
The Arquivo.pt respects an embargo period of 1 year, and for that reason the archived collection will only be avaliable by the end of 2016.
However you can consult now some archived Web pages from the previous Portuguese Parliamentary Elections such as:
On the 12th January 2016 we have issued a press release about Arquivo.pt, where we explain our service, its history, present, colaborations and future challenges.
The Arquivo.pt is a free of charge service, that allows the search of Web archived data since 1996. The investigation infrastructure of Arquivo.pt is mainly focused on the preservation of relevant content for the portuguese community.
There are currently about 2700 milions of archived files, which correspond to 95 TB of information and any person can suggest interesting sites to be archived by simply going to https://arquivo.pt/suggest.
In 2015, the Arquivo.pt has collected 580 milion files for preservation, and the search service has registered an average of 3 692 users per month, from which 90% are new users.
In 2016 the Arquivo.pt will make avaliable the access to archived Web pages from the .eu domain and the Arquivo.pt team will work on a prototype that will allow the search and visualization of archived images.