Last updated on August 1st, 2019 at 02:35 pm
News updated on August 1, 2019.
Arquivo.pt (the Portuguese web-archive) performed an experiment to preserve .EU web sites.
The .EU domain is commonly used to reference sites related to Europe. The strategy adopted to archive the World Wide Web has been delegating the responsibility of each domain to the respective national archiving institutions. However, the .EU domain fails to fit in this model because it covers multiple nations. Thus, the preservation of .EU sites was not been yet assigned and undertaken by any institution.
RESAW is an European network that aims to create a Research Infrastructure for the Study of Archived Web Materials (resaw.eu).
Arquivo.pt performed a first attempt to crawl and preserve web sites hosted under the .EU domain within the scope of RESAW activities. This first crawl began on the 21 November 2014 and finished on the 16 December 2014.
We performed 2 more crawls of the .EU domain. All the crawls were indexed and became searchable through Arquivo.pt one year after its finish date. Moreover, we made available a prototype that enables focused search over the .EU crawls which demonstrates the simplicity of creating search engines that targeted specific collections through the usage of the “collection” search parameter.
Collaborations with researchers interested on studying the collected web data sets or crawl logs are welcome.
To know more
- Prototype of focused search over the .EU crawls implemented using the search operator “collection”
- The Curious Case of Archiving .eu , chapter in book The Historical Web and Digital Humanities: The Case of National Web Domains
- Opportunities and challenges in collecting and studying national webs, (video, PDF)
- A first attempt to archive the .EU domain, technical report;
- Heritrix original crawl log (19,6 GB);
- Heritrix generated reports (21,5 MB);
- Analysis sheet generated using the Notebook Python library.