Research open datasets on 2019 European Parliamentary Elections

Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event preserved online documents in several languages about the 2019 European Parliamentary Elections

The 2019 European Parliamentary Elections were an event of international relevance. The strategy to preserve the relevant information on the World Wide Web is delegated to national institutions. However, the preservation of web pages that document transnational events is not officially assigned. 

The team, with the aim of preserving the cross-lingual online content that documents this event, applied a combination of human and automatic selection processes.

The process of generating the collection about the 2019 European Parliamentary Elections was performed in two steps.

In the first step, 40 relevant terms in Portuguese about the 2019 European Parliamentary Elections were identified, and then, automatically translated into the 24 official languages of the European Union: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish. 

These translations were reviewed in collaboration with the Publications Office of the European Union. Besides that, in parallel, a collaborative list was launched to gather contributions of relevant seeds from the international community.

In the second step, the team iteratively ran 6 crawls (99 million web files, 4.8 TB) using different configurations and crawling software, to maximize the quality of the collected content. 

The obtained web-data was aggregated into one special collection identified as EAWP23 and became searchable and accessible through in July 2020 (

CLEOPATRA project: Cross-lingual Event-centric Open Analytics Research Academy

Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event
Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event

The CLEOPATRA ITN was a Marie Skłodowska-Curie Innovative Training Network aimed to generate ways to better understand the massive digital coverage of major events in Europe over the past decades. 

The main goal was to facilitate advanced cross-lingual processing of textual and visual information related to key contemporary events at large scale and develop innovative methods for efficient access and interaction with multilingual information.

In total, 14 Early-Stage Researchers hosted across 9 European Universities developed their research while enrolled as Ph.D. students. 

Associated partners such as contributed to CLEOPATRA by hosting and training early-stage researchers such as Diego Alves. As part of the training program, he conducted a secondment at in Lisbon from June to August 2022. 

The idea was to develop part of his research about syntactic structures of EU languages using the textual resources preserved by the and exchange knowledge with the web-archiving experts on the strategies to extract and process historical web-data. 

Diego Alves defended his Ph.D thesis entitled Computational typological analysis of syntactic structures in European languages in July 2023 at the Faculty of Humanities and Social Sciences of the University of Zagreb (Croatia). 

Generating textual datasets for Natural Language Processing

Diego Alves’ work originated cross-lingual datasets about the 2019 European Parliamentary Elections precious for research.

This work will be detailed in chapter “Robustness of Corpus-based Typological Strategies for Dependency Parsing” of the open-access CLEOPATRA book entitled “Event Analytics across Languages and Communities”.

A 3-step Natural Language Processing pipeline was developed to generate research textual datasets that can be used in several types of digital humanities studies:

  1. Extract text: The textual content was extracted from each web-archived URL using the newspaper3k Python library. The language of each extracted text was determined using the langdetect library, to separate the texts written in different languages across distinct files;
  2. Clean extracted texts: a Python script was applied to clean the texts by removing unnecessary information (e.g.: repeated instances, empty lines, etc.);
  3. Double-check of language identification: the language of each cleaned extracted text was verified again to eliminate possible errors originated during the previous steps.

Two new research datasets are openly available!

The result was a dataset of cleaned and language-verified texts publicly available. Each file contains the texts in a given language about the 2019 European Union Elections. The distribution of extracted texts for each language is described in the figure below:

Number of tokens of each corpus extracted from the collection 2019 European Union Elections preserved by (EAWP23).
Number of tokens of each corpus extracted from the collection 2019 European Union Elections preserved by (EAWP23).

The aforementioned corpus was automatically annotated regarding part-of-speech and dependency relations to generate a corpus with syntactic information which is useful for linguistic studies. 

The multilingual model of the UDify tool (Kondratyuk and Straka, 2019) was applied. 

The texts in these annotated corpora followed the same order of the respective raw-texts files. Each sentence is annotated following the Universal Dependencies framework in the CoNNL-U format, which is the reference in terms of syntactic annotation in Natural Language Processing. Thus, each file in this dataset contains the annotated texts in a given language about the 2019 European Union Elections

The syntactically annotated texts about the 2019 European Elections are publicly available!

Know more preserves Wikipedia citations

Wikimedia Portugal e

Last updated on September 19th, 2023 at 01:49 pm

Wikipedia is an educational resource degraded by broken links preserves information published online so that it can be used later for research and educational purposes. For example, preserved online information about European projects funded by H2020 .

Wikipedia articles refer to external pages with important complementary information that has since become unavailable.
Wikipedia articles reference external web pages with important complementary information that became unavailable on their original websites.

One of the most used online resources for educational purposes are Wikipedia articles. However, Wikipedia articles sometimes reference external web pages with important complementary information which disappeared from their original websites. This problem degrades the quality of Wikipedia as a credible and verifiable source of information.

In August 2023, the team carried out an experiment to measure the percentage of external links (outside the domain) that were broken in Portuguese Wikipedia articles. The obtained results showed that 25% of the external links referenced in the Portuguese Wikipedia were broken.

There is also the problem that a link may refer to an accessible web page, but its content changed meanwhile and it is no longer what it was meant to be referenced (Content Drift problem). The domain of the hosting website may have meanwhile been purchased by third parties, for example for malicious purposes.

To mitigate these problems, launched a project to preserve the online references contained in Portuguese Wikipedia articles in collaboration with Wikimedia Portugal. The objective was to change the references to broken links on Wikipedia articles, to start referencing web-archived content preserved in, thus keeping the referenced information accessible to Wikipedia users. archived the pages referenced in Portuguese Wikipedia articles

Wikipedia articles refer to external pages with important complementary information that has since become unavailable.
Wikipedia recommends citing web archives (archiveurl/archive-url parameter).

The Portuguese Wikipedia contains about 1 million articles and on average 140 pages are edited per day. automatically extracted 14 million links from references in all Portuguese Wikipedia articles. Only 620 referenced and 744 553 the Internet Archive (5.3%). Note that Wikipedia’s guide to creating references recommends publishing citations to web archives ( archiveurl/archive-url parameter) .

On February 15, 2023, collected all pages referenced in Portuguese Wikipedia articles, which resulted in a new collection named EAWP42: Collection of external links from wikipedia using the wikimedia dumps, which contains 12 million files (856 GB).

The main result of this project was the creation of a new automatic process for extracting and collecting external links cited on Portuguese Wikipedia pages. Therefore, will perform an annual crawl of Wikipedia citations.

Attempt to automatically fix broken links in Wikipedia articles

InternetArchiveBot offers powerful operation and monitoring tools (eg Dashboard and Insights)
InternetArchiveBot offers powerful operation and monitoring tools (e.g. Dashboard and Insights)

There are software robots which automatically add links to web-archived content when they find broken links in Wikipedia articles (e.g. PywikibotWayback Medic or InternetArchiveBot).

An experiment was carried out to create an ArquivoPTBot based on the InternetArchiveBot because it offers powerful operation and monitoring tools (e.g. Dashboard and Insights ) and it is maintained by the Internet Archive: the largest web archive in the world.

However, it was not possible to launch this service in production because it implies changes in the software to use as a source of web-archived information.

If you want to collaborate to resume this project, please contact us!

Preserving Wikipedia references is at your fingertips! remains committed in contributing to preserve links cited in Wikipedia articles and offers the following services that may be useful to you. CitationSaver: preserves citations to online content ( CitationSaver: preserves citations to online content (

CitationSaver allows you to submit the source code of an Wikipedia article and will automatically extract the referenced links and archive the correspondent content. SavePageNow: saves pages in ( SavePageNow: saves pages in (

SavePageNow allows you to immediately archive a web page in, for example, that is being referenced in a Wikipedia article so that it doesn’t get lost.


Wikimedia Portugal, in collaboration with, promoted a set of webinars to capture the community’s attention for the preservation of content published and cited on Wikipedia. The videos and slides of these webinars are available:

Meet the winners of the Award 2023!

Last updated on September 4th, 2023 at 02:40 pm

The winners of the Award 2023 were announced by the Público newspaper, the official communication partner of this edition, on 26 June.

40 applications were received.

The award ceremony tooke place during the closing session of the Encontro Ciência (Portuguese Science Summit), on July 7, at Aveiro University.

1st place – “Viajar no tempo sobre carris”

The winner of the 10 000 euro prize was the work “Viajar no tempo sobre carris” (Travelling through time on rails) developed by Antero Pires, Carlos Cipriano, Diogo Ferreira Nunes and Ruben Martins.

Viajar no tempo sobre carris” is an online platform that analyses and presents the evolution of train travel times in Portugal, based on timetables preserved in

For example, it allows you to see the duration of the journey Lisbon-Oporto on the Alfa Pendular since the year 2000.

2nd place – “Representatividade das mulheres artistas na imprensa nacional”

The 2nd prize of 3 000 euros was awarded to the work “Representatividade das mulheres artistas na imprensa nacional” (Representativeness of women artists in the national press), by Cláudia Sevivas and Miguel Boavida.

This work resulted in the website Existo that provides information on Portuguese artists, referring to the web pages in which they were mentioned over time. The work is based on an analysis of their representation and visibility that allows several readings.

For example, we can find information about the artist Joana Vasconcelos, news about other artists in a certain year or even get a graphic visualization of women artists compared to men.

3rd place – “Memória Política

The 3rd prize of 2 000 euros was awarded to the work “Memória Política”, (Political Memory) developed by Miguel Lopes, Maria Carneiro and João Andrade.

“Memória Política” is a Web application that processes and presents information taken from the web pages of political parties represented in the Portuguese Parliament, archived by

For example, it allows you to search for the term “democracy” and obtain pages from the websites of the Parties related to the search. The results can be grouped by Party and by year.

Honorable Mention granted by Público newspaper

The Público newspaper, official partner of the 6th edition of the Prize, awarded an Honourable Mention to the work “Fábrica do Jornal” (Newspaper factory), carried out by Miguel Almeida.

“Fábrica do Jornal” is a Web application that allows the user to generate a personalized newspaper from the news preserved at The user can obtain a version that can be printed or saved in digital format.

Honorable Mention granted by Público newspaper AMCC – Aveiro Media Competence Center

The Aveiro Media Competence Center (AMCC), awarded its Honourable Mention to the work “Imaginarium”, carried out by Diogo Sousa.

Imaginarium” is a web application that searches for images based on similarities with other images.

For example, after suggesting an image of a car, “Imaginarium” searches for images in that have some affinity with the suggested image.

Flash interview – AMCC Executive Director

Award cerimony

The awards ceremony took place at the closing session of the Science 2023 Meeting, at the University of Aveiro, on 7 July 2023.

The awards were presented by the Minister of Science, Technology and Higher Education, Elvira Fortunato, the President of the FCT Board of Directors, Madalena Alves and the Executive Director of the Aveiro Media Competence Centre, João Moraes Palmeiro.

Image gallery

Cerimónia de Entrega Prémio 2023

Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023
Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023 Cerimónia de Entrega Prémio 2023

Flash enterviews

Video of the cerimony

Dissemination materials


Short link to this page: presentations at IIPC GA/WAC and RESAW 2023

Last updated on August 1st, 2023 at 12:42 pm

Meeting with the Web Archive Community

The International Internet Preservation Consortium (IIPC), a consortium that brings together Web preservation initiatives from around the world, held its General Assembly with its members on May 10, 2023.

On the following days, May 11 and 12, the IIPC Web Archiving Conference (IIPC WAC) was held, an initiative open to the community, where people or entities not associated with the IIPC and interested in the Web preservation domain can participate.

The two events were jointly hosted by KB – National Library of the Netherlands, and by Beeld & Geluid – Netherlands Institute for Sound & Vision.

Contributions from the at the Web Archiving Conference participated in the IIPC working group meetings (Training Working Group and Curators Working Group) and contributed with presentations in the thematic sessions Collaborations & Outreach and Program infrastructure (sessions 7 and 17).

  • updates 2023 (slides)
  • Linking web archiving with arts and humanities: the collaboration between ROSSIO and (video, slides)
  • behind the curtains (slides)

Meeting with the RESAW research community

RESAW (Research Infrastructure for the Study of Archived Web Materials) is an initiative created in 2012 with the aim of promoting studies based on archived Web content, in areas such as Social Sciences, Digital Arts and Humanities.

The RESAW 2023 conference was held at the MUCEM Lab (Mediterranean Institute of Heritage Crafts) in Marseille on June 5-6, 2023, under the theme Exploring the Archived Web During a Highly Transformative Age.

Contributions from the at RESAW 2023 contributed with presentations to the sessions Web Archive in Mediterranean area and its merge (4.A), From online Tools to Web Archive (6.B.), Towards a participatory approach to collections (9. A.), Digging up the materials for writing web history (9.B.).

  • How to research governmental web data? (abstract, slides)
  • Archiving Cryptocurrencies (abstract, slides)
  • Time to explore, time to learn from the archived web: training initiative (abstract, slides)
  • Exhibiting Web Memories from a call for community participation (abstract, slides)

CLEOPATRA Project Meeting

The CLEOPATRA Project, led by the L3S Research Center at the Gottfried Wilhelm Leibniz University of Hannover, has developed since 2019 a training programme for doctoral researchers (Early Stage Researcher, PhD). has participated in three courses: Incentives design for hybrid multilingual information processing and analytics, in Southampton; National and transnational media coverage of European parliamentary elections, 2004-2014, London; and NLP for under-resourced languages, in Zagreb, Croatia.

In 2022, the welcomed two researchers in its facilities who used the archived resources and received special support from the team to develop their research.

The CLEOPATRA Project ended in 2023 with a meeting on the 16th May, in Hannover, which brought together Professors, Researchers and representatives of the institutions involved.

Daniel Gomes,’s Manager, highlighted the new tools that makes available and the results of the work carried out by the researchers that have passed through

  • and new research tools available (Slides) at Jornadas de Computação Científica 2023. Register now!

thumbnail jornadas FCCN 2023

Last updated on July 26th, 2023 at 09:51 am

Jornadas de Computação Científica 2023 was held at the Naval School in Almada from 27 to 29 June 2023.

This event is a meeting for sharing knowledge among the entities that make up the national higher education and research community.

The event counts with the participation of decision-makers of the institutions, people in charge of computer technical services and people responsible for libraries and documentation services, among others. presented two 90-minute sessions, on June 28th from 2h30 p.m. to 6 p.m., under the theme “ services for managing citations and cybersecurity” and the service Memorial in the Zapping session.


June 28 2:30-16 p.m.: available services and system architecture

Sessão 2 4:30-6 p.m.: uma ferramenta para gerir citações e cibersegurança Memorial


Jornadas de Computação Científica Registration page Científica 2023

Virtual Museum of Tourism MUVITUR created a collection of preserved Websites

Coleção registos no Catálogo do MUVITUR com páginas Web preservadas no

Last updated on April 25th, 2023 at 08:16 pm

MUVITUR – Virtual Museum of Turism is a portal that aggregates digital content about Tourism in Portugal.

The platform is maintained by the Celestino Domingues Library of The Estoril Higher Institute for Tourism and Hotel Studies (ESHTE) and has the participation of institutions from various areas of heritage that are content providers.

Among the digitized contents that can be consulted in the catalog and accessed in the provider institutions were sound, image, photography, printed material, but websites were missing.

Thus, the idea for the MUVITUR’s new “Web Pages” collection emerged.

Collaboration between MUVITUR and

In 2019, a collaboration between and MUVITUR began with the aim of identifying websites related to Tourism in Portugal and to disseminate the history of content published on the Web since 1996.

In 2022, a list was established with about 400 records of websites of various entities related to tourism, hotels, travel agencies, pages of municipalities’ websites dedicated to tourism and others.

This database resulted in the first collection of preserved websites about Tourism in Portugal.

Collection of records in the MUVITUR catalog with webpages preserved at

How the integration was done

MUVITUR uses Nyron software, which allows content from different sources to be aggregated using the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) interoperability protocol, which is very common among libraries, archives and museums to provide content to portals such as Europeana., however, does not make information available through OAI-PMH so it was necessary to find alternative ways to create a record in Nyron with descriptive information from preserved sites.

The procedure for integration was as follows:

  • The XML schema with the fields for the metadata, according to what works in Nyron, was exported to an Excel sheet.
  • The information was entered manually, respecting the format and syntax, in collaboration with the computer technicians.
  • The XML file with the inserted data was validated and imported into Nyron.

Creating records in catalogs is largely a manual task and requires human curation. However, it was possible to input information to be automatically processed in the records of the Website collection. For example, the thumbnail was obtained using the API, more specifically the linkToScreenShot, visible in the technical details of a preserved page (see Options).

For other elements, such as the site’s title, it would be possible to obtain them automatically through the API, however the quality of the information depends on what the site’s producers have inserted and may not be optimal. The dates to limit the temporal scope can also be obtained automatically, but the manual method was chosen to control the information presented.

In the continuation of the project, the collection will be increased with new records, as there are thousands of websites about the Tourism sector.

Description of Web contents in the MUVITUR catalog

In the collection “Paginas Web” the following data are used:

  • Denomination – usually the title of the website
  • Organization – the entity to which the publication belongs
  • Website address on the Internet
  • Address for version in
  • Moment(s) to remember
  • Link for miniature in
  • Descriptors
  • Geographical data (location, coordinates, geographical name)

The presentation of the information was adjusted to be aligned with that of other MUVITUR resources and contains links to

For example, in the register of the Turismo do Algarve site, we find a link to a moment to remember in 2011 and another link to the history in under “Consultar objecto”.

Detalhe do registo do site "Turismo do Algarve"
Detalhe do registo do site Turismo do Algarve

Organizations can create collections of Websites from their area

With this unprecedented project we can say that preserved Web sites have gained citizenship in digital platforms dedicated to historical memory.

Websites are rarely included in catalogs or displayed in a museum context, in Portugal.

The National Library of Australia, for example, included records of preserved Websites in the catalog. In the Library of Congress there are collections of old Websites alongside traditional resources.

MUVITUR has paved the way with this project and for other entities to create collections of websites of their interest on their platforms.

Other results of the collaboration

CitationSaver preserves citations to web resources

Last updated on April 20th, 2023 at 09:37 pm

Documents cite web content by referencing their URLs so that readers can later access them.

In the case of scientific articles, the importance of these citations is even greater to maintain the integrity of research works because they often reference essential information to enable the reproducibility of an experiment or analysis.

For example, links in a scientific article may cite the datasets, software or web news that supported the research, which are not included in the text of the article.

To respond to the need of preserving the integrity of documents, launched the CitationSaver.

CitationSaver automatically extracts cited links in a document and preserves their content (e.g. web pages cited in a book) so that they can be retrieved later from


Use CitationSaver to preserve the integrity of your documents

Upload a document and CitationSaver will extract the cited URLs, archive their content and make it available on after a short notice. There are 3 methods to upload a document:

  • insert the address (URL) of the PDF or TXT file, if it is published online
  • upload the file in PDF or TXT format
  • paste the text containing the addresses you want to preserve (e.g. References section of an article or Bibliography of a book).

More information

Project “Renascer” brings back old websites

Last updated on April 17th, 2023 at 06:32 pm

Organizations keep domains that referenced websites which are no longer used, to prevent them from being bought or because they were just forgotten.

The aim of project Renascer (Reborn) is to bring back historical websites whose content is no longer available online and whose domain continues to be held by their authors.

“Forgotten” domains can cause cybersecurity problems

In May 2023, the domain of the Harvard Medical School-Portugal project referenced just one default web page hosted on an active server and the domain continued to be owned by its author.

In this situation, the original content of the website was inaccessible despite the fact that the domain continued to be owned by the author of the website.

Furthermore, since the domain was still pointing to an active web server, cybersecurity issues could occur if this server was not being properly maintained.

The domain could be reborn to reference the contents of this website preserved by

How are websites Reborn?

The domain owner only has to redirect it to, through the Memorial service.

For example, the domain started to reference back its original contents preserved by, thus making this website to be reborn.

Examples of Reborn domains

Project Renascer identified active domains managed by FCCN which were not referencing any content, and gave them a new life turning them to reference its historical contents preserved by

Contact to reborn the historical websites of your organization.

See the following examples of Reborn websites:




Free training on digital media – webinars

Last updated on June 2nd, 2023 at 05:35 am

The Aveiro Media Competence Center (AMCC) is a platform to support and promote the European Union (EU) Local News Media sector in the implementation of digital transition projects. The consortium includes the PCI Creative Science Park of Aveiro Region, the Associação Portuguesa de Imprensa and the University of Aveiro. is a free public service that allows searching and accessing Web pages preserved since the 1990’s, such as viewing an old news or accessing an old version of a website.

The collaboration between the AMCC and is materialized in a training program entitled Digital Skills for the Media, developed in four webinars, and in the attribution of the AMCC Honorable Mention to work done on Portuguese centenary newspapers in the Award 2023.

Webinar cycle: digital skills for media

The webinar cycle aims to equip trainees with digital skills that enable them to solve problems caused by the disappearance of digital information and gain competitive advantage in the production of unique and exclusive content.

  • Webinar 1: A tool for quickly searching the past
    • Data: Mars 24, 2023 Time: 14h00-15h30 (in Portuguese)
  • Webinar 2: Publishing well for preserving well

    • Data: April 6, 2023, Time: 14h00-15h30 (in Portuguese)
  • Webinar 3: Automated access and processing of preserved Web information through APIs
    • Data: May 4, 2023, Time: 14h00-15h30 (in Portuguese)
    • Slides
    • Video
  • Webinar 4: Web archiving: do-it-yourself!
    • Data: June 1, 2023, Time: 14h00-15h30 (in Portuguese)

Prepare a work for the Award 2023!


Last updated on January 26th, 2023 at 12:21 pm

Until May 4th, launches the challenge of creating a work based on historical information preserved from the Web.

In this 6th edition of the Award, 15 000 euros will be granted to the three best works (1st place: 10 000 euros).

Works about any subject may be submitted, done individually or in group. The only condition is that was the main source of information.

The Público newspaper will grant an Honorable Mention for works based on the web-archived content of Público online.

The Aveiro Media Competence Center (AMCC) will also grant an Honorable Mention to one of the submitted works that focuses on the archives of the online version of century-old newspapers.

All details at:

The Award promotes the visibility of the applicants and their institutions.

Help us spread the word about the Award 2023 among potential candidates!