Cross-lingual research datasets on 2019 European Parliamentary Elections

Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event

preserved online documents in several languages about the 2019 European Parliamentary Elections

The 2019 European Parliamentary Elections were an event of international relevance. The strategy to preserve the relevant information on the World Wide Web is delegated to national institutions. However, the preservation of web pages that document transnational events is not officially assigned. 

The team, with the aim of preserving the cross-lingual online content that documents this event, applied a combination of human and automatic selection processes.

The process of generating the collection about the 2019 European Parliamentary Elections was performed in two steps.

In the first step, 40 relevant terms in Portuguese about the 2019 European Parliamentary Elections were identified, and then, automatically translated into the 24 official languages of the European Union: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish. 

These translations were reviewed in collaboration with the Publications Office of the European Union. Besides that, in parallel, a collaborative list was launched to gather contributions of relevant seeds from the international community.

In the second step, the team iteratively ran 6 crawls (99 million web files, 4.8 TB) using different configurations and crawling software, to maximize the quality of the collected content. 

The obtained web-data was aggregated into one special collection identified as EAWP23 and became searchable and accessible in July 2020

CLEOPATRA project: Cross-lingual Event-centric Open Analytics Research Academy

The CLEOPATRA ITN was a Marie Skłodowska-Curie Innovative Training Network aimed to generate ways to better understand the massive digital coverage of major events in Europe over the past decades. 

The main goal was to facilitate advanced cross-lingual processing of textual and visual information related to key contemporary events at large scale and develop innovative methods for efficient access and interaction with multilingual information.

In total, 14 Early-Stage Researchers hosted across 9 European Universities developed their research while enrolled as Ph.D. students. 

Associated partners such as contributed to CLEOPATRA by hosting and training early-stage researchers such as Diego Alves. As part of the training program, he conducted a secondment at in Lisbon from June to August 2022. 

The idea was to develop part of his research about syntactic structures of EU languages using the textual resources preserved by the and exchange knowledge with the web-archiving experts on the strategies to extract and process historical web-data. 

Diego Alves defended his Ph.D thesis entitled Computational typological analysis of syntactic structures in European languages in July 2023 at the Faculty of Humanities and Social Sciences of the University of Zagreb (Croatia). 

Generating textual datasets for Natural Language Processing

Diego Alves’ work originated cross-lingual datasets about the 2019 European Parliamentary Elections precious for research.

This work will be detailed in chapter “Robustness of Corpus-based Typological Strategies for Dependency Parsing” of the open-access CLEOPATRA book entitled “Event Analytics across Languages and Communities”.

A 3-step Natural Language Processing pipeline was developed to generate research textual datasets that can be used in several types of digital humanities studies:

  1. Extract text: The textual content was extracted from each web-archived URL using the newspaper3k Python library. The language of each extracted text was determined using the langdetect library, to separate the texts written in different languages across distinct files;
  2. Clean extracted texts: a Python script was applied to clean the texts by removing unnecessary information (e.g.: repeated instances, empty lines, etc.);
  3. Double-check of language identification: the language of each cleaned extracted text was verified again to eliminate possible errors originated during the previous steps.

Two new research datasets are openly available!

The result was a dataset of cleaned and language-verified texts publicly available. Each file contains the texts in a given language about the 2019 European Union Elections. The distribution of extracted texts for each language is described in the figure below:

Number of tokens of each corpus extracted from the collection 2019 European Union Elections preserved by (EAWP23).
The aforementioned corpus was automatically annotated regarding part-of-speech and dependency relations to generate a corpus with syntactic information which is useful for linguistic studies. 

The multilingual model of the UDify tool (Kondratyuk and Straka, 2019) was applied. 

The texts in these annotated corpora followed the same order of the respective raw-texts files. Each sentence is annotated following the Universal Dependencies framework in the CoNNL-U format, which is the reference in terms of syntactic annotation in Natural Language Processing. Thus, each file in this dataset contains the annotated texts in a given language about the 2019 European Union Elections

The syntactically annotated texts about the 2019 European Elections are publicly available!

Know more preserves Wikipedia citations

Wikimedia Portugal e

Last updated on September 19th, 2023 at 01:49 pm

Wikipedia is an educational resource degraded by broken links preserves information published online so that it can be used later for research and educational purposes. For example, preserved online information about European projects funded by H2020 .

Wikipedia articles refer to external pages with important complementary information that has since become unavailable.
One of the most used online resources for educational purposes are Wikipedia articles. However, Wikipedia articles sometimes reference external web pages with important complementary information which disappeared from their original websites. This problem degrades the quality of Wikipedia as a credible and verifiable source of information.

In August 2023, the team carried out an experiment to measure the percentage of external links (outside the domain) that were broken in Portuguese Wikipedia articles. The obtained results showed that 25% of the external links referenced in the Portuguese Wikipedia were broken.

There is also the problem that a link may refer to an accessible web page, but its content changed meanwhile and it is no longer what it was meant to be referenced (Content Drift problem). The domain of the hosting website may have meanwhile been purchased by third parties, for example for malicious purposes.

To mitigate these problems, launched a project to preserve the online references contained in Portuguese Wikipedia articles in collaboration with Wikimedia Portugal. The objective was to change the references to broken links on Wikipedia articles, to start referencing web-archived content preserved in, thus keeping the referenced information accessible to Wikipedia users. archived the pages referenced in Portuguese Wikipedia articles

Wikipedia recommends citing web archives (archiveurl/archive-url parameter).

The Portuguese Wikipedia contains about 1 million articles and on average 140 pages are edited per day. automatically extracted 14 million links from references in all Portuguese Wikipedia articles. Only 620 referenced and 744 553 the Internet Archive (5.3%). Note that Wikipedia’s guide to creating references recommends publishing citations to web archives ( archiveurl/archive-url parameter) .

On February 15, 2023, collected all pages referenced in Portuguese Wikipedia articles, which resulted in a new collection named EAWP42: Collection of external links from wikipedia using the wikimedia dumps, which contains 12 million files (856 GB).

The main result of this project was the creation of a new automatic process for extracting and collecting external links cited on Portuguese Wikipedia pages. Therefore, will perform an annual crawl of Wikipedia citations.

Attempt to automatically fix broken links in Wikipedia articles

InternetArchiveBot offers powerful operation and monitoring tools (eg Dashboard and Insights)
There are software robots which automatically add links to web-archived content when they find broken links in Wikipedia articles (e.g. PywikibotWayback Medic or InternetArchiveBot).

An experiment was carried out to create an ArquivoPTBot based on the InternetArchiveBot because it offers powerful operation and monitoring tools (e.g. Dashboard and Insights ) and it is maintained by the Internet Archive: the largest web archive in the world.

However, it was not possible to launch this service in production because it implies changes in the software to use as a source of web-archived information.

If you want to collaborate to resume this project, please contact us!

CitationSaver: preserves citations to online content

CitationSaver allows you to submit the source code of an Wikipedia article and will automatically extract the referenced links and archive the correspondent content. SavePageNow: saves pages in

SavePageNow allows you to immediately archive a web page in, for example, that is being referenced in a Wikipedia article so that it doesn’t get lost.


Wikimedia Portugal, in collaboration with, promoted a set of webinars to capture the community’s attention for the preservation of content published and cited on Wikipedia. The videos and slides of these webinars are available:

CDXJ index files are available to support bulk access

begun making all its CDXJ index files publicly available in real-time to facilitate the bulk download of web-archived data.

Your feedback with comments or suggestions is most welcome to improve this service!

The team went to the biggest international Web archiving conference.

The conference, organized by the International Internet Preservation Consortium (IIPC), occurred from the 5th to the 7th of June 2019.


Interview to Daniel Gomes, Head of contributed with 5 additional presentations during the conference.

The slides and videos are available in the following links:

Search 17 million images from the past with!

Image viewer Arnold

At the end of 2018, launched an experimental image search service from the past, which it was possible to search around 4 million images from the past, coming from some collections of

From April 2019, it became possible to search for images from all the collections of

aArnold Schwarzenegger image search

You can now search over 17 million unique images (over 50 pixels in width and height) since 1996.

Find pages from the past through the new image search service.

Try the “Visit Page” option to find the Web page from the past that contained the image you selected.

Image viewer Arnold

Try the image search service now!

Search images from the past with at

Students from Coimbra visit

On April 11, headquarters at LNEC welcomed 30 students enrolled in the course of Information Sciences at the Faculty of Arts and Humanities of the University of Coimbra.
On the occasion, students had the opportunity to take part in a technical visit, and a training session on digital preservation and on the services offered by
Bring your institution. Contact us on


Visita ao centro de dados para conhecer a infraestrutura que suporta o


Visita guiada ao, que funciona no Campus do LNEC


Sessão acerca do no pequeno auditório do LNEC


Sessão acerca do no pequeno auditório do LNEC


Sessão acerca do no pequeno auditório do LNEC


Daniel Gomes, gestor do Arquivo, acompanha a visita ao centro de dados para conhecer a infraestrutura que suporta o


Daniel Gomes, gestor do Arquivo acompanha a visita ao centro de dados para conhecer a infraestrutura que suporta o

Ask for an exhibition of posters about historical web pages

If you wish to host an exhibition to show-off on a common space (e.g. library, university hallway, common room in a congress center), please contact us!

Examples of performed exhibitions

These four exhibitions presented iconic web pages from Portugal’s recent history.

Universidade do Minho – Gualtar Campus – Braga

  • Where: Espaço B-Lounge of Universidade do Minho Library
  • When: Until April 18. From Monday to Friday, from 8h30 to midnight. And Saturdays from 08h30 to 14h30.

Universidade do Minho – Azurém Campus – Guimarães

  • Where: Espaço B-Lounge of Universidade do Minho Library
  • When: From April 23 to May 10. From Monday to Friday, from 8h30 to midnight. And Saturdays from 08h30 to 14h30.

ISCTE – Lisboa

  • Where: ISCTE-IUL Library.
  • When: From April 16 until May 11. From 8h to 21h.

Escola Superior de Hotelaria do Estoril

  • Where: Celestino Gomes Library
  • When: Until May 15. From Monday to Friday, from 9h30 to 12h00 and from 14h30 to 17h30. And from 6 to May 18. From Monday to Friday, from 9h30 to 20h30, Saturdays from 9h30 to 13h.

NOVA – IMS – Campolide Campus

  • Where: Nova IMS Library and Main Corridor.
  • When: Until April 18. From Monday to Friday, from 10h to 18h. And April 22 to 30, from 9h to 20h.


Biblioteca da Universidade do Minho, Campus de Azurém

Biblioteca da Universidade do Minho, Campus de Azurém

Biblioteca do ISCTE, Lisboa

Biblioteca do ISCTE, Lisboa

B-Lounge, Universidade do Minho Library / B-Lounge of Universidade do Minho Library

B-Lounge, Biblioteca da Universidade do Minho / B-Lounge Universidade do Minho Library

Biblioteca da Faculdade de Ciência e Tecnologia da Universidade Nova de Lisboa

Biblioteca da Faculdade de Ciência e Tecnologia da Universidade Nova de Lisboa

Biblioteca da ESCE do Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca da ESCE do Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca da Reitoria da Universidade do Algarve

Biblioteca da Reitoria da Universidade do Algarve

Biblioteca da Reitoria da Universidade do Algarve

Biblioteca da Reitoria da Universidade do Algarve

Biblioteca da Reitoria da Universidade do Algarve

Biblioteca da Reitoria da Universidade do Algarve

Faculdade de Letras da Universidade de Lisboa

Faculdade de Letras da Universidade de Lisboa

Faculdade de Letras da Universidade de Lisboa

Faculdade de Letras da Universidade de Lisboa

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituo da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituo da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituo da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

News about our poster exhibitions Exhibition in FCT – Caparica

Last updated on March 29th, 2019 at 01:37 pm

Next Thursday, 28th March, is the last day to visit Exhibition in FCT – Library, in Caparica.

The exhibition is open to the public all day long, from 9 am to 8 pm and shows six iconic pages from Portugal’s recent history such as the result of the 1996 Presidential Elections in Portugal, the Expo’98 page, or the first home page of the RTP television network, published in 1998.

FCT – Library is located at Caparica Campus. For further information call 212 94 78 29.

This exhibition is itinerant, so if you are interested in bringing it to your city or institution, please contact


Expo FCT – Caparica

Last updated on August 6th, 2024 at 05:19 pm Award 2019 was officially launched during an event at the Museum of the Presidency of the Republic, in Lisbon, March 13th.

The President of the Portuguese Republic, Marcelo Rebelo de Sousa, the Minister of Science, Technology and Higher Education (MCTES), Manuel Heitor, and the vice-president of FCT, Helena Pereira attended the ceremony.

In his speech, Marcelo Rebelo highlighted the importance of initiatives like Awards, as well as the services offered by the portal.

Submit your work to Award 2019. The 3 best works will receive a total of 15,ooo euros in prizes

Speech by the Minister of Science, Technology and Higher Education, Manuel Heitor

Presentation of the portal and the by Daniel Gomes, FCT

Speech by His Excellency the President of the Portuguese Republic

President of the Portuguese Republic launches Award 2019

Last updated on August 6th, 2024 at 05:19 pm is honoured to announce that the launching event of the Award 2019 will take place at the Museum of the Presidency of the Republic, in Lisbon, on March 13th, at 3:30 pm.

The President of the Portuguese Republic, Marcelo Rebelo de Sousa, the Minister of Science, Technology and Higher Education (MCTES), Manuel Heitor, and the FCT vice-president, Helena Pereira will attend the ceremony.

The event is open to all interested parties but limited to the facility capacity.

See the complete schedule below:

  • 3:30 pm – Reception
  • 4 pm – Opening (MPR)
  • 4:05 pm – Speech by the Minister of Science, Technology and Higher Education, Manuel Heitor
  • 4:15 pm – Presentation of the portal and the by Daniel Gomes, FCT
  • 4:30 pm – Address by His Excellency the President of the Republic and inauguration of “Time travel with the Presidents of the Republic” exhibit.
  • 4:45 pm – Cocktail and guided tours of the Museum and the exhibition
  • 5:30 p.m. – Closing

For more information: 213 614 660 or

Submit your work to Award 2019. The 3 best works will receive a total of 15,ooo euros in prizes