Arquivo.pt preserved online documents in several languages about the 2019 European Parliamentary Elections
The 2019 European Parliamentary Elections were an event of international relevance. The strategy to preserve the relevant information on the World Wide Web is delegated to national institutions. However, the preservation of web pages that document transnational events is not officially assigned.
The Arquivo.pt team, with the aim of preserving the cross-lingual online content that documents this event, applied a combination of human and automatic selection processes.
The process of generating the collection about the 2019 European Parliamentary Elections was performed in two steps.
In the first step, 40 relevant terms in Portuguese about the 2019 European Parliamentary Elections were identified, and then, automatically translated into the 24 official languages of the European Union: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish.
These translations were reviewed in collaboration with the Publications Office of the European Union. Besides that, in parallel, a collaborative list was launched to gather contributions of relevant seeds from the international community.
In the second step, the Arquivo.pt team iteratively ran 6 crawls (99 million web files, 4.8 TB) using different configurations and crawling software, to maximize the quality of the collected content.
The obtained web-data was aggregated into one special collection identified as EAWP23 and became searchable and accessible through Arquivo.pt in July 2020 (https://arquivo.pt/ee2019).
CLEOPATRA project: Cross-lingual Event-centric Open Analytics Research Academy
The CLEOPATRA ITN was a Marie Skłodowska-Curie Innovative Training Network aimed to generate ways to better understand the massive digital coverage of major events in Europe over the past decades.
The main goal was to facilitate advanced cross-lingual processing of textual and visual information related to key contemporary events at large scale and develop innovative methods for efficient access and interaction with multilingual information.
In total, 14 Early-Stage Researchers hosted across 9 European Universities developed their research while enrolled as Ph.D. students.
Associated partners such as Arquivo.pt contributed to CLEOPATRA by hosting and training early-stage researchers such as Diego Alves. As part of the training program, he conducted a secondment at Arquivo.pt in Lisbon from June to August 2022.
The idea was to develop part of his research about syntactic structures of EU languages using the textual resources preserved by the Arquivo.pt and exchange knowledge with the web-archiving experts on the strategies to extract and process historical web-data.
Diego Alves defended his Ph.D thesis entitled Computational typological analysis of syntactic structures in European languages in July 2023 at the Faculty of Humanities and Social Sciences of the University of Zagreb (Croatia).
Generating textual datasets for Natural Language Processing
Diego Alves’ work originated cross-lingual datasets about the 2019 European Parliamentary Elections precious for research.
This work will be detailed in chapter “Robustness of Corpus-based Typological Strategies for Dependency Parsing” of the open-access CLEOPATRA book entitled “Event Analytics across Languages and Communities”.
A 3-step Natural Language Processing pipeline was developed to generate research textual datasets that can be used in several types of digital humanities studies:
- Extract text: The textual content was extracted from each web-archived URL using the newspaper3k Python library. The language of each extracted text was determined using the langdetect library, to separate the texts written in different languages across distinct files;
- Clean extracted texts: a Python script was applied to clean the texts by removing unnecessary information (e.g.: repeated instances, empty lines, etc.);
- Double-check of language identification: the language of each cleaned extracted text was verified again to eliminate possible errors originated during the previous steps.
Two new research datasets are openly available!
The result was a dataset of cleaned and language-verified texts publicly available. Each file contains the texts in a given language about the 2019 European Union Elections. The distribution of extracted texts for each language is described in the figure below:
The aforementioned corpus was automatically annotated regarding part-of-speech and dependency relations to generate a corpus with syntactic information which is useful for linguistic studies.
The multilingual model of the UDify tool (Kondratyuk and Straka, 2019) was applied.
The texts in these annotated corpora followed the same order of the respective raw-texts files. Each sentence is annotated following the Universal Dependencies framework in the CoNNL-U format, which is the reference in terms of syntactic annotation in Natural Language Processing. Thus, each file in this dataset contains the annotated texts in a given language about the 2019 European Union Elections
- Secondments@Arquivo.pt and new research tools available and Robustness of Corpus based Typological Strategies for Dependency Parsing”, presentation at CLEOPATRA final event, 2023
- Dataset of cleaned and language-verified texts about the 2019 European Elections (Raw texts)
- Dataset of syntactically annotated texts about the 2019 European Elections (CoNLL-U texts)
- Computational typological analysis of syntactic structures in European languages, Diego Alves PhD thesis, 2023
- Diego Alves personal page
- Arquivo.pt APIs
- “Robustness of Corpus-based Typological Strategies for Dependency Parsing”, Diego Alves and Daniel Gomes, “Event Analytics across Languages and Communities” book, Springer (to appear).