Cross-lingual research datasets on 2019 European Parliamentary Elections

March 17, 2024September 18, 2023 by admin

Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event

Last updated on March 17th, 2024 at 03:04 pm

Arquivo.pt preserved online documents in several languages about the 2019 European Parliamentary Elections

The 2019 European Parliamentary Elections were an event of international relevance. The strategy to preserve the relevant information on the World Wide Web is delegated to national institutions. However, the preservation of web pages that document transnational events is not officially assigned.

The Arquivo.pt team, with the aim of preserving the cross-lingual online content that documents this event, applied a combination of human and automatic selection processes.

The process of generating the collection about the 2019 European Parliamentary Elections was performed in two steps.

In the first step, 40 relevant terms in Portuguese about the 2019 European Parliamentary Elections were identified, and then, automatically translated into the 24 official languages of the European Union: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish.

These translations were reviewed in collaboration with the Publications Office of the European Union. Besides that, in parallel, a collaborative list was launched to gather contributions of relevant seeds from the international community.

In the second step, the Arquivo.pt team iteratively ran 6 crawls (99 million web files, 4.8 TB) using different configurations and crawling software, to maximize the quality of the collected content.

The obtained web-data was aggregated into one special collection identified as EAWP23 and became searchable and accessible through Arquivo.pt in July 2020 (https://arquivo.pt/ee2019).

CLEOPATRA project: Cross-lingual Event-centric Open Analytics Research Academy

The CLEOPATRA ITN was a Marie Skłodowska-Curie Innovative Training Network aimed to generate ways to better understand the massive digital coverage of major events in Europe over the past decades.

The main goal was to facilitate advanced cross-lingual processing of textual and visual information related to key contemporary events at large scale and develop innovative methods for efficient access and interaction with multilingual information.

In total, 14 Early-Stage Researchers hosted across 9 European Universities developed their research while enrolled as Ph.D. students.

Associated partners such as Arquivo.pt contributed to CLEOPATRA by hosting and training early-stage researchers such as Diego Alves. As part of the training program, he conducted a secondment at Arquivo.pt in Lisbon from June to August 2022.

The idea was to develop part of his research about syntactic structures of EU languages using the textual resources preserved by the Arquivo.pt and exchange knowledge with the web-archiving experts on the strategies to extract and process historical web-data.

Diego Alves defended his Ph.D thesis entitled Computational typological analysis of syntactic structures in European languages in July 2023 at the Faculty of Humanities and Social Sciences of the University of Zagreb (Croatia).

Generating textual datasets for Natural Language Processing

Diego Alves’ work originated cross-lingual datasets about the 2019 European Parliamentary Elections precious for research.

This work will be detailed in chapter “Robustness of Corpus-based Typological Strategies for Dependency Parsing” of the open-access CLEOPATRA book entitled “Event Analytics across Languages and Communities”.

A 3-step Natural Language Processing pipeline was developed to generate research textual datasets that can be used in several types of digital humanities studies:

Extract text: The textual content was extracted from each web-archived URL using the newspaper3k Python library. The language of each extracted text was determined using the langdetect library, to separate the texts written in different languages across distinct files;
Clean extracted texts: a Python script was applied to clean the texts by removing unnecessary information (e.g.: repeated instances, empty lines, etc.);
Double-check of language identification: the language of each cleaned extracted text was verified again to eliminate possible errors originated during the previous steps.

Two new research datasets are openly available!

The result was a dataset of cleaned and language-verified texts publicly available. Each file contains the texts in a given language about the 2019 European Union Elections. The distribution of extracted texts for each language is described in the figure below:

Number of tokens of each corpus extracted from the collection 2019 European Union Elections preserved by Arquivo.pt (EAWP23).

The aforementioned corpus was automatically annotated regarding part-of-speech and dependency relations to generate a corpus with syntactic information which is useful for linguistic studies.

The multilingual model of the UDify tool (Kondratyuk and Straka, 2019) was applied.

The texts in these annotated corpora followed the same order of the respective raw-texts files. Each sentence is annotated following the Universal Dependencies framework in the CoNNL-U format, which is the reference in terms of syntactic annotation in Natural Language Processing. Thus, each file in this dataset contains the annotated texts in a given language about the 2019 European Union Elections

The syntactically annotated texts about the 2019 European Elections are publicly available!

Know more

Secondments@Arquivo.pt and new research tools available and Robustness of Corpus based Typological Strategies for Dependency Parsing”, presentation at CLEOPATRA final event, 2023
Dataset of cleaned and language-verified texts about the 2019 European Elections (Raw texts)
Dataset of syntactically annotated texts about the 2019 European Elections (CoNLL-U texts)
Python script to extract language-specific texts from Arquivo.pt through a list of keywords
Computational typological analysis of syntactic structures in European languages, Diego Alves PhD thesis, 2023
Diego Alves personal page
Arquivo.pt APIs
“Robustness of Corpus-based Typological Strategies for Dependency Parsing”, Diego Alves and Daniel Gomes, “Event Analytics across Languages and Communities” book, Springer (to appear).

CDXJ index files are available to support bulk access

May 5, 2023April 18, 2023 by admin

Um grupo de investigadores olham para um bastidor de servidores

Last updated on May 5th, 2023 at 01:39 pm

The research and education community has been requesting to support the bulk download of web-archived data and index files (CDXJ), for instance, to feed AI training models, optimize routing of web archive requests or recover information from selected websites (e.g. news).

Arquivo.pt begun making all its CDXJ index files publicly available in real-time to facilitate the bulk download of web-archived data. Learn how at:

https://arquivo.pt/api#bulk

Your feedback with comments or suggestions is most welcome to improve this service!

Please disseminate this information among potentially interested parties.

Tutorial: how to explore Arquivo.pt using Python

July 17, 2023July 29, 2022 by Daniel Gomes

Last updated on July 17th, 2023 at 01:44 pm

The Programming Historian aims to develop digital skills among the Humanities researchers through the publication of practical lessons in several languages.

The call Computational analysis skills for large-scale humanities data originated 7 new lessons.

One of them was the tutorial “Timeline summarization for large-scale past-web events with Python: the case of Arquivo.pt” developed by Daniel Gomes and Ricardo Campos.

It shows how to explore Arquivo.pt user interface and the Application Programming Interface (API) to execute advanced queries, process large amount of data or build new services, such as Tell me stories.

All the developed resources are freely available in open-access.

Open-access resources of the tutorial “Timeline summarization for large-scale past-web events with Python: the case of Arquivo.pt”

Colab project that enables editing and running directly the code examples of the tutorial (English, Portuguese)
Official tutorial page on Programming Historian
Video presented on May 5, 2022, as part of the Programming Historian webinars and tutorials “Developing computational skills for digital collections”
- Slides

H2020 projects preserved by Arquivo.pt

June 16, 2023November 29, 2021 by Daniel Gomes

Last updated on June 16th, 2023 at 01:40 pm

The main objective of Arquivo.pt is to preserve online information for research and education purposes.

Previously, Arquivo.pt identified and preserved Research & Development project websites funded by the European Union during the FP4, FP5, FP6 and FP7 programmes (1994-2013).

Now, Arquivo.pt contributed to preserve online information that documents R&D projects funded by the Horizon 2020 programme (2014-2021). It preserved 197 million web files (17 TB) related to science for future access.

H2020 projects publish valuable information online but are being lost

Websites about Research and Development (R&D) projects are increasingly being used to publish and disseminate important scientific information that complements published literature (e.g. data sets, documentation or software).

However, after projects ending, the corresponding websites usually disappear causing a permanent loss of unique and valuable scientific information.

Arquivo.pt automatically identified URLs that document H2020 Research and Development projects

The European Union’s Open Data Portal published a data set from the Community Research and Development Information Service (CORDIS) that documents H2020 research projects. However, from the 31 129 projects listed, only 46% presented a project URL.

Arquivo.pt developed a low-cost methodology that automatically identifies URLs related to R&D projects to be systematically preserved. This automatic identification is achieved through the combination of open data sets with web search services. This methodology is detailed on a scientific article published at the International Conference on Digital Preservation 2016.

In sum, we extracted 106 300 unique URLs from the following open data sets:

Then, we extracted the acronym and title of the projects from the data sets and automatically searched the web for additional URLs using the Bing Search API.

All the data sets and tools developed have been made publicly available in open access so that they can be reused and collaboratively enhanced. In particular, you can access the software developed to automatically identify additional URLs about H2020 projects.

197 million web files related to science were preserved

Arquivo.pt identified and preserved 197 million web files (17 TB) that document R&D projects funded by Horizon 2020.

In 2021, we can already witness project websites that are no longer available online, such as the Extended Model of Organic Semiconductors (EXTMOS) project (http://extmos.eu/). However, it was preserved and can be accessed at Arquivo.pt:

Archived version at Arquivo.pt (https://arquivo.pt/wayback/20170427182603/http://extmos.eu/) of the home page of the EXTMOS Research and Development project (http://extmos.eu/)funded by H2020. — *Archived version at Arquivo.pt of the home page of the EXTMOS Research and Development project funded by H2020.*

Contributions to complement the European Open Data Sets

All the resulting data sets were made publicly available so that they can be improved and reused by other organizations also interested on preserving this digital heritage:

Cordis-h2020projectsComplementedByArquivoPT.xlsx: contains 2 additional columns in comparison to the original data set:
- URLsBingSearch (column V): top 10 search results returned by Bing API when column projectUrl (column K) in the original data set was empty (e.g. http://extmos.eu/)
- ArchivedProjectURLs (column W): direct link to access the preserved version of the projectUrls and URLsBingSearch in Arquivo.pt (e.g. https://arquivo.pt/wayback/http://extmos.eu)
Cordis-h2020organizationsComplementedByArquivoPT.xlsx: 1 additional column:
- archivedOrganizationUrl (column Y): direct link to access the preserved version of the organizationUrl (column O) in Arquivo.pt (e.g. https://arquivo.pt/wayback/www.it.pt)
Cordis-h2020projectDeliverablesComplementedByArquivoPT.xlsx: 1 additional column:
- archivedUrl (column K): direct link to access the preserved version of the url (column I) in Arquivo.pt (e.g. https://arquivo.pt/wayback/https://ec.europa.eu/research/participants/documents/downloadPublic?documentIds=080166e5c0231ec0&appId=PPGMS)
Cordis-h2020reportsComplementedByArquivoPT.xlsx: 1 additional column:
- archivedUrl (column P): direct link to access the preserved version of the url (column O) in Arquivo.pt (e.g. https://arquivo.pt/wayback/http://crome.ces.uc.pt)

If you want to know more information about this collection you can watch the video Preservation of web content related to Horizon 2020.

References

Previous work to preserve online documentation about FP4-FP7 projects
Software and datasets to automatically identify online information about H2020 projects
Technical Report about collection H2020 EU projects (in Portuguese)
Video Preservation of web content related to Horizon 2020, Pedro Gomes, October 2021
Presentation Preservation of web content related to Horizon 2020, Pedro Gomes, June 2023

Are you a researcher?

Suggest a website to be preserved

Create automatic narratives about any topic!

October 25, 2021October 25, 2021 by Ricardo Basílio

Arquivo.pt provides a new function that allows you to automatically create temporal narratives on any topic.

The “Narrative” functionality, integrated into Arquivo.pt in September 2021, is the result of the collaboration between “Conta-me Histórias”, winner of the Arquivo.pt Award 2018, and Arquivo.pt.

The “Conta-me Histórias” (Tell me Stories) project was developed by researchers from the Laboratory of Artificial Intelligence and Decision Support (LIAAD – INESCTEC ) and affiliated to the institutions Instituto Politécnico de Tomar – Center for Research in Smart Cities (CI2) ; University of Porto and University of Innsbruck .

How it works?

When a user enters a set of words about a topic in the Arquivo.pt search box and clicks on the “Narrative” button, the user is directed to the “Conta-me Histórias” service, which automatically analyzes the news from 25 websites archived by Arquivo.pt over time and presents a chronology of news related to the topic.

For example, if we search for “Just Bieber” and click on the “Narrative” button (Figure 1), we will be directed to the “Conta-me Histórias” , where we will automatically obtain a narrative of archived news (Figure 2).

Figure 1: Search results for pages about “Justin Bieber”.

Figure 2: Narrative of news about “Justin Bieber” from Portuguese news sites preserved by Arquivo.pt generated by the “Conta-me Histórias” service.

Create your narrative now!

“Conta-me Histórias” researches, analyzes and aggregates thousands of results to generate each narrative about a topic. It is recommended to choose descriptive words about well-defined themes, personalities or events to obtain good narratives.

Creating a narrative is useful for researchers, journalists or citizens who want to quickly get an overview of the evolution of a topic along time, thus saving them a lot of time and effort.

Go to Arquivo.pt and try to create a narrative about a theme of your choice.

Tell us about your experience so we can improve the service!

Presentations in the IIPC Web Archiving Conference and RESAW 2021

November 17, 2022August 2, 2021 by Ricardo Basílio

Last updated on November 17th, 2022 at 05:37 pm

During the week of 14 to 18 June, three international meetings were held by videoconference with the participation of the Arquivo.pt:

- International Internet Preservation Consortium (IIPC) – General Assembly – general assembly of the consortium that gathers the Web archiving initiatives around the world
- Web Archiving Conference 2021 – the most important meeting in the field of Web preservation, where experts share new knowledge and experiences
- RESAW Conference – meeting of the European RESAW network (Research Infrastructure for the Study of Archived Web Materials) this year in its 4th edition, mainly addressed to the community of researchers from non-technological scientific areas, such as Social Sciences, Arts and Humanities.

Contributions of Arquivo.pt to the international community

Arquivo.pt presented some results of the work developed in the last year, with emphasis on the functionalities that improve the reproduction of the archived contents, such as the “Complete the page”.
Two historical collections were integrated on the Arquivo.pt: the Geocities and the Internet Memory Foundation. Arquivo.pt did special collections about the 2019 European Elections and Covid-19.
The contents of Arquivo.pt are accessible to any researcher regardless of the country they are in and therefore it is a useful service to the international community.

Presentations

Arquivo.pt updates 2021: presentation at the IIPC – General Assembly, by Daniel Gomes (Vídeo)
Complete the page. 1 minute drop in (presentation at the IIPC – General Assembly “complete the page”), by Daniel Gomes (Slide, Video)
A transnational and cross-lingual crawl of the European Parliamentary Elections 2019, by Ivo Branco (Slides, Vídeo)
Enhancing access to research the Geocities historical collection, by Pedro Gomes (Slides, Vídeo)

Complete the page - demo — Complete the page – demo. Slide used in the IIPC 1 minute presentation, at the IIPC General Assembly 2021

Internet Memory Foundation collection available in Arquivo.pt

September 15, 2021May 26, 2021 by Ricardo Basílio

Last updated on September 15th, 2021 at 09:29 am

The historical collection of web content generated during the Internet Memory Foundation’s (IMF) activity has been donated to Arquivo.pt and is now searchable!

The IMF was a European organization dedicated to preserving web content that was wound up in 2018.

The 1st web archiving project in Europe (2004-2010) was led by Julien Masanès (who was guest of honour at the celebration of 10 years of Arquivo.pt) and was called European Archive Foundation.

In 2010, Julien Masanès, the “father” of Web archives in Europe created the IMF.

Examples of pages from the collection donated by the IMF

The collection donated by the IMF has now been integrated in the Arquivo.pt collection to be preserved for posterity.

This collection is composed of 142 million files that total 6.3 TB of historical information whose texts or images can now be searched through Arquivo.pt.

Life Science Competence in Europe portal, 2009.

LIMES project homepage (Land and Sea Monitoring for Environment and Security), 2009.

Project Intelligence-territoriale homepage, 2009.

European Parliament news page in the 20th anniversary of the break of the Berlim Wall, 2009.

Le Figaro about French presidential election, 2012.

Reuters with a new about WikiLeaks, 2011.

Internet Memory Foundation homepage, 2014.

Search this new collection!

This new collection has been named “InternetMemory” in the Arquivo.pt collections list.

Searches can be made on this collection using the collection search parameter or through the custom search page available at arquivo.pt/InternetMemory.

Replay with old browser and export results with the new version of Arquivo.pt

July 23, 2020July 23, 2020 by Ivo Branco

Exported results into an Excel sheet of a search for the word "universidade", university, limited to 10 items

Arquivo.pt launched a new version of its service on July 1, 2020 named Responsive.

The purpose of this version was to improve the user experience between different devices and add new features.

Replay a past webpage using a browser from the past

We added an option to view the archived page using a browser from the past. In the Options choose Replay with old browser and you will be redirected to the oldweb.today service that emulates browsers such as Netscape Navigator, Microsoft Internet Explorer or NSCA Mosaic.

This external service is useful for research use cases, in areas such as Web design, Art, Communication or History,where it is necessary to access the original visual aspect of a page from the past in the most reliable way possible.

Web page of the European Union in 1996 using the Oldweb.Today service — Web page of the European Map of WWW/NIR sites in 1996 using the Oldweb.Today service

Try this new option from Arquivo.pt to replay the European Map of WWW/NIR sites in 1996 using a contemporary browser or any other historical page using the Oldweb.Today service.

You may have to wait a while for your request to be processed but it is always faster than having to install a browser from the past on your computer.

Export search results to spreadsheet format

This new function enables users to save their search results for further treatment and analysis. This is specially useful to perform thorough research about a given topic.

After a search, in the Options, just choose one of the available formats to export the obtained results: XLSX, CSV or TXT.

New version of Arquivo.pt (Webapp release)

October 12, 2020April 15, 2020 by Ivo Branco

Webapp release on mobile version example

Last updated on October 12th, 2020 at 11:52 am

Arquivo.pt launched a new version of its service on April 15, 2020 named WebApp.

The purpose of this version was to standardize the user experience between different devices and reduce maintenance costs by removing components with redundant functions.

Its main novelty is the combination of the desktop and mobile interfaces in a single user interface.

The old desktop version has been disabled and the mobile version has evolved to work on various types of devices and screen sizes.

New design of the homepage

Try the new image and page search

New user interfaces for image or text search

Help us to improve!

To help us, just search the Arquivo.pt using any device (e.g. laptop, mobile phone, tablet).

If you encounter any problems, please contact us!

Remember to always send the address of the page where you detected the problem.

More information

New feature “Complete the page” that tries to complete the page with resources from other web archives.
All user interfaces started using the Arquivo.pt API, which is publicly available.
Replacement of the PhantomJS page screenshot service technology [5] by puppeteer [6] .
See the list of 145 issues solved

Arquivo.pt Award 2020 launched at Público Newspaper

March 24, 2020January 17, 2020 by Ricardo Basílio

Last updated on March 24th, 2020 at 12:21 pm

The Arquivo.pt Award 2020 was officially launched on January 16th, at the Público Newspaper in Lisbon. Público is one of the most well-known newspapers in Portugal.

The event had talks by the Director of Público Newspaper Manuel Carvalho, the President of the Foundation for Science and Technology Helena Pereira and the manager of Arquivo.pt Daniel Gomes.

Participants of this open event were led into a guided visit to the newsroom. They saw a real scenario where contents of a newspaper are edited and produced.

Público’s website is daily crawled by the Arquivo.pt, which means an important contribution to the future access and use of contents.

In the 2020 edition of the Arquivo.pt Award, Público Newspaper will grant an Honorable Mention to works based on the newspaper’s content along its 20 years.

Find out how to apply, till 4th of May: arquivo.pt/award2020