Cross-lingual research datasets on 2019 European Parliamentary Elections

Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event

Last updated on December 13th, 2024 at 01:55 pm

Arquivo.pt preserved online documents in several languages about the 2019 European Parliamentary Elections

The 2019 European Parliamentary Elections were an event of international relevance. The strategy to preserve the relevant information on the World Wide Web is delegated to national institutions. However, the preservation of web pages that document transnational events is not officially assigned. 

The Arquivo.pt team, with the aim of preserving the cross-lingual online content that documents this event, applied a combination of human and automatic selection processes.

The process of generating the collection about the 2019 European Parliamentary Elections was performed in two steps.

In the first step, 40 relevant terms in Portuguese about the 2019 European Parliamentary Elections were identified, and then, automatically translated into the 24 official languages of the European Union: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish. 

These translations were reviewed in collaboration with the Publications Office of the European Union. Besides that, in parallel, a collaborative list was launched to gather contributions of relevant seeds from the international community.

In the second step, the Arquivo.pt team iteratively ran 6 crawls (99 million web files, 4.8 TB) using different configurations and crawling software, to maximize the quality of the collected content. 

The obtained web-data was aggregated into one special collection identified as EAWP23 and became searchable and accessible through Arquivo.pt in July 2020 (https://arquivo.pt/ee2019).

CLEOPATRA project: Cross-lingual Event-centric Open Analytics Research Academy

Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event
Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event

The CLEOPATRA ITN was a Marie Skłodowska-Curie Innovative Training Network aimed to generate ways to better understand the massive digital coverage of major events in Europe over the past decades. 

The main goal was to facilitate advanced cross-lingual processing of textual and visual information related to key contemporary events at large scale and develop innovative methods for efficient access and interaction with multilingual information.

In total, 14 Early-Stage Researchers hosted across 9 European Universities developed their research while enrolled as Ph.D. students. 

Associated partners such as Arquivo.pt contributed to CLEOPATRA by hosting and training early-stage researchers such as Diego Alves. As part of the training program, he conducted a secondment at Arquivo.pt in Lisbon from June to August 2022. 

The idea was to develop part of his research about syntactic structures of EU languages using the textual resources preserved by the Arquivo.pt and exchange knowledge with the web-archiving experts on the strategies to extract and process historical web-data. 

Diego Alves defended his Ph.D thesis entitled Computational typological analysis of syntactic structures in European languages in July 2023 at the Faculty of Humanities and Social Sciences of the University of Zagreb (Croatia). 

Generating textual datasets for Natural Language Processing

Diego Alves’ work originated cross-lingual datasets about the 2019 European Parliamentary Elections precious for research.

This work will be detailed in chapter “Robustness of Corpus-based Typological Strategies for Dependency Parsing” of the open-access CLEOPATRA book entitled “Event Analytics across Languages and Communities”.

A 3-step Natural Language Processing pipeline was developed to generate research textual datasets that can be used in several types of digital humanities studies:

  1. Extract text: The textual content was extracted from each web-archived URL using the newspaper3k Python library. The language of each extracted text was determined using the langdetect library, to separate the texts written in different languages across distinct files;
  2. Clean extracted texts: a Python script was applied to clean the texts by removing unnecessary information (e.g.: repeated instances, empty lines, etc.);
  3. Double-check of language identification: the language of each cleaned extracted text was verified again to eliminate possible errors originated during the previous steps.

Two new research datasets are openly available!

The result was a dataset of cleaned and language-verified texts publicly available. Each file contains the texts in a given language about the 2019 European Union Elections. The distribution of extracted texts for each language is described in the figure below:

Number of tokens of each corpus extracted from the collection 2019 European Union Elections preserved by Arquivo.pt (EAWP23).
Number of tokens of each corpus extracted from the collection 2019 European Union Elections preserved by Arquivo.pt (EAWP23).

The aforementioned corpus was automatically annotated regarding part-of-speech and dependency relations to generate a corpus with syntactic information which is useful for linguistic studies. 

The multilingual model of the UDify tool (Kondratyuk and Straka, 2019) was applied. 

The texts in these annotated corpora followed the same order of the respective raw-texts files. Each sentence is annotated following the Universal Dependencies framework in the CoNNL-U format, which is the reference in terms of syntactic annotation in Natural Language Processing. Thus, each file in this dataset contains the annotated texts in a given language about the 2019 European Union Elections

The syntactically annotated texts about the 2019 European Elections are publicly available!

Know more

Arquivo.pt preserves Wikipedia citations

Wikimedia Portugal e Arquivo.pt

Last updated on December 11th, 2024 at 10:36 am

Wikipedia is an educational resource degraded by broken links

Arquivo.pt preserves information published online so that it can be used later for research and educational purposes. For example, Arquivo.pt preserved online information about European projects funded by H2020 .

Wikipedia articles refer to external pages with important complementary information that has since become unavailable.
Wikipedia articles reference external web pages with important complementary information that became unavailable on their original websites.

One of the most used online resources for educational purposes are Wikipedia articles. However, Wikipedia articles sometimes reference external web pages with important complementary information which disappeared from their original websites. This problem degrades the quality of Wikipedia as a credible and verifiable source of information.

In August 2023, the Arquivo.pt team carried out an experiment to measure the percentage of external links (outside the wikipedia.org domain) that were broken in Portuguese Wikipedia articles. The obtained results showed that 25% of the external links referenced in the Portuguese Wikipedia were broken.

There is also the problem that a link may refer to an accessible web page, but its content changed meanwhile and it is no longer what it was meant to be referenced (Content Drift problem). The domain of the hosting website may have meanwhile been purchased by third parties, for example for malicious purposes.

To mitigate these problems, Arquivo.pt launched a project to preserve the online references contained in Portuguese Wikipedia articles in collaboration with Wikimedia Portugal. The objective was to change the references to broken links on Wikipedia articles, to start referencing web-archived content preserved in Arquivo.pt, thus keeping the referenced information accessible to Wikipedia users.

Arquivo.pt archived the pages referenced in Portuguese Wikipedia articles

Wikipedia articles refer to external pages with important complementary information that has since become unavailable.
Wikipedia recommends citing web archives (archiveurl/archive-url parameter).

The Portuguese Wikipedia contains about 1 million articles and on average 140 pages are edited per day.

Arquivo.pt automatically extracted 14 million links from references in all Portuguese Wikipedia articles. Only 620 referenced Arquivo.pt and 744 553 the Internet Archive (5.3%). Note that Wikipedia’s guide to creating references recommends publishing citations to web archives ( archiveurl/archive-url parameter) .

On February 15, 2023, Arquivo.pt collected all pages referenced in Portuguese Wikipedia articles, which resulted in a new collection named EAWP42: Collection of external links from wikipedia using the wikimedia dumps, which contains 12 million files (856 GB).

The main result of this project was the creation of a new automatic process for extracting and collecting external links cited on Portuguese Wikipedia pages. Therefore, Arquivo.pt will perform an annual crawl of Wikipedia citations.

Attempt to automatically fix broken links in Wikipedia articles

InternetArchiveBot offers powerful operation and monitoring tools (eg Dashboard and Insights)
InternetArchiveBot offers powerful operation and monitoring tools (e.g. Dashboard and Insights)

There are software robots which automatically add links to web-archived content when they find broken links in Wikipedia articles (e.g. PywikibotWayback Medic or InternetArchiveBot).

An experiment was carried out to create an ArquivoPTBot based on the InternetArchiveBot because it offers powerful operation and monitoring tools (e.g. Dashboard and Insights ) and it is maintained by the Internet Archive: the largest web archive in the world.

However, it was not possible to launch this service in production because it implies changes in the software to use Arquivo.pt as a source of web-archived information.

If you want to collaborate to resume this project, please contact us!

Preserving Wikipedia references is at your fingertips!

Arquivo.pt remains committed in contributing to preserve links cited in Wikipedia articles and offers the following services that may be useful to you.

Arquivo.pt CitationSaver: preserves citations to online content (https://arquivo.pt/citationsaver).
Arquivo.pt CitationSaver: preserves citations to online content (https://arquivo.pt/citationsaver).

CitationSaver allows you to submit the source code of an Wikipedia article and Arquivo.pt will automatically extract the referenced links and archive the correspondent content.

 

Arquivo.pt SavePageNow: saves pages in Arquivo.pt (https://arquivo.pt/savepagenow).
Arquivo.pt ArchivePageNow: saves pages in Arquivo.pt (https://arquivo.pt/archivepagenow).

SavePageNow allows you to immediately archive a web page in Arquivo.pt, for example, that is being referenced in a Wikipedia article so that it doesn’t get lost.

Training Arquivo.pt/Wikimedia

Wikimedia Portugal, in collaboration with Arquivo.pt, promoted a set of webinars to capture the community’s attention for the preservation of content published and cited on Wikipedia. The videos and slides of these webinars are available:

CDXJ index files are available to support bulk access

Um grupo de investigadores olham para um bastidor de servidores

Last updated on August 22nd, 2024 at 10:48 am

The research and education community has been requesting the bulk download of web-archived data and index files (CDXJ), for instance, to feed AI training models, optimize routing of web archive requests or recover information from selected websites (e.g. news).

Arquivo.pt begun making all its CDXJ index files publicly available in real-time to facilitate the bulk download of web-archived data. Learn how at:

Your feedback with comments or suggestions is most welcome to improve this service!

Please disseminate this information among potentially interested parties.

Tutorial: how to explore Arquivo.pt using Python

Last updated on August 5th, 2024 at 04:50 pm

The Programming Historian aims to develop digital skills among the Humanities researchers through the publication of practical lessons in several languages.

The call Computational analysis skills for large-scale humanities data originated 7 new lessons.

One of them was the tutorial “Timeline summarization for large-scale past-web events with Python: the case of Arquivo.pt” developed by Daniel Gomes and Ricardo Campos.

It shows how to explore Arquivo.pt user interface and the Application Programming Interface (API) to execute advanced queries, process large amount of data or build new services, such as Tell me stories.

All the developed resources are freely available in open-access.

Open-access resources of the tutorial “Timeline summarization for large-scale past-web events with Python: the case of Arquivo.pt”

 

 

Open dataset about cryptocurrency

Criptomoedas gráfico

Last updated on August 17th, 2022 at 09:19 am

(Photo: QuoteInspector)

Since 2008 the cryptocurrency market has revolutionised the world by innovating and expanding into other areas (e.g., finance and art). However, with this rapid expansion, many projects are created every day, giving rise to a wide and varied range of websites, technologies and scams. Markets follow financing stages and it is during an initial stage of euphoria that more projects are created.

We believe that as the cryptocurrency market  stabilises, projects/websites are disappearing because funding diminishes or runs out.

Arquivo.pt initiated a new web archive collection that preserves web content that documents Cryptocurrency activities.

This work produced a new open dataset with information documenting each cryptocurrency project, including it is original URLs and links to the corresponding web-archived version in Arquivo.pt. The information sources selected to create this dataset were:

We believe that by creating this new dataset related to cryptocurrencies and by preserving all the corresponding web content, it has the potential to originate innovative scientific contributions in several areas such as Economy or Digital Humanities.

Resources

Researchers who want to carry out studies on the Cryptocurrencies dataset and need earlier access to the collected contents can contact Arquivo.pt.

Presentation at the IIPC Web Archiving Conference 2022

H2020 projects preserved by Arquivo.pt

Thumbnail H2020 projects

Last updated on August 5th, 2024 at 04:50 pm

The main objective of Arquivo.pt is to preserve online information for research and education purposes.

Previously, Arquivo.pt identified and preserved Research & Development project websites funded by the European Union during the FP4, FP5, FP6 and FP7 programmes (1994-2013).

Now, Arquivo.pt contributed to preserve online information that documents R&D projects funded by the Horizon 2020 programme (2014-2021). It preserved 197 million web files (17 TB) related to science for future access.

H2020 projects publish valuable information online but are being lost

Websites about Research and Development (R&D) projects are increasingly being used to publish and disseminate important scientific information that complements published literature (e.g. data sets, documentation or software).

However, after projects ending, the corresponding websites usually disappear causing a permanent loss of unique and valuable scientific information.

Arquivo.pt automatically identified URLs that document H2020 Research and Development projects

The European Union’s Open Data Portal published a data set from the Community Research and Development Information Service (CORDIS) that documents H2020 research projects. However, from the 31 129 projects listed, only 46% presented a project URL.

Arquivo.pt developed a low-cost methodology that automatically identifies URLs related to R&D projects to be systematically preserved. This automatic identification is achieved through the combination of open data sets with web search services. This methodology is detailed on a scientific article published at the International Conference on Digital Preservation 2016.

In sum, we extracted 106 300 unique URLs from the following open data sets:

Then, we extracted the acronym and title of the projects from the data sets and automatically searched the web for additional URLs using the Bing Search API.

All the data sets and tools developed have been made publicly available in open access so that they can be reused and collaboratively enhanced. In particular, you can access the software developed to automatically identify additional URLs about H2020 projects.

197 million web files related to science were preserved

Arquivo.pt identified and preserved 197 million web files (17 TB) that document R&D projects funded by Horizon 2020.

In 2021, we can already witness project websites that are no longer available online, such as the Extended Model of Organic Semiconductors (EXTMOS) project (http://extmos.eu/). However, it was preserved and can be accessed at Arquivo.pt:

Archived version at Arquivo.pt (https://arquivo.pt/wayback/20170427182603/http://extmos.eu/) of the home page of the EXTMOS Research and Development project (http://extmos.eu/)funded by H2020.
Archived version at Arquivo.pt of the home page of the EXTMOS Research and Development project funded by H2020.

Contributions to complement the European Open Data Sets

All the resulting data sets were made publicly available so that they can be improved and reused by other organizations also interested on preserving this digital heritage:

If you want to know more information about this collection you can watch the video Preservation of web content related to Horizon 2020.

References

Are you a researcher?

Applications open to the Arquivo.pt Award 2020!

Arquivo.pt Award 2020

Last updated on August 6th, 2024 at 05:18 pm

Applications are open to the Arquivo.pt Award 2020!

In this 3rd edition of the annual Arquivo.pt Award, € 15,000 will be awarded to the 3 best works (1st place: € 10,000).

The deadline for submissions is May 4, 2020.

Works may be developed individually or in group about any topic, as long as they use the information provided by Arquivo.pt as the main source of information.

The Público Newspaper is the official media partner of the Arquivo.pt Award in 2020. It was one of the first newspapers to become available online.

Jornal Público celebrates its 30th anniversary on March 5, 2020 and will award an Honorable Mention to one of the works which focused on the historical web-archive of Público online.

Full details about the applications are available at:
https://arquivo.pt/award2020

The Arquivo.pt Award promotes the visibility of the applicants and their institutions.

Help us to spread the word!

Meet the winners of the Arquivo.pt Award 2019!

Last updated on August 22nd, 2024 at 02:33 pm

The winners of the Arquivo.pt Award 2019 were announced by His Excellency the Prime Minister of Portugal António Costa during the session that took place on July 8 at 9:00 am in Auditorium 1 during the event Science 2019 – Encounter with Science and Technology at the Lisbon Congress Center.

23 applications were received.

Entregas do Prémio Arquivo.pt durante o Encontro Ciência 2019

1_GOU6417
_GOU6658
_GOU6570
_GOU6861
_GOU6647
_GOU6630
_GOU6610-2
_GOU6687
_GOU6753
_GOU6719
_GOU6713
1_GOU6417 _GOU6658 _GOU6570 _GOU6861 _GOU6647 _GOU6630 _GOU6610-2 _GOU6687 _GOU6753 _GOU6719 _GOU6713

Photos by Valter Gouveia, FCT

1st place: meuParlamento.pt

MeuParlamento.pt is a mobile application that simulates the Portuguese Parliament, calling on all citizens to play the role of a deputy.

Authors: Nuno Moniz, Arian Pasquali, Tomás Amaro

2nd place: Revisionista.pt: Un-cover the news

The Revisionista.pt is an online tool to reveal post-publication changes in the Portuguese news.

Authors: Flávio Martins and André Mourão

3rd place: Public speeches about violence in private

Analysis of 217 news collected in Arquivo.pt from the three main daily newspapers, on domestic violence.

Author: Zélia de Macedo Teixeira

Materials for dissemination about the winners

About the Arquivo.pt Award 2019

Arquivo.pt is a research infrastructure managed by the Foundation for Science and Technology (FCT) that enables search and access to web pages archived since 1996.

The Arquivo.pt Award aims to annually promote innovative works based on historical information preserved by Arquivo.pt. Submissions closed on May 3 and we received works in areas such as: media studies, education, design, information technology, health or cultural and historical heritage.

The Arquivo.pt Award 2019 received the High Sponsorship of His Excellency the President of the Republic Marcelo Rebelo de Sousa.

Arquivo.pt Award 2020

The Arquivo.pt Award was approved as an annual initiative of the Foundation for Science and Technology that will run from January to May.

Sign up for the Arquivo.pt international mailing list to be informed about future editions of the Arquivo.pt Award!

Know more

Arquivo.pt in Croatia

Arquivo.pt na Croácia

Last updated on October 1st, 2020 at 04:44 pm

The Arquivo.pt team went to the biggest international Web archiving conference.

The conference, organized by the International Internet Preservation Consortium (IIPC), occurred from the 5th to the 7th of June 2019.

 

Interview to Daniel Gomes, Head of Arquivo.pt

 

Arquivo.pt contributed with 5 additional presentations during the conference.

The slides and videos are available in the following links:

Java/Linux expert to collaborate with Arquivo.pt

Last updated on August 5th, 2024 at 11:23 am

The ROSSIO Project-  Social Sciences, Arts and Humanities has an open call for 1 position to collaborate with Arquivo.pt.

Call is open until June 13th, 2019.

Check for application details at the links below.

Main tasks

  • Development of large-scale distributed computer systems;
  • Analysis, planning and operation of computer services;
  • Corrective and evolutive maintenance of software;
  • Interaction with scientific and technical communities to establish collaborations and identify requirements.

Application details