Cross-lingual research datasets on 2019 European Parliamentary Elections

Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event

Last updated on March 17th, 2024 at 03:04 pm

Arquivo.pt preserved online documents in several languages about the 2019 European Parliamentary Elections

The 2019 European Parliamentary Elections were an event of international relevance. The strategy to preserve the relevant information on the World Wide Web is delegated to national institutions. However, the preservation of web pages that document transnational events is not officially assigned. 

The Arquivo.pt team, with the aim of preserving the cross-lingual online content that documents this event, applied a combination of human and automatic selection processes.

The process of generating the collection about the 2019 European Parliamentary Elections was performed in two steps.

In the first step, 40 relevant terms in Portuguese about the 2019 European Parliamentary Elections were identified, and then, automatically translated into the 24 official languages of the European Union: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish. 

These translations were reviewed in collaboration with the Publications Office of the European Union. Besides that, in parallel, a collaborative list was launched to gather contributions of relevant seeds from the international community.

In the second step, the Arquivo.pt team iteratively ran 6 crawls (99 million web files, 4.8 TB) using different configurations and crawling software, to maximize the quality of the collected content. 

The obtained web-data was aggregated into one special collection identified as EAWP23 and became searchable and accessible through Arquivo.pt in July 2020 (https://arquivo.pt/ee2019).

CLEOPATRA project: Cross-lingual Event-centric Open Analytics Research Academy

Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event
Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event

The CLEOPATRA ITN was a Marie Skłodowska-Curie Innovative Training Network aimed to generate ways to better understand the massive digital coverage of major events in Europe over the past decades. 

The main goal was to facilitate advanced cross-lingual processing of textual and visual information related to key contemporary events at large scale and develop innovative methods for efficient access and interaction with multilingual information.

In total, 14 Early-Stage Researchers hosted across 9 European Universities developed their research while enrolled as Ph.D. students. 

Associated partners such as Arquivo.pt contributed to CLEOPATRA by hosting and training early-stage researchers such as Diego Alves. As part of the training program, he conducted a secondment at Arquivo.pt in Lisbon from June to August 2022. 

The idea was to develop part of his research about syntactic structures of EU languages using the textual resources preserved by the Arquivo.pt and exchange knowledge with the web-archiving experts on the strategies to extract and process historical web-data. 

Diego Alves defended his Ph.D thesis entitled Computational typological analysis of syntactic structures in European languages in July 2023 at the Faculty of Humanities and Social Sciences of the University of Zagreb (Croatia). 

Generating textual datasets for Natural Language Processing

Diego Alves’ work originated cross-lingual datasets about the 2019 European Parliamentary Elections precious for research.

This work will be detailed in chapter “Robustness of Corpus-based Typological Strategies for Dependency Parsing” of the open-access CLEOPATRA book entitled “Event Analytics across Languages and Communities”.

A 3-step Natural Language Processing pipeline was developed to generate research textual datasets that can be used in several types of digital humanities studies:

  1. Extract text: The textual content was extracted from each web-archived URL using the newspaper3k Python library. The language of each extracted text was determined using the langdetect library, to separate the texts written in different languages across distinct files;
  2. Clean extracted texts: a Python script was applied to clean the texts by removing unnecessary information (e.g.: repeated instances, empty lines, etc.);
  3. Double-check of language identification: the language of each cleaned extracted text was verified again to eliminate possible errors originated during the previous steps.

Two new research datasets are openly available!

The result was a dataset of cleaned and language-verified texts publicly available. Each file contains the texts in a given language about the 2019 European Union Elections. The distribution of extracted texts for each language is described in the figure below:

Number of tokens of each corpus extracted from the collection 2019 European Union Elections preserved by Arquivo.pt (EAWP23).
Number of tokens of each corpus extracted from the collection 2019 European Union Elections preserved by Arquivo.pt (EAWP23).

The aforementioned corpus was automatically annotated regarding part-of-speech and dependency relations to generate a corpus with syntactic information which is useful for linguistic studies. 

The multilingual model of the UDify tool (Kondratyuk and Straka, 2019) was applied. 

The texts in these annotated corpora followed the same order of the respective raw-texts files. Each sentence is annotated following the Universal Dependencies framework in the CoNNL-U format, which is the reference in terms of syntactic annotation in Natural Language Processing. Thus, each file in this dataset contains the annotated texts in a given language about the 2019 European Union Elections

The syntactically annotated texts about the 2019 European Elections are publicly available!

Know more

Arquivo.pt preserves Wikipedia citations

Wikimedia Portugal e Arquivo.pt

Last updated on September 19th, 2023 at 01:49 pm

Wikipedia is an educational resource degraded by broken links

Arquivo.pt preserves information published online so that it can be used later for research and educational purposes. For example, Arquivo.pt preserved online information about European projects funded by H2020 .

Wikipedia articles refer to external pages with important complementary information that has since become unavailable.
Wikipedia articles reference external web pages with important complementary information that became unavailable on their original websites.

One of the most used online resources for educational purposes are Wikipedia articles. However, Wikipedia articles sometimes reference external web pages with important complementary information which disappeared from their original websites. This problem degrades the quality of Wikipedia as a credible and verifiable source of information.

In August 2023, the Arquivo.pt team carried out an experiment to measure the percentage of external links (outside the wikipedia.org domain) that were broken in Portuguese Wikipedia articles. The obtained results showed that 25% of the external links referenced in the Portuguese Wikipedia were broken.

There is also the problem that a link may refer to an accessible web page, but its content changed meanwhile and it is no longer what it was meant to be referenced (Content Drift problem). The domain of the hosting website may have meanwhile been purchased by third parties, for example for malicious purposes.

To mitigate these problems, Arquivo.pt launched a project to preserve the online references contained in Portuguese Wikipedia articles in collaboration with Wikimedia Portugal. The objective was to change the references to broken links on Wikipedia articles, to start referencing web-archived content preserved in Arquivo.pt, thus keeping the referenced information accessible to Wikipedia users.

Arquivo.pt archived the pages referenced in Portuguese Wikipedia articles

Wikipedia articles refer to external pages with important complementary information that has since become unavailable.
Wikipedia recommends citing web archives (archiveurl/archive-url parameter).

The Portuguese Wikipedia contains about 1 million articles and on average 140 pages are edited per day.

Arquivo.pt automatically extracted 14 million links from references in all Portuguese Wikipedia articles. Only 620 referenced Arquivo.pt and 744 553 the Internet Archive (5.3%). Note that Wikipedia’s guide to creating references recommends publishing citations to web archives ( archiveurl/archive-url parameter) .

On February 15, 2023, Arquivo.pt collected all pages referenced in Portuguese Wikipedia articles, which resulted in a new collection named EAWP42: Collection of external links from wikipedia using the wikimedia dumps, which contains 12 million files (856 GB).

The main result of this project was the creation of a new automatic process for extracting and collecting external links cited on Portuguese Wikipedia pages. Therefore, Arquivo.pt will perform an annual crawl of Wikipedia citations.

Attempt to automatically fix broken links in Wikipedia articles

InternetArchiveBot offers powerful operation and monitoring tools (eg Dashboard and Insights)
InternetArchiveBot offers powerful operation and monitoring tools (e.g. Dashboard and Insights)

There are software robots which automatically add links to web-archived content when they find broken links in Wikipedia articles (e.g. PywikibotWayback Medic or InternetArchiveBot).

An experiment was carried out to create an ArquivoPTBot based on the InternetArchiveBot because it offers powerful operation and monitoring tools (e.g. Dashboard and Insights ) and it is maintained by the Internet Archive: the largest web archive in the world.

However, it was not possible to launch this service in production because it implies changes in the software to use Arquivo.pt as a source of web-archived information.

If you want to collaborate to resume this project, please contact us!

Preserving Wikipedia references is at your fingertips!

Arquivo.pt remains committed in contributing to preserve links cited in Wikipedia articles and offers the following services that may be useful to you.

Arquivo.pt CitationSaver: preserves citations to online content (https://arquivo.pt/citationsaver).
Arquivo.pt CitationSaver: preserves citations to online content (https://arquivo.pt/citationsaver).

CitationSaver allows you to submit the source code of an Wikipedia article and Arquivo.pt will automatically extract the referenced links and archive the correspondent content.

 

Arquivo.pt SavePageNow: saves pages in Arquivo.pt (https://arquivo.pt/savepagenow).
Arquivo.pt SavePageNow: saves pages in Arquivo.pt (https://arquivo.pt/savepagenow).

SavePageNow allows you to immediately archive a web page in Arquivo.pt, for example, that is being referenced in a Wikipedia article so that it doesn’t get lost.

Training Arquivo.pt/Wikimedia

Wikimedia Portugal, in collaboration with Arquivo.pt, promoted a set of webinars to capture the community’s attention for the preservation of content published and cited on Wikipedia. The videos and slides of these webinars are available:

CDXJ index files are available to support bulk access

Um grupo de investigadores olham para um bastidor de servidores

Last updated on May 5th, 2023 at 01:39 pm

The research and education community has been requesting to support the bulk download of web-archived data and index files (CDXJ), for instance, to feed AI training models, optimize routing of web archive requests or recover information from selected websites (e.g. news).

Arquivo.pt begun making all its CDXJ index files publicly available in real-time to facilitate the bulk download of web-archived data. Learn how at:

Your feedback with comments or suggestions is most welcome to improve this service!

Please disseminate this information among potentially interested parties.

Arquivo.pt in Croatia

Arquivo.pt na Croácia

Last updated on October 1st, 2020 at 04:44 pm

The Arquivo.pt team went to the biggest international Web archiving conference.

The conference, organized by the International Internet Preservation Consortium (IIPC), occurred from the 5th to the 7th of June 2019.

 

Interview to Daniel Gomes, Head of Arquivo.pt

 

Arquivo.pt contributed with 5 additional presentations during the conference.

The slides and videos are available in the following links:

Search 17 million images from the past with Arquivo.pt!

Image viewer Arnold

At the end of 2018, Arquivo.pt launched an experimental image search service from the past, which it was possible to search around 4 million images from the past, coming from some collections of Arquivo.pt.

From April 2019, it became possible to search for images from all the collections of Arquivo.pt.

aArnold Schwarzenegger arquivo.pt image search

You can now search over 17 million unique images (over 50 pixels in width and height) since 1996.

Find pages from the past through the new image search service.

Try the “Visit Page” option to find the Web page from the past that contained the image you selected.

Image viewer Arnold

Try the image search service now!

Search images from the past with Arquivo.pt at https://arquivo.pt/images.jsp?l=en

Students from Coimbra visit Arquivo.pt

On April 11, Arquivo.pt headquarters at LNEC welcomed 30 students enrolled in the course of Information Sciences at the Faculty of Arts and Humanities of the University of Coimbra.
On the occasion, students had the opportunity to take part in a technical visit, and a training session on digital preservation and on the services offered by Arquivo.pt
Bring your institution. Contact us on arquivo.pt/contact

IMG_1116

Visita ao centro de dados para conhecer a infraestrutura que suporta o Arquivo.pt

IMG_1077

Visita guiada ao Arquivo.pt, que funciona no Campus do LNEC

IMG_1083

Sessão acerca do Arquivo.pt no pequeno auditório do LNEC

IMG_1081

Sessão acerca do Arquivo.pt no pequeno auditório do LNEC

IMG_1085

Sessão acerca do Arquivo.pt no pequeno auditório do LNEC

IMG_1100

Daniel Gomes, gestor do Arquivo, acompanha a visita ao centro de dados para conhecer a infraestrutura que suporta o Arquivo.pt

IMG_1095

Daniel Gomes, gestor do Arquivo acompanha a visita ao centro de dados para conhecer a infraestrutura que suporta o Arquivo.pt

IMG_1116 IMG_1077 IMG_1083 IMG_1081 IMG_1085 IMG_1100 IMG_1095

Ask for an exhibition of posters about historical web pages

Last updated on October 26th, 2023 at 10:19 am

If you wish to host an exhibition to show-off on a common space (e.g. library, university hallway, common room in a congress center), please contact us!

Examples of performed exhibitions

These four exhibitions presented iconic web pages from Portugal’s recent history.

Universidade do Minho – Gualtar Campus – Braga

  • Where: Espaço B-Lounge of Universidade do Minho Library
  • When: Until April 18. From Monday to Friday, from 8h30 to midnight. And Saturdays from 08h30 to 14h30.

Universidade do Minho – Azurém Campus – Guimarães

  • Where: Espaço B-Lounge of Universidade do Minho Library
  • When: From April 23 to May 10. From Monday to Friday, from 8h30 to midnight. And Saturdays from 08h30 to 14h30.

ISCTE – Lisboa

  • Where: ISCTE-IUL Library.
  • When: From April 16 until May 11. From 8h to 21h.

Escola Superior de Hotelaria do Estoril

  • Where: Celestino Gomes Library
  • When: Until May 15. From Monday to Friday, from 9h30 to 12h00 and from 14h30 to 17h30. And from 6 to May 18. From Monday to Friday, from 9h30 to 20h30, Saturdays from 9h30 to 13h.

NOVA – IMS – Campolide Campus

  • Where: Nova IMS Library and Main Corridor.
  • When: Until April 18. From Monday to Friday, from 10h to 18h. And April 22 to 30, from 9h to 20h.

Exposições Arquivo.pt

Biblioteca da Universidade do Minho, Campus de Azurém

Biblioteca da Universidade do Minho, Campus de Azurém

Biblioteca do ISCTE, Lisboa

Biblioteca do ISCTE, Lisboa

B-Lounge, Universidade do Minho Library / B-Lounge of Universidade do Minho Library

B-Lounge, Biblioteca da Universidade do Minho / B-Lounge Universidade do Minho Library

Biblioteca da Faculdade de Ciência e Tecnologia da Universidade Nova de Lisboa

Biblioteca da Faculdade de Ciência e Tecnologia da Universidade Nova de Lisboa

Biblioteca da ESCE do Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca da ESCE do Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca da Reitoria da Universidade do Algarve

Biblioteca da Reitoria da Universidade do Algarve

Biblioteca da Reitoria da Universidade do Algarve

Biblioteca da Reitoria da Universidade do Algarve

Biblioteca da Reitoria da Universidade do Algarve

Biblioteca da Reitoria da Universidade do Algarve

Faculdade de Letras da Universidade de Lisboa

Faculdade de Letras da Universidade de Lisboa

Faculdade de Letras da Universidade de Lisboa

Faculdade de Letras da Universidade de Lisboa

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituo da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituo da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituo da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Universidade do Minho, Campus de Azurém Biblioteca do ISCTE, Lisboa B-Lounge, Universidade do Minho Library / B-Lounge of Universidade do Minho Library Biblioteca da Faculdade de Ciência e Tecnologia da Universidade Nova de Lisboa Biblioteca da ESCE do Instituto Politécnico de Setúbal Biblioteca da ESCE do Instituto Politécnico de Setúbal Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal Biblioteca da Reitoria da Universidade do Algarve Biblioteca da Reitoria da Universidade do Algarve Biblioteca da Reitoria da Universidade do Algarve Faculdade de Letras da Universidade de Lisboa Faculdade de Letras da Universidade de Lisboa Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação Faculdade de Psicologia da Universidade de Lisboa e Instituo da Educação Faculdade de Psicologia da Universidade de Lisboa e Instituo da Educação Faculdade de Psicologia da Universidade de Lisboa e Instituo da Educação Biblioteca da Faculdade de Ciências da Universidade de Lisboa Biblioteca da Faculdade de Ciências da Universidade de Lisboa Biblioteca da Faculdade de Ciências da Universidade de Lisboa Biblioteca da Faculdade de Ciências da Universidade de Lisboa Biblioteca da Faculdade de Ciências da Universidade de Lisboa Biblioteca da Faculdade de Ciências da Universidade de Lisboa

News about our poster exhibitions

Arquivo.pt Exhibition in FCT – Caparica

Last updated on March 29th, 2019 at 01:37 pm

Next Thursday, 28th March, is the last day to visit Arquivo.pt Exhibition in FCT – Library, in Caparica.

The exhibition is open to the public all day long, from 9 am to 8 pm and shows six iconic pages from Portugal’s recent history such as the result of the 1996 Presidential Elections in Portugal, the Expo’98 page, or the first home page of the RTP television network, published in 1998.

FCT – Library is located at Caparica Campus. For further information call 212 94 78 29.

This exhibition is itinerant, so if you are interested in bringing it to your city or institution, please contact Arquivo.pt.

 

Expo Arquivo.pt FCT – Caparica

expo_NOVA-FCT_20190314_2
expo_NOVA-FCT_20190314_1
bty
IMG-0621
expo_NOVA-FCT_20190314_3
bty
IMG-0623
740FD37E-DDEB-4744-ADDF-361EE83ACEE9
IMG-0625
expo_NOVA-FCT_20190314_2 expo_NOVA-FCT_20190314_1 bty IMG-0621 expo_NOVA-FCT_20190314_3 bty IMG-0623 740FD37E-DDEB-4744-ADDF-361EE83ACEE9 IMG-0625

Arquivo.pt Awards 2019 launching event

Last updated on April 3rd, 2019 at 05:07 pm

Arquivo.pt Award 2019 was officially launched during an event at the Museum of the Presidency of the Republic, in Lisbon, March 13th.

The President of the Portuguese Republic, Marcelo Rebelo de Sousa, the Minister of Science, Technology and Higher Education (MCTES), Manuel Heitor, and the vice-president of FCT, Helena Pereira attended the ceremony.

In his speech, Marcelo Rebelo highlighted the importance of initiatives like Arquivo.pt Awards, as well as the services offered by the portal.

Submit your work to Arquivo.pt Award 2019. The 3 best works will receive a total of 15,ooo euros in prizes

Arquivo.pt Award 2019 launching event

_GOU0539
_GOU0496
_GOU0480
_GOU0490
_GOU0372
_GOU0383
IMG_2873
IMG_2860
IMG_2848
_GOU0524
_GOU0622
_GOU0459
_GOU0504
_GOU0503
IMG_2873
IMG_2788
_GOU0539 _GOU0496 _GOU0480 _GOU0490 _GOU0372 _GOU0383 IMG_2873 IMG_2860 IMG_2848 _GOU0524 _GOU0622 _GOU0459 _GOU0504 _GOU0503 IMG_2873 IMG_2788

 

Opening

Speech by the Minister of Science, Technology and Higher Education, Manuel Heitor

Presentation of the portal Arquivo.pt and the Arquivo.pt by Daniel Gomes, FCT

Speech by His Excellency the President of the Portuguese Republic

President of the Portuguese Republic launches Arquivo.pt Award 2019

Last updated on March 8th, 2019 at 03:05 pm

Arquivo.pt is honoured to announce that the launching event of the Arquivo.pt Award 2019 will take place at the Museum of the Presidency of the Republic, in Lisbon, on March 13th, at 3:30 pm.

The President of the Portuguese Republic, Marcelo Rebelo de Sousa, the Minister of Science, Technology and Higher Education (MCTES), Manuel Heitor, and the FCT vice-president, Helena Pereira will attend the ceremony.

The event is open to all interested parties but limited to the facility capacity.

See the complete schedule below:

  • 3:30 pm – Reception
  • 4 pm – Opening (MPR)
  • 4:05 pm – Speech by the Minister of Science, Technology and Higher Education, Manuel Heitor
  • 4:15 pm – Presentation of the portal Arquivo.pt and the Arquivo.pt by Daniel Gomes, FCT
  • 4:30 pm – Address by His Excellency the President of the Republic and inauguration of “Time travel with the Presidents of the Republic” exhibit.
  • 4:45 pm – Cocktail and guided tours of the Museum and the Arquivo.pt exhibition
  • 5:30 p.m. – Closing

For more information: 213 614 660 or museu@presidencia.pt

Submit your work to Arquivo.pt Award 2019. The 3 best works will receive a total of 15,ooo euros in prizes