Cross-lingual research datasets on 2019 European Parliamentary Elections

Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event

Last updated on March 17th, 2024 at 03:04 pm

Arquivo.pt preserved online documents in several languages about the 2019 European Parliamentary Elections

The 2019 European Parliamentary Elections were an event of international relevance. The strategy to preserve the relevant information on the World Wide Web is delegated to national institutions. However, the preservation of web pages that document transnational events is not officially assigned. 

The Arquivo.pt team, with the aim of preserving the cross-lingual online content that documents this event, applied a combination of human and automatic selection processes.

The process of generating the collection about the 2019 European Parliamentary Elections was performed in two steps.

In the first step, 40 relevant terms in Portuguese about the 2019 European Parliamentary Elections were identified, and then, automatically translated into the 24 official languages of the European Union: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish. 

These translations were reviewed in collaboration with the Publications Office of the European Union. Besides that, in parallel, a collaborative list was launched to gather contributions of relevant seeds from the international community.

In the second step, the Arquivo.pt team iteratively ran 6 crawls (99 million web files, 4.8 TB) using different configurations and crawling software, to maximize the quality of the collected content. 

The obtained web-data was aggregated into one special collection identified as EAWP23 and became searchable and accessible through Arquivo.pt in July 2020 (https://arquivo.pt/ee2019).

CLEOPATRA project: Cross-lingual Event-centric Open Analytics Research Academy

Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event
Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event

The CLEOPATRA ITN was a Marie Skłodowska-Curie Innovative Training Network aimed to generate ways to better understand the massive digital coverage of major events in Europe over the past decades. 

The main goal was to facilitate advanced cross-lingual processing of textual and visual information related to key contemporary events at large scale and develop innovative methods for efficient access and interaction with multilingual information.

In total, 14 Early-Stage Researchers hosted across 9 European Universities developed their research while enrolled as Ph.D. students. 

Associated partners such as Arquivo.pt contributed to CLEOPATRA by hosting and training early-stage researchers such as Diego Alves. As part of the training program, he conducted a secondment at Arquivo.pt in Lisbon from June to August 2022. 

The idea was to develop part of his research about syntactic structures of EU languages using the textual resources preserved by the Arquivo.pt and exchange knowledge with the web-archiving experts on the strategies to extract and process historical web-data. 

Diego Alves defended his Ph.D thesis entitled Computational typological analysis of syntactic structures in European languages in July 2023 at the Faculty of Humanities and Social Sciences of the University of Zagreb (Croatia). 

Generating textual datasets for Natural Language Processing

Diego Alves’ work originated cross-lingual datasets about the 2019 European Parliamentary Elections precious for research.

This work will be detailed in chapter “Robustness of Corpus-based Typological Strategies for Dependency Parsing” of the open-access CLEOPATRA book entitled “Event Analytics across Languages and Communities”.

A 3-step Natural Language Processing pipeline was developed to generate research textual datasets that can be used in several types of digital humanities studies:

  1. Extract text: The textual content was extracted from each web-archived URL using the newspaper3k Python library. The language of each extracted text was determined using the langdetect library, to separate the texts written in different languages across distinct files;
  2. Clean extracted texts: a Python script was applied to clean the texts by removing unnecessary information (e.g.: repeated instances, empty lines, etc.);
  3. Double-check of language identification: the language of each cleaned extracted text was verified again to eliminate possible errors originated during the previous steps.

Two new research datasets are openly available!

The result was a dataset of cleaned and language-verified texts publicly available. Each file contains the texts in a given language about the 2019 European Union Elections. The distribution of extracted texts for each language is described in the figure below:

Number of tokens of each corpus extracted from the collection 2019 European Union Elections preserved by Arquivo.pt (EAWP23).
Number of tokens of each corpus extracted from the collection 2019 European Union Elections preserved by Arquivo.pt (EAWP23).

The aforementioned corpus was automatically annotated regarding part-of-speech and dependency relations to generate a corpus with syntactic information which is useful for linguistic studies. 

The multilingual model of the UDify tool (Kondratyuk and Straka, 2019) was applied. 

The texts in these annotated corpora followed the same order of the respective raw-texts files. Each sentence is annotated following the Universal Dependencies framework in the CoNNL-U format, which is the reference in terms of syntactic annotation in Natural Language Processing. Thus, each file in this dataset contains the annotated texts in a given language about the 2019 European Union Elections

The syntactically annotated texts about the 2019 European Elections are publicly available!

Know more

Arquivo.pt presentations at IIPC GA/WAC, RESAW 2023 and CLEOPATRA

Last updated on March 10th, 2024 at 05:23 pm

Meeting the Web Archive Community

The International Internet Preservation Consortium (IIPC), a consortium that brings together Web preservation initiatives from around the world, held its General Assembly with its members on May 10, 2023.

On the following days, May 11 and 12, the IIPC Web Archiving Conference (IIPC WAC) was held, an initiative open to the community, where people or entities not associated with the IIPC and interested in the Web preservation domain can participate.

The two events were jointly hosted by KB – National Library of the Netherlands, and by Beeld & Geluid – Netherlands Institute for Sound & Vision.

Contributions from the Arquivo.pt at the Web Archiving Conference

Arquivo.pt participated in the IIPC working group meetings (Training Working Group and Curators Working Group) and contributed with presentations in the thematic sessions Collaborations & Outreach and Program infrastructure (sessions 7 and 17).

  • Arquivo.pt updates 2023 (slides)
  • Linking web archiving with arts and humanities: the collaboration between ROSSIO and Arquivo.pt (video, slides)
  • Arquivo.pt behind the curtains (slides)

Meeting the RESAW research community

RESAW (Research Infrastructure for the Study of Archived Web Materials) is an initiative created in 2012 with the aim of promoting studies based on archived Web content, in areas such as Social Sciences, Digital Arts and Humanities.

The RESAW 2023 conference was held at the MUCEM Lab (Mediterranean Institute of Heritage Crafts) in Marseille on June 5-6, 2023, under the theme Exploring the Archived Web During a Highly Transformative Age.

Contributions from Arquivo.pt to RESAW 2023

Arquivo.pt contributed with presentations to the sessions Web Archive in Mediterranean area and its merge (4.A), From online Tools to Web Archive (6.B.), Towards a participatory approach to collections (9. A.), Digging up the materials for writing web history (9.B.).

  • How to research governmental web data? (abstract, slides)
  • Archiving Cryptocurrencies (abstract, slides)
  • Time to explore, time to learn from the archived web: Arquivo.pt training initiative (abstract, slides)
  • Exhibiting Web Memories from Arquivo.pt: a call for community participation (abstract, slides)

CLEOPATRA Project Meeting

The CLEOPATRA Project, led by the L3S Research Center at the Gottfried Wilhelm Leibniz University of Hannover, has developed since 2019 a training programme for doctoral researchers (Early Stage Researcher, PhD).

Arquivo.pt has participated in three courses: Incentives design for hybrid multilingual information processing and analytics, in Southampton; National and transnational media coverage of European parliamentary elections, 2004-2014, London; and NLP for under-resourced languages, in Zagreb, Croatia.

In 2022, the Arquivo.pt welcomed two researchers in its facilities who used the archived resources and received special support from the Arquivo.pt team to develop their research.

The CLEOPATRA Project ended in 2023 with a meeting on the 16th May, in Hannover, which brought together Professors, Researchers and representatives of the institutions involved.

Daniel Gomes, Arquivo.pt’s Manager, highlighted the new tools that Arquivo.pt makes available and the results of the work carried out by the researchers that have passed through Arquivo.pt.

Free training on digital media – webinars

Last updated on June 2nd, 2023 at 05:35 am

The Aveiro Media Competence Center (AMCC) is a platform to support and promote the European Union (EU) Local News Media sector in the implementation of digital transition projects. The consortium includes the PCI Creative Science Park of Aveiro Region, the Associação Portuguesa de Imprensa and the University of Aveiro.

Arquivo.pt is a free public service that allows searching and accessing Web pages preserved since the 1990’s, such as viewing an old news or accessing an old version of a website.

The collaboration between the AMCC and Arquivo.pt is materialized in a training program entitled Arquivo.pt: Digital Skills for the Media, developed in four webinars, and in the attribution of the AMCC Honorable Mention to work done on Portuguese centenary newspapers in the Arquivo.pt Award 2023.

Webinar cycle: Arquivo.pt: digital skills for media

The webinar cycle aims to equip trainees with digital skills that enable them to solve problems caused by the disappearance of digital information and gain competitive advantage in the production of unique and exclusive content.

  • Webinar 1: A tool for quickly searching the past
    • Data: Mars 24, 2023 Time: 14h00-15h30 (in Portuguese)
  • Webinar 2: Publishing well for preserving well

    • Data: April 6, 2023, Time: 14h00-15h30 (in Portuguese)
  • Webinar 3: Automated access and processing of preserved Web information through APIs
    • Data: May 4, 2023, Time: 14h00-15h30 (in Portuguese)
    • Slides
    • Video
  • Webinar 4: Web archiving: do-it-yourself!
    • Data: June 1, 2023, Time: 14h00-15h30 (in Portuguese)

Tutorial: how to explore Arquivo.pt using Python

Last updated on July 17th, 2023 at 01:44 pm

The Programming Historian aims to develop digital skills among the Humanities researchers through the publication of practical lessons in several languages.

The call Computational analysis skills for large-scale humanities data originated 7 new lessons.

One of them was the tutorial “Timeline summarization for large-scale past-web events with Python: the case of Arquivo.pt” developed by Daniel Gomes and Ricardo Campos.

It shows how to explore Arquivo.pt user interface and the Application Programming Interface (API) to execute advanced queries, process large amount of data or build new services, such as Tell me stories.

All the developed resources are freely available in open-access.

Open-access resources of the tutorial “Timeline summarization for large-scale past-web events with Python: the case of Arquivo.pt”

 

 

Cultural heritage on the Web: the online presence of museums

Last updated on July 7th, 2022 at 09:26 pm

The Portuguese Museums Network was the community invited to participate in the cycle of three webinars entitled “Cultural Heritage on the Web: online presence of museums”.

The aim is to raise awareness among museum managers and professionals about the importance of preserving content published on the Web and to make known the services and tools of Arquivo.pt.

This initiative is promoted by the Direção Geral do Património Cultural, through the Departamento de Museus, Conservação e Credenciação and Divisão de Museus e Credenciação, which welcomed and integrated in its training offer the proposal of Arquivo.pt (FCT, I.P.) .

Information and materials

June 21st, 2022 – The Arquivo.pt and the preservation of digital memory (1st webinar)

In this session Arquivo.pt is presented as a useful service to museums and institutions that the community can count on to preserve digital cultural heritage, specifically Web content.

  • Speaker: Ricardo Basílio, digital curator (in substitution of Daniel Gomes, manager of Arquivo.pt)
  • Duration: 15h30 -17h00
  • Slides (PDF)
  • Video

June 22, 2022 – Publishing Well to Preserve Well (2nd Webinar)

This session deals with the aspects that an institution must take into account to create and maintain preservable websites.

  • Speaker: Pedro Gomes, responsible for the Arquivo.pt collections
  • Duration: 15h30 -17h00
  • Slides
  • Vídeo

June 27, 2022 – Archiving the Web: DIY (3rd Webinar)

This session offers a tutorial for creating a local web archive, recording contentes in a standard format and using open tools that any person can use.

  • Speaker: Ricardo Basílio, digital curator
  • Duration: 15h30 -17h00
  • Vídeo
  • Slides

June 28, 2022 – Repeat of the first session (extra session)

Open session for those who were not able to participate in the 1st session.

  • Speaker: Ricardo Basílio, digital curator
  • Duration: 15h30 -17h00
  • Video
  • Slides

Online exhibition: discover museums’ online presence over time

 

Municipality of Sines and Arquivo.pt together on the International Archives Day

thumbnail-sines-dia-internacional-dos-arquivos

Last updated on June 27th, 2022 at 08:40 am

The Municipal Archive of the Municipality of Sines and Arquivo.pt celebrated the International Archives Day, June 9, at the Salão Nobre dos Paços do Concelho, with a Workshop on preserving the digital memory of Sines (Portugal).

The meeting was broadcast online with the aim of sharing with the community of archivists what has been an experience of collaborative curation of Web content.

Collaboration between a municipal archive and a web archive

This meeting took place in the continuity of a collaboration between the two teams developed during the pandemic period.

The Arquivo Municipal de Sines made a selective and systematic collection of Web content related to the Municipality of Sines, with the collaboration of local media, such as Rádio Miróbriga and Rádio Sines.

In turn, Arquivo.pt contributed with training on tools, like Webrecorder.net, that records in standardized format and prepared useful services, such as SavePageNow that allows to record pages on the fly directly on Arquivo.pt.

Local history is better with preserved Web pages

From this collaboration resulted the preservation of thousands of Web pages (about 200 Gigabytes of information) about the experience of the pandemic in the geographical area of Sines and Santiago do Cacém.

The copies of the Web Archive Files (WARCs) sent to Arquivo.pt have been integrated to become available.

Presentations

Training in colaboration with the City Council of Lisboa

Thumbnail_passaporte-competencias-digitais-arquivopt

Last updated on December 13th, 2021 at 12:02 pm

print_passaporte-competencias-digitais

A cycle of webinars was held between October and December 2021, organised by the Department of Development and Training of the Municipality of Lisbon, within the digital skills program Passaporte Competências DigitaisCâmara Municipal de Lisboa, in collaboration with Centro Qualifica +ValorLx, a Infraestrutura ROSSIO and Arquivo.pt Fundação para a Ciência e a Tecnologia I.P.

The aim of this initiative was to present the services of Arquivo.pt and disseminate their use so that the historical heritage published on the web can be preserved and exploited by any citizen.

The sessions were open by registration and had a total of 126 participants (average of 31 per session).

The speakers’ presentations were recorded and can now be accessed, along with the slides from each session.

Sessions held

September 15 – Arquivo.pt. What is it? What is it for?

Daniel Gomes, manager of Arquivo.pt, the public Web preservation service operated by the Fundação para a Ciência e a Tecnologia, I.P., explains how any citizen can use to consult Web pages from the past in the most diverse cases and talks about the importance of preserving the digital memory.

November 11 – API Arquivo.pt : automatic acess to the Web preserved information

Vasco Rato, web developer of Arquivo.pt, presented the Arquivo.pt’s APIs (Application Programming Interface). These enable the development of innovative and useful applications for organizations through the automatic processing of historical information preserved from the Web.

November 25 – Archive the Web: do-it-yourself!

Ricardo Basílio, curador digital do Arquivo.pt, apresentou um tutorial sobre a utilização das ferramentas do Webrecorder.net para gravação de páginas Web em formato normalizado no próprio computador, a qual permite que uma pessoa ou uma organização possa organizar em pequena escala o seu próprio arquivo da Web.

December 9 – Publish on the Web: best practices  by Arquivo.pt

Pedro Gomes, the engineer responsible for the Arquivo.pt crawls, addressed the issue of publishing preservable web contents. How many contents are in formats that make their future access difficult or impossible? These situations were illustrated with practical cases and recommendations on how to avoid them. Therefore, it all boils down to publishing well in order to preserve well.

Know more about Arquivo.pt training

Arquivo.pt is open to collaborations aiming at training professionals in organizations or common citizens on Web preservation.

Learn about the training modules and contact us.

 

Book “The Past Web: Exploring Web Archives” available in Green Open access!

thumb-the-past-web

Last updated on September 13th, 2022 at 04:15 pm

Since 2006, a book has not been published that reflects the state-of-the-art in the area of ​​web preservation and the research that has been conducted on web archives.

The main goal of the new book The Past Web: Exploring Web Archives was to create a new, up-to-date resource to educate more people in the field of web preservation and to make web archives known to researchers and academics.

As such, the book is primarily aimed at the academic and scientific communities, and presents the most innovative methods for exploring information from the past preserved by web archives.

Daniel Gomes, head of Arquivo.pt led the book’s editorial team, along with the field specialists Elena DemidovaJane Winters and Thomas Risse. In total, the book resulted from the contributions of 40 authors from around the world who are experts in web archiving.

The book is divided into 6 parts where we find various resources for exploring pages archived from the Internet since the 1990s.

We can also learn how to preserve our collective memory in the Digital Era, which strategies to use when selecting online content, and what impact web archives have on preserving historical information.

The book aims to support professors in their mission to transmit innovative and adequate knowledge for the digital literacy required to train professionals for the 21st century.

Daniel Gomes from Arquivo.pt, alerts to the need of including web archives in teaching plans and emphasizes that this knowledge brings a great competitive advantage especially for students of Humanities and Social Sciences.

An innovative detail of this book is that all its cited links have been preserved by Arquivo.pt so that the references remain valid over time.

The book was available for free to be downloaded from Portuguese higher education institutions (b-On member entities) until March 6th 2022.

However, you can still download a pre-print version of the book (Green Open Access).

Links

Book launch at Jornadas FCCN 2021

Apresentação do livro
Apresentação do livro
Apresentação do livro
Apresentação do livro
Apresentação do livro
Apresentação do livro
Apresentação do livro
Apresentação do livro
Apresentação do livro Apresentação do livro Apresentação do livro Apresentação do livro Apresentação do livro Apresentação do livro Apresentação do livro Apresentação do livro

“Art Forever on the Web”: Cycle of Webinars

composicao sobre Colectiva de Artistas 2008 Quadrado Azul

Last updated on July 6th, 2021 at 01:23 pm

composicao sobre Colectiva de Artistas 2008 Quadrado Azul

Colectiva de Artistas. 2008.04.19 a 2008.06.07. Galeria Quadrado Azul. Porto. Composition from a Webpage preserved on Arquivo.pt: www.quadradoazul.pt, 22nd October 2008.

On April 29, May 27 and July 1, from 3 to 4:30 pm, webinars geared to the community of artists, curators, gallerists and event producers will be held, open also to anyone interested in learning more about preserving art websites.

Throughout the sessions, participants will learn in detail about the functionalities of Arquivo.pt in order to take advantage of this public Web preservation service. They will have technical information, in the form of recommendations and best practices, to create preservable websites. Finally, they will learn how to use available tools to save their websites in a standardized format so that their contents are not lost.

This cycle of Webinars is an initiative of the “Forever” Project, a collaboration between the Calouste Gulbenkian Foundation Art Library and Arquivo.pt under the ROSSIO infrastructure.

For more details and sharing, please see the program (PDF) (in Portuguese).

Sign up!

April 29 – The Arquivo.pt and the preservation of digital memory
May 27 – Recommendations for creating preservable websites for the future
July 1 – Archiving the Web: do-it-yourself!

Held sessions presentations

Online archives or archives of the online?

thumbnail_tendencias

At the end of 2020, we recommend some texts that put the future in perspective.

We highlight the theme of preserving online content presented in the ebook “Tendências 2021” (Trends 2021). The contribution of Daniel Gomes, the Arquivo.pt manager, was entitled “Arquivos online ou do online?” (Online archives or archives of the online?).

I was invited to write about the challenges and threats to online archives. The first question that came to me was what is meant by an “online archive”?

My concern lies in the “archives of the online” because there is not even an established awareness about their need, whether at an academic, governmental or individual level.

It is technologically impossible to preserve all information available online. But it is absurd not to be aware that we have to preserve some of the information online for short, medium and long term access.

The complete text (in Portuguese) is available at pages 23 to 26 of the open-access book “Tendências 2021”.

The challenge is to cultivate awareness about the importance of preserving content online by learning how to do it in practice.

Happy New Year!