RESAW 2025 conference had the participation of Arquivo.pt

June 12, 2025June 12, 2025 by Ricardo Basílio

Arquivo.pt was present at the 6th RESAW Conference for researchers in the Digital Humanities, Media and Communication and other areas, on the theme of ‘The Datafied Web’, which took place at the University of Siegen, Germany, from 4 to 6 June 2025.

RESAW (Research Infrastructure for the Study of Archived Web Materials) is an informal initiative that brings together researchers who use web archives in their research. RESAW’s first conference was in 2015, and it is now held every two years.

Initially, RESAW brought together European researchers, but now it brings together researchers from all over the world and has become a unique forum of its kind. In 2025, it had more than 100 participants. It brings together the best in the field of using web archives in research.

Niels Brügger, Professor of Media and Communication at Aaharus University in Denmark, has been its main driving force for 10 years.

Other leading researchers with studies on web archives are: Valerie Schafer from the University of Luxembourg, Jane Winters from the University of London, Anne Helmond from the University of Utrecht, Susan Aasman from the University of Groningen, Sophie Gebeil from Aix-Marseille University and Ian Millingan from the University of Waterloo.

This year’s theme “The Datafied Web” addressed the issue of the datification of the Web, from its beginnings in the 1990s to the present day, marked by massive data processing and the use of Artificial Intelligence.

Why would a web archive take part in an academic meeting?

Arquivo.pt has been a regular participant in RESAW since 2019, as it wants to make itself increasingly known as a service for national and international researchers.

Thanks to participation in international events such as RESAW, several publications have appeared that use and refer to Arquivo.pt. Any researcher with Internet access can search the information preserved on Arquivo.pt, use the APIs, process information or train their models.

We invite Portuguese researchers to take part in this meeting, as we have been the only Portuguese presence in several editions. We have an accessible web archive, ready to use, which is not the case in other countries. We would like to see researchers in the fields of Digital Humanities and Media and Communication in Portugal using Arquivo.pt more often and actively participating in meetings like RESAW.

Arquivo.pt’s contribution to RESAW 2025

Arquivo.pt contributed two presentations to the 2025 edition of the RESAW meeting, held at the University of Siegen. The first was about the Arquivo.pt APIs and their application in a research context, by Vasco Rato. The second was about the open datasets and lists of websites on topics and events that Arquivo.pt has prepared to help researchers start exploring archived information in greater depth.

Image gallery

RESAW 2025 na Universidade de Siegen

Arquivo.pt at the University of Coimbra to talk about digital preservation

May 22, 2025May 10, 2025 by Ricardo Basílio

Last updated on May 22nd, 2025 at 06:45 pm

Arquivo.pt took part in the workshop entitled “Digital preservation: tools and practices”, promoted by the Faculty of Letters of the University of Coimbra, on the afternoon of May 7, 2025. Moderated by Inês Santos, we highlight the initial panel with excellent speeches by Moisés Rockembach (University of Coimbra), Humberto Innarelli (Unicamp, Brazil) and Daniel Gomes (Arquivo.pt, digital service of FCCN-FCT).

The aim of the meeting was to offer the community a critical reflection on new trends in digital preservation tools and practices.

Digital preservation is a cross-cutting issue for organizations, as they all produce and generate information in digital format. There is a growing range of tools and solutions that promise greater efficiency in information processing. Many are labeled Artificial Intelligence. Such an abundance of products and frameworks calls for greater discussion and a critical approach. And this was achieved brilliantly by the panel of speakers.

Three approaches to Artificial Intelligence and Digital Preservation

This meeting brought together three authors of works on digital preservation at the Amphitheatre III of the Faculty of Letters of the University of Coimbra and discussed different approaches.

Moisés Rockembach, co-author with Caterina Pavão of Arquivamento da Web e preservação digital (Archiving the Web and Digital Preservation), the first work in Portuguese on web archives, focused his presentation on the impact of Artificial Intelligence on digital preservation systems, namely on searching for and accessing information, in classification and indexing processes, for example. With regard to the impact of the new tools that digital technology offers us, he referred to a phrase by Demi Gretscko: “The process of searching for and capturing information described in the text could certainly be improved in the future, especially when considering the contribution of new tools, such as those of Artificial Intelligence”.

There are Artificial Intelligence tools that allow interesting access to information through novelty and format. Archiving must take this reality into account and test the extent to which it can transform the way in which many types of content are disseminated and accessed. One example to illustrate this idea was the presentation of a Podcast generated by Artificial Intelligence from An example to illustrate this idea was the presentation of a Podcast generated by Artificial Intelligence, based on chapter 2 of the book on Web Archives, which deals with digital preservation policies.

Link to Podcast generated by Artificial Intelligence (published on Instagram, in Portuguese)

Humberto Innarelli, author of Criptex da preservação digital (Digital preservation cryptex), coordinator of the Arquivo Edgard Leuenroth (AEL) and specialist archival researcher at Unicamp, São Paulo and PhD professor at the Paula Souza Centre, São Paulo, posed the question of the future of digital preservation. Until now, the practice for preserving dynamic digital content has been to convert it into static documents. On the other hand, information is increasingly given to us dynamically, from databases or algorithms and Artificial Intelligence. What’s the next step? Archival practice needs to look not only at metadata, as it has done in recent years, but also at what explains how the information was generated (what we might call paradata). This is the only way to put archives and digital preservation in the long-term perspective. A hundred or two hundred years from now we should still be able to access the digital information produced today.

Daniel Gomes, editor of the book The Past Web and founder of Arquivo.pt, discussed the issue of Artificial Intelligence as it relates to non-artificial, human-produced content. What added value do tools that generate text, images, audio or video bring? If we consider, for example, that a Podcast on digital preservation used a book written by a human author as its basis, what new knowledge did it generate? Little or none. So, what has come to be called Artificial Intelligence can be considered a way of presenting human knowledge and in no way exempts humanity from continuing to think, research and produce new knowledge.

Arquivo.pt preserves content that has been published by individuals and organizations and in this sense is a unique source of its kind. Information published on the web is important for reporting and better understanding recent history, since the 1990s. Any Artificial Intelligence tool will have to go back to the point where the information was created by people. The human origin of the content preserved by Arquivo.pt, and the same can be said of traditional archives, makes them of enormous value, even considering their economic value. How much is the information stored in a web archive worth?

New MOOC (Massive Online Open Course) about web archiving

Daniel Gomes, Manager of Arquivo.pt, has announced first-hand the online course on the NAU platform: The Web of the Past: Preservation and Research (in Portuguese).

The online course or MOOC (Massive Online Open Course) is available for those who want to deepen their knowledge of web preservation.

The short link for dissemination is arquivo.pt/mooc

Preserved Arquivo.pt data and its automatic processing by APIs

Vasco Rato, developer of Arquivo.pt, showed how the automatic processing interfaces, Application Programming Interfaces (APIs), work.

Arquivo.pt data can be processed by Artificial Intelligence. The works competing for the Arquivo.pt Award have already demonstrated this, as have projects such as GlórIA, a Large Language Model developed at NOVA-FCT.

Finally, Ricardo Basílio, digital curator, showed how anyone can save a page or an entire website on their own computer in a standardized format, compatible with web archives. ArchiveWeb.page and browsertrix-crawler were used for this, as training tools. This practice allows the community to be increasingly active in preserving institutional information published on the Web.

Agenda

14h30 Panel – Moderator: Inês Santos, University of Coimbra

Digital Preservation and Artificial Intelligence – Moisés Rockembach, University of Coimbra – Slides
Cryptex for Digital Preservation: The Next Step – Humberto Innarelli, Unicamp – Slides
Arquivo.pt and Web Preservation – Daniel Gomes, FCCN-FCT – Slides

16h00 Break

Open Data for Research. Automatic information processing through APIs – Vasco Rato, FCCN-FCT – Slides
Demo – Archiving the Web: do-it-yourself – Ricardo Basílio, FCCN-FCT – Slides
- Manual recording demo with ArchiveWeb.page
- Automatic recording demo with Browsertrix-crawler

17h00 – Final

Image gallery

Images on the Coimbra University social media

Video of some moments from the event (published on Facebook)

Workshop na Faculdade de Letras da Universidade de Coimbra

Arquivo.pt Links Dataset: Unveiling the Web’s Hidden Structure

May 13, 2025April 30, 2025 by Ricardo Basílio

Last updated on May 13th, 2025 at 02:29 pm

The interconnected nature of the World Wide Web has long fascinated researchers and technologists alike. Today, we are thrilled to announce the release of the Arquivo.pt Links dataset, a comprehensive collection that opens new possibilities for understanding and analyzing web connectivity patterns.

The dataset encompasses more than 139 million webpage URLs, each accompanied by crucial metadata about their incoming links – both the source URLs and their corresponding anchor texts, i.e., visible and clickable text in hyperlinks. This rich collection of interconnection data provides researchers with a unique window into the web’s underlying structure.

The importance of hyperlinks in web architecture cannot be understated. They serve as the fundamental building blocks of web navigation and discovery, enabling both users and automated systems to traverse the vast landscape of online content.

Links formed the foundation of Google’s revolutionary PageRank algorithm, which transformed our approach to information retrieval and web search. PageRank’s fundamental insight – that a page’s importance could be measured by analyzing its incoming links – revolutionized search technology and remains influential in modern information retrieval systems.

By making this dataset publicly available, Arquivo.pt enables researchers to explore similar innovative approaches to web analysis and search engine development. The dataset opens up numerous exciting research possibilities across multiple domains:

Researchers can implement and experiment with various ranking algorithms, from classic approaches like PageRank to modern machine learning-based techniques. The inclusion of anchor texts provides valuable semantic context that can enhance search relevance and document classification.
The dataset enables deep analysis of web topology and link structures. Researchers can investigate questions about web connectivity patterns, identify clusters of related content, and study how information spreads across the web through link networks.
The anchor text associated with each link offers a rich source of human-generated descriptions of web content. This data can be particularly valuable for developing and testing document summarization algorithms, semantic analysis tools, and automated classification systems.
For web archiving researchers, this dataset provides insights into how web pages are connected and referenced over time, offering valuable data for studying web preservation strategies and digital heritage maintenance.

Methodology

The process begins with a temporal snapshot of web pages from a specific time period (collection). During this initial phase, our systems analyze each captured page, extracting all outgoing hyperlinks along with their associated anchor texts and capture timestamps. This creates a preliminary mapping of how pages connect to one another within our captured timeframe.

What makes this dataset particularly valuable is its inverted link structure. Rather than organizing the data around source pages and their outgoing links, we’ve created an inverted map that centers on destination pages and their incoming links. This approach is particularly useful for analyzing a page’s importance or authority within the web’s structure, as it provides immediate access to all pages that reference or point to a given URL.

Consider a traditional link structure where Page A links to Pages B, C, and D. In our inverted structure, we instead see entries for Pages B, C, and D, each listing Page A as a source of incoming links. This reorganization of the data facilitates more efficient analysis of page authority and influence, making it particularly valuable for researchers working on ranking algorithms or studying information flow patterns across the web.

The Arquivo.pt links dataset combines three distinct web collections:

PWA9609 (1996-2009): 89 million pages capturing early Internet evolution, focused on the .pt domain. This historical collection provides insights into early web linking patterns.
AWP38 (Oct-Nov 2021): 44 million pages offering a contemporary snapshot of web connectivity, with emphasis on the .pt domain while including broader Internet content.
FAWP47 (Oct-Dec 2021): 8 million pages from daily captures of .pt domain content, designed to track short-term changes in link patterns.

Getting Started with the Dataset

Researchers can access the complete dataset. The data is provided in a format that supports efficient processing and analysis, making it suitable for both large-scale studies and focused investigations.

Conclusion

The release of the Arquivo.pt links dataset represents a significant contribution to the web science research community. By making this rich collection of web connectivity data freely available, we hope to facilitate innovative research and deepen our understanding of the web’s complex structure.

We encourage researchers to explore this dataset and look forward to seeing the novel insights and applications that emerge from its analysis. Whether you’re interested in developing new search algorithms, studying web topology, or investigating content relationships, this dataset provides a robust foundation for your research.

Arquivo.pt is finalist for the DPC Awards 2024

August 12, 2024May 13, 2024 by Ricardo Basílio

Last updated on August 12th, 2024 at 11:50 am

The Digital Preservation Coalition Awards

The Digital Preservation Coalition (DPC) is dedicated to promoting digital preservation and associated best practices.

The DPC Awards promote exemplary and innovative digital preservation use cases from all over the world.

The Arquivo.pt team submitted two applications to the DPC Awards 2024 in the categories of “Safeguarding the Digital Legacy” and “Research and Innovation”.

The Award for Safeguarding the Digital Legacy celebrates the practical application of preservation tools to protect at-risk digital objects.

The Award for Research and Innovation recognizes excellence in practical research and innovation activities.

Arquivo.pt applications to the DPC Awards

#1 Arquivo.pt catalog of tools for digital preservation

Information that rules modern-day lives is born-digital and disseminated online. However, invaluable digital objects published online have been continuously lost.

Arquivo.pt is a public infrastructure which supports the preservation of digital objects published online to safeguard this digital legacy for future generations.

Thus, in October 2023 after 15 years of research and development, Arquivo.pt released a Catalog of 13 innovative tools to support the preservation of at-risk online content, from acquisition to dissemination (e.g. search and access, APIs, training, open data sets, exhibitions).

Arquivo.pt safeguards online digital objects of worldwide interest for research and education.

The Arquivo.pt Catalog was selected as finalist to the Safeguarding the Digital Legacy Award.

See the application documentation

#2 Searching preserved web-images

Images published online are precious digital assets that document contemporary times for future generations.

This initiative describes the research and development of an innovative image search system that enables the discovery and access to billions of preserved images acquired from the web since the 1990s.

This research was applied to enhance the Arquivo.pt web archive with an image search service publicly available to any Internet user, officially launched in August 2022.

The resulting scientific and technical publications are available in open-access and the developed software is available as free open-source to be reused and enhanced by the community.

This work on searching images preserved in web archives applied for the Research and Innovation Award.

See the application documentation

Know more

3 minute pitch (video, slides)
DPC Awards Nomination Pack
Finalists of the Digital Preservation Awards 2024
Arquivo.pt catalog of services
Daniel Gomes, Web archives as research infrastructure for digital societies: the case study of Arquivo.pt, Archeion 123, 2022 (pre-print version, video, slides)
André Mourão, Daniel Gomes, Searching images in a web archive, 10th IEEE International Conference on Data Science and Advanced Analytics 2023 (ppt).
Inside Arquivo.pt image search service (video in English)

Analysis of the Arquivo.pt query dataset

December 10, 2024May 6, 2024 by Ricardo Basílio

Last updated on December 10th, 2024 at 03:12 pm

Arquivo.pt query logs are unique resources for research

Arquivo.pt provides a “Google-like” service that enables searching pages and images collected from the web since the 1990s. Notice that Arquivo.pt search complements live-web search engines because it enables temporal search over information that is no longer available online on its original websites.

Analyzing user behavior is an important research topic to understand users’ information needs and enhance the quality of search results. Thus, when a user interacts with a search engine, the system records the user’s actions in a file called the query log. Query logs from web archives are unique resources for research because they describe the real needs of web-archive users about the historical information published online over time.

Research case study

Flavie Gallois and Adam Jatowt from the University of Innsbruck, and Ricardo Campos from the University of Beira Interior and INESC TEC analyzed user search behavior based on the Arquivo.pt search query log dataset collected over a period of 3 months from June to September 2021 (Analyzing User Search Behaviour in Temporal Web Repositories through Search Query Log Analysis).

This study analyzed query features such as length, type or frequency and compared the obtained results with previous work about user search behavior over web-archives and live-web search engines.

This study revealed interesting trends and patterns about how users search for information within web archives, with strong potential for future research work.

How do web-archive users search?

Figure 1 : Distribution of country origin of users

Figure 2: Distribution of languages used in queries

The users came from Portugal in 85.7% of the queries. However, the Portuguese language was identified through automatic language identification of queries as being used in only 37% of the queries. This suggests that users apply other languages than their own to search in web archives.

Users of Arquivo.pt tend to use longer queries with more words and characters in comparison to previous studies, both over web archives and live-web search engines. About 92% of the queries had 5 or fewer terms (average of 25 characters), with 3 being the most common number of submitted terms. In previous work about search behavior in web archives, it was observed that users tended to submit from 1 to 3 terms per query, with 1 term as the most common submission.

Users tend to issue multiple queries within a session instead of a single query, possibly indicating a need for refining their search queries or exploring multiple options for inquiry.

87,7% of the queries submitted to Arquivo.pt used Desktop Browsers, despite Arquivo.pt providing mobile-friendly user interfaces. Old web-archived pages are not responsive and render poorly on mobile devices. Thus, it is expectable that users mostly use web archives through their desktops.

Figure 3: Arquivo.pt users can refine the time span of their queries by using the From and To datepickers.

Users refined the time span of the search (using the datepickers) in about 50% of queries which indicates awareness of temporal needs peculiar to web-archive usage. Interestingly, users modified the From datepicker more frequently than the To datepicker. Notice that keeping the default time span may fit the user information needs and does not necessarily indicate the lack of awareness about the existence of the function to define time span (peculiar to web-archive search).

Only a small percentage of users included specific years in their query terms (4%), potentially suggesting that in these cases the time span function was insufficient, or unnoticed by some users.

The obtained results suggest that users are more conscious of their information needs and have improved their search techniques to be more effective over web-archives instead of just using them out of curiosity as first-comers.

What is searched in a web-archive?

The authors of the study applied automatic named entity recognition over the user queries and derived a set of word clouds that graphically provide a glimpse of the most common information needs of Arquivo.pt users:

Figure 4: Word cloud of the most frequent query terms submitted to Arquivo.pt.

Figure 6: The most frequent Geographical Locations in query terms submitted to Arquivo.pt.

Figure 6: The most frequent Organizations in query terms submitted to Arquivo.pt

Figure 7: The most frequent Persons in query terms submitted to Arquivo.pt.

Access to research Arquivo.pt query dataset

Arquivo.pt released a set of resources to support research studies over its

Query log dataset

Query_Dataset_Sample.csv: Sheet containing a sample of the dataset query.
Query_Dataset_ArquivoPT.7z (in UTF-8): this file contains to the full query log dataset available for research collected over a period of 3 months from June to September 2021. We advise to be careful when opening because some readers such as Microsoft Excel may use the wrong charset and damage the content for instance of column L “QUERY”.
- See How to set character encoding when opening a CSV file in Excel? – Super User
- Generated Logs file in XLSX format

Original log files (samples)

Query_Log_Page_Search_Log4j_Sample.txt: raw sample of the page search query log (Log4j format) randomly selected.
Query_Log_Image_Search_Log4j_Sample.txt: raw sample of the image search query log (Log4 format) randomly selected.
Query_Log_Apache_HTTPD_Sample.txt: raw query log sample of the Apache HTTPd

Documentation

Evaluation Metrics for web-archive search

The first step to understand user behavior is to define evaluation metrics. Defining metrics is a powerful tool to set long and short-term goals to decide which new products and features should be released to the users.

We share a work-in-progress report which aggregates information about Web Archive Search Evaluation Metrics. This contributes to comparing users’ search behavior between live-web and web-archive search engines. Feel free to comment directly on the collaborative document or to contact us.

This report also provides a summary of references about previous work, query workflows and structure of the corresponding query logs produced by Arquivo.pt, to facilitate the work from the researchers to study these data sets.

Know more

Artificial Intelligence processes data from Arquivo.pt

July 16, 2024March 1, 2024 by Ricardo Basílio

Last updated on July 16th, 2024 at 08:33 am

Artificial Intelligence (AI), covers various areas of knowledge, such as linguistics and computing, and is present in the new technologies used by citizens on a daily basis.

For example, when we search for information on the Internet and the computer generates an amazingly accurate response, in a language very close to our own.

Natural Language Processing (NLP) is what allows machines to perfect the algorithm that generates these answers tailored to Internet users.

The problem is that natural language processing models have been developed more for the English language and less for Portuguese and other languages with less representation.

The more the processing models are trained on a language, the better they will be able to interpret the complexities of the language. But this is only possible if quality data is available.

Portuguese text collection on Arquivo.pt available for research

Arquivo.pt appears here as the largest Portuguese-language textual dataset in Portugal, available in open access, for researchers to train NLP models.

In recent years, researchers from various research groups and projects have drawn attention to the usefulness of preserved web data for large-scale processing.

Arquivo.pt has more than 1 Petabyte of preserved web content dating back to the 1990s, including everything that can be found on web pages. It’s not just text, but also images, audio files, video, page code and various metadata.

The content is accessible via the search interface and the Arquivo.pt APIs.

In order to make it easier to download archived resources from the web, Arquivo.pt has created indexes for researchers in CDXJ format.

GlórIA, a model for the Portuguese language

One of the projects that used Arquivo.pt to obtain large amounts of text is called GlórIA and is a large-scale language model (LLM) focused on the European Portuguese language.

“Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese” as the authors of GlórIA project, Ricardo Lopes, João Magalhães, David Semedo, researchers at the NOVA School of Science and Technology, explain in their article GlórIA – A Generative and Open Large Language Model for Portuguese.

The model used 35 billion tokens, or expressions that machines can process, from various sources.

Arquivo.pt contributed a collection of 1.4M European Portuguese archived news and periodicals.

You can try generating text in European Portuguese using the GlórIA API inference on the Hugging Face Model card.

If you want to develop a project or study using Arquivo.pt, you can start your research and, if you need help, contact us.

Know more

Prepare a work for the Arquivo.pt Award 2024!

August 6, 2024October 27, 2023 by Ricardo Basílio

Last updated on August 6th, 2024 at 05:15 pm

Until May 6, 2024, Arquivo.pt is launching the challenge of creating a work based on historical information preserved from the Web.

In this 7th edition of the Arquivo.pt Award, €15,000 will be awarded to the 3 best works (€10,000 for 1st place), plus 3 honorable mentions.

Know more at: arquivo.pt/award

Honorable mentions for authors and professors

To promote the use of the Arquivo.pt in the context of teaching, research or professional usage, three partners institutions promoted honorable mentions with an associated prize.

The Público newspaper will award an Honorable Mention to works based on the Público online content preserved by Arquivo.pt.
The Aveiro Media Competence Center (AMCC) will award an Honorable Mention to the best work on the web archive of one or more Portuguese online media.
Association DNS.PT will award an Honorable Mention to a professor or teacher who has encouraged the submission of works.

Share and spread the word!

Help us spreading the word about the Arquivo.pt Award 2024 among potential candidates!

Cross-lingual research datasets on 2019 European Parliamentary Elections

December 13, 2024September 18, 2023 by admin

Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event

Last updated on December 13th, 2024 at 01:55 pm

Arquivo.pt preserved online documents in several languages about the 2019 European Parliamentary Elections

The 2019 European Parliamentary Elections were an event of international relevance. The strategy to preserve the relevant information on the World Wide Web is delegated to national institutions. However, the preservation of web pages that document transnational events is not officially assigned.

The Arquivo.pt team, with the aim of preserving the cross-lingual online content that documents this event, applied a combination of human and automatic selection processes.

The process of generating the collection about the 2019 European Parliamentary Elections was performed in two steps.

In the first step, 40 relevant terms in Portuguese about the 2019 European Parliamentary Elections were identified, and then, automatically translated into the 24 official languages of the European Union: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish.

These translations were reviewed in collaboration with the Publications Office of the European Union. Besides that, in parallel, a collaborative list was launched to gather contributions of relevant seeds from the international community.

In the second step, the Arquivo.pt team iteratively ran 6 crawls (99 million web files, 4.8 TB) using different configurations and crawling software, to maximize the quality of the collected content.

The obtained web-data was aggregated into one special collection identified as EAWP23 and became searchable and accessible through Arquivo.pt in July 2020 (https://arquivo.pt/ee2019).

CLEOPATRA project: Cross-lingual Event-centric Open Analytics Research Academy

The CLEOPATRA ITN was a Marie Skłodowska-Curie Innovative Training Network aimed to generate ways to better understand the massive digital coverage of major events in Europe over the past decades.

The main goal was to facilitate advanced cross-lingual processing of textual and visual information related to key contemporary events at large scale and develop innovative methods for efficient access and interaction with multilingual information.

In total, 14 Early-Stage Researchers hosted across 9 European Universities developed their research while enrolled as Ph.D. students.

Associated partners such as Arquivo.pt contributed to CLEOPATRA by hosting and training early-stage researchers such as Diego Alves. As part of the training program, he conducted a secondment at Arquivo.pt in Lisbon from June to August 2022.

The idea was to develop part of his research about syntactic structures of EU languages using the textual resources preserved by the Arquivo.pt and exchange knowledge with the web-archiving experts on the strategies to extract and process historical web-data.

Diego Alves defended his Ph.D thesis entitled Computational typological analysis of syntactic structures in European languages in July 2023 at the Faculty of Humanities and Social Sciences of the University of Zagreb (Croatia).

Generating textual datasets for Natural Language Processing

Diego Alves’ work originated cross-lingual datasets about the 2019 European Parliamentary Elections precious for research.

This work will be detailed in chapter “Robustness of Corpus-based Typological Strategies for Dependency Parsing” of the open-access CLEOPATRA book entitled “Event Analytics across Languages and Communities”.

A 3-step Natural Language Processing pipeline was developed to generate research textual datasets that can be used in several types of digital humanities studies:

Extract text: The textual content was extracted from each web-archived URL using the newspaper3k Python library. The language of each extracted text was determined using the langdetect library, to separate the texts written in different languages across distinct files;
Clean extracted texts: a Python script was applied to clean the texts by removing unnecessary information (e.g.: repeated instances, empty lines, etc.);
Double-check of language identification: the language of each cleaned extracted text was verified again to eliminate possible errors originated during the previous steps.

Two new research datasets are openly available!

The result was a dataset of cleaned and language-verified texts publicly available. Each file contains the texts in a given language about the 2019 European Union Elections. The distribution of extracted texts for each language is described in the figure below:

Number of tokens of each corpus extracted from the collection 2019 European Union Elections preserved by Arquivo.pt (EAWP23).

The aforementioned corpus was automatically annotated regarding part-of-speech and dependency relations to generate a corpus with syntactic information which is useful for linguistic studies.

The multilingual model of the UDify tool (Kondratyuk and Straka, 2019) was applied.

The texts in these annotated corpora followed the same order of the respective raw-texts files. Each sentence is annotated following the Universal Dependencies framework in the CoNNL-U format, which is the reference in terms of syntactic annotation in Natural Language Processing. Thus, each file in this dataset contains the annotated texts in a given language about the 2019 European Union Elections

The syntactically annotated texts about the 2019 European Elections are publicly available!

Know more

Robustness of Corpus-Based Typological Strategies for Dependency Parsing, Event Analytics across Languages and Communities, 2024
Secondments@Arquivo.pt and new research tools available and Robustness of Corpus based Typological Strategies for Dependency Parsing”, presentation at CLEOPATRA final event, 2023
Dataset of cleaned and language-verified texts about the 2019 European Elections (Raw texts)
Dataset of syntactically annotated texts about the 2019 European Elections (CoNLL-U texts)
Python script to extract language-specific texts from Arquivo.pt through a list of keywords
Computational typological analysis of syntactic structures in European languages, Diego Alves PhD thesis, 2023
Diego Alves personal page
Arquivo.pt APIs
Robustness of Corpus-based Typological Strategies for Dependency Parsing, Diego Alves and Daniel Gomes, Event Analytics across Languages and Communities book, Springer.

Arquivo.pt presentations at IIPC GA/WAC, RESAW 2023 and CLEOPATRA

March 10, 2024June 16, 2023 by Ricardo Basílio

Last updated on March 10th, 2024 at 05:23 pm

Meeting the Web Archive Community

The International Internet Preservation Consortium (IIPC), a consortium that brings together Web preservation initiatives from around the world, held its General Assembly with its members on May 10, 2023.

On the following days, May 11 and 12, the IIPC Web Archiving Conference (IIPC WAC) was held, an initiative open to the community, where people or entities not associated with the IIPC and interested in the Web preservation domain can participate.

The two events were jointly hosted by KB – National Library of the Netherlands, and by Beeld & Geluid – Netherlands Institute for Sound & Vision.

Contributions from the Arquivo.pt at the Web Archiving Conference

Arquivo.pt participated in the IIPC working group meetings (Training Working Group and Curators Working Group) and contributed with presentations in the thematic sessions Collaborations & Outreach and Program infrastructure (sessions 7 and 17).

Arquivo.pt updates 2023 (slides)
Linking web archiving with arts and humanities: the collaboration between ROSSIO and Arquivo.pt (video, slides)
Arquivo.pt behind the curtains (slides)

Meeting the RESAW research community

RESAW (Research Infrastructure for the Study of Archived Web Materials) is an initiative created in 2012 with the aim of promoting studies based on archived Web content, in areas such as Social Sciences, Digital Arts and Humanities.

The RESAW 2023 conference was held at the MUCEM Lab (Mediterranean Institute of Heritage Crafts) in Marseille on June 5-6, 2023, under the theme Exploring the Archived Web During a Highly Transformative Age.

Contributions from Arquivo.pt to RESAW 2023

Arquivo.pt contributed with presentations to the sessions Web Archive in Mediterranean area and its merge (4.A), From online Tools to Web Archive (6.B.), Towards a participatory approach to collections (9. A.), Digging up the materials for writing web history (9.B.).

How to research governmental web data? (abstract, slides)
Archiving Cryptocurrencies (abstract, slides)
Time to explore, time to learn from the archived web: Arquivo.pt training initiative (abstract, slides)
Exhibiting Web Memories from Arquivo.pt: a call for community participation (abstract, slides)

CLEOPATRA Project Meeting

The CLEOPATRA Project, led by the L3S Research Center at the Gottfried Wilhelm Leibniz University of Hannover, has developed since 2019 a training programme for doctoral researchers (Early Stage Researcher, PhD).

Arquivo.pt has participated in three courses: Incentives design for hybrid multilingual information processing and analytics, in Southampton; National and transnational media coverage of European parliamentary elections, 2004-2014, London; and NLP for under-resourced languages, in Zagreb, Croatia.

In 2022, the Arquivo.pt welcomed two researchers in its facilities who used the archived resources and received special support from the Arquivo.pt team to develop their research.

The CLEOPATRA Project ended in 2023 with a meeting on the 16th May, in Hannover, which brought together Professors, Researchers and representatives of the institutions involved.

Daniel Gomes, Arquivo.pt’s Manager, highlighted the new tools that Arquivo.pt makes available and the results of the work carried out by the researchers that have passed through Arquivo.pt.

Secondments@Arquivo.pt and new research tools available (slides)
Research open datasets on 2019 European Parliamentary Elections

Portuguese municipal elections 2021 preserved by Arquivo.pt

May 8, 2023December 16, 2021 by Ricardo Basílio

Last updated on May 8th, 2023 at 05:09 pm

Thousands of pages about the elections to preserve before they disappear

On 26 September 2021 the local elections were held in Portugal, an event marked by the Covid-19 pandemic. The communication of the candidates was mainly based on the media and publications through the Web.

Electoral websites are of manifest historical importance. However, they are difficult to identify because they appear and disappear quickly. In the case of municipal elections, the number of candidates and the variety of channels used makes the task even more challenging.

Arquivo.pt, as in previous elections, launched a special collection to preserve contents concerning the municipal elections.

How was the electoral content published on the Web identified

The first step was the manual identification of election-related content by municipality and parish. For this purpose help was requested from people and organisations with the following initiatives:

collaborative list “Municipal Elections 2021: we need your help!“
request for collaboration from the archive services of the 308 municipalities in the identification of electoral sites and candidates of the respective municipality;
request to the Parties to send the names of their lead candidates.

The Eyedata – Social Data Lab site was used, which made the names of candidates from all over the country available on the Web. The Wikipedia page Eleições autárquicas portuguesas de 2021 was also used as a source of information.

This manual identification process resulted in a list of 255 addresses which documented the candidacies for the 2021 Municipal Elections. Notice that 61% of the identified addresses pointed to private social media platforms: 54% facebook.com, 5% instagram.com and 2% twitter.com).

Much of this content of national interest could not be preserved because these foreign private companies do not allow it.

The list with names of candidates by county, party or coalition was used to create automatic searches in Bing that identified the most relevant electoral contents.

For instance, by combining the term “autárquicas 2021” with the name of a candidate and the respective municipality, one obtains results related to that candidate, such as news, initiatives of his/her campaign or the official page of his/her electoral campaign.

This methodology was applied in the Presidential Elections 2021 and in the Europeia Elections 2019. The technical report A transnational crawl of the European Parliamentary Elections 2019 details the applied methodology.

Content collection and availability in Arquivo.pt

Between 22nd August and 8th October 2021, the Arquivo.pt gathered, in an exhaustive manner, pages related to the Local Government Elections 2021.

The resulting collection called Municipal Elections 2021″ (EAWP39) gathers 31 million files that total 2.7 TeraBytes of information and will be available one year later.

Researchers who want to make a study on the 2021 Local Elections and need early access to the collected contents can contact Arquivo.pt.

Why would a web archive take part in an academic meeting?

Arquivo.pt’s contribution to RESAW 2025

Image gallery

RESAW 2025 na Universidade de Siegen

Three approaches to Artificial Intelligence and Digital Preservation

New MOOC (Massive Online Open Course) about web archiving

Preserved Arquivo.pt data and its automatic processing by APIs

Agenda

Image gallery

Workshop na Faculdade de Letras da Universidade de Coimbra

Methodology

Getting Started with the Dataset

Conclusion

The Digital Preservation Coalition Awards

Arquivo.pt applications to the DPC Awards

#1 Arquivo.pt catalog of tools for digital preservation

#2 Searching preserved web-images

Know more

Arquivo.pt query logs are unique resources for research

Research case study

How do web-archive users search?

What is searched in a web-archive?

Access to research Arquivo.pt query dataset

Query log dataset

Original log files (samples)

Documentation

Evaluation Metrics for web-archive search

Know more

Portuguese text collection on Arquivo.pt available for research

GlórIA, a model for the Portuguese language

Know more

Honorable mentions for authors and professors

Share and spread the word!

Arquivo.pt preserved online documents in several languages about the 2019 European Parliamentary Elections

CLEOPATRA project: Cross-lingual Event-centric Open Analytics Research Academy

Generating textual datasets for Natural Language Processing

Two new research datasets are openly available!

Know more

Meeting the Web Archive Community

Contributions from the Arquivo.pt at the Web Archiving Conference

Meeting the RESAW research community

Contributions from Arquivo.pt to RESAW 2023

CLEOPATRA Project Meeting

Thousands of pages about the elections to preserve before they disappear

How was the electoral content published on the Web identified

Content collection and availability in Arquivo.pt

To know more