2024 European and Portuguese elections in special Arquivo.pt collections

European Elections

Last updated on October 9th, 2024 at 05:48 pm

Arquivo.pt made special collections on the three elections that took place this year: the Parlamentary elections on 10 March, the elections in Madeira island on 26 May and the European elections on 9 June.

More than 70,000 pages with content related to the elections and political life in Portugal and Europe were identified and around 4 terabytes of information collected.

We would like to thank the people who contributed to the selection of pages. Teachers and students are encouraged to do work using the special collections on elections that Arquivo.pt has produced over the years.

Find out more about the collection procedure and the results obtained.

Portuguese Parlamentary Elections (Legislativas 2024)

The Portuguese Parlamentary Elections  took place on 10 March 2024 to elect the members of the Assembly of the Republic for the 16th Legislature of the Third Portuguese Republic.

We would like to highlight the community’s contribution to this collection with a manual selection of 827 pages, which helped to improve the quality of the collection.

Around 500 compound terms or keywords were used to search for content published on the web about the elections. The service used for the automatic search was the Bing Search API. The results were limited to the top 20.

For example, the compound term ‘head-to-head legislative 2024’ found pages relating to debates between candidates. The term ‘legislative housing 2024’ found pages relating to party proposals for housing. The term ‘legislativas 2024 site:expresso.pt’ identified Expresso pages about the elections. The names of the candidates were also used.

After the elections, search terms specific to that period were used, such as ‘legislative victory 2024’, ‘legislative defeat 2024’ or ‘legislative results 2024’, among others.

The automatic search in the Bing Search API resulted in 34,120 addresses obtained before the elections and 5,803 after the elections.

The websites of political parties, including parties without parliamentary seats, were also collected during the election period.

Not all the content identified could actually be recorded, due to the limitations of the recording tools or the restrictions of the websites themselves.

The tools Heritrix, Brozzler and Browsertrix-cloud (beta version), courtesy of Webrecorder.net, were used for the recording.

The recording took place between 6 and 20 March and resulted in 3.2 Terabytes of information. The contents have been included in the EAWP45 special collection and will be available after one year.

To find out more, consult the open dataset:

Madeira Legislative Assembly elections 2024

The elections for the Legislative Assembly of Madeira took place on 26 May. Arquivo.pt carried out a special collection of content published on the web.

We began by automatically searching for news, election pages and websites related to the elections in Madeira. We used a list of search terms to put into the Bing Search API.

The aim was to obtain as many URLs as possible related to the event or topic in question, i.e. the Madeiran elections. To do this, several limits were set for the results: top 10, top20, top50 and top100. This process was documented, which shows that the more we expand the number of results, the greater the number of pages that are not very relevant and sometimes outside the intended target.

All the addresses (12,656) were recorded on 7 June in the Heritrix crawler.

Find out more by consulting the open dataset:

European elections 2024 in multilingual collection

The European elections took place on 9 June in Portugal. In some countries, such as Estonia, Czechia and Italy, the elections were held on a different date.

Arquivo.pt collected pages relating to the European Elections in the 27 countries of the European Union and in the 24 official languages.

The same methodology was used for the 2019 European Elections collection, i.e. a multilingual and semi-automatic search.

A list of 40 compound terms or keywords was used and translated into the 24 official EU languages. The terms were translated into the various languages in 2019 by the EU Publications Office. This resulted in a multilingual list of 960 terms to put into the Bing Search API.

Before the elections, on 3 June, the first search was carried out, resulting in 8,986 unique addresses, limiting the number of results to the top 20.

After the elections, new search terms were added with the names of the main candidates for the European Parliament in each country of the European Union. This second post-election search yielded 15,371 unique addresses.

The tool used for this collection was Heritrix. The collection was limited to three ‘hops’. In this case, the crawler follows links up to three times. This means that we opted for a certain restraint in the depth of the recording. Three ‘hops’ in the Heritrix crawler is enough to record one page (in other applications also called ‘page’ or ‘single page’ recording).

The content was recorded between 7 and 20 June and included in the EAWP46 special collection. It will be available after 1 year.

Find out more by consulting the open dataset:

Know more about past collections about elections

Analysis of the Arquivo.pt query dataset

demo-wordcloud-arquivopt3

Last updated on October 1st, 2024 at 10:34 am

Arquivo.pt query logs are unique resources for research

Arquivo.pt provides a “Google-like” service that enables searching pages and images collected from the web since the 1990s. Notice that Arquivo.pt search complements live-web search engines because it enables temporal search over information that is no longer available online on its original websites.

Analyzing user behavior is an important research topic to understand users’ information needs and enhance the quality of search results. Thus, when a user interacts with a search engine, the system records the user’s actions in a file called the query log. Query logs from web archives are unique resources for research because they describe the real needs of web-archive users about the historical information published online over time.

Research case study

Flavie Gallois and Adam Jatowt from the University of Innsbruck, and Ricardo Campos from the University of Beira Interior and INESC TEC analyzed user search behavior based on the Arquivo.pt search query log dataset collected over a period of 3 months from June to September 2021 (Analyzing User Search Behaviour in Temporal Web Repositories through Search Query Log Analysis).

This study analyzed query features such as length, type or frequency and compared the obtained results with previous work about user search behavior over web-archives and live-web search engines.

This study revealed interesting trends and patterns about how users search for information within web archives, with strong potential for future research work.

How do web-archive users search?

Figure 1 : Distribution of country origin of users
Figure 1 : Distribution of country origin of users
Figure 2: Distribution of languages used in queries
Figure 2: Distribution of languages used in queries

The users came from Portugal in 85.7% of the queries. However, the Portuguese language was identified through automatic language identification of queries as being used in only 37% of the queries. This suggests that users apply other languages than their own to search in web archives.

Users of Arquivo.pt tend to use longer queries with more words and characters in comparison to previous studies, both over web archives and live-web search engines. About 92% of the queries had 5 or fewer terms (average of 25 characters), with 3 being the most common number of submitted terms. In previous work about search behavior in web archives, it was observed that users tended to submit from 1 to 3 terms per query, with 1 term as the most common submission.

Users tend to issue multiple queries within a session instead of a single query, possibly indicating a need for refining their search queries or exploring multiple options for inquiry.

87,7% of the queries submitted to Arquivo.pt used Desktop Browsers, despite Arquivo.pt providing mobile-friendly user interfaces. Old web-archived pages are not responsive and render poorly on mobile devices. Thus, it is expectable that users mostly use web archives through their desktops.

Figure 3: Arquivo.pt users can refine the time span of their queries by using the From and To datepickers.
Figure 3: Arquivo.pt users can refine the time span of their queries by using the From and To datepickers.

Users refined the time span of the search (using the datepickers) in about 50% of queries which indicates awareness of temporal needs peculiar to web-archive usage. Interestingly, users modified the From datepicker more frequently than the To datepicker. Notice that keeping the default time span may fit the user information needs and does not necessarily indicate the lack of awareness about the existence of the function to define time span (peculiar to web-archive search).

Only a small percentage of users included specific years in their query terms (4%), potentially suggesting that in these cases the time span function was insufficient, or unnoticed by some users.

The obtained results suggest that users are more conscious of their information needs and have improved their search techniques to be more effective over web-archives instead of just using them out of curiosity as first-comers.

What is searched in a web-archive?

The authors of the study applied automatic named entity recognition over the user queries and derived a set of word clouds that graphically provide a glimpse of the most common information needs of Arquivo.pt users:

Figure 4: Word cloud of the most frequent query terms submitted to Arquivo.pt.
Figure 4: Word cloud of the most frequent query terms submitted to Arquivo.pt.
Figure 6: The most frequent Geographical Locations in query terms submitted to Arquivo.pt.
Figure 6: The most frequent Geographical Locations in query terms submitted to Arquivo.pt.
Figure 6: The most frequent Organizations in query terms submitted to Arquivo.pt
Figure 6: The most frequent Organizations in query terms submitted to Arquivo.pt
Figure 7: The most frequent Persons in query terms submitted to Arquivo.pt.
Figure 7: The most frequent Persons in query terms submitted to Arquivo.pt.

Access to research Arquivo.pt query dataset

Arquivo.pt released a set of resources to support research studies over its

Query log dataset

Original log files (samples)

Documentation

Evaluation Metrics for web-archive search

The first step to understand user behavior is to define evaluation metrics. Defining metrics is a powerful tool to set long and short-term goals to decide which new products and features should be released to the users.

We share a work-in-progress report which aggregates information about Web Archive Search Evaluation Metrics. This contributes to comparing users’ search behavior between live-web and web-archive search engines. Feel free to comment directly on the collaborative document or to contact us.

This report also provides a summary of references about previous work, query workflows and structure of the corresponding query logs produced by Arquivo.pt, to facilitate the work from the researchers to study these data sets.

Know more

Cross-lingual research datasets on 2019 European Parliamentary Elections

Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event

Last updated on August 5th, 2024 at 04:49 pm

Arquivo.pt preserved online documents in several languages about the 2019 European Parliamentary Elections

The 2019 European Parliamentary Elections were an event of international relevance. The strategy to preserve the relevant information on the World Wide Web is delegated to national institutions. However, the preservation of web pages that document transnational events is not officially assigned. 

The Arquivo.pt team, with the aim of preserving the cross-lingual online content that documents this event, applied a combination of human and automatic selection processes.

The process of generating the collection about the 2019 European Parliamentary Elections was performed in two steps.

In the first step, 40 relevant terms in Portuguese about the 2019 European Parliamentary Elections were identified, and then, automatically translated into the 24 official languages of the European Union: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish. 

These translations were reviewed in collaboration with the Publications Office of the European Union. Besides that, in parallel, a collaborative list was launched to gather contributions of relevant seeds from the international community.

In the second step, the Arquivo.pt team iteratively ran 6 crawls (99 million web files, 4.8 TB) using different configurations and crawling software, to maximize the quality of the collected content. 

The obtained web-data was aggregated into one special collection identified as EAWP23 and became searchable and accessible through Arquivo.pt in July 2020 (https://arquivo.pt/ee2019).

CLEOPATRA project: Cross-lingual Event-centric Open Analytics Research Academy

Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event
Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event

The CLEOPATRA ITN was a Marie Skłodowska-Curie Innovative Training Network aimed to generate ways to better understand the massive digital coverage of major events in Europe over the past decades. 

The main goal was to facilitate advanced cross-lingual processing of textual and visual information related to key contemporary events at large scale and develop innovative methods for efficient access and interaction with multilingual information.

In total, 14 Early-Stage Researchers hosted across 9 European Universities developed their research while enrolled as Ph.D. students. 

Associated partners such as Arquivo.pt contributed to CLEOPATRA by hosting and training early-stage researchers such as Diego Alves. As part of the training program, he conducted a secondment at Arquivo.pt in Lisbon from June to August 2022. 

The idea was to develop part of his research about syntactic structures of EU languages using the textual resources preserved by the Arquivo.pt and exchange knowledge with the web-archiving experts on the strategies to extract and process historical web-data. 

Diego Alves defended his Ph.D thesis entitled Computational typological analysis of syntactic structures in European languages in July 2023 at the Faculty of Humanities and Social Sciences of the University of Zagreb (Croatia). 

Generating textual datasets for Natural Language Processing

Diego Alves’ work originated cross-lingual datasets about the 2019 European Parliamentary Elections precious for research.

This work will be detailed in chapter “Robustness of Corpus-based Typological Strategies for Dependency Parsing” of the open-access CLEOPATRA book entitled “Event Analytics across Languages and Communities”.

A 3-step Natural Language Processing pipeline was developed to generate research textual datasets that can be used in several types of digital humanities studies:

  1. Extract text: The textual content was extracted from each web-archived URL using the newspaper3k Python library. The language of each extracted text was determined using the langdetect library, to separate the texts written in different languages across distinct files;
  2. Clean extracted texts: a Python script was applied to clean the texts by removing unnecessary information (e.g.: repeated instances, empty lines, etc.);
  3. Double-check of language identification: the language of each cleaned extracted text was verified again to eliminate possible errors originated during the previous steps.

Two new research datasets are openly available!

The result was a dataset of cleaned and language-verified texts publicly available. Each file contains the texts in a given language about the 2019 European Union Elections. The distribution of extracted texts for each language is described in the figure below:

Number of tokens of each corpus extracted from the collection 2019 European Union Elections preserved by Arquivo.pt (EAWP23).
Number of tokens of each corpus extracted from the collection 2019 European Union Elections preserved by Arquivo.pt (EAWP23).

The aforementioned corpus was automatically annotated regarding part-of-speech and dependency relations to generate a corpus with syntactic information which is useful for linguistic studies. 

The multilingual model of the UDify tool (Kondratyuk and Straka, 2019) was applied. 

The texts in these annotated corpora followed the same order of the respective raw-texts files. Each sentence is annotated following the Universal Dependencies framework in the CoNNL-U format, which is the reference in terms of syntactic annotation in Natural Language Processing. Thus, each file in this dataset contains the annotated texts in a given language about the 2019 European Union Elections

The syntactically annotated texts about the 2019 European Elections are publicly available!

Know more

Collection about Covid-19 in Portugal

Thumbnail Covid-19 colletcion in Portugal

Last updated on June 18th, 2021 at 08:26 am

Banner Covid-19 colletcion in Portugal

Suggest web pages about Covid-19

Arquivo.pt invites everyone to suggest web pages that document the Covid-19 pandemic to be preserved for future access. Help us to keep a complete memory of the Portuguese live during this period.

Suggest pages using this form: https://tinyurl.com/arquivopt-covid19

Thousands of web pages to tell the story of the pandemic in Portugal

Arquivo.pt has been carrying out special collections of web pages related to the Covid-19 pandemic since March 2020.

“Future academics, scientists and journalists who are studying the Portuguese response to the Covid-19 pandemic will want to read first-hand testimonies of those affected, official records of the number of victims, and recommendations from doctors, politicians and scientists at the time” , Público newspaper, May 1, 2020 edition.

Daily, content was collected from a set of 106 sites on the theme of Covid-19. This set includes, for example, websites for the media, government, associations and university initiatives.

In another set are Twitter pages (108 identified in May), Youtube videos (815 identified in May) and also pages from Reddit and Git Hub.

Suggestions from the community were included. For example, Archivists from Sines (Portugal) collected local news related to Covid-19 (9 GB). The Revisionista.pt project also contributed and identified pages from newspapers. People sent suggestions through the public form.

Collaboration with IIPC for international collection

In February 2020, the International Internet Preservation Consortium (IIPC), the main organization on Web preservation, proposed to its members a collection about the Novel Coronavirus (Covid-19) outbreak.

Arquivo.pt contributed with 1 237 seeds, mainly in Portuguese. With successive contributions from other countries, the IIPC collection reached over 7 000 pages in July 2020.

A form is also available for anyone to suggest content for this international collection.

The IIPC collection “Novel Coronavirus (COVID-19)” is accessible via the Internet Archive Archive-it.

Arquivo.pt carried out 3 collections of the international collection compiled by the IIPC, the 1st on March 23 the 2nd on June 15 and the 3rd on late August, thus gathering international content useful for worldwide researchers.

Methodology for the selection of pages for the Covid-19 collection

We started by identifying terms related to the Coronavirus theme that included health, economic, political, geographic or organizational aspects.

Then, the Bing Azure service was used to automatically obtain, through a script, the following information for the first 10 results for each term: the page address, the title and the position in the results list.

Considering the list of results, it was decided which software would be used and which settings would be the best to collect the pages.

For example, in the case of a newspaper section dedicated to Covid-19, it was necessary to decide whether to record just one page or whether it makes sense to collect the entire site exhaustively.

Various types of software were used to collect the pages. For daily collections from 106 sites Heritrix was used. For capturing 108 Twitter accounts, Brozzler was chosen and for videos, manual capture using Webrecorder and Browsertrix.

Know more

Scientific study presents a search log analysis of a search engine

Last updated on August 4th, 2024 at 06:00 pm

This research presents a characterization of the information-seeking behaviour of the users of a Portuguese web search engine, based on the analysis of its logs.

The paper A Search Log Analysis of a Portuguese Web Search Engine, by Miguel Costa and Mário J. Silva, was presented at INForum 2010 – Simpósio de Informática, in Braga, Portugal.

Paper presented at EPIA 2009

Last updated on August 4th, 2024 at 06:03 pm

An Updated Portrait of the Portuguese Web presented at EPIA 2009

The paper An Updated Portrait of the Portuguese Web, by João Miranda and Daniel Gomes, was presented at the 14th Portuguese Conference on Artificial Intelligence (EPIA 2009) in Aveiro.

This paper presents a characterization of the Portuguese Web derived from a crawl performed by the Portuguese Web Archive in March 2008, with 48 million documents in 2.5 TB of amount of data.

Session at ISCTE “Archive.pt as an infrastructure for research in Social Sciences and Humanities

Last updated on August 4th, 2024 at 06:05 pm

Session at ISCTE (Lisbon) “Archive.pt as an infrastructure for research in Social Sciences and Humanities”

You missed it?

No problem. Here are all the presentations: