Dataset on 2025 Portuguese Local Elections at Arquivo.pt

Last updated on December 3rd, 2025 at 12:56 pm

Local elections (“autárquicas”) were held in Portugal on 12 October 2025, and Arquivo.pt compiled a special collection of electoral content published on the web, resulting in 3.5 terabytes of information for research and academic work.

440 search terms were used to obtain 43,000 page addresses, along with the websites of parishes, municipalities, and political parties.

Here we explain the various steps involved in collecting data on the elections:

How to identify election-related content on the web

To identify content related to the elections, we used a list of search terms, for example, “eleições autárquicas 2025″, “habitação autárquicas 2025″, “promessas “autárquicas 2025”. After the elections, other terms were added, such as “vitória autárquicas 2025”, “resultados autárquicas 2025”.

The search terms are words that aim to include various topics related to the elections, such as politics, society, economics, among others, media, candidate names, and regions of the country.

In the collection on local elections, the Google search engine was used to perform each search. Some advanced search parameters were used: number of results (&num=100), news results (&tbm=nws), image results (&udm=2). After the elections, the results were restricted using the “last week” filter.

In each search, the addresses of the search engine results pages (SERP) were extracted using the Google Rank Checker,Keyword SERP Ranking Tool. This tool works as a browser extension that exports the list of results in JSON format.

In total, 1,400 searches or queries were performed on Google (800 before the elections and 600 after the elections). Finally, the results of all searches (.json files) were compiled into a document and converted into a table. Each result contains various data, such as relevance, the domain from which it was extracted, the link or URL, the title of the publication, the date of the search, and the query.

It should be noted that the list obtained represents only a small portion of everything published on the Web about the elections. In addition, the same list contains results unrelated to the purpose of the collection (false positives) and some repetitions. To save time, no lines were deleted.

This exercise resulted in 45,000 pages (seeds) with news, articles, and publications related to the elections to be used in the collection process by Arquivo.pt. This dataset, 2025 Local Elections, is available on the open data platform Dados.Gov.

A list of parish councils, municipal councils and political parties with their respective websites has also been added.

How the contents were recorded and limitations to be taken into account

The addresses obtained before and after the elections were recorded in two web crawlers, Heritrix and Browsertix-crawler . These tools record pages from a given starting address (seed), then follow the links there, up to a certain limit, in this case a maximum of five times (five hops).

Heritrix was used for an initial generic collection of pages, as it is capable of quickly processing lists containing thousands of addresses: 25,858 URLs before the elections and 17,258 URLs after the elections. It generated 541 gigabytes of information.

The Browsertix-crawler was used to improve the collection of dynamic content. This crawler’s recording is browser-based. Recording takes longer, but captures content that would otherwise escape collection.

The collection was carried out using the Browsertix-crawler, in stages, first by recording the parish websites in August and September, and then, between October 9 and November 5, by recording news about the elections and 8,850 social media posts. It generated 2.9 terabytes of information.

As for the limits of the collection, we were able to identify a few: access blocked by some websites that defend themselves against automatic access, despite the Arquivo.pt agent being identified; social media content behind a login that cannot be reproduced on Arquivo.pt; videos that cannot be reproduced due to their format.

How and when to access data for research and work creation

EAWP48 is the identifying name of the collection that will bring together content on the Local Elections of 12 October 2025. It is described in the list of collections at Arquivo.pt.

Nos próximos meses, o conteúdo será indexado e os índices CDXJ ficarão disponíveis para os investigadores na lista de datasets do Arquivo.pt.

In the coming months, the content will be indexed and the CDXJ indexes will be available to researchers in the Arquivo.pt dataset list.

After one year, the collected content will be accessible through the Arquivo.pt search engine. Anyone will then be able to search election pages by text or image.

For further information, please contact us.

Data collected on the 2025 Local Elections

Find out more about electoral recalls from previous years

Afghanistan Websites and the fall of the regime in August 2021

thumbnail_Karima Faryabi

Last updated on September 26th, 2022 at 03:57 pm

afghan-ministry-of-economy-17-08-2021

Afghanistan Ministry of Economy website with Karima Faryabi (recorded August 17, 2021)

On August 15, 2021 the presidential palace in Kabul was taken over by the Taliban, consummating the fall of the regime that had been in place for 20 years, following the 9/11 attacks on the United States.

The community of Web archivists, through the Content Development Working Group – International Internet Preservation Consortium, was challenged to record the Afghan sites, given the risk that they would disappear with the new regime.

No time to lose when it comes to preserving the Web

Arquivo.pt reacted quickly, launching an automatic content search focused on .af domain sites and on international media news about the ongoing events.

On August 17, the websites began to be recorded.

1800 website addresses from Afghanistan (ending in .af) and 500 media news stories from around the world were used.

The addresses, URLs or “seeds” were obtained through automated search using the Bing Search API and immediately put into recording.

Content available to know Afghanistan’s history

As a result of the collection carried out, more than 400 Gigabytes of information became available at Arquivo.pt, which anyone can use for research in the most diverse areas.

The main contribution of Arquivo.pt to the community of Web archivists was the use of the automatic search that allows a quick reaction in the recording of Web contents in imminent risk of being lost.

Know more

Arquivo.pt open data set (Dados.gov)

Content collected by the Content Development Working Group of the International Internet Preservation Consortium available at the Archive-it service

On line Cafe with Arquivo.pt continues

Last updated on August 6th, 2024 at 02:11 pm

banner-cafe-com-o-arquivo-pt

Share this page: arquivo.pt/onlinecafe

Welcome to the third season of the Online Cafe with Arquivo.pt

Talk directly to the Arquivo.pt team and get answers to all your questions! The Arquivo.pt launched a new cycle of team chats with you through online sessions. Brief introductory presentations will be given, leaving time to ask all your questions about how to get more out of Arquivo.pt or how to apply to the Arquivo.pt Awards.

Sessions

February 17, 2022 – Primeiras páginas de jornais online portugueses

Primeiras páginas de jornais online portugueses” (Front pages of Portuguese online newspapers) presents an interactive graphical analysis of the front pages of Portuguese online newspapers. For this study, specific items within the newspaper design were analysed, thus allowing trends to be observed over time.

Susana Parreira, explains how she developed this work as part of her Masters, with the collaboration and guidance of Ana Boavida (Universidade de Coimbra) Ana Sabino (Instituto Politécnico de Castelo Branco) and Penousal Machado (Universidade de Coimbra).

22nd session –  January 20, 2022 – Politiquices

Politiquices.pt, allows to research support or opposition relations between political personalities and parties expressed in news headlines. This application uses information preserved in Arquivo.pt to create an ontology of relations. It uses Natural Language Processing technology. David Batista, 2nd place of Arquivo.pt Awards 2021, will explain how he developed his work and demonstrate the applications for researchers and citizens in general.

Special session – World Digital Preservation Day 2021 – Major minors project – november 5

In November, World Digital Preservation Day is broadly celebrated and, to mark this international initiative, Arquivo.pt held an online session open to the community. Special guests of this session were the winners of the Arquivo.pt Award 2021, Leandro Costa, Paulo Martins and José Carlos Ramalho.

Previous seasons

Presentation at the IIPC Web Archiving Conference 2022

2019 websites available and Arquivo.pt surpasses 10 billion files

thumb_notre-dame-paris

Last updated on December 16th, 2021 at 06:43 pm

The information collected from the Web during 2019 is now avaliable in Arquivo.pt (in respect to the embargo period of 1 year).

Printed screen from www.politico.eu preserved by Arquivo.pt, collected in June 18, 2019. Article about the Notre Dame fire in Paris, "Notre Dame fire 'fully extinguished’ as fundraising starts".
Printed screen from www.politico.eu preserved by Arquivo.pt, collected in June 18, 2019. Article about the Notre Dame fire in Paris, “Notre Dame fire ‘fully extinguished’ as fundraising starts”.

Remember and research historical events in 2019, such as

Arquivo.pt has visited 2 million sites and collected 1,7 billion files, 131TB in total, so that you can access the memory of past events.

In 2021, Arquivo.pt provides open access to more than 10 billion files (721 TB) from 27 million websites.

Special collection of Portuguese Presidential Elections

thumbnail_presidential_elections
banner_presidenciais_v
Form to suggest a web page, a web site or other web content

Arquivo.pt invites all citizens to suggest web pages related to the 2021 Presidential Elections to be preserved for the future.

The Presidential Elections will take place in Portugal on January 24, 2021.

Your suggestions are important so that Arquivo.pt can keep a more complete memory of this important electoral event.

To suggest web pages use this form (https://tinyurl.com/presidenciais-sugerir)

Arquivo.pt preserves websites of national scientific projects

thumbnail_memoriafct

Last updated on October 1st, 2021 at 09:11 am

Preserving scientific project websites is important

The contents of the websites tend to disappear when the scientific projects are finished.

The preservation of scientific project websites is important because:

  • documents the development of projects;
  • ensures access to unique technical and scientific content that researchers have posted on the project websites (eg presentations, photographs, data sets);
  • reinforces the visibility of the results of projects financed by FCT.

Experimental collection of scientific projects websites in 2016

Arquivo.pt automatically collected websites for projects financed by FCT in 2016.

The information about these websites was dispersed as it was not recorded during the administrative process.

For about 20 years, FCT financed scientific projects, so the number of sites could be too high to be identified manually.

Then an automatic methodology for identifying these websites was developed, developed by Arquivo.pt.

The FCT database had a total of 11,996 project entries but did not include references to web addresses. Applying the automatic methodology, 7 956 URLs related to the funded scientific projects were identified.

The collection of content referenced by these addresses resulted in the preservation of 600 721 files (72 GB), including content such as research group web pages, researchers’ personal pages or project-related blogs.

Online references in scientific project reports have been preserved since 2020

From June 2020, the website addresses of the projects financed by FCT must be registered in the progress and final reports funded by FCT.

Arquivo.pt started using these addresses to preserve the contents of websites of national scientific projects in a systematic way.

1st official collection of scientific project websites

In June 2020, Arquivo.pt obtained 263 addresses related to 100 scientific projects from the reports submitted to FCT. Most of the addresses (67%) did not have any version previously preserved in Arquivo.pt.

The addresses obtained point to online resources such as the websites of the projects, R&D units, news in the media, articles in scientific journals or repositories, databases, videos on Youtube or Facebook pages.

In July 2020, a special collection was launched from this set of addresses which resulted in 6.9 GB of information obtained from the visit to 31,606 URLs.

Exhibition about Research & Development projects

The Scientific Research Memory is an online exhibition dedicated to the websites of scientific projects funded by the Foundation for Science and Technology (FCT) that Arquivo.pt has preserved.

There are also websites of the Research & Development Units financed by FCT.

Memorial do Arquivo.pt preserves scientific websites for free

The Memorial do Arquivo.pt service has preserved historic FCT websites that have been disabled. These were created for events or initiatives that have ended and therefore their contents are no longer updated.

To include a website in the Memorial, Arquivo.pt starts by making a high quality collection of its contents.

Then, the collected contents are validated in collaboration with those responsible for the website.

Finally, the website address is redirected to the contents that have been preserved by Arquivo.pt.

For example, if someone wants to access any page on the Scientific Archives Meeting held in 2014, they will be redirected to Arquivo.pt.

Thus, the contents remain accessible over time and the links, the references in scientific communications that may exist do not break.

The digital preservation service Memorial do Arquivo.pt is free of charge for websites of the academic and scientific community, just send a request to contacto@arquivo.pt.

To know more

Cross-lingual collection about the 2019 European Elections is available

print_europeanelections_q

Last updated on August 30th, 2022 at 10:46 am

Print European Elections 2019
Print from an archived page on Arquivo.pt: https://www.european-elections.eu

The special collection of web pages about the 2019 European Elections is available for search at Arquivo.pt.

To compile this collection, pages written in 24 European languages ​​were identified through automatic searches on the Bing search engine and suggestions from 17 European countries.

We emphasize the collaboration of the Publications Office of the European Union, which reviewed the list of search terms in the different languages ​​of the European Union.

Between May and July 2019, Arquivo.pt exhaustively collected pages related to the European Elections in several countries.

The resulting collection named “European Elections 2019” comprises 99 million web files that sum 4.8 Terabytes of information.

The technical report “A transnational crawl of the European Parliamentary Elections 2019 ” details the applied methodology. This methodology has been applied to generate other thematic collections such as about Covid-19.

We invited all citizens, especially the researchers, to try this service especially created to search the 2019 European Elections cross-lingual and international collection: https://arquivo.pt/ee2019

Video “A transnational and cross-lingual crawl of the European Parliamentary Elections 2019”

A transnational and cross-lingual crawl of the European Parliamentary Elections 2019, Ivo Branco, IIPC Web Archiving Conference and RESAW 2021 (slides)

To know more:

We preserved the Portuguese Local Elections of 2017

Last updated on August 5th, 2024 at 05:05 pm

Arquivo.pt performed 2 web crawls of information related with the Portuguese Local Elections of 2017.

We appealed the community to contribute with suggestions of relevant Web pages so that we could preserve them.

The 2 crawls occurred during and after the campaign period, using the list of 410 Web pages suggested by the community and 13 887 web pages found automatically using search engines.

The manual identification process originated a list of 337 addresses which documented candidacies for the 2017 Municipal Elections. Note that 46% of these addresses referenced the social media platform Facebook.com. Much of this content of national interest could not be preserved because this foreign private company does not allow it.

The final result was an archive of 2 265 887 Web resources (360 GB).

Among the preserved web pages are the official sites of the candidates, news, blogs and articles with personal opinions about the elections.

The Arquivo.pt respects an embargo period of 1 year, and for that reason this collection will only be available by the end of 2018.

Meanwhile, you can consult the preserved pages about the previous elections of 2013, such as:

We would like to thank all the volunteers that collaborated with this initiative.

Trends in the evolution of the Web: we have published a video about our study

Last updated on August 4th, 2024 at 06:21 pm

Talk about the evolution of Web characteristics, based on a scientific study performed by the Portuguese Web Archive.

The presentation focuses on the following points:

  • The Web
  • Web Archiving and crawlers
  • Web characteristics and its evolution within 5 years
  • The importance of the study on Web trends in the design of tools that process its data

Find out more: