Open data sets for research

Last updated on November 26th, 2024 at 05:42 pm

CDXJ indexes

The research and education community has been requesting the bulk download of web-archived data and index files (CDXJ), for instance, to feed AI training models, optimize routing of web archive requests or recover information from selected websites (e.g. news). Arquivo.pt begun making all its CDXJ index files publicly available in real-time to facilitate the bulk download of web-archived data. Learn how at:

Crawl report

Arquivo.pt makes available the logs of some of its collections (web crawling). This data is useful for research and enables various types of analysis to be carried out. Examples of analysing logs are also available.

2015 general crawl of .PT domain

2015 EU domain crawl

2019 European Elections crawl

2020 Covid special collection

Link graphs

Query logs

Seeds or initial URLs used for Arquivo.pt

Seeds’ are the addresses or URLs from which the crawler starts collecting data.  A special collection at Arquivo.pt is an occasional collection focussed on a specific event or topic (e.g. elections, artists’ websites, etc.). Arquivo.pt provides a list of seeds from special collections, as they can be a starting point for analysing and studying events or topics.

Special collections seeds (EAWPs)

All available seeds of special collections (EAWPs)

Seeds published at Dados.gov open data portal

Dados.gov is website where public administration entities and citizens can publish open data. Arquivo.pt uses Dados.gov to publish and disseminate its datasets. Dozens of seeds lists used in special collections (newspapers, media, art, music, elections, etc) can be found at Dados.gov. Also, a copy of those information was made available for downloading from the Arquivo.pt website.

Seeds of Público newspaper (over time)