Last updated on November 26th, 2024 at 05:42 pm
CDXJ indexes
The research and education community has been requesting the bulk download of web-archived data and index files (CDXJ), for instance, to feed AI training models, optimize routing of web archive requests or recover information from selected websites (e.g. news). Arquivo.pt begun making all its CDXJ index files publicly available in real-time to facilitate the bulk download of web-archived data. Learn how at:
- CDXJ indexes (all, available for download)
- Bulk download of web-archived resources (wiki)
- Tutorial: how to explore Arquivo.pt using Python (news)
- Artificial Intelligence processes data from Arquivo.pt (news)
Crawl report
Arquivo.pt makes available the logs of some of its collections (web crawling). This data is useful for research and enables various types of analysis to be carried out. Examples of analysing logs are also available.
2015 general crawl of .PT domain
- Heritrix crawl log, 2015 (tar, available for download, 20 GB)
- Heritrix crawl report, 2015 (tar, available for download 21 MB)
2015 EU domain crawl
- Analysing Crawl Heritrix reports (html)
- A first attempt to archive the .EU domain (news)
- A first attempt to archive the .EU domain, technical report, 2015 (pdf)
- Automatic Identification and Preservation of R&D Websites (pdf)
2019 European Elections crawl
- 2019 European Elections crawl (html)
- Cross-lingual collection about the 2019 European Elections is available (html)
- A transnational crawl of the European Parliamentary Elections 2019. Technical report (pdf)
2020 Covid special collection
- Logs of Covid 2020 (zip, available for download, 11 GB)
- Collection about Covid-19 in Portugal (news)
Link graphs
Query logs
- All query logs
- QueryLogs BeforeClean (csv, available for download, 48 GB)
- Querylogs version 1 (zip, available for download, 5 MB)
Seeds or initial URLs used for Arquivo.pt
Seeds’ are the addresses or URLs from which the crawler starts collecting data. A special collection at Arquivo.pt is an occasional collection focussed on a specific event or topic (e.g. elections, artists’ websites, etc.). Arquivo.pt provides a list of seeds from special collections, as they can be a starting point for analysing and studying events or topics.
Special collections seeds (EAWPs)
All available seeds of special collections (EAWPs)
- Seeds EAWP40, pos-elections, list of seeds the Portuguese Parlamentary Elections 2022 (txt, download)
- Seeds EAWP40 pre-elections, list of seeds about the Portuguese Parliamentary Elections 2022 (txt, download)
- Seeds EAWP41, about cryptocurrencies (txt, download)
- Seeds EAWP42, external links from Wikipedia, (txt, download)
- Seeds EAWP43, citations in scientific articles from RCAAP, the Portuguese open acess scientific repository (txt, download)
- Seeds EAWP44, URLs presented in scientific CVs from the Ciencia Vitae portal (txt, download)
- Seeds EAWP45 pos-elections, list of seeds about the Portuguese Parliamentary Elections 2024 (download txt, download)
- Seeds EAWP45 pre-elections, list of seeds about the Portuguese Parliamentary Elections 2024 (txt, download)
- Seeds EAWP46, about the European Elections 2024, Madeira Elections 2024 and also from Ciencia Vitae CVs (txt, download)
- Seeds EAWP47, about scientific publications, from Ciencia Vitae, RCAAP and Instituto Superior Tecnico repository (txt, download)
Seeds published at Dados.gov open data portal
Dados.gov is website where public administration entities and citizens can publish open data. Arquivo.pt uses Dados.gov to publish and disseminate its datasets. Dozens of seeds lists used in special collections (newspapers, media, art, music, elections, etc) can be found at Dados.gov. Also, a copy of those information was made available for downloading from the Arquivo.pt website.
- All datasets published on Dados.gov (website)
- Summary of all files published on Dados.gov – list (xlsx, available for download)
- Copy of individual files at the arquivo.pt website – list files (available for download)
- Arquivo.pt certified as an open data provider (news)