Arquivo.pt Links Dataset: Unveiling the Web’s Hidden Structure

wikipedia_link-graph_clusters

Last updated on May 5th, 2025 at 02:50 pm

The interconnected nature of the World Wide Web has long fascinated researchers and technologists alike. Today, we are thrilled to announce the release of the Arquivo.pt Links dataset, a comprehensive collection that opens new possibilities for understanding and analyzing web connectivity patterns.

The dataset encompasses more than 139 million webpage URLs, each accompanied by crucial metadata about their incoming links – both the source URLs and their corresponding anchor texts, i.e., visible and clickable text in hyperlinks. This rich collection of interconnection data provides researchers with a unique window into the web’s underlying structure.

The importance of hyperlinks in web architecture cannot be understated. They serve as the fundamental building blocks of web navigation and discovery, enabling both users and automated systems to traverse the vast landscape of online content.

Links formed the foundation of Google’s revolutionary PageRank algorithm, which transformed our approach to information retrieval and web search. PageRank’s fundamental insight – that a page’s importance could be measured by analyzing its incoming links – revolutionized search technology and remains influential in modern information retrieval systems.

By making this dataset publicly available, Arquivo.pt enables researchers to explore similar innovative approaches to web analysis and search engine development. The dataset opens up numerous exciting research possibilities across multiple domains:

  • Researchers can implement and experiment with various ranking algorithms, from classic approaches like PageRank to modern machine learning-based techniques. The inclusion of anchor texts provides valuable semantic context that can enhance search relevance and document classification.
  • The dataset enables deep analysis of web topology and link structures. Researchers can investigate questions about web connectivity patterns, identify clusters of related content, and study how information spreads across the web through link networks.
  • The anchor text associated with each link offers a rich source of human-generated descriptions of web content. This data can be particularly valuable for developing and testing document summarization algorithms, semantic analysis tools, and automated classification systems.
  • For web archiving researchers, this dataset provides insights into how web pages are connected and referenced over time, offering valuable data for studying web preservation strategies and digital heritage maintenance.

Methodology

The process begins with a temporal snapshot of web pages from a specific time period (collection). During this initial phase, our systems analyze each captured page, extracting all outgoing hyperlinks along with their associated anchor texts and capture timestamps. This creates a preliminary mapping of how pages connect to one another within our captured timeframe.

What makes this dataset particularly valuable is its inverted link structure. Rather than organizing the data around source pages and their outgoing links, we’ve created an inverted map that centers on destination pages and their incoming links. This approach is particularly useful for analyzing a page’s importance or authority within the web’s structure, as it provides immediate access to all pages that reference or point to a given URL.

Consider a traditional link structure where Page A links to Pages B, C, and D. In our inverted structure, we instead see entries for Pages B, C, and D, each listing Page A as a source of incoming links. This reorganization of the data facilitates more efficient analysis of page authority and influence, making it particularly valuable for researchers working on ranking algorithms or studying information flow patterns across the web.

The Arquivo.pt links dataset combines three distinct web collections:

  1. PWA9609 (1996-2009): 89 million pages capturing early Internet evolution, focused on the .pt domain. This historical collection provides insights into early web linking patterns.
  2. AWP38 (Oct-Nov 2021): 44 million pages offering a contemporary snapshot of web connectivity, with emphasis on the .pt domain while including broader Internet content.
  3. FAWP47 (Oct-Dec 2021): 8 million pages from daily captures of .pt domain content, designed to track short-term changes in link patterns.

Getting Started with the Dataset

Researchers can access the complete dataset. The data is provided in a format that supports efficient processing and analysis, making it suitable for both large-scale studies and focused investigations.

Conclusion

The release of the Arquivo.pt links dataset represents a significant contribution to the web science research community. By making this rich collection of web connectivity data freely available, we hope to facilitate innovative research and deepen our understanding of the web’s complex structure.

We encourage researchers to explore this dataset and look forward to seeing the novel insights and applications that emerge from its analysis. Whether you’re interested in developing new search algorithms, studying web topology, or investigating content relationships, this dataset provides a robust foundation for your research.

Arquivo.pt took part in the IIPC Web Archiving Conference in Oslo

Last updated on April 23rd, 2025 at 02:52 pm

Four members of the Arquivo.pt team were in Oslo, Norway, to take part in the General Assembly of the International Internet Preservation Consortium and the Web Archiving Conference, from 8 to 15 April 2025.

The National Library of Norway was the host institution for this international event. The Norwegian Web Archive is part of the Library’s mission and is held in a second location specialising in digital preservation, in the city of Mo i Rana, in the centre of the country.

The first day, 8 April, was dedicated to the General Assembly, exclusively for members of the consortium, and to the working groups in which Arquivo.pt plays an active role. The Content Development Working Group is dedicated to the creation of thematic collections and has the participation of Arquivo.pt in the ‘Street Art’ collection. The Training Working Group creates training content and training actions, such as IIPC webinars and face-to-face workshops.

The Web Archiving Conference was held on 9 and 10 April, an event open to all entities and initiatives related to web preservation and archiving.

Arquivo.pt’s contribution

Arquivo.pt presented its services and initiatives for interacting with the community, such as its collaboration with the Sines Municipal Archive in preserving content of local interest. The concern with access to content, both for researchers and for citizens in general, is an aspect that is highly appreciated by the IIPC community.

  • Arquivo.pt toolkit for web archiving – Lightning talk session 1 – Daniel Gomes – Slides
  • Arquivo.pt Query Logs – Lightning talk session 3 – Pedro Gomes – Slides
  • Collaborative collections at Arquivo.pt: four years of recordings from the city of Sines (Portugal) – Lightning talk session 4 – Ricardo Basílio – Slides, notes
  • API/Bulk access and its usage – Poster slam – Vasco Rato – Poster
  • Arquivo.pt annual awards: a glimpse since 2018 – Poster slam – Daniel Gomes – Slides

Image gallery

IIPC Web Archiving Conference 2025, Oslo

oslo4
oslo3
oslo6
oslo2
oslo1
oslo8
oslo5
oslo9
oslo10
oslo11
oslo12
oslo4 oslo3 oslo6 oslo2 oslo1 oslo8 oslo5 oslo9 oslo10 oslo11 oslo12

Arquivo.pt training with APDSI. Sign up!

Ciclo de Webinars do Arquivo.pt com a APSDI

Last updated on April 5th, 2025 at 01:10 pm

Ciclo de Webinars do Arquivo.pt com a APSDI

APDSI – Associação para a Promoção e Desenvolvimento da Sociedade da Informação (Association for the Promotion and Development of the Information Society) promoted a Cycle of Webinars on Arquivo.pt, held between March 20 and April 1, 2025.

This Webinar Cycle, dedicated to the preservation of cultural memory published on the Web, is a collaboration between APDSI and Arquivo.pt, the FCCN digital services of the Fundação para a Ciência e a Tecnologia.

Luís Vidigal, Founding Partner of APDSI, Filipa Fixe and João Tavares, Board Members, introduced the theme of each session and the Arquivo.pt team showed how the preservation of web content works, allowing organizations and citizens to access the web of the past.

The four sessions had a total of 121 participants.

Program

  • Webinar 1 – March 20 – Arquivo.pt: a new tool for researching the past. Daniel Gomes, Head of Arquivo.pt – Vídeo, slides
  • Webinar 2 – March 25 – To publish well, to preserve well. Pedro Gomes, Arquivo.pt Collections Manager – Vídeo, slides
  • Webinar 3 – March 27 – Access and automatic processing of information preserved from the Web through APIs. Vasco Rato, Web developer, Vídeo, Slides
  • Webinar 4 – April 1 – Archiving the Web: do-it-yourself! Ricardo Basílio, Digital Curator – Video, slides

Registration (free but required)

Know more

Arquivo.pt took part in E-Archiving Portugal workshop

Professor José Borbinha, eArchiving workshop, 25 February 2025, at the Instituto Superior Técnico in Lisbon (José Tribolet Room)

Last updated on March 11th, 2025 at 04:22 pm

Professor José Borbinha, eArchiving workshop, 25 February 2025, at the Instituto Superior Técnico in Lisbon (José Tribolet Room)
Professor José Borbinha, eArchiving workshop, 25 February 2025, at the Instituto Superior Técnico in Lisbon (José Tribolet Room)

Arquivo.pt took part in the eArchiving Portugal workshop, which was held at the Instituto Superior Técnico on 25 February 2025, at the invitation of Professor José Borbinha, one of the first people to do web archiving in Portugal when he worked at the Biblioteca Nacional in the 90’s.

Professor José Borbinha, better than anyone, knows how to tell in the first person the small, almost epic episodes, the actions of the first ‘heroes’ that led to the creation of a web archive in Portugal. He sees Arquivo.pt as an essential service when it comes to digital preservation and safeguarding organisations’ communication heritage.

The event had a hybrid format with 50 in-person and 270 online participants and was open to all public and private organisations concerned with digital preservation and information management in any type or format. This includes the content of websites and social networks!

The heads of municipalities and local government organisations took part in the event, responding to the call from the Direção-Geral do Livro, dos Arquivos e das Bibliotecas (DGLAB). This call for people was an opportunity to show how Arquivo.pt can help preserve institutional websites and comply with Portaria n.º 112/2023, de 27 de abril.

eArchiving, a European initiative born in Portugal

The eArchiving Initiative‘s main objective is digital cultural heritage and was created at a meeting of European partners in Lisbon.

‘It was precisely in this room (the José Tribolet room at the Instituto Superior Técnico) that eArchiving began eleven years ago, on 29 May 2014,’ recalled José Borbinha (INESC-ID), host and organiser of the workshop.

The eArchiving initiative is managed on behalf of the European Commission by the E-ARK Consortium, which includes Portuguese partners KEEP Solutions LDA and INESC-ID. The consortium also includes the AIT Austrian Institute of Technology GmbH, the lead partner, and the DLM Forum MTÜ.

Janet Anderson, manager of eArchiving, showed the progress made in eleven years in the field of digital preservation. The projects funded by the European Union within the consortium have resulted in the development of specifications, software, training and knowledge about digital preservation.

This was followed by a presentation of contributions to digital preservation in Portugal: DGLAB, by Pedro Penteado, Centro Hospitalar São João, by Fernanda Gonçalves, Ministério da Justiça, by Alexandra Lourenço and Cristina Soares, Arquivo.pt, by digital curator Ricardo Basílio.

Finnaly, Miguel Ferreira spoke on behalf of DLM Forum MTÜ , a community in which KEEP Solutions LDA participates by developing software. Taking a more technical approach, he showed how the metadata in the E-Ark packaging specifications is structured to fulfil the requirements of digital preservation.

How to use Arquivo.pt to preserve institutional websites

Apresentação do Arquivo.pt no Workshop eArchiving

Digital preservation requires collaboration, both internally and externally between organisations, and this workshop served that purpose: sharing good practices, disseminating tools and services and connecting people.

Arquivo.pt highlighted three services from its catalogue for preserving content published on the web:

Arquivo.pt services can be used, for example, by municipalities to preserve content published on institutional websites.

Arquivo.pt training, such as webinars or face-to-face sessions, are useful for empowering organisations to take care of institutional content, including social media content that requires an alternative strategy.

Arquivo.pt presentation

Know more

Video of all speakers, soon at E-ARK

Archived Flash content can now be replayed on Arquivo.pt

Last updated on February 25th, 2025 at 03:08 pm

Download MP4

Arquivo.pt launched a new version named Isis on 7 January 2025.

Support for Flash using the Ruffle emulator

This new release of Arquivo.pt enables the replay of web-archived animations and interactive content in Flash format!

Flash technology was used on websites in the early years of the Web.

However, it became obsolete and current browsers, such as Google or Edge, no longer support it, preventing the visualization of such content. Software emulation is a way of giving access to content produced by obsolete technologies.

Arquivo.pt has therefore included Ruffle, a Flash Player emulator that allows you to visualise Flash content that was previously inaccessible to the users.

Web-archived Flash animations preserved by Arquivo.pt: before and after

Access the following sites on Arquivo.pt, before and after using Ruffle, bearing in mind that they are generally designed to be used on a desktop computer.

Estoril Palace Hotel, 2008

demo-flash-hotel-palacio-estoril-small

Estoril Casino, 2003

demo-flash-casino-estoril-small

Online games, albinoblacksheep.com, 2009

Video with all examples (download, MP4)

How to add Ruffle

print-ruffle-script

Other improvements in the new version of Arquivo.pt

In addition to Flash support, the development of the service involved the following actions:

  • Implemention of middleware to issue requests to the Solr API (development environment)
  • Implementation of relevance feedback Javascript layer
  • Improvements on the Arquivo.pt API: responds error 400 Bad request when parameter “q” contains an URL

Know more

If you find any errors on Arquivo.pt, or have any suggestions, please contact us.

Prepare a work for the Arquivo.pt Award 2025!

arquivo-pt-award-travel-in-time-sq

Last updated on January 9th, 2025 at 11:09 am

arquivo-pt-award-EN

Until May 6, 2025, Arquivo.pt is launching the challenge of creating a work based on  historical information preserved from the Web.

In this 8th edition of the Arquivo.pt Award, €15,000 will be awarded to the 3 best works (€10,000 for 1st place), plus 4 honorable mentions.

Know more at: arquivo.pt/award

Honorable mentions for authors and professors

  • The Público newspaper will award an Honorable Mention to works based on the Público online content preserved by Arquivo.pt. This award includes a two-year subscription to Público online.
  • The Aveiro Media Competence Center (AMCC) will award an Honorable Mention to the best work on the web archive of one or more Portuguese online media (500 €).
  • Association DNS.PT will award an Honorable Mention to a professor or teacher who has encouraged the submission of works.
  • The Comissão Comemorativa 50 Anos 25 de Abril will award an Honourable Mention accompanied by a prize of €5,000 to one of the works submitted that uses Arquivo.pt to deal with the theme ‘”25 de Abril and Democracy”.

The initiative has the high patronage of the President of the Portuguese Republic.

Share and spread the word!

Help us spreading the word about the Arquivo.pt Award 2024 among potential candidates!