Arquivo.pt Links Dataset: Unveiling the Web’s Hidden Structure

wikipedia_link-graph_clusters

Last updated on September 5th, 2025 at 09:49 am

colecao-de-dados_arquivo.pt

The interconnected nature of the World Wide Web has long fascinated researchers and technologists alike. Today, we are thrilled to announce the release of the Arquivo.pt Links dataset, a comprehensive collection that opens new possibilities for understanding and analyzing web connectivity patterns.

The dataset encompasses more than 139 million webpage URLs, each accompanied by crucial metadata about their incoming links – both the source URLs and their corresponding anchor texts, i.e., visible and clickable text in hyperlinks. This rich collection of interconnection data provides researchers with a unique window into the web’s underlying structure.

The importance of hyperlinks in web architecture cannot be understated. They serve as the fundamental building blocks of web navigation and discovery, enabling both users and automated systems to traverse the vast landscape of online content.

Links formed the foundation of Google’s revolutionary PageRank algorithm, which transformed our approach to information retrieval and web search. PageRank’s fundamental insight – that a page’s importance could be measured by analyzing its incoming links – revolutionized search technology and remains influential in modern information retrieval systems.

By making this dataset publicly available, Arquivo.pt enables researchers to explore similar innovative approaches to web analysis and search engine development. The dataset opens up numerous exciting research possibilities across multiple domains:

  • Researchers can implement and experiment with various ranking algorithms, from classic approaches like PageRank to modern machine learning-based techniques. The inclusion of anchor texts provides valuable semantic context that can enhance search relevance and document classification.
  • The dataset enables deep analysis of web topology and link structures. Researchers can investigate questions about web connectivity patterns, identify clusters of related content, and study how information spreads across the web through link networks.
  • The anchor text associated with each link offers a rich source of human-generated descriptions of web content. This data can be particularly valuable for developing and testing document summarization algorithms, semantic analysis tools, and automated classification systems.
  • For web archiving researchers, this dataset provides insights into how web pages are connected and referenced over time, offering valuable data for studying web preservation strategies and digital heritage maintenance.

Methodology

The process begins with a temporal snapshot of web pages from a specific time period (collection). During this initial phase, our systems analyze each captured page, extracting all outgoing hyperlinks along with their associated anchor texts and capture timestamps. This creates a preliminary mapping of how pages connect to one another within our captured timeframe.

What makes this dataset particularly valuable is its inverted link structure. Rather than organizing the data around source pages and their outgoing links, we’ve created an inverted map that centers on destination pages and their incoming links. This approach is particularly useful for analyzing a page’s importance or authority within the web’s structure, as it provides immediate access to all pages that reference or point to a given URL.

Consider a traditional link structure where Page A links to Pages B, C, and D. In our inverted structure, we instead see entries for Pages B, C, and D, each listing Page A as a source of incoming links. This reorganization of the data facilitates more efficient analysis of page authority and influence, making it particularly valuable for researchers working on ranking algorithms or studying information flow patterns across the web.

The Arquivo.pt links dataset combines three distinct web collections:

  1. PWA9609 (1996-2009): 89 million pages capturing early Internet evolution, focused on the .pt domain. This historical collection provides insights into early web linking patterns.
  2. AWP38 (Oct-Nov 2021): 44 million pages offering a contemporary snapshot of web connectivity, with emphasis on the .pt domain while including broader Internet content.
  3. FAWP47 (Oct-Dec 2021): 8 million pages from daily captures of .pt domain content, designed to track short-term changes in link patterns.

Getting Started with the Dataset

Researchers can access the complete dataset. The data is provided in a format that supports efficient processing and analysis, making it suitable for both large-scale studies and focused investigations.

Conclusion

The release of the Arquivo.pt links dataset represents a significant contribution to the web science research community. By making this rich collection of web connectivity data freely available, we hope to facilitate innovative research and deepen our understanding of the web’s complex structure.

We encourage researchers to explore this dataset and look forward to seeing the novel insights and applications that emerge from its analysis. Whether you’re interested in developing new search algorithms, studying web topology, or investigating content relationships, this dataset provides a robust foundation for your research.

Arquivo.pt took part in the IIPC Web Archiving Conference in Oslo

Last updated on July 4th, 2025 at 08:32 am

Four members of the Arquivo.pt team were in Oslo, Norway, to take part in the General Assembly of the International Internet Preservation Consortium and the Web Archiving Conference, from 8 to 15 April 2025.

The National Library of Norway was the host institution for this international event. The Norwegian Web Archive is part of the Library’s mission and is held in a second location specialising in digital preservation, in the city of Mo i Rana, in the centre of the country.

The first day, 8 April, was dedicated to the General Assembly, exclusively for members of the consortium, and to the working groups in which Arquivo.pt plays an active role. The Content Development Working Group is dedicated to the creation of thematic collections and has the participation of Arquivo.pt in the ‘Street Art’ collection. The Training Working Group creates training content and training actions, such as IIPC webinars and face-to-face workshops.

The Web Archiving Conference was held on 9 and 10 April, an event open to all entities and initiatives related to web preservation and archiving.

Arquivo.pt’s contribution

Arquivo.pt presented its services and initiatives for interacting with the community, such as its collaboration with the Sines Municipal Archive in preserving content of local interest. The concern with access to content, both for researchers and for citizens in general, is an aspect that is highly appreciated by the IIPC community.

  • Arquivo.pt toolkit for web archiving – Lightning talk session 1 – Daniel Gomes – Slides, video
  • Arquivo.pt Query Logs – Lightning talk session 3 – Pedro Gomes – Slides, video
  • Collaborative collections at Arquivo.pt: four years of recordings from the city of Sines (Portugal) – Lightning talk session 4 – Ricardo Basílio – Slides, notes, video
  • API/Bulk access and its usage – Poster slam – Vasco Rato – Poster
  • Arquivo.pt annual awards: a glimpse since 2018 – Poster slam – Daniel Gomes – Slides

Image gallery

IIPC Web Archiving Conference 2025, Oslo

oslo4
oslo3
oslo6
oslo2
oslo1
oslo8
oslo5
oslo9
oslo10
oslo11
oslo12
oslo4 oslo3 oslo6 oslo2 oslo1 oslo8 oslo5 oslo9 oslo10 oslo11 oslo12

Arquivo.pt training with APDSI. Sign up!

Ciclo de Webinars do Arquivo.pt com a APSDI

Last updated on April 5th, 2025 at 01:10 pm

Ciclo de Webinars do Arquivo.pt com a APSDI

APDSI – Associação para a Promoção e Desenvolvimento da Sociedade da Informação (Association for the Promotion and Development of the Information Society) promoted a Cycle of Webinars on Arquivo.pt, held between March 20 and April 1, 2025.

This Webinar Cycle, dedicated to the preservation of cultural memory published on the Web, is a collaboration between APDSI and Arquivo.pt, the FCCN digital services of the Fundação para a Ciência e a Tecnologia.

Luís Vidigal, Founding Partner of APDSI, Filipa Fixe and João Tavares, Board Members, introduced the theme of each session and the Arquivo.pt team showed how the preservation of web content works, allowing organizations and citizens to access the web of the past.

The four sessions had a total of 121 participants.

Program

  • Webinar 1 – March 20 – Arquivo.pt: a new tool for researching the past. Daniel Gomes, Head of Arquivo.pt – Vídeo, slides
  • Webinar 2 – March 25 – To publish well, to preserve well. Pedro Gomes, Arquivo.pt Collections Manager – Vídeo, slides
  • Webinar 3 – March 27 – Access and automatic processing of information preserved from the Web through APIs. Vasco Rato, Web developer, Vídeo, Slides
  • Webinar 4 – April 1 – Archiving the Web: do-it-yourself! Ricardo Basílio, Digital Curator – Video, slides

Registration (free but required)

Know more

Arquivo.pt took part in E-Archiving Portugal workshop

Professor José Borbinha, eArchiving workshop, 25 February 2025, at the Instituto Superior Técnico in Lisbon (José Tribolet Room)

Last updated on March 11th, 2025 at 04:22 pm

Professor José Borbinha, eArchiving workshop, 25 February 2025, at the Instituto Superior Técnico in Lisbon (José Tribolet Room)
Professor José Borbinha, eArchiving workshop, 25 February 2025, at the Instituto Superior Técnico in Lisbon (José Tribolet Room)

Arquivo.pt took part in the eArchiving Portugal workshop, which was held at the Instituto Superior Técnico on 25 February 2025, at the invitation of Professor José Borbinha, one of the first people to do web archiving in Portugal when he worked at the Biblioteca Nacional in the 90’s.

Professor José Borbinha, better than anyone, knows how to tell in the first person the small, almost epic episodes, the actions of the first ‘heroes’ that led to the creation of a web archive in Portugal. He sees Arquivo.pt as an essential service when it comes to digital preservation and safeguarding organisations’ communication heritage.

The event had a hybrid format with 50 in-person and 270 online participants and was open to all public and private organisations concerned with digital preservation and information management in any type or format. This includes the content of websites and social networks!

The heads of municipalities and local government organisations took part in the event, responding to the call from the Direção-Geral do Livro, dos Arquivos e das Bibliotecas (DGLAB). This call for people was an opportunity to show how Arquivo.pt can help preserve institutional websites and comply with Portaria n.º 112/2023, de 27 de abril.

eArchiving, a European initiative born in Portugal

The eArchiving Initiative‘s main objective is digital cultural heritage and was created at a meeting of European partners in Lisbon.

‘It was precisely in this room (the José Tribolet room at the Instituto Superior Técnico) that eArchiving began eleven years ago, on 29 May 2014,’ recalled José Borbinha (INESC-ID), host and organiser of the workshop.

The eArchiving initiative is managed on behalf of the European Commission by the E-ARK Consortium, which includes Portuguese partners KEEP Solutions LDA and INESC-ID. The consortium also includes the AIT Austrian Institute of Technology GmbH, the lead partner, and the DLM Forum MTÜ.

Janet Anderson, manager of eArchiving, showed the progress made in eleven years in the field of digital preservation. The projects funded by the European Union within the consortium have resulted in the development of specifications, software, training and knowledge about digital preservation.

This was followed by a presentation of contributions to digital preservation in Portugal: DGLAB, by Pedro Penteado, Centro Hospitalar São João, by Fernanda Gonçalves, Ministério da Justiça, by Alexandra Lourenço and Cristina Soares, Arquivo.pt, by digital curator Ricardo Basílio.

Finnaly, Miguel Ferreira spoke on behalf of DLM Forum MTÜ , a community in which KEEP Solutions LDA participates by developing software. Taking a more technical approach, he showed how the metadata in the E-Ark packaging specifications is structured to fulfil the requirements of digital preservation.

How to use Arquivo.pt to preserve institutional websites

Apresentação do Arquivo.pt no Workshop eArchiving

Digital preservation requires collaboration, both internally and externally between organisations, and this workshop served that purpose: sharing good practices, disseminating tools and services and connecting people.

Arquivo.pt highlighted three services from its catalogue for preserving content published on the web:

Arquivo.pt services can be used, for example, by municipalities to preserve content published on institutional websites.

Arquivo.pt training, such as webinars or face-to-face sessions, are useful for empowering organisations to take care of institutional content, including social media content that requires an alternative strategy.

Arquivo.pt presentation

Know more

Video of all speakers, soon at E-ARK

Archived Flash content can now be replayed on Arquivo.pt

Last updated on February 25th, 2025 at 03:08 pm

Download MP4

Arquivo.pt launched a new version named Isis on 7 January 2025.

Support for Flash using the Ruffle emulator

This new release of Arquivo.pt enables the replay of web-archived animations and interactive content in Flash format!

Flash technology was used on websites in the early years of the Web.

However, it became obsolete and current browsers, such as Google or Edge, no longer support it, preventing the visualization of such content. Software emulation is a way of giving access to content produced by obsolete technologies.

Arquivo.pt has therefore included Ruffle, a Flash Player emulator that allows you to visualise Flash content that was previously inaccessible to the users.

Web-archived Flash animations preserved by Arquivo.pt: before and after

Access the following sites on Arquivo.pt, before and after using Ruffle, bearing in mind that they are generally designed to be used on a desktop computer.

Estoril Palace Hotel, 2008

demo-flash-hotel-palacio-estoril-small

Estoril Casino, 2003

demo-flash-casino-estoril-small

Online games, albinoblacksheep.com, 2009

Video with all examples (download, MP4)

How to add Ruffle

print-ruffle-script

Other improvements in the new version of Arquivo.pt

In addition to Flash support, the development of the service involved the following actions:

  • Implemention of middleware to issue requests to the Solr API (development environment)
  • Implementation of relevance feedback Javascript layer
  • Improvements on the Arquivo.pt API: responds error 400 Bad request when parameter “q” contains an URL

Know more

If you find any errors on Arquivo.pt, or have any suggestions, please contact us.

Prepare a work for the Arquivo.pt Award 2025!

arquivo-pt-award-travel-in-time-sq

Last updated on January 9th, 2025 at 11:09 am

arquivo-pt-award-EN

Until May 6, 2025, Arquivo.pt is launching the challenge of creating a work based on  historical information preserved from the Web.

In this 8th edition of the Arquivo.pt Award, €15,000 will be awarded to the 3 best works (€10,000 for 1st place), plus 4 honorable mentions.

Know more at: arquivo.pt/award

Honorable mentions for authors and professors

  • The Público newspaper will award an Honorable Mention to works based on the Público online content preserved by Arquivo.pt. This award includes a two-year subscription to Público online.
  • The Aveiro Media Competence Center (AMCC) will award an Honorable Mention to the best work on the web archive of one or more Portuguese online media (500 €).
  • Association DNS.PT will award an Honorable Mention to a professor or teacher who has encouraged the submission of works.
  • The Comissão Comemorativa 50 Anos 25 de Abril will award an Honourable Mention accompanied by a prize of €5,000 to one of the works submitted that uses Arquivo.pt to deal with the theme ‘”25 de Abril and Democracy”.

The initiative has the high patronage of the President of the Portuguese Republic.

Share and spread the word!

Help us spreading the word about the Arquivo.pt Award 2024 among potential candidates!

Arquivo.pt won the Digital Transformation 2024 award

Last updated on January 6th, 2025 at 03:30 pm

Arquivo.pt, the digital service of the Fundação para a Ciência e a Tecnologia (FCT)-Unidade FCCN, was one of the winners of the Prémio Transformação Digital da APDSI 2024 (APDSI 2024 Digital Transformation Award).

Arquivo.pt was recognised in the ‘Promoting a more Innovative and Digital Society’ category.

This category highlights the innovative aspect of organisations’ digital transition.

The manager of Arquivo.pt, Daniel Gomes, and the web developer in charge of Arquivo.pt’s collections, Pedro Gomes, were present at the ceremony, which took place in Oeiras on 3 December 2024.

Arquivo.pt, a service for digital transformation

Daniel Gomes explains how a web preservation service contributes to a more sustainable information society, in a video prepared for the awards ceremony.

Digital Transformation Award 2024 APDSI

The Associação para a Promoção e Desenvolvimento da Sociedade da Informação (APDSI) (Association for the Promotion and Development of the Information Society) promotes the use of technology in favour of citizens, their inclusion and participation in the development of society.

The Digital Transformation Award (4th edition in 2024) aims to ‘recognise and disseminate best practices in the adoption and implementation of information and communication technologies (ICT), with a view to a more digital society sustained by public and private institutions that are more efficient and closer to the citizen’ (APDSI website).

The 2024 edition of this award received 33 nominations in 3 categories:

  • Effectiveness/Efficiency of Organisations
  • Proximity to Citizens and a More Inclusive Society
  • Promoting a more Innovative and Digital Society

See all the finalist projects

Image gallery

premio-apdsi_TD24-140
premio-apdsi_TD24-152
premio-apdsi_TD24-155
premio-apdsi_TD24-148
premio-apdsi
premio-apdsi2
premio-apdsi4-
premio-apdsi-3
premio-apdsi_TD24-140 premio-apdsi_TD24-152 premio-apdsi_TD24-155 premio-apdsi_TD24-148 premio-apdsi premio-apdsi2 premio-apdsi4- premio-apdsi-3

World Digital Preservation Day celebrated at Portuguese National Archive Torre do Tombo

Last updated on December 11th, 2024 at 05:25 pm

Let’s talk about preservation and access!

On November 7, 2024, the New Paths to Information Preservation and Access Meeting was held, organised jointly by Arquivo.pt and the Arquivo de Ciência e Tecnologia, the first located on Avenida do Brasil and the latter on Avenida D. Carlos I, in Lisbon, both services of the Fundação para a Ciência e a Tecnologia (FCT).

The aim of this joint FCT team was precisely to bring about the meeting and sharing of experiences between various institutions that inevitably have to manage information, both in traditional formats such as paper and in digital formats.

The meeting had 243 participants and 29 speakers throughout the day. Nine of the twenty-seven presentations were submitted for a session called ‘Community Space’.

The Portuguese Association of Librarians, Archivists, Information and Documentation Professionals APBAD made an important contribution to publicising the event to the community and was present with an information stand.

An international day dedicated to digital preservation

On this day, World Digital Preservation Day was celebrated, an initiative of the Digital Preservation Coalition (DPC) to which Arquivo.pt has been associated since the first edition in 2017. Jane Winters, Chair of the DPC, sent a video message to join this initiative in Portugal.

Digital information was the main theme of the speeches. At the opening, the Head of the DGLAB – Direção Geral do Livro, dos Arquivos e das Bibliotecas  (Directorate for Books, Archives and Libraries), Silvestre Lacerda, recalled that the DGLAB was a pioneer among public organisations in tackling the issue of digital preservation. FCT vice-president Francisco Santos emphasised the economic value of data for scientific research.

Digital preservation is not just about technology, as Henrique São Mamede, Professor at Universidade Aberta, INESC TEC, said at the opening conference. It’s also about people, the human factor, the environment outside organisations and new sensibilities such as sustainability and ecology. Hence the importance of creating bridges, of using Artificial Intelligence, for example, in conjunction with ethics. Presentation.

Throughout the day, four panels brought together presentations on various preservation contexts such as the digitisation of sound, image and video, research data, regulatory frameworks, management systems for digitised or born-digital information, dissemination and access, and use in academic research.

Panel 1: Digital preservation initiatives and realities

The first panel was moderated by João Gomes, Director of Advanced Services at FCT, and brought to the table the diversity of contexts in which the issue of preservation and access arises. Here we highlight one aspect of each presentation and invite you to follow the links to learn more about these initiatives.

Moisés Rockemback, Professor at the University of Coimbra and co-author of the book Arquivamento da web e preservação digital  (Web archiving and digital preservation), spoke about the first initiatives carried out in Brazil to preserve content published on the Web. The websites of the candidates in the Brazilian elections, for example, are ephemeral by nature but have become material for historiographical research by being preserved in a web archive. From a more theoretical perspective, he addressed the issue of memory. Preserving the web allows us to bring to light events that were only broadcast on digital media such as the web and, in this sense, postpones the end of history expressed in the metaphor of the ‘Dark Age’, a time of darkness, empty of information. Presentation.

Pedro Penteado, Director of Archival and Standardisation Services, presented a set of instruments that the DGLAB has developed, such as the Macro Estrutura Funcional (MEF) (Macro Functional Structure, the Avaliação Suprainstitucional da Informação Arquivística (ASIA) (Super-institutional Assessment of Archival Information) project and the Lista Consolidada na Plataforma CLAV (Consolidated List on the CLAV Platform), which allows the different public administration bodies to comply with legislation and standardise classification and assessment practices. He recalled that these tools are flexible to meet the specific needs of organisations. Presentation.

Pedro Príncipe, Head of the Documentation Services Division at the University of Minho, spoke about research data. The preservation of and access to data is fundamental to the production of science. To achieve this, it is necessary to combine initiatives and work in networks and create communities of practice. The GDI Forum is an example of how useful it is to meet professionals. Certification is highly recommended, as demonstrated by the University of Minho, which has certified its repository, as it is an extra reason to create robustness and to achieve the FAIR (Findable, Accessible, Interoperable, and Reusable) objectives. Presentation.

Hilário Lopes, RTP’s Deputy Director of Institutional Relations and Archive, described the path to digitalisation that has completely changed the way we access the RTP archive (Portuguese Radio and Televison). If until 2001 digitisation was done on request, from that year onwards the contents were massively digitised. Since 2007, the contents have been accessible in digital format, which has facilitated access and use. RTP Memória and Portal RTP are two examples of access to the audiovisual heritage of public radio and television. Presentation.

Panel 2: Preserving and reusing Web information

The theme of web archiving was highlighted in the second panel, moderated by Daniel Gomes, manager of Arquivo.pt and its initiator on 8 November 2007.

Ricardo Basílio, digital curator at Arquivo.pt, presented the online exhibition ‘Memories of 25 April on the Internet’, created in collaboration with the 50 Years of 25 April Commemorative Commission, based on preserved web pages. Select pages about the 25 April celebrations across the country were highlighted through a guided tour of the exhibition. Presentation.

Joana Paulino, historian and researcher at the Faculdade de Ciências Sociais e Humanas da Universidade Nova de Lisboa, showed how technologies contribute to the development of studies in areas traditionally far removed from technologies, based on her experience at the Digital Humanities Laboratory. Presentation.

António Campos and Hélder Mestre, from the Arquivo Municipal de Sines (Sines City Council Archive), showed how, since 2020, they have been preserving web content of local interest in collaboration with Arquivo.pt. They record web pages with ArchiveWeb.page, a Webrecorder tool, send a copy of the files to Arquivo.pt, transcribe images and videos verbatim, and also use PDF as the most traditional format for archiving news. The issue of accessibility to content for people with special needs is fundamental in the preservation process. Presentation.

António Ramiro and Carmen Fonseca, winners of the Arquivo.pt 2024 Award, presented their work Noticioso.pt. It’s a project that reuses information from Arquivo.pt to challenge citizens’ critical capacity. Presentation.

Finally, Daniel Gomes emphasised how much has been done in the last 17 years in the field of web preservation, to the point where we now have a functional service that everyone can use. As a testimony to those early days, we found a page from Diário Digital newspaper from November 2006.

Panel 3: Preserving the present and safeguarding the future

The third panel was moderated by Paula Meireles, Coordinator of the Archive, Documentation and Information service at the Foundation for Science and Technology (FCT) and brought four other realities to the table.

Filipe Guimarães Silva, Executive Director of the Fundação Mário Soares e Maria Barroso,  and António Coelho, Digital Reproduction Coordinator, delved into the technical issues related to digitisation, based on the case of the collection, which is also accessible on the Casa Comum portal. Quality control is the most important factor in obtaining a preservable digital version. You don’t always need expensive technology to get good results. It is essential to follow standards and ensure that quality metadata is generated. Presentation.

Fernanda Gonçalves, Director of Archives at the São João Local Health Unit, showed how the São João Digital Clinical Repository is transforming access to clinical files with advantages in terms of both speed and quality of information. The information management model at this huge institution poses immense challenges for preservation and continued access, as it involves creating interoperability between multiple systems. What’s more, this is sensitive data with different levels of access. This is where the archive comes in as an asset. The archive service must rise to the challenges of any organisation in order to serve all its ‘clients’. Presentation.

Augusto Ribeiro, head of the Documentation and Information Management Service at UPdigital, University of Porto, explained how the university collection is being preserved. From the treatment of paper documents to their digitisation and inclusion in the digital repository, it’s important to guarantee their robustness. This work has been progressive and systematic, i.e. it follows a plan where all the pieces fit together as the work is carried out. Presentation.

Pedro Penteado (DGLAB) presented the ‘Digital Preservation Guide’ project that is being developed in collaboration with the Asociación Latinoamericana de Archivos (ALA). This initiative will structure content on digital preservation in a pragmatic way. Soon, professionals will have a knowledge base to consult whenever they carry out digital preservation activities. Presentation.

Panel 4: Community space

The fourth panel, moderated by Paula Carvalho, from FCT’s Science and Technology Archive, included 9 short presentations submitted by the community. Below, we present the abstracts submitted by the authors:

Celebrating the 50th anniversary of 25 de Abril at the closing session

Maria Inácia Rezola, Executive Commissioner of the Mission Structure for the Commemorations of the 50th Anniversary of the Revolution of 25 de Abril 1974, presented a historical perspective of the impact of 25 April on Portuguese society, namely through the way it is commemorated throughout the country.

It was shown the work that the Commission has been doing to identify archives, documentation centres and collections of all kinds with material about 25 April. There are public collections that are practically unknown, and others that are in private collections. Inventorying and publicising them is therefore the first step in promoting study and knowledge about 25 de Abrril.

Finally, Maria Inácia Rezola announced the award of the Honourable Mention ‘25 de Abril and Democracy’, together with a prize of 5,000 euros, in the Arquivo.pt Award 2025, to the best work on 25 April that uses Arquivo.pt.

Image gallery

Encontro Dia Mundial da Preservação Digital 2024 #WDPD2024

Carmen Fonseca, O Noticioso.pt
Ricardo Basílio, Arquivo.pt -FCT
Hélder Mestre e António Campos, Arquivo Municipal de Sines
Hélder Mestre e António Campos, Arquivo Municipal de Sines
Ricardo Basílio, Arquivo.pt -FCT
Joana Paulino, NOVA-FCSH
António Ramiro, Noticioso.pt
2º Painel - António Ramiro e Carmen Fonseca, Noticioso.pt
António Ramiro e Carmen Fonseca, Noticioso.pt
Encontro Novos Caminhos para a Preservação e o Acesso à Informação
2º painel - Encontro Novos Caminhos para a Preservação e o Acesso à Informação
Encontro Novos Caminhos para a preservação e o aEncontro Novos Caminhos para a Preservação e o Acesso à Informaçãoesso à informação
1º Painel - Encontro Novos Caminhos para a Preservação e o Acesso à Informação
Encontro Novos Caminhos para a Preservação e o Acesso à Informação
Encontro Novos Caminhos para a Preservação e o Acesso à Informação
Encontro Novos Caminhos para a Preservação e o Acesso à Informação
Encontro Novos Caminhos para a Preservação e o Acesso à Informação
Pedro Príncipe, Universidade do Minho
Moisés Rockemback, Universidade de Coimbra
Hilário Lopes, Arquivo da RTP
Stand do Arquivo.pt - Encontro Novos Caminhos para a Preservação e o Acesso à Informação
1º Painel - Encontro Novos Caminhos para a Preservação e o Acesso à Informação
Pedro Penteado, DGLAB
Encontro Novos Caminhos para a Preservação e o Acesso à Informação
1º Painel - Encontro Novos Caminhos para a Preservação e o Acesso à Informação
Moisés Rockemback, Univ. Coimbra, Ricardo Basílio, Arquivo.pt
Henrique São Mamede, Universidade Aberta, INESC TEC
Stand do Arquivo.pt - Encontro Novos Caminhos para a Preservação e o Acesso à Informação
Stand do Arquivo.pt - Encontro Novos Caminhos para a Preservação e o Acesso à Informação
Moisés Rockemback, Universidade de Coimbra
Sessão de Abertura - Silvestre Lacerda, Diretor da DGLAB e Francisco Santos, Vice-Presidente da FCT
3º Painel - Paula Meireles, FCT
Henrique São Mamede, Universidade Aberta, INESC TEC
Sessão de Abertura - João Gomes, Diretor Serviços Avançados da FCT
Sessão de Abertura - Silvestre Lacerda, Diretor da DGLAB e Francisco Santos, Vice-Presidente da FCT
Sessão de Abertura - Jane Winters, Digital Preservation Coalition (DPC)
Sessão de Abertura - Silvestre Lacerda, Diretor da DGLAB e Francisco Santos, Vice-Presidente da FCT
Sessão de Abertura - Silvestre Lacerda, Diretor da DGLAB e Francisco Santos, Vice-Presidente da FCT
Sessão de Abertura - Silvestre Lacerda, Diretor da DGLAB e Francisco Santos, Vice-Presidente da FCT
Augusto Ribeiro, Universidade do Porto, UPDigital
3º painel - Encontro Novos Caminhos para a Preservação e o Acesso à Informação
Pedro Penteado, DGLAB
wdpd_encontro-preservacao-vasco-rato-arquivo-pt
wdpd_encontro-preservacao-pedro-gomes-citationsaver-fccn-1
wdpd_encontro-preservacao-rita-cepa-nova-fcsh
wdpd_encontro-preservacao-pedro-gomes-citationsaver-fccn
wdpd_encontro-preservacao-joao-pedro-oliveira-nova-fcsh
wdpd_encontro-preservacao-uab-madalena-carvalho
wdpd_encontro-preservacao-suzana-oliveira-act-fct-1
wdpd_encontro-preservacao-susana-torrao-pedro-cavaco-nova-fcsh
wdpd_encontro-preservacao-inacia-rezola
wdpd_encontro-preservacao-inacia-rezola-1
moises-rockembach
arquivamento-da-web-moises-rockembach
paula-meireles-inacia-rezola-sessao-de-encerramento
pedro-principe-uminho
wdpd-paula-meireles
Carmen Fonseca, O Noticioso.pt Ricardo Basílio, Arquivo.pt -FCT Hélder Mestre e António Campos, Arquivo Municipal de Sines Hélder Mestre e António Campos, Arquivo Municipal de Sines Ricardo Basílio, Arquivo.pt -FCT Joana Paulino, NOVA-FCSH António Ramiro, Noticioso.pt 2º Painel - António Ramiro e Carmen Fonseca, Noticioso.pt António Ramiro e Carmen Fonseca, Noticioso.pt Encontro Novos Caminhos para a Preservação e o Acesso à Informação 2º painel - Encontro Novos Caminhos para a Preservação e o Acesso à Informação Encontro Novos Caminhos para a preservação e o aEncontro Novos Caminhos para a Preservação e o Acesso à Informaçãoesso à informação 1º Painel - Encontro Novos Caminhos para a Preservação e o Acesso à Informação Encontro Novos Caminhos para a Preservação e o Acesso à Informação Encontro Novos Caminhos para a Preservação e o Acesso à Informação Encontro Novos Caminhos para a Preservação e o Acesso à Informação Encontro Novos Caminhos para a Preservação e o Acesso à Informação Pedro Príncipe, Universidade do Minho Moisés Rockemback, Universidade de Coimbra Hilário Lopes, Arquivo da RTP Stand do Arquivo.pt - Encontro Novos Caminhos para a Preservação e o Acesso à Informação 1º Painel - Encontro Novos Caminhos para a Preservação e o Acesso à Informação Pedro Penteado, DGLAB Encontro Novos Caminhos para a Preservação e o Acesso à Informação 1º Painel - Encontro Novos Caminhos para a Preservação e o Acesso à Informação Moisés Rockemback, Univ. Coimbra, Ricardo Basílio, Arquivo.pt Henrique São Mamede, Universidade Aberta, INESC TEC Stand do Arquivo.pt - Encontro Novos Caminhos para a Preservação e o Acesso à Informação Stand do Arquivo.pt - Encontro Novos Caminhos para a Preservação e o Acesso à Informação Moisés Rockemback, Universidade de Coimbra Sessão de Abertura - Silvestre Lacerda, Diretor da DGLAB e Francisco Santos, Vice-Presidente da FCT 3º Painel - Paula Meireles, FCT Henrique São Mamede, Universidade Aberta, INESC TEC Sessão de Abertura - João Gomes, Diretor Serviços Avançados da FCT Sessão de Abertura - Silvestre Lacerda, Diretor da DGLAB e Francisco Santos, Vice-Presidente da FCT Sessão de Abertura - Jane Winters, Digital Preservation Coalition (DPC) Sessão de Abertura - Silvestre Lacerda, Diretor da DGLAB e Francisco Santos, Vice-Presidente da FCT Sessão de Abertura - Silvestre Lacerda, Diretor da DGLAB e Francisco Santos, Vice-Presidente da FCT Sessão de Abertura - Silvestre Lacerda, Diretor da DGLAB e Francisco Santos, Vice-Presidente da FCT Augusto Ribeiro, Universidade do Porto, UPDigital 3º painel - Encontro Novos Caminhos para a Preservação e o Acesso à Informação Pedro Penteado, DGLAB wdpd_encontro-preservacao-vasco-rato-arquivo-pt wdpd_encontro-preservacao-pedro-gomes-citationsaver-fccn-1 wdpd_encontro-preservacao-rita-cepa-nova-fcsh wdpd_encontro-preservacao-pedro-gomes-citationsaver-fccn wdpd_encontro-preservacao-joao-pedro-oliveira-nova-fcsh wdpd_encontro-preservacao-uab-madalena-carvalho wdpd_encontro-preservacao-suzana-oliveira-act-fct-1 wdpd_encontro-preservacao-susana-torrao-pedro-cavaco-nova-fcsh wdpd_encontro-preservacao-inacia-rezola wdpd_encontro-preservacao-inacia-rezola-1 moises-rockembach arquivamento-da-web-moises-rockembach paula-meireles-inacia-rezola-sessao-de-encerramento pedro-principe-uminho wdpd-paula-meireles

Credits: photos by Leonor Arrimar (FCT). Included are some images of mobile devices sent in by participants.

Video


Video by Leonor Arrimar (FCT)

Know more

Previous editions of World Digital Preservation Day with Arquivo.pt

Arquivo.pt received the award for Best Central Public Administration Digital Project

Arquivo.pt receives Award for Best Governmental service

Last updated on October 31st, 2024 at 12:42 pm

premio-acepi-2024-atribuido-a-arquivo-pt

Arquivo.pt, a digital service of the Foundation for Science and Technology (FCT)-FCCN Unit,  was one of the winners of the Navegantes XXI Awards, 2024 edition.

Arquivo.pt won the award in the category of “Best Digital Project of Central Public Administration”.

This category annually recognizes a project that has contributed “unequivocally to the development of the Central Public sector through digital means, as well as the Digital Economy in Portugal”.

The Head of Arquivo.pt Daniel Gomes, the FCCN Deputy General Coordinator Salomé Branco and the FCT vice-president Francisco Santos were present at the ceremony held on October 24 at the Técnico Innovation Center in Lisbon and received the award.

Arquivo.pt receives Award for Best Governmental service

WhatsApp Arquivo.pt vence Prémio Navegantes XXI2024-10-25 at 14.30.42
Arquivo.pt vence Prémio Navegantes XXI
Arquivo.pt receives Award for Best Governmental service
Premios-Navegantes-XXI-Arquivo.pt_
Equipa do Arquivo.pt recebe Prémio Navegantes XXI
WhatsApp Arquivo.pt vence Prémio Navegantes XXI2024-10-25 at 14.30.42 Arquivo.pt vence Prémio Navegantes XXI Arquivo.pt receives Award for Best Governmental service Premios-Navegantes-XXI-Arquivo.pt_ Equipa do Arquivo.pt recebe Prémio Navegantes XXI

Navegantes XXI Awards

The Navegantes XXI (Navigators XXI) Awards are an annual initiative by ACEPI – Digital Economy Association, created with the mission “To Promote and Develop the Digital Economy in Portugal”.

The competition rewards the best of the Digital Economy and Society in Portugal in its most diverse aspects. It currently comprises 20 categories that reward the most innovative and digital transformation Portuguese projects, ideas and institutions. Three prizes are also awarded for special categories outside the competition.

Meet all the winners.

Save websites before they disappear with the Browsertrix Crawler tool

print-browsertrix-tutorial

Last updated on December 11th, 2024 at 12:16 pm

The month of September marks the beginning of a year’s work and also the end of many websites that are hopelessly lost. Remodelled or shut down without making a good copy of their content, this is how historic websites are lost unnecessarily.

There are tools that allow websites to be saved immediately by the organisations that manage them. In addition, there is the on-demand archiving service for high-quality websites that Arquivo.pt provides to partner organisations or in occasional collaborations.

This article aims to highlight the Browsertrix Crawler used by Arquivo.pt, without excluding other tools, which can be useful to information managers and IT departments.

Use of Browsertrix-crawler by Arquivo.pt for high-quality collections

Browsertrix Crawler is a tool that lets you record entire websites and lists of web pages automatically and in a format compatible with web archives.

Arquivo.pt uses the Browsertrix Crawler to make high-quality site collections (RAQs) on-demand of the community. For example, when a site is about to be shut down, when it’s going to undergo remodelling or, periodically, to maintain a good history of a particular site.

An illustrative case is the Almada City Council website, recorded in April 2021 at the request of the Municipal Archive. Another case is the website of the newspaper Notícias de Leiria, which was recorded before its closure in December 2023.

Requests for high-quality collections (RAQs) to Arquivo.pt are increasingly frequent: 77 requests from January to September 2024. This is a sign that there is greater concern about the preservation of web content.

What you need to use Browsertrix-crawler locally

The group that developed the Browsertrix Crawler, Webrecorder.net, led by Ilya Kreymer, has the motto ‘web archiving for all’. Its tools make it possible to record the Internet in a decentralised way and on a small scale.

The Browsertrix Crawler is available and can be installed on your computer for small collections.

The basic version of Browsertrix that Arquivo.pt is using requires basic command line knowledge, which is the only barrier for non-experts.

From Arquivo.pt’s own experience, using the Browsertrix Crawler is easy in multidisciplinary teams, where there is always someone with minimal knowledge to use Linux commands and provide occasional support.

Demonstration of recording entire websites on your own computer

To promote the preservation of sites in Web archive format, Arquivo.pt presents a use case for the Browsertrix Crawler. It’s useful for anyone who wants to deepen their knowledge and practice of saving sites in a local environment.

Other tools used by Arquivo.pt to record content

Brozzler: a tool for improving the history of daily and monthly collection sites

Brozzler is a similar tool to Browsertrix Crawler in that it also bases its recording on a browser. It is used and maintained by the Internet Archive.

Arquivo.pt has been using Brozzler since at least 2018 to record web pages with interactive content present on the web pages and for high-quality collections (RAQs).

Lists of up to 200 sites are successfully recorded by Brozzler. For example, the 125 daily collection sites (FAWP) are recorded with Brozzler at the beginning of each month.During the month, another list of 75 monthly collection sites (MAWP) is recorded using Brozzler.

At the end of 2023, Arquivo.pt compared Brozzler and Browsertrix Crawler and chose to keep these two tools.

Heritrix, pywb and ArchiveWeb.page: tools for thousands of sites or one page

The Heritrix crawler is Arquivo.pt’s main recording tool. It is used on huge lists of websites, such as the .PT domain sites, to which other Portuguese sites are added, totalling  more than half a million.

On the opposite side is the ArchiveWeb.page extension, which Arquivo.pt uses for short page-by-page recordings and also for the Web archiving: do-it-yourself! training course.

To complete the list of recording tools used by Arquivo.pt, mention should be made of pywb, which comes into play, for example, when an Arquivo.pt user uses the ‘Complete the page’ functionality or the ArchivePageNow service.