MOOC on Arquivo.pt and web archives launched and open to the community

May 21, 2025May 16, 2025 by Ricardo Basílio

Last updated on May 21st, 2025 at 11:42 am

The online training programme on Arquivo.pt, entitled The Web of the Past: Preservation and Research, has been launched and is open free of charge on the NAU platform to anyone who wants to deepen their knowledge of web archiving and Arquivo.pt services.

Daniel Gomes, manager of Arquivo.pt, who developed this training programme, announced it, at a first-hand, at the Faculdade de Letras da Universidade de Coimbra, during the workshop Digital preservation: tools as practices, held on May 7, 2025.

Registration open on the NAU platform for the MOOC web archiving

NAU – Sempre a Aprender is the e-learning platform of the FCCN, Foundation for Science and Technology (FCT) digital services unit. The NAU initiative focuses on supporting the publication and dynamisation of content in the Massive Open Online Courses (MOOC) format in Portuguese.

The aim of this programme is to develop skills in searching the Web’s digital memory, with an emphasis on using Arquivo.pt both in everyday life and in the context of studies and research.

The programme is divided into four courses:

Preservação da web e arquivos (Preserving the web and archives)
Pesquisar e aceder ao passado com o Arquivo.pt (Search and access the past with Arquivo.pt)
Bem publicar para bem preservar (Publish well to preserve well)
Casos de uso do Arquivo.pt (Arquivo.pt use cases)

No special requirements are needed, apart from a computer with Internet access and a browser such as Google, Chrome, Internet Explorer.

Spread the word: arquivo.pt/mooc

Know more

Interview Internet Day, May 17, published on NAU website

Arquivo.pt training with APDSI. Sign up!

April 5, 2025March 13, 2025 by Ricardo Basílio

Ciclo de Webinars do Arquivo.pt com a APSDI

Last updated on April 5th, 2025 at 01:10 pm

APDSI – Associação para a Promoção e Desenvolvimento da Sociedade da Informação (Association for the Promotion and Development of the Information Society) promoted a Cycle of Webinars on Arquivo.pt, held between March 20 and April 1, 2025.

This Webinar Cycle, dedicated to the preservation of cultural memory published on the Web, is a collaboration between APDSI and Arquivo.pt, the FCCN digital services of the Fundação para a Ciência e a Tecnologia.

Luís Vidigal, Founding Partner of APDSI, Filipa Fixe and João Tavares, Board Members, introduced the theme of each session and the Arquivo.pt team showed how the preservation of web content works, allowing organizations and citizens to access the web of the past.

The four sessions had a total of 121 participants.

Program

Webinar 1 – March 20 – Arquivo.pt: a new tool for researching the past. Daniel Gomes, Head of Arquivo.pt – Vídeo, slides
Webinar 2 – March 25 – To publish well, to preserve well. Pedro Gomes, Arquivo.pt Collections Manager – Vídeo, slides
Webinar 3 – March 27 – Access and automatic processing of information preserved from the Web through APIs. Vasco Rato, Web developer, Vídeo, Slides
Webinar 4 – April 1 – Archiving the Web: do-it-yourself! Ricardo Basílio, Digital Curator – Video, slides

Registration (free but required)

Know more

IPL – Politécnico de Lisboa organised a series of webinars with Arquivo.pt

July 16, 2024July 15, 2024 by Ricardo Basílio

IPL – Politécnico de Lisboa, through its Distance Learning Group (EaD@IPL), organised a series of webinars for its community dedicated to Arquivo.pt and the preservation of content published on the Internet.

This initiative was attended by IPL – Politécnico de Lisboa lecturers and researchers, as well as people linked to the institution’s communications department.

The cycle of webinars took place in three sessions, between May and July 2024, and followed the training programme that Arquivo.pt has been offering for several years.

Presentation materials

1st webinar – Arquivo.pt: a new tool for researching the past. Well publish to well preserve. June 5 , 2024.
- Video
- Slides, 1st part; slides, 2nd part
2nd webinar – Automatic processing of information preserved from the Web. June 19, 2024.
- Video
- Slides
3nd webinar – Web archiving: Do-it-yourself!. July 3, 2024
- Video
- Slides

Why training on web preservation is important

Archiving content published on the web and using a web archive on a day-to-day basis is an unusual practice, largely due to the community’s lack of knowledge about the existence and operation of Arquivo.pt.

For example, in this cycle of webinars with the IPL – Politécnico de Lisboa, participants were given tools that allow them to use the web archive immediately and creatively, such as the SavePageNow service, the historical content search service and, for use in interdisciplinary teams, Application Programming Interfaces (APIs).

As a result of this series of webinars, the collaboration between the IPL – Politécnico de Lisboa and Arquivo.pt was strengthened, with a view to preserving its institutional websites and other interesting content that is available on various online media (news, events, references to teachers, researchers and students).

Higher education library mobility program brings professionals to Arquivo.pt

May 30, 2024May 29, 2024 by Ricardo Basílio

Arquivo.pt headquarters, operated by FCCN FCT, in Lisbon.

On May 24, the FCCN welcomed professionals from Higher Education Libraries (HEL) for the first time as part of the program promoted by the Higher Education Libraries Working Group (GT-BES) of the Portuguese Association of Librarians, Archivists, Documentalists and Information Professionals (BAD), My library is your library.

This is a mobility program that aims to carry out short-term visits with a view to exchanging experiences and hands-on contact with good practices, fostering collaboration and knowledge of Portuguese HEIs among professionals in the field.

Advanced services for knowledge

In this first edition of the program at FCCN, the participating colleagues (3 professionals from the University of Lisbon and 1 from the Catholic University of Porto) were offered a tour of the digital support services for higher education institutions operated by FCCN-FCT

Some services are familiar to information professionals, such as B-On and RCAAP. Others are back-office services and therefore less visible, but they are essential for higher education institutions. For example, Eduroam, which guarantees access to the Internet, RCTSaai for authentication or RCTS CERT for responding to security incidents.

Highlights include the Arquivo.pt and NAU services

The day highlighted Arquivo.pt and the NAU Platform, two services in the field of knowledge that are available to higher education institutions and also to society.

The Arquivo.pt team showed the backoffice of this Internet preservation service in Portugal and carried out a practical exercise in recording and integrating content into the web archive.

The NAU Platform is a platform for MOOCs (Massive Open Online Courses) created with the aim of democratizing knowledge, promoting digital literacy, enabling education and training for broad communities of users, particularly the Portuguese and Lusophone population.

More recently, with its integration into the North American platform edx.org, it has also been made available to all potential Portuguese-speaking trainees around the world. Participants in the program were shown how to build a MOOC course on the edx platform.

The program also included a visit to the Data Center and the professional television studio at the FCCN.

Visit by participants in the Higher Education Libraries mobility program to the FCCN Tv Studio

To know more

Program announced

Week job shadowing at the Arquivo.pt from Prague to Lisbon

May 28, 2024May 27, 2024 by Ricardo Basílio

By: Marie Haškovcová and Luboš Svoboda, Webarchiv, National Library of the Czech Republic, May 13th to 17th, 2024.

A visit within the EU Erasmus+ programme

Thanks to the EU Erasmus+ programme, focused on adult education – staff mobility, we were able to spend a week job shadowing at the Portuguese web archive Arquivo.pt and compare the strategies of the Czech web archive – Webarchiv with the approaches of our Portuguese colleagues.

In both cases, these are archives focused on national (Czech and Portuguese) content on the Internet.

The Arquivo.pt

While the Czech web archive is part of the National Library of the Czech Republic, the Portuguese archive (Arquivo.pt) is part of the FCCN, under the FCT – Foundation for Science and Technology, which aims to contribute to the development of science, technology and knowledge.

FCT provides IT services to the Portuguese higher education and research system, as well as high-speed internet connectivity. The institutional background of both archives is also reflected in the specifics of their concepts.

The visit included a presentation of the team and the campus and departmental spaces, a presentation of the activities of both archives and a discussion of the different aspects of our work – technical and curatorial tools, technologies and processes, the legislative environment and ethical issues, data storage, some services, research activities, perspectives and future plans.

The Czech web archive

The Czech web archive was founded in 2000, the oldest archival copies date back to 2001 and currently has more than 580 TB of data. Like Arquivo.pt, it harvests content on a national domain based on a list of url addresses obtained from its provider. It supplements these so-called comperhensive harvests with thematic and selective harvests in its acquisition strategy.

Topic collections relate to a specific topic or event, can be one-off or continuously built, and combine manually selected and automated scraped resources. Selective ones are intended for long-term harvesting, have detailed cataloging records that are part of the Czech national bibliography and are licensed – archival copies are therefore freely available through the catalogue.

From the Webarchive’s research activities, we presented our project aimed at detecting so-called dead webs through the Extinct Websites application and creating a database to serve as a basis for monitoring broader changes in the Czech web, and the WACloud project aimed at extracting big data from the web archive.

Exchanging knowledge and experience

Among the Portuguese projects we were interested in, for example, CitationSaver, and we also discussed the Memorial project, the harvesting of the Portuguese Wikipedia, and the activities of the Portuguese archive related to education in web archiving (training courses).

The meeting was enriched by the discussion of specific topic collections.

The Czech net art collection documents digital art and its transformation in the online space, providing a unique art historical perspective.
Another important collection is the Social networks of Members of Parliament of the Czech Republic 2021-2025 collection, which preserves the online communications and interactions of Czech MPs, invaluable for the study of political marketing and public political life.
The GitHub collection archives important repositories from this popular developer platform, preserving key domestic software projects and their code for future generations.
Finally, the Crypto, NFT, Blockchain, Web3, Metaverse collection charts the rise and impact of technology in the digital asset space. These collections are key resources for research and analysis of digital culture, policy, and technology, and the discussion of these collections at web archivist meetings contributes to the further development of archival methods and technological innovation.

We focused on exchanging knowledge and experience in seeds acquisition, workflow optimization and sharing technical tips and tricks.

Sharing best practices

We discussed best practices for identifying and collecting key web resources, a critical step in ensuring a comprehensive and representative archive. We shared various strategies for automating and streamlining workflows, including the use of web scraping tools and advanced content filtering.

Technical discussions included solutions to common problems such as harvesting dynamic web pages and overcoming access restrictions. The meeting provided a valuable platform for sharing innovative methods and fostering collaboration among experts, furthering the development of effective and sustainable digital archiving.

Erasmus+ visti to FCCN TV studio — Luboš Svoboda, web curator, Marie Haškovcová, chief of the Webarchiv e Ricardo Basílio, Arquivo.pt web curator visiting the FCCN-FCT TV Studio.

Training about web archiving in Madeira island

November 27, 2024May 2, 2024 by Ricardo Basílio

Last updated on November 27th, 2024 at 01:43 pm

The Arquivo.pt team was in Funchal between April 15 and 19, 2024, and presented two different sessions on web preservation. The first took place during the Jornadas FCCN 2024 and the second was a workshop, after the event had ended, at the headquarters of the Regional Agency for the Development of Research, Technology and Innovation (ARDITI).

Arquivo.pt at Jornadas FCCN 2024

The session held during the Jornadas FCCN 2024 was entitled “Arquivo.pt at the service of culture” and aimed to highlight two of Arquivo.pt’s collaborations in the field of culture and knowledge, namely with Wikipedia Portugal and the Virtual Museum of Tourism (MUVITUR).

At the FCCN Zapping session, Arquivo.pt presented the Arquivo404 service, which allows websites to offer historical content instead of the negative “Page not found”.

Arquivo404 – Vasco Rato, Web developer

Workshop with ARDITI

The post-Day Workshop, promoted by ARDITI, was open to regional institutions and citizens in general. It was entitled “Arquivo.pt and the preservation of Internet memory”.

The contents were structured according to the training program run by Arquivo.pt and preceded by a framework between the other services of the FCCN – FCCN – Computação Científica da FCT.

Just as important as the content was the dialog that was established during the sessions between the participants and the Arquivo.pt team to clarify doubts or ask questions.

Web preservation is increasingly important for organizations that want to preserve part of their institutional memory and develop security policies.

ARDITI gave an important signal about preserving the web memory of Madeiran institutions by hosting and promoting the Arquivo.pt training sessions.

If you want to promote the preservation of web content in your organization, check out the Arquivo.pt training and contact us.

More about

Image gallery

20240416-jornadas-fccn-dia 02_horizontal_27

Cross-lingual research datasets on 2019 European Parliamentary Elections

December 13, 2024September 18, 2023 by admin

Daniel Gomes and Diego Alves at presenting at CLEOPATRA final event

Last updated on December 13th, 2024 at 01:55 pm

Arquivo.pt preserved online documents in several languages about the 2019 European Parliamentary Elections

The 2019 European Parliamentary Elections were an event of international relevance. The strategy to preserve the relevant information on the World Wide Web is delegated to national institutions. However, the preservation of web pages that document transnational events is not officially assigned.

The Arquivo.pt team, with the aim of preserving the cross-lingual online content that documents this event, applied a combination of human and automatic selection processes.

The process of generating the collection about the 2019 European Parliamentary Elections was performed in two steps.

In the first step, 40 relevant terms in Portuguese about the 2019 European Parliamentary Elections were identified, and then, automatically translated into the 24 official languages of the European Union: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish.

These translations were reviewed in collaboration with the Publications Office of the European Union. Besides that, in parallel, a collaborative list was launched to gather contributions of relevant seeds from the international community.

In the second step, the Arquivo.pt team iteratively ran 6 crawls (99 million web files, 4.8 TB) using different configurations and crawling software, to maximize the quality of the collected content.

The obtained web-data was aggregated into one special collection identified as EAWP23 and became searchable and accessible through Arquivo.pt in July 2020 (https://arquivo.pt/ee2019).

CLEOPATRA project: Cross-lingual Event-centric Open Analytics Research Academy

The CLEOPATRA ITN was a Marie Skłodowska-Curie Innovative Training Network aimed to generate ways to better understand the massive digital coverage of major events in Europe over the past decades.

The main goal was to facilitate advanced cross-lingual processing of textual and visual information related to key contemporary events at large scale and develop innovative methods for efficient access and interaction with multilingual information.

In total, 14 Early-Stage Researchers hosted across 9 European Universities developed their research while enrolled as Ph.D. students.

Associated partners such as Arquivo.pt contributed to CLEOPATRA by hosting and training early-stage researchers such as Diego Alves. As part of the training program, he conducted a secondment at Arquivo.pt in Lisbon from June to August 2022.

The idea was to develop part of his research about syntactic structures of EU languages using the textual resources preserved by the Arquivo.pt and exchange knowledge with the web-archiving experts on the strategies to extract and process historical web-data.

Diego Alves defended his Ph.D thesis entitled Computational typological analysis of syntactic structures in European languages in July 2023 at the Faculty of Humanities and Social Sciences of the University of Zagreb (Croatia).

Generating textual datasets for Natural Language Processing

Diego Alves’ work originated cross-lingual datasets about the 2019 European Parliamentary Elections precious for research.

This work will be detailed in chapter “Robustness of Corpus-based Typological Strategies for Dependency Parsing” of the open-access CLEOPATRA book entitled “Event Analytics across Languages and Communities”.

A 3-step Natural Language Processing pipeline was developed to generate research textual datasets that can be used in several types of digital humanities studies:

Extract text: The textual content was extracted from each web-archived URL using the newspaper3k Python library. The language of each extracted text was determined using the langdetect library, to separate the texts written in different languages across distinct files;
Clean extracted texts: a Python script was applied to clean the texts by removing unnecessary information (e.g.: repeated instances, empty lines, etc.);
Double-check of language identification: the language of each cleaned extracted text was verified again to eliminate possible errors originated during the previous steps.

Two new research datasets are openly available!

The result was a dataset of cleaned and language-verified texts publicly available. Each file contains the texts in a given language about the 2019 European Union Elections. The distribution of extracted texts for each language is described in the figure below:

Number of tokens of each corpus extracted from the collection 2019 European Union Elections preserved by Arquivo.pt (EAWP23).

The aforementioned corpus was automatically annotated regarding part-of-speech and dependency relations to generate a corpus with syntactic information which is useful for linguistic studies.

The multilingual model of the UDify tool (Kondratyuk and Straka, 2019) was applied.

The texts in these annotated corpora followed the same order of the respective raw-texts files. Each sentence is annotated following the Universal Dependencies framework in the CoNNL-U format, which is the reference in terms of syntactic annotation in Natural Language Processing. Thus, each file in this dataset contains the annotated texts in a given language about the 2019 European Union Elections

The syntactically annotated texts about the 2019 European Elections are publicly available!

Know more

Robustness of Corpus-Based Typological Strategies for Dependency Parsing, Event Analytics across Languages and Communities, 2024
Secondments@Arquivo.pt and new research tools available and Robustness of Corpus based Typological Strategies for Dependency Parsing”, presentation at CLEOPATRA final event, 2023
Dataset of cleaned and language-verified texts about the 2019 European Elections (Raw texts)
Dataset of syntactically annotated texts about the 2019 European Elections (CoNLL-U texts)
Python script to extract language-specific texts from Arquivo.pt through a list of keywords
Computational typological analysis of syntactic structures in European languages, Diego Alves PhD thesis, 2023
Diego Alves personal page
Arquivo.pt APIs
Robustness of Corpus-based Typological Strategies for Dependency Parsing, Diego Alves and Daniel Gomes, Event Analytics across Languages and Communities book, Springer.

Arquivo.pt presentations at IIPC GA/WAC, RESAW 2023 and CLEOPATRA

March 10, 2024June 16, 2023 by Ricardo Basílio

Last updated on March 10th, 2024 at 05:23 pm

Meeting the Web Archive Community

The International Internet Preservation Consortium (IIPC), a consortium that brings together Web preservation initiatives from around the world, held its General Assembly with its members on May 10, 2023.

On the following days, May 11 and 12, the IIPC Web Archiving Conference (IIPC WAC) was held, an initiative open to the community, where people or entities not associated with the IIPC and interested in the Web preservation domain can participate.

The two events were jointly hosted by KB – National Library of the Netherlands, and by Beeld & Geluid – Netherlands Institute for Sound & Vision.

Contributions from the Arquivo.pt at the Web Archiving Conference

Arquivo.pt participated in the IIPC working group meetings (Training Working Group and Curators Working Group) and contributed with presentations in the thematic sessions Collaborations & Outreach and Program infrastructure (sessions 7 and 17).

Arquivo.pt updates 2023 (slides)
Linking web archiving with arts and humanities: the collaboration between ROSSIO and Arquivo.pt (video, slides)
Arquivo.pt behind the curtains (slides)

Meeting the RESAW research community

RESAW (Research Infrastructure for the Study of Archived Web Materials) is an initiative created in 2012 with the aim of promoting studies based on archived Web content, in areas such as Social Sciences, Digital Arts and Humanities.

The RESAW 2023 conference was held at the MUCEM Lab (Mediterranean Institute of Heritage Crafts) in Marseille on June 5-6, 2023, under the theme Exploring the Archived Web During a Highly Transformative Age.

Contributions from Arquivo.pt to RESAW 2023

Arquivo.pt contributed with presentations to the sessions Web Archive in Mediterranean area and its merge (4.A), From online Tools to Web Archive (6.B.), Towards a participatory approach to collections (9. A.), Digging up the materials for writing web history (9.B.).

How to research governmental web data? (abstract, slides)
Archiving Cryptocurrencies (abstract, slides)
Time to explore, time to learn from the archived web: Arquivo.pt training initiative (abstract, slides)
Exhibiting Web Memories from Arquivo.pt: a call for community participation (abstract, slides)

CLEOPATRA Project Meeting

The CLEOPATRA Project, led by the L3S Research Center at the Gottfried Wilhelm Leibniz University of Hannover, has developed since 2019 a training programme for doctoral researchers (Early Stage Researcher, PhD).

Arquivo.pt has participated in three courses: Incentives design for hybrid multilingual information processing and analytics, in Southampton; National and transnational media coverage of European parliamentary elections, 2004-2014, London; and NLP for under-resourced languages, in Zagreb, Croatia.

In 2022, the Arquivo.pt welcomed two researchers in its facilities who used the archived resources and received special support from the Arquivo.pt team to develop their research.

The CLEOPATRA Project ended in 2023 with a meeting on the 16th May, in Hannover, which brought together Professors, Researchers and representatives of the institutions involved.

Daniel Gomes, Arquivo.pt’s Manager, highlighted the new tools that Arquivo.pt makes available and the results of the work carried out by the researchers that have passed through Arquivo.pt.

Secondments@Arquivo.pt and new research tools available (slides)
Research open datasets on 2019 European Parliamentary Elections

Free training on digital media – webinars

August 2, 2024March 24, 2023 by Ricardo Basílio

Last updated on August 2nd, 2024 at 12:10 pm

The Aveiro Media Competence Center (AMCC) is a platform to support and promote the European Union (EU) Local News Media sector in the implementation of digital transition projects. The consortium includes the PCI Creative Science Park of Aveiro Region, the Associação Portuguesa de Imprensa and the University of Aveiro.

Arquivo.pt is a free public service that allows searching and accessing Web pages preserved since the 1990’s, such as viewing an old news or accessing an old version of a website.

The collaboration between the AMCC and Arquivo.pt is materialized in a training program entitled Arquivo.pt: Digital Skills for the Media, developed in four webinars, and in the attribution of the AMCC Honorable Mention to work done on Portuguese centenary newspapers in the Arquivo.pt Award 2023.

Webinar cycle: Arquivo.pt: digital skills for media

The webinar cycle aims to equip trainees with digital skills that enable them to solve problems caused by the disappearance of digital information and gain competitive advantage in the production of unique and exclusive content.

Webinar 1: A tool for quickly searching the past
- Data: Mars 24, 2023 Time: 14h00-15h30 (in Portuguese)
  - Slides
  - Vídeo
Webinar 2: Publishing well for preserving well
- Data: April 6, 2023, Time: 14h00-15h30 (in Portuguese)
  - Slides
  - Video
Webinar 3: Automated access and processing of preserved Web information through APIs
- Data: May 4, 2023, Time: 14h00-15h30 (in Portuguese)
- Slides
- Video
Webinar 4: Web archiving: do-it-yourself!
- Data: June 1, 2023, Time: 14h00-15h30 (in Portuguese)
  - Slides
  - Video

Tutorial: how to explore Arquivo.pt using Python

August 5, 2024July 29, 2022 by admin

Last updated on August 5th, 2024 at 04:50 pm

The Programming Historian aims to develop digital skills among the Humanities researchers through the publication of practical lessons in several languages.

The call Computational analysis skills for large-scale humanities data originated 7 new lessons.

One of them was the tutorial “Timeline summarization for large-scale past-web events with Python: the case of Arquivo.pt” developed by Daniel Gomes and Ricardo Campos.

It shows how to explore Arquivo.pt user interface and the Application Programming Interface (API) to execute advanced queries, process large amount of data or build new services, such as Tell me stories.

All the developed resources are freely available in open-access.

Open-access resources of the tutorial “Timeline summarization for large-scale past-web events with Python: the case of Arquivo.pt”

Colab project that enables editing and running directly the code examples of the tutorial (English, Portuguese)
Official tutorial page on Programming Historian
Video presented on May 5, 2022, as part of the Programming Historian webinars and tutorials “Developing computational skills for digital collections”
- Slides