IPL – Politécnico de Lisboa, through its Distance Learning Group (EaD@IPL), organised a series of webinars for its community dedicated to Arquivo.pt and the preservation of content published on the Internet.
This initiative was attended by IPL – Politécnico de Lisboa lecturers and researchers, as well as people linked to the institution’s communications department.
The cycle of webinars took place in three sessions, between May and July 2024, and followed the training programme that Arquivo.pt has been offering for several years.
Presentation materials
1st webinar – Arquivo.pt: a new tool for researching the past. Well publish to well preserve. June 5 , 2024.
Archiving content published on the web and using a web archive on a day-to-day basis is an unusual practice, largely due to the community’s lack of knowledge about the existence and operation of Arquivo.pt.
As a result of this series of webinars, the collaboration between the IPL – Politécnico de Lisboa and Arquivo.pt was strengthened, with a view to preserving its institutional websites and other interesting content that is available on various online media (news, events, references to teachers, researchers and students).
Arquivo.pt headquarters, operated by FCCN FCT, in Lisbon.
On May 24, the FCCN welcomed professionals from Higher Education Libraries (HEL) for the first time as part of the program promoted by the Higher Education Libraries Working Group (GT-BES) of the Portuguese Association of Librarians, Archivists, Documentalists and Information Professionals (BAD), My library is your library.
This is a mobility program that aims to carry out short-term visits with a view to exchanging experiences and hands-on contact with good practices, fostering collaboration and knowledge of Portuguese HEIs among professionals in the field.
Advanced services for knowledge
In this first edition of the program at FCCN, the participating colleagues (3 professionals from the University of Lisbon and 1 from the Catholic University of Porto) were offered a tour of the digital support services for higher education institutions operated by FCCN-FCT
Some services are familiar to information professionals, such as B-On and RCAAP. Others are back-office services and therefore less visible, but they are essential for higher education institutions. For example, Eduroam, which guarantees access to the Internet, RCTSaai for authentication or RCTS CERT for responding to security incidents.
Highlights include the Arquivo.pt and NAU services
The day highlighted Arquivo.pt and the NAU Platform, two services in the field of knowledge that are available to higher education institutions and also to society.
The Arquivo.pt team showed the backoffice of this Internet preservation service in Portugal and carried out a practical exercise in recording and integrating content into the web archive.
The NAU Platform is a platform for MOOCs (Massive Open Online Courses) created with the aim of democratizing knowledge, promoting digital literacy, enabling education and training for broad communities of users, particularly the Portuguese and Lusophone population.
More recently, with its integration into the North American platform edx.org, it has also been made available to all potential Portuguese-speaking trainees around the world. Participants in the program were shown how to build a MOOC course on the edx platform.
By: Marie Haškovcová and Luboš Svoboda, Webarchiv, National Library of the Czech Republic, May 13th to 17th, 2024.
A visit within the EU Erasmus+ programme
Thanks to the EU Erasmus+ programme, focused on adult education – staff mobility, we were able to spend a week job shadowing at the Portuguese web archive Arquivo.pt and compare the strategies of the Czech web archive – Webarchiv with the approaches of our Portuguese colleagues.
In both cases, these are archives focused on national (Czech and Portuguese) content on the Internet.
FCT provides IT services to the Portuguese higher education and research system, as well as high-speed internet connectivity. The institutional background of both archives is also reflected in the specifics of their concepts.
The visit included a presentation of the team and the campus and departmental spaces, a presentation of the activities of both archives and a discussion of the different aspects of our work – technical and curatorial tools, technologies and processes, the legislative environment and ethical issues, data storage, some services, research activities, perspectives and future plans.
The Czech web archive
The Czech web archive was founded in 2000, the oldest archival copies date back to 2001 and currently has more than 580 TB of data. Like Arquivo.pt, it harvests content on a national domain based on a list of url addresses obtained from its provider. It supplements these so-called comperhensive harvests with thematic and selective harvests in its acquisition strategy.
Topic collections relate to a specific topic or event, can be one-off or continuously built, and combine manually selected and automated scraped resources. Selective ones are intended for long-term harvesting, have detailed cataloging records that are part of the Czech national bibliography and are licensed – archival copies are therefore freely available through the catalogue.
From the Webarchive’s research activities, we presented our project aimed at detecting so-called dead webs through the Extinct Websites application and creating a database to serve as a basis for monitoring broader changes in the Czech web, and the WACloud project aimed at extracting big data from the web archive.
Exchanging knowledge and experience
Among the Portuguese projects we were interested in, for example, CitationSaver, and we also discussed the Memorial project, the harvesting of the Portuguese Wikipedia, and the activities of the Portuguese archive related to education in web archiving (training courses).
The meeting was enriched by the discussion of specific topic collections.
The Czech net art collection documents digital art and its transformation in the online space, providing a unique art historical perspective.
Another important collection is the Social networks of Members of Parliament of the Czech Republic 2021-2025 collection, which preserves the online communications and interactions of Czech MPs, invaluable for the study of political marketing and public political life.
The GitHub collection archives important repositories from this popular developer platform, preserving key domestic software projects and their code for future generations.
Finally, the Crypto, NFT, Blockchain, Web3, Metaverse collection charts the rise and impact of technology in the digital asset space. These collections are key resources for research and analysis of digital culture, policy, and technology, and the discussion of these collections at web archivist meetings contributes to the further development of archival methods and technological innovation.
We focused on exchanging knowledge and experience in seeds acquisition, workflow optimization and sharing technical tips and tricks.
Sharing best practices
We discussed best practices for identifying and collecting key web resources, a critical step in ensuring a comprehensive and representative archive. We shared various strategies for automating and streamlining workflows, including the use of web scraping tools and advanced content filtering.
Technical discussions included solutions to common problems such as harvesting dynamic web pages and overcoming access restrictions. The meeting provided a valuable platform for sharing innovative methods and fostering collaboration among experts, furthering the development of effective and sustainable digital archiving.
The session held during the Jornadas FCCN 2024 was entitled “Arquivo.pt at the service of culture” and aimed to highlight two of Arquivo.pt’s collaborations in the field of culture and knowledge, namely with Wikipedia Portugal and the Virtual Museum of Tourism (MUVITUR).
At the FCCN Zapping session, Arquivo.pt presented the Arquivo404 service, which allows websites to offer historical content instead of the negative “Page not found”.
The post-Day Workshop, promoted by ARDITI, was open to regional institutions and citizens in general. It was entitled “Arquivo.pt and the preservation of Internet memory”.
The contents were structured according to the training program run by Arquivo.pt and preceded by a framework between the other services of the FCCN – FCCN – Computação Científica da FCT.
Just as important as the content was the dialog that was established during the sessions between the participants and the Arquivo.pt team to clarify doubts or ask questions.
Web preservation is increasingly important for organizations that want to preserve part of their institutional memory and develop security policies.
ARDITI gave an important signal about preserving the web memory of Madeiran institutions by hosting and promoting the Arquivo.pt training sessions.
If you want to promote the preservation of web content in your organization, check out the Arquivo.pt training and contact us.
Arquivo.pt preserved online documents in several languages about the 2019 European Parliamentary Elections
The 2019 European Parliamentary Elections were an event of international relevance. The strategy to preserve the relevant information on the World Wide Web is delegated to national institutions. However, the preservation of web pages that document transnational events is not officially assigned.
The Arquivo.pt team, with the aim of preserving the cross-lingual online content that documents this event, applied a combination of human and automatic selection processes.
In the first step, 40 relevant terms in Portuguese about the 2019 European Parliamentary Elections were identified, and then, automatically translated into the 24 official languages of the European Union: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish.
These translations were reviewed in collaboration with the Publications Office of the European Union. Besides that, in parallel, a collaborative list was launched to gather contributions of relevant seeds from the international community.
In the second step, the Arquivo.pt team iteratively ran 6 crawls (99 million web files, 4.8 TB) using different configurations and crawling software, to maximize the quality of the collected content.
The obtained web-data was aggregated into one special collection identified as EAWP23 and became searchable and accessible through Arquivo.pt in July 2020 (https://arquivo.pt/ee2019).
CLEOPATRA project: Cross-lingual Event-centric Open Analytics Research Academy
The CLEOPATRA ITN was a Marie Skłodowska-Curie Innovative Training Network aimed to generate ways to better understand the massive digital coverage of major events in Europe over the past decades.
The main goal was to facilitate advanced cross-lingual processing of textual and visual information related to key contemporary events at large scale and develop innovative methods for efficient access and interaction with multilingual information.
In total, 14 Early-Stage Researchers hosted across 9 European Universities developed their research while enrolled as Ph.D. students.
Associated partners such as Arquivo.pt contributed to CLEOPATRA by hosting and training early-stage researchers such as Diego Alves. As part of the training program, he conducted a secondment at Arquivo.pt in Lisbon from June to August 2022.
The idea was to develop part of his research about syntactic structures of EU languages using the textual resources preserved by the Arquivo.pt and exchange knowledge with the web-archiving experts on the strategies to extract and process historical web-data.
Generating textual datasets for Natural Language Processing
Diego Alves’ work originated cross-lingual datasets about the 2019 European Parliamentary Elections precious for research.
This work will be detailed in chapter “Robustness of Corpus-based Typological Strategies for Dependency Parsing” of the open-access CLEOPATRA book entitled “Event Analytics across Languages and Communities”.
A 3-step Natural Language Processing pipeline was developed to generate research textual datasets that can be used in several types of digital humanities studies:
Extract text: The textual content was extracted from each web-archived URL using the newspaper3k Python library. The language of each extracted text was determined using the langdetect library, to separate the texts written in different languages across distinct files;
Clean extracted texts: a Python script was applied to clean the texts by removing unnecessary information (e.g.: repeated instances, empty lines, etc.);
Double-check of language identification: the language of each cleaned extracted text was verified again to eliminate possible errors originated during the previous steps.
Two new research datasets are openly available!
The result was a dataset of cleaned and language-verified texts publicly available. Each file contains the texts in a given language about the 2019 European Union Elections. The distribution of extracted texts for each language is described in the figure below:
The aforementioned corpus was automatically annotated regarding part-of-speech and dependency relations to generate a corpus with syntactic information which is useful for linguistic studies.
The texts in these annotated corpora followed the same order of the respective raw-texts files. Each sentence is annotated following the Universal Dependencies framework in the CoNNL-U format, which is the reference in terms of syntactic annotation in Natural Language Processing. Thus, each file in this dataset contains the annotated texts in a given language about the 2019 European Union Elections
On the following days, May 11 and 12, the IIPC Web Archiving Conference (IIPC WAC) was held, an initiative open to the community, where people or entities not associated with the IIPC and interested in the Web preservation domain can participate.
Contributions from the Arquivo.pt at the Web Archiving Conference
Arquivo.pt participated in the IIPC working group meetings (Training Working Group and Curators Working Group) and contributed with presentations in the thematic sessions Collaborations & Outreach and Program infrastructure (sessions 7 and 17).
Arquivo.pt contributed with presentations to the sessions Web Archive in Mediterranean area and its merge (4.A), From online Tools to Web Archive (6.B.), Towards a participatory approach to collections (9. A.), Digging up the materials for writing web history (9.B.).
How to research governmental web data? (abstract, slides)
Arquivo.pt has participated in three courses: Incentives design for hybrid multilingual information processing and analytics, in Southampton; National and transnational media coverage of European parliamentary elections, 2004-2014, London; and NLP for under-resourced languages, in Zagreb, Croatia.
In 2022, the Arquivo.pt welcomed two researchers in its facilities who used the archived resources and received special support from the Arquivo.pt team to develop their research.
The CLEOPATRA Project ended in 2023 with a meeting on the 16th May, in Hannover, which brought together Professors, Researchers and representatives of the institutions involved.
Daniel Gomes, Arquivo.pt’s Manager, highlighted the new tools that Arquivo.pt makes available and the results of the work carried out by the researchers that have passed through Arquivo.pt.
Secondments@Arquivo.pt and new research tools available (slides)
Arquivo.pt is a free public service that allows searching and accessing Web pages preserved since the 1990’s, such as viewing an old news or accessing an old version of a website.
The collaboration between the AMCC and Arquivo.pt is materialized in a training program entitled Arquivo.pt: Digital Skills for the Media, developed in four webinars, and in the attribution of the AMCC Honorable Mention to work done on Portuguese centenary newspapers in the Arquivo.pt Award 2023.
Webinar cycle: Arquivo.pt: digital skills for media
The webinar cycle aims to equip trainees with digital skills that enable them to solve problems caused by the disappearance of digital information and gain competitive advantage in the production of unique and exclusive content.
Webinar 1: A tool for quickly searching the past
Data: Mars 24, 2023 Time: 14h00-15h30 (in Portuguese)
One of them was the tutorial “Timeline summarization for large-scale past-web events with Python: the case of Arquivo.pt” developed by Daniel Gomes and Ricardo Campos.
The Portuguese Museums Network was the community invited to participate in the cycle of three webinars entitled “Cultural Heritage on the Web: online presence of museums”.
The aim is to raise awareness among museum managers and professionals about the importance of preserving content published on the Web and to make known the services and tools of Arquivo.pt.
This initiative is promoted by the Direção Geral do Património Cultural, through the Departamento de Museus, Conservação e Credenciação and Divisão de Museus e Credenciação, which welcomed and integrated in its training offer the proposal of Arquivo.pt (FCT, I.P.) .
Information and materials
June 21st, 2022 – The Arquivo.pt and the preservation of digital memory (1st webinar)
In this session Arquivo.pt is presented as a useful service to museums and institutions that the community can count on to preserve digital cultural heritage, specifically Web content.
Speaker: Ricardo Basílio, digital curator (in substitution of Daniel Gomes, manager of Arquivo.pt)
June 27, 2022 – Archiving the Web: DIY (3rd Webinar)
This session offers a tutorial for creating a local web archive, recording contentes in a standard format and using open tools that any person can use.
The meeting was broadcast online with the aim of sharing with the community of archivists what has been an experience of collaborative curation of Web content.
Collaboration between a municipal archive and a web archive
This meeting took place in the continuity of a collaboration between the two teams developed during the pandemic period.
The Arquivo Municipal de Sines made a selective and systematic collection of Web content related to the Municipality of Sines, with the collaboration of local media, such as Rádio Miróbriga and Rádio Sines.
In turn, Arquivo.pt contributed with training on tools, like Webrecorder.net, that records in standardized format and prepared useful services, such as SavePageNow that allows to record pages on the fly directly on Arquivo.pt.
Local history is better with preserved Web pages
From this collaboration resulted the preservation of thousands of Web pages (about 200 Gigabytes of information) about the experience of the pandemic in the geographical area of Sines and Santiago do Cacém.
The copies of the Web Archive Files (WARCs) sent to Arquivo.pt have been integrated to become available.