Arquivo.pt has contributed to the international collection of web pages on the Summer Olympics Games taking place in Paris from 26 July to 11 August 2024 and is doing the same for the Summer Paralympics taking place from 28 August to 8 September.
The pages of this collection will also be available on Arquivo.pt for those who want to carry out studies on sport and Olympism.
How the pages about Portuguese athletes were selected
At the Olympic Games 73 athletes represented Portugal in 15 sports, and at the Paralympic Games 27 athletes in 10 sports.
The criterion for selecting pages for the international collection was news about the athletes. For each athlete, pages were selected about their expectations before the games, their performance in the competition and their comments during and after the competition.
Some athletes have more news selected than others, and the same goes for the sites from which the news comes. The selection of pages was not limited to the first results presented by the search engine. We looked for a variety of channels and news from regional and local sites, some from the region or city where the athletes came from.
More than 500 pages to remember the Portuguese presence in Paris
The contribution of Arquivo.pt, as you can see in the table, already has more than 500 web pages.
Arquivo.pt headquarters, operated by FCCN FCT, in Lisbon.
On May 24, the FCCN welcomed professionals from Higher Education Libraries (HEL) for the first time as part of the program promoted by the Higher Education Libraries Working Group (GT-BES) of the Portuguese Association of Librarians, Archivists, Documentalists and Information Professionals (BAD), My library is your library.
This is a mobility program that aims to carry out short-term visits with a view to exchanging experiences and hands-on contact with good practices, fostering collaboration and knowledge of Portuguese HEIs among professionals in the field.
Advanced services for knowledge
In this first edition of the program at FCCN, the participating colleagues (3 professionals from the University of Lisbon and 1 from the Catholic University of Porto) were offered a tour of the digital support services for higher education institutions operated by FCCN-FCT
Some services are familiar to information professionals, such as B-On and RCAAP. Others are back-office services and therefore less visible, but they are essential for higher education institutions. For example, Eduroam, which guarantees access to the Internet, RCTSaai for authentication or RCTS CERT for responding to security incidents.
Highlights include the Arquivo.pt and NAU services
The day highlighted Arquivo.pt and the NAU Platform, two services in the field of knowledge that are available to higher education institutions and also to society.
The Arquivo.pt team showed the backoffice of this Internet preservation service in Portugal and carried out a practical exercise in recording and integrating content into the web archive.
The NAU Platform is a platform for MOOCs (Massive Open Online Courses) created with the aim of democratizing knowledge, promoting digital literacy, enabling education and training for broad communities of users, particularly the Portuguese and Lusophone population.
More recently, with its integration into the North American platform edx.org, it has also been made available to all potential Portuguese-speaking trainees around the world. Participants in the program were shown how to build a MOOC course on the edx platform.
By: Marie Haškovcová and Luboš Svoboda, Webarchiv, National Library of the Czech Republic, May 13th to 17th, 2024.
A visit within the EU Erasmus+ programme
Thanks to the EU Erasmus+ programme, focused on adult education – staff mobility, we were able to spend a week job shadowing at the Portuguese web archive Arquivo.pt and compare the strategies of the Czech web archive – Webarchiv with the approaches of our Portuguese colleagues.
In both cases, these are archives focused on national (Czech and Portuguese) content on the Internet.
FCT provides IT services to the Portuguese higher education and research system, as well as high-speed internet connectivity. The institutional background of both archives is also reflected in the specifics of their concepts.
The visit included a presentation of the team and the campus and departmental spaces, a presentation of the activities of both archives and a discussion of the different aspects of our work – technical and curatorial tools, technologies and processes, the legislative environment and ethical issues, data storage, some services, research activities, perspectives and future plans.
The Czech web archive
The Czech web archive was founded in 2000, the oldest archival copies date back to 2001 and currently has more than 580 TB of data. Like Arquivo.pt, it harvests content on a national domain based on a list of url addresses obtained from its provider. It supplements these so-called comperhensive harvests with thematic and selective harvests in its acquisition strategy.
Topic collections relate to a specific topic or event, can be one-off or continuously built, and combine manually selected and automated scraped resources. Selective ones are intended for long-term harvesting, have detailed cataloging records that are part of the Czech national bibliography and are licensed – archival copies are therefore freely available through the catalogue.
From the Webarchive’s research activities, we presented our project aimed at detecting so-called dead webs through the Extinct Websites application and creating a database to serve as a basis for monitoring broader changes in the Czech web, and the WACloud project aimed at extracting big data from the web archive.
Exchanging knowledge and experience
Among the Portuguese projects we were interested in, for example, CitationSaver, and we also discussed the Memorial project, the harvesting of the Portuguese Wikipedia, and the activities of the Portuguese archive related to education in web archiving (training courses).
The meeting was enriched by the discussion of specific topic collections.
The Czech net art collection documents digital art and its transformation in the online space, providing a unique art historical perspective.
Another important collection is the Social networks of Members of Parliament of the Czech Republic 2021-2025 collection, which preserves the online communications and interactions of Czech MPs, invaluable for the study of political marketing and public political life.
The GitHub collection archives important repositories from this popular developer platform, preserving key domestic software projects and their code for future generations.
Finally, the Crypto, NFT, Blockchain, Web3, Metaverse collection charts the rise and impact of technology in the digital asset space. These collections are key resources for research and analysis of digital culture, policy, and technology, and the discussion of these collections at web archivist meetings contributes to the further development of archival methods and technological innovation.
We focused on exchanging knowledge and experience in seeds acquisition, workflow optimization and sharing technical tips and tricks.
Sharing best practices
We discussed best practices for identifying and collecting key web resources, a critical step in ensuring a comprehensive and representative archive. We shared various strategies for automating and streamlining workflows, including the use of web scraping tools and advanced content filtering.
Technical discussions included solutions to common problems such as harvesting dynamic web pages and overcoming access restrictions. The meeting provided a valuable platform for sharing innovative methods and fostering collaboration among experts, furthering the development of effective and sustainable digital archiving.
The initiatives were as follows: a journey through time, a special collection on the theme “Abril 25”, a presentation at the “50 years of April International Congress” and the inclusion of a special mention in the 2025 edition of the Arquivo.pt Award.
Memories of April 25 on the Internet exhibition
The exhibition Memories of April 25 on the Internet presents a selection of web pages about the celebrations of April 25 in various regions of the country, since the beginning of the web in the 1990s.
The criteria for choosing the pages for the exhibition were as follows:
Pages relating to the April 25 commemorations;
Pages found on Arquivo.pt on dates close to the anniversary each year;
Diversity to include different areas of the country;
Popular demonstrations and official ceremonies.
A historical memory without web archives is incomplete. The aim of this journey through time is to invite citizens to travel back in time, browsing through old web pages and reliving recent episodes in our life as a democracy.
The dataset contains a list of keywords put into a search engine in order to obtain results on the topic of “April 25”. The search considered names of people, places, political, social and cultural aspects, as well as words associated with the event.
The searches were carried out on March 22, 2024 using the Bing Search API, an automatic search service that returns results according to the relevance criteria of the Bing service itself and others configured by us.
A total of 12,650 unique web page addresses were obtained. It is hoped that the recording of these pages will be useful for the organizations that produced this content, for researchers who want to study our history and for citizens who cultivate a sense of memory and democracy.
Participation in the 50 years of April International Congress
On May 2, 2024, João Gomes, Director of Advanced Services at the FCCN Scientific Computing Unit of the Foundation for Science and Technology I.P., presented Arquivo.pt to the participants of the 50 years of April International Congress, as a distinctive service, open to citizens and useful for organizations.
Arquivo.pt is a web preservation service available to all citizens who want to search for old content published on the web.
Using Arquivo.pt contributes to a better understanding of our history. It also provides useful services for cybersecurity, such as the Arquivo.pt Memorial, which is able to maintain institutions’ old websites, preventing attacks and saving them resources.
Special mention for “April 25 and Democracy” at the Arquivo.pt Awards 2025
In 2025, as part of the celebrations for the 50th anniversary of April 25, a special mention will be made of work on the theme “April 25 and Democracy”.
We therefore challenge researchers and interested citizens to create innovative works using Arquivo.pt.
If you have any questions about the Arquivo.pt Award, please contact us.
The session held during the Jornadas FCCN 2024 was entitled “Arquivo.pt at the service of culture” and aimed to highlight two of Arquivo.pt’s collaborations in the field of culture and knowledge, namely with Wikipedia Portugal and the Virtual Museum of Tourism (MUVITUR).
At the FCCN Zapping session, Arquivo.pt presented the Arquivo404 service, which allows websites to offer historical content instead of the negative “Page not found”.
The post-Day Workshop, promoted by ARDITI, was open to regional institutions and citizens in general. It was entitled “Arquivo.pt and the preservation of Internet memory”.
The contents were structured according to the training program run by Arquivo.pt and preceded by a framework between the other services of the FCCN – FCCN – Computação Científica da FCT.
Just as important as the content was the dialog that was established during the sessions between the participants and the Arquivo.pt team to clarify doubts or ask questions.
Web preservation is increasingly important for organizations that want to preserve part of their institutional memory and develop security policies.
ARDITI gave an important signal about preserving the web memory of Madeiran institutions by hosting and promoting the Arquivo.pt training sessions.
If you want to promote the preservation of web content in your organization, check out the Arquivo.pt training and contact us.
Artificial Intelligence (AI), covers various areas of knowledge, such as linguistics and computing, and is present in the new technologies used by citizens on a daily basis.
For example, when we search for information on the Internet and the computer generates an amazingly accurate response, in a language very close to our own.
Natural Language Processing (NLP) is what allows machines to perfect the algorithm that generates these answers tailored to Internet users.
The problem is that natural language processing models have been developed more for the English language and less for Portuguese and other languages with less representation.
The more the processing models are trained on a language, the better they will be able to interpret the complexities of the language. But this is only possible if quality data is available.
Portuguese text collection on Arquivo.pt available for research
Arquivo.pt appears here as the largest Portuguese-language textual dataset in Portugal, available in open access, for researchers to train NLP models.
In recent years, researchers from various research groups and projects have drawn attention to the usefulness of preserved web data for large-scale processing.
Arquivo.pt has more than 1 Petabyte of preserved web content dating back to the 1990s, including everything that can be found on web pages. It’s not just text, but also images, audio files, video, page code and various metadata.
The content is accessible via the search interface and the Arquivo.pt APIs.
One of the projects that used Arquivo.pt to obtain large amounts of text is called GlórIA and is a large-scale language model (LLM) focused on the European Portuguese language.
More than 100 historical websites from the Faculty of Sciences of the University of Lisbon (FCUL) are now accessible through the Memorial service of Arquivo.pt.
FCUL’s IT Department sent to Arquivo.pt a list of old websites hosted on its servers that were no longer updated, but whose historical content continues to be interesting to the community (e.g. websites of research projects or scientific events).
Arquivo.pt preserved these websites in collaboration with their ownersa, seeking to maintain a faithful representation of the published content for the future.
FCUL redirected the domain of each website to Arquivo.pt, and then, became able to disconnect the respective servers and begin sparing the resources spent on their maintenance (e.g. electricity, data center space, human resources).
The MiNEMA scientific program website was the first that FCUL integrated into the Memorial. This website stopped being updated in 2009 when the project ended. FCUL invested resources in maintaining the website for another 10 years until it became necessary to suspend it down for cybersecurity reasons.
The Memorial of Arquivo.pt emerged as an option and since 2020, FCUL just needs to maintain the domain www.minema.di.fc.ul.pt while Arquivo.pt preserveS the information contained on the website.
Follow FCUL and preserve your historical websites in the Memorial!
An increasing number of institutions are recurring to the Memorial of Arquivo.pt to safely preserve the content of their historical websites. For example, FCUL preserved 116 websites, the Government IT Network Management Center preserved 23 and the Foundation for Science and Technology preserved 40.
Public institutions have priority to benefit from this service. However, other entities can also request it as long as they own the website domain.
Identify your historical websites candidate to be integrated into the Memorial of Arquivo.pt and contact us!
Some web-archived pages are reproduced incompletely due to problems occurred during the archiving process (e.g. deformatted or missing embedded images).
Complete page is a function of Arquivo.pt that allows to recover missing elements in web-archived pages, from other web archives or the original websites.
When a user views a page archived in Arquivo.pt, just needs to access the Options menu in the top right corner and choose Complete page.
This process is performed automatically.
How does Complete page work?
If you open a web-archived page that appears incomplete, try the Complete page option and wait.
Arquivo.pt will search for missing elements on the Internet and in other web archives using the Memento protocol. If it succeeds, the obtained elements will be immediately displayed on the web-archived page.
Later, these recovered elements are integrated into the Arquivo.pt collection, so that the web-archived page will appear more complete in the future accesses performed by any user.
Completing the home page of artist Cristina Guerra’s website found a missing image.
For example, the website of artist Cristina Guerra archived in 2005 had a missing image. By using Complete page, it was possible in 2021 to obtain this missing image from another web archive which preserved it.
Participate in collaborative curation to improve the quality of Arquivo.pt!
Due to the high number of web-archived pages, it is not possible for Arquivo.pt to complete them all automatically. Therefore, the collaboration of users to identify important pages with missing elements and try to complete them is important.
By using Complete page, the users are contributing to improve the quality of the historical webpages preserved in Arquivo.pt!
Always give it a try to complete web-archived pages may that look incomplete. If you detect any problem, contact us.
Spread the word about the Arquivo.pt Complete page!
On the following days, May 11 and 12, the IIPC Web Archiving Conference (IIPC WAC) was held, an initiative open to the community, where people or entities not associated with the IIPC and interested in the Web preservation domain can participate.
Contributions from the Arquivo.pt at the Web Archiving Conference
Arquivo.pt participated in the IIPC working group meetings (Training Working Group and Curators Working Group) and contributed with presentations in the thematic sessions Collaborations & Outreach and Program infrastructure (sessions 7 and 17).
Arquivo.pt contributed with presentations to the sessions Web Archive in Mediterranean area and its merge (4.A), From online Tools to Web Archive (6.B.), Towards a participatory approach to collections (9. A.), Digging up the materials for writing web history (9.B.).
How to research governmental web data? (abstract, slides)
Arquivo.pt has participated in three courses: Incentives design for hybrid multilingual information processing and analytics, in Southampton; National and transnational media coverage of European parliamentary elections, 2004-2014, London; and NLP for under-resourced languages, in Zagreb, Croatia.
In 2022, the Arquivo.pt welcomed two researchers in its facilities who used the archived resources and received special support from the Arquivo.pt team to develop their research.
The CLEOPATRA Project ended in 2023 with a meeting on the 16th May, in Hannover, which brought together Professors, Researchers and representatives of the institutions involved.
Daniel Gomes, Arquivo.pt’s Manager, highlighted the new tools that Arquivo.pt makes available and the results of the work carried out by the researchers that have passed through Arquivo.pt.
Secondments@Arquivo.pt and new research tools available (slides)
On November 8, 2007, the Portuguese Web Archive was officially created and later named Arquivo.pt.
To celebrate this date, Wikimedia Portugal and Arquivo.pt have associated themselves in the organization of an online event dedicated to the preservation of the digital heritage.
Agenda
Introdução – André Barbosa, Wikimédia Portugal (Video)
15 anos de Arquivo.pt – Daniel Gomes, Arquivo.pt (Slides, Video)
Wikimedia na Universidade: Exploração e Projetos na NOVA FCSH – Rute Correia, Residência WMPT na NOVA FCSH, (Slides;Video)
GLAM Wiki. Uma introdução geral – Giovanna Fontenelle, Fundação Wikimédia, Brasil (Slides;Video)
Demo dos recursos em acesso livre no Arquivo.pt – Daniel Gomes (Video)