Higher education library mobility program brings professionals to Arquivo.pt

FCCN_A Minha Biblioteca_24 maio 2024_2

Arquivo.pt operado pela FCCN FCT e localizado no Campus do LNEC

Arquivo.pt headquarters, operated by FCCN FCT, in Lisbon.

On May 24, the FCCN welcomed professionals from Higher Education Libraries (HEL) for the first time as part of the program promoted by the Higher Education Libraries Working Group (GT-BES) of the Portuguese Association of Librarians, Archivists, Documentalists and Information Professionals (BAD), My library is your library.

This is a mobility program that aims to carry out short-term visits with a view to exchanging experiences and hands-on contact with good practices, fostering collaboration and knowledge of Portuguese HEIs among professionals in the field.

Advanced services for knowledge

In this first edition of the program at FCCN, the participating colleagues (3 professionals from the University of Lisbon and 1 from the Catholic University of Porto) were offered a tour of the digital support services for higher education institutions operated by FCCN-FCT

Some services are familiar to information professionals, such as B-On and RCAAP. Others are back-office services and therefore less visible, but they are essential for higher education institutions. For example, Eduroam, which guarantees access to the Internet, RCTSaai for authentication or RCTS CERT for responding to security incidents.

Highlights include the Arquivo.pt and NAU services

The day highlighted Arquivo.pt and the NAU Platform, two services in the field of knowledge that are available to higher education institutions and also to society.

The Arquivo.pt team showed the backoffice of this Internet preservation service in Portugal and carried out a practical exercise in recording and integrating content into the web archive.

The NAU Platform is a platform for MOOCs (Massive Open Online Courses) created with the aim of democratizing knowledge, promoting digital literacy, enabling education and training for broad communities of users, particularly the Portuguese and Lusophone population.

More recently, with its integration into the North American platform edx.org, it has also been made available to all potential Portuguese-speaking trainees around the world. Participants in the program were shown how to build a MOOC course on the edx platform.

The program also included a visit to the Data Center and the professional television studio at the FCCN.

Visit by participants in the Higher Education Libraries mobility program to the FCCN Tv Studio
Visit by participants in the Higher Education Libraries mobility program to the FCCN Tv Studio

To know more

Week job shadowing at the Arquivo.pt from Prague to Lisbon

FCCN TV studio

By: Marie Haškovcová and Luboš Svoboda, Webarchiv, National Library of the Czech Republic, May 13th to 17th, 2024.

A visit within the EU Erasmus+ programme

Thanks to the EU Erasmus+ programme, focused on adult education – staff mobility, we were able to spend a week job shadowing at the Portuguese web archive Arquivo.pt and compare the strategies of the Czech web archive – Webarchiv with the approaches of our Portuguese colleagues.

In both cases, these are archives focused on national (Czech and Portuguese) content on the Internet.

The Arquivo.pt

While the Czech web archive is part of the National Library of the Czech Republic, the Portuguese archive (Arquivo.pt) is part of the FCCN, under the FCT – Foundation for Science and Technology, which aims to contribute to the development of science, technology and knowledge.

FCT provides IT services to the Portuguese higher education and research system, as well as high-speed internet connectivity. The institutional background of both archives is also reflected in the specifics of their concepts.

The visit included a presentation of the team and the campus and departmental spaces, a presentation of the activities of both archives and a discussion of the different aspects of our work – technical and curatorial tools, technologies and processes, the legislative environment and ethical issues, data storage, some services, research activities, perspectives and future plans.

The Czech web archive

The Czech web archive was founded in 2000, the oldest archival copies date back to 2001 and currently has more than 580 TB of data. Like Arquivo.pt, it harvests content on a national domain based on a list of url addresses obtained from its provider. It supplements these so-called comperhensive harvests with thematic and selective harvests in its acquisition strategy.

Topic collections relate to a specific topic or event, can be one-off or continuously built, and combine manually selected and automated scraped resources. Selective ones are intended for long-term harvesting, have detailed cataloging records that are part of the Czech national bibliography and are licensed – archival copies are therefore freely available through the catalogue.

From the Webarchive’s research activities, we presented our project aimed at detecting so-called dead webs through the Extinct Websites application and creating a database to serve as a basis for monitoring broader changes in the Czech web, and the WACloud project aimed at extracting big data from the web archive.

Exchanging knowledge and experience

Among the Portuguese projects we were interested in, for example, CitationSaver, and we also discussed the Memorial project, the harvesting of the Portuguese Wikipedia, and the activities of the Portuguese archive related to education in web archiving (training courses).

The meeting was enriched by the discussion of specific topic collections.

  • The Czech net art collection documents digital art and its transformation in the online space, providing a unique art historical perspective.
  • Another important collection is the Social networks of Members of Parliament of the Czech Republic 2021-2025 collection, which preserves the online communications and interactions of Czech MPs, invaluable for the study of political marketing and public political life.
  • The GitHub collection archives important repositories from this popular developer platform, preserving key domestic software projects and their code for future generations.
  • Finally, the Crypto, NFT, Blockchain, Web3, Metaverse collection charts the rise and impact of technology in the digital asset space. These collections are key resources for research and analysis of digital culture, policy, and technology, and the discussion of these collections at web archivist meetings contributes to the further development of archival methods and technological innovation.

We focused on exchanging knowledge and experience in seeds acquisition, workflow optimization and sharing technical tips and tricks.

Sharing best practices

We discussed best practices for identifying and collecting key web resources, a critical step in ensuring a comprehensive and representative archive. We shared various strategies for automating and streamlining workflows, including the use of web scraping tools and advanced content filtering.

Technical discussions included solutions to common problems such as harvesting dynamic web pages and overcoming access restrictions. The meeting provided a valuable platform for sharing innovative methods and fostering collaboration among experts, furthering the development of effective and sustainable digital archiving.

Erasmus+ visti to FCCN TV studio
Luboš Svoboda, web curator, Marie Haškovcová, chief of the Webarchiv e Ricardo Basílio, Arquivo.pt web curator visiting the FCCN-FCT TV Studio.

Exhibition of old websites to mark International Museum Day

Heritales Crowd-Recycling e Arquivo.pt no Dia Internacional dos Museus

May 18, International Museum Day, was celebrated all over the country with free admission, guided tours, entertainment and exhibitions related to memory and heritage.

Arquivo.pt contributed with an exhibition of old web pages, entitled “Digital Memory through the Internet of the Past”, which was on display at one of the stands at the National Coach Museum in Lisbon.

The pages were selected to show different aspects of the Alentejo over time. From 2016, pages relating to the Heritales project were selected.

Heritales and Crowd-Recycling drew attention to the preservation of the Internet’s memory

Heritales is a project based in Évora that aims to study and disseminate heritage in all its manifestations. It is known for its main event created in 2016, HERITALES – International Heritage Film Festival.

Crowd-Recycling is a project focused on good practices for sustainability.

Heritales, Crowd-Recycling and Arquivo.pt carried out this action in collaboration with the aim of giving visibility to content published on the web over time. Preserving and giving access to digital content is fundamental to enhancing heritage.

Why an exhibition of old websites is a good idea

Making an exhibition of websites over time is relatively easy, all you have to do is come up with a theme, which can also be the history of an institution, and choose pages preserved on Arquivo.pt.

An exhibition of old websites is an original idea for the target audience. It often features texts and images that only existed on the web.

By drawing attention to the websites, we realize that many things were left unrecorded and this changes our view of the content we publish today. We start taking more care to save important pages, for example by taking action or saving them on the spot with SavePageNow.

Heritales Crowd-Recycling e Arquivo.pt no Dia Internacional dos Museus
Heritales, Crowd-Recycling and Arquivo.pt on International Museum Day at the National Coach Museum

World Internet Day was on May 17th

The day before International Museum Day was World Internet Day (May 17). The proximity of the two commemorations ties in with the theme of preserving memory.

Portugal connected to the Internet for the first time in 1991, with the FCCN project “RCCN IP Service”.

To remember how it all happened, here are the three suggestions that FCCN published on social media for this day:

Arquivo.pt is finalist for the DPC Awards 2024

dpc-award-thumb

Last updated on June 17th, 2024 at 10:23 am

The Digital Preservation Coalition Awards

The Digital Preservation Coalition (DPC) is dedicated to promoting digital preservation and associated best practices.

The DPC Awards promote exemplary and innovative digital preservation use cases from all over the world.

The Arquivo.pt team submitted two applications to the DPC Awards 2024 in the categories of “Safeguarding the Digital Legacy” and “Research and Innovation”.

The Award for Safeguarding the Digital Legacy celebrates the practical application of preservation tools to protect at-risk digital objects.

The Award for Research and Innovation recognizes excellence in practical research and innovation activities.

Arquivo.pt applications to the DPC Awards

#1 Arquivo.pt catalog of tools for digital preservation

Information that rules modern-day lives is born-digital and disseminated online. However, invaluable digital objects published online have been continuously lost.

Arquivo.pt is a public infrastructure which supports the preservation of digital objects published online to safeguard this digital legacy for future generations.

Thus, in October 2023 after 15 years of research and development, Arquivo.pt released a Catalog of 13 innovative tools to support the preservation of at-risk online content, from acquisition to dissemination (e.g. search and access, APIs, training, open data sets, exhibitions).

Arquivo.pt safeguards online digital objects of worldwide interest for research and education.

The Arquivo.pt Catalog was selected as finalist to the Safeguarding the Digital Legacy Award.

#2 Searching preserved web-images

Images published online are precious digital assets that document contemporary times for future generations.

This initiative describes the research and development of an innovative image search system that enables the discovery and access to billions of preserved images acquired from the web since the 1990s.

This research was applied to enhance the Arquivo.pt web archive with an image search service publicly available to any Internet user, officially launched in August 2022.

The resulting scientific and technical publications are available in open-access and the developed software is available as free open-source to be reused and enhanced by the community.

This work on searching images preserved in web archives applied for the Research and Innovation Award.

Know more

Analysis of the Arquivo.pt query dataset

demo-wordcloud-arquivopt3

Last updated on May 12th, 2024 at 02:40 pm

Arquivo.pt query logs are unique resources for research

Arquivo.pt provides a “Google-like” service that enables searching pages and images collected from the web since the 1990s. Notice that Arquivo.pt search complements live-web search engines because it enables temporal search over information that is no longer available online on its original websites.

Analyzing user behavior is an important research topic to understand users’ information needs and enhance the quality of search results. Thus, when a user interacts with a search engine, the system records the user’s actions in a file called the query log. Query logs from web archives are unique resources for research because they describe the real needs of web-archive users about the historical information published online over time.

Research case study

Flavie Gallois and Adam Jatowt from the University of Innsbruck, and Ricardo Campos from the University of Beira Interior and INESC TEC analyzed user search behavior based on the Arquivo.pt search query log dataset collected over a period of 3 months from June to September 2021 (Analyzing User Search Behaviour in Temporal Web Repositories through Search Query Log Analysis).

This study analyzed query features such as length, type or frequency and compared the obtained results with previous work about user search behavior over web-archives and live-web search engines.

This study revealed interesting trends and patterns about how users search for information within web archives, with strong potential for future research work.

How do web-archive users search?

Figure 1 : Distribution of country origin of users
Figure 1 : Distribution of country origin of users
Figure 2: Distribution of languages used in queries
Figure 2: Distribution of languages used in queries

The users came from Portugal in 85.7% of the queries. However, the Portuguese language was identified through automatic language identification of queries as being used in only 37% of the queries. This suggests that users apply other languages than their own to search in web archives.

Users of Arquivo.pt tend to use longer queries with more words and characters in comparison to previous studies, both over web archives and live-web search engines. About 92% of the queries had 5 or fewer terms (average of 25 characters), with 3 being the most common number of submitted terms. In previous work about search behavior in web archives, it was observed that users tended to submit from 1 to 3 terms per query, with 1 term as the most common submission.

Users tend to issue multiple queries within a session instead of a single query, possibly indicating a need for refining their search queries or exploring multiple options for inquiry.

87,7% of the queries submitted to Arquivo.pt used Desktop Browsers, despite Arquivo.pt providing mobile-friendly user interfaces. Old web-archived pages are not responsive and render poorly on mobile devices. Thus, it is expectable that users mostly use web archives through their desktops.

Figure 3: Arquivo.pt users can refine the time span of their queries by using the From and To datepickers.
Figure 3: Arquivo.pt users can refine the time span of their queries by using the From and To datepickers.

Users refined the time span of the search (using the datepickers) in about 50% of queries which indicates awareness of temporal needs peculiar to web-archive usage. Interestingly, users modified the From datepicker more frequently than the To datepicker. Notice that keeping the default time span may fit the user information needs and does not necessarily indicate the lack of awareness about the existence of the function to define time span (peculiar to web-archive search).

Only a small percentage of users included specific years in their query terms (4%), potentially suggesting that in these cases the time span function was insufficient, or unnoticed by some users.

The obtained results suggest that users are more conscious of their information needs and have improved their search techniques to be more effective over web-archives instead of just using them out of curiosity as first-comers.

What is searched in a web-archive?

The authors of the study applied automatic named entity recognition over the user queries and derived a set of word clouds that graphically provide a glimpse of the most common information needs of Arquivo.pt users:

Figure 4: Word cloud of the most frequent query terms submitted to Arquivo.pt.
Figure 4: Word cloud of the most frequent query terms submitted to Arquivo.pt.
Figure 6: The most frequent Geographical Locations in query terms submitted to Arquivo.pt.
Figure 6: The most frequent Geographical Locations in query terms submitted to Arquivo.pt.
Figure 6: The most frequent Organizations in query terms submitted to Arquivo.pt
Figure 6: The most frequent Organizations in query terms submitted to Arquivo.pt
Figure 7: The most frequent Persons in query terms submitted to Arquivo.pt.
Figure 7: The most frequent Persons in query terms submitted to Arquivo.pt.

Access to research Arquivo.pt query dataset

Arquivo.pt released a set of resources to support research studies over its query log dataset:

Evaluation Metrics for web-archive search

The first step to understand user behavior is to define evaluation metrics. Defining metrics is a powerful tool to set long and short-term goals to decide which new products and features should be released to the users.

We share a work-in-progress report which aggregates information about Web Archive Search Evaluation Metrics. This contributes to comparing users’ search behavior between live-web and web-archive search engines. Feel free to comment directly on the collaborative document or to contact us.

This report also provides a summary of references about previous work, query workflows and structure of the corresponding query logs produced by Arquivo.pt, to facilitate the work from the researchers to study these data sets.

Know more

Commemoration of the 50th anniversary of April 25 – the Portuguese revolution of 1974

50anos25abril-ArquivoPT-IG-Feed-2

Arquivo.pt joined the celebrations of the 50th anniversary of April 25, the Portuguese Revolution of 1974, as part of the initiatives promoted by the Fundação para a Ciência e a Tecnologia (FCT) in partnership with the Estrutura de Missão – Comissão Comemorativa 50 anos 25 de Abril.

The initiatives were as follows: a journey through time, a special collection on the theme “Abril 25”, a presentation at the “50 years of April International Congress” and the inclusion of a special mention in the 2025 edition of the Arquivo.pt Award.

Memories of April 25 on the Internet exhibition

The exhibition Memories of April 25 on the Internet presents a selection of web pages about the celebrations of April 25 in various regions of the country, since the beginning of the web in the 1990s.

The criteria for choosing the pages for the exhibition were as follows:

  • Pages relating to the April 25 commemorations;
  • Pages found on Arquivo.pt on dates close to the anniversary each year;
  • Diversity to include different areas of the country;
  • Popular demonstrations and official ceremonies.

A historical memory without web archives is incomplete. The aim of this journey through time is to invite citizens to travel back in time, browsing through old web pages and reliving recent episodes in our life as a democracy.

See the exhibiton: arquivo.pt/50anos25abril

Special collection on April 25 – the Portuguese Revolution of 1974

To mark the anniversary, Arquivo.pt carried out a special collection on the topic of “April 25” and made the results available in an open dataset, published on the Dados.gov portal.

The dataset contains a list of keywords put into a search engine in order to obtain results on the topic of “April 25”. The search considered names of people, places, political, social and cultural aspects, as well as words associated with the event.

The searches were carried out on March 22, 2024 using the Bing Search API, an automatic search service that returns results according to the relevance criteria of the Bing service itself and others configured by us.

A total of 12,650 unique web page addresses were obtained. It is hoped that the recording of these pages will be useful for the organizations that produced this content, for researchers who want to study our history and for citizens who cultivate a sense of memory and democracy.

Participation in the 50 years of April International Congress

memorial-congresso-internacional-50anos25abril
João Gomes, Director of Advanced Services, FCCN-FCT presenting the Arquivo.pt Memorial service at the 50 years of April International Congress

On May 2, 2024, João Gomes, Director of Advanced Services at the FCCN Scientific Computing Unit of the Foundation for Science and Technology I.P., presented Arquivo.pt to the participants of the 50 years of April International Congress, as a distinctive service, open to citizens and useful for organizations.

This event, organized by the Estrutura de Missão – Comissão Comemorativa 50 anos 25 de Abril and the University of Lisbon, included a presentation of two FCT services for citizens: Arquivo.pt and NAU’s massive online open courses.

Arquivo.pt is a web preservation service available to all citizens who want to search for old content published on the web.

Using Arquivo.pt contributes to a better understanding of our history. It also provides useful services for cybersecurity, such as the Arquivo.pt Memorial, which is able to maintain institutions’ old websites, preventing attacks and saving them resources.

Special mention for “April 25 and Democracy” at the Arquivo.pt Awards 2025

The Arquivo.pt Award is held annually and honors works that use Arquivo.pt.

In 2025, as part of the celebrations for the 50th anniversary of April 25, a special mention will be made of work on the theme “April 25 and Democracy”.

We therefore challenge researchers and interested citizens to create innovative works using Arquivo.pt.

If you have any questions about the Arquivo.pt Award, please contact us.

Arquivo.pt in Paris for international event

GAWAC2024-website-banner

The Arquivo.pt team took part in the Web Archiving Conference e na Assembleia Geral do International Internet Preservation Consortium (GA&WAC 2024), an event that annually brings together web archiving initiatives from around the world.

The National Library of France (BNF), in partnership with the Institut Nationale de l’Audiovisuelle (INA), hosted this meeting, which took place from April 24 to 25, 2024, in the iconic François Mitterrand building in Paris.

For three days, participants were able to share knowledge and experience on the preservation of information published on the Web.

Arquivo.pt contributed the following presentations:

  • Training the Trainers – Helping Web Archiving Professionals become Confident Trainers (Pre-Conference Workshop, Training Working Group) – Ricardo Basílio (Abstract, slides)
  • 80 Thousand Pages On Street Art : Exploring Techniques To Build Thematic Collections (Session#02: unique content) – Ricardo Basílio (Abstract, slides)
  • Renascer Project Brings Back Old Websites at Arquivo.pt, Ricardo Basílio, Daniel Gomes  and Vasco Rato (Session#04: Delivery & Access (Abstract, slides)
  • Arquivo.pt CitationSaver: Preserving Citations for Online Documents (Session#09: Digital Preservation) – Pedro Gomes, Daniel Gomes (Abstract, slides)
  • Fixing Broken Links with Arquivo404 (Poster session 2) – Vasco Rato, Daniel Gomes (Abstract, slides)

Training about web archiving in Madeira island

jornadas-fccn-2024-funchal-thumb

Last updated on May 8th, 2024 at 07:31 pm

The Arquivo.pt team was in Funchal between April 15 and 19, 2024, and presented two different sessions on web preservation. The first took place during the Jornadas FCCN 2024 and the second was a workshop, after the event had ended, at the headquarters of the Regional Agency for the Development of Research, Technology and Innovation (ARDITI).

Arquivo.pt at Jornadas FCCN 2024

The session held during the Jornadas FCCN 2024 was entitled “Arquivo.pt at the service of culture” and aimed to highlight two of Arquivo.pt’s collaborations in the field of culture and knowledge, namely with Wikipedia Portugal and the Virtual Museum of Tourism (MUVITUR).

At the FCCN Zapping session, Arquivo.pt presented the Arquivo404 service, which allows websites to offer historical content instead of the negative “Page not found”.

Workshop with ARDITI

The post-Day Workshop, promoted by ARDITI, was open to regional institutions and citizens in general. It was entitled “Arquivo.pt and the preservation of Internet memory”.

The contents were structured according to the training program run by Arquivo.pt and preceded by a framework between the other services of the FCCN – FCCN – Computação Científica da FCT.

Just as important as the content was the dialog that was established during the sessions between the participants and the Arquivo.pt team to clarify doubts or ask questions.

Web preservation is increasingly important for organizations that want to preserve part of their institutional memory and develop security policies.

ARDITI gave an important signal about preserving the web memory of Madeiran institutions by hosting and promoting the Arquivo.pt training sessions.

If you want to promote the preservation of web content in your organization, check out the Arquivo.pt training and contact us.

More about