Arquivo.pt is finalist for the DPC Awards 2024

dpc-award-thumb

Last updated on August 12th, 2024 at 11:50 am

The Digital Preservation Coalition Awards

The Digital Preservation Coalition (DPC) is dedicated to promoting digital preservation and associated best practices.

The DPC Awards promote exemplary and innovative digital preservation use cases from all over the world.

The Arquivo.pt team submitted two applications to the DPC Awards 2024 in the categories of “Safeguarding the Digital Legacy” and “Research and Innovation”.

The Award for Safeguarding the Digital Legacy celebrates the practical application of preservation tools to protect at-risk digital objects.

The Award for Research and Innovation recognizes excellence in practical research and innovation activities.

Arquivo.pt applications to the DPC Awards

#1 Arquivo.pt catalog of tools for digital preservation

Information that rules modern-day lives is born-digital and disseminated online. However, invaluable digital objects published online have been continuously lost.

Arquivo.pt is a public infrastructure which supports the preservation of digital objects published online to safeguard this digital legacy for future generations.

Thus, in October 2023 after 15 years of research and development, Arquivo.pt released a Catalog of 13 innovative tools to support the preservation of at-risk online content, from acquisition to dissemination (e.g. search and access, APIs, training, open data sets, exhibitions).

Arquivo.pt safeguards online digital objects of worldwide interest for research and education.

The Arquivo.pt Catalog was selected as finalist to the Safeguarding the Digital Legacy Award.

#2 Searching preserved web-images

Images published online are precious digital assets that document contemporary times for future generations.

This initiative describes the research and development of an innovative image search system that enables the discovery and access to billions of preserved images acquired from the web since the 1990s.

This research was applied to enhance the Arquivo.pt web archive with an image search service publicly available to any Internet user, officially launched in August 2022.

The resulting scientific and technical publications are available in open-access and the developed software is available as free open-source to be reused and enhanced by the community.

This work on searching images preserved in web archives applied for the Research and Innovation Award.

Know more

Analysis of the Arquivo.pt query dataset

demo-wordcloud-arquivopt3

Last updated on October 1st, 2024 at 10:34 am

Arquivo.pt query logs are unique resources for research

Arquivo.pt provides a “Google-like” service that enables searching pages and images collected from the web since the 1990s. Notice that Arquivo.pt search complements live-web search engines because it enables temporal search over information that is no longer available online on its original websites.

Analyzing user behavior is an important research topic to understand users’ information needs and enhance the quality of search results. Thus, when a user interacts with a search engine, the system records the user’s actions in a file called the query log. Query logs from web archives are unique resources for research because they describe the real needs of web-archive users about the historical information published online over time.

Research case study

Flavie Gallois and Adam Jatowt from the University of Innsbruck, and Ricardo Campos from the University of Beira Interior and INESC TEC analyzed user search behavior based on the Arquivo.pt search query log dataset collected over a period of 3 months from June to September 2021 (Analyzing User Search Behaviour in Temporal Web Repositories through Search Query Log Analysis).

This study analyzed query features such as length, type or frequency and compared the obtained results with previous work about user search behavior over web-archives and live-web search engines.

This study revealed interesting trends and patterns about how users search for information within web archives, with strong potential for future research work.

How do web-archive users search?

Figure 1 : Distribution of country origin of users
Figure 1 : Distribution of country origin of users
Figure 2: Distribution of languages used in queries
Figure 2: Distribution of languages used in queries

The users came from Portugal in 85.7% of the queries. However, the Portuguese language was identified through automatic language identification of queries as being used in only 37% of the queries. This suggests that users apply other languages than their own to search in web archives.

Users of Arquivo.pt tend to use longer queries with more words and characters in comparison to previous studies, both over web archives and live-web search engines. About 92% of the queries had 5 or fewer terms (average of 25 characters), with 3 being the most common number of submitted terms. In previous work about search behavior in web archives, it was observed that users tended to submit from 1 to 3 terms per query, with 1 term as the most common submission.

Users tend to issue multiple queries within a session instead of a single query, possibly indicating a need for refining their search queries or exploring multiple options for inquiry.

87,7% of the queries submitted to Arquivo.pt used Desktop Browsers, despite Arquivo.pt providing mobile-friendly user interfaces. Old web-archived pages are not responsive and render poorly on mobile devices. Thus, it is expectable that users mostly use web archives through their desktops.

Figure 3: Arquivo.pt users can refine the time span of their queries by using the From and To datepickers.
Figure 3: Arquivo.pt users can refine the time span of their queries by using the From and To datepickers.

Users refined the time span of the search (using the datepickers) in about 50% of queries which indicates awareness of temporal needs peculiar to web-archive usage. Interestingly, users modified the From datepicker more frequently than the To datepicker. Notice that keeping the default time span may fit the user information needs and does not necessarily indicate the lack of awareness about the existence of the function to define time span (peculiar to web-archive search).

Only a small percentage of users included specific years in their query terms (4%), potentially suggesting that in these cases the time span function was insufficient, or unnoticed by some users.

The obtained results suggest that users are more conscious of their information needs and have improved their search techniques to be more effective over web-archives instead of just using them out of curiosity as first-comers.

What is searched in a web-archive?

The authors of the study applied automatic named entity recognition over the user queries and derived a set of word clouds that graphically provide a glimpse of the most common information needs of Arquivo.pt users:

Figure 4: Word cloud of the most frequent query terms submitted to Arquivo.pt.
Figure 4: Word cloud of the most frequent query terms submitted to Arquivo.pt.
Figure 6: The most frequent Geographical Locations in query terms submitted to Arquivo.pt.
Figure 6: The most frequent Geographical Locations in query terms submitted to Arquivo.pt.
Figure 6: The most frequent Organizations in query terms submitted to Arquivo.pt
Figure 6: The most frequent Organizations in query terms submitted to Arquivo.pt
Figure 7: The most frequent Persons in query terms submitted to Arquivo.pt.
Figure 7: The most frequent Persons in query terms submitted to Arquivo.pt.

Access to research Arquivo.pt query dataset

Arquivo.pt released a set of resources to support research studies over its

Query log dataset

Original log files (samples)

Documentation

Evaluation Metrics for web-archive search

The first step to understand user behavior is to define evaluation metrics. Defining metrics is a powerful tool to set long and short-term goals to decide which new products and features should be released to the users.

We share a work-in-progress report which aggregates information about Web Archive Search Evaluation Metrics. This contributes to comparing users’ search behavior between live-web and web-archive search engines. Feel free to comment directly on the collaborative document or to contact us.

This report also provides a summary of references about previous work, query workflows and structure of the corresponding query logs produced by Arquivo.pt, to facilitate the work from the researchers to study these data sets.

Know more

Commemoration of the 50th anniversary of April 25 – the Portuguese revolution of 1974

50anos25abril-ArquivoPT-IG-Feed-2

Arquivo.pt joined the celebrations of the 50th anniversary of April 25, the Portuguese Revolution of 1974, as part of the initiatives promoted by the Fundação para a Ciência e a Tecnologia (FCT) in partnership with the Estrutura de Missão – Comissão Comemorativa 50 anos 25 de Abril.

The initiatives were as follows: a journey through time, a special collection on the theme “Abril 25”, a presentation at the “50 years of April International Congress” and the inclusion of a special mention in the 2025 edition of the Arquivo.pt Award.

Memories of April 25 on the Internet exhibition

The exhibition Memories of April 25 on the Internet presents a selection of web pages about the celebrations of April 25 in various regions of the country, since the beginning of the web in the 1990s.

The criteria for choosing the pages for the exhibition were as follows:

  • Pages relating to the April 25 commemorations;
  • Pages found on Arquivo.pt on dates close to the anniversary each year;
  • Diversity to include different areas of the country;
  • Popular demonstrations and official ceremonies.

A historical memory without web archives is incomplete. The aim of this journey through time is to invite citizens to travel back in time, browsing through old web pages and reliving recent episodes in our life as a democracy.

See the exhibiton: arquivo.pt/50anos25abril

Special collection on April 25 – the Portuguese Revolution of 1974

To mark the anniversary, Arquivo.pt carried out a special collection on the topic of “April 25” and made the results available in an open dataset, published on the Dados.gov portal.

The dataset contains a list of keywords put into a search engine in order to obtain results on the topic of “April 25”. The search considered names of people, places, political, social and cultural aspects, as well as words associated with the event.

The searches were carried out on March 22, 2024 using the Bing Search API, an automatic search service that returns results according to the relevance criteria of the Bing service itself and others configured by us.

A total of 12,650 unique web page addresses were obtained. It is hoped that the recording of these pages will be useful for the organizations that produced this content, for researchers who want to study our history and for citizens who cultivate a sense of memory and democracy.

Participation in the 50 years of April International Congress

memorial-congresso-internacional-50anos25abril
João Gomes, Director of Advanced Services, FCCN-FCT presenting the Arquivo.pt Memorial service at the 50 years of April International Congress

On May 2, 2024, João Gomes, Director of Advanced Services at the FCCN Scientific Computing Unit of the Foundation for Science and Technology I.P., presented Arquivo.pt to the participants of the 50 years of April International Congress, as a distinctive service, open to citizens and useful for organizations.

This event, organized by the Estrutura de Missão – Comissão Comemorativa 50 anos 25 de Abril and the University of Lisbon, included a presentation of two FCT services for citizens: Arquivo.pt and NAU’s massive online open courses.

Arquivo.pt is a web preservation service available to all citizens who want to search for old content published on the web.

Using Arquivo.pt contributes to a better understanding of our history. It also provides useful services for cybersecurity, such as the Arquivo.pt Memorial, which is able to maintain institutions’ old websites, preventing attacks and saving them resources.

Special mention for “April 25 and Democracy” at the Arquivo.pt Awards 2025

The Arquivo.pt Award is held annually and honors works that use Arquivo.pt.

In 2025, as part of the celebrations for the 50th anniversary of April 25, a special mention will be made of work on the theme “April 25 and Democracy”.

We therefore challenge researchers and interested citizens to create innovative works using Arquivo.pt.

If you have any questions about the Arquivo.pt Award, please contact us.

Arquivo.pt in Paris for international event

GAWAC2024-website-banner

Last updated on November 21st, 2024 at 11:28 am

The Arquivo.pt team took part in the Web Archiving Conference e na Assembleia Geral do International Internet Preservation Consortium (GA&WAC 2024), an event that annually brings together web archiving initiatives from around the world.

The National Library of France (BNF), in partnership with the Institut Nationale de l’Audiovisuelle (INA), hosted this meeting, which took place from April 24 to 25, 2024, in the iconic François Mitterrand building in Paris.

For three days, participants were able to share knowledge and experience on the preservation of information published on the Web.

Arquivo.pt contributed the following presentations:

  • Training the Trainers – Helping Web Archiving Professionals become Confident Trainers (Pre-Conference Workshop, Training Working Group) – Ricardo Basílio (Abstract, slides)
  • 80 Thousand Pages On Street Art : Exploring Techniques To Build Thematic Collections (Session#02: unique content) – Ricardo Basílio (Abstract, vídeo, slides)
  • Renascer Project Brings Back Old Websites at Arquivo.pt, Ricardo Basílio, Daniel Gomes  and Vasco Rato (Session#04: Delivery & Access (Abstract, vídeo, slides)
  • Arquivo.pt CitationSaver: Preserving Citations for Online Documents (Session#09: Digital Preservation) – Pedro Gomes, Daniel Gomes (Abstract, vídeo, slides)
  • Fixing Broken Links with Arquivo404 (Poster session 2) – Vasco Rato, Daniel Gomes (Abstract, slides)

Image gallery

Arquivo.pt at the IIPC Web Archiving Conference (Paris)

IIPC 2024
53927777249_4cf658f113_o
53938430956_f642260e2b_o
53938764934_0c2636d3a1_o
Vasco Rato and Ricardo Basílio presenting Renacer Project at Arquivo.pt
IIPC WAC 2024 - Web archives in context
IIPC WAC 2024 - Web archives in context
53926531252_6fb8812be0_o
print-pedrogomes-paris
20240426_095900
20240426_095839
IIPC 2024 53927777249_4cf658f113_o 53938430956_f642260e2b_o 53938764934_0c2636d3a1_o Vasco Rato and Ricardo Basílio presenting Renacer Project at Arquivo.pt IIPC WAC 2024 - Web archives in context IIPC WAC 2024 - Web archives in context 53926531252_6fb8812be0_o print-pedrogomes-paris 20240426_095900 20240426_095839

Training about web archiving in Madeira island

jornadas-fccn-2024-funchal-thumb

Last updated on May 8th, 2024 at 07:31 pm

The Arquivo.pt team was in Funchal between April 15 and 19, 2024, and presented two different sessions on web preservation. The first took place during the Jornadas FCCN 2024 and the second was a workshop, after the event had ended, at the headquarters of the Regional Agency for the Development of Research, Technology and Innovation (ARDITI).

Arquivo.pt at Jornadas FCCN 2024

The session held during the Jornadas FCCN 2024 was entitled “Arquivo.pt at the service of culture” and aimed to highlight two of Arquivo.pt’s collaborations in the field of culture and knowledge, namely with Wikipedia Portugal and the Virtual Museum of Tourism (MUVITUR).

At the FCCN Zapping session, Arquivo.pt presented the Arquivo404 service, which allows websites to offer historical content instead of the negative “Page not found”.

Workshop with ARDITI

The post-Day Workshop, promoted by ARDITI, was open to regional institutions and citizens in general. It was entitled “Arquivo.pt and the preservation of Internet memory”.

The contents were structured according to the training program run by Arquivo.pt and preceded by a framework between the other services of the FCCN – FCCN – Computação Científica da FCT.

Just as important as the content was the dialog that was established during the sessions between the participants and the Arquivo.pt team to clarify doubts or ask questions.

Web preservation is increasingly important for organizations that want to preserve part of their institutional memory and develop security policies.

ARDITI gave an important signal about preserving the web memory of Madeiran institutions by hosting and promoting the Arquivo.pt training sessions.

If you want to promote the preservation of web content in your organization, check out the Arquivo.pt training and contact us.

More about

Artificial Intelligence processes data from Arquivo.pt

Artificial Intelligence AI

Last updated on July 16th, 2024 at 08:33 am

Artificial Intelligence (AI), covers various areas of knowledge, such as linguistics and computing, and is present in the new technologies used by citizens on a daily basis.

For example, when we search for information on the Internet and the computer generates an amazingly accurate response, in a language very close to our own.

Natural Language Processing (NLP) is what allows machines to perfect the algorithm that generates these answers tailored to Internet users.

The problem is that natural language processing models have been developed more for the English language and less for Portuguese and other languages with less representation.

The more the processing models are trained on a language, the better they will be able to interpret the complexities of the language. But this is only possible if quality data is available.

Portuguese text collection on Arquivo.pt available for research

Arquivo.pt appears here as the largest Portuguese-language textual dataset in Portugal, available in open access, for researchers to train NLP models.

In recent years, researchers from various research groups and projects have drawn attention to the usefulness of preserved web data for large-scale processing.

Arquivo.pt has more than 1 Petabyte of preserved web content dating back to the 1990s, including everything that can be found on web pages. It’s not just text, but also images, audio files, video, page code and various metadata.

The content is accessible via the search interface and the Arquivo.pt APIs.

In order to make it easier to download archived resources from the web, Arquivo.pt has created indexes for researchers in CDXJ format.

GlórIA, a model for the Portuguese language

One of the projects that used Arquivo.pt to obtain large amounts of text is called GlórIA and is a large-scale language model (LLM) focused on the European Portuguese language.

“Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese” as the authors of GlórIA project, Ricardo Lopes, João Magalhães, David Semedo, researchers at the NOVA School of Science and Technology, explain in their article GlórIA – A Generative and Open Large Language Model for Portuguese.

The model used 35 billion tokens, or expressions that machines can process, from various sources.

Arquivo.pt contributed a collection of 1.4M European Portuguese archived news and periodicals.

You can try generating text in European Portuguese using the GlórIA API inference on the Hugging Face Model card.

If you want to develop a project or study using Arquivo.pt, you can start your research and, if you need help, contact us.

Know more

Arquivo.pt in the top 3 of government services in Portugal

portugal-digital-awards-2023

Last updated on August 6th, 2024 at 05:32 pm

Arquivo.pt, Portugal’s national web preservation service, has earned a prominent position by being named one of the top 3 government services in the 2023 Portugal Digital Awards. This recognition is testimony to the crucial role played by Arquivo.pt in the preservation and accessibility of Portugal’s digital heritage.

The three finalists in the category Best Government Project (best digital transformation project in the public administration sector) were Arquivo.pt, the Porto Digital Association and Banco de Portugal, which received the winning award.

Mission and recognition

Arquivo.pt, developed by the FCCN – National Scientific Computing , stands out as an innovative initiative in the field of digital preservation. Its mission is to collect and archive web content, allowing users to access past versions of web pages, documents and other online resources.

portugal-digital-awards-2023

The recognition at the Portugal Digital Awards highlights not only the importance of digital preservation, but also the effectiveness and relevance of Arquivo.pt as a government service. By providing a journey through time via the Internet, this resource becomes a valuable tool for researchers, academics and the general public.

Commitment to digital preservation

Participation in the award underlines Arquivo.pt’s commitment to improving the historical record of the evolution of the Web in Portugal. This service not only contributes to the country’s digital memory, but also facilitates research, promoting understanding of digital evolution over time.

In addition, Arquivo.pt’s distinction reflects FCCN’s ongoing effort to develop and improve innovative services that benefit society. Digital preservation is a crucial component in ensuring that Portugal’s digital heritage is passed on to future generations, and Arquivo.pt fulfills this role in a unique way.

In conclusion, recognition in the Portugal Digital Awards 2023, a competition that received over 300 candidate services, solidifies Arquivo.pt’s position as one of the leading government services at the forefront of digital preservation. This achievement highlights the growing importance of digital preservation in the digital age in which we live.

Know more

Arquivo.pt reaches 1 PetaByte of preserved information!

The collection of 1 PetaByte of content predominantly in Portuguese, accessible to both researchers and ordinary citizens, is a milestone that deserves to be celebrated, in the month of its 16th anniversary.

At Arquivo.pt you can search for information published on the Web in the past, such as:

Discover more pages through the selected pages in the Arquivo.pt Online Exhibitions.

The first European page
News from The New York Times in 2008
European Film Awards 2014

Purpose and mission of the Portuguese Web Archive

Arquivo.pt was created on November 8, 2007 with the aim of preserving content from the Portuguese Web.

In 2013, as a service operated by the Fundação para a Ciência e a Tecnologia (FCT), its mission was formulated as follows: “To promote the preservation of content available on the national Internet, ensuring that it is made available to the scientific community and the general public” (Decreto-Lei no. 55/2013).

In recent years, Arquivo.pt has created new services, such as CitationSaver, which allows researchers to record references to web content in their scientific articles, Memorial and Complete page, which facilitate access to content scattered throughout the huge 1 PetaByte block of data.

Where did so much information come from?

In order to reach the 1 PetaByte volume, Arquivo.pt periodically recorded content from websites in the .PT domain and from Portuguese websites in other domains.

In addition, frequent daily and monthly collections were made from a small number of government sites and the main news sites in Portugal.

As part of international collaborations, content was collected from sites in various languages, for example on the 2019 European Elections.

Content prior to 2008 came from the Internet Archive and donations, such as a collection made by the National Library and INESC on the 2005 Legislative Elections.

The largest Portuguese-language dataset available to researchers

By making 1 PetaByte of information available, in open access and through the use of APIs (Application Programming Interfaces), Arquivo.pt is a useful tool for research.

For example, a researcher who wants to do a study on elections in Portugal can use the entire Arquivo.pt collection. Better still, they can focus on just a few special collections dedicated to the elections, choosing the ones that interest them and downloading just a few Terabytes to process automatically with the APIs.

Contributions from the various teams and friends of Arquivo.pt

The development of Arquivo.pt is more than a technological issue and has been due to the dedication and persistence of the various teams that have worked on it since 2007.

It was also due to the contribution of many friends of Arquivo.pt, who were always on hand to help improve, and to the response of the user community.

Congratulations to all! Thank you.

Arquivo404 more powerful!

Last updated on August 9th, 2024 at 12:59 pm

Arquivo.pt has been launching innovative complementary services useful for organizations to optimize their functioning.

The new release of Arquivo.pt named Helios was launched on November 13, 2023 and includes developments in Arquivo404 and CitationSaver.

Arquivo404 with new methods for defining time intervals

Arquivo404 is a service that presents website users with links to web-archived versions, instead of laconic “Page not found” error messages.

However, sometimes it is necessary to specify the correct version of a web-archived page to be displayed. For example, a website’s domain may have belonged to another entity in the past, and only web-archived versions since the website came under its current owners should to be displayed.

For this purpose, 3 new methods for configuring Arquivo404 were released :

  • setMinimumDate( minDate : Date ) – specifies the earliest date of the web-archived version of the URL that can be displayed.
  • setMaximumDate( maxDate : Date ) – specifies the latest date of the web-archived version of the URL that can be displayed.
  • setMostRelevantMemento( criterion : ‘oldest’ | ‘most-recent’ ) – specifies the order of results for the versions retrieved from the web archive. By default, the oldest version is displayed ( ‘oldest’ ).

In short, Arquivo404 now allows you to define whether to display the oldest or most recent web-archived page to the users, within a certain time interval.

CitationSaver processes HTML documents

CitationSaver is a service that extracts citations in documents to online resources and archives them. This service is particularly useful for maintaining the integrity of scientific articles and the reproducibility of the experiments and studies described in them.

Many open-access articles are published in hypertext format (HMTL). CitationSaver now processes documents in HTML format, in addition to PDF and TXT formats.

For example, if a user finds an article on the Web  which contains citations to online resources, he/she simply needs to submit the URL of the article into CitationSaver. The URLs cited in the article will be extracted and their content will be web-archived for later access.

Example of an article from the Journal of Integrated Coastal Management, available on SciELO

Know more

Give us feedback about our services and if you detect any issue, please contact us.

World Digital Preservation Day dedicated to Justice

Last updated on November 13th, 2023 at 08:59 am

The Instituto de Gestão Financeira e Equipamentos da Justiça (IGFEJ) and Secretaria Geral do Ministério da Justiça (SGMJ), in collaboration with BAD, organized the event “Digital Preservation in Justice” to mark World Digital Preservation Day on November 2, 2023.

The event, which took place in the auditorium of the Polícia Judiciária in Lisbon, was attended by representatives from the government’s justice department and professionals from the archives, communications and IT departments.

How to use Arquivo.pt to preserve institutional websites

Arquivo.pt took part in the presentation “Preserve your website”, which addressed the issue of preserving institutional websites and critical aspects such as cybersecurity.

Justice entities can benefit from Arquivo.pt and its various services to ensure good preservation of their websites, mitigate cybersecurity threats and provide historical content to citizens.

The presentation concluded with the following recommendations:

  • Inventory and publicize your current and historical websites
  • Use Arquivo.pt services collaboratively
  • Save content in a standardized format with ArchiveWeb.page

Resources