Artificial Intelligence processes data from Arquivo.pt

Artificial Intelligence AI

Artificial Intelligence (AI), covers various areas of knowledge, such as linguistics and computing, and is present in the new technologies used by citizens on a daily basis.

For example, when we search for information on the Internet and the computer generates an amazingly accurate response, in a language very close to our own.

Natural Language Processing (NLP) is what allows machines to perfect the algorithm that generates these answers tailored to Internet users.

The problem is that natural language processing models have been developed more for the English language and less for Portuguese and other languages with less representation.

The more the processing models are trained on a language, the better they will be able to interpret the complexities of the language. But this is only possible if quality data is available.

Portuguese text collection on Arquivo.pt available for research

Arquivo.pt appears here as the largest Portuguese-language textual dataset in Portugal, available in open access, for researchers to train NLP models.

In recent years, researchers from various research groups and projects have drawn attention to the usefulness of preserved web data for large-scale processing.

Arquivo.pt has more than 1 Petabyte of preserved web content dating back to the 1990s, including everything that can be found on web pages. It’s not just text, but also images, audio files, video, page code and various metadata.

The content is accessible via the search interface and the Arquivo.pt APIs.

In order to make it easier to download archived resources from the web, Arquivo.pt has created indexes for researchers in CDXJ format.

GlórIA, a model for the Portuguese language

One of the projects that used Arquivo.pt to obtain large amounts of text is called GlórIA and is a large-scale language model (LLM) focused on the European Portuguese language.

“Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese” as the authors of GlórIA project, Ricardo Lopes, João Magalhães, David Semedo, researchers at the NOVA School of Science and Technology, explain in their article GlórIA – A Generative and Open Large Language Model for Portuguese.

The model used 35 million tokens, or expressions that machines can process, from various sources.

Arquivo.pt contributed a collection of 1.4M European Portuguese archived news and periodicals.

You can try generating text in European Portuguese using the GlórIA API inference on the Hugging Face Model card.

If you want to develop a project or study using Arquivo.pt, you can start your research and, if you need help, contact us.

Know more

University of Lisbon preserved over 100 historical websites in the Arquivo.pt Memorial

thumb-memorial-fcul

Last updated on March 27th, 2024 at 11:17 am

More than 100 historical websites from the Faculty of Sciences of the University of Lisbon (FCUL) are now accessible through the Memorial service of Arquivo.pt.

FCUL’s IT Department sent to Arquivo.pt a list of old websites hosted on its servers that were no longer updated, but whose historical content continues to be interesting to the community (e.g. websites of research projects or scientific events).

Arquivo.pt preserved these websites in collaboration with their ownersa, seeking to maintain a faithful representation of the published content for the future.

FCUL redirected the domain of each website to Arquivo.pt, and then, became able to disconnect the respective servers and  begin sparing the resources spent on their maintenance (e.g. electricity, data center space, human resources).

The show case of MiNEMA

print-memorial-example-minema-project

Landing page of www.minema.di.fc.ul.pt at Memorial do Arquivo.pt.

The MiNEMA scientific program website was the first that FCUL integrated into the Memorial. This website stopped being updated in 2009 when the project ended. FCUL invested resources in maintaining the website for another 10 years until it became necessary to suspend it down for cybersecurity reasons.

The Memorial of Arquivo.pt emerged as an option and since 2020, FCUL just needs to maintain the domain www.minema.di.fc.ul.pt while Arquivo.pt preserveS the information contained on the website.

Please note that the website’s content continues to be displayed in search engine results.

Follow FCUL and preserve your historical websites in the Memorial!

An increasing number of institutions are recurring to the Memorial of Arquivo.pt to safely preserve the content of their historical websites. For example, FCUL preserved 116 websites, the Government IT Network Management Center preserved 23 and the Foundation for Science and Technology preserved 40.

Public institutions have priority to benefit from this service. However, other entities can also request it as long as they own the website domain.

Identify your historical websites candidate to be integrated into the Memorial of Arquivo.pt and contact us!

To know more

Completing webpages from the past: it is possible!

Last updated on October 16th, 2023 at 06:59 pm

Some web-archived pages are reproduced incompletely due to problems occurred during the archiving process (e.g. deformatted or missing embedded images).

Complete page is a function of Arquivo.pt that allows to recover missing elements in web-archived pages, from other web archives or the original websites.

When a user views a page archived in Arquivo.pt, just needs to access the Options menu in the top right corner and choose Complete page.

This process is performed automatically.

How does Complete page work?

If you open a web-archived page that appears incomplete, try the Complete page option and wait.

Arquivo.pt will search for missing elements on the Internet and in other web archives using the Memento protocol. If it succeeds, the obtained elements will be immediately displayed on the web-archived page.

Later, these recovered elements are integrated into the Arquivo.pt collection, so that the web-archived page will appear more complete in the future accesses performed by any user.

complete-page-website-cristina-guerra-en

Completing the home page of artist Cristina Guerra’s website found a missing image.

For example, the website of artist Cristina Guerra archived in 2005 had a missing image. By using Complete page, it was possible in 2021 to obtain this missing image from another web archive which preserved it.

Participate in collaborative curation to improve the quality of Arquivo.pt!

Due to the high number of web-archived pages, it is not possible for Arquivo.pt to complete them all automatically. Therefore, the collaboration of users to identify important pages with missing elements and try to complete them is important.

By using Complete page, the users are contributing to improve the quality of the historical webpages preserved in Arquivo.pt!

Always give it a try to complete web-archived pages may that look incomplete. If you detect any problem, contact us.

Spread the word about the Arquivo.pt Complete page!

Arquivo.pt presentations at IIPC GA/WAC, RESAW 2023 and CLEOPATRA

Last updated on March 10th, 2024 at 05:23 pm

Meeting the Web Archive Community

The International Internet Preservation Consortium (IIPC), a consortium that brings together Web preservation initiatives from around the world, held its General Assembly with its members on May 10, 2023.

On the following days, May 11 and 12, the IIPC Web Archiving Conference (IIPC WAC) was held, an initiative open to the community, where people or entities not associated with the IIPC and interested in the Web preservation domain can participate.

The two events were jointly hosted by KB – National Library of the Netherlands, and by Beeld & Geluid – Netherlands Institute for Sound & Vision.

Contributions from the Arquivo.pt at the Web Archiving Conference

Arquivo.pt participated in the IIPC working group meetings (Training Working Group and Curators Working Group) and contributed with presentations in the thematic sessions Collaborations & Outreach and Program infrastructure (sessions 7 and 17).

  • Arquivo.pt updates 2023 (slides)
  • Linking web archiving with arts and humanities: the collaboration between ROSSIO and Arquivo.pt (video, slides)
  • Arquivo.pt behind the curtains (slides)

Meeting the RESAW research community

RESAW (Research Infrastructure for the Study of Archived Web Materials) is an initiative created in 2012 with the aim of promoting studies based on archived Web content, in areas such as Social Sciences, Digital Arts and Humanities.

The RESAW 2023 conference was held at the MUCEM Lab (Mediterranean Institute of Heritage Crafts) in Marseille on June 5-6, 2023, under the theme Exploring the Archived Web During a Highly Transformative Age.

Contributions from Arquivo.pt to RESAW 2023

Arquivo.pt contributed with presentations to the sessions Web Archive in Mediterranean area and its merge (4.A), From online Tools to Web Archive (6.B.), Towards a participatory approach to collections (9. A.), Digging up the materials for writing web history (9.B.).

  • How to research governmental web data? (abstract, slides)
  • Archiving Cryptocurrencies (abstract, slides)
  • Time to explore, time to learn from the archived web: Arquivo.pt training initiative (abstract, slides)
  • Exhibiting Web Memories from Arquivo.pt: a call for community participation (abstract, slides)

CLEOPATRA Project Meeting

The CLEOPATRA Project, led by the L3S Research Center at the Gottfried Wilhelm Leibniz University of Hannover, has developed since 2019 a training programme for doctoral researchers (Early Stage Researcher, PhD).

Arquivo.pt has participated in three courses: Incentives design for hybrid multilingual information processing and analytics, in Southampton; National and transnational media coverage of European parliamentary elections, 2004-2014, London; and NLP for under-resourced languages, in Zagreb, Croatia.

In 2022, the Arquivo.pt welcomed two researchers in its facilities who used the archived resources and received special support from the Arquivo.pt team to develop their research.

The CLEOPATRA Project ended in 2023 with a meeting on the 16th May, in Hannover, which brought together Professors, Researchers and representatives of the institutions involved.

Daniel Gomes, Arquivo.pt’s Manager, highlighted the new tools that Arquivo.pt makes available and the results of the work carried out by the researchers that have passed through Arquivo.pt.

15 years of Arquivo.pt celebrated in a event promoted by Wikimedia

thumbnail_15-anos-Arquivopt-Wikimedia

Last updated on August 18th, 2023 at 03:29 pm

On November 8, 2007, the Portuguese Web Archive was officially created and later named Arquivo.pt.

To celebrate this date, Wikimedia Portugal and Arquivo.pt have associated themselves in the organization of an online event dedicated to the preservation of the digital heritage.

Agenda

  • Introdução – André Barbosa, Wikimédia Portugal (Video)
  • 15 anos de Arquivo.pt – Daniel Gomes, Arquivo.pt (Slides, Video)
  • Wikimedia na Universidade: Exploração e Projetos na NOVA FCSH – Rute Correia, Residência WMPT na NOVA FCSH, (Slides; Video)
  • GLAM Wiki. Uma introdução geral – Giovanna Fontenelle, Fundação Wikimédia, Brasil (Slides; Video)
  • Demo dos recursos em acesso livre no Arquivo.pt – Daniel Gomes (Video)

More information

15-anos-Arquivopt-Wikimedia

Ask for an exhibition of posters about historical web pages

Last updated on October 26th, 2023 at 10:19 am

If you wish to host an exhibition to show-off on a common space (e.g. library, university hallway, common room in a congress center), please contact us!

Examples of performed exhibitions

These four exhibitions presented iconic web pages from Portugal’s recent history.

Universidade do Minho – Gualtar Campus – Braga

  • Where: Espaço B-Lounge of Universidade do Minho Library
  • When: Until April 18. From Monday to Friday, from 8h30 to midnight. And Saturdays from 08h30 to 14h30.

Universidade do Minho – Azurém Campus – Guimarães

  • Where: Espaço B-Lounge of Universidade do Minho Library
  • When: From April 23 to May 10. From Monday to Friday, from 8h30 to midnight. And Saturdays from 08h30 to 14h30.

ISCTE – Lisboa

  • Where: ISCTE-IUL Library.
  • When: From April 16 until May 11. From 8h to 21h.

Escola Superior de Hotelaria do Estoril

  • Where: Celestino Gomes Library
  • When: Until May 15. From Monday to Friday, from 9h30 to 12h00 and from 14h30 to 17h30. And from 6 to May 18. From Monday to Friday, from 9h30 to 20h30, Saturdays from 9h30 to 13h.

NOVA – IMS – Campolide Campus

  • Where: Nova IMS Library and Main Corridor.
  • When: Until April 18. From Monday to Friday, from 10h to 18h. And April 22 to 30, from 9h to 20h.

Exposições Arquivo.pt

Biblioteca da Universidade do Minho, Campus de Azurém

Biblioteca da Universidade do Minho, Campus de Azurém

Biblioteca do ISCTE, Lisboa

Biblioteca do ISCTE, Lisboa

B-Lounge, Universidade do Minho Library / B-Lounge of Universidade do Minho Library

B-Lounge, Biblioteca da Universidade do Minho / B-Lounge Universidade do Minho Library

Biblioteca da Faculdade de Ciência e Tecnologia da Universidade Nova de Lisboa

Biblioteca da Faculdade de Ciência e Tecnologia da Universidade Nova de Lisboa

Biblioteca da ESCE do Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca da ESCE do Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal

Biblioteca da Reitoria da Universidade do Algarve

Biblioteca da Reitoria da Universidade do Algarve

Biblioteca da Reitoria da Universidade do Algarve

Biblioteca da Reitoria da Universidade do Algarve

Biblioteca da Reitoria da Universidade do Algarve

Biblioteca da Reitoria da Universidade do Algarve

Faculdade de Letras da Universidade de Lisboa

Faculdade de Letras da Universidade de Lisboa

Faculdade de Letras da Universidade de Lisboa

Faculdade de Letras da Universidade de Lisboa

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituo da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituo da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituo da Educação

Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Faculdade de Ciências da Universidade de Lisboa

Biblioteca da Universidade do Minho, Campus de Azurém Biblioteca do ISCTE, Lisboa B-Lounge, Universidade do Minho Library / B-Lounge of Universidade do Minho Library Biblioteca da Faculdade de Ciência e Tecnologia da Universidade Nova de Lisboa Biblioteca da ESCE do Instituto Politécnico de Setúbal Biblioteca da ESCE do Instituto Politécnico de Setúbal Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal Biblioteca Biblioteca da Escola Superior de Ciências Empresariais de Setúbal – Instituto Politécnico de Setúbal Biblioteca da Reitoria da Universidade do Algarve Biblioteca da Reitoria da Universidade do Algarve Biblioteca da Reitoria da Universidade do Algarve Faculdade de Letras da Universidade de Lisboa Faculdade de Letras da Universidade de Lisboa Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação Faculdade de Psicologia da Universidade de Lisboa e Instituto da Educação Faculdade de Psicologia da Universidade de Lisboa e Instituo da Educação Faculdade de Psicologia da Universidade de Lisboa e Instituo da Educação Faculdade de Psicologia da Universidade de Lisboa e Instituo da Educação Biblioteca da Faculdade de Ciências da Universidade de Lisboa Biblioteca da Faculdade de Ciências da Universidade de Lisboa Biblioteca da Faculdade de Ciências da Universidade de Lisboa Biblioteca da Faculdade de Ciências da Universidade de Lisboa Biblioteca da Faculdade de Ciências da Universidade de Lisboa Biblioteca da Faculdade de Ciências da Universidade de Lisboa

News about our poster exhibitions