Last updated on July 16th, 2024 at 08:33 am
Artificial Intelligence (AI), covers various areas of knowledge, such as linguistics and computing, and is present in the new technologies used by citizens on a daily basis.
For example, when we search for information on the Internet and the computer generates an amazingly accurate response, in a language very close to our own.
Natural Language Processing (NLP) is what allows machines to perfect the algorithm that generates these answers tailored to Internet users.
The problem is that natural language processing models have been developed more for the English language and less for Portuguese and other languages with less representation.
The more the processing models are trained on a language, the better they will be able to interpret the complexities of the language. But this is only possible if quality data is available.
Portuguese text collection on Arquivo.pt available for research
Arquivo.pt appears here as the largest Portuguese-language textual dataset in Portugal, available in open access, for researchers to train NLP models.
In recent years, researchers from various research groups and projects have drawn attention to the usefulness of preserved web data for large-scale processing.
Arquivo.pt has more than 1 Petabyte of preserved web content dating back to the 1990s, including everything that can be found on web pages. It’s not just text, but also images, audio files, video, page code and various metadata.
The content is accessible via the search interface and the Arquivo.pt APIs.
In order to make it easier to download archived resources from the web, Arquivo.pt has created indexes for researchers in CDXJ format.
GlórIA, a model for the Portuguese language
One of the projects that used Arquivo.pt to obtain large amounts of text is called GlórIA and is a large-scale language model (LLM) focused on the European Portuguese language.
“Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese” as the authors of GlórIA project, Ricardo Lopes, João Magalhães, David Semedo, researchers at the NOVA School of Science and Technology, explain in their article GlórIA – A Generative and Open Large Language Model for Portuguese.
The model used 35 billion tokens, or expressions that machines can process, from various sources.
Arquivo.pt contributed a collection of 1.4M European Portuguese archived news and periodicals.
You can try generating text in European Portuguese using the GlórIA API inference on the Hugging Face Model card.
If you want to develop a project or study using Arquivo.pt, you can start your research and, if you need help, contact us.
Know more
- GlórIA – A Generative and Open Large Language Model for Portuguese (Preprint)
- GlórIA – ArquivoPT News PT-PT Dataset – Hugging Face Model card
- Generating a European Portuguese BERT Based Model Using Content from Arquivo.pt Archive, (Preprint)
- Research open datasets on 2019 European Parliamentary Elections
- Arquivo.pt APIs (Application Programming Interfaces)
- Bulk download of web-archived resources
- Arquivo.pt Awards (find use cases)