Tutorial: how to explore Arquivo.pt using Python

Last updated on July 17th, 2023 at 01:44 pm

The Programming Historian aims to develop digital skills among the Humanities researchers through the publication of practical lessons in several languages.

The call Computational analysis skills for large-scale humanities data originated 7 new lessons.

One of them was the tutorial “Timeline summarization for large-scale past-web events with Python: the case of Arquivo.pt” developed by Daniel Gomes and Ricardo Campos.

It shows how to explore Arquivo.pt user interface and the Application Programming Interface (API) to execute advanced queries, process large amount of data or build new services, such as Tell me stories.

All the developed resources are freely available in open-access.

Open-access resources of the tutorial “Timeline summarization for large-scale past-web events with Python: the case of Arquivo.pt”

 

 

Open dataset about cryptocurrency

Criptomoedas gráfico

Last updated on August 17th, 2022 at 09:19 am

(Photo: QuoteInspector)

Since 2008 the cryptocurrency market has revolutionised the world by innovating and expanding into other areas (e.g., finance and art). However, with this rapid expansion, many projects are created every day, giving rise to a wide and varied range of websites, technologies and scams. Markets follow financing stages and it is during an initial stage of euphoria that more projects are created.

We believe that as the cryptocurrency market  stabilises, projects/websites are disappearing because funding diminishes or runs out.

Arquivo.pt initiated a new web archive collection that preserves web content that documents Cryptocurrency activities.

This work produced a new open dataset with information documenting each cryptocurrency project, including it is original URLs and links to the corresponding web-archived version in Arquivo.pt. The information sources selected to create this dataset were:

We believe that by creating this new dataset related to cryptocurrencies and by preserving all the corresponding web content, it has the potential to originate innovative scientific contributions in several areas such as Economy or Digital Humanities.

Resources

Researchers who want to carry out studies on the Cryptocurrencies dataset and need earlier access to the collected contents can contact Arquivo.pt.

Presentation at the IIPC Web Archiving Conference 2022

H2020 projects preserved by Arquivo.pt

Thumbnail H2020 projects

Last updated on June 16th, 2023 at 01:40 pm

The main objective of Arquivo.pt is to preserve online information for research and education purposes.

Previously, Arquivo.pt identified and preserved Research & Development project websites funded by the European Union during the FP4, FP5, FP6 and FP7 programmes (1994-2013).

Now, Arquivo.pt contributed to preserve online information that documents R&D projects funded by the Horizon 2020 programme (2014-2021). It preserved 197 million web files (17 TB) related to science for future access.

H2020 projects publish valuable information online but are being lost

Websites about Research and Development (R&D) projects are increasingly being used to publish and disseminate important scientific information that complements published literature (e.g. data sets, documentation or software).

However, after projects ending, the corresponding websites usually disappear causing a permanent loss of unique and valuable scientific information.

Arquivo.pt automatically identified URLs that document H2020 Research and Development projects

The European Union’s Open Data Portal published a data set from the Community Research and Development Information Service (CORDIS) that documents H2020 research projects. However, from the 31 129 projects listed, only 46% presented a project URL.

Arquivo.pt developed a low-cost methodology that automatically identifies URLs related to R&D projects to be systematically preserved. This automatic identification is achieved through the combination of open data sets with web search services. This methodology is detailed on a scientific article published at the International Conference on Digital Preservation 2016.

In sum, we extracted 106 300 unique URLs from the following open data sets:

Then, we extracted the acronym and title of the projects from the data sets and automatically searched the web for additional URLs using the Bing Search API.

All the data sets and tools developed have been made publicly available in open access so that they can be reused and collaboratively enhanced. In particular, you can access the software developed to automatically identify additional URLs about H2020 projects.

197 million web files related to science were preserved

Arquivo.pt identified and preserved 197 million web files (17 TB) that document R&D projects funded by Horizon 2020.

In 2021, we can already witness project websites that are no longer available online, such as the Extended Model of Organic Semiconductors (EXTMOS) project (http://extmos.eu/). However, it was preserved and can be accessed at Arquivo.pt:

Archived version at Arquivo.pt (https://arquivo.pt/wayback/20170427182603/http://extmos.eu/) of the home page of the EXTMOS Research and Development project (http://extmos.eu/)funded by H2020.
Archived version at Arquivo.pt of the home page of the EXTMOS Research and Development project funded by H2020.

Contributions to complement the European Open Data Sets

All the resulting data sets were made publicly available so that they can be improved and reused by other organizations also interested on preserving this digital heritage:

If you want to know more information about this collection you can watch the video Preservation of web content related to Horizon 2020.

References

Are you a researcher?

Applications open to the Arquivo.pt Award 2020!

Arquivo.pt Award 2020

Applications are open to the Arquivo.pt Award 2020!

In this 3rd edition of the annual Arquivo.pt Award, € 15,000 will be awarded to the 3 best works (1st place: € 10,000).

The deadline for submissions is May 4, 2020.

Works may be developed individually or in group about any topic, as long as they use the information provided by Arquivo.pt as the main source of information.

The Público Newspaper is the official media partner of the Arquivo.pt Award in 2020. It was one of the first newspapers to become available online.

Jornal Público celebrates its 30th anniversary on March 5, 2020 and will award an Honorable Mention to one of the works which focused on the historical web-archive of Público online.

Full details about the applications are available at:
https://arquivo.pt/award2020

The Arquivo.pt Award promotes the visibility of the applicants and their institutions.

Help us to spread the word!

Meet the winners of the Arquivo.pt Award 2019!

Last updated on August 26th, 2022 at 02:22 pm

The winners of the Arquivo.pt Award 2019 were announced by His Excellency the Prime Minister of Portugal António Costa during the session that took place on July 8 at 9:00 am in Auditorium 1 during the event Science 2019 – Encounter with Science and Technology at the Lisbon Congress Center.

23 applications were received.

Entregas do Prémio Arquivo.pt durante o Encontro Ciência 2019

1_GOU6417
_GOU6658
_GOU6570
_GOU6861
_GOU6647
_GOU6630
_GOU6610-2
_GOU6687
_GOU6753
_GOU6719
_GOU6713
1_GOU6417 _GOU6658 _GOU6570 _GOU6861 _GOU6647 _GOU6630 _GOU6610-2 _GOU6687 _GOU6753 _GOU6719 _GOU6713

Photos by Valter Gouveia, FCT

1st place: meuParlamento.pt

MeuParlamento.pt is a mobile application that simulates the Portuguese Parliament, calling on all citizens to play the role of a deputy.

Authors: Nuno Moniz, Arian Pasquali, Tomás Amaro

2nd place: Revisionista.pt: Un-cover the news

The Revisionista.pt is an online tool to reveal post-publication changes in the Portuguese news.

Authors: Flávio Martins and André Mourão

3rd place: Public speeches about violence in private

Analysis of 217 news collected in Arquivo.pt from the three main daily newspapers, on domestic violence.

Author: Zélia de Macedo Teixeira

Materials for dissemination about the winners

About the Arquivo.pt Award 2019

Arquivo.pt is a research infrastructure managed by the Foundation for Science and Technology (FCT) that enables search and access to web pages archived since 1996.

The Arquivo.pt Award aims to annually promote innovative works based on historical information preserved by Arquivo.pt. Submissions closed on May 3 and we received works in areas such as: media studies, education, design, information technology, health or cultural and historical heritage.

The Arquivo.pt Award 2019 received the High Sponsorship of His Excellency the President of the Republic Marcelo Rebelo de Sousa.

Arquivo.pt Award 2020

The Arquivo.pt Award was approved as an annual initiative of the Foundation for Science and Technology that will run from January to May.

Sign up for the Arquivo.pt international mailing list to be informed about future editions of the Arquivo.pt Award!

 

Java/Linux expert to collaborate with Arquivo.pt

The ROSSIO Project-  Social Sciences, Arts and Humanities has an open call for 1 position to collaborate with Arquivo.pt.

Call is open until June 13th, 2019.

Check for application details at the links below.

Main tasks

  • Development of large-scale distributed computer systems;
  • Analysis, planning and operation of computer services;
  • Corrective and evolutive maintenance of software;
  • Interaction with scientific and technical communities to establish collaborations and identify requirements.

Application details

Arquivo.pt in the Azores, May 2019

Jornadas Computação Científica 2019

Last updated on May 2nd, 2019 at 01:49 pm

Jornadas de Computação Científica is an annual event on scientific computing. In 2019 it will be take place between the 6 and 8 of May, at the University of the Azores in Ponta Delgada.

The event is aimed at teachers, students, researchers and service providers in Higher Education institutions.

Sessions Arquivo.pt on May 6 from 14h30 to 18h00

14h30 – The Memory of the Web: a forgotten heritage?, Daniel Gomes
15h15 – Preservation of institutional Web sites, Ricardo Basílio

16h00 – Coffee break

16h30 – Recommendations for publishing information on the Web, Daniel Bicho
17h15 – Automatic access to Arquivo.pt (APIs), Fernando Melo
17h45 – A few words from our sponsor Huawei, Henrique Oliveira

The event also includes an Azorean dinner on May 7.

Registration is free but mandatory.

Sign up!

IT specialists to collaborate with Arquivo.pt

Last updated on May 21st, 2019 at 03:59 pm

The ROSSIO Project-  Social Sciences, Arts and Humanities has an open call for 2 positions to collaborate with Arquivo.pt.

Call is open until February 5th, 2019.

Check for application details at the links below.

1. Front-end developer

Main tasks

  • Middleware development;
  • Development of user interfaces;
  • Usability Experience quality control (UX).

Application details

2. Java/Linux specialist

Main tasks

  • Development of large-scale distributed computer systems;
  • Analysis, planning and operation of computer services;
  • Corrective and evolutive maintenance of software;
  • Interaction with scientific and technical communities to establish collaborations and identify requirements.

Application details

Study on “homosexuality” based on past news

Last updated on October 26th, 2018 at 11:35 am

Framing the concept of “homosexuality” in 20 years of publication of the Expresso newspaper

by João Teixeira Duarte and Zélia Teixeira, winners of the second prize of the Arquivo.pt Awards 2018.

This work values the social changes in the Portuguese and international reality when it comes to the visibility of LGBT-related social issues, and explores the role of social communication in promoting the visibility of these issues, along with its influence in keeping them in the social agenda.

20 years of the Expresso news site under review

To achieve this, 20 years of publication history of the Expresso newspaper published online (from 1997 to 2017) — mainly accessed through the Arquivo.pt web archive tool — were analysed both quantitatively and qualitatively.

In quantitative terms, through grounded analysis the following categories emerged: Legislative-political (81 pieces), Social (37), Discrimination and aggression (23), Arts and culture (18), Popular (12), Religious (15), Opinion (19), Link between homosexuality and pedophilia (5).

In qualitative terms, a trend to normalise homossexual individuals, relationships and related issues (such as same sex marriage and adoption) was observed.

In parallel, a tendency to progressively represent the LGBT population’s negative experiences for emotional resonance was also found.

In general, this archive encapsulated a glimpse of the evolution of this social phenomenon in the Portuguese society, in tandem and sometimes in contrast with the international realities that framed it.

More Info (in Portuguese)

Tutorial and Workshop in Porto, September

Sessão de trabalho do grupo Investiga XXI

Last updated on August 7th, 2018 at 10:32 am

Would you like to know more about web archives?

Then, do not miss the RESAW@Porto2018 workshop and the tutorial Using web archive tools to preserve and research the Past Web.

Workshop RESAW@Porto2018

The RESAW@Porto2018 workshop is aimed at everyone who wish to explore web archives to search for information about the past. The detailed program is already available.

This workshop will be held on September 13, 2018 (9:00-18:30) in Porto (FEUP), Portugal, as part of the TPDL 2018 international conference.

Price and registration

The registration fee is 120 EURO or 90 EURO for students. Lunch is included.

In the registration form, you must:

  1.  send the following comment: “Special authorization for a reduced fee to the Web Archive workshop – An introduction to web archives for Humanities and Social Science research”;
  2. choose option payment by “bank transfer”.

Then, you will receive the details to perform the payment.

You may register only for the workshop. Registration for the remaining conference is optional.

Tutorial “Research the Past Web using Web archives”

The tutorial Research the Past Web using Web archives is suitable for researchers, computer scientists, information professionals and webmasters, who wish to gain new insights about preserving information published online.

This tutorial will be held on September 10, 2018 (9:00-12:30) in Porto (FEUP), Portugal, as part of the TPDL 2018 conference.

Price and registration

The registration on the tutorial is free but the registration in the remaining conference is mandatory.

Spread the word!

Help us disseminating these events among potential participants.