Challenges

The main challenges related to web archiving.

The development and maintenance of a web archiving service is not trivial and there are requirements that must be considered:

  • The need for a research component: web archiving technology is not stabilized. Therefore, system development will have to be complemented with a research component that enables examining existing solutions and solve new problems that may arise;
  • The need for continuity: the project has an initial expected duration of two years. However, the system will have to be designed considering the requirements of long term preservation of information. The effort invested in this project will be useless without the existence of a long term plan to enable its continuity and information preservation for the future.

Any system that processes information from the web has to deal with the typical unpredictability of this information source. The web is extremely heterogeneous and constantly changing, so it is difficult to make realistic assumptions about its characteristics. The main challenges that this project must address are:

  • Unexpected system updates to follow the pace of the evolution of the web. It is difficult to predict exactly how the web will evolve. For example, the amount of information published on the web rapidly increased in the last years due, for instance, to the popularity of blogs and content management systems;
  • Difficulties in automatic information processing. Most Portuguese web pages have a low technical quality, often disregarding basic principles and recommendations for web publishing. As a result, tools and systems that perform well on other webs are not efficient when processing Portuguese web contents. The automatic processing problems also affect search engines that frequently ignore pages containg these problems. The result has been a progressive loss of visibility of Portuguese web sites. Fortunately, Portuguese publishers are beginning to be aware of this problem, which can lead to a possible breakthrough in the future;
  • Creating efficient web archive search mechanisms. Creating a search engine over one web crawl is not straightforward. Doing it over an archive containing multiple web crawls harvested over the years raises this problem to a higher level of complexity.

We count on everyone’s collaboration to overcome these challenges.