- Info
Goals
Here we present the main goals to achieve with the Portuguese Web Archive project.
Goals of the Portuguese Web
Archive
The creation of a
Portuguese Web Archive represents a historic milestone in the preservation of knowledge for future generations. With the creation
of a system that supports regular crawls of the Portuguese web, its long term
storage and access, we intend to provide the following services:
- Term search over the archived contents: it will enable the identification of archived contents over the years that contain certain terms;
- URL search over the archived contents: it will allow
to identify several versions of a content gathered from a given URL;
- New search engine over
the Portuguese web: the archive will enable searching over several Portuguese web collections. Providing a search service over the most
recent collection, as current web search engines do, can be attainable in a relatively small additional effort and it is an interesting service for the
Portuguese community;
- Historical collections
of web contents for research purposes: the web has information about the
most various subjects reflecting society changes across time. Researchers from different fields use the web as a source of information for their studies. Providing web collections will enable these
researchers to store and process web data locally on their computers without
having to crawl the web themselves;
- Characterization reports
of the Portuguese web: a web archive system must be tunned according to the characteristics of the archived data. Therefore, Portuguese web characterizations must be periodically generated. As these studies are interesting to a broader audience, they will be published. Characterizing national webs is interesting to measure the spread of information technologies in different societies and the evolution of the web across time;
- Backup system of the
archived information (rARC): it will be a distributed system that will enable Internet users to provide disk space to store a backup copies of the archived contents through the installation of a small application on their
computers. If a failure happens on the central repository, the archived
collection will be recovered from the backup copies stored on the users’
computers. Any individual or institution can contribute to preserve the web by providing some disk space on their computers;
- Archived data parallel processing system:
it will allow researchers to execute their programs over the archived web
data using several computers in parallel.
We also
want to achieve the following goals:
- Train human resources
in web archiving to enable the maintenance of the system
in the future;
- Export know-how,
experience and technology in web archiving to other countries, specially
the Portuguese language ones;
- Contribute to increase
the number of domains registered under .PT, the free historical
archiving of the information published under this domain could be an
additional motivation for registrars;
- Publish scientific and technical papers
that enable the sharing of the acquired knowledge and receiving feedback from the community regarding the work performed.