Last updated on August 1st, 2017 at 01:52 pm
To enable the location in time for each content, it is advisable to identify the date of publication.
The date of publication of a content enables its correct chronological location and should be supplied through:
- Text written on the page. The date of publication of a page should be defined by its author and only updated if there is a significant change in the content. Therefore, scripts that automatically write a date of modification on the page should not be used;
- Last-Modified, HTTP header field.
- It should be available for any content, including for those that are dynamically generated (e.g. .php, .asp, .jsp).
- The Last-Modified date is commonly generated from the date of last update of the file. Therefore, the dates of the files should remain unchanged if the files are copied or moved between directories.
- The usage of frames should be avoid because a frame is composed by several pages rendered to the user as a single one. This makes it difficult to assign a single correct date of publication to the content.
- Date and Time, EXIF image metadata field. EXIF is a schema that enables supplying several metadata about an image, including its date of capture and digitization.
To avoid the repetitive archive of unchaged contents it is advisable to supply the following aditional metadata:
- Content-Length, HTTP header field: it supplies the size of the content measured in number of bytes;
- ETag, HTTP header field: it is a code that can be used to identify if the page has changed in comparison to a previous version.
Learn more
- L. Clausen, Concerning Etags and Datestamps, 2004.
- Kristinn Sigurdsson, Incremental crawling with Heritrix, 2005.