![]() |
Memento Guide: |
About Demos Guide Tools Depot |
This document describes how the Robots Exclusion Protocol (aka robots.txt)
can be leveraged by Web servers to support discovery in the Memento framework.
Use this document alongside the Introduction to Memento. Feedback is welcome on the Memento Development Group list at memento-dev@googlegroups.com. |
The Robots Exclusion Protocol |
The Robots Exclusion Protocol's robots.txt file is commonly used by Web
site owners to give instructions about their site to Web robots. Typically, it is used to exclude certain directories from crawling.
For example, the following entries in a robots.txt file convey that all pages in the /tmp/ directory on the server are excluded from crawling by any robot:
User-agent: *
Various extensions have been proposed for robots.txt . For example, the Sitemap
directive supports discovery of Sitemaps that provide detailed descriptions of a Web server's content
to enhance discoverability via search engines:
Sitemap: http://a.example.org/sitemaps/sitemap_index.xml
|
Memento Extensions for the Robots Exclusion Protocol |
Two extensions to robots.txt are introduced to support discovery In the Memento
framework:
|
[1] |
Discovery of TimeGates: |
The TimeGate and Archived directives for robots.txt
provide a server-wide mechanism to support TimeGate discovery that can be used by:
TimeGate and Archived directives MUST be used toghether, and their meaning is as follows:
robots.txt file may contain several occurences of the TimeGate directive, each with one or more associated Archived directives.
Examples:
|
[2] |
Discovery of Mementos: |
The combination of the User-agent and Allow directives for robots.txt can provide a server-wide mechanism
to support discovery
of Mementos that can be used by all servers that host Mementos,
i.e. crawler-based Web Archives, Content Management Systems, Transactional Archives, and Snapshot Archives. However, in order to restrict crawling and
mirroring to robots that respect the sticky Memento-Datetime behavior,
a value of memento MUST be used for the User-agent directive.
The sticky Memento-Datetime notion entails that applications that mirror Mementos at a different URI MUST NOT change the Memento-Datetime header and value of those Mementos unless mirroring involves a meaningful state change. This behavior allows, for example, duplicating a Web archive at a new location while preserving the value of the Memento-Datetime header of the archived resources. Hence, the use of the User-agent and Allow directives is as follows:
Allow entries can be associated with the User-agent directive.
Example:
|