memento logo

Memento Guide:
Discovery Using the Robots Exclusion Protocol

About Demos Guide Tools Depot



This document describes how the Robots Exclusion Protocol (aka robots.txt) can be leveraged by Web servers to support discovery in the Memento framework.

Use this document alongside the Introduction to Memento.

Feedback is welcome on the Memento Development Group list at memento-dev@googlegroups.com.

The Robots Exclusion Protocol

The Robots Exclusion Protocol's robots.txt file is commonly used by Web site owners to give instructions about their site to Web robots. Typically, it is used to exclude certain directories from crawling. For example, the following entries in a robots.txt file convey that all pages in the /tmp/ directory on the server are excluded from crawling by any robot:

User-agent: *
Disallow: /tmp/


Various extensions have been proposed for robots.txt. For example, the Sitemap directive supports discovery of Sitemaps that provide detailed descriptions of a Web server's content to enhance discoverability via search engines:

Sitemap: http://a.example.org/sitemaps/sitemap_index.xml

Memento Extensions for the Robots Exclusion Protocol

Two extensions to robots.txt are introduced to support discovery In the Memento framework:
  • To support the discovery of TimeGates: the TimeGate and Archived Directives;
  • To support the discovery of Mementos and to allow those to be crawled and mirrored under the appropriate conditions: the memento value for the User-agent Directive combined with one or more Allow Directives.
These extensions are discussed in detail, below.

[1]

Discovery of TimeGates: TimeGate and Archived Directives for robots.txt

The TimeGate and Archived directives for robots.txt provide a server-wide mechanism to support TimeGate discovery that can be used by:
  • Servers of Original Resources;
  • Crawler-based Web Archives, Content Management Systems, and Transactional Archives that provide access to Mementos by exposing TimeGates.
The TimeGate and Archived directives MUST be used toghether, and their meaning is as follows:
  • TimeGate: Conveys the base URL (that is URI scheme, host and path component) that is shared by all URIs of TimeGates of a set of Original Resources.
  • Archived: Indicates - by means of MANDATORY host and OPTIONAL path parts of a URI - for which set of Original Resources actual TimeGates are available that have the base URL conveyed in the associated TimeGate directive.
A robots.txt file may contain several occurences of the TimeGate directive, each with one or more associated Archived directives.

Examples:
  • A MediaWiki server http://a.example.org/w/ has installed the Memento MediaWiki extension, which results in exposing TimeGates to access the wiki's history pages at base URL http://a.example.org/w/index.php/Special:TimeGate/. An actual TimeGate for the wiki's http://a.example.org/w/My_Title page would then be at http://a.example.org/w/index.php/Special:TimeGate/http://a.example.org/w/My_Title. This MediaWiki server can make its TimeGates discoverable by using the following directives in its http://a.example.org/robots.txt file:

    TimeGate: http://a.example.org/w/index.php/Special:TimeGate/
    Archived: a.example.org/w/

  • A crawler-based Web-archive recurrently collects resources from Belgian servers (.be domain). It exposes TimeGates to make these archived resources accessible at base URL http://a.belgianarchive.org/timegate/. For example, archived copies of the home page of the newspaper De Avond (http://deavond.be) would be available via TimeGate http://a.belgianarchive.org/timegate/http://deavond.be. This Belgian Web Archive can declare its TimeGates and coverage using the following directives in its http://a.belgianarchive.org/robots.txt file:

    TimeGate: http://a.belgianarchive.org/timegate/
    Archived: .be/

  • The newspaper De Avond (http://deavond.be) is aware that the Belgian Web Archive recurrently archives its pages, and that this archive generally exposes TimeGates at base URL http://a.belgianarchive.org/timegate/. De Avond can make TimeGates for these archived resources discoverable by including the following directives in its http://deavond.be/robots.txt file:

    TimeGate: http://a.belgianarchive.org/timegate/
    Archived: .deavond.be/

  • A server at Los Alamos hosts Original Resources for the http://mementoweb.org and http://lanlsource.lanl.gov domains, and has its resources archived in an associated Transactional Archive that exposes TimeGates http://mementoarchive.lanl.gov/ta/timegate/http://mementoweb.org, http://mementoarchive.lanl.gov/ta/timegate/http://lanlsource.lanl.gov/hello, etc.
    In this case, the servers's robots.txt file would contain the following Memento-related entries:

    TimeGate: http://mementoarchive.lanl.gov/ta/timegate/
    Archived: mementoweb.org/
    Archived: lanlsource.lanl.gov/

    These resources are also archived by the Internet Archive, and the server could make TimeGates there discoverable by using the following additional entries in its robots.txt file:

    TimeGate: http://memento.waybackmachine.org/memento/timegate/
    Archived: mementoweb.org/
    Archived: lanlsource.lanl.gov/

  • The Internet Archive hosts Mementos for a wide range of Original Resources, and exposes TimeGates for them at ther base URL http://memento.waybackmachine.org/memento/timegate/.
    In order to make these TimeGates discoverable, the Internet Archive can include the following lines in its robots.txt file:

    TimeGate: http://memento.waybackmachine.org/memento/timegate/
    Archived: *



[2]

Discovery of Mementos: User-agent and Allow Directives for robots.txt

The combination of the User-agent and Allow directives for robots.txt can provide a server-wide mechanism to support discovery of Mementos that can be used by all servers that host Mementos, i.e. crawler-based Web Archives, Content Management Systems, Transactional Archives, and Snapshot Archives. However, in order to restrict crawling and mirroring to robots that respect the sticky Memento-Datetime behavior, a value of memento MUST be used for the User-agent directive.

The sticky Memento-Datetime notion entails that applications that mirror Mementos at a different URI MUST NOT change the Memento-Datetime header and value of those Mementos unless mirroring involves a meaningful state change. This behavior allows, for example, duplicating a Web archive at a new location while preserving the value of the Memento-Datetime header of the archived resources.

Hence, the use of the User-agent and Allow directives is as follows:

  • User-agent: Has memento as its value;
  • Allow: Lists the path that contains Mementos that can be crawled, and for which content can be mirrored subject to the sticky Memento-Datetime behavior.
Several Allow entries can be associated with the User-agent directive.

Example:
  • The following is a robots.txt for a server that generally disallows crawling, yet allows agents that respect the sticky Memento-Datetime behavior to crawl Mementos in its /web/ path.:

    User-agent: *
    Disallow: /
    User-agent: memento
    Allow: /web/