Memento Capabilities for Wikipedia

Last updated: September 19, 2013

Latest version:
   http://mementoweb.org/wikipedia/

Prepared by:
   Herbert Van de Sompel, Martin Klein, Robert Sanderson - Los Alamos National Laboratory
   Michael Nelson - Old Dominion University

Abstract

This document describes capabilities for time travel that could be added to Wikipedia and other MediaWiki platforms. The described capabilities have in common that they leverage the Memento protocol specified in RFC 7089, an extension of HTTP that introduces negotiation in the datetime dimension that is meanwhile natively supported by several web archives. It details the necessary requirements for realizing each capability and describes the current state of efforts with this regard.

The Memento team considers Wikipedia, and other MediaWiki platforms, to be a major use case for the Memento protocol because of the availability of a complete version history for pages. Therefore, the Memento team is eager to work with the Wikipedia/MediaWiki community towards adding Memento-related capabilities.

Table of Contents

    Current Status

Introduction

The Memento "Time Travel for the Web" effort started in 2009 with the overall goal of making it as easy to navigate the past of the web as it is to navigate the current web. The basic idea underlying the Memento protocol is that an old version of a web page, such as a version of a Wikipedia article - http://en.wikipedia.org/w/index.php?title=Web_archiving&oldid=526371727 - can be retrieved by accessing its original URI - http://en.wikipedia.org/wiki/Web_archiving - and by applying datetime negotiation to it. Datetime negotiation is similar to content negotiation, which is used frequently by browsers, for example, to ask a server for a version of a page in a specific format e.g. HTML or PDF. Datetime negotiation asks the server for a version with a specific date, and uses the special purpose Accept-Datetime HTTP header to do so. A 101-style description of the technical underpinnings of the Memento protocol is provided in the Introduction to Memento. The protocol details are specified in RFC 7089.

The Memento protocol is meanwhile natively supported by several web archives. Also, all versions of DBpedia are natively accessible via the Memento protocol, and proxy support for all language version of Wikipedia has been implemented.

Memento's time travel is not yet natively supported in browsers and hence requires installing an extension. The first such extension was MementoFox for FireFox. It is currently not aligned with the most recent version of the Memento protocol and its use is discouraged. A Memento extension for Chrome was released the end of September 2013. A preview movie illustrates the extraordinary time travel functionality it provides.

Memento Navigation

The below picture (click to get a bigger version) illustrates Memento time travel and highlights capabilities it could provide for Wikipedia and other MediaWiki platforms. The navigation assumes client-side Memento support, i.e. installation of the Chrome extension for Memento. All transitions in the navigation can be achieved using existing Memento infrastructure, essentially including the proxy Memento support for Wikipedia.


The depicted navigation starts at the left hand side of the picture. The following description of the steps assumes that time-based access to a page is available by means of a right-click paradigm, as implemented in the Chrome extension :

Current Status

The addition of the capability illustrated by the Red Arrow transitions 1 and 2 has been discussed in a Wikipedia RFC about Memento that had a positive outcome. Adding support for the Memento protocol for Wikipedia involves implementing the protocol that supports datetime negotiation. To that end, a Memento add-on for MediaWiki platforms was implemented. The process aimed at deploying the add-on for Wikipedia led to a negative decision because of concerns about the technical quality of the add-on and lack of native browser support for Memento. As a result, Memento support for Wikipedia, in all its language versions, is provided by means of a proxy solution. Proxy support has several disadvantages:
  • Response times are longer than would be the case with native support.
  • Lacking the Memento protocol HTTP headers in Wikipedia responses, Memento clients:
    • Need to implement exceptions to be able to distinguish between current and prior versions of Wikipedia pages.
    • Can not determine the version datetime of a Wikipedia page in the manner intended by the Memento protocol and would need to resort to screen-scraping to do so.
    • Need to keep track of the URI of the TimeGate and TimeMap URI for Wikipedia pages in a hard coded way.
  • Proxy infrastructure needs to be maintained and is subject to failure in case changes would be made to Wikipedia APIs used by the proxy.

HTTP Header Support

A relatively simple, yet important step could be taken towards Memento support for Wikipedia by alleviating the problems that clients experience due to the lack of appropriate HTTP headers, while still relying on the proxy solution for datetime negotiation. This can be achieved by adding the Memento HTTP headers to responses from Wikipedia pages. This involves:
  • For current pages, e.g. http://en.wikipedia.org/wiki/Web_archiving: Add an HTTP Link header with a "timegate" link that points to the proxy TimeGate. The response header would then be as follows:



  • For version pages, e.g. http://en.wikipedia.org/w/index.php?title=Web_archiving&oldid=526371727: Add an HTTP Link header with a "timegate" link that points to the proxy TimeGate and an "original" link that points at the current page. Add the Memento-Datetime header that conveys the version datetime of the version page. The response header would then be as follows:

HTTP Header and Datetime Negotiation Support

Native support for Memento datetime negotiation in Wikipedia is achieved by a full implementation of the protocol. from the initial MediaWiki add-on have been taken into account, and an ongoing project supported by the Andrew W. Mellon Foundation is aimed at developing a version that addresses the above mentioned concerns and that fully meets the needs of the MediaWiki community. An initial version will be shared with the MediaWiki community by the end of September 2013, along with a demonstration installation with the add-on installed.

It will be essential to get active involvement from the MediaWiki community to assess the add-on regarding, among others, adherence to MediaWiki coding practice, robustness, performance, and delineation between default and optional functionality. Such involvement will be requested via wikitech-l.

Memento Functionality for External Links

The Blue Arrow transition 4 illustrates the following:
  1. A Memento client can use the URI of a resource and the date it was accessed, as provided in a Wikipedia citation, to navigate to a version of that resource as it existed around the access date.
  2. In order to start the navigation, the user has to manually change the travel date to the access date.
Regarding (1): This Memento functionality is beneficial in the following cases:
  • When the original link does no longer work and an access date is provided.
  • When the original link still works but the content of the resource has significantly changed since the provided access date.
  • When there is a need to access the most recently archived version of a resource. This can occur in case the original link no longer works, even when an access date is provided. This can also occur when no access date is provided, both in case the original link still works or no longer does. The functionality to access the most recently archived version of a resource is provided in the Memento protocol through a datetime negotiation request that does not express a date preference.
Regarding (2): Manually resetting the travel date could be avoided by expressing the access date of the linked page - 7 April 2009 - as provided in the reference, in a machine-actionable manner.

Machine-Actionable Citation Data in Wikipedia

The addition of machine-actionable citation data expressed in a standardized way, combined with archival web infrastructure such as web archives and the Memento protocol, can help alleviate the link rot problem that is well-known in Wikipedia and beyond. Pro-active archiving of cited/linked resources, as explored for Wikipedia is an important component of a link rot solution. As is a consistent way to cite web resources as intended by Wikipedia's citation style. Making citation data machine actionable by expressing it in a standardized way in HTML pages and enabling Memento clients to act upon that data is a third component of a link rot solution. It would provide seamless time travel capabilities by leveraging the expressed citation information.

Given ongoing efforts aimed at addressing link rot, and the existence of a web citation template that already explicitly distinguishes between various core information citation elements, the Wikipedia community has a head start when it comes to working towards end-user functionality aimed at alleviating link/reference rot problems. The introduction of a machine-actionable manner to express citation information in HTML, combined with the power of the Memento protocol and Memento client capabilities can help to move closer towards that goal.

As an example, consider the following reference expressed according to Wikipedia's web citation template:
   <ref name="Liar Society">{{cite web
| url = http://liarsociety.tripod.com/blog/index.blog?from=20041130
| archiveurl = http://web.archive.org/web/20080206210600/http://liarsociety.tripod.com/blog/index.blog?from=20041130
| archivedate = 6 February 2008
| title = Coil: Scatology, Horse Rotorvator, Love's Secret Domain
| publisher = ''Liar Society'' (2004-10-30)
| accessdate= 2007-02-12
}}</ref>
   
This reference is currently automatically rendered into HTML as follows:
<span class="reference-text"><span class="citation web">
<a rel="nofollow" class="external text" 
   href="http://web.archive.org/web/20080206210600/http://liarsociety.tripod.com/blog/index.blog?from=20041130">
"Coil: Scatology, Horse Rotorvator, Love's Secret Domain"</a>. <i>Liar Society</i> (2004-10-30). Archived from 
<a rel="nofollow" class="external text" 
   href="http://liarsociety.tripod.com/blog/index.blog?from=20041130">the original</a> 
on 6 February 2008<span class="reference-accessdate">. 
Retrieved 2007-02-12</span>.</span>
   
But, using the data- extensibility mechanism for attributes provided in HTML5, the reference could also be rendered as follows:
<span class="reference-text"><span class="citation web">
<a rel="nofollow" class="external text" 
   href="http://liarsociety.tripod.com/blog/index.blog?from=20041130"
   data-versionurl="http://web.archive.org/web/20080206210600/http://liarsociety.tripod.com/blog/index.blog?from=20041130"
   data-versiondate="2007-02-12">
"Coil: Scatology, Horse Rotorvator, Love's Secret Domain"</a>. <i>Liar Society</i> (2004-10-30). Archived from 
<a rel="nofollow" class="external text" 
   href="http://liarsociety.tripod.com/blog/index.blog?from=20041130">the original</a> 
on 6 February 2008<span class="reference-accessdate">. 
Retrieved 2007-02-12</span>.</span>
   
The Memento extension for Chrome does not have built-in support for acting upon machine-actionable citation data, mostly because no agreed-upon approach for expressing it exists. The Memento team is keen to include such support and most interested in discussing approaches with the Wikipedia/MediaWiki community.

Expressing citation data in a machine-actionable manner can be achieved in a Wikipedia/MediaWiki-specific manner. But, since the Wikipedia/MediaWiki community is not the only one that struggles with reference rot problems, looking at this problem from a broader perspective can be beneficial. The document Thoughts on Referencing, Linking, Reference Rot explores the problem domain, suggests possible approaches to express citation data in a machine-actionable manner in HTML, and describes how applications could leverage it. In addition to cited URI, and access date, the document takes into account the URI of an archival version of a cited resource as elements that need to be involved in a machine-actionalbe citation.

Thanks for input and inspiration: Peter Brantley, Harihar Shankar, Lyudmila Balakierva