Submissions/Building for Failure (Wikimedia XML Dumps on a Regular Schedule)

From Wikimania 2011 • Haifa, Israel


This is an open submission for Wikimania 2011.

Review no.

115

Title of the submission

Building for Failure: Wikimedia XML Dumps on a Regular Schedule

Type of submission (workshop, tutorial, panel, presentation)

Presentation

Author of the submission

Ariel Glenn

E-mail address or username (if username, please confirm email address in Special:Preferences)

ariel at wikimedia.org

Country of origin

US

Affiliation, if any (organization, company etc.)

WMF

Personal homepage or blog

Not so much, but you can look at el:wikt:Χρήστης:ArielGlenn

Abstract (please use no less than 300 words to describe your proposal)

Over the past few years we've learned a lot about how not to produce copies of all Wikipedia content for download that researchers and other users rely upon. This is the story of those failures, and our eventual success, as we learned how to build for failure.

When you have a processing task that takes a long time to complete, there are some simple principles you can apply to get the job done. These principles all revolve around planning for failure and managing your task so that you can work around it. This talk grew out of my personal experience as a dissatisfied user for several years and then finally the producer of the XML dumps, which for the English language Wikipedia consist of a whopping 340GB of bzip2-compressed data which would take months to complete—and usually failed somewhere in the middle. The dumps of the smaller projects (that's all of the other ones) failed periodically as well, for various reasons.

I started out as many folks do believing that the way to get the job done was to eliminate the causes of failure, and that we would get it done on a regular schedule by optimizing for speed as much as possible. Neither of these things turned out to be true, and what I did learn forms the core of this presentation.

Beyond the core topic, there will be some time to talk about obstacles that make the data hard for researchers and bot writers to use, and ways to work around these obstacles. I'll probably also sneak in a few thoughts about "Dumps 2.0", i.e. what next-generation dumps might look like. At the top of that list is how we can make "dailies" available (a dump of the changes for a given day, plus a set of tools to incorporate them into existing "full" dumps).

Track (People and Community/Knowledge and Collaboration/Infrastructure)

Collaboration/Infrastructure

Will you attend Wikimania if your submission is not accepted?

If I can get my travel worked out, I'll come regardless. I have a paperwork issue, that's all.

Slides or further information (optional)

Later... (I have slides but they need some tweaking up.)

Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).

  1. Phoebe 00:20, 29 April 2011 (UTC)[reply]
  2. Erik Zachte 13:55, 29 April 2011 (UTC)[reply]
  3. Catrope
  4. Blahma 12:55, 1 May 2011 (UTC)[reply]
  5. Vibhijain 07:04, 8 May 2011 (UTC)[reply]
  6. Amir E. Aharoni
  7. Add your username here.