Submissions/Building for Failure (Wikimedia XML Dumps on a Regular Schedule)
This is an open submission for Wikimania 2011.
- Review no.
- Title of the submission
Building for Failure: Wikimedia XML Dumps on a Regular Schedule
- Type of submission (workshop, tutorial, panel, presentation)
- Author of the submission
- E-mail address or username (if username, please confirm email address in Special:Preferences)
ariel at wikimedia.org
- Country of origin
- Affiliation, if any (organization, company etc.)
- Personal homepage or blog
Not so much, but you can look at el:wikt:Χρήστης:ArielGlenn
- Abstract (please use no less than 300 words to describe your proposal)
Over the past few years we've learned a lot about how not to produce copies of all Wikipedia content for download that researchers and other users rely upon. This is the story of those failures, and our eventual success, as we learned how to build for failure.
When you have a processing task that takes a long time to complete, there are some simple principles you can apply to get the job done. These principles all revolve around planning for failure and managing your task so that you can work around it. This talk grew out of my personal experience as a dissatisfied user for several years and then finally the producer of the XML dumps, which for the English language Wikipedia consist of a whopping 340GB of bzip2-compressed data which would take months to complete—and usually failed somewhere in the middle. The dumps of the smaller projects (that's all of the other ones) failed periodically as well, for various reasons.
I started out as many folks do believing that the way to get the job done was to eliminate the causes of failure, and that we would get it done on a regular schedule by optimizing for speed as much as possible. Neither of these things turned out to be true, and what I did learn forms the core of this presentation.
Beyond the core topic, there will be some time to talk about obstacles that make the data hard for researchers and bot writers to use, and ways to work around these obstacles. I'll probably also sneak in a few thoughts about "Dumps 2.0", i.e. what next-generation dumps might look like. At the top of that list is how we can make "dailies" available (a dump of the changes for a given day, plus a set of tools to incorporate them into existing "full" dumps).
- Track (People and Community/Knowledge and Collaboration/Infrastructure)
- Will you attend Wikimania if your submission is not accepted?
If I can get my travel worked out, I'll come regardless. I have a paperwork issue, that's all.
- Slides or further information (optional)
Later... (I have slides but they need some tweaking up.)
If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).