Submissions/How to handle data and run statistical analyses in Mediawiki

From Wikimania 2011 • Haifa, Israel
Jump to navigation Jump to search

Presentation Media

30px No slides known
(please upload your slides and/or add it)



Review no.

121

Title of the submission
How to handle data and run statistical analyses in Mediawiki
Type of submission (workshop, tutorial, panel, presentation)
presentation
Author of the submission
Juha Villman
E-mail address or username (if username, please confirm email address in Special:Preferences)
JuhaV
Country of origin
Finland
Affiliation, if any (organization, company etc.)
National Institute for Health and Welfare
Personal homepage or blog
http://en.opasnet.org/w/User:Juha_Villman
Abstract (please use no less than 300 words to describe your proposal)

Opasnet is a website (running on Mediawiki) and workspace for a mass collaboration project that aims to improve societal decision-making. Opasnet is hosted by THL (National Institute for Health and Welfare, Finland), The Department of Environmental Health. The original motivation was to improve environmental health assessments and thus decisions related to environment and health. However, as the methods have developed and the project has grown, the scope has been widened to policy-making in any field. One of the major topics in Opasnet is climate change, which is clearly a multi-disciplinary field. Opasnet is based on open participation by anyone interested, free distribution of information, and strict application of the scientific method.

Opasnet is based on the idea that assessments should no longer be done in closed expert groups that produce some static reports that may or may not answer the questions a decision-maker actually has, and that are only as credible as the expert group is. Instead, two improvements are needed. First, an assessment should be built on an explicit information need that is defined by an open deliberation between experts, decision-makers, and stakeholders. Second, everything in the assessment - including premises, data sources, modelling, and conclusions - is open to scientific criticism.

Mediawiki is designed mainly for encyclopedia use so it is good for displaying text and images. Assessments in Opasnet requires also text and images but often we need massive amounts of numerical data. Instead of saving all the numerical data into Mediawiki's own database we use separate database. For this purpose we have developed Opasnet Base. Opasnet Base is MySQL database which is used as a storage and retrieval system for results of variables and data from studies. It is designed to be flexible enough to store information in almost any format: probability distributions or deterministic point estimates; spatially or temporally distributed data; or data with multiple dimensions. It can be used as a direct source of model input data, thus making it possible to use shared input information sources such as population data, climate scenarios, or dose-responses of pollutants. Opasnet Base is integrated into Mediawiki as an extension which has its own special page for browsing the data. It is also possible to download data from Opasnet Base as a csv-file.

Currently there is 2 methods to upload data into Opasnet Base. For smaller datasets (rows < 100) we have developed Table2Base -extension. It enables data upload directly from tables in wikipages. Syntax of these tables is quite similar to standard wikitables. Instead of headers you need to define indices, cells are separated by a single |-sign and line break starts a new row of data. Every time table is updated a new set of data is stored into Opasnet Base. Old data is also kept in the database meaning that full history is accessible just like page history in Wikipedia.

Second method for data upload is to use some statistical software which has ODBC capabilities. So far we have built functionalities for this purpose for R and Analytica. Using R or Analytica it is easy to download data from Opasnet Base, use it on your analyses and then upload new data directly into Base. Using external statistical software is recommended if you have large dataset to upload. Especially we recommend use of R which is a free software available for multiple platforms.

Standard Mediawiki does not really have functionalities for making any kind of data analysis but these analysis can be quite important part of Opasnet. Therefore we have used external software (Analytica) to make Monte Carlo -analysis and then uploaded results into Opasnet Base. Problem is that integrating Analytica into Mediawiki is almost impossible and Analytica is a proprietary program. We have solved this problem by starting to use R. R is a programming language and software environment for statistical computing and graphics that is widely used for statistical and data analysis. It is free software that compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. Off-the-shelf (and through using selected "packages") it is possible to use the software for Monte Carlo simulation, production of complicated graphics (over which the user has complete control) and a host of other applications.

R integration is the newest addition to Opasnet workspace. We have developed Mediawiki extension which uses R installed on the server. This extension makes possible to write R code directly on the wiki pages between rcode -tags. This generates "Run code" -button and by clicking it a special page is opened, R process starts running and results are generated. If R result contains images (diagrams etc.) they are also generated to the result page. It is also possible to create input fields for values used in the R code. This makes it easy to run same model with different input values without having to change the R code itself.

Presentation will contain brief introduction to Opasnet and the ideas behind it. Main focus will be on data related issues: how to handle and how to use massive amounts of data in Mediawiki and as well demonstrate how easy it is to run R on wiki.

Track (People and Community/Knowledge and Collaboration/Infrastructure)
Infrastructure
Will you attend Wikimania if your submission is not accepted?
Probably yes
Slides or further information (optional)


Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).

  1. JanPaul123 21:54, 26 April 2011 (UTC)[reply]
  2. Vibhijain 07:11, 8 May 2011 (UTC)[reply]
  3. Add your username here.