I recently came across a wonderful high profile paper with an intriguing hypothesis and a pretty solid set of analyses. As interesting papers often do... this one got our juices flowing, so a colleague and I decided to follow up on it and test some ideas of our own. Very exciting...
With the recent push towards open science more journals are requiring that the data used in published projects are made available online as supplementary material. This journal was no exception... we were golden! However it turned out that critical information for understanding key variables in their data set was completely absent from the online supplement, making it impossible for us to replicate what they had done. Ten emails down the line (first with the senior corresponding author, then with the grad student that run the analyses, and now back with the corresponding author), we still haven't gotten a straight answer of what these variables mean. The sheer joy of an idea worthwhile pursuing (and my initial admiration for colleagues that took the first step) has gradually been replaced by frustration. To be honest, this is completely unacceptable and easily avoidable...
Which brings me back to the topic of today's post... the following is my personal view on data sharing and it is by no means a comprehensive review of this topic. Please feel free to comment and engage in discussion.
Why do we share data? As in the case that prompted this post, new and exciting questions can sometimes be answered by building up on earlier analyses ('standing on the shoulders of giants'). By sharing our data we enable this kind of scientific progress and we increase the visibility and utility of our work. The thousands of papers coming out of synthesis centers such as NESCent, NCEAS, NIMBioS, iPlant, SESYNC, or BEACon (to name just a few) in the last 15 years, have more than demonstrated over the years the value of repurposing data for new analyses. Data sharing is also important because scientific progress relies on our ability to check each other's work and confirm controversial results. Yes, this can sometimes be painful because after all, we are only human, and nobody enjoys being called up on a mistake! (For prominent recent examples go here). However, since cross-validation is a vital aspect of our profession I often remind myself and the students I interact with, that the best way to stop worrying about being called up on a mistake is to do our best with the analytical tools and the data we have. Honest efforts to answer a question will always be appreciated even if new techniques or different analyses indicate at a later stage that the answer we initially got is not necessarily the right one. Sloppy mistakes, on the other hand, are a completely different thing... the point is: our colleagues are not out to get us (at least most of them are not!)... we are all in pursuit of the same thing: the truth (or as close as we can get to such ideal).
How can we share data? Top tier journals have made this extremely easy by allowing online supplements to accompany our papers. Options for other journals that do not support such supplements are also available (e.g., see DRYAD).
What is responsible data sharing? Although data sharing is quickly becoming the norm, the quality control of these supplements is still somewhat lagging behind and the sole responsibility for the content of these files currently appears to lie solely on the corresponding author. Thus, when you publish your next paper please remember: PUBLISHED DATA ARE AS IMPORTANT AS THE PAPER THEY BELONG TO. Don't just post a spreadsheet and consider yourself done. Please make sure that:
(1) You include all variables used in your paper.
(2) You include information on the units of each variable.
(3) You explain any variable transformations that may be required to reproduce your results.
(4) The contents of your file are up-to-date and accurate.
Other guidelines and a nice FAQ for data sharing can be found in Dryad's URL.
With the recent push towards open science more journals are requiring that the data used in published projects are made available online as supplementary material. This journal was no exception... we were golden! However it turned out that critical information for understanding key variables in their data set was completely absent from the online supplement, making it impossible for us to replicate what they had done. Ten emails down the line (first with the senior corresponding author, then with the grad student that run the analyses, and now back with the corresponding author), we still haven't gotten a straight answer of what these variables mean. The sheer joy of an idea worthwhile pursuing (and my initial admiration for colleagues that took the first step) has gradually been replaced by frustration. To be honest, this is completely unacceptable and easily avoidable...
Which brings me back to the topic of today's post... the following is my personal view on data sharing and it is by no means a comprehensive review of this topic. Please feel free to comment and engage in discussion.
Why do we share data? As in the case that prompted this post, new and exciting questions can sometimes be answered by building up on earlier analyses ('standing on the shoulders of giants'). By sharing our data we enable this kind of scientific progress and we increase the visibility and utility of our work. The thousands of papers coming out of synthesis centers such as NESCent, NCEAS, NIMBioS, iPlant, SESYNC, or BEACon (to name just a few) in the last 15 years, have more than demonstrated over the years the value of repurposing data for new analyses. Data sharing is also important because scientific progress relies on our ability to check each other's work and confirm controversial results. Yes, this can sometimes be painful because after all, we are only human, and nobody enjoys being called up on a mistake! (For prominent recent examples go here). However, since cross-validation is a vital aspect of our profession I often remind myself and the students I interact with, that the best way to stop worrying about being called up on a mistake is to do our best with the analytical tools and the data we have. Honest efforts to answer a question will always be appreciated even if new techniques or different analyses indicate at a later stage that the answer we initially got is not necessarily the right one. Sloppy mistakes, on the other hand, are a completely different thing... the point is: our colleagues are not out to get us (at least most of them are not!)... we are all in pursuit of the same thing: the truth (or as close as we can get to such ideal).
How can we share data? Top tier journals have made this extremely easy by allowing online supplements to accompany our papers. Options for other journals that do not support such supplements are also available (e.g., see DRYAD).
What is responsible data sharing? Although data sharing is quickly becoming the norm, the quality control of these supplements is still somewhat lagging behind and the sole responsibility for the content of these files currently appears to lie solely on the corresponding author. Thus, when you publish your next paper please remember: PUBLISHED DATA ARE AS IMPORTANT AS THE PAPER THEY BELONG TO. Don't just post a spreadsheet and consider yourself done. Please make sure that:
(1) You include all variables used in your paper.
(2) You include information on the units of each variable.
(3) You explain any variable transformations that may be required to reproduce your results.
(4) The contents of your file are up-to-date and accurate.
Other guidelines and a nice FAQ for data sharing can be found in Dryad's URL.