Data sharing is a necessary step for improving the reproducibility of science and ensuring that the horrifying statistic that 85% of research resources are being wasted is one that only reduces, and quickly.
In the run up to the Force2015 conference in January, it occurred to me that I ought to endorse the Force 11 data citation principles. These 8 principles propose a backbone for honest, open, worthwhile data sharing that it is hoped all funding bodies, institutions and repositories (including publishers) will get behind. There’s a lot of scope for honing these and there are various workgroups trying to do that as we speak – it’s possible for you to get involved.
Briefly (and with my comments), the proposed data principles are:
- Importance – data should be a valid research output. This does call into question the likes of FigShare who will host absolutely anything, and provide a DOI. In response they might say that it’s impossible to accurately judge what will be of use in future
- Credit and attribution – data citations should give accurate credit to those involved
- Evidence – this is extremely important in my view: wherever data is used as evidence, it ought to be made available. In some cases there are privacy concerns over this but groups like DataFAIRport and DataShield are working towards removing those problems.
- Unique identification – basically put, when someone cites data, you ought to be in no doubt about which data they mean. Repositories all use different ‘accession’ numbering styles but so long as they remain distinct there is little problem. Personally I like the idea of providing a DOI for data (as we do at GigaDB)
- Access – the data citation should actually facilitate use of the data. This is technically a bit more challenging but definitely something to aim for. Ideally, each citation would resolve to a web-accessible location AND that location would be interrogable directly, in a standard, machine-and-human-friendly manner. DOI’s do resolve to web locations and can contain a bunch of metadata, but then it is required that the location respond to some sort of commonly agreed protocol and that’s where the community is a little behind.
- Persistence – hopefully this is a no brainer. Broken web links are annoying at the best of times, surely we can prevent this for research data! Again, DOIs promote this concept in their own specification. What’s not clear is what would happen when a DOI distributing company goes bust. When does a repository become ‘too big to fail’?
- Specificity and verifiability – citations should be highly specific about exactly which data they used. Although this may seem obvious, it’s not always so easy. If someone uses half of dataset A and half of dataset B to produce a new analysis, the citation ideally would help to specify which half of each dataset is used. And then would you produce a new citation reference for the combined sub-sets? Also, what happens if someone amends a dataset (adding more data or correcting some corrupted data)? How do you handle versioning?
- Interoperability and flexibility – this is important so as not to produce just ‘yet another standard’. Each community will have its own requirements but it will also be necessary to combine studies across communities. This should be incorporated into ‘the standards’ but it should also be expected that conversion tools will be needed because no one can ever agree on a universal standard.
It is possible for individuals and organisations to provide their endorsement of the data citation principles via a simple webform. I’m pleased to say that Scott Edmunds signed BGI and GigaScience up a long time ago so I decided to put my endorsement down as an individual.
I initially put my name “Rob L Davidson” and no institution but as I was looking at my common-as-muck name in the list it dawned on me that this wasn’t very in keeping with the concepts behind principles 2 and 4 to say the least. So, after a little thought I endorsed again (I should contact them to get them to remove one of these and not overrepresent myself in the list) but this time using my ORCID ID. ORCID is basically unique identification for authors/researchers and allows disambiguation of researchers with common names like mine, or those that change their name for some reason or where the name has been mis-spelled or whatever. Perhaps I’m just being a smart arse, but I think it’s quite appropriate.