Why do we need data citation- take two

Background
Data are an essential component of scholarly communication and are the evidence upon which hypothesis driven research and general scientific inquiry is conducted. Historically empirical data relating to intellectual inquiry have been included in traditional based publication as evidence supporting analytical assertions and informed opinion. Advances in technology have increased our ability to generate data in volumes not previously considered and we have passed a point where traditional publication is unable to represent large volumes of supporting or supplementary data. Consequently researchers and recorders of scholarly communication are faced with a challenge; how to manage, reference and preserve datasets as valid research objects in scholarly communication.

This widening fissure between the evidence and it textual representations risks rendering scholarly communication incomplete; the evidence supporting scholarly publication is in danger of being lost.

Supporting and maintaining a complete scholarly record is not trivial and requires investments in capability and capacity of curation, preservation and access, a role historically filled by libraries, publishers and other information providers. The need for preservation is made more urgent by virtue of the very technology that has facilitated this data deluge. Changing formats, hardware and software have created what has been regularly referred to as the beginnings of a digital black hole. The evidence supporting scholarly publication of only a short period before cannot be easily accessed and re-represented.

Two requisites already employed in traditional publishing are needed to change this situation for data. An infrastructure to preserve and persist data and an agreed framework that can reference data: data citation.
Data Centre
Need to insert a little piece about the similarities and differences between libraries and data centres
Data Citation

Data citation is a complex concept that variously alludes to…
– Unique identification
– Persistent Identification
– Location services
– Access services
– Immutability
– Attribution
– Credit
– Authority

There a generally multiple solutions to each of the above and these are presently driven by business need rather than an integrated requirement. E.g. Unique identification facilitates data management issues in large collections while persistent identification concerns long term management of identification, particularly important in scholarly communication. The two are related but do not generally fulfil complete requirements for the other. Uniqueness is reduced by multiple short term registration authorities and persistence is not maintained projects or collaborations that are themselves short term. There is a clear need for both unique and persistent identification but any single solution will mix concerns and likely result in insufficient agreement and acceptance.

In collaborations like Sage Bionetworks there needs to be a clear incentive driver to join the collective and share ones datasets. Data citation framework has the capacity to create such an incentive. Researchers are generally happy to share their data provided they receive due credit as the creators of the datasets, in much the same way that traditional citation supports sharing in scholarly publication data citation can support the sharing of data.

Systems level research require large amounts of data
As an example, we consider systems level, data driven research. A valuable and testable science concerned with complex network modelling and validation based on and consuming of, datasets. These datasets need not be generated de novo as part of a single project and in fact the volumes required generally require the re-use of existing data together with primary data generation within the project. System concepts contrast reductionism techniques where the many relationships and interdependencies in a network are compiled into a theoretical constant in order to focus on a particular component, relationship or pathway. Systems analyses attempt to include many parameters to understand network behaviour rather than single components. To date much medical research has tended to employ focussed reductive analyses as there were few alternatives. Technology has changed this and it is now possible to measure many parameters of the same system at the same time, thus generate a more accurate and informative model of biological processes, so called systems biology. The emergence of ‘omics terminologies and datasets in nearly all disciplines of biology are testament to this and provide an opportunity to construct models that more accurately represent biological systems.

In a biomedical setting systems analyses are generally applied to disease models as a method of identifying key nodes or potential targets for intervention strategies or treatment regimes that advance modern treatment. In order to model complex networks, like those of human disease, large amounts of data are required. Generally such volumes are too large, diverse and complex for any one organisation to generate. To tackle this Sage Bionetworks is implementing a global consortium to generate massive coherent data sets as a fuel for systems level analysis of a number of human diseases. Sage bionetworks is attempting integration of [genetic, transciptomic, proteomic….etc in the study of x human diseases, what are they????] The intention is to identify intervention targets (potential therapies) that would otherwise be lost using reductionism techniques.

Large collections of these types of data from multiple contributors require robust data management and attribution mechanisms, requirements fulfilled by unique and persistent data identification and appropriate metadata provision. Furthermore, once analysis outputs offered for independent validation or peer review, appropriate credit for data must be assigned. Stable, scaleable and tractable data management and data citation mechanisms are essential to record provenance and create an incentive for individuals and institutes to contribute to this project.

SageCite
SageCite was conceived to determine the data citation requirements of Sage Bionetworks, and in doing so support the citation of network models and datasets as logical citable units of intellectual value for other projects. It is a 12month project funded by the JISC in the UK.

DataCite [uri] is interested in promoting and supporting data as first class citizens in scholarly communication. DataCite believe data are an essential component of scholarly communication and citation requirements are poorly served in this area.

Need something on the other partners

Together we are exposing and confronting the data citation requirements of Sage Bionetworks and offering pilot demonstrators for possible solutions.

Benefits of data citation for Sage Bionetworks
1. Citing data/datasets creates incentive: Permits allocation of creator, i.e. a registration service for data in much the same way that journal articles register intellectual arguments for researchers. Data are recognised as a valuable professional asset to creator (researcher) and output of public funding to research funders (Funders).
2. Promotes preservation of data assets. Data are research assets and should be preserved as good practice. Supports validation and reproducibility in line with scientific best practice
3. Citing data/datasets encourage openness and transparency.
a. Supports data sharing with agreed citation standards
b. Supports data re-use by removing attribution barriers
c. Enables validation and increases statistical power

Implications of Data Citation
Data citation confers the property of an intellectual artefact to the object of citation, which in turn confers a research value onto that object. Data citation prompts the following questions.

Who owns or is responsible for the dataset?
What provision is there for preservation of the dataset?
At what point does re-used data become a new intellectual object worthy of citation and how are such derivations and abstractions handled in the citation?

Formal citation requires the object of the citation to be static e.g. a published work is essentially a static representation of intellectual output and remains immutable. Subsequent addendums are permitted but only as additional statement declarations (mistakes, withdrawals, etc).
Formal data citation implies that the data object is immutable and changes to it are only permitted as supplements/addendums in a metadata record.

Targets of Data Citation
Exactly what constitutes a data citation target is not clear. Datasets can exist as valuable research objects yet not be ‘published’ in the traditional sense. They can be represented as numeric tables, images, code scripts etc etc, the list is almost endless. What is clear is that each discipline has a firm idea as to what constitutes a dataset and thus a potential citation target. For the purposes of this exercise datasets are considered organised collections of data that are either generated or consumed of during the course of the sage Bionetworks data analysis pipeline. For example, reference sequence and gene information from NCBI can be considered a data input and citable, affymetrix gene chip datasets can also be considered citable. Integration, abstraction or aggregation of these data to produce a verified data mapping can also be considered a new dataset, but does it warrant further citation? Perhaps not, in fact perhaps the methods/workflow scripts that generate the derived dataset from the two input datasets can be provided a citation so that it may be re-created. In any case Sage Bionetworks should be able to provide for the datasets it consumes to be provided a citation.

Any data that is produced entirely within the Sage Bionetworks project should be provided a citation to acknowledge credit. The point at which such a citation should be provided should be determined based in the principle of citation, to declare and make available the object that is being cited. In this case, it citation should occur when the datasets are most shareable (generally meant as open) and the metadata the most structured and informative. It must be noted that once cited the datasets should not change and should persist. Any changes require new citation parameters.

There will of course be circumstances where significant effort is put into mapping, aggregating, supplementing and integrating datasets that were not created within Sage Bionetworks. It is not, and likely never will be clear at what point such datasets are independent and citable objects. What is clear is that there needs to be provision for derivation and conglomerations of already citable objects to be recognised as part of the whole. SageCite hopes to be able to lay foundations for a framework for this.

In summary Sage Bionetworks should consider citation targets to fulfil one of the following conditions.
1. Consumed data, i.e. whoever owns the data may wish to have it cited when it is used. If there is no citation presently available to them, Sage Bionetworks should provide one. SageCite aims to implement DataCite services in Taverna for this purpose.
2. Empirical Data should be provided a citation at a point where it is most shareable, i.e. most open format, most structured metadata. Bearing in mind that citation implies preservation and data management. Again, this service can be provided as an implementation of DataCite in Taverna.
3. Where datasets are derived simply from a protocol of code sequence, then that protocol and/or workflow should be provided a citation rather than the resulting dataset
4. Derivations, abstractions and transformations should be provided a citation if the following properties are met,
a. Data are new and represent significant intellectual input
b. Citation can not be simply applied to algorithm to re-create from inputs
c. Citation can reference input objects as essential (citations within citations)
5. Any aggregation should contain relevant citations e.g. compare with condition 1 above, if a dataset already has a citation (or is provided one) then this should be represented in any dataset generated using it.

Consequences of Data Citation

Sage Bionetworks data pipeline as data citation test bed
To begin we consider the data analysis pipeline for Sage Bionetworks. It is essential to define and identify citation targets and separate these from data management concerns.

> peters workflow and using the above rules identify citation targets

Workflow 1: Aggregate reference sequences with affymetrix gene chip mapping. Repeat for each cross-discipline dataset
2 datasets consumed
Effort used to rationalise and clean data
1. Aggregation scripts and output dataset
2. External datasets provided citations

Workflow 2: Take mapping and integrate with other derived datasets

What next
This is an allegory I am working on that describes the barriers to joined up scholarly communication by representing them as a descent into data hell. All is not lost, as described in the prose by the 15th Century poet Alegeheri Dante’s piece ‘The Divine Comedy’. Briefly, Dante is guided through the circles of sin as he descends into Hell. The point of this part of the story is that by understanding the sin one is able to work to correct it. If one ignores these sins then Hell beckons and that is the final journey from which there is no return.

Dante’s prose describing his guided journey thought the rings of Hell as part of his Divine Comedy is a useful template for us. Dante asserts that life ends with entry into Heaven, Hell or Purgatory. To get into Heaven a traveller had to first recognise their sins and either accept them and work towards atonement or reject them and continue on the path to Hell. Each type of sin was grouped into layers of increasing severity providing a classification that belies the 9 circles of Hell, with a final journey from circle 9 to Hell (9+1=10). Along the way notable figures from history are recognised and used to illustrate the fate of committing a particular sin.

Following this template, in order to get into ‘data Heaven’ where the scholarly communication is complete and interconnected we need to understand the barriers, or data sins, that break the cycle of scholarly communication. Once we identify and confront our data sins we are either; pure and enter ‘data Heaven’ (a paradise to enjoy a complete and tractable cycle of scholarly communication), data Purgatory where we accept our sins and work towards atoning for them, or finally ‘data Hell’ where despite opportunity to absolve ourselves we ignore all sins and promote the widening fissure between scholarly publication and the data that underpin it.

(my opinion is the result will be an unsupported and unsupportable research knowledge base that will become ever more irrelevant as assertions and opinions from tooled up and savvy self -promoters win out over scientific method and scholarly discourse, i.e the re-emergence of pseudoscience)

More sober reckoning by kent anderson Permanence and Accountability — Why Publishers Need to Modernize Their Approaches
Posted by Kent Anderson under

What emerges will hopefully be a 10 POINT PLAN for stable citation framework and why we need this. i.e. recognise and seek to avoid the 9+1 circles of Hell

The nine circles of data Hell
Dante’s Circle Data Context: How Data Citation will improve matters Examples of Data sins
First circle (Limbo) Data are unavailable, no one knows they exist Data citation is a declaration that datasets existed and maintaining metadata that support citation regarding the dataset is key to this
Second circle (lust) Data impact disequilibrium, where most attractive data promoted over most valid data Agreed and authoritative styles and standards in data citation encourage an even field of data discoverability/re-use and interpretation
Third circle (gluttony) Data consumption is voracious with no regard for consequence or attribution Considered bad professional practice to use data without acknowledgement, but how can one do this? Data citation can provide an easy and insignificant overhead to cite someone else’s data in an individuals work.
Fourth circle (greed) Data are stock piled If data provenance, credit and attribution can be embedded in the community, data stock piles are less likely as creators get due credit. Such credit can be financial or professional, in both cases transparent and fair access can be backed by IPR as required. Pharma, but only because there is no simple and well understood IPR and rights framework.
Fifth circle (anger) Data are closed. The first thing someone does when they are angry is to deny access or use of their assets Joining traditional publication to data publication by citation creates a driver to make data available at the most and findable at t eh least. e.g. as a condition of publication (this is policy for almost all scholarly communication and is good research practice). Exclusions must exist for sensitive or confidential data NHS IC or holders of patient data are particularly susceptible to this, confronting the real rather than perceived risks is much more diffucult than simply shutting the door.
Sixth circle (heresy) Data are misused. Data Citation enables better data management (through extensible metadata capabilities) and promotes appropriate reuse and transparency. Data citation extends from traditional data tables to models that consume of this data and forecast (e.g. the graphs seci5ton in the economist example) Economist example where the graph in online article is different from the one that appears in the print article. Data behind the graph are difficult to find and even more difficult to validate
Seventh circle (violence) Data are corrupted (possible not corrupted!)
Data citation metadata can extend to check lists and provenance models to ensure data validation
Eighth circle (fraud) Data are falsified as matter of pride Data Citation drives reproducibility by promoting provenance and immutability. E.g thalidomide negative/causative data falsificaiton
Ninth circle (treachery) Data are falsified as malicious act. Transparency is made easier and more efficient minimising scientific fraud Smoking does not cause lung cancer , asbestos does not cause mesotheilioma
HELL No hope (Abandon all hope ye who enter) There is always hope. Possible role for data repository/archive of last saloon (e.g. exit strategy for a data repository)98. Data centre of the oblivion

Conclusions and going forward
If we want a joined cycle of scholarly communication then we need data citation framework

Data citation comes with consequences, most notably the preservation and persistence issues.

Some of these consequences are positive, e.g. registration attribution and impact potential

Some of these consequences are negative, e.g. preservation, archiving and persistence.

The benefits out-weigh the costs as data citation creates an important incentive that has been lacking for the greater goal in the data landscape, data sharing.

Compartmentalising data sins means we can disseminate and understand the various and complex requirements of data citation and illustrate it’s implications and it’s positive and negative consequences.