Current state of the discussion (2014)

The following extract browses through current literature and discussion on the topic of sheer curation:

Corubolo, F., Eggers, A-G., Waddington, S., Hasan, A., Kontopoulos, E., Hedges, M., Darányi, S. (2014). Initial Version of Environment Information Extraction Tools (Deliverable 4.1 of the PERICLES project). pp 29-30

http://goo.gl/9Uzv9W

Please refer to the further reading list for the names referenced to in [ ].

PERICLES_D4_1_Cover

One approach to this challenge has been termed sheer curation (by Alistair Miles of the Science and Technology Facilities Council, UK), and describes a situation in which curation activities are integrated into the workflow of the researchers creating or capturing data. The word ‘sheer’ here is used to describe the ‘lightweight and virtually transparent’ way in which these curation activities are integrated, with minimal disruption. Sheer curation is based on the principle that effective data management at the point of creation and initial use lays a firm foundation for subsequent data publication, sharing, reuse, curation and preservation activities; the sheer curation model has not been extensively discussed in the scientific literature. The term has sometimes been interpreted as motivating the performance of curatorial tasks by data creators and initial users of data by promoting the use of tools and good practice that add immediate value to the data. This is, in particular, the take of [Curry, E. et al], which discusses the role of such an approach to the distributed, community – based curation of business data. However, this interpretation does not really address the challenges outlined above, and a more common understanding of sheer curation depends on data capture being embedded within the data creators’ working practices in such a way that it is automatic and invisible to them. For example, the SCARP project, during which the term ‘sheer curation’ was coined, carried out a number of case studies in which digital curators engaged with researchers in a range of disciplines, with the aim of improving data curation through a close understanding of the researchers’ practice [Lyon, L., et al (2009) and Whyte, A. et al (2008)]. [In Hedges, M., & Blanke, T. (2013)] the concept of sheer curation is extended further to take account of process and provenance as well as the data itself. The work examined a number of use cases in which scientists processed data through various stages using different tools in turn; however, as this processing was not carried out in any formally controlled way (e.g. by a workflow management system), it would have been impossible for a generic preservation environment to understand the significance of the various digital objects produced from the information available, as the story of the experiment was represented implicitly in a variety of opaque sources of information, such as the location of files in the directory hierarchy, metadata embedded in binary files, filenames, and log files. This was addressed by capturing information about changes on the file system as these change s occurred, when a variety of contextual information was still available, and the provenance graph was constructed from this dynamically using software that embedded the knowledge and expertise of the scientists. The BW – eLabs Project [Razum, M., et al. (2003)] comprises an example for sheer curation and the collection of context information in laboratory environments. The project stores context metadata during experiments at the laboratory equipment together with the experiments measurements to improve reuse and the collaboration between scientists. The most effective way to capture SEI is through observation in the environment of creation and use of the object. We look at the interaction between the DO, the environment and the user, with time dimension. This allows us to infer dependencies that are not explicit and determine relevant information useful for use and reuse of the DO. In terms of the DCC lifecycle, we are addressing the ‘create’ phase of the DCC lifecycle, with a strong focus on the ‘use and reuse’ and create phases, examining the creation and use – reuse context and try to extract SEI from these contexts. There is a close analogy between what we term sheer curation and modern models used in the records management community. In the traditional approach towards record – keeping – the so – called ‘life cycle model’ – archivists are only involved subsequent to the period of active use of a record within an organisation, when an object is transferred to a formal archive or otherwise disposed of. Partly in response to the move towards digital rather than paper records, record – keeping practice has increasingly adopted the ‘Records Continuum’ model, in which a record is regarded as existing in a continuum rather than passing through a series of fixed life – cycle stages, and archival practices are involved throughout, from the time the record is first created [McKemmish, S. (1997)]. In this way, contextual information available at the time the record is created and during its period of active use may be captured and subsequently exploited to support archiving, in much the same way as the metadata captured during data creation and reuse in sheer curation models. This metadata may be thought of as constituting records that document (e.g.) a scientific experiment, thus forming a ‘metadata continuum’.