How to measure significance?

This section is extracted from the deliverable D4.1 Initial version of environment information extraction tools [Corubolo, F. et al (2014)]

http://goo.gl/9Uzv9W

Please refer to the further reading list for the names referenced to in [ ]

Now that we have identified the benefit of collecting environment information, the question is how to measure the significance of the collected environment information. The question ‘what for – for what purpose?’ should help us define what is significant.

Weighing significance

Collections of data often have more than one use. Determining what information is significant depends on the use of the data.

For example: the calibration of the solar measurement instrument will require calibration data, which may be a subset of the complete collection of data, as well as applications necessary to read and analyse the calibration data.

For a given collection, not all of its environment information may be necessary for every potential use. To represent this, we propose assigning weights to each relation between the collection DOs and their environment information. Weights have a value between 0 and 1 included; where a weight of 1 indicates the information is essential for all intended uses of the data. Monitoring the access of information as well as regular reviewing of the information required for each use would provide the opportunity to update the weights and could also accommodate new uses of the data.

Weights could be determined by direct collection, which is by asking the users to provide such weights together with their current purpose of use, once the dependencies have been determined. Another possibility would be that of observing the frequency of data use to determine the significance weight, as for example by observing how often a particular object is used in conjunction with another, in cases where such a scenario could be applied (when the usage data can be collected across multiple users).

For an individual DO these weights would include factors based on the cost of collecting the information (e.g. a subscription fee may be required to access the information contained in the resource). A threshold defined by the user community, archive and content providers would determine which pieces of information make up the SEI.

For example: in the case of a subscription fee for accessing information contained in a third-party repository one could define the weight as: (1-cost/budget), where the budget could be the total funds allocated to the archival of this DO.

The user community, content providers and archive will need to determine how the weight is defined. Particular patterns in data usage could also help determine the current user activity – to infer the dependency purpose – and from there also automate the inference of the purpose of use.