Derived and transformed variables
Users rarely collect data in the final form for analysis and interpretation. They almost always have a set of transformations or derivations they want to make on the raw data before submitting it for analysis. In addition they may occasionally transform data to make it conform better to assumptions of a statistical analysis.
Transformations generally involve only one variable - convert form one scale (gms/plot) to another (kh/ha) or from a count to a log count. Derivations, on the other hand, may involve several input variables such as the derivation of grain yield at 15% moisture in t/ha which involves knowing the grain yield per plot, the moisture content (sometimes per plot) and the harvested plot size.
Another complication is that not infrequently the derivation my involve summarizing data from one sampling scale to another for example from single plant peasures (5 per plot) to and average plant measure for the plot, or from four disease measures at different times to the area under the disease progress curve over those four times.
The current work-around for this problem in the IBWS is to develop a fieldbook with all the variables in it - both raw and derived, to collect the data into the fieldbook (or via a hand-held into the database and then exported to the excel fieldbook) and then to perform the derivations in the excel file and load these into the database. For summaries over sampling scales the user can either put columns for each sample unit in the raw variates (with different methods) so that they appear in parallel in the fieldbook and then use excel to average or summarize the columns into a summary variable so that both the raw and summary variables are loaded and stored in the database or the user can manually create another sheet in the excel workbook to contain the values on the sub-samples and then user excel features to aggregate these up to the final variates on the Observation sheet. In this case the sample values would not be stored in the database.
Possible options for an integrated solution
It would be possible for users (working with data managers) to specify the formula for the derivation or transformation in the description of the method applied to the derived variate. The formula would be an arithmetic expression involving the standard variable names of the input variables.
Suppose the following standard variables are used in the study:
PLOTSIZE - Plot Size Harvested, m**2
GRMOIST - Grain moisture, %
GYLDP - Grain yield, kg/plot
Then the method associated with the variable: GRNYLD - grain yield adjusted to 15% moisture, t/ha would be:
Yield Adjustment with description (GYLDP*10/ PLOTSIZE)*(100- GRMOIST)/85
Now at some point (when the raw data has been entered) this formula needs to be applied to derive the values of the new variable. There are some interesting complications. Firstly not all variables need be in the same dataset. In the example above PLOTSIZE is a variate that might be a CONSTANT and therefore attach to the STUDY or to the TRIAL INSTANCE dataset. GRMOIST may be a CONSTANT or a VARIATE with values on each plot, and GYLDP is a variate of the PLOT dataset.
If the computation is to be done by the workbench, then the workbench needs to have a formula parser which understands mathematical formulae and can identify the input variate and the operations. Then it needs to find the variables in the study and make the appropriate cast of values to the level of the derived variate (in other words if the derived variate, GRNYLD, is at the PLOT level then the input variates need to be cast to the PLOT level). It may be possible to find an off-the-shelf formula parser in Java and this would save a lot of pain.
On the other hand, statistical packages like Genstat and R already have advanced formula parsers and the alternative would be to get BV to do the computations and pass the results back to the database - or just do the computations and analyse the results. However to do this the workbench would have to parse the formula and translate the standard variable names into local study names since these are the names that BV will know. This is an easier parsing job and could be made even easier by using a convention in the formulae which identifies the variable names (like enclosing them in curly brackets).