Business Rules for using the Chado ND Module for managing field trial phenotyping data
STUDY AND DATASET INFORMATION
The Study Component schema (from CHADO)
STUDIES and DATASETS are captured using the CHADO project tables. Studies are organized into hierarchical structures of folders. FOLDERS are CHADO projects which can contain sub folders or studies. Studies contain DATASETS. The structure of folders, sub-folders, studies and datasets in managed through the project_relationship table. (type_id is "is a sub-folder of", cvterm_id=1140, "is a study in", cvterm_id=1145 and "is a dataset of", cvterm_id=1150). Folders do not contain information (other than their names and descriptions in the project table) but they group sub-folders and studies for easy browsing.
Figure 5. The project and project_relationship tablea
(note the 1140 in the first line should be 1145)
Every folder, study and dataset has a unique name and may have a title or description which are stored in the project.name and project.description fields. If there is a study label with property STUDY, method ASSIGNED and scale DBCV then it contains the name of the study as its single level and it is stored in the project,name field. If there is no such label in the study then a unique study name must be supplied by the application.
If there is a study label with property STUDY TITLE, method ASSIGNED and scale TEXT then this is stored in the project.description field, otherwise this can be null.
If there is a dataset label with property DATASET, method ASSIGNED and scale DBCV then it contains the unique name of the dataset as its single level and it is stored in the project,name field. If there is no such label in the study then a unique dataset name must be supplied by the application for each dataset in the study.
If there is a label with property DATASET TITLE, method ASSIGNED and scale TEXT then this is stored in the project.description field of the dataset project record, otherwise this can be null.
Every study variable in a study is described by three records in the projectprop table with projectprop.project_id equal to the project_id of the study describing the user-supplied name, the user-supplied description of the variable and the cvterm_id of the standard variable to which it belongs.
All projectprop records for the same label have the same projectprop.rank in the projectprop table. The rank specifies the user-supplied order of the variables in the study.
The projectprop.type_id field indicates whether the projectprop.value contains a local name for the variable and if so, where it is stored in the schema, or whether it contains a description (type_id=1060) or the link to a standard variable (type_id=1070). The interpretation of the type_id for the name projectprop records is given in Table 6.
As described above the levels of the study name and title labels (if present) are stored in the project table. If the labels are Germplasm Entry labels (eg a study with only one genotype) the values are stored in the stock or stockprop tables linked to a study observation unit (nd_experiment record linked to the study). If the labels are Trial Environment labels (eg a study with only one location) then the levels are stored in the nd_geolocation and nd_geolocationprop tables linked to a study observation unit (nd_experiment record linked to the study).
The levels of all other study labels are stored in a fourth projectprop record for each label with the same projectprop.rank value as the label description records. If there are any study constant variates (observationsor measurements having a single value for the whole study) they are similarly described in the projectprop table with project_id of the whole study. The values however are stored in the phenotype table linked to a study observation unit (nd_experiment record linked to the study).
The values of study variables are cast to all observation units in the study by the study-dataset and dataset-observation unit relationships.
Figure 6. Part of the projectprop table for the STYDY (project_id=2)
All other variables (labels and variates) in the study belong to one or more datasets and these are described in projectprop records, in the same way as study variables, for each dataset in which they occur. As for studies the levels of the dataset name and title labels, if present, are stored in the project table. Other labels which are dataset conditions (such as the dataset user id) are stored in the projectprop table as with study labels. Values of all other variables in a dataset are stored in tables appropriate to their class, linked to observation units related to the dataset.
Every study and dataset should have user name and user id labels to identify the database user who entered the data for that study or dataset.See the section below on attribution and ownership.
Figure 7. Part of the projectprop table for the plot dataset (project_id=4)
The values of labels in datasets are stored in the TRIAL ENVIRONMENT, GERMPLASM ENTRY or TRIAL DESIGN components of the schema and values of the variates of studies or datasets are stored in the OBSERVATION component of the schema. These storage locations are identified from the projectprop.type_id of the local name property, which is also reflected in the standard variable cvterm_relationship with type_id=1044 so that the storage location of any variable is known from the dataset side or the standard variable side without linking.
The Trial Environment Component schema (from CHADO)
The location/environment component of the logical data model manages information about the trial environment where an experiment is conducted. It uses the nd_geolocation table and nd_geolocationprop table to store all values of location labels. Information in these tables corresponds to actual physical locations where Field Trials are conducted.
The nd_geolocation.description field is used to store the value of factor with property TRIAL INSTANCE, method ENUMERATED and scale NUMBER.If the study comes from the IB Fieldbook this is always present and contains a sequential number 1,2, .... But studies from other applications may have TRIAL INSTANCE factors with a different scale. If there is no TRIAL INSTANCE factor then simply assign sequential numbers to nd_geolocation.description and put all LOCATION labels in the nd_gelocationprop table. Georeference properties of the trial site are also stored in the nd_geolocation table if available, but all other properties of the trial location (eg site name, side code, site ID and loc_id) are stored in the nd_gelocationprop table.
Global information about locations is stored in the Location Management Module of the database, outside the scope of the Phenotyping database, but linked via the location ID label with property LOCATION, method ASSIGNED and scale DBID.
Other labels of the trial site which describe the management or environment of the trial (ie not treatments within the trial) are also stored as properties of the trial environment in the nd_geolocationprop table. (eg, irrigation, pesticides, season etc.) In the IB Fieldbook, these are labels of the TRAIL INSTANCE factor.
Figure 8. nd_geolocation table and properties SITE ID and SITE for the trial environment stored in the nd_geolocationprop table.
The Germplasm Entries Component schema (from CHADO)
The stock table maps to the Germplasm component of the logical data model. Entries in the stock table represent the Germplasm used in a field trail. Information about the Germplasm is stored in the GMS database, outside the scope of the Phenotyping database.
The stock.uniquename field is used to store the label with property GERMPLASM ENTRY, method ENUMERATED and scale NUMBER (ENTRY_NO, cvterm_id=8230) this is always present if the study comes from the IBFieldbook, but if it is not present (study from another application) simply store a sequence number for the germplasm entries in the study in this field. (In migration from IBDB V1 the levelno will do).
If there is a label with property GERMPLASM ID, nethod ASSIGNED and scale DBID (GID, cvterm_id=8240) this contains the GID from GMS and they should be stored in the stock.dbxref_id field. Else null. The stock table thus serves as a link to the information in the GMS via the stock.dbxref_id field.
If there is a label with property GERMPLASM ID, method ASSIGNED and scale DBCV (DESIGNATION, cvterm_id=8250) it contains a germplasm name from the database and its levels should be stored in the stock.name field. Else null.
If there is another label with property GERMPLASM ID, method ASSIGNED and scale CODE (ENTRY_CODE, cvterm_id=8300) it contains a study level entry code for the germplasm and its levels should be stored in the stock.value field and its cvterm_id in the stock.type_id field. Else both are null.
So what to use the type_id for since it is obligatory? All the variables have fixed and known cvterms so what other data could be useful? Put 8300 (ENTRY_CODE) for now.
Figure 9: Information on the entries in the experiment in the stock table.
The stockprop table contains levels of any other labels with relationship 'stored in' (cvterm_id=1044) pointing to 'Germplasm entry' (cvterm_id=1040) for example other names or whether a certain seed is a control/check in a field trial, or its seed source - where you got the seed or a label for the pedigree.
The Trial Design Component schema (from CHADO)
The Field Trial component of the Logical Data Model uses the nd_experiment tables. The information stored in the nd_experiment and supporting tables describes the design of the nursery or field trial and/or the structure of derived data coming from summary processes or statistical analysis. General information about the study which is contained in study labels is inherited by all the observation units of all datasets belonging to the study. Information on the germplasm included in the study is indicated by links to the stock table, and information on environments used in the study is provided by links to the nd_geolocation table.
If a factor describes a management practice which is a treatment in the experiment eg fertilization or irrigation, it is part of the Field Trial design, but if it applies to the whole environment or trial site, for example, irrigation is sometimes applied to an entire location, then it is managed in the Trial Environemt component. In other cases irrigation varies by plot in which case it will be managed as part of the Field Trial component with levels stored for each plot in the nd_experimentprop table.
Each record in the nd_experiment table corresponds to one observation unit in a dataset and serves to link a specific combination of label values (levels) to a specific set of variate values. The levels of the design labels (treatments, design and lay-out) specify the context of the experiment or observation unit on which the associated values of the variates were observed. The study labels, environment labels and entry labels are associated by links from those components to the nd_experiment records (observation units).
Figure 10. Observation units of the plot data listed in the nd_experiment table.
Each observation unit has a type indicated by nd_experiment.type_id linking to cvterm. Some types are:
- Type = Study (cvterm_id=1010) linking levels of labels which apply to the whole study (like the PI's name) to observations which might have been made at the whole study level - like water condition eg irrigated or not.
- Type = Dataset (cvterm_id=1015) linking levels of the whole dataset environment to observations made at this level.
- Type = Trial Environment (cvterm_id=1020) linking levels of labels which apply to each instance of a trial (ie each environment where a trail is repeated). Usage of properties is similar to study above.
- Type =Field Plot (cvterm_id=1155). A field trial. Linked by proj_id (to what?) and links into the stock table. The properties describe the plot. Raw observations are linked from here in the phenotype table.
- Type=Sample. (cvterm_id=1160) For a sample unit smaller than a plot
- Type = Average (cvterm_id=1170). Linked by proj_id, links to the stock table, properties determined by what is being tracked. Means are linked from here in the phenotype table.
- Type = Summary (cvterm_id=1180) for summary statistics like SEs and LSDs. Linked by proj_id, links to the stock table, properties similar to plot.
Each observation unit (nd_experiment record) belongs to a project (study or dataset) and this is recorded in the nd_experiment_project table which allows many to many linkages between nd_experiment records and projects although in the breeding context each observation unit will probably belong to one dataset.
Each nd_experiment record links to Trial Environment and Location labels via the nd_experiment.nd_geolocation_id field. If the location/environment information is not available or is not relevant (eg for the mean over several locations) then this field is set to 1 - a 'not specified' environment (since it is not allowed to be null.
Details of the germplasm applied to each nd_experiment record are linked via the nd_experiment_stock linkage table which allows many to many linkages between nd_experiment records and stock records although in the breeding context this will almost always be one stock to many observation units (nd_experiment records). This table requires a type_id for each link. It is not clear what information that should carry so we can set it to 1000 for now.
Levels of other labels describing the experiment context are supplied in the nd_experimentprop table and these are linked to their standard variable ID via the nd_experimentprop.type_id. The nd_experimentprop.rank is only needed if you have two values of the same type_id for the same nd_experiment_id so it can be left 0 for the moment.
Figure 11. Design and treatment levels for plot units in the nd_experimentprop table.
The Observation Component schema (from CHADO)
Variate values are recorded in the phenotype table. All phenotype records are linked to the appropriate nd_experiment record via the nd_experiment_phenotype linkage table. This allows many to many linkages between nd_experiment records and phenotype records although in breeding trials it will always be one nd_experiment record to one or more phenotype records (possibly none).The type of the observation unit (nd_experiment record) indicates how the values are obtained. In the case of experiments of type Plot the observations in the phenotype table will be the directly observed raw data about plants in field trial (e.g. yield, plant height). In the case of experiments of type Average or Summary, the observations in the phenotype table will be the results of statistical analysis on the raw data from Plot experiments. These results will be stored rather than computed in the database due to the complexity of the analysis even for simple means calculations.
The variate value is stored in the phenotype.value field whether it is numeric or character except for categorical variates for which the valid values are stored in the cvterm table and the actual value is indicated by a link from phenotype.cvlaue_id .
The phenotype.observable_id field links the phenotype value to the standard variable (STDVAR) cv term in the cvterm table. The phenotype.name filed also stores the cvterm id for the standard variable so that links can easily be made to the standard variable id in the projectprop table when browsing datasets.
Figure 12. Phenotype values for plot data in the phenotype table.
OWNERSHIP AND ATTRIBUTION
Every data value can be traced back to a database user. To do this applications must add a UID variable at the appropriate level for the data value. Every study must have a Study_UID (cvterm_id=8020) variable, if a different user adds a dataset to the study, it must have a Dataset_UID variable in the dataset with its value there also, if a different user adds a record to a dataset it must have an OU_UID in the nd_experimentprop table. If a different user adds a variable to a dataset is must have a Variable_UID property attached to the variable description in the dataset and if a different user adds or changes a phenotype value .. we need a phenotypeprop table! Value_UID.
Any value then 'belongs' to its closest UID value in the order phenotype, variable, observation unit, dataset and study.
The same system could be used to timestamp data and different levels of precision.
KEY DIFFERENCES BETWEEN IBDB V1 AND IBDB V2
The key difference between the two versions is that the original concept of user defined factors that was central to IBDB V1 has been dropped. Factors were subsets of the labels in a study, one of which (the factor label) was required to be discriminate between all levels of the property concerned. Other labels in the factor could have one to many relationships with these levels. In IBDB V2 labels are divided into components of the data model - study, trial environment, germplasm entry, trial design. In effect each study has these four factors.
The old constraint that the factor label had to be maximally discriminatory has been implemented slightly differently by assuming or imposing an ID in each factor - study id, trial instance, germplasm entry, and field plot. These are just numbers and may or may not be explicitly included in the component labels for a particular study. If they are not included the model simply inserts them into project.project_id,, nd_geolocation_id and description, stock.uniquename and nd_experiment_id.
Another difference is that the values of variables are stored in their respective components instead of just splitting then into label levels and data values. Also these are no longer split as numeric and character. All numeric values are stored as character strings. However there is now a split between categorical and other types of variables with values of categorical variables being stored as the ids (cvterm_ids) of their valid values from the ontology management system.
Another thing that has changed is the concept of effect and representation. These are now rolled into the concept of dataset and managed through the project_relationship table.