Research-based Training in Trustworthy Data Science
Trainees will work with faculty members in interdisciplinary research projects spanning data science, bioinformatics, chemical & biological engineering, or other subject domains. The interdisciplinary team will examine risks, establish measures of dependability, and develop mechanisms to mitigate the risks in making data-driven decisions using biological data and other data. Data-driven decisions are an output of the data science lifecycle (Figure 1). We define a risk to be a cause that can lead to failures in data-driven discovery. We define a measure as a facet of dependability that provides metrics and tolerance levels for risks. We define a mechanism as additional logic that is added to one or more stages of the data science lifecycle to bring or ensure that risks are within tolerance levels.
Fig. 1. Working representation of the data science lifecycle developed in the NSF TRIPODS Phase I award (#1934884); not all data science lifecycles contain all of the stages represented in the figure.
We will examine and improve the trustworthiness of data science lifecycles used for solving biological data science problems: prediction of protein function and differentiation of stem cells, just to name a few.
Data Science Foundation
This research theme focuses on the foundation of trustworthy data science. The team identified several sources of risk within the data acquisition, exploratory data analysis, modeling & training, model evaluation, prediction, and interpretation stages of the D4 Framework. We will explore risk mitigation methods for dealing with noisy data, limited training data, class imbalance, occlusion, and low contrast in cellular image data. We will explore risk quantification methods associated with predictions. For instance, when predicting a quantitative response, we report a prediction interval. For point predictions, we report associated coverage probability to characterize the certainty of the predictions. We will develop formal foundations and representations of data science lifecycles that are interpretable by domain experts. The domain experts will be involved in the various stages of the data science lifecycles. The proposed data science foundation research aims to catalyze trustworthy data-driven discovery across multiple disciplines.
Data Driven Prediction of Stem Cell Differentiation for Nerve Regeneration
This research theme focuses on the application of trustworthy data science to stem cell differentiation. Adult stem cells such as MSCs hold considerable potential for nerve regeneration due to their routine isolation, self-renewal capacity, lack of ethical constraints for their isolation and derivation, and paracrine activity having the capacity in differentiated forms to secrete bioactive molecules capable of stimulating neuroprotection and reducing inflammation. Differentiation can occur by designing biomaterials that impart specific electrical, physical, and chemical cues, or their combinations thereof, to the stem cells. However, this strategy of using differentiated stem cells for therapy for applications such as facilitating nerve regeneration has not widely transitioned into clinical use due to a lack of reliability in controlling the final fate of the implanted cell population. Differentiation in response to the external stimuli is ascertained using a variety of different techniques to improve reliability and ensure quality assurance. These techniques include morphological assessment of differentiation using microscopy methods and high throughput imaging, immunocytochemistry approaches that monitor expression of cell surface markers and their quantification, gene expression and proteomics studies, and functional assays that are characteristic of MSC differentiation to neural cells.
The multi-measurement approach utilized to acquire data ensures redundancy that improves reliability of differentiation protocols. The approach leads to multiple large datasets where machine learning models can be used for high-throughput exploration of the stimuli landscape and its impact on stem cell differentiation. In this project, domain experts and data scientists together will collaboratively develop methods that enable robust predictions of stem cell differentiation.
Computational Protein Function Annotation
Overwhelmed with genomic data, biologists are facing a big post-genomic challenge: what do all genes do? This is one of the most urgent questions in molecular biology today, and genomic data are just so much noise if we lack the capability to interpret what the products of these genes are doing in the living organism.
Experimental biologists, biocurators, and computational biologists all play a role in characterizing a protein function. The discovery of protein function in the laboratory by experimental scientists is the foundation of our knowledge about proteins. Experimental findings are compiled in knowledge bases by biocurators to provide standardized, readily accessible, and computationally amenable information. Computational biologists train their methods using these data to predict protein function and guide subsequent experiments. To meet both the social and technical challenges in the field, a more productive and meaningful interaction between members of the core communities is necessary. Within the D4 framework, we will discuss the different aspects of functionally understanding genomic data from the computational, experimental, and data-science aspects.