Data Mining for Genomics and Proteomics: Analysis of Gene and Protein Expression Data
Format: PDF / Kindle (mobi) / ePub
Data Mining for Genomics and Proteomics uses pragmatic examples and a complete case study to demonstrate step-by-step how biomedical studies can be used to maximize the chance of extracting new and useful biomedical knowledge from data. It is an excellent resource for students and professionals involved with gene or protein expression data in a variety of settings.
aggressive, with the threshold as large as 25 percent or even 50 percent. For large experiments, it may be sufﬁcient to remove only those probe sets that have no Present calls. Since probe sets with all (or almost all) Absent calls as well as probe sets expressed at very low levels are very likely to represent experimental noise, it may be preferable to ﬁlter expression data using more than one criterion—for instance, ﬁltering by the fraction of Present calls in a class and ﬁltering by the
other genes, we suggest performing the following simple but enlightening Excel exercise described by Dra˘ghici (Dra˘ghici 2003): † Using the RAND function in Excel, generate a random number for each cell in a spreadsheet with 10,000 rows (genes) and 20 columns (samples). † Copy all the data and “Paste Special” as “Values” to another spreadsheet. † Assume that the ﬁrst ten columns represent Class A (say, Disease), and the last ten columns represent Class B (Control). † In the 21st column,
patient may be improved by the identiﬁcation of characteristic expression patterns associated with different responses to a variety of treatments. To identify such patterns, large repositories of expression data for patients with known diagnosis, treatment, and outcome parameters are necessary. Multivariate feature selection algorithms can be used to identify genomic or proteomic biomarkers with high classiﬁcation efﬁciency. 3.1 OVERVIEW 103 A small size of multivariate biomarkers is
variables. When applied to typical gene expression data, this approach has, however, the following disadvantages (Theodoridis and Koutroumbas 2006; Huberty and Olejnik 2006). † A basic requirement of the holdout method is the large number of biological samples in each class. † Our training set would be smaller and—unless we have a very large number of samples—the quality of the identiﬁed biomarker and classiﬁcation model would be worse than when developed from the whole training set. † The
the size of the complete human proteome is about one million proteins. Why are there more proteins than genes if each protein is synthesized by reading the sequence of a gene? This is due to such events as alternative splicing of genes and post-translational modiﬁcations of proteins.8 The Human Proteome Organisation (HUPO) plans to identify and characterize all proteins in the complete human proteome. However, due to the scale and complexity of this task, the goal of the ﬁrst phase of the Human