Download preprocessed TGP data
Data set description:
Data for the four studies are zip-compressed and available to download through the links below. Each zip-file contains four files in CSV-format (comma-separated values): the FARMS-summarized gene expression values per gene (exprs_*.csv), the informative/non-informative (I/NI) call per gene (ini_*.csv), the sample names (sampleNames_*.csv) and the gene names (geneNames_*.csv). Each sample corresponds to one drug measurement. In the gene expression matrix the columns are genes and the rows are samples. The I/NI call is a filter criteria, which allows detecting information carrying genes (e.g., genes with an I/NI call below 0.5 - smaller I/NI calls means more information). Replicate measurements were collapsed to one measurement per gene.
TGP drug info and pathological findings (CSV, EXCEL format)
Study – rat in vivo single (CSV format)
- With replicates (63MB) 6264 samples, 12088 genes
- Collapsed replicates (19MB) 2088 samples, 12088 genes
Study – rat in vivo repeated (CSV format)
- With replicates (61MB) 6249 samples, 12088 genes
- Collapsed replicates (18MB) 2092 samples, 12088 genes
Study – rat in vitro single (CSV format)
- With replicates (27MB) 3140 samples, 18988 genes
- Collapsed replicates (13MB) 1570 samples, 18988 genes
Study – human in vitro (CSV format)
- With replicates (16MB) 1418 samples, 18988 genes
- Collapsed replicates (8MB) 714 samples, 18988 genes
Download rat in vitro study (LIBSVM format)
Data set description:
The example classification data sets below were build using the drug information from “Drug Information.csv” and the expression data from the rat in vivo single study (CSV format). The example data set contain the gene expression values (FARMS preprocessed) and as labels the drug induced liver injury (DILI) classes (”-1”,”+1”).
For different time points (2h,8h, and 24h) and dose-levels (low, middle, and high) the data is stored in LIBSVM format. These binary classification data sets are ready to be analysed using the LIBSVM package. Samples (drugs) being of no DILI concern were labeled as ”-1” and those of most DILI concern as ”+1”. For more details regarding the categorization of DILI see here.
Example data sets
- Dose level low, time point 2h (2MB) # classes: 2, # features: 12088, # samples: 49
- Dose level low, time point 8h (2MB) # classes: 2, # features: 12088, # samples: 49
- Dose level low, time point 24h (2MB) # classes: 2, # features: 12088, # samples: 49
- Dose level middle, time point 2h (2MB) # classes: 2, # features: 12088, # samples: 49
- Dose level middle, time point 8h (2MB) # classes: 2, # features: 12088, # samples: 49
- Dose level middle, time point 24h (2MB) # classes: 2, # features: 12088, # samples: 49
- Dose level high, time point 2h (2MB) # classes: 2, # features: 12088, # samples: 49
- Dose level high, time point 8h (2MB) # classes: 2, # features: 12088, # samples: 49
- Dose level high, time point 24h (2MB) # classes: 2, # features: 12088, # samples: 49
- All dose levels, all time points (17MB) # classes: 2, # features: 108792, # samples: 49, features are column-wise concatenated (low.2h, low.8h, low.24, middle.2h, …, high.24h) )
Description data preprocessing
The Japanese Toxicogenomics Project (TGP) includes gene expression data, toxicological information and pathological data of 131 compounds in vitro and in vivo screened for toxicity in rat and in vitro screened for toxicity in human.
Upper panel: The y-axis shows the log expression values of the fatty acid-binding
protein 1 (Fabp1) estimated by FARMS after quantile normalization, while the
grouped compounds are shown on the x-axis. The time points are encoded by orange,
green and blue for 2h, 8h and 24h, respectively. The plot shows strong cell-culture effects,
within the three time points and compounds, which could not be removed by the quantile
normalization.
Lower panel: Same as upper panel but batch corrected. The correction with the matched control within cell-culture clearly reduces
the cell-culture effects, while compound induced expression changes are preserved.
The standard microarray preprocessing procedure consists of normalization, summarization and filtering. However, the standard preprocessing pipeline can not be applied to these data sets, as the initial quality control of the microarray data revealed severe effects between the cell-cultures (see upper panel). To remove these effects, first, the probe-level data of the microarrays were quantile normalized. Secondly, a compound batch correction was made by calculating probe intensity ratios using the corresponding control measurement for the cell-culture (only vehicle without compound) as reference. For the next preprocessing step, summarization, probe sets were defined corresponding to genes using alternative CDFs (Version 15.1.0, ENTREZG) from Brainarray [2] and applied FARMS [1] for summarizing the intensity ratios at probe set level to obtain expression values per gene. For the last preprocessing step, gene filtering, the FARMS based informative/non-informative (I/NI) call [3] was applied to identify all non-informative probe sets.
References:
- Hochreiter S, Clevert DA, and Obermayer K (2006). A new summarization method for Affymetrix probe level data, Bioinformatics, 22(8):943-949
- Dai M, Wang P, Boyd AD, et al. (2005). Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data, Nucleic Acids Res., 33(20):e175
- Talloen W, Clevert DA, Hochreiter S, et al. (2007). I/NI-calls for the exclusion of non-informative genes: a highly effective feature filtering tool for microarray data, Bioinformatics, 23(21):2897-2902
STAY CONNECTED
Tweet