
【ADHD血液診断】データのダウンロードと前処理
目次
- 0. はじめに
- 1. ダウンロード
- 2. データのロード
- 3. 不要な列の削除と転置
- 4. 1サンプルあたりのリードの和を揃える (RPKMを計算)
- 5. RPKMが100以下のトランスクリプトを排除
- 6. 同じ遺伝子由来のトランスクリプトの平均をとって1つの遺伝子にまとめる
- 7. 分散が0の遺伝子を除去する
0. はじめに
本記事は、以下の記事の一部です。
https://www.bioinforest.com/adhd1
1. ダウンロード
こちらのページの最下部にある「GSE159104_DGE_raw_154samples.txt.gz」と「GSE159104_samples.txt.gz」をクリックしてダウンロードし、解凍します。
適当にローカルにフォルダを作成し、その中にsrcというフォルダを作成し、上記で解凍したファイルを移動させておきます。
2. データのロード
以下のようにしてPythonのpandasからデータをロードします。
Text
import timeimport pickle
import pandas as pdimport numpy as np
import matplotlib.pyplot as pltimport seaborn as sns
from scipy.stats import pearsonrfrom sklearn.decomposition import PCAfrom sklearn.manifold import TSNEimport umapfrom scipy.sparse.csgraph import connected_components
import warningswarnings.filterwarnings('ignore')
df = pd.read_csv("src/GSE159104_DGE_raw_154samples.txt", sep="\t")df| UCSC ID | Gene name | Description | chr (HG38) | strand (HG38) | start of transcript (HG38) | end of transcript (HG38) | length of exons | P14.2016-03-31T12_40_25.0220160771004.fc1.ch8 | P14.2016-03-31T12_40_25.0220160771004.fc1.ch10 | ... | P13.2016-10-26T13_37_05.0234062861007.fc1.ch19 | P13.2016-11-22T13_16_36.0234063021020.fc2.ch5 | P13.2016-11-22T13_16_36.0234063021020.fc2.ch6 | P13.2016-11-22T13_16_36.0234063021020.fc2.ch7 | P13.2016-11-22T13_16_36.0234063021020.fc2.ch10 | P13.2016-11-22T13_16_36.0234063021020.fc2.ch11 | P13.2016-11-22T13_16_36.0234063021020.fc2.ch15 | P13.2016-11-22T13_16_36.0234063021020.fc2.ch24 | P13.2016-12-08T21_33_13.0249463261006.fc2.ch17 | P13.2016-12-08T21_33_13.0249463261006.fc2.ch18 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| uc001aak.4 | FAM138A | Homo sapiens family with sequence similarity 1... | chr1 | - | 34553 | 36081 | 1187 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| uc001aal.1 | OR4F5 | Homo sapiens olfactory receptor, family 4, sub... | chr1 | + | 69090 | 70008 | 918 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| uc001aaq.3 | RP4-669L17.10 | RP4-669L17.10 (from geneSymbol) | chr1 | - | 497239 | 499002 | 696 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| uc001aar.3 | RP4-669L17.10 | RP4-669L17.10 (from geneSymbol) | chr1 | - | 497133 | 498456 | 457 | 0 | 0 | ... | 0 | 0 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| uc001abo.4 | RP11-206L10.2 | Homo sapiens uncharacterized LOC100288069 (LOC... | chr1 | - | 764856 | 778626 | 1317 | 0 | 0 | ... | 0 | 0 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| uc065cwt.1 | IL9R | interleukin 9 receptor (from HGNC IL9R) | chrY | + | 57184253 | 57189490 | 298 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| uc065cwu.1 | AJ271736.10 | Homo sapiens cDNA FLJ77647 complete cds. (from... | chrY | + | 57190737 | 57208756 | 773 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| uc065cwv.1 | IL9R | interleukin 9 receptor (from HGNC IL9R) | chrY | + | 57192633 | 57194566 | 308 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| uc065cww.1 | WASIR1 | WASH and IL9R antisense RNA 1 (from HGNC WASIR1) | chrY | - | 57201142 | 57203357 | 1054 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| uc065cwx.1 | AJ271736.1 | Sequence 159 from Patent EP2733220. (from mRNA... | chrY | + | 57209150 | 57209218 | 68 | 0 |
Text
df.shapeText
(195178, 162)3. 不要な列の削除と転置
Text
df_selected_columns = df.iloc[:,1].tolist()df_selected = df.iloc[:, 8:].Tdf_selected.columns = df_selected_columnsdf_selected.shapeText
(154, 195178)Text
df_selected| FAM138A | OR4F5 | RP4-669L17.10 | RP4-669L17.10 | RP11-206L10.2 | LINC01128 | FAM41C | RP11-54O7.1 | SAMD11 | NOC2L | ... | SNORA70 | AC013734.1 | VAMP7 | VAMP7 | VAMP7 | IL9R | AJ271736.10 | IL9R | WASIR1 | AJ271736.1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P14.2016-03-31T12_40_25.0220160771004.fc1.ch8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 15.5 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| P14.2016-03-31T12_40_25.0220160771004.fc1.ch10 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 8.5 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| P14.2016-03-31T12_40_25.0220160771004.fc1.ch12 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 4.5 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| P14.2016-03-31T12_40_25.0220160771004.fc1.ch15 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 9.5 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| P14.2016-03-31T12_40_25.0220160771004.fc1.ch19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| P13.2016-11-22T13_16_36.0234063021020.fc2.ch11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| P13.2016-11-22T13_16_36.0234063021020.fc2.ch15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| P13.2016-11-22T13_16_36.0234063021020.fc2.ch24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| P13.2016-12-08T21_33_13.0249463261006.fc2.ch17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| P13.2016-12-08T21_33_13.0249463261006.fc2.ch18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4. 1サンプルあたりのリードの和を揃える (RPKMを計算)
Text
data = []
for i in range(df_selected.shape[0]): v = df_selected.iloc[i, :].tolist() s = np.sum(v) v_ = [(j/s)*(10**6) for j in v] data.append(v_)
df_rpkm = pd.DataFrame(data)df_rpkm.index = df_selected.indexdf_rpkm.columns = df_selected.columns
df_rpkm.shapeText
(154, 195178)5. RPKMが100以下のトランスクリプトを排除
Text
idxes = []
for i in range(df_rpkm.shape[1]): if not np.sum(df_rpkm.iloc[:,i]) <= 100: idxes.append(i)
df_excluded = df_rpkm.iloc[:, idxes]df_excluded.shapeText
(154, 92807)6. 同じ遺伝子由来のトランスクリプトの平均をとって1つの遺伝子にまとめる
Text
gene_set = list(set(df_excluded.columns.tolist()))
data = []
for i, g in enumerate(gene_set): tab = df_excluded.loc[:, g] if len(tab.shape) == 1: v = tab.tolist() else: v = df_excluded.loc[:, g].apply(lambda x: np.mean(x), axis=1).tolist() data.append(v)
df_gene = pd.DataFrame(data).Tdf_gene.index = df_excluded.indexdf_gene.columns = gene_set
df_gene.shapeText
(154, 16177)7. 分散が0の遺伝子を除去する
Text
non_zero_list = []
for i in range(df_gene.shape[1]): v = df_selected.iloc[:, i].tolist() s = np.std(v) if s != 0: non_zero_list.append(i)
df_std = df_gene.iloc[:, non_zero_list]
df_std.to_csv("src/df_std.csv")
df_std.shapeText
(154, 16042)
