Forest background
バイオインフォの森へようこそ
【ADHD血液診断】データのダウンロードと前処理

【ADHD血液診断】データのダウンロードと前処理

目次

0. はじめに

本記事は、以下の記事の一部です。

https://www.bioinforest.com/adhd1

1. ダウンロード

こちらのページの最下部にある「GSE159104_DGE_raw_154samples.txt.gz」と「GSE159104_samples.txt.gz」をクリックしてダウンロードし、解凍します。

適当にローカルにフォルダを作成し、その中にsrcというフォルダを作成し、上記で解凍したファイルを移動させておきます。

2. データのロード

以下のようにしてPythonのpandasからデータをロードします。

Text
import time
import pickle

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import pearsonr
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
from scipy.sparse.csgraph import connected_components

import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("src/GSE159104_DGE_raw_154samples.txt", sep="\t")
df
UCSC IDGene nameDescriptionchr (HG38)strand (HG38)start of transcript (HG38)end of transcript (HG38)length of exonsP14.2016-03-31T12_40_25.0220160771004.fc1.ch8P14.2016-03-31T12_40_25.0220160771004.fc1.ch10...P13.2016-10-26T13_37_05.0234062861007.fc1.ch19P13.2016-11-22T13_16_36.0234063021020.fc2.ch5P13.2016-11-22T13_16_36.0234063021020.fc2.ch6P13.2016-11-22T13_16_36.0234063021020.fc2.ch7P13.2016-11-22T13_16_36.0234063021020.fc2.ch10P13.2016-11-22T13_16_36.0234063021020.fc2.ch11P13.2016-11-22T13_16_36.0234063021020.fc2.ch15P13.2016-11-22T13_16_36.0234063021020.fc2.ch24P13.2016-12-08T21_33_13.0249463261006.fc2.ch17P13.2016-12-08T21_33_13.0249463261006.fc2.ch18
uc001aak.4FAM138AHomo sapiens family with sequence similarity 1...chr1-3455336081118700...0000000000
uc001aal.1OR4F5Homo sapiens olfactory receptor, family 4, sub...chr1+690907000891800...0000000000
uc001aaq.3RP4-669L17.10RP4-669L17.10 (from geneSymbol)chr1-49723949900269600...0010000000
uc001aar.3RP4-669L17.10RP4-669L17.10 (from geneSymbol)chr1-49713349845645700...000.50000000
uc001abo.4RP11-206L10.2Homo sapiens uncharacterized LOC100288069 (LOC...chr1-764856778626131700...000.50000000
...............................................................
uc065cwt.1IL9Rinterleukin 9 receptor (from HGNC IL9R)chrY+571842535718949029800...0000000000
uc065cwu.1AJ271736.10Homo sapiens cDNA FLJ77647 complete cds. (from...chrY+571907375720875677300...0000000000
uc065cwv.1IL9Rinterleukin 9 receptor (from HGNC IL9R)chrY+571926335719456630800...0000000000
uc065cww.1WASIR1WASH and IL9R antisense RNA 1 (from HGNC WASIR1)chrY-5720114257203357105400...0000000000
uc065cwx.1AJ271736.1Sequence 159 from Patent EP2733220. (from mRNA...chrY+5720915057209218680
Text
df.shape
Text
(195178, 162)

3. 不要な列の削除と転置

Text
df_selected_columns = df.iloc[:,1].tolist()
df_selected = df.iloc[:, 8:].T
df_selected.columns = df_selected_columns
df_selected.shape
Text
(154, 195178)
Text
df_selected
FAM138AOR4F5RP4-669L17.10RP4-669L17.10RP11-206L10.2LINC01128FAM41CRP11-54O7.1SAMD11NOC2L...SNORA70AC013734.1VAMP7VAMP7VAMP7IL9RAJ271736.10IL9RWASIR1AJ271736.1
P14.2016-03-31T12_40_25.0220160771004.fc1.ch800000001015.5...000000000
P14.2016-03-31T12_40_25.0220160771004.fc1.ch100000010008.5...000000000
P14.2016-03-31T12_40_25.0220160771004.fc1.ch120000010004.5...000000000
P14.2016-03-31T12_40_25.0220160771004.fc1.ch150000100009.5...000000000
P14.2016-03-31T12_40_25.0220160771004.fc1.ch190000000001...000000000
...............................................................
P13.2016-11-22T13_16_36.0234063021020.fc2.ch110000000001...000000000
P13.2016-11-22T13_16_36.0234063021020.fc2.ch150000000001...000000000
P13.2016-11-22T13_16_36.0234063021020.fc2.ch240000000002...000000000
P13.2016-12-08T21_33_13.0249463261006.fc2.ch170000000001...000000000
P13.2016-12-08T21_33_13.0249463261006.fc2.ch180000000000...000000000

4. 1サンプルあたりのリードの和を揃える (RPKMを計算)

Text
data = []

for i in range(df_selected.shape[0]):
    v = df_selected.iloc[i, :].tolist()
    s = np.sum(v)
    v_ = [(j/s)*(10**6) for j in v]
    data.append(v_)

df_rpkm = pd.DataFrame(data)
df_rpkm.index = df_selected.index
df_rpkm.columns = df_selected.columns

df_rpkm.shape
Text
(154, 195178)

5. RPKMが100以下のトランスクリプトを排除

Text
idxes = []

for i in range(df_rpkm.shape[1]):
    if not np.sum(df_rpkm.iloc[:,i]) <= 100:
        idxes.append(i)

df_excluded = df_rpkm.iloc[:, idxes]
df_excluded.shape
Text
(154, 92807)

6. 同じ遺伝子由来のトランスクリプトの平均をとって1つの遺伝子にまとめる

Text
gene_set = list(set(df_excluded.columns.tolist()))

data = []

for i, g in enumerate(gene_set):
    tab = df_excluded.loc[:, g]
    if len(tab.shape) == 1:
        v = tab.tolist()
    else:
        v = df_excluded.loc[:, g].apply(lambda x: np.mean(x), axis=1).tolist()
    data.append(v)

df_gene = pd.DataFrame(data).T
df_gene.index = df_excluded.index
df_gene.columns = gene_set

df_gene.shape
Text
(154, 16177)

7. 分散が0の遺伝子を除去する

Text
non_zero_list = []

for i in range(df_gene.shape[1]):
    v = df_selected.iloc[:, i].tolist()
    s = np.std(v)
    if s != 0:
        non_zero_list.append(i)

df_std = df_gene.iloc[:, non_zero_list]

df_std.to_csv("src/df_std.csv")

df_std.shape
Text
(154, 16042)