今回のノートブックを開くには、以下の手順。
$ git clone https://github.com/yyoshiaki/drbono.chap4.git
とターミナルでうち、今回のリポジトリをcloneしてくる。gitがない場合は適宜インストール。$ jupyter notebook
を立ち上げて、20180414_drbonobon_chap4.ipynbというファイルを実行する。だいたいのことはEMBOSSで実行できます。
! head seq/L07770.1.fasta
>L07770.1 Xenopus laevis rhodopsin mRNA, complete cds GGTAGAACAGCTTCAGTTGGGATCACAGGCTTCTAGGGATCCTTTGGGCAAAAAAGAAACACAGAAGGCA TTCTTTCTATACAAGAAAGGACTTTATAGAGCTGCTACCATGAACGGAACAGAAGGTCCAAATTTTTATG TCCCCATGTCCAACAAAACTGGGGTGGTACGAAGCCCATTCGATTACCCTCAGTATTACTTAGCAGAGCC ATGGCAATATTCAGCACTGGCTGCTTACATGTTCCTGCTCATCCTGCTTGGGTTACCAATCAACTTCATG ACCTTGTTTGTTACCATCCAGCACAAGAAACTCAGAACACCCCTAAACTACATCCTGCTGAACCTGGTAT TTGCCAATCACTTCATGGTCCTGTGTGGGTTCACGGTGACAATGTACACCTCAATGCACGGCTACTTCAT CTTTGGCCAAACTGGTTGCTACATTGAAGGCTTCTTTGCTACACTTGGTGGTGAAGTGGCCCTCTGGTCA CTGGTAGTATTGGCCGTTGAAAGATATATGGTGGTCTGCAAGCCCATGGCCAACTTCCGATTCGGGGAGA ACCATGCTATTATGGGTGTAGCCTTCACATGGATCATGGCTTTGTCTTGTGCTGCTCCTCCTCTCTTCGG
! head seq/U23808.2.fasta
>U23808.2 Xenopus laevis rhodopsin gene, complete cds GGATCCATGTTAGTAGCCTTCAAATCGAGTTTAGTCTAATTTGAATCAAATTCGATAAGAGTTTTCAAGT CAGCAAAAATTGTACGAGTTTTAGATGTCCAATTTTTTTTTCTATAAGTAATTTAGAGTATATTCGAATT TAATCGAGTTTAAAAAAATTCACATTAATTCCAAATTCGACCTTTAATCAATGTGCCTCCCAGGATGAGT GCCAGACTATGAACTCCTCCAACCTATTTATTCATCAACCAGATTTGACAGTGGGAGACCTAAGAATCCA GTGATTGAATGAGTTTAATGCATGAAAAGGTTGGCTTGGTTATTTGCAGCACAGAGAGAAACCAATAAGG AGAAAGTAGAGTTGAGGTGGGACATTCATGAATACTCACCTAAAGGACTTTGATTGGGATTCAGCCAATA ATATCCCATCCACCACATTTACTGAGCTTAGAACATTCTCCAAACATGAGCAGTAAAGGGACAAAGAGTA ATGGCTTTAAAGGAGAACTAAACCCTAAAACTGAATGTGTAAAAATGTTATATTTTATAGACAGAACTTA TTGCACCAGCCTAAGTTTCAGCTTGTCAATAGCAGCAATGATCCAGAACTTCAAACTTGTCACAGGGGGT
$ dottup L07770.1.fasta U23808.2.fasta -wordsize 10 -graph png
大域的アラインメント とは配列中の全塩基やアミノ酸を並べるようにする方法。 現在Needleman-Wunch法と呼ばれる方法が広く使われている。
大域的アライメントの方法を改良して、部分的な類似性が見つけられるようにしたのが局所的アライメント。局所的アライメントの方法として最もよく用いられているのが Smith-Waterman法。query配列をDBから検索してくるという作業はこの局所的アラインメントのおかげで可能になった。
EMBOSSでの実行例は大石さんのスライドが参考になります。挙動をpythonライブラリのscikit-bioで見てみましょう。
from skbio.sequence import DNA
from skbio.alignment import *
s1 = DNA("AGGTGCGTGCCTAAGGTATGCAAGAGATGG")
s2 = DNA("ACGTGCCTAGGTACGCAAG")
# 教育用の関数なので、動作は遅い。下記URLを参照。
# http://scikit-bio.org/docs/0.2.0/alignment.html?highlight=alignment#module-skbio.alignment
r = global_pairwise_align_nucleotide(s1, s2)
# r = local_pairwise_align_nucleotide(s1, s2)
print(r)
(TabularMSA[DNA] ------------------------------ Stats: sequence count: 2 position count: 30 ------------------------------ AGGTGCGTGCCTAAGGTATGCAAGAGATGG ----ACGTGCCTA-GGTACGCAAG------, 8.0, [(0, 29), (0, 18)])
/Users/yasumizuyoshiaki/.pyenv/versions/anaconda3-4.2.0/lib/python3.5/site-packages/skbio/alignment/_pairwise.py:599: EfficiencyWarning: You're using skbio's python implementation of Needleman-Wunsch alignment. This is known to be very slow (e.g., thousands of times slower than a native C implementation). We'll be adding a faster version soon (see https://github.com/biocore/scikit-bio/issues/254 to track progress on this). "to track progress on this).", EfficiencyWarning)
Smith-Waterman法 先程の例で使用。FASTAパッケージに含まれているssearchが並列計算を用いたDB配列検索によく用いられていた。
FASTA法 かつてよく使われた。現在はBLASTがよく使われるが、FANTOMプロジェクトでcDNA配列解析をしていた際は、塩基配列をアミノ酸に翻訳して配列比較を行うプログラムのfastxとfastyが有用だったそう。
BLAST 現在おなじみの配列類似性検索ツール。
BLAT 非常に高速に、しかもエクソンとイントロンの境界も考慮して、ゲノムに対するアラインメントがなされる。ほぼ一致する領域を探すことに特化している。BLASTはもともと遠く離れた類縁関係を検出することを目的としていた。営利目的の使用は有償。
BLASTとBLATを試してみます。
! head seq/chr22.hg19.fa
>chr22 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
! makeblastdb -in seq/chr22.hg19.fa -dbtype nucl
Building a new DB, current time: 04/12/2018 22:18:29 New DB name: /Users/yasumizuyoshiaki/bioinformatics/bono/seq/chr22.hg19.fa New DB title: seq/chr22.hg19.fa Sequence type: Nucleotide Deleted existing Nucleotide BLAST database named /Users/yasumizuyoshiaki/bioinformatics/bono/seq/chr22.hg19.fa Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 1 sequences in 0.666504 seconds.
!ls seq
L07770.1.fasta chr22.hg19.fa chr22.hg19.fa.nin dgcr6.fa U23808.2.fasta chr22.hg19.fa.nhr chr22.hg19.fa.nsq
今回はDiGeorge症候群に関連するDGCR6のmRNAから配列を取ってきました。
! head seq/dgcr6.fa
>NM_005675.4 Homo sapiens DiGeorge syndrome critical region gene 6 (DGCR6), mRNA CGCGAGGCGCGCCGCGATCGGGGACTGTCCTAAGACGGGCGGGGCGCGCTGCGCTAGGGACTGTCATAAA AGGGGCGGGACGCGCCGCGGTCGGGATGACGTGAGCTGGGGGCGCTCGTCGCTGCAGCCGGCGGCTAGCG GGCGTCCGCGCCATGGAGCGCTACGCGGGCGCCTTGGAGGAGGTGGCGGACGGTGCCCGGCAGCAGGAGC GACACTACCAGCTGCTGTCGGCGTTACAGAGCCTGGTGAAGGAGTTGCCCAGCTCATTCCAGCAGCGCTT GTCCTACACCACGCTGAGCGACCTGGCCCTGGCGCTTCTCGACGGCACCGTGTTCGAAATCGTGCAGGGG CTACTGGAGATCCAGCACCTCACCGAAAAGAGCCTGTACAACCAGCGCCTGCGCCTACAGAACGAGCATC GAGTGCTCAGGCAGGCGCTGCGGCAGAAGCACCAGGAAGCCCAGCAGGCCTGCCGGCCCCATAACCTGCC TGTGCTTCAGGCGGCTCAGCAGCGAGAACTAGAGGCGGTGGAGCACCGGATCCGTGAGGAGCAGCGGGCG ATGGACCAGAAGATCGTCCTGGAGCTGGACCGGAAGGTGGCTGACCAGCAGAGCACACTGGAGAAGGCGG
! blastn -query seq/dgcr6.fa -db seq/chr22.hg19.fa
BLASTN 2.6.0+ Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), "A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14. Database: seq/chr22.hg19.fa 1 sequences; 51,304,566 total letters Query= NM_005675.4 Homo sapiens DiGeorge syndrome critical region gene 6 (DGCR6), mRNA Length=1245 Score E Sequences producing significant alignments: (Bits) Value chr22 1020 0.0 > chr22 Length=51304566 Score = 1020 bits (552), Expect = 0.0 Identities = 552/552 (100%), Gaps = 0/552 (0%) Strand=Plus/Plus Query 663 CAGGAGCTGATGCTGCAGATGAACCTGCTGGAACTCATCCGGAAGCTGCAGCAGAGGGGC 722 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18899050 CAGGAGCTGATGCTGCAGATGAACCTGCTGGAACTCATCCGGAAGCTGCAGCAGAGGGGC 18899109 Query 723 TGCTGGGCAGGGAAGGCAGCCCTGGGGCTAGGAGGTCCCTGGCAGTTGCCTGCTGCCCAG 782 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18899110 TGCTGGGCAGGGAAGGCAGCCCTGGGGCTAGGAGGTCCCTGGCAGTTGCCTGCTGCCCAG 18899169 Query 783 TGTGACCAGAAAGGCAGCCCTGTCCCACCATAGCCACAGGCAGCAGAAGTCTGGGCAGAG 842 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18899170 TGTGACCAGAAAGGCAGCCCTGTCCCACCATAGCCACAGGCAGCAGAAGTCTGGGCAGAG 18899229 Query 843 TTCATCTTCTTGACCTTTGGCCACTGCCTTCCCAGCTGCCCGCAGGGGGTTCCCCCTGCT 902 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18899230 TTCATCTTCTTGACCTTTGGCCACTGCCTTCCCAGCTGCCCGCAGGGGGTTCCCCCTGCT 18899289 Query 903 GAGGAGAGACCAGGTGGACCCCAGCTGCCTGTCACCCTTCATCTGGGACTTGCTGTCAAA 962 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18899290 GAGGAGAGACCAGGTGGACCCCAGCTGCCTGTCACCCTTCATCTGGGACTTGCTGTCAAA 18899349 Query 963 CCCTAGGATAGTCTCATAAAGGGGAGGCTGGGCCAGCCTGCTGCTGTCTGCTTCAGGGCC 1022 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18899350 CCCTAGGATAGTCTCATAAAGGGGAGGCTGGGCCAGCCTGCTGCTGTCTGCTTCAGGGCC 18899409 Query 1023 AGGCAGAGAGTGAGGCTGGGGGTTCTCACACCTTACTCCACCGGGCACATCCCAACCTGC 1082 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18899410 AGGCAGAGAGTGAGGCTGGGGGTTCTCACACCTTACTCCACCGGGCACATCCCAACCTGC 18899469 Query 1083 ACTGGGGCCCACTCGAGCGCTTGTTCTGGTCTCAGCCGCTCCCTTGGCAGCTGCAGCCCC 1142 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18899470 ACTGGGGCCCACTCGAGCGCTTGTTCTGGTCTCAGCCGCTCCCTTGGCAGCTGCAGCCCC 18899529 Query 1143 CATGCAGAAGAGGCTCCCAGGCCCAAGCTCTGTGTGACCCAGAGAAATAAAGATGCCTCA 1202 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18899530 CATGCAGAAGAGGCTCCCAGGCCCAAGCTCTGTGTGACCCAGAGAAATAAAGATGCCTCA 18899589 Query 1203 GTGTGGCCCGCa 1214 |||||||||||| Sbjct 18899590 GTGTGGCCCGCA 18899601 Score = 970 bits (525), Expect = 0.0 Identities = 541/549 (99%), Gaps = 0/549 (0%) Strand=Plus/Minus Query 663 CAGGAGCTGATGCTGCAGATGAACCTGCTGGAACTCATCCGGAAGCTGCAGCAGAGGGGC 722 ||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||| Sbjct 20302350 CAGGAGCTGATGCTGCAGATGAACCTGCTGGAACTCATCCGAAAGCTGCAGCAGAGGGGC 20302291 Query 723 TGCTGGGCAGGGAAGGCAGCCCTGGGGCTAGGAGGTCCCTGGCAGTTGCCTGCTGCCCAG 782 ||| |||||||||| ||||||||||| || |||||||||||||||| ||||||||||||| Sbjct 20302290 TGCCGGGCAGGGAATGCAGCCCTGGGACTGGGAGGTCCCTGGCAGTCGCCTGCTGCCCAG 20302231 Query 783 TGTGACCAGAAAGGCAGCCCTGTCCCACCATAGCCACAGGCAGCAGAAGTCTGGGCAGAG 842 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 20302230 TGTGACCAGAAAGGCAGCCCTGTCCCACCATAGCCACAGGCAGCAGAAGTCTGGGCAGAG 20302171 Query 843 TTCATCTTCTTGACCTTTGGCCACTGCCTTCCCAGCTGCCCGCAGGGGGTTCCCCCTGCT 902 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 20302170 TTCATCTTCTTGACCTTTGGCCACTGCCTTCCCAGCTGCCCGCAGGGGGTTCCCCCTGCT 20302111 Query 903 GAGGAGAGACCAGGTGGACCCCAGCTGCCTGTCACCCTTCATCTGGGACTTGCTGTCAAA 962 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 20302110 GAGGAGAGACCAGGTGGACCCCAGCTGCCTGTCACCCTTCATCTGGGACTTGCTGTCAAA 20302051 Query 963 CCCTAGGATAGTCTCATAAAGGGGAGGCTGGGCCAGCCTGCTGCTGTCTGCTTCAGGGCC 1022 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||| || Sbjct 20302050 CCCTAGGATAGTCTCATAAAGGGGAGGCTGGGCCAGCCTGCTGCTGTCTGCTTCAGGACC 20301991 Query 1023 AGGCAGAGAGTGAGGCTGGGGGTTCTCACACCTTACTCCACCGGGCACATCCCAACCTGC 1082 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 20301990 AGGCAGAGAGTGAGGCTGGGGGTTCTCACACCTTACTCCACCGGGCACATCCCAACCTGC 20301931 Query 1083 ACTGGGGCCCACTCGAGCGCTTGTTCTGGTCTCAGCCGCTCCCTTGGCAGCTGCAGCCCC 1142 |||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 20301930 ACTGGGGCCCACCCGAGCGCTTGTTCTGGTCTCAGCCGCTCCCTTGGCAGCTGCAGCCCC 20301871 Query 1143 CATGCAGAAGAGGCTCCCAGGCCCAAGCTCTGTGTGACCCAGAGAAATAAAGATGCCTCA 1202 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 20301870 CATGCAGAAGAGGCTCCCAGGCCCAAGCTCTGTGTGACCCAGAGAAATAAAGATGCCTCA 20301811 Query 1203 GTGTGGCCC 1211 ||||||||| Sbjct 20301810 GTGTGGCCC 20301802 Score = 484 bits (262), Expect = 7e-136 Identities = 262/262 (100%), Gaps = 0/262 (0%) Strand=Plus/Plus Query 1 CGCGAGGCGCGCCGCGATCGGGGACTGTCCTAAGACGGGCGGGGCGCGCTGCGCTAGGGA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18893736 CGCGAGGCGCGCCGCGATCGGGGACTGTCCTAAGACGGGCGGGGCGCGCTGCGCTAGGGA 18893795 Query 61 CTGTCATAAAAGGGGCGGGACGCGCCGCGGTCGGGATGACGTGAGCTGGGGGCGCTCGTC 120 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18893796 CTGTCATAAAAGGGGCGGGACGCGCCGCGGTCGGGATGACGTGAGCTGGGGGCGCTCGTC 18893855 Query 121 GCTGCAGCCGGCGGCTAGCGGGCGTCCGCGCCATGGAGCGCTACGCGGGCGCCTTGGAGG 180 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18893856 GCTGCAGCCGGCGGCTAGCGGGCGTCCGCGCCATGGAGCGCTACGCGGGCGCCTTGGAGG 18893915 Query 181 AGGTGGCGGACGGTGCCCGGCAGCAGGAGCGACACTACCAGCTGCTGTCGGCGTTACAGA 240 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18893916 AGGTGGCGGACGGTGCCCGGCAGCAGGAGCGACACTACCAGCTGCTGTCGGCGTTACAGA 18893975 Query 241 GCCTGGTGAAGGAGTTGCCCAG 262 |||||||||||||||||||||| Sbjct 18893976 GCCTGGTGAAGGAGTTGCCCAG 18893997 Score = 339 bits (183), Expect = 5e-92 Identities = 224/244 (92%), Gaps = 2/244 (1%) Strand=Plus/Minus Query 21 GGGACTGTCCTAAGA-CGGGCGGGGCGCGCTGCGCTAGGGACTGTCATAAAAGGGGCGGG 79 ||||||||| ||| | |||||| |||||| |||| ||||| || ||||||||||||| Sbjct 20307645 GGGACTGTCGTAAAAGGGGGCGGAGCGCGCCGCGCGTGGGACAATCGTAAAAGGGGCGGG 20307586 Query 80 ACGCGCCGCGGTCGGGATGACGTGA-GCTGGGGGCGCTCGTCGCTGCAGCCGGCGGCTAG 138 |||||||||| |||||||| ||||| |||||||| ||||||||| || |||||||||||| Sbjct 20307585 ACGCGCCGCGCTCGGGATGTCGTGAAGCTGGGGGAGCTCGTCGCCGCCGCCGGCGGCTAG 20307526 Query 139 CGGGCGTCCGCGCCATGGAGCGCTACGCGGGCGCCTTGGAGGAGGTGGCGGACGGTGCCC 198 |||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||| Sbjct 20307525 CGGGCGTCCGCGCCATGGAGCGCTACGCGGCCGCCTTGGAGGAGGTGGCGGACGGTGCCC 20307466 Query 199 GGCAGCAGGAGCGACACTACCAGCTGCTGTCGGCGTTACAGAGCCTGGTGAAGGAGTTGC 258 ||||||||||||||||||||||| ||||||||||| |||||||||||||||||||||||| Sbjct 20307465 GGCAGCAGGAGCGACACTACCAGTTGCTGTCGGCGCTACAGAGCCTGGTGAAGGAGTTGC 20307406 Query 259 CCAG 262 |||| Sbjct 20307405 CCAG 20307402 Score = 303 bits (164), Expect = 2e-81 Identities = 167/168 (99%), Gaps = 1/168 (1%) Strand=Plus/Plus Query 260 CAGCTCATTCCAGCAGCGCTTGTCCTACACCACGCTGAGCGACCTGGCCCTGGCGCTTCT 319 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18894075 CAGCTCATTCCAGCAGCGCTTGTCCTACACCACGCTGAGCGACCTGGCCCTGGCGCTTCT 18894134 Query 320 CGACGGCACCGTGTTCGAAATCGTGCAGGGGCTACTGGAGATCCAGCACCTCACCGAAAA 379 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18894135 CGACGGCACCGTGTTCGAAATCGTGCAGGGGCTACTGGAGATCCAGCACCTCACCGAAAA 18894194 Query 380 GAGCCTGTACAACCAGCGCCTGCGCCTACAGAACGAGCATCGAG-TGC 426 |||||||||||||||||||||||||||||||||||||||||||| ||| Sbjct 18894195 GAGCCTGTACAACCAGCGCCTGCGCCTACAGAACGAGCATCGAGGTGC 18894242 Score = 281 bits (152), Expect = 9e-75 Identities = 163/168 (97%), Gaps = 1/168 (1%) Strand=Plus/Minus Query 260 CAGCTCATTCCAGCAGCGCTTGTCCTACACCACGCTGAGCGACCTGGCCCTGGCGCTTCT 319 |||||| |||||||||||| |||||||||||||||| ||||||||||||||||||||||| Sbjct 20307325 CAGCTCTTTCCAGCAGCGCCTGTCCTACACCACGCTCAGCGACCTGGCCCTGGCGCTTCT 20307266 Query 320 CGACGGCACCGTGTTCGAAATCGTGCAGGGGCTACTGGAGATCCAGCACCTCACCGAAAA 379 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 20307265 CGACGGCACCGTGTTCGAAATCGTGCAGGGGCTACTGGAGATCCAGCACCTCACCGAAAA 20307206 Query 380 GAGCCTGTACAACCAGCGCCTGCGCCTACAGAACGAGCATCGAG-TGC 426 ||||||||||||||||||||||||||||||||||||||| |||| ||| Sbjct 20307205 GAGCCTGTACAACCAGCGCCTGCGCCTACAGAACGAGCACCGAGGTGC 20307158 Score = 267 bits (144), Expect = 3e-70 Identities = 144/144 (100%), Gaps = 0/144 (0%) Strand=Plus/Plus Query 523 AGGCGGTGGAGCACCGGATCCGTGAGGAGCAGCGGGCGATGGACCAGAAGATCGTCCTGG 582 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18898399 AGGCGGTGGAGCACCGGATCCGTGAGGAGCAGCGGGCGATGGACCAGAAGATCGTCCTGG 18898458 Query 583 AGCTGGACCGGAAGGTGGCTGACCAGCAGAGCACACTGGAGAAGGCGGGGGTGGCTGGCT 642 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18898459 AGCTGGACCGGAAGGTGGCTGACCAGCAGAGCACACTGGAGAAGGCGGGGGTGGCTGGCT 18898518 Query 643 TCTACGTGACCACCAACCCACAGG 666 |||||||||||||||||||||||| Sbjct 18898519 TCTACGTGACCACCAACCCACAGG 18898542 Score = 250 bits (135), Expect = 3e-65 Identities = 141/144 (98%), Gaps = 0/144 (0%) Strand=Plus/Minus Query 523 AGGCGGTGGAGCACCGGATCCGTGAGGAGCAGCGGGCGATGGACCAGAAGATCGTCCTGG 582 |||| ||||| |||||||||||||||||||||||||||||||||||||||||| |||||| Sbjct 20303001 AGGCCGTGGAACACCGGATCCGTGAGGAGCAGCGGGCGATGGACCAGAAGATCATCCTGG 20302942 Query 583 AGCTGGACCGGAAGGTGGCTGACCAGCAGAGCACACTGGAGAAGGCGGGGGTGGCTGGCT 642 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 20302941 AGCTGGACCGGAAGGTGGCTGACCAGCAGAGCACACTGGAGAAGGCGGGGGTGGCTGGCT 20302882 Query 643 TCTACGTGACCACCAACCCACAGG 666 |||||||||||||||||||||||| Sbjct 20302881 TCTACGTGACCACCAACCCACAGG 20302858 Score = 193 bits (104), Expect = 4e-48 Identities = 104/104 (100%), Gaps = 0/104 (0%) Strand=Plus/Plus Query 422 AGTGCTCAGGCAGGCGCTGCGGCAGAAGCACCAGGAAGCCCAGCAGGCCTGCCGGCCCCA 481 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 18897683 AGTGCTCAGGCAGGCGCTGCGGCAGAAGCACCAGGAAGCCCAGCAGGCCTGCCGGCCCCA 18897742 Query 482 TAACCTGCCTGTGCTTCAGGCGGCTCAGCAGCGAGAACTAGAGG 525 |||||||||||||||||||||||||||||||||||||||||||| Sbjct 18897743 TAACCTGCCTGTGCTTCAGGCGGCTCAGCAGCGAGAACTAGAGG 18897786 Score = 182 bits (98), Expect = 1e-44 Identities = 102/104 (98%), Gaps = 0/104 (0%) Strand=Plus/Minus Query 422 AGTGCTCAGGCAGGCGCTGCGGCAGAAGCACCAGGAAGCCCAGCAGGCCTGCCGGCCCCA 481 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 20303744 AGTGCTCAGGCAGGCGCTGCGGCAGAAGCACCAGGAAGCCCAGCAGGCCTGCCGGCCCCA 20303685 Query 482 TAACCTGCCTGTGCTTCAGGCGGCTCAGCAGCGAGAACTAGAGG 525 |||||||||||| |||||||||||||||||||||||||||||| Sbjct 20303684 CAACCTGCCTGTGGTTCAGGCGGCTCAGCAGCGAGAACTAGAGG 20303641 Lambda K H 1.33 0.621 1.12 Gapped Lambda K H 1.28 0.460 0.850 Effective search space used: 62540234260 Database: seq/chr22.hg19.fa Posted date: Apr 12, 2018 10:18 PM Number of letters in database: 51,304,566 Number of sequences in database: 1 Matrix: blastn matrix 1 -2 Gap Penalties: Existence: 0, Extension: 2.5
! blastn -query seq/dgcr6.fa -db seq/chr22.hg19.fa -outfmt 7 -out blast.out.txt
! head blast.out.txt
# BLASTN 2.6.0+ # Query: NM_005675.4 Homo sapiens DiGeorge syndrome critical region gene 6 (DGCR6), mRNA # Database: seq/chr22.hg19.fa # Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score # 10 hits found NM_005675.4 chr22 100.000 552 0 0 663 1214 18899050 18899601 0.0 1020 NM_005675.4 chr22 98.543 549 8 0 663 1211 20302350 20301802 0.0 970 NM_005675.4 chr22 100.000 262 0 0 1 262 18893736 18893997 6.52e-136 484 NM_005675.4 chr22 91.803 244 18 2 21 262 20307645 20307402 5.37e-92 339 NM_005675.4 chr22 99.405 168 0 1 260 426 18894075 18894242 1.96e-81 303
query\DB | 塩基配列 | アミノ酸配列 |
---|---|---|
塩基配列 | blastn, tblastx | blastx |
アミノ酸配列 | tblastn | blastp |
NCBIのwebサイトやbiopythonを使ってpythonプログラムに組み込むことも可能。
!blat seq/chr22.hg19.fa seq/dgcr6.fa output.psl
Loaded 51304566 letters in 1 sequences Searched 1245 bases in 1 sequences
!head output.psl
psLayout version 3 match mis- rep. N's Q gap Q gap T gap T gap strand Q Q Q Q T T T T block blockSizes qStarts tStarts match match count bases count bases name size start end name size start end count --------------------------------------------------------------------------------------------------------------------------------------------------------------- 1214 0 0 0 0 0 4 4652 + NM_005675.4 1245 0 1214 chr22 51304566 18893735 18899601 5 262,161,101,141,549, 0,262,423,524,665, 18893735,18894077,18897684,18898400,18899052, 1120 25 0 0 0 0 5 4652 - NM_005675.4 1245 66 1211 chr22 51304566 20301801 20307598 6 546,141,101,161,158,38, 34,580,721,822,983,1141, 20301801,20302858,20303641,20307161,20307401,20307560,
ラウス肉腫ウイルスが宿主のニワトリから持ち出したと考えられる配列をニワトリゲノムに対して検索する、といったことも可能である。
多重配列アラインメント(multiple sequence alignment)は3本以上の配列を並べて、出来る限りギャップをいれないようにして、似たアミノ酸(もしくは塩基)を並べる手法である。アラインメントした結果から分子系統樹を推定し、その結果からアラインメントを更に改良するという作業が行われる。clustal-omegaやmafftが使われる。どちらともwebインターフェースとしても公開されている。
多重配列アラインメントの結果はJalview等のソフトウェアで可視化出来る。詳しくは書籍もしくはtogo tvを参照。
BWAとBowtieはNGSから出てきたデータをreference genome配列にマッピングすることに特化したソフトウェア。検索する対象DB(reference genome配列)が予めBurrows Wheeler Transform(BWT)で処理されており、そのため高速に、短い配列に対しても処理を行うことが出来る。
マッピングの流れは基本的に以下。
condaがおすすめ。個人的にmacでマッピングはしたくない。condaだとsudo権限のないlinuxでも簡単にインストール出来る。
$ conda install -c bioconda bowtie
$ conda install -c bioconda bwa
bwaは伝統的にゲノム解析で使われる印象。bowtieは少し古く、bowtie2、更にはhisat、hisat2へと開発が進んでいる。さらに、STARやLASTなど様々な選択肢がある。トランスクリプトーム解析だと最近ではkallistoやsalmonなど、pseudo-alignmentと呼ばれる高速でトランスクリプトームの定量まで行えるソフトウェアが人気。どのワークフローが良いかは日夜議論の対象となっている。
出典 Mohammad, S. et al. Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis. doi:10.1038/s41467-017-00050-4
上述のBWAやBowtieと同様の検索をGoogle検索のように行えるDBCLSのサービス。CRISPR/CasのガイドRNA設計ツールにも応用されている。
! curl http://gggenome.dbcls.jp/hg38/1/AGGTVANNTGACCT.gff
### ERROR : searcher timed out ###
最近の次世代シーケンサーが長いreadが読めるようになったとはいえ、一本で完全長のDNAなどを読むのはまだ難しい。そこで、次世代シーケンサーによって得られたreadをコンピューター上でつなぎ合わせて再構築する。それがAssemble。
Trinityはguid referenceなしのアセンブルに用いられる。あまり解析の進んでいない生物にくわえ、ヒトにおけるfusion geneやvirusゲノムの組み込みなどを解析する際にも使える。癌ゲノムに焦点を当てたTrinity Cancer Transcriptome Analysis Toolkit( https://github.com/NCIP/Trinity_CTAT/wiki )なども公開されている。guid referenceが信頼できる場合はStringtieを使うこともできる。
文法を極力単純化してコードの可読性を高め、読みやすく、また書きやすくしてプログラマの作業性とコードの信頼性を高めることを重視してデザインされた、汎用の高水準言語である。 wikipedia 「Python」より
フリーかつオープンソースの言語。バイオインフォマティクスだけではなく幅広いコミュニティをもち、web構築やGUI構築、科学演算、機械学習、データベース、ひいてはドローンの操作まで出来る万能言語。
様々なところでRとpythonの比較はされている。
統計やバイオインフォマティクス特化ならRで十分。DESeq2など高機能ライブラリもRで用意されている。最新の機械学習やドローンなど、他のこともしたい欲があるならpythonがおすすめ。juliaもじわじわ人気。
まとめてpythonでやってみましょう。 データはImmuno-NavigatorからTregのマイクロアレイデータをダウンロードした。
Vandenbon, A. et al. Immuno-Navigator, a batch-corrected coexpression database, reveals cell type-specific gene networks in the immune system. Proc. Natl. Acad. Sci. U. S. A. 113, E2393-402 (2016).
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
# immuno-NavigatorからダウンロードしたTregの発現ファイルを読み込み。
df_exp = pd.read_csv('data/expression_3_human.txt', sep='\t')
# Affymetrixのprobe IDの詳細データをダウンロード。これを使って発現テーブルのProbe IDを置き換える。
df_annot = pd.read_csv('data/HG-U133_Plus_2-na36-annot-csv/HG-U133_Plus_2.na36.annot.csv', comment='#')
df_annot = df_annot[['Probe Set ID', 'Gene Symbol']]
df_annot.columns = ['ProbeID', 'GeneSymbol']
df_exp['ProbeID'] = df_exp.index
df_exp = pd.merge(df_exp, df_annot, on='ProbeID', how='inner')
df_exp = df_exp.drop(['ProbeID'], axis=1)
# 一つの遺伝子に複数のProbeが割り当てられている事があるので、平均を取った。
df_exp = df_exp.groupby('GeneSymbol').mean()
df_exp = df_exp[df_exp.index != '---']
df_exp
GSM266974.CEL | GSM266967.CEL | GSM266970.CEL | GSM266982.CEL | GSM266988.CEL | GSM266972.CEL | GSM266985.CEL | GSM266986.CEL | GSM266975.CEL | GSM266966.CEL | ... | GSM723820.CEL | GSM723824.CEL | GSM723819.CEL | GSM723823.CEL | GSM723825.CEL | GSM723822.CEL | MKLE_133-2_EXP2-TREG-TUBE2_071111_HG-U133_Plus_2.CEL | MKLE_133-2_EXP1-TREG-TUBE14_070511_HG-U133_Plus_2.CEL | MKLE_133-2_EXP1-TREG-P-TUBE18_070511_HG-U133_Plus_2.CEL | MKLE_133-2_EXP2-TREG-P-TUBE6_070511_HG-U133_Plus_2.CEL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GeneSymbol | |||||||||||||||||||||
A1BG | 6.329995 | 6.672491 | 6.554513 | 6.370591 | 6.439727 | 6.030856 | 5.976589 | 6.344407 | 6.223625 | 6.300893 | ... | 5.781228 | 6.500306 | 6.572795 | 6.519515 | 6.458826 | 6.598172 | 6.231950 | 6.915423 | 6.154872 | 6.427209 |
A1BG-AS1 | 5.037404 | 5.686309 | 6.483032 | 5.350224 | 5.415077 | 5.731335 | 5.595510 | 5.958698 | 5.565518 | 5.551137 | ... | 4.918857 | 5.826912 | 5.532381 | 5.621202 | 4.893040 | 5.537691 | 5.401727 | 5.964308 | 5.738294 | 5.632629 |
A1CF | 4.384173 | 4.398225 | 4.504269 | 4.487556 | 4.027412 | 4.217721 | 3.970703 | 4.053616 | 3.863796 | 4.280985 | ... | 4.041906 | 4.025384 | 3.958178 | 4.084334 | 4.220860 | 4.138160 | 4.321285 | 4.304278 | 4.133247 | 4.441416 |
A2M | 5.257580 | 5.592144 | 5.431512 | 5.122258 | 4.889085 | 5.122039 | 5.090914 | 4.710800 | 5.149762 | 5.052591 | ... | 5.051330 | 5.554665 | 5.037134 | 5.237206 | 5.497572 | 5.357655 | 5.624329 | 5.253649 | 4.745921 | 5.510778 |
A2M-AS1 | 6.906687 | 5.845524 | 4.045360 | 5.894904 | 6.369392 | 6.136523 | 7.209754 | 6.270830 | 7.231036 | 6.375537 | ... | 6.507730 | 5.662840 | 6.469428 | 6.107442 | 6.191663 | 6.255368 | 7.022739 | 6.088032 | 5.814550 | 5.900342 |
A2ML1 | 4.218624 | 4.148094 | 4.327877 | 4.155352 | 4.042741 | 4.051983 | 3.834966 | 4.365218 | 3.862751 | 3.912596 | ... | 4.226010 | 4.128961 | 4.134998 | 4.138744 | 3.924122 | 4.110336 | 4.417305 | 4.147809 | 4.196089 | 4.134112 |
A2MP1 | 5.258512 | 5.276884 | 3.778824 | 4.896070 | 4.408941 | 4.567173 | 5.295994 | 4.623203 | 4.594767 | 4.674247 | ... | 4.080358 | 4.798134 | 4.171829 | 4.583593 | 4.774224 | 5.165135 | 4.243110 | 4.558537 | 4.952115 | 5.188502 |
A4GALT | 6.321879 | 5.804474 | 7.195217 | 6.363988 | 6.128443 | 5.523223 | 6.045786 | 6.558648 | 6.094470 | 6.182731 | ... | 5.794614 | 5.987141 | 6.391888 | 6.051665 | 5.976650 | 5.923692 | 5.933604 | 6.429013 | 6.273170 | 6.567489 |
A4GNT | 4.333680 | 4.474698 | 4.498369 | 4.754718 | 4.213597 | 4.231558 | 4.332090 | 4.182145 | 4.062237 | 4.208773 | ... | 4.306044 | 4.497004 | 4.618224 | 4.693414 | 4.259028 | 4.520218 | 4.237504 | 4.450787 | 4.334868 | 4.539352 |
AA06 | 5.393698 | 4.665403 | 4.380906 | 4.930437 | 4.771812 | 4.538719 | 4.848835 | 4.383819 | 5.036763 | 4.450124 | ... | 4.788990 | 4.940413 | 4.638221 | 5.078762 | 4.961179 | 5.140533 | 4.580044 | 5.145684 | 4.442799 | 4.528140 |
AAAS | 7.484228 | 7.285323 | 7.393733 | 7.372531 | 6.956579 | 7.434012 | 7.355164 | 6.789198 | 7.119065 | 7.328291 | ... | 6.957142 | 7.488046 | 7.428081 | 7.781621 | 7.334754 | 7.758605 | 6.609913 | 7.128531 | 7.486500 | 7.323181 |
AACS | 6.378566 | 6.970323 | 6.365766 | 6.529469 | 6.626753 | 6.575329 | 6.558108 | 6.205369 | 7.171728 | 6.582380 | ... | 6.352892 | 6.286587 | 6.504212 | 6.301055 | 6.306702 | 6.624690 | 6.651286 | 7.232763 | 6.190090 | 6.156982 |
AACSP1 | 3.145278 | 2.975717 | 3.096689 | 3.062645 | 3.085016 | 2.691928 | 2.689060 | 3.336866 | 3.089322 | 2.992623 | ... | 3.100575 | 2.965334 | 3.071555 | 3.131415 | 2.963740 | 3.091258 | 3.242600 | 2.965728 | 2.923044 | 3.401935 |
AADAC | 4.846279 | 4.156425 | 5.179911 | 4.457879 | 4.481283 | 4.275923 | 4.432578 | 4.243737 | 4.543590 | 4.292354 | ... | 4.475644 | 4.397545 | 4.486347 | 4.436835 | 4.111668 | 4.320535 | 4.911669 | 4.326585 | 4.451919 | 4.563897 |
AADACL2 | 3.121081 | 3.474939 | 3.579743 | 3.455728 | 3.152071 | 2.925857 | 2.969151 | 3.197674 | 3.480939 | 3.486558 | ... | 3.081818 | 3.381752 | 3.348910 | 2.801880 | 3.241722 | 3.625903 | 3.113075 | 3.149511 | 3.248608 | 3.013602 |
AADACP1 | 3.299814 | 3.214657 | 3.278485 | 3.546459 | 3.255058 | 3.525315 | 3.334051 | 3.311696 | 3.625368 | 2.793833 | ... | 3.368335 | 3.706438 | 3.073400 | 3.537613 | 3.221581 | 3.570733 | 3.611234 | 3.050288 | 3.214717 | 3.262152 |
AADAT | 3.866535 | 4.000587 | 3.863209 | 3.769423 | 3.934697 | 3.566442 | 3.662077 | 3.957041 | 3.925848 | 3.564020 | ... | 3.710834 | 4.018663 | 3.736145 | 3.816661 | 3.617059 | 3.725673 | 4.331902 | 3.899052 | 3.567507 | 3.981494 |
AAED1 | 7.770825 | 7.098003 | 7.443736 | 6.741960 | 8.518507 | 7.632394 | 8.180012 | 8.054782 | 8.475554 | 7.033080 | ... | 8.330915 | 7.526159 | 7.813668 | 7.613171 | 7.697455 | 7.618173 | 8.008320 | 7.960115 | 7.769471 | 7.836551 |
AAGAB | 7.098844 | 7.436299 | 7.064280 | 7.383824 | 7.254084 | 7.344425 | 7.308551 | 7.037541 | 7.157243 | 7.088123 | ... | 7.244505 | 7.277905 | 6.518174 | 7.140176 | 6.974104 | 6.849210 | 7.098186 | 6.992624 | 7.332086 | 7.595613 |
AAK1 | 6.981258 | 6.767286 | 6.618398 | 6.758833 | 6.857997 | 6.984558 | 6.925797 | 6.993195 | 6.841952 | 6.822264 | ... | 7.071926 | 6.846925 | 6.756679 | 7.090422 | 6.897952 | 6.890690 | 6.858001 | 6.902427 | 6.668177 | 6.457301 |
AAMDC | 5.160006 | 5.324996 | 5.159385 | 5.398079 | 5.174976 | 5.379032 | 5.541379 | 5.431284 | 5.056683 | 5.262009 | ... | 5.531084 | 5.164038 | 5.436543 | 5.210015 | 5.320398 | 5.316070 | 5.211888 | 5.586485 | 5.257232 | 5.245482 |
AAMP | 7.166707 | 7.829989 | 6.709885 | 6.956228 | 7.471870 | 8.035248 | 7.469542 | 7.332768 | 7.126895 | 7.593800 | ... | 7.206061 | 7.583037 | 7.930606 | 7.857851 | 7.872428 | 7.845199 | 7.161119 | 7.073605 | 7.812235 | 7.758489 |
AANAT | 3.694477 | 3.484826 | 4.684714 | 3.801925 | 3.689042 | 3.716612 | 3.413324 | 3.718113 | 3.376590 | 3.517447 | ... | 3.706694 | 3.890034 | 3.637831 | 3.620472 | 3.757300 | 4.024139 | 3.371585 | 4.069667 | 3.429284 | 3.456324 |
AAR2 | 7.877142 | 8.085598 | 7.132338 | 7.930943 | 8.252844 | 7.944194 | 7.854837 | 8.126661 | 8.300647 | 7.639179 | ... | 8.015395 | 8.163611 | 8.388253 | 8.274179 | 8.178604 | 8.178095 | 7.631466 | 7.830055 | 8.281054 | 8.187867 |
AARS | 8.144657 | 8.753102 | 7.837781 | 7.968016 | 8.193309 | 9.092173 | 8.951710 | 8.091970 | 8.058308 | 9.020546 | ... | 8.125359 | 8.424393 | 8.630149 | 8.817591 | 8.887776 | 8.996524 | 7.736918 | 8.288400 | 8.981580 | 8.912190 |
AARS2 | 6.579071 | 6.678795 | 6.255364 | 6.537113 | 6.850929 | 6.786886 | 7.166372 | 6.838984 | 6.682881 | 6.682269 | ... | 6.617070 | 6.556515 | 6.934701 | 6.767751 | 6.694380 | 6.857485 | 6.256266 | 6.719111 | 7.103893 | 6.790279 |
AARSD1 /// PTGES3L-AARSD1 | 6.489105 | 7.319580 | 5.668359 | 6.370102 | 6.896480 | 6.707557 | 6.738000 | 6.871708 | 6.689252 | 6.375201 | ... | 6.073023 | 6.971701 | 6.883990 | 7.050118 | 6.582501 | 6.501194 | 5.966756 | 6.372048 | 7.114629 | 6.453428 |
AASDH | 7.074000 | 8.195313 | 7.210623 | 7.514958 | 7.528762 | 8.248522 | 7.957446 | 7.505874 | 7.280934 | 7.976176 | ... | 7.762211 | 7.152091 | 7.999862 | 7.408004 | 7.462255 | 7.795535 | 7.340061 | 7.693434 | 7.468181 | 7.374443 |
AASDHPPT | 7.321247 | 7.100702 | 7.404780 | 7.517359 | 7.610962 | 7.826139 | 7.162326 | 7.427644 | 7.114044 | 7.539108 | ... | 7.344945 | 7.355283 | 6.966186 | 7.273707 | 7.468971 | 7.346118 | 7.337866 | 7.249796 | 7.136639 | 7.419338 |
AASS | 4.745798 | 4.626478 | 5.116449 | 4.634705 | 4.655642 | 4.473566 | 4.273255 | 4.689584 | 4.858475 | 5.069104 | ... | 4.697543 | 4.972510 | 4.533027 | 4.606036 | 4.673496 | 4.234631 | 5.136650 | 4.465306 | 4.709777 | 5.057968 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ZSCAN25 | 6.965716 | 7.343121 | 6.776739 | 7.431333 | 7.158724 | 7.069241 | 7.175228 | 7.062324 | 7.067914 | 7.177454 | ... | 7.438494 | 7.309628 | 7.518539 | 7.111170 | 7.274723 | 7.381359 | 6.800567 | 7.013460 | 6.919259 | 6.967833 |
ZSCAN26 | 6.472607 | 7.525173 | 6.038698 | 6.182118 | 6.456805 | 7.507058 | 7.577202 | 6.629014 | 6.148266 | 7.364820 | ... | 7.551551 | 6.939673 | 7.012030 | 6.865640 | 6.978210 | 6.712314 | 5.842205 | 6.163694 | 7.169735 | 6.785658 |
ZSCAN29 | 6.822445 | 7.991624 | 7.014349 | 7.484403 | 7.293331 | 8.098088 | 8.206188 | 7.358728 | 7.321688 | 8.204116 | ... | 8.102251 | 7.829694 | 8.051798 | 7.815997 | 7.089490 | 7.957736 | 7.418007 | 7.325251 | 7.897328 | 7.853105 |
ZSCAN30 | 5.891297 | 6.100541 | 5.588563 | 5.557731 | 5.806909 | 5.955336 | 5.934812 | 5.557202 | 5.685127 | 5.975555 | ... | 5.802787 | 5.987463 | 5.669433 | 5.852736 | 5.841063 | 5.578973 | 5.848648 | 5.576084 | 5.618770 | 5.763638 |
ZSCAN31 | 4.969071 | 3.803487 | 4.153126 | 4.184705 | 4.563488 | 3.349163 | 3.791677 | 3.647429 | 4.056464 | 4.461130 | ... | 5.464177 | 3.594939 | 4.555461 | 4.011858 | 3.505327 | 4.133448 | 3.337982 | 3.845519 | 4.605612 | 3.783439 |
ZSCAN32 | 6.482495 | 6.891778 | 6.130611 | 6.224822 | 6.673122 | 6.886765 | 6.616539 | 6.275076 | 6.037710 | 6.895340 | ... | 6.599642 | 6.682538 | 6.942645 | 6.744576 | 6.669189 | 6.475831 | 6.125477 | 6.492872 | 6.662209 | 6.854058 |
ZSCAN4 | 3.149061 | 3.590883 | 3.339891 | 3.229534 | 3.444897 | 3.250199 | 3.297852 | 3.555481 | 3.396513 | 3.111150 | ... | 3.439371 | 3.276863 | 3.150498 | 3.095562 | 3.146420 | 3.578157 | 3.055982 | 3.193255 | 3.346913 | 3.235752 |
ZSCAN5A | 4.938788 | 4.960243 | 4.743162 | 4.862338 | 4.764949 | 5.115277 | 4.973510 | 4.894480 | 4.696079 | 4.784316 | ... | 4.885687 | 4.972420 | 5.098590 | 4.889080 | 4.389971 | 4.768561 | 4.639562 | 4.910014 | 4.653369 | 4.809885 |
ZSCAN9 | 6.561056 | 7.315718 | 6.598600 | 7.104977 | 7.060993 | 7.247849 | 7.606776 | 6.724439 | 6.246730 | 7.249823 | ... | 7.654670 | 7.312807 | 7.315430 | 7.192130 | 7.119312 | 7.421506 | 6.304430 | 6.550459 | 7.294009 | 7.534429 |
ZSWIM1 | 6.377744 | 6.256704 | 6.443409 | 6.174444 | 5.542088 | 6.291561 | 6.006684 | 6.177934 | 6.187452 | 6.052880 | ... | 6.885835 | 6.175371 | 6.398363 | 6.281831 | 6.200171 | 6.027931 | 6.128073 | 6.022472 | 5.786127 | 5.879002 |
ZSWIM2 | 3.135231 | 3.154574 | 3.104452 | 3.532067 | 3.240295 | 2.849146 | 3.269629 | 3.181506 | 3.253384 | 3.020387 | ... | 3.204598 | 3.328911 | 3.302065 | 3.179092 | 3.315916 | 3.313163 | 3.695217 | 3.248029 | 3.093018 | 3.100363 |
ZSWIM3 | 5.810809 | 6.445546 | 5.801737 | 5.670764 | 5.252791 | 6.448127 | 5.892647 | 5.692403 | 5.636769 | 6.289205 | ... | 6.485304 | 6.286789 | 6.243936 | 6.487948 | 6.188584 | 5.741934 | 5.646498 | 5.809768 | 6.356532 | 5.606267 |
ZSWIM4 | 5.123830 | 4.247439 | 4.935015 | 4.177440 | 4.125815 | 4.142566 | 3.823305 | 4.377297 | 4.034792 | 4.005143 | ... | 3.969417 | 4.325007 | 4.270083 | 4.475759 | 4.239169 | 4.543260 | 4.871726 | 4.783760 | 4.427503 | 4.115403 |
ZSWIM5 | 5.061247 | 4.268369 | 5.620412 | 5.010646 | 5.304072 | 5.250946 | 5.135438 | 5.187186 | 5.385591 | 4.777841 | ... | 5.589952 | 4.984416 | 4.999301 | 4.894395 | 5.267742 | 5.193105 | 4.893966 | 5.111675 | 4.760028 | 4.761519 |
ZSWIM6 | 7.104259 | 8.302044 | 6.819198 | 7.737233 | 7.638553 | 8.412299 | 8.423985 | 7.671966 | 7.570882 | 8.738596 | ... | 8.115822 | 8.003511 | 7.070686 | 8.012586 | 7.896693 | 7.931935 | 7.710851 | 7.720100 | 8.216624 | 8.419844 |
ZSWIM7 | 5.762341 | 6.296414 | 5.195159 | 5.876310 | 6.557895 | 6.621798 | 6.645763 | 6.108089 | 6.020111 | 6.707180 | ... | 6.480891 | 5.920207 | 6.236654 | 6.150594 | 5.615262 | 6.349885 | 6.465304 | 6.343185 | 6.583639 | 5.869975 |
ZSWIM8 | 6.655869 | 6.354241 | 6.745099 | 6.092956 | 6.183647 | 5.979727 | 6.028081 | 6.451701 | 6.280447 | 6.112246 | ... | 6.244308 | 6.375418 | 6.582424 | 6.592328 | 6.024382 | 6.100777 | 6.117906 | 6.923466 | 6.496569 | 6.372735 |
ZUFSP | 7.815441 | 6.824549 | 6.347263 | 7.799579 | 8.210375 | 6.952307 | 7.272167 | 7.805215 | 7.951880 | 7.314708 | ... | 8.168623 | 7.243567 | 7.181108 | 7.251295 | 7.055141 | 6.923462 | 7.794678 | 7.446957 | 7.075280 | 6.902104 |
ZW10 | 6.785843 | 7.834948 | 6.418325 | 6.727571 | 7.233974 | 7.746689 | 7.798367 | 6.920381 | 6.789222 | 7.437107 | ... | 6.582008 | 7.115714 | 7.562147 | 7.279648 | 7.565980 | 7.197110 | 7.033704 | 6.921524 | 7.656918 | 7.565217 |
ZWILCH | 7.201657 | 6.297545 | 6.798789 | 7.659460 | 7.289148 | 7.178505 | 7.198985 | 6.854641 | 7.486535 | 7.487574 | ... | 7.228371 | 6.925207 | 7.324762 | 7.082216 | 6.915969 | 7.206885 | 7.526647 | 7.521560 | 6.728067 | 6.599255 |
ZWINT | 6.248166 | 7.954764 | 7.121643 | 7.078205 | 6.858055 | 7.994592 | 7.582563 | 6.923124 | 6.798789 | 6.876908 | ... | 7.562404 | 6.694565 | 7.465004 | 7.010036 | 6.962721 | 7.015432 | 8.563703 | 7.789426 | 7.296689 | 7.333868 |
ZXDA | 6.602988 | 5.820766 | 6.208023 | 6.766386 | 7.130399 | 6.547887 | 6.243968 | 6.936459 | 7.004140 | 6.985058 | ... | 7.669662 | 6.310258 | 6.646783 | 6.102405 | 6.306645 | 6.533250 | 6.135229 | 6.350061 | 6.286034 | 6.294129 |
ZXDA /// ZXDB | 5.169698 | 4.907419 | 4.954019 | 5.181062 | 5.492337 | 4.575762 | 4.931880 | 5.634828 | 5.371257 | 4.814620 | ... | 5.054535 | 4.769206 | 5.484213 | 4.776155 | 4.945827 | 4.986257 | 4.996734 | 5.559169 | 4.571086 | 4.648404 |
ZXDB | 5.831605 | 5.634411 | 5.567937 | 5.796505 | 6.442235 | 5.254390 | 5.474351 | 6.268193 | 5.887443 | 5.734685 | ... | 6.098971 | 5.255370 | 6.116957 | 5.470177 | 5.542593 | 5.776629 | 5.674837 | 6.041412 | 5.380701 | 5.062694 |
ZXDC | 6.331872 | 5.919160 | 5.781993 | 6.413931 | 6.252410 | 6.393202 | 6.203285 | 6.409323 | 6.478785 | 6.346135 | ... | 6.440188 | 6.001538 | 6.354298 | 6.333354 | 6.079599 | 6.031899 | 6.376990 | 6.241490 | 6.193611 | 5.963847 |
ZYG11A | 5.910114 | 5.272793 | 5.337675 | 5.240433 | 5.099909 | 5.066068 | 4.760521 | 5.002463 | 5.431786 | 4.950386 | ... | 5.669280 | 4.925695 | 5.471463 | 4.913366 | 5.637031 | 5.092262 | 4.605311 | 5.450916 | 5.086130 | 5.653435 |
ZYG11B | 8.675450 | 8.410204 | 8.054727 | 8.710899 | 8.558867 | 9.090438 | 9.143191 | 8.080272 | 8.936341 | 9.006650 | ... | 8.753219 | 8.374556 | 8.744095 | 8.499122 | 8.263795 | 8.594041 | 8.670517 | 8.597402 | 7.923820 | 7.892749 |
ZYX | 6.468198 | 6.532613 | 6.003965 | 6.111791 | 6.519835 | 6.744659 | 6.328306 | 7.069401 | 6.805881 | 6.572609 | ... | 6.356087 | 6.640681 | 6.923513 | 6.907630 | 6.808598 | 7.024192 | 6.912290 | 7.288850 | 6.580884 | 6.160194 |
ZZEF1 | 7.136295 | 6.511970 | 6.227152 | 6.447227 | 7.103101 | 6.457195 | 6.483693 | 6.917548 | 7.162369 | 6.171867 | ... | 6.454512 | 6.744475 | 6.576435 | 6.741442 | 6.697681 | 6.769907 | 6.949340 | 7.129077 | 6.618018 | 6.409453 |
ZZZ3 | 7.958952 | 8.502548 | 7.146590 | 8.289285 | 8.559358 | 8.613209 | 8.818681 | 8.361559 | 8.665872 | 8.577187 | ... | 8.492928 | 8.622157 | 8.445104 | 8.059314 | 8.574227 | 8.443216 | 8.022772 | 7.951454 | 8.449775 | 8.402241 |
22485 rows × 65 columns
# 変動係数を出してみる
(df_exp.std(axis=1) / df_exp.mean(axis=1)).sort_values(ascending=False).head(10)
GeneSymbol EIF1AY 3.306372 IL3 0.716975 IL4 0.687709 IL5 0.604539 KDM5D 0.603147 FASLG 0.534145 USP9Y 0.487174 RPS4Y1 0.480955 IFNG 0.470246 DDX3Y 0.451477 dtype: float64
sns.clustermap(df_exp.loc[['FOXP3', 'CTLA4', 'IL2RA', 'STAT5A', 'STAT5B', 'CD4', 'RORA', 'IL3', 'IL4']],
z_score=0, cmap='bwr', method='ward')
<seaborn.matrix.ClusterGrid at 0x120f51c50>
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# from sklearn.manifold import TSNE
# PCAの前に正規化したほうが良いです。
# http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html
# 正規化
X = StandardScaler().fit_transform(df_exp.T)
# PCA
pca = PCA(n_components=50)
pca.fit(X)
result = pca.transform(X)
# plot
plt.scatter(result[:,0], result[:,1])
# plt.savefig('img/pca.png')
<matplotlib.collections.PathCollection at 0x121f0d7f0>
# 固有ベクトル(Loadings)を分析
df_comp = pd.DataFrame(pca.components_)
df_comp.columns = df_exp.index
df_comp.loc[0].abs().sort_values(ascending=False).head(20)
GeneSymbol TMEM41B 0.016315 OTUB1 0.016224 OGFR 0.016120 OPA1 0.016079 ARHGAP4 0.016032 ABCA5 0.016021 DNASE2 0.015974 SENP6 0.015971 C7orf55-LUC7L2 /// LUC7L2 0.015959 HIGD2A 0.015957 EVA1B 0.015934 MRPL28 0.015931 CORO1B 0.015889 GPAA1 0.015802 TVP23B 0.015796 NAF1 0.015763 ACAP1 0.015735 TSTA3 0.015664 ICAM3 0.015663 MAP2K2 0.015623 Name: 0, dtype: float64
df_comp.loc[1].abs().sort_values(ascending=False).head(20)
GeneSymbol PIGF 0.016535 NUDT5 0.016494 GJA9-MYCBP /// MYCBP 0.016339 PDCD6IP 0.016294 MSH2 0.016248 IPP 0.016052 CYB5B 0.016014 TPST2 0.015695 UNG 0.015691 PRIM1 0.015672 LDAH 0.015669 INTS7 0.015621 HMGN4 0.015610 CBX1 0.015608 KRCC1 0.015592 HSPB11 0.015587 ZNF75D 0.015587 HENMT1 0.015557 DNAJA3 0.015556 RNF20 0.015531 Name: 1, dtype: float64
# 寄与率
plt.bar(range(50),pca.explained_variance_ratio_)
plt.title('explained variance ratio')
plt.xlabel('PC')
Text(0.5,0,'PC')
# plotlyで更にわかりやすく
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
# Create a trace
trace = go.Scatter(
x = result[:,0],
y = result[:,1],
mode = 'markers',
text = df_exp.columns,
marker = dict(
color = df_exp.loc['FOXP3'],
size='16',
colorscale='Viridis',
showscale=True,
colorbar = dict(title = 'FOXP3')
)
)
data = [trace]
layout=dict(title='PCA')
fig=dict(data=[trace], layout=layout)
# はみ出るので次のスライドいきます。
# offline.iplot(fig, filename='PCA')
offline.iplot(fig, filename='PCA')
plotly : インタラクティブな可視化が出来るライブラリ。これも熱い。
scikit-bio : alignmentの例で使用。