Chapter 1Introduction1Introduction Gene(基因)Historyn1865 Mendel:The basic unit of inheritance is a gene.lMendels work was forgotten until 1900s.n1944 The gene was known to be made of DNA(Deoxyribonucleic Acid).n1953 James Watson and Francis Crick:Double helical structure of DNA.(雙股螺旋)2Introduction Gene History(Cont.)n1990 The Human Genome Project(人類基 因體計畫)started.n1995 The first free-living organism to be sequenced:haemophilus influenzae(流行性感冒嗜血桿菌)n1998 CELERA joined the gene research.n2000 The human DNA sequence draft was completed(published in 2001).3Bioinformatics-國內相關計畫n2000年國科會生物資訊跨領域研究n2001年國科會國家型研究計畫l基因體醫學國家型計畫n2001年國科會跨領域專題研究l工程處:資訊科技l生物處:生物資訊4動物細胞(細胞核、細胞質、細胞膜)nDNA位於細胞核內之核仁5DNA Double Helix(雙股螺旋)6DNA Double Helix(雙股螺旋)7DNA中核甘酸間之鍵結8核甘酸n核甘酸(Nucleotide)為核酸分子構成單元n核甘酸包含:l五碳糖(去氧核糖,deoxyribose)l磷酸基(phosphate group)l含氮鹼基之一(A、G、C、T、U)胞嘧啶(C)9DNA四種含氮鹼基10DNA Double Helix(雙股螺旋)11DNA Sequence12DNA and RNAnNucleotide(核甘酸):腺嘌呤(adenine,A)鳥糞糞嘌呤(guanine,G)胞嘧啶(cytosine,C)胸腺嘧啶(thymine,T)尿嘧啶(uracil,U)nDNA(deoxyribonucleic acid,去氧核糖核酸)A,G,C,T (base pair:GC,A=T)nRNA(ribonucleic acid,核糖核酸)A,G,C,U (base pair:GC,A=U,GU)13DNA LengthnThe total length of the human DNA is about 3109 (30億)base pairs.n1%1.5%of DNA sequence is useful.n#of human genes:30,00040,000lConclusion from the human genome projectlExpected#is 100,000 originally.14DNA Sequencing(定序)nGiven DNA sequence:TGCACTTGACGCATGCT Cut the sequence after random A:ATGCT length=5 ACGCATGCT length=9 AACGCATGCT length=10 ACTTGAACGCATGCT length=1515DNA Sequencingn電泳法(eletrophoresis)16DNA Sequencing 17Amino Acids(胺基酸)胺基酸:蛋白質的基本單位,共20種18General Structure of an Amino Acid 3 groups:Amino Group(胺基)Carboxyl Group(羧基)R Group(R 基團)19Amino Acids(胺基酸)分子 20Amino Acids(胺基酸)分子21Protein(蛋白質)分子 22Amino Acids and RNA每三個核甘酸(codon,基因密碼)對應至一種胺基酸。
  
                            Second Position of CodonUCAGFirstPositionUUUU Phe FUUC Phe FUUA Leu LUUG Leu LUCU Ser SUCC Ser SUCA Ser SUCG Ser SUAU Tyr YUAC Tyr YUAA Ter endUAG Ter endUGU Cys CUGC Cys CUGA Ter endUGG Trp WUCAGThirdPositionCCUU Leu LCUC Leu LCUA Leu LCUG Leu LCCU Pro PCCC Pro PCCA Pro PCCG Pro PCAU His HCAC His HCAA Gln QCAG Gln QCGU Arg RCGC Arg RCGA Arg RCGG Arg RUCAGAAUU Ile IAUC Ile IAUA Ile IAUG Met MACU Thr TACC Thr TACA Thr TACG Thr TAAU Asn NAAC Asn NAAA Lys KAAG Lys KAGU Ser SAGC Ser SAGA Arg RAGG Arg RUCAGGGUU Val VGUC Val VGUA Val VGUG Val VGCU Ala AGCC Ala AGCA Ala AGCG Ala AGAU Asp DGAC Asp DGAA Glu EGAG Glu EGGU Gly GGGC Gly GGGA Gly GGGG Gly GUCAGAUG is also the“start”codon.23From DNA via RNA to Protein24DNA,Genes and ProteinsnDNA:program for cell processesnProteins:execute cell processesTCCAACGGTGCTGAGGTGCACGeneProteinDNA25Promoter(啟動子)and Gene26 GeneRegulatory ElementRNA polymerase(Protein)Transcription Factor(Protein)DNABy BlanchetteRegulation(調控)of Genes27 GeneRNA polymeraseTranscription Factor(Protein)Regulatory ElementDNABy BlanchetteRegulation of Genes28 GeneRNA polymeraseTranscription FactorRegulatory ElementDNANew proteinBy BlanchetteRegulation of Genes29From DNA via RNA to Protein30From RNA to Protein31From RNA to Protein32Primary Structure(一級結構)of Protein牛的胰島素(一種蛋白質)之胺基酸序列33Secondary Structure(二級結構)of Protein34Tertiary Structure(三級結構)of Protein血紅素分子三級結構35Quaternary Structure(四級結構)of Protein血紅素分子四級結構36Problems on Different Levels37Some Problems in BioinformaticsnSequence comparisonlLongest common subsequencelEdit distancelSimilaritylMultiple sequence alignmentnFragment assembly of DNA sequenceslShortest common superstringnPhysical mappinglDouble digest problemlConsecutive ones problemnEvolutionary treesnMolecular structure predictionlProtein folding38Sequence ComparisonnGoals:lDatabase search:Given a sequence S and a set of sequences G,to find all the sequences in G,which are similar to S.lSimilarity:To find which parts of the sequences are alike and which parts differ.-Sequence alignment(global alignment)-Local alignment39Sequence AlignementnGlobal alignmentnLocal alignment40Longest Common Subsequence(1)nTo find a longest common subsequence between two strings.string1:TAGTCACG string2:AGACTGTC LCS:AGACGnDynamic programming:41Longest Common Subsequence(2)TAGTCACGAGACTGTCLCS:AGACGS2S142Edit Distance(1)nTo find a smallest edit process between two strings.S1:TAGTCAC G S2:AG ACTGTCOperation:DMMDDMMIMII 43Edit Distance(2)TAGTCAC G AG ACTGTCDMMDDMMIMII S2S144SimilaritynTwo sequences s1 and s2.np is the match value if ai=bj,else it is the mismatch value.ng is the gap penalty.45Sequence Alignmenta=TAGTCACGb=AGACTGTC-TAGTCACGTAGTCAC-G-AGACT-GTC-AG-ACTGTCnWhich one is better?46Sequence Alignment Formulac0,0=0ci,0=ic0,j=j if ai bj if ai=bj47Sequence Alignment ExampleTAGTCAC-G-AG-ACTGTC48Multiple Sequence Alignments1=ATTCGATs2=TTGAGs3=ATGCT alignments1=ATTCGATs2=-TT-GAGs3=AT-GCTnIf the number of sequences is k,and k is large,how to solve the problem?nNP-complete problem49Multiple Sequence Alignment-SPnSum-of-pairsscore=50Example of Sum-of-pairs Scores1=ATTCGATs2=-TT-GAGs3=AT-GCTFor the alignment,the pairwise alignment scores are:score(s1,s2)=5score(s2,s3)=0score(s1,s3)=5 SP score=1051Multiple Sequence Alignment-StarnStar alignment is an approximation system of sum-of-pairs(SP)scoring system.nStar alignment score=52Example of Star Scores1=ATTCGATs2=-TT-GAGs3=AT-GCTFor the alignment,the pairwise alignment scores are:score(s1,s2)=5score(s2,s3)=0score(s1,s3)=5 Star score=maxscore(s1,s2)+score(s1,s3),score(s2,s1)+score(s2,s3),score(s3,s1)+score(s3,s2)=max5+5,5+0,5+0=1053Multiple Sequence Alignment-TreenTreescore=where Si and Sj are adjacent,Sk and Sl are adjacent.54Fragment AssemblynDepending on experimental factors:lFragment length can be as low as 200 or high as 700.nTypical problems involve target sequences 30,000 to 100,000 base-pairs long,and total number of fragments is in the range 500 to 2000.55Shortest Common SuperstringnGiven a set of k strings P=s1,s2,sk,to find a shortest superstring s containing every string in P as a substring.That is,|s|is the minimal.ACCGT-ACCGT-CGTGC-CGTGCTTACTTAC-TACCGT-TACCGT -TTACCGTGCnNP-complete problem56Physical MappingnGiven a(0,1)matrix of probes versus clones,to reconstruct the relative places of clones or probes.nNP-complete problem57Consecutive Ones Problem(1)58Consecutive Ones Problem(2)nConsider a(0,1)matrix M,with rows indexed by clones and columns by probes,and position(i,j)is 1 if clone i contains probe j.nThe problem is to permute the columns so that the ones in each row are consecutive.nA(0,1)matrix has the k-consecutive ones property(k-C1P)if there exists a column order such that in each row the occurrences of all ones appear in at most k consecutive blocks.nThe k-consecutive ones Problem:lDoes a given(0,1)matrix have the k-consecutive ones property?lNP-complete,for k 259Double Digest Problem(1)enzyme a=|A|=1,3,3,12enzyme b=|B|=1,2,3,3,4,6c=|A B|=|C|=1,1,1,1,2,2,2,3,660Double Digest Problem(2)nGiven the lengths of fragments,|Xi Xj|,1 i j n,obtained by applying either one of the two restriction enzymes A and B,or both,to determine the order of these fragments.na=|A|=ai:1 i n from the first digestb=|B|=bi:1 i m from the second digest.c=|A B|=|C|=ci:1 i l from first and second digests.61Evolutionary Trees(1)62Evolutionary Trees(2)nGenome sequences:Given genomes of several organisms,to build an evolutionary tree in which the number of mutations(changes)is minimal.nCharacter matrix:Given a(0,1)character state matrix of several organisms,to build a perfect evolutionary tree.nDistance matrix:Given a distance matrix of several organisms,to build a tree satisfying the distances between all organisms.63Perfect Phylogeny(1)64Protein Structure65Protein FoldingnGiven the primary structure of a protein,to compute or evaluate its 3-dimensional structure.nPrimary structure(sequence):66Protein Folding Problem nThe characteristic of each amino:lH(hydrophobic,non-polar)(hating water,疏水性)lP(hydrophilic,polar)(loving water,親水性)nThe amino acid sequence of a protein can be viewed as a binary sequence of Hs(1s)and Ps(0s).67Example of H-P ModelnInput sequence:011001001110010011001001111000011001001111000Score=5Score=368Protein folding on H-P ModelnThe protein folding on H-P model:Given a sequence of 1s(Hs)and 0s(Ps),to find a self-avoiding paths embedded in either a 2D or 3D lattice such that the number of pairs of adjacent 1s is maximized.nNP-complete even for 2D lattice.69RNA Secondary Structure Prediction(1)nRNA:A,G,C,UnBase pairs:GC (Watson-Crick base pair)A=U (Watson-Crick base pair)GU (Wobble base pair)n(a,b)is defined as 1 if a and b can form a base pair;otherwise it is 0.70RNA Secondary Structure Prediction(2)71RNA Secondary Structure Prediction(3)nX(i,j)is maximum number of base pairs in the sequence aiai+1aj,i j.nDynamic Programming:X(i,j)=0 if|j i|1.i k j 1.72Reference-BooksnAlgorithms on strings,trees,and sequences:computer science and computational biology,Dan Gusfield,Cambridge University Press,1997.nIntroduction to computational molecular biology,Joao Carlos Setubal and Joao Meidanis,PWS Pub.,1997.nIntroduction to computational biology:maps,sequences and genomes,Michael S.Waterman,Champman&Hall,1995.nManuscript of Prof.R.C.T.Lee http:/www.csie.ncnu.edu.tw/rctlee/biology.html73Reference Books(Biology)n生物學 C.Starr&R.Taggart 原著 丁澤民 王偉 張世衿 連慧瑞 編譯n現代分子生物學 朱玉賢 李毅 編著 藝軒出版社n分子生物學入門 駒野徹、酒井裕 合著 何士慶 譯 科技圖書nDNA 圖解小百科 (名詞解釋)威惹利、培瑞、李哈 合著 潘震澤 譯 新新聞文化公司74Reference Journals(1)nBioinfomatics(SCI)nBulletin of Mathematical Biology(SCI)nComputer Applications in the BiosciencesnJournal of Computational Biology(SCI expanded)nJournal of Mathematical Biology(SCI)nJournal of Molecular Biology(SCI)nNucleic Acids Research(SCI)nGene(SCI)nScience(SCI)75Reference Journals(2)nGenome Research(SCI)nPROTEINS:Structure,Function,and Bioinformatics(SCI)nGene(SCI)nCurrent Opinion in Structural Biology(SCI)nProtein-Structure Function and Bioinfomatics(SCI)nBMC Bioinformatics(SCI Expanded)nComputational Biology and Chemistry(SCI)nBioSystems(SCI)76Reference Web SitesnC.B.Yanghttp:/par.cse.nsysu.edu.twnBioweb link of C.B.YangnBioWeb http:/bioweb.uwlax.edu/nMIT Biology Hypertextbook http:/esg-www.mit.edu:8001/esgbio/nBioinformatics Related Journals http:/www.iscb.org/journals.htmlnNCBI (National Center for Biotechnology Informationhttp:/www.ncbi.nlm.nih.gov/77Conclusion(1)Bioinformatics and Computer SciencenAlgorithm:all computing problems.nImage processing:3D images of RNA folds or protein.nDatabase:massive database and retrieval.nDistributed system and parallel processing:massive storage and accelerating computation.78Conclusion(2)nBiology easily has 500 years of exciting problems to work on.-Donald E.Knuth79。