Skip to main content

Computational Methods for Pseudogene Annotation Based on Sequence Homology

  • Protocol
  • First Online:
Pseudogenes

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2324))

Abstract

The number of complete genome sequences explodes more and more with each passing year. Thus, methods for genome annotation need to be honed constantly to handle the deluge of information. Annotation of pseudogenes (i.e., gene copies that appear not to make a functional protein) in genomes is a persistent problem; here, we overview pseudogene annotation methods that are based on the detection of sequence homology in genomic DNA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Harrison PM, Gerstein M (2002) Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J Mol Biol 318(5):1155–1174

    Article  CAS  PubMed  Google Scholar 

  2. Harrison PM, Zheng D, Zhang Z, Carriero N, Gerstein M (2005) Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability. Nucleic Acids Res 33(8):2374–2383

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Harrison PM, Carriero N, Liu Y, Gerstein M (2003) A “polyORFomic” analysis of prokaryote genomes using disabled-homology filtering reveals conserved but undiscovered short ORFs. J Mol Biol 333:885–892

    Article  CAS  PubMed  Google Scholar 

  4. Zhang Z, Harrison PM, Liu Y, Gerstein M (2003) Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res 13(12):2541–2558

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Yu Z, Morais D, Ivanga M, Harrison PM (2007) Analysis of the role of retrotransposition in gene evolution in vertebrates. BMC Bioinformatics 8:308

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  6. Torrents D, Suyama M, Zdobnov E, Bork P (2003) A genome-wide survey of human pseudogenes. Genome Res 13(12):2559–2567

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Yang L, Takuno S, Waters ER, Gaut BS (2011) Lowly expressed genes in Arabidopsis thaliana bear the signature of possible pseudogenization by promoter degradation. Mol Biol Evol 28(3):1193–1203

    Article  CAS  PubMed  Google Scholar 

  8. Harrison PM, Sternberg MJE (1996) The disulphide beta-cross: from cystine geometry and clustering to classification of small disulphide-rich protein folds. J Mol Biol 264:603–623

    Article  CAS  PubMed  Google Scholar 

  9. Harrison PM, Hegyi H, Balasubramanian S, Luscombe NM, Bertone P, Echols N, Johnson T, Gerstein M (2002) Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22. Genome Res 12(2):272–280

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Harrison PM, Kumar A, Lang N, Snyder M, Gerstein M (2002) A question of size: the eukaryotic proteome and the problems in defining it. Nucleic Acids Res 30(5):1083–1090

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. van Baren MJ, Brent MR (2006) Iterative gene prediction and pseudogene removal improves genome annotation. Genome Res 16(5):678–685

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  12. Zhang Z, Harrison P, Gerstein M (2002) Identification and analysis of over 2000 ribosomal protein pseudogenes in the human genome. Genome Res 12(10):1466–1482

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Khachane AN, Harrison PM (2009) Assessing the genomic evidence for conserved transcribed pseudogenes under selection. BMC Genomics 10:435

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  14. Harrison PM, Khachane A, Kumar M (2010) Genomic assessment of the evolution of the prion protein gene family in vertebrates. Genomics 95(5):268–277

    Article  CAS  PubMed  Google Scholar 

  15. Zheng D, Frankish A, Baertsch R, Kapranov P, Reymond A, Choo SW, Lu Y, Denoeud F, Antonarakis SE, Snyder M, Ruan Y, Wei CL, Gingeras TR, Guigo R, Harrow J, Gerstein MB (2007) Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. Genome Res 17(6):839–851

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Zheng D, Gerstein MB (2006) A computational approach for identifying pseudogenes in the ENCODE regions. Genome Biol 7 Suppl 1:S13.11–10

    Google Scholar 

  17. Harrison PM, Kumar A, Lan N, Echols N, Snyder M, Gerstein M (2002) A small reservoir of disabled ORFs in the sequenced yeast genome and its implications for the dynamics of proteome evolution. J Mol Biol 316(3):409–419

    Article  CAS  PubMed  Google Scholar 

  18. Cole ST, Eiglmeier K, Parkhill J, James KD, Thomson NR, Wheeler PR, Honore N, Garnier T, Churcher C, Harris D, Mungall K, Basham D, Brown D, Chillingworth T, Connor R, Davies RM, Devlin K, Duthoy S, Feltwell T, Fraser A, Hamlin N, Holroyd S, Hornsby T, Jagels K, Lacroix C, Maclean J, Moule S, Murphy L, Oliver K, Quail MA, Rajandream MA, Rutherford KM, Rutter S, Seeger K, Simon S, Simmonds M, Skelton J, Squares R, Squares S, Stevens K, Taylor K, Whitehead S, Woodward JR, Barrell BG (2001) Massive gene decay in the leprosy bacillus. Nature 409(6823):1007–1011

    Article  CAS  PubMed  Google Scholar 

  19. Gilad Y, Man O, Paabo S, Lancet D (2003) Human specific loss of olfactory receptor genes. Proc Natl Acad Sci U S A 100(6):3324–3327

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Wang X, Grus WE, Zhang J (2006) Gene losses during human origins. PLoS Biol 4(3):e52

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  21. Kim HL, Igawa T, Kawashima A, Satta Y, Takahata N (2010) Divergence, demography and gene loss along the human lineage. Philos Trans R Soc Lond 365(1552):2451–2457

    Article  CAS  Google Scholar 

  22. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Pearson WR (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132:185–219

    CAS  PubMed  Google Scholar 

  24. Zhang Z, Carriero N, Zheng D, Karro J, Harrison PM, Gerstein M (2006) PseudoPipe: an automated pseudogene identification pipeline. Bioinformatics 22(12):1437–1439

    Article  CAS  PubMed  Google Scholar 

  25. Karro JE, Yan Y, Zheng D, Zhang Z, Carriero N, Cayting P, Harrrison P, Gerstein M (2007) Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation. Nucleic Acids Res 35(Database issue):D55–D60

    Article  CAS  PubMed  Google Scholar 

  26. Khelifi A, Duret L, Mouchiroud D (2005) HOPPSIGEN: a database of human and mouse processed pseudogenes. Nucleic Acids Res 33(Database issue):D59–D66

    PubMed  Google Scholar 

  27. Huang X, Miller W (1991) A time-efficient linear-space local similarity algorithm. Adv Appl Math 12:337–357

    Article  Google Scholar 

  28. Rosikiewicz W, Kabza M, Kosinski JG, Ciomborowska-Basheer J, Kubiak MR, Makalowska I (2017) RetrogeneDB—a database of plant and animal retrocopies. Database (Oxford) 2017. https://doi.org/10.1093/database/bax038

  29. Kielbasa SM, Wan R, Sato K, Horton P, Frith MC (2011) Adaptive seeds tame genomic sequence comparison. Genome Res 21(3):487–493. https://doi.org/10.1101/gr.113985.110

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Navarro FC, Galante PA (2013) RCPedia: a database of retrocopied genes. Bioinformatics 29(9):1235–1237. https://doi.org/10.1093/bioinformatics/btt104

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Wang Y (2017) PlantRGDB: a database of plant retrocopied genes. Plant Cell Physiol 58(1):e2. https://doi.org/10.1093/pcp/pcw210

    Article  CAS  PubMed  Google Scholar 

  32. Suyama M, Torrents D, Bork P (2004) BLAST2GENE: a comprehensive conversion of BLAST output into independent genes and gene fragments. Bioinformatics 20(12):1968–1970

    Article  CAS  PubMed  Google Scholar 

  33. Birney E, Clamp M, Durbin R (2004) GeneWise and genomewise. Genome Res 14(5):988–995

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24(8):1586–1591

    Article  CAS  PubMed  Google Scholar 

  35. Thibaud-Nissen F, Ouyang S, Buell CR (2009) Identification and characterization of pseudogenes in the rice gene complement. BMC Genomics 10:317

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  36. Xie J, Li Y, Liu X, Zhao Y, Li B, Ingvarsson PK, Zhang D (2019) Evolutionary origins of pseudogenes and their association with regulatory sequences in plants. Plant Cell 31(3):563–578. https://doi.org/10.1105/tpc.18.00601

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Slater GS, Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6:31. https://doi.org/10.1186/1471-2105-6-31

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Lafontaine I, Dujon B (2010) Origin and fate of pseudogenes in Hemiascomycetes: a comparative analysis. BMC Genomics 11:260

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  39. Jimenez J, Duncan CD, Gallardo M, Mata J, Perez-Pulido AJ (2015) AnABlast: a new in silico strategy for the genome-wide search of novel genes and fossil regions. DNA Res 22(6):439–449. https://doi.org/10.1093/dnares/dsv025

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Rubio A, Casimiro-Soriguer CS, Mier P, Andrade-Navarro MA, Garzon A, Jimenez J, Perez-Pulido AJ (2019) AnABlast: re-searching for protein-coding sequences in genomic regions. Methods Mol Biol 1962:207–214. https://doi.org/10.1007/978-1-4939-9173-0_12

    Article  CAS  PubMed  Google Scholar 

  41. Liu Y, Harrison PM, Kunin V, Gerstein M (2004) Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes. Genome Biol 5(9):R64

    Article  PubMed  PubMed Central  Google Scholar 

  42. Lerat E, Ochman H (2004) Psi-Phi: exploring the outer limits of bacterial pseudogenes. Genome Res 14(11):2273–2278

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Lerat E, Ochman H (2005) Recognizing the pseudogenes in bacterial genomes. Nucleic Acids Res 33(10):3125–3132

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Tanizawa Y, Fujisawa T, Nakamura Y (2018) DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics 34(6):1037–1039. https://doi.org/10.1093/bioinformatics/btx713

    Article  CAS  PubMed  Google Scholar 

  45. Yao A, Charlab R, Li P (2006) Systematic identification of pseudogenes through whole genome expression evidence profiling. Nucleic Acids Res 34(16):4477–4485

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Florea L, Di Francesco V, Miller J, Turner R, Yao A, Harris M, Walenz B, Mobarry C, Merkulov GV, Charlab R, Dew I, Deng Z, Istrail S, Li P, Sutton G (2005) Gene and alternative splicing annotation with AIR. Genome Res 15(1):54–66

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Solovyev V, Kosarev P, Seledsov I, Vorobyev D (2006) Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol 7 Suppl 1:S10.11–12

    Google Scholar 

  48. Morais DD, Harrison PM (2009) Genomic evidence for non-random endemic populations of decaying exons from mammalian genes. BMC Genomics 10:309

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  49. de Lima Morais DA, Harrison PM (2010) Large-scale evidence for conservation of NMD candidature across mammals. PLoS One 5(7):e11695

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  50. Terai G, Yoshizawa A, Okida H, Asai K, Mituyama T (2010) Discovery of short pseudogenes derived from messenger RNAs. Nucleic Acids Res 38(4):1163–1171

    Article  CAS  PubMed  Google Scholar 

  51. Molineris I, Sales G, Bianchi F, Di Cunto F, Caselle M (2010) A new approach for the identification of processed pseudogenes. J Comput Biol 17(5):755–765

    Article  CAS  PubMed  Google Scholar 

  52. Khurana E, Lam HY, Cheng C, Carriero N, Cayting P, Gerstein MB (2010) Segmental duplications in the human genome reveal details of pseudogene formation. Nucleic Acids Res 38(20):6997–7007

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Zhang C, Wang J, Long M, Fan C (2013) gKaKs: the pipeline for genome-level Ka/Ks calculation. Bioinformatics 29(5):645–646. https://doi.org/10.1093/bioinformatics/btt009

    Article  CAS  PubMed  Google Scholar 

  54. Suyama M, Harrington E, Bork P, Torrents D (2006) Identification and analysis of genes and pseudogenes within duplicated regions in the human and mouse genomes. PLoS Comput Biol 2(6):e76

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  55. Ji Z (2018) RibORF: identifying genome-wide translated open reading frames using ribosome profiling. Curr Protoc Mol Biol 124(1):e67. https://doi.org/10.1002/cpmb.67

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Wright JC, Mudge J, Weisser H, Barzine MP, Gonzalez JM, Brazma A, Choudhary JS, Harrow J (2016) Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow. Nat Commun 7:11778. https://doi.org/10.1038/ncomms11778

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Suyama M, Torrents D, Bork P (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res 34(Web Server issue):W609–W612

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Ranwez V, Douzery EJP, Cambon C, Chantret N, Delsuc F (2018) MACSE v2: toolkit for the alignment of coding sequences accounting for frameshifts and stop codons. Mol Biol Evol 35(10):2582–2584. https://doi.org/10.1093/molbev/msy159

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Johnson TS, Li S, Franz E, Huang Z, Dan Li S, Campbell MJ, Huang K, Zhang Y (2019) PseudoFuN: deriving functional potentials of pseudogenes from integrative relationships with genes and microRNAs across 32 cancers. Gigascience 8(5). https://doi.org/10.1093/gigascience/giz046

  60. Abascal F, Juan D, Jungreis I, Kellis M, Martinez L, Rigau M, Rodriguez JM, Vazquez J, Tress ML (2018) Loose ends: almost one in five human genes still have unresolved coding status. Nucleic Acids Res 46(14):7070–7084. https://doi.org/10.1093/nar/gky587

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Frankish A, Harrow J (2014) GENCODE pseudogenes. Methods Mol Biol 1167:129–155. https://doi.org/10.1007/978-1-4939-0835-6_10

    Article  PubMed  Google Scholar 

  62. Milligan MJ, Harvey E, Yu A, Morgan AL, Smith DL, Zhang E, Berengut J, Sivananthan J, Subramaniam R, Skoric A, Collins S, Damski C, Morris KV, Lipovich L (2016) Global intersection of long non-coding RNAs with processed and unprocessed pseudogenes in the human genome. Front Genet 7:26. https://doi.org/10.3389/fgene.2016.00026

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Frankish A, Diekhans M, Ferreira AM, Johnson R, Jungreis I, Loveland J, Mudge JM, Sisu C, Wright J, Armstrong J, Barnes I, Berry A, Bignell A, Carbonell Sala S, Chrast J, Cunningham F, Di Domenico T, Donaldson S, Fiddes IT, Garcia Giron C, Gonzalez JM, Grego T, Hardy M, Hourlier T, Hunt T, Izuogu OG, Lagarde J, Martin FJ, Martinez L, Mohanan S, Muir P, Navarro FCP, Parker A, Pei B, Pozo F, Ruffier M, Schmitt BM, Stapleton E, Suner MM, Sycheva I, Uszczynska-Ratajczak B, Xu J, Yates A, Zerbino D, Zhang Y, Aken B, Choudhary JS, Gerstein M, Guigo R, Hubbard TJP, Kellis M, Paten B, Reymond A, Tress ML, Flicek P (2019) GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res 47(D1):D766–D773. https://doi.org/10.1093/nar/gky955

    Article  CAS  PubMed  Google Scholar 

  64. Lilue J, Doran AG, Fiddes IT, Abrudan M, Armstrong J, Bennett R, Chow W, Collins J, Collins S, Czechanski A, Danecek P, Diekhans M, Dolle DD, Dunn M, Durbin R, Earl D, Ferguson-Smith A, Flicek P, Flint J, Frankish A, Fu B, Gerstein M, Gilbert J, Goodstadt L, Harrow J, Howe K, Ibarra-Soria X, Kolmogorov M, Lelliott CJ, Logan DW, Loveland J, Mathews CE, Mott R, Muir P, Nachtweide S, Navarro FCP, Odom DT, Park N, Pelan S, Pham SK, Quail M, Reinholdt L, Romoth L, Shirley L, Sisu C, Sjoberg-Herrera M, Stanke M, Steward C, Thomas M, Threadgold G, Thybert D, Torrance J, Wong K, Wood J, Yalcin B, Yang F, Adams DJ, Paten B, Keane TM (2018) Sixteen diverse laboratory mouse reference genomes define strain-specific haplotypes and novel functional loci. Nat Genet 50(11):1574–1583. https://doi.org/10.1038/s41588-018-0223-8

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paul M. Harrison .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Harrison, P.M. (2021). Computational Methods for Pseudogene Annotation Based on Sequence Homology. In: Poliseno, L. (eds) Pseudogenes. Methods in Molecular Biology, vol 2324. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-1503-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-1503-4_3

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-1502-7

  • Online ISBN: 978-1-0716-1503-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics