Research Article By
With the recent rise of genetic sequencing and an influx of information, a need for organization has emerged for researchers to better make sense of DNA sequences. As a solution to this dilemma, genome annotation was created in 1995 by Dr. Owen White, in collaboration with The Institute for Genomic Research. Genome annotation is the process of analyzing a DNA sequence, identifying the functions of the coding regions of all the genes, and adding a note on each one to make the sequence simpler to understand. This way, researchers can easily access valuable data, such as the number of and space between repeat regions or the location of SNPs significant to their study. There are many different tools that are used to annotate DNA sequences, including AUGUSTUS and GEMINI. While the invention of computational tools for genome sequencing has aided researchers and tremendously reduced the time it takes to annotate a DNA sequence, this is solely the beginning for this technology, and it continues to be improved year after year.
Utilizing the flourishing and constantly-improving genome annotation technology, over the years, researchers have compiled hundreds of eukaryotic genomes and over 100,000 bacterial genomes to be stored in databases. Currently, the two main types of genomic annotation are structural and functional annotation. While structural annotation locates and labels the various genomic elements of DNA, such as introns, exons, and promoters, functional annotation is responsible for describing the functions and adding biological information to the structural components.
The goal of structural annotation is to locate genes in DNA. The two primary types of genes are protein-coding genes (which create proteins), and noncoding genes, which produce functional RNA. These noncoding genes lead to the production of many types of RNA, including transfer RNA, microRNA, ribosomal RNA, small nuclear RNA, etc. In addition to protein-coding and noncoding genes, structural annotation also identifies pseudogenes—imperfect copies of functional genes that can play a role in gene regulation. These pseudogenes are created due to the presence of indels, which are additions or deletions of nucleotides in a DNA sequence. Identifying these many factors in a DNA sequence are crucial to the researchers’ understanding of how genomes function.
Both structural and functional genomic annotation can be used together to give the researcher an overall better understanding of a DNA sequence. This figure demonstrates the intersection between these two methods.(source)
Furthermore, in addition to identifying different genes, another important aspect of DNA, and the first thing to be identified through structural annotation, is repeat DNA. Approximately 50 percent of the human genome is made of repetitive sequences. In order to be as efficient as possible and avoid sequencing the same piece of DNA several times, repeat masking occurs before the rest of the DNA sequence is annotated. This being said, it is still difficult to identify many repetitive elements of these sequences, and repeat masking softwares rely on databases with stored repeat sequences. Of these softwares, RepeatMasker is one of the most common, which utilizes sequence alignment to locate and replace each nucleotide in the repeat regions of a DNA sequence with the letter “x.”
Within structural genome annotation, the principal task is gene prediction, which is the process of identifying where the coding genes are in a DNA sequence. However, the widely prevalent and variously sized introns (segments of DNA that interrupt genes and do not code for proteins) make this a challenging feat. So, gene prediction programs are used to counter these blockages and determine where a coding gene could potentially be in a DNA sequence. There are three major programs for performing gene prediction, ab initio methods, homology-based methods, and combined methods. Firstly, ab initio methods rely on the nucleotide sequence to make a gene prediction. This method employs statistical models, one of the most common being the Hidden Markov Model (HMM), to then determine crucial factors of the genome sequence, such as the coding/noncoding regions, promoters, and junctions between introns and exons (which is especially helpful for pinpointing the location of coding regions). Moreover, homology-based gene prediction aligns the sequence in question with short sequences of complementary DNA called expressed sequence tags (EST) or protein sequences, and uses sequence alignment technology to detect similarities between the two and predict the genes in the sequence. Finally, the combined method uses both ab initio and homology techniques to produce more accurate results. In all three cases, these identified genes are then labeled through structural annotation to make the sequence easier to comprehend.
Genome annotation workflow. (source)
After the key parts of a genomic sequence are identified via structural annotation, functional annotation is applied to better make sense of them. To put it simply, functional annotation involves aligning a sequence with a database, and labeling the similarities with a description of the gene. However, along with similarities, functional annotation is used to analyze the variations in genes. These variations occur when a single nucleotide in a DNA sequence is substituted with one that the majority of the population does not obtain, but at least one percent of the total population does. This phenomenon is called a single-nucleotide polymorphism, or a SNP.
Commonly used gene prediction programs. (source)
SNPs contain essential information about the DNA sequence, and can be very useful to doctors or researchers regarding their patients’ reactions to specific drugs, predispositions to contracting a certain disease, responses to specific environmental factors, etc. Clearly, functional annotation can prove to be very helpful in these situations, as if these researchers can identify that their patient has a “T” nucleotide where the majority of the population has a “G,” for example, and it is previously known that a specific treatment is not effective for patients with this SNP, they will not waste time with trial and error, but rather find another treatment to heal the patient. Another type of variant that functional annotation can explore are copy number variations (CNVs), caused by insertions and deletions of nucleotides. The significance of these variants can be explored in databases, then described in the context of the sequence through functional annotation.
Ontology based annotation tools. (source)
With the invention of next generation DNA sequencing came an abundance of genetic information, so much that manual annotation was no longer a realistic venture. Thus, there has been a recent necessity for automatic functional annotation. This can be achieved through using local alignment tools, such as BLAST, to find similarities between protein sequences. From there, the functions of these sequences can be inferred, with the assumption that similar protein sequences possess similar functions, and they have evolved from a similar ancestor. In fact, this kind of automatic functional annotation can detect orthologous and paralogous relations, orthologous meaning the genes originated from a single ancestral gene, while paralogous means the genes originated from duplications.
There are various computational programs that aid researchers in performing both structural and functional genomic annotation. In terms of structural annotation, two of the most common applications are AUGUSTUS and Exonerate. The first of the two, AUGUSTUS, uses an ab initio method to predict and annotate genes, and it is one of the most accurate programs of its kind.
Ab initio and Homology-based annotation tools summary. (source)
Ab initio annotation requires gene predictors, which need experimental data to perform statistical analysis on the sequence in question. K–mer statistics and frame length are two variables that this analysis considers. K-mers are substrings of DNA sequences in various lengths k, which can be used to identify certain genes. AUGUSTUS defines the probability for specific nucleotides in a DNA sequence, and also predicts alternative splicing, where exons of a gene are joined in different combinations, leading to different mRNA transcripts. This software and method can be applied to various situations, such as identifying similar pieces of different genetic sequences, to then draw conclusions between the two. For example, if two patients have similar symptoms, and one of them has already been diagnosed with a disease and has a genomically annotated variation in their DNA sequence, if the other patient can have this same variation annotated through a program like AUGUSTUS, an assumption can be made that the two patients have the same disease.
Another program used for structural annotation is Exonerate, but it uses a homology-based gene prediction method to identify the several parts of a gene. This technology aligns the sequence with another known, well-annotated sequence in order to detect similarities and predict genes, based on the molecular evolution principle. This principle explains that the most functionally important sections of the genome evolve slower than the other regions, so it is still accurate to align sequences with these sections. This being said, this program can have inaccuracies due to this assumption, because as evolutionary distance increases between the two aligned sequences, they become more dissimilar.
Two common functional annotation programs include BLAST and GEMINI. Firstly, BLAST is a sequence alignment tool that can be used to perform automatic functional annotation. For the highly similar alignments, the function from the already-annotated sequence in a database is assigned to the genes of the unknown sequence, quickly and easily providing the researcher with information crucial to their study. As previously mentioned, variation annotation, such as the identification and function analysis of SNPs, is a notable aspect of functional annotation.
Moreover, tools such as GEMINI are helpful in providing researchers with this information. GEMINI contains data for all types of human genetic variations. It obtains diverse annotated sequences from different databases, including dbSNP and KEGG, then automatically annotates the new DNA sequence with the many variants by comparing it to those of the database. This tool is crucial for physicians and researchers alike, because identifying and annotating variants like SNPs can save medical professionals time by providing them detailed information about their patients, such as their susceptibility to developing a certain virus, or the chances that a specific medication will be effective for them.
While the development of genomic annotation technologies has occurred rapidly over the years, there are still many sources of error in the results. In fact, some scientists argue that this technology has recently become less accurate. One of the most troublesome issues with this system is that gaps in DNA sequencing lead to more errors in the annotations. For example, when a DNA sequence is divided into several contigs, they can sometimes get contaminated or mixed up during the reassembly, which results in false results and an inaccurate DNA sequence. This is due to accidental horizontal gene transfer, where genes from different organisms move between one another. Thus, the annotations for this incorrect sequence are also imprecise. In addition, if a DNA sequence is not correct, it will not be annotated accurately.
Incorrect sequencing occurs because genes can sometimes be missed by sequencing technology if they are only expressed in a few tissues or at very low levels. Along with sequencing errors, alignment errors can sometimes occur where during functional annotation, annotations from non-similar regions are transferred, ultimately leading to incorrect annotations.
When the DNA sequence of one species is annotated, any other species that contains a similar sequence can be aligned and annotated with the same function. This being said, this means that annotation errors can spread quickly throughout databases, making them difficult to contain and resolve.
Undoubtedly, genomic annotation is a groundbreaking invention that has greatly aided many researchers in their work with genetic sequences. This being said, there are also clearly some errors with the system, and this is simply the beginning for genomic annotation. In order to resolve the aforementioned inaccuracies that can occur through genomic annotation, sequencing and aligning technologies must also be more precise. Thus, the future of genomic annotation will flourish with the creation of an up and coming technology, direct sequencing of RNA.
Currently, RNA must be first converted into DNA to be annotated, and there can be errors along the way. However, with the emerging nanopore technology, RNA can be directly sequenced and generate full-length transcripts, rather than many reads that can get mixed up and lead to inaccuracies. Ultimately, this will lead to a cleaner, actual “high-throughput” process and practically error-free annotations for all the genes of various species.
Student 1 ( High School)
Student 2 (High School)
(External students, post on the request of Mentors)
Brent, Michael R. “Genome annotation past, present, and future: how to define an ORF at each locus.” Genome research vol. 15,12 (2005): 1777-86. doi:10.1101/gr.3866105
Ejigu, Girum Fitihamlak, and Jaehee Jung. “Review on the Computational Genome Annotation of Sequences Obtained by Next-Generation Sequencing.” Biology vol. 9,9 295. 18 Sep. 2020, doi:10.3390/biology9090295
Erxleben, Anika, and Björn Grüning. “Galaxy Training: Genome Annotation.” Galaxy Training Network, Galaxy Training Network, 18 Oct. 2022, https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/genome-annotation/tutorial.html.
“Genome Annotation.” Genome Annotation - an Overview | ScienceDirect Topics, https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/genome-annotation.
Haubold, Bernhard, and Thomas Wiehe. “How repetitive are genomes?.” BMC bioinformatics vol. 7 541. 22 Dec. 2006, doi:10.1186/1471-2105-7-541
“Medical Definition of Genome Annotation.” MedicineNet, MedicineNet, 29 Mar. 2021, https://www.medicinenet.com/genome_annotation/definition.htm.
Salzberg, Steven L. “Next-Generation Genome Annotation: We Still Struggle to Get It Right - Genome Biology.” BioMed Central, BioMed Central, 16 May 2019, https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1715-2.
The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of Elio Academy.