Back to chapter

15.15:

Genome Annotation and Assembly

JoVE Core
Molecular Biology
A subscription to JoVE is required to view this content.  Sign in or start your free trial.
JoVE Core Molecular Biology
Genome Annotation and Assembly

Languages

Share

The entire genome of an organism cannot be sequenced in continuous sequences – even the newest generation of sequencing technologies produce fragmented data from thousands of short DNA fragments ranging from 50-1000bp in length.

These short DNA sequences – called reads – need to be assembled to reconstruct the complete sequence of a genome in a process called genome assembly.

There are four main steps in any next-generation genome assembly – raw data analysis, contig assembly, scaffolding, and finally, gap closing.

The first step is to analyze the raw data acquired for quality – and then eliminate any contamination, biased data, or poor quality reads with a large number of unknown nucleotides.

Next, the clean reads are trimmed to remove the adapter sequences from their ends. Any bases at the fragment ends that do not pass the quality threshold are also trimmed.

Then, a well-suited assembly tool is used to assemble the reads into contiguous sequences – called contigs – based on the overlapping DNA segments.

Comparative genome assembly can be used when a reference genome of a closely related organism is available to direct the reconstruction of the new genome. Here, the reads are aligned to the reference genome, and this provides a layout for further steps in the genome assembly.

Alternatively, de novo genome assembly needs to be performed in the absence of a reference genome. Here, the overlapping reads are used to orient the sequences into longer contigs.

In the next step, the paired short reads – which are overhanging reads at the end of the contigs -  are used for scaffolding the genome.

The gaps between the adjacent contigs are filled with Ns incase of unknown sequences. However, if long reads that are more than 1kb in length are used to stitch contigs together, the gaps can be filled with actual sequences.

The result is an assembled genome that then needs to be annotated with the help of automated tools – a process called genome annotation.

The two main aims of genome annotation are gene structure and gene function prediction, commonly known as structural annotation and functional annotation, respectively.

While the structural annotation leads to the identification of the genomic elements such as coding regions, regulatory motifs,etc.; the functional annotation helps to correctly identify the biological function of these structural elements especially, protein-coding genes.

Genome annotation tools use available data, including known transcripts, protein or signal sequences, predicted genes from other sequenced genomes, or signatures of conserved domains, as the references for any new annotation.

Once the software aligns this available data to the draft genome, it needs to be filtered and polished either manually or using annotation tools to obtain a final set of gene annotations.

15.15:

Genome Annotation and Assembly

The genome refers to all of the genetic material in an organism. It can range from a few million base pairs in microbial cells to several billion base pairs in many eukaryotic organisms. Genome assembly refers to the process of taking the DNA sequencing data and putting it all back together in a correct order to create a close representation of the original genome. This is followed by the identification of functional elements on the newly assembled genome, a process called genome annotation.

Genome assembly is a complicated process. While human genomes in a population can have variable gene copy numbers and repeated sequences that add complexity to genome assembly, the physical location of the genes remains constant. In contrast, bacterial genes are not always in the same location, and multiple copies of the same gene may appear in different locations on the genome. This adds complexity to the assembly of the bacterial genomes. Therefore, a single genome assembly from an organism cannot represent all the diversity within the population of a species.

Furthermore, the possibility of technological or algorithmic errors adds further complexity to the process of genome assembly. As a result, many published genomes are continuously updated with the advancement in sequencing technologies as well as assembly and annotation tools. For example, while the first human genome assembly (build 37) was released in 2009, a new version (build 38) was made available in 2013.

Additionally, the evolution of genome annotation tools in the last few decades has increased its resolution. The genome annotation tools have come a long way from just annotating long protein-coding genes and regulatory elements on the genomes to the annotation of sole nucleotides within a population.

Both genome assembly and annotation are essential tools for genome analysis that lead to precise insights into the biology of species, populations, and individuals.

Suggested Reading

  1. Genome annotation. Josep F. Abril, Sergi Castellano, in Encyclopedia of Bioinformatics and Computational Biology, 2019
  2. Yandell M. and Ence D. A beginner’s guide to eukaryotic genome annotation. Nature Reviews Genetics. Vol. 13-329 (2012)
  3. Del Angel V. D. Ten steps to get started in Genome Assembly and Annotation. F1000Research 2018, 7(ELIXIR):148 DOI:10.12688/f1000research.13598.1