Introduction
Exom sequencing is also known as whole exom sequencing (WES). This sequencing technique is used to sequence all the expressed genes in a genome also called as exom. It is one of the most widely used targeted sequencing methods. The exom which is the protein coding region of the human genome consists of less than 2% of the genetic code. This region is responsible for containing more than 85% of known disease-related variants. Thus, whole exom sequencing is a cost-effective alternative to whole genome sequencing (WGS).
This technique is efficient in identifying coding variants across a wide range of applications like population genetics, cancer studies or genetics, and genetic diseases. The exom of human genome consists of 1.5% of the genome encoding proteins. It serves as the functional part of the genome, which determines the functional pattern of the genes. It can efficiently reveal chromosomal recessive, sex linked and dominant traits, which are otherwise not detectable by microarray [1].
The Traditional Methods
1.Maxam-Gilbert sequencing: Involves use of radioactive labeling at the 5’ end of the DNA which is followed by purification of DNA fragment that is to be sequenced. The radioactive labeling is done by using a kinase reaction involving gamma-32 sequencing. This is followed by chemical treatment generating breaks at a small proportion of one or two of the available four nucleotide bases in all the four reactions (A+G, G, C, C+T). The purines are depurinated by formic acid and the guanine and adenine are methylated by dimethyl sulfate. The pyrimidines are hydrolyzed by hydrazine. Sodium chloride addition to the hydrazine reaction inhibits the reaction of thymine in the reaction involving (C+T). The modified DNA bases are then cleaved by piperidine at the modified bases. Hence, a series of available fragments are generated and the fragments of the four reactions are electrophoresed side-by-side in the acrylamide gel for side reaction.
2. Sanger sequencing: It is also known as dideoxy chain termination method. It is based on the use of dideoxy nucleotides (ddNTPs). The ddNTPs differ from deoxyribonucleotides due to the lack of free 3 OH group and the 5 carbon ring. Therefore, when a ddNTP is added to the growing DNA strand, the chain gets terminated as the free OH group is not available. Thus, by the use of a predetermined ratio of dNTPs to ddNTPs, it is possible to generate DNA fragments of various sizes when replicating DNA in vitro.
Second generation and next generation sequencing: In the late 1990s, various methods came up known as second generation sequencing. One of the pioneer methods was pyrosequencing. It differs from Sanger sequencing as it relies upon detection of pyrophosphate release upon nucleotide incorporation and does not involve chain termination with ddNTPs.
Nowadays, a number of sequencing methods are available known as next-generation sequencing methods. The most widely used sequencing method is known as the Illumine sequencing. It involves up to 5*108 separate sequencing reactions, which are run simultaneously on a single slide (comparable to a microscopic slide) and put into a single machine. Every reaction is analyzed separately. The sequences which are generated from 500 million DNA samples are stored in the attached computer. Each reaction is a modified replication reaction that involves fluorescently tagged nucleotides, but no ddNTPs are needed [2,3].
Shotgun sequencing: The Sanger sequencing method can venerate several 100 nucleotide sequences per reaction. Most of the next-generation sequencing methods generate even smaller groups of nucleotide sequences. Since the genome is made up of chromosomes spanning millions of base pairs, the tiny fragments have to be put in the correct order for generating the uninterrupted genome sequence. Whole genome shotgun sequencing requires isolation of several copilot chromosomal DNAs. The chromosomes are then fragmented into sizes small enough to be sequenced at random locations. Every fragment is sequenced and sophisticated computer programs are used to compare and find which fragment overlaps with the other. Eventually the sequence of the entire chromosome is assembled [4].
The exom consists of all the exons present in the genome. The term exon has been derived from the word “expressed region” as these are the fragments which get translated/ expressed as proteins whereas the introns dived from intragenic regions are not translated into proteins. Exom sequencing enables us to look into the genome in a way which large scale studies of common variations like the genome-wide association study (GWAS) cannot provide. The GWAS is only usable to identify variations in DNA, which is common in the population. But exom sequencing can determine every nucleotide of exons present in DNA; thus, it can reveal rare mutations which cannot be detected by GWAS [5,6].
Exom Sequencing
Solution-based exom sequencing: Here the DNA samples undergo fragmentation and selective hybridization by biotinylated oligonucleotide probes for targeting specific regions in the genome. The biotinylated probes are bound by magnetic streptavidin beads and the non-targeted region of the genome is washed away. This is followed by amplification of the sample by PCR. This enriches the DNA sample from the target region. The sample is then subjected to sequencing before bioinformatics analysis.
Array-based methods are very similar to the previous method with the exception of the probes binding to a high-density microarray. The array-based method was the first method to be employed in exom capture. Thus, the solution-based method is more efficient as it requires less input DNA than array-based methods. However, some studies have shown that array-based methods performed better than its solution-based counterparts in low GC content regions, single nucleotide polymorphism (SNP) detection, and showed higher sensitivity toward mapping rates [7].
Exom Capture Platforms
There are various exom capture platforms available by providers like Nimblegen, Agilent, and Illumine. Nimblegen’s sequences have EZ exom library. It consists of the greatest bait density compared to any other platform. It makes use of Short (55-105) bp overlapping bases for covering the entire target region. This approach is very efficient for the enrichment with minimum amount of sequencing required to cover the targeted portion. It also selectively detects variants, and has a high specificity level compared to other platforms.
This bait design shows greater genotype sensitivity and uniformity in coverage of sequencing regions containing areas of high GC content. Agilent Sure Select Human All Econ Kit is the only platform, which uses RNA probes. The bait size is longer than those employed in Nimblegen platforms (114-126) bp. The corresponding target sequences lie adjacent to one another whereas they are overlapping in Nimblegen platforms.
This design is efficient at identifying deletions and insertions as longer baits can detect larger mismatches. It has also been hypothesized that this technique can also reduce reference allege bias at the heterozygous sites when compared to any other bait designs. This platform produces fewer duplicate reads compared to Nimblegen, but also produces fewer high-quality reads. Illumina’s TruSeq Exom Enrichment Kit makes the use of 95bp probes, which leave small gaps in the targeted region along with paired end reads that extend outside the bed sequence while performing sequencing. This design has a high percentage of off-target enrichment. This factor plays a major role in reducing its target efficiency when compared with other platforms [8].
Uses in Humans
Exom sequencing has been extensively used in diagnosis of novel diseases and in finding novel causative mutations responsible for various known diseases. Exom sequencing finds its use in human medicine and screening and identification of difficult-to-diagnose patients, diagnosis of young patients who do not exhibit complete spectrum of symptoms, early diagnosis of debilitating diseases, and pre-natal diagnosis. It can also help in finding the causative mutation which aids in alteration of treatment, accurate prognosis, prevention of further invasive testing, and confirmed diagnosis.
The use of exom sequencing in human medicine gets benefitted by the availability of large SNP databases, control genomes, and pathogenic variants. This provides researchers with a huge set of reference exomes, which are free from homozygous Bavarians responsible for causing childhood onset Mendelian diseases.
The diseases for which exom sequencing has been successfully used to detect causative variant are Alzheimer’s disease, maturity onset diabetes in the young, high myopia, amyotrophic lateral sclerosis, autosomal recessive polycystic kidney disease, acromelic frontonasal dystopia, and various cancer pre-disposition mutations [9].
Uses in Other Species
Exom sequencing has been used in plant genomes, which are extremely complex, repetitive, and often polyploid. An exom capture kit has recently been designed for wheat, which is based on the accumulated transcriptome data. The capture region of this kit is 56.5 mb long, which is estimated to be around the size of one diploid wheat chromosome. This kit has been widely used to identify induced mutations in the plant genome and assist in studies involving gene function. Oryza sativa or rice and Glycine max or soybean have also been studied in separate exom analysis [10].
Other Uses
Whole exom sequencing can be, especially, useful for model organisms specifically where sequences involving large number of individuals are needed. The species with fairly large genome like mice (3.5 gb) and zebrafish (1.95 gb) are common model organisms. The smart approach is to sequence a small region of the genome, especially candidate regions can be targeted with specificity. This is a very cost-effective approach, which allows deeper coverage and potentially increases the number of individuals who can be included in the studies.
The genomes of various animals have different linkage disequilibrium (LD) levels. The LD level can also vary by breed and effective population sizes. While identifying causal variants by exom sequencing, high LD value can cause identification of a benign variant. In such a case, the causative variant could be missed in variant calling or exist outside the sequence space. The larger haplotypes that are associated with high LD values increase the accuracy and lower the computational burdens, which are associated with imputation [11].
Conclusion
Whole genome sequencing is still more expensive than whole exom sequencing in spite of the rapid decrease in the cost of WGS. The price of WGS is comprised of the price of capture and the price of sequencing, whereas WGS price consists of only the sequencing cost. While sequencing human genome, the HiSeq X platform has a cost per gb, which is far less than other platforms, but its limited to 30x WGA human genome. However, the cost of a 30x human genome is $1,000, which is 2-3 times more than the cost of a 40x human exom. Therefore, it is advantageous to sequence numerous samples using WGS in order to gain statistical power compared to sequencing samples by WGS.
In other species, the cost difference is even higher, e.g., in pigs, it is estimated that the cost of WGS is 9-10 times more than the cost of WES. WGS generates approximately 100 times the data generated by WES at the same coverage. The infrastructure required to store analyses and manage data causes significant increase in the cost of WGS.
The WGS generates a larger number of variants compared to WES. In addition to this, variations present in the non-coding regions is less well understood compared to variations in the coding regions. This trend makes it very difficult to predict whether synch variants are relevant for a trade of interest. However, WGS has the capability of covering the entire genome at a higher consistent coverage rate than WES. It can also provide better accuracy in detection of structural variants, and does not contain reference sequence bias generated by probe sequences used in WES. The WGS is supposed to eventually take a leading role in the field of genome interrogation. However, it will have to wait until its data storage and analysis methods improve and cost of the entire technique gets reduced compared to its WAS counterpart [12].
References
Bilgüvar K, Öztürk A, Louvi A, Kwan K, Choi M, Tatlı B et al. Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations, Nature. 2017; 467(7312):207-210.
Teer J, Mullikin J. Exome sequencing: the sweet spot before whole genomes, Hum Mol Genet. 2010; 19(R2):R145-R151.
Warr A, Robert C, Hume D, Archibald A, Deeb N, Watson M. Exome Sequencing: Current and Future Perspectives. G3 (Bethesda). 2015; 5(8): 1543-1550.
Mertes F, ElSharawy A, Sauer S, van Helvoort J, van der Zaag P, Franke A et al. Targeted enrichment of genomic DNA regions for next-generation sequencing. Brief Funct Genomics. 2011; 10(6): 374-386.
Choi M, Scholl U, Ji W, Liu T, Tikhonova I, Zumbo P et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci USA. 2009; 106(45): 19096-19101.
Turner, et al., Methods for Genomic Partitioning, Annu. Rev. Genomics Hum. Genet. 2009; 10:263–284.
Raffan E, Hurst L, Turki S, Carpenter G, Scott C, Daly A et al. Early Diagnosis of Werner's Syndrome Using Exome-Wide Sequencing in a Single, Atypical Patient. Front Endocrinol (Lausanne). 2011; 2: 8.
Editorial. Nature Chemical Biology features. Nature Chemical Biology 2005; 1, 63.
Ng S, Buckingham K, Lee C, Bigham A, Tabor H, Dent K et al. Exome sequencing identifies the cause of a mendelian disorder. Nature Genetics. 2010; 42: 30–35.
Worthey EA e. Making a definitive diagnosis: successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease. Genet Med. 2011; 13(3): 255-262.
Ng S, Turner E, Robertson P, Flygare S, Bigham A, Lee C et al. Targeted capture and massively parallel sequencing of 12 human exomes Nature. 2009; 461(7261): 272-276.
Rauch A e. Diagnostic yield of various genetic approaches in patients with unexplained developmental delay or mental retardation. Am J Med Genet A. 2006; 140(19): 2063-2074.