DNA sequencing

DNA sequencing is the process of determining the nucleotide order of a given DNA fragment, called the DNA sequence. For thirty years a huge fraction of DNA sequencing has been achieved using the chain termination method, developed by Frederick Sanger in 1975. This technique uses sequence-specific termination of an in vitro DNA synthesis reaction using modified nucleotide substrates. 'Next generation' sequencing technologies, such as Pyrosequencing and 454 Sequencing, are recently delivering noticeable amounts, too (see below).

The sequence of DNA encodes the necessary information for living things to survive and reproduce. Determining the sequence is therefore useful in 'pure' research into why and how organisms live, as well as in applied subjects. Because DNA is key to all living things, knowledge of DNA sequence may be useful in almost any biological subject area. For example, in medicine it can be used to identify, diagnose and potentially develop treatments for genetic diseases. Similarly, research into pathogens may lead to treatments for contagious diseases.

Chemical sequencing
In 1976-7, Allan Maxam and Walter Gilbert developed a method of DNA sequencing based on chemical modification of DNA followed by its subsequent cleavage. Chemical sequencing methods originated in the study of DNA-protein interactions (footprinting), nucleic acid structure and epigenetic modifications to DNA, and those remain as important applications.

Chain termination method
In chain terminator sequencing (Sanger sequencing), which is possible because of the availability of clones and/or thermal cycling DNA amplification, extension is initiated at a specific site on the template DNA by using a short oligonucleotide 'primer' complementary to the template at that region. The oligonucleotide primer is extended using a DNA polymerase, an enzyme that replicates DNA. Included with the primer and DNA polymerase are the four deoxynucleotide bases (DNA building blocks), along with a low concentration of a chain terminating nucleotide (most commonly a di-deoxynucleotide). Limited incorporation of the chain terminating nucleotide by the DNA polymerase results in a series of related DNA fragments that are terminated only at positions where that particular nucleotide is used. The fragments are then size-separated by electrophoresis in a slab polyacrylamide gel, or more commonly now, in a narrow glass tube (capillary) filled with a viscous polymer.

The classical chain termination method or Sanger method first involves preparing the DNA to be sequenced as a single strand. (The single-band preparation guarantees one band per nucleotide, whereas a double-strand preparation guarantees two bands, and makes sequence prediction impossible.) The DNA sample is divided into four separate samples. Each of the four samples has a primer, the four normal deoxynucleotides (dATP, dGTP, dCTP and dTTP), DNA polymerase, and only one of the four dideoxynucleotides (ddATP, ddGTP, ddCTP and ddTTP) added to it. The dideoxynucleotides are added in limited quantities. The primer or the dideoxynucleotides are either radiolabeled or have a fluorescent tag.

As the DNA strand is elongated the DNA polymerase catalyses the joining of deoxynucleotides to the corresponding bases. The bases available to the polymerase are a mixture of normal and tagged/terminating nucleotides. So if the appropriate dideoxynucleotide happens to be near the polymerase, it is incorporated into the elongating DNA strand. The tagged/terminating base prevents further elongation because a dideoxynucleotide lacks a crucial 3'-OH group. So a series of DNA fragments are produced with random length and (base-nonspecific, hence the four separate reactions) tags. Unfortunately, only short stretches of DNA can be sequenced in each reaction. The polymerase chain reaction(PCR) technique is limited to 10,000 base-pairs and the maximum length of extension is dictated by the concentration of tagged/terminating nucleotides.

The DNA is then denatured and the resulting fragments are separated (with a resolution of just one nucleotide) by gel electrophoresis, from longest to shortest. Each of the four DNA samples is run on one of four individual lanes (lanes A, T, G, C) depending on which dideoxynucleotide was added. Depending on whether the primers or dideoxynucleotides were radiolabeled or fluorescently labeled, the DNA bands can be detected by exposure to X-rays or UV-light and the DNA sequence can be directly read off the gel. In the image on the right, X-ray film was exposed to the dried gel, and the dark bands indicate the positions of the DNA molecules of different lengths. A dark band in a lane indicates a chain termination for that particular DNA subunit and the DNA sequence can be read off as indicated.

There can be various problems with sequencing through the Sanger Method. The primer used can also be annealed to a second site. This would cause two sequences to be interpreted at the same time. This can be solved by higher annealing temperatures and higher G and C content in the primer. Another problem can occur when RNA contaminates the reaction, which can act like a primer and leads to bands in all lanes at all positions due to non specific priming. Other contaminants can be from other plasmids, inhibitors of DNA polymerase, and low concentrations in general. Secondary structure of DNA being read by DNA polymerase can lead to reading problems and will be visualized on the readout by bands in all lanes of only a few positions. In short, the problems of this method are the standard problems one would encounter in PCR.

There are two sub-types of chain-termination sequencing. In the original method, the nucleotide order of a particular DNA template can be inferred by performing four parallel extension reactions using one of the four chain-terminating bases in each reaction. The DNA fragments are detected by labelling the primer with a base-nonspecific label, radioactive phosphorus for example, prior to performing the sequencing reaction. The four reactions would then be run out in four adjacent lanes on a slab polyacrylamide gel.

The Sanger method can be done using primers that add a non-specific label on the 5' end of the PCR product. Instead of the label being included in the terminating nucleotide, the label is in the primer. The difference between this and the radioactive Sanger method is that the label is at the 5' end instead of the 3' end. Four separate reactions are still required, but the dye labels can be read using an optical system instead of film or phosphor storage screens, so it is faster, cheaper, and easier to automate. This approach is known as 'dye-primer sequencing'.

Dye terminator sequencing
An alternative to the labelling of the primer is to label the terminators instead, commonly called 'dye terminator sequencing'. The major advantage of this approach is that the complete sequencing set can be performed in a single reaction, rather than the four needed with the labeled-primer approach. This is accomplished by labelling each of the dideoxynucleotide chain-terminators with a separate fluorescent dye, which fluoresces at a different wavelength. This method is easier and quicker than the dye primer approach, but may produce more uneven data peaks (different heights), due to a template dependent difference in the incorporation of the large dye chain-terminators. This problem has been significantly reduced with the introduction of new enzymes and dyes that minimize incorporation variability.

This method is now used for the vast majority of sequencing reactions as it is both simpler and cheaper. The major reason for this is that the primers do not have to be separately labelled (which can be a significant expense for a single-use custom primer), although this is less of a concern with frequently used 'universal' primers.

Automation and sample preparation
Modern automated DNA sequencing instruments (called DNA sequencers) are able to sequence as many as 384 fluoresecently labelled samples in a batch (run) and perform as many as 24 runs a day. These perform only the size separation and peak reading; the actual sequencing reaction(s), cleanup and resuspension in a suitable buffer solution must be performed separately.

The magnitude of the fluorescent signal is related to the number of strands of DNA that are in the reaction. If the initial amount of DNA is small, the signals will be weak. However, the properties of PCR allow one to increase the signal by increasing the number of cycles in the PCR programme.

Large-scale sequencing strategies
Current methods can directly sequence only short lengths of DNA at a time. For example, modern sequencing machines using the Sanger method can achieve a maximum of around 1000 base pairs. This limitation is due to the geometrically decreasing probability of chain termination at increasing lengths, as well as physical limitations on gel size and resolution.

It is often necessary to obtain the sequence of much larger regions. For example, even simple bacterial genomes contain millions of base pairs, and the human genome has more than 3 billion. Several strategies have been devised for large-scale DNA sequencing, including primer walking (see also chromosome walking) and shotgun sequencing. These involve taking many small reads of the DNA through the Sanger method and subsequently assembling them into a contiguous sequence. The different strategies have different tradeoffs in speed and accuracy; for example, the shotgun method is the most practical for sequencing large genomes, but its assembly process is complex and potentially error-prone.

It is easier to obtain high quality sequence data when the desired DNA is purified and amplified from any contaminants that may be in the original sample. This can be achieved through PCR if it is practical to design primers that cover the entire desired region. Alternatively, the sample can be cloned using a bacterial vector, harnessing bacteria to "grow" copies of the desired DNA a few thousand base pairs at a time. Most large-scale sequencing efforts involve the preparation of a large library of such clones. The advantage of sequencing clones over PCR-products is that the possibility of the presence of non-specific PCR products that may cause signal noise is virtually eliminated.