PGN home
About  |  Contact  |  Help
About PGN

PGN is a repository for plant EST sequence data located at Cornell. It comprises an analysis pipeline and a website, and presently contains mainly data from the Floral Genome Project. However, PGN accepts submission from other sources. This page gives more information about the methods used in the analysis pipeline.

Sequence processing pipelines
PGN has developed a standard sequence analysis pipeline consisting of base calling using Phred, vector and E. coli sequence contamination screening, and unigene assembly. In addition, a database was developed that also serves as the back end for the Plant Genome Network website (www.pgn.cornell.edu). The analysis pipeline and the sequence database are tightly integrated. The quality screen consists of trimming low quality sequence, based on Phred scores, using a custom algorithm which works as follows: To extend the high quality sequence as far as possible given a particular quality threshold, the sequence was scanned and, concomitantly, the difference between the quality score and the quality threshold (termed the "adjusted score") was integrated over the length of the sequence. The high quality sequence was defined as the region of sequence in which the integration of the adjusted score was maximal. Importantly, this can include small regions of lower scoring nucleotides if they are "compensated" by higher scoring downstream sequence. Next, putative polyA tails are removed if they contained more than 20 consecutive adenine residues. A contamination screen is performed to remove E. coli chromosomal sequences from the dataset. In a final quality screening step, sequences with lengths below a certain threshold (150bp), sequences with more than 4% ambiguous base calls (Ns), and sequences with a complexity below a given threshold (defined as sequence composed of more than 60% of one nucleotide) are rejected. The rejected sequences are not used in unigene builds, but are retained in the database along with information as to why they were rejected. For each library, a quick evaluation assembly is generated using Phrap to evaluate the sequences for redundancy.

Unigene building
During a sequencing project, unigene builds are generated at regular intervals, and at the end of the project, combining all libraries for a given organism. The sequences are first preclustered, and these clusters then assembled with the cap3 (Huang et al, 1997) program. Sequences are also checked for length, complexity, contamination, with identical parameters as the evaluation builds, and extensive chimera detection is performed. The builds are then uploaded to the database, where each unigene was assigned a unique unigene ID. This ID will remains unique, i.e., when the unigene set is built again (such as when new sequence for a library becomes available); at that point, a new unigene ID is created for that build. Subsequent unigene builds of the same libraries are attributed new ids for all unigenes. Unigenes from a newer build can be tracked to the older builds through the ESTs that they share, and a complete history of unigene IDs in order to track corresponding unigenes from earlier builds is available on the website for tracking unigenes through the different builds.

Annotation of sequence data
For functional annotation, blast is used to compare find the best match of each unigene sequence to in the Genbank NR database, and the in complete coding sequences from Arabidopsis. These annotations are stored in the database and serve as the primary source of annotation. The annotation framework will be extended to Gene Ontology annotations in the future.