Description
Purple sea urchins (Strongylocentrotus purpuratus or Sp) are invertebrates that share more than 7,000 genes with humans, more than other common model invertebrate organisms like fruit flies and worms. In addition, the innate immune system of sea urchins demonstrates unprecedented complexity. These factors make the sea urchin a very interesting organism for investigations of immunology. Of particular interest are the set of proteins in SP that contain C-type lectin (CLECT) domains, a functional region in the protein which recognizes sugars. Proteins containing CLECTs may be particularly important to immune system robustness because of sugars that are present on pathogens.
The primary goals of this research project are first to identify all the CLECT-containing proteins in the Sp genome, and then to predict their function based on similarity to characterized proteins in other species (protein homology or similarity). The latter goal is particularly challenging and requires new and creative analysis methods.
From an informational viewpoint, proteins are represented by a unique sequence of letters, each letter corresponding to an amino acid. For example G-A-V indicates the sequence glycine, alanine and valine. Commonality between proteins is usually measured by sequence alignments; that is, by directly comparing the sequence of letters between two proteins. Algorithms and tools for these alignments are among the most standardized and available tools in bioinformatics.
Sequence similarity between homologous proteins can degrade over long evolutionary timescales. This is in part because some mutations at the sequence level can occur without compromising a protein's overall function. This is akin to the evolution of a language, e.g modern English and middle English, which initially appear to be separate languages due to spelling differences. Because domains are regions of a protein which can function semi- independently, they are less prone to accommodate mutations. By comparing proteins based on the ordering of their domains, or their "domain architecture", it becomes possible to identify homology, or similarities in domain order, separated by extensive evolution.
Alignment tools based on domain architecture are promising, but are still in their infancy. Consequently, very few researchers utilize both sequence and domain alignment methodologies corroboratively. Using Python scripts in tandem with various web tools and databases, we have identified the top alignment candidates for the CLECT-containing Sp proteins using both methods. With the help of the Enthought Tool Suite, we have created a simple visualization tool that allows users to examine the sequence alignments side-by-side with two types of domain alignments. The information provided by these three results together is much more informative with respect to predicting protein function than any single method alone. Finally, we have developed a systematic set of heuristic rules to allow users to make objective comparisons among the three sets of results. The results can later be parsed using Python scripts to make quantitative and qualitative assessments of the dataset. We believe that these new comparison and visualization techniques will apply in general to computational proteomics.