Proteins

In this exercise you will explore how to search for structures, and given a sequence of amino acids, using a combination of tools from BLAST to Swiss-Prot, Pfam, to Swiss-Model and PDB, you'll come up with physical properties and likely structures.

This is a fun unit, but will test your skills and patience as you explore the complexities of sequence-structure-function. You'll use the infamous SPDBV viewer and RasMol applications, both which work pretty well and give beautiful structures which you can print or export as images.

PDB - The Protein Data Bank. This site contains a database of over 25,000 structures and domains, where you can download .pdb files which can be viewed using RasMol or Swiss-Model PDB Viewer. PDB is also a database of domains that you can search or BLAST in NCBI.

Pfam - Protein families database of alignments and HMMs. You'll use Pfam extensively to look up and compare conserved domains. NCBI has an extensive Conserved Domain Database (CDD) that you'll use to search and retrieve using CDART: Conserved Domain Architecture Retrieval Tool. You'll use Pfam, CDD, and CDART to find related sequences to your protein to predict function and structure.

PIR - Protein Information Resource, includes Protein Sequence Database, iProClass, and PIR-NREF for researching proteins.

ScanProsite - Take your sequence (or COMT) into ProSite and search for patterns and profiles. The search that is returned includes post-translational modifications, which each are linked to the protein modifications. You can see cAMP- and cGMP-dependent protein kinase phosphorylation site from this link. What post-translational modifications occur to your protein?

SMART - Simple Modular Architecture Research Tool. Allows you to do a search against sequences with known domains and motifs. You will be searching with HMMER, Pfam, as well as looking for signal peptides and internal repeats.

Swiss-Prot - Protein Database of EMBL-EBI. Contains entries for most known proteins and enzymes, including tools for calculating physical properties, and links to metabolic pathways, including KEGG. Swiss-Prot is a key Web resource with links to most other databases.

Swiss-Model - An Automated Comparative Protein Modeling Server. From this link you can perform structure searches, including the First Approach mode for comparing your sequence against (PDB) models, giving you a rough approximation of what your structure might be.

UniProt - UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.

NCBI Structure Resources and using Cn3D - Use NCBI's PDB archive and Molecular Modeling database (MMDB) to search for and compare protein structures. You'll use VAST as a method to compare multiple protein sequences in the Cn3D viewer.

Assignment deliverables:

Take your favorite protein (or use COMT, or Gag-Pol of HIV and SIV) and explore the above sites and exercises below. Also look at HIV-SIV_gag-pol_FASTA.txt which shows the gag-pol amino acid sequence for HIV-1, HIV-2, SIV-1, and SIV-2, which you can import into VAST.

Before you explore on your own, you are also encouraged to work through protein_structure_exercise.htm, which will help you master some of the navigation of the structure sites, which can be quite tricky. It will take you a couple hours to complete.

Starting with PDB, find the structure for your protein. You may have done that already, but there are other ways to locate your structure file. You can start by finding the structure entry for your protein at NCBI Structure, and then looking for the PDB link into The Protein Data Bank.

BLAST the PDB database: A more interesting approach, and one that will help you with protein structures, is to first get the FASTA protein sequence for COMT, and then BLAST the PDB database at NCBI. When you choose BLASTp, make sure to set the pull-down menu to PDB. Note that before BLAST actually looks for local alignments, it returns a page indicating that it has found conserved domains. You *might* want to click on the rounded rectangular graphic (for COMT it's COG4122) and see where that takes you. But we'll come back to that later on.

The BLAST results from above should give you four entries. 1VID is our structure (you can now retrieve the file for http://www.rcsb.org/pdb/). Follow both the link into GenPept for 1VID, as well as clicking on the small red box (for structure). That will take you to a page with entries for related structures in the Molecular Modeling Database (MMDB) at NCBI. Follow the links into related structures if you wish. Those links will start you looking at Multiple Sequence Alignments (MSA) which is the focus of assignment four (evolutionary studies through MSA).

BLASTing the PDB database takes us to entries where we have known structures for our proteins. Where there is homology (strong sequence similarity) there is likely similar structure, and thus (conserved) or similar function. Remember that mantra - sequence-structure-function!

Pfam - Using our FASTA sequence we can delve into Pfam to find more related sequences, but more importantly, we'll identify what family of proteins our sequence belongs to. Take the FASTA sequence from NCBI and go to Pfam. From there choose protein search, and enter the COMT sequence, or your protein's sequence. Be patient, this can take a few minutes. What is the domain of the COMT sequence? You should have o-methyltransferase. Follow some of the links from that entry to learn more about the biological function of this domain.

PIR - The Protein Information Resource (PIR) is a database of protein information very well cross-linked to other databases. Try the NREF search and enter your protein sequence (or COMT). We will try to get a pattern match, or a list of post translational modifications that our protein is likely to experience. Write down the NREF, iProClass, PIR-PSD, Swiss-Prot / Trembl, and Super Family. Follow the Super family link for COMT. What information were you able to retrieve? Go to the Domain - Motif display. Click on each motif. What information are you seeing?

Follow the iProClass link for COMT or whatever protein sequence you are now using. Scroll down the iProClass entry until you get to PIR Feature & Post Translational Modifications for COMT. Follow the link to AA059. Follow the link to InterPro for P21964 (COMT). Explore!

Swiss-Prot - Go to Swiss-Prot and follow the link sat the bottom of the page to find physical properties for COMT or for your sequence.

Swiss-Model - From Swiss-Model you can both search for structures, or have the database perform a 'first approach' estimate of your protein's structure based upon it's homology of one or more known protein structures. Go to Swiss-Model and then to the link to First Approach mode. Type your sequence (or COMT) into the search box. Put your email address, name, and a memorable title (like COMT sequence). With luck, Swiss-Model will email you an attachment in about 10 to 15 minutes. You'll need to download and install SPDBV Deep View Swiss pdb Viewer. Go to this link to find and download. You can also use RasMol to display PDB files. RasMol is a little leaner, SPDBV has some power to it!

Cn3D - (If you are uncertain about installing this software, you can simply use NCBI Cn3D 4.1 software for the structure exercises). Cn3D is a great structure viewer tool that works with the Molecular Modeling Data Base (MMDB) at NCBI. You can view a single structure (sequence), or multiple sequences to determine where structural differences occur. You'll use Cn3D in combination with VAST in the exercise below.

VAST - NCBI allows you to compare multiple sequences using VAST and the Cn3D viewer. Follow this tutorial (it's a little tricky). If you have time, try to compare two different sequences, perhaps from HIV-SIV or HTLV-STLV. Be forewarned that this software isn't always 'intuitive'.

Email me at rdcormia@earthlink.net if you have specific questions). Remember that the goal of this assignment is to explore, not conquer!


Protein Structure Prediction (see also protein_structure_exercise.htm and Cn3D lecture - structure.htm)


Predicting the three dimensional structure of proteins, based either on comparison to structures from proteins and motifs already measured by X-ray crystallography and or NMR, or by first approach, is an important and difficult part of bioinformatics in general and proteomics specifically.


This lesson is copyrighted using an Educational Common License, and may be used freely without restriction for academic purposes.

Robert D. Cormia

rdcormia@earthlink.net