WebQuest Step-by-Step
1) NCBI Tour
2) BLAST
3) Gene / Map Viewer
4) OMIM, UniGene, SAGE
5) UniProt, Swiss-Prot
6) Pfam, PIR
7) PDB, Swiss-Model
8) GeneCards
9) Biology Workbench
10) KEGG
NCBI tour - We start at http://www.ncbi.nlm.nih.gov/ Start by looking at the title bar.
PubMed, Entrez, Gene, BLAST, OMIM, Books, TaxBrowser, Structure
We will explore each of the NCBI databases below in our WebQuest:
PubMed - The public version of Medline.
Entrez - The search retrieval tool for NCBI databases.
GenBank - NCBI primary database for nucleotide and protein data
BLAST - Basic Local Alignment Search Tools
Gene - Linking a gene or translated protein to a specific chromosomal position
dbSNP - NCBI database of variation
OMIM - Online Mendelian Inheritance in Man
Books - Books that you can read and print from online. Great resources.
TaxBrowser - Links to organisms and whole genome resources for each.
Structure - A newer feature at NCBI that uses the Cn3D molecular viewer.
Under Hot Spots, you will see:
We will visit some of these in assignment one, but try to take time to familiarize yourself with these resources throughout the class.
NCBI - We will spend time looking at Gene, UniGene, and SAGE resources initially, but do explore each of these as time permits. So let's start out at NCBI. As we explore, you can always return to the home page by clicking in the upper left-hand corner.
Exploring PubMed - This is the gate to a world of Medline publications, all available by abstract, and some full-text. There are links to full-text documents that you may purchase online (not cheap) but better than a trip to the Stanford library. We'll start with a few searches for HIV immunity, and the delta 32 mutation in CCR5. Start by typing in HIV and you'll get 198318 (plus / minus) documents. Try HIV immunity and it's 16901. Try "HIV immunity delta 32 CCR5 mutation" and you have 11 documents. Read the abstract for PMID: 9433423. Were you able to find it? (Use the PMID number).
Now let's go to COMT (catechol-O-methyl transferase). Just try "COMT" - and you have 2684 hits. Now try "COMT and mood disorders" and you have 83 hits. What is the mutation that is associated with chronic schizophrenia? Look for Val / Met. COMT will be our theme as we investigate NCBI even deeper. Look at the article titled "Molecular Genetics of Mood Disorders". Now use that as a query. Type 12192614 into the text search, and follow the link to the abstract. Can you purchase this document? Where and for how much. What else does the Val / Met mutation do? (See the CNN article that links COMT to pain).
As time permits, come back to NCBI PubMed and explore a topic of interest to you. We'll do a few light BLAST exercises, then LocusLink and UniGene. It never hurts to take a topic of interest into Google to get a better keyword base to search PubMed with. Look for PDF articles too.
BLAST - I have attached (in the appendix) a short version of the BLAST tutorial, and also supplied you with a longer version. Refer to those handouts for our workshop. You don't need to do a lot of BLAST at this moment - we'll do much more in a seperate exercise very soon.
UniGene and Gene - These are two very powerful resources at NCBI, and they work hand-in-hand as we explore our COMT gene and its place in the Genome. We'll start with UniGene first. Look at the URL, and the Entrez component. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene Type in COMT. The URL of the page returned should be http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?ORG=Hs&CID=370408 How many hits were there? What are these organisms? Display the human loci Hs.370408. Look at the entire page, and scroll all the way to the bottom. Did you find the expressed sequence tags at the bottom of the document? At the top of the top of the document there is a link to PIR, pir:A38459 This is an accession number (we will explore that more later). If this link does not work, try the entry for COMT_HUMAN in UniProt.
We'll now search for COMT in MapViewer. Can you find it? Go to the NCBI front page, then look under 'hot spots'. Choose MapViewer. From the pull-down menu, choose Homo Sapiens (human) and in the search box, type COMT. The URL of The page returned should be: http://www.ncbi.nlm.nih.gov/mapview/maps.cgi?taxid=9606&chr=22&MAPS=genec,pheno,ugHs,genes-r&cmd=focus&fill=80&query=uid(18246192,14219869)&QSTR=COMT It's on chromosome 22, can you spot the link? Follow that link, but also come back and visit MIM link at the bottom of that page. Choose the link to the gene neighborhood, and then to schizophrenia.
As you move back and forth between Gene, UniGene, MapViewer, and HomoloGene, it's easy to get confused. Try to pay close attention to page headings, and especially URLs (and the lower left hand corner of your Web browser for links), and you will pick up navigation faster.
Gene - This is one of the more fascinating pages in all of NCBI, as it connects our locus, COMT, with other key reference points. Gene is almost a 'dictionary' entry for our gene. First let's find out what we can about COMT. From the front page of Gene, type in COMT, and the URL returned should be http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=1312 For those of you who remember LocusLink (now retired) the uid=1312 is the way that we follow the COMT 'locus' into the various NCBI databases. From the link above, look at the very top of the page for a link to HGNC:2228 and MIM_116790 , which you should visit to see how related sites show links to information about COMT. Scan this entire page, all the way to the bottom, and and take a moment to follow PharmGKB PA117.
MapViewer - From the MapViewer link, choose Homo Sapiens and type in COMT. You'll see 22 chromosomes, and look for the link on number 22. That will show you the neighborhood our gene lives in. Navigate into gene 22, and see the neighborhood of genes for COMT. We're not going to spend a lot of time on these right now, as they are the focus of the "genome navigation". Practice here will benefit you when you start to use other genome browsers, such as UCSC Genome browser, and Perlegen's Haplotype and Genotype Browsers.
HomoloGene - HomoloGene is a resource of curated and calculated orthologs for genes as represented by UniGene or by annotation of genomic sequences. Search the database for COMT, and explore the links to the genes conserved in mammals, and in particular the link to HomoloGene:19664. On that page, look at the protein entries in PIR, as well as the tools to do pairwise alignments. Do an alignment of the mouse and rat COMT proteins. How identical are they? Where do they differ? How significant is that difference? (try this in Cn3D).
GenBank and GenPept. The accession numbers for COMT are Z26491 (nucleotide) and CAA81263 (protein), respectively. Understanding the NCBI data model, and the GenBank Flat File (GBFF) is the subject of our second and third weeks. From the main page at NCBI, select nucleotide or protein as the database to search, and enter Z26491 (nucleotide) or CAA81263 (protein), and follow the link to the GenBank entry.
Let's also try a more general search of GenBank from the NCBI home page. Choose search nucleotide. Type in just "COMT". You should get 164 entries. Now try COMT human. You should get 156 entries. Now let's type in Z26491. You will get only one entry returned. Click on that link, which will display the GenBank entry for Z26491 - H.sapiens gene for catechol O-methyltransferase. When you are inspecting a GenBank record, look at the display format - it's the button near the top of the file on the left hand side. Choose GenBank and FASTA displays, and to the right of that entry, choose (send to) text as your output. You will learn to save GenBank and FASTA displays in text formats (see below). The GBFF is returned as "default". Try to get in the habit of displaying the file as "GenBank" or "FASTA". You can "send" the entry to "file", "text" or "clipboard" Choose text, and save the GenBank and FASTA displays as a text file (make sure to save the file with a .txt extension)
Cn3D and the Structure database - these are two very powerful databases that we will explore more when we look at conserved domains and structures. For now, search the structure database for COMT, and choose the 1VID link. From this page, click on the graphic 'chain' bar, and see COMT's structural neighbors. Then go back to the 1VID entry, an dview the 1VID structure using Cn3D. You'll need to have it installed, but it is very stable, ans is a lot of fun to use. The VAST tutorial for this tool is complicated, so attempt that when you have time and patience.
Swiss-Prot - We will explore our gene, COMT, and it's enzyme entry, EC 2.1.1.6. First, go to http://us.expasy.org/sprot/. From the text entry field, we will type in COMT. There are 122564 entries. Fortunately we can go to COMT_HUMAN, which is the entry for our gene. P21964 is our accession number. In Swiss-Prot we have a collection of links to protein resources, including the Protein Data Bank. Look carefully at the links to GenBank, DDBJ, and EMBL, as well as to protein structure tolls linked from each page.
UniProt - UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR. (It is a relatively new site).
UniProt is comprised of three components, each optimized for different uses. The UniProt Knowledgebase (UniProtKB) is the central access point for extensive curated protein information, including function, classification, and cross-reference. The UniProt Reference Clusters (UniRef) databases combine closely related sequences into a single record. The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences. Try doing a quick search for COMT at UniProt, and look for COMT entries for mouse, pig, human, horse, etc. If you have time, check some of the boxes for COMT, and try a mulitple sequence alignment. We will come back to UniProt later in the course, especially looking for structure and enzyme links. For now, just scan http://www.pir.uniprot.org/cgi-bin/upEntry?id=Q6IB07_HUMAN
Pfam and PIR - we will explore families of proteins at Pfam http://www.sanger.ac.uk/Software/Pfam/ , and our specific protein at PIR http://pir.georgetown.edu/. First we go to Pfam. From the home page, choose protein search. Scroll down to protein sequence - single sequence searches. Open up any of the HIV or SIV text files (gag-pol FASTA) and paste in the amino acid sequence. You can also upload a sequence directly from a file, but we'll try that later. After a short (or long) period of time, a page will be returned with alignments. These are the families related to your protein domain. You can even paste in the entire HIV or SIV viral genome (try that).
PIR - Go to http://pir.georgetown.edu/ Remember the PIR sequence that we discovered when exploring NCBI? It was pir:A38459 . Try using the text search field, and enter COMT or COMT_HUMAN. You can also enter a protein sequence in the text box below that. What search worked best for you? What was the supper family accession number? Hint - you should see Summary Report for iProClass Superfamily: SF005841. Look at the cross-references to Pfam and other bioinformatic protein databases.
PDB and Swiss Model - we'll look up the entry for our protein COMT or COMT_HUMAN. You can find the link to PDB from Swiss-Prot, but PIR will require a text search. You should find an iProClass link to All Sequences in SF005841 . At first, that page will be completely overwhelming. The PDB link is 1VID.pdb Explore the structure data at PDB. There is a Swiss Protein Data Bank Viewer that you can download, but the Cn3D tool at NCBI is more elegant. Hint - you can get some great images of proteins in different size JPEG files to spice up your biology reports!
GeneCards - http://nciarray.nci.nih.gov/cards/ Choose the symbol for text entry, and search for COMT. What is the name of our gene? Is COMT correct? Follow the link to http://thr.cit.nih.gov/cgi-bin/cards/carddisp?COMT Scroll down the page to see the colorful gene expression levels from different tissues. There is a lot of information on this page, which makes it a lot like GeneLynx. Make sure to bookmark this site!
GeneLynx - Your one-stop-shop for links to all the information you would ever need to know about COMT. Try this link directly.
Biology Workbench - The San Diego Super Computer Center hosts Biology Workbench at http://workbench.sdsc.edu/. This site requires that you create an account (painless) and then you can upload sequences into a session. A word of caution, you need a little tutorial to make this work for you. I also have a Workbench primer at http://www.informaticus.org/COIN81/pdf/How_to Use Biology Workbench.pdf The HIV comparison that I demonstrated in class follows the two papers and the data found in this directory http://www.informaticus.org/COIN81/bioinformatics/HIV_mutation/
You can download the HIV mutation data from the zipped file. The two PDF documents describe the results of the ClustalW and phylogenetic analysis. It's time consuming, but a great project, and a great set of tools. This will be the focus of the fourth assignment in class.
For the Biology Workbench exercise, there are three files to look at, all in FASTA format. These are HIV-1_gag-pol_FASTA.txt, HIV-2_gag-pol_FASTA.txt, and SIV_gag-pol_FASTA.txt. You will upload each of these to a session in Workbench, and use ClustalW as a protein alignment tool to compare the sequences. You can find each of these files in http://informaticus.org/COIN81/bioinformatics/
KEGG - Kyoto Encyclopedia of Genes and Genomes is a resource of genes, pathways, and substrates. We'll do a quick exploration of the tyrosine metabolism that COMT is a part of. You should go to http://www.genome.jp/ and then to KEGG at http://www.genome.jp/kegg/kegg2.html Search all organisms (or just Homo Sapiens) for COMT. Your result should be hsa:1312 COMT; catechol-O-methyltransferase [EC:2.1.1.6] . The direct link is http://www.genome.jp/dbget-bin/www_bget?hsa:1312 Now let's try to find where COMT plays in metabolism. Go to http://www.genome.jp/kegg/metabolism.html or go to the tyrosine metabolism at http://www.genome.jp/kegg/pathway/map/map00350.html Can you find EC 2.1.1.6, our COMT translation product? These links can be tricky, so take your time with this.
Let's take a break from COMT and have some fun looking at some gemonics data and tools for influenza. One of the new resources should not surpirse you - it's from NCBI. Point your browser at: http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html From this vantage you should be able to inspect the flu sequence dataset explorer. If this is over your head, try just looking at the influenza genome, or the Influenza Virus Genome Viewer. Each of these resources takes some time to explore, and requires some knowledge of the flu virus, and its epidemiology.
http://www.ncbi.nlm.nih.gov/BLAST/
http://www.informaticus.org/COIN81/BLAST.htm
Below is a quick write-up of using BLAST. I used the tutorial to do a few exercises, and it mostly works pretty well. This is intended to be an informal first pass, while I work on a more structured exercise.
Go to the BLAST home page (http://www.ncbi.nlm.nih.gov/BLAST/ )
Along the left-hand margin of the page, follow the links down to BLAST course and BLAST tutorial. I highly recommend that you take some time and explore both these two resources, as it will answer most of the questions that you'll eventually have.
To start a BLAST search open up one of the 25 or so text files in the BLAST file archive. I recommend the bovine_ mRNA.txt file, or the amino_acid_sequence.txt files, which will be used for nucleotide and protein BLAST searches, respectively.
On the main BLAST page, select 'standard nucleotide BLAST [blastn]' for genomic searches, and 'standard protein-protein BLAST [blastp]' for protein searches.
To learn more about either, click the ? icon on the box next to the items below:
Nucleotide BLAST
Protein BLAST
Translated BLAST searches
Search for conserved domains
Pairwise BLAST
Genomic BLAST pages
Specialized BLAST pages
Treat BLAST like a search engine (like Google). You have a genomic or protein search sequence as a query (your unknown sequence).
The BLAST database is a collection of 13 billion nucleotides in 12 million records, or about 1,000 base pairs per file entry. Each file entry represents a gene or a protein.
When you search against the database, your query is "aligned" against all possible matches. The BLAST algorithm starts by finding all sequences that have at least 11 nucleotide matches (called an 11mer) and then moving out to the longer matches. BLAST looks for long, straight series of uninterrupted matches, regardless of whether the entries are the same length (in bases for example). Only a few matches will be identical, most will be smaller, and often represent conserved domains in protein structures.
BLAST (Basic Local Alignment Search Tool) optimizes the searches to return the longest uninterrupted length of alignment. Good 'hits' may not match the whole length of your query. You'll quickly see how this works, but tread the tutorials to learn faster.
Step-by-step
The secret to learning how to 'BLAST' is practice, practice, and more practice.
I did this at least 10 times before it really started to make sense. You have
over 25 files in the BLAST exercise folder that I have made for you. Many of
these entries include the accession number in the filename, so you can see right
away if you're looking at the right file. Try saving GenBank records as flatfile
and FASTA formats for archiving.
And try to have fun!
I have two sequences (which I usually
attach with this handout) for you to practice with. One is named "amino_acid_sequence.txt"
and the other is named "bovine_mRNA.txt.
Use each of these to start practice "BLASTing" at NCBI - you're "Surfing
the Genome".
This lesson is copyrighted using an Educational Common License, and may be used freely without restriction for academic purposes.
Robert D. Cormia