BLAST exercises - each of the text files below can be used in nucleotide or protein BLAST. For each entry, use BLAST to arrive at the GenBank entry (nucleotide or protein). Save each GenBank record in FASTA and flatfile format. I have done that for AB003468 - FASTA and flatfile formats. This may seem like a tedious activity - and it's exactly that. This is a skill that you should be able to do "with your eyes closed".
BLAST ex1 - BLAST exercise one (nucleotide) bovine mRNA
BLAST ex2 - BLAST exercise one (protein) eye proteinBLAST ex3 - BLAST exercise two (nucleotide) COMT
BLAST ex4 - BLAST exercise two (protein) COMT
BLAST ex5 - BLAST exercise three (nucleotide) cAMP
BLAST ex6 - BLAST exercise three (protein) Prion
BLAST ex7 - BLAST exercise four (nucleotide) BRCA1 (this is a very large file - try BLASTing just 200 or 300 base pairs at a time).
BLAST ex8 - BLAST exercise four (protein) Heat Shock
BLAST ex9 - BLAST exercise five (nucleotide) HIV-1 gag-pol
BLAST ex10 - BLAST exercise five (protein) HIV-1 gag-pol
Pairwise Alignment - Hemoglobin / Myoglobin FASTA file
Extra credit (5 points) - for each of the protein sequences, also BLAST the PDB database (from BLASTp). Save the resulting PDB entry in flat file format. Additionally, find and download the PDB structure (http://www.rcsb.org/pdb/) for some of the protein files.
Extra credit (5 points) - for each of the nucleotide files, remove every 10th or 11th base pair. BLAST a short segment of your modified file using a translated BLAST data base (tBLASTx). What did you find? How can you explain these results?
Extra credit (5 points) - Open the pairwise myoglobin / hemoglobin file. Go to the BLAST 2 Sequences page at NCBI. Do a pairwise BLAST of hemoglobin chain A against chain B, and Hemoglobin chain A (and B) against myoglobin chain A. What do you see? What does it tell you?
For a great tutorial on similarity searching, try looking at http://www.ncbi.nlm.nih.gov/Class/NAWBIS/Modules/Similarity/simsrch1.html
Using BLAST at NCBI (You can download the word document and print it, which is a much better format than the text below. Right click here)
BLAST is like a Google search engine that looks for text string comparisons in the 12 million records at NCBI (15 billion letters of data). You will search with an unknown nucleotide sequence, and an unknown amino acid sequence. This exercise will take you about 30 minutes when you first start out. With a little practice it will take you 2 minutes. Take your time and have fun. You're "surfing the genome"! Here's the step-by-step. Good luck!
Starting with http://www.ncbi.nlm.nih.gov
Then go to http://www.ncbi.nlm.nih.gov/BLAST/
Note the references along the left hand side of the page for the BLAST course, tutorial, and references.
Take an unknown nucleotide sequence like:
GAAGGCCGAGGCCAAGATGGCGGCGCTGGCGGCTCTCCGCCTGCTCCACCCTATCCTTGCTGTGCGCTCTGGCGTGGGTGCCGCTCTGCAGGTGCGAGGGGTCCATTCTAGCATGGCCGCTGACAGCCCGAGCAGCACTCAGCCCGCCGTATCCCAGGCCAGAGCTGTGGTCCCCAAACCTGCCGCACTTCCCAGCAGCCGGGGCGAGTATGTGGTGGCCAAGCTGGACGACCTCATCAACTGGGCGCGCCGGAGCTCGCTGTGGCCCATGACCTTTGGCCTGGCCTGCTGTGCCGTGGAGATGATGCACATGGCGGCACCCCGCTATGACATGGACCGCTTTGGCGTGGTCTTCCGTGCCAGCCCGCGCCAGTCCGACGTGATGATTGTGGCTGGGACGCTCACCAACAAGATGGCCCCTGCGCTCCGCAAGGTCTACGACCAGATGCCGGAGCCCCGCTATGTCGTATCCATGGGGAGCTGTGCCAACGGAGGTGGCTACTACCACTACTCCTACTCAGTGGTGCGGGGCTGCGACCGCATCGTTCCAGTGGACATCTACGTCCCAGGCTGCCCGCCTACGGCCGAGGCCCTGCTGTATGGCATTCTGCAGCTGCAGAAGAAGATCAAACGGGAAAAGAGGTTGCGGATCTGGTACCGCAGGTAGTGCTGGCGCCCGGCCGCCCTGAGCATCGTCCCTCCAAGAGGCGGTCAATAAACCCGATCAACCCTCAAAAAAAAAAAAAAAAAAA
Or an amino acid sequence like:
MSQAAKASASATVAVNPGPDTKGKGAPPAGTSPSPGTTLAPTTVPITSAKAAELPPGNYRLVVFELENFQGRRAEFSGECSNLADRGFDRVRSIIVSAGPWVAFEQSNFRGEMFILEKGEYPRWNTWSSSYRSDRLMSFRPIKMDAQEHKISLFEGANFKGNTIEIQGDDAPSLWVYGFSDRVGSVKVSSGTWVGYQYPGYRGYQYLLEPGDFRHWNEWGAFQPQMQSLRRLRDKQWHLEGSFPVLATEPPX
For nucleotide BLAST, click the link under the nucleotide BLAST pull-down menu:
Standard nucleotide-nucleotide BLAST [blastn]
For protein BLAST, click the link under the nucleotide BLAST pull-down menu:
Standard protein-protein BLAST [blastp]
Each will take you to a formatting page.
Nucleotide BLAST will look like:
![]()
Protein BLAST will look like:
![]()
You will paste the sequence of letters into the form box.
And then click on the
Button
The page that is returned for the nucleotide BLAST will look like this:
Your request has been successfully
submitted and put into the Blast Queue.
Query = (752 letters)
Click on the format button, and be prepared to wait until it's done. You will get a page with a clock timer that increment upward.
The page that is returned for the protein BLAST will look like this:
Your request has been successfully
submitted and put into the Blast Queue.
Query = (252 letters)
Putative conserved domains have been detected, click on the image below for
detailed results.
The request ID is
or
The results are estimated to be ready in 4 seconds but may be done sooner.
Please press "FORMAT!" when you wish to check your results. You may
change the formatting options for your result via the form below and press "FORMAT!"
again. You may also request results of a different search by entering any other
valid request ID to see other recent jobs.
You might want to click on the conserved domains, which will lead you to families of proteins (Pfam) and other interesting links.
The Nucleotide BLAST will return pages that look like:
![]()
BLASTN 2.2.5 [Nov-16-2002]
Reference:
Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
RID: 1042230468-025965-5346
Query=
(752 letters)
Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS,
GSS, or phase 0, 1 or 2 HTGS sequences)
1,570,776 sequences; 7,667,485,011 total letters
If you have any problems or questions with the results of this search
please refer to the BLAST FAQs
Scroll down to one of the links below:
Sequences producing significant alignments: (bits) Value
gi|4503061|ref|NP_001878.1| crystallin,
beta B1; eye lens s... 402 e-111
gi|9739217|gb|AAF97950.1|AF286652_1 betaB1-crystallin [Ratt... 345 3e-94
gi|20178278|sp|P02523|CRB1_RAT Beta crystallin B1 [Contains... 344 7e-94
gi|12963789|ref|NP_076184.1| crystallin, beta B1 [Mus muscu... 340 8e-93
gi|6978711|ref|NP_037068.1| crystallin beta B1 [Rattus norv... 339 3e-92
gi|22137737|gb|AAH29012.1| crystallin, beta B1 [Mus musculus] 338 4e-92
gi|117394|sp|P07318|CRB1_BOVIN BETA CRYSTALLIN B1 >gi|10854... 334 7e-91
Click on a link like
gi|4503061|ref|NP_001878.1|
You'll go to an entry in GenBank entry that looks like:
1: NP_001878. crystallin, beta ...[gi:4503061] BLink, Links
LOCUS CRYBB1 252 aa linear PRI 05-NOV-2002
DEFINITION crystallin, beta B1; eye lens structural protein [Homo sapiens].
ACCESSION NP_001878
VERSION NP_001878.1 GI:4503061
DBSOURCE REFSEQ: accession NM_001887.3
KEYWORDS .
SOURCE Homo sapiens (human)
When you get there, scroll down and view the page. Click on a button in the image header that says "display default" and "send to text". You'll get a text file. Save that file from your Web browser (locally) in a text format (.txt).
Then click on "display FASTA" and "Send to text". You'll get something that looks like:
>gi|4503061|ref|NP_001878.1| crystallin,
beta B1; eye lens structural protein [Homo sapiens]
MSQAAKASASATVAVNPGPDTKGKGAPPAGTSPSPGTTLAPTTVPITSAKAAELPPGNYRLVVFELENFQGRRAEFSGECSNLADRGFDRVRSIIVSAGPWVAFEQSNFRGEMFILEKGEYPRWNTWSSSYRSDRLMSFRPIKMDAQEHKISLFEGANFKGNTIEIQGDDAPSLWVYGFSDRVGSVKVSSGTWVGYQYPGYRGYQYLLEPGDFRHWNEWGAFQPQMQSLRRLRDKQWHLEGSFPVLATEPPK
Again, save it as a text format.
You're done with the nucleotide BLAST!
The protein BLAST will return a page that looks like:
BLASTP 2.2.5 [Nov-16-2002]
Reference:
Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
RID: 1042229648-016793-20669
Query=
(252 letters)
Database: All non-redundant GenBank CDS
translations+PDB+SwissProt+PIR+PRF
1,296,341 sequences; 413,974,790 total letters
If you have any problems or questions with the results of this search
please refer to the BLAST FAQs
Click on one of the links below:
Sequences producing significant alignments: (bits) Value
gi|11256|emb|X65020.1|BTPSSTSU B.taurus
mRNA for PSST subun... 1453 0.0
gi|17921974|ref|NM_024407.2| Homo sapiens NADH dehydrogenas... 628 e-177
gi|17511653|gb|BC001715.2|BC001715 Homo sapiens, NADH dehyd... 628 e-177
gi|13543602|gb|BC005954.1|BC005954 Homo sapiens, clone MGC:... 628 e-177
gi|12001973|gb|AF060512.1|AF060512 Homo sapiens clone 016d0... 559 e-156
gi|21750695|dbj|AK092169.1| Homo sapiens cDNA FLJ34850 fis,... 523 e-145
Like this one gi|11256|emb|X65020.1|BTPSSTSU B.taurus mRNA for PSST subun... 1453 0.0
And you'll go to a GenBank entry that looks like:
1: X65020. B.taurus mRNA for...[gi:11256] Links
LOCUS BTPSSTSU 752 bp mRNA linear
MAM 29-NOV-1994
DEFINITION B.taurus mRNA for PSST subunit of the NADH: ubiquinone
oxidoreductase complex.
ACCESSION X65020
VERSION X65020.1 GI:11256
KEYWORDS NADH-ubiquinone oxidoreductase subunit.
SOURCE Bos taurus (cow)
When you get there, scroll down and view the page. Click on a button in the image header that says "display default" and "send to text". You'll get a text file. Save that file from your Web browser (locally) in a text format (.txt).
Then click on "display FASTA" and "Send to text". You'll get something that looks like:
>gi|11256|emb|X65020.1|BTPSSTSU
B.taurus mRNA for PSST subunit of the NADH: ubiquinone oxidoreductase complex
GAAGGCCGAGGCCAAGATGGCGGCGCTGGCGGCTCTCCGCCTGCTCCACCCTATCCTTGCTGTGCGCTCT
GGCGTGGGTGCCGCTCTGCAGGTGCGAGGGGTCCATTCTAGCATGGCCGCTGACAGCCCGAGCAGCACTC
AGCCCGCCGTATCCCAGGCCAGAGCTGTGGTCCCCAAACCTGCCGCACTTCCCAGCAGCCGGGGCGAGTA
TGTGGTGGCCAAGCTGGACGACCTCATCAACTGGGCGCGCCGGAGCTCGCTGTGGCCCATGACCTTTGGC
CTGGCCTGCTGTGCCGTGGAGATGATGCACATGGCGGCACCCCGCTATGACATGGACCGCTTTGGCGTGG
TCTTCCGTGCCAGCCCGCGCCAGTCCGACGTGATGATTGTGGCTGGGACGCTCACCAACAAGATGGCCCC
TGCGCTCCGCAAGGTCTACGACCAGATGCCGGAGCCCCGCTATGTCGTATCCATGGGGAGCTGTGCCAAC
GGAGGTGGCTACTACCACTACTCCTACTCAGTGGTGCGGGGCTGCGACCGCATCGTTCCAGTGGACATCT
ACGTCCCAGGCTGCCCGCCTACGGCCGAGGCCCTGCTGTATGGCATTCTGCAGCTGCAGAAGAAGATCAA
ACGGGAAAAGAGGTTGCGGATCTGGTACCGCAGGTAGTGCTGGCGCCCGGCCGCCCTGAGCATCGTCCCT
CCAAGAGGCGGTCAATAAACCCGATCAACCCTCAAAAAAAAAAAAAAAAAAA
Now you're done with the Protein BLAST exercise.
Practice both Nucleotide and Protein
BLAST using the sample files. Look for Locus Link
and Unigene
buttons too.
The goals of these exercises are to get you comfortable using BLAST, and then explore the NCBI database. It takes lots and lots of practice - 30 to 40 hours for me. In time you will know these skills very well, and it won't seem like such a big deal. If your molecular biology is a bit weak, you'll want to read the PDF primer on molecular genetics from the department of energy.
Feel free to send me email with any questions or concerns. I'm there to help!
http://www.informaticus.org/COIN81/BLAST/
This lesson is copyrighted using an Educational Common License, and may be used freely without restriction for academic purposes.
Robert D. Cormia