Options for local sequence searches.

Local implementations of NCBI software and the NCBI databases are available at UTHSCSA.  These databases can be searched as an alternative to searching at NCBI.  Because only UTHSCSA users can access these services, the local search is generally faster than the turnaround from NCBI.  Additionally, other programs, including HMMER and SAM access the databases.  The local tools can also be used to gain some extra capabilities such as:

Note to GCG users:

The default local databases provided in the GCG system are not regularly updated and hence should be considered obsolete.  See notes below for accessing the same local daily updated databases out of GCG as used by the local blast suite.

Programs available

For either method, you will need a username and password for bcf.  If you do not have one, apply here.

NetBlast family.

NOTE: as of 7/12/2006, the local NetBlast implementation was out of order. If you wish to use this program, inquire to Dr. Dememler.

Use these for the following purposes:

Features: Usage: Databases available:

The same nucleotide and protein databases at NCBI are available at UTHSCSA.  They are updated periodically directly from NCBI.   Databases for RPS-Blast are NCBI's remake of Pfam, Smart, COGS, KOGS, and an option to search All (which includes a few more protein families than the union of the other 3).

Comparison to other methods:

Command line Blast versions (Standalone Blast).

Note: These programs are constantly updated from NCBI. Sometimes there are small changes to the command syntax. If the suggested syntax does not work, try typing the command followed by a <space> <dash> to see a synopsis of the syntax for the currently installed version.

Use these to:

Databases available

  • The same nucleotide and protein databases at NCBI are available at UTHSCSA.  They are updated nightly directly from NCBI.  To refer to the databases from the command line blast programs, use the path $BLAST_DB.  More than one database can be searched at the same time by enclosing the names in quotes: e.g. -d "$BLAST_DB/human_genomic $BLAST_DB/other_genomic"..  See this link for the current distribution of sequences among the different database divisions. If there is an extension, it must be included in the -d option.  Eg. -d $BLAST_DB/drosoph.aa
  • Access

    Documentation for standalone programs

     
    Program Documentation file 
    NOTE: the documentation described below is now somewhat dated, but the files can still be read in
    /home/hardies/oldblast
    Newer documentation is at this site
    Summary of functions
    blastall README.bls Run blastp, blastn, blastx, tblastn or tblastx searches. 
    Permits 
    • batch of searches at once from a multifasta file for the key
    • search of multiple databases at once
    • search of database restricted to a list of gi numbers
    Examples: 
    • blastall -i key.fa -d $BLAST_DB/nr -p blastp -o result.out
    • blastall -i key.fa -d"$BLAST_DB/nr my_local_library" -p blastp -o result.out
    • blastall -i key.fa -d $BLAST_DB/nr -i gi.lst -p blastp -o result.out
    blastall README.bls Run psitblastn search. Key is a position specific matrix generated by blastpgp. The key is searched against a nucleotide database translated in all six frames.
    fastacmd README.bls
    • Retrieve sequence by ID from local database to a fasta file. 
      • e.g.. fastacmd -s NP_640322 -pT -d $BLAST_DB/nr -o NP_640322.fa
      • The -pT  parameter is necessary for the protein nr database to avoid confusion with the nucleotide nr database.
    • Retrieve multiple sequences specified by a file of IDs to a multifasta file
      • e.g.. fastacmd -i gilist.txt -d $BLAST_DB/est -o <output filename>
    • Restore a formatted database to a multifasta file
      • e.g.. fastacmd -DT -cT -d $BLAST_DB/euk_genome -o <output filename>
    • Count the sequences in a database
    • Extracting a subrange of sequence from large nucleotide entries
      • e.g.. given the following source descriptor in a bacterial protein entry: /coded_by="complement(NC_004431.1:5193864..5194652)
      • Retrieve gene plus surrounding DNA by:
      • fastacmd -s NC_004431 -L 5193500,5195000 -d $BLAST_DB/prok_genomes -o <output filename>
    blastpgp README.bls Conduct Psi-Blast search in non-interactive mode 
    Permits: 
    • batch of searches at once from a multifasta file of keys.
    • search of multiple databases at once
    • inclusion of user-created databases, allowing inclusion of private sequences in reverse Psi-Blast strategies
    • Example
      • blastpgp -i key.fa -d "$BLAST_DB/nr mylib" -j 8 -o results.out
    • restriction of database searched to a list of gi numbers (for increased sensitivity).
      • Get the gi list from NCBI Entrez using a web browser, e.g. retrieve all bacterial gi's with Entrez query bacteria[orgn], select "gi list" in the display box, and download.
      • blastpgp -i key.fa -d $BLAST_DB -l list.gi -j 8 -o results.out
    • creation of PSSM in one database, and then use in another.
    • creation of PSSM for use by blastall in psitblastn mode.
    • creation of PSSM for conversion to form used by rpsblast.
    Conducts PHI-Blast search 
    A requirement to match a regular expression specifying a sequence motif is added to the first round of a Psi-Blast search.  Note: command line Psi-Blast requires you to specify the regular expression in a file of a particular format (see documentation).  NetBlast lets you just put the regular expression in a box.  Syntax of of regular expressions for blast is described at http://www.ncbi.nlm.nih.gov/blast/html/PHIsyntax.html
    seedtop README.bls Searches a database for match to a regular expression specifying a sequence motif. 
    Can also search a library of patterns against a sequence.  You would have to obtain the library (from the Procite database, for example).  NetBlast only lets you do this in conjunction with a Blast search.  Command line blast let's you do it separately.
    bl2seq README.bls Blast two sequences against each other.  Can be done in blastp, blastn, blastx,  tblastx, or tblastn modes.
    formatdb README.formatdb Creates a user-defined blast searchable database from a multifasta file.
    To get rid of redundancies that block formatdb over a list of sequences retrieved by Entrez: 
    • Set Entrez to retrieve refseq only.
    • Retrieve gi list
    • fastacmd -i gi.lst -d $BLAST_DB/nr -tT -pT -o library
    • formatdb -i library -oT -pT
    megablast README.mbl Speeds up a blastn search between two very long nucleotide sequences at the cost of assuming near identity.  Mainly used for overlapping clones.
    blastclust README.bcl Organizes a database into sets of homologous sequences
    rpsblast README.rps Searches a sequence against a library of protein family models (mainly derived from the Pfam and smart databases). 

    Databases are in /ncbi/rpsblast and allowable names are Pfam, Smart, Cog, Kog, and All. The case must be matched.  e.g.
    rpsblast -i key.fa -d /ncbi/rps-blast/All -o <output filename>
    NOTE: the rpsblast database has not been maintained. If you need to do local rpsblast, contact Dr. Demeler.

    Permits: 

    • Inclusion of user-specified protein family models created with blastpgp.
    • Search of a nucleotide sequence translated in all 6 frames.
    impala README.imp Searches a protein family model (derived from blastpgp) against a database of sequences.
    copymat 
    makemat
    README.rps Programs used to convert PSSMs generated by blastpgp to protein family models used by rpsblast and impala.
    fmerge Merges two databases (as multifasta files) with removal of redundant gi's. 
    It's a two step process resulting in addition of fasta entries from update_file to the multifasta file oldlib
    • fmerge -t 1 -n oldlib -i index.oldlib
    • fmerge -t 2 -m update_file -i index.oldlib
    • Look in fmerge.log to see summary of the process.

    Access to databases from GCG

    The GCG programs blast, and  psiblast can be redirected to search the daily updated local databases by adding the parameter -INfile2=$BLAST_DB/<database name> to the command line.

    To retrieve specific sequences from the daily updated databases, instead of the GCG fetch function, use fastacmd -s <accession number> -d $BLAST_DB/<database name> -o <output file> to retrieve the sequence.  If the database is protein nr, also include -pT in the parameter list to avoid confusion with the nucleotide nr database.  The file retrieved will be in fasta format.  Use GCG command fromfasta to convert it to GCG format, if desired.

    We do not at the moment have a way to make the daily updated databases available to non-blast GCG programs like FastA.  We recommend the Blast suite in place of those programs in any case.



    Last update 7/12/06; Steve Hardies