Currently
many biological databases have been developed and became an important toolbox
for every scientist in research and academic purpose. Searching a sequence homologue
of either Protein, DNA or to know the novelty of a sequence, one needs to do a
sequence search against available databases. Similarly, searching for Open
Reading Frame, structure, functional, regulatory sequences and repeated
elements, we also need to search our query against different available
databases.
As
biological data is increasing with the passage of time, its tremendous growth
requires a searching and access system to retrieve useful information. In
biological data, three retrieval systems are widely used relevant to a
scientific need, it includes: Entrez, Sequence Retrieval System also known as
SRS and DBGET. These retrieval systems let its user a text search against
multiple molecular databases and also provides useful relevant information in
the forms of links either internal or external to our queries that matches best
with searching criteria. Basic difference between these retrieval systems lies
in the databases they search and the links to other useful information.
National
Center for Biotechnology information offers a text based retrieval system for
molecular databases, known as Entrez. This system is easy to use and searches
36 distinct databases of genes, proteins, literature, health, genome and
chemicals. It also provides limited search options to limit search information.
The aim of this review is to provide an overview of Entrez retrieval system with different searching
options, tags and stopwords used for Entrez databases search.
NCBI Retrieval System:
ENTREZ:
Entrez
provides a global search page with six broad categories of available databases;
Chemicals, Literature, Proteins, Genomes, Genes and Health. Text based query
retrieves number of relevant results with respective database name in global
search page. Query searching can be limited to a specific database by selecting
required database from dropdown menu from the left side of the query text box
or by providing advance searching options. Boolean operators (AND, OR, NOT) and
combination of these operators limit our query results. Single words, database
identifiers, gene names, symbols, phrases and sentences; all information can be
searched as a query in Entrez system. Query searching can be made more
effective by using query translation, index fields and Boolean operators. All
these features provide precise results against our query.
Advance and
Limit Search:
Different
indices have been created to facilitate Entrez database searching which
provides information taken out from specific field aspects. Basically some records
include free text whereas other include for database identifiers such as PMID,
Accession and MeSH. By default Entrez provides results for all records
resulting in a large number of retrieved records and can have some unwanted
results. To find a specific and relevant record one may limit his/her search
using Limit and Advance search options. Advance search page provides all the
available indexed terms and Fields present in any Entrez database and Fields
can be restricted between a certain range i.e Molecular weight, Modification
Date, Publication Date, Sequence Length, Accession and so on.
Query Mapping and Controlled Vocabulary
Organism and Medical Subject Headings fields are
controlled vocabularies for bimolecular databases and literature database
PubMed with a hierarchical classification of terminology. These terminologies have been used to assign
each PubMed record to add useful information about original paper under
searching. PubChem database which provides small molecule records in it also
use MeSH terms for a controlled terminology and additional information linked
with chemical components. PubChem database also provides navigational links to Taxonomy
database for organism and classification based on phylogenetics associated with
chemical compounds. Taxonomy Browser and MeSH Browser provides a useful way of
linking important information and whenever possible, these two systems
automatically map queries to available vocabularies.
Similarly, identifiers of different Entrez databases
automatically map required record with query identifier; these identifiers can
be PubMed ID (PMIDs), Author name, sequence accession numbers, protein
accession number or any other identifier. Entrez searching system also ignores
stopwords (non-informative words which occurs more frequently in records). Such
examples include conjugations, definite, punctuations, prepositions and
indefinite words. Searching problematic words needs one to enclose in quotation
marks, so that precise and accurate results can be obtained.
Using
Wild Cards or Query Truncation
Entrez allows truncating searching where word ending
with a wild card character “*” is used to replace word characters. This “*”
means to search all related words having same node before * irrespective of
letters after *. For example Dengue virus have 4 serotypes, searching for
records related to Dengue serotypes one can query as DENV* or for protease NS domain
one can search such like that protease NS*. Different fields support truncation
for searching and it is useful when someone in uncertain about spelling and/or
want to have results for identifiers in different ranges. This truncation may
also give poor and incomplete results as truncation results just returns first
600 variations for a query term mapped with indexed vocabulary terms.
Related data: Neighbors and Links
Most beneficial and dominating feature of the Entrez
retrieval system is the data incorporation so that association between
different records can be easily searched and retrieved. Various associations
may be surprising forming the Entrez as scientific discovery search engine. Two
major types of relationships have been established in Entrez system:
1.
Hard
links: based on information available on the records and
2.
Neighbors:
links derived from computational relationships within a database. As Neighbors
are computationally determined, ENTREZ uses BLAST and VAST algorithm to find
similar structures and sequences; related citations in PubMed are compared using
an algorithm which compares abstract information (phrases and words) with each
other.
Mutual links present in databases are used to
identify from database records i.e, nucleotide sequence in GenBank has a cross
link with original PubMed article(s) reporting this sequence. Similarly PubMed
articles have navigational links with structural databases and full reference
database (PubMed Central). Similarly UniGene provides back navigational links
to different database records. Combining hard links and neighbors can be
effective technique for linking data and searching valuable information.
Additionally, this can be achieved without needing some extra searches. For
example, moving from nucleotide database (mRNA) to structural database for the
closely or corresponding associated protein, simply it can follow the
navigational links to protein, relating corresponding proteins and then from these
proteins by searching associated data components to navigate structure
database. A distinct Related Structure navigation point on various protein
records can do this process even faster.
Comments
Post a Comment