Information Retreival Systems in Bioinformatics: Entrez

Currently many biological databases have been developed and became an important toolbox for every scientist in research and academic purpose. Searching a sequence homologue of either Protein, DNA or to know the novelty of a sequence, one needs to do a sequence search against available databases. Similarly, searching for Open Reading Frame, structure, functional, regulatory sequences and repeated elements, we also need to search our query against different available databases.

As biological data is increasing with the passage of time, its tremendous growth requires a searching and access system to retrieve useful information. In biological data, three retrieval systems are widely used relevant to a scientific need, it includes: Entrez, Sequence Retrieval System also known as SRS and DBGET. These retrieval systems let its user a text search against multiple molecular databases and also provides useful relevant information in the forms of links either internal or external to our queries that matches best with searching criteria. Basic difference between these retrieval systems lies in the databases they search and the links to other useful information.

National Center for Biotechnology information offers a text based retrieval system for molecular databases, known as Entrez. This system is easy to use and searches 36 distinct databases of genes, proteins, literature, health, genome and chemicals. It also provides limited search options to limit search information. The aim of this review is to provide an overview of Entrez retrieval system with different searching options, tags and stopwords used for Entrez databases search.

NCBI Retrieval System: ENTREZ:

Entrez provides a global search page with six broad categories of available databases; Chemicals, Literature, Proteins, Genomes, Genes and Health. Text based query retrieves number of relevant results with respective database name in global search page. Query searching can be limited to a specific database by selecting required database from dropdown menu from the left side of the query text box or by providing advance searching options. Boolean operators (AND, OR, NOT) and combination of these operators limit our query results. Single words, database identifiers, gene names, symbols, phrases and sentences; all information can be searched as a query in Entrez system. Query searching can be made more effective by using query translation, index fields and Boolean operators. All these features provide precise results against our query.

Advance and Limit Search:

Different indices have been created to facilitate Entrez database searching which provides information taken out from specific field aspects. Basically some records include free text whereas other include for database identifiers such as PMID, Accession and MeSH. By default Entrez provides results for all records resulting in a large number of retrieved records and can have some unwanted results. To find a specific and relevant record one may limit his/her search using Limit and Advance search options. Advance search page provides all the available indexed terms and Fields present in any Entrez database and Fields can be restricted between a certain range i.e Molecular weight, Modification Date, Publication Date, Sequence Length, Accession and so on.

Query Mapping and Controlled Vocabulary

Organism and Medical Subject Headings fields are controlled vocabularies for bimolecular databases and literature database PubMed with a hierarchical classification of terminology. These terminologies have been used to assign each PubMed record to add useful information about original paper under searching. PubChem database which provides small molecule records in it also use MeSH terms for a controlled terminology and additional information linked with chemical components. PubChem database also provides navigational links to Taxonomy database for organism and classification based on phylogenetics associated with chemical compounds. Taxonomy Browser and MeSH Browser provides a useful way of linking important information and whenever possible, these two systems automatically map queries to available vocabularies.

Similarly, identifiers of different Entrez databases automatically map required record with query identifier; these identifiers can be PubMed ID (PMIDs), Author name, sequence accession numbers, protein accession number or any other identifier. Entrez searching system also ignores stopwords (non-informative words which occurs more frequently in records). Such examples include conjugations, definite, punctuations, prepositions and indefinite words. Searching problematic words needs one to enclose in quotation marks, so that precise and accurate results can be obtained.

Using Wild Cards or Query Truncation

Entrez allows truncating searching where word ending with a wild card character “*” is used to replace word characters. This “*” means to search all related words having same node before * irrespective of letters after *. For example Dengue virus have 4 serotypes, searching for records related to Dengue serotypes one can query as DENV* or for protease NS domain one can search such like that protease NS*. Different fields support truncation for searching and it is useful when someone in uncertain about spelling and/or want to have results for identifiers in different ranges. This truncation may also give poor and incomplete results as truncation results just returns first 600 variations for a query term mapped with indexed vocabulary terms.

Related data: Neighbors and Links

Most beneficial and dominating feature of the Entrez retrieval system is the data incorporation so that association between different records can be easily searched and retrieved. Various associations may be surprising forming the Entrez as scientific discovery search engine. Two major types of relationships have been established in Entrez system:

1. Hard links: based on information available on the records and

2. Neighbors: links derived from computational relationships within a database. As Neighbors are computationally determined, ENTREZ uses BLAST and VAST algorithm to find similar structures and sequences; related citations in PubMed are compared using an algorithm which compares abstract information (phrases and words) with each other.

Mutual links present in databases are used to identify from database records i.e, nucleotide sequence in GenBank has a cross link with original PubMed article(s) reporting this sequence. Similarly PubMed articles have navigational links with structural databases and full reference database (PubMed Central). Similarly UniGene provides back navigational links to different database records. Combining hard links and neighbors can be effective technique for linking data and searching valuable information. Additionally, this can be achieved without needing some extra searches. For example, moving from nucleotide database (mRNA) to structural database for the closely or corresponding associated protein, simply it can follow the navigational links to protein, relating corresponding proteins and then from these proteins by searching associated data components to navigate structure database. A distinct Related Structure navigation point on various protein records can do this process even faster.

Information Retreival System: Implementation

NCBI provides an information retrieval system, Entrez, designed to provide user friendly access to biomedical data including structural, molecular, sequences and literature. Entrez provides access and searching facilities to more than 30 databases of genome, health, structural, literature, sequence and chemical. It provides faecet, limited and advance searching option with Boolean operators to customize user’s query. It also facilitates querying with wild card characters, mapping and controlled vocabulary. Web implementation of Entrez has more valuable applications and benefits over Network Entrez as it facilitates searching with a tremendous amount of data in different databases. Entrez provides navigational links between different databases either provided by NCBI or external (journal/databases) for each record by using two types of relationships: neighbors and hard links. Both of these types of relationships have been found on the basis of controlled vocabulary and algor...

Bioinformatics

Search This Blog