BioMart: An Innovative and Unified Access To Biological Databases

BioMart:

Recently, new high-throughput techniques have developed and increased biomedical data both in terms of complexity and quantity. However, many bioinformatics resources have been created to link significant newly generated information with previous one. Each of these resources have their own method for querying and processing information, causing problems for a scientist to use these resources in their research work. Another challenge faced by scientist is to compile results from the available resources even from few available resources due to lack of data catalogue and navigation between the existing resources by using different query interfaces. Another problem is to maintain or generate their own independent data sets. All of these problems need to be address by some common interface to facilitate research work by generating, managing data and distributing them among different scientists in some easy and simple way. All of these challenges are addressed by BioMart project.

BioMart is freely available, open-sourced and federated database system providing cross-platform and it supports different most popular database management systems including SQL server, DB2, Oracle and MySQL. It also supports different third party packages such as Galaxy, biomaRt, Cytscape and Taverna.

BioMart links geographically distributed databases to facilitate researchers. The software can be easily adapted for the existing data sets and it can be expanded, customized through plug-in system. As BioMart is freely available, scientist community can easily participate for its development and maintaining their data sets. These features provided the basis for the development of BioMart Central portal; integrating and single access for different important independent biological databases under one umbrella and it is maintained by community.

For administrators, BioMart provide options to create plug-in and take the benefits of tools available, therefore, facilitating them to generate and extend their own databases by passing queries to individual database servers. Administrators have full control over their data and databases.

For users, BioMart central portal provides central repository for a variety of biologically important databases as well as links to various external resources. Pancreatic Expression Database entries (Cutts, et al.) and KEGG pathway information are some examples included in BioMart Central Portal.

As new resources become available to BioMart Central Portal, they also become available to users of this system. Web based interface includes a variety of query and access method with application programming interfaces for SOAP, JAVA, REST and SPARQL. Moreover, data sets generated or maintained by users and administrators can also be connected to a central access point, allows linking and distributing datasets for research purposes.

Design and Implementations

BioMart system is based on three levels: client tier, application tier and database tier. Client layer deals with query interfaces and their results for users. Database layer is responsible for the organization, storage and maintenance of data, while in between these two layers application layer works for the execution, placement, retrieval and organizing results in proper format for user. BioMart software has integrated front-end with JAVA application system for the creation, configuration and deployment of server instances

Database schema

BioMart data sets are organized in relational BioMart schema in an optimized query processing format known as “Reverse star” derived from warehouse star schema for the retrieval of massive data. Relationships from Central relation (table) in satellite (‘dimension’) tables are many-to-many or mostly one-to-many rather than many-to-one. Optimized query processing is achieved by accessing smaller relations, dividing and placing larger relations into more than one relations. Further optimization is ensured by placing subsets into different places of same or different servers.

Algorithm

BioMart algorithm works to automatically convert any database present in third normal form into its schema (reverse star) by following some steps. Its algorithm finds the longest path distance between the central or main relation and all of other tables, by walking along these paths it then joins and merges these relations together for the creation of main relation (table). Algorithm then check relationships of contributing tables, if it finds many-to-one or one-to-many relationship among them, then one relation becomes main table and all of other contributing tables become sub-relation linking to it, otherwise, tables are partitioned into separate tables.

SQL compilation

BioMart process user queries automatically depending on whether data sets are non-materialilzed or materialized. If datasets are materialized, it compiles query by choosing the relevant relation at run-time through SQL path, thus providing a non-redundant output set. Non-materialized data sets are complied against the third normal form source schema. ANSI SQL statements are used to ensure the cross-platform portability.

Query Engine

Distribution of queries to underlying data sources and collection of results is the main responsibility of query engine. Query processing is a multistep process which starts from validation step. Validation step ensures the presence of query elements and restrictions as defined by data provider, if these elements are present then query engine ensures to create a link between them. After this validation step, query is partitioned into smaller subqueries to manage and execute in parallel and efficient manner. Depending on portal configuration, queries across multiple datasets can be compiled either as intersection or union. For intersection query processing, it is optimized through software by finding the optimal execution path and link to pre-computed index files. All of these compilation is done either on same or different BioMart servers. BioMart optimize memory usage by synchronizing results with buffer results to consume less temporary storage by the server for better performance.

PLUG-Ins

Community of BioMart contributes to extend and develop this system by writing different codes and making them available to whole community to perform different tasks such as gene sequence retrieval and visualization tools using plugins. These plugins could be of server-side or client side modules.

Security Hypertext Transfer Protocol (HTTPS) ensures a safe data transfer without interception between server and clients. BioMart also supports OAuth protocol to ensure a connection between pre-selected trusted parties and BioMart server for safe data exchange. OpenID protocol is another method which supports user authentication and manage user activity in a restricted way, managing data sets visibility and groups.

DATA CONTENT

BioMart Central Portal contains a constantly growing list of data sources available by a wide range of tools and methods. It includes: WormBase, Pancreatic Expression Database, EMAGE, UniProt, InterPro, COSMIC mart, IntOGen, DAPPER, GermOnline, SigReannot, Paramecium DB, MGI, Reactome, HapMap, HGNC, Ensembl, PRIDE, PepSeeker, Eurexpress, Gramene, Cidb, SalmonDB, BCCTB Bioinformatics Portal, Potato database, KazusaMart, VEGA, Rfam, VectorBase, Animal Genome database, GWAS Central, FANTOM5, Rice-Map, i-Phame, Ensembl Genomes, EMMA, EuroPhenome and RhesusBase.

INTERFACE

BioMart interface provides three main sections for querying; Database search, Tools and identifier search. Database section provides access to individual available databases through BioMart interface. Relevant information can be found by using two options; Search by organism or Search by Type. Search by organism is further divided into many categories for vertebrates, bacteria, invertebrates, plants and prototists. Appropriate attributes are selected to specify what information is required in the final result. Similarly, Search by Type option is divided into categories such as protein sequence and structure, interactions and pathways, genomes, gene annotation, model organism databases, cancer and gene expression databases.

ID Converter, Sequence retrieval, Variant retrieval and Gene retrieval categories are linked through Tools section. Variant retrieval and Gene retrieval sections allow fast access to most popular databases in Central portal. Sequence retrieval provides easy way to query protein and genome sequences in various formats.

Identifier search interface provides user to specify gene identifiers in many formats such as RefSeq ID, Ensemle IDs, Gene name. It then uses these identifiers across member databases available to BioMart Portal. Result page then presents respective information of that identifier taken from various source databases.

Information Retreival Systems in Bioinformatics: Entrez

Currently many biological databases have been developed and became an important toolbox for every scientist in research and academic purpose. Searching a sequence homologue of either Protein, DNA or to know the novelty of a sequence, one needs to do a sequence search against available databases. Similarly, searching for Open Reading Frame, structure, functional, regulatory sequences and repeated elements, we also need to search our query against different available databases. As biological data is increasing with the passage of time, its tremendous growth requires a searching and access system to retrieve useful information. In biological data, three retrieval systems are widely used relevant to a scientific need, it includes: Entrez, Sequence Retrieval System also known as SRS and DBGET. These retrieval systems let its user a text search against multiple molecular databases and also provides useful relevant information in the forms of links either internal or external to our qu...

Bioinformatics

Search This Blog