A computational biologist at Cornell has developed a resource to store and explore biological data that may be the key to unlocking the secrets of the protein universe.
By Jay Wrolstad
The quest for knowledge takes lots of twists and turns, with both promising leads and dead ends confronting scientists seeking insight into the secrets of life. Sifting through the massive amount of data continuously churned out by labs can be a daunting task for researchers looking to identify and establish links between basic biological phenomena.
Now such researchers have a powerful tool to help bring order to the chaos of computer databases in an array of formats storing information on thousands of proteins and genes. Called Biozon, the project is the brainchild of Cornell computational biologist Golan Yona, who has adapted the latest search technologies used by such Internet giants as Google and Amazon to create a hybrid data processing engine that promises to speed the pace of discovery.
“Biological data is being generated in so many different types and in such large-scale efforts, presenting a challenge to analyze it all effectively and to integrate the available information for the greatest benefit,” Yona says. Research facilities evaluating biological entities in vivo are now producing digital records on the physical sequences of chromosomes, the existence of molecular interactions, the expression patterns of genes, and so on, he notes. “The rules of the game have changed, so that scientists have much more information to consider,” he says.
His solution was to re-examine the way biological data is warehoused, retrieved, and analyzed, employing an extensive and tightly connected logical graph schema and advanced data-crunching methods.
A particular challenge is that the functions of many genes remain to be discovered, and just looking at the gene sequence itself doesn’t reveal the characterization. “And, if you really want to analyze a gene, you need to look at not just related genes, but consider also the interactions it forms and the cellular pathways it is involved in,” says Yona.
The same holds true for biological pathways comprising multiple genes. So understanding the pathway requires knowledge of the constituent proteins. “There is a strong mutual dependency among biological entities,” Yona says.
That mutual dependency is the key to Biozon’s effectiveness and is something that existing biological databases cannot provide. “Biological data is searchable, but we are creating a resource that will integrate all of the data and will relate the entities to each other,” says Yona. That includes connections between DNA sequences and the proteins they encode, relations between protein sequences and the structures that they fold into, and links between expression data and DNA sequences, to name a few.
Biozon currently stores information on no less than 40 million protein and DNA sequences, 30,000 protein structures, and 200,000 interactions, totaling 90 million documents from storehouses such as GenBank, Uniprot, PDB, and BIND, as well as from in-house computations. It also lists some 2.5 billion relations between documents.
“We have created a unified resource that allows a view of every entity in its broader biological context,” Yona says. Down the line, Biozon should be able not only to link to the available databases, but also to augment existing information in such a way that scientists can accurately predict interactions between proteins, their domain structure and their function, and the pathways they are involved in.
The process of downloading information from a database, analyzing it, and then moving on to the next depository is a tedious, time-consuming chore for researchers who often lack the skills to do the job efficiently, Yona contends. “We want to help them by compiling the data from multiple databases into a single unified knowledge resource, processing the data to determine relationships between biological entities (such as similarity relations), and then providing those results as part of the database.”
Four years in the making, Biozon remains a work in progress, with mountains of information still to be integrated. The goal is to develop a system that enables open contributions and the sharing of information data using the infrastructure and tools created by Yona and his team.
The big bang in cyberspace resulted in a demand from computer users for help in navigating the Internet, especially for ways to find what they are looking for among billions of web sites. Google leaped to the forefront in search with an uncanny ability to deliver the most pertinent results at the top of what can be very long lists of query results on virtually any topic.
Their success, along with that of engines run by Lycos, Ask Jeeves, AltaVista, Yahoo, AOL, and Microsoft, among others, has ushered in some amazing search capabilities. Computer users can now quickly search through all of the information on their machines, obtain directions to a restaurant, or get a real-time, bird’s-eye view of locations near and far.
The graph theory technology behind such popular search engines serves as the model for Biozon. “The challenge was that there is not much precedent for this type of database, and few models to use. We had to create the concept ourselves—enabling complex searches not found elsewhere,” says Cornell programmer Aaron Birkland, who collaborated with Yona in creating the foundation for Biozon.
Biozon also borrows from concepts developed by online retailer Amazon.com, a popular web location with the ability to host sites for other vendors and to create purchasing patterns among consumers that often leads to targeted marketing efforts. In a similar manner, accessing a wide variety of biological databases and identifying patterns across biological data has enormous potential for understanding relationships, predicting function, and discovering emergent data structures and biological phenomena.
The more information a researcher has at his or her disposal, the greater the accuracy of assertions being made, says Yona. “We looked at several different web searching methods and developed effective searching techniques, after determining that the Google method is the most effective way to rank and organize query results.”
Biozon’s effectiveness is the direct result of a data graphing structure that is expressive enough to conduct searches that span highly connected entities and can pinpoint those connections, Birkland says. “In biology, meaning is stated as shapes and pathways, and that lends itself to expressive types of searches.”
The underlying data graph holds a mind-boggling one terabyte (roughly a thousand gigabytes) of information in a relational database. A cluster of 50 machines is used to crunch the data. “It’s a very powerful tool,” Birkland notes.
Also working on the search engine were undergraduates Paul Shafer and Tim Isganitis, both of whom graduated with computer science degrees in May. They were charged with finding the optimum way to rank query results.
“Typically results are not sorted by relevance. Instead they may be ranked arbitrarily based on alphabetical ordering or by the date the record was created,” Isganitis explains. “Our method looks at the web as interconnected pages, like Google, with rankings based on the connectivity of those pages; usually the pages with the most links are the most significant.”
This approach can be applied to a database, so that by looking at entities versus words, protein or DNA sequences can be neatly organized. “We experimented with existing algorithms to see which ones provide the most useful ranking of results,” Isganitis says. Biologists can collect information from many sources—some of which may be useful and some of which may not be, because they are outdated or not relevant—and order them to present the most useful information first.
“It was great research experience, and it was very rewarding to see that the technology we helped to develop was performing as advertised,” Shafer says. “It will be a very valuable tool for biologists.”
Isganitis concurs, saying, “As my first project, Biozon was a valuable experience in that I learned how to conduct proper research methods.” He notes that being an author of a published academic paper detailing the work is a feather in his professional cap and was influential in his acceptance by Carnegie Mellon University to do graduate study at the Language Technologies Institute, focusing on computational biology and artificial intelligence.
To conduct a search on Biozon, the scientist first formulates a question, such as “What are all of the structures of proteins that interact with a particular breast cancer gene?” The answer requires data on protein-to-protein interactions, sequences, and structures and the ability to link it all together. The search engine will query all the relations between structures and proteins, and all relations between proteins and interactions. In Biozon, these relations are not established just based on cross-links between databases. The actual entities are examined carefully to determine possible relations that are often overlooked.
“This increases tremendously the expressiveness of our system,” says Yona, “We can link biological entities based on physical properties, not just on identifiers.” Biozon takes such capabilities a step further by enabling “fuzzy” searches that list tenuous connections among data sets. The notion of similarity is critical, says Yona, because extrapolating on the properties of a particular protein or gene can result in a previously undetected relationship.
Similarities are used often in computational biology to infer relationships, and some databases allow this, but not based on expression data. Measuring the expression levels of different genes can indicate their activity in a cell under certain conditions. Thus, researchers can find correlations between genes, or similar patterns of expression, that are the basis for a functional link (such as an interaction) and in turn suggest new relationships.
Yona is now offering Biozon to the scientific community and getting some rave reviews. “People are excited about how this can be used, how powerful it is,” he says. “A faculty member at one university told me she had a full-time assistant doing these types of searches. There is still a lot of work to do with the prototype, but we have something that is a big advance in database research.”
Also singing Biozon’s praises is David Linn, an assistant professor of biomedical sciences at Cornell who contributed to the project. “Biologists now have access to all the information they need in one place. This makes it much easier to find data on a given gene and related proteins than searching through several databases,” he says.
inn cites the search engine’s ability to conduct comprehensive data analysis, noting similarities, as particularly useful to him and his peers. “The big question is, what does a gene do? More than half of the identified genes have no known function, and this will allow us to determine that function more quickly, potentially saving years of time spent on research.”
In delving into the roles of genes in the human olfactory system, says Linn, he has taken advantage of Biozon’s capabilities. “We know that these genes have an effect on neurons in the nose and transmit information to the brain, but we have discovered a number of them that have unknown functions. With Biozon we may be able to determine those functions.”
And that’s just one of myriad applications for the search engine, Linn notes. “This has potential for any biological question and any type of biological analysis.”
That utility increases with users such as Linn being able to register with Biozon so that they can contribute or access comments posted on the site, store query results for future reference, and get access to proprietary information.
A biologist may want more than just a list of search results. He or she might want the actual entities or biological structures themselves, to download or review, says Yona. “ In Biozon, users can save their queries, materialize the entities, or reproduce prior results,” he says. “Scientists can focus on research rather than on database construction, web tools, and other indirectly related research tasks. This also helps eliminate a duplication of efforts in biology labs.”
To speed the knowledge transfer process, Biozon allows experts to add notes to specific biological entities before the formal publication of a paper. “The time between a scientific discovery and the release of that discovery could be several years. We can provide links to a research Web site and release new information, pending authorization.”
Yona calls Biozon a “complete roadmap of the protein universe” — an apt description of a tool that will help scientists figure out where they are headed and get there faster.