Christoph Koch is looking for better ways to handle data—lots of data.
All sorts of scientists, from physicists to genetic engineers, not to mention Internet search engines such as Google and Yahoo, must find ways to efficiently work with extremely large amounts of data. The size of most hard drives today is measured in gigabytes, but these databases are measured in petabytes. That’s 1,048,576 gigabytes.
“It’s an exciting area with lots of new things to be done,” says Koch. “It’s also an area that allows me to take new interests I have and combine them with other things that I’m doing, like logics, graph theory, statistics, and probability theory, and put them into a system that a lot of people are interested in.”
Koch gave up a faculty position at Saarland University in Germany, where he was the only one working in data management, to join a strong research group. “Some of my friends don’t understand it. It was a tough decision, but in the end I think it was worth it,” he says. “Cornell has a very strong set of people working on network science. It is great to be here, to have people around to talk to about these issues.”
And it’s not just other systems researchers he’s looking forward to collaborating with. “The systems people here provide new research problems for theorists and the theorists feed us back with foundational results that help us drive our systems efforts forward,” he says. “We’re going to be consumers of their results and we’ll be validating their research to some extent.”
Managing incomplete information is one area in which Koch is interested. Incompleteness can be a problem in scientific databases, as well as more familiar sources such as Wikipedia, where information is manually entered and is therefore prone to mistakes. Because Wikipedia is produced by the community, there is no central control. In general, says Koch, the community does a good job maintaining the quality of the information, but sometimes people disagree about content and get into editing wars, with one person changing an entry and another person changing it back.
Rather than appoint an arbiter to decide who is right, such disagreements could be embraced while still providing users with the data that best serves their needs, says Koch. “The disparate views could be integrated by allowing users to vote on the data,” he says. “This technique, to some extent, is similar to those used by Google and others. If a page has a high number of references, the information on that page gets a higher quality rating.”
Koch’s research interests also include data streams and data management in video games and simulations. “These problems are kind of new to data management, but very important for managing data for communities,” says Koch. “The algorithms for doing things in the right way need to be worked out.”
Prof. Koch's Web page