Ever dreamed of having access to the information in the millions of books you’ll never have the time to read? David Mimno is your guy. As assistant professor of Information Sciences, he examines a question that’s not often answered--even in this Age of Information. “While search engines are really good at finding things that people know they want, my work is about giving people the ability to discover what’s available even if they don’t know what they’re looking for,” he says. Specifically, Mimno uses computer data mining techniques to make sense of huge amounts of data, including text from historical books and documents, social networks, and genomic data.
After majoring in classics and computer science at Swarthmore College, Mimno worked as a computer programmer, and eventually landed a job in the UK at the University of Sheffield’s natural language processing group, which uses computational methods to to intelligently process written language. Mimno then went on to work at Tufts University’s Perseus Project, which digitizes ancient works from Greek, Roman, and other historical cultures. “It’s tough to read Greek and Latin,” says Mimno. “My goal was to put all of the support structure around reading, including morphological analysis, links to dictionaries, aligning translations with original texts with commentaries.” The work drew together huge amounts of information into a digital source. “I once laid out all of the books on the table that are equivalent to one page [of Perseus Project text],” says Mimno. “It equated to about seven or eight volumes.”
His time with the Perseus Project led Mimno to pursue a Ph.D. in computer science with a focus in machine learning and statistical data mining at UMass Amherst. There he worked on a technique called statistical topic modeling which represents large unstructured text collections (e.g., words on a page, divided into documents) in a clear and searchable manner. “It tackles the question of--what do you do if someone dumps a hundred thousand pages of text on your desk--how do you make sense of that,” says Mimno. “The technique has been particularly useful for people who want to know what’s in a collection without reading it all the way through.”
Mimno has analyzed collections as big as 1.2 million books. “That’s the central issue that I work with,” he says. “How do you give people access to vast amounts of text they won’t possibly have time to read?” Currently, keyword searches are the main way people try to locate digital information. However, “Language is highly variable--there’s a lot of ways to say basically the same thing,” says Mimno. “One of the failures of a keyword search is, you don’t’ know what you’re not finding, or what you're not asking for.”
With Matthew Jockers at the University of Nebraska, Mimno’s has enabled new heights of information and knowledge in English literature, “you’re not just limited by the 200 19th century novels on your shelf,” says Mimno, “you have access to 5,000 novels. That’s an amazing resource that can fundamentally change what we’re able to say about the history of literature.” The application can help with other historical research as well. “Frequently, we have sources that people aren’t ever going to read—almost no one is interested in the Iowa farm journal in 1889 on July 12th.The specific documents are not interesting, but taken in aggregate, we can learn a lot about life back then.”
Mimno’s research at Cornell goes beyond written text to include data mining in social networks to predict distributions of people in certain communities, or combinations of proteins that interact with each other, airports that have connections to each other, or how certain genes are associated with certain traits. “In general, I want to design statistical methods that give people summaries and analytical tools for large data sets.”
While this type of computational approach is familiar to researchers working in science and technology, he says he has run into some healthy skepticism from humanities researchers. This doesn’t bother him too much. “I would really like computing to be as important in history as it is in physics, but we need to demonstrate that it’s useful, and working with scholars like Jockers is vital” says Mimno. “The statistical approaches are nowhere near the understanding of a grad student, but it can show you where there are areas that need more research,” he says.