What Can We Do With a Million Scanned Books?

Associate Professor R. Manmatha

Center for Intelligent Information Retrieval. Dept. of Computer Science, University of Massachusetts, Amherst, MA, USA

Date and time: 12.00 - 13.00, Friday 30th March, 2012

Venue: 10.08.04 (Building 10, Level 8, Room 4)

Abstract:

"A great library contains the diary of the human race" — George Mercer DAWSON (1849-1901) Address on Opening the Birmingham Free Library

There are two kinds of problems that arise from large collections of scanned books. One concerns infrastructure issues. For example, it is useful to know the quality of the OCR output. We will discuss how to find such information given noisy OCR output.

The second set of problems is about how one can mine useful information from the content of books. Some kinds of links and structure only emerge at the collection level not at book level. For example, it would be useful to connect all the versions of Shakespeare's Othello in a set of books. One version of Othello may only have the main text, the second may have text and substantial footnotes while the third may be part of a set of plays. By finding partial duplicates one can link books with patterns of substantial text re-use in a collection. This information has a variety of applications including providing the appropriate result in response to a query. The challenge is to be able to do this with noisy OCR output and at large scale since the potential number of matches required is quadratic. To solve this problem we represent a book with the sequence of unique words in it. Two books are partial duplicates if there is sufficient overlap when the sequences are aligned. This problem can be solved efficiently and gives better accuracy than shingling techniques. The technique is also resilient to OCR errors. We will also describe how the method can be used to find translations.

Joint work with Zeki Yalniz and Ethem Can

About the speaker:

R. Manmatha is a research associate professor at the University of Massachusetts, Amherst and is also part of the Center for Intelligent Information Retrieval. His research interests are in the areas of image and video retrieval and on the recognition and retrieval of printed and handwritten books. He has previously worked on the automatic annotation and retrieval of images and he and his students developed the first automatic search engine for historical handwritten scanned documents - specifically those of George Washington. He is currently an associate editor of PAMI and Pattern Recognition Letters and has served on several conference committees. He co-founded SnapTell, a mobile image search company which was acquired by Amazon.


Seminar Organisation

Seminars are free and open to the general public. No booking is necessary. If you are interested in giving a presentation in this seminar series, or to make suggestions for speakers, please contact Xiaodong Li, the seminar co-ordinator.