Machine Learning for Information Integration in Industry

Dr. Benjamin Rubinstein

IBM Research Australia

Date and time: 11.30 - 12.30, Friday September 27 2013

Venue: 10.08.03

Abstract:

Information integration is a staple of the literature in databases (aka entity resolution, deduplication), statistics (aka record linkage), and data mining (related to clustering). I will discuss recent applications in Microsoft, detailing two well-motivated research projects on an entity resolution platform repeatedly used in production in Bing, the Xbox360 and elsewhere, that has helped drive $100m.s of revenue.
Machine learning is key to entity resolution, as a means to computing similarity scores between records. While the state-of-the-art has focused on matching 1 or 2 sources, matching many sources with existing methods creates untenable requirements on labeling training data. For the first part, I will describe a novel multi-task learning problem for achieving high accuracy when matching multiple structured sources, under small label budgets.
In the second part, I will discuss merging records that have previously been matched. We develop an approach that is a novel Bayesian probabilistic graphical model that simultaneously merges while learning source qualities. Work based on CIKM.12 and VLDB.12 papers with Neghaban (MIT), Zhao (MSR), Gemmell (Trov), Han (UIUC).

About the speaker:

Ben joined IBM Research Australia early 2013 as a Research Staff Member, returning to Melbourne after 9 years in California. Previously he enjoyed short stints in the research divisions of Google, Yahoo! & Intel; and for close to 3 years was a full-time Researcher at Microsoft Research, Silicon Valley. At MSR, in addition to patenting and research activities Ben shipped production systems for entity resolution in Bing and the Xbox360. Ben earned the PhD in Computer Science from UC Berkeley under Peter Bartlett in 2010 at the boundary of machine learning and security. Ben actively researches topics in machine learning, security, privacy, and databases; and has served on the program committees or organised workshops at the corresponding top meetings - ICML, IJCAI, CCS, SIGMOD.


Seminar Organisation

Seminars are free and open to the general public. No booking is necessary. If you are interested in giving a presentation in this seminar series, or to make suggestions for speakers, please contact Lawrence Cavedon, the seminar co-ordinator.