Enhancing Data Quality Through Semantics

This project is based on graph-based exploitation of semantics to improve data quality, including semantics implicit in the data and/or explicit constraints/rules. The idea behind this project is to address the ambiguity in entities, record linkage, Web people search, speech-based tagging, and multimedia event detection.

Data Cleaning

During my PhD studies I focused on problems that lay at the intersection of in Databases, Information Retrieval and Natural Language Processing areas. Specifically, my research topic is about decreasing ambiguity (cleaning) of very large datasets in order to improve the quality of the datasets used in analysis and decision making. Nowadays, most decisions/analysis are performed using automatically collected information. For instance, in CiteseerX, publication information is extracted from the crawled files using automatic extraction tools. Furthermore, Information Extraction tools are not perfect yet, so they are bound to make mistakes.  Thus, in such databases, there are many duplicates/erroneous data which has to be eliminated. To avoid the `garbage in garbage out' paradigm, automatically created datasets should be made as clean as possible.

Web People Search

Internet search has become one of the essential part of our everyday routines. People use search engines to get answers to their questions ranging from ``what to cook for dinner?'' to ``what is the h-index of John Smith?''. Web People Search is such a task, which can be seen as an instance of the entity resolution problem in the Web Search results. The idea is to group Web pages returned to the queried person name on demand. This can be viewed as query time entity-resolution on unstructured datasets.  A search engine such as Google or Yahoo! returns a set of web  pages, in ranked order, where each web page is deemed relevant to the search keyword entered (the person name in this case). A  search for a person such as say “Michael Jordan” will return pages relevant to any person with
the name Michael Jordan. 

A next generation search engine can provide significantly more powerful models for person search. Assume (for now) that for each such web page, the search-engine could determine which real entity (i.e., which Michael Jordan) the page refers to. This information can be used to provide a capability of clustered person search, where instead of a list of web pages of (possibly) multiple persons with the same name, the results are clustered by associating each cluster to a real person. The clusters can be returned in a ranked order determined by aggregating the rank of the web pages that constitute the cluster. With each cluster, we  also provide a summary description that is representative of
the real person associated with that cluster (for instance, in this example, the summary description may be a list of words such as “statistics, computer science, machine learning, and professor”). The user can hone in on the cluster of interest to her and get all pages in that cluster, i.e., only the pages
associated with that Michael Jordan.

I implemented a demo system that clusters Web Search results for the person names for testing our different algorithms on live data.

Automatic Evaluation of Retrieval Effectiveness

In my M.Sc. thesis, I worked on the retrieval effectiveness of Information Retrieval systems. I studied the retrieval effectiveness measurement in the absence of human relevance judgments. The models we proposed were using data fusion techniques to generate the judgment sets automatically.