Information Organization and Retrieval with Collaboratively Generated Content

Proliferation of ubiquitous access to the Internet enables millions of Web users to collaborate online on a variety of activities. Many of these activities result in the construction of large repositories of knowledge, either as their primary aim (e.g., Wikipedia) or as a by-product (e.g., Yahoo! Answers). In this tutorial, we will discuss organizing and exploiting collaboratively generated content (CGC) for information organization and retrieval. Specifically, we intend to cover two complementary areas of the problem: (1) using such content as a powerful enabling resource for knowledge-enriched, intelligent representations and new information retrieval algorithms, and (2) development of supporting technologies for extracting, filtering, and organizing collaboratively created content.

The unprecedented amounts of information in CGC enable new, knowledge-rich approaches to information access, which are significantly more powerful than the conventional word-based methods. Considerable progress has been made in this direction over the last few years. Examples include explicit manipulation of human-defined concepts and their use to augment the bag of words (cf. Explicit Semantic Analysis), using large-scale taxonomies of topics from Wikipedia or the Open Directory Project to construct additional class-based features, or using Wikipedia for better word sense disambiguation.

However, the quality and comprehensiveness of collaboratively created content varies significantly, and in order for this resource to be useful, a significant amount of preprocessing, filtering, and organization is necessary. Consequently, new methods for analyzing CGC and corresponding user interactions are required to effectively harness the resulting knowledge. Thus, not only the content repositories can be used to improve IR methods, but the reverse pollination is also possible, as better information extraction methods can be used for automatically collecting more knowledge, or verifying the contributed content. This natural connection between modeling the generation process of CGC and effectively using the accumulated knowledge suggests covering both areas together in a single tutorial.

The intended audience of the tutorial includes IR researchers and graduate students, who would like to learn about the recent advances and research opportunities in working with collaboratively generated content. The emphasis of the proposed tutorial will be on comparing the existing approaches and presenting practical techniques that IR practitioners can use in their research. We also plan to cover open research challenges, as well as survey available resources (software tools and data) for getting started in this research field.


Dr. Eugene Agichtein is an Assistant Professor in the Math & Computer Science Department at Emory University. He is a founder of the Emory Intelligent Information Access Laboratory (IRLab). Eugene's research expertise is in information access and retrieval, in particular on understanding and modeling user interactions in web search and social media to improve information access and discovery.

Dr. Evgeniy Gabrilovich is a Senior Research Scientist and Manager of the NLP & IR Group at Yahoo! Research. His research interests include information retrieval, machine learning, and computational linguistics. Evgeniy is a recipient of the Karen Sparck Jones Award for his contributions to natural language processing and information retrieval. Evgeniy earned his MSc and PhD degrees in Computer Science from the Technion - Israel Institute of Technology. In his Ph.D. thesis, he developed a methodology for using large scale repositories of world knowledge (e.g., all the knowledge available in Wikipedia) to enhance text representation beyond the bag of words.