Efficient techniques for multi-document summarization using document graphs

Date of Completion

January 2006


Computer Science




Due to the explosion of the amount of information available on-line, researchers in many sectors have turned their attention towards the problem of multi-document summarization. Multi-document summarization problems include: capturing hidden information or relations between concepts/entities in the text, generating user-focused summaries, defusing salient information from different sources into one concise summary and identifying different needs for different types of summaries. ^ In this thesis we present new techniques for automatic text summary generation of multiple documents using document graphs and meta-search algorithms. We propose using document graph algorithms to capture hidden relations among the concepts/entities in the text. Capturing these relations will help in identifying the most salient information by structuring a relation tree that contains hierarchical information about the relations among the concepts/entities in the original document(s). Our algorithms use sophisticated procedures to measure the salience level of each relation, and hence the salience level of each concept/entity. We compare our algorithms' summaries to the summaries of other summarizers that participated in Document Understanding Conference. ^ We also present new techniques for generating summaries based on meta-search algorithms. Meta-search algorithms have been used successfully before in many applications. Our new meta-search summarization systems generate summaries by composing pre-generated summaries from different systems. This strengthens the quality of the new summaries. ^ One of the most important problems in multi-document summarization is generating user-focused summaries to suit the user's intent or query. In our query-summarization model we present a novel approach that generates user-focused summaries based on pre-specified queries. ^ We also present a novel approach that uses the similarity measure as a feedback to the summarization system to improve its quality. To the best of our knowledge, this approach has not been used before in the context of text summarization. ^ Finally, we present a case study on one of the biggest collections of documents, which is the PubMed collection. We apply five of our approaches on the collection and present a comprehensive study and analysis of the results. ^