Date and time:
Venue: 10.08.04
Chairs: Xiaodong Li
Abstract:
In information retrieval, queries can fail to find documents due to mismatch in terminology. Query expansion is a well-known technique addressing this problem, and has been shown to be effective at improving query performance. During this talk, I will address four areas that I investigated during my PhD candidature. Current techniques for query expansion use fixed values for key parameters. These parameters may not be generally applicable, and are not optimal for all queries. We demonstrate that new methods for choosing parameters must be found and that there is considerable further scope for improvement to effectiveness through better query expansion. Typically, expansion terms are selected from highly ranked documents returned from an initial retrieval run. However, terms can also be selected from past user queries that are associated with documents in the collection. For web queries, our experiments show relative improvements over unexpanded full text retrieval of 25% over an optimised, conventional expansion approach. Retrieving documents from which expansion terms are sourced is costly. Instead, terms can be drawn from brief document summaries that are held in memory, significantly reducing the time required for query expansion by a factor of five to ten while roughly maintaining effectiveness. Rather than expanding queries themselves, expansion of documents at indexing time would increase query throughput greatly. We will explain two different methods of document expansion. However, the context of queries cannot be matched at indexing time and we find that document expansion is unpromising.
Speaker Bio:
Bodo Billerbeck has
recently completed his PhD in the
Date and time:
Venue: 10.08.04
Chairs: Xiaodong Li
Abstract:
Ongoing changes in computer architecture are affecting the efficiency of sorting
algorithms. The size of main memory in typical computers continues to grow but
memory accesses require increasing numbers of instruction cycles, which is a
problem for the most efficient of the existing string-sorting algorithms, as
they do not utilize cache well for large data sets. We propose a new sorting
algorithm for strings, burstsort, based on dynamic
construction of a compact trie in which strings are
kept in buckets. Our experiments show that, for large sets of strings, burstsort is almost twice as fast as any previous
algorithm, due primarily to a lower rate of cache miss.
However, accessing strings using pointers is not particularly cache-efficient. We explore the alternative strategy of copying suffixes of the strings into the buckets. We introduce C-Burstsort, which transfers the unexamined tail of each key to the bucket and discards the original key. This approach results in improved locality, reduced memory usage, and sorting up to three times as fast as the original burstsort and consequently up to six times faster than previous algorithms. Tries have traditionally been used for string processing tasks such as searching. We have applied tries to sort integers and our algorithms was shown to be competitive with the most efficient algorithms. Moreover, for sorting signed and floating-point numbers, in contrast to other radix-based methods burstsort requires neither pre nor post-processing of the data.
Speaker Bio:
Ranjan Sinha has
recently completed his PhD in the
Seminar Organisation
Seminars are free and open to the general public. No booking is necessary. If you are interested in giving a presentation in this seminar series, or to make suggestions for speakers, please contact Xiaodong Li, the seminar co-ordinator.