The Canterbury Corpus

Compressed-Domain Pattern Matching with the Burrows-Wheeler Transform (Honours Project Report by Matt Powell, 2001)

This report investigates two approaches for online pattern-matching in files compressed with the Burrows-Wheeler transform (Burrows & Wheeler, 1994). The first is based on the Boyer-Moore pattern matching algorithm (Boyer & Moore, 1977), and the second is based on binary search. The new methods use the special structure of the Burrows-Wheeler transform to achieve efficient, robust pattern matching algorithms that can be used on files that have been only partly decompressed. Experimental results show that both new methods perform considerably faster than a decompress-and-search approach for most applications, with binary search being faster than Boyer-Moore at the expense of increased memory usage. The binary search in particular is strongly related to efficient indexing strategies such as binary trees, and suggests a number of new applications of the Burrows-Wheeler transform in data storage and retrieval.

powell2001_hons_report.pdf (528,537 bytes)

Searching BWT Compressed Text with the Boyer-Moore Algorithm and Binary Search (Tim Bell, Matt Powell, Amar Mukherjee and Don Adjeroh, 2001)

This paper explores two techniques for on-line exact pattern matching in files that have been compressed using the Burrows-Wheeler transform. We investigate two approaches. The first is an application of the Boyer-Moore algorithm (Boyer & Moore, 1977) to a transformed string. The second approach is based on the observation that the transform effectively contains a sorted list of all substrings of the original text, which can be exploited for very rapid searching using a variant of binary search. Both methods are faster than a decompress-and-search approach for small numbers of queries, and binary search is much faster even for large numbers of queries.

search_bwt.pdf (173,568 bytes)

Evaluating Lossless Compression Methods (Matt Powell, 2001)

This paper describes the work being done in maintaining the Canterbury Corpus website, and in particular the process of automating results generation. We investigate the popularity and usefulness of the Canterbury Corpus as a data compression standard, and propose several areas for further research and development of the current system.

evaluate.pdf (110,248 bytes)

A corpus for the evaluation of lossless compression algorithms (Ross Arnold and Tim Bell, 1997)

(no abstract available)

corpus.pdf (98,835 bytes)

This page maintained by Matt Powell
Last updated: Monday, 21 January, 2002