Tuesday, July 13, 2004

Spotlight on Spotlight

Mighty interesting read on Apple's new search engine and its management of metadata, a traditionally difficult but interesting computer science problem.
------------------------
Daring Fireball says Apple's Spotlight will be a real and a well-thought-out product:
Daring Fireball: Spotlight on Spotlight: ...two years ago... Apple hired Dominic Giampaolo, renowned file system design expert and creator of the highly-regarded, metadata-rich Be File System.... [W]hat then has Giampaolo been working on?... Spotlight — which is, in the words of one WWDC attendee, Giampaolo’s “baby”.... [T]he aforementioned source who attended the Spotlight session at WWDC sent me the following report:
Spotlight is completely, relentlessly focused on files and files’ metadata. Files are the only object returned to Spotlight queries. Two aspects of Jobs’ keynote were thus misleading: The “spotlight” effect on System Preferences was wholly unrelated to Spotlight. Spotlight’s ability to show results from Apple Mail archives on Jobs’ machine was tantamount to a sham. Believe it or not, Tiger Mail has switched to an “exploded” Maildir-like storage format with a single message per file.
One implication of Spotlight’s file-centricity is that its ability to search “email” might not apply to clients other than Apple Mail — it’s the fact that the new Tiger version of Mail stores each message as a separate file that allows Spotlight to effectively return individual mail messages as search results. No other major mail client uses a one-message-per-file storage format.
Spotlight’s full-text search is outsourced to SearchKit, which will be considerably faster in Tiger (“3x indexing, 20x incremental search” over Panther). So, Spotlight has three places to look for information about files: its own hand-tuned substring-matching metadata store (built by Giampaolo, not part of Core Data or anything else), Carbon’s HFS+ catalog calls (so Spotlight will respond to searches for type and creator), and SearchKit’s full-text index.

Both metadata collection and full-text indexing depend on cooperating per-file-format Importers, either written by Apple or by third parties. Like Google, no matter how much text an Importer provides, Spotlight only cares about the first 100K of raw text. Importers are fired on every file the moment it is created, saved, changed, or moved, including when files are made available through a newly mounted drive. Performance is said to be excellent in every case except network-mounted home directories, which are bedeviling on several levels and on which they’re still working.
It’s through the default set of Importers that Spotlight is able to index and search format-specific metadata, such as the ID3 tags in MP3 files. What’s cool about this architecture is that Spotlight’s indexes will thus stay up-to-date automatically. All you need to do is save, move, or copy a file, and Spotlight’s metadata and content indexes will note the changes on-the-fly. Compare and contrast to the full-content file searching previously provided via Sherlock, which required periodic monolithic re-indexing of the content of your drives.