Building Better Search

October 23, 2012

Have you ever searched for ‘Buddhism’ only to find records which contain exactly  ‘Buddhism’ and not ‘Buddhist’, ‘Buddha’, or related concepts like ‘Maitreya‘ or ‘Bodhisattva‘? Frustrating, no?


Like many museum’s online collections, the Penn Museum’s first online collection site (launched in January) worked like the previous example and matched the terms a user searched for  against the terms used in a catalog record. This type of search works quite well when either:

  1. All records are fully described, using the same terms that a user is likely to search for (e.g. using both ‘Buddhism’ and ‘Buddhist’)
  2. Users know how the collection is cataloged and can align their searches to accommodate our terminology (e.g. knowing to search for ‘Maitreya’ and not ‘Budai‘)

Unfortunately these conditions are almost never true.  Some of our catalog records are very complete with vivid, detailed descriptions but much of the collection is minimally cataloged. Nor can users be expected to know off-hand how our 330,000 object records have been described over the last 125 years. Over the last six months we have worked to exploit existing data to improve our online search without re-cataloging all 330,000 records to meet the previous two conditions for running a successful search.


Since the 1980s a core component of the Penn Museum’s various collections management systems has been a hierarchical controlled vocabulary, in Questor Systems’s Argus it was called the Lexicon, in KE Software’s EMu, it is called the Thesaurus but in both cases, it is a set terms organized into a hierarchy that facilitate object cataloging and searching within the collections management system. The Penn Museum’s thesaurus contains approximately 67,000 terms and controls the data entered in fields like Object Name, Provenience, Material, Culture, Technique, Maker, Culture Area, Subject, and Function. The content and structure of the thesaurus allow curators, collections managers and museum staff to catalog an object with a Provenience of ‘Cincinnati’ and then be able to find that object by searching for ‘United States’ or ‘Ohio’ or ‘Porkopolis‘ because of the hierarchical relationship between the terms.

How terms are organized in the thesaurus

Over the last twenty years, this structure has become so ingrained in how museum staff catalog objects that the use of discipline specific terms and limited object level cataloging  (why enter ‘United States, Ohio, Cincinnati’ at the object level when you can enter ‘Cincinnati’ and let the thesaurus work for you?)   presented huge barriers for online discovery because traditional online discovery requires that all metadata exist at the item level.  We quickly discovered that users were unable to find objects they knew we had because their queries didn’t match the object level metadata. However we found that many of the search terms did exist in the thesaurus.

What if we use the content and structure of the thesaurus to improve the quality of our search engine?

After experimenting with Apache Solr, we found that it is possible to use the thesaurus and Solr to replicate the functionality of the collections management system in online searches (searching for “United States” will now find objects that are cataloged as “Cincinnati”).


One of my favorite things about EMu is that they provide a set of APIs.  Using the API we are able to export the thesaurus content and structure into Solr and then create two text files that are used by the Solr SynonymFilterFactory to index catalog records and expand searches.

The first text file (index.txt) is used to analyze and index catalog records.  Each row in this text file contains a term from the thesaurus and its primary key in the thesaurus table.

Qing Dynasty=>68250

If a catalog record contains the term ‘Qing Dynasty’, Solr associates the value ‘68250’  with the record in addition to the text value ‘Qing Dynasty’.

The second text file (query.txt) is used by Solr to expand a searches.  Each row in this file contains a term (Qing Dynasty), any alternate spellings (Ch’ing Dynasty, 大清) , the broader term (Chinese Dynasty) and the primary key for the term (68250).

Qing Dynasty,Ch’ing Dynasty,大清, Chinese Dynasty=>68250

When someone searches the online collection for “Chinese Dynasty”, their search term is passed to query.txt. Each time Solr finds ‘Chinese Dynasty’ on the left side of the => operator it uses the value on the right side of the => as a search term.  So if the query.txt file looked like this:

Qing Dynasty,Ch’ing Dynasty,大清, Chinese Dynasty=>68250
Qin Dynasty,Chinese Dynasty=>68528
Shang Dynasty,Chinese Dynasty=>68524
Han Dynasty,汉朝, Han Ch’ao,Chinese Dynasty=>68503

Then when a user searches for “Chinese Dynasty” in the Period field, the Solr query looks like


and it will return all  records that use the term “Chinese Dynasty” or any of its narrower terms in the Period field.


Whether this kind of search is useful is still an open question (or if users even recognize that it is happening) but it was worth trying, certainly our staff has found it quite useful since they run these types of queries all the time within EMu. This is a work in progress and there are improvements that we are planning to make but this is was a large step toward programatically improving resource discovery without re-cataloging the entire collection.

Sample searches

“West Asia”  ”Armament T&E” 

“Greek god” vessel – Note that none of the records contain the terms ‘Vessel’ or ‘Greek god’ but do contain representations of both

“Behavioral Control Device” – See Chenhall’s Nomenclature

“North America” basket -woven – Baskets from “North America” that are NOT woven (the minus (-) is the NOT operator and can be used to exclude items from your query)