University of Pennsylvania Museum of Archaeology and Anthropology

Automation in action

February 17, 2010


Digitization is iterative. There was a time that every institution with a desktop scanner and a work-study student was scanning every item they could lay their hands on like kids in candy stores, and throwing them up onto the web without thinking much about the user experience. Now that we’re at a point of thinking about these projects in terms of scalability, sustainability and user consumption, we have a lot of legacy data to deal with. And now that we’re migrating our images to our new collections management system, it’s time to deal with the legacy data.

Every image in the archives has a unique image number — they start from one and can go to infinity. Lower numbers tend to be older images — glass plate negatives of varying sizes, nitrate negatives (since destroyed), etc. When the archives first started scanning images, we started with our notable photographers collection, and instead of naming digital surrogates with their image numbers, we assigned a different set of numbers — “DD numbers.”

Needless to say, it’s never a good idea to have an extra identifier when the original will do. We have a gazillion image files that are named in a way that means nothing to us anymore — that DD number I mentioned. It would just make a lot more sense if these image files were identified by their image numbers.  Since I simply didn’t have the coding chops to rename such a large data set of files, I asked Scott if he could help out. He’s here to tell you about the cool, geeky stuff he did.


Hi all!  I’m the Database Administrator for the museum’s collections management system and as a part of the museum’s ‘digital spine,’ I get to work on a lot  of really interesting projects with the collections and archives staff.  One of the many initiatives in the archives is to assign a unique, non-descriptive identifier to each image and rename the file to that identifier, referred to as the archives image number. A component of the archives image migration project is renaming about 7,000 “DD numbered” images to their archives image number.

Working with Maureen, I was able to script this process and what would have taken hundreds of hours is done in about 3 minutes (thanks automation!).  While we still haven’t solved the problem of data integrity, it is a big step in the right direction.

I haven’t had the opportunity to do much programming lately so I took this opportunity to brush the dust off and write a little code.  So what did I do?  I wrote a program in Java that reads a tab-delimited export of the archives image database, parses that data and does some validation.  Then it reads a specified directory on the archives image server taking each filename and removes the file extension to get the DD number (i.e. DD2004-14563.tiff parses to DD2004-14563).  With these two lists (archives image database and file directory list) it searches the archives image database for each DD number found in the server directory and when successful matches were found it renames the DD file to the archives image number.

How does this work?

I will run this on the live data at the end of the week, but based on sampling and test runs we should rename approximately 4,000 files (65%) and also know which DD numbers don’t have an archives image number, which DD numbers are assigned to multiple archives image numbers and which archives image numbers are assigned to a single DD number!  Not bad for 3 minutes of work.

  • Scott Williams

    Quick update: After adding additional blocks to handle some other naming conventions I ran the program today and successfully renamed 5614 of 6894 files (84%!). Big thanks to Maureen for all the help with this project.

© Penn Museum 2018 Sitemap | Contact | Copyright | Disclaimer | Privacy |