Working With Legacy Data

Working with the data from Intute presented us with a few fairly knotty problems when it came to re-purposing the data for the ARCH project. If I hadn’t had some years with the Intute project myself working with the data at the database table level, I would, perhaps, have been hard pushed to make much sense of it. Given the nature of Intute itself and its task of unifying half a dozen rather disparate ways of collecting, classifying, and cataloguing data across its range of subjects, that this had to be achieved within an impossibly tight time-frame, and given that the people who would be working with this new unified database were going to be largely the same people who had worked with the individual subject data at the different subject hubs, and that their existing users were going to have certain expectations about the way the data was classified and presented, it was, perhaps, inevitable that a pragmatic, somewhat piecemeal, approach was taken to building up the database structure. The original subject hubs often had a requirement for data fields which none of the others were likely to use; sometimes in a one to many relationship to the specific record. This lead to a profusion of columns and whole tables which would have no relevance to the majority of the records but were added to maintain existing practices at the subject hubs. Rather than an opportunity to rethink and rebuild an internet resource catalogue system from the ground up, Intute became a sprawling and unweildy hybrid of its component parts.

Even though, for the purposes of ARCH I was only really interested in three of the 20+ tables which made up the Intute DB, I was still faced with the problem of the cryptic and single direction way that these two tables were linked. The tables I was interested in were the record table, in which the majority of the usable and interesting data was stored, the classification table where the subject taxonomies were stored and used to generate the web site browse structure, and the record_admin table which contains certain metadata about the record itself (as opposed to the resource the record points at) such as who catalogued it, on which date, its status as live or otherwise on the public site etc. The record table had some fields where multiple values were stored in the field itself as semi-colon seperated values, while others had a reference to an id field in another table. It was rather a struggle to seperate all this out and to hive the arts and humanities records (16,000+) off from the other subjects (150,000+ records). Having done this, and saved the results into a new table (ahrecord) I then tried to work out how the id numbers in the classification field in the record table related to the actual rows of the classification table. This proved to be rather less than trivial given that there was no way in SQL to actually join the two tables.

At this point there is no use being made of the record_admin table, which contains metadata for the revprd itself, as opposed to the reasource. In a complete system this table, or one derived from it, would be used to track community contributions and take care of the ranking system and associated privileges. In the original Intute database, the record_admin table was linked, via an id field, to the editors table which contained personal and contact information for the cataloguers who originally created these records. For reasons of data-protection policy and lacking the resources to chase up and obtain permission from these legacy cataloguers, it was decided not to use this data and, instead, mark all records as originating from a generic ‘Intute Staff’ user. It should be noted that, if ARCH recruited gained registered users, it would leave a similar legacy problem in its data to any future projects.

A quick survey of the translated and transferred data in the new ARCH system suggests that a great many of the URLs in the Intute data and no longer working. An immediate task before making the site fully public would be to find and flag these dead links and put them into a pool available for new registrants to work with. The basic task of the lowest Arch rank would be in tracking down these lost resources and editing in the revised url, or flagging the resource as no longer available. This would provide a ready path for progression to a higher rank and gain a number of badges en route.

Posted in Uncategorized | Comments Off on Working With Legacy Data

Comments are closed.