Data Modeling for the Digital Humanist
This is a post aimed at people who need to deal with organizing, investigating, saving and retrieving information, but who have little or no knowledge and experience of how to do this effectively.
A data oriented Digital Humanities project is often started in a spreadsheet. More and more columns are added as new options are found – for example a Location column was created in a table of manuscripts but suddenly it is not enough as the manuscript may have moved between cities. The quick solution to this will be to enter a list of values separated by some delimiter (location1; location2; …). An alternative may be to split Location into Location1 and Location2, and later Location3 and 4 are added and may still not be enough. Because location name spelling may have changed through time, a city will appear in the various columns in several variants – and don’t let us forget spelling mistakes. A single table becomes even more gnarly when people can appear as authors of a work but also as collaborators (Colab1, Colba2 etc.). Some columns will require an accuracy description, such as dates, and some an indication of certainty in the assertions represented by that column’s data.
Major problems with this approach arise because a large part of the data is repeated, and needs to be repeated accurately or it will not be available for aggregation and summation. Maintenance and changes to the data are complex and unreliable as changes to a data point need to be performed with a search and replace operation across all columns, or even worse, in all lists where the data may occur. More specific changes are even harder to make reliably – in the manuscript locations example above, let’s assume the order of locations in the list represents the order in which the manuscript traveled between locations, so to change the order will require editing the list. Additionally, complex relationships between entities are difficult to represent and even harder to investigate and retrieve.
The other question that presents itself quickly is where will the data live, who will be able to look at it, query and report on it and who and how will be able to maintain and update it. Although google sheets and even excel workbooks can be shared over the web, they are clearly insufficient in terms of presentation, access control and complex data structure capabilities. The required solution is a database web application. A future post will present a complete, robust and almost free-to-run framework for such projects.
The database component in a database web application can take several forms. A relational database is usually selected for data structures similar to those outlined above. The term SQL (Structured Query Language) which is the name of a general-purpose language used to deal with relational data is often used for this approach too. Common relational database systems include MySQL, Postgres, SQLServer as well as others.
The next post will describe the relational data modeling solutions for some of the challenges outlined above.
