404 Music Not Found: What's the Problem with Music Datasets?

Table of Contents

Presented at the University of East Anglia, School of Music research seminar series, 19 October 2009

1 Introduction

I'll begin with a question: is musicology a data-poor discipline? What is a data-poor discipline? What's a data-rich discipline for that matter? Why am I interested in the data-richness (or poverty) of musicology?

2 Astrophysics example

First let's examine a classic example of a data-rich discipline.

In his contribution to a collected volume on applied computer science, Neil DeGrasse Tyson describes what seems to be a very rosy state of affairs in astrophysics in the US. According to Tyson, publication in astrophysics research has doubled in the period 1986 to 2002, as compared to the period 1895 to 1985. Tyson also reveals what he believes to be the secret to this success. Every ten years or so the US astrophysics community comes together to thrash out what their research programme should include for the next ten years. This plan of work includes what new technologies they will require, what observations will need to be made, and how the labour will be divided amongst the numerous observatories and university physics departments. They produce a document detailing this research programme and present it to Congress. As a result, US astrophysicists are, on the whole, pretty good at getting their research funded. The Hubble Telescope, the Very Large Array, they have the hardware capacity to make pretty comprehensive observations of our visible universe.

And this is exactly their aim. Tyson describes how researchers in his lab now have the computing capacity to model systems from clusters of 100,000s of stars up to galaxies and galactic clusters. But what they also have, as a result of their unified front on funding, is standards for observations, for gathering astrophysical data, and one massive, distributed database of these observations. Astrophysicists are no longer competing with each other for time on telescopes; anyone with the right software and a decent internet connection now has access to large portions on the universe on their desktop. In short, astrophysics is a data-rich discipline.

3 e-Science

This success story was an early example of a project ongoing in universities in Europe, North America, Australia, and elsewhere variously called e-Science, e-Research, or (even worse) cyberinfrastructure. e-Science is generally described as consisting of three major components: the sharing of computational resources, distributed access to massive datasets, and the use of digital platforms for collaboration and communication, all with the aim of enhancing existing research methods or of making new ones possible.

However, Paul Wouters describes an alternative view of e-Science as three new modes of knowledge creation: computational discovery, comparative research, and digital library browsing. In his computational discovery, some of the responsibility of hypothesis formation is given to the computer, tools such as expert systems have demonstrated some limited success in this area. The system is given facts and rules pertaining to a limited knowledge domain and is capable of inferring new facts using those rules.

In his comparative research, computers are employed in making numerous comparisons between artefacts in a dataset, and creating new knowledge by identifying patterns in this data. In his digital library browsing, on the other hand, it's the human agents (if you like) who are responsible for creating new knowledge; the digital library allows them to see items together which in physical libraries would be impossibly far apart.

All these activities have in common the fact that they rely on having access to data. For scientists, data is usually an unproblematic notion. Whether its astronomical observations, chemical analyses, meteorological observations, or demographic surveys, scientists tend to be pretty good at agreeing what information should look like, what information is valuable, and what information is necessary. Data richness, then, tends to be an unproblematic notion in the sciences.

4 Humanities computing example

But what about the humanities? Is there such a thing as data rich humanisitic research? What would it mean to treat the objects of humanistic study as data?

Stephen Ramsay argues that in fact humanists have been doing this for centuries (before the humanities was a discipline; in fact, before there were even such things as disciplines). He aruges, "Whenever humanists have amassed enough information to make retrieval (or comprehensive understanding) cumbersome, technologies of whatever epoch have sought to put forth ideas about how to represent that information in some more tractable form." Dictionaries, concordances, catalogues are examples of just such technologies. It's not possible for one person to comprehend the whole language, literary corpus, museum collection, but it is possible for one person to make effective use of a dictionary, a concordance, or a catalogue.

But what of digital datasets for humanities research? The project widely regarded to be the first humanities computing project was that of Father Roberto Busa who, in 1949, set out to produce a computerised concordance of the works of Thomas Aquinas. He collaborated with Thomas J. Watson of IBM and set to work producing punched cards of Thomas Aquinas' writings, a project that was actually even more difficult than it sounds. As well as the laborous nature of committing data to punched cards and checking it, they had to deal with some serious deficiencies of the technology. In the days before UNICODE it's understandable that a number of characters that appear in Aquinas' writings wouldn't be catered for on IBM hardware, but Father Busa didn't even have the ability to distinguish between capital and lowercase characters. The solution he adopted was to precede true-capital letters with an asterisk!

This work set the tone of humanities computing for the next three decades, numerous computerised concordances were produced and software such as COCOA was developed for dealing with them. Punched cards become magnetic tape which still presented serious problems for random access. Problems such as character sets and the slow turnaround of batch processing were mainly solved by the introduction of disk storage and of the graphical user interface with Apple's first Macintosh computer, and, indeed, the whole idea of a desktop computer, removing the necessity to book time on the university's mainframe.

Humanities computing began to make its presence felt as a unified discipline with the inauguration of journals such as Computers and the Humanities in 1966 edited by Joseph Raben, and the establishment of centres such as the Centre for Literary and Linguistic Computing in Cambridge in 1963. The Oxford Text Archive was established in 1976, reflecting the new concern in humanities computing with producing digital critical editions of texts, rather than just concordances.

Another big development for this text work was the introduction of the World Wide Web. Although originally viewed skeptically by some members of the discipline—HTML was seen as a pretty poor relation for markup semantics compared to the then dominant SGML—the Web provided a ready solution to user interface and dissemination concerns that many in the discipline were beginning to feel quite keenly.

It's important to note one last very important feature of humanities computing projects: the projects are conceived by humanists and the majority of the work is carried out by humanists. At the least, they learn enough technology to be able to markup their texts, and often they learn enough to be able to publish their Web sites. But whatever the level of technology employed, the projects are executed by humanist scholars working in humanities departments; they are very much humanities projects.

Unfortunately, this is about the state of the art in humanities computing. Numerous projects get funded to produce critical editions of various corpora which get published on the Web, but then the work stops. What's not happening is the publication of any new findings as a result of having access to all this data. Or the establishment of new, data-rich research methodologies.

British Women Writers project. Blake Archive. NINES.

Something is not happening in humanities computing which does happen in the sciences. Scientists are actually using their datasets to make new predictions or prove or disprove hypotheses. Humanisitic datasets, however, seem to be lacking in use cases and in applications.

One possible explanation for this disparity of data use in the humanities and sciences is that many scientists taking advantage of advanced computing techniques are doing so because the theoretical advances in their fields require such processing power and storage capacity in order to be tested experimentally. In the humanities the relationship is the other way round, the discipline is in a position where it can react to such processing power and storage capacity in order to make new theoretical advances, but is so far failing to do so.

So we seem to have established a sort of hierarchy of data-richness in current research. At the top, scientists are able to collaborate on producing data, and are able to generate interesting results from it. Literary scholars are able to produce data, but seem to lack a research programme that uses that data. So what's the situation for musicology? Where are the datasets of musicology? What would go into a musical dataset? Who would want one? And what sort of research would a musical dataset enhance or enable?

5 Music as data

To begin to consider musical datasets, one important notion has to be established: that of music as data. The ability, or willingness, or perhaps act of faith which allows one to accept this notion depends quite a lot on one's disciplinary sensibilities. I'll consider the points of view of three different disciplines to the music as data idea: music psychologists, computer scientists (in which I include audio engineers), and musicologists.

5.1 Music psychology

Music psychology takes as its driving research question that of how the human mind is able to understand musical stimuli. This is certainly quite a difficult challenge and poses numerous methodological problems, including how a scholar actually goes about doing experiments on a mind. One technique is to use human subjects in perceptual experiments and ask them to report the workings of their own mind, as far as they understand them. Was this pitch higher or lower than that one? Was this rhythm the same as or different to that one? Obviously, experiments of this kind must take into account the potential inaccuracies of subjects' self reporting.

Another method of the music psychologist begins with the assumption that a mind is located within a brain, and that by studying the brain, one may be able to gain insights into the working of the mind. But how do you study a brain? One historically common method was to study brains that had gone wrong, in the hope that comparing the performance of an abnormal brain with a normal brain would tell you something about functioning of each. More recently, scanning techniques developed for clinical purposes have allowed psychologists to see what the brain is doing in certain situations, including experiencing musical stimuli.

But the final and most interesting method of studying minds (at least for our present purposes) is the technique of modelling cognition computationally. Part of the motivation for this is the criticism of behavioural psychology that it doesn't address the processes of cognition, and that minds in the wild are subject to so many variables that controlling for just one in behavioural experiments is near impossible. Instead, proponents of computational cognitive modelling attempt to embody the processes of cognition in algorithms which can be executed on computers and the results compared with the performance of human subjects. The extent to which these models simulate the processes of cognition vary from so-called product theories or black box theories where the model only produces the right output but the means by which it does so are not intended to replicate the cognitive process, to pure process theories which actually attempt to replicate the cognitive process. A key problem for cognitive modelling is the lack of evidence of cognitive processes to test these models against.

An example of cognitive modelling for music perception is the work of Geraint Wiggins and Marcus Pearce at Goldsmiths. They implemented a information theoretic model of melodic expectation which proved to be quite successful at replicating human behaviour. It was this model that Geraint demonstrated in this seminar in 2007.

Whether or not you agree with the principle that computers can model cognition, what's important here is that there is a group of scholars who require a form of musical data which they can use as virtual stimuli to feed into their cognitive models. And that the object of study of these scholars is the cognitive process which allows humans to deal with sound stimuli at a level they can call music.

5.2 Computer science

Now let's look at the computer scientists' view of musical data. The particular group we're focusing on are those engaged in a very active programme of research called music information retrieval. This field is concerned with desiging and implementing techniques for taking raw digital sound data and extracting structural, or possiby meaingful information from it. The basic principle is that digital sound data is just a stream of numbers to which statistical analyses implemented in software may be applied in order to extract features. One of the most common applications is testing to what extent computers may make judgements of musical similarity. This is a purely syntactical process in which features are extracted from a database of music and then from a query fragment. The features of the query fragment are tested against the features of the fragments in the database and those with the best matches are considered similar. If this idea isn't jarring enough, practitioners in this field also often work against so-called ground truth data. In order to test the performance of your algorithm, you compare its output against subjectively correct answers generated by a human authority.

Like cognitive modelling, there's plenty to disagree with here, but what's important is that these scholars are using music as data. MIR is a discipline which, like many sciences, assigns a lot of value to weight of evidence. So the more music you can test your algorithm against, the better the case you can make for its accuracy or utility. Finally, in music information retrieval, the objects of study are exactly those algorithms for dealing with musical signal or musical symbols as information.

5.3 Musicology

So do musicologists engage in any practices which treat music as data? Perhaps music analysis is a candidate? Certainly on the face of it, it would seem that the close reading of a musical score requires a conception of musical notation as information. This process, however, has so far proved impossible to replicate algorithmically. There is something about the close reading of a musical score which requires much more than just the notational data printed on the page. A music analyst brings with him a vast quantity of tacit cultural knowledge which allows him to gain insight into the music he is examining. Further, the result of an analytical study of a musical work is rarely ever a description of the information content of that work. More often it's an attempt to describe the processes that the composer may have applied in order to arrive at the finished work, or to describe how a work fits into a cultural context.

What about musical philology? Creating critical editions of works, composers' outputs or manuscripts. This practice tends to treat the sources as a kind of data and attempts to present that data in a comprehensible manner to the future user. And in fact, the creation of digital critical editions is exactly one of the application areas that's enjoyed quite considerable interest in recent years.

Another quite new area of research which has seen a small amount of attention is the analysis of performance through recorded music. Particularly the Centre for the History and Analysis of Recorded Music based at Royal Holloway. They created a database of numerous recordings of Chopin Mazurkas made throughout the twentieth century and were able to extract data such as performance tempi and, using it, attempted to draw some conclusions such as mapping schools of influence in performance practice.

These three data-oriented views of musicological practice (music analysis, musical philology, and performance analysis) all share one important epistemological property: in none of them is the musical work as a information entity a primary object of study. Performance analysis takes performance as its object of study; musical philology takes the notated sources as its objects of study; and music analysis is never considered a valid exercise if all it achieves is to describe a musical work. In each case, music is process, and that process requires a context. It's these which form the object of study.

This seems to parallel an unspoken problem in humanities computing which I held up as a model for computational musicology earlier: just as the object of musical study isn't exactly musical works themselves, the object of literary study isn't exactly works of literature. Neither a musicologist nor a literary scholar can ever find much of interest to say if they concern themselves solely with the musical or literary work; it's only by examining context and process that the study of literature or music becomes interesting and worthwhile. We cannot find a programme of research which employs computers using literary data for the same reason that we cannot find a programme of research for computational musicology. The techniques, being essentially syntactical, actually require that the scholar's object of study be the literary or musical work itself. This simply isn't the case.

6 Music datasets

So where does this leave the practice of creating music datasets? Can there be a culturally and scholarly valid effort to gather representations of musical works together in one place?

We've established that no scholarship genuinely has the musical work as its object of study, rather the process and context of musical practice represents a more credible object of study. However, there is one other object of study, which we touched on earlier, and which falls within the sphere of attention of the musicologist: that of musical source materials. These are the things which go towards building your critical edition, and while they include fragments of notation, they are by no means limited to such expressions. They include any kind of score, sketch, or script that contributes to the process of a musical work; they include fragments of sound captured on tape or other media which provide evidence of the musical process; they include evidence of the context of musical works such as programme notes, commissioning letters. What can these materials contribute to the data-richness of musicology?

7 Purcell Plus

This has become the research question of the project that I'm currently employed on. The Purcell Plus project was funded under a joint call by the AHRC, EPSRC and JISC on the subject of e-Science in the arts and humanities. It set out to investigate what an e-Science methodology for musicology might be like. Given that e-Science may be described as bringing technology to bear on the data richness of research practices, we first needed to establish what our musicological data might be. In line with the notion of musical source materials as a valid object of study, we defined three domains of source materials for Purcell Plus: performances captured as audio data; musical notation; and textual commentaries, including programme notes, record sleeves, and analytical essays.

We then set out to build a small proof-of-concept dataset. For this, we chose to deal with Henry Purcell's Fantazias and In Nomines, a corpus of sixteen works for viols representing a late example of a then quite archaic English instrumental style. We prepared editions of the works from one of the two complete surviving sources, the British Library manuscript Add. MS 30930, a manuscript in Purcell's hand in which the Fantazias are copied from the reverse. We encoded two of the three published editions of the works: the Peter Warlock edition of 1927, and the Thurston Dart (edited Michael Tilmouth) edition of 1990. We've also collected around 30 complete recordings of the works, dating back as far as 1927. Our textual commentaries currently consist of several book chapters discussing the works, and several journal articles which deal with the sources of the works. In fact, the preparation of textual commentaries for inclusion on the database has proved to be the most time-consuming element.

Having amassed all these materials, we then designed a database schema to describe them and, importantly, the relationships between them and their parts. The kinds of entities this database deals with includes musical works, literary works, performances, recordings, manuscripts, published scores, etc. For each of these classes of entity, we also needed to specify syntax and semantics for referring to their parts. This has allowed us to encode in our database things such as where a literary work makes reference to a musical work, or to a recording or performance. As a result, we have a richy interlinked collection of musical materials pertaining to this small corpus of works.

With this resource available as a proof-of-concept, we are now begining to consult the scholarly community on what a research programme that addresses it as an object of study may include.

8 Conclusion

One last question I'll raise is that asked at the "What to do with a Million Books" colloquium held in Chicago in 2006. If we have access to a large dataset of musical materials, how do we treat it as an object of study? What questions can we ask of it? And how? It's important that, as technology begins to make this feasible, musicologists continue to apply the kind of thoughtful self-reflection they do to their own discipline to this new interdisciplinary ground. If not, there's a very real risk that our interaction with these resources may be defined and constrained by the technical interests of computer scientists and the commercial interests of the music industry. Musicologists have the opportunity to push these developments in the direction of novel and exciting ways of reading and experiencing musical source materials which could have a lasting impact on the research agenda of the discipline well into the future.

9 References

Ramsay, Stephen (2004). Databases. In S. Schreibman, R. Siemens and J. Unsworth (eds.) A Companion to Digital Humanities. Oxford: Blackwell: 177–197

Tyson, Neil DeGrasse (2002). Science's Endless Golden Age. In P. J. Denning (ed.) The Invisible Future: The Seamless Integration of Technology in Everyday Life. McGraw-Hill: 1–14.

Wouters, Paul (2006). What is the matter with e-Science? – thinking aloud about informatisation in knowledge creation. The Pantaneto Forum 23 (July 2006). http://www.pantaneto.co.uk/issue23/wouters.htm

Wiggins, Geraint A. (2007) Models of Musical Similarity. Musicae Scientiae, Discussion Forum 4a, 315–338

Author: Richard Lewis

Date: 2009-10-22 15:20:18 BST

HTML generated by org-mode 6.31trans in emacs 23