Wrap up on Late-Breaking Session: (Deep) Feature Learning


In the late-breaking track of ISMIR 2012, we held a 30-minute session on learning features from audio music data, with a special focus on unsupervised learning and deep learning. With between 20 and 30 participants, we took the largest room with a video projector. About half of the participants had already worked with unsupervised learning, the others were new to this topic. The session was moderated by Jan Schl├╝ter and Eric Humphrey.

Course of Discussion

Jan tried to start the discussion with a one-minute demo on how unsupervised learning found reliable audio features for speech and music detection, but encountered a technical problem.
In the meantime, Eric attempted to get a sense of what the group hoped to accomplish in the session. After some expected---and admittedly unproductive---banter between the resident "deep learners," a lively discussion emerged, sparked by a question about unsupervised training. The sentiment was expressed that it seems a bit like magic that a system might automatically learning anything on its own, and this was addressed by multiple (complementary or sometimes contrary) responses from the more experienced participants, relating unsupervised learning to manifold learning, density estimation or data compression.

It was at this point that the tone of the session started to take shape, where participants unfamiliar or unconvinced by deep learning (but otherwise self-confident) began asking questions about concepts they didn't understand or naming specific doubts regarding the viability of these methods. These questions were mainly answered, for better or worse, by a select few, however. Regardless, some other topics that came up during this stretch focused on steering one architecture toward different applications from unsupervised data, the influence of supervised fine-tuning, ground-truth data requirements, and the difference between types of supervision during training. The rationale behind autoencoders and the idea of intrinsic probability density of "real" data were also discussed briefly.

About 20 minutes into the session, Jan brought up the demo again with the technical problems fixed to illustrate some of the topics that were covered. It proved to be particularly helpful to simply show a diagram of a multi-layer architecture, as this initiated some discussion around model selection. The conversation then focused on architectural design of "deep networks" and how one might go about crafting one, what rationale factors into these decisions, and generally calling into question the guesswork nature of neural networks. After a few responses from various folks in the room, Eric hopped on the projector and showed a digram of tempo estimation as a three-layer network. He offered the notion that domain knowledge already steers our architectural decisions in manually designed MIR systems, so it could also steer our choice of networks in deep learning. In contrast to manually designed systems, (deep) feature learning saves us from having to find good parameters (e.g., filter coefficients) for a chosen architecture, though.

After 30 minutes, J.-J. Aucouturier promptly ended the session by ringing a bell. Most of the participants left the room, and a small group of 5-6 people (including the moderators) stood in a circle and continued for a while. At the end, Geoffroy Peeters asked whether there was a toolbox for Deep Learning, and we pointed him to deeplearning.net.


The session generated a lot of interest. Although initially proposed as an opportunity for discussion among experienced deep learning practitioners, the considerable amount of unexperienced participants turned it into a Q&A session on feature learning. Direct feedback after the session showed that it got people interested in deep learning who did not know about it or were even skeptical about it. At least two participants would have liked to see a more technical introduction on how to apply deep learning methods to their tasks, but this was outside the scope of this short session. Even though, given the technical nature of the topic, it probably would have been beneficial to start with a 2-3 minute, high-level review of the main concepts with illustrations to quickly put everyone on at least a basic foundation. This is probably good practice for all late-breaking sessions that assume or require some kind of prerequisite knowledge, as opposed to a topic that might be more accessible, e.g., Teaching MIR.

Overall, participation was good, but perhaps not ideal for a barcamp session. As conversation progressed, it became clear that there were two camps of individuals content to voice their thoughts in the forum: those experienced in deep learning and those skeptical of it. These two groups comprised maybe 40% of the total number of attendees, so most didn't actually contribute to conversation. At least one other participant suggested to have a tutorial on Deep Learning at next year's ISMIR, which might make sense given what seems to be a predisposition toward more of a Q&A format of experts and interested individuals.


Late-breaking sessions have the potential to lay the foundation for a tutorial. Firstly, they serve as a test-bed for possible tutorial topics, and secondly, they also bring together possible lecturers.