Data Skeptic

deepjazz

Duration:: 29m
Broadcast on:: 29 Apr 2016
Audio Format:: other

Deepjazz is a project from Ji-Sung Kim, a computer science student at Princeton University. It is built using Theano, Keras, music21, and Evan Chow's project jazzml. Deepjazz is a computational music project that creates original jazz compositions using recurrent neural networks trained on Pat Metheny's "And Then I Knew". You can hear some of deepjazz's original compositions on soundcloud.

[music] Data skeptic features interviews with experts on topics related to data science, all through the eye of scientific skepticism. [music] Geesum Kim is a sophomore at Princeton, with an interest in data science, machine learning, and deep learning. This summer he'll be working as a data scientist at Merck. His interest extend beyond that into fashion, cooking, and mental health, and presumably also music, because the project I invited him here to talk about today is something called Deep Jazz. Geesum, welcome to Data Skeptic. Hey, how are you? Doing very well. Thanks for coming on. Yeah, yeah, no problem. Excited to share more about Deep Jazz. Excellent. Yeah, well, let's start there. What is Deep Jazz? Deep Jazz is a prototype for deep learning-driven jazz generation. In other words, it's a limited AI, specialized to generating jazz based on existing music. From what I understand, you created Deep Jazz as a part of Hack Princeton. Can you tell me about the event? Hack Princeton is a hackathon. It's an event where people are challenged to build something interesting within a very short period of time. Hackathons began in college communities and are a pretty big scene for engineers for an application oriented CS students. I wanted to showcase something cool with machine learning, so I gave it a go, and Hack Princeton was my first official hackathon, and I ended up building Deep Jazz within a period of third takes hours at Hack Princeton. I would maybe it'd be a good time in the show to drop in some background music produced by Deep Jazz. Do you have a particular favorite of the compositions that are there online? I think 128 is pretty good. Epoch 128, yeah. Do you see any distinction between what you're creating and what a jazz musician does when they improvise? Yeah, actually, the way the human brain works is really quite similar to how neural network works, and kind of the basis of Deep Jazz is a neural network. So, there's a theory in psychology called statistical learning, where people learn from language, from the probabilities of next words, given some input. Deep Jazz does much the same, basically, given an input sentence of notes, basically a musical sequence that we treat as kind of a textual sentence. It predicts the next note based on the probability that the model generates, given some sort of input. The great thing about Deep Jazz is that it uses a LSTM, which is a kind of a recurrent neural network, which actually has the ability to have some sort of temporal memory, which is quite well suited for the task of trying to generate jazz, because the next note obviously depends on the context, namely the previous notes that precede it. Interesting. Could we delve a little bit more into an LSTM, maybe for someone that has limited exposure to neural networks? What does that look like as a black box? What are its inputs and outputs? One of the biggest purposes of machine learning models is to generate predictions, or in other words, to predict a class based on some new input. As a black box, you're given an input, and what the model spits out is the probabilities of the next notes in the case of Deep Jazz. So, for example, if I feed in the notes, maybe A, B, C, and Deep Jazz has seen a lot of scales, then it may predict that the most likely next note will be E. An LSTM is a type of neural network called a recurrent neural network. A recurrent neural network, or an RNN, as it's called, are neural networks that have loops in them, allowing them to have temporal behavior. Namely, they can reason about previous events, because they can loop back the decision from a previous step can influence next. As a result, RNNs can have inputs from multiple sources, not just a present, but also the past. We often like to politicize this by saying that RNNs have memory. This is pretty fascinating, and this makes RNNs very well suited to tasks which involve order and sequence, which is a case for generating music. From what I understand, when the Baidu team was working on speech recognition, one of their major breakthroughs was when they switched into RNNs for exactly those same reasons. Did you have a sense going into the project that this was certainly the way to go related to music, or did you have to try a few things to arrive at that? I looked at this task of generating music as a modified natural language processing problem, and the classic natural language processing problem is text generation, given some sort of a corpus. So I had the idea where we treat the musical sequence from some sort of music file as instead a corpus of music. We treat each note as a value, which has its place in a sequence of similar values. And using the idea of the classical NLP problem as a prototype, I applied the idea of using LSTMs to music, and LSTMs have been really popular with NLP applications due to the exact reasons that I stated before. It just made sense, with the inspiration that we can treat music as a musical corpus, it follows that we can treat the problem as a text generation problem. Or instead of words or characters, we instead have notes. Yeah, it makes sense to me that you'd leverage methods learned in the NLP community to describe music. Anyone that studies computer science who knows some music will pretty quickly recognize music as having a grammar. So yeah, if you have an algorithm that's good at modeling human language, those same methodologies seem applicable to modeling music as well, can you go into a bit more detail about how that works from the perspective a little bit closer to the algorithmic level of abstraction? Given some sort of sequence, you predict the next note, right? The model will give you probabilities for what it predicts may be the next note. And from there, we can do a weighted random selection of the next note. And as we add notes based on this input sequence, we shift the reading frame over by one every time we generate a note. So, for example, if the input sequence is A, B, C, and we generate D, then we shift the reading frame over, and now the new input is going to be B, C, D, and then if we generate E, it's going to be C, D, E. With a starting seed sequence that we put into the model, we can thereby generate a completely novel sequence of notes using this idea of note-wise generation based on the previous sequence. Ah, that's interesting. So, I'd imagine you could fine-tune the window that Deep Jazz is working with. Was that a part of the project? To be honest, for this hackathon project, which was built enough. Because I had about 36 hours to work on it, including sleep time, eating time, homework time. Sure. I actually only had maybe like 12 hours to build it, so I just kind of guesstimated and tried different settings really casually to determine the length of the reading frame. I mean, in the future, definitely optimizing that is definitely on the books. And that probability distribution that it produces, have you happened to compare that to any of the sort of standard guides musicians look at? Like, I know there's certain heuristics and rules for how you formulate a melody or resolve a chord. I'm wondering if maybe Deep Jazz figured some of those patterns out on its own, or if you've had time to kind of look at it from that angle. It's actually a really interesting idea, and I think that would be a great side project for someone else to do. It's to really kind of compare what Deep Jazz produces to what's accepted in music. And I haven't had a chance to do that personally because of my kind of lack of formal music theory background. But definitely that's an amazing idea, and I hope a listener with Dean Know-How would be able to kind of compare the outputs of Deep Jazz to what's accepted today in the field of music. You can't hardly turn around these days without seeing some new breakthrough or interesting result coming out of the deep learning community. AlphaGo being one of the most recent ones. And AlphaGo is kind of interesting in that AlphaGo itself plays Go way better than any of the people who actually developed AlphaGo. So I'm curious in you developing a system that creates music. Do you have a music background yourself, and do you feel that your creation is outperforming your own, perhaps improvisational skills? Yeah, I used to play clarinet actually for about a period of seven years back in high school. And I played competitively during all state band, regional band, etc. And one of the things I was the worst at was improvisation. Deep Jazz sounds pretty good to a casual listener. I remember reading a post on Reddit, someone commented saying that if they heard this on the radio, that they wouldn't really mind it, and that they would just accept it as it is. Which is kind of really fascinating to me, and it's kind of exciting. That a prototype that I built in 36 hours can pass off as human. And personally, I think that Deep Jazz would probably do a better job at improvising than I would. Well, what of your experience with music was helpful in building out this prototype? Yeah, so kind of just understanding the structure of music was really helpful. Now, I don't really have a formal background in music theory. It would have probably been optimal if I did for this project. But the idea of chords and notes and different rhythms was really helpful in terms of being able to comprehend the extraction step of it. Because we needed to extract the musical corpus from the MIDI file. That brings me to another interesting point I wanted to discuss with you. So I don't want to take for granted that any of my listeners necessarily know what MIDI is. Could you share a bit of the background on it and how it was useful to you in the project? You've probably heard of an MP3 file. An MP3 file is essentially an audio file that contains the audio in a fashion which where the notes aren't explicitly represented. But in a MIDI file, it's actually very detailed. It explicitly describes the notes and it's each note's length and each note's pitch. In contrast, in a music file, if you try to somehow parse it, it'd be extremely difficult because you have saxophones, drums, clarinets all playing at the same time without each parping explicitly specified. But a MIDI file is very organized. It has each note specified as well as each part. So as a result, it made the parsing task a lot easier. And actually, there are libraries to help with that. And I used Music21, which is a Python library for processing MIDI files and also for just in general electronic music developed by MIT. Interesting, yeah. Well, I've worked a fair amount with MIDI files and while I like them a lot, they feel a little bit low-level to me sometimes. They're sort of the assembly language of music in that it's really an event-driven system. It has this timer that is sort of an arbitrary clock and you get a message that says this note went down with this velocity and then you're informed later that that note went up. So it's not exactly formulated at the level of abstraction that a musician thinks about music. Did you have to transform that data at all and is that a service that Music21 provided for you? Yeah, so I'm not really sure about the exact low-level interpretation of how Music21 gets the different parts of the MIDI file. And actually, most of the pre-processing code I actually developed from my friend's project JazzML by Evan Chow. Fortunately, he's kind of this very accomplished jazz pianist. So he had the musical knowledge to do so. Music planning does most of the legwork for you. And all you need to do is represent each note as a bi-gram or a trigram of principally its note and links with some other information. Does deep jazz have some sense of key and timing or is it just sort of playing to whatever native inputs it has? We don't explicitly pass to the underlying LSTM things about timing or rhythm. And so we give it the input sequence, the notes represented in textual form. At the later stages, after the model has been trained, we do specify the BPM for playback. In my experience, neural networks in general really require pretty large training set to give you good effective results. Could you talk about the corpus of MIDI files you use for your training? This is a prototype, so unfortunately I did have the time to gather large corpus. So right now, at its current stage, it isn't a truly general model, unfortunately. But we have big plans in store for deep jazz and kind of developing it further, but I'll talk about that later. So right now, deep jazz, it improvises and it learns from a single file. There are limitations to that, right? So I actually discussed this problem with a couple of researchers in the field of music generation using AI. And I got to touch with them through deep jazz and some of them solve my project. So basically, the challenge is to balance stability and to balance overfitting. Namely, you want to balance the idea that your output wants to be stable and you want to have a good quality output, something that you can listen to and that sounds natural. But at the same time, you don't want your output to seem too similar to your input. Namely, the problem of overfitting maker. So your model loses its ability to generalize because it's learned too much from the input file. And here, in this prototype that I built, I don't have a really good method for overcoming overfitting. That's kind of a problem that persists with a lot of these attempts to use AI to generate music. And in the future, I have some plans in terms of solving this problem. Namely, we want to basically cluster the relevant genres in a large body of music. And we want to annotate these individual clusters with some sort of metadata tag. And this actually has been done for text where it would also suffice to basically train separate models for each genre in order to kind of maximize stability while providing enough variety so that you avoid the overfitting problem. But one solution that someone sent me, I think it was on Reddit, someone sent me this. It was basically to add metadata tags to different genres of music, and to instead use these metadata tags as part of the training data. So instead of having a number of different models for each genre to have one large model that's able to, based on the metadata tag, to learn different genres of music from what I read on the material that this commenter sent me is that this is much more efficient at training K different models. Instead of having one large model that incorporates these K number of genres using metadata tag annotation. Yeah, I could see that. I guess the way I'm making sense of it would be that if you include that as a feature, then there are certain genres that will make the same musical decisions. You know, that the one goes to the five, the five to the four, and some genres may or may not do that. But since there's consistency across most music that perhaps aggregating all the different genres, yet leaving that as a feature would give a more robust set, is that the basic idea? Yeah, yeah, basically. You'd mention that jazz ML program. Could you tell us a bit more about that? Jazz ML was a project by Evan Chow. He's graduating this year from Princeton with a degree in economics. He's actually working at Uber this upcoming fall. It kind of has a similar idea where it tries to use machine learning to generate jazz, but it's a little bit different. It's a lot less generative than deep jazz framework that I implemented. It has the same pre-processing steps, but yeah, it uses the same pre-processing step, which is to transform the data, the MIDI file, into a corpus of trigrams. And it also uses K-means clustering and chord inference with SVMs. And what it does is, in a more limited way, constructs these artificial sequences from the input MIDI file and then selects from these possible sequences. Because it kind of forces this structure where you have to select from a limited number of sequences instead of kind of generating music on the fly as deep jazz does, it's a lot more limited than what deep jazz presents. And that was one of the compelling reasons why I wanted to build deep jazz, was to make kind of a more truly generative model, because instead of just selecting from a number of sequences that is generated from the model, I wanted to use an LSTM, which has an ocean of memory, and may learn under the shadow of the black box more about the training data. So you used a couple of other tools in the project. If I'm not mistaken, Keras and Theano are in there. Could you talk about the selection of these libraries and how they are part of deep jazz? So Theano is right now one of the most popular deep learning libraries for deep learning research and also for general usage. And Keras provides a layer of abstraction which simplifies a lot of the processes that a pure Theano implementation would require. With these advantages in mind, I chose Keras and Theano for its simplicity and also for its power, in case I needed to do something more robust using Theano directly. But fortunately for this project, the Keras layer was sufficient for all of my implementation. The success of most recurrent neural networks is, in my opinion, partially driven by the massive training sets they can use, even when they're effective on smaller sets. There's this famous paper by Brill and Banco out of Microsoft Research that talks about how their natural language processing algorithms got better, not by code improvements, but by more corpore available. I was curious to get your thoughts on how you think deep jazz would evolve if you provided it larger and larger corpuses of MIDI files. There's a couple different directions to try to take this model. One is to create this kind of universal music generator, which is kind of this generalist model, which is able to make very far reaching and broad interpretations on some sort of given input. And the way I want to take deep jazz now isn't so much to create an extremely general, kind of a universal music generator. Instead, what my plan in the future is to modify deep jazz in a way that anyone with a MIDI file could interact with it and make an improvisation on a particular piece that they really like. So most of us have probably faced that problem where we really love a song and we can't stop listening to it, but after a while it becomes dry and becomes dull. But one of the cool things about deep jazz is that it provides a framework that anyone could use their own MIDI file to generate an improvisation on a piece of music that they might particularly like. For this goal, I think that limiting the corpus to a certain genre and artist would be more powerful than to make a gigantic model, which serves as a universal music generator. Yeah, we have big plans in mind for deep jazz, so I'm working with a couple of the people who've reached out to me. You should keep an eye and definitely an ear out for deep jazz in the future to see where it goes. For sure, we'll do. I'm curious about those big plans. Anything you're able to give us a hint or a tease about? The future of deep jazz is something that the public will be able to interact with. It's something that will dynamically generate music based on providing input, and I'm looking forward to how it evolves and we have some really smart people on the team. Do you think that the creations of deep jazz, or the creations it's coming up with today, would they pass a sort of a Turing test for musicianship or for composership? Maybe I should move more specifically what that means in case people don't know the Turing test. The Turing test is the classic proposal of Alan Turing that if you could interact with the chatbot and not tell all the difference between the human and the chatbot, then we acknowledge that bot is having some achievement in the direction of AI. Do you think deep jazz should get a similar achievement in its ability to be indistinguishable from a human improviser? As I mentioned before, I actually wrote a comment on Reddit. Where someone commented that the musicianship wasn't exactly spectacular, but that if they heard this casually on the radio that they wouldn't really mind it, I wouldn't turn it off. They would just accept it as something a human would play. And to me, the fact that deep jazz is able to pass off as a normal musical piece, even anecdotally, is pretty amazing. It's something I built in less than two days. And the fact that it can pass off as human to some stranger on the internet, to me, is pretty fascinating. But on a more general scale, I think that if you told someone that it may have been a robot, that it would probably fail most of the time to the average person. But I'm sure that would pass at least some decent size proportion of the time. Deep jazz has a lot of potential for growth. Right now, it's really specialized. It's really limited in its capabilities. But with more development and kind of the future plans, deep jazz will be able to make something really amazing. Hopefully, in the next 30, 40, 50 years, that possibly we might have AI artists instead of having just people generating music. And the idea that an AI would be able to generate music is pretty profound, considering that music is something that we consider as deeply human. There's this interesting, I follow film scoring in composition a bit. I'm far from an expert, but there's a slight controversy here. Maybe not a controversy, but something worth noting is that there's been an evolution in that area where 50 years ago, a composer would sit down and write out a complete score for an orchestra, and then it would be recorded formally. And these days, the software is so good that the composer is still typically writing most of the complete score, but a lot of the instruments are synthesized. They're in a computer and maybe supplemented with some actual musicians or maybe not at all in certain cases. Do you think the next evolution might be that a composer is no longer writing notes, but just making suggestions and hints to a machine? Well, I think that human creativity is something that an AI will never quite be able to imitate within the next 50 years, I think. But I think that definitely in the future, if this technology develops more, and to be honest, I'm not an expert on deep learning, machine learning, or data science. I'm just an aspiring data scientist in a second year of college, but with the right minds and with the right people, I believe that such a system where you just input suggestions to an AI to generate some sort of complete peace isn't too far off in the future. If we think about the kind of the growth and the acceleration of the development of artificial intelligence in the last 20 years or so, it's been crazy. We've all heard of AlphaGo, the AI that's been trained to beat world champions in the game of Go. That feat of beating a human world champion in Go is quite incredible. It was predicted in 1997, less than 20 years ago, that it would take 100 or more years for a computer to be humans that go. And yet we solve this very, very hard AI problem within less than 20 years of the estimated 100 year development time. And that demonstrates the rapid pace of AI research nowadays. And I feel that not too far in the future, we may have entire compositions being written based on a couple of inputs. In fact, you might not even have to be a musician to create your own composition. You just need a sufficiently powerful program, a sufficiently powerful computer, and a couple of notes in mind. A system maybe deep jazz version 25 may be able to generate a complete piece based on a number of simple inputs that may not necessarily have to be from a musician. I'm curious to hear a few would frame deep jazz as an unsupervised or supervised learning problem. Because it sounds like it's a supervised problem and that you're trying to predict the next note so you have a labeled corpus. But ultimately the output is music, which I'm going to judge in this sort of ill-defined way, whether I personally, it's appeasing to my ear or not, which feels more like an unsupervised problem. Do you think those are useful labels at all for this conversation? Yeah, so the idea of unsupervised and supervised challenges really applied to a lower level than the output of what you're kind of looking at. So, classically, unsupervised algorithms are more like clustering algorithms. You want to group together similar data points. While supervised algorithms, you try to fit some model knowing the labels in order to predict label information for unseen data. For a deep jazz as it is right now, it's currently more of a supervised approach because in order to learn the training corpus, it has to compare to the actual next notes that are present in the data set. In terms of whether it's a supervised or unsupervised problem, it's kind of hard to classify at a higher level of whether we judge the output to be human or whether we judge the musical quality to be good. But it's really more of a lower level, I don't want to say classification, because that might be confusing. It's a lower level designation of whether this task is actually a supervisor and unsupervised task. I think in the future, generally, for a deep jazz at least, when it's going to train on different corpuses, it's going to have to compare the next notes that it predicts to some sort of actual next notes in the training data. So it will, I think, for the near future be a supervised learning problem. But I think that to frame how we judge the results as unsupervised or supervised, it's not a good fit to describe how we judge the results as unsupervised or supervised. Because that really is more of a lower level of a machine learning task. Yeah, it seems that future work could involve different stochastic methods for training, maybe transfer learning, or fine-tuning some activation functions. I don't know all the types of research that could be done, but ultimately you're getting closer to better fits for your input corpus. Do you think that'll end up resulting in later versions that sound more like they're plagiarizing than they're creating? What if you train it so well that it says, "Yeah, your thing's just ripping off coal train," or who went in Marsalis? Is there a way to tune it for creativity? Oh, yeah, I mean, that's exactly the overfitting problem I described before. So you have to balance overfitting and stability. So right now with the current technology, it's very feasible with enough epochs, basically, enough repetitions of training cycles on the data to produce an output that's very, very similar to producing the same amount of... We don't have novel music. We don't have music that sounds different, that sounds original. So the way I solve this in Deep Jazz was to use a method called dropout. You know, the current research that's being done with generating music with machine learning approaches is to use a very varied corpus in order to force the model to become more generalized. For example, if you train a model purely on, let's say, a particular Beethoven symphony, then of course that model will do very well in terms of generating music that plays like a Beethoven piece. But if you then train the model again with, let's say, a pop rock piece then, so the underlying model structure, which we can't really unpack because neural networks are really black boxes at this point, it will force the underlying structure to be less specialized to the Beethoven piece and really more generalized to other types of music. The idea of balancing stability and balancing overfitting is actually a very big problem right now in research that uses machine learning to produce music. As for the future of Deep Jazz, I was hoping to solve it, as I mentioned before, by using the annotation method where we basically attempt to have the model learn different genres of music based on the metadata tags. In order to preserve stability within a genre, but also to create enough variety so that the output doesn't sound too much like one piece or another. Makes sense, it also seems to open the opportunity for genre blending and things like that. Which actually makes me wonder, why Deep Jazz, why not Deep Sky or Deep Broadway for your first iteration? The piece that Deep Jazz was first hyped around is Pat Machenes and then I knew, and because it's a jazz piece and because Deep Jazz has its roots from my friend's project, JazzML, Deep Jazz sounded like a pretty cashier name, so I went with it. I believe all your codes up there on GitHub, is that correct? Yep, you could download today, try running on a methane piece or modify the codes for another task. So maybe some more advanced listeners will fork that and do some interesting things with it, maybe send you a pull request even. Where can people follow you in Deep Jazz online? Yeah, so I have a personal website, gsunkim.com, and it'll be updated with my projects and what I'm up to. And you know, what I'm doing over the summer in case someone wants to reach out and ask any questions. They can also contact me on LinkedIn and, you know, Deep Jazz has its own website, deepjazz.io, is its URL. And any updates or any other projects that stem out of Deep Jazz will definitely post it there. I'll be sure to put Deep Jazz.io and your personal site in the show notes for anyone who wants to check that out. Well, Gsunk, thanks again so much for coming on the show. Yeah, no, thank you, and thanks for taking the time to talk to me about Deep Jazz. I really didn't expect Deep Jazz to blow up, you know, it was kind of a casual side project that I built over weekend, but I think that the idea behind Deep Jazz is pretty powerful and pretty profound, and it's that an AI is able to do what we often thought that would be impossible for machines to do, which is to generate art and specifically music. And I just want to fund building the project, and I would hope that, you know, other people, students, young people who are interested in jazz music, or people who might not have thought of CS as a career to really dive into the field of data science and machine learning because that is the future. Data is already encapsulating everything that we do from using Facebook, going to our Gmail, how we receive mail, how we buy products. That is involved in everything. It's a core of, you know, it's a core of technology today. We need a lot of smart people working on these problems that involve data. And I had a ton of fun building Deep Jazz, and I'd hope that, you know, other people, listeners, students who are interested in a subject would try their hand at CS, you know, try a hackathon, learn some skills, see what you can build, and it'd be great to have more people in the field of CS. I think that's a great suggestion. I hope maybe somebody will get inspired by that, and that will be a future guest on the show with an equally interesting project. Yeah. Excellent. Well, just on, thanks again. I look forward to this going live. I'm sure the audience is going to really enjoy it. Yeah, cool. Thanks so much, Kyle. Have a good one. And until next time, I want to remind all the listeners to keep thinking skeptically of and with data. More on this episode, visit DataSceptic.com. If you enjoyed the show, please give us a review on iTunes or Stitcher. [MUSIC PLAYING] [MUSIC PLAYING] [MUSIC PLAYING] [MUSIC PLAYING] [MUSIC PLAYING] [MUSIC PLAYING] [MUSIC PLAYING] [MUSIC PLAYING] [MUSIC PLAYING] [MUSIC PLAYING] [MUSIC PLAYING] [MUSIC PLAYING] [BLANK_AUDIO]