Hello, this is my final project. I want to create a program that generates music based on user-provided video input.
I am proposing to create a solo project via Max/MSP or Web Audio (Max/MSP preferably although Web Audio may be chosen depending on difficulties/time available) with the goal of generating an algorithmic composition given video input. I am interested in this topic due to my experience that whenever I combined music and video for an edit, I would work more on conforming the raw video to the music rather than the music to the video. So, I wanted to try developing a tool that could maybe help with creating edits by creating music that "matches" the video.
At the very least, the project will be able to generate some sequence of sounds algorithmically given a video. I would like it such that the sound generated "correlates" with the actions/events that are shown in the video. A stretch goal of mine would be to input multiple videos and audio that could train the parameters s.t. if another video were to be input, the generated audio's "style" would be similar to the videos used for training.
For user interaction, I want the user to be able to preview their input. So, I would add video and audio preview capabilities. Additionally, I think it would be better that the user is able to manipulate the video within the program rather than having to cut and edit in a third-party program, export, and then input into the program. A bonus capability is to get live video input such as from a webcam or video capture card. Additionally, I want to add a menu of some sort so that the user can choose between multiple ways on how the audio is to be generated. It will display the algorithms and categorize them in terms of algorithmic complexity and will provide some details on their workings (i.e. naive, uses Fourier, etc.). The generated audio could either play as it is being generated depending on the algorithm or would be built within a buffer that would be played once the generation is finished. Either way, the playing of audio would also be with the video playing, and there would be a way to save the generated audio.
I will define the deadlines of my intermediate steps on a weekly basis: the first week, I would want to create a prototype for the video and audio preview capabilities as well as video editing capabilities. To prep for the second week, I will have created some internal functionality in extracting video qualities. In the second week, I will have developed a naive algorithm for sound generation (probably utilizing MIDI notes and global transport) and begin experimenting/looking at more complex topics for the more difficult algorithms. In this case, I want to look more into audio effects and Fourier analysis. Part of developing the naive algorithm would also be fleshing out the "output" part of the UI. During the third week, I would be finishing up the algorithms related to last week's research and start looking into implementing the algorithms that involve learning. The learning algorithms are a stretch goal for me, and by this week, I would have finished most of the required/necessary functionality. The fourth week I will have finished implementing most of the functionality that was wanted and found possible. By the fifth week, I will have finished polishing the UI and documentation.
A big part of the project will just be researching--especially for the stretch goals. I will need to learn more about how to utilize Fourier transform in Max/MSP and the audio effects for the algorithms I will be making. Additionally, learning more about Jitter will be integral as it makes up most of the video capabilities of the program. I will need to look at previous posts and projects focused on the topic of deep/machine learning for the learning-based algorithm.
Fortunately, since Jitter is not a third-party library, the default resources for Max will be highly beneficial for me in learning it. Googling machine learning for Max/MSP leads me to a small video series on how one approached their implementation, and there are forum posts that inquire about this. Looking through Max's package manager will also help me in finding any third-party libraries that may help in simplifying the implementation or utilizing more powerful/efficient packages that could help me delegate more computationally heavy tasks.
Currently, I am still primarily working on the researching part of my project. I have one singular patch with some of my tinkering so far which has been only video-related so far. So far, I have just been following along the official Jitter documentation as well as the helper patch for the cv.jit library. A majority of the different video manipulations within my patch are thanks to the cv.jit library.
So far, I have somewhat accomplished part of my first week goal, which was to have video and audio preview capabilities. However, there are no editing capabilities yet. For now, I think I will push this goal to a later week as most of the clips that I will be tinkering with will have already been edited in order to focus more effectively on the algorithmic composition part later. My second week goal has barely been touched on as I am still looking at video. Later this week, I plan on looking more at cv.jit's feature extraction--"hooking up" the resulting data with my previous algorithmic composition project for a simple test run.
Other than that, I will be catching up on previous deadlines and begin doing research on implementing learning methods, perhaps with the help of the ml.star package. Fourth and fifth week goals are still pretty much the same plus the moving of editing capabilities to one of these weeks. I am still somewhat interested in incorporating motion vectors from MPEG compression for one of the features of the video input that will affect the parameter functions for the algorithmic composition. I have been thinking of possibly analyzing feature variation within the video input as a function w.r.t time and using calculus to derive other functions that may be assigned to the parameters of the algorithmic composition. Because of this idea, I am interested in pursuing the motion vectors due to the vector-nature of the data. I will still have to figure out how to parse the data from cv.jit objects effectively as well to try to apply this idea.
Check out my current experimental project here. Currently, it has one RAW test video.
Check the final project out at https://drive.google.com/drive/folders/10UBt2-JgFB4xiqaftfw85WxABgfuU6Bs?usp=sharing.
Feel free to contact me at [email protected]!