A tool for AI artists, visual jockeys, synesthetes and psychonauts.
Want to make a deep music video? Wrap your mind around BigGAN. Developed at Google by Brock et al. (2018)¹, BigGAN is a recent chapter in a brief history of generative adversarial networks (GANs). GANs are AI models trained by two competing neural networks: a generator creates new images based on statistical patterns learned from a set of example images, and a discriminator tries to classify the images as real or fake. By training the generator to fool the discriminator, GANs learn to create realistic images.
BigGAN is considered Big because it contains over 300 million parameters trained on hundreds of google TPUs at the cost of an estimated $60,000. The result is an AI model that generates images from 1128 input parameters:
i) a 1000-unit class vector of weights {0 ≤ 1} that correspond to 1000 ImageNet classes, or object categories.
ii) a 128-unit noise vector of values {-2 ≤ 2} that control the visual features of objects in the output image, like color, size, position and orientation.
A class vector of zeros except a one in the vase class outputs a vase:

Interpolating between classes without changing the noise vector reveals shared features in the latent space, like faces:
Interpolating between random vectors reveals deeper sorts of structure:
If you’re intrigued, join the expedition of artists, computer scientists and cryptozoologists on this strange frontier. Apps like artbreeder have provided simple interfaces for creating AI artwork, and autonomous artificial artists loom while some users occupy themselves searching for the Mona Lisa.
Others have set BigGAN to music.
These “deep music videos” have garnered mixed reactions, varying between beautiful, trippy, and horrifying. To be fair, one is wise to fear what lurks in latent space…
What other unlikely chimeras, mythical creatures, priceless artworks and familiar dreams reside within BigGAN? To find out, we need to cover more ground. That’s why I built the deep music visualizer, an open source, easy-to-use tool for navigating the latent space with sound.
A latent spaceship, with bluetooth.
Take it for a spin and create some cool music videos along the way. Just make sure to share what you discover.
Tutorial: Using the Deep Music Visualizer
Clone the GitHub repository and follow the installation instructions in the README file.msieg/deep-music-visualizerThe Deep Music Visualizer uses BigGAN (Brock et al., 2018), a generative neural network, to visualize music. Like this…github.com
Run this command in your terminal:
python visualize.py --song beethoven.mp3
That’s it. Here’s the output:
What’s going on here? The deep music visualizer syncs pitch with the class vector and volume and tempo with the noise vector, so that pitch controls the objects, shapes, and textures in each frame, while volume and tempo control movement between frames. At each time point in the song, a chromagram of the twelve chromatic notes determines the weights{0 ≤ 1} of up to twelve ImageNet classes in the class vector. Independently, the rate of change of the volume — mainly percussion — controls the rate of change of the noise vector.
Video customization
Resolution
- 128, 256, or 512
- Default: 512
BigGAN is big, and therefore slow. If you ran the first example on your laptop, it will take ~7 hours to render. With a resolution of 128×128, it would only take 25 minutes (per minute of video).
python visualize.py --song beethoven.mp3 --resolution 128
However, I recommend you generate high resolution videos by launching a virtual GPU on google cloud to significantly speed up runtime from ~7 hours to a few minutes. While it isn’t free, google awards new users $300 in credit, and a GPU costs $1/hour.
Duration (seconds)
- Integer ≥ 1
- Default: Full length of audio
It can be useful to generate shorter videos to limit runtime while testing out some other input parameters.
Pitch sensitivity
- Range: 1–299
- Default: 220
The pitch sensitivity is the sensitivity of the class vector to changes in pitch. At higher pitch sensitivity, the shapes, textures and objects in the video change more rapidly and adhere more precisely to the notes in the music.
Tempo sensitivity
- Range: 0 ≤ 1
- Default: 0.25
The tempo sensitivity is the sensitivity of the noise vector to changes in volume and tempo. Higher tempo sensitivity yields more movement.
In this example, the classes cohere strongly to the pitch because pitch sensitivity is high, but there is little overall movement because tempo sensitivity is low.
python visualize.py --song moon_river.mp3 --duration 60
--pitch_sensitivity 290 --tempo_sensitivity 0
In this example, the class mixture hardly changes because pitch sensitivity is low, but there is more overall movement because tempo sensitivity is high.
python visualize.py --song moon_river.mp3 --duration 60
--pitch_sensitivity 10 --tempo_sensitivity 0.6
Num. classes
- 1–12
- Default: 12
Lower the number of classes to mix fewer objects.
Classes
- Up to twelve indices {0–999} corresponding to 1000 ImageNet classes
- Default: Twelve random indices
You can choose which classes you want to include in the video. The classes sync with pitches in chromatic order (A, A#, B…).
Alternatively, set sort_classes_by_power to 1 if you prefer to enter classes in a prioritized order.
In this example, the video includes daisy (#985) and jellyfish (#107), but with more daisy than jellyfish:
python visualize.py --song cold.mp3 --duration 10
--pitch_sensitivity 260 --tempo_sensitivity 0.8 --num_classes 2
--classes 985 107 --sort_classes_by_power 1
Frame length
- Multiples of 64
- Default: 512
The frame length is the number of audio samples per video frame. The default frame length of 512 yields a video frame rate of ~43 fps. Decreasing the frame length increases the frame rate so the image updates more frequently (but the video will take longer to render). This is most useful for visualizing rapid music.
python visualize.py --song T69_collapse.mp3 --duration 30 --num_classes 4 --classes 527 511 545 611 --sort_classes_by_power 1
--frame_length 128
I hope you found this tutorial interesting and informative. If you want to express thanks, tweet me a deep music video you create with this code!
You can find more of my videos here.
Open Questions
- What sort of art can GANs not create? Must art imitate reality?
- Does music have intrinsic visual structure? Are certain sounds, instruments, songs and genres better represented by certain ImageNet classes? Someone with synesthesia might think so.
- Is BigGAN’s artistic ability explained by the representational similarity between deep neural networks and the human visual cortex²? If so, could the latent space represent a topological map of the human imagination?Can BigGAN predict object imageability? For example, picture a wall clock. Did you actually visualize all of the digits? Neither did BigGAN.
Source-Medium