Artistic beauty is more than skin deep. But does it even require skin?[Editor’s note: This is a complicated and controversial subject, and we welcome your opinions. Please let us know on our Facebook page or via this link. Also, Kays and I are planning to record a more philosophical discussion about this in the next few days. Spoiler alert: if you find anything remotely appealing about the utterly execrable images in this story, you won’t agree with what I have to say. 🙂 – NB]
If your social network friends are anything like mine, your feed has been inundated with posts featuring unusual and sometime nightmarish images generated by the latest trend in Neural Network and Machine Learning technology: text (or prompt) to image.
For instance, the graphic at the top of this story is one that I made from the prompt “Alien version of Tommy Wiseau” (Wiseau is the unlikely star of quite possibly the worst movie ever made and one of the most entertaining -The Room).
Dall-E, Disco Diffusion, Midjourney and other text-to-image generators have quickly risen in popularity during the past few months, at first through online articles written by a handful of privileged insiders, and more recently with the advent of public betas open to most of the public.
The way that the technology behind Neural Networks and Machine Learning (often incorrectly referred to as A.I. or Artificial Intelligence by overeager journalists) works is by first analyzing hundreds of thousands, sometimes even millions of samples. Then it creates internal algorithms that allow it to generate results coherent with what the user is asking for.
While the fundamentals of the technology has been around for decades (the term Machine Learning [ML] was actually coined in 1959 by IBM employee Arthur Samuel), what makes it all so exciting today is just how good it has become at generating results that seem… er… human.
There have already been some relatively primitive attempts at M-generated music, but the natural question for many is: “How soon before someone can I type ‘Paul McCartney singing about playing ping pong in Las Vegas in the style of the Red Hot Chilly Peppers’ and get a reasonably credible rendition of such a song?”
We might get the answer to that question very soon, and I would be highly surprised if engineering teams at Google, OpenAI, and other tech firms aren’t already working on it.
In order for this technology to develop in a particular field, three key ingredients are required:
1. A large sample pool of freely available data to analyze.
2. Some form of pre-existing structure that can be used to help generate algorithms.
3. Enough interest on the part of the public and the engineers to make it worth the effort.
Music easily fulfills these three requirements, which is why I believe it’s an ideal future candidate.
The advent of music generated by machines that sounds as authentic and vibrant as that made by humans has many appealing uses, providing the public with an almost limitless resource of never-before-heard-of music. Imagine for instance how happy your teenage niece might be at not only getting a wholly original birthday song with lyrics that relate to her, but also one sung by BTS!
However, many fear that once here, Neural Network generated music will quickly overtake human composers with its sheer speed and versatility, putting many out of a job.
In reality, I feel that these new developments will have both positives and negatives.
If we define a “job” as an exchange of one person’s money for another person’s time, Neural Networks would appear poised to generate massive amounts of new music very quickly. But the reality is that a human is still very much a requirement for this technology.
In my experiments trying to get usable images from Midjourney, I found that achieving good results still required a great deal of knowledge, experience, and time to get back quality results.
Midjourney works is by interpreting a set of prompts and generating an initial group of four images that you’re then asked to filter progressively until it reaches an acceptable final result.
The amount of detail in the text prompts heavily affects the results. “Dog wearing a cheese hat,” for instance, is not likely to achieve as quality results as “studio photoshoot of a Pembroke Welsh Corgi wearing a hat in the shape of a triangular wedge of emmental Swiss cheese, portrait, intricate complexity, rule of thirds, style of Krenz Cushart, Ashley Wood, and Charlie Bowater and Craig Mullins, intricate accurate details, Artstation trending, Octane render, cinematic color grading, muted colors, soft light, bokeh, yellows and blues, spotlight.”
The text prompt is just the beginning. Midjourney will then return four “coarse” results that you narrow to the one most matching your vision that it refines further. This in turn returns four more results, of which you pick one for further refining; rinse and repeat until you’re finally satisfied and hit that “Upres” button to get the final image.
You get the idea: someone with knowledge, experience, and a vision still needs to be in the driver’s seat for the Neural Network to generate useful results. I suspect that the same will be true for music.
In many ways we should think of Machine Learning and Neural Networks not as a substitute for humans, but as yet another tool in our creative arsenals. And that leads me to what I think is the real problem with newer technological trends – far too often then tend to promote a much more lazy approach, which in turn yields blandness and homogeneity.
In many ways we are already experiencing such an effect, thanks to current technology that many of us use on a daily basis.
Pitch correction and beat correction makes us not worry so much about performing a part well, since we know that we can “fit it in post.” Machine learning algorithms can already be used for mixing, mastering, even EQing tracks.
Recently I reviewed Native Instruments’ Playbox, a Kontakt instrument that offers countless variations of chord progressions and sounds, that requires very little in the way of prior knowledge of music theory, synthesis, or sound engineering.
A friend of mine calls this trend in music products “instant gratification.” The idea is that the tools can generate complex and polished results with relatively little effort and input from the user.
While there is a profound difference between how a Neural Network generates new compositions and how a plug-in does it, the net potential for complacency is equally damaging.
I have seen a counter movement to the instant gratification trends in recent years. Analog technology such as modular synths and outboard gear are in many ways the polar opposite of ease-of-use. These tools require hard work, knowledge, and dedication to achieve good results, but many find this to be a small price to pay for music that is individual and unique.
I like to think that perhaps the future will hold a hybrid approach to music creation, one that will leverage the best of what each technology can offer.
We are fast approaching (or perhaps have already reached) a creative crossroad, and it is up to each and every one of us to decide how we are going to proceed.
Will we be content with having Neural Networks generate music for us while we sit back and enjoy a cold one, or will we challenge ourselves to push the boundaries of these new tools by creating innovative and yet unforeseen new ways of releasing our creativity?
The answer lies in each and every one of us, and I hope that we collectively choose to step up to this challenge, as the future of music depends on it. Now excuse me while I go and ask Midjourney to show me what a pancake with the face of Elvis Presley wearing a sombrero and singing into a carrot looks like.