Disclaimer: All opinions expressed here are solely my own and do not represent those of my employer.
Layered up and panting, in total darkness, I hiked up the steep trail blearily while my 30lbs backpack snoozed on my back. It was 4 am and the caffeine from a hastily brewed pot of joe hadn’t kicked in. I could see my breath fog up under the light of my headlamp. Side stitches in my ribs begged me to stop. Darkness surrounded us.
With me were fellow photographers. Like fireflies trying to find a distant light source, our hope to capture the first light of dawn united us. It felt like we'd been hiking for days with no end in sight. When we finally reached the summit, the sky was just beginning to glow like a freshly kindled flame. I’d only ever seen something like this twice before in my life.
It was magical.
Like a marathoner rejuvenated by a second wind on seeing the finish line, I felt energized and rushed to frame a composition that would do justice to the scene. Any photographer worth their salt knows that the magic hour changes fast. Working feverishly, I took several images before the sun rose above the peak. The colors vanished as fast as they had come. I had just made memories that would last my lifetime and an image that would outlive me. As of this writing, this happened 7 years ago. I remember it like it was yesterday. Would you like to see the image I made that morning?
Surprise!
Only one of the images above is what I captured that morning. The other one was something I created today using the help of Artificial Intelligence (AI). Which one is which?
One of these cost me thousands of dollars in travel costs and camera equipment. The other cost just 10 minutes of my time to type in some text. One of these unleashed my latent creative potential, the other put it on sabbatical. One of these images emotionally stirs me and invokes memories of adventure. The other evokes a sense of shock, unease, and powerlessness.
While I was slogging to make that memorable image, machine learning was busy learning to caption images, i.e., given an image, describe it in plain text. The inverse problem, i.e, generating realistic images from plain text was still beyond the realms of possibility.
Seven years since - I have added a few tricks to my photography toolbelt. But, that is dwarfed in comparison to the quantum leap machine learning has made to get to where we are today.
Mist Dissolving in the Morning Light
As a machine learning researcher and photographer, I'm uniquely positioned to observe this tectonic shift in creative expression. I feel a bit like Clark Kent when behind a camera and like Superman in front of a computer.
Today, I can stand on the line separating these two media and spot differences, both noticeable and subtle. With each passing day, however, I see this line fading away like a sketch drawn on sand frequently visited by frothy waves.
As Dickens famously wrote,
“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair.”
Much of what he wrote applies to our story today.
Never has it been easier to become a creator. With recent advances in machine learning, millions have been able to produce art beyond their wildest imaginations. Every new creator is given turbocharged training wheels. Their creations too are stunning.
Take the example above. With no context of my experience whatsoever, a model called Midjourney came up with incredible imagery from just a fragmented sentence I gave it. I mean, It didn't even have to smell the wildflowers to be inspired.
It's not just photographs. Artificial Intelligence can now proficiently author stories, compose music, paint, and transcribe videos. Heck, it can write a Twitter thread in the style of any person. Just ask productivity Youtuber Ali Abdaal -
The thread above was written entirely by a machine learning model (based on GPT-3) pretending to be him. Not only that, it was one of his best-performing threads garnering over 1 Million views and nearly 25k engagements. Talk about never having writer's block ever again.
An artificially generated painting won a state art fair in Colorado. In fact, it was so good that the judges said that even if they'd known it was produced by AI, they'd have still given it first place.
Creativity was supposed to be the lone sentinel that stood when all the others bowed to the machines, The paragon of what it meant to be human, the singular quality we prided ourselves on.
That fortress has been breached. Insignificance envelopes me as I write this.
Legacy creators worked hard on developing their craft, spending long hours trying, tinkering, and finding new ways to express themselves. Some took years to produce their first masterpiece, and through that process had eager students queueing up to apprentice under them.
Legendary photographer Ansel Adams spent nearly 10 years from 1919 to 1928 honing his craft through regular trips to the Sierras with like-minded souls. Ten years.
Such a hard graft is no longer necessary for the Tiktok generation sporting a fruit fly attention span. The next Picasso could be "Netflix and chilling" while a model cranks out artwork to be evaluated by its master. All they need is an internet connection.
This might read like sour grapes, and there is some truth in that. I spent years myself traveling and learning from the masters of our time, trying to make an image worthy of their consideration. It took me a while to finally pluck up the courage to submit my work to competitions and win them. Was that all worth it? Is great art defined by just its curb appeal or by the sum total of unseen efforts, craftsmanship, and pizzaz?
A Perverse Finishing School
What really makes these waters murky is the manner in which these models are trained to produce novel pieces of art. To understand why, we must first open up the black box inside these models that provides inspiration.
Unlike traditional rule-based systems where you tell a computer explicitly what to do through code, machine learning systems learn the rules directly from large amounts of data. Keep showing a model various pictures of dogs and soon enough, it will learn rules that allow it to measure the "dogginess" of a picture - Two ears on top of the head, fur, walks on four feet, lovable, etc. Soon enough it can identify most kinds of dogs with a high level of accuracy.
Have you seen kids learning to draw? Initially, all of their doodles make no sense to anyone but them. As they see more and more examples of things (unfortunately a majority of them being characters from Cocomelon, Blippi, and Spongebob), they slowly start building a visual memory bank of how things are supposed to look. They learn attributes like shape, color, size, posture, and so forth. Little by little, like a sculpture emerging from an amorphous block of stone, their art begins to crystallize into recognizable forms. Their drawings don't exactly resemble things they have seen before but feel more inspired by them. In the same way, these models build up a visual dictionary from various things they're shown. However, their memory bank is significantly larger and has hundreds if not thousands of attributes. We don't even have names for some of these attributes. These help the model extract meaning from new examples that it's shown.
This multi-dimensional visual dictionary that the model learns is called the latent space.
Items in the latent space that are closer together are similar things. For example, different types of dogs would be clustered together, as would different types of cars, and so on.
To quickly recap, the generative machine learning model builds a representation of the world (latent space) from the examples it is shown. The examples are deconstructed into attributes as mentioned above (points in the latent space). Similar things are clustered together in the latent space.
To generate art, it simply projects the text input (a.k.a the prompt) into this space, i.e., matches the text to the attributes it's learned, and thereby finds inspiration by drawing from the examples it's seen before.
What's interesting is that providing the same prompt will not produce the same artwork from the model. This is due to some randomness within the system (and there are ways to make this deterministic but I digress).
But to be able to handle so many different kinds of requests, the model needs to be trained on a massive and diverse dataset (examples). Think in the order of millions. Where do you think it got this from?
Engineers created ginormous datasets by scraping images from the web along with their textual descriptions (alt text). These datasets included images from Pinterest and Fine-Art America as well as other third-party websites. Where do you think legacy creators post their work online to sell them?
These machine learning masterpieces are the twisted offspring of legacy artists. One just needs to know how to communicate with these models and get exactly what they want out of them. What took legacy artists months to complete, now takes model whisperers just minutes. Michelin-star restaurant quality at Mcdonald's quantity.
This feels like a seminal moment in history. These models have freed the imaginations of many who lack an artistic hand. Even seasoned artists are using these models to spark their creativity and quickly get a few directions to explore.
Personally, I feel like Thanos having just been told that I need to sacrifice something dear to me to get the soul stone. If I switch to the dark side, will I ever have that experience of joy, wonder, and of creation again? Will I resort to snapping my fingers and producing art beyond my wildest dreams knowing fully their checkered storyline?
I'm not sure - yet.