Clip Art Perspective Clip Art Writing on Post It
By Charlie Snell
In recent months there has been a scrap of an explosion in the AI generated art scene.
Ever since OpenAI released the weights and lawmaking for their CLIP model, diverse hackers, artists, researchers, and deep learning enthusiasts take figured out how to utilize Clip every bit a an effective "natural language steering wheel" for various generative models, allowing artists to create all sorts of interesting visual fine art merely by inputting some text – a caption, a poem, a lyric, a word – to one of these models.
For case inputting "a cityscape at night" produces this cool, abstract-looking depiction of some city lights:
(source: @RiversHaveWings on Twitter)
Or asking for an image of the sunset returns this interesting minimalist thing:
(source: @Advadnoun on Twitter)
Asking for "an abstruse painting of a planet ruled by little castles" results in this satisfying and trippy piece:
(source: @RiversHaveWings on Twitter)
Feed the system a portion of the poem "The Wasteland" past T.S. Eliot and yous get this sublime, calming piece of work:
(source: @Advadnoun on Twitter)
You can even mention specific cultural references and it'll commonly come up with something sort of accurate. Querying the model for a "studio ghibli landscape" produces a reasonably convincing upshot:
(source: @ak92501 on Twitter)
Y'all can create petty animations with this aforementioned method also. In my own experimentation, I tried asking for "Starry Night" and ended upwards with this pretty cool looking gif:
These models have so much creative power: just input some words and the arrangement does its best to render them in its ain uncanny, abstract fashion. Information technology's really fun and surprising to play with: I never really know what's going to come out; it might be a trippy pseudo-realistic landscape or something more than abstract and minimal.
And despite the fact that the model does almost of the work in actually generating the epitome, I still feel creative – I feel like an artist – when working with these models. At that place's a existent element of creativity to figuring out what to prompt the model for. The natural language input is a total open up sandbox, and if you can weild words to the model'due south liking, you can create about annihilation.
In concept, this idea of generating images from a text description is incredibly similar to Open up-AI's DALL-Due east model (if you lot've seen my previous blog posts, I covered both the technical inner workings and philosophical ideas behind DALL-E in cracking detail). But in fact, the method here is quite unlike. DALL-East is trained end-to-end for the sole purpose of producing high quality images directly from linguistic communication, whereas this Clip method is more than similar a beautifully hacked together flim-flam for using language to steer existing unconditional image generating models.
A loftier-level depiction of how DALL-Eastward's end-to-end text-to-image generation works.
A high level depiction of how CLIP can be used to generate fine art.
The weights for DALL-E haven't fifty-fifty been publicly released however, so y'all can come across this Prune piece of work as somewhat of a hacker's attempt at reproducing the promise of DALL-E.
Since the CLIP based approach is a lilliputian more hacky, the outputs are not quite as loftier quality and precise every bit what's been demonstrated with DALL-E. Instead, the images produced past these systems are weird, trippy, and abstract. The outputs are grounded in our earth for sure, but information technology's like they were produced by an alien that sees things a piffling bit differently.
It's exactly the weirdness that makes these Clip based works so uniquely artistic and cute to me. There's something special about seeing an alien perspective on something familiar.
(Annotation: technically DALL-Due east makes use of Clip to re-rank its outputs, just when I say Clip based methods here, I'm not talking about DALL-East.)
Over the final few months, my Twitter timeline has been taken over by this CLIP generated fine art. A growing community of artists, researchers, and hackers have been experimenting with these models and sharing their outputs. People accept too been sharing code and various tricks/methods for modifying the quality or artistic mode of the images produced. It all feels a bit like an emerging art scene.
I've had a lot of fun watching as this art scene has developed and evolved over the course of the yr, so I figured I'd write a web log post virtually information technology considering it'due south just then cool to me.
I'm not going to go in-depth on the technical details of how this system generates art. Instead, I'k going to document the unexpected origins and evolution of this art scene, and along the manner I'll besides present some of my own thoughts and some absurd artwork.
Of course I am not able to embrace every attribute of this art scene in a single blog post. But I think this blog hits about of the big points and big ideas, and if there's anything important that you think I might take missed, feel gratuitous to comment below or tweet at me.
Clip: An Unexpected Origin Story
On January 5th 2021, OpenAI released the model-weights and lawmaking for Clip: a model trained to determine which caption from a set of captions all-time fits with a given image. After learning from hundreds of millions of images in this mode, CLIP not just became quite practiced at picking out the best caption for a given paradigm, simply it also learned some surprisingly abstract and general representations for vision (see multimodal neuron work from Goh et al. on Distill).
For example, CLIP learned to represent a neuron that activates specifically for images and concepts relating to Spider-Man. At that place are also other neurons that activate for images relating to emotions, geographic locations, or even famous individuals (you lot can explore these neuron activations yourself with OpenAI's microscope tool).
Image representations at this level of brainchild were somewhat of a first of their kind. And in add-on to all of this, the model also demonstrated a greater classification robustness than whatever prior work.
And so from a research perspective, Prune was an incredibly heady and powerful model. But goose egg here clearly suggests that it would be helpful with generating art – permit alone spawning the art scene that information technology did.
Nonetheless, it only took a twenty-four hour period for various hackers, researchers, and artists (near notably @advadnoun and @quasimondo on Twitter) to figure out that with a simple pull a fast one on Prune tin actually be used to guide existing prototype generating models (like GANs, Autoencoders, or Implicit Neural Representations like SIREN) to produce original images that fit with a given explanation.
In this method, Prune acts equally something like a "tongue steering wheel" for generative models. CLIP essentially guides a search through the latent space of a given generative model to find latents that map to images which fit with a given sequence of words.
Early results using this technique were weird simply nonetheless surprising and promising:
left – (source: @quasimondo on Twitter); right – (source: @advadnoun on Twitter)
The Big Sleep: Humble Beginnings
In just a couple weeks, there was a breakthrough. @advadnoun released code for The Big Sleep: a Prune based text-to-image technique, which used Big GAN as the generative model.
(source: @advadnoun on Twitter)
In its own unique way, the Big Sleep roughly met the promise of text-to-image. It can approximately render merely about annihilation you tin can put into words: "a sunset", "a face like an Chiliad.C. Escher cartoon", "when the wind blows", "the grand canyon in 3d".
Of course, the outputs from The Big Sleep are maybe not everyone'south cup of tea. They're weird and abstruse, and while they are usually globally coherent, sometimes they don't brand much sense. There is definitely a unique style to artworks produced by The Big Sleep, and I personally observe it to exist aesthetically pleasing.
"a sunset" co-ordinate to The Large Slumber (source: @advadnoun on Twitter)
"a face like an Chiliad.C. Escher cartoon" from The Big Sleep (source: @advadnoun on Twitter)
"when the air current blows" from The Big Sleep (source: @advadnoun on Twitter)
But the main wonder and enchantment that I get from The Large Sleep does not necessarily come from its aesthetics, rather it'south a bit more meta. The Big Sleep's optimization objective when generating images is to find a point in GAN latent space that maximally corresponds to a given sequence of words under CLIP. Then when looking at outputs from The Large Sleep, nosotros are literally seeing how Prune interprets words and how information technology "thinks" they correspond to our visual earth.
To really capeesh this, you can think of Clip as being either statistical or alien. I prefer the latter. I like to think of CLIP as something like an alien encephalon that we're able to unlock and peer into with the help of techniques like The Big Sleep. Neural networks are very unlike from homo brains, and then thinking of Prune as some kind of alien brain is not really that crazy. Of course Prune is non truly "intelligent", bit it's yet showing us a different view of things, and I discover that idea quite enchanting.
The alternative perspective/philosophy on CLIP is a lilliputian more statistical and cold. Y'all could think of CLIP's outputs as the product of mere statistical averages: the result of computing the correlations betwixt language and vision as they exist on the internet. And so with this perspective, the outputs from CLIP are more than akin to peering into the zeitgeist (at to the lowest degree the zeitgeist at the time that CLIP'south training data was scraped) and seeing things as something like a "statistical average of the internet" (of course this assumes minimal approximation error with respect to the truthful distribution of data, which is probably an unreasonable assumption).
Since Clip's outputs are so weird, the alien viewpoint makes a lot more than sense to me. I think the statistical zeigeist perspective applies more to situations like GPT-iii, where the approximation error is presumably quite low.
"At the end of everything, aging buildings and a weapon to pierce the sky" from The Big Sleep (source: @advadnoun on Twitter)
"the grand canyon in 3d" according to The Big Sleep
Looking back, The Big Sleep is not the first AI art technique to capture this magical feeling of peering into the "listen" of a neural network, simply information technology does capture that feeling arguably better than any technique that has come earlier.
That'southward not to say that older AI fine art techniques are irrelevant or uninteresting. In fact, it seems that The Big Sleep was in some means influenced past one of the most pop neural network fine art techniques from a foregone era: DeepDream.
Per @advadnoun (The Large Sleep'southward creator):
The Big Sleep'due south name is "an allusion to DeepDream and the surrealist picture noir, The Big Sleep. The second reference is due to its strange, dreamlike quality" (source).
It's interesting that @advadnoun partly named The Large Sleep afterward DeepDream because looking dorsum now, they are spiritually sort of related.
DeepDream was an incredibly popular AI fine art technique from a previous generation (2015). The technique essentially takes in an prototype and modifies information technology slightly (or dramatically) such that the epitome maximally activates certain neurons in a neural network trained to classify images. The results are usually very psychedelic and trippy, like the image below.
an image produced by DeepDream (source).
Although aesthetically DeepDream is quite different from The Big Sleep, both of these techniques share a similar vision: they both aim to excerpt art from neural networks that were not necessarily meant to generate fine art. They swoop inside the network and pull out beautiful images. These art techniques experience similar deep learning interpretability tools that accidentally produced art along the manner.
Then in a way, The Large Sleep is sort of similar a sequel to DeepDream. But in this case the sequel is arguably amend than the original. The alien views generated by DeepDream will e'er exist timeless in their own respect, simply there's something really powerful nigh being able to probe CLIP's knowledge by prompting it with natural linguistic communication. Anything you tin can put into words will be rendered through this conflicting dream-like lense. Information technology's just such an enchanting way to brand art.
VQ-GAN: New Generative Superpowers
On December 17 2020, researchers (Esser et al.) from Heidelberg University, posted their paper "Taming Transformers for Loftier-Resolution Prototype Synthesis" on Arxiv. They presented a novel GAN compages called VQ-GAN which combines conv-nets with transformers in a way that optimally takes advantage of both the local anterior biases of conv-nets and the global attention in transformers, making for a specially strong generative model.
Around early on April @advadnoun and @RiversHaveWings started doing some experiments combining VQ-GAN and Prune to generate images from a text prompt. On a high level, the method they used is mostly identical to The Big Sleep. The main difference is really merely that instead of using Big-GAN equally the generative model, this system used VQ-GAN.
The results were a huge stylistic shift:
"A Serial Of Tubes" from VQ-GAN+CLIP (source: @RiversHaveWings on Twitter)
"The Yellow Fume That Rubs Its Muzzle On The Window-Panes" from VQ-GAN+CLIP (source: @RiversHaveWings on Twitter)
"Planetary Metropolis C" from VQ-GAN+Clip (source: @RiversHaveWings on Twitter)
"Dancing in the moonlight" from VQ-GAN+Clip (source: @advadnoun on Twitter)
"Mechanic Want" from VQ-GAN+Clip (source: @RiversHaveWings on Twitter)
"Mechanic Desire" from VQ-GAN+CLIP (source: @RiversHaveWings on Twitter)
"a tree with weaping branches" from VQ-GAN+Prune (source: @advadnoun on Twitter)
The outputs from VQ-GAN+Prune tend to look less painted than The Big Sleep and more like a sculpture. Even when the images are too abstract to be real, at that place's a certain material quality to them that makes it seem every bit if the objects in the images could take been crafted by hand. At the same time, there's still an alien weirdness to it all, and the aura of peering into a neural network and seeing things from its viewpoint is most definitely not lost hither.
Just swapping out the generative model from Big-GAN to VQ-GAN was almost like gaining a whole new creative person with their own unique mode and viewpoint: a new lens for seeing the world through Clip's optics. This highlights the generality of this CLIP based system. Anytime a new latent-generative model is released, information technology can usually be plugged into CLIP without too much trouble, and then of a sudden we tin generate art with a new style and class. In fact, this has already happened at least once: less than 8 hours later on DALL-Due east'south dVAE weights were publically released, @advadnoun was already Tweeting out art fabricated with dVAE+Clip.
The Joys of Prompt Programming: The Unreal Engine Trick
We've seen how switching generative models can dramatically modify the fashion of CLIP'due south outputs without also much effort, but it turns out that there's an fifty-fifty simpler trick for doing this.
All you need to do is add some specific primal-words to your prompt that signal something nigh the mode of your desired image and Clip will practise its best to "understand" and alter its output accordingly. For example yous could append "in the fashion of Minecraft" or "in the fashion of a Drawing" or even "in the style of DeepDream" to your prompt and nearly of the time Prune will actually output something that roughly matches the style described.
In fact, one specific prompting trick has gained quite a chip of traction. It has become known as the "unreal engine fox".
(source: @arankomatsuzaki on Twitter)
Information technology was discovered by @jbustter in EleutherAI's Discord merely a few weeks ago that if you add "rendered in unreal engine" to your prompt, the outputs wait much more than realistic.
(source: the #art channel in EleutherAI's Discord)
Unreal Engine is a popular 3D video game engine created by Epic Games. CLIP likely saw lots of images from video games that were tagged with the explanation "rendered in Unreal Engine". So past adding this to our prompt, nosotros're effectively incentivizing the model to replicate the look of those Unreal Engine images.
And it works pretty well, just await at some of these examples:
"a magic fairy house, unreal engine" from VQ-GAN+Prune (source: @arankomatsuzaki on Twitter)
"A Void Dimension Rendered in Unreal Engine" from VQ-GAN+CLIP (source: @arankomatsuzaki on Twitter)
"A Lucid Nightmare Rendered in Unreal Engine" from VQ-GAN+CLIP (source: @arankomatsuzaki on Twitter)
CLIP learned general enough representations that in order to induce desired behavior from the model, all we need to do is to ask for it in the prompt. Of course, finding the right words to get the all-time outputs tin can be quite a claiming; after all, it did take several months to notice the unreal engine trick.
In a way, the unreal engine play a trick on was a quantum. It made people realize just how effective adding keywords to the prompt can exist. And in the last couple weeks, I've seen increasingly complicated prompts beingness used that are aimed at extracting the highest quality outputs possible from Clip.
For example, asking VQ-GAN+CLIP for "a small hut in a blizzard well-nigh the top of a mountain with 1 light turn on at dusk trending on artstation | unreal engine" produces this hyper-realistic looking output:
(source: @ak92501 on Twitter)
Or querying the model with "view from on top of a mountain where yous can meet a village below at night with the lights on landscape painting trending on artstation | vray" gives this monumental view:
(source: @ak92501 on Twitter)
Or "matte painting of a house on a hilltop at midnight with small fireflies flying around in the mode of studio ghibli | artstation | unreal engine":
(source: @ak92501 on Twitter)
Each of these images looks nothing similar the VQ-GAN+CLIP art nosotros saw in the previous department. The outputs still have a certain surreal quality to them and perhaps the coherence breaks down at a few points, but overall the images merely popular similar zip else we've seen so far; they look more than like edited photographs or scenes from a video game. And then it seems that each of these keywords – "trending on artstation", "unreal engine", "vray" – play a crucial function in defining the unique manner of these outputs.
This general paradigm of prompting models for desired beliefs is condign known every bit "prompt programming", and information technology is really quite an fine art. In order to take any intuition as to what prompts might be effective, you need some clue as to how the model "thinks" and what types of data the model "saw" during training. Otherwise, prompting can exist a little scrap similar dumb luck. Although hopefully, in the future, as models get fifty-fifty larger and more powerful, this will become a little bit easier.
This is Just The Beginning
In this blog mail service I've described some of the early milestones in the evolution of CLIP-based generative art. But by no ways was this an extensive coverage of the art that people take been able to create with Clip. I didn't even get effectually to talking about the super cool piece of work that'southward been washed with StyleGAN+CLIP or the really interesting CLIPDraw piece of work or even the saga of experiments done with DALL-E's dVAE+CLIP. I could go on and on, and the listing of new methods for creating art with CLIP is expanding each week. In fact, it really feels similar this is just the beginning; there is likely then much to amend and build upon and so many creative discoveries withal to be fabricated.
Then if this stuff is interesting to you, and you'd like to learn more about how these CLIP based art systems piece of work, or even if you lot just want to proceed upwardly with some of the about innovative artists in this space, or if you lot want to try your own paw at generating some fine art, be sure to checkout the resources below.
References, Notebooks, and Relevant Twitter Accounts
References
(run across the captions below each slice of artwork for its corresponding reference; all images without references are works that I created)
- CLIP blog mail service
- CLIP paper
- Big-GAN paper
- VQ-GAN paper
- The Large Sleep blog postal service
- DeepDream blog post
- DALL-E blog mail service
- Multimodal Neurons Distill
Notebooks
(you can use these Colab notebooks to brand your own Clip based art; just input a prompt. They each use slightly dissimilar techniques. Take fun !)
- The Big Slumber
- Aleph2Image
- Deep Daze
- VQ-GAN+Clip (codebook sampling)
- VQ-GAN+Clip (z+quantize)
- VQ-GAN+Prune (EleutherAI)
(Notation: if you are unfamiliar with Google Colab, I can recommend this tutorial on how to operate the notebooks.)
Relevant Twitter Accounts
(these are all twitter accounts that often mail fine art generated with CLIP)
- @ak92501
- @arankomatsuzaki
- @RiversHaveWings
- @advadnoun
- @eps696
- @quasimondo
- @M_PF
- @hollyherndon
- @matdryhurst
- @erocdrahs
- @erinbeess
- @ganbrood
- @92C8301A
- @bokar_n
- @genekogan
- @danielrussruss
- @kialuy
- @jbusted1
- @BoneAmputee
- @eyaler
Source: https://ml.berkeley.edu/blog/posts/clip-art/
0 Response to "Clip Art Perspective Clip Art Writing on Post It"
Post a Comment