Plug & Play Generative Networks | Uber AI Labs CoFounder Jason Yosinski
Articles Blog

Plug & Play Generative Networks | Uber AI Labs CoFounder Jason Yosinski

February 28, 2020


Just because of the way the data set was collected,
nobody was ever incentivized to do this. So going back to a question we had earlier, in the very beginning,
do biases creep into these models, absolutely. Both the types of insidious biases that you and I would
really like to avoid in the world, but also like really weird, subtle biases like this birds and tree
branches. If you ever teach the model, then birds and trees and branches aren’t the same thing you might
accidentally learn. So we have this like method of producing images, we start with some pixels, we, we
try to get the network to draw better and better pictures. It didn’t work at first, then it worked a
little bit better. And we’d like to sort of take this to the next level. I won’t go into all the math. But
basically, what we did is rather than coding hand coding a prior for these images, we we now train a new
network, or like a painter network, to paint images. So now we have two networks. We have one network that
we’re trying to visualize here. That’s what I’ve shown you that has the face detector and all this. We try A
second neural network that knows how to paint pictures. It’s pretty cool. You train it to like take
a latent vector. So like a little vector that describes which picture you’d like to paint. You put
that in here, you, you propagate that through the network, and you get it to paint, like to generate the
pixels corresponding to that vector. These are called plug and play generative networks. I won’t go into the
math, but more or less plug these two networks together. One network says it’s trained to know what
images are realistic. So the generator network says, this looks realistic. And the condition networks is
this image looks like a cheeseburger. By combining them we can produce, hopefully an image that is a
cheeseburger and also looks realistic. So here is the cheeseburger. That also looks pretty realistic. We can
generate other images classes to here’s swimming trunks, and lots of other things from imagenet. So
leaf beetle, Triumphal Arch toaster and so using this approach we can generate also not a single toaster
single cheeseburger. A tricky part of neural networks is it’s not like like that is the cheeseburger to the
network. There’s many cheeseburgers that would cause the network to think it’s a cheeseburger. There are
many types of dogs, you can think like dogs zoomed out, zoomed in turning left turning right, different
color for different lighting conditions and so on. So we’d like to we’d like this generator network to be
able to draw like a diverse array of possible inputs. So for example, here we show we can do that pretty
well. Here’s all types of synthetic volcano images that cause high firing of the volcano neuron. And you
can see here there’s like night, day, there’s with clouds is with steam, there’s with just blue sky and
so on. So we can get a lot of diversity. Here’s different categories, so different types of birds and
ants and monasteries, and so on. We can do some other fun stuff with this. So here we plugged together to
real networks, and we asked him to generate a real thing. So generated cheeseburger. We can also like
make up stuff. So in this in this example, why is the label we’re going for. So why is like cheezburger, we
can make up new labels that don’t really exist. And ask the network to draw those, which is kind of fun.
So for example, just here’s a real category castle, we get the network to draw the castle. Here’s a candle.
So then our control candle. So this is like, please fire this castle neuron. And this is please fire the
candle neuron. And then we can ask it to fire both at the same time. So we can say make us a picture that
causes castle and candle both to fire, and we get something like this. So here’s like a castle like on
fire. Here’s fire boats and candle and we can generate images of like fire boats, fighting fires and other
fire boots. If your fire boats are on fire, you should probably just go home. Say it’s a bad day. We can do
other things we can we can replace the networks with other classifiers This is kind of fun. So let’s
imagine the year is the year is 2035 and you You’re like, the whole world is not doing so well. You’re in
like a trash heap and you find this hard drive with a neural network on it. And you want to know like, what
is this neural network? Do? What was it trained for? You went like do some digital forensics. So we did
that we grabbed the network, it was trained on MIT places just to kind of give it away. We just
downloaded the weights from someone’s GitHub repo. And we tried to visualize it. So we said, okay, we’re
going to take this second network, and we’re going to generate images that caused it to have like a really
big firing. And we did that for a category, we just picked like a random output neuron, and we got this
picture. So what do you think this neuron is for? So it turns out in the original data set of MIT
places, this was called a residential area. So the second network this condition network and knows that
this residential area is this sort of thing. Our generator network is able to satisfy it by basically
drawing things that it knows about, so it knows about like, grass and trees and houses and the condition
network tells it You better put these grass and trees and houses together in this configuration, in order
for me to think this is a residential area. So we can like generate visualizations for new things that have
never seen before. There’s other categories too. So Canyon and banquet hall and so on banquet hall, you
can see some like dress in the bottom as if someone’s dancing, art studio and so on. Cool, we can do some
other stuff, we can replace this network with a caption network. So if you’ve seen caption networks,
caption networks, you give it a picture and you say, please generate it’ll you train it to generate a
caption for that picture that you showed the Eiffel Tower. And I might say this is the Eiffel Tower,
Eiffel Tower in the springtime or something like that. Using this process isn’t this generative process, we
can like reverse it so we can we can type in the caption and get it to generate a plausible picture for
that caption. So for example, we typed in a red car parked on the side of the road, and we asked it to
generate a few plausible red car images and we This not a beautiful image. On the other hand, this network
was never trained to generate cars at all. We changed red car to blue, and we can make the car look blue. We
can try some other stuff. Here’s a marina filled with boats in the water, you can see some boats and some
water and like specular reflections. We can also learn a lot when when when this process fails. So here we
typed in a bird sitting on a branch. And we generated these four pictures. So you kind of see maybe like a
nature seeing little branch. But no bird. So why is the bird missing? It sitting? Maybe? Oh, sitting? Yeah, good. Good question. Good question.
Yeah. As opposed to like standing or something. So it’s a good question. We saw this. We were really
confused. We thought you know, our model doesn’t it’s not great. You can’t draw birds very well. But
actually, I already showed you we can like draw really beautiful birds. So like, Why doesn’t it Just put the
bird on the branch. So we went a little deeper. And we thought maybe there’s something else going on. So to
figure this out, what we did is we took the caption model, we grabbed this image from Google Images, we
put it into the caption model and asked what, what it thought the caption was. And the caption model said, a
bird sitting on a tree branch with a tree. Okay, little bit grammatically weird, but it does have three
concepts. It has bird tree and branch, it gets them all in the sentence. So then my co author took and
like painstakingly photoshopped out this bird to like remove the bird from the image. We give that image to
the caption model and ask the capital model what a things it is. And we get almost the same thing, a bird
purchase in a branch industry. So what does this teach us? There’s two networks, the one network can
definitely draw birds. But the second network the caption network is not asking it to draw birds.
Because the caption network, these three concept bird and tree and branch are kind of like confused. They’re
kind of commingled. So why might that have happened? Okay, if you go back to the way the data set was
collected, this is trained on Ms. Coco’s vision data set. The way they created this data set is they took a
bunch of images they gave them to Mechanical Turk workers, they had the workers click on like
interesting regions of the image. So it’s one one person maybe like, highlights a box around this bird.
And then a second person types in a caption says, what what is what is in that box? The person might say, a
bird on a tree branch. Okay, fine. So you have bird tree branch all being trained. But nobody ever, like
highlighted this region over here. And just then tree branch, like no bird, because just because of the way
the data that was collected, like nobody was ever incentivize to do this. So going back to a question we
had earlier, in the very beginning, do biases creep into these models? Absolutely. Both the types of
insidious biases that you and I would really like to avoid in the world, but also like a really weird
subtle biases like this. birds and tree branches. If you ever teach the model that birds and trees and
branches aren’t the same thing, it might accidentally learn that they’re kind of the same thing. There’s
some other work actually, this is not by us, but we, we really wanted to, like replace this with a real
human or like mammalian brain. This was like a crazy idea. Like we could generate these bird pictures or
something, and then recur, record neuron firings, and like, optimize images to produce like what the brain
was trying to see. But of course, I don’t know anything about brains or monkeys. So we thought this
is never gonna happen. But actually, two labs in the last year or so have have done this with real monkeys
is super cool. I’ll just show you some of their basic results. So what they do is they they generate these
images using this network. And they show it to a monkey who has like some recording devices implanted,
they record how much the neuron fires and then they try to like change the image over and over again, in
order to make it fire more and more. And by doing this, you can change an initial image like this here,
the top left, that doesn’t cause much firing at all. You slowly change it, you slowly tweak it. In our
network, it would be slowly making it look more and more like a school bus. There’s zebra and the monkey
network, like literally the brain of the monkey. It’s slowly making it look more and more like something.
And actually, what they made it look like is this thing here. So they like we’re probing this this
neuron deep in some monkeys head. And really, it was looking for like another monkey. That’s what it that’s
what it wanted to see. So I think this is super cool, because I don’t know anything about like squishy
brains. But I think it’s cool that people can like apply these methods to their. So kind of want to wrap
up a little bit. I think I hope I’ve shown you like we’re kind of getting into this region, we’re building
things that are really amazing, totally going to change the world. But it’s hard to understand them. So
we need some of these methods, probably 100 more to understand what’s going on. So a lot of the work I
showed you was the work not only for myself, but Many, like really talented and amazing collaborators. If
you’re curious about if you’re curious about this work and you want to see more, there’s a lot of code and
papers on the website. You can email me if you’d like. And if you if you’d like to see these slides, they’re
posted online as well. So thanks to my collaborators and thanks to you guys for listening

Leave a Reply

Your email address will not be published. Required fields are marked *