Some researchers at Google Brain have discovered a technique by which a black-box decider that has been successfully trained for one task can be used to perform an unrelated computation by embedding the inputs for that computation in the input to the black-box decider and extracting the result of the unrelated computation from the output of the black-box decider.
One of the proof-of-concept experiments that the paper describes uses ImageNet for recognition of handwritten numerals. The inputs for the numeral-recognition problem are small images (twenty-eight pixels high and twenty-eight pixels wide), and the task is to determine which of the ten decimal numerals each input represents. Normally ImageNet takes much larger, full-color images as inputs and outputs a tag identifying what's in the picture, chosen from a list of a thousand fixed tags. Numerals aren't included in that list, so ImageNet never outputs a numeral. It's not designed to be a recognizer for handwritten numerals.
But ImageNet can be coopted. The researchers took the first ten tags from the ImageNet tag list and associated them with numerals (tench ↦ 0, goldfish ↦ 1, etc.). Then they set up an optimization problem: Find the pattern of pixels making up a large image so as to maximize the ImageNet's success in “interpreting” the images that result when each small image from the training set for the numeral-recognition task is embedded at the center of the large image. An interpretation counts as correct, for this purpose, if ImageNet returns the tag that is mapped to the correct numeral.
The pixel pattern that emerges from this optimization problem looks like video snow; it doesn't have any human-recognizable elements. When one of the small handwritten numerals is embedded at the center, the image looks to a human being like a white handwritten numeral in a small black square surrounded by this random-looking video snow. But if the numeral is a 9, ImageNet thinks that it looks very like an ostrich, whereas if it's a 3, then ImageNet thinks that it depicts a tiger shark.
Note that ImageNet is not being retrained here and isn't doing anything that it wouldn't do right out of the box. The “training” step here is just finding the solution to the optimization problem: What pattern of pixels will most effectively trick ImageNet into doing the computation we want it to do when the input data for our problem is embedded into that pattern of pixels?
The researchers call the optimized pixel patterns “adversarial programs.”
Besides the numeral-recognition task, the researchers were also able to trick ImageNet — six different variants of ImageNet, in fact — into doing two other standard classification tasks, just by finding optimal pixel patterns — adversarial programs — in which to embed the input data.
“Adversarial Reprogramming of Neural Networks”
Gamaleldin F. Elsayed, Ian Goodfellow, and Jascha Sohl-Dickstein, arXiv, June 28, 2018
The authors provide summary descriptions of three proposed defensive strategies for training black-box deciders that block attempts to find adversarial examples: “adversarial training” (including adversarial examples in training sets), “defensive distillation” (“smoothing the model's decision surface” in the hope of eliminating the abrupt discontinuities that adversarial examples exploit), and “gradient masking” (flattening the gradients in the vicinity of a successfully classified training object so that it is computationally difficult for the software that finds adversarial examples to explore that space productively).
The authors' assessment is that the first two methods are whack-a-mole games in which new adversarial examples just pop up in previously unexplored parts of the input space, while the third doesn't work at all. It papers over exploitable weaknesses, but would-be attackers can simply develop their own models, not using gradient masking, and find the adversarial examples against those models. They will be equally effective against the gradient-masked decider.
“Is Attacking Machine Learning Easier Than Defending It?”
Ian Goodfellow and Nicolas Papernot, cleverhans-blog, February 15, 2017
The authors conclude with some possible explanations of the easy availability, robustness, and persistence of adversarial examples:
Adversarial examples are hard to defend against because it is hard to construct a theoretical model of the adversarial example crafting process. Adversarial examples are solutions to an optimization problem that is non-linear and non-convex for many ML models, including neural networks. Because we don't have good theoretical tools for describing the solutions to these complicated optimization problems, it is very hard to make any kind of theoretical argument that a defense will rule out a set of adversarial examples.
From another point of view, adversarial examples are hard to defend against because they require machine learning models to produce good outputs for every possible input. Most of the time, machine learning models work very well but only work on a very small amount of all the may possible inputs they might encounter.
The authors don't mention one explanation that I find particularly plausible. A trained deep neural network maps each possible input to a decision or a classification. The space of possible inputs is typically immense. I imagine the decision function that the network implements as carving this input space into regions, each region containing the inputs that will be classified in the same way. (The regions may or may not be simply connected; that doesn't matter.) Instead of cleanly separating the space into blocks with easily described shapes, the regions have extremely irregular boundaries that curl around one another and thread through one another and break one another up in complicated ways. The number of dimensions of the space is huge, so that the Euclidean distance between an arbitrarily chosen point inside one region and the nearest point that is inside some other (arbitrarily chosen) region is likely to be small, since there are so many directions in which to search. Evolution drives animal brains to make decisions and classifications that are not only mostly accurate but also intelligible (and energy-efficient). Training a deep neural network doesn't impose these additional constraints and so yields network configurations that implement decision functions that are much more likely to carve up their input spaces in these irremediably intricate ways.
A snapshot of the state of research on adversarial examples at the time of publication (February 2017). It's partial, but there are a lot of links that look useful.
“Attacking Machine Learning with Adversarial Examples”
Ian Goodfellow, Nicolas Papernot, Sandy Huang, Yan Duan, Pieter Abbeel, and Jack Clark, OpenAI, February 24, 2017
Eight vignettes from a foreseeable future.
Cory Doctorow, this., August 17, 2017
It is possible to fool face-recognition (FR) systems into misidentifying one person A as some specified other person B by projecting an pattern of infrared light onto A's face when the recognizer's camera photographs it, creating a customized adversarial example. Since light in the near infrared can be detected by surveillance cameras but not by human eyes, other people cannot detect the masquerade, even at close range. To project the light patterns, researchers had person A wear a baseball cap with tiny infrared LEDs tucked up under the bill.
“Invisible Mask: Practical Attacks on Face Recognition with Infrared”
Zhe Zhou, Di Tang, Xiaofeng Wang, Weili Han, Xiangyu Liu, and Kehuan Zhang, arXiv, March 13, 2018
In this paper, we present the first approach that makes it possible to apply [an] automatically-identified, unique adversarial example to [a] human face in an inconspicuous way [that is] completely invisible to human eyes. As a result, the adversary masquerading as someone else will be able to walk on the street, without any noticeable anomaly to other individuals[,] but appearing to be a completely different person to the FR system behind surveillance cameras.
Although accurate image captioning is strictly more difficult than image classification, and can produce a larger variety of results, the strategies that have been developed for constructing adversarial examples against image classifiers can be adapted to image-captioning systems as well.
“Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning”
Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, and Cho-Jui Hsieh, arXiv, December 6, 2017
Changing only one (carefully selected) pixel in an image can cause black-box deciders to misclassify the image in an astonishing number of cases. The authors develop a method that makes fewer assumptions about the mechanics of the deciders that it is trying to fool than other methods of constructing adversarial examples.
“One Pixel Attack for Fooling Deep Neural Networks”
Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai, arXiv, February 22, 2018
“AI Has a Hallucination Problem That's Proving Tough to Fix”
Tom Simonite, Wired, March 9, 2018
“Data Driven Exploratory Attacks on Black Box Classifiers in Adversarial Domains”
Tegjyot Singh Sethi and Mehmed Kantardzic, arXiv, March 23, 2017
Machine learning operates under the assumption of stationarity, i.e. the training and testing distributions are assumed to be identically and independently distributed … . This assumption is often violated in an adversarial setting, as adversaries gain nothing by generating samples which are blocked by a defender's system. …
In an adversarial environment, the accuracy of classification has little significance, if an attacker can easily evade detection by intelligently perturbing the input samples.
Most of the paper deals with strategies for probing black-box deciders that are only accessible as services, through APIs or Web interfaces. The justification for the strategies is more heuristic than theoretical, but the authors give some evidence that they are good enough to generate adversarial examples for a lot of real-world black-box deciders.
What I liked most about the paper was the application of “the security mindset.”
This 2005 paper is a kind of precursor to the current literature about adversarial examples: It shows how to modify an e-mail that a naive Bayesian spam filter correctly classifies as spam so as to induce the same filter to misclassify it as ham. The modification consists in replacing a small number of words.
Daniel Lowd and Christopher Meek, ACM Conference on Knowledge Discovery and Data Mining, August 2005
This is the paper that introduced the term “adversarial examples” and initiated the systematic study and construction of such examples.
“Intriguing Properties of Neural Networks”
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus, arXiv, February 19, 2014
When carefully selected adversarial inputs are presented to image classifiers implemented as deep neural networks and trained using machine-learning techniques, the classifiers fail badly, because the functions from inputs to outputs that they implement are highly discontinuous. Such adversarial inputs are not difficult to generate and are not dependent on particular training data or learning regimens, since the same adversarial examples are misclassified by image classifiers trained on different data under different learning regimens.
The authors present a second, possibly related discovery: Even the pseudo-neurons (“units”) that are close to the output layer usually don't carry semantic information individually; the contributions of linear combinations of such units are indistinguishable in nature from the contributions of the units individually, so any semantic information that is present emerges holistically from the entire layer.
These results suggest that the deep neural networks that are learned by backpropagation have nonintuitive characteristics and intrinsic blind spots, whose structure is connected to the data distribution in a non-obvious way. …
If the network can generalize well, how can it be confused by these adversarial negatives, which are indistinguishable from the regular examples? Possible explanation is that the set of adversarial negatives is of extremely low probability, and thus is never (or rarely) observed in the test set, yet it is dense (much like the natural numbers), and so it is found near … virtually every test case.
Black-box classifiers are prone to hasty generalizations from training data. If you train your neural network on a lot of pictures depicting sheep on grassy hills, the network learns to posit sheep whenever it sees a grassy hill.
“Do Neural Nets Dream of Electric Sheep?”
Janelle Shane, Postcards from the Frontiers of Science, March 2, 2018
Are neural networks just hyper-vigilant, finding sheep everywhere? No, as it turns out. They only see sheep where they expect to see them. They can find sheep easily in fields and mountainsides, but as soon as sheep start showing up in weird places, it becomes obvious how much the algorithms rely on guessing and probabilities.
Bring sheep indoors, and they're labeled as cats. Pick up a sheep (or a goat) in your arms, and they're labeled as dogs.
Paint them orange, and they become flowers.
On the other hand, as Shane observes, the Microsoft Azure classifier is hypervigilant about giraffes (“due to a rumored overabundance of giraffes in the original dataset”).
Several of the papers to be presented at this year's International Conference on Learning Representations propose strategies for blocking the construction of adversarial examples against machine-learning-based image-classification systems. The goal is to harden such systems enough to make them usable even in high-risk situations in which adversaries can select and control the inputs that the fully trained systems are expected to classify.
Once these post hoc defenses are incorporated into the systems, however, it is possible to devise more specialized attacks against them, resulting in new, even more robust adversarial examples:
“Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples”
Anish Athalye, Nicholas Carlini, and David Wagner, arXiv, February 1, 2018
“Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples”
Anish Athalye, Nicholas Carlini, and David Wagner, GitHub, February 2, 2018
“Adversarial Attacks on Modern Speech-to-Text”
Max Little, Language Log, January 30, 2018
For many commercial STT and associated user-centric applications this is mostly a curiosity. If I can order pizza and nearly always get it right in one take through Siri, I don't really see the problem here, even if it is obviously highly brittle. …
Nonetheless, I think this brittleness does have consequences. There will be critical uses for which this technology simply can't work. Specialised dictionaries may exist (e.g. clinical terminology) for which it may be almost impossible to obtain sufficient training data to make it useful. Poorly represented minority accents may cause it to fail. Stroke survivors and those with voice or speech impairments may be unable to use them. And there are attacks … in which a device is hacked remotely.