The authors provide summary descriptions of three proposed defensive strategies for training black-box deciders that block attempts to find adversarial examples: “adversarial training” (including adversarial examples in training sets), “defensive distillation” (“smoothing the model's decision surface” in the hope of eliminating the abrupt discontinuities that adversarial examples exploit), and “gradient masking” (flattening the gradients in the vicinity of a successfully classified training object so that it is computationally difficult for the software that finds adversarial examples to explore that space productively).
The authors' assessment is that the first two methods are whack-a-mole games in which new adversarial examples just pop up in previously unexplored parts of the input space, while the third doesn't work at all. It papers over exploitable weaknesses, but would-be attackers can simply develop their own models, not using gradient masking, and find the adversarial examples against those models. They will be equally effective against the gradient-masked decider.
“Is Attacking Machine Learning Easier Than Defending It?”
Ian Goodfellow and Nicolas Papernot, cleverhans-blog, February 15, 2017
The authors conclude with some possible explanations of the easy availability, robustness, and persistence of adversarial examples:
Adversarial examples are hard to defend against because it is hard to construct a theoretical model of the adversarial example crafting process. Adversarial examples are solutions to an optimization problem that is non-linear and non-convex for many ML models, including neural networks. Because we don't have good theoretical tools for describing the solutions to these complicated optimization problems, it is very hard to make any kind of theoretical argument that a defense will rule out a set of adversarial examples.
From another point of view, adversarial examples are hard to defend against because they require machine learning models to produce good outputs for every possible input. Most of the time, machine learning models work very well but only work on a very small amount of all the may possible inputs they might encounter.
The authors don't mention one explanation that I find particularly plausible. A trained deep neural network maps each possible input to a decision or a classification. The space of possible inputs is typically immense. I imagine the decision function that the network implements as carving this input space into regions, each region containing the inputs that will be classified in the same way. (The regions may or may not be simply connected; that doesn't matter.) Instead of cleanly separating the space into blocks with easily described shapes, the regions have extremely irregular boundaries that curl around one another and thread through one another and break one another up in complicated ways. The number of dimensions of the space is huge, so that the Euclidean distance between an arbitrarily chosen point inside one region and the nearest point that is inside some other (arbitrarily chosen) region is likely to be small, since there are so many directions in which to search. Evolution drives animal brains to make decisions and classifications that are not only mostly accurate but also intelligible (and energy-efficient). Training a deep neural network doesn't impose these additional constraints and so yields network configurations that implement decision functions that are much more likely to carve up their input spaces in these irremediably intricate ways.
Although accurate image captioning is strictly more difficult than image classification, and can produce a larger variety of results, the strategies that have been developed for constructing adversarial examples against image classifiers can be adapted to image-captioning systems as well.
“Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning”
Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, and Cho-Jui Hsieh, arXiv, December 6, 2017
Changing only one (carefully selected) pixel in an image can cause black-box deciders to misclassify the image in an astonishing number of cases. The authors develop a method that makes fewer assumptions about the mechanics of the deciders that it is trying to fool than other methods of constructing adversarial examples.
“One Pixel Attack for Fooling Deep Neural Networks”
Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai, arXiv, February 22, 2018
“AI Has a Hallucination Problem That's Proving Tough to Fix”
Tom Simonite, Wired, March 9, 2018
This is the paper that introduced the term “adversarial examples” and initiated the systematic study and construction of such examples.
“Intriguing Properties of Neural Networks”
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus, arXiv, February 19, 2014
When carefully selected adversarial inputs are presented to image classifiers implemented as deep neural networks and trained using machine-learning techniques, the classifiers fail badly, because the functions from inputs to outputs that they implement are highly discontinuous. Such adversarial inputs are not difficult to generate and are not dependent on particular training data or learning regimens, since the same adversarial examples are misclassified by image classifiers trained on different data under different learning regimens.
The authors present a second, possibly related discovery: Even the pseudo-neurons (“units”) that are close to the output layer usually don't carry semantic information individually; the contributions of linear combinations of such units are indistinguishable in nature from the contributions of the units individually, so any semantic information that is present emerges holistically from the entire layer.
These results suggest that the deep neural networks that are learned by backpropagation have nonintuitive characteristics and intrinsic blind spots, whose structure is connected to the data distribution in a non-obvious way. …
If the network can generalize well, how can it be confused by these adversarial negatives, which are indistinguishable from the regular examples? Possible explanation is that the set of adversarial negatives is of extremely low probability, and thus is never (or rarely) observed in the test set, yet it is dense (much like the natural numbers), and so it is found near … virtually every test case.
“The Secret Sharer: Measuring Unintended Neural Network Memorization and Extracting Secrets”
Nicholas Carlini, Chang Liu, Jernej Kos, Úlfar Erlingsson, and Dawn Song, arXiv, February 22, 2018
Given access to a fully trained black-box decider, it is surprisingly easy to recover personally identifiable information (such as Social Security numbers and credit-card information) that was present in its training set. This paper works out some of the methods and suggests adding noise to the training data, as in differential-privacy schemes, as a solution.
The neural network's implicit memorization of information in its training data is not due to overfitting and occurs even if additional validation is carried out during the learning process specifically to stop the training before overfitting occurs.
The secrets of a black-box decider need not be extracted by brute-force testing of all possible secrets. The authors propose a more efficient algorithm that uses a priority queue of partially determined secrets, to organize the search. (The measure used as the priority is the total entropy of the posited components of the secret as they are filled in during the search process. When all of the components have been filled in.)
Black-box classifiers are prone to hasty generalizations from training data. If you train your neural network on a lot of pictures depicting sheep on grassy hills, the network learns to posit sheep whenever it sees a grassy hill.
“Do Neural Nets Dream of Electric Sheep?”
Janelle Shane, Postcards from the Frontiers of Science, March 2, 2018
Are neural networks just hyper-vigilant, finding sheep everywhere? No, as it turns out. They only see sheep where they expect to see them. They can find sheep easily in fields and mountainsides, but as soon as sheep start showing up in weird places, it becomes obvious how much the algorithms rely on guessing and probabilities.
Bring sheep indoors, and they're labeled as cats. Pick up a sheep (or a goat) in your arms, and they're labeled as dogs.
Paint them orange, and they become flowers.
On the other hand, as Shane observes, the Microsoft Azure classifier is hypervigilant about giraffes (“due to a rumored overabundance of giraffes in the original dataset”).
Some of the limitations of black-box deciders become more obvious and intuitive when one recognizes the machine-learning algorithms behind them as software tools for approximating functions, using large data sets and statistical tools for calibration.
“The Delusions of Neural Networks”
Giacomo Tesio, Medium, January 18, 2018
The key points:
(A) Neural networks simulate functions and are calibrated statistically, with the assistance of large data sets comprising known argument-value pairs.
(B) Like other simulations, neural networks sometimes yield erroneous or divergent results. The functions they actually compute are usually not mathematically equal to the functions they simulate.
(C) The reason for this is that the calibration process uses only a finite number of argument-value pairs, whereas the function that the neural network is designed to simulate computes values for infinitely many arguments, or at least for many, many more arguments than are used in the calibration. (Otherwise, the simulation would be useless.) The data used in the calibration are compatible with many, many more functions than the one that the neural network is designed to simulate. The probability that calibrating the neural network results in its computing a function that is mathematically equal to the one it is designed to simulate is negligible — for practical purposes, it is zero.
(D) The problem of determining how accurately a neural network simulates the function it is designed to simulate is undecidable. There is no general algorithm to answer questions of this form. As a result, neural networks are usually validated not by proving their correctness but by empirical measurement: We apply them to arguments not used in the calibration process and compare the values they compute to the values that the functions they are designed to simulate associate with the same test arguments. When they match in a large enough percentage of cases, we pronounce the simulation a success.
(E) However, these test arguments are not generated at random, but are drawn from the same “natural” sources as the data used in the calibration of the network. The success of the simulation depends on this bias: Unless the test arguments are sufficiently similar to the data used in the calibration, the probability that the computed values will match is again negligible. This would essentially never happen if the test arguments were randomly selected.
(F) Consequently, the process of validating a neural network does not prove that it is unbiassed. On the contrary: in order to be pronounced valid, a neural network must simulate the biasses of the data set used in the calibration.
(G) In principle, it would be possible for independent judges to confirm that the data set is free from forms of bias that constitute discrimination against some protected class of persons and to provide strong empirical evidence that the function actually computed by the neural network does not actually introduce such a bias. In practice, this confirmation process would be prohibitively expensive and time-consuming.
(H) Neural networks can be used, and often are used, to simulate unknown functions. In those cases, there would be no way for a panel of independent judges even to begin the process of confirming freedom from discriminatory bias, because no one even knows whether the function that the neural network is designed to simulate exemplifies such a bias.
“I Will Improve My Batography Skills.”
“New Ways to Market Your Self-Affirmation Talk, Thanks to a Neural Network”
Janelle Shane, Postcards from the Frontiers of Science, February 23, 2018
An overview of the recent achievements, acknowledged limitations, and plausible extensions of multi-level neural networks suggests that this approach to artificial intelligence is nearly played out and must be supplemented by alternative approaches in order to make further progress.
In section 3, the author identifies ten “limits on the scope of deep learning,” including some that I would consider critical and ineradicable (see section 3.5, “Deep Learning Thus Far Is Not Sufficiently Transparent,” and section 3.9, “Deep Learning Thus Far Works Well as an Approximation, But Its Answers Often Cannot Be Fully Trusted”).
“Deep Learning: A Critical Appraisal”
Gary Marcus, arXiv, January 2018
The transparency issue, as yet unsolved, is a potential liability when using deep learning for problem domains like financial trades or medical diagnosis, in which human users might like to understand how a given system made a given decision. … Such opacity can also lead to serious issues of bias.
None of Marcus's proposals for supplementing machine learning addresses either the transparency problem or the problem posed by adversarial examples.