“ ‘All models are wrong, but some are useful.’
So proclaimed statistician George Box 30 years ago, and he was right. But what choice did we have? Only models, from cosmological equations to theories of human behavior, seemed to be able to consistently, if imperfectly, explain the world around us. Until now. Today companies like Google, which have grown up in an era of massively abundant data, don’t have to settle for wrong models. Indeed, they don’t have to settle for models at all.”
So proclaimed WIRED editor-in-chief Chris Anderson 7 years ago, opening the July 2008 issue of stories relating to the advent of “The Petabyte Age” with his piece entitled: “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”. Anderson concludes this preface, saying:
The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
There’s no reason to cling to our old ways. It’s time to ask: What can science learn from Google?
WIRED is known for occasional overstatement, and it’s generally unwise to give any sort of credence to those who use the phrase “crunch the numbers” (numbers are many things, but crunchiness is not one of their qualities). WIRED’s captain’s strongly-worded Op-Ed can probably be written off as nothing but bluster and hype. Anderson warrants his far-fetched claims in a variety of ways, but one direct quote surprised me.
Peter Norvig, Google’s research director, offered an update to George Box’s maxim: “All models are wrong, and increasingly you can succeed without them.”
“To set the record straight: That’s a silly statement, I didn’t say it, and I disagree with it.” Peter Norvig writes in a blog post (All we want are the facts, ma’am) after the publication of Anderson’s piece, staunchly disavowing the quote attributed to him in the article. Norvig continues:
The ironic thing is that even the article’s author, Chris Anderson, doesn’t believe the idea. I saw him later that summer at Google and asked him about the article, and he said “I was going for a reaction.”
Ok. So both these intellectuals generally agree with one another: the scientific method is still alive and relevant, not likely to be made obsolete anytime soon.
Except, perhaps not. That is to say, when Anderson, referring to the opinions he forwards in his piece, writes “This kind of thinking is poised to go mainstream,” I feel more inclined to perceive him as uncharacteristically clairvoyant from the vantage point of 2015. What’s changed since 2008?
Science is a systematic enterprise that creates, builds and organizes knowledge in the form of testable explanations and predictions about the universe. -Wiki
A scientific model is an explanation and a prediction bundled into one. Or at least, it used to be. As Norvig points out, in reference to an original paper by Leo Breiman, in a blog post of his (On Chomsky and the Two Cultures of Statistical Learning) there exists another kind of model which is growing in appeal and becoming widely adopted. He describes the dichotomy between these two models in the following way:
First the data modeling culture (to which, Breiman estimates, 98% of statisticians subscribe) holds that nature can be described as a black box that has a relatively simple underlying model which maps from input variables to output variables (with perhaps some random noise thrown in). It is the job of the statistician to wisely choose an underlying model that reflects the reality of nature, and then use statistical data to estimate the parameters of the model.
Second the algorithmic modeling culture (subscribed to by 2% of statisticians and many researchers in biology, artificial intelligence, and other fields that deal with complex phenomena), which holds that nature’s black box cannot necessarily be described by a simple model. Complex algorithmic approaches (such as support vector machines or boosted decision trees or deep belief networks) are used to estimate the function that maps from input to output variables, but we have no expectation that the form of the function that emerges from this complex algorithm reflects the true underlying nature.
This second kind of model is different from the former in that it offers a prediction without an explanation. This is what Chomsky finds problematic.
It seems that the algorithmic modeling culture is what Chomsky is objecting to most vigorously. It is not just that the models are statistical (or probabilistic), it is that they produce a form that, while accurately modeling reality, is not easily interpretable by humans, and makes no claim to correspond to the generative process used by nature. In other words, algorithmic modeling describes what does happen, but it doesn’t answer the question of why.
The reason that Anderson’s piece seems to me more apt now than it was in 2008 is that I believe that the balance of power between these two approaches has begun to shift towards the latter, meaning the models we develop today are not only less frequently explanatory, but oftentimes veritable black boxes, such that we may not even understand the mechanisms that cause them to have such great prediction accuracy in the first place. Recently, explanatory models of complex phenomena (such as language) have been ceding the stage to a class of model which may have little to no explanatory power, but a great deal of predictive power. Even if these models might have explanatory power, researchers have routinely de-emphasized those aspects of these models. I’m talking about Neural Networks, and “Deep Learning,” a modeling approach which has experienced a renaissance.
Neural Networks have come to the fore in the past 5 or so years as some of the most powerful tools for modeling arbitrary data. They are universal function approximators, meaning they can technically model any function (provided sufficient compute, data, and an adequate architecture). For the purposes of many engineering tasks, their re-discovery is a great boon, but for scientists, they will have to be considered cautiously, because they belong staunchly to the algorithmic modeling culture, which today corresponds roughly to what we call Machine Learning.
Machine Learning imports an almost entirely distinct paradigm and methodology into research, parallel but separate from science. In traditional science, the means of generation for models is directed, deliberate, and well-reasoned (the scientific method). In Machine Learning, hypothesis generation by trial and error is the common method of choice, always to some degree, if not entirely. Models are not created so much as evolved (trained, in machine learning parlance).
And this allows for a sort of profound intellectual laziness on the part of some researchers who imbibe the Deep Learning kool-aid which can best be summarized by their catchphrase “Throw a Neural Network at the problem”. Doing so de-prioritizes understanding, practically exporting the role of science, of learning, to machines. This does not bode well for the role of robust and principled theory in science, which is why I feel inclined to give credence to Anderson’s inflammatory 2008 op-ed. Many researchers, a few years back would find that they had some data they sought to model, and deep learning strategies consistently provided the best prediction accuracy, and this would be the tacit signal that the research was complete.
Chomsky seems to believe this is calamitous, and “derided researchers in machine learning who use purely statistical methods to produce behavior that mimics something in the world, but who don’t try to understand the meaning of that behavior.” (MIT Tech Review)
There is a notion of success … which I think is novel in the history of science.
It interprets success as approximating unanalyzed data.
I tend to agree with him. But let’s hear Norvig out. He proceeds in his blog:
It is not “The End of Theory.” It it is an important change (or addition) in the methodology and set of tools that are used by science, and perhaps a change in our stereotype of scientific discovery. […]
In complex, messy domains, particularly game-theoretic domains involving unpredictable agents such as human beings, there are no general theories that can be expressed in simple equations like F = m a or E = m c ². But if you have a dense distribution of data points, it may be appropriate to employ non-parametric density approximation models such as nearest-neighbors or kernel methods rather than parametric models such as low-dimensional linear regression. […]
The great thing about F = m a is that it is so simple that we can easily see how to apply it to falling objects on Earth, and then to the orbits of the moon and planets, and then to the flight of spacecraft. But complex models may hold secrets that they are less willing to give up. […]
Sure, we all love succinct theories like F = m a. But social science domains and even biology appear to be inherently more complex than physics. Let’s stop expecting to find a simple theory, and instead embrace complexity, and use as much data as well as we can to help define (or estimate) the complex models we need for these complex domains.
There’s a lot to unpack here.
Someday, surely, we will see the principle underlying existence as so simple, so beautiful, so obvious that we will all say to each other, ‘Oh, how could we all have been so blind, so long.’ — John Archibald Wheeler
Here is a beautiful affirmation of the traditional ontology of science — the faith that there exists a model of reality which is so simple and yet corresponds so well with observed phenomena that we grant this model the designation of scientific theory. This intuition, this faith in the existence of such a model (exhibited by Wheeler’s quote) is exactly what Norvig says Chomsky must abandon for “complex, messy domains” like social science.
However, Norvig also says something unobtrusively radical (above).
The crux, I believe, of Norvig’s argument, is cautiously worded as:
But complex models may hold secrets that they are less willing to give up.
Here, Norvig is making the claim that in fact, these models may actually contain explanatory power, but more deeply buried than the insights one might garner from another type of model. Anderson claims “science can advance even without coherent models, unified theories, or really any mechanistic explanation at all” which may seem fairly accurate when one considers that classification accuracy keeps going up and we don’t have a clear idea why. But he’s mistaken, because uninterpretable models aren’t particularly actionable. Science, after all, is explanation and prediction. We mustn’t give in to this idea, but instead, discover means of understanding these complex models and teasing out the explanations for the phenomena being modeled that might lie therein.
Norvig illustrates the idea with this koan:
There’s a famous (well famous in Artificial Intelligence circles, anyways) Zen koan that goes like this:
In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.
“What are you doing?”, asked Minsky.
“I am training a randomly wired neural net to play Tic-Tac-Toe,” Sussman replied.
“Why is the net wired randomly?”, asked Minsky.
“I do not want it to have any preconceptions of how to play”, Sussman said.
Minsky shut his eyes.
“Why do you close your eyes?”, Sussman asked his teacher.
“So that the room will be empty.”
At that moment, Sussman was enlightened.
Minsky was showing Sussman that the randomly-wired neural net (a complex model if ever there was one) actually did have preconceptions; it’s just that we don’t understand what these preconceptions are.
The implication here is that there might exist robust theories and beautiful models underlying many of these complex phenomena, but that we simply choose not to look for them inside the many parameters of the neural network. In this koan, Sussman is identical to the modern day practitioner of deep learning and merely seeks to make it work, rather than understand how. But Minsky exhibits the intellectual courage needed to extract the secrets these models are less willing to give up, to peer inside the black box.
Once this neural network renaissance of “deep learning” really got going (circa 2013), it didn’t take long for machine learning researchers to begin trying to pry open the black box. The first of these, as far as I can tell, was Zeiler and Fergus’ seminal “Visualizing and Understanding Convolutional Networks”.
Visualizing and Understanding Convolutional Networks
Abstract: Large Convolutional Network models have recently demonstrated impressive classification performance on the…
Tensorflow, a deep learning library released by Google, now has the methods presented in this paper built in to their “TensorBoard” tool.
But I think the most honest and thorough introspection I’ve seen must be Szegedy et al’s “Intriguing Properties of Neural Networks”.
Intriguing properties of neural networks
Abstract: Deep neural networks are highly expressive models that have recently achieved state of the art performance on…
In both of these papers, the authors seek to understand how these tremendously powerful models work, to develop a theory of neural networks instead of merely abdicating the role of thinking to these algorithmic tools.
Notable also is the work that’s been done on the “Manifold Hypothesis” of deep learning, which is extremely well described and visualized by Christopher Olah’s blog post:
He describes how neural networks might be plying the manifold upon which the data exists in order to separate it, which he visualizes in two dimensions:
This sort of work excites me a great deal, because it is unrelentingly intellectually brave to delve into these complex machines and try to understand them, rather than merely trying to get them to work.
It also points the way forwards for both machine learning research and machine learning-empowered research. Finally, it illuminates the contours of some truly beautiful questions:
How might we develop a language to begin to unravel complexity?
Might it be the case that what Norvig describes as “complex, messy domains” are in fact describable by simple, elegant theories in a different language than any we currently know? Recall that Sir Isaac Newton had to invent Calculus, and within that language, the vast complexity of physics was reduced down to simple rules. Might it be the case that these phenomena (life, intelligence) necessitate the inception of a new language, and perhaps a new paradigm, to deconvolve their complexity and explain it simply? How would one begin to go about this?