Scott Aaronson's Fall 2011 MIT 6.893 course reading list

Philosophy and Theoretical Computer Science

published 6/10/2025

laycomputer sciencephilosophymit 6.893

I’m going through the reading list for Scott Aaronson’s Philosophy and Theoretical Computer Science course held at MIT in Fall 2011. Note that the course code is now used for a different topic. I have included links to the papers where available, but many of them have linkrotted. If you are interested, I have personal backups. I shall update this post with brief thoughts on the readings as I go through them.

6.893 Class Home (mit.edu)
S. Aaronson, NP-complete Problems and Physical Reality
S. Aaronson, Dude, It’s Like You Read My Mind (read 2025-06-20)
S. Aaronson and J. Watrous, Closed Timelike Curves Make Quantum and Classical Computing Equivalent
E. Bernstein and U. Vazirani, Quantum Complexity Theory
N. Bostrom, Are You Living In A Computer Simulation?
D. Chalmers, Does A Rock Implement Every Finite-State Automaton? (read 2025-06-20)
D. Chalmers, The Singularity: A Philosophical Analysis
D. Deutsch, Quantum Mechanics Near Closed Timelike Lines
- Quantum mechanics near closed timelike lines (archive.org)
D. Deutsch, Quantum theory, the Church-Turing Principle and the universal quantum computer
- deutsch85.pdf (princeton.edu)
P. Domingos, The Role of Occam’s Razor in Knowledge Discovery (read 2025-06-30)
C. Gentry, Computing Arbitrary Functions of Encrypted Data
- easy-fhe.pdf (stanford.edu)
M. Hutter, On Universal Prediction and Bayesian Confirmation
H. Levesque, Is It Enough To Get The Behaviour Right?
- Is it Enough to Get the Behaviour Right? (ijcai.org)
S. Lloyd, Computational Capacity of the Universe
C. Papadimitriou, Algorithms, Games, and the Internet
- christos.dvi (psu.edu)
I. Parberry, Knowledge, Understanding, and Computational Complexity
S. Kritchman and R. Raz, The Surprise Examination Paradox and the Second Incompleteness Theorem
J. Schmidhuber, A Computer Scientist’s View of Life, the Universe, and Everything
- [quant-ph/9904050] A Computer Scientist’s View of Life, the Universe, and Everything (arxiv.org)
S. Shieber, The Turing Test As Interactive Proof
- turing-interactive-proof.pdf (harvard.edu)
M. Sipser, The History and Status of the P versus NP Question
- 129712.129771 (acm.org)
R. Stalnaker, The Problem of Logical Omniscience I
A. Turing, On Computable Numbers
- Turing_Paper_1936.pdf (virginia.edu)
A. Turing, Computing Machinery and Intelligence (read 2025-06-10)
- https://watermark.silverchair.com/lix-236-433.pdf
L. Valiant, A Theory of the Learnable (read 2025-06-11)
- ValiantLearnable.pdf (mpg.de)
L. Valiant, Evolvability
- evolvability-2008.pdf (harvard.edu)
A. Wigderson, P, NP and Mathematics
A. Wigderson, Knowledge, Creativity, and P versus NP (read 2025-06-08)
R. de Wolf, Philosophical Applications of Computational Learning Theory

This new offering will examine the relevance of modern theoretical computer science to traditional questions in philosophy, and conversely, what philosophy can contribute to theoretical computer science. Topics include: the status of the Church-Turing Thesis and its modern polynomial-time variants; quantum computing and the interpretation of quantum mechanics; complexity aspects of the strong-AI and free-will debates; complexity aspects of Darwinian evolution; the claim that “computation is physical”; the analog/digital distinction in computer science and physics; Kolmogorov complexity and the foundations of probability; computational learning theory and the problem of induction; bounded rationality and common knowledge; new notions of proof (probabilistic, interactive, zero-knowledge, quantum) and the nature of mathematical knowledge. Intended for graduate students and advanced undergraduates in computer science, philosophy, mathematics, and physics. Participation and discussion are an essential part of the course.

Prerequisites: Previous coursework in either computability and complexity theory (such as 6.045 or 6.840) or analytic philosophy (such as 24.09)

S. Aaronson, Dude, It’s Like You Read My Mind (2005)

Aaronson presents Newcomb’s problem. There are two boxes; the first contains either $1,000,000 or nothing, and the second contains $1,000. You can choose to open just the first box, or open both boxes; you keep everything found in boxes you open. What the first box contained was determined by an adversary known as the Predictor.

the Predictor has already predicted what you’ll do. If he predicted you’ll open both boxes, then he left the first box empty; if he predicted you’ll open the first box only, then he put $1,000,000 in the first box. Furthermore, the Predictor has played this game hundreds of times before, with you and other people, and has never once been wrong.

To me, if we should formulate this scenario in a formal manner such that “if he predicted you’ll open the first box only” is interpreted to be “if you’ll open the first box only”, then there is a self-reference, yet models exist of the description of the scenario such that both choices are represented among the models.

Aaronson argues that a predictor has perfect information and can perfectly simulate my decisions beforehand and put the appropriate amount into the first box. So I ought to choose to open just the first box (as that is what the predictor would have simulated I would do already).

D. Chalmers, Does A Rock Implement Every Finite-State Automaton? (1996)

Chalmers responds to Putnam. Chalmers is known for a particular naturalistic dualist position regarding the theory of mind, that posits the existence of a non-physical phenomenon called qualia, related to one’s subjective experience of the external world. However, it is not this position that is being discussed here.

Rather, Putnam has made an argument against computational functionalism, which generally contends that some mental phenomena can be described as executing computer programs. Chalmer’s characterization of one of Putnam’s arguments is that “every ordinary open system realizes every abstract finite automaton”. I will adapt the arguments a bit.

For this argument as described by Chalmers, first assume that an open system in a sufficiently complex environment, with sufficiently demarcated boundaries, does not revisit an old state. That is, first fix a certain way to describe the distinct possible physical states of objects in the system in a particular time. Then at any time $t$ , it does not possess the physical state of any $t^{'} < t$ .

Now discretize time into time steps, and consider the state of the system at each time step to be the complete description of the system as it evolves within this time step. Then the state at each time step is distinct, so the evolution of the state over time steps has a natural correspondence with the natural numbers indexing the time steps themselves. One can then define an equivalence relation adversarially among the natural numbers that partitions them into a finite number of equivalence classes and consider the equivalence classes as the states of a DFA, and define the transitions between them so that the transition $p \to q$ from each $t$ to $t + 1$ caused by $p$ at time $t$ . The DFA transitions transition $ϵ$ and ignores the input, here called inputless FSAs. Regarding unreached states, perhaps encode states unrealized by the system into negative numbers that then get mapped to unreached states.

The force of Putnam’s argument is that even a rock experiences mental phenomena, if mental phenomena are computational processes.

Chalmers raises two points of contention regarding Putnam’s argument, which he then resolves. I would resolve those differently, but it would be resolved.

One is centered around the word causes. Take an open system, and fix the mapping to a DFA based on an execution. But because it is an open system, it is the case that at time step $t$ before its termination, should the interaction with the environment be different, then the description of the state of $t$ would be a different one. Because of this counterfactual, we cannot say that a state causes the next state. Personally, I agree with Chalmers’ rebuttal, but I don’t think that causation was essential to Putnam’s argument. It was sufficient to me that one state led to another, even in an open system, without need for causation, but I don’t know if Putnam then would have disagreed. The way Chalmers resolves it is to simply consider, instead of a rock, a stable system like a clock (that is comparatively unaffected by common environmental conditions).

Regarding the second point of contention, Chalmers observes that DFAs specify state-transitions that are not taken as well (one example are the state transitions away from an unreachable state.) So again, our mapping must satisfy that if it is physically possible starting from state $p$ at time $t$ to be followed by state $q$ at time $t + 1$ , then the DFA must have the transiton $p \to q$ (we of course generalize our use of the natural numbers to some larger set that indexes all physical states). Chalmers contends that because any number of DFAs varying on the state-transitions agree on the states and transitions that the physical system takes, we cannot then say that the system implements any particular DFA.

Personally, I don’t think this argument works either. Putnam’s contention was that depending on the mapping from physical states to DFAs, the trace could be one trace of any number of DFAs. This observation just strengthens that to say that even with the mapping fixed with respect to the trace, the trace could be one trace of any number of DFAs (of a more restricted class).

Chalmers seems to resolve this differently. Instead, we should consider an infinite set of initial states of our system (he calls the choice of initial states a dial), so that every state transition can be traced through by some equivalence class of states starting from on the equivalence class of some initial state.

Ultimately, Chalmers agrees that using inputless FSAs to specify a mind is inappropriate, and computational functionalists do not attempt to do that anyway, understanding that a richer computational object is needed for their goals.

Chalmers then discusses Putnam’s later consideration of a DFA that transitions based on input. Putnam argues that such a DFA is realized by some open physical systems that satisfy certain constraints on their system/environment boundary. Chalmers elaborates on the significance this: conversely, fix a physical system and mapping of boundary conditions to input symbols. Then there are an unbounded number of DFAs that satisfy these. Putnam’s construction is a variation of the previous, but that the constructed transitions are adversarially chosen to take the observed prior state and environment as input.

Putnam notes that if a mind can be made by implementing some particular automaton, and only the I/O sequence of a system determines whether the automaton is implementable or not, then the mind only matters to the extent that it maps the input sequences to output sequences.

Chalmers observes that with Putnam’s construction, after fixing a mapping to a DFA that agrees on the trace, we again have no guarantee that the DFA agrees in respect to the transitions not taken by a trace. The case of initial states is solved by considering only physical systems with a stable clock and dials. But we still have untraced transitions based on unobserved inputs. To address this, Chalmers further restricts the physical systems under consideration to ones that remember their input history. In this case, every state of the system is distinct; distinct inputs and input histories lead to different states. We can then draw the right equivalence classes to collapse this state space into a DFA with the right transitions.

The conclusion that Chalmers arrives at is that physical systems that remember possess a mind, if DFAs with I/O were sufficient for modelling mentailty. For example, everyone with the same inputs and same outputs will still (be able to) realize the same DFA. The argument that a rock implements a mind does not apply to this formalism, but a very banal system still possesses a mind.

Chalmers thinks that DFAs are insufficient because their “states” lack internal structure, and the structure is completely specified by the transitions. Chalmers thinks that Turing machines and cellular automata in contrast, have states with internal structure. I suppose Chalmers is thinking of the infinitely long tape of a TM or unbounded 1-D space of cellular automata.

Chalmers introduces the notion of a combinatoral state automation, that augments each state with a feature tuple. Then state transitions must also map feature tuple to feature tuple. Self-transitions that modify just the feature tuple are permissible. It is clear that formally, for tuples in finite values, this has equivalent power as DFAs.

The mapping is modified as follows. We first a priori subdivide the system and input into components that are easy to describe; so that the physical and input states are tuples of these components; call each tuple a substate. Then define the causal transitions between the tuples under input and output tuples; these follow the laws of physics, so they are again easy to describe. In this substate space, again find a way to partition the substate space and input space in a way that is easy to describe. Then induce a DFA on this partition, drawing transitions whereever all but vanishingly few of the substates go from one partition to the next. Then all the DFAs implemented by the physical system (wrt the subdivision) are these DFAs, ranging over the ways to partition the substate space. The complexity of the DFA is measured by how difficult it is to describe the state transitions of the DFA. Chalmers associate the substates with the feature tuples, so this is a combinatoral state automaton.

As an illustrative example, perhaps we subdivide a rock into its molecules, and define the causal transitions. Suppose we partition the substate space in a trivial way, associating each distinct feature tuple with a distinct state, and this induces a very large DFA with a lot of transitions (it is guaranteed to be induced because of the causal transitions). However, it is comparatively extremely simple to describe the DFA’s transitions because the physical laws that govern the evolution of each particular molecule are extremely uniform. Perhaps we can find an artificial partition, but then such a partition is difficult to describe.

Chalmers discusses issues regarding “false implementation” that seems to be related to the notion of an “artificial partition” above. He wonders about “specifying an additional constraint on causal structure”.

[We might] add a “uniformity” clause, specifying that underlying each formal state-transition of a CSA there needs to be a uniform causal mechanism, as opposed to the highly disjunctive causal mechanism found in the false implementations. Another might be to add a causal relevance clause along the lines suggested before. But it is not obvious just how these suggestions should be spelled out in detail, so I leave the project as a challenge for the future.

The “easy to describe” criterion I mentioned is similar to his “uniformity” clause. We can moreover formalize this by perhaps saying “easy to describe” means having low K-complexity.

In Chalmers’ presentation, the subdivision into components was based on physical location. He thinks that a weaker “independence” notion could suffice, and here I say that “easy to describe” notion suffices.

In fact, I shall try something else. Consider the whole state space and input space of an open physical system. Now partition the state space and input space, $S$ and $I$ respectively of the open system, into partitions $𝒫$ and $ℐ$ according to some parsimonious criterion; the state spaces themselves may be uncountable, yet they perhaps can be partitioned with a very succinct criterion. When done, define a DFA $D$ on $𝒫$ and $ℐ$ to be induced by a particular time step $Δ t$ , possessing the state transition $(P, J) \mapsto P^{'}$ if to every $s \in P$ and $j \in J$ , $(s, j)$ it causally transitions to some $s^{'} \in P^{'}$ . Then define the computational power of the system by $\max_{𝒫, ℐ, Δ t} (K (D) - K ((𝒫, ℐ)))$ , where $K$ denotes the Kolmogorov complexity of an object.

This addresses the triviality argument: a trivial system will be seen to have low computational power since the artificial $K ((𝒫, ℐ))$ will be high. This addreses the robust transition criterion: states that do not transition in the same way as the rest of their partition will require a redefinition of the $𝒫$ and/or $ℐ$ to exclude them. Such a redefined $𝒫$ and $ℐ$ will have larger K-complexity.

P. Domingos, The Role of Occam’s Razor in Knowledge Discovery (1999)

In my response to this paper, I shall say “encoded” to refer to naive, uncompressed encodings of a model, unless otherwise specified.

Domingos writes a position paper cautioning KDD practitioners at that time from an unwarranted preference towards simpler models. He distinguishes between two notion of Occam’s razor. What he calls the first razor states that if two models generalize equally well, then prefer the simpler one for explainability. He accepts this. He notes that it is often applied instead with a subtle but crucial difference; what he calls the second razor is to prefer simpler models if they perform equally on the training set, with the understanding that simpler models tend to generalize better.

In general, Domingos argues that the second razor is a dangerous heuristic. However, he fails to distinguish between two different notions of “simpler” models. What he is arguing against is generally a preference, once a model architucture is fixed, for preferring models that naively are simpler in that architecture (e.g. fewer parameters, lower tree depth, fewer constants). But these are just proxies of some theoretical descriptively simplest model that describes the data well, once we generalize the architectures we consider to arbitrary Turing machines predicting labels given features. In this architecture-independent setting, we should prefer models that are theoretically simplest, which with respect to a given architecture, may be best approximated by a more naively complex model. In fact, some of his arguments are that more complex models with regularization are better than naively simpler models, but regularization is a way to apply the second razor, just not in a way that is biased towards naively architecture-dependent simpler models.

What follows are my thoughts on his arguments, and my point-by-point thoughts may not be very well thought or expressed.

He first rebuts several arguments that argue for the use of the second razor.

One argument for is based on a result from PAC-learning, where if we start off with a small hypothesis class a priori, then find that one of them agrees well with the data, then we can be confident that it will do well on future data. The hope is that since only a small class of simple models can be described succinctly, then should a dataset be well-described by some model in this class, then the model will generalize well. The pitfall should be quite obvious. What if the true hypothesis lies outside this class? Nor can we a priori just choose an arbitrary small hypothesis class that is not succinct to describe, as again there is no assurance that the true hypothesis is similar to any hypothesis in our small set of hypotheses under consideration.

One argument is based on Bayesian literature. The procedure of Bayesian model averaging (BMA) generally asks that the practitioner not discard “non-modal” models, but retain them, and have them perform a weighted vote for a result that is then averaged. With this seen as intractable, we satisfy ourselves with choosing the maximum a posteriori (MAP) model structure (linear model/quadratic model/cubic model, etc. if we are considering only polynomial models); we further assume the uniform prior over model structures, so that the model structures are uniformly likely. This reduces the task to finding the maximum marginal likelihood model structure. This is an weighted integral over choices of parameters of the product of the likelihood of the data given parameters and model, and the prior probability of the models. Again, this integral is infeasible to compute, so we make an assumption that this function in terms of the parameters is unimodal with most of the mass concentrated there; which is reasonable when the number of data points is large. Then we approximate this function with a Gaussian having the same mode and precision (Laplace’s approximation). The covariance is related to Fisher information, and so it grows in $n^{k}$ where $k$ is the dimensionality and $n$ is the number of samples. Under this approximation, and eliminating terms that vanish for large $n$ , the BIC is proportional to the marginal likelihood.

The relevance of this to Occam’s razor is that structures with more parameters (high dimensionality of the Gaussian) are penalized. Domingos argues that this may cause a simpler model to be preferred over a correct model. I don’t find this particularly convincing. The BIC encodes some generally healthy skepticism regarding more expressive model structures, and with enough data, the BIC should penalize the simpler, less accurate models enough to prefer the correct model structure. Of course, in hindsight, we should consider preferring models of fewer parameters as some form of regularization (that may take other guises), albeit one that helps with tractability in respect to the model architecture.

Domingos then moves on to discuss the minimum description length (MDL) principle. He notes that “there is no guarantee that it will select the most accurate model.” I will return to this claim later. He then analyzes the related minimum message length (MML) principle. The data term in MML represents the additional information needed to explain the dataset given the model’s predictions, and there is another term that represents the optimal coding of the model in bits. Assuming discrete models, this term is similarly determined by the prior distribution on the parameters. Dominigos complains about the arbitrariness of the prior. The criticism is similar, and I would respond similarly.

Nevertheless, Dominigos’ criticisms are in fact not directly against the second razor, but toward tractable and fallible proxies of the second razor. In the most powerful realization of the second razor, we do not select models from a particular family, but instead seek to find the shortest description of a Turing machine that maps features to labels, and we should expect this to predict the dataset perfectly except in case of multiplicity in the feature vectors corresponding to different labels, in which case we add some compressed encoding of the residuals to the description. This TM ought to generalize best, in the sense that if we should look at slightly longer descriptions and find ones that also do well on some validation dataset combined with the training dataset, then

there should also be competing TMs of similar description length that perform worse on the validation dataset to a comparable degree and it would not have been possible a priori to choose between them,
repeating the procedure over both datasets should find a more descriptively succinct model than the description of the model we obtained by perturbation.

That is, we have searched across all predictive models and chosen the one that minimizes Kolmogorov complexity (K-complexity). Against K-complexity, the earlier criteria should be seen as a variation where we have very strongly restricted the models from arbitrary TMs to the traditionally studied models.

Dominigos then exhibits some arguments that he claims shows that the second razor is not true. Schaffer’s tells us that when the datapoints are arbitrary, generalization does no better than random guessing. The No Free Lunch theorem makes a similar argument with respect to the choice of optimizer, and how fast learning happens. Mitchell makes a similar point as Schaffer, saying that bias is required in generalization.

It sees to me that from an uninformative, architecture-agnostic perspective, introducing domain-dependent bias or choosing a model architecture, or selecting a particular model from an architecture from data all looks the same. All of these look like advice for generating predictions $\hat{y}$ . The generated predictions may then be compared against a hold-out dataset $y$ with a measure like $D_{K L} (y | | \hat{y})$ . The $D_{K L}$ can be viewed as an extremely resource-bound descriptive complexity of the residuals of $y$ assuming that we know $y$ . When there are patterns in the data that the advice captures, the advice generates ${\hat{y}}^{'}$ closer to $y$ , such that should then expect something like $A < D_{K L} (y | | \hat{y}) - D_{K L} (y | | {\hat{y}}^{'})$ . That is $A$ compresses the residuals in a way that $D_{K L}$ is too resource-bound to consider, so that $A$ plus the length of the new residuals is lower than the length of the old residuals. When the advice fails to capture patterns in the data, $D_{K Y} (y | | \hat{y})$ is close to the amount of information needed to describe $y$ . So the model just adds dead weight to the combined length of the $A$ plus the length of the new residuals.

Dominigos then a result working from the VC dimension. But the VC dimension again is an imperfect proxy for descriptive complexity.

Dominigos then discusses the notion that overfitting (that is thought to lead to poor generalization) is due overly complex models versus that this is due to multiple testing. But these are actually issues regarding certain model architectures. Conventionally we are forced to choose inexpressive models that capture simple, general trends, or expressive modes that capture complex dynamics within the data (if those exist). But there is no reason why we cannot have model architectures (and corresponding training regimens that select from them) that allow themselves to be more expressive near evidence, and restrain themselves from complexity far from evidence. The alternative methods to combat overfitting that Domonigos proposes indeed regularizes complexity.

Dominigos then discusses the bias-variance tradeoff. My reply above also applies in this case.

Dominigos analyzes what is going on regarding pruning tree models. I agree with that. I note that the issue is that simple models in terms of the tree architecture is being used as a proxy for capturing simple trends in the data itself and regularizing complexity far from evidence.

The same goes for the 1R algorithm.

I want to address Dominigos’ discussion regarding the application of the second razor in physics and other domains. I agree with his rebuttals against contrary positions, that general relativity does in fact make more assumptions than Newtonian gravity. But I disagree with what he assumes to be obvious: that we should use a spherical model of the earth over the flat earth as it is more accurate, though more complex. Taken to the extreme, we should not use the spherical model either; it is still just an approximation. Instead, we should use the full sum of our knowledge of the topography of the Earth, which still is just an approximation. In turn, this is absurd for most applications as we would not need such precision, and furthermore, the topography of the Earth at this level of detail changes significantly. So similarly, there are far more applications that require just that we model the Earth as a sphere, and even more that require just that we model the Earth as a flat plane.

He later then discusses what he considers to be empirical evidence against the second razor. As with before, I agree to the extent that they are arguments against model architecture-specific proxies for complexity. I agree that apparently complex models can perform better with regularization techniques, but I would contend that this is just evidence that points to the second razor in an architecture-independent form. Simplicity in the model parameters fails to capture the inherent simplicity itself that regularization succeeds in capturing, but that the increased expressivity of the model parameters is required to express the different internal simplicity present in the training data.

A. Turing, On Computable Numbers (1936)

In this paper, Turing describes what we refer to today as Turing machines. He calls $m$ -configurations what we call today call the states of the machine. He uses the same terminology “tape” and “symbols”, but calls “squares” what we may call cells. Instead of the “head”, we discuss the “scanned square”, being “in the machine”. The distinguished symbol is said to be “blank”. The typical TM he calls an $a$ -machine, or automatic machine, and he distinguishes a variant called a choice machine or $c$ -machine where some transitions are partially specified and may appeal to an external system to make a choice between possible transitions. He does not focus om the $c$ -machines.

He discusses TMs in the context of computing numbers, so he actually defines a variant of a transducer that reads no input. The machine uses the same tape for output and for intermediate computation, but uses distinct alphabets (I refer to these as output alphabet and internal alphabet) to distinguish between the two. At any point so far the subsequence consisting of the output alphabet should be taken to be the output at that step. Since the task is to compute binary expansions of (computable) real numbers, he does not expect the machines to terminate, but that progress is always made i.e. the next output bit is written after a finite number of steps. I do not think that he expects output bits to be changeable. The internal alphabet is used as memory. Turing later adopts a convention that effectively makes his TM into a 2-track, TM where one tape holds the internal symbols liable to erasure, and the other holds the output symbols.

Machines that eventually stop making progress are called circular machines. Machines that always make progress are called circle-free machines.

A computable number is a number for which a circle-free machine prints its bits, and we can refer to each computable number by the number encoding the description of a TM that prints out its binary sequence. He calls such numbers “satisfactory”.

Turing then talks about the existence of a Universal TM.

He then applies the diagonal argument to show that the computable numbers are countable, by showing the non-existence of a HALT program.

Next, Turing provides two arguments appealing to intuition that the Turing machine is expressive enough to model computation consisting of the symbolic manipulation done by someone by someone working through math, and finally he provides examples of large classes of numbers which are computable.

With the above, he relates the non-existence of a HALT program to Hilbert’s Entscheidungsproblem; of the existence of an effective procedure that can prove or disprove any logical statement, and concludes that these is no such procedure.

A. Turing, Computing Machinery and Intelligence (1950)

Here Turing introduces the Turing test as a criterion for assenting to the question “Can machines think?” It is quaint to review this in the present milleu. Today, machines are indistinguishable from humans in restricted domains of discourse. Beyond, LLMs have the ability to hold a casual conversation to a similar ability as the lay person. After introducing the test, Turing anticipates objections and responds to them. Finally, he briefly presents the notion of learning machines. This is in response to a variant of Lovelace’s objection that a machine can “never do anything new”. He talks about how impoverished the signal of punishments and rewards are with respect to learning a concept, and discusses Symbolic AI systems that logically build up new facts and conditional rules from a dataset of facts and conditional rules, and act on them. He discusses some element of randomization in which inferences to make. This is the interpretation Turing has here with respect to doing something new; that the system synthesizes new conditional rules and acts on them, beyond the data that it has been given. Personally, I find that the interpretation of “never do anything new” to ask for a machine that synthesizes from inputs to be a harder task that what the objection demands. Socially, we would consider another person to be “doing something new”, or to “originate something”, if we find that it is responding even simply to a part of the environment that we are not aware of. We are frequently happy with another human that learns from its environment but does not do too much synthesis on these inputs.

The discussion on the role of electricity in computers, as well as the anticipated objections, are very reflective of the milleu that Turing was writing in, and quite enjoyable to witness.

L. Valiant, A Theory of the Learnable (1984)

This is the paper that introduced PAC-learning (probably approximately correct learning). Despite that, the focus of the paper is on the efficient learnability of a particular class of concepts (boolean functions), rather than on the PAC-learning framework introduced to analyze it. In the decades since, more general classes of concepts have been studied for their PAC-learnability. For context, I shall immediately state a modern definition of PAC-learning.

Let $X$ be a domain of instances, and let $Y$ be classification labels. A concept $f \in Y^{X}$ is a function that labels each instance. Write $(X \times Y)^{*}$ to denote the tuples of pairs $(x, y) \in X \times Y$ . We say that

an algorithm L:(X×Y)∗→H
- using a hypothesis space $H \subseteq Y^{X}$
- (ϵ,δ)-learns a concept f∈YX
  - using k∈ℕ instances
    - distributed according to the distribution $D$

P_{X_{1}, \dots, X_{k} \sim D} [P_{X_{0} \sim D} [L ((X_{i}, f (X_{i}))_{i = 1}^{k}) (X_{0}) \neq f (X_{0})] < ϵ] > 1 - δ .

That is to say, with probability at least $1 - δ$ over $k$ iid. instances distributed as $D$ , the algorithm $L$ when trained on the $k$ instances successfully learns a hypothesis $h \in H$ that is approximately correct; that is, $h$ disagrees with $f$ on a subset of $X$ that has total probability mass less than $ϵ$ . We call $ϵ$ the accuracy parameter and $δ$ the confidence parameter.

Fix a concept class $C \subseteq Y^{X}$ and hypothesis space $H \subseteq Y^{X}$ . We call $m_{L} : ℝ_{+}^{2} \to ℕ$ the sample complexity of algorithm $L : (X \times Y)^{*} \to H$ if $m_{L} (ϵ, δ)$ is the minimal $k^{'}$ for which for any $k \geq k^{'}$ and arbitrary concept $f \in C$ and distribution $D$ , $L$ $(ϵ, δ)$ -learns $f$ using $k$ instances distributed according to $D$ . We say that $L$ has polynomial sample complexity if $ϵ^{'}, δ^{'} \mapsto m_{L} (1 / ϵ, 1 / δ^{'})$ is eventually bounded above by a multinomial.

We say that a $C \subseteq Y^{X}$ is PAC-learnable with respect to a hypothesis space $H \subseteq Y^{X}$ if some $L$ has polynomial sample complexity. Furthermore, we generally say that it is efficiently PAC-learnable if $L$ has running time polynomial in certain parameters we care about like its input size.

A. Wigderson, Knowledge, Creativity and $P$ versus $N P$ (April 2009)

Wigderson gives a lay introduction to $P$ and $N P$ , informally characterizing $N P$ problem classes as problems for which solutions are difficult to find, but easy to verify. He uses trunk packing as a relatable example of an $N P$ -complete problem. Near the end, he talks about cryptographic protocols being based on the hardness of $N P$ problems (in the guise of one-way functions and trapdoor functions), as well as the notion of zero-knowledge proofs, which he has contributed seminal work on.

Wigderson here considers the heavily culturally loaded notion of creativity to be materially equivalent to the heavy computational work of finding solutions to $N P$ -hard problems. I consider this identification to be quite naïve, and I suppose that Wigderson has a more respectable views than he lets on here, if not then, then perhaps since. But for a pop-sci discussion, I suppose it is acceptable. In particular, I don’t think this characterization sufficiently appreciates the dialectic involved in doing empirical sciences or engineering. Here, material and communicative limitations dominate, and all the internal time and space resources gifted to an agent cannot save it from learning a wrong concept due to improper instrumentation. One must repeatedly query other entites and nature, and even then the inaccuracy of instrumentation limits the complexity of feasible concepts. Even with perfect instrumentation, for example, in the physical sciences, one immediately runs into computational intractability of the EXPTIME nature even in accurately simulating very “simple” molecular systems from first principles, so experimentation and querying the oracle of nature is fundamental to knowledge production.

Wigderson claims from intuition that “Every natural process can be viewed as computational”. I suppose that one can in some manner say this, once one has demarcated for ourselves some phenomena corresponding to input and output, and the boundaries of the system that said process takes place. And still I suspect that the model of computation one would use to model such phenomena is very different from the models typically studied in computer science. In some sense, to the extent that we have microchips that do amazing computation, to that extent nature has the capacity to do so much computation in so little space. Our models, even physical models, are more reflective of human concerns than of any particularly impersonal and inhuman process.

R. de Wolf, Philosophical Applications of Computational Learning Theory (July 1997)

The author introduces the usual definitions PAC-learning, VC-dimension and fundamental results. The author then applies this notion to language learning, and shows that classes of languages in the Chomsky hierarchy are not PAC-learnable, except in certain situations with access to certain oracles. He then introduces Kolmogorov complexity, randomness of strings (and infinite strings) in terms of Kolmogorov complexity, and shows some applications including Gödel’s first incompleteness theorem. He then discusses the formalization of Occam’s razor in the notion of Minimum Description Length (MDL), and the interpretation of MDL using K-complexity as the notion of description length. Finally in that chapter, the author discusses that even random strings exhibit regularities that can be compressed, and science can be done on them, to the extent that science is the art of identifying and compressing the regularities in data.