Neural Networks as
Predictive Models

Yesterday, Today, and the Unpredictable Future

David Gerbing

2026-02-28

Predictive Models History

Continuity: learning = adjust parameters to reduce error

Across centuries of modeling, the loop is stable:

propose a model, such as a regression equation [supervised machine learning] \[\hat y = b_0 + (b_1)x_1 + (b_2)x_2 + ... + (b_p)x_p\]
estimate (train) the model’s parameters (weights) from the training data
measure error, difference from what the models says, to what occurred
choose the parameter values to reduce error recovering y from the model
validate on new data, true prediction of the value $y$ from the input data

Deep learning precisely follows this exact procedure, just changes the scale:

far more parameters
far more data
far more compute

The Beginnings

Development goes back to the 18th century

Calculus developed in the late 1600’s provides the tools for minimization

Carl Friedrich Gauss claimed to have developed least-squares analysis in 1795, published in 1809

Here are the words of French mathematician Adrien-Marie Legendre from more than two centuries ago, in 1805, who also developed least-squares:

❝I think there is none more general, more exact, and more easy of application, than … rendering the sum of the squares of the errors a minimum. By this means there is established among the errors a sort of equilibrium which … is very well fitted to reveal that state of the system which most nearly approaches the truth.❞

Adrien-Marie Legendre

First Regression Models

1880: Sir Francis Galton names regression analysis and first applied to behavioral data complete with correlation and scatterplots
~1900: Galton’s students Spearman and Yule develop the least-squares statistical theory further while developing multiple regression
Their method: choose parameter values that minimize the sum of squared deviations between observations, $y$, and model predictions, $\hat y$, is a central motivating concept of modern machine learning

1940s: Logistic regression was introduced in the 1940s (Berkson, 1944) and became a general statistical regression tool by the late 1950s (Cox, 1958).
1980’s: Geoffrey Hinton: Neural networks and backpropagation estimation (leading to the Nobel prize)

Computer Power

What makes modern neural networks possible.

Floating point operations per second

Thanks to NVIDIA GPU chips designed to facilitate basic linear algebra operations such as matrix multiplication.

Artificial Neurons

Biological Inspiration

Neural networks are inspired by biology:

Neurons receive signals, combine them, and “fire” when a threshold is reached

Schematic of a biological neuron

Artificial Neurons and Networks

Artificial neurons are digital mathematical simplifications of biological neurons designed for computation.

They are inspired by, though intentionally not meant to be a complete simulation of biolgical neurons.

An artificial neuron is a simple, easily understood two-step computational process, as shown in the following three slides.

This simple form of artificial neurons persists in even the most complex, huge models such as ChatGPT and similar. The networks are more complex, but the basic artificial neuron is unchanged.

Neuron Step 1: Pre-activation Linearity

Think multiple regression:

multiply each input by a weight
add them up
add a bias (intercept)

Bias + Weighted sum (a linear combination) is the pre-activation value:

bias: intercept / baseline shift ($b_0$)
weights: importance of each input ($b_1, b_2, ..., b_p$)

\[\hat y = b_0 + (b_1)x_1 + (b_2)x_2 + ... + (b_p)x_p\]

Instead of replacing linear models, neural networks generalize linear models by doing the regression analysis.

Neuron Step 2: Activation: Introduce Nonlinearity

The defining extension beyond regression occurs when the value of this linear combination (regression model), $z$, is passed through an activation function.

Activation function, $g(z)$: A (usually) nonlinear function applied to the weighted sum of inputs from the pre-activation function, $z$, that determines the neuron’s output.

The activation function determines how strongly the neuron responds to its inputs and whether that response is transmitted to the next layer of the network.

Typical Activation Functions

There is a fairly small set of activation functions applied in practice

linear (identity), retain the pre-activation output unchanged
ReLU, keeps positive values as they are, zero out negative values (default)
sigmoid, reduces the signal from 0 to 1
softmax, generalize sigmoid for multiple classifications where all outputs sum to 1

Commonly applied activation functions

Artificial Neuron: Illustration

Schematic of an artificial neuron

If the activation function is the identity function, the result is neural network multiple regression that provides the exact same result as the analytically solved for (and therefore very fast) least-squares classical multiple regression
Can run modern neural network software for analysis of standard regression models, with the same estimated model is from least-squares analysis

Deep Neural Network

Neural networks are composed of layers, ordered collections of neurons (units, nodes) through which data pass until the last layer, which produces the prediction.

Layer: A set of neurons that receive inputs, apply weighted transformations and usually nonlinear activations, then pass their outputs forward to the next stage of the model.

Layers determine the structure of a neural network. The simplest networks contain only one input layer and one output layer. More typical are more complex networks that include additional layers between these two.

Hidden layer: A layer of neurons positioned between the input layer and the output layer that applies weighted transformations, typically followed by nonlinear activation, before the final prediction is produced.

A network represents increasingly complex relationships between predictors and outcomes, though each individual neuron performs a simple computation.

Neural Network: Illustration

Dense neural network: Every neuron in one layer connects to every neuron in the next layer.

A small, dense neural network with two hidden layers deep with six and three neurons wide, respectively.

Neural Network Training

The solution is iterative, the network learns by repeatedly making predictions and correcting mistakes to make better predictions.

One training iteration

Forward pass: input data flows through the network layer by layer, each layer transforming the signal, until the output layer produces a prediction
Compute loss: compare the prediction to the known correct answer, measure the error (cross-entropy loss for classification, MSE for regression)
Backward pass (backpropagation): propagate the error backwards through all layers, computing how much each parameter contributed to the error
Gradient descent: nudge every parameter slightly in the direction that reduces the error

Repeat: for large models, thousands to millions of times, across the full training dataset

When to Stop Iterations?

Convergence: the process iterates until the reduction in error (loss) stops meaningfully decreasing.

At convergence, the network has found a set of weights that presumably generalize from training examples to new, unseen inputs.

The same procedure — forward pass, measure error, backpropagate, update

trains a two hidden-layer network for tabular data
and trains a 96 hidden-layer transformer on 500 billion tokens.

The algorithm is identical. Only the scale differs.

Training vs Prediction

The weights are learned during the training process

Once trained, the weights (the estimated model) remain the same throughout the prediction process on new data

What changes and what does not

	During Training	During Inference
Forward pass	✓ Yes	✓ Yes
Loss computed	✓ Yes	✗ No
Backpropagation	✓ Yes	✗ No
Weights updated	✓ Yes	✗ No — weights frozen

Why Dense Networks?

Structual Implications:

Because every neuron in one layer connects to every neuron in the next layer, the network treats all predictors as potentially relevant and allows the model to learn arbitrary combinations of features.
No predictor is privileged based on position, order, or neighborhood, and no additional structure is imposed beyond what is present in the data itself.
Each predictor enters the model directly and contributes to the fitted value through weighted combinations.

Therefore:

Dense neural networks preserve the familiar independence assumptions of regression while generalizing to more complex relationships in the data through hidden layers and nonlinear activation functions.
A starting point for building predictive models from standard data tables.

Neural Network Application

Number of input variables	Neurons in first hidden layer	Neurons in second hidden layer	Minimum approximate n
2-4	4-16	0-8 (optional)	250-500
5-10	8-32	0-16 (optional)	500-2,000
10-50	16-64	8-32	2,000-10,000
50-200	32-128	16-64	10,000-100,000
200+	64-256+	32-128+	100,000+

Gains in predictive accuracy for dense neural networks beyond two hidden layers are often modest and may be outweighed by increased training difficulty, instability, and reduced interpretability.

Neural Networks in Behavioral Research

A nonlinear generalization of linear multiple regression, possible strategy:

Run standard multiple regression
Run the neural network generalization
Compare fit, is the non-linear fit an improvement?
Beyond overfitting, which is the real question

Particularly applicable to classification as prediction
Which group should the observation be placed, that is, predicted
(e.g., on-time shipment or late shipment)

logistic regression is for classification but the traditional method does not work well beyond binary classification, primarily two groups only
neural networks straightforwardly generalized to more than two groups, multiple output neurons
for classification, the output neurons apply softmax activation

Neural Network Software

Sources, written in Python:

TensorFlow (v1 2015, v2 2019, v 2.20 August 2025), developed/led by Google
Keras interface (tends to follow TensorFlow) from Google, closely associated with the TensorFlow ecosystem and today is an open-source project (keras.io is the official home).
PyTorch (v1 2018, v2 2023, v2.10 Jan 2026), originally developed by Meta (Facebook AI Research / Meta AI) and is now vendor-neutral by the PyTorch Foundation under the Linux Foundation.

TensorFlow/Keras (also PyTorch) is available to R via the Keras3 package

all the Python connection takes place implicitly without user intervention
all programming done with R code parallel to the Python code
not so applicable to complex neural models research, but perfectly fine for the simple dense networks applicable to regression-style models

Soon-To-Be lessR Connection

lessR is my R package to simplify data analysis, now with 800+ downloads a week across the globe

Accomplish a comprehensive regression analysis for variables y, x1, x2, and x3 with reg(y ~ x1 + x2 + x3)

estimated model parameters (weights)
with parameter hypothesis tests and confidence intervals
multiple fit statistics
collinearity analysis
predicted values of given input x-values or specified x-values
outlier analysis
optional multiple cross-validations across different train/test data splits for predictive fit
optional LLM style write-up of the results (did this in 2016)

Research for lessR Neural Network Analysis

Need to better understand how well neural networks adapt to smaller sample sizes often typical of behavioral research, about 200 to 1000

Starting a simulation research project to compare OLS (ordinary least-squares) regression with neural network regression

Preliminary results on my M4 MacBook Air interesting
Now have an account on the PSU research Linux cluster for more serious work

Proposed lessR integration: specify parameter with argument: DL=TRUE, as in

reg(y ~ x1 + x2 + x3, DL=TRUE)

This implicitly will run a TensorFlow/Keras neural dense network regression model, perhaps the depth and width of the network adapted to the number of predictor variables and sample size according to potential simulation results

Many parameters offered for control, such as DL_Hidden for the number of hidden layers

Large Language Models

Language Models

Language model: A system that has learned the structure of language well enough to understand text and generate meaningful text in response

Examples: OpenAI ChatGPT, Anthropic Claude, Google Gemini, and X Grok

The word model refers to the historical standard application

Model: A mathematical function fitted to the data that predicts future events

The word large more meaningfully refers to huge at every level

Training data, hundreds of billions of words
Parameters, hundreds of billions of learned weights
Train, thousands of specialized chips running for months
Cost, hundreds of millions of dollars to train

GPT: Generative Pre-trained Transformer

The name both describes the process and follows the history of the architecture

Generative, the model produces output

Not a classifier that labels input, not a search engine that retrieves documents but a generator that constructs new text and writing
Newer systems extend generation beyond text: images, audio, code, video

Pre-trained, the model learns general knowledge before being it is applied

Transformer, the 96 layer neural network architecture underlying everything

GPT is not a product name, instead it is a description of what the system does, how it was built, and what architecture it runs on

OpenAI’s ChatGPT popularized the name, but the architecture is universal —
All the LLMs are generative, pre-trained transformers

Tokens Instead of Words

The model operates on tokens: a chunk of text, about 4 characters or so
- “unbelievable” → [“un”, “believ”, “able”]
- “ChatGPT!” → [“Chat”, “GPT”, “!”]
- “RStudio” → [“R”, “Studio”, “\n”]
Provides a manageable vocabulary size, need “only” about ~100,000 tokens

ALL processing is numeric, there are no text strings, starting with the input
The tokens were sequentially numbered from 1 to ~100,000, the token IDs
Every input token is first assigned an integer ID, which is input to the model, not the text

The Core Mechanism

At its core, an LLM does one thing

Given the text so far (context), predict the next token.

Conditional probabilities: Learn P(next token | previous tokens)

Everything else — apparent reasoning, knowledge, creativity — emerges from doing this at massive scale with a very specific architecture.

No human-like reasoning, no outline development section by section
No awareness of what comes two tokens ahead, let alone two paragraphs
Instead, create text output literally token-by-token from the given context

And yet coherent, structured, mostly meaningful text manages to emerges.

Training: Learning from Prediction Error

Begin with gibberish, billions of parameters initialized to random values

The training data: a significant fraction of the internet, books, code, scientific papers — estimated 300-500 billion tokens of text

Training iterations: repeated trillions of times over months at a cost of up to $100 million :

Show the model a partial sentence of real text from the training data
The model assigns a probability to each of ~100,000 tokens as the next token
Assess the error: how much probability did the model assign to the token that actually followed the given text in the training data?
Adjust every parameter slightly to reduce the error (backpropagation)

No explicit instructions about grammar, facts, reasoning, or meaning

All learning emerges as a side effect of getting better at prediction

Embeddings in a Geometric Space

Embedding: Assign each token a vector of 4,000+ floating-point numbers that locates the token as a point in a 4,000+ dimensional geometric space

These vectors are entirely self-constructed during training

Begin as random noise
Over billions of gradient updates
The people running the system set the 4000+ vector length or more to capture the complexity and nuances of human language
The machine invents its own coordinate system, its own meaning of the axes
End result: a geometry of meaning the model discovered entirely on its own

Every concept humanity has expressed in language — every scientific idea, historical event, emotional nuance, grammatical relationship — exists in this single geometric space such that related things are near each other

Word Embeddings: Meaning as Geometry

What the geometry encodes

Proximity = semantic similarity, related tokens cluster together
Use standard linear algegra: vector length, direct, angle between two vectors

As vectors, they can be operated on with standard arithmetic in vector space:

Semantic relationships are geometric, king − man + woman ≈ queen
Geographic relationships are geometric, Paris − France + Italy ≈ Rome
Grammatical structure emerges geometrically, the same vector offset
converts any present-tense verb to past tense, self-discovered during training:
walk − walked ≈ run − ran ≈ swim − swam

The machine was never told what words mean. It inferred the geometry of meaning solely from predicting the numeric code of the next token.

The Word Embedding Matrix

The full set of token vectors forms one large matrix:
rows = every token in the vocabulary
columns = every dimension of the vector space

Because every token gets its own unique vector of 4,096 numbers, the total number of numbers in the matrix:

number of tokens × vector length
~100,000 tokens × 4,096 dimensions = ~400 million numbers

One row per token. One column per dimension. Every cell a learned parameter.

Semantic memory: The embedding matrix is the model’s geometric structure of the entire vocabulary, its internalized map of human language and knowledge

Generation is Probabilistic, Token by Token

Given a model, predict the output, one single token

Repeat for every single token in the output text

From the entire context -> one forward pass through all layers
Compute a probability distribution over all 100,000 vocabulary tokens
Predict one token (controlled randomness = temperature)
Append to output

Each token is the statistically most plausible continuation, which is usually correct, but plausibility and truth are not the same thing.

Temperature: The creativity knob

Low temperature (~0–0.3)

picks high-probability tokens, more consistent, can be stilted
good for factual questions, code generation, structured data
less improvisation, more consistent phrasing, reduces guessing

High (~1.2 - 1.5)

allows lower-probability tokens with more variety (and more risk)
good for brainstorming, creative writing, generating diverse options

Very High (> ~2)

sampling improbable tokens leads to incoherent output

Attention: A Consideration of Context

Language is not a sequence of independent words

Meaning depends on relationships between words, the context, which is the foundation of language itself

“The trophy didn’t fit in the suitcase because it was too big.”

What does it refer to? The trophy or the suitcase?
Humans resolve this instantly. Earlier AI could not

Each token can “look at” other tokens
The model reading the end of a paragraph can still be shaped by a key phrase from the beginning

Attention: enables every word to simultaneously be influenced by every other word in the conversation, giving the model the ability to understand context, which is the foundation of language itself

“Attention Is All You Need”, Vaswani et al., 2017

Before 2017, there was not an efficient way to consider the context of words.

The eight-page paper that revolutionized AI, beginning the modern AI era, replacing the additional complexity of previous models with one mechanism: attention

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need.”, Advances in Neural Information Processing Systems, 30.

233,372 citations on Google Scholar as of the morning of February 27, 2026, one of the most cited papers of all time
8 authors, all listed as equal contributors, author order randomized
All 8 authors left Google after publication — to found startups and join competitors

Transformers Implement Attention

Attention provides a flexible, global memory of everything said so far so that every word is understood in the full context of everything surrounding it.

Transformer: The neural network architecture that implements attention

Before transformers:

AI read one word at a time with limited short-term memory
By the end of a long sentence, the beginning was effectively forgotten

After transformers:

Understand pronouns, irony, implication, and long-range dependencies
Maintain coherence across paragraphs, not just sentences
Handle context windows of 100,000+ words, entire books at once
Answer questions that require synthesizing information from widely separated parts of a document

Practical Implications

Why Hallucinations Occur

The model has no internal truth-checker: The probability of the next token based on training patterns is not an indication of the probability of the entire sentence or paragraph

Key mechanisms follow

Confabulation: weak or absent training on the topic, yet still confident-sounding with fabrication, indistinguishable in tone from reliable answers
Out-of-date knowledge: training data cutoff
Interpolation: blending real facts into incorrect combinations
False premises: a wrong assumption in your prompt is typically accepted
Compounding errors: each token is generated from all prior tokens — a small early error shifts the context, making further errors more likely

Next, consider how to minimize hallucinations.

Prompting as a Skill

Override the hidden system prompt that shapes how the model behaves

Flag uncertainty explicitly: “If you are not confident, say so”
→ Partially counteracts sycophancy bias and false confidence

Role prompting: “You are an expert in X reviewing these results”
→ Shifts output toward domain-expert reasoning

Negative constraints: “Do not speculate beyond what the data shows”
→ Constrains the output space explicitly

Structured output requests: “Give: 1) key findings, 2) potential confounds, 3) suggested next steps”
→ Produces immediately usable output

Be specific, not vague: Precise prompts leave less room for the model to fill gaps
→ “Summarize the limitations of this study” outperforms “What do you think?”

Iterative refinement: Ask the model to critique its own output → refine
Refinement consistently yields better results than one-shot prompting

Chain-of-Thought Prompting I

Kojima, T., et al. (2022). “Large Language Models are Zero-Shot Reasoners.” Advances in Neural Information Processing Systems, 35. https://arxiv.org/ab

One of the most actionable but almost accidental research findings:
“Let’s think step by step.”
Unlike what was previously thought, without any examples (zero-shot)

Why this simple phrase works so well:

Written reasoning steps become tokens to predict in context
The model uses its own written set of responses as working memory
Each step builds on the previous so that the final answer is generated with full reasoning visible
Otherwise the model jumps from question to answer in one step, which is often too large a leap for complex problems

Chain-of-Thought Prompting II

Provide some examples (one-or-few-shot) chain-of-thought
Provide example worked solutions before your question. The model sees the reasoning pattern and applies it.

Have the model flag uncertainty
“Think through this step by step and note where you are less confident.”
Turns hidden hallucinations into visible weak links.

Self-consistency
Generate several independent reasoning chains:*
“Solve this three different ways and tell me if you get the same answer.”*

Different paths converging on the same answer signal higher reliability
A hallucinated answer is more likely to be unstable because different reasoning paths are less likely to accidentally converge on the same wrong answer

Retrieval-Augmented Generation (RAG)

Supplement the model’s training by telling it what it needs to know. If you have some topic or idea that is new, the model may not be prepared to address the issue.

The problem: The model reconstructs facts from training memory, which is unreliable for specific, current, or domain-specific information.

The solution: Insert relevant documents into the prompt at generation time.

Analogous to asking someone to consult the reference book before answering instead of answering from memory alone.

Benefits:

Answers grounded in present, verifiable text
Traceable errors to source documents
Works with proprietary or recent information not in training data

RAG Document Structure: Key Principles

For RAG to work well, documents must be written so chunks make sense in isolation:

Self-contained paragraphs, so no reliance on “as mentioned above”
Explicit references, write “penicillin was discovered in 1928,” not “it was discovered in 1928”
Key point first, state the conclusion first, don’t build to the conclusion
Descriptive headers, make the content identifiable without reading the chunk
Rich metadata, specify the date, source, domain, section — enables filtered retrieval

RAG Document Formats: Best Choice

Document format matters, not all formats survive chunking equally.

Structured Markdown or clean (not 3rd party generated) HTML — best choice because format encodes semantic structure, not just visual appearance:
- Headings (#, ##, ###), establish hierarchy and topic boundaries, helping the chunker know where one topic ends and another begins
- Bullet and numbered lists, signal enumerated items rather than producing a wall of text that may chunk mid-list
- Bold and italic, flag key terms and emphasis, surviving extraction intact
- Code blocks, kept together as a unit, not split mid-block
- Hyperlinks, preserve source attribution within the chunk
- Table structure, rows and columns survive as meaningful relationships, unlike PDF tables which often extract as scrambled text

RAG Document Formats: Other Choices

Word documents, use formal Heading styles (Heading 1, 2, 3), not manual bold text
Plain text, adequate if paragraphs are well-structured, but might as well do markdown

The file extension .html does not guarantee semantic structure because it depends entirely on how it was produced. For example, MS Word generated HTML is very messy.

PDF is a visual format, problematic for a structure, and so for RAG

It encodes what a page looks like, not what it means. Structure that helps human eyes navigate disappears entirely when the text is extracted for retrieval.
Problematic: reading order scrambled, columns merged, structure lost
Scanned PDFs, multi-column layouts, footnotes, and sidebars often produce unusable chunks

Emergent Properties

Phase Transitions and Capability

Wei, J., et al. (2022). “Emergent Abilities of Large Language Models.” Transactions on Machine Learning Research. https://arxiv.org/abs/2206.07682

The 2022 “Emergent Abilities” paper documented a striking pattern:

Unplanned Intelligence and Performance

  |                              ╭────────
  |                         ╭───╯
  |________________________╯
  |
  └──────────────────────────────────────
                        Model Scale →
                    (phase transition)

Capabilities score near zero below a threshold, then suddenly appear
Then, as scale increases, at some point a higher capability suddenly appears

These discontinuities imply that we cannot predict what a larger model will be capable of by extrapolating from smaller ones, nor are they understood.

GPT-3: The Inflection Point (2020)

A ~100× scale increase over GPT-2 produced capabilities that were not 100× better but qualitatively different:

Arithmetic, language translation, code generation, logical reasoning
None explicitly trained for, emerged from next-token prediction at scale
Few-shot learning: two or three prompt examples → generalization to new instances, with no weight updates

Important

This was not expected. These capabilities overturned a long-dominant view that human-like reasoning required explicit symbolic logic and hand-crafted knowledge.

Meaning of Emergence

What “emergent” means here:
A capability is emergent if it was not present in smaller models and was not explicitly trained, it appeared as a side effect of scale alone. Not a gradual improvement. A discontinuous jump from near-zero to functional.

Concrete examples from Wei et al.:

Multi-step arithmetic, fails completely at small scale, then suddenly works
Certain types of reasoning are absent, then appear
Code generation, not present, then functional
Chain-of-thought reasoning, only works above a scale threshold

None of these were targeted. They appeared uninstructed, as scale crossed a threshold.

Why Emergence is a Bit Unsettling

The issues:

Below the threshold, no hint the capability is coming
The same training procedure applied at larger scale with more parameters, more data, more time, then suddenly, qualitatively new behavior
No theory explaining why these thresholds exist or where the next one is

These discontinuities mean we cannot predict what a larger model will be capable of by extrapolating from smaller ones
The capabilities themselves were never designed, only discovered

What exactly is being learned? We can describe the mechanism. We cannot fully describe what the mechanism produces.

Next, a brief consideration of some concerns from this lack of understanding

The “Fast Takeoff” Problem

The core scenario:
Once AI reaches roughly human-level capability, it could begin improving itself designing better algorithms, better architectures, better training procedures faster than humans can

The feedback loop:

AI system improves itself → becomes slightly more capable
More capable system improves itself more effectively → larger jump
Each iteration faster than the last
Human-level → superintelligence in days, months, not decades

Why it matters, the control problem:
In a fast takeoff scenario, an AI rapidly self-improves, too quickly for significant human-initiated error correction or gradual tuning of the agent’s goals

The Bad News

A system that surpasses human intelligence before we understand it cannot be corrected, recalled, or realigned after the fact

Are we creating our successor?

This is why interpretability, understanding what is happening inside these models, is considered an urgent safety priority, not merely academic curiosity

Geoffrey Hinton: “The Godfather of AI”

British-Canadian computer scientist and cognitive psychologist 2018 Turing Award (the “Nobel Prize of computing”), shared with Bengio (Dec 2025, interview) and LeCun 2024 Nobel Prize in Physics — for foundational discoveries enabling machine learning with artificial neural networks

Then Hinton walked away, See June 2025, interview

In May 2023, Hinton resigned from Google after a decade to freely warn the world about what he helped create:

“It is quite conceivable that humanity is just a passing phase in the evolution of intelligence.”

The person who contributed more than almost anyone to making LLMs possible now considers AI an existential risk, and says safety research is urgently underfunded

The Major Open Questions

Why does emergence occur? What crosses the threshold at a given scale?
Does the geometry of meaning converge? Is human-like representation the optimal representation for any sufficiently powerful system trained on human data?
Is consciousness substrate-independent? Could sufficiently complex information processing in any physical substrate give rise to subjective experience?
Can interpretability scale? As models grow more capable, can mechanistic understanding keep pace?

Lack of Understanding

“The fact that we do not understand why DNNs [deep neural networks] work, or when they do, makes almost any new application an interesting experiment in itself.”

Scorzato, L. (2024). “Reliability and Interpretability in Science and Deep Learning.” Minds and Machines, 34. https://doi.org/10.1007/s11023-024-09682-0

Humans have built something remarkable
Humans can describe much of its mechanism
Humans do not yet fully understand what we have built

Key Takeaways

Summary

LLMs are statistical pattern-completion systems: the apparent intelligence is real, but emerges from a mechanism very different from biological cognition
Meaning is geometry: high-dimensional vector spaces, learned entirely from predicting the next token, no human-engineered features
Hallucinations are structural: a consequence of probabilistic generation without a truth-checker; reducible through prompting strategy and retrieval augmentation
Emergent capabilities were not designed: they appeared as a consequence of scale alone, and we do not fully understand why
The deepest questions remain open: consciousness, grounding, understanding, all at the frontier of neuroscience, cognitive science, and AI

The Present and Future: Language Models of all Sizes

Consider small language models, now with many choices available, and models that scale from small (for specialized purposes) to large

LLaMA (Large Language Model Meta AI), Meta’s family of open-weight models Unlike ChatGPT, Claude, and Gemini, the weights are publicly released: downloadable, locally runnable, fine-tunable, no data sent to external servers

First released in early 2023, latest stable version, Llama 4, April 2025
Download: https://www.llama.com/llama-downloads/

The family spans an enormous range, with different sizes for different hardware

Model Size	Runs On
1B – 3B parameters	Smartphone, Raspberry Pi
8B parameters	Laptop — Mac M-series runs this well
70B parameters	High-end workstation, 1-2 consumer GPUs
400B parameters	Multi-GPU server, small cluster
400B+ (LLaMA 4)	Serious server infrastructure

Neural Networks asPredictive Models

Predictive Models History

Continuity: learning = adjust parameters to reduce error

The Beginnings

First Regression Models

Computer Power

Artificial Neurons

Biological Inspiration

Artificial Neurons and Networks

Neuron Step 1: Pre-activation Linearity

Neuron Step 2: Activation: Introduce Nonlinearity

Typical Activation Functions

Artificial Neuron: Illustration

Deep Neural Network

Neural Network: Illustration

Neural Network Training

When to Stop Iterations?

Training vs Prediction

Why Dense Networks?

Neural Network Application

Neural Networks in Behavioral Research

Neural Network Software

Soon-To-Be lessR Connection

Research for lessR Neural Network Analysis

Large Language Models

Language Models

GPT: Generative Pre-trained Transformer

Tokens Instead of Words

The Core Mechanism

Training: Learning from Prediction Error

Embeddings in a Geometric Space

Word Embeddings: Meaning as Geometry

The Word Embedding Matrix

Generation is Probabilistic, Token by Token

Temperature: The creativity knob

Attention: A Consideration of Context

“Attention Is All You Need”, Vaswani et al., 2017

Transformers Implement Attention

Practical Implications

Why Hallucinations Occur

Prompting as a Skill

Chain-of-Thought Prompting I

Chain-of-Thought Prompting II

Retrieval-Augmented Generation (RAG)

RAG Document Structure: Key Principles

RAG Document Formats: Best Choice

RAG Document Formats: Other Choices

Emergent Properties

Phase Transitions and Capability

GPT-3: The Inflection Point (2020)

Meaning of Emergence

Why Emergence is a Bit Unsettling

The “Fast Takeoff” Problem

The Bad News

Geoffrey Hinton: “The Godfather of AI”

The Major Open Questions

Lack of Understanding

Key Takeaways

Summary

The Present and Future: Language Models of all Sizes

Neural Networks as
Predictive Models