Yesterday, Today, and the Unpredictable Future
2026-02-28
Across centuries of modeling, the loop is stable:
Deep learning precisely follows this exact procedure, just changes the scale:
Development goes back to the 18th century
❝I think there is none more general, more exact, and more easy of application, than … rendering the sum of the squares of the errors a minimum. By this means there is established among the errors a sort of equilibrium which … is very well fitted to reveal that state of the system which most nearly approaches the truth.❞
What makes modern neural networks possible.
Floating point operations per second
Thanks to NVIDIA GPU chips designed to facilitate basic linear algebra operations such as matrix multiplication.
Neural networks are inspired by biology:
Schematic of a biological neuron
Artificial neurons are digital mathematical simplifications of biological neurons designed for computation.
They are inspired by, though intentionally not meant to be a complete simulation of biolgical neurons.
An artificial neuron is a simple, easily understood two-step computational process, as shown in the following three slides.
This simple form of artificial neurons persists in even the most complex, huge models such as ChatGPT and similar. The networks are more complex, but the basic artificial neuron is unchanged.
Think multiple regression:
Bias + Weighted sum (a linear combination) is the pre-activation value:
\[\hat y = b_0 + (b_1)x_1 + (b_2)x_2 + ... + (b_p)x_p\]
Instead of replacing linear models, neural networks generalize linear models by doing the regression analysis.
The defining extension beyond regression occurs when the value of this linear combination (regression model), \(z\), is passed through an activation function.
Activation function, \(g(z)\): A (usually) nonlinear function applied to the weighted sum of inputs from the pre-activation function, \(z\), that determines the neuron’s output.
The activation function determines how strongly the neuron responds to its inputs and whether that response is transmitted to the next layer of the network.
There is a fairly small set of activation functions applied in practice
Commonly applied activation functions
Schematic of an artificial neuron
If the activation function is the identity function, the result is neural network multiple regression that provides the exact same result as the analytically solved for (and therefore very fast) least-squares classical multiple regression
Can run modern neural network software for analysis of standard regression models, with the same estimated model is from least-squares analysis
Neural networks are composed of layers, ordered collections of neurons (units, nodes) through which data pass until the last layer, which produces the prediction.
Layer: A set of neurons that receive inputs, apply weighted transformations and usually nonlinear activations, then pass their outputs forward to the next stage of the model.
Layers determine the structure of a neural network. The simplest networks contain only one input layer and one output layer. More typical are more complex networks that include additional layers between these two.
Hidden layer: A layer of neurons positioned between the input layer and the output layer that applies weighted transformations, typically followed by nonlinear activation, before the final prediction is produced.
A network represents increasingly complex relationships between predictors and outcomes, though each individual neuron performs a simple computation.
Dense neural network: Every neuron in one layer connects to every neuron in the next layer.
A small, dense neural network with two hidden layers deep with six and three neurons wide, respectively.
The solution is iterative, the network learns by repeatedly making predictions and correcting mistakes to make better predictions.
One training iteration
Repeat: for large models, thousands to millions of times, across the full training dataset
Convergence: the process iterates until the reduction in error (loss) stops meaningfully decreasing.
At convergence, the network has found a set of weights that presumably generalize from training examples to new, unseen inputs.
The same procedure — forward pass, measure error, backpropagate, update
The algorithm is identical. Only the scale differs.
The weights are learned during the training process
Once trained, the weights (the estimated model) remain the same throughout the prediction process on new data
What changes and what does not
| During Training | During Inference | |
|---|---|---|
| Forward pass | ✓ Yes | ✓ Yes |
| Loss computed | ✓ Yes | ✗ No |
| Backpropagation | ✓ Yes | ✗ No |
| Weights updated | ✓ Yes | ✗ No — weights frozen |
Structual Implications:
Therefore:
| Number of input variables |
Neurons in first hidden layer |
Neurons in second hidden layer |
Minimum approximate n |
|---|---|---|---|
| 2-4 | 4-16 | 0-8 (optional) | 250-500 |
| 5-10 | 8-32 | 0-16 (optional) | 500-2,000 |
| 10-50 | 16-64 | 8-32 | 2,000-10,000 |
| 50-200 | 32-128 | 16-64 | 10,000-100,000 |
| 200+ | 64-256+ | 32-128+ | 100,000+ |
Gains in predictive accuracy for dense neural networks beyond two hidden layers are often modest and may be outweighed by increased training difficulty, instability, and reduced interpretability.
A nonlinear generalization of linear multiple regression, possible strategy:
Particularly applicable to classification as prediction
Which group should the observation be placed, that is, predicted
(e.g., on-time shipment or late shipment)
Sources, written in Python:
TensorFlow/Keras (also PyTorch) is available to R via the Keras3 package
lessR is my R package to simplify data analysis, now with 800+ downloads a week across the globe
Accomplish a comprehensive regression analysis for variables y, x1, x2, and x3 with reg(y ~ x1 + x2 + x3)
Need to better understand how well neural networks adapt to smaller sample sizes often typical of behavioral research, about 200 to 1000
Starting a simulation research project to compare OLS (ordinary least-squares) regression with neural network regression
Proposed lessR integration: specify parameter with argument: DL=TRUE, as in
reg(y ~ x1 + x2 + x3, DL=TRUE)
This implicitly will run a TensorFlow/Keras neural dense network regression model, perhaps the depth and width of the network adapted to the number of predictor variables and sample size according to potential simulation results
Many parameters offered for control, such as DL_Hidden for the number of hidden layers
Language model: A system that has learned the structure of language well enough to understand text and generate meaningful text in response
Examples: OpenAI ChatGPT, Anthropic Claude, Google Gemini, and X Grok
The word model refers to the historical standard application
Model: A mathematical function fitted to the data that predicts future events
The word large more meaningfully refers to huge at every level
The name both describes the process and follows the history of the architecture
Generative, the model produces output
Pre-trained, the model learns general knowledge before being it is applied
Transformer, the 96 layer neural network architecture underlying everything
GPT is not a product name, instead it is a description of what the system does, how it was built, and what architecture it runs on
The model operates on tokens: a chunk of text, about 4 characters or so
Provides a manageable vocabulary size, need “only” about ~100,000 tokens
At its core, an LLM does one thing
Given the text so far (context), predict the next token.
Conditional probabilities: Learn P(next token | previous tokens)
Everything else — apparent reasoning, knowledge, creativity — emerges from doing this at massive scale with a very specific architecture.
And yet coherent, structured, mostly meaningful text manages to emerges.
Begin with gibberish, billions of parameters initialized to random values
The training data: a significant fraction of the internet, books, code, scientific papers — estimated 300-500 billion tokens of text
Training iterations: repeated trillions of times over months at a cost of up to $100 million :
No explicit instructions about grammar, facts, reasoning, or meaning
All learning emerges as a side effect of getting better at prediction
Embedding: Assign each token a vector of 4,000+ floating-point numbers that locates the token as a point in a 4,000+ dimensional geometric space
These vectors are entirely self-constructed during training
Every concept humanity has expressed in language — every scientific idea, historical event, emotional nuance, grammatical relationship — exists in this single geometric space such that related things are near each other
What the geometry encodes
As vectors, they can be operated on with standard arithmetic in vector space:
The machine was never told what words mean. It inferred the geometry of meaning solely from predicting the numeric code of the next token.
The full set of token vectors forms one large matrix:
rows = every token in the vocabulary
columns = every dimension of the vector space
Because every token gets its own unique vector of 4,096 numbers, the total number of numbers in the matrix:
number of tokens × vector length
~100,000 tokens × 4,096 dimensions = ~400 million numbers
One row per token. One column per dimension. Every cell a learned parameter.
Semantic memory: The embedding matrix is the model’s geometric structure of the entire vocabulary, its internalized map of human language and knowledge
Each token is the statistically most plausible continuation, which is usually correct, but plausibility and truth are not the same thing.
Low temperature (~0–0.3)
High (~1.2 - 1.5)
Very High (> ~2)
Language is not a sequence of independent words
Meaning depends on relationships between words, the context, which is the foundation of language itself
“The trophy didn’t fit in the suitcase because it was too big.”
Attention: enables every word to simultaneously be influenced by every other word in the conversation, giving the model the ability to understand context, which is the foundation of language itself
Before 2017, there was not an efficient way to consider the context of words.
The eight-page paper that revolutionized AI, beginning the modern AI era, replacing the additional complexity of previous models with one mechanism: attention
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need.”, Advances in Neural Information Processing Systems, 30.
Attention provides a flexible, global memory of everything said so far so that every word is understood in the full context of everything surrounding it.
Transformer: The neural network architecture that implements attention
Before transformers:
After transformers:
The model has no internal truth-checker: The probability of the next token based on training patterns is not an indication of the probability of the entire sentence or paragraph
Key mechanisms follow
Next, consider how to minimize hallucinations.
Override the hidden system prompt that shapes how the model behaves
Flag uncertainty explicitly: “If you are not confident, say so”
→ Partially counteracts sycophancy bias and false confidence
Role prompting: “You are an expert in X reviewing these results”
→ Shifts output toward domain-expert reasoning
Negative constraints: “Do not speculate beyond what the data shows”
→ Constrains the output space explicitly
Structured output requests: “Give: 1) key findings, 2) potential confounds, 3) suggested next steps”
→ Produces immediately usable output
Be specific, not vague: Precise prompts leave less room for the model to fill gaps
→ “Summarize the limitations of this study” outperforms “What do you think?”
Iterative refinement: Ask the model to critique its own output → refine
Refinement consistently yields better results than one-shot prompting
Kojima, T., et al. (2022). “Large Language Models are Zero-Shot Reasoners.” Advances in Neural Information Processing Systems, 35. https://arxiv.org/ab
Why this simple phrase works so well:
Provide some examples (one-or-few-shot) chain-of-thought
Provide example worked solutions before your question. The model sees the reasoning pattern and applies it.
Have the model flag uncertainty
“Think through this step by step and note where you are less confident.”
Turns hidden hallucinations into visible weak links.
Self-consistency
Generate several independent reasoning chains:*
“Solve this three different ways and tell me if you get the same answer.”*
Supplement the model’s training by telling it what it needs to know. If you have some topic or idea that is new, the model may not be prepared to address the issue.
The problem: The model reconstructs facts from training memory, which is unreliable for specific, current, or domain-specific information.
The solution: Insert relevant documents into the prompt at generation time.
Analogous to asking someone to consult the reference book before answering instead of answering from memory alone.
Benefits:
For RAG to work well, documents must be written so chunks make sense in isolation:
Document format matters, not all formats survive chunking equally.
The file extension .html does not guarantee semantic structure because it depends entirely on how it was produced. For example, MS Word generated HTML is very messy.
PDF is a visual format, problematic for a structure, and so for RAG
Wei, J., et al. (2022). “Emergent Abilities of Large Language Models.” Transactions on Machine Learning Research. https://arxiv.org/abs/2206.07682
The 2022 “Emergent Abilities” paper documented a striking pattern:
Unplanned Intelligence and Performance
| ╭────────
| ╭───╯
|________________________╯
|
└──────────────────────────────────────
Model Scale →
(phase transition)
These discontinuities imply that we cannot predict what a larger model will be capable of by extrapolating from smaller ones, nor are they understood.
A ~100× scale increase over GPT-2 produced capabilities that were not 100× better but qualitatively different:
Important
This was not expected. These capabilities overturned a long-dominant view that human-like reasoning required explicit symbolic logic and hand-crafted knowledge.
What “emergent” means here:
A capability is emergent if it was not present in smaller models and was not explicitly trained, it appeared as a side effect of scale alone. Not a gradual improvement. A discontinuous jump from near-zero to functional.
Concrete examples from Wei et al.:
None of these were targeted. They appeared uninstructed, as scale crossed a threshold.
The issues:
What exactly is being learned? We can describe the mechanism. We cannot fully describe what the mechanism produces.
Next, a brief consideration of some concerns from this lack of understanding
The core scenario:
Once AI reaches roughly human-level capability, it could begin improving itself designing better algorithms, better architectures, better training procedures faster than humans can
The feedback loop:
Why it matters, the control problem:
In a fast takeoff scenario, an AI rapidly self-improves, too quickly for significant human-initiated error correction or gradual tuning of the agent’s goals
A system that surpasses human intelligence before we understand it cannot be corrected, recalled, or realigned after the fact
Are we creating our successor?
This is why interpretability, understanding what is happening inside these models, is considered an urgent safety priority, not merely academic curiosity
British-Canadian computer scientist and cognitive psychologist 2018 Turing Award (the “Nobel Prize of computing”), shared with Bengio (Dec 2025, interview) and LeCun 2024 Nobel Prize in Physics — for foundational discoveries enabling machine learning with artificial neural networks
Then Hinton walked away, See June 2025, interview
In May 2023, Hinton resigned from Google after a decade to freely warn the world about what he helped create:
“It is quite conceivable that humanity is just a passing phase in the evolution of intelligence.”
The person who contributed more than almost anyone to making LLMs possible now considers AI an existential risk, and says safety research is urgently underfunded
Why does emergence occur? What crosses the threshold at a given scale?
Does the geometry of meaning converge? Is human-like representation the optimal representation for any sufficiently powerful system trained on human data?
Is consciousness substrate-independent? Could sufficiently complex information processing in any physical substrate give rise to subjective experience?
Can interpretability scale? As models grow more capable, can mechanistic understanding keep pace?
“The fact that we do not understand why DNNs [deep neural networks] work, or when they do, makes almost any new application an interesting experiment in itself.”
Scorzato, L. (2024). “Reliability and Interpretability in Science and Deep Learning.” Minds and Machines, 34. https://doi.org/10.1007/s11023-024-09682-0
LLMs are statistical pattern-completion systems: the apparent intelligence is real, but emerges from a mechanism very different from biological cognition
Meaning is geometry: high-dimensional vector spaces, learned entirely from predicting the next token, no human-engineered features
Hallucinations are structural: a consequence of probabilistic generation without a truth-checker; reducible through prompting strategy and retrieval augmentation
Emergent capabilities were not designed: they appeared as a consequence of scale alone, and we do not fully understand why
The deepest questions remain open: consciousness, grounding, understanding, all at the frontier of neuroscience, cognitive science, and AI
Consider small language models, now with many choices available, and models that scale from small (for specialized purposes) to large
LLaMA (Large Language Model Meta AI), Meta’s family of open-weight models Unlike ChatGPT, Claude, and Gemini, the weights are publicly released: downloadable, locally runnable, fine-tunable, no data sent to external servers
The family spans an enormous range, with different sizes for different hardware
| Model Size | Runs On |
|---|---|
| 1B – 3B parameters | Smartphone, Raspberry Pi |
| 8B parameters | Laptop — Mac M-series runs this well |
| 70B parameters | High-end workstation, 1-2 consumer GPUs |
| 400B parameters | Multi-GPU server, small cluster |
| 400B+ (LLaMA 4) | Serious server infrastructure |