From Rational Animations.
How do we rigorously measure AI’s intelligence? We don’t really know. What we know is that measuring intelligence is tricky, and if we’re not careful, our tests might not measure what we intend. We explore this topic by starting with the story of Clever Hans, a horse who seemingly could do arithmetic. Later, we explain the potential limitations of today’s AI benchmarks and how we could do better by looking at the established discipline of cognitive science.
▀▀▀▀▀▀▀▀▀SOURCES & READINGS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
The Project Gutenberg EBook of Clever Hans, by Oskar Pfungst: https://www.gutenberg.org/files/33936/33936-h/33936-h.htm#CHAPTER_IV
The Wiring of Intelligence: https://journals.sagepub.com/doi/10.1177/1745691619866447
New and emerging models of human intelligence: https://wires.onlinelibrary.wiley.com/doi/10.1002/wcs.1356
NTIRE2025 Challenge on Cross-Domain Few-Shot Object Detection: Methods and Results: https://arxiv.org/pdf/2504.10685v1
HellaSwag: https://rowanzellers.com/hellaswag/
Are We Done with MMLU? https://arxiv.org/abs/2406.04127
Artificial cognition: How experimental psychology can help generate explainable artificial intelligence: https://link.springer.com/article/10.3758/s13423-020-01825-5
o3-mini System Card: https://cdn.openai.com/o3-mini-system-card-feb10.pdf
Measuring Massive Multitask Language Understanding: https://arxiv.org/pdf/2009.03300
Requiem for nutrition as the cause of IQ gains: Raven’s gains in Britain 1938–2008: https://www.sciencedirect.com/science/article/abs/pii/S1570677X09000057?via%3Dihub
Observational Scaling Laws and the Predictability of Language Model Performance: https://doi.org/10.48550/arXiv.2405.10938
Introducing Claude 4 (agentic benchmarks): https://www.anthropic.com/news/claude-4
Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses: https://turntrout.com/original-truthfulqa-weaknesses
Phonological memory and vocabulary development during the early school years: A longitudinal study: https://psycnet.apa.org/doi/10.1037/0012-1649.28.5.887
MMLU-CF:AContamination-free Multi-task Language Understanding Benchmark: https://arxiv.org/pdf/2412.15194
Smelling themselves: Dogs investigate their own odours longer when modified in an “olfactory mirror” test: https://doi.org/10.1016/j.beproc.2017.08.001
Elephants’ jumbo mirror ability: http://news.bbc.co.uk/2/hi/science/nature/6100430.stm
ARC Prize 2024: Technical Report: https://arxiv.org/pdf/2412.04604
Baby Intuitions Benchmark (BIB): Discerning the goals, preferences, and actions of others: https://arxiv.org/pdf/2102.11938v1
CogBench: a large language model walks into a psychology lab: https://arxiv.org/pdf/2402.18225
The Animal-AI Environment: A virtual laboratory for comparative cognition and artificial intelligence research: https://doi.org/10.3758/s13428-025-02616-3
A little less conversation, a little more action, please: Investigating the physical common-sense of LLMs in a 3D embodied environment: https://openreview.net/forum?id=eUkbTUsDgs
▀▀▀▀▀▀▀▀▀PATREON, MEMBERSHIP, MERCH▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🟠 Patreon: https://www.patreon.com/rationalanimations
🔵 Channel membership: https://www.youtube.com/channel/UCgqt1RE0k0MIr0LoyJRy2lg/join
🟢 Merch: https://rational-animations-shop.fourthwall.com
🟤 Ko-fi, for one-time and recurring donations: https://ko-fi.com/rationalanimations
▀▀▀▀▀▀▀▀▀SOCIAL & DISCORD▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Rational Animations Discord: https://discord.gg/5Y3Dwz89yH
Reddit: https://www.reddit.com/r/RationalAnimations/
X/Twitter: https://twitter.com/RationalAnimat1
Instagram: https://www.instagram.com/rationalanimations/
▀▀▀▀▀▀▀▀▀PATRONS & MEMBERS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Thanks to our patrons and channel members from the Simple Adder tier and above: https://docs.google.com/document/d/1pu5mGj0_FcotxmsOxnXVQaGWhdZfjnuW7zg8vDNvdZQ/edit?usp=sharing
▀▀▀▀▀▀▀CREDITS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Credits here: https://docs.google.com/document/d/1dr5mKwN2AESrIkyZU4PrytOiLVvIfzYG9CotpS-t400/edit?usp=sharing