What a 100-year-old horse teaches us about AI

From Rational Animations.

How do we rigorously measure AI’s intelligence? We don’t really know. What we know is that measuring intelligence is tricky, and if we’re not careful, our tests might not measure what we intend. We explore this topic by starting with the story of Clever Hans, a horse who seemingly could do arithmetic. Later, we explain the potential limitations of today’s AI benchmarks and how we could do better by looking at the established discipline of cognitive science.

▀▀▀▀▀▀▀▀▀SOURCES & READINGS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

The Project Gutenberg EBook of Clever Hans, by Oskar Pfungst: https://www.gutenberg.org/files/33936/33936-h/33936-h.htm#CHAPTER_IV

The Wiring of Intelligence: https://journals.sagepub.com/doi/10.1177/1745691619866447

New and emerging models of human intelligence: https://wires.onlinelibrary.wiley.com/doi/10.1002/wcs.1356

NTIRE2025 Challenge on Cross-Domain Few-Shot Object Detection: Methods and Results: https://arxiv.org/pdf/2504.10685v1

HellaSwag: https://rowanzellers.com/hellaswag/

Are We Done with MMLU? https://arxiv.org/abs/2406.04127

Artificial cognition: How experimental psychology can help generate explainable artificial intelligence: https://link.springer.com/article/10.3758/s13423-020-01825-5

o3-mini System Card: https://cdn.openai.com/o3-mini-system-card-feb10.pdf

Measuring Massive Multitask Language Understanding: https://arxiv.org/pdf/2009.03300

Requiem for nutrition as the cause of IQ gains: Raven’s gains in Britain 1938–2008: https://www.sciencedirect.com/science/article/abs/pii/S1570677X09000057?via%3Dihub

Observational Scaling Laws and the Predictability of Language Model Performance: https://doi.org/10.48550/arXiv.2405.10938

Introducing Claude 4 (agentic benchmarks): https://www.anthropic.com/news/claude-4

Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses: https://turntrout.com/original-truthfulqa-weaknesses

Phonological memory and vocabulary development during the early school years: A longitudinal study: https://psycnet.apa.org/doi/10.1037/0012-1649.28.5.887

MMLU-CF:AContamination-free Multi-task Language Understanding Benchmark: https://arxiv.org/pdf/2412.15194

Smelling themselves: Dogs investigate their own odours longer when modified in an “olfactory mirror” test: https://doi.org/10.1016/j.beproc.2017.08.001

Elephants’ jumbo mirror ability: http://news.bbc.co.uk/2/hi/science/nature/6100430.stm

ARC Prize 2024: Technical Report: https://arxiv.org/pdf/2412.04604

Baby Intuitions Benchmark (BIB): Discerning the goals, preferences, and actions of others: https://arxiv.org/pdf/2102.11938v1

CogBench: a large language model walks into a psychology lab: https://arxiv.org/pdf/2402.18225

The Animal-AI Environment: A virtual laboratory for comparative cognition and artificial intelligence research: https://doi.org/10.3758/s13428-025-02616-3

A little less conversation, a little more action, please: Investigating the physical common-sense of LLMs in a 3D embodied environment: https://openreview.net/forum?id=eUkbTUsDgs

▀▀▀▀▀▀▀▀▀PATREON, MEMBERSHIP, MERCH▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

🟠 Patreon: https://www.patreon.com/rationalanimations

🔵 Channel membership: https://www.youtube.com/channel/UCgqt1RE0k0MIr0LoyJRy2lg/join

🟢 Merch: https://rational-animations-shop.fourthwall.com

🟤 Ko-fi, for one-time and recurring donations: https://ko-fi.com/rationalanimations

▀▀▀▀▀▀▀▀▀SOCIAL & DISCORD▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

Rational Animations Discord: https://discord.gg/5Y3Dwz89yH

Reddit: https://www.reddit.com/r/RationalAnimations/

X/Twitter: https://twitter.com/RationalAnimat1

Instagram: https://www.instagram.com/rationalanimations/

▀▀▀▀▀▀▀▀▀PATRONS & MEMBERS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

Thanks to our patrons and channel members from the Simple Adder tier and above: https://docs.google.com/document/d/1pu5mGj0_FcotxmsOxnXVQaGWhdZfjnuW7zg8vDNvdZQ/edit?usp=sharing

▀▀▀▀▀▀▀CREDITS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

Credits here: https://docs.google.com/document/d/1dr5mKwN2AESrIkyZU4PrytOiLVvIfzYG9CotpS-t400/edit?usp=sharing

Leave a Reply

Your email address will not be published. Required fields are marked *