The AI Download is your weekly guide to navigating the rapidly evolving world of AI and digital technology. Written by Jim Christian, a digital strategy consultant and former tech educator, this newsletter cuts through the noise to deliver practical insights and actionable strategies. Each week, you’ll get behind-the-scenes access to real-world experiments with cutting-edge AI tools, automation strategies, and emerging technologies.
Share
From the Turing Test to Humanity's Last Exam: How We Measure AI
Published 12 days ago • 5 min read
The AI Download #017
March 21st, 2025
Dear Reader,
Have you ever wondered how we (as a species) determine if a machine is truly "intelligent"? Long before ChatGPT entered our daily lives, notable mathematician Alan Turing was already thinking about this question. His simple yet profound test has shaped how we evaluate AI for over 70 years--and as these systems grow more capable, the ways we measure them continue to evolve in fascinating ways.
The Original Question: Can Machines Think?
In 1950, Alan Turing published a paper titled “Computing Machinery and Intelligence” that began with a deceptively simple question: “Can machines think?” Rather than getting lost in philosophical debates about the nature of consciousness, Turing proposed a practical test.
Imagine you’re texting with someone. You can’t see them or hear their voice–you only have their written responses. Could you tell if you were chatting with a human or a computer program? If you couldn’t reliably distinguish between the two, Turing suggested, then perhaps the machine deserves to be called “intelligent” in some meaningful way.
This became known as the Turing Test, and it’s remarkably similar to how many of us now interact with AI assistants daily. When you ask a question and receive a helpful, nuanced response, does it matter whether a human or AI composed it? Turing’s insight was that intelligence might be better judged by behaviour than by mechanism.
Beyond Pass/Fail: How the Measurement of AI Has Evolved
While elegant, the Turing Test has limitations. The binary pass/fail nature of the Turing Test doesn’t capture the spectrum of capabilities modern AI systems possess.
Today’s approach to evaluating AI has become much more nuanced:
Task-specific benchmarks measure performance on everything from grammar checking to medical diagnosis
Reasoning assessments evaluate whether AI can follow logical steps to solve problems
Creative tasks test if AI can generate novel, valuable outputs
Safety evaluations determine if AI systems can avoid harmful outputs when prompted
For example, when researchers want to measure an AI’s understanding of physics, they might present it with puzzles about objects in motion rather than asking it to fool a human judge in conversation. This gives us a more detailed picture of where these systems excel and where they still fall short.
Humanity’s Last Exam: A New Framework for the AI Era
What if we’re creating systems that will eventually surpass human capabilities across all domains?
In recent years, a provocative idea has emerged in AI research circles: what if we’re creating systems that will eventually surpass human capabilities across all domains? This concept, sometimes called “Humanity’s Last Exam,” suggests that the tests we design for AI today might be the last meaningful challenge humans pose before super-intelligent systems begin creating their own benchmarks.
Think about what this means for a moment. The math problems, coding challenges, and reasoning tests we’re using to evaluate today’s AI could be the final exams humans give to machines before they graduate beyond our level of intelligence.
These evaluations aren’t just academic exercises–they’re how we ensure AI systems align with human values before they potentially surpass human capabilities.
Measuring What Matters: Beyond Intelligence to Alignment
Perhaps the most important evolution in how we evaluate AI isn’t about intelligence at all–it’s about alignment with human values and goals.
When selecting an AI system to assist with tasks, raw performance isn’t the only concern. Teams need to know the system will protect privacy, provide unbiased recommendations, and explain its reasoning in understandable ways.
This reflects a broader shift in AI measurement. While we still care about capability, researchers are developing increasingly sophisticated ways to evaluate:
How well AI systems understand and respect human intent
Whether they can explain their reasoning in understandable terms
If they make fair and unbiased decisions
How they handle edge cases and uncertain situations
What's the next step in evolution for intelligent machines?
The Tests We Create Reveal What We Value
There’s a fascinating aspect to this evolution in AI measurement that often goes unnoticed: the tests we design reveal what we truly value in intelligence.
When early AI researchers focused exclusively on logic puzzles and chess, they were expressing a particular view of intelligence centred on calculation and strategic thinking. As our evaluations expanded to include emotional intelligence, creativity, and ethical reasoning, we acknowledged a broader understanding of what makes intelligence valuable.
How we test AI systems today will influence what abilities those systems prioritise tomorrow. It’s like education–if we only test for memorisation, that’s what students will focus on developing.
What You Can Do: Becoming an Informed AI Evaluator
As AI becomes more integrated into our daily lives, each of us becomes an informal evaluator. Here are some practical questions you can ask when interacting with AI systems:
Does it understand the nuance in my request, or am I having to oversimplify?
When it makes a mistake, can it learn from the feedback I provide?
Does it respect boundaries I set, or does it require me to repeatedly reinforce them?
Can it explain its recommendations in terms that help me make better decisions?
These questions aren’t just academic–they help you determine which AI tools genuinely enhance your work and life, and which ones aren’t quite ready for prime time.
The Conversation Continues
The way we measure AI capabilities continues to evolve, reflecting our deepening understanding of both intelligence and what we want from our technological creations.
From Turing’s simple imitation game to today’s multifaceted evaluation frameworks, the question has expanded from “Can machines think?” to “Can machines think in ways that are beneficial, safe, and aligned with human flourishing?”
I’d love to hear your thoughts on this topic. Have you found yourself evaluating AI systems in your work or personal life? What criteria matter most to you?
🛠️ Tool of the Week: Obsidian – A Powerful Free Note-Taking App
If you’re looking for a flexible, offline-first note-taking tool that keeps your thoughts organised without locking you into a specific ecosystem, Obsidian is worth checking out.
Unlike other note-taking apps, Obsidian is built on plain-text Markdown files, meaning your notes are fully in your control—no forced cloud storage, no proprietary formats. But the real magic happens when you start linking your notes together.
I've yet to find one app that "does everything" in terms of dashboards and second brains, but Obsidian is pretty damn close (sorry, Notion 😢), primarily because I can access it all offline and my data's not going to someone else's cloud unless I expressly want it to.
You can even use third-party or local AI to query your notes.
Key features:
✅ Bi-directional linking – Connect ideas and build your own knowledge web ✅ Offline-first – No internet? No problem ✅ Cross platform - sync across your computers, tablets and phones, wherever you are ✅ Markdown support – Keep your notes lightweight and future-proof ✅ Customisable with plugins – Supercharge it with themes, templates, and automation - including AI integration ✅ Free for personal use – No subscriptions needed
🚀 Best Use Case? Obsidian is perfect for knowledge workers, researchers, solopreneurs, and creatives who want a powerful, distraction-free way to organise their thoughts—whether it’s for brainstorming, writing, project planning, or even journaling.
💡 Pro Tip: Start small. Use daily notes + linking to create a simple second brain, then explore community plugins when you’re ready.
The AI Download is your weekly guide to navigating the rapidly evolving world of AI and digital technology. Written by Jim Christian, a digital strategy consultant and former tech educator, this newsletter cuts through the noise to deliver practical insights and actionable strategies. Each week, you’ll get behind-the-scenes access to real-world experiments with cutting-edge AI tools, automation strategies, and emerging technologies.
The AI Download #016 March 14th, 2025 Greetings from rainy and wet Valencia, where I’ve been deep in research mode for my latest project. This has me thinking about how dramatically our research methods have evolved in just the past year. If you’ve been following along, you know I’ve been exploring alternatives to traditional search engines. Today, I want to share why this matters for your business and how early adoption of these tools can give you a significant competitive edge. Let’s get to...
The AI Download #015 March 7th, 2025 Dear Reader, Ever felt like you're getting stuck in circles with ChatGPT (or other chatbots, for that matter)? You're not alone. Despite their impressive capabilities, AI assistants lack something fundamental: human intuition. They don't naturally question their own methods or adapt their communication style to match your needs. This is why your prompting strategy matters more than you might think. Be Direct: Clarity Creates Better Results When I first...
The Download #014 February 21st, 2025 Dear Reader, I spent the majority of time this and last week keeping my poorly kids entertained at home while trying to look for contract work and create an online course, with video. It became clear very quickly that I wasn't going to be able to do everything, but I had nobody else on hand to help me hit my targets. That is, unless you count...me. On one of those days I had an unexpected two-hour gap where there was nobody in the house - that's all the...