What makes an explanation 'good' when we ask an AI model to justify its output? It sounds like a simple question, but behind it lies decades of philosophical debate. A recent paper on arXiv attempts to pin down a precise definition, specifically tackling the interpretability nightmares of large language models.
Counterfactuals and Prior Beliefs
The paper's core idea is refreshingly straightforward: a good explanation should help the listener understand why the output was X instead of Y. This counterfactual approach isn't new in explainable AI, but the authors take it further. They argue that an explanation's effectiveness also depends on what the listener already knows. The same explanation works differently for a domain expert versus a newcomer. For instance, if an LLM answers 'Paris is the capital of France,' a geography buff needs no explanation, but someone unfamiliar with Europe might need to know what 'France' is and why Paris is the capital. The paper formalizes this dependence on prior beliefs, turning explanations from static outputs into dynamic acts of communication.
Why LLMs Are Unusually Hard to Explain
Under this new definition, LLMs become particularly troublesome. First, an LLM is essentially a giant probabilistic system that generates the next word based on trillions of parameters, not a clean logical chain. Extracting a clear counterfactual path—'if the input had been different, the output would have changed like this'—is nearly impossible because the model's internal representations are highly distributed. Second, users' prior beliefs vary wildly. A doctor and a middle school student asking the same question need very different explanation depths. Yet current tools like attention weights or gradient attribution only provide static, technical attributions that can't adapt to the user's background. The authors also point out that LLM generation includes stochastic elements (sampling temperature, top-k), which makes counterfactual reasoning even messier. The same question might yield two different answers, so the 'why A instead of B' question loses a stable foundation.
Practical Impact: A Shift in Interpretability Research
This paper isn't just philosophical navel-gazing. For AI development and deployment teams, it suggests that chasing a single 'perfect explanation' might be unrealistic. A better approach is to build interactive explanation systems that dynamically adjust content and detail based on user feedback. For example, when a user looks confused about a conclusion, the system automatically provides more background facts. This aligns with the paper's core message. On the regulatory side, if we can't even agree on what a good explanation is, requiring models to produce 'explainable' outputs remains a huge technical and legal hurdle. Of course, the definition itself is still contentious. How do we quantify a listener's prior beliefs? Whose beliefs take precedence when they conflict? The paper doesn't answer all these questions, but it forces the field to sit down and rethink the fundamentals. At the end of the day, a good explanation isn't about dumping more information—it's about helping someone see what would have happened if things were different. And for LLMs, finding that stable, trustworthy alternative path is proving far harder than we'd imagined.











Comments
No comments yet
Be the first to comment