BehaviorBench: Real-World Data for AI Decision Models

Emma Carter

June 4, 2026

original

BehaviorBench is a novel benchmark for evaluating personalized decision-making AI, leveraging authentic, wallet-level behavioral data from prediction markets and blockchain records. It features two tasks—belief and transaction prediction—across 2,000 wallets, with over 140,000 belief instances and 1.4 million transaction instances. This dataset aims to foster AI systems that more accurately reflect complex human behavior, moving beyond synthetic data limitations.

For years, researchers building personalized decision systems have grappled with a fundamental problem: a severe shortage of reliable, real-world user data. Most existing benchmarks lean heavily on simulated users or model-generated behaviors. However, recent studies increasingly highlight a critical flaw in this approach—simulated data often carries systemic biases that diverge significantly from actual human behavior. What performs flawlessly in a lab setting might completely fall apart when faced with the messy realities of the real world.

Grounding AI in Authentic Human Behavior

BehaviorBench tackles this issue head-on with a straightforward premise: if synthetic data isn't cutting it, let's use genuine behavioral traces. The research team meticulously reconstructed the complete decision histories of 2,000 distinct wallets by sifting through publicly available prediction market and blockchain transaction records. This isn't some sanitized, experimental playground; it's a raw, high-stakes arena where real money is on the line, with every transaction reflecting genuine market judgment and individual risk appetite.

The benchmark is structured around two complementary tasks: belief prediction and transaction prediction. The former challenges models to forecast a user's ultimate stance and confidence level within a market, while the latter dives deeper, requiring predictions for the direction and amount of individual transactions. This dual-layered design allows for a comprehensive understanding, capturing both long-term user perspectives and granular, short-term trading patterns.

Scale, Structure, and Real-World Nuance

As detailed in the accompanying paper, BehaviorBench boasts an impressive scale, encompassing 141,445 belief prediction instances and a staggering 1,485,972 transaction prediction instances. This volume is more than sufficient to train and rigorously evaluate sophisticated deep neural networks. Crucially, each wallet's historical record forms a complete user profile—detailing when positions were opened or closed, and how risk was managed. These intricate behavioral patterns are incredibly difficult, if not impossible, to replicate with synthetic data.

A particularly insightful design choice by the team was to deliberately retain real-world noise. For instance, users might make irrational decisions driven by emotion. In traditional benchmarks, such behaviors are often filtered out as anomalies. BehaviorBench, however, treats them as valid signals. This increased inclusivity forces models to learn how to navigate the inherent imperfections of human decision-making, rather than just optimizing for idealized scenarios.

Implications for AI Research and Development

BehaviorBench fills a significant gap in the evaluation landscape. For researchers developing personalized recommendation systems, adaptive interfaces, or financial assistants, it offers a far more realistic testing ground. While models can be simulated on training data, their true mettle will be tested against these authentic behavioral trajectories. Can the AI truly understand a user's genuine intent, or is it merely repeating patterns it's been shown?

Of course, the benchmark isn't without its limitations. Its data sources are confined to prediction markets and blockchain transactions, meaning its generalizability to other decision domains—like e-commerce browsing or content consumption—still needs to be thoroughly validated. Additionally, while wallet-level data is rich, the absence of direct user identity information prevents cross-platform analytical linking.

Despite these caveats, BehaviorBench's open-source nature and structured design provide a robust stepping stone for the industry. It serves as a powerful reminder: for AI systems to genuinely assist human decision-making, they must first learn to truly comprehend how humans make decisions in the wild.

decision modelingbenchmarkbehavioral datauser predictionpersonalized AIAI researchprediction marketsblockchain datahuman behavior