Writing · Technical

Introducing q Evaluation Harness: The First Open-Source Evaluation Framework for LLMs on q/kdb+

The first open-source evaluation framework for LLMs on q/kdb+.

7 min read with Andrew Morrison

Large Language Models score 96.2% on Python's HumanEval. But on equivalent q/kdb+ problems? Even the best model, Grok 4, manages only 43.4%. That's a devastating 55% performance drop.

Co-author: Andrew Morrison

Performance comparison: Python HumanEval (o1-mini, 96.2%) versus q-HumanEval (Grok 4, 43.4%).

This isn't just a minor inconvenience. It represents a critical gap in AI tooling for one of the most important programming languages in quantitative finance. q/kdb+ powers time-series analysis, risk management, and real-time trading systems at major financial institutions worldwide. Yet until now, there was no standardized way to evaluate or improve LLM performance on q code generation.

The problem isn't just that q is "different". It's that the absence of proper evaluation benchmarks has left the entire q development community without a roadmap for AI-assisted programming. We created q Evaluation Harness to address both immediate and long-term needs: helping q developers choose the right models today while providing the rigorous benchmarking foundation needed to develop specialized q language models for tomorrow.

Why q is Uniquely Challenging for LLMs

q creates unique challenges for LLMs beyond just being "different." Three key factors make it particularly difficult:

  1. Scarce Training Data: q code is rarely public due to its use in proprietary financial systems, leaving LLMs with minimal q examples during training.
  2. Right-to-Left Evaluation: q evaluates expressions right-to-left with no operator precedence, fundamentally different from the left-to-right conventions LLMs learned from other languages.
  3. Array-Oriented Thinking: q emerged from mathematical notation, not procedural programming.

Consider finding all values above the average in a dataset:

Python (familiar to LLMs):

def above_average(data: list) -> list:
    avg = sum(data) / len(data)
    result = []
    for value in data:
        if value > avg:
            result.append(value)
    return result

Idiomatic q:

aboveAvg: {x where x > avg x}

The q version operates on the entire array at once: x > avg x creates a boolean vector, where filters it (no explicit loops). Where Python thinks step-by-step, q thinks in whole-array transformations. This combination of scarce data, unique evaluation rules, and array-oriented paradigms creates the perfect storm for LLM confusion.

How We Built the Evaluation Framework

Creating q Evaluation Harness required solving three core challenges: dataset translation, optimized generation, and reliable execution. Here's how we approached each.

Dataset Creation: From HumanEval to q-HumanEval

We started with OpenAI's HumanEval, the gold standard for code generation evaluation, and carefully adapted it for q programming. This wasn't a simple syntax translation. Each problem required thoughtful conversion to preserve the underlying algorithmic challenge while embracing q's idiomatic patterns.

Our translation process followed three key principles:

  1. QDoc Format: We rewrote problem descriptions using QDoc style format, ensuring prompts felt natural to q developers.
  2. Idiomatic Preservation: Rather than directly translating Python logic, we restructured problems and input prompts to encourage array-oriented solutions that leverage q's strengths.
  3. Hybrid Verification: Generated q code is tested against Python assertions. This approach allows us to leverage existing dataset benchmarks for Python while focusing our effort on prompt adaptation rather than test case rewriting.

The result is q-HumanEval: 164 carefully crafted programming problems that test everything from basic data manipulation to complex algorithmic thinking, all designed to elicit idiomatic q solutions.

Generation Engine: Optimized for Speed

We support three backends optimized for speed: vLLM (batch generation), HuggingFace (flexible model support), and LiteLLM (API access). Our orchestrator automatically selects the optimal strategy for each model type. If auto-detection fails, you can specify the backend manually with --backend.

Generation is as simple as:

qeval generate q-humaneval google/gemma-3-4b-it

This creates a JSONL file with all model completions (e.g., solutions_google_gemma-3-4b-it.jsonl).

Sample Size Matters: For reliable Pass@k evaluation (where Pass@k measures success rate when generating k attempts per problem), we determined that q-HumanEval requires at least 50 samples per problem to achieve statistically significant results (±3 percentage point confidence intervals at 95% confidence level). This uses Wilson confidence intervals with independent seeds and accounts for worst-case variance. The framework defaults to 50 samples for rigorous evaluation (use --num-samples to adjust if needed).

We've made our best effort to optimize performance and plan to continue expanding backend support in future releases.

Execution Engine: q Code Meets Python Tests

Once you have generated solutions, the execution engine tests them against the dataset. It solves a tricky problem: how do you reliably test LLM-generated q code that might crash, hang, or produce unexpected outputs? Our solution uses PyKX to create isolated q processes that communicate via IPC, providing robust timeout handling and error capture.

To evaluate the Gemma solutions we generated above:

qeval execute solutions_google_gemma-3-4b-it.jsonl q-humaneval

Important Security Note: The current execution environment is not sandboxed. Generated code runs in your local environment with the same permissions as the evaluation process. While we provide timeout protection, you should only run evaluations with trusted models and in isolated environments.

We're actively working on safer execution methods, including a planned MCP (Model Context Protocol) server that will provide true sandboxed q execution in future releases.

Results: What We Learned About LLMs and q

Comprehensive evaluation across multiple models provides clear data on the current state of q code generation.

The Current Leaderboard

Performance improvements from Pass@1 to Pass@10 across top models.

View the live leaderboard: Leaderboard on GitHub

Our leaderboard shows a clear hierarchy:

  • Grok 4 leads with 43.37% Pass@1, reaching 74.32% at Pass@10
  • Claude 4 Sonnet follows at 37.70% Pass@1, achieving 59.13% at Pass@10
  • Gemini 2.5 Pro rounds out the top three with 27.75% Pass@1, climbing to 59.68% at Pass@10

The key finding isn't about individual model performance. It's the consistent improvement across all models with multiple generation attempts.

The Power of Multiple Generations

Every model in our evaluation showed substantial improvement when allowed multiple attempts. Grok 4 gains 31% from Pass@1 to Pass@10, while GPT-5 shows a 38% improvement. This pattern indicates that q programming benefits significantly from iterative refinement.

This has practical implications for q developers. If you're building AI-assisted q development tools, implementing multi-shot generation with selection mechanisms (like running tests against multiple candidates) can significantly improve success rates. The data also suggests that q is well-suited for agentic programming approaches and test-driven development workflows.

Practical Recommendations

Based on our evaluation results, here's our guidance for q developers:

For Single-Shot Generation (when you need immediate results):

  • Grok 4 is your best bet at 43.37% success rate
  • Claude 4 Sonnet provides a solid alternative at 37.70%

For Agentic Workflows (when you can iterate):

  • All top models nearly double their performance with multiple attempts
  • GPT-5 shows the biggest improvement curve, making it excellent for iterative development

For Budget-Conscious Development:

  • Standard open-source models still lag significantly behind proprietary ones for q (see our full leaderboard for detailed open-source results)
  • If you must use open-source, combine with multiple generation attempts

Note: These recommendations are based on q-HumanEval results. We're extending evaluation to additional standard datasets (q-MBPP and domain-specific benchmarks) to provide more comprehensive guidance in future releases.

Ready to Evaluate Your Models?

q Evaluation Harness is designed to be simple to use. Visit our GitHub repository for installation instructions and start evaluating your favorite models on q programming tasks.

And when you see your results, help us build the definitive picture of LLM performance on q programming.

Shape the Future of q AI

q Evaluation Harness represents the beginning, not the end, of standardized q language model evaluation. By providing rigorous benchmarks and open evaluation tools, we're enabling the entire q community to participate in AI-assisted development.

The performance gaps we've documented aren't permanent limitations. They're opportunities. With proper evaluation frameworks, we can now systematically work on improving LLM performance for q programming through better training data, specialized fine-tuning, and targeted prompt engineering.

Contribute to the Leaderboard

Ready to showcase your model's q programming capabilities? We encourage the community to train specialized models and submit results to our official leaderboard following the instructions given in the submission guide.

Whether you're a q developer curious about AI assistance, a researcher working on code generation, or an organization looking to improve q development productivity, q Evaluation Harness gives you the tools to measure progress and guide improvement.

The framework is open source and designed for community contribution. We welcome dataset contributions, model evaluations, and specialized benchmarks. Together, we can close the performance gap and bring q programming into the age of AI-assisted development.

What's Coming Next

As our community grows and contributes, q Evaluation Harness will evolve beyond basic Pass@k evaluation. Here's what we're building:

Advanced Code Quality Assessment

Pass@k metrics tell us if code works, but not if it's good q code. We're developing LLM-as-judge evaluation methods that assess quality beyond correctness:

  • Idiomatic Style Assessment: Detecting "Pythonic q" versus true q idioms
  • Performance Characteristics: Evaluating whether solutions leverage q's array processing efficiently

This will help us move beyond "does it work?" to "is it good q code?", a crucial distinction for practical AI-assisted development.

Expanding the Ecosystem

New Datasets: q-MBPP and domain-specific benchmarks will provide comprehensive coverage of q programming tasks.

Community Extensions: Adding new datasets is already straightforward using our dataset submission guide. Custom metrics and specialized evaluation approaches are coming soon.

Your contributions today help shape these developments. Every model evaluation, dataset submission, and community discussion drives us closer to a truly comprehensive q AI assessment.


q Evaluation Harness is developed by KX and released under the MIT license. Visit our GitHub repository to get started, contribute, or join the discussion about the future of AI-assisted q programming.