CASSIA Benchmark

Performance comparison of different LLMs using the CASSIA method for single-cell annotation

Model Performance by Tissue
Comparing different LLM models using the CASSIA method across various tissues
Cost-Performance Analysis
Performance vs. cost ranking of models - optimal models appear in the top-left corner

Optimal models appear in the top-left corner (higher score, lower cost). Larger circles indicate better overall ranking.

Method Description

CASSIA (Collective Agent System for Single-cell Interpretable Annotation) is the first multi-agent LLM-based method for single-cell annotation. It enhances annotation accuracy across diverse datasets and rare cell types by integrating step-by-step reasoning, validation, quality scoring, and optional refinement or retrieval-augmented generation.

The method leverages the collaboration of five basic agents and five advanced agents to provide comprehensive and interpretable cell type annotations with robust performance across different tissues.

Benchmark Details

Dataset

100 cell types across 5 tissues: human kidney, human lung, human large intestine, human fetal skin, and whole mouse atlas.

Evaluation Metric

We built an agent that scores annotations by averaging similarity between predicted and gold standard cell types per tissue. The agent tends to underestimate accuracy, and although some clear errors in the gold standard were corrected, the true accuracy is still considered to be higher.

Models Tested

Llama 4 MaverickGPT-4.1Claude 3.7Gemini 2.5 proGemini 2.5 flashGPT-O4 Mini HighDeepseek v3QWEN3-235b