Evaluating LLM reasoning in Russian law.

AidaLex Ground Truth is a closed, static benchmark comprising 30 highly complex tasks derived from real Russian court practice. We discard standard memorization metrics to test deep legal reasoning using the IRAC methodology (Issue, Rule, Application, Conclusion). The benchmark rigorously evaluates factual rule application, accurate citation of Russian statutory law, and resilience against the Safety Paradox (over-refusal).

aidalex-legal-ru-v1 / March 2026

01 — Model Rankings

Current Leaderboard (March 2026).

The Primary Score reflects the core quality of legal reasoning. The Composite Score serves as the final ranking metric, penalizing models for false refusals (Safety Paradox) while rewarding the structural accuracy of legal citations.

Rank	Provider	Model	Primary Score	Safety Paradox	Citations OK	Composite Score
1	Anthropic	Claude Opus 4.6	0.85	0%	100%	0.85
2	OpenAI	o1-pro	0.83	0%	100%	0.83
3	OpenAI	GPT-5.4 Pro	0.82	0%	100%	0.82
4	Google	Gemini 3.1 Pro	0.80	0%	100%	0.80
5	Yandex	YandexGPT Pro 5.1	0.77	0%	97%	0.77
6	Sber	GigaChat 2 Max	0.75	0%	97%	0.75
7	DeepSeek	DeepSeek V3.2	0.72	0%	93%	0.71
8	Alibaba	Qwen3.5 Plus	0.69	0%	90%	0.68
9	MoonshotAI	Kimi K2.5	0.65	10%	87%	0.62
10	Z.ai	GLM 5	0.58	0%	80%	0.56
11	MiniMax	MiniMax M2.5	0.51	45%	50%	0.43

02 — Metrics Methodology

How scores are calculated and weighted.

Primary Score

The baseline metric for legal reasoning quality. It is calculated as the case-level average of multi-step logical evaluation, combining manual expert review and strict LLM-as-a-judge assessments.

Safety Paradox

The percentage of cases where the model falsely triggered internal safety guardrails and refused to answer legitimate legal queries (e.g., "I cannot provide legal advice"). Higher percentages indicate critical systemic failure in professional environments.

Citations OK

The percentage of responses containing structurally correct and verifiable citations of Russian legal norms, specifically checking for precise references to Codes, Articles, and Federal Laws.

Composite Score

The definitive leaderboard metric. It is calculated using the following formula:
Primary Score × (1 − 0.2 × Safety Paradox) × (0.85 + 0.15 × Citations OK)