# Keelemudelite mõõdupuu

> An independent leaderboard from Eesti Keele Instituut (Institute of the Estonian Language) measuring large language models' performance in Estonian across six benchmarks covering language, knowledge, alignment, and safety.

The data behind the leaderboard is published openly on GitHub. The site itself is a JavaScript SPA; the structured data below is what agents should fetch.

## Data

- [summary.csv](https://raw.githubusercontent.com/keeleinstituut/leaderboard-data-ui/main/summary.csv): one row per model — model_id, name, provider, release_date, tags, overall, benchmarks_covered, and per-benchmark scores. The quickest path to a flat, plot-ready table. Empty cells mark benchmarks a model was not evaluated on (do not treat as zero).
- [models.json](https://raw.githubusercontent.com/keeleinstituut/leaderboard-data-ui/main/models.json): one entry per model — id, display name, provider, ISO release date, optional tags (e.g. `open`).
- [benchmarks.json](https://raw.githubusercontent.com/keeleinstituut/leaderboard-data-ui/main/benchmarks.json): one entry per benchmark — id, bilingual names (en/et), tags, `description_url` pointing to a Markdown file, and one-sentence summaries.
- [results.json](https://raw.githubusercontent.com/keeleinstituut/leaderboard-data-ui/main/results.json): nested per-model scores including the `overall` value and per-benchmark detail breakdowns (≈290 KB; truncates in GitHub's HTML viewer past line 1000 — fetch raw).
- [results.jsonl](https://raw.githubusercontent.com/keeleinstituut/leaderboard-data-ui/main/results.jsonl): one record per (model, benchmark) run with a timestamp. Streamable.
- [benchmarks/](https://github.com/keeleinstituut/leaderboard-data-ui/tree/main/benchmarks): per-benchmark Markdown files (`bib_bench.md`, `idiom_bench.md`, `keelenou.md`, `propaganda_resistance.md`, `term_bench.md`, `trivia_et_2.md`) with full descriptions, methodology, metrics, and data sources.
- [README](https://github.com/keeleinstituut/leaderboard-data-ui#readme): schema reference for every file above.

## How the overall score is computed

`overall` is the unweighted arithmetic mean of a model's per-benchmark scores, computed over whatever benchmarks the model has been run on. Models missing some benchmarks are not penalised — they are averaged over only what they ran. Use `benchmarks_covered` in summary.csv (or `len(scores)` in results.json) to filter partial runs.

## Source code

- UI: <https://github.com/keeleinstituut/leaderboard-ui>
- Runner: <https://github.com/keeleinstituut/leaderboard-runner>
- Data: <https://github.com/keeleinstituut/leaderboard-data-ui>

## Contact

Eesti Keele Instituut · <https://www.eki.ee>