LingLanMiDian: Systematic Evaluation of LLMs on TCM Knowledge and Clinical Reasoning

Introduction

LingLan is a large-scale, expert-curated benchmark for evaluating large language models (LLMs) in Traditional Chinese Medicine (TCM). It spans five domains (licensing exam, fundamental knowledge, Chinese patent medicine, information extraction, diagnostic–therapeutic reasoning) with 13 subtasks and 25,624 instances. Each dataset includes a 400-item Hard subset to probe robustness.

What’s in LingLan

LingLan covers five domains with diverse formats (single-choice, multiple-choice, cloze, multi-label lists, dosage vectors) under a unified metric system:

single-choice: Accuracy
multi-choice: instance-level Accuracy; option-level Precision/Recall/F1
cloze: character-level F1
extraction & multi-label: list-level Precision/Recall/F1
dosage: cosine similarity (primary), MAE when prescription overlap is adequate

LingLan further introduces decision-recognition versions of clinical tasks (single-choice) for clean, reproducible comparisons.

Data Statistics

TCM licensing exam: 1,832
Fundamental TCM knowledge (single/multi-choice, cloze): 5,844
Chinese patent medicine (single/multi-choice, cloze): 5,948
Decision recognition (syndrome / treatment / prescription): 2,000 + 2,000 + 2,000
Diagnostic–therapeutic reasoning (syndrome / treatment / prescription / dosage): 2,000
Information extraction — classical: 2,000
Information extraction — EMR: 2,000
Total: 9 files, 25,624 items (each dataset includes a 400-item Hard subset)

Evaluation Metrics

Classification: Accuracy (single-choice; decision recognition)
Multi-choice: Accuracy; option-level Precision/Recall/F1
Cloze: character-level F1
Extraction & multi-label clinical tasks: list-level Precision/Recall/F1
Dosage: cosine similarity (primary); MAE optionally reported when prescription overlap is reasonable

Download

data: LingLan/data

Quick Start

Refer to simple_evals for more details.

Evaluation codes: tcm_evals

The annotations will not be released until the paper is accepted; therefore, the evaluation code is not currently executable.

Leaderboards

Concise Leaderboard

Overall averages (%) on Full / Hard sets. (Hard reveals a marked gap between current LLMs and expert-level TCM reasoning.)

Rank	Model	Full	Hard	Δ (pp)
1	DeepSeek-R1	51.1	31.9	19.2
2	DeepSeek-V3.1	50.9	31.2	19.7
3	Qwen3-235B-A22B	50.6	30.8	19.8
4	Qwen3-32B	48.4	28.8	19.6
5	GPT-5	48.1	28.0	20.1
6	Qwen3-14B	47.5	27.2	20.3
7	Qwen3-30B-A3B	47.2	26.7	20.5
8	Baichuan-M2-32B	47.1	26.2	20.9
9	Qwen3-8B	46.2	25.4	20.8
10	Qwen3-4B	43.8	24.5	19.3
11	Qwen3-Next-80B-A3B	44.9	22.6	22.3
12	GPT-5-mini	43.9	23.3	20.6
13	GPT-OSS-120B	40.7	23.1	17.6
14	GPT-OSS-20B	36.1	20.8	15.3

Cite LingLan

@misc{hua2026linglanmidiansystematicevaluationllms,
      title={LingLanMiDian: Systematic Evaluation of LLMs on TCM Knowledge and Clinical Reasoning}, 
      author={Rui Hua and Yu Wei and Zixin Shu and Kai Chang and Dengying Yan and Jianan Xia and Zeyu Liu and Hui Zhu and Shujie Song and Mingzhong Xiao and Xiaodong Li and Dongmei Jia and Zhuye Gao and Yanyan Meng and Naixuan Zhao and Yu Fu and Haibin Yu and Benman Yu and Yuanyuan Chen and Fei Dong and Zhizhou Meng and Pengcheng Yang and Songxue Zhao and Lijuan Pei and Yunhui Hu and Kan Ding and Jiayuan Duan and Wenmao Yin and Yang Gu and Runshun Zhang and Qiang Zhu and Jian Yu and Jiansheng Li and Baoyan Liu and Wenjia Wang and Xuezhong Zhou},
      year={2026},
      eprint={2602.01779},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.01779}, 
}


@misc{LingLan2026,
  title={LingLanMiDian: Systematic Evaluation of LLMs on TCM Knowledge and Clinical Reasoning},
  author={Ruihua},
  year={2026},
  note={https://github.com/TCMAI-BJTU/LingLan}
}

License & Ethics

Data are curated from de-identified clinical records, classical texts, official exam sources, and structured knowledge resources.

Acknowledgements

We thank the curators, annotators, and domain experts who contributed to dataset construction and verification.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
data		data
human-eval		human-eval
simple-evals		simple-evals
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LingLanMiDian: Systematic Evaluation of LLMs on TCM Knowledge and Clinical Reasoning

Introduction

Contents

What’s in LingLan

Data Statistics

Evaluation Metrics

Download

Quick Start

Leaderboards

Concise Leaderboard

Cite LingLan

License & Ethics

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LingLanMiDian: Systematic Evaluation of LLMs on TCM Knowledge and Clinical Reasoning

Introduction

Contents

What’s in LingLan

Data Statistics

Evaluation Metrics

Download

Quick Start

Leaderboards

Concise Leaderboard

Cite LingLan

License & Ethics

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages