As a Japanese learner, I wanted a frequency-based view of Kanji usage, not just textbook ordering. Wikipedia is a convenient large corpus for that purpose.
Source: jawiki dump (2024-08-01) from dumps.wikimedia.org, using current-page revisions only. Compressed size: 4.6 GB.
bzgrep -oP '[\x{4E00}-\x{9FAF}\x{3400}-\x{4DBF}]' jawiki-20240801-pages-meta-current.xml.bz2
Unicode ranges from Rikai Kanji Tables: CJK Unified Ideographs + Extension A.
I streamed matching Kanji directly from the compressed file with bzgrep, then counted them in Python.
#!/usr/bin/python3
import fileinput
total = {}
for line in fileinput.input():
kanji = line.rstrip()
if kanji in total:
total[kanji] += 1
else:
total[kanji] = 1
for k, v in sorted(total.items(), key=lambda item: item[1]):
print(f'{k},{v}')
bzgrep -oP '[\x{4E00}-\x{9FAF}\x{3400}-\x{4DBF}]' jawiki-20240801-pages-meta-current.xml.bz2 | python3 counter.py > python_result.txt
Rare-character pages (for example, CJK Extension A listings) contribute substantially to the unique-character count.
| Used ≤ X times | Unique Kanji count |
|---|---|
| 1 | 3,335 |
| 10 | 14,062 |
| 100 | 18,053 |
| 1,000 | 20,599 |
| 10,000 | 22,256 |
So the number of Kanji used more than 10,000 times is about 2,660, close to the common estimate that everyday Japanese relies on roughly ~3,000 characters in practice.
In this layout, entries further right and higher up are more frequent.
| 見 | 家 | 下 | 的 | 文 | 県 | 内 | 話 | 記 | 学 |
| 理 | 送 | 小 | 同 | 立 | 道 | 子 | 場 | 書 | 人 |
| 対 | 通 | 表 | 選 | 高 | 編 | 時 | 新 | 事 | 会 |
| 連 | 全 | 版 | 目 | 業 | 後 | 画 | 生 | 中 | 大 |
| 京 | 主 | 前 | 語 | 分 | 手 | 第 | 上 | 一 | 者 |
| 号 | 頼 | 所 | 長 | 発 | 山 | 田 | 行 | 出 | 用 |
| 依 | 公 | 明 | 戦 | 動 | 代 | 成 | 市 | 作 | 本 |
| 町 | 開 | 校 | 削 | 関 | 東 | 地 | 合 | 国 | 月 |
| 機 | 野 | 回 | 物 | 間 | 社 | 除 | 部 | 名 | 日 |
| 世 | 使 | 川 | 定 | 集 | 自 | 和 | 方 | 利 | 年 |
The top three characters are 年, 日, 月 (year/day/month), which makes sense for encyclopedia-style text full of dates. More emotional first-person language appears much lower (for example 私 and 良), consistent with formal writing style.
Will this completely change my study routine? Not entirely. Frequency is useful, but pedagogical order and context still matter.