Counting all the Kanji in Japanese Wikipedia

I counted every Kanji in a full Japanese Wikipedia dump to answer one practical question: which characters appear most often in real-world text?

As a Japanese learner, I wanted a frequency-based view of Kanji usage, not just textbook ordering. Wikipedia is a convenient large corpus for that purpose.

Dataset and Method

Source: jawiki dump (2024-08-01) from dumps.wikimedia.org, using current-page revisions only. Compressed size: 4.6 GB.

bzgrep -oP '[\x{4E00}-\x{9FAF}\x{3400}-\x{4DBF}]' jawiki-20240801-pages-meta-current.xml.bz2

Unicode ranges from Rikai Kanji Tables: CJK Unified Ideographs + Extension A.

I streamed matching Kanji directly from the compressed file with bzgrep, then counted them in Python.

#!/usr/bin/python3

import fileinput

total = {}

for line in fileinput.input():
  kanji = line.rstrip()
  if kanji in total:
    total[kanji] += 1
  else:
    total[kanji] = 1

for k, v in sorted(total.items(), key=lambda item: item[1]):
  print(f'{k},{v}')
bzgrep -oP '[\x{4E00}-\x{9FAF}\x{3400}-\x{4DBF}]' jawiki-20240801-pages-meta-current.xml.bz2 | python3 counter.py > python_result.txt

Headline Results

Total Kanji Tokens 1,834,033,943 (about 2 billion)
Unique Kanji 24,916
Snapshot Date 2024-08-01 jawiki dump

Rare-character pages (for example, CJK Extension A listings) contribute substantially to the unique-character count.

Frequency Distribution

Used ≤ X times Unique Kanji count
13,335
1014,062
10018,053
1,00020,599
10,00022,256

So the number of Kanji used more than 10,000 times is about 2,660, close to the common estimate that everyday Japanese relies on roughly ~3,000 characters in practice.

Top 100 Kanji (Frequency Summary)

In this layout, entries further right and higher up are more frequent.

使

Interpretation

The top three characters are 年, 日, 月 (year/day/month), which makes sense for encyclopedia-style text full of dates. More emotional first-person language appears much lower (for example 私 and 良), consistent with formal writing style.

Will this completely change my study routine? Not entirely. Frequency is useful, but pedagogical order and context still matter.