Published January 11, 2023 | Version 0.0.1
Other Open

Pinyin - IPA Mapping

  • 1. Chemnitz University of Technology

Description

This upload contains IPA json-mappings for all pinyin romanized Chinese syllables retrieved from a big corpus.

The corpus that contained the original Chinese words was taken from uni-leipzig. We took the 1M Wikipedia Corpus from 2018. Each syllable was extracted and then converted to pinyin. The pinyin transcription was retrieved with pypinyin (v0.47.1) using dict-from-pypinyin (v0.0.1) which was then transcribed to IPA using pinyin-to-ipa (v0.0.1). Only the first possible transcription was included in the mappings.

Note: tone sandhi is not considered since the vocabulary consists only of stand-alone syllables.

Files:

  • hanzi-vocabulary.txt
    • contains the hanzi vocabulary from which pinyin was transcribed (Chinese syllables), e.g., 
  • pinyin-ipa-map-NORMAL.json (418 mappings)
    • contains toneless pinyin mapped to IPA in pypinyin-style NORMAL, e.g., beng
  • pinyin-ipa-map-TONE.json (1400 mappings)
    • contains pinyin mapped to IPA with pinyin tones in pypinyin-style TONE, e.g., bèng
  • pinyin-ipa-map-TONE2.json (1400 mappings)
    • contains pinyin mapped to IPA with pinyin tones in pypinyin-style TONE2, e.g., be4ng
  • pinyin-ipa-map-TONE3.json (1400 mappings)
    • contains pinyin mapped to IPA with pinyin tones in pypinyin-style TONE3, e.g., beng4
  • pinyin-ipa-map-TONE3-all.json (2508 mappings)
    • contains all theoretical combinations of pinyin mapped to IPA with pinyin tones in pypinyin-style TONE3, e.g., beng4
  • oov-vocabulary.txt
    • contains the vocabulary from which no pinyin could have been transcribed (because it was no Chinese symbol or doesn't have a pinyin representation), e.g.,  or 
  • script.sh
    • contains the script to reproduce all results

Notes

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410

Files

hanzi-vocabulary.txt

Files (229.7 kB)

Name Size Download all
md5:8ae3e0e18bdd436d074cbc41aab6790a
48.0 kB Preview Download
md5:4d0d7f727573f6b7b971c90c25736c36
6.8 kB Preview Download
md5:82b56b91105131a44ec4835af3d12840
8.4 kB Preview Download
md5:f1d73f58ff409a54a452c2b3dc01738d
34.5 kB Preview Download
md5:65bb66864435d5a891a26626fb4aac83
34.6 kB Preview Download
md5:f5e0a4e416e481c4da20b0e2b09a1276
59.0 kB Preview Download
md5:81e7ac22dc9918b738716b140485aa20
34.6 kB Preview Download
md5:15b0a474d3319a758b41602a63e0ecd6
4.0 kB Download

Additional details

References

  • D. Goldhahn, T. Eckart & U. Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the 8th International Language Resources and Evaluation (LREC'12), 2012