Pinyin - IPA Mapping

doi:10.5281/zenodo.7525638

Published January 11, 2023 | Version 0.0.1

Other Open

Pinyin - IPA Mapping

Taubert, Stefan¹

1. Chemnitz University of Technology

This upload contains IPA json-mappings for all pinyin romanized Chinese syllables retrieved from a big corpus.

The corpus that contained the original Chinese words was taken from uni-leipzig. We took the 1M Wikipedia Corpus from 2018. Each syllable was extracted and then converted to pinyin. The pinyin transcription was retrieved with pypinyin (v0.47.1) using dict-from-pypinyin (v0.0.1) which was then transcribed to IPA using pinyin-to-ipa (v0.0.1). Only the first possible transcription was included in the mappings.

Note: tone sandhi is not considered since the vocabulary consists only of stand-alone syllables.

Files:

hanzi-vocabulary.txt
- contains the hanzi vocabulary from which pinyin was transcribed (Chinese syllables), e.g., 㩳
pinyin-ipa-map-NORMAL.json (418 mappings)
- contains toneless pinyin mapped to IPA in pypinyin-style NORMAL, e.g., beng
pinyin-ipa-map-TONE.json (1400 mappings)
- contains pinyin mapped to IPA with pinyin tones in pypinyin-style TONE, e.g., bèng
pinyin-ipa-map-TONE2.json (1400 mappings)
- contains pinyin mapped to IPA with pinyin tones in pypinyin-style TONE2, e.g., be4ng
pinyin-ipa-map-TONE3.json (1400 mappings)
- contains pinyin mapped to IPA with pinyin tones in pypinyin-style TONE3, e.g., beng4
pinyin-ipa-map-TONE3-all.json (2508 mappings)
- contains all theoretical combinations of pinyin mapped to IPA with pinyin tones in pypinyin-style TONE3, e.g., beng4
oov-vocabulary.txt
- contains the vocabulary from which no pinyin could have been transcribed (because it was no Chinese symbol or doesn't have a pinyin representation), e.g., 방 or 㕔
script.sh
- contains the script to reproduce all results

Notes

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410

Files

hanzi-vocabulary.txt

Files (229.7 kB)

Name	Size	Download all
hanzi-vocabulary.txt md5:8ae3e0e18bdd436d074cbc41aab6790a	48.0 kB	Preview Download
oov-vocabulary.txt md5:4d0d7f727573f6b7b971c90c25736c36	6.8 kB	Preview Download
pinyin-ipa-map-NORMAL.json md5:82b56b91105131a44ec4835af3d12840	8.4 kB	Preview Download
pinyin-ipa-map-TONE.json md5:f1d73f58ff409a54a452c2b3dc01738d	34.5 kB	Preview Download
pinyin-ipa-map-TONE2.json md5:65bb66864435d5a891a26626fb4aac83	34.6 kB	Preview Download
pinyin-ipa-map-TONE3-all.json md5:f5e0a4e416e481c4da20b0e2b09a1276	59.0 kB	Preview Download
pinyin-ipa-map-TONE3.json md5:81e7ac22dc9918b738716b140485aa20	34.6 kB	Preview Download
script.sh md5:15b0a474d3319a758b41602a63e0ecd6	4.0 kB	Download

Additional details

D. Goldhahn, T. Eckart & U. Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the 8th International Language Resources and Evaluation (LREC'12), 2012

	All versions	This version
Views	50	50
Downloads	39	39
Data volume	1.5 MB	1.5 MB

Pinyin - IPA Mapping

Creators

Description

Notes

Files

hanzi-vocabulary.txt

Files (229.7 kB)

Additional details

References