Introduction

Word frequency (WF) is arguably the most established predictor of lexical processing accuracy and efficiency (for a review, see Brysbaert et al., 2018). Obtaining reliable frequency estimates has thus become one of the most pursued aims in psycholinguistic research in the past two decades. Intriguingly, the extent to which WF explains word recognition might depend on the language register from which the frequency values were computed (Brysbaert & New, 2009; Brysbaert et al., 2018). The best frequency measures for undergraduate students are based on corpora of television and movie subtitles (Brysbaert & New, 2009) or social media (Gimenes & New, 2016). Whereas traditional frequency measures based on printed books and newspapers seem to better account for the language experience of the elderly (Brysbaert & Ellis, 2016). What about children? Particularly, what are good frequency measures for beginner readers who are starting to get exposed to printed books while there is also considerable media language input? In this article, we present a Chinese lexical database based on animated movies and TV series for young children: Chinese Children’s Lexicon of Oral Words (CCLOOW). We show that CCLOOW word frequencies, as well as contextual diversity measures, can explain young children’s word recognition performance while frequency measures based on adults’ language samples cannot. They could also explain unique variance in adults’ word recognition latencies in addition to adult frequency measures.

How to compute WF measures

The necessity of WF norms comes from the robust findings that high-frequency words enjoy a processing advantage relative to low-frequency words. The frequency effect has been consistently demonstrated in lexical processing tasks such as lexical decision (Balota et al., 2007; Tsang et al., 2018; Van Heuven et al., 2014), word naming (Balota et al., 2004; Liu et al., 2007) and picture naming (Lampe et al., 2021; Taylor et al., 2012). Particularly, WF is the most important variable in predicting lexical decision reaction times (RTs), accounting for over 40% of the variance (Balota et al., 2007). In comparison, when frequency is partialled out, the contribution of other lexical variables (e.g., familiarity, imageability, family size etc.) together contribute less than 5% of the variance in RT (Balota et al., 2004; Yap et al., 2009). Given the importance of WF, it is not surprising that psycholinguistic researchers have spent considerable efforts in seeking reliable frequency measures.

The type of corpora based on which WF is calculated has gone through some changes. WF is traditionally computed by counting words’ occurrences in written samples of newspapers, magazines, and scientific books. An example is the widely used Kucera and Francis’ (Kucera & Francis, 1967) word list. With the increasing popularity of Internet use, web-based frequency measures have also been created (Balota et al., 2004; Burgess & Livesay, 1998; Herdağdelen & Marelli, 2017). Later in Brysbaert and New’s (2009) seminal work, television and movies subtitles were proposed to be a good estimate of the language experience of average adult language users. Importantly, as megastudies of word recognition data have become available, the predictive validity of traditional norms such as KF frequency on lexical decision and word naming was called into question (Balota et al., 2004, 2007). Compared with frequency measures based on subtitles, the KF norms explained 6% and 10% less variance in word recognition RTs and accuracies respectively (Brysbaert & New, 2009). It was suggested that movie subtitles are reflective of the language the participants are most likely exposed to, whereas traditional frequency measures were based on very limited topics and edited language that is detached from natural language use. Because of their high predictive validity, subtitle-based frequency norms quickly became available in many different languages such as Chinese (Cai & Brysbaert, 2010), German (Brysbaert et al., 2011), Dutch (Keuleers et al., 2010) and Spanish (Cuetos et al., 2011). They also showed better performance in explaining word recognition than traditional frequency norms. For example, the SUBTLEX-CH frequencies (Cai & Brysbaert, 2010) contributed 42.7% of the variance in adults’ lexical decision latencies, compared with 13.7–27.3% of the variance explained by frequencies based on written language samples.

Another important issue is the computation of standardized frequencies. Frequency measures have mostly been raw counts or frequency per million words (fpmw). However, Van Heuven et al. (2014) noted that the interpretation of these measures depends highly on the size of the corpus. Frequency values in a 1-million-word corpus are not comparable to that in a 20-million-word corpus, because the smallest value was 1 fpmw in the former while the frequency could be less than 1 fpmw in the latter, despite both being the lowest in respective corpora. Therefore, they proposed a frequency measure that promotes cross-corpora interpretation: the Zipf scale. It is a logarithmic scale roughly ranging from 1 to 7, computed as: log10 (frequency per million words) + 3. A Zipf value smaller than 1 (corresponding to one occurrence per 100 million words) represents very low frequency; a value between 3 (corresponding to one occurrence per million words) and 4 indicates medium frequency, and a value greater than 6 (corresponding to 1 occurrence per 1000 words) suggests very high frequency.

More recently, a complementary or alternative index to WF has been proposed to take into account the contextual variations in which words are experienced – contextual diversity (CD). It is calculated by counting the number of unique documents in which a word occurs (Adelman et al., 2006) such that within-document redundant appearances are accounted for. It has been argued that CD might outperform WF in reflecting the linguistic experience with a word. According to the Semantic Distinctiveness Model (Johns & Jones, 2022; M. Jones et al., 2012), the strength of a word’s lexical representation is updated each time it is encountered in a new context. The update is determined by the dissimilarity between the current context and the previous contexts the word occurs in. Thus, words experienced in a wider variety of contexts, having a high CD, should be more easily acquired and processed than low-CD words, as they are more likely to be accessed in the future. There is indeed evidence that CD explains more variance in word recognition latencies than WF (Adelman et al., 2006; Brysbaert & New, 2009). In sentence reading, high-CD words were fixated for a shorter time than low-CD words, and once CD was controlled for, the WF effect was found diminished (Chen, Huang, et al., 2017a; Chen, Zhao, et al., 2017b; Plummer et al., 2014). The CD effect has also been lately demonstrated in children (Huang et al., 2020; Perea et al., 2013). More diverse contexts also seem to facilitate learning to read novel words (Joseph & Nation, 2017; Pagan & Nation, 2019; Rosa et al., 2017).

Lexical databases for children

Turning to developmental research, the number of frequency databases based on children’s reading materials has been growing rapidly in recent years. Some of the earliest children’s databases were for English such as the Children’s Printed Word Database (CPWD, Masterson et al., 2003) and its later extension (Masterson et al., 2010). CPWD was compiled from 556 books used by Grades 1–4 teachers in a representative sample of elementary schools in UK. It provides WF as well as orthographic and phonological variables for 12,193 English words for 5-9-year-old children. Children’s lexical databases in other languages have followed ever since (French: Lété et al., 2004; Chinese: Li et al., 2022; German: Schroeder et al., 2014; Greek: Terzopoulos et al., 2017). Most of these corpora sampled from grade-leveled school textbooks (e.g., Terzopoulos et al., 2017) with the aim to reflect a developmental trend of average children’s experience with printed words. Some of them also included measures of CD, calculated as the proportion of textbooks in which a word occurs at a given grade level (Soares et al., 2014; Terzopoulos et al., 2017).

Constructing specialized lexical corpora for children is necessitated by the drasticly different language environments children and adults are exposed to. Infants and toddlers mostly hear child-directed speech from caregivers, which are characterized by item-based phrases (Cameron-Faulkner et al., 2003) and decontextualized language such as narratives (Rowe, 2012). When children start to read, the picture books they most likely read are accompanied by illustrative pictures and use more communicative and interactive language than adults’ printed books. Analyses have shown that word frequencies based on children’s materials correlated with each other better than with adults’ frequency norms (Van Heuven et al., 2014). Given that the best frequency measures are based on the language samples the participants are most familiar with (Brysbaert et al., 2018), there is good reason to predict that frequencies computed from children’s materials are better predictors of children’s lexical processing than adults’ frequencies.

In Chinese, there are currently two lexical databases that provide WF and CD measures based on children’s reading materials. The CJC database (Huang et al., 2020) is based on a collection of 52 Grades 1–4 textbooks used in Jiangsu Province and 43 storybooks, containing 2.65 million characters and 1.83 million words. Using the CJC WF and CD measures, the researchers have found that Grade 4 children’s RTs in lexical decision were affected by CD and not WF. The other database is the recently released Chinese Children’s Lexicon of Written Words (CCLOWW, Li et al., 2022), which is by far the largest written lexical database for Chinese children and the only one publicly available. CCLOWW sampled 2131 books of 34 million characters (22 million words) organized into three grade levels: Grades 2 and below (G2), Grades 3–4 (G34) and Grades 5–6 (G56). In particular, the G2 subcorpus was compiled from over 1500 picture books and an additional couple of Grades 1–2 textbooks. Thus, they should reflect the very early print experience of preschoolers and beginner readers. CCLOWW frequencies were not only able to explain a fair amount of variance in Grade 3 children’s word naming RTs, but also, they contributed significant extra variance in adults’ naming and lexical decision RTs in addition to adult frequency measures, indicating the influence of early reading experience on the mature lexicon and lexical processing.

Subtitle-like frequencies for children?

Despite the popularity of adult frequency norms based on movie subtitles, there is a lack of children’s lexical database in similar language registers. SUBTLEX-UK (Van Heuven et al., 2014) is the only corpus that contained subtitles from two children’s channels broadcasted in UK: CBeebies for 0-6-year-old children and CBBC for 6-12-year-old children. The subtitle-based frequencies were correlated with the children’s CPWD written frequencies. It was found that CPWD correlated with CBeebies and CBBC frequencies (CBeebies: r = .756; CBBC: r = .690) better than with SUBTLEX-UK frequencies (r = .664). Intriguingly, although CBeebies frequencies were assumed to be based on subtitles of TV channels, we know that when preschool children watch TV, they cannot yet read the subtitles, if there are any. This is, for prereaders, they are more likely a source of spoken rather than written language input.

The impact of screen media on children’s language development is two-fold. Although watching TV and using other screen media have sometimes been negatively associated with language development because it may displace other language-learning activities (Pagani et al., 2013), it can also be beneficial provided that it delivers interactive content (Linebarger & Vaala, 2010; Myers et al., 2017) and the time spent on the screen was not excessive (Dore et al., 2020). Some media content, such as animated cartoons and movies, could potentially provide a learning environment that promotes interactive learning similar to daily speech, because the language is also accompanied by communicative cues from the characters. Processing such multisensory information might enhance semantic memory retention (Li & Jeong, 2020; Mayer et al., 1999) and may thus lead to a learning advantage relative to unimodal learning environments such as reading printed texts.

Importantly, screen media such as television does constitute a considerable portion of children’s everyday language exposure. Recent surveys show that on average, children under the age of three in the United States spend as much as 3.6 h a day watching TV (Madigan et al., 2019). It is likely that beginner readers spend comparable or more time on screen than they read books. Therefore, WF computed from children’s TV programs and movies might provide a good reflection of their daily language experience.

The CCLOOW database

In this article, we introduce the Chinese Children’s Lexicon of Oral Words (CCLOOW), a new database of characters and words compiled from animated movies and TV series for 3-to-9-year-old Chinese children. We aim to provide an addition to the recent CCLOWW database to profile the lexical experience of prereaders and beginner readers in mainland China. Importantly, with a series of validation analyses, we show that frequency norms for children could be reliably computed from screen media language similar to subtitle-based databases for adults (e.g., Cai & Brysbaert, 2010).

In the following, we first describe how the corpus was collected and how the language samples were processed. We then provide information about the distributions of word length, syntactic category and then frequency and CD indices of the characters and words in CCLOOW. Next, we evaluated the predictive validity of CCLOOW character frequencies and CDs first with existing character naming data of Grades 2–3 children. We then conducted word naming and lexical decision experiments with Grade 2 children to validate the word measures. We also compare the effects of CCLOOW word frequencies with that of CCLOWW G2 subcorpus and SUBTLEX-CH frequencies on the beginner readers’ lexical recognition performance. This way, we examine the notion that the best WF norms are based on language to which the participants are most likely exposed. Finally, we investigated the effects of CCLOOW word frequencies on adults’ word recognition latencies to explore the potential influences of early language experience on the mature lexicon.

The CCLOOW database

Corpus sampling and linguistic processing

We acquired a sample of 21 animated TV series and 145 animated movies in Mandarin Chinese. To ensure that the materials were representative of what young Chinese children commonly watch, we checked with the “child” sections of popular streaming platforms (Tencent Video, iQiyi, Youku) and the China Central Television children’s channel to make sure that the samples were included in these programs. Note that there is no official recommended age for the TV series and movies collected for the current corpus. Our aim was to collect a sample that likely covers average 3-to-9-year-old children’s exposure to screen media language in mainland China. The samples were all narratives. The audio materials were converted to written samples using the AISPEECH recognition software (https://www.aispeech.com/, Gong et al., 2022; Liu et al., 2020). The written samples were then cleaned and proofread by three native Chinese speakers. Each movie and episode of TV series was treated as one document. To avoid overrepresentations of any TV series with multiple episodes when calculating CD, 20 of the series that contained more than ten episodes and an average episode length of less than 300 characters were divided into documents of approximately equal length comprising multiple episodes. The final corpus thus contained a total of 265 documents.

Word segmentation was conducted using fastHan (Geng et al., 2020), a BERT (Bidirectional Encoder Representations from Transformers)-based Chinese natural language processing toolkit. FastHan has been tested with multiple corpora and has achieved 97.41% accuracy on word segmentation and 95.66% on POS tagging, outperforming other popular Chinese word segmentation tools (Geng et al., 2020). It used the Penn Chinese Treebank 9.0 (Xue et al., 2019) for POS-tagging and dependency parsing. The output of fastHan was cleaned following Li et al. (2022). This is, we first removed non-characters including alphabet letters and numbers in low-frequency sequences. The Table of General Standard Chinese Characters (Ministry of Education, 2013) was then used to exclude characters that are no longer used in contemporary Chinese. Next, the following items were removed: items with more than 15 characters and a frequency count smaller than 2, non-nouns with more than eight characters, interjections with more than four characters and numerals with a frequency count smaller than 10. These are likely nonwords.

Frequency and contextual diversity calculation

We calculated the total number of times a character/word occurs in the corpus (raw frequency), the number of occurrences per million characters/words, and the number of unique documents in which a character/word occurs (CD). In addition, we computed the standardized frequency measure Zipf (van Heuven et al., 2014) using the formula: Zipf = log10 (frequency per million) + 3. We also computed log-transformed values for contextual diversity as logCD. In the following, we report distribution and experimental analysis using Zipf and logCD values. Other indices can be found and downloaded in the online database.

Summary of the corpus

The final corpus comprised 2,745,366 character tokens and 1,889,656 word tokens. There were 3920 unique character types and 22,229 unique word forms (26,582 non-lemmatized words). The mean document length was 10,360 characters (min = 342, max = 60,709) and 7,131 words (min = 189, max = 34,394). Compared with CCLOWW G2 (2.2 million character and 1.5 million word tokens), which contained 4351 character and 37,516 word types, CCLOOW has much fewer unique items, despite its larger size. This shows that written texts for young Chinese children provide a wider range of lexical items than the movies and TV series they watch, which is broadly consistent with Montag et al.’s (2015) finding that children’s picture books (written texts) contain more diverse word types than child-directed speech (spoken language).

Statistics of the characters and words

Word length

Among the words in CCLOOW, 1891 contained one character, 15,636 two characters, 3453 three characters and 1154 four characters. Two-character words were the most common type of words, accounting for over 70% of all word types, which is consistent with previous findings with Chinese dictionary (Tan & Perfetti, 1999) and children’s corpora (L. Li et al., 2022). Nevertheless, one-character words were most frequent (1,889,656 tokens), accounting for over 60% of all word tokens. This result adds to previous findings that one-character words take up 58.41% of the CCLOWW Grade 2 and below (G2) subcorpus and the percentage decreases to 52.94% in the Grades 5–6 (G56) subcorpus. Together, they show that children encounter more complex words as they develop from prereaders (CCLOOW, CCLOWW G2) to advanced readers (CCLOWW G56).

Syntactic categories

Table 1 presents the percentages of word types and tokens by syntactic categories in CCLOOW. The distributions in CCLOWW G2 are also shown to allow direct comparisons. The percentage ranks of words’ syntactic categories were largely consistent with CCLOWW G2. Also, similar to CCLOWW G2, content words (nouns, verbs, adjectives, adverbs, numerals, and measure words) take up over 95% of all word types. Nouns were the most typical type of words, followed by verbs, adjectives, and adverbs. Although over 50% of the word types are nouns, verb tokens accounted for a much higher percentage than noun tokens. This pattern is also demonstrated in CCLOWW G2 but to a lesser degree. Another interesting comparison between CCLOOW and CCLOWW G2 is the percentages of some function words. For example, pronoun types took up similar proportions in the two databases (.35 in CCLOOW and .29 in CCLOWW G2), whereas its token percentage was much higher in CCLOOW (12.56%) than in CCLOWW G2 (7.95%). This potentially reflects that media language afford a communicative context where some cues are present in speech that are absent in writing (Castles et al., 2018). For example, gestures and facial expressions may be used along with a pronoun to indicate a referent in the animated movies, whereas a noun is necessary for the same referent in written texts.

Table 1 Distribution of syntactic categories of words in CCLOOW and CCLOWW G2

Frequency and CD

Distributions of the frequency (Zipf) and CD (logCD) are presented below in Fig. 1. The words’ Zipf and logCD are for lemmatized words. Frequency and CD measures for non-lemmatized words with POS tags can be found in the online database. The characters’ Zipf values ranged from 2.56 to 7.56 (M = 4.27, SD = 1.02). The words’ Zipfs ranged from 3.02 to 7.63 with the mean lying at 3.64 (SD = .64). The distributions, particularly for words, were heavily skewed to the left. The most frequent 100 characters (2.55% of all types) and words (0.45% of all types) accounted for 58.40% and 56.63%, respectively, of all tokens. The pattern of CD distributions was similar. The average number of documents in which the words occurred was 11.32 (SD = 26.09). Over 50% of the words appeared in three or less documents. Two words occurred in all documents: the pronoun 你 (“you”) and the aspect marker 了. Similar to findings with previous lexical databases, the correlations between frequency and CD in CCLOOW are high (character r = .95; word r = .92).

Fig. 1
figure 1

Distributions of character and word frequency (Zipf) and CD (logCD) in CCLOOW. Dashed vertical lines indicate the 10, 25, 50, 75, and 90% percentiles

Validating CCLOOW measures

Correlations with other Chinese lexical databases

We carried out Pearson’s correlational analyses on the frequency and CD measures of CCLOOW with that of two other Chinese lexical databases: CCLOWW G2 and SUBTLEX-CH. We selected the former because it was mostly based on picture books for children whose age is comparable to the target age of the current corpus. We selected the latter because it is the mostly widely used Chinese adult frequency norms and it was based on movie subtitles – a similar register to the language samples in CCLOOW. The analyses were based on 3,680 characters and 12,624 words shared across the three databases. The results are shown in Table 2.

Table 2 Pearson’s correlations between CCLOOW, CCLOWW G2, and SUBTLEX-CH

The correlations within databases were all very high (rs > .92). Between-corpora correlations were also reasonably high, particularly for characters. Correlations between CCLOOW character measures and the other two databases were comparable (with CCLOWW G2: r range = .87–.88, with SUBTLEX-CH r range = .81–.88). In contrast, word measures in CCLOOW correlated better with that in CCLOWW G2 (r range = .74 -.76) than with SUBTLEX-CH (r range = .63–.71). This is consistent with Van Heuven et al.’ (2014) finding that English children’s written frequencies correlated with lexical frequencies based on children’s TV channels than with adults’ subtitle frequencies, highlighting the necessity for lexical databases made specific for children.

Predicting children’s character reading

Character naming

We first validated the frequency and CD measures of CCLOOW characters using existing children’s character naming data. We obtained the data of 52 Grades 2–3 children from Li et al. (2022) and collected additional data from 13 children of the same age. The children (age M = 8.33, SD = .31, 26 males) were tested individually in a quiet room at East China Normal University. They were instructed to read the characters presented on a paper sheet as accurately as they can. They were told that if they could not read a character, they could just skip it. The task took about 10 min. The participants all had normal or corrected-to-normal vision and were assessed by the teachers to have normal language and reading abilities. The experiments reported in this article have been approved by the East China Normal University Committee on Human Research Protection. Written consents were obtained from the children’s guardians.

The analysis was based on 65 Grades 2–3 children’s naming accuracy of 118 characters. Data were analyzed using the packages lme4 (Bates et al., 2015) and lmerTest (Kuznetsova et al., 2017) implemented in R (R Core Team, 2013). Two simple mixed effects models were built in which the characters’ Zipf and logCD were separately added into the baseline model as a fixed factor. For this and the following regression analyses, following Barr et al.’s (2013) suggestion, we always started with a maximal random effects structure: (1 + Zipf/logCD | subject) + (1 + Zipf/logCD | item), and removed the random effect associated with the smallest variance one at a time to facilitate model convergence when a model could not converge. The final model structures are provided in the Appendix. Naming accuracies were binarily coded as the dependent variable in the logistic regression model. P values for the fixed effects were obtained using the lmerTest package. The effect of character Zipf was significant, β = 4.77, SE = .46, z = 10.45, p < .001, as well as the effect of character logCD, β = 6.38, SE = .67, z = 9.46, p < .001.

Predicting children’s word reading

Next, to validate the words’ frequency and CD measures in CCLOOW, we collected beginner readers’ behavioral data in word naming and lexical decision experiments. Because we also want to compare its predictive validity with WF indexed from written corpora, we sampled words that did not correlate much on CCLOOW and CCLOWW G2 (Li et al., 2022) frequencies. The stimuli were 150 two-character words. The correlation between the words’ frequencies in the two corpora were low, Pearson’s r = .21, p < .001. The distributions of the target words’ frequencies in the two databases are shown in Fig. 2.

Fig. 2
figure 2

Distributions of words’ Zipf values in CCLOOW and in CCLOWW G2. Grey dots represent the common words in the two databases. Colored dots represent the sampled target words in the children’s word naming and lexical decision experiments

Word naming

Forty children participated in the experiment (18 males, age M = 8.05, SD = .55). One participant was excluded from the analysis due to technical error during data collection. Data from 39 children were included in the analysis. In the experiment, each child was seated in front of a computer in a quiet classroom on campus. They were presented a central fixation cross for 500 ms, followed by a target word at the center of a computer screen until an oral response was captured by the voice key or for a maximum of 3500 ms. The stimuli were presented in randomized order. The children were asked to say the word aloud to the microphone as quickly and accurately as they could. Optional breaks were provided after every 50 trials. The task took about 10 min.

Lexical decision

Thirty-one children (13 males, age M = 8.05, SD = .55) participated in lexical decision. Two participants’ data were excluded from the analysis because their accuracies were close to chance level (.50). Data from 29 children were included in the analysis. One hundred and fifty pseudowords were created by combing the first character of a target word with the second character of another one. Ten native adult Chinese speakers judged that they were made-up words. In the experiment, after a fixation cross of 500 ms, a target word or pseudoword was shown at the center of the screen for a maximum of 5000 ms. The children were asked to press one of two keys on the keyboard to indicate whether the word was a real word, meaning that they had seen it before and know its meaning, or a made-up word, meaning that they had not seen it before and did not know its meaning. The 300 stimuli were split into five lists of 30 target words and 30 pseudowords in each list. A blocked design was used. The assignment of the lists was counterbalanced across participants such that each list had similar total number of occurrences in the five blocks. Optional breaks were provided between blocks. The task took about 20 min.

Results

Data were cleaned before we proceeded to statistical analyses. In word naming, all data (5850 trials) were included in the analysis for naming accuracy; for naming RTs, incorrect responses, speech errors and responses with RTs beyond 3 standard deviations from the participant’s mean were removed for each participant (16.74% of the data), resulting in 4870 valid observations. In lexical decision, 1.79% of the trials on the target words had RTs beyond 3 SDs of the participants’ mean and were excluded, leading to 4272 valid observations in the analysis of accuracy; among them, 3533 trials were correctly responded to and the data were used in the analysis of lexical decision RTs. The average word naming accuracy was .90 (SD = .30) and the average naming RT was 966.09 ms (SD = 295.85 ms). The average lexical decision accuracy was .83 (SD = .38) and the average RT was 1126.31 ms (SD = 491.91 ms).

Again, we first built simple mixed effects models in which the words’ Zipf and logCD computed in CCLOOW and CCLOWW G2 were separately added into the baseline model as a fixed factor. Because of high multicollinearity among the predictors, here we did not build full models with all variables included. Naming and lexical decision accuracy was binarily coded as the dependent variable in logistic mixed effects models, and log-transformed RT was used as the dependent variable in linear mixed effects models. The results are shown in Table 3. Both WF and CD computed in the current database and in CCLOWW G2 significantly predicted the children’s word naming and lexical decision accuracies and RTs.

Table 3 Results of simple mixed-effects logistic and linear models fitted to accuracy and log RT in naming and lexical decision

Next, to examine the unique effects of each frequency measure, we constructed multiple logistic and linear mixed effects models in a stepwise manner. The base model first included some psycholinguistic variables known to influence naming and lexical decision performances, including orthographic complexity (i.e., number of strokes), concreteness, age of acquisition (AoA), the number of strokes and the phonetic regularity of the first and second constituting characters (C1 and C2). Phonetic regularity was coded to indicate non-phonograms (0) and whether (1) or not (ؘ–1) a phonogram is identical to its phonetic radical in pronunciation. Concreteness values were obtained for all 150 target words from Xu and Li (2020). The numbers of strokes for the 150 words and the phonetic regularity for 104 of the first and for 102 of the second constituting characters were acquired from Liu et al. (2007). Nonetheless, among the variables, only AoA significantly predicted the children’s naming and lexical decision. Therefore, we only kept AoA as the fixed effect in the base model (Model 0).

In the next steps, CCLOWW G2 word frequency was entered (Model 1), followed by CCLOWW G2 frequency of the constituting characters (C1_Zipf_CCLOWW G2 and C2_Zipf_CCLOWW G2, Model 2). Then we entered CCLOOW word frequency (Model 3) and finally CCLOOW character frequency (Model 4). Changes in the Akaike information criterion (AIC) were used to identify the set of predictors that maximize model fit. Likelihood ratio test was used to compare the nested models, the results of which are summarized in Table 4. Model 3 fit the data better than other models across all outcome variables.

Table 4 Results of likelihood ratio test for model comparisons

Results of the final regression models, this is, Model 3, are presented in Table 5. CCLOOW WF was a significant predictor of the children’s word recognition in all outcome measures. Whereas the effect of CCLOWW G2 written word frequency became not significant once CCLOOW WF was included in the model. CCLOWW G2 written frequency of the constituting characters, particularly the first character, also consistently predicted word recognition performance.

Table 5 Results of final multiple regression models fitted to accuracy and log RT in naming and lexical decision

We then examined the additional effect of subtitle-based adult frequencies on the children’s data by adding SUBTLEX-CH frequencies to the above multiple regression models. It was a significant predictor of children’s naming (β = –.02, SE = .01, t = –2.77, p = .006) and lexical decision RTs (β = .03, SE = .01, t = 2.57, p = .001) but not significant on naming (β = –.04, SE = .28, z = .13, p = .894) or lexical decision accuracies (β = .10, SE = .22, z = .347, p = .638). Intriguingly, its effect on lexical decision RTs was in the reverse direction of the typically WF effect. Therefore, we built another simple mixed-effects model keeping only SUBTLEX-CH frequency as the fixed effect in analyzing the data. The effect of SUBTLEX-CH frequencies was only approaching significant on lexical decision RTs (β = –.03, SE = .01, t = –1.90, p = .060).

Predicting adults’ word naming and lexical decision RTs

Given Li et al.’s (2022) finding that frequency measures based on children’s books explained variance in adults’ word naming and lexical decision RTs in addition to adult frequency measures, we also explored whether CCLOOW frequencies can explain adults’ word recognition. We obtained adult word naming data from Li et al. (2022) and lexical decision data from Tsang et al. (2018) for 277 two-character words and regressed the RTs on SUBTLEX-CH frequencies and CCLOOW frequencies. Accuracies were not analyzed because they were at ceiling. A number of other significant predictors of the RTs found in Li et al. (2022) were also included in the linear regression models: frequency of the first character in CCLOOW (C1_ Zipf_CCLOOW), frequency of the second character in CCLOOW (C2_ Zipf_CCLOOW), total number of strokes (N strokes), regularity of the first character (Regularity_C1), number of words the first (Ortho N_C1) and the second character occurs in (Ortho N_C2), number of characters the semantic radical of the first (SR ortho N_C1) and the second character occurs in (SR ortho N_C2), number of characters the phonetic radical of the first (PR ortho N_C1) and the second character occurs in (PR ortho N_C2). These variables were obtained from Sun et al. (2018). Concreteness obtained from Xu and Li (2020) was also included in the lexical decision model. AoA and the stroke counts of constituting characters were not included to avoid multicollinearity issues. The results are shown in Table 6.

Table 6 Results of linear regression models predicting adults’ word naming and lexical decision RTs

The significant predictors jointly explained 60.48% of the variance in adults’ word naming RTs and 57.42% of the variance in lexical decision RTs. The effect of CCLOOW word frequencies was significant on lexical decision but not on naming RTs. We then explored the unique variance in the RTs explained by SUBTLEX-CH frequencies, CCLOOW frequencies of the words and of the constituting characters in stepwise multiple regression analyses. A base model with the significant predictors in the previous analyses other than the frequency measures were first built (Step 1). We entered SUBTLEX-CH word frequencies in Step 2 and CCLOOW word frequencies in Step 3. The CCLOOW character frequencies of the first and the second constituting characters were entered in Steps 4 and 5 respectively. The results (Table 7) show that in word naming, the effect of CCLOOW words’ Zipf was significant, β = –.06, SE = .03, t = –2.02, p = .043, and contributed 15.40% variance in naming RTs.

Nevertheless, its effect became nonsignificant once the first characters’ CCLOOW Zipf were added, which contributed 22.00% additional variance. In comparison, in lexical decision, the effect of CCLOOW words’ Zipf was consistently significant and explained 8.12% variance while characters’ Zipfs jointly explained 6.70% variance in the RTs.

Table 7 Results of stepwise regression analyses of adults’ word naming and lexical decision RTs

Discussion and conclusion

CCLOOW is the first lexical database based on animated movies and TV series for Chinese children. Compared with the recent CCLOWW database (Li et al., 2022) compiled from printed books, CCLOOW provides frequency and contextual diversity values of characters and words based on another important source of language input in children’s environment. The database is particularly appropriate for profiling the language experience of children transitioning from pre-readers to beginner readers (3– 9-year-olds). CCLOOW will add to a comprehensive description of the lexical statistics for developing language learners in mainland China.

Our analysis shows that frequencies computed from animated movies and TV series for children are reliable estimates of young children’s experience with characters and words. The frequency and contextual diversity measures correlated well within the CCLOOW database. Their correlations with the children’s lexical database based on printed books (CCLOWW G2, Li et al., 2022) and with the adult subtitle database (SUBTLEX-CH, Cai & Brysbaert, 2010) were also reasonably high. In particular, the correlations of word measures were higher between the two children’s corpora than between the children’s and the adult corpora. This indicates that language materials for children, regardless of the register, differ from that for adults, highlighting the importance of building specialized corpora for children. This suggestion was corroborated by the results of the word recognition experiments with Grade 2 children.

CCLOOW measures significantly predicted Grade 2 children’s character naming, word naming and lexical decision performance. Particularly, the WF effects on word recognition were robust and beyond other frequency measures (CCLOWW G2 and SUBTLEX-CH). When WF measures from these databases were submitted to analysis in one model, only CCLOOW WF remained significant. This contrast shows that beginner readers’ lexical processing is much better explained by frequencies computed from children’s materials, particularly spoken texts, than by that from adults’ language samples, again stressing the necessity of children’s lexical databases in developmental psycholinguistic research.

The finding that the effect of CCLOWW G2 written WF was also diminished once CCLOOW WF was added in the models implicated the possibility that beginner reader’ lexical knowledge is still largely affected by their spoken language environment. This suggestion was also supported by our finding that CCLOWW G2 frequency of the words’ constituting characters did predict the children’s word recognition performance. For Chinese children, the priority of the early school years is to learn to read characters and not words. Shu et al. (2003) has similarly suggested that the task of learning characters is the heaviest for children until Grade 3. Therefore, the beginner readers in our word naming and lexical decision experiments might have been intensively exposed to written characters rather than written words, thus demonstrating a strong frequency effect of written characters and not words. It is likely that the written WF effect would be more pronounced in older children and children who started reading early. Whereas CCLOOW WF might perform even better with younger children with limited reading experience. We aim to investigate these questions in future work. Note that children in rural areas might not have as much access to TV and screen media as children living in cities. Given that they are also likely exposed to written texts later or less, other sources of language input such as daily conversation might have a greater impact. How CCLOOW statistics might affect character and word reading of these children also awaits investigation.

In our analysis with the adults’ word naming and lexical decision data, the effect of CCLOOW word frequencies was diminished once character frequencies were added in the model predicting naming RTs, whereas the WF effect remained robust on lexical decision. Moreover, CCLOOW character frequencies explained as much as 22.20% extra variance in naming RTs, in addition to adults’ word frequencies and CCLOOW word frequencies. Whereas in lexical decision RTs, CCLOOW word frequencies contributed more variance (8.12%) than character frequencies (6.70%). Intriguingly, CCLOOW word frequencies contributed three times more variance (15.40%) than SUBTLEX-CH frequencies (4.93%) in naming, but the pattern was reversed for lexical decision. These results were consistent with Li et al.’s (2022) finding, and they together suggest that early language experience in childhood might have lasting impacts on the mature lexicon, particularly early spoken language experience. How early spoken and written language input variations might impact lexical processing in skilled language users warrant further investigations in the future.

Although frequencies are the most established lexical predictors of lexical processing, it should be noted that there are other indices of words’ occurrences in spoken and written texts which likely influence lexical representation and acquisition. We have shown that CCLOOW CD could also predict children’s word and character reading, but the effects were not compared to that of frequency. The two indices were often separately analyzed due to high correlation, yet it is possible to disentangle their effects by examining an alternative measure of CD – semantic diversity (SD). SD considers the content rather than the number of documents in which word occurs to capture context-dependent variability in word meaning (Hoffman et al., 2013; Jones et al., 2012). Importantly, some has found that SD outperformed WF and CD in predicting naming and lexical decision latencies in adults (Chang & Lee, 2018; Hoffman et al., 2013; Jones et al., 2012; Jones et al., 2017) and children (Hsiao & Nation, 2018). Future work should also consider indices such as SD to measure how variability in words’ meaningful occurrences in context influences children’s reading and learning to read in Chinese.

The CCLOOW database is accessible online at https://www.learn2read.cn/ccloow. Statistics of frequency, contextual diversity, word length, syntactic category are provided for characters, non-lemmatized POS-tagged words, and lemmatized words. Search functions are made available to tailor to the needs of users. In the future, we plan to include other lexical and sublexical variables to provide a comprehensive account of the characters and words in the language environment of young Chinese children from early to late childhood.