Identifying Nonprofits by Scaling Mission and Activity with Word Embedding

Chen, Haohan; Zhang, Ruodan

doi:10.1007/s11266-021-00399-7

Identifying Nonprofits by Scaling Mission and Activity with Word Embedding

Research Papers
Published: 10 September 2021

Volume 34, pages 39–51, (2023)
Cite this article

VOLUNTAS: International Journal of Voluntary and Nonprofit Organizations Aims and scope Submit manuscript

591 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

This study develops a new text-as-data method for organization identification, based on word embedding. We introduce and apply the method to identify identity-based nonprofit organizations, using the U.S. nonprofits’ mission and activity information reported in the IRS Form 990s in 2010–2016. Our results show that such method is simple but versatile. It complements the existing dictionary-based approaches and supervised machine learning methods for classification purposes and generates a reliable continuous measure of document-to-keyword relevance. Our approach provides a nonbinary alternative for nonprofit big data analyses. Using word embedding, researchers are able to identify organizations of interest, track possible changes over time and capture nonprofits’ multi-dimensionality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Methods for Classifying Nonprofit Organizations According to their Field of Activity: A Report on Semi-automated Methods Based on Text

Article Open access 01 November 2019

Classifying Companies by Industry Using Word Embeddings

Exploring funding patterns with word embedding-enhanced organization–topic networks: a case study on big data

Article 05 January 2022

Notes

“Low-dimensional” numeric representation with word embedding turns every word into a 100–300-dimensional numeric “word vector.” Word vectors capture the relationship among words, although their absolute values have no interpretable meanings. They are considered “low-dimensional” vectors relative to “high-dimensional” representation under previous methods, whose numeric word representations can take tens of thousands of dimensions.

References

Applied Nonprofit Research (2019a). Form 990 variables (Versions 2009v1.4-2.12v3.0; 2013v3.0-2016v3.0).
Applied Nonprofit Research (2019b). Governance Dataset (Version 2019-01-15).
Arora, S., Liang, Y., & Ma, T. (2017). A simple but tough to beat baseline for sentence embeddings. In Proceedings of International Conference on Learning Representations.
Barman, E. (2013). Classificatory struggles in the nonprofit sector: the formation of the national taxonomy of exempt entities, 1969–1987. Social Science History, 37(1), 103–141.
Google Scholar
Brown, W. (2017). Classification of program activities: How nonprofits create social value. Administrative Sciences, 7(2), 12.
Article Google Scholar
Carvalho, J.-P. (2016). Identity-based organizations. American Economic Review, 106(5), 410–14.
Article Google Scholar
Crenshaw, K. (1990). Mapping the margins: Intersectionality, identity politics, and violence against women of color. Stanford Law Review, 43, 1241.
Article Google Scholar
Dai, A. M., Olah, C., & Le, Q. V. (2015). Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Fyall, R., Moore, M. K., & Gugerty, M. K. (2018). Beyond ntee codes: Opportunities to understand nonprofit activity through mission statement content coding. Nonprofit and Voluntary Sector Quarterly, 47(4), 677–701.
Article Google Scholar
Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644.
Article Google Scholar
Grønbjerg, K. A., & Paarlberg, L. (2002). Extent and nature of overlap between listings of irs tax-exempt registration and nonprofit incorporation: The case of Indiana. Nonprofit and Voluntary Sector Quarterly, 31(4), 565–594.
Article Google Scholar
Gugerty, M. K., & Prakash, A. (2010). Voluntary regulation of NGOs and nonprofits: An accountability club framework. Cambridge University Press.
Book Google Scholar
Guo, C. (2007). When government becomes the principal philanthropist: The effects of public funding on patterns of nonprofit governance. Public Administration Review, 67(3), 458–473.
Article Google Scholar
Iyyer, M., Manjunatha, V., Boyd-Graber, J., & III H. D. (2015). Deep unordered composition rivals syntactic methods for text classification. In Proceedings ofthe 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (pp. 1681–1691).
Joos, M. (1950). Description of language design. The Journal of the Acoustical Society of America, 22(6), 701–707.
Article Google Scholar
Jurafsky, D. & Martin, J. H. (2019). Speech and language processing: An introduction to natual language processing, computational linguistics, and speech recognition (3rd Edition).
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1746–1751).
Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., & Fidler, S. (2015). Skip-thought vectors, 786, 1–11.
Kozlowski, A. C., Taddy, M., & Evans, J. A. (2019). The geometry of culture: Analyzing the meanings of class through word embeddings. American Sociological Review, 84(5), 905–949.
Article Google Scholar
Le, Q. V. & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st international conference on machine learning (Vol. 32).
Lecy, J. D., Ashley, S. R., & Santamarina, F. J. (2019). Do nonprofit missions vary by the political ideology of supporting communities? some preliminary results. Public Performance and Management Review, 42(1), 115–141.
Article Google Scholar
Levy, O., Goldberg, Y., & Dagan, I. (2018). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3, 211–225.
Article Google Scholar
Litofcenko, J., Karner, D., & Maier, F. (2020). Methods for classifying nonprofit organizations according to their field of activity: A report on semi-automated methods based on text. VOLUNTAS: International Journal of Voluntary and Nonprofit Organizations, 31(1), 227–237.
Article Google Scholar
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692
Ma, J. (2021). Automated coding using machine learning and remapping the US nonprofit sector: A guide and benchmark. Nonprofit and Voluntary Sector Quarterly, 50(3), 662–687.
Article Google Scholar
Messamore, A., & Paxton, P. (2020). Surviving victimization: How service and advocacy organizations describe traumatic experiences, 1998–2016. Social Currents, 2329496520948198.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Proceedings of NIPS, 2013, 1–9.
Google Scholar
Mikolov, T., Yih, W. T., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of NAACL-HLT, pp. 746–751.
Minkoff, D. C. (2002). The emergence of hybrid organizational forms: Combining identity-based service provision and political action. Nonprofit and Voluntary Sector Quarterly, 31(3), 377–401.
Article Google Scholar
Mitchell, J., & Lapata, M. (2010). Composition in distributional models of semantics. Cognitive Science, 34(8), 1388–1429.
Article Google Scholar
Nguyen, H. T., & Le Nguyen, M. (2018). Multilingual opinion mining on youtube—A convolutional n-gram bilstm word embedding. Information Processing and Management, 54(3), 451–462.
Article Google Scholar
Plummer, S., Hughes, M. M., & Smith, J. (2019). The challenges of organizational classification: A research note. Social. Currents, 2329496519878469.
Reid, E. J. (1999). Nonprofit advocacy and political participation (pp. 291–325). Nonprofits and government: Collaboration and conflict.
Salamon, L. M., & Anheier, H. K. (1996). The international classification of nonprofit organizations. Johns Hopkins University Institute for Policy Studies Baltimore Mar.
Google Scholar

Download references

Acknowledgments

An earlier version of this paper was presented at the 2019 Association for Public Policy Analysis & Management Annual Conference. We thank the panel attendees, Yuan Cheng, the editors and anonymous reviewers for their constructive feedback; we thank Jonathan Richter for research assistance.

Author information

Authors and Affiliations

Department of Politics and Public Administration, The University of Hong Kong, Hong Kong, China
Haohan Chen
Department of Public Policy, University of Connecticut Hartford, 10 Prospect Street, Hartford, CT, 06103, USA
Ruodan Zhang

Authors

Haohan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ruodan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Both authors contributed to the study conception and design. Material preparation and data collection were performed by RZ. Methodology and analysis were performed by HC. Both authors drafted, revised, read, and approved the manuscript.

Corresponding author

Correspondence to Ruodan Zhang.

Ethics declarations

Conflict interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Learning Distributed Semantics with word2vec

Word2vec has two variants: the Skip-gram model and the continuous bag-of-words (CBOW) model. The difference between the two lies in the specific task of prediction performed. The Skip-gram model predicts context words with the target word; the CBOW model predicts the target word with context words. Thus, the definition of likelihood functions slightly differs, leading to different optimization tasks.

To formalize, we start with the following setup: Consider a sequence of text of T words in total. Let the size of context windows be c (i.e., c’s immediate neighbors before and after a word are considered its context). Let ${\mathbf {v}}(w)$ be the vector of distributed semantics of word w. Let the size of the vocabulary be V, (i.e., there are V unique words in the text). Let $p(w_i | w_j)$ be the probability of word $w_i$ appearing, given word $w_j$. Let $\mathcal {L}$ be the likelihood.

The word2vec uses the a softmax function to link distributed representation of words (word vectors) with their predicted probabilities. Specifically, the probability of word $w_i$ given word $w_j$ in its context window is the exponential of the dot products of the word vectors ${\mathbf {v}}_{w_i}, \tilde{{\mathbf {v}}}_{w_j}$ over the sum of the exponentials of the dot products of $\tilde{{\mathbf {v}}}_{w_j}$ with word vectors of all words in the vocabulary:

$$\begin{aligned} \log p(w_i | w_j)&= \log \frac{\exp ({\mathbf {v}}_{w_i}^T \tilde{{\mathbf {v}}}_{w_j})}{\sum ^{V}_{k=1} \exp ({\mathbf {v}}_{w_k}^T \tilde{{\mathbf {v}}}_{w_j})} \end{aligned}$$

(5)

$$\begin{aligned}&= {\mathbf {v}}_{w_i}^T \tilde{{\mathbf {v}}}_{w_j} - \log \sum ^{V}_{k=1} \exp ({\mathbf {v}}_{w_k}^T \tilde{{\mathbf {v}}}_{w_j}) \end{aligned}$$

(6)

Thus, the log-likelihood of the Skip-gram model is computed as follows. At location t of the text sequence, the joint conditional probability of words in the context window (conditional on the target word at t) is calculated. The conditional probabilities are obtained by applications of softmax on the target word vector against each context word vector. Then the algorithm moves to location $t+1$ and repeat the process until the end of the sequence. The log-likelihood is the sum of all log probabilities. Formally:

$$\begin{aligned} \log \mathcal {L}&= \sum ^{T}_{t=1} \sum _{-c \le j \le c, j \ne 0} \log p(w_{t+j} | w_t) \end{aligned}$$

(7)

$$\begin{aligned}&= \sum ^{T}_{t=1} \sum _{-c \le j \le c, j \ne 0} \log \frac{\exp (\mathbf{v }_{w_{t+j}}^T \tilde{\mathbf{v }}_{w_{t}})}{\sum ^{V}_{k=1} \exp (\mathbf{v }_{w_k}^T \tilde{\mathbf{v }}_{w_{t}})} \end{aligned}$$

(8)

$$\begin{aligned}&= \sum ^{T}_{t=1} \left[ \sum _{-c \le j \le c, j \ne 0} \mathbf{v }_{w_{t+j}}^T \tilde{\mathbf{v }}_{w_t} - \log \sum _{k=1}^{V} \exp ({\mathbf {v}}_{w_k}^T \tilde{{\mathbf {v}}}_{w_t}) \right] \end{aligned}$$

(9)

Similarly, the log-likelihood of the CBOW model is computed as follows: at location t of the text sequence, the probability of target word given context words is calculated. The conditional probability is obtained by a softmax of the target word vector and the average of context word vectors. Then the algorithm moves to location $t+1$ and repeats the process until the end of the sequence. The log-likelihood is the sum of all log probabilities. Formally:

$$\begin{aligned} \log \mathcal {L}&= \sum ^{T}_{t=1} \log p(w_t | w_{t-c}, w_{t-c+1}, ... w_{t+c-1}, w_{t+c}) \end{aligned}$$

(10)

$$\begin{aligned}&= \sum ^{T}_{t=1} \log \frac{\exp ({\mathbf {v}}_{w_t}^T \bar{{\mathbf {v}}}_t)}{\sum ^{V}_{k=1} \exp ({\mathbf {v}}_{w_k}^T \bar{{\mathbf {v}}}_t)} \end{aligned}$$

(11)

$$\begin{aligned}&= \sum ^{T}_{t=1} \left[ {\mathbf {v}}_{w_t}^T \bar{{\mathbf {v}}}_t - \log \sum ^{V}_{k=1} \exp ({\mathbf {v}}_{w_k}^T \bar{{\mathbf {v}}}_t) \right] \end{aligned}$$

(12)

$$\begin{aligned}&\text {where } {\bar{v}}_t = \frac{1}{2c} \sum _{-c \le j \le c, j \ne 0} \tilde{{\mathbf {v}}}_{w_{t+j}} \end{aligned}$$

(13)

Both Skip-gram and CBOW models train vector representations of words to maximize the above defined likelihood. The processing is operationalized as neural networks trained by stochastic gradient descent. In general, they are both neural networks with one hidden layer and two weight matrices. The first weight matrices ${\mathbf {W}}_{V \times N}$ contain vector representations of all V words as targets in the vocabulary: ${\mathbf {W}}_{V \times N} = [{\mathbf {v}}_{w_1}, {\mathbf {v}}_{w_2}, ..., {\mathbf {v}}_{w_N}]^T$. The second weight matrices $\tilde{{\mathbf {W}}}_{N \times V}$ contain vectors of words as context: $\tilde{{\mathbf {W}}}_{N \times V} = [\tilde{{\mathbf {v}}}_{w_1}, \tilde{{\mathbf {v}}}_{w_2}, ..., \tilde{{\mathbf {v}}}_{w_N}]$. Input and output layers are one-hot-encoded words. The differences between Skip-gram and CBOW are evident in the model architectures. Skip-gram (Panel a) uses target words to predict context words, while CBOW (Panel b) uses context words to predict target words. Word vectors are updated with stochastic gradient descent. For the final output, researchers can use either of the two weight matrices ${\mathbf {W}}_{V \times N}, \tilde{{\mathbf {W}}}_{N \times V}^T$ or the two matrices’ average as the representation of distributed semantics.

Training word2vec models can be computationally taxing. Two methods are used to reduce the computational demands of the model: hierarchical softmax and negative sampling. The algorithm in its naïve version described above can be computationally taxing primarily because the complexity of the softmax step (Eq. 5) grows linearly with the vocabulary size (i.e., O(V) complexity): in the forward pass, it takes summations over the whole vocabulary of size V for the denominator; in the backpropagation, it updates all V word vectors in the vocabulary. Two methods have been developed to boost efficiency. First, hierarchical softmax uses a binary tree where words are represented by their leaf units. The probability of a word being the output is estimated by the probability of the path from root to leaf of the word. The method reduces computational complexity from O(V) to $O(\log _2 V)$ given its tree structure. A second and more intuitive method, negative sampling, takes a random sample of words from the vocabulary to approximate the denominator in the forward pass and to update only the sample in the backpropagation. Thus, the computational complexity depends on the size of the negative sample and does not grow with the vocabulary size. The two methods have both demonstrated good performance in existing applications.

Appendix B: Word Cloud of Other Query Terms

Appendix C: Additional Histograms of Cosine Similarity

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, H., Zhang, R. Identifying Nonprofits by Scaling Mission and Activity with Word Embedding. Voluntas 34, 39–51 (2023). https://doi.org/10.1007/s11266-021-00399-7

Download citation

Accepted: 18 August 2021
Published: 10 September 2021
Issue Date: February 2023
DOI: https://doi.org/10.1007/s11266-021-00399-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identifying Nonprofits by Scaling Mission and Activity with Word Embedding

Abstract

Access this article

Similar content being viewed by others

Methods for Classifying Nonprofit Organizations According to their Field of Activity: A Report on Semi-automated Methods Based on Text

Classifying Companies by Industry Using Word Embeddings

Exploring funding patterns with word embedding-enhanced organization–topic networks: a case study on big data

Notes

References

Acknowledgments