ABSTRACT
Stack Overflow is widely regarded as the most popular Community driven Question Answering (CQA) website for programmers. Questions posted on Stack Overflow which are not related to programming topics, are marked as `closed' by experienced users and community moderators. A question can be `closed' for five reasons -- duplicate, off-topic, subjective, not a real question and too localized. In this work, we present the first study of `closed' questions on Stack Overflow. We download 4 years of publicly available data which contains 3.4 Million questions. We first analyze and characterize the complete set of 0.1 Million `closed' questions. Next, we use a machine learning framework and build a predictive model to identify a `closed' question at the time of question creation.
One of our key findings is that despite being marked as `closed', subjective questions contain high information value and are very popular with the users. We observe an increasing trend in the percentage of closed questions over time and find that this increase is positively correlated to the number of newly registered users. In addition, we also see a decrease in community participation to mark a `closed' question which has led to an increase in moderation job time. We also find that questions closed with the Duplicate and Off Topic labels are relatively more prone to reputation gaming. Our analysis suggests broader implications for content quality maintenance on CQA websites. For the `closed' question prediction task, we make use of multiple genres of feature sets based on - user profile, community process, textual style and question content. We use a state-of-art machine learning classifier based on an ensemble framework and achieve an overall accuracy of 70.3%. Analysis of the feature space reveals that `closed' questions are relatively less informative and descriptive than non-`closed' questions. To the best of our knowledge, this is the first experimental study to analyze and predict `closed' questions on Stack Overflow.
- Privileges - create tags. http://stackoverflow.com/privileges/create-tags.Google Scholar
- Why are some questions closed, and what does "closed" mean? http://stackoverflow.com/help/closed-questions.Google Scholar
- What are "community wiki" posts? http://meta.stackoverflow.com/questions/11740/what-are-community-wiki-posts, September 2008.Google Scholar
- What is a "locked" post? http://meta.stackoverflow.com/questions/22228/what-is-a-locked-post, September 2008.Google Scholar
- What is a "protected" question? http://meta.stackoverflow.com/questions/52764/what-is-a-protected-question/, June 2010.Google Scholar
- Who are the diamond moderators, and what is their role? http://meta.stackoverflow.com/a/75192/214223, January 2011.Google Scholar
- Stack exchange data dump. http://www.clearbits.net/torrents/2076-aug-2012, August 2012.Google Scholar
- List of stack exchange moderators by sites. http://stackexchange.com/about/moderators?by=sites, June 2013.Google Scholar
- What is a day in life of a stackoverflow moderator? http://meta.stackoverflow.com/a/166630/214223, February 2013.Google Scholar
- E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In Proceedings of the international conference on Web search and web data mining, pages 183--194. ACM, 2008. Google ScholarDigital Library
- A. Anderson, D. Huttenlocher, J. Kleinberg, and J. Leskovec. Discovering value from community activity on focused question answering sites: a case study of stack overflow. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 850--858. ACM, 2012. Google ScholarDigital Library
- J. Atwood. Stack overflow creative commons data dump. http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/, June 2009.Google Scholar
- J. C. Campbell, C. Zhang, Z. Xu, A. Hindle, and J. Miller. Deficient documentation detection: a methodology to locate deficient project documentation using topic analysis. In Proceedings of the Tenth International Workshop on Mining Software Repositories, pages 57--60. IEEE Press, 2013. Google ScholarDigital Library
- J. H. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367--378, 2002. Google ScholarDigital Library
- H. He and E. A. Garcia. Learning from imbalanced data. Knowledge and Data Engineering, IEEE Transactions on, 21(9):1263--1284, 2009. Google ScholarDigital Library
- J. S. Jeff Atwood. Stack exchange platform. http://stackexchange.com, September 2009.Google Scholar
- J. Jeon, W. B. Croft, J. H. Lee, and S. Park. A framework to predict the quality of answers with non-textual features. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '06, pages 228--235, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- B. Li, T. Jin, M. R. Lyu, I. King, and B. Mak. Analyzing and predicting question quality in community question answering services. In Proceedings of the 21st international conference companion on World Wide Web, WWW '12 Companion, pages 775--782, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- M. Linares-Vásquez, B. Dit, and D. Poshyvanyk. An exploratory analysis of mobile development issues using stack overflow. In Proceedings of the Tenth International Workshop on Mining Software Repositories, pages 93--96. IEEE Press, 2013. Google ScholarDigital Library
- R. Lotufo, L. Passos, and K. Czarnecki. Towards improving bug tracking systems with game mechanisms. In 9th Working Conference on Mining Software Repositories (MSR'12), Zurich, Switzerland, 06/2012 2012. IEEE (also published as GSDLAB TR 2011 09 29), IEEE (also published as GSDLAB TR 2011 09 29).Google ScholarDigital Library
- L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a site in the west. In Proceedings of the 2011 annual conference on Human factors in computing systems, pages 2857--2866. ACM, 2011. Google ScholarDigital Library
- C. Parnin, C. Treude, L. Grammel, and M.-A. Storey. Crowd documentation: Exploring the coverage and the dynamics of api discussions on stack overflow. Georgia Institute of Technology, Tech. Rep.Google Scholar
- T. Sakai, D. Ishikawa, N. Kando, Y. Seki, K. Kuriyama, and C.-Y. Lin. Using graded-relevance metrics for evaluating community qa answer selection. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 187--196. ACM, 2011. Google ScholarDigital Library
- C. Shah and J. Pomerantz. Evaluating and predicting answer quality in community qa. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 411--418. ACM, 2010. Google ScholarDigital Library
- G. Wang, K. Gill, M. Mohanlal, H. Zheng, and B. Y. Zhao. Wisdom in the social crowd: an analysis of quora.Google Scholar
Index Terms
- Fit or unfit: analysis and prediction of 'closed questions' on stack overflow
Recommendations
Chaff from the wheat: characterization and modeling of deleted questions on stack overflow
WWW '14: Proceedings of the 23rd international conference on World wide webStack Overflow is the most popular Community based Question Answering (CQA) website for programmers on the web with 2.05M users, 5.1M questions and 9.4M answers. Stack Overflow has explicit, detailed guidelines on how to post questions and an ebullient ...
Towards Understanding Negative Votes in a Question and Answer Social Network
Social Computing and Social Media. Design, Human Behavior and AnalyticsAbstractOnline community question answering (CQA) social networking sites thrive when community members actively participate in the network. To influence participation, some CQA sites such as Stack Overflow reward members with incentives such as ...
Why will my question be closed?: NLP-based pre-submission predictions of question closing reasons on stack overflow
ICSE-NIER '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging ResultsClosing a question on a community question answering forum such as Stack Overflow is a highly divisive event. On one hand, moderation is of crucial importance in maintaining the content quality indispensable for the future sustainability of the site. On ...
Comments