Abstract
Text analytics based on supervised machine learning has shown great promise in a multitude of domains but has yet to be applied to seismology. We describe some common classifiers (Naïve Bayes, k-Nearest Neighbors, Support Vector Machines, and Random Forests) as well as the standard steps of supervised learning (training, validation of model parameter adjustments, and testing). To illustrate text classification on a seismological corpus, we use a hundred articles related to the topic of precursory accelerating seismicity, spanning from 1988 to 2010. This corpus was labelled by Mignan [Tectonophysics, 2011] with the precursor whether explained by critical processes (i.e., cascade triggering) or by other processes (such as signature of main fault loading). We investigate how the classification process can be automatized to help analyze larger corpora in order to better understand trends in earthquake predictability research. We find that the Naïve Bayes model performs best, in agreement with the machine learning literature for the case of small datasets, with cross-validation accuracies showing the model’s predictive ability for both binary classification (“critical process” or else) and a multiclass classification (“non-critical process,” “agnostic,” “critical process assumed,” “critical process demonstrated”). Prediction on a dozen of articles published since 2011 shows however a weak generalization, which can be explained, in part, by the empirical variance of the small training set. This preliminary study demonstrates the potential of supervised learning to reveal textual patterns in the seismological literature. Manual labelling remains essential but is made transparent by an investigation of Naïve Bayes keyword posterior probabilities.
Similar content being viewed by others
References
Adamaki AK, Roberts RG (2017) Precursory activity before larger events in Greece revealed by aggregated seismicity data. Pure Appl Geophys 174:1331–1343. https://doi.org/10.1007/s00024-017-1465-6
Aggarwal CC (2018) Machine learning for text. Springer Nature, 493 pp. https://doi.org/10.1007/978-3-319-73531-3
Bak P, Tang C (1989) Earthquakes as a self-organized critical phenomenon. J Geophys Res 94:15,635–15,637
Bennet KP, Campbell C (2000) Support vector machines: hype or hallelujah? SIGKDD Explor 2:1–13
Benoit K (2018) Quantitative analysis of textual data, package 'quanteda', available at https://cran.r-project.org/web/packages/quanteda/ (last assessed August 2018)
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Bouchon M, Marsan D (2015) Reply to 'Artificial seismic acceleration'. Nat Geosci 8:83
Bouchon M, Durand V, Marsan D, Karabulut H, Schmittbuhl J (2013) The long precursory phase of most large interplate earthquakes. Nat Geosci 6:299–302
Breiman L (2001) Random forests. Mach Learn 45:5–32
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Chapman & Hall/CRC, Taylor & Francis Group 358 pp
Bufe CG, Varnes DJ (1993) Predictive modeling of the seismic cycle of the greater San Francisco Bay region. J Geophys Res 98:9,871–9,883
Christou EV, Karakaisis G, Scordilis E (2016) Time dependent seismicity along the western coast of Canada. Res Geophys 5:5730
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory IT-13:21–27
De Santis A, Cianchini G, Di Giovambattista R (2015) Accelerating moment release revisited: examples of application to Italian seismic sequences. Tectonophysics 639:82–98. https://doi.org/10.1016/j.tecto.2014.11.015
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29:103–130
Felzer KR, Page MT, Michael AJ (2015) Artificial seismic acceleration. Nat Geosci 8:82–83
Forman G (2008) BNS feature scaling: an improved representation over TF-IDF for SVM text classification, ACM 17th Conf. Info. and Knowl. Management 263-270
Freund Y, Schapire RE (1999) A short introduction to boosting. J Japanese Soc AI 14:771–780
Geller RJ (1997) Earthquake prediction: a critical review. Geophys J Int 131:425–450
Glez-Peña D, Laurenco A, Lopez-Fernandez H, Reboiro-Jato M, Fdez-Riverola F (2013) Web scraping technologies in an API world. Brief Bioinform 15:788–797
Grimmer J, Stewart BM (2013) Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit Anal 21:267–297. https://doi.org/10.1093/pan/mps028
Grün B, Hornik K (2017). Topic models, package 'topicmodels', available at https://cran.r-project.org/web/packages/topicmodels/ (last assessed August 2018)
Guilhem A, Bürgmann R, Freed AM, Tabrez Ali S (2013) Testing the accelerating moment release (AMR) hypothesis in areas of high stress. Geophys J Int 195:785–798. https://doi.org/10.1093/gji/ggt298
Hardebeck JL, Felzer KR, Michael AJ (2008) Improved tests reveal that the accelerating moment release hypothesis is statistically insignificant. J Geophys Res 113:B08310. https://doi.org/10.1029/2007JB005410
Hechenbichler, K., and K. P. Schliep (2004). Weighted k-nearest-neighbor techniques and ordinal classification. Discussion paper 399, SFB 386, Ludwig-Maximilians University, Munich
Hough S (2010) Predicting the unpredictable: the tumultuous science of earthquake prediction. Princeton University Press 272 pp
Huang H, Meng L (2018) Slow unlocking processes preceding the 2015 Mw 8.4 Illapel, Chile, earthquake. Geophys Res Lett 45:3914–3922. https://doi.org/10.1029/2018GL077060
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31:651–666. https://doi.org/10.1016/j.patrec.2009.09.011
Jiang C, Wu Z (2012) Insights into the long-to-intermediate-term pre-shock accelerating moment release (AMR) from the March 11, 2011, off the Pacific coast of Tohoku, Japan, M9 earthquake. Earth Planets Space 64:765–769
Jiang C, Wu Z (2013) Intermediate-term medium-range precursory accelerating seismicity prior to the 12 May 2008, Wenchuan earthquake. Pure Appl Geophys 170:209–219. https://doi.org/10.1007/s00024-011-0413-0
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Mach Learn ECML-98:137–142
Karakaisis GF, Parazachos CB, Scordilis EM (2013) Recent reliable observations and improved tests on synthetic catalogs with spatiotemporal clustering verify precursory decelerating-accelerating seismicity. J Seismol 17:1063–1072. https://doi.org/10.1007/s10950-013-9372-5
Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab—an S4 package for Kernelt methods in R. J Stat Softw 11:1–20
Kazemian J, Hatami MR (2017) Temporal variations of seismic parameters in Tehran region. Pure Appl Geophys 174:3841–3852. https://doi.org/10.1007/s00024-017-1549-3
Kharde VA, Sonawane SS (2016) Sentiment analysis of Twitter data: a survey of techniques. Int J Comput Appl 139:5–15
King GCP (1983) The accommodation of large strains in the upper lithosphere of the earth and other solids by self-similar fault systems: the geometrical origin of b-value. Pure Appl Geophys 121:761–815
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI'95 Proceed 14th Int Joint Conf AI 2:1137–1143
Kuhn T (1970) The structure of scientific revolutions, enlarged. In: International encyclopedia of unified science, 2nd edn. The University of Chicago Press 210 pp
Lagios E, Papadimitriou P, Novali F, Sakkas V, Fumagalli A, Vlachou K, Del Conte S (2012) Combined seismicity pattern analysis, DGPS and PSInSAR studies in the broader area of Cephalonia (Greece). Tectonophysics 524-525:43–58. https://doi.org/10.1016/j.tecto.2011.12.015
Liaw A, Wiener M (2018). Breiman and Cutler's random forests for classification and regression, package 'randomForest', available at https://cran.r-project.org/web/packages/randomForest/ (last assessed August 2018)
Mignan A (2011) Retrospective on the accelerating seismic release (ASR) hypothesis: controversy and new horizons. Tectonophysics 505:1–16. https://doi.org/10.1016/j.tecto.2011.03.010
Mignan A (2012) Seismicity precursors to large earthquakes unified in a stress accumulation framework. Geophys Res Lett 39:L21308. https://doi.org/10.1029/2012GL053946
Mignan A (2014) The debate on the prognostic value of earthquake foreshocks: a meta-analysis. Sci Rep 4:4099. https://doi.org/10.1038/srep04099
Mignan A (2015) Modeling aftershocks as a stretched exponential relaxation. Geophys Res Lett 42:9726–9732. https://doi.org/10.1002/2015GL066232
Mignan A, King GCP, Bowman D (2007) A mathematical formulation of accelerating moment release based on the stress accumulation model. J Geophys Res 112:B07308. https://doi.org/10.1029/2006JB004671
Mouselimis L (2018). Kernel k nearest neighbors, package 'KernelKnn', available at https://cran.r-project.org/web/packages/KernelKnn/ (last assessed August 2018)
Ng AY, Jordan MI (2001) On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. Adv Neural Inf Proces Syst 14:605–610
Ng S-K, Wong M (1999) Toward routine automatic pathway discovery from on-line scientific text abstracts. Genome Inform 10:104–112
Ogata Y (1988) Statistical models for earthquake occurrences and residual analysis for point processes. J Am Stat Assoc 83:9–27
Papadopoulos GA (1988) Long-term accelerating foreshock activity may indicate the occurrence time of a strong shock in the Western Hellenic Arc. Tectonophysics 152:179–192
Papazachos BC, Karakaisis GF, Papazachos CB, Scordilis EM (2007) Evaluation of the results for an intermediate-term prediction of the 8 January 2006 Mw 6.9 Cythera earthquake in the southwestern Aegean. Bull Seismol Soc Am 97:347–352. https://doi.org/10.1785/0120060075
Pearce D, Rantala V (1983) New foundations for metascience. Synthese 56:1–26
Pliakis D, Papakostas T, Vallianatos F (2012) A first principles approach to understand the physics of precursory accelerating seismicity. Ann Geophys 55:165–170. https://doi.org/10.4401/ag-5363
Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33:1–39. https://doi.org/10.1007/s10462-009-9124-7
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagation errors. Nature 323:533–536
Salton G, McGill M (eds) (1983) Introduction to modern information retrieval. McGraw-Hill
Sammis CG, Sornette D (2002) Positive feedback, memory, and the predictability of earthquakes. PNAS 99:2501–2508. https://doi.org/10.1073/pnas.012580999
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34:1–47
Seif S, Mignan A, Zechar JD, Werner MJ, Wiemer S (2017) Estimating ETAS: the effects of truncation, missing data, and model assumptions. J Geophys Res Solid Earth 122:449–469. https://doi.org/10.1002/2016JB012809
Seif S, Zechar JD, Mignan A, Nandan S, Wiemer S (2018) Foreshocks and their potential deviation from general seismicity. Bull Seismol Soc Am 109:1–18. https://doi.org/10.1785/0120170188
Sornette D (2000) Critical phenomena in natural sciences, chaos, fractal, self-organization and disorder: concepts and tools. Springer 434 pp
Steinwart I, Christmann A (2008) Support vector machines, information science and statistics. Springer 601 pp
Tsytsarau M, Palpanas T (2012) Survey on mining subjective data on the web. Data Lin Knowl Disc 24:478–514. https://doi.org/10.1007/s10618-011-0238-6
Welbers K, Van Atteveldt W, Benoit K (2017) Text analysis in R. Commun Methods Meas 11:245–265. https://doi.org/10.1080/19312458.2017.1387238
Acknowledgments
I thank Pablo Nieto and Marco Broccardo for discussions on the topic of text classification, as well as reviewer Riccardo Zaccarelli for his valuable comments.
Data and resources
All the corpus articles are available on journal websites. The corpus meta-data and labelling are provided in the supplementary material to this article.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mignan, A. A preliminary text classification of the precursory accelerating seismicity corpus: inference on some theoretical trends in earthquake predictability research from 1988 to 2018. J Seismol 23, 771–785 (2019). https://doi.org/10.1007/s10950-019-09833-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10950-019-09833-2