Natural language processing systems for data extraction and mapping on the basis of unstructured text blocks

DOI: 10.35595/2414-9179-2020-3-26-53-61

View or download the article (Rus)

About the Authors

Pavel M. Kikin

Peter the Great St. Petersburg Polytechnic University (SPbPU),
Polytechnicheskaya str., 29, 195251, St. Petersburg, Russia,
E-mail: it-technologies@yandex.ru

Alexey A. Kolesnikov

Siberian State University of Geosystems and Technologies,
Plakhotnogo str., 10, 630108, Novosibirsk, Russia,
E-mail: alexeykw@mail.ru

Alexey M. Portnov

Moscow State University of Geodesy and Cartography,
Gorokhovsky lane, 4, 105064, Moscow, Russia,
E-mail: portnov@miigaik.ru

Denis V. Grischenko

Siberian State University of Geosystems and Technologies,
Plakhotnogo str., 10, 630108, Novosibirsk, Russia,
E-mail: mr_divis@mail.ru

Abstract

The state of ecological systems, along with their general characteristics, is almost always described by indicators that vary in space and time, which leads to a significant complication of constructing mathematical models for predicting the state of such systems. One of the ways to simplify and automate the construction of mathematical models for predicting the state of such systems is the use of machine learning methods. The article provides a comparison of traditional and based on neural networks, algorithms and machine learning methods for predicting spatio-temporal series representing ecosystem data. Analysis and comparison were carried out among the following algorithms and methods: logistic regression, random forest, gradient boosting on decision trees, SARIMAX, neural networks of long-term short-term memory (LSTM) and controlled recurrent blocks (GRU). To conduct the study, data sets were selected that have both spatial and temporal components: the values of the number of mosquitoes, the number of dengue infections, the physical condition of tropical grove trees, and the water level in the river. The article discusses the necessary steps for preliminary data processing, depending on the algorithm used. Also, Kolmogorov complexity was calculated as one of the parameters that can help formalize the choice of the most optimal algorithm when constructing mathematical models of spatio-temporal data for the sets used. Based on the results of the analysis, recommendations are given on the application of certain methods and specific technical solutions, depending on the characteristics of the data set that describes a particular ecosystem.

Keywords

ecosystems, spatio-temporal indicators, LSTM, SARIMAX, forecasting

References

  1. Arunraj N.S., Ahrens D., Fernandes M. Application of SARIMAX model to forecast daily sales in food retail industry. International Journal of Operations Research and Information Systems, 2016. V. 7 (2). P. 1–21. DOI: 10.4018/ijoris.2016040101.
  2. Chiu J., Jason P.C., Nichols E. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 2015. V. 4. P. 357–370.
  3. Chung J., Gulcehre C., Cho K.H., Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014. 9 p. DOI: arΧiv:1412.3555 [cs.NE].
  4. Clark D.B., Clark D.A. Tree growth, mortality, physical condition, and microsite in an old-growth lowland tropical rain forest. Ecology, 2006. V. 87. P. 2132–2132. DOI: 10.1890/0012-9658(2006)87[2132:TGMPCA]2.0.CO;2.
  5. Haupt S., Pasini A., Marzban C. Artificial intelligence methods in the environmental sciences. Springer Netherlands, Amsterdam, 2009. 424 p. DOI: 10.1007/978-1-4020-9119-3.
  6. Hochreiter S., Schmidhuber J. Long short-term memory. Neural Computing, 1997. V. 9–8. P. 1735–1780. DOI: dx.doi.org/10.1162/neco.1997.9.8.1735.
  7. Knudby A., Brenning A., LeDrew E. New approaches to modelling fish–habitat relationships. Ecological Modelling, 2010. V. 221 (3). P. 503–511. DOI: 10.1016/j.ecolmodel.2009.11.008.
  8. Liaw A., Wiener M. Classification and regression by random forest. R News, 2002. V. 2 (3). P. 18–22.
  9. McCullagh P., Nelder J.A. Generalized linear models. 2nd edition. Taylor & Francis, 1989. 532 p.
  10. Ndenga B.A., Mutuku F.M., Ngugi H.N. Characteristics of Aedes aegypti adult mosquitoes in rural and urban areas of western and coastal Kenya. PLoS One, 2017. V. 12 (12): e0189971. DOI: 10.1371/journal.pone.0189971.
  11. Olden J., Lawler J., Poff N.L. Machine learning methods without tears: a primer for ecologists. The Quarterly Review of Biology, 2017. V. 83 (2). P. 171–193. DOI: 10.1086/587826.
  12. Schuster M., Paliwal K.K. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997. V. 45–11. P. 2673–2681. DOI: dx.doi.org/10.1109/78.650093.
  13. Smetanin Y.G., Ulyanov M.V. The design of cluster spaceoftime series: Kolmogorov and harmonious complexity. Scientific works of the Free Economic Society of Russia, 2014. V. 186 (186). P. 124–129 (in Russian).
  14. Ulyanov M.V., Smetanin Y.G. An approach to characterizing the Kolmogorov complexity of time series based on symbolic descriptions. Business Informatics, 2013. V. 2. P. 49–54 (in Russian).

For citation: Kikin P.M., Kolesnikov A.A., Portnov A.M., Grischenko D.V. Natural language processing systems for data extraction and mapping on the basis of unstructured text blocks. InterCarto. InterGIS. GI support of sustainable development of territories: Proceedings of the International conference. Moscow: Moscow University Press, 2020. V. 26. Part 3. P. 53–61. DOI: 10.35595/2414-9179-2020-3-26-53-61 (in Russian)