Natural language processing systems for data extraction and mapping on the basis of unstructured text blocks

DOI: 10.35595/2414-9179-2020-1-26-375-384

View or download the article (Rus)

About the Authors

Alexey A. Kolesnikov

Siberian State University of Geosystems and Technologies (SSUGT),
Plakhotny str., 10, 630108, Novosibirsk, Russia;
E-mail: alexeykw@mail.ru

Pavel M. Kikin

Peter the Great St. Petersburg Polytechnic University (SPbPU),
Polytechnicheskaya str., 29, 195251, St. Petersburg, Russia;
E-mail: it-technologies@yandex.ru

Giovanni Niko

Institute for Applied Mathematics “Mauro Picone” (IAC), National Research Council of Italy (CNR),
Via Amendola, 122/O, 75100, Bari, Italy;
E-mail: g.nico@ba.iac.cnr.it

Elena V. Komissarova

Siberian State University of Geosystems and Technologies (SSUGT),
Plakhotny str., 10, 630108, Novosibirsk, Russia;
E-mail: komissarova_e@mail.ru

Abstract

Modern natural language processing technologies allow you to work with texts without being a specialist in linguistics. The use of popular data processing platforms for the development and use of linguistic models provides an opportunity to implement them in popular geographic information systems. This feature allows you to significantly expand the functionality and improve the accuracy of standard geocoding functions. The article provides a comparison of the most popular methods and software implemented on their basis, using the example of solving the problem of extracting geographical names from plain text. This option is an extended version of the geocoding operation, since the result also includes the coordinates of the point features of interest, but there is no need to separately extract the addresses or geographical names of the objects in advance from the text. In computer linguistics, this problem is solved by the methods of extracting named entities (Eng. named entity recognition). Among the most modern approaches to the final implementation, the authors of the article have chosen algorithms based on rules, models of maximum entropy and convolutional neural networks. The selected algorithms and methods were evaluated not only from the point of view of the accuracy of searching for geographical objects in the text, but also from the point of view of simplicity of refinement of the basic rules or mathematical models using their own text bodies. Reports on technological violations, accidents and incidents at the facilities of the heat and power complex of the Ministry of Energy of the Russian Federation were selected as the initial data for testing the abovementioned methods and software solutions. Also, a study is presented on a method for improving the quality of recognition of named entities based on additional training of a neural network model using a specialized text corpus.

Keywords

geographical name, named entity recognition, SpaCy, DeepPavlov, natural language processing

References

  1. Akbik A., Blythe D., Vollgraf R. Contextual string embeddings for sequence labeling. Proceedings of the 27th international conference on computational linguistics. Santa Fe: Association for Computational Linguistics, 2018. P. 1638–1649.
  2. Aycock J., Horspool R.N. Practical earley parsing. The Computer Journal, 2002. V. 45 (6). P. 620–630. CiteSeerX 10.1.1.12.4254. DOI: 10.1093/comjnl/45.6.620.
  3. Batuev A.R., Batuev D.A., Beshentsev A.N., Bogdanov V.N., Dashpilov T.B., Korytniy L.M., Tikunov V.S., Fedorov R.K. Atlas information system for providing socio-economic development of the Baikal region. InterCarto. InterGIS. GI support of sustainable development of territories: Proceedings of the International conference. Moscow: Moscow University Press, 2019. V. 25. Part 1. P. 66–80. DOI: 10.35595/2414-9179-2019-1-25-66-80 (in Russian, abs English).
  4. Berant J., Chou A., Frostig R., Liang P. Semantic parsing on freebase from question-answer pairs. Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP). Grand Hyatt Seattle, Seattle, Washington: Association for Computational Linguistics, 2013. P. 1533–1544.
  5. Beshentsev A.N., Garmaev E.Zh., Potaev V.S. Geoinformation monitoring of territorial economic and social systems. Bulletin of Buryat State University. Economics and management. Ulan-Ude: Dorzhi Banzarov Buryat State University Press, 2019. V. 3. P. 3–9 (in Russian).
  6. Bird S., Loper E., Klein E. Natural language processing with Python. Sebastopol, CA, USA: O’Reilly Media Inc., 2009. 512 p.
  7. Bodenhamer D.J., Corrigan J., Harris T.M. Deep maps and spatial narratives. Bloomington: Indiana University Press, 2015. 254 p.
  8. Cooper D., Donaldson C., Murrieta-Flores P. Literary Mapping in the digital age. Digital research in the arts and humanities. Abingdon: Routledge, 2016. 308 p.
  9. Cura R., Dumenieu B., Abadie N., Costes B., Perret J., Gribaudi M. Historical collaborative geocoding. ISPRS international journal of geo-information. Basel, Switzerland: MDPI AG, 2018. V. 7. P. 262. DOI: 10.3390/ijgi7070262.
  10. Ding J., Wang Y., Hu W., Shi L., Qu Y. Answering multiple-choice questions in geographical gaokao with a concept graph. The semantic web — 15th International conference (ESWC 2018), Heraklion, Crete, Greece. Cham: Springer, 2018. P. 161–176.
  11. Fujita A., Kameda A., Kawazoe A., Miyao Y. Overview of Todai robot project and evaluation framework of its NLP-based problem solving. Proceedings of the 9 International conference on language resources and evaluation. Reykjavik: European Language Resources Association (ELRA), 2014. P. 2590–2597.
  12. Gong Y., Luo H., Zhang J. Natural language inference over interaction space. 6th International conference on learning representations (ICLR). Vancouver, BC, Canada, 2018.
  13. Honnibal M., Johnson M. An improved non-monotonic transition system for dependency parsing. Proceedings of the 2015 Conference on empirical methods in natural language processing. Lisbon, Portugal: Association for Computational Linguistics, 2015. P. 1373–1378.
  14. Karpachevskiy A.M., Filippova O.G. Opportunities of power systems’ emergency mapping based on open data. InterCarto. InterGIS. Proceedings of the International conference. Petrozavodsk: KRC RAS, 2018. V. 24. Part 1. P. 202–211. DOI: http://doi.org/10.24057/2414-9179-2018 (in Russian, abs English).
  15. Karpik A.P., Lisitsky D.V., Baykov K.S., Osipov A.G., Savinykh V.N. Geospacial discourse of forward-looking and breaking-through way of thinking. Vestnik of the Siberian State University of Geosystems and Technologies (SSUGT). Novosibirsk: Siberian State University of Geosystems and Technologies, 2017. V. 22. No 4. P. 53–67 (in Russian).
  16. Krylov S.A., Zagrebin G.I., Dvornikov A.V., Loginov D.S., Fokin I.E. Theoretical basics of the automatization of atlas mapping processes. Proceedings of the Higher Educational Institutions. Izvestia vuzov “Geodesy and aerophotosurveying”. Moscow: Moscow State University of Geodesy and Cartography, 2018. V. 62. No 3. P. 283–293. DOI: 10.30533/0536-101X-2018-62-3-283-293 (in Russian).
  17. Lally A., Bagchi S., Barborak M., Buchanan D.W., Chu-Carroll J., Ferrucci D.A., Glass M.R., Kalyanpur A., Mueller E.T., Murdock J.W., Patwardhan S., Prager J.M. WatsonPaths: Scenario-based question answering and inference over unstructured information. AI magazine. Menlo Park: Association for the advancement of artificial intelligence, 2017. V. 38 (2). P. 59–76.
  18. Le T.A., Arkhipov M.Y., Burtsev M.S. Application of a hybrid Bi-LSTM-CRF model to the task of Russian named entity recognition. Artificial Intelligence and Natural Language. AINL 2017. Communications in Computer and Information Science. V. 789. Cham: Springer, 2018. P. 91–103. DOI: https://doi.org/10.1007/978-3-319-71746-3_8.
  19. Mozharova V., Loukachevitch N. Two-stage approach in russian named entity recognition. International FRUCT conference on intelligence, social media and web (ISMW FRUCT). St. Petersburg: IEEE, 2016. DOI: 10.1109/FRUCT.2016.7584769.
  20. Pisarev V.S., Akhmedov B.N. Automatic updating of digital geospace models. Proceedings of the Interexpo GEO-Sibir’. Novosibirsk: Siberian State University of Geosystems and Technologies, 2017. V. 1. No 1. P. 46–50 (in Russian).
  21. Smith R. An overview of the Tesseract OCR engine. Google Inc. Proceeding 9th IEEE International conference on document analysis and recognition (ICDAR). Curitiba, Parana, Brazil: IEEE, 2007. P. 629–633.

For citation: Kolesnikov A.A., Kikin P.M., Niko G., Komissarova E.V. Natural language processing systems for data extraction and mapping on the basis of unstructured text blocks. InterCarto. InterGIS. GI support of sustainable development of territories: Proceedings of the International conference. Moscow: Moscow University Press, 2020. V. 26. Part 1. P. 375–384. DOI: 10.35595/2414-9179-2020-1-26-375-384 (in Russian)