Exo-Tatar digital world

How to help the neural network with the republic’s official languages?

Exo-Tatar digital world
Photo: Радиф Кашапов

Translator and interpreter Farkhad Fatkullin named as Wikimedian of the Year in 2018 for “his public organisational work among representatives of communities speaking Russia’s regional languages” and who continues the activity to conserve and develop his mother tongue decided to write an op-ed column regarding Alexander Kraynov’s speech in mid-April. The director for AI technologies development at Yandex noted how few texts in some languages were available on the Internet.

“Not that there aren’t enough texts, there aren’t any”

As a simultaneous interpreter whose life has been dedicated to serve participants in sectoral conferences during events and preliminary examination and consideration of topics of upcoming discussions, any invitation to formulate and voice my opinion on an important issue is valuable and useful. Here both my generative and knowledge- and rule-based intellect have the right to create something “on their own” in dialogue with my heart and soul.

The words of the director for AI technologies development at Yandex at Data Fusion 2024 forum on 18 April during the month where great Tatar poet Gabdulla Tukay was born became a reason for this. He de facto repeated three out of four points I had earlier voiced using an example of Tatar and all other languages of multi-ethnic Tatarstan when answering questions of local and Moscow journalists, members of the Tatarstan government and the Commission for the Conservation and Development of the Tatar Language, our highly-esteemed state adviser, the Civic Chamber and so on, even at a seminar in the Tatarstan State Council.

“There are a lot of good texts in English on the Internet. Not that there are enough texts, there are even too many, it is not necessary to consider all of them. In Russian, we collect everything we can reach out to… And it is somewhat enough to create quality language models, but there is no surplus… As for Uzbek, Tajik or Kazakh languages, not to mention Buryat or Tatar, not that there are not enough texts, there aren’t any,” said Alexander Kraynov.

Rephrasing the honourable expert, it is yet impossible to create neural networks using exotic Kazakh, Uzbek and other Tatar and Buryat languages that are poor in themes. While the maintenance of the needs of these language speakers will require machine translation on top of English, Russian, Chinese, Arabic or any other generative nucleus.

Let’s have a look at a wider context. At Digital Economy forum on 6 February, Russian Minister of Digitalisation Maksut Shadayev announced the expected access to the State Services through virtual assistant. Alisa Yandex station already speaks Kazakh. translate.tatar and speak.tatar by the Institute of Applied Semiotics are operating and constantly improving. People across the planet are actively using AI for their business tasks, which enhances workforce productivity and creates a previously unimaginable added economic value.

Perhaps, the fully-fledged Tatar culture and value world is far, but the Tatar-speaking one is around the corner.

Pay attention to the footnote below. Screenshot from tatarstan.ru. скриншот с сайта tatarstan.ru

Let's make texts world heritage

There is no sense in grieving and lamenting because of the yet non-existing sovereign digital Tatar world. The situation will improve with time, especially thanks to language speakers editing machine translations into Tatar.

We will keep in mind that the representative of Yandex is indirectly expressing the general scientific and sectoral consensus that English-speaking AI models in general are stronger, more accurate and reliable. Only the Chinese are somewhere around, the rest is outsiders. Not only Russians but also Frenchmen with their not bad Mistral. All languages and other cultural knowledge accumulated by generations of people in general are common intangible heritage. So we all are in the same boat. If we remember the main law of cybernetics and GIGO principle (“Garbage in, garbage out”), it is important for humans to improve their ability to formulate questions and tasks. Here the broad scientific erudition as well as fully-fledged multilingual and multicultural environment, humans and all their artificial servants are more important and valuable than strength, power and speed.

Tatar history and culture teach to move only forward, accumulate knowledge and rethink the experience, create an environment that is comfortable for cooperation as equals and enrich all the members of this community. In the era of cooperation between humans and machines, every language speaker makes a priceless contribution to the life of the Tatar culture, which improves a Tatar text that was born itself or was generated by a machine and makes a world heritage.

Here I would recommend all Tatar-speakers and creators speaking other languages who are concerned about the future of the language to publish their content using free licences, at least CC-BY like on kremlin.ru, tatarstan.ru, tatar-congress.org, kzn.ru and so on. This can be indicated even when uploading something on YouTube. Better use CC0 or its analogues, like on wikidata.org, osm.org, flikr.com and others.

Otherwise, developers of all future generations of neural networks and machine translations can start avoiding it both in Russia and abroad. It became known in April that some mass media already started to ban Yandex to use its content for AI.

Farkhat Fatkullin
Reference

The author's opinion does not necessarily coincide with the position of Realnoe Vremya's editorial board.

Tatarstan