Detect language and translate to english python
Show
From a Text Data with Multiple Languages To a Single LanguageImage by PublicDomainPictures from PixabayThis article was updated on 20th June 2022 For Spanish speakers, you can read the translated version of this article here Happy new year to you, 2021 is here and you did it 💪. 2020 is now behind us, and even though 2020 has been a tough and strange year for many people around the world, there’s still a lot to celebrate. In 2020, I learned that all we need is the love & support of our loved ones, family members, and friends.
This will be my first article for 2021, and I will talk about some language challenges a Data scientist or Machine Learning Engineer can face while working on a NLP project and how you can solve them. Imagine you as a data scientist assigned to work on a NLP project to analyze what people post on social media (e.g Twitter) about covid-19. One of your first tasks is to find different hashtags for COVID-19 (e.g #covid19 ) and then start collecting all tweets related to covid-19. when you start to analyze the collected data related to covid-19, you find out that the data is generated from different languages around the world such as English, Swahili, Spanish, Chinese, Hindi e.t.c. In this case, you will have two problems to solve before you start analyzing the dataset, the first is to identify the language of the particular data and the second is how you can translate the data to the language of your choice (e.g all data should be in the English language). So how can we solve these two problems? Image by Tumisu from PixabayFirst Problem: Language DetectionThe first problem is to know how you can detect language for particular data. In this case, you can use a simple python package called langdetect. langdetect is a simple python package developed by Michal Danilák that supports detection of 55 different languagesout of the box (ISO 639-1 codes): af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he, Install langdetectTo install langdetect run the following command in your terminal. pip install langdetect Basic ExampleTo detect the language of the text: e.g “Tanzania ni nchi inayoongoza kwa utalii barani afrika”. First, you import the detect method from langdetect and then pass the text to the method. Output: “sw” The method detects the text provided is in the Swahili language (‘sw’). You can also find out the probabilities for the top languages by using detect_langs method. Output: [sw:0.9999971710531397] NOTE: You also need to know that the language detection algorithm is non-deterministic, if you run it on a text which is either too short or too ambiguous, you might get different results every time you run it. Call the following code before language detection in order to enforce consistent results. Now you can detect any language in your data by using the langdetect python package. Second Problem: Language TranslationThe second problem you need to solve is to translate a text from one language to the language of your choice. In this case, you will use another useful python package called google_trans_new. google_trans_new is a free and unlimited python package that implemented Google Translate API and It also performs auto language detection. Install google_trans_newTo install google_trans_new run the following command in your terminal. pip install google_trans_new Basic exampleTo translate a text from one language to another, you have to import the In the example above we translate a Swahili sentence into the English language. Here is the output after translation. Tanzania is the leading tourism country in Africa By default, the Here are all the languages names along with their shorthand notation. {'af': 'afrikaans', 'sq': 'albanian', 'am': 'amharic', 'ar': 'arabic', 'hy': 'armenian', 'az': 'azerbaijani', 'eu': 'basque', 'be': 'belarusian', 'bn': 'bengali', 'bs': 'bosnian', 'bg': 'bulgarian', 'ca': 'catalan', 'ceb': 'cebuano', 'ny': 'chichewa', 'zh-cn': 'chinese (simplified)', 'zh-tw': 'chinese (traditional)', 'co': 'corsican', 'hr': 'croatian', 'cs': 'czech', 'da': 'danish', 'nl': 'dutch', 'en': 'english', 'eo': 'esperanto', 'et': 'estonian', 'tl': 'filipino', 'fi': 'finnish', 'fr': 'french', 'fy': 'frisian', 'gl': 'galician', 'ka': 'georgian', 'de': 'german', 'el': 'greek', 'gu': 'gujarati', 'ht': 'haitian creole', 'ha': 'hausa', 'haw': 'hawaiian', 'iw': 'hebrew', 'hi': 'hindi', 'hmn': 'hmong', 'hu': 'hungarian', 'is': 'icelandic', 'ig': 'igbo', 'id': 'indonesian', 'ga': 'irish', 'it': 'italian', 'ja': 'japanese', 'jw': 'javanese', 'kn': 'kannada', 'kk': 'kazakh', 'km': 'khmer', 'ko': 'korean', 'ku': 'kurdish (kurmanji)', 'ky': 'kyrgyz', 'lo': 'lao', 'la': 'latin', 'lv': 'latvian', 'lt': 'lithuanian', 'lb': 'luxembourgish', 'mk': 'macedonian', 'mg': 'malagasy', 'ms': 'malay', 'ml': 'malayalam', 'mt': 'maltese', 'mi': 'maori', 'mr': 'marathi', 'mn': 'mongolian', 'my': 'myanmar (burmese)', 'ne': 'nepali', 'no': 'norwegian', 'ps': 'pashto', 'fa': 'persian', 'pl': 'polish', 'pt': 'portuguese', 'pa': 'punjabi', 'ro': 'romanian', 'ru': 'russian', 'sm': 'samoan', 'gd': 'scots gaelic', 'sr': 'serbian', 'st': 'sesotho', 'sn': 'shona', 'sd': 'sindhi', 'si': 'sinhala', 'sk': 'slovak', 'sl': 'slovenian', 'so': 'somali', 'es': 'spanish', 'su': 'sundanese', 'sw': 'swahili', 'sv': 'swedish', 'tg': 'tajik', 'ta': 'tamil', 'te': 'telugu', 'th': 'thai', 'tr': 'turkish', 'uk': 'ukrainian', 'ur': 'urdu', 'uz': 'uzbek', 'vi': 'vietnamese', 'cy': 'welsh', 'xh': 'xhosa', 'yi': 'yiddish', 'yo': 'yoruba', 'zu': 'zulu', 'fil': 'Filipino', 'he': 'Hebrew'} Detect and Translate Python FunctionI have created a simple python function that you can do both detect and translate the text into the language of your choice. The python function receives a text and target language as parameters. Then it detects the language of the text provided and if the language of the text is the same as the target language it returns the same text, but it is not the same it translates the text provided to the target language. Example: In the above source code, we translate the sentence into the Swahili language. Here is the output:- Natumai kwamba, nitakapojiwekea akiba, nitaweza kusafiri kwenda Mexico Wrapping UpIn this article, you have learned how to solve two language challenges when you have text data with different languages and you want to translate the data into the single language of your choice. Congratulations 👏, you have made it to the end of this article! You can download the notebook used in this article here: https://github.com/Davisy/Detect-and-Translate-Text-Data If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post! I can also be reached on Twitter @Davis_McDavid. One last thing: Read more articles like this in the following links. How do you translate text to English in python?Introduction. from googletrans import Translator.. translator = Translator(). translated_text = translator. translate('안녕하세요. '). print(translated_text. text). translated_text = translator. translate('안녕하세요.', dest='ja'). How do I translate German to English in python?You can also translate text documents via Google Translate API. All you have to do is to read the text file in Python using the open method, read the text and pass it to the translate() method.
How do you identify and translate a language in NLP project?First, you import the detect method from langdetect and then pass the text to the method. The method detects the text provided is in the Swahili language ('sw'). You can also find out the probabilities for the top languages by using detect_langs method.
Which language translator is used by python?Python googletrans is a module to translate text. It uses the Google Translate Ajax API to detect langauges and translate text.
|