polyglot:Pipeline 多语言NLP工具

Start your future on Coursera today.

知道创宇IA-Lab 岳永鹏

目前,在NLP任务处理中,Python支持英文处理的开源包有NLTK、Scapy、StanfordCoreNLP、GATE、OPenNLP,支持中文处理的开源工具包有Jieba、ICTCLAS、THU LAC、HIT LTP,但是这些工具大部分仅对特定类型的语言提供支持。本文将介绍功能强大的支持Pipeline方式的多语言处理Python工具包:polyglot。该项目最早是由AboSamoor在2015年3月16日在GitHub上开源的项目,已经在Github收集star 1021个。

特征

  • 语言检测 Language Detection (支持196种语言)
  • 分句、分词 Tokenization (支持165种语言)
  • 实体识别 Named Entity Recognition (支持40种语言)
  • 词性标注 Part of Speech Tagging(支持16种语言)
  • 情感分析 Sentiment(支持136种语言)
  • 词嵌入 Word Embeddings(支持137种语言)
  • 翻译 Transliteration(支持69种语言)
  • 管道 Pipelines

安装

从PyPI安装/升级

$ pip install polyglot

安装polyglot依赖于numpy和 libicu-dev,在 ubuntu / debian linux发行版中你可以通过执行以下命令来安装这样的包:
$ sudo apt-get install python-numpy libicu-dev
安装成功以后,输入

$ import polyglot$ polyglot.__version__$ 16.07.04

数据

在随后的实例演示中,将以中文、英文或中英文混合语句作为测试数据。

text_en = u"Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years."text_cn = u" 日本最后一家寻呼机服务营业商宣布,将于2019年9月结束服务,标志着日本寻呼业长达50年的历史正式落幕。目前大约还有1500名用户使用东京电信通信公司提供的寻呼服务,该公司在20年前就已停止生产寻呼机。"text_mixed = text_cn + text_en

语言检测 Language Detection

polyglot的语言检测依赖pycld2cld2,其中cld2是Google开发的多语言检测应用。

Example

导入依赖

from polyglot.detect import  Detector

语言类型检测

>>> Detector(text_cn).language name: Chinese     code: zh       confidence:  99.0 read bytes:  1996>>>> Detector(text_en).language name: English     code: en       confidence:  99.0 read bytes:  1144>>> Detector(text_mixed).language name: Chinese     code: zh       confidence:  50.0 read bytes:  1996

对中英文混合的text_mixed,其识别的语言是中文,但置信度(confidence)仅有50,所有包含的语言类型检测

>>> for language in Detector(text_mixed):>>>     print(language) name: Chinese     code: zh       confidence:  50.0 read bytes:  1996 name: English     code: en       confidence:  49.0 read bytes:  1144 name: un          code: un       confidence:   0.0 read bytes:     0

目前,cld2支持的语言检测类型有

>>> Detector.supported_languages()  1. Abkhazian                  2. Afar                       3. Afrikaans                  4. Akan                       5. Albanian                   6. Amharic                    7. Arabic                     8. Armenian                   9. Assamese                  10. Aymara                    11. Azerbaijani               12. Bashkir                   13. Basque                    14. Belarusian                15. Bengali                   16. Bihari                    17. Bislama                   18. Bosnian                   19. Breton                    20. Bulgarian                 21. Burmese                   22. Catalan                   23. Cebuano                   24. Cherokee                  25. Nyanja                    26. Corsican                  27. Croatian                  28. Croatian                  29. Czech                     30. Chinese                   31. Chinese                   32. Chinese                   33. Chinese                   34. Chineset                  35. Chineset                  36. Chineset                  37. Chineset                  38. Chineset                  39. Chineset                  40. Danish                    41. Dhivehi                   42. Dutch                     43. Dzongkha                  44. English                   45. Esperanto                 46. Estonian                  47. Ewe                       48. Faroese                   49. Fijian                    50. Finnish                   51. French                    52. Frisian                   53. Ga                        54. Galician                  55. Ganda                     56. Georgian                  57. German                    58. Greek                     59. Greenlandic               60. Guarani                   61. Gujarati                  62. Haitian_creole            63. Hausa                     64. Hawaiian                  65. Hebrew                    66. Hebrew                    67. Hindi                     68. Hmong                     69. Hungarian                 70. Icelandic                 71. Igbo                      72. Indonesian                73. Interlingua               74. Interlingue               75. Inuktitut                 76. Inupiak                   77. Irish                     78. Italian                   79. Ignore                    80. Javanese                  81. Javanese                  82. Japanese                  83. Kannada                   84. Kashmiri                  85. Kazakh                    86. Khasi                     87. Khmer                     88. Kinyarwanda               89. Krio                      90. Kurdish                   91. Kyrgyz                    92. Korean                    93. Laothian                  94. Latin                     95. Latvian                   96. Limbu                     97. Limbu                     98. Limbu                     99. Lingala                  100. Lithuanian               101. Lozi                     102. Luba_lulua               103. Luo_kenya_and_tanzania   104. Luxembourgish            105. Macedonian               106. Malagasy                 107. Malay                    108. Malayalam                109. Maltese                  110. Manx                     111. Maori                    112. Marathi                  113. Mauritian_creole         114. Romanian                 115. Mongolian                116. Montenegrin              117. Montenegrin              118. Montenegrin              119. Montenegrin              120. Nauru                    121. Ndebele                  122. Nepali                   123. Newari                   124. Norwegian                125. Norwegian                126. Norwegian_n              127. Nyanja                   128. Occitan                  129. Oriya                    130. Oromo                    131. Ossetian                 132. Pampanga                 133. Pashto                   134. Pedi                     135. Persian                  136. Polish                   137. Portuguese               138. Punjabi                  139. Quechua                  140. Rajasthani               141. Rhaeto_romance           142. Romanian                 143. Rundi                    144. Russian                  145. Samoan                   146. Sango                    147. Sanskrit                 148. Scots                    149. Scots_gaelic             150. Serbian                  151. Serbian                  152. Seselwa                  153. Seselwa                  154. Sesotho                  155. Shona                    156. Sindhi                   157. Sinhalese                158. Siswant                  159. Slovak                   160. Slovenian                161. Somali                   162. Spanish                  163. Sundanese                164. Swahili                  165. Swedish                  166. Syriac                   167. Tagalog                  168. Tajik                    169. Tamil                    170. Tatar                    171. Telugu                   172. Thai                     173. Tibetan                  174. Tigrinya                 175. Tonga                    176. Tsonga                   177. Tswana                   178. Tumbuka                  179. Turkish                  180. Turkmen                  181. Twi                      182. Uighur                   183. Ukrainian                184. Urdu                     185. Uzbek                    186. Venda                    187. Vietnamese               188. Volapuk                  189. Waray_philippines        190. Welsh                    191. Wolof                    192. Xhosa                    193. Yiddish                  194. Yoruba                   195. Zhuang                   196. Zulu                     

分句、分词 Tokenization

自然语言处理任务中,任务可以分为字符级、词语级、句子级、段落级和篇章级,Tokenization就是实现切分字符、词语、句子和段落边界的功能。分段可以用nnr作分割,字符分割也比较容易实现,分句和分词相对比较复杂一点。

Example

导入依赖

from polyglot.text import Text

分句

>>> Text(text_cn).sentences [Sentence("日本最后一家寻呼机服务营业商宣布,将于2019年9月结束服务,标志着日本寻呼业长达50年的历史正式落幕。"), Sentence("目前大约还有1500名用户使用东京电信通信公司提供的寻呼服务,该公司在20年前就已停止生产寻呼机。")]>>> Text(text_en).sentences [Sentence("Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction."), Sentence("Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years.")]>>> Text(text_mixed).sentences [Sentence("Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction."), Sentence("Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years."), Sentence("日本最后一家寻呼机服务营业商宣布,将于2019年9月结束服务,标志着日本寻呼业长达50年的历史正式落幕。"), Sentence("目前大约还有1500名用户使用东京电信通信公司提供的寻呼服务,该公司在20年前就已停止生产寻呼机。")]

分词

>>> Text(text_cn).words 日本 最后 一家 寻 呼 机 服务 营业 商 宣布 , 将 于 20199 月 结束 服务 , 标志 着 日本 寻 呼 业 长达 50 年 的 历史 正式 落幕 。 目前 大约 还有 1500 名 用户 使用 东京 电信 通信 公司 提供 的 寻 呼 服务 , 该 公司 在 20 年前 就 已 停止 生产 寻 呼 机 。>>> Text(text_en).words Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers , 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage , which has not made the devices in 20 years .>>> Text(text_mixed).wordsJapan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers , 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage , which has not made the devices in 20 years . 日本 最后 一家 寻 呼 机 服务 营业 商 宣布 , 将 于 20199 月 结束 服务 , 标志 着 日本 寻 呼 业 长达 50 年 的 历史 正式 落幕 。 目前 大约 还有 1500 名 用户 使用 东京 电信 通信 公司 提供 的 寻 呼 服务 , 该 公司 在 20 年前 就 已 停止 生产 寻 呼 机 。

实体识别 Named Entity Recognition

实体识别是识别出文本中具有特定意义的实体,其常有三种分类:

  • 实体类: 人名、地名、机构名、商品名、商标名等等
  • 时间类: 日期、时间
  • 数字类: 生日、电话号码、QQ号码等等

实体识别的方法也可以分为三种:

  • 基于规则 Linguistic grammar-based techniques
    基于语言语法的技术主要是用规则的方法,在工程的实现方面上的应用就是写很多的正则表达(RegEx),这种方式可以解决部分时间类、和数字类命名实体的识别。
  • 统计学习 Statistical models
    统计的方法目前主要是HMM和CRF模型,也是当前比较成熟的方式。
  • 深度学习 Deep Learning models
    深度学习的方法是目前最为流行的方式,特别是RNN系列的DL模型,其可以吸收到更多的文本语义信息,其效果是当前最好的。

polyglot实体识别的训练语料来源于维基百科(WIKI),其训练好的模型并没有初次安装,需要下载相应的模型。polyglot支持40种语言的实体类(人名、地名、机构名)的识别。

>>> from polyglot.downloader import downloader>>> print(downloader.supported_languages_table("ner2", 3)) 1. Polish                     2. Turkish                    3. Russian 4. Indonesian                 5. Czech                      6. Arabic 7. Korean                     8. Catalan; Valencian         9. Italian10. Thai                      11. Romanian, Moldavian, ...  12. Tagalog13. Danish                    14. Finnish                   15. German16. Persian                   17. Dutch                     18. Chinese19. French                    20. Portuguese                21. Slovak22. Hebrew (modern)           23. Malay                     24. Slovene25. Bulgarian                 26. Hindi                     27. Japanese28. Hungarian                 29. Croatian                  30. Ukrainian31. Serbian                   32. Lithuanian                33. Norwegian34. Latvian                   35. Swedish                   36. English37. Greek, Modern             38. Spanish; Castilian        39. Vietnamese40. Estonian

模型下载

下载英文和中文实体识别的模型

  $ python>>> import polyglot>>> !polyglot download ner2.en ner2.zh embeddings2.zh embeddings2.en[polyglot_data] Downloading package ner2.en to[polyglot_data] Downloading package ner2.zh to[polyglot_data] Downloading package embeddings2.zh to[polyglot_data] Downloadinuserackage embeddings2.en to[polyglot_data]  /home/user/polyglot_data...

Example

导入依赖

>>> from polyglot.text import Text

实体识别

>>> Text(text_cn).entities [I-ORG([u'东京'])]>>> Text(text_en).entities) [I-LOC([u'Tokyo'])]>>> Text(text_mixed).entities) [I-ORG([u'东京'])]

词性标注 Part of Speech Tagging

词性标注是对分词单元作相应的词性标记,其常用的标记包括:

  • 形容词 ADJ: adjective
  • 介词 ADP: adposition
  • 副词 ADV: adverb
  • 辅助动词 AUX: auxiliary verb
  • 连词 CONJ: coordinating conjunction
  • 限定词 DET: determiner
  • 感叹词 INTJ: interjection
  • 名词 NOUN: noun
  • 数字 NUM: numeral
  • 代词 PRON: pronoun
  • 名词代词 PROPN: proper noun
  • 标点符号 PUNCT: punctuation
  • 从属连词 SCONJ: subordinating conjunction
  • 符号 SYM: symbol
  • 动词 VERB: verb
  • 其他 X: other

polyglot训练词性标注的语料来源于CONLL数据集,其支持16种语言,不支持中文。

>>> from polyglot.downloader import downloader>>> print(downloader.supported_languages_table("pos2"))  1. German                     2. Italian                    3. Danish                     4. Czech                      5. Slovene                    6. French                     7. English                    8. Swedish                    9. Bulgarian                 10. Spanish; Castilian        11. Indonesian                12. Portuguese                13. Finnish                   14. Irish                     15. Hungarian                 16. Dutch                    

模型下载

下载英文词性标注的模型

  $ python>>> import polyglot>>> !polyglot download pos2.en[polyglot_data] ownloading package pos2.en to[polyglot_data]  /home/user/polyglot_data...

Example

导入依赖

from polyglot.text import Text

词性标注

>>> Text(text_en).pos_tags [(u"Japan's", u'NUM'), (u'last', u'ADJ'), (u'pager', u'NOUN'), (u'provider', u'NOUN'), (u'has', u'AUX'), (u'announced', u'VERB'), (u'it', u'PRON'), (u'will', u'AUX'), (u'end', u'VERB'), (u'its', u'PRON'), (u'service', u'NOUN'), (u'in', u'ADP'), (u'September', u'PROPN'), (u'2019', u'NUM'), (u'-', u'PUNCT'), (u'bringing', u'VERB'), (u'a', u'DET'), (u'national', u'ADJ'), (u'end', u'NOUN'), (u'to', u'ADP'), (u'telecommunication', u'VERB'), (u'beepers', u'NUM'), (u',', u'PUNCT'), (u'50', u'NUM'), (u'years', u'NOUN'), (u'after', u'ADP'), (u'their', u'PRON'), (u'introduction.Around', u'NUM'), (u'1,500', u'NUM'), (u'users', u'NOUN'), (u'remain', u'VERB'), (u'subscribed', u'VERB'), (u'to', u'ADP'), (u'Tokyo', u'PROPN'), (u'Telemessage', u'PROPN'), (u',', u'PUNCT'), (u'which', u'DET'), (u'has', u'AUX'), (u'not', u'PART'), (u'made', u'VERB'), (u'the', u'DET'), (u'devices', u'NOUN'), (u'in', u'ADP'), (u'20', u'NUM'), (u'years', u'NOUN'), (u'.', u'PUNCT')]

情感分析 Sentiment Analysis

polyglot的情感分析是词级别的,对每一个分词正面标记为1,中性标记为0,负面标记为1.其目前支持136种语言。

>>> from polyglot.downloader import downloader>>> print(downloader.supported_languages_table("sentiment2")) 1. Turkmen                    2. Thai                       3. Latvian 4. Zazaki                     5. Tagalog                    6. Tamil 7. Tajik                      8. Telugu                     9. Luxembourgish, Letzeb...10. Alemannic                 11. Latin                     12. Turkish13. Limburgish, Limburgan...  14. Egyptian Arabic           15. Tatar16. Lithuanian                17. Spanish; Castilian        18. Basque19. Estonian                  20. Asturian                  21. Greek, Modern22. Esperanto                 23. English                   24. Ukrainian25. Marathi (Marāṭhī)         26. Maltese                   27. Burmese28. Kapampangan               29. Uighur, Uyghur            30. Uzbek31. Malagasy                  32. Yiddish                   33. Macedonian34. Urdu                      35. Malayalam                 36. Mongolian37. Breton                    38. Bosnian                   39. Bengali40. Tibetan Standard, Tib...  41. Belarusian                42. Bulgarian43. Bashkir                   44. Vietnamese                45. Volapük46. Gan Chinese               47. Manx                      48. Gujarati49. Yoruba                    50. Occitan                   51. Scottish Gaelic; Gaelic52. Irish                     53. Galician                  54. Ossetian, Ossetic55. Oriya                     56. Walloon                   57. Swedish58. Silesian                  59. Lombard language          60. Divehi; Dhivehi; Mald...61. Danish                    62. German                    63. Armenian64. Haitian; Haitian Creole   65. Hungarian                 66. Croatian67. Bishnupriya Manipuri      68. Hindi                     69. Hebrew (modern)70. Portuguese                71. Afrikaans                 72. Pashto, Pushto73. Amharic                   74. Aragonese                 75. Bavarian76. Assamese                  77. Panjabi, Punjabi          78. Polish79. Azerbaijani               80. Italian                   81. Arabic82. Icelandic                 83. Ido                       84. Scots85. Sicilian                  86. Indonesian                87. Chinese Word88. Interlingua               89. Waray-Waray               90. Piedmontese language91. Quechua                   92. French                    93. Dutch94. Norwegian Nynorsk         95. Norwegian                 96. Western Frisian97. Upper Sorbian             98. Nepali                    99. Persian100. Ilokano                  101. Finnish                  102. Faroese103. Romansh                  104. Javanese                 105. Romanian, Moldavian, ...106. Malay                    107. Japanese                 108. Russian109. Catalan; Valencian       110. Fiji Hindi               111. Chinese112. Cebuano                  113. Czech                    114. Chuvash115. Welsh                    116. West Flemish             117. Kirghiz, Kyrgyz118. Kurdish                  119. Kazakh                   120. Korean121. Kannada                  122. Khmer                    123. Georgian124. Sakha                    125. Serbian                  126. Albanian127. Swahili                  128. Chechen                  129. Sundanese130. Sanskrit (Saṁskṛta)      131. Venetian                 132. Northern Sami133. Slovak                   134. Sinhala, Sinhalese       135. Bosnian-Croatian-Serbian136. Slovene

模型下载

下载英文和中文情感分析模型

  $ python>>> import polyglot>>> !polyglot download sentiment2.en sentiment2.zh[polyglot_data] ownloading package sentiment2.en to[polyglot_data] ownloading package sentiment2.zh to[polyglot_data]  /home/user/polyglot_data...

Example

导入依赖

from polyglot.text import Text

情感分析

>>> text = Text("The movie is very good and the actors are prefect, but the cinema environment is very poor.")>>> print(text.words,text.polarity) (WordList([u'The', u'movie', u'is', u'very', u'good', u'and', u'the', u'actors', u'are', u'prefect', u',', u'but', u'the', u'cinema', u'environment', u'is', u'very', u'poor', u'.']), 0.0)>>> print([(w,w.polarity) for w in text.words]) [(u'The', 0), (u'movie', 0), (u'is', 0), (u'very', 0), (u'good', 1), (u'and', 0), (u'the', 0), (u'actors', 0), (u'are', 0), (u'prefect', 0), (u',', 0), (u'but', 0), (u'the', 0), (u'cinema', 0), (u'environment', 0), (u'is', 0), (u'very', 0), (u'poor', -1), (u'.', 0)]>>> text = Text("这部电影故事非常好,演员也非常棒,但是电影院环境非常差。")>>> print(text.words,text.polarity) (WordList([这 部 电影 故事 非常 好 , 演员 也 非常 棒 , 但是 电影 院 环境 非常 差 。]), 0.0)>>> print([(w,w.polarity) for w in text.words]) [(u'u8fd9', 0), (u'u90e8', 0), (u'u7535u5f71', 0), (u'u6545u4e8b', 0), (u'u975eu5e38', 0), (u'u597d', 1), (u'uff0c', 0), (u'u6f14u5458', 0), (u'u4e5f', 0), (u'u975eu5e38', 0), (u'u68d2', 0), (u'uff0c', 0), (u'u4f46u662f', 0), (u'u7535u5f71', 0), (u'u9662', 0), (u'u73afu5883', 0), (u'u975eu5e38', 0), (u'u5dee', -1), (u'u3002', 0)]

词嵌入 Word Embeddings

Word Embedding在NLP中是指一组语言模型和特征学习技术的总称,把词汇表中的单词或者短语映射成由实数构成的向量上。常见的Word Embeddings有两种方法:离散表示和分布式表示。离散的方法包括one-hot和N-gram,离散表示的缺点是不能很好的刻画词与词之间的相关性和维数灾难的问题。分布式表示的思想是用一个词附近的其他词来表示该词,也就是大家所熟悉的word2ec。word2ec包含根据当前一个词预测前后nn个词Skip-Gram Model以及给定上下文的nn个词预测一个词的CBOW Model。目前训练好的英文词向量有glove,其提供了50、100、200、300维词向量,以及前一段时间腾讯AI Lab开源的中文词向量,其提供200维的中文词向量。polyglot支持从以下不同源读取词向量

  • Gensim word2vec objects: (from_gensim method)
  • Word2vec binary/text models: (from_word2vec method)
  • GloVe models (from_glove method)
  • polyglot pickle files: (load method)

其中,polyglot pickle files支持136种语言的词向量。

>>> from polyglot.downloader import  downloader>>> print(downloader.supported_languages_table("embeddings2"))  1. Scots                      2. Sicilian                   3. Welsh                      4. Chuvash                    5. Czech                      6. Egyptian Arabic            7. Kapampangan                8. Chechen                    9. Catalan; Valencian        10. Slovene                   11. Sinhala, Sinhalese        12. Bosnian-Croatian-Serbian 13. Slovak                    14. Japanese                  15. Northern Sami             16. Sanskrit (Saṁskṛta)       17. Croatian                  18. Javanese                  19. Sundanese                 20. Swahili                   21. Swedish                   22. Albanian                  23. Serbian                   24. Marathi (Marāṭhī)         25. Breton                    26. Bosnian                   27. Bengali                   28. Tibetan Standard, Tib...  29. Bulgarian                 30. Belarusian                31. West Flemish              32. Bashkir                   33. Malay                     34. Romanian, Moldavian, ...  35. Romansh                   36. Esperanto                 37. Asturian                  38. Greek, Modern             39. Burmese                   40. Maltese                   41. Malagasy                  42. Spanish; Castilian        43. Russian                   44. Mongolian                 45. Chinese                   46. Estonian                  47. Yoruba                    48. Sakha                     49. Alemannic                 50. Assamese                  51. Lombard language          52. Yiddish                   53. Silesian                  54. Venetian                  55. Azerbaijani               56. Afrikaans                 57. Aragonese                 58. Amharic                   59. Hebrew (modern)           60. Hindi                     61. Quechua                   62. Haitian; Haitian Creole   63. Hungarian                 64. Bishnupriya Manipuri      65. Armenian                  66. Gan Chinese               67. Macedonian                68. Georgian                  69. Khmer                     70. Panjabi, Punjabi          71. Korean                    72. Kannada                   73. Kazakh                    74. Kurdish                   75. Basque                    76. Pashto, Pushto            77. Portuguese                78. Gujarati                  79. Manx                      80. Irish                     81. Scottish Gaelic; Gaelic   82. Upper Sorbian             83. Galician                  84. Arabic                    85. Walloon                   86. Urdu                      87. Norwegian Nynorsk         88. Norwegian                 89. Dutch                     90. Chinese Character         91. Nepali                    92. French                    93. Western Frisian           94. Bavarian                  95. English                   96. Persian                   97. Polish                    98. Finnish                   99. Faroese                  100. Italian                  101. Icelandic                102. Volapük                  103. Ido                      104. Waray-Waray              105. Indonesian               106. Interlingua              107. Lithuanian               108. Uzbek                    109. Latvian                  110. German                   111. Danish                   112. Cebuano                  113. Ukrainian                114. Latin                    115. Luxembourgish, Letzeb... 116. Divehi; Dhivehi; Mald... 117. Vietnamese               118. Uighur, Uyghur           119. Limburgish, Limburgan... 120. Zazaki                   121. Ilokano                  122. Fiji Hindi               123. Malayalam                124. Tatar                    125. Kirghiz, Kyrgyz          126. Ossetian, Ossetic        127. Oriya                    128. Turkish                  129. Tamil                    130. Tagalog                  131. Thai                     132. Turkmen                  133. Telugu                   134. Occitan                  135. Tajik                    136. Piedmontese language

模型下载

下载英文和中文词向量

  $ python>>> import polyglot>>> !polyglot download embeddings2.zh embeddings2.en[polyglot_data] Downloading package embeddings2.zh to[polyglot_data] Downloadinuserackage embeddings2.en to[polyglot_data]  /home/user/polyglot_data...

Example

导入依赖并加载词向量

>>> from polyglot.mapping import Embedding>>> embeddings = Embedding.load('/home/user/polyglot_data/embeddings2/zh/embeddings_pkl.tar.bz2')

词向量查询

>>> print(embeddings.get("中国"))[ 0.60831094  0.37644583 -0.67009342  0.43529209  0.12993187 -0.07703398 -0.04931475 -0.42763838 -0.42447501 -0.0219319  -0.52271312 -0.57149178 -0.48139745 -0.31942225  0.12747335  0.34054375  0.27137381  0.1362032 -0.54999739 -0.39569679  1.01767457  0.12317979 -0.12878017 -0.65476489  0.18644606  0.2178454   0.18150428  0.18464987  0.29027358  0.21979097 -0.21173042  0.08130789 -0.77350897  0.66575652 -0.14730017  0.11383133  0.83101833  0.01702038 -0.71277034  0.29339811  0.3320756   0.25922608 -0.51986367  0.16533957  0.04327472  0.36460632  0.42984027  0.04811303 -0.16718218 -0.18613082 -0.52108622 -0.47057685 -0.14663117 -0.30221295  0.72923231 -0.54835045 -0.48428732  0.65475166 -0.34853089  0.03206051  0.2574054   0.07614037  0.32844698 -0.0087136 ]>>> print(len(embeddings.get("中国"))) 64

相似词查询

>>> neighbors = embeddings.nearest_neighbors("中国")>>> print(" ".join(neighbors)) 上海 美国 韩国 北京 欧洲 台湾 法国 德国 天津 广州

翻译 Transliteration

polyglot翻译采用是无监督的方法( False-Friend Detection and Entity Matching via Unsupervised Transliteration paper),其支持69种语言。

>>> from polyglot.downloader import  downloader>>> print(downloader.supported_languages_table("transliteration2")) 1. Haitian; Haitian Creole    2. Tamil                      3. Vietnamese                4. Telugu                     5. Croatian                   6. Hungarian                 7. Thai                       8. Kannada                    9. Tagalog                  10. Armenian                  11. Hebrew (modern)           12. Turkish                  13. Portuguese                14. Belarusian                15. Norwegian Nynorsk        16. Norwegian                 17. Dutch                     18. Japanese                 19. Albanian                  20. Bulgarian                 21. Serbian                  22. Swahili                   23. Swedish                   24. French                   25. Latin                     26. Czech                     27. Yiddish                  28. Hindi                     29. Danish                    30. Finnish                  31. German                    32. Bosnian-Croatian-Serbian  33. Slovak                   34. Persian                   35. Lithuanian                36. Slovene                  37. Latvian                   38. Bosnian                   39. Gujarati                 40. Italian                   41. Icelandic                 42. Spanish; Castilian       43. Ukrainian                 44. Urdu                      45. Indonesian               46. Khmer                     47. Galician                  48. Korean                   49. Afrikaans                 50. Georgian                  51. Catalan; Valencian       52. Romanian, Moldavian, ...  53. Basque                    54. Macedonian               55. Russian                   56. Azerbaijani               57. Chinese                  58. Estonian                  59. Welsh                     60. Arabic                   61. Bengali                   62. Amharic                   63. Irish                    64. Malay                     65. Marathi (Marāṭhī)         66. Polish                   67. Greek, Modern             68. Esperanto                 69. Maltese  

模型下载

下载英文和中文翻译模型

  $ python>>> import polyglot>>> !polyglot download transliteration2.zh transliteration2.en[polyglot_data] Downloading package transliteration2.zh to[polyglot_data] Downloadinuserackage transliteration2.en to[polyglot_data]  /home/user/polyglot_data...

Example

导入依赖

>>> from polyglot.text import Text

英文翻译中文

>>> text = Text(text_en)>>> print(text_en)  Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers, 50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage, which has not made the devices in 20 years.>>> print("".join([t for t in text.transliterate("zh")])) 拉斯特帕格普罗维德尔哈斯安诺乌恩斯德伊特维尔恩德伊特斯塞尔维斯因塞普特艾伯布林吉恩格阿恩阿特伊奥纳尔恩德托特埃莱科姆穆尼卡特伊昂布熙佩尔斯年年耶阿尔斯阿夫特特海尔乌斯尔斯雷马因苏布斯克里贝德托托基奥特埃莱梅斯斯阿格埃惠克赫哈斯诺特马德特赫德耶夫伊斯斯因耶阿尔斯

中英文翻译的结果显示其效果还是比较差,在此不做过多的介绍。

管道 Pipelines

Pipelines的方式是指以管道的方式顺序执行多个NLP任务,上一个任务的输出作为下一个任务的输入。比如在实体识别和实体关系识别中,Pipeline方式就是先识别出实体,然后再识别这些实体的关系,另外一种是Join,将实体识别和关系识别放在一起。

Exmaple

先分词,然后统计词频数大于2的单词。

>>> !polyglot --lang en tokenize --input testdata/example.txt | polyglot count --min-count 2 in  10the 6.   6-   5,   4of  3and 3by  3South       25   22007        2Bermuda     2which       2score       2against     2Mitchell    2as  2West        2India       2beat        2Afghanistan 2Indies      2

欢迎关注我们的公众号

NLPJob

发表评论

电子邮件地址不会被公开。 必填项已用*标注