Skip to main content

智能语言模型

项目描述

ANYKS 智能语言模型

ANYKS 拼写检查器 (ASC)

项目描述

那里有很多错字和文本纠错系统。这些系统中的每一个都有其优点和缺点,每个系统都有生存权,并且会找到自己的用户群。我想展示我自己版本的错字更正系统,它有自己独特的功能。

功能列表

  • 纠正Levenshtein距离高达4的单词错误;
  • 不同类型单词错别字的更正:插入、删除、替换、字符重排;
  • Ё - 给定上下文的单词化(字母'ё'通常被替换为俄语打字文本中的字母'е');
  • 专有名称和标题的基于上下文的单词大写;
  • 缺少分隔空格字符的单词的基于上下文的拆分;
  • 不修正原文的文本分析;
  • 搜索文本中的错误、错别字、不正确的上下文。

要求

安装 PyBind11

$ python3 -m pip install pybind11

即用型词典

字典名称 大小 (GB) 内存 (GB) N-gram 顺序
wittenbell-3-big.asc 1.97 15.6 3 俄罗斯
wittenbell-3-middle.asc 1.24 9.7 3 俄罗斯
mkneserney-3-middle.asc 1.33 9.7 3 俄罗斯
wittenbell-3-single.asc 0.772 5.14 3 俄罗斯
wittenbell-5-single.asc 1.37 10.7 5 俄罗斯

测试

为了测试系统,我们使用了 Dialog21 组织的 2016 年“拼写纠正”比赛的数据
用于测试的经过训练的二进制字典:wittenbell-3-middle.asc

模式 精确 记起 F测量
错字更正 76.97 62.71 69.11
纠错 73.72 60.53 66.48

我认为没有必要添加任何其他数据。如果他们愿意,任何人都可以重复测试(所有用于测试的文件都附在下面)。

用于测试的文件


方法说明

方法:

  • idw - Word ID 检索方法
  • idt - 令牌 ID 检索方法
  • ids - 序列 ID 检索方法

例子:

>>> import asc
>>>
>>> asc.idw("hello")
313191024
>>>
>>> asc.idw("<s>")
1
>>>
>>> asc.idw("</s>")
22
>>>
>>> asc.idw("<unk>")
3
>>>
>>> asc.idt("1424")
2
>>>
>>> asc.idt("hello")
0
>>>
>>> asc.idw("Living")
13268942501
>>>
>>> asc.idw("in")
2047
>>>
>>> asc.idw("the")
83201
>>>
>>> asc.idw("USA")
72549
>>>
>>> asc.ids([13268942501, 2047, 83201, 72549])
16314074810955466382

描述

姓名 描述
〈s〉 句首记号
〈/s〉 句子结束标记
〈网址〉 URL 地址令牌
〈编号〉 数字(阿拉伯文或罗马文)标记
〈不知道〉 未知单词标记
<时间> 时间令牌 (15:44:56)
<分数> 计分令牌 (4:3 ¦ 01:04)
〈分数〉 分数代币 (5/20 ¦ 192/864)
<日期> 日期令牌 (18.07.2004 ¦ 07/18/2004)
〈缩写〉 缩写记号(1-й ¦ 2-е ¦ 20-я ¦ ps ¦ ps)
〈维度〉 尺寸令牌(200x300 ¦ 1920x1080)
<范围> 数字令牌范围(1-2 ¦ 100-200 ¦ 300-400)
〈大约〉 近似数字标记(~93 ¦95.86 ¦ 1020)
〈anum〉 伪数字令牌(数字和其他符号的组合)(T34 ¦ 895-M-86 ¦ 39km)
〈名片〉 扑克牌的符号(♠ ¦ ♣ ¦ ♥ ¦ ♦ )
〈点〉 标点符号 (. ¦ , ¦ ? ¦ ! ¦ : ¦ ; ¦ … ¦ ¡ ¦ ¿)
<路线> 方向符号(箭头)(← ¦ ↑ ¦ ↓ ¦ ↔ ¦ ↵ ¦ ⇐ ¦ ⇑ ¦ ⇒ ¦ ⇓ ¦ ⇔ ¦ ◄ ¦ ▲ ¦ ► ¦ ▼
<希腊语> 希腊字母表的符号 (Α ¦ Β ¦ Γ ¦ Δ ¦ Ε ¦ Z ¦ Η ¦ Θ ¦ Ι ¦ Κ ¦ Λ ¦ Μ ¦ Ν ¦ Ξ ¦ Ο ¦ Π ¦ Ρ ¦ Σ ¦ Τ ¦ Υ ¦ Φ ¦ X ¦ Ψ Ω)
〈隔离〉 隔离/引用标记 (( ¦ ) ¦ [ ¦ ] ¦ { ¦ } ¦ " ¦ « ¦ » ¦ „ ¦ " ¦ ` ¦ ⌈ ¦ ⌉ ¦ ⌊ ¦ ⌋ ¦ ‹ ¦ › ¦ ‚ ¦ ¦ ′ ¦ ‛ ¦ ″ ¦ ' ¦ ” ¦ ' ¦ ' ¦〈 ¦ 〉)
〈规格〉 特殊字符记号 (_ ¦ @ ¦ # ¦ № ¦ © ¦ ® ¦ & ¦ § ¦ æ ¦ ø ¦ Þ ¦ – ¦ ‾ ¦ - ¦ — ¦ ¯ ¦ ¶ ¦ ˆ ¦ ~ ¦ † ¦ ‡ ¦ • ¦ ‰ ¦ ⁄ ¦ ℑ ¦ ℘ ¦ ℜ ¦ ℵ ¦ ◊ ¦ \ )
<货币> 世界货币符号 ($ ¦ € ¦ ₽ ¦ ¢ ¦ £ ¦ ₤ ¦ ¦ ¦ ¥ ¦ ℳ ¦ ₣ ¦ ₴ ¦ ₸ ¦ ₹ ¦ ₩ ¦ ₦ ¦ ₭ ¦ ₪ ¦ ৳ ¦ ៘ ¦ ₨ ¦ ¦ ₮ ¦ ₱ ¦ ﷼ ¦ ₡ ¦ ₲ ¦ ؋ ¦ ₵ ¦ ₺ ¦ ₼ ¦ ₾ ¦ ₠ ¦ ₧ ¦ ₯ ¦ ₢ ¦ ₳ ¦ ₥ ¦ ₰ ¦ ₿)
<数学> 数学运算记号 (+ ¦ - ¦ = ¦ / ¦ * ¦ ^ ¦ × ¦ ÷ ¦ - ¦ ∕ ¦ ∖ ¦ ∗ ¦ √ ¦ ∝ ¦ ∞ ¦ ∠ ¦ ± ¦ ¹ ¦ ² ¦ ³ ¦ ½ ¦ ⅓ ¦ ¼ ¦ ¾%¦ ∼ ¦ ≅ ¦ ≈ ¦ ≠ ¦ ≡ ¦ ≤ ¦ ≥ ¦ ¦ ¦ ⊂ ¦ ⊃ ¦ ⊄ ¦ ⊆ ¦ ⊇ ¦ ⊕ ¦ ⊗ ¦ ⊥ ¦ ¨)

方法:

  • setZone - 用户区域设置方法

例子:

>>> import asc
>>>
>>> asc.setZone("com")
>>> asc.setZone("ru")
>>> asc.setZone("org")
>>> asc.setZone("net")

方法:

  • clear - 方法清除所有数据
  • setAlphabet - 方法集字母表
  • getAlphabet - 获取字母的方法

例子:

>>> import asc
>>>
>>> asc.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'
>>>
>>> asc.setAlphabet("abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя")
>>>
>>> asc.getAlphabet()
'abcdefghijklmnopqrstuvwxyzабвгдеёжзийклмнопрстуфхцчшщъыьэюя'
>>>
>>> asc.clear()
>>>
>>> asc.getAlphabet()
'abcdefghijklmnopqrstuvwxyz'

方法:

  • setUnknown - 方法设置未知词
  • getUnknown - 方法提取未知词

例子:

>>> import asc
>>>
>>> asc.setUnknown("word")
>>>
>>> asc.getUnknown()
'word'

方法:

  • infoIndex - 打印字典信息的方法
  • token - 确定标记词类型的方法
  • addText - 添加文本以进行估算的方法
  • collectCorpus - 为 ASC 组装文本数据的训练方法 [curpus = 文件名或目录,平滑 = wittenBell,修改 = 假,准备 = 假,模 = 0.0,状态 = 空]
  • pruneVocab - 字典修剪方法
  • buildArpa - 构建 ARPA 的方法
  • writeWords - 将这些单词写入文件的方法
  • writeVocab - 将字典数据写入文件的方法
  • writeNgrams - 将数据写入 NGRAMs 文件的方法
  • writeMap - 将序列映射写入文件的方法
  • writeSuffix - 将数据写入数字缩写后缀文件的方法
  • writeAbbrs - 将数据写入缩写文件的方法
  • getSuffixes - 提取数字缩写后缀列表的方法
  • writeArpa - 将数据写入 ARPA 文件的方法
  • setThreads - 设置工作中使用的线程数的方法(0 - 所有可用线程)
  • setStemmingMethod - 设置外部词干函数的方法
  • loadIndex - 二进制索引加载方法
  • spell - 执行拼写检查的方法
  • 分析- 分析文本的方法
  • addAlt - 添加带有替代字母的单词/字母的方法
  • setAlphabet - 设置字母的方法
  • setPilots - 设置引导词的方法
  • setSubstitutes - 设置字母以纠正混合字母中的单词的方法
  • addAbbr - 方法添加缩写
  • setAbbrs - 方法集缩写
  • getAbbrs - 提取缩写列表的方法
  • addGoodword - 方法添加好词
  • addBadword - 方法添加坏词
  • addUWord - 添加始终以大写字母开头的单词的方法
  • setUWords - 为始终以大写字母开头的单词添加标识符列表的方法
  • readArpa - 读取 ARPA 文件的方法,语言模型
  • readVocab - 读字典的方法
  • setEmbedding - 集合嵌入的方法
  • buildIndex - 构建拼写检查索引的方法
  • setAdCw - 设置字典特征的方法(cw - 计算数据集中的所有单词,ad - 计算数据集中的所有文档)
  • setCode - 设置代码语言的方法
  • addLemma - 将引理添加到字典的方法
  • setNSWLibCount - 设置分析选项的最大数量的方法

例子:

>>> import asc
>>> 
>>> asc.infoIndex("./wittenbell-3-single.asc")

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* Code: RU

* Version: 1.0.0

* Dictionary name: Russian - single

* Locale: en_US.UTF-8
* Alphabet: абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz

* Build date: 09/08/2020 15:39:31

* Encrypted: NO

* ALM type: ALMv1

* Allow apostrophe: NO

* Count words: 106912195
* Count documents: 263998

* Only good words: NO
* Mix words in dicts: YES
* Confidence arpa: YES

* Count upper words: 841915
* Count pilots words: 15
* Count bad words: 108790
* Count good words: 124
* Count substitutes: 14
* Count abbreviations: 16532

* Alternatives: е => ё
* Count alternatives words: 58138

* Size embedding: 28

* Length n-gram: 3
* Count n-grams: 6710202

* Author: Yuriy Lobarev

* Contacts: site: https://anyks.com, e-mail: forman@anyks.com

* Copyright ©: Yuriy Lobarev

* License type: GPLv3
* License text:
The GNU General Public License is a free, copyleft license for software and other kinds of works.

The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too.

When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.

To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others.

For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.

Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it.

For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions.

Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users.

Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free.

The precise terms and conditions for copying, distribution and modification follow.

URL: https://www.gnu.org/licenses/gpl-3.0.ru.html

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

例子:

>>> 进口 asc 
>>> 进口 spacy 
>>> 进口 pymorphy2 
>>>  
>>>  asc 设置线程0 
>>>  asc setOption ( asc.options_t.ascSplit ) >>> asc . _ _ _ _ setOption ( asc.options_t.ascAlter ) >>> asc . _ _ _ _ setOption ( asc . options_t .
 
 
 升序setOption ( asc.options_t.ascRSplit ) >>> asc . _ _ _ _ setOption ( asc.options_t.ascUppers ) >>> asc . _ _ _ _ setOption ( asc.options_t.ascHyphen ) >>> asc . _ _ _ _ setOption ( asc.options_t.ascWordRep ) >>> asc . _ _ _ _ setOption ( asc . options_t
 
 
 
 . 混合字典
>>>  asc setOption asc.options_t.confidence >>> asc . _ _ _ _ setOption ( asc . options_t . stemming ) >>> >>> morphRu = pymorphy2 . MorphAnalyzer () >>> morphEn = spacy 加载'en' 禁用= [ 'parser' 'ner' ])>>>
 
 
   
     
 
 def  status ( text ,  status ): 
...      print ( text ,  status ) 
...  
>>>  
>>>  def  eng ( word ): 
...      global  morphEn 
...      words  =  morphEn ( word ) 
...      word  =  '' join ([ token . lemma_  for  token  in  words ]) strip () 
...     如果 单词[ 0 ]  !=  '-'  word [ len ( word )  -  1 ]  !=  '-' : 
...          return  word 
...      else : 
...          return  "" 
...  
>>>  
>>>  def  rus ( word ): 
...     全局 morphRu 
...     如果 morphRu  !=  None : 
...          word  =  morphRu 解析)[ 0] . normal_form 
...          return  word 
...