Skip to main content

易于使用的最先进的神经机器翻译

项目描述

EasyNMT - 易于使用、最先进的神经机器翻译

该软件包为 100 多种语言提供易于使用、最先进的机器翻译。这个包的亮点是:

  • 易于安装和使用:使用最先进的机器翻译和 3 行代码
  • 自动下载预训练机器翻译模型
  • 150 多种语言之间的翻译
  • 自动检测 170 多种语言
  • 句子和文件翻译
  • 多GPU和多进程翻译

目前,我们提供以下型号:

例子:

Docker & REST-API

我们提供现成的 Docker 镜像,将 EasyNMT 包装在 REST API 中:

docker run -p 24080:80 easynmt/api:2.0-cpu

调用 REST API:

http://localhost:24080/translate?target_lang=en&text=Hallo%20Welt

有关不同 Docker 映像和 REST API 端点的更多信息,请参阅docker/

另请查看我们的EasyNMT Google Colab REST API 托管示例,了解如何使用 Google Colab 和免费 GPU 托管翻译 API。

安装 Python

您可以通过以下方式安装软件包:

pip install -U easynmt

这些模型基于PyTorch。如果您有可用的 GPU,请参阅如何安装 具有 GPU 支持的 PyTorch。如果您使用 Windows 并遇到安装问题,请参阅此问题如何解决。

用法

用法很简单:

from easynmt import EasyNMT
model = EasyNMT('opus-mt')

#Translate a single sentence to German
print(model.translate('This is a sentence we want to translate to German', target_lang='de'))

#Translate several sentences to German
sentences = ['You can define a list with sentences.',
             'All sentences are translated to your target language.',
             'Note, you could also mix the languages of the sentences.']
print(model.translate(sentences, target_lang='de'))

文件翻译

可用模型基于 Transformer 架构,可提供最先进的翻译质量。但是, opus-mt模型的输入长度限制为 512 个单词片段,M2M模型的输入长度限制为1024 个单词片段。

执行translate()自动句子拆分以翻译更长的文档:

from easynmt import EasyNMT
model = EasyNMT('opus-mt')

document = """Berlin is the capital and largest city of Germany by both area and population.[6][7] Its 3,769,495 inhabitants as of 31 December 2019[2] make it the most-populous city of the European Union, according to population within city limits.[8] The city is also one of Germany's 16 federal states. It is surrounded by the state of Brandenburg, and contiguous with Potsdam, Brandenburg's capital. The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants and an area of more than 30,000 km2,[9] Germany's third-largest metropolitan region after the Rhine-Ruhr and Rhine-Main regions. Berlin straddles the banks of the River Spree, which flows into the River Havel (a tributary of the River Elbe) in the western borough of Spandau. Among the city's main topographical features are the many lakes in the western and southeastern boroughs formed by the Spree, Havel, and Dahme rivers (the largest of which is Lake Müggelsee). Due to its location in the European Plain, Berlin is influenced by a temperate seasonal climate. About one-third of the city's area is composed of forests, parks, gardens, rivers, canals and lakes.[10] The city lies in the Central German dialect area, the Berlin dialect being a variant of the Lusatian-New Marchian dialects.

First documented in the 13th century and at the crossing of two important historic trade routes,[11] Berlin became the capital of the Margraviate of Brandenburg (1417–1701), the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–1933), and the Third Reich (1933–1945).[12] Berlin in the 1920s was the third-largest municipality in the world.[13] After World War II and its subsequent occupation by the victorious countries, the city was divided; West Berlin became a de facto West German exclave, surrounded by the Berlin Wall (1961–1989) and East German territory.[14] East Berlin was declared capital of East Germany, while Bonn became the West German capital. Following German reunification in 1990, Berlin once again became the capital of all of Germany.

Berlin is a world city of culture, politics, media and science.[15][16][17][18] Its economy is based on high-tech firms and the service sector, encompassing a diverse range of creative industries, research facilities, media corporations and convention venues.[19][20] Berlin serves as a continental hub for air and rail traffic and has a highly complex public transportation network. The metropolis is a popular tourist destination.[21] Significant industries also include IT, pharmaceuticals, biomedical engineering, clean tech, biotechnology, construction and electronics."""

#Translate the document to German
print(model.translate(document, target_lang='de'))

该函数将文档分解为句子,然后使用指定的模型单独翻译句子。

自动语言检测

您可以设置source_lang用于translate定义源语言的方法。如果source_lang未设置,fastText将用于自动确定源语言。这还允许您提供包含各种语言的句子/文档的列表:

from easynmt import EasyNMT
model = EasyNMT('opus-mt')

#Translate several sentences to English
sentences = ['Dies ist ein Satz in Deutsch.',   #This is a German sentence
             '这是一个中文句子',    #This is a chinese sentence
             'Esta es una oración en español.'] #This is a spanish sentence
print(model.translate(sentences, target_lang='en'))

可用型号

目前提供以下型号。他们提供 150 多种语言之间的翻译。

模型 参考 #语言 尺寸 速度 GPU(V100 上的句子/秒) CPU 速度(句/秒) 评论
作品-mt 赫尔辛基-NLP 186 300 MB 50 6 每个平移方向的个人模型 (~300 MB)
mbart50_m2m 脸书研究 52 2.3GB 25 -
mbart50_m2en 脸书研究 52 2.3GB 25 - 只能从其他语言翻译成英文。
mbart50_en2m 脸书研究 52 2.3GB 25 - 只能从英语翻译成其他语言。
m2m_100_418M 脸书研究 100 1.8 GB 22 -
m2m_100_1.2B 脸书研究 100 5.0 GB 13 -

翻译质量

比较模型翻译质量将很快在此处添加。到目前为止,我个人的主观印象是,opus-mtm2m_100_1.2B的翻译效果最好。

Opus-MT

我们为来自Opus-MT的预训练模型提供了一个包装器。

Opus-MT 提供 1200 多种不同的翻译模型,每一种都能够翻译一个方向(例如从德语到英语)。每个模型的大小约为 300 MB。

支持的语言:aav, aed, af, alv, am, ar, 艺术, ase, az, bat, bcl, be, bem, ber, bg, bi, bn, bnt, bzs, ca, cau, ccs, ceb, cel, chk, cpf, crs, cs, csg, csn, cus, cy, da, de, dra, ee, efi, el, en, eo, es, et, eu, euq, fi, fj, fr, fse, ga, gaa, gil, gl, grk, ​​guw, gv, ha, he, hi, hil, ho, hr, ht, hu, hy, id, ig, ilo, is, iso, it, ja, jap, ka, kab, kg, kj,kl,ko,kqn,kwn,kwy,lg,ln,loz,lt,lu,lua,lue,lun,luo,lus,lv,地图,mfe,mfs,mg,mh,mk,mkh,ml, mos, mr, ms, mt, mul, ng, nic, niu, nl, no, nso, ny, nyk, om, pa, pag, pap, phi, pis, pl, pon, poz, pqe, pqw, prl, pt,rn,rnd,ro,roa,ru,运行,rw,sal,sg,sh,坐,sk,sl,sm,sn,sq,srn,ss,ssp,st,sv,sw,swc,taw, tdt, th, ti, tiv, tl, tll, tn, to, toi, tpi, tr, trk, ts, tum, tut, tvl, tw, ty, tzo, uk, umb, ur, ve, vi, vsl, wa, wal, war, wls, xh, yap, yo, yua, zai, zh, zne

用法:

from easynmt import EasyNMT
model = EasyNMT('opus-mt', max_loaded_models=10)

系统会自动检测到合适的 Opus-MT 模型并加载。使用可选参数max_loaded_models,您可以指定同时加载的模型的最大数量。如果您随后使用看不见的语言方向进行翻译,则会卸载最旧的模型并加载新模型。

mBERT_50

我们为 Facebook 的mBART50模型提供了一个包装器,它能够在任何 50 多种语言之间进行翻译。还有一些模型可以从英语翻译成这些语言,反之亦然。

用法:

from easynmt import EasyNMT
model = EasyNMT('mbart50_m2m')

支持语言:af、ar、az、bn、cs、de、en、es、et、fa、fi、fr、gl、gu、he、hi、hr、id、it、ja、ka、kk、km、ko , lt, lv, mk, ml, mn, mr, my, ne, nl, pl, ps, pt, ro, ru, si, sl, sv, sw, ta, te, th, tl, tr, uk, ur , vi, xh, zh

M2M_100

我们为 Facebook 的M2M 100模型提供了一个包装器,它能够在任何 100 种语言之间进行翻译。

支持语言:af、am、ar、ast、az、ba、be、bg、bn、br、bs、ca、ceb、cs、cy、da、de、el、en、es、et、fa、ff、fi , fr, fy, ga, gd, gl, gu, ha, he, hi, hr, ht, hu, hy, id, ig, ilo, is, it, ja, jv, ka, kk, km, kn, ko , lb, lg, ln, lo, lt, lv, mg, mk, ml, mn, mr, ms, my, ne, nl, no, ns, oc, or, pa, pl, ps, pt, ro, ru , sd, si, sk, sl, so, sq, sr, ss, su, sv, sw, ta, th, tl, tn, tr, uk, ur, uz, vi, wo, xh, yi, yo, zh , 祖

目前,我们为两个 M2M 100 模型提供包装器:

  • m2m_100_418M:M2M 模型,4.18 亿个参数(1.8 GB)
  • m2m_100_1.2B:具有 12 亿个参数的 M2M 模型(5.0 GB)

用法:

from easynmt import EasyNMT
model = EasyNMT('m2m_100_418M')   #or: EasyNMT('m2m_100_1.2B') 

您可以在此处找到更多信息。注:目前不支持 120 亿 M2M 参数模型。

只要您调用EasyNMT('m2m_100_418M')/ EasyNMT('m2m_100_1.2B'),相应的模型就会被下载并缓存到本地。

作者

联系人:尼尔斯·雷默斯info@nils-reimers.de

https://www.ukp.tu-darmstadt.de/

如果有问题(不应该)或者您有其他问题,请随时向我们发送电子邮件或报告问题。

该存储库包含实验软件,以鼓励未来的研究。

项目详情


下载文件

下载适用于您平台的文件。如果您不确定要选择哪个,请了解有关安装包的更多信息。

源分布

EasyNMT-2.0.2.tar.gz (23.7 kB 查看哈希

已上传 source