1,简单概述1.1 NLP的概念
1.2 NLP的内容和技术
自然语言处理研究内容非常广泛,它只列出了部分(主要是参与运动容易系统),包括文字处理(词-段),和词性标记(-的研究和标记部分),语法分析(解析)、信息检索(信息检索)、文本校对(文本- Rroofing),这个词向量模型(WordVector -模型),语言模型(语言——模型),问答系统(提问-回答系统)。介绍以下一个接一个。
工具版本下载地址哈工大LTPltp4jdownloadberkeleylmberkeleylm 1.1.5 downloadElasticSearchelasticsearch——2.4.5 download3,具体实现3.1分词(字-段)
3.1.1主要介绍中文分词的实现,实现中文分词方法有许多种,如StandfordCore NLP(见(NLP)为具体实现中文分词使用斯坦福NLP), jieba分词,这里使用的语言哈尔滨工业大学技术平台LTP(包括后面的词性标注、句法分析)。具体步骤如下:
下载LTP4J jar包(下载),
将复制JDK bin目录中的所有内容,如下所示:
包公约。Ltpdemo;进口java.util.ArrayList;进口并不知道;导入edu。打击。红外光谱。Ltp4j。Segmentor。公开课ltpSegmentDemo{公共静态空main (String [] args) {Segmentor Segmentor = new Segmentor ();如果(segmentor。创建(“D: / NLP / LTP / ltp_data_v3。4.0 / ltp_data_v3。4.0 /水煤浆。< 0){系统模型”)。出去了。Println(“模型加载失败”);其他}{字符串发送= "这是中文分词测试”;列表<字符串>词= new ArrayList <字符串> ();Int = segmentor大小。段(发送、单词);(字符串字:话说){系统。出去了。印刷(文字+“t \”);}segmentor。Release ();}}}
3.2.1这里描述如何实现中文词性标注,通过LTP具体实现代码如下:包公约。Ltpdemo;进口java.util.ArrayList;进口并不知道;导入edu。打击。红外光谱。Ltp4j。Postagger;公开课ltpPostaggerDemo{公共静态空main (String [] args) {Postagger Postagger = new Postagger ();如果(postagger。创建(“D: / NLP / LTP / ltp_data_v3。4.0 / ltp_data_v3。4.0 / pos。< 0){系统模型”)。出去了。Println(“模型加载失败”);其他}{<字符串>词汇列表= new ArrayList <字符串> ();单词。添加(“我”);单词。添加(“是”);单词。添加(“中国”);单词。添加(“人”);<字符串>列表值= new ArrayList <字符串> ();Int = postagger大小。Postag(话说,值);for (int i = 0;我<的话。size ();我+ +){系统。出去了。印刷(文字。(我)+ " " +值。get(我)+“\ t”);}postagger。Release ();}}}
包公约。Ltpdemo;进口java.util.ArrayList;进口并不知道;导入edu。打击。红外光谱。Ltp4j。解析器;公开课ltpParserDemo{/ * * * @参数arg游戏* /公共静态void main (String [] args){解析器解析器=新的解析器();如果解析器。创建(“D: / NLP / LTP / ltp_data_v3。4.0 / ltp_data_v3。4.0 /解析器。< 0){系统模型”)。出去了。Println(“模型加载失败”);其他}{<字符串>词汇列表= new ArrayList <字符串> ();<字符串>标记列表= new ArrayList <字符串> ();单词。添加(“我”);标签。add (r);单词。Add(“非常”);标签。add (“d”);单词。添加(“喜欢”);标签。add (“v”);单词。添加(“音乐”);标签。add (“n”);<整数>列表头= new ArrayList <整数> ();列表<字符串> deprels = new ArrayList <字符串> ();Int si泽=解析器。解析(话说,标签,正面,deprels);for (int i = 0;我<大小;我+ +){系统。出去了。Print(正面。(我)+”:“+ deprels。(我));如果大小(I = = - 1){系统。出去了。println ();}{系统。出去了。打印(" ");}}解析器。释放();}}}
3.3.2在10 - 12的雨量分布实现效果如下:
3.5文本校对(文本- Rroofing),语言模型(语言——模型)3.5.1的N元模型(N - gram)
首先介绍了N -“格拉姆模型中,N -格拉姆模型是自然语言处理中的一个非常重要的概念,通常,在NLP,基于语料库,可以预计或N - gram评估是否合理的一个句子。T为一个句子,假设T字序列w1, w2, w3……Wn,那么T出现概率
P (T) = P (w1 w2, w3……Wn) = P (w1) P (w2 | w1) P (w3 | w2, w1)…P (wn | wn - 1,……W1 w2),巨大的参数条件下的概率显然不容易计算,因此介绍了马尔可夫链(即。,每个词的概率在短短几句话前后其相关),因此可以大大减少计算长度,即
P (wi | w1,……wi - 1) = P (wi | wi - n + 1,……,特别是wi - 1),当n是小值:
当N = 1,即每个词出现的概率只由这个词的频率决定的,被称为元模型(unigram模型):
P (w1 w2,…,wm) =∏议员(wi) I = 1 M说单词的总数在语料库,c (wi) wi语料库的出现次数,然后
P (wi) = C (wi) / M当n = 2,即每个词出现的概率只有一个词的词之前和之后的决定,一个词叫做双重模型(三元模型):
P (w1 w2,…,wm) =∏I = 1议员(wi | wi - 1)如果M说单词的总数在语料库,c (wi) wi - 1 wi - 1 wi语料库的出现次数,然后
P (wi | wi - 1) = C (wi wi - 1) / C (wi) - 1 n = 3时,称为三元模型(卦模型):
P (w1 w2,…,wm) =∏I = 1议员(wi | wi wi - 2 - 1)
P (wi | wi - 1, 2) = C (wi wi - 2 - 1 wi) / C (wi wi - 2 - 1)
包公约。春天。公约。Lucencedemo;导入Java。IO。BufferedReader;进口java.io.File;导入Java。IO。FileInputStream;进口java.io.FileReader;进口java.io.IOException;导入Java。IO。InputStreamReader;导入Java。跑龙套。迭代器;进口org, apache lucene。索引。类IndexReader。进口org, apache lucene。索引。IndexWriterConfig;进口org, apache lucene。搜索。法术。LuceneDictionary;进口org, apache lucene。搜索。法术。PlainTextDictionary;进口org, apache lucene。搜索。法术。拼写检查;进口org, apache lucene。搜索。建议会;进口org, apache lucene。商店。目录;进口org, apache lucene。商店。FSDirectory;进口org, apache lucene。跑龙套。版本;公开课SpellchecK {public static string directorypath;Public static string origindirectorypath;The proofreader explain;Public LuceneDictionary dict./ create indexes @ return * * * * * * @ throws IOException Boolean * * / public static void method createIndex (origindirectorypath directorypath string, string) throws IOException {= FSDirectory catalog directory.Open (new file (directorypath));Check check = new check (directory);IndexWriterConfig configuration = new IndexWriterConfig (version.LUCENE_4_9, null);PlainTextDictionary pdic = new PlainTextDictionary (new InputStreamReader (new FileInputStream (new file (origindirectorypath)), "utf-8"));Spell check.IndexDictionary (new PlainTextDictionary (new file (origindirectorypath)), configuration, false);Directory. The close ();Spellchecker. Close ();} the public to make it clear,Opath string, the string path) {origindirectorypath = opath;Directorypath = path;Directory directory;Try = FSDirectory {directory.Open (new file (directorypath));Explain = new check (directory);IndexReader oriIndex = IndexReader. Open (directory).Dict = new LuceneDictionary (oriIndex, "name");} capture (IOException e) {e.p rintStackTrace ();{}} public empty setAccuracy float (v). Spellcheck setAccuracy (v);Public String} [] search (the String variable name, int suggestionsNumber) {String [] advice = zero;Try {if return null (exist (variable));Advice = spellcheck. SuggestSimilar (string, suggestionsNumber);} capture (IOException e) {e.p rintStackTrace ();} the returned Suggestions;} there private Boolean (string variable) throws IOException {will though = dict. GetEntryIterat或();虽然(尽管。如果(尽管HasContexts ()) {。下一个()=(变量))返回true;}返回false;}公共静态void main (String [] args)抛出IOException{字符串Lucene opath = " D: \ \ \ \ NLPLucene \ \文字。TXT”。Lucene字符串ipath = " D: \ \ \ \ NLPLucene \ \指数”。解释清楚。方法CreateIndex (ipath opath);解释清楚解释清楚= new解释清楚(opath, ipath);/ /解释清楚createSpellIndex ();解释清楚。SetAccuracy(浮动)(0.5);String[] =解释清楚。搜索(“辣糖”,15);如果结果。长度= = 0 | | = = null){系统。出去了。Println(“没有找到错误”);}{系统。出去了。Println(“你想找到:”);(字符串:结果){系统。出去了。Println(打击);}}}}
这里主要介绍中国语言模型训练,训练的中国语言模型是基于N -格拉姆算法,开源语言模型训练工具SRILM KenLM, berkeleylm等是主要的类型,KenLM SRILM性能更好,用c++写的,支持独立的训练数据。Berkeleylm是用Java编写的。在这篇文章中,主要的介绍如何实现由berkelylm中国语言模型训练。
包公约。Berkeleylm;进口java.io.File;进口java.util.ArrayList;进口并不知道;导入edu。加州大学伯克利分校。NLP。Lm。ConfigOptions;导入edu。加州大学伯克利分校。NLP。Lm。StringWordIndexer;导入edu。加州大学伯克利分校。NLP。Lm。IO。ArpaLmReader;导入edu。加州大学伯克利分校。NLP。Lm。IO。LmReaders;导入edu。加州大学伯克利分校。NLP。Lm。跑龙套。记录器;公开课演示{私有静态空间使用(){系统。犯错。Println(“用法:< lmOrder > < ARPA lm输出文件> <文本文件> *”);系统。退出(1);}公共空makelml (String [] argv){{如果(argv。长度< 2)使用();}最后int lmOrder= Integer. ParseInt (argv [0]).Final string outputFile = argv [1].The final list < string > inputFiles = new ArrayList < string > ();For (int I = 2;I < argv. Length;+ + I) {inputFiles. Add (argv [I]);} if (inputFiles isEmpty ()) inputFiles. Add (" - ");The recorder.SetGlobalLogger (new Logger. SystemLogger (system.), system. Err);The recorder.StartTrack (+ inputFiles + ", "reading text file" written document "+ outputFile);The last StringWordIndexer wordIndexer = new StringWordIndexer ();WordIndexer. SetStartSymbol (ArpaLmReader. START_SYMBOL);WordIndexer. SetEndSymbol (ArpaLmReader. END_SYMBOL);WordIndexer. SetUnkSymbol (ArpaLmReader. UNK_SYMBOL);LmReaders.CreateKneserNeyLmFromTextFiles (inputFiles wordIndexer lmOrder, new file (outputFile), new ConfigOptions ());Logger. EndTrack ();} public static void main (String [] args)) {demonstration d = new ();String inputfile = "D: \ \ \ \ NLP languagematerial \ \ quest. TXT";String outputfile = "D: \ \ \ \ NLP languagematerial \ \ q.a rpa";String [] s = {" 8 ", the outputfile inputfile};D.m akelml (s);}}
The last is to read the model, and then judge the sentence similarity, the implementation code is as follows:
Package CCW. Berkeleylm;Import the Java. IO. The File;Import the Java. Util. ArrayList;Import does not know;Import edu. Berkeley. NLP. Lm. ArrayEncodedProbBackoffLm;Import edu. Berkeley. NLP. Lm. ConfigOptions;Import edu. Berkeley. NLP. Lm. StringWordIndexer;Import edu. Berkeley. NLP. Lm. IO. LmReaders;Public class readdemo {public static ArrayEncodedProbBackoffLm < string > getLm (Boolean compression string file) {final document lmFile = new files (documents);The last ConfigOptions ConfigOptions = new ConfigOptions ();ConfigOptions.UnknownWordLogProb f = 0.0;Finally ArrayEncodedProbBackoffLm < string > lm = LmReaders. ReadArrayEncodedLmFromArpa (lmFile. GetPath (), compression, new StringWordIndexer (), configOptions, Integer. MAX_VALUE);Returns the lm;} public static void main (String [] args) {readdemo read = new演示();LmReaders读者= new LmReaders ();ArrayEncodedProbBackoffLm <字符串> = (ArrayEncodedProbBackoffLm) readdemo模型。GetLm(假,“D: \ \ \ \ NLP languagematerial \ \问。战”);字符串句子=“是”;String[] =句子。分割(" ");列表<字符串> = new ArrayList <字符串> ();(字符串字:话说){系统。出去了。Println(词);列表。add(词);}浮动得分=模型。GetLogProb(列表);系统。离开。Println(分数);}}
第八标签有三个,分别是“=”、“#”、“@”,代表平等,同义,= #代表,同样,@代表自我封闭、独立,它既不是同义词词典,没有相关词汇。森林的同义词词来比较这两个词,类似的代码如下:
包cilin。导入Java。IO。BufferedReader;导入Java。IO。FileInputStream;导入Java。IO。InputStreamReader;进口java.util.HashMap;进口并不知道;导入Java。跑龙套。向量;公开课CiLin{公共静态HashMap <字符串,列表<字符串> > keyWord_Identifier_HashMap;/ / <关键词、编号列表收集>散列公共int zero_KeyWord_Depth = 12;公共静态HashMap < String, Integer > first_KeyWord_Depth_HashMap;/ / < >第一层数量、深度散列公共静态HashMap <字符串,整数> second_KeyWord_Depth_HashMap;/ / < >第二个数字之前,深度散列公共静态HashMap <字符串,整数> third_KeyWord_Depth_HashMap;> / / <前三层数、深度散列公共静态HashMap <字符串,整数> fourth_KeyWord_Depth_HashMap;/ / < >第一个四层数,深度散列/ /公共HashMap <字符串,HashSet <字符串> > ciLin_Sort_keyWord_HashMap = new HashMap <字符串,HashSet <字符串> > ();/ / <(同义词)数字,关键词设置收集>散列统计IC {keyWord_Identifier_HashMap = new HashMap < string, a list < string > > ();First_KeyWord_Depth_HashMap = new HashMap < String, Integer > ();Second_KeyWord_Depth_HashMap = new HashMap < String, Integer > ();Third_KeyWord_Depth_HashMap = new HashMap < String, Integer > ();Fourth_KeyWord_Depth_HashMap = new HashMap < String, Integer > ();InitCiLin ();} / / 3.Initialize the word relevant public static void initCiLin Lin () {int I;String STR = zero;String [] STR = zero;A list < string > = 0;BufferedReader inFile = zero;Try {/ / initialization < keywords, numbering Settings > hash inFile = new BufferedReader (new InputStreamReader (new FileInputStream (" cilin/keyWord_Identifier_HashMap. TXT "), "utf-8"));/ / while reading text (STR = inFile. ReadLine ())!= null) {STR = STR. The split (" ");List = new vector < string > ();(I = 1;I < STRS. Length;I + +) list. Add (STR [I]);keyWord_IdentifIer_HashMap。Put (str[0],一个列表)。}/ /初始化<第一层数、高度>哈希inFile。Close ();InFile = new BufferedReader(新InputStreamReader(新FileInputStream (“cilin / first_KeyWord_Depth_HashMap。TXT”),“utf - 8”));/ / (STR = inFile在阅读课文。ReadLine ()) != = null) {STR STR。分割(" ");First_KeyWord_Depth_HashMap。Put (str[0],整数。valueOf (str [1]));}/ /初始化”第二个数字之前,高度>哈希inFile。Close ();InFile = new BufferedReader(新InputStreamReader(新FileInputStream (“cilin / second_KeyWord_Depth_HashMap。TXT”),“utf - 8”));/ / (STR = inFile在阅读课文。ReadLine ()) != = null) {STR STR。分割(" ");Second_KeyWord_Depth_HashMap。Put (str[0],整数。valueOf (str [1]));}/ /初始化<前三层数、高度>哈希inFile。Close ();InFile = new BufferedReader(新InputStreamReader(新FileInputStream (“cilin / third_KeyWord_Depth_HashMap。TXT”),“utf - 8”));/ /读课文而(STR = inFile。ReadLine ()) != = null) {STR STR。分割(" ");Third_KeyWord_Depth_HashMap。Put (str[0],整数。valueOf (str [1]));}/ /初始化<四层数、高度>哈希inFile。Close ();InFile = new BufferedReader(新InputStreamReader(新FileInputStream (“cilin / fourth_KeyWord_Depth_HashMap。TXT”),“utf - 8”));/ / (STR = inFile在阅读课文。ReadLine ()) != = null) {STR STR。分割(" ");Fourth_KeyWord_Depth_HashMap。Put (str[0],整数。valueOf (str [1]));}inFile。Close ();}问题(异常e) {e。p rintStackTrace ();}}/ /根据相似度计算两个关键字的公共静态双calcWordsSimilarity (key1的字符串,字符串key2){列表<字符串> identifierList1 = null, identifierList2 =零;/ /单词列表数量如果林(key1 = (key2))返回1.0;如果(!KeyWord_Identifier_HashMap。containsKey (key1) | | !KeyWord_Identifier_HashMap。containsKey (key2)){/ /就是其中之一在森林里,它返回0.1 / /系统的相似性。出去了。println (key1 + " " + key2 +”一词不是同义词森林!”);
Return to 0.1;} identifierList1 = keyWord_Identifier_HashMap. Get (key1);/ / set the serial number of the first word identifierList2 = keyWord_Identifier_HashMap. Get (key2);/ / get the serial number of the second word is set to return to getMaxIdentifierSimilarity (identifierList1 identifierList2);} public static double getMaxIdentifierSimilarity (< string > identifierList1 list, a list < string > identifierList2) {int I, j.Double maxSimilarity = 0, similarity = 0;(I = 0;I < identifierList1. The size ();I + +) {j = 0;And (j < identifierList2. The size ()) {similar = getIdentifierSimilarity (identifierList1. Get (I), identifierList2. Get (j));System. The out. Println (identifierList1. Get (I) + "" + identifierList2. Get (j) +" "+);If the similarity > maxSimilarity maxSimilarity = similar;If the return maxSimilarity (maxSimilarity = = 1.0);+ +;}} return maxSimilarity;Public sta}identifier1抽搐双getIdentifierSimilarity(字符串,字符串identifier2) {int n = 0 k = 0;/ /节点总数n是分支层,k是两个分支之间的距离。/ /双= 0.5 a, b = 0.6, c = 0.7 d = 0.96;双= 0.65,= 0.8 b, c = 0.9, d = 0.96;如果(identifier1。= (identifier2)){/ /在五楼等于如果(identifier1。Substring(7) =(" = ")返回1.0;否则返回0.5;}else if (identifier1 substring (0 5) = (identifier2。Substring(0 5))){/ /四楼Da13A01 = n = fourth_KeyWord_Depth_HashMap相等。Get (identifier1。Substring (0、5));K =整数。返回对象的值(identifier1。Substring(5、7)) - - -一个整数。返回对象的值(identifier2。Substring (5、7));如果(k < = 0) k - k;返回数学。因为(n *数学。π/ 180)*((双)(n - k + 1) / n) * d;}else if (identifier1 substring (0, 4) = (identifier2。Substring(0, 4))){/ /在第三层等于Da13A01 = n = third_KeyWord_Depth_HashMap。得到(ideNtifier1.The substring (0, 4));K = identifier1.Substring (4, 5). The charAt (0) - identifier2.Substring (4, 5). The charAt (0);If (k < = 0) k - k;Return to mathematics.Because (n * mathematics.PI / 180) * ((double) (n - k + 1)/n) * c.If (identifier1} other.The substring (0, 2). The equals (identifier2.The substring (0, 2))) {/ / in the second layer is equal to n = second_KeyWord_Depth_HashMap. Get (identifier1.The substring (0, 2));K = Integer. The valueOf (identifier1.Substring (2, 4)) - an Integer. The valueOf (identifier2.Substring (2, 4));If (k < = 0) k - k;Return to mathematics.Because (n * mathematics.PI / 180) * ((double) (n - k + 1)/n) * b;If (identifier1} other.The substring (0, 1). The equals (identifier2.The substring (0, 1))) {/ / in the first layer is equal to n = first_KeyWord_Depth_HashMap. Get (identifier1.The substring (0, 1));K = identifier1.The substring (1, 2). CharAt (0) - identifier2.The substring (1, 2). CharAt (0);If (k < = 0) k - k;Return to mathematics.Because (n *数学。π/ 180)*((双)(n - k + 1) / n) *;}返回0.1;}}/ /测试公共类测试{公共静态孔隙主要(String参数[]){字符串word1 =“相似”,word2 =“类似”;双sim = 0;Sim = CiLin。CalcWordsSimilarity (word1 word2);/ /计算相似系统。两个词。Println (word1 + " " + word2 +“相似之处是:“+ sim);}}
3.6词向量模型(WordVector -模型)3.6.1词向量
Word2vec谷歌公司于2013年开业,是一个软件工具用于培训项向量。根据给定的语料库,快速有效地通过优化训练模型将是一个词所表达的向量形式,其核心架构包括CBOW和跳过——“格拉姆。Word2vec包含两个训练模型,分别是CBOW Skip_gram(输入层、发射层和输出层),如下图所示:
3.6.3 word2vec #培训项向量编码:utf - 8进口sys重载(sys)系统。从gensim Setdefaultencoding (“utf - 8”)。模型导入word2vec导入日志、gensim TextLoader操作系统类(对象):def __init__(自我):通过def __iter__(自我):输入=开放(“文集——赛格。TXT”、“r”)行= STR(输入)柜台readline()) = 0,而行!=没有和len(线)> 4:#打印线段=线。分割(' ')屈服段线= STR(输入)readline())句子= TextLoader()模型= gensim.models.Word2Vec(sentences, workers=8) model.save('word2vector2.model') print 'ok' # coding:utf-8 import sys reload(sys) sys.setdefaultencoding( "utf-8" ) from gensim.models import Word2Vec import logging,gensim,os #模型的加载 model = Word2Vec.load('word2vector.model') #比较两个词语的相似度,越高越好 print('"唐山" 和 "中国" 的相似度:'+ str(model.similarity('唐山','中国'))) print('"中国" 和 "祖国" 的相似度:'+ str(model.similarity('祖国','中国'))) print('"中国" 和 "中国" 的相似度:'+ str(model.similarity('中国','中国'))) #使用一些词语来限定,分为正向和负向的 result = model.most_similar(positive=['中国', '城市'], negative=['学生']) print('同"中国"与"城市"二词接近,但是与"学生"不接近的词有:') for item in result: print(' "'+item[0]+'" 相似度:'+str(item[1])) result = model.most_similar(positive=['男人','权利'], negative=['女人']) print('同"男人"和"权利"接近,但是与"女人"不接近的词有:') for item in result: print(' "'+iteM[0] +”相似:“+ STR =(项)[1])结果模型。Most_similar(积极=(“女人”,“法律”),- =['人'])打印(”与“女人”和“法律”接近,但有:“人”不是接近)的项目结果:打印(“”+项目[0]+”相似:“+ STR(项)[1])#在一堆单词找到打印不匹配(“老师学生校长,哪个不匹配?Word2vec结果说:“+模型。Doesnt_match(“老师学生主体”。split()))打印(“火车汽车自行车相机,这不是比赛吗?Word2vec结果说:“+模型。Doesnt_match(“火车车自行车相机”。split()))打印(“白蓝绿、红米、不匹配?Word2vec结果说:“+模型。Doesnt_match(“白蓝绿,红米”。split())) #检查一个词向量直接打印(“中国特征向量:”)打印(模型(“中国”))
