Lucene中的Tokenizer, TokenFilter学习

lucene中的tokenstream，tokenfilter之间关系

tokenstream是一个能够在被调用后产生语汇单元序列的类，其中有两个类型：tokenizer和tokenfilter，两者的不同在于tokenfilter中包含了一个tokenstream作为input，该input仍然可以为一种tokenfilter进行递归封装，是一种组合模式；而tokenzier接受一个reader对象读取字符并创建语汇单元，tokenfilter负责处理输入的语汇单元，通过新增、删除或者修改属性的方式来产生新的语汇单元。

对于某些tokenfilter来说，在分析过程中对事件的处理顺序非常重要。当指定过滤操作顺序时，还应该考虑这样的安排对于应用程序性能可能造成的影响。

当一个document被索引或者检索操作的时候，分析器analyzer会审阅字段field的文本内容，然后生成一个token流，analyzer可以由多个tokenizer和filter组成；tokenizer可以将field字段的内容切割成单个词或token，进行分词处理；filters可以接收tokenizer分词输出的token流，进行转化过滤处理，例如对词元进行转换（简繁体转换），舍弃无用词元（虚词谓词）。tokenizer和filter一起组成一个管道或者链条，对输入的文档和输入的查询文本进行处理，一系列的tokenizer和filter被称为分词器analyzer，得到的结果被存储成为索引字典用来匹配查询输入条件。

此外，我们还可以将索引分析器和查询分析器分开，例如下面的字段配置的意思：对于索引，先经过一个基本的分析器，然后转换为小写字母，接着过滤掉不在keepword.txt中的词，最后将剩下的词元转换为同义词；对于查询，先经过一个基本的分词器，然后转换为小写字母就可以了。

packageorg.liuyuantao.lucene.analyzer.util;importjava.io.ioexception;importjava.util.stack;importorg.apache.lucene.analysis.tokenfilter;importorg.apache.lucene.analysis.tokenstream;importorg.apache.lucene.analysis.tokenattributes.chartermattribute;importorg.apache.lucene.analysis.tokenattributes.positionincrementattribute;importorg.apache.lucene.util.attributesource;publicclassmysametokenfilterextendstokenfilter{privatechartermattributecta=null;privatepositionincrementattributepia=null;privateattributesource.statecurrent;privatestacksames=null;privatesamewordcontextsamewordcontext;protectedmysametokenfilter(tokenstreaminput,samewordcontextsamewordcontext){super(input);cta=this.addattribute(chartermattribute.class);pia=this.addattribute(positionincrementattribute.class);sames=newstack();this.samewordcontext=samewordcontext;}@overridepublicfinalbooleanincrementtoken()throwsioexception{if(sames.size()>0){//将元素出栈，并且获取这个同义词stringstr=sames.pop();//还原状态restorestate(current);cta.setempty();cta.append(str);//设置位置0pia.setpositionincrement(0);returntrue;}if(!this.input.incrementtoken())returnfalse;if(addsames(cta.tostring())){//如果有同义词将当前状态先保存current=capturestate();}returntrue;}privatebooleanaddsames(stringname){string[]sws=samewordcontext.getsamewords(name);if(sws!=null){for(stringstr:sws){sames.push(str);}returntrue;}returnfalse;}}analyzer，用于将tokenizer和filter串联起来：

packageorg.liuyuantao.lucene.analyzer.util;importorg.apache.lucene.analysis.analyzer;importorg.apache.lucene.analysis.tokenizer;importorg.apache.lucene.analysis.core.lowercasefilter;importorg.apache.lucene.analysis.core.stopanalyzer;importorg.apache.lucene.analysis.core.stopfilter;importorg.apache.lucene.analysis.util.chararrayset;importorg.liuyuantao.lucene.utils.iktokenizer5x;publicclassmysameanalyzerextendsanalyzer{privatesamewordcontextsamewordcontext;publicmysameanalyzer(samewordcontextswc){samewordcontext=swc;}@overrideprotectedtokenstreamcomponentscreatecomponents(stringfieldname){tokenizertokenizer=newiktokenizer5x(true);returnnewtokenstreamcomponents(tokenizer,newmysametokenfilter(newstopfilter(newlowercasefilter(tokenizer),newchararrayset(stopanalyzer.english_stop_words_set,true)),this.samewordcontext));}}定义一个简易的同义词匹配引擎：

packageorg.liuyuantao.lucene.analyzer.util;/***同义词库引擎*/publicinterfacesamewordcontext{publicstring[]getsamewords(stringname);}//实现类packageorg.liuyuantao.lucene.analyzer.util;importjava.util.hashmap;importjava.util.map;publicclasssimplesamewordcontextimplementssamewordcontext{mapmaps=newhashmap();publicsimplesamewordcontext(){maps.put("中国",newstring[]{"天朝","大陆"});maps.put("我",newstring[]{"咱","俺"});}@overridepublicstring[]getsamewords(stringname){returnmaps.get(name);}}

对最终结果进行测试：

/***测试同义词库*/@testpublicvoidtest05(){try{analyzera2=newmysameanalyzer(newsimplesamewordcontext());stringtxt="我来自中国人，我来自中国山东临沂";directorydir=newramdirectory();indexwriterwriter=newindexwriter(dir,newindexwriterconfig(a2));documentdoc=newdocument();doc.add(newtextfield("content",txt,field.store.yes));writer.adddocument(doc);writer.close();analyzerutils.displayalltokeninfo(txt,a2);system.out.println("----------查询-------------");system.out.println("使用咱进行查询");indexsearchersearcher=newindexsearcher(directoryreader.open(dir));topdocstds1=searcher.search(newtermquery(newterm("content","咱")),10);scoredoc[]scoredocs1=tds1.scoredocs;if(scoredocs1!=null&&scoredocs1.length>0){for(scoredocsd:scoredocs1){documentd=searcher.doc(sd.doc);system.out.println(d.get("content"));}}system.out.println("使用大陆进行查询");topdocstds2=searcher.search(newtermquery(newterm("content","大陆")),10);scoredoc[]scoredocs2=tds2.scoredocs;if(scoredocs2!=null&&scoredocs2.length>0){for(scoredocsd:scoredocs2){documentd=searcher.doc(sd.doc);system.out.println(d.get("content"));}}}catch(ioexceptione){e.printstacktrace();}}

结果：

1:我[0-1]-->cn_word0:俺[0-1]-->cn_word0:咱[0-1]-->cn_word1:来自[1-3]-->cn_word1:中国人[3-6]-->cn_word1:我[7-8]-->cn_word0:俺[7-8]-->cn_word0:咱[7-8]-->cn_word1:来自[8-10]-->cn_word1:中国[10-12]-->cn_word0:大陆[10-12]-->cn_word0:天朝[10-12]-->cn_word1:山东[12-14]-->cn_word1:临沂[14-16]-->cn_word----------查询-------------使用咱进行查询我来自中国人，我来自中国山东临沂使用大陆进行查询我来自中国人，我来自中国山东临沂

从结果中可以看到，“我”、“俺”、“咱”是同义词，使用咱搜索的时候可以检索到。

“中国”、“大陆”、“天朝”是同义词，使用大陆搜索的时候，也是可以搜索到的。

lucene中内置的几个tokenizer，我们分析使用的文本为：pleaseemailclark.ma@gmail.comby09,re:aa-bb

1:[please:0->6:]

2:[email:7->12:]

3:[clark.ma:13->21:]

4:[gmail.com:22->31:]

6:[09:35->37:]

7:[re:aa:39->44:]

8:[bb:45->47:]

去除空格，标点符号，@；

1:[please:0->6:]

2:[email:7->12:]

3:[clark.ma@gmail.com:13->31:]

5:[09:35->37:]

6:[re:39->41:]

7:[aa:42->44:]

8:[bb:45->47:]

1:[please:0->6:word]

2:[email:7->12:word]

3:[clark:13->18:word]

4:[ma:19->21:word]

5:[gmail:22->27:word]

6:[com:28->31:word]

7:[by:32->34:word]

8:[re:39->41:word]

9:[aa:42->44:word]

10:[bb:45->47:word]

1:[pleaseemailclark.ma@gmail.comby09,re:aa-bb:0->47:word]

1:[please:0->6:word]

2:[email:7->12:word]

3:[clark:13->18:word]

4:[ma:19->21:word]

5:[gmail:22->27:word]

6:[com:28->31:word]

7:[by:32->34:word]

8:[re:39->41:word]

9:[aa:42->44:word]

10:[bb:45->47:word]

可以定义最小mingramsize（default=1）,最大切割值maxgramsize(default=2)，生成的词元较多。

假设mingramsize=2,maxgramsize=3，输入abcde，输出：ababcabcbcbcdcdcde

c:mydocumentfileafileb，newpathhierarchytokenizer('\','/')

1:[c::0->2:word][c:/mydocument:0->14:word][c:/mydocument/filea:0->20:word][c:/mydocument/filea/fileb:0->26:word]

需要两个参数,pattern正则表达式，group分组。

pattern=”[a-z][a-za-z]*”group=”0′

输入：“hello.mynameisinigomontoya.youkilledmyfather.preparetodie.”

输出：“hello”,“my”,“inigo”,“montoya”,“you”,“prepare”

1:[please:0->6:]

2:[email:7->12:]

3:[clark.ma@gmail.com:13->31:]

4:[by:32->34:]

5:[09:35->37:]

6:[re:aa:39->44:]

7:[bb:45->47:]

过滤器能够组成一个链表，每一个过滤器处理上一个过滤器处理过后的词元，所以过滤器的排序很有意义，第一个过滤器最好能处理大部分常规情况，最后一个过滤器是带有针对特殊性的。

1:[abc:0->3:]

2:[i.b.m:4->9:]

3:[cat:10->15:]

4:[can:16->21:]

1:[i.b.m:0->5:]

2:[cat's:6->11:]

3:[can't:12->17:]

如果email_type.txt设置为alphanum，会保留该类型的所有分析结果，否则会被删除掉

1:[i.b:0->5:]

2:[cat:6->11:]

3:[can:12->17:]

a:-符号wi-fi变成wifib:驼峰写法lovesong变成lovesong对应参数c:字母-数字xiaomi100变成xiaomi100d:–符号like–me变成likemee:尾部的’s符号mother’s变成motherf:-符号wi-fi变成wifi于规则a不同的是没有分成两个词元g：-符号，数字之间400-884586变成400884586h:-符号无论字母还是数字，都取消-符号wi-fi-4变成wifi4

protected=”protwords.txt”指定这个单词列表的单词不被修改