有关于 Lucene 索引中文,网上的叙述很多,为了一次到位,我直接先择了: ICTCLAS,但是关于用C#接合ICTCLAS和LUCENE的文章并不多。我还是找了个JAVA版的对着写的如下的代码。
WhitespaceAnalyzer:仅仅是去除空格,对字符没有lowcase化,不支持中文
SimpleAnalyzer:功能强于WhitespaceAnalyzer,将除去letter之外的符号全部过滤掉,并且将所有的字符lowcase化,不支持中文
StopAnalyzer:StopAnalyzer的功能超越了SimpleAnalyzer,在SimpleAnalyzer的基础上
增加了去除StopWords的功能,不支持中文
StandardAnalyzer:英文的处理能力同于StopAnalyzer.支持中文采用的方法为单字切分.
ChineseAnalyzer:来自于Lucene的sand box.性能类似于StandardAnalyzer,缺点是不支持中英文混和分词.
CJKAnalyzer:chedong写的CJKAnalyzer的功能在英文处理上的功能和StandardAnalyzer相同
但是在汉语的分词上,不能过滤掉标点符号,即使用二元切分
譬如你有词典,然后你根据正向最大匹配法或者逆向最大匹配法写了一个分词方法,却想在Lucene中应用,很简单,你只要把他们包装成Lucene的TokenStream就好了
首先要从这里下载 c# 的合适版本:
http://ictclas.org/Down_share.html
把包里的:Data 目录和Configure.xml 放到某个目录下,
实现ICTCLAS的接口:
using System;
using System.Runtime.InteropServices;
namespace ChineseAnalyzer {
[StructLayout(LayoutKind.Explicit)]
public struct result_t {
[FieldOffset(0)]
public int start;
[FieldOffset(4)]
public int length;
[FieldOffset(8)]
public int sPos;
[FieldOffset(12)]
public int sPosLow;
[FieldOffset(16)]
public int POS_id;
[FieldOffset(20)]
public int word_ID;
[FieldOffset(24)]
public int word_type;
[FieldOffset(28)]
public int weight;
}
/// <summary>
/// Description of ICTCLAS.
/// </summary>
public class ICTCLAS {
const string path = @"ICTCLAS30.dll";
[DllImport(path, CharSet = CharSet.Ansi, EntryPoint = "ICTCLAS_Init")]
public static extern bool Init(String sInitDirPath);
[DllImport(path, CharSet = CharSet.Ansi, EntryPoint = "ICTCLAS_ParagraphProcess")]
public static extern String ParagraphProcess(String sParagraph, int bPOStagged);
[DllImport(path, CharSet = CharSet.Ansi, EntryPoint = "ICTCLAS_Exit")]
public static extern bool Exit();
[DllImport(path, CharSet = CharSet.Ansi, EntryPoint = "ICTCLAS_ImportUserDict")]
public static extern int ImportUserDict(String sFilename);
[DllImport(path, CharSet = CharSet.Ansi, EntryPoint = "ICTCLAS_FileProcess")]
public static extern bool FileProcess(String sSrcFilename, String sDestFilename, int bPOStagged);
[DllImport(path, CharSet = CharSet.Ansi, EntryPoint = "ICTCLAS_FileProcessEx")]
public static extern bool FileProcessEx(String sSrcFilename, String sDestFilename);
[DllImport(path, CharSet = CharSet.Ansi, EntryPoint = "ICTCLAS_GetParagraphProcessAWordCount")]
static extern int GetParagraphProcessAWordCount(String sParagraph);
//ICTCLAS_GetParagraphProcessAWordCount
[DllImport(path, CharSet = CharSet.Ansi, EntryPoint = "ICTCLAS_ParagraphProcessAW")]
static extern void ParagraphProcessAW(int nCount, [Out, MarshalAs(UnmanagedType.LPArray)] result_t result);
[DllImport(path, CharSet = CharSet.Ansi, EntryPoint = "ICTCLAS_AddUserWord