网学网为需要其他类别的朋友们搜集整理了基于字符编辑的字符串匹配算法的实现相关资料,希望对各位网友有所帮助!
论文字数:8972,页数:24
摘 要
随着信息技术的迅猛发展,各种数据生成以及数据采集设备的广泛使用,人们获取到的数据量指数级增长,但是人们从海量数据中获取信息的方便性并没有得到有效的改善,究其原因,其一就是数据质量大大下降,不足以满足应用的需求。
本文介绍了对数据质量研究的必要性以及目前数据质量研究的热点,并着重介绍通过记录连接来改善数据质量问题。通过匹配技术中的编辑距离算法、Jaro-Winkler算法达到进行记录连接的目的,并对算法的原理及其实现作了阐述,通过计算两个记录的相似度来解决基于字符编辑的字符串匹配问题,实现对重复相似记录的检测以达到数据连接的目的,最后对匹配技术对数据质量研究的展望。
关键词:数据质量; 记录连接; 匹配; 编辑距离; Levenshtein算法; Jaro-Winkler算法
String Matching Algorithm and its Realization
Based on Character Editor
ABSTRACT
With the rapid development of information technology and various data generation and data acquisition equipment widely used ,the amount of data which people get is increasing by exponential,however, the huge amounts of data which people get in the convenience of access to information has not been effective improvement, one of reseaons is that data quality significantly decreased and insufficient to meet the application requirements.
This paper introduces the necessarity of researching data quality and describes the current hot topic of data quality ,then puts an emphasis on introducing through the records to improve data quality problems. Through the matching technology in the edit distance, Jaro-Winkler algorithm to achieve the purpose of record linkage,then describe the Principles and implementation of the algorithm .Through Introduces the useage of the edit distance algorithm, Jaro-Winkler algorithm of matching technology and how to realize them ,through calculating the similarity of two records to solve the character-based string matching editor to achieve detection of duplicate records ,finally looks forward to the research on matching technology for data quality.
Keywords:Data Quality; Record Linkage; Matching; Edit distance; Levenshtein Algorithm; Jaro-Winkler Algorithm
目 录
摘 要 i
ABSTRACT ii
第一章 绪论 - 1 -
第二章 编辑距离 (Edit distance) - 3 -
2.1 Levenshtein算法思想 - 3 -
2.2 Levenshtein算法原理 - 3 -
2.3 算法的实现 - 4 -
2.3.1 Levenshtein算法 - 4 -
2.3.2 Levenshtein算法实现 - 5 -
2.4 正确性说明 - 6 -
2.5 Levenshtein算法补充说明 - 6 -
第三章 Jaro-Winkler距离(Jaro-Winkler Distance) - 7 -
3.1 Jaro算法 - 7 -
3.1.1 Jaro算法原理 - 7 -
3.1.2 Jaro算法实现 - 7 -
3.2 Jaro-winkler算法 - 10 -
3.2.1 Jaro-winkler原理 - 10 -
3.2.2 Jaro-winkler实现 - 10 -
3.2.3 算法相关补充说明 - 12 -
结束语 - 13 -
致谢 - 14 -
参考文献 - 15 -
附录 - 16 -