- 无标题文档
查看论文信息

中文题名:

 实时文本索引系统设计与实现    

姓名:

 周美孜    

学生类型:

 学士    

学位名称:

 工学学士    

学校:

 中国人民大学    

院系:

 信息学院    

专业:

 信息安全    

第一导师姓名:

 石文昌    

完成日期:

 2014    

中文摘要:
目前,非结构化数据规模巨大,增长迅速,因此,在许多领域中对其进行高效处理的需求也更加迫切。除了传统的精确查询数据的需求,在新兴领域中,带通配符的字符串匹配等模糊查询也成为了新的需求热点。我们通常会通过建立数据索引的方法来快速支持这些查询,但是随着数据规模的不断增加,现有的数据索引方法面临着分词有歧义、磁盘访问次数过多、索引空间过大、匹配功能单一等缺陷。为了进一步研究更好地针对非结构化大数据的索引技术,并支持带通配符字符串的查询,本文设计并实现了一个实时文本索引系统。该系统采用对后缀数组改进的方法来实现文本的索引和查询,并通过socket网络通信的方式,完成网页客户端与服务器的交互,最终用户可以方便直观地通过浏览器实现对大数据的索引和查询操作。通过与现有的大数据索引工具FEMTO的性能测试对比得出,本系统在索引创建时间、创建过程中空间的峰值、搜索功能上有优势,但也存在索引所占空间过大的问题,计划未来进一步从FM-index的方法出发,研究空间占用更少,搜索速度更快的全文压缩索引技术。 Unstructured data plays an important role in modern network world. Its amount is huge and is increasing fast. Thus, many fields are facing challenges of dealing with it. Besides accurate query in traditional demand, many new fields have urgent demand of fuzzy matching. Usually, we build index to support fast query. However, as the data size increases, there are problems in existing methods, such as segmentation ambiguity, too much disk access, big index size, incomplete searching function, etc. In order to find a better way to support query in big data and fuzzy matching function, I design and make this real-time text indexing system. This system is based on improvements of suffix array algorithm, and can support wildcard string query. And clients can finish their operation on their browser. By comparing with FEMTO, the system has advantages on indexing time, maximum space during indexing, searching function. But it still has problems on oversize index space. Future work includes studying FM-index technology and design a better system with smaller index size and faster searching speed.
开放日期:

 2016-03-21    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式