青岛远洋船员职业学院学报

一种网页信息抽取算法的研究与实现

王孟博

（广州新华学院，广东广州510520）

关键词：网页信息抽取;信息过滤;自动存储

Research and Implementation of a Web Information Extraction Algorithm

WANG Meng—bo

(Guangzhou Xinhua University，Guangzhou 510520，China)

Keywords：web information extraction; information filtering; automatic storage

DOI:

备注

摘要

全文

图/表

参考文献

随着互联网上的信息资源日益丰富，数量上难以计数，几乎每一个网页都包含与关键信息无关的噪音信息。如果想要收集自己需要的信息，仅通过手工方式存储到数据库或者文档中，需要消耗大量的时间和人力来整理以及存储并且存在一定的难度。基于此，本研究选用windows系统作为开发平台，运用JAVA对网页信息抽取进行研究，实现一个基本、简略但具备可行性的算法。当前台通过关键字搜索，后台算法即会过滤噪音信息，自动抽取出智能及相关网站的信息并自动存储到数据库中。

With the increasing abundance of information resources on the Internet, the quantity is hard to count. Almost every web page contains noise information that has nothing to do with the key information. If you want to collect the information you need, you can only store it in a database or document by hand, it takes a lot of time and Labor to organize and store and is difficult. Based on this, this research chooses the Windows system as the development platform, uses the Java to carry on the research to the web page information extraction, the foreground searches through the key word, the background algorithm will filter the noise information immediately, automatically extract information from smart and related sites and automatically store it in a database. The goal is to achieve a basic, simple but feasible algorithm.