实验目的:
※ 掌握crawler的工作原理及实现方法;
※ 熟悉网页抓取的整个流程及操作步骤;
※ 掌握crawler应用程序的编写、调试和运行;
※ 掌握主题爬行及内容分析技术;
实验内容:
采取某种算法或策略实现一个主题crawler,然后放出到web上抓取;
实验要求:
※ 选取2~6个种站点,锁定某个主题搜集高质量网页,如教育新闻、信息检索课程、旅游信息、就业招聘信息等;
※ 设计搜集日志文件,至少包含时间戳和URL等,为后面采集过程的统计分析打下基础;
※ 抓取某一URL时最多允许建立2个连接(本地作网页解析的线程数则不限)
※ 遵守采集礼貌规则:须分析robots.txt文件和meta tag有无限制;一个线程抓完一个网页后要sleep 1~2秒钟;
※ 能对HTML网页进行解析,提取出链接URL,能判别提取的URL是否已经处理过,不重复解析已crawl过的网页;
※ 能对crawler程序的一些基本参数进行设置,包括:抓取深度,种子URL等;
※ 使用User-agent向服务器表明自己的身份;
※ 产生抓取统计信息:包括抓取速度、抓取完成所需时间、抓取网页总数;重要变量和所有类、方法加注释;
※ 遵守编程规范,如类、方法、文件等的命名规范;
※ 可选:GUI图形用户界面、Web界面,通过界面管理crawler,包括启停、URL增删等;
本系统采用基于内容的主题相关度计算,具有良好的系统结构,可以在Web上搜集与指定主题相关的页面。而且通过实验表明,系统有理想的性能,可以准确地爬行到高质量的网页。
而由于本实验是对得到的某个网页上的所有链接进行直接解析判断相关性的,所以可能会解析大量无用且不相关的网页,如导航链接、广告链接等等,所以要加以改进的就是要在分析相关度之前对得到的链接根据某种算法进行URL排序,使得在分析相关度时优先分析排在前面即质量高的网页。再者,当线程数大于给定的种子URL数时,为了能充分利用多余的线程,本实验在初始当待爬队列为空时,让没得到URL的线程先睡眠30秒,这样设置有些局限性,因为当线程醒来的时候不一定有相关的URL入队列,所以需要改进的是利用wait() ,notify()来进行操作。
实验中也遇到了一些问题,比如由于是对title和p标签进行解析获得文本的。当title,p标签无对应的文本时后面的程序因为无法得到向量权重将会出现运行问题,所以处理该问题时是判断解析后的文本是否为空,为空则对应的权重向量每个元素直接赋为0。
Experimental Objective:
※ crawler work to master theory and implementation;
※ familiar with the web crawling and operation of the whole process step;
※ master crawler application, preparation, commissioning and operation;
※ crawl and grasp the subject content analysis techniques;
Experiment content:
To take a certain algorithm or strategy for achieving a topic crawler, and then released to the web on a crawl;
Experimental requirements:
※ Select 2 to 6 kinds of sites, lock a topic to collect high-quality web pages, such as education news, information retrieval courses, travel information, employment, recruitment information;
※ designed to collect the log file, or at least include a timestamp and the URL, etc. for later collection process lay the foundation for statistical analysis;
※ crawl up to a certain URL, allowing the establishment of two connections (local for pages parse the number of threads are not limited to)
※ compliance with collection courtesy rules: to analyze the robots.txt file and meta tag has no limit; a thread grasping finished a page after going to sleep 1 ~ 2 seconds;
※ able to parse HTML pages, extracting out the link URL, the URL to determine whether the extract has been dealt with and not repeat the analysis already crawl the web pages;
※ crawler program can set some basic parameters, including: crawl depth, seed URL, etc.;
※ using the User-agent to the server that their own identity;
※ generated Crawl stats: including crawl speed, the time required to crawl, the total number of crawled pages; important variable, and all classes, methods, plus comment;
※ comply with programming standards, such as classes, methods, documents naming convention;
※ Optional: GUI graphical user interface, Web interface, through the interface management crawler, including start and stop, URL additions or deletions, etc.;
The system uses the theme of content-based relevance calculations, has a good system architecture, you can collect on a Web page with the specified topic. But also through experiments show that the system has the ideal performance, can accurately crawl to a high-quality web pages.
And because this experiment is to get all the links on a page for direct analysis to determine the correlation, it may be useless and does not resolve a large number of relevant pages, such as navigation links, advertising links, etc., so make improvement the analysis of correlation is to get the link before the algorithm based on some sort URL, making the analysis of correlation analysis of standing in the front that when the first high-quality web pages. Furthermore, when the number of threads is greater than a given number of seed URL, in order to be able to take full advantage of the extra thread, when this experiment to be climbing in the initial queue is empty, so do not get the first URL of the thread sleep for 30 seconds, so setting Some limitations, because when the thread wakes up when the URL is not necessarily related to entry queue, it needs to be improved is the use of wait (), notify () to operate.
The experiment also encountered some problems, such as is the title and the p tag parsing to obtain the text. When the title, p labels without corresponding text behind the program because they can not get the weight vector will appear to run, so when dealing with this issue, after parsing the text to determine whether the empty, empty then the corresponding weight vector for each element of the direct assigned to 0.