服务承诺





51Due提供Essay,Paper,Report,Assignment等学科作业的代写与辅导,同时涵盖Personal Statement,转学申请等留学文书代写。




私人订制你的未来职场 世界名企,高端行业岗位等 在新的起点上实现更高水平的发展




Coherent Keyphrase Extraction via Web Mining--论文代写范文精选
2016-01-21 来源: 51due教员组 类别: Paper范文
在候选关键词使用统计作为依据,他们可能是语义相关的。实验表明,改进提高提取的关键词质量。该算法概括训练时在一个域,测试在另一个物理文件。下面的paper代写范文进行详述。
Abstract
Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. A limitation of previous keyphrase extraction algorithms is that the selected keyphrases are occasionally incoherent. That is, the majority of the output keyphrases may fit together well, but there may be a minority that appear to be outliers, with no clear semantic relation to the majority or to each other. This paper presents enhancements to the Kea keyphrase extraction algorithm that are designed to increase the coherence of the extracted keyphrases. The approach is to use the degree of statistical association among candidate keyphrases as evidence that they may be semantically related. The statistical association is measured using web mining. Experiments demonstrate that the enhancements improve the quality of the extracted keyphrases. Furthermore, the enhancements are not domain-specific: the algorithm generalizes well when it is trained on one domain (computer science documents) and tested on another (physics documents).
Introduction
A journal article is often accompanied by a list of keyphrases, composed of about five to fifteen important words and phrases that express the primary topics and themes of the paper. For an individual document, keyphrases can serve as a highly condensed summary, they can supplement or replace the title as a label for the document, or they can be highlighted within the body of the text, to facilitate speed reading (skimming). For a collection of documents, keyphrases can be used for indexing, categorizing (classifying), clustering, browsing, or searching. Keyphrases are most familiar in the context of journal articles, but many other types of documents could benefit from the use of keyphrases, including web pages, email messages, news reports, magazine articles, and business papers.
The vast majority of documents currently do not have keyphrases. Although the potential benefit is large, it would not be practical to manually assign keyphrases to them. This is the motivation for developing algorithms that can automatically supply keyphrases for a document. Section 2.1 discusses past work on this task. This paper focuses on one approach to supplying keyphrases, called keyphrase extraction. In this approach, a document is decomposed into a set of phrases, each of which is considered as a possible candidate keyphrase. A supervised learning algorithm is taught to classify candidate phrases as keyphrases and non-keyphrases. The induced classification model is then used to extract keyphrases from any given document [Turney, 1999, 2000; Frank et al., 1999; Witten et al., 1999, 2000].
A limitation of prior keyphrase extraction algorithms is that the output keyphrases are at times incoherent. For example, if ten keyphrases are selected for a given document, eight of them might fit well together, but the remaining two might be outliers, with no apparent semantic connection to the other eight or to each other. Informal analysis of many machine-extracted keyphrases suggests that these outliers almost never correspond to author-assigned keyphrases. Thus discarding the incoherent candidates might improve the quality of the machine-extracted keyphrases. Section 2.2 examines past work on measuring the coherence of text. The approach used here is to measure the degree of statistical association among the candidate phrases [Church and Hanks, 1989; Church et al., 1991]. The hypothesis is that semantically related phrases will tend to be statistically associated with each other, and that avoiding unrelated phrases will tend to improve the quality of the output keyphrases.
Assignment versus Extraction
There are two general approaches to automatically supplying keyphrases for a document: keyphrase assignment and keyphrase extraction. Both approaches use supervised machine learning from examples. In both cases, the training examples are documents with manually supplied keyphrases. In keyphrase assignment, there is a predefined list of keyphrases (in the terminology of library science, a controlled vocabulary or controlled index terms). These keyphrases are treated as classes, and techniques from text classification (text categorization) are used to learn models for assigning a class to a given document [Leung and Kan, 1997; Dumais et al., 1998].
Usually the learned models will map an input document to several different controlled vocabulary keyphrases. In keyphrase extraction, keyphrases are selected from within the body of the input document, without a predefined list. When authors assign keyphrases without a controlled vocabulary (in library science, free text keywords or free index terms), typically from 70% to 90% of their keyphrases appear somewhere in the body of their documents [Turney, 1999]. This suggests the possibility of using author-assigned free text keyphrases to train a keyphrase extraction system. In this approach, a document is treated as a set of candidate phrases and the task is to classify each candidate phrase as either a keyphrase or non-keyphrase [Turney, 1999, 2000; Frank et al., 1999; Witten et al., 1999, 2000].
Coherence
An early study of coherence in text was the work of Halliday and Hasan [1976]. They argued that coherence is created by several devices: the use of semantically related terms, coreference, ellipsis, and conjunctions. The first device, semantic relatedness, is particularly useful for isolated words and phrases, outside of the context of sentences and paragraphs. Halliday and Hasan [1976] called this device lexical cohesion. Morris and Hirst [1991] computed lexical cohesion by using a thesaurus to measure the relatedness of words. Recent work on text summarization has used lexical cohesion in an effort to improve the coherence of machinegenerated summaries. Barzilay and Elhadad [1997] used the WordNet thesaurus to measure lexical cohesion in their approach to summarization.
Keyphrases are often specialized technical phrases of two or three words that do not appear in a thesaurus such as WordNet. In this paper, instead of using a thesaurus, statistical word association is used to estimate lexical cohesion. The idea is that phrases that often occur together tend to be semantically related. There are many statistical measures of word association [Manning and Schütze, 1999]. The measure used here is Pointwise Mutual Information (PMI) [Church and Hanks, 1989; Church et al., 1991]. PMI can be used in conjunction with a web search engine, which enables it to effectively exploit a corpus of about one hundred billion words [Turney, 2001]. Experiments with synonym questions, taken from the Test of English as a Foreign Language (TOEFL), show that word association, measured with PMI and a web search engine, corresponds well to human judgements of synonymy relations between words [Turney, 2001].
Conclusion
This paper provides evidence that statistical word association can be used to improve the coherence of keyphrase extraction, resulting in higher quality keyphrases, measured by the degree of overlap with the authors’ keyphrases. Furthermore, the new coherence features are not domain-specific.
51Due网站原创范文除特殊说明外一切图文著作权归51Due所有;未经51Due官方授权谢绝任何用途转载或刊发于媒体。如发生侵犯著作权现象,51Due保留一切法律追诉权。(paper代写)
更多paper代写范文欢迎访问我们主页 www.51due.com 当然有paper代写需求可以和我们24小时在线客服 QQ:800020041 联系交流。-X(paper代写)
