Based on synonym

Based on synonym DEFAULT

Open Access

Peer-reviewed

  • Hend Alrasheed
PLOS

x

Abstract

Keyword extraction refers to the process of detecting the most relevant terms and expressions in a given text in a timely manner. In the information explosion era, keyword extraction has attracted increasing attention. The importance of keyword extraction in text summarization, text comparisons, and document categorization has led to an emphasis on graph-based keyword extraction techniques because they can capture more structural information compared to other classic text analysis methods. In this paper, we propose a simple unsupervised text mining approach that aims to extract a set of keywords from a given text and analyze its topic diversity using graph analysis tools. Initially, the text is represented as a directed graph using synonym relationships. Then, community detection and other measures are used to identify keywords in the text. The set of extracted keywords is used to assess topic diversity within the text and analyze its sentiment. The proposed approach relies on grouping semantically similar candidate words. This approach ensures that the set of extracted keywords is comprehensive. Differing from other graph-based keyword extraction approaches, the proposed method does not require user parameters during graph construction and word scoring. The proposed approach achieved significant results compared to other keyword extraction techniques.

Citation: Alrasheed H (2021) Word synonym relationships for text analysis: A graph-based approach. PLoS ONE 16(7): e0255127. https://doi.org/10.1371/journal.pone.0255127

Editor: Diego Raphael Amancio, University of Sao Paulo, BRAZIL

Received: January 7, 2021; Accepted: July 9, 2021; Published: July 27, 2021

Copyright: © 2021 Hend Alrasheed. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All text datasets are available at https://github.com/halrashe/Topic-Diversity.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Currently, social media outlets produce extremely large amounts of data. Text analysis provides an effective way to process and utilize the most relevant data. Such analysis supports various applications in different domains, such as marketing, content filtering, and search. Manual processing of the huge number of documents available online is tedious, time-consuming, and error-prone. Text mining refers to the automatic extraction of information and the identification of valuable and previously unknown hidden patterns from unstructured textual data [1]. Text mining algorithms make it possible to process huge amounts of unstructured textual data efficiently and effectively.

Many text mining techniques, such as text summarization, text comparisons, document categorization, and document similarity measurements, depend on the extraction of a representative set of keywords from the given text [2, 3]. Keywords can be defined as a set of one or more words that provides a compact representation of the content of a text document [4]. Automatic keyword extraction refers to the process of detecting the most relevant terms and expressions from a given text in a timely manner. Keyword extraction approaches can be categorized as statistical [5–7], machine learning [8, 9], linguistic [10], and graph-based approaches [3, 11–18]. Due to their simplicity, statistical approaches, such as term frequency, do not always produce good results [12]. Machine learning approaches require data training and can be biased toward the domain on which they were trained [19]. Graph-based keyword extraction techniques can capture more structural information about the text compared to other classic text analysis methods [20].

The underlying principle in graph-based keyword extraction is measuring and identifying the most important vertices (words) based on information obtained from the structure of the constructed text graph. Such vertices can be obtained using node centrality measures, such as degree centrality, closeness centrality, PageRank [12, 13, 15, 18] and k-degeneracy [18, 21]. However, keyword extraction approaches vary according to their text graph construction techniques, which directly impacts the ranking of the candidate keywords. Most of the proposed graph-based keyword extraction approaches depend on word co-occurrences; therefore, they do not necessarily generate a set of keywords that covers the main topics discussed in the text [22]. Moreover, most existing graph-based keyword extraction approaches require user parameters [12, 18, 21].

In this paper, we propose a simple unsupervised text mining approach that aims to extract a set of key concepts from a given text and then analyze its topic diversity using graph analysis tools. The proposed approach relies on grouping semantically similar candidate words; as a result, the extracted set of keywords is ensured to be comprehensive. Moreover, differing from other existing graph-based keyword extraction approaches [12, 18, 21], the proposed method does not require user parameters during graph construction and word scoring.

The text is first represented as a directed synonym graph. A word v is a direct synonym of another word u if it has a similar meaning. Word v is an indirect synonym of word u if a word w that is a synonym of v is also a synonym of word u. For example, the word “publication” is a direct synonym of the word “book” while the word “paper” is an indirect synonym of the word “book” (Fig 1). Direct synonym relationships between word pairs represent stronger relationships compared to indirect ones. Once the text graph is constructed, community detection and other measures are used to identify keywords. The set of most central vertices in each community is included in the set of keywords and is ranked according to the community qualities. The quality of each community is assessed according to its attributes, such as size and diameter. The set of extracted keywords is used to assess topic diversity. Finally, sentiment analysis is conducted to identify the general orientation of the opinions in the text. The proposed approach achieved significant results compared to other keyword extraction techniques. Our primary contributions are as follows. (1) We propose a graph-based text representation approach using word direct and indirect synonym relationships. (2) We extract a representative set of keywords from the text graph based on its structure. We primarily use vertex centrality and the community structure. (3) We analyze the topic diversity and sentiment of the text using the set of extracted keywords.

Topic diversity refers to the presence of multiple, possibly contradictory, topics in a given text [23]. Several diversity dimensions have been discussed in the literature, including diversity in topic, diversity in viewpoint, and diversity in language. Here, we focus on topic diversity, which is the diverse representation of information (by including different ideas, dimensions, beliefs, perspectives, or feelings) on a specific topic. The greater the number of topics in a conversation the more diverse it is. Text sentiment analysis attempts to extract the semantic orientation conveyed in the text, which can be positive, negative, or neutral [23]. Topic diversity and sentiment analysis have many applications in health care, public opinion analysis, social relationship analysis, marketing, and sales predictions [24]. Various methods for topic modeling and extraction [25–27] and sentiment analysis [28, 29] have been discussed in the literature. In this study, we use the structure of the text graph and the set of extracted keywords to assess the topic diversity of a given text and analyze its sentiment.

There are several limitations of the current study. First, the proposed method associates between word pairs based on their synonym relationships and does not consider word contexts. Moreover, identifying the words that actively contribute to the meaning of the text during prerocessing is challenging because the part of speech (POS) taggers are usually trained on a different dataset. Finally, the community detection approach that produces the most accurate keyword set requires further exploration.

Preliminaries

A graph is a mathematical representation that allows the effective exploration of the relationships between the elements of a system. A given text T can be represented as a directed weighted graph G = (V, E), where V is the set of vertices (words) and EV2 is the set of edges. Edges exist between node pairs based on a specific text relation between them.

Each node u has an in-degree degreein(u) that represents the number of edges pointing toward u. Moreover, each vertex u has an out-degree degreeout(u) that represents the number of edges pointing out from u toward other vertices. The distance between a pair of vertices u and v in G is defined as the number of edges on a shortest path between u and v. The weighted distance between u and v is defined as the sum of the weights of all the edges that exist on a shortest path between u and v. Here, a shortest path is a path that minimizes the distance between a vertex pair. The diameter of a graph diam(G) is the length of a longest shortest path between any two vertices u and v in G, i.e., diam(G) = maxu,vV{d(u, v)}.

A subgraphGW = (W, EW), where WV and EW = {uvE: u, vW}, is called the subgraph of G induced by vertex set W. A strongly connected component in a directed graph is a subgraph in which each vertex is reachable from every other vertex in the subgraph. A weakly connected component in a directed graph is a subgraph in which each vertex is reachable from every other vertex in the subgraph despite the direction of the edges. A singleton vertexu is a vertex with no connections to any other vertex in the graph, i.e., degreein(u) = degreeout(u) = 0.

In graph theory, centrality measures rank vertices based on their importance in the graph. Degree centrality considers the central vertices of the graph as the set of vertices that have the highest number of connections. Betweenness centrality expresses how much effect each vertex has in the communication process between other vertices. Finally, closeness centrality considers the graph center as the subset of vertices with the minimum total distance to all other vertices.

The clustering coefficient of a given graph G, denoted by CC(G), measures the extent to which vertices tend to cluster together. , where CC(u) is the clustering coefficient of a single vertex u. CC(u) is computed as the proportion of edges among u’s neighbors to the number of all possible edges that could exist within the neighborhood of vertex u.

A graph community refers to a set of vertices that is densely connected internally and loosely connected to other vertices outside the community. Graph modularity [30], denoted by M ∈ [−1, 1], is a graph property that measures the quality of a proposed division of a graph into distinct communities. M is positive when the number of edges between the vertices within the community is high compared to what we would be expected by chance (indicating a better community division) and negative if the number of edges is less than what we would be expected by chance.

Several methods have been proposed for community detection in networks. The Louvain algorithm is a greedy algorithm that attempts to optimize the modularity of a network partition. First, the algorithm looks for small communities by optimizing modularity locally. Second, it builds a new network by aggregating vertices within each community. The steps are repeated until a maximum of modularity is attained. This process naturally produces a hierarchical decomposition of the network.

The Leiden community detection algorithm [31], which is an extension of the Louvain algorithm, partitions the vertices into different communities that are guaranteed to be connected. The proposed communities are then refined by splitting them further into multiple partitions or merging vertices with a randomly chosen community.

In this work, we use both the Louvain and the Leiden community detection algorithms because they do not require a priori knowledge of the number of communities that will be detected. Moreover, both algorithms have the advantage of finding high quality communities in a time-efficient manner.

Related work

This work proposes a keyword extraction technique by exploiting the structure of the text graph and the synonym relationships between words. The set of extracted keywords is used to assess the topic diversity and sentiment of the text. In this section, we review relevant keyword extraction, sentiment analysis, and topic diversity approaches with the focus on graph-based text analysis techniques.

Modeling text as graphs

Modeling text as graphs attempts to uncover text patterns and analyze linguistic properties hidden in the text. Graph-based approaches require transforming the text into a structured format (graph) by identifying the graph vertices and edges. First, subset of the words in the text are selected as vertex candidates. Second, the relationships connecting vertex pairs need to be identified. Relationships between vertex pairs vary from very simple ones such as word co-occurrences (words that appear together, in the same sequence, or within a specific window) [3, 13, 15, 16, 18] to more complex ones such as word semantic relationships [11, 17] and word syntax relationships [14].

Text graph representation has been used for keyword extraction [3, 11–19], text summarization [32], and language classification [33]. Modeling text as graphs has also been used for text semantic analysis including information retrieval [34, 35] and authorship attribution analysis [36, 37], and word sense disambiguation [38, 39].

Text graph representation can be enhanced using the concept of word embedding [40] in which words are represented by vectors capturing their semantical and contextual features. Vector similarity measures are used to capture the similarity between the words. For example, in a word co-occurrence graph, identifying words that are semantically similar may not be straightforward. For example, “hard” and “difficult” may be mapped into two distinct vertices. In this case, word embedding can be used to map words conveying the same meaning into the same vertex by adding virtual edges between words with similar vectors. Some word embedding strategies include Word2Vec and FastText. The Word2Vec [41] defines dense vector representations of words using a three layer neural network with a single hidden layer. The FastText [42] represents each word as a bag of character n-grams. Therefore, the neural network trains several n-grams of each word. The word vector is the sum of vectors obtained for the character n-grams of the word.

An approach that uses word embedding for keyword extraction was proposed in [43]. The evaluation showed that using word embedding for keyword extraction outperforms many baseline algorithms. Keyword extraction using word embedding was also explored in [44]. First, a word embedding model that integrates local context information of the word graph is used to represent each word. Second, a novel PageRank-based model that incorporates the embedded information is used to rank the candidate words. In [45], the authors investigated whether adding virtual edges using word embedding in co-occurrence graphs may improve the quality of text classification tasks. Their results showed that using word embedding increased the classification performance compared to using traditional co-occurrence graphs.

Keyword extraction

Keyword extraction refers to the process of detecting the most relevant terms and expressions from a given text. Here, the goal is to summarize the text content and highlight its main topics. Automatic keyword extraction is a key step for multiple text mining applications including summarization, classification, clustering, and topic detection [2, 3]. Keyword extraction techniques range from simple statistical approaches, such as word frequency [5] and word collection and co-occurrence [6], to more advanced machine learning approaches, such as Naive Bayes [8] and support vector machine [9].

Recent keyword extraction methods use both statistics and context information. For example, YAKE [7] relies on the word position and frequency as well as new statistical metrics that capture context information. YAKE calculates five features for each term, i.e., case, position, frequency, relatedness to context, and how often a candidate word appears in different sentences. Then, all features are used to compute a score for each term.

Multiple graph-based text representations have been used for keyword extraction such as word co-occurrence [3, 13, 15, 16, 18], word semantic relationships [11, 17], and syntax relationships [14]. The underlying principle in graph-based keyword extraction is measuring and identifying the most important vertices (words) based on information obtained from the structure of the graph, e.g., vertex centrality [3, 12, 13, 15, 16, 18] and k-degeneracy [18, 21].

A comparison of five centrality measures (degree, closeness, betweenness, eigenvector, and TextRank) showed that the simple degree centrality measures achieved results comparable to those of the widely used TextRank [12] algorithm. In addition, it has been shown that closeness centrality outperforms the other centrality measures on short documents [46].

TextRank [12] is a popular graph-based keyword and sentence extraction technique. TextRank uses word co-occurrence relocation to control the distance between word occurrences. In other words, two lexical units (vertices) are connected if they occur within a window of maximum N words.

Syntactic filters can be used to select lexical units of a certain part of speech to be added to the graph. Vertices are then ranked based on the PageRank algorithm and the top vertices are returned as keywords. In [47], the authors construct the text graph based on word semantic similarity and then use PageRank centrality to extract keywords. [48] introduced a text analysis and visualization method using graph analysis tools to identify the pathways for meaning circulation.

Keyword extraction from a collection of texts using semantic relationship graphs has been discussed in [17]. A graph is first constructed using word co-occurrences and related senses. That is, word relationships are not established only based on a simple co-occurrence, but also as a result of a significant number of occurrences within a document that represents a semantic unit. Then, the relations between words in the obtained graph are enriched with information from WordNet.

Word semantic relations have also been exploited in [49]. The proposed method identifies exemplar words by leveraging clustering techniques to guarantee that the document is semantically covered. First, words are grouped into clusters according to their semantic distances. Then, each cluster is represented by a centroid word. The exemplar words are used to extract the keywords from the document.

Motivated by the fact that both documents and words can be represented by several semantic topics, [50] proposed a keyword extraction technique using multiple random walks specific to various topics. Accordingly, they assigned multiple importance scores to each word. Then, keywords are extracted based on their relevance to the document and their topic coverage.

In [51], the authors proposed a method to extract the main topics in conversations among Twitter users. They created their text graph based on the logical proximity of the concepts. That is, two words become adjacent if they are shared by users directly or through some specific separation degree. Then they use the k-core and modularity to isolate the different topics in the text. Measuring the level of bias in a discourse using text graphs is discussed in [52]. The proposed tool creates the text graph based on word co-occurrences. The most influential words and the different topics are identified using betweenness centrality and community detection techniques.

The above graph-based keyword extraction methods suffer from several problems. First, they require the number of keywords as a preset parameter as they are not able to find an optimal number of keywords based on the content of the text. Second, the constructed text graphs rely only mostly on co-occurrence relations ignoring any semantic relationships between the terms in the text.

Keyword extraction using word synonym relationships.

[53] proposed a keyword extraction algorithm using PageRank on synonym graphs. First, the text is represented as a weighted synonym co-occurrence graph. Then, the PageRank algorithm is used to rank each synonym group. Finally, several top-ranked synonym groups are selected as keywords. Using word synonym relationships for keyword extraction has also been discussed in the literature under the notion of lexical chains [54]. Lexical chains describe sets of semantically related words. Lexical chains can be created using three steps: (1) select a set of candidate words, (2) determine a suitable chain by calculating the semantic relatedness among members of the chain, and (3) if a chain exists, add the word and update the chain; else, create a new chain to fit the word [54, 55]. The second step can be performed using an existing database of synsets, such as the one included in the WordNet corpus [56]. Lexical chains and graph centrality measures were also used for keyword extraction in [55, 57].

Topic diversity and sentiment analysis

Topic diversity refers to the diverse representation of information on a specific topic by including different ideas, beliefs, perspectives, or feelings. Topic diversity is related to sentiment analysis, opinion mining, and text summarization. Various approaches can be applied to analyze topic diversity. In [58], the authors proposed a natural language processing technique to discover opinion diversity in a text using domain-specific vocabulary. They used two initial lists of positive and negative adjectives. Each list is then expanded using word synonym and antonym relationships. [59] used a graph-based template approach for topic variation detection. Here, the text is represented as semantic subgraphs and best matching subgraphs are used as a template to compare the text in an unsupervised manner.

Topic diversity can also be investigated through the analysis of community structure in graphs. In [4], text is first modeled as a graph of semantic relationships between terms. Then, community detection techniques are used for keyword extraction. The results showed that the terms related to the main topics of the text tend to form several cohesive communities. [60] identified a collection of communities related to a range of topics in Twitter conversation graphs. In [15], noun phrases that represent the main topics are extracted, clustered into topics, and used as vertices in a complete graph. Topics are scored using TextRank [12], and key phrases are extracted by selecting the most representative candidate from each of the top-ranked topics.

Sentiment analysis aims to classify a text as positive, negative, or neutral. Its primary focus is identifying opinion words in the text. Supervised learning techniques with three classes (positive, negative, and neutral) have been used by training research data [29]. For example, customer ratings can be directly translated into a class. Here, a 4–5 star review is considered positive, a 1–2 star review is considered negative, and a review of 3 is considered neutral. A previous study [28] identified opinion sentences in customer reviews about a specific product feature to determine if the review is positive or negative. The authors identified opinion adjectives and determine their sentiment using WordNet. Note that WordNet and other similar sources do not include sentiment information for each adjective word; thus, the authors used the WordNet synonym and antonym sets to predict the sentiment information.

The current paper proposes an unsupervised parameterless domain independent keyword extraction approach using text graph representation. The proposed approach does not require a training dataset labeled by humans. Moreover, because the proposed approach relies on grouping semantically similar candidate words, it can extract a set of keywords that covers the main topics discussed in the text.

Proposed method

Given a text comprising a collection of text items, such as comments to a tweet or a news article, the proposed method aims to identify a set of keywords in the text, assess the diversity of the text, and analyze its sentiment (the code is included as Supplementary S1 Code). Here, the main idea is to represent the text as a synonym graph and then analyze the graph structure. The proposed method is described in the following steps (Fig 2 describes the proposed method workflow).

Step 1. Data preprocessing

Generally when people write in a conversational style, their texts tend to be very noisy for any text mining technique. Therefore, the proposed method starts with preprocessing the text based on the following steps.

  • Tokenization: Each word/item in the text is treated as a token. A given text T is represented as T = {t1, t2, …ti}, where i is the number of tokens.
  • Token removal: All stop word tokens, non-alphabetic word tokens, and non-English word tokens are removed from the text. Moreover, when the text contains user names (Twitter text, for example), they are removed from the text. In this work, we use the NLTK corpus stopwords [56].
  • POS analysis: The part of speech (POS) of each of the remaining tokens is analyzed and only tokens that actively contribute to the meaning of the text are included. Here we include nouns, proper nouns, adjectives, and adverbs (including comparative adjectives and adverbs). In this work, we use the Semantic/syntactic Extraction using a Neural graph Architecture (SENNA) part of speech tagger [61].
  • Token normalization: All words are normalized by converting them to their lemmas, i.e., their meaningful base forms. Here we use the WordNet lemmatizer [56].

Step 2. Text graph construction

The text is converted to a directed weighted graph G = (V, E), where the set of vertices V represents the set of tokens and the set of edges E represents the synonym relationships between word pairs as follows. A directed edge is added between two vertices u and v if vertex (word) v is a synonym of vertex (word) u. Edge direction is used to capture the asymmetric relationships between word synonyms [62]. For example, according to the Collins English Dictionary [63], the word “unit” is a synonym of the word “department”, but the opposite is not true.

Here, each edge has a weight that represents the strength of the relationship between the pair of words. We assume two relationship types (strengths): direct and indirect.

A word v is a direct synonym of another word u if it has a similar meaning. Word v is an indirect synonym of word u if a word w which is a synonym of v is also a synonym of word u. For example, in Fig 1, the word “publication” is a direct synonym of the word “book,” while the word “paper” is an indirect synonym of the word “book”. Direct synonym relationships between word pairs represent stronger relationships among word pairs compared to indirect synonym relationships. In other words, direct synonym relationships result in higher edge weights.

In this text graph, the in-degree of a vertex u represents how many words have u as their synonym. The out-degree of vertex u represents the number of other words in the text that are synonyms of u. In addition, each word u has the frequency attribute (denoted freq(u)), which represents the number of occurrences of the word.

Given two words u and v and their synonym sets denoted by syn(u) and syn(v), respectively, the edges and their weights in the text graph are assigned as follows. (1) where euv is a directed edge pointing toward vertex v, and w(euv) denotes the edge weight. The weights are selected to represent direct and indirect synonym relationships (words that are direct synonyms have a stronger relationship in the text graph). Throughout this work, we use the Networkx Python library (https://networkx.org) to construct and analyze our text graphs.

Step 3. Text graph cleaning

Unimportant vertices (words) are removed from the text graph. We measure vertex importance based on its degree (number of connections) and its frequency in the text (number of times it occurs within the text). Both measures indicate word important within a given text [64].

Let S be the set of singleton vertices in G, where S = {uV: degreein(u)>0 and degreeout(u)} > 0. Singleton vertices (vertices with in-degree and out-degree equal to zero) with frequency of one are considered to have a little contribution to the topic; therefore, those vertices are removed from the text graph. However, singleton vertices with frequency greater than one are considered high contributors. This set of singleton vertices forms the set , where .

Step 4. Text graph analysis—Community extraction and evaluation

Our goal is to use the structure of the text graph to identify its keywords, which will be used subsequently to assess the text topic diversity and analyze its sentiment. To do so, we partition the vertices of the text graph into distinct communities each of which includes (to an extent) a set of interrelated concepts. We then apply the concept of vertex centrality to extract keywords from each community.

The text graph is first partitioned into communities using one of the community detection algorithms such as the Louvain algorithm [65] and the Leiden algorithm [31], and then the qualities of each community are assessed. We define the following attributes for each community Ci with |Vi| = ni vertices and |Ei| = mi edges:

  • size(Ci): The community size defined as size(Ci) = ni + mi. Larger community sizes indicate the existence of more words and relationships within the community, i.e., the same concept (or concepts) was introduced in multiple different ways in the text or most of the concepts are synonyms. Accordingly, the larger the size of a community, the more important it is in the text.
  • weight(Ci): The community weight is computed as the sum of the frequencies of all vertices in the community, i.e., . The larger the weight of the community, the more relevant it is in the text.
  • density(Ci): The community density is defined as , where 0 ≤ density(Ci)≤1. Density reports the difference between the number of existing edges and the maximum possible number of edges. Higher densities imply stronger vertex relationships.
  • diam(Ci): The community diameter is the length of a longest shortest path between any two vertices in the subgraph induced by the community vertices. Larger community diameters indicate the existence of words that are less relevant to each other.
  • CC(Ci): The community clustering coefficient, where 0 ≤ CC(Ci)≤1. The clustering coefficient measures the extent to which vertices tend to cluster together. A larger clustering coefficient indicates a community that includes strongly related words. This indicates a stronger synonym relationships between the words.

Communities are sorted lexicographically according to their weights, sizes, and densities. Each community is then assigned a Quality value, which can be“High” or “Low”. The Quality of a community Ci is considered High if it achieves at least one of the following: density(Ci)≥δ or CC(Ci)≥δ, where δ > 0 is a threshold for the community quality. Otherwise, the community quality is considered Low.

Community qualities can be enhanced by partitioning its vertices into smaller communities. This process can be applied to communities with lower qualities and repeated iteratively until no further community partition enhancement is possible.

Step 5. Keywords identification

To extract the set of keywords from the given text, we use the concepts of vertex centrality and community quality. First, the graph communities are ranked based on their attributes: community weight, size, clustering coefficient, and diameter. Then, the set of most central vertices in each community is identified.

Multiple centrality measures can be used to rank vertices. Here, we use the in-degree centrality. Vertices with higher in-degrees are more important since they are synonyms to more words in the text. The most important vertices with respect to the in-degree measure is denoted , where k ≥ 1 is the selected number of top words to be returned from this set. In addition, we compute the sets of medium important vertices and least important vertices for each community (denoted by and , respectively).

The set of key words kw will be formed by combining the most important vertices in each community. In addition, all important singleton vertices (set in Section 1) will be added. That is: (2)

To increase the precision and the quality of set kw, the sets of medium and least important vertices (sets and ) will be included for communities with low quality. This is important to ensure the completeness of the content because communities with low quality may include words that are not strongly related.

Step 6. Topic diversity assessment

Topic diversity is assessed using the number of weakly connected components in the text graph |W| and the graph modularity M. The number of weakly connected components (|W|) shows how strongly related the vertices in the graph are. The graph modularity M indicates how well defined the communities in the graph are. Using the two values, topic diversity in the text graph is assessed as follows.

  • When , topic diversity is very high since the text contains many topics that are weakly related.
  • When and M ≥ 0.65, topic diversity is high since the text contains multiple topics that are weakly related.
  • When and M < 0.65, topic diversity is low since the text contains few topics that are closely related.

The modularity threshold (0.65) was chosen based on previous studies [52, 65].

Step 7. Sentiment analysis

The overall text sentiment is assessed by identifying the general orientation (polarity) of the set of keywords kw. Here, the sentiment of each keyword is assessed using the VADER package [66] and accordingly classified as positive, negative, or neutral. Given a concept w, VADER assigns it a polarity score polarity(w) that shows its orientation and the orientation level (−1 ≤ polarity(w)≤1). The overall text sentiment is assessed in two ways as follows.

  1. Determine the cardinalities of the positive, negative, and neutral concepts in the set kw. This provides a sentiment analysis overview about the text.
  2. Compute the text weighted polarity P as (3) where kw is the set of keywords and polarity(wi) is the polarity score of keyword wi as assigned by VADER.

An Illustrative example

We use the set of replies to a Tweet posted by CNN on Dec. 29, 2019 (available at https://twitter.com/CNN/status/1210348818492997633) to demonstrate the proposed method. At the time of this analysis, the initial CNN tweet had 87 replies (listed in Supplementary S1 Text). The text was first tokenized into a list of words and preprocessed.

During preprocessing, we used the Semantic/syntactic Extraction using a Neural graph Architecture (SENNA) part of speech tagger [61] to keep only nouns, proper nouns, adjectives, and adverbs. The remaining tokens were then normalized by converting them to their lemmas. For example, “I hope the money is going to charity.” will become [“hope”, “money”, “go”, “charity”].

Then the associated text graph was constructed. We used the word synonyms from WordNet corpus [56] using the NLTK toolkit to identify synonym relationships between vertices in the text graph. The unimportant vertices (singletons with frequencies less than two) were removed.

Fig 3 shows the text graph associated with the tweet replies. The text graph has 193 vertices (34 vertices are singletons), 572 edges, and 45 connected components. 112 edges represent direct synonym relationships, and 460 represent indirect synonym relationships. For example, in Fig 3, the word “proper” is a direct synonym of the word “right” and an indirect synonym of the word “best.” Table 1 lists the set of important singleton vertices .

thumbnail
Download:

Fig 3. Text graph for the tweet replies text.

This graph has 193 vertices (34 vertices are singletons), 572 edges, and 45 connected components. 112 edges represent direct synonym relationships, and 460 represent indirect synonym relationships. The size of each vertex reflects its frequency in the original text. Thick and thin edges indicate direct and indirect synonym relationships, respectively.

https://doi.org/10.1371/journal.pone.0255127.g003

The qualities of the 22 main communities (partitioned using the Louvain algorithm) in the text graph are summarized in Table 2. We evaluate the quality of each community using its attributes: weight, size, density, diameter, and clustering coefficient. The last column in Table 2 shows the quality of each community based on its attributes. Quality is classified as “High” or “Low”. The quality of a community Ci is considered high if it achieves at least one of the following: density(Ci)≥0.5 or CC(Ci)≥0.5. Otherwise, the community quality is considered low.

First, communities are sorted lexiclavically according to their weights and sizes. Then each community is assigned a quality rank according to its density, clustering coefficient, and diameter.

Table 3(a)–3(d) show the synonym relationships within four communities. The community presented in Table 3(b) has a low clustering coefficient compared to the community in Table 3(c). Higher clustering coefficients indicate that the majority of words in the community are connected by either direct or indirect synonym relationships and vise verse. The concepts in Table 3(c) are more related to one another; as shown in Fig 3, they form almost a star graph. The concepts within the community in Table 3(b) also form a star but with weaker ties between the vertices (smaller weights). This indicates that it contains words that are not closely related such as the words “miserable” and “poor”. Similarly, the density of a community reflects the relationship strength among the words in the community. Note that the clustering coefficient of the community in Table 3(c) is 0.7, while the clustering coefficient of the community in Table 3(b) is 0.25. Another community attribute that can be used to evaluate quality is the diameter. Generally, the length of the diameter (with respect to the number of vertices in the community) correlates negatively with its quality. In other words, shorter diameters indicate stronger communities and vice versa. For example, community with 23 vertices has a diameter of 3, while community with 100 vertices has a diameter of 2.

thumbnail
Download:

Table 3. Synonym relationships and clustering coefficients (CC) of four communities in the text graph.

The numbers represent relationship strengths (a strength of 1 is assigned between direct synonyms and 0.5 is assigned between indirect synonyms).

https://doi.org/10.1371/journal.pone.0255127.t003

The set of keywords kw for k = 1 and all singleton vertices for our example is

kw = {study, practice, bill, complete, address, extreme, good, blood, disorder, ready, cause, concept, pressure, tweet, life, bite, attack, burning, harder, food, long, loose, miserable, hurt, regular, ill, earth, stem, thank, speed, yea, ate, word, muhammed, ramadan, muslim, month, day, body, time, many, meal, intermittent, ever, also, twice, health, breakfast, thing, never, hungry, doubt, western, obese, something, prophet, yes, sure, hunger, woman, much, hospital, obesity, diet, anyway, calorie, everyone, american}.

Community extraction using the Leiden algorithm for this example is briefly discussed in S1 Appendix. The set of keywords using iterative community extraction is shown in S2 Appendix.

The number of weakly connected components in the text graph |W| = 45, the graph modularity M = 0.75, and the number of vertices |V| = 193. That is, and M ≥ 0.65, which suggests high topic diversity within the text.

We also analyzed the sentiment of the text. In our example, the set of keywords includes 68 words. The sentiment analysis shows that 8 words are positive, 10 are negative, and 50 are neutral. The weighted polarity P = −0.24 suggests that the text is slightly negative in orientation.

Evaluation

To assess the keyword extraction performance of the proposed technique, we compared its performance to that of two keyword extraction techniques: TextRank [12] (a graph-based technique) and YAKE [7] (a statistical-based technique) using several different datasets. TextRank uses word co-occurrences to control the distance between word occurrences in creating the text graph. Then it uses eigenvector centrality to rank each term. YAKE computes a score for each term based on five features: case, position, frequency, relatedness to context, and how often a candidate word appears in different sentences.

Three performance measures were used as key concept extraction evaluation metrics: precision (Pr), recall (Re), and F-score defined as follows. (4)(5)(6)

We also used a metric similar to the Pyramid [67] for our evaluation. The Pyramid method [67] creates a pyramid using the human annotated keywords. The set of keywords extracted by each method is then compared to the pyramid. Each keyword w is assigned a pyramid score ps based on the number of human annotators who selected it. The higher the pyramid score, the higher the keyword is in the pyramid. A system’s oracle score os, where 0 ≤ os ≤ 1, is computed by adding the pyramid scores of the keywords generated by the system.

In our performance analysis, we let the number of keywords returned by each method contribute to a method’s oracle score. We define the weighted-score as follows. where U = {Retrieved} − {Keywords} and N is the number of unique keywords extracted by the human extractors.

Two evaluation tasks were performed: evaluation using human extractors and evaluation using publicly available annotated datasets. For each dataset, we used the community partition (Louvain or Leiden) that yielded the best result.

Evaluation using human extractors

In this evaluation task, three human extractors were asked to extract an unspecified number of keywords from a given text dataset. The human extractors were instructed to extract keywords based on importance and relevance to the topic. Then, the intersection of the three keyword sets was determined.

Table 4 shows three sets of keywords extracted by three human extractors for the text dataset that includes the Tweet replies for the Tweet in the Illustrative example. The performances of the proposed method, TextRank, and YAKE are shown in Table 5. In Table 5, Relevant represents the set of keywords that appear in at least one of the lists extracted by the human extractors. Inter_Relevant represents the set of keywords that appears in the intersection set of the human extractor lists. Precision represents the probability that a key concept is relevant given that it is returned by a system. Recall represents the probability that a relevant key concept is returned [68, 69].

Table 5 shows the performance of two versions of the proposed method; first, when all keywords are included (set kw as discussed in Section 4.4.2), and second, set kw after removing singleton vertices with frequencies ≤2. For comparison, two YAKE versions are also listed in Table 5: YAKE, which represents all keywords returned by this technique, and YAKE*, which represents the 51 most important concepts.

Moreover, this experiment was conducted with eight more text datasets: five Tweet replies datasets, two CNN news datasets, and one speech dataset. The Tweet replies datasets are: TRAVEL-BAN, AIRPODS, GENE-TECH, JEFF-BEZOS, and INT-FASTING. All five datasets were collected from Twitter. Each dataset contains the set of Tweet replies posted under each Tweet. The numbers of replies associated with each tweet are 69, 72, 30, 161, and 87. The CNN news datasets are: BIDEN and COVID-19. The last dataset (KING-SPEECH) is a transcript of Martin Luther King Jr’s “I have dream” speech. Table 6 lists the main statistical data of each dataset.

thumbnail
Download:

Table 6. Statistical and text graph data of each dataset.

Number of words and number of tokens denote the number of words in the dataset before and after preprocessing respectively. Direct edges and indirect edges represent the number of direct and indirect synonym relationships between words in the text graph respectively.

https://doi.org/10.1371/journal.pone.0255127.t006

For each dataset, we used the keyword set extracted by the human extractors to compare the performance of the proposed method against TextRank and YAKE (see Table 7). Table 7 shows two sets of comparisons. The first comparison considers all extracted keywords by each approach. The second comparison considers only the first keywords, where is the number of unique keywords extracted by all human extractors.

In Table 7, precision represents the number of correctly matched words to the total number of extracted words. That is, . Recall represents the number of correctly matched words to the total number of assigned words. That is, . The table shows the performance of the proposed method using three different values of k (the number of words extracted from each community in the text graph).

Evaluation using annotated datasets

As another experiment to evaluate the performances of the proposed method and the other existing methods used for comparison, we used a set of publicly available human annotated datasets from the Inspec database [10]. The dataset includes a collection of abstracts and the corresponding manually assigned keywords. The abstracts are from Computer Science and Information Technology journal papers. Two sets of keywords are assigned for each abstract: controlled (restricted to a given thesaurus) and uncontrolled (freely assigned by the annotators). Following [10, 12], we use the uncontrolled set of keywords for our comparisons.

First, we extracted the set of human annotated keywords for five abstract datasets (see Table 8). Then we used the keyword set to compare the performance of the proposed method against the baselines: TextRank, and YAKE (the number of extracted keywords was limited to the number of keywords assigned by human annotators). The results are shown in Table 9.

thumbnail
Download:

Table 8. Statistical and text graph data of each abstract in the HULTH dataset.

Number of words and number of tokens denote the number of words in the dataset before and after preprocessing respectively. Direct edges and indirect edges represent the number of direct and indirect synonym relationships between words in the text graph respectively.

https://doi.org/10.1371/journal.pone.0255127.t008

In Table 9, Precision represents the number of correctly matched words to the total number of extracted words and Recall represents the number of correctly matched words to the total number of assigned words. Table 9 shows the performance of the proposed method using three different values of k for the proposed method.

Discussion

Table 5 shows the results of the proposed method and the two baselines when compared using the keywords extract by three human extractors. The proposed method achieves better results compared to the two baseline methods when all keywords are included. The performance of the proposed method is slightly affected by the removal of singleton vertices with low frequencies (≤2).

Table 7 shows the results for all text datasets. Overall, the proposed method shows good results. Using Precision, the proposed method ourperforms all baselines on TRAVEL-BAN, GENE-TECH, and INT-FASTING. On AIRPODS and JEFF-BEZOS the proposed method provides comparable results to both baselines. On BIDEN, COVID-19, and KING-SPEECH datasets, the proposed method fails to do better than the baselines. This can be justified by the number of keywords extracted by each method. In all datasets, the two baselines extracts a large number of keywords. For example the number of keywords extracted by TextRank is about three times the number of keywords extracted by the proposed method in almost all datasets. Similarly, the number of keywords extracted by YAKE is about four times the number of keywords extracted by the proposed method. Overall, the proposed method extracts a concise set of keywords without requiring the number of keywords as a preset parameter.

Using the weighted-score, the proposed method achieves comparable results to both baselines with a fraction of keywords.

Increasing the number of extracted keywords (by increasing the value of k) does not seem to improve the performance of the proposed method. This highlights the importance of synonym relationships between words (communities in the text graph can be represented by a single word).

Table 9 shows results for the TextRank, YAKE, and the proposed method compared against keywords extracted by human annotators. Similar trends as the ones shown in Table 7 can be observed. First, the number of keywords extracted by the proposed method is smaller than the number of keywords returned by TextRank and YAKE. In fact, the number of keywords extracted by the proposed method is smaller than the number of keywords annotated by the human annotators. This is because our method takes into account the semantic relationships between the words. This is crucial when there is a limit on the number of returned keywords. For example, the dataset ABSTRACT-5 includes two pairs of direct synonym words: “velocity” and “speed” and “procedure” and “function”.

Second, Table 9 shows that the proposed method outperforms TextRank and YAKE. However, increasing the number of extracted keywords (by increasing k) does not seem to improve the performance of the proposed method. Again, this shows that including semantically related words does not increase the accuracy of keyword extraction algorithms.

The highest F-score achieved is about 55% and the average is 34.9%. Keyword extraction is a highly subjective task and an F-score of 100% is infeasible [70]. For example, in the human annotated abstract datasets, some keywords are not present in the abstracts as human indexers had access to full-length documents when assigning the keywords [10]. For example, the number of absent keywords from the ABSTRACT-4 and ABSTRACT-5 datasets analyzed above are 10 and 16 respectively. This implies that the highest a method could theoretically achieve is 100% precision for both datasets, 75% recall ABSTRACT-4 and 71% recall for ABSTRACT-5. This gives a maximum F-score of 87% for ABSTRACT-4 and 83% for ABSTRACT-5.

A number of limitations of the proposed method must be noted. First, in English (and many other languages), the meaning of a word usually depends on its context. However, the proposed method associates word pairs based on their synonym relationships and does not consider context. For example, the words “regular” and “even” are direct synonyms; thus, they make their own community. The word “regular” appeared twice in the text in the following comments:

  • “I went from 285+ to 165 at the age of 50 … I fast for 20+ hrs and will never go back to a “regular” eating schedule.”
  • “Naaah when I am hungry I just tuck in. The yoga stuff is not for regular folk. Good day snowflakes!”

The word “even” appeared four times in the following comments:

  • “Irresponsible post. Need to preface this with excluding women. A lot of women. Don’t do well with even a 12 hour fast.”
  • “I don’t even want to live another day let along any longer! Thanks to fake news!”
  • “Americans can’t even give up guns and you want them to give up food? They’re overweight for a reason.”
  • “who told you you gotta eat three times or even more a day and who told you you gotta eat each time till you filled up complete day after day.”

Considering context in relation to these comments, “regular” and “even” carry two different meanings and should not be in the same community.

The other limitation is related to the words’ POS and the POS taggers. The SENNA tagger [61] is used to identify the POS of the words and retain those that carry the core text content. In our case, we selected nouns, proper nouns, adjectives, and adverbs. The POS taggers may not always identify the correct POS due to the differences between the training dataset and the current dataset. Typically, those differences are usually related to the different uses of words in English. For example, identifying the correct POS of a word ending with “ing” can be problematic.

Conclusion

The goal of keyword extraction is to identify the concepts that describe the main topics discussed in a conversation. Keyword extraction can provide insights about the topics discussed within the text. Keyword extraction approaches can be categorized as statistical, machine learning, linguistic, and graph-based approaches. Graph-based keyword extraction approaches capture more structural information about the text compared to other text analysis techniques.

In this work, we extract the keywords in a given text, assess its topic diversity, and analyze its sentiment using graph representation of the text and the synonym relationships between the words. We first partition the text graph into different communities and then identify the most central vertices as keywords. The quality of each community is assessed according to its attributes, such as the number of vertices and edges, its diameter, and its clustering coefficient. The community quality indicates the strength of relationship between the words in the community. We first sort the communities according to their qualities and then extract the most central vertices in each community using the degree centrality measure. We also include the set of single vertices with high frequencies to the set of keywords.

The motivation behind our work is to overcome the limitations of other graph-based keyword extraction approaches, primarily their dependence on word co-occurrences only for the text graph construction and their user parameter requirements. Our basic concept can be improved by including collections (words that appear adjacent to one another) or co-occurrences (words that appear together within the text but not necessarily adjacent) analyses.

Word synonym relationships connect words with semantic associations. Here we used two synonym relationship types: direct and indirect. A word is a direct synonym of another word if it has a similar meaning. A word is an indirect synonym of another word if they are both synonyms of a third one. Word synonym relationships are used as edge weights to indicate the strength of relationship between them (direct synonym relationships are stronger to compared to the indirect ones).

The proposed method has a number of limitations. First, the proposed method associates word pairs based on their synonym relationships and does not consider context. However, the meaning of English words usually depends on the context. Second, the part of speech (POS) taggers, which are used to select words that actively contribute to the meaning of the text during prerocessing, may not always identify the correct POS due to the differences between the training dataset and the current dataset. Typically, those differences are related to the different uses of words in English. Finally, the community detection approach that results in the best keyword set is yet to be explored.

As future work, we plan to extend this word relationship formation by including higher degree synonym relationships among words. New studies considering other vertex centrality measures to rank the words in each community can be proposed. Moreover, community detection approaches that allow for overlapping communities need to be considered. Further analysis need to be conducted to identify the best community detection algorithm that can be used with synonym graphs. Finally, using word embedding and virtual edges to improve the performance of the proposed approach need to be investigated.

Acknowledgments

The author thank Maram Bahareth for her insight during the early stages of this research. The author also thank the Deanship of Scientific Research and RSSU at King Saud University for their technical support.

References

  1. 1. Dumais S. Using SVMs for text categorization. IEEE Intelligent Systems. 1998;13(4):21–23.
  2. 2. Feldman R, Sanger J. The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge university press; 2007.
  3. 3. Abilhoa WD, De Castro LN. A keyword extraction method from twitter messages represented as graphs. Applied Mathematics and Computation. 2014;240:308–325.
  4. 4. Grineva M, Grinev M, Lizorkin D. Extracting key terms from noisy and multitheme documents. Proceedings of the 18th International Conference on World Wide Web; 2009. p. 661–670.
  5. 5. Luhn HP. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development. 1957;1(4):309–317.
  6. 6. Matsuo Y, Ishizuka M. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools. 2004;13(01):157–169.
  7. 7. Campos R, Mangaravite V, Pasquali A, Jorge AM, Nunes C, Jatowt A. YAKE! collection-independent automatic keyword extractor. Proceedings of the European Conference on Information Retrieval. Springer; 2018. p. 806–810.
  8. 8. Uzun Y. Keyword extraction using naive bayes. Bilkent University, Department of Computer Science, Turkey; 2005. Available from: http://www.cs.bilkent.edu.tr/guvenir/courses/CS550/Workshop/Yasin_Uzun.pdf.
  9. 9. Zhang K, Xu H, Tang J, Li J. Keyword extraction using support vector machine. Proceedings of the International Conference on Web-age Information Management. Springer; 2006. p. 85–96.
  10. 10. Hulth A. Improved automatic keyword extraction given more linguistic knowledge. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing; 2003. p. 216–223.
  11. 11. Washio T, Motoda H. State of the art of graph-based data mining. Acm Sigkdd Explorations Newsletter. 2003;5(1):59–68.
  12. 12. Mihalcea R, Tarau P. Textrank: Bringing order into text. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing; 2004. p. 404–411.
  13. 13. Palshikar GK. Keyword extraction from a single document using centrality measures. Proceedings of the International Conference on Pattern Recognition and Machine Intelligence. Springer; 2007. p. 503–510.
  14. 14. Liu H, Hu F. What role does syntax play in a language network? EPL (Europhysics Letters). 2008;83(1):18002.
  15. 15. Bougouin A, Boudin F, Daille B. Topicrank: Graph-based topic ranking for keyphrase extraction. Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP); 2013. p. 543–551.
  16. 16. Lahiri S, Choudhury SR, Caragea C. Keyword and keyphrase extraction using centrality measures on collocation networks. arXiv:14016571 [Preprint]. 2014 [cited 2021 March 20]. Available from: https://arxiv.org/abs/1401.6571.
  17. 17. Martinez-Romo J, Araujo L, Duque Fernandez A. Sem Graph: Extracting keyphrases following a novel semantic graph-based approach. Journal of the Association for Information Science and Technology. 2016;67(1):71–82.
  18. 18. Tixier A, Malliaros F, Vazirgiannis M. A graph degeneracy-based approach to keyword extraction. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing; 2016. p. 1860–1870.
  19. 19. Vega-Oliveros DA, Gomes PS, Milios EE, Berton L. A multi-centrality index for graph-based keyword extraction. Information Processing & Management. 2019;56(6):102063.
  20. 20. Do TNQ, Napoli A. A graph model for text analysis and text mining. Doctoral Dissertation, Master Thesis, Université de Lorraine; 2012.
  21. 21. Rousseau F, Vazirgiannis M. Main core retention on graph-of-words for single-document keyword extraction. Proceedings of the European Conference on Information Retrieval. Springer; 2015. p. 382–393.
  22. 22. Hasan KS, Ng V. Automatic keyphrase extraction: A survey of the state of the art. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics; 2014. p. 1262–1273.
  23. 23. Giunchiglia F, Maltese V, Madalli D, Baldry A, Wallner C, Lewis P, et al. Foundations for the representation of diversity, evolution, opinion and bias. Technical Report DISI-09-063; 2009. Available from: http://eprints.biblio.unitn.it/1758.
  24. 24. Liu B. Sentiment analysis and opinion mining. Synthesis lectures on human language technologies. 2012;5(1):1–167.
  25. 25. Hofmann T. Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval; 1999. p. 50–57.
  26. 26. Demeester T, Rocktäschel T, Riedel S. Lifted rule injection for relation embeddings. arXiv:160608359 [Preprint]. 2016 [cited 2021 March 20]. Available from: https://arxiv.org/abs/1606.08359.
  27. 27. Moody CE. Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv:160502019 [Preprint]. 2016 [cited 2021 March 20]. Available from: https://arxiv.org/abs/1605.02019.
  28. 28. Hosseini AS. Sentence-level emotion mining based on combination of adaptive Meta-level features and sentence syntactic features. Engineering Applications of Artificial Intelligence. 2017;65:361–374.
  29. 29. Saranya K, Jayanthy S. Onto-based sentiment classification using machine learning techniques. Proceedings of the 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS). IEEE; 2017. p. 1–5.
  30. 30. Newman ME. Detecting community structure in networks. The European physical journal B. 2004;38(2):321–330.
  31. 31. Traag VA, Waltman L, Van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Scientific reports. 2019;9(1):1–12. pmid:30914743
  32. 32. Tohalino JV, Amancio DR. Extractive multi-document summarization using multilayer networks. Physica A: Statistical Mechanics and its Applications. 2018;503:526–539.
  33. 33. Mehri A, Jamaati M. Statistical metrics for languages classification: A case study of the Bible translations. Chaos, Solitons & Fractals. 2021;144:110679.
  34. 34. Véronis J. Hyperlex: lexical cartography for information retrieval. Computer Speech & Language. 2004;18(3):223–252.
  35. 35. Mihalcea R, Radev D. Graph-based natural language processing and information retrieval. Cambridge university press; 2011.
  36. 36. Mehri A, Darooneh AH, Shariati A. The complex networks approach for authorship attribution of books. Physica A: Statistical Mechanics and its Applications. 2012;391(7):2429–2437.
  37. 37. Segarra S, Eisen M, Ribeiro A. Authorship attribution through function word adjacency networks. IEEE Transactions on Signal Processing. 2015;63(20):5464–5478.
  38. 38. Corrêa EA Jr, Lopes AA, Amancio DR. Word sense disambiguation: A complex network approach. Information Sciences. 2018;442:103–113.
  39. 39. Corrêa EA Jr, Amancio DR. Word sense induction using word embeddings and community detection in complex networks. Physica A: Statistical Mechanics and its Applications. 2019;523:180–190.
  40. 40. Chopra A, Prashar A, Sain C. Natural language processing. International journal of technology enhancements and emerging engineering research. 2013;1(4):131–134.
  41. 41. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv:13013781 [Preprint]. 2013. [cited 2021 March 20]. Available from: https://arxiv.org/abs/1301.3781.
  42. 42. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics. 2017;5:135–146.
  43. 43. Wang R, Liu W, McDonald C. Using word embeddings to enhance keyword identification for scientific publications. Proceedings of the Australasian Database Conference. Springer; 2015. p. 257–268.
  44. 44. Zhang Y, Liu H, Wang S, Ip W, Fan W, Xiao C. Automatic keyphrase extraction using word embeddings. Soft Computing. 2019; p. 1–16.
  45. 45. Quispe LV, Tohalino JA, Amancio DR. Using word embeddings to improve the discriminability of co-occurrence text networks. arXiv:200306279 [Preprint]. 2020. [cited 2021 March 20]. Available from: https://arxiv.org/abs/2003.06279.
  46. 46. Boudin F. A comparison of centrality measures for graph-based keyphrase extraction. Proceedings of the sixth International Joint Conference on Natural Language Processing; 2013. p. 834–838.
  47. 47. Liu J, Wang J. Keyword extraction using language network. Proceedings of the 2007 International Conference on Natural Language Processing and Knowledge Engineering. IEEE; 2007. p. 129–134.
  48. 48. Paranyushkin D. Identifying the pathways for meaning circulation using text network analysis. Nodus Labs. 2011;26.
  49. 49. Liu Z, Li P, Zheng Y, Sun M. Clustering to find exemplar terms for keyphrase extraction. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing; 2009. p. 257–266.
  50. 50. Liu Z, Huang W, Zheng Y, Sun M. Automatic keyphrase extraction via topic decomposition. Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing; 2010. p. 366–376.
  51. 51. Lipizzi C, Iandoli L, Marquez JER. Extracting and evaluating conversational patterns in social media: A socio-semantic analysis of customers’ reactions to the launch of new products using Twitter streams. International Journal of Information Management. 2015;35(4):490–503.
  52. 52. Paranyushkin D. InfraNodus: Generating insight using text network analysis. Proceedings of the World Wide Web Conference; 2019. p. 3584–3589.
  53. 53. Liu Z, Liu J, Yao W, Wang C. Keyword extraction using PageRank on synonym networks. Proceedings of the 2010 International Conference on E-Product E-Service and E-Entertainment. IEEE; 2010. p. 1–4.
  54. 54. Stairmand M, et al. A computational analysis of lexical cohesion with applications in information retrieval. Doctoral Dissertation, The University of Manchester; 1996. Available from: https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.503546.
  55. 55. Aggarwal A, Sharma C, Jain M, Jain A. Semi supervised graph based keyword extraction using lexical chains and centrality measures. Computación y Sistemas. 2018;22(4).
  56. 56. WordNet. NLTK 3.5 documentation; 2020. Available from: https://www.nltk.org.
  57. 57. Ercan G, Cicekli I. Using lexical chains for keyword extraction. Information Processing & Management. 2007;43(6):1705–1714.
  58. 58. Bizău A, Rusu D, Mladenić D. Expressing Opinion Diversity. DiversiWeb 2011. 2011; p. 5.
  59. 59. Trampuš M, Mladenic D. Approximate subgraph matching for detection of topic variations. DiversiWeb 2011. 2011; p. 25.
  60. 60. Smith MA, Rainie L, Shneiderman B, Himelboim I. Mapping Twitter topic networks: From polarized crowds to community clusters. Pew Research Center. 2014;20:1–56.
  61. 61. Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural language processing (almost) from scratch. Journal of machine learning research. 2011;12:2493–2537.
  62. 62. Chodorow M, Ravin Y, Sachar HE. A tool for investigating the synonymy relation in a sense disambiguated thesaurus. Proceedings of the Second Conference on Applied Natural Language Processing; 1988. p. 144–151.
  63. 63. https://www.collinsdictionary.com/dictionary/english-thesaurus.
  64. 64. Biswas SK, Bordoloi M, Shreya J. A graph based keyword extraction model using collective node weight. Expert Systems with Applications. 2018;97:51–59.
  65. 65. Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment. 2008;2008(10):P10008.
  66. 66. Gilbert C, Hutto E. Vader: A parsimonious rule-based model for sentiment analysis of social media text. Proceedings of the Eighth International Conference on Weblogs and Social Media (ICWSM-14).
  67. 67. Nenkova A, Passonneau RJ. Evaluating content selection in summarization: The pyramid method. Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: Hlt-naacl 2004; 2004. p. 145–152.
  68. 68. Goutte C, Gaussier E. A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. Proceedings of the European Conference on Information Retrieval. Springer; 2005. p. 345–359.
  69. 69. Bordoloi M, Biswas SK. Keyword extraction from micro-blogs using collective weight. Social Network Analysis and Mining. 2018;8(1):58.
  70. 70. Kim SN, Medelyan O, Kan MY, Baldwin T. Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles. Proceedings of the 5th International Workshop on Semantic Evaluation; 2010. p. 21–26.
Check for updates via CrossMark

Subject Areas

?

For more information about PLOS Subject Areas, click here.

We want your feedback.Do these Subject Areas make sense for this article? Click the target next to the incorrect Subject Area and let us know. Thanks for your help!

  • Semantics 
  • Centrality 
  • Extraction techniques 
  • Clustering coefficients 
  • Obesity 
  • Word embedding 
  • Food 
  • Text mining 
Sours: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0255127

CN104657514A - Synonym identification method based on electronic commerce user behavior data - Google Patents

Synonym identification method based on electronic commerce user behavior data Download PDF

Info

Publication number
CN104657514A
CN104657514ACN201510129041.5ACN201510129041ACN104657514ACN 104657514 ACN104657514 ACN 104657514ACN 201510129041 ACN201510129041 ACN 201510129041ACN 104657514 ACN104657514 ACN 104657514A
Authority
CN
China
Prior art keywords
data
user
speech
synonym
commodity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510129041.5A
Other languages
Chinese (zh)
Other versions
CN104657514B (en
Inventor
王军
甘骏
彭中正
王磊
张迪
肖琴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Zhi Shuo Science And Technology Ltd
Original Assignee
Chengdu Zhi Shuo Science And Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Zhi Shuo Science And Technology LtdfiledCriticalChengdu Zhi Shuo Science And Technology Ltd
Priority to CN201510129041.5ApriorityCriticalpatent/CN104657514B/en
Publication of CN104657514ApublicationCriticalpatent/CN104657514A/en
Application grantedgrantedCritical
Publication of CN104657514BpublicationCriticalpatent/CN104657514B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

  • 230000000875correspondingEffects0.000claimsabstractdescription15
  • 238000001914filtrationMethods0.000claimsabstractdescription5
  • 230000003542behaviouralEffects0.000claimsdescription21
  • 239000000284extractsSubstances0.000claimsdescription10
  • 238000000605extractionMethods0.000claimsdescription4
  • 238000010606normalizationMethods0.000claimsdescription4
  • 239000000047productsSubstances0.000description6
  • 238000003860storageMethods0.000description6
  • 239000007788liquidsSubstances0.000description3
  • 238000004458analytical methodsMethods0.000description2
  • 230000001755vocalEffects0.000description2
  • 238000005406washingMethods0.000description2
  • 240000004528Catalpa ovataSpecies0.000description1
  • 235000010005Catalpa ovataNutrition0.000description1
  • 240000004203Syringa vulgarisSpecies0.000description1
  • 235000004338Syringa vulgarisNutrition0.000description1
  • 280000907835Version 4companies0.000description1
  • 241000681094Zingel asperSpecies0.000description1
  • 230000004931aggregatingEffects0.000description1
  • 239000007795chemical reaction productsSubstances0.000description1
  • 230000019771cognitionEffects0.000description1
  • 239000003086colorantsSubstances0.000description1
  • 235000014510cookiesNutrition0.000description1
  • 238000007418data miningMethods0.000description1
  • 239000003599detergentsSubstances0.000description1
  • 239000000428dustSubstances0.000description1
  • 235000013399edible fruitsNutrition0.000description1
  • 230000000694effectsEffects0.000description1
  • 238000005516engineering processesMethods0.000description1
  • 239000000686essencesSubstances0.000description1
  • 240000000971garden vetchSpecies0.000description1
  • 238000002372labellingMethods0.000description1
  • 238000004519manufacturing processMethods0.000description1
  • 239000000463materialsSubstances0.000description1
  • 238000000034methodsMethods0.000description1
  • 238000003058natural language processingMethods0.000description1
  • 238000005498polishingMethods0.000description1
  • 238000002360preparation methodsMethods0.000description1
  • 239000002453shampoosSubstances0.000description1
  • 239000007787solidsSubstances0.000description1
  • 239000011901waterSubstances0.000description1

Classifications

    • G—PHYSICS
    • G06—COMPUTING; CALCULATING; COUNTING
    • G06F—ELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90—Details of database functions independent of the retrieved data types
    • G06F16/95—Retrieval from the web
    • G06F16/953—Querying, e.g. by the use of web search engines
    • G06F16/9535—Search customisation based on user profiles and personalisation

Abstract

Description

Claims (5)

1., based on the near synonym recognition methods of electric business's user behavior data, it is characterized in that, comprise the following steps:

(1) raw data is carried out data prediction, extract item description data and the user behavior data of end article;

(2) Stochastic choice part item description data participle go forward side by side rower note;

(3) return mark item description data and as the training set data of hidden Markov algorithm model;

(4) trained the parameter of hidden Markov algorithm model by training set data, set up hidden Markov algorithm model, and the item description data extracted by the identification of hidden Markov algorithm, obtain the effective entity dictionary of result data;

(5) filter user behavioral data, therefrom extracts user browsing behavior data and user search behavioral data;

(6) read user browsing behavior data and user search behavioral data, form multiple similar part of speech data set;

(7) call hidden Markov algorithm model, identify effective entity of similar part of speech data centralization each near synonym class and the entity of identical part of speech, form the near synonym group corresponding with similar part of speech number of data sets amount;

(8) aggregate all near synonym groups, calculate the similarity between word and word;

(9) by similarity size normalization sequence, near synonym recognition result is exported.

2. the near synonym recognition methods based on electric business's user behavior data according to claim 1, is characterized in that, the concrete grammar of described step (6) is as follows:

Read user browsing behavior data

(61) read user browsing behavior data, set up user-commodity bigraph (bipartite graph);

(62) utilize the classical collaborative filtering based on article, calculate the similarity between commodity;

(63) extract the word that 5-10 the highest commodity of similarity corresponding to each commodity are corresponding, form a similar part of speech;

Read user search behavioral data

(64) read user search behavioral data, the word that the search word that extraction user searches for input is at every turn corresponding with all commodity clicked, form a search part of speech;

(65), when extracting the searched click of same commodity, the search word of user's input, forms a similar part of speech.

3. the near synonym recognition methods based on electric business's user behavior data according to claim 1, is characterized in that, described raw data comprises original article data of description and original user behavioral data.

4. the near synonym recognition methods based on electric business's user behavior data according to claim 3, is characterized in that, the concrete grammar of described step (1) is as follows:

(11) read original article data of description, filter according to category mapping table, obtain the item description data of end article;

(12) read original user behavioral data, filter the user behavior data of non-targeted commodity, obtain the user behavior data of end article.

5. the near synonym recognition methods based on electric business's user behavior data according to claim 1, it is characterized in that, in described step (7), identify that effective entity of similar part of speech data centralization each near synonym class is as follows with the concrete grammar of the entity of identical part of speech:

(71) hidden Markov algorithm model is called, the commodity caption text in input item description data;

(72) effective entity word of the different parts of speech in commodity title is identified by hidden Markov algorithm;

(73) add up the identified number of times of effective entity word of different part of speech and at every turn identified part of speech mark, get the mark of the maximum part of speech mark of occurrence number as commodity caption text, and export Entity recognition result.

CN201510129041.5A2015-03-242015-03-24Near synonym recognition methods based on electric business user behavior data ActiveCN104657514B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201510129041.5ACN104657514B (en) 2015-03-242015-03-24Near synonym recognition methods based on electric business user behavior data

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201510129041.5ACN104657514B (en) 2015-03-242015-03-24Near synonym recognition methods based on electric business user behavior data

Publications (2)

ID=53248641

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201510129041.5AActiveCN104657514B (en) 2015-03-242015-03-24Near synonym recognition methods based on electric business user behavior data

Country Status (1)

Cited By (3)

Publication numberPriority datePublication dateAssigneeTitle
CN104933152A (en) *2015-06-242015-09-23北京京东尚科信息技术有限公司Named entity recognition method and device
CN105469282A (en) *2015-12-012016-04-06成都知数科技有限公司Online brand assessment method based on text comments
WO2017101728A1 (en) *2015-12-182017-06-22阿里巴巴集团控股有限公司Similar word aggregation method and apparatus

Citations (3)

Publication numberPriority datePublication dateAssigneeTitle
CN103631948A (en) *2013-12-112014-03-12北京京东尚科信息技术有限公司Identifying method of named entities
US20140180676A1 (en) *2012-12-212014-06-26Microsoft CorporationNamed entity variations for multimodal understanding systems
CN104268200A (en) *2013-09-222015-01-07中科嘉速(北京)并行软件有限公司Unsupervised named entity semantic disambiguation method based on deep learning

Patent Citations (3)

Publication numberPriority datePublication dateAssigneeTitle
US20140180676A1 (en) *2012-12-212014-06-26Microsoft CorporationNamed entity variations for multimodal understanding systems
CN104268200A (en) *2013-09-222015-01-07中科嘉速(北京)并行软件有限公司Unsupervised named entity semantic disambiguation method based on deep learning
CN103631948A (en) *2013-12-112014-03-12北京京东尚科信息技术有限公司Identifying method of named entities

Non-Patent Citations (1)

Title
赵琳瑛: "基于隐马尔科夫模型的中文命名实体识别研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》*

Cited By (6)

Publication numberPriority datePublication dateAssigneeTitle
CN104933152A (en) *2015-06-242015-09-23北京京东尚科信息技术有限公司Named entity recognition method and device
CN104933152B (en) *2015-06-242018-09-14北京京东尚科信息技术有限公司Name entity recognition method and device
CN105469282A (en) *2015-12-012016-04-06成都知数科技有限公司Online brand assessment method based on text comments
WO2017101728A1 (en) *2015-12-182017-06-22阿里巴巴集团控股有限公司Similar word aggregation method and apparatus
CN106897309A (en) *2015-12-182017-06-27阿里巴巴集团控股有限公司The polymerization and device of a kind of similar word
CN106897309B (en) *2015-12-182018-12-21阿里巴巴集团控股有限公司A kind of polymerization and device of similar word

Also Published As

Similar Documents

PublicationPublication DateTitle
CN103488724B (en) A kind of reading domain knowledge map construction method towards books
CN102254043B (en) Semantic mapping-based clothing image retrieving method
CN102609523B (en) The collaborative filtering recommending method classified based on taxonomy of goods and user
CN102866990B (en) A kind of theme dialogue method and device
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN104866496B (en) method and device for determining morpheme importance analysis model
CN104657514A (en) Synonym identification method based on electronic commerce user behavior data
CN104715049B (en) Comment on commodity attribute word abstracting method based on body dictionary
US20160217522A1 (en) Review based navigation and product discovery platform and method of using same
CN103377249B (en) Keyword put-on method and system
CN103942347B (en) A kind of segmenting method based on various dimensions synthesis dictionary
CN102043812A (en) Method and system for retrieving medical information
CN107506486A (en) A kind of relation extending method based on entity link
CN102890707A (en) System for mining emotional tendencies of brief network comments based on conditional random field
CN106096609B (en) A kind of merchandise query keyword automatic generation method based on OCR
WO2013049529A1 (en) Method and apparatus for unsupervised learning of multi-resolution user profile from text analysis
CN108563766A (en) The method and device of food retrieval
Lee et al.A study on perception of swimsuit using big data text-mining analysis
CN102955837A (en) Analogy retrieval control method based on Chinese word pair relationship similarity
Zhao et al.Deep style match for complementary recommendation
CN109960756A (en) Media event information inductive method
Wan et al.Fine-grained sentiment analysis of online reviews
Vartak et al.CHIC: a combination-based recommendation system
CN107895303A (en) A kind of method of the personalized recommendation based on OCEAN models
CN105488136A (en) Mining method of choosing hotspot tag

Legal Events

DateCodeTitleDescription
PB01Publication
C06Publication
SE01Entry into force of request for substantive examination
C10Entry into substantive examination
GR01Patent grant
GR01Patent grant
Sours: https://www.google.com/patents/CN104657514A?cl=en
  1. Bill dodge maine
  2. Ugg 75 off
  3. Ffxiv paladin swords
  4. Missoula jail roster

Synonyms for be based on in English

present
  1. am based on
  2. are based on
  3. is based on
  4. are based on
  5. are based on
  6. are based on
present perfect
  1. have been based on
  2. have been based on
  3. has been based on
  4. have been based on
  5. have been based on
  6. have been based on
past continuous
  1. was being based on
  2. were being based on
  3. was being based on
  4. were being based on
  5. were being based on
  6. were being based on
future
  1. shall be based on
  2. will be based on
  3. will be based on
  4. shall be based on
  5. will be based on
  6. will be based on
continuous present
  1. am being based on
  2. are being based on
  3. is being based on
  4. are being based on
  5. are being based on
  6. are being based on
subjunctive
  1. be based on
  2. be based on
  3. be based on
  4. be based on
  5. be based on
  6. be based on
diverse
  1. be based on!
  2. let's be based on!
  3. been based on
  4. are based on

1. I, 2. you, 3. he/she/it, 4. we, 5. you, 6. they

Sours: https://www.interglot.com/dictionary/en/en/translate/be%20based%20on
Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '19)(2019), pp. 575-584

Abstract

Synonym expansion is a technique that adds related words to search queries, which may lead to more relevant documents being retrieved, thus improving recall. There is extensive prior work on synonym expansion for web search, however very few studies have tackled its application for email search. Synonym expansion for private corpora like emails poses several unique research challenges. First, the emails are not shared across users, which precludes us from directly employing query-document bipartite graphs, which are standard in web search synonym expansion. Second, user search queries are of personal nature, and may not be generalizable across users. Third, the size of the underlying corpora from which the synonyms may be mined is relatively small (i.e., user's private email inbox) compared to the size of the web corpus. Therefore, in this paper, we propose a solution tailored to the challenges of synonym expansion for email search. We formulate it as a multi-view learning problem, and propose a novel embedding-based model that joins information from multiple sources to obtain the optimal synonym candidates. To demonstrate the effectiveness of the proposed technique, we evaluate our model using both explicit human ratings as well as a live experiment using the Gmail Search service, one of the world's largest email search engines.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work

Sours: https://research.google

Synonym based on

Synonym, Topic Model and Predicate-Based Query Expansion for Retrieving Clinical Documents

1. Hu H, Correll M, Kvecher L, Osmond M, Clark J, Bekhash A, et al. DW4TR: A Data Warehouse for Translational Research. J Biomed Inform. 2011;44(6):1004–19. Epub 2011/08/30. [PubMed] [Google Scholar]

2. Horvath MM, Winfield S, Evans S, Slopek S, Shang H, Ferranti J. The DEDUCE Guided Query tool: providing simplified access to clinical data for research and quality improvement. J Biomed Inform. 2011;44(2):266–76. Epub 2010/12/07. [PMC free article] [PubMed] [Google Scholar]

3. Frakes WB, Baeza-Yates R. Modern Information Retrieval. Addison Wesley; 1999. [Google Scholar]

4. Cui H, Wen J-R, Nie J-Y, Ma W-Y, editors. Query Expansion by Mining User Logs. IEEE Transaction on Knowledge and Data Engineering; 2003. [Google Scholar]

5. Stenmark D, editor. Query Expansion on a Corporate Intranet: Using LSI to Increase Precision in Explorative Search. Proceedings of the 38th Hawaii International Conference on System Sciences; 2005 03–06 Jan. [Google Scholar]

6. Harman D, editor. Towards interactive query expansion. 11th annual international ACM SIGIR conference on Research and development in information retrieval; May 1988; Grenoble, France: ACM Press; [Google Scholar]

7. Doszkocs TE. AID, an Associative Interactive Dictionary for Online Searching. Online Review. 1978;2(2):163–72.[Google Scholar]

8. Pollitt S. CANSEARCH: An expert systems approach to document retrieval. Information Processing & Management. 1987;23(2):119–38.[Google Scholar]

9. Srinivasan P. Retrieval feedback in MEDLINE. J Am Med Inform Assoc. 1996;3(2):157–67.[PMC free article] [PubMed] [Google Scholar]

10. Hersh W, Price S, Donohoe L. Assessing thesaurus-based query expansion using the UMLS Metathesaurus. Proc AMIA Symp; 2000. pp. 344–8. [PMC free article] [PubMed] [Google Scholar]

11. Aronson AR, Rindflesch TC. Query expansion using the UMLS Metathesaurus. Proc AMIA Annu Fall Symp; 1997. pp. 485–9. [PMC free article] [PubMed] [Google Scholar]

12. Nenadic G, Mima H, Spasic I, Ananiadou S, Tsujii J. Terminology-driven literature mining and knowledge acquisition in biomedicine. Int J Med Inform. 2002;67(1–3):33–48. [PubMed] [Google Scholar]

13. Ozturkmenoglu O, Alpkocak A. DEMIR at TREC-Medical 2011: Power of Term Phrases in Medical Text Retrieval. 20th Anniversary of Text Retrieval Conference; 2011. [Google Scholar]

14. Mariam Daoud DK, Miao Jun, Huang Jimmy. York University at TREC 2011: Medical Records Track. Gaithersburg, Maryland, USA: 2011. TREC 2011. 2011. [Google Scholar]

15. Martijn Schuemie DT, Meij Edgar. DutchHatTrick: Semantic query modeling, ConText, section detection, and match score maximization. TREC 20112011.

16. Karimi S, Martinez D, Ghodke S, Zhang L, Suominen H, Cavedon L. Search for Medical Records:NICTA at TREC 2011 Medical Track. TREC 20112011.

17. Dinh D, Tamine L. IRIT at TREC 2011: evaluation of query expansion techniques for medical record retrieval. Text Retrieval Conference; Gaithersburg, Maryland, USA. 2011. TREC 2011. [Google Scholar]

18. Yang L, Mei Q, Zheng K, Hanauer DA. Query log analysis of an electronic health record search engine. AMIA Annual Symposium proceedings / AMIA Symposium AMIA Symposium; 2011. pp. 915–24. Epub 2011/12/24. [PMC free article] [PubMed] [Google Scholar]

19. Murphy SN, Morgan MM, Barnett GO, Chueh HC. Optimizing healthcare research data warehouse design through past COSTAR query analysis. Proc AMIA Symp; 1999. pp. 892–6. [PMC free article] [PubMed] [Google Scholar]

20. Natarajan K, Stein D, Jain S, Elhadad N. An analysis of clinical queries in an electronic health record search utility. Int J Med Inform. 2010;79(7):515–22. Epub 2010/04/27. [PMC free article] [PubMed] [Google Scholar]

21. Klimov D, Shahar Y, Taieb-Maimon M. Intelligent interactive visual exploration of temporal associations among multiple time-oriented patient records. Methods Inf Med. 2009;48(3):254–62. Epub 2009/04/24. [PubMed] [Google Scholar]

22. Bellika JG, Sue H, Bird L, Goodchild A, Hasvold T, Hartvigsen G. Properties of a federated epidemiology query system. Int J Med Inform. 2007;76(9):664–76. Epub 2006/09/05. [PubMed] [Google Scholar]

23. Deshpande AM, Brandt C, Nadkarni PM. Temporal query of attribute-value patient data: utilizing the constraints of clinical studies. Int J Med Inform. 2003;70(1):59–77. Epub 2003/04/23. [PMC free article] [PubMed] [Google Scholar]

24. Dorda W, Gall W, Duftschmid G. Clinical data retrieval: 25 years of temporal query management at the University of Vienna Medical School. Methods Inf Med. 2002;41(2):89–97. Epub 2002/06/14. [PubMed] [Google Scholar]

25. Mabotuwana T, Warren J. An ontology-based approach to enhance querying capabilities of general practice medicine for better management of hypertension. Artif Intell Med. 2009;47(2):87–103. Epub 2009/08/28. [PubMed] [Google Scholar]

26. Schulz S, Daumke P, Fischer P, Muller M. Evaluation of a document search engine in a clinical department system. AMIA Annu Symp Proc; 2008. pp. 647–51. Epub 2008/11/13. [PMC free article] [PubMed] [Google Scholar]

27. Hanauer DA. EMERSE: The Electronic Medical Record Search Engine. AMIA Annu Symp Proc; 2006. p. 941. Epub 2007/01/24. [PMC free article] [PubMed] [Google Scholar]

28. Griffon N, Chebil W, Rollin L, Kerdelhue G, Thirion B, Gehanno JF, et al. Performance evaluation of Unified Medical Language System(R)’s synonyms expansion to query PubMed. BMC medical informatics and decision making. 2012;12(1):12. Epub 2012/03/02. [PMC free article] [PubMed] [Google Scholar]

29. NIST Guidelines for the 2011 TREC Medical Records Track. 2011. http://wwwnlpir.nist.gov/projects/trecmed/2011/

30. Steyvers M, Griffiths T. Probabilistic topic models. Lawrence Erlbaum; 2007. [Google Scholar]

31. Griffiths TL, Steyvers M. Finding scientific topics. National Acad Sciences. 2004:5228–35.[PMC free article] [PubMed] [Google Scholar]

32. Blei DM, Ng AY, Jordan MI. Latent Dirichlet Allocation. Journal of Machine Learning Research. 2003;3:993–1022.[Google Scholar]

33. Smith KB, Ellis SA. Standardisation of a procedure for quantifying surface antigens by indirect immunofluorescence. Journal of immunological methods. 1999;228(1–2):29–36. [PubMed] [Google Scholar]

34. Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform. 2003;36(6):462–77. Epub 2004/02/05. [PubMed] [Google Scholar]

35. Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp. 2001:17–21.[PMC free article] [PubMed] [Google Scholar]

36. McCallum AK. MALLET: A Machine Learning for Language Toolkit. http://malletcsumassedu2002.

Sours: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3540443/
14 OVERUSED ENGLISH WORDS - Stop Using Them! Use these alternatives

A framework for work task based thesaurus design

Abstract

Design and construction of indexing languages require thorough knowledge and understanding of the information environment. This empirical study investigated a mixed set of methods (group interviews, recollection of information needs and word association tests to collect data; content analysis and discourse analysis to analyse data) to evaluate whether these methods collected the data needed for work domain oriented thesaurus design. The findings showed that the study methods together provided the domain knowledge needed to define the role of the thesaurus and design its content and structure. The study was carried out from a person‐insituation perspective. The findings reflected the information environment and made it possible to develop a thesaurus according to the characteristics of the work domain. It seemed more difficult to capture the needs of the individual user and adapt the thesaurus to individual characteristics.

Keywords

Citation

Lykke Nielsen, M. (2001), "A framework for work task based thesaurus design", Journal of Documentation, Vol. 57 No. 6, pp. 774-797. https://doi.org/10.1108/EUM0000000007100

Publisher

:

MCB UP Ltd

Copyright © 2001, MCB UP Limited

Sours: https://www.emerald.com/insight/content/doi/10.1108/EUM0000000007100/full/html

You will also like:

Ontology-aided Word2vec based Synonym Identification for Ontology Alignment

Abstract: Synonym identification is the key factor for ontology alignment. There are several researches which proposed synonym identification methods. However, most of the researches focus on the words in general contexts, which occurs the problem in finding synonym relations in certain domains. To address this problem, we suggest ontology-aided word2vec based synonym identification method. In this paper, we find domain-specific documents based on ontology for training word2vec model. To do so, we use Kernel Density Estimation (KDE) to estimate distributions of words and we Kullback-Leibler (KL) divergence to compare the distributions. Through this, we can find the synonym relations considering domain-specific context which is hard to be identified with existing methods.

Published in: 2020 IEEE International Conference on Big Data and Smart Computing (BigComp)

Article #:

Date of Conference: 19-22 Feb. 2020

Date Added to IEEE Xplore: 20 April 2020

ISBN Information:

Electronic ISBN: 978-1-7281-6034-4

Print on Demand(PoD) ISBN: 978-1-7281-6035-1

ISSN Information:

Electronic ISSN: 2375-9356

Print on Demand(PoD) ISSN: 2375-933X

Sours: /document/


1200 1201 1202 1203 1204