Extraction of Topics


Contents


Menu

 

 

 

  •  Scroll to Top of Page
  •  Print Topic
  •  Show/Hide Expanders

The Topic Extraction feature of WordStat attempts to uncover the hidden thematic structure of a text collection by applying a combination of natural language processing and statistical analysis. The main statistical procedure used for topic extraction in WordStat is a factor analysis. Technically speaking, such an extraction is achieved by computing a word by document frequency matrix, or alternatively by segmenting documents into smaller chunks and computing a word by segment frequency matrix. Once this matrix is obtained, a factor analysis with Varimax rotation is computed in order to extract a small number of factors. All words with a factor loading higher than a specific criterion are then retrieved as part of the extracted topic. While in hierarchical cluster analysis, a word may only appear in one cluster, topic modeling using factor analysis may result in a word being associated with more than one factor, a characteristic that more realistically represents the polysemic nature of some words as well as the multiple contexts of word usage.

 

The current implementation of the topic-modeling procedure has a limit of 2,500 words or content categories. (We are working on ways to increase the capability to at least twice this amount.) To insure the stability of the factoring solution, low frequency items should preferably be excluded. It is thus strongly recommended to remove any word occurring less than 10 times on smaller data sets, ideally less than 30 to 50 times on larger ones. Stemming, lemmatization or the creation of a categorization dictionary may also be used to group words or phrases, including less frequent ones, prior to the topic extraction.

 

WordStat provides the following analysis options to control the topic modeling process:

 

Segmentation - This option allows one to specify whether the data to be used for topic modeling will be based on the co-occurrence of words in the same document, or whether they will be based on co-occurrence within paragraphs or sentences. The choice of segmentation should ideally reflect how topics are being distributed in a typical document and across documents, as well as the objective of the analysis. When the text collection consists of long documents containing multiple topics (such as long political speeches) and one needs to identify all topics in order to compare their relative frequencies, then performing a segmentation by paragraph or by sentence may be more sensitive than computing co-occurrences by documents. Alternatively, if one attempts to differentiate documents by identifying domains or disciplines, or to identify the dominant issue of documents, then performing the analysis at the document level may be more appropriate. When analyzing responses to open-ended questions, which may include several topics listed in a single paragraph, segmenting by sentence may also result in a more precise extraction of the various topics they contain.

 

No. Topics - Setting this option allows one to specify how many topics to extract.

 

Loading - This option allows one to set a minimum factor loading an word should reach in order to be retained in the factor solution. By default, this value is set to 0.4. Increasing the cutoff value will reduce the number of words, keeping only the more representative ones, while reducing it may include words that are somewhat less characteristic of the extracted topic.

 

Once the options have been set, click the button to perform the analysis. Please note that extracting topics on more than a few hundred words can take several minutes. Once extracted, the TOPICS page should looks like this:

The table to the left contains the following information:

 

NO

Shows the factor number. Please note that some factor numbers may be omitted if none of their items attained the factor-loading cutoff criteria. When factors are being merged by the user, this column  displays the numbers of all factors that have been merged together.

NAME

WordStat uses an algorithm to automatically provide a label for the extracted topic. This label may be edited by clicking the button.

KEYWORDS

Lists all keywords meeting the factor loading cutoff criteria in descending order of factor loading.

% VAR

Shows the percentage of variance explained. Please note that the smaller the segment one chose, the lower the percentage.

FREQ

Displays the total frequency of all items listed in the keywords column.

CASES

Shows the number of cases containing at least one of the items listed in the keywords column.

% CASES

Displays the percentage of cases with at least one of the items listed in the keywords column.

 

 

Topic Modeling Buttons:

 

Allows one to delete the topic on the selected row.

 

Click to merge a topic into another one. One first needs to select the row containing the first topic one would like to merge, and then click this button. A dialog box will appear with a list of all other topics. Select the second topic and click OK .

 

To rename a topic, first select the topic and then click this button. Type the new name and click OK.

 

To retrieve segments associated with a topic, select it and click this button. All text segments containing at least two keywords of the selected topic will be retrieved and presented in a table format. You may however change both the type of segments retrieved (paragraphs, sentences or full documents) or the minimum number of topic words needed for retrieval.

 

Allows one to perform co-occurrence analysis of all the extracted topics including clustering and multidimensional scaling, and to create proximity plots as well as link charts. For more information on the various features available, see the Co-Occurrence Page topic.

 

 

Allows one to perform full crosstabulation analysis of all the displayed topics with structured data, to apply statistical analysis, and to create various charts such as correspondence plots, heatmaps, bubble charts, and bar charts.  For more information on the various features available for crosstabulation analysis, see the Crosstab Page topic.

 

 

Stores the extracted topics currently displayed into a new categorization dictionary where folders at the first level correspond to different topics, and where each of those folders contains the associated words. A dialog box allows one to save

 

 

Press this button to append a copy of the topic table in the Report Manager. A descriptive title will be provided automatically. To edit this title or to enter a new one, hold down the SHIFT keyboard key while clicking this button (for more information on the Report Manager, see the Report Management Feature topic).

 

 

Allows to store the topic table to disk in various formats, including Excel, tab and comma delimited files, plain text, HTML, XML, SPSS or Stata files.

 

 

Allows you to print a copy of the displayed chart

 

 

Using the Right Panel

The right side of this table is a panel that allows one to look at the distribution of the selected topic among values of up to two structured variables. One may display this distribution using either a vertical bar chart, a horizontal bar chart or a line chart, by clicking on the corresponding button. Four statistics may also be represented on those charts:

 

Case Occurrence - number of cases in this subgroup containing at least one of these words.

Category Percent - percentage of cases in this subgroup containing at least one of these words.

Word Frequency - total number of these words in this subgroup.

Rate per 10,000 Words - rate of words in this subgroup per 10,000 words.

Right-clicking anywhere in the chart areas displays a popup menu that allows one to edit the chart, save it to disk or in the Report Manager, or to copy it to the clipboard. Clicking a specific bar or a data point of a line chart also allows one to retrieve text segments associated with the selected class and containing words of the selected topic.