Description
We will explore how to do some basic natural language processing (NLP) on the text including tokenization and stemming to combine word forms, and stop word removal and sentence detection to examine word sequences (n-grams). We will then look at the distribution of the vocabulary terms and n-grams in our data set using term frequency and inverse document frequency (TF.IDF).
Abstract
In the initial analysis of a data set it is useful to gather informative summaries. This includes evaluating the available fields, by finding unique counts or by calculating summary statistics such as averages for numerical fields. These summaries help in understanding what is in the data itself, the underlying quality, and illuminate potential paths for further exploration. In structured data, this a straightforward task, but for unstructured text, different types of summaries are needed. Some useful examples for text data include a count of the number of documents in which a term occurs, and the number of times a term occurs in a document. Since vocabulary terms often have variant forms, e.g. “performs” and “performing”, it is useful to pre-process and combine these forms before computing distributions. Oftentimes, we want to look at sequences of words, for example we may want to count the number of times “data science” occurs, and not just “data” and “science”. We will use the pandas Python Data Analysis Library and the Natural Language Toolkit (NLTK) to process a data set of job descriptions posted by employers in the United States, and look at the difference in vocabularies across different job segments.