Several days ago, I discover a new package about text mining in R, text2vec, this package is great, I use this package and other packages to predict the defect name from the defect description text, this idea is similar to sentiment analysis, just change response variable Y values from positive/negative to defect name1/2/3/…

Below is a flowchart about text mining.

  1. Text Data

  2. Segmentation(for chinese text)jiebaR package

  3. Transform to structural data, such as matrix, use text2vec package, English text could split the sentence to word in this package, which is similar to segmentation

3.1 Data Clean

3.2 Transform Document-term Matrix

  1. Model

Such as sentiment analysis, use classification model, to predict positive or negative

Below is my R and Python code:

Text Mining Series 1.1: Sentiment Analysis in Movie Review Dataset in English by R

Text Mining Series 1.2: Predict Defect Name From Defect Description Text in Chinese by R

Text Mining Series 1.3: Predict Defect Name From Defect Description Text in Chinese by Python

Welcome your advice and suggestion!

Just record, this article was posted at linkedin, and have 63 views to November 2021.