To process text we need to make some preparations:

  • convert whole text to lowecase – in R: tm_map(ourtextvariable, tolower)
  • remove punctuations – in R: tm_map(ourtextvariable, removePunctuation)
  • remove numbers – in R: tm_map(ourtextvariable, removeNumbers)
  • remove unnecessary (stop) words – in R: tm_map(ourtextvariable, removeWords, stopwords(“english”) )
  • remove whitespaces – in R: tm_map(ourtextvariable, stripWhitespace)
  • create “term matrix” – in R: DocumentTermMatrix(ourtextvariable)
  • remove sparse terms