Reduce words to their root forms
Stemming refers to a text normalization technique innatural language processingthat reduces words to their root forms. Stemming is done primarily by removing affixes of the words, which may result in an invalid dictionary word.
Stemming is commonly used for:
- Information retrieval, where stemmed words are used as synonyms to expand search criteria
- Engineering applications to reduce dimensionality, where stemming results in fewer words to be tracked and used in a model with machine learning algorithms
Porter’s Stemming Algorithm
The Porter stemmer algorithm is one of the most popular stemming approaches for the English language, and is based on simple heuristic rules. This stemming approach is fast but may not always be accurate. In subsequent years, many other algorithms were proposed, but Porter’s stemming algorithm remains popular due to its speed and simplicity.
Stemming vs. Lemmatization
A related, but more sophisticated approach, to stemming islemmatization。而相比,
- Lemmatization uses vocabulary and morphological analysis and stemming uses simple heuristic rules
- Lemmatization returns dictionary forms of the words, whereas stemming may result in invalid words
The differences between lemmatization and stemming are shown below.
Actual Word | Lemmatization | Stemming |
---|---|---|
Requiring | Require | Requir |
Required | Require | Requir |
Requirement | Requirement | Requir |
在MATLAB, stemming can be done using “normalizeWords” function with the default style option of ‘stem’. To learn more about stemming and building models with text data, seeText Analytics Toolbox™。
Examples and How To
Software Reference
See also:natural language processing,sentiment analysis,word2vec,n-gram,text mining with MATLAB,data science,deep learning,Deep Learning Toolbox™,Statistics and Machine Learning Toolbox™