Stemming
A given word is either a stem word it self or an inflection of a stem word. Eg. bikes is an inflection of bike, and car is the stem word of cars. In this context, stemming is getting the stem word from an inflected word.
The underlying WeCantSpell.Hunspell package doesn't support stemming directly, but by putting the right pieces together, the Stem
method in the HunspellTextAnalyzer
from this package can be used to get the stem word.
@using System.Web.Hosting
@using Skybrud.TextAnalysis.Hunspell
@using Skybrud.TextAnalysis.Hunspell.Stem
@{
// Map the path to the dictionary and affix files
string dic = HostingEnvironment.MapPath("~/App_Data/Hunspell/en-US.dic");
string aff = HostingEnvironment.MapPath("~/App_Data/Hunspell/en-US.aff");
// Load a new text analyzer (Hunspell wrapper)
HunspellTextAnalyzer analyzer = HunspellTextAnalyzer.CreateFromFiles(dic, aff);
// Get the stem words of "bikes" (underlying package only ever returns one stem)
HunspellStemResult[] stems = analyzer.Stem("bikes");
}
In this example, the Stem
method returns an array with the stem word bike as the only item.
Notice that the underlying WeCantSpell.Hunspell package doesn't support returning multiple stem words (opposed to the older NHunspell package used in v1.x
of this package).
Compound Words
In some languages (eg. Danish), compound words are spelled as one word (without any separators). Eg. summer house (mix of summer and house) is spelled sommerhus (mix of sommer and hus) in Danish.
This gives a few problems with the implementation in the WeCantSpell.Hunspell package, so the HunspellStemResult
class is something we've build on top of their implementation.
To add better support for working with compound words in these languages, the HunspellStemResult
class exposes the Stem
and Prefix
properties, as well as the Value
property, which is a mix of the Prefix
and Stem
properties. This is in particular useful for morph operations.
@using System.Web.Hosting
@using Skybrud.TextAnalysis.Hunspell
@using Skybrud.TextAnalysis.Hunspell.Stem
@{
// Map the path to the dictionary and affix files
string dic = HostingEnvironment.MapPath("~/App_Data/Hunspell/da-DK.dic");
string aff = HostingEnvironment.MapPath("~/App_Data/Hunspell/da-DK.aff");
// Load a new text analyzer (Hunspell wrapper)
HunspellTextAnalyzer analyzer = HunspellTextAnalyzer.CreateFromFiles(dic, aff);
// Get the stem words of "webredaktører" (underlying package only ever returns one stem)
HunspellStemResult[] stems = analyzer.Stem("webredaktører");
foreach (HunspellStemResult stem in stems) {
<pre>Value: @stem.Value</pre>
<pre>Stem: @stem.Stem</pre>
<pre>Prefix: @stem.Prefix</pre>
<br />
}
}
In this example, the Danish compound word webredaktører will be split into web (Prefix
property) and redaktør (Stem
property), which then combined is webredaktør (Value
property). Morph operations will only be based on the stem, but then the prefix is automatically prepended to each inflection returned by the morph operation, ensuring the final result is still correct for compounded words.