This is just a prelude to a complete spellcheck addon to the discussion forum. I figured I’d start with basic autocomplete first that ties into the wordlist.
All spellcheckers essentially refer to a global wordlist in the specified language and any words that don’t belong, get flagged.
The hardest part of this turned out to be finding a decent wordlist. I was actually surprised at the delicate balance between finding a “good enough” list and one that’s “too good”. Too good? Yes, apparently a list that has too many words will mean you will get a lot of misses where an apparent misspelling turned out to be an obscure word… and you didn’t mean to use obscure words.
The final list I settled on has a word count of 125,346 and was from the Ispell project which also has common acronyms. Note: This is not the same as Iespell (written ieSpell), although if you Google, “Ispell”, you’ll get “ieSpell as the first result. Ispell lists are available for download at the Kevin’s Wordlist page. I have also combined the 4 main english lists into one file (MS Word). WordPress, strangely, won’t allow plain text files to be uploaded, but allows richtext documents. Email me if you want the plaintext version.
I started with a simple DB table to store all the entries. Since I may also be adding more languages, I also have a WordLang field which can be something small like “en”, “de”, “fr” etc…
I then created an MVC app and loaded each of the wordlist files into the db using a simple function (this can take a while depending on filesize):
public List GetWords(string p) { var query = from line in File.ReadAllLines(p) select new Wordentry { WordText = NormalizeString(line), WordLowercase = NormalizeString(line).ToLower(), WordLang = "en" }; return query.ToList(); }
After feeding it a HostingEnvironment.MapPath to the filename, I can use this to load all entries into the list and call a db.Wordentries.InsertAllOnSubmition the result. NormalizeString is another helper function which I will list below.
I’m using a Spellword model instead of directly using the Wordentry object since I may want to extend the returned result in the future and changing the columns in the DB wouldn’t be practical.
using System; using System.Collections.Generic; using System.Linq; using System.Web; namespace Spellcheck.Models { public class Spellword { public int Id { get; set; } public string Spelling { get; set; } public string Lowercase { get; set; } public string Lang { get; set; } } }
And we’re using a SpellRepository class so we’ll keep the controllers free of too much data access stuff.
using System; using System.Collections.Generic; using System.Linq; using System.Web; using System.IO; using System.Text; using System.Globalization; namespace Spellcheck.Models { public class SpellRepository { // DataContext global private readonly CMDataContext db; public SpellRepository(CMDataContext _db) { db = _db; } /// <summary> /// Counts the total number of word entries /// </summary> /// <returns>Wordcount int</returns> public int GetCount() { return (from w in db.Wordentries select w.WordText).Count(); } /// <summary> /// Searches a given word or word fragment /// </summary> /// <param name="word">Search word/fragment</param> /// <param name="word">Number of returned results</param> /// <param name="word">Language to search. Defaults to 10</param> /// <param name="word">Search lowercase field only</param> /// <returns>List of spellwords</returns> public List<Spellword> GetWords(string word, int limit = 10, string lang = "en", bool lower = true) { word = (lower) ? NormalizeString(word.ToLower()) : NormalizeString(word); var query = from w in db.Wordentries select w; // Get only unique entries in case we have // duplicates in the db (Edited from an earlier "GroupBy") query = query.Distinct().OrderBy(w => w.WordLowercase); // If a language code was specified if (!string.IsNullOrEmpty(lang)) query = query.Where(w=>w.WordLang == lang); // Lowercase? query = (lower) ? query.Where(w => w.WordLowercase.StartsWith(word)) : query.Where(w => w.WordText.StartsWith(word)); // Order alphabetically query = query.OrderBy(w => w.WordLowercase); return (from w in query select new Spellword { Id = w.WordId, Spelling = w.WordText, Lowercase = w.WordLowercase, Lang = w.WordLang }).Take(limit).ToList(); } /// <summary> /// Inserts a new list of words into the spellcheck library /// </summary> public void SaveWords(List Words) { var query = Words.GroupBy(w => w.Spelling) .Select(w => w.First()) .OrderBy(w => w.Spelling).ToList(); List Entries = (from w in query orderby w.Spelling ascending select new Wordentry { WordText = w.Spelling, WordLowercase = w.Lowercase, WordLang = w.Lang }).ToList(); db.Wordentries.InsertAllOnSubmit(Entries); db.SubmitChanges(); } /// <summary> /// Helper function normalizes a given word to the Unicode equivalent /// </summary> /// <param name="txt">Raw word</param> /// <returns>Normalized word</returns> private static string NormalizeString(string txt) { if (!String.IsNullOrEmpty(txt)) txt = txt.Normalize(NormalizationForm.FormD); StringBuilder sb = new StringBuilder(); sb.Append( txt.Normalize(NormalizationForm.FormD).Where( c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark).ToArray() ); return sb.ToString().Normalize(NormalizationForm.FormD); } } }
To use this, we’ll just add a JsonResult action to our controller. I just created a Suggestions action in the default Home controller since this is just an example.
public JsonResult Suggestions(string word, int limit = 10, string lang="en") { List Words = new List(); if (!string.IsNullOrEmpty(word)) { using (CMDataContext db = new CMDataContext()) { SpellRepository repository = new SpellRepository(db); // 10 results is usually enough Words = repository.GetWords(word, limit, lang); } } // Need to use AllowGet or else, we'll need use POST return Json(Words, JsonRequestBehavior.AllowGet); }
… And that pretty much covers the backend for now.
To test out to see if the word suggestion works, we’ll do one autocomplete textbox. Just add the jQuery and jQuery UI script files and include the jQuery UI CSS to your layout first and add this to the default view :
<script type="text/javascript"> $(function () { var searchtext = $("#search"); searchtext.autocomplete({ source: function (request, response) { $.ajax({ url: "/Home/Suggestions", // Or your controller dataType: "json", data: { word: request.term }, success: function (data) { // Returned data follows the Spellword model response($.map(data, function (item) { return { id: item.Id, label: item.Spelling, value: item.Lowercase } })) } }); }, minlength: 3 }); }); </script> <form action="/" method="post"> <input id="search" type="text" name="search" /> </form>
Fun fact : Total misspellings as I was writing this (excluding Ispell/ieSpell names and code) before running spellcheck = 12.
Yeah, I really can’t spell.