Pseudo language generator

The usual method of placing filler text on page or on a web design was to copy the Lorem Ipsum text found all over the place. The one down side to this was making sure the text wasn’t too recognizable, considering everyone uses it, and making sure you have enough of it for a particularly big layout or dummy text range.

For some reason this didn’t seem particularly elegant, especially since the afore mentinoed commonality and the only other alternative was to create actual posts with, you know, a post. Well that didn’t seem elegant either since those posts tend to look contrived and stupid… a lot like a most of the “real” posts on the internet, sadly.

So I set about creating my own pseudo language generator in JavaScript that I can plug anywhere and generate as many as needed and not worry about repeating myself. I’m home today thanks to doctor’s orders and bored to death of walking up and down the apartment so might as well do it now…

First step : Wikipedia

Apparently,  in English, ETAON RISHD LFCMU GYPWB VKXJQ Z is the alphabet arranged from the most frequently used letter to the least. While this seems logical to me, it didn’t make sense to mix the vowels in with the consonants since I needed to match a consonant with at least one vowel and in my own cursory observation of Lorem Ipsum, the vowels tended to have even distribution, but since I’m trying to match English as much as possible, I sorted the vowels by decreasing frequency: eaoiu and the consonants as well : tnrshdlfcmgypwbvkxjqz.

To make it easier to pair them with a consonant, I figured I’d put all the vowels together and randomly pick one from the pool, but that leaves their frequency.

The easiest way to make sure the most frequent letters are picked more than the least frequent was to multiply the occurence of the most frequent letters in the pool. I could have repeated the letter a number of times in the pool, but that felt hackish, so for this I wrote a simple multiply function.

function multFrequency(chars) {
	var cn = '';
	for(var i = chars.length; i > 0; i--) {
		var ch = chars.charAt(chars.length - i);
		for(var j = i; j >= 0; j--) {
			cn += ch;
		}
	}
	return cn;
}

What’s going on here? Well, we first we iterate through the pool starting with the character length and decrement by one. Each letter is selected using charAt using i. At each letter, we repeatedly add it to the new pool, “cn”, i number of times. Since i decreases as the loop continues, characters further down the pool get added fewer number of times. We want to add up to and including the last letter, which is why the inner loop is set to j >= 0. (Remember, the index is 1 less than the length.)

Pairing ending letters

According to the Wikipedia article, English also has common pairings. TH HE AN RE ER IN ON AT ND ST ES EN OF TE ED OR TI HI AS TO.

I figured it will be easier to again split these up by pairings that start with vowels and those that start with consonants. So that leaves :

Consonant pairs : “th”, “he”, “nd”, “st”, “te”, “ti”, “hi”, “to”.
Vowel pairs : “an”, “er”, “in”, “on”, “at”, “es”, “en”, “of”, “ed”, “or”, “as”

So we need another function that will take the first letter of each pool and randomly select a pairing ending letter. This is actually a lot easier than it sounds. First part of this problem is making a little helper function that will randomly give a number between a minimum and maximum.

// Random number helper
function rnd(min, max) {
	return Math.floor(Math.random() * (max - min)) + min
}

Now we’re going to create the pair matching helper.

function fromPair(pairs, p) {
	var nc = '';
	for(var i = 0; i < pairs.length - 1; i++) {
		if (pairs[i].charAt(0) == p)
			nc += pairs[i].charAt(1);
	}
	if (nc == '')
		return nc;
	
	// Or else...
	return fromRange(nc);
}

What this does is iterate through each pair in the pool and take the second letter of each pair matching the first letter and create a new pool, “nc”. If “nc” is empty, then it didn’t find a matching pair and returns an empty string, but if at least one pair was found, it will randomly select from this pool… using the function below.

We need a function that will avoid letter duplications. I could be wrong, but in the original Lorem Ipsum, I don’t recall seeing double vowels. I think this makes sense in our new pseudo language.

function fromRange(chars, p) {
	var c;
	if (arguments.length > 1) {
		do {
			c = chars.charAt(rnd(0, chars.length -1));
		} while(c == p);
	} else {
		c = chars.charAt(rnd(0, chars.length -1));
	}
	
	return c;
}

This function is what we’ll use to randomly select characters from any pool. It also doubles as a duplicate remover if the second parameter is specified. Basically, it will retry a random pick from the pool until it skips the given parameter.

Building the language

To give this the look and feel of true randomness, I put all the above constants (vowels, consonants, pairings) into variables. I also created a minimum and maximum value set for word length, sentence sizes and paragraph sizes to give the impression of random entries.

var vowels = "eaoiu";

// The consonants are placed in the order of their appearence
var consonants = "tnrshdlfcmgypwbvkxjqz";

// Letters commonly paired (with consonants first and vowels next)
var consonantPairs = ["th", "he", "nd", "st", "te", "ti", "hi", "to"];
var vowelPairs = ["an", "er", "in", "on", "at", "es", "en", "of", "ed", "or", "as"];

var wMin = 2;		// Minimum word length
var wMax = 10;		// Maximum word length
var sMin = 4;		// Minimum sentence size
var sMax = 20;		// Maximum sentence size
var pMin = 1;		// Minimum sentences per paragraph
var pMax = 3;		// Maximum sentences per paragraph
var vFreq = 3;		// Every x characters must be a vowel

That last variable, vFreq, is what I think will really make or break this; I think having every 3rd character a vowel will make this seem realistic.

Now we need a function to generate a realistic sounding word…

function getWord(u) {
	if(arguments.length > 1)
		u = true;
	
	var r = rnd(wMin, wMax);
	
	var w = '';	// Completed word holder
	var c = '';	// Generated letter holder
	
	for(var i = 0; i < r; i++) {
		// Every x characters is a vowel
		if (i % vFreq == 0) {
			c = fromRange(consonants);
		} else {
			c = fromRange(vowels, c);
		}
			
		 // First letter of the word requested in uppercase
		if(u == true && i == 0)
			c = c.toUpperCase();
		w +=  c;
	}
	
	// Commonly paired letters
	if (consonants.indexOf(c) > -1 ) {
		w += fromPair(consonantPairs, c);
	} else {
		w += fromPair(vowelPairs, c);
	}
	
	return w;
}

This function has an argument to make the first letter upper case for use in the beginning of a sentence. Note the wMin and wMax variables we declared earlier between which the word lengths alternate. Also note in the for loop, we’re using that fromRange function with the second parameter (to skip duplicates) specified for vowels. I’m also making use of the fromPair function depending on whether the last character in the word ends in a consonant or vowel.

Now that we have the word generator we need a function that creates a sentence by repeatedly calling the above getWord function. Note the sMin and sMax variables that allow the sentence length to fluctuate.

// Creates a sentence (bunch of words ending in '. ');
function getSentence() {
	var r = rnd(sMin, sMax);
	var s = '';
	for(var i = 0; i < r; i++) {
		if(i == 0) // First letter in first word is uppercase
			s += getWord(true) + ' ';
		else
			s += getWord() + ' ';
	}
	
	return s.substring(0, s.length - 1) + '. ';
}

Finally a very simple paragraph generator that calls getSentence between pMin and pMax.

// Creates a paragraph (bunch of sentences wrapped in <p>)
function getParagraph() {
	var r = rnd(pMin, pMax);
	var p =  '<p>';
	
	for(var i = 0; i < r; i++) {
		p += getSentence();
	}
	
	return p + '</p>';
}

Putting these functions together, I created a paragraph that looks less like faux Latin and more like a Scandanavian language…

Buedaehain ges seist gieneof yauteof moareon noisoin daeceolan peobuohen rieyeiher sieqeawof cuekuaxeof deuliukan roapen teahan noifaogu liacon. Daogeadan rin xaegiehin can qeoviof dairin toefoatean rion teiceivean naijaeton riof rain hiakeof weawean.

But, oh well. For what it is, it does a well enough job, I think. Here’s a running demo of everything together.

For some reason, Modernizr kept throwing an error which means it doesn’t work in Firefox.

Update

Thanks to a very helpful comment by Lin, the code now works on Firefox. Turned out to be an encoding issue (Firefox doesn’t like ANSI and UTF dancing together).

Also a minor imporvement:
I changed the following line in the multFrequency function..

for(var j = i; j >= 0; j--)

Into…

for(var j = (i * i + 1); j >= 0; j--)

This yielded much better distribution of letters for both vowels and consonants.

Autocomplete with jQuery and MVC

This is just a prelude to a complete spellcheck addon to the discussion forum. I figured I’d start with basic autocomplete first that ties into the wordlist.

All spellcheckers essentially refer to a global wordlist in the specified language and any words that don’t belong, get flagged.

The hardest part of this turned out to be finding a decent wordlist. I was actually surprised at the delicate balance between finding a “good enough” list and one that’s “too good”. Too good? Yes, apparently a list that has too many words will mean you will get a lot of misses where an apparent misspelling turned out to be an obscure word… and you didn’t mean to use obscure words.

The final list I settled on has a word count of 125,346 and was from the Ispell project which also has common acronyms. Note: This is not the same as Iespell (written ieSpell), although if you Google, “Ispell”, you’ll get “ieSpell as the first result. Ispell lists are available for download at the Kevin’s Wordlist page. I have also combined the 4 main english lists into one file (MS Word). WordPress, strangely, won’t allow plain text files to be uploaded, but allows richtext documents. Email me if you want the plaintext version.

I started with a simple DB table to store all the entries. Since I may also be adding more languages, I also have a WordLang field which can be something small like “en”, “de”, “fr” etc…

Wordentries table

 

I then created an MVC app and loaded each of the wordlist files into the db using a simple function (this can take a while depending on filesize):

public List GetWords(string p) {
	var query = from line in File.ReadAllLines(p)
			select new Wordentry
			{
				WordText = NormalizeString(line),
				WordLowercase = NormalizeString(line).ToLower(),
				WordLang = "en"
			};
	return query.ToList();
}

 

After feeding it a HostingEnvironment.MapPath to the filename, I can use this to load all entries into the list and call a db.Wordentries.InsertAllOnSubmition the result. NormalizeString is another helper function which I will list below.

I’m using a Spellword model instead of directly using the Wordentry object since I may want to extend the returned result in the future and changing the columns in the DB wouldn’t be practical.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;

namespace Spellcheck.Models
{
	public class Spellword
	{
		public int Id { get; set; }
		public string Spelling { get; set; }
		public string Lowercase { get; set; }
		public string Lang { get; set; }
	}
}

 

And we’re using a SpellRepository class so we’ll keep the controllers free of too much data access stuff.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.IO;
using System.Text;
using System.Globalization;

namespace Spellcheck.Models
{
	public class SpellRepository
	{
		// DataContext global
		private readonly CMDataContext db;

		public SpellRepository(CMDataContext _db)
		{
			db = _db;
		}

		/// <summary>
		/// Counts the total number of word entries
		/// </summary>
		/// <returns>Wordcount int</returns>
		public int GetCount()
		{
			return (from w in db.Wordentries
			 select w.WordText).Count();
		}

		/// <summary>
		/// Searches a given word or word fragment
		/// </summary>
		/// <param name="word">Search word/fragment</param>
		/// <param name="word">Number of returned results</param>
		/// <param name="word">Language to search. Defaults to 10</param>
		/// <param name="word">Search lowercase field only</param>
		/// <returns>List of spellwords</returns>
		public List<Spellword> GetWords(string word, int limit = 10,
			string lang = "en", bool lower = true)
		{
			word = (lower) ?
				NormalizeString(word.ToLower()) :
				NormalizeString(word);

			var query = from w in db.Wordentries
						select w;

			// Get only unique entries in case we have
			// duplicates in the db (Edited from an earlier "GroupBy")
			query = query.Distinct().OrderBy(w => w.WordLowercase);

			// If a language code was specified
			if (!string.IsNullOrEmpty(lang))
				query = query.Where(w=>w.WordLang == lang);

			// Lowercase?
			query = (lower) ?
				query.Where(w => w.WordLowercase.StartsWith(word)) :
				query.Where(w => w.WordText.StartsWith(word));

			// Order alphabetically
			query = query.OrderBy(w => w.WordLowercase);

			return (from w in query
					select new Spellword
					{
						Id = w.WordId,
						Spelling = w.WordText,
						Lowercase = w.WordLowercase,
						Lang = w.WordLang
					}).Take(limit).ToList();
		}
		/// <summary> 
		/// Inserts a new list of words into the spellcheck library
		/// </summary>
		public void SaveWords(List Words)
		{
			var query = Words.GroupBy(w => w.Spelling)
				.Select(w => w.First())
				.OrderBy(w => w.Spelling).ToList();

			List Entries = (from w in query
									   orderby w.Spelling ascending
									   select new Wordentry
									   {
										   WordText = w.Spelling,
										   WordLowercase = w.Lowercase,
										   WordLang = w.Lang
									   }).ToList();

			db.Wordentries.InsertAllOnSubmit(Entries);
			db.SubmitChanges();
		}

		/// <summary> 
		/// Helper function normalizes a given word to the Unicode equivalent
		/// </summary>
		/// <param name="txt">Raw word</param>
		/// <returns>Normalized word</returns>
		private static string NormalizeString(string txt)
		{
			if (!String.IsNullOrEmpty(txt))
				txt = txt.Normalize(NormalizationForm.FormD);

			StringBuilder sb = new StringBuilder();

			sb.Append(
				txt.Normalize(NormalizationForm.FormD).Where(
					c => CharUnicodeInfo.GetUnicodeCategory(c)
					!= UnicodeCategory.NonSpacingMark).ToArray()
				);

			return sb.ToString().Normalize(NormalizationForm.FormD);
		}
	}
}

To use this, we’ll just add a JsonResult action to our controller. I just created a Suggestions action in the default Home controller since this is just an example.

public JsonResult Suggestions(string word, int limit = 10, string lang="en")
{
	List Words = new List();
	if (!string.IsNullOrEmpty(word))
	{
		using (CMDataContext db = new CMDataContext())
		{
			SpellRepository repository = new SpellRepository(db);
			// 10 results is usually enough
			Words = repository.GetWords(word, limit, lang);
		}
	}
	// Need to use AllowGet or else, we'll need use POST
	return Json(Words, JsonRequestBehavior.AllowGet);
}

 

… And that pretty much covers the backend for now.

To test out to see if the word suggestion works, we’ll do one autocomplete textbox. Just add the jQuery and jQuery UI script files and include the jQuery UI CSS to your layout first and add this to the default view :

<script type="text/javascript">
	$(function () {
		var searchtext = $("#search");
		searchtext.autocomplete({
			source: function (request, response) {
				$.ajax({
					url: "/Home/Suggestions", // Or your controller
					dataType: "json",
					data: { word: request.term },
					success: function (data) {
						// Returned data follows the Spellword model
						response($.map(data, function (item) {
							return {
								id: item.Id,
								label: item.Spelling,
								value: item.Lowercase
							}
						}))
					}
				});
			},
			minlength: 3
		});
	});
</script>
<form action="/" method="post">
<input id="search" type="text" name="search" />
</form>

 

Fun fact : Total misspellings as I was writing this (excluding Ispell/ieSpell names and code) before running spellcheck = 12.

Yeah, I really can’t spell.