Pseudo language generator

The usual method of placing filler text on page or on a web design was to copy the Lorem Ipsum text found all over the place. The one down side to this was making sure the text wasn’t too recognizable, considering everyone uses it, and making sure you have enough of it for a particularly big layout or dummy text range.

For some reason this didn’t seem particularly elegant, especially since the afore mentinoed commonality and the only other alternative was to create actual posts with, you know, a post. Well that didn’t seem elegant either since those posts tend to look contrived and stupid… a lot like a most of the “real” posts on the internet, sadly.

So I set about creating my own pseudo language generator in JavaScript that I can plug anywhere and generate as many as needed and not worry about repeating myself. I’m home today thanks to doctor’s orders and bored to death of walking up and down the apartment so might as well do it now…

First step : Wikipedia

Apparently,  in English, ETAON RISHD LFCMU GYPWB VKXJQ Z is the alphabet arranged from the most frequently used letter to the least. While this seems logical to me, it didn’t make sense to mix the vowels in with the consonants since I needed to match a consonant with at least one vowel and in my own cursory observation of Lorem Ipsum, the vowels tended to have even distribution, but since I’m trying to match English as much as possible, I sorted the vowels by decreasing frequency: eaoiu and the consonants as well : tnrshdlfcmgypwbvkxjqz.

To make it easier to pair them with a consonant, I figured I’d put all the vowels together and randomly pick one from the pool, but that leaves their frequency.

The easiest way to make sure the most frequent letters are picked more than the least frequent was to multiply the occurence of the most frequent letters in the pool. I could have repeated the letter a number of times in the pool, but that felt hackish, so for this I wrote a simple multiply function.

function multFrequency(chars) {
	var cn = '';
	for(var i = chars.length; i > 0; i--) {
		var ch = chars.charAt(chars.length - i);
		for(var j = i; j >= 0; j--) {
			cn += ch;
		}
	}
	return cn;
}

What’s going on here? Well, we first we iterate through the pool starting with the character length and decrement by one. Each letter is selected using charAt using i. At each letter, we repeatedly add it to the new pool, “cn”, i number of times. Since i decreases as the loop continues, characters further down the pool get added fewer number of times. We want to add up to and including the last letter, which is why the inner loop is set to j >= 0. (Remember, the index is 1 less than the length.)

Pairing ending letters

According to the Wikipedia article, English also has common pairings. TH HE AN RE ER IN ON AT ND ST ES EN OF TE ED OR TI HI AS TO.

I figured it will be easier to again split these up by pairings that start with vowels and those that start with consonants. So that leaves :

Consonant pairs : “th”, “he”, “nd”, “st”, “te”, “ti”, “hi”, “to”.
Vowel pairs : “an”, “er”, “in”, “on”, “at”, “es”, “en”, “of”, “ed”, “or”, “as”

So we need another function that will take the first letter of each pool and randomly select a pairing ending letter. This is actually a lot easier than it sounds. First part of this problem is making a little helper function that will randomly give a number between a minimum and maximum.

// Random number helper
function rnd(min, max) {
	return Math.floor(Math.random() * (max - min)) + min
}

Now we’re going to create the pair matching helper.

function fromPair(pairs, p) {
	var nc = '';
	for(var i = 0; i < pairs.length - 1; i++) {
		if (pairs[i].charAt(0) == p)
			nc += pairs[i].charAt(1);
	}
	if (nc == '')
		return nc;
	
	// Or else...
	return fromRange(nc);
}

What this does is iterate through each pair in the pool and take the second letter of each pair matching the first letter and create a new pool, “nc”. If “nc” is empty, then it didn’t find a matching pair and returns an empty string, but if at least one pair was found, it will randomly select from this pool… using the function below.

We need a function that will avoid letter duplications. I could be wrong, but in the original Lorem Ipsum, I don’t recall seeing double vowels. I think this makes sense in our new pseudo language.

function fromRange(chars, p) {
	var c;
	if (arguments.length > 1) {
		do {
			c = chars.charAt(rnd(0, chars.length -1));
		} while(c == p);
	} else {
		c = chars.charAt(rnd(0, chars.length -1));
	}
	
	return c;
}

This function is what we’ll use to randomly select characters from any pool. It also doubles as a duplicate remover if the second parameter is specified. Basically, it will retry a random pick from the pool until it skips the given parameter.

Building the language

To give this the look and feel of true randomness, I put all the above constants (vowels, consonants, pairings) into variables. I also created a minimum and maximum value set for word length, sentence sizes and paragraph sizes to give the impression of random entries.

var vowels = "eaoiu";

// The consonants are placed in the order of their appearence
var consonants = "tnrshdlfcmgypwbvkxjqz";

// Letters commonly paired (with consonants first and vowels next)
var consonantPairs = ["th", "he", "nd", "st", "te", "ti", "hi", "to"];
var vowelPairs = ["an", "er", "in", "on", "at", "es", "en", "of", "ed", "or", "as"];

var wMin = 2;		// Minimum word length
var wMax = 10;		// Maximum word length
var sMin = 4;		// Minimum sentence size
var sMax = 20;		// Maximum sentence size
var pMin = 1;		// Minimum sentences per paragraph
var pMax = 3;		// Maximum sentences per paragraph
var vFreq = 3;		// Every x characters must be a vowel

That last variable, vFreq, is what I think will really make or break this; I think having every 3rd character a vowel will make this seem realistic.

Now we need a function to generate a realistic sounding word…

function getWord(u) {
	if(arguments.length > 1)
		u = true;
	
	var r = rnd(wMin, wMax);
	
	var w = '';	// Completed word holder
	var c = '';	// Generated letter holder
	
	for(var i = 0; i < r; i++) {
		// Every x characters is a vowel
		if (i % vFreq == 0) {
			c = fromRange(consonants);
		} else {
			c = fromRange(vowels, c);
		}
			
		 // First letter of the word requested in uppercase
		if(u == true && i == 0)
			c = c.toUpperCase();
		w +=  c;
	}
	
	// Commonly paired letters
	if (consonants.indexOf(c) > -1 ) {
		w += fromPair(consonantPairs, c);
	} else {
		w += fromPair(vowelPairs, c);
	}
	
	return w;
}

This function has an argument to make the first letter upper case for use in the beginning of a sentence. Note the wMin and wMax variables we declared earlier between which the word lengths alternate. Also note in the for loop, we’re using that fromRange function with the second parameter (to skip duplicates) specified for vowels. I’m also making use of the fromPair function depending on whether the last character in the word ends in a consonant or vowel.

Now that we have the word generator we need a function that creates a sentence by repeatedly calling the above getWord function. Note the sMin and sMax variables that allow the sentence length to fluctuate.

// Creates a sentence (bunch of words ending in '. ');
function getSentence() {
	var r = rnd(sMin, sMax);
	var s = '';
	for(var i = 0; i < r; i++) {
		if(i == 0) // First letter in first word is uppercase
			s += getWord(true) + ' ';
		else
			s += getWord() + ' ';
	}
	
	return s.substring(0, s.length - 1) + '. ';
}

Finally a very simple paragraph generator that calls getSentence between pMin and pMax.

// Creates a paragraph (bunch of sentences wrapped in <p>)
function getParagraph() {
	var r = rnd(pMin, pMax);
	var p =  '<p>';
	
	for(var i = 0; i < r; i++) {
		p += getSentence();
	}
	
	return p + '</p>';
}

Putting these functions together, I created a paragraph that looks less like faux Latin and more like a Scandanavian language…

Buedaehain ges seist gieneof yauteof moareon noisoin daeceolan peobuohen rieyeiher sieqeawof cuekuaxeof deuliukan roapen teahan noifaogu liacon. Daogeadan rin xaegiehin can qeoviof dairin toefoatean rion teiceivean naijaeton riof rain hiakeof weawean.

But, oh well. For what it is, it does a well enough job, I think. Here’s a running demo of everything together.

For some reason, Modernizr kept throwing an error which means it doesn’t work in Firefox.

Update

Thanks to a very helpful comment by Lin, the code now works on Firefox. Turned out to be an encoding issue (Firefox doesn’t like ANSI and UTF dancing together).

Also a minor imporvement:
I changed the following line in the multFrequency function..

for(var j = i; j >= 0; j--)

Into…

for(var j = (i * i + 1); j >= 0; j--)

This yielded much better distribution of letters for both vowels and consonants.

Advertisement

9 thoughts on “Pseudo language generator

  1. I kept getting confused what was post and what was randomly generating text. Ha ha just kidding. Great work, despite all the double vowels giving it such a old Germanic look (not that there’s anything wrong with old Germanic or whatever language this resembles, only that it looks so foreign).

    • Thank you!

      That’s what I was thinking too. I figured if I can match at least a semblance of a real language, then it was worth the effort. Nothing like accidentaly cursing in Old Germanic eh? ;)

      Now if only I can get it to work in Firefox… which seems to be acting weird for some reason.

  2. I am glad you posted this, because back in February, I was trying to do the same thing, but the results were not great, for an example sentence:

    “Sk t pl eabok sioneagrary sptonan pr l tisait d aricendilodocaus k onolsarecon aucertenf kewhan e deiragatrorowk fte serouameanilendd se cay atauseacock minedont maiosisic ltovernsothansthivisslinesingar gomitesh.”

    Your post gave me a push, though I didn’t use your method but improve one I was using. Now it can generate different langagues based on words file, and funny thing is Google Translate thinks that is *the* language. A sample text:

    “Comids blowleemenisaverad abrelly nnes ms s rines ded wyer al’s mbletenda d giscession’s heetrang riersal wont cedifestemary g band s nt’s mpeeper t pendrica sidgentischomed ks.”

    It still needs some improvements, but the original design is for… Well, you will understand by this generator’s name: “Blah, WTF?”

    Anyway, it’s so strange for your demo site. Those two .js looks like the contents get messed up when view with FireBug, but if you open the files in its own tab or download them, they looks like the normal js file.

    • Problem solved! Your comment helped me narrow it down, because I remembered the same exact issue back when my code files had mixed encoding (some ANSI, others UTF-8 etc…). The files were mangled in Firefox, but when I downloaded them, they were fine.

      I remembered that I opened each code file again and re-saved in UTF-8. I’m referencing Google’s developer lib for jQuery which was in UTF-8. The mixed encoding messed up the script.

      Thank you so much! :D

      “Blah, WTF?” Hahaha, that’s a great project name!

      I think if you approached it while looking at an actual language for reference, you would get better results. In my case, it was English and Latin, but there’s no reason the same can’t be used for other languages.

      But I also tried avoiding double letters, which made it seem more natural and of course, putting in a vowel with regular frequency helps normalize the text. Then there’s also the common pairings approach. I used the common pairings in the word endings because this is usually where we see parts like “th”, “es” and “nd”. In that Wikipedia article, there are also common double letter pairings which can be sprinkled in with the same frequency.

      The vowels and consonants were actually my last steps. First I tried to get a properly random sample for the entire alphabet in word and sentence sizes. Then I split the vowels and consonants so I can sprinkle in for every 3rd character. I figured that about looks like for a Latinesque language. The common letter pairings were the last step.

      This trial-error approach took me about an hour (side distraction to keep myself from going mad while I’m home sick), so with more time and effort, I’m sure it can look even more like Lorem Ipsum.

      • Thanks!

        Ah, see that’s the approach to truly mimic a language. In my case, it was just to generate more realistic looking gibberish.

        I noticed in your code that you’re using a matrix for the lookbehind. If you only want to look at the last letter and not the last collection, you could very well end up with a more convincing sentence like :

        Colorless green ideas sleep furiously.

      • Well, it’s not intended to output a sentence like “Colorless green ideas sleep furiously,” but to make people try to pronunce the words and say “WTF?” ;)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s