Optimal layout for soft keyboard

Ibix · Jul 1, 2012

I hope this isn't going to be a really dumb question, but I'm stuck on a personal project.

I'm writing a multi-tap keyboard (like ye olde phones that actually had keyboards and didn't have predictive text) for a smartphone with a small screen. I'd like the layout to be in some sense optimal. I think I have a working definition of "optimal" in this context, and have cracked some of the maths. Since what I'm writing is a software keyboard, I also have a data source which would allow the keyboard to re-optimise itself on request. I'd appreciate any comments and any help with the maths I haven't figured out.

I think there are three components to optimal - the number of keystrokes required to type a letter, the number of times two consecutive letters are on the same key (which forces you to wait a second), and the actual placement of the keys. What I have chosen to do (and any suggestions for more flexible approaches would be of interest) is to optimise for minimum keystrokes and, since this is non-unique, choose the solution with the minimal repeat use of keys. The third constraint I was intending to implement last, simply placing the most popular keys where my thumb doesn't have to bend too much.

I can do the first optimisation. I record the number of times, [itex]n_i[/itex], that the user types the character [itex]c_i[/itex]. That gives me an estimate of the probability that the user wants to type that character, [itex]p_i=n_i/\sum_jn_j[/itex]. I order the characters by descending [itex]n_i[/itex] and (on a keyboard with [itex]K[/itex] keys) I assign the first [itex]K[/itex] characters as first characters on a key, the next [itex]K[/itex] as second characters, and so on. I'm going to skip proving that this minimises the number of keystrokes. I can do it, but hopefully the idea that common characters should need few keystrokes is uncontroversial.

I have now defined the set of "first rank" characters, the set of "second rank" characters, and so on. The question is which "second rank", "third rank", etc. character to put on a key with which "first rank" character. This is the bit I cannot see how to do. I can record [itex]n_{ij}[/itex], the number of times the character sequence [itex]c_ic_j[/itex] is typed and hence calculate the probability, [itex]p_{ij}[/itex], that the next two characters that the user types will be [itex]c_ic_j[/itex]. That means that, if you type [itex]c_i[/itex], I expect you to have to wait [itex]\sum_{j\in \{k_i\}}p_{ij}/p_i[/itex] (where [itex]\{k_i\}[/itex] means all characters assigned to the key to which [itex]c_i[/itex] is assigned) before typing the second character. Therefore the non-conditional expected waiting time after the first character is [itex]\sum_i\sum_{j\in \{k_i\}}p_{ij}[/itex]. That is to say (I think) that if you write the [itex]p_{ij}[/itex] in a matrix and group the rows and columns by the key to which the characters are assigned, the merit function is the sum of the same-key blocks on the leading diagonal.

I have absolutely no idea how to optimise this in general. In the case that there are only two characters on each key, I can see that you could try any arrangement of the second characters then try swapping any pair of second characters, accept the best swap and iterate. This would lead "downhill", but I have no idea if it necessarily leads to the global optimum. You could extend this to three ranks by starting with this solution, adding the third rank and pairwise swapping the third rank by the same algorithm, but this feels unlikely to lead to a global optimum. Any ideas? I'll go with the above if no-one has a better idea, trying to make an intelligent choice of starting position (say, for the first rank character with the highest [itex]n_i[/itex], pick the second rank character that minimises [itex]p_{ij}[/itex], then move on to the second-highest [itex]n_i[/itex] and pick from the remaining second rank characters).

The final optimisation is actually easy. Once you have the set of characters on each key then the expected number of keystrokes on a key [itex]k_i[/itex] is [itex]\sum_{j\in \{k_i\}}r_jp_j[/itex]. Put the highest-scoring key in the easiest to reach place, etc.

Thank you if you've even read all of the above. Do you have any thoughts? Better approaches to the second optimisation, or a more holistic approach, recommended reading, or just commentary would be welcome.

jedishrfu · Jul 1, 2012

well you could order initially by the frequency alphabet for the first character obviating the need to predict what character will be typed first. There is also a two letter frequency table that you could use the first letter as a search key to get the list of likely second chars. These tables often are associated with cryptanalysis.

http://en.wikipedia.org/wiki/Letter_frequency

The more common approach is to use a dictiionary and predictive word selection based on the first few letters to complete the word the source of all humorous typing errors on the internet.

Ibix · Jul 1, 2012

Thanks for your reply.

I'm ashamed to admit that my initial setup is alphabetical at the moment but I can do much better, as you suggest. Then update the frequency table for the mix of languages that I enter.

I'm resistant to autocorrect type options because the dictionary for my typical use needs to cover two languages plus some bits of code. Also, I know I would end up with something inappropriate in a text to the boss...

haruspex · Jul 1, 2012

There's something I don't understand about the design. You've assigned a first rank character to every key, so when I tap one of those keys, how does it know whether that's the whole character or the first tap of a multi-tap sequence?

Ibix · Jul 1, 2012

As per the physical twelve-key keypads on non-smart phones, the keyboard doesn't really commit to a character until a timeout elapses or you press another key. For example, the standard keypad has abc on the 2 key, def on 3, and ghi on 4. If you want to type the word "age", you press 2433. You can be as slow or fast as you like with the 2 and the 4, but you must get the two 3s in within the timeout, or else you get two ds instead of one e. If you want to type "aged" you press 2433, again being quick on the two 3s, then wait for the timeout to expire and press 3 a third time to get the d. There is usually visual evidence of the timeout - for example the cursor stays to the left of the character until timeout or you hit a different key.

Minimising the incidence of wait-for-timeout is the driver for the attempt to minimise consecutive characters on the same key.

Stephen Tashi · Jul 2, 2012

Some miscellaneous thoughts:

A keyboard layout defines a "code" and peopel study the cost of using various codes to transmit messages, so you might find some help if you search for such material. A keyboard layout defines a code whose symbols are K[i,j] where i = 1,2,.. number of keys on the keyboard and j = 1,2,3 = number of keypresses needed.

A digraph model of your problem:

A state i is a vector of two letters such as (a,b) (so we consider (a,b) as different from (b,a) ) For example, if we are in state (a,b) and transmit the character 'x', we transition from state (a,b) to state (b,x).

Let M[i,j] be the probability of going to state j given that your are in state i when another character is transmitted.
A nice transition matrix M will have a stationary state vector S such that M S = S. (I susptect the the single letter frequencies are a good approximation to S. At any rate, there are numerical methods for finding S if you have approximated M from observed data.) I think of S as "what fraction of the time on the average, the process is in state i".

Let C be a cost matrix such that C[i,j] defines the cost of going from state i to state j.
The natural "total cost" of the model is the "mean cost of transmitting a character" which I would define as [itex]F = \sum_i S \sum_j M[i,j] C[i,j]. [/itex].

Minimizing F with respect to C is an "integer programming" problem - another phrase you could use in searching the web.

Elaborating the model further requires stating the constraints on the C[i,j] that are imposed by the code symbols K[r,s]. I'll see if there is any interest in the model before attempting that!

Ibix · Jul 2, 2012

Stephen - excellent stuff, thank you. I'll look into integer programming. Could I ask, though, is S an eigenvector of M? If so, is it the one with eigenvalue nearest one, with the assumption being that if M is "nice" there will actually be one with eigenvalue 1? Also, am I understanding that I could either apply this to my step two by setting C[i,j] to 0 where i and j aren't on the same key and 1 where they are, or apply it to solve the first two steps by setting C[i,j]=r+r[j]+w[i,j], where the r are the numbers of keystrokes to type a character and w[i,j] is zero for characters on different keys and a positive constant for characters on the same key? Varying that constant let's me vary how much I care about repeats? Obviously the constraints would be different in the two cases, too.

Stephen Tashi · Jul 2, 2012

Ibix said:

is S an eigenvector of M? If so, is it the one with eigenvalue nearest one, with the assumption being that if M is "nice" there will actually be one with eigenvalue 1?

Yes, S is an eigenvector of M with eigenvalue 1. It's called the stationary distribution of the process.

Also, am I understanding that I could either apply this to my step two by setting C[i,j] to 0 where i and j aren't on the same key and 1 where they are, or apply it to solve the first two steps by setting C[i,j]=r+r[j]+w[i,j], where the r are the numbers of keystrokes to type a character and w[i,j] is zero for characters on different keys and a positive constant for characters on the same key?

The states in the digraph model are not characters, they are pairs of characters. So r would not represent a cost associated with a single character because i does not refer to a single character. For example if the state i is the vector of characters ('h', 'e' ). By typing another character we move to a different state, which begins with an 'e'. If you want to make statements about indices define characters, you'd have to define the algorithm that computes which pairs (r,s) of indices of characters correspond to the single index of a state i.

Suppse we index the character by variables q,r,s,t and the states by indices i,j.

Let state i be the vector of charcter indices (q,r) and let state j be the vector of indices (s,t).

Unless r = s, we never transition from state i to state j. So we can stiipulate that C[i,j] = 0 when r is not equal to s, since that transition never happens.
If r = s then the cost of transitioning from i to j is the cost of typing the character with index t given that you have just typed the character with index r = s. So, as I understand your costs, C[i,j] is the number of keystrokes rquired to type character t plus a penalty if the character t and character r are on the same key.

chiro · Jul 2, 2012

Hey Tbix and welcome to the forums.

Formally you could analyze this using context-free grammars and then using the structure and the nesting components to design the final keyboard.

This will not only take into account frequency information of letters, but also all the conditional aspects of entire actions and commands and not just for typing written or other related forms of language.

From the language definition you associate distributions and you get a very complicated markov-model which provides all the conditional behaviour.

The language itself again is not just for written and spoken languages, but also incorporates the entire command-set for driving the entire application.

By optimizing the above and taking into account other constraints like usability and user interface requirements, and being able to prioritize different tasks (amongst other things), you will have a systematic and analytic approach to the entire keyboard and UI design.

haruspex · Jul 3, 2012

Stephen Tashi said:

The states in the digraph model are not characters, they are pairs of characters.

Hi Stephen,
I don't understand why the states need to be pairs of characters. If the only frequency data being used is for digrams, doesn't the last completed character give you all the state you need? If X and Y are on the same key then the cost of going from state X to state Y is wait_time+numkeystrokes(Y). If on different keys it's just numkeystrokes(Y). And the probability of Y next depends only on X having been the preceding character. Once Y is complete, the history before that is of no further consequence.
(If trigram frequency data is used then ordered letter pairs might need to be the states, but it's not obvious to me that trigram frequency matters here.)
If M is the digram frequency matrix, then I think it's trivial that the single letter frequency is an eigenvector, not just an approximation to one.
Your equation for F still works for these reduced states.

lpetrich · Jul 3, 2012

This reminds me of Carpalx - keyboard layout optimizer. The software comes with a training-text corpus; you can use it even if you choose not to use Carpalx.

For a multistroke system, one could use the idea of Huffman coding: http://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html, though maybe generalized to more than two branches per branch node.

Ibix · Jul 3, 2012

Stephen - Thanks for repeating the explanation - obviously I hadn't quite got your whole first post into my head when I replied. I am interested in haruspex's question, though. Can I not just use the single character transitions? Your expression for costs appears to be entirely independent of the character you indexed q, depending only on the transition from r(=s) to t. It seems to me that this would be the same for a unigraph (?) model where the state i is just [r] and the state j is just [t], and C[i,j] is again the cost of transitioning from r to t. Happy to be corrected - and probably will be.

Chiro - thanks for your input. I know even less about Markov chains than I do about integer programming. Can you recommend a primer, or do I just go with Wikipedia's external links?

Ipetrich - The guy who wrote Carpalx says that he doesn't think he'll be the first or the last to try to optimise a keyboard. I doubt I will be either. I've had a quick skim and will have to do some further reading. He's got a slightly different model of the way typing works from me, largely because I think he's trying to optimise two-handed typing on a full-scale keyboard, where I'm doing one thumb on a small screen. However, he did put in a transition cost for moving from one key to another, which is perhaps something I should consider as that aspect is more or less completely neglected in my model.

Regarding Huffman coding, I'm not sure that's quite what I want to do. A Huffman code symbol (usually) switches between the underlying symbols (so a binary code for "z" might be 101001), which would be possible to type, but I can't imagine how I'd make the UI work. Edit: ...although the non-prefix nature of the code means that I could dispense with the waiting time entirely. Hm. Something to think about.

lpetrich · Jul 3, 2012

My idea of Huffman coding was to construct a decision tree for finding each letter. The program would show a button for each letter or group of letters, and you'd push a group-of-letter buttons to get its letters. You wouldn't need to see the binary or whatever representation of a letter in a tree.

As to the CarpalX approach, you'd adjust the distance penalties to be appropriate for whatever keyboard design you have in mind, like a nearly square grid.

Stephen Tashi · Jul 4, 2012

haruspex said:

Hi Stephen,
I don't understand why the states need to be pairs of characters. If the only frequency data being used is for digrams, doesn't the last completed character give you all the state you need?

You are correct that one can make a model where the states are single characters (the most recent one typed) and whose transitions probabilities are computed from the conditional probabilities implied by the digraph probabilities. I wanted a model where the stationary distribution would be the vector of digraph probabilities, not the vector of single character probabilities. Your way seems simpler.

If M is the digram frequency matrix, then I think it's trivial that the single letter frequency is an eigenvector, not just an approximation to one.

I think that's true for a probability matrix, but for an empirical frequency matrix it may not be.

Consider the text data to be the string "abaab". If we take "the digraph frequency matrix" to be the one used by the model that you suggested, I make that out to be:
[tex]\begin{pmatrix} 1/2 & 2/3 \\ 1 & 0 \end{pmatrix}[/tex]
The single letter frequency vector is
[itex]\begin{pmatrix} 3/5 \\ 2/5 \end{pmatrix}[/itex]

haruspex · Jul 4, 2012

Stephen Tashi said:

Consider the text data to be the string "abaab". If we take "the digraph frequency matrix" to be the one used by the model that you suggested, I make that out to be:
[tex]\begin{pmatrix} 1/2 & 2/3 \\ 1 & 0 \end{pmatrix}[/tex]
The single letter frequency vector is
[itex]\begin{pmatrix} 3/5 \\ 2/5 \end{pmatrix}[/itex]

Ah yes - it's the frequency of initial letters that's the problem. This will be different from the overall frequencies. I'm not sure what the ideal vector is for this calculation. Overall single letter frequency might be better than the eigenvector.

Stephen Tashi · Jul 4, 2012

ibix,

I'm a believer in using simulations to tackle real world problems and also in optiimization methods that are essentially trial-and-error with fancier names, like genetic programming and simulated annealing. If this problem motivates you study mathematical topics and you find them interesting then by all means, proceed to study them. But since you seem to be confident in programming, don't neglect to use that as a resource!

For example, from mathematics, you might find layout that minimizes the expected cost per character in a message and that layout may work by making some messages quick to type and a few messages extremely slow to type. From tiral-and-error you might find another layout that is slightly slower on the average but doesn't make so many messages extremely slow to type. Consumers might like this better than the "optimum" layout.

Ibix · Jul 4, 2012

Ipetrich - are you suggesting that I just implement the first node on the Huffman tree and put all the characters on children of child 1 on key 1, etc? Using a two key setup and the data from your link, for example, would give me 6 and 54321 as keys. Or were you suggesting that when I tap the 54321 key, the buttons change to 54 and 321? I understood the latter, which is cool but involves a lot of key switching and a hard learning curve, but the former is interesting as a seed layout at the very least. Regarding Carpalx, I thought his model enforced hand symmetry - although I've only skimmed and the model could probably be adapted, even if it does.

Stephen - am I being dumb or is there a typo in your matrix? I understood that it should be
[tex]\left(\begin{array}{cc} p(aa|a)&p(ab|a)\\ p(ba|b)&p(bb|b) \end{array}\right)[/tex]
which I make
[tex]\left(\begin{array}{cc} \frac{1}{3}&\frac{2}{3}\\ 1&0 \end{array}\right)[/tex]

Ibix · Jul 4, 2012

Stephen - 'consumers' in this case is almost certainly just me. I'm certainly going to think about the integer programming, but am a bit worried that branch-and-bound is going to take a while on my phone once I work out how many constraints there will be. Optimum keystrokes plus avoiding massive clangers like e and d on the same key may indeed end up being the way forward.

Stephen Tashi · Jul 5, 2012

Ibix said:

I

Stephen - am I being dumb or is there a typo in your matrix? I understood that it should be
[tex]\left(\begin{array}{cc} p(aa|a)&p(ab|a)\\ p(ba|b)&p(bb|b) \end{array}\right)[/tex]
which I make
[tex]\left(\begin{array}{cc} \frac{1}{3}&\frac{2}{3}\\ 1&0 \end{array}\right)[/tex]

Your version is correct!

haruspex · Jul 6, 2012

Fwiw, here's how I would proceed:
Construct matrix F_ij for relative frequencies of the character sequences (i,j).
Based on single letter frequency, organise the characters into tiers of N (N being the number of keys).
Assign the first tier to keys arbitrarily (or maybe based on some ergonomics). In general the character assigned to key r in the nth tier is k_rn.
There are N! possible ways of assigning the next tier. For each possible arrangement, the cost is Ʃ_r(F_{k_r1k_r2}+F_{k_r2k_r1}). With N <= 10 it's no problem for a computer to check them all and pick the lowest cost.
Likewise at the next tier, but now the cost function is Ʃ_r(F_{k_r1k_r3}+F_{k_r3k_r1}+F_{k_r2k_r3}+F_{k_r3k_r2}), etc.

There are two respects in which this might not produce the least overall cost:
- A different initial assignment into tiers might work better. To assess that, need to expand the cost function to reflect the frequency of each character versus its tier. Then variations could be tried by swapping characters that fell either side of tier boundaries.
- The optimal arrangement within one tier might make lower tiers worse. But since the earlier tiers have the more frequent letters, I doubt this will matter.

Ibix · Jul 31, 2012

I've had a go at implementing the keyboard optimisation algorithm discussed above. I found the answers to a couple of questions and ended up with one more. I learned linear programming from here and integer programming here.

The reason to use the digram model suggested by Stephen Tashi is that the cost function for the unigram model (counter-proposed by Haruspex and myself) depends on the transition between two unigrams, and is therefore bilinear. Integer programming depends on linearity of constraints and merit functions, so you can't apply it.

The reason not to use the digram model is that the simplex array based on this model is bigger than the SD card on my phone, even for a minimal keyboard with little more than the English alphabet.

I implemented an integer program to do more or less what I originally planned - having ranked the letters in decreasing order of individual frequency, I work down the ranks using the integer program to find the optimum arrangement based on the (now fixed) arrangement of the higher ranks. The program is fairly straightforward. For the character c, I define K binary variables, [itex]x_k^c[/itex], that are 1 if the character is to be assigned to key k and 0 otherwise. This makes the constraints:
[tex]\begin{array}{rcll} x_k^c&\leq&1&\forall k,c\\ \sum_kx_k^c&=&1&\forall c\\ \sum_cx_k^c&=&1&\forall k \end{array}[/tex]The first requires each variable to be either zero or one; the second requires each character to be assigned to exactly one key; the third requires each key to have exactly one character assigned to it.

I departed somewhat from Stephen's suggestion for a merit function; I'd be interested to know if I've misunderstood something. Stephen suggested using [itex]\sum_is_i\sum_jM_{ij}C_{ij}[/itex], where s is the eigenvector of M with unit eigenvalue, M is the transition probability matrix ([itex]M_{ij}=p(ij|i)[/itex] is the probability that the next character will be the one indexed j given that the current one is indexed i), and C is the cost matrix. The problem I had was that the vector s was always parallel to (1 1 1 1 1...), so its contribution was trivial. This appears to be a consequence of the fact that the rows of M sum to 1; any other variant I tried ([itex]\sum_{ij}M_{ij}=1[/itex], totally unnormalised) did not yield a unit eigenvalue.

Instead, having assigned first rank character c_k to key k, I defined the "merit" associated with the variable [itex]x_k^c[/itex] as the probability (may be unnormalised) of a transition from c to c_k or vice versa. That makes the merit function
[tex] \sum_{k,c}x_k^c\left(p(cc_k|c)+p(c_kc|c_k)\right)[/tex]which must be minimised in this case. The generalisation to the third rank is to add on the transition probabilities between the second rank character (c'_k) and the third rank character:
[tex] \sum_{k,c}x_k^c\left(p(cc_k|c)+p(c_kc|c_k)+p(cc'_k|c)+p(c'_kc|c'_k)\right)[/tex]

That's more or less it - it produces a plausible-looking layout when fed with available data. I have yet to integrate it into my keyboard program; I did the development in Python rather than Java because I have a linear algebra package installed.

The only further thing to note is that I tried Haruspex's enumeration of permutations approach with the same merit function. Reassuringly, it comes up with the same answer (or at least an equal merit score) where it can. It is faster for keyboard sizes below eight keys; after that it gets rapidly slower. Eleven keys took the integer program about 5.4s, and the enumeration method a little under 41 minutes; I estimate that fifteen keys will take around two years where the integer program needed a little over 16s.

I'd be very appreciative if anyone can explain if I misunderstood something about Stephen's proposed merit function. If not, thanks to everyone who contributed - I hope to be able to start typing much quicker on my phone in the near future.

Stephen Tashi · Aug 5, 2012

My thought about a cost function for a layout is that it should be an estimate of the average cost per character. To this end, it should be:

sum over all characters k of : ( probability that a character selected at random from a messages is
character k) (cost of typing the next character given the first character is k)

The cost of typing the next character given the firs character is k is: Sum over all second characters j of: ( the probability that the second character is j given the first character is k)( the cost of typing character j given that the character you previously typed was character k).

The probabilities would be estimated by frequencies.

The probability that a randomly selected character is k could be estimated by the relative frequency of the character k as a fraction of all characters or by the sum of the frequences of all digraph that begin with character k as a fraction of all occurences of digraphs.

The probability that the next character is j given that the first character is k is estimated by the frequency of the diagraph (k,j) as a fraction of all occurences of digraphs beginning with k.

I'd have to review both your notation and mine to know if this was successfully written in symbolic form.

Optimal layout for soft keyboard

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Undergrad The countability paradox of computable numbers

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect