# Optimal layout for soft keyboard

1. Jul 1, 2012

### Ibix

I hope this isn't going to be a really dumb question, but I'm stuck on a personal project.

I'm writing a multi-tap keyboard (like ye olde phones that actually had keyboards and didn't have predictive text) for a smartphone with a small screen. I'd like the layout to be in some sense optimal. I think I have a working definition of "optimal" in this context, and have cracked some of the maths. Since what I'm writing is a software keyboard, I also have a data source which would allow the keyboard to re-optimise itself on request. I'd appreciate any comments and any help with the maths I haven't figured out.

I think there are three components to optimal - the number of keystrokes required to type a letter, the number of times two consecutive letters are on the same key (which forces you to wait a second), and the actual placement of the keys. What I have chosen to do (and any suggestions for more flexible approaches would be of interest) is to optimise for minimum keystrokes and, since this is non-unique, choose the solution with the minimal repeat use of keys. The third constraint I was intending to implement last, simply placing the most popular keys where my thumb doesn't have to bend too much.

I can do the first optimisation. I record the number of times, $n_i$, that the user types the character $c_i$. That gives me an estimate of the probability that the user wants to type that character, $p_i=n_i/\sum_jn_j$. I order the characters by descending $n_i$ and (on a keyboard with $K$ keys) I assign the first $K$ characters as first characters on a key, the next $K$ as second characters, and so on. I'm going to skip proving that this minimises the number of keystrokes. I can do it, but hopefully the idea that common characters should need few keystrokes is uncontroversial.

I have now defined the set of "first rank" characters, the set of "second rank" characters, and so on. The question is which "second rank", "third rank", etc. character to put on a key with which "first rank" character. This is the bit I cannot see how to do. I can record $n_{ij}$, the number of times the character sequence $c_ic_j$ is typed and hence calculate the probability, $p_{ij}$, that the next two characters that the user types will be $c_ic_j$. That means that, if you type $c_i$, I expect you to have to wait $\sum_{j\in \{k_i\}}p_{ij}/p_i$ (where $\{k_i\}$ means all characters assigned to the key to which $c_i$ is assigned) before typing the second character. Therefore the non-conditional expected waiting time after the first character is $\sum_i\sum_{j\in \{k_i\}}p_{ij}$. That is to say (I think) that if you write the $p_{ij}$ in a matrix and group the rows and columns by the key to which the characters are assigned, the merit function is the sum of the same-key blocks on the leading diagonal.

I have absolutely no idea how to optimise this in general. In the case that there are only two characters on each key, I can see that you could try any arrangement of the second characters then try swapping any pair of second characters, accept the best swap and iterate. This would lead "downhill", but I have no idea if it necessarily leads to the global optimum. You could extend this to three ranks by starting with this solution, adding the third rank and pairwise swapping the third rank by the same algorithm, but this feels unlikely to lead to a global optimum. Any ideas? I'll go with the above if no-one has a better idea, trying to make an intelligent choice of starting position (say, for the first rank character with the highest $n_i$, pick the second rank character that minimises $p_{ij}$, then move on to the second-highest $n_i$ and pick from the remaining second rank characters).

The final optimisation is actually easy. Once you have the set of characters on each key then the expected number of keystrokes on a key $k_i$ is $\sum_{j\in \{k_i\}}r_jp_j$. Put the highest-scoring key in the easiest to reach place, etc.

Thank you if you've even read all of the above. Do you have any thoughts? Better approaches to the second optimisation, or a more holistic approach, recommended reading, or just commentary would be welcome.

2. Jul 1, 2012

### Staff: Mentor

well you could order initially by the frequency alphabet for the first character obviating the need to predict what character will be typed first. There is also a two letter frequency table that you could use the first letter as a search key to get the list of likely second chars. These tables often are associated with cryptanalysis.

http://en.wikipedia.org/wiki/Letter_frequency

The more common approach is to use a dictiionary and predictive word selection based on the first few letters to complete the word the source of all humorous typing errors on the internet.

3. Jul 1, 2012

### Ibix

I'm ashamed to admit that my initial setup is alphabetical at the moment but I can do much better, as you suggest. Then update the frequency table for the mix of languages that I enter.

I'm resistant to autocorrect type options because the dictionary for my typical use needs to cover two languages plus some bits of code. Also, I know I would end up with something inappropriate in a text to the boss...

4. Jul 1, 2012

### haruspex

There's something I don't understand about the design. You've assigned a first rank character to every key, so when I tap one of those keys, how does it know whether that's the whole character or the first tap of a multi-tap sequence?

5. Jul 1, 2012

### Ibix

As per the physical twelve-key keypads on non-smart phones, the keyboard doesn't really commit to a character until a timeout elapses or you press another key. For example, the standard keypad has abc on the 2 key, def on 3, and ghi on 4. If you want to type the word "age", you press 2433. You can be as slow or fast as you like with the 2 and the 4, but you must get the two 3s in within the timeout, or else you get two ds instead of one e. If you want to type "aged" you press 2433, again being quick on the two 3s, then wait for the timeout to expire and press 3 a third time to get the d. There is usually visual evidence of the timeout - for example the cursor stays to the left of the character until timeout or you hit a different key.

Minimising the incidence of wait-for-timeout is the driver for the attempt to minimise consecutive characters on the same key.

6. Jul 2, 2012

### Stephen Tashi

Some miscellaneous thoughts:

A keyboard layout defines a "code" and peopel study the cost of using various codes to transmit messages, so you might find some help if you search for such material. A keyboard layout defines a code whose symbols are K[i,j] where i = 1,2,.. number of keys on the keyboard and j = 1,2,3 = number of keypresses needed.

A digraph model of your problem:

A state i is a vector of two letters such as (a,b) (so we consider (a,b) as different from (b,a) ) For example, if we are in state (a,b) and transmit the character 'x', we transition from state (a,b) to state (b,x).

Let M[i,j] be the probability of going to state j given that your are in state i when another character is transmitted.
A nice transition matrix M will have a stationary state vector S such that M S = S. (I susptect the the single letter frequencies are a good approximation to S. At any rate, there are numerical methods for finding S if you have approximated M from observed data.) I think of S as "what fraction of the time on the average, the process is in state i".

Let C be a cost matrix such that C[i,j] defines the cost of going from state i to state j.
The natural "total cost" of the model is the "mean cost of transmitting a character" which I would define as $F = \sum_i S \sum_j M[i,j] C[i,j].$.

Minimizing F with respect to C is an "integer programming" problem - another phrase you could use in searching the web.

Elaborating the model further requires stating the constraints on the C[i,j] that are imposed by the code symbols K[r,s]. I'll see if there is any interest in the model before attempting that!

7. Jul 2, 2012

### Ibix

Stephen - excellent stuff, thank you. I'll look into integer programming. Could I ask, though, is S an eigenvector of M? If so, is it the one with eigenvalue nearest one, with the assumption being that if M is "nice" there will actually be one with eigenvalue 1? Also, am I understanding that I could either apply this to my step two by setting C[i,j] to 0 where i and j aren't on the same key and 1 where they are, or apply it to solve the first two steps by setting C[i,j]=r+r[j]+w[i,j], where the r are the numbers of keystrokes to type a character and w[i,j] is zero for characters on different keys and a positive constant for characters on the same key? Varying that constant lets me vary how much I care about repeats? Obviously the constraints would be different in the two cases, too.

8. Jul 2, 2012

### Stephen Tashi

Yes, S is an eigenvector of M with eigenvalue 1. It's called the stationary distribution of the process.

The states in the digraph model are not characters, they are pairs of characters. So r would not represent a cost associated with a single character because i does not refer to a single character. For example if the state i is the vector of characters ('h', 'e' ). By typing another character we move to a different state, which begins with an 'e'. If you want to make statements about indices define characters, you'd have to define the algorithm that computes which pairs (r,s) of indices of characters correspond to the single index of a state i.

Suppse we index the character by variables q,r,s,t and the states by indices i,j.

Let state i be the vector of charcter indices (q,r) and let state j be the vector of indices (s,t).

Unless r = s, we never transition from state i to state j. So we can stiipulate that C[i,j] = 0 when r is not equal to s, since that transition never happens.
If r = s then the cost of transitioning from i to j is the cost of typing the character with index t given that you have just typed the character with index r = s. So, as I understand your costs, C[i,j] is the number of keystrokes rquired to type character t plus a penalty if the character t and character r are on the same key.

9. Jul 2, 2012

### chiro

Hey Tbix and welcome to the forums.

Formally you could analyze this using context-free grammars and then using the structure and the nesting components to design the final keyboard.

This will not only take into account frequency information of letters, but also all the conditional aspects of entire actions and commands and not just for typing written or other related forms of language.

From the language definition you associate distributions and you get a very complicated markov-model which provides all the conditional behaviour.

The language itself again is not just for written and spoken languages, but also incorporates the entire command-set for driving the entire application.

By optimizing the above and taking into account other constraints like usability and user interface requirements, and being able to prioritize different tasks (amongst other things), you will have a systematic and analytic approach to the entire keyboard and UI design.

10. Jul 3, 2012

### haruspex

Hi Stephen,
I don't understand why the states need to be pairs of characters. If the only frequency data being used is for digrams, doesn't the last completed character give you all the state you need? If X and Y are on the same key then the cost of going from state X to state Y is wait_time+numkeystrokes(Y). If on different keys it's just numkeystrokes(Y). And the probability of Y next depends only on X having been the preceding character. Once Y is complete, the history before that is of no further consequence.
(If trigram frequency data is used then ordered letter pairs might need to be the states, but it's not obvious to me that trigram frequency matters here.)
If M is the digram frequency matrix, then I think it's trivial that the single letter frequency is an eigenvector, not just an approximation to one.
Your equation for F still works for these reduced states.

11. Jul 3, 2012

### lpetrich

12. Jul 3, 2012

### Ibix

Stephen - Thanks for repeating the explanation - obviously I hadn't quite got your whole first post into my head when I replied. I am interested in haruspex's question, though. Can I not just use the single character transitions? Your expression for costs appears to be entirely independent of the character you indexed q, depending only on the transition from r(=s) to t. It seems to me that this would be the same for a unigraph (?) model where the state i is just [r] and the state j is just [t], and C[i,j] is again the cost of transitioning from r to t. Happy to be corrected - and probably will be.

Chiro - thanks for your input. I know even less about Markov chains than I do about integer programming. Can you recommend a primer, or do I just go with Wikipedia's external links?

Ipetrich - The guy who wrote Carpalx says that he doesn't think he'll be the first or the last to try to optimise a keyboard. I doubt I will be either. I've had a quick skim and will have to do some further reading. He's got a slightly different model of the way typing works from me, largely because I think he's trying to optimise two-handed typing on a full-scale keyboard, where I'm doing one thumb on a small screen. However, he did put in a transition cost for moving from one key to another, which is perhaps something I should consider as that aspect is more or less completely neglected in my model.

Regarding Huffman coding, I'm not sure that's quite what I want to do. A Huffman code symbol (usually) switches between the underlying symbols (so a binary code for "z" might be 101001), which would be possible to type, but I can't imagine how I'd make the UI work. Edit: ...although the non-prefix nature of the code means that I could dispense with the waiting time entirely. Hm. Something to think about.

Last edited: Jul 3, 2012
13. Jul 3, 2012

### lpetrich

My idea of Huffman coding was to construct a decision tree for finding each letter. The program would show a button for each letter or group of letters, and you'd push a group-of-letter buttons to get its letters. You wouldn't need to see the binary or whatever representation of a letter in a tree.

As to the CarpalX approach, you'd adjust the distance penalties to be appropriate for whatever keyboard design you have in mind, like a nearly square grid.

14. Jul 4, 2012

### Stephen Tashi

You are correct that one can make a model where the states are single characters (the most recent one typed) and whose transitions probabilities are computed from the conditional probabilities implied by the digraph probabilities. I wanted a model where the stationary distribution would be the vector of digraph probabilities, not the vector of single character probabilities. Your way seems simpler.

I think that's true for a probability matrix, but for an empirical frequency matrix it may not be.

Consider the text data to be the string "abaab". If we take "the digraph frequency matrix" to be the one used by the model that you suggested, I make that out to be:
$$\begin{pmatrix} 1/2 & 2/3 \\ 1 & 0 \end{pmatrix}$$
The single letter frequency vector is
$\begin{pmatrix} 3/5 \\ 2/5 \end{pmatrix}$

15. Jul 4, 2012

### haruspex

Ah yes - it's the frequency of initial letters that's the problem. This will be different from the overall frequencies. I'm not sure what the ideal vector is for this calculation. Overall single letter frequency might be better than the eigenvector.

16. Jul 4, 2012

### Stephen Tashi

ibix,

I'm a believer in using simulations to tackle real world problems and also in optiimization methods that are essentially trial-and-error with fancier names, like genetic programming and simulated annealing. If this problem motivates you study mathematical topics and you find them interesting then by all means, proceed to study them. But since you seem to be confident in programming, don't neglect to use that as a resource!

For example, from mathematics, you might find layout that minimizes the expected cost per character in a message and that layout may work by making some messages quick to type and a few messages extremely slow to type. From tiral-and-error you might find another layout that is slightly slower on the average but doesn't make so many messages extremely slow to type. Consumers might like this better than the "optimum" layout.

17. Jul 4, 2012

### Ibix

Ipetrich - are you suggesting that I just implement the first node on the Huffman tree and put all the characters on children of child 1 on key 1, etc? Using a two key setup and the data from your link, for example, would give me 6 and 54321 as keys. Or were you suggesting that when I tap the 54321 key, the buttons change to 54 and 321? I understood the latter, which is cool but involves a lot of key switching and a hard learning curve, but the former is interesting as a seed layout at the very least. Regarding Carpalx, I thought his model enforced hand symmetry - although I've only skimmed and the model could probably be adapted, even if it does.

Stephen - am I being dumb or is there a typo in your matrix? I understood that it should be
$$\left(\begin{array}{cc} p(aa|a)&p(ab|a)\\ p(ba|b)&p(bb|b) \end{array}\right)$$
which I make
$$\left(\begin{array}{cc} \frac{1}{3}&\frac{2}{3}\\ 1&0 \end{array}\right)$$

Last edited: Jul 4, 2012
18. Jul 4, 2012

### Ibix

Stephen - 'consumers' in this case is almost certainly just me. I'm certainly going to think about the integer programming, but am a bit worried that branch-and-bound is going to take a while on my phone once I work out how many constraints there will be. Optimum keystrokes plus avoiding massive clangers like e and d on the same key may indeed end up being the way forward.

19. Jul 5, 2012

### Stephen Tashi

20. Jul 6, 2012

### haruspex

Fwiw, here's how I would proceed:
Construct matrix Fij for relative frequencies of the character sequences (i,j).
Based on single letter frequency, organise the characters into tiers of N (N being the number of keys).
Assign the first tier to keys arbitrarily (or maybe based on some ergonomics). In general the character assigned to key r in the nth tier is krn.
There are N! possible ways of assigning the next tier. For each possible arrangement, the cost is Ʃr(Fkr1kr2+Fkr2kr1). With N <= 10 it's no problem for a computer to check them all and pick the lowest cost.
Likewise at the next tier, but now the cost function is Ʃr(Fkr1kr3+Fkr3kr1+Fkr2kr3+Fkr3kr2), etc.

There are two respects in which this might not produce the least overall cost:
- A different initial assignment into tiers might work better. To assess that, need to expand the cost function to reflect the frequency of each character versus its tier. Then variations could be tried by swapping characters that fell either side of tier boundaries.
- The optimal arrangement within one tier might make lower tiers worse. But since the earlier tiers have the more frequent letters, I doubt this will matter.