**Note: **This is a follow-up post to the other post I made on embedding hidden messages in plain text.

**Source code: **http://pastebin.com/B1bc9CCt

After I made the original post on embedding hidden messages in plain text, one of my friends (called Avegaard) pointed out to me that there’s a far more efficient way of storing the bits within plain text. If you’ll recall the last way of storing bits, we used the following procedure:

|H|E|L|L|O| |T|H|E|R|E|

Where we either stored or didn’t store a ZERO_SPACE_WIDTH character in between the letters to denote the bits. Given a string of length *n*, this gave us *n+1* bits to store data, however, this is far from the most efficient way of doing things. If we let the letters of the string denote a 0 and the ZERO_SPACE_WIDTH character denote a 1, we have many more bits at our disposal. This means that adjacent 1’s will be crammed together, giving more space to play with. If we let | denote a ZERO_SPACE_WIDTH character, this means that encoding the bit-string 01010111 in the above ‘hello there’-message would look like

H|E|L|||LO THERE

Notice how we have multiple 1’s in a row. But because the ZERO_SPACE_WIDTH character has literally no width, we can cram together as many as we want without anyone being the wiser. In reality, however, the above message would look like this:

HELLO THE|R|E|||

This is because it doesn’t matter how many 0’s we *prepend* to a bitstring – it’ll always be the same bitstring. But *appending *0’s completely changes the data, then we’d need a special character reserved for *STOP *telling the program to stop counting bits anymore – otherwise it’d append a bunch of 0’s. But by placing the bits at the end, we don’t need a special *STOP**–*character, because the program can prepend as many 0’s as it wants. 00000001 is the same as 01, that is.

Now, let’s analyse how well our new algorithm is – storage wise. The worst case scenario is embedding a text string consisting solely of 0’s, as it doesn’t add any characters. For a text string of size *n*, this means the worst case storage capacity is *n* bits, which actually is worse than our old algorithm – but the only bitstrings consisting only of 0’s represents a long list of *AA*, so unless we’re storing only A’s, this new algorithm will always either be able to store as much as the old algorithm or more.

*Note*: Regardless of the method used to store a hidden message within text, it’s *not* an efficient compression algorithm, as we’re using two *bytes* to represent each ZERO_SPACE_WIDTH character, and one (or more) byte to represent each original character. So when I say *space efficient* I mean efficient in the sense of how many bits can be stored and subsequently read from a string of regular text.

It should be obvious by now that the number of bits this algorithm can store within a string of *n* characters is *n + the number of 1’s,** *as each one turns into a ZERO_SPACE_WIDTH character and is invisible. Thus, the more 1’s we have in our secret message, the more space we’ll be able to store. Let’s make the naïve assumption that our secret message is completely random. This means there are an equal number of 0’s and 1’s, and we’ll therefore be able to store an additional bit of data for each regular character, giving us a storage capacity of *2n* – given that an image file is 200 kB, we’d need 100 kB of text to hide the image, which is 100.000 ASCII characters = 350 wikipedia pages.

Although this seems completely horrendous, it’s *definitely* better than the old algorithm (it needed nearly 700 pages), but there’s just one problem – we’re not storing random data, we’re storing *text*, and text is far from random. For example, the letter *e* is much more likely to show up than the letter *q* for instance. Here’s a list of the few most frequent characters in the English language (source: http://millikeys.sourceforge.net/freqanalysis.html):

Character |
Frequency |

SPACE BAR | 18.74 % |

E | 9.60 % |

T | 7.02 % |

A | 6.21 % |

O | 5.84 % |

I | 5.22 % |

N | 5.21 % |

H | 4.87 % |

S | 4.77 % |

R | 4.43 % |

How can we use this information to our advantage? Recall that 1’s are basically *free*, as they don’t take any space away from the characters – the algorithm is limited by the number of 0’s in a string of text, not the number of 1’s. This means we can assign the bitstring that has the most 1’s to the character that shows up most often, which is the *space bar*. Here’s a new encoding, that assigns the most frequent characters the bitstrings that contain the most 1’s (in descending order):

Character |
Binary Code |
Decimal Code |

SPACE BAR | 11111 | 31 |

E | 01111 | 15 |

T | 10111 | 23 |

A | 11011 | 27 |

O | 11101 | 29 |

I | 11110 | 30 |

N | 11100 | 28 |

H | 11010 | 26 |

S | 10110 | 22 |

R | 01110 | 14 |

D | 01101 | 13 |

L | 01011 | 1 |

U | 00111 | 7 |

M | 10011 | 19 |

C | 10101 | 21 |

W | 11001 | 25 |

G | 00011 | 3 |

F | 00101 | 5 |

Y | 01001 | 9 |

P | 10001 | 17 |

COMMA | 10010 | 18 |

DOT | 10100 | 20 |

B | 11000 | 24 |

K | 01100 | 12 |

V | 01010 | 10 |

APOSTROPHE | 00110 | 6 |

? | 10000 | 16 |

X | 01000 | 8 |

J | 00100 | 4 |

! | 00010 | 2 |

Q | 00001 | 1 |

Z | 00000 | 0 |

So how many bits can we expect to store within a string of text *n* characters long? In probability theory, the expected value is sum of all possible values times the probability of each value. That is , where denotes an event, and is the probability of that event happening. This means we can find the expected number of additional bits of information, where each event is the number of 1’s in a character, and the probability is the probability of that character appearing.I constructed an excel file for doing just this, taking in account that the probabilities in the above link doesn’t add up to 100 (it adds up to 98.64 %, which means I divide all the probabilities by 0.9864 to account for this). I find that the expected number of additional bits is…

… 3.6. This means I gain approximately 3.6 bits for every character in the text string, which means the expected number of bits for a hidden message in a text string of length *n* is *4.6**n*. This means we’re able to encode 0.575 character into a single character of text. Using the data found at this website, this means we can encode around 5 news articles into a wikipedia article.

Here’s a brief explanation of how you use the program:

//Encode a hidden message String encoded = Encoder.encodeString(plain_text, hidden_message); //Retrieve the hidden message String hidden_message = Encoder.getMessage(encoded);

**Source code: **http://pastebin.com/B1bc9CCt

**Secret message: **http://pastebin.com/S35w7W8D