Rolling Hashes almost made my day
Posted by muntasirkhan on January 22, 2009
Today, rolling hashes almost made my day. Almost, but not quite. Ended up having a pretty rotten day after all. In today’s TopCoder SRM, the easy (250 point) problem involved checking for a checking strings that end up looking the same when cyclically rotated by a number of places
(see the problem statement here).
Suppose we have a string T=”abcd”. A rotation by i places of T is defined as a string T(i) such that T[j]=T(i)[i+j]. In simple terms, to rotate a string T by 1 place, we take off the first character of T and append it to its end to get T(1). To rotate by x places, we simply do this x times. It is clear that there cannot be more than length(T) distinct rotations of T.
Example:
T=”abcd”
T(0)=”abcd” (no rotations)
T(1)=”bcda”
T(2)=”cdab”
T(3)=”dabc”
The problem actually asks a lot more, the TC site hasn’t been updated yet but if you have a TC account you can check it out in the arena. The tricky part of the problem was, given a string T, count how many T(i) there are where T=T(i). That is, how many rotations of T yield T as the result.
Example:
T=”abab”
T(0)=”abab” (match, 0 counts as a valid rotation here)
T(1)=”baba”
T(2)=”abab” (match)
T(3)=”baba”
So, for this string we would get 2 rotations that are matches.
Now, the simple way of counting involves trying out all rotations of a string and counting the ones that match. That would mean, in a string of length n, trying n different rotations (O(n)), checking to see if the rotated string matches (also O(n)), and the actual work of rotating includes chopping off the first char (again another O(n) ). That results in a total complexity of O(n^3). If you are clever, you can even avoid actually rotating strings by chopping up the first character and appending thus getting you a runtime of O(n^2). One way can be by cleverly manipulating the string indices as we check, thus not having to keep two strings at all. An easier way of going about this is to append T with itself (“abab” -> “abababab”) and then searching for the occurrences of T in it. I will not go into why it works, one can easily convince oneself that it works by trying it out a few examples. I actually ended up using a bit of both.
At this point I coded and tested my solution. TLE on the last case. The strings were short enough, only about a 100 char max, so O(n^2) is quite doable. Turns out, the other part of the problem involved trying out different permutaions that added another m! (not: m not n, another variable from the input) to the overall complexity, thus increasing the runtime by 8! at worst. It was obvious the permutations would have to be tried out, there was no other way. The checking function had to be made faster. A few moments later I realized merely a linear time string search was required. Checking a for T in T+T was just a string search after all.
I am familiar with (i.e. have coded before) two different algorithms for string search, KMP and Rabin-Karp. My memory of KMP was rather rusty, having last coded it more than a year ago. Rabin-Karp, however, I remember quite well. I even set a problem at an online contest at UVa for which my solution used an idea derived from Rabin-Karp. Moreover, the idea behind it is much simpler, although I had reservations about my ability to make a bug free implementation in time.
The main idea behind it is the concept of a hash function. Simply put, you give an object (in this case, a string) as input and the function returns a number. The same Object always returns the same number. But different objects might return the same number, but in a well designed hash function that should be very rare (such an event is called a collision). For a string, we can for our purposes define a hash function as
hash(x)=sum(value(x[i])*a^i)%M + C over all i in [0,length(x)]
The values of a and M define how the hash function behaves, and are randomly chosen large integers for best results in this case.
Here, value(c) can be anything that makes a one-to-one relation between chars and integers (e.g. a=1, b=2…, ascii values etc.)
(Note: computing a hash is therefore O(length(x)) )
Now, if we have two strings x and y,
x=y => hash(x)=hash(y)
But note,
hash(x)=hash(y) =/=> x=y
However, with a well designed randomized hash function, we can say that,
probability(hash(x)=hash(y) => x=y) ~ 1.0
i.e. if their hashes match, it is very probable that the strings also match.
Thus by comparing two numbers (a constant time operation) we can test if two potentially large strings are the same or not. But, we have noticed that computing the hash of a string itself is linear. Thus checking each n length substring of T+T would involved taking O(n) hashes each taking O(n) time to compute, bringing us back to our old complexity of O(n^2).
But notice that in making these O(n) substrings of T, substring from [i,i+n) shares a lot of chars and relative positional characteristics with substring [i+1, i+1+n). Actually, the second substring can be produced from the first by simply discarding its first char and adding the char following the substring (i.e. at i+n) in the string. Consequently, we can compute the hash value for the second by using the value of the first and adjusting it by taking away the value contributed to it by the i-th char and adding the value contributed by the (i+n)th character. In our example hash function it would look like,
hash( x[i+1..n+1) ) = ( hash[ x[i..n) ] – value(x[i])*a^(n-1) + value( x[i+n] ) ) % M
Thus we can compute the next hash value in O(1) time. This gives our solution an overall complexity of O(n), since we generate all O(n) hashes in O(1) after the first one.
In Rabin Karp, if the hash value matches, we check the strings for a match using the the usual O(n) method, thus elimination false positives due to collisions. This is ok if we only need to check whether a string occurs in another or not. But in this problem we also need to count all occurrences, so using that will end up being O(n^2) if all substrings match. So we simply choose a very good randomized hash function and depend on the fact that collisions are so rare that the chances of one happening during system tests is practically zero.
I coded this right away and, after the traditional bug hunt all my code demand, was happy to see my solution solve all sample cases well in time. Confident of my solution, I moved on to the next problem , quite pleased with myself and thinking of a large rating increase.
But my solution failed the system tests. I looked at my code again, but could not find any mistakes. After a few more minutes I gave up. Later on, I tried my solution in the practice rooms, and it turned out it fails even some of the example cases in the system tests. That’s really odd since I tested those cases myself before submitting and it passed all of them. Running systems tests again, I saw my solution now fails on several other cases, but passes some of the old ones. The exact same code giving different results on the same input. My immediate suspicion was that the hash function was getting way too many collisions. I changed my code to use long longs instead of ints and ensured that the value of M would never be too small. I was dumb enough to simply code that M would be in the range [3,2^31), so small values of M would still make my solution fail. I submitted again, and it passes all tests. Code for my practice room submission can be found here.
Perhaps writing this will ensure I remember this lesson. On the bright side, I got a small rating increase despite scoring a duck.