David Smith

Here is some sample code, a C program I wrote in March 2014.

Problem

Given:

Word W on alphabet {A, B, C, ..., Z}, where a word is just a list of letters, not necessarily in any dictionary. Assume W is 1 to 25 letters long.

Find:

Number N = N(W), the rank of W, so that W is the Nth word in the lexicographically sorted list of words on the multiset of letters of W. Assume W is chosen so that N < 264.

Example:

Suppose W = PEEP.
  1. EEPP
  2. EPEP
  3. EPPE
  4. PEEP
  5. PEPE
  6. PPEE
Then N = 4.

Criteria:

Fast, memory-efficient, and understandable.

My solution

Let < denote the lexicographic order when applied to words. Now N(W) - 1 is the number of words W' < W on the same letter multiset as W. We partition these as follows. Let Sn(W) be the set of words W' < W on the same letter multiset where W and W' disagree in the nth position from the end, but agree in all positions left of that one.

Example:

Let W = POOLS.

S1(POOL|S) = S1(POOLS) is the set of words W' < POOLS that begin POOL and have the same letter multiset as POOLS. Then the last letter of W' must be S, but then it is not the case that W' < POOLS, so S1 is empty.

S2(POO|LS) is the set of words POOxy where xy is a word on letter multiset LS so that x < L. Again there are none.

S3(PO|OLS) is the set of words POxyz where xyz is a word on letter multiset LOS so that x < O. (For this purpose we can ignore the prefix PO and focus on the suffix OLS.) Then xyz is one of LOS, LSO. In fact, the size #S3 of set S3 is the multinomial coefficient

#S3 = m(2; 1, 1) = m(1, 1) = (1 + 1)! / (1! · 1!) = 2,
for we know x = L, and once this is determined, there are m(1, 1) ways to choose y and z from letter multiset OS (there is one O and one S).

S4(P|OOLS) is the set of words Pxyzw where xyzw is a word on letter multiset LOOS and x < O. Then x can be only L, and yzw is a word on letter multiset OOS. Then

#S4 = m(2, 1) = (2 + 1)! / (2! · 1!) = 3,
as there are 2 instances of O and 1 of S; yzw is one of OOS, OSO, SOO.

S5(|POOLS) is the set of words xyzwv on letter multiset LOOPS with x < P. Then x is one of OL, and

#S5 = m(1, 1, 1, 1) + m(2, 1, 1).

In all,

N(POOLS) - 1 = m(1, 1) + m(2, 1) + m(1, 1, 1, 1) + m(2, 1, 1) = 2 + 3 + 24 + 12 = 41.
That is, N(POOLS) = 42.

An insight:

Each multinomial coefficient that we consider differs only a little from the previous one. We may multiply and divide by little factors.

Definitions:

Let the sequence of tail words be denoted W1, W2, ..., e.g.,
W0 empty, W1 = S, W2 = LS, W3 = OLS, W4 = OOLS, W5 = POOLS.

Let Mn denote the set of arrangements of the letters of Wn. Let wn(c) be the number of occurrences of character c in Wn. Then |Mn| is a multinomial coefficient,

|Mn| = mn = [wn(A) + ... + wn(Z)]! / [wn(A)! · ... · wn(Z)!].

In the code below, *t refers to the character at the head of Wn. Let wn(c < *t) denote the number of characters c in Wn before *t in the alphabet.

On overflow:

In this problem, words can be 25 letters long, but 24! doesn't fit in a 64-bit integer. A way to avoid overflow is to separate factors of 2:
6! = 1 · 2 · 3 · 4 · 5 · 6 = 2 · 3 · 22 · 5 · (2 · 3) = 45 · 24.
We store the number of factors 2 rather than a power of 2. For example, the odd part of 6! is 45, and its even part is 4. The odd part of 24! fits in 58 bits (and conveniently in a 64-bit unsigned long long int of C99). Each multinomial coefficient m(k; ...) that we consider has no more of any prime factor than does k!, and k! has no more of any prime factor than does 24!.
#include <stdio.h>
int odd[]  = {0,1,1,3,1,5,3,7,1,9,5,11,3,13,7,15,1,17,9,19,5,21,11,23,3,25};
int even[] = {0,0,1,0,2,0,1,0,3,0,1, 0,2, 0,1, 0,4, 0,1, 0,2, 0, 1, 0,3, 0};
We need odd and even parts of only a few small numbers. Precompute them for speed.

Algorithm:

Scan the word right to left. Keep a (singly) linked list in alphabetical order of distinct characters seen along with their counts. Each time a letter of the word is revealed, update the list. Only one letter-count pair is incremented or inserted each time. While seeking it in the sorted list, add up the multinomial coefficients corresponding to its alphabetical predecessors.
void rank(char *s) { 
    char L[32];                         // L[k]: kth distinct character seen
    int C[32];                          // C[k]: no. chars. L[k] in W_n
    int D = 0;                          // D: no. distinct characters seen
    int next[32];     // Let i(0) = 0; i(k) = next[i(k-1)] if k > 0; and next[i(D)] 
    next[0] = 0;      // = 0. The alphabetically kth distinct char. seen is L[i(k)].
    unsigned long long rank = 1, m_odd = 1;
    int m_even = 0;                     // m = m_odd * (2 ^ m_even)
    char *t;                            // tail string W_n
    for(t = s; *t; t++);                // advance t to end of string
    for(int n = 1; t-- != s; n++) {     // read string right to left
        int k = 0, w = 0;               // w: w_n(c < *t)
        for(; next[k] && L[next[k]] < *t; k = next[k])
            w += C[next[k]];
        // now L[k] is *t's alphabetical predecessor
        if(!next[k] || L[next[k]] > *t) { // *t is new; update L, C, D, and next
            D++;
            L[D] = *t;
            C[D] = 0;
            next[D] = next[k];
            next[k] = D;
        }
        int ct = ++C[next[k]];          // ct: w_n(*t)

        // rank += m_{n-1} * w_n(c < *t) / w_n(*t); see comments @1, @2
        if(w) rank += m_odd * odd[w] / odd[ct] << m_even + even[w] - even[ct]; 

        // m = m_n = m_{n-1} * n / w_n(*t)
        m_odd  = m_odd  *  odd[n] /  odd[ct];
        m_even = m_even + even[n] - even[ct];
    }
    printf("%llu\n", rank);
}

Comment @1:

This formula is a straightforward implementation of the small-adjustment idea. It might be nice to have a combinatorial interpretation. The cartesian product of sets A and B is denoted A × B. Then a combinatorial interpretation is a one-to-one correspondence from
Sn × {1, 2, ..., wn(*t)}
to
Mn-1 × {1, 2, ..., wn(c < *t)}.

Here's one. Think of an element (U, k) of Sn × {1, ..., wn(*t)} as a word U < Wn (on the letters of Wn) with its kth character *t underscored. (By definition, there are wn(*t) occurrences of the letter *t in Wn. Put an underscore under the kth one, but don't tie it to the character, as the character will be removed shortly.)

E.g., if Wn = BAABA, then U could be AAABB, and k could be 1, giving AAABB, where bold type indicates the underscored character.

Consider the following operation: take such an element (U, k), pop off its head letter, and replace the underscored character by it. The result is a word on the letters of Wn-1 with a character c < *t underscored.

E.g., AAABB -> AAAB. That is, A1A2A3BiBii -> A2A3A1Bii.

The inverse operation is to remove the underscored character to the front and fill its vacated place with *t.

Comment @2:

Overflow may occur only when we reconstitute a multinomial coefficient from its odd and even parts, or when we add it to the tally. Either way, if overflow occurs, it is because the answer itself is too large. When we update m_odd at the end of the loop, m_odd occupies at most 58 bits, and odd[n] occupies at most 5 bits. Similarly the product m_odd * odd[w] does not overflow.

It would be easier to avoid overflow by using a big-integer library such as GMP, but slower. If longer words are to be considered, we might try to separate more prime factors. Because factorials lack large prime factors, it should be possible to pack their prime-factor representations into a single 64-bit integer for strings 33 characters long. Combining this with the idea of separating 2 from the next 12 primes, it should be possible to pack prime-factor representations (say, with a 5-bit count of each prime) into a single 64-bit integer for strings of 43 characters or so.

int main(int argc, char **argv) {
    if(argc < 2) {
        printf("Usage: %s WORD\n", argv[0]);
        return 1;
    }
    if(argv[1][0] != '-') 
        rank(argv[1]);
    else { // take words from stdin instead of argv (for performance testing)
        char line[30];
        while(0 < scanf("%s", line))
            rank(line);
    }
    return 0;
}

Compilation:

I used gcc -std=c99 -O3.

Efficiency:

This program is pretty fast. My computer takes 0.7 seconds to process a million random words of length uniformly distributed on interval [1..25], three times longer than it takes to compute the lengths of the same words. (In both cases, words are read from stdin, and output is directed to /dev/null.) Little memory is used, perhaps 1000 bytes on startup in addition to whatever is reserved for the 9000-byte executable file.

Remark:

I don't know how representative this program is. Usually mine have more than a couple functions. Seldom do I hunt for 10% speed improvements as I did here. If even greater speed is needed, a next step might be to examine the instructions generated by the compiler. This would be slow work, as I rarely write in assembly.

Here is the complete program source.