Array of 10000 having 16bit elements, find bits set (unlimited RAM) - Google interview

This was asked in my Google interview recently and I offered an answer which involved bit shift and was O(n) but she said this is not the fastest way to go about doing it. I don't understand, is there a way to count the bits set without having to iterate over the entire bits provided?

2012-04-05 00:18
by noMAD

No. If you want to count all the bits, then you need to visit all the bits. So you can't do better than O(n) - Oliver Charlesworth 2012-04-05 00:20

Unlimited RAM? 64K-entry lookup table sounds faster to me - cHao 2012-04-05 00:21

So you did a shift 16 times per each element? I.e. count bits in the first element, then second etc - Eugene Retunsky 2012-04-05 00:22

@Oli - maybe not asymptotically faster, but with the stated "unlimited RAM", it could be made about 16 times faster by having a 16-bit lookup table - Carl Norum 2012-04-05 00:22

When you say O(n), the n here is 160000, I assume. Like other posts below, you can use a a look up array, say of 256 numbers [8 bits] which gives number of set bits for each number. So, you'll have to just look up this array for twice [for lower & upper 8 bits] for each of 10000 numbers [that's 20000 look ups instead of 160000]. You can extend this to 65535 look up array [16 bit - Bhaskar 2012-04-05 00:26

Depending on platform, you could also use instructions like POPCNT which count the number of set bits in a word - hammar 2012-04-05 00:27

Have a look to the 2nd answer heremshsayem 2012-04-05 00:29

2^160000 lookup table. see population count in hacker's deligh - Ron 2012-04-05 00:30

@mshsayem: That's what I gave them. Not good enough. I think the look-up table as mentioned was what they were looking for - noMAD 2012-04-05 00:33

Was this actually a Google interview question? It sounds a bit basic (compared to the impression I get when reading about Google interviews elsewhere...) - Oliver Charlesworth 2012-04-05 00:57

There is also the traditional algorithm to count 1 bits in a single word that runs in O(k) time where k is the number of 1s. while( w != 0 ) count++, w &= w-1; But using this still results in worst case O(n) (which is 10,000 in this case) times 16 bits per word (in this example) - dcow 2013-01-03 22:08

Brute force: 10000 * 16 * 4 = 640,000 ops. (shift, compare, increment and iteration for each 16 bits word)

Faster way:

We can build table 00-FF -> number of bits set. 256 * 8 * 4 = 8096 ops

I.e. we build a table where for each byte we calculate a number of bits set.

Then for each 16-bit int we split it to upper and lower

for (n in array)
   byte lo = n & 0xFF; // lower 8-bits
   byte hi = n >> 8;   // higher 8-bits
   // simply add number of bits in the upper and lower parts 
   // of each 16-bits number
   // using the pre-calculated table
   k += table[lo] + table[hi];
}

60000 ops in total in the iteration. I.e. 68096 ops in total. It's O(n) though, but with less constant (~9 times less).

In other words, we calculate number of bits for every 8-bits number, and then split each 16-bits number into two 8-bits in order to count bits set using the pre-built table.

2012-04-05 00:33
by Eugene Retunsky

Can you elaborate a little more? Trying to understand what you wrote - noMAD 2012-04-05 00:35

I updated the answer. Please let me know if it's clear or not clear enough - Eugene Retunsky 2012-04-05 00:38

Of course, you're assuming that random access into a 64kB table is the same cost as any other operation.. - Oliver Charlesworth 2012-04-05 00:43

One other simple way would be to do this, table[n] = table[n>>1] + n&1, this requires less memory access - Priyank Bhatnagar 2012-04-05 00:48

This preview table (256 bytes) can be even cached in L1. Which is as very fast - Eugene Retunsky 2012-04-05 02:41

http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel - have someone actually timed the code? using the table and the paralel counting of bits? Tables were faster on older cpus , tight loops in the instruction cache and L1/L2 caches, often outperform table lookups - Alexander Atanasov 2013-01-06 08:32

@AlexanderAtanasov have you timed the code? : - Eugene Retunsky 2013-01-06 18:08

@EugeneRetunsky Nope, that's why i've asked - i've done a basic test doing for loop iterations and measuring time - it is faster. : - Alexander Atanasov 2013-01-06 20:40

@AlexanderAtanasov what is faster - look-ups or brute-force? in what language? I tried in Java - lookups were faster, as far as I remember - Eugene Retunsky 2013-01-07 01:04

@EugeneRetunsky in pure C lookups are faster - tried your table vs paralel bit counting. I've seen a case that in cold tests table lookups were faster but in real code they were slower due to a cache misses - but that involved a lot bigger tables and a lot more computations. The weird thing is that: gcc -O3 -Wall -o 1 1.c alex@darkstar:~/gg$ ./1 Tab Took 8353706 LL Took 24346915 gcc -O2 -Wall -o 1 1.c alex@darkstar:~/gg$ ./1 Tab Took 104 LL Took 314 -- same code with O3 gives very bad results, while O2 looks ok. with gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6. - Alexander Atanasov 2013-01-07 06:01

@EugeneRetunsky code at http://i.bik.bg/zaxl/bitsin64karr. - Alexander Atanasov 2013-01-07 06:20

So I did the math - Andres Kütt 2017-01-24 07:35

There's (almost) always a faster way. Read up about lookup tables.

2012-04-05 00:20
by Carl Norum

I don't know what the correct answer was when this question was asked, but I believe the most sensible way to solve this today is to use the POPCNT instruction. Specifically, you should use the 64-bit version. Since we just want the total number of set bits, boundaries between 16-bit elements are of no interest to us. Since the 32-bit and 64-bit POPCNT instructions are equally fast, you should use the 64-bit version to count four elements' worth of bits per cycle.

2015-03-17 02:08
by dfeuer

I just implemented it in Java:

import java.util.Random;


public class Main {

static int array_size = 1024;
static int[] array = new int[array_size];
static int[] table = new int[257];
static int total_bits_in_the_array = 0;

private static void create_table(){
    int i;
    int bits_set = 0;

    for (i = 0 ; i <= 256 ; i++){
        bits_set = 0;
        for (int z = 0; z <= 8 ; z++){
            bits_set += i>>z & 0x1;
        }
    table[i] = bits_set;
    //System.out.println("i = " + i + " bits_set = " + bits_set);
    }



}

public static void main(String args[]){
        create_table();
        fill_array();
        parse_array();
        System.out.println("The amount of bits in the array is: " + total_bits_in_the_array);
}


private static void parse_array() {
    int current;

    for (int i = 0; i < array.length; i++){
        current = array[i];

        int down = current & 0xff; 
        int up = current & 0xff00;

        int sum = table[up] + table[down];

        total_bits_in_the_array += sum;
    }       
}

private static void fill_array() {
    Random ran = new Random();

    for (int i = 0; i < array.length; i++){
        array[i] = Math.abs(ran.nextInt()%512);
    }

}
}

Also at https://github.com/leitao/bits-in-a-16-bits-integer-array/blob/master/Main.java

2015-03-16 20:51
by Breno Leitão

You can pre-calculate the bit counts in bytes and then use that for lookup. It is faster, if you make certain assumptions.

Number of operations (just computation, not reading input) should take the following

Shift approach:

For each byte: 2 ops (shift, add) times 16 bits = 32 ops, 0 mem access times 10000 = 320 000 ops + 0 mem access

Pre-calculation approach:

255 times 2 ops (shift, add) times 8 bits = 4080 ops + 255 mem access (write the result)

For each byte: 2 ops (compute addresses) + 2 mem access + op (add the results) = 30 000 ops + 20 000 mem access

Total of 30 480 ops + 20 255 mem access

So a lot more memory access with lot fewer operations

Thus, assuming everything else being equal, pre-calculation for 10 000 bytes is faster if we can assume memory access is faster than an operation by a factor of (320 000 - 30 480)/20 255 = 14.29

Which is probably true if you are alone on a dedicated core on a reasonably modern box as the 255 bytes should fit into a cache. If you start getting cache misses, the assumption might no longer hold.

Also, this math assumes pointer arithmetic and direct memory access as well as atomic operations and atomic memory access. Depending on your language of choice (and, apparently, based on previous answers, your choice of compiler switches), that assumption might not hold.

Finally, things get more interesting if you consider scalability: shifting can be easily parallelised onto up to 10000 cores but pre-computation not necessarily. As byte number goes up, however, lookup gets more and more advantageous.

So, in short. Yes, pre-calculation is faster under pretty reasonable assumptions but no, it is not guaranteed to be faster.

2017-01-24 07:59
by Andres Kütt