This was asked in my Google interview recently and I offered an answer which involved bit shift and was O(n) but she said this is not the fastest way to go about doing it. I don't understand, is there a way to count the bits set without having to iterate over the entire bits provided?
O(n)
, the n
here is 160000, I assume. Like other posts below, you can use a a look up array, say of 256 numbers [8 bits] which gives number of set bits for each number. So, you'll have to just look up this array for twice [for lower & upper 8 bits] for each of 10000 numbers [that's 20000 look ups instead of 160000]. You can extend this to 65535 look up array [16 bit - Bhaskar 2012-04-05 00:26
POPCNT
which count the number of set bits in a word - hammar 2012-04-05 00:27
while( w != 0 ) count++, w &= w-1;
But using this still results in worst case O(n) (which is 10,000 in this case) times 16 bits per word (in this example) - dcow 2013-01-03 22:08
Brute force: 10000 * 16 * 4 = 640,000 ops. (shift, compare, increment and iteration for each 16 bits word)
Faster way:
We can build table 00-FF -> number of bits set. 256 * 8 * 4 = 8096 ops
I.e. we build a table where for each byte we calculate a number of bits set.
Then for each 16-bit int we split it to upper and lower
for (n in array)
byte lo = n & 0xFF; // lower 8-bits
byte hi = n >> 8; // higher 8-bits
// simply add number of bits in the upper and lower parts
// of each 16-bits number
// using the pre-calculated table
k += table[lo] + table[hi];
}
60000 ops in total in the iteration. I.e. 68096 ops in total. It's O(n) though, but with less constant (~9 times less).
In other words, we calculate number of bits for every 8-bits number, and then split each 16-bits number into two 8-bits in order to count bits set using the pre-built table.
table[n] = table[n>>1] + n&1
, this requires less memory access - Priyank Bhatnagar 2012-04-05 00:48
There's (almost) always a faster way. Read up about lookup tables.
I don't know what the correct answer was when this question was asked, but I believe the most sensible way to solve this today is to use the POPCNT
instruction. Specifically, you should use the 64-bit version. Since we just want the total number of set bits, boundaries between 16-bit elements are of no interest to us. Since the 32-bit and 64-bit POPCNT
instructions are equally fast, you should use the 64-bit version to count four elements' worth of bits per cycle.
I just implemented it in Java:
import java.util.Random;
public class Main {
static int array_size = 1024;
static int[] array = new int[array_size];
static int[] table = new int[257];
static int total_bits_in_the_array = 0;
private static void create_table(){
int i;
int bits_set = 0;
for (i = 0 ; i <= 256 ; i++){
bits_set = 0;
for (int z = 0; z <= 8 ; z++){
bits_set += i>>z & 0x1;
}
table[i] = bits_set;
//System.out.println("i = " + i + " bits_set = " + bits_set);
}
}
public static void main(String args[]){
create_table();
fill_array();
parse_array();
System.out.println("The amount of bits in the array is: " + total_bits_in_the_array);
}
private static void parse_array() {
int current;
for (int i = 0; i < array.length; i++){
current = array[i];
int down = current & 0xff;
int up = current & 0xff00;
int sum = table[up] + table[down];
total_bits_in_the_array += sum;
}
}
private static void fill_array() {
Random ran = new Random();
for (int i = 0; i < array.length; i++){
array[i] = Math.abs(ran.nextInt()%512);
}
}
}
Also at https://github.com/leitao/bits-in-a-16-bits-integer-array/blob/master/Main.java
You can pre-calculate the bit counts in bytes and then use that for lookup. It is faster, if you make certain assumptions.
Number of operations (just computation, not reading input) should take the following
Shift approach:
For each byte: 2 ops (shift, add) times 16 bits = 32 ops, 0 mem access times 10000 = 320 000 ops + 0 mem access
Pre-calculation approach:
255 times 2 ops (shift, add) times 8 bits = 4080 ops + 255 mem access (write the result)
For each byte: 2 ops (compute addresses) + 2 mem access + op (add the results) = 30 000 ops + 20 000 mem access
Total of 30 480 ops + 20 255 mem access
So a lot more memory access with lot fewer operations
Thus, assuming everything else being equal, pre-calculation for 10 000 bytes is faster if we can assume memory access is faster than an operation by a factor of (320 000 - 30 480)/20 255 = 14.29
Which is probably true if you are alone on a dedicated core on a reasonably modern box as the 255 bytes should fit into a cache. If you start getting cache misses, the assumption might no longer hold.
Also, this math assumes pointer arithmetic and direct memory access as well as atomic operations and atomic memory access. Depending on your language of choice (and, apparently, based on previous answers, your choice of compiler switches), that assumption might not hold.
Finally, things get more interesting if you consider scalability: shifting can be easily parallelised onto up to 10000 cores but pre-computation not necessarily. As byte number goes up, however, lookup gets more and more advantageous.
So, in short. Yes, pre-calculation is faster under pretty reasonable assumptions but no, it is not guaranteed to be faster.