This is based on the Wikipedia example, the StackOverflow thread, and Bit Twiddling Hacks by Sean Anderson. Note that in practice, modern CPUs have a dedicated instruction for this, interoperable enough that it's even in WebAssembly. This post explains what this instruction is useful for. (A variation of this parallelization trick can also be used for reversing the bits in a word (see also StackOverflow), which has a dedicated CPU instruction in ARM but not x86, so maybe that's more useful.) As an example, consider 1,825,859,23710 = 01101100 11010100 01100110 101001012, which has 16 ones. (For readability, the bytes are separated by spaces and the nibbles are separated by dots.) We can of course go through and count them one-by-one, but there's a clever trick where we count them in parallel. Sketch: First, to count the ones in each pair of bits AB, we just want to do A+B. If we mask, we can pull out the Bs; if we shift and mask, we can pull out the As; then we can add them: 01101100 11010100 01100110 10100101 & 01010101 11010101 01010101 01010101 01000100 01010100 01000100 00000101 01101100 11010100 01100110 10100101 >> 1 = 00110110 01101010 00110011 01010010 00110110 01101010 00110011 01010010 & 01010101 11010101 01010101 01010101 00010100 01000000 00010001 01010000 Then for each pair of 2-bit counts, we can do the same trick, mask with 00110011 00110011 00110011 00110011 to pull out every other 2-bit count, then shift and mask to pull out the other every other 2-bit counts, and add them, then repeat with 00001111 00001111 00001111 00001111, etc: x = (x & 0x55555555) + ((x >> 1) & 0x55555555) x = (x & 0x33333333) + ((x >> 2) & 0x33333333) x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F) x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF) x = (x & 0x0000FFFF) + ((x >> 16) & 0x0000FFFF) This is pretty cool, but there's 3 tricks we can do to even further optimize this: Trick 1: We can simplify the first step because for just two bits, instead of (x & 01) + ((x >> 1) & 01), it so happens that you can just do x - ((x >> 1) & 01); this easy to verify by trying all 4 values of 2 bits (0, 1, 2, 3). Trick 2: We can simplify the third step by adding before masking, rather than two separate masks before adding, because in the maximum of 8 possible ones easily fits in 4 bits. (Can't do this in earlier steps because, for example, the maximum of 4 possible ones does not fit in 2 bits.) Trick 3: We can simplify the fourth and later steps by noticing that the maximum possible total of 32 ones easily fits in 8 bits, so we can just multiply by 0x01010101, which is equivalent to x + (x<<8) + (x<<16) + (x<<24). End result: x = x - ((x >> 1) & 0x55555555); // Trick 1 x = (x & 0x33333333) + ((x >> 2) & 0x33333333); // original x = (x + (x >> 4)) & 0x0F0F0F0F; // Trick 2 x = x * 0x01010101; // Trick 3