This is based on the Wikipedia example, the StackOverflow thread, and Bit Twiddling Hacks by Sean Anderson. Note that in practice, modern CPUs have a dedicated instruction for this, interoperable enough that it's even in WebAssembly. This post explains what this instruction is useful for. (A variation of this parallelization trick can also be used for reversing the bits in a word (see also StackOverflow), which has a dedicated CPU instruction in ARM but not x86, so maybe that's more useful.) As an example, consider 1,825,859,237₁₀ = 0110⋅1100 1101⋅0100 0110⋅0110 1010⋅0101₂, which has 16 ones. (For readability, the bytes are separated by spaces and the nibbles are separated by dots.) We can of course go through and count them one-by-one, but there's a clever trick where we count them in parallel. Sketch: First, to count the ones in each pair of bits AB, we just want to do A+B. If we mask, we can pull out the Bs; if we shift and mask, we can pull out the As; then we can add them: 0110⋅1100 1101⋅0100 0110⋅0110 1010⋅0101 & 0101⋅0101 1101⋅0101 0101⋅0101 0101⋅0101 0100⋅0100 0101⋅0100 0100⋅0100 0000⋅0101 0110⋅1100 1101⋅0100 0110⋅0110 1010⋅0101 >> 1 = 0011⋅0110 0110⋅1010 0011⋅0011 0101⋅0010 0011⋅0110 0110⋅1010 0011⋅0011 0101⋅0010 & 0101⋅0101 1101⋅0101 0101⋅0101 0101⋅0101 0001⋅0100 0100⋅0000 0001⋅0001 0101⋅0000 Then for each pair of 2-bit counts, we can do the same trick, mask with 0011⋅0011 0011⋅0011 0011⋅0011 0011⋅0011 to pull out every other 2-bit count, then shift and mask to pull out the other every other 2-bit counts, and add them, then repeat with 0000⋅1111 0000⋅1111 0000⋅1111 0000⋅1111, etc: x = (x & 0x55555555) + ((x >> 1) & 0x55555555) x = (x & 0x33333333) + ((x >> 2) & 0x33333333) x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F) x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF) x = (x & 0x0000FFFF) + ((x >> 16) & 0x0000FFFF) This is pretty cool, but there's 3 tricks we can do to even further optimize this: Trick 1: We can simplify the first step because for just two bits, instead of (x & 01) + ((x >> 1) & 01), it so happens that you can just do x - ((x >> 1) & 01); this easy to verify by trying all 4 values of 2 bits (0, 1, 2, 3). Trick 2: We can simplify the third step by adding before masking, rather than two separate masks before adding, because in the maximum of 8 possible ones easily fits in 4 bits. (Can't do this in earlier steps because, for example, the maximum of 4 possible ones does not fit in 2 bits.) Trick 3: We can simplify the fourth and later steps by noticing that the maximum possible total of 32 ones easily fits in 8 bits, so we can just multiply by 0x01010101, which is equivalent to x + (x<<8) + (x<<16) + (x<<24). End result: x = x - ((x >> 1) & 0x55555555); // Trick 1 x = (x & 0x33333333) + ((x >> 2) & 0x33333333); // original x = (x + (x >> 4)) & 0x0F0F0F0F; // Trick 2 x = x * 0x01010101; // Trick 3