Post

Binary Search

Binary search is an efficient algorithm used to determine if a target value exists within a sorted array. It is staightforward to implement and can be optimized for improved performance. In the upcoming section of this blog, I will present a comprehensive explanation of both the basic version and the optimized versions.

Classroom approach

As we learned from classroom lectures, writing a binary search becomes a straightforward task, as demonstrated below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
int binary_search(const int *data, int len, const int &x) {
  long lo = 0, hi = len - 1;

  while (lo <= hi) {
    int mid = (lo + hi) >> 1;
    if (data[mid] == x) {
      return mid;
    } else if (data[mid] < x) {
      lo = mid + 1;
    } else {
      hi = mid - 1;
    }
  }

  return -1;
}

It exhibits an efficient time complexity of O(log n). However, in high-performance, low-latency systems, it may be necessary to rewrite it into a more optimized version.

Real-world scenarios

Why the above version is not good enough is because the presences of branches(breach penalty of instruction pipeline) and data cache misses can negatively impact performance. It is important to eliminate branches and data cache misses where possibile.

Using std::lower_bound() versions

One aliternative solution is to utilize the modified std::lower_bound as a replacement for std::binary_search, as illustrated below one by one:

1
2
3
4
5
int binary_search(const int *data, int len, const int &x) {
  int i = lower_bound(data, len, x);

  return (i == len || data[i] != x) ? -1 : i;
}
  • lower_bound v1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
int lower_bound(const int *data, int len, const int &x) {
  if (!len || x > data[len - 1])
    return len;

  const int *base = data;
  while (len > 1) {
    int half = (len >> 1);
    if (base[half - 1] < x) {
      base += half;
      len -= half;
    } else {
      len -= half;
    }
  }

  return base - data;
}
  • lower_bound v2 (branchless)
1
2
3
4
5
6
7
8
9
10
11
12
13
int lower_bound_v2(const int *data, int len, const int &x) {
  if (!len || x > data[len - 1])
    return len;

  const int *base = data;
  while (len > 1) {
    int half = (len >> 1);
    base += half * (base[half - 1] < x);
    len -= half;
  }

  return base - data;
}
  • lower_bound v3 (branchless with prefetch)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
int lower_bound_v3(const int *data, int len, const int &x) {
  if (!len || x > data[len - 1])
    return len;

  const int *base = data;
  while (len > 1) {
    int half = (len >> 1);
    __builtin_prefetch(base + (half >> 1) - 1);
    __builtin_prefetch(base + half + (half >> 1) - 1);
    base += half * (base[half - 1] < x);
    len -= half;
  }

  return base - data;
}

Using Eytzinger Layout version

In the paper “Arrays layouts for comparison-based searching” by Paul-Virak Khuong and Pat Morin, they describe a specific method for optimizing binary search by reorganizing elements of a sorted array in a cache-friendly manner. To gain a detailed understanding of this technique, I recommend reading a blog that offers an in-depth explanation on the subject.

However, it is worth noting that the blog presents a recursive version to rearrange the elements, which can lead to a crash due to the recursion stack depth limit. To address this issue, I have implemented an iterative version, which avoids the curse of recursion.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
class Eytzinger {
public:
  struct zpair {
    int val = -1;
    int idx = -1;
  };

  static constexpr int block_size = 64 / sizeof(zpair);

  Eytzinger(const std::vector<int> &vec) : z_(vec.size() + 1) {
    build(vec);
  }

  int search(int target) {
    auto zp = search_(target);
    return (zp.val == target) ? zp.idx : -1;
  }

private:
  void build(const std::vector<int>& vec) {
    int i = 1;
    std::queue<std::pair<int, int>> q;
    q.push({0, vec.size()});

    while (!q.empty()) {
      auto pair = q.front();
      q.pop();
      long lo = pair.first, hi = pair.second;
      if (lo < hi) {
          int mid = (lo + hi) >> 1;
          z_[i].idx = mid;
          z_[i].val = vec[mid];
          ++i;

          q.push({lo, mid});
          q.push({mid + 1, hi});
      }
    }
  }

  inline zpair search_(int target) {
    long k = 1;
    while (k < (int)z_.size()) {
      __builtin_prefetch(z_.data() + k * block_size);
      k = (k << 1) + (z_[k].val < target);
    }
    k >>= __builtin_ffs(~k);
    return z_[k];
  }

private:
  alignas(64) std::vector<zpair> z_;
};

Benchmarks

I have conducted a performance benchmark of different versions. In order to analyze the results clearly, I have divided them into two parts: one for smaller datasets and another for larger datasets. Based on the results, It can be concluded that for smaller datasets, version 2 performs better. However, for larger datasets, the eyztiner version outperforms the others. This distinction helps provide a clearer understanding of the relative performance of each version in different scenarios. Here is my source code.

kid
kid
This post is licensed under CC BY 4.0 by the author.