...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
Histograms are a basic tool in statistical analysis. A histogram consists of a number of non-overlapping cells in data space. When an input value is passed to the histogram, the corresponding cell that envelopes the value is found and an associated counter is incremented.
When analyzing a large low-dimensional data set, it is more convenient to work with a histogram of the input values than the original values. Keeping the cell counts in memory for analysis and/or processing the counts requires far fewer resources than keeping the original values in memory and processing them. Information present in the original can also be extracted from the histogram^{[1]}. Some information is lost in this way, but if the cells are small enough^{[2]}, the loss is often negligible. A histogram is a kind of lossy data-compression. It is also often used as a simple estimator for the probability density function of the input data. More complex density estimators exist, but histograms remain attractive because they are easy to reason about.
This library provides a histogram for multi-dimensional data. In the multi-dimensional case, the input consist of tuples of values which belong together and describing different aspects of the same entity. A point in space is a good example. You need three coordinate values to describe a point. The entity is the point, and to fully characterize a point distribution in space you need three values and therefore a three-dimensional histogram. A three-dimensional histogram collects more information than three separate one-dimensional histograms, one for each coordinate. For example, you could have a point distribution that looks like a checker board in three dimensions (a checker cube): high and low densities are alternating along each coordinate. Then, the one-dimensional histograms along each coordinate would look like flat distributions, completely hiding the complex structure, while the three-dimensional histogram would retain the structure for further analysis.
The term histogram is usually strictly used for something with cells over discrete or continuous data. This histogram class can also process categorical variables and it even allows for non-consecutive cells if that is desired. There is no restriction to numbers as input. Any C++ type can be fed into the histogram, if the user provides a specialized axis class that maps values of this type to a cell index. The only remaining restriction is that cells are non-overlapping, since there must be a unique mapping from input value to cell. The library is not able to automatically ensure this for user-provided axis classes, so the responsibly is on the user.
Furthermore, the histogram can handle weighted input. Normally, the cell counter which is connected to an input tuple is incremented by one, but sometimes it is useful to increment by a weight, an integral or floating point number.
Finally, the histogram can be configured to store an accumulator in each cell. Arbitrary samples can be passed to this accumulator, which may compute the mean, variance, median, or other interesting statistics from the samples that are sorted into its cell. When the accumulator computes a mean, the result is called a profile.
C++ lacks a widely-used, free multi-dimensional histogram class. While it is easy to write a one-dimensional histogram, writing a general multi-dimensional histogram poses more of a challenge. If you add a few more features required by scientific professionals onto the wish-list, then the implementation becomes non-trivial and a well-tested library solution desirable.
The GNU Scientific Library (GSL) and the ROOT framework from CERN have histogram implementations. The GSL has histograms for one and two dimensions in C. The implementations are not customizable. ROOT has well-tested implementations of histograms, but they are not customizable and they are not easy to use correctly. ROOT also has new implementations in beta-stage similar to this one, but they cannot be used without the rest of ROOT, which is a huge library to install just to get histograms.
The templated histogram class in this library has a minimalistic interface, which strives to be as elegant as the GSL implementations. In addition, it is very customizable and extensible through user-provided classes. A single implementation is used for one and multi-dimensional histograms. While being safe, customizable, and convenient, the histogram is also very fast. The static version, which has an axis configuration that is hard-coded at compile-time, is faster than any tested competitor.
One of the central design goals was to provide an abstract interface to the internal bin counters. The internal counting mechanism is encapsulated in a storage class, which can be replaced at compile-time. The default storage uses an adaptive memory management which is safe to use, memory-efficient, and fast. The safety comes from the guarantee, that counts cannot overflow or be capped. This is a rare guarantee, hardly found in other libraries. In the standard configuration, the histogram just works under any circumstance. Yet, users with unusual requirements can implement their own custom storage class or use an alternative builtin array-based storage.
This library was written based on a decade of experience collected in working with big data, more precisely in the field of particle physics and astroparticle physics. The design is guided by advice from people like Bjarne Stroustrup, Scott Meyers, Herb Sutter, and Andrei Alexandrescu, and Chandler Carruth. I also like the Zen of Python (also applies to other languages). I also borrowed ideas from the Eigen library.
Design goals of the library:
The library consists of three orthogonal components:
Histograms store axis objects and a storage object. A one-dimensional histogram has one axis, a multi-dimensional histogram has several. When you pass an input tuple, say (v1, v2, v3), then the first axis will map v1 onto index i1, the second axis v2 onto i2, and so on, to generate the index tuple (i1, i2, i3). The histogram host class then converts these indices into a linear global index that is used to address bin counter in the storage object.
Note | |
---|---|
To understand the need for multi-dimensional histograms, think of point coordinates. If all points that you consider lie on a line, you need only one value to describe the point. If all points lie in a plane, you need two values to describe the position. Three values are needed for a point in space. A histogram puts a discrete grid over the line, the plane or the space, and counts how many points lie in each cell of the grid. To approximate a point distribution on a line, a 1d-histogram is sufficient. To do the same in 3d-space, one needs a 3d-histogram. |
This library supports different axis types, so that the user can customize how the mapping is done exactly, see axis types. Users can furthermore chose between several ways of storing axis types in the histogram.
When the number and types of the axes are known at compile-time, the
histogram host class stores axis types in a std::tuple
.
We call this a static histogram. To access a particular
axis, one should use a compile-time number as index (a run-time index
also works with some limitations). A static histogram is extremely fast
(see benchmark), because
there is no overhead and the compiler can inline code, unroll loops,
and more. Also, many user errors are can be caught at compile-time rather
than run-time.
Static histograms are the best kind, but cannot be used when histograms are to be created with an axis configuration that is only known at run-time. This is the case, for example, when histograms are created from Python or from a graphical user interface. Therefore also more dynamic kinds of histograms are supported.
There are two levels of dynamism. The histogram can hold instances of
a single axis type in a std::vector
.
Now the number of axis instances per histogram can vary at run-time,
but the axis type must be the same for all instances. We call this a
semi-dynamic histogram.
If also the axis types need to vary at run-time, one can place boost::histogram::axis::variant
type in a std::vector
,
which can hold one of a set of different concrete axis types. We call
this a dynamic histogram. The dynamic histogram
is a single type that can store arbitrary sequences of different axes
types, which may be generated at run-time. The polymorphic behavior of
the generic boost::histogram::axis::variant
type has a run-time cost, however.
Typically, the performance is reduced by a factor of two compared to
a static histogram.
Note | |
---|---|
The design decision to store axis types in the variant-like type |
An axis defines an injective mapping of (a range of) input values to a bin. The logic is encapsulated in an axis type. Users can create their own axis classes and use them with the library, by implementing the Axis concept. The library comes with four builtin types, which implement different specializations.
boost::histogram::axis::regular
sorts real numbers into bins with equal width. The regular axis also
supports monotonic transforms, which are applied when the input values
are passed to the axis. This can be used to make a fast logarithmic
axis, where the bins have equal width in the logarithm of the variable.
boost::histogram::axis::variable
sorts real numbers into bins with varying width.
boost::histogram::axis::integer
is a specialization of a regular axis for a range of integers with
unit bin width. It is much faster than a regular axis.
boost::histogram::axis::category
is a bijective mapping of unique values onto bin indices and vice
versa. This can be used with discrete categorical data, like "red",
"green", "blue", for example.
Each builtin axis type has a few compile-time options, which change its behavior.
A storage type holds the actual cell values. It uses a one-dimensional
index for cell lookup, computed by the histogram host from the indices
generated by its axes. The storage needs to know nothing about axes.
Users can integrate their own storage classes with the library, by implementing
the storage concept.
Standard containers can be used as storage backends, the library adapts
them with the boost::histogram::storage_adaptor
.
Cell lookup is often happening in a tight loop and is random-access.
A normal std::vector
works well as a storage backend.
Sometimes this is the best solution, but there are some caveats to this
approach. The user has to decide which type should represents the cell
counts and it is not an obvious choice. An integer type needs to be large
enough to avoid counter overflow, but only a fraction of the bits are
used if its capacity is too large. This is a waste of memory then. When
memory is wasted, more cache misses occur and performance is degraded
(see the benchmarks). The performance of modern CPUs depends a lot on
effective utilization of the CPU cache, which is still small. Using floating
point numbers instead of integers is also dangerous. They don't overflow,
but cap the bin count when the bits in the mantissa are used up.
The default storage used in the library is boost::histogram::unlimited_storage
.
It solves these issues with a dynamic counter type management, based
on the following insight. The curse
of dimensionality makes the total number of bins grow very fast
as the dimension of the histogram grows. However, having many bins also
reduces the typical number of counts per bin, since the input values
are spread over many more bins now. This means a small counter is often
sufficient for high-dimensional histograms.
The default storage therefore starts with a minimum amount of memory per cell, it uses an 1 byte. If the count in any cell is about to overflow, all cells switch to the next larger integer type simultaneously. This goes on, the capacity per cell is always doubled when it is used up, until 8 byte per bin are reached. The following images illustrate this progression for a storage of 3 bin counters. A new memory block is always allocated for all counters, when the first one of them hits its capacity limit.
When even that is not enough, the default storage switches to a multiprecision type similar to those in Boost.Multiprecision, whose capacity is limited only by available memory. The following image is not to scale:
This approach is not only memory conserving, but also provides the strong guarantee that bin counters cannot overflow.
Note | |
---|---|
The no-overflow-guarantee only applies when the histogram is not using weighted fills or if all weights are integral numbers. When floating point weights are used, the default storage switches to a double counter per cell to store the sum of such weights. A double cannot provide the no-overflow-guarantee. |
The best part: this approach is even faster for a histogram with sufficient size despite the run-time overheads of handling the counter type dynamically. The benchmarks show that the gains from better cache usage outweigh the run-time overheads of dynamic dispatching to the right bin counter type and the occasional allocation costs. Doubling the size of the bin counters each time helps, because the allocations happen only O(logN) times for N increments.
Lambdas were considered and rejected as a form of simple user-defined axis type, because they do not allow access to their state, such as the current axis size. Lambdas can be fully replaced by locally-defined structs. A local struct cannot be templated and cannot have templated methods, but this is not an issue. In the local context where the struct is created, all relevant types must be known already so that locally defined structs can simply use these concrete types and there is no need for templates.
Axis instances by default add extra bins that count values which fall below or above the range covered by the axis (for those types where that makes sense). These extra bins are called under- and overflow bins, respectively. The extra bins can be turned off individually for each axis to conserve memory, but it is generally recommended to have them. The normal bins, excluding under- and overflow, are called inner bins.
Under- and overflow bins are useful in one-dimensional histograms, and nearly essential in multi-dimensional histograms. Here are the advantages:
reduce
operation on a histogram, which removes some axes by summing all counts
along that dimension, this would lead to distortions of the histogram
along the remaining axes. When under- and overflow bins are present,
the reduce
operation
always produces a sub-histogram identical to one obtained, if it was
filled with the original data.
The presence of the extra bins does not interfere with normal indexing.
On an axis with n
bins,
the first bin has the index 0
,
the last bin n-1
, while the under- and overflow bins are
accessible at the indices -1
and n
,
respectively. This choice is optimized for users who are unaware of the
existence of these extra bins. They would find the other indexing scheme
surprising, where you start with 0
at the underflow bin and the first normal bin is at 1
.
Also, the chosen scheme allows one to turn off the extra bins in the code
where the histogram is created, without changing any code downstream that
addresses inner bins with indices.
The standard library returns a container size as an unsigned integer, because
a container size cannot be negative. The size()
method of the histogram class follows
this rule, but the size()
methods of axis types return a signed
integral type. Why?
As explained in the section
about under- and overflow, a histogram axis may have an optional
underflow bin, which is addressed by the index -1
. It follows that the index type must be
signed integer for all axis types.
The size()
method of any axis returns the same signed integer type. The size of an
axis cannot be negative, but this choice has two advantages. Firstly, the
value returned by size()
itself is guaranteed to be a valid index,
which is good since it may address the overflow bin. Secondly, comparisons
between an index and the value returned by size()
are frequent. If size()
returned an unsigned integral type, compilers
would produce a warning for each comparisons, and rightly so. Something
awful happens on most machines when you compare -1
with an unsigned integer, -1 <
1u == false
, which causes a serious bug in the
following innocent-looking loop:
auto my_axis = /* ... */; // naive loop to iterate over all bins, including underflow and overflow for (int i = -1; i <= my_axis.size(); ++i) { // body is never executed if return value of my_axis.size() is an unsigned integral type }
The advantages clearly override the disadvantages of this choice.
An axis has a method which converts an index into the equivalent value
at that index. If the axis is continuous, there are many possible values
in the interval between two adjacent integer indices. User often want to
access the center of such an interval. The easiest and most efficient way
to compute the center value is to accept real-valued indices in the conversion
method. Then, the center of the first bin between index i
and i+1
is simply obtained by passing i+0.5
to the method which computes the value.
This scheme is computationally efficient and intuitive. It is so useful that each continuous axis is required to accept a real-valued index. Library code relies on this to detect whether an axis is continuous or discrete. It introspects the argument type of the conversion function and checks whether it is a floating point or integral type, respectively.
Once a histogram is filled, the bin counter can be accessed with the at(...)
method. The standard counter type has a value()
method to return the count k.
It also offers a variance()
method, which returns an estimate v
of the variance
of that count.
If the input values for the histogram come from a stochastic process, the variance provides useful additional information. Examples for a stochastic process are a physics experiment or a random person filling out a questionnaire ^{[3]}. The variance v is the square of the standard deviation. The standard deviation is a number that tells us how much we can expect the observed value to fluctuate if we or someone else would repeat our experiment with new random input.
Variance estimates are useful in many ways:
Why return the variance v and not the standard deviation s = sqrt(v)? The reason is that variances can be trivially added. Variances of independent samples can be added like normal numbers v3 = v1 + v2. This is not true for standard deviations, where the addition law is more complex s3 = sqrt(s1^2 + s2^2). In that sense, the variance is more straight-forward to use during data processing. It is also more efficient for the same reason. The user can take the square-root at the end of the processing obtain the standard deviation as needed.
How is the variance estimate v computed? If we know the expected number of counts lambda per bin, we could compute the variance as v = lambda, because counts in a histogram follow the Poisson distribution ^{[4]}. After filling a histogram, we do not know the expected number of counts lambda for any particular bin, but we know the observed count k, which is not too far from lambda. We therefore might be tempted to just replace lambda with k in the formula v = lambda = k. This is in fact the so-called non-parametric estimate for the variance based on the plug-in principle. It is the best (and only) estimate for the variance, if we know nothing more about the underlying stochastic process which generated the inputs (or want to feign ignorance about it).
Now, if the number returned by the variance()
method is just the same as the number
return by value()
method, why bother with adding a variance()
method? There is a reason, which becomes
apparent if the histograms are filled with weights, which is discussed
next.
A histogram sorts input values into bins and increments a bin counter if
an input value falls into the range covered by that bin. The standard storage
uses integer types to store these counts, see the storage
section how integer overflow is avoided. However, sometimes histograms
need to be filled with values that have a weight w
attached to them. In this case, the corresponding bin counter is not increased
by one, but by the weight value w.
Note | |
---|---|
There are several use-cases for weighted increments. The main use in particle physics is to adapt simulated data of an experiment to real data. Simulations are needed to determine various corrections and efficiencies, but a simulated experiment is almost never a perfect replica of the real experiment. In addition, simulations are expensive to do. So, when deviations in a simulated distribution of a variable are found, one typically does not rerun the simulations, but assigns weights to match the simulated distribution to the real one. |
When the unlimited_storage
is used, histograms may also be filled with weighted value tuples. The
choice of using weighted fills can be made at run-time. If the call operator()(weight(x), ...)
is used, two doubles per bin are stored
(previous integer counts are automatically converted). The first double
keeps track of the sum of weights. The second double keeps track of the
sum of weights squared, which is the variance estimate in this case. The
former is accessed with the value()
method of the bin counter, and the latter
with the variance()
method.
Note | |
---|---|
Why the sum of weights squared is the variance estimate can be derived from the mathematical properties of the variance. Let us say a bin is filled k1 times with a fixed weight w1. The sum of weights is then w1 k1. It then follows from the variance properties that Var(w1 k1) = w1^2 Var(k1). Using the reasoning from before, the estimated variance of k1 is k1, so that Var(w1 k1) = w1^2 Var(k1) = w1^2 k1. Variances of independent samples are additive. If the bin is further filled k2 times with weight w2, the sum of weights is w1 k1 + w2 k2, with variance w1^2 k1 + w2^2 k2. This also holds for k1 = k2 = 1. Therefore, the sum of weights w[i] has variance sum of w[i]^2. In other words, to incrementally keep track of the variance of the sum of weights, we need to keep track of the sum of weights squared. |
Python is a popular scripting language in the data science community. Thus, the library must be designed to support Python bindings, which are developed separately. The histogram is usable as an interface between a complex simulation or data-storage system written in C++ and data-analysis/plotting in Python. Users are able to define a histogram in Python, let it be filled on the C++ side, and then get it back for further data analysis or plotting.
Boost.Histogram can be configured to use arbitrary accumulators as cells, in particular the accumulators from Boost.Accumulators. Sample values can be passed to the cell accumulator, which it may use to compute the mean, median, variance or other statistics of the samples sorted into each cell.
The histogram class is a valid range and can be used with the Boost.Range
library. This library provides a custom adaptor generator, indexed
, analog to the corresponding
adaptor generator in Boost.Range, but with a potentially multi-dimensional
index.
Serialization is implemented using Boost.Serialization. It would be great to have a portable binary archive with support for floating point data to store and retrieve histograms efficiently, which is currently not available. The library has to be open for other serialization libraries.
Boost.Histogram has a minor overlap with Boost.Accumulators,
but the scopes are rather different. The statistical accumulators density
and weighted_density
in Boost.Accumulators generate one-dimensional histograms. The axis range
and the bin widths are determined automatically from a cached sample of
initial values. They cannot be used for multi-dimensional data. Boost.Histogram
focuses on multi-dimensional data and gives the user full control of how
the binning should be done for each dimension.
Automatic binning is not an option for Boost.Histogram, because it does not scale well to many dimensions. Because of the Curse of Dimensionality, a prohibitive number of samples would need to be collected.
Note | |
---|---|
There is no scientific consensus on how do automatic binning in an optimal way, mostly because there is no consensus over the cost function (there are many articles with different solutions in the literature). The problem is not solved for one-dimensional data, and even less so for multi-dimensional data. |
Recommendation:
Boost.MultiArray implements a multi-dimensional array, it also converts an index tuple into a global index that is used to access an element in the array. Boost.Histogram and Boost.MultiArray share this functionality, but Boost.Histogram cannot use Boost.MultiArray as a back-end. Boost.MultiArray makes the rank of the array a compile-time property, while this library needs the rank to be dynamic.
Boost.MultiArray also does not allow to change the element type dynamically. This is needed to implement the adaptive storage mentioned further up. Using a variant type as the element type of a Boost.MultiArray would not work, because it creates this wasteful layout:
[type-index 1][value
1][type-index 2][value 2]...
A type index is stored for each cell. Moreover, the variant is always as large as the largest type in the union, so there is no way to safe memory by using a smaller type when the bin count is low, as it is done by the adaptive storage. The adaptive storage uses only one type-index for the whole array and allocates a homogeneous array of values of the same type that exactly matches their sizes, creating the following layout:
[type-index][value 1][value
2][value 3]...
There is only one type index and the number of allocated bytes for the array can adapted dynamically to the size of the value type.
^{[1] } Parameters of interest, like the center of a distribution, can be extracted from the histogram instead of the original data set; likewise, statistical models can be fitted to histograms.
^{[2] } What small enough means has to be decided case by case.
^{[3] } The choices of the person are most likely not random, but if we pick a random person from a group, we randomly sample from a pool of opinions
^{[4] } The Poisson distribution is correct as far as the counts k themselves are of interest. If the fractions per bin p = k / N are of interest, where N is the total number of counts, then the correct distribution to describe the fractions is the multinomial distribution.