TDigest

class pytdigest.TDigest(compression: int = 100)

TDigest can estimate an approximate empirical distribution in a single pass through data. Multiple TDigest calculated of chunks of data can be combined to obtain approximate distribution of the overall dataset. Estimated distribution can be then used to compute approximate empirical CDF (cumulative distribution function) or inverse CDF (quantiles).

The precision of the estimation is controlled by compression parameter.

cdf(at: Number | List | ndarray) float | ndarray

Calculate approximate values of empirical cumulative distribution function (CDF) at given points.

Parameters:

at – values at which the CDF should be calculated.

Returns:

Values of CDF calculated at given points.

Raises:

TypeError – If at is of invalid type.

static combine(first: TDigest | Iterable[TDigest], second: TDigest | None = None) TDigest

Combine multiple TDigests together.

static compute(x: Number | ndarray | Series, w: Number | ndarray | Series | None = None, handling_invalid: HandlingInvalid = HandlingInvalid.Drop, compression: int = 100) TDigest

Estimate TDigest directly from data.

Parameters:
  • x – Values to calculate the distribution of.

  • w – Optional weight for the values. If it is np.ndarray or pd.Series, it has to have the same (one-dimensional) shape as x.

  • handling_invalid – How to handle invalid values in calculation [‘drop’, ‘raise’], default value ‘drop’. Provided either as enum or its string representation. The

  • compression – Higher compression value leads to more precise results and larger memory requirements. Compression gives a maximum number of centroids of the internal representation after merging of the TDigest. Unmerged representation can have up to roughly six times as many centroids.

Returns:

New TDigest object estimated based on (possibly weighted) data.

Raises:
  • TypeError – If x or w are not of permitted types or of different types.

  • ValueError – If handling_invalid is ‘raise’ and there are invalid values in data (nan, infinity, negative weight).

force_merge()

Force merging of centroids of the underlying representation.

get_centroid(i: int)

Get the centroid (i.e. a tuple of mean and weight) at a given index of the underlying representation.

get_centroids()

Get all centroids as a two-dimensional array. Based on the centroids, the TDigest can be fully reconstructed (for a given compression). Conversion to centroids and back can be used for serialization.

inverse_cdf(quantile: Number | List | ndarray) float | ndarray

Calculate quantiles at given points (quantiles are just inverse of cumulative distribution function).

Parameters:

quantile – The values where the inverse CDF should be calculated. Multiple values are passed to C library for fast estimation of larger number of quantiles.

Returns:

Estimated approximate quantiles.

Raises:

TypeError – If quantile is of invalid type.

property mean

Exact mean of the data.

static of_centroids(centroids: ndarray, compression: float = 100)

Reconstruct TDigest of the centroids and a given compression.

scale_weight(factor)

Scale weight by a certain factor. Can be used for instance to estimate exponentially decaying approximate quantiles.

property std_dev

Standard deviation, not corrected for doff.

update(x: Number | ndarray | Series, w: Number | ndarray | Series | None = None, handling_invalid: HandlingInvalid = HandlingInvalid.Drop)

Add new data to TDigest.

Parameters:
  • x – Values to calculate the distribution of.

  • w – Optional weight for the values. If it is np.ndarray or pd.Series, it has to have the same (one-dimensional) shape as x.

  • handling_invalid – How to handle invalid values in calculation [‘drop’, ‘raise’], default value ‘drop’. Provided either as enum or its string representation. The

Raises:
  • TypeError – If x or w are not of permitted types or of different types.

  • ValueError – If handling_invalid is ‘raise’ and there are invalid values in data (nan, infinity, negative weight).

property var

Alias for variance.

property variance

Exact variance of the data (biased, i.e. normalized by total weight without doff correction).

property weight

Total weight of the data.