OMeanVar
- class omoment.OMeanVar(mean: Number = nan, var: Number = nan, weight: Number = 0, handling_invalid: HandlingInvalid = HandlingInvalid.Drop)
Bases:
OMean
Online estimator of weighted mean and variance.
Represents mean, variance and weight of a part of data. Two OMeanVar objects can be added together to produce correct estimates for overall dataset. Mean, variance and weight are stored using __slots__ to allow for lightweight objects that can be used in large quantities even in pandas DataFrame (however they are still Python objects, not numpy types).
Most methods are fairly permissive, allowing to work on numbers, numpy arrays or pandas DataFrames. By default, invalid values are omitted from data (NaNs, infinities and negative weights). Variance in OMeanVar is based on ddof = 0, in agreement with numpy std method.
Addition of \(\mathrm{OMeanVar}(m_1, v_1, w_1)\) and \(\mathrm{OMeanVar}(m_2, v_2, w_2)\) is calculated as:
\begin{gather*} \delta_m = m_2 - m_1\\ \delta_v = v_2 - v_1\\ w_N = w_1 + w_2\\ r = \frac{w_2}{w_N}\\ m_N = m_1 + \delta_m \frac{w_2}{w_N}\\ v_N = v_1 + \delta_v r + \delta_m^2 r (1 - r) \end{gather*}Where subscript N denotes the new values produced by the addition.
- classmethod compute(x: Number | ndarray | Series, w: Number | ndarray | Series | None = None, handling_invalid: HandlingInvalid = HandlingInvalid.Drop) OMeanVar
Shortcut for initialization of an empty object and its update based on data.
- mean
- static of_groupby(data: pd.DataFrame, g: str | List[str], x: str, w: str | None = None, handling_invalid: HandlingInvalid = HandlingInvalid.Drop) pd.Series[OMean]
Optimized version for calculation of means of large number of groups in data.
Avoids slower groupby -> apply workflow and uses optimized aggregation functions only. The function is about 5x faster on testing dataset with 10,000,000 rows and 100,000 groups.
- Parameters:
data – input DataFrame
g – name of column containing group keys; can be also a list of multiple column names
x – name of column with values to calculated mean of
w – name of column with weights (optional)
handling_invalid – How to handle invalid values in calculation [‘drop’, ‘keep’, ‘raise’], default value ‘drop’. Provided either as enum or its string representation.
- Returns:
pandas Series indexed by group values g and containing estimated OMeanVar objects
- property std_dev: float
Estimate of standard deviation, calculated as \(\sqrt{\mathrm{Var}}\). Based on ddof = 0, the same default as in numpy std method.
- property unbiased_std_dev: float
Estimate of unbiased standard deviation based on ddof = 1 (suitable for unweighted data).
- property unbiased_var: float
Estimate of unbiased variance based on ddof = 1 (suitable for unweighted data).
- update(x: Number | ndarray | Series, w: Number | ndarray | Series | None = None, handling_invalid: HandlingInvalid = HandlingInvalid.Drop) OMeanVar
Update the moments by adding new values.
Can be either single values or batch of data in numpy arrays. In the latter case, moments are first estimated on the new data and the moments for old and new data are combined. Invalid values and negative weights are omitted by default. The calculated variance assumes zero degrees of freedom, OMeanVar has properties unbiased_var and unbiased_std_dev based on dof = 1.
- Parameters:
x – Values to add to the current estimate.
w – Weights for the values. If provided, has to have the same length as x.
handling_invalid – How to handle invalid values in calculation [‘drop’, ‘keep’, ‘raise’], default value ‘drop’. Provided either as enum or its string representation.
- Returns:
The same OMeanVar object updated for the new data.
- Raises:
ValueError – if raise_if_nans is True and there are invalid values (NaNs, infinities or negative weights) in data.
TypeError – if values x or w have more than one dimension or if they are of different size.
- var
- weight