Introduction

Linear regression is something you probably begrudgingly did in high school. Most people remember applying it to find a “line of best fit” to a series of datapoints, and they probably used ordinary least squares to do it.

So, naturally, let’s do just that with eval. Consider the following data, charted over time:

a chart

The data itself doesn’t matter, only that it’s not flat. Let’s use eval to find a line of best fit and illustrate the overall trend of the data.

Ordinary Least Squares

I’ll post the entire query first, and then break it down:

@OLS{
    tag=data json value
    | eval
        var offset = 0.0;
        var count = 0.0;
        var x = 0.0;
        var y = 0.0;
        var x2 = 0.0;
        var y2 = 0.0;
        var xy = 0.0;

        if (count == 0.0) {
            offset = unix(TIMESTAMP);
        }

        xi = float(unix(TIMESTAMP)-offset);
        yi = float(value);

        count++;
        x = x + xi;
        y = y + yi;
        x2 = x2 + (xi*xi);
        y2 = y2 + (yi*yi);
        xy = xy + (xi*yi);

        n = count;
        xoffset = offset;
        Σx = x;
        Σy = y;
        Σx2 = x2;
        Σy2 = y2;
        Σxy = xy;
    | last
    | eval 
        β = ((n*Σxy) - (Σx*Σy))/((n*Σx2) - math_pow(Σx,2));
        α = ((1/n)*Σy) - (β*(1/n)*Σx);
    | table
};

tag=data json value
| enrich -r @OLS
| eval 
    var count = 0;
    x = float(unix(TIMESTAMP)-xoffset);
    fit = α + (β*x);
| chart fit value 

Compound queries

@OLS{
    tag=data json value

The first thing we notice is the use of a compound query, named OLS for ordinary least squares. What we’ll do here is use the inner query to determine the parameters (α and β) of the line of best fit, and then use the main query to plot the line alongside our original data.

9th grade math

| eval
    var offset = 0.0;
    var count = 0.0;
    var x = 0.0;
    var y = 0.0;
    var x2 = 0.0;
    var y2 = 0.0;
    var xy = 0.0;

    if (count == 0.0) {
        offset = unix(TIMESTAMP);
    }

    xi = float(unix(TIMESTAMP)-offset);
    yi = float(value);

    count++;
    x = x + xi;
    y = y + yi;
    x2 = x2 + (xi*xi);
    y2 = y2 + (yi*yi);
    xy = xy + (xi*yi);

    n = count;
    xoffset = offset;
    Σx = x;
    Σy = y;
    Σx2 = x2;
    Σy2 = y2;
    Σxy = xy;
| last
| eval 
    β = ((n*Σxy) - (Σx*Σy))/((n*Σx2) - math_pow(Σx,2));
    α = ((1/n)*Σy) - (β*(1/n)*Σx);
| table

I won’t go into any detail, but this eval function uses persistent variables to walk each value in the dataset and determine the values α and β of the line of best fit. Those values satisfy the slope-intercept form of a line y = α + β*x. Once we have these two values, we can plot a line alongside our original data.

Note: We cast the timestamp of each entry to an integer second (UNIX time) in order to give us a number we can use for both calculating and later plotting the line.

Plotting

tag=data json value
| enrich -r @OLS
| eval 
    var count = 0;
    x = float(unix(TIMESTAMP)-xoffset);
    fit = α + (β*x);
| chart fit value 

The main query uses the enrich module to add the α and β variables from the inner query to every entry in our dataset. From there, we calculate the same numeric timestamp as before, and calculate the value of the line of best fit at this point in time.

Finally, we just chart the fit and original value alongside each other:

a chart with fit

Success!