Filling in Missing Values in a string of data

  • Context: Undergrad 
  • Thread starter Thread starter Brian82784
  • Start date Start date
  • Tags Tags
    Analysis Data String
Click For Summary

Discussion Overview

The discussion revolves around methods for estimating missing values in a dataset related to an automatic transformer's operations counter. Participants explore various techniques for filling in gaps in the data, including linear interpolation and polynomial fitting, while also addressing the implications of their choices on data accuracy and reporting.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant describes using linear interpolation to estimate missing values by averaging surrounding data points.
  • Another participant suggests that fitting a curve, such as a second-degree polynomial, might provide a better estimate due to the increasing trend in the data.
  • Some participants express uncertainty about the best method for estimating missing values, questioning whether linear interpolation is sufficient or if more complex methods are warranted.
  • There is a discussion about how to apportion total changes between months, with suggestions to either split them equally or based on the number of days in each month.
  • A participant mentions quadratic regression as a method for fitting a curve to the data and provides a resource for performing this in Excel.
  • Concerns are raised about ensuring transparency in the estimation process, emphasizing the need for clarity on which figures are exact and which are estimates.

Areas of Agreement / Disagreement

Participants do not reach a consensus on the best method for estimating missing values, with multiple competing views on the appropriateness of linear interpolation versus more complex polynomial fitting.

Contextual Notes

Participants note the importance of considering the method of estimation and its impact on data interpretation, but there are no resolved mathematical steps or definitive conclusions regarding the best approach.

Brian82784
Messages
19
Reaction score
0
TL;DR
Finding a missing value for trending and analysis
hello,

I’m trying to figure out if I’m doing this correctly or if there’s a different way that I should be finding a missing value. I’m trending data for an automatic transformer. Every month I collect the operations counter value and at the end of the year sum the number of tap changes. Sometimes I can’t get to the transformer so there are months where I don’t Collect the count.

For example:

Operations count
Jan = 654104
Feb = Missing
Mar = 659176

To find approximate missing value take 654104 + 659176 = 1313280 and dividing 1313280/2 = 656640

Now it would look like
Jan = 654104
Feb = 656640
Mar = 659176

I will get into a bit more down below.

Operations Counter

Jan = 654104
Feb = missing (populate with calculated 656640)
Mar = 659176
April= 661476
May= 664376
June= missing
July = missing
August = 671976
September= 675076

I then find the total changes month to month

Total tap Changes
Jan - Feb = 2536
Feb- March = 2536
March-April= 2300
April-May= 2900

After I have a years worth of data I sum the total changes. To see how much it operated in a year. So above from Jan - May would be a total of 10,272 operations (I know it’s not. Full year just showing an example).
 
Physics news on Phys.org
It's not clear what your question is. Are you asking if there is a better way to estimate the missing values versus simple interpolation?
 
berkeman said:
It's not clear what your question is. Are you asking if there is a better way to estimate the missing values versus simple interpolation?
Just wondering if this is the best/right way or if I’m completely wrong.
 
What you did to estimate the missing Feb data is called linear interpolation, which amounts to getting the average of the two surrounding months. I'm not sure what you did to find the missing data for June and July.

Given that the counts are increasing month by month, another option would be to fit the data points to a curve rather than a straight line. In this case a second-degree polynomial (a parabola) might be a good choice. That is, the curve whose equation is ##y = ax^2 + bx + c## that minimizes the vertical distance between points on the curve and your data.
 
  • Like
Likes   Reactions: Vanadium 50 and FactChecker
I would be inclined to keep it simple (linear interpolation) unless there was a significant reason to get more complicated (higher-degree polynomial). The linear interpolation would also be safer if there is missing data at either end. That would require extrapolation of the known data, which can give more extreme estimates from a higher-degree approximation.
 
Mark44 said:
What you did to estimate the missing Feb data is called linear interpolation, which amounts to getting the average of the two surrounding months. I'm not sure what you did to find the missing data for June and July.

Given that the counts are increasing month by month, another option would be to fit the data points to a curve rather than a straight line. In this case a second-degree polynomial (a parabola) might be a good choice. That is, the curve whose equation is ##y = ax^2 + bx + c## that minimizes the vertical distance between points on the curve and your data.

I’m not sure how to calculate that could you show me an example?
 
What I described is called quadratic regression. Here's an article that shows how to do this in Excel - https://www.statology.org/quadratic-regression-excel/. You need to have Analysis Toolpack installed to do this (it's a free download). If you have some other spreadsheet, there likely is an equivalent set of tools available.
 
You need to decide how to attribute the total change over February and March to those two months. The simplest way is to apportion it equally, which is what you did. Alternatively, you can apportion them according to the number of (work)days in each month.

A more accurate approach is to apportion operations to months according to the actual dates of the meter readings, if you have that information.

Regardless of the method, you need to ensure that people who use this data can find out (1) which figures were exact and which had to be estimated, and (2) how the estimation was done.
 

Similar threads

  • · Replies 27 ·
Replies
27
Views
3K
  • · Replies 18 ·
Replies
18
Views
3K
  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 4 ·
Replies
4
Views
8K
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
5K
Replies
2
Views
2K
  • · Replies 10 ·
Replies
10
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K