Filling in Missing Values in a string of data

  • Context: Undergrad 
  • Thread starter Thread starter Brian82784
  • Start date Start date
  • Tags Tags
    Analysis Data String
Click For Summary
SUMMARY

This discussion focuses on methods for filling in missing values in time-series data, specifically for operations counters related to transformers. The primary technique discussed is linear interpolation, where the missing value is calculated as the average of surrounding known values. Additionally, quadratic regression is suggested as an alternative for fitting data points to a curve, particularly when data trends are non-linear. The importance of transparency in data estimation methods is emphasized, ensuring users can identify which values are estimated and the methods used for estimation.

PREREQUISITES
  • Understanding of time-series data analysis
  • Familiarity with linear interpolation techniques
  • Knowledge of quadratic regression and its application
  • Basic proficiency in Excel, particularly with the Analysis Toolpak
NEXT STEPS
  • Research "Linear Interpolation in Excel" for practical applications
  • Learn "Quadratic Regression Techniques" for advanced data fitting
  • Explore "Time-Series Analysis Methods" to understand various estimation techniques
  • Investigate "Data Transparency Best Practices" for reporting estimated values
USEFUL FOR

Data analysts, engineers working with time-series data, and anyone involved in operations monitoring and reporting will benefit from this discussion.

Brian82784
Messages
19
Reaction score
0
TL;DR
Finding a missing value for trending and analysis
hello,

I’m trying to figure out if I’m doing this correctly or if there’s a different way that I should be finding a missing value. I’m trending data for an automatic transformer. Every month I collect the operations counter value and at the end of the year sum the number of tap changes. Sometimes I can’t get to the transformer so there are months where I don’t Collect the count.

For example:

Operations count
Jan = 654104
Feb = Missing
Mar = 659176

To find approximate missing value take 654104 + 659176 = 1313280 and dividing 1313280/2 = 656640

Now it would look like
Jan = 654104
Feb = 656640
Mar = 659176

I will get into a bit more down below.

Operations Counter

Jan = 654104
Feb = missing (populate with calculated 656640)
Mar = 659176
April= 661476
May= 664376
June= missing
July = missing
August = 671976
September= 675076

I then find the total changes month to month

Total tap Changes
Jan - Feb = 2536
Feb- March = 2536
March-April= 2300
April-May= 2900

After I have a years worth of data I sum the total changes. To see how much it operated in a year. So above from Jan - May would be a total of 10,272 operations (I know it’s not. Full year just showing an example).
 
Physics news on Phys.org
It's not clear what your question is. Are you asking if there is a better way to estimate the missing values versus simple interpolation?
 
berkeman said:
It's not clear what your question is. Are you asking if there is a better way to estimate the missing values versus simple interpolation?
Just wondering if this is the best/right way or if I’m completely wrong.
 
What you did to estimate the missing Feb data is called linear interpolation, which amounts to getting the average of the two surrounding months. I'm not sure what you did to find the missing data for June and July.

Given that the counts are increasing month by month, another option would be to fit the data points to a curve rather than a straight line. In this case a second-degree polynomial (a parabola) might be a good choice. That is, the curve whose equation is ##y = ax^2 + bx + c## that minimizes the vertical distance between points on the curve and your data.
 
  • Like
Likes   Reactions: Vanadium 50 and FactChecker
I would be inclined to keep it simple (linear interpolation) unless there was a significant reason to get more complicated (higher-degree polynomial). The linear interpolation would also be safer if there is missing data at either end. That would require extrapolation of the known data, which can give more extreme estimates from a higher-degree approximation.
 
Mark44 said:
What you did to estimate the missing Feb data is called linear interpolation, which amounts to getting the average of the two surrounding months. I'm not sure what you did to find the missing data for June and July.

Given that the counts are increasing month by month, another option would be to fit the data points to a curve rather than a straight line. In this case a second-degree polynomial (a parabola) might be a good choice. That is, the curve whose equation is ##y = ax^2 + bx + c## that minimizes the vertical distance between points on the curve and your data.

I’m not sure how to calculate that could you show me an example?
 
What I described is called quadratic regression. Here's an article that shows how to do this in Excel - https://www.statology.org/quadratic-regression-excel/. You need to have Analysis Toolpack installed to do this (it's a free download). If you have some other spreadsheet, there likely is an equivalent set of tools available.
 
You need to decide how to attribute the total change over February and March to those two months. The simplest way is to apportion it equally, which is what you did. Alternatively, you can apportion them according to the number of (work)days in each month.

A more accurate approach is to apportion operations to months according to the actual dates of the meter readings, if you have that information.

Regardless of the method, you need to ensure that people who use this data can find out (1) which figures were exact and which had to be estimated, and (2) how the estimation was done.
 

Similar threads

  • · Replies 27 ·
Replies
27
Views
3K
  • · Replies 18 ·
Replies
18
Views
3K
  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 4 ·
Replies
4
Views
8K
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
5K
Replies
2
Views
2K
  • · Replies 10 ·
Replies
10
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K