Is there a statistically significant increase in phrase occurrences?

  • Context: Undergrad 
  • Thread starter Thread starter kmrstats
  • Start date Start date
  • Tags Tags
    Data Statistics
Click For Summary
SUMMARY

The discussion focuses on determining statistically significant increases in phrase occurrences within a news data feed. The user initially employed a moving average to generate signals based on daily phrase counts but found this method ineffective due to the bursty nature of the data. A proposed solution involves calculating the historical average (HA) and historical standard deviation (HSD) to assess whether the current count exceeds the threshold defined by HA plus two times HSD. This statistical approach provides a more reliable method for identifying significant events in phrase occurrences.

PREREQUISITES
  • Understanding of moving averages in time series analysis
  • Knowledge of statistical concepts such as mean and standard deviation
  • Familiarity with hypothesis testing
  • Experience with data analysis tools like Python or R
NEXT STEPS
  • Study the calculation of historical averages and standard deviations in time series data
  • Learn about hypothesis testing techniques for identifying significant changes
  • Explore Python libraries such as Pandas for data manipulation and analysis
  • Investigate advanced statistical methods for analyzing bursty data patterns
USEFUL FOR

Data analysts, statisticians, and anyone involved in monitoring and analyzing trends in news data feeds will benefit from this discussion.

kmrstats
Messages
2
Reaction score
0
Hi -

First timer here. Excuse me if this question is not up to the level i see posted on this forum, but here goes.

I have been asked to provide a daily signal generated from the number of occurrences of a set of specified phrases present in a news data feed. The first thing I did is generate a moving average from the daily count of each phrase in the feed and generate a signal if the current count was above the moving average by a specified percentage. Using this approach I didn't think the signal provided much value beacuse the phrase counts are very bursty. The count can be in the low teens for a number of days in a row and then jump to a 100 for a couple of days and then settle back into the low teens.

What type of statistics should I use to determine a statistically significant event given my scenario described above?

Thanks in advance
 
Physics news on Phys.org
One way is to:
1. calculate the historical average up to day t: HA(t) = [itex]\left.\sum_{s=1}^t n_s\right/t[/itex], where ns is the number of occurrences on day s
2. calculate the historical standard deviation HSD(t) similarly
3. test whether nt is > HA(t) + 2 HSD(t).
 
Last edited:

Similar threads

  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 0 ·
Replies
0
Views
2K
  • · Replies 19 ·
Replies
19
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 4 ·
Replies
4
Views
4K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 9 ·
Replies
9
Views
1K
  • · Replies 7 ·
Replies
7
Views
4K
  • · Replies 6 ·
Replies
6
Views
6K