How Can We Estimate Sampling Probability from Sales Data?

Click For Summary
SUMMARY

The discussion focuses on estimating the sampling probability (p) from sales data consisting of buyer and receiver transactions. The primary method discussed is binomial sampling, where the probability of including each transaction is determined by the buyer's activity level. Two approaches are explored: the first involves solving a linear system to relate the number of buyers in the sample to the original network, while the second uses subsampling to estimate p. The results from both approaches indicate challenges in achieving a consistent estimate of p, with the first approach yielding nonsensical results and the second suggesting a sampling rate of approximately 0.02%.

PREREQUISITES
  • Understanding of binomial sampling techniques
  • Familiarity with R programming for statistical analysis
  • Knowledge of linear algebra for solving systems of equations
  • Experience with data analysis and interpretation of sales data
NEXT STEPS
  • Research advanced binomial sampling methods and their applications
  • Learn about R functions for statistical modeling and data manipulation
  • Explore techniques for estimating sampling probabilities in large datasets
  • Investigate the implications of power law distributions in sales data
USEFUL FOR

Data scientists, statisticians, and analysts working with sales data who need to estimate sampling probabilities and understand the implications of buyer activity on sampling methods.

roark13
Messages
1
Reaction score
0
In my research I am working on some data (let’s call it sales data), however we almost have no information on the original sample. My goal is to come up with possible explanations on how it could be sampled by consistency arguments.

In more detail, the data provided is composed of sales transactions and it has 2 columns (buyer and receiver). One of the most reasonable ways to sample would be (we believe) to assign a probability (p) to include each transaction or not, and then include a subset of transactions according to random realizations ( sampling without replacement).

This corresponds to binomial sampling. Our information on the original sample is limited (almost none), except that I have an estimate on total number of buyers there may be on overall (circa ~2 million). In addition, we assume that buyers are the customers that belong to a network of original data provider, however sellers may belong to other networks, thus our attention is limited to buyers.


The consistency argument is based on the following: on the original network if the buyer is highly active (there are more transactions) then the probability that that specific buyer gets included in a sample is higher than a less active customer.


Then by counting how many transactions each buyer makes, and grouping them by common number of transactions – we get the number of buyers who make $i$ number of transactions. Then, based on our estimate on the total number of buyers in the network (2 M), my goal to estimate the sampling rate (p).

R code:


set.seed(0);
---generate random sales transactions

ss1= sample.int( 100, size = 100,replace = TRUE, prob = NULL)
ss2= sample.int(10000, size = 100, replace = TRUE, prob = NULL)

buyer= sort(ss1,decreasing=TRUE);
receiver = sort(ss2,decreasing=TRUE);


data= data.frame("buyer"= buyer, "receiver"= receiver);

---compute the number of sales - each buyer makes

count= data.frame(table(buyer));

buyer.sales = count[order(count$Freq,decreasing=TRUE),];


df = data.frame(table(buyer.sales$Freq)); ##number of buyers who make 1 transaction, 2 transactions etc...


--- We have a guess on total number of buyers - 2M. What would be the sampling rate(p)?

Here are the approaches I’ve tried so far:

**Approach 1:**

Pose Y_i the number of customers who make “i” calls in the sample (data), and similarly X_i the number of customers who make “i” calls in the original network. We know the Y’s and try to get X’s.

Then if the transactions are sampled with a constant p rate, then

$Y_i = sum_{k=i}^{k= max(i)} N_k * (choose(k,i))* p^k * (1-p)^{(n-k)}$;

If we define the matrix A as follows:

A <- function(i,j,prob) ifelse(i > j, 0,(choose(j,i)*((1-prob)^(j-i))*(prob^(i))) );

The problem would be solve A*x= b, which has an easy solution since A is upper triangular. But for my current problem, the x values that I get do not make any sense – they highly range between plus-minus infinity and ideally I would think that x vector should have some kind of correlation with y’s.

I solved the linear system of equalities by minimizing L1, L2 norms and also with the positivity constraints as well. Still, not correlated results- and the sum of x_i’s are highly different than 2M.

**Approach 2:**

This approach is on subsampling from the sample (data). I thought I could subsample from the sample with a given p, then count the number of buyers I get in that subsample. The probability value –p, which has generates the following ratio ( the number of buyers original / the number of buyers in sample), equal to (the number of buyers in sample / the number of buyers in subsample) would be the most reasonable choice.


This approach gave me p~0.02% but I don’t think this is correct. I think approach 1 makes more sense but the results do not suggest that.


I should note as well, that the data on hand is composed of monthly transaction data for a period of 12 months. Thus the amount of data I take as the sample is important as well. I tried poisson rate estimation but the amount of data I include, highly effects the estimate of total number of buyers – and I am really stuck.

I would think that the original data would have some kind of a power law behavior, but that’s purely a guess without any justification. I am really stuck on which approach makes more sense. Any comments, ideas appreciated.
 
Physics news on Phys.org


roark13 said:
I am really stuck on which approach makes more sense.

Unless you define what you are trying to accomplish by the sampling, I don't see how the sensibility or non-sensibility of a given method of sampling can be evaluated. (For example, why not simply sample the data giving each sample a equal chance of being picked and sampling with replacement? )

You didn't mention whether the transactions in the data have a time stamp.
 

Similar threads

  • · Replies 29 ·
Replies
29
Views
6K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
17
Views
6K
  • · Replies 18 ·
Replies
18
Views
3K
Replies
2
Views
2K
  • · Replies 12 ·
Replies
12
Views
3K
  • · Replies 5 ·
Replies
5
Views
5K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 8 ·
Replies
8
Views
2K