Inference on original population based on sample - estimating sampling probability

In summary: If they do, and if you know that the period of time covered by the data is not representative of the year, then that's important information.In summary, the conversation discusses the researcher's goal of finding possible explanations for the sampling of sales data, which has two columns (buyer and receiver). The researcher mentions using a probability (p) to determine whether to include each transaction, and then selecting a subset of transactions at random without replacement. They also mention limited information on the original sample, but an estimate of the total number of buyers (2 million) and a focus on buyers within the network of the original data provider. The consistency argument is based on the idea that more active buyers are more likely to be included in the sample.
  • #1
roark13
1
0
In my research I am working on some data (let’s call it sales data), however we almost have no information on the original sample. My goal is to come up with possible explanations on how it could be sampled by consistency arguments.

In more detail, the data provided is composed of sales transactions and it has 2 columns (buyer and receiver). One of the most reasonable ways to sample would be (we believe) to assign a probability (p) to include each transaction or not, and then include a subset of transactions according to random realizations ( sampling without replacement).

This corresponds to binomial sampling. Our information on the original sample is limited (almost none), except that I have an estimate on total number of buyers there may be on overall (circa ~2 million). In addition, we assume that buyers are the customers that belong to a network of original data provider, however sellers may belong to other networks, thus our attention is limited to buyers.


The consistency argument is based on the following: on the original network if the buyer is highly active (there are more transactions) then the probability that that specific buyer gets included in a sample is higher than a less active customer.


Then by counting how many transactions each buyer makes, and grouping them by common number of transactions – we get the number of buyers who make $i$ number of transactions. Then, based on our estimate on the total number of buyers in the network (2 M), my goal to estimate the sampling rate (p).

R code:


set.seed(0);
---generate random sales transactions

ss1= sample.int( 100, size = 100,replace = TRUE, prob = NULL)
ss2= sample.int(10000, size = 100, replace = TRUE, prob = NULL)

buyer= sort(ss1,decreasing=TRUE);
receiver = sort(ss2,decreasing=TRUE);


data= data.frame("buyer"= buyer, "receiver"= receiver);

---compute the number of sales - each buyer makes

count= data.frame(table(buyer));

buyer.sales = count[order(count$Freq,decreasing=TRUE),];


df = data.frame(table(buyer.sales$Freq)); ##number of buyers who make 1 transaction, 2 transactions etc...


--- We have a guess on total number of buyers - 2M. What would be the sampling rate(p)?

Here are the approaches I’ve tried so far:

**Approach 1:**

Pose Y_i the number of customers who make “i” calls in the sample (data), and similarly X_i the number of customers who make “i” calls in the original network. We know the Y’s and try to get X’s.

Then if the transactions are sampled with a constant p rate, then

$Y_i = sum_{k=i}^{k= max(i)} N_k * (choose(k,i))* p^k * (1-p)^{(n-k)}$;

If we define the matrix A as follows:

A <- function(i,j,prob) ifelse(i > j, 0,(choose(j,i)*((1-prob)^(j-i))*(prob^(i))) );

The problem would be solve A*x= b, which has an easy solution since A is upper triangular. But for my current problem, the x values that I get do not make any sense – they highly range between plus-minus infinity and ideally I would think that x vector should have some kind of correlation with y’s.

I solved the linear system of equalities by minimizing L1, L2 norms and also with the positivity constraints as well. Still, not correlated results- and the sum of x_i’s are highly different than 2M.

**Approach 2:**

This approach is on subsampling from the sample (data). I thought I could subsample from the sample with a given p, then count the number of buyers I get in that subsample. The probability value –p, which has generates the following ratio ( the number of buyers original / the number of buyers in sample), equal to (the number of buyers in sample / the number of buyers in subsample) would be the most reasonable choice.


This approach gave me p~0.02% but I don’t think this is correct. I think approach 1 makes more sense but the results do not suggest that.


I should note as well, that the data on hand is composed of monthly transaction data for a period of 12 months. Thus the amount of data I take as the sample is important as well. I tried poisson rate estimation but the amount of data I include, highly effects the estimate of total number of buyers – and I am really stuck.

I would think that the original data would have some kind of a power law behavior, but that’s purely a guess without any justification. I am really stuck on which approach makes more sense. Any comments, ideas appreciated.
 
Physics news on Phys.org
  • #2


roark13 said:
I am really stuck on which approach makes more sense.

Unless you define what you are trying to accomplish by the sampling, I don't see how the sensibility or non-sensibility of a given method of sampling can be evaluated. (For example, why not simply sample the data giving each sample a equal chance of being picked and sampling with replacement? )

You didn't mention whether the transactions in the data have a time stamp.
 

1. What is the purpose of inference on original population based on sample?

The purpose of inference on original population based on sample is to make inferences or predictions about the characteristics of a larger population based on a smaller sample of data. It allows scientists to draw conclusions about a population without having to collect data from every single individual in that population.

2. How is sampling probability estimated?

Sampling probability is estimated by calculating the likelihood of obtaining a certain sample from a population. This can be done through various methods such as simple random sampling, stratified sampling, or cluster sampling. The sampling probability is typically calculated using mathematical formulas and statistical techniques.

3. What factors can affect the accuracy of inference on original population based on sample?

Several factors can affect the accuracy of inference on original population based on sample. These include the size and representativeness of the sample, the sampling method used, the variability of the population, and the quality of the data collected. It is important to carefully consider these factors when making inferences about a larger population.

4. How do scientists determine the confidence level in their inferences?

The confidence level in inferences is typically determined by using statistical measures such as confidence intervals and p-values. These measures indicate the degree of uncertainty in the results and can help scientists determine the level of confidence they have in their inferences. A higher confidence level means there is a lower chance of the results being due to random chance.

5. Can inference on original population based on sample be used in all situations?

No, inference on original population based on sample is not always appropriate or accurate. It relies on certain assumptions and may not be valid in certain situations, such as when the population is highly variable or when the sample is not representative of the population. It is important for scientists to carefully consider the limitations and potential biases in their data before making inferences about a larger population.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
471
  • Set Theory, Logic, Probability, Statistics
Replies
18
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
638
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
17
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
951
  • Set Theory, Logic, Probability, Statistics
Replies
12
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
Back
Top