- #1
roark13
- 1
- 0
In my research I am working on some data (let’s call it sales data), however we almost have no information on the original sample. My goal is to come up with possible explanations on how it could be sampled by consistency arguments.
In more detail, the data provided is composed of sales transactions and it has 2 columns (buyer and receiver). One of the most reasonable ways to sample would be (we believe) to assign a probability (p) to include each transaction or not, and then include a subset of transactions according to random realizations ( sampling without replacement).
This corresponds to binomial sampling. Our information on the original sample is limited (almost none), except that I have an estimate on total number of buyers there may be on overall (circa ~2 million). In addition, we assume that buyers are the customers that belong to a network of original data provider, however sellers may belong to other networks, thus our attention is limited to buyers.
The consistency argument is based on the following: on the original network if the buyer is highly active (there are more transactions) then the probability that that specific buyer gets included in a sample is higher than a less active customer.
Then by counting how many transactions each buyer makes, and grouping them by common number of transactions – we get the number of buyers who make $i$ number of transactions. Then, based on our estimate on the total number of buyers in the network (2 M), my goal to estimate the sampling rate (p).
R code:
set.seed(0);
---generate random sales transactions
ss1= sample.int( 100, size = 100,replace = TRUE, prob = NULL)
ss2= sample.int(10000, size = 100, replace = TRUE, prob = NULL)
buyer= sort(ss1,decreasing=TRUE);
receiver = sort(ss2,decreasing=TRUE);
data= data.frame("buyer"= buyer, "receiver"= receiver);
---compute the number of sales - each buyer makes
count= data.frame(table(buyer));
buyer.sales = count[order(count$Freq,decreasing=TRUE),];
df = data.frame(table(buyer.sales$Freq)); ##number of buyers who make 1 transaction, 2 transactions etc...
--- We have a guess on total number of buyers - 2M. What would be the sampling rate(p)?
Here are the approaches I’ve tried so far:
**Approach 1:**
Pose Y_i the number of customers who make “i” calls in the sample (data), and similarly X_i the number of customers who make “i” calls in the original network. We know the Y’s and try to get X’s.
Then if the transactions are sampled with a constant p rate, then
$Y_i = sum_{k=i}^{k= max(i)} N_k * (choose(k,i))* p^k * (1-p)^{(n-k)}$;
If we define the matrix A as follows:
A <- function(i,j,prob) ifelse(i > j, 0,(choose(j,i)*((1-prob)^(j-i))*(prob^(i))) );
The problem would be solve A*x= b, which has an easy solution since A is upper triangular. But for my current problem, the x values that I get do not make any sense – they highly range between plus-minus infinity and ideally I would think that x vector should have some kind of correlation with y’s.
I solved the linear system of equalities by minimizing L1, L2 norms and also with the positivity constraints as well. Still, not correlated results- and the sum of x_i’s are highly different than 2M.
**Approach 2:**
This approach is on subsampling from the sample (data). I thought I could subsample from the sample with a given p, then count the number of buyers I get in that subsample. The probability value –p, which has generates the following ratio ( the number of buyers original / the number of buyers in sample), equal to (the number of buyers in sample / the number of buyers in subsample) would be the most reasonable choice.
This approach gave me p~0.02% but I don’t think this is correct. I think approach 1 makes more sense but the results do not suggest that.
I should note as well, that the data on hand is composed of monthly transaction data for a period of 12 months. Thus the amount of data I take as the sample is important as well. I tried poisson rate estimation but the amount of data I include, highly effects the estimate of total number of buyers – and I am really stuck.
I would think that the original data would have some kind of a power law behavior, but that’s purely a guess without any justification. I am really stuck on which approach makes more sense. Any comments, ideas appreciated.
In more detail, the data provided is composed of sales transactions and it has 2 columns (buyer and receiver). One of the most reasonable ways to sample would be (we believe) to assign a probability (p) to include each transaction or not, and then include a subset of transactions according to random realizations ( sampling without replacement).
This corresponds to binomial sampling. Our information on the original sample is limited (almost none), except that I have an estimate on total number of buyers there may be on overall (circa ~2 million). In addition, we assume that buyers are the customers that belong to a network of original data provider, however sellers may belong to other networks, thus our attention is limited to buyers.
The consistency argument is based on the following: on the original network if the buyer is highly active (there are more transactions) then the probability that that specific buyer gets included in a sample is higher than a less active customer.
Then by counting how many transactions each buyer makes, and grouping them by common number of transactions – we get the number of buyers who make $i$ number of transactions. Then, based on our estimate on the total number of buyers in the network (2 M), my goal to estimate the sampling rate (p).
R code:
set.seed(0);
---generate random sales transactions
ss1= sample.int( 100, size = 100,replace = TRUE, prob = NULL)
ss2= sample.int(10000, size = 100, replace = TRUE, prob = NULL)
buyer= sort(ss1,decreasing=TRUE);
receiver = sort(ss2,decreasing=TRUE);
data= data.frame("buyer"= buyer, "receiver"= receiver);
---compute the number of sales - each buyer makes
count= data.frame(table(buyer));
buyer.sales = count[order(count$Freq,decreasing=TRUE),];
df = data.frame(table(buyer.sales$Freq)); ##number of buyers who make 1 transaction, 2 transactions etc...
--- We have a guess on total number of buyers - 2M. What would be the sampling rate(p)?
Here are the approaches I’ve tried so far:
**Approach 1:**
Pose Y_i the number of customers who make “i” calls in the sample (data), and similarly X_i the number of customers who make “i” calls in the original network. We know the Y’s and try to get X’s.
Then if the transactions are sampled with a constant p rate, then
$Y_i = sum_{k=i}^{k= max(i)} N_k * (choose(k,i))* p^k * (1-p)^{(n-k)}$;
If we define the matrix A as follows:
A <- function(i,j,prob) ifelse(i > j, 0,(choose(j,i)*((1-prob)^(j-i))*(prob^(i))) );
The problem would be solve A*x= b, which has an easy solution since A is upper triangular. But for my current problem, the x values that I get do not make any sense – they highly range between plus-minus infinity and ideally I would think that x vector should have some kind of correlation with y’s.
I solved the linear system of equalities by minimizing L1, L2 norms and also with the positivity constraints as well. Still, not correlated results- and the sum of x_i’s are highly different than 2M.
**Approach 2:**
This approach is on subsampling from the sample (data). I thought I could subsample from the sample with a given p, then count the number of buyers I get in that subsample. The probability value –p, which has generates the following ratio ( the number of buyers original / the number of buyers in sample), equal to (the number of buyers in sample / the number of buyers in subsample) would be the most reasonable choice.
This approach gave me p~0.02% but I don’t think this is correct. I think approach 1 makes more sense but the results do not suggest that.
I should note as well, that the data on hand is composed of monthly transaction data for a period of 12 months. Thus the amount of data I take as the sample is important as well. I tried poisson rate estimation but the amount of data I include, highly effects the estimate of total number of buyers – and I am really stuck.
I would think that the original data would have some kind of a power law behavior, but that’s purely a guess without any justification. I am really stuck on which approach makes more sense. Any comments, ideas appreciated.