Pick k (assume k < n) items from the list

  • Context: Undergrad 
  • Thread starter Thread starter bob j
  • Start date Start date
  • Tags Tags
    List
Click For Summary
SUMMARY

The discussion focuses on the challenge of selecting k items from a list of n items, where k is less than n, without knowing the total number of items (n) in advance. Participants propose methods for achieving this in a single pass through the data. One suggested algorithm involves storing the first k items as preliminary picks and then selecting subsequent items with a probability of k/x, where x is the current item index. This method allows for random selection while maintaining the desired probability distribution.

PREREQUISITES
  • Understanding of probability theory, specifically uniform distribution
  • Familiarity with algorithms for random sampling
  • Knowledge of data structures, particularly lists and sets
  • Experience with file I/O operations in programming
NEXT STEPS
  • Research "Reservoir Sampling" for efficient random selection from streams
  • Explore algorithms for random sampling without replacement
  • Learn about the implications of probability in algorithm design
  • Investigate file handling techniques for large datasets in programming
USEFUL FOR

Data scientists, software engineers, and anyone involved in algorithm development or data processing who needs to implement efficient random sampling techniques.

bob j
Messages
22
Reaction score
0
I have a list of n items. My goal is to pick k (assume k < n) items from the list such that each one of them as probability 1/n of being chosen. Would it be possible to accomplish this without knowing n, in one reading?

thanks
 
Physics news on Phys.org


bob j said:
I have a list of n items. My goal is to pick k (assume k < n) items from the list such that each one of them as probability 1/n of being chosen. Would it be possible to accomplish this without knowing n, in one reading?

thanks

I don't understand how you can have a list of n items and not know n.

Besides:
Are you picking a subset from a set?
Set from a list?
List from a set?
List from a list?

I think there is a difference.
 


Let's say there is a file composed of n lines. I do not know the number of lines in advance. I wish to pick k random lines from the file, with one read only (without storing the lines in memory and then doing the rest)...
 


bob j said:
Let's say there is a file composed of n lines. I do not know the number of lines in advance. I wish to pick k random lines from the file, with one read only (without storing the lines in memory and then doing the rest)...

I think the best you can do is to pick k random distinct integers from {0,...,m} where m is an arbitary number bigger than k. Order the k integers in a list L.

During your one pass through, process the i-th line if i is an element in L, skip the rest.

Obviously you will not be able to read k lines if the actual value of n > m. But this is the best I can think of.
 


I assume you mean k/n, not 1/n.

bob j said:
I have a list of n items. My goal is to pick k (assume k < n) items from the list such that each one of them as probability 1/n of being chosen. Would it be possible to accomplish this without knowing n, in one reading?

Yes -- but it won't be efficient. Store the first k items as your preliminary picks. Start x as k + 1 and increase it by 1 until you reach the end of the file. For each x, select x with probability k/x. If selected, choose a random number r from 1 to k and replace preliminary pick r with x.
 


CRGreathouse said:
I assume you mean k/n, not 1/n.



Yes -- but it won't be efficient. Store the first k items as your preliminary picks. Start x as k + 1 and increase it by 1 until you reach the end of the file. For each x, select x with probability k/x. If selected, choose a random number r from 1 to k and replace preliminary pick r with x.

What happens if k > n? That to me is the problem...
 


ych22 said:
What happens if k > n? That to me is the problem...

Since the OP said

bob j said:
(assume k < n)

I'm not worried about this case. My program can cause nasal demons if that happens.
 


CRGreathouse said:
I'm not worried about this case. My program can cause nasal demons if that happens.

Oh, fair enough. Then my algorithm should work too.

However this k<n assumption is still pretty suspicious. How can one claim not to know n yet make the claim that k<n? :eek:
 


ych22 said:
Oh, fair enough. Then my algorithm should work too.

Well, yours doesn't pick them with the desired probability. I think that mine does, though I haven't proved it.

ych22 said:
However this k<n assumption is still pretty suspicious. How can one claim not to know n yet make the claim that k<n? :eek:

Perhaps k is small and n is very large. You might know that n is between a million and ten million, and that k is 10.
 
  • #10


CRGreathouse said:
Well, yours doesn't pick them with the desired probability. I think that mine does, though I haven't proved it.



Perhaps k is small and n is very large. You might know that n is between a million and ten million, and that k is 10.

Oh oops. You are right on both counts!
 

Similar threads

  • · Replies 6 ·
Replies
6
Views
989
  • · Replies 20 ·
Replies
20
Views
2K
  • · Replies 29 ·
Replies
29
Views
4K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
1K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 13 ·
Replies
13
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K