Get Frequency Distribution Vector from a Random Sequence

In summary, the conversation discussed a desire for a mathematical function or operator to transform a vector of random discrete values into a frequency distribution vector. Various suggestions were made, such as using the empirical pdf method, but it was ultimately determined that there is no specific function for this task. The conversation also touched on the topic of finding the pdf for discrete random variables and how it relates to mapping a sequence to another domain. It was noted that this is a complex task that requires careful consideration and knowledge of numerical analysis and interpolation techniques.
  • #1
dexterdev
194
1
Hi all,
suppose I have a random discrete sequence like x= [1 2 3 2 5 2 4 2 3 1 6 3 5] (where possible outcomes are 1,2,3,4,5 or 6) and wanted to get its frequency distribution vector
f=[2 4 3 1 2 1] which means frequency of occurrence of 1 is 2 times, 2 occurs 4 times , and so on. I wanted a mathematical function or operator so that vector x can be transformed to f. Is it possible?

Generally x is the input to the system and f must be the output. And how this case can be generalized to continuous case.

TIA
 
Physics news on Phys.org
  • #2


Atleast It would be helpful to know how to find the inverse like operation of a non-square matrix or vector like

if Ax=B

x=inv(A) B like that. I don't know if that works.
 
  • #3


dexterdev said:
Hi all,
suppose I have a random discrete sequence like x= [1 2 3 2 5 2 4 2 3 1 6 3 5] (where possible outcomes are 1,2,3,4,5 or 6) and wanted to get its frequency distribution vector
f=[2 4 3 1 2 1] which means frequency of occurrence of 1 is 2 times, 2 occurs 4 times , and so on. I wanted a mathematical function or operator so that vector x can be transformed to f. Is it possible?

Generally x is the input to the system and f must be the output. And how this case can be generalized to continuous case.

TIA

So you are using data to guess the pdf. It's called the "emperical pdf". It's obvious how to derive it, isn't it? I don't see what the problem is.
 
  • #4


Thankyou for the reply , I was not knowing about the term 'emperical pdf'. Any way , can you suggest some references to help me? Is there equations to find it? I mean some sort of transformation we work on input random sequence to get Prob. density function
 
Last edited:
  • #5


ImaLooser said:
So you are using data to guess the pdf. It's called the "emperical pdf". It's obvious how to derive it, isn't it? I don't see what the problem is.
As I read it, the OP is asking for a mathematical function to convert a vector representing results of trials to a vector consisting of the counts of the different outcomes.
Dexterdev, I'm not aware of any such function. It certainly couldn't be a matrix since it would not be a linear operation. Why do you need it to be a mathematical function?
 
  • #6


dexterdev said:
Is there equations to find it? I mean some sort of transformation we work on input random sequence to get Prob. density function

You haven't defined a specific mathematical problem because "random discrete sequence" isn't a specific type of random process. You must define the process that generates the sequence. For example, it isn't clear whether the selection of one term in your sequence is independent of the selection of the other terms. If you have a real life application in mind, you'll get better advice by telling what it is.
 
  • #7


There is a standard method for finding the pdf of DRV's. That is just by counting the number of occurences of the particular value and dividing that count by the total number of samples u consider for this. When the number of samples considered becomes larger, more accuarte will be the pdf value. If u want to form a system in which u need to get the pdf at the output for an input of a number of input observation samples, program the above method as the system operation and implement it. What is the need of looking for some other operation? What's the problem with existing method?
 
  • #8


ok guys, I will explain why I need such a mathematical operation. I would like to illustrate Central limit theorem. we know that if x1 and x2 are 2 independent rnd variables with pdfs pdf(x1) and pdf(x2) respectively we have pdf(x1 + x2) = convolution ( pdf(x1), pdf(x2) ).

So I thought explaining this way

x1 ------------------> pdf(x1) = some equation depending on x1
x2 ------------------> pdf(x2) = some equation depending on x2

x1 + x2 -------------> pdf(x1+x2) = some eqn depending on x1 and x2 ie here convolution(pdf(x1),pdf(x2))

Is it right that finding discrete prob. density fn similar to mapping a sequence to other domain with some loss of information?

TIA
 
Last edited:
  • #9


dexterdev said:
x1 + x2 -------------> pdf(x1+x2) = some eqn depending on x1 and x2 ie here convolution(pdf(x1),pdf(x2))

Is it right that finding discrete prob. density fn similar to mapping a sequence to other domain with some loss of information?

I see no similarity between the two situations. The pdf of x1+x2 is a function of one variable, not two variables. How can you write the pdf of a sequence as a function of one variable?
 
  • #10


If you don't mind please explain that to me...
 
  • #11


dexterdev said:
If you don't mind please explain that to me...

Explain what?
 
  • #12


Hey dexterdev.

Are you trying to take a set of data (sample data) to generate a purely symbolic representation for the PDF function (in symbolic form) given the sample?
 
  • #13


yes.
 
  • #14


Well in that case you're going to face quite a lot of issues.

The first thing you need to think about is what the space of possible representations will be and the technique you will use.

There is an area (or maybe sub-area is the right word) that deals with interpolation and this is whole field of numeric analysis in itself. Interpolation is basically a way of generating representations for a function that go through known points with special properties that are unique to the interpolation algorithm.

There are many algorithms to interpolate and some of the more complex methods are known as NURBS or Non-Uniform-Rational-B-SPLINES which allow you to interpolate not only over points but also to specify multiplicities of each point with special vectors and you can really have a lot more control over the process than your standard Lagrange Polynomial.

Now the interpolation will generate polynomial expressions but the thing is that these expressions using this method will look like garbage since you will have 100's of terms with 100's of data points and if this goes to say thousands or tens of thousands of points, then you can see that doing this is going backwards.

So that's the interpolation side.

The second way that is looked at deals with what happens in signal processing, data compression, and other similar fields (these two are applied everywhere and signal processing is a field of its own for good reason).

What happens in signal processing is that you have a signal and an orthogonal basis and you project the signal to the basis and re-construct the signal as a linear combination of basis vectors.

This kind of thing in mathematics is known as Fourier Analysis and deals with orthogonal functions: you take a signal get the component for each basis and then you can get coeffecients that are used to re-construct the signal relative to that basis.

The third way deals with a form of convergence to a particular model and this is used in probability and statistics frequently and one particular way is known as the EM method of expectation maximization method.

This works by fitting an arbitrary distribution to a fixed model and getting the best representation of the arbitrary data relative to that model.

So you have to assume a PDF model and then the algorithm takes your data and provides the best fit in accordance to that model.

The difference between two and three is that the second is an explicit technique and the third is an implicit technique.

So now you are faced with a few decisions: the first one generates a symbolic equation that doesn't give you anything useful than what you get given the raw data and the other two require you to give a basis or a model to fit to.

You have a trade-off of either not making any assumptions about constraints and getting something that just confuses you more, or you make assumptions about the constraints which means you are pre-defining the characteristics of the model anyway and simply fitting your data to a pre-defined set of constraints.

So what do you choose?
 
  • #15


I will read your explanation and reply. Thanks for your elaborate reply.
 

1. How do I get the frequency distribution vector from a random sequence?

To get the frequency distribution vector from a random sequence, you will need to first count the occurrences of each element in the sequence. Then, divide the count of each element by the total number of elements in the sequence. This will give you the frequency of each element. Finally, create a vector with the frequencies in the same order as the elements in the sequence.

2. Why is it important to get the frequency distribution vector from a random sequence?

The frequency distribution vector provides valuable information about the distribution of elements in a sequence. This information can be used for data analysis, pattern recognition, and statistical modeling. It also helps in identifying outliers and understanding the overall structure of the sequence.

3. Can I get the frequency distribution vector from any type of random sequence?

Yes, the method of getting the frequency distribution vector can be applied to any type of random sequence, including numerical, categorical, or textual data. However, the approach may differ depending on the type of data and the tools available.

4. Is there a specific software or tool to get the frequency distribution vector from a random sequence?

There are many software and programming languages that have built-in functions or libraries to calculate the frequency distribution vector. Some popular options are Python's collections.Counter() function, R's table() function, and MATLAB's histcounts() function. You can also write your own code to calculate the vector using basic programming concepts.

5. Are there any limitations to using the frequency distribution vector from a random sequence?

The frequency distribution vector is a useful tool for analyzing and understanding a random sequence, but it does have some limitations. It only provides information about the distribution of elements in the sequence and does not take into account the order or relationship between elements. Also, the accuracy of the vector may be affected by the size and diversity of the sequence.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
366
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
865
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
836
  • Set Theory, Logic, Probability, Statistics
Replies
17
Views
803
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
4K
Back
Top