Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Who can perform a Chi-squared analysis on my linguistic data?

  1. Jan 3, 2012 #1
    I am doing a PhD in General Linguistics and I have a large amount of data on which I would have a Chi-squared analysis done. Unfortunately I don't know the first thing about maths. Could anyone help me or at least point me in the right direction? That would be much appreciated. Also, if you need more info, please let me know exactly what you need to know, because, like I said, I am 100% ignorant when it comes to maths...:(

    Thanks in advance for your help.

  2. jcsd
  3. Jan 3, 2012 #2


    User Avatar
    Science Advisor

    Hey ametisto and welcome to the forums.

    If you are using classical statistics to attempt to measure a) goodness of fit (i.e. how well your data fits to a particular distribution) b) attempting to find an interval that corresponds to an estimate of the true population variance (i.e. finding the parameter of the variance of a distribution given a sample) then you will need to use the chi-square distribution.

    Before you do this, it will helpful for us to know exactly what you are trying to do. It might turn out that you can not use standard analytic techniques because certain assumptions are not right, so before we can help you, it would be beneficial for both you and other forum members to know this upfront.
  4. Jan 4, 2012 #3
    Thanks a lot for your speedy reply, chiro. Good to know I'm in the right place.

    Ok, let me explain what it is I'm trying to do:

    To put it simply, I am trying to see if there is a correlation between two grammatical structures belonging to two different grammatical structures. The structures in questions look like this:
    a. ser + Verb in the past participle (category of voice - as in the active/passive voice)
    a. tener + Verb in the past participle (category of aspect - a category similar to tense)
    When certain verbs are used in the a construction, the sentence becomes ungrammatical. My hypothesis is that the a construction in fact belong to the same category as the b construction. It is well known that the b construction is subject to certain lexical restrictions, i. e. not all verbs can be used in this construction. No such restrictions exist for the a category. Therefore, if my hypothesis is correct and a belongs to the same category as b, they should be subject to the same lexical restrictions. I have therefore come up with a list of 231 sentences which are ungrammatical in a. Then I have used the same verbs to form sentences using the b construction to find out if the verbs that make the a sentences ungrammatical, also result in the ungrammaticality of the b sentences. And this seems to be the case: approx. 87% of the a sentences are also ungrammatical in the b sentences.

    Now somebody mentioned that I could use the chi-square distribution to analyse my data professionally and give them more credibility. I had never heard of the chi-square distribution before, and the person telling me about it is also a linguist, so I am not at all sure what exactly it is about, or if it's at all suitable for my type of data. I hope I made myself clear. If I forgot anything important, do let me know.

    Again, thanks a lot for your help and patience;)

    Edit: I should probably say a few words on how I arrived at the correlation of 85%:
    Saying that 87% of the a sentences are also ungrammatical in the b sentences was a bit of a generalisation. In fact the data is composed as follows:
    1. -[a] and - = 47%
    2. ?[a] and - = 28% Explanation: - means 'ungrammatical', and ? means 'odd, but not necessarily ungrammatical'
    3. ?[a] and ? = 8.7%
    4. -[a] and ? = 3%

    For now I have simply added up the numbers because for me as a linguist the data is quite convincing. However, a chi-square analysis wiould give my research further credibility and I would therefore like to use it to find out how good the data is in relation to my initial hypothesis.
    Last edited: Jan 4, 2012
  5. Jan 4, 2012 #4


    User Avatar
    Science Advisor

    I did a quick google for chi-square analysis and I got the following link:


    In statistics this is known as a "goodness-of-fit" test.

    Just so you know I am training to become a statistician, and as a result I know that there the chi-square distribution (which you are using in the goodness-of-fitness test) is actually used for a variety of purposes and the fitness test is just one application of this distribution.

    The GOF (goodness of fitness) test works like this: you find the amount of deviation from an expected distribution against an observed which gives you a total "variance". You then use the chi-squared distribution with the appropriate degree of freedom to check statistically if the observed is "different enough" (think has a high enough variation between the two distributions) to the expected distribution.

    So the next logical thing is to find out exactly what the expected and observed data is.

    From what you have said, you have two categories a and b that have certain linguistic constraints (with one being a lot more constrained than the other).

    Now from what you have posted I'm guessing you have a set of verbs and each verb has a frequency of being "declined" in each category (a and b).

    The frequency data forms the "expected" and the "observed" distribution. Basically your lower constrained frequency data (category a) is your expected and the category b frequency is your observed. Basically you are seeing how category b fits to category a where a is the less constrained set and b is the higher constrained set.

    First I need to make sure that this is what you want to do: in other words, check if the frequency information between categories a and b is statistically significantly "close enough" to each other.

    If this is the case, then I can tell you how to analyze your data, and give you an idea of how to interpret it and what that actually means.
  6. Jan 4, 2012 #5
    Thanks, chiro. I edited my post for clarification purpose. I'm about to leave the house, but I'll get back to you asap.
  7. Jan 9, 2012 #6
    Sorry it took me so long to get back to you. I was away for a few days and didn't have time to reply.

    To be precise, I have 231 verbs which are declined in category a, but not all of them are declined in category b. According to the grammatical rule, a should not have any constraints at all, whereas b does have constraints. I would now like to find out if it's ok to claim that a behaves like b on the basis of the 85% correlation (i.e. 85% of the verbs declined in a are also declined in b)

    In theory (according to the grammatical rules) a should be less constrained than b, but in actual fact it's the other way round (a = 100% declined, b= 85% declined)

    I am not sure I understand you. Would you be so kind as to give me a fool-proof explanation;)

    Thanks a lot for your help. I really appreciate it.

  8. Jan 9, 2012 #7


    User Avatar
    Science Advisor

    Frequency information just relates to the number of times something happens: it just encodes a probability.

    For example lets say we have a dice that we roll a very large amount of times. For a theoretical fair dice we have the probability of rolling any number to be 1/6. This for our purposes, is the frequency information.

    What the chi-square goodness of fit tests does is measure the difference between two distributions with respect to an expected distribution.

    What this means intuitively is that if there isn't much variation between two distributions, you will get a lower chi-square statistic and you will end up rejecting the hypothesis that the two distributions are statistically significantly different within some confidence measure (usually 95%). Otherwise we reject the idea that they are statistically similar.

    Thats the basic idea of the chi-squared goodness of fit test.

    What I think you need to do is have two distributions: the first is a discrete uniform distribution with 231 entries (all entries have same probability of 1/231) and your second distribution relates to the probabilities found in your constrained category "b".

    Does this make sense to you?
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook