Comparing two sets of data with percentages

  • Thread starter Thread starter Carole
  • Start date Start date
  • Tags Tags
    Data Sets
AI Thread Summary
The discussion centers on comparing survey data from two areas with differing sample sizes, specifically using percentages to analyze Likert scale responses. Concerns arise about the representation of responses due to the unequal sample sizes (n1=31, n2=34), leading to questions about the appropriateness of using percentages or alternative statistical tests. Suggestions include using a two-sample t-test to assess differences in means or the Mann-Whitney U test for non-parametric data. The importance of defining specific hypotheses for analysis is emphasized, along with the potential use of graphical representations like box plots. Ultimately, the consensus leans towards the two-sample t-test as a suitable method for this analysis.
Carole
Messages
3
Reaction score
0
Hi everyone,

I was wondering if someone could help with the following:

I am doing my undergraduate project and have collected two sets of answers following a survey, which I would like to compare.

The questions (29 of them) are mostly Likert style, some allowing for multiple responses.

I wanted to use percentages, but am not sure I can as area 1 n1=31 and area 2 n2=34. For example for one of the question the number of responses "I agree" were of one only in each area, which gave me the following percentages:
area 1 = 3.2%; area 2 = 2.9 %
So it seems that area 2 is under-represented.
Can I still use percentages, is there anything to "adapt" my data? Or should I add n1 and n2 so 65 in which case I would get the same % per response, in the example cited above 1 response would equal to 1.5% in each case?

Alternatively, a friend suggested a two sample t-test but if I remember properly this is for the mean, and I believe the median is more appropriate? Also, I would like to keep it as simple as possible to avoid biting more than I can chew really!

Thank you very much, I hope that I made sense, I would appreciate an answer in as plain English as possible as I will get lost in the jargon otherwise!

Carole
 
Physics news on Phys.org
Carole said:
The questions (29 of them) are mostly Likert style, some allowing for multiple responses.

Give an example of a Likert style question I'm not sure what it is.

I wanted to use percentages

Use percentages for what? What is it that you are trying to do?


So it seems that area 2 is under-represented.

What characteristics make a number a correct representation? Without an explanation of the data and what you're tyring to analyze, I have no ideas on that subject.

Alternatively, a friend suggested a two sample t-test but if I remember properly this is for the mean, and I believe the median is more appropriate?

Why do you believe the median is more appropriate?

If people in your field publish analyses of surveys, it would wise to look at published papers and reports and see what the authors did. In statistics, tradition often trumps any other consideration. If you are doing the analysis only for your own satisfaction, you have to formulate precisely what question your are asking.

The Mann-Whitney U test is often used to test the hypothesis of the equality of two distributions. Equality of distributions is a more restrictive specification than merely saying that two distributions have the same median.
 
Dear Stephen, thanks for your reply,


Stephen Tashi said:
Give an example of a Likert style question I'm not sure what it is.

Likert is a type of questions used to scale responses.
For example: "How would you rate the air quality in this area?" (Tick one only)
- Very good
- Good
- Bad
- Very bad

My other questions allow for a multiple choice to test the respondents level of knowledge, such as:
"Which of the following pollutants known to affect air quality are you aware of?"
- Nitrogen Dioxide
- Ozone
- Particulate Matter
- Carbon Monoxide ... and so on

Use percentages for what? What is it that you are trying to do?

I wanted to use percentages to compare the responses between two surveyed areas, for example "10% of the respondents in Clifton Village were aware of four or more air pollutants, as opposed to 30% in the City Centre"


What characteristics make a number a correct representation? Without an explanation of the data and what you're tyring to analyze, I have no ideas on that subject.

At the moment I have coded all the answers and input them onto SPSS, and analysed for frequencies so it gave my how many times a specific answer has been responded, per area (this is as far as my abilities go with SPSS though). For example for "Very Good" for the question above, only 1 person in clifton answered it, and similarly in the centre. If I had exactly the same number of responses in each area, it would be very easy to compare percentages, but because they differ I just don't know how to do it, as 1 response in Cliton equals to 2.9% of the responses, and in the centre to 3.2%.
I am not sure if I am looking at this the wrong way entirely, maybe I just need to look at it as a whole?

Why do you believe the median is more appropriate?

Well, the mean would give me the average, where the median would be more precise and gear me towards the main trend in what people believe/think/know.

If people in your field publish analyses of surveys, it would wise to look at published papers and reports and see what the authors did. In statistics, tradition often trumps any other consideration. If you are doing the analysis only for your own satisfaction, you have to formulate precisely what question your are asking.

I have done that, they are expressing their results in percentages and comparing various sets of differing number of responses, just as I wanted to do, but I haven't been able to find one that shows the working of the results sadly. They go as far as the method, but not mentionning the type of statitics they have used. They do not give their original full set of responses/results so I can't even try and work it out backwards.

The Mann-Whitney U test is often used to test the hypothesis of the equality of two distributions. Equality of distributions is a more restrictive specification than merely saying that two distributions have the same median.


I can try this, I have tried using Minitab to create some box-plots but I used it last three years ago and am now doing some trial and error to remember on how to input my data in the worksheet.

I hope I made sense, thanks again,

Carole
 
In my opinion, you haven't formed a precise statement of your objectives yet. You appear to have several objectives (which is perfectly OK). To use the traditional method of "hypothesis testing" you need a hypothesis! It must be specific. Different "null hypotheses" my require different statistical tests.

Examples of various hypotheses

1. There is no difference between the population of Clifton and City Centre with respect to the distribution of answers on the survey, if the entire population of each is polled.

2. The population of Clifton has a higher fraction of people who answered a) to question 12 than the population of City Center

3. When answering question 37, the population of Clifton tends to rate air quality as being higher than the poulation of City Center rates it.

If you want to test a generality such as "The residents of Clifton are less concerned about environmental problems than the residents of City Centre" then you have to create a definition of that generality in terms of very specific hypotheses like those above.

I think you are at the preliminary stage of analysis. You are using "descriptive statistics". The means you are making plots and graphs to form an intuitive understanding of the data. I'm sure that people have studied how to do this effectively - but I'm not one of them! As far as I know, there aren't any strict rules about what you must do. Perhaps papers in your field only publish informal arguments based on such descriptions.
 
Dear Carole,

Nothing comes to mind as to how to use medians instead of means to compare your data other than box diagrams etc, which are a good graphical aid to a project and take very little time to produce and write about, however witht he different populations I would probably lean more towards using the mean.

I think your friends idea of a two sampled t-test is one of the better options. You will be able to conclude if there is a significant difference between the two areas for each of your questions, and using the extra information in the output I think you can find things estimate for difference. Plus as you have inputted the data already a 2 sample t-test ir very quick and can only aid you in your project :) As long as you test the data is normally distributed and test for equal variances first.

Another test you may want to use 2 proportions test? You can see if the proportion in n1 is similar to that of n2, for example if the proportion of obese females is the same as the proportion of obese males. This might help with your issue of misrepresentation that you mentioned. This would also need the data to be normally distributed and have equal variances.

If you do not have normally distributed data (I don't think this is likely for a lot of the questions) the Mann Whitney U test can be used instead of the 2 sample t-test and is found in the Non-Parametric tests :)

I hope this helped!
 
Dear Stephen and Monachus,

Thanks a lot to both of you for your help, I will go with the two sample t-test which appears to be the best option as advised by several parties.

Thanks again,

Kind regards,

Carole
 
I was reading documentation about the soundness and completeness of logic formal systems. Consider the following $$\vdash_S \phi$$ where ##S## is the proof-system making part the formal system and ##\phi## is a wff (well formed formula) of the formal language. Note the blank on left of the turnstile symbol ##\vdash_S##, as far as I can tell it actually represents the empty set. So what does it mean ? I guess it actually means ##\phi## is a theorem of the formal system, i.e. there is a...
Back
Top