Increase Accuracy of Program: How Many Subjects Needed?

  • Context: Undergrad 
  • Thread starter Thread starter jt.harperjr
  • Start date Start date
  • Tags Tags
    Analysis
Click For Summary

Discussion Overview

The discussion revolves around the statistical modeling of two categorical variables, A and B, which represent distinct characteristics of individuals. Participants explore how to improve the accuracy of a program designed to classify individuals based on various measurements, considering the implications of sample size and model refinement.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Some participants suggest that the initial model assumes a binary classification (A or B) without accounting for nuances such as bisexuality or variations within categories.
  • There is a proposal to refine the model to represent a continuum rather than a strict dichotomy, possibly incorporating multiple categories (A, B, C, D, E) to better reflect real-world complexities.
  • Participants question the definition of "accuracy rating" and how it relates to the sample size and representativeness of the population.
  • One participant notes that a larger sample size may yield results that are more representative of the general population, but the meaning of "accuracy" remains ambiguous and dependent on sampling methods.
  • There is a discussion about the need for a model that accommodates variations within categories, similar to how height is measured on a continuous scale.

Areas of Agreement / Disagreement

Participants generally agree that the model needs refinement and that sample size plays a crucial role in accuracy. However, there is no consensus on the definition of accuracy or the best approach to modeling the variables.

Contextual Notes

Limitations include the lack of clarity on what constitutes an "accuracy rating" and the dependence on how the sample is chosen, which may affect the generalizability of the results.

Who May Find This Useful

Individuals interested in statistical modeling, data classification, and those working on projects involving categorical data analysis may find this discussion relevant.

jt.harperjr
Messages
3
Reaction score
0
I'm a programmer, but I know very little about statistics and am not even sure where or how to ask this. Let's say you have 2 variables about people in general, var A and var B, that are tangible characterists of these people. People either possesses A or B.

I then take 11 different measurements about the person and use those to determine if they are actually A or B without looking at them. The program successfully determines if someone is A or B in a group of 10 people. But as I test more and more people, I find that some people have slight differences or exceptions in their variables that I have to account for.

Example: All A people have the first variable in a range of 12 to 13, the 2nd variable in a range of 5 to 6, but then I find an A person who has a range of 1 for the 2nd variable. So I add to the formula that if the 2nd variable = 1, then the person has A.

My question - How many people would I have to test out to get an accuracy rating above 80% of the program, or is that even possible. As I add more and more subjects that fit that equation, does that translate into an increase in accuracy of the program when used on the general population?
 
Physics news on Phys.org
jt.harperjr said:
I'm a programmer, but I know very little about statistics and am not even sure where or how to ask this. Let's say you have 2 variables about people in general, var A and var B, that are tangible characterists of these people. People either possesses A or B.
So to start with your model assumes that someone is one or the other ... i.e. maybe A=gay and B=straight ... and you cannot be bisexual.

I then take 11 different measurements about the person and use those to determine if they are actually A or B without looking at them. The program successfully determines if someone is A or B in a group of 10 people. But as I test more and more people, I find that some people have slight differences or exceptions in their variables that I have to account for.

Example: All A people have the first variable in a range of 12 to 13, the 2nd variable in a range of 5 to 6, but then I find an A person who has a range of 1 for the 2nd variable. So I add to the formula that if the 2nd variable = 1, then the person has A.
So you discover with testing that it the initial model needs to be refined to account for those mostly straight people who have experimented with same-sex relationships in college or something?

Maybe you noticed that some people with mostly blue eyes have a bit of brown flecks in them or some people who are basically dark-skinned are, yet, not exactly black either.

My question - How many people would I have to test out to get an accuracy rating above 80% of the program, or is that even possible. As I add more and more subjects that fit that equation, does that translate into an increase in accuracy of the program when used on the general population?
You need to change your model. Perhaps you need to set A and B as opposite ends of a scale.
You also need to define what you mean by "accuracy rating".
But your question is too general.
 
Simon Bridge said:
So to start with your model assumes that someone is one or the other ... i.e. maybe A=gay and B=straight ... and you cannot be bisexual.

So you discover with testing that it the initial model needs to be refined to account for those mostly straight people who have experimented with same-sex relationships in college or something?

Maybe you noticed that some people with mostly blue eyes have a bit of brown flecks in them or some people who are basically dark-skinned are, yet, not exactly black either.


You need to change your model. Perhaps you need to set A and B as opposite ends of a scale.
You also need to define what you mean by "accuracy rating".
But your question is too general.

To give more detail, in the example above, you could be bisexual. I was just trying to keep it simple. There is actually A, B, C, D, and E, they are on a scale going from A to E, placing people somwhere in between or on the ends.

By accuracy rating, I wonder if I incorporate enough subjects, would it translate to being more accurate for the worlds population as a whole?
 
The bigger your sample, the more representative the results will be of the population responses - if you were to test the entire population. Does that mean it is "accurate"? Depends what you mean by "accurate". Depends how you choose your sample.

You also need a model that accounts for the continuum between the types.
Like height is pretty continuous, but someone who is 185cm tall is actually between 184.5cm and 185.4cm
You could be more granulated than that - using wider ranges for "small", "medium", "tall", "giant" etc.
 
Thanks for your help Simon, after some work on the dry erase board, I've found an answer.
 
Well done. I use a windowpane myself ;)
 

Similar threads

  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 28 ·
Replies
28
Views
3K
  • · Replies 45 ·
2
Replies
45
Views
5K
  • · Replies 14 ·
Replies
14
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 43 ·
2
Replies
43
Views
4K