1. Limited time only! Sign up for a free 30min personal tutor trial with Chegg Tutors
    Dismiss Notice
Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

How can I visualize this data?

  1. Apr 10, 2012 #1


    User Avatar
    Gold Member

    I've acquired data on about 14,000 users who participated in my questionnaire. I have their questionnaire answers (98 questions, rated 1 to 5) as well as some user demographic data such as gender, age and honesty.

    It's all sitting in a mySQL database on my webhost. I can write some queries on it, and I can export it to one of many formats (OpenDoc Calc or simply CSV), but I'm a bit overwhelmed about how I might extract all the useful, interesting information from it.

    My initial queries filter out incomplete questionnaires and dummy users. Then I can filter on gender by age:
    Code (Text):
    SELECT COUNT(*) FROM `Users` WHERE `Completed` = 7 AND `Dummy` IS NULL AND `Gender`= 1 AND `Age` = 2
    This gets me the number of male users in their 20's.

    But this is going to get tedious. Additionally, I want to display these results visually using graphs. (I've tried to figure out OpenDoc Calc's graphing feature but it's not going well.)

    I know there are a million ways to do what I want, but I'd like your opinions on how to do it easily. Basic is better than advanced. I don't plan to do too much fancy manipulation.
  2. jcsd
  3. Apr 11, 2012 #2
    I would export complete data, e.g. to CSV, and import it into R. Since R is a very powerful tool for statistical computations, I would use it for analyses of the data. In R, you can select data according to categories, values etc. See, e.g.
  4. Apr 11, 2012 #3
    For simple plotting without too complex analysis, try Gnuplot on the CSV export. It is easy to use, well documented and for this kind of analysis I always like to write scripts that I can copy/paste and modify to my liking rather than click-orgies. Gnuplot can produce publication-quality PDF or EPS. It is free and runs on Windows and Linux.

    If you want to do real statistical analysis, correlations etc. then R sounds like the way to go, but I have not used that myself.
  5. Apr 11, 2012 #4
    I think he's looking for advice on the best way to visualize it. Presumably he doesn't find it difficult to actually generate the plots.

    With data that high dimensional, visualization is difficult. You could plot individual questions of pairs of questions, but I imagine that's not what you're interested in. I have very little experience with analyzing questionnaires, but I'd probably do some sort of factor or cluster analysis and look for large scale differences between groups. Visualizing every question at once seems too difficult.
  6. Apr 11, 2012 #5
    With your goal of keeping it simple, I would export the data to a CSV file, import it into Excel, and use a pivot table for summaries. If you want serious statistical analysis, R is a good choice.
  7. Apr 11, 2012 #6


    User Avatar
    Gold Member

    Not a good assumption. :smile:

    I really am looking for ways to generate the plots. Not quite Graphing for Dummies - the software, but something with not too steep a learning curve. This is new for me.

    No, not too complex, simple breakdown is fine. I'll want to show for example,
    - total breakdown by age and gender
    - the breakdown of answers for the given questions by age and gender, etc.
    But there's lots of these. I can see generating a hundred graphs or more, easy. (After all there's 98 questions)

    I brought in in to Open Office Calc.

    I will look up what a pivot table is, and how to generate one in OOC.

  8. Apr 11, 2012 #7


    User Avatar
    Gold Member

    OK, well one of the nice things about click-orgies is that I can get past File:Open without having to sit down with the manual... :grumpy:

    You have a strange definition of "easy to use"... :tongue:
    Last edited: Apr 11, 2012
  9. Apr 11, 2012 #8


    User Avatar
    Science Advisor

    Hey DaveC426913.

    I would also recommend R and Gnuplot, particularly R if you want to get a plot up very quickly.

    To read in a CSV, you use the command read.csv. You can then create new objects that take entire columns and throw them in a new object using something like [,1] which will grab the entire first column. So you can grab any two columns and throw it into a 2D data set and use plot to plot the data.

    With regards to filtering the data by your specific questions, you can write simple functions that retain data according to characteristics. One way to do this is to copy the input to another data object and then create a function to take in your filtering requirements and then remove any data that doesn't fit the requirements and copy it to a new data object and then plot that.

    Once you've got the function to generate the data and the plots, then edit a text file that calls each variation of the function that generates a data object and a plot for that particular 'question' and then execute that in the R console. You might even be able to pipe the output to a bitmap file which means that once you run the code it will automatically save all the results for you, but I don't know if you can do this (but if you can it will make your life very, very easy in comparison to if you couldn't).

    Also if you need to customize anything of the plot, you can supply your custom function with extra arguments for title data, axis data, scale data, color data and so on.

    I have uploaded a short reference card for R and if you decide to use it, you'll find this pdf will help you get things done faster.

    Attached Files:

  10. Apr 11, 2012 #9


    User Avatar
    Science Advisor
    Homework Helper

    IMO it is easy to use, but it's not easy to learn, especially if you want to instant gratification.

    One big win with a command line interface is that after you have got one plot looking the way you want it, it's easy to create more that look the same - and repeat the process on different data sets - though you probably won't need that till you get the results of your follow-up survey.....

    A tip for working with big datasets: if you want to produce 98 (or more!) similar plots, make a script file that outputs them all to a PDF. Then you can browse through them, bookmark the interesting ones, etc, with your favorite PDF viewer, without the hassle of retyping the gnuplot commands.
  11. Apr 12, 2012 #10
    The biggest drawback of File:Open is a complete lack of preparing operations with plots and data. Of course, you can open a file and 1e3 times click here and there, and it is (usually) intuitive, but you need repeat it for every new data file. Even if you want to change something trivial in the data, you often need to start from scratch, throwing all plots away. That's where scripting wins. Once you learn some basics, which indeed takes some *short* time (half a day at most for learning principles), you start to be very effective. Repeatedly. This is an investment into the future :-)
  12. Apr 12, 2012 #11
    Code (Text):
     plot 'data.csv' using 1:2
    sound pretty easy to me :-)
    Code (Text):
     help plot
    gets you the help text, and you can google up plenty of examples.

    What I like about scripts is that you can start out very basic and then add complexity, cosmetics etc around it until your figure is ready for PRL.

    The second advantage kicks in after you've used it a bit, you can recycle plots and analysis scripts (gnuplot is pretty good for fitting, too), especially if you have to fit the same scan for 100 different temperatures. Click that!

    Third, examples of scripts are easier to follow (or copy/paste) than descriptions of where to click when.

    But whatever program you use most is always the easiest to use.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook