# Homework Help: Experiment/Principle Components - Unsupervised Learning

1. Nov 4, 2013

### brojesus111

1. The problem statement, all variables and given/known data

A researcher collects expression measurements for 1,000 genes in 100 tissue samples. The data can be written as a 1, 000 × 100 matrix, which we call X, in which each row represents a gene and each column a tissue sample. Each tissue sample was processed on a diﬀerent day, and the columns of X are ordered so that the samples that were processed earliest are on the left, and the samples that were processed later are on the right. The tissue samples belong to two groups: control (C) and treatment (T). The C and T samples were processed in a random order across the days. The researcher wishes to determine whether each gene’s expression measurements diﬀer between the treatment and control groups.

As a pre-analysis (before comparing T versus C), the researcher performs a principal component analysis of the data, and ﬁnds that the ﬁrst principal component (a vector of length 100) has a strong linear trend from left to right, and explains 10 % of the variation. The researcher now remembers that each patient sample was run on one of two machines, A and B, and machine A was used more often in the earlier times while B was used more often later. The researcher has a record of which sample was run on which machine.

(a) The researcher decides to replace the (i, j)th element of X with

x_ij − z_i1 φ_j1

where z_i1 is the ith score, and φ_j1 is the jth loading, for the ﬁrst principal component. He will then perform a two-sample t-test on each gene in this new data set in order to determine whether its expression diﬀers between the two conditions. Critique this idea, and suggest a better approach.

(b) Design and run a small simulation experiment to demonstrate the superiority of your idea.

3. The attempt at a solution

I'm just not sure what's going on in this problem. I'm pretty sure there's something wrong with how he decides to replace the (i,j)th element of X, but I'm not sure what. What is he accomplishing with his subtraction?

I'm assuming my simulation should be based on my approach from part a, but does that mean I have to make up some fake data? Will any fake data work?

I appreciate any help.

2. Nov 5, 2013

Anyone? :/