# Linear regression with two data sets?

I want to try to predict the USA summer highs using a linear regression. I know I can probably take data from the last 10 summers and plug that in, and use that to predict, but I'd like to use two data sources. 1 data source from the historical highs from past summers in the USA, and the 2nd data source from the historical highs from past summers from another country like Australia who has opposite seasons from us in the USA. Is it possible to do a linear regression from two data sources, and use both of them to predict a number?

Do you think this is a more accurate way of doing this? Or is it just as accurate as using one data source like historical USA highs.

Svein
Linear regression will give you a number - but that number will have little or nothing to to with what you are looking for. Natural temperature variations (day/night, summer/winter) are cyclic and extremes are more or less random.

Combining temperatures in Australia with temperatures in the USA is meaningless. Think about it - northern Australia is quite close to the equator, a large part of central Australia is desert, etc...

chiro
Hey Josh Terrill.

You can do it - but you have to specify the regression model to combine them.

The simplest way to combine information is a weighted sum where you have w1*x1 + w2*x2 where w1 + w2 = 1 but there are many ways to combine information (based on the different kinds of functions you can think of).

You will have to have an understanding of the differences and context between the different data sets and how they standardize against each other with respect to the variable you are making inferences on.

This will probably mean adjusting for things like season, and other geographical factors.

Without any context or domain knowledge for your data, an extended response is not possible.

Stephen Tashi
I want to try to predict the USA summer highs using a linear regression.

You should explain clearly what you want to do. You use the plural "highs". This suggests there is some aspect of time involved. For example, perhaps you are tying to predict the maximum temperature on each day of the summer. You say "USA". It isn't clear whether you mean to get a single high temperature for the entire USA or whether you are interested in one particular city - or perhaps you want to predict the daily max temperature for each major city in the US.

When you speak of "using a linear regression" this suggests using a single model that consists of a single linear equation. However, perhaps you'd also consider using a different model for different situation. For example, you might use an equation to predict the high temperature in Greensboro NC on June 12 based on historical high temperatures for other cities and use a different equation to predict the high temperature for a different date or for a different US city.

Is it possible to do a linear regression from two data sources, and use both of them to predict a number?

It may be mathematically possible, depending on what the data and the equation actually are.

A more general question is whether we can increase the reliability of predictions by using information that seems, at first sight, to be irrelevant or only indirectly relevant to what is being predicted. A lot has been written about this problem, but I can't summarize it as a simple set of instructions.

Svein