Getting this Array to be in 2D instead of 1D for Python Linear Regression

WWGD · Dec 4, 2019

Python:

import matplotlibimport matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
import pandas as pd

# Load CSV and columns
df = pd.read_csv("C:\Housing.csv")

Y = df['price']
X = df['lotsize']
# Split the data into training/testing sets
X_train = X[:-250]
X_test = X[-250:]

# Split the targets into training/testing sets
Y_train = Y[:-250]
Y_test = Y[-250:]

# Plot outputs
plt.scatter(X_test, Y_test,  color='black')
plt.title('Test Data')
plt.xlabel('Size')
plt.ylabel('Price')
plt.xticks(())
plt.yticks(())

plt.show()

Which worked out well, produced a plot of the data. But then the next batch is giving me problems. I tried at Stack Overflow but they just refer me to other answers I already tried

Python:

regr = linear_model.LinearRegression()

X_train = X[:-250]
X_test = X[-250:]

# Split the targets into training/testing sets
Y_train = Y[:-250]
Y_test = Y[-250:]# Train the model using the training sets
regr.fit(X_train, Y_train)
X_test.reshape(-1,1)
Y_test.reshape(-1,1)

I appended the 'reshape' method after the error messages:

ValueError: Expected 2D array, got 1D array instead:
array=[ 5850. 4000. 3060. 6650. 6360. 4160. 3880. 4160. 4800. 5500.
7200. 3000. 1700. 2880. 3600. 3185. 3300. 5200. 3450. 3986.
4785. 4510. 4000. 3934. 4960. 3000. 3800. 4960. 3000. 4500.
3500. 3500. 4000. 4500. 6360. 4500. 4032. 5170. 5400. 3150.
3745. 4520. 4640. 8580. 2000. 2160. 3040. 3090. 4960. 3350.
5300. 4100. 9166. 4040. 3630. 3620. 2400. 7260. 4400. 2400.
4120. 4750. 4280. 4820. 5500. 5500. 5040. 6000. 2500. 4095.
4095. 3150. 1836. 2475. 3210. 3180. 1650. 3180. 3180. 6360.
4240. 3240. 3650. 3240. 3780. 6480. 5850. 3150. 3000. 3090.
6060. 5900. 7420. 8500. 8050. 6800. 8250. 8250. 3500. 2835.
4500. 3300. 4320. 3500. 4992. 4600. 3720. 3680. 3000. 3750.
5076. 4500. 5000. 4260. 6540. 3700. 3760. 4000. 4300. 6840.
4400. 10500. 4400. 4840. 4120. 4260. 5960. 8800. 4560. 4600.
4840. 3850. 4900. 3850. 3760. 6000. 4370. 7700. 2990. 3750.
3000. 2650. 4500. 4500. 4500. 4500. 2175. 4500. 4800. 4600.
3450. 3000. 3600. 3600. 3750. 2610. 2953. 2747. 1905. 3968.
3162. 6000. 2910. 2135. 3120. 4075. 3410. 2800. 2684. 3100.
3630. 1950. 2430. 4320. 3036. 3630. 5400. 3420. 3180. 3660.
4410. 3990. 4340. 3510. 3420. 3420. 5495. 3480. 7424. 3460.
3630. 3630. 3480. 3460. 3180. 3635. 3960. 4350. 3930. 3570.
3600. 2520. 3480. 3180. 3290. 4000. 2325. 4350. 3540. 3960.
2640. 2700. 2700. 3180. 3500. 3630. 6000. 3150. 3792. 3510.
3120. 3000. 4200. 2817. 3240. 2800. 3816. 3185. 6321. 3650.
4700. 6615. 3850. 3970. 3000. 4352. 3630. 3600. 3000. 3000.
2787. 3000. 4770. 3649. 3970. 2910. 3480. 6615. 3500. 3450.
3450. 3520. 6930. 4600. 4360. 3450. 4410. 4600. 3640. 6000.
5400. 3640. 3640. 4040. 3640. 3640. 5640. 3600. 3600. 4632.
3640. 4900. 4510. 4100. 3640. 5680. 6300. 4000. 3960. 5960.
5830. 4500. 4100. 6750. 9000. 2550. 7152. 6450. 3360. 3264.
4000. 4000. 3069. 4040. 4040. 3185. 5900. 3120. 5450. 4040.
4080. 8080. 4040. 4080. 5800. 5885. 9667. 3420. 5800. 7600.
5400. 4995. 3000. 5500. 6450. 6210. 5000. 5000. 5828. 5200.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
# Plot outputs
plt.plot(X_test, regr.predict(X_test))

Mark44 · Dec 4, 2019

I would start by Reading The Fershlugginer Manual: https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html
I'm assuming that the reshape function is part of the numpy library you're importing.

If a is a single-dimension array, then np.reshape(a, (20, 10)) reshapes a to an array with 20 rows and 10 columns. You can have one of the parameters as -1, so the size of that dimension is inferred from the number of data in the array. Personally, I would want to know the dimensions of the two-d array I'm reshaping to, before calling reshape.

I'm not sure what happens if the quantity of data doesn't align with the sizes you pick for the parameters. For example, if you have 26 data values, and you want 4 rows, that might be a problem, because 26 isn't evenly divisible by 4.

pbuk · Dec 5, 2019

Unfortunately sklearn has not given you a very helpful error message. Python arrays (or rather array-like lists) do not have a .reshape method - as @Mark44 says .reshape is a method of the numpy library so you need to use it like regr.fit(np.reshape(X_train, (-1, 1)), Y_train))

WWGD · Dec 5, 2019

Thank you all, in case someone else is interested, this different setup worked much more smoothly:

I am not following any manual, just piecing together bits here and there.

Mark44 · Dec 5, 2019

WWGD said:

I am not following any manual, just piecing together bits here and there.

But you should take a look at the documentation of any library functions they use in whatever it is that you're following.

WWGD · Dec 5, 2019

Mark44 said:

But you should take a look at the documentation of any library functions they use in whatever it is that you're following.

I do this sort of weird top-down, bottom-up approach.

Still, for my previous, it was not working and I figured there were some missing values below the 500th row,
using [name].isnull() and renamed my dataframe to df.head(500) and regression went through. It was running before but the algorithm was not converging, I presume, because of the missing values.

But I will look at the documentation, @Mark44

WWGD · Dec 5, 2019

@Mark44: I did a full simple linear regression. It took me two full days to figure out and I am telling everyone. The fruit vendor guy, the dry cleaner, everyone. May I post the full code in case someone is interested?

Mark44 · Dec 5, 2019

Sure, I don't see why not.

WWGD · Dec 5, 2019

Ok, you need to have an excel .xslx or a .csv file . Mine is called 'Housing.csv"

Python:

from numpy import *
import sklearn
import pandas as pd
from scipy.interpolate import *
df=pd.read_csv("C:/Housing.csv")

#First batch

df

df1=df.head(500)

#File had missing values above row 500 , so I shortened it.

df1

#Now checking to see if I there are still missing values left:

df1.isnull()

#Did not see any ( I am not using column 13, which does have nulls)

X=df1['bathrms'].values
Y=df1['stories'].values
p1=polyfit(X,Y,1)

p1

#Now printing fitted values (fitted by regression line)

Yfit=p1[0]*X +p1[1]
print(Yfit)

#Printing out analysis of regression:

Yresid=Y-Yfit
SSResid=sum(pow(Yresid,2))
SSTotal=len(Y)*var(Y)
rsq=1 -SSResid/SSTotal

print(rsq)
print(Yresid)
print(SSTotal)

#Plotting residuals:
from matplotlib import *
import matplotlib.pyplot as plt
plt.scatter(Y, Yresid,  color='black')
plt.show

Mark44 · Dec 5, 2019

What are lines 9, 15, and 27 doing? Should they instead be comments?

WWGD · Dec 5, 2019

Mark44 said:

What are lines 9, 15, and 27 doing? Should they instead be comments?

They just output the variables, e.g., df will show the values of the dataframe, all rows, columns. I guess it does this without the need to use 'print'.

pbuk · Dec 6, 2019

Python:

from numpy import *

import * is not a good idea, you'll see one reason why on the next line, but the more important reason comes later...

Python:

import sklearn

Your code doesn't use sklearn so you should drop this line. If you had done from sklearn import * you wouldn't be able to tell whether you were using it or not.

Python:

import pandas as pd

OK, but i would use from pandas import read_csv - this makes it much easier to see why you are importing the pandas module and therefore to reuse the code.

Python:

from scipy.interpolate import *

Python:

df=pd.read_csv("C:/Housing.csv")

These may seem like nit-picking, but trust me, if you adopt these habits you will both make fewer errors and make it easier to find errors that you do make.

use spaces around operators, including the assignment operator
stick to using either double or single quotes for string literals (later on you use single quotes) - in Python it doesn't matter which, but pick one and stick to it. Most Python code I see uses single quotes.
use meaningful variable names. df may mean something to you now, but a dataframe is a pandas concept, and as you are about to rely on some of this class's methods I would be more explicit, and use a comment to point this out. So we have instead

Python:

# Read the data into a pandas dataframe object.
data = pd.read_csv('C:/Housing.csv')

Moving on...

Python:

#First batch

df

Again a space after the # makes things more readable, but don't leave a blank line between the comment and the line it explains.

EDIT: later on in the thread (see #23) it became clear that the OP was not a Python script, it was input to an interactive REPL session. The following comment assumed that pandas implemented a magic method to print when accessing a dataframe in this way in a script which is not the case.

Now this is where I have to say I don't like pandas. Reading this code, it is anyone's guess what df does - @Mark44 certainly had no clue. So say what it does in a comment:

Python:

# Print the whole dataframe to stdout.
data

Moving on again...

Code:

# Data file has missing values past row 500 so truncate it and print to stdout.
truncated = data.head(500)
truncated

Comments should be before the related code, not after. I've also tidied up the next few lines...

Code:

# Print a table of null values to stdout to check for any more missing values.
truncated.isnull()

# Did not see any except in column 13 which I am not using.
X = truncated['bathrms'].values
Y = truncated['stories'].values

Now we come to the other reason for not using import *. Is polyfit a built-in Python function? No, perhaps it comes from numpy. Or maybe it comes from scipy.interpolate? If my code doesn't work, maybe I intended to use numpy.polyfit but this was overwritten by scipy.interpolate.polyfit which I didn't know existed? Or even worse, maybe my code works today, but in a year's time when a new release of scipy comes out implementing interpolate.polyfit it stops working.

So at the top of the file we should have done

Python:

from scipy.interpolate import polyfit

and then we can safely do (with some standard spacing)...

Python:

# Fit a polynomial curve of degree 1 (i.e. linear regression) and print it to stdout.
fit = polyfit(X, Y, 1)
fit

Wait a minute, will this print anything out? I'll leave this for you to work out and come to your own opinion on objects with magic side-effects like pandas dataframes.

For now, I'm just going to rewrite this...

Python:

# Fit a polynomial curve of degree 1 (i.e. linear regression).
(fitSlope, fitIntercept) = polyfit(X, Y, 1)

# Now print fitted values.
Yfit = fitSlope * X + fitIntercept
print(Yfit)

#Printing out analysis of regression:

Yresiduals = Y - Yfit
sumSquaredResiduals = sum(pow(Yresiduals, 2))
sumSquaredTotal = len(Y) * np.var(Y)
rSquared = 1 - sumSquaredResiduals / sumSquaredTotal

print(rSquared)
print(Yresiduals)
print(sumSquaredTotal )

#Plotting residuals:
from matplotlib import *
import matplotlib.pyplot as plt

Woah there, imports should all be at the top (why do you think this is a good idea?) and think about what you actually want to import.

Python:

plt.scatter(Y, Yresiduals, color='black')
plt.show

Well the last line is definitely not going to work - what should it be? Do you see how you have got into this habit?

To avoid most of these problems, you should use a code linter in your IDE - Pylint is the de facto standard for Python.

Mark44 · Dec 6, 2019

@pbuk, excellent points all!

Ibix · Dec 6, 2019

pbuk said:

use spaces around operators, including the assignment operator

I agree with everything except this. Does it really help people? I just find it a waste of horizontal space...

WWGD · Dec 6, 2019

Thanks for tge input, PBUK, yes, I was just desperate to get some output after 3 days. Now I can start thinking of elegance and efficiency of my code.

pbuk · Dec 6, 2019

Ibix said:

I agree with everything except this. Does it really help people? I just find it a waste of horizontal space...

Yes it does help, and white space is cheap compared with programmers time.

Python:

# Compare...
some_variable=one_other_variable-other_variable
some_variable_one=other-variable_other_variable
# ... with
some_variable = one_other_variable - other_variable
some_variable_one = other - variable_other_variable

WWGD said:

Now I can start thinking of elegance and efficiency of my code.

No, this is not about elegance or efficiency, it is about writing code that stands a better chance of working first time and if it doesn't is easier to debug however inelegant or inefficient it is.

Ibix · Dec 6, 2019

pbuk said:

Yes it does help

I'll take your word for it, but neither of your examples seems easier to read than the other to me. Maybe I just have an unusual way of looking at code.

PeterDonis · Dec 6, 2019

Ibix said:

Does it really help people?

It certainly helps me with code readability.

Also, FWIW, for Python it's part of the standard PEP 8 code style guide:

https://www.python.org/dev/peps/pep-0008/#other-recommendations

WWGD · Dec 6, 2019

Just to clarify that my code is for "standard (OLS) regression" but not ML OLS regression. Not sure how the two differ, but I believe they do. I understand ML uses a loss function with a threshold value, partions dataset into training and test data, iterates for different partitions of dataset, evaluates whether threshold ( or beyond) is met for given choice using the loss function and keeps iterating if the answer is no (i.e., if loss function value is above chosen threshold), and stops otherwise. Not sure how this differs from just using some program, e.g., SPSS to conduct OLS and "just" spit out output: intercept, coefficients with confidence intervals, F-tests, etc.
EDIT: Maybe the two are the same but programs like SPSS , etc. somehow "Blackbox" the iteration process?

Mark44 · Dec 6, 2019

pbuk said:

use spaces around operators, including the assignment operator

Ibix said:

I agree with everything except this. Does it really help people? I just find it a waste of horizontal space...

Some code I found that was posted here at PF. The coder was adept at not wasting horizontal space. Aside from the impenetrable variable names, is this really as easy to read as code for which there are spaces around operators?

Fortran:

psie(i,ifli)=psie(i,ifli)+exphel(i,j)*psib(j,ifli) 
err(i,1)=1.d2*( (dble(psie(i,1))-dble(psio(i)))**2  &
  &  + (imag(psie(i,1))-imag(psio(i)))**2)
psibare(i,ifli)=psibare(i,ifli)+exphel(i,j)*psibarb(j,ifli)
err(i,2)=1.d2*((dble(psibare(i,1))-dble(psibaro(i)))**2 &
  & +(imag(psibare(i,1))-imag(psibaro(i)))**2)

pbuk · Dec 6, 2019

OLS stands for "Ordinary Least Squares", and that is exactly what it is. It exists as a statistical technique because (i) it is easy to do, and (ii) it is possible to prove a number of things using it. However, in the real world, OLS often provides a sub-optimal fit because outliers tend to affect the parameters too much - exactly because you are squaring the difference between a sample value and its predicted value.

Because of this, in the real world we tend to modify our regression by applying a weighting function and/or thresholds to lessen the impact of outliers. When we have many independent variables (which again in the real world may not be truly independent) then similarly good fits may be obtained by very different combinations of coefficients. The buzz-word 'Machine Learning' in this context simply means that instead of simple calculations the code employs adaptive heuristic algorithms to attempt to fit a model to the data.

For a single independent variable, the best way to fit a line to data is to plot it and fit by eye: the human brain is pretty good at adaptive heuristics.

Ibix · Dec 7, 2019

Mark44 said:
Some code I found that was posted here at PF. The coder was adept at not wasting horizontal space. Aside from the impenetrable variable names, is this really as easy to read as code for which there are spaces around operators?
Fortran:
psie(i,ifli)=psie(i,ifli)+exphel(i,j)*psib(j,ifli)
err(i,1)=1.d2*( (dble(psie(i,1))-dble(psio(i)))**2  &
  &  + (imag(psie(i,1))-imag(psio(i)))**2)
psibare(i,ifli)=psibare(i,ifli)+exphel(i,j)*psibarb(j,ifli)
err(i,2)=1.d2*((dble(psibare(i,1))-dble(psibaro(i)))**2 &
  & +(imag(psibare(i,1))-imag(psibaro(i)))**2)

Fortran:

psie(i,ifli) = psie(i,ifli) + exphel(i,j) * psib(j,ifli)
err(i,1) = 1.d2 * ((dble(psie(i,1)) - dble(psio(i)))**2 &
  &  + (imag(psie(i,1)) - imag(psio(i)))**2)
psibare(i,ifli) = psibare(i,ifli) + exphel(i,j) * psibarb(j,ifli)
err(i,2) = 1.d2 * ((dble(psibare(i,1)) - dble(psibaro(i)))**2 &
   & + (imag(psibare(i,1)) - imag(psibaro(i)))**2)

I don't see the improvement. The operators seem to me to stand out clearly anyway (they're very different shapes from letters), so wrapping them in whitespace doesn't add anything.

The multiple nested brackets are a huge issue for comprehensibility for me. Assuming I'm correctly interpreting the variables as arrays of complex numbers, and hazily recalling that arithmetic operations on complex types are allowed in Fortran, for the err() computations I'd almost certainly have created a variable called (e.g.) delta to store the difference and then taken the squared modulus of that in a separate line.

Perfectly happy to accept that I'm in the minority on the whitespace around operators issue. I'm surprised it seems to be quite such a small minority, though.

pbuk · Dec 7, 2019

pbuk said:
Python:
#First batch

df
Again a space after the # makes things more readable, but don't leave a blank line between the comment and the line it explains. Now this is where I have to say I don't like pandas. Reading this code, it is anyone's guess what df does - @Mark44 certainly had no clue. So say what it does in a comment:
Python:
# Print the whole dataframe to stdout.
data

I've just realized that this is nonsense - pandas DataFrames don't have any magic output methods, the OP was simply posting what he was entering in the Python REPL (aka interactive mode) as a script. This won't work.

@WWGD the convention for posting REPL sessions is to show the >>> prompt like this:

Python:

>>> df

But when we write reusable code we are always writing a script, and now we don't need a comment because it is obvious what we are doing.

Python:

print(df)

Getting this Array to be in 2D instead of 1D for Python Linear Regression

1. What is the difference between a 1D and 2D array in Python?

2. How can I convert a 1D array to a 2D array in Python?

3. Why is it important to use a 2D array for Python linear regression?

4. What are some common methods for creating a 2D array in Python?

5. Can I perform linear regression on a 2D array with missing values?

Hot Threads

Recent Insights