Getting this Array to be in 2D instead of 1D for Python Linear Regression

In summary: ValueError: Expected 2D array, got 1D array instead:array=[ 5850. 4000. 3060. 6650. 6360. 4160. 3880. 4160. 4800. 5500. 7200. 3000. 1700. 2880. 3600. 3185. 3300. 5200. 3450. 3986. 4785. 4510. 4000. 3934. 4960. 3000. 3800. 4960. 3000. 4500. 3500. 3500. 4000. 4500
  • #1
WWGD
Science Advisor
Gold Member
7,010
10,469
TL;DR Summary
Python code is producing data in 1D arrays instead of the 2D arrays I need for linear regression.
Python:
import matplotlibimport matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
import pandas as pd

# Load CSV and columns
df = pd.read_csv("C:\Housing.csv")

Y = df['price']
X = df['lotsize']
# Split the data into training/testing sets
X_train = X[:-250]
X_test = X[-250:]

# Split the targets into training/testing sets
Y_train = Y[:-250]
Y_test = Y[-250:]

# Plot outputs
plt.scatter(X_test, Y_test,  color='black')
plt.title('Test Data')
plt.xlabel('Size')
plt.ylabel('Price')
plt.xticks(())
plt.yticks(())

plt.show()
Which worked out well, produced a plot of the data. But then the next batch is giving me problems. I tried at Stack Overflow but they just refer me to other answers I already tried

Python:
regr = linear_model.LinearRegression()

X_train = X[:-250]
X_test = X[-250:]

# Split the targets into training/testing sets
Y_train = Y[:-250]
Y_test = Y[-250:]# Train the model using the training sets
regr.fit(X_train, Y_train)
X_test.reshape(-1,1)
Y_test.reshape(-1,1)
I appended the 'reshape' method after the error messages:

ValueError: Expected 2D array, got 1D array instead:
array=[ 5850. 4000. 3060. 6650. 6360. 4160. 3880. 4160. 4800. 5500.
7200. 3000. 1700. 2880. 3600. 3185. 3300. 5200. 3450. 3986.
4785. 4510. 4000. 3934. 4960. 3000. 3800. 4960. 3000. 4500.
3500. 3500. 4000. 4500. 6360. 4500. 4032. 5170. 5400. 3150.
3745. 4520. 4640. 8580. 2000. 2160. 3040. 3090. 4960. 3350.
5300. 4100. 9166. 4040. 3630. 3620. 2400. 7260. 4400. 2400.
4120. 4750. 4280. 4820. 5500. 5500. 5040. 6000. 2500. 4095.
4095. 3150. 1836. 2475. 3210. 3180. 1650. 3180. 3180. 6360.
4240. 3240. 3650. 3240. 3780. 6480. 5850. 3150. 3000. 3090.
6060. 5900. 7420. 8500. 8050. 6800. 8250. 8250. 3500. 2835.
4500. 3300. 4320. 3500. 4992. 4600. 3720. 3680. 3000. 3750.
5076. 4500. 5000. 4260. 6540. 3700. 3760. 4000. 4300. 6840.
4400. 10500. 4400. 4840. 4120. 4260. 5960. 8800. 4560. 4600.
4840. 3850. 4900. 3850. 3760. 6000. 4370. 7700. 2990. 3750.
3000. 2650. 4500. 4500. 4500. 4500. 2175. 4500. 4800. 4600.
3450. 3000. 3600. 3600. 3750. 2610. 2953. 2747. 1905. 3968.
3162. 6000. 2910. 2135. 3120. 4075. 3410. 2800. 2684. 3100.
3630. 1950. 2430. 4320. 3036. 3630. 5400. 3420. 3180. 3660.
4410. 3990. 4340. 3510. 3420. 3420. 5495. 3480. 7424. 3460.
3630. 3630. 3480. 3460. 3180. 3635. 3960. 4350. 3930. 3570.
3600. 2520. 3480. 3180. 3290. 4000. 2325. 4350. 3540. 3960.
2640. 2700. 2700. 3180. 3500. 3630. 6000. 3150. 3792. 3510.
3120. 3000. 4200. 2817. 3240. 2800. 3816. 3185. 6321. 3650.
4700. 6615. 3850. 3970. 3000. 4352. 3630. 3600. 3000. 3000.
2787. 3000. 4770. 3649. 3970. 2910. 3480. 6615. 3500. 3450.
3450. 3520. 6930. 4600. 4360. 3450. 4410. 4600. 3640. 6000.
5400. 3640. 3640. 4040. 3640. 3640. 5640. 3600. 3600. 4632.
3640. 4900. 4510. 4100. 3640. 5680. 6300. 4000. 3960. 5960.
5830. 4500. 4100. 6750. 9000. 2550. 7152. 6450. 3360. 3264.
4000. 4000. 3069. 4040. 4040. 3185. 5900. 3120. 5450. 4040.
4080. 8080. 4040. 4080. 5800. 5885. 9667. 3420. 5800. 7600.
5400. 4995. 3000. 5500. 6450. 6210. 5000. 5000. 5828. 5200.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
# Plot outputs
plt.plot(X_test, regr.predict(X_test))
 
Technology news on Phys.org
  • #2
I would start by Reading The Fershlugginer Manual: https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html
I'm assuming that the reshape function is part of the numpy library you're importing.

If a is a single-dimension array, then np.reshape(a, (20, 10)) reshapes a to an array with 20 rows and 10 columns. You can have one of the parameters as -1, so the size of that dimension is inferred from the number of data in the array. Personally, I would want to know the dimensions of the two-d array I'm reshaping to, before calling reshape.

I'm not sure what happens if the quantity of data doesn't align with the sizes you pick for the parameters. For example, if you have 26 data values, and you want 4 rows, that might be a problem, because 26 isn't evenly divisible by 4.
 
  • Like
  • Haha
Likes Tom.G and WWGD
  • #3
Unfortunately sklearn has not given you a very helpful error message. Python arrays (or rather array-like lists) do not have a .reshape method - as @Mark44 says .reshape is a method of the numpy library so you need to use it like regr.fit(np.reshape(X_train, (-1, 1)), Y_train))
 
  • Like
Likes WWGD
  • #4
Thank you all, in case someone else is interested, this different setup worked much more smoothly:



I am not following any manual, just piecing together bits here and there.
 
  • #5
WWGD said:
I am not following any manual, just piecing together bits here and there.
But you should take a look at the documentation of any library functions they use in whatever it is that you're following.
 
  • Like
Likes pbuk and WWGD
  • #6
Mark44 said:
But you should take a look at the documentation of any library functions they use in whatever it is that you're following.
I do this sort of weird top-down, bottom-up approach.

Still, for my previous, it was not working and I figured there were some missing values below the 500th row,
using [name].isnull() and renamed my dataframe to df.head(500) and regression went through. It was running before but the algorithm was not converging, I presume, because of the missing values.

But I will look at the documentation, @Mark44
 
  • #7
@Mark44: I did a full simple linear regression. It took me two full days to figure out and I am telling everyone. The fruit vendor guy, the dry cleaner, everyone. May I post the full code in case someone is interested?
 
  • #9
Ok, you need to have an excel .xslx or a .csv file . Mine is called 'Housing.csv"

Python:
from numpy import *
import sklearn
import pandas as pd
from scipy.interpolate import *
df=pd.read_csv("C:/Housing.csv")

#First batch

df

df1=df.head(500)

#File had missing values above row 500 , so I shortened it.

df1

#Now checking to see if I there are still missing values left:

df1.isnull()

#Did not see any ( I am not using column 13, which does have nulls)

X=df1['bathrms'].values
Y=df1['stories'].values
p1=polyfit(X,Y,1)

p1

#Now printing fitted values (fitted by regression line)

Yfit=p1[0]*X +p1[1]
print(Yfit)

#Printing out analysis of regression:

Yresid=Y-Yfit
SSResid=sum(pow(Yresid,2))
SSTotal=len(Y)*var(Y)
rsq=1 -SSResid/SSTotal

print(rsq)
print(Yresid)
print(SSTotal)

#Plotting residuals:
from matplotlib import *
import matplotlib.pyplot as plt
plt.scatter(Y, Yresid,  color='black')
plt.show
 
  • #10
What are lines 9, 15, and 27 doing? Should they instead be comments?
 
  • #11
Mark44 said:
What are lines 9, 15, and 27 doing? Should they instead be comments?
They just output the variables, e.g., df will show the values of the dataframe, all rows, columns. I guess it does this without the need to use 'print'.
 
  • #12
Python:
from numpy import *
import * is not a good idea, you'll see one reason why on the next line, but the more important reason comes later...

Python:
import sklearn
Your code doesn't use sklearn so you should drop this line. If you had done from sklearn import * you wouldn't be able to tell whether you were using it or not.

Python:
import pandas as pd
OK, but i would use from pandas import read_csv - this makes it much easier to see why you are importing the pandas module and therefore to reuse the code.

Python:
from scipy.interpolate import *

Python:
df=pd.read_csv("C:/Housing.csv")
These may seem like nit-picking, but trust me, if you adopt these habits you will both make fewer errors and make it easier to find errors that you do make.
  • use spaces around operators, including the assignment operator
  • stick to using either double or single quotes for string literals (later on you use single quotes) - in Python it doesn't matter which, but pick one and stick to it. Most Python code I see uses single quotes.
  • use meaningful variable names. df may mean something to you now, but a dataframe is a pandas concept, and as you are about to rely on some of this class's methods I would be more explicit, and use a comment to point this out. So we have instead
Python:
# Read the data into a pandas dataframe object.
data = pd.read_csv('C:/Housing.csv')

Moving on...
Python:
#First batch

df
Again a space after the # makes things more readable, but don't leave a blank line between the comment and the line it explains.

EDIT: later on in the thread (see #23) it became clear that the OP was not a Python script, it was input to an interactive REPL session. The following comment assumed that pandas implemented a magic method to print when accessing a dataframe in this way in a script which is not the case.

Now this is where I have to say I don't like pandas. Reading this code, it is anyone's guess what df does - @Mark44 certainly had no clue. So say what it does in a comment:
Python:
# Print the whole dataframe to stdout.
data

Moving on again...
Code:
# Data file has missing values past row 500 so truncate it and print to stdout.
truncated = data.head(500)
truncated
Comments should be before the related code, not after. I've also tidied up the next few lines...

Code:
# Print a table of null values to stdout to check for any more missing values.
truncated.isnull()

# Did not see any except in column 13 which I am not using.
X = truncated['bathrms'].values
Y = truncated['stories'].values
Now we come to the other reason for not using import *. Is polyfit a built-in Python function? No, perhaps it comes from numpy. Or maybe it comes from scipy.interpolate? If my code doesn't work, maybe I intended to use numpy.polyfit but this was overwritten by scipy.interpolate.polyfit which I didn't know existed? Or even worse, maybe my code works today, but in a year's time when a new release of scipy comes out implementing interpolate.polyfit it stops working.

So at the top of the file we should have done
Python:
from scipy.interpolate import polyfit

and then we can safely do (with some standard spacing)...
Python:
# Fit a polynomial curve of degree 1 (i.e. linear regression) and print it to stdout.
fit = polyfit(X, Y, 1)
fit
Wait a minute, will this print anything out? I'll leave this for you to work out and come to your own opinion on objects with magic side-effects like pandas dataframes.

For now, I'm just going to rewrite this...
Python:
# Fit a polynomial curve of degree 1 (i.e. linear regression).
(fitSlope, fitIntercept) = polyfit(X, Y, 1)

# Now print fitted values.
Yfit = fitSlope * X + fitIntercept
print(Yfit)

#Printing out analysis of regression:

Yresiduals = Y - Yfit
sumSquaredResiduals = sum(pow(Yresiduals, 2))
sumSquaredTotal = len(Y) * np.var(Y)
rSquared = 1 - sumSquaredResiduals / sumSquaredTotal

print(rSquared)
print(Yresiduals)
print(sumSquaredTotal )

#Plotting residuals:
from matplotlib import *
import matplotlib.pyplot as plt
Woah there, imports should all be at the top (why do you think this is a good idea?) and think about what you actually want to import.

Python:
plt.scatter(Y, Yresiduals, color='black')
plt.show
Well the last line is definitely not going to work - what should it be? Do you see how you have got into this habit?

To avoid most of these problems, you should use a code linter in your IDE - Pylint is the de facto standard for Python.
 
Last edited:
  • Like
  • Love
Likes BvU, jim mcnamara, WWGD and 1 other person
  • #14
pbuk said:
use spaces around operators, including the assignment operator
I agree with everything except this. Does it really help people? I just find it a waste of horizontal space...
 
  • #15
Thanks for tge input, PBUK, yes, I was just desperate to get some output after 3 days. Now I can start thinking of elegance and efficiency of my code.
 
  • #16
Ibix said:
I agree with everything except this. Does it really help people? I just find it a waste of horizontal space...
Yes it does help, and white space is cheap compared with programmers time.
Python:
# Compare...
some_variable=one_other_variable-other_variable
some_variable_one=other-variable_other_variable
# ... with
some_variable = one_other_variable - other_variable
some_variable_one = other - variable_other_variable

WWGD said:
Now I can start thinking of elegance and efficiency of my code.
No, this is not about elegance or efficiency, it is about writing code that stands a better chance of working first time and if it doesn't is easier to debug however inelegant or inefficient it is.
 
  • Like
Likes BvU, WWGD and Ibix
  • #17
pbuk said:
Yes it does help
I'll take your word for it, but neither of your examples seems easier to read than the other to me. Maybe I just have an unusual way of looking at code.
 
  • #19
Just to clarify that my code is for "standard (OLS) regression" but not ML OLS regression. Not sure how the two differ, but I believe they do. I understand ML uses a loss function with a threshold value, partions dataset into training and test data, iterates for different partitions of dataset, evaluates whether threshold ( or beyond) is met for given choice using the loss function and keeps iterating if the answer is no (i.e., if loss function value is above chosen threshold), and stops otherwise. Not sure how this differs from just using some program, e.g., SPSS to conduct OLS and "just" spit out output: intercept, coefficients with confidence intervals, F-tests, etc.
EDIT: Maybe the two are the same but programs like SPSS , etc. somehow "Blackbox" the iteration process?
 
Last edited:
  • #20
pbuk said:
use spaces around operators, including the assignment operator
Ibix said:
I agree with everything except this. Does it really help people? I just find it a waste of horizontal space...
Some code I found that was posted here at PF. The coder was adept at not wasting horizontal space. Aside from the impenetrable variable names, is this really as easy to read as code for which there are spaces around operators?
Fortran:
psie(i,ifli)=psie(i,ifli)+exphel(i,j)*psib(j,ifli) 
err(i,1)=1.d2*( (dble(psie(i,1))-dble(psio(i)))**2  &
  &  + (imag(psie(i,1))-imag(psio(i)))**2)
psibare(i,ifli)=psibare(i,ifli)+exphel(i,j)*psibarb(j,ifli)
err(i,2)=1.d2*((dble(psibare(i,1))-dble(psibaro(i)))**2 &
  & +(imag(psibare(i,1))-imag(psibaro(i)))**2)
 
  • Wow
Likes pbuk
  • #21
OLS stands for "Ordinary Least Squares", and that is exactly what it is. It exists as a statistical technique because (i) it is easy to do, and (ii) it is possible to prove a number of things using it. However, in the real world, OLS often provides a sub-optimal fit because outliers tend to affect the parameters too much - exactly because you are squaring the difference between a sample value and its predicted value.

Because of this, in the real world we tend to modify our regression by applying a weighting function and/or thresholds to lessen the impact of outliers. When we have many independent variables (which again in the real world may not be truly independent) then similarly good fits may be obtained by very different combinations of coefficients. The buzz-word 'Machine Learning' in this context simply means that instead of simple calculations the code employs adaptive heuristic algorithms to attempt to fit a model to the data.

For a single independent variable, the best way to fit a line to data is to plot it and fit by eye: the human brain is pretty good at adaptive heuristics.
 
  • Like
Likes WWGD
  • #22
Mark44 said:
Some code I found that was posted here at PF. The coder was adept at not wasting horizontal space. Aside from the impenetrable variable names, is this really as easy to read as code for which there are spaces around operators?
Fortran:
psie(i,ifli)=psie(i,ifli)+exphel(i,j)*psib(j,ifli)
err(i,1)=1.d2*( (dble(psie(i,1))-dble(psio(i)))**2  &
  &  + (imag(psie(i,1))-imag(psio(i)))**2)
psibare(i,ifli)=psibare(i,ifli)+exphel(i,j)*psibarb(j,ifli)
err(i,2)=1.d2*((dble(psibare(i,1))-dble(psibaro(i)))**2 &
  & +(imag(psibare(i,1))-imag(psibaro(i)))**2)
Fortran:
psie(i,ifli) = psie(i,ifli) + exphel(i,j) * psib(j,ifli)
err(i,1) = 1.d2 * ((dble(psie(i,1)) - dble(psio(i)))**2 &
  &  + (imag(psie(i,1)) - imag(psio(i)))**2)
psibare(i,ifli) = psibare(i,ifli) + exphel(i,j) * psibarb(j,ifli)
err(i,2) = 1.d2 * ((dble(psibare(i,1)) - dble(psibaro(i)))**2 &
   & + (imag(psibare(i,1)) - imag(psibaro(i)))**2)
I don't see the improvement. The operators seem to me to stand out clearly anyway (they're very different shapes from letters), so wrapping them in whitespace doesn't add anything.

The multiple nested brackets are a huge issue for comprehensibility for me. Assuming I'm correctly interpreting the variables as arrays of complex numbers, and hazily recalling that arithmetic operations on complex types are allowed in Fortran, for the err() computations I'd almost certainly have created a variable called (e.g.) delta to store the difference and then taken the squared modulus of that in a separate line.

Perfectly happy to accept that I'm in the minority on the whitespace around operators issue. I'm surprised it seems to be quite such a small minority, though.
 
  • #23
pbuk said:
Python:
#First batch

df
Again a space after the # makes things more readable, but don't leave a blank line between the comment and the line it explains. Now this is where I have to say I don't like pandas. Reading this code, it is anyone's guess what df does - @Mark44 certainly had no clue. So say what it does in a comment:
Python:
# Print the whole dataframe to stdout.
data
I've just realized that this is nonsense - pandas DataFrames don't have any magic output methods, the OP was simply posting what he was entering in the Python REPL (aka interactive mode) as a script. This won't work.

@WWGD the convention for posting REPL sessions is to show the >>> prompt like this:
Python:
>>> df

But when we write reusable code we are always writing a script, and now we don't need a comment because it is obvious what we are doing.
Python:
print(df)
 
  • Like
Likes BvU and WWGD

1. What is the difference between a 1D and 2D array in Python?

A 1D array in Python is a single-dimensional array that stores data in a linear fashion. This means that all the elements are arranged in a single row or column. On the other hand, a 2D array is a two-dimensional array that stores data in a grid-like fashion, with rows and columns.

2. How can I convert a 1D array to a 2D array in Python?

To convert a 1D array to a 2D array in Python, you can use the reshape() function from the NumPy library. This function allows you to specify the new shape of the array, and it will automatically rearrange the elements to fit the new dimensions.

3. Why is it important to use a 2D array for Python linear regression?

A 2D array is important for Python linear regression because it allows you to store and manipulate data in a way that is more suitable for regression analysis. This means that you can easily access and modify specific data points, which is necessary for building and evaluating regression models.

4. What are some common methods for creating a 2D array in Python?

There are several methods for creating a 2D array in Python, including using the NumPy library, using nested lists, and using the array() function from the array module. Each method has its own advantages and disadvantages, so it is important to choose the method that best fits your specific needs.

5. Can I perform linear regression on a 2D array with missing values?

Yes, you can perform linear regression on a 2D array with missing values. However, you may need to handle these missing values before running the regression analysis. This can be done by either removing the rows or columns with missing values, or by imputing the missing values with a suitable method, such as mean or median imputation.

Back
Top