Python Elementary Python Questions: Data Frames, k-nary functions

AI Thread Summary
In this discussion, the focus is on handling data structures in Python, specifically using Pandas and defining functions for summation. The first question addresses the data structure of a selected column from a Pandas DataFrame, with the consensus that using `.values` returns a NumPy array, and it's recommended to use `.to_numpy()` for better functionality. The second question explores how to define a summation function that can handle multiple inputs, particularly lists or tuples, suggesting a recursive approach to add elements element-wise. The built-in `sum` function is highlighted as capable of summing iterables, but for specific use cases, a custom function is proposed. Additionally, the discussion touches on defining variance in Pandas and using NumPy, with examples provided for calculating variance across different axes in a DataFrame. The conversation emphasizes the importance of understanding data types and the flexibility of Python functions in handling various input forms.
WWGD
Science Advisor
Homework Helper
Messages
7,678
Reaction score
12,345
Hi All,
A couple of questions, please:

1) Say df is a dataframe in Python Pandas, and I select a specific column from df:
Y=df[column].values.
What kind of data structure is Y?

2)
I want to find the sum of two numbers:
Def Sum(a=0,b=0):
return a+b

If I want to find a sum over sum data structure ( say a list) , how can
I define sum, i.e., how to extend it from a binary operation? Should I use
recursion and/or some 'for' clause?

Thanks.
 
Technology news on Phys.org
Here’s a tutorial on pandas and it shows inuts, floats and strings for columns in a dataframe.

https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm

Tuples can be used as well as other data types in python. The best way to find out though is to try it yourself.

the key point is a column must use the same data type through all it’s rows although that may not be strictly true either.

its simliar to a sql table schema.

when using .values its sugested to use .to_numpy() instead.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html

so you could extract a column into a numpy array and use numpy functions to sum it or get stats...
 
Last edited:
  • Like
Likes WWGD
WWGD said:
If I want to find a sum over sum data structure ( say a list) , how can
I define sum

You don't have to, the Python built-in sum function takes any iterable as an argument.
 
  • Like
Likes WWGD
PeterDonis said:
You don't have to, the Python built-in sum function takes any iterable as an argument.
In this context, I suspect @WWGD wants a function that returns a list whose ith element is a[i]+b[i], which isn't what the builtin sum function does with a list-type object. Builtin sum returns the sum of all elements in one list, and fails with multiple lists.

Assuming I'm understanding the problem correctly, I think you want something like
Python:
def sum(a, b):
    if isinstance(a, (list, tuple)) and isinstance(b, (list, tuple)):
        return [sum(ai, bi) for (ai, bi) in zip(a, b)]
    else:
        return a + b
What that does is check if a and b are lists or tuples. If they are it adds them element by element, calling itself so it can handle lists-of-lists. Otherwise it just tries adding them.

I don't know if that's the most efficient way to do things. You would also want to add sense checking - for example, what would you want the behaviour to be if the lists are different lengths? And you may want to replace the isinstance calls with something better suited to your application. And you may not want the recursive behaviour with lists of lists.
 
Last edited:
  • Like
Likes WWGD
Ibix said:
In this context, I suspect @WWGD wants a function that returns a list whose ith element is a[i]+b[i], which isn't what the builtin sum function does with a list-type object. Builtin sum returns the sum of all elements in one list, and fails with multiple lists.

Assuming I'm understanding the problem correctly, I think you want something like
Python:
def sum(a,b):
    if isinstance(a,(list,tuple)) and isinstance(b,(list,tuple)):
        return [sum(ai,bi) for (ai,bi) in zip(a,b)]
    else:
        return a+b
What that does is check if a and b are lists or tuples. If they are it adds them element by element, calling itself so it can handle lists-of-lists. Otherwise it just tries adding them.

I don't know if that's the most efficient way to do things. You would also want to add sense checking - for example, what would you want the behaviour to be if the lists are different lengths? And you may want to replace the isinstance calls with something better suited to your application. And you may not want the recursive behaviour with lists of lists.
Thank you. I was actually trying to define a k-nary function ##(x_1, x_2,...,x_n) \rightarrow x_1+x_2+...+x_n

This is trivial for 2 terms:
Def sum(x_1, x_2):
returns x_1+x_2

But I was hoping to define a sum over, say, a list, or maybe the values in a dictionary, etc and could not think of a way of defining it. Thought I would need a 'for' clause somewhere but not clear otherwise.

I tried using it to define variance but I got an error message on not being able to iterate on floats.
 
In that case, Peter's answer is correct about the builtin sum function. If a is a list, sum(a) will return the sum of the elements of the list, and sum(x1, x2, ..., xn) will return the sum of the n variables.

The general pattern for defining a function without a prespecified argument list is
Python:
def f(*args, **kwargs):
    print(args)
    print(kwargs)
Calling f with arbitrary arguments will put the values into a list called args or a dict called kwargs, depending on whether you named the arguments or not. f(1, 2, 3, a = 4, b = 5) would make args = [1, 2, 3] and kwargs a two-element dictionary with keys "a" and "b" and corresponding values 4 and 5.

You don't have to specify both of *args and **kwargs if you only need one. As far as I'm aware the names args and kwargs are merely conventional and the asterisks are the important things, but I don't think I've ever seen anyone use anything other than args and kwargs.
 
Last edited:
  • Like
Likes WWGD
Oh, and
Python:
def f(a, b, *args, **kwargs):
is perfectly acceptable usage - f(1, 2, 3) will put 1 and 2 in a and b and args will be [3].
 
Last edited:
  • Like
Likes WWGD
WWGD said:
I tried using it to define variance

There are library functions for that.

For values in a pandas DataFrame, there's DataFrame.var.
Python:
df = pandas.DataFrame(...)

#Column-wise variance
df.var()
df.var(axis=0)

# Row-wise variance
df.var(axis=1)

# variance of single column
df.loc[:,column].var()

For general iterables, there's numpy's var
Python:
numpy.var(a_list)

# Should also work on a pandas.Series object:
numpy.var(df[column])
 
  • Like
Likes WWGD
pasmith said:
There are library functions for that.

For values in a pandas DataFrame, there's DataFrame.var.
Python:
df = pandas.DataFrame(...)

#Column-wise variance
df.var()
df.var(axis=0)

# Row-wise variance
df.var(axis=1)

# variance of single column
df.loc[:,column].var()

For general iterables, there's numpy's var
Python:
numpy.var(a_list)

# Should also work on a pandas.Series object:
numpy.var(df[column])
Thanks. I was trying to practice by defining it on my own and getting an error re iterating on floats when defining it as sum[ ( x_i-xbar)(x_i-xbar) for x_i in list] where xbar is the mean .
 

Similar threads

Replies
15
Views
2K
Replies
5
Views
5K
Replies
9
Views
3K
Replies
2
Views
2K
Back
Top