Cprev and Self.C are referring to the same NumPy

  • Thread starter Thread starter Zap
  • Start date Start date
  • Tags Tags
    Bug Python
Click For Summary

Discussion Overview

The discussion revolves around a bug encountered in Python related to the handling of NumPy arrays, specifically regarding the assignment of references versus copies. Participants explore the implications of this behavior in the context of a k-means clustering algorithm and broader programming practices in Python.

Discussion Character

  • Exploratory
  • Technical explanation
  • Conceptual clarification
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant describes a bug where changes to an attribute self.C also affect Cprev, leading to confusion about object references in Python.
  • Another participant clarifies that Cprev and self.C refer to the same NumPy array and suggests using numpy.copy() to create an explicit copy.
  • A participant emphasizes that Python does not automatically make copies of objects, providing examples to illustrate this point.
  • One participant expresses surprise at their previous misunderstanding and reflects on the importance of grasping fundamental programming concepts.
  • Discussion shifts to the effectiveness of the k-means clustering algorithm, with participants expressing skepticism about its performance and reliability.
  • Another participant notes that k-means clustering is not a definitive algorithm and discusses the NP-hard nature of finding global optima.
  • Participants discuss the use of Python for neural networks, questioning whether Python's speed limitations necessitate using languages like C++ for performance-critical tasks.
  • One participant suggests using the sklearn.cluster class for clustering tasks, while another emphasizes the value of building algorithms from scratch for learning purposes.
  • There is a discussion about the performance of Python libraries, noting that many computationally intensive tasks are handled by underlying C or C++ code.

Areas of Agreement / Disagreement

Participants generally agree on the importance of understanding object references in Python and the limitations of the k-means algorithm. However, there are competing views on the necessity of using C++ for neural networks and the value of using established libraries versus building algorithms from scratch.

Contextual Notes

Participants express varying levels of familiarity with programming concepts, indicating a range of experience in Python and algorithm development. The discussion includes assumptions about the behavior of Python and NumPy that may not be universally understood.

Who May Find This Useful

Readers interested in Python programming, NumPy usage, machine learning algorithms, and the nuances of object references in Python may find this discussion beneficial.

Zap
Messages
406
Reaction score
120
TL;DR
Weird Bug in Python
I've encountered this really weird bug in Python. Below is a snippet from a class method. The error occurs where I've placed the two print statements. The update method has absolutely nothing to do with Cprev. The update function changes the value of the attribute self.C. However, the two print statements (1 and 2) print different values for Cprev. For whatever reason, the changes made to self.C by the update method are also made for Cprev, though this is nowhere explicitly defined.

Python:
def classify( self : object ) -> None :
    Cprev = np.empty( self.C.shape )
    while np.not_equal( Cprev, self.C ) :
        Cprev = self.C
        print( 1, Cprev )
        self.__update()
        print( 2, Cprev )
    return
 
Last edited:
Technology news on Phys.org
I found an answer. The Cprev and Self.C are referring to the same NumPy array. I have to explicitly make a copy of self.C and assign it to Cprev. So Cprev and self.C are pointing to the same array. They are not arrays themselves. I assumed the copy was automatically made by Python. I suppose this is not the case when using NumPy. In NumPy the copy is made via numpy.copy( numpy.ndarray ).
 
  • Like
Likes   Reactions: jedishrfu and jim mcnamara
Zap said:
I assumed the copy was automatically made by Python.

No, Python never automatically makes copies of objects. You have to explicitly say you want a copy. For example, consider the following transcript of an interactive Python session:

Python:
>>> list_a = [1, 2, 3]
>>> list_b = list_a
>>> list_b is list_a
True
>>> list_b = list_a.copy()
>>> list_b is list_a
False
>>> list_b = list(list_a)
>>> list_b is list_a
False
 
  • Like
Likes   Reactions: Zap and jedishrfu
That's amazing. I had no idea. This was really vital for me to learn. I can't believe I had assumed incorrectly this whole time. Glad I have made the correction now!

I remember encountering this before, but fixing it somehow without knowing exactly what was going on. That's insane. Just goes to show that you can be a data analyst and still have no idea what you're doing LOL.

This is what sucks about learning programming on the fly. You find out that you missed some very simple and seemingly trivial concept but which turns out to be extremely vital. I can imagine a million situations in which this tiny little detail could be catastrophic if not known.

I was coding in Python professionally with people as clueless as I was about the fundamentals of SE while working in management consulting. It's a very goofy job with a lot of horrible code. Let it be known that I am always on the path to improving.

I made a k-means clustering algorithm with the code above, and then realized k-means isn't actually a very good clustering algorithm ... why so much hype around k-means? It's a crappy clustering algorithm. I have to run it 100 times to achieve a reasonable result when the number of clusters is high, and even then the reasonable result is not guaranteed.
 
Last edited:
Zap said:
I made a k-means clustering algorithm with the code above, and then realized k-means isn't actually a very good clustering algorithm ... why so much hype around k-means? It's a crappy clustering algorithm. I have to run it 100 times to achieve a reasonable result when the number of clusters is high, and even then the reasonable result is not guaranteed.
Well strictly speaking k-means clustering is not an algorithm, in the same way that the n-Queens problem is not an algorithm. Finding the global optimum set of centroids is an NP-hard problem.

I assume you are referring to Lloyds algorithm which is a commonly used iterative heuristic for finding a local optimum set and whose performance depends on the method used to seed the initial set.

Other methods exist and can be found by searching the ACM library. Also, Python user-space is not the ideal way to implement a computationally intensive task.
 
Yea just naive k-means clustering. I thought coding it up would be a good start to learn something about clustering. The k-means thing was kind of hyped up in my mind, since I've heard about it so many times, but it's not super impressive. I am just using random data points in the data to initialize the centroids.

I found that if you repeat the "algorithm" a number of times and then choose the result with the smallest sum of squared differences, or in this case the sum of distances from each centroid, you will get okay results. But this is just relying on the chance that if you repeat the random initialization process so many times that eventually each centroid will be initiated near the true centroid location.

A little off topic again, but do people actually write neural nets in Python? You hear about Python and neural nets being used together all of the time, but if Python is slow, wouldn't you want to create a neural net with something like C++? Or are the neural nets written in a different language and only using Python as a kind of interface?

Would you suggest rewriting something like k-means clustering in C++?
 
Last edited:
Zap said:
A little off topic again, but do people actually write neural nets in Python? You hear about Python and neural nets being used together all of the time, but if Python is slow, wouldn't you want to create a neural net with something like C++? Or are the neural nets written in a different language and only using Python as a kind of interface?

Would you suggest rewriting something like k-means clustering in C++?
Maybe, but before you take that step you should try the sklearn.cluster class provided by the Scikit package. I'm pretty sure there are some forums dedicated to ML in Python around too.
 
Zap said:
if Python is slow

It's not that simple.

The simple fact is that Python bytecode is slow. But there are plenty of Python libraries that do the heavy lifting in C or C++ code, not Python bytecode. Numpy is an example: all the CPU intensive numerical computation is done in C extension modules, at C speed. The only parts of a numpy application that are actually running Python bytecode are the "glue" parts, that organize the overall application.

There are even built-in parts of Python that don't directly depend on Python bytecode. For example, if you call the sort method of a Python list, the sorting algorithm is written in C and runs at C speed.
 
pbuk said:
Maybe, but before you take that step you should try the sklearn.cluster class provided by the Scikit package. I'm pretty sure there are some forums dedicated to ML in Python around too.

I know there are already fully optimized clustering algorithms you can simply plug and chug in Python. That wasn't the point. I wanted to build something from scratch to learn the ins and outs of it, and for the sake of coding something up and thinking it through. What's the fun in importing a library and doing a plug and chug? I like writing algorithms and going through the math. I am using NumPy for the computational stuff, so I suspect I am not losing too much performance.

Now, if I do decide to do a clustering of data for whatever reason, I will probably use something like sklearn to do it for me, but I will understand exactly what is going on.
 
Last edited:
  • #10
PeterDonis said:
It's not that simple.

The simple fact is that Python bytecode is slow. But there are plenty of Python libraries that do the heavy lifting in C or C++ code, not Python bytecode. Numpy is an example: all the CPU intensive numerical computation is done in C extension modules, at C speed. The only parts of a numpy application that are actually running Python bytecode are the "glue" parts, that organize the overall application.

There are even built-in parts of Python that don't directly depend on Python bytecode. For example, if you call the sort method of a Python list, the sorting algorithm is written in C and runs at C speed.

This is what I suspected. Really neat stuff. I do however enjoy learning C++ for the sake of learning C++, but if it's not really going to provide much benefit in the long run, for data stuff, than maybe I will hold out on it for the time being.

If I can write neural net in Python using NumPy, without much cost to performance, than I think that's what I'll try to do. Python is just so easy to use.
 
  • #11
Zap said:
If I can write neural net in Python using NumPy, without much cost to performance, than I think that's what I'll try to do.

For neural nets, scipy might be worth checking out as well. It takes the same basic approach as numpy with regard to computation intensive tasks.
 
  • Like
Likes   Reactions: Zap
  • #12
Sounds good. Writing neural nets in Python is making more sense to me, with the fact that NumPy uses C behind the scenes. That's awesome.
 

Similar threads

  • · Replies 2 ·
Replies
2
Views
1K
  • · Replies 43 ·
2
Replies
43
Views
4K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 4 ·
Replies
4
Views
6K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 23 ·
Replies
23
Views
3K
  • · Replies 15 ·
Replies
15
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K