Cprev and Self.C are referring to the same NumPy

In summary: I've never actually seen it in the wild.In summary, the bug is caused by the two print statements modifying the same NumPy array. To workaround the bug, you have to explicitly make a copy of the NumPy array and assign it to the Cprev variable.
  • #1
Zap
406
120
TL;DR Summary
Weird Bug in Python
I've encountered this really weird bug in Python. Below is a snippet from a class method. The error occurs where I've placed the two print statements. The update method has absolutely nothing to do with Cprev. The update function changes the value of the attribute self.C. However, the two print statements (1 and 2) print different values for Cprev. For whatever reason, the changes made to self.C by the update method are also made for Cprev, though this is nowhere explicitly defined.

Python:
def classify( self : object ) -> None :
    Cprev = np.empty( self.C.shape )
    while np.not_equal( Cprev, self.C ) :
        Cprev = self.C
        print( 1, Cprev )
        self.__update()
        print( 2, Cprev )
    return
 
Last edited:
Technology news on Phys.org
  • #2
I found an answer. The Cprev and Self.C are referring to the same NumPy array. I have to explicitly make a copy of self.C and assign it to Cprev. So Cprev and self.C are pointing to the same array. They are not arrays themselves. I assumed the copy was automatically made by Python. I suppose this is not the case when using NumPy. In NumPy the copy is made via numpy.copy( numpy.ndarray ).
 
  • Like
Likes jedishrfu and jim mcnamara
  • #3
Zap said:
I assumed the copy was automatically made by Python.

No, Python never automatically makes copies of objects. You have to explicitly say you want a copy. For example, consider the following transcript of an interactive Python session:

Python:
>>> list_a = [1, 2, 3]
>>> list_b = list_a
>>> list_b is list_a
True
>>> list_b = list_a.copy()
>>> list_b is list_a
False
>>> list_b = list(list_a)
>>> list_b is list_a
False
 
  • Like
Likes Zap and jedishrfu
  • #4
That's amazing. I had no idea. This was really vital for me to learn. I can't believe I had assumed incorrectly this whole time. Glad I have made the correction now!

I remember encountering this before, but fixing it somehow without knowing exactly what was going on. That's insane. Just goes to show that you can be a data analyst and still have no idea what you're doing LOL.

This is what sucks about learning programming on the fly. You find out that you missed some very simple and seemingly trivial concept but which turns out to be extremely vital. I can imagine a million situations in which this tiny little detail could be catastrophic if not known.

I was coding in Python professionally with people as clueless as I was about the fundamentals of SE while working in management consulting. It's a very goofy job with a lot of horrible code. Let it be known that I am always on the path to improving.

I made a k-means clustering algorithm with the code above, and then realized k-means isn't actually a very good clustering algorithm ... why so much hype around k-means? It's a crappy clustering algorithm. I have to run it 100 times to achieve a reasonable result when the number of clusters is high, and even then the reasonable result is not guaranteed.
 
Last edited:
  • #5
Zap said:
I made a k-means clustering algorithm with the code above, and then realized k-means isn't actually a very good clustering algorithm ... why so much hype around k-means? It's a crappy clustering algorithm. I have to run it 100 times to achieve a reasonable result when the number of clusters is high, and even then the reasonable result is not guaranteed.
Well strictly speaking k-means clustering is not an algorithm, in the same way that the n-Queens problem is not an algorithm. Finding the global optimum set of centroids is an NP-hard problem.

I assume you are referring to Lloyds algorithm which is a commonly used iterative heuristic for finding a local optimum set and whose performance depends on the method used to seed the initial set.

Other methods exist and can be found by searching the ACM library. Also, Python user-space is not the ideal way to implement a computationally intensive task.
 
  • #6
Yea just naive k-means clustering. I thought coding it up would be a good start to learn something about clustering. The k-means thing was kind of hyped up in my mind, since I've heard about it so many times, but it's not super impressive. I am just using random data points in the data to initialize the centroids.

I found that if you repeat the "algorithm" a number of times and then choose the result with the smallest sum of squared differences, or in this case the sum of distances from each centroid, you will get okay results. But this is just relying on the chance that if you repeat the random initialization process so many times that eventually each centroid will be initiated near the true centroid location.

A little off topic again, but do people actually write neural nets in Python? You hear about Python and neural nets being used together all of the time, but if Python is slow, wouldn't you want to create a neural net with something like C++? Or are the neural nets written in a different language and only using Python as a kind of interface?

Would you suggest rewriting something like k-means clustering in C++?
 
Last edited:
  • #7
Zap said:
A little off topic again, but do people actually write neural nets in Python? You hear about Python and neural nets being used together all of the time, but if Python is slow, wouldn't you want to create a neural net with something like C++? Or are the neural nets written in a different language and only using Python as a kind of interface?

Would you suggest rewriting something like k-means clustering in C++?
Maybe, but before you take that step you should try the sklearn.cluster class provided by the Scikit package. I'm pretty sure there are some forums dedicated to ML in Python around too.
 
  • #8
Zap said:
if Python is slow

It's not that simple.

The simple fact is that Python bytecode is slow. But there are plenty of Python libraries that do the heavy lifting in C or C++ code, not Python bytecode. Numpy is an example: all the CPU intensive numerical computation is done in C extension modules, at C speed. The only parts of a numpy application that are actually running Python bytecode are the "glue" parts, that organize the overall application.

There are even built-in parts of Python that don't directly depend on Python bytecode. For example, if you call the sort method of a Python list, the sorting algorithm is written in C and runs at C speed.
 
  • #9
pbuk said:
Maybe, but before you take that step you should try the sklearn.cluster class provided by the Scikit package. I'm pretty sure there are some forums dedicated to ML in Python around too.

I know there are already fully optimized clustering algorithms you can simply plug and chug in Python. That wasn't the point. I wanted to build something from scratch to learn the ins and outs of it, and for the sake of coding something up and thinking it through. What's the fun in importing a library and doing a plug and chug? I like writing algorithms and going through the math. I am using NumPy for the computational stuff, so I suspect I am not losing too much performance.

Now, if I do decide to do a clustering of data for whatever reason, I will probably use something like sklearn to do it for me, but I will understand exactly what is going on.
 
Last edited:
  • #10
PeterDonis said:
It's not that simple.

The simple fact is that Python bytecode is slow. But there are plenty of Python libraries that do the heavy lifting in C or C++ code, not Python bytecode. Numpy is an example: all the CPU intensive numerical computation is done in C extension modules, at C speed. The only parts of a numpy application that are actually running Python bytecode are the "glue" parts, that organize the overall application.

There are even built-in parts of Python that don't directly depend on Python bytecode. For example, if you call the sort method of a Python list, the sorting algorithm is written in C and runs at C speed.

This is what I suspected. Really neat stuff. I do however enjoy learning C++ for the sake of learning C++, but if it's not really going to provide much benefit in the long run, for data stuff, than maybe I will hold out on it for the time being.

If I can write neural net in Python using NumPy, without much cost to performance, than I think that's what I'll try to do. Python is just so easy to use.
 
  • #11
Zap said:
If I can write neural net in Python using NumPy, without much cost to performance, than I think that's what I'll try to do.

For neural nets, scipy might be worth checking out as well. It takes the same basic approach as numpy with regard to computation intensive tasks.
 
  • Like
Likes Zap
  • #12
Sounds good. Writing neural nets in Python is making more sense to me, with the fact that NumPy uses C behind the scenes. That's awesome.
 

Related to Cprev and Self.C are referring to the same NumPy

1. What is Cprev and Self.C in NumPy?

Cprev and Self.C are both attributes in NumPy that refer to the same array object. They are used to access and modify the current array in a NumPy program.

2. Why are Cprev and Self.C used interchangeably in NumPy?

Cprev and Self.C refer to the same array object in NumPy because they both point to the current array. This allows for more flexibility and convenience when writing code.

3. Can Cprev and Self.C be used in the same program?

Yes, Cprev and Self.C can be used in the same program. They both refer to the same array object, so using one or the other will have the same effect on the array.

4. Are there any differences between Cprev and Self.C in NumPy?

No, there are no functional differences between Cprev and Self.C in NumPy. They are simply two different ways to access the same array object.

5. How do I know when to use Cprev or Self.C in my NumPy code?

In general, it is a matter of personal preference whether to use Cprev or Self.C in a NumPy program. However, some developers prefer to use Self.C when modifying the array in-place, and Cprev when creating a new array object.

Similar threads

  • Programming and Computer Science
Replies
2
Views
780
  • Programming and Computer Science
Replies
9
Views
2K
  • Programming and Computer Science
Replies
8
Views
2K
  • Programming and Computer Science
Replies
15
Views
1K
  • Programming and Computer Science
Replies
4
Views
4K
  • Programming and Computer Science
Replies
23
Views
2K
  • Programming and Computer Science
Replies
9
Views
3K
  • Programming and Computer Science
Replies
4
Views
1K
  • Programming and Computer Science
Replies
1
Views
2K
  • Programming and Computer Science
Replies
2
Views
2K
Back
Top