Cprev and Self.C are referring to the same NumPy

Zap · Sep 27, 2020

I've encountered this really weird bug in Python. Below is a snippet from a class method. The error occurs where I've placed the two print statements. The update method has absolutely nothing to do with Cprev. The update function changes the value of the attribute self.C. However, the two print statements (1 and 2) print different values for Cprev. For whatever reason, the changes made to self.C by the update method are also made for Cprev, though this is nowhere explicitly defined.

Python:

def classify( self : object ) -> None :
    Cprev = np.empty( self.C.shape )
    while np.not_equal( Cprev, self.C ) :
        Cprev = self.C
        print( 1, Cprev )
        self.__update()
        print( 2, Cprev )
    return

Zap · Sep 27, 2020

I found an answer. The Cprev and Self.C are referring to the same NumPy array. I have to explicitly make a copy of self.C and assign it to Cprev. So Cprev and self.C are pointing to the same array. They are not arrays themselves. I assumed the copy was automatically made by Python. I suppose this is not the case when using NumPy. In NumPy the copy is made via numpy.copy( numpy.ndarray ).

PeterDonis · Sep 27, 2020

Zap said:

I assumed the copy was automatically made by Python.

No, Python never automatically makes copies of objects. You have to explicitly say you want a copy. For example, consider the following transcript of an interactive Python session:

Python:

>>> list_a = [1, 2, 3]
>>> list_b = list_a
>>> list_b is list_a
True
>>> list_b = list_a.copy()
>>> list_b is list_a
False
>>> list_b = list(list_a)
>>> list_b is list_a
False

Zap · Sep 29, 2020

That's amazing. I had no idea. This was really vital for me to learn. I can't believe I had assumed incorrectly this whole time. Glad I have made the correction now!

I remember encountering this before, but fixing it somehow without knowing exactly what was going on. That's insane. Just goes to show that you can be a data analyst and still have no idea what you're doing LOL.

This is what sucks about learning programming on the fly. You find out that you missed some very simple and seemingly trivial concept but which turns out to be extremely vital. I can imagine a million situations in which this tiny little detail could be catastrophic if not known.

I was coding in Python professionally with people as clueless as I was about the fundamentals of SE while working in management consulting. It's a very goofy job with a lot of horrible code. Let it be known that I am always on the path to improving.

I made a k-means clustering algorithm with the code above, and then realized k-means isn't actually a very good clustering algorithm ... why so much hype around k-means? It's a crappy clustering algorithm. I have to run it 100 times to achieve a reasonable result when the number of clusters is high, and even then the reasonable result is not guaranteed.

pbuk · Sep 29, 2020

Zap said:

I made a k-means clustering algorithm with the code above, and then realized k-means isn't actually a very good clustering algorithm ... why so much hype around k-means? It's a crappy clustering algorithm. I have to run it 100 times to achieve a reasonable result when the number of clusters is high, and even then the reasonable result is not guaranteed.

Well strictly speaking k-means clustering is not an algorithm, in the same way that the n-Queens problem is not an algorithm. Finding the global optimum set of centroids is an NP-hard problem.

I assume you are referring to Lloyds algorithm which is a commonly used iterative heuristic for finding a local optimum set and whose performance depends on the method used to seed the initial set.

Other methods exist and can be found by searching the ACM library. Also, Python user-space is not the ideal way to implement a computationally intensive task.

Zap · Sep 29, 2020

Yea just naive k-means clustering. I thought coding it up would be a good start to learn something about clustering. The k-means thing was kind of hyped up in my mind, since I've heard about it so many times, but it's not super impressive. I am just using random data points in the data to initialize the centroids.

I found that if you repeat the "algorithm" a number of times and then choose the result with the smallest sum of squared differences, or in this case the sum of distances from each centroid, you will get okay results. But this is just relying on the chance that if you repeat the random initialization process so many times that eventually each centroid will be initiated near the true centroid location.

A little off topic again, but do people actually write neural nets in Python? You hear about Python and neural nets being used together all of the time, but if Python is slow, wouldn't you want to create a neural net with something like C++? Or are the neural nets written in a different language and only using Python as a kind of interface?

Would you suggest rewriting something like k-means clustering in C++?

pbuk · Sep 29, 2020

Zap said:

A little off topic again, but do people actually write neural nets in Python? You hear about Python and neural nets being used together all of the time, but if Python is slow, wouldn't you want to create a neural net with something like C++? Or are the neural nets written in a different language and only using Python as a kind of interface?

Would you suggest rewriting something like k-means clustering in C++?

Maybe, but before you take that step you should try the sklearn.cluster class provided by the Scikit package. I'm pretty sure there are some forums dedicated to ML in Python around too.

PeterDonis · Sep 29, 2020

Zap said:

if Python is slow

It's not that simple.

The simple fact is that Python bytecode is slow. But there are plenty of Python libraries that do the heavy lifting in C or C++ code, not Python bytecode. Numpy is an example: all the CPU intensive numerical computation is done in C extension modules, at C speed. The only parts of a numpy application that are actually running Python bytecode are the "glue" parts, that organize the overall application.

There are even built-in parts of Python that don't directly depend on Python bytecode. For example, if you call the sort method of a Python list, the sorting algorithm is written in C and runs at C speed.

Zap · Sep 29, 2020

pbuk said:

Maybe, but before you take that step you should try the sklearn.cluster class provided by the Scikit package. I'm pretty sure there are some forums dedicated to ML in Python around too.

I know there are already fully optimized clustering algorithms you can simply plug and chug in Python. That wasn't the point. I wanted to build something from scratch to learn the ins and outs of it, and for the sake of coding something up and thinking it through. What's the fun in importing a library and doing a plug and chug? I like writing algorithms and going through the math. I am using NumPy for the computational stuff, so I suspect I am not losing too much performance.

Now, if I do decide to do a clustering of data for whatever reason, I will probably use something like sklearn to do it for me, but I will understand exactly what is going on.

Zap · Sep 29, 2020

PeterDonis said:

It's not that simple.

The simple fact is that Python bytecode is slow. But there are plenty of Python libraries that do the heavy lifting in C or C++ code, not Python bytecode. Numpy is an example: all the CPU intensive numerical computation is done in C extension modules, at C speed. The only parts of a numpy application that are actually running Python bytecode are the "glue" parts, that organize the overall application.

There are even built-in parts of Python that don't directly depend on Python bytecode. For example, if you call the sort method of a Python list, the sorting algorithm is written in C and runs at C speed.

This is what I suspected. Really neat stuff. I do however enjoy learning C++ for the sake of learning C++, but if it's not really going to provide much benefit in the long run, for data stuff, than maybe I will hold out on it for the time being.

If I can write neural net in Python using NumPy, without much cost to performance, than I think that's what I'll try to do. Python is just so easy to use.

PeterDonis · Sep 29, 2020

Zap said:

If I can write neural net in Python using NumPy, without much cost to performance, than I think that's what I'll try to do.

For neural nets, scipy might be worth checking out as well. It takes the same basic approach as numpy with regard to computation intensive tasks.

Zap · Sep 30, 2020

Sounds good. Writing neural nets in Python is making more sense to me, with the fact that NumPy uses C behind the scenes. That's awesome.

Cprev and Self.C are referring to the same NumPy

Related to Cprev and Self.C are referring to the same NumPy

1. What is Cprev and Self.C in NumPy?

2. Why are Cprev and Self.C used interchangeably in NumPy?

3. Can Cprev and Self.C be used in the same program?

4. Are there any differences between Cprev and Self.C in NumPy?

5. How do I know when to use Cprev or Self.C in my NumPy code?

Similar threads

Hot Threads

Recent Insights