Using argmin() and argsort(): How to Solve the Problem

EngWiPy · Nov 14, 2017

I used argmin() in my code when I wanted to find the index of the smallest number in a vector. Later, I tried to used argsort()[:k] to find the indices of the smallest k numbers. It returns the correct results, but not in a form similar to argmin(). For example, suppose I have two vectors of the same size, x, and y, where x is numeric while y is char. First I want to find the index of the smallest number in x, say it is a, then I want to extract y[a]. If the index of the smallest number in x is 2, and y[2] is 'c', then print(x.argmin(), y(x.argmin()) and print(x.argsort()[:1], y(x.argsort()[:1]) output:

Code:

2 c
[2] ['c']

respectively. I am not sure how these are different, but I know that when I replace argmin() in my code by argsort()[:1] I get drastically different results. How can I fix this problem?

Mark44 · Nov 14, 2017

It would be helpful to know what language you're using. Is it Python? Don't make us guess...

EngWiPy · Nov 14, 2017

Mark44 said:

It would be helpful to know what language you're using. Is it Python? Don't make us guess...

Yes, it is Python. It is mentioned as a prefix in the title.

Mark44 · Nov 14, 2017

Sorry, somehow I missed it in the title.

Mark44 · Nov 14, 2017

argmin() and argsort() do different things - argsort() returns a list of indexes -- see https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

EngWiPy · Nov 14, 2017

Mark44 said:

argmin() and argsort() do different things - argsort() returns a list of indexes -- see https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

Yes, they are not similar, and they both give different results. What I want to do is to select the indices of the k smallest numbers instead of selecting the index of the smallest number as a generalization of an algorithm. So, I found argsort()[:k] might do it. However, as a general case, I expected argsort()[:1] to give the same results as the spatial case argmin(). But it didn't! Is there another way I can do this such that I get the same results for both argmin() and when I use k = 1 for whatever function I can use for the general case?

Mark44 · Nov 14, 2017

S_David said:

Yes, they are not similar, and they both give different results. What I want to do is to select the indices of the k smallest numbers instead of selecting the index of the smallest number as a generalization of an algorithm. So, I found argsort()[:k] might do it. However, as a general case, I expected argsort()[:1] to give the same results as the spatial case argmin(). But it didn't! Is there another way I can do this such that I get the same results for both argmin() and when I use k = 1 for whatever function I can use for the general case?

Why not just sort the list first, and then you'll have the k smallest numbers at one end of the list. If it's important to keep the original list, you could copy the list to a second list, and then sort the second list.

EngWiPy · Nov 15, 2017

Mark44 said:

Why not just sort the list first, and then you'll have the k smallest numbers at one end of the list. If it's important to keep the original list, you could copy the list to a second list, and then sort the second list.

I have a numeric scalar a, and a numeric vector x and a char vector y contains the corresponding labels of the elements in x. What I want to do is

first to compute the distance between a and all x's elements and put the results in z.
From z I want to find the k indices of the k smallest numbers to use them to fetch the corresponding labels from y.
From the labels fetched from y I want to find the label the occurs the most, and assign to a that label.

It is an k-nn (k-nearest neighbor) classification problem.

Mark44 · Nov 15, 2017

S_David said:

I have a numeric scalar a, and a numeric vector x and a char vector y contains the corresponding labels of the elements in x. What I want to do is

first to compute the distance between a and all x's elements and put the results in z.

From z I want to find the k indices of the k smallest numbers to use them to fetch the corresponding labels from y.

From the labels fetched from y I want to find the label the occurs the most, and assign to a that label.

Changed bullet list above to number list.
Rather than having three separate arrays, x, y, and z, you could create an array (or list) of tuples, with each tuple holding three values: a number from x, the label associated with x_i, and the distance between a and x_i.

For 2, it might be possible to create a new array of tuples, sorted by the distance values. That way it would be relatively easy to find the tuples with the smallest distance values. For each distance value, that tuple would have the associated numeric x_i value and associated y_i label.

If you're writing code, I would recommend using better variable names than a, x, y, and z.

EngWiPy · Nov 16, 2017

Mark44 said:

Changed bullet list above to number list.
Rather than having three separate arrays, x, y, and z, you could create an array (or list) of tuples, with each tuple holding three values: a number from x, the label associated with x_i, and the distance between a and x_i.

For 2, it might be possible to create a new array of tuples, sorted by the distance values. That way Θit would be relatively easy to find the tuples with the smallest distance values. For each distance value, that tuple would have the associated numeric x_i value and associated y_i label.

If you're writing code, I would recommend using better variable names than a, x, y, and z.

You mean like this:

\mathbf{z} = [(x_0, y_0, z_0), \ldots, (x_k, y_k, z_k), \ldots, (x_n, y_n, z_n)]

where

z_i = \sqrt{(x_i - a)^2}

and the sorted vector as:

\mathbf{z}_{\text{sorted}} = [(x_{(0)}, y_{(0)}, z_{(0)}), \ldots, (x_{(k)}, y_{(k)}, z_{(k)}), \ldots, (x_{(n)}, y_{(n)}, z_{(n)})]

where

z_{(0)}\leq z_{(1)}\cdots \leq z_{(n)}

In that case, how to go efficiently from the unsorted to the sorted list of tuples, and efficiently get the labels of the first k tuples in the sorted list?

Mark44 · Nov 16, 2017

S_David said:

You mean like this:

\mathbf{z} = [(x_0, y_0, z_0), \ldots, (x_k, y_k, z_k), \ldots, (x_n, y_n, z_n)]

Yes.

S_David said:

where

z_i = \sqrt{(x_i - a)^2}

Not quite -- ##z_i = abs(x_i - a)##. Python has a built-in absolute value function.

S_David said:

and the sorted vector as:

\mathbf{z}_{\text{sorted}} = [(x_{(0)}, y_{(0)}, z_{(0)}), \ldots, (x_{(k)}, y_{(k)}, z_{(k)}), \ldots, (x_{(n)}, y_{(n)}, z_{(n)})]

where

z_{(0)}\leq z_{(1)}\cdots \leq z_{(n)}

Yes, that's what I had in mind.

S_David said:

How to go from the unsorted to the sorted list of tuples, and get the labels of the first k tuples in the sorted list?

You can sort a list of tuples using the built-in list.sort() and sorted() functions. The Python docs discuss how these are used, and give example code showing how to use them with lists that contain complex types. Look for "Sorting HOW TO" under "Python HOWTOs".

After the list is sorted, you'll have all the tuples ordered by distance. It should be fairly easy to iterate through the list to get a new list with the k smallest distance values.

Again, I strongly recommend using variable names that are more self-explanatory than x, y, and z.

jim mcnamara · Nov 16, 2017

I second @Mark44 's comment about variable names. When you start writing more complex functions, you will run into problems you probably have not seen, like the scope of variables. You could have variables declared as: a global x, and an x that is local to a function, for example. Then when you work with x it is easy to get confused about what you are actually doing. The compiler won't get confused but you may get garbage results. And it is much harder to debug code littered with meaningless names.

EngWiPy · Nov 16, 2017

Mark44 said:

Again, I strongly recommend using variable names that are more self-explanatory than x, y, and z.

Thanks. I will look for the sort function. Yes, I am using these variables here only. In my code I use more descriptive variables.

EngWiPy · Nov 16, 2017

Code:

>>> x = [(1, 'a', 0.5), (1.5, 'b', 0.2), (2, 'c', 0.7)]
>>> x.sort(key = lambda tub: tub[2])
>>> x
[(1.5, 'b', 0.2), (1, 'a', 0.5), (2, 'c', 0.7)]

It is as easy as that! Thanks for the hints.

Mark44 · Nov 16, 2017

The Python documentation is very helpful. If you're going to be programming in Python, it behooves you to take a close look at the docs.

EngWiPy · Nov 18, 2017

In selecting the second element from each tuble and creating a new list of them I used:

Code:

sel = [w[1] for w in z]

but other people use something like this:

Code:

sel = [tub for _, for tub in z]

I couldn't understand the logic of the second one. How to interpret it? I read about the underscore in python, which could mean different things, among them ignoring the index in for loop. What does this mean in the above context?

Mark44 · Nov 18, 2017

S_David said:

I couldn't understand the logic of the second one. How to interpret it?

No idea. I did a cursory check in the documentation, but didn't find anything.

jim mcnamara · Nov 19, 2017

No idea either.

But look at it this way: this is why you document code, so the next guy can pick up and run with it - after you go elsewhere. Same thing as better variable names.
Perl, awk, and python all have "shortcut" keywords that are often abused - so do shells. And other languages to a lesser extent.
Example in awk and shell (bash), remove duplicates from a file:

Code:

# cool version
awk '!a[$0]++' filename > tmp.tmp && mv tmp.tmp filename
# easy to understand version
awk '{arr[$0]++ 
         if(arr[$0]==1) {print }
       }' filename >  tmp.tmp
if [ $? -eq 0 ] ; then
   mv tmp.tmp filename
fi

Newbies think this is "cool", but when you have to debug somebody's clever code littered with constructions like this, you learn that it may not be so cool after all.
And no it does not run way faster, awk is interpreted as is perl (sometimes) and shell. The code is read one time, then "compiled" into memory.

EngWiPy · Nov 19, 2017

Sorry there was a typo. The second way is like this:

Code:

sel = [tub for _, tub in z]

i.e., there is no second for. Also, the tubles in my code contains only two values, the distance and the labels only. It works like this on Python shell:

Code:

>>> x = [('a', 1), ('b', 2), ('c', 3)]
>>> y = [z for _, z in x]
>>> y
[1, 2, 3]

So, it apparently selects the second element of each tuble. But again: how?

EDIT: OK, I think I see what's going on.

Code:

y = [z for _, z in x]

the (for _, z) part probably means to ignore the first value of the tuble (using the underscore), and use only the second value.

Yes, it is like I said, because the following code works like this:

Code:

>>> x = [('a', 1), ('b', 2), ('c', 3)]
>>> y = [(z, w) for w, z in x]
>>> y
[(1, 'a'), (2, 'b'), (3, 'c')]#I can reorder the tubles using this method!

And of course I can select the the first element of each tuble as:

Code:

>>> y = [z for z, _ in x]
>>> y
['a', 'b', 'c']

Mark44 · Nov 19, 2017

S_David said:

the second element of each tuble

Standard term is "tuple".

Using argmin() and argsort(): How to Solve the Problem

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Similar threads

How to increase phone signal strength by lying about it

Use of AI (ML/DL) in Science

Could the reason why I can't select any kernels in VS Code be this error?

Star maps using Blender

Other than just FizzBuzz to test programmer candidates

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight