Python Using argmin() and argsort(): How to Solve the Problem

  • Thread starter Thread starter EngWiPy
  • Start date Start date
AI Thread Summary
The discussion focuses on the differences between the `argmin()` and `argsort()` functions in Python, particularly in the context of finding indices of the smallest numbers in a vector. While `argmin()` returns the index of the smallest element, `argsort()[:k]` provides the indices of the k smallest elements, which can lead to confusion when expecting similar outputs. Participants suggest sorting a list of tuples containing values and labels to efficiently retrieve the k smallest distances and their corresponding labels for a k-nearest neighbor classification problem. The conversation also emphasizes the importance of using descriptive variable names and understanding Python's list comprehensions, particularly the use of the underscore to ignore certain values when unpacking tuples. Overall, the thread highlights effective strategies for handling data in Python while clarifying common misconceptions about these functions.
EngWiPy
Messages
1,361
Reaction score
61
I used argmin() in my code when I wanted to find the index of the smallest number in a vector. Later, I tried to used argsort()[:k] to find the indices of the smallest k numbers. It returns the correct results, but not in a form similar to argmin(). For example, suppose I have two vectors of the same size, x, and y, where x is numeric while y is char. First I want to find the index of the smallest number in x, say it is a, then I want to extract y[a]. If the index of the smallest number in x is 2, and y[2] is 'c', then print(x.argmin(), y(x.argmin()) and print(x.argsort()[:1], y(x.argsort()[:1]) output:

Code:
2 c
[2] ['c']

respectively. I am not sure how these are different, but I know that when I replace argmin() in my code by argsort()[:1] I get drastically different results. How can I fix this problem?
 
Technology news on Phys.org
It would be helpful to know what language you're using. Is it Python? Don't make us guess...
 
Mark44 said:
It would be helpful to know what language you're using. Is it Python? Don't make us guess...

Yes, it is Python. It is mentioned as a prefix in the title.
 
Sorry, somehow I missed it in the title.
 
Mark44 said:
argmin() and argsort() do different things - argsort() returns a list of indexes -- see https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

Yes, they are not similar, and they both give different results. What I want to do is to select the indices of the k smallest numbers instead of selecting the index of the smallest number as a generalization of an algorithm. So, I found argsort()[:k] might do it. However, as a general case, I expected argsort()[:1] to give the same results as the spatial case argmin(). But it didn't! Is there another way I can do this such that I get the same results for both argmin() and when I use k = 1 for whatever function I can use for the general case?
 
S_David said:
Yes, they are not similar, and they both give different results. What I want to do is to select the indices of the k smallest numbers instead of selecting the index of the smallest number as a generalization of an algorithm. So, I found argsort()[:k] might do it. However, as a general case, I expected argsort()[:1] to give the same results as the spatial case argmin(). But it didn't! Is there another way I can do this such that I get the same results for both argmin() and when I use k = 1 for whatever function I can use for the general case?
Why not just sort the list first, and then you'll have the k smallest numbers at one end of the list. If it's important to keep the original list, you could copy the list to a second list, and then sort the second list.
 
Mark44 said:
Why not just sort the list first, and then you'll have the k smallest numbers at one end of the list. If it's important to keep the original list, you could copy the list to a second list, and then sort the second list.

I have a numeric scalar a, and a numeric vector x and a char vector y contains the corresponding labels of the elements in x. What I want to do is
  • first to compute the distance between a and all x's elements and put the results in z.
  • From z I want to find the k indices of the k smallest numbers to use them to fetch the corresponding labels from y.
  • From the labels fetched from y I want to find the label the occurs the most, and assign to a that label.
It is an k-nn (k-nearest neighbor) classification problem.
 
S_David said:
I have a numeric scalar a, and a numeric vector x and a char vector y contains the corresponding labels of the elements in x. What I want to do is
  1. first to compute the distance between a and all x's elements and put the results in z.
  2. From z I want to find the k indices of the k smallest numbers to use them to fetch the corresponding labels from y.
  3. From the labels fetched from y I want to find the label the occurs the most, and assign to a that label.
Changed bullet list above to number list.
Rather than having three separate arrays, x, y, and z, you could create an array (or list) of tuples, with each tuple holding three values: a number from x, the label associated with x_i, and the distance between a and x_i.

For 2, it might be possible to create a new array of tuples, sorted by the distance values. That way it would be relatively easy to find the tuples with the smallest distance values. For each distance value, that tuple would have the associated numeric x_i value and associated y_i label.

If you're writing code, I would recommend using better variable names than a, x, y, and z.
 
  • Like
Likes EngWiPy and QuantumQuest
  • #10
Mark44 said:
Changed bullet list above to number list.
Rather than having three separate arrays, x, y, and z, you could create an array (or list) of tuples, with each tuple holding three values: a number from x, the label associated with x_i, and the distance between a and x_i.

For 2, it might be possible to create a new array of tuples, sorted by the distance values. That way Θit would be relatively easy to find the tuples with the smallest distance values. For each distance value, that tuple would have the associated numeric x_i value and associated y_i label.

If you're writing code, I would recommend using better variable names than a, x, y, and z.

You mean like this:

\mathbf{z} = [(x_0, y_0, z_0), \ldots, (x_k, y_k, z_k), \ldots, (x_n, y_n, z_n)]

where

z_i = \sqrt{(x_i - a)^2}

and the sorted vector as:

\mathbf{z}_{\text{sorted}} = [(x_{(0)}, y_{(0)}, z_{(0)}), \ldots, (x_{(k)}, y_{(k)}, z_{(k)}), \ldots, (x_{(n)}, y_{(n)}, z_{(n)})]

where

z_{(0)}\leq z_{(1)}\cdots \leq z_{(n)}

In that case, how to go efficiently from the unsorted to the sorted list of tuples, and efficiently get the labels of the first k tuples in the sorted list?
 
  • #11
S_David said:
You mean like this:

\mathbf{z} = [(x_0, y_0, z_0), \ldots, (x_k, y_k, z_k), \ldots, (x_n, y_n, z_n)]
Yes.
S_David said:
where

z_i = \sqrt{(x_i - a)^2}
Not quite -- ##z_i = abs(x_i - a)##. Python has a built-in absolute value function.
S_David said:
and the sorted vector as:

\mathbf{z}_{\text{sorted}} = [(x_{(0)}, y_{(0)}, z_{(0)}), \ldots, (x_{(k)}, y_{(k)}, z_{(k)}), \ldots, (x_{(n)}, y_{(n)}, z_{(n)})]

where

z_{(0)}\leq z_{(1)}\cdots \leq z_{(n)}
Yes, that's what I had in mind.
S_David said:
How to go from the unsorted to the sorted list of tuples, and get the labels of the first k tuples in the sorted list?
You can sort a list of tuples using the built-in list.sort() and sorted() functions. The Python docs discuss how these are used, and give example code showing how to use them with lists that contain complex types. Look for "Sorting HOW TO" under "Python HOWTOs".

After the list is sorted, you'll have all the tuples ordered by distance. It should be fairly easy to iterate through the list to get a new list with the k smallest distance values.

Again, I strongly recommend using variable names that are more self-explanatory than x, y, and z.
 
  • Like
Likes EngWiPy and jim mcnamara
  • #12
I second @Mark44 's comment about variable names. When you start writing more complex functions, you will run into problems you probably have not seen, like the scope of variables. You could have variables declared as: a global x, and an x that is local to a function, for example. Then when you work with x it is easy to get confused about what you are actually doing. The compiler won't get confused but you may get garbage results. And it is much harder to debug code littered with meaningless names.
 
  • Like
Likes EngWiPy
  • #13
Mark44 said:
Again, I strongly recommend using variable names that are more self-explanatory than x, y, and z.

Thanks. I will look for the sort function. Yes, I am using these variables here only. In my code I use more descriptive variables.
 
  • #14
Code:
>>> x = [(1, 'a', 0.5), (1.5, 'b', 0.2), (2, 'c', 0.7)]
>>> x.sort(key = lambda tub: tub[2])
>>> x
[(1.5, 'b', 0.2), (1, 'a', 0.5), (2, 'c', 0.7)]

It is as easy as that! Thanks for the hints.
 
  • #15
The Python documentation is very helpful. If you're going to be programming in Python, it behooves you to take a close look at the docs.
 
  • Like
Likes EngWiPy
  • #16
In selecting the second element from each tuble and creating a new list of them I used:

Code:
sel = [w[1] for w in z]

but other people use something like this:

Code:
sel = [tub for _, for tub in z]

I couldn't understand the logic of the second one. How to interpret it? I read about the underscore in python, which could mean different things, among them ignoring the index in for loop. What does this mean in the above context?
 
  • #17
S_David said:
I couldn't understand the logic of the second one. How to interpret it?
No idea. I did a cursory check in the documentation, but didn't find anything.
 
  • Like
Likes EngWiPy
  • #18
No idea either.

But look at it this way: this is why you document code, so the next guy can pick up and run with it - after you go elsewhere. Same thing as better variable names.
Perl, awk, and python all have "shortcut" keywords that are often abused - so do shells. And other languages to a lesser extent.
Example in awk and shell (bash), remove duplicates from a file:
Code:
# cool version
awk '!a[$0]++' filename > tmp.tmp && mv tmp.tmp filename
# easy to understand version
awk '{arr[$0]++ 
         if(arr[$0]==1) {print }
       }' filename >  tmp.tmp
if [ $? -eq 0 ] ; then
   mv tmp.tmp filename
fi

Newbies think this is "cool", but when you have to debug somebody's clever code littered with constructions like this, you learn that it may not be so cool after all.
And no it does not run way faster, awk is interpreted as is perl (sometimes) and shell. The code is read one time, then "compiled" into memory.
 
  • Like
Likes EngWiPy
  • #19
Sorry there was a typo. The second way is like this:

Code:
sel = [tub for _, tub in z]

i.e., there is no second for. Also, the tubles in my code contains only two values, the distance and the labels only. It works like this on Python shell:

Code:
>>> x = [('a', 1), ('b', 2), ('c', 3)]
>>> y = [z for _, z in x]
>>> y
[1, 2, 3]

So, it apparently selects the second element of each tuble. But again: how?

EDIT: OK, I think I see what's going on.

Code:
y = [z for _, z in x]

the (for _, z) part probably means to ignore the first value of the tuble (using the underscore), and use only the second value.

Yes, it is like I said, because the following code works like this:

Code:
>>> x = [('a', 1), ('b', 2), ('c', 3)]
>>> y = [(z, w) for w, z in x]
>>> y
[(1, 'a'), (2, 'b'), (3, 'c')]#I can reorder the tubles using this method!

And of course I can select the the first element of each tuble as:

Code:
>>> y = [z for z, _ in x]
>>> y
['a', 'b', 'c']
 
Last edited:
  • #20
S_David said:
the second element of each tuble
Standard term is "tuple".
 
Back
Top