Using argmin() and argsort(): How to Solve the Problem

  • Context: Python 
  • Thread starter Thread starter EngWiPy
  • Start date Start date
Click For Summary

Discussion Overview

The discussion revolves around the use of the functions argmin() and argsort() in Python, particularly in the context of finding indices of the smallest elements in a vector and how these functions differ in their outputs. Participants explore their applications in a k-nearest neighbor classification problem, addressing issues related to indexing and sorting.

Discussion Character

  • Technical explanation
  • Conceptual clarification
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant describes using argmin() to find the index of the smallest number in a vector and attempts to use argsort()[:k] to find the indices of the smallest k numbers, noting differences in results.
  • Several participants clarify that argmin() and argsort() serve different purposes, with argsort() returning a list of indices.
  • Another participant suggests sorting the list first to obtain the k smallest numbers, proposing a method to maintain the original list while sorting a copy.
  • A participant introduces a k-nearest neighbor classification problem, detailing the need to compute distances and fetch corresponding labels based on the smallest distances.
  • There is a discussion about creating a list of tuples to hold values, labels, and distances, and sorting this list to facilitate finding the smallest distances.
  • Participants exchange ideas on how to efficiently sort and retrieve labels from the sorted list of tuples.
  • Some participants recommend using more descriptive variable names to avoid confusion in code, especially in more complex functions.

Areas of Agreement / Disagreement

Participants generally agree on the differences between argmin() and argsort(), but there is no consensus on a single method to achieve the desired results for both functions in the context of the k-nearest neighbor problem. Multiple approaches are proposed, and the discussion remains unresolved regarding the best solution.

Contextual Notes

Participants express uncertainty about the efficiency of transitioning from an unsorted to a sorted list of tuples and how to retrieve the labels of the first k tuples from the sorted list.

Who May Find This Useful

This discussion may be useful for individuals interested in Python programming, particularly those working with numerical data and looking to implement algorithms related to finding minimum values and sorting data structures.

EngWiPy
Messages
1,361
Reaction score
61
I used argmin() in my code when I wanted to find the index of the smallest number in a vector. Later, I tried to used argsort()[:k] to find the indices of the smallest k numbers. It returns the correct results, but not in a form similar to argmin(). For example, suppose I have two vectors of the same size, x, and y, where x is numeric while y is char. First I want to find the index of the smallest number in x, say it is a, then I want to extract y[a]. If the index of the smallest number in x is 2, and y[2] is 'c', then print(x.argmin(), y(x.argmin()) and print(x.argsort()[:1], y(x.argsort()[:1]) output:

Code:
2 c
[2] ['c']

respectively. I am not sure how these are different, but I know that when I replace argmin() in my code by argsort()[:1] I get drastically different results. How can I fix this problem?
 
Technology news on Phys.org
It would be helpful to know what language you're using. Is it Python? Don't make us guess...
 
Mark44 said:
It would be helpful to know what language you're using. Is it Python? Don't make us guess...

Yes, it is Python. It is mentioned as a prefix in the title.
 
Sorry, somehow I missed it in the title.
 
Mark44 said:
argmin() and argsort() do different things - argsort() returns a list of indexes -- see https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

Yes, they are not similar, and they both give different results. What I want to do is to select the indices of the k smallest numbers instead of selecting the index of the smallest number as a generalization of an algorithm. So, I found argsort()[:k] might do it. However, as a general case, I expected argsort()[:1] to give the same results as the spatial case argmin(). But it didn't! Is there another way I can do this such that I get the same results for both argmin() and when I use k = 1 for whatever function I can use for the general case?
 
S_David said:
Yes, they are not similar, and they both give different results. What I want to do is to select the indices of the k smallest numbers instead of selecting the index of the smallest number as a generalization of an algorithm. So, I found argsort()[:k] might do it. However, as a general case, I expected argsort()[:1] to give the same results as the spatial case argmin(). But it didn't! Is there another way I can do this such that I get the same results for both argmin() and when I use k = 1 for whatever function I can use for the general case?
Why not just sort the list first, and then you'll have the k smallest numbers at one end of the list. If it's important to keep the original list, you could copy the list to a second list, and then sort the second list.
 
Mark44 said:
Why not just sort the list first, and then you'll have the k smallest numbers at one end of the list. If it's important to keep the original list, you could copy the list to a second list, and then sort the second list.

I have a numeric scalar a, and a numeric vector x and a char vector y contains the corresponding labels of the elements in x. What I want to do is
  • first to compute the distance between a and all x's elements and put the results in z.
  • From z I want to find the k indices of the k smallest numbers to use them to fetch the corresponding labels from y.
  • From the labels fetched from y I want to find the label the occurs the most, and assign to a that label.
It is an k-nn (k-nearest neighbor) classification problem.
 
S_David said:
I have a numeric scalar a, and a numeric vector x and a char vector y contains the corresponding labels of the elements in x. What I want to do is
  1. first to compute the distance between a and all x's elements and put the results in z.
  2. From z I want to find the k indices of the k smallest numbers to use them to fetch the corresponding labels from y.
  3. From the labels fetched from y I want to find the label the occurs the most, and assign to a that label.
Changed bullet list above to number list.
Rather than having three separate arrays, x, y, and z, you could create an array (or list) of tuples, with each tuple holding three values: a number from x, the label associated with x_i, and the distance between a and x_i.

For 2, it might be possible to create a new array of tuples, sorted by the distance values. That way it would be relatively easy to find the tuples with the smallest distance values. For each distance value, that tuple would have the associated numeric x_i value and associated y_i label.

If you're writing code, I would recommend using better variable names than a, x, y, and z.
 
  • Like
Likes   Reactions: EngWiPy and QuantumQuest
  • #10
Mark44 said:
Changed bullet list above to number list.
Rather than having three separate arrays, x, y, and z, you could create an array (or list) of tuples, with each tuple holding three values: a number from x, the label associated with x_i, and the distance between a and x_i.

For 2, it might be possible to create a new array of tuples, sorted by the distance values. That way Θit would be relatively easy to find the tuples with the smallest distance values. For each distance value, that tuple would have the associated numeric x_i value and associated y_i label.

If you're writing code, I would recommend using better variable names than a, x, y, and z.

You mean like this:

\mathbf{z} = [(x_0, y_0, z_0), \ldots, (x_k, y_k, z_k), \ldots, (x_n, y_n, z_n)]

where

z_i = \sqrt{(x_i - a)^2}

and the sorted vector as:

\mathbf{z}_{\text{sorted}} = [(x_{(0)}, y_{(0)}, z_{(0)}), \ldots, (x_{(k)}, y_{(k)}, z_{(k)}), \ldots, (x_{(n)}, y_{(n)}, z_{(n)})]

where

z_{(0)}\leq z_{(1)}\cdots \leq z_{(n)}

In that case, how to go efficiently from the unsorted to the sorted list of tuples, and efficiently get the labels of the first k tuples in the sorted list?
 
  • #11
S_David said:
You mean like this:

\mathbf{z} = [(x_0, y_0, z_0), \ldots, (x_k, y_k, z_k), \ldots, (x_n, y_n, z_n)]
Yes.
S_David said:
where

z_i = \sqrt{(x_i - a)^2}
Not quite -- ##z_i = abs(x_i - a)##. Python has a built-in absolute value function.
S_David said:
and the sorted vector as:

\mathbf{z}_{\text{sorted}} = [(x_{(0)}, y_{(0)}, z_{(0)}), \ldots, (x_{(k)}, y_{(k)}, z_{(k)}), \ldots, (x_{(n)}, y_{(n)}, z_{(n)})]

where

z_{(0)}\leq z_{(1)}\cdots \leq z_{(n)}
Yes, that's what I had in mind.
S_David said:
How to go from the unsorted to the sorted list of tuples, and get the labels of the first k tuples in the sorted list?
You can sort a list of tuples using the built-in list.sort() and sorted() functions. The Python docs discuss how these are used, and give example code showing how to use them with lists that contain complex types. Look for "Sorting HOW TO" under "Python HOWTOs".

After the list is sorted, you'll have all the tuples ordered by distance. It should be fairly easy to iterate through the list to get a new list with the k smallest distance values.

Again, I strongly recommend using variable names that are more self-explanatory than x, y, and z.
 
  • Like
Likes   Reactions: EngWiPy and jim mcnamara
  • #12
I second @Mark44 's comment about variable names. When you start writing more complex functions, you will run into problems you probably have not seen, like the scope of variables. You could have variables declared as: a global x, and an x that is local to a function, for example. Then when you work with x it is easy to get confused about what you are actually doing. The compiler won't get confused but you may get garbage results. And it is much harder to debug code littered with meaningless names.
 
  • Like
Likes   Reactions: EngWiPy
  • #13
Mark44 said:
Again, I strongly recommend using variable names that are more self-explanatory than x, y, and z.

Thanks. I will look for the sort function. Yes, I am using these variables here only. In my code I use more descriptive variables.
 
  • #14
Code:
>>> x = [(1, 'a', 0.5), (1.5, 'b', 0.2), (2, 'c', 0.7)]
>>> x.sort(key = lambda tub: tub[2])
>>> x
[(1.5, 'b', 0.2), (1, 'a', 0.5), (2, 'c', 0.7)]

It is as easy as that! Thanks for the hints.
 
  • #15
The Python documentation is very helpful. If you're going to be programming in Python, it behooves you to take a close look at the docs.
 
  • Like
Likes   Reactions: EngWiPy
  • #16
In selecting the second element from each tuble and creating a new list of them I used:

Code:
sel = [w[1] for w in z]

but other people use something like this:

Code:
sel = [tub for _, for tub in z]

I couldn't understand the logic of the second one. How to interpret it? I read about the underscore in python, which could mean different things, among them ignoring the index in for loop. What does this mean in the above context?
 
  • #17
S_David said:
I couldn't understand the logic of the second one. How to interpret it?
No idea. I did a cursory check in the documentation, but didn't find anything.
 
  • Like
Likes   Reactions: EngWiPy
  • #18
No idea either.

But look at it this way: this is why you document code, so the next guy can pick up and run with it - after you go elsewhere. Same thing as better variable names.
Perl, awk, and python all have "shortcut" keywords that are often abused - so do shells. And other languages to a lesser extent.
Example in awk and shell (bash), remove duplicates from a file:
Code:
# cool version
awk '!a[$0]++' filename > tmp.tmp && mv tmp.tmp filename
# easy to understand version
awk '{arr[$0]++ 
         if(arr[$0]==1) {print }
       }' filename >  tmp.tmp
if [ $? -eq 0 ] ; then
   mv tmp.tmp filename
fi

Newbies think this is "cool", but when you have to debug somebody's clever code littered with constructions like this, you learn that it may not be so cool after all.
And no it does not run way faster, awk is interpreted as is perl (sometimes) and shell. The code is read one time, then "compiled" into memory.
 
  • Like
Likes   Reactions: EngWiPy
  • #19
Sorry there was a typo. The second way is like this:

Code:
sel = [tub for _, tub in z]

i.e., there is no second for. Also, the tubles in my code contains only two values, the distance and the labels only. It works like this on Python shell:

Code:
>>> x = [('a', 1), ('b', 2), ('c', 3)]
>>> y = [z for _, z in x]
>>> y
[1, 2, 3]

So, it apparently selects the second element of each tuble. But again: how?

EDIT: OK, I think I see what's going on.

Code:
y = [z for _, z in x]

the (for _, z) part probably means to ignore the first value of the tuble (using the underscore), and use only the second value.

Yes, it is like I said, because the following code works like this:

Code:
>>> x = [('a', 1), ('b', 2), ('c', 3)]
>>> y = [(z, w) for w, z in x]
>>> y
[(1, 'a'), (2, 'b'), (3, 'c')]#I can reorder the tubles using this method!

And of course I can select the the first element of each tuble as:

Code:
>>> y = [z for z, _ in x]
>>> y
['a', 'b', 'c']
 
Last edited:
  • #20
S_David said:
the second element of each tuble
Standard term is "tuple".
 

Similar threads

  • · Replies 28 ·
Replies
28
Views
5K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 16 ·
Replies
16
Views
2K
  • · Replies 13 ·
Replies
13
Views
2K
  • · Replies 15 ·
Replies
15
Views
2K
Replies
1
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
Replies
2
Views
1K
  • · Replies 34 ·
2
Replies
34
Views
6K
  • · Replies 3 ·
Replies
3
Views
2K