Spreadsheet help: compare two lists to find exceptions

  • Thread starter Thread starter DaveC426913
  • Start date Start date
  • Tags Tags
    Compare
AI Thread Summary
The discussion revolves around comparing two patient lists in OpenOffice Calc or Excel to identify missing entries from each list. The lists are not normalized, complicating direct comparisons due to differing formats and additional data fields. Users suggest using functions like VLOOKUP and MATCH to find discrepancies, emphasizing the importance of correctly formatting data as numbers or text to avoid errors. The MATCH function is highlighted as a preferred method, particularly when set to return exact matches. There are challenges with derived fields and formatting issues that prevent successful comparisons. Some participants recommend using Excel's Power Query for better data management and suggest considering a database approach for handling large datasets. The conversation also touches on the difficulties of working in a virtual environment and the need for solutions that can be maintained after the original creator departs. Ultimately, the goal is to automate the identification of non-matching records efficiently, given the large volume of data involved.
DaveC426913
Gold Member
Messages
23,861
Reaction score
7,903
TL;DR Summary
Compare a column of each of two spreadsheets to pull out exceptions
(I am using OpenOffice Calc, though I have access to Excel).
I have coerced two lists of patients into two sheets on a spreadsheet.

I need to determine who is missing from each list.

So, schematically, something like this output:

List 1:
Alice
Bob
Carol
Dave

List 2:
Alice
Bill
Bob
Dave

or something like this is fine too:

List 1:
Carol
List 2:
Bill

I'm not really picky. I am compiling my report manually, so "cheap and cheerful" is the word-of-the-day. The only reason I need to automate parts of it is because of the sheer volume: 70 doctors, totaling 16,000 patients.I'm not sure how to Google search for such operations - whether it's "conditional formatting" or "compare sheets", etc. Which is why I'm asking for direction/inspiration.As for comparing sheets, note that the two lists are not currently normalized:
One list might be:
Last Name, First name, Address, Gender, etc.
The other list list might be:
Doctor, Phone#, Full name, Gender, etc.

(The point I'm making is that a straight up "compare two sheets" is not really what this is. A "compare two sheets" operation is likely expecting a majority of overlap with only data/content differences.) Normalizing them could be done, but that's even more work. I'm sure I will have to at least normalize the name field(s).Also: No hurry. Anytime in the next half hour will be fine. :smile:
 
Computer science news on Phys.org
Excel has something called an intersection operator. You could use that. Syntx is tricky, as white space is significant.
 
jack action said:
VLOOKUP seems like a nice candidate.
In my experience, MATCH is almost always a better solution than any of the (V/H/)LOOKUP family (often, as here, with the 'match-type' option set to 0). The answer here is probably to introduce a column doing the MATCHing and using that for conditional formatting (which is always a bit of a faff) or with a bit more work, ingenuity and intermediate calculations, extracting the matching rows.
 
  • Like
Likes DaveC426913
OK, making progress with the MATCH function.

I've abandoned using names because one list has Bobbie Smith and the other has Bobbie Jo Smith.
But insurance numbers are unique, so that's a lot easier.

But one list has ON 1234 567 890 YZ while the other has 1234 567 890, so now I have to see if there's a way of trimming/truncating a field...

Ah .. MID()
 
Nope. Not working.

Here's a sample:
Sheet 'H-RCP':

7072​
Alice​
4154​
Andy​
etc.
Sheet 'H-PSS' (sorted):

7033​
Bill​
7058​
Dave​
7072​
Alice​
7103​
Charlie​
7113​
Annie​
etc.
Sheet 'Output', Cell A1:

=MATCH('H-RCP'.A1; 'H-PSS'.A1:A5; 1)
If I am using this correctly, MATCH is
looking for H-RCP cell 1: 7072
in the array H-PSS cells 1-5

It should return row 3, but it actually returns #N/A, meaning it didn't find it.
 
Use 0 as the third argument; you don't need to sort them. Also make sure you are not comparing numbers with text that looks like a number (you might need to convert with VALUE or TEXT).
 
pbuk said:
Use 0 as the third argument; you don't need to sort them. Also make sure you are not comparing numbers with text that looks like a number (you might need to convert with VALUE or TEXT).
Yup, and yup.
Tried them as numbers and as text.

The only other difference I can think of is that one of them is, itself, the output of a function: the MID function, to strip off the prefix and suffix, thus:

B1 =MID(A1;3;10)
ON6323609633ET​
6323609633​
 
How many names are there? If it's mall, and you don't need to repeat it often, it might be aster and cheaper to have two sorted lists side by side.
 
  • #10
Vanadium 50 said:
How many names are there? If it's mall, and you don't need to repeat it often, it might be aster and cheaper to have two sorted lists side by side.
16,000 names for 70 doctors.
 
  • #11
DaveC426913 said:
The only other difference I can think of is that one of them is, itself, the output of a function: the MID function, to strip off the prefix and suffix, thus:

B1 =MID(A1;3;10)
Yes, that will give a string (text). Convert it to a number with VALUE(MID(A1;3;10))[icode].
 
  • #12
pbuk said:
Yes, that will give a string (text). Convert it to a number with VALUE(MID(A1;3;10))[icode].
I'm comparing text.
 
  • #13
Give the problem, along with your worked example to ChatGPT. It can be very good at Excel. You can suggest using MATCH.
 
  • #14
Really? 🤔
 
  • #15
DaveC426913 said:
Really? 🤔
ChatGPT Q & A results.
Q: I have a spreadsheet file. Sheet A has a list of ID numbers in column A, and names in column B. Sheet B has a list of ID numbers in column A, and names in column B. Write an Excel expression to join the two and return only the matches in Sheet C.
A: In Excel, you can use the VLOOKUP function ... [cut off]
Q: No, use the MATCH function.
A: You can use the MATCH function in Excel to find the row number of a specific value in a range, and then use that row number to retrieve the corresponding data. Here's how to use the MATCH function in Sheet C to find the matching ID numbers in Sheet A and Sheet B:

In Cell A2 of Sheet C, enter the following formula:

=MATCH(A2,SheetA!A:A,0)

In Cell B2 of Sheet C, enter the following formula:

=MATCH(A2,SheetB!A:A,0)

These formulas will return the row number of the matching ID number in Sheet A and Sheet B, respectively. To retrieve the corresponding name for each ID, use the INDEX function, like this:

In Cell C2 of Sheet C, enter the following formula:

=INDEX(SheetA!B:B,A2)

In Cell D2 of Sheet C, enter the following formula:

=INDEX(SheetB!B:B,B2)

These formulas will return the name corresponding to the matching ID number in Sheet A and Sheet B, respectively.

You can then drag these formulas down to the last row of your data to apply them to all the IDs in Sheet C.
 
  • #16
DaveC426913 said:
TL;DR Summary: Compare a column of each of two spreadsheets to pull out exceptions

(The point I'm making is that a straight up "compare two sheets" is not really what this is. A "compare two sheets" operation is likely expecting a majority of overlap with only data/content differences.)
Just as an aside question, have you considered just using a typical scripting programming language to do this type of task instead of Excel? Have you used Python/Perl/Tcl at all before? I will often use Excel for simple stuff, but as soon as things get a little complicated, I'm inclined to just write a program to do it instead...
 
  • Like
Likes Vanadium 50 and FactChecker
  • #17
I was also going to suggest an honest to goodness database, but I suspect that the additional ease of that operation would be cancelled out by the trouble getting the data in and out.
 
  • Like
Likes DaveC426913
  • #18
DaveC426913 said:
I'm comparing text.
You sure? 6323609633 looks like a number to me (and to Excel).

Also, if you have Excel I'd be inclined to use Excel, not OpenOffice which is mostly compatible but you can never be sure how edge cases are handled.
 
  • #19
anorlunda said:
ChatGPT Q & A results.
Which are broken: a human expert can spot this instantly (SheetC!A2 will be a circular reference).

You really need to stop believing that a language tool can ever provide a reliable answer to technical problems.
 
  • Like
Likes Wrichik Basu and jack action
  • #20
pbuk said:
You sure? 6323609633 looks like a number to me (and to Excel).
By setting the format to text explicitly.

pbuk said:
Also, if you have Excel I'd be inclined to use Excel, not OpenOffice which is mostly compatible but you can never be sure how edge cases are handled.
You're right, apparently this utility may have to survive me, which means I'll have to use their tools, not mine.
 
  • #21
berkeman said:
Just as an aside question, have you considered just using a typical scripting programming language to do this type of task instead of Excel? Have you used Python/Perl/Tcl at all before? I will often use Excel for simple stuff, but as soon as things get a little complicated, I'm inclined to just write a program to do it instead...
*sigh* Yeah. I made* a Python script for a very similar utility. (There's lots of data massaging that needs to happen in this office because of old. obsolete and unfriendly Record Management software).

The problem with such utilities is how to make them survive me.

It's hard enough writing several pages of instructions for my successor to export/massage/import/process the data, but if I use a python program, that adds another bunch of pages in the "Office Procedures" documentation for how to install and use Python and environment. And that drops any likelihood this program will ever be used beyond my exit to less than 1%.*read as: had a friend make for me
 
  • #22
DaveC426913 said:
By setting the format to text explicitly.
No, a number formatted as text is not text, it is still a number (you can verify this by adding 1 to it). This is a common source of problems when extracting numbers from strings in spreadsheets. The most reliable way around it is to make sure that anything that looks like a number is a number with VALUE, not to try to make numbers look like text.
 
  • Like
Likes hutchphd, DaveE, DaveC426913 and 1 other person
  • #23
Since you have access to Excel, I would suggest taking advantage of Power Query. Load every table into your data model, apply any necessary transformations to get each dataset into a consistent format, and then analyse them.

(You will need the actual desktop version of Excel, not the Office365 app.)
 
  • #24
Well this is disappointing.

It appears that a derived field does not want to be formatted as a number, so I can't MATCH it.

My Search data:
1675782708778.png


My Array to be searched:
1675782742349.png

Column B is derived from column A.
Column C is flat numbers (as a control).

The results:
1675782964324.png

Column B cannot find a MATCH to the derived number, even though they're both formatted as 'number'.
Column C (the control) MATCHes fine.

I am stumped.

Test file attached.
 
  • #25
pbuk said:
Yes, that will give a string (text). Convert it to a number with VALUE(MID(A1;3;10)).
Ah. @pbuk solved that already! Enclose it in VALUE(). (It also works to add +0)

1675784959491.png
 
  • #26
Dave, what was the formula you used in cell D1 of post #25? I'm using =MATCH(D1, B1:B5), but it just gives the (correct) index of the value I have in D1, of the column range B1 - B5.

I can't figure out how to use the index of a range to get the value there.
 
  • #27
Mark44 said:
I'm using =MATCH(D1, B1:B5), but it just gives the (correct) index of the value I have in D1, of the column range B1 - B5.
Tips for MATCH:
  • Always specify the final argument, which you most often want to be 0.
  • It is often best to MATCH on a whole column e.g. MATCH(D1,B:B), this avoids mistakes when you add more items to the list (and you should never put that formula in D1, or anywhere in column B as it would be a circular reference).
Mark44 said:
I can't figure out how to use the index of a range to get the value there.
You could try INDEX :-p
 
  • #28
pbuk said:
  • It is often best to MATCH on a whole column e.g. MATCH(D1,B:B), this avoids mistakes
Indeed. Too bad it doesn't work in OpenOffice Calc. :frown:
 
  • #29
DaveC426913 said:
Indeed. Too bad it doesn't work in OpenOffice Calc. :frown:
Yes, it does:

Screenshot_2023-02-07_17-48-39.png
 
  • #30
DaveC426913 said:
Indeed. Too bad it doesn't work in OpenOffice Calc. :frown:
In that case I strongly recommend you use guard rows (the narrow rows in the sheet below - I downloaded the latest OpenOffice just for you!)
1675810630626.png


Coming up: the formulas to achieve this.
 
  • #31
pbuk said:
Coming up: the formulas to achieve this.
Doesn't seem easy in OpenOffice, I'll give you the entries in row 3, the rest just copy down.
Code:
C3 =MATCH(A3;B$2:B$25;0)
D3 =IF(ISNA(C3);0;MAX(D2:D$2)+1)
E3 =E2+1
F3 =MATCH(E3;D$2:D$25;0)
G3 =IF(ISNA(F3);"";INDEX(A$2:A$25;$F3))
 
  • #32
jack action said:
Yes, it does:
It didn't for me (OpenOffice on Windows) - that looks like LibreOffice on Linux to me.
 
  • Like
Likes jack action and DaveC426913
  • #33
Yeah, works in LibreOffice on Windows too, and we can display the formulas - see below (different formulas as we don't need the guard rows now).

1675811998151.png


Can you persuade your client to move to LibreOffice, apart from these there are a number of technical and commercial reasons to drop OpenOffice since the fork?
 
  • #34
pbuk said:
Can you persuade your client...
Well, client (hospital) has Excel on their image for virtual servers, but it is such a pain to tunnel in via VPN to do this kind of finicky work.

The biggest headache is that the virtual environment actively discourages copy/pasting of any sort. I mean, literally copying a number from a Notepad doc and pasting it in to a spreadsheet cell will not work most of the time. There is some sort of chron job or utility that clears the clipboard once per second. If I hit Ctrl-C ten times in a row, and then paste within about two seconds I have maybe a 50% chance of pasting what was in the clipboard. Crazy stuff.
 
  • #35
DaveC426913 said:
Indeed. Too bad it doesn't work in OpenOffice Calc. :frown:
In my case (LibreOffice), after I begin writing a function, when I click on a column heading, it writes B1:B1048576, and then it transforms into B:B after I get out of the cell.

Maybe you can just write the full column range by default?
 
  • #36
jack action said:
Maybe you can just write the full column range by default?
Yeah. I just set it to A$1:A$500, which will cover the largest list I have.

But this really gets into the weeds of how I'm going to move forward with this. If this whole process has to survive me (which I verified is the desire, I'm going to have to use the client's preferred software, which is Excel.

Still, no idea where it's going to live or who will be the keeper of the lore. There's no "owner" and no central repository. But that's not an issue that will be solved here...
 
  • #37
DaveC426913 said:
But this really gets into the weeds of how I'm going to move forward with this. If this whole process has to survive me (which I verified is the desire, I'm going to have to use the client's preferred software, which is Excel.
Oh I have misunderstood, I didn't realise you were being paid for this - my bill is in the post.
 
  • Haha
Likes DaveC426913
  • #38
pbuk said:
Oh I have misunderstood, I didn't realise you were being paid for this - my bill is in the post.
It is technically for my (salaried) job, but I don't get expenses. I'm just expected to produce this report.
 
  • #39
OK, I've switched to Excel. I've got a primitive solution but since it uses the MATCH function, it spits out matches. Any matched records display as #N/A.

What I'm after is the opposite. I need to display non matches.

I've looked at VLOOKUP and HLOOKUP but that's not what I'm trying to do - at least they don't seem to work across two sheets.)

So:

Sheet "RCP":
7072Alice
4154Andy
etc.
Sheet "PSS":
7033Bill
7058Dave
7072Alice
7103Charlie
7113Annie
etc.
I hope to get:

RCP NO matches in PSS:
4154 Andy

PSS NO matches in RCP:
7033 Bill
7058 Dave
7103 Charlie
7113 Annie
 
  • #40
With 16K ids on the list, upload each list as a separate table into MS Access and do an outer join
 
  • #41
BWV said:
With 16K ids on the list, upload each list as a separate table into MS Access and do an outer join
I found an even better solution:
conscript a gaggle of interns to do it with pencil and paper.
 
  • #42
DaveC426913 said:
I found an even better solution:
conscript a gaggle of interns to do it with pencil and paper.
Well you could use the query to check the work, than fire the lot of them for wasting time doing it manually and not knowing how to do a simple query
 
  • Haha
Likes DaveC426913
Back
Top