How do data scientists resolve name discrepancies like this?

  • Context: Python 
  • Thread starter Thread starter Eclair_de_XII
  • Start date Start date
  • Tags Tags
    Data
Click For Summary

Discussion Overview

The discussion revolves around the challenges data scientists face in resolving name discrepancies within datasets, particularly when dealing with variations in job titles across multiple .csv files. Participants explore methods for associating records to a common identifier, such as "Driver," and the implications of data cleaning and normalization processes.

Discussion Character

  • Exploratory
  • Technical explanation
  • Conceptual clarification
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant suggests linking various job titles related to "Driver" to a common identifier, expressing concerns about the tediousness of the process and questioning the effectiveness of regular expressions.
  • Another participant proposes programming a function to associate records containing the term "driver" in a case-insensitive manner, acknowledging the potential for false positives that would require manual correction.
  • A different viewpoint highlights the complexity of natural language processing and suggests a balance between manual identification of relevant fields and automation, framing it as a trade-off between labor and efficiency.
  • One participant mentions the relevance of data cleaning methods in the context of ETL (Extract, Transform, Load) processes when creating data warehouses from diverse sources.
  • Another participant emphasizes the importance of building a "data dictionary" to understand existing data fields and their relationships, which is crucial before any data manipulation occurs.

Areas of Agreement / Disagreement

Participants express a variety of approaches to the problem, indicating that there is no consensus on a single best method for resolving name discrepancies. Different strategies, including manual and automated techniques, are discussed without agreement on their effectiveness.

Contextual Notes

Participants acknowledge the limitations of their proposed methods, including the potential for false positives and the need for manual intervention. The discussion also touches on the dependency of solutions on the specific context of the datasets being analyzed.

Eclair_de_XII
Messages
1,082
Reaction score
91
TL;DR
You have one data set containing an attribute about some person and many data sets pertaining to that same attribute. In each of these data sets, this attribute is spelt differently. How would a data scientist go about resolving these discrepancies?
Suppose you have a .csv file that resembles something like this:

Code:
Name,Profession, ...
Mike Jones, Driver, ...

And now suppose you have many .csv files pertaining to information about people who drive for a living.

Code:
Profession,Qualified to Operate Motor Vehicle,...
[variation of "driver"],TRUE,...

where these variations can range from "bus driver", "taxi driver", "self-employed on-call driver", etc.

Now suppose you wished to learn a bit more about Mr. Jones and his life as a driver of sorts. To get a general picture of this life, you would want to associate the information found in these files to his person using the key: "Driver". Would correcting these errors in the files be a job for a data correction team, or would the data-scientist have a sure-fire way of associating "Mike Jones" to all these .csv files?

Personally, the way I would do it is to associate each of these related professions to "Driver", and using that as a conduit of sorts, link the information to Mr. Jones that way. But it seems tedious to me and I wonder if there is a better way. I'm wondering if employing the use of regular expressions here would be appropriate. I haven't actually learned about them to be honest.

EDIT: Changed "driver" example to "psychologist".
EDIT 2: Changed example back to "driver".
 
Last edited:
Technology news on Phys.org
You could program it to associate to the class 'driver' every record in each csv of interest that records a profession described by a string that, when converted to lower case (by a function named something like 'tolower'), contains the pattern 'driver'. You would get some false positives, which you would need to weed out by hand. For each one (eg screwdriver salesperson), as you manually identify it as a false positive, you could add it to a list of exclusions for the search algorithm to apply.

We generally call such pattern-matching 'grep'. For a python approach to grepping, see here.
 
You may want to think about how a search engine like Google can perform so many successful searches when the users type in the queries with so many languages and so much flexibility in how to phrase the question.

You could spend a lifetime dealing with natural language processing. Or you could do no such processing and manually identify the fields you want to treat as "driver" in each of those csv files (as @andrewkirk suggested). Or you can find a compromise solution. It is a classical problem of manual labor versus automation.
 
You're asking broadly how a data scientist works with this.

One of the first things the data scientists needs to do is build a "data dictionary". For an existing system, this is done by looking at all existing system documentation to identify what fields exist, where they come from, and how they are used. In most cases, this also involves interviewing a comprehensive sample of the data users.

With this information in hand, a full Boyce-Codd normalization can be done to identify the data relations. In the process, fields from multiple sources can be recognized as denoting the same information and the data scientists can pick a name for each field that will be used in his dictionary.

All this and he hasn't even done anything with the data. But in the case of very large data sets, the data dictionary is often purposeful as a user reference document even if no other software development follows.
 
Right-o. Thanks for the information. I'll keep these tips in mind moving forward.
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
4K
Replies
10
Views
4K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 2 ·
Replies
2
Views
477
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 13 ·
Replies
13
Views
4K
Replies
12
Views
4K
  • · Replies 1 ·
Replies
1
Views
5K
  • · Replies 4 ·
Replies
4
Views
3K
Replies
1
Views
4K