Two separate csv files with some common entries

  • Context: Python 
  • Thread starter Thread starter EngWiPy
  • Start date Start date
  • Tags Tags
    Csv files
Click For Summary

Discussion Overview

The discussion revolves around the challenge of merging two dataframes, df1 and df2, based on a common 'Country' column where df2 contains country codes. Participants explore various methods to automate the extraction of these codes without manual entry, addressing issues related to differing country names and orders in the dataframes.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant suggests using a hash table to store df2 keyed by country code for efficient lookups while processing df1.
  • Another participant describes their approach using nested for loops to match countries from df1 to df2, retrieving codes and storing them in a list.
  • Some participants clarify that Python's hash tables are called dictionaries and discuss their advantages over manual entry methods.
  • Concerns are raised about the standardization of country names, which may vary in spelling, order, or capitalization, complicating the matching process.
  • One participant mentions that with a limited number of countries, a brute force method of looping through lists may still be effective.
  • Another participant emphasizes that using hash tables can reduce unnecessary code and improve performance compared to sequential searches.
  • A participant shares their implementation using a dictionary to match countries and codes, asking for potential improvements.
  • A later post provides a pseudo code example in Perl for checking country abbreviation conflicts, illustrating the complexity of data validation in such tasks.

Areas of Agreement / Disagreement

Participants express varying opinions on the best approach to take, with some advocating for hash tables while others suggest simpler methods may suffice given the context. The discussion remains unresolved regarding the optimal solution, particularly in light of potential issues with data standardization.

Contextual Notes

Participants note limitations related to the standardization of country names, which may affect the accuracy of matches. There are also references to the performance implications of different coding approaches, but no consensus on the best method is reached.

EngWiPy
Messages
1,361
Reaction score
61
Hello,

I have two dataframes, df1 and df2. Both have Country column, but one, say df2, has the codes of the country names as well. I wish to extract these codes, and assign them to the corresponding countries in df1. I just don't want to look up and fill the codes manually. I am hoping to find something like this:

Python:
df1['Code'] = df2[df2['Country'] == df1_countries]['Code']

but I have two problems:
1. The number of countries in both dataframes are different, and
2. They are not of the same order (countries of df1 can be found somewhere in df2).

Thanks in advance
 
Technology news on Phys.org
Can you read one dataframe into a hash table keyed by country code? Then you’d be able to find matches as you read each line of the other file and then check the hash table for a matching country code.
 
  • Like
Likes   Reactions: QuantumQuest, FactChecker, EngWiPy and 1 other person
jedishrfu said:
Can you read one dataframe into a hash table keyed by country code? Then you’d be able to find matches as you read each line of the other file and then check the hash table for a matching country code.

I am not familiar with hash tables, but I did it with two for loops, the first (outer) one iterates over the countries in df1, and the second iterates over the countries in df2. I then found a match for each country in df1 in df2, and once I found it, I retrieved the Code of that country from df2 and added it to a list. At the end I created a new column in df1 assigned the values of the resulted list.
 
Python calls hash tables dicts. Well worth looking up if you aren't familiar with them. Basically they are unordered arrays that are indexed by more or less anything (anything that implements the _hash_ function, in python). So you could create a dict whose indices were the country names and whose values were the codes. Then you loop over the countries df1 and look each one up in the dict.
 
  • Like
Likes   Reactions: EngWiPy
Ibix said:
Python calls hash tables dicts. Well worth looking up if you aren't familiar with them. Basically they are unordered arrays that are indexed by more or less anything (anything that implements the _hash_ function, in python). So you could create a dict whose indices were the country names and whose values were the codes. Then you loop over the countries df1 and look each one up in the dict.

But the main idea is to avoid typing the codes manually. Otherwise, I would have added them to my file without the need of the other file. I am familiar with dictionaries in python, and I think I get the idea, but it is what I want to avoid, since I have many countries.
 
Don't populate the dict manually - populate it from the code/country pairing you've got in df2. Since hash lookup is a lot faster than the sequential search you are using, the result should be a lot quicker.
 
  • Like
Likes   Reactions: QuantumQuest and jedishrfu
Regardless of the approach that is used, a common problem when working with two data bases would be that the country names are not completely standardized. One might be abbreviated, in a different order, or different capitalization (sometimes even within its own entries). On a large project, that can be a nightmare. Typically, you want to automate the matching and print any problems so that you can fix things up. Then a final clean automated run tells you that everything can be matched.
 
  • Like
Likes   Reactions: QuantumQuest, jedishrfu and Ibix
FactChecker said:
a common problem when working with two data bases would be that the country names are not completely standardized
Indeed. One of the three rules for working with data is "do not trust your data"[1]. There's always someone who enters "Isle of Angelsey".

[1] The other two rules are "do not trust your data" and "do not trust your data".
 
  • Like
Likes   Reactions: QuantumQuest, jedishrfu and FactChecker
S_David said:
But the main idea is to avoid typing the codes manually. Otherwise, I would have added them to my file without the need of the other file. I am familiar with dictionaries in python, and I think I get the idea, but it is what I want to avoid, since I have many countries.

It should work quite well with up to 10000 entries in your hash table.
 
  • #10
Since there are so few countries in the world, the brute force method of looping through a list and comparing should work fine.
 
  • #11
FactChecker said:
Since there are so few countries in the world, the brute force method of looping through a list and comparing should work fine.

Yes, of course. It’s just that you’re writing extra code that’s unnecessary if you use a hash table. It’s why folks use higher level languages for the conveniences offered to the programmer and that the provided implementation is often faster than what you code yourself.
 
  • Like
Likes   Reactions: FactChecker
  • #12
jedishrfu said:
Yes, of course. It’s just that you’re writing extra code that’s unnecessary if you use a hash table. It’s why folks use higher level languages for the conveniences offered to the programmer and that the provided implementation is often faster than what you code yourself.
I agree. Hash functions are the best way to do it (as long as its limits are not exceeded).
 
  • #13
I did it as following using a dictionary:

Python:
d = {}
for index, row in df2.iterrows():
    d[row['Country']] = row['Code']
Code = []
for index, row in df1.iterrows():
    country = row['Country']
    if country in d.keys():
        Code.append(d[country])
    else:#Not all countries are found, or may have a different name (to be entered manually)
        Code.append('')
df1['Code'] = Code

Can I improve it more than this?
 
  • #14
Just to give you an example of the type of data checking that you might want to do for an official program, I made a rough cut at Perl pseudo code. (Sorry, I don't know Python.) I have ignored any possible errors regarding Perl syntax or accessing the data in d1 and d2. So don't expect this to be working Perl code. It's just to give you a rough idea.
Perl:
# ================================================================
# Process country abbreviations from d1
foreach $record (@d1){
    $abbreviation = $record{'abbreviation'};
    $country = $record{'country'};
    # check if $abbreviation has already been used by a different country
    if( ( defined $abbr_used{$abbreviation})
               &&  ( $abbr_used{$abbreviation} ne $country )
    {
        print "Abbreviation $abbreviation use conflict: $abbr_used{$abbreviation} and $country\n";
    }
   
    # check if country has already been assigned a different abbreviation
    if( (defined $abbr{$country} )
               &&  ( $abbr{$country} ne $abbreviation)
    {
        print "Country $country abbreviation conflict: $abbr{$country} and $abbreviation\n";
    }
   
    # ================================================================
    # Store abbreviation for country
    $abbr{$country} = $abbreviation;
    $abbr_used{$abbreviation} = $country;
}

# ================================================================
# Process d2 countries
foreach $record (@d2){
    $country = $record{'country'};
    # check if there is a d1 abbreviation for this country
    if( ! defined $abbr{$country} ){
        print "No abbreviation defined in d1 for country $country\n";
        $record{'abbreviation'} = 'undefined';
    }else{
        $record{'abbreviation'} = $abbr{$country};
    }
}
 
  • Like
Likes   Reactions: EngWiPy

Similar threads

Replies
3
Views
1K
  • · Replies 12 ·
Replies
12
Views
3K
Replies
3
Views
3K
  • · Replies 2 ·
Replies
2
Views
22K
Replies
6
Views
3K
Replies
7
Views
3K
  • · Replies 16 ·
Replies
16
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 34 ·
2
Replies
34
Views
4K