Python Two separate csv files with some common entries

  • Thread starter Thread starter EngWiPy
  • Start date Start date
  • Tags Tags
    Csv files
AI Thread Summary
The discussion revolves around merging two dataframes, df1 and df2, where df1 contains country names and df2 includes country codes. The user seeks an automated method to match and assign these codes without manual input, despite differences in country names and order between the two dataframes. Suggestions include using hash tables (dictionaries in Python) for efficient lookups, which can significantly speed up the process compared to nested loops. Additionally, the conversation highlights the importance of data standardization to avoid mismatches due to variations in country naming conventions. Overall, leveraging hash tables is recommended for improved performance and ease of data handling.
EngWiPy
Messages
1,361
Reaction score
61
Hello,

I have two dataframes, df1 and df2. Both have Country column, but one, say df2, has the codes of the country names as well. I wish to extract these codes, and assign them to the corresponding countries in df1. I just don't want to look up and fill the codes manually. I am hoping to find something like this:

Python:
df1['Code'] = df2[df2['Country'] == df1_countries]['Code']

but I have two problems:
1. The number of countries in both dataframes are different, and
2. They are not of the same order (countries of df1 can be found somewhere in df2).

Thanks in advance
 
Technology news on Phys.org
Can you read one dataframe into a hash table keyed by country code? Then you’d be able to find matches as you read each line of the other file and then check the hash table for a matching country code.
 
  • Like
Likes QuantumQuest, FactChecker, EngWiPy and 1 other person
jedishrfu said:
Can you read one dataframe into a hash table keyed by country code? Then you’d be able to find matches as you read each line of the other file and then check the hash table for a matching country code.

I am not familiar with hash tables, but I did it with two for loops, the first (outer) one iterates over the countries in df1, and the second iterates over the countries in df2. I then found a match for each country in df1 in df2, and once I found it, I retrieved the Code of that country from df2 and added it to a list. At the end I created a new column in df1 assigned the values of the resulted list.
 
Python calls hash tables dicts. Well worth looking up if you aren't familiar with them. Basically they are unordered arrays that are indexed by more or less anything (anything that implements the _hash_ function, in python). So you could create a dict whose indices were the country names and whose values were the codes. Then you loop over the countries df1 and look each one up in the dict.
 
  • Like
Likes EngWiPy
Ibix said:
Python calls hash tables dicts. Well worth looking up if you aren't familiar with them. Basically they are unordered arrays that are indexed by more or less anything (anything that implements the _hash_ function, in python). So you could create a dict whose indices were the country names and whose values were the codes. Then you loop over the countries df1 and look each one up in the dict.

But the main idea is to avoid typing the codes manually. Otherwise, I would have added them to my file without the need of the other file. I am familiar with dictionaries in python, and I think I get the idea, but it is what I want to avoid, since I have many countries.
 
Don't populate the dict manually - populate it from the code/country pairing you've got in df2. Since hash lookup is a lot faster than the sequential search you are using, the result should be a lot quicker.
 
  • Like
Likes QuantumQuest and jedishrfu
Regardless of the approach that is used, a common problem when working with two data bases would be that the country names are not completely standardized. One might be abbreviated, in a different order, or different capitalization (sometimes even within its own entries). On a large project, that can be a nightmare. Typically, you want to automate the matching and print any problems so that you can fix things up. Then a final clean automated run tells you that everything can be matched.
 
  • Like
Likes QuantumQuest, jedishrfu and Ibix
FactChecker said:
a common problem when working with two data bases would be that the country names are not completely standardized
Indeed. One of the three rules for working with data is "do not trust your data"[1]. There's always someone who enters "Isle of Angelsey".

[1] The other two rules are "do not trust your data" and "do not trust your data".
 
  • Like
Likes QuantumQuest, jedishrfu and FactChecker
S_David said:
But the main idea is to avoid typing the codes manually. Otherwise, I would have added them to my file without the need of the other file. I am familiar with dictionaries in python, and I think I get the idea, but it is what I want to avoid, since I have many countries.

It should work quite well with up to 10000 entries in your hash table.
 
  • #10
Since there are so few countries in the world, the brute force method of looping through a list and comparing should work fine.
 
  • #11
FactChecker said:
Since there are so few countries in the world, the brute force method of looping through a list and comparing should work fine.

Yes, of course. It’s just that you’re writing extra code that’s unnecessary if you use a hash table. It’s why folks use higher level languages for the conveniences offered to the programmer and that the provided implementation is often faster than what you code yourself.
 
  • Like
Likes FactChecker
  • #12
jedishrfu said:
Yes, of course. It’s just that you’re writing extra code that’s unnecessary if you use a hash table. It’s why folks use higher level languages for the conveniences offered to the programmer and that the provided implementation is often faster than what you code yourself.
I agree. Hash functions are the best way to do it (as long as its limits are not exceeded).
 
  • #13
I did it as following using a dictionary:

Python:
d = {}
for index, row in df2.iterrows():
    d[row['Country']] = row['Code']
Code = []
for index, row in df1.iterrows():
    country = row['Country']
    if country in d.keys():
        Code.append(d[country])
    else:#Not all countries are found, or may have a different name (to be entered manually)
        Code.append('')
df1['Code'] = Code

Can I improve it more than this?
 
  • #14
Just to give you an example of the type of data checking that you might want to do for an official program, I made a rough cut at Perl pseudo code. (Sorry, I don't know Python.) I have ignored any possible errors regarding Perl syntax or accessing the data in d1 and d2. So don't expect this to be working Perl code. It's just to give you a rough idea.
Perl:
# ================================================================
# Process country abbreviations from d1
foreach $record (@d1){
    $abbreviation = $record{'abbreviation'};
    $country = $record{'country'};
    # check if $abbreviation has already been used by a different country
    if( ( defined $abbr_used{$abbreviation})
               &&  ( $abbr_used{$abbreviation} ne $country )
    {
        print "Abbreviation $abbreviation use conflict: $abbr_used{$abbreviation} and $country\n";
    }
   
    # check if country has already been assigned a different abbreviation
    if( (defined $abbr{$country} )
               &&  ( $abbr{$country} ne $abbreviation)
    {
        print "Country $country abbreviation conflict: $abbr{$country} and $abbreviation\n";
    }
   
    # ================================================================
    # Store abbreviation for country
    $abbr{$country} = $abbreviation;
    $abbr_used{$abbreviation} = $country;
}

# ================================================================
# Process d2 countries
foreach $record (@d2){
    $country = $record{'country'};
    # check if there is a d1 abbreviation for this country
    if( ! defined $abbr{$country} ){
        print "No abbreviation defined in d1 for country $country\n";
        $record{'abbreviation'} = 'undefined';
    }else{
        $record{'abbreviation'} = $abbr{$country};
    }
}
 
  • Like
Likes EngWiPy
Back
Top