Two separate csv files with some common entries

EngWiPy · Jan 13, 2018

Hello,

I have two dataframes, df1 and df2. Both have Country column, but one, say df2, has the codes of the country names as well. I wish to extract these codes, and assign them to the corresponding countries in df1. I just don't want to look up and fill the codes manually. I am hoping to find something like this:

Python:

df1['Code'] = df2[df2['Country'] == df1_countries]['Code']

but I have two problems:
1. The number of countries in both dataframes are different, and
2. They are not of the same order (countries of df1 can be found somewhere in df2).

Thanks in advance

jedishrfu · Jan 13, 2018

Can you read one dataframe into a hash table keyed by country code? Then you’d be able to find matches as you read each line of the other file and then check the hash table for a matching country code.

EngWiPy · Jan 13, 2018

jedishrfu said:

Can you read one dataframe into a hash table keyed by country code? Then you’d be able to find matches as you read each line of the other file and then check the hash table for a matching country code.

I am not familiar with hash tables, but I did it with two for loops, the first (outer) one iterates over the countries in df1, and the second iterates over the countries in df2. I then found a match for each country in df1 in df2, and once I found it, I retrieved the Code of that country from df2 and added it to a list. At the end I created a new column in df1 assigned the values of the resulted list.

Ibix · Jan 13, 2018

Python calls hash tables dicts. Well worth looking up if you aren't familiar with them. Basically they are unordered arrays that are indexed by more or less anything (anything that implements the _hash_ function, in python). So you could create a dict whose indices were the country names and whose values were the codes. Then you loop over the countries df1 and look each one up in the dict.

EngWiPy · Jan 13, 2018

Ibix said:

Python calls hash tables dicts. Well worth looking up if you aren't familiar with them. Basically they are unordered arrays that are indexed by more or less anything (anything that implements the _hash_ function, in python). So you could create a dict whose indices were the country names and whose values were the codes. Then you loop over the countries df1 and look each one up in the dict.

But the main idea is to avoid typing the codes manually. Otherwise, I would have added them to my file without the need of the other file. I am familiar with dictionaries in python, and I think I get the idea, but it is what I want to avoid, since I have many countries.

Ibix · Jan 13, 2018

Don't populate the dict manually - populate it from the code/country pairing you've got in df2. Since hash lookup is a lot faster than the sequential search you are using, the result should be a lot quicker.

FactChecker · Jan 13, 2018

Regardless of the approach that is used, a common problem when working with two data bases would be that the country names are not completely standardized. One might be abbreviated, in a different order, or different capitalization (sometimes even within its own entries). On a large project, that can be a nightmare. Typically, you want to automate the matching and print any problems so that you can fix things up. Then a final clean automated run tells you that everything can be matched.

Ibix · Jan 13, 2018

FactChecker said:

a common problem when working with two data bases would be that the country names are not completely standardized

Indeed. One of the three rules for working with data is "do not trust your data"[1]. There's always someone who enters "Isle of Angelsey".

[1] The other two rules are "do not trust your data" and "do not trust your data".

jedishrfu · Jan 13, 2018

S_David said:

But the main idea is to avoid typing the codes manually. Otherwise, I would have added them to my file without the need of the other file. I am familiar with dictionaries in python, and I think I get the idea, but it is what I want to avoid, since I have many countries.

It should work quite well with up to 10000 entries in your hash table.

FactChecker · Jan 13, 2018

Since there are so few countries in the world, the brute force method of looping through a list and comparing should work fine.

jedishrfu · Jan 13, 2018

FactChecker said:

Since there are so few countries in the world, the brute force method of looping through a list and comparing should work fine.

Yes, of course. It’s just that you’re writing extra code that’s unnecessary if you use a hash table. It’s why folks use higher level languages for the conveniences offered to the programmer and that the provided implementation is often faster than what you code yourself.

FactChecker · Jan 14, 2018

jedishrfu said:

Yes, of course. It’s just that you’re writing extra code that’s unnecessary if you use a hash table. It’s why folks use higher level languages for the conveniences offered to the programmer and that the provided implementation is often faster than what you code yourself.

I agree. Hash functions are the best way to do it (as long as its limits are not exceeded).

EngWiPy · Jan 14, 2018

I did it as following using a dictionary:

Python:

d = {}
for index, row in df2.iterrows():
    d[row['Country']] = row['Code']
Code = []
for index, row in df1.iterrows():
    country = row['Country']
    if country in d.keys():
        Code.append(d[country])
    else:#Not all countries are found, or may have a different name (to be entered manually)
        Code.append('')
df1['Code'] = Code

Can I improve it more than this?

FactChecker · Jan 14, 2018

Just to give you an example of the type of data checking that you might want to do for an official program, I made a rough cut at Perl pseudo code. (Sorry, I don't know Python.) I have ignored any possible errors regarding Perl syntax or accessing the data in d1 and d2. So don't expect this to be working Perl code. It's just to give you a rough idea.

Perl:

# ================================================================
# Process country abbreviations from d1
foreach $record (@d1){
    $abbreviation = $record{'abbreviation'};
    $country = $record{'country'};
    # check if $abbreviation has already been used by a different country
    if( ( defined $abbr_used{$abbreviation})
               &&  ( $abbr_used{$abbreviation} ne $country )
    {
        print "Abbreviation $abbreviation use conflict: $abbr_used{$abbreviation} and $country\n";
    }
   
    # check if country has already been assigned a different abbreviation
    if( (defined $abbr{$country} )
               &&  ( $abbr{$country} ne $abbreviation)
    {
        print "Country $country abbreviation conflict: $abbr{$country} and $abbreviation\n";
    }
   
    # ================================================================
    # Store abbreviation for country
    $abbr{$country} = $abbreviation;
    $abbr_used{$abbreviation} = $country;
}

# ================================================================
# Process d2 countries
foreach $record (@d2){
    $country = $record{'country'};
    # check if there is a d1 abbreviation for this country
    if( ! defined $abbr{$country} ){
        print "No abbreviation defined in d1 for country $country\n";
        $record{'abbreviation'} = 'undefined';
    }else{
        $record{'abbreviation'} = $abbr{$country};
    }
}

Two separate csv files with some common entries

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Use of AI (ML/DL) in Science

Other than just FizzBuzz to test programmer candidates

Sweetspot of data compression

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

HTML/CSS Problems with DNS records

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect