Some time ago I was doing a demo for querying Politically Exposed Persons from Wikidata. This was supposed to be a part of a bigger project which didn’t materialize. Nevertheless the code can be a good exercise in the topic, so I am sharing it for anyone that might be curious.
Additionally, I wanted to do the project in python
, so that finally I have enough reasons to make a demo shiny app in python
(this will be in part two of this series).
Wikidata, the knowledge base maintained by Wikimedia, has a SPARQL
query service. SPARQL
is kind of like SQL
, but not really, and it takes some time getting used to it. Fortunately, Wikidata maintains excelent documentation, as well as a point and click query builder which is quite useful.
The Wikidata query
It is possible to explore wikidata through the query builder. I started working on the query line by line.
The final query I am using is this:
SELECT ?personLabel ?personGenderLabel ?dateOfBirth ?politicalPartyLabel ?spouseLabel ?childLabel WHERE {
?person wdt:P106 wd:Q82955;
wdt:P27 wd:Q221;
wdt:P21 ?personGender;
wdt:P569 ?dateOfBirth;
rdfs:label ?personLabel.
FILTER(LANG(?personLabel) = "[AUTO_LANGUAGE]").
#FILTER(STRSTARTS(?personLabel, "Q")).
OPTIONAL {?person wdt:P102 ?politicalParty. }
OPTIONAL {?person wdt:P26 ?spouse. }
OPTIONAL {?person wdt:P40 ?child. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
LIMIT 500
But let’s try to explain what is happening here.
First I started by trying to find all persons that are politicians in North Macedonia.
That query looked like this:
SELECT ?personLabel WHERE {
?person wdt:P106 wd:Q82955;
wdt:P27 wd:Q221;
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
This query returns all persons where occupation (wdt:P106
) is politician (wd:Q82955
) and citizenship (wdt:P27
) is North Macedonia (wd:Q221
). These prefixes are a bit confusing and I am not sure I can explain them well, but here is a good reference. I think the easiest way to get these values is through the query builder, though searching on Wikidata is also an option.
As the query progresses you will notice two things specific to Wikidata querying. First, it is the lines with optional
. This is where we say to return the persons and their political party, but also return the person if there is no political party listed. Arguably, the same can be done for gender and date of birth, but I just assumed all persons will have those on file, as opposed to political parties, spouses and children.
The second thing is the line that starts with rdfs
and the following filter
lines. The problem I had with the returned data was that for some persons there was no label, but the query returned the unique identifier of a data item on Wikidata (a string that starts with Q
and has bunch of numbers after it). This is why I tried to filter out those rows, although it seems only the first filter line takes care of the issue. I am not entirely sure how that part of the query works.
Finally, the LIMIT
is not really relevant for North Macedonia since the query returns around 350 rows anyway. But it may be a good thing to have it if the queried country is different.
On to Python
The module used to interact with Wikidata is called SPARQLWrapper
, first thing as always, install it. I am also importing pandas
because we need to wrangle the data too.
Now, the SPARQLWrapper
code below comes from Wikidata as well. Once you run the query in the query service there is an option to copy the code for several languages.
Note however, that the query we are sending below is slightly different from the one above. For some reasons, there were issues with the filter
clauses when using the query as part of a SPARQLWrapper
call, and I removed them.
# pip install sparqlwrapper
# https://rdflib.github.io/sparqlwrapper/
import sys
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import networkx as nx
# Query Macedonian politicians and their political party affiliation
endpoint_url = "https://query.wikidata.org/sparql"
query = """SELECT ?personLabel ?personGenderLabel ?dateOfBirth ?politicalPartyLabel ?spouseLabel ?childLabel WHERE {
?person wdt:P106 wd:Q82955;
wdt:P27 wd:Q221;
wdt:P21 ?personGender;
wdt:P569 ?dateOfBirth;
OPTIONAL {?person wdt:P102 ?politicalParty. }
OPTIONAL {?person wdt:P26 ?spouse. }
OPTIONAL {?person wdt:P40 ?child. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
LIMIT 500"""
def get_results(endpoint_url, query):
user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
# TODO adjust user agent; see https://w.wiki/CX6
sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
return sparql.query().convert()
results = get_results(endpoint_url, query)
Data wrangling
The object returned by the query service is a dict, which we can verify.
type(results)
The data we need is in a list of dictionaries. Fortunately pandas
can deal with this in one line of code.
df = pd.json_normalize(results['results']['bindings'])
df.columns
Index(['dateOfBirth.datatype', 'dateOfBirth.type', 'dateOfBirth.value',
'personLabel.xml:lang', 'personLabel.type', 'personLabel.value',
'personGenderLabel.xml:lang', 'personGenderLabel.type',
'personGenderLabel.value', 'politicalPartyLabel.xml:lang',
'politicalPartyLabel.type', 'politicalPartyLabel.value',
'spouseLabel.xml:lang', 'spouseLabel.type', 'spouseLabel.value',
'childLabel.xml:lang', 'childLabel.type', 'childLabel.value'],
dtype='object')
We only want to keep the columns that have the value
. Another one liner from pandas
:
df_filtered = df[df.filter(like='value').columns]
Before moving on to other things, we can rename the columns, and change the date of birth to date time.
df_filtered = df_filtered.rename(columns={'dateOfBirth.value': 'dob', 'personLabel.value': 'name',
'personGenderLabel.value': 'gender', 'politicalPartyLabel.value': 'party',
'childLabel.value': 'child_name', 'spouseLabel.value': 'spouse_name'})
df_filtered["dob"] = pd.to_datetime(df_filtered["dob"])
There are some interesting graphs / tables that can be pulled from the data. For example:
- age distribution of men and women;
- number of politicians per political party;
- politicians who switched parties;
The last one is particularly useful here because we cann detect mismathes in the date of births of the politicans as well as their allegiance changes.
df_filtered[df_filtered.name.duplicated(keep=False)].sort_values('name')
dob name gender party spouse_name child_name
302 1937-01-01 00:00:00+00:00 Dimitar Dimitrov male NaN Ratka Dimitrova Nikola Dimitrov
301 1936-05-30 00:00:00+00:00 Dimitar Dimitrov male NaN Ratka Dimitrova Nikola Dimitrov
The main goals was to try and visualize a network of relations, so let’s move on to that.
The data for Macedonian politicians are actually quite poor. A lot data points are missing.
df_rel = df_filtered.dropna(subset=['child_name', 'spouse_name'], how='all')
df_names = df_rel[['name', 'child_name', 'spouse_name']]
Eventually we keep the names of the politicians, and names of their spouses / children.
df_names = df_names.drop_duplicates('name')
Now we pivot the table from wide to long to get it ready for plotting with networkx
.
df_names_long = pd.melt(df_names, id_vars=['name'], value_vars=['child_name', 'spouse_name'])
df_names_long = df_names_long.dropna(subset = ['value'])
df_names_long
name variable value
4 Kole Čašule child_name Slobodan Čašule
5 Krste Crvenkovski child_name Stevo Crvenkovski
6 Dimitar Dimitrov child_name Nikola Dimitrov
7 Vera Dolgova-Korubin child_name Rubens Korubin
8 Esma Redžepova spouse_name Stevo Teodosievski
9 Kiro Gligorov spouse_name Nada Gligorova
10 Zoran Zaev spouse_name Zorica Zaeva
11 Boris Trajkovski spouse_name Vilma Trajkovska
14 Dimitar Dimitrov spouse_name Ratka Dimitrova
15 Vera Dolgova-Korubin spouse_name Q12286704
Finally, we plot:
g = nx.from_pandas_edgelist(df_names_long, source = 'name', target = 'value')
nx.draw_kamada_kawai(g, with_labels=True)
The graph doesn’t look spectacular, but it show the possibilities of what can be achieved. The few data points we had actually helped having a readable graph like this. Probably a different visualization would be better suited if the relations were many.