This is my attempt at developing and deploying the machine learning engine used by Michael Becker in his PyCon 2014 talk on Realtime Predictive Analytics with scikit-learn & RabbitMQ. This will be a series of tutorials as an opportunity to share my implementation with you.
To make this series more intuitive, it is broken down into three parts. I will start with the OSEMN (Obtain, Scrub, Explore, Model, and iNterpret) process tutorial as part 1, part 2 will comprise of the model development and then finally part 3 will cover model distribution (i.e. deployment and scaling).
NB – This is my first attempt as a machine learning practitioner so suggestions and any helpful advice are most welcome. If you also have a question about the code or the hypotheses I made, do not hesitate to post a comment in the comment section below.
All code for this tutorial is available in a GitHub repo which you can go ahead and fork off!
Getting and processing the data
Obtain
For us to build a real-time predictive machine learning algorithm, we need to obtain the training data that should be mainly text from different languages. Wikipedia is a natural choice for this data source and has abundant supply of articles covered in multiple languages thorugh their datasets dumps available online. The extraction process follows the one Michael Becker implemented where we will use the Wikimedia Pageview API to get the top 1000 articles per month of pageview count timeseries for a wikipedia project of each language for the last 7 months.
We use MongoDB to persist the data from the JSON response of the API call and the aggregation framework to then aggregate the top 5000 articles from each language. We do this by starting a MongoDB instance locally and connect to it on the default port 27017 using pymongo, the official driver for MongoDB:
# -*- coding: utf-8 -*-
import requests
import sys
from lang_map import code_lang_map
from pymongo import InsertOne, MongoClient
# connect to database
connection = MongoClient('localhost', 27017)
# use test database
db = connection.test
# handle to wikipedia collection
wikipedia = db.wikipedia
def insert_top_articles():
"""
Insert in mongo the top 1000 articles per month of pageview count timeseries for a wikipedia project of each language for the first 6 months.
"""
# initialize bulk insert operations list
ops = []
# clear existing wikipedia collection
wikipedia.remove()
for lang in code_lang_map.keys():
for month in range(1,7):
try:
url = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/top/{0}.wikipedia/all-access/2016/{1}/all-days'.format(lang, str(month).zfill(2))
result = requests.get(url).json()
if 'items' in result and len(result['items']) == 1:
r = result['items'][0]
for article in r['articles']:
article['lang'] = r['project'][:2]
ops.append(InsertOne(article))
except:
print('ERROR while fetching or parsing ' + url)
wikipedia.bulk_write(ops)
def get_top_articles(lang):
"""
Aggregate top 5000 articles from a daily pageview count timeseries of all projects for the last 6 months.
"""
# initialize aggregation pipeline
pipeline = [
{ "$match": { "lang": lang } },
{
"$group": {
"_id": "$article",
"lang": { "$first": "$lang" },
"max_views": { "$max": "$views" }
}
},
{
"$project": {
"page": "$_id", "_id": 0,
"lang": 1,
"max_views": 1
}
},
{ "$sort": { "max_views": -1 } },
{ "$limit": 5000 }
]
result = list(wikipedia.aggregate(pipeline))
return result
Now, having generated the list of articles above, we use the Wikipedia Special:Export page for each article to export the data and we use this simple Python script to execute queries against the Wikipedia Special:Export page:
# -*- coding: utf-8 -*-
import bz2
import requests
import sys
from shutil import copyfileobj
def load_articles(lang, pagelist):
url = "https://{0}.wikipedia.org/w/index.php?title=Special:Export&action=submit".format(lang)
origin = "https://{0}.wikipedia.org".format(lang)
referer = "https://{0}.wikipedia.org/wiki/Special:Export".format(lang)
filename = "dumps/wikipedia-{0}.xml".format(lang)
pages = '\n'.join(pagelist)
headers = {
"Origin": origin,
"Accept-Encoding": "gzip,deflate,sdch",
"User-Agent": "Mozilla/5.0 Chrome/35.0",
"Content-Type": "application/x-www-form-urlencoded",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Cache-Control": "max-age=0",
"Referer": referer,
"Connection": "keep-alive",
"DNT": "1"
}
payload = {
'catname': '',
'pages': pages,
'curonly': '1',
'wpDownload': '1',
}
res = requests.post(url, data=payload, headers=headers)
with open(filename, 'wb') as f:
f.write(res.content)
with open(filename, 'rb') as input:
with bz2.BZ2File(filename + '.bz2', 'wb', compresslevel=9) as output:
copyfileobj(input, output)
os.remove(filename)
return filename + '.bz2'
Example xml file downloaded:
<mediawiki xml:lang="en">
<page>
<title>Page title</title>
<!-- page namespace code -->
<ns>0</ns>
<id>2</id>
<!-- If page is a redirection, element "redirect" contains title of the page redirect to -->
<redirect title="Redirect page title" />
<restrictions>edit=sysop:move=sysop</restrictions>
<revision>
<timestamp>2001-01-15T13:15:00Z</timestamp>
<contributor>
<username>Foobar</username>
<id>65536</id>
</contributor>
<comment>I have just one thing to say!</comment>
<text>A bunch of [text] here.</text>
<minor />
</revision>
<revision>
<timestamp>2001-01-15T13:10:27Z</timestamp>
<contributor><ip>10.0.0.2</ip></contributor>
<comment>new!</comment>
<text>An earlier [[revision]].</text>
</revision>
<revision>
<!-- deleted revision example -->
<id>4557485</id>
<parentid>1243372</parentid>
<timestamp>2010-06-24T02:40:22Z</timestamp>
<contributor deleted="deleted" />
<model>wikitext</model>
<format>text/x-wiki</format>
<text deleted="deleted" />
<sha1/>
</revision>
</page>
<page>
<title>Talk:Page title</title>
<revision>
<timestamp>2001-01-15T14:03:00Z</timestamp>
<contributor><ip>10.0.0.2</ip></contributor>
<comment>hey</comment>
<text>WHYD YOU LOCK PAGE??!!! i was editing that jerk</text>
</revision>
</page>
</mediawiki>
Scrub
The dumps from above export process are in XML format and for scrubbing the Wikipedia markup converting it to plaintext, the WikiExtractor.py Python script was used since it extracts and cleans most of the xml markup from a Wikipedia database dump.
Example of Use
The following commands illustrate how to apply the script to a Wikipedia dump:
> python WikiExtractor.py 'dumps/wikipedia-en.xml.bz2' -cb 250K -o extracted -
In order to combine the whole extracted text into a single file one can issue:
> find extracted -name '*bz2' -exec bunzip2 -c {} \; > articles.xml
> rm -rf extracted
The output is stored in a number of files of similar size in a given directory. Each file will contain several documents in the format:
<doc id="" revid="" url="" title="">
...
</doc>
The output from the above tool needs to be cleaned further by using a parser that uses regex to strip the xml markup and return a dictionary with the data that we need that has four keys id (an integer tracking the version of the page), the url (a permanent link to this version of the page), the title, and the plain text content of the page:
import bz2
import re
article = re.compile(r'[^"]+)" title="(?P[^"]+)">\n(?P.+)\n<\/doc>', re.S|re.U)
def parse(filename):
data = ""
with bz2.BZ2File(filename, 'r') as f:
for line in f:
line = line.decode('utf-8')
data += line
if line.count(''):
m = article.search(data)
if m:
yield m.groupdict()
data = ""
Explore
Now that we have scrubbed our data, we need to resist the urge to dive in and immediately start building models and getting answers explore this data with the overall view of summarizing the data’s main characteristics i.e seeing what the data can tell us beyond the formal modeling.
We can start off with analyzing the number of times each character occurs in each language. We use a Counter object to achieve this task as it turns a sequence of values into a defaultdict(int)-like object mapping keys to counts. For example
from collections import Counter
c = Counter([0, 1, 2, 0])
# c is basically { 0: 2, 1: 1, 2: 1}
An instance of a Counter object has a most_common method which will be useful to get the top 2000 characters per language.
This Python data structure will be ideal for generating the counts for each character in a language article:
from collections import Counter
files = [f for f in os.listdir('.') if os.path.isfile(f)]
top_letters = []
for f in files:
print(f)
c = Counter()
for article in parse(f):
c['articles_count'] += 1
for letter in article['content']:
c[letter] += 1
c['letters_count'] += 1
d = dict(c.most_common(2000))
top_letters.append(d)
Once we get our dictionary list top_letters, we can load it into a pandas DataFrame, which represents a tabular, spreadsheet like data structure containing an ordered collection of columns, each of which can be a different value type:
from pandas import DataFrame
df = DataFrame(top_letters)
df.fillna(0, inplace=True)
df = df.set_index('lang')
We will continue the series in the next part where we will go one step further and use visualization tools to explore the data and then develop the models for our language classifiers.