Originally, when I set out to do this, I thought it would take a day or two at most.
I wanted to get a frequency count of every word on Wikipedia for a project on calculating different media sources’ biases (in terms of coverage, at least) on different topics.
I have a super shiny desktop PC with 32 cores (and 64 threads), so on paper this should not have been a particularly difficult challenge to do, as it’s a task that can be done in parallel. However my measly 64GB of RAM is not enough to hold English-language Wikipedia in memory (boo) - so I had to write every part of this around this constraint.
But a difficult challenge I wanted; and so I set the conditions of success:
- I wanted to do it in Node - a single threaded JavaScript runtime.
- I wanted to use the official Wikipedia backup, so I could repeat the process when future data is released.
- I wanted it to be as fast as reasonably possible given those constraints.
I won’t get into the detail about each of the wrong turns I found in this process but those mistakes included:
- Trying to read data using a streaming XML parser (yes, Wikipedia’s backups are in XML)
- Trying to parse the Wiki markup myself
- Trying to write the results using JSON.stringify()
- Trying to read 80GB of data through MongoDB’s Node API
- Using the pre-installed bzip2 decompression instead of a parallel one
- Assuming someone has made a convenient api with good performance for reading lines from a read stream
…and many more…
So how do I do it?
Note: this project’s code and the top results are available on Github
- Download it: https://dumps.wikimedia.org/backup-index.html (the latest bz2 xml backup, you’d do well to not try and forge your own path here)
- While you’re downloading it, set up
pbzip2
, use it to decompress it to an xml file - Get a node project set up, add
dumpster-dive
as a dependency and reference the xml file - using the plaintext option it will also remove a large proportion of the Wiki code in the process, so it’s worth it! - Use
mongodump
(notmongoexport
) - for memongodump -d enwiki
- note, it doesn’t give a progress bar, so open another terminal tab andls -hal
to see how big the output file is, and use that as a rough estimate for how long is left. You will be left with a text file with a json object on each line. - At this point you’re gonna need to use my code as a guide because it’s pretty much all green-field code, but the core approach is:
- An asynchronous generator that produces lines from a read stream in a performant way
- Split these lines (you’ll need to batch them up) up and send them to different workers (using node’s built-in
cluster
tools)- note: frequent messaging between the main process and the clusters cripples performance, so send 1000s in each batch
- Only when they are on the worker do you parse them as JSON, and extract the plain text.
- Then do whatever you like with the data - I searched for words to create a word frequency relationship
- Save it in Redis (in my case, as a sorted set, allowing quick lookups and O(k) lookups for the top k most frequent words)
Simple huh! Honestly though, even though there’s a lot of steps, each of them is quite quick, and each can be done in less than an hour with the right approach!
With this kind of scale, the wrong approach will cost you hours to days of time, so I strongly encourage anyone to follow this path for a task like this, as anything else will waste you and your CPU’s time.
Results
When only considering words in (or around*) the English language, the top 100 is as follows:
* around meaning suffixes and prefixes separated by anything other than a hyphen
WORDS FREQUENCY
"the" 187079743
"of" 93608799
"and" 77153878
"in" 76470188
"a" 56831470
"to" 52900711
"was" 30922054
"is" 24636808
"for" 22888523
"as" 22506216
"on" 22187946
"s" 21941778
"by" 19513091
"with" 19051459
"he" 15800917
"at" 15089282
"that" 14659948
"from" 14238840
"his" 12510050
"it" 12142171
"an" 10331521
"were" 8170793
"are" 7584606
"which" 7560026
"this" 6599758
"be" 6394422
"also" 6370524
"or" 6281272
"first" 6046964
"had" 5935722
"has" 5825258
"new" 5607477
"their" 5191310
"one" 5152099
"after" 5066202
"its" 4943748
"who" 4935555
"her" 4841379
"she" 4795762
"but" 4722933
"not" 4706620
"they" 4469967
"have" 4177744
"two" 4138736
"been" 3780025
"other" 3514326
"when" 3424821
"during" 3339508
"all" 3317140
"th" 3157888
"time" 3136051
"school" 3134592
"may" 3110937
"into" 3107391
"there" 3005568
"team" 2889521
"i" 2839534
"university" 2763787
"more" 2751715
"world" 2737773
"national" 2686585
"born" 2662497
"over" 2601852
"city" 2593619
"years" 2580163
"state" 2564343
"american" 2562583
"only" 2557004
"de" 2538979
"would" 2520255
"between" 2465696
"united" 2452601
"no" 2441796
"most" 2430468
"where" 2426813
"film" 2417207
"later" 2410908
"up" 2393615
"d" 2371211
"about" 2313284
"some" 2287103
"match" 2267597
"name" 2264341
"can" 2261249
"out" 2237657
"him" 2237329
"three" 2223914
"year" 2192001
"then" 2176276
"made" 2172445
"used" 2171999
"such" 2168488
"while" 2163569
"season" 2138300
"states" 2135492
"under" 2100864
"part" 2091441
"known" 2065582
"south" 2060629
"second" 2025200
Performance
As I’m writing this, I’ve noticed a new backup has appeared, so I shall benchmark each step using my (close to ideal conditions) setup.
Step | Name | Time |
---|---|---|
1 | mirror download | 10 mins |
2 | pbzip2 | 2 mins |
3 | dumpster-dive | 1 hour |
4 | mongoexport | 20 mins |
5 | word frequency | 16 mins |
When writing this word frequency code there were like 10 different bottlenecks that took a lot of wrangling to get working - for most of the time I worked on it, it wanted to use 4 to 24 hours - from now on I’ll take any performance for granted when working with any data larger than the computer’s memory!