Thursday, December 16, 2010

ParallelPython vs multiprocessing

Today I'm working on parallelisation of a process I have written. Process is simply a text conversion. I have a translator class, organism directories and files inside these directories. What I want to do is to split the data among the processors and perform the operation faster.

To do this, I tested multiprocessing and ParallelPython modules of python. Without using these modules, it took 38 seconds to perform the task whereas with the help of these modules, it went down to 29 seconds(multiprocessing) and 30 seconds (ParallelPython). Not a great deal but better than nothing. By the way, ParallelPython is way too complicated compared to multiprocessing.

Here is the code for ParallelPython:

import translator
import os
from utils.pp import pp

base = "/some/path"
organisms = [ "organism1", "organism2", ...]

def convert_organism(base, organism):
t = translator.BiogridOspreyTranslator()
# uses os module here
t.translate()

if __name__ == '__main__':
job_server = pp.Server(ppservers=())
jobs = [(organism, job_server.submit(convert_organism, (base, organism,), (), ("os","translator",))) for organism in organisms]
for organism, job in jobs:
job()




ParallelPython requires you to tell him the modules the functions requires. I didn't like that.
And here is the code for multiprocessing:

from translator import *
import os
from multiprocessing import Pool

base = "/some/path"
organisms = [ "organism1", "organism2", ...]

def convert_organism(organism):
t = BiogridOspreyTranslator()
# uses os module here
t.translate()

if __name__ == '__main__':
pool = Pool(processes = 2)
pool.map(convert_organism, organisms)

No comments: