byterial: dcap – A Distributed Computation Architecture In Python

Get The Code: https://github.com/byterial/dcap

Compatible with: Linux (tested on Ubuntu) see below.

Part of my master's thesis consists of training many different classifiers and combining them using majority voting. As the training of a classifier tends to take a long time, it is not feasible to train each one on one machine. Instead we use CSAIL's OpenStack cloud and farm out the training of classifiers to machines running in the cloud.

training classifiers on the cloud using dcap

In order to farm out tasks, I developed a small client-server architecture with four goals in mind

Few constraints on the type of tasks - For my project the tasks would consist out of running Matlab code, but in the future our group may want to farm out other types of tasks to client machines.
Robustness - The cloud is a fickle environment and if nodes or the entire cloud goes down, we don't want to have to repeat the entire computation. We also want to make sure that each task that should be computed is actually computed.
Elasticity - It should be possible to kill or add instances to the computation at any time.
Intermediate results: The system should provide intermediate results and should be able to make changes to the tasks which have not been computed yet, depending on the results of previous tasks. The motivation behind this goal is boosting, we want to be able to send back a vector telling clients which data points to use. This vector would allow us to weigh data heavier that is classified wrongly more often.

I wrote the architecture for this in python and ended up calling it “dcap”, an acronym which stands for “A Distributed Computation Architecture In Python” (I guess the acronym should be adcap, but that doesn't roll off the tongue nearly as nicely :) ). Here's how it works:

The server loads the list of tasks specified in a tasks.txt file. Each task consists out of a python script and a folder containing data for that particular task.
Clients running a dcap client connect to the server and request a task
The server selects the next task, and loads the specified python script and a zipped version of the data folder into memory. It then sends these through the network to the client.
The client unzips the zip file in a temporary folder, and then calls the transferred python script with two arguments: The first is the path to where the unzipped data is on the client, the second is a directory where the python script should store the results of it's computation.
After the task is executed, the client zips the results directory and transfers it back to the server.
The server unzips the results in /server/results/computationName and updates the completed tasks file therein. It then sends the next task, if any remain, to the client.

stages of processing one task

A Brief Overview of dcap

Specifying Your Own Tasks

The most important thing you are going to want to do is specify your own tasks you'd like to farm out. Here's how it's done:

For each task:

Create a python script file that should be executed on the client, as an example see server/tasks/demotask.py This script will be called with two arguments: The first is the path to where the client unpacked the data directory, the second is the path to where the result of the computation should be saved (see storing results below).
Optional: create a data directory: This directory will be zipped and sent to the client. This directory can be used to store additional code, data or parameters.
List each task in the tasks.txt as a separate line in the following csv format:
uniqueTaskName,path/to/python/clientscript.py,/path/to/datadirectory
If you are not using a data directory, simply specify the task as:
uniqueTaskName,path/to/python/clientscript.py,
(note the trailing comma!)
The location of the tasks.txt file can be specified using the -t flag when executing RunServer.py

It is possible to use the same python client script for each task. For example when I use dcap to train many different classifiers using Matlab, the python client script is the same for every task as all it does is start Matlab. However, the data directory for every task differs, as it contains a parameters.mat file which Matlab reads from in order to train a classifier with specific parameters.

Storing Results

To send back the results of the computation, the clientScript should store the results in the results directory. The path to this directory is passed in as the second argument to the client script.
The server will decompress the results it received from the client in server/results/computationTaskName/uniqueTaskName note that computationTaskName is a name you can specify by passing it in via the -n option.

It is of course possible for the clients to store their result somewhere else, for example in a database. In this case the transferred python script could simply send the results there after computation and leave the results directory empty.

The Folder Structure

Many of these will be created by the scripts as they run if they do not exist beforehand.
|-- dcap

| |-- client

| | |-- logs

| | |-- results

| | `-- temp

| |-- common

| |-- documentation

| `-- server

| |-- logs

| |-- results

| |-- tasks

| | |-- demoMatlabTask

| | `-- demotaskdata

| `-- temp

Running The Demo Task

I have included a simple demo task to show how dcap runs. If you start the following two tasks:

python RunServer.py
python RunClient.py

The demo task specified by the files

tasks/tasks.txt
tasks/demotask.py
tasks/demotaskdata

will run with both the server and the client on the local machine. The demo task consists of sending a text file to the client, which the client will read and print out, and then send the text file back to server.

The dcap server always constructs a name for the computation using user input (passed in through the -n option) and the current local time. If no user name for the computation task is specified the name is simply based of the current local time.

If you'd like to run the demo task on two seperate machines, specify port and ip using the -p option on the server side and the -p and -ip options on the client side to specify port and ip address of the server.

Options

Server, optional arguments:

-h Show help

-p Specify on which port the server should listen, default is 4444

-t Specify where the tasks file is located, default is ./tasks.txt

-c Specify path to a file with tasks that have already been completed, default is None

-n Specify a user defined name for these computational tasks

Client, optional arguments:

-h Show help

-ip Specify the IPv4 address the server is listening on

-p Specify the port the server is listening on, default is 4444

Miscellaneous

API Documentation: For an api documentation, open documentation/index.html in any browser.
Options: Type python RunServer.py -h or python RunClient.py to see the options.
Client Recovery: If individual clients fail, their incomplete task will be put back into the tasks queue.
Server Recovery: If the server fails, but the data on the server was backed up regularly, it is possible to restart from a backup. The server keeps track of the tasks the clients have completed in a text file in results/computationTaskName/completedTasks.txt . After a server failure, you can restart the computation where it left off by specifying in addition to the original tasks file, the completed tasks file with the -c option. Tasks in the original tasks.txt file that match tasks in the completed tasks file will not be farmed out to clients again.
Updating tasks after a result is returned: Although it is not demonstrated by the demo task, it is possible to update tasks that have not yet been farmed out, depending on the result of the completed task. After a result is stored, the ServerSideResultsProcessor module will be called with a path to the most recently received results. By default this module does nothing, but if so desired, it can be modified to manipulate the remaining tasks. Note that in order to prevent race conditions, you should aquire the IOLock before manipulating any tasks. This also means that if updating tasks is computationally expensive, the server becomes a bottleneck.
OS Compatibility: dcap was developed for machines running Ubuntu. Unfortunately it seems the multiprocessing library causes conflicts on Windows and OSX. I may at some point switch from using the multiprocessing library to using threading in order to fix these issues.

byterial

Pages

Monday, February 25, 2013

dcap – A Distributed Computation Architecture In Python