Qingqi & Lian's Blog

Backgrounds

Thread: the smallest independent execution unit
- Each processor (not process) splits itself up into multiple logical cores, and each logical core provide one kernel thread
- Kernel threads switches among threads very quickly, seems all threads are running at the same time, we call this 'concurrent'
- threads could be more than kernel threads
Process: the instance of a computer program that is being executed by one or many threads
GIL (global interpreter lock) in CPython:
- preventing multiple threads from executing Python bytecodes at once
- one process can only run one thread at once
- in most cases, multiprocessing is better than multithreading because of GIL (unless IO intensive)
Parallel computing:
- Running many processors to handle separate parts of an overall task
- In CPython, multiprocessing is the most common way for parallel computing
- In Python, when processes is equal to kernel threads is the most efficient

MapReduce

MapReduce: Simplified Data Processing on Large Clusters
Map
- each worker node applies the map function to the local data, and writes the output to a temporary storage
map() in Python

map(f, iterable)
# equal to
for i in iterable:
    f(i)

Shuffle
- worker nodes redistribute data based on the output keys (produced by the map function)
Reduce
- worker nodes now process each group of output data, per key, in parallel
functools.reduce() in Python

reduce(f, [x1, x2, x3, x4]) = f(f(f(x1, x2), x3), x4)

multiprocessing.Pool

import multiprocessing
cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=cores)

# method 1: map -> list
pool.map(function, iterable)

# method 2: imap -> iterator
for y in pool.imap(function, iterable):
    print y

# method 3: imap_unordered -> iterator
# may be in any order
for y in pool.imap_unordered(function, iterable):
    print(y)

Parallel computing in Python

Backgrounds

MapReduce

multiprocessing.Pool