Backgrounds
- Thread: the smallest independent execution unit
- Each processor (not process) splits itself up into multiple logical cores, and each logical core provide one kernel thread
- Kernel threads switches among threads very quickly, seems all threads are running at the same time, we call this 'concurrent'
- threads could be more than kernel threads
- Process: the instance of a computer program that is being executed by one or many threads
- GIL (global interpreter lock) in CPython:
- preventing multiple threads from executing Python bytecodes at once
- one process can only run one thread at once
- in most cases, multiprocessing is better than multithreading because of GIL (unless IO intensive)
- Parallel computing:
- Running many processors to handle separate parts of an overall task
- In CPython, multiprocessing is the most common way for parallel computing
- In Python, when processes is equal to kernel threads is the most efficient
MapReduce
map(f, iterable)
# equal to
for i in iterable:
f(i)
- Shuffle
- worker nodes redistribute data based on the output keys (produced by the map function)
- Reduce
- worker nodes now process each group of output data, per key, in parallel
functools.reduce()
in Python
reduce(f, [x1, x2, x3, x4]) = f(f(f(x1, x2), x3), x4)
multiprocessing.Pool
import multiprocessing
cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=cores)
# method 1: map -> list
pool.map(function, iterable)
# method 2: imap -> iterator
for y in pool.imap(function, iterable):
print y
# method 3: imap_unordered -> iterator
# may be in any order
for y in pool.imap_unordered(function, iterable):
print(y)