log in
register

Parallel computing in Python

Qingqi@2020-11-24 #tech

Backgrounds

  • Thread: the smallest independent execution unit
    • Each processor (not process) splits itself up into multiple logical cores, and each logical core provide one kernel thread
    • Kernel threads switches among threads very quickly, seems all threads are running at the same time, we call this 'concurrent'
    • threads could be more than kernel threads
  • Process: the instance of a computer program that is being executed by one or many threads
  • GIL (global interpreter lock) in CPython:
    • preventing multiple threads from executing Python bytecodes at once
    • one process can only run one thread at once
    • in most cases, multiprocessing is better than multithreading because of GIL (unless IO intensive)
  • Parallel computing:
    • Running many processors to handle separate parts of an overall task
    • In CPython, multiprocessing is the most common way for parallel computing
    • In Python, when processes is equal to kernel threads is the most efficient

MapReduce

map(f, iterable)
# equal to
for i in iterable:
    f(i)
  • Shuffle
    • worker nodes redistribute data based on the output keys (produced by the map function)
  • Reduce
    • worker nodes now process each group of output data, per key, in parallel
  • functools.reduce() in Python
reduce(f, [x1, x2, x3, x4]) = f(f(f(x1, x2), x3), x4)

multiprocessing.Pool

import multiprocessing
cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=cores)

# method 1: map -> list
pool.map(function, iterable)

# method 2: imap -> iterator
for y in pool.imap(function, iterable):
    print y

# method 3: imap_unordered -> iterator
# may be in any order
for y in pool.imap_unordered(function, iterable):
    print(y)
Comments

Log in to add your comment

Don't have an account? register here