What are the limitations of multithreading in python?
Testing a multithreaded application is more difficult than testing a single-threaded application because defects are often timing-related and more difficult to reproduce. Show
Existing code often requires significant re-architecting to take advantage of multithreading and multicontexting. Programmers need to:
Because the completed port must be tested and re-tested, the work required to port a multithreaded and/or multicontexted application is substantial. This deep dive on Python parallelization libraries - multiprocessing and threading - will explain which to use when for different data scientist problem sets. Sep 7, 2019 • 14 min read Sooner or later, every data science project faces an inevitable challenge: speed. Working with larger data sets leads to slower processing thereof, so you'll eventually have to think about optimizing your algorithm's run time. As most of you already know, parallelization is a necessary step of this optimization. Python offers two built-in libraries for parallelization: multiprocessing and threading. In this article, we'll explore how data scientists can go about choosing
between the two and which factors should be kept in mind while doing so. As you all know, data science is the science of dealing with large amounts of data and extracting useful insights from them. More often than not, the operations we perform on the data are easily parallelizable, meaning that different processing agents can run the operation on the data one piece at a time, then combine the results at the
end to get the complete result. To better visualize parallelizability, let's consider a real world analogy. Suppose you need to clean three rooms in your home. You can either do it all by yourself, cleaning the rooms one after the other, or you can ask your two siblings to help you out, with each of you cleaning a single room. In the latter approach, each of you are working parallely on a part of the whole task, thus reducing the total time required to complete it. This is
parallelizability in action. Parallel processing can be achieved in Python in two different ways: multiprocessing and threading. Fundamentally, multiprocessing and threading are two ways to achieve parallel computing, using processes and threads, respectively, as the processing agents. To understand how these work, we have to clarify what processes and threads are. ProcessA process is an instance of a computer program being executed. Each process has its own memory space it uses to store the instructions being run, as well as any data it needs to store and access to execute. ThreadsThreads are components of a process, which can run parallely. There can be multiple threads in a process, and they share the same memory space, i.e. the memory space of the parent process. This would mean the code to be executed as well as all the variables declared in the program would be shared by all threads. Process and Threads, By I, Cburnett, CC BY-SA 3.0, Link For example, let us consider the programs being run on your computer right now. You’re probably reading this article in a browser, which probably has multiple tabs open. You might also be listening to music through the Spotify desktop app at the same time. The browser and the Spotify application are different processes; each of them can use multiple processes or threads to achieve parallelism. Different tabs in your browser might be run in different threads. Spotify can play music in one thread, download music from the internet in another, and use a third to display the GUI. This would be called multithreading. The same can be done with multiprocessing—multiple processes—too. In fact, most modern browsers like Chrome and Firefox use multiprocessing, not multithreading, to handle multiple tabs. Technical details
Pitfalls of Parallel ComputingIntroducing parallelism to a program is not always a positive-sum game; there are some pitfalls to be aware of. The most important ones are as follows.
Multiprocessing and Threading in PythonThe Global Interpreter LockWhen it comes to Python, there are some oddities to keep in mind. We know that threads share the same memory space, so special precautions must be taken so that two threads don’t write to the same memory location. The
CPython interpreter handles this using a mechanism called From the Python wiki:
Check the slides here for a more detailed look at the Python The This bottleneck, however, becomes irrelevant if your program has a more severe bottleneck elsewhere, for example in network, IO, or user interaction. In those cases, threading is an entirely effective method of parallelization. But for programs that are CPU bound, threading ends up making the program slower. Let's explore this with some example use cases. Use Cases for ThreadingGUI programs use threading all the time to make applications responsive. For example, in a text editing program, one thread can take care of recording the user inputs, another can be responsible for displaying the text, a third can do spell-checking, and so on. Here, the program has to wait for user interaction, which is the biggest bottleneck. Using multiprocessing won’t make the program any faster. Another use case for threading is programs that are IO bound or network bound, such as web-scrapers. In this case, multiple threads can take care of scraping multiple webpages in parallel. The threads have to download the webpages from the Internet, and that will be the biggest bottleneck, so threading is a perfect solution here. Web servers, being network bound, work similarly; with them, multiprocessing doesn’t have any edge over threading. Another relevant example is Tensorflow, which uses a thread pool to transform data in parallel. Use Cases for MultiprocessingMultiprocessing outshines threading in cases where the program is CPU intensive and doesn’t have to do any IO or user interaction. For example, any program that just crunches numbers will see a massive speedup from multiprocessing; in fact, threading will probably slow it down. An interesting real world example is Pytorch Dataloader, which uses multiple subprocesses to load the data into GPU. Parallelization in Python, in ActionPython offers two libraries -
You
can see that I've created a function Then, I’ve created two threads that will execute the same function. The thread objects have a As you can see, the API for spinning up a new thread to a task in the background is pretty straightforward. What’s great is that the API for multiprocessing is almost the exact same as well; let’s check it out.
There it is—just swap There’s obviously a lot more you can do with this, but that’s not within the scope of this article, so we won’t go into it here. Check out the docs here and here if you’re interested in learning more. BenchmarksNow that we have an idea of how the code implementing parallelization looks like, let’s get back to the performance issues. As we’ve noted before, threading is not suitable for CPU bound tasks; in those cases it ends up being a bottleneck. We can validate this using some simple benchmarks. Firstly, let’s see how threading compares against multiprocessing for the code sample I showed you above. Keep in mind that this task does not involve any kind of IO, so it’s a pure CPU bound task. And let’s see a similar benchmark for an IO bound task. For example, the following function —
The function is simply fetching a webpage and saving that to a local file, multiple times in a loop. Useless but straightforward and thus a good fit for demonstration. Let’s look at the benchmark. Now there are a few things to note from these two charts:
Differences, Merits and Drawbacks
From all this discussion, we can conclude the following —
From the Perspective of a Data ScientistA typical data processing pipeline can be divided into the following steps:
Let’s explore how we could introduce parallelism in these tasks so that they can be sped up. Step 1 involves reading data from disk, so clearly disk IO is going to be the bottleneck for this step. As we’ve discussed, threads are the best option for parallelizing this kind of operation. Similarly, step 3 is also an ideal candidate for the introduction of threading. However, step 2 consists of computations that involve the CPU or a GPU. If it’s a CPU based task, using threading will be of no use; instead, we have to go for multiprocessing. Only then we’ll be able to exploit the multiple cores of the CPU and achieve parallelism. If it’s a GPU based task, since GPU already implements a massively parallelized architecture at the hardware level, using the correct interface (libraries and drivers) to interact with the GPU should take care of the rest. Now you may be thinking, “My data pipeline looks a bit different to this; I have some tasks that don’t really fit into this general framework.” Still, you should be able to observe the methodology used here to decide between threading and multiprocessing. The factors you should consider are:
With these factors in mind, together with the takeaways above, you should be able to make the decision. Also, keep in mind that you don’t have to use a single form of parallelism throughout your program. You should use one or the other for different parts of your program, whichever is suitable for that particular part. Now we’ll look at two example scenarios a data scientist might face and how you can use parallel computing to speed them up. Scenario: Downloading EmailsLet’s say you want to analyze all the emails in the inbox of your own home-grown startup and understand the trends: who are the most frequent senders, what are the most common keywords appearing in the emails, which day of the week or which hour of the day do you receive most emails, and so on. The first step of this project would be, of course, downloading the emails to your computer. At first, let’s do it sequentially without using any parallelization. The code to use is below and it should be pretty self-explanatory. There is a function
Time taken :: Now let’s introduce some parallelizability into this task to speed things up. Before we dive into writing the code, we have to decide between threading and multiprocessing. As you’ve learned so far, threads are the best option when it comes to tasks that have some IO as the bottleneck. The task at hand obviously belongs to this category, as it is accessing an IMAP server over the internet. So we’ll be going with Much
of the code we’re going to use is going to be the same as the one we used in the sequential case. The only difference is that we will split the list of 100 email ids into 10 smaller chunks, each chunk containing 10 ids, then create 10 threads and call the
Time taken :: As you can see, threading, sped it up considerably. Scenario: Classification Using Scikit-LearnLet’s say you have a classification problem, and you want to use a random forest classifier for this. As it’s a standard and well-known machine learning algorithm, let’s not reinvent the wheel and just use The below code serves demonstration purposes. I have created a classification dataset
using the helper function
Time taken :: Now we’ll look into how we can reduce the running time of this algorithm. We know that this algorithm can be parallelized to some extent, but what kind of parallelization would be suitable? It does not have any IO bottleneck; on the contrary, it’s a very CPU intensive task. So multiprocessing would be the logical choice. Fortunately,
Time taken :: As expected, multiprocessing made it quite a bit faster. ConclusionMost if not all data science projects will see a massive increase in speed with parallel computing. In fact, many of the popular data science libraries already have parallelism built into them, you just have to enable it. So before trying to implement it on your own, look through the documentation of the library you’re using and check if it supports parallelism (by the way, I definitively recommend you to check out dask). In case it doesn’t, this article should assist you in implementing it on your own. About the AuthorSumit is a computer enthusiast who started programming at an early age; he's currently finishing his master's degree in computer science at IIT Delhi. Whenever he isn't programming, you can probably find him either reading philosophy, playing the guitar, taking photos, or blogging. You can connect with Sumit on Twitter, LinkedIn, Github, and his website. What are the limitations of threading?The Thread class has the following disadvantages: With more threads, the code becomes difficult to debug and maintain. Thread creation puts a load on the system in terms of memory and CPU resources. We need to do exception handling inside the worker method as any unhandled exceptions can result in the program crashing.
What are two limitations of multithreading?Multithreaded and multicontexted applications present the following disadvantages:. Difficulty of writing code. Multithreaded and multicontexted applications are not easy to write. ... . Difficulty of debugging. ... . Difficulty of managing concurrency. ... . Difficulty of testing. ... . Difficulty of porting existing code.. What are the advantages and disadvantages of multithreading in Python?The primary function of multithreading is to simultaneously run or execute multiple tasks.. Complex debugging and testing processes.. Overhead switching of context.. Increased potential for deadlock occurrence.. Increased difficulty level in writing a program.. Unpredictable results.. What is some issues with multithreading?Complications due to Concurrency − It is difficult to handle concurrency in multithreaded processes. This may lead to complications and future problems. Difficult to Identify Errors− Identification and correction of errors is much more difficult in multithreaded processes as compared to single threaded processes.
|