Hướng dẫn python asyncio tutorial

Async IO is a concurrent programming design that has received dedicated support in Python, evolving rapidly from Python 3.4 through 3.7, and probably beyond.

You may be thinking with dread, “Concurrency, parallelism, threading, multiprocessing. That’s a lot to grasp already. Where does async IO fit in?”

This tutorial is built to help you answer that question, giving you a firmer grasp of Python’s approach to async IO.

Here’s what you’ll cover:

  • Asynchronous IO [async IO]: a language-agnostic paradigm [model] that has implementations across a host of programming languages

  • async/await: two new Python keywords that are used to define coroutines

  • asyncio: the Python package that provides a foundation and API for running and managing coroutines

Coroutines [specialized generator functions] are the heart of async IO in Python, and we’ll dive into them later on.

Before you get started, you’ll need to make sure you’re set up to use asyncio and other libraries found in this tutorial.

Setting Up Your Environment

You’ll need Python 3.7 or above to follow this article in its entirety, as well as the aiohttp and aiofiles packages:

$ python3.7 -m venv ./py37async
$ source ./py37async/bin/activate  # Windows: .\py37async\Scripts\activate.bat
$ pip install --upgrade pip aiohttp aiofiles  # Optional: aiodns

For help with installing Python 3.7 and setting up a virtual environment, check out Python 3 Installation & Setup Guide or Virtual Environments Primer.

With that, let’s jump in.

The 10,000-Foot View of Async IO

Async IO is a bit lesser known than its tried-and-true cousins, multiprocessing and threading. This section will give you a fuller picture of what async IO is and how it fits into its surrounding landscape.

Where Does Async IO Fit In?

Concurrency and parallelism are expansive subjects that are not easy to wade into. While this article focuses on async IO and its implementation in Python, it’s worth taking a minute to compare async IO to its counterparts in order to have context about how async IO fits into the larger, sometimes dizzying puzzle.

Parallelism consists of performing multiple operations at the same time. Multiprocessing is a means to effect parallelism, and it entails spreading tasks over a computer’s central processing units [CPUs, or cores]. Multiprocessing is well-suited for CPU-bound tasks: tightly bound for loops and mathematical computations usually fall into this category.

Concurrency is a slightly broader term than parallelism. It suggests that multiple tasks have the ability to run in an overlapping manner. [There’s a saying that concurrency does not imply parallelism.]

Threading is a concurrent execution model whereby multiple threads take turns executing tasks. One process can contain multiple threads. Python has a complicated relationship with threading thanks to its GIL, but that’s beyond the scope of this article.

What’s important to know about threading is that it’s better for IO-bound tasks. While a CPU-bound task is characterized by the computer’s cores continually working hard from start to finish, an IO-bound job is dominated by a lot of waiting on input/output to complete.

To recap the above, concurrency encompasses both multiprocessing [ideal for CPU-bound tasks] and threading [suited for IO-bound tasks]. Multiprocessing is a form of parallelism, with parallelism being a specific type [subset] of concurrency. The Python standard library has offered longstanding support for both of these through its multiprocessing, threading, and concurrent.futures packages.

Now it’s time to bring a new member to the mix. Over the last few years, a separate design has been more comprehensively built into CPython: asynchronous IO, enabled through the standard library’s asyncio package and the new async and await language keywords. To be clear, async IO is not a newly invented concept, and it has existed or is being built into other languages and runtime environments, such as Go, C#, or Scala.

The asyncio package is billed by the Python documentation as a library to write concurrent code. However, async IO is not threading, nor is it multiprocessing. It is not built on top of either of these.

In fact, async IO is a single-threaded, single-process design: it uses cooperative multitasking, a term that you’ll flesh out by the end of this tutorial. It has been said in other words that async IO gives a feeling of concurrency despite using a single thread in a single process. Coroutines [a central feature of async IO] can be scheduled concurrently, but they are not inherently concurrent.

To reiterate, async IO is a style of concurrent programming, but it is not parallelism. It’s more closely aligned with threading than with multiprocessing but is very much distinct from both of these and is a standalone member in concurrency’s bag of tricks.

That leaves one more term. What does it mean for something to be asynchronous? This isn’t a rigorous definition, but for our purposes here, I can think of two properties:

  • Asynchronous routines are able to “pause” while waiting on their ultimate result and let other routines run in the meantime.
  • Asynchronous code, through the mechanism above, facilitates concurrent execution. To put it differently, asynchronous code gives the look and feel of concurrency.

Here’s a diagram to put it all together. The white terms represent concepts, and the green terms represent ways in which they are implemented or effected:

] async def fetch_html[url: str, session: ClientSession, **kwargs] -> str: """GET request wrapper to fetch page HTML. kwargs are passed to `session.request[]`. """ resp = await session.request[method="GET", url=url, **kwargs] resp.raise_for_status[] logger.info["Got response [%s] for URL: %s", resp.status, url] html = await resp.text[] return html async def parse[url: str, session: ClientSession, **kwargs] -> set: """Find HREFs in the HTML of `url`.""" found = set[] try: html = await fetch_html[url=url, session=session, **kwargs] except [ aiohttp.ClientError, aiohttp.http_exceptions.HttpProcessingError, ] as e: logger.error[ "aiohttp exception for %s [%s]: %s", url, getattr[e, "status", None], getattr[e, "message", None], ] return found except Exception as e: logger.exception[ "Non-aiohttp exception occured: %s", getattr[e, "__dict__", {}] ] return found else: for link in HREF_RE.findall[html]: try: abslink = urllib.parse.urljoin[url, link] except [urllib.error.URLError, ValueError]: logger.exception["Error parsing URL: %s", link] pass else: found.add[abslink] logger.info["Found %d links for %s", len[found], url] return found async def write_one[file: IO, url: str, **kwargs] -> None: """Write the found HREFs from `url` to `file`.""" res = await parse[url=url, **kwargs] if not res: return None async with aiofiles.open[file, "a"] as f: for p in res: await f.write[f"{url}\t{p}\n"] logger.info["Wrote results for source URL: %s", url] async def bulk_crawl_and_write[file: IO, urls: set, **kwargs] -> None: """Crawl & write concurrently to `file` for multiple `urls`.""" async with ClientSession[] as session: tasks = [] for url in urls: tasks.append[ write_one[file=file, url=url, session=session, **kwargs] ] await asyncio.gather[*tasks] if __name__ == "__main__": import pathlib import sys assert sys.version_info >= [3, 7], "Script requires Python 3.7+." here = pathlib.Path[__file__].parent with open[here.joinpath["urls.txt"]] as infile: urls = set[map[str.strip, infile]] outpath = here.joinpath["foundurls.txt"] with open[outpath, "w"] as outfile: outfile.write["source_url\tparsed_url\n"] asyncio.run[bulk_crawl_and_write[file=outpath, urls=urls]]

This script is longer than our initial toy programs, so let’s break it down.

The constant HREF_RE is a regular expression to extract what we’re ultimately searching for, href tags within HTML:

>>>

>>> HREF_RE.search['Go to Real Python']

The coroutine fetch_html[] is a wrapper around a GET request to make the request and decode the resulting page HTML. It makes the request, awaits the response, and raises right away in the case of a non-200 status:

resp = await session.request[method="GET", url=url, **kwargs]
resp.raise_for_status[]

If the status is okay, fetch_html[] returns the page HTML [a str]. Notably, there is no exception handling done in this function. The logic is to propagate that exception to the caller and let it be handled there:

We await session.request[] and resp.text[] because they’re awaitable coroutines. The request/response cycle would otherwise be the long-tailed, time-hogging portion of the application, but with async IO, fetch_html[] lets the event loop work on other readily available jobs such as parsing and writing URLs that have already been fetched.

Next in the chain of coroutines comes parse[], which waits on fetch_html[] for a given URL, and then extracts all of the href tags from that page’s HTML, making sure that each is valid and formatting it as an absolute path.

Admittedly, the second portion of parse[] is blocking, but it consists of a quick regex match and ensuring that the links discovered are made into absolute paths.

In this specific case, this synchronous code should be quick and inconspicuous. But just remember that any line within a given coroutine will block other coroutines unless that line uses yield, await, or return. If the parsing was a more intensive process, you might want to consider running this portion in its own process with loop.run_in_executor[].

Next, the coroutine write[] takes a file object and a single URL, and waits on parse[] to return a set of the parsed URLs, writing each to the file asynchronously along with its source URL through use of aiofiles, a package for async file IO.

Lastly, bulk_crawl_and_write[] serves as the main entry point into the script’s chain of coroutines. It uses a single session, and a task is created for each URL that is ultimately read from urls.txt.

Here are a few additional points that deserve mention:

If you’d like to explore a bit more, the companion files for this tutorial up at GitHub have comments and docstrings attached as well.

Here’s the execution in all of its glory, as areq.py gets, parses, and saves results for 9 URLs in under a second:

$ python3 areq.py
21:33:22 DEBUG:asyncio: Using selector: KqueueSelector
21:33:22 INFO:areq: Got response [200] for URL: //www.mediamatters.org/
21:33:22 INFO:areq: Found 115 links for //www.mediamatters.org/
21:33:22 INFO:areq: Got response [200] for URL: //www.nytimes.com/guides/
21:33:22 INFO:areq: Got response [200] for URL: //www.politico.com/tipsheets/morning-money
21:33:22 INFO:areq: Got response [200] for URL: //www.ietf.org/rfc/rfc2616.txt
21:33:22 ERROR:areq: aiohttp exception for //docs.python.org/3/this-url-will-404.html [404]: Not Found
21:33:22 INFO:areq: Found 120 links for //www.nytimes.com/guides/
21:33:22 INFO:areq: Found 143 links for //www.politico.com/tipsheets/morning-money
21:33:22 INFO:areq: Wrote results for source URL: //www.mediamatters.org/
21:33:22 INFO:areq: Found 0 links for //www.ietf.org/rfc/rfc2616.txt
21:33:22 INFO:areq: Got response [200] for URL: //1.1.1.1/
21:33:22 INFO:areq: Wrote results for source URL: //www.nytimes.com/guides/
21:33:22 INFO:areq: Wrote results for source URL: //www.politico.com/tipsheets/morning-money
21:33:22 INFO:areq: Got response [200] for URL: //www.bloomberg.com/markets/economics
21:33:22 INFO:areq: Found 3 links for //www.bloomberg.com/markets/economics
21:33:22 INFO:areq: Wrote results for source URL: //www.bloomberg.com/markets/economics
21:33:23 INFO:areq: Found 36 links for //1.1.1.1/
21:33:23 INFO:areq: Got response [200] for URL: //regex101.com/
21:33:23 INFO:areq: Found 23 links for //regex101.com/
21:33:23 INFO:areq: Wrote results for source URL: //regex101.com/
21:33:23 INFO:areq: Wrote results for source URL: //1.1.1.1/

That’s not too shabby! As a sanity check, you can check the line-count on the output. In my case, it’s 626, though keep in mind this may fluctuate:

$ wc -l foundurls.txt
     626 foundurls.txt

$ head -n 3 foundurls.txt
source_url  parsed_url
//www.bloomberg.com/markets/economics //www.bloomberg.com/feedback
//www.bloomberg.com/markets/economics //www.bloomberg.com/notices/tos

Async IO in Context

Now that you’ve seen a healthy dose of code, let’s step back for a minute and consider when async IO is an ideal option and how you can make the comparison to arrive at that conclusion or otherwise choose a different model of concurrency.

When and Why Is Async IO the Right Choice?

This tutorial is no place for an extended treatise on async IO versus threading versus multiprocessing. However, it’s useful to have an idea of when async IO is probably the best candidate of the three.

The battle over async IO versus multiprocessing is not really a battle at all. In fact, they can be used in concert. If you have multiple, fairly uniform CPU-bound tasks [a great example is a grid search in libraries such as scikit-learn or keras], multiprocessing should be an obvious choice.

Simply putting async before every function is a bad idea if all of the functions use blocking calls. [This can actually slow down your code.] But as mentioned previously, there are places where async IO and multiprocessing can live in harmony.

The contest between async IO and threading is a little bit more direct. I mentioned in the introduction that “threading is hard.” The full story is that, even in cases where threading seems easy to implement, it can still lead to infamous impossible-to-trace bugs due to race conditions and memory usage, among other things.

Threading also tends to scale less elegantly than async IO, because threads are a system resource with a finite availability. Creating thousands of threads will fail on many machines, and I don’t recommend trying it in the first place. Creating thousands of async IO tasks is completely feasible.

Async IO shines when you have multiple IO-bound tasks where the tasks would otherwise be dominated by blocking IO-bound wait time, such as:

  • Network IO, whether your program is the server or the client side

  • Serverless designs, such as a peer-to-peer, multi-user network like a group chatroom

  • Read/write operations where you want to mimic a “fire-and-forget” style but worry less about holding a lock on whatever you’re reading and writing to

The biggest reason not to use it is that await only supports a specific set of objects that define a specific set of methods. If you want to do async read operations with a certain DBMS, you’ll need to find not just a Python wrapper for that DBMS, but one that supports the async/await syntax. Coroutines that contain synchronous calls block other coroutines and tasks from running.

For a shortlist of libraries that work with async/await, see the list at the end of this tutorial.

Async IO It Is, but Which One?

This tutorial focuses on async IO, the async/await syntax, and using asyncio for event-loop management and specifying tasks. asyncio certainly isn’t the only async IO library out there. This observation from Nathaniel J. Smith says a lot:

[In] a few years, asyncio might find itself relegated to becoming one of those stdlib libraries that savvy developers avoid, like urllib2.

What I’m arguing, in effect, is that asyncio is a victim of its own success: when it was designed, it used the best approach possible; but since then, work inspired by asyncio – like the addition of async/await – has shifted the landscape so that we can do even better, and now asyncio is hamstrung by its earlier commitments. [Source]

To that end, a few big-name alternatives that do what asyncio does, albeit with different APIs and different approaches, are curio and trio. Personally, I think that if you’re building a moderately sized, straightforward program, just using asyncio is plenty sufficient and understandable, and lets you avoid adding yet another large dependency outside of Python’s standard library.

But by all means, check out curio and trio, and you might find that they get the same thing done in a way that’s more intuitive for you as the user. Many of the package-agnostic concepts presented here should permeate to alternative async IO packages as well.

Odds and Ends

In these next few sections, you’ll cover some miscellaneous parts of asyncio and async/await that haven’t fit neatly into the tutorial thus far, but are still important for building and understanding a full program.

Other Top-Level asyncio Functions

In addition to asyncio.run[], you’ve seen a few other package-level functions such as asyncio.create_task[] and asyncio.gather[].

You can use create_task[] to schedule the execution of a coroutine object, followed by asyncio.run[]:

>>>

>>> import asyncio

>>> async def coro[seq] -> list:
...     """'IO' wait time is proportional to the max element."""
...     await asyncio.sleep[max[seq]]
...     return list[reversed[seq]]
...
>>> async def main[]:
...     # This is a bit redundant in the case of one task
...     # We could use `await coro[[3, 2, 1]]` on its own
...     t = asyncio.create_task[coro[[3, 2, 1]]]  # Python 3.7+
...     await t
...     print[f't: type {type[t]}']
...     print[f't done: {t.done[]}']
...
>>> t = asyncio.run[main[]]
t: type 
t done: True

There’s a subtlety to this pattern: if you don’t await t within main[], it may finish before main[] itself signals that it is complete. Because asyncio.run[main[]] calls loop.run_until_complete[main[]], the event loop is only concerned [without await t present] that main[] is done, not that the tasks that get created within main[] are done. Without await t, the loop’s other tasks will be cancelled, possibly before they are completed. If you need to get a list of currently pending tasks, you can use asyncio.Task.all_tasks[].

Separately, there’s asyncio.gather[]. While it doesn’t do anything tremendously special, gather[] is meant to neatly put a collection of coroutines [futures] into a single future. As a result, it returns a single future object, and, if you await asyncio.gather[] and specify multiple tasks or coroutines, you’re waiting for all of them to be completed. [This somewhat parallels queue.join[] from our earlier example.] The result of gather[] will be a list of the results across the inputs:

>>>

>>> import time
>>> async def main[]:
...     t = asyncio.create_task[coro[[3, 2, 1]]]
...     t2 = asyncio.create_task[coro[[10, 5, 0]]]  # Python 3.7+
...     print['Start:', time.strftime['%X']]
...     a = await asyncio.gather[t, t2]
...     print['End:', time.strftime['%X']]  # Should be 10 seconds
...     print[f'Both tasks done: {all[[t.done[], t2.done[]]]}']
...     return a
...
>>> a = asyncio.run[main[]]
Start: 16:20:11
End: 16:20:21
Both tasks done: True
>>> a
[[1, 2, 3], [0, 5, 10]]

You probably noticed that gather[] waits on the entire result set of the Futures or coroutines that you pass it. Alternatively, you can loop over asyncio.as_completed[] to get tasks as they are completed, in the order of completion. The function returns an iterator that yields tasks as they finish. Below, the result of coro[[3, 2, 1]] will be available before coro[[10, 5, 0]] is complete, which is not the case with gather[]:

>>>

>>> async def main[]:
...     t = asyncio.create_task[coro[[3, 2, 1]]]
...     t2 = asyncio.create_task[coro[[10, 5, 0]]]
...     print['Start:', time.strftime['%X']]
...     for res in asyncio.as_completed[[t, t2]]:
...         compl = await res
...         print[f'res: {compl} completed at {time.strftime["%X"]}']
...     print['End:', time.strftime['%X']]
...     print[f'Both tasks done: {all[[t.done[], t2.done[]]]}']
...
>>> a = asyncio.run[main[]]
Start: 09:49:07
res: [1, 2, 3] completed at 09:49:10
res: [0, 5, 10] completed at 09:49:17
End: 09:49:17
Both tasks done: True

Lastly, you may also see asyncio.ensure_future[]. You should rarely need it, because it’s a lower-level plumbing API and largely replaced by create_task[], which was introduced later.

The Precedence of await

While they behave somewhat similarly, the await keyword has significantly higher precedence than yield. This means that, because it is more tightly bound, there are a number of instances where you’d need parentheses in a yield from statement that are not required in an analogous await statement. For more information, see examples of await expressions from PEP 492.

Conclusion

You’re now equipped to use async/await and the libraries built off of it. Here’s a recap of what you’ve covered:

Resources

Python Version Specifics

Async IO in Python has evolved swiftly, and it can be hard to keep track of what came when. Here’s a list of Python minor-version changes and introductions related to asyncio:

  • 3.3: The yield from expression allows for generator delegation.

  • 3.4: asyncio was introduced in the Python standard library with provisional API status.

  • 3.5: async and await became a part of the Python grammar, used to signify and wait on coroutines. They were not yet reserved keywords. [You could still define functions or variables named async and await.]

  • 3.6: Asynchronous generators and asynchronous comprehensions were introduced. The API of asyncio was declared stable rather than provisional.

  • 3.7: async and await became reserved keywords. [They cannot be used as identifiers.] They are intended to replace the asyncio.coroutine[] decorator. asyncio.run[] was introduced to the asyncio package, among a bunch of other features.

If you want to be safe [and be able to use asyncio.run[]], go with Python 3.7 or above to get the full set of features.

Articles

Here’s a curated list of additional resources:

  • Real Python: Speed up your Python Program with Concurrency
  • Real Python: What is the Python Global Interpreter Lock?
  • CPython: The asyncio package source
  • Python docs: Data model > Coroutines
  • TalkPython: Async Techniques and Examples in Python
  • Brett Cannon: How the Heck Does Async-Await Work in Python 3.5?
  • PYMOTW: asyncio
  • A. Jesse Jiryu Davis and Guido van Rossum: A Web Crawler With asyncio Coroutines
  • Andy Pearce: The State of Python Coroutines: yield from
  • Nathaniel J. Smith: Some Thoughts on Asynchronous API Design in a Post-async/await World
  • Armin Ronacher: I don’t understand Python’s Asyncio
  • Andy Balaam: series on asyncio [4 posts]
  • Stack Overflow: Python asyncio.semaphore in async-await function
  • Yeray Diaz:
    • AsyncIO for the Working Python Developer
    • Asyncio Coroutine Patterns: Beyond await

A few Python What’s New sections explain the motivation behind language changes in more detail:

  • What’s New in Python 3.3 [yield from and PEP 380]
  • What’s New in Python 3.6 [PEP 525 & 530]

From David Beazley:

  • Generator: Tricks for Systems Programmers
  • A Curious Course on Coroutines and Concurrency
  • Generators: The Final Frontier

YouTube talks:

  • John Reese - Thinking Outside the GIL with AsyncIO and Multiprocessing - PyCon 2018
  • Keynote David Beazley - Topics of Interest [Python Asyncio]
  • David Beazley - Python Concurrency From the Ground Up: LIVE! - PyCon 2015
  • Raymond Hettinger, Keynote on Concurrency, PyBay 2017
  • Thinking about Concurrency, Raymond Hettinger, Python core developer
  • Miguel Grinberg Asynchronous Python for the Complete Beginner PyCon 2017
  • Yury Selivanov asyncawait and asyncio in Python 3 6 and beyond PyCon 2017
  • Fear and Awaiting in Async: A Savage Journey to the Heart of the Coroutine Dream
  • What Is Async, How Does It Work, and When Should I Use It? [PyCon APAC 2014]

Libraries That Work With async/await

From aio-libs:

  • aiohttp: Asynchronous HTTP client/server framework
  • aioredis: Async IO Redis support
  • aiopg: Async IO PostgreSQL support
  • aiomcache: Async IO memcached client
  • aiokafka: Async IO Kafka client
  • aiozmq: Async IO ZeroMQ support
  • aiojobs: Jobs scheduler for managing background tasks
  • async_lru: Simple LRU cache for async IO

From magicstack:

  • uvloop: Ultra fast async IO event loop
  • asyncpg: [Also very fast] async IO PostgreSQL support

From other hosts:

  • trio: Friendlier asyncio intended to showcase a radically simpler design
  • aiofiles: Async file IO
  • asks: Async requests-like http library
  • asyncio-redis: Async IO Redis support
  • aioprocessing: Integrates multiprocessing module with asyncio
  • umongo: Async IO MongoDB client
  • unsync: Unsynchronize asyncio
  • aiostream: Like itertools, but async

Chủ Đề