Hướng dẫn what is data transformation in python? - chuyển đổi dữ liệu trong python là gì?

Question

Giới thiệu

Trực quan hóa là một công cụ quan trọng để tạo Insight, nhưng rất hiếm khi bạn nhận được dữ liệu ở dạng chính xác đúng hình thức bạn cần. Bạn thường sẽ cần tạo một số biến hoặc tóm tắt mới, đổi tên các biến hoặc sắp xếp lại các quan sát để dữ liệu dễ quản lý hơn. Bạn sẽ học cách làm tất cả những điều đó (và hơn thế nữa!) Trong chương này, điều này sẽ dạy bạn cách chuyển đổi dữ liệu của bạn bằng gói Pandas và bộ dữ liệu mới trên các chuyến bay khởi hành thành phố New York vào năm 2013.

Nội dung chính Show

Giới thiệu
Điều kiện tiên quyết
NYCFLIGHT13
Basipulation Data Thao tác dữ liệu cơ bản
Lọc hàng với flights.query('month == 11 | month == 12')2
Toán tử logic
Giá trị bị mất
Bài tập
Sắp xếp hoặc sắp xếp các hàng với #> year month day ... hour minute time_hour #> 0 2013 1 1 ... 5 15 2013-01-01 10:00:00+00:00 #> 1 2013 1 1 ... 5 29 2013-01-01 10:00:00+00:00 #> 2 2013 1 1 ... 5 40 2013-01-01 10:00:00+00:00 #> 3 2013 1 1 ... 5 45 2013-01-01 10:00:00+00:00 #> 4 2013 1 1 ... 6 0 2013-01-01 11:00:00+00:00 #> ... ... ... ... ... ... ... ... #> 336771 2013 9 30 ... 14 55 2013-09-30 18:00:00+00:00 #> 336772 2013 9 30 ... 22 0 2013-10-01 02:00:00+00:00 #> 336773 2013 9 30 ... 12 10 2013-09-30 16:00:00+00:00 #> 336774 2013 9 30 ... 11 59 2013-09-30 15:00:00+00:00 #> 336775 2013 9 30 ... 8 40 2013-09-30 12:00:00+00:00 #> #> [336776 rows x 19 columns]42
Bài tập
#> year month day ... hour minute time_hour #> 0 2013 1 1 ... 5 15 2013-01-01 10:00:00+00:00 #> 1 2013 1 1 ... 5 29 2013-01-01 10:00:00+00:00 #> 2 2013 1 1 ... 5 40 2013-01-01 10:00:00+00:00 #> 3 2013 1 1 ... 5 45 2013-01-01 10:00:00+00:00 #> 4 2013 1 1 ... 6 0 2013-01-01 11:00:00+00:00 #> ... ... ... ... ... ... ... ... #> 336771 2013 9 30 ... 14 55 2013-09-30 18:00:00+00:00 #> 336772 2013 9 30 ... 22 0 2013-10-01 02:00:00+00:00 #> 336773 2013 9 30 ... 12 10 2013-09-30 16:00:00+00:00 #> 336774 2013 9 30 ... 11 59 2013-09-30 15:00:00+00:00 #> 336775 2013 9 30 ... 8 40 2013-09-30 12:00:00+00:00 #> #> [336776 rows x 19 columns]42 hoạt động tương tự như flights.query('month == 11 | month == 12')2 ngoại trừ thay vì chọn hàng, nó thay đổi thứ tự của chúng. Nó lấy một khung dữ liệu và tên cột hoặc danh sách các tên cột để đặt hàng theo. Nếu bạn cung cấp nhiều hơn một tên cột, mỗi cột bổ sung sẽ được sử dụng để phá vỡ các mối quan hệ trong các giá trị của các cột trước:
Bài tập
#> year month day ... hour minute time_hour #> 0 2013 1 1 ... 5 15 2013-01-01 10:00:00+00:00 #> 1 2013 1 1 ... 5 29 2013-01-01 10:00:00+00:00 #> 2 2013 1 1 ... 5 40 2013-01-01 10:00:00+00:00 #> 3 2013 1 1 ... 5 45 2013-01-01 10:00:00+00:00 #> 4 2013 1 1 ... 6 0 2013-01-01 11:00:00+00:00 #> ... ... ... ... ... ... ... ... #> 336771 2013 9 30 ... 14 55 2013-09-30 18:00:00+00:00 #> 336772 2013 9 30 ... 22 0 2013-10-01 02:00:00+00:00 #> 336773 2013 9 30 ... 12 10 2013-09-30 16:00:00+00:00 #> 336774 2013 9 30 ... 11 59 2013-09-30 15:00:00+00:00 #> 336775 2013 9 30 ... 8 40 2013-09-30 12:00:00+00:00 #> #> [336776 rows x 19 columns]60: Kết hợp các tên có chứa DEP DEP.
Động não càng nhiều cách càng tốt để chọn #> year month day ... hour minute time_hour #> 0 2013 1 1 ... 5 15 2013-01-01 10:00:00+00:00 #> 1 2013 1 1 ... 5 29 2013-01-01 10:00:00+00:00 #> 2 2013 1 1 ... 5 40 2013-01-01 10:00:00+00:00 #> 3 2013 1 1 ... 5 45 2013-01-01 10:00:00+00:00 #> 4 2013 1 1 ... 6 0 2013-01-01 11:00:00+00:00 #> ... ... ... ... ... ... ... ... #> 336771 2013 9 30 ... 14 55 2013-09-30 18:00:00+00:00 #> 336772 2013 9 30 ... 22 0 2013-10-01 02:00:00+00:00 #> 336773 2013 9 30 ... 12 10 2013-09-30 16:00:00+00:00 #> 336774 2013 9 30 ... 11 59 2013-09-30 15:00:00+00:00 #> 336775 2013 9 30 ... 8 40 2013-09-30 12:00:00+00:00 #> #> [336776 rows x 19 columns]41, #> year month day ... hour minute time_hour #> 0 2013 1 1 ... 5 15 2013-01-01 10:00:00+00:00 #> 1 2013 1 1 ... 5 29 2013-01-01 10:00:00+00:00 #> 2 2013 1 1 ... 5 40 2013-01-01 10:00:00+00:00 #> 3 2013 1 1 ... 5 45 2013-01-01 10:00:00+00:00 #> 4 2013 1 1 ... 6 0 2013-01-01 11:00:00+00:00 #> ... ... ... ... ... ... ... ... #> 336771 2013 9 30 ... 14 55 2013-09-30 18:00:00+00:00 #> 336772 2013 9 30 ... 22 0 2013-10-01 02:00:00+00:00 #> 336773 2013 9 30 ... 12 10 2013-09-30 16:00:00+00:00 #> 336774 2013 9 30 ... 11 59 2013-09-30 15:00:00+00:00 #> 336775 2013 9 30 ... 8 40 2013-09-30 12:00:00+00:00 #> #> [336776 rows x 19 columns]64, #> year month day ... hour minute time_hour #> 0 2013 1 1 ... 5 15 2013-01-01 10:00:00+00:00 #> 1 2013 1 1 ... 5 29 2013-01-01 10:00:00+00:00 #> 2 2013 1 1 ... 5 40 2013-01-01 10:00:00+00:00 #> 3 2013 1 1 ... 5 45 2013-01-01 10:00:00+00:00 #> 4 2013 1 1 ... 6 0 2013-01-01 11:00:00+00:00 #> ... ... ... ... ... ... ... ... #> 336771 2013 9 30 ... 14 55 2013-09-30 18:00:00+00:00 #> 336772 2013 9 30 ... 22 0 2013-10-01 02:00:00+00:00 #> 336773 2013 9 30 ... 12 10 2013-09-30 16:00:00+00:00 #> 336774 2013 9 30 ... 11 59 2013-09-30 15:00:00+00:00 #> 336775 2013 9 30 ... 8 40 2013-09-30 12:00:00+00:00 #> #> [336776 rows x 19 columns]65 và #> year month day ... hour minute time_hour #> 0 2013 1 1 ... 5 15 2013-01-01 10:00:00+00:00 #> 1 2013 1 1 ... 5 29 2013-01-01 10:00:00+00:00 #> 2 2013 1 1 ... 5 40 2013-01-01 10:00:00+00:00 #> 3 2013 1 1 ... 5 45 2013-01-01 10:00:00+00:00 #> 4 2013 1 1 ... 6 0 2013-01-01 11:00:00+00:00 #> ... ... ... ... ... ... ... ... #> 336771 2013 9 30 ... 14 55 2013-09-30 18:00:00+00:00 #> 336772 2013 9 30 ... 22 0 2013-10-01 02:00:00+00:00 #> 336773 2013 9 30 ... 12 10 2013-09-30 16:00:00+00:00 #> 336774 2013 9 30 ... 11 59 2013-09-30 15:00:00+00:00 #> 336775 2013 9 30 ... 8 40 2013-09-30 12:00:00+00:00 #> #> [336776 rows x 19 columns]66 từ jan1 = flights.query('month == 1 & day == 1')5.
Bài tập
Tóm tắt hoặc tập hợp được nhóm với #> year int64 #> month int64 #> day int64 #> dep_time float64 #> sched_dep_time int64 #> dep_delay float64 #> arr_time float64 #> sched_arr_time int64 #> arr_delay float64 #> carrier object #> flight int64 #> tailnum object #> origin object #> dest object #> air_time float64 #> distance int64 #> hour int64 #> minute int64 #> time_hour datetime64[ns, UTC] #> dtype: object28
Kết hợp nhiều hoạt động
Giá trị bị mất
Chức năng tóm tắt hữu ích
Nhóm theo nhiều biến
Không nhóm (đặt lại chỉ mục)
Bài tập
Biến đổi được nhóm (và bộ lọc)
Bài tập
Chuyển đổi dữ liệu giải thích với ví dụ là gì?
Phương pháp chuyển đổi dữ liệu là gì?
Python có thể được sử dụng để chuyển đổi dữ liệu không?
Chuyển đổi dữ liệu trong học máy là gì?

Điều kiện tiên quyết

Trong chương này, chúng tôi sẽ tập trung vào cách sử dụng gói Pandas, gói nền tảng cho khoa học dữ liệu trong Python. Chúng tôi sẽ minh họa các ý tưởng chính bằng cách sử dụng dữ liệu từ gói NYCFlight13 R và sử dụng Altair để giúp chúng tôi hiểu dữ liệu. Chúng tôi cũng sẽ cần hai gói Python bổ sung để giúp chúng tôi với các chức năng toán học và thống kê: Numpy và Scipy. Lưu ý

jan1 = flights.query('month == 1 & day == 1')

3 tuân theo hướng dẫn SCIPY để nhập các chức năng từ các không gian phân nhóm. Bây giờ chúng tôi sẽ gọi các chức năng bằng gói SCIPY với cấu trúc

jan1 = flights.query('month == 1 & day == 1')

4.

import pandas as pd
import altair as alt
import numpy as np
from scipy import stats

flights_url = "https://github.com/byuidatascience/data4python4ds/raw/master/data-raw/flights/flights.csv"

flights = pd.read_csv(flights_url)
flights['time_hour'] = pd.to_datetime(flights.time_hour, format = "%Y-%m-%d %H:%M:%S")

NYCFLIGHT13

Để khám phá các động từ thao tác dữ liệu cơ bản của gấu trúc, chúng tôi sẽ sử dụng

jan1 = flights.query('month == 1 & day == 1')

5. Khung dữ liệu này chứa tất cả 336.776 chuyến bay khởi hành từ thành phố New York vào năm 2013. Dữ liệu đến từ Cục Thống kê Giao thông Hoa Kỳ và được ghi lại ở đây.

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

Bạn có thể nhận thấy rằng khung dữ liệu này không in toàn bộ vì các khung dữ liệu khác mà bạn có thể thấy trong quá khứ: nó chỉ hiển thị một vài hàng đầu tiên và một vài hàng cuối cùng chỉ có các cột phù hợp với một màn hình. .

Sử dụng

jan1 = flights.query('month == 1 & day == 1')

6 sẽ hiển thị cho bạn các loại biến cho mỗi cột. Chúng mô tả loại của từng biến:

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

jan1 = flights.query('month == 1 & day == 1')

7 là viết tắt của số nguyên.

```
jan1 = flights.query('month == 1 & day == 1')
```
8 là viết tắt của nhân đôi, hoặc số thực.
```
jan1 = flights.query('month == 1 & day == 1')
```
9 là viết tắt của các vectơ ký tự, hoặc chuỗi.

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

0 là viết tắt của thời gian ngày (ngày + một thời gian) và ngày. Bạn có thể đọc thêm về các công cụ DateTime Pandas

Có ba loại biến phổ biến khác được sử dụng trong bộ dữ liệu này nhưng bạn sẽ gặp sau này trong cuốn sách:

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

1 là viết tắt của các vectơ logic, chỉ chứa

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

2 hoặc

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

3.

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

4 là viết tắt của các yếu tố, mà gấu trúc sử dụng để thể hiện các biến phân loại với các giá trị có thể cố định.

Sử dụng

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

5 cũng cung cấp một bản in từ các loại dữ liệu trên các thông tin hữu ích khác về khung dữ liệu gấu trúc của bạn.

flights.info() #> #> RangeIndex: #> Data #>  #   Column #> ---  ------ #>  0   year #>  1   month #>  2   day #>  3   dep_time #>  4 #>  5   dep_delay #>  6   arr_time #>  7 #>  8   arr_delay #>  9   carrier #>  10  flight #>  11  tailnum #>  12  origin #>  13  dest #>  14  air_time #>  15  distance #>  16  hour #>  17  minute #>  18  time_hour #> dtypes: #> memory

'pandas.core.frame.DataFrame'> 336776 entries, 0 to 336775 columns (total 19 columns): Non-Null Count   Dtype               --------------   -----               336776 non-null  int64               336776 non-null  int64               336776 non-null  int64               328521 non-null  float64             sched_dep_time  336776 non-null  int64               328521 non-null  float64             328063 non-null  float64             sched_arr_time  336776 non-null  int64               327346 non-null  float64             336776 non-null  object              336776 non-null  int64               334264 non-null  object              336776 non-null  object              336776 non-null  object              327346 non-null  float64             336776 non-null  int64               336776 non-null  int64               336776 non-null  int64               336776 non-null  datetime64[ns, UTC] datetime64[ns, UTC](1), float64(5), int64(9), object(4) usage: 48.8+ MB

Basipulation Data Thao tác dữ liệu cơ bản

Trong chương này, bạn sẽ tìm hiểu năm chức năng chính của gấu trúc hoặc phương thức đối tượng. Phương thức đối tượng là những thứ mà các đối tượng có thể thực hiện. Ví dụ, các khung dữ liệu gấu trúc biết cách cho bạn biết hình dạng của chúng, đối tượng gấu trúc biết cách kết hợp hai khung dữ liệu với nhau. Cách chúng tôi nói với một đối tượng mà chúng tôi muốn nó làm một cái gì đó là với toán tử DOT. Chúng tôi sẽ đề cập đến các toán tử đối tượng này là các hàm hoặc phương thức. Dưới đây là năm phương pháp cho phép bạn giải quyết phần lớn các thách thức thao tác dữ liệu của bạn:

Chọn các quan sát theo giá trị của chúng (

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

6).

Sắp xếp lại các hàng (

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

7).

Chọn các biến theo tên của họ (

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

8).

Tạo các biến mới với các hàm của các biến hiện có (

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

9).

Thu gọn nhiều giá trị xuống một bản tóm tắt duy nhất (
```
np.sqrt(2) ** 2 ==  2
#> False
1 / 49 * 49 == 1
#> False
```
0).

Gói Pandas có thể xử lý tất cả các chức năng giống nhau của DPPLY trong R. Bạn có thể đọc hướng dẫn ánh xạ gấu trúc và điều này đối với bài viết khoa học dữ liệu để có thêm chi tiết về bảng ngắn gọn sau đây.

Bảng 5.1: Các chức năng có thể so sánh trong R-DPPLYR và Python-PandasComparable functions in R-Dplyr and Python-Pandas

R chức năng dplyr	Chức năng Python Pandas
flights.query('month = 1') #> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object #> #> Detailed traceback: #> File "", line 1, in #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query #> res = self.eval(expr, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval #> return _eval(expr, inplace=inplace, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval #> parsed_expr = Expr(expr, engine=engine, parser=parser, env=env) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__ #> self.terms = self.parse() #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse #> return self._visitor.visit(self.expr) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit #> return visitor(node, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module #> return self.visit(expr, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit #> return visitor(node, **kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign #> raise ValueError("cannot assign without a target object") 8	flights.query('month = 1') #> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object #> #> Detailed traceback: #> File "", line 1, in #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query #> res = self.eval(expr, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval #> return _eval(expr, inplace=inplace, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval #> parsed_expr = Expr(expr, engine=engine, parser=parser, env=env) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__ #> self.terms = self.parse() #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse #> return self._visitor.visit(self.expr) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit #> return visitor(node, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module #> return self.visit(expr, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit #> return visitor(node, **kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign #> raise ValueError("cannot assign without a target object") 6
`np.sqrt(2) ** 2 == 2 #> False 1 / 49 * 49 == 1 #> False` 3	flights.query('month = 1') #> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object #> #> Detailed traceback: #> File "", line 1, in #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query #> res = self.eval(expr, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval #> return _eval(expr, inplace=inplace, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval #> parsed_expr = Expr(expr, engine=engine, parser=parser, env=env) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__ #> self.terms = self.parse() #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse #> return self._visitor.visit(self.expr) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit #> return visitor(node, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module #> return self.visit(expr, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit #> return visitor(node, **kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign #> raise ValueError("cannot assign without a target object") 7
`np.sqrt(2) ** 2 == 2 #> False 1 / 49 * 49 == 1 #> False` 5	flights.query('month = 1') #> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object #> #> Detailed traceback: #> File "", line 1, in #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query #> res = self.eval(expr, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval #> return _eval(expr, inplace=inplace, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval #> parsed_expr = Expr(expr, engine=engine, parser=parser, env=env) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__ #> self.terms = self.parse() #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse #> return self._visitor.visit(self.expr) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit #> return visitor(node, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module #> return self.visit(expr, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit #> return visitor(node, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign #> raise ValueError("cannot assign without a target object") 8 hoặc `np.sqrt(2) 2 == 2 #> False 1 / 49 * 49 == 1 #> False` 7
`np.sqrt(2) ** 2 == 2 #> False 1 / 49 * 49 == 1 #> False` 8	`np.sqrt(2) ** 2 == 2 #> False 1 / 49 * 49 == 1 #> False` 9
`np.isclose(np.sqrt(2) ** 2, 2) #> True np.isclose(1 / 49 * 49, 1) #> True` 0	flights.query('month = 1') #> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object #> #> Detailed traceback: #> File "", line 1, in #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query #> res = self.eval(expr, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval #> return _eval(expr, inplace=inplace, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval #> parsed_expr = Expr(expr, engine=engine, parser=parser, env=env) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__ #> self.terms = self.parse() #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse #> return self._visitor.visit(self.expr) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit #> return visitor(node, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module #> return self.visit(expr, kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit #> return visitor(node, **kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign #> raise ValueError("cannot assign without a target object") 9 (xem ghi chú)
`np.isclose(np.sqrt(2) ** 2, 2) #> True np.isclose(1 / 49 * 49, 1) #> True` 2	`np.sqrt(2) ** 2 == 2 #> False 1 / 49 * 49 == 1 #> False` 0
`np.isclose(np.sqrt(2) ** 2, 2) #> True np.isclose(1 / 49 * 49, 1) #> True` 4	`np.isclose(np.sqrt(2) ** 2, 2) #> True np.isclose(1 / 49 * 49, 1) #> True` 5

Lưu ý: Hàm

np.isclose(np.sqrt(2) ** 2,  2)
#> True
np.isclose(1 / 49 * 49, 1)
#> True

6 hoạt động tương tự như
flights.query('month = 1') #> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object #> #> Detailed traceback: #> File "", line 1, in #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query #> res = self.eval(expr, **kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval #> return _eval(expr, inplace=inplace, **kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval #> parsed_expr = Expr(expr, engine=engine, parser=parser, env=env) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__ #> self.terms = self.parse() #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse #> return self._visitor.visit(self.expr) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit #> return visitor(node, **kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module #> return self.visit(expr, **kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit #> return visitor(node, **kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign #> raise ValueError("cannot assign without a target object")
9 trong gấu trúc trên các khung dữ liệu. Nhưng bạn không thể sử dụng
flights.query('month = 1') #> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object #> #> Detailed traceback: #> File "", line 1, in #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query #> res = self.eval(expr, **kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval #> return _eval(expr, inplace=inplace, **kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval #> parsed_expr = Expr(expr, engine=engine, parser=parser, env=env) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__ #> self.terms = self.parse() #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse #> return self._visitor.visit(self.expr) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit #> return visitor(node, **kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module #> return self.visit(expr, **kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit #> return visitor(node, **kwargs) #> File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign #> raise ValueError("cannot assign without a target object")
9 trên khung dữ liệu được nhóm trong gấu trúc giống như bạn sẽ sử dụng
np.isclose(np.sqrt(2) ** 2, 2) #> True np.isclose(1 / 49 * 49, 1) #> True
9 trên một đối tượng được nhóm. Trong trường hợp đó, bạn sẽ sử dụng
flights.query('month == 11 | month == 12')
0 và thậm chí sau đó chức năng không hoàn toàn giống nhau. The

np.isclose(np.sqrt(2) ** 2,  2)
#> True
np.isclose(1 / 49 * 49, 1)
#> True

6 function works similar to

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

9 in pandas on data frames. But you cannot use

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

9 on grouped data frame in pandas like you would use

np.isclose(np.sqrt(2) ** 2,  2)
#> True
np.isclose(1 / 49 * 49, 1)
#> True

9 on a grouped object. In that case you would use

flights.query('month == 11 | month == 12')

0 and even then the functionality is not quite the same.

np.sqrt(2) ** 2 ==  2
#> False
1 / 49 * 49 == 1
#> False

0 thay đổi phạm vi của từng hàm từ hoạt động trên toàn bộ tập dữ liệu sang hoạt động trên từng nhóm. Các chức năng này cung cấp các động từ cho một ngôn ngữ thao tác dữ liệu.

Tất cả các động từ hoạt động tương tự:

Đối số đầu tiên là một DataFrame của Pandas.
Các phương pháp tiếp theo mô tả phải làm gì với khung dữ liệu.
Kết quả là một khung dữ liệu mới.

Các thuộc tính này cùng nhau giúp dễ dàng chuỗi các bước đơn giản để đạt được kết quả phức tạp. Hãy để lặn và xem những động từ này hoạt động như thế nào.

Lọc hàng với flights.query('month == 11 | month == 12')2

flights.query('month == 11 | month == 12')

2 cho phép bạn quan sát tập hợp con dựa trên giá trị của chúng. Đối số đầu tiên chỉ định các hàng được chọn. Đối số này có thể là tên nhãn hoặc một chuỗi Boolean. Đối số thứ hai chỉ định các cột sẽ được chọn. Bộ lọc Blean trên các hàng là trọng tâm của chúng tôi. Ví dụ: chúng tôi có thể chọn tất cả các chuyến bay vào ngày 1 tháng 1 với:

flights.query('month == 1 & day == 1')
#>      year  month  day  ...  hour  minute                 time_hour
#> 0    2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1    2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2    2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3    2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4    2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ..    ...    ...  ...  ...   ...     ...                       ...
#> 837  2013      1    1  ...    23      59 2013-01-02 04:00:00+00:00
#> 838  2013      1    1  ...    16      30 2013-01-01 21:00:00+00:00
#> 839  2013      1    1  ...    19      35 2013-01-02 00:00:00+00:00
#> 840  2013      1    1  ...    15       0 2013-01-01 20:00:00+00:00
#> 841  2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> 
#> [842 rows x 19 columns]

Biểu thức trước đó tương đương với

flights.query('month == 11 | month == 12')

4

Khi bạn chạy dòng mã đó, Pandas thực hiện thao tác lọc và trả về khung dữ liệu mới. Các chức năng của gấu trúc thường không thể sửa đổi đầu vào của chúng, vì vậy nếu bạn muốn lưu kết quả, bạn sẽ cần sử dụng toán tử chuyển nhượng,

flights.query('month == 11 | month == 12')

5:

jan1 = flights.query('month == 1 & day == 1')

Python tương tác hoặc in ra kết quả hoặc lưu chúng vào một biến.

So sánh

Để sử dụng lọc hiệu quả, bạn phải biết cách chọn các quan sát mà bạn muốn sử dụng các toán tử so sánh. Python cung cấp bộ tiêu chuẩn:

flights.query('month == 11 | month == 12')

6,

flights.query('month == 11 | month == 12')

7,

flights.query('month == 11 | month == 12')

8,

flights.query('month == 11 | month == 12')

9,

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

00 (không bằng) và

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

01 (bằng).

Khi bạn bắt đầu với Python, sai lầm dễ nhất là sử dụng

flights.query('month == 11 | month == 12')

5 thay vì

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

01 khi kiểm tra sự bình đẳng. Khi điều này xảy ra, bạn sẽ gặp lỗi:

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

Có một vấn đề phổ biến khác mà bạn có thể gặp phải khi sử dụng

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

01: Số điểm nổi. Kết quả sau đây có thể làm bạn ngạc nhiên!

np.sqrt(2) ** 2 ==  2
#> False
1 / 49 * 49 == 1
#> False

Máy tính sử dụng số học chính xác hữu hạn (rõ ràng chúng có thể lưu trữ một số chữ số vô hạn!) Vì vậy, hãy nhớ rằng mỗi số bạn thấy là một xấp xỉ. Thay vì dựa vào

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

01, hãy sử dụng

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

06:

np.isclose(np.sqrt(2) ** 2,  2)
#> True
np.isclose(1 / 49 * 49, 1)
#> True

Toán tử logic

Nhiều đối số cho

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

6 được kết hợp với các biểu thức và mọi biểu thức phải đúng để đưa vào một hàng trong đầu ra. Đối với các loại kết hợp khác, bạn sẽ cần phải tự mình sử dụng các nhà khai thác boolean:

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

08 là và và, ____109 là, hoặc

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

10 là không phải. Hình 5.1 cho thấy bộ hoàn chỉnh của các hoạt động Boolean.

Hướng dẫn what is data transformation in python? - chuyển đổi dữ liệu trong python là gì?

Hình 5.1: Tập hợp hoàn chỉnh các hoạt động Boolean.

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

11 là vòng tròn bên trái,

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

12 là vòng tròn bên phải và khu vực bóng mờ hiển thị các bộ phận mà mỗi toán tử chọn.

Mã sau đây tìm thấy tất cả các chuyến bay khởi hành vào tháng 11 hoặc tháng 12:

flights.query('month == 11 | month == 12')

Thứ tự hoạt động không hoạt động như tiếng Anh. Bạn có thể viết

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

13, mà bạn có thể dịch theo nghĩa đen là tìm thấy tất cả các chuyến bay khởi hành vào tháng 11 hoặc tháng 12. Thay vào đó, nó tìm thấy tất cả các tháng bằng

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

14, một biểu thức đánh giá thành

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

2. Trong bối cảnh số (như ở đây),

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

2 trở thành một, vì vậy điều này tìm thấy tất cả các chuyến bay vào tháng 1, không phải tháng 11 hoặc tháng 12. Điều này khá khó hiểu!

Một tay ngắn hữu ích cho vấn đề này là

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

17. Điều này sẽ chọn mỗi hàng trong đó

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

11 là một trong những giá trị trong

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

12. Chúng tôi có thể sử dụng nó để viết lại mã ở trên:

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

0

Đôi khi bạn có thể đơn giản hóa tập hợp con phức tạp bằng cách ghi nhớ luật de Morgan,

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

20 giống như

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

21 và

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

22 giống như

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

23. Ví dụ: nếu bạn muốn tìm các chuyến bay mà weren bị trì hoãn (khi đến hoặc đi) hơn hai giờ, bạn có thể sử dụng một trong hai bộ lọc sau:

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

1

Bất cứ khi nào bạn bắt đầu sử dụng các biểu thức phức tạp, đa điểm trong

flights.query('month == 11 | month == 12')

2, hãy xem xét việc làm cho chúng thay vào đó là các biến rõ ràng. Điều đó làm cho nó dễ dàng hơn nhiều để kiểm tra công việc của bạn. Bạn sẽ học cách tạo các biến mới trong thời gian ngắn.

Giá trị bị mất

Một tính năng quan trọng của gấu trúc trong Python có thể làm cho so sánh trở nên khó khăn là các giá trị bị thiếu hoặc

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

25s (không phải là sẵn có).

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

25 đại diện cho một giá trị không xác định nên các giá trị bị thiếu là những người truyền nhiễm,: Hầu như mọi hoạt động liên quan đến một giá trị chưa biết cũng sẽ không được biết.

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

2

Kết quả khó hiểu nhất là các so sánh. Họ luôn trả lại một

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

3. Logic cho kết quả này được giải thích trên stackoverflow. Hướng dẫn dữ liệu bị thiếu gấu trúc là một bài đọc hữu ích.

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

3

Nó dễ hiểu nhất tại sao điều này đúng với bối cảnh hơn một chút:

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

4

Nhóm phát triển Python đã quyết định cung cấp chức năng để tìm các đối tượng

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

28 trong mã của bạn bằng cách cho phép

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

29 trả về

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

2. Một lần nữa bạn có thể đọc lý do cho quyết định này. Python hiện có các chức năng

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

31 để làm cho so sánh này thẳng hơn trong mã của bạn.

Pandas sử dụng cấu trúc

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

32 trong Python để xác định các giá trị NA hoặc ‘thiếu. Nếu bạn muốn xác định xem có thiếu giá trị hay không, hãy sử dụng

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

33:NA or ‘missing’ values. If you want to determine if a value is missing, use

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

33:

flights.query('month == 11 | month == 12')

2 chỉ bao gồm các hàng trong đó điều kiện là

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

35; Nó loại trừ cả giá trị

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

36 và NA.NA values.

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

5

Nếu bạn muốn bảo tồn các giá trị bị thiếu, hãy yêu cầu chúng bằng cách sử dụng thủ thuật được đề cập rõ ràng trong đoạn trước hoặc bằng cách sử dụng

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

33 với tham chiếu tượng trưng

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

38 trong điều kiện của bạn:

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

6

Bài tập

Tìm thấy tất cả các chuyến bay

A. Có sự chậm trễ từ hai giờ trở lên B. đã bay tới Houston (

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

39 hoặc

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

40) C. được điều hành bởi United, American hoặc Delta D. khởi hành vào mùa hè (tháng 7, tháng 8 và tháng 9) E. trễ hai giờ, nhưng không rời khỏi F. muộn đã bị trì hoãn ít nhất một giờ, nhưng chiếm hơn 30 phút trong chuyến bay G. khởi hành từ nửa đêm đến 6 giờ sáng (bao gồm)

Có bao nhiêu chuyến bay bị thiếu

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

41? Những biến nào khác bị thiếu? Những hàng này có thể đại diện cho những gì?

Sắp xếp hoặc sắp xếp các hàng với #> year month day ... hour minute time_hour #> 0 2013 1 1 ... 5 15 2013-01-01 10:00:00+00:00 #> 1 2013 1 1 ... 5 29 2013-01-01 10:00:00+00:00 #> 2 2013 1 1 ... 5 40 2013-01-01 10:00:00+00:00 #> 3 2013 1 1 ... 5 45 2013-01-01 10:00:00+00:00 #> 4 2013 1 1 ... 6 0 2013-01-01 11:00:00+00:00 #> ... ... ... ... ... ... ... ... #> 336771 2013 9 30 ... 14 55 2013-09-30 18:00:00+00:00 #> 336772 2013 9 30 ... 22 0 2013-10-01 02:00:00+00:00 #> 336773 2013 9 30 ... 12 10 2013-09-30 16:00:00+00:00 #> 336774 2013 9 30 ... 11 59 2013-09-30 15:00:00+00:00 #> 336775 2013 9 30 ... 8 40 2013-09-30 12:00:00+00:00 #> #> [336776 rows x 19 columns]42

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

42 hoạt động tương tự như

flights.query('month == 11 | month == 12')

2 ngoại trừ thay vì chọn hàng, nó thay đổi thứ tự của chúng. Nó lấy một khung dữ liệu và tên cột hoặc danh sách các tên cột để đặt hàng theo. Nếu bạn cung cấp nhiều hơn một tên cột, mỗi cột bổ sung sẽ được sử dụng để phá vỡ các mối quan hệ trong các giá trị của các cột trước:

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

7

Sử dụng đối số

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

45 để đặt hàng lại theo một cột theo thứ tự giảm dần:

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

8

Các giá trị bị thiếu luôn được sắp xếp ở cuối:

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

9

Bài tập

Tìm thấy tất cả các chuyến bay

A. Có sự chậm trễ từ hai giờ trở lên B. đã bay tới Houston (

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

39 hoặc

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

40) C. được điều hành bởi United, American hoặc Delta D. khởi hành vào mùa hè (tháng 7, tháng 8 và tháng 9) E. trễ hai giờ, nhưng không rời khỏi F. muộn đã bị trì hoãn ít nhất một giờ, nhưng chiếm hơn 30 phút trong chuyến bay G. khởi hành từ nửa đêm đến 6 giờ sáng (bao gồm)

Có bao nhiêu chuyến bay bị thiếu

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

41? Những biến nào khác bị thiếu? Những hàng này có thể đại diện cho những gì?

Sắp xếp hoặc sắp xếp các hàng với

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

42

#> year month day ... hour minute time_hour #> 0 2013 1 1 ... 5 15 2013-01-01 10:00:00+00:00 #> 1 2013 1 1 ... 5 29 2013-01-01 10:00:00+00:00 #> 2 2013 1 1 ... 5 40 2013-01-01 10:00:00+00:00 #> 3 2013 1 1 ... 5 45 2013-01-01 10:00:00+00:00 #> 4 2013 1 1 ... 6 0 2013-01-01 11:00:00+00:00 #> ... ... ... ... ... ... ... ... #> 336771 2013 9 30 ... 14 55 2013-09-30 18:00:00+00:00 #> 336772 2013 9 30 ... 22 0 2013-10-01 02:00:00+00:00 #> 336773 2013 9 30 ... 12 10 2013-09-30 16:00:00+00:00 #> 336774 2013 9 30 ... 11 59 2013-09-30 15:00:00+00:00 #> 336775 2013 9 30 ... 8 40 2013-09-30 12:00:00+00:00 #> #> [336776 rows x 19 columns]42 hoạt động tương tự như flights.query('month == 11 | month == 12')2 ngoại trừ thay vì chọn hàng, nó thay đổi thứ tự của chúng. Nó lấy một khung dữ liệu và tên cột hoặc danh sách các tên cột để đặt hàng theo. Nếu bạn cung cấp nhiều hơn một tên cột, mỗi cột bổ sung sẽ được sử dụng để phá vỡ các mối quan hệ trong các giá trị của các cột trước:

Sử dụng đối số

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

45 để đặt hàng lại theo một cột theo thứ tự giảm dần:

Các giá trị bị thiếu luôn được sắp xếp ở cuối:

Làm thế nào bạn có thể sử dụng

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

46 để sắp xếp tất cả các giá trị bị thiếu để bắt đầu? (Gợi ý: Sử dụng

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

47).

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

0

Sắp xếp

jan1 = flights.query('month == 1 & day == 1')

5 để tìm các chuyến bay bị trì hoãn nhất. Tìm các chuyến bay rời đi sớm nhất.

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

1

Sắp xếp

jan1 = flights.query('month == 1 & day == 1')

5 để tìm các chuyến bay nhanh nhất (tốc độ cao nhất).

Những chuyến bay nào đi xa nhất? Những gì đi du lịch ngắn nhất?

Chọn các cột có

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

8 hoặc

np.sqrt(2) ** 2 ==  2
#> False
1 / 49 * 49 == 1
#> False

7

Nó không phải là hiếm khi có được bộ dữ liệu với hàng trăm hoặc thậm chí hàng ngàn biến. Trong trường hợp này, thách thức đầu tiên thường thu hẹp các biến mà bạn thực sự quan tâm.

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

52 cho phép bạn nhanh chóng phóng to trên một tập hợp con hữu ích bằng cách sử dụng các hoạt động dựa trên tên của các biến.

Ngoài ra,

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

53 thường được sử dụng để chọn các cột bởi nhiều người dùng gấu trúc. Bạn có thể đọc thêm về phương pháp

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

53 trong tài liệu gấu trúc

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

52 không hữu ích khủng khiếp với dữ liệu các chuyến bay vì chúng tôi chỉ có 19 biến, nhưng bạn vẫn có thể có được ý tưởng chung:

np.sqrt(2) ** 2 ==  2
#> False
1 / 49 * 49 == 1
#> False

7 Chức năng theo kiểu tương tự.

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

2

Bài tập

Có một số biểu thức thông thường mà bạn có thể sử dụng trong

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

8:

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

58: Kết hợp các tên cột bắt đầu bằng cách Sch Sch.

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

59: khớp các tên kết thúc với thời gian.

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

3

#> year month day ... hour minute time_hour #> 0 2013 1 1 ... 5 15 2013-01-01 10:00:00+00:00 #> 1 2013 1 1 ... 5 29 2013-01-01 10:00:00+00:00 #> 2 2013 1 1 ... 5 40 2013-01-01 10:00:00+00:00 #> 3 2013 1 1 ... 5 45 2013-01-01 10:00:00+00:00 #> 4 2013 1 1 ... 6 0 2013-01-01 11:00:00+00:00 #> ... ... ... ... ... ... ... ... #> 336771 2013 9 30 ... 14 55 2013-09-30 18:00:00+00:00 #> 336772 2013 9 30 ... 22 0 2013-10-01 02:00:00+00:00 #> 336773 2013 9 30 ... 12 10 2013-09-30 16:00:00+00:00 #> 336774 2013 9 30 ... 11 59 2013-09-30 15:00:00+00:00 #> 336775 2013 9 30 ... 8 40 2013-09-30 12:00:00+00:00 #> #> [336776 rows x 19 columns]60: Kết hợp các tên có chứa DEP DEP.

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

61: Chọn các biến phù hợp với biểu thức chính quy. Điều này phù hợp với bất kỳ biến nào chứa các ký tự lặp lại. Bạn sẽ tìm hiểu thêm về các biểu thức thông thường trong chuỗi.

Xem tài liệu bộ lọc Pandas để biết thêm chi tiết.

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

4

Sử dụng

np.sqrt(2) ** 2 ==  2
#> False
1 / 49 * 49 == 1
#> False

9 để đổi tên cột hoặc nhiều cột.

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

5

Động não càng nhiều cách càng tốt để chọn #> year month day ... hour minute time_hour #> 0 2013 1 1 ... 5 15 2013-01-01 10:00:00+00:00 #> 1 2013 1 1 ... 5 29 2013-01-01 10:00:00+00:00 #> 2 2013 1 1 ... 5 40 2013-01-01 10:00:00+00:00 #> 3 2013 1 1 ... 5 45 2013-01-01 10:00:00+00:00 #> 4 2013 1 1 ... 6 0 2013-01-01 11:00:00+00:00 #> ... ... ... ... ... ... ... ... #> 336771 2013 9 30 ... 14 55 2013-09-30 18:00:00+00:00 #> 336772 2013 9 30 ... 22 0 2013-10-01 02:00:00+00:00 #> 336773 2013 9 30 ... 12 10 2013-09-30 16:00:00+00:00 #> 336774 2013 9 30 ... 11 59 2013-09-30 15:00:00+00:00 #> 336775 2013 9 30 ... 8 40 2013-09-30 12:00:00+00:00 #> #> [336776 rows x 19 columns]41, #> year month day ... hour minute time_hour #> 0 2013 1 1 ... 5 15 2013-01-01 10:00:00+00:00 #> 1 2013 1 1 ... 5 29 2013-01-01 10:00:00+00:00 #> 2 2013 1 1 ... 5 40 2013-01-01 10:00:00+00:00 #> 3 2013 1 1 ... 5 45 2013-01-01 10:00:00+00:00 #> 4 2013 1 1 ... 6 0 2013-01-01 11:00:00+00:00 #> ... ... ... ... ... ... ... ... #> 336771 2013 9 30 ... 14 55 2013-09-30 18:00:00+00:00 #> 336772 2013 9 30 ... 22 0 2013-10-01 02:00:00+00:00 #> 336773 2013 9 30 ... 12 10 2013-09-30 16:00:00+00:00 #> 336774 2013 9 30 ... 11 59 2013-09-30 15:00:00+00:00 #> 336775 2013 9 30 ... 8 40 2013-09-30 12:00:00+00:00 #> #> [336776 rows x 19 columns]64, #> year month day ... hour minute time_hour #> 0 2013 1 1 ... 5 15 2013-01-01 10:00:00+00:00 #> 1 2013 1 1 ... 5 29 2013-01-01 10:00:00+00:00 #> 2 2013 1 1 ... 5 40 2013-01-01 10:00:00+00:00 #> 3 2013 1 1 ... 5 45 2013-01-01 10:00:00+00:00 #> 4 2013 1 1 ... 6 0 2013-01-01 11:00:00+00:00 #> ... ... ... ... ... ... ... ... #> 336771 2013 9 30 ... 14 55 2013-09-30 18:00:00+00:00 #> 336772 2013 9 30 ... 22 0 2013-10-01 02:00:00+00:00 #> 336773 2013 9 30 ... 12 10 2013-09-30 16:00:00+00:00 #> 336774 2013 9 30 ... 11 59 2013-09-30 15:00:00+00:00 #> 336775 2013 9 30 ... 8 40 2013-09-30 12:00:00+00:00 #> #> [336776 rows x 19 columns]65 và #> year month day ... hour minute time_hour #> 0 2013 1 1 ... 5 15 2013-01-01 10:00:00+00:00 #> 1 2013 1 1 ... 5 29 2013-01-01 10:00:00+00:00 #> 2 2013 1 1 ... 5 40 2013-01-01 10:00:00+00:00 #> 3 2013 1 1 ... 5 45 2013-01-01 10:00:00+00:00 #> 4 2013 1 1 ... 6 0 2013-01-01 11:00:00+00:00 #> ... ... ... ... ... ... ... ... #> 336771 2013 9 30 ... 14 55 2013-09-30 18:00:00+00:00 #> 336772 2013 9 30 ... 22 0 2013-10-01 02:00:00+00:00 #> 336773 2013 9 30 ... 12 10 2013-09-30 16:00:00+00:00 #> 336774 2013 9 30 ... 11 59 2013-09-30 15:00:00+00:00 #> 336775 2013 9 30 ... 8 40 2013-09-30 12:00:00+00:00 #> #> [336776 rows x 19 columns]66 từ jan1 = flights.query('month == 1 & day == 1')5.

Điều gì xảy ra nếu bạn bao gồm tên của một biến nhiều lần trong cuộc gọi

flights.query('month = 1')
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot assign without a target object
#> 
#> Detailed traceback:
#>   File "", line 1, in 
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3341, in query
#>     res = self.eval(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3471, in eval
#>     return _eval(expr, inplace=inplace, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 341, in eval
#>     parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 787, in __init__
#>     self.terms = self.parse()
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in parse
#>     return self._visitor.visit(self.expr)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 404, in visit_Module
#>     return self.visit(expr, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 398, in visit
#>     return visitor(node, **kwargs)
#>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 607, in visit_Assign
#>     raise ValueError("cannot assign without a target object")

8?NumPy package for accessing the suite of mathematical functions needed. You would import NumPy with

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

77. There’s no way to list every possible function that you might use, but here’s a selection of functions that are frequently useful:

Kết quả của việc chạy mã sau làm bạn ngạc nhiên? Làm thế nào để người trợ giúp chọn đối phó với trường hợp theo mặc định? Làm thế nào bạn có thể thay đổi mặc định đó?

Các toán tử số học cũng hữu ích khi kết hợp với các hàm tổng hợp mà bạn sẽ tìm hiểu về sau. Ví dụ,

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

85 tính toán tỷ lệ của tổng số và

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

86 tính toán chênh lệch so với giá trị trung bình.

Số học mô -đun:

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

87 (phân chia số nguyên) và

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

88 (phần còn lại), trong đó

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

89. Số học mô -đun là một công cụ tiện dụng vì nó cho phép bạn chia số nguyên thành từng mảnh. Ví dụ: trong bộ dữ liệu của các chuyến bay, bạn có thể tính toán

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

90 và

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

91 từ

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

41 với:

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

6

Nhật ký:

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

93,

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

94,

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

95. Logarit là một chuyển đổi cực kỳ hữu ích để xử lý dữ liệu nằm trong nhiều thứ tự có độ lớn. Họ cũng chuyển đổi các mối quan hệ nhân sang phụ gia, một tính năng mà chúng tôi sẽ quay trở lại trong mô hình hóa.

Tất cả những thứ khác đều bằng nhau, tôi khuyên bạn nên sử dụng

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

94 bởi vì nó dễ dàng diễn giải: sự khác biệt của 1 trên thang nhật ký tương ứng với tăng gấp đôi trên thang đo ban đầu và sự khác biệt của -1 tương ứng với một nửa.

Offsets:

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

97 và

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

98 cho phép bạn tham khảo các giá trị hàng đầu hoặc độ trễ. Điều này cho phép bạn tính toán sự khác biệt đang chạy (ví dụ: & nbsp; ________ 199) hoặc tìm khi các giá trị thay đổi (

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

00). Chúng hữu ích nhất khi kết hợp với

np.sqrt(2) ** 2 ==  2
#> False
1 / 49 * 49 == 1
#> False

0, mà bạn sẽ tìm hiểu về thời gian ngắn.

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

7

Các tập hợp tích lũy và lăn: Pandas cung cấp các chức năng cho các khoản tiền, sản phẩm, phút và tối đa:

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

02,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

03,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

04,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

05. Nếu bạn cần tập hợp cuộn (tức là & nbsp; một tổng được tính toán trên một cửa sổ lăn), hãy thử

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

06 trong gói Pandas.

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

8

So sánh logic,

flights.query('month == 11 | month == 12')

8,

flights.query('month == 11 | month == 12')

9,

flights.query('month == 11 | month == 12')

6,

flights.query('month == 11 | month == 12')

7,

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

00 và

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

01, mà bạn đã tìm hiểu trước đó. Nếu bạn đang thực hiện một chuỗi các hoạt động logic phức tạp, nó thường là một ý tưởng tốt để lưu trữ các giá trị tạm thời trong các biến mới để bạn có thể kiểm tra xem mỗi bước có hoạt động như mong đợi không.

Xếp hạng: Có một số chức năng xếp hạng, nhưng bạn nên bắt đầu với

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

13. Nó thực hiện loại xếp hạng thông thường nhất (ví dụ: & NBSP; 1, 2, 2, 4). Mặc định cho các giá trị nhỏ nhất các cấp bậc nhỏ; Sử dụng

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

14 để cung cấp cho các giá trị lớn nhất các cấp bậc nhỏ nhất.

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

9

Nếu

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

15 không làm những gì bạn cần, hãy nhìn vào các biến thể

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

16,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

17,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

18,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

19. Xem trang trợ giúp cấp bậc để biết thêm chi tiết.

flights.info()
#> 
#> RangeIndex: 336776 entries, 0 to 336775
#> Data columns (total 19 columns):
#>  #   Column          Non-Null Count   Dtype              
#> ---  ------          --------------   -----              
#>  0   year            336776 non-null  int64              
#>  1   month           336776 non-null  int64              
#>  2   day             336776 non-null  int64              
#>  3   dep_time        328521 non-null  float64            
#>  4   sched_dep_time  336776 non-null  int64              
#>  5   dep_delay       328521 non-null  float64            
#>  6   arr_time        328063 non-null  float64            
#>  7   sched_arr_time  336776 non-null  int64              
#>  8   arr_delay       327346 non-null  float64            
#>  9   carrier         336776 non-null  object             
#>  10  flight          336776 non-null  int64              
#>  11  tailnum         334264 non-null  object             
#>  12  origin          336776 non-null  object             
#>  13  dest            336776 non-null  object             
#>  14  air_time        327346 non-null  float64            
#>  15  distance        336776 non-null  int64              
#>  16  hour            336776 non-null  int64              
#>  17  minute          336776 non-null  int64              
#>  18  time_hour       336776 non-null  datetime64[ns, UTC]
#> dtypes: datetime64[ns, UTC](1), float64(5), int64(9), object(4)
#> memory usage: 48.8+ MB

0

Bài tập

Hiện tại

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

41 và

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

21 thuận tiện để xem xét, nhưng khó tính toán vì chúng không thực sự liên tục. Chuyển đổi chúng thành một đại diện thuận tiện hơn của số phút kể từ nửa đêm.

So sánh

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

22 với

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

23. Bạn mong đợi điều gì? Bạn thấy gì? Bạn cần làm gì để sửa nó?

So sánh

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

41,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

21 và

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

64. Làm thế nào bạn mong đợi ba con số đó có liên quan?

Tìm 10 chuyến bay bị trì hoãn nhất bằng cách sử dụng chức năng xếp hạng. Bạn muốn xử lý các mối quan hệ như thế nào? Đọc kỹ các tài liệu cho

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

15.

Numpy cung cấp các chức năng lượng giác nào?NumPy provide?

Tóm tắt hoặc tập hợp được nhóm với #> year int64 #> month int64 #> day int64 #> dep_time float64 #> sched_dep_time int64 #> dep_delay float64 #> arr_time float64 #> sched_arr_time int64 #> arr_delay float64 #> carrier object #> flight int64 #> tailnum object #> origin object #> dest object #> air_time float64 #> distance int64 #> hour int64 #> minute int64 #> time_hour datetime64[ns, UTC] #> dtype: object28

Động từ chính cuối cùng là

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

28. Nó thu gọn khung dữ liệu vào một hàng duy nhất:

flights.info()
#> 
#> RangeIndex: 336776 entries, 0 to 336775
#> Data columns (total 19 columns):
#>  #   Column          Non-Null Count   Dtype              
#> ---  ------          --------------   -----              
#>  0   year            336776 non-null  int64              
#>  1   month           336776 non-null  int64              
#>  2   day             336776 non-null  int64              
#>  3   dep_time        328521 non-null  float64            
#>  4   sched_dep_time  336776 non-null  int64              
#>  5   dep_delay       328521 non-null  float64            
#>  6   arr_time        328063 non-null  float64            
#>  7   sched_arr_time  336776 non-null  int64              
#>  8   arr_delay       327346 non-null  float64            
#>  9   carrier         336776 non-null  object             
#>  10  flight          336776 non-null  int64              
#>  11  tailnum         334264 non-null  object             
#>  12  origin          336776 non-null  object             
#>  13  dest            336776 non-null  object             
#>  14  air_time        327346 non-null  float64            
#>  15  distance        336776 non-null  int64              
#>  16  hour            336776 non-null  int64              
#>  17  minute          336776 non-null  int64              
#>  18  time_hour       336776 non-null  datetime64[ns, UTC]
#> dtypes: datetime64[ns, UTC](1), float64(5), int64(9), object(4)
#> memory usage: 48.8+ MB

1

(Các hàm tổng hợp PANDAS bỏ qua các giá trị

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

28 như

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

31 trong R.)

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

28 không hữu ích khủng khiếp trừ khi chúng tôi ghép nó với

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

33. Điều này thay đổi đơn vị phân tích từ bộ dữ liệu hoàn chỉnh sang các nhóm riêng lẻ. Sau đó, khi bạn sử dụng các hàm gấu trúc trên khung dữ liệu được nhóm, chúng sẽ được tự động áp dụng bởi nhóm. Ví dụ: nếu chúng tôi áp dụng mã tương tự vào khung dữ liệu được nhóm theo ngày, chúng tôi sẽ nhận được độ trễ trung bình mỗi ngày. Lưu ý rằng với hàm

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

33, chúng tôi đã sử dụng tuple để xác định cột (mục đầu tiên) và hàm để áp dụng trên cột (mục thứ hai). Điều này được gọi là tổng hợp được đặt tên trong gấu trúc:

flights.info()
#> 
#> RangeIndex: 336776 entries, 0 to 336775
#> Data columns (total 19 columns):
#>  #   Column          Non-Null Count   Dtype              
#> ---  ------          --------------   -----              
#>  0   year            336776 non-null  int64              
#>  1   month           336776 non-null  int64              
#>  2   day             336776 non-null  int64              
#>  3   dep_time        328521 non-null  float64            
#>  4   sched_dep_time  336776 non-null  int64              
#>  5   dep_delay       328521 non-null  float64            
#>  6   arr_time        328063 non-null  float64            
#>  7   sched_arr_time  336776 non-null  int64              
#>  8   arr_delay       327346 non-null  float64            
#>  9   carrier         336776 non-null  object             
#>  10  flight          336776 non-null  int64              
#>  11  tailnum         334264 non-null  object             
#>  12  origin          336776 non-null  object             
#>  13  dest            336776 non-null  object             
#>  14  air_time        327346 non-null  float64            
#>  15  distance        336776 non-null  int64              
#>  16  hour            336776 non-null  int64              
#>  17  minute          336776 non-null  int64              
#>  18  time_hour       336776 non-null  datetime64[ns, UTC]
#> dtypes: datetime64[ns, UTC](1), float64(5), int64(9), object(4)
#> memory usage: 48.8+ MB

2

Lưu ý rằng việc sử dụng

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

35 để loại bỏ sự tạo ra một đa dạng. Bạn có thể đọc thêm về việc sử dụng

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

33 trong gấu trúc với nhóm của họ bằng cách: tài liệu hướng dẫn người dùng-bombine chia nhỏ

Cùng nhau

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

33 và

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

28 cung cấp một trong những công cụ mà bạn sẽ sử dụng phổ biến nhất khi làm việc với các bản tóm tắt được nhóm: nhóm được nhóm. Nhưng trước khi chúng ta đi xa hơn với điều này, chúng ta cần giới thiệu một cấu trúc cho mã gấu trúc khi thực hiện công việc khoa học dữ liệu. Chúng tôi cấu trúc mã của chúng tôi giống như ‘ống,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

39 trong các gói gọn gàng từ R-Studio.

Kết hợp nhiều hoạt động

Hãy tưởng tượng rằng chúng tôi muốn khám phá mối quan hệ giữa khoảng cách và độ trễ trung bình cho từng vị trí. Sử dụng những gì bạn biết về gấu trúc, bạn có thể viết mã như thế này:

flights.info()
#> 
#> RangeIndex: 336776 entries, 0 to 336775
#> Data columns (total 19 columns):
#>  #   Column          Non-Null Count   Dtype              
#> ---  ------          --------------   -----              
#>  0   year            336776 non-null  int64              
#>  1   month           336776 non-null  int64              
#>  2   day             336776 non-null  int64              
#>  3   dep_time        328521 non-null  float64            
#>  4   sched_dep_time  336776 non-null  int64              
#>  5   dep_delay       328521 non-null  float64            
#>  6   arr_time        328063 non-null  float64            
#>  7   sched_arr_time  336776 non-null  int64              
#>  8   arr_delay       327346 non-null  float64            
#>  9   carrier         336776 non-null  object             
#>  10  flight          336776 non-null  int64              
#>  11  tailnum         334264 non-null  object             
#>  12  origin          336776 non-null  object             
#>  13  dest            336776 non-null  object             
#>  14  air_time        327346 non-null  float64            
#>  15  distance        336776 non-null  int64              
#>  16  hour            336776 non-null  int64              
#>  17  minute          336776 non-null  int64              
#>  18  time_hour       336776 non-null  datetime64[ns, UTC]
#> dtypes: datetime64[ns, UTC](1), float64(5), int64(9), object(4)
#> memory usage: 48.8+ MB

3

Có ba bước để chuẩn bị dữ liệu này:

Các chuyến bay theo nhóm theo điểm đến.
Tóm tắt để tính toán khoảng cách, độ trễ trung bình và số lượng chuyến bay.
Bộ lọc để loại bỏ các điểm ồn ào và sân bay Honolulu, cách đó gần gấp đôi so với sân bay gần nhất tiếp theo.

Mã này là một chút bực bội khi viết vì chúng tôi phải đặt cho mỗi khung dữ liệu trung gian một tên, mặc dù chúng tôi không quan tâm đến nó. Đặt tên cho mọi thứ là khó khăn, vì vậy điều này làm chậm phân tích của chúng tôi.

Có một cách khác để giải quyết vấn đề tương tự mà không có các đối tượng bổ sung:

flights.info()
#> 
#> RangeIndex: 336776 entries, 0 to 336775
#> Data columns (total 19 columns):
#>  #   Column          Non-Null Count   Dtype              
#> ---  ------          --------------   -----              
#>  0   year            336776 non-null  int64              
#>  1   month           336776 non-null  int64              
#>  2   day             336776 non-null  int64              
#>  3   dep_time        328521 non-null  float64            
#>  4   sched_dep_time  336776 non-null  int64              
#>  5   dep_delay       328521 non-null  float64            
#>  6   arr_time        328063 non-null  float64            
#>  7   sched_arr_time  336776 non-null  int64              
#>  8   arr_delay       327346 non-null  float64            
#>  9   carrier         336776 non-null  object             
#>  10  flight          336776 non-null  int64              
#>  11  tailnum         334264 non-null  object             
#>  12  origin          336776 non-null  object             
#>  13  dest            336776 non-null  object             
#>  14  air_time        327346 non-null  float64            
#>  15  distance        336776 non-null  int64              
#>  16  hour            336776 non-null  int64              
#>  17  minute          336776 non-null  int64              
#>  18  time_hour       336776 non-null  datetime64[ns, UTC]
#> dtypes: datetime64[ns, UTC](1), float64(5), int64(9), object(4)
#> memory usage: 48.8+ MB

4

Điều này tập trung vào các phép biến đổi, không phải là những gì mà được chuyển đổi, điều này làm cho mã dễ đọc hơn. Bạn có thể đọc nó dưới dạng một loạt các tuyên bố bắt buộc: nhóm, sau đó tóm tắt, sau đó lọc. Theo đề xuất của bài đọc này, một cách tốt để phát âm

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

40 khi đọc mã Pandas là thì sau đó là.

Bạn có thể sử dụng

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

41 với

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

40 để viết lại nhiều hoạt động theo cách mà bạn có thể đọc từ trái sang phải, từ trên xuống dưới. Chúng tôi sẽ sử dụng định dạng này thường xuyên kể từ bây giờ vì nó cải thiện đáng kể khả năng đọc của mã gấu trúc phức tạp.

Giá trị bị mất

Bạn có thể đã tự hỏi về các giá trị

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

28 mà chúng tôi đưa vào khung dữ liệu Pandas của chúng tôi ở trên. Pandas chỉ bắt đầu một tùy chọn thử nghiệm (phiên bản 1.0) cho

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

44 nhưng nó không phải là tiêu chuẩn như trong ngôn ngữ R. Bạn có thể đọc đầy đủ chi tiết về dữ liệu bị thiếu trong gấu trúc.

Xử lý các giá trị bị thiếu của Pandas và Numpy, mặc định là chức năng ngược lại của R và Tidyverse. Dưới đây là ba mặc định chính khi sử dụng gấu trúc.

Khi tổng dữ liệu, các giá trị Na (thiếu) sẽ được coi là không.
Nếu dữ liệu là tất cả NA, kết quả sẽ là 0.

Các phương pháp tích lũy bỏ qua các giá trị NA theo mặc định, nhưng bảo tồn chúng trong các mảng kết quả. Để ghi đè hành vi này và bao gồm các giá trị bị thiếu, hãy sử dụng

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

45.

Tất cả các phương thức

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

33 loại trừ các giá trị bị thiếu trong các tính toán của chúng như được mô tả trong tài liệu Groupby của Pandas.

Trong trường hợp của chúng tôi, nơi các giá trị bị thiếu đại diện cho các chuyến bay bị hủy, chúng tôi cũng có thể giải quyết vấn đề bằng cách trước tiên loại bỏ các chuyến bay bị hủy. Chúng tôi sẽ lưu bộ dữ liệu này để chúng tôi có thể sử dụng lại nó trong một vài ví dụ tiếp theo.

flights.info()
#> 
#> RangeIndex: 336776 entries, 0 to 336775
#> Data columns (total 19 columns):
#>  #   Column          Non-Null Count   Dtype              
#> ---  ------          --------------   -----              
#>  0   year            336776 non-null  int64              
#>  1   month           336776 non-null  int64              
#>  2   day             336776 non-null  int64              
#>  3   dep_time        328521 non-null  float64            
#>  4   sched_dep_time  336776 non-null  int64              
#>  5   dep_delay       328521 non-null  float64            
#>  6   arr_time        328063 non-null  float64            
#>  7   sched_arr_time  336776 non-null  int64              
#>  8   arr_delay       327346 non-null  float64            
#>  9   carrier         336776 non-null  object             
#>  10  flight          336776 non-null  int64              
#>  11  tailnum         334264 non-null  object             
#>  12  origin          336776 non-null  object             
#>  13  dest            336776 non-null  object             
#>  14  air_time        327346 non-null  float64            
#>  15  distance        336776 non-null  int64              
#>  16  hour            336776 non-null  int64              
#>  17  minute          336776 non-null  int64              
#>  18  time_hour       336776 non-null  datetime64[ns, UTC]
#> dtypes: datetime64[ns, UTC](1), float64(5), int64(9), object(4)
#> memory usage: 48.8+ MB

5

Đếm

Bất cứ khi nào bạn thực hiện bất kỳ tập hợp nào, nó luôn luôn là một ý tưởng tốt để bao gồm một số đếm (

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

47) hoặc số lượng các giá trị không bỏ lỡ (

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

48). Bằng cách đó, bạn có thể kiểm tra xem bạn không đưa ra kết luận dựa trên lượng dữ liệu rất nhỏ. Ví dụ, hãy để Lôi nhìn vào các mặt phẳng (được xác định bởi số đuôi của chúng) có độ trễ trung bình cao nhất:

flights.info()
#> 
#> RangeIndex: 336776 entries, 0 to 336775
#> Data columns (total 19 columns):
#>  #   Column          Non-Null Count   Dtype              
#> ---  ------          --------------   -----              
#>  0   year            336776 non-null  int64              
#>  1   month           336776 non-null  int64              
#>  2   day             336776 non-null  int64              
#>  3   dep_time        328521 non-null  float64            
#>  4   sched_dep_time  336776 non-null  int64              
#>  5   dep_delay       328521 non-null  float64            
#>  6   arr_time        328063 non-null  float64            
#>  7   sched_arr_time  336776 non-null  int64              
#>  8   arr_delay       327346 non-null  float64            
#>  9   carrier         336776 non-null  object             
#>  10  flight          336776 non-null  int64              
#>  11  tailnum         334264 non-null  object             
#>  12  origin          336776 non-null  object             
#>  13  dest            336776 non-null  object             
#>  14  air_time        327346 non-null  float64            
#>  15  distance        336776 non-null  int64              
#>  16  hour            336776 non-null  int64              
#>  17  minute          336776 non-null  int64              
#>  18  time_hour       336776 non-null  datetime64[ns, UTC]
#> dtypes: datetime64[ns, UTC](1), float64(5), int64(9), object(4)
#> memory usage: 48.8+ MB

6

Wow, có một số máy bay có độ trễ trung bình là 5 giờ (300 phút)!

Câu chuyện thực sự là một sắc thái hơn một chút. Chúng ta có thể hiểu rõ hơn nếu chúng ta vẽ một số lượng phân tán số lượng chuyến bay so với & NBSP; độ trễ trung bình:

flights.info()
#> 
#> RangeIndex: 336776 entries, 0 to 336775
#> Data columns (total 19 columns):
#>  #   Column          Non-Null Count   Dtype              
#> ---  ------          --------------   -----              
#>  0   year            336776 non-null  int64              
#>  1   month           336776 non-null  int64              
#>  2   day             336776 non-null  int64              
#>  3   dep_time        328521 non-null  float64            
#>  4   sched_dep_time  336776 non-null  int64              
#>  5   dep_delay       328521 non-null  float64            
#>  6   arr_time        328063 non-null  float64            
#>  7   sched_arr_time  336776 non-null  int64              
#>  8   arr_delay       327346 non-null  float64            
#>  9   carrier         336776 non-null  object             
#>  10  flight          336776 non-null  int64              
#>  11  tailnum         334264 non-null  object             
#>  12  origin          336776 non-null  object             
#>  13  dest            336776 non-null  object             
#>  14  air_time        327346 non-null  float64            
#>  15  distance        336776 non-null  int64              
#>  16  hour            336776 non-null  int64              
#>  17  minute          336776 non-null  int64              
#>  18  time_hour       336776 non-null  datetime64[ns, UTC]
#> dtypes: datetime64[ns, UTC](1), float64(5), int64(9), object(4)
#> memory usage: 48.8+ MB

7

Không có gì đáng ngạc nhiên, có sự thay đổi lớn hơn nhiều trong độ trễ trung bình khi có ít chuyến bay. Hình dạng của cốt truyện này rất đặc trưng: bất cứ khi nào bạn vẽ một biểu đồ trung bình (hoặc tóm tắt khác) so với & nbsp; kích thước nhóm, bạn sẽ thấy rằng biến thể giảm khi kích thước mẫu tăng.

Khi nhìn vào loại cốt truyện này, nó thường hữu ích để lọc các nhóm với số lượng quan sát nhỏ nhất, vì vậy bạn có thể thấy nhiều hơn về mô hình và ít hơn sự thay đổi cực đoan trong các nhóm nhỏ nhất. Đây là những gì mã sau làm, cũng như hiển thị cho bạn một mẫu tiện dụng cho các thao tác khung dữ liệu đơn giản chỉ cần thiết cho một biểu đồ.

flights.info()
#> 
#> RangeIndex: 336776 entries, 0 to 336775
#> Data columns (total 19 columns):
#>  #   Column          Non-Null Count   Dtype              
#> ---  ------          --------------   -----              
#>  0   year            336776 non-null  int64              
#>  1   month           336776 non-null  int64              
#>  2   day             336776 non-null  int64              
#>  3   dep_time        328521 non-null  float64            
#>  4   sched_dep_time  336776 non-null  int64              
#>  5   dep_delay       328521 non-null  float64            
#>  6   arr_time        328063 non-null  float64            
#>  7   sched_arr_time  336776 non-null  int64              
#>  8   arr_delay       327346 non-null  float64            
#>  9   carrier         336776 non-null  object             
#>  10  flight          336776 non-null  int64              
#>  11  tailnum         334264 non-null  object             
#>  12  origin          336776 non-null  object             
#>  13  dest            336776 non-null  object             
#>  14  air_time        327346 non-null  float64            
#>  15  distance        336776 non-null  int64              
#>  16  hour            336776 non-null  int64              
#>  17  minute          336776 non-null  int64              
#>  18  time_hour       336776 non-null  datetime64[ns, UTC]
#> dtypes: datetime64[ns, UTC](1), float64(5), int64(9), object(4)
#> memory usage: 48.8+ MB

8

Có một biến thể phổ biến khác của loại mẫu này. Hãy cùng xem xét hiệu suất trung bình của các batters trong bóng chày có liên quan đến số lần họ ở Bat. Ở đây tôi sử dụng dữ liệu từ gói Lahman để tính trung bình đánh bóng (số lần truy cập / số lần thử) của mỗi cầu thủ bóng chày giải đấu lớn.Lahman package to compute the batting average (number of hits / number of attempts) of every major league baseball player.

Khi tôi vẽ kỹ năng của người đánh bóng (được đo bằng trung bình đánh bóng,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

49) so với số lượng cơ hội để đánh bóng (được đo bằng bat,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

50), bạn sẽ thấy hai mẫu:

Như trên, sự thay đổi trong tổng hợp của chúng tôi giảm khi chúng tôi nhận được nhiều điểm dữ liệu hơn.

Có một mối tương quan tích cực giữa kỹ năng (

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

49) và cơ hội để đánh bóng (

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

50). Điều này là do các đội kiểm soát ai sẽ chơi, và rõ ràng họ sẽ chọn những cầu thủ giỏi nhất của họ.

flights.info()
#> 
#> RangeIndex: 336776 entries, 0 to 336775
#> Data columns (total 19 columns):
#>  #   Column          Non-Null Count   Dtype              
#> ---  ------          --------------   -----              
#>  0   year            336776 non-null  int64              
#>  1   month           336776 non-null  int64              
#>  2   day             336776 non-null  int64              
#>  3   dep_time        328521 non-null  float64            
#>  4   sched_dep_time  336776 non-null  int64              
#>  5   dep_delay       328521 non-null  float64            
#>  6   arr_time        328063 non-null  float64            
#>  7   sched_arr_time  336776 non-null  int64              
#>  8   arr_delay       327346 non-null  float64            
#>  9   carrier         336776 non-null  object             
#>  10  flight          336776 non-null  int64              
#>  11  tailnum         334264 non-null  object             
#>  12  origin          336776 non-null  object             
#>  13  dest            336776 non-null  object             
#>  14  air_time        327346 non-null  float64            
#>  15  distance        336776 non-null  int64              
#>  16  hour            336776 non-null  int64              
#>  17  minute          336776 non-null  int64              
#>  18  time_hour       336776 non-null  datetime64[ns, UTC]
#> dtypes: datetime64[ns, UTC](1), float64(5), int64(9), object(4)
#> memory usage: 48.8+ MB

9

Điều này cũng có ý nghĩa quan trọng cho xếp hạng. Nếu bạn ngây thơ sắp xếp trên

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

53, những người có mức trung bình tốt nhất là may mắn rõ ràng, không có kỹ năng:

flights.query('month == 1 & day == 1')
#>      year  month  day  ...  hour  minute                 time_hour
#> 0    2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1    2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2    2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3    2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4    2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ..    ...    ...  ...  ...   ...     ...                       ...
#> 837  2013      1    1  ...    23      59 2013-01-02 04:00:00+00:00
#> 838  2013      1    1  ...    16      30 2013-01-01 21:00:00+00:00
#> 839  2013      1    1  ...    19      35 2013-01-02 00:00:00+00:00
#> 840  2013      1    1  ...    15       0 2013-01-01 20:00:00+00:00
#> 841  2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> 
#> [842 rows x 19 columns]

0

Bạn có thể tìm thấy một lời giải thích tốt về vấn đề này tại http://varianceExplained.org/r/empirical_bayes_baseball/ và http://www.evanmiller.org/how-not-to-sort-by-by

Chức năng tóm tắt hữu ích

Chỉ cần sử dụng phương tiện, số đếm và tổng có thể giúp bạn có một chặng đường dài, nhưng Numpy, Scipy và Pandas cung cấp nhiều chức năng tóm tắt hữu ích khác (hãy nhớ rằng chúng tôi đang sử dụng mô hình phân nhóm SCIPY STATS):

Các biện pháp vị trí: Chúng tôi đã sử dụng

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

54, nhưng

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

55 cũng hữu ích. Giá trị trung bình là tổng chia cho độ dài; Trung bình là một giá trị trong đó 50% của

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

11 ở trên nó và 50% thấp hơn nó.

Nó đôi khi hữu ích để kết hợp tập hợp với tập hợp hợp lý. Chúng tôi đã nói về loại tập hợp con này, nhưng bạn sẽ tìm hiểu thêm về nó trong tập hợp.

flights.query('month == 1 & day == 1')
#>      year  month  day  ...  hour  minute                 time_hour
#> 0    2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1    2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2    2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3    2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4    2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ..    ...    ...  ...  ...   ...     ...                       ...
#> 837  2013      1    1  ...    23      59 2013-01-02 04:00:00+00:00
#> 838  2013      1    1  ...    16      30 2013-01-01 21:00:00+00:00
#> 839  2013      1    1  ...    19      35 2013-01-02 00:00:00+00:00
#> 840  2013      1    1  ...    15       0 2013-01-01 20:00:00+00:00
#> 841  2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> 
#> [842 rows x 19 columns]

1

Các biện pháp lây lan:

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

57,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

58,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

59. Độ lệch bình phương trung bình gốc, hoặc độ lệch chuẩn

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

57, là thước đo tiêu chuẩn của sự lây lan. Phạm vi liên vùng

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

58 và độ lệch tuyệt đối trung bình

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

59 là tương đương mạnh mẽ có thể hữu ích hơn nếu bạn có ngoại lệ.

flights.query('month == 1 & day == 1')
#>      year  month  day  ...  hour  minute                 time_hour
#> 0    2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1    2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2    2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3    2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4    2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ..    ...    ...  ...  ...   ...     ...                       ...
#> 837  2013      1    1  ...    23      59 2013-01-02 04:00:00+00:00
#> 838  2013      1    1  ...    16      30 2013-01-01 21:00:00+00:00
#> 839  2013      1    1  ...    19      35 2013-01-02 00:00:00+00:00
#> 840  2013      1    1  ...    15       0 2013-01-01 20:00:00+00:00
#> 841  2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> 
#> [842 rows x 19 columns]

2

Các biện pháp cấp bậc:

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

63,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

64,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

65. Quantiles là một khái quát hóa của trung vị. Ví dụ,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

66 sẽ tìm thấy giá trị

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

11 lớn hơn 25% giá trị và ít hơn 75% còn lại.

flights.query('month == 1 & day == 1')
#>      year  month  day  ...  hour  minute                 time_hour
#> 0    2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1    2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2    2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3    2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4    2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ..    ...    ...  ...  ...   ...     ...                       ...
#> 837  2013      1    1  ...    23      59 2013-01-02 04:00:00+00:00
#> 838  2013      1    1  ...    16      30 2013-01-01 21:00:00+00:00
#> 839  2013      1    1  ...    19      35 2013-01-02 00:00:00+00:00
#> 840  2013      1    1  ...    15       0 2013-01-01 20:00:00+00:00
#> 841  2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> 
#> [842 rows x 19 columns]

3

Các biện pháp vị trí:

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

68,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

69,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

70. Chúng hoạt động tương tự như

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

71,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

72 và

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

73 nhưng hãy để bạn đặt giá trị mặc định nếu vị trí đó không tồn tại (tức là & nbsp; bạn đã cố gắng lấy phần tử thứ 3 từ một nhóm chỉ có hai phần tử). Ví dụ, chúng ta có thể tìm thấy sự khởi hành đầu tiên và cuối cùng cho mỗi ngày:

flights.query('month == 1 & day == 1')
#>      year  month  day  ...  hour  minute                 time_hour
#> 0    2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1    2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2    2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3    2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4    2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ..    ...    ...  ...  ...   ...     ...                       ...
#> 837  2013      1    1  ...    23      59 2013-01-02 04:00:00+00:00
#> 838  2013      1    1  ...    16      30 2013-01-01 21:00:00+00:00
#> 839  2013      1    1  ...    19      35 2013-01-02 00:00:00+00:00
#> 840  2013      1    1  ...    15       0 2013-01-01 20:00:00+00:00
#> 841  2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> 
#> [842 rows x 19 columns]

4

flights.query('month == 1 & day == 1')
#>      year  month  day  ...  hour  minute                 time_hour
#> 0    2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1    2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2    2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3    2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4    2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ..    ...    ...  ...  ...   ...     ...                       ...
#> 837  2013      1    1  ...    23      59 2013-01-02 04:00:00+00:00
#> 838  2013      1    1  ...    16      30 2013-01-01 21:00:00+00:00
#> 839  2013      1    1  ...    19      35 2013-01-02 00:00:00+00:00
#> 840  2013      1    1  ...    15       0 2013-01-01 20:00:00+00:00
#> 841  2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> 
#> [842 rows x 19 columns]

5

Đếm: Bạn đã thấy

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

47, không có đối số nào và trả về kích thước của nhóm hiện tại. Để đếm số lượng các giá trị không bỏ lỡ, hãy sử dụng

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

75. Để đếm số lượng các giá trị (khác biệt) duy nhất, hãy sử dụng

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

76.

flights.query('month == 1 & day == 1')
#>      year  month  day  ...  hour  minute                 time_hour
#> 0    2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1    2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2    2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3    2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4    2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ..    ...    ...  ...  ...   ...     ...                       ...
#> 837  2013      1    1  ...    23      59 2013-01-02 04:00:00+00:00
#> 838  2013      1    1  ...    16      30 2013-01-01 21:00:00+00:00
#> 839  2013      1    1  ...    19      35 2013-01-02 00:00:00+00:00
#> 840  2013      1    1  ...    15       0 2013-01-01 20:00:00+00:00
#> 841  2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> 
#> [842 rows x 19 columns]

6

Số lượng rất hữu ích và gấu trúc cung cấp một người trợ giúp đơn giản nếu tất cả những gì bạn muốn là một số lượng:

flights.query('month == 1 & day == 1')
#>      year  month  day  ...  hour  minute                 time_hour
#> 0    2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1    2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2    2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3    2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4    2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ..    ...    ...  ...  ...   ...     ...                       ...
#> 837  2013      1    1  ...    23      59 2013-01-02 04:00:00+00:00
#> 838  2013      1    1  ...    16      30 2013-01-01 21:00:00+00:00
#> 839  2013      1    1  ...    19      35 2013-01-02 00:00:00+00:00
#> 840  2013      1    1  ...    15       0 2013-01-01 20:00:00+00:00
#> 841  2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> 
#> [842 rows x 19 columns]

7

Đếm và tỷ lệ của các giá trị logic:

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

77,

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

78. Khi được sử dụng với các hàm số,

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

35 được chuyển đổi thành 1 và

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

36 thành 0. Điều này làm cho

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

81 và

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

73 rất hữu ích:

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

83 đưa ra số lượng

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

35S trong

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

11 và

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

86 đưa ra tỷ lệ.

flights.query('month == 1 & day == 1')
#>      year  month  day  ...  hour  minute                 time_hour
#> 0    2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1    2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2    2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3    2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4    2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ..    ...    ...  ...  ...   ...     ...                       ...
#> 837  2013      1    1  ...    23      59 2013-01-02 04:00:00+00:00
#> 838  2013      1    1  ...    16      30 2013-01-01 21:00:00+00:00
#> 839  2013      1    1  ...    19      35 2013-01-02 00:00:00+00:00
#> 840  2013      1    1  ...    15       0 2013-01-01 20:00:00+00:00
#> 841  2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> 
#> [842 rows x 19 columns]

8

Nhóm theo nhiều biến

Hãy cẩn thận khi dần dần đưa ra các bản tóm tắt: Nó OK OK cho các khoản tiền và số lượng, nhưng bạn cần phải suy nghĩ về các phương tiện và phương sai trọng số, và nó không thể thực hiện chính xác cho các số liệu thống kê dựa trên xếp hạng như trung bình. Nói cách khác, tổng số tổng số theo nhóm là tổng tổng, nhưng trung bình của trung bình theo nhóm không phải là trung bình tổng thể.

Không nhóm (đặt lại chỉ mục)

Nếu bạn cần loại bỏ nhóm và đa nhóm sử dụng

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

87. Đây là một tương đương thô với

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

88 trong R nhưng nó không giống nhau. Lưu ý các tên cột không còn ở nhiều cấp độ.

flights.query('month == 1 & day == 1')
#>      year  month  day  ...  hour  minute                 time_hour
#> 0    2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1    2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2    2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3    2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4    2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ..    ...    ...  ...  ...   ...     ...                       ...
#> 837  2013      1    1  ...    23      59 2013-01-02 04:00:00+00:00
#> 838  2013      1    1  ...    16      30 2013-01-01 21:00:00+00:00
#> 839  2013      1    1  ...    19      35 2013-01-02 00:00:00+00:00
#> 840  2013      1    1  ...    15       0 2013-01-01 20:00:00+00:00
#> 841  2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> 
#> [842 rows x 19 columns]

9

Bài tập

Động não ít nhất 5 cách khác nhau để đánh giá các đặc điểm chậm trễ điển hình của một nhóm các chuyến bay. Hãy xem xét các kịch bản sau:
- Một chuyến bay là sớm 15 phút 50% thời gian và muộn 15 phút 50% thời gian.
- Một chuyến bay luôn trễ 10 phút.
- Một chuyến bay là sớm 30 phút 50% thời gian và 30 phút trễ 50% thời gian.
- 99% thời gian một chuyến bay đúng giờ. 1% thời gian nó trễ 2 giờ.
Điều nào là quan trọng hơn: Sự chậm trễ đến hoặc trì hoãn khởi hành?

Định nghĩa của chúng tôi về các chuyến bay bị hủy (

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

89) là hơi tối ưu. Tại sao? Cột nào là cột quan trọng nhất?

Nhìn vào số lượng các chuyến bay bị hủy mỗi ngày. Có một mô hình? Là tỷ lệ các chuyến bay bị hủy liên quan đến độ trễ trung bình?

Người vận chuyển nào có độ trễ tồi tệ nhất? Thử thách: Bạn có thể giải quyết những ảnh hưởng của các sân bay xấu so với & nbsp; người vận chuyển xấu không? Tại sao tại sao không? (Gợi ý: Hãy nghĩ về

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

90)

Biến đổi được nhóm (và bộ lọc)

Nhóm là hữu ích nhất kết hợp với

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

28, nhưng bạn cũng có thể thực hiện các hoạt động thuận tiện với

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

92. Đây là một sự khác biệt về gấu trúc so với DPPLYR. Khi bạn tạo một đối tượng

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

33, bạn không thể sử dụng

#>         year  month  day  ...  hour  minute                 time_hour
#> 0       2013      1    1  ...     5      15 2013-01-01 10:00:00+00:00
#> 1       2013      1    1  ...     5      29 2013-01-01 10:00:00+00:00
#> 2       2013      1    1  ...     5      40 2013-01-01 10:00:00+00:00
#> 3       2013      1    1  ...     5      45 2013-01-01 10:00:00+00:00
#> 4       2013      1    1  ...     6       0 2013-01-01 11:00:00+00:00
#> ...      ...    ...  ...  ...   ...     ...                       ...
#> 336771  2013      9   30  ...    14      55 2013-09-30 18:00:00+00:00
#> 336772  2013      9   30  ...    22       0 2013-10-01 02:00:00+00:00
#> 336773  2013      9   30  ...    12      10 2013-09-30 16:00:00+00:00
#> 336774  2013      9   30  ...    11      59 2013-09-30 15:00:00+00:00
#> 336775  2013      9   30  ...     8      40 2013-09-30 12:00:00+00:00
#> 
#> [336776 rows x 19 columns]

69 và tương đương tốt nhất là

#> year                            int64
#> month                           int64
#> day                             int64
#> dep_time                      float64
#> sched_dep_time                  int64
#> dep_delay                     float64
#> arr_time                      float64
#> sched_arr_time                  int64
#> arr_delay                     float64
#> carrier                        object
#> flight                          int64
#> tailnum                        object
#> origin                         object
#> dest                           object
#> air_time                      float64
#> distance                        int64
#> hour                            int64
#> minute                          int64
#> time_hour         datetime64[ns, UTC]
#> dtype: object

92. Theo hướng dẫn của Groupby GroupBy về ’-apply-bombine, chúng tôi sẽ gán các biến được chuyển cho khung dữ liệu của chúng tôi và sau đó thực hiện các bộ lọc trên khung dữ liệu đầy đủ.

Tìm các thành viên tồi tệ nhất của mỗi nhóm:
```
jan1 = flights.query('month == 1 & day == 1')
```
0
Tìm tất cả các nhóm lớn hơn ngưỡng:
```
jan1 = flights.query('month == 1 & day == 1')
```
1
Chuẩn hóa để tính toán cho mỗi số liệu nhóm:
```
jan1 = flights.query('month == 1 & day == 1')
```
2

Bài tập

Động não ít nhất 5 cách khác nhau để đánh giá các đặc điểm chậm trễ điển hình của một nhóm các chuyến bay. Hãy xem xét các kịch bản sau:
Một chuyến bay là sớm 15 phút 50% thời gian và muộn 15 phút 50% thời gian.
Một chuyến bay luôn trễ 10 phút.
Sự chậm trễ thường có mối tương quan theo thời gian: Ngay cả khi vấn đề gây ra sự chậm trễ ban đầu đã được giải quyết, các chuyến bay sau đó vẫn bị trì hoãn để cho phép các chuyến bay sớm hơn rời đi. Khám phá làm thế nào sự chậm trễ của một chuyến bay có liên quan đến sự chậm trễ của chuyến bay ngay trước đó.
Nhìn vào từng điểm đến. Bạn có thể tìm thấy các chuyến bay đang nghi ngờ nhanh không? (tức là & NBSP; các chuyến bay đại diện cho lỗi nhập dữ liệu tiềm năng). Tính thời gian không khí của một chuyến bay so với chuyến bay ngắn nhất đến điểm đến đó. Những chuyến bay nào bị trì hoãn nhất trong không khí?
Tìm tất cả các điểm đến được bay bởi ít nhất hai tàu sân bay. Sử dụng thông tin đó để xếp hạng các nhà mạng.
Đối với mỗi mặt phẳng, hãy đếm số lượng chuyến bay trước lần trễ đầu tiên lớn hơn 1 giờ.

Chuyển đổi dữ liệu giải thích với ví dụ là gì?

Chuyển đổi dữ liệu là gì? Theo thuật ngữ ngụ ý, chuyển đổi dữ liệu có nghĩa là lấy dữ liệu được lưu trữ ở một định dạng và chuyển đổi nó sang một định dạng khác. Là người dùng cuối máy tính, bạn có thể thực hiện các phép biến đổi dữ liệu cơ bản trên cơ sở thường xuyên. Ví dụ, khi bạn chuyển đổi tệp Microsoft Word thành PDF, bạn đang chuyển đổi dữ liệu.taking data stored in one format and converting it to another. As a computer end-user, you probably perform basic data transformations on a routine basis. When you convert a Microsoft Word file to a PDF, for example, you are transforming data.

Phương pháp chuyển đổi dữ liệu là gì?

Chuyển đổi dữ liệu trong khai thác dữ liệu được thực hiện để kết hợp dữ liệu phi cấu trúc với dữ liệu có cấu trúc để phân tích nó sau.Nó cũng rất quan trọng khi dữ liệu được chuyển sang kho dữ liệu đám mây mới.Khi dữ liệu đồng nhất và có cấu trúc tốt, việc phân tích và tìm kiếm các mẫu sẽ dễ dàng hơn.

Python có thể được sử dụng để chuyển đổi dữ liệu không?

Hàm biến đổi trong Python là gì?Hàm biến đổi của Python trả về một khung dữ liệu tự sản xuất với các giá trị được chuyển đổi sau khi áp dụng hàm được chỉ định trong tham số của nó.DataFrame này có cùng độ dài với khung dữ liệu được truyền.Python's Transform function returns a self-produced dataframe with transformed values after applying the function specified in its parameter. This dataframe has the same length as the passed dataframe.

Chuyển đổi dữ liệu trong học máy là gì?

Ngày 17 tháng 10 năm 2022. Chuyển đổi dữ liệu được định nghĩa là quy trình kỹ thuật chuyển đổi dữ liệu từ định dạng, tiêu chuẩn hoặc cấu trúc này sang cấu trúc khác - mà không thay đổi nội dung của các bộ dữ liệu - thường là để chuẩn bị cho ứng dụng hoặc người dùng tiêu thụ hoặc để cải thiệnChất lượng dữ liệu.the technical process of converting data from one format, standard, or structure to another – without changing the content of the datasets – typically to prepare it for consumption by an app or a user or to improve the data quality.

programming python Data transformation techniques Data transformation example Data transformation rules Feature transformation

Hướng dẫn what is data transformation in python? - chuyển đổi dữ liệu trong python là gì?

Giới thiệu

Điều kiện tiên quyết

NYCFLIGHT13

Basipulation Data Thao tác dữ liệu cơ bản

Lọc hàng với flights.query('month == 11 | month == 12')2

So sánh

Toán tử logic

Giá trị bị mất

Bài tập

Bài tập

Bài tập

Bài tập

Kết hợp nhiều hoạt động

Giá trị bị mất

Đếm

Chức năng tóm tắt hữu ích

Nhóm theo nhiều biến

Không nhóm (đặt lại chỉ mục)

Bài tập

Biến đổi được nhóm (và bộ lọc)

Bài tập

Chuyển đổi dữ liệu giải thích với ví dụ là gì?

Phương pháp chuyển đổi dữ liệu là gì?

Python có thể được sử dụng để chuyển đổi dữ liệu không?

Chuyển đổi dữ liệu trong học máy là gì?

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội