Openrowset sql server là gì bị lỗi 65001 năm 2024

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assignees

Comments

While running the below script in sqlproj using Visual Studio 17.9 preview 1, I am getting the error 'Incorrect syntax near '65001' while the same query runs successfully in SSMS 19.

Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.

How to use OPENROWSET using serverless SQL pool in Azure Synapse Analytics

  • Article
  • 10/12/2023

In this article

The

SELECT *
FROM OPENROWSET(BULK '/folder/*.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE
                FORMAT = 'PARQUET') AS [file]

2 function allows you to access files in Azure Storage.

SELECT *
FROM OPENROWSET(BULK '/folder/*.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE
                FORMAT = 'PARQUET') AS [file]

3 function reads content of a remote data source (for example file) and returns the content as a set of rows. Within the serverless SQL pool resource, the OPENROWSET bulk rowset provider is accessed by calling the OPENROWSET function and specifying the BULK option.

The

SELECT *
FROM OPENROWSET(BULK '/folder/*.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE
                FORMAT = 'PARQUET') AS [file]

3 function can be referenced in the

SELECT *
FROM OPENROWSET(BULK '/folder/*.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE
                FORMAT = 'PARQUET') AS [file]

5 clause of a query as if it were a table name

SELECT *
FROM OPENROWSET(BULK '/folder/*.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE
                FORMAT = 'PARQUET') AS [file]

3. It supports bulk operations through a built-in BULK provider that enables data from a file to be read and returned as a rowset.

Note

The OPENROWSET function is not supported in dedicated SQL pool.

Data source

OPENROWSET function in Synapse SQL reads the content of the file(s) from a data source. The data source is an Azure storage account and it can be explicitly referenced in the

SELECT *
FROM OPENROWSET(BULK '/folder/*.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE
                FORMAT = 'PARQUET') AS [file]

3 function or can be dynamically inferred from URL of the files that you want to read. The

SELECT *
FROM OPENROWSET(BULK '/folder/*.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE
                FORMAT = 'PARQUET') AS [file]

3 function can optionally contain a

SELECT *
FROM OPENROWSET(BULK '/folder/*.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE
                FORMAT = 'PARQUET') AS [file]

9 parameter to specify the data source that contains files.

  • SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                FORMAT = 'PARQUET') AS [file]  
    

    3 without

    SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                FORMAT = 'PARQUET') AS [file]  
    

    9 can be used to directly read the contents of the files from the URL location specified as

    --OPENROWSET syntax for reading Parquet or Delta Lake files OPENROWSET ( { BULK 'unstructured_data_path' , [DATA_SOURCE = , ]

    FORMAT= ['PARQUET' | 'DELTA'] }  
    
    ) [WITH ( {'column_name' 'column_type' }) ] [AS] table_alias(column_alias,...n) --OPENROWSET syntax for reading delimited text files OPENROWSET ( { BULK 'unstructured_data_path' , [DATA_SOURCE = , ]
    FORMAT = 'CSV'  
    [  ]  
    [ ,  ] }  
    
    ) WITH ( {'column_name' 'column_type' [ 'column_ordinal' | 'json_path'] }) [AS] table_alias(column_alias,...n) ::= [ , FIELDTERMINATOR = 'char' ] [ , ROWTERMINATOR = 'char' ] [ , ESCAPECHAR = 'char' ] [ , FIRSTROW = 'first_row' ] [ , FIELDQUOTE = 'quote_characters' ] [ , DATA_COMPRESSION = 'data_compression_method' ] [ , PARSER_VERSION = 'parser_version' ] [ , HEADER_ROW = { TRUE | FALSE } ] [ , DATAFILETYPE = { 'char' | 'widechar' } ] [ , CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ] [ , ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}' ] ::= {
    | MAXERRORS = reject_value,  
    | ERRORFILE_DATA_SOURCE = ,  
    | ERRORFILE_LOCATION = '/REJECT_Directory'  
    
    }

    2 option:

    SELECT FROM OPENROWSET(BULK 'http://.dfs.core.windows.net/container/folder/.parquet',
                FORMAT = 'PARQUET') AS [file]  
    

This is a quick and easy way to read the content of the files without pre-configuration. This option enables you to use the basic authentication option to access the storage (Microsoft Entra passthrough for Microsoft Entra logins and SAS token for SQL logins).

  • SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                FORMAT = 'PARQUET') AS [file]  
    

    3 with

    SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                FORMAT = 'PARQUET') AS [file]  
    

    9 can be used to access files on specified storage account:

    SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                FORMAT = 'PARQUET') AS [file]  
    

    This option enables you to configure location of the storage account in the data source and specify the authentication method that should be used to access storage. Important

    SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                FORMAT = 'PARQUET') AS [file]  
    

    3 without

    SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                FORMAT = 'PARQUET') AS [file]  
    

    9 provides quick and easy way to access the storage files but offers limited authentication options. As an example, Microsoft Entra principals can access files only using their Microsoft Entra identity or publicly available files. If you need more powerful authentication options, use

    SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                FORMAT = 'PARQUET') AS [file]  
    
    9 option and define credential that you want to use to access storage.

Security

A database user must have

--OPENROWSET syntax for reading Parquet or Delta Lake files
OPENROWSET  
( { BULK 'unstructured_data_path' , [DATA_SOURCE = , ]
    FORMAT= ['PARQUET' | 'DELTA'] }  
)  
[WITH ( {'column_name' 'column_type' }) ]
[AS] table_alias(column_alias,...n)
--OPENROWSET syntax for reading delimited text files
OPENROWSET  
( { BULK 'unstructured_data_path' , [DATA_SOURCE = , ] 
    FORMAT = 'CSV'
    [  ]
    [ ,  ] }  
)  
WITH ( {'column_name' 'column_type' [ 'column_ordinal' | 'json_path'] })  
[AS] table_alias(column_alias,...n)
 ::=  
[ , FIELDTERMINATOR = 'char' ]    
[ , ROWTERMINATOR = 'char' ] 
[ , ESCAPECHAR = 'char' ] 
[ , FIRSTROW = 'first_row' ]     
[ , FIELDQUOTE = 'quote_characters' ]
[ , DATA_COMPRESSION = 'data_compression_method' ]
[ , PARSER_VERSION = 'parser_version' ]
[ , HEADER_ROW = { TRUE | FALSE } ]
[ , DATAFILETYPE = { 'char' | 'widechar' } ]
[ , CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]
[ , ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}' ]
 ::=  
{  
    | MAXERRORS = reject_value,  
    | ERRORFILE_DATA_SOURCE = ,
    | ERRORFILE_LOCATION = '/REJECT_Directory'
}  

8 permission to use the

SELECT *
FROM OPENROWSET(BULK '/folder/*.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE
                FORMAT = 'PARQUET') AS [file]

3 function.

The storage administrator must also enable a user to access the files by providing valid SAS token or enabling Microsoft Entra principal to access storage files. Learn more about storage access control in this article.

SELECT *
FROM OPENROWSET(BULK '/folder/*.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE
                FORMAT = 'PARQUET') AS [file]

3 use the following rules to determine how to authenticate to storage:

  • In SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                FORMAT = 'PARQUET') AS [file]  
    

    3 without

    SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                FORMAT = 'PARQUET') AS [file]  
    

    9 authentication mechanism depends on caller type.

    • Any user can use SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                     DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                     FORMAT = 'PARQUET') AS [file]  
      

      3 without

      SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                     DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                     FORMAT = 'PARQUET') AS [file]  
      
      9 to read publicly available files on Azure storage.
    • Microsoft Entra logins can access protected files using their own if Azure storage allows the Microsoft Entra user to access underlying files (for example, if the caller has

      WITH (

         [country_code] VARCHAR (5) COLLATE Latin1_General_BIN2,  
         [country_name] VARCHAR (100) COLLATE Latin1_General_BIN2 2  
         [year] smallint,  
         --[population] bigint  
      
      )

      5 permission on Azure storage).
    • SQL logins can also use SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                     DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                     FORMAT = 'PARQUET') AS [file]  
      

      3 without

      SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                     DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                     FORMAT = 'PARQUET') AS [file]  
      
      9 to access publicly available files, files protected using SAS token, or Managed Identity of Synapse workspace. You would need to to allow access to storage files.
  • In SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                FORMAT = 'PARQUET') AS [file]  
    

    3 with

    SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                FORMAT = 'PARQUET') AS [file]  
    

    9 authentication mechanism is defined in database scoped credential assigned to the referenced data source. This option enables you to access publicly available storage, or access storage using SAS token, Managed Identity of workspace, or (if caller is Microsoft Entra principal). If

    SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                FORMAT = 'PARQUET') AS [file]  
    

    9 references Azure storage that isn't public, you would need to and reference it in

    `

    SELECT

    • FROM OPENROWSET(
      BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',  
      FORMAT = 'CSV',  
      PARSER_VERSION = '2.0',  
      HEADER_ROW = TRUE) as [r]  
      
      ` 1 to allow access to storage files.

Caller must have

SELECT 
    *
FROM OPENROWSET(
    BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0',
    HEADER_ROW = TRUE) as [r]

2 permission on credential to use it to authenticate to storage.

Syntax

--OPENROWSET syntax for reading Parquet or Delta Lake files
OPENROWSET  
( { BULK 'unstructured_data_path' , [DATA_SOURCE = , ]
    FORMAT= ['PARQUET' | 'DELTA'] }  
)  
[WITH ( {'column_name' 'column_type' }) ]
[AS] table_alias(column_alias,...n)
--OPENROWSET syntax for reading delimited text files
OPENROWSET  
( { BULK 'unstructured_data_path' , [DATA_SOURCE = , ] 
    FORMAT = 'CSV'
    [  ]
    [ ,  ] }  
)  
WITH ( {'column_name' 'column_type' [ 'column_ordinal' | 'json_path'] })  
[AS] table_alias(column_alias,...n)
 ::=  
[ , FIELDTERMINATOR = 'char' ]    
[ , ROWTERMINATOR = 'char' ] 
[ , ESCAPECHAR = 'char' ] 
[ , FIRSTROW = 'first_row' ]     
[ , FIELDQUOTE = 'quote_characters' ]
[ , DATA_COMPRESSION = 'data_compression_method' ]
[ , PARSER_VERSION = 'parser_version' ]
[ , HEADER_ROW = { TRUE | FALSE } ]
[ , DATAFILETYPE = { 'char' | 'widechar' } ]
[ , CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]
[ , ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}' ]
 ::=  
{  
    | MAXERRORS = reject_value,  
    | ERRORFILE_DATA_SOURCE = ,
    | ERRORFILE_LOCATION = '/REJECT_Directory'
}  

Arguments

You have three choices for input files that contain the target data for querying. Valid values are:

  • 'CSV' - Includes any delimited text file with row/column separators. Any character can be used as a field separator, such as TSV: FIELDTERMINATOR = tab.
  • 'PARQUET' - Binary file in Parquet format
  • 'DELTA' - A set of Parquet files organized in Delta Lake (preview) format

Values with blank spaces are not valid, e.g. 'CSV ' is not a valid value.

'unstructured_data_path'

The unstructured_data_path that establishes a path to the data may be an absolute or relative path:

  • Absolute path in the format `

    SELECT

    • FROM OPENROWSET(
      BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',  
      FORMAT = 'CSV',  
      PARSER_VERSION = '2.0',  
      HEADER_ROW = TRUE) as [r]  
      
      ` 3 enables a user to directly read the files.
  • Relative path in the format `

    SELECT

    • FROM OPENROWSET(
      BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',  
      FORMAT = 'CSV',  
      PARSER_VERSION = '2.0',  
      HEADER_ROW = TRUE) as [r]  
      

      4 that must be used with the

      SELECT FROM OPENROWSET(BULK '/folder/.parquet',
                  DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE  
                  FORMAT = 'PARQUET') AS [file]  
      

      9 parameter and describes the file pattern within the location defined in

      SELECT
    • FROM OPENROWSET(
      BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',  
      FORMAT = 'CSV',  
      PARSER_VERSION = '2.0',  
      HEADER_ROW = TRUE) as [r]  
      
      ` 6.

Below you'll find the relevant values that will link to your particular external data source.

External Data Source Prefix Storage account path Azure Blob Storage http[s] .blob.core.windows.net/path/file Azure Blob Storage wasb[s] @.blob.core.windows.net/path/file Azure Data Lake Store Gen1 http[s] .azuredatalakestore.net/webhdfs/v1 Azure Data Lake Store Gen2 http[s] .dfs.core.windows.net /path/file Azure Data Lake Store Gen2 abfs[s]

''

Specifies a path within your storage that points to the folder or file you want to read. If the path points to a container or folder, all files will be read from that particular container or folder. Files in subfolders won't be included.

You can use wildcards to target multiple files or folders. Usage of multiple nonconsecutive wildcards is allowed. Below is an example that reads all csv files starting with population from all folders starting with /csv/population:

SELECT 
    *
FROM OPENROWSET(
    BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0',
    HEADER_ROW = TRUE) as [r]

7

If you specify the unstructured_data_path to be a folder, a serverless SQL pool query will retrieve files from that folder.

You can instruct serverless SQL pool to traverse folders by specifying /* at the end of path as in example:

SELECT 
    *
FROM OPENROWSET(
    BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0',
    HEADER_ROW = TRUE) as [r]

8

Note

Unlike Hadoop and PolyBase, serverless SQL pool doesn't return subfolders unless you specify /** at the end of path. Just like Hadoop and PolyBase, it doesn't return files for which the file name begins with an underline (_) or a period (.).

In the example below, if the unstructured_data_path=

SELECT 
    *
FROM OPENROWSET(
    BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0',
    HEADER_ROW = TRUE) as [r]

9, a serverless SQL pool query will return rows from mydata.txt. It won't return mydata2.txt and mydata3.txt because they're located in a subfolder.

Openrowset sql server là gì bị lỗi 65001 năm 2024
SELECT 
    *
FROM OPENROWSET(
    BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0') as [r]

0

The WITH clause allows you to specify columns that you want to read from files.

  • For CSV data files, to read all the columns, provide column names and their data types. If you want a subset of columns, use ordinal numbers to pick the columns from the originating data files by ordinal. Columns will be bound by the ordinal designation. If HEADER_ROW = TRUE is used, then column binding is done by column name instead of ordinal position. Tip You can omit WITH clause for CSV files also. Data types will be automatically inferred from file content. You can use HEADER_ROW argument to specify existence of header row in which case column names will be read from header row. For details check .
  • For Parquet or Delta Lake files, provide column names that match the column names in the originating data files. Columns will be bound by name and is case-sensitive. If the WITH clause is omitted, all columns from Parquet files will be returned.

    Important Column names in Parquet and Delta Lake files are case sensitive. If you specify column name with casing different from column name casing in the files, the

    `

    SELECT

    • FROM OPENROWSET(
      BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',  
      FORMAT = 'CSV',  
      PARSER_VERSION = '2.0') as [r]  
      
      ` 1 values will be returned for that column.

column_name = Name for the output column. If provided, this name overrides the column name in the source file and column name provided in JSON path if there is one. If json_path is not provided, it will be automatically added as '$.column_name'. Check json_path argument for behavior.

column_type = Data type for the output column. The implicit data type conversion will take place here.

column_ordinal = Ordinal number of the column in the source file(s). This argument is ignored for Parquet files since binding is done by name. The following example would return a second column only from a CSV file:

WITH (
    --[country_code] VARCHAR (5) COLLATE Latin1_General_BIN2,
    [country_name] VARCHAR (100) COLLATE Latin1_General_BIN2 2
    --[year] smallint,
    --[population] bigint
)

json_path = JSON path expression to column or nested property. Default is lax.

Note

In strict mode query will fail with error if provided path does not exist. In lax mode query will succeed and JSON path expression will evaluate to NULL.

FIELDTERMINATOR ='field_terminator'

Specifies the field terminator to be used. The default field terminator is a comma (",").

ROWTERMINATOR ='row_terminator'`

Specifies the row terminator to be used. If row terminator is not specified, one of default terminators will be used. Default terminators for PARSER_VERSION = '1.0' are \r\n, \n and \r. Default terminators for PARSER_VERSION = '2.0' are \r\n and \n.

Note

When you use PARSER_VERSION='1.0' and specify \n (newline) as the row terminator, it will be automatically prefixed with a \r (carriage return) character, which results in a row terminator of \r\n.

ESCAPE_CHAR = 'char'

Specifies the character in the file that is used to escape itself and all delimiter values in the file. If the escape character is followed by a value other than itself, or any of the delimiter values, the escape character is dropped when reading the value.

The ESCAPECHAR parameter will be applied regardless of whether the FIELDQUOTE is or isn't enabled. It won't be used to escape the quoting character. The quoting character must be escaped with another quoting character. Quoting character can appear within column value only if value is encapsulated with quoting characters.

FIRSTROW = 'first_row'

Specifies the number of the first row to load. The default is 1 and indicates the first row in the specified data file. The row numbers are determined by counting the row terminators. FIRSTROW is 1-based.

FIELDQUOTE = 'field_quote'

Specifies a character that will be used as the quote character in the CSV file. If not specified, the quote character (") will be used.

DATA_COMPRESSION = 'data_compression_method'

Specifies compression method. Supported in PARSER_VERSION='1.0' only. Following compression method is supported:

  • GZIP

PARSER_VERSION = 'parser_version'

Specifies parser version to be used when reading files. Currently supported CSV parser versions are 1.0 and 2.0:

  • PARSER_VERSION = '1.0'
  • PARSER_VERSION = '2.0'

CSV parser version 1.0 is default and feature rich. Version 2.0 is built for performance and does not support all options and encodings.

CSV parser version 1.0 specifics:

  • Following options aren't supported: HEADER_ROW.
  • Default terminators are \r\n, \n and \r.
  • If you specify \n (newline) as the row terminator, it will be automatically prefixed with a \r (carriage return) character, which results in a row terminator of \r\n.

CSV parser version 2.0 specifics:

  • Not all data types are supported.
  • Maximum character column length is 8000.
  • Maximum row size limit is 8 MB.
  • Following options aren't supported: DATA_COMPRESSION.
  • Quoted empty string ("") is interpreted as empty string.
  • DATEFORMAT SET option is not honored.
  • Supported format for DATE data type: YYYY-MM-DD
  • Supported format for TIME data type: HH:MM:SS[.fractional seconds]
  • Supported format for DATETIME2 data type: YYYY-MM-DD HH:MM:SS[.fractional seconds]
  • Default terminators are \r\n and \n.

HEADER_ROW = { TRUE | FALSE }

Specifies whether a CSV file contains header row. Default is

SELECT 
    *
FROM OPENROWSET(
    BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0') as [r]

2 Supported in PARSER_VERSION='2.0'. If TRUE, the column names will be read from the first row according to FIRSTROW argument. If TRUE and schema is specified using WITH, binding of column names will be done by column name, not ordinal positions.

DATAFILETYPE = { 'char' | 'widechar' }

Specifies encoding:

SELECT 
    *
FROM OPENROWSET(
    BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0') as [r]

3 is used for UTF8,

SELECT 
    *
FROM OPENROWSET(
    BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0') as [r]

4 is used for UTF16 files.

CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' }

Specifies the code page of the data in the data file. The default value is 65001 (UTF-8 encoding). See more details about this option .

ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'

This option will disable the file modification check during the query execution, and read the files that are updated while the query is running. This is useful option when you need to read append-only files that are appended while the query is running. In the appendable files, the existing content is not updated, and only new rows are added. Therefore, the probability of wrong results is minimized compared to the updateable files. This option might enable you to read the frequently appended files without handling the errors. See more information in section.

Reject Options

Note

Rejected rows feature is in Public Preview. Please note that rejected rows feature works for delimited text files and PARSER_VERSION 1.0.

You can specify reject parameters that determine how service will handle dirty records it retrieves from the external data source. A data record is considered 'dirty' if actual data types don't match the column definitions of the external table.

When you don't specify or change reject options, service uses default values. Service will use the reject options to determine the number of rows that can be rejected before the actual query fails. The query will return (partial) results until the reject threshold is exceeded. It then fails with the appropriate error message.

MAXERRORS = reject_value

Specifies the number of rows that can be rejected before the query fails. MAXERRORS must be an integer between 0 and 2,147,483,647.

ERRORFILE_DATA_SOURCE = data source

Specifies data source where rejected rows and the corresponding error file should be written.

ERRORFILE_LOCATION = Directory Location

Specifies the directory within the DATA_SOURCE, or ERROR_FILE_DATASOURCE if specified, that the rejected rows and the corresponding error file should be written. If the specified path doesn't exist, service will create one on your behalf. A child directory is created with the name "rejectedrows". The "" character ensures that the directory is escaped for other data processing unless explicitly named in the location parameter. Within this directory, there's a folder created based on the time of load submission in the format YearMonthDay_HourMinuteSecond_StatementID (Ex. 20180330-173205-559EE7D2-196D-400A-806D-3BF5D007F891). You can use statement id to correlate folder with query that generated it. In this folder, two files are written: error.json file and the data file.

error.json file contains json array with encountered errors related to rejected rows. Each element representing error contains following attributes:

Attribute Description Error Reason why row is rejected. Row Rejected row ordinal number in file. Column Rejected column ordinal number. Value Rejected column value. If the value is larger than 100 characters, only the first 100 characters will be displayed. File Path to file that row belongs to.

Fast delimited text parsing

There are two delimited text parser versions you can use. CSV parser version 1.0 is default and feature rich while parser version 2.0 is built for performance. Performance improvement in parser 2.0 comes from advanced parsing techniques and multi-threading. Difference in speed will be bigger as the file size grows.

Automatic schema discovery

You can easily query both CSV and Parquet files without knowing or specifying schema by omitting WITH clause. Column names and data types will be inferred from files.

Parquet files contain column metadata, which will be read, type mappings can be found in . Check for samples.

For the CSV files, column names can be read from header row. You can specify whether header row exists using HEADER_ROW argument. If HEADER_ROW = FALSE, generic column names will be used: C1, C2, ... Cn where n is number of columns in file. Data types will be inferred from first 100 data rows. Check for samples.

Have in mind that if you are reading number of files at once, the schema will be inferred from the first file service gets from the storage. This can mean that some of the columns expected are omitted, all because the file used by the service to define the schema did not contain these columns. In that case, please use OPENROWSET WITH clause.

Important

There are cases when appropriate data type cannot be inferred due to lack of information and larger data type will be used instead. This brings performance overhead and is particularly important for character columns which will be inferred as varchar(8000). For optimal performance, please and .

Type mapping for Parquet

Parquet and Delta Lake files contain type descriptions for every column. The following table describes how Parquet types are mapped to SQL native types.

Parquet type Parquet logical type (annotation) SQL data type BOOLEAN bit BINARY / BYTE_ARRAY varbinary DOUBLE float FLOAT real INT32 int INT64 bigint INT96 datetime2 FIXED_LEN_BYTE_ARRAY binary BINARY UTF8 varchar *(UTF8 collation) BINARY STRING varchar *(UTF8 collation) BINARY ENUM varchar *(UTF8 collation) FIXED_LEN_BYTE_ARRAY UUID uniqueidentifier BINARY DECIMAL decimal BINARY JSON varchar(8000) *(UTF8 collation) BINARY BSON Not supported FIXED_LEN_BYTE_ARRAY DECIMAL decimal BYTE_ARRAY INTERVAL Not supported INT32 INT(8, true) smallint INT32 INT(16, true) smallint INT32 INT(32, true) int INT32 INT(8, false) tinyint INT32 INT(16, false) int INT32 INT(32, false) bigint INT32 DATE date INT32 DECIMAL decimal INT32 TIME (MILLIS) time INT64 INT(64, true) bigint INT64 INT(64, false) decimal(20,0) INT64 DECIMAL decimal INT64 TIME (MICROS) time INT64 TIME (NANOS) Not supported INT64 TIMESTAMP () (MILLIS / MICROS) datetime2 INT64 TIMESTAMP () (MILLIS / MICROS) bigint - make sure that you explicitly adjust

SELECT 
    *
FROM OPENROWSET(
    BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0') as [r]

5 value with the timezone offset before converting it to a datetime value. INT64 TIMESTAMP (NANOS) Not supported LIST varchar(8000), serialized into JSON MAP varchar(8000), serialized into JSON

Examples

Read CSV files without specifying schema

The following example reads CSV file that contains header row without specifying column names and data types:

SELECT 
    *
FROM OPENROWSET(
    BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0',
    HEADER_ROW = TRUE) as [r]

The following example reads CSV file that doesn't contain header row without specifying column names and data types:

SELECT 
    *
FROM OPENROWSET(
    BULK 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0') as [r]

Read Parquet files without specifying schema

The following example returns all columns of the first row from the census data set, in Parquet format, and without specifying column names and data types:

SELECT 
    TOP 1 *
FROM  
    OPENROWSET(
        BULK 'https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=20*/*.parquet',
        FORMAT='PARQUET'
    ) AS [r]

Read Delta Lake files without specifying schema

The following example returns all columns of the first row from the census data set, in Delta Lake format, and without specifying column names and data types:

SELECT 
    TOP 1 *
FROM  
    OPENROWSET(
        BULK 'https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=20*/*.parquet',
        FORMAT='DELTA'
    ) AS [r]

Read specific columns from CSV file

The following example returns only two columns with ordinal numbers 1 and 4 from the population*.csv files. Since there's no header row in the files, it starts reading from the first line:

SELECT 
  • FROM OPENROWSET( BULK 'https://sqlondemandstorage.blob.core.windows.net/csv/population/population*.csv', FORMAT = 'CSV', FIRSTROW = 1 ) WITH ( [country_code] VARCHAR (5) COLLATE Latin1_General_BIN2 1, [population] bigint 4 ) AS [r]

Read specific columns from Parquet file

The following example returns only two columns of the first row from the census data set, in Parquet format:

SELECT 
    TOP 1 *
FROM  
    OPENROWSET(
        BULK 'https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=20*/*.parquet',
        FORMAT='PARQUET'
    )
WITH (
    [stateName] VARCHAR (50),
    [population] bigint
) AS [r]

Specify columns using JSON paths

The following example shows how you can use JSON path expressions in WITH clause and demonstrates difference between strict and lax path modes:

SELECT *
FROM OPENROWSET(BULK '/folder/*.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE
                FORMAT = 'PARQUET') AS [file]

0

Specify multiple files/folders in BULK path

The following example shows how you can use multiple file/folder paths in BULK parameter:

SELECT *
FROM OPENROWSET(BULK '/folder/*.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE
                FORMAT = 'PARQUET') AS [file]

1

Next steps

For more samples, see the query data storage quickstart to learn how to use

SELECT *
FROM OPENROWSET(BULK '/folder/*.parquet',
                DATA_SOURCE='storage', --> Root URL is in LOCATION of DATA SOURCE
                FORMAT = 'PARQUET') AS [file]

3 to read CSV, PARQUET, DELTA LAKE, and JSON file formats. Check best practices for achieving optimal performance. You can also learn how to save the results of your query to Azure Storage using CETAS.