Main Content

Work with Remote Data

您可以读取和写入数据从远程位置using MATLAB®functions and objects, such as file I/O functions and some datastore objects. These examples show how to set up, read from, and write to remote locations on the following cloud storage platforms:

  • Amazon S3™ (Simple Storage Service)

  • Azure®Blob Storage (previously known as Windows Azure®Storage Blob (WASB))

  • Hadoop®Distributed File System (HDFS™)

Amazon S3

MATLAB lets you use Amazon S3 as an online file storage web service offered by Amazon Web Services. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of the form

s3://bucketname/path_to_file

bucketnameis the name of the container andpath_to_fileis the path to the file or folders.

Amazon S3provides data storage through web services interfaces. You can use abucketas a container to store objects in Amazon S3.

Set Up Access

To work with remote data in Amazon S3, you must set up access first:

  1. Sign up for an Amazon Web Services (AWS) root account. SeeAmazon Web Services: Account.

  2. Using your AWS root account, create an IAM (Identity and Access Management) user. SeeCreating an IAM User in Your AWS Account.

  3. Generate an access key to receive an access key ID and a secret access key. SeeManaging Access Keys for IAM Users.

  4. Configure your machine with the AWS access key ID, secret access key, and region using the AWS Command Line Interface tool fromhttps://aws.amazon.com/cli/. Alternatively, set the environment variables directly by usingsetenv:

    • AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY— Authenticate and enable use of Amazon S3 services. (You generated this pair of access key variables in step 3.)

    • AWS_DEFAULT_REGION(optional) — Select the geographic region of your bucket. The value of this environment variable is typically determined automatically, but the bucket owner might require that you set it manually.

    • AWS_SESSION_TOKEN(optional) — Specify the session token if you are using temporary security credentials, such as with AWS®Federated Authentication.

If you are using Parallel Computing Toolbox™, you must ensure the cluster has been configured to access S3 services. You can copy your client environment variables to the workers on a cluster by settingEnvironmentVariablesinparpool,batch,createJob, or in the Cluster Profile Manager.

Read Data fromAmazon S3

The following example shows how to use anImageDatastoreobject to read a specified image from Amazon S3, and then display the image to screen.

setenv('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID'); setenv('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY'); ds = imageDatastore('s3://bucketname/image_datastore/jpegfiles', ... 'IncludeSubfolders', true, 'LabelSource', 'foldernames'); img = ds.readimage(1); imshow(img)

Write Data toAmazon S3

The following example shows how to use atabularTextDatastoreobject to read tabular data from Amazon S3 into a tall array, preprocess it by removing missing entries and sorting, and then write it back to Amazon S3.

setenv('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID'); setenv('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY'); ds = tabularTextDatastore('s3://bucketname/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('s3://bucketname/preprocessedData/',tt);

To read your tall data back, use thedatastorefunction.

ds = datastore('s3://bucketname/preprocessedData/'); tt = tall(ds);

AzureBlob Storage

MATLAB lets you use Azure Blob Storage for online file storage. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of the form

wasbs://container@account/path_to_file/file.ext

container@accountis the name of the container andpath_to_fileis the path to the file or folders.

Azureprovides data storage through web services interfaces. You can use ablobto store data files in Azure. SeeWhat isAzurefor more information.

Set Up Access

To work with remote data in Azure storage, you must set up access first:

  1. Sign up for a Microsoft Azure account, seeMicrosoft Azure Account.

  2. Set up your authentication details by setting exactly one of the two following environment variables usingsetenv:

    • MW_WASB_SAS_TOKEN— Authentication via Shared Access Signature (SAS)

      Obtain an SAS. For details, see the "Get the SAS for a blob container" section inManage Azure Blob Storage resources with Storage Explorer.

      In MATLAB, setMW_WASB_SAS_TOKENto the SAS query string. For example,

      setenv MW_WASB_SAS_TOKEN '?st=2017-04-11T09%3A45%3A00Z&se=2017-05-12T09%3A45%3A00Z&sp=rl&sv=2015-12-11&sr=c&sig=E12eH4cRCLilp3Tw%2BArdYYR8RruMW45WBXhWpMzSRCE%3D'

      你必须将这个字符串设置为一个有效的SAS标记基因rated from the Azure Storage web UI or Explorer.

    • MW_WASB_SECRET_KEY— Authentication via one of the Account's two secret keys

      Each Storage Account has two secret keys that permit administrative privilege access. This same access can be given to MATLAB without having to create an SAS token by setting theMW_WASB_SECRET_KEYenvironment variable. For example:

      setenv MW_WASB_SECRET_KEY '1234567890ABCDEF1234567890ABCDEF1234567890ABCDEF'

If you are using Parallel Computing Toolbox, you must copy your client environment variables to the workers on a cluster by settingEnvironmentVariablesinparpool,batch,createJob, or in the Cluster Profile Manager.

For more information, see使用Azure存储Azure HDInsight集群.

Read Data fromAzure

To read data from an Azure Blob Storage location, specify the location using the following syntax:

wasbs://container@account/path_to_file/file.ext

container@accountis the name of the container andpath_to_fileis the path to the file or folders.

For example, if you have a fileairlinesmall.csvin a folder/airlineon a test storage accountwasbs://blobContainer@storageAccount.blob.core.windows.net/, then you can create a datastore by using:

location = 'wasbs://blobContainer@storageAccount.blob.core.windows.net/airline/airlinesmall.csv';
ds = tabularTextDatastore(location, 'TreatAsMissing', 'NA', ... 'SelectedVariableNames', {'ArrDelay'});

哟u can use Azure for all calculations datastores support, including direct reading,mapreduce, tall arrays and deep learning. For example, create anImageDatastoreobject, read a specified image from the datastore, and then display the image to screen.

setenv('MW_WASB_SAS_TOKEN', 'YOUR_WASB_SAS_TOKEN'); ds = imageDatastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/', ... 'IncludeSubfolders', true, 'LabelSource', 'foldernames'); img = ds.readimage(1); imshow(img)

Write Data toAzure

This example shows how to read tabular data from Azure into a tall array using atabularTextDatastoreobject, preprocess it by removing missing entries and sorting, and then write it back to Azure.

setenv('MW_WASB_SAS_TOKEN', 'YOUR_WASB_SAS_TOKEN'); ds = tabularTextDatastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('wasbs://YourContainer@YourAccount.blob.core.windows.net/preprocessedData/',tt);

To read your tall data back, use thedatastorefunction.

ds = datastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/preprocessedData/'); tt = tall(ds);

HadoopDistributed File System

Specify Location of Data

MATLAB lets you use Hadoop Distributed File System (HDFS) as an online file storage web service. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of one of these forms:

hdfs:/path_to_file
hdfs:///path_to_file
hdfs://hostname/path_to_file

hostnameis the name of the host or server andpath_to_fileis the path to the file or folders. Specifying thehostnameis optional. When you do not specify thehostname, Hadoop uses the default host name associated with the Hadoop Distributed File System (HDFS) installation in MATLAB.

For example, you can use either of these commands to create a datastore for the file,file1.txt, in a folder nameddatalocated at a host namedmyserver:

  • ds = tabularTextDatastore('hdfs:///data/file1.txt')
  • ds = tabularTextDatastore('hdfs://myserver/data/file1.txt')

Ifhostnameis specified, it must correspond to the namenode defined by thefs.default.nameproperty in the Hadoop XML configuration files for your Hadoop cluster.

Optionally, you can include the port number. For example, this location specifies a host namedmyserverwith port7867, containing the filefile1.txtin a folder nameddata:

'hdfs://myserver:7867/data/file1.txt'

The specified port number must match the port number set in your HDFS configuration.

SetHadoopEnvironment Variable

Before reading from HDFS, use thesetenvfunction to set the appropriate environment variable to the folder where Hadoop is installed. This folder must be accessible from the current machine.

  • Hadoop v1 only — Set theHADOOP_HOMEenvironment variable.

  • Hadoop v2 only — Set theHADOOP_PREFIXenvironment variable.

  • If you work with both Hadoop v1 and Hadoop v2, or if theHADOOP_HOMEandHADOOP_PREFIXenvironment variables are not set, then set theMATLAB_HADOOP_INSTALLenvironment variable.

For example, use this command to set theHADOOP_HOMEenvironment variable.hadoop-folderis the folder where Hadoop is installed, and/mypath/is the path to that folder.

setenv('HADOOP_HOME','/mypath/hadoop-folder');

HDFSdata on Hortonworks orCloudera

If your current machine has access to HDFS data on Hortonworks or Cloudera®, then you do not have to set theHADOOP_HOMEorHADOOP_PREFIXenvironment variables. MATLAB automatically assigns these environment variables when using Hortonworks or Cloudera application edge nodes.

Prevent Clearing Code from Memory

When reading from HDFS or when reading Sequence files locally, thedatastorefunction calls thejavaaddpathcommand. This command does the following:

  • Clears the definitions of all Java®classes defined by files on the dynamic class path

  • Removes all global variables and variables from the base workspace

  • Removes all compiled scripts, functions, and MEX-functions from memory

To prevent persistent variables, code files, or MEX-files from being cleared, use themlockfunction.

Write Data toHDFS

This example shows how to use atabularTextDatastoreobject to write data to an HDFS location. Use thewritefunction to write your tall and distributed arrays to a Hadoop Distributed File System. When you call this function on a distributed or tall array, you must specify the full path to a HDFS folder. The following example shows how to read tabular data from HDFS into a tall array, preprocess it by removing missing entries and sorting, and then write it back to HDFS.

ds = tabularTextDatastore('hdfs://myserver/some/path/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('hdfs://myserver/some/path/preprocessedData/',tt);

To read your tall data back, use thedatastorefunction.

ds = datastore('hdfs://myserver/some/path/preprocessedData/'); tt = tall(ds);

See Also

||||||||

Related Topics

Baidu
map