Articles, Blog

Azure Machine Learning Datasets

February 14, 2020

>>You’re not going to
want to miss this episode of the AI Show where we talk about Azure Machine
Learning Service datasets. Imagine being able to know exactly
where your model came from, how it was built, and what data
is in it. Make sure you tune in. [MUSIC]>>Hello and welcome to this episode of the AI Show where we’re going to talk about something
called a dataset. I have with me my colleague. May you tell us who you are
and what you do, my friend.>>My name is Mae. I’m a
PM working in Microsoft and the product I’m working on is
Azure Machine Learning datasets.>>Fantastic. So why don’t you
to tell us what datasets are.>>Sure. Datasets is that assets in your Machine Learning workspace
where we help you connect to the data and your storage service and make the data available for your
machine learning experiments.>>So when you say asset, it’s basically are we copying data? Does it cost more?
What’s going on there?>>No. So by creating a
dataset is you just create a reference to the data
in your storage service. We do not copy your data. So there’s no storage cost incurred
when we create our datasets.>>So when you’re using
a dataset it’s basically a pointer to some other data that’s stored on storage.
Did I get that right?>>Exactly.>>So why would you do that? Why would you make a dataset if you could just point
it directly to storage?>>Well, first we make it easy
for you to access your data. What you only need to register
data once and then you can reuse it across different experiments or share it with your colleagues. So again, we integrates well
with our training products. You can use datasets as a direct input for your
script [inaudible] , Azure Machine Learning pipelines and also Bayesian machine learning. Third, we help you check
where data has been used. For each experiments run, you will be able to see this is the input datasets I used for
this experiment and you can also even register in a dataset with your models so that users
will be able to know, okay, this date, this
model was traced by these datasets for reproducibility
and audit purpose.>>So let me see if
I understand this. There was three reasons you said. The first one is because it’s a shareable asset
that anyone can use.>>Yes.>>The second one is
because it’s easy to put into experimentation and the last one was this
thing of being able to trace where the datasets
have been used.>>Exactly.>>Okay. So let’s
start first because I wanted to see how
you actually use it. So let’s first start with
how do you create a dataset? Number one and number two,
how to use it? Can show us?>>Sure. Definitely.>>All right. Let’s do it.>>So I’m in my Machine
Learning workspace.>>Okay.>>We have datasets under assets. You can create datasets for either local files from your
data storage or your web URLs.>>So let me pause right there.>>Sure.>>So when you when
you say let’s go back to that little dropdown here. So when you say create dataset. So from local files it’s
basically I can upload stuff?>>Yes. Exactly.>>Okay. From a data
storage if I already have a storage that I’ve mapped to my Azure Machine Learning.
I would say, is that right? Okay. So far I’m two for three. I’m not sure what it
means by web files. Why wouldn’t you
describe what that is.>>Web files is just you can paste a URLs for public datas,
for example, right? You can paste a URL that
step off the data together.>>I see and does it pull it down
and put it somewhere for you?>>No. We just record that as a pointer and when
you use the datasets, we will help you load
data from the web.>>That’s pretty cool.
So in essence it’s like if there’s a public Kaggle
dataset or some other datasets, you can have a web referenced directly to that CSV file
and I’m done with this.>>Exactly.>>Fantastic. All right. Sorry. I
interrupted you. Let’s make one.>>Sure. I’ll use from my datastore. So these are all the
storage container I have. I’ll use the workspace default Azure Blob here and
if I click “Browse”, it will show all the
files available in my Blob Storage and by creating this, you can either point to a single
CSV files or our full folders. So which means multiple files. So in this case, I’ll just
point to these folders.>>I see. Each CSV file has
the same tabular structure.>>Exactly. Same schema.>>Before you go to “Save”, so when you pointed it to datastores, that’s the mapping of
the actual Azure Storage to the machine learning
workspace, is that right?>>Yes.>>Okay. Cool.>>So I’ll click “Save” and
give my dataset a name. When I click “Next,” it will
try to pass my data into a tabular format and it would help detect a file
formats delimiters, column headers all for you and this order first few
rows off my data, and if I click “Next,” this
is the schema of my data and I can choose to drop a few columns that’s not
relevant for my experiments.>>I see.>>Or I can change the
column type if I want. Then after that, I will click “Done”. So then these datasets will show up in my Machine
Learning Workspace.>>Got it and so now that asset is shareable across anyone
that’s in the workspace?>>Exactly.>>That’s awesome and this
feels almost using Pandas, but storing all this stuff in the
actual workspace. Is that right?>>Exactly.>>Cool. All right.
So how do you use it? Well, you’re going to
show us something here.>>So this is the overview page of the dataset I just created
as for how do I use it. So there are two ways to
consume your dataset. The first way is to
consume it directly. So here are the sample code. You can load, after you
get your datasets by name from the workspace object, you can load it into
common data frames like Pandas or Spark and then
data scientists can continue their data provisional
official engineering using the libraries that
they’re very familiar with.>>That’s really cool
because it actually puts in your sub idea in resource
group and everything. So your workspaces ready.
You could just literally copy this code and put it
into your notebooks, right?>>Exactly. So that’s the
first way to consume it. The second way is more for deep learning scenarios where data frame doesn’t
really make much sense.>. Right.>>So datasets help
you mount your data from your storage service
to the remote compute.>>Right.>>I can show you a demo.>>Let’s do that because as
you’re getting the tab open, I’m a computer vision guy. I love computer vision
and it is you don’t want to copy all of the
images into the container.>>Exactly.>>You want to somehow mount it. So I’m interested to see how
that’s done with datasets.>>Sure. So here I’m loading my dataset for
Azure ML open datasets. I’ll use MNIST here and I will register this MNIST
dataset with my workspace. By clicking that, this MNIST dataset will actually show
up in my workspace.>>Obviously, you created a
dataset is encode this time?>>Yes.>>Got it. Cool.>>By registering in
this job in the list. So MNIST dataset here.>>Cool.>>Cool. So here I’m going
to do some data exploration and so what I will do is to mount my MNIST datasets to my Notebook VMs. So again we don’t copy your data. Your data stays in the Blob Storage
and here’s just to mount data to the Notebook VM
so that we can take a look at a few sample files.>>I see. So when you’re
calling MNIST dataset mount, you’re mounting it to a
path in the container, but it’s not copying the files over?>>No. It’s just mounting.>>Got it. Cool. So this order
like simple images and labels, it all looks good
and now I’m going to start configure and training.>>Cool.>>So I will first get my dataset
back from the workspace by name and dataset to path shows you what other files
available pointed by your dataset.>>That’s cool.>>Then here I’m configuring
my TensorFlow estimators and the one thing I
want to highlight here is how you use datasets as the input. So you can call it
named input. as mount. This basically tells to mount my MNIST dataset
to this compute target.>>Got it.>>This is my training script. It shows you how to get your data. So as you can remember we pass this as an augments
and in your training script, you will get this data folder
back and it basically points to the mounted path on
your remote compute.>>If I remember right,
this is actually running inside of a remote compute
inside of a container.>>Yes.>>So basically when you
say the data folder, it thinks it’s just a local file
on the container when in reality, the subsystem is mounted as a share.>>Exactly.>>Got it. Cool.>>So that for data scientists, didn’t really need to change
a single line of code in their training script
when they switch for a local training to remote training.>>That’s cool.>>Yeah. So you can just load
your data as if local path.>>That’s cool.>>So after all this configuration, you can just submit the
rounds and you will be able to see the experiments from your workspace and this was the experiment I
submitted just now, and if you go to run detail page, we have input datasets as
saved as one properties. So this is where we help you chase which datasets
was used for the run.>>So that is what it’s
called data lineage. For example, when you run something, obviously a model is produced. In the model you can point back to
the experiment that made it and then you can point back to the
actual dataset you used as well. Is that right?>>Yes. So if you click this link, it will link you back
to the MNIST data sets.>>Got it.>>So this was the datasets
that I used for this training. As for questions about models, you can actually register model
with the dataset directly. So I will just run
the register model, register my model with my training
datasets and if I do that, under MNIST dataset, you
will see a model tab which list all the models that has
used this MNIST datasets.>>That’s really cool because generally when I do machine learning, I just run the thing
and export the model. But now if you do this in a context of Azure
Machine Learning Service, you’re able to trace the model to the experiment to that dataset
as well which is really cool.>>Yes.>>That’s awesome. Well, where can people go to find
out more about this?>>So you can find out more
documentation about datasets in our Azure Machine Learning
documentation website and we have lots of similar
notebook in our GitHub.>>Fantastic. We’ll put some links below so you can take
a look at those. Well, thank you so much for
spending some time with us.>>Thank you, Seth.>>Thank you so much for watching. We’ve learned all about how to use datasets in Azure Machine
Learning Service workspace. Thanks for watching and we’ll
see you next time. Take care. [MUSIC]

No Comments

Leave a Reply