PyText is an open-source Natural Language Processing (NLP) tool recently developed by the Facebook: AI team. Although there are quite a few applications of the tool, for the purpose of this example, let's take a look at how an AI assistant/chatbot could be developed using the tool.
Before diving deep into the example, however, it is important to note that the tool can be easily used by both Machine Learning (ML) experts and amateurs in the field: it allows to start with the basics, and also extend the application once more knowledge is acquired and the expertise in ML is vastly improved. This article is a simple introduction to PyText: it shows the main concepts with minimal coding.
The basis of any AI assistant/chatbot involves using Intent detection and Slot filling. To understand spoken language we need to automatically identify the intent of the user as expressed in natural language and extract associated arguments or slots towards achieving a goal.
For this example, we will use e-commerce as the domain for creating an AI assistant. Take a look at some possible dialogue examples:
Intent | Text | Slots |
product | Show me black t-shirts with size M. | product=t-shirts, size=M, color=black |
add | I will take the first one. | position=first |
cost | How much will it cost to deliver to 221B Baker Str London? | location=221B Baker Str London |
For each sentence, the intent needs to be identified in order to gather what each customer is inquiring about. To gather the details and comprehend the context the slots (labeled words) need to be found. For example, when a customer asks for a black t-shirt, we understand that he/she is looking for a specific kind of product, we know what kind of the product it is (t-shirt), and what color the product should be (black).
Introduction to PyText
To start using PyText, you need to have Python 3 and pip installed in order to run the following command that will install PyText:
PyText provides simple and extensive interfaces and abstractions for model components. Through the following example, you will be able to see how easy it could be to train a model and start using it right away in production.
The command line tool pytext will automatically be installed into your PATH variable. In fact, it's the fastest way to start training the model.
Data Preparation
Data preparation is the most critical part of any machine learning algorithm. Unfortunately, there are no openly available datasets. Because of that, we need to generate our own simple dataset. If you would like to save some time, feel free to skip this part and use my dataset from https://github.com/kiril-me/assistance/tree/master/data.
One of the main problems is that the training dataset is required to have a particular format. It is, however, rather easy to make a tool that would transform the markup text to the PyText format. For the sake of simplicity, try using my assistance_data.py script.
First, generate your chat samples and store them in chat.txt. Take a look at a sample of the training dataset we will be using:
Each row contains two tab-separated columns: first is intent and second is text in the markdown format. The slot value is inside round brackets, and the slot definition is inside square brackets. Remember, however, that three samples are not sufficient for training a ML model, which is why more training certainly needs to be added. In addition to that, more intents and slots can be added in order to increase complexity.
To convert data into the PyText format and split it into training, validation and test sets run the assistance_data.py script with the following parameters:
The script will generate three files inside the data directory.
Configuring the Model
Training a PyText model on a dataset is primarily about the configuration parameters. We have already made our training dataset, and now we need to configure our Deep Learning network and model parameters.
PyText configurations are in JSON format. Create the joint-model.json
Let's take a look at the data_handler configuration. It configures the locations of all three datasets and has a definition of our data format. The first part of this configuration specifies the document label, next specifies the word label, and the last specifies the text label. This is a pretty standard text format for PyText.
To train a model, we need to convert our text to machine representation (vectorize it). For our model, we use GloVe pre-trained word embeddings. You can download the GloVe dataset from Stanford website https://nlp.stanford.edu/projects/glove/ and take the smallest one, glove.6B.50d.txt. It has about 6 billion words, and each word has a 50-dimensional vector. PyText will be able to automatically convert our sentences to vectors. The greater the number of dimensions, the better the accuracy, and the slower the training and the prediction. To understand the logic simply calculate how large of a vector you would have if you had a sentence with 20 words and 50 dimensions for each. This would result in a 20 * 50 = 1,000 vector. Using a 300-dimensional vector will create a 20 * 300 = 6,000 vector. In this case, the performance will drop roughly 6 times.
The model configuration has a deep learning network. You can learn more about the Joint Model of Intent Determination and Slot Fillingapproach [https://www.ijcai.org/Proceedings/16/Papers/425.pdf]. It uses bi-directional long short-term memory (LTSM) network. Because the dataset is too small we set the dropout to be 0.1.
Starting the Training
To start the training we will use the PyText training mode. It will take the joint-model.json configuration file, initialize the model, and begin the training process. In the end, it will save the best model in the snapshot folder. Our configuration has epoch set to 10, which means that we perform 10 iterations and select the best model. The model snapshot can be used further in the production.
You will get the final F1 score for each intent and slot.
Executing the Model
Before we use the model in the production, we want to make sure that it satisfies our needs, that is, has good accuracy. Let's test it:
Because our dataset is too small, we cannot get a good F1 score.
Let's try to make a prediction for one sample:
This command will print all the intent and slot coefficients:
This shows that customers are most likely to inquire about a product because it has the smallest coefficients. Slots product and size also have smallest coefficients. To get a better F1 score we need to create a better training dataset and potentially change model parameters.
Starting to Use the Model in Your Project
Before we start using the model, we need to export it. To do so, make the following call:
The joint_model.c2 will contain all the information needed to run our model.
For a simple web application, we will use Flask.
To run the application execute the following curl command:
For demonstration purposes, our code only displays labels and no actual words. You can see how to get actual words in the PyText demo project.[https://github.com/facebookresearch/pytext/blob/master/demo/flask_server/atis.py];
As you can see, the code doesn't look large, and you can play around with your model and tweak parameters or even change the machine learning algorithm. Data engineering is the hardest problem in creating accurate machine learning models, but you can easily play around with it and potentially improve it.
Full code is available at https://github.com/kiril-me/assistance
8 Comments
Nice post :)
ReplyDeleteI have one question:
Does, during the training, pytext save the tensorboard logs? Because in my
case no. And I would like to see a model graph and metrics during training process.
Hey man,
DeleteThanks for commenting.
I've actually done a full length Tensorboard tutorial for you.
Check it out : https://codecampanion.blogspot.com/2019/01/tensorboard-complete-tutorial-visualize.html
Thanks. I know how to use Tensorboard. The issue was that Tensorboard logs aren't generated if you use pytext installed throug pip. For pytext installed directly from github everything works.
ReplyDeleteNice blog, check this also - NLP training dataset
ReplyDelete
ReplyDeleteTraining Data Sets for AI and Machine learning Engines
Training Data Sets for World’s Leading AI and Machine Learning Companies.Cognegica Networks was a dream realized early in 2018 as an enterprise that amalgamated technology with rural development.
to get more - https://www.cognegicanetworks.com/
Nice post! Thanks a lot for sharing it!
ReplyDeletePersonal Statement Helper
caepaltuo_riPeoria Keith Hunt https://wakelet.com/wake/qW1jqzX0j7DJXA7YN_3n5
ReplyDeletepecmamava
Vniatricneu-wa_1997 Kathy Johnson FonePaw
ReplyDeleteDesignCAD 3D Max
NetBalancer
empretarvic