
One of the technologies that is currently attracting the most attention is AI (artificial intelligence).Companies and researchers around the world are competing to develop AI. An increasing number of companies are working to utilize AI to improve operational efficiency.
Highly accurate AI is essential for streamlining business operations.Among these, training data plays an important role in improving the accuracy of machine learning, which is one of the analysis methods of AI.The huge amount of training data that is used as learning material for machine learning is often talked about, but in reality, the amount of training data required and the accuracy of machine learning vary greatly depending on whether you can prepare appropriate training data. Masu.
Here, we will explain the relationship between machine learning and training data, the challenges of training data, and introduce how FRONTEO's AI "KIBIT" responds to challenges with training data.


Supervision
FRONTEO Inc.
Director/CTO
Ph.D. (Science)
Hiroyoshi Toyoshiba
He majored in mathematics and obtained a PhD in science. He has been engaged in research on gene expression data analysis, target discovery, biomarker discovery, etc. at the National Institute of Environmental Health Sciences (NIEHS) and Takeda Pharmaceutical Co., Ltd. He is also involved in research and development of FRONTEO's AI algorithms.
3 steps of AI (artificial intelligence) - Teacher data is an important "input"
When thinking about what can be done with AI and how to make better use of AI tools, it is easier to organize it by dividing it into three steps: "input," "analysis method," and "output."

All of these are essential for AI, but the quality of the input data, or learning data, is essential. In order to increase the accuracy of AI, it is necessary to load the AI with appropriate training data.
The analysis method is the technology or algorithm that comes to mind when you hear the word "AI", such as machine learning or deep learning, and the output is something like "classifying sentences" or "generating images."Applications are also called tasks in machine learning.AI Learning articles However, as introduced above, it is important to use the most suitable AI for each purpose, in the right place, or in combination.
Machine learning using training data: Supervised learning
Of the learning data, data that includes a set of questions and correct answers is called training data.It's like an example problem with an answer.Supervised learning allows AI to explore and learn the data patterns and rules that exist in the supervised data, and optimizes the AI model so that it can make predictions and judgments for new input data for which the correct answer is unknown.This kind of supervised learning is used for things such as email spam detection, machine failure prediction, and demand forecasting.
Unsupervised learning and reinforcement learning
There are two methods of machine learning that do not use supervised data: unsupervised learning and reinforcement learning.
Unsupervised learning is a method of finding patterns and structures in training data for which correct answers have not been given, and is used for data grouping (clustering).Reinforcement learning is a method in which the AI learns the optimal answer through its own trial and error, and is used in AI for chess and shogi, which have winners and losers.
What is training data in AI machine learning?
Preparing training data for machine learning begins with data collection and annotation (labeling correct data).In addition to acquiring data from in-house data or public datasets, there are also cases where the creation is outsourced to external vendors.The advantage is that in-house data can be used in a way that suits the purpose, while external data can be used in a well-organized format. .
How to prepare training data necessary for machine learning
To prepare training data, first collect the data and label it.
How to label data (annotation)
Annotation is the process of annotating data and its corresponding answers, that is, labeling the data.When determining whether an email is spam, each email in the training data is given correct data to indicate whether it is spam or not.
where to collect data from
・Use your own data
Utilize your company's document data, image data, sales data, etc.Since it is your own company's data, you can expect to build an AI model that suits your company's business and actual situation and incorporates your company's culture, language, and subtle expressions.
・Use external datasets
There are two ways to do this: use publicly available datasets, or outsource to a vendor company that sells and creates teacher data on your behalf.It has the advantage of being able to obtain a complete set of data that is easy to use for machine learning.In fact, there are AI services that provide models based on publicly available information or designed for specific industries.For example, ChatGPT is an AI service that has learned a large amount of common topics such as web pages, news articles, and papers.
Relationship between training data and machine learning - What is machine learning doing?
I feel like humans would be able to make some judgments if they had two or three sample training data.However, if there is little training data, it is difficult to get machine learning to produce the correct answer.Why?
Creating an AI model means finding a function
Creating an AI model means finding laws from data with answers = training data.In other words, machine learning is where a result is output for some input, or in mathematics, some (x₁, y₁)、(x₂, y₂) When a set of … (teacher data, ● in the figure) is given, other than ●xIt also works if you enter the value ofyDraw a graph that shows the value of (functiony=f(x).

If there are only two points given, any number of lines can be drawn that pass through both points.The same goes for three points, leaving many possibilities for graphs.In other words, if there is little training data (● in the diagram), it is not possible to create a good AI model.

If you have 10 points, you can get a pretty good idea of the shape of the curve.If this curve is used as a guideline, the newxCorresponding toyIt can be seen that the value of can be predicted with considerable accuracy.

In actual AI, the formulas are more complex and usually require a much larger amount of training data.Also,x,yis not just a number, but a wide variety of things, such as vectors and natural language.For example, in the case of image classification (deep learning), it is said that around 1,000 to 10,000 pieces of training data are required.
Challenges of creating appropriate training data
Quantity and quality of training data
The amount and quality of training data required varies depending on the purpose and situation of using AI.If the amount of data is insufficient, the model will not be able to fully learn the pattern, and there is a risk of overfitting, which may result in correct answers in the training data but lacks generality in the actual data.However, the large amount of data that needs to be collected takes time and effort, and can often get in the way of learning.
In terms of data quality, AI cannot learn properly if it contains noise or outliers such as incorrect labels or unrelated data, so the data selection and creation process cannot be neglected.The format of the collected data must be consistent, and labeling requires know-how tailored to the purpose.
In this way, creating training data is one of the first hurdles when utilizing AI, and is also a major challenge.
Identity of training data - Can you create an AI model for your company from data from all over the world and across industries?
When analyzing your company's data using language-based AI, even if you use an AI service trained with training data created from around the world or across industries, you will not necessarily be able to obtain data with the desired precision.One reason for this is that even within the same industry, different companies use different terms and habits.
Graph for "amount of training data"For example, in an AI model that uses external data from around the world or from other companies as training data, as shown in the figure, the samexButyThe value of (x, y) may exist in many combinations.There is a lot of data that becomes noise when analyzing your own data, and it is difficult to know where to draw the line for the model.This shows that too much training data can become noise, and that more is not necessarily better.
Or your own datax, yNot only on the planezEven though the training data is spread over “a different space” with an additional axis,x, yThere may be cases where there is almost no data only on a plane.This corresponds to your company's unique wording.If you input your company's data into the AI model built in this way, the accuracy of the output will be low.

If you're a business person, you've probably experienced how, after transferring or changing jobs, the words you use differ depending on your department or company, or the way you say things differs slightly depending on the company you do business with.From this, you can see that in order to improve the accuracy of AI, it is better to use your own company's data as training data rather than general-purpose data.In other words, the necessary element for training data is the sameness as the data you want to analyze (usually your own data).
Issues specific to “fraud investigations” and “audits” supported by FRONTEO with AI, strengths and usage of each AI
FRONTEO's AI is a natural language processing AI that can analyze with high accuracy even a small amount of training data, and supports companies with evidence investigation in international lawsuits, fraud investigations, and various monitoring (audits).
[“Fraud investigation” and “audit” supported by FRONTEO with AI]
-ediscovery -Forensic survey | -Email & chat audit -SNS monitoring/monitoring |
Investigation of evidence for fraud and litigation, which requires precision in identifying subtle differences with only a small amount of training data.
When using AI to investigate fraud, we would like to use actual emails that involved fraudulent exchanges as training data.However, since fraud doesn't happen very often, the targeted emails make up a very small portion of all emails, and actual data is not often collected.
In addition, in order to find litigation materials and evidence of fraud from a huge amount of text information such as emails and documents, it is necessary to distinguish subtle differences in sentences.Searching for evidence from vast amounts of data in corporate lawsuits and fraud investigations is an extremely difficult task, even with AI.
FRONTEO's "KIBIT", an AI specialized for the purpose of "discovery"
In this way, in order to increase accuracy, it is best to use training data with correct answer labels attached to your own data, and it is desirable that the amount of training data be small enough that it is not a pain for people to prepare it. This can be said to be a major issue in social implementation. On the other hand, FRONTEO's natural language processing AI "KIBIT" is a "discovery" AI designed to meet this difficult requirement for the purpose of discovering necessary information from a huge amount of text data. is.
Because quality is important, by reducing the amount of training data required to just a few dozen items, it is possible to ensure quality with the human eye.By inputting high-quality in-house training data prepared in this way, it is possible to build a highly accurate AI model that can read subtle differences in sentences and subtle movements.
In other words, for the "use" of finding information such as evidence from data, the "analysis method" called KIBIT of discovery AI that meets this requirement, and above all, high-quality "teacher data", are the appropriate three elements. The combination maximizes AI accuracy.

Teacher data issues solved with KIBIT
With AI services where training data is prepared in advance and does not need to be created in-house, it is usually difficult to improve the accuracy of analysis of emails exchanged within a company.Therefore, FRONTEO's AI solution emphasizes the internal data of the companies conducting investigations and audits, and creates training data from the data of each company.This is because accuracy cannot be improved unless training data containing the unique words and sentences used by the target company is used.
However, the burden of creating training data is still heavy, so a major feature of KIBIT is that it can currently construct highly accurate AI models that can be operated with a small amount of training data, only a few dozen items.Unlike regular machine learning and deep learning, thousands of training data are not required.
Achieving high accuracy with little KIBIT training data is due to a simple, mathematically-based algorithm that uses fewer parameters.Because of its simplicity, it also has the advantage of being able to analyze and study text on a laptop-level computer with low power consumption and at high speed.
Lowering barriers to AI introduction and aiming for social implementation of AI
KIBIT learns human tacit knowledge, feelings, and judgments even from a small amount of training data, reproduces human thinking, and quickly finds necessary data from large amounts of data whose correct answers are unknown.
Currently, KIBIT can be implemented with dozens of training data, but we are continuing to develop and improve it so that it can be operated with even less training data while maintaining high accuracy.If highly accurate AI can be utilized with less burden, it will become possible to support many experts, leading to the expansion of social implementation of AI.