For researchers, searching for articles is essential in all research processes.Access to prior research is essential for new ideas.
Nowadays, it has become much easier to obtain article information through electronic journals and article databases.One of the important elements when formulating a hypothesis for a new drug in drug development is novelty, and papers on rare diseases and new mechanisms of action are needed, but it is difficult to find such papers just by keyword search. yeah.
Here, we will outline the papers that are essential for research and their search methods, and explain the features of FRONTEO's "discovery concept search" AI system "KIBIT Amanogawa" that can discover similar and related papers with high accuracy. To do.
Supervision
FRONTEO Inc.
Executive Officer
Drug Discovery AI FactoryChief Executive Officer
Head of Life Science AI Business Headquarters and Director of Behavioral Information Science Research Institute
CTO Ph.D. (Science)
Hiroyoshi Toyoshiba
After graduating from Waseda University Graduate School of Science and Engineering Department of Mathematics, engaged in research on gene expression data analysis, target search, biomarker search, etc. at the National Institute of Environmental Health Sciences, Takeda Pharmaceutical Company, etc.He researches and develops AI algorithms for life sciences.
A scientist's research begins with a paper search
For researchers, papers are clues to what our predecessors have thought, how they have tried, and what they have discovered.From setting a research topic to writing and presenting a paper, referencing previous research papers is essential.
Papers are a means of presenting and sharing research and are a treasure trove of the latest information.
Academic research generally takes several forms of presentation and publication.Starting with reports in institutional journals and presentations at conferences, letters when priority is given to speedy reporting, original articles to submit original research results to academic journals, and summaries of overviews and latest trends in a certain field or theme. As reviews and new knowledge become established, they are organized and compiled into a book.
In the world of research, the term ``paper'' generally refers to letters, original articles, and reviews.
Difficulty in finding innovative perspectives and ideas from article searches, and limitations of keyword searches
Searching for papers is an essential process for research, and research involves reading papers and searching repeatedly to confirm evidence and obtain new ideas.In the case of drug development, you would search article databases and electronic journals by disease name or substance name.If you have papers related to your purpose at hand, you can find useful papers one after another by following their citations.
However, finding research papers can be very difficult, especially when formulating the initial hypothesis for research.Naturally, if it is a disease for which there is no cure, no one has ever researched it, and even if you try to search for papers using keywords, you will almost never find any results.
Limits of generating groundbreaking hypotheses from a huge number of papers
Many researchers around the world are publishing papers one after another, and these papers are being accumulated as collective knowledge, but the difficulty of finding suitable papers among them is only increasing.For example, the life science article search database PubMed contains over 3000 million articles, and Elsevier's academic database Scopus has over 9100 million articles across all fields.
I can't keep track of all the papers.
You should spend enough time searching for papers, which is an important process in research, but time is limited.Of course, it is impossible to read and understand all the papers in order to find relevant ones.
Readers may be biased
At the stage of selecting papers to refer to, biases such as the researcher's own expertise and thinking inevitably enter the stage.Searches that rely on personal knowledge may miss papers that are supposed to be related.
The problem with article searches is that we have not yet found highly similar or related articles.
In the case of medicine, there is a good chance that there is a paper somewhere in the database that is related to the subject disease, even if it does not have the same disease name or gene name.In fact, even if researchers are able to find information in their own field of expertise, they may not be able to successfully find other information that should be related... This can actually happen very often, and can be called an invisible loss. .
How to search for papers and the mechanisms and technologies used there
Although the number of articles continues to increase, access to articles has improved dramatically with the spread of article search databases and electronic journals.The key is how to search among them.An overview of search methods in scientific papers and the techniques and mechanisms used behind the scenes.
Search method: search database, electronic journal
In the field of natural science, many services are widely used, such as search databases such as PubMed, and electronic journals such as ScienceDirect and Embase.
PubMed | The largest medical database provided by the National Center for Biological Sciences Information within the U.S. National Library of Medicine.Keyword searches and thesaurus can be used. |
Sub-base | A bibliographic database in the medical field provided by Elsevier.It has a wealth of clinical trial papers on pharmaceuticals, and thesaurus search is also available. |
Google Scholar | Search system provided by Google.By keyword search, you can comprehensively search papers in Japanese, English, and other languages. |
Medical Central Journal | You can comprehensively search for information on domestic papers in medicine, dentistry, pharmacy, nursing, and related fields. |
Techniques and mechanisms commonly used for text searches
Behind the scenes of each search service, in addition to the commonly seen keyword matching searches, various efforts have been made to suggest and list related papers in text.
Keywords research | The most common method is to find articles that mention your desired keywords.Results are returned as a list, ranked by relevance. |
thesaurus (controlled word) | A "thesaurus" is like a dictionary that organizes the relationships between words, synonyms, hierarchical structure, hypernyms and hyponyms, etc.By using so-called controlled words, it becomes easier to find articles using keywords or words related to the topic. |
Word2Vec | It is a method used in many natural language processing tasks, such as sentence summarization, clustering, and information retrieval, and uses the semantic relationships between words, which can be expressed by word vectorization, for retrieval. |
BERT | An algorithm developed by Google that learns word relationships.It learns by predicting masked words in a sentence and whether or not there is semantic continuity between two sentences. |
Natural language processing AI and its mechanism, which is widely used for article searches
In order to use computers and AI to search for articles, it is necessary to convert natural language into a form that AI can understand.Natural language is the words we use in everyday life, such as Japanese and English, and natural language processing AI is the process of having AI (artificial intelligence) learn natural language so that it can be processed mechanically.What is involved here is the idea of ``distribution hypothesis'' and the embedding of meaning through ``word vectorization.''
Distribution hypothesis: “The meaning of a word is determined by the surrounding words.”
Many of the methods used to represent words as vectors in Natural Language Processing (NLP) are based on the idea of "distribution hypothesis."This is an idea that was proposed in the 1950s, and is one of the important ideas in natural language processing that the meaning is formed by the surrounding words and context rather than the words themselves.
What it means to vectorize a language
In natural language processing, "vectorization" is the process of converting language, which is unstructured data (not numerical data), into a form that computers can understand.This process allows the information in the language to be represented numerically and processed by computers.At this time, the information of the surrounding words is said to be "embedded" into a combination of numbers, or a vector.
“Discovery concept search” realized by FRONTEO’s AI “KIBIT”
Deep learning, especially Transformer, is widely used in large-scale language models that are a hot topic these days.On the other hand, FRONTEO's in-house developed AI "KIBIT" is a type of machine learning and uses a different algorithm from deep learning.
The paper search AI "KIBIT Amanogawa" equipped with "KIBIT" is a "discovery concept search" system that learns all the papers in PubMed and utilizes unique natural language processing AI technology for paper searching.
"KIBIT Amanogawa" has high power to find similarities and relationships among all PubMed articles.
・Find "sentences with similar concepts" to the input words and sentences
・Achieving unknown information in other areas through “virtual addition and subtraction of concepts”
can do.This is a cutting-edge approach that uses an algorithm based on distributional hypotheses to accurately list papers that are highly similar and related, and uses vectorization to find papers using figurative expressions that use addition and subtraction (operation) of meaning. This has been realized.
Searching for papers using AI requires both an appropriate database and an AI engine.
When searching for papers, it is natural to search for databases containing high-quality papers that are most suitable for the research topic, but the performance of the AI engine has a major impact on the success of the paper search.In other words, when searching for articles, the combination of an appropriate database and an effective AI engine is an essential element for success.
Unlike generative AI, which is suitable for writing and summarizing sentences, KIBIT is excellent at detecting similarities.
Generative AI, especially ChatGPT, which has been a hot topic since the end of 2022, is attracting a lot of attention for its excellent ability to create and summarize sentences, but what supports this is an algorithm that specializes in generating natural sentences.
While generative AI algorithms such as ChatGPT* and BERT* are specialized for generating sentences, they have limited accuracy when it comes to finding language similarities.This is because generative AI approaches, such as similarity comparison and word weighting, are not optimized for document comparison.
Based on an algorithm rooted in distributional hypotheses, KIBIT Amanogawa can find the mathematical closeness, or similarity, between words with high accuracy. This is because KIBIT does not aim at sentence generation, but instead focuses on language understanding itself.
*GPT=Generative Pre-trained Transformer, BERT=Bidirectional Encoder Representations from Transformers.Both are language processing methods based on the deep learning model Transformer (a natural language processing model announced in 2017).
Why KIBIT Amanogawa is highly accurate and allows you to discover new discoveries
KIBIT Amanogawa's accuracy in finding highly relevant and similar papers is because it has a unique vectorization method that is different from other search algorithms.
One of the points is that it analyzes both words and sentences, and the other is that the algorithm is built faithfully to the distribution hypothesis.Another strength is weighting, which allows for appropriate searches even for rare words, and analysis by adding and subtracting the meanings of words.
Algorithms beyond Google
KIBIT Amanogawa's high accuracy in the task of finding similar documents has been proven, achieving approximately 15% higher accuracy than BioBERT, which is based on Google's natural language processing model BERT. <Yamada et al(2020)>
In language vectorization, it has been reported that algorithms derived from distribution hypotheses have better results than Transformer, which is widely used in BERT and generative AI. This is one reason why it is so high.Furthermore, in FRONTEO's verification, KIBIT Amanogawa was found to be more accurate when compared with Word2Vec (an algorithm developed by Google), which is also derived from the distribution hypothesis, and it is a better algorithm than Google for identifying similarities and relationships. It has been proven that
Unique technology that analyzes sentences and words
KIBIT Amanogawa evaluates similarity based on the proximity of their features, regardless of words or sentences.It vectorizes search terms (newly entered words or sentences) using a unique approximation formula, compares them with vectorized PubMed data, and presents them in order of similarity.
Analyzing words and sentences together is technically called analyzing them in the same vector space."Space" is a mathematical term that means "a set" or "a collection of elements" that have the same properties.
Other methods, such as Word2Vec, treat words and sentences as separate properties and belong to different vector spaces.Therefore, it is common to evaluate the similarity between words or sentences by calculating them in their respective vector spaces.However, KIBIT Amanogawa's algorithm is a technology that approximates words and sentences in the same vector space, that is, it can treat words and sentences as similar elements.
This method improved the accuracy of identifying papers with high similarity compared to other word-to-word or sentence-to-sentence methods.It is able to recognize a wide range of contexts and differences in word meaning with high accuracy.This patented, groundbreaking approach treats words and sentences in the same vector space.
Faithfully represents the distribution hypothesis
KIBIT Amanogawa is an algorithm that thoroughly follows the distribution hypothesis, and technically speaking, it captures the relationships between words and sentences based on the co-occurrence relationships of words.To put an example, this is similar to the process by which children learn words, in other words, the process of acquiring meaning and usage through ``the context in which new words are used.''
Furthermore, AI does not have biases (biased thinking) like humans.Because it objectively analyzes context based only on distributional hypotheses, i.e., word patterns that occur around it, it presents useful information only from the pure association of documents and words.
Weighting of words allows for appropriate searches even for rare words
KIBIT Amanogawa is also excellent at finding highly relevant papers from words that occur infrequently, or in other words, are rare.This is because the algorithm automatically assigns appropriate weights to words based on their frequency.Specifically, if a word appears many times in the corpus PubMed, that word will be given a low weight, while rare words that appear few times will be given a high weight.This makes it possible to search for rare topics.
You can encounter new “discoveries” and get ideas by adding and subtracting concepts.
When language is vectorized, or converted into numbers, it becomes possible to perform operations on the meanings of words, or concepts.A famous example is the king and queen, the king and the male and female elements, as shown in the figure.
Using this property, KIBIT Amanogawa can "virtually add or subtract specific concepts (meanings) within the PubMed database" to obtain information that would not be found in conventional searches, and to generate new ideas. It will be possible.As shown in the figure, using information on diabetes (disease) and GLP1R (diabetes target), a database is searched for target candidates corresponding to another disease, simple diabetic retinopathy (SDR), through semantic calculations. You can.
In addition, when searching for a certain gene, for example, you can remove (eliminate) the ``concept of cancer'' contained in a certain gene (vector), and virtually search for ``a gene that does not have the concept of cancer''. You can also use the method of searching for `` to get search results that lead to even more innovative ideas.
Article search AI opens up the future of drug discovery to solve drug discovery challenges
Drug discovery, or development of pharmaceuticals, takes a long time and costs a lot of money, and the success rate of bringing drugs to market is decreasing, which is a major issue.
To increase the success rate of "first in class" drug discovery (drugs with new mechanisms of action discovered using unconventional approaches), it is important to select targets at the start of drug discovery and develop hypotheses that will determine the path of research and development. Generation is essential.It is extremely difficult to find an AI drug discovery support company that provides this hypothesis, not only in Japan but also around the world.One of the techniques to overcome this problem is searching for articles on KIBIT Amanogawa.
Unbiased AI brings serendipic encounters and supports discovery
KIBIT Amanogawa has three characteristics: unbiasedness, serendipity, and discovery.
"Unbiased" refers to the ability to analyze data objectively and comprehensively, without being biased by the researcher's personal knowledge, interests, or specific journals. "Serendipity" is the ability to discover unexpected information and documents that cannot be predicted by standard keyword-based searches, and this allows us to find related information from different areas.Through serendipic encounters with papers through unbiased analysis, researchers can arrive at new ideas, or "discoveries," and promote further exploration and research.
This solves the problem of not being able to find truly "related/similar" papers based solely on the commonality of keywords. By sorting the information, ``conceptually relevant information'' is displayed with higher priority.
At this time, the correct answer that the searcher expected may not be at the top, or the searcher may feel that there are candidates that appear to be noise.However, this is an unbiased result found from the entire database according to certain rules.We should change our way of thinking about this unexpected result by thinking that, thanks to AI, we were able to discover information in the entire database that we were unable to find based on the concepts we grasped.This is because the "concept" that embeds everything in a database is sometimes beyond the scope of human comprehension.
"KIBIT Amanogawa" is an AI system based on the same logic as the human language acquisition process.
Since the ``distribution hypothesis'' is similar to the process by which children acquire words, building a natural language processing AI based on this distribution hypothesis can be said to be the optimal approach for effectively handling linguistic information.
The paper search AI "KIBIT Amanogawa" uses an algorithm faithfully based on distributional hypotheses to realize efficient language processing that understands the context.This also connects to the idea underlying FRONTEO's AI development, which is to reproduce human thought processes using mathematical algorithms.
AI can capture subtle nuances that are difficult to verbalize and find true similarities in papers.
High similarity in sentences can also be said to be the impression that the nuances of the words are similar, but this is difficult and subtle when trying to explain it in words.However, by using numerical language, AI can use numbers to capture subtle nuances and subtleties that cannot be expressed in words, and can make us aware of them without the bias that humans do.
KIBIT Amanogawa is an indispensable article search AI system for pharmaceutical researchers working on new drug development, providing new knowledge with algorithms specialized for analyzing and discovering article information, and researching groundbreaking ideas and ideas. We will provide the support that people create.