Technology

The rapid accumulation of experimental data presents both opportunities and challenges for scientific research. While vast datasets enable deeper insights and evidence-based discoveries, they also create obstacles such as information overload, data reliability issues, and normalization and integration difficulties. The sheer volume of scientific information exceeds human processing capacity, making interpretation increasingly complex.

LLMs address the information overflow problem by deep-learning the knowledge from massive amounts of text data, but their summarizing and synthesis capabilities come with significant limitations. LLMs are enormously computationally intensive, lack transparency in tracing information sources, and may generate unreliable outputs, particularly when dealing with scarce information.

Our technological approach overcomes these barriers by splitting the knowledge processing and learning into three distinct phases: entity recognition, natural language processing, and understanding the underlying knowledge model. The result is a significantly faster, more robust and interpretable system capable of finding and summarizing relevant information quickly and efficiently.

Below are the key components and highlights of our technology.

Entity recognition

Based on proprietary normalized biomedical taxonomy covering millions of entities, including gene/protein names, diseases, biological processes, cells, tissues, organs, chemicals, medical procedures, etc.
Taxonomies are compiled and cross-linked from public taxonomies and ontologies
Taxonomies undergo multiple rounds of cleaning and enrichment to ensure highly detection accuracy
Highly accurate and efficient algorithm for matching and marking taxonomy entities in the text corpora

Natural Language Processing

Proprietary lexicon, grammar and fast deterministic parsing algorithm
Lightning-fast text processing up to 10,000 of sentences per second
Sentences are split into Subject-Verb-Object (SVO) triplets
Triplets capture grammatical structure of Subject and Object, further characterized by a number of linguistic properties
SVO Triplets form the building blocks for information summarization
Triplets are traceable back to the source documents and specific sentences
Triplets are additionally compressed into short token fingerprints/signatures representing essence of their meaning (relationship)

Knowledge learning

Knowledge model is defined as a set of possible relationships between different types of entities
Individual triplets describe specific types of relationship
Knowledge model is learned from corpora by clustering similar triplets connecting unique individual pairs of entities
Clustering procedure can be guided by domain experts

Indexing and searching

Triplets are indexed using proprietary indexing engine
S/V/O parts of triplets are indexed separately that allows construction of complex queries and retrieval of information with high accuracy
User queries are transformed into S/V/O-indexes search plan to find and retrieve relevant triplets

Summarization and Categorization

Learned knowledge model rules are applied to retrieved triplets to categorize them into topics
Linguistic properties of triplets are used for sorting and filtering to identify the most relevant ones
Summary is constructed from triplets that belong to specific topics