LCMs vs LLMs
Large Concept Models (LCMs) are a new paradigm in language modeling that operate at a higher level of abstraction than traditional Large Language Models (LLMs). Instead of processing information at the token level, LCMs work with "concepts," which represent higher-level ideas or actions. The core idea of LCMs is to decouple reasoning from language representation. Think of it as going beyond just manipulating words (like LLMs) to actually grasping the underlying ideas and relationships between concepts. They aim to capture the semantics of concepts. LLMs are like expert linguists whereas LCMs are like philosophers.
While there might be some overlap and even complementary aspects between LCMs and LLMs, they have key differences:
Why are LCMs important?
LCMs represent a significant departure from traditional LLMs and have the potential to revolutionize how machines process and generate language. LCMs can be trained on data in multiple languages and modalities (text, speech, images, etc.) simultaneously, as they focus on the underlying semantic meaning rather than specific linguistic representations. This facilitates the generation of coherent long-form output and allows for easier human understanding and interaction.
LLMs can be fooled by adversarial examples or produce nonsensical outputs because they don't genuinely understand the concepts they are using. LLMs struggle with complex reasoning tasks that require applying commonsense or making inferences beyond the patterns they've learned from text. Again, the core idea of LCMs is to decouple reasoning from language representation. This is similar to how humans plan their thoughts before communicating them verbally. A presenter might have the same core ideas for a presentation (represented by their slides) but use different words each time they deliver it.
By building LCMs that understand formal representations of concepts and their relationships in specific domains, we can potentially create more robust, reliable, and explainable AI systems.
Ontologies Guiding the Development of LCM Embedding Spaces
Embedding spaces are mathematical representations where concepts, words, or other entities are represented as vectors in a high-dimensional space. The position and distance between vectors are supposed to capture semantic relationships.
Ontologies provide the explicit, structured representation of concepts and relationships that LCMs need to perform these tasks effectively. They are a natural fit because they provide the very thing LCMs aim to model. Ontologies are symbolic representations, and LCMs are designed to work with this type of knowledge. While more data is generally better, LCMs can achieve meaningful results using well-defined ontologies, even if they don't contain massive amounts of data.
Ontologies can be used to create conceptually meaningful embedding spaces for LCMs. Ontology-guided embeddings can be trained to respect the hierarchical and relational structure defined in an ontology. For instance, the embedding for "dog" would be closer to the embedding for "mammal" than to "fish," reflecting the class hierarchy in the ontology. The embedding for "worksFor" would carry a specific relational meaning connecting an "employee" and an "organization" embedding.
Instead of just representing words, embeddings can be created for the concepts defined in the ontology. Two words can be synonyms and have different word embeddings, but they could point to the same concept embedding. These concept embeddings capture the meaning and semantic properties of the concept, as defined in the ontology, rather than just the statistical properties of its associated words. The axioms and constraints in an ontology can be used to guide the training of embeddings, ensuring that the resulting embedding space is consistent with the logical rules of the domain. By aligning ontologies or mapping concepts between different ontologies, embeddings can be developed that facilitate reasoning across different domains.
It's not necessarily that ontologies "work better" with LCMs than LLMs in an absolute sense, but rather that they are more naturally aligned with the goals and mechanisms of LCMs and address some inherent limitations of LLMs. While LLMs can learn some implicit knowledge about concepts from vast amounts of text, they are not inherently designed for structured knowledge representation or logical reasoning. Their primary goal is to predict and generate text, not to build a deep model of the world's concepts.
Ontologies are essential for LCMs because they provide the formal framework for representing concepts, relationships, and constraints. They are key to developing embedding spaces specifically designed for LCMs. These embeddings, in turn, empower LCMs with enhanced reasoning, generalization, and explainability capabilities, moving us closer to truly intelligent AI systems.
Scenario: Drug Repurposing using a Biomedical Ontology and an LCM
Identifying new uses for existing drugs (drug repurposing) is a crucial area of research. It's faster and more cost-effective than developing entirely new drugs. This requires understanding complex relationships between drugs, diseases, genes, proteins, and biological pathways. We can use a hybrid approach - Combining a biomedical ontology (like the Unified Medical Language System - UMLS) with a large corpus of biomedical literature (e.g., PubMed abstracts) and joint training of embeddings on both ontological and textual data for building LCMs.
Ontology-Based Embeddings Initialization: Initialize the embeddings for concepts and relationships based on the UMLS ontology. Use a pre-training step where the model learns embeddings solely from the UMLS.
Textual Data Preparation: Process the PubMed abstracts to identify mentions of UMLS concepts. This can be done using entity linking tools that map phrases in the text to UMLS concept unique identifiers (CUIs). Create text-based triples. If an abstract states, "Aspirin has been shown to reduce inflammation," and "Aspirin" and "inflammation" are linked to their respective UMLS CUIs, create a triple: ("Aspirin", "reduces", "inflammation"). Note: "reduces" might not be a formal UMLS relationship, but we capture it from the text.
Joint Loss Function: The key to joint training is the loss function. We'll use a combined loss function that considers both:
Ontology Loss: This part of the loss function encourages the model to preserve the relationships defined in the UMLS, similar to the previous example. For instance, if ("Aspirin", "TREATS", "Inflammation") is in the UMLS, the model is penalized if the embeddings don't reflect this relationship. We want to retain the knowledge we already have.
Textual Loss: This part of the loss function encourages the model to learn from the co-occurrence of concepts in the text. For example, if "Aspirin" and "inflammation" frequently appear close together in PubMed abstracts, their embeddings should be closer, even if "TREATS" is not explicitly stated between them in that abstract. We also want to learn new information that may not yet be in our ontology.
Joint Training Iterations:
Iterate through the training data, sampling both ontology triples and text-derived triples. For each triple, calculate the loss and update the embeddings using an optimization algorithm. This process updates the embeddings to simultaneously satisfy the constraints of the ontology and the patterns observed in the text.
Improved Inference of New Relationships:
After joint training, the embeddings will have captured a richer representation of biomedical knowledge, integrating both structured (ontology) and unstructured (text) information. This leads to improved inference:
Let's say the UMLS doesn't have a direct relationship between "Aspirin" and "Myocardial infarction." However, many PubMed abstracts might discuss the role of inflammation in myocardial infarction and also mention the anti-inflammatory effects of Aspirin. During joint training, the textual loss will encourage the embeddings of "Aspirin" and "Myocardial infarction" to become closer due to their indirect relationship through "inflammation" as observed in the text. As a result, the model might predict a new potential relationship: "Aspirin" - "MAY TREAT" - "Myocardial infarction," even though it's not explicitly in the UMLS.
Complementary rather than Competitive
Joint training of embeddings on both ontological and textual data is a powerful technique for building LCMs that can effectively leverage both structured knowledge and the vast amount of information available in unstructured text. This approach leads to more comprehensive, accurate, and insightful relationship predictions, ultimately accelerating scientific discovery in fields like drug repurposing. It helps to bridge the gap between LLMs and LCMs, bringing the best of both worlds. While still in early stages of development, LCMs hold immense promise for creating more intelligent, adaptable, and trustworthy AI systems. A complementary, rather than competitive, approach to LLMs, and future AI systems may well integrate both language and conceptual understanding for a more holistic intelligence.