Signup

Enterprise AI Engineering: Embeddings, RAG, and Multimodal Agents with AWS Bedrock and Nova (Video Course)

Discover how Amazon Nova and Bedrock empower you to build AI systems that process text, images, and video, automate decisions, and extract real value from your data. Learn practical strategies to create robust, enterprise-ready solutions at scale.

Duration: 6 hours

Rating: 5/5 Stars

Difficulty:

Intermediate Expert (technical)

Video Course

Access this Course

Also includes Access to All:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

Video thumbnail for Enterprise AI Engineering: Embeddings, RAG, and Multimodal Agents with AWS Bedrock and Nova (Video Course)

What You Will Learn

Explain embeddings, tokenization, and positional embeddings
Build RAG pipelines and manage vector databases
Apply multimodal models (Nova, CLIP, BLIP-2) to text, images, and video
Use Bedrock Knowledge Bases and Agents for automation
Design chunking, embedding, and retrieval best practices

Study Guide

Introduction: Why Enterprise AI, Embeddings, RAG, and Multimodal Agents Matter

AI is no longer a buzzword; it's the backbone of modern enterprise innovation.
But the real power of AI for business comes from going beyond simple question-answering. It's about building systems that understand not just text, but images, video, and complex workflows,systems that can search vast proprietary knowledge, automate decisions, and adapt to your organization’s needs, all at scale.

This guide is your deep-dive into the practical side of advanced enterprise AI with Amazon Nova and Bedrock. It covers the essential concepts and hands-on strategies behind:

Embeddings: The fundamental building blocks that let AI “understand” language, images, and more.
Retrieval Augmented Generation (RAG): A game-changing technique that gives your AI up-to-date, context-rich knowledge.
Multimodal Models: AI that processes and reasons across text, images, and video together.
Bedrock Agents and Knowledge Bases: Enterprise tools for automation, workflow orchestration, and seamless integration with your data and business logic.

Whether you’re just getting started or looking to architect advanced AI solutions, this guide gives you the clarity, real-world examples, and actionable steps needed to turn these technologies into business value.

Foundations: Why Embeddings are the Language of AI

Raw data is noise to a machine. Embeddings turn it into meaning.
At their core, AI models,especially large language models (LLMs),can't “see” text, audio, or video in their natural form. They need everything translated into numbers. That’s where embeddings come in.

Understanding Embeddings: The Bridge from Raw Data to AI Understanding

What are embeddings?
An embedding is a numerical representation (an array or vector of numbers) that captures the “essence” of a piece of data,be it a word, sentence, image, or even a video frame. This transformation is vital, because AI models process vectors, not raw language or pixels.

Why do embeddings matter?

They allow AI to measure similarity: Two similar pieces of text,or images,will have similar embeddings (vectors “close” together in a high-dimensional space).
They let AI generalize: Instead of memorizing, the model learns patterns, relationships, and context.

Practical Example 1:
Suppose you have the sentences “The cat sat on the mat” and “A feline rested on the rug.” Their embeddings will be close to each other, even though the words differ, because they share similar meaning.

Practical Example 2:
If you embed an image of a car and the text “red sports car,” a well-trained multimodal embedding model will map both to similar positions in the vector space, enabling applications like visual search or cross-modal retrieval.

Tokenization: The First Step to Embedding Text

Why tokenization?
Before a model can embed text, it needs to break it down into manageable pieces called tokens. Tokenization is how sentences are split into words, subwords, or even characters, depending on the approach.

Simple Tokenization Methods:

Whitespace splitting: “Hello world!” becomes [“Hello”, “world!”].
Special characters: Some tokenizers split on punctuation or use regular expressions to handle contractions and compound words.

Why is tokenization crucial?
AI models can’t process entire sentences or documents as a single unit. Breaking them into tokens lets the model learn relationships and patterns at a finer level.

The Out-of-Vocabulary (OOV) Problem and Special Tokens

What is the OOV problem?
Suppose your model’s vocabulary was built on a dataset without the word “blockchain.” When the model encounters “blockchain” during inference, it doesn’t know how to process it.

Addressing OOV with Special Tokens:
A common workaround is the introduction of an “unknown” token (<UNK>). When the model sees an unfamiliar word, it substitutes <UNK>. Similarly, tokens like <EOS> (end of sentence) or <PAD> (padding) help define sequence boundaries.

Limitation of this approach:
All unknown words are mapped to the same representation, making it impossible for the model to distinguish between them. This results in loss of specificity and semantic nuance.

Example:
If “blockchain” and “nanotech” are both unknown, the model treats them identically, reducing the quality of its understanding and responses.

Byte Pair Encoding (BPE): The Evolution of Tokenization

BPE solves the OOV challenge by breaking words into subword units.

How does BPE work?

Start with a character-level vocabulary.
Count the frequency of all adjacent character pairs in your text corpus.
Merge the most frequent pair into a new token (e.g., “th” from “t” and “h”).
Repeat: update counts, merge the next most common pair, and so on.

This iterative process results in a vocabulary of common words, prefixes, suffixes, and subword fragments.

How does BPE handle OOV words?
Any rare or new word can be broken down into known subword tokens. For instance, “blockchain” might become [“block”, “chain”] or even [“bl”, “ock”, “ch”, “ain”], depending on merge history.

Practical Example 1:
The rare word “reforestation” can be tokenized as [“re”, “forest”, “ation”], even if the full word was never seen during training.

Practical Example 2:
A brand-new term like “hyperloop” could be split as [“hyper”, “loop”].

Token IDs and the Embedding Matrix

Assigning Token IDs:
Once tokens are defined, each is assigned a unique integer ID, creating a mapping from tokens to numbers.

The Embedding Matrix:
Imagine a giant table (matrix), where each row is a vector representing one token. The size of each vector (its dimensionality) is a model design choice,often 256, 384, 1024, or higher.

How embeddings are generated and learned:

Initial embeddings are typically random.
During model training, these vectors are updated so that tokens with similar meanings end up with similar vectors.
This is achieved through self-supervised learning objectives, such as predicting the next token in a sequence.

Example:
After training, the embeddings for “cat” and “feline” will be close together, while “cat” and “rocket” will be far apart.

Dimensionality: How Much Information Can an Embedding Capture?

Why does dimensionality matter?
A higher-dimensional embedding can capture more nuance and relationships, but too high a dimension can lead to inefficiency or redundancy.

Example 1:
A 256-dimensional embedding might be sufficient for basic semantic search, but a 1024-dimensional embedding can capture more subtle differences,useful for tasks like ranking search results or clustering documents.

Example 2:
For image embeddings, higher dimensions allow the model to capture not just color and shape, but also style, texture, and even emotional tone.

Positional Embeddings: Encoding the Order of Tokens

Why is position important?
“The dog chased the cat” is very different from “The cat chased the dog.” Token embeddings alone do not encode order.

Absolute Positional Embeddings:
A unique vector is assigned for each position in the sequence (first word, second word, etc.), and this is added to the corresponding token embedding.

Calculating Positional Embeddings:
Many transformers use sine and cosine functions with varying frequencies to generate these vectors, ensuring that each position has a unique, smoothly varying embedding across dimensions.

Example 1:
For the sentence “AI transforms business,” the word “AI” at position 0 will have a different final embedding than “AI” at position 2 due to its positional embedding.

Example 2:
In long documents, positional embeddings help the model distinguish between an introduction and a conclusion, even if similar words are used in both.

Self-Supervised Learning and Data Sampling for LLM Training

How do LLMs actually learn?
The classic training objective is to predict the next token given a sequence of previous tokens,a form of self-supervised learning.

Sliding Window Technique:
A window of N tokens is used as input X, and the next token is the target Y. The window slides forward to generate more (X, Y) pairs.

Example:
Given "The quick brown fox jumps," the model sees ["The", "quick", "brown"] → "fox", then ["quick", "brown", "fox"] → "jumps".

Efficient Sampling:
In frameworks like PyTorch, DataSets and DataLoaders manage batching and sampling, ensuring training is scalable even with massive datasets.

Amazon Titan Text Embeddings: Enterprise-Grade Embedding Models

What is Amazon Titan?
Available via Amazon Bedrock, Titan Text Embeddings are pre-trained models built for high-quality, scalable text embedding generation.

Key Features:

Input Length: Up to 8,192 tokens per request. Longer texts must be split (“chunked”).
Output Dimensions: Choose from sizes like 1024, 384, or 256. Higher dimensions capture more, but come with computational cost.
Normalization: Optionally, embeddings can be normalized to unit length. This makes similarity and retrieval consistent.
Semantic Similarity: Effectiveness is measured by how well semantically related texts are placed close together (high cosine similarity).

Practical Example 1:
Searching a legal document repository: Embed the query and all document chunks. Retrieve the most similar chunks for the user.

Practical Example 2:
Product search: Embedding product descriptions and customer queries to enable semantic search,users find what they mean, not just what they type.

Best Practice:
Always normalize embeddings when using them for retrieval or similarity comparison. This ensures results are based on direction (semantic meaning) rather than length (which can be arbitrary).

Multimodal Models: Bringing Text, Images, and Video Together

Why Multimodality?
The modern enterprise deals with more than just text. Data comes in the form of images, scanned documents, tables, and even videos. Real value comes from systems that can process and connect these modalities.

Examples of Multimodal Applications:

Image captioning: “Describe what’s in this photo.”
Visual search: “Find products that look like this image.”
Visual question answering: “What’s the total in this invoice image?”

Types of Multimodality:

Input: Accepting text, images, video, or combinations.
Output: Generating text, images, or video as responses.
Embeddings: Mapping all modalities into a shared vector space for cross-modal retrieval.

Processing Images with Vision Transformers (ViT)

Why can’t we tokenize images like text?
Text is inherently sequential and discrete, but images are 2D arrays of pixels.

The “Patching” Approach:

Divide the image into non-overlapping patches (e.g., 16x16 pixels).
Flatten each patch into a vector (concatenate pixel values).
Project each patch vector into an embedding space via a linear layer.
Add positional embeddings to preserve spatial arrangement.

Example 1:
An invoice image is split into patches, each patch carrying information about a small region (e.g., a number, a logo).

Example 2:
A medical X-ray is divided into patches, allowing the model to attend to both local features (e.g., a fracture) and global context (e.g., bone alignment).

Why not use token IDs for images?
Unlike text, there’s no universal “dictionary” of image patches, so each patch is embedded directly, not mapped via IDs.

CLIP: Contrastive Language-Image Pre-training

What is CLIP?
CLIP is a breakthrough architecture that jointly learns to embed images and their textual descriptions into the same vector space.

How does CLIP train?

Uses a dataset of image-caption pairs.
For each pair, create “positive pairs” (correct image-caption) and “negative pairs” (image with unrelated captions).
Text encoder: Converts the caption to an embedding.
Image encoder: Converts the image to an embedding.
Contrastive learning: The model is trained to make positive pairs close and negative pairs far apart.

Practical Example 1:
Given a query “Find all images of red cars,” CLIP can retrieve images whose embeddings are close to the embedding of the text “red car.”

Practical Example 2:
Visual question answering: Given an image and the question “How many people are present?”, both are embedded and the model generates an answer based on their proximity in vector space.

Best Practice:
Always preprocess images to the format expected by the CLIP model (resize, normalize pixel values).

BLIP-2: Efficient Multimodal Language Understanding

What is BLIP-2?
BLIP-2 links a pre-trained vision transformer (for images) and a pre-trained LLM (for text) with a new bridge called the “Q-Former.”

Key Features:

The vision transformer and LLM are “frozen”,their weights are not updated, only the Q-Former is trained.
The Q-Former transforms visual features into “soft visual prompts” that the LLM can understand and use for generation tasks.
Training uses three objectives:
- Image-Text Matching (ITM): Classify if an image and text pair match.
- Image-Text Contrastive (ITC): Align image and text embeddings.
- Image-Grounded Text Generation (ITG): Generate text describing an image.

Efficiency:
Only the Q-Former is trained. This leverages the power of large, pre-trained models without retraining them from scratch.

Practical Example 1:
Generating a product description from a product photo.

Practical Example 2:
Answering questions about a chart or infographic by combining visual and textual information.

Best Practice:
Use BLIP-2 when you need to integrate vision and language at scale without the overhead of retraining massive models.

Amazon Bedrock: The Platform for Enterprise AI

What is Amazon Bedrock?
Bedrock is AWS’s fully managed service for accessing, customizing, and deploying a suite of foundational models (FMs) from Amazon (Titan, Nova) and leading AI companies (Anthropic, Cohere, Mistral, etc.).

What does Bedrock offer?

Simplified Access: One API for all supported models,no need to juggle multiple keys or endpoints.
Customization: Fine-tuning, model distillation, and continuous pre-training for eligible models.
Playgrounds: Interactive environments for experimentation.
Builders (Agents and Knowledge Bases): Tools for building RAG pipelines and workflow automation.
Model Access Management: Explicitly request access to each model before invoking via SDK.

Practical Example:
A developer can quickly test Titan and Nova models, switch between them, and choose the best fit for their use case,all from within the same platform.

Best Practice:
Always verify and manage model access rights in the Bedrock console before integrating with your application.

Amazon Nova Models: The Next Generation of Multimodal AI

Overview of Nova Models:

Nova Micro: Text-only, highly cost-effective for pure language tasks.
Nova Light: Multimodal (text, image, video input; text output), fast and affordable.
Nova Pro: Multimodal with advanced reasoning, creativity, and code generation. For demanding enterprise applications.
Nova Premier: “Any-to-any” multimodal (image, text, video in and out) for end-to-end AI pipelines.
Nova Canvas, Nova Real: Creative content generation (images, video).

API Structure:

Content Object: Accepts text, image (Base64), or video (Base64 or S3 URL).
Inference Parameters:
- max_new_tokens: Output limit (up to 5,000).
- temperature: Randomness of output.
- top_p (nucleus sampling): Focuses generation on the most probable tokens until a probability threshold is reached.
- top_k: Restricts output to the k most likely tokens.
- stop_sequences: Strings that end generation.
Preprocessing: Images are automatically rescaled for optimal input (e.g., 900x450 for 1:2 aspect ratio).
Video Understanding: Supports major formats. Samples one frame per second (up to 960 frames for videos under 16 minutes). Resolution does not impact analysis since frame sampling is fixed.

Practical Example 1:
Feeding a customer support chat transcript and an attached screenshot to Nova Pro, which generates a response that references both text and image evidence.

Practical Example 2:
Uploading a marketing video for Nova Light to summarize the key points in text.

Tips:

For large files (images/video), use S3 URLs for smooth API integration.
Adjust temperature and top_p for more creative or more deterministic responses as needed.

Retrieval Augmented Generation (RAG): Overcoming the Limitations of LLMs

Why RAG?
LLMs have two major limitations:

Knowledge Cutoff: They only know what was in their training data, frozen at a certain point in time.
Hallucination: When asked about information outside their dataset (e.g., your company’s internal documents), they may invent plausible-sounding but incorrect answers.

How does RAG work?

Data Ingestion: Proprietary data (text, images, tables) is embedded and stored in a vector database (e.g., OpenSearch, Aurora, Pinecone).
Retrieval: User queries are embedded and used for semantic search against this vector database.
Augmentation: Top-matching data chunks are retrieved and concatenated with the user’s query to form a rich context.
Generation: The context-augmented prompt is sent to the LLM, enabling accurate, grounded responses based on up-to-date, proprietary, or domain-specific information.

Orchestration:
Frameworks like LangChain manage chunking, embedding, storage, retrieval, and prompt construction, making RAG pipelines scalable and robust.

Practical Example 1:
A customer support bot retrieves the latest policy documents to answer user queries, even if these documents were updated after the LLM was trained.

Practical Example 2:
A medical assistant searches research papers and provides evidence-based summaries to clinicians, reducing hallucination risk.

Best Practice:
Always keep your vector database and embeddings in sync with your data source. Regularly re-embed when documents change.

Multimodal RAG: Extending RAG to Images, Tables, and Video

Why Multimodal RAG?
Enterprises store valuable knowledge in PDFs, scanned documents, tables, and videos. A standard text RAG pipeline can’t tap into this.

How does Multimodal RAG work?

Data Extraction: Tools like PyMuPDF or Tabula extract text, images, and tables from documents.
Multimodal Embedding: All content is embedded using a multimodal model (e.g., Amazon Titan Multimodal Embedding), ensuring a shared vector space.
Vector Storage: Store embeddings and raw chunks (with metadata) in a vector database.
Retrieval and Generation: User queries trigger semantic search, returning relevant multimodal chunks, which are then passed to a multimodal LLM (like Nova Pro) for context-aware generation.

Practical Example 1:
A legal assistant retrieves both the relevant contract text and the embedded image of a signature page to answer a compliance query.

Practical Example 2:
A financial analyst asks for the latest quarterly numbers, and the system retrieves tables from a PDF report and generates a summary.

Alternative Multimodal RAG Strategies:

Option 2 (Text-based LLM): All modalities are summarized into text, which is embedded and used for retrieval, enabling use of a standard text LLM. However, this loses raw image/table context.
Option 3 (Hybrid): Retrieval is performed on text summaries, but original raw data is fetched based on metadata and passed to a multimodal LLM. This balances efficient search with richer context for generation.

Best Practice:
Use Option 3 when you require both search efficiency and the ability to reference raw data (e.g., for compliance or audit trails).

Amazon Bedrock Knowledge Bases: Fully Managed RAG for the Enterprise

What are Knowledge Bases?
Bedrock Knowledge Bases automate the entire RAG pipeline,data ingestion, chunking, embedding, vector storage, and retrieval,so you can focus on application logic, not infrastructure.

Key Features:

Automated Data Ingestion: Connect to S3, web URLs, Confluence, Salesforce, Sharepoint, etc. The service handles extraction, chunking, and embedding.
Retrieval APIs: Use “retrieve” (returns relevant chunks) or “retrieveAndGenerate” (returns model-generated answers with context).
Chunking Strategies:
- Default: ~300-token chunks, sentence-aware.
- Fixed: Customizable size and overlap, similar to LangChain’s recursive splitter.
- Hierarchical: Small “child” chunks for search, larger “parent” chunks for generation,ideal for context-rich documents.
- Semantic: Chunks based on embedding similarity, with buffer size for context.
- No Chunking: Each document is a single chunk (if it fits the embedding model’s input limit).
Dynamic Syncing: Incrementally updates the knowledge base as your data evolves.
Vector Database Choice: Supports OpenSearch Serverless, Aurora, Pinecone.

Practical Example 1:
HR teams set up a knowledge base linked to their policy documents on S3. Employees can query policies via a chatbot, with RAG ensuring up-to-date, accurate answers.

Practical Example 2:
A law firm connects its case files to a knowledge base. Attorneys search across thousands of documents and receive context-rich responses, including links to the source.

Best Practice:

For long or complex documents, use hierarchical or semantic chunking to maximize context and retrieval precision.
Regularly resync data sources to keep the knowledge base current.

What is Function Calling?
Function calling lets an LLM recognize when a prompt requires an external action (e.g., calling an API, running a database query), and recommend the appropriate function or tool.

How does it work?

The LLM receives structured descriptions of available functions.
When a user query matches a function, the LLM outputs a “call” recommendation (e.g., “call get_weather(location=London)”).
The application code receives this recommendation and executes the function.

Practical Example 1:
A support bot recognizes the intent “reset my password” and recommends calling the password reset API.

Practical Example 2:
A travel assistant detects “book me a flight to Paris” and outputs a structured call to the booking system.

Tips:

Function calling improves reliability and auditability. The application layer retains control over what gets executed.

Bedrock Agents: Automating Complex Workflows with AI

What are Bedrock Agents?
Agents are intelligent systems that combine an LLM with the ability to select, sequence, and execute external functions (tools), and to reason through multi-step tasks.

How do Agents work?

Agents are configured with “Action Groups”,each group maps to a Lambda function or other business logic.
Agents can also attach Knowledge Bases as tools, combining data retrieval with workflow automation.
When a user interacts, the agent’s LLM breaks down the query, determines which actions to take, executes them, and synthesizes a final response.

Insurance Claim Automation Example:

Separate action groups for “create a new claim,” “gather evidences,” and “send reminders.” Each triggers a Lambda function (e.g., updating records in DynamoDB).
The agent also queries a knowledge base populated with policy documents for informational questions.
When a user says “file a claim for my car accident,” the agent calls the “create claim” group and updates the database. If the user asks, “What’s covered under my policy?”, the agent retrieves the answer from the knowledge base.

Benefits:

End-to-end workflow automation, reducing manual steps.
Intelligent orchestration: The agent can reason about which actions to take and in what order.
Seamless integration of proprietary knowledge and business logic.

Practical Example 1:
A procurement agent manages purchase requests,creating orders, checking budgets, and providing policy guidance,all orchestrated via action groups and a knowledge base.

Practical Example 2:
A healthcare agent automates patient intake, document lookups, and appointment scheduling, using action groups for backend tasks and a knowledge base for clinical information.

Tips:

Start with clear action group definitions and robust Lambda functions,agents are only as smart as the tools you provide.
Test agent workflows with real-world scenarios to uncover edge cases.

Enterprise Implementation: Bringing It All Together

Data Preparation:

Extract, clean, and chunk your data according to its modality (text, image, table, video).
Choose appropriate chunking strategies for your use case (hierarchical for legal documents, semantic for medical research).
Embed and store data in a scalable vector database.

Model and Tool Selection:

Use Bedrock to select and access the right foundational models for your needs (text-only, multimodal, creative generation).
Configure inference parameters to balance creativity, accuracy, and speed.
Leverage Bedrock Knowledge Bases for managed RAG pipelines, and Bedrock Agents for workflow automation.

Integration and Automation:

Build orchestration flows using frameworks like LangChain or Bedrock’s own builder tools.
Combine semantic retrieval with business logic for automated, reliable enterprise solutions.

Testing and Continuous Improvement:

Regularly evaluate semantic similarity scores to ensure embedding quality.
Monitor agent actions and knowledge base accuracy, retrain or re-embed as new data and business needs emerge.

Security and Compliance:

Control access to sensitive data through IAM roles and data source permissions.
Audit agent actions and retrievals for traceability.

Conclusion: Key Takeaways and Next Steps

Enterprise AI is about more than just deploying a chatbot or a text generator. It’s about building intelligent, context-aware systems that leverage your unique data, automate complex workflows, and adapt to your business processes.

You’ve learned how:

Embeddings transform raw data into meaningful representations, powering everything from semantic search to multimodal retrieval.
Tokenization (especially BPE) and positional embeddings are foundational to LLM understanding and flexibility.
Amazon Bedrock and Nova offer scalable, enterprise-ready infrastructure for accessing, customizing, and deploying advanced AI models.
Retrieval Augmented Generation (RAG) overcomes LLM limitations by grounding answers in your proprietary knowledge,especially when extended to multimodal data.
Bedrock Knowledge Bases and Agents automate and orchestrate complex workflows, combining retrieval, reasoning, and action seamlessly.

The real value lies in application: Don’t just understand these concepts,apply them. Start by embedding your own data, experiment with RAG pipelines, and build agents that reflect your business needs. As you do, you’ll unlock new possibilities for automation, intelligence, and differentiation in your enterprise.

AI is not just a tool. It’s a new way of thinking about how your business interacts with information, customers, and the world. Master these foundations, and you’re ready to lead that transformation.

Frequently Asked Questions

This FAQ is crafted to clarify concepts, workflows, and best practices around embeddings, Retrieval Augmented Generation (RAG), and multimodal agents using Amazon Nova and Bedrock. It addresses questions from foundational understanding to hands-on implementation, covering common technical challenges and practical business implications. Whether you're just starting or looking to refine your enterprise AI strategy, you'll find actionable insights and real-world examples to guide your next steps.

What are embeddings in the context of Large Language Models (LLMs) and why are they important?

Embeddings are a fundamental concept in Natural Language Processing (NLP) and LLMs.
At their core, embeddings represent data, such as text, images, or audio, as an array of numbers (a vector). This numerical representation allows machines to understand and process the data. For instance, in text, similar words or concepts are mapped to vectors that are numerically close to each other in a multi-dimensional space.
Embeddings are crucial because LLMs and other models cannot directly understand raw data. By converting data into numerical arrays, embeddings provide a foundation for training these models. The "Enterprise AI Tutorial" highlights that while simple 2D representations exist for illustrative purposes, real-world embeddings involve hundreds or thousands of numbers, allowing for a much richer capture of information. During LLM training, these embedding vectors are learned and refined through processes like backpropagation, ensuring that semantically similar data points have similar numerical representations.

How does tokenisation work in LLMs, particularly with techniques like Byte Pair Encoding (BPE)?

Tokenisation is the process of breaking down a continuous stream of text into smaller units called tokens.
Initially, simple tokenisation might involve splitting text by whitespace or special characters. However, this approach has limitations, such as encountering "out-of-vocabulary" (OOV) words – words not present in the model's pre-defined vocabulary. To address this, special tokens like "unknown" or "end-of-text" can be introduced, but this can lead to different words being mapped to the same "unknown" token, losing their unique meaning.
Byte Pair Encoding (BPE) is a more sophisticated tokenisation technique widely used in modern LLMs like GPT-2. BPE works by iteratively merging the most frequently occurring adjacent pairs of bytes (or characters) in a text corpus. It starts with a vocabulary of individual characters and progressively builds a larger vocabulary of common subwords, words, and phrases. This subword tokenisation is powerful because it allows the model to handle OOV words gracefully. For example, an unfamiliar word can be broken down into known subword units (e.g., "unknownword" might be tokenised as "un" + "known" + "word"). This ensures that even new or rare words can be represented and understood by the model, as long as their constituent subwords are in the vocabulary.

How are training data samples (X and Y) generated for self-supervised LLM training?

In self-supervised LLM training, the model learns to predict the next token in a sequence without requiring explicitly labelled data.
This is achieved by transforming the input text (X) into training pairs of input (X) and expected output (Y). The core idea is to slide a window across the text, where the input (X) consists of a sequence of tokens, and the output (Y) is the very next token in the sequence.
For example, if the text is "LLMs learn to predict one word at a time," and the context size is four tokens, the training pairs would be generated as follows:

Input (X): "LLMs learn to predict" -> Output (Y): "one"
Input (X): "learn to predict one" -> Output (Y): "word"
And so on.

This "sliding window" technique allows a single corpus of text to generate a vast number of training samples. Each sample effectively trains the LLM to predict the most probable next token given a preceding sequence, which is fundamental to the generative capabilities of LLMs.

What are multimodal models and why are they significant in AI?

Multimodal models are advanced AI systems capable of processing and integrating different types of data, such as text, images, video, and audio.
They are significant because real-world data often comes in various formats, and models that can understand and reason across these modalities are essential for many applications.
Traditional AI models were often limited to a single modality (e.g., text-only NLP models or image-only computer vision models). Multimodal models overcome this limitation by allowing inputs from multiple sources and often generating outputs in one or more modalities. Use cases include image captioning (generating text descriptions for images), visual question answering (answering questions about an image), and visual search (finding products based on an image). The "Enterprise AI Tutorial" emphasizes the need for models that can embed different modalities into the same vector space, enabling semantic understanding across various data types.

How do multimodal embedding models, particularly those based on CLIP (Contrastive Language-Image Pre-training), work?

Multimodal embedding models, like those based on CLIP, aim to embed different modalities (e.g., images and text) into the same vector space. The goal is for semantically similar items, regardless of their modality (e.g., an image of a cat and the text "a cat"), to have embedding vectors that are very close to each other.
CLIP models are trained using a dataset of image-caption pairs. The training process involves two main components: a text encoder and an image encoder. During training, the model learns to:

Maximise similarity for positive pairs: The embeddings of an image and its correct caption are encouraged to be very close.
Minimise similarity for negative pairs: The embeddings of an image and incorrect captions (from other images) are pushed far apart.

This contrastive learning approach ensures that the resulting embeddings capture the semantic relationship between images and text. Consequently, a trained CLIP model can take an image or a piece of text and generate an embedding that allows for direct comparison and understanding across modalities.

What is Retrieval Augmented Generation (RAG) and how does it address the limitations of traditional LLMs?

Retrieval Augmented Generation (RAG) is a technique that enhances LLMs by providing them with external, up-to-date, or proprietary information, thereby overcoming the "knowledge cutoff" and "hallucination" limitations of standalone LLMs.
In a RAG pipeline:

Data Injection: Proprietary or recent data (text, images, tables, etc.) is converted into numerical embeddings and stored in a vector database.
Retrieval: When a user asks a question, the question itself is embedded and used to perform a semantic search in the vector database. This retrieves "relevant chunks" or contexts that are semantically similar to the query.
Generation: The original query, along with the retrieved relevant chunks (context), is then sent to the LLM. The LLM uses this provided context to generate a more accurate, informed, and up-to-date response.

RAG allows LLMs to provide answers based on specific, external data they were not trained on, making them more reliable and useful for domain-specific applications. This approach avoids the need for expensive and frequent fine-tuning of LLMs with new data, as the external knowledge base can be updated more dynamically.

How does Amazon Bedrock simplify the implementation of AI applications, including RAG and multimodal models?

Amazon Bedrock is a fully managed service on AWS that provides access to various foundational models (FMs) from Amazon and other AI startups through a unified API. It significantly simplifies AI application development by abstracting away infrastructure management and direct API integrations with multiple providers.
Key benefits and features include:

Unified Access: Developers can switch between different FMs with minimal code changes, avoiding the need for multiple API keys.
Model Customisation: Bedrock supports fine-tuning, model distillation, and continuous pre-training for eligible models.
Playgrounds: Interactive interfaces for experimenting with different models and prompts.
Developer Tools: Includes Agents and Knowledge Bases, which automate complex AI workflows.

For RAG, Bedrock's Knowledge Bases offer an end-to-end managed solution for data injection, embedding, and retrieval, allowing users to simply provide data sources (e.g., S3, web URLs) and let Bedrock handle the heavy lifting. This greatly simplifies building context-aware applications without needing deep expertise in vector databases or embedding strategies.

What are AI Agents in Amazon Bedrock and how do they leverage Knowledge Bases and Action Groups for complex tasks?

In Amazon Bedrock, AI Agents are advanced AI systems that automatically determine how to best answer a user's question by performing a series of actions. They extend the concept of function calling by not only selecting the right tool (function) but also executing it.
Agents leverage two primary components:

Action Groups: These represent specific business logics or external tools (e.g., Lambda functions) that the agent can call to perform tasks. For example, an "Insurance Claim Agent" might have action groups for "create claim," "gather evidences," or "send reminders." The agent uses its internal LLM to reason which action group is appropriate for a given query.
Knowledge Bases: As discussed, these are managed RAG systems. Agents can use Knowledge Bases as a tool to retrieve information from proprietary or recent data. If a user's question requires information from the company's internal documents, the agent can query the attached Knowledge Base to get the relevant context.

The "Enterprise AI Tutorial" illustrates how an agent can break down a complex user query into multiple sub-tasks. For each sub-task, the agent decides whether to use an action group (for performing an operation) or a knowledge base (for retrieving information). This intelligent orchestration allows agents to automate sophisticated multi-step workflows, providing comprehensive and contextually relevant responses to users.

Why can't models understand raw text, audio, or video?

AI models require numerical data to perform mathematical operations. Raw text, audio, or video are unstructured and do not have inherent numeric meaning, so models cannot process them directly. Embeddings transform these raw formats into vectors,arrays of numbers,that encode essential patterns and semantics, enabling models to learn relationships and make predictions. For example, an audio file becomes a series of numbers representing sound waves; text is tokenized and embedded into vectors.

How are embeddings typically represented, and what does the dimensionality mean?

Embeddings are represented as vectors,arrays of numbers. Each dimension captures a particular aspect or feature of the data. The number of dimensions (e.g., 128, 512, 1024) determines how much information can be encoded. Higher dimensionality allows for more nuanced and detailed representations but also increases computational cost and storage. For example, a 768-dimensional embedding can capture more subtle differences between words than a 64-dimensional one but may require more memory and processing power.

Why does the dimensionality of an embedding vector matter?

The dimensionality of an embedding determines the richness of information encoded. A low-dimensional embedding may not capture all the nuances in the data, while very high dimensionality can make models prone to overfitting and reduce retrieval efficiency. Choosing the right dimension is a trade-off between accuracy, computational efficiency, and storage. In practice, dimensions are often chosen empirically, balancing business needs and infrastructure constraints.

What is tokenization and why is it a crucial first step in LLM training?

Tokenization breaks text into smaller units (tokens) such as words, subwords, or characters. This is essential because LLMs operate on sequences of tokens, not whole sentences or paragraphs. Tokenization makes it possible to map raw text to numerical IDs, which are then used for embedding and further processing in the model. Accurate tokenization ensures that the model can handle diverse and complex language, including rare or new words.

What is the "out-of-vocabulary" (OOV) problem and how can it be addressed?

The OOV problem occurs when a model encounters words during inference that were not present in its training vocabulary. Simple tokenizers often map OOV words to a generic "unknown" token, causing loss of meaning. Subword tokenization techniques like Byte Pair Encoding (BPE) solve this by breaking unfamiliar words into known subwords or character combinations, preserving much of the original meaning. This enables the model to process and generate responses for previously unseen words with greater accuracy.

How does Byte Pair Encoding (BPE) solve tokenization challenges?

BPE bridges the gap between word-level and character-level tokenization. It starts with a base vocabulary of characters and repeatedly merges the most common adjacent pairs, forming new tokens (subwords). This efficiently addresses the OOV problem, reduces vocabulary size, and keeps token sequences manageable for model training and inference. For example, "unhappiness" could be split into "un," "happi," and "ness," all of which are likely known subwords.

Author, Links & Resources

Unlock this content to view the author bio and resources by Logging in or Signing up.

Certification

About the Certification

Get certified in Enterprise AI Engineering: design and deploy scalable AI solutions using AWS Bedrock and Nova, leveraging embeddings, RAG, and multimodal agents to automate workflows, analyze diverse data, and drive impactful business outcomes.

Get your: Certification in Building and Deploying Enterprise AI Solutions with AWS Bedrock and Nova

Official Certification

Upon successful completion of the "Certification in Building and Deploying Enterprise AI Solutions with AWS Bedrock and Nova", you will receive a verifiable digital certificate. This certificate demonstrates your expertise in the subject matter covered in this course.

Benefits of Certification

Enhance your professional credibility and stand out in the job market.
Validate your skills and knowledge in cutting-edge AI technologies.
Unlock new career opportunities in the rapidly growing AI field.
Share your achievement on your resume, LinkedIn, and other professional platforms.

How to complete your certification successfully?

To earn your certification, you’ll need to complete all video lessons, study the guide carefully, and review the FAQ. After that, you’ll be prepared to pass the certification requirements.

Join 20,000+ Professionals, Using AI to transform their Careers

Join professionals who didn’t just adapt, they thrived. You can too, with AI training designed for your job.

Enterprise AI Engineering: Embeddings, RAG, and Multimodal Agents with AWS Bedrock and Nova (Video Course)

Video Course

What You Will Learn

Study Guide

Introduction: Why Enterprise AI, Embeddings, RAG, and Multimodal Agents Matter

Foundations: Why Embeddings are the Language of AI

Understanding Embeddings: The Bridge from Raw Data to AI Understanding

Tokenization: The First Step to Embedding Text

The Out-of-Vocabulary (OOV) Problem and Special Tokens

Byte Pair Encoding (BPE): The Evolution of Tokenization

Token IDs and the Embedding Matrix

Dimensionality: How Much Information Can an Embedding Capture?

Positional Embeddings: Encoding the Order of Tokens

Self-Supervised Learning and Data Sampling for LLM Training

Amazon Titan Text Embeddings: Enterprise-Grade Embedding Models

Multimodal Models: Bringing Text, Images, and Video Together

Processing Images with Vision Transformers (ViT)

CLIP: Contrastive Language-Image Pre-training

BLIP-2: Efficient Multimodal Language Understanding

Amazon Bedrock: The Platform for Enterprise AI

Amazon Nova Models: The Next Generation of Multimodal AI

Retrieval Augmented Generation (RAG): Overcoming the Limitations of LLMs

Multimodal RAG: Extending RAG to Images, Tables, and Video

Amazon Bedrock Knowledge Bases: Fully Managed RAG for the Enterprise

Function Calling: Teaching LLMs to Recommend Actions

Bedrock Agents: Automating Complex Workflows with AI

Enterprise Implementation: Bringing It All Together

Conclusion: Key Takeaways and Next Steps

Frequently Asked Questions

What are embeddings in the context of Large Language Models (LLMs) and why are they important?

How does tokenisation work in LLMs, particularly with techniques like Byte Pair Encoding (BPE)?

How are training data samples (X and Y) generated for self-supervised LLM training?

What are multimodal models and why are they significant in AI?

How do multimodal embedding models, particularly those based on CLIP (Contrastive Language-Image Pre-training), work?

What is Retrieval Augmented Generation (RAG) and how does it address the limitations of traditional LLMs?

How does Amazon Bedrock simplify the implementation of AI applications, including RAG and multimodal models?

What are AI Agents in Amazon Bedrock and how do they leverage Knowledge Bases and Action Groups for complex tasks?

Why can't models understand raw text, audio, or video?

How are embeddings typically represented, and what does the dimensionality mean?

Why does the dimensionality of an embedding vector matter?

What is tokenization and why is it a crucial first step in LLM training?

What is the "out-of-vocabulary" (OOV) problem and how can it be addressed?

How does Byte Pair Encoding (BPE) solve tokenization challenges?

Author, Links & Resources

Certification

About the Certification

Official Certification

Benefits of Certification

How to complete your certification successfully?

Related Course Categories

Other AI Video Courses

Video Course: What is Generative AI and how does it work?

Video Course: How to Use ChatGPT from Beginner to Professional

Video Course: How to Use Google Gemini for Google Workspace to Boost Productivity

Video Course: How to Use Claude 3.7 AI - Tips for Beginners!

Video Course: ChatGPT for Data Analytics: Full Course - from Beginners to Professional

Video Course: Generating Images and photos With ChatGPT?

Video Course: Generating Images & Photo's with MidJourney

Video Course: Generating Design's with Microsoft Designer and AI

Video Course: Generating Design's with Canva.com and AI

Join 20,000+ Professionals, Using AI to transform their Careers