Back to Documentation
Multi-Modal Features
Work with images, audio, and video using a unified API. Analyze, generate, and transform content across all modalities.
Image Understanding
Analyze images with object detection, scene understanding, OCR, and visual Q&A
import { analyzeImage, askAboutImage, extractTextFromImage } from '@rana/core';
// Comprehensive image analysis
const analysis = await analyzeImage(imageUrl, {
features: ['objects', 'scene', 'text', 'classification']
});
console.log(analysis.objects); // Detected objects with bounding boxes
console.log(analysis.scene); // Scene description, lighting, mood
console.log(analysis.text); // Extracted text (OCR)
console.log(analysis.classifications); // Image categories
// Visual Q&A
const answer = await askAboutImage(imageUrl, "What color is the car?");
console.log(answer.answer); // "The car is red"
console.log(answer.confidence); // 0.95Image Generation
Generate, edit, and transform images with AI
import { generateImage, editImage, upscaleImage } from '@rana/core';
// Generate from text
const images = await generateImage("A sunset over mountains", {
size: '1024x1024',
quality: 'hd',
style: 'photorealistic'
});
// Edit existing image
const edited = await editImage(imageBuffer, {
prompt: "Add a rainbow in the sky",
mask: maskBuffer
});
// Upscale image
const upscaled = await upscaleImage(imageBuffer, { scale: 4 });Audio Transcription
Convert speech to text with speaker diarization and timestamps
import {
transcribeAudio,
transcribeToSRT,
transcribeToVTT
} from '@rana/core';
// Basic transcription
const result = await transcribeAudio(audioFile, {
language: 'en',
enableDiarization: true, // Identify speakers
enableTimestamps: true
});
console.log(result.text); // Full transcription
console.log(result.segments); // Timestamped segments
console.log(result.speakers); // Speaker information
// Export formats
const srt = await transcribeToSRT(audioFile); // SubRip format
const vtt = await transcribeToVTT(audioFile); // WebVTT formatText-to-Speech
Convert text to natural-sounding speech with multiple voices
import { speak, getVoices, speakStream } from '@rana/core';
// List available voices
const voices = await getVoices();
// [{ id: 'alloy', name: 'Alloy', gender: 'neutral' }, ...]
// Generate speech
const audio = await speak("Hello, welcome to RANA!", {
voice: 'nova',
speed: 1.0,
format: 'mp3'
});
// Stream for real-time playback
const stream = await speakStream(longText, { voice: 'echo' });
// SSML support for fine control
const ssml = tts.textToSSML("Hello", {
rate: 'slow',
pitch: 'high'
});Video Understanding
Analyze videos with scene detection, object tracking, and temporal Q&A
import {
analyzeVideo,
askAboutVideo,
summarizeVideo,
searchVideo
} from '@rana/core';
// Full video analysis
const analysis = await analyzeVideo(videoFile, {
features: ['scenes', 'objects', 'actions', 'transcript']
});
console.log(analysis.scenes); // Scene boundaries and descriptions
console.log(analysis.objects); // Tracked objects with trajectories
console.log(analysis.actions); // Detected activities
console.log(analysis.keyMoments); // Important timestamps
// Ask questions about video
const answer = await askAboutVideo(videoFile, "When does the speaker mention AI?");
console.log(answer.relevantTimestamps);
// Search within video
const results = await searchVideo(videoFile, "person walking");Unified Multi-Modal
Work with all modalities through a single interface
import { createMultiModal } from '@rana/core';
const mm = createMultiModal({
imageUnderstanding: { provider: 'openai' },
imageGeneration: { provider: 'openai' },
audioTranscription: { provider: 'openai' },
textToSpeech: { provider: 'openai' },
videoUnderstanding: { provider: 'google' }
});
// Analyze any media
const imageAnalysis = await mm.analyze(imageUrl, 'image');
const audioResult = await mm.analyze(audioFile, 'audio');
// Cross-modal operations
const speech = await mm.describeImageAsSpeech(imageUrl);
const image = await mm.generateImageFromSpeech(audioFile);Supported Providers
Image Understanding
- OpenAI (GPT-4 Vision)
- Anthropic (Claude)
- Google (Gemini)
- Hugging Face
Image Generation
- OpenAI (DALL-E 3)
- Stability AI
- Midjourney
- Hugging Face
Audio/Speech
- OpenAI (Whisper, TTS)
- Deepgram
- AssemblyAI
- ElevenLabs