Back to Documentation

Multi-Modal Features

Work with images, audio, and video using a unified API. Analyze, generate, and transform content across all modalities.

Image Understanding

Analyze images with object detection, scene understanding, OCR, and visual Q&A

import { analyzeImage, askAboutImage, extractTextFromImage } from '@rana/core';

// Comprehensive image analysis
const analysis = await analyzeImage(imageUrl, {
  features: ['objects', 'scene', 'text', 'classification']
});

console.log(analysis.objects);    // Detected objects with bounding boxes
console.log(analysis.scene);      // Scene description, lighting, mood
console.log(analysis.text);       // Extracted text (OCR)
console.log(analysis.classifications); // Image categories

// Visual Q&A
const answer = await askAboutImage(imageUrl, "What color is the car?");
console.log(answer.answer);       // "The car is red"
console.log(answer.confidence);   // 0.95

Image Generation

Generate, edit, and transform images with AI

import { generateImage, editImage, upscaleImage } from '@rana/core';

// Generate from text
const images = await generateImage("A sunset over mountains", {
  size: '1024x1024',
  quality: 'hd',
  style: 'photorealistic'
});

// Edit existing image
const edited = await editImage(imageBuffer, {
  prompt: "Add a rainbow in the sky",
  mask: maskBuffer
});

// Upscale image
const upscaled = await upscaleImage(imageBuffer, { scale: 4 });

Audio Transcription

Convert speech to text with speaker diarization and timestamps

import {
  transcribeAudio,
  transcribeToSRT,
  transcribeToVTT
} from '@rana/core';

// Basic transcription
const result = await transcribeAudio(audioFile, {
  language: 'en',
  enableDiarization: true,  // Identify speakers
  enableTimestamps: true
});

console.log(result.text);         // Full transcription
console.log(result.segments);     // Timestamped segments
console.log(result.speakers);     // Speaker information

// Export formats
const srt = await transcribeToSRT(audioFile);  // SubRip format
const vtt = await transcribeToVTT(audioFile);  // WebVTT format

Text-to-Speech

Convert text to natural-sounding speech with multiple voices

import { speak, getVoices, speakStream } from '@rana/core';

// List available voices
const voices = await getVoices();
// [{ id: 'alloy', name: 'Alloy', gender: 'neutral' }, ...]

// Generate speech
const audio = await speak("Hello, welcome to RANA!", {
  voice: 'nova',
  speed: 1.0,
  format: 'mp3'
});

// Stream for real-time playback
const stream = await speakStream(longText, { voice: 'echo' });

// SSML support for fine control
const ssml = tts.textToSSML("Hello", {
  rate: 'slow',
  pitch: 'high'
});

Video Understanding

Analyze videos with scene detection, object tracking, and temporal Q&A

import {
  analyzeVideo,
  askAboutVideo,
  summarizeVideo,
  searchVideo
} from '@rana/core';

// Full video analysis
const analysis = await analyzeVideo(videoFile, {
  features: ['scenes', 'objects', 'actions', 'transcript']
});

console.log(analysis.scenes);     // Scene boundaries and descriptions
console.log(analysis.objects);    // Tracked objects with trajectories
console.log(analysis.actions);    // Detected activities
console.log(analysis.keyMoments); // Important timestamps

// Ask questions about video
const answer = await askAboutVideo(videoFile, "When does the speaker mention AI?");
console.log(answer.relevantTimestamps);

// Search within video
const results = await searchVideo(videoFile, "person walking");

Unified Multi-Modal

Work with all modalities through a single interface

import { createMultiModal } from '@rana/core';

const mm = createMultiModal({
  imageUnderstanding: { provider: 'openai' },
  imageGeneration: { provider: 'openai' },
  audioTranscription: { provider: 'openai' },
  textToSpeech: { provider: 'openai' },
  videoUnderstanding: { provider: 'google' }
});

// Analyze any media
const imageAnalysis = await mm.analyze(imageUrl, 'image');
const audioResult = await mm.analyze(audioFile, 'audio');

// Cross-modal operations
const speech = await mm.describeImageAsSpeech(imageUrl);
const image = await mm.generateImageFromSpeech(audioFile);

Supported Providers

Image Understanding

  • OpenAI (GPT-4 Vision)
  • Anthropic (Claude)
  • Google (Gemini)
  • Hugging Face

Image Generation

  • OpenAI (DALL-E 3)
  • Stability AI
  • Midjourney
  • Hugging Face

Audio/Speech

  • OpenAI (Whisper, TTS)
  • Deepgram
  • AssemblyAI
  • ElevenLabs