Production-Ready NLP API Platform Using FastAPI and Ollama (LLaMA Integration)

Technologies Used

Python FastAPI Ollama API LLaMA LLM

Production-Ready NLP API Platform Using FastAPI and Ollama

This project is a FastAPI-based NLP service that integrates with the Ollama API to expose multiple large language model capabilities through clean, well-structured REST endpoints.

The goal of this project was to design a modular, API-first backend system that makes advanced language model functionalities easily accessible for real-world applications such as chat systems, summarization tools, sentiment analysis engines, translation services, and more.

Instead of building separate scripts for each task, I consolidated everything into a unified, scalable FastAPI application with properly defined endpoints, structured request/response schemas, and interactive API documentation via Swagger UI.

Key Features

Modular REST API built with FastAPI
Integration with Ollama (LLaMA-based models)
Text generation with configurable parameters
Summarization, classification, and Q&A endpoints
Chatbot with persistent memory handling
Translation and simulated text-to-speech
Token counting endpoint for prompt optimization

Core Architecture

FastAPI handles routing, validation, and API documentation.
Ollama acts as the LLM inference backend.
Each NLP task is exposed as a dedicated endpoint.
JSON-based request and response structure for clean integration.
Designed for easy extension with additional LLM capabilities.

Installation

1. Clone the repository:

git clone https://github.com/akhilsu/ollama-fastapi.git

2. Run the Application

uvicorn main:app --reload

http://127.0.0.1:8000/docs

3. API Endpoints

1. Basic Text Generation

Endpoint: POST /chat/

Request Body Example:

{
  "prompt": "Once upon a time..."
}

Response:

{
  "response": "Generated response text based on the prompt."
}

2. Text Generation with Custom Parameters

Endpoint: POST /generate/

Request Body Example:

{
  "prompt": "Tell me a story.",
  "temperature": 0.8,
  "max_tokens": 200,
  "top_p": 0.9
}

Response:

{
  "response": "Generated response text based on the prompt and parameters."
}

3. Text Summarization

Endpoint: POST /summarize/

Request Body Example:

{
  "text": "The quick brown fox jumps over the lazy dog. The dog wakes up and chases the fox."
}

Response:

{
  "summary": "Shortened version of the text provided."
}

4. Text Classification (Sentiment Analysis)

Endpoint: POST /classify/

Request Body Example:

{
  "text": "I love this place!"
}

Response:

{
  "classification": "positive"
}

5. Question Answering

Endpoint: POST /qa/

Request Body Example:

{
  "context": "The Eiffel Tower is in Paris.",
  "question": "Where is the Eiffel Tower located?"
}

Response:

{
  "answer": "Paris"
}

6. Chatbot with Memory (Persistent Chat)

Endpoint: POST /chat_with_memory/

Request Body Example:

{
  "prompt": "Hello, how are you?",
  "conversation_history": [{"user": "Hi!"}]
}

Response:

{
  "conversation_history": [
    {"user": "Hi!"},
    {"assistant": "Hello, how are you?"}
  ]
}

7. Language Translation

Endpoint: POST /translate/

Request Body Example:

{
  "text": "Hello, how are you?",
  "target_language": "es"
}

Response:

{
  "translation": "Hola, ¿cómo estás?"
}

8. Text-to-Speech (Simulated)

Endpoint: POST /text_to_speech/

Request Body Example:

{
  "prompt": "Hello, welcome!"
}

Response:

{
  "audio_file": "Simulated audio for: Hello, welcome!"
}

9. Model Information

Endpoint: GET /model_info/

Response:

{
  "model": "llama3.2:3b-instruct-q4_K_M",
  "version": "v1.0",
  "parameters": {
    "num_layers": 32,
    "hidden_units": 4096
  }
}

10. Token Count

Endpoint: POST /token_count/

Request Body Example:

{
  "text": "Hello, world!"
}

Response:

{
  "token_count": 3
}

💻 View on GitHub