Technologies Used

Python Transformers Meta Llama LLM PyTorch

I built and deployed an interactive AI chatbot that generates responses in a conversational format using Hugging Face Transformers and Streamlit. The goal of this project was to create a lightweight “chat-with-AI” experience that feels responsive, configurable, and secure — while still running a modern instruction-tuned language model.

The application loads Meta’s Llama-3.2-1B-Instruct model via Hugging Face, supports real-time conversation history using Streamlit session state, and provides user controls to tune response behavior (length and creativity). To ensure secure deployment, the Hugging Face access token is handled via Streamlit Secrets / environment variables instead of hardcoding credentials.


Key Features

  • Real-time chat interface with persistent conversation history
  • Adjustable response behavior using sliders (max tokens + temperature)
  • Secure token handling using Streamlit Secrets (HF_TOKEN)
  • GPU-aware execution (runs on CUDA if available, otherwise CPU)
  • Hugging Face chat template support for structured prompting

Secure Token Handling (Deployment-Friendly)

HF_TOKEN = st.secrets["HF_TOKEN"]

Model + Tokenizer Loading with GPU Support

model_name = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, token=HF_TOKEN)
model = AutoModelForCausalLM.from_pretrained(model_name, token=HF_TOKEN)
model.to('cuda' if torch.cuda.is_available() else 'cpu')

The app automatically uses GPU acceleration when available, which improves response latency.

Conversational Prompting + Response Generation

input_ids = tokenizer.apply_chat_template(conversation, return_tensors="pt").to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=max_new_tokens,
    do_sample=True,
    temperature=temperature,
    eos_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

I use the tokenizer’s chat template to preserve conversational context and generate a more natural assistant response. Sampling is enabled with temperature control so users can choose between more deterministic or more creative outputs.