Hi, I am Vivek Aryan.

Data Scientist | ML Engineer | AI Engineer

Welcome to my Data Science Portfolio! I am a skilled data scientist proficient in ML and AI with more than 3.5 professional experience. Explore my projects showcasing expertise in predictive modeling, NLP, computer vision, and more.

Currently seeking full-time opportunities from May 2024.

Contact

+1 (346) 303-8568

Email

aryanravula@gmail.com

Location

Houston, Texas

About Me

My introduction

I am a Data science master's student with 1 year experience in focused Generative AI research, particularly in the Large Language Models space, and over 2.5 years of professional industry experience analyzing data and providing recommendations in product and service based companies. Skilled in machine learning, statistics, data visualization, and generative AI. This blend of academic rigor and practical industry exposure positions me as a versatile professional capable of navigating the complexities of data science. Highly dedicated problem solver, goal-oriented, and an efficient team player. Self-driven and fast learner.

What I may lack in years of experience, I compensate with my ability to grasp new tools and techniques quickly. This is evident in my journey from Civil Engineering to Business Intelligence Engineer and ultimately to a Data Scientist, gaining experience in generative AI and computer vision along the way.

3.5+ Years
experience
15+ Completed
projects
03+ Companies
worked

Domain Knowledge

Industries and domains I worked in

Food Tech

Ad Tech

Fin Tech

E-Commerece

Explainable AI

User Expereince and Customer Experience

Techinical Skills

My techinical level

Programming

Python

Advanced

PyTorch

Advanced

HTML

Proficient

CSS

Proficient

JavaScript

Proficient

Database

MySQL

Advanced

Redshift

Advanced

Snowflake

Advanced

Vector Database [FAISS, Weaviate]

Proficient

Graph Database [NebulaGraph, Neo4J]

Proficient

Analytical Tools

Tableau [Certified]

Advanced

PowerBI

Advanced

Microsoft Excel

Advanced

Microsoft Powersoft

Advanced

Machine Learning Alogirithms

Linear Regression

Logistic Regression

KNN

Decision Trees

Random Forest

Support Vector Machines

Apriori Algorithm

Dimensionality Reduction [PCA, SOM, tSNE]

Deep Learning/Artificial Intillegence

Recurrent Neural Networks

LSTM

Transformers

Large Language Models

Convolutional Neural Networks

Object Detection

Image Segmentation

Pose Detection

Prompt Engineering

Retrieval Augmented Generation

Knowledge Graphs

Dev-ops Tools

Git

Advanced

Docker

Proficient

AWS

Proficient

Data Analysis & ML/DL Dependencies

Numpy

Pandas

SQL

OpenCV

Pillow

Tensorflow

Pytorch

Scikit-Learn

Experience

My personal journey
Education
Work Experience

M.S in Data Science

University of Houston - Main Campus
Houston, Texas
GPA: 3.93

Jan 2023 - May 2024 (Summa Cum Laude)

B.Tech in Civil Engineering

Manipal Institute of Technology
Manipal, India
Aug 2015 - Sept 2019

AI Research Assistant

Aiceberg - Houston, Texas
2023 - 2024
  • Fine-tuned large language models (such as llama2 )on a custom dataset using LoRA and QLoRA fine-tuning methods and quantization of the models.
  • Grounded the finetuned LLM with external non-parametric knowledge though RAG and Knowledge Graphs and generate quality response through prompt engineering to tackle hallucination in LLM.
  • Enhanced research and conducted experiments across multiple frameworks (LangChain, LLamaindex), embedder models (SOTA embedders), and vector storage databases (simple methods to FAISS) to optimize both the speed and quality of processes.
  • Researched techniques to enhance text-generative AI interpretability and explainability. Developed an end-to-end commercial integration from a cybersecurity perspective.
  • Developing and experimenting with novel techniques in feature extraction using vector embeddings and prompt engineering.

Business Analyst

Meesho - Bangalore, India
2022 - 2023
  • Generated leads through data mining and measured the impact/performance of webinars and 1:1 training through A/B testing.
  • Provided business recommendations to improve supplier engagement.
  • Developed and maintained analytical dashboards utilized by stakeholders to track the L0 metrics of the Supplier Activation charter.

Business Analyst

Swiggy - Bangalore, India
2021 - 2022
  • Implemented A/B testing and normalizations to measure the impact/performance of in-house products or features on Chatbot (CRM) and formulate necessary business recommendations.
  • Improved the CPO by 15% by changing the nomenclature of a bot disposition. Reduced 95th percentile customer wait times during peak hours by 60% by balancing the load.
  • Utilized Power BI to develop and maintain smart, compelling analytical dashboards to monitor KPIs, identify trends, and monitor company initiatives and agents' performance.
  • Contributed to the formulation of various metrics (active agents) and the enhancement of a bot efficacy metric.
  • Conducted driver analysis on key metrics to identify potential improvement areas in the Swiggy Chatbot flow.
  • Collaborated with enterprise data warehouse, data governance, and business teams on data quality issues, as well as the architecture of data repositories or fact tables under my purview.

Product Data Anayst Trainee

Capital Float - Bangalore, India
2020 - 2021
  • Used tools such as Redshift, python, Excel and Power BI to defining metrics, drive roadmaps, and provide data driven insights to improve Unsecured Business loan product and drive growth.
  • Facilitated Root Cause Analysis (RCA) on incidents. Production of statistics and reports to demonstrate performance.
  • Data mining to drive growth. Designed a one-stop source schema on Database that increased the efficiency.

Data Analyst Intern

Inmobi - Bangalore, India
2019 - 2020
  • Worked on structured data, performed statistical analysis on the data to identify patterns and trends on iDSP product.
  • Delivered insights and inferences that helped to boost the business side of the company. Used different tools such as Microsoft Excel, Python and SQL, and worked on customer relations.

Portfolio

Most recent work

Movie Recommendation System using LLMs

The Movie Recommendation System helps users find similar movies by entering a movie title. It provides a top 10 list of similar movies, ranked by weighted scores and popularity. Additionally, the system generates a brief summary of each recommended movie using the Phi-3 language model, offering an overview of the plot for each suggestion.

Workflow

Data:

The data was taken from Movies Daily Update Dataset available on Kaggle. Extra directors name were scrapped from wikipedia and The Movie DataBase website. The data is ingested and recommendation are provided based on the genre, keywords, cast and director name.

Recommendations:

Previously I have used Count Vectorization method to get embeddings, which essentially represents text as a vector of word counts. To improve this I have generated embeddings using a Sentence Transformer which captures the context, leading to better semantic understanding and improved performance.

Weaviate:

Used for storage and performing similarity search.

LangChain:

This framework was used to integrate weaviate with workflow and also used to create a pipeline to streamline the process of generateing summaries.

Fast API:

Used to handle movie recommendations and summary generation.

Next.js:

Used to build the user interface for interacting with the recommendation system.

GitHub Link

Transformers vs CNNs for Semantic Segmentation of Bridge Damages

This project offers a robust data pipeline solution designed to efficiently extract, transform, and load (ETL) Reddit data into a Redshift data warehouse. Leveraging a blend of industry-standard tools and services, the pipeline ensures seamless data processing and integration.

This project covers the process of connecting Apache Airflow to a Reddit instance in the cloud. PRAW: the Python Reddit API Wrapper was leveraged and injected with Reddit API credentials to extract subreddit posts and metadata from the Reddit app, which was integrated into the Airflow setup. The airflow environment was configured with a Celery backend and PostgreSQL for efficient task management and data storage. The Reddit data within Airflow was transferred to an S3 bucket. AWS Glue was utilized for data manipulation before querying and visualizing the data with Amazon Athena. Additionally, a data warehouse was set up on AWS using Amazon Redshift to demonstrate real-time data loading.

GitHub Link

Transformers vs CNNs for Semantic Segmentation of Bridge Damages

Image segmentation serves as a pivotal tool in the detection of defects in bridges, offering a systematic and thorough approach to enhance the accuracy, efficiency, and timeliness of assessing structural integrity and ensuring the safety of critical infrastructure.

This project focuses on the implementation of Segformer-b5 and YOLOv8-m models in segmenting the different damage and object classes on dacl10k dataset. The performance metric chosen was “mean Intersection over Union (mIOU)”. The experiments show that YOLOv8-m trained for 70 epochs resulting in 0.312 mIOU performs slightly better than Segformer-b5 trained for 5 epochs giving 0.268 mIOU.

GitHub Link

Pedestrian Crossing Detection using YOLOv8

The safety of pedestrians in smart cities and advanced traffic management systems is of paramount concern in today's world. This project focuses on developing a sophisticated pedestrian detection system aimed at enhancing safety on the roads. This was achieved through fine-tuning the state-of-the-art single-shot object detection YOLOv8s model on an available Pedestrian Intention Estimation (PIE) Dataset, from which we extracted 4922 images, which have three different classes of pedestrians crossing the roads. In order to maintain high standards and yield the best results, the data needed to be prepared and meticulously processed. This training process was evaluated on a plethora of metrics such as bounding box precision, recall, mAP50, mAP50-95, DFL-loss, and classification loss. Our final model was able to achieve a mAP50 of 0.590 and a mAP50-95 of 0.412. The fine-tuned model was used to detect instances in an independent image dataset to simulate real-world scenarios where pedestrians were crossing the roads or not with respect to the ego vehicle.

GitHub Link

Room Occupancy Estimation

The aim of the project is to estimate the occupancy in a room in the event of a “break and enter”. It is a classification task. Data from multiple non-intrusive environmental sensors like temperature, sound, CO2, and PIR (motion detection) were used to estimate occupancy without a computer vision solution. The structured data, of 10129 instances and 16 variables, were collected over a period of 4 days in a controlled manner. The target occupancy varies between 0 and 3 people.The two best models are Random Forest Classifier with 99.56% accuracy and MLP with 98.33% accuracy.

GitHub Link

Illuminating LLM Responses (XAI) with GPT 3.5 and T5

Leveraged LangChain framework on GPT 3.5 to create high-quality QA downstream task response on custom documents. Fine-tuned the T5 LLM to generate headlines from responses and employed contextual similarity (cosine) on the embeddings of T5-generated headline and the title of the document to retrieve the document that was used to generate the response.

This integrated framework represents a pioneering approach to document-specific QA tasks, showcasing the potential for advanced language models in nuanced information retrieval and synthesis.

GitHub Link

Keyword extraction using Attention mechanism in BERT

Leveraging the power of BERT (Bidirectional Encoder Representations from Transformers), I harnessed the contextual embeddings from the final hidden layer of its output. This strategic choice allowed for a nuanced understanding of the text body, capturing intricate contextual nuances. By employing this information, I successfully developed a robust system to automatically extract the top 5 keywords from any given text.

This approach not only enhanced keyword extraction accuracy but also demonstrated the efficacy of utilizing contextual embeddings for extracting meaningful insights from textual data.

GitHub Link

Hotdog or Not Hotdog

The project culminated in the creation of a highly accurate custom image classification model, boasting an impressive accuracy rate of 93.67%. This achievement was made possible by leveraging the powerful Inception V3 architecture, which played a pivotal role in extracting intricate features from the images.

The model's accuracy underscores the efficacy of Inception V3, showcasing the potential of tailored models for achieving precision in diverse visual recognition tasks.

GitHub Link

Resume Parser

The project involved the development of a custom Named Entity Recognition (NER) model utilizing spaCy. The model was meticulously trained on a diverse dataset comprising examples of both hard skills and soft skills. Its primary objective was to adeptly extract skills from PDF files, demonstrating a versatile application for automated information extraction.

The tailored NER model not only showcased the capabilities of spaCy in accurately identifying and categorizing entities but also highlighted the project's significance in streamlining the extraction of valuable information from unstructured data sources like PDF documents.

GitHub Link

HackerEarth: Adopt a buddy Challenge

The project entailed constructing a robust multi-label machine learning model, skillfully combining the strengths of XGBoost and Light GBM algorithms by stakcing the models. This dynamic ensemble approach was purposed to determine both the type and breed of animals in the dataset.

The model's exceptional performance not only secured accurate predictions but also helped me achieve the top 8% on the project leaderboard. This success underscores the effectiveness of the stacked ensemble strategy and its practical application in achieving high-ranking results in complex classification tasks.

GitHub Link

Content-Based Movie Recommender System

The project involved the development of an end-to-end movie recommendation system utilizing Cosine Similarity, with data sourced through API calls to the TMDB website. This comprehensive system allowed users to receive personalized movie suggestions based on their preferences. To make it accessible, I encapsulated the recommendation engine into a user-friendly web app using Flask. Furthermore, the deployment was seamlessly executed on the Heroku platform, ensuring widespread accessibility and convenience for movie enthusiasts seeking tailored film suggestions.

Github Link