Statistical Natural Language Processing

Text Classification & Sentiment Analysis of a reviews dataset
  • Designed features (bag-of-words, N-gram), Applied Logistic Regression model, used TF-IDF weighting, used model parameters for explanation.
  • Applied Naive Bayes Model for sentiment analysis.
  • Applied Support Vector Machine for sentiment analysis.
  • Sequence Tagging & Named Entity Recognition
  • Preprocessed training set, built N-gram language model, extract transition and emission parameters of Hidden Markov Model (HMM), applied Viterbi algorithm to extract the optimal tag sequence, achieved best F1 score as 0.286.
  • Applied Maximum Entropy Markov Model (MEMM) to build a log-linear tagger, designed features manually, used Viterbi algorithm to extract the optimal sequence, achieved best F1 score as 0.398.
  • Applied Bi-Long-Short Term Memory (Bi-LSTM) with Conditional Random Field (CRF), achieved best F1 score as 0.625.
  • Machine Translation
  • Implemented and compared the performance of IBM Model 1 and Model 2 on a French-English dataset from European Parliament, achieved best F1 Score using Model 2 as 0.449
  • Implemented Transformer-based sequence-to-sequence model.
  • Graduate Networked System

    Distributed & Fault-tolerant SurfStore
    A cloud-based file storage system patterned on Dropbox that can survice server failure, datacenter failure and network failures. Written in Python
  • Implemented a BlockStore service: divided content of files into chunks/blocks, used SHA-256 to get a unique identifier, stored identifier and block as key-value pairs, implemented get/put operations.
  • Implemented a MetadataStore service: stored filenames and version number & hashlists as key-value pairs, implemented get/update operations.
  • Implemented client part to sync new/missing files to/from cloud, sync locally/remotely updated files to/from cloud, handle concurrent modification from multiple clients.
  • Modified MetadataStore service to be fault tolerent based on RAFT protocol.
  • [CODE]
    TritonHTTP
    This work is to build a webserver that implements a subset of the HTTP/1.1 protocol which we call it TritonHTTP. The code is in C++.
  • Implemented a client and a server based on a stream-oriented TCP protocol.
  • Appended features like reusing TCP connection for request and reply, using HTTP format as request and response message, allowing for pipelined requests.
  • Implemented error codes and custom error page to deal with various types of errors.
  • [CODE]

    COMPUTER VISION

    Pattern Recognition of Images
    This is a class project for differentiating foregroud and background of a cheetah image
  • Applied Maximum Likelihood Estimation for multivariate Gaussian and performed feature selection using Bhattacharyya coefficient.
  • Applied Bayesian Estimation for Gaussian likelihood and Gaussian prior using different prior knowledge.
  • Applied Expectation Maximization algorithm based on Gaussian mixture models with various number of mixtures and different dimension of features.
  • [CODE]
    Image denoising
    This is a class project for differentiating foregroud and background of a cheetah image
  • Implemented a fully convolutional neural network (DnCNN) using skip connection between the first and last layer with a Peak Signal-to-Noise-Ratio (PSNR) of 29.1651
  • Implemented a U-net like DnCNN with pooling and unpooling operations, achieved a PSNR of 29.1381
  • Implemented a U-net like DnCNN with dilated convolutions, achieved a PSNR of 29.1664.
  • Pet Adoption Speed Prediction
    This is a Kaggle competition.
  • Text feature extraction: Used Google Natural Language API to acquire sentiment features. Built a TF-IDF weighted bag of words model. Used pre-trained word embedding vectors from ‘fasttext’.
  • Image feature extraction: Used Cascade Object Detector to select pet body of images and then fine-tuned Alexnet, VGG16 and Densenet to extract image features.
  • Classification: Implemented Extreme Gradient Boosting (XGB) and Light Gradient Boosting Machine (LightGBM) for classification of the combined features for more than 150k entities.
  • [PDF]

    Sensing and Estimation in Robotics

    SLAM
    SLAM is short for simultaneous localization and mapping
  • Processed and analyzed data accumulated from Encoder, IMU, Hokuyo LIDAR, RGBD camera and stereo camera at nearly 5000 timestamps from a differential drive robot in Numpy.
  • Implemented localization prediction, localization update and (occupancy grid/texture) mapping using a Particle Filter with resampling based on a differential drive motion model and a RGBD camera/Hokuyo LIDAR observation model.
  • Implemented localization prediction, localization and landmark update using an Extended Kalman Filter based on an IMU kinematics motion model and a stereo camera observation model.
  • [PDF], [CODE]