Nlp portfolio

Jakiel David 4-23-26 P7: Portfolio Presentation

NLP Techniques Implemented This project implemented core NLP techniques to process structured web text into usable data. The pipeline used web scraping to collect HTML content using requests.get() in stage01_extract.py. After extraction, HTML parsing was performed using BeautifulSoup in stage02_validate_jak.py to locate structured elements such as the title, authors, abstract, and subject. Text cleaning and normalization were completed in stage03_transform_jak.py. This included converting text to lowercase, removing punctuation, filtering stopwords using spaCy, and normalizing whitespace. The cleaned text was then tokenized and used to generate derived features such as token counts, unique word counts, and vocabulary richness (type-token ratio). These preprocessing steps prepared the text for analysis. https://github.com/ss2jak/nlp-06-nlp-pipeline/blob/main/src/nlp/stage01_extract.py https://github.com/ss2jak/nlp-06-nlp-pipeline/blob/main/src/nlp/stage02_validate_jak.py https://github.com/ss2jak/nlp-06-nlp-pipeline/blob/main/src/nlp/stage03_transform_jak.py
Systems and Data Sources The system analyzed structured HTML content from https://arxiv.org/abs/2602.20021.Starting as raw HTML, then parsed into structured elements using BeautifulSoup, and then converted into a dataframe using Pandas. The pipeline targets Web Pages (URL is defined in config_jak.py). It handles the transition from unstructured HTML (web) to semi-structured JSON (intermediate stage) and finally to structured DataFrames/CSV (load stage). In stage02_validate_jak.py, the system checks for empty strings or failed fetches to ensure "messy" web data (like pages with no readable text) doesn't crash the subsequent transformation stages. https://github.com/ss2jak/nlp-06-nlp-pipeline/blob/main/src/nlp/config_jak.py https://github.com/ss2jak/nlp-06-nlp-pipeline/blob/main/src/nlp/stage02_validate_jak.py
Pipeline Structure (EVTL) The project followed an EVTAL (Extract, Validate, Transform, Load) pipeline structure. • Extract: In stage01_extract.py, HTML data was pulled from the URL https://arxiv.org/abs/2602.20021 and saved locally. • Validate: In stage02_validate_jak.py, required HTML elements were checked to confirm that the page structure contained expected fields before moving forward. • Transform: In stage03_transform_jak.py, metadata fields were extracted, text was cleaned, tokens were generated, and additional features such as token counts and author counts were created. • Analyze: In stage04_analyze_jak.py, frequency-based analysis and visualizations such as bar charts and word clouds were generated. • Load: In stage05_load.py, the final structured dataset was saved as a CSV file for reuse and reporting. https://github.com/ss2jak/nlp-06-nlp-pipeline/blob/main/src/nlp/stage01_extract.py https://github.com/ss2jak/nlp-06-nlp-pipeline/blob/main/src/nlp/stage02_validate_jak.py https://github.com/ss2jak/nlp-06-nlp-pipeline/blob/main/src/nlp/stage03_transform_jak.py https://github.com/ss2jak/nlp-06-nlp-pipeline/blob/main/src/nlp/stage04_analyze_jak.py https://github.com/ss2jak/nlp-06-nlp-pipeline/blob/main/src/nlp/stage05_load.py
Signals and Analysis Methods The pipeline generated several useful text-based signals. Word frequency counts were calculated using Counter in python to identify the most common tokens. Additional metrics included total token counts, unique token counts, and type-token ratio, which provided insight into vocabulary diversity. Visualizations such as frequency bar charts, word clouds, and scatter plots helped highlight patterns in token usage and word structure. These outputs made it easier to interpret trends and confirm that preprocessing steps were working correctly. https://github.com/ss2jak/nlp-06-nlp-pipeline/tree/main/data/processed
Insights The analysis revealed clear patterns in topic-related vocabulary, with domain-specific words appearing frequently across the cleaned text. This confirmed that the normalization and stopword removal steps successfully reduced noise while preserving meaningful information. One useful insight was how vocabulary diversity metrics helped identify repeated terminology, which is common in technical or research-style writing. The visual outputs also provided fast confirmation that the extracted text matched expected subject themes. Overall, this project demonstrated how unstructured web content can be transformed into structured data using an EVTL pipeline. The results showed how combining web scraping, NLP preprocessing, and visualization techniques can produce meaningful insights from textual data in a reproducible workflow. https://github.com/ss2jak/nlp-06-nlp-pipeline/blob/main/src/nlp/pipeline_web_html.py
Representative Work https://github.com/ss2jak/nlp-03--text-exploration/blob/main/notebooks%20copy/nlp_corpus_fst_jak.ipynb This project shows skills in working with APIs, handling JSON data, and cleaning and transforming text into structured formats, which are core tasks in real-world analytics workflows. It demonstrates the ability to automate data collection and build repeatable pipelines, which improves efficiency and reliability. These skills are representative of a good analyst because they show the ability to gather reliable data, prepare it correctly, and make it usable for analysis and decision-making. https://github.com/ss2jak/nlp-03--text-exploration/blob/main/notebooks%20copy/nlp_corpus_fst_jak.ipynb This project shows skills in text exploration, tokenization, and frequency analysis using corpus data, which are foundational techniques in natural language processing. It demonstrates the ability to analyze large text datasets and identify patterns such as common words and vocabulary structure, which is key for understanding textual data. These skills represent a good analyst because they show the ability to explore raw data, find meaningful patterns, and prepare text for deeper analysis or modeling. https://github.com/ss2jak/nlp-06-nlp-pipeline/tree/main/src/nlp This project shows skills in building a full NLP pipeline, including extracting data, validating it, transforming text, analyzing patterns, and loading results into structured outputs. It demonstrates the ability to organize workflows into repeatable stages, which is important because NLP pipelines typically involve step-by-step processing such as cleaning text and generating features. These skills represent a good analyst because they show the ability to manage complex data workflows, ensure data quality, and produce reliable results that can be reused for reporting or decision-making.
Skills Skills demonstrated in the linked files. • Process data in Python, including reading, cleaning, and transforming text files and DataFrames • Work with real-world text data from HTML pages, APIs, and raw corpora • Extract and structure information such as titles, metadata, and abstract content from unstructured sources • Clean messy data by handling missing values, removing noise, and normalizing text (stopwords, punctuation, casing) • Perform text analysis, including tokenization, frequency counts, and basic feature engineering • Build repeatable NLP pipelines using • Create simple visualizations (word frequencies, word clouds, charts) to communicate patterns in text • Document and present work professionally using Markdown and organized GitHub-style project structure