CodeParser Tutorial: Extract Clean Data From Scripts Automating data extraction from raw source code is a common challenge for software engineers and data scientists. Whether you are auditing security compliance, generating documentation, or migrating legacy software, parsing code manually is inefficient and error-prone.
This tutorial teaches you how to use CodeParser, a powerful open-source library designed to convert raw scripts into structured, clean data like JSON or CSV. Understanding CodeParser
CodeParser reads source code files and translates them into an Abstract Syntax Tree (AST). Instead of treating code as plain text, CodeParser recognizes programming logic, structural blocks, variables, and functions. Key Capabilities
Multi-Language Support: Parses Python, JavaScript, TypeScript, Java, and C++.
Syntax Isolation: Extracts comments, docstrings, or function names while ignoring execution logic.
Format Exporting: Converts extracted code elements directly into clean JSON, CSV, or pandas DataFrames. Step 1: Installation and Setup
CodeParser requires Python 3.8 or higher. Install the core package and its language dependencies via pip: pip install codeparser-lib Use code with caution.
To verify the installation, run a quick version check in your terminal: codeparser –version Use code with caution. Step 2: Basic Extraction (Functions and Classes)
Let’s extract all function names and their arguments from a target Python script named app.py. The Target Script (app.py)
def calculate_metrics(data_stream, threshold=0.5): “”“Calculates systemic threshold metrics.”“” return [x for x in data_stream if x > threshold] class DataProcessor: def process_node(self, node_id): pass Use code with caution. The CodeParser Script
Create a new file named parse_script.py and add the following code:
import codeparser import json # Initialize the parser for Python parser = codeparser.LanguageParser(language=“python”) # Load and parse the file tree = parser.parse_file(“app.py”) # Query the AST for function metadata functions_data = [] for node in tree.find_all(“function”): functions_data.append({ “name”: node.name, “arguments”: [arg.name for arg in node.arguments], “line_number”: node.start_line }) # Output clean JSON data print(json.dumps(functions_data, indent=4)) Use code with caution. The Output Running the script yields clean, structured metadata:
[ { “name”: “calculate_metrics”, “arguments”: [“data_stream”, “threshold”], “line_number”: 1 }, { “name”: “process_node”, “arguments”: [“self”, “node_id”], “line_number”: 6 } } Use code with caution.
Step 3: Advanced Extraction (Filtering Comments and Docstrings)
Data extraction often requires isolating documentation from code to analyze developer notes, perform sentiment analysis, or build LLM training datasets.
import codeparser parser = codeparser.LanguageParser(language=“python”) tree = parser.parse_file(“app.py”) # Extract only the docstrings docstrings = [node.text for node in tree.find_all(“docstring”)] print(“Extracted Documentation:”, docstrings) # Output: Extracted Documentation: [‘Calculates systemic threshold metrics.’] Use code with caution. Step 4: Exporting to CSV for Data Analysis
For large-scale codebases, you can aggregate data across hundreds of scripts and export them directly to a CSV file for analytical review.
import pandas as pd import codeparser import glob parser = codeparser.LanguageParser(language=“python”) all_extracted_data = [] # Scan all python files in the directory for filepath in glob.glob(“*.py”): tree = parser.parse_file(filepath) for node in tree.find_all(“function”): all_extracted_data.append({ “File”: filepath, “Function”: node.name, “LineCount”: node.end_line - node.start_line }) # Convert to DataFrame and export df = pd.DataFrame(all_extracted_data) df.to_csv(“codebase_metrics.csv”, index=False) Use code with caution. Conclusion
CodeParser simplifies the process of treating source code as structured data. By abstracting away complex AST manipulations, it allows you to write simple queries to pull clean text, variable names, and architectural maps out of raw scripts.
Leave a Reply