Working with CSV and JSON Data: Handling Real-World Information
Every day, millions of data files flow through the internet. Banking systems, e-commerce platforms, weather forecasts, and medical records all rely on structured data formats like CSV and JSON. This chapter teaches you to read, parse, transform, and generate these data formats—essential skills for any data scientist, backend engineer, or data analyst.
Understanding CSV: The Universal Data Format
CSV (Comma-Separated Values) is the most widely used data format. Every Excel spreadsheet, government dataset, and database export can become CSV. Here's the real data structure:
# File: student_marks.csv (actual content)
roll_no,name,math_score,science_score,english_score,grade
101,Raj,92,88,85,8
102,Priya,88,95,90,8
103,Amit,78,82,79,8
104,Neha,95,93,94,8
105,Arjun,70,68,72,8
# Notice: First row is header, data follows
# Commas separate values, newlines separate rows
Reading CSV Files with Different Approaches
Learn multiple techniques for different scenarios:
import csv
# Approach 1: Simple reading (list of rows)
print("=== Reading as list of lists ===")
with open('student_marks.csv', 'r') as file:
reader = csv.reader(file)
for i, row in enumerate(reader):
if i == 0:
print(f"Header: {row}")
else:
print(f"Row {i}: {row}")
# Output: Row 1: ['101', 'Raj', '92', '88', '85', '8']
# Approach 2: Reading as dictionaries (much better for real work)
print("
=== Reading with column names ===")
with open('student_marks.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
print(f"Student: {row['name']}, Math: {row['math_score']}")
# row['name'] is easier than row[1]!
# Approach 3: Filter and process while reading
print("
=== Filter high performers ===")
high_performers = []
with open('student_marks.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
total = int(row['math_score']) + int(row['science_score']) + int(row['english_score'])
avg = total / 3
if avg > 85:
high_performers.append({
'name': row['name'],
'average': round(avg, 2),
'grade': row['grade']
})
for student in high_performers:
print(f"{student['name']}: Avg {student['average']}")
Writing CSV Files: Creating Datasets
Generate CSV files for reports, exports, or data sharing:
import csv
from datetime import datetime
# Write student performance report
print("=== Writing CSV output ===")
data = [
{'roll_no': 101, 'name': 'Raj', 'math': 92, 'science': 88, 'grade': 8},
{'roll_no': 102, 'name': 'Priya', 'math': 88, 'science': 95, 'grade': 8},
{'roll_no': 103, 'name': 'Amit', 'math': 78, 'science': 82, 'grade': 8},
{'roll_no': 104, 'name': 'Neha', 'math': 95, 'science': 93, 'grade': 8},
]
# Write data to CSV
output_filename = f"report_{datetime.now().strftime('%Y%m%d')}.csv"
with open(output_filename, 'w', newline='') as file:
fieldnames = ['roll_no', 'name', 'math', 'science', 'grade']
writer = csv.DictWriter(file, fieldnames=fieldnames)
# Write header
writer.writeheader()
# Write data rows
writer.writerows(data)
print(f"Report written to {output_filename}")
# Real example: Export filtered data
print("
=== Export only Math A+ students ===")
math_toppers = [row for row in data if row['math'] >= 90]
with open('math_toppers.csv', 'w', newline='') as file:
fieldnames = ['name', 'math_score']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for student in math_toppers:
writer.writerow({'name': student['name'], 'math_score': student['math']})
JSON: Structured Data for APIs and Web Services
JSON (JavaScript Object Notation) is the standard for web APIs. Every API call returns JSON: UPI payments, weather services, social media, everything uses JSON.
import json
# JSON structure (notice keys, types, nesting)
json_string = '''
{
"school_id": "DPS001",
"school_name": "Delhi Public School",
"established": 1995,
"is_residential": false,
"class_8_students": [
{
"roll_no": 101,
"name": "Raj Kumar",
"age": 14,
"subjects": ["Math", "Science", "English"],
"marks": {"math": 92, "science": 88, "english": 85}
},
{
"roll_no": 102,
"name": "Priya Singh",
"age": 14,
"subjects": ["Math", "Science", "English"],
"marks": {"math": 88, "science": 95, "english": 90}
}
],
"contact": {
"phone": "+91-11-2345-6789",
"email": "info@dps.edu.in",
"location": {"city": "Delhi", "state": "Delhi"}
}
}
'''
# Parse JSON string to Python dictionary
print("=== Parse JSON ===")
data = json.loads(json_string)
print(f"School: {data['school_name']}")
print(f"Established: {data['established']}")
print(f"Students: {len(data['class_8_students'])}")
# Access nested data
print("
=== Access nested values ===")
for student in data['class_8_students']:
print(f"Roll {student['roll_no']}: {student['name']} - Math: {student['marks']['math']}")
# Convert Python object back to JSON string
print("
=== Write JSON ===")
new_student = {
"roll_no": 103,
"name": "Amit Patel",
"age": 14,
"subjects": ["Math", "Science", "English"],
"marks": {"math": 78, "science": 82, "english": 79}
}
# Add to existing data
data['class_8_students'].append(new_student)
# Save to file
with open('school_data.json', 'w') as f:
json.dump(data, f, indent=2) # indent=2 for readable formatting
print("JSON saved to school_data.json")
CSV vs JSON: When to Use Each
| Format | Best For | Example |
|---|---|---|
| CSV | Tabular data, spreadsheets, simple data | Student marks, sales records |
| JSON | Complex nested data, APIs, configuration | User profiles with addresses, API responses |
| CSV | File size (smaller) | Large datasets needing quick processing |
| JSON | Human-readable structured data | Configuration files, web APIs |
Data Cleaning: Handling Real-World Messy Data
Real data is always dirty. Missing values, typos, inconsistent formats—you must clean before analysis.
import csv
print("=== Data cleaning example ===")
# Raw data with problems
raw_data = [
{'name': 'Raj', 'score': '92', 'grade': '8'},
{'name': 'Priya', 'score': '', 'grade': '8'}, # Missing score
{'name': 'Amit', 'score': 'N/A', 'grade': ''}, # N/A and missing grade
{'name': ' Neha ', 'score': '95', 'grade': '8'}, # Extra spaces
{'name': 'Arjun', 'score': '70', 'grade': 'VIII'}, # Inconsistent format
]
def clean_data(row):
"""Clean a single data row"""
cleaned = {}
# Clean name: strip whitespace, title case
cleaned['name'] = row['name'].strip().title()
# Clean score: handle missing and invalid values
score = row['score'].strip()
if score in ['', 'N/A', 'NA', 'null']:
cleaned['score'] = None # Represent missing data as None
else:
try:
cleaned['score'] = int(score)
except ValueError:
cleaned['score'] = None
# Clean grade: standardize format
grade = row['grade'].strip()
if grade in ['VIII', '8', '8th']:
cleaned['grade'] = '8'
else:
cleaned['grade'] = grade if grade else None
return cleaned
print("Original data:")
for row in raw_data:
print(row)
print("
Cleaned data:")
cleaned_rows = [clean_data(row) for row in raw_data]
for row in cleaned_rows:
print(row)
# Save cleaned data
with open('student_marks_clean.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=['name', 'score', 'grade'])
writer.writeheader()
writer.writerows(cleaned_rows)
Real-World Example: Processing Zomato Restaurant Data
import json
import csv
# JSON data from Zomato API
zomato_json = '''
{
"restaurants": [
{
"id": 1,
"name": "Spice Route",
"cuisine": ["North Indian", "Mughlai"],
"rating": 4.5,
"avg_cost_for_two": 500,
"location": "New Delhi",
"is_vegetarian": false
},
{
"id": 2,
"name": "Green Plate",
"cuisine": ["Vegan", "Organic"],
"rating": 4.2,
"avg_cost_for_two": 600,
"location": "Bangalore",
"is_vegetarian": true
}
]
}
'''
# Parse JSON
data = json.loads(zomato_json)
# Filter and export to CSV
print("=== Export vegetarian restaurants ===")
veg_restaurants = [r for r in data['restaurants'] if r['is_vegetarian']]
with open('vegetarian_restaurants.csv', 'w', newline='') as f:
fieldnames = ['name', 'cuisine', 'rating', 'cost', 'location']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for restaurant in veg_restaurants:
writer.writerow({
'name': restaurant['name'],
'cuisine': ', '.join(restaurant['cuisine']),
'rating': restaurant['rating'],
'cost': restaurant['avg_cost_for_two'],
'location': restaurant['location']
})
print(f"Exported {len(veg_restaurants)} vegetarian restaurants")
Using CSV and JSON with Government Data (India)
India's Data.gov.in provides open datasets in CSV and JSON formats:
import json
import csv
import requests # For API calls
# Example: Fetch rainfall data from India's government portal
# and convert to different formats
def fetch_rainfall_data(state):
"""Fetch from data.gov.in (requires API key in real usage)"""
# This is pseudocode—real API would require authentication
return {
"state": state,
"annual_rainfall_mm": 800,
"last_updated": "2024-01-20"
}
# Convert between formats
rainfall_csv = [
{"state": "Rajasthan", "rainfall": 500},
{"state": "Kerala", "rainfall": 3000},
{"state": "Delhi", "rainfall": 700}
]
# Save as CSV
with open('rainfall.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=['state', 'rainfall'])
writer.writeheader()
writer.writerows(rainfall_csv)
# Convert to JSON
rainfall_json = {"rainfall_data": rainfall_csv}
with open('rainfall.json', 'w') as f:
json.dump(rainfall_json, f, indent=2)
print("Data exported to both CSV and JSON formats")
Advanced Data Transformation: CSV to JSON and Back
Professional data engineers constantly convert between formats. Master bidirectional conversion:
import csv
import json
# CSV to JSON conversion
def csv_to_json(csv_filename, json_filename):
"""Convert CSV file to JSON format"""
data = []
with open(csv_filename, 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# Type conversion if needed
if 'score' in row:
row['score'] = int(row['score'])
if 'rating' in row:
row['rating'] = float(row['rating'])
data.append(row)
# Write JSON
with open(json_filename, 'w') as jsonfile:
json.dump(data, jsonfile, indent=2)
return len(data)
# JSON to CSV conversion
def json_to_csv(json_filename, csv_filename):
"""Convert JSON file to CSV format"""
with open(json_filename, 'r') as jsonfile:
data = json.load(jsonfile)
# Handle both list and dict JSON structures
if isinstance(data, dict) and len(data) == 1:
# If single key containing array, extract it
key = list(data.keys())[0]
data = data[key]
if not data:
print("No data to write")
return 0
# Get field names from first record
fieldnames = list(data[0].keys())
# Write CSV
with open(csv_filename, 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(data)
return len(data)
# Example usage
print("CSV → JSON")
count1 = csv_to_json('students.csv', 'students.json')
print(f"Converted {count1} records")
print("
JSON → CSV")
count2 = json_to_csv('students.json', 'students_export.csv')
print(f"Converted {count2} records")
Working with Large Files: Memory-Efficient Processing
Real datasets have millions of rows. Process in chunks to save memory:
import csv
import json
# Memory-efficient CSV reading
def process_large_csv_by_chunks(filename, chunk_size=1000):
"""Process CSV in chunks instead of loading all at once"""
chunk = []
with open(filename, 'r') as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader):
chunk.append(row)
# Process when chunk is full
if len(chunk) >= chunk_size:
print(f"Processing chunk {i // chunk_size + 1}")
# Your processing logic here
yield chunk
chunk = []
# Don't forget last chunk
if chunk:
yield chunk
# Example: Process IRCTC booking data (millions of records)
print("Processing large IRCTC booking file...")
for i, chunk in enumerate(process_large_csv_by_chunks('irctc_bookings.csv', 5000)):
print(f"Chunk {i + 1}: Processing {len(chunk)} records")
# Filter high-value bookings
premium_bookings = [r for r in chunk if float(r.get('fare', 0)) > 5000]
print(f" Found {len(premium_bookings)} premium bookings in this chunk")
# Memory-efficient JSON streaming
def process_large_json_streaming(filename):
"""Process large JSON arrays line by line"""
with open(filename, 'r') as f:
# Read opening bracket
f.readline()
for line in f:
line = line.strip().rstrip(',') # Remove trailing comma
if line and line != ']':
try:
record = json.loads(line)
yield record
except json.JSONDecodeError:
continue
# Example: Process JSON logs (streaming approach)
print("
Processing large JSON log file...")
error_count = 0
for log_entry in process_large_json_streaming('app_logs.json'):
if log_entry.get('level') == 'ERROR':
error_count += 1
print(f"Found {error_count} error entries without loading entire file")
Data Validation and Type Conversion
Ensure data is correct type before processing:
import csv
import json
from datetime import datetime
class DataValidator:
"""Validate and convert data types"""
@staticmethod
def validate_and_convert(row, schema):
"""
Validate row against schema
schema = {'score': int, 'date': str, 'rating': float}
"""
converted = {}
errors = []
for field, field_type in schema.items():
value = row.get(field)
try:
if field_type == int:
converted[field] = int(value)
elif field_type == float:
converted[field] = float(value)
elif field_type == str:
converted[field] = str(value).strip()
elif field_type == bool:
converted[field] = value.lower() in ['true', '1', 'yes']
elif field_type == datetime:
converted[field] = datetime.strptime(value, '%Y-%m-%d')
except (ValueError, AttributeError) as e:
errors.append(f"{field}: {value} → {field_type.__name__} failed")
converted[field] = None
return converted, errors
# Example: Validate student records
schema = {
'roll_no': int,
'name': str,
'math_score': int,
'attendance_percent': float,
'is_scholarship': bool
}
student_data = [
{'roll_no': '101', 'name': ' Raj ', 'math_score': '92', 'attendance_percent': '85.5', 'is_scholarship': 'true'},
{'roll_no': '102', 'name': 'Invalid', 'math_score': 'ABC', 'attendance_percent': '90', 'is_scholarship': 'false'},
]
validator = DataValidator()
for student in student_data:
converted, errors = validator.validate_and_convert(student, schema)
if errors:
print(f"Validation errors: {errors}")
else:
print(f"Valid student: {converted}")
Real-World Case Study: Processing Flipkart Product Data
import csv
import json
from datetime import datetime
# Typical Flipkart product export (CSV)
flipkart_csv_data = '''
product_id,name,category,price,rating,stock_quantity,last_updated
SKU001,Laptop Pro,Electronics,75000,4.5,50,2024-01-20
SKU002,Headphones,Electronics,2500,4.2,150,2024-01-20
SKU003,Cotton Shirt,Fashion,800,4.0,500,2024-01-19
SKU004,Yoga Mat,Sports,1200,4.7,75,2024-01-20
'''
# Process Flipkart data
from io import StringIO
print("=== Processing Flipkart Product Catalog ===")
# Read CSV
reader = csv.DictReader(StringIO(flipkart_csv_data))
products = list(reader)
# Transform: Add derived fields
for product in products:
product['price'] = int(product['price'])
product['rating'] = float(product['rating'])
product['stock_quantity'] = int(product['stock_quantity'])
product['in_stock'] = product['stock_quantity'] > 0
product['discount_eligible'] = product['price'] > 5000
product['last_updated'] = datetime.strptime(product['last_updated'], '%Y-%m-%d')
# Filter: High-value products with good ratings
premium_products = [p for p in products if p['price'] > 2000 and p['rating'] >= 4.5]
print(f"Total products: {len(products)}")
print(f"Premium products: {len(premium_products)}")
# Export: Save filtered data as JSON
output = {
'export_date': datetime.now().isoformat(),
'total_products': len(products),
'premium_products': premium_products
}
with open('flipkart_premium.json', 'w') as f:
json.dump(output, f, indent=2, default=str) # default=str to handle datetime
print("Exported to flipkart_premium.json")
Common Pitfalls and How to Avoid Them
| Pitfall | Problem | Solution |
|---|---|---|
| Reading without header | Confusing indices: row[1] vs row['name'] | Always use DictReader for CSV |
| Not handling missing values | Crashes on None/empty values | Check for None and provide defaults |
| Loading entire file in memory | Out of memory for large files | Use generators, process in chunks |
| Type mismatches | Comparing "92" (string) with 92 (int) | Validate and convert types explicitly |
| Encoding issues | UTF-8 characters cause errors | Specify encoding: open(file, encoding='utf-8') |
Key Takeaways
- CSV is simple, universal, and perfect for tabular data
- JSON is flexible, human-readable, and the standard for APIs
- Always read CSV with DictReader (column names are clearer than indices)
- Always validate and clean real-world data before analysis
- Missing values must be handled explicitly—don't ignore them
- Convert between formats as needed (CSV → JSON for APIs, JSON → CSV for reports)
- Process large files in chunks or streams to save memory
- Validate data types before processing to catch errors early
- Every major company (Flipkart, Zomato, ISRO, TCS) processes millions of CSV/JSON files daily
- Encoding matters—always specify UTF-8 for international data
Practice Problems
- Create a CSV file with movie data (title, year, rating, director) and read it with DictReader
- Convert a CSV file to JSON format and save the output with proper formatting
- Write a script that filters CSV data and writes only records where score > 80 to a new CSV
- Clean a CSV file with missing values, whitespace, and inconsistent data formats
- Create JSON representing a student with nested marks data, then extract and process specific subjects
- Combine CSV and JSON: read CSV with 1000+ rows, convert to JSON, add computed fields, save as JSON
- Process a large CSV file (10MB+) in chunks without loading entire file in memory
- Validate student records against a schema: convert types, handle errors, output validated records
- Build a data transformation pipeline: CSV → validate → transform → JSON → output
- Create a class DataProcessor that can convert CSV↔JSON bidirectionally with error handling
Introduction and Overview
Welcome to this chapter on CSV and JSON data formats! In this chapter, you will learn the core concepts, see real-world examples, and build your skills step by step. This is an essential topic for competitive exam preparation including CBSE Board, JEE, and BITSAT.
Summary and Recap
Key Takeaways: In this chapter, we covered the fundamentals of CSV and JSON data formats, explored practical examples with Python code, and connected these concepts to real-world applications in Indian tech companies. Remember: mastery comes from practice, not just reading!
Challenge Exercises
Think about this: How would you explain CSV and JSON data formats to someone who has never programmed before? What analogy or metaphor would make it click? Imagine you are building a real application — which concepts from this chapter would you use first?
Try this exercise: implement one concept from this chapter from scratch, without looking at the examples. Then compare your solution. What did you learn?