Case Study: Client Cigars Scraping Projects
Executive Summary
The Client Cigars Scraping Projects represent a comprehensive data extraction initiative focused on collecting detailed information about cigar establishments and products from multiple sources including Yelp, Cigars.com, and Cigar Aficionado. This project successfully delivered a multi-platform scraping solution capable of extracting business information, product details, reviews, and comprehensive metadata across 14 US states.
Key Metrics: - Data Sources: 3 major platforms (Yelp, Cigars.com, Cigar Aficionado) - Geographic Coverage: 14 US states - Data Points Extracted: 25+ unique fields per business/product - Technology Stack: Python, Selenium, BeautifulSoup, Playwright - Output Format: Excel, JSON, CSV with structured dataProject Overview
Business Context and Objectives
The client required comprehensive market intelligence for the cigar industry across multiple major US markets. The project aimed to:
- Market Research: Gather competitive intelligence on cigar bars and retailers across 14 states
- Product Analysis: Collect detailed product information including ratings, reviews, and specifications
- Geographic Coverage: Focus on high-population states representing the largest cigar markets
- Review Intelligence: Extract customer sentiment and detailed reviews for market analysis
- Data Collection Layer: Multi-platform scrapers with platform-specific logic
- Processing Layer: Data normalization and cleaning utilities
- Storage Layer: JSON files with Excel export capabilities
- Output Layer: Structured CSV/Excel files for analysis
- Market Intelligence: Comprehensive competitive landscape analysis
- Customer Insights: Deep understanding of customer preferences and sentiment
- Product Knowledge: Detailed product specifications and professional ratings
- Business Intelligence: Contact information and operational details for sales prospecting
- Real-Time Monitoring: Implement automated monitoring for new businesses and reviews
- API Integration: Develop REST API for programmatic data access
- Machine Learning: Add sentiment analysis and rating prediction models
- Cloud Deployment: Migrate to cloud infrastructure for better scalability
- Competitive Analysis: Add pricing comparison and market positioning insights
- Trend Analysis: Implement time-series analysis for market trends
- Geographic Expansion: Extend coverage to remaining US states
- International Markets: Explore opportunities in Canadian and European markets
- Automated Validation: Implement automated data quality checks
- Real-Time Updates: Add change detection for business information updates
- Enhanced Metadata: Capture additional business attributes and social media presence
- Image Analysis: Implement computer vision for business photo categorization
Target Markets
The project covered 14 strategically selected US states based on population and market size: - California (CA) - ~[phone-removed] million - Texas (TX) - ~[phone-removed] million - Florida (FL) - ~[phone-removed] million - New York (NY) - ~[phone-removed] million - Pennsylvania (PA) - ~[phone-removed] million - Illinois (IL) - ~[phone-removed] million - Ohio (OH) - ~[phone-removed] million - Georgia (GA) - ~[phone-removed] million - North Carolina (NC) - ~[phone-removed] million - Michigan (MI) - ~[phone-removed] million - New Jersey (NJ) - ~9.29 million - Virginia (VA) - ~8.63 million - Washington (WA) - ~7.98 million - Arizona (AZ) - ~7.36 millionTechnical Architecture
System Design
The project implements a modular architecture with three distinct scraping modules:
project_folder/
├── modules/
│ ├── cigaraficionado.py # Cigar Aficionado scraping logic
│ ├── cigars.py # Cigars.com product extraction
│ ├── connectors/ # Database and driver connections
│ └── helpers/ # Utility functions
├── data/ # Structured data storage
│ ├── yelp/
│ ├── cigars/
│ └── cigaraficionado/
└── run_yelp.py # Main Yelp execution script
Data Flow Architecture
Technology Stack Analysis
Core Technologies
Web Scraping Framework: - Selenium WebDriver: Advanced browser automation with proxy support - BeautifulSoup: HTML parsing and data extraction - Playwright: Modern browser automation for JavaScript-heavy pages - Requests: HTTP client for API-like endpoints Data Processing: - Pandas: Data manipulation and Excel export functionality - JSON: Structured data storage and intermediate processing - CSV: Final output format for analysis tools Infrastructure: - Proxy Integration: Premium proxy API for IP rotation and bot detection evasion - Browser Management: Brave browser with headless/headed modes - Error Handling: Comprehensive retry logic and exception managementAdvanced Features
Anti-Detection Measures:# Proxy rotation with proxy provider
proxy_api_key = sys.argv[1]
r = requests.get(f"[PROXY-ENDPOINT-REDACTED]")
proxy_data = r.json()
driver, _ = DriverInitializer("brave", "data", headless=False, proxy={
"PROXY_HOST": proxy_data["data"]["ipAddress"],
"PROXY_PORT": proxy_data["data"]["port"]
}).set_driver_for_browser()
Dynamic Content Handling:
# Network request interception for API data extraction
def parse_game_analytics(driver, prefix):
perf = driver.get_log('performance')
for p in perf:
if "GetBusinessReviewFeed" in p["message"]["message"]["params"]["request"]["postData"]:
response = get_request_response(driver, p, "GetBusinessReviewFeed", prefix)
Implementation Details
Yelp Business Intelligence Module
Core Functionality: - Business Discovery: Automated search across all major cities in target states - Detail Extraction: Comprehensive business profiles including contact info, ratings, reviews - Review Analysis: Full review text extraction with user metadata - Image Collection: Business photos and user-generated content Data Points Extracted: - Business Name, Phone, Complete Address - Rating, Review Count, Review Distribution - Business Hours, Amenities, Claim Status - Specialties, History, Year Established - Categories, Price Range, Website URL - Customer Reviews with Full Text and RatingsCigars.com Product Catalog
Brand and Product Intelligence:def scrape_brands_and_packagings():
# Extract all available brands and packaging types
response = requests.get('https://www.cigars.com/search')
soup = BeautifulSoup(response.text)
# Brand extraction
li_brand_tags = soup.find("ul", {"id": "filter-brand"}).find_all("li")
for li_brand_tag in li_brand_tags:
data_brands.append({
"brand_name": li_brand_tag.getText(strip=True),
"brand_url": li_brand_tag.find("label")["data-refinement"]
})
Product Detail Extraction:
- Title, Brand, SKU, Pictures
- Detailed Specifications (Length, Ring, Wrapper Type, Binder, Filler)
- Origin, Strength, Wrapper Shade
- Pricing Information, Stock Status
- Multiple Packaging Options
Cigar Aficionado Reviews Module
Professional Review System:def scrape_review_details(review_url):
# Extract professional cigar ratings and detailed tasting notes
obj = {
"ratings": None, "length": None, "gauge": None, "strength": None,
"size": None, "filler": None, "binder": None, "wrapper": None,
"country": None, "price": None, "issue_date": None
}
Review Intelligence:
- Professional Ratings (Point System)
- Detailed Tasting Notes and Descriptions
- Technical Specifications
- Historical Pricing Data
- Editorial Rankings and Designations
Challenges and Solutions
Challenge 1: Anti-Bot Detection
Problem: Yelp implements sophisticated bot detection mechanisms including CAPTCHA, IP blocking, and behavioral analysis. Solution: - Implemented proxy rotation using premium proxy service - Added human-like interaction patterns with random delays - Employed browser fingerprint masking techniques - Used performance log interception to capture API responses directlyChallenge 2: Dynamic Content Loading
Problem: Modern web applications load content dynamically via JavaScript, making traditional scraping ineffective. Solution:def scroll_until_storyline(driver: WebDriver, wait_time: float = 2.0, max_attempts: int = 20):
# Intelligent scrolling to trigger content loading
attempts = 0
while attempts < max_attempts:
elements = driver.find_elements(By.XPATH, "//span[contains(text(), 'Storyline')]")
if elements:
driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", elements[0])
return elements[0]
driver.execute_script("window.scrollBy(0, window.innerHeight/2);")
time.sleep(wait_time)
Challenge 3: Data Volume and Processing
Problem: Processing thousands of businesses across 14 states with multiple data points per entry. Solution: - Implemented incremental processing with checkpoint saving - Created efficient data flattening algorithms for nested JSON structures - Developed robust error handling to prevent data loss - Used parallel processing where possibleChallenge 4: Data Quality and Consistency
Problem: Different platforms provide data in varying formats and structures. Solution:def map_business_data(source_df: pd.DataFrame):
# Standardized mapping between different data sources
mapping = {
"Business Name": ["business.name"],
"Phone": ["business.phoneNumber.formatted"],
"Address 1": ["business.location.address.addressLine1"],
# ... comprehensive field mapping
}
Key Features
1. Multi-Platform Intelligence
- Yelp: Business discovery and customer sentiment - Cigars.com: Product catalog and e-commerce data - Cigar Aficionado: Professional reviews and industry ratings2. Geographic Scalability
- State-by-state processing with city-level granularity - Population-weighted market prioritization - Comprehensive coverage of major metropolitan areas3. Advanced Data Extraction
- API response interception for rich data access - Network traffic analysis for hidden endpoints - Comprehensive metadata extraction4. Quality Assurance
- Duplicate detection and removal algorithms - Data validation and cleaning pipelines - Structured output with standardized schemas5. Export Flexibility
- Excel workbooks with multiple sheets - JSON for programmatic access - CSV for analysis tool integrationResults and Outcomes
Quantitative Results
- Businesses Scraped: 1,000+ cigar establishments - Products Cataloged: 500+ individual cigar products - Reviews Processed: 10,000+ customer reviews - Data Accuracy: 95%+ field completion rate - Geographic Coverage: 100% of target marketsQualitative Achievements
Client Value Delivered
- Time Savings: Eliminated months of manual research - Data Quality: Professional-grade data cleaning and normalization - Actionable Insights: Ready-to-analyze business intelligence - Scalable Solution: Framework for future market expansionFuture Recommendations
Technical Enhancements
Business Intelligence Extensions
Data Quality Improvements
Technical Excellence Highlights
Code Quality
- Modular Architecture: Clean separation of concerns with dedicated modules - Error Handling: Comprehensive exception management and graceful degradation - Logging: Detailed execution logging for debugging and monitoring - Documentation: Thorough inline documentation and usage examplesPerformance Optimization
- Efficient Parsing: Optimized BeautifulSoup selectors for fast extraction - Memory Management: Streaming processing for large datasets - Concurrent Processing: Parallel execution where thread-safe - Resource Management: Proper driver cleanup and resource disposalMaintainability
- Configuration Management: External configuration files for easy updates - Version Control: Structured file organization with clear dependencies - Testing Framework: Built-in test cases and validation routines - Deployment Scripts: Automated setup and execution proceduresThis project demonstrates exceptional technical execution in web scraping, data processing, and business intelligence delivery, providing the client with comprehensive market insights and a scalable framework for future expansion.
Interested in a Similar Project?
Let's discuss how we can help transform your business with similar solutions.