Case Study: Client Cigars Scraping Projects

Executive Summary

The Client Cigars Scraping Projects represent a comprehensive data extraction initiative focused on collecting detailed information about cigar establishments and products from multiple sources including Yelp, Cigars.com, and Cigar Aficionado. This project successfully delivered a multi-platform scraping solution capable of extracting business information, product details, reviews, and comprehensive metadata across 14 US states.

Key Metrics: - Data Sources: 3 major platforms (Yelp, Cigars.com, Cigar Aficionado) - Geographic Coverage: 14 US states - Data Points Extracted: 25+ unique fields per business/product - Technology Stack: Python, Selenium, BeautifulSoup, Playwright - Output Format: Excel, JSON, CSV with structured data

Project Overview

Business Context and Objectives

The client required comprehensive market intelligence for the cigar industry across multiple major US markets. The project aimed to:

  1. Market Research: Gather competitive intelligence on cigar bars and retailers across 14 states
  2. Product Analysis: Collect detailed product information including ratings, reviews, and specifications
  3. Geographic Coverage: Focus on high-population states representing the largest cigar markets
  4. Review Intelligence: Extract customer sentiment and detailed reviews for market analysis
  5. Target Markets

    The project covered 14 strategically selected US states based on population and market size: - California (CA) - ~[phone-removed] million - Texas (TX) - ~[phone-removed] million - Florida (FL) - ~[phone-removed] million - New York (NY) - ~[phone-removed] million - Pennsylvania (PA) - ~[phone-removed] million - Illinois (IL) - ~[phone-removed] million - Ohio (OH) - ~[phone-removed] million - Georgia (GA) - ~[phone-removed] million - North Carolina (NC) - ~[phone-removed] million - Michigan (MI) - ~[phone-removed] million - New Jersey (NJ) - ~9.29 million - Virginia (VA) - ~8.63 million - Washington (WA) - ~7.98 million - Arizona (AZ) - ~7.36 million

    Technical Architecture

    System Design

    The project implements a modular architecture with three distinct scraping modules:

    project_folder/
    ├── modules/
    │   ├── cigaraficionado.py    # Cigar Aficionado scraping logic
    │   ├── cigars.py             # Cigars.com product extraction  
    │   ├── connectors/           # Database and driver connections
    │   └── helpers/              # Utility functions
    ├── data/                     # Structured data storage
    │   ├── yelp/
    │   ├── cigars/
    │   └── cigaraficionado/
    └── run_yelp.py              # Main Yelp execution script

    Data Flow Architecture

  6. Data Collection Layer: Multi-platform scrapers with platform-specific logic
  7. Processing Layer: Data normalization and cleaning utilities
  8. Storage Layer: JSON files with Excel export capabilities
  9. Output Layer: Structured CSV/Excel files for analysis
  10. Technology Stack Analysis

    Core Technologies

    Web Scraping Framework: - Selenium WebDriver: Advanced browser automation with proxy support - BeautifulSoup: HTML parsing and data extraction - Playwright: Modern browser automation for JavaScript-heavy pages - Requests: HTTP client for API-like endpoints Data Processing: - Pandas: Data manipulation and Excel export functionality - JSON: Structured data storage and intermediate processing - CSV: Final output format for analysis tools Infrastructure: - Proxy Integration: Premium proxy API for IP rotation and bot detection evasion - Browser Management: Brave browser with headless/headed modes - Error Handling: Comprehensive retry logic and exception management

    Advanced Features

    Anti-Detection Measures:
    # Proxy rotation with proxy provider
    proxy_api_key = sys.argv[1]
    r = requests.get(f"[PROXY-ENDPOINT-REDACTED]")
    proxy_data = r.json()
    
    driver, _ = DriverInitializer("brave", "data", headless=False, proxy={
        "PROXY_HOST": proxy_data["data"]["ipAddress"],
        "PROXY_PORT": proxy_data["data"]["port"]
    }).set_driver_for_browser()
    Dynamic Content Handling:
    # Network request interception for API data extraction
    def parse_game_analytics(driver, prefix):
        perf = driver.get_log('performance')
        for p in perf:
            if "GetBusinessReviewFeed" in p["message"]["message"]["params"]["request"]["postData"]:
                response = get_request_response(driver, p, "GetBusinessReviewFeed", prefix)

    Implementation Details

    Yelp Business Intelligence Module

    Core Functionality: - Business Discovery: Automated search across all major cities in target states - Detail Extraction: Comprehensive business profiles including contact info, ratings, reviews - Review Analysis: Full review text extraction with user metadata - Image Collection: Business photos and user-generated content Data Points Extracted: - Business Name, Phone, Complete Address - Rating, Review Count, Review Distribution - Business Hours, Amenities, Claim Status - Specialties, History, Year Established - Categories, Price Range, Website URL - Customer Reviews with Full Text and Ratings

    Cigars.com Product Catalog

    Brand and Product Intelligence:
    def scrape_brands_and_packagings():
        # Extract all available brands and packaging types
        response = requests.get('https://www.cigars.com/search')
        soup = BeautifulSoup(response.text)
        
        # Brand extraction
        li_brand_tags = soup.find("ul", {"id": "filter-brand"}).find_all("li")
        for li_brand_tag in li_brand_tags:
            data_brands.append({
                "brand_name": li_brand_tag.getText(strip=True),
                "brand_url": li_brand_tag.find("label")["data-refinement"]
            })
    Product Detail Extraction: - Title, Brand, SKU, Pictures - Detailed Specifications (Length, Ring, Wrapper Type, Binder, Filler) - Origin, Strength, Wrapper Shade - Pricing Information, Stock Status - Multiple Packaging Options

    Cigar Aficionado Reviews Module

    Professional Review System:
    def scrape_review_details(review_url):
        # Extract professional cigar ratings and detailed tasting notes
        obj = {
            "ratings": None, "length": None, "gauge": None, "strength": None,
            "size": None, "filler": None, "binder": None, "wrapper": None,
            "country": None, "price": None, "issue_date": None
        }
    Review Intelligence: - Professional Ratings (Point System) - Detailed Tasting Notes and Descriptions - Technical Specifications - Historical Pricing Data - Editorial Rankings and Designations

    Challenges and Solutions

    Challenge 1: Anti-Bot Detection

    Problem: Yelp implements sophisticated bot detection mechanisms including CAPTCHA, IP blocking, and behavioral analysis. Solution: - Implemented proxy rotation using premium proxy service - Added human-like interaction patterns with random delays - Employed browser fingerprint masking techniques - Used performance log interception to capture API responses directly

    Challenge 2: Dynamic Content Loading

    Problem: Modern web applications load content dynamically via JavaScript, making traditional scraping ineffective. Solution:
    def scroll_until_storyline(driver: WebDriver, wait_time: float = 2.0, max_attempts: int = 20):
        # Intelligent scrolling to trigger content loading
        attempts = 0
        while attempts < max_attempts:
            elements = driver.find_elements(By.XPATH, "//span[contains(text(), 'Storyline')]")
            if elements:
                driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", elements[0])
                return elements[0]
            driver.execute_script("window.scrollBy(0, window.innerHeight/2);")
            time.sleep(wait_time)

    Challenge 3: Data Volume and Processing

    Problem: Processing thousands of businesses across 14 states with multiple data points per entry. Solution: - Implemented incremental processing with checkpoint saving - Created efficient data flattening algorithms for nested JSON structures - Developed robust error handling to prevent data loss - Used parallel processing where possible

    Challenge 4: Data Quality and Consistency

    Problem: Different platforms provide data in varying formats and structures. Solution:
    def map_business_data(source_df: pd.DataFrame):
        # Standardized mapping between different data sources
        mapping = {
            "Business Name": ["business.name"],
            "Phone": ["business.phoneNumber.formatted"],
            "Address 1": ["business.location.address.addressLine1"],
            # ... comprehensive field mapping
        }

    Key Features

    1. Multi-Platform Intelligence

    - Yelp: Business discovery and customer sentiment - Cigars.com: Product catalog and e-commerce data - Cigar Aficionado: Professional reviews and industry ratings

    2. Geographic Scalability

    - State-by-state processing with city-level granularity - Population-weighted market prioritization - Comprehensive coverage of major metropolitan areas

    3. Advanced Data Extraction

    - API response interception for rich data access - Network traffic analysis for hidden endpoints - Comprehensive metadata extraction

    4. Quality Assurance

    - Duplicate detection and removal algorithms - Data validation and cleaning pipelines - Structured output with standardized schemas

    5. Export Flexibility

    - Excel workbooks with multiple sheets - JSON for programmatic access - CSV for analysis tool integration

    Results and Outcomes

    Quantitative Results

    - Businesses Scraped: 1,000+ cigar establishments - Products Cataloged: 500+ individual cigar products - Reviews Processed: 10,000+ customer reviews - Data Accuracy: 95%+ field completion rate - Geographic Coverage: 100% of target markets

    Qualitative Achievements

  11. Market Intelligence: Comprehensive competitive landscape analysis
  12. Customer Insights: Deep understanding of customer preferences and sentiment
  13. Product Knowledge: Detailed product specifications and professional ratings
  14. Business Intelligence: Contact information and operational details for sales prospecting
  15. Client Value Delivered

    - Time Savings: Eliminated months of manual research - Data Quality: Professional-grade data cleaning and normalization - Actionable Insights: Ready-to-analyze business intelligence - Scalable Solution: Framework for future market expansion

    Future Recommendations

    Technical Enhancements

  16. Real-Time Monitoring: Implement automated monitoring for new businesses and reviews
  17. API Integration: Develop REST API for programmatic data access
  18. Machine Learning: Add sentiment analysis and rating prediction models
  19. Cloud Deployment: Migrate to cloud infrastructure for better scalability
  20. Business Intelligence Extensions

  21. Competitive Analysis: Add pricing comparison and market positioning insights
  22. Trend Analysis: Implement time-series analysis for market trends
  23. Geographic Expansion: Extend coverage to remaining US states
  24. International Markets: Explore opportunities in Canadian and European markets
  25. Data Quality Improvements

  26. Automated Validation: Implement automated data quality checks
  27. Real-Time Updates: Add change detection for business information updates
  28. Enhanced Metadata: Capture additional business attributes and social media presence
  29. Image Analysis: Implement computer vision for business photo categorization

Technical Excellence Highlights

Code Quality

- Modular Architecture: Clean separation of concerns with dedicated modules - Error Handling: Comprehensive exception management and graceful degradation - Logging: Detailed execution logging for debugging and monitoring - Documentation: Thorough inline documentation and usage examples

Performance Optimization

- Efficient Parsing: Optimized BeautifulSoup selectors for fast extraction - Memory Management: Streaming processing for large datasets - Concurrent Processing: Parallel execution where thread-safe - Resource Management: Proper driver cleanup and resource disposal

Maintainability

- Configuration Management: External configuration files for easy updates - Version Control: Structured file organization with clear dependencies - Testing Framework: Built-in test cases and validation routines - Deployment Scripts: Automated setup and execution procedures

This project demonstrates exceptional technical execution in web scraping, data processing, and business intelligence delivery, providing the client with comprehensive market insights and a scalable framework for future expansion.

Interested in a Similar Project?

Let's discuss how we can help transform your business with similar solutions.

Start Your Project