Case Study: Client Cigars Scraping Projects

Executive Summary

The Client Cigars Scraping Projects represent a comprehensive data extraction initiative focused on collecting detailed information about cigar establishments and products from multiple sources including Yelp, Cigars.com, and Cigar Aficionado. This project successfully delivered a multi-platform scraping solution capable of extracting business information, product details, reviews, and comprehensive metadata across 14 US states.

Key Metrics: - Data Sources: 3 major platforms (Yelp, Cigars.com, Cigar Aficionado) - Geographic Coverage: 14 US states - Data Points Extracted: 25+ unique fields per business/product - Technology Stack: Python, Selenium, BeautifulSoup, Playwright - Output Format: Excel, JSON, CSV with structured data

Project Overview

Business Context and Objectives

The client required comprehensive market intelligence for the cigar industry across multiple major US markets. The project aimed to:

Market Research: Gather competitive intelligence on cigar bars and retailers across 14 states
Product Analysis: Collect detailed product information including ratings, reviews, and specifications
Geographic Coverage: Focus on high-population states representing the largest cigar markets
Review Intelligence: Extract customer sentiment and detailed reviews for market analysis

Target Markets

Technical Architecture

System Design

The project implements a modular architecture with three distinct scraping modules:

project_folder/
├── modules/
│   ├── cigaraficionado.py    # Cigar Aficionado scraping logic
│   ├── cigars.py             # Cigars.com product extraction  
│   ├── connectors/           # Database and driver connections
│   └── helpers/              # Utility functions
├── data/                     # Structured data storage
│   ├── yelp/
│   ├── cigars/
│   └── cigaraficionado/
└── run_yelp.py              # Main Yelp execution script

Data Flow Architecture

Data Collection Layer: Multi-platform scrapers with platform-specific logic
Processing Layer: Data normalization and cleaning utilities
Storage Layer: JSON files with Excel export capabilities
Output Layer: Structured CSV/Excel files for analysis

Technology Stack Analysis

Core Technologies

Web Scraping Framework:

Selenium WebDriver

BeautifulSoup

Playwright

Requests

Data Processing:

Pandas

JSON

CSV

Infrastructure:

Proxy Integration

Browser Management

Error Handling

Advanced Features

Anti-Detection Measures:

# Proxy rotation with proxy provider
proxy_api_key = sys.argv[1]
r = requests.get(f"[PROXY-ENDPOINT-REDACTED]")
proxy_data = r.json()

driver, _ = DriverInitializer("brave", "data", headless=False, proxy={
    "PROXY_HOST": proxy_data["data"]["ipAddress"],
    "PROXY_PORT": proxy_data["data"]["port"]
}).set_driver_for_browser()

Dynamic Content Handling:

# Network request interception for API data extraction
def parse_game_analytics(driver, prefix):
    perf = driver.get_log('performance')
    for p in perf:
        if "GetBusinessReviewFeed" in p["message"]["message"]["params"]["request"]["postData"]:
            response = get_request_response(driver, p, "GetBusinessReviewFeed", prefix)

Implementation Details

Yelp Business Intelligence Module

Core Functionality:

Business Discovery

Detail Extraction

Review Analysis

Image Collection

Data Points Extracted:

Cigars.com Product Catalog

Brand and Product Intelligence:

def scrape_brands_and_packagings():
    # Extract all available brands and packaging types
    response = requests.get('https://www.cigars.com/search')
    soup = BeautifulSoup(response.text)
    
    # Brand extraction
    li_brand_tags = soup.find("ul", {"id": "filter-brand"}).find_all("li")
    for li_brand_tag in li_brand_tags:
        data_brands.append({
            "brand_name": li_brand_tag.getText(strip=True),
            "brand_url": li_brand_tag.find("label")["data-refinement"]
        })

Product Detail Extraction:

Cigar Aficionado Reviews Module

Professional Review System:

def scrape_review_details(review_url):
    # Extract professional cigar ratings and detailed tasting notes
    obj = {
        "ratings": None, "length": None, "gauge": None, "strength": None,
        "size": None, "filler": None, "binder": None, "wrapper": None,
        "country": None, "price": None, "issue_date": None
    }

Review Intelligence:

Challenges and Solutions

Challenge 1: Anti-Bot Detection

Problem

Solution

Challenge 2: Dynamic Content Loading

Problem

Solution

def scroll_until_storyline(driver: WebDriver, wait_time: float = 2.0, max_attempts: int = 20):
    # Intelligent scrolling to trigger content loading
    attempts = 0
    while attempts < max_attempts:
        elements = driver.find_elements(By.XPATH, "//span[contains(text(), 'Storyline')]")
        if elements:
            driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", elements[0])
            return elements[0]
        driver.execute_script("window.scrollBy(0, window.innerHeight/2);")
        time.sleep(wait_time)

Challenge 3: Data Volume and Processing

Problem

Solution

Challenge 4: Data Quality and Consistency

Problem

Solution

def map_business_data(source_df: pd.DataFrame):
    # Standardized mapping between different data sources
    mapping = {
        "Business Name": ["business.name"],
        "Phone": ["business.phoneNumber.formatted"],
        "Address 1": ["business.location.address.addressLine1"],
        # ... comprehensive field mapping
    }

Key Features

1. Multi-Platform Intelligence

Yelp

Cigars.com

Cigar Aficionado

2. Geographic Scalability

3. Advanced Data Extraction

4. Quality Assurance

5. Export Flexibility

Results and Outcomes

Quantitative Results

Businesses Scraped

Products Cataloged

Reviews Processed

Data Accuracy

Geographic Coverage

Qualitative Achievements

Market Intelligence: Comprehensive competitive landscape analysis
Customer Insights: Deep understanding of customer preferences and sentiment
Product Knowledge: Detailed product specifications and professional ratings
Business Intelligence: Contact information and operational details for sales prospecting

Client Value Delivered

Time Savings

Data Quality

Actionable Insights

Scalable Solution

Future Recommendations

Technical Enhancements

Real-Time Monitoring: Implement automated monitoring for new businesses and reviews
API Integration: Develop REST API for programmatic data access
Machine Learning: Add sentiment analysis and rating prediction models
Cloud Deployment: Migrate to cloud infrastructure for better scalability

Business Intelligence Extensions

Competitive Analysis: Add pricing comparison and market positioning insights
Trend Analysis: Implement time-series analysis for market trends
Geographic Expansion: Extend coverage to remaining US states
International Markets: Explore opportunities in Canadian and European markets

Data Quality Improvements

Automated Validation: Implement automated data quality checks
Real-Time Updates: Add change detection for business information updates
Enhanced Metadata: Capture additional business attributes and social media presence
Image Analysis: Implement computer vision for business photo categorization

Technical Excellence Highlights

Code Quality

- Modular Architecture: Clean separation of concerns with dedicated modules - Error Handling: Comprehensive exception management and graceful degradation - Logging: Detailed execution logging for debugging and monitoring - Documentation: Thorough inline documentation and usage examples

Performance Optimization

- Efficient Parsing: Optimized BeautifulSoup selectors for fast extraction - Memory Management: Streaming processing for large datasets - Concurrent Processing: Parallel execution where thread-safe - Resource Management: Proper driver cleanup and resource disposal

Maintainability

- Configuration Management: External configuration files for easy updates - Version Control: Structured file organization with clear dependencies - Testing Framework: Built-in test cases and validation routines - Deployment Scripts: Automated setup and execution procedures

This project demonstrates exceptional technical execution in web scraping, data processing, and business intelligence delivery, providing the client with comprehensive market insights and a scalable framework for future expansion.

Interested in a Similar Project?

Let's discuss how we can help transform your business with similar solutions.

Start Your Project