How AI collects
business information
AI does not randomly decide what to say about your business. It builds a picture from specific sources: training data, web crawlers, search engines, review platforms and directories. Understanding where AI gets its information tells you exactly where to focus your efforts. This article explains every source and how to make sure your business is well-represented in each one.
Two layers of information: training data and live search
Every AI platform uses two layers of information. The first is training data: a massive collection of text from the internet that the AI was trained on. This includes websites, news articles, Wikipedia, forums, academic papers and social media posts. Training data has a cut-off date, which means the AI may not know about recent changes to your business.
The second layer is live web search. Modern AI platforms like ChatGPT, Gemini and Perplexity can search the internet in real time when answering a question. This means they can find your latest blog post, your most recent reviews and your current service offerings. Live search makes it possible to influence your AI visibility right now, not just through historical data.
Training data
- • Billions of pages of internet text
- • Has a cut-off date (months or years old)
- • Includes websites, forums, news, Wikipedia
- • Cannot be updated directly by businesses
Live web search
- • Real-time search results
- • Current and up to date
- • Influenced by SEO, reviews, content freshness
- • Can be influenced by businesses immediately
Where each AI platform gets its data
ChatGPT
- Training data: Massive internet text corpus with periodic updates
- Live search: Bing web search (not Google)
- Crawler: GPTBot (visits websites to index content)
- Key sources: Websites, news sites, Wikipedia, forums, review sites
Google Gemini
- Training data: Google's own internet index
- Live search: Google Search (the full ecosystem)
- Crawler: Googlebot (same as regular Google)
- Key sources: Google Business Profile, Maps, Reviews, Shopping, Knowledge Graph
Claude
- Training data: High-quality internet text with quality filters
- Live search: Web search capability for current information
- Crawler: ClaudeBot (Anthropic's web crawler)
- Key sources: Detailed website content, expert articles, documentation
Perplexity
- Training data: Uses multiple underlying AI models
- Live search: Its own web search index
- Crawler: PerplexityBot
- Key sources: Web content with heavy emphasis on source citations
Only 11% of sources overlap between AI platforms answering the same question. That is why you need to be present on many different platforms and not rely on any single source.
The specific sources AI uses to evaluate your business
AI does not look at just one source. It triangulates across multiple data points. Here are the main sources it draws from.
Your website
The primary source. AI reads your service pages, about page, FAQ section and blog posts. It extracts what you do, where you operate, what questions you answer and how detailed your information is. Content-rich websites provide more material for AI to work with.
Google Business Profile
Critical for Gemini and Google AI Overviews. Your business name, address, phone number, categories, opening hours, photos, posts and reviews. An incomplete profile is a missed opportunity, especially for local businesses.
Review platforms
Google Reviews and Trustpilot are the most important in the UK. AI reads review volume, average rating, recency and even the text of individual reviews. It also notes whether you respond to reviews.
Directories and listings
Yell.com, Checkatrade, Bark, Rated People, Companies House, professional body directories (Law Society, FCA register, RICS). Each listing is an independent confirmation that your business exists and operates in a specific sector.
Social media
LinkedIn is the most relevant for business recommendations. Your company page, employee profiles, posts and articles all contribute. Facebook and Twitter/X are secondary but still contribute to your overall footprint.
News and media mentions
Press coverage, guest articles in trade publications, features in local newspapers. These are strong authority signals. Being quoted in the Financial Times or even a local paper tells AI you are a recognised expert in your field.
What does AI actually say about your business?
VestVale monitors all 4 AI platforms. See exactly what information AI uses and how it presents your business.
How AI crawlers access your data
AI platforms send automated crawlers to visit and index websites. GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity) and Googlebot (Google) are the main ones. These crawlers read the HTML of your pages, extract text content, and add it to the platform's knowledge base.
If your website blocks these crawlers through your robots.txt file, AI platforms cannot access your content. Some UK hosting providers and WordPress security plugins block AI crawlers by default. Check your robots.txt file (yourdomain.co.uk/robots.txt) and make sure GPTBot, ClaudeBot and PerplexityBot are not disallowed. If you want AI visibility, these bots need access to your content. More on this in our article on how AI reads websites.
Beyond crawling, AI platforms also collect information from their live web search capabilities. When ChatGPT searches Bing to answer a question, it accesses whatever Bing has indexed. When Gemini answers a query, it pulls from Google's entire search index including Maps, Reviews and Shopping data.
This means your Bing SEO matters for ChatGPT visibility, even if you have always focused exclusively on Google. Submit your site to Bing Webmaster Tools if you have not already. Many UK businesses overlook Bing entirely, but it now powers the world's most popular AI search platform.
How AI selects and filters the information it uses
AI does not use everything it finds. It filters and evaluates. Not all sources carry equal weight. A mention on the BBC carries more weight than a mention on a small blog. A Trustpilot profile with 100 reviews carries more weight than a single listing on an obscure directory.
The filtering works roughly like this: AI looks for consensus. If your website says you are a plumber in Bristol, and Google Maps confirms it, and Checkatrade lists you as a plumber in Bristol, and 50 reviewers confirm you did plumbing work in Bristol, then AI has high confidence in that information.
If your website says "plumber" but Google says "heating engineer" and your LinkedIn says "facilities management", AI has low confidence and may not recommend you for any of those terms. Consistency is the filter. Inconsistent information gets filtered out.
Freshness matters too. AI weighs recent information more heavily than old information. A review from last week carries more weight than a review from 2021. A blog post from this month is more relevant than one from three years ago. This is why ongoing activity matters.
AI also detects when information looks manufactured. Dozens of five-star reviews posted on the same day. Keyword-stuffed website copy that reads like it was written for search engines rather than humans. Fake testimonials. These signals reduce trust rather than build it.
Quality over quantity. 30 genuine, detailed reviews from real customers are worth more than 200 generic one-line reviews. AI can distinguish between authentic and manufactured signals.
What you can control and what you cannot
You can control:
- Your website content (FAQs, services, blog posts)
- Your Google Business Profile completeness
- Whether you ask for and respond to reviews
- Which directories and platforms you are listed on
- Consistency of your information across platforms
- Whether AI crawlers can access your website
- Structured data on your website
You cannot control:
- What is in AI training data (historical snapshots)
- How AI weighs different sources against each other
- The exact wording AI uses when mentioning you
- Whether AI mentions you on any given query (variability)
- Algorithm updates and changes to how AI processes data
The controllable factors are where your effort should go. For a step-by-step action plan, read our guide on how to get visible in ChatGPT.
Frequently asked questions
Can I submit my business to ChatGPT's database?
No. There is no submission process. AI gathers information automatically from the web. The best way to get into AI's knowledge base is to have a strong, consistent online presence that AI crawlers can find and index.
Should I block AI crawlers?
Generally no, unless you have specific reasons. Blocking GPTBot, ClaudeBot or PerplexityBot prevents these platforms from indexing your content. If you want AI visibility, you want these crawlers to access your site. Check your robots.txt file to make sure you are not blocking them unintentionally.
How often does AI update its information about my business?
Live web search is real-time. Training data updates happen periodically (typically every few months). Google Business Profile changes can affect Gemini within days. Website content changes take weeks to be fully indexed. Keep your information current across all platforms.
Does AI read my social media posts?
LinkedIn profiles and posts are generally accessible to AI. Facebook content is often limited by privacy settings. Twitter/X content is accessible. However, social media posts are a secondary signal. Your website and review platforms are more impactful for AI recommendations.
See what AI knows about your business
VestVale monitors all 4 AI platforms. See exactly what information AI uses and how it presents your business.
From £19.95/mo excl. VAT. Cancel monthly.