Scraping Wikipedia¶

View on Github

We'll use Superpipe to build a pipeline that receives a famous person's name and figures out their birthday, whether they're still alive and if not, their cause of death.

This pipeline will work in 4 steps -

Do a google search with the person's name
Use an LLM to fetch the URL of their wikipedia page from the search results
Fetch the contents of the wikipedia page and convert them to markdown
Use an LLM to extract the birthdate and living or dead from the wikipedia contents

We'll build the pipeline, evaluate it on some data, and optimize it to maximize accuracy while reducing cost and latency.

Step 1: Building the pipeline¶

In [1]:

Copied!





from superpipe.steps import LLMStructuredStep, CustomStep, SERPEnrichmentStep
from superpipe import models
from pydantic import BaseModel, Field

# Step 1: use Superpipe's built-in SERP enrichment step to search for the persons wikipedia page
# Include a unique "name" for the step that will used to reference this step's output in future steps

search_step = SERPEnrichmentStep(
  prompt= lambda row: f"{row['name']} wikipedia",
  name="search"
)

# Step 2: Use an LLM to extract the wikipedia URL from the search results
# First, define a Pydantic model that specifies the structured output we want from the LLM

class ParseSearchResult(BaseModel):
  wikipedia_url: str = Field(description="The URL of the Wikipedia page for the person")

# Then we use the built-in LLMStructuredStep and specify a model and a prompt
# The prompt is a function that has access to all the fields in the input as well as the outputs of previous steps

parse_search_step = LLMStructuredStep(
  model=models.gpt35,
  prompt= lambda row: f"Extract the Wikipedia URL for {row['name']} from the following search results: \n\n {row['search']}",
  out_schema=ParseSearchResult,
  name="parse_search"
)
from superpipe.steps import LLMStructuredStep, CustomStep, SERPEnrichmentStep
from superpipe import models
from pydantic import BaseModel, Field

# Step 1: use Superpipe's built-in SERP enrichment step to search for the persons wikipedia page
# Include a unique "name" for the step that will used to reference this step's output in future steps

search_step = SERPEnrichmentStep(
  prompt= lambda row: f"{row['name']} wikipedia",
  name="search"
)

# Step 2: Use an LLM to extract the wikipedia URL from the search results
# First, define a Pydantic model that specifies the structured output we want from the LLM

class ParseSearchResult(BaseModel):
  wikipedia_url: str = Field(description="The URL of the Wikipedia page for the person")

# Then we use the built-in LLMStructuredStep and specify a model and a prompt
# The prompt is a function that has access to all the fields in the input as well as the outputs of previous steps

parse_search_step = LLMStructuredStep(
  model=models.gpt35,
  prompt= lambda row: f"Extract the Wikipedia URL for {row['name']} from the following search results: \n\n {row['search']}",
  out_schema=ParseSearchResult,
  name="parse_search"
)

In [ ]:

Copied!





from superpipe.pipeline import Pipeline
import requests
import html2text
import json

h = html2text.HTML2Text()
h.ignore_links = True

# Step 3: we create a CustomStep that can execute any arbitrary function (transform)
# The function fetches the contents of the wikipedia url and converts them to markdown

fetch_wikipedia_step = CustomStep(
  transform=lambda row: h.handle(requests.get(row['wikipedia_url']).text),
  name="wikipedia"
)

# Step 4: we extract the date of birth, living/dead status and cause of death from the wikipedia contents

class ExtractedData(BaseModel):
    date_of_birth: str = Field(description="The date of birth of the person in the format YYYY-MM-DD")
    alive: bool = Field(description="Whether the person is still alive")
    cause_of_death: str = Field(description="The cause of death of the person. If the person is alive, return 'N/A'")

extract_step = LLMStructuredStep(
  model=models.gpt4,
  prompt= lambda row: f"""Extract the date of birth for {row['name']}, whether they're still alive \
  and if not, their cause of death from the following Wikipedia content: \n\n {row['wikipedia']}""",
  out_schema=ExtractedData,
  name="extract_data"
)

# Finally we define and run the pipeline

pipeline = Pipeline([
  search_step,
  parse_search_step,
  fetch_wikipedia_step,
  extract_step
])

output = pipeline.run({"name": "Jean-Paul Sartre"})
print(json.dumps(output, indent=2))
from superpipe.pipeline import Pipeline
import requests
import html2text
import json

h = html2text.HTML2Text()
h.ignore_links = True

# Step 3: we create a CustomStep that can execute any arbitrary function (transform)
# The function fetches the contents of the wikipedia url and converts them to markdown

fetch_wikipedia_step = CustomStep(
  transform=lambda row: h.handle(requests.get(row['wikipedia_url']).text),
  name="wikipedia"
)

# Step 4: we extract the date of birth, living/dead status and cause of death from the wikipedia contents

class ExtractedData(BaseModel):
    date_of_birth: str = Field(description="The date of birth of the person in the format YYYY-MM-DD")
    alive: bool = Field(description="Whether the person is still alive")
    cause_of_death: str = Field(description="The cause of death of the person. If the person is alive, return 'N/A'")

extract_step = LLMStructuredStep(
  model=models.gpt4,
  prompt= lambda row: f"""Extract the date of birth for {row['name']}, whether they're still alive \
  and if not, their cause of death from the following Wikipedia content: \n\n {row['wikipedia']}""",
  out_schema=ExtractedData,
  name="extract_data"
)

# Finally we define and run the pipeline

pipeline = Pipeline([
  search_step,
  parse_search_step,
  fetch_wikipedia_step,
  extract_step
])

output = pipeline.run({"name": "Jean-Paul Sartre"})
print(json.dumps(output, indent=2))

Step 2: Evaluating the pipeline¶

Now, we'll evaluate the pipeline on a dataset. Think of this as unit tests for your code. You wouldn't ship code to production without testing it, you shouldn't ship LLM pipelines to production without evaluating them.

To do this, we need:

A dataset with labels - In this case we need a list of famous people and the true date of birth, living status and cause of death of each person
Evaluation function - a function that defines what "correct" is. We'll use simple comparison for date of birth and living status, and an LLM call to evaluate the correctness of cause of death.

In [8]:

Copied!





import pandas as pd

data = [
  ("Ruth Bader Ginsburg", "1933-03-15", False, "Pancreatic cancer"),
  ("Bill Gates", "1955-10-28", True, "N/A"),
  ("Steph Curry", "1988-03-14", True, "N/A"),
  ("Scott Belsky", "1980-04-18", True, "N/A"),
  ("Steve Jobs", "1955-02-24", False, "Pancreatic tumor/cancer"),
  ("Paris Hilton", "1981-02-17", True, "N/A"),
  ("Kurt Vonnegut", "1922-11-11", False, "Brain injuries"),
  ("Snoop Dogg", "1971-10-20", True, "N/A"),
  ("Kobe Bryant", "1978-08-23", False, "Helicopter crash"),
  ("Aaron Swartz", "1986-11-08", False, "Suicide")
]
df = pd.DataFrame([{"name": d[0], "dob_label": d[1], "alive_label": d[2], "cause_label": d[3]} for d in data])

class EvalResult(BaseModel):
  result: bool = Field(description="Is the answer correct or not?")

cause_evaluator = LLMStructuredStep(
  model=models.gpt4,
  prompt=lambda row: f"This is the correct cause of death: {row['cause_label']}. Is this provided cause of death accurate? The phrasing might be slightly different. Use your judgement: \n{row['cause_of_death']}",
  out_schema=EvalResult,
  name="cause_evaluator")

def eval_fn(row):
  score = 0
  if row['date_of_birth'] == row['dob_label']:
    score += 0.25
  if row['alive'] == row['alive_label']:
    score += 0.25
  if row['cause_label'] == "N/A":
    if row['cause_of_death'] == "N/A":
      score += 0.5
  elif cause_evaluator.run(row)['result']:
    score += 0.5  
  return score

pipeline.run(df)
print("Score: ", pipeline.evaluate(eval_fn))
df
import pandas as pd

data = [
  ("Ruth Bader Ginsburg", "1933-03-15", False, "Pancreatic cancer"),
  ("Bill Gates", "1955-10-28", True, "N/A"),
  ("Steph Curry", "1988-03-14", True, "N/A"),
  ("Scott Belsky", "1980-04-18", True, "N/A"),
  ("Steve Jobs", "1955-02-24", False, "Pancreatic tumor/cancer"),
  ("Paris Hilton", "1981-02-17", True, "N/A"),
  ("Kurt Vonnegut", "1922-11-11", False, "Brain injuries"),
  ("Snoop Dogg", "1971-10-20", True, "N/A"),
  ("Kobe Bryant", "1978-08-23", False, "Helicopter crash"),
  ("Aaron Swartz", "1986-11-08", False, "Suicide")
]
df = pd.DataFrame([{"name": d[0], "dob_label": d[1], "alive_label": d[2], "cause_label": d[3]} for d in data])

class EvalResult(BaseModel):
  result: bool = Field(description="Is the answer correct or not?")

cause_evaluator = LLMStructuredStep(
  model=models.gpt4,
  prompt=lambda row: f"This is the correct cause of death: {row['cause_label']}. Is this provided cause of death accurate? The phrasing might be slightly different. Use your judgement: \n{row['cause_of_death']}",
  out_schema=EvalResult,
  name="cause_evaluator")

def eval_fn(row):
  score = 0
  if row['date_of_birth'] == row['dob_label']:
    score += 0.25
  if row['alive'] == row['alive_label']:
    score += 0.25
  if row['cause_label'] == "N/A":
    if row['cause_of_death'] == "N/A":
      score += 0.5
  elif cause_evaluator.run(row)['result']:
    score += 0.5  
  return score

pipeline.run(df)
print("Score: ", pipeline.evaluate(eval_fn))
df

Applying step search: 100%|██████████| 10/10 [00:08<00:00,  1.16it/s]
Applying step parse_search: 100%|██████████| 10/10 [00:10<00:00,  1.02s/it]
Applying step wikipedia: 100%|██████████| 10/10 [00:04<00:00,  2.27it/s]
Applying step extract_data: 100%|██████████| 10/10 [01:26<00:00,  8.66s/it]

Score:  1.0

Out[8]:

	name	dob_label	alive_label	cause_label	search	__parse_search__	wikipedia_url	wikipedia	__extract_data__	date_of_birth	alive	cause_of_death	__eval_fn__
0	Ruth Bader Ginsburg	1933-03-15	False	Pancreatic cancer	{"searchParameters":{"q":"Ruth Bader Ginsburg ...	{'input_tokens': 1922, 'output_tokens': 23, 'i...	https://en.wikipedia.org/wiki/Ruth_Bader_Ginsburg	Jump to content\n\nMain menu\n\nMain menu\n\nm...	{'input_tokens': 46522, 'output_tokens': 37, '...	1933-03-15	False	complications of metastatic pancreatic cancer	1.0
1	Bill Gates	1955-10-28	True	N/A	{"searchParameters":{"q":"Bill Gates wikipedia...	{'input_tokens': 1809, 'output_tokens': 20, 'i...	https://en.wikipedia.org/wiki/Bill_Gates	Jump to content\n\nMain menu\n\nMain menu\n\nm...	{'input_tokens': 46613, 'output_tokens': 32, '...	1955-10-28	True	N/A	1.0
2	Steph Curry	1988-03-14	True	N/A	{"searchParameters":{"q":"Steph Curry wikipedi...	{'input_tokens': 1339, 'output_tokens': 20, 'i...	https://en.wikipedia.org/wiki/Stephen_Curry	Jump to content\n\nMain menu\n\nMain menu\n\nm...	{'input_tokens': 64861, 'output_tokens': 32, '...	1988-03-14	True	N/A	1.0
3	Scott Belsky	1980-04-18	True	N/A	{"searchParameters":{"q":"Scott Belsky wikiped...	{'input_tokens': 1566, 'output_tokens': 21, 'i...	https://en.wikipedia.org/wiki/Scott_Belsky	Jump to content\n\nMain menu\n\nMain menu\n\nm...	{'input_tokens': 2227, 'output_tokens': 32, 'i...	1980-04-18	True	N/A	1.0
4	Steve Jobs	1955-02-24	False	Pancreatic tumor/cancer	{"searchParameters":{"q":"Steve Jobs wikipedia...	{'input_tokens': 1625, 'output_tokens': 20, 'i...	https://en.wikipedia.org/wiki/Steve_Jobs	Jump to content\n\nMain menu\n\nMain menu\n\nm...	{'input_tokens': 47086, 'output_tokens': 42, '...	1955-02-24	False	respiratory arrest related to a pancreatic neu...	1.0
5	Paris Hilton	1981-02-17	True	N/A	{"searchParameters":{"q":"Paris Hilton wikiped...	{'input_tokens': 1322, 'output_tokens': 20, 'i...	https://en.wikipedia.org/wiki/Paris_Hilton	Jump to content\n\nMain menu\n\nMain menu\n\nm...	{'input_tokens': 49288, 'output_tokens': 32, '...	1981-02-17	True	N/A	1.0
6	Kurt Vonnegut	1922-11-11	False	Brain injuries	{"searchParameters":{"q":"Kurt Vonnegut wikipe...	{'input_tokens': 1369, 'output_tokens': 22, 'i...	https://en.wikipedia.org/wiki/Kurt_Vonnegut	Jump to content\n\nMain menu\n\nMain menu\n\nm...	{'input_tokens': 29700, 'output_tokens': 45, '...	1922-11-11	False	brain injuries incurred several weeks prior, f...	1.0
7	Snoop Dogg	1971-10-20	True	N/A	{"searchParameters":{"q":"Snoop Dogg wikipedia...	{'input_tokens': 1702, 'output_tokens': 20, 'i...	https://en.wikipedia.org/wiki/Snoop_Dogg	Jump to content\n\nMain menu\n\nMain menu\n\nm...	{'input_tokens': 40901, 'output_tokens': 32, '...	1971-10-20	True	N/A	1.0
8	Kobe Bryant	1978-08-23	False	Helicopter crash	{"searchParameters":{"q":"Kobe Bryant wikipedi...	{'input_tokens': 1355, 'output_tokens': 21, 'i...	https://en.wikipedia.org/wiki/Kobe_Bryant	Jump to content\n\nMain menu\n\nMain menu\n\nm...	{'input_tokens': 74108, 'output_tokens': 33, '...	1978-08-23	False	helicopter crash	1.0
9	Aaron Swartz	1986-11-08	False	Suicide	{"searchParameters":{"q":"Aaron Swartz wikiped...	{'input_tokens': 1329, 'output_tokens': 21, 'i...	https://en.wikipedia.org/wiki/Aaron_Swartz	Jump to content\n\nMain menu\n\nMain menu\n\nm...	{'input_tokens': 37532, 'output_tokens': 34, '...	1986-11-08	False	Suicide by hanging	1.0

Step 3: Optimizing the pipeline¶

This pipeline has an accuracy score of 100%, but perhaps there's room for improvement on cost and speed. First let's view the cost and latency of each step to figure out which one is the bottleneck.

In [4]:

Copied!





for step in pipeline.steps:
  print(f"Step {step.name}:")
  print(f"- Latency: {step.statistics.total_latency}")
  print(f"- Cost: {step.statistics.input_cost + step.statistics.output_cost}")
for step in pipeline.steps:
  print(f"Step {step.name}:")
  print(f"- Latency: {step.statistics.total_latency}")
  print(f"- Cost: {step.statistics.input_cost + step.statistics.output_cost}")

Step search:
- Latency: 12.000389575958252
- Cost: 0.0
Step parse_search:
- Latency: 10.51110366685316
- Cost: 0.008334
Step wikipedia:
- Latency: 4.235257387161255
- Cost: 0.0
Step extract_data:
- Latency: 90.95815300196409
- Cost: 4.7203800000000005

Clearly the final step (extract_data) is the one responsible for the bulk of the cost and latency. This makes sense, because we're feeding in the entire wikipedia article to GPT-4, one of the most expensive models.

Let's find out if we can get away with a cheaper/faster model. Most models cannot handle the number of tokens needed to ingest a whole wikipedia article, so we'll turn to the two that can that are also cheaper than GPT4: Claude 3 Sonnet and Claude 3 Haiku.

In [5]:

Copied!





from superpipe.grid_search import GridSearch
from superpipe.models import claude3_haiku, claude3_sonnet
from superpipe.steps import LLMStructuredCompositeStep

# we need to use LLMStructuredCompositeStep which uses GPT3.5 for structured JSON extraction
# because Claude does not support JSON mode or function calling out of the box
new_extract_step = LLMStructuredCompositeStep(
  model=models.claude3_haiku,
  prompt=extract_step.prompt,
  out_schema=ExtractedData,
  name="extract_data_new"
)

new_pipeline = Pipeline([
  search_step,
  parse_search_step,
  fetch_wikipedia_step,
  new_extract_step
], evaluation_fn=eval_fn)

param_grid = {
  new_extract_step.name:{
    "model": [claude3_haiku, claude3_sonnet]}
}
grid_search = GridSearch(new_pipeline, param_grid)
grid_search.run(df)
from superpipe.grid_search import GridSearch
from superpipe.models import claude3_haiku, claude3_sonnet
from superpipe.steps import LLMStructuredCompositeStep

# we need to use LLMStructuredCompositeStep which uses GPT3.5 for structured JSON extraction
# because Claude does not support JSON mode or function calling out of the box
new_extract_step = LLMStructuredCompositeStep(
  model=models.claude3_haiku,
  prompt=extract_step.prompt,
  out_schema=ExtractedData,
  name="extract_data_new"
)

new_pipeline = Pipeline([
  search_step,
  parse_search_step,
  fetch_wikipedia_step,
  new_extract_step
], evaluation_fn=eval_fn)

param_grid = {
  new_extract_step.name:{
    "model": [claude3_haiku, claude3_sonnet]}
}
grid_search = GridSearch(new_pipeline, param_grid)
grid_search.run(df)

Applying step search: 100%|██████████| 10/10 [00:08<00:00,  1.20it/s]
Applying step parse_search: 100%|██████████| 10/10 [00:10<00:00,  1.06s/it]
Applying step wikipedia: 100%|██████████| 10/10 [00:03<00:00,  2.56it/s]
Applying step extract_data_new: 100%|██████████| 10/10 [01:26<00:00,  8.63s/it]
Applying step search: 100%|██████████| 10/10 [00:08<00:00,  1.18it/s]
Applying step parse_search: 100%|██████████| 10/10 [00:10<00:00,  1.03s/it]
Applying step wikipedia: 100%|██████████| 10/10 [00:03<00:00,  2.57it/s]
Applying step extract_data_new: 100%|██████████| 10/10 [05:17<00:00, 31.73s/it]
/Users/amandhesi/llm/superpipe/superpipe/util.py:44: FutureWarning: Styler.applymap has been deprecated. Use Styler.map instead.
  styler = styler.applymap(

Out[5]:

	extract_data_new__model	score	input_cost	output_cost	total_latency	input_tokens	output_tokens	num_success	num_failure	index
0	claude-3-haiku-20240307	1.000000	0.129856	0.001945	109.038948	defaultdict(, {'gpt-3.5-turbo-0125': 15056, 'claude-3-haiku-20240307': 487402})	defaultdict(, {'gpt-3.5-turbo-0125': 208, 'claude-3-haiku-20240307': 1218})	10	0	4643861466949536679
1	claude-3-sonnet-20240229	0.450000	1.465117	0.022944	339.825781	defaultdict(, {'gpt-3.5-turbo-0125': 14733, 'claude-3-sonnet-20240229': 488036})	defaultdict(, {'gpt-3.5-turbo-0125': 208, 'claude-3-sonnet-20240229': 1786})	10	0	3722756468172814577

Strangely, Claude 3 Haiku is both more accurate (100% v/s 45%) as well as cheaper and faster. This is suprising, but useful information that we wouldn't have found out unless we built and evaluated pipelines on our specific data rather than benchmark data.

In [6]:

Copied!





best_params = grid_search.best_params
new_pipeline.update_params(best_params)
new_pipeline.run(df)
print("Score: ", new_pipeline.score)
for step in new_pipeline.steps:
  print(f"Step {step.name}:")
  print(f"- Latency: {step.statistics.total_latency}")
  print(f"- Cost: {step.statistics.input_cost + step.statistics.output_cost}")
best_params = grid_search.best_params
new_pipeline.update_params(best_params)
new_pipeline.run(df)
print("Score: ", new_pipeline.score)
for step in new_pipeline.steps:
  print(f"Step {step.name}:")
  print(f"- Latency: {step.statistics.total_latency}")
  print(f"- Cost: {step.statistics.input_cost + step.statistics.output_cost}")

Applying step search: 100%|██████████| 10/10 [00:08<00:00,  1.14it/s]
Applying step parse_search: 100%|██████████| 10/10 [00:11<00:00,  1.15s/it]
Applying step wikipedia: 100%|██████████| 10/10 [00:03<00:00,  2.52it/s]
Applying step extract_data_new: 100%|██████████| 10/10 [01:27<00:00,  8.76s/it]

Score:  1.0
Step search:
- Latency: 8.75270938873291
- Cost: 0.0
Step parse_search:
- Latency: 11.506851500831544
- Cost: 0.007930999999999999
Step wikipedia:
- Latency: 3.9602952003479004
- Cost: 0.0
Step extract_data_new:
- Latency: 87.57113150181249
- Cost: 0.12396325000000001