A project-based approach to learning Python programming for beginners. Intriguing projects teach you how to tackle challenging problems with code.

You've mastered the basics. Now you're ready to explore some of Python's more powerful tools. Real-World Python will show you how.

Through a series of hands-on projects, you'll investigate and solve real-world problems using sophisticated computer vision, machine learning, data analysis, and language processing tools. You'll be introduced to important modules like OpenCV, NumPy, Pandas, NLTK, Bokeh, Beautiful Soup, Requests, HoloViews, Tkinter, turtle, matplotlib, and more. You'll create complete, working programs and think through intriguing projects that show you how to:

  • Save shipwrecked sailors with an algorithm designed to prove the existence of God
  • Detect asteroids and comets moving against a starfield
  • Program a sentry gun to shoot your enemies and spare your friends
  • Select landing sites for a Mars probe using real NASA maps
  • Send unbreakable messages based on a book code
  • Survive a zombie outbreak using data science
  • Discover exoplanets and alien megastructures orbiting distant stars
  • Test the hypothesis that we're all living in a computer simulation
  • And more!

  • If you're tired of learning the bare essentials of Python Programming with isolated snippets of code, you'll relish the relevant and geeky fun of Real-World Python!
    Introduction
    Chapter 1: Saving Shipwrecked Sailors with Bayes’ Rule
    Chapter 2: Attributing Authorship with Stylometry
    Chapter 3: Summarizing Speeches
    Chapter 4: Sending Super Secret Messages
    Chapter 5: Finding Pluto
    Chapter 6: Winning the Moon Race with Apollo 8
    Chapter 7: Selecting Martian Landing Sites
    Chapter 8: Detecting Distant Exoplanets
    Chapter 9: Identifying Friend or Foe
    Chapter 10: Finding a Safe Space
    Chapter 11: Charting Exoplanets
    Chapter 12: Are we Living in an Alien Simulation?
    Appendix: Answers to the practice problems
    Index
    ATTRIBUTING AUTHORSHIP WITH STYLOMETRY


    Stylometry is the quantitative study of literary style through computational text analysis. It’s based on the idea that we all have a unique, consistent, and recognizable style to our writing. This includes our vocabulary, our use of punctuation, the average length of our sentences and words, and so on.

    A common application of stylometry is authorship attribution. Do you ever wonder if Shakespeare really wrote all his plays? Or if John Lennon or Paul McCartney wrote the song “In My Life”? Could Robert Galbraith, author of A Cuckoo’s Calling, really be J. K. Rowling in disguise? Stylometry can find the answer!

    Stylometry has been used to overturn murder convictions and even helped identify and convict the Unabomber in 1996. Other uses include detecting plagiarism and determining the emotional tone behind words, such as in social media posts. Stylometry can even be used to detect signs of mental depression and suicidal tendencies.

    In this chapter, you’ll use multiple stylometric techniques to determine whether Sir Arthur Conan Doyle or H. G. Wells wrote the novel The Lost World.

    Project #2: The Hound, The War, and The Lost World

    Sir Arthur Conan Doyle (1859–1930) is best known for the Sherlock Holmes stories, considered milestones in the field of crime fiction. H. G. Wells (1866–1946) is famous for several groundbreaking science fiction novels including The War of The Worlds, The Time Machine, The Invisible Man, and The Island of Dr. Moreau.

    In 1912, the Strand Magazine published The Lost World, a serialized version of a science fiction novel. It told the story of an Amazon basin expedition, led by zoology professor George Edward Challenger, that encountered living dinosaurs and a vicious tribe of ape-like creatures.

    Although the author of the novel is known, for this project, let’s pretend it’s in dispute and it’s your job to solve the mystery. Experts have narrowed the field down to two authors, Doyle and Wells. Wells is slightly favored because The Lost World is a work of science fiction, which is his purview. It also includes brutish troglodytes redolent of the morlocks in his 1895 work The Time Machine. Doyle, on the other hand, is known for detective stories and historical fiction.


    THE OBJECTIVE

    Write a Python program that uses stylometry to determine whether Sir Arthur Conan Doyle or H. G. Wells wrote the novel The Lost World.

    THE STRATEGY

    The science of natural language processing (NLP) deals with the interactions between the precise and structured language of computers and the nuanced, frequently ambiguous “natural” language used by humans. Example uses for NLP include machine translations, spam detection, comprehension of search engine questions, and predictive text recognition for cell phone users.

    The most common NLP tests for authorship analyze the following features of a text:

    Word length A frequency distribution plot of the length of words in a document

    Stop words A frequency distribution plot of stop words (short, noncontextual function words like the, but, and if)

    Parts of speech A frequency distribution plot of words based on their syntactic functions (such as nouns, pronouns, verbs, adverbs, adjectives, and so on)

    Most common words A comparison of the most commonly used words in a text

    Jaccard similarity A statistic used for gauging the similarity and diversity of a sample set


    If Doyle and Wells have distinctive writing styles, these five tests should be enough to distinguish between them. We’ll talk about each test in more detail in the coding section.

    To capture and analyze each author’s style, you’ll need a representative corpus, or a body of text. For Doyle, use the famous Sherlock Holmes novel The Hound of the Baskervilles, published in 1902. For Wells, use The War of the Worlds, published in 1898. Both these novels contain more than 50,000 words, more than enough for a sound statistical sampling. You’ll then compare each author’s sample to The Lost World to determine how closely the writing styles match.

    To perform stylometry, you’ll use the Natural Language Toolkit (NLTK), a popular suite of programs and libraries for working with human language data in Python. It’s free and works on Windows, macOS, and Linux. Created in 2001 as part of a computational linguistics course at the
    University of Pennsylvania, NLTK has continued to develop and expand with the help of dozens of contributors.
    Lee Vaughan is a programmer, pop culture enthusiast, educator, and author of Impractical Python Projects
    (No Starch Press). As a former executive-level scientist at ExxonMobil, he spent decades constructing and reviewing complex computer models, developed and tested software, and trained geoscientists and engineers.

    About

    A project-based approach to learning Python programming for beginners. Intriguing projects teach you how to tackle challenging problems with code.

    You've mastered the basics. Now you're ready to explore some of Python's more powerful tools. Real-World Python will show you how.

    Through a series of hands-on projects, you'll investigate and solve real-world problems using sophisticated computer vision, machine learning, data analysis, and language processing tools. You'll be introduced to important modules like OpenCV, NumPy, Pandas, NLTK, Bokeh, Beautiful Soup, Requests, HoloViews, Tkinter, turtle, matplotlib, and more. You'll create complete, working programs and think through intriguing projects that show you how to:

  • Save shipwrecked sailors with an algorithm designed to prove the existence of God
  • Detect asteroids and comets moving against a starfield
  • Program a sentry gun to shoot your enemies and spare your friends
  • Select landing sites for a Mars probe using real NASA maps
  • Send unbreakable messages based on a book code
  • Survive a zombie outbreak using data science
  • Discover exoplanets and alien megastructures orbiting distant stars
  • Test the hypothesis that we're all living in a computer simulation
  • And more!

  • If you're tired of learning the bare essentials of Python Programming with isolated snippets of code, you'll relish the relevant and geeky fun of Real-World Python!

    Table of Contents

    Introduction
    Chapter 1: Saving Shipwrecked Sailors with Bayes’ Rule
    Chapter 2: Attributing Authorship with Stylometry
    Chapter 3: Summarizing Speeches
    Chapter 4: Sending Super Secret Messages
    Chapter 5: Finding Pluto
    Chapter 6: Winning the Moon Race with Apollo 8
    Chapter 7: Selecting Martian Landing Sites
    Chapter 8: Detecting Distant Exoplanets
    Chapter 9: Identifying Friend or Foe
    Chapter 10: Finding a Safe Space
    Chapter 11: Charting Exoplanets
    Chapter 12: Are we Living in an Alien Simulation?
    Appendix: Answers to the practice problems
    Index

    Excerpt

    ATTRIBUTING AUTHORSHIP WITH STYLOMETRY


    Stylometry is the quantitative study of literary style through computational text analysis. It’s based on the idea that we all have a unique, consistent, and recognizable style to our writing. This includes our vocabulary, our use of punctuation, the average length of our sentences and words, and so on.

    A common application of stylometry is authorship attribution. Do you ever wonder if Shakespeare really wrote all his plays? Or if John Lennon or Paul McCartney wrote the song “In My Life”? Could Robert Galbraith, author of A Cuckoo’s Calling, really be J. K. Rowling in disguise? Stylometry can find the answer!

    Stylometry has been used to overturn murder convictions and even helped identify and convict the Unabomber in 1996. Other uses include detecting plagiarism and determining the emotional tone behind words, such as in social media posts. Stylometry can even be used to detect signs of mental depression and suicidal tendencies.

    In this chapter, you’ll use multiple stylometric techniques to determine whether Sir Arthur Conan Doyle or H. G. Wells wrote the novel The Lost World.

    Project #2: The Hound, The War, and The Lost World

    Sir Arthur Conan Doyle (1859–1930) is best known for the Sherlock Holmes stories, considered milestones in the field of crime fiction. H. G. Wells (1866–1946) is famous for several groundbreaking science fiction novels including The War of The Worlds, The Time Machine, The Invisible Man, and The Island of Dr. Moreau.

    In 1912, the Strand Magazine published The Lost World, a serialized version of a science fiction novel. It told the story of an Amazon basin expedition, led by zoology professor George Edward Challenger, that encountered living dinosaurs and a vicious tribe of ape-like creatures.

    Although the author of the novel is known, for this project, let’s pretend it’s in dispute and it’s your job to solve the mystery. Experts have narrowed the field down to two authors, Doyle and Wells. Wells is slightly favored because The Lost World is a work of science fiction, which is his purview. It also includes brutish troglodytes redolent of the morlocks in his 1895 work The Time Machine. Doyle, on the other hand, is known for detective stories and historical fiction.


    THE OBJECTIVE

    Write a Python program that uses stylometry to determine whether Sir Arthur Conan Doyle or H. G. Wells wrote the novel The Lost World.

    THE STRATEGY

    The science of natural language processing (NLP) deals with the interactions between the precise and structured language of computers and the nuanced, frequently ambiguous “natural” language used by humans. Example uses for NLP include machine translations, spam detection, comprehension of search engine questions, and predictive text recognition for cell phone users.

    The most common NLP tests for authorship analyze the following features of a text:

    Word length A frequency distribution plot of the length of words in a document

    Stop words A frequency distribution plot of stop words (short, noncontextual function words like the, but, and if)

    Parts of speech A frequency distribution plot of words based on their syntactic functions (such as nouns, pronouns, verbs, adverbs, adjectives, and so on)

    Most common words A comparison of the most commonly used words in a text

    Jaccard similarity A statistic used for gauging the similarity and diversity of a sample set


    If Doyle and Wells have distinctive writing styles, these five tests should be enough to distinguish between them. We’ll talk about each test in more detail in the coding section.

    To capture and analyze each author’s style, you’ll need a representative corpus, or a body of text. For Doyle, use the famous Sherlock Holmes novel The Hound of the Baskervilles, published in 1902. For Wells, use The War of the Worlds, published in 1898. Both these novels contain more than 50,000 words, more than enough for a sound statistical sampling. You’ll then compare each author’s sample to The Lost World to determine how closely the writing styles match.

    To perform stylometry, you’ll use the Natural Language Toolkit (NLTK), a popular suite of programs and libraries for working with human language data in Python. It’s free and works on Windows, macOS, and Linux. Created in 2001 as part of a computational linguistics course at the
    University of Pennsylvania, NLTK has continued to develop and expand with the help of dozens of contributors.

    Author

    Lee Vaughan is a programmer, pop culture enthusiast, educator, and author of Impractical Python Projects
    (No Starch Press). As a former executive-level scientist at ExxonMobil, he spent decades constructing and reviewing complex computer models, developed and tested software, and trained geoscientists and engineers.