מה המילה שמופיעה הכי הרבה פעמים במובי דיק?

לא צריך יותר מכמה שורות פייתון וספר חופשי או שניים כדי למצוא מילים מעניינות באנגלית. במקרה של מובי דיק החיים קלים כי הספר ללא זכויות יוצרים ואפשר למצוא את כל הטקסט המקורי בפרויקט גוטנברג בקישור:

https://www.gutenberg.org/cache/epub/2701/pg2701.txt

עכשיו בואו נלך לקרוא אותו, אבל בהילוך מהיר.

1. איך לשבור טקסט למילים

שלב חשוב ראשון בשביל למצוא מילים מעניינות בטקסט יהיה לשבור את הטקסט למילים. אני יודע שאתם חושבים על איזה split אבל עם מילים זה טיפה יותר מסובך - אין לי מה לעשות עם שמות של אנשים או מקומות, ובאופן כללי מילים מעניינות יהיו בדרך כלל פעלים, שמות עצם או תארים. אני גם לא רוצה סימני פיסוק שיפריעו לחיפוש. אז כן אפשר למחוק תווים מיותרים אבל יותר קל לתת למחשב לעשות את זה.

ספריית spacy יודעת לחלק טקסט לטוקנים (כלומר מילים או סימני פיסוק), וגם להגיד מה התפקיד של כל טוקן בטקסט.

הקוד הזה לדוגמה ייקח משפט וידפיס את כל הטוקנים והתפקיד של כל טוקן במשפט:

import spacy

nlp = spacy.load("en_core_web_trf")
doc = nlp("This is such a long sentence that I cannot read it so go on please.")

print([(w.text, w.pos_) for w in doc])

והפלט:

[('This', 'PRON'), ('is', 'AUX'), ('such', 'DET'), ('a', 'DET'), ('long', 'ADJ'), ('sentence', 'NOUN'), ('that', 'SCONJ'), ('I', 'PRON'), ('can', 'AUX'), ('not', 'PART'), (
'read', 'VERB'), ('it', 'PRON'), ('so', 'CCONJ'), ('go', 'VERB'), ('on', 'ADP'), ('please', 'INTJ'), ('.', 'PUNCT')]

עכשיו שיש לי את חלקי הדיבור אפשר לסנן ולהדפיס רק את המילים המעניינות - כלומר הפעלים, התארים ושמות העצם.

2. בעיה 1 - מאיפה משיגים את הטקסט

הטקסט של מובי דיק זמין אונליין ולכן בשביל לקבל אותו אני יכול להשתמש בקוד פייתון הבא:

import urllib.request
import ssl

if __name__ == "__main__":
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE

    with urllib.request.urlopen("https://www.gutenberg.org/cache/epub/2701/pg2701.txt",
                                context=ctx) as f:
        text = f.read().decode('utf8')
        print(len(text))

בקוד שלי התעלמתי מבעיות ב SSL. אני חושב שהבעיות נגרמו בגלל בעיית התקנה או קבצים חסרים אצלי על המחשב, אבל בכל מקרה לתוכנית הדוגמה של מובי דיק זה לא היה חשוב.

3. בעיה 2 - הטקסט ארוך מדי

ניסיון לחבר בין שתי תוכניות הדוגמה נכשל בגלל שהטקסט של מובי דיק ארוך מדי. לכן הוספתי עוד פונקציה ששוברת טקסט ארוך לקטעים קטנים יותר, בלי לשבור מילים באמצע:

def chunks(text: str, max_size: int):
    while len(text) > max_size:
        space_index = text[:max_size].rfind(' ')
        yield text[:space_index]
        text = text[space_index:]
        print(f"{len(text)} chars left")
    yield text

עכשיו אנחנו מוכנים לספור את המילים:

def count_words(text: str):
    nlp = spacy.load("en_core_web_trf")
    word_count = Counter()
    for chunk in chunks(text, 100_000):
        doc = nlp(chunk)
        word_tags = {'ADV', 'VERB', 'NOUN', 'ADJ'}
        weird_tokens = {"'s", "so", "then", "there"}
        word_count.update([w.text.lower()
                           for w in doc
                           if (w.pos_ in word_tags) and (w.text not in weird_tokens)])

    return word_count

4. התוכנית המלאה ורשימות מילים מעניינות

אז איזה מילים יש במובי דיק? אני רציתי למצוא 3 רשימות. הרשימה הראשונה היא של המילים הכי נפוצות. המילה הכי נפוצה בספר למקרה שתהיתם היא כמובן לוויתן. אבל יש עוד כמה מילים נפוצות שאנחנו כנראה מכירים. הרשימה השניה היתה של המילים הכי פחות נפוצות בספר, והרשימה השלישית שנראתה לי הכי מעניינת היתה של ה 200 מילים הכי פחות נפוצות אבל שהופיעו יותר מפעם אחת.

סך הכל זאת היתה כל התוכנית:

import spacy
from collections import Counter
import urllib.request
import ssl

def chunks(text: str, max_size: int):
    while len(text) > max_size:
        space_index = text[:max_size].rfind(' ')
        yield text[:space_index]
        text = text[space_index:]
        print(f"{len(text)} chars left")
    yield text

def count_words(text: str):
    nlp = spacy.load("en_core_web_sm")
    word_count = Counter()
    for chunk in chunks(text, 100_000):
        doc = nlp(chunk)
        word_tags = {'ADV', 'VERB', 'NOUN', 'ADJ'}
        weird_tokens = {"'s", "so", "then", "there"}
        word_count.update([w.text.lower()
                           for w in doc
                           if (w.pos_ in word_tags) and (w.text not in weird_tokens)])

    return word_count


if __name__ == "__main__":
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE

    with urllib.request.urlopen("https://www.gutenberg.org/cache/epub/2701/pg2701.txt",
                                context=ctx) as f:
        text = f.read().decode('utf8')
        word_count = count_words(text)
        print(f"Book has {len(word_count)} words")

        print("--- most common 200 words:")
        print(word_count.most_common(200))

        print("--- least common 200 words:")
        print(word_count.most_common()[:-201:-1])

        print("--- least common 200 words that appear more than once:")
        word_count_greater_than_1 = sorted({k: v for k, v in word_count.items() if v > 1}.items(),
                                           key=lambda x: x[1])

        print(word_count_greater_than_1[:200])

תוצאות? בטח. רשימה ראשונה - המילים הכי נפוצות בספר מובי דיק:

[('whale', 894), ('now', 781), ('ship', 515), ('more', 507), ('man', 504), ('old', 440), ('other', 432), ('sea', 431), ('’s', 416), ('only', 378), ('head', 333), ('boat', 331), ('time', 329), ('long', 327), ('very', 322), ('here', 316), ('ye', 315), ('still', 311), ('great', 300), ('said', 296), ('most', 286), ('seemed', 279), ('last', 275), ('way', 269), ('chapter', 267), ('see', 265), ('again', 258), ('have', 256), ('yet', 247), ('whales', 246), ('little', 246), ('_', 243), ('men', 239), ('say', 233), ('round', 230), ('first', 225), ('much', 223), ('same', 213), ('such', 208), ('hand', 207), ('side', 206), ('never', 206), ('ever', 205), ('own', 205), ('good', 202), ('look', 200), ('almost', 196), ('even', 192), ('go', 192), ('deck', 188), ('thing', 187), ('water', 186), ('all', 185), ('as', 183), ('too', 182), ('made', 177), ('come', 177), ('away', 175), ('world', 174), ('white', 174), ('day', 171), ('thou', 170), ('life', 167), ('far', 165), ('seen', 164), ('do', 163), ('many', 161), ('well', 159), ('line', 158), ('let', 157), ('eyes', 156), ('had', 156), ('fish', 154), ('part', 153), ('sort', 152), ('cried', 150), ('thought', 148), ('know', 148), ('back', 147), ('once', 147), ('night', 147), ('boats', 145), ('so', 144), ('air', 140), ('crew', 137), ('whole', 136), ('full', 135), ('take', 134), ('thus', 134), ('things', 133), ('tell', 133), ('small', 130), ('soon', 129), ('feet', 127), ('hands', 125), ('came', 123), ('whaling', 122), ('mast', 121), ('has', 121), ('captain', 119), ('think', 118), ('half', 118), ('found', 117), ('just', 117), ('place', 117), ('called', 116), ('make', 114), ('saw', 112), ('times', 112), ('right', 110), ('body', 110), ('work', 110), ('poor', 108), ('high', 106), ('heard', 106), ('moment', 105), ('sight', 104), ('sperm', 104), ('end', 102), ('aye', 101), ('stand', 100), ('one', 100), ('sail', 98), ('strange', 98), ('hold', 98), ('years', 96), ('however', 95), ('face', 95), ('sun', 95), ('down', 94), ('voyage', 94), ('few', 94), ('went', 94), ('also', 93), ('dead', 93), ('get', 92), ('certain', 91), ('is', 90), ('oil', 90), ('going', 89), ('heart', 89), ('perhaps', 89), ('stood', 89), ('indeed', 89), ('give', 88), ('ships', 88), ('eye', 87), ('sometimes', 87), ('heads', 86), ('days', 86), ('seems', 86), ('like', 86), ('true', 85), ('matter', 85), ('arm', 85), ('iron', 85), ('hard', 84), ('set', 84), ('black', 83), ('soul', 82), ('death', 81), ('seem', 81), ('wild', 81), ('standing', 81), ('cabin', 81), ('known', 80), ('tail', 80), ('always', 80), ('present', 80), ('seas', 79), ('large', 79), ('mind', 79), ('young', 79), ('light', 79), ('length', 78), ('land', 78), ('instant', 77), ('least', 76), ('open', 76), ('harpooneer', 76), ('enough', 76), ('bed', 76), ('at', 76), ('fire', 75), ('mate', 75), ('harpoon', 75), ('leg', 75), ('word', 74), ('morning', 74), ('vast', 73), ('living', 73), ('board', 73), ('put', 73), ('did', 73), ('lay', 73), ('done', 73), ('often', 73), ('-', 72), ('point', 71), ('deep', 70)]

רשימה שניה - המילים הכי פחות נפוצות בספר מובי דיק:

[('newsletter', 1), ('subscribe', 1), ('includes', 1), ('confirmed', 1), ('volunteer', 1), ('network', 1), ('originator', 1), ('checks', 1), ('addresses', 1), ('donation', 1), ('web', 1), ('treatment', 1), ('gratefully', 1), ('international', 1), ('donors', 1), ('accepting', 1), ('prohibition', 1), ('solicitation', 1), ('www.gutenberg.org/donate', 1), ('locations', 1), ('paperwork', 1), ('charities', 1), ('outdated', 1), ('widespread', 1), ('www.gutenberg.org/contact', 1), ('deductible', 1), ('identification', 1), ('corporation', 1), ('educational', 1), ('501(c)(3', 1), ('sections', 1), ('ensuring', 1), ('goals', 1), ('financial', 1), ('formats', 1), ('synonymous', 1), ('c', 1), ('deletions', 1), ('additions', 1), ('modification', 1), ('alteration', 1), ('employee', 1), ('indemnify', 1), ('provisions', 1), ('void', 1), ('unenforceability', 1), ('invalidity', 1), ('maximum', 1), ('violates', 1), ('types', 1), ('implied', 1), ('disclaimers', 1), ('elect', 1), ('distributor', 1), ('1.f.3', 1), ('warranty', 1), ('remedies', 1), ('disclaim', 1), ('codes', 1), ('virus', 1), ('disk', 1), ('infringement', 1), ('transcription', 1), ('data', 1), ('inaccurate', 1), ('defects', 1), ('stored', 1), ('proofread', 1), ('expend', 1), ('employees', 1), ('manager', 1), ('discontinue', 1), ('notifies', 1), ('periodic', 1), ('legally', 1), ('owed', 1), ('taxes', 1), ('%', 1), ('exporting', 1), ('hypertext', 1), ('processing', 1), ('proprietary', 1), ('nonproprietary', 1), ('compressed', 1), ('binary', 1), ('redistribute', 1), ('detach', 1), ('unlink', 1), ('redistributing', 1), ('indicating', 1), ('texts', 1), ('accessed', 1), ('1.e.', 1), ('representations', 1), ('downloading', 1), ('govern', 1), ('unprotected', 1), ('compilation', 1), ('1.e', 1), ('1.c', 1), ('indicate', 1), ('1.a.', 1), ('renamed', 1), ('orphan', 1), ('retracing', 1), ('sheathed', 1), ('padlocks', 1), ('dirgelike', 1), ('liberated', 1), ('ixion', 1), ('suction', 1), ('halfspent', 1), ('forth?—because', 1), ('thrill', 1), ('etherial', 1), ('intercept', 1), ('incommoding', 1), ('tauntingly', 1), ('backwardly', 1), ('touched;—at', 1), ('coincidings', 1), ('ironical', 1), ('intermixingly', 1), ('whelmings', 1), ('inanimate', 1), ('animate', 1), ('lookouts', 1), ('infatuation', 1), ('gaseous', 1), ('mediums', 1), ('bewildering', 1), ('bowstring', 1), ('mutes', 1), ('voicelessly', 1), ('grooves;—ran', 1), ('grapple', 1), ('unconquering', 1), ('comber', 1), ('foregone', 1), ('prow,—death', 1), ('bullied', 1), ('uncracked', 1), ('unsurrendered', 1), ('flume', 1), ('dislodged', 1), ('buttress', 1), ('predestinating', 1), ('inactive', 1), ('coppers', 1), ('though;—cherries', 1), ('gulping', 1), ('assassins', 1), ('brushwood', 1), ('mattrass', 1), ('unwinking', 1), ('unappeasable', 1), ('fidelities', 1), ('plaid', 1), ('gap', 1), ('splashing', 1), ('persecutions', 1), ('evolution', 1), ('crashing', 1), ('cracks!—’tis', 1), ('sinew', 1), ('tug', 1), ('ungraduated', 1), ('unprepared', 1), ('foreknew', 1), ('writhed', 1), ('fiercer', 1), ('tell”—he', 1), ('rowlocks', 1), ('pertinaciously', 1), ('abate', 1), ('busying', 1), ('staved', 1), ('judicious', 1), ('seekest', 1), ('again.—aye', 1), ('breath—“aye', 1), ('befooled!”—drawing', 1), ('befooled', 1), ('frayed', 1), ('flailed', 1), ('knitted', 1), ('tiers', 1), ('combinedly', 1), ('creamed', 1), ('brokenly', 1), ('swamping', 1), ('bedraggled', 1), ('berg', 1), ('upheaved', 1), ('ahab!—shudder', 1), ('it!—where', 1), ('soars', 1), ('vane”—pointing', 1), ('again!—drive', 1), ('whale!—ho', 1)]

רשימה שלישית - המילים הכי פחות נפוצות אבל שעדיין מופיעות יותר מפעם אחת:

[('restrictions', 2), ('updated', 2), ('*', 2), ('dick', 2), ('loomings', 2), ('postscript', 2), ('historically', 2), ('diminish?—will', 2), ('aloft.—thunder', 2), ('chase.—third', 2), ('combination', 2), ('defunct', 2), ('indebted', 2), ('dusting', 2), ('grammars', 2), ('vaulted', 2), ('entertaining', 2), ('affording', 2), ('rosy', 2), ('sadness', 2), ('tuileries', 2), ('gulp', 2), ('hoary', 2), ('paunch', 2), ('biggest', 2), ('verbal', 2), ('gulf', 2), ('insomuch', 2), ('parmacetti', 2), ('boil', 2), ('quid', 2), ('pikes', 2), ('fry', 2), ('troops', 2), ('caution', 2), ('discoverer', 2), ('fence', 2), ('abode', 2), ('conceal', 2), ('boldness', 2), ('revenue', 2), ('momentary', 2), ('serves', 2), ('impetus', 2), ('enemies', 2), ('swords', 2), ('finny', 2), ('mightier', 2), ('flounders', 2), ('gateway', 2), ('monument', 2), ('sprout', 2), ('resounds', 2), ('rushes', 2), ('neglected', 2), ('opportunities', 2), ('witnessing', 2), ('formidable', 2), ('displays', 2), ('employ', 2), ('a.d.', 2), ('inevitable', 2), ('national', 2), ('rebounds', 2), ('totally', 2), ('ex', 2), ('arches', 2), ('entrances', 2), ('alcoves', 2), ('whites', 2), ('manage', 2), ('cheery', 2), ('giant', 2), ('regulating', 2), ('warehouses', 2), ('surrounds', 2), ('waterward', 2), ('battery', 2), ('cooled', 2), ('seaward', 2), ('pent', 2), ('benches', 2), ('extremest', 2), ('suffice', 2), ('caravan', 2), ('metaphysical', 2), ('employs', 2), ('hermit', 2), ('woodlands', 2), ('overlapping', 2), ('spurs', 2), ('bathed', 2), ('sighs', 2), ('shepherd', 2), ('wade', 2), ('lilies', 2), ('cataract', 2), ('poet', 2), ('pedestrian', 2), ('deity', 2), ('tormenting', 2), ('rag', 2), ('tribulations', 2), ('judiciously', 2), ('mummies', 2), ('grasshopper', 2), ('touches', 2), ('indignity', 2), ('orchard', 2), ('thieves', 2), ('entailed', 2), ('cheerfully', 2), ('wholesome', 2), ('leaders', 2), ('secretly', 2), ('magnificent', 2), ('tragedies', 2), ('delusion', 2), ('inducements', 2), ('gates', 2), ('inmost', 2), ('amazingly', 2), ('monopolising', 2), ('sally', 2), ('concernment', 2), ('grapnels', 2), ('expensive', 2), ('congealed', 2), ('tinkling', 2), ('building', 2), ('stumble', 2), ('porch', 2), ('dilapidated', 2), ('carted', 2), ('ruins', 2), ('cheap', 2), ('pea', 2), ('palsied', 2), ('judging', 2), ('writer', 2), ('lookest', 2), ('improvements', 2), ('copestone', 2), ('curbstone', 2), ('tatters', 2), ('drinks', 2), ('tepid', 2), ('frosted', 2), ('thoroughly', 2), ('diligent', 2), ('systematic', 2), ('contemplation', 2), ('oft', 2), ('unwarranted', 2), ('yeast', 2), ('sublimity', 2), ('deceptive', 2), ('combat', 2), ('hyperborean', 2), ('yielded', 2), ('aggregated', 2), ('opinions', 2), ('dismantled', 2), ('purposing', 2), ('sickle', 2), ('segment', 2), ('mown', 2), ('mower', 2), ('wondered', 2), ('imbedded', 2), ('decanters', 2), ('bottles', 2), ('withered', 2), ('dearly', 2), ('sells', 2), ('pours', 2), ('villanous', 2), ('cheating', 2), ('rudely', 2), ('surround', 2), ('skrimshander', 2), ('objections', 2), ('liked', 2), ('adjoining', 2), ('nightmare', 2), ('complexioned', 2), ('seed', 2), ('coats', 2), ('comforters', 2), ('icicles', 2), ('molasses', 2), ('sovereign', 2), ('capering', 2), ('interested', 2), ('ordained', 2), ('partner', 2), ('brawn', 2), ('reminiscences', 2), ('stature', 2), ('orgies', 2)]