nannon-book

Reddit Bible

About

In short, the Reddit Bible compares questions found on 8 advice boards of reddit with interrogative sentences found in the bible.

After much wrangling with religious-text based concepts, I decided to create a "Reddit Advice" board by answering biblical questions with Reddit content. I scraped the Bible for interrogative sentences, and found similarities in questions scraped from Reddit advice boards, using sentence embedding. From there, I wanted to to answer the Bible Questions with Markov-chain generated answers based on the thread responses of the respective Reddit questions.

Unexpectedly, I loved the results of the "similar" reddit and bible questions--the seeming connections between the two almost hint at some sort of relation between the questions people are asking now and in Biblical times. Though I did go through with the Markov chained responses, the results were a bit jumble-y and seemed to take away from what I liked about the juxtaposed questions. Ultimately, I made the decision to cut the Markov chains, and to highlight the contrast in pairs of questions, as well as how similar the computer thinks the question pairs are.

Inspiration

I originally wanted to generate some sort of occult text, ie. Chinese Divination. I ended up pivoting to the more normative of religious texts, the Bible to be specific, since I have a lot of personal experience with this one. Prior to the "reddit advice" board, I actually had the opposite idea, of making a "christian advice" board where I would gather 4chan questions, and answer them with markov chain generated responses based on real christian advice forums. I scraped a christian advice forum, but the results were too few and inconsistent, so I knew I had to pivot a bit. That's when I flipped the idea and decided to reverse it to answering bible questions with reddit data. (4chan's threads were a little too inconsistent and lacking compared with reddits thousand-count response threads).

If anyone ever wants buttloads of  responses from a christian forumhttps://github.com/swlsheltie/christian-forum-data

Process

Once I solidified my concept, it was time to execute, one step at a time.

  1. Getting matching question pairs from Reddit and the Bible
    1. Getting questions from Reddit
      1. Reddit has a great API called PRAW
      2. Originally I only scraped r/advice, but towards the end, I decided to bump it up and scrape 8 different advice subreddits: r/advice, r/internetparents, r/legal_advice, r/need_a_friend, r/need_advice, r/relationship_advice, r/tech_support, r/social_skills
        1. Using PRAW, i looked at top posts of all time, with no limit
      3. r/Advice yielded over 1000 question sentences, and the other advice subreddits ranged more or less around that number.
      4. Lists with the lists of scraped questions:
        1. https://github.com/swlsheltie/reddit_bible/tree/master/lists%20of%20reddit%20questions
      5. # REDDIT QUESTIONS X BIBLE QUESTIONS
        # ENTER A SUBREDDIT
        #   FIND ALL QUESTIONS
        # INFERSENT WITH BIBLE QUESTIONS 
        # GET LIST OF [DISTANCE, SUBREDDIT, THREAD ID, SUBMISSION TEXT, BIBLE TEXT]
        #   HIGHLIGHT SPECIFIC QUESTION
         
         
        # load 400 iterations 
        # format [distance, subreddit, reddit question, bible verse, bible question]
         
         
         
        # get bible questions
        from final_bible_questions import bible_questions
        from split_into_sentences import split_into_sentences
         
        # write to file
        data_file= open("rqbq_data.txt", "w")
        rel_advice_file = open("rel_advice.txt", "w")
        legal_advice_file = open("legal_advice.txt", "w")
        tech_support_file = open("tech_support.txt", "w")
        need_friend_file = open("needfriend.txt", "w")
        internetparents_file = open("internetparents.txt", "w")
        socialskills_file = open("socialskills.txt", "w")
        advice_file = open("advice.txt", "w")
        needadvice_file = open("needadvice.txt", "w")
         
        files = [rel_advice_file, legal_advice_file, tech_support_file, need_friend_file, internetparents_file, socialskills_file, advice_file, needadvice_file]
         
        # libraries
        import praw
        import re
         
        # reddit keys
        reddit = praw.Reddit(client_id='---',
                             client_secret='---',
                             user_agent='script: http://0.0.0.0:8000/: v.0.0(by /u/swlsheltie)')
         
        # enter subreddit here ------------------------------------------------------------------------------------action required
        list_subreddits = ["relationship_advice", "legaladvice", "techsupport", "needafriend", "internetparents", "socialskills", "advice", "needadvice"]
         
        relationshipadvice = {}
        legaladvice={}
        techsupport={}
        needafriend={}
        internetparents={}
        socialskills={}
        advice={}
        needavice={}
         
        subreddit_dict_list=[relationshipadvice, legaladvice, techsupport, needafriend, internetparents, socialskills, advice, needavice]
         
        relationshipadvice_questions = []
        legaladvice_questions =[]
        techsupport_questions=[]
        needafriend_questions=[]
        internetparents_questions=[]
        socialskills_questions=[]
        advice_questions=[]
        needavice_questions=[]
         
        questions_list_list=[relationshipadvice_questions, legaladvice_questions, techsupport_questions, needafriend_questions, internetparents_questions, socialskills_questions, advice_questions, needavice_questions]
         
         
        # sub_reddit = reddit.subreddit('relationship_advice')
         
        for subreddit in list_subreddits:
            counter=0
            print(subreddit)
            i = list_subreddits.index(subreddit)
            sub_reddit = reddit.subreddit(subreddit)
            txt_file_temp = []
         
            for submission in sub_reddit.top(limit=1000):
                # print(submission)
                print("...getting from reddit", counter)
                submission_txt = str(reddit.submission(id=submission).selftext.replace('\n', ' ').replace('\r', ''))
                txt_file_temp.append(submission_txt)
                counter+=1
            print("gottem")
            for sub_txt in txt_file_temp:
                print("splitting")
                sent_list = split_into_sentences(sub_txt)
         
                for sent in sent_list:
                    print("grabbing questions")
                    if sent.endswith("?"):
                        questions_list_list[i].append(sent)
            print("writing file")
            files[i].write(str(questions_list_list[i]))
            print("written file, next")
         
        # for list_ in questions_list_list:
        #     print("\n")
        #     print(list_subreddits[questions_list_list.index(list_)])
        #     print(list_)
        #     print("\n")
    2. Getting questions from the bible
      1. Used the King James Version, because of this awesome text file, that only has the text in it (no verse numbers, etc) (would bite me in the ass later on)
      2. Found some code on Stack Overflow that allowed me to get a list of the sentences in the bible
      3. Originally used RITA to get the question sentences, then towards the end (since rita --> python was too much of a hassle), I just went through found all sentences that ended with a "?".
        1. file = open("bible.txt", "r")
          empty= open("bible_sent.txt", "w")
          bible = file.read()
           
          from nltk import tokenize
          import csv
          import re
           
          alphabets= "([A-Za-z])"
          prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
          suffixes = "(Inc|Ltd|Jr|Sr|Co)"
          starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
          acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
          websites = "[.](com|net|org|io|gov)"
           
          # final_output=[]
           
          def split_into_sentences(text):
              text = " " + text + "  "
              text = text.replace("\n"," ")
              text = re.sub(prefixes,"\\1",text)
              text = re.sub(websites,"\\1",text)
              if "Ph.D" in text: text = text.replace("Ph.D.","PhD")
              text = re.sub("\s" + alphabets + "[.] "," \\1 ",text)
              text = re.sub(acronyms+" "+starters,"\\1 \\2",text)
              text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1\\2\\3",text)
              text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1\\2",text)
              text = re.sub(" "+suffixes+"[.] "+starters," \\1 \\2",text)
              text = re.sub(" "+suffixes+"[.]"," \\1",text)
              text = re.sub(" " + alphabets + "[.]"," \\1",text)
              if "”" in text: text = text.replace(".”","”.")
              if "\"" in text: text = text.replace(".\"","\".")
              if "!" in text: text = text.replace("!\"","\"!")
              if "?" in text: text = text.replace("?\"","\"?")
              text = text.replace(".",".")
              text = text.replace("?","?")
              text = text.replace("!","!")
              text = text.replace("",".")
              sentences = text.split("")
              sentences = sentences[:-1]
              sentences = [s.strip() for s in sentences]
              return (sentences)
              # final_output.append(sentences)
           
          print(split_into_sentences(bible))
           
          # with open('christian_forums1.csv', newline='') as csvfile:
          #     reader = csv.reader(csvfile)
          #     for row in reader:
          #         for i in range(2): 
          #             if (i==1) and (row[i]!= ""):
          #                 input_txt = row[0]
          #                 # print(text)
          #                 # list_sent = tokenize.sent_tokenize(text)
          #                 # sentences.append(list_sent)
          #                 list_sent= split_into_sentences(input_txt)
          #                 final_output.append(list_sent)
           
          # list_sent = split_into_sentences(bible)
          # for sent in list_sent:
          #     # print(sent)
          #     empty.write(sent+"\n")
          # empty.close()
          # print(list_sent)
      4. Results: around 1000+ Bible questions, find them here
    3. Get matching pairs!!! 
      1. My boyfriend suggested that I use sentence embeddings to find the best matching pairs. This library is the one i used. It was super easy to install+use! 4.75 Stars
      2. Infersent pumps out a 2d matrix containing each vector of each sentence in the list of sentences that you provide. I gave it the list of bible questions, then ran a loop to get the embeddings of all 8 subreddit question lists.
      3. Then another loop with matrix multiplication to get each matrix with the distances between the bible vs. [respective] subreddit sentences.
      4. Since the matrix contains such precise information about how close two sentences, are I wanted to visualize this data. I saved the "distances," and used circle size to show how close they are.
        1. I didn't have that much time to visually design the book, which I regret, and the circle sizes were obviously not that communicative about what they represented. I ended up including a truncated version of the distances on each page.
    4. Markov chaining the responses
        1. Now that I had my reddit questions to their bible question counterparts, I wanted to get a mishmash of the respective response threads of the submissions that the questions came from.
        2. submission = reddit.submission(pair["thread_id"])
          for top_level_comment in submission.comments:
              if isinstance(top_level_comment, MoreComments):
                  continue
              # print(top_level_comment.body)
              comment_body.append(top_level_comment.body)
           
          comment_body_single_str="".join(comment_body)
          comment_body = split_into_sentences(comment_body_single_str)
           
          # print(" ".join(comment_body))
          text_model = markovify.Text(comment_body)
          for i in range(10):
              print(text_model.make_sentence())
        3. This was an example of my result:
          1.  Submission Data: {'thread_id': '7zg0rt', 'submission': 'Deep down, everyone likes it when someone has a crush on them right?   So why not just tell people that you like them if you do? Even if they reject you, deep down they feel good about themselves right? '}

            Markov Result: Great if you like it, you generally say yes, even if they have a hard time not jumping for joy.

            Us guys really do love it when someone has a crush on them right?

            I think it's a much more nuanced concept.

            In an ideal world, yesDepends on the maturity of the art.

            But many people would not recommend lying there and saying "I don't like you", they would benefit from it, but people unequipped to deal with rejection in a more positive way.

            If it's not mutual, then I definitely think telling them how you feel, they may not be friends with you or ignore you, or not able to answer?

            I've always been open about how I feel, and be completely honest.

            In an ideal world, yesDepends on the maturity of the chance to receive a big compliment.

            So while most people would argue that being honest there.

            I think that in reality it's a much more nuanced examples within relationships too.

        4. While this was ok, I felt like the Markov data just probably wasn't good (extensive) enough to sound that different from the source. It didn't seem like this would add anything to the concept, so I decided to cut it.
    5. Organizing the matching pairs 
      1. The part that I probably had the most difficulty with was trying to organize the lists into [least distant] to [most distant] pairs. Seemed really easy in my head, but for reason, I just had a lot of trouble executing.
        1. However, it paid off in the end, as I was able to get random pairs from only the closest related 100 pairs from each of the 8 subreddit/bible lists.
        2. import numpy
          import torch
          import json
           
           
          from rel_advice import rel_advice
          from legal_advice import legal_advice
          from techsupport import techsupport
          from needfriend import needfriend
          from internetparents import internet_parents
          from socialskills import socialskills
          from advice import advice
          from needadvice import needadvice
           
           
          from final_bible_questions import bible_questions
          from bible_sent import bible_sentences
           
          final_pairs= open("final_pairs.txt", "w")
          # final_pairs_py= open("final_pairs_py.txt", "w")
           
          with open("kjv.json", "r") as read_file:
              bible_corpus = json.load(read_file)
           
          rqbq_data_file = open("rqbq_data.txt", "w")
          subreddit_questions = [rel_advice, legal_advice, techsupport, needfriend, internet_parents, socialskills, advice, needadvice]
          list_subreddits = ["relationship_advice", "legaladvice", "techsupport", "needafriend", "internetparents", "socialskills", "advice", "needadvice"]
           
          bible_verses=[]
          # for 
           
           
          from models import InferSent
          V = 2
          MODEL_PATH = 'encoder/infersent%s.pkl' % V
          params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                          'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
           
          print("HELLO", MODEL_PATH)
          infersent = InferSent(params_model)
          infersent.load_state_dict(torch.load(MODEL_PATH))
           
          W2V_PATH = 'dataset/fastText/crawl-300d-2M.vec'
          infersent.set_w2v_path(W2V_PATH)
           
          with open("encoder/samples.txt", "r") as f:
              sentences = f.readlines()
           
          infersent.build_vocab_k_words(K=100000)
          infersent.update_vocab(bible_sentences)
           
          print("embed bible")
          temp_bible = bible_questions 
          embeddings_bible = infersent.encode(temp_bible, tokenize=True)
          normalizer_bible = numpy.linalg.norm(embeddings_bible, axis=1)
          normalized_bible = embeddings_bible/normalizer_bible.reshape(1539,1)
           
           
           
           
          pairs = {}
          for question_list in range(len(subreddit_questions)):
              pairs[list_subreddits[question_list]]={}
           
              print("setting variables: ", list_subreddits[question_list])
              temp_reddit = subreddit_questions[question_list] #TRIM THESE TO TEST SMALLER LIST SIZES
           
           
              print("embed", list_subreddits[question_list], "questions")
              embeddings_reddit = infersent.encode(temp_reddit, tokenize=True)
           
              print("embed_reddit dim: ",embeddings_reddit.shape)
              print("embed_bible dim: ", embeddings_bible.shape)
           
              normalizer_reddit = numpy.linalg.norm(embeddings_reddit, axis=1)
           
           
              print("normalizer_reddit dim: ", normalizer_reddit.shape)
              print("normalizer_bible dim: ", normalizer_bible.shape)
           
              temp_tuple = normalizer_reddit.shape
           
              normalized_reddit = embeddings_reddit/normalizer_reddit.reshape(temp_tuple[0],1)
           
           
              print("normalized_reddit dim:", normalized_reddit)
              print("normalized_bible dim:", normalized_bible)
           
           
              print("normed normalized_reddit dim: ", numpy.linalg.norm(normalized_reddit, ord=2, axis=1))
              print("normed normalized_bible dim: ", numpy.linalg.norm(normalized_bible, ord=2, axis=1))
           
              reddit_x_bible = numpy.matmul(normalized_reddit, normalized_bible.transpose())
              print("reddit x bible", reddit_x_bible)
           
              matrix = reddit_x_bible.tolist()
              distances = []
              distances_double=[]
              distances_index = []
              bible_indeces=[]
           
           
              for reddit_row in matrix:
                  closest = max(reddit_row)
                  bible_indeces.append(reddit_row.index(closest))
                  distances.append(closest)
                  distances_double.append(closest)
                  cur_index = matrix.index(reddit_row)
           
                  final_pairs.write("\n-------\n" + "distance: "+ str(closest)+"\n" +str(list_subreddits[question_list])+"\n"+subreddit_questions[question_list][cur_index]+"\n"+ bible_questions[reddit_row.index(closest)]+"\n-------\n")
           
           
              distances.sort()
              distances.reverse()
              for distance in distances: 
                  inde_x = distances_double.index(distance)
                  distances_index.append(inde_x)
           
              pairs[list_subreddits[question_list]]["distances"]=distances
              pairs[list_subreddits[question_list]]["distances_indexer"]=distances_index
              pairs[list_subreddits[question_list]]["bible_question"]=bible_indeces
              # print(pairs)
          rqbq_data_file.write(str(pairs))
          rqbq_data_file.close()
           
           
           
          # for pair in pairs: 
          #     # print( "\n-------\n", reddit_questions[pair],"\n", bible_questions[pairs[pair]], "\n-------\n")
          #     # export_list.append([max_nums[counter], pair, pairs[pair], reddit_questions[pair],  bible_questions[pairs[pair]]])
          #     counter+=1
           
          # # final_pairs_py.write(str(export_list))
          # # final_pairs_py.close()
          # final_pairs.close()
           
                  # nums.append(closest)
                  # max_nums.append(closest)
                  # for distance in max_nums:    
                  # row = nums.index(distance) #matrix row 
                  # column = matrix[row].index(distance)
                  # pairs[row]= column
                  # pairs[list_subreddits[question_list]][closest]={}
           
                  # reddit_bodies.append()
           
          # export_list = []
          #     nums=[]
          #     max_nums = []
          #         max_nums.sort()
          #     max_nums.reverse()
           
           
          # load 400 iterations 
          # format [distance, subreddit, reddit question, bible verse, bible question]
           
          # build dictionary in loop, and keep list of min distances 
           
          # final_pairs.write(str(pairs))
           
          # counter = 0
           
          # bible_x_reddit = numpy.matmul(embeddings_bible, reddit_trans)
          # print(bible_x_reddit)
  2. Basil.js
    1. Using Basil's CSV method, I was able to load pair data into the book.
      1. from rqbq_data import rqbq_dictionary
        from bibleverses import find_verse
        import random
        import csv
         
         
        from rel_advice import rel_advice
        from legal_advice import legal_advice
        from techsupport import techsupport
        from needfriend import needfriend
        from internetparents import internet_parents
        from socialskills import socialskills
        from advice import advice
        from needadvice import needadvice
         
        from final_bible_questions import bible_questions
         
         
         
        # print(len(rqbq_dictionary))
        # print(rqbq_dictionary.keys())
         
        list_subreddits = ["relationship_advice", "legaladvice", "techsupport", "needafriend", "internetparents", "socialskills", "advice", "needadvice"]
        subreddit_questions = [rel_advice, legal_advice, techsupport, needfriend, internet_parents, socialskills, advice, needadvice]
         
         
        write_csv=[]
         
        def getPage():
            subreddit_index=random.randint(0,7)
            subreddit = list_subreddits[subreddit_index]
            print("subreddit: ", subreddit)
            length = len(rqbq_dictionary[subreddit]["distances"])
            print("length: ", length)
            random_question = random.randint(0,500) #SPECIFY B AS CUT OFF FOR REDDIT/BIBLE ACCURACY. 1=MOST ACCURATE, LENGTH-1 = LEAST ACCURATE 
            print("random question num: ", random_question)
            print("distance of random question: ", rqbq_dictionary[subreddit]["distances"][random_question])
            print("index of random question: ", rqbq_dictionary[subreddit]["distances_indexer"][random_question])
            index_rand_q=rqbq_dictionary[subreddit]["distances_indexer"][random_question]
            index_rand_q_bible = rqbq_dictionary[subreddit]["bible_question"][index_rand_q]
            # print(index_rand_q, index_rand_q_bible)
            print("question: ", subreddit_questions[subreddit_index][index_rand_q])
            print("verse: ", bible_questions[index_rand_q_bible])
            verse = find_verse(bible_questions[index_rand_q_bible])
            write_csv.append([rqbq_dictionary[subreddit]["distances"][random_question], subreddit, subreddit_questions[subreddit_index][index_rand_q], verse, bible_questions[index_rand_q_bible]])
        # getPage()
         
        for i in range(15):
            getPage()
         
        with open('redditxBible.csv', 'w', newline='') as f:
                writer = csv.writer(f)
                writer.writerow(["distance", "subreddit", "reddit_question", "verse", "bible_question"])
                writer.writerows(write_csv)
      2. Example of one of the books' CSV: redditxBible
    2. As said earlier, not having the Bible verse data was a pain when I realized it would be nice to have the verses. So I had to run each bible question in this code to get the verses:
      1. import json
         
        with open("kjv.json", "r") as read_file:
            bible_corpus = json.load(read_file)
         
        sample = " went up against the king of Assyria to the river Euphrates: and king Josiah went against him; and he slew him at Megiddo, when he had seen him."
         
        def find_verse(string):
            for x in bible_corpus["books"]:
                for chapter  in x["chapters"]: 
                    # print(chapter["verses"])
                    for verse in chapter["verses"]:
                        if string in verse["text"]:
                            return (verse["name"])
    3. Designing the book
      1. Unfortunately, I didn't save many iterations from my design process, but I did play with having two columns, and other ways of organizing the text+typography.
    4. I had used Basil.js before in Kyu's class last year, so that was really helpful in knowing how to auto-resize the text box sizes.
      1. That way, I was able to get exact distances between all the text boxes.
    5.  I had some trouble with rotating the text boxes and getting the locations after rotation.
    6. The circle was drawn easily by mapping the distance between sentences to a relative size of the page.
Examples

00-nannon

01-nannon

02-nannon

03-nannon

05-nannon

Thoughts

Overall, i really enjoyed making this book. This project was super interesting in that I feel like I really had to apply more programming knowledge than previous projects in order to combine all the parts to get fully usable code. They couldn't only work disparately, they had to able to work all together too. Piping all the parts together was definitely the toughest part. I only wish I had more time to design the book, but i will probably continue working on that onwards.

See a list of ALL (8000+) the pairs here