<h2 id="heading-1-how-big-is-the-data">1. How big is the data?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1737962635743/b3ceaa34-7c45-4313-b09f-2bdf1c455c76.png" alt class="image--center mx-auto" /></p>
<p><code>df.shape</code> is used in pandas to get the <strong>dimensions of a DataFrame.</strong></p>
<h2 id="heading-2-how-does-the-data-look-like">2. How does the data look like?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1737964454244/94634416-5911-4588-b913-059ef91802fe.png" alt class="image--center mx-auto" /></p>
<p><code>df.sample</code> in pandas is used to <strong>select rows or columns</strong> from a DataFrame randomly.</p>
<h2 id="heading-3-what-is-the-data-type-of-cols">3. What is the data type of cols?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1737964593583/d5807b0a-4125-4895-aa6c-6723fb37b48b.png" alt class="image--center mx-auto" /></p>
<p><a target="_blank" href="http://df.info"><code>df.info</code></a><code>()</code> is a <strong>pandas</strong> method that provides a concise summary of a DataFrame.</p>
<h2 id="heading-4-are-there-any-missing-values">4. Are there any missing values?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1737964696984/6c13675e-a02a-4a91-95d4-429a9c48c3c1.png" alt class="image--center mx-auto" /></p>
<p><code>df.isnull().sum()</code> in pandas is used to <strong>identify and count the number of missing (null or NaN) values</strong> in each column of a DataFrame.</p>
<h2 id="heading-5-how-does-the-data-look-mathematically">5. How does the data look mathematically?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1737964850475/dc0fe74d-e3b5-4e5a-b97a-e867d3a176fb.png" alt class="image--center mx-auto" /></p>
<p><code>df.describe()</code> in pandas is used to generate <strong>descriptive statistics</strong> for the numerical columns (by default) in a DataFrame.</p>
<h2 id="heading-6-are-there-duplicate-values">6. Are there duplicate values?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1737964934886/124cd07f-91bb-4caa-9e69-e1ce69070408.png" alt class="image--center mx-auto" /></p>
<p><code>df.duplicated().sum()</code> in pandas is used to <strong>identify and count duplicate rows</strong> in a DataFrame.</p>
<h2 id="heading-7-how-is-the-correlation-between-cols">7. How is the correlation between cols?</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1737965035598/5e1d2903-1fc3-46ed-b186-c8ea95f12ccb.png" alt class="image--center mx-auto" /></p>
<p><code>df.corr()['Survived']</code> is used to compute the <strong>correlation coefficients</strong> between the column <code>Survived</code> and all other numeric columns in the DataFrame.</p>


## 1\. How big is the data?

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1737962635743/b3ceaa34-7c45-4313-b09f-2bdf1c455c76.png align="center")

`df.shape` is used in pandas to get the **dimensions of a DataFrame.**

## 2\. How does the data look like?

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1737964454244/94634416-5911-4588-b913-059ef91802fe.png align="center")

`df.sample` in pandas is used to **select rows or columns** from a DataFrame randomly.

## 3\. What is the data type of cols?

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1737964593583/d5807b0a-4125-4895-aa6c-6723fb37b48b.png align="center")

[`df.info`](http://df.info)`()` is a **pandas** method that provides a concise summary of a DataFrame.

## 4\. Are there any missing values?

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1737964696984/6c13675e-a02a-4a91-95d4-429a9c48c3c1.png align="center")

`df.isnull().sum()` in pandas is used to **identify and count the number of missing (null or NaN) values** in each column of a DataFrame.

## 5\. How does the data look mathematically?

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1737964850475/dc0fe74d-e3b5-4e5a-b97a-e867d3a176fb.png align="center")

`df.describe()` in pandas is used to generate **descriptive statistics** for the numerical columns (by default) in a DataFrame.

## 6\. Are there duplicate values?

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1737964934886/124cd07f-91bb-4caa-9e69-e1ce69070408.png align="center")

`df.duplicated().sum()` in pandas is used to **identify and count duplicate rows** in a DataFrame.

## 7\. How is the correlation between cols?

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1737965035598/5e1d2903-1fc3-46ed-b186-c8ea95f12ccb.png align="center")

`df.corr()['Survived']` is used to compute the **correlation coefficients** between the column `Survived` and all other numeric columns in the DataFrame.

Key dataset questions: size, structure, types, missing values, statistics, duplicates, correlations. Use pandas for analysis. Ideal for beginners

How to Understand Any Dataset: 7 Essential Questions

1. How big is the data?

2. How does the data look like?

3. What is the data type of cols?

4. Are there any missing values?

5. How does the data look mathematically?

6. Are there duplicate values?

7. How is the correlation between cols?