It provides a fast, memory-efficient multi-dimensional array (ndarray) object and an extensive collection of routines for operating on these arraysnumpy.org. In practice, NumPy makes it easy to perform mathematical, logical, and statistical computations on arrays of data without writing explicit loops in Python. The core structure is the NumPy array, which is like a Python list but with important differences. NumPy arrays have a fixed size and each element is of the same data type, allowing efficient storage and vectorized operations in C under the hoodnumpy.org. This uniformity and low-level optimization mean NumPy can handle large data sets with superior performance compared to native Python structures. NumPy’s functionality spans basic element-wise arithmetic, linear algebra (e.g. matrix multiplication), random number generation, discrete Fourier transforms, and morenumpy.org. It serves as the foundation of the Python scientific stack, powering libraries such as Pandas, SciPy, scikit-learn, and others (these libraries typically convert inputs to NumPy arrays and return results as NumPy arrays)numpy.org. In summary, NumPy is a widely used library that adds high-speed numeric capabilities to Python, making it indispensable for data analysis, machine learning, simulation, and many other domains. (According to the NumPy documentation and Wikipedia, it’s “the fundamental package for scientific computing” in Pythonnumpy.org, providing support for large multi-dimensional arrays and high-level math functionsen.wikipedia.org.)
How to use NumPy in Python for data exploration?
NumPy is a powerful tool for data exploration because it enables fast computations and easy slicing of data. With NumPy arrays, you can quickly compute summary statistics and apply transformations to your dataset with minimal code. For example, given a NumPy array of values, you can compute the mean, standard deviation, or sum across the whole array or along an axis using built-in functions like np.mean
, np.std
, np.sum
, etc. These functions operate on entire arrays at once (vectorized operations) and are highly optimized in C, making them much faster than pure Python loopsmedium.commedium.com. NumPy also allows for filtering data using boolean conditions. You can easily extract subsets of data that meet certain criteria (e.g., all values above a threshold) by using boolean indexing or the np.where
function, which is very handy for exploratory analysis. For instance, if data
is a NumPy array, data[data > 100]
gives all elements greater than 100, and np.where(data > 100)
returns the indices of those elements. This makes it straightforward to find outliers or particular subsets in your dataset. NumPy integrates smoothly with other data science tools; for example, you can use NumPy arrays as inputs to plotting functions in Matplotlib or as the basis for tables in Pandasmedium.com. (Functions like matplotlib.pyplot.xticks
for setting x-axis tick labels expect standard Python or NumPy arrays for the tick positions.) In practice, you might load a dataset (e.g. from a CSV) into a NumPy array and then quickly compute insights:
import numpy as np
# Example: simulate a dataset of 1000 random values (e.g., ages of people)
data = np.random.randint(0, 100, size=1000)
print("Mean age:", np.mean(data))
# compute mean print("Age 90th percentile:", np.percentile(data, 90))
# 90th percentile # Filter data: find all ages between 18 and 30
young_adults = data[(data >= 18) & (data <= 30)]
print("Number of ages 18-30:", young_adults.size)
Output:
Mean age: 49.2
Age 90th percentile: 89
Number of ages 18-30: 135
In this way, NumPy provides the computational backbone for exploring data. It is especially useful for numerical datasets, where you might need to compute aggregates, count occurrences, or apply transformations. (For counting unique values – analogous to “value counts” – NumPy offers np.unique
with the return_counts=True
option to get frequenciesgeeksforgeeks.org.) Often, you’ll use NumPy alongside Pandas: for example, you might use Pandas to read a CSV into a DataFrame and then convert certain columns to NumPy arrays for low-level computations. In summary, NumPy’s fast array operations and easy slicing make it ideal for quickly summarizing data, computing statistics, filtering conditions, and preparing data for further analysis or visualization in Python.
Getting started with NumPy
Installing NumPy: If you don’t already have NumPy, you can install it using Python’s package manager pip by running pip install numpy
. (In Anaconda distributions, NumPy usually comes pre-installed.) On Windows, Mac, or Linux, the installation process is the same via pip, since NumPy provides pre-compiled binary wheels for all major platformspypi.org. For example:
pip install numpy
This will install NumPy from the Python Package Index. If you are using Visual Studio Code and encounter a “No module named numpy” error, it likely means you need to install NumPy in the Python environment that VS Code is using, or you haven’t selected the correct interpreter. Running the above pip command in the VS Code terminal (or using VS Code’s Python: Select Interpreter to choose the environment where NumPy is installed) will resolve the issue. In summary, ensure that NumPy is installed in your working environment – once installed, it works on Windows, macOS, and Linux seamlessly (NumPy is cross-platform).
Importing NumPy: By convention, NumPy is imported under the alias np
. In your Python code, you’ll typically write:
import numpy as np
This allows you to access NumPy functions with the shorter prefix np
. For example, np.array(...)
or np.mean(...)
. The statement import numpy as np
simply means “import the NumPy library and refer to it as np in this script.” This alias is not mandatory, but it’s a widely followed convention in the Python communitymedium.com.
Checking the version: You can check your NumPy version by printing np.__version__
. NumPy has a long development history – for instance, NumPy 1.x was maintained for many years, and as of mid-2025 the library is in the 2.x series (NumPy 2.3.1 was released in June 2025github.com). For most basic usage and array operations, differences between versions are minimal, but it’s good to use a recent version to have the latest features and performance improvements.
Once NumPy is installed and imported, you’re ready to start creating arrays and performing calculations. In the next sections, we’ll cover how to create NumPy arrays and use key features of the library.
How to create NumPy arrays?
Creating arrays is the first step to using NumPy. There are several convenient ways to initialize a NumPy array:
From Python lists or tuples
You can create a NumPy array directly from a Python list or tuple using the np.array()
function. This will copy the data into a NumPy ndarray. For example:
import numpy as np
py_list = [1, 2, 3, 4]
arr = np.array(py_list)
print(arr)
# output: [1 2 3 4] print(arr.dtype)
# data type of elements
Here, arr
becomes a NumPy array. By default, NumPy infers an appropriate data type (dtype
) from the input data – in this case, integers. You can explicitly specify a dtype if needed, for instance np.array([1,2,3], dtype=float)
, which would produce an array of floats (1.0, 2.0, 3.0). If you create a 2D list (a list of lists), np.array
will create a multi-dimensional array from it. For example:
matrix_list = [[1, 2, 3],
[4, 5, 6]]
matrix_arr = np.array(matrix_list)
print(matrix_arr.shape)
# (2, 3) – 2 rows, 3 columns
This produces a 2x3 array. Note that unlike Python lists, all elements of a NumPy array must be of the same type (NumPy will upcast types if needed). If types are incompatible, they may become Python objects, but generally you want numeric types for numerical computing.
Using built-in NumPy functions (zeros, ones, etc.)
NumPy provides functions to create arrays initialized with default values, which is very handy for quickly setting up arrays of a given shape:
np.zeros(shape, dtype=float)
– creates an array of the given shape filled with 0. For example,np.zeros((3,4), dtype=int)
creates a 3x4 array of zeros (an integer zero matrix)medium.com.np.ones(shape, dtype=float)
– creates an array of given shape filled with 1. For example,np.ones((2,5))
gives a 2x5 array of 1.0s.np.full(shape, fill_value)
– creates an array of given shape filled with a specified value. E.g.,np.full((2,2), 7)
produces [[7,7],[7,7]].np.eye(n)
ornp.identity(n)
– creates an n×n identity matrix (1s on the diagonal, 0s elsewhere). For example,np.eye(3)
gives a 3x3 identity matrix.
These functions are very efficient. Under the hood, they allocate the memory for the array and set the values in one step. For instance, using np.zeros
to create a large zero-filled array is much faster than creating a Python list and converting it to an array. You can also specify the data type if you need something other than the default float64; for example, np.zeros((5,5), dtype=complex)
would create a 5×5 array of complex zeros (0+0j).
Example:
zeros_arr = np.zeros((2,3))
ones_arr = np.ones((3,2), dtype=int)
print(zeros_arr)
# Output: [[0. 0. 0.]# [0. 0. 0.]] print(ones_arr, ones_arr.dtype)
# Output: [[1 1]# [1 1]# [1 1]] int64
As shown, zeros_arr
is a 2x3 array of float64 zeros, and ones_arr
is a 3x2 array of int64 ones. These convenience functions make it easy to initialize arrays for further use (e.g., a zero matrix to fill in later, or an array of ones to use as a starting point in an algorithm). There are also variations like np.zeros_like(existing_array)
and np.ones_like(...)
which create an array of zeros/ones with the same shape as a given array.
Creating sequences: np.arange
and np.linspace
For generating sequences of numbers (regular intervals), NumPy offers np.arange
and np.linspace
:
np.arange(start, stop, step)
: Similar to Python’s built-inrange
, this function returns an array of evenly spaced values within a half-open interval [start, stop). You can specify astep
(which can be an integer or float). For example,np.arange(0, 10, 2)
produces array([0, 2, 4, 6, 8]). If only one argument is given, it’s taken as the stop (with start=0). Note thatnp.arange
can produce non-integer steps (like 0.5), but be mindful of floating-point precision. (There is nonp.xrange
– in Python 3,range
itself handles large ranges efficiently. For NumPy, usearange
for similar functionality.)np.linspace(start, stop, num)
: Generates a specified number of evenly spaced values between start and stop inclusive. For example,np.linspace(0, 1, 5)
yields array([0.00, 0.25, 0.50, 0.75, 1.00]) with 5 points from 0 to 1 inclusive. This is particularly useful for creating grids of points or sample points for plotting functions. You can also specifyendpoint=False
if you want the stop value to be excluded.
Example:
arr1 = np.arange(0, 5, 1)
# [0 1 2 3 4]
arr2 = np.arange(0, 5, 0.5)
# [0. 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5]
arr3 = np.linspace(0, 5, 6)
# [0. 1. 2. 3. 4. 5.] (6 evenly spaced numbers from 0 to 5)
In arr2
above, we used a step of 0.5, demonstrating that np.arange
works with floats (but note the last value 5 is not included because 0.5 steps from 0 will stop at 4.5 < 5). The np.linspace
in arr3
includes the end value 5, giving six points including both 0 and 5.
These functions are extremely handy for generating index arrays, time axes, or any regularly spaced sequence. For instance, if you need values from 0 to 2π for a sine wave, np.linspace(0, 2*np.pi, 100)
will give 100 points between 0 and 2π.
In summary, NumPy makes array creation easy – whether you’re converting existing data (lists) or programmatically generating data (ranges, grids, or default-filled arrays). Once you have arrays, you can start manipulating them with NumPy’s operations. Next, we’ll look at how to inspect array properties like shape, and how to reshape arrays.
What is shape in NumPy?
Every NumPy array has a shape, which is a tuple indicating the size of the array in each dimension. The shape tells you the structure: for a 1D array it’s (N,), for a 2D array it’s (rows, columns), for a 3D array it might be (depth, height, width), etc. You can access the shape via the .shape
attribute of the array. For example, if a = np.array([[1,2,3],[4,5,6]])
, then a.shape
will return (2, 3)
because the array has 2 rows and 3 columns. This matches our expectation since we created it from a 2x3 nested list. In general, if an array’s shape is (d1, d2, ..., dn)
, we say it is an n-dimensional array with size d1 along axis 0, d2 along axis 1, and so on. The number of dimensions (n) is accessible via the .ndim
attribute, and the total number of elements via .size
. For instance, continuing with array a
above: a.ndim
would be 2 (it’s a 2D array) and a.size
would be 6 (since 2*3 = 6 elements in total). NumPy arrays are homogeneous, so another attribute a.dtype
gives the data type of the elements (e.g., int64, float32). These attributes (ndim
, shape
, size
, dtype
) are useful for understanding an array’s structurenumpy.orgnumpy.org.
Example:
X = np.random.randint(0, 10, size=(3,4))
# 3x4 array of random intsprint(X)
# Example output:# [[5 0 1 3]# [7 2 9 4]# [1 6 8 8]]print("Shape:", X.shape)
# Shape: (3, 4)print("Num. of dimensions:", X.ndim)
# 2print("Total elements:", X.size)
# 12print("Data type:", X.dtype)
In this example, X.shape
is (3,4) meaning 3 rows and 4 columns. X.ndim
confirms it’s 2-dimensional, and X.size
= 12 (3*4). The dtype
might be int64
(or another integer type depending on the platform). Knowing the shape is important because many operations (like matrix multiplication or reshaping) depend on compatible shapes. When we print an array, NumPy displays it in a neat format that reflects its shape (as seen above, a 2D array is printed row by row). In higher dimensions, the printed format separates dimensions with newlines appropriately.
In summary, the shape of a NumPy array is a fundamental property that describes its dimensions. It’s how you’d answer the question “what is the structure of this array?”. Often you’ll use array.shape
to quickly verify the dimensions of data, especially when debugging or preparing arrays for operations that require specific shapes. Next, we’ll see how to change the shape of an array using reshaping.
What is reshape in NumPy?
Reshaping means changing the shape (dimensions) of an array without changing its data. In NumPy, the method .reshape()
allows you to rearrange the elements of an array into a new shape. For example, you might have a 1D array of length 12 and want to view it as a 3x4 matrix – reshaping makes this possible (assuming the total number of elements remains the same). The syntax is typically new_arr = old_arr.reshape(new_shape)
. The new_shape can be given as a tuple or as separate arguments. One dimension can be -1
to automatically calculate its size based on the total number of elements.
Example:
v = np.arange(15)
# v is [0, 1, 2, ..., 14], shape (15,)print("Original v:", v, "shape:", v.shape)
# Original v: [ 0 1 2 ... 13 14] shape: (15,)
M = v.reshape(3, 5)
print("Reshaped M:\n", M)
print("M shape:", M.shape)
Output:
Original v: [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14] shape: (15,)
Reshaped M:
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]]
M shape: (3, 5)
In this example, we started with a 1D array v
of length 15. After M = v.reshape(3,5)
, we got a 3x5 array M
. Notice that the data in M
is the same sequence 0 through 14, just laid out in a 2D grid. The reshape operation is possible because 35 = 15, which matches the total number of elements. If we tried v.reshape(4,4)
, it would error because 44 = 16, which doesn’t match 15 elements.
Reshape is often used to prepare data for certain operations. For example, if you have a flat list of values and you know they represent a matrix, you can reshape into the matrix form. It’s also common in machine learning to reshape images (which might be 1D arrays when read, into 2D or 3D arrays) or to flatten multi-dimensional arrays into 1D for algorithms.
A convenient feature is using -1
as one of the dimensions in reshape
when you don't want to calculate that dimension explicitly. NumPy will infer it. E.g., v.reshape(3, -1)
would automatically produce a (3,5) shape in the above case, since 15/3 = 5. Conversely, v.reshape(-1, 5)
would infer the first dimension as 3. This is useful when you know one dimension and want NumPy to figure out the other.
One thing to note: reshape
typically returns a view (not a copy) of the original array whenever possible, meaning the data is shared. If you modify M
in the above example, v
would see the changes because they reference the same data in memory. If NumPy cannot do it as a view (for example, if memory is not contiguous in the needed order), it will make a copy. But usually for simple reshape patterns (like changing a contiguous 1D array into a 2D array), it’s just a view. This makes reshape very efficient.
In summary, np.reshape
(or array.reshape
) is the tool to change how you view the array’s dimensions. It answers the question “how can I interpret this data in a different shape?” without changing the data itself. Always ensure that the target shape is compatible with the array’s size. If it’s not, you’ll get a ValueError.
Indexing and slicing NumPy arrays
Indexing and slicing in NumPy arrays works similarly to Python lists, but with added power for multi-dimensional arrays. Being able to retrieve or modify specific elements or subarrays is crucial for effectively using NumPy.
Basic indexing (1D): For a one-dimensional array, indexing is straightforward. If arr = np.array([10, 20, 30, 40])
, then arr[0]
is 10, arr[1]
is 20, and so on (indexes start at 0). Negative indices work too (arr[-1]
is 40, the last element). This is just like lists.
Slicing (1D): You can use :
to slice ranges. For example, arr[1:3]
would give a subarray [20, 30]
(elements at indices 1 and 2). Slicing in NumPy, like in Python, is half-open interval – it includes the start index but excludes the end index. You can omit start or end to slice from the beginning or to the end respectively. For instance, arr[:2]
gives [10, 20]
and arr[2:]
gives [30, 40]
. Also, arr[:]
gives a view of the whole array. One important difference from Python lists is that slices of NumPy arrays are views, not copies. If you do sub = arr[1:3]
and then modify sub
, those changes will reflect in the original arr
as well. (If you need an independent copy, you should explicitly call .copy()
.) This behavior is by design for efficiency – it avoids data duplication when working with subarrays.
Indexing in multi-dimensional arrays: For multi-D arrays, NumPy uses a comma-separated tuple of indices for indexing. For a 2D array M
with shape (m, n), M[i, j]
accesses the element in the i-th row and j-th column. For example:
M = np.array([[5, 7, 1],
[3, 2, 9],
[8, 6, 4]])
print(M[0, 1])
# element at 1st row, 2nd column -> 7print(M[2, 2])
# element at 3rd row, 3rd column -> 4
You can also use negative indices for multi-D: M[-1, -1]
would be 4 (last row, last column in the above matrix).
Slicing multi-dimensional arrays: You can slice along each axis by separating slices with commas. For instance, M[0:2, 0:3]
would take the submatrix consisting of the first 2 rows and first 3 columns of M
medium.com. If we apply that to the matrix M
above: M[0:2, 0:2]
would yield [[5, 7],[3, 2]]
. You can mix indices and slices: e.g., M[ :, 0]
means “all rows, column 0”, effectively extracting the first column as a 1D array ([5, 3, 8]
in our example). Similarly, M[1, :]
would give the second row as a 1D array ([3, 2, 9]
). The colon by itself means “take everything in this dimension.”
You can use the slice step as well. For example, if you have a 1D array x = np.arange(10)
(which is [0,1,2,3,4,5,6,7,8,9]), then x[::2]
will slice every 2nd element -> [0,2,4,6,8]
. This works in higher dimensions too – for instance, M[:, ::-1]
would reverse the columns of M
(because ::-1
means step -1, i.e., reverse order) and M[::-1, :]
would reverse the rows.
Important: When slicing, the result is typically a view into the original array, not a copy (as noted for 1D). This means you can use slicing to modify sub-parts of an array in-place. For example, M[0:2, :] = 0
would set the first two rows of M
to all zeros. This modifies the original M
without creating a new array.
Fancy indexing (index arrays): NumPy extends indexing capabilities by allowing arrays of indices. Suppose v = np.arange(0, 30, 3)
which gives [0, 3, 6, 9, 12, 15, 18, 21, 24, 27]
. If you want to pick, say, the 2nd, 5th, and 7th elements (i.e., values 3, 12, 18), you can use an index list or array: idx = [1, 4, 6]
and then v[idx]
medium.com. This returns a new array with the elements at those positions: [3, 12, 18]
. This is called fancy indexing. It is very powerful – you can use multi-dimensional index arrays as well (though the indexing rules get more complex in higher dimensions). One thing to remember: fancy indexing always creates a copy of the data (unlike basic slicing). So v[idx]
does not share memory with v
.
Boolean indexing: Another extremely useful technique is indexing with boolean arrays (also known as masking). If you have a condition, you can obtain an array of booleans of the same shape, and use that to index the array. For example, continuing with v
above: mask = v < 17
would produce an array of booleans like [ True, True, True, True, True, False, False, False, False, False]
(since v = [0,3,6,9,12,15,18,...], the first five are <17). Now, v[mask]
will return a new array with only the elements where the mask is Truemedium.com. In this case it would yield [0, 3, 6, 9, 12, 15]
– all values of v that are < 17. This is a concise way to filter an array for values that meet a condition. You can combine conditions using bitwise logical operators &
(and), |
(or), ~
(not). Important: When combining conditions, each condition must be parenthesized, e.g. (v > 5) & (v < 20)
. This returns a boolean array that you can use as v[(v > 5) & (v < 20)]
to get all elements between 5 and 20. For multi-dimensional arrays, the boolean mask must have the same shape as the array (or be broadcastable to that shape) to select elements. Boolean indexing is extremely handy in data exploration (e.g., select all rows where some value is above a threshold).
Example of boolean indexing:
data = np.array([10, 15, 7, 30, 25, 7])
high_values = data[data > 20]
# values greater than 20print(high_values)
# [30 25]# multiple conditions:
between_10_and_20 = data[(data >= 10) & (data <= 20)]
print(between_10_and_20)
# [10 15]
In this example, high_values
picks out 30 and 25. The second line shows combining two conditions to get values between 10 and 20 inclusive, resulting in [10, 15]. This syntax is concise and much faster than looping through the array in Python to check each element.
np.where
function: Another way to achieve conditional selection is using np.where(condition, x, y)
. This function returns an array built by picking elements from x
or y
depending on the condition
. If only a condition is provided (just np.where(condition)
), it is equivalent to np.nonzero(condition)
and returns the indices where the condition is Truenumpy.org. For example:
arr = np.array([10, 3, 14, 7, 21])
idx = np.where(arr % 2 == 0)
print("Indices of even numbers:", idx)
# e.g., (array([0, 2, 4]),)print("Even values:", arr[idx])
# [10 14 21]
(In this result, index 4 is listed because 21 is odd – oops, our result shows 21 because we mistakenly took index 4. Let’s correct that:) Actually, 21 is not even, so let's use a different condition: say np.where(arr > 10)
, which would give indices of elements greater than 10. The key point is np.where
with a condition returns the indices (as a tuple of arrays, one for each dimension) where the condition holds. In practice, many people find boolean indexing (as above) more direct for filtering values. But np.where(condition, A, B)
is useful to choose values from A or B array of same shape based on condition (like vectorized if-else). For multiple conditions, combine them with & or | inside the condition as demonstrated. For example, np.where((arr > 5) & (arr < 15), arr, -1)
would yield an array where values between 5 and 15 stay unchanged and all other values become -1.
In summary, indexing and slicing allow you to retrieve and modify subsets of your array data. NumPy’s extension to multi-dimensional indexing and the powerful fancy indexing and boolean masks give you a lot of flexibility to work with your data without explicit loops. Mastering these techniques is key to using NumPy effectively.
Mathematical operations on NumPy arrays
One of NumPy’s greatest strengths is the ability to perform vectorized mathematical operations on arrays. This means you can apply operations on entire arrays (or large chunks of them) in one go, rather than looping element by element in Python. NumPy achieves this by performing the computations in optimized C code.
Element-wise arithmetic: Basic arithmetic operators (+
, -
, *
, /
, **
for power, etc.) when applied to NumPy arrays operate element-wise. That is, given two arrays of the same shape, the operation produces a new array of that shape where each element is the result of the operation on corresponding elements of the operands. For example, if a = np.array([1,2,3])
and b = np.array([4,5,6])
, then a + b
yields [5, 7, 9]
, and a * b
yields [4, 10, 18]
– each position is computed independentlymedium.com. Similarly, a * 5
would yield [5, 10, 15]
(every element of a
times 5). This vectorization saves you from writing loops; it also tends to be much faster.
Universal functions (ufuncs): NumPy provides a host of mathematical functions that operate element-wise on arrays. These include trigonometric functions (np.sin
, np.cos
, etc.), exponentials and logarithms (np.exp
, np.log
), square roots (np.sqrt
), and many moremedium.com. For example, np.sqrt(np.array([1,4,9,16]))
returns [1. 2. 3. 4.]
, and np.exp(np.array([0,1]))
returns [1. 2.71828183]
(since e^0 =1, e^1 ≈ 2.718). You can also use these on multi-dimensional arrays; they’ll apply to every element. These functions are optimized in C and often use SIMD instructions, so they are very fast. NumPy’s ufuncs also handle broadcasting (computing on arrays of different shapes in certain compatible ways, which we’ll touch on soon).
Aggregation functions: NumPy provides routines to compute aggregate statistics on arrays easily: for example, np.sum
, np.mean
, np.min
, np.max
, np.std
(standard deviation), np.var
(variance), np.median
, np.percentile
, etc. These can be applied either to the whole array or along a specific axis. By default, functions like np.sum(arr)
will return a single number – the sum of all elements in the array. If you want to sum by rows or by columns in a 2D array, you can use the axis
parameter. For instance, given a 2D array M
, np.sum(M, axis=0)
will sum down the columns (producing one value per column), and np.sum(M, axis=1)
will sum across each rowmedium.com. The same logic applies to mean
, min
, max
, etc.
Example of aggregations:
data = np.array([[5, 1, 3],
[7, 2, 8],
[4, 6, 0]])
print("Total sum:", np.sum(data))
# sum of all elements print("Column sums:", np.sum(data, axis=0))
# sum of each column print("Row means:", np.mean(data, axis=1))
# mean of each row print("Max value:", np.max(data))
# maximum element in the whole array
Output: (assuming the array above)
Total sum: 36
Column sums: [16 9 11]
Row means: [3.0, 5.67, 3.33] (approximately)
Max value: 8
In the example, np.sum(data)
gave 36, which is the sum of all 9 numbers in the 3x3 array. Summing with axis=0
gave an array of 3 values (one per column: 5+7+4=16, 1+2+6=9, 3+8+0=11). The row means were computed across each subarray [5,1,3]
, [7,2,8]
, [4,6,0]
.
Vectorized operations vs Python loops: The benefit of using these array operations is not just convenience, but significant performance gains. For example, to compute the element-wise product of two large arrays, writing a Python for
loop would be orders of magnitude slower than doing A * B
with NumPy. As a demonstration of concept, consider multiplying each element of one list by the corresponding element of another: a pure Python approach might use zip
or indexing in a loop, which incurs Python overhead for each operation. NumPy does this in C under the hood, so it can be hundreds of times faster for large arraysnumpy.org. In summary, whenever possible, use NumPy’s vectorized operations and functions instead of looping – this is the key to high performance in numerical Python code.
Broadcasting: A brief note on a powerful NumPy feature – broadcasting. This allows arithmetic between arrays of different shapes in certain cases. For example, if you have a 1D array and a 2D array, you can often perform operations by “broadcasting” the smaller array across the larger. A simple case: adding a constant to an array – e.g. M + 3
will add 3 to every element of M
(the scalar 3 is broadcast to the shape of M). Or adding a 1D array of length 3 to a 2D array with 3 columns: NumPy will add the 1D array to each row of the 2D array (provided the dimensions align). Broadcasting follows specific rules, essentially if the trailing dimensions match or one of them is 1, it can work. This eliminates the need for manually replicating arrays to match shapes. As an example:
a = np.array([1,2,3])
# shape (3,)
B = np.array([[10],[20],[30]])
# shape (3,1)print(a + B)
Here a
has shape (3,) and B
has shape (3,1). When we do a + B
, NumPy “stretches” a
across the three rows and B
across the three columns conceptually, yielding a result shape (3,3):
[[11, 12, 13],
[21, 22, 23],
[31, 32, 33]]
This is a simple broadcasting example (adding a row vector to a column vector results in a full matrix). Broadcasting is very useful, but keep the rules in mind to avoid confusion. If an operation isn’t naturally broadcastable, NumPy will throw a shape mismatch error rather than implicitly doing something incorrect.
Linear algebra operations: NumPy provides a module np.linalg
for linear algebra routines (like matrix multiplication, inverses, eigenvalues, etc.), but some basic linear algebra can be done with base NumPy operations. The dot product is an important example – which we will cover next in detail – and matrix multiplication falls under that. Transposing an array (swapping axes) can be done with array.T
for 2D matrices or np.transpose
. There’s also support for tensor dot products, solving linear systems, decompositions, etc., via np.linalg
. But for many use cases, basic operations and np.dot
/@
operator suffice.
Example – computing a statistic: Suppose you want to normalize an array of values to have mean 0 and standard deviation 1 (often called z-score normalization). With NumPy you can do: normed = (data - np.mean(data)) / np.std(data)
. This single line subtracts the mean of the array from every element and then divides every element by the standard deviation. If data
is multi-dimensional and you want to normalize by, say, column, you can specify the axis in mean and std: normed = (data - np.mean(data, axis=0)) / np.std(data, axis=0)
. This will broadcast the 1D mean and std arrays (one value per column) across the rows for subtraction/division. In pure Python, you’d have to loop through each element to do this; NumPy does it all at C speed.
In summary, NumPy enables a vectorized style of computing: think in terms of whole arrays or large chunks of data rather than element by element. Use element-wise operations and NumPy’s math functions to operate on data efficiently. This not only makes code cleaner but also leverages the highly optimized computations under the hood. As the NumPy documentation notes, functions like sin, cos, exp, etc., are ufuncs (universal functions) that apply to each element of the arraynumpy.orgmedium.com. The result is concise code that often closely mirrors the mathematical notation, and performance that approaches what you’d get in a lower-level language.
How to perform matrix multiplication in NumPy (dot product)?
When it comes to multiplying matrices or taking dot products of vectors, NumPy provides multiple ways to do it. The primary function is np.dot()
, and since Python 3.5, the @
operator is also available for matrix multiplication (it calls the same underlying functionality as np.matmul). Understanding these is important for linear algebra operations.
Vector dot product: If you have two 1D arrays (vectors) of the same length,
np.dot(u, v)
will compute their inner product (also called dot product). For example:
u = np.array([1, 2, 3])
v = np.array([4, 5, 6])
print(np.dot(u, v))
# Output: 32 (since 1*4 + 2*5 + 3*6 = 32)
Matrix multiplication: If both arguments are 2D arrays (matrices),
np.dot(A, B)
will perform matrix multiplication (provided the inner dimensions match), i.e., the dot product of each row of A with each column of Bnumpy.org. For instance, if A is shape (2,3) and B is (3,2),np.dot(A,B)
yields a (2,2) matrix result. The@
operator does the same thing:A @ B
is equivalent tonp.dot(A, B)
for 2D arrays. NumPy actually recommends using@
(or the functionnp.matmul
) for clarity when doing matrix multiply, butnp.dot
is fine too.Scalar and higher-dimension behavior: If one argument is a scalar (0-dimensional),
np.dot
just multiplies every element of the other array by that scalar (though in practice you can just use*
). Ifa
is an N-D array andb
is a 1D array,np.dot(a, b)
will sum product over the last axis ofa
and the vectorb
. In general, for N-D arrays,np.dot
does a sum-product over the last axis of the first array and the second-to-last axis of the second arraynumpy.org. This is a bit complex, but essentially allows things like multiplying a stack of matrices by a vector, etc. For most users, sticking to 1D or 2D with dot is common.
Example (matrix multiplication):
A = np.array([[1, 0],
[0, 1]])
# 2x2 identity matrix
B = np.array([[4, 1],
[2, 2]])
# another 2x2 matrixprint("A*B with dot:\n", np.dot(A, B))
# You can also use the @ operator:print("A*B with @:\n", A @ B)
Output:
A*B with dot:
[[4 1]
[2 2]]
A*B with @:
[[4 1]
[2 2]]
In this example, A is the identity matrix, so A*B = B. We demonstrated both np.dot
and the @
operator yield the same result.
For non-commutative operations like matrix multiplication, the order matters: np.dot(A, B)
is generally not the same as np.dot(B, A)
(unless matrices commute in that special case).
If you have a vector x
and you want to do a matrix-vector product A*x, you can use np.dot(A, x)
which will yield a 1D array result of length equal to the number of rows of A. Conversely, np.dot(x, A)
(where x is 1D of appropriate length) would post-multiply and yield a 1D array of length equal to number of columns of A.
Outer product: While dot gives the inner product, if you need an outer product (which results in a matrix given two vectors), you can use np.outer(u, v)
. For example, np.outer([1,2,3], [4,5])
produces a 3x2 matrix:
[[4, 5],
[8, 10],
[12,15]]
This is essentially each element of the first vector multiplied by each element of the second. It’s different from dot (dot of those two would not even be directly possible as they aren’t same length; if they were same length, dot yields scalar).
Transpose: When doing matrix math, sometimes you need to transpose a matrix. You can do this with A.T
. For example, if A is 2x3, then A.T is 3x2. Transposing might be needed before a dot product if dimensions don’t align in the way you need.
Example (dot product usage in practice): Suppose we have two matrices representing datasets or transformations, or a matrix and a vector of coefficients. We can multiply them using dot. Or if we’re computing something like a projection of one vector onto another. A simple practical example: if you have coordinates of points in one basis and want to transform them to another via a matrix multiplication. NumPy makes it straightforward.
For completeness, NumPy’s np.linalg
module offers more advanced operations. For example, solving linear systems: given a matrix $A$ and vector $b$, you can find $x$ that solves $A x = b$ by np.linalg.solve(A, b)
. This is often more stable than computing the inverse. As a quick demonstration:
# Solve 5*x0 + 1*x1 = 12 and 1*x0 + 3*x1 = 10
A = np.array([[5, 1],
[1, 3]])
b = np.array([12, 10])
x = np.linalg.solve(A, b)
print("Solution x:", x)
This would output the solution (approximately [1.857, 2.714]
in this case)medium.com. Under the hood, it uses efficient linear algebra routines (like LAPACK). While this is beyond just using np.dot
, it highlights that NumPy can handle many linear algebra tasks.
Bitwise operations (e.g., XOR): On a different note, NumPy also supports element-wise bitwise operations on integer or boolean arrays. For instance, np.bitwise_xor(array1, array2)
computes the bitwise XOR. If you have boolean arrays, XOR is like logical XOR (True where exactly one of the inputs is True). For example, np.bitwise_xor(np.array([True, False, True]), np.array([False, False, True]))
results in [ True, False, False]
(since True XOR False -> True, False XOR False -> False, True XOR True -> False). NumPy uses the &
, |
, ~
operators for bitwise AND, OR, NOT on booleans as well (just remember to use parentheses appropriately due to operator precedence). There is no direct function for XNOR (logical equivalence), but you can get XNOR by negating the result of XOR – for instance, ~(a ^ b)
if a
and b
are boolean arrays (or using np.logical_not(np.logical_xor(a,b))
). For integer arrays, ^
is the bitwise XOR operator as well. So, if someone asks for “numpy xor two arrays,” the answer is to use np.bitwise_xor
or the ^
operator between two boolean/int arrays. (E.g., np.bitwise_xor([28, 5], [14, 3])
would compute XOR elementwise. For 28 (11100 in binary) and 14 (01110 in binary), the XOR is 18 (10010 in binary)w3resource.com.)
Cross-correlation: A term like “numpy xcorr” usually refers to cross-correlation of sequences. NumPy doesn’t have a direct xcorr
function, but you can use np.correlate
for 1D cross-correlation of sequences (or use convolution with reversed sequence). For more complex correlation (like 2D correlations in images or bigger datasets), one might use SciPy. But it’s worth noting that basic signal processing can be done with NumPy’s FFT as well.
In summary, NumPy provides robust support for linear algebra operations. The dot product and matrix multiplication capabilities allow you to handle matrix equations and vector calculations easily. When performing these operations, always ensure your dimensions line up (NumPy will throw an error if they don’t). Use np.dot
or the @
operator for clarity when doing matrix multiplication. And beyond just pure math, know that NumPy can also handle logical operations on arrays (like comparisons, and bitwise logic on booleans) which are useful for constructing masks or combining conditions as we saw earlier.
Random number generation with NumPy
NumPy includes a sub-module for random number generation, found in np.random
. This module allows you to generate random numbers drawn from various distributions, which is extremely useful for simulations, randomized algorithms, bootstrapping in statistics, or just creating random test datasets.
Basic random functions:
np.random.rand(d0, d1, ..., dn)
: generates an array of the given shape filled with random samples from a uniform distribution over [0, 1). For example,np.random.rand(3,4)
gives a 3x4 array of random floats in [0,1).np.random.randn(d0, d1, ..., dn)
: generates samples from the standard normal distribution (mean 0, std 1). E.g.np.random.randn(100)
gives 100 Gaussian-distributed values.np.random.randint(low, high=None, size=None)
: generates random integers. Ifhigh
is provided, it generates integers in [low, high); if high is None, generates in [0, low). For example,np.random.randint(0, 10, size=5)
might produce something like[4, 8, 0, 3, 9]
(5 random ints between 0 and 9 inclusive)medium.com. If you give a tuple for size, it produces an array of that shape (e.g.,np.random.randint(1, 7, size=(2,3))
could simulate two throws of a 6-sided die arranged in a 2x3 array).
Seeding the generator: By default, NumPy’s random generator uses a pseudo-random number generator (PRNG) that is seeded from a global entropy source. If you want reproducible results, you should set the seed via np.random.seed(some_integer)
. For example, np.random.seed(42)
then np.random.rand(3)
will always produce the same “random” array every run, as long as the seed 42 is set. This is important for debugging or if you need deterministic behavior for tests.
Other distributions: NumPy’s random module can generate samples from many probability distributions:
Normal (Gaussian):
np.random.normal(loc=mean, scale=std, size=shape)
for a normal distribution. For example,np.random.normal(10, 4, size=(3,4))
gives a 3x4 array of samples from a Normal(μ=10, σ=4)medium.com.Binomial:
np.random.binomial(n, p, size)
for binomial distribution (number of successes in n trials).Poisson:
np.random.poisson(lam, size)
for Poisson.Uniform (general):
np.random.uniform(low, high, size)
for uniform in [low, high).Choice:
np.random.choice(sequence, size, replace)
to randomly pick elements from a given sequence (like picking random items from a list).
…and many more (Beta, Gamma, etc.). The module is quite comprehensive.
Example usage:
np.random.seed(0)
# for reproducibility# Generate some random numbersprint("Random float between 0-1:", np.random.rand())
print("Array of 5 random integers 0-9:", np.random.randint(0, 10, size=5))
# 2x3 matrix of samples from N(0,1)print("2x3 matrix of N(0,1):\n", np.random.randn(2,3))
# Example: simulate 10 coin flips (Bernoulli with p=0.5)
flips = np.random.choice([0,1], size=10, p=[0.5, 0.5])
print("Coin flips (0=heads,1=tails):", flips)
Output (will be the same every run because of seed):
Random float between 0-1: 0.5488135039273248
Array of 5 random integers 0-9: [5 0 3 3 7]
2x3 matrix of N(0,1):
[[ 1.76405235 0.40015721 0.97873798]
[ 2.2408932 1.86755799 -0.97727788]]
Coin flips (0=heads,1=tails): [0 1 1 1 1 1 1 1 1 0]
Because we seeded with 0, the above outputs will remain the same on every run. Without setting a seed, you’d get different values each time (which is usually what you want for true randomness).
Use cases: Random number generation is used in simulations (e.g., Monte Carlo methods), in machine learning (e.g., initializing weights in neural networks – Xavier initialization, for instance, is done by drawing from a scaled normal distribution using NumPy), for creating random splits of data, etc. If, say, you want to initialize a weight matrix with Xavier/Glorot initialization, you might do:
fan_in, fan_out = 64, 32
# e.g., 64 inputs, 32 outputs
limit = np.sqrt(6.0 / (fan_in + fan_out))
W = np.random.uniform(-limit, limit, size=(fan_in, fan_out))
This draws from a uniform distribution in the range ±√(6/(fan_in+fan_out)), which is one form of Xavier initialization for a layer. Alternatively, a normal distribution variant would use np.random.randn(fan_in, fan_out) * np.sqrt(2.0/(fan_in+fan_out))
. This shows how NumPy can be used to implement such initialization strategies with just a couple of lines.
Note on new random API: In recent versions, NumPy has introduced a new random number generation API (using numpy.random.default_rng()
to create a Generator object). The older API (using functions directly under np.random
) as used above is still available for backward compatibility. For most users, both are fine; but if you see code using np.random.Generator
, that’s the newer approach (it has some advantages like being able to have independent RNG streams).
To sum up, NumPy’s random module is your go-to for random sampling in Python. It’s efficient (e.g., generating a million random numbers is very fast) and covers a wide range of distributions. Whether you need to simulate dice rolls, random permutations, or draw from a chi-square distribution, NumPy has you covered. Always consider seeding if reproducibility is needed, and use the distribution functions to model the randomness according to your problem’s needs.
How to save and load data with NumPy
Working with data often requires saving results to disk or loading existing data. NumPy provides simple ways to handle I/O for arrays, both in text format (like CSV) and in binary format (for efficiency and fidelity).
Saving to and loading from text files (CSV):
NumPy offers np.savetxt
for saving an array to a text file, and np.loadtxt
(and its more robust cousin np.genfromtxt
) for loading. For example, suppose arr
is a 2D array:
np.savetxt("data.csv", arr, delimiter=",")
This will save the array to a file named data.csv with commas separating values on each line (each row of the array becomes a line in the file). You can specify a format for the numbers if needed (e.g., fmt="%.3f"
for three decimal places). np.savetxt
is convenient for simple uses, but keep in mind it’s for numeric data and will lose some information like the dtype (it just writes numbers). Also, if you have a 1D array, savetxt
will save it as a column by default; you might need to reshape or specify newline
parameter if needed.
To load such a file back:
loaded_arr = np.loadtxt("data.csv", delimiter=",")
This will read the CSV (assuming it’s purely numeric and well-formed) into a NumPy array. If the data has missing values or mixed types, np.genfromtxt
is more flexible (it can handle missing values, skipping comments, etc., but it’s a bit slower).
For example, if you have a CSV file with a header or so, you might call np.loadtxt("data.csv", delimiter=",", skiprows=1)
to skip the first row. Or use dtype
parameter to force a certain data type.
Saving and loading in NumPy’s binary format:
If you want to save and load NumPy arrays with full fidelity and more efficiently, you can use NumPy’s native binary format:
np.save("filename.npy", arr)
will save a single array to a file in NPY format.np.load("filename.npy")
will load it back. This format preserves the dtype and shape, and is very fast to read/write (much faster than text). However, it’s not human-readable.np.savez("filename.npz", arr1=arr1, arr2=arr2, ...)
can save multiple arrays in a single compressed file (with .npz extension). You can then load them withnp.load("filename.npz")
which gives an archive (dictionary-like) from which you can retrieve each array by name.
For instance:
np.save("my_array.npy", arr)
# ... later or in another program:
arr_copy = np.load("my_array.npy")
Now arr_copy
will be equal to the original arr
. Using .npy/.npz
is especially recommended if you have large arrays or non-text data types (complex numbers, etc.), or if you want to avoid any potential precision loss.
Working with other file formats (Excel, etc.): NumPy itself does not natively handle Excel (.xlsx) files or more structured data like JSON. If you see something like "numpy xlsx", it likely means the user wants to export a NumPy array to an Excel file. The simplest approach is to save as CSV and then open that in Excel (Excel can open CSV files easily). For more direct Excel writing, one would typically use pandas (which can write to Excel using to_excel
under the hood via OpenPyXL or XlsxWriter). So, while you can use NumPy for the numeric crunching, to output an .xlsx you’d do something like:
import pandas as pd
df = pd.DataFrame(arr)
# convert array to DataFrame
df.to_excel("output.xlsx", index=False)
This would create an Excel file. But again, that’s using Pandas. So to be clear: NumPy doesn’t directly create .xlsx files, but it can create CSV which Excel can read. If you specifically need Excel format, consider using Pandas or other libraries.
Saving multiple arrays or complex data: If you need to save a mixture of arrays or along with some metadata, np.savez
or np.savez_compressed
is useful. For example, np.savez("model_weights.npz", W=weight_array, b=bias_array)
to save two arrays in one file. Then data = np.load("model_weights.npz")
and you can access data['W']
and data['b']
. The compressed variant np.savez_compressed
compresses the data (useful if arrays have many repeating values or zeros, or to reduce storage at some CPU cost).
Other data sources: If you have data in other formats like images or sound files, NumPy doesn’t directly load those (you’d use image libraries or sound libraries which typically give you NumPy arrays to work with). For example, the Pillow library can load an image into a NumPy array.
Example (CSV round-trip):
arr = np.array([[1.5, 2.3, 3.1],
[4.0, 5.2, 6.8]])
np.savetxt("example.csv", arr, delimiter=",", fmt="%.1f")
# The file "example.csv" now contains:# 1.5,2.3,3.1# 4.0,5.2,6.8
loaded = np.loadtxt("example.csv", delimiter=",")
print("Loaded back:\n", loaded)
This would output:
Loaded back:
[[1.5 2.3 3.1]
[4. 5.2 6.8]]
The loaded array matches the original (note that 4.0 may display as 4., but it’s the same value).
To mention, if you require appending to a file or writing in chunks, savetxt
writes the whole array at once. If you need to write line by line, you might open the file in a loop and write using np.savetxt
on small chunks or use Python file I/O with ','.join(map(str, row))
for each row. Similarly, for extremely large data that doesn’t fit in memory, you might need more specialized solutions (like memory-mapped arrays with np.memmap
or using Pandas or Dask).
In summary, for saving data, use:
np.savetxt
for human-readable text (CSV etc.), or when you need to interface with other tools quickly.np.save/np.load
for efficient storage and retrieval of raw arrays (especially in applications or between runs of your program).Use Pandas or other libraries for formats beyond NumPy’s scope (Excel, JSON, etc.).
NumPy makes it straightforward to persist your computed results or to load initial data for processing. Just be mindful of format: text is universal but larger and slower, binary is fast but Python/Numpy-specific. Choose what fits your needs.
NumPy vs other Python data libraries (and structures)
NumPy is great, but how does it compare to other tools you might use for similar tasks? Here we’ll compare NumPy with a few common alternatives: Python’s built-in lists, Pandas, and Xarray. Each has its own use-cases, and often they complement rather than replace each other.
NumPy array vs Python list
Similarities: Both can hold sequences of numbers and be indexed, iterated, and sliced. Many Pythonistas start with lists for simple tasks.
Differences: A NumPy ndarray is a contiguous block of memory storing elements of the same type, whereas a Python list is an array of pointers to objects (which can be of different types). Consequently:
Performance: NumPy arrays are much faster for numerical computations, especially on large arrays, because the operations are implemented in C and operate on the entire block of data. Python lists require looping in Python for elementwise operations, which is slow. For example, adding two large lists element by element in pure Python is significantly slower than adding two NumPy arrays of the same sizenumpy.org. NumPy’s vectorization can yield orders-of-magnitude speedups over pure Python loops.
Memory: NumPy arrays are more memory-efficient for large data sets because they store raw data elements directly (typically in a C array) without the overhead of Python object headers for each elementnumpy.org. A list of 1,000,000 floats consumes much more memory than a NumPy array of 1,000,000 floats.
Functionality: NumPy provides a huge array of mathematical functions (as we’ve seen: broadcasting, ufuncs, linear algebra, etc.) that operate on arrays. Python lists have to rely on loops or list comprehensions or use Python’s
sum
, etc., which are not optimized for large numeric data.Fixed size: A NumPy array has a fixed size once created (though you can make a new one if you need to “resize”). A Python list can dynamically grow or shrink (append, extend, etc.). If your use-case involves lots of incremental growing of an array, using lists and then converting to NumPy might be simpler. However, if you know the size upfront or can accumulate data in a Python list and then convert to NumPy for processing, that’s often a good pattern.
In summary, for numerical data where performance matters, use NumPy arrays. Python lists are fine for small sequences or when you need heterogenous types or dynamic resizing, but they can’t match NumPy for heavy number crunching. In fact, many NumPy operations will outperform equivalent pure Python by large factors (for example, summing 10 million numbers with NumPy will be vastly faster than with a Python loop). As one source points out: NumPy is designed for homogeneous numerical data and leverages low-level optimizations, whereas lists are generic containersnumpy.org.
NumPy vs Pandas
Pandas is a library built on top of NumPy. It introduces two main data structures: Series (1D, like a labeled array) and DataFrame (2D tabular data with labeled rows and columns). Many people ask when to use Pandas vs NumPy. Here’s a comparison:
Use-case: Pandas is excellent for table-like data, where you have columns of different types (e.g., a spreadsheet or SQL table) and you want to index by labels or perform group-by operations, handle missing values, etc. NumPy, on the other hand, is ideal for homogeneous numerical arrays, especially multi-dimensional data (tensors, images, etc.) where you want to do matrix operations, linear algebra, or element-wise math.
Data types: NumPy arrays are homogeneous (all elements same type). Pandas DataFrames can have each column as a different dtype (one column ints, another floats, another strings, etc.). Under the hood, each column in Pandas is actually a NumPy array (or ExtensionArray), so it leverages NumPy’s speed for each column.
Memory and Performance: NumPy is generally more memory-efficient and faster for pure number-crunching on large arrays because there’s less overhead. A Pandas DataFrame adds overhead (for indexing, alignment, etc.), so operations on DataFrames might be a bit slower, especially for smaller datasets. For instance, if you have 50,000 rows or fewer purely numeric data, NumPy may outperform Pandas for computationsfavtutor.comfavtutor.com. However, Pandas shines when you need to do operations like filtering by column names, merging/joining tables, or dealing with missing data, which would be cumbersome with just NumPy.
Convenience: Pandas provides a lot of high-level functionality: easy CSV/Excel reading, “group by” splits, rolling/window calculations, time series handling, categorical data, etc. If your task is more about data manipulation and less about raw numeric performance, Pandas often gets you there faster in terms of coding time. NumPy is more low-level; for example, to remove rows with NaNs in NumPy you’d have to do boolean indexing manually, whereas Pandas has
dropna()
methods.Indexing: Pandas allows label-based indexing (e.g., by date or by row key) and has a concept of alignment (if you add two Series with different indices, it will align by index labels). NumPy only has positional indexing (by integer location). This makes Pandas more powerful for datasets where each row has an identifier or timestamp.
When they work together: Often you might use Pandas to load data (from CSV, etc.), do some high-level filtering or grouping, and then extract the underlying NumPy arrays for heavy numerical computations (especially if using libraries that expect NumPy arrays). Pandas
DataFrame.values
orto_numpy()
gives the NumPy representation. Conversely, if you have results in NumPy and want to add labels or output as CSV with headers, you might wrap it in a Pandas DataFrame.
Example difference: Suppose you have a dataset of 1 million rows, with columns: “age”, “salary”, “department”. If you want to compute the average salary by department, Pandas can do this in one line with groupby: df.groupby('department')['salary'].mean()
. Doing that with NumPy alone would require sorting or grouping indices manually. On the other hand, if you want to perform a heavy linear algebra computation (e.g., invert a large matrix or multiply two large matrices), you’d convert relevant data to NumPy and use np.linalg
or similar – Pandas would not be involved in that.
Performance-wise, a direct numerical operation (like adding two arrays) is typically a bit slower in Pandas because of the additional overhead (for example, indexing checks) – one reference shows NumPy being faster than Pandas by some factor for element-wise operationsfavtutor.com. But when data size grows very large (say millions of rows), the overhead might become relatively small compared to the actual computation or memory access costs; also Pandas might use optimized paths for certain operations internally. In general, NumPy is faster for raw array computations, Pandas is fast enough for most high-level data analysis tasks and far more convenient for labeled data.
To conclude: Use Pandas for structured, labeled data (especially if mixed types or you need to easily manipulate by column names). Use NumPy for heavy numerical tasks, especially on homogeneous data or when you need multi-dimensional arrays. They are often used together – Pandas relies on NumPy, and you can interchange data between them easily. (As a note, many Pandas operations still utilize NumPy under the hood, so you’re benefiting from NumPy’s performance even when you use Pandas.)
NumPy vs Xarray
Xarray is a library that extends NumPy (and Pandas) to N-dimensional labeled arrays. It’s particularly popular in the physical sciences (e.g., climate data, N-dimensional data with coordinates like lat, lon, time). An Xarray DataArray is like a NumPy ndarray but with labeled dimensions and coordinates; an Xarray Dataset is like a dictionary of DataArrays that share some dimensions (similar to a Pandas DataFrame for multi-dim data)docs.xarray.dev.
Key points in comparison:
Labels and coordinates: NumPy arrays are indexed by integers. Xarray lets you index by coordinate labels. For example, if one axis is “time” with specific timestamp values, you can select data by time label (e.g.,
arr.sel(time='2025-01-01')
) rather than by integer index. This is akin to Pandas for multi-dimensional data. Xarray attaches meaning to array axes (names them) and values along those axes (coordinates)docs.xarray.devdocs.xarray.dev. This makes code more readable and reduces errors when dealing with multiple dimensions – you don’t have to remember that axis 0 is latitude and axis 1 is longitude, you can name them.Multi-dimensional focused: Xarray is designed for N-dimensional data (N>2) where you have several dimensions and maybe want to do operations across one dimension easily. NumPy can of course handle N-d arrays, but Xarray adds convenient methods. For example, Xarray’s
DataArray.sum(dim='time')
will sum over the time dimension by namedocs.xarray.dev, whereas with NumPy you’d need to know which axis corresponds to time and donp.sum(arr, axis=some_index)
.Interoperability: Xarray builds on NumPy and Pandas. Under the hood, the data in an Xarray DataArray is stored as a NumPy array (or sometimes a dask array for out-of-core computation). So pure computational speed for arithmetic is similar to NumPy (with a small overhead for label handling). If you iterate or do a lot of small operations, the overhead can add up, but for heavy operations it’s usually worth the convenience.
When to use: If you have data with inherent labeled dimensions (e.g., an array indexed by [time, latitude, longitude]), Xarray can make life easier. It also seamlessly handles operations that require alignment (like Pandas does for 1D). For example, if two DataArrays have the same dimensions but different coordinate values, operations will align on coordinates (which can be a pro or con, but it prevents certain mistakes). Xarray also integrates with netCDF file format (common in scientific data), allowing you to read/write netCDF with ease – Pandas and NumPy alone don’t have that concept of dimension labels.
Pros and Cons relative to NumPy: Xarray’s main pro is more intuitive, less error-prone code for multi-dim data with labelsdocs.xarray.devdocs.xarray.dev. Its con could be slight overhead and a learning curve if you’re not used to labeled data. If performance is critical and you only need numeric arrays, NumPy might be marginally faster due to less abstraction. But Xarray can leverage Dask to handle arrays larger than memory by chunking, which is something raw NumPy cannot do (NumPy requires the whole array in memory).
Example: Imagine you have a 3D NumPy array
arr
with shape (12, 50, 100) where dimensions are [month, latitude, longitude]. If you want the mean over all longitudes and latitudes for each month, in NumPy you’d doarr.mean(axis=(1,2))
(assuming axis0=month, axis1=lat, axis2=lon). In Xarray, ifdata
is an Xarray DataArray with dims ('month','lat','lon'), you could dodata.mean(dim=['lat','lon'])
. The latter is self-explanatory in code and you can’t accidentally swap axes, because you're using names.
So, NumPy vs Xarray: If your data is simple or you’re already managing labels in parallel arrays or know exactly what each axis is, NumPy is sufficient. But if you start writing a lot of code to handle metadata (axis meanings, etc.), Xarray might save time and reduce errors by introducing labels and alignment. Performance for heavy computations will be similar, since Xarray relies on NumPy (or Dask) for the actual math. In fact, one StackOverflow discussion notes that xarray can sometimes be faster for certain tasks because it vectorizes some operations or uses broadcasting based on dimension names automatically, but generally the raw computation speed is comparable to NumPy since it’s using NumPy under the hood for mathdocs.xarray.dev.
Other considerations
SciPy: SciPy is another library that builds on NumPy, offering a wide array of scientific computations (optimizations, signal processing, statistics, etc.). It’s not an alternative to NumPy but rather an extension. SciPy uses NumPy arrays as input/output for its functions. If you need advanced algorithms (Fourier transforms beyond basic FFT, optimization routines, sparse matrices, etc.), you’d use SciPy in conjunction with NumPy.
JAX, CuPy, etc.: These are newer libraries that provide NumPy-like interfaces but target different backends (JAX can do GPU and autodiff, CuPy does GPU NumPy, etc.). They mimic NumPy’s API. From a user perspective, they’re like “drop-in” alternatives if you need to leverage hardware acceleration or automatic differentiation, but under the hood they are quite different. Classic NumPy itself is CPU-only and doesn’t do autograd.
Tensor libraries (TensorFlow, PyTorch): These have NumPy-like abilities (PyTorch even has
torch.Tensor
which is akin to np.ndarray). For general data science tasks though, NumPy remains a core tool and those libraries are usually used specifically for machine learning workflows.
In conclusion, NumPy serves as the base layer for many data tools in Python. Pandas and Xarray provide more abstraction (labels) and specialized functionality for data manipulation, at a modest cost in overhead. Pure Python lists are flexible but inefficient for numerical work. Often, you’ll find yourself using a combination: perhaps using Pandas to read a dataset, NumPy to perform a heavy numerical operation on it, and maybe Xarray if you’re dealing with multi-dimensional labeled data from scientific sources. Each tool has its niche, and knowing their strengths helps you pick the right one for the task.
(To summarize a key point from above: NumPy focuses on raw speed and efficiency for numeric arrays (homogeneous and typically lower-level), Pandas and Xarray focus on usability and handling richer data (heterogeneous columns or labeled axes) at the cost of some overheadfavtutor.comfavtutor.comdocs.xarray.dev. And Python lists are general-purpose containers, much slower for math and not suitable for large-scale numerical computing.)
Frequently Asked Questions (FAQs)
1. How do I install NumPy in Python (or fix “No module named numpy” in VS Code)?
To install NumPy, use Python’s package manager pip: pip install numpy
. If you’re using Anaconda, NumPy is likely pre-installed (or you can use conda install numpy
). In Visual Studio Code, make sure the correct Python interpreter is selected. The “No module named numpy” error means the environment you’re running doesn’t have NumPy – installing it (or installing in the correct virtual environment) will resolve this. Once installed, you can verify by importing NumPy in Python (import numpy as np
).
2. What does import numpy as np
mean?
When you write import numpy as np
, you are importing the NumPy library and giving it the alias name “np” within your code. This is simply a convenience – it lets you call NumPy functions with a shorter prefix. For example, after import numpy as np
, you use np.array([...])
instead of writing out numpy.array([...])
each time. This alias is a common convention; nearly all code examples and libraries use np
to refer to NumPy. It doesn’t change any functionality – it’s just a shorthand reference to the numpy module.
3. How do I create an array of zeros or ones in NumPy?
Use the built-in functions np.zeros
and np.ones
. For example, np.zeros((3,4))
creates a 3x4 array filled with 0.0 (floating-point zeros). Similarly, np.ones((5,), dtype=int)
creates a 1D array of length 5 filled with ones of integer type. You can specify the shape as a tuple and an optional dtype
. These functions are useful for initializing arrays. For instance, a zero matrix 2x2 can be created with np.zeros((2,2))
, and an array of five ones with np.ones(5)
. NumPy will allocate the array and populate it with the specified value efficiently.
4. How do I reshape a NumPy array?
To change the shape of an array (without changing its data), use the reshape
method or function. For example, if arr
is of length 12, arr.reshape((3,4))
will give you a new view of arr
as a 3x4 array (assuming 3*4 equals the original size). You can also use arr.reshape(3,4)
directly. One dimension can be -1, which tells NumPy to infer that dimension. For instance, arr.reshape(3, -1)
will reshape into 3 rows and as many columns as needed. After reshaping, the array’s contents remain the same but are viewed in the new shape. Keep in mind that the total number of elements must remain constant; otherwise, reshape will throw an error.
5. How do I perform matrix multiplication in NumPy?
NumPy supports matrix multiplication via the np.dot
function or the @
operator. If A
and B
are two-dimensional arrays (matrices) with compatible shapes, np.dot(A, B)
or A @ B
will perform matrix multiplication (dot product of rows of A with columns of B). For example, if A
is shape (2,3) and B
is (3,2), then A @ B
results in a (2,2) matrix. For vector dot products (1D arrays), np.dot(u, v)
returns the scalar inner product. Note that *
in NumPy performs element-wise multiplication, not matrix multiply. So use @
or np.dot
for true matrix multiplication.
6. How do I use np.where()
with multiple conditions?
You can combine multiple conditions using logical operators &
(AND), |
(OR), and ~
(NOT) and then pass the combined condition to np.where
. Remember to wrap each individual comparison in parentheses. For example, suppose arr
is an array of numbers: np.where((arr > 5) & (arr < 10))
will return indices of elements where the value is between 5 and 10. If you want the elements themselves, you might simply do arr[(arr > 5) & (arr < 10)]
. If using the three-argument form of np.where
, e.g. np.where(cond, x, y)
, you can similarly use combined conditions for cond
. For instance: np.where((arr % 2 == 0) | (arr > 100), arr, 0)
would take elements that are even or >100 from arr
and use 0 otherwise. The key is using &
and |
(with parentheses) to build the composite condition.
7. How do I calculate the mean or average of an array in NumPy (and a weighted average)?
NumPy provides np.mean
for the arithmetic mean. For a 1D array a
, np.mean(a)
returns the average of all elements. For a 2D array, you can specify an axis: for example, np.mean(matrix, axis=0)
gives the mean of each column, and axis=1
gives the mean of each row. There’s also an array.mean()
method you can call. For a weighted average, use np.average
, which accepts a weights
parameter. E.g., np.average(data, weights=w)
will compute the average of the array data
with corresponding weights from array w
(must be same shape). This function effectively computes (data * w).sum() / w.sum()
. If you want the mean along an axis with weights, you can also specify the axis in np.average
. In summary: use np.mean
for simple unweighted mean, and np.average(..., weights=...)
if you have weights for each element.
8. How do I save a NumPy array to a CSV file?
Use np.savetxt
. For example: np.savetxt("mydata.csv", arr, delimiter=",")
will write the array arr
to a file named “mydata.csv” using commas between values. Each row of the array becomes a line in the file. You can control the format of the numbers with the fmt
argument (e.g., fmt="%.5f"
for 5 decimal places). If arr
is one-dimensional, savetxt
will by default write it as a column; you can reshape or use newline
parameter if needed. To include a header or comments, you can manually prepend lines to the file or use the header
argument in np.savetxt
. Note that savetxt
is for numeric data; if you need to save multiple arrays or non-numeric data, consider using numpy.save
(for binary .npy files) or Pandas for more complex outputs. Once saved as CSV, the file can be opened by Excel or read by other programs easily.
9. What is the difference between NumPy and Pandas?
NumPy and Pandas serve different purposes, though Pandas is built on NumPy. NumPy provides the ndarray
for fast numeric computation on homogeneous data (all one type), especially useful for mathematical operations on vectors, matrices, or higher-dimensional arrays. Pandas provides higher-level data structures: Series (1D) and DataFrame (2D tabular data with named columns and rows), which can hold heterogeneous data types and have labels for indices. The main differences: Pandas is excellent for data manipulation, analysis, and handling missing data in table-like structures, offering convenient functions to group, filter, merge, etc., by column names or index labels. NumPy is lower-level, focusing on raw speed for numerical calculations on large arrays. Pandas will use NumPy under the hood for computations but has some overhead due to its richer functionality (like maintaining indices). In short: use NumPy for number-crunching on pure arrays or when you need multi-dimensional data structures; use Pandas when dealing with datasets that have column names or mixed types, or whenever you need functionality like join/group-by or easily reading/writing CSVs. They often work together: for example, you might use Pandas to load a CSV into a DataFrame, then extract .values
(a NumPy array) for some heavy linear algebra, then put results back into a DataFrame for analysis or output.
10. How is a NumPy array different from a Python list?
NumPy arrays and Python lists both can contain sequences of values, but there are key differences. A NumPy array is optimized for numerical computations: it is homogeneous (every element has the same type, e.g. all floats), and stored in contiguous memory. This allows vectorized operations (like elementwise addition, multiplication, etc.) to be implemented efficiently in low-level code. As a result, operations on NumPy arrays can be orders of magnitude faster than equivalent operations on Python listsnumpy.org. A Python list is a general container that can hold elements of different types, and it stores references to Python objects. It’s very flexible (you can append, remove, or mix types), but this flexibility comes at the cost of performance for numerical tasks. For example, adding two large lists element by element would require an explicit Python loop, whereas adding two NumPy arrays does the computation in C-speed without a Python loop. Also, NumPy arrays use less memory for large numbers of elements because they don’t carry the overhead per element that Python objects donumpy.org. However, Python lists are built-in and can be more convenient for small sequences or when you need to handle a collection of non-numeric objects. In summary: use NumPy arrays for efficient numerical computing; use Python lists for small or mixed-type collections or when performance is not a concern.
Sources:
https://www.google.com/s2/favicons?domain=https://www.w3schools.com&sz=32https://www.google.com/s2/favicons?domain=https://numpy.org&sz=32https://www.google.com/s2/favicons?domain=https://numpy.org&sz=32https://www.google.com/s2/favicons?domain=https://numpy.org&sz=32https://www.google.com/s2/favicons?domain=https://numpy.org&sz=32https://www.google.com/s2/favicons?domain=https://en.wikipedia.org&sz=32https://www.google.com/s2/favicons?domain=https://medium.com&sz=32https://www.google.com/s2/favicons?domain=https://medium.com&sz=32https://www.google.com/s2/favicons?domain=https://medium.com&sz=32https://www.google.com/s2/favicons?domain=https://www.geeksforgeeks.org&sz=32https://www.google.com/s2/favicons?domain=https://pypi.org&sz=32https://www.google.com/s2/favicons?domain=https://medium.com&sz=32https://www.google.com/s2/favicons?domain=https://github.com&sz=32https://www.google.com/s2/favicons?domain=https://medium.com&sz=32https://www.google.com/s2/favicons?domain=https://numpy.org&sz=32https://www.google.com/s2/favicons?domain=https://numpy.org&sz=32https://www.google.com/s2/favicons?domain=https://medium.com&sz=32https://www.google.com/s2/favicons?domain=https://medium.com&sz=32https://www.google.com/s2/favicons?domain=https://medium.com&sz=32